modelling the cross-section dependence, the …modelling the cross-section dependence, the spatial...

Modelling the Cross-section Dependence, theSpatial Heterogeneity and the Network Diffusion

in the Multi-dimensional Dataset

Yongcheol Shin, University of YorkWorkshop at University of Lodz

10-12 June 2019 (Preliminary)

1

1 Introduction

Panel data models have been increasingly popular in applied economics andfinance, due to their ability to model various sources of heterogeneity. A stan-dard practice is to impose strong restrictions on error cross-section dependence(CSD). This takes the form of independence across individual units under thefixed effects model whilst a common time effect severely restricts the nature ofCSD under the random effects specification.Cross-section dependence (CSD) seems pervasive in panels, since it seems

rare that the covariance of the errors is zero. Recently, there has been muchprogress in characterising and modelling CSD. Phillips and Sul (2003) note thatthe consequences of ignoring CSD can be serious: pooling may provide little gainin effi ciency over single equation estimation; estimates may be badly biased andtests for unit roots and cointegration may be misleading.The pervasive evidence detecting the presence of strong CSD in panels over

the last decade (e.g. Pesaran, 2015), has prompted a large number of studies todevelop proper econometric methodologies for modelling CSD, mainly throughthe structure of interactive effects. This introduces heterogeneous unobservedfactors into the error components, allowing for a richer cross-sectional covariancestructure.Currently, there are two leading approaches that have received considerable

attention. The first, based on principal component (PC) estimation, estimatesthe factors jointly with the main slope parameters. This approach has beenexhaustively analysed by Bai (2009) and extended by Fernandez-Val and Wei-dner (2016), and Moon and Weidner (2015). The second approach, advancedby Pesaran (2006), treats factors as nuisance terms, and removes their effectsthrough proxying them by the cross-section averages of regressors as well as thedependent variable. This is referred to as the common correlated effects (CCE)estimator. A growing number of extensions have also been developed by Chudikand Pesaran (2015) and Westerlund and Urbain (2015).CSD has always been central in spatial econometrics (Baltagi, 2005) where

there is a natural way to characterise dependence in terms of distance, but formost economic and social science problems there is no obvious distance measure.For instance, trade between countries reflects not just geographical distance,but transport costs, common language, policy and historical factors such ascolonial links as well as the multilateral barriers (Anderson and van Wincoop,2003). The ability of spatial econometric models to capture co-dependenciesacross a known terrain or network has proved highly attractive to economistsand regional scientists. Following early work by Cliff and Ord (1973), Uptonand Fingleton (1985), Anselin (1988), Cressie (1993), Kelejian and Robinson(1993). The spatial lag or spatial autoregressive model, which captures spatialcorrelation within the system through the dependent variable has proved themost popular but can be restrictive. A more general form, the spatial Durbinmodel, allows the explanatory variables of one unit to impact the dependentvariable of another directly as well as indirectly through their impact on theoriginal dependent variable. For example, an improvement in school quality

2

in one area will directly improve house prices in neighbouring areas, whoseresidents may access those schools, as well as indirectly through the transmissionof increased house prices from one area to its neighbours.As the availability of spatial datasets with a large time dimension grows it

is of interest to explore dynamic spatial models in greater detail, see Elhorst(2014) for an overview as well as Anselin et al. (2009), Baltagi et al. (2003),Lee and Yu (2011). Yu et al. (2008) study the stable spatial dynamic panel datamodel, featuring individual time lags, spatial time lags and contemporaneousspatial lags. Recently, Shin and Thornton (2019) follow Aquaro, Bailey andPesaran (2015), who allow heterogeneous spatial lag parameters, and proposeto generalise the spatial panel model to higher-order temporal dynamics throughthe spatio-temporal autoregressive distributed lag (STARDL) model. Allowingfor parameter heterogeneity raises significant complications for estimation andinterpretation. As some simultaneity is inevitable in spatial models, we proposeto develop a quasi-maximum likelihood (QML) estimator and the alternativecontrol function-based estimator utilising internal instruments.A number of specification tests have been proposed to testing the validity of

the cross section dependence or the presence of interactive effects, e.g. Pesaran(2015), Sarafidis, Yamagata, and Robertson (2009), Bai (2009) and Castagnetti,Rossi, and Trapani (2015). For large T , it is straightforward to test for cross-section dependence using the squared correlations between the residuals (e.g.Pesaran, 2015). See also Pesaran, Ullah and Yamagata (2007) for the survey ofthe various tests. The conventional wisdom has so far been that the standardtwo-way fixed effects (FE) estimator would be inappropriate and inconsistentin the presence of interactive effects, due to ignoring the potential endogene-ity arising from the correlation between regressors and factors and/or factorloadings (e.g. Bai, 2009). Recently, Kapetanios, Serlenga and Shin (2019) high-light a simple fact that the FE estimator is not always inconsistent even in thepresence of unobserved factor structure. If the factor loadings are uncorrelated,the FE estimator is still consistent. In retrospect, it is surprising to find thatthe literature has been silent on investigating an important issue of testing thevalidity of uncorrelated factor loadings. In order to fill this gap, we develop aHausman-type test that determines the validity of correlated factor loadings incross-sectionally correlated panels.

2 Overview on Cross-section Dependence (CSD)

We consider a generic panel data model advanced by Serlenga an Shin (2007):

yit = β′xit + λ′zi + π′ist + εit, i = 1, ..., N, t = 1, ..., T, (1)

where xit = (x1,it, ..., xk,it)′ is a k × 1 vector of variables that vary across indi-

viduals and over time periods, st = (s1t, ..., sst)′ is an s × 1 vector of observed

time-specific factors, zi = (z1i, ..., zgi)′ is a g×1 vector of individual-specific vari-

ables, β = (β1, ..., βk)′, λ = (λ1, ..., λg)

′ and πi = (π1i, ..., πsi)′ are conformably

defined column vectors of parameters.

3

To address the heterogeneous individual effects and common time effects, weconsider the following one-way and two-way error components specifications:

εit = αi + uit (2)

εit = αi + θt + uit (3)

However, in the presence of cross-section correlations among εit’s, the conven-tional panel data estimators such as FE, tend to become biased. To control forCSD, the most popular approach is to add heterogeneous factors:

εit = αi + γ′if t + uit, (4)

where αi is an unobserved individual effect (heterogeneity), f t is the c × 1vector of unobserved common factors with heterogeneous parameter vector,γi = (γ1i, ..., γci)

′, and uit is a zero mean idiosyncratic uncorrelated randomdisturbance. Notice that both αi and f t might be correlated with explanatoryvariables xit and zi.The distinguishing feature of this model is that it allows for observed and

unobserved time effects both of which are cross-sectionally correlated. Factorsare expected to provide good proxy for any remaining complex time-varyingpatterns associated with multilateral resistance and globalisation trends, e.g.Mastromarco et al. (2016). Notice that the cross-section dependence in (1) isexplicitly allowed through heterogeneous loadings, γi, see Pesaran (2006) andBai (2009).

2.1 Representations of CSD

There are various sources of CSD (neighborhood or network effects, the influenceof a dominant unit or the influence of common unobserved factors). Factormodels propose that the errors reflect a vector of unobserved common factors:

yit = δ′izt + β′ixit + εit with εit = γ′if t + uit (5)

where yit is a scalar dependent variable, zt is a kz×1 vector of variables that donot differ over units, e.g. intercept and trend, xit is a kx × 1 vector of observedregressors which differ over units, f t is an r × 1 vector of unobserved factors,which may influence each unit differently and which may be correlated with thexit, and uit is an unobserved disturbance with E (uit) = 0, E

(u2it

)= σ2

i , whichis independently distributed across i and (possibly) t. The covariance of theerrors, εit = γ′if t + uit is determined by the factor loadings γi. Notice that iff t is correlated with xit, as is likely in many economic applications, then notallowing for CSD by omitting f t causes the estimates of βi to be biased.

Spatial models allow the N × 1 vector of errors, εt = (ε1t, ..., εNt)′ to follow:

εt = Wut

where ut = (u1t, ..., uNt)′ is cross-sectionally independent and W is a known

(possibly time-varying) matrix, e.g. reflecting whether the units share a common

4

border. This can be used to represent spatial autoregressive, moving average orerror component models.This approach assumes that the structure of cross section correlation is re-

lated to the location and distance among units on the basis of a pre-specifiedweight matrix. Hence, cross section correlation is represented by means of aspatial process, which explicitly relates each unit to its neighbors. A num-ber of approaches for modeling spatial dependence has been suggested. Themost popular ones are the Spatial Autoregressive (SAR), the Spatial MovingAverage (SMA), and the Spatial Error Component (SEC) specifications. Thespatial panel data model is estimated using the maximum likelihood (ML) orthe generalized method of moments (GMM) techniques (Elhorst, 2011).We consider a spatial panel data gravity (SARAR) model, which combines

a spatial lagged variable and a spatial autoregressive error term:

yit = ρy∗it + β′xit + γ′zi + αi + vit, i = 1, ..., N, t = 1, ..., T, (6)

vit = λv∗it + uit (7)

where y∗it =∑Nj 6=i wijyjt is the spatial lagged variable, and v

∗it =

∑Nj 6=i wijvjt is

the spatial autoregressive error term, wij’s are the spatial weight with the row-sum normalisation,

∑i wij = 1 and wii = 0, and uit is a zero mean idiosyncratic

disturbance with constant variance. ρ is the spatial lag coeffi cient and λ thespatial error component coeffi cient. They capture the spatial spillover effectsand measure the influence of the weighted average of neighboring observationson cross section units. Chudik et al. (2011) show that a particular form of aweak cross dependent process arises when pairwise correlations take non-zerovalues only across finite units that do not spread widely as the sample sizerises. A similar case occurs in the spatial/network processes, where the localdependency exists only among adjacent observations.

2.2 Weak and Strong CSD

Chudik et al. (2011) show that these factor models exhibit the strong formof cross-sectional dependence since the maximum eigenvalue of the covariancematrix for εit tends to infinity at rate N . On the other hand spatial economet-ric models display the weak form of cross-sectional dependence, which can berepresented by an infinite number of weak factors and no idiosyncratic error.With weak CSD, the dependences are local and decline with N . This could

be the case with spatial correlations, where each cross-section unit is correlatedwith near neighbors but not others; with strong CSD the dependences influenceall units. The distinction can be expressed in various ways.Suppose the elements of yt are stationary, e.g. growth rates, and the weighted

average of the elements yt =∑Ni=1 yit/N, where the weights are ‘granular’, go

to zero as N → ∞. Then, with weak CSD the variance of yt goes to zeroas N → ∞. If there is strong CSD, it does not, for instance there may be aglobal cycle in yt. If there is weak CSD, the influence of the factors,

∑Ni=1 γ

2i

is bounded as N → ∞, while if there is strong dependence, it goes to infinity

5

with N . If there is weak dependence, all the eigenvalues of the covariancematrix of the errors are bounded as N → ∞. If there is strong dependence,the largest eigenvalue goes to infinity with N . Bailey, Kapetanios and Pesaran(2016) characterise the strength of the dependence in terms of the exponent ofCSD, defined as α = ln(n)/ ln(N), where n is the number of units with non zerofactor loadings. In the case of a strong factor, α = 1 while α < 1/2 indicates theweak factor. The values of 1

2 ≤ α ≤34 represent a moderate degree of CSD.

1

CSD is central to all the issues discussed. For instance, there is a growingliterature on testing for structural change in panels. However, the apparentstructural change may result from having left out an unobserved global variable,f t. If f t is omitted and the correlation between f t and xit changes, this willchange the estimate of βi giving the appearance of structural change. Similarly,an omitted factor may give the impression of non-linearity.2 Since unobservedfactors play a major role in the treatment of CSD, we begin by discussingthe estimation of such factors. The implications for estimation are differentdepending on whether f t are merely regarded as nuisance parameters that wewish to control for in order to get better estimates of β or whether they arethe parameters of interest: one wishes to estimate f t as variables of economicinterest in their own right.

3 Introduction to Panel Data Model

We discuss techniques for panel data. These are repeated time series observa-tions on the same set of cross-section units. Thus, “pooling”of cross-section andtime series data, where there are N cross-section individuals, i = 1, 2, . . . , N ,and T time periods, t = 1, 2, . . . , T . Regression model is written as

yit = β1xit1 + β2xit2 + · · ·+ βkxitk + εit, (8)

i = 1, 2, . . . , N, t = 1, 2, . . . , T , where yit is the value of the dependent variablefor cross-section unit i at time t, xitj is the value of the jth explanatory variablefor unit i at time t for j = 1, ..., k. There are k explanatory variables, thus letxit = (xit1, . . . , xitk) be a 1 × k vector of regressors (including constant), andβ = (β1, ..., βk)

′ a k × 1 vectors of parameters. Then, (8) can be compactlywritten as

yit = xitβ + εit, i = 1, 2, . . . , N, t = 1, 2, . . . , T. (9)

More compactly,yi = xiβ + εi, i = 1, 2, . . . , N, (10)

1Vega and Elhorst (2016) argue "the terminology of weak and strong cross-sectional depen-dence is to some extent misleading, because the terms strong and weak suggest that the formeris more important than the latter, while we find that their contributions are almost equallyas important. We therefore propose the descriptions common factors and spatial dependenceto acknowledge the importance of both properties.

2Cerrato, de Peretti and Sarantis (2007) extend the Kapetanios, Shin and Snell (2003) testfor a unit root against a non-linear ESTAR alternative to allow for cross-section dependence.

6

where

yiT×1

=

yi1yi2...yiT

, xiT×k

=

xi1xi2...

xiT

, εiT×1=

εi1εi2...εiT

,and finally

y = Xβ + ε, (11)

where

yNT×1

=

y1

y2

...yN

, XNT×k

=

x1

x2

...xN

, εNT×1

=

ε1

ε2

...εN

.

Motivation for use of panel data: The analysis of panel data is the subjectof one of active literature in econometrics. See Hsiao (1986), Baltagi (1995)and Greene (1997, Ch.14). First, we can obtain effi ciency gain from using moreobservations. e.g. Budget study, where y: consumption of some good, x: (prices,income), prices vary over time and (real) income varies over individual. Second,we can control the bias of the estimation. e.g. Labor economics, in earningsequation, where y: wage, x: education, age, etc. time invariant unobservablesor individual effects, which can be related to individual ability or intelligence.

Sources and types of the panel data

• The Panel Study of Income Dynamics (PSID)collected by the Institute ofSocial research at the University of Michigan (since 1968). Informationabout economic status such as income, job, marital status and so on

• The Survey of Income and Program Participation (SIPP, US Departmentof Commerce) covers shorter time periods.

• The study by Card (1992): Effects of the minimum wage law on employ-ment: Collected information by US States on youth employment, unem-ployment rates, average wages and other factors for 1976-90.

• Macropanel such as the international data set obtained from the version5.5 of the Penn World Tables collected by Summers and Heston.

See also 12.1 in JD for more explanations and references.

3.1 Models For Pooled Time Series

Here N is relatively small, and T is large enough to run separate regressionsfor each individual but combining individuals may yield better (more effi cient)

7

estimates. See Greene (1997, Sec. 15.1-15.3). Define the NT ×NT covariancematrix,

V = Cov(ε) = E(εε′)

= E

ε1ε′1 ε1ε

′2 . . . ε1ε

′N

ε2ε′1 ε2ε

′2 . . . ε2ε

′N

......

......

εNε′1 εNε

′2 . . . εNε

′N

=

v11 v12 . . . v1N

v21 v22 . . . v2N

......

...vN1 vN2 . . . vNN

,where the dimension of each block is T × T . Assume that

ε ∼ N(0,V),

and X’s are exogenous. Then, we consider the two basic estimators:

Ordinary least squares (OLS):

βOLS = (X′X)−1X′y,

which is unbiased but ineffi cient; that is,

Cov(βOLS) = (X′X)−1X′VX(X′X)−1.

Generalised least squares (GLS):

βGLS = (X′V−1X)−1X′V−1y,

which is unbiased and effi cient; that is,

Cov(βGLS) = (X′V−1X)−1.

Exercise 1 Show that Cov(βOLS) ≥ Cov(βGLS).

We have different models depending on specifications of V.

3.1.1 Classical regression model

We have ideal conditions such as εit’s are iidN(0, σ2). Then,

V = σ2INT , GLS = OLS,

where INT is an NT ×NT identity matrix.

8

3.1.2 Cross-sectional heteroskedasticity: heterogenous variances

We now have ideal conditions except

V ar(εit) = σ2i 6= V ar(εjt) = σ2

j , for i 6= j,

that is, the error variance is allowed to vary across individuals. Then,

V =

σ2

1IT 0 . . . 00 σ2

2IT . . . 0...

......

...0 0 . . . σ2

NIT

,and therefore, it is straightforward to show that the GLS estimator becomes theweighted least squares estimator given by

βGLS = (X′V−1X)−1X′V−1y

=

(N∑i=1

σ−2i X

′

iXi

)−1( N∑i=1

σ−2i X

′

iyi

).

Noting that σ2i’s are not observable, feasible GLS estimator is obtained by

βFGLS =

(N∑i=1

s−2i X

′

iXi

)−1( N∑i=1

s−2i X

′

iyi

),

wheres2i =

1

T − k

(yi −XiβGLS

)′ (yi −XiβGLS

).

The FGLS is consistent, and asymptotically effi cient (as T →∞ and N fixed).

Example 2 Greene (1997).

Iit = β1 + β2Fit + β3Cit + εit,

where N = 5, T = 20 (1935-1954), Iit is gross investment, Fit market value ofthe firm and Cit value of plant and equipment.

OLS : I = −48.03 + .106(.011)

F + .305(.043)

C, R2 = .78,

FGLS : I = −36.25 + .095(.0074)

F + .338(.030)

C.

Equality of variances can be tested using the LM statistic given by

LM =T

2

N∑i=1

(s2i

s2− 1

)2

,

9

where

s2 =1

N

N∑i=1

s2i .

Under the null, σ2i = σ2 for all i’s, and as T →∞,

LM → χ2N .

Example 3 In the above example, we find that LM = 46.63, which is muchgreater than the 95% critical value of χ2

5, so we strongly reject the case forhomogeneous variances.

3.1.3 Cross-sectional Heteroskedasticity: Correlation across groups

Now, there is correlation across individuals at the same time;

Cov(εit, εjs) =

0 for t 6= sσij for t = s

.

Then,

V =

σ11IT σ12IT . . . σ1NITσ21IT σ22IT . . . σ2NIT...

......

...σN1IT σN2IT . . . σNNIT

= Σ⊗ IT ,

where the dimension of Σ is N × N . This is the same error structure as in“seemingly unrelated regressions model”.FGLS can be obtained as previously: replace unknown σij in V by

sij =1

T − k

(yi −XiβGLS

)′ (yj −XjβGLS

),

so define V = V accordingly with sij in place of σij . Then,

βFGLS =(X′V−1X

)−1 (X′V−1y

).

Again, the GLS estimator is consistent and asymptotically normal as T → ∞and N fixed.

Example 4 Greene example continued:

FGLS : I = −30.28 + .094(.0068)

F + .341(.027)

C.

For testing that the off-diagonal elements of Σ are zero; that is, there isno correlation across groups, we use the following LM statistic developed byBreusch and Pagan (1980):

LM = T

N∑i=2

i−1∑j=1

r2ij ,

10

where r2ij is the ijth residual correlation coeffi cient. Under the (joint) null of

σij = 0 for i, j = 1, 2, ..., N , and i 6= j, as T →∞,

LM → χ2N(N−1)

2

.

Example 5 Greene example continued: Using the residuals based on the FGLSestimates given above we find LM = 51.32, which is far greater than the 95%critical value of χ2

10. Hence we may conclude that the simple heteroskedasticmodel is not general enough for the investment data.

3.1.4 Autocorrelation (but not correlation across individuals)

See Greene (1997, Section 15.2.3).

3.2 Models for Longitudinal Data

Here we have large N , but small T : hence we use an asymptotic theory asN →∞ and T fixed.

Example 6 Panel study for income dynamics (PSID), N = 5000, T = 9.

In principle methods of the previous section could be applied, but problem-atic because only a few time periods are available. In this case the techniqueshas been focused on cross sectional variation or heterogeneity.The basic assumption made is that time-invariant “individual effect,” be-

comes part of error process: that is, we consider the following error components-based panel,

yit = xitβ + εit, i = 1, 2, . . . , N ; t = 1, 2, . . . , T, (12)

and εit is decomposed asεit = αi + uit, (13)

where αi’s are called individual effects. Here we assume:

• uit’s are N(0, σ2u).

• uit’are uncorrelated with xjs for all i, t, j, s, i.e., x’s are exogenous withrespect to u.

But, assumptions about αi vary.

Example 7 Suppose we have observations for T time periods on N regionsof country. We want to estimate the spillover effect of foreign technology ondomestic firm productivity in manufacturing. An error components model de-scribing output in each region with a standard Cobb-Douglas production is givenby

qit = β1kit + β2lit + β3mit + β4fit + αi + uit,

11

where qit is the log of output of domestic firms, kit the log of capital, lit thelog of labor, mit the log of material, fit a measure of the influence of foreignfirms. A positive spillover is indicated by β4 > 0. Why should we allow forthe unobserved individual effects αi? One reason is that an observed positiverelationship between output in a region and foreign influence, controlling onlyfor capital, labor and materials, might simply reflect the fact that foreign firmstend to settle in areas that lend themselves to higher productivity; there may beno spillover effect at all. By adding αi we allow fit to be correlated possibly withfeatures of region, embodied in αi, that are related to higher productivity. Thissolution is an improvement over not allowing for αi.

Example 8 Suppose that y is net migration into city i at time t. You wouldlike to see whether taxes, housing prices, educational quality and other factorsinfluence population flows. There are certain features of cities, for example ge-ographical characteristics, reputation that could be diffi cult to model, but areessentially constant over short periods of time. Because the unobservables in-fluence yit and might also be related to local policy and economic variables, it isimportant to control fro them in any model. One model would be

yit = β1xit + β2hit + β3eit + β4cit + αi + uit,

where x are tax rates, h housing prices, e educational quality and c crime rates.Now αi capture all time-constant (unobservable) differences about cities thatmight affect migration. Thus, the above regression allows us to estimate

E (yit|xit, hit, eit, cit, αi) ,

which makes it clear that we are controlling for unobserved city effects whenestimating the effects of tax policy on net migration for example.

There are two main approaches to deal with αi’s: fixed effects and randomeffects.

3.2.1 Fixed Effects Estimator

Here we treat the αi as fixed, but in particular, remember that we do not assumeαi to be uncorrelated with xit. This implies that differences across cross sectionunits can be captured in differences in the constant terms. Notice that theregression of yit on xit only is biased if αi is correlated with xit. (Why?)We now want to avoid this bias by using the fixed effects estimation. Defining

eT =

11...1

T×1

, α∗ =

α1

α2

...αN

N×1

, D =

eT 0 . . . 00 eT . . . 0...

......

0 0 · · · eT

NT×N

= IN⊗eT ,

12

and

α =

α1

...α1

α2

...α2

...αN...αN

NT×1

=

α1eTα2eT...

αNeT

= Dα∗ = α∗ ⊗ eT ,

then we have (in T observations for each individual)

yi = xiβ + αieT + ui, (14)

or (in NT observations)

y = Xβ +α+ u = y = Xβ + Dα∗ + u. (15)

Notice that the bias in regression of y on X only is due to omission of D in (15).Solution is simply to regress y on (X,D). So it is sometimes called “OLS withdummy variables”or least squares dummy variable (LSDV) model; that is,

β = coeffi cients of X in the regression of y on (X,D) .

But there is a computational problem for large N , since the dimension of Dis very big (N columns). So we need an alternative formulation. Define theNT ×NT matrices P and Q by

P = D(D′D)−1D′,

Q = INT −P = INT −D(D′D)−1D′,

then a standard result from least squares algebra says:

β = (X′QX)−1(X′Qy).

Digression on algebra: Notice that

e′TeT = T ; eTe′T =

1 1 · · · 11 1 · · · 1

...1 1 · · · 1

.

13

Defining

P∗ =1

TeTe′T =

1T

1T · · · 1

T1T

1T · · · 1

T...

1T

1T · · · 1

T

,then this matrix creates “means”for any T × 1 vector c = [c1, .., cT ]

′; that is,

P∗c =

cc...c

= ceT , c =1

T

T∑t=1

ct.

Next, define

Q∗ = IT −1

TeTe′T =

1− 1

T − 1T · · · − 1

T− 1T 1− 1

T · · · − 1T

...− 1T − 1

T · · · 1− 1T

;

which makes “deviations from means”: that is,

Q∗c =

c1 − cc2 − c...

cT − c

.These matrices are relevant because

P = IN ⊗P∗, Q = IN ⊗Q∗,

which are idempotent,PP = P, QQ = Q,

symmetricP = P′,Q = Q′

and orthogonal to each otherPQ = 0.

14

In particular, they make “individual means” and “deviation from individualmeans”;

Py =

y1

...y1

y2

...y2

...yN...yN

, Qy =

y11 − y1

...y1T − y1

y21 − y2

...y2T − y2

...yN1 − yN

...yNT − yN

, (16)

where yi = 1T

∑Tt=1 yit and similarly for x and u. In particular, note that

Qy = Q (Xβ +α+ u) = QXβ + Qu,

becauseQα = 0.

Therefore, taking deviations from (individual) means removes time-invariantunobservables. Multiplication by Q (taking deviations from individual means)is often called the “within transformations”.

Within estimator This can be obtained by one of the following equivalentexpressions:

1. Regression Qy on Qx.

2. Regression of yit − yi on xit − xi, i = 1, ..., N, t = 1, ..., T , where yi =1T

∑Tt=1 yit and xi = 1

T

∑Tt=1 xit.

3. OLS estimation with dummy variables (LSDV).

The within estimator is obtained by3

βW = (X′QX)−1

X′Qy.

Here we should bear in mind that

1. We cannot include any time-invariant regressors.

3Using the (double) summation notation, we have

βW =

N∑i=1

T∑t=1

(xit − xi)′ (xit − xi)

−1 N∑i=1

T∑t=1

(xit − xi)′ (yit − yi)

.

15

2. If you estimate by within-estimation and still want αi, then

αi = yi − xiβW .

3. Statistical properties of βW : unbiased, consistent (as N →∞ or T →∞)and asymptotically normal,4

√NT

(β − β

)→ N

0, σ2

u limNT→∞

(1

NTX′QX

)−1,

so that the usual inference procedures can be used like t and Wald tests.

In sum, the fixed effects estimation removes potential bias caused by time-invariant unobservables by the within transformation. Cost is that only varia-tion over time (not between individual) is used in estimating β, which wouldpossibly result in imprecise estimates. Essentially the fixed effects model concen-trates on differences within individuals, that is it is explaining to what extentyit differs from yi and does not explain why yi is different from yj . It maybe important to realize that β’s are identified (or consistently estimated) onlythrough within variation of the data.

Testing the significance of the individual effects If we are interested indifferences across individuals we can test the hypothesis

αi = α for all i, (17)

with an F test given by

FFE =

(R2u −R2

r

)/ (N − 1)

(1−R2u) / (NT −N − k)

,

where a subscript u indicates the unrestricted model, (291), and a subscript rindicates restricted or pooled model with only a single overall constant term,

yit = xitβ + α+ uit.

Under the null (17), it can be shown that

FFE ∼ F (N − 1, NT −N − k) .

4A consistent estimate for σ2u is obtained as the within residual sum of squares divided byN (T − 1), that is,

σ2u =1

N (T − 1)

N∑i=1

T∑t=1

(yit − yi)− βW (xit − xi)

2.

It is also possible to apply the usual degrees of freedom correction in which case k is subtractedfrom the denominator.

16

3.2.2 Random Effects Estimator

We consider the same basic model, (291), but now assume:

• uit’s are iidN(0, σ2u).

• αi’s are iidN(0, σ2α).

• αi’s are uncorrelated with ujt for all i, j, t, that is, E [αiujt] = 0 for alli, j, t.

• αi and uit are uncorrelated with xjs for all i, j, s, t (so x is exogenous withrespect to α and u).

This approach would be appropriate if we believed that sampled cross-sectional units were drawn from a large population. First of all, the OLS es-timator is unbiased or consistent, because x is assumed to be exogenous withrespect to ε = α+u, but ineffi cient. This model is suitable for case of “pooling”,not of “bias reduction”as in the fixed effects model. In general, the GLS esti-mator will be more effi cient. For the GLS estimation we need a more detailedexpression for Cov (ε) = V.First, consider

Cov(εi) = E

ε2i1 εi1εi2 . . . εi1εiT

εi2εi1 ε2i2 . . . εi2εiT

......

......

εiT εi1 εiT εi2 . . . ε2iT

T×T

.

Under the assumptions given above it is easily seen that

E(ε2it

)= E

(α2i + u2

it + 2αiuit)

= σ2α + σ2

u,

E (εitεis) = E (αi + uit) (αi + uis) = σ2α, t 6= s.

Hence, for all i,

Cov(εi) = σ2αeTe′T + σ2

uIT =

σ2α + σ2

u σ2α · · · σ2

α

σ2α σ2

α + σ2u · · · σ2

α...

... · · ·...

σ2α σ2

α · · · σ2α + σ2

u

.Therefore, the NT ×NT matrix V can be written as

V = Cov(ε) =

Ω 0 · · · 00 Ω · · · 0...

......

0 0 · · · Ω

= IN ⊗Ω, (18)

where ⊗ is a Kronecker product.

17

Now, recall

P∗ =1

TeTe′T , Q∗ = IT −

1

TeTe′T , P = IN ⊗P∗, Q = IN ⊗Q∗.

Then, Ω can be rewritten as

Ω =(Tσ2

α + σ2u

) 1

TeTe′T + σ2

u

(IT −

1

TeTe′T

)=

(Tσ2

α + σ2u

)P∗ + σ2

uQ∗.

Next,V = IN ⊗Ω =

(Tσ2

α + σ2u

)P + σ2

uQ, (19)

where we usedP = IN ⊗P∗, Q = IN ⊗Q∗.

Digression on derivation of the inverse of V For the GLS estimation weneed to find

V−1 = IN ⊗Ω−1 or V−1/2 = IN ⊗Ω−1/2.

Using the special nature of P and Q, it can be shown that5

V−1 =1

Tσ2α + σ2

u

P +1

σ2u

Q, (20)

and then

V−1/2 =1√

Tσ2α + σ2

u

P +1√σ2u

Q =1

σu

√σ2u

Tσ2α + σ2

u

P + Q

=1

σu

INT −

(1−

√σ2u

Tσ2α + σ2

u

)P

.

Define

θ = 1−

√σ2u

Tσ2α + σ2

u

, (21a)

thenV−1/2 =

1

σu(INT − θP) . (22)

Notice that the GLS estimator is obtained by

βGLS = (X′V−1X)−1(X′V−1y)

=[(V−1/2X)′(V−1/2X)

]−1

(V−1/2X)′(V−1/2y),

5Similarly, the inverse of Ω can be obtained as

Ω−1 =1

Tσ2α + σ2uP∗ +

1

σ2uQ∗.

18

and therefore βGLS is obtained from the regression of V−1/2y on V−1/2X. Infact, this is equivalent to the OLS estimation after “θ differences”. Since

V−1/2y =1

σu(INT − θP) y =

1

σu(y − θPy) ,

or more precisely1

σu

(V−1/2y

)it

=1

σu(yit − θyi) ,

and likewise1

σu

(V−1/2X

)it

=1

σu(xit − θxi) ,

where yi = 1T

∑Tt=1 yit and xi = 1

T

∑Tt=1 xit. So the GLS estimator is obtained

from a regression of

(yit − θyi) = (xit − θxi)β + εit,

where εit = εit − θεi (the proportionality constant 1σubeing cancelled out). In

other words, a fixed proportion θ of the individual means is subtracted from thedata to obtain this transformed model.We note in passing that

1. You can now include time-invariant or individual specific variables. Theyalso get multiplied by INT − θP.

2. If θ = 0, then the GLS estimator is equivalent to the OLS estimator. But,this would occur only if σ2

α = 0.

3. The GLS estimator is consistent as N →∞ (with T fixed or with T →∞such that N

T constant). It is also asymptotically effi cient relative to thewithin estimator with

Cov(βGLS

)=(X′V−1X

)−1=

X′(

1

Tσ2α + σ2

u

P +1

σ2u

Q

)X

−1

.

4. The effi ciency difference tends to 0 as T → ∞ and θ → 1. (When θ = 1,βGLS = βW .)

3.2.3 Between estimator

We now formulate the model in terms of the individual means,

yi = xiβ + εi, i = 1, ..., N, (23)

whereyiT×1

= yi· ⊗ eT , xiT×k

= xi· ⊗ eT

εi = αi ⊗ eT + ui,

19

yi = T−1T∑t=1

yit, xi· = T−1T∑t=1

xit.

Reminding that the matrix P creates “means”for any conformable vector, thenwe write (23) in matrix form as

Py = PXβ + Pε. (24)

The OLS estimator in the above regression gives the between estimator,

βB = (X′PX)−1

X′Py, (25)

which is also unbiased and consistent as N →∞ under the assumption that xiis uncorrelated with αi (such that E (x′iεi) = 0), and with

V ar(βB

)=(Tσ2

α + σ2u

)(X′PX)

−1.

The between estimator ignores any information within individuals. The formulain (25) will be simplified as follow: Notice from (16) that Py and PX can bewritten as

Py = y ⊗ eT , PX = X⊗ eT ,

where

yN×1

=

y1

...yN

; XN×k

=

x1·...

xN ·

,and thus we have

βB =

(x⊗ eT )′(x⊗ eT )

−1(x⊗ eT )

′(y ⊗ eT ) (26)

= (x′x⊗ e′TeT )−1

(x′y ⊗ e′TeT ) = (x′x⊗ T )−1

(x′y ⊗ T )

=

(x′x)−1 ⊗ T−1

(x′y ⊗ T ) = (x′x)

−1x′y,

which is equivalent to the OLS estimator obtained from the following cross-sectional regression:

yi· = xi·β + εi·, i = 1, ..., N. (27)

The GLS estimator can be shown to be a weighted average of the withinestimator βW and the between estimator βB , (see Greene, p.473),

βG = FβW + (Ik − F) βB ,

whereF = (X′QX + λX′PX)

−1X′QX, λ = (1− θ)2

.

This clearly shows that the effi ciency gain of the GLS estimator relative to thewithin estimator comes from the use of between (across individuals) variations.The GLS estimator is the optimal combination of the within estimator and thebetween estimator, and therefore is more effi cient than either.There are some polar cases to consider:

20

1. If λ = 1, the GLS is equivalent to the OLS. his would occur only if σ2α = 0.

Thus, the OLS estimator is also a linear combination of the within andthe between estimators, but not effi cient one.

2. If λ = 0, the GLS is equivalent to the within estimator. There are twopossibilities. The first is σ2

u = 0, in which case all variation across indi-viduals would be due to αi’s, which would be equivalent to the dummyvariables used in the fixed effects model. The question of whether theywere fixed or random would be moot. The other case is T →∞.

3.2.4 Feasible GLS estimator

We need to find a consistent (as N →∞) estimator of

θ = 1−

√σ2u

σ2u + Tσ2

α

,

and then V (see (19)). The estimator of σ2u is easily obtained from the within

residuals (see (14) or (15)), denoted σ2u and estimated by

σ2u =

1

N (T − 1)− kN∑i=1

T∑t=1

u2it, (28)

with the within residuals given by

uit = (yit − yi)− (xit − xi) βW .

From the between regression (27) we find

σ2B = E

(ε2i·)

= E(α2i + u2

i· + 2αiui·)

= σ2α +

1

Tσ2u.

Hence, the consistent estimator for σ2α is obtained by

σ2α = σ2

B −1

Tσ2u,

where σ2B is the consistent estimator of σ

2B = σ2

α + 1T σ

2u obtained by

σ2B =

1

N − kN∑i=1

ε2i·, (29)

with the between residuals given by6

εi· = yi· − xi·βB .

6 It is also possible to apply degrees of freedom correction in computing σ2u,W and σ2B .

21

Now, we have

θ = 1−

√√√√ σ2u,W

σ2u,W + T σ2

α

,

and the resulting feasible GLS estimates is referred to as the random effectsestimator,

βRE = (X′V−1X)−1(X′V−1y), (30)

whereV−1 =

1

σ2u,W + T σ2

α

P +1

σ2u,W

Q.

One thing to note is that the implied estimate of σ2α may be negative in fi-

nite samples. Such a negative finding calls the specification of the model intoquestion. See Green, section 16.4.3b.None of the desirable properties of the random effects estimator relies on

T → ∞, although it can be shown that some consistency results follow for Tincreasing. On the other hand, in this case the fixed effects estimator does relyon T increasing for consistency. See Green (p. 478) and Nickell (1981).

Testing for random effects Breusch and Pagan (1980) have derived an LMtest for the random effects based on the OLS residuals,

LM =NT

2 (T − 1)

[ε′DD′ε

ε′ε− 1

]2

,

where ε = Y −XβOLS and D is defined as

D =

eT 0 . . . 00 eT . . . 0...

......

0 0 · · · eT

NT×N

, eT =

11...1

T×1

.

Under the null of no random effects, σ2α = 0, as N →∞,

LM ∼ χ2(1).

3.2.5 Fixed Effects or Random Effects

Whether to treat individual effects αi as fixed or random is not an easy questionto answer. The most common view is that the discussion should not be aboutthe true nature of αi. The appropriate interpretation is that the fixed effectsapproach is conditional on the values of αi. This makes sense if the individualsin the sample are ‘one of a kind’and cannot be viewed as a random draw fromsome underlying population. This is probably most appropriate when i denotes,(large) companies or industries.

22

In contrast the random effects approach is not conditional on the individualαi’s but integrates them out. In this case we are usually not interested in theparticular value of some individual’s αi; we just focus on arbitrary individualsthat have certain characteristics. The random effects approach allows one tomake inference with respect to population characteristics.One way of formalizing this is noting that the random effects model states

thatE (yit|xit) = xitβ,

while the fixed effects model estimates

E (yit|xit) = xitβ + αi.

Notice that the β’s in these two conditional expectations are the same only ifE (αi|xit) = 0.However, even if we are interested in the larger population of individuals,

and a random effects framework seems appropriate, the fixed effects estimatormay be preferred, since it is likely the case in practice that xit and αi arecorrelated in which the random effects approach, ignoring this correlation, leadsto inconsistent estimators. This problem of correlation can be handled only byusing the fixed effects approach, which essentially eliminates the αi from themodel.

3.2.6 Hausman Test

The general specification test suggested by Hausman (1978) can be used to testthe null hypothesis

H0 : xit and αi are uncorrelated,

against the alternative hypothesis

H1 : xit and αi are correlated.

This test is based on an idea that the fixed effects estimator is consistent underboth the null and the alternative while the random effects estimator is consistentonly under the null but effi cient. Let us consider the difference between βW andβRE . To evaluate the significance of this difference we need to find its covariancematrix. Under the null the two estimates should not differ significantly, and itcan be also shown under the null that

V ar(βW − βRE

)= V ar

(βW

)+ V ar

(βRE

)− Cov

(βW , βRE

)− Cov

(βW , βRE

)= V ar

(βW

)− V ar

(βRE

), (31)

where we used Hausman’s essential result that

Cov((βW − βRE

), βRE

)= Cov

(βW , βRE

)− V ar

(βRE

)= 0,

23

orCov

(βW , βRE

)= V ar

(βRE

).

Consequently, the Hausman test statistic is defined as

h =(βW − βRE

)′ V ar

(βW

)− V ar

(βRE

)−1 (βW − βRE

), (32)

where V ar(βW

)and V ar

(βRE

)denote the estimates of V ar

(βW

)and

V ar(βRE

). Under the null,

h ∼ χ2(k),

where k is the number of parameters. The Hausman test thus tests whether thefixed effects and random effects estimators are significantly different. It is alsopossible to test for a subset of parameters in β.

3.2.7 Empirical Illustration

See section 10.3 in Verbeek and various examples in Greene.

3.2.8 Random Effects Correlated with Regressors

Essential difference between the “fixed”and “random”effects is whether or notthe individual effects are correlated with regressors. We now consider the casewhere the random effects are correlated with regressors.Mundlak (1978) argued that the dichotomy between fixed effects and random

effects models disappears if we make the assumption that αi depend on themean values of xi, an assumption he regards as reasonable in many problems.As before, consider the error components model,

yi = xiβ +αi + ui, (33)

but now assumeαi = xiπ + wi,

where wi has the same properties that αi was assumed to have; that is,

1. wi’s are iidN(0, σ2w).

2. wi’s are uncorrelated with ujt for all i, j, t, that is, E [wiujt] = 0 for alli, j, t.

3. wi’s are uncorrelated with xjt for all i, j, t, that is, E [wixjt] = 0 for alli, j, t.

Then, we rewrite (33) as

yi = xiβ + xiπ + wi + ui, (34)

24

which can be written in matrix form as

y = Xβ + PXπ + w + u =[

X PX] [ β

π

]+ w + u,

where we usedα = (PX)π + w,

and where Cov(w + u) = V as before (see (18)). Carrying out the GLS estima-tion, then we have[

βGLSπGLS

]=[

X PX]′

V−1[

X PX]−1 [

X PX]′

V−1y. (35)

After some algebra, it can be shown that7

βGLS = βW = (X′QX)−1

X′Qy, (36)

andπGLS = βB − βW = (X′PX)

−1X′Py− (X′QX)

−1X′Qy, (37)

with

V ar (πGLS) = V ar(βB

)+V ar

(βW

)=(Tσ2

w + σ2u

)(X′PX)

−1+σ2

u (X′QX)−1.

This shows that for the linear regression model, the fixed effects is effectivelythe same as the random effects correlated with all regressors. The test of π = 0can be based on the following statistic:

π′GLS [V ar (πGLS)]−1πGLS

d→ χ2k under H0.

3.3 Alternative IV Estimators

The fixed effects estimator eliminates anything that is time-invariant from themodel, which might be a high price to pay for allowing the x variables to becorrelated with individual specific heterogeneity αi. For example, we may beinterested in the effect of time invariant variables like gender or schooling on aperson’s wage. In this section we show that there is no need to restrict attentionto the fixed and the random effects only, as it is possible to derive instrumentalvariables estimators that can be considered an intermediate case between fixedand random effects approach.

7The same result can be derived alternatively as follow: Applying the GLS transformationdescribed earlier to (34), we have

yit − θyi = (xit − θxi)β + (xi − θxi)π + vit

= (xit − xi)β + xiδ + vit,

where δ = (1− θ) (π + β). Using the fact that xi is orthogonal to xit− xi, we get the result.

25

First of all we now show that the fixed effects estimator is in fact a specialcase of an IV estimator. For this, notice that the fixed effects estimator can bewritten as

βW =

N∑i=1

T∑t=1

(xit − xi)′(xit − xi)

−1 N∑i=1

T∑t=1

(xit − xi)′(yit − yi)

=

N∑i=1

T∑t=1

(xit − xi)′xit

−1 N∑i=1

T∑t=1

(xit − xi)′yit

= βIV ,

which shows that βW has the interpretation of an IV estimator in the model,

yit = xitβ + αi + uit, i = 1, 2, . . . , N, ; t = 1, 2, . . . , T,

where xit is instrumented by xit − xi. Notice that by construction

E[(xit − xi)

′αi]

= 0,

so that an IV estimator is consistent provided that

E[(xit − xi)

′uit]

= 0,

which is satisfied by our assumption of strict exogeneity of xit. This route mayallow us to estimate the effect of time invariant variables in a general context.To describe this approach, consider the following model:

yit = x1,itβ1+x2,itβ2+z1,iγ1+z2,iγ2+αi+uit, i = 1, 2, . . . , N, ; t = 1, 2, . . . , T,(38)

where we have now four different groups of variables; x’s are varying over bothtime periods and cross-section units, but z’s are varying only over cross-sectionunits and time-invariant.In addition we now assume that the 1× k1 vector x1,it and the 1× g1 vector

z1,i are uncorrelated with αi such that

plimN→∞1

N

N∑i=1

x1,iαi = 0, plimN→∞1

N

N∑i=1

z1,iαi = 0,

whereas the 1 × k2 vector x2,it and the 1 × g2 vector z2,i are correlated withαi (k1 + k2 = k and g1 + g2 = g). Under these assumptions, the fixed effectsestimation provides still consistent estimators for β1 and β2, but would notidentify γ1 and γ2, since time-invariant variables z1,i and z2,i are wiped out bythe within transformation.

3.3.1 Hausman and Taylor (1981) IV Estimator

Hausman and Taylor (1981) suggest to estimate (38) by IV using the followingvariables as instruments:

x1,it for x1,it, x2,it − x2,i for x2,it, z1,i for z1,i, and x1,i for z2,i,

26

that is, uncorrelated variables x1,it and z1,i trivially serve as their own instru-ments, but x2,it’s are instrumented by their deviation from individual means asin the fixed effects estimation, and finally, z2,i is instrumented by the individualaverage of x1,it. Obviously the identification requires that the number of x1,it

is as large as that of z1,i, (k1 > g1).The resulting estimator, Hausman-Talyorestimator, also allows us to estimate γ1 and γ2 consistently. In particular, ifsome of time-invariant variables are believed to be correlated with αi, we requirethat suffi cient time-varying variables that are not correlated with αi should beincluded for instruments. In particular, the advantage of the Hausman and Tay-lor is that one does not have to use external instruments, instruments can beobtained within the model.There are two versions of the Hausman-Taylor estimator, called HT-IV and

HT-GLS, respectively.

HT-IV estimator (consistent but less effi cient) We now rewrite (38) inthe matrix from

y = X1β1 + X2β2 + Z1γ1 + Z2γ2 +α+ u (39)

= Xβ + Zγ +α+ u,

where X = (X1,X2), Z = (Z1,Z2), β =(β′1,β

′2

)′and γ = (γ′1,γ

′2). We first

estimate by βW (within estimator), which is still consistent, and consider thefollowing averaged within residuals in matrix form:

P(y −Xβ

)= Zγ +

α+ Pu + PX

(β − β

), (40)

where PZ = Z and Pα = α trivially. Applying the 2SLS to (40) and using theinstruments of (X1,Z1), then we obtain the consistent estimate of γ by

γW =Z′P(X1,Z1)Z

−1

Z′P(X1,Z1)

(y −Xβ

),

whereP(X1,Z1) = (X1,Z1)

(X1,Z1)

′(X1,Z1)

−1(X1,Z1)

′

is the orthogonal projection operator onto its column space. In summary, γWexists and is consistent as N → ∞ if k1 ≥ g2. We need at least as manyinstruments [X1,Z1] as regressors [Z1,Z2]. So you need at least as many X1’sas Z1’s.

HT-GLS estimator (consistent and effi cient) Transform (39) by V−1/2

as above (θ difference),

V−1/2y = V−1/2Xβ + V−1/2Zγ + V−1/2 (α+ u) , (41)

then do 2SLS using the instruments of (QX,Z1,PX1) = (QX1,QX2,Z1,PX1)to obtain βGLS and γGLS , where QX are deviations from means of all time-varying variables, PX1 means of all time-varying variables not correlated with

27

effects, Z1 time-invariant variables not correlated with effects. Condition forexistence of the estimator now becomes

k1+k2+g1+k1 (number of instruments) ≥ k1+k2+g1+g2 (number of regressors)

or againk1 ≥ g2,

which is the same as for the simple estimator.

Summary of HT estimator (i) k1 < g2 (underidentification): βW = βGLS ,but and γW and γGLS do not exist.(ii) k1 = g2 (just-identification): βW = βGLS , and γW = γGLS .(iii) k1 > g2 (over-identification): βGLS , γGLS are more effi cient than βW ,

γW .

An empirical application: Estimating returns to schooling See Haus-man and Taylor (1981).

3.3.2 Further Generalization

Amemiya and McCurdy (1986) and Breusch, Mizon and Schmidt (1989) all con-sider the same random effects model correlated with some but not all regressors,and suggest a larger set of instruments to improve upon the effi ciency of theHausman and Taylor estimator. A basic question is how many explanatory vari-ables or some linear combinations must be uncorrelated with the effects in orderto improve on the within estimator and to include time-invariant regressors.Amemiya and McCurdy (1986) suggest the use of the time invariant instru-

ments, x1,i1 − x1,i, ..., x1,iT − x1,i, for z2,i. This requires that

E[(x1,it − x1,i)

′αi]

= 0 for each t,

which makes sense if the correlation between x1,it and αi is due to a timeinvariant component in x1,it such that for a given t, E

[x′1,itαi

]does not depend

on t.Breusch, Mizon and Schmidt (1989) nice summarise this literature suggesting

x2,i1 − x2,i, ..., x2,iT − x2,i as additional instruments for z2,i.

3.4 Extension to two-way error components model

Consider the linear regression model with large N and large T .

yit = xitβ + αi + λt + uit, (42)

where αi denotes the unobservable individual effect and λt denotes the unob-servable time effect.

28

3.4.1 The fixed effects model

Regress y on [X, dummy for individual, dummy for time], which is equivalentto within transformation. Define the following means:

yi· =1

T

T∑t=1

yit, y·t =1

N

N∑i=1

yit, y·· =1

NT

N∑i=1

T∑t=1

yit.

Then, carry out the within transformation:

ywit = yit − yi· − y·t + y··,

xwit = xit − xi· − x·t + x··.

Then, the within estimator is obtained from the regress ywit on xwit. Notice thatthis transformation removes anything that does not vary over time (eg., αi) andanything that does not vary over individual (eg., λt). The within estimator isunbiased, but consistency depends on.

3.4.2 The random effects model

Treat αi and λt as iid draws from N(0, σ2α) and N(0, σ2

λ), and not correlatedwith X. Then, the OLS estimator is unbiased and consistent as N → ∞ andT → ∞. The GLS estimator is more effi cient. For details see Baltagi (1995,Section 3.3).

3.5 Dynamic Panels with Fixed T

Consider a dynamic panel with a lagged dependent variable as regressor,

yit = φyit−1 + xitβ + εit, i = 1, ..., N, t = 1, ..., T, (43)

where we also assume an error component specification,

εit = αi + uit,

where ui ∼ iidN(0, σ2

u

). The motivation here is to distinguish the true depen-

dence of y on lagged y from “spurious” correlation due to unobserved hetero-geneity (eg., earning across generations). We also assume for simplicity that xit’sare not correlated with αi, and xit is uncorrelated with uit for all t = 1, ..., T .Note, however, that αi is still correlated with yi,t−1 since yi,t−1 contains αi.Hence, yit−1 is correlated with εit. This renders the OLS estimator biased andinconsistent even if uit’s are not serially correlated. Thus, the OLS and GLSestimators are biased and inconsistent.Fixed effects would seem to be an obvious possibility, but the within estima-

tor is inconsistent as N →∞ for fixed T . To illustrate this problem, consider asimple autoregressive model,8

yit = φyit−1 + uit, i = 1, ..., N, t = 1, ..., T, (44)8Extension to a dynamic panel with exogenous variables would be straightforward and in

this case we would obtain exactly the same result, at least asymptotically.

29

where we assume |φ| < 1. The fixed effects estimator is given by

φW =

∑Ni=1

∑Tt=1 (yit−1 − yi,−1) (yit − yi)∑N

i=1

∑Tt=1 (yit−1 − yi,−1)

2, (45)

where

yi =1

T

T∑t=1

yi,t, yi,−1 =1

T

T∑t=1

yi,t−1,

and we assume that y0 exists. Clearly, the within transformation (yi,t−1 − yi,−1)is uncorrelated with αi, but it is still correlated with uit; that is,

Cov (yi,t−1 − yi,−1, uit) = −σ2u

T,

so the within estimator is biased and inconsistent as N →∞ for a fixed T . Toanalyze this, we can substitute (44) into (45) and get

φW = φ+

∑Ni=1

∑Tt=1 (uit−1 − ui,−1) (yit − yi)∑N

i=1

∑Tt=1 (yit−1 − yi,−1)

2. (46)

In fact, the fixed effects estimator will be biased of O(T−1

), and in particular,

Nickell (1981) has shown that9

plimN→∞1

NT

N∑i=1

T∑t=1

(uit−1 − ui,−1) (yit − yi) = −σ2u

T

(T − 1)− Tφ+ φT

(1− φ)2 .

(47)Therefore, for a typical panel where N is large and T is fixed, the fixed effects es-timator is biased and inconsistent.10 Only if T →∞, the fixed effects estimatorwill be consistent for the dynamic error components model.

3.5.1 The Anderson and Hsiao (1981) First-difference IV Estimation

Fortunately, there is a relatively easy way to fix the inconsistency problem.Alternative transformation that wipes out the individual effects yet does notcreate the above problem in dynamic panels is the first difference transformation.Take first difference of (44) to get rid of αi and obtain

∆yit = φ∆yit−1 + ∆uit, t = 2, ..., T, (48)

where we note that∆uit is an MA(1) process with unit root. The OLS estimatorobtained from (48) will be inconsistent since ∆yit−1 and ∆uit or more preciselyyit−1 and uit−1 are by definition correlated. Anderson and Hsiao suggested∆yit−2 or yit−2 as an instrument for ∆yit−1. These instruments will not be

9See p.328 in Verbeek for actual magnitude of the bias for a fixed T and for different valuesof φ.10Notice that the same inconsistency problem occurs with the random effects GLS estimator.

30

correlated with ∆uit as long as the uit are not serially correlated. For example,when using yit−2 as an instrument for ∆yit−1, then we obtain the following(consistent) IV estimator:11

φIV =

∑Ni=1

∑Tt=1 yit−2∆yit∑N

i=1

∑Tt=1 yit−2∆yit−1

. (49)

Since ∆uit = uit− ui,t−1, in fact any of yi,s, s ≤ t− 2 can be used as legitimateinstruments.

3.5.2 The Arellano and Bond (1991) IV-GMM Estimator

It is well-known that imposing more moment conditions increases the effi ciencyof the estimators provided the additional moment conditions are valid. Arellanoand Bond (1991) show that the list of instruments can be extended by exploitingadditional moment conditions and letting their number vary with t. In partic-ular, they argue that additional instruments can be obtained if one utilizes theorthogonality conditions that exist between lagged values of yit and uit.Consider again the simple autoregressive panel (44) and its first-difference

version (48). For t = 3, we observe (note here t = 2, ..., so the observation startsfrom yi1, not yi0)

∆yi3 = φ∆yi2 + ∆ui3,

and thus yi1 is a valid instrument since it is highly correlated with ∆yi2 but notcorrelated with ∆ui3. Note when t = 4, then

∆yi4 = φ∆yi3 + ∆ui4,

and yi2 as well as yi1 are valid instruments for ∆yi3 since both are not corre-lated with ∆ui4. Continuing in this fashion, for the period T the set of validinstruments becomes (yi1, yi2, ..., yiT−2).Now define the (T − 2)× (1 + 2 + · · ·+ T − 2) matrix,

Wi =

(yi1) 0 · · · 0

0 (yi1, yi2) · · · 0...

......

...0 0 · · · (yi1, yi2, ..., yiT−2)

, (50)

as the matrix of instruments, where each row contains the instruments that arevalid for a given period. Consequently, the set of all moment conditions can bewritten concisely as

E (W′i∆ui) = 0,

11A necessary condition for consistency is that

plimN→∞1

NT

N∑i=1

T∑t=2

yit−2∆uit = 0.

31

where ∆ui = (∆u3, ...,∆uT )′ or alternatively,

E (W′i (∆yi − φ∆yi,−1)) = 0,

where ∆yi = (∆y3, ...,∆yT )′ and ∆yi,−1 = (∆y2, ...,∆yT−1)

′ are T × 2 vectors,respectively. A total number of moment conditions adds up to (1 + 2 + · · ·+ T − 2).Next, define the N (T − 2)× (1 + 2 + · · ·+ T − 2) matrix of instruments by

W =

W1

...WN

,and rewrite (48) in the matrix form as

∆y = φ∆y−1 + ∆u, (51)

where

∆y =

∆y1

...∆yN

N(T−2)×1

,∆y−1 =

∆y1,−1

...∆yN,−1

N(T−2)×1

,∆u =

∆u1

...∆uN

N(T−2)×1

.

Pre-multiplying (51) by W′,

W′∆y = φW′∆y−1 + W′∆u. (52)

The Arellano and Bond’s suggested estimator is in fact the GLS estimatorapplied to (52); that is,

φGLS =

∆y′−1WV−1W′∆y−1

−1 ∆y′−1WV−1W′∆y

, (53)

whereV = V ar (W′∆u). They then suggested the two feasible GLS estimators.First, under the assumption that uit is iid over both i and t, it is easily seenthat

E (∆ui∆u′i) = σ2u (IN ⊗G) ,

where ∆ui = (∆u3, ...,∆uT )′ and G is the matrix given by

G =

2 −1 0 · · · 0 0−1 2 −1 · · · 0 0...

......

......

...0 0 0 · · · 2 −10 0 0 · · · −1 2

(T−2)×(T−2)

.

Then,V = V ar (W′∆u) = σ2

uW′ (IN ⊗G) W,

32

and therefore, we obtain one-step Arellano and Bond (GMM) estimator by

φFGLS,1 =

∆y′−1W [W′ (IN ⊗G) W]−1

W′∆y−1

−1

(54)

×

∆y′−1W [W′ (IN ⊗G) W]−1

W′∆y.

Since G is a fixed matrix, the optimal GMM estimator can be computed in onestep if uit’s are assumed to be homoskedastic and exhibit no autocorrelation.In general, the GMM approach does not impose that uit is iid over both

cross-section units and time periods. In this case V or V−1 can be estimatedwithout imposing these restrictions.12 Now, we need to replace

W′ (IN ⊗G) W =N∑i=1

W′iGWi,

by

VN =

N∑i=1

W′i∆ui∆u′iWi.

Since ∆ui is unobservable, we obtain the two-step Arellano and Bond GMMestimator by

φFGLS,2 =

∆y′−1WV−1N W′∆y−1

−1 ∆y′−1WV

−1

N W′∆y, (55)

where

∆ui = ∆yi − φFGLS,1∆y−1 and V−1N =

N∑i=1

W′i∆ui∆u′iWi.

This GMM estimator requires no knowledge concerning the initial conditionsor the distributions of ui and αi. In general, the GMM estimator for φ isasymptotically normal with its covariance given by

V AR(φFGLS,2

)=

∆y′−1WV−1N W′∆y−1

−1

.

GMM estimator in models with exogenous variables Now we considera more general dynamic panel model,

yit = φyit−1 + xitβ + αi + uit, i = 1, ..., N, t = 1, ..., T, (56)

where xit is the k × 1 vector of regressors. Its first-difference version becomes

∆yit = φ∆yit−1 + ∆xitβ + ∆uit. (57)

12Notice, however, that the absence of autocorrelation is necessary for the validity of momentconditions and thus instruments.

33

Suppose that the k dimensional regressors xit are strictly exogenous such that

E (x′ituis) = 0 for all t, s = 1, ..., T,

and assume that all xit are not correlated with the individual effects αi. Then,all xit are valid instruments for (57) in which case the 1× kT vector defined by

x∗i = (xi1,xi2, ...,xiT )

should be added to each diagonal element of Wi in (50); that is, we now havethe (T − 2)× [(1 + 2 + · · ·+ T − 2) + kT (T − 2)] instrument matrix,

Wi =

(yi1,x

∗i ) 0 · · · 0

0 (yi1, yi2,x∗i ) · · · 0

......

......

0 0 · · · (yi1, yi2, ..., yiT−2,x∗i )

. (58)

Writing (57) in the matrix form and premultiplying it by W′, we obtain

W′∆y = φW′∆y−1 + W′∆Xβ + W′∆u, (59)

where∆y, ∆y−1, ∆u are defined just after equation (51), and∆X is the stackedN (T − 2)× k matrix of observations on ∆xit. The two-step GLS estimator of(φβ

)can then be obtained by

(φFGLS,2βFGLS,2

)=(

∆z′WV−1N W′∆z

)−1 (∆z′WV−1

N W′∆y), (60)

where W = (W′1, ...,W

′N )′, ∆z = (∆y−1,∆X) and V−1

N is estimated similarlyas in (55).Next, if xit are not strictly exogenous but predetermined such that

E (xituis) 6= 0 for s < t,

(and still assuming that all xit are not correlated with αi), then only (xi1,xi2, ...,xis−1)are valid instruments for (57) at period s. Thus, we now get the (T − 2) ×[(1 + 2 + · · ·+ T − 2) + k (2 + 3 + · · ·+ T − 1)] matrix of instruments,

Wi =

(yi1,xi1,xi2) 0 · · · 0

0 (yi1, yi2,xi1,xi2,xi3) · · · 0...

......

...0 0 · · · (yi1, ..., yiT−2,xi1, ...,xiT−1)

,(61)

and the two-step GLS estimator of(φβ

)can be obtained by (60), but with

the choice of Wi given by (61).

34

In empirical studies a combination of both exogenous and predeterminedregressors may occur rather than two extreme cases, and one can adjust thematrix of instruments W accordingly. In the case where only subset of xit arecorrelated with the individual effects αi, then we can also extend the Hausman-Taylor estimation procedure. See Baltagi (1995, section 8.2).

3.5.3 The Arellano and Bover (1995) Study

Arellano and Bover develop a unifying IV-GMM framework for dynamic paneldata models, which also includes the Hausman and Taylor type estimator as aspecial case.To begin with consider the following static panels:

yit = xitβ + ziγ + εit, i = 1, 2, . . . , N, ; t = 1, 2, . . . , T, (62)

where β is k × 1, γ is g × 1 and

εit = αi + uit.

In the vector from,

yi = Xiβ + Ziγ + εi = Siδ + εi, (63)

εi = αieT + ui,

where

yi =

yi1...yiT

T×1

, Xi =

xi1...

xiT

T×k

, Zi =

zi...zi

T×g

, εi =

εi1...εiT

T×1

,

Si = [Xi,Zi]T×(k+g) , δ =

[βγ

](k+g)×1

, eT =

1...1

T×1

, ui =

ui1...uiT

T×1

.

Arellano and Bover transform (63) using the following nonsingular matrix trans-formation:

H =

[C

e′T /T

]T×T

,

where C is any (T − 1) × T matrix of rank (T − 1) such that CeT = 0. Forexample, C could be the first rows of the within group operator (see definitionof Q∗ in Chapter 3) or the first difference operator. Premultiplying εi by C,then we obtain the transformed disturbances,

ε∗i = Hεi =

[Cεiεi

]. (64)

35

For example, this class of transformation performs a decomposition between‘within-group’and ‘between-group’variation which is helpful in order to imple-ment moment conditions implied by the model. Notice now that the first (T − 1)transformed disturbances are free of the individual effects αi by construction.Hence all exogenous variables are valid instruments for the first (T − 1) equa-tions in (63).Define the 1× (kT + g) vector wi = (xi1, ...,xiT , zi), and let mi denote the

row vector of subset of variables of wi assumed to be uncorrelated with αi suchthat dim (mi) = m ≥ dim (γ) = g. Then, a valid instrument matrix becomes

Wi =

wi 0 · · · 0...

... · · ·...

0 wi 00 0 0 mi

T×(kT+g+m)

, (65)

and the moment conditions are given by

E (W′iHεi) = 0. (66)

Write (63) in matrix form,y = Sδ + ε, (67)

S =

S1

...SN

NT×(k+g)

, ε =

ε1

...εN

NT×1

.

Defining the matrix of instruments,

W =

W1

...WN

NT×(kT+g+m)

,

and premultiplying (67) by WH, where H = IN ⊗H is a NT × NT matrix,then we obtain the following complete transformed system:

W′Hy = W′HSδ + W′Hε. (68)

The Arellano-Bover optimal GMM estimator of δ based on moment condition(66) is in fact the GLS estimator applied to (68) given by

δGLS =(S′HWV

−1W′HS

)−1 (S′HWV

−1W′Hy

), (69)

whereV ar (ε) = IN ⊗ E (εiε

′i) = IN ⊗Ω,

V = V ar(W′Hε

)= W′H (IN ⊗Ω) H′W = W′ (IN ⊗HΩH′

)W.

36

The feasible GLS estimator is obtained by replacing HΩH′ or V by itsconsistent estimator. First, unrestricted estimator of HΩH′ takes the form,

1

N

N∑i=1

ε∗i ε∗′i ,

where ε∗i are the residuals based on consistent preliminary estimates, in whichcase we have

δFGLS =(S′HWV

−1W′HS

)−1 (S′HWV

−1W′Hy

), (70)

where

V = W′

(IN ⊗

(1

N

N∑i=1

ε∗i ε∗′i

))W,

Alternatively, we consider a restricted estimate of Ω,

Ω = σ2αeTe′T + σ2

uIT ,

where σ2α and σ

2u are the consistent estimators of σ

2α and σ

2u (recall random

effects model). Thus, we have

δFGLS =(S′HWV

−1W′HS

)−1 (S′HWV

−1W′Hy

), (71)

whereV = W′

(IN ⊗HΩH

′)W.

Consider now the Hausman and Taylor model again,

yit = x1,itβ1+x2,itβ2+z1,iγ1+z2,iγ2+αi+uit, i = 1, 2, . . . , N, ; t = 1, 2, . . . , T,(72)

where the 1 × k1 vector x1,it and the 1 × g1 vector z1,i are uncorrelated withαi, but the 1× k21 vector x2,it and the 1× g2 vector z2,i are correlated with αi(k1 + k2 = k and g1 + g2 = g). Using the Arellano-Bover transformation, it iseasily seen that mi include the set of and variables x1,it and z1,i, namely,

mi = (x1,i1, ...,x1,iT , z1,i) .

Then, the Hausman and Taylor is equivalent to δFGLS in (71).Because the set of instruments Wi is block-diagonal, it can be shown that

δFGLS or δFGLS is invariant to the choice of the transformation matrix C.Another advantage is that the form of Ω−

12 (for GLS transformation) need

not be known. Hence, this approach generalises the HT, AM and BMS typeestimators to a more general form than that of error components, and it is easilyextends to the dynamic panels.Consider now the dynamic panels:

yit = φyit−1 + xitβ + ziγ + εit, i = 1, 2, . . . , N, ; t = 1, 2, . . . , T, (73)

37

where β is k × 1, γ is g × 1 and

εit = αi + uit.

In the vector from,

yi = yi,−1φ+ Xiβ + Ziγ + εi = Siδ + εi, (74)

εi = αieT + ui,

where13

yi =

yi1...yiT

T×1

, yi,−1 =

yi0...

yiT−1

,Xi =

xi1...

xiT

T×k

, Zi =

zi...zi

T×g

, εi =

εi1...εiT

,

Si = [yi,−1,Xi,Zi]T×(1+k+g) , δ =

φβγ

(1+k+g)×1

, eT =

1...1

, ui =

ui1...uiT

.Provided there are enough valid instruments to ensure identification of all para-meters, the GMM estimator defined in (69) still provides a consistent estimatorof δ in (74). Now, the matrix of instruments Wi is basically the same as in (65),where 1 × (k (T + 1) + g) vector wi = (xi0,xi1, ...,xiT , zi) and mi is the rowvector of subset of variables of wi assumed to be uncorrelated with αi such thatdim (mi) = m ≥ dim (γ) = g. (The only difference is that t = 0 is now our firsttime period observed.) But, notice that yi,−1 is excluded despite its presencein Si, because yi,−1 is generally regarded as endogenous unless uit is seriallyuncorrelated (see below). Then, the previous steps of consistent estimationfollow.Finally, consider the case where uit is not serially correlated, so that yit−1

is predetermined. In this case we could obtain additional orthogonality restric-tions under the additional condition that the transformation matrix C is uppertriangular. In this case the transformed error in the equation for period t isindependent of αi and (ui1, ..., uit−1) so that (yi0, yi1, ..., yit−1) are additionalvalid instruments. Thus, a valid instrument matrix now becomes

Wi =

(wi, yi0) 0 · · · 0

...... · · ·

...0 (wi, yi0, yi1, ..., yit−2) 00 0 0 mi

T×(kT+g+m)

, (75)

and the consistent GMM estimation follows accordingly.For further details see Arellano and Bover (1995).

13We now assume that yi0 exists.

38

3.5.4 Further Readings

The dynamic panel model based on the IV-GMM estimator is still ongoingresearch topic. But there are a few paper worth reading for further studies. Ahnand Schmidt (1995) observed that the standard IV-GMM estimator suggestedby Arellano and Bond (1991) neglects quite a lot of information and is thereforeineffi cient. They explain how these more moment conditions arise for the simpledynamic model and show how they can be utilized in a GMM framework. Ahnand Schmidt also consider the dynamic model with exogenous regressors andshow how one can make effi cient use of exogenous variables as instruments.Holtz-Eakin et. al (1988) discuss the VAR in panels and Blundell and Bond(1998) examine initial conditions and moment conditions in dynamic panels.

3.6 Dynamic Panels When Both N and T are Large

The econometric theory for panel data (see Hsiao (1986), Baltagi (1995) andMatyas and Sevestre (1996) for surveys) was largely developed for survey datawhere T was small (often only four or five observations) but N the number ofindividuals was large.In recent years there has been an upsurge in the availability and use of panel

data sets where both N and T are suffi ciently large. For example Summers andHeston’s(1991) large multi-country panel data has been and still is the focusof much empirical work in the area of macroeconomic growth. Furthermore,as time progresses, micro panel data sets such as the British Household PanelSurvey (BHPS), where T is not currently so large, are being updated to incorpo-rate new time series observations as they become available. It is now recognizedthat when used in an appropriately rigorous fashion large panels can be hugelyinformative about the unknown parameters of economic models and yield verypowerful tests of hypotheses nested within these models.These large N , large T panels raise a number of issues. First of all, since

it is possible to estimate a separate regression for each group, which is notpossible in the small T case, it is natural to think of heterogeneous panelswhere the parameters can differ over groups. One can then test for equalityof the parameters, rather than having to assume it, as one is forced to do inthe small T case. When equality of parameters over groups is tested it is veryoften rejected and the differences in the estimates between groups can be large.Baltagi and Griffi n (1997) discuss the dispersion in OECD country estimates forgasoline demand functions. Boyd and Smith (2000) review possible explanationsfor the large dispersion in models of the monetary transmission mechanism for57 developing countries.Secondly, since time-series data tend to be non-stationary, determining the

order of integration or cointegration of the variables becomes important, wherethe order of integration is the number of times a time-series must be differencedto make it stationary. Extending the estimation and testing procedures forintegrated and cointegrated series to panels is thus a natural development. Infact, the tendency of individual time series tend to reject Purchasing Power

39

Parity (not reject a unit root in the real exchange rate) has led the emphasis toswitch to testing PPP in panels.Third, one needs to determine the asymptotic properties of standard panel

estimators when the data are non-stationary. These properties are rather differ-ent from those of single time-series, in particular spurious regression seems tobe less of a problem. See Phillips and Moon (1999) and Coakley, Fluertes andSmith (2001). There is also a question about how to do the asymptotic analysisas both N and T can go to infinity. There are a number of different ways that Nand T can go to infinity and the relation between these different ways remainsa subject of research, see Phillips and Moon (1999, 2000) and Shin and Snell(2001).

3.6.1 The Mean Group Estimator

The conventional dynamic panel model with small T focuses only on allowingfor intercept variation via individual effects. In comparison little attention hasbeen paid to the implications of variation in slope parameters. There are threejustifications for analyzing the dynamic heterogeneous panels.First, the slope heterogeneity does not matter when the primary interest lies

in obtaining an unbiased estimator of the average effect of exogenous variables.Zellner (1969) showed that when the regressors are exogenous but their coeffi -cients differ randomly across groups, the pooled estimators such as the fixed andrandom effects estimator will provide unbiased estimator of the average effect.However, Pesaran and Smith (1995) showed that such results do not extend todynamic models with lagged dependent variables.Second, in the case where only relatively small time periods were available,

the scope for analysing the slope heterogeneity explicitly appeared limited, e.g.seminal paper by Balestra and Nerlove (1966). Considering now that panelswith a reasonable time dimension are available and that the evidence of slopeheterogeneity in panels are pervasive, it is a high time to examine the implica-tions of slope heterogeneity directly.A third possible justification is that the long-run responses which are often

the primary focus of analysis are less likely to be subject to slope heterogeneitythan the short-run adjustment patterns across groups. Therefore, it is interest-ing to see how the time-series, cross-section and panel estimates of such long-runcoeffi cients can be compared.In practice the extent of cross-sectional heterogeneity may be so large as

to preclude the use of pooling. An approach that is becoming increasinglypopular in this context is to focus estimation and inference on so called meangroup quantities that are “averages”across panel units. This approach has beenadvanced by Pesaran and Smith (1995), and applied to the panel-based unit roottest by Im, Pesaran and Shin (1999) and to the panel-based stationarity test byShin and Snell (2001).Consider the following dynamic heterogeneous panels,

yit = φiyit−1 + xitβi + εit, i = 1, ..., N, t = 1, ..., T, (76)

40

where we assume:

1. xit and εis are uncorrelated each other for all t and s.

2. φi ∼ iidN(φ, ω2

1

).

3. βi ∼ iidN (β,Ω2) .

4. φi and βi are independently distributed with y1t,xit and εit for all t.

5. The k-dimensional vector xit are covariance stationary processes.14

Under the second and third assumptions this can be regarded as the standardrandom coeffi cients model, see Swamy (1970). Under this scenario Pesaran andSmith (1995) have investigated the asymptotic as well as small sample propertiesof the following three alternative estimators:

1. The pooled (within) estimator, which involves pooling the data by impos-ing homogeneous slope coeffi cients, allowing for fixed or random effects.

2. The cross-section (between) estimator, which involves averaging the dataover time per each group and estimating the cross-section regression basedon the group mean data. Notice that many works on endogenous growthfollow this approach, e.g. Barro (1991).

3. The mean group estimator suggested by Pesaran and Smith (1995), whichinvolves estimating separate regression for each group and averaging theindividual estimates over groups, that is,

βMG = N−1N∑i=1

βi, (77)

where βi is the OLS estimator of βi.

Notice that all the above procedures will provide an estimate of the averagecoeffi cient but main difference is that in the first two cases the averaging isimplicit while in the third case it is explicit.

1. When T is small (even if N is large), all the procedures yield inconsistentestimators.

2. When both T and N are large, both the cross-section (between) estimatorand the mean group estimator yield consistent estimators of φ and β.

3. The pooled (within) estimator is not consistent for the expected values ofβi and γi even when both T and N are large.15

14 In general most of the results in what also follow when xit are unit root processes.15This inconsistency in heterogenous dynamic panel models was first noted by Robertson

and Symonds (1992).

41

To see that the within estimator is inconsistent we rewrite (76) as16

yit = φyit−1 + xitβ + vit, (78)

wherevit = εit + (λi − λ) yi,t−1 + xit (βi − β) .

Then it is easily seen that vit is now correlated with all present and past valuesof yi,t−1−s and xit−s for all s ≥ 0. This correlation renders the OLS esti-mator inconsistent. Furthermore, the fact that vit is correlated with yi,t−1−sand xit−s for all s ≥ 0 rules out the possibility of choosing lagged values ofyi,t−1 and xit as legitimate instruments. This composite disturbance vit willalso be serially correlated, if xit is serially correlated, as it usually is, and willnot be independent of the lagged dependent variable. This heterogeneity bias,which depends on the serial correlation in the x and the variance of the randomparameters, can be quite severe.Pesaran, Smith and Im (1996) also confirmed the above theoretical findings

via Monte Carlo simulations. The heterogeneity issue is important since whenthe hypothesis that the slope coeffi cients are in fact identical is tested using forexample the log-likelihood ratio statistic, it is almost always rejected at conven-tional significance levels. However, in small samples conventional significancelevels may not be appropriate. An alternative to testing is to use a model se-lection criterion to decide whether or not to pool. One possibility which doescorrect for the effect of sample size is to choose the model with the largest valueof the Schwarz Bayesian Criteria. For large samples the SBC penalizes over-parameterization more heavily than tests at conventional significance levels andis thus more likely to choose the pooled model. The decision whether to pool ornot will depend both on the degree of heterogeneity and on the purpose of theexercise. Baltagi and Griffi n (1997) use a large number of different estimatorsto estimate demand for gasoline in OECD countries. They find that there is avery large degree of heterogeneity in the individual country estimates but thatfor the purpose of medium term forecasting simple pooled estimators do well.

3.6.2 Pooled mean group estimation in dynamic heterogeneous pan-els

To date, there have been three alternative estimation procedures for dynamicpanels, differing in the relative magnitudes of N and T .

16Remind that if βi = β and γi = γ, then the fixed effect estimators are consistent only asT →∞, for fixed N . However, they are inconsistent as N →∞ for fixed T . The latter is theresult of the fact that the lagged dependent variable bias arising from the initial conditions,is not removed by increasing N. There is a large literature dealing with dynamic models inthe small T case. See the previous section. There is a further diffi culty that the traditionalGMM estimator breaks down if the variables are I(1), see Binder, Hsiao and Pesaran (1999)who suggest a maximum likelihood estimator that is consistent whether the variables are I(0)or I(1).

42

1. (Small N and large T ) When N = 1, the traditional approach was toestimate an autoregressive distributed lag (ARDL) model. For N > 1, fol-lowing Zellner (1962) the seemingly unrelated regression equation (SURE)procedure is often used. The main attraction of the SURE procedure liesin the fact that it allows the contemporaneous error covariances to befreely estimated. However, this is possible only when N is reasonablysmall relative to T . When N is of the same order of magnitude as T , thecase we are interested in, SURE is not feasible.

2. (Small T and large N) Pesaran and Smith (1995) show that the tra-ditional procedures for estimation of pooled models such as the fixed ef-fects, the random effects, the instrumental variables (IV) or the General-ized Method of Moments (GMM) estimators proposed by Anderson andHsiao (1981, 1982), Arellano (1989), Arellano and Bover (1995), Keaneand Runkle (1992) and Ahn and Schmidt (1995) can produce inconsis-tent, and potentially very misleading estimates of the average values ofthe parameters unless the slope coeffi cients are homogeneous. In mostpanels of this sort, however, tests indicate that these parameters differsignificantly across groups. Thus an estimator which imposes weaker ho-mogeneity assumptions would be useful.

3. (The Bayes and empirical Bayes estimators) Hsiao, Pesaran andTahmiscioglu (1998) consider Bayes estimation of short-run coeffi cients indynamic heterogenous panels, and establish the asymptotic equivalence ofthe Bayes estimator (see Swamy (1970)) and the mean group estimator,showing that the mean group estimator is asymptotically normal for largeN and large T so long as

√N/T → 0 as both N and T →∞.

In particular when both N and T are suffi ciently large, there have beentwo extreme approaches to analyzing dynamic panels. At one extreme are thetraditional pooled estimators, such as the fixed and random effects estimators,where only the intercepts are allowed to differ across groups while all othercoeffi cients and error variances are constrained to be the same. At the otherextreme, one can estimate separate equations for each group and examine thedistribution of the estimated coeffi cients across groups. Of particular interest isthe mean group estimator (MGE) advanced by Pesaran and Smith (1995).There are often good reasons to expect the long-run equilibrium relation-

ships between variables to be similar across groups, due to budget or solvencyconstraints, arbitrage conditions or common technologies influencing all groupsin a similar way. However, the reasons for assuming that short-run dynamicsand error variances should be the same tend to be less compelling. On thisground Pesaran, Shin and R.P. Smith (1999) provide an intermediate estimatorcalled the pooled mean group estimator (PMGE) which involves both poolingand averaging. This estimator allows the intercepts, short-run coeffi cients anderror variances to differ freely across groups, but only the long-run coeffi cientsare constrained to be the same.

43

We now extend the single time series ARDL modelling to the dynamic paneldata model,

yit =

p∑j=1

λijyi,t−j +

q∑j=0

xi,t−jδij + αi + εit,t = 1, 2, ..., Ti = 1, 2, ..., N

, (79)

where αi represent the fixed effects, xi,t is a 1× k vector of regressors, and λij ,δij are scalar and k× 1 vector of parameters. It is convenient to work with thefollowing (unrestricted) error correction form of (79):

∆yit = φiyi,t−1 + xitβi +

p−1∑j=1

λ∗ij∆yi,t−j +

q−1∑j=0

∆xi,t−jδ∗ij + µi + εit, (80)

where

φi = − (1−p∑j=1

λij) and βi =

q∑j=0

δij .

If we stack the time-series observations for each group, equation (80) can bewritten as

∆yi = φiyi,−1 + Xiβi +

p−1∑j=1

λ∗ij∆yi,−j +

q−1∑j=0

∆Xi,−jδ∗ij + µiιT + εi, (81)

where

yi =

yi1...yiT

T×1

, Xi =

xi1...

xiT

T×k

, ιT =

1...1

T×1

, εi =

εi1...εiT

T×1

.

We now assume:

• Assumption 1. εit’s are independently distributed across i and t, withzero means, variances σ2

i > 0, and finite fourth-order moments. They arealso distributed independently of the regressors, xit. So xit’s are strictlyexogenous.

• Assumption 2. The ARDL(p, q, q, ..., q) model (79) is stable in that theroots of

p∑j=1

λijzj = 1, i = 1, 2, ..., N,

lie outside the unit circle. This assumption ensures that

φi < 0 for all i = 1, 2, ..., N,

and hence there exists a long-run relationship between yit and xit definedby

yit = θixit + ηit,

where θi = −β′i/φi are long-run coeffi cients and ηit is a stationary process.

44

• Assumption 3. (Long-run homogeneity) The long-run coeffi cients θi arethe same across the groups, i.e.

θi = θ, i = 1, 2, ..., N. (82)

Under Assumptions 2 and 3, (81) can be written compactly as

∆yi = φiξi(θ) + Wiκi + εi, i = 1, 2, ..., N, (83)

where we highlight the dependence of the error correction term on θ as

ξi(θ) = yi,−1 −Xiθ, i = 1, 2, ..., N, (84)

Wi = (∆yi,−1, ...,∆yi,−p+1,∆Xi,∆Xi,−1, ...,∆Xi,−q+1, ιT ), and κi = (λ∗i1, ..., λ∗i,p−1,

δ∗′i0, δ∗′i1, ..., δ

∗′i,q−1, µi)

′.To estimate the model, we adopt a likelihood approach. Since the parameters

of interest are the long-run effects and adjustment coeffi cients, we directly workwith the concentrated log-likelihood function given by

`T (ϕ) = −T2

N∑i=1

ln 2πσ2i −

1

2

N∑i=1

1

σ2i

((∆yi − φiξi(θ))′Hi (∆yi − φiξi(θ)) ,

(85)where

Hi = IT −Wi(W′iWi)

−1W′i

ϕ = (θ′,φ′,σ′)′, φ = (φ1, φ2, ..., φN )′, σ = (σ21, σ

22, ..., σ

2N )′.

The maximum likelihood (ML) estimators of the long-run coeffi cients, θ,and the group-specific error-correction coeffi cients, φi, will be referred to as the“pooled mean group”(PMG) estimators in order to highlight both the poolingimplied by the homogeneity restrictions on the long-run coeffi cients and theaveraging across groups used to obtain means of the short-run parameters ofthe model.

The Case of Stationary Regressors In this case under fairly standardconditions the consistency and the asymptotic normality of The ML estimatorsof ϕin (170) can be easily established.

The Case of Non-Stationary Regressors The asymptotic analysis in thiscase is more complicated, but we can still show that the ML estimator of theshort-run coeffi cients φ and σ in the dynamic heterogeneous panel data model(93) are

√T consistent, and the ML estimator of θ is T consistent. Furthermore,

for a fixed N and as T →∞, the ML estimator of ψ = (θ′,φ′)′ asymptotically

has the (mixture) normal distribution.In sum, for the common long-run coeffi cients, θ, the pooled ML estimator

is consistent so long as T →∞, irrespective of whether N is large or not, but θwill not be consistent for finite T , even if N →∞.

45

Once the pooled ML estimator of the long run parameters, θ, is estimated,all other short run coeffi cients can be consistently estimated by running theindividual OLS regressions,

∆yi = φiξi + Wiκi + error, i = 1, ..., N,

whereξi = yi,−1 −Xiθ.

In this case the mean of the error correction coeffi cients and the other short runparameters can be estimated consistently by the MGE,

φMG = N−1N∑i=1

φi and κMG = N−1N∑i=1

κi.

3.6.3 Empirical Applications: The Consumption Function in theOECD

See Pesaran, Shin and Smith (1999). For further empirical application of thePMG estimation see Pesaran, Haque and Sharma (1999) and Fedderke, Shin andVaze (2000). A very useful applied paper is given by Attanasio et al. (2000)who use a variety of estimators to measure the dynamic interaction betweensavings, investment and growth.

4 Factor Models

The ‘factor’has a variety of different meanings in different areas. Suppose thatobserved variables xit, i = 1, 2, ..., N are determined by unobserved factors, fjt,j = 1, ..., r:

xit = λi0 +

r∑j=1

λijfjt + eit (86)

where λij are factor loadings, and eit are idiosyncratic effects. Usually r is muchsmaller than N so the variation in a large number of observed variables can bereduced to a few unobserved factors,

4.1 Uses

Factor models are used in various applications:

• In economics the oldest is the decomposition of time series into unobservedfactors labelled trend, cycle, seasonals etc.

• The observed series may be generated by some underlying unobserved fac-tors and the objective is to measure them. This was developed extensivelyin psychometrics, where the xit are answers to a variety of questions by asample of people. The underlying factors are aspects of personality, e.g.

46

neuroticism, openness, conscientiousness, agreeableness and extroversion.It has also been used in economics for unobserved variables like: develop-ment, natural rates, permanent components, core inflation, etc.

• Factor models can be used to measure the dimension of the independentvariation in a set of data, e.g. how many factors are needed to accountfor most of the variation in xit. For I(1) series these dimensions may bethe stochastic trends.

• Factor models can be used to reduce the dimensionality of a set of possibleexplanatory variables in regression or forecasting models, i.e. replace thelarge number of xit by a few fjt which contain most of the information inthe xit. This may reduce omitted variable problems.

• Factor models are used to model residual cross-section depen-dence in panel data models.

• Factor models have been used to choose instruments for IV or GMM esti-mators when there is a large number of potential instruments.

One can distinguish two different types of problem. The Pesaran approachto the role of unobserved factors in panels is primarily motivated by the need toallow for "error" cross-sectional dependence. The aim is to estimate the (mean)coeffi cient of xit allowing for error cross-sectional dependence and/or missingunobserved effects.Alternatively, it might be relevant to view the unobserved factor as "miss-

ing" (omitted) common effects. An example is "technology" in the aggregateproduction function. In modelling error CSD the error of each cross section unitshould have mean zero (otherwise the model suffers from omitted variables) andcould be serially correlated. The errors could also be I(1). But if the aim is totest for cointegration between observables, yit and xit (which could also containobserved common effects such as oil prices), and if we maintain the possibilitythat the errors of the relationship between yit and xit can be I(1), then it isclear that yit and xit cannot cointegrate.One could hypothesize that yit and xit and ft are cointegrated where ft is an

"unobserved" factor. Even if such a possibility existed, it may not be relevant ifthe economic relation of interest is between yit and xit; e.g. in the case of PPPor UIP.In the case where the common factor represents a missing variable, such as

the technological variable in the production function, our main interest is infact that yit (log output per man hour), kit (log capital per man hour) and ft(global technology) are cointegrated. The role of ft is not to model error CSD,but is an integral part of the model, which happens to be unobserved. Onecould try to obtain proxies for ft - some assume that ft is a linear trend witha stationary component while others assume that it is a latent variable and useHP-filter to measure it. In the context of growth convergence one mightestimate ft by the cross sectional average of yit over i. But, there will

47

be some degree of arbitrariness. For example, how can we establish that ft isI(1) or trend stationary? Not knowing whether ft is I(1) or trend stationary,how can we test that yit, kit and ft are cointegrated.17

4.2 Estimation Methods

There are various ways to estimate factors:

• Univariate (N = 1) filters (e.g. the Hodrick-Prescott filter for trends).

• Multivariate (N > 1) filters such as the Kalman filter used to estimateunobserved-component models, see Canova (2007).

• Multivariate judgemental approaches, e.g. NBER cycle dating based onmany series.

• Using a priori weighted averages of the variables.

• Deriving estimates from a model, e.g. Beveridge Nelson decompositionswhich treat the unobserved variable as the long-horizon forecast.

• Principal component based methods.

The relative attractiveness of these methods depends on the number of ob-served series, N , and the number of unobserved factors, r. The method empha-sised is Principal Components (PC). This can be appropriate for large N andsmall r. Unobserved component models for small N tend to put more para-metric structure on the factors whilst PCs do not. The size of N and T arecrucial. There are some methods that work for small N , other methods thatwork for large N , but no obvious methods for the medium sized N that we havein practice.Factor models have a long history. In the early days, it was not clear whether

the errors in variables model (observed data generated by unobserved factors)or the errors in equation model was appropriate and both models were used.From the late 1940s the errors in equations model came to dominate. The basicapproach to measuring unobserved variables by the PCs of a data matrix wasdeveloped by Hotelling (1933). Stone (1947) used this method to show that mostof the variation in a large number of financial accounts series could be accountedfor by three factors, which could be interpreted as trend, cycle and rate of changeof the cycle. Factor analysis was extensively developed in psychometrics andplayed relatively little role in the development of econometrics, which focussedon the errors in equations model. There are some exceptions, such as factorinterpretations of Friedman’s permanent income.However, there has been an explosion of papers on factor models. The

original statistical theory was developed for cases where one dimension, say Nwas fixed and the other say T went to infinity. It is only recently that the theory

17 In the case of testing for panel unit roots, the cross-sectionally augmented CADF test,CADF is a joint test of a unit root and a stationary ft.

48

for large panels, where both dimensions can go to infinity, has been developed,e.g. Bai (2003).

4.3 Calculating Principal Components

Static models Suppose that we have a T ×N data matrix, X with elementxit for units i = 1, ..., N and periods t = 1, ..., T . The direction in which you takethe factors could also be reversed, i.e. treat X as an N ×T matrix. We assumethat the T × N data matrix X is generated by a smaller set of r unobservedfactors stacked in the T × r matrix F . In matrix notation,

X = FΛ +E (87)

where Λ is an r × N matrix of factor loadings and E is a T × N matrix ofidiosyncratic components. Units can differ in the weight that is given to eachof the factors. Strictly factor analysis involves making some distributional as-sumptions about eit and applying ML to estimate factor loadings, but we use adifferent approach and estimate the factors as the PCs of the data matrix.The PCs of X are the linear combinations of X that have maximal variance

and are orthogonal to (uncorrelated with) each other. Often theX matrix is firststandardised (subtracting the mean and dividing by the standard deviation), toremove the effect of units of measurement on the variance. X ′X is then thecorrelation matrix. To obtain the first PC we construct a T×1 vector f1 = Xa1

such that f ′1f1 = a′1 X′Xa1 is maximised. We need some normalisation, so

use a′1a1 = 1.18 The problem is to choose a1 to maximise the variance of f1

subject to this constraint. The Langrangian is

L = a′1X′Xa1 − φ1 (a′1a1 − 1)

∂L∂a1

= 2X ′Xa1 − 2φ1a1 = 0

X ′Xa1 = φ1a1

so a1 is the first eigenvector of X′X, (the one corresponding to the largest

eigenvalue, φ1) or the first eigenvector of the correlation matrix ofX if the dataare standardised. This gives us the weights we need for the first PC.The second PC, f2 = Xa2 is the linear combination which has the second

largest variance, subject to being uncorrelated with a1 i.e. a′2a1 = 0; so a2 isthe second eigenvector. If X is of full rank, there are N distinct eigenvaluesand associated eigenvectors and the number of PCs is N .We can stack the results:

X ′XA = ΦA

18We need to impose normalizations on the factors and factor loadings to pin down therotational indeterminacy. This is due to the fact that FΛ = FQQ−1Λ for any r× r full-rankmatrix Q. Because an arbitrary r × r matrix has r2 degrees of freedom, we need to imposeat least r2 restrictions (order condition) to remove the indeterminacy.

49

where A is the matrix of eigenvectors and Φ = diag (φ1, ..., φN ) is the diagonalmatrix of eigenvalues. We can also write this

A′X ′XA = Φ or F ′F = Φ

The eigenvalues can be used to calculate the proportion of the variation in Xthat each principal component explains: φi/

∑Ni=1 φi. If the data are standard-

ised, then∑Ni=1 φi = N is the total variance. Forming the PCs is a mathematical

operation replacing the T ×N matrix X by the T ×N matrix F .We define the PCs as F = XA, but we can also write X = FA′ defining

X in terms of the PCs.19 Usually, we want to reduce the number of PCs. Toreduce the dimensionality, we can write:

X = F 1A′1 + F 2A

′2 (88)

where the T×r matrix F 1 contains the r < N largest PCs, the r×N matrix A′1contains the first r eigenvectors corresponding to the largest eigenvalues. Wetreat F 1 = (f1t, ..., frt) as the common factors and F 2A

′2 as the idiosyncratic

factors corresponding to the eit in (87). While it is an abuse of this notation,we usually write F 1 as F and F 2A

′2 as E.

Dynamic models We write the factor model (87) in time series form:

xt = Λf t + et (89)

where xt is an N×1 vector, Λ an N×r matrix of loadings, f t an r×1 vector offactors and et an N×1 vector of errors. In using PCs to calculate the factors wehave ignored all the information in the lagged values of xt. It may be that somelagged elements of xit−j contain information that help predict xit; e.g. factorsinfluence the variables at different times. Standard PCs, which just extract theinformation from the covariance matrix, are often called static factor models,because they ignore the dynamic information and the idiosyncratic component,et, may be serially correlated. There are also dynamic factor models whichextract the PCs of the long-run covariance matrix or spectral density matrix,see Forni et al. (2000, 2003, 2005). The spectral density matrix is estimatedusing some weight function, like the Bartlett or Parzen windows, with sometruncation lag.The dynamic factor model gives different factors, say

xt = Λ∗f∗t + e∗t (90)

where f∗t is an r∗ × 1 vector. In practice, we can approximate the dynamic

factors by using lagged values of the static factors,

xt = Λ (L)f t + est (91)

19Note that AA′ = IN and A′ = A−1.

50

where Λ (L) is a pth order lag polynomial. This may be less effi cient in the sensethat r < rp: one can get the same degree of fit with fewer parameters usingthe dynamic factors than using current and lagged static factors. Determiningwhether the dynamics in xt comes from an autoregression in xt, dynamics inf t or serial correlation in et raises quite diffi cult issues of identification.Dynamic PCs are two sided filters, taking account of future as well as past

information, thus are less suitable for forecasting purposes. This problem doesnot arise with using current and lagged static factors. Forni et al. (2003)discuss one sided dynamic PCs which can be used for forecasting. Forecastingalso includes ‘nowcasting’, where one has a series, say quarterly GDP, producedwith a lag but various monthly series produced very quickly, such as industrialproduction and retail sales. PCs of the rapidly produced series are then used toprovide a ‘flash’estimate of current GDP.

4.4 Issues in Using PCs

How to choose r? How many factors to use depends on statistical criteria,the purpose of the exercise and the context. Traditional rules of thumb fordetermining r included choosing the PCs that correspond to eigenvalues thatare above average value or equivalently greater than unity for standardised dataor graphing the eigenvalues and seeing where they drop off sharply. There arealso various tests and information criteria. There has been work on informationcriteria when both N and T are large.Kapetanios (2004) suggests to use the largest eigenvalue to choose the num-

ber of factors. Onatski (2009) suggests another function of the largest eigen-values of the spectral density matrix at a specified frequency. The statisticalproperties of the various tests, information criteria and other methods of choos-ing r for economic data is still a matter of research.We focus on two types of methods; one based on information criteria and the

other based on the distribution of eigenvalues. Bai and Ng (2002) proposed amodel selection procedure which can consistently estimate the number of factors

when N and T converge to ∞. Let λk

i and Fkt be the PC estimators assuming

that the number of factors is k. We may treat the sum of squared residuals(divided by NT ) as a function of k:

V (k) =1

NT

N∑i=1

T∑t=1

(xit − λ

k′i F

kt

)2

Define the following loss function:

IC(k) = ln(V (k)) + kg (N,T )

where the penalty function g (N,T ) satisfies two conditions: (i) g (N,T ) → 0,and (ii) min

N1/2, T /12

g (N,T )→∞, as N,T →∞. Define the estimator for

the number of factors as

kIC = arg min0<k<kmax

IC(k);

51

where kmax is the upper bound. Then consistency can be established understandard conditions:

kIC → r as N, T →∞;

Bai and Ng (2002) propose the following information criteria:

IC1(k) = ln(V (k)) + k

(N + T

NT

)ln

(NT

N + T

)

IC2(k) = ln(V (k)) + k

(N + T

NT

)ln(C2NT

)IC3(k) = ln(V (k)) + k

(ln(C2NT

)C2NT

)YC: def. C2

NT ??Monte Carlo simulations show that all criteria perform well when both N

and T are large. For the cases where either N or T is small, and if errors areuncorrelated across units and time, the preferred criteria tend to be IC1 andIC2. Still, they may not work well when N or T are small, leading to too manyfactors being estimated, e.g. always choosing the maximum number allowed.Some desirable features of the above method are worth mentioning. Firstly,

the consistency is established without any restriction between N and T . Sec-ondly, the results hold under heteroskedasticity in both the time and the cross-section dimensions, as well as under weak serial and cross-section correlation.Kapetanios (2004) suggests to use the largest eigenvalue to choose the num-

ber of factors. Based on random matrix theory, Onatski (2009) established atest of k0 factors against the alternative that the number of factors is betweenk0 and k1. The test statistic is given by

R = maxk0<k≤k1

γk − γk+1

γk+1 − γk+2

where γk is the k-th largest eigenvalue of the sample spectral density of dataat a given frequency. For macroeconomic data, the frequency could be chosenat the business cycle frequency. The basic idea is that under the null of k0

factors, the first leading k0 eigenvalues will be unbounded, while the remainingeigenvalues are all bounded. As a result, R will be bounded under the null, whileexplode under the alternative, making R asymptotically pivotal. The limitingdistribution of R is derived under the assumption that T grows suffi ciently fasterthan N , which turns out to be a function of the Tracy-Widom distribution. Ahnand Horenstein (2013) proposed two estimators, the Eigenvalue Ratio (ER)estimator and the Growth Ratio (GR) estimator, based on simple calculationof eigenvalues. The ER estimator is defined as maximizing the ratio of twoadjacent eigenvalues in decreasing order. The intuition is similar to Onatski(2009, 2010).The statistical properties of the various tests, information criteria and other

methods of choosing r for economic data is still a matter of research. The

52

choice of r will depend not just on statistical criteria but also the purpose ofthe exercise and the context.YC: update and how to extend to the MD case?

Determining the number of dynamic factors The dynamic factor modelconsiders the case in which lags of factors also directly affect xit. The methodsfor static factor models can be extended to estimate the number of dynamicfactors. Consider

xit = λ′i0ft + λ′i1ft−1 + ...+ λ′isft−s + eit = λi(L)′ft + eit (92)

where ft is q× 1 and λi(L) = λi0 + λi1L+ ...+ λisLs. While Forni et al. (2000,

2004, 2005) consider the case with s→∞, we focus on the case with a fixed s.Model (92) can be represented as a static factor model with r = q(s+ 1) staticfactors:

xit = λ′iFt + eit

λi =

λi0λi1...λis

; Ft =

ftft−1

...ft−s

We refer to ft as the dynamic factors and Ft as the static factors. Regardingthe dynamic process for ft, we may use a finite-order VAR:

Φ(L)ft = εt

where Φ(L) = Iq − Φ1L − ... − ΦhLh. Then, we may form the V AR(k) repre-

sentation of the static factor, Ft, where k = max h, s+ 1,

ΦF (L)Ft = ut with ut = Rεt

where ΦF (L) = Iq(s+1) − ΦF1L − ... − ΦFkLk, and the q(s + 1) × q matrix R

are given by R = [Iq, 0, ..., 0]′.The spectrum of the static factors has rank q instead of r = q(s+ 1). Given

thatΦF (L)Ft = Rεt;

the spectrum of F at frequency ω is

SF (ω) = ΦF(e−iω

)−1RSε (ω)R′ΦF

(eiω)−1

;

whose rank is q if Sε (ω) has rank q for |ω| ≤ π. This implies SF (ω) has onlyq nonzero eigenvalues. Hallin and Liska (2007) estimate the rank of this matrixto determine the number of dynamic factors. Onatski (2009) also considersestimating q using the sample estimates of SF (ω).

Alternatively, we may first estimate a static factor model using Bai and Ng(2002) to obtain Ft. Next, we may estimate a V AR(p) for Ft to obtain the

53

residuals ut. Let Σu = T−1∑utu′t. Note that the theoretical moments E(utu

′t)

has rank q. We may estimate q using the information about the rank of Σu.Using a different approach, Stock & Watson (2005) considered a richer dy-

namics in error terms, and transformed the model such that the residual of thetransformed model has a static factor representation with q factors. Bai and Ng(2002)’s IC can then be directly applied to estimate q.Amengual and Watson (2007) derived the corresponding econometric theory

for estimating q. They started from the static factor model,

Xt = ΛFt + et,

and considered a V AR(p) for Ft,

Ft =

p∑i=1

ΦiFt−i + εt; εt = Gηt;

where G is r×q with full column rank and ηt is a sequence of shocks with mean0 and variance Iq. The shock ηt is the dynamic factor shock, whose dimensionis the number of dynamic factors. Let

Yt = Xt −p∑i=1

ΛΦiFt−i

and Γ = ΛG, then Yt has a static factor representation with q factors,

Yt = Γηt + et :

If Yt is observed, q can be directly estimated using Bai and Ng (2002)’s IC. Inpractice, Yt needs to be estimated. Let Yt = Xt −

∑pi=1 ΛΦiFt−i where Λ and

Ft are PC estimators from Xt, and Φi is obtained by V AR(p) regression of Ft.

How to choose N? One may have very large amounts of potential dataavailable (e.g. thousands of time series on different economic, social, and demo-graphic variables for different countries) and an issue is how many you shoulduse in constructing the PCs. Information seems better so one should includeas many as possible, but this may not be the case. Adding variables that areweakly dependent on the common factors will add very little information.To calculate the PCs the weights on the series have to be estimated and

adding more series adds more parameters to be estimated. This increases thenoise due to parameter estimation error. If the series have little information onthe factors of interest, they just add noise, worsening the estimation problem.The series may be determined by different factors, increasing the number offactors needed to explain the variance. They may also have outliers or idiosyn-cratic jumps and this will introduce a lot of variance which may be picked up bythe estimated factors. Many of the disputes in the literature about the relevantnumber of factors reflect the range of series used to construct the PCs.

54

If the series are mainly different sort of price and output measures, twofactors may be adequate; but if one adds financial series such as stock pricesand interest rates, or foreign variables, more factors may be needed. One maybe able to look at the factor loading matrix and see whether it has a blockstructure, certain factors loading on certain sets of variables. If this is the caseone may want to split the data using different dataset to estimate differentfactors. But it may be diffi cult to determine the structure of the factor loadingmatrix.

N may be larger than the number of variables, if transformations of thevariables (e.g. logarithms, powers, first differences, etc.) are also included.This trade-off between the size of the information set and the errors introducedby estimation may be a particular issue in forecasting, where parsimony tendsto produce better forecasting models. Then using more data may not improveforecasts, e.g. Mitchell et al. (2005) and Elliott and Timmerman (2008). Noticethat in forecasting we would need to update our estimates of Ft, and perhaps rthe number of factors, as our sample size, T changes.

How to Identify and interpret factors? To interpret the factors requiresjust identifying restrictions. Suppose that we have obtained estimates:

X = FΛ +E

For any non-singular r×r matrix, Q, the new factors and loadings (FQ)(Q−1Λ)are observationally equivalent to FΛ. The new loadings are Λ∗ = Q−1Λ andfactors are F ∗ = FQ. The r2 just identifying restrictions used to calculate PCsare the unit length and orthogonality normalisations which come from treatingit as an eigenvalue problem. Thus, the factors are only defined up to a non-singular transformation. A major problem in applications is to interpret theestimated PCs. Often in time-series the first PC has roughly equal weightsand corresponds to the mean of the series. Looking at the factor loadingsand the graphs of the PCs may help interpret them. The choice of Q, justidentifying restrictions, called ‘rotations’in psychometrics, is an important partof traditional factor analysis. These are needed to provide some interpretationof the factors.20

The same identification issue arises in simple regression. For

y = Xβ + u

is observationally equivalent to the reparameterisation:

y =(XQ−1

)(Qβ) + u = Zδ + u

For instance, Z could be the PCs, which have the advantage that they areorthogonal and so the estimates of the factor coeffi cients are invariant to which

20Rotations in psychometrics are as controversial as just-identifying restrictions in eco-nomics, so while many psychologists agree that there are five dimensions to personality, r = 5;how they are described differs widely.

55

other factors are included. But there could be other Q matrices. To interpretthe regression coeffi cients we need to choose a parameterisation, k2 restrictionsthat specify Q. We tend to take the parameterisation for granted in economics,so this is not usually called an identification problem.For some purposes, e.g. forecasting, one may not need to identify the factors,

but for other purposes their interpretation is crucial. It is quite often the casethat one estimates the PCs and has no idea what they mean or measure.

Estimated or imposed weights? Factors are estimated as linear combina-tions of observed data series. Above it has been assumed that the weights inthe linear combination should be estimated to optimise some criterion function,e.g. to maximise variance explained in the case of PCs. However, there arepossible a priori weights, imposing the weights rather than estimating them.Examples are equal weights as in the mean or trade weights as in effectiveexchange rates. There is a bias-variance trade-off. The imposed weights arealmost certainly biased, but have zero variance. The estimated weights may beunbiased but may have large variance because of estimation error. The imposedweights may be better than the estimated weights in the sense of having smallermean square error (bias squared plus variance). Forecast evaluation of regres-sion models indicates that simple models with imposed coeffi cients tend to dovery well. Measures constructed with imposed weights are usually also mucheasier to interpret.The most obvious candidate for imposed weights is to use equal weights, a

simple average (perhaps after having standardised the variable to have meanzero and variance one). In many cases the first PC seems to have roughly equalweights and thus behave like an average or sum.Alternatively, economic theory may suggest suitable weights. For instance,

effective exchange rates for country i (weighted averages of exchange rates withall other countries) use trade weights: exports plus imports of i with j as a shareof total exports plus imports of country i). PCs might give a lot of weight to aset of countries which have very volatile exchange rates even though country idoes not trade with them. Measures of core inflation give zero weights to theinflation rates of certain volatile components of total expenditure while a PCmight give a high weight to those volatile components because they account fora lot of the variance. Monte Carlo evaluation of estimators that allow for CSDindicate the methods that use imposed weights, like the CCE estimator, oftendo much better than estimators that rely on estimating the number of PCs andtheir weights. See the comparison between the CCE estimator by Pesaran andthe IPC estimator by Bai see also the studies by Westlund and coauthors.

Explanation using PCs Suppose the model of interest is

y = Xβ + u (93)

where β is an N×1 vector of parameters and we wish to reduce the dimension ofX. This could be because there are a very large number of candidate variables

56

or because there is multicollinearity. Replacing X by all the PCs, F = XA isjust a reparameterisation:

y = XAA′β + u = Fδ + u (94)

However, we could reduce the number of PCs by writing it:

y = F 1δ1 + F 2δ2 + u (95)

where F 1 are the r < N PCs (corresponding to the r largest eigenvalues).Setting δ2 = A′2β = 0 to give

y = F 1δ1 + v (96)

In this case the original coeffi cients could be recovered as β = A1δ1. Thehypothesis, δ2 = 0 is testable (as long as N < T ). This has been suggested asa way of dealing with multicollinearity, or choosing a set of instruments.There are some problems. First, it is quite possible that a PC which has

a small eigenvalue and explains a very small part of the total variation of Xmay explain a large part of the variation of y. The PCs are chosen on thebasis of their ability to explain X not y, but the regression is designed toexplain y. Secondly, unless F 1 can be given an interpretation, e.g. as anunobserved variable, it is not clear whether the hypothesis, δ2 = A′2β = 0has prior plausibility or what the interpretation of the estimated regressionis. Thirdly, estimation error is being introduced by using F 1 and these aregenerated regressors with implications for the estimation of the standard errorsof δ1. As a result, until recently with the Factor augmented VARS and ECMsdiscussed below, economists have tended not to use PCs as explanatory variablesin regressions. Instead multicollinearity tended to be dealt with through theuse of theoretical information, either explicitly through Bayesian estimatorsor implicitly by a priori weights e.g. through the construction of aggregates.Notice that we could include certain elements of X directly and have otherssummarised in factors.

5 The Factor-based Models for CSD

CSD has attracted considerable attention and a large number of estimators havebeen suggested. Currently, the market leader, according to Monte Carlo studies,appears to be CCE estimators. It is common to transform the data to makeit stationary before calculating PCs by differencing. If one tries to measure astationary unobservable, e.g. a global trade cycle, this is clearly sensible. It isequally not sensible if one is trying to measure a non-stationary unobservable,e.g. a global trend. Even in the stationary case it is important that transfor-mations beyond differencing be considered, stationary transformations of levelsvariables, such as interest rate spreads, may also contain valuable information.We describe main techniques for dealing with factor-based CSD.

57

SURE Suppose that the model is heterogeneous:

yit = δ′izt + β′ixit + γ′if t + uit, i = 1, ..., N, t = 1, ..., T (97)

where yit is a scalar dependent variable, zt is a kz vector of variables that do notdiffer over groups (intercept and trend), and xit is a kx × 1 vector of observedregressors which differ over groups, f t is an r × 1 vector of unobserved factorsand uit is an unobserved disturbance with E(uit) = 0, E(u2

it) = σ2i which is

independently distributed across i and t. Estimating

yit = δ′izt + b′ixit + vit, i = 1, ..., N, t = 1, ..., T (98)

will give inconsistent estimates of βi if f t is correlated with xit and ineffi cientestimates even if f t is not correlated with xit. In the latter case if N is small,the equations can be estimated by SURE, but if N is large relative to T , SUREis not feasible, because the estimated covariance matrix cannot be inverted.Robertson and Symons (2007) suggest using the factor structure to obtain aninvertible covariance matrix. Their estimator is quite complicated and will notbe appropriate if the factors are correlated with the regressors.

Time effects/demeaning If βi = β, and there is a single factor which in-fluences each group in the same way, i.e. γi = γ, then including time effects, adummy variable for each period, i.e. the two way fixed effect estimator:

yit = θt + αi + β′ixit + uit

will estimate f ′tγ = θt. This can be implemented by using time-demeaned data,yit = yit − yt, where yt = N−1

∑Ni=1 yit and similarly for xit. Unlike SURE

the factor does not have to be distributed independently of xit forthis to work.It is sometimes suggested (e.g. for unit root tests) that demeaned

data be used even in the case of heterogeneous slopes. Suppose we haveheterogeneous random parameters:

yit = θt + β′ixit + uit with βi = β + ηi

Averaging over groups for each period we get:

yt = θt + β′xt + ut +N−1N∑i=1

η′ixit

Noting thatβ′ixit − β′xt = β′ixit + η′ixt

Demeaning (yit = yit − yt) gives:

yit = β′ixit + uit + eit with eit = η′ixt −N−1N∑i=1

η′ixit = N−1N∑i=1

η′ixit

58

This removes the common factor θt, but has added new terms to the errorreflecting the effect of slope heterogeneity. If ηi is independent of the regressors,et will have expected value zero and be independent of the regressors, so one canobtain large T consistent estimates of βi, but the variances will be larger. Onecan compare the fit of the panels using the original data yit and the demeaneddata yit to see which effect dominates, i.e. whether the reduction in variancefrom eliminating θt is greater or less than the increase in variance from addingeit.This model assumes that the factor has identical effects on each unit, im-

plying that they impose the same CSD across all cross-section units.Rather than demeaning, it is usually better to include the meansdirectly.

5.1 The Correlated Common Effect (CCE) Estimator

If one wishes to treat factors as nuisance parameters and remove the effect ofCSD, a simple and effective procedure, for large N,T , is the CCE estimatorproposed by Pesaran (2006).Let yit be the observation on the i-th cross-section unit at time t for i =

1, ..., N and t = 1, ..., T . Consider the linear heterogeneous panel data model

yit = α′idt + β′ixit + eit (99)

where dt is an n × 1 vector of observed common effects (such as intercepts orseasonal dummies), xit is a k×1 vector of observed individual-specific regressorson the i-th cross-section unit at time t, and the errors have the multifactorstructure:

eit = γ′ift + εit (100)

where ft is the m × 1 vector of unobserved common effects and εit are theidiosyncratic errors assumed to be independently distributed of (dt,xit). Ingeneral, however, the unobserved factors ft could be correlated with (dt,xit). Toallow for such a possibility, we adopt the fairly general model for the individualspecific regressors,

xit = A′idt + Γ′ift + vit (101)

where Ai and Γi are n × k and m × k, factor loading matrices, and vit arethe specific components of xit distributed independently of the common effectsand across i but assumed to follow general covariance stationary processes.21

In what follows, however, we focus on the case where dt and ft are covariancestationary.

21This setup only allows for the correlation between factors and regressors, and not forcorrelation between regressor and factor loadings. This is one of the drawbacks of CCE, seeBai(2009).

59

Combining (99)—(101), we have the system of equations:

zit(k+1)×1

=

(yitxit

)= B′i

(k+1)×ndtn×1

+ C′i(k+1)×m

ftm×1

+ uit(k+1)×1

22 (102)

where

uit =

(εit + β′ivit

vit

)(103)

Bi = (αi,Ai)

(1 0βi Ik

), Ci = (γi,Γi)

(1 0βi Ik

), (104)

Ik is an identity matrix of order k, and the rank of Ci is determined by the rankof the m× (k + 1) matrix of the unobserved factor loadings

Γi = (γi,Γi) (105)

Throughout we shall assume that Bi and Ci or their expectations are bounded.The above setup is suffi ciently general and renders a variety of panel data

models as special cases. (i) The familiar fixed or random effects models cor-respond to the case where dt = 1, βi = β, and γi = 0 for all i. (ii) Thetime-varying effects models of Kiefer (1980), Lee (1991), and Ahn, Lee, andSchmidt (2001) allow for error cross-section dependence through a single un-observed factor, but also require the individual-specific regressors to be cross-sectionally independent, namely Ai = 0 and Γi = 0. However, the individualspecific regressors are likely to be cross-sectionally dependent and a formula-tion such as (101) will be more widely applicable. (iii) The random coeffi cientmodel of Swamy (1970) allows for slope heterogeneity but assumes γi = 0 forall i. (iv) In the special case where γi = γ, the multifactor structure reducesto θt = γ′ft, and (99) and (100) become the familiar panel data model withtime dummies. In this case the estimation of β can be achieved using standardpanel data estimators based on cross-sectionally demeaned observations. (v)The large N and T factor models analyzed by Stock and Watson (2002) andBai and Ng (2002) focus on consistent estimation of ft (including its dimensionm) and the factor loadings γi, and are not concerned with the estimation of the“structural”parameters βi. Here, ft are treated as nuisance “parameters”and no assumption is made about m, except that it is fixed as N and T tend toinfinity.

Consider the cross-section averages of (??) using the weights wj :

zwt = B′wdt + C′wft + uwt (106)

where zwt =∑Nj=1 wjzjt and

Bw =

N∑i=1

wiBi, Cw =

N∑i=1

wiCi, uwt =

N∑i=1

wiuit. (107)

22CZ:Two ways to obtain this equation: First, substitute (3) into (1); Second, rewrite (1)as yit − β′ixit = α′idt + eit

60

Suppose that the following rank condition holds:

Rank(Cw

)= m ≤ k + 1 for all N (108)

Then, we haveft =

(CwC′w

)−1Cw

(zwt − B′wdt − uwt

)(109)

But, as N →∞, it is easily seen that

uwt → 0 for each t (110)

and

Cw →p C = Γ

(1 0βi Ik

)(111)

whereΓ = (E (γi) , E (Γi)) = (γ,Γ) (112)

Therefore, assuming that Rank(

Γ)

= m, we obtain

ft − (CC′)−1

C(zwt − B′wdt

)→p 0

This suggests using hwt = (d′t, z′wt)′ as observable proxies for ft. Although

consistent estimation of ft using the above results still requires knowledge ofthe underlying parameters, the individual slope coeffi cients of interest, βi andtheir means β, can be consistently estimated by augmenting the OLS or pooledregressions of yit on xit with dt and the cross-section averages zwt. We shallrefer to such estimators as the common correlated effect estimator (CCE). Aswe shall see later, the basic idea of augmenting the regressions with cross-sectionaverages continues to work even if the rank condition (108) is not satisfied.Consider the panel data model with unobserved common factors:

yit = δ′idt + β′ixit + εit with εit = γ′if t + uit (113)

where yit is a scalar dependent variable, dt is a kd×1 vector of variables that donot differ over units, e.g. intercept and trend, xit is a kx × 1 vector of observedregressors which differ over units, f t is an r × 1 vector of unobserved factors,which may influence each unit differently and which may be correlated with thexit, and uit is an unobserved disturbance with E (uit) = 0 and E

(u2it

)= σ2

i ,which is independently distributed across i and t. If xit is correlated with f tand γi, then not allowing for CSD by omitting f t causes the estimates of βi tobe biased and inconsistent.Pesaran suggests to include the means of yit and xit as additional regressors,

to remove the effect of the factors. The CCE procedure can handle multiplefactors which are I(0) or I(1), which can be correlated with the regressors, andhandles serial correlation in the errors. The consistency holds for any linearcombination of the dependent variable and the regressors, not just the arithmeticmean, subject to the assumptions that the weights, wi satisfy:

(i) : wi = O

(1

N

); (ii) :

N∑i=1

|wi| < K; (iii) :

N∑i=1

wiγi 6= 0

61

These clearly hold for the mean:

wi =1

N;

N∑i=1

|wi| = 1;

N∑i=1

wiγi = N−1N∑i=1

γi 6= 0

as long as the mean effect of the factor on the dependent variable is non-zero.This involves adding the means of the dependent and independent variables

to the regression, (113):

yit = δ′idt + β′ixit + πyiyt + π′xixt + uit (114)

As the covariance between yt and uit goes to zero with N , so for large N thereis no endogeneity problem. The CCE generalises to many factors and laggeddependent variables, but requires that γ or the vector equivalent, isnon-zero.Pesaran (2006) showed that βi can be consistently estimated through the

augmented OLS regression, (114) under the large N,T , by

βi =(X ′iMDXi

)−1X ′iMDyi

where yi is a T × 1 vector of the dependent variable for the ith unit, Xi is aT × kx vector of regressors, andMD = IT −D

(D′D

)−1D and D consists of

observed common factor and cross sectional average of dependent and indepen-dent variables. The following theorem provides the asymptotic distribution inthe case where the rank condition is satisfied and (N,T )→j ∞.

THEOREM 1: Consider the panel data model (99) and (100). Supposethat Assumptions 1, 2, and 5(a) hold, and the rank condition (108) is satisfied.Then, βi is a consistent estimator of βi. If

√T/N → 0 as (N,T )→j ∞, then

√T(βi − βi

)→d N

(0,Σβi

)(115)

whereΣβi = Σ−1

i SiεΣ−1i (116)

Σ−1i is defined by (11),

Siε = plimT→∞[T−1 (X′iMgΩεiMgXi)

](117)

Mg is defined by (16), and Ωεi = E (εiε′i).

We shall assume that the parameters of interest are β = E(βi) or the crosssection means of βi. We consider two alternative estimators, the mean group(MG) estimator and the pooled estimator. We shall refer to the former as thecommon correlated effects mean group (CCEMG) estimator given by

βMG =1

n

n∑i=1

βi.

62

and the latter as the common correlated effects pooled (CCEP) estimator givenby

βP =

(n∑i=1

X ′iMDXi

)−1 n∑i=1

X ′iMDyi

THEOREM 2: Consider the panel data model (99) and (100). Supposethat Assumptions 1-4, and 5(a) hold. Then the CCEMG estimator is asymp-totically (for a fixed T and as N →∞) unbiased and as (N,T )→∞,

√N(βMG − β

)→d N (0,ΣMG)

where ΣMG is consistently estimated by nonparametric variance.This theorem does not require that the rank condition holds for any num-

ber m of unobserved factors as long as m is fixed and does not impose anyrestrictions on the relative rates of expansion of N and T .

THEOREM 3: Consider the panel data model (99) and (100). Then, theCCEP estimator is asymptotically unbiased and as (N,T )→∞ we have:

√N(βP − β

)→d N

(0,Σ∗p

)where

Σ∗p = Ψ∗−1R∗Ψ∗−1 (118)

Ψ∗ = limN→∞

(N∑i=1

wiΣiq

)

R∗ = limN→∞

(N−1

N∑i=1

w2i (ΣiqΩυΣiq + QifΩηQif )

)23

w2i =

wi

N−1∑Ni=1 w

2i

24

and Σiq and Qif are defined by (61).Although the asymptotic variance matrix of βP depends on the unobserved

factors and their loadings, it is nevertheless possible to estimate it consistentlyalong lines similar to that followed in the case of CCEMG.A clear advantage of the CCE method is that the number of unobserved

factors need not be estimated. In fact, the method is valid with a single ormultiple unobserved factors and does not require the number of factors to besmaller than the number of observed cross-section averages. In addition, CCEis easy to compute as an outcome of OLS and no iteration is needed. Desirablesmall sample properties of CCE are also demonstrated.

23CZ:Also involves the order of Qif24CZ:It is easy to show this is OP (1). See proof of Lemma 4.

63

There are sometimes economic reasons for adding averages, but in other casesthe economic interpretation is not straightforward. In a variety of circumstancesestimating the factors by the means, seems to work better than estimating themdirectly by the PC estimator. This procedure determines the weights a priorirather than estimating them by PCs. Not estimating the weights seems toimprove the performance of the procedure. Kapetanios, Pesaran and Yamagata(2011) show that this procedure is robust to a wide variety of data generationprocesses including unit roots.Remark: Westerlund and Urbain (2013) show that Pesaran’s estimator

becomes inconsistent when the factor loadings in the y equation are correlatedwith the factor loadings in the x equation.

• YC: update More issues

5.2 Panel Data Models with Interactive Fixed Effects

Bai (2009) considers the following panel data model with N cross-sectional unitsand T time periods:

Yit = Xitβ + uit (119)

uit = λ′iFt + εit (i = 1, ..., N, t = 1, ..., T ),

where Xit is a p×1 vector of observable regressors, β is a p×1 vector of unknowncoeffi cients, uit has a factor structure, λi (r × 1) is a vector of factor loadings,and Ft (r × 1) is a vector of common factors:

λiFt = λi1F1t + ...+ λirFrt;

εit are idiosyncratic errors; λi, Ft, and εit are all unobserved. Our interest iscentered on the inference for the slope coeffi cient β.The preceding set of equations constitutes the interactive-effects model be-

tween λi and Ft. The usual fixed-effects model takes the form:

Yit = Xitβ + αi + ξt + εit (120)

where the individual effects αi and the time effects ξt enter the model additively.It is noted that multiple interactive effects include additive effects as specialcases. For r = 2, consider the special factor and factor loading such that, forall i and all t,

Ft =

[1ξt

];λi =

[αi1

]Then

λ′iFt = αi + ξt

Owing to potential correlations between the unobservable effects and theregressors, we treat λi and Ft as fixed-effects parameters to be estimated. Weallow the observable Xit to be written as

Xit = τ i + θt +

r∑k=1

akλik +

r∑k=1

bkFkt +

r∑k=1

ckλikFkt + πiGt + ηit (121)

64

where ak, bk, and ck are scalar constants (or vectors when Xit is a vector), andGt is another set of common factors that do not enter the Yit equation. SoXit can be correlated with λi alone or with Ft alone, or can be simultaneouslycorrelated with λi and Ft. In fact, Xit can be a nonlinear function of λi andFt. We directly estimate λi and Ft, together with β subject to some identifyingrestrictions. We consider the least squares method.While additive effects can be removed by the within-group transformation,

the scheme fails to purge interactive effects. For example, consider r = 1,

Yit = Xitβ + λiFt + εit.

ThenYit − Yi. = (Xit − Xi.)β + λi(Ft − F ) + εit − εi.

where Yi., Xi., and εi. are averages over time. Because Ft 6= F , the within-group transformation is unable to remove the interactive effects. Similarly, theinteractive effects cannot be removed with time dummy variable. Thus thewithin group estimator is inconsistent since the unobservables arecorrelated with the regressors.We provide a large N and large T perspective on panel data models with

interactive effects, permitting the regressor Xit to be correlated with either λior Ft, or both. Compared with the fixed T analysis, the large T perspectivehas its own challenges. For example, an incidental parameter problem isnow present in both dimensions. On the other hand, the large T setup alsopresents new opportunities. We show that if T is large, then the least squaresestimator for β is

√NT consistent, despite serial or cross-sectional correlations

and heteroskedasticities of unknown form in εit. We allow εit to be weaklycorrelated across i and over t, thus, uit has the approximate factor structure ofChamberlain and Rothschild (1983). Additionally, heteroskedasticity is allowedin both dimensions.Even for large T , asymptotic bias can persist in dynamic or nonlinear panel

data models with fixed effects. We show that asymptotic bias arises underinteractive effects, leading to nonzero centered limiting distributions.We also show that bias-corrected estimators can be constructed in a way similarto Hahn and Kuersteiner (2002) and Hahn and Newey (2004).Because additive effects are special cases of interactive effects, the

interactive-effects estimator is consistent when the effects are, in fact,additive, but the estimator is less effi cient than the one with additiv-ity imposed.We also consider the problem of testing additive effectsversus interactive effects.Write the model as

Yi = Xiβ + Fλi + εi

where

YiT×1

=

Yi1...YiT

; XiT×k

=

X ′i1...

X ′iT

; FT×r

=

F ′1...F ′T

;

65

Similarly, define Λ = (λ1, λ2, ..., λN )′, an N × r matrix. In matrix notation,

Y = Xβ + FΛ′ + ε (122)

where Y = (Y1, ..., YN ) is T × N and X is a three-dimensional matrix with psheets (T ×N × p), the `th sheet of which is associated with the `th element ofβ (` = 1, ..., p). The product Xβ is T ×N and ε = (ε1, ..., εN ) is T ×N .In view of FΛ′ = FAA−1Λ′ for an arbitrary r×r invertible A, identification is

not possible without restrictions. Because an arbitrary r×r invertible matrix hasr2 free elements, the number of restrictions needed is r2. The normalization25

F ′F

T= Ir (123)

yields r(r + 1)/2 restrictions, implying that T−1∑Tt=1 F

2jt = 1 for j = 1, ..., r

and T−1∑Tt=1 FjtFht = 0 for j, h = 1, ..., r and j 6= h. This is a commonly used

normalization. Additional r(r − 1)/2 restrictions can be obtained by requiring

Λ′Λ = diagonal

implying that∑Ni=1 λ

2ij 6= 0 for j = 1, ..., r and

∑Ni=1 λijλih = 0 for j, h = 1, ..., r

and j 6= h. These two sets of restrictions uniquely determine Λ and F . The leastsquares estimators for F and Λ derived below satisfy these restrictions.Suffi cient variation in Xit is also needed. The usual identification condition

for β is that the matrix 1NT

∑Ni=1X

′iMFXi is of full rank, where MF = IT −

F (F ′F )−1F ′. Because F is not observable and is estimated, a strongercondition is required.The least squares objective function is defined as

SSR(β, F,Λ) =

N∑i=1

(Yi −Xiβ − Fλi)′(Yi −Xiβ − Fλi) (124)

subject to the constraint F ′F/T = Ir and Λ′Λ being diagonal. Definethe projection matrix

MF = IT − F (F ′F )−1F ′ = IT − FF ′/T

The least squares estimator for β for each given F is simply

β(F ) =

(N∑i=1

X ′iMFXi

)−1( N∑i=1

X ′iMFYi

)Given β, the variable Wi = Yi −Xiβ has a pure factor structure such that

Wi = Fλi + εi (125)

25The normalization still leaves rotation indeterminacy. For example, let G be an r × rorthogonal matrix, and let F ∗ = FG and Λ∗ = ΛG. Then FΛ′ = F ∗Λ∗′ and F ∗F ∗/T =F ′F/T = I. To remove this indeterminacy, we fix G to make Λ∗′Λ∗ = G′Λ′ΛG a diagonalmatrix. This is the reason for restriction (7).

66

W = FΛ′ + ε

DefineW = (W1, ...,WN ), a T ×N matrix. The least squares objective functionis

tr[(W − FΛ′)(W − FΛ′)′]

By concentrating out Λ = W ′F (F ′F )−1 = W ′F/T , the objective function be-comes

tr(W ′MFW ) = tr(W ′W )− tr(F ′WW ′F )/T (126)

where we useW − FΛ′ = W − F (F ′F )−1F ′W = MFW

Therefore, minimizing with respect to F is equivalent to maximizing tr((F ′W ) (W ′F )).The estimator for F is equal to the first r eigenvectors (multiplied by

√T due

to the restriction F ′F/T = I) associated with the first r largest eigenvaluesof the matrix

WW ′ =

N∑i=1

WiW′i =

N∑i=1

(Yi −Xiβ)(Yi −Xiβ)′

Therefore, given F , we can estimate β, and given β, we can estimate F . Thefinal least squares estimator (β, F ) is the solution of the set of nonlinearequations

β =

(N∑i=1

X ′iMFXi

)−1( N∑i=1

X ′iMFYi

)(127)

[1

NT

N∑i=1


]F = F VNT

where VNT is a diagonal matrix that consists of the r largest eigenvalues ofthe above matrix26 in the brackets, arranged in decreasing order. The solution(β, F ) can be simply obtained by iteration. Finally, from Λ = W ′F/T , Λ isexpressed as

Λ′ =(λ1, ..., λN

)= T−1

[F ′(Y1 −X1β), ..., F ′(YN −XN β)

]We may also write

Λ′ = T−1F ′(Y −Xβ)

where Y is T ×N and X is T ×N × p, a three-dimensional matrix.The triplet

(β, F , Λ

)jointly minimizes the objective function (124). The

pair (β, F ) jointly minimizes the concentrated objective function (126), whichis equal to

tr(W ′MFW ) =

N∑i=1

W ′iMFWi =

N∑i=1

(Yi −Xiβ)′MF (Yi −Xiβ) (128)

26We divide this matrix by NT so that VNT will have a proper limit. The scaling does notaffect F .

67

Here we suggest a more robust iteration scheme than the one implied by(127) and (??). Given F and Λ, we compute

β(F,Λ) =

(N∑i=1

X ′iXi

)−1 N∑i=1

X ′i (Yi − Fλi)

and given β, we compute F and Λ from the pure factor model

Wi = Fλi + ei

with Wi = Yi − Xiβ. This iteration scheme only requires a single matrix in-

verse(∑N

i=1XiXi

)−1

, with no need to update during iteration. Our simulation

results are based on this scheme.Alternative method: We can extend Mundlak (1978) and Chamberlain

(1984) to models with interactive effects. When λi is correlated with theregressors, it can be projected onto the regressors such that

λi = AXi. + ηi,

where Xi. is the time average of Xit and A is r × p, so that model can berewritten as

Yit = X ′itβ + X ′i.δt + η′iFt + εit

where δt = A′Ft . The above model still has a factor error structure. However,when Ft is assumed to be uncorrelated with the regressors, the aggregated errorη′iFt+εit is now uncorrelated with the regressors, so we can use a random effectsGLS to estimate (β, δ1, ..., δT ).Similarly, when Ft is correlated with the regressors, but λi is not, one can

project Ft onto the cross-sectional averages such that

Ft = BX.t + ξt

to obtainYit = X ′itβ + X ′.tρi + λ′iξt + εit

with ρi = Bλi. Again, a random-effects GLS can be used.When both λi and Ft are correlated with regressors, we apply both

projections and augment the model with cross-products of Xi. and X.t, in ad-dition to Xi. and X.t so that, with ρi = B′ηi and δt = A′ξt,

Yit = X ′itβ + X ′i.δt + X ′.tρi + X ′i.CX.t + η′iξt + εit

where C is a matrix. The above can be estimated by the random-effectsGLS.METHOD 3: Pesaran (2006) augments the model with regressors

(Y·t, X·t

)under the assumption of Ft being correlated with regressors, where Y·t and X·tattempt to estimate Ft. But in the Mundlak argument, the projection residualξt is assumed to have a fixed variance. In contrast, the variance of ξt is

68

assumed to converge to zero as N →∞ in Pesaran (2006), who assumedXit is of the form Xit = BiFt + eit so that X·t = BFt + ξt with B =

∑Ni=1Bi

and ξt =∑Ni=1 eit. The variance of ξt is of O(N−1). Thus the factor error λ′iξt

is negligible under large N . He established consistency. It appears that whenλi is correlated with the regressors, additional regressors Yi· and Xi·should also be added to achieve consistency. (HOW??)

We use(β0, F 0

)to denote the true parameters, and we still use λi without

the superscript 0 as it is not directly estimated. The estimator F below estimatesa rotation of F 0. Define SNT (β, F ) as the concentrated objective function in(128) divided by NT together with centering, that is,

SNT (β, F ) =1

NT

N∑i=1

(Yi −Xiβ)′MF (Yi −Xiβ)− 1

NT

N∑i=1

ε′iMF 0εi

The second term is for the purpose of centering, whereMF = I−PF = I−FF ′/Twith F ′F/T = I. We estimate β0 and F 0 by(

β, F)

= a arg minβ,F

SNT (β, F )

(β, F

)satisfies

β =

(N∑i=1

X ′iMFXi

)−1 N∑i=1

X ′iMFYi[1

NT

N∑i=1


]F = F VNT

where F is the the matrix that consists of the first r eigenvectors (multipliedby√T ) of the matrix 1

NT

∑Ni=1(Yi − Xiβ)(Yi − Xiβ)′ and VNT is a diagonal

matrix that consists of the first r largest eigenvalues of this matrix.THEOREM 2: Assume Assumptions A—E hold. As T,N → ∞, the fol-

lowing statements hold:(i) In the absence of serial correlation and heteroskedasticity and with T/N →

0, √NT (β − β0)→d N

(0, D−1

0 D1D−1′0

)(ii) In the absence of cross-section correlation and heteroskedasticity and

with N/T → 0, √NT (β − β0)→d N

(0, D−1

0 D2D−1′0

)where D0 = plimD(F 0) = plim 1

NT

∑Ni=1 Z

′iZi.

Noting that D1 = D2 = σ2D0 under i.i.d. assumption of εit , the followingstatement holds:

69

COROLLARY 1: Under the assumptions of Theorem 2, if εit are i.i.d.over t and i, with zero mean and variance σ2, then

√NT (β − β0)→d N

(0, σ2D−1

0

)It is conjectured that β is asymptotically effi cient if εit are i.i.d.N(0, σ2),

based on the argument of Hahn and Kuersteiner (2002).If T/N converges to a constant, there will be a bias term due to correla-

tion and heteroskedasticity. We shall deal with the more general case in whichcorrelation and heteroskedasticity exist in both dimensions.THEOREM 3: Assume Assumptions A—E hold and T/N → ρ > 0. Then

√NT (β − β0)→d N

(ρ1/2B0 + ρ−1/2C0, D

−10 DZD

−1′0

)where B0 is the probability limit of B with

B = −D(F 0)−1 1

N

N∑i=1

N∑k=1

(Xi − Vi)′F 0

T

(F 0′F 0

T

)−1(Λ′Λ

N

)−1

λk

(1

T

T∑t=1

σik,tt

)(129)

and C0 is the probability limit of C with

C = −D(F 0)−1 1

NT

N∑i=1

X ′iMF 0ΩF 0

(F 0′F 0

T

)−1(Λ′Λ

N

)−1

λi (130)

and Vi = 1N

∑Nj=1 aijXj , aij = λi(ΛΛ/N)−1λj , and Ω = 1

N

∑Nk=1 Ωk with

Ωk = E(εkε′k).

There will be no biases in the absence of correlations and heteroskedasticities.When εit are i.i.d. over t and over i, then both B0 and C0 become zero, andthe result specializes to Corollary 1.REMARK 4: Suppose that k factors are allowed in the estimation, with

k ≥ r. Then β remains√NT consistent, albeit less effi cient than k = r.

Consistency relies on controlling the space spanned by Λ and that of F , whichis achieved when k ≥ r.REMARK 5: Due to

√NT consistency for β, estimation of β does not

affect the rates of convergence and the limiting distributions of Ft and λi. Thatis, they are the same as that of the pure factor model of Bai (2003). This followsfrom

Yit −Xitβ = λiFt + eit +Xit(β − β),

which is a pure factor model with an added error Xit(β − β) = (NT )−1/2op(1).An error of this order of magnitude does not affect the analysis.Define the bias-corrcted PC estimator:

β†

= β − 1

NB − 1

TC

70

THEOREM 4: Assume Assumptions A—E hold. In addition, E(ε2it) = σ2

it

and E(εitεjs) = 0 for i 6= j or t 6= s. If T/N2 → 0 and N/T 2 → 0, then

√NT (β

†− β0)→d N(0, D−1

0 D3D−10 )

where D3 = plim 1NT

∑Ni=1

∑Tt=1 ZitZitσ

2it.

The limiting variance D3 is a special case of DZ due to the no correlationassumption. Bias correction does not contribute to the limiting variance. Alsonote that conditions N/T 2 → 0 and T/N2 → 0, are less restrictive than T/Nconverging to a positive constant. There exist other bias correction procedures(e.g., panel jackknife) that could be used.SummaryBai (2009) allows Xit to be correlated with λi, Ft or both. Under the large

N,T , we may estimate the model by minimising a LS objective function:

SSR (β, F,Λ) =

N∑i=1

(Y i −Xiβ − Fλi)′ (Y i −Xiβ − Fλi)

s.t.F ′F

T= Ir, Λ′Λ is diagonal

Although no closed-form solution is available, the estimators can be obtainedby iterations.

1. Obtain some initial values β(0), such as least squares estimators fromregressing Y i on Xi.

2. Perform principal component analysis for the pseudo-data, Y i −Xiβ(0)

to obtain F (1) and Λ(1).

3. Next, regress Y i − F (1)λ(1)i on Xi to obtain β

(1).

4. Iterate such steps until convergence is achieved.

Bai (2009) showed that the resulting estimator β is√NT -consistent, and

the limiting distributions for F and Λ are the same as in Bai (2003) due to theirslower convergence rates. The limiting distribution for β depends on specificassumptions on εit and on the ratio T/N . If T/N → 0, then the limitingdistribution of β will be centered around zero, given that E (εitεjs) = 0 fort 6= s, and E (εitεjs) = σij for all i, j, t. On the other hand, if N and T arecomparable such that T/N → K > 0, then the limiting distribution will not becentered around zero, which poses a challenge for inference. Bai (2009) provideda bias-corrected estimator for β, whose limiting distribution is centered aroundzero. The bias-corrected estimator allows for heteroskedasticity across both Nand T . Let β be the bias-corrected estimator, assume that T/N2 → 0 andN/T 2 → 0, E

(ε2it

)= σ2

it, and E (εitεjs) = 0 for i 6= j and t 6= s, then

√NT

(β − β

)→d N (0,Σβ)

71

where a consistent estimator for Σβ is available. Notice that the issue of choosingr remains.Ahn et al. (2001, 2013) studied the model (??) under large N but fixed T .

They employ the GMM method that applies moments of zero correlation andhomoskedasticity. Moon and Weidner (2014) consider the same model as (??)),but allow lagged dependent variable as regressors. They devise a quadraticapproximation of the profile objective function to show the asymptotic theory forthe least square estimators and test statistics. Moon and Weidner (2015) extendtheir study by allowing unknown number of factors. They show that the limitingdistribution of the least square estimator is not affected by the number of factorsused in the estimation, as long as this number is no smaller than the true numberof factors. Lu and Su (2015) propose the adaptive group LASSO (least absoluteshrinkage and selection operator), which can simultaneously select the regressorsand determine the number of factors.Heterogenous panel models with interactive effects are also studied by Ando

and Bai (2014), where the number of regressors can be large and the regularisa-tion method is used to select relevant regressors. Ando and Bai (2015) providea formal test for homogenous coeffi cients. The ML estimation of (??) is studiedby Bai and Li (2014). They consider the case in which Xit also follows a factorstructure and is jointly modeled.YC: update, Moon and Weidner (2015)?

5.3 An identification of the number of unobserved factors

• Bai and Ng IC

• Ahn and Horenstein ER

• LASSO

We provide a brief review on the existing studies which develop the testingor information criteria to consistently select the number of (unobserved) factorsin the panels with the approximate factor structure. There are two mainapproaches. The first is the information criteria given by

PC(r) = V(r, F r

)+ rg(N,T )

where V(r, F r

)is the residual sum of squared, F r is the r factors estimated

by PC and g(N,T ) is a penalty term for overfitting as a function of N and T .The different specifications are proposed for g(N,T ), which result in differentestimation results in finite samples. The most popular information criteria areICp2 and BIC3, proposed by Bai and Ng (2002), which are specified as

ICp2(r) = ln(V (r, F r)) + r

(N + T

NT

)lnC2

NT

72

and

BIC3(r) = V (r, F r) + rσ2

((N + T − r) ln(NT )

NT

).

Another strand of the literature develop the tests on the basis of eigenvaluescomparison. Let r0 be the true number of factors and µr be the r-th largesteigenvalue of the covariance matrix of the T × N data matrix Y . Then, thefirst r0 eigenvalues of the covariance matrix diverge and the rest are bounded.Kapetanios (2010) approximates µr −µrmax+1 by block subsampling techniqueswhere rmax is the (fixed) maximum number of factors. Based on the empiricaldistribution, a sequential tests of the null hypothesis r = 0, 1, . . . , rmax againstthe alternative r > 0, 1, . . . , rmax can be performed. Onatski (2010) develops theedge distribution (ED) estimator based on the difference between the adjacenteigenvalues, and propose to select r0 by

r0 = maxr ≤ rmax : µr − µr+1 ≥ δ

where δ is a threshold value. He proposes to calibrate δ as follows: (i) Computeeigenvalues µ1, . . . , µn of the sample covariance matrix Y Y

′/T . Set j = rmax+1.(ii) Compute β, the slope coeffi cient in the OLS regression of µ1, . . . , µj+4 on

the constant and (j − 1)2/3

, . . . , (j + 3)2/3. Set δ = 2|β|. (iii) Compute r(δ) =

max` ≤ rmax : µ` − µ`+1 ≥ δ

. If µ` − µ`+1 < δ for all ` ≤ rmax, set r(δ) = 0.

(iv) Set j = r(δ) + 1. Repeat (ii) and (iii) until convergence.Let µr denote the r-th largest eigenvalue of the matrix

1NT Y Y

′. Ahn andHorenstein (2013) propose to compare the eigenvalue ratios (ER) and select r0

byr0 = maxr ≤ rmax : µr/µr+1.

The ER is easy to compute, and it is reported that the ER test tends to providethe better finite sample performance than BIC3 and ED.

(?) apply the group bridge method and estimate the factor model by min-imising the following penalised least squares:

L(Λ) =1

NT

N∑i=1

T∑t=1

(Yit − λ′iF

rmaxt

)2

+γ

C2NT

r∑j=1

(1

N

N∑i=1

λ2ij

)α(131)

where γ is a tuning parameter and 0 < α < 1/2 is a user-specified constant.As the factor loadings for the redundant factors shrink to zero, thus r can beestimated by the number of non-zero columns in the loading matrix Λ. Theyalso propose an alternative criterion by

ln

[1

NT

N∑i=1

T∑t

(Yit − λ′iF

rmaxt )2

]+ r

(N + T

NT

)ln

(NT

N + T

)ln(ln(N)) (132)

where r is an estimate obtained from (131).When there is block factor structure, however, the information criteria fail to

identify the number of global and local factors separately. Let r0 be the number

73

of global factors and ri be the numbers of local factors for the ith block datamatrix, Y i for i = 1, . . . , R. If the information criteria are applied to Y i, thesum, r0+ri can be consistently estimated, but not r0 and ri separately. Further,if the information criteria are applied to the whole data matrix Y , the number ofcommon global factors r0 cannot be consistently estimated because the presenceof the local factors will violate the weak cross-sectional correlation assumptionmade in Bai and Ng (2002). The eigenvalue-based approaches cannot be appliedunder the multilevel factor setup. Now, the first r0 + ri largest eigenvalues ofthe covariance matrix of Y i are diverging such that the r0 +r1 + · · ·+rR largesteigenvalues of the covariance matrix of Y will diverge for the the whole datamatrix Y .A few studies have attempted to extend the existing approaches to the mul-

tilevel setting. Wang (2008) suggests using a three step procedure to estimatethe total number of global and local factors. For example, consider a two blockfactor model. Then, one can apply the existing information criteria to the wholedata and obtain r0 + r1 + r2, which is a consistent estimator of r0 + r1 + r2.Next, apply the IC to the data for each group and obtain r0 + ri, i = 1, 2. Thenumber of global factors is then obtained by

r0 + r1 + r0 + r2 − r0 + r1 + r2.

But, Han (2018) shows that this approach can lead to negative estimate in finitesamples. Instead, he applies the adaptive LASSO to estimate the factors andfactor loadings using an penalised least square analogous to (131) by shrink-ing the factor loadings corresponding to irrelevant factors to zero. Then, thenumber of global and the number of local factors can be estimated consistentlyby counting the number of non-zero columns in the respective factor loadingmatrix. A modified BIC criterion analogous to (132) is suggested to select theoptimal set of tuning parameters. This algorithm proceeds as follows:

1. Let F be equal to√T times the eigenvectors corresponding to the k largest

eigenvalues of the T × T matrix Y Y ′ in descending order. ComputeΛ = Y ′F /T . Let Λi (Mi× k) be the i-th block of Λ. Let S1 = Λ

′1Λ1/M1

and then compute the spectral decomposition

S1 = A1V 1A′1 (133)

where V 1 is a diagonal matrix consisting of the eigenvalues of S1 indescending order and A1 denotes the corresponding eigenvectors. Let`F = F A1 and

`Λ = ΛA1.

2. The estimate of`Λ1, denoted by Λ1, can be obtained by minimizing the

LASSO function

Q1(Λ1) =1

M1T

T∑t=1

‖Y 1t −Λ1

`f t‖2 +

γ1

δ2NT

k∑k=1

w1j‖Λ1k‖ (134)

74

where`f t is the transpose of of the t-th row of

`F , Λ1k is an M1-vector

denoting the k-th column of Λ1k, δ2NT = minN,T, γ1,NT is a positive

tuning parameter depending on N and T , and w1k is an adaptive weightspecified below. Let k1 be the number of nonzero column in Λ1.

3. Partition`Λ as [

`Λ′1, . . . ,

`Λ′R]′ where

`Λ′i is an Mi × k loading matrix of the

i-th block for i = 1, . . . , R. Let`

Λi,1:k1and

`Λi,k1+1:k the first k1 columns

and last k − k1 columns of`Λi, respectively. Define

`S0 = (N −M1)−1

R∑i=2

`Λ′i,1:k1

`Λi,1:k1

Si =1

Mi

`Λ′i,k1+1:k

`Λi,k1+1:k, i = 2, . . . , R.

Compute the spectral decomposition`S0 =

`A0

`V 0

`S′0 and Si = AiV iA

′i

for i = 2, . . . , R. Let G =`F 1:k

`A0 and F i =

`F k1+1:kAi for i = 2, . . . , R.

4. Let Λ(−1) = [Λ′2, . . . ,Λ′R]′. The factor loading matrix can be estimated

by minimizing the following function:

Q(Λ(−1)) =1

(N −M1)T

R∑i=2

T∑t=1

‖Y it −Λi,1:k1Gt −Λi,k1+1:kF it‖

2

+γ0

δ2NT

k1∑k=1

w0k‖Λ(−1)k ‖+

1

δ2NT

R∑i=2

γi

k∑k=k+1

wik‖Λik‖

(135)

where Gt and F it denote the transpose of the t-th rows of G and Frespectively, Λ

(−1)k is the k-th column of Λ(−1), Λik is the k-th column

of the Mi × k matrix Λi for i = 2, . . . , R, γ0,NT and γi,NT are positivetuning parameters that depend on N and T , and w0k and wik are adap-

tive weights. Let Λ(−1)

= [Λ′2, . . . , Λ

′R]′, where Λi denotes the Mi × k

estimated loadings for the i-th block (i = 2, . . . , R). The estimated num-ber of global factors r0 is therefore the number of nonzero columns inΛ

(−1)

1:k1. Similarly, the number of global factors ri is the number of nonzero

columns in Λi,k+1:k for i = 2, . . . , R and r1 = k1 − r0. γ1 can be selectedby minimizing the following criterion:

T1(γ1) = ln

(1

M1T

T∑t=1

‖Y 1t − Λ1

`f t‖2

)+k1(γ1)

M1 + T

M1Tln

(M1T

M1 + T

)·ln[ln(M1)].

75

Let γ∗1 be the selected tuning parameter and k1(γ∗1) be the estimated k1

based on γ∗1. To determine the rest of the tuning parameters, they proposeto use the following criterion:

T2(γ0, γ2, . . . , γR) = ln

(1

(N −M1)T

R∑i=2

T∑t=1

‖Y it − Λi,1:k1Gt − Λi,k1+1:kF it‖

2

)

+

(r0(γ0, γi, γ

∗1) +

R∑i=2

γi(γ0, γi, γ∗1)

)(N + T

NT

)ln

(NT

N + T

)ln[ln(N)].

Here a selection of the first block is critical. Thus, Han proposes a selectioncriterion for determining which block is ordered first:

IC(s) = ln(SSRs)+

(r0 +

R∑i=1

ri

)(N + T

NT

)ln

(NT

N + T

)ln(ln(N)), for s = 1, . . . , R

where SSRs is the residual sum of square when the s-th block is orderedfirst and N =

∑iMi/R. However, as the number of blocks is growing,

the number of combination of the tuning parameters is huge, resultingin too many candidate models to be estimated, which is computationallyvery intensive. In the simulations they consider the sample size such asR ∈ 2, 3, 4, 5, M ∈ 50, 100, 150, 200 and T ∈ 100, 200. They findthat the convergence of ri is slower than r0.

Alternatively, Chen (2012) and Dias et al. (2013) extend the IC by including

the number of local factors as arguments in the PC formula. Let Gk0be the

k0 estimated global factors, Fkii be the estimated local factors and Mi be the

number of the cross section for each block i = 1, . . . , R. Chen suggests to modifythe information criterion by

ICChen(k0, k1, . . . , kR) =

R∑i=1

Mi

NVi(G

k0, F

kii , k0, ki)+σ

(1

R

R∑i=1

ki + h+ αk0

)g(N,T )

(r0, r1, . . . , rR) = arg min0≤k0,k1,...,kR≤rmax

ICChen(k0, k1, . . . , kR)

where Vi is the residual sum of squared for each block i, σ is the average vari-ance of the idiosyncratic errors, h and α < 1 are fixed scaling parameters andg(N,T ) is a penalty function.27 The parameter α indicates that global factorsare less penalized for their parsimony. In the Monte Carlo simulation by Chen,R varies from 2 to 4, M varies from 30 to 200 and T varies from 80 to 500.This information criteria seems easier to implement, especially when R is small.

27Chen (2012) also uses this criterion to determine the block membership. In this case his a function of Mi, N and T controlling the dispersion of the blocks. If the membership isknown, h is a fixed constant. Then, this can be seen as an extended Bai and Ng informationcriteria.

76

However, as R increases, the number of candidates drastically increases andthe computation will be almost infeasible. Indeed, (?), (?) and (?) consideronly the small number of blocks in their simulation studies as well empiricalapplications.[Dasi] Recently, (?) derives the asymptotic normal distribution of the canon-

ical correlation between factor spaces estimated from different blocks under thenull hypothesis that the first r0 canonical correlations are equal to one. Theyfirst apply existing information criteria to block and obtain the estimator r0 + rifor i = 1, 2. Set rmax = minr0 + r1, r0 + r2 and N∗ = minM1,M2. Theyextract rmax PCs from the data matrix Y i and obtain Ki for i = 1, 2. LetV ab, a, b = 1, 2 be the covariance matrix between K1 and K2 such that V ab =

K′aKb/T . Compute R = V

−1

11 V 12V−1

22 V 21 and R∗

= V−1

22 V 21V−1

11 V 12. De-fine the rmax × r matrix W 1 (resp. W 1) which collects the eigenvectors corre-sponding to r largest eigenvalues of R in descending order (resp. R

∗). Define the

rmax× (rmax−r) matrix W−1 (resp. W

−2 ) which collects the remaining rmax−r

eigenvectors corresponding to the remaining eigenvalues in descending order ofR (resp. R

∗). W 1 The global factors can be estimated by either G = K1W 1

or G = K2W 2. The local factors can be estimated by F 1 = K1W−1 and

F 2 = K2W−2 . The factor loadings can be estimated by Γi = Y ′iG/T and

Λi = Y ′iF i/T for i = 1, 2. Let Di = E′iEi/T where Ei = Y i − GΓ

′i − F iΛ

′i

and Θi = [Γi, Λi]. Define

Σu,i =

(Θ′iΘi

Mi

)−1(Θ′iCiΘi

Mi

)(Θ′iΘi

Mi

)−1

, i = 1, 2

where Ci is an Mi ×Mi diagonal matrix with diagonal elements equal to thoseof Di. Then the following statistics

ξ(r) = N∗√T

(1

2trΣ2

U)−1/2 [

ξ(r)− r +1

2N∗trΣ2

U]

is asymptotically standard normal. Therefore, the number of global factors canbe estimated by the sequence of tests:

r0 = minr ≤ rmax : ξ(r) > −zαN∗T

where ξ(r) is sum the r largest eigenvalues of R and ΣU = (M2/M1)Σ(rr)

u,1 +Σ(rr)

u,2

and (rr) means the upper-left (r, r) elements of the matrix. zαN∗T is a thresholdvalue depending on N∗ and T . Given the estimated number of global factors r0,ri can be obtained by r0 + ri− r0 and the numbers of global and local factors areseparately identified. In their simulation analysis, they allow for cross sectionalcorrelation but not serial correlation. For the sample sizes in their simulationstudy, M varies from 50 to 1000 and T is smaller than M and varies from 50to 600. When M and T are not large, their estimator r0 does not outperform

77

ICChen in general. When both M and T are large, r0 converges to its truevalue. However, their approach only allows the number of blocks to be two.A sequential approach is developed by (?), who assume that the number

of global factors is known a priori so that the global factors can be projectedout from the data. Then they apply Bai and Ng type information criteria toeach block to estimate the number of local factors. In this paper, we adoptthe same sequential approach except that we estimate r0 based on (squared)canonical correlations. We differ from (?) in the sense that no asymptoticdistribution of the canonical correlation is needed. In addition, they assumethat the idiosyncratic error terms across blocks are uncorrelated, but we do notneed this strong assumption.??Lu and Su (2015) consider the problem of determining the number of fac-

tors and selecting the proper regressors in linear dynamic panel data modelswith interactive fixed effects. Based on the preliminary estimates of the slopeparameters and factors a la Bai and Ng (2009) and Moon and Weidner (2014a),they propose a method for simultaneous selection of regressors and factors andestimation through the method of adaptive group Lasso. With probability ap-proaching one, this method can correctly select all relevant regressors and factorsand shrink the coeffi cients of irrelevant regressors and redundant factors to zero.Further, the shrinkage estimators of the nonzero slope parameters exhibit someoracle property. Monte Carlo simulations demonstrate the superb finite-sampleperformance of the proposed method.

5.4 The specifications tests

A number of specification tests have been proposed to testing the validity ofthe cross section dependence or the presence of interactive effects, e.g. Pesaran(2015), Sarafidis, Yamagata, and Robertson (2009), Bai (2009) and Castag-netti, Rossi, and Trapani (2015). Notice, however, that the rejection of thenull hypothesis by these tests does not determine whether the FE estimatoris consistent or not under the alternative model with unobserved factors. Forinstance, Sarafidis, Yamagata, and Robertson (2009) maintain an assumptionthat factor loadings are uncorrelated under the alternative whilst the Hausmantest proposed by Bai (2009) would have no power when factor loadings are un-correlated, under the alternative hypothesis. This suggests that the presence ofcorrelated loadings emerges as an influential but under-appreciated feature ofthe panel data model with interactive effects, see also Westerlund and Urbain(2013).In retrospect, it is rather surprising to find that the literature has been silent

on investigating the important issue of testing the validity of uncorrelated factorloadings. In order to fill this gap, as the main contribution of this paper, weproceed to develop a Hausman-type test that determines the validity of corre-lated factor loadings in cross-sectionally correlated panels. Both the FE andPC estimators are consistent under the null hypothesis of uncorrelated factorloadings whilst only the latter is consistent under the alternative hypothesisof correlated factor loadings. Further, the PC estimator is more effi cient even

78

under the null. Based on this observation, we develop two nonparametric vari-ance estimators for the difference between the FE and PC estimators, that areshown to be robust to the presence of heteroscedasticity, autocorrelation andslope heterogeneity. We then show that the proposed test statistic follows theχ2 distribution asymptotically.

The CD test by Pesaran (2015) We now develop a diagnostic test forthe null hypothesis of residual cross-section independence and the estimationof the exponent of cross-sectional dependence in the triple-index paneldata models. Those are evaluated using the residuals, which we denote aseij = (eij1, ..., eijT )

′.The proposed cross-section dependence (CD) test is a modified counterpart

of an existing CD test proposed by Pesaran (2015). For convenience, we rep-resent eij as the (ij) pair using the single index n = 1, ..., N1N2, and computethe pair-wise residual correlations between n and n′ cross-section units by

ρnn′ (= ρn′n) =e′nen′√

(e′nen) (e′n′en′), n, n′ = 1, ..., N1N2 and n 6= n′.

Then, we construct the CD statistic by

CD =

√2

N1N2 (N1N2 − 1)

N1N2−1∑n=1

N1N2∑n′=n+1

√T ρnn′ (136)

Pesaran derives that the CD test has the limiting N(0, 1) distribution under thenull hypothesis of cross-sectional error independence, namely H0 : ρnn′ = 0 forall n, n′ = 1, ..., N1N2 and n 6= n′. Following BKP, Pesaran further shows thatthe CD statistic, (136) can also be applicable to testing the null hypothesis ofweak cross-sectional error dependence.28 As an extension one can also constructhierarchical CD tests based on 2D sub-dataset out of the 3D dataset.Following BKP, we introduce the exponent of cross-sectional dependence

based on the double cross sectional averages defined as u..t = (N1N2)−1∑N1

i=1

∑N2

j=1 uijt.If uijt’s are cross sectionally correlated across (i, j) pairs, V ar (u..t) declines ata rate that is a function of α, where α is defined as

limN1N2→∞

(N1N2)−α

λmax (Σu)

and Σu is the N1N2 ×N1N2 covariance matrix of ut = (u11t, ..., uN1N2t)′ with

λmax (Σu) denoting the largest egeinvalue. Clearly, V ar (u..t) cannot declineat rate faster than (N1N2)

−1 as well as it cannot decline at rate slower than(N1N2)

α−1 with 0 ≤ α ≤ 1. Given that

V ar (u..t) ≤ (N1N2)−1λmax (Σu) ,

28Consider the effects of the jth factor on the ith error, and suppose that the factor loadingstake nonzero values for Mj out of the N cross-section units. The degree of cross-sectionaldependence due to the jth factor can be measured by αj = ln(Mj)/ ln(N), and the overalldegree of CSD by α = maxj αj , and α ∈ [0, 1], with 1 indicating the highest CSD. Then, theimplicit null is given by Hw

0 : α < 12. See BKP for more details.

79

we find that α defined by (N1N2)−1λmax (Σu) = O

(Nα−1

)will provide an

upper bound for V ar (u..t). The extent of CSD depends on the nature of factorloadings in the following factor-based errors:

uijt = πijλt + εijt.

If the average of the heterogenous loading parameters, πij , denoted µπ, isbounded away from zero, the cross-sectional dependence will be strong, in whichcase, (N1N2)

−1λmax (Σu) and V ar (u..t) are both O (1), which yields α = 1.

Furthermore, for 1/2 < α ≤ 1, BKP propose the following bias-adjustedestimator to consistently estimate α:

α = 1 +1

2

ln(σ2u..t

)ln(N1N2)

−ln(µ2π

)2 ln(N1N2)

− cN1N2

2[N1N2 ln (N1N2) σ2

u..t

] (137)

where cN1N2= σ2

N1N2= (N1N2)

−1∑N1

i=1

∑N2

j=1 σ2ij where σ

2ij = T−1ΣTt=1ε

2ijt is

the ijth diagonal element of the estimated covariance matrix, Σε with εijt =

uijt− δij u..t and δij being the OLS estimator from the regression of uijt on u..t.BKP also discuss that a suitable estimation of µ2

π can be derived, noticing thatµπ is the mean of the population regression coeffi cient of uijt on u..t = u..t/σu..tfor units uijt that have at least one non-zero loading, and those units are selectedusing Holm’s (1979) multiple testing approach.In the empirical section we apply the above 3D extension of the BKP estima-

tion and testing techniques directly to the residuals eijt . We also evaluate theconfidence band for the estimated CSD exponent by employing the test statisticdefined in (B47) in BKP’s Supplementary Appendix VI.

The Hausman test by Bai (2009) for testing additive versus interactiveeffects There exist two methods to evaluate which specification– fixed effectsor interactive effects– gives a better description of the data. The first methodis that of the Hausman test statistic (Hausman (1978)) and the second is basedon the number of factors. We detail the Hausman test method. For simplicity,we assume εit are i.i.d. over i and t, and that E(ε2

it) = σ2.The null hypothesis is an additive-effects model

Yit = Xitβ + αi + ξt + µ+ εit (138)

with restrictions∑Ni=1 αi = 0 and

∑Tt=1 ξt = 0 due to the grand mean parameter

µ. The alternative hypothesis is

Yit = Xitβ + λ′iFt + εit (139)

The null model is nested in the general model with λi = (αi, 1) and Ft =(1, ξt + µ).The interactive-effects estimator for β is consistent under both models (138)

and (139), but is less effi cient than the least squares dummy-variable estimator

80

for model (138), as the latter imposes restrictions on factors and factor loadings.But the fixed-effects estimator is inconsistent under model (139). The principleof the Hausman test is applicable here.The within-group estimator of β in (138) is

√NT (βFE − β) =

(1

NT

N∑i=1

X ′iXi

)(1

NT

N∑i=1

X ′iεi

)

where Xi = Xi − iT Xi. − X + iT X.. with X =(X.1, ..., X.T

)′. Rewrite the

fixed-effects estimator more compactly as

√NT (βFE − β) = C−1ψ =

(1

NT

N∑i=1

X ′iXi

)(1

NT

N∑i=1

X ′iεi

)The interactive-effects estimator can be written as (see PROPOSITION A.3)

NT(βIE − β

)= D(F 0)−1(η − ξ) + op(1)

where

D(F ) =1

NT

N∑i=1

Z ′iZi =1

N

N∑i=1

1

T

T∑i=1

ZitZ′it

Zi = MFXi −1

N

N∑k=1

MFXkaik

η =1√NT

N∑i=1

X ′iMF 0εi, ξ =1√NT

N∑i=1

(1

N

N∑k=1

aikX′kMF 0

)εi

The variances of the two estimators are

var(√

NT (βFE − β))

= σ2C−1; var(√

NT (βIE − β))

= σ2D(F 0)−1

We show, under the null hypothesis of additivity,

E[(η − ξ)ψ′

]= σ2D(F 0)

This implies

V ar(βIE − βFE

)= var

(βIE

)− var

(βFE

).

Thus the Hausman test takes the form

J = NTσ2(βIE − βFE

) [D(F 0)−1 − C−1

]−1(βIE − βFE

)→d χ

2

REMARK 9: The Hausman test is also applicable when there are no timeeffects but only individual effects (i.e., ξt = 0). Then it is testing whether theindividual effects are time-varying. Similarly, the Hausman test is applicablewhen αi = 0 in (138) but ξt 6= 0. Then it is testing whether the common shockshave heterogeneous effects on individuals. Details are given in the SupplementalMaterials.

81

Testing for correlated factor loadings by KSS A number of specificationtests have been proposed to test the validity of the cross section dependenceor the presence of the multiplicative interactive effects in panels. The mostpopular test is the so-called CD test proposed by Pesaran (2015). However,the CD test fails to reject the null hypothesis of no error CSD when the factorloadings have zero means, implying that the CD test will display very poor powerwhen it is applied to cross-sectionally demeaned data. Sarafidis, Yamagata, andRobertson (2009) propose an alternative testing procedure for the homogeneousfactor loadings after estimating a linear dynamic panel data model by GMM.This approach is valid only when N is large relative to T , but it can be appliedto testing for any error CSD remaining after including time dummies. But, theystill maintain an assumption that the factor loadings are uncorrelated. If thefactor loadings are correlated under the alternative hypothesis, it is easily seenthat the proposed GMM-based test will be invalid because the GMM estimatoris no longer consistent under the alternative.The PC estimator is consistent both under models with additive effects and

with interactive effects, but less effi cient than the FE estimator under the nullmodel with additive effects only. On the other hand, the FE estimator is in-consistent under the alternative model with interactive effects and correlatedfactor loadings. Following this idea, Bai (2009) advances a Hausman test fortesting the null hypothesis of additive effects against the alternative of inter-active effects. Focussing on the special cases, Castagnetti, Rossi, and Trapani(2015) propose two tests for the null of no factor structure: one for the nullthat factor loadings are cross sectionally homogeneous, and another for the nullthat common factors are homogeneous over time. Using extremes of the esti-mated loadings and common factors, they show that their statistics follow anasymptotic Gumbel distribution under the null.The conventional wisdom is that if the null hypothesis of no error CSD is

rejected, the standard FE estimator would be invalid due to ignoring the poten-tial endogeneity arising from the correlation between regressors and unobservedfactors. We have shown that the presence of CSD does not always imply thatthe FE estimator is inconsistent even in the presence of unobserved factor struc-ture. In particular, if the factor loadings are uncorrelated, the FE estimator isstill consistent. More importantly, the FE estimator avoids any issue related toselecting the correct number of unobserved factors, which has been shown tosignificantly affect the performance of both CCE and PC estimators.In this regard, it is rather surprising to find that the literature has been

silent on investigating an important issue of testing the validity of uncorrelatedfactor loadings in panels with interactive effects. Notice that the Hausman testdeveloped by Bai (2009) would be valid only if regressors are correlated withboth factors and loadings. Further, it is easily seen that Hausman test has nopower if the factor loadings are uncorrelated under the alternative model withinteractive effects. This raises a potentially important research question. Forlarge T , without loss of generality, we suppose that f t represent the unobservedcommon policy or globalisation trend, and γi are the associated heterogeneousindividual responses (parameters). In this context, it is natural to allow for xit

82

to be correlated with f t to avoid the potential omitted variables bias. But, itstill remains an important issue to test whether xit are correlated with γi ornot.Given the pervasive evidence of cross sectionally dependent errors in panels,

we proceed to develop a Hausman-type test that determines the presence ofuncorrelated factor loadings in cross-sectionally correlated panels. Consider thefollowing panel data model with interactive effects:

yit = αi + β′ixit + γ′if t + εit (140)

xit = bi + Γ′if t + vit (141)

where yit is the dependent variable of the i−th cross-sectional unit in periodt, xit is the k × 1 vector of covariates with βi the k × 1 vector of parameters.αi and bi are unobserved individual effects, and εit and vit are idiosyncraticerrors. f t is an r× 1 vector of unobserved common factors while γi and Γi arerandom heterogenous loadings. Recall that both the two way FE estimator andthe PC estimator are consistent under the null hypothesis that factor loadings,γi and Γi are uncorrelated. Only the latter is consistent under the alternativeof correlated loadings. Following this idea, we propose the Hausman-type testbased on the difference between the FE and PC estimators:

H =(βFE − βPC

)′V −1

(βFE − βPC

)(142)

where βPC is the PC estimator to be defined below, and V = V ar(βFE − βPC

)=

V ar(βFE

)+V ar

(βPC

)−2Cov

(βFE , βPC

). Notice that both estimators are

consistent under the null hypothesis, but the PC estimator is more effi cient thanthe FE estimator even under the null. This implies that

V ar(βFE − βPC

)6= V ar

(βFE

)− V ar

(βPC

)in contrast to the well-established finding in Hausman (1978). We interpret (142)as a test for correlated factor loadings in panels with interactive effects.Next, we propose two robust versions of the variance estimator as follows:

VNON

(βPC

)(143)

=

(N∑i=1

X ′iM FXi

)−1( N∑i=1

(X ′iM FXi

) (βPC,i − βPC

) (βPC,i − βPC

)′ (X ′iM FXi

))( N∑i=1

X ′iM FXi

)−1

where βPC,i =(X ′iM FXi

)−1X ′iM Fyi and

VHAC

(βPC

)=

(N∑i=1

X ′iM FXi

)−1( N∑i=1

X′iV iV

′iXi

)(N∑i=1

X ′iM FXi

)−1

(144)

83

where V i = yi−XiβPC . Having established that the two versions of the robustestimator can consistently standardise the estimator, we propose to account for

the covariance Cov(βFE , βPC

)by setting

CNON

(βFE , βPC

)=

(N∑i=1

X′iXi

)−1( N∑i=1

(X′iXi

)(βFE,i − βFE

)(βPC,i − βPC

)′ (X ′iM FXi

))( N∑i=1

X ′iM FXi

)−1

and

CHAC

(βFE , βPC

)=

(N∑i=1

X′iXi

)−1( N∑i=1

X′iuiV

′iXi

)(N∑i=1

X ′iM FXi

)−1

.

Accordingly, we define two operating versions of the Hausman-type statistic by

HNON =(βFE − βPC

)′ (VNON

)−1 (βFE − βPC

)(145)

HHAC =(βFE − βPC

)′ (VHAC

)−1 (βFE − βPC

)(146)

where

VNON

= VNON

(βFE

)+ V

NON(βPC

)− 2C

NON(βFE , βPC

)(147)

VHAC

= VHAC

(βFE

)+ V

HAC(βPC

)− 2C

HAC(βFE , βPC

)(148)

We provide the main result in the following Theorem.

Theorem 9 Under Assumption A, as N,T →∞,

Hj →d χ2k for j = NON,HAC

There is ample evidence, as will be seen in the Monte Carlo section, that theCCE and the FE estimators perform comparably to the PC estimator in smallsamples, if the null hypothesis is not rejected. Hence, the proposed Hausman-type test statistic has a clear operational rationale in practice.We present the estimation and test results. First of all, the test results for

both HNON and HHAC , provide a surprisingly convincing finding that the nullof uncorrelated factor loadings is rejected (at levels of significance greater than1%) only in one out of six cases considered. We also report the results for the CDtest proposed by Pesran (2015), which tests the null of no (weak) CSD againstthe alternative of strong CSD, and the Hausman test proposed by Bai (2009),which tests the null of additive-effects against the alternative of interactive-effects. The CD test strongly rejects the null hypothesis for all the datasetswhilst the HBai test never rejects the null hypothesis of additive-effects model.

84

These test results are rather in conflict, since the former suggests the presenceof interactive effects while the latter suggests no such effects. As highlighted,however, the rejection of CD test does not always imply that the FE estimatoris biased even in panels with interactive effects. Further, we note that the HBai

test has no power against the alternative model with interactive effects, if thefactor loadings are uncorrelated. Indeed, such conflicting results could providesupport that the panels are subject to interactive effects but factor loadings areuncorrelated.

5.5 Dynamic panels

Chudik and Pesaran (2015) develop the dynamic Mean Group CCE (CCEMG)estimator where the unobserved factors are approximated by set of cross-sectionalaverages of yit and xit including their lagged values. A bias corrected CCEMGestimator is also presented in the panel ARDL models under the different as-sumptions concerning the parameter values and the degree of cross-sectionaldependence. They compare the robustness of the quasi maximum likelihoodestimator developed by Moon and Weidner (2015), the IPC estimator by (?),and the alternative MG estimator based on Song’s (2013) extension of the IPCapproach. Monte Carlo simulation results confirm that the dynamic CCEMGestimator is valid asymptotically when the following two conditions hold: (i) asuffi cient number of lagged cross-sectional averages should be included in theindividual augmented regression, and (ii) the number of cross-section averagesshould be at least as large as the number of unobserved common factors. If theparameter of interest is the mean of the parameters on the weakly exogenousregressors, the CCEMG estimator performs relatively well, even in the relativelysmall samples. However, if the parameter of interest is the mean coeffi cient ofthe lagged dependent variable, the CCEMG estimator does not perform well.Similar findings are also provided by De Vos and Everaert (2018).Recently Norkute, Sarafidis, Yamagata and Cui (2019) consider a two-step

IV estimator for models with homogeneous and heterogeneous slope coeffi cientswhere the instruments are obtained within to the model. In the case withhomogeneous parameters they implement a two step estimator. The first stepIV estimator is obtained using the defactored covariates as instruments. In thesecond step, the entire model is defactored by the extracted factors obtainedfrom the residuals of the first step estimation. Subsequently, they apply thefinal IV estimator on the basis of the transformed variables. Interestingly thealso mention that they have applied a two way FE transformation in the firststep. In particular, when factor loading are correlated and slope parametersare heterogeneous, they find that the proposed IV estimators outperform theCCEMG estimator by Chudik and Pesaran (2015) and the QML estimator byMoon and Weidner (2017).

• More to come!

85

6 The Spatial-based Models for CSD

In general, an important research issue is to model the spatial dependence, thespatial heterogeneity and nonlinearity, simultaneously.

6.1 The Spatial Autoregressive (SAR) Process

Some models for cross sectional data may capture spatial interactions acrossspatial units. Consider the first-order spatial autoregressive (SAR) process:

yi = λwi,nY n + εi, i = 1, ..., n, (149)

where Y n = (y1, ..., yn)′ is an n×1 vector of dependent variable, wi,n is a 1×nrow vector of weights, and εi ∼ iid(0, σ2). We write (462) in the matrix form:

Y n = λW nY n +En (150)

where W nY n is ‘the spatial lag’. Under the assumption that Sn (λ) = In −λW n is nonsingular, we have:

Y n = Sn (λ)−1En.

Also consider the regression model with SAR disturbance:

Y n = Xnβ +Un, Un = ρW nUn +En (151)

The disturbances in Un are spatially-correlated. The variance matrix of Un isσ2Sn (ρ)

−1Sn (ρ)

−1′. As the off-diagonal elements of Sn (ρ)−1Sn (ρ)

−1′ maybe nonzero, ui’s are cross-sectionally correlated across units.

Spatial autoregressive model with covariates This generalises SAR byincorporating exogenous variables xi:

Y n = λW nY n +Xnβ +En (152)

where En ∼ iid(0, σ2In). This model has the feature of a simultaneous equationmodel and its reduced form is:

Y n = Sn (λ)−1Xnβ + Sn (λ)

−1En.

Some Intuitions on Spatial Weights Matrix, W n The jth element ofwi,n, wn,ij , represents the link (or distance) between the neighbor j and thespatial unit i. The diagonal of W n is specified to be zero, i.e., wn,ii = 0 forall i, because λwi,n represents the effect of other spatial units on the spatialunit i. It is a common practice to have W n having a zero diagonal and beingrow-normalized such that the summation of elements in each row of W n isunity. In some applications, the ith row wi,n of W n may be constructed aswi,n = (di1, di2, ..., din)/

∑nj=1 dij , where dij ≥ 0, represents a function of the

86

spatial distance between i and j, in some space. The weighting operation maybe interpreted as an average of neighboring values.When neighbors are defined as adjacent ones for each unit, the correlation

is local in the sense that correlations will be stronger for neighbors but weak forunits far away. Suppose that ‖ρW n‖ ≤ 1 for matrix norm ‖.‖, then

Sn (ρ)−1

= In +

∞∑i=1

ρiW in

Notice that ∥∥∥∥∥∞∑i=m

ρiW in

∥∥∥∥∥ ≤ |ρW n|i∥∥∥Sn (ρ)

−1∥∥∥

If W n is row-normalized, then∥∥∥∥∥∞∑i=m

ρiW in

∥∥∥∥∥∞

≤∞∑i=m

∣∣ρi∣∣ =|ρ|m

1− |ρ|

will be small as m gets larger. Un can be represented as

Un = En + ρW nEn + ρ2W 2nEn + ...,

where ρW n may represent the influence of neighbors on each unit, ρ2W 2n is the

second layer neighborhood influence, etc. In the social interactions litera-ture, WnSn (ρ)

−1 is a vector of measures of centrality, which summariesthe position of each spatial unit in a network.In conventional spatial cases, neighbor units are defined by only a few ad-

jacent ones. However, there are cases where ‘neighbors’may consist of manyunits. An example is a social interactions model, where ‘neighbors’refer to indi-viduals in a same group. The latter may be regarded as models with large groupinteractions. For models with a large number of interactions for each unit, thespatial weights matrix W n will associate with the sample size. Suppose thatthere are R groups and there are m individuals in each group, with the samplesize, n = mR. In a special network (e.g., friendship), one may assume that eachindividual in a group is given an equal weight. In that case,

W n = IR ⊗Bm with Bm =(`m`

′m − Im

)/(m− 1),

where `m is the m-dimensional vector of ones. In general, the number of mem-bers in each district may be large but have different sizes. This model has manyinteresting applications in the social interactions.

Other generalizations We may combine the SAR equation with SAR dis-turbances:

Y n = λW nY n +Xnβ +Un, Un = ρMnUn +En (153)

where W n andMn are spatial weights matrices, which may not be identical.

87

Further extension of a SAR model may allow high-order spatial lags as in

Y n =

p∑j=1

λjW jnY n +Xnβ +En (154)

where W jn’s are p distinct spatial weights matrices.

6.2 Estimation Methods

We consider the QML, the 2SLS (IV), and the generalized method of moments(GMM). The QML has usually good finite sample properties. However, the MLmethod is not computationally attractive for the higher spatial lags model, inwhich case the IV and GMM methods may be feasible, (Lee, 2007).

MLE For the SAR process in (459), we have the log-likelihood function:

lnLn(λ, σ2

)= −n

2ln(2π)− n

2lnσ2 + ln |Sn (λ)| − 1

2σ2Y ′nSn (λ)

′Sn (λ)Y n,

(155)where Sn (λ) = In − λW n. For the model with SAR disturbances, (461), thelog likelihood function is

lnLn(ρ, β, σ2

)= −n

2ln(2π)− n

2lnσ2 + ln |Sn (ρ)| (156)

− 1

2σ2(Y n −Xnβ)

′Sn (ρ)

′Sn (ρ) (Y n −Xnβ)

The log likelihood function for the SAR model with covariates in (152) is

lnLn(λ, β, σ2

)= −n

2ln(2π)− n

2lnσ2 + ln |Sn (λ)| (157)

− 1

2σ2(Y nSn (λ)−Xnβ)

′(Y nSn (λ)−Xnβ)

The likelihood function involves the computation of the determinant ofSn (λ), which is a function of the unknown parameter λ, and may have a largedimension n. A computationally tractable method is due to Ord (1975), whereW n is a row-normalized weights matrix with

W n = DnW∗n,

where W ∗n is a symmetric matrix and Dn = diag

∑nj=1 w

∗n,ij

−1

. Though

W n is not symmetric, the eigenvalues of W n are still all real. This is because

|W n − vIn| = |DnW∗n − vIn|

=∣∣∣DnW

∗nD

1/2n D−1/2

n − vD1/2n D−1/2

n

∣∣∣=

∣∣∣D1/2n

∣∣∣ ∣∣∣D1/2n W ∗

nD1/2n − vIn

∣∣∣ ∣∣∣D−1/2n

∣∣∣=

∣∣∣D1/2n W ∗

nD1/2n − vIn

∣∣∣88

Thus, the eigenvalues of W n are the same as those of D1/2n W ∗

nD1/2n , which is

a symmetric matrix. As the eigenvalues of a symmetric matrix are real, theeigenvalues of W n are real. Let µi’s be the eigenvalues of W n, which are thesame as those of D1/2

n W ∗nD

1/2n . Let Γ be the orthogonal matrix such that

D1/2n W ∗

nD1/2n = Γdiag µiΓ′

The above relations show that

|In − λW n| = |In − λDnW∗n| =

∣∣∣In − λD1/2n W ∗

nD1/2n

∣∣∣=

∣∣∣D1/2n D−1/2

n − λD1/2n D1/2

n W ∗nD

1/2n D−1/2

n

∣∣∣=

∣∣∣D1/2n

∣∣∣ ∣∣∣In − λD1/2n W ∗

nD1/2n

∣∣∣ ∣∣∣D−1/2n

∣∣∣=

∣∣∣In − λD1/2n W ∗

nD1/2n

∣∣∣ =∣∣In − λΓdiag µiΓ′

∣∣= |In − λdiag µi| = Πn

i=1 (1− λµi)

Thus, |In − λW n| can be easily updated during iterations within a maximiza-tion subroutine, as µi’s need be computed only once.Another tractable method is the characteristic polynomial. The determi-

nant, |W n − µIn| is a polynomial in µ and is called the characteristic polyno-mial of W n (the zeros of the characteristic polynomial are the eigenvalues ofW n). Thus,

|In − λW n| = anλn + ...+ a1λ1 + a0,

where the constant a depends only onW n. So a’s can be computed once duringthe maximization algorithm.

2SLS estimation For the SAR model with covariates, (152), the spatial lagW nY n can be correlated with the disturbance, En. So OLS may not be aconsistent estimator. However, there is a class of spatialW n (with large groupinteraction) that the OLS estimator can be consistent (Lee 2002).To avoid the bias due to the correlation of W nY n with En, Kelejian and

Prucha (1998) suggested the use of instrumental variables. Let Qn be a matrixof IVs. Denote Zn = (W nY n,Xn) and θ =

(λ,β′

)′, and rewrite (152) as

Y n = Znθ +En (158)

The 2SLS estimator of θ is

θ2SLS =[Z ′nQn

(Q′nQn

)−1Q′nZn

] [Z ′nQn

(Q′nQn

)−1Q′nY n

]The asymptotic distribution of θ2SLS follows:

√n(θ2SLS − θ

)→d N

(0, σ2 (GnXnβ,Xn)

′Qn

(Q′nQn

)−1Q′n (GnXnβ,Xn)

)

89

whereGn = W nS−1n , under the assumption that the limiting matrix 1

n (GnXnβ,Xn)has the full column rank (k+1) where k is the number of columns of Xn. Kele-jian and Prucha (1998) suggest the use of linearly independent variables in(Xn,W nXn) for the construction of Qn.

By the Schwartz inequality, the optimum IV matrix is (GnXnβ,Xn). This2SLS cannot be used for the estimation of the (pure) SAR process with β = 0.When β = 0, GnXnβ = 0. Hence, (GnXnβ,Xn) = (0,Xn) would have rankk but not full rank (k+ 1). Intuitively, when β = 0, Xn does not appear in themodel and there is no other IV available.

Method of moments Kelejian and Prucha (1999) suggest an MOM estima-tion:

minθg′n (θ) gn (θ) .

The moment equations are based on three moments:

E(E′nEn

)= nσ2; E

(E′nW

′nW nEn

)= σ2tr

(W ′

nW n

); E

(E′nW nEn

)= 0

In this case we have:

gn (θ) =

Y ′nSn (λ)′Sn (λ)Y n − nσ2

Y ′nSn (λ)′W ′

nW nSn (λ)Y n − σ2tr(W ′

nW n

),

Y ′nSn (λ)′W nSn (λ)Y n

For the regression model with SAR disturbances, Y n shall be replaced by leastsquares residuals.

GMM estimation For the SAR model with covariates, we can obtain othermoment equations in addition toQn. LetQn be an n×kx IV matrix constructedas functions of Xn and W n. Let

εn (θ) = Sn (λ)Y n −Xnβ

for any possible θ. The orthogonality conditions, Q′nεn (θ) = 0 provide thekx × 1 vector of moment conditions.Now, consider a finite number, saym, of n× n constant matrices, P 1n, ...,Pmn,

each of which has a zero diagonal. Then, (P jnεn (θ))′εn (θ) can be used as the

moment functions in addition toQ′nεn (θ). Then, we have the following momentconditions vector:

gn (θ) = (P 1nεn (θ) , ...,Pmnεn (θ) ,Qn)′εn (θ)

=

εn (θ)

′P 1n (θ) εn (θ)...

εn (θ)′Pmn (θ) εn (θ)Q′nεn (θ)

Proposition. For any constant n × n matrix P n with tr(P n) = 0, P nεn

is uncorrelated with εn, i.e., E((P nεn)

′εn)

= 0.

90

Proof :

E((P nεn)

′εn)

= E(ε′nP

′nεn

)= E (ε′nP nεn) = σ2tr(P n) = 0

This shows that E (gn (θ0)) = 0. Thus, gn (θ) are valid moment equationsfor GMM. Intuitively, as

W nY n = GnXnβ0 +Gnεn with Gn = W nS−1n and Sn = Sn (λ0) ,

and Gnεn is correlated with the disturbance εn in the model,

Y n = λW nY n +Xnβ + εn,

hence, any P jnεn, which is uncorrelated with εn, can be used as IV forW nY n

as long as P jnεn and Gnεn are correlated.

6.3 The Spatial Dynamic Panel Data (SDPD) Model

Dynamic panel data models consider not only heterogeneity but also state de-pendence that cannot be handled by cross-sectional or static panel data models.The most general case is the “time-space dynamic”model (Anselin, Le Gallo,and Jayet, 2008),29 which is termed spatial dynamic panel data (SDPD) modelin Yu, de Jong and Lee(2008):

Y nt = λ0W nY nt+γ0Y n,t−1+ρ0W nY n,t−1+Xntβ0+cn0+αt0ln+V nt, (159)

where cn0 is n × 1 column vector of fixed effects and αt0’s are time effects. γ0

captures the pure dynamic effect and ρ0 captures the spatial-time or diffusioneffect. Due to the presence of fixed individual and time effects, Xnt will notinclude any time invariant or individual invariant regressors.Define

Sn (λ) = In − λW n; Sn ≡ Sn (λ0) = In − λ0W n.

Presuming that Sn is invertible and denoting An = S−1n (γ0In + ρ0W n), (159)

can be rewritten as

Y nt = AnY n,t−1 + S−1n (Xntβ0 + cn0 + αt0ln + V nt) , (160)

We study the eigenvalues of An by focusing on the case with W n beingrow-normalized. Let $n = diag$n1, ..., $nn be the n × n diagonal eigen-values matrix of W n such that W n = Γn$nΓ−1

n where Γn is the eigenvector

29Anselin, Le Gallo, and Jayet (2008) divide spatial dynamic models into four categories,namely, “pure space recursive” if only a spatial time lag is included; “time-space recursive”if an individual time lag and a spatial time lag are included; “time-space simultaneous” if anindividual time lag and a contemporaneous SL term are specified; and “time-space dynamic”if all forms of lags are included. Korniotis (2010) studies a time-space recursive model withfixed effects, which is applied to the growth of consumption in each state in the United Statesto investigate habit formation. Su and Yang (2007) derive the quasi-maximum likelihood(QML) estimation of the above model under both fixed and random effects specifications.

91

matrix. Because An = S−1n (γ0In + ρ0W n), the eigenvalues matrix of An is

Dn = (In − λ0W n)−1

(γ0In + ρ0W n) such that An = ΓnDnΓ−1n . As W n

is row-normalized, all the eigenvalues are less than or equal to 1 in absolutevalue. Denote mn as the number of unit eigenvalues ofW n and let the first mn

eigenvalues of W n be the unity. Dn can be decomposed into two parts, onecorresponding to the unit eigenvalues of W n, and the other corresponding tothe eigenvalues ofW n which are smaller than 1. Define Jn = diag

(l′mn , 0, ... , 0

)with lmn being anmn×1 vector of ones and Dn = diag0, ..., 0, dn,mn+1, ..., dnn,where |dni| < 1, for i = mn + 1, ..., n. We have

Ahn =

(γ0 + ρ0

1− λ0

)hΓnJnΓ−1

n +Bhn with Bn = ΓnDnΓ−1n

Depending on the value of γ0+ρ01−λ0 , we may divide the process into four cases.

• Stable case when γ0 + ρ0 + λ0 < 1 (and with some other restrictions onthree parameters).

• Spatial cointegration case when γ0 + ρ0 + λ0 = 1 but γ0 < 1.

• Unit roots case when γ0 + ρ0 + λ0 = 1 and γ0 = 1.

• Explosive case when γ0 + ρ0 + λ0 > 1.

For stability, or in terms of stationarity in the time series notion, there aremore restrictions on the parameter space of (γ0, ρ0, λ0) in addition to γ0 + ρ0 +λ0 < 1. Such a parameter region can be revealed from conditions such that allthe eigenvalues dni’s of An are less than one in absolute value. The eigenvaluesof An are dni = γ0+ρ0$ni

1−λ0$ni , where $ni’s are eigenvalues of W n. By regarding das a function of $, we have:

∂

∂$

(γ0 + ρ0$ni

1− λ0$ni

)=

ρ0 + λ0γ0

(1− λ0$)2

Thus, we have three different situations:

1. ρ0 + λ0γ0 > 0 if and only if dni has the same increasing order as $ni.

2. ρ0 + λ0γ0 = 0, i.e., separable space-time filter, if and only if dni is aconstant (under this case, dni = γ0).

3. ρ0 + λ0γ0 < 0 if and only if dni has the decreasing order of $ni.

(160) expresses the model in terms of a space-time multiplier (Anselin, LeGallo, and Jayet 2008), which specifies how the joint determination of the de-pendent variables is a function of both spatial and time lags of explanatoryvariables and disturbances of all spatial units. This is useful for calculatingmarginal effects of changes of exogenous variables on outcomes over time and

92

across spatial units. LeSage and Pace (2009) have introduced the concept ofdirect impact, total impact, and indirect impact. In a SAR model:

Y n = α0ln + λ0W nY n +

kx∑k=1

βk0Xnk + εn,

where W n does not depend on Xn, the impact of a regressor Xnk on Y n is

∂Yn∂X ′nk

= (In − λ0W n)−1k0 βk0 for the kth regressor.

The average direct, average total and average indirect impacts are defined as

fk,direct (θ0) ≡ 1

ntr(

(In − λ0W n)−1βk0

),

fk,total (θ0) ≡ 1

nl′n

((In − λ0W n)

−1βk0

)ln,

fk,indirect (θ0) ≡ fk,total (θ0)− fk,direct (θ0) ,

with ln being an n-dimensional column of ones.Debarsy, Ertur and LeSage (2012) extend such impact analyses to spatial

dynamic panel models to investigate diffusion effects over time. Lee and Yu(2012b) extend the impact analysis to the case with time-varying spatial weights.Elhorst (2012) points out various restrictions on spatial dynamic panel modelswith marginal effects implied by specified models.For an impact analysis for SDPD model, the change in exogenous variables

will only influence the dependent variable of current period, but not the futureones. By changing the value of a regressor by the same amount across all spatialunits in some consecutive time periods, say, from the time period t1 to t2. Thus,we have

∂E (Y nt)

∂x= β0

t−t1∑h=t−t2

AhnS−1n ln

where x is a regressor with its coeffi cient being β0. Hence, by denoting θ =(λ, γ, ρ, β), the average total impact is

ft,total (θ0) ≡ 1

nl′n∂E (Y nt)

∂x= β0

t−t1∑h=t−t2

1

n

[l′nA

hnS−1n ln

],

and similarly for other impacts. Debarsy, Ertur and LeSage (2012) study thecase of t2 = t so that

ft,total (θ0) ≡ β0

t−t1∑h=0

1

n

[l′nA

hnS−1n ln

],

and the object of interest is how a permanent change in Xn,t1 will affect thefuture horizons (cumulatively) until t.

93

6.4 The Spatial Durbin Model

Elhorst (2012) proposes the following general spatial Durbin-type model:

Y t = τY t−1+δWY t+ηWY t−1+Xtβ1+WXtβ2+Xt−1β3+WXt−1β4+Ztθ+vt(161)

vt = γvt−1 + ρW vt + µ+ λtiN + εt

µ = κWµ+ ξ

where Y t is an N × 1 vector of the dependent variable for spatial unit (i =1, . . . , N) observed at time (t = 1, . . . , T ), Xt is an N ×K matrix of exogenousregressors, and Zt is an N × L matrix of endogenous regressors. The N × Nmatrix W is a nonnegative spatial-weight matrix with its diagonal elementsbeing zero. τ , δ and η are the scalar parameters on Y t−1,WY t, andWY t−1.β1, β2, β3, and β4 are the K × 1 vectors of the parameters on exogenousregressors and θ is the L×1 vector of the parameters on endogenous regressors.vt is the N × 1 vector of error term, which may be allowed to be serially andspatially correlated. µ = (µ1, ..., µN )

′ is the N × 1 vector of the spatial-specificeffects, and λt (t = 1, . . . T ) denotes time effects with iN an N × 1 vector ofones. We may allow the spatial-specific effects to be spatially autocorrelated.Finally, εt = (ε1t, ..., εNt)

′ and ξ are vectors of iid disturbances with zero meanand finite variance σ2 and σ2

ξ .

Stationarity conditions We impose the restrictions on the parameters andW to obtain the stationarity. The characteristic roots of (I − δW )

−1(τI + ηW )

in (161) should lie within the unit circle as follows:

τ < 1− (δ + η)ωmax, if δ + η ≥ 0 (162)

τ < 1− (δ + η)ωmin, if δ + η < 0

−1 + (δ − η)ωmax < τ, if δ − η ≥ 0

−1 + (δ − η)ωmin < τ, if δ − η < 0

where ωmin is the smallest (most negative) and ωmax the largest real character-istic root of W .Remark: The restriction τ + δ + η < 1 in Lee and Yu (2010a) is not too

restrictive. The stationarity conditions in (162) are more diffi cult to work with.

Estimation methods The spatial model has been estimated mainly by theiterative ML estimation procedure developed by Anselin (1988). Three estima-tion methods: (i) the bias-corrected quasi-maximum likelihood (QML) estima-tor, (ii) the IV or GMM estimator, and (iii) the Bayesian Markov Chain MonteCarlo (MCMC) approach.Yu et al. (2008) construct a bias-corrected estimator for a dynamic model

with (Y t−1,WY t,WY t−1) and spatial fixed effects. Lee and Yu (2010d) ex-tend it to include time effects. They provide an asymptotic theory for bias-corrected LSDV (BCLSDV) estimator when both N and T tend to infinity, butT cannot be too small relative to N .

94

A few studies consider IV/GMM estimators, building on works by Arrelanoand Bond (1991) and Blundell and Bond (1998). Elhorst (2010d) extends theFD-GMM to include endogenous interaction effects and finds that this estimatorcan still be severely biased. Lee and Yu (2010c) show that a 2SLS estimator,which is based on lagged values of Y t−1, WY t−1, Xt and WY t, is inconsis-tent due to too many moments; the dominant bias is caused by the endogeneityof WY t. They propose an optimal GMM estimator. Kukenova and Monteiro(2009) and Jacobs et al. (2009) consider a dynamic panel data model with(Y t−1,WY t) and extend the system GMM to account for endogenous inter-action effects. GMM can also be used to instrument endogenous explanatoryvariables (other than Y t−1 and WY t).

The dynamic spatial Durbin model Burridge (1981) recommends the first-order spatial autoregressive distributed lag model, in which Y is regressed onWY and X andWX. This is known as the spatial Durbin model. The cost ofignoring spatial dependence in the dependent/independent variables is relativelyhigh (biased) whilst ignoring spatial dependence in the disturbances will onlycause a loss of effi ciency (LeSage and Pace, 2009). The spatial Durbin modelproduces unbiased estimates, even if the DGP contains a spatial error.If explanatory variables are endogenous, the best estimation method is the

IV/GMM estimator (Fingleton and LeGallo, 2008).30 Elhorst et al. (2010b)propose a dynamic spatial Durbin model:

Y t = τY t−1 + δWY t + ηWY t−1 +Xtβ1 +WXtβ2 + vt (163)

Rewriting (163) as

Y = (IN − δW )−1

(τI + ηW )Y t−1+(IN − δW )−1

(Xtβ1 +WXtβ2)+(IN − δW )−1vt,

(164)we can derive the partial derivatives of Y with respect to the kth explanatoryvariable of X at time t by[

∂Y∂x1k

· · · ∂Y∂xNk

]t

= (IN − δW )−1

(β1kIN + β2kW ) (165)

These denote the effect of a change of an explanatory variable in a spatial uniton the dependent variable of all other units in the short term. Similarly, thelong-term effects can be:[

∂Y∂x1k

· · · ∂Y∂xNk

]= [(1− τ) IN − (δ + η)W ]

−1(β1kIN + β2kW ) (166)

(165) and (166) show that the short-term indirect effects do not occur if bothδ = 0 and β2k = 0 while the long-term indirect effects do not occur if bothδ = −η and β2k = 0. By simulating the effects of shocks in vt, it is possible tofind the path along which an economy moves to its long term equilibrium (DeGroot and Elhorst 2010).

30The studies on growth and convergence typically regress economic growth on economicgrowth in neighboring economies, and the initial income level, the rates of saving, populationgrowth, technological change, and depreciation in the own and in neighboring countries.

95

The dynamic spatial Durbin model can be used to determine direct effectsand indirect (spatial spillover) effects in the short- and long-term. Anselin et al.(2008) criticise that this model might suffer from identification problems. Bycontinuous substitution of Y t−1 up to Y 1 in (164), we have:

Y = (IN − δW )−T

(τIN + ηW )TY t−T (167)

+

T∑p=1

(IN − δW )−p

(τIN + ηW )p−1 (

Xt−(p−1)β1 +WXt−(p−1)β2 + vt−(p−1)

).

This shows that two global spatial multiplier matrices, (IN − δW )−p and (τIN + ηW )

p−1,are at work at the same time in conjunction with one process that produces lo-cal spatial spillover effects, WXt−(p−1)β2.

31 Second, more empirical researchis needed to find out whether the short-term and long-term direct and indirecteffects make sense.To avoid identification problems, 4 restrictions are imposed. The first re-

striction is β2 = 0 in which the local indirect effects (spatial spillover) are zero.The indirect effects in relation to the direct effects become the same for everyexplanatory variable both in the short- and the long-term. For example, theratio for the kth explanatory variable in the short term takes the form:[

(IN − δW )−1β1kIN

]rsum[(IN − δW )

−1β1kIN

]d =

[(IN − δW )

−1]rsum

[(IN − δW )

−1]d

where d is the operator that calculates the mean diagonal element of a matrixand rsum is the operator that calculates the mean row sum of the non-diagonalelements.32 This is independent of β1k and thus the same for. A similar resultin the long term.The second restriction is δ = 0 in which case (IN − δW )

−1= IN . Thus,

the global short-term indirect (spatial spillover) effect is zero. This model is lesssuitable if the analysis focuses on spatial spillover effects in the short term.The third restriction is η = −τδ (Parent and LeSage, 2011). The advantage

is that the impact of a change in one of the explanatory variables on the de-pendent variable can be decomposed into a spatial effect and a time effect; theimpact over space falls by δW for every higher-order neighbor, and over timeby the factor τ for every period. The disadvantage is that the indirect (spatial

31That is one process too much. One should examine whether the log-likelihood functionis flat or almost flat. Hendry (1995) recommends to regress Yt on Yt−1, Xt and Xt−1 as ageneralization of the first-order autocorrelation model for time-series while Burridge (1981)recommends to regress Yt on WYt, Xt and WXt as a generalization of the first-order spa-tial autocorrelation model for cross-section data. Elhorst (2001) suggests to regress Yt onYt−1, WYt, WYt−1, Xt, WXt, Xt−1 and WXt−1. This extension, however, worsens theidentification problem.32LeSage and Pace (2009) propose to report one direct effect measured by the average of

the diagonal elements, and one indirect effect measured by the average of the row sums of thenon-diagonal elements of that matrix.

96

spillover) effects in relation to the direct effects remain constant over time. Theratio of the kth explanatory variable takes the form:[

(IN − δW )−1

(β1kIN + β2kW )]rsum

/[(IN − δW )

−1(β1kIN + β2kW )

]dboth in the short term and the long term.The fourth restriction is η = 0. Although this model limits the flexibility of

the ratio between indirect and direct effects, it seems to be the least restrictivemodel.

6.5 QMLE of Heterogeneous Spatial Models

Recently, Aquaro, Bailey and Pesaran (2015) extend the spatial autoregressivepanel data model to the case where the spatial coeffi cients differ across thespatial units. They develop the QML estimator which is shown to be consistentand asymptotically normally distributed when both the time and cross sectiondimensions are large. Small sample properties of the proposed estimators areinvestigated by Monte Carlo simulations for Gaussian and non-Gaussian errors,and with spatial weight matrices of differing degree of sparseness, showing thatthe QML estimators have satisfactory small sample properties under certainsparsity conditions on the spatial weight matrix.Consider the heterogeneous spatial autoregressive (HSAR) model:

yit = ψi

N∑j=1

wijyjt + εit, i = 1, ..., N ; t = 1, ..., T (168)

where w′iyt =∑Nj=1 wijyjt, wi = (wi1, ..., wiN )

′ with wii = 0, and yt =

(y1t, ..., yNt)′. Here, wi denotes an N × 1 non-stochastic vector. Stacking the

observations on individual units for each t, we have

(IN −ΨW )yt = εt, t = 1, ..., T

where Ψ = diag(ψ) with ψ = (ψ1, ..., ψN )′. The true value of ψi will be denoted

by ψi0. Define:

S (ψ) = (IN −ΨW ) , and S0 = (IN −Ψ0W ) .

Assumption 1 The N × N spatial weight matrix, W = (wij) is exactlysparse such that

hN = maxi≤N

∑j≤N

I(wij 6= 0);

is bounded in N , where I(A) denotes the indicator function.Assumption 2 εit are independently distributed over i and t, have zero

means, and constant variances, E(ε2it

)= σ2.

97

Assumption 3 The (N + 1) × 1 parameter vector, θ =(ψ′, σ2

)′ ∈ Θ isa subset of the N + 1 dimensional Euclidean space, RN+1. Θ is a closed andbounded (compact) set and includes the true value of θ, θ0 as an interior point.Assumption 4 λmin

[S′ (ψ)S (ψ)

]> 0 for all values of ψ ∈ Θ and N .

Assumption 5 Let yt = (y1t, ..., yNt)′, and consider the sample covariance

matrix of yt, ΣT = T−1∑Tt=1 yty

′t. For a given N we have (as T →∞)

ΣT →p Σ0 uniformly in θ =(ψ′, σ2

)′,

whereΣ0 = σ2

0 (IN −Ψ0W )−1

(IN −Ψ0W )−1′

= σ20S−10 S−1′

0 .

Under Assumption 1 and under the following condition:

supi|ψi| < max

1

‖W ‖1,

1

‖W ‖∞

then S (ψ) = (IN −ΨW ) is non-singular and (168) can be expressed as

yt = S−1 (ψ) εt; t = 1, ..., T (169)

Then, the joint density function of y1t, ..., yNt is given by

T∏t=1

S (ψ)

|σ2IN |1/2 (2π)N/2

exp

(− 1

2σ2y′tS

′ (ψ)S (ψ)yt

)and the (quasi) log-likelihood function can be written as

` (θ) = −NT2

ln 2π − NT

2lnσ2 + T ln |S (ψ)| − 1

2σ2

T∑t=1

y′tS′ (ψ)S (ψ)yt

where θ =(ψ′, σ2

)′. The last term of the LLF can be written conveniently as

T∑t=1

y′tS (ψ)′S (ψ) yt = Ttr

[S (ψ)

′S (ψ) ΣT

]Hence,

` (θ) = −NT2

ln 2π − NT

2lnσ2 + T ln |S (ψ)| − T

2σ2

[S (ψ)

′S (ψ) ΣT

](170)

Proposition 1 (The main identification) Consider the HSAR model(168) and suppose that Assumptions 1 to 5 hold and the invertibility condition(169) is met. Then the true parameter values, σ2

0 and ψi0, for i = 1, 2, ..., N ,are identified if λmin [ΛN (ϕ)] > 0, where

ΛN (ϕ) =

[ (A0 A′0

)+ (1− δ) diag

(G0G

′0

)diag (G0) τN − diag

(G0G

′0D)τN[

diag (G0) τN − diag(G0G

′0D)τN]′ N

2(1−δ)2

]

98

τN is an N × 1 vector of ones, is the Hadamard product matrix operator,

A0 = G0 (IN −DG0)−1

= W (IN −ΨW )−1.

For large N , λmin [ΛN (ϕ)] > 0 is necessary for identification but need not besuffi cient if the aim is to identify all the spatial coeffi cients, ψi for i = 1, 2, ..., N .For local identification the condition simplifies to

λmin [H0] > ε > 0, for all N,

H0 =[(G0 G′0

)+ diag

(G0G

′0

)]− 2

Ndiag (G0) τNτ

′Ndiag (G0)

G0 = G0 (ψ0) = W (IN −Ψ0W )−1,

the ith element of diag(G0G

′0

)is given by g′0ig0i. For large N it is also required

that λmax [H0] < K for all N .Proposition 2 (The main asymptotic result) Consider the HSAR model

(168) and suppose that: (a) Assumptions 1 to 5 hold, (b) the invertibility con-dition (169) is met, (c) the N ×N information matrix

H11,2 = (G0 G′0) + diag (G0G′0)− 2

Ndiag (G0) τ ′NτNdiag (G′0)

is full rank, whereG0 = W (IN−ΨW )−1,Ψ0 = diag(ψ0); ψ0 = (ψ10, ..., ψN0)′,

the ith element of diag(G0G

′0

)is given by g′0ig0i, andW is the spatial weight

matrix, and (d) εit ∼ IIDN(0, σ20). The MLE of ψ0 has the following asymp-

totic distribution as T →∞,√T(ψT −ψ0

)→d N

(0, AsyV ar

(ψT

))AsyV ar

(ψT

)=

[(G0 G′0) + diag (G0G

′0)− 2

Ndiag (G0) τ ′NτNdiag (G′0)

]−1

which does not depend on σ20.

The HSAR model with heteroskedastic error variances and exogenousregressors The HSAR model (168) can be extended to include exogenousregressors as well as heteroskedastic errors:

yit = ψi

N∑j=1

wijyjt + β′ixit + εit (171)

where w′iyt =∑Nj=1 wijyjt, wi = (wi1, ..., wiN )

′ with wii = 0, and yt =

(y1t, ..., yNt)′. Now, we also introduce a k × 1 vector of exogenous regressors

xit = (xi1,t, ..., xik,t)′ with parameters βi = (βi1, ..., βik)′. The above spec-

ification allows for the fixed effects by setting one of the regressors equal tounity.

99

We also allow εit to be cross-sectionally heteroskedastic, V ar(εit) = σ2i for

i = 1, ..., N . Stacking by individual units for each t, (171) becomes

yt = ΨWyt +Bxt + εt (172)

where Ψ = diag(ψ), ψ = (ψ1, ..., ψN )′, B = diag

(β′1, ...,β

′N

)′, and εt =

(ε1t, ..., εNt)′. Then (172) can be written as

yt = (IN −ΨW)−1

(Bxt + εt)

The (quasi) LLF can be written as (assuming that the errors are Gaussian)

` (θ) = −NT2

ln 2π − T

2

N∑i=1

lnσ2i + T ln |IN −ΨW|

−1

2

T∑t=1

[(IN −ΨW) yt −Bxt]′Σε [(IN −ΨW) yt −Bxt]

where Σε = diag(σ21, ..., σ

2N ). Alternatively, it is often more convenient to write

the above log-likelihood function as

` (θ) = −NT2

ln 2π − T

2

N∑i=1

lnσ2i + T ln |IN −ΨW|

−1

2

T∑t=1

(yi − ψiy∗i −Xiβi)′(yi − ψiy∗i −Xiβi)

σ2i

where θ =(ψ,β′1, ...,β

′N , σ

21, ..., σ

2N

)′with βi = (βi1, ..., βiN )

′,Xi = (xi1, ...,xiT )′

is the T × k matrix of regressors on the ith cross section unit with xit =(xi1,t, ..., xik,t)

′, yi = (yi1, ..., yiT )′, and y∗i = (y∗i1, ..., y

∗iT )′, with the elements

y∗it = w′iyt =∑Nj=1 wijyjt.

Assumption 6 TheN(k+2)×1 parameter vector, θ =(ψ,β′1, ...,β

′N , σ

21, ..., σ

2N

)′ ∈Θ is a sub-set of the N(k + 2) dimensional Euclidean space, RN(k+2). Θ is aclosed and bounded (compact) set and includes θ0 as an interior point, andsupi ‖βi‖1 < K.

Assumption 7 εit are independently distributed over i and t; have zeromeans, and constant variances, E(ε2

it) = σ2i .

Assumption 8 xit are exogenous such that E (xitεjt) = 0 for all i and j,

T−1T∑t=1

xitεjt →p 0, uniformly in i and j = 1, ..., N.

The covariance matrices E(xitx

′jt

)= Σij , for all i and j, are time-invariant

and finite, Σii is non-singular, T−1∑Tt=1XiX

′j →p Σij ;

supiλmax

(T−1XiX

′i

)< K, inf

iλmin

(T−1XiX

′i

)> 0

100

Proposition 3 (The main asymptotic result) Consider the HSAR model(171) and suppose that: (a) Assumptions 1, 4, 5, and 6, 7, and 8 hold, (b) the

invertibility condition (169) is met, (c) λmin(H11,2

)> ε > 0 for all N , where

H11,2 is the N ×N matrix

H11,2 = (G0 G′0) + diag

−g0,ii +

N∑s=1,s6=i

σ2s

σ2i

g20,is, i = 1, ..., N

+diag

[1

σ2i

N∑r=1

N∑s=1

g0,isg0,irβ′r

(Σrs − ΣriΣ

−1ii Σit

)βs, i = 1, ..., N

]

G0 = W(IN − ΨW)−1, Ψ0 = diag(ψ0); ψ0 = (ψ10, ..., ψN0)′, the ith element

of diag(G0G

′0

)is given by g′0ig0i, and W is the spatial weight matrix, and

(d) εit ∼ IIDN(0, σ2i ). Then the MLE of ψ0 has the following asymptotic

distribution as T →∞,√T(ψT −ψ0

)→d N

(0, AsyV ar

(ψT

))where AsyV ar

(ψT

)= H−1

11,2.

6.5.1 BHP (2016) application to US housing prices

Almost all spatial econometric models assume that the spatial parameters donot vary across the units. Such parameter homogeneity is unavoidable when Tis very small. The evidence of parameter heterogeneity in panel data modelsis quite prevalent particularly in the case of cross-county or country datasets.In such cases and when T is suffi ciently large, reducing the spatial effects toa single parameter appears rather restrictive. ABP allow the spatial effects todiffer across the units, and derive the conditions needed for identification andconsistent estimation under parameter heterogeneity. Consider the followingheterogeneous equation:

xit = ψix∗it + uit; for i = 1, ..., N, t = 1, ..., T ;

where x∗it = w′ixt, w′i denotes the ith row of the N × N row-standardizedspatial matrix, W . In the spatial econometrics literature it is assumed that allunits have at least one neighbour, which ensures that w′iτ = 1 for all i. Butwhen using correlation-based weights, it is possible for some units not to haveany connections. In such cases x∗it = 0 and the associated coeffi cient, ψi isunidentified, and to resolve the identification problem, we set ψi = 0. In matrixnotation we have

xt = ΨWxt + ut; for t = 1, ..., T ;

where Ψ = diag (ψ), ψ = (ψ1, ..., ψN )′, and σ2

ui = var (uit), i = 1, ..., N .

101

An extension that incorporates richer temporal and spatial dynamics andaccommodates negative as well as positive connections is given by

xt =

hλ∑j=1

Λjxt−j +

h+ψ∑j=1

Ψ+j W+xt−j +

h−ψ∑j=1

Ψ−j W−xt−j + ut

where hλ = max (hλ1, ..., hλN )′; h+

ψ =(h+ψ1, ..., h

+ψN

)′; h−ψ =

(h−ψ1, ..., h

−ψN

)′;

Λj , Ψ+j , Ψ−j are N ×N diagonal matrices with λij , ψ

+ij and ψ

−ij over i as their

diagonal elements. Also, W+ and W− are N ×N network matrices for positiveand negative connections, respectively such that W = W+ + W−. We sethλ = h+

ψ = h−ψ = 1 for expositional simplicity.ABP propose a QML procedure. The following concentrated log-likelihood

function can be used:

`(ψ+

0 ,ψ−0

)∝ T ln

∣∣IN −Ψ+0 W+ −Ψ−0 W−∣∣− T

2

N∑i=1

(1

Tx′iMixi

)where

xi = xi −ψ+i0x

+i −ψ

−i0x−i

Mi = IT − Zi (Z′iZi)−1

Zi,Zi =(xi,−1,x

+i,−1,x

−i,−1

)ψ+

0 =(ψ+

10, ..., ψ+N0

)′,ψ−0 =

(ψ−10, ..., ψ

−N0

)′,

The parameters of the lagged variables, λ1, ψ+1 and ψ−1 , can be estimated by

least squares applied to the equations for individual units conditional on ψ+i0

and ψ−i0. For inference the analysis must be carried out with respect to theunconcentrated log-likelihood function in terms of θ =

(θ′1, ...,θ

′N

)′, where

θi =(ψ+i0, ψ

−i0, ψ

+i1, ψ

−i1, λi1, σ

2ui

)′. The variance—covariance matrix of θML is

computed as

ΣθML =

− 1

T

∂2`(θML

)∂θML∂θ

′ML

−1

See BHP for more detailed applications.

6.6 The Spatiotemporal Autoregressive Distributed Lag(STARDL) Modelling

The ability of spatial econometric models to capture co-dependencies across aknown terrain or network at relatively low parametric cost has proved highlyattractive to economists and regional scientists. Following early work by Cliffand Ord (1973), Upton and Fingleton (1985), Anselin (1988), Cressie (1993),Kelejian and Robinson (1993) these models have been used in a wide rangeof applications. The spatial lag or spatial autoregressive model, which captures

102

spatial correlation within the system through the dependent variable has provedthe most popular but can be restrictive. A more general form, the spatial Durbinmodel, is also widely used and allows the explanatory variables of one unit toimpact the dependent variable of another directly as well as indirectly throughtheir impact on the original dependent variable. For example, an improvementin school quality in one area will directly improve house prices in neighbouringareas, whose residents may access those schools, as well as indirectly through thetransmission of increased house prices from one area to its neighbours. Investi-gating identification in spatial Durbin models under both instrumental variableand maximum likelihood estimation, Lee and Yu (2016) show that significantbiases can arise if relevant Durbin terms are omitted while their unnecessaryinclusion causes no material loss of effi ciency. Our model is in the spirit of thespatial Durbin model but generalises it to include time as well as spatial lags.As the availability of spatial datasets with a large time dimension grows it

is of interest to explore time dynamics in greater detail. Much of the early workon spatial models was done with large cross-section dimensions in mind. In datawith a time dimension, a static model is, in effect, assuming that the spatialsystem is only observed on (or relatively close to) its equilibrium path. Becausemany systems take some time to adjust fully to shocks, interest in dynamicspatial models has grown recently, see Elhorst (2014) for an overview, with thespatial panel data model, see Anselin et al. (2009), Baltagi et al. (2003), Leeand Yu (2011), among others, now well established. Yu et al. (2008) study thestable spatial dynamic panel data model, featuring individual time lags, spatialtime lags and contemporaneous spatial lags. Maximum likelihood estimationrequires bias correction when the time dimension is small but alternative ap-proaches, using time lags as instrumental variables, are available. Elhorst (2010)uses Monte Carlo studies to investigate small sample performances of variousestimators. While these endow the system with some temporal memory they areincapable of capturing the dynamics seen in many economic series. We thereforepropose to generalise the spatial panel model to higher-order temporal dynamicsthrough the spatial-temporal autoregressive distributed lag (STARDL) model.In time series econometrics, the autoregressive distributed lag (ARDL) modelhas proved an extremely effective tool for both the estimation of dynamic para-meters and for understanding the interaction of variables over time, becomingwidely used in particular to differentiate short-run and long-run behaviour andthe process of adjustment to new equilibria, see Pesaran and Shin (1998), Pe-saran et al. (2001) and Shin et al. (2014). We adopt this approach in the novelcontext of spatially correlated data.A significant drawback to spatial models is that the spatial weighting matrix

must be known a priori. A range of methods have been used to constructweighting matrices in applied work, including contiguity, inverse distance ormeasures of similarity, with the only restriction that no unit can be its ownneighbour. The researcher may also choose to normalise the rows of this matrix,for example to sum to unity. When, as in the overwhelming majority of cases, acommon spatial parameter is used this matrix determines not only which unitsare to be considered as neighbours and their relative importance but also the

103

relative intensity with which each unit is influenced by its neighbours. Only thegeneral intensity of transmission within the system is left to be estimated; therelative degree of openness to transmission or susceptibility to external influenceof each unit is assumed known.We follow a recent paper by Aquaro, Bailey and Pesaran (2015), which is

unusual in allowing heterogeneous spatial lag parameters, although our modelis more general in allowing for time dynamics and also encompasses the widelyused spatial Durbin model as a special case. This relaxation offers a numberof advantages. Firstly, the predictions of the model are invariant to any row-normalisation of the spatial weighting matrix, in the sense that the residuals arenot affected by multiplying the spatial weights of the neighbours and dividingthe spatial coeffi cient of unit i by an arbitrary constant. Secondly it enables theresearcher to estimate the relative openness of each spatial unit. As is commonin spatial models, our set up assumes that the relative importance of differentneighbours is known while leaving the importance of the spatial effect relativeto other influences and relative to other spatial units to be estimated.Allowing for parameter heterogeneity raises significant complications for es-

timation and interpretation that this paper addresses. As some simultaneityis inevitable in spatial models, we propose to develop a quasi-maximum likeli-hood (QML) estimator and one utilising internal instruments. QML techniques,based around a transformation of the data, were developed by Ord (1975), Cliffand Ord (1981) and Anselin (1980) among others and the asymptotic propertiesstudied rigorously in Lee (2004). The evaluation of the quasi-likelihood requirescalculation of a Jacobian of a matrix which grows with the cross-section dimen-sion, typically by calculating the eigenvalues of the weighting matrix. As pointedout by Kelejian and Prucha (1998), this becomes compuationally diffi cult forlarge cross-sections and the problem is exacerbated by heterogeneity across thespatial parameters, which prohibits the typical approach. In the light of thesediffi culties we also develop an alternative control function approach, followingthe instrumental variables/ method of moments approach that has been devel-oped by Anselin (1980), Kelejian and Robinson (1993), Kelejian and Prucha(1998, 1999, 2004), Lee and Liu (2010), Lee and Yu (2014) and Kuersteiner andPrucha (2018). The choice of the optimal set of instruments are also discussedin Anselin (1980), Land and Deane (1992) and Lee (2003). Importantly, thepresence of time and spatial correlation equips our model with a greater varietyof instruments, which we exploit in developing the estimator of the parameters.We undertake an asymptotic analysis of both estimators, establishing condi-tions for the stability of the model and the identification of the parameters andshow that, under certain conditions they are consistent and normally distributedasymptotically.We explore the properties of both the QML and control function estimators

in a Monte Carlo simulation. Our results indicate that both methods providegood estimates in finite samples, with both bias and root mean square errorfalling as the time dimension increases. The results are largely unaffected byincreases in the cross-sections dimension, with the control function providing afar easier algorithm to implement. The control function estimator is also shown

104

to be robust to heteroskedasticity and to time dependence in the regressor,while the QML estimator has the lower root mean square error. We investigateboth methods using different row-normalised weighting matrices and find thatperformance is maintained regardless of its sparsity.While offering useful flexibility, allowing heterogeneity across a large num-

ber of units can make it diffi cult to interpret results in a meaningful way, onceestimated, and to analyse the influence of both spatial interactions and dif-fusion dependence. Another contribution of the paper is to provide two suchmeasures, the individual spatio-temporal dynamic multipliers and the systemdiffusion multipliers, based on work by Shin et al. (2014). We then illus-trate their usefulness in an empirical application looking at casualty data fromthe 2003 Iraq war and its aftermath. Although the conduct of the war itselfwas initially quick and decisive, subsequent attempts by the coalition forces tomaintain the safety of the Iraqi population during the subsequent insurgencyrequired considerable efforts over a longer time horizon. Using novel (and fa-mously unoffi cial) casualty data, we analyse the effect of insurgent deaths onthose of civilians in each of the 18 provinces of Iraq, modelling both time andspatial contaminations. Our results highlight the contrasting experiences of themost populous, the capital Baghdad, where the majority of civilian casualtiesoccurred but which, nevertheless, was largely a net propagator of civilian vio-lence, and the Shia centre of Basrah, where the number of civilian casualtieswere particularly sensitive to events in neighbouring provinces.We consider the STARDL model with the heterogeneous parameters:

yit = φiyit−1 + π′i0xit + π′i1xi,t−1 + φ∗i0y∗it + φ∗i1y

∗it−1 + π∗′i0x

∗it + π∗′i1x

∗i,t−1 + uit

(173)where yit is the scalar dependent variable of the ith spatial unit at time t,xit =

(x1it, ..., x

Kit

)′is a K×1 vector of exogenous regressors with a K×1 vector

of parameters, πi0 =(π1i0, ..., π

Ki0

)′. Similarly for xi,t−1 =

(x1i,t−1, ..., x

Ki,t−1

)′and πi1 =

(π1i1, ..., π

Ki1

)′.

The spatial variables, y∗it and x∗it, are defined by

y∗it =

N∑j=1

wijyjt = wiyt with ytN×1

= (y1t, ..., yNt)′, (174)

x∗itK×1

=

x1∗it...

xK∗it

=

∑Nj=1 wijx

1jt

...∑Nj=1 wijx

Kjt

= (wi ⊗ ιK)xt; xtNK×1

=

x1t

...xNt

(175)

where wi = (wi1, ..., wiN ) is a 1 × N vector of spatial weights determined apriori with wii = 0, and ιK is a K × 1 vector of unity. Similarly,

y∗i,t−1 =

N∑j=1

wijyj,t−1 = wiyt−1 with yt−1N×1

= (y1,t−1, ..., yN,t−1)′,

105

x∗i,t−1 =(x1∗i,t−1, ..., x

K∗i,t−1

)′= (wi ⊗ ιK)xt−1 with xt−1

NK×1

=(x′1,t−1, ...,x

′N,t−1

)′.

Then, y∗t = (y∗1t, ..., y∗Nt)′ and x∗t = (x∗′1t, ...,x

∗′Nt)′ can be expressed as

y∗tN×1

= Wyt and x∗tNK×1

= (W ⊗ ιK)xt (176)

where W is the N ×N matrix of the spatial weights given by

W =

w11 · · · w1N

.... . .

...wN1 · · · wNN

=

w1

...wN

with wii = 0 (177)

Stacking the individual STARDL(1,1) regressions, (173), we have the follow-ing generalised spatial representation:

yt = Φyt−1+Π0xt+Π1xt−1+Φ∗0Wyt+Φ∗1Wyt−1+Π∗0 (W ⊗ ιK)xt+Π∗1 (W ⊗ ιK)xt−1+ut(178)

where

ΦN×N

=

φ1 · · · 0...

. . ....

0 · · · φN

, Φ∗hN×N

=

φ∗1h · · · 0...

. . ....

0 · · · φ∗Nh

for h = 0, 1

ΠhN×NK

=

π′1h · · · 0...

. . ....

0 · · · π′Nh

, Π∗hN×NK

=

π′∗1h · · · 0...

. . ....

0 · · · π′∗Nh

for h = 0, 1

It is straightforward to develop the general STARDL(p, q) model with theheterogeneous parameters as follows:

yit =

p∑h=1

φihyi,t−h +

q∑h=0

π′ihxi,t−h +

p∑h=0

φ∗ihy∗i,t−h +

q∑h=0

π′∗ihx∗i,t−h + uit (179)

Suppose that the lag orders p and q are selected suffi ciently large and assumethat xit are exogenous. In such a case uit’s in (179) are free from serial corre-lations. Stacking the individual STARDL(p, q) regressions, (179), we have thefollowing generalised spatial representation:

yt =

p∑h=1

Φhyt−h +

q∑h=0

Πhxt−h +

p∑h=0

Φ∗hWyt−h +

q∑h=0

Π∗h (W ⊗ ιK)xt−h +ut

(180)where

ΦhN×N

=

φ1h · · · 0...

. . ....

0 · · · φNh

, h = 1, ...p, Φ∗hN×N

=

φ∗1h · · · 0...

. . ....

0 · · · φ∗Nh

, h = 0, 1, ..., p

106

ΠhN×NK

=

π′1h · · · 0...

. . ....

0 · · · π′Nh

, Π∗hN×NK

=

π′∗1h · · · 0...

. . ....

0 · · · π′∗Nh

, h = 0, 1, ..., q.

Remark: Spatial stability: The eigenvalues of Φ∗0W lie inside the unitcircle.Remark: Time stability: We rewrite equation (180) as

yt =

p∑h=1

Φhyt−h +

q∑h=0

Πhxt−h + ut, (181)

where Φh = (IN −Φ∗0W )−1

(Φh + Φ∗hW ), Πh = (IN −Φ∗0W )−1

[Πh+Π∗h (W ⊗ ιK)],and ut = (IN −Φ∗0W )

−1ut. The roots of the N × N matrix polynomial

Φ (z) = IN −∑ph=1 Φhz lie outside the unit circle.

6.6.1 Estimation and Inference

To deal with the endogeneity of y∗it in (179) we apply the control functionapproach.33 Consider the following CF DGP for y∗it:

y∗it = ϕ′izit + vit with E (z′itvit) = 0 (182)

where zit be the L× 1 vector of exogenous variables:

zit =(z1′it , z

2′it

)where z1

it be the L1 × 1 vector of exogenous variables included in (179) and z2it

be the L2 × 1 vector of exogenous variables excluded.We assume that the error terms uit and vit have a joint distribution:(

uitvit

)∼ iid (0,Σuv) , Σuv =

[σ2u σuv

σuv σ2v

](183)

Furthermore, E (uit|vit) = ρvit and E (uit|vit) = σ2e. The endogeneity of y

∗it

comes from the correlation between uit and vit. Using (183), we can construct34

uit = ρvit + eit (184)

33Most linear models are estimated using IV methods —two stage least squares (2SLS). Analternative, the control function (CF) approach, relies on the same kind of identification con-ditions. However, in models with nonlinearities or random coeffi cients, the form of exogeneityis stronger and more restrictions are imposed on the reduced forms.34 In the special case where (uit, vit)

′ has a jointly normal distribution, then

uit|vit ∼ N(σuv

σ2vvit,

σ2uvσ2v

)and eit is independent of vit.

107

where ρ = E (vituit) /E(v2it

). By construction, E (z′iteit) = 0 and E (viteit) =

0. Replacing uit by (184), we obtain the following transformation of (179):

yit =

p∑h=1

φihyi,t−h +

q∑h=0

π′ihxi,t−h +

p∑h=0


q∑h=0

π′∗ihx∗i,t−h + ρvit + eit

(185)where vit is the additional control variable, rendering the new error terms, eituncorrelated with y∗it as well as with vit and other regressors in (185). In prac-tice, we use the two-step procedure: (i) obtain the reduced form residuals,vit = y∗it − ϕ

′izit from (182) and (ii) then run the following regression:

yit =

p∑h=1

φihyi,t−h +

q∑h=0

π′ihxi,t−h +

p∑h=0


q∑h=0

π′∗ihx∗i,t−h + ρvit + e∗it

(186)where e∗it = e∗it + ρ (ϕi −ϕi)

′zit depends on the sampling error in ϕi unless

ρ = 0 (exogeneity test). Then, the OLS estimator from (186) will be consistentfor all the parameters.Remark: The selection of exogenous IVs, zit. We prefer to construct them

internally. Potentially, under the assumption of exogeneity of xit, xit and x∗itwould be good candidates. Then, why not xjt and x∗jt, j (6= i) = 1, ..., N? Forthe current application on the relationship between inflation and output gap, wemay suggest to use the cross-section average of yit−1, yt−1 as the just-identifiedIV for y∗it. yit−1 is unocrrelated with uit whereas yt−1 (measuring the sort ofglobal inflation) should be highly correlated with y∗it (measuring the individualcountry-specific global inflation).

6.6.2 The Spatiotemporal Dynamic Multipliers

It is straightforward to derive (dynamic) multipliers associated with unit changesin xt, y∗t and x

∗t on yt. Rewrite the STARDL(p, q) model, (179) as35

φi (L) yit = φ∗i (L) y∗it + πi (L)xit + π∗i (L)x∗it + uit (187)

where

φi (L) = 1−p∑

h=1

φihLh; φ∗i (L) = 1−

p∑h=0

φ∗ihLh; πi (L) =

q∑h=0

π′ihLh; π∗i (L) =

q∑h=0

π∗′ihLh.

Premultiplying (187) by the inverse of φi (L), we obtain:

yit = φ∗i (L) y∗it + πi (L)xit + π∗i (L)x∗it + uit (188)

where φ∗i (L)

(=∑∞j=0 φ

∗ijL

j)

= [φi (L)]−1φ∗i (L), πi (L)

(=∑∞j=0 π

′ijL

j)

=

[φi (L)]−1πi (L), π∗i (L)

(=∑∞j=0 π

∗′ijL

j)

= [φi (L)]−1π∗i (L) and uit = [φi (L)]

−1uit.

35To construct the dynamic multipliers, we should use the structural parameters in (179)which are consistently estimated by the CF-augmented regression, (186).

108

The dynamic multipliers, φ∗ij , π

′ij and π

∗′ij for j = 0, 1, ..., can be evaluated using

the following recursive relationships:

φ∗ij = φi1φ

∗i,j−1 + φi2φ

∗i,j−2 + · · ·+ φi,j−1φ

∗i1 + φij φ

∗i0 + φ∗ij , j = 1, 2, ... (189)

where φij = 0 for j < 1 and φ∗i0 = φ∗i0, φ

∗ij = 0 for j < 0 by construction,

π′ij = φi1π′i,j−1 +φi2π

′i,j−2 + · · ·+φi,j−1π

′i,1 +φijπ

′i0 +π′ij , j = 1, 2, ... (190)

where π′i0 = π′i0, π′ij = 0 for j < 0, and

π∗′ij = φi1π∗′i,j−1 +φi2π

∗′i,j−2 + · · ·+φi,j−1π

∗′i,1 +φijπ

∗′i0 +π′ij , j = 1, 2, ... (191)

where π∗′i0 = π∗′i0, π∗′ij = 0 for j < 0.

Define the dynamic multiplier effects as

∂yi,t+h∂y∗it

,∂yi,t+h∂x′it

=

[∂yi,t+h∂x1

it

, ...,∂yi,t+h∂xKit

]1×K

,∂yi,t+h∂x∗′it

=

[∂yi,t+h∂x∗1it

, ...,∂yi,t+h∂x∗Kit

]K×1

The cumulative dynamic multiplier effects of y∗it, xit and xit on yi,t+h for h =0, ...,H, can be evaluated as follows:

myi (y∗i , H) =

H∑h=0

φ∗ih,myi (xi, H) =

H∑h=0

π′ih,myi (x∗i , H) =

H∑h=0

π∗′ih, H = 0, 1, ...

By construction, as H →∞,

myi (y∗i , H)→ βyi; myi (xi, H)→ β′xi; myi (x∗i , H)→ β∗′xi

where βyi, βxi and β∗xi are the associated long-run coeffi cients.

Remark: An important feature of the SARDL model is to capture threedifferent forms of dynamic adjustment from initial equilibrium to the new equi-librium following an economic perturbation with respect to domestic conditions,overseas conditions and the overseas policy decisions.

• An investigation of the key parameters in (179) enables us to categorisethe group of countries, say countries that focus on domestic conditionsonly (e.g. the US), and those that pay attention to both domestic andoverseas conditions. The small open economy may be likely to dependmore on overseas conditions and vice versa. Also differently in the short-and the long-run.

• We may apply the mean group estimation of the key parameters and thedynamic multipliers to see the overall mean patterns on a global scale.

We now develop spatial-dynamic multipliers more generally in terms of thespatial system representation (180). We rewrite (180) as

Φ (L)yt = Φ∗ (L)Wyt + Π (L)xt + Π∗ (L) (W ⊗ ιK)xt + ut (192)

109

where

Φ (L) = IN−p∑

h=1

ΦhLh; Φ∗ (L) =

p∑h=0

Φ∗hLh; Π (L) =

q∑h=0

ΠhLh; Π∗ (L) =

q∑h=0

Π∗h (W ⊗ ιK)Lh

Premultiplying (192) by the inverse of Φ (L), we obtain:

yt = Φ∗

(L)Wyt + Π (L)xt + Π∗

(L) (W ⊗ ιK)xt + ut (193)

where Φ∗

(L)(

=∑∞h=0 Φ

∗hL

h)

= [Φ (L)]−1

Φ∗ (L), Π (L)(

=∑∞h=0 ΠhL

h)

=

[Φ (L)]−1

Π (L), Π∗

(L)(

=∑∞h=0 Π

∗hL

h)

= [Φ (L)]−1

Π∗ (L), and ut = [Φ (L)]−1ut.

The dynamic multipliers, Φ∗j , Πj and Π

∗j for j = 0, 1, ..., can be evaluated using

the following recursive relationships:

Φ∗j = Φ1Φ

∗j−1 + Φ2Φ

∗j−2 + · · ·+ Φj−1Φ

∗1 + ΦjΦ

∗0 + Φ∗j , j = 1, 2, ... (194)

where Φj = 0 for j < 1 and Φ∗0 = Φ∗0, Φ

∗j = 0 for j < 0,

Πj = Φ1Πj−1 + Φ1Πj−2 + · · ·+ Φj−1Π1 + Φ1Π0 + Πj , j = 1, 2, ... (195)

where Π0 = Π0, Πj = 0 for j < 0, and

Π∗j = Φ1Π

∗j−1 + Φ1Π

∗j−2 + · · ·+ Φj−1Π

∗1 + Φ1Π

∗0 + Π∗j , j = 1, 2, ... (196)

where Π∗0 = Π∗0, Π

∗j = 0 for j < 0. The matrices of the cumulative dynamic

multiplier effects can be evaluated as follows:

my∗ (H) =

H∑h=0

∂yt+h∂y∗′t

=

H∑h=0

Φ∗h,

mx (H) =

H∑h=0

∂yt+h∂x′t

=

H∑h=0

Πh,mx∗ (H) =

H∑h=0

∂yt+h∂x∗′t

=

H∑h=0

Π∗h,

Remark: my∗ (H), mx (H) and mx∗ (H) capture the dynamic multipliereffects with respect to three different types of regressors, y∗t , xt and x

∗t , respec-

tively. But, they are block-diagonal by construction because Φ∗h, Πh and Π

∗h

are block-diagonal.

6.6.3 Diffusion Multipliers

We rewrite (181) asΦ (L)yt = Π (L)xt + ut, (197)

where ut = (IN −Φ∗0W )−1ut,

Φ (L) = IN −p∑j=1

ΦjLj with Φj = (IN −Φ∗0W )

−1 (Φj + Φ∗jW

)

110

Π (L) =

q∑j=0

ΠjLj with Πj = (IN −Φ∗0W )

−1[Πh + Π∗h (W ⊗ ιK)]


yt = B (L)xt+[Φ (L)

]−1

ut, B (L)

=

∞∑j=0

BjLj

=[Φ (L)

]−1

Π (L) (198)

The diffusion multipliers, Bj for j = 0, 1, ..., can be evaluated as follows:Algebra:

Φ (L)B (L) = Π (L)IN − p∑j=1

ΦjLj

∞∑j=0

BjLj

=

q∑j=0

ΠjLj

(IN − Φ1L− Φ2L

2 − Φ3L3 − · · ·

) (B0 +B1L+B2L

2 +B3L3 + · · ·

)= B0 +

(−Φ1B0 +B1

)L+

(−Φ1B1 − Φ2B0 +B2

)L2

+(−Φ1B2 − Φ2B1 − Φ3B0 +B3

)L3 + · · ·

Therefore,B0 = Π0

B1 = Φ1B0 + Π1

B2 = Φ1B1 + Φ2B0 + Π2

B3 = Φ1B2 + Φ2B1 + Φ3B0 + Π3

· · ·Bj = Φ1Bj−1 + Φ2Bj−2 + · · ·+ Φj−1B1 + ΦjB0 + Πj , j = 1, 2, ... (199)

where B0 = Π0 and Bj = 0 for j < 0 by construction.Define the dynamic multiplier effects as

∂yt+h∂x′t

=

∂y1,t+h∂x11t

· · · ∂y1,t+h∂xk1t

· · · ∂y1,t+h∂x1Nt

· · · ∂y1,t+h∂xkNt

∂y2,t+h∂x11t

· · · ∂y2,t+h∂xk1t

· · · ∂y2,t+h∂x1Nt

· · · ∂y2,t+h∂x1Nt

......

......

∂yN,t+h∂x11t

· · · ∂yN,t+h∂xk1t

· · · ∂yN,t+h∂x1Nt

· · · ∂yN,t+h∂xkNt

N×NK

(200)Then, the matrices of the cumulative diffusion multiplier effects can be evaluatedas follows:

mx (H) =

H∑h=0

∂yt+h∂x′t

=

H∑h=0

Bh, H = 0, 1, 2, ...

111

The cumulative diffusion multiplier effects of x`jton yi,t+h are given by the(i, (j − 1) k + `)th element of the N ×NK matrix, mx (H).Consider the special case of the single regressor, xt with k = 1. Then, (200)

is simplified to

∂yt+h∂x′t

=

∂y1,t+h∂x1t

· · · ∂y1,t+h∂xNt

. . .∂yN,t+h∂x1t

· · · ∂yN,t+h∂xNt

N×N

, h = 0, 1, ... (201)

and similarly for mx (H).Remark: In the case of homogeneous spatial panel models, LeSage and Pace

[2009] propose using the average of the main diagonal elements of the N × Nmatrix as a summary measure of the own-partial derivatives that they label adirect (own-region) effect. The direct effect for region i includes some feedbackloop effects that arise as a result of impacts passing through neighboring regionsj and back to region i. LeSage and Pace also propose an average of the (cumu-lative) off diagonal elements over all rows (observations) to produce a summarythat corresponds to the cross-partial derivative or indirect (other-region) effectassociated with changes in the rth explanatory variable. Debarsy et al. (2012)extend this cross-sectional reasoning to the case of dynamic space—time paneldata. This allows us to compute own- and cross-partial derivatives that tracethe effects (own-region and other-region) through time and space. Space—timedynamic models produce a situation where a change in the ith observation ofthe rth explanatory variable at time t will produce contemporaneous and fu-ture responses in all regions’dependent variables yit+T as well as other-regionfuture responses yjt+T . This is due to the presence of an individual time lag(time dependence), a spatial lag (spatial dependence) and a cross-product termreflecting the space—time diffusion. The main diagonal elements of the N ×Nmatrix sums for time horizon T represent (cumulative) own-region impacts thatarise from both time and spatial dependence. The sum of off-diagonal elementsreflects both spillovers measuring contemporaneous cross-partial derivatives anddiffusion measuring cross-partial derivatives that involve different time periods.We note that it is not possible to separate out the time dependence from spilloverand diffusion effects. This implies that except from the contemporaneous effectsthat represent pure spatial effects, future time horizons contain both time andspace diffusion effects, which cannot be distinguished from each other. Indeedthis approach is somewhat similar to Diebold-Yilmaz aggregate measures.Remark: In the case with heterogeneous spatial coeffi cients, LeSage and

Chin (2016) propose use of theN diagonal elements in (201) to produce observation-level direct effects estimates for each of the N regions. As estimates of regionspecific (observation-level) indirect spill-in and spill-out effects (similar to fromand to-effects or in-degree or out-degree in network approach), they proposeuse of the sum of off-diagonal elements in each row and column. In this regard,initially, we may apply our GCM type approach tomx (H) at different horizons.Remark: mx (H) captures the total diffusion multiplier effects with respect

to xt. It would be an important issue how to decompose the overall diffusion

112

multipliers into the spatial and dynamic components. The model can be esti-mated using the QML-type algorithms employed in the spatial literature.

• Add the GCM measures and network/graph approach to the analysis ofthe dynamic or diffusion multipliers.

6.6.4 STARDL models with observed common factors

We can also add the global factors in a straightforward manner. We now considerthe STARDL(p, q) model with the G× 1 vector of observed global factors, gt =(g1t , ..., g

Gt

)′(e.g. oil prices, commodity prices or the common currency such as

the Euro):36

yit =

p∑h=1

φihyi,t−h+

q∑h=0

π′ihxi,t−h+

p∑h=0

φ∗ihy∗i,t−h+

q∑h=0

π′∗ihx∗i,t−h+

q∑h=0

ψ′ihgt−h+uit

(202)Stacking the individual STARDL-F(p, q) regressions, (202), we have:

yt =

p∑h=1

Φhyt−h+

q∑h=0

Πhxt−h+

p∑h=0

Φ∗hWyt−h+

q∑h=0

Π∗h (W ⊗ ιK)xt−h+

q∑h=0

Ψh

(iN ⊗ gt−h

)+ut

(203)where iN is an N × 1 vector of unity,

ΦhN×N

=

φ1h · · · 0...

. . ....

0 · · · φNh

for h = 1, ...p, Φ∗hN×N

=

φ∗1h · · · 0...

. . ....

0 · · · φ∗Nh

for h = 0, 1, ..., p

ΠhN×NK

=

π′1h · · · 0...

. . ....

0 · · · π′Nh

, Π∗hN×NK

=

π′∗1h · · · 0...

. . ....

0 · · · π′∗Nh

,

ΨhN×NG

=

ψ′1h · · · 0...

. . ....

0 · · · ψ′Nh

for h = 0, 1, ..., q.

6.6.5 Special Case: STAR models with factors

We consider the STAR model with the heterogeneous parameters given by

yit = φiyit−1 + φ∗i0y∗it + φ∗i1y

∗it−1 + uit (204)

where yit is the scalar dependent variable of the ith spatial unit at time t.

36For notational simplicity we use the same lag order q for the global factors.

113

The spatial variable y∗it is defined by

y∗it =

N∑j=1

wijyjt = wiyt with ytN×1

= (y1t, ..., yNt)′ (205)

where wi = (wi1, ..., wiN ) denotes a 1×N vector of spatial weights determineda priori with wii = 0. Similarly,

y∗i,t−1 =

N∑j=1

wijyj,t−1 = wiyt−1 with yt−1N×1

= (y1,t−1, ..., yN,t−1)′

Then, y∗t = (y∗1t, ..., y∗Nt)′ can be expressed as

y∗tN×1

=

w1yt...

wNyt

= Wyt (206)

where W is the N ×N matrix of the spatial weights given by

W =

w11 · · · w1N

.... . .

...wN1 · · · wNN

=

w1

...wN

with wii = 0 (207)

Stacking the individual STAR(1) regressions, (204), we have the following gen-eralised spatial representation:

yt = Φyt−1 + Φ∗0Wyt + Φ∗1Wyt−1 + ut (208)

where

ΦN×N

=

φ1 · · · 0...

. . ....

0 · · · φN

, Φ∗hN×N

=

φ∗1h · · · 0...

. . ....

0 · · · φ∗Nh

for h = 0, 1

It is straightforward to develop the general STAR(p) model with the heteroge-neous parameters as follows:

yit =

p∑h=1

φihyi,t−h +

p∑h=0

φ∗ihy∗i,t−h + uit (209)

Stacking the individual STAR(p) regressions, (209), we have the following gen-eralised spatial representation:

yt =

p∑h=1

Φhyt−h +

p∑h=0

Φ∗hWyt−h + ut (210)

114

where

ΦhN×N

=

φ1h · · · 0...

. . ....

0 · · · φNh

, h = 1, ...p, Φ∗hN×N

=

φ∗1h · · · 0...

. . ....

0 · · · φ∗Nh

, h = 0, 1, ..., p

Remark: Spatial stability: The eigenvalues of Φ∗0W lie inside the unitcircle.Remark: Time stability: We rewrite equation (210) as

yt =

p∑h=1

Φhyt−h + ut, (211)

where Φh = (IN −Φ∗0W )−1

(Φh + Φ∗hW ), and ut = (IN −Φ∗0W )−1ut. The

roots of the N ×N matrix polynomial Φ (z) = IN −∑ph=1 Φhz lie outside the

unit circle.Suppose that the lag order p is selected suffi ciently large in which case uit’s

in (209) are free from serial correlations. To deal with the endogeneity of y∗it in(209) we apply the control function approach. Consider the following CF DGPfor y∗it:

y∗it = ϕ′izit + vit with E (z′itvit) = 0 (212)

where zit be the L× 1 vector of exogenous variables:

zit =(z1′it , z

2′it

)where z1

it be the L1 × 1 vector of exogenous variables included in (179) and z2it

be the L2 × 1 vector of exogenous variables excluded.We assume that the error terms uit and vit have a joint distribution:(

uitvit

)∼ iid (0,Σuv) , Σuv =

[σ2u σuv

σuv σ2v

](213)

Furthermore, E (uit|vit) = ρvit and E (uit|vit) = σ2e. The endogeneity of y

∗it

comes from the correlation between uit and vit. Using (183), we can construct37

uit = ρvit + eit (214)

where ρ = E (vituit) /E(v2it

). By construction, E (z′iteit) = 0 and E (viteit) =

0. Replacing uit, we obtain the following transformation of (209):

yit =

p∑h=1

φihyi,t−h +

p∑h=0

φ∗ihy∗i,t−h + ρvit + eit (215)

37 In the special case where (uit, vit)′ has a jointly normal distribution, then

uit|vit ∼ N(σuv

σ2vvit,

σ2uvσ2v

)and eit is independent of vit.

115

where vit is the control variable, rendering eit uncorrelated with y∗it as wellas with vit in (215). In practice, we use the two-step procedure: (i) obtainthe reduced form residuals, vit = y∗it − ϕ

′izit and (ii) then run the following

regression:

yit =

p∑h=1

φihyi,t−h +

p∑h=0

φ∗ihy∗i,t−h + ρvit + e∗it (216)

where e∗it = e∗it + ρ (ϕi −ϕi)′zit depends on the sampling error in ϕi unless

ρ = 0 (exogeneity test). Then, the OLS estimator from (216) will be consistentfor all the parameters.It is straightforward to derive (dynamic) multipliers associated with unit

changes in y∗t on yt. Rewrite the STAR(p) model, (209) as

φi (L) yit = φ∗i (L) y∗it + uit (217)

where

φi (L) = 1−p∑

h=1

φihLh; φ∗i (L) = 1−

p∑h=0

φ∗ihLh.

Premultiplying (217) by the inverse of φi (L), we obtain:

yit = φ∗i (L) y∗it + uit (218)

where φ∗i (L)

(=∑∞j=0 φ

∗ijL

j)

= [φi (L)]−1φ∗i (L), and uit = [φi (L)]

−1uit. The

dynamic multipliers, φ∗ij , π

′ij and π

∗′ij for j = 0, 1, ..., can be evaluated using the

following recursive relationships:

φ∗ij = φi1φ

∗i,j−1 + φi2φ

∗i,j−2 + · · ·+ φi,j−1φ

∗i1 + φij φ

∗i0 + φ∗ij , j = 1, 2, ... (219)

where φij = 0 for j < 1 and φ∗i0 = φ∗i0, φ

∗ij = 0 for j < 0 by construction.

Define the dynamic multiplier effects as ∂yi,t+h∂y∗it

. Then, the cumulative dy-namic multiplier effects of y∗it on yi,t+h for h = 0, ...,H, can be evaluated asfollows:

myi (y∗i , H) =

H∑h=0

φ∗ih, H = 0, 1, ...

By construction, as H →∞,

myi (y∗i , H)→ β∗yi

where β∗yi is the long-run coeffi cient.Remark: An important feature of the STAR model is to capture dynamic

(spillover) adjustment from initial equilibrium to the new equilibrium followingan perturbation with respect to overseas conditions

116

• An investigation of the key parameters in (179) enables us to categorisethe group of countries, say countries that focus on domestic conditionsonly (e.g. the US), and those that pay attention to both domestic andoverseas conditions. The small open economy may be likely to dependmore on overseas conditions and vice versa. Also differently in the short-and the long-run. Still useful, but not quite suffi ciently informative aboutthe spatial and the dynamic multipliers.

We now develop spatial-dynamic multipliers in terms of the spatial systemrepresentation (??), which we rewrite as

Φ (L)yt = Φ∗ (L)Wyt + ut (220)

where

Φ (L) = IN −p∑

h=1

ΦhLh; Φ∗ (L) =

p∑h=0

Φ∗hLh


yt = Φ∗

(L)Wyt + ut (221)

where Φ∗

(L)(

=∑∞h=0 Φ

∗hL

h)

= [Φ (L)]−1

Φ∗ (L), and ut = [Φ (L)]−1ut. The

dynamic multipliers, Φ∗j for j = 0, 1, ..., can be evaluated using the following

recursive relationships:

Φ∗j = Φ1Φ

∗j−1 + Φ2Φ

∗j−2 + · · ·+ Φj−1Φ

∗1 + ΦjΦ

∗0 + Φ∗j , j = 1, 2, ... (222)

where Φj = 0 for j < 1 and Φ∗0 = Φ∗0, Φ

∗j = 0 for j < 0 by construction.

Define the dynamic multiplier effects as

∂yt+h∂y∗′t

=

∂y1,t+h∂y∗1t

∂y1,t+h∂y∗2t

· · · ∂y1,t+h∂y∗Nt

∂y2,t+h∂y∗1t

∂y2,t+h∂y∗2t

· · · ∂y2,t+h∂y∗Nt

......

. . ....

∂yN,t+h∂y∗1t

∂yN,t+h∂y∗2t

· · · ∂yN,t+h∂y∗Nt

N×N

Q What’s ∂yi,t+h∂yjt? Say,

∂y1,t+h

∂y2t=∂y1,t+h

∂y∗1t× w12 +

∂y1,t+h

∂y∗2t× w22 + · · ·+ ∂y1,t+h

∂y∗Nt× wn2

Hence,

∂yi,t+h∂yjt

=∂yi,t+h∂y∗1t

× w1j +∂yi,t+h∂y∗2t

× w2j + · · ·+ ∂yi,t+h∂y∗Nt

× wnj

=

N∑k=1

∂yi,t+h∂y∗kt

× wkj for i 6= j

117

Then, what about ∂yi,t+h∂yit? Simply set to zero? More??

The matrix of the cumulative dynamic multiplier effects can be evaluated asfollows:

my∗ (H) =

H∑h=0

∂yt+h∂y∗′t

=

H∑h=0

Φ∗h

The cumulative dynamic multiplier effects of y∗jt on yit is given by the (i, j)thelement of the N ×N matrix my∗ (H).Remark: my∗ (H) captures the dynamic multiplier effects with respect to

y∗t . But, they are block-diagonal by construction:

my∗ (H)N×N

=

my1 (y∗1 , H) · · · 0...

. . ....

0 · · · myN (y∗N , H)

because Φ

∗h is block-diagonal.

Diffusion IRF and FEVD We rewrite (210) as (see also (211)):

Φ (L)yt = ut, (223)

where ut = (IN −Φ∗0W )−1ut,

Φ (L) = IN −p∑j=1

ΦjLj with Φj = (IN −Φ∗0W )

−1 (Φj + Φ∗jW

)Premultiplying (223) by the inverse of Φ (L), we obtain:

yt =[Φ (L)

]−1

ut =[Φ (L)

]−1

(IN −Φ∗0W )−1ut (224)

from which we can construct (diffusion) IRF and FEVD. Q with respect tout or ut, probably ut (can we assume ut as structural?)

More to follow: Algebra:

With Factors In the spatial modelling it is implicitly assumed that uit is iidacross spatial units or spatially correlated:

uit = λWut + εit.

Now we introduce the common factor structure as in QVAR paper such that38

ut = Λf t + vt.

38Shi and Lee (2017) also postulate that the error term in a SAR panel equation can bedecomposed into a common factor component and an idiosyncratic component, where commonfactors can potentially correlate with included regressors. See also HLP17??

118

Then, (204) can be extended as

yit = φiyit−1 + φ∗i0y∗it + φ∗i1y

∗it−1 + λ′if t + vit (225)

where vit contains the idiosyncratic components which are mutually uncorre-lated across (i, j). More generally, we have STAR(p) with (observed) factors:

yit =

p∑h=1

φihyi,t−h +

p∑h=0

φ∗ihy∗i,t−h + λ′if t + vit (226)

To deal with the endogeneity of y∗it we apply the control function approachdescribed above. Then, (226) can be expressed as

Φ (L)yt = (IN −Φ∗0W )−1

(Λf t + vt) (227)

from which we can construct IRF and FEVD.Remark: This is quite parsimonious specification, implying that we can

include the large N spatial units, so another way of circumventing the curse ofinput dimensionality (e.g. 100 or 1000 asset returns). Further, we may arguethat vit would carry certain structural interpretations.

6.7 GVAR-SPVAR Model

Consider a global economy consisting of N economies, indexed by i = 1, .., N ,and denote the country-specific variables by an mi × 1 vector yit, and thecountry-specific foreign variables by an m∗i × 1 vector y∗it =

∑Nj=1 wijyjt where

wij ≥ 0 is the set of granular weights with∑Nj=1 wij = 1, and wii = 0 for all

i (e.g. Pesaran, 2006 and GNS). The country-specific VARX∗ (2, 2) model canbe written as

yit = hi0 +hi1t+ Φi1yi,t−1 + Φi2yi,t−2 + Ψi0y∗it + Ψi1y

∗i,t−1 + Ψi2y

∗i,t−2 +uit

(228)where the dimension of hij and δij , j = 0, 1, 2, is mi × 1 while the dimensionsof Φij and Ψij , j = 0, 1, 2, are mi × mi and mi × m∗i . GNS assume thatuit ∼ iid (0,Σii) where Σii is an mi ×mi positive definite matrix.

6.7.1 Spatial VAR (SPVAR) Models

Beenstock and Felsenstein (2007) propose the panel VAR model with both spa-tial and time lags, which they refer to as spatial vector autoregressions (SpVAR):

Ynt = µn + θWYnt +

q∑j=1

βjYn,t−j + λWYn,t−1 + unt

unt = ρun,t−1 + δWunt + γWun,t−1 + εnt with σni = Cov (εn, εi)

The SpVAR resembles the SDPD except that Ynt are allowed to be a K × 1vector.

119

The structural multivariate counterpart is (SPSVAR):

Yknt = µkn +

K∑i=1

(αkiYint + βkjYin,t−1 + θkiY

∗int + λkiY

∗in,t−1

)+ εknt (229)

where µ’s are region-specific effects, α’s are within-region contemporaneouscausal effects between Y ’s with αkk = 0, θ’s are spatial lag coeffi cients, β’sare temporal lag coeffi cients, and λ’s are lagged spatial lag coeffi cients.Denote by A, B, Θ and Λ the K ×K coeffi cient matrices for αs, βs, θs and

λs, respectively. We then express (229) as follows:

Yt = µ+A∗Yt +B∗Yt−1 + Θ∗Y ∗t + Λ∗Y ∗t−1 + εt (230)

where Y is an NK × 1 vector of observations stacked by n, µ is an NK × 1vector of regional-specific effects,

A∗ = IN ⊗A, B∗ = IN ⊗B, Θ∗ = IN ⊗Θ, Λ∗ = IN ⊗ Λ,

are NK ×NK block diagonal matrices. Yt and Y ∗t are not independent of εt.(230) has a corresponding reduced form:

Yt = Π0 + Π1Yt−1 + Π2Y∗t + Π3Y

∗t−1 + vt (231)

where Π0 = (INK −A∗)−1µ, Π1 = (INK −A∗)−1

B∗, Π2 = (INK −A∗)−1Θ∗,

Π3 = (INK −A∗)−1Λ∗, and vt = (INK −A∗)−1

εt. There are K(K − 1) un-known A coeffi cients, K2 unknown coeffi cients for each of B, Θ, and Λ, and thereare NK unknown variances for Σε, making a total of K(4K − 1) + K = 4K2

unknown structural parameters in (230). In (231) there are 3K2 data restric-tions from Πs and Σv provides 1

2K(K + 1) further restrictions. Therefore, theSpVAR underidentifies the structural parameters and the identification deficitis 1

2K(K − 1).Beenstock and Felsenstein (2007) note that "incidental parameter problem"

arises and LSDV estimated B∗ has O(1/T ) downward bias when T is finite, andfurther issues arise because q is unknown. When q = 1, they propose the bias-corrected estimator following Hsiao (2003). They show that impulse responsesof SpVAR help to simulate the spatial-temporal dynamic effects of exogenousshocks.

• Add the review on more recent SPVAR Models: e.g. Mutl (2009) andSpatial Panel Vector Autoregression by Xie (2015).

6.7.2 The spatial representation of the GVAR model

For convenience we assume mi = k and m = Nk. Define the m × 1 vector ofthe global variables:

ytm×1

= (y′1t, ...,y′Nt)′ with yit

k×1

= (y1,it, ..., yk,it)′

120

Define the N ×N weight matrix:

W =

w11 w12 w1N

w21 w22 w2N

. . .wN1 wN2 wNN

=

w1

w2

...wN

Then, we have:

y∗itk×1

=(y∗1,it, ..., y

∗k,it

)′= (wi ⊗ Ik)yt, y

∗t

m×1= (y∗′1t, ...,y

∗′Nt)′

= (W ⊗ Ik)yt

Thus, (228) can be written as (drop hi0 + hi1t without loss of generality):

yit = Φi1yi,t−1+Φi2yi,t−2+Ψi0 (wi ⊗ Imi)yit+Ψi1 (wi ⊗ Imi)yit−1+Ψi2 (wi ⊗ Imi)yit−2+uit(232)

Further,

y1t = Φ11y1,t−1+Φ12y1,t−2+Ψ10 (w1 ⊗ Ik)y1t+Ψ11 (w1 ⊗ Ik)y1t−1+Ψ12 (w1 ⊗ Ik)y1t−2+u1t

...

yNt = ΦN1yN,t−1+ΦN2yN,t−2+ΨN0 (wN ⊗ Ik)yNt+ΨN1 (wN ⊗ Ik)yNt−1+ΨN2 (wN ⊗ Ik)yNt−2+uNt

Stacking these results. we have:y1t

y2t...yNt

=

Φ11 0 0

0 Φ21 0. . .

0 0 ΦN1

y1t−1

y2t−1...

yNt−1

+

Φ12 0 0

0 Φ22 0. . .

0 0 ΦN2

y1t−2

y2t−2...

yNt−2

+

Ψ10 0 0

0 Ψ20 0. . .

0 0 ΨN0

y∗1ty∗2t...y∗Nt

+

Ψ11 0 0

0 Ψ21 0. . .

...0 0 ΨN1

y∗1t−1

y∗2t−1...

y∗Nt−1

+

Ψ12 0 0

0 Ψ22 0. . .

0 0 ΨN2

y∗1t−2

y∗2t−2...

y∗Nt−2

+

u1t

u2t

...uNt

Hence, we have:

yt = Φ1yt−1 + Φ2yt−2 + Ψ0y∗t + Ψ1y

∗t−1 + Ψ2y

∗t−2 + ut (233)

= Φ1yt−1 + Φ2yt−2 + Ψ0 (W ⊗ Ik)yt + Ψ1 (W ⊗ Ik)yt−1 + Ψ2 (W ⊗ Ik)yt−2 + ut

Alternatively, (233) can be written as

(Im −Ψ0 (W ⊗ Ik))yt = (Φ1 + Ψ1 (W ⊗ Ik))yt−1+(Φ2 + Ψ2 (W ⊗ Ik))yt−2+ut(234)

121

or (Im −Φ1L−Φ2L

2)−(Ψ0 + Ψ1L+ Ψ2L

2)

(W ⊗ Ik)yt = ut (235)

Remark: The above representation clearly shows that SPVAR is the specialcase of GVAR. Notice in the GVAR modelling that we are interested in IRFsin terms of

∂yt+h∂ut

, which are the combination of the spatial and dynamic ones.So it would be an important issue how to decompose the overall IRFs into thespatial and dynamic components.Remark: We may also be interested in evaluating the dynamic multipliers

in terms of∂y1,t+h∂y2t

or vice versa where yt = (y′1ty′2t)′, but this is not quite

straightforward (see the STARDL model above). When we add the global fac-tors, denoted f t, such as the oil or commodity prices, it is straightforward toderive the dynamic multipliers in terms of

∂yt+h∂ft

, say.

7 The Joint Modelling of the Spatial Depen-dence/Heterogeneity and Unobserved Factors

There has been the massive development of modelling CSD in the double-indexed panel data (e.g. Pesaran, 2006 and Bai, 2009). There have been twomain approaches: The spatial effects deal with weak CSD whilst the factor ap-proach accommodates strong CSD. Recently, a few studies attempted to developa combined approach that can accommodate both weak and strong CSD.Bailey et al. (2016) develop the multi-step estimation procedure that can

distinguish the relationship between spatial units that is purely spatial fromthat which is due to common factors. Mastromarco et al. (2015) propose thenovel technique in modelling technical effi ciency of stochastic frontier panelsby combining the exogenously driven factor-based approach and an endogenousthreshold regime selection advanced by Kapetanios et al. (2014). Gunella etal. (2015) develop the unified framework for modelling multilateral resistanceand bilateral heterogeneity simultaneously in panel gravity models. Shi andLee (2016), Bai and Li (2014,5) and Kuersteiner and Prucha (2018) have alsodeveloped the framework for jointly modelling spatial effects and interactiveeffects.We first consider the univariate spatio-temporal autoregressive (STAR) model

with homogeneous parameters and with heterogeneous parameters. We then ex-tend the models into the panels with unobserved factors. In the homogenouscase, the QML and GMM have been developed even together with factors whilstno unified approach has been completely developed yet in the heterogenouscase. Here we consider the (system) QML approach and the CF-based STARapproach. In particular, the latter approach provides a consistent estimator ofthe heterogeneous spatial parameters on the basis of the individual regressionssuch that it would be computationally much simpler...Next, we develop the multivariate spatio-temporal autoregressive distributed

lag (STARDL) model with homogeneous parameters and with heterogeneous

122

parameters. We also extend this approach to the panels with unobserved factors.Up to our knowledge, this is the first studies to develop the (system) QMLapproach and the CF-based STARDL approach...

7.1 The spatial panel data model with interactive effectsby Bai and Li (2014)

Consider the following spatial panel data model with common shocks:

yit = αi + ρ

N∑j=1

wij,Nyjt +

k∑p=1

xitpβp + λ′ift + eit, (236)

xitp = νip + γ′ipft + vitp, p = 1, 2, ..., k; (237)

where yit is the dependent variable; xit = (xit1, xit2, ..., xitk)′ is a k-dimensional

vector of explanatory variables; ft is an r-dimensional vector of unobservablecommon shocks; λi is the corresponding heterogenous response to the commonshocks; WN = (wij,N )N×N is a spatial weights matrix with wii,N = 0; and eitand vit are the idiosyncratic errors. The term λ′ift captures the common shockeffects, and ρ

∑Nj=1 wij,Nyjt captures the spatial effects.

Let vit = (vit1, ..., vitk)′ and νi = (νi1, ..., νik)′, and let γi = (γi1, ..., γik).

The x equation is equivalent to

xit = νi + γ′ift + vit

The model (415) and (237) can be written as[yit − ρ

∑Nj=1 wij,Nyjt − x′itβ

xit

]= µi + Φ′ift + εit (238)

=

[αivi

]+

[λ′iγ′i

]ft +

[eitvit

]with Φi = (λi, γi), µi = (αi, ν

′i)′ and εit = (eit, v

′it)′. Let D (ρ, β) be an N(k +

1)×N(k + 1) matrix, whose (i, j) subblock, denoted by Dij (ρ, β), is equal to

Dij (ρ, β)(k+1)×(k+1)

=

[1 −β′0 Ik

], if i = j

Dij (ρ, β) =

[ρwij,N 0

0 0

], if i 6= j

Derivation: Consider the equation (238) for i = 1:

Di•zt = µi + Φ′ift + εit

123

which can be expressed in details:

D1•zt =[D11 · · · D1N

] z1t

...zNt

= D11z1t + · · ·+D1NzNt

=

[1 −β′0 Ik

] [y1t

x1t

]+

[ρw12 0

0 0

] [y2t

x2t

]+ · · ·+

[ρw1N 0

0 0

] [yNtxNt

]=

[y1t − β′x1t

x1t

]+

[ρw12y2t

0

]+ · · ·+

[ρw1NyNt

0

]=

[y1t − β′x1t − ρ

∑Nj=1 w1jyjt

x1t

]Now, the model can be written in the system as

D (ρ, β) zt = µ+ Φft + εt (239)

where ztN(k+1)×1

= (z1t, z2t, ..., zNt)′ with zit = (yit, x

′it)′, Φ = (Φ1, ...,ΦN )

′,

µ = (µ′1, ..., µ′N )′ and εt = (ε′1t, ..., ε

′Nt)′. In details:

ΦN(k+1)×r

=

Φ′1...

Φ′N

=

(λ′1γ′1

)...(λ′Nγ′N

)

D =

D11 D12 D1N

D21 D22 D2N

. . .DN1 DN2 DNN

=

D1•D2•...

DN•

D (ρ, β) zt =

D1•z1t

D2•z2t

...DN•zNt

=

D11z1t + · · ·+D1NzNt =

∑Nj=1D1jzjt

D21z1t + · · ·+D2NzNt =∑Nj=1D2jzjt

...DN1z1t + · · ·+DNNzNt =

∑Nj=1DNjzjt

Next,

ztN(k+1)×1

= D−1 (µ+ Φft + εt) = V (µ+ Φft + εt)

zt = V(

Φft + εt

)

MzzN(k+1)×N(k+1)

=1

T

T∑t=1

ztz′t =

1

T

T∑t=1

V(

Φft + εt

)(Φft + εt

)′V ′

=1

T

T∑t=1

V(

Φftf′tΦ′ + Φftε

′t + εtf

′tΦ′ + εtε

′t

)V ′

124

DMzzD′ = D

[1

T

T∑t=1

V(

Φftf′tΦ′ + Φftε

′t + εtf

′tΦ′ + εtε

′t

)V ′

]D′

=1

T

T∑t=1

(Φftf

′tΦ′ + Φftε

′t + εtf

′tΦ′ + εtε

′t

)Let θ = (ω,Φ,Σεε) be the parameters. The objective function is

L(θ) = − 1

2Nln |Σzz|+

1

Nln |D| − 1

2Ntr[DMzzD

′Σ−1zz ]

where Σzz = ΦΦ′+ Σεε, D = D (ρ, β) and Mzz = 1T

∑Tt=1 ztz

′t is the N(k+ 1)×

N(k + 1) data matrix. The maximizer θ, defined by

θ = arg maxθ∈Θ

L(θ)

is the quasi maximum likelihood estimator or MLE, where Θ is the parametersspace. By the definition of D = D (ρ, β),

ln |D| = ln |IN − ρWN | .

The Jacobian term is relatively easy to handle. The more diffi cult part is thatD also appears in the second term, which also depends on both ρ and β. Wecan rewrite the objective function as

L (w,Φ,Σεε) = − 1

2Nln |Σzz|+

1

Nln |IN − ρWN | −

1

2Ntr[DMzzD

′Σ−1zz

]The first order condition for Φ is

Φ′Σ−1εε

(DMzzD

′ − Σzz

)= 0

where D = D(ρ, β). The first order condition for Σεε is

DMzzD′ − Σzz = W

where W is an N(k+ 1)×N(k+ 1) matrix whose ith (k+ 1)× (k+ 1) diagonalsubblock is such that the upper-left 1× 1 and lower-right k× k submatrices areboth zeros and the rest elements are unspecified. The unspecified elements ofW correspond to the zero elements of Σεε. The first order condition for ρ is

− 1

Ntr[(IN − ρWN )

−1WN

]+

1

NT

N∑i=1

T∑t=1

1

σ2i

(yit − ρyit − x′itβ

)yit

− 1

NT

N∑i=1

T∑t=1

1

σ2i

λ′iGΦ′Σ−1

εε Dztyit = 0

125

where yit =∑Nj=1 wij,N yjt and G =

(Ir + Φ′ΣεεΦ

)−1

. The first order condition

for β is

1

NT

N∑i=1

T∑t=1

1

σ2i

xit

(yit − ρyit − x′itβ

)− 1

NT

N∑i=1

T∑t=1

1

σ2i

xitλ′iGΦ′Σ−1

εε Dzt = 0

These first order conditions are useful when deriving the rate of convergenceand the limiting distributions. They are not used in computing the MLE. TheMLE obtained by the EM algorithm automatically satisfies the firstorder conditions.The EM algorithm. Lee (2004a) uses the usual maximization procedures

(e.g. the Newton-Raphson method) to estimate the spatial models; Bai andLi (2014) use the EM algorithm to estimate panel data models withcommon shocks. Neither these methods are suitable here.There is no closed form solution for the maximizer ρ even other

parameters are all known. However, this problem can be easily overcome be-cause ρ is a scalar and actual maximization (a low dimension maximization) canbe directly carried out. Thus the algorithm combines the usual maximization

procedures with the EM algorithm. Let θ(s) =(ρ(s), β(s),Φ(s),Σ

(s)εε

)denote the

estimates at the sth iteration. Our updating procedures consist of two steps.In the first step, we update β,Φ,Σεε, according to the EM algorithm

Φ(s+1) =

[1

T

T∑t=1

E (Dztf′t |θs)

][1

T

T∑t=1

E (ftf′t |θs)

]−1

Σ(s+1)εε = Dg

[D(s)MzzD

(s)′ − Φ(s+1)Φ(s)′(

Σ(s)zz

)−1

D(s)MzzD(s)′]

β(s+1) =

N∑i=1

T∑t=1

1(σ

(s+1)i

)2 xitx′it

−1 N∑

i=1

T∑t=1

1(σ

(s+1)i

)2 xit

yit − ρ(s)N∑j=1

wij yjt − λ(s+1)′i f

(s)t

−1

where Dg is the operator which sets the entries of its argument to zeros if the

counterparts of E(εtε′t) are zeros;

(σ

(s+1)i

)2

is the [(i−1)(k+ 1) + 1]th diagonal

element of Σ(s+1)εε and λ(s+1)′

i is the transpose of the [(i − 1)(k + 1) + 1]th rowof Φ(s+1). In addition,

1

T

T∑t=1

E (Dztf′t |θs) = D(s)MzzD

(s)′(

Σ(s)zz

)−1

Φ(s)

1

T

T∑t=1

E (ftf′t |θs) = Ir−Φ(s)′

(Σ(s)zz

)−1

Φ(s)+Φ(s)′(

Σ(s)zz

)−1

D(s)MzzD(s)′(

Σ(s)zz

)−1

Φ(s)

f(s)t = Φ(s)′

(Σ(s)zz

)−1

D(s)zt

126

In the second step, we update by maximizing (3.1) with respect to ρ atβ = β(s+1), Φ = Φ(s+1) and Σεε = Σ

(s+1)εε with an initial value of ρ at ρ(s).

The suggested procedure is a version of the ECME procedure of Liu and Rubin(1994). Putting together, we obtain

θ(s+1) =(ρ(s+1), β(s+1),Φ(s+1),Σ(s+1)

εε

)This procedure guarantees that the value of likelihood function in each iter-

ation does not decrease. This is because

L(ρ(s), β(s+1),Φ(s+1),Σ(s+1)

εε

)≥ L

(ρ(s), β(s),Φ(s),Σ(s)

εε

)L(ρ(s+1), β(s+1),Φ(s+1),Σ(s+1)

εε

)≥ L

(ρ(s), β(s+1),Φ(s+1),Σ(s+1)

εε

)The limit of the iterated solution satisfies the first order conditions (3.2)-(3.5)and therefore possesses the local optimality property.

For the initial value θ(1) =(ρ(1), β(1),Φ(1),Σ

(1)εε

), ρ(1), β(1) can be set to the

within group estimator, ignoring the endogeneity problem. And Φ(1),Σ(1)εε is

then the maximizer at ρ(1), β(1).

To provide the asymptotic representation of ω =(ρ, β′)′, we define

Ω =1

N

tr[S2N

]+ tr

[Σ−1ee SNΨS′N

]− 2

∑Ni=1 S

2ii,N

β′(∑Ni=1 Sii,NΣiiv)

σ2i(∑Ni=1 Sii,NΣiiv)β

σ2i

∑Ni=1

Σiivσ2i

where Ψ is a diagonal matrix with the ith diagonal element being σ2

i +β′Σiivβ,that is,

Ψ = diag(σ2

1 + β′Σ11vβ, ..., σ2N + β′ΣNNvβ

)Then we have the following theorem.Theorem 4.2 Under Assumptions A-F, when N,T →∞, and

√N/T → 0,

we have

√NT (ω − ω) = Ω−1 1√

NT

N∑i=1

T∑t=1

1

σ2i

[ ∑Nj=1 eitηijtSijN

eitvit

]+ op (1)

where ηijt = v′jtβ + 1(i 6= j)ejt.Corollary 4.1 Under the assumptions of Theorem 4.2, if

√N/T → 0, we

have √NT (ω − ω)→d N

(0, Ω−1

)where Ω =limN→∞Ω.Remark 4.4 Ω can be consistently estimated by

Ω =1

N

tr[S2N

]+ tr

[Σ−1ee SN ΨS′N

]− 2

∑Ni=1 S

2ii,N β

′ (∑Ni=1 Sii,N Σiiv

σ2i

)(∑Ni=1 Sii,N Σiiv

σ2i

)β

∑Ni=1

Σiivσ2i

where SN = WN (IN−ρWN )−1 and Ψ = diag

(σ2

1 + β′Σ11vβ, ..., σ

2N + β

′ΣNNvβ

).

127

• Under large N , the consistency analysis for MLE under heteroskedasticityis challenging even for spatial panel models without common shocks, owingto the simultaneous estimation of a large number of variance parametersalong with (ρ, β). The QML studies, such as Yu et al. (2008) and Lee andYu (2010), assume homoskedasticity.

• Remarks: We will develop the number of models and describe how toadopt the QML and EM algorithms. One of main motivations lies in theexcellent simulation evidence reported in the sequence of papers by Baiand Li (2014, 2015) and Lu (2017) and so on...??

7.2 SDPD Models with Interactive Fixed Effects by Shiand Lee (2017)

Shi and Lee (2017) consider the following SDPD model with large n and T :

Y nt = λ0W nY nt + γ0Y n,t−1 + ρ0W nY n,t−1 +Xntβ0 + Γ0nf0t +Unt (240)

Unt = α0W nUnt + εnt

where Y nt is an n-dimensional column vector of dependent variables andXnt isan n× (K−2) matrix of exogenous regressors, so that the total number of vari-ables is K. The model accommodates two types of cross sectional dependences,namely, local dependence and global (strong) dependence. Individual units areimpacted by time varying unknown common factors f0t, which capture global(strong) dependence. The effects of the factors can be heterogeneous on thecross section units, Γ0n. For example, in an earnings regression where Ynt is thewage rate, each row of Γ0n may correspond to a vector of an individual’s skillsand f0t is its time varying premium. The true number of unobserved factorsis assumed to be fixed at r0 that is much smaller than n and T . The matrixof n × r0 factor loading Γ0n and the T × r0 factors F 0t = (f01, f02, ..., f0t)

′

are treated as fixed effects parameters39 . The fixed effects approach allows un-known correlation between the time factors and the regressors. The n×n spatialweights matricesW n and W n represent local (spatial) dependence. λ0W nY nt

describes the contemporaneous spatial interactions. γ0Y n,t−1 captures the puredynamic effect. ρ0W nY n,t−1 is a spatial time lag of interactions, which cap-tures diffusion. The idiosyncratic error Unt with elements of εit being iid(0, σ2

0)also possesses a spatial structure W n.

39Since ΓnFt is observationally equivalent to ΓnHH−1ft for any r×r invertible matrix H, r2restrictions are needed to uniquely determine Γn and Ft. One set of restrictions has: 1

nΓ′nΓn =

Ir and F ′TFT is diagonal with positive diagonals. The first condition imposes 12

(r2 + r)

restrictions and the second one provides additional 12

(r2 − r) ones. There is “rotationalindeterminacy”in the sense that the order of the factors can be switched. Let K be the matrixof eigenvectors of F ′TFT . Because F

′TFT is diagonal under restriction, K is a permutation

in the columns of Ir . It can be verified that Γn and Ft satisfy the restrictions above, withΓn = ΓnK and FT = FTK. This indeterminacy can be eliminated by requiring the diagonalsof F ′TFT to be in decreasing or increasing order. Here Γn and Ft are treated as nuisanceparameters and the above restrictions are not imposed.

128

They consider a quasi-maximum likelihood estimation and show estimatorconsistency and characterize its asymptotic distribution. When n and T arecomparable, the estimator is

√nT consistent. The Monte Carlo experiment

shows that the estimator performs well and the proposed bias correction iseffective.

QML Estimation The parameters for the model are θ =(δ′, λ, α

)′with δ =

(γ, ρ, β)′, σ2, Γn and F T . Collect the exogenous regressors by the n× K matrixZnt = (Y n,t−1,W nY n,t−1,Xnt), where K = k+ 2. Denote Sn(λ) = In−λWn

and Rn(α) = In − αWn. The sample averaged quasi-log likelihood function is

Qnt(θ, σ2,Γn,F T ) = −1

2log 2π − 1

2log σ2 +

1

nlog |Sn(λ)Rn(α)|

− 1

2σ2nT

T∑t=1

Sn(λ)Y nt −Zntδ − Γnf t′Rn(α)′Rn(α) Sn(λ)Y nt −Zntδ − Γnf t

Concentrating out σ2 and dropping the overall constant term for simplicity,

Qnt(θ,Γn,F T ) =1

nlog |Sn(λ)Rn(α)|

−1

2log

[1

nT

T∑t=1

Sn(λ)Y nt −Zntδ − Γnf t′Rn(α)′Rn(α) Sn(λ)Y nt −Zntδ − Γnf t

]is a concentrated sample averaged log likelihood function of θ, Γn and F T .Because no restriction is imposed on Γn and Rn(α) is assumed invertible, op-timizing with respect to Γn ∈ Rn×r is equivalent to optimizing with respectto the transformed Γn with Γn = Rn(α)Γn. The objective function can beequivalently written as

Qnt(θ, Γn,F T ) =1

nlog |Sn(λ)Rn(α)|

−1

2log

[1

nT

T∑t=1

Rn(α) (Sn(λ)Y nt −Zntδ)′ Rn(α) (Sn(λ)Y nt −Zntδ)]

Because it is easier to optimize with respect to a finite dimensional vector, weconcentrate out factors and their loadings using the PCA: for an n× T matrixHnT ,

minFT∈RT×r,Γn∈Rn×r

tr(HnT − Γn,F

′T

)(HnT − Γn,F

′T

)′= min

FT∈RT×rtr(HnTMFTH

′nT

)=

n∑i=r+1

µi(HnTH

′nT

)where the jth column of F T is the eigenvector corresponding to the jth largesteigenvalue of H ′nTHnT . Because ΓnF

′T cannot be separately identified from

129

ΓnMM−1F ′T for an invertible matrix M , the identification conditions thatF ′TF T = Ir and Γ

′n Γn is diagonal have been imposed. The estimated factors

and factor loadings will be rotations of their true values. However, the asymp-totic distribution of our QML estimator is unaffected by the normalizationsbecause the space spanned by the factors and their loadings do not change. Theconcentrated log likelihood is

Qnt(θ) = maxFT∈RT×r,Γn∈Rn×r

Qnt(θ, Γn,F T )

=1

nlog |Sn(λ)Rn(α)| − 1

2logLnT (θ)

with

LnT (θ) = µ

1

nT

T∑t=1

Rn(α)

(Sn(λ)−

K∑k=1

Zkδk

)(Sn(λ)−

K∑k=1

Zkδk

)′Rn(α)′

The QML estimator is

θnT = arg maxθ∈Θ

Qnt(θ).

The estimate for Γn can be obtained as the eigenvectors associated with the first

r largest eigenvalues ofRn(α)(Sn(λ)−

∑Kk=1Zkδk

)(Sn(λ)−

∑Kk=1Zkδk

)′Rn(α)′.

By switching n and T , the estimate for F T can be similarly obtained. Note thatthe estimated Γn and F T are not unique, as ΓnHH

−1F ′T is observationallyequivalent to ΓnF

′T for any invertible r × r matrix H. However, the column

spaces of Γn and F T are invariant to H, hence the projectorsM Γnand MFT

are uniquely determined.

7.3 Dynamic spatial panel data models with common shocksby Bai and Li (2015)

Bai and Li (2015) consider jointly modeling spatial interactions, dynamic inter-actions and common shocks within the following model:

yit = αi + ρ

N∑j=1

wij,Nyjt + δyit−1 + x′itβ + λ′if t + eit (241)

where yit is the dependent variable, xit = (xit1, ..., xitk)′ is a k-dimensional

vector of explanatory variables, f t is an r-dimensional vector of unobserv-able common shocks; λi is the heterogenous response to the common shocks,WN = (wij,N )N×N is a spatial weights matrix whose diagonal elements wii,Nare 0, and eit are the idiosyncratic errors. λ

′if t captures the common-effects,∑N

j=1 wij,Nyjt captures the spatial effects, and δyit−1 captures the dynamic ef-fects. The joint modeling allows one to test which type of effects is present. Wemay test ρ = 0 while allowing common-shocks effects and dynamic effects, or

130

similarly, we may determine if the number of factors is zero in a model withspatial effects and dynamic effects. The features of model (241) make it flexibleenough to cover a wide range of applications.An additional feature of the model is the allowance of cross sectional het-

eroskedasticity. If heteroskedasticity exists but homoskedasticity is imposed,MLE can be inconsistent. Under large-N , the consistency analysis for MLEunder heteroskedasticity is challenging even for spatial panel models withoutcommon shocks, owing to the simultaneous estimation of a large number ofvariance parameters along with (ρ, δ,β). Interestingly, we show that the lim-iting variance of the MLE is not of a sandwich form if heteroskedasticity isallowed.The spatial interaction on the dependent variable gives rise to the endogene-

ity problem while the spatial interaction on the errors does not. As a result,existing estimation methods on the common shocks models such as Pesaran(2006) and Bai (2009) cannot be directly applied to model (241) due to theendogeneity from the spatial interactions.Real data often have complicated correlation over cross section and time.

Modeling, estimating and interpreting the correlations in data are particu-larly important in economic analysis. This paper integrates several correlation-modeling techniques and propose dynamic spatial panel data models with com-mon shocks to accommodate possibly complicated correlation structure overcross section and time. A large number of incidental parameters exist. TheQML is proposed and heteroskedasticity is explicitly estimated. Our analysisindicates that the MLE has a non-negligible bias. We propose a bias correctionmethod. The simulations reveal the excellent finite sample properties of theQMLE after bias correction.

Let α = (α1, ..., αN )′ and Yt = (y1t, , yNt)

′. The symbols Yt−1, Xt and et aredefined similarly as Yt. Then we can rewrite model (241) into matrix form:

Yt = α+ ρWNYt + δYt−1 +Xtβ + Λft + et.

The above model can always be written as

Yt =(α+ Λf

)+ ρWNYt + δYt−1 +Xtβ + ΛQ−1/2Q1/2

(ft − f

)+ et

= α+ + ρWNYt + δYt−1 +Xtβ + Λ+f+t + et

where Q = 1NΛ′Σ−1

ee Λ and f = 1T

∑Tt=1 ft. Let α

+,Λ+, f+t be defined as illus-

trated in the above equation. We see that

T∑t=1

f+t = 0

131

and

1

NΛ+′Σ−1

ee Λ+ =1

N

(ΛQ−1/2

)′Σ−1ee

(ΛQ−1/2

)=

1

NQ−1/2′Λ′Σ−1

ee ΛQ−1/2

= Q−1/2′QQ−1/2 = Ir

Without loss of generality assume:Normalization conditions:

∑Tt=1 ft = 0; 1

NΛ′Σ−1ee Λ = Ir.

We shall use (ρ∗, δ∗, β∗) to denote the true values for (ρ, δ, β), and we use(Λ∗, f∗t ) to denote the true values for (Λ, ft). So the data generating process is

Yt = α∗ + ρ∗WNYt + δ∗Yt−1 +Xtβ∗ + Λ∗f∗t + et.

LetZt (α,w,Λ, F ) = Yt − α− ρWNYt − δYt−1 −Xtβ − Λft

with w = (ρ, δ, β)′. The quasi likelihood function, by assuming the normality

of eit, is

L∗ (θ) = − 1

2NT

T∑t=1

Zt (α,w,Λ, F )′Σ−1ee Zt (α,w,Λ, F )− 1

2Nln |Σee|+

1

Nln |IN − ρWN | .

where θ = (w,Λ, diag (Σee)). Given Σee, w,Λ, it is seen that α and ft maximizethe above function at

α = Y − ρWY − δY−1 − Xβ − Λf

andft =

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

(Yt − ρWY − δYt−1 − Xtβ

)Note:

Yt = α+ ρWNYt + δYt−1 +Xtβ + Λft + et.

Yt = Λft + et.

so we obtain ft by GLS:

ft =(Λ′Σ−1

ee Λ)−1

(Λ′Σ−1

ee Yt

)Substituting the above two equation into the preceding likelihood function

to concentrate out α and ft, the objective function can be simplified as

L (θ) = − 1

2NT

T∑t=1


)′M(Yt − ρWY − δYt−1 − Xtβ

)− 1

2Nln |Σee|+

1

Nln |IN − ρWN | .

132

Note:

Yt − α− ρWNYt − δYt−1 −Xtβ − Λft

=(I − Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

)(Yt − ρWY − δYt−1 − Xtβ

)Zt (α,w,Λ, F )

′Σ−1ee Zt (α,w,Λ, F )

=(Yt − ρWY − δYt−1 − Xtβ

)′ (I − Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

)′Σ−1ee

(I − Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

)(Yt − ρWY − δYt−1 − Xtβ

)=


)′M(Yt − ρWY − δYt−1 − Xtβ

)(I − Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

)′Σ−1ee

(I − Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

)= Σ−1

ee − Σ−1ee Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee − Σ−1

ee Λ(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee + Σ−1

ee Λ(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee Λ

(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee

= Σ−1ee − Σ−1

ee Λ(Λ′Σ−1

ee Λ)−1

Λ′Σ−1ee = Σ−1

ee −1

NΣ−1ee Λ

(1

NΛ′Σ−1

ee Λ

)−1

Λ′Σ−1ee = Σ−1

ee −1

NΣ−1ee ΛΛ′Σ−1

ee

The maximizer, defined by

θ = arg maxθ∈Θ

L (θ)

is referred to as the QML estimator. More specifically, Θ is defined as

Θ =

θ| ‖w‖ ≤ C;C−1 ≤ σ2

i ≤ C,1

NΛ′Σ−1

ee Λ = Ir

The first order condition for Λ gives[1

NT

T∑t=1

(Yt − ρW Y − δYt−1 − Xtβ

)(Yt − ρW Y − δYt−1 − Xtβ

)′]Σ−1ee Λ = ΛV

where V is a diagonal matrix. The first order condition for σ2i gives

σ2i =

1

T

T∑t=1

(yit − ρyit − δyit−1 − x′itβ − λift

)2

where yit =∑Nj=1 wij,N yjt and

ft =(

Λ′Σ−1ee Λ

)−1

Λ′Σ−1ee

(Yt − ρY − δYt−1 − Xtβ

)=

1

NΛ′Σ−1

ee

(Yt − ρY − δYt−1 − Xtβ

)The first order condition for ρ is

1

NT

T∑t=1

Y ′tM (

Yt − ρY − δYt−1 − Xtβ)− 1

Ntr∣∣∣WN (IN − ρWN )

−1∣∣∣ = 0

133

The first order condition for δ is

1

NT

T∑t=1

Y ′t−1M (

Yt − ρY − δYt−1 − Xtβ)

= 0

The first order condition for β is

1

NT

T∑t=1

X ′tM (

Yt − ρY − δYt−1 − Xtβ)

= 0

We emphasize that in computing the MLE, we do not need to solve the firstorder conditions. They are for theoretical analysis.

Apart from this O(1/T ) bias term, our analysis indicates that thereis another O(1/N) bias arising from common shocks part Λft. Thepresence of biases in the MLE is due to incidental parameters problem. Tostate the asymptotic properties of the MLE, we define the following notations:

Bt =

∞∑l=0

(δ∗G∗N )lG∗N Xt−lβ

∗ +

∞∑l=0

(δ∗G∗N )lG∗NΛ∗ft−l, Bt = Bt −

1

T

T∑s=1

Bs

Bt = WN Bt, Qt =

∞∑l=0

(δ∗G∗N )lG∗Net−l, Jt = S∗N

∞∑l=0

(δ∗G∗N )let−l

G∗N = (IN − ρ∗WN )−1, S∗N = WN (IN − ρ∗WN )

−1

Theorem 5.2 Under Assumptions A-H, when N,T →∞ and√N/T → 0,√

T/N → 0, we have:√NT (ω − ω + b) = D−1ξ + op(1),

where

ξ =1√NT

∑Tt=1 B

′tM∗et −

∑Tt=1

∑Ts=1 B

′tM∗esπ

∗st +

∑Tt=1 J

′tΣ∗−1ee et + η∑T

t=1 B′tM∗et −

∑Tt=1

∑Ts=1 B

′t−1M

∗esπ∗st +

∑Tt=1Q

′t−1Σ∗−1

ee et∑Tt=1 X

′tM∗et −

∑Tt=1

∑Ts=1 X

′t−1M

∗esπ∗st

with

π∗st = f∗′s (F ∗′F ∗)−1f∗t , M

∗ = Σ∗−1ee −

1

NΣ∗−1ee Λ∗Λ∗′Σ∗−1

ee

The (k + 2)× (k + 2) matrix D is defined as

D =1

NT

tr(Y ′MY MF∗

)+ Φ tr

(Y ′MY−1MF∗

)tr(Y ′MX1MF∗

)tr(Y ′MXkMF∗

)tr(X ′1MY−1MF∗

)tr(X ′1MX1MF∗

)tr(X ′1MXkMF∗

)??

. . .

tr(X ′kMY−1MF∗

)tr(X ′kMX1MF∗

)tr(X ′kMXkMF∗

)

134

with

Φ = T

[tr(S∗2N )− 2

N∑i=1

S∗2ii,N

].

The (k + 2)-dimensional vector b is defined as

b = D−1

1NT tr

[δ∗S∗NG

∗N (IN − δ∗G∗N )

−1]

+ 1N tr

[Λ∗S′NΣ∗−1

ee Λ∗(Λ∗′Σ∗−1

ee Λ∗)−1]

1NT tr

[G∗N (IN − δ∗G∗N )

−1]

0k×1

η =

T∑t=1

e′tS′NΣ∗−1

ee et =

N∑i=1

N∑j=1

T∑t=1

1

σ∗2i1 (i 6= j) eitejtS

∗ij,N

Here SN is an N × N matrix which is obtained by setting all the diagonalelements of S∗N to zeros.Although ξ has a relatively complicated expression, it can be shown that

D−1/2ξ →d N(0, 1)

by resorting to the martingale difference central limit theorem. Given this result,we have the following corollary.Corollary 5.1 Under the assumptions in Theorem 5.2, when N,T → ∞

and N/T → κ2, we have

√NT (ω − ω∗)→d N

(−b, [plimN,T→∞D]

−1),

where

b = plimN,T→∞

D−1

1NT tr

[δ∗S∗NG

∗N (IN − δ∗G∗N )

−1]

+ 1N tr

[Λ∗S′NΣ∗−1

ee Λ∗(Λ∗′Σ∗−1

ee Λ∗)−1]

1NT tr

[G∗N (IN − δ∗G∗N )

−1]

0k×1

7.4 Dynamic Spatial Panel Models: Networks, CommonShocks, and Sequential Exogeneity by Kuersteiner andPrucha (2015)

We consider panel data yt, xt, ztTt=1, where yt = [y1t, ..., ynt]′, xt = [x′1t, ..., x

′nt]′,

and zt = [z′1t, ..., z′nt]′ denote the vector of the endogenous variables, and the

matrices of kx weakly exogenous and kz strictly exogenous variables. The spec-ification allows for temporal dynamics in that xit may include a finite numberof time lags of the endogenous variables.We allow in each period t for the regressors and disturbances to be affected

by common shocks. Alternatively we allow for CSD from “spatial lags” in theendogenous variables, the exogenous variables and in the disturbance process.Our specification accommodates higher order spatial lags, as well as time lags in

135

xit. Spatial lags represent weighted cross sectional averages, where the weightswill be reflective of some measure of distance between units. The spatial weightswill be summarized by n× n spatial weight matrices denoted as Wpt = (wp,ijt)and Mqt = (mq,ijt). εt = [ε1t, ..., εnt]

′ denotes the vector of regression distur-bances, ut = [u1t, ..., unt]

′ denotes the vector of idiosyncratic disturbances, andµ is an n × 1 vector of unobserved factor loadings or individual specific fixedeffects, which may be time varying through a common unobserved factor ft.Let λ and ρ be P,Q dimensional vectors of parameters with typical elements λpand ρq and define

Rt (λ) =

P∑p=1

λpWpt for SAR

R∗t (ρ) = I −Q∑q=1

ρqMqt for a spatial autoregressive error term

R∗t (ρ) =

(I +

Q∑q=1

ρqMqt

)−1

for a spatial moving average error term

Then, the dynamic and cross sectionally dependent panel data model can bewritten as

yt = Rt (λ) yt + xtβx + ztβz + εt = Xtδ + εt (242)

R∗t (ρ) εt = µft + ut

where Xt = [M1tyt, ...,MPtyt, xt, zt] and δ =[λ′, β′

]′with β =

(β′x, β

′z

)′. As a

normalization we take

wp,iit = mq,iit = 0 and fT = 1

Note that (242) is a system of n equations describing simultaneous interac-tions between the individual units. The weighted averages, yp,it =

∑nj=1 wp,ijtyjt

and εq,it =∑nj=1mq,ijtεjt model contemporaneous direct cross-sectional inter-

actions. We allow the weights to be stochastic and endogenous in that they candepend on µ1, ..., µN and uit, and can be correlated with the disturbances εt.This extension is important to model sequential group formation as in Carrellet.al. (2013) or endogenous network formation as in Goldsmith-Pinkham andImbens (2013).The reduced form is given by

yt = (In −Rt (λ))−1Wtδ + (In −Rt (λ))

−1εt (243)

εt = R∗t (ρ)−1

(µft + ut)

Applying a Cochrane-Orcutt type transformation by premultiplying the firstequation in (242) with R∗t (ρ) yields

R∗t (ρ) yt = R∗t (ρ)Wtδ + µft + ut (244)

Three examples:

136

1. The social interactions model by Graham (2008) illustrates the use ofboth spatial interaction terms and interactive effects in a social interactionmodel;

2. The analysis of the group level heterogeneity is based on Carrell et. al.(2013), which illustrates the use of higher order, and data-dependent spa-tial lags to model within-group heterogeneity. By allowing Rt (λ) to de-pend on predetermined outcomes we can accommodate the fact that groupmembership is not exogenous;

3. The next example is in the area of health, and considers the spread of aninfectious disease.

Kuersteiner and Prucha (2015) consider a class of GMM estimators, allow-ing for CSD due to spatial lags and due to common shocks. They expand thescope of the existing literature by allowing for endogenous spatial weight ma-trices, time-varying interactive effects, as well as weakly exogenous covariates.An important area of application is in social interaction and network modelswhere specification can accommodate data dependent network formation. Iden-tification of spatial interaction parameters is achieved through a combination oflinear and quadratic moment conditions. They develop an orthogonal forwarddifferencing transformation to aid in the estimation of factor components whilemaintaining orthogonality of moment conditions. In the social interactions ex-ample, orthogonal forward differencing amounts to controlling for unobservedcorrelated effects by combining multiple outcome measures.

7.5 Common Factors and Spatial Dependence by Yang(2018)

Yang (2018) considers panel data models with cross-sectional dependence arisingfrom both spatial autocorrelation and unobserved common factors, and proposesestimation methods that employ cross-sectional averages as factor proxies, in-cluding the 2SLS, Best 2SLS, and GMM estimations. The proposed estimatorsare robust to unknown heteroskedasticity and serial correlation in the distur-bances, unrequired to estimate the number of unknown factors, and computa-tionally tractable.

Related Literature The characterization of CSD is divided into two areas ofwriting. On the one hand, there is a large body of literature on common factormodels. Recent contributions include Pesaran (2006), Bai (2009), Bai and Li(2012), and Moon and Weidner (2015). Our study is particularly related toPesaran (2006), who develops Common Correlated Effects (CCE) estimators forpanel data models with multifactor error structure. The basic idea is to filter theunobserved factors with cross-sectional averages. Chudik and Pesaran (2015a)extend the estimation approach to models with lagged dependent variables andweakly exogenous regressors.

137

On the other hand, the present paper also draws from the spatial econo-metrics literature. Two main methods have been developed to estimate spatialmodels: the maximum likelihood (ML) techniques (Anselin, 1988; Lee, 2004;Yu et al., 2008; Lee and Yu, 2010a; Aquaro et al., 2015), and the instrumentalvariables (IV)/GMM approaches (Kelejian and Prucha, 1999, 2010; Lee, 2007;Lin and Lee, 2010; Lee and Yu, 2014). The estimation strategy in the currentarticle builds on the GMM framework. The present paper sheds new light on theidentification of spatial models with factors, and it shows that the conditionsin Lee and Yu (2016) cannot be applied when N tends to infinity.The current paper is closely related to a number of recent studies that con-

sider both common factors and spatial effects. Pesaran and Tosetti (2011) con-sider models where the idiosyncratic errors are spatially correlated and subjectto common shocks. Bai and Li (2014) specify the spatial autocorrelation onthe dependent variable while assuming the presence of unobserved commonshocks. They advocate a pseudo-ML method that simultaneously estimates alarge group of parameters, including the heterogeneous factor loadings and het-erogeneous variances of the disturbances. A similar approach is considered byBai and Li (2015) for dynamic models. Other studies within the ML frameworkinclude Shi and Lee (2017), and Lu (2017). However, besides computationalcomplexities, the ML methods are not robust to serial correlation in the errors,and they require knowing or estimating the number of latent factors. Instead ofestimating the two effects jointly, Bailey, Holly, and Pesaran (2016) propose atwo-stage approach that extracts the common factors in the first stage and thenestimates the spatial connections in the second stage. Nonetheless, a formaldistribution theory that takes into account the first-stage sampling errors is notyet available.

Consider the following spatial autoregressive (SAR) model with commonfactors,

yit = ρy∗it + β′xit + γ′if t + eit; (245)

xit = A′if t + vit;

for i = 1, ..., N , and t = 1, ..., T , where yit is the dependent variable of uniti at time t, and y∗it =

∑Nj=1 wijyjt, which represents the endogenous interac-

tion effects (or spatial lag effects). The matrix W = (wij)N×N is a spatialweights matrix of known constants. It characterizes neighborhood relations,which are typically based on a geographical arrangement or on socio-economicconnections of the cross-section units. The parameter captures the strength ofspatial dependence across observations on the dependent variable and is knownas the spatial autoregressive coeffi cient. The k × 1 vector xit = (x, ..., xit,k)′

contains individual-specific explanatory variables, and β is the correspondingvector of coeffi cients. The variables eit and vit = (vit,1, ..., vit,k)′ are the idio-syncratic disturbances associated with yit and xit, respectively. Them×1 vectorf t = (f1t, ..., fmt)

′ represents unobserved common factors, where m is fixed butpossibly unknown. The factor loadings γi and Ai capture heterogeneous load-

138

ings. Overall, y∗it captures the spatial effect, while γ′if t captures the common

factor effect.In model (245), the explanatory variables can be influenced by the same

factors. Such a specification has been considered in Pesaran (2006) and Bai andLi (2014). This model can be readily extended to include observable factorssuch as intercepts, seasonal dummies, and deterministic trends; here we focuson unobservable factors.To cope with the unknown factors in (245), we replace them with cross-

sectional averages of the dependent and independent variables. To see why thisapproximation works for the SAR model, we rewrite the model (245) as(

yit − ρ∑Nj=1 wijyjt − β

′xitxit

)= Φ′ift + uit (246)

where Φ′i =(γi,A

′i

)′, uit = (eit, v

′it)′. Then, stacking (246) by individual unit

for each time period, the model can be expressed compactly as

∆ (ρ,β) z.t = Φf t + u.t; for t = 1, ..., T ; (247)

where z.t = (z′1t, z′2t, ...,z

′Nt)′ is an N(k+1)-dimensional vector of observations

with zit = (yit,x′it)′, Φ = (Φ1, ...,ΦN )

′, u.t = (u′1t, ...,uNt)′, and ∆ = ∆ (ρ,β)

is a square matrix, of which the (i, j)th subblock of size (k+1), for i, j = 1, ..., N ,is given by

∆ii =

[1 −β′0 Ik

]if i = j; ∆ij =

[−ρwij 0

0 0

]if i 6= j

The way of stacking the equations in (246) follows Bai and Li (2014), who showthat ∆−1 = ∆−1 (ρ, β) exists and its (i, j)th subblock is given by

∆−1ii =

[sii siiβ

′

0 Ik

]; ∆−1

ij =

[sij sijβ

′

0 0

](248)

where sij denotes the (i, j)th element of S−1 (ρ), and S (ρ) = IN − ρW . Itthen follows from (248) that (247) is equivalent to

z.t = ∆−1 (Φf t + u.t) = C ′f t + ε.t (249)

where C =(∆−1Φ

)′and ε.t = ∆−1u.t = (ε1t, ..., εNt)

′.Now letting Θa = N−1τ ′N ⊗ Ik+1, where τN is an N × 1 vector of ones, it

is easily verified thatz.t = Θaz.t = (y.t, x

′.t)′,

where y.t = T−1∑Ni=1 yit and x.t = T−1

∑Ni=1 xit. Θa is a matrix that operates

on any N(k + 1)-dimensional vector that is stacked in the same order as z.tand produces a k + 1 vector of cross-sectional averages. Similarly, we haveε.t = Θaε.t = T−1

∑Ni=1 εit. Premultiplying both sides of (249) with Θa yields

z.t = C ′f t + ε.t (250)

139

where

C = (ΘaC′)′

= N−1

N∑i=1

N∑j=1

sij(γj +Ajβ

),

N∑j=1

Aj

(251)

Assuming that C has full row rank, namely, Rank(C)

= m ≤ k + 1, for all Nincluding N →∞, we obtain

f t =(CC ′

)−1C (z.t − ε.t) (252)

We establish in Lemma A2 that ε.t converges to zero in quadratic mean asN → ∞, for any t. It follows from (252) that f t can be approximated by thecross-sectional averages z.t with an error of order Op(1/

√N):

f t →p

(C0C

′0

)−1C0z.t as N →∞ (253)

where

C0 = lim C = [E (γi) , E (Ai)]

(s 0

sβ Ik

)

s = N−1τ ′NS−1 (ρ) τN = N−1

N∑i=1

N∑j=1

sij

It is clear from (253) that z.t serve fairly well as factor proxies as long as N islarge.Define the infeasible de-factoring matrices as follows:

Mf = IT − F(F ′F

)−F ′;M b

f = Mf ⊗ IN (254)

where F = (f1, ...,fT )′ is a T ×m matrix of unobserved common factors, and(

F ′F)−denotes the generalized inverse of

(F ′F

). The observable counterparts

of (254) that utilize cross-sectional averages are given by

M = IT − Z(Z′Z)−Z′;M b = M ⊗ IN

where Z = (z.1, ...,z.T )′. Note that M bf and M

b are de-factoring matrices of

NT dimension that operate on Y = (y′.1, ...,y′.T )′ and X =

(X ′.1, ...,X

′.T

)′,

where y.t = (y1t, ..., yNt)′ and X .t = (x1t, ..., xNt)

′, for t = 1, ..., T .We suggest three estimation methods, including the 2SLS, Best 2SLS, and

GMM estimations. This section formally establishes the asymptotic distribu-tions of these estimators.

2SLS Estimation The 2SLS estimator of δ0 is defined as

δ2sls =(L′PQL

)−1L′0PQY (255)

140

where

PQ = M bQ(Q′M bQ

)−1

Q′M b, L = (Y ∗,X) , Y ∗ = (IT ⊗W )Y

We show that the 2SLS estimator, δ2sls, is consistent as N →∞, for T fixedor T →∞. To see this, note that

δ2sls − δ0 =(L′PQL

)−1L0PQ[(IT ⊗ Γ0)f + e];

and then

√NT

(δ2sls − δ0

)=

(1

NTL′M bQ

(1

NTQ′M bQ

)−11

NTQ′M bL

)−1

×(

1

NTL′M bQ

(1

NTQ′M bQ

)−11√NT

Q′M b [(IT ⊗ Γ0)f + e]

)

Applying Lemma A6, we have

1

NTQ′M b Q =

1

NTQ′M b

fQ+Op

(1

N

)+Op

(1√NT

)1

NTQ′M b L =

1

NTQ′M b

fL0 +Op

(1

N

)+Op

(1√NT

)whereL0 =

(Gb

0Xβ0,X)withGb

0 = IT⊗G0 andG0 = G0 (ρ0) = W (IN − ρ0W )−1.

It follows that

1

NTL′PQL =

1

NTL′0PQ,fL0 +Op

(1

N

)+Op

(1√NT

)where

PQ,f = M bfQ(Q′M b

fQ)−1

Q′M bf .

Under Assumption 8, plim 1NT L

′0PQ,fL0 exists and is nonsingular. Further-

more, we have shown in Lemma A6 that

plimN→∞1√NT

Q′M bf [(IT ⊗ Γ0)f + e] = 0

As a result, δ2sls is consistent for δ0, as N →∞.We show that as (N,T )→j ∞ and T/N → 0, the term 1√

NTQ′M b [(IT ⊗ Γ0)f ]

converges in probability to zero, and 1√NTQ′M be tends toward a normal dis-

tribution. The following theorem summarizes the limiting distribution of the2SLS estimator.Theorem 1. Consider the panel data model given by (245) and suppose

that Assumptions 1, 2(i), and 3—8 hold. The 2SLS estimator, δ2sls, defined by

141

(255), is consistent for δ0, as N → ∞, for T fixed or T → ∞. Moreover, as(N,T )→j ∞ and T/N → 0, we have

√NT

(δ2sls − δ0

)→d N(0,Σ2sls);

whereΣ2sls = Ψ−1

LPLΩLPeΨ−1LPL

ΨLPL = p limN,T→∞

1

NTL′0PQ,fL0, ΩLPe = Ψ′QMLΨ−1

QMQΩQMeΨ−1QMQΨQML

ΨQMQ = p limN,T→∞

1

NTQ′M b

fQ, ΨQML = p limN,T→∞

1

NTQ′M b

fL0

ΩQMe = limN→∞

(N−1

N∑i=1

Ωi,QMe

), Ωi,QMe = p lim

T→∞

1

TQ′i.MfΩe,iMfQi.

Ωe,i = E (ei.e′i.) and Qi. = (Qi1, ...,QiT )

′

A consistent estimator for Σ2sls, is given by

Σ2sls =

(1

NTL′PQL

)−1

ΩLPe

(1

NTL′PQL

)−1

where ΩLPe can be obtained by a Newey-West type robust estimator as follows:

ΩLPe = N−1N∑i=1

ΩiLPe

ΩiLPe = ΩiLPe,0 +

Ml∑h=1

(1− h

Ml+1

)(ΩiLPe,h + Ω′iLPe,h

)

ΩiLPe,h = T−1T∑

t=h+1

eiteit−hlitlit−h

whereMl is the the window size of the Bartlett kernel, e = M b(Y −Lδ2sls

)=(

e′.1, ..., e′.T

)′, eit is the tth element of e.t, L = PQL =

(L′.1, ..., L

′.T

)′, and l

′it

is the ithrow of L.t.Remark 8. Although our interest lies in the parameters δ, we can gain

insight into the variability of the factor loadings after obtaining estimates ofδ. This can be done by regressing yit − l′itδ on z.t and an intercept for eachcross-section unit i, where lit = (y∗it,x

′it)′, and z.t = (y.t, x

′.t)′.

142

7.6 Estimation and Inference in Heterogenous Spatial PanelData Models with a Multifactor Error Structure byShin and Zheng (2019)

[Edit] Following Yang (2018), we consider the heterogenous spatial panel datamodel with interactive effects, and develop the asymptotic theory for the CCE-type estimator.

• The main differences and contributions are highlighted as fol-lows:

• First, we allow yit and xit to contain the different factors whereas yit andxit share the same common factors in the existing CCE literature; thismay raise an additional issue in selecting the full set or the subset of CSAof xit’s. See also Chudik and Pesaran (2015).

• Second, we propose to employ the cross-section averages of xit as proxiesfor factors. In particular, the use of yt in the spatial panel data modelcannot be justified even as N → ∞. This is due to the fact that someor all yjt’s for j = 1, ..., N and j 6= i, are likely to be correlated withthe idiosyncratic disturbance, eit. It would be more sensible to derive theconditions under which the use of yt would be detrimental or not?

• More importantly, we allow the main slope parameters to be heteroge-neous across the spatial unit. So far there are only two papers by Aquaro,Bailey and Pesaran (2015) and Shin and Thornton (2019), which are un-usual in allowing heterogeneous spatial parameters. In particular, we arethe first to develop the consistent IV estimation for the heterogeneousspatial panel data model with the interactive effects. We derive the as-ymptotic theory for the individual estimator as well as the mean groupand pooled estimators for the averages of the heterogeneous spatial pa-rameters. In particular, we find that the limiting distribution of the ithindividual estimator will depend on those of the other individual estima-tors for j = 1, ..., N and j 6= i.

• The proposed approach is somewhat similar to the Mundlak transforma-tion?? Pros and cons?

• We then provide the network analyses using the outputs involving theN ×N spatial units.

Consider the heterogenous spatial panel data model with interactive effects:

yit = ρiy∗it + βi

1×k

′ xitk×1

+ Γ1i1×r1

′ f1tr1×1

+ Γ2i1×r2

′ f2tr2×1

+ eit = ρiy∗it + βi

1×k

′ xitk×1

+ Γ′yi1×(r1+r2)

fyt(r1+r2)×1

+ eit,(256)

xitk×1

= A′1ik×r1

f1tr1×1

+ A′2ik×r3

f3tr3×1

+ vitk×1

= A′xik×(r1+r3)

fxt(r1+r3)×1

+ vitk×1

, (257)

143

for i = 1, 2, . . . , N , and t = 1, 2, . . . , T , where yit is the dependent variable ofunit i at time t, and y∗it =

∑Nj=1 wijyjt represents the spatial lagged depen-

dent variable. The matrix W = (wij)N×N is a fixed spatial weights matrix.The k × 1 vector xit = (xit,1, xit,2, . . . , xit,k)′ contains individual-specific ex-planatory variables, and β is the corresponding vector of coeffi cients. eit andvit = (vit,1, vit,2, . . . , vit,k)′ are the idiosyncratic disturbances associated withyit and xit, respectively.Notice that we introduce the three types of unobserved factors. f1t =

(f1t, . . . , fr1t)′ is the r1 × 1 vector of unobserved factors that are common to

both yit and xit whilst f2t and f3t are the r2×1 and r3×1 vector of unobservedfactors that are specific to yit and xit, respectively. Accordingly, Γyi and Axicapture heterogeneous factor loadings. This factor structure is more practicalthan the maintained setup in the existing literature (e.g. Pesaran, 2006, Bai andLi, 2014 and Yang,2018) that yit and xit are affected by the same factors. Seealso Chudik and Pesaran (2015) who consider the similar multifactor structure.Following Pesaran (2006) and Yang (2018), we are initially inclined to employ

the cross-section averages of both yit and xit as factor proxies for large-N . But,there are a few critical issues. First, y∗it = N−1

∑Nj=1 wijyjt in the model (256),

is endogenous such that yt = N−1∑Nj=1 yjt is also likely to be endogenous

even as N → ∞. See also Chen and Yan (2019) who show that when yt iscorrelated with idiosyncratic error in the panel data model with interactiveeffects. More details. Second, we observe that the use of yt as a proxyfor factors makes the estimation results sometimes unreliable in estimating thegravity model of the international trade, the stochastic frontier regression, andthe PPP estimation and so on, e.g. SS and more applied papers. Thirdly, themain motivation behind the CCE or the correlated RE estimation is to controlfor the correlation between regressors and unobserved factors, which would leadsto biased estimates. Similar to the control function approach, however, we onlyneed to control the correlation between regressors and unobserved factors, inorder to deal with endogeneity. One might argue that adding the cross sectionaverage of yit would produce more effi cient estimator. But, this is not necessarilytrue, especially if r2 ≥ 2. In such as case f2 cannot be fully approximated.Finally, maybe referring to the few papers using xt only in the literature, e.g.Linton who applies the CCE to the binary choice model or other nonlinearapplications??In this regard, we propose to employ the cross section averages of regressors

only as factor proxies. In what follows, we show that this scheme works verywell and has several advantages as compared to the approximation using both

144

yt and xt. To this end we rewrite (257) in matrix form:

X.tN×k

= (IN ⊗ f ′1t)N×Nr1

A1Nr1×k

+(IN ⊗ f ′3t)N×Nr3

A2Nr3×k

+ v.tN×k

= (IN ⊗ f ′xt)N×N(r1+r3)

AxN(r1+r3)×k

+ V.tN×k

,

(258)where X.t

N×k= (x1t

k×1, x2t, ..., xNt)

′; V.tN×k

= (v1tk×1

, v2t, ..., vNt)′; A1Nr1×k

= (A′11k×r1

,A′12, ...,A′1N )′;

A2Nr3×k

= (A′21k×r3

,A′22, ...,A′2N )′ and Ax = (A′x1,A

′x2, ...,A

′xN )′. Then, we rewrite

(258) as

vec(X ′.t)Nk×1

= A1′

Nk×r1f1tr1×1

+ A′2Nk×r3

f3tr3×1

+ vec(V ′.t)Nk×1

= A′xNk×(r1+r3)

fxt(r1+r3)×1

+ vec(V ′.t)Nk×1

,

(259)where A1 = (A11, ...,A1N ); A2 = (A21, ...,A2N ) and Ax = (Ax1, ...,AxN ). Let-ting Θa = N−1τ ′N ⊗ Ik, where τN is an N × 1 vector of ones, it is easily seenthat

x.tk×1

= Θak×Nk

vec(X ′.t)Nk×1

=1

N(τ ′N ⊗ Ik)vec(X ′.t) =

1

Nvec(X ′.tτN ) =

1

N

N∑i=1

xitk×1

(260)Similarly,

v.tk×1

= Θavec(V′.t) =

1

N

N∑i=1

vit. (261)

Pre-multiplying both sides of (259) with Θa yields:

x.tk×1

= Θavec(X′.t) = Θa

k×NkA′x

Nk×(r1+r3)

fxt(r1+r3)×1

+Θavec(V′.t) = A′x

k×(r1+r3)

fxt(r1+r3)×1

+ v.tk×1

,

(262)where

Ax = (ΘaA′x)′ =1

N

N∑j=1

Axj . (263)

Assume that Ax has full row rank, namely, Rank (Ax) = (r1 + r3) = rx ≤ k, forall N including N →∞. Pre-multiplying (262) by Ax, we have:

fxt =(AxA′x

)−1Ax(x.t − v.t). (264)

39This can be established by noticing that from (??), we have

x′it1×k

= f ′1t1×r1

A1ir1×k

+ f ′3t1×r3

A2ir3×k

+ v′it1×k

= f ′xt1×(r1+r3)

Axi(r1+r3)×k

+ v′it1×k

,

It seems that equation (258) does not play any role in our analysis. But the matrix form forXi. will play a role.

Xi.T×k

= F1T×r1

A1ir1×k

+ F3T×r3

A3ir3×k

+ Vi.T×k

= FxT×(r1+r3)

Axi(r1+r3)×k

+ Vi.T×k

where FxT×(r1+r3)

= (fx1, fx2, . . . , fxT )′ and F1 and F3 are defined similarly.

145

We establish in Lemma A1 that v.t converges to zero in quadratic mean asN → ∞, for any t. Therefore, fxt can be approximated by the cross-sectionalaverages x.t; namely,

fxt = (Ax0A′x0)−1

Ax0x.t + op(1) (265)

where Ax0 = limN→∞ Ax = E(Axi).40 It is clear from (265) that x.t serve fairly

well as factor proxies as long as N is large.We now define the (infeasible) projection matrices for factor Fx as

Mfx = IT − Fx(F′xFx)−F′x, Mbfx = Mfx ⊗ IN , (266)

where Fx = (fx1, . . . , fxT )′ is a T × (r1 + r3) matrix of unobserved commonfactors, and (F′xFx)− denotes the generalized inverse of (F′xFx). We also definethe observable counterparts of (266) on the basis of cross-sectional averages ofxit by

Mx = IT − X(X ′X)−X ′, Mbx = Mx ⊗ IN , (267)

where XT×k

= (x.1, x.2, . . . , x.T )′41 . Note that Mbfxand Mb

x are de-factoring

matrices of NT dimension that operate on XNT×k

= (X ′.1, X′.2, . . . , X

′.T )′,

where X.tN×k

= (x1t, x2t, . . . , xNt)′ for t = 1, . . . , T .

Now rewrite (256) in matrix form

Y.t(N×1)

=

y1t

...yNt

= ρ(N×N)

W(N×N)

Y.t(N×1)

+ β(N×Nk)

X.t(Nk×1)

+ Γy(N×(r1+r2))

fyt((r1+r2)×1)

+ e.t(N×1)

=

ρ1 0 . . . 0...

.... . .

...0 . . . 0 ρN

(N×N)

W(N×N)

Y.t(N×1)

+

β′1 0 . . . 0...

.... . .

...0 . . . 0 β′N

(N×Nk)

X.t(Nk×1)

+

Γy1

...ΓyN

fyt + e.t

(268)where

βi(k×1)

=

βi1...βik

, X.t(Nk×1)

=

x1t

...xNt

, xit(k×1)

=

xit,1...xit,k

40E(Axi) might be somewhat strange. Actually, it simply means the population expectation

of (123). If we further assume that E(Axi) = E(Axj) = B, we can also write above equationaas limN→∞ Ax = B.41Since from (262), we have

x′.t1×k

= f ′xt1×(r1+r3)

Ax(r1+r3)×k

+ v′.t1×k

,

therefore, we haveXT×k

= FxT×(r1+r3)

Ax(r1+r3)×k

+ vT×k

,

146

Further, we rewrite (268) as

Y(NT×1)

=

y.1...y.T

= (IT ⊗ ρW(NT×NT )

) Y(NT×1)

+ ( IT ⊗ β(NT×NTk)

) X(NTk×1)

+ ( IT ⊗ Γy(NT×T (r1+r2))

) fy(T (r1+r2)×1)

+ e(NT×1)

=

ρW 0 . . . 0...

.... . .

...0 . . . 0 ρW

Y.1...Y.T

+

β 0 . . . 0...

.... . .

...0 . . . 0 β

X.1

...X.T

+

Γy 0 . . . 0...

.... . .

...0 . . . 0 Γy

fy1

...fyT

+

e.1...e.T

= (INT − IT ⊗ ρW )−1[(IT ⊗ β)X + (IT ⊗ Γy)fy + e]

= (IT ⊗ (IN − ρW ))−1[(IT ⊗ β)X + (IT ⊗ Γy)fy + e]

= (IT ⊗ (IN − ρW )−1)[(IT ⊗ β)X + (IT ⊗ Γy)fy + e]

= (IT ⊗ S−1)[(IT ⊗ β)X + (IT ⊗ Γy)fy + e] (269)

Since we are interested in constructing mean group estimator which meansthat we need to obtain estimates for each ρi and βi, i = 1, 2, ..., N , we furtherrewrite the (256) as

Yi.N×1

= ρi(IT ⊗ b′iW )T×NT

YNT×1

+ Xi.N×k

βik×1

+ F1T×r1

Γ1ir1×1

+ F2T×r2

Γ2ir2×1

+ ei.

= ρi(IT ⊗ b′iW )Y +Xi.βi + FyT×(r1+r2)

Γyi(r1+r2)×1

+ ei.,

= ρiY∗i. +Xi.βi + Fy

T×(r1+r2)

Γyi(r1+r2)×1

+ ei., (270)

Xi.T×k

= F1T×r1

A1ir1×k

+ F3T×r3

A3ir3×k

+ Vi.T×k

= FxtT×(r1+r3)

Axi(r1+r3)×k

+ Vi.,42 (271)

whereY ∗i.T×1

= (y∗i1, y∗i2, ..., y

∗iT )′, y∗it =

∑Nj=1 wijyjt, wij is the ij-th element of

spatial weighting matrix W , bi = (0, 0, ..., 1, ..., 0)′ is the i-th basis vector of Ndimension, Xi. = (xi1, xi2, ..., xiT )′.

42To make (270) more clear, notice

(IT ⊗ b′iW )T×1

Y = vec(b′iW (Y.1, Y.2, ..., Y.T )) = vec(Wi.(Y.1, Y.2, ..., Y.T )) = vec((Wi.Y.1,Wi.Y.2, ...,Wi.Y.T )) = (y∗i1, y∗i2, ..., y

∗iT )′

T×1= Y ∗i.

Since (270) plays very important role in our analysis, I would like to talk more about itsderivation. To study the property of the individual coeffi cient estimator and further MeanGroup estimator, we need to have its expression in matrix form. As can be seen from (269),ρ now is a N ×N diagonal coeffi cient matrix and Aquaro Bailey & Pesaran (2015) estimatethis coeffi cient matrix directly. This is not what we want.There are two ways to derive (270). First, simply we know that the spatial term for

yit is y∗it = b′iWY.t, therefore, Y ∗i.T×1

= (y∗i1, y∗i2, ..., y

∗iT )′ = (b′iWY.1, b′iWY.2, ..., b′iWY.T ) =

(IT ⊗ b′iW )Y .Second, it is important to observe that this coeffi cient matrix can be further written as

ρ =∑Ni=1 b

′iρbib

′i =

∑Ni=1 ρibib

′i , where ρ = (ρ1, ρ2, ..., ρN )′ is a N × 1 vector. This is

important as we are now able to work with a coeffi cient vector. Suppose now we are focusing

147

7.6.1 Individual Coeffi cient Estimation

We use 2SLS to estimate (270) and the instrumental variable is (IT ⊗ b′i)T×TN

QTN×q

43 ,

Q is the same as before and is specified in Assumption ??. Let δi = (ρi, β′i)′

denotes the individual coeffi cient vector. To consistently estimate them, wefirst do projection to (270) using Mx

T×Tas defined in (267), to cope with the

endogeneity issue caused by the correlation between Xi. and unobserved factors,f1. To deal with the endogeneity issue caused by the spatial term, we useMxT×T

(IT ⊗ b′i)T×TN

QTN×q

as instruments. The 2SLS estimator for δi = (ρi, β′i)′ , δi

can be defined as:δi = (Z′iPQZi)

−1Z′iPQYi.

where PQ = Mx(IT ⊗ b′i)Q(Q′(IT ⊗ bi)Mx(IT ⊗ b′i)Q)−1Q′(IT ⊗ bi)Mx, Zi =((IT ⊗ b′iW )Y,Xi.).

Theorem 10 Consider the spatial panel data model given by (256) and supposethat Assumptions ??- ?? and ?? hold. The individual 2SLS estimator, δi definedby (??), is consistent for δi, as (N, T )

j→ ∞. Moreover, as (N, T )j→ ∞ and

T/N2 → 0, we have: √T (δi − δi)

d→ N(0, Σδi), (272)

on the following SAR model:

Y(NT×1)

= (IT ⊗ ρW(NT×NT )

) Y(NT×1)

+ e(NT×1)

= (IT ⊗N∑i=1

b′iρbib′iW )Y + e =

N∑i=1

(IT ⊗ b′iρbib′iW )Y + e

=

N∑i=1

(IT ⊗ ρibib′iW )Y + e =

N∑i=1

vec(ρibib′iW (Y.1, Y.2, ..., Y.T )) + e

we actually now can conjecture that the expression for Yi.N×1

can be obtained by some operation

on vec(ρibib′iW (Y.1, Y.2, ..., Y.T ))

NT×1. Noticing that ρibib

′iW (Y.1, Y.2, ..., Y.T ) is a N × T matrix

and has zero elements except in the i-th row. This i-th row is actually Y ′i.. Combining allthese observations, we now have

Yi.T×1

= (ρib′ibib′iW (Y.1, Y.2, ..., Y.T ))′ = ρi(Y.1, Y.2, ..., Y.T )′W ′bi = ρivec((Y.1, Y.2, ..., Y.T )′W ′bi) = ρivec(b

′iW (Y.1, Y.2, ..., Y.T ))

= ρivec(b′iW (Y.1, Y.2, ..., Y.T )IT ) = ρi(IT ⊗ b′iW )vec((Y.1, Y.2, ..., Y.T )) = ρi(IT ⊗ b′iW )Y

redThe key here is that we should work with Y = (Y ′.1, Y′.2, ..., Y

′.T )′

43To see why, as Q can serve well as the instruments for Y ∗NT×1

= (Y ∗.1, Y∗.2, ..., Y

∗.T ), where

Y ∗.tNT×1

= (y∗1t, y∗2t, ..., y

∗Nt) , we now actually only need the instruments for Y ∗i. . Since Y

∗i. =

(IT ⊗ b′∗i , we then have the instruments for Y ∗i. is (IT ⊗ b′i)Q

148

where

Σδi = Ψ−1

ZPZΩZPεiΨ

−1

ZPZ, (273)

ΨZPZ = plimT→∞

Z′i0PQ,fZi0T

, ΩZPεi = Ψ′QMZ

Ψ−1

QMQΩQMεi

Ψ−1

QMQΨQMZ , (274)

ΨQMQ = plimT→∞

Q′(IT ⊗ bi)Mfx(IT ⊗ b′i)QT

, ΨQMZ = plimT→∞

Q′(IT ⊗ bi)MfxZi0T

,(275)

ΩQMεi= plimT→∞

Q′(IT ⊗ bi)Mfx(Ωe,i + Ωf2)Mfx(IT ⊗ b′i)QT

, (276)

Ωe,i = E(ei.e′i.), Ωf2 = E(F2Γ2iΓ

′2iF′2) (277)

Estimation for this variance-covariance matrix will be providedlater

7.6.2 Mean Group Estimator

Suppose the heterogeneous coeffi cient δi satisfy the following assumption:Random Slope Coeffi cients: The slope coeffi cients δi follow the random co-

effi cient model

δi = δ + ξi, ξi ∼ IID(0,Ωξ) for i = 1, ..., N (278)

where ‖δ‖ < K, ‖Ωξ‖ < K, Ωξ is a (k + 1) × (k + 1) symmetric non-negativedefinite matrix, and the random deviations ξi are distributed independently ofΓji, Aji, eit, vit for all i, j, and t, where j = 1, 2. and i = 1, 2, ..., N

Define the Mean Group estimator as

δMG =1

N

N∑i=1

δi (279)

Theorem 11 Consider the panel data model (256) and (257), and suppose thatAssumptions ??-?? and ??-?? hold. Then the common correlated effects meangroup estimator δMG defined by (279), is asymptotically unbiased for δ as long

as N →∞ ( T can be fixed) and as (N,T )j→∞,

√N(δMG − δ

)d→ N = (0,ΣδMG) = (0,Σξ) (280)

where Σξ is given by (278) and can be consistently estimated by

ˆΣδMG = Σξ =1

N − 1

N∑i=1

(δi − δMG)(δi − δMG)′ (281)

149

7.6.3 Pooled Estimator

Constructed the following pooled data:

Y =

MxY1.

MxY2.

...MxYN.

and Z =

MxZ1.

MxZ2.

...MxZN.

(282)

where Yi. and Zi. are defined in (270) and (??). Due to the existence of spa-tial term, if we directly regress Y on Z, we are unable to obtain a consistentestimator. To handle this, we use the following instrument:

Q =

Mx(IT ⊗ b′1)QMx(IT ⊗ b′2)Q

...Mx(IT ⊗ b′N )Q

(283)

The pooled estimator for δ is then defined as

δp = [Z ′Q(Q′Q)−1Q′Z]−1Z ′Q(Q′Q)−1Q′Y

=

[N∑i=1

Z ′i.Mx(IT ⊗ b′i)Q][

N∑i=1

Q′(IT ⊗ bi)Mx(IT ⊗ b′i)Q]−1 [ N∑

i=1

Q′(IT ⊗ bi)MxZi.

]−1

[N∑i=1

Z ′i.Mx(IT ⊗ b′i)Q][

N∑i=1

Q′(IT ⊗ bi)Mx(IT ⊗ b′i)Q]−1 [ N∑

i=1

Q′(IT ⊗ bi)MxYi.

](284)

Theorem 12 Consider the panel data model (256) and (257), and suppose thatAssumptions ??-?? and ??-?? hold. Then the common correlated effects poolestimator δp defined by (284), is asymptotically unbiased for δ and as (N,T )

j→∞, √

N(δp − δ

)d→ N

(0,Σδp

)(285)

where

Σδp =[Ψ′QMZ

Ψ−1

QMQΨQMZ

]−1 [Ψ′QMZ

Ψ−1

QMQΩQMξΨ

−1

QMQΨ′QMZ

] [Ψ′QMZ

Ψ−1

QMQΨQMZ

]−1

, (286)

ΨQMQ = plimN→∞

N∑i=1

plimT→∞

Q′(IT ⊗ bi)Mfx(IT ⊗ b′i)QT

, ΨQMZ = plim

N→∞

N∑i=1

plimT→∞


,(287)

ΩQMξ = plimN→∞

N∑i=1

plimT→∞


ΩξZ ′i0Mfx(IT ⊗ b′i)Q

T

, (288)

Estimation for this variance-covariance matrix will be providedlater

150

7.7 Other Approaches

Introducing factors into a model is a way to specify possible CSD. Chudik,Pesaran and Tosetti (2011) introduce the concepts of time-specific weak andstrong CSD in panel data models. A factor model allows time effects to interactwith spatial units with different intensity.Pesaran and Tosetti (2011) investigate the following model with both com-

mon factors and spatial correlation:

yit = α′idt + β′ixit + γ′if t + eit, (289)

where dt is a k × 1 vector of observed common effects, xit is the kx × 1vector of observed individual specific regressors, f t is an m-dimensional vectorof unobservable common factors, and γi is the m× 1 vector of factor loadings.The common factors f t simultaneously affect all cross section units, albeit withdifferent intensity as measured by γ′i. The eit follows some spatial process, eitheran SAR or SMA process.To estimate the parameters of interest (the mean of βi), we can use the mean

group estimator

βMG =1

n

n∑i=1

βi

where βi is the OLS estimate of yi regressed on Xi after the orthogonal pro-jection of data by a T × (k + kx + 1) matrixMD where D consists of observedcommon factor and cross sectional average of dependent and independent vari-ables, e.g.

βi =(X ′iMDXi

)−1X ′iMdyi

where Xi is a T × kx vector of regressors for the ith unit, yi is a T × 1 vectorof the dependent variable for the ith unit, and MD = IT − D

(D′D

)−1D.

Alternatively, the regression coeffi cient β can be obtained by a pooled estimate:

βP =

(n∑i=1

X ′iMDXi

)−1 n∑i=1

X ′iMdyi

Given the spatial structure in eit, Pesaran and Tosetti (2011) derive the asymp-totic properties of βMG and βMG under difference scenarios such as whetherthe unobserved common factor is present or not and whether the regression pa-rameters are heterogeneous or not. Pesaran and Tosetti (2011) then considerthe Common Correlated Effects (CCE) estimator advanced by Pesaran (2006),which continues to yield estimates of the slope coeffi cients that are consistentand asymptotically normal. Holly, Pesaran, and Yamagata (2010) use (289) toanalyze the changes in real house prices in the US. They have also specifieda spatial process in eit, which is shown to be significant after controlling forunobserved common factors.

151

Chudik, Pesaran, and Tosetti (2011) extend the model by allowing additionalweakly dependent common factors. Their model is

yit = α′idt + β′ixit + γ′ift + λ′int + eit, (290)

where nt, uncorrelated with the regressor xit, is the additional weakly dependentcommon factors with dimension of the factor loading coeffi cients λi going toinfinity. They show that the CCE method still yields consistent estimates ofthe mean of the slope coeffi cients and the asymptotic normal theory continuesto be applicable.Holly, Pesaran, and Yamagata (2011) provide a method for the analysis of

the spatial and temporal diffusion of shocks in a dynamic system. They gener-alize VAR panel models with unobserved common factors to incorporate spatialelements in the dynamic coeffi cient matrix. With coeffi cients being spatial unitspecific, consistent estimation has emphasized on T tending to infinity whilethe number of spatial units can be moderate or large. They use changes in realhouse prices within the UK economy, and analyze the effect of shocks usinggeneralized spatiotemporal impulse responses.The spatial panel data models assume a time invariant spatial weights ma-

trix. When the spatial weights matrix is constructed with economic/socioeconomicdistances or demographic characteristics, it can be time varying. For example,Case, Hines, and Rosen (1993) on state spending have weights based on thedifference in the percentage of the population that is black. Lee and Yu (2012b)investigate the QML estimation of SDPD models where spatial weights matricescan be time varying. They find that QML estimate is consistent and asymp-totically normal. Monte Carlo results show that, when spatial weights matricesare substantially varying over time, a model misspecification of a time invariantspatial weights matrix may cause substantial bias in estimation.

7.7.1 A Nonlinear Panel Data Model of Cross-Sectional Dependence

Kapetanios, Mitchell and Shin (2014) propose a nonlinear panel data modelwhich can endogenously generate both ‘weak’and ‘strong’CSD. The model’sdistinguishing characteristic is that a given agent’s behaviour is influenced byan aggregation of the views or actions of those around them. The model allowsfor considerable flexibility in terms of the genesis of this herding or clusteringtype behaviour. At an econometric level, the model is shown to nest variousextant dynamic panel data models. These include panel AR models, spatialmodels, which accommodate weak dependence only, and panel models wherecross-sectional averages or factors exogenously generate strong CSD. An impor-tant implication is that the appropriate model for the aggregate series becomesintrinsically nonlinear, due to the clustering behaviour.We propose dynamic nonlinear panel data models:

xi,t = ρ

N∑j=1

wij (x−i,t−1, xi,t−1; γ)xj,t−1 + εi,t, i = 1, ...N, t = 1, ...T, (291)

152

where x−i,t = (x1,t, x2,t, ..., xi−1,t, xi+1,t, ..., xNt)′ and

∑Nj=1 wij (x−i,t−1, xi,t−1; γ) =

1. This form of the model signifies that xi,t depends, possibly in a nonlinearfashion depending on how wij is parameterised, on weighted averages of pastvalues of xt = (x1,t, ..., xNt)

′, where the weights depend on xt−1. One particu-lar motivation is structural and follows from the claim that it mimics structuralinteractions between economic units. Another, more econometric, justificationnotes that this model can accommodate generic forms of CSD, including evolv-ing clusters.The model in (291) encompasses a wide variety of nonlinear specifications.

We place particular emphasis on specifications where the weights depend on xt−1

only through distances of the form |xj,t−1 − xi,t−1|. This type of specificationis easy to analyse, based on a threshold mechanism, to illustrate the class ofmodels. This model nests a variety of dynamic panel data models, such as paneldata AR models and panel models where cross-sectional averages are used topick up CSD (Pesaran, 2006). Interestingly, it is also closely related to factoror interactive effects models (Bai, 2009).Our models provide an intuitive means by which many forms of cross-

sectional dependence can arise in a large panel dataset comprised of variables ofa ‘similar’nature that relate to different agents/units. These variables might bethe disaggregates underlying often studied macroeconomic or financial aggre-gates, such as economy-wide inflation or the S&P500 index. In particular, themodel allows these different economic units to cluster; and for these clusters (in-cluding their number) to evolve over time. Such clustering also has implicationswhen modelling and forecasting the aggregate of these units.The degree of CSD in our models can vary, from a case where it is similar to

standard factor models, for which the largest eigenvalue of the variance covari-ance matrix of the data tends to infinity at rate N , to the case of very weak orno factor structure where this eigenvalue is bounded as N →∞. Of course, allintermediate cases can arise as well. Our model constitutes the first attempt tointroduce endogenous cross-sectional dependence into a panel modelling frame-work.Let xi,t denote the variable of interest, such as the agent’s income or the

agent’s view of the future value of some macroeconomic variable, at time t, foragent i. Then, we specify:

xi,t =ρ

mi,t

N∑j=1

I (|xi,t−1 − xj,t−1| ≤ r)xj,t−1 + εi,t, t = 2, ..., T, i = 1, ..., N,

(292)where

mi,t =

N∑j=1

I (|xi,t−1 − xj,t−1| ≤ r) ,

εi,tTt=1 is an error process, I (.) is the indicator function and −1 < ρ < 1.Verbally, the above model states that xi,t is influenced by the cross-sectionalaverage of a selection of xj,t−1 and in particular that the relevant xj,t−1 are those

153

that lie closest to xi,t−1. The model involves a K nearest neighbor mechanismexcept that it is in the data generating process and not as a technique to estimatean unknown function. This formalises the intuitive idea that people are affectedmore by those with whom they share common views or behaviour. The modelmay be equally viewed as a descriptive model of agents’behaviour, reflectingthe fact that ‘similar’agents are affected by ‘similar’effects, or as a structuralmodel of agents’views whereby agents use the past views of other agents, similarto them to form their own views. The interaction term in (292) may then bethought of capturing the (cross-sectional) local average or common componentof their views. This idea of commonality has various clear, motivating andconcrete examples in a variety of social science disciplines, such as psychologyand politics. In economics and finance, the herding could be rational (imitativeherding) or irrational.A deterministic form of the above model has been analysed previously in

the mathematical and system engineering literature. They have analysed acontinuous form of the restricted version of (292) given by

xi,t =1

mi,t

N∑j=1

I (|xi,t−1 − xj,t−1| ≤ 1)xj,t−1, t = 2, ..., T, i = 1, ..., N, (293)

where mi,t =∑Nj=1 I (|xi,t−1 − xj,t−1| ≤ 1).

By setting r = 0, we obtain a simple panel autoregressive model:

xi,t = ρxi,t−1 + εi,t. (294)

On the other hand, letting r →∞, we obtain the following model

xi,t =ρ

N

N∑j=1

xj,t−1 + εi,t, (295)

where past cross-sectional averages of opinions inform, in similar fashions, cur-rent opinions. Recently, the use of such cross-sectional averages has been advo-cated by Pesaran (2006) as a means of modelling CSD in the form of unobservedfactors. However, in our case, the use of cross-sectional averages is a limitingcase of a ‘structural’nonlinear model.Factor models have the property that both the maximum eigenvalue and the

row/column sum norm of the covariance matrix of xt = (x1,t, ..., xN,t)′ tend to

infinity at rate N , as N →∞. In contrast, for other models such as spatial ARor MA models, these quantities are bounded, implying that they exhibit muchlower degrees of CSD than factor models. We show that the column sum normof the covariance matrix of xt when xt follows (292) is O(N). Thus, the modelis more similar to factor models than spatial models. Interestingly, there areversions of (292) that resemble spatial models. Another finding is that (295)implies a covariance matrix for xt with a column sum norm that is O(1). Thisis surprising, given the similarity that cross-sectional average schemes have withfactor models as detailed in Pesaran (2006).

154

Factor models are intrinsically reduced form; they focus on modelling CSDusing an exogenously given number of unobserved factors. Since our modelnests (295), it can approximate a factor model when r → ∞. On the otherhand, our model has a clear parametric structure, which is a feature shared bydynamic spatial model. But, our models are more general than spatial mod-els, in the sense that the weighting schemes are estimated endogenously, ratherthan assumed ex ante. It is worth noting that the factor model cannot accom-modate the weak CSD, in contrast to the extensions of our nonlinear model. Thenonlinear model can be seen to lie between the two extremes characterised byweakly cross-sectionally dependent spatial models and strongly cross-sectionallydependent factor models.

Further suggestions on the empirical applications Inflation expecta-tions: Here, we have considered the basic model with fixed effects:

πi,t = νi +ρ

mi,t

N∑j=1

I (|πi,t−1 − πj,t−1| ≤ r)πj,t−1 + εi,t (296)

where π is the one-quarter ahead CPI inflation rate forecast, νi ∼ iid(0, σ2ν), and

obtained the within estimator of ρ along with the consistent estimator of r. Wecan provide some economic interpretations for ρ and r in terms of persistence orinertia and the relative distance of similarity. Notable exceptions are to estimatetwo extreme models, denoted PAR and CSA, respectively:

πi,t = νi + ρπi,t−1 + εi,t (297)

πi,t = νi + ρπt−1 + εi,t (298)

It is interesting to see how the estimates of ρ differ for each of three models.Assuming that the overall performance of the model, (296), is superior, we thenmove to estimate the extensions as discussed in the model of the form:

πi,t = νi + ρ1πi,t−1 + ρ2πci,t−1 + εi,t (299)

where πi,t−1 and πci,t−1 are the respective cross-section averages related to sim-ilar and dissimilar forecasters given by

πi,t−1 =1

mi,t

N∑j=1

I (|πi,t−1 − πj,t−1| ≤ r)πj,t−1

πci,t−1 =1

N −mi,t

N∑j=1

I (|πi,t−1 − πj,t−1| > r)πj,t−1

We can also test the hypothesis of ρ2 = 0 (informational contents arising fromdissimilar forecasters). If ρ2 6= 0, what’s the prior implication of the sign of ρ2?In other words, in forming the own forecast, how does each forecaster use any(past) information from others?

155

Next to find a way of decomposing the (partial) aggregate parameter, ρ in(296) or ρ1in (299) into the own effect and the neighbor effect. One obviouscandidate is to consider:

πi,t = νi + ρ0πi,t−1 + ρ1πi,t−1 + εi,t (300)

πi,t = νi + ρ10πi,t−1 + ρ11πi,t−1 + ρ2πci,t−1 + εi,t (301)

where

πi,t−1 =1

mi,t

N∑j=1,j 6=i

I (|πi,t−1 − πj,t−1| ≤ r)πj,t−1

Anselin et al. (2008) distinguish spatial dynamic models into four categoriesbased on the following general time-space-dynamic specification:

xi,t = ρ0xi,t−1 + ρ1

∑j 6=i

wijxj,t−1 + β∑j 6=i

wijxj,t + vi + εi,t (302)

Here,∑j 6=i wijyjt and

∑j 6=i wijyjt−1 are a first order spatial lag and its time-

lagged value, respectively. The parameter, ρ0, captures serial dependence of xit,β represents the intensity of a contemporaneous spatial effect and ρ1 capturesspace time autoregressive dependence (diffusion). Most studies focus on thestable case with ρ0 + ρ1 + β < 1. This specification, (302), includes variousspecial cases:

• if ρ0 = ρ1 = 0, we obtain a ‘pure-space recursive’model in which de-pendence results from the neighborhood locations in the previous timeperiod;

• if β = 0, the model is reduced to a ‘time space recursive’model in whichdependence relates to both the location itself (xi,t−1) and its neighbors inthe previous time period

∑j 6=i wijxj,t−1;

• if ρ1 = 0, we obtain a ‘time space simultaneous’model which includes thetime lag (xi,t−1) and the spatial lag,

∑j 6=i wijxj,t;

• if ρ0 = ρ1 = 0, we are dealing with a spatial autoregressive model on paneldata, while if ρ1 = β = 0 we obtain a ‘simple’dynamic model.

According to Anselin (2001) and Abreu et al. (2005), the addition of aspatially lagged dependent variable causes simultaneity and endogeneity prob-lems and thus a candidate consistent estimator should lie between the OLS andwithin estimates.Our model, (300), is similar to the time-space recursive model considered in

Korniotis (2010), who apply it to investigate the issue of internal versus externalhabit formation using the annual consumption data for the U.S. states, and findthat state consumption growth is not significantly affected by its own (lagged)consumption growth but it is affected by lagged consumption growth of nearbystates. Notice that the weight wij measures the importance of xj,t−1 on xit.

156

The weights are observed quantities, which are known to the econometrician,and therefore exogenous. Because the spatial-time lag,

∑Nj=1 wijxj,t−1, is a

weighted average of past consumption choices of other cross-sectional units, itis the measure of the catching-up habit.There is a trade-off between our model, (300) and the time space recursive

model employed by Korniotis (2010). In (300), the neighbors are selected en-dogenously but the equal weights are imposed to the selected neighbors. Bycontrast, in the Korniotis’s model, the neighbors are selected exogenously, butthe weights are selected in a flexible manner albeit not time-varying. The appli-cation of the model, (300) to similar issue of the consumption habit formationwill provide an interesting insight.44

Unless ρ2 = 0, the model (301) should be more general, and it is interestingto find out the potential application, say GVC, upstream and downstream orasymmetry?YC: updateIt is also interesting to investigate how we relate our approach to the Sias’

(2004) approach to an analysis of herding. The idea is to estimate ρ from (296),and find a way to decompose:

ρ = ρown + ρneighbor

following the the Sias’approach. But, the analogy is not quite one-to-one. Thepotential advantage of this approach is the possible robustness of this measurewhich can also be used for a finite T , as well.We generalise (296) and allow different weights to the selected neighbors as

follows:

xi,t = νi +ρ

mi,t

N∑j=1

I (|xi,t−1 − xj,t−1| ≤ r)wijxj,t−1 + εi,t (303)

where we may consider the following weights

wij =d−2ij∑N

j=1 d−2ij

, dij = |xi,t−1 − xj,t−1| (304)

(then, how we define wii, just normalised to 1?). The estimation can be donein two steps: first, the consistent estimate of r is obtained from (296). Then,construct the weights by (304) and the associated cross-section averages, andestimate the model, (303). Or possibly more complicated due to the grid searchover r. If successful, then our approach is more general than the spatial models.Next, we may consider the following extension of (296):

xi,t = νi +ρ

mi,t

N∑j=1

I(∣∣xmax

t−1 − xj,t−1

∣∣ ≤ r)xj,t−1 + εi,t (305)

44We can allow the weights to be inversely proportional to the distance once the thresholdparameter is consistently estimated.

157

where xmaxt−1 = maxj xj,t−1, such that the distance is measured with respect to

the best performer rather than the unit i. The alternative functional type canalso be considered.Remark: From the empirical point of view, the following consideration may

be useful: Suppose that the distance between xi,t−1 and xj,t−1 or more generallybetween qit and qjt, can be regarded as the sort of similarity measure, and thatthe parameter may measure the impact of the certain policy. We then estimatethe value of ρ’s under different values of r, and make a 2-dimensional plot toinvestigate whether the relationship between ρ and r is monotonic or nonlinear.This approach may be related to the recent GMM analytic approach where thesample moment condition is not equal to zero for over-identified case.

7.7.2 MSS (2015) Approach to Modelling Technical Effi ciency inCross Sectionally Dependent Stochastic Frontier Panels

Mastromarco, Serlenga and Shin (2015) propose a unified framework for ac-commodating both time- and cross-section dependence in modelling technicaleffi ciency in stochastic frontier models. The proposed approach enables us todeal with both weak and strong forms of CSD by introducing exogenously drivencommon factors and an endogenous threshold selection mechanism. Using thedataset of 26 OECD countries over the period 1970-2010, we provide the sat-isfactory estimation results for the production technology parameters and theassociated effi ciency ranking of individual countries. We find positive spillovereffect on effi ciency, supporting the hypothesis that knowledge spillover is morelikely to be induced by technological proximity. Furthermore, our approachenables us to identify effi ciency clubs endogenously.We begin with the standard Cobb-Douglas production function:

yit = β′xit + εit, i = 1, ..., N, t = 1, ..., T, (306)

where yit is a logarithm of output of country i at time t, xit a k × 1 vector of(logged) production inputs, β a k × 1 vector of structural parameters, and εitis the composite stochastic errors including the idiosyncratic disturbance (vit)and time varying (logged) technical ineffi ciency (uit):

εit = vit − uit. (307)

Mastromarco et al. (2013) propose the panel stochastic frontier model withunobserved factors for modelling the time-varying technical ineffi ciency, uit:

uit = αi + λ′ift, i = 1, ..., N, t = 1, ..., T, (308)

where αi is (unobserved) individual effects, and ft is an r × 1 vector of unob-served factors that are expected to provide a proxy for nonlinear and complextrending patterns associated with globalisation and the business-cycle. Thisfactor approach clearly accommodates strong CSD.Recent literature emphasises that the individual country’s total factor pro-

ductivity (TFP) is likely to be significantly affected by economic performance of

158

neighboring or frontier countries. To allow for such spatial dependence, Erturand Koch (2007) develop a growth model in which technological interdepen-dency is specified through spatial externalities. In particular, the productivityshocks in SFA are assumed to be spatially correlated:

εt = ρWεt + et, t = 1, ..., T, (309)

where εt = (ε1t, ..., εNt)′, W = wijNi,j=1 is the N × N spatial weight matrix

with diagonal elements equal to zero, ρ is a spatial autoregressive parameter,and et = (e1t, ..., eNt)

′ the vector of zero-mean idiosyncratic disturbances. Theelements in W are selected exogenously on the basis of geographic or economicproximity measures such as contiguity, physical/economic/climatic distances ordissimilarities.The spatial models can only control for weak CSD whilst the factor-based

models can allow for strong CSD. In this regard the spatial-based approach islikely to produce biased estimates in the presence of strong CSD. While spatialautoregressive models are generally estimated by MLE, Pesaran (2006) andBai (2009) develop two alternative consistent estimation methodologies in thepresence of strong CSD. These studies clearly suggest that factors can play animportant role in the cross sectionally correlated panels. Nevertheless, factor-based models impose an assumption that the strong CSD is mainly driven byan exogenously given unobserved factors. Recently, KMS propose an alternativeapproach that allows the CSD to be determined endogenously.Suppose that the product of a country i at time t, Yit, is determined by the

levels of labor input and private capital, Lit and Kit. It is also affected by theHicks-neutral multi-factor productivity TFP :

Yit = TFPitF (Lit,Kit), (310)

where TFPit depends on the technological progress. The TFPit component canbe decomposed into the level of technology Ait, a measurement error wit, andthe effi ciency measure τ it with 0 < τ it ≤ 1:

TFPit = Aitτ itwit. (311)

By writing (310) in log form:

yit = α+ β1kit + β2lit − uit + vit, (312)

with the two-way error components structure given by

εit = vit − uit, (313)

where vit = lnwit and uit = − ln(τ it) is the term measuring the (time-varying)technical ineffi ciency.We propose that innovators consider the behaviour of other agents as:

uit = αi + ρuit(r) + λ′ift. (314)

159

where

uit(r) =1

mit

N∑j=1

I(∣∣u∗t−1 − ujt−1

∣∣ ≤ r)ujt−1, (315)

and r is the threshold parameter that is determined endogenously and u∗t−1 is the

effi ciency of the best performing country and mit =N∑j=1

I(∣∣u∗t−1 − ujt−1

∣∣ ≤ r).The uit(r) is an interaction term that may be thought of capturing the crosssectional local average of the best practices or common technology. The spec-ification in (314) explicitly allows the dynamics of technical ineffi ciency to beinteracted spatially. A priori, we expect that such externalities can be capturedby a negative ρ. We can also identify the heterogeneous technology clubs thatmay vary over time and across cross-section units; the frontier cluster formed bytechnology leading countries and the other group substantially below the fron-tier. We follow Schmidt and Sickles (1984), Kumbhakar (1990) and attempt tomeasure individual ineffi ciency:

eit = maxi

(uit)− (uit) = maxi

(αi + ρuit(r) + λ′ift

)−(αi + ρuit(r) + λ′ift

)(316)

We discuss how to estimate the proposed model (312-314) by rewriting themodels as follows:

yit = β′xit + εit, i = 1, ..., N, t = 1, ..., T, (317)

εit = vit − uit, (318)

uit = αi + ρuit(r) + λ′ift, (319)

uit(r) =1

mit

N∑j=1

I(∣∣u∗t−1 − ujt−1

∣∣ ≤ r)ujt−1, (320)

where αi is (unobserved) individual-specific effect, ft is an r × 1 vector of un-observed factors and λi is an r × 1 vector of the heterogeneous loading, uit(r)represents a cluster effect which is equal to the average effi ciency of countrieswhich are close to the frontier where u∗t−1 = minj (ujt−1) , and vit is an idio-syncratic disturbance. The distinguishing feature of our model is the use ofunit-specific aggregates, which summaries of past values of effi ciency, and con-nects the units that are close to the technology frontier (the best units).To obtain consistent estimate of ineffi ciency in (316), we first estimate β in

(317) by PCCE or IPC, and derive εit = yit−xitβ with vit = vit−(β − β

)xit =

vit + op (1). Then, by normalizing with respect to the maximum, we get a firstproxy of ineffi ciency as eit = maxi (εit)− (εit). Next, we consider the thresholdestimation procedure, where a grid of values for r is constructed. For all valueson that grid the model is estimated by least squares to obtain estimates of ρ.Specifically, we estimate r and ρ jointly by minimising the following criterion

160

function:

V (r, ρ) = minr,ρ

N∑i=1

T∑t=1

eit − ρ 1

mit

N∑j=1

I(∣∣e∗t−1 − ejt−1

∣∣ ≤ r) ejt−1

2

. (321)

The time-varying individual technical ineffi ciencies can be consistently estimatedby

eit = maxi

(uit)− (uit) = maxi

(αi + ρuit(r) + λ

′ift

)−(αi + ρuit(r) + λ

′ift

)(322)

Finally, we will convert eit to the time-varying individual technical effi ciency by

τ it = exp(−eit). (323)

For empirical implementations, we follow Bailey et al. (2016) who proposea multi-step procedure to deal with both strong and weak forms of CSD asfollows:

1. Test for the existence of CSD by applying the Pesaran (2015) CD test;

2. If the null of CSD is rejected, we apply the factor-based model to controlfor strong CSD.

3. We apply the Pesaran (2015) CD test again to the (de-factored) residuals.

4. If the null of no CSD is rejected, we also apply spatial or network modellingto the residuals (see (319)).

This extended KMS approach enables us to deal with both strong and weakforms of CSD by combining the (exogenously driven) factor-based approachwith an endogenous threshold effi ciency regime selection mechanism.

8 Multi-dimensional Panel Data Modelling inthe Presence of Cross Sectional Dependence

Given the growing availability of the dataset which contain information on multi-ple dimensions, the recent literature has focused more on extending the two-wayfixed effects models to the multidimensional setting. The ‘triple-way specifica-tion’has been popularised by Mátyás (1997), where time, source (exporter) anddestination (importer) fixed effects are specified respectively as unobservable.Baltagi et al. (2003) propose an extended specification with fixed exporter-time,importer-time, and country-pair effects. The triple-index models can be appliedto the number of bilateral flows between countries or regions such as trade, FDI,capital or migration flows (e.g. Feenstra, 2005; Bertoli and Fernandez-HuertasMoraga, 2013; Gunnella et al., 2015), but also to a variety of matched dataset

161

which may link the employer-the employee and pupils-teachers (e.g. Abowd etal., 1999; Kramarz et al., 2008).Balazsi, Mathyas and Wansbeek (2015, BMW) introduce the 3D within esti-

mators and analyse the behavior of these estimators in the cases of no self-flowdata, unbalanced data, and dynamic autoregressive models.45 Consider the 3Dcountry-time fixed effects panel data model:

yijt = β′xijt + γ′sit + δ′djt + κ′qt +ϕ′zij + uijt, (324)

for i = 1, ..., N1, j = 1, ..., N2, t = 1, ..., T , with errors:

uijt = µij + vit + ζjt + εijt (325)

where yijt is the dependent variable across 3 indices (e.g. the import of countryj from country i at period t); xijt, sit, djt, qt, zij are the kx× 1, ks× 1, kd× 1,kq × 1, kz × 1 vectors of covariates covering all measurements across 3indices;The multi-error components contain pair-fixed effects

(µij)as well as origin and

destination CTFEs, vit and ζjt.Unobserved fixed effects, µij , vit and ζjt, can be modelled by adding the

following N2T × (N2 + 2NT ) matrix of the dummies:

D = ((IN ⊗ IN ⊗ lT ) , (IN ⊗ lN ⊗ IT ) , (lN ⊗ IN ⊗ IT ))(N2T × (N2 + 2NT )

)where IN is the N × N identity matrix and lN is the N × 1 vector of ones(similarly for IT and lT ). To remove all unobserved FEs, BMW derive the 3Dwithin transformation:

yijt = yijt − yij· − y·jt − yi·t + y··t + y·j· + yi·· − y··· (326)

where yij· = T−1∑Tt=1 yijt, y·jt = N−1

∑Ni=1 yijt, yi·t = N−1

∑Nj=1 yijt, y••t =

N−2∑Ni=1

∑Nj=1 yijt, y•j• = (NT )

−1∑Ni=1

∑Tt=1 yijt, yi•• = (NT )

−1∑Nj=1

∑Tt=1 yijt

and y··· =(N2T

)−1∑Ni=1

∑Nj=1

∑Tt=1 yijt.

We can estimate consistently β from the transformed regression:

yijt = β′xijt + εijt, (327)

where xijt = xijt − xij. − x.jt − xi.t + x..t + x.j. + xi.. − x.... We write (327)compactly as

Yij = Xijβ + Eij (328)

YijT×1

=

yij1...

yijT

, XijT×kx

=

x′ij1...

x′ijT

, EijT×1

=

εij1...

εijT

.45See also Balazsi Mathyas and Pus (2015) for the random effect approach and Baltagi et

al. (2015) for the host of issues related to the panel data gravity models of trade.

162

The 3D-within estimator of β is obtained by

βW =

(N1∑i=1

N2∑j=1

X′ijXij

)−1(N1∑i=1

N2∑j=1

X′ijYij

). (329)

As (N1, N2, T )→∞,√N1N2T

(βW − β

)(330)

a∼ N

0, σ2ε lim

(N1,N2,T )→∞

(1

N1N2T

N1∑i=1

N2∑j=1

X′ijXij

)−1 .

The 3D within transformation wipes out all other covariates, xit, xjt, xt,and xij . It would be worthwhile to develop an extension of the Hausman-Taylor (1981) estimation, popular in the 2D panels in the presence of CSD (e.g.Serlenga and Shin, 2007). Balazsi, Bun, Chan and Harris (2017) develop anextended HT estimator for multi-dimensional panels.

8.1 Research Extensions

We now propose a number of important research extensions.

• First, the 3D within transformation, (326) wipes out the regressors xit,xjt, xt and xij . But, we are also interested in uncovering the effects ofthose covariates, e.g. the impacts of measured trade costs in the structuralgravity model. In order to recover these coeffi cients, we wish to develop anextension of the Hausman-Taylor type estimation or the Mundlak trans-formations.

• Second and more importantly, we aim to introduce the cross-section de-pendence (CSD) explicitly within the three-way error component specifi-cations. This makes the timely contribution to the literature, given themassive development of modelling CSD in the 2D panels, e.g. Pesaran(2006) and Bai (2009, 2014).

—What’s the nature of CSD, strong, semi-strong or weak? As αit andα∗jt in the CTFE model (??) are related to the local time factors,their CSD may be semi-strong.

—To introduce the strong CSD, we may add the common unobservedglobal factor, λt:

uijt = µij + vit + ζjt + λt + εijt (331)

However, the 3D-within transformation also removes λt, because λt

is proportional toN1∑i=1

vit orN2∑j=1

ζjt.

163

—Consider the following error components:

uijt = µij + πijλt + εijt (332)

which is the two-way version of the factor model considered by Ser-lenga and Shin (2007). More generally, we will examine:

uijt = µij + vit + ζjt + πijλt + εijt (333)

—Kapetanios, Serlenga and Shin (2017) model the error components,uijt to follow the hierarchical multi-factor structure:

uijt = γ′ijf t + γ′jf it + γ′ifjt + εijt, (334)

where f t, fjt and f it are respectively mf × 1, m• × 1 and m• ×1 vectors of unobserved common effects, and εijt are idiosyncraticerrors. Exporter i reacts heterogeneously to the common importmarket condition fjt and importer j reacts heterogeneously to thecommon export market condition f it. Both exporter i and importerj reacts heterogeneously to the common global market condition f t.Within this model, we can distinguish between three types of CSD:(i) the strong global factor, f t which influences the (ij) pairwiseinteractions (of N2 dimension); (ii) the semi-strong local factors, f itand fjt, which influence origin or destination separately (each of Ndimension); and (iii) the weak CSD idiosyncratic errors, εijt.

• We expect that this kind of generalisation would be most natural withinthe 3D panel data models.

• In such a case, the 3D within estimator is unable to estimate β’s consis-tently due to the nonzero correlation between omitted interactive effects(πijλt) and the regressors.

• The full estimation of (??) with (332), (333) or (412) can be feasible bycombining the Pesaran or the Bai type estimation procedures with theHausman-Taylor extension.

• Finally, there are a few studies which attempt to develop a general ap-proach that can accommodate both weak and strong CSD in modellingcross-sectionally correlated panels.

—Bailey et al. (2015) develop the multi-step estimation procedure thatcan distinguish the relationship between spatial units that is purelyspatial from that which is due to the effect of common factors.

—Kapetanios et al. (2014) advance a flexible nonlinear panel datamodel, which can generate strong and/or weak CSD endogenously.Furthermore, Mastromarco et al. (2015) propose the novel techniquefor allowing weak and strong CSD in modelling technical effi ciency

164

of stochastic frontier panels by combining the exogenously drivenfactor-based approach by SS and an endogenous threshold regimeselection by KMS. See also Gunella et al. (2015).

—Bai and Li (2015) and Shi and Lee (2017) develop the frameworkfor jointly modelling spatial effects and interactive effects. See alsoKuersteiner and Prucha (2015).

—Hence, the extension of such joint modelling to the multidimensionaldata would be challenging but shed further lights on the understand-ing the complex structure of CSD.

8.2 The Econometrics of Multi-dimensional Panel DataModelling in the Presence of Cross-sectional ErrorDependences by Kapetanios, Mastromarco, Serlengaand Shin (2017, KMSS)

There has been no study to address an issue of controlling cross-section (error)dependence (CSD) in 3D or higher-dimensional data, despite strong CSD evi-dence in 2D panels (Pesaran, 2015). Two main approaches in modelling CSDin 2D panels are: (i) the factor-based approach (Pesaran, 2006; Bai, 2009) and(ii) the spatial econometrics techniques (Baltagi, 2005; Behrens et al., 2012).The factor-based models exhibit strong CSD while the spatial models weak

CSD only (Chudik et al., 2011). See also Bailey et al. (2016), Le Gallo andPirotte (2017), and Baltagi, Egger and Erhardt (2017).We develop the 3D models with strong CSD.We generalise the multi-dimensional

error components by incorporating unobserved heterogeneous global factors.The country-time fixed (CTFE) and random effects (CTRE) estimators failto remove heterogenous global factors; inconsistent in the presence of nonzerocorrelation between the regressors and unobserved global factors.We develop the 2-step estimation procedure. Following Pesaran (2006), we

augment the 3D model with cross-section averages of dependent variable and re-gressors, as proxies for unobserved global factors. We then apply the 3D-withintransformation to the augmented specification and obtain consistent estimators(the 3D-PCCE estimator). Our approach is the first attempt to accommodatestrong CSD in multi-dimensional panels.We discuss the extent of CSD under 3 different error components with CTFE,

with 2-way heterogeneous factor, and with both. We develop a diagnostic testfor the null of (pairwise) residual cross-section independence or weak depen-dence, which is a modified CD test in 2D panels proposed by Pesaran (2015).We also provide extensions into unbalanced panels and 4D models.Monte Carlo studies confirm that the 3D-PCCE estimators perform well

when the 3D panels are subject to heterogeneous global factors. On the contraryCTFE displays severe biases and size distortions. We apply the 3D PCCEestimation, together with the 2-way FE and CTFE estimators, to the datasetover 1960-2008 for 91 country-pairs amongst 14 EU countries. Based on the CDtest results, and the predicted signs and statistical significance of the coeffi cients,

165

we find that the 3D PCCE estimation results are most reliable and satisfactory.The trade effect of currency union is rather modest. This suggests that thetrade increase within the Euro area may reflect a continuation of a long-runhistorical trend linked to the broader set of EU’s economic integration policies.Throughout we adopt the following standard notations. IN is an N × N

identity matrix, JN the N ×N identity matrix of ones, and ιN the N ×1 vectorof ones, respectively. MA projects the N ×N matrix A into its null-space, i.e.,MA = IN −A(A′A)−1A′. Finally, y.jt = N−1

1

∑N1

i=1 yijt, yi.t = N−12

∑N2

j=1 yijt

and yij. = T−1∑Tt=1 yijt denote the average of y over the index i, j and t,

respectively, with the definition extending to other quantities such as y..t, y.j.,yi.. and y....

8.2.1 The 3D Models with CSD

We first consider the following error components specification:

uijt = µij + πijλt + εijt. (335)

Similar to the 2-way heterogeneous factor model by Serlenga and Shin (2007).We apply the cross-section averages of (324) over i and j:

y..t = β′x..t + γ′s.t + δ′d.t + κ′qt +ϕ′z..+ µ.. + π..λt + ε..t (336)

Hence, we have:

λt =1

π..

y..t −

(β′x..t + γ′s.t + δ′d.t + κ′qt +ϕ′z.. + µ.. + ε..t

)We augment the model (324) with the cross-section averages:

yijt = β′xijt + γ′sit + δ′djt +ψ′ijft + τ ij + µ∗ij + ε∗ijt, (337)

where

ψ′ij =

(πijπ..

,−πijβ′

π..,−πijγ′π..

,−πijδ′

π..,

(1− πij

π..

)κ′)

ft =(y..t, x

′..t, s

′.t, d

′.t,q

′t

)′(338)

τ ij = ϕ′zij −−πijπ..

ϕ′z.., µ∗ij = µij −

πijµ..π..

, ε∗ijt = εijt −πijπ..

ε..t.

We write (337) compactly as

Yij = Wijθ + Hψ∗ij + E∗ij , i = 1, ..., N1, j = 1, ..., N2 (339)

YijT×1

=

yij1...

yijT

, XijT×kx

=

x′ij1...

x′ijT

, SiT×ks

=

s′i1...

s′iT

,

DjT×kd

=

d′j1...

d′jT

, FT×kf

=

f ′1...

f ′T

, E∗ijT×1

=

ε∗ij1...

ε∗ijT

,166

Wij =(Xij ,Si,Dj

), θ =

(β′ γ′ δ′

)′, ψ∗ij =

(ψ′ij ,

(τ ij + µ∗ij

))′and H =

[F, ιT ].We derive the 3D-PCCE estimator of θ by

θPCCE =

(N1∑i=1

N2∑j=1

W′ijMHWij

)−1(N1∑i=1

N2∑j=1

W′ijMHYij

)(340)

whereMH = IT−H (H′H)−1

H′.Following Pesaran (2006), it is straightforwardto show that as (N1, N2, T )→∞,√

N1N2T(θPCCE − θ

)a∼ N (0,Σθ) , (341)

where the (robust) consistent estimator of Σθ is given by

Σθ =1

N1N2S−1θ RθS

−1θ , (342)

Rθ =1

N1N2 − 1

N1∑i=1

N2∑j=1

(W′

ijMHWij

T

)(θij − θMG

)(θij − θMG

)′(W′ijMHWij

T

),

Sθ =1

N1N2

N1∑i=1

N2∑j=1

(W′

ijMHWij

T

), θMG =

1

N1N2

N1∑i=1

N2∑j=1

θij ,

where θij is the (ij) pairwise OLS estimator obtained from the individual re-gression of Yij on (Wij ,H) in (339).Next, we consider the 3D model (324) with the general errors:

uijt = µij + vit + ζjt + πijλt + εijt. (343)

The 3D-within transformation fails to remove πijλt, because

uijt = πij λt + εijt

where λt = λt − λ with λ = T−1∑Tt=1 λt and πij = πij − π.j − πi. + π.. with

π.j = N−11

∑N1

i=1 πij and πi. = N−12

∑N2

j=1 πij .46 In the presence of the nonzero

correlation between xijt and λt, the 3D-within estimator of β is biased.We develop the two-step consistent estimation procedure. First, taking the

cross-section averages of (??) over i and j,

y..t = β′x..t + γ′s.t + δ′d.t + κ′qt +ϕ′z.. + µ.. + v.t + ζ .t + π..λt + ε..t (344)

where v.t = N−11

∑N1

i=1 vit, ζ .t = N−12

∑N2

j=1 ζjt. We augment the model (324)with the cross-section averages:

yijt = β′xijt + γ′sit + δ′djt +ψ′ijft + τ ij + µ∗ij + v∗ijt + ζ∗ijt + ε∗ijt, (345)

46Unless πij = 0,uijt 6= εijt. This holds only if factor loadings, πij are homogeneous.

167

where v∗ijt = vit − πij v.tπ..

, ζ∗ijt = ζjt −πij ζ.tπ..

.We rewrite (345) as

yijt = β′xijt + γ′sit + δ′djt +ψ′ijft + τ ij + µ∗ij + vit + ζjt + ε∗∗ijt, (346)

where ε∗∗ijt = εijt − πijπ..ε..t − πij v.t

π..− πij ζ.t

π... As N1, N2 →∞, ε∗∗ijt →p εijt.

We apply the 3D-within transformation (326) to (346):

yijt = β′xijt + ψ′ij ft + ε∗∗ijt, (347)

where ψij = ψij − ψ.j − ψj. + ψ.., ft = ft − f with f = T−1∑Tt=1 ft.Rewriting

(347) compactly as

Yij = Xijβ + Fψij + E∗∗ij , i = 1, ..., N1, j = 1, ..., N2 (348)

YijT×1

=

yij1...

yijT

, XijT×kx

=

x′ij1...

x′ijT

, FT×kf

=

f ′1...

f ′T

, E∗∗ij =

ε∗∗ij1...

ε∗∗ijT

.The 3D-PCCE estimator of β is obtained by

βPCCE =

(N1∑i=1

N2∑j=1

X′ijMF Xij

)−1(N1∑i=1

N2∑j=1

X′ijMF Yij

)(349)

whereMF = IT−F(F′F

)−1

F′ is the T×T idempotent matrix.As (N1, N2, T )→∞, √

N1N2T(βPCCE − β

)a∼ N (0,Σβ) , (350)

where the (robust) consistent estimator of Σβ is given by

Σβ =1

N2S−1β RβS−1

β , (351)

Rβ =1

N1N2 − 1

N1∑i=1

N2∑j=1

(X′ijMF Xij

T

)(βij − βMG

)(βij − βMG

)′(X′ijMF Xij

T

),

Sβ =1

N1N2

N1∑i=1

N2∑j=1

(X′ijMF Xij

T

), βMG =

1

N1N2

N1∑i=1

N2∑j=1

βij ,

where βij is the (ij) pairwise OLS estimator from the individual regression of

Yij on(Xij , F

)in (348).

We extend to the 3D panels with heterogeneous slope parameters:

yijt = β′ijxijt + γ′jsit + δ′idjt + κ′ijqt +ϕ′zij + uijt (352)

168

We develop the mean group estimators:

βW,MG =1

N1N2

N1∑i=1

N2∑j=1

(X′ijXij

)−1 (X′ijYij

)(353)

θMGCCE =1

N1N2

N1∑i=1

N2∑j=1

(W′

ijMHWij

)−1 (W′

ijMHYij

)(354)

βMGCCE =1

N1N2

N1∑i=1

N2∑j=1

(X′ijMF Xij

)−1 (X′ijMF Yij

)(355)

8.2.2 Cross-section Dependence (CD) Test

The extent of CSD is captured by non-zero covariance between uijt and ui′j′t,

which relates to rate at which 1N1N2

N1∑i=1

N2∑j=1

σijt,u declines with N1N2. CTFE

accommodates non-zero covariance locally, but imposes the same covariance forall i = 1, ..., N1 and j = 1, ..., N2. Such restrictions are too strong. Our proposederror components (343) accommodates non-zero covariances both locally andglobally.The diagnostic test for the null hypothesis of residual cross-section inde-

pendence in the 3D panels using the residuals, eij = (eij1, ..., eijT )′. We have

eij = Yij − XijβW for the model (328), eij = MHYij −MHWij θPCCE for(339), and eij = MF Yij −MF XijβPCCE for (348). The cross-section depen-dence (CD) test is a modified counterpart of an existing CD test by Pesaran(2015).We compute the pair-wise residual correlations between n and n′ cross-

section units by

ρnn′ =e′nen′√

(e′nen) (e′n′en′), n, n′ = 1, ..., N1N2 and n 6= n′,

where we represent eij as the (ij) pair using the single index n = 1, ..., N1N2.We construct the CD statistic by

CD =

√2

N1N2 (N1N2 − 1)

N1N2−1∑n=1

N1N2∑n′=n+1

√T ρnn′ (356)

CD test has the limiting N(0, 1) distribution under the null H0 : ρnn′ = 0 forall n, n′ = 1, ..., N1N2 and n 6= n′ (Pesaran, 2015).

8.2.3 Monte Carlo Study

We construct DGP1 by

yijt = β′xijt + µij + πijλt + εijt, (357)

169

xijt = µxij + µij + πxijλt + vijt, (358)

for i = 1, ..., N1, j = 1, ..., N2, and t = 1, ..., T . The global factor, λt andidiosyncratic errors, εijt and vijt are generated independently as iid processes

λt ∼ iidN (0, 1) , εijt ∼ iidN (0, 1) , vijt ∼ iidN (0, 1) .

We generate pairwise individual effects independently as

µij ∼ iidN (0, 1) , µxij ∼ iidN (0, 1) .

Factor loadings, πij and πxij , are independently generated from U [1, 2].Next, we construct DGP2 by

yijt = β′xijt + µij + vit + ζjt + πijλt + εijt. (359)

xijt = µxij + µij + πxijλt + vijt, , (360)

for i = 1, ..., N1, j = 1, ..., N2, and t = 1, ..., T . In addition, we generate vit andζjt independently as:

vit ∼ U (−1, 1) and ζjt ∼ U (−1, 1)

In both DGP1 and DGP2 we set β = 1.In Table 1 biases of the 2D PCCE and 3D PCCE estimators of β are mostly

negligible even for (N1, N2, T ) = (25, 25, 50). The CTFE estimator displayssubstantial biases. RMSE results are qualitatively similar to the bias pattern.CTFE over-rejects the null in all cases and tends to 1 even as N1 (N2) or Trises. The size of the 2D PCCE is close to the nominal 5% while 3D PCCEslightly over-rejects when N1 or N2 is small. Overall performance of the 2DPCCE estimator is the best under DGP1.Simulation results in Table 2 are qualitatively similar to those in Table 1.

Biases of PCCE are almost negligible and their RMSEs decrease rapidly withN1 (N2) or T . Empirical sizes are still close to the nominal 5% level. CTFEsuffers from substantial biases and size distortions, and its performance doesnot improve in large samples. Good performance of the 2D PCCE is rathersurprising as the 3D PCCE estimator is expected to dominate. Overall simu-lation results support the simulation findings reported under the 2D panels byKapetanios and Pesaran (2005) and Pesaran (2006).

8.2.4 The Gravity Model of the Intra-EU Trade

Anderson and van Wincoop (2003): “The gravity equation tells us that bilat-eral trade, after controlling for size, depends on the bilateral trade barriers butrelative to the product of their Multilateral Resistance Indices (MTR).”Omit-ting MTR induces severe bias (e.g. Baldwin and Taglioni, 2006). Subsequentresearch focused on estimating the model with directional country-specific fixedeffects to control for unobservable MTRs (e.g. Feenstra, 2004).

170

A large number of studies established an importance of taking into accountmultilateral resistance and bilateral heterogeneity in the 2D panels. Serlengaand Shin (2007) is the first to develop the panel gravity model by incorporat-ing observed and unobserved factors. Behrens et al. (2012) develop the spatialeconometric specification, to control for multilateral cross-sectional correlationsacross trade flows. Mastromarco et al. (2015) compare the factor- and thespatial-based gravity models to investigate the Euro impact on intra-EU tradeflows over 1960-2008 for 190 country-pairs of 14 EU and 6 non-EU OECD coun-tries. The CD test confirms that the factor-based model is more appropriate forcontrolling for CSD.For the 3D models, we should control for source of biases presented by un-

observed time-varying MTRs. Baltagi et al. (2003) propose the 3D model (327)with CTFE specification (??). This approach popular in measuring the impactsof MTRs of the exporters and the importers in the structural gravity studies (e.g.Baltagi et al., 2015). CTFE or CTRE estimators fail to accommodate (strongand heterogeneous) CSD. The presence of CSD across (ij) pairs suggests thatthe appropriate econometric techniques be required.We apply our approach to the dataset covering the period 1960-2008 (49

years) for 182 country-pairs amongst 14 EUmember countries (Austria, Belgium-Luxemburg, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Nether-lands, Portugal, Spain, Sweden, United Kingdom).Consider the generalised panel gravity specification:

lnEXPijt = β0 + β1CEEijt + β2EMUijt + β3SIMijt + β4RLFijt (361)

+ β5 lnGDPit + β6 lnGDPjt + β7RERt

+ γ1DISij + γ2BORij + γ3LANij + uijt

The dependent variable EXPijt is the export flow from country i to countryj at time t; CEE and EMU are dummies for European Community member-ship and European Monetary Union; SIM and RLF measure similarity in sizeand difference in relative factor endowments; RER represents the logarithm ofcommon real exchange rates; GDPit and GDPjt are logged GDPs of exporterand importer; The logarithm of geographical distance (DIS) and the dummiesfor common language (LAN) and for common border (BOR) represent time-invariant bilateral barriers.We report the CD test results for the residuals and the estimates of the CSD

exponent (α). Our focus is on the impacts of tij that contain both barriers andincentives to trade. We focus on the two dummy variables; CEE (one when bothcountries belong to the European Community); EMU (one when both adopt thesame currency). Both are expected to exert a positive impact on bilateral exportflows.The empirical evidence is mixed. Rose (2001), Frankel and Rose (2002),

Glick and Rose (2002) and Frankel (2008), document a huge positive effect; Anumber of studies report negative or insignificant effects (Persson, 2001, Pakkoand Wall, 2002, De Nardis and Vicarelli, 2003). Recent studies by Serlengaand Shin (2007), Mastromarco et al. (2015) and Gunnella et al. (2015) finding

171

a small but significant effect (7 to 10%) of the euro on intra-EU trade, aftercontrolling for strong CSD.Table 3 reports the estimation results. The two-way FE estimation results

are statistically significant except RER. The impacts of home and foreign GDPson exports are positive, but surprisingly, the former is twice larger than the lat-ter. The impact of SIM is negative and significant, inconsistent with a prioriexpectations. CEE and EMU significantly boost exports, but their magnitudesseem to be too high. The CD test rejects the null of no or weak CSD con-vincingly. α is 0.99 with CI containing unity; the residuals strongly correlatedand the FE results biased and unreliable. This supports our main concern thatupward trends in omitted trade determinants may cause them to be upward-biased.We turn to the CTFE estimation results. CD test results indicate that the

CTFE residuals do not suffer from any strong CSD. This rather surprisingresult is not supported by α = 0.91 (pretty close to 1). All the coeffi cientsbecome insignificant except for CEE. The CEE is still substantial (0.29) whilethe EMU turns negligible (-0.011). Overall CTFE results are unreliable.The 2D PCCE results are significant with the expected signs except for

EMU. The impact of foreign GDP on exports is substantially larger than homeGDP. The RER is positive, confirming that a depreciation of the home currencyincreases exports. The CEE is smaller (0.186), but EMU is insignificant andnegligible (0.017). The 2D PCCE suffers from strong CSD residuals with α =0.87.Finally, the 3D PCCE results show that all the coeffi cients are significant

with the expected signs. The CD test fails to strongly reject the null, sup-ported by the smaller estimate of α = 0.77, close to a moderate range of weakCSD.47 CEE still substantial (0.335) while the EMU modest at 0.081, close tothe consensus reported in the 2D panel studies (e.g. Baldwin, 2006, Gunnellaet al., 2015). The 3D PCCE results are mostly reliable, suggesting that thetrade-boosting effect of the Euro should be viewed in the long-run historicaland multilateral perspectives rather than simply focusing on the formation of amonetary union as an isolated event.

The CTFE estimator is proposed to capture (unobserved) multilateral re-sistance terms and trade costs, but it fails to accommodate strong CSD amongMTRs, clearly present in our sample of the EU countries (confirmed by CDtests and CSD exponent estimates). We should model the time-varying interde-pendency of bilateral export flows in a flexible manner than simply introducingdeterministic country-time specific dummies. MTRs arise from the bilateralcountry-pair specific reactions to global shocks or the local spillover effects orboth.47BHP show that the values of α ∈ [1/2, 3/4) represent a moderate degree of CSD.

172

Table 1: 3D panel gravity model estimation results for bilateral export flowsFE CTFECoeff se t-ratio Coeff se t-ratio

gdph 2.185 0.041 52.97gdpf 1.196 0.041 28.98sim -0.263 0.052 -5.069 -0.055 0.074 -0.754rlf 0.031 0.006 5.011 0.006 0.005 1.294rer 0.005 0.007 0.791 0.031 0.072 0.436cee 0.302 0.014 22.05 0.290 0.017 16.99emu 0.204 0.019 10.71 -0.011 0.036 -0.315

CD stat 620.1 -2.676α0.05 α α0.95 α0.05 α α0.95

CSD exponent 0.925 0.992 1.059 0.865 0.914 0.9632D PCCE 3D PCCECoeff se t-ratio Coeff se t-ratio

gdph 0.289 0.095 3.033gdpf 1.491 0.095 15.69sim 0.042 0.105 0.401 1.032 0.111 9.290rlf 0.007 0.005 1.420 -0.004 0.005 -0.748rer 0.144 0.019 7.427 0.168 0.114 1.471cee 0.187 0.014 13.20 0.335 0.022 15.10emu 0.018 0.015 1.160 0.081 0.045 1.793

CD stat 76.11 -4.19α0.05 α α0.95 α0.05 α α0.95

CSD exponent 0.837 0.867 0.897 0.724 0.775 0.826

Notes: Using the annual dataset over 1960-2008 for 182 country-pairs amongst 14 EU member

countries, we estimate the generalised panel gravity specification, (409). FE stands for the

standard two-way fixed effects estimator with country-pair and time fixed effects. CTFE refers

to the 3D within estimator given by (328). 2D PCCE is the PCCE estimator given by (339)

with factors ft =gdp.t, sim..t, rlf ..t, cee..t, rert, t

. 3D PCCE is the PCCE estimator given

by (348) with factors ft =sim..t, rlf ..t, rert

. CD test refers to testing the null hypothesis of

residual cross-sectional error independence or weak dependence and is defined in (356). CSD

exponent denotes the point estimate of the exponents of CSD α and the 90% level confidence

bands.

173

8.2.5 Conclusion

We propose novel estimation techniques to accommodate CSD within the 3Dpanel data models. Our framework is a generalisation of the multidimensionalcountry-time fixed and random effects estimators. Our approach is the firstattempt to introduce strong CSD into the multi-dimensional error components.We develop the two-step estimation procedure, the 3D-PCCE estimator. Theempirical usefulness of the 3D-PCCE estimator is demonstrated via the MonteCarlo studies and the empirical application to the gravity model of the intra-EUtrade.Extensions and generalisations. First, we develop the multi-dimensional

heterogenous panel data models with hierarchical multi-factor error structure(KSS). Next, we aim to develop the challenging models by combining the spatial-and the factor-based techniques. Bailey et al. (2016) develop the multi-step es-timation procedure that can distinguish the relationship between spatial unitsfrom that which is due to the effect of common factors. Mastromarco et al.(2015) propose the technique for allowing weak and strong CSD in stochasticfrontier panels by combining the exogenously driven factor-based approach andan endogenous threshold regime selection by Kapetanios et al. (2014, KMS).Bai and Li (2015) and Shi and Lee (2014,5) developed the framework for jointlymodelling spatial effects and interactive effects. See also Gunnella et al. (2015)and Kuersteiner and Prucha (2015).

8.3 Estimation and Inference for Multi-dimensional Het-erogeneous Panel Datasets with Hierarchical Multi-factor Error Structure by Kapetanios, Serlenga andShin (2019)

8.3.1 The Model

Consider the triple-index heterogeneous panel data model:

yijt = β′ijxijt + δ′ijdt + uijt, i, j = 1, ..., N, t = 1, ..., T, (362)

where yijt is the dependent variable observed across 3 indices, i the origin, jthe destination at period t (say, the export from country i to j at t); xijt is themx × 1 vector of covariates; dt is the md × 1 vector of observed common effectssuch as constants and trends. βij and δij are the mx× 1 and md× 1 vectors ofparameters.We allow uijt to follow the hierarchical multi-factor structure:

uijt = γ′ijf t + γ′jf it + γ′ifjt + εijt (363)

where f t, fjt and f it are mf × 1, m• × 1 and m• × 1 vectors of unob-served common effects; εijt are idiosyncratic errors distributed independently of(xijt,dt).f t are the global factors affecting all of the bilateral pairs; f it and fjt

are local origin i and destination j factors. They are designed to account for

174

commonality in yijt; CSD between a given flow and a flow from the exportingregion’s commonality to the importing region (exporting-based dependence) andanother flow from the exporting region to the importing region’s commonality(importing-based dependence). This can provide a natural alternative to theexisting literature. Business cycles can be decomposed into world, region andcountry-specific factors (Kose et al. 2003). See also Choi et al. (2016).To deal with the general case where f t, fjt and f it, are correlated with

(xijt,dt), we consider the following DGP for xijt:

xijt = Dijdt + Γijf t + Γjf it + Γifjt + vijt, (364)

where Dij is the (mx ×md) parameter matrix, Γij , Γj and Γi are (mx ×mf ),(mx ×m•), (mx ×m•) factor loading matrices, and vijt are the idiosyncraticerrors.Combining (362)-(413), we have:

zijt =

(yijtxijt

)= Ξijdt + Φijf t + Φjf it + Φifjt + uijt (365)

Ξij =

(δ′ij + β′ijDijDij

),Φij =

(γ′ij + β′ijΓij

Γij

),Φi =

(γ′i + β′ijΓi

Γi

),

(366)

Φj =

(γ′j + β′ijΓj

Γj

),uijt =

(εijt + β′ijvijt

vijt

).

The ranks of Φij , Φi and Φj determined by the ranks of

Γij(mx+1)×mf

=

(γ′ijΓij

), Γi

(mx+1)×m•=

(γ′iΓi

), Γj

(mx+1)×m•=

(γ′jΓj

).

Rewrite (362) and (414) in the matrix notation:

yij = Xijβij +Dδij + Fγij + F iγj + F jγi + εij , (367)

zij = DΞij + FΦij + F iΦj + F jΦi + uij , (368)

where

yijT×1

=

yij1...

yijT

, XijT×mx

=

x′ij1...

x′ijT

, DT×md

=

d′1...d′T

, zijT×(mx+1)

=

z′ij1...

z′ijT

,(369)

FT×mf

=

f ′1...f ′T

, F iT×m•

=

f ′i1...

f ′iT

, F jT×m•

=

f ′j1...

f ′jT

, εijT×1

=

εij1...

εijT

, uijT×(mx+1)

=

u′ij1...

u′ijT

Assumption 1. Common Effects: The (md +mf +m• +m•) × 1 vec-

tor of common factors gt =(d′t,f

′t,f′it,f

′jt)′, is covariance stationary with

175

absolute summable autocovariances, distributed independently of εijt′ and vijt′for all i, j, t and t′.Assumption 2. Individual-specific Errors: εijt and vijt′ are distributed

independently for all i, j, t and t′, and they are distributed independently ofxijt and dt.Assumption 3. Factor Loadings: Unobserved factor loadings are indepen-

dently and identically distributed across (i, j), and of εijt, vijt, gt for all i, j, t,with finite means and variances. In particular,

γij = γ + ηij , γi = γ• + ηi, γj = γ• + ηj , (370)

Γij = Γ + ξij , Γi = Γ• + ξi, Γj = Γ• + ξj , (371)

where ηij ∼ iid(0,Ωη

), ξij ∼ iid

(0,Ωξ

), ηi ∼ iid

(0,Ωη•

), ξi ∼

iid(0,Ωξ•

), ηj ∼ iid

(0,Ωη•

)and ξj ∼ iid

(0,Ωξ•

). Further, ‖γ‖ < K,

‖γ•‖ < K, ‖γ•‖ < K, ‖Γ‖ < K, ‖Γ•‖ < K, and ‖Γ•‖ < K for positiveconstant K <∞.

Assumption 4. Random Slope Coeffi cients:

βij = β+νi+νj+νij ,νi ∼ iid (0,Ων•) ,νj ∼ iid (0,Ων•) ,νij ∼ iid (0,Ων)(372)

where ‖β‖ < K and νij , νi, νj are distributed independently of one another,and of γij , Γij , εijt, vijt and gt for all i, j and t.

Assumption 5. Identification of βij and β: Construct the cross-sectionaverages of zijt by

zt =1

N2

N∑i=1

N∑j=1

zijt, zit =1

N

N∑j=1

zijt and zjt =1

N

N∑i=1

zijt (373)

Let Zij =(Z, Zi, Zj

)and Hij =

(D, Zij

), where

ZT×(mx+1)

=

z′1...z′T

, ZiT×(mx+1)

=

z′i1...

z′iT

, ZiT×(mx+1)

=

z′j1...

z′j1

,and construct the idempotent matrix:

M ij = IT − Hij

(H′ijHij

)−1

H′ij (374)

(i) Identification of βij : The mx×mx matrices, Ψij,T = T−1(X ′ijM ijXij

)are

nonsingular, and Ψ−1ij,T have finite second-order moments.

(ii) Identification of β: The mx × mx matrix, Ψ = N−2∑Ni=1

∑Nj=1 Ψij,T is

nonsingular.Remark 1: It is challenging to develop an appropriate model for accommo-

dating CSD within the multilevel dataset. LeSage and Llano (2015) propose a

176

spatial econometric methodology that introduces spatially-structured origin anddestination effects, in such a way that regions treated as origins (destinations)exhibit similar effects to neighbors of the origins (destinations). Choi et al.(2016) develop a multilevel factor model with global and country factors, andpropose a sequential principal component estimation procedure. KMSS addressan important issue of controlling CSD in 3D panels by adding unobserved het-erogeneous global factors to the CTFE specification, and propose the 3D PCCEestimator. The hierarchical multi-factor error model is more parsimonious andstructural.Remark 2: The weights are not necessarily unique. One could use the equal

weight, 1/N for reasonably large N . Alternatively, the economic distance-basedor time-varying measures could be considered.Remark 3: The number of observed factors and the number of individual-

specific regressors are fixed and known. The number of unobserved factors,m = mf +m• +m•, is assumed fixed, but needs not to be known.Represent hierarchical cross-section averages as follows:

zt = Ξdt + Φf t + Φ•f•t + Φ•f•t + ut, (375)

zit = Ξidt + Φif t + Φ•f it + Φif•t + uit, (376)

zjt = Ξjdt + Φjf t + Φjf•t + Φ•fjt + ujt, (377)

Ξ =1

N2

N∑i=1

N∑j=1

Ξij , Ξi =1

N

N∑j=1

Ξij , Ξj =1

N

N∑i=1

Ξij ,

Φ =1

N2

N∑i=1

N∑j=1

Φij , Φi =1

N

N∑j=1

Φij , Φj =1

N

N∑i=1

Φij , (378)

Φ• =1

N2

N∑i=1

N∑j=1

Φj =1

N

N∑j=1

Φj , Φ• =1

N2

N∑i=1

N∑j=1

Φi =1

N

N∑i=1

Φi,

(379)

ut =1

N2

N∑i=1

N∑j=1

uijt, uit =1

N

N∑j=1

uijt, ujt =1

N

N∑i=1

uijt

f•t =1

N

N∑j=1

fjt, f•t =1

N

N∑i=1

f it

Combining (375)-(377), we have:

zijt = Ξijdt + Φijf ijt + uijt, (380)

where

zijt3(mx+1)×1

=

ztzitzjt

, f ijtm×1

=

f tf itfjt

, uijt =

ut + Φ•f•t + Φ•f•tuit + Φif•tujt + Φjf•t

177

Ξij3(mx+1)×md

=

ΞΞiΞj

, Φij3(mx+1)×m

=

Φ 0 0Φi Φ• 0Φj 0 Φ•

, (381)

with m = mf +m• +m•.Using (366) and (420), Φij can be represented as follows:

Φ(k+1)×mf

= BΓ +

(1N2

∑Ni=1

∑Nj=1 (νi + νj + νij)

′Γij

0

)(382)

Φ•(k+1)×m•

= BΓ• +

(1N


′Γj

0

)(383)

Φ•(k+1)×m•

= BΓ• +

(1N

∑N=1 (νi + νj + νij)

′Γi

0

)(384)

Φi(k+1)×mf

= BΓi +

(1N


′Γij

0

)(385)

Φj(k+1)×mf

= BΓj +

(1N

∑Ni=1 (νi + νj + νij)

′Γij

0

)(386)

B =

(1 β′

0 Ik

), Γ=

(γ′Γ

), Γ•=

(γ′•Γ•

), Γ•=

(γ′•Γ•

), Γi=

(γ′iΓi

), Γj=

(γ′jΓj

)and Γ, γ, Γ•, γ•, Γ•, γ•, Γi, γi, Γj and γj are defined similarly toΦ, Φi, Φj , Φ• and Φ•.Suppose that the rank condition holds:

Rank(Φij

)= m for all (ij) . (387)

Then, we obtain from (380):

f ijt =(Φ′ijΦij

)−1

Φ′ij

(zijt − Ξijdt − uijt

)(388)

It is easily seen that

uijt = Op

(1

N

)+Op

(1√NT

)for each t, as N →∞.

Therefore,

f ijt −(Φ′ijΦij

)−1

Φ′ij

(zijt − Ξijdt

)= Op

(1

N

)+Op

(1√NT

).

We can use hijt =(d′t, z

′ijt

)′as observable proxies for f ijt, and consistently

estimate βij and their mean β by augmenting the regression, (362) with dt andzijt. These are referred to as the 3DCCE estimators.

178

8.3.2 3D Common Correlated Effects Estimator

Individual Specific Coeffi cients The 3DCCE estimator of βij is given by

bij =(X ′ijM ijXij

)−1X ′ijM ijyij (389)

We show the dependence of bij on the unobserved factors as:

bij − βij =

(XijM ijXij

T

)−1 X ′ijM ijF ij

Tγ∗ij +

(X ′ijM ijXij

T

)−1X ′ijM ijεij

T

(390)

=

(XijMQ,ijXij

T

)−1 X ′ijMQ,ijF ij

Tγ∗ij +

(X ′ijMQ,ijXij

T

)−1X ′ijMQ,ijεij

T

+Op

(1

N

)+Op

(1√NT

)where F ij = (F ,F i,F j), γ∗ij =

(γ′ij ,γ

′i,γ

′j)′andMQ,ij = IT−Qij

(Q′ijQij

)−1Q′ij

with Qij =(F Φ

′,F iΦ

′•,F jΦ

′•

).

Suppose that the rank condition (387), is satisfied. Then,

Theorem 13 Consider the triple-index heterogeneous panel data model, (362)-(413). Suppose that Assumptions 1-4 and 5(a) hold. Then, the 3DCCE estima-tor of the individual slope coeffi cients given by (389) is consistent. Further, asN,T →∞ and T/N → K <∞,

√T(bij − βij

)→d N(0,V ij), (391)

where V ij = Σ−1v,ijΣijεΣ

−1v,ij, Σv,ij = V ar(vijt), Σijε = p limT→∞

[X′ijMF,ijΩijεMF,ijXij

T

],

and Ωijε = E(ε′ijεij

).

Remark: If the rank condition (387) does not hold, we need to show that1TX

′ijM ij

(Fγij + F iγj + F jγi

)converges to zero. We can establish that

bij − βij =

(X ′ijMQ,ijXij

T

)−1X ′ijMQ,ijεij

T+ op (1) .

√T(bij − βij

)will be asymptotically normal if

√T/N → 0 as N,T →∞.

3D Common Correlated Effects Mean Group Estimator The 3DC-CEMG estimator is an average of the individual bij :

bMG =1

N2

N∑i=1

N∑j=1

bij , (392)

179

Under Assumption 4 and using (390), we decompose√N(bMG − β

)and analyse

each terms to obtain the following Theorem.

Theorem 14 Consider the 3D model, (362)-(413). Suppose that Assumptions1-4 and 5(a) hold. Then, the 3D CCEMG, bMG is consistent. As N,T →∞,√N(bMG − β

)→d N(0,V MG),V MG = Ων• + Ωη• + Ων• + Ωη• (393)

Ων•= limN,T→∞

1

N

N∑i=1

E(A1,i,NTΩνiA

′1,i,NT

),Ωη•= lim

N,T→∞

1

N

N∑i=1

E(A2,i,NTΩηiA

′2,i,NT

)Ων•= lim

N,T→∞

1

N

N∑j=1

E(A1,j,NTΩνjA

′1,j,NT

),Ωη•= lim

N,T→∞

1

N

N∑j=1

E(A2,j,NTΩηjA

′2,j,NT

)V MG can be consistently estimated by

VMG=1

N − 1

N∑i=1

(bi − bMG

)(bi − bMG

)′+

1

N − 1

N∑j=1

(bj − bMG

)(bj − bMG

)′,

(394)where bi = 1

N

∑Nj=1 bij and bj = 1

N

∑Ni=1 bij.

An important finding is that the dominant terms of√N(bMG − β

)are those

that involve νi, νi, ηi and ηj only, because the terms associated with νij andηij are asymptotically negligible. This explains the N

1/2 rate of convergence.

The nonparametric variance estimator 1N2

∑Ni=1

∑Nj=1

(bij − bMG

)(bij − bMG

)′used by Pesaran (2006), is not consistent since it gives equal weights to the termscontaining νi, νi, ηi and ηj , and those containing νij and ηij . The con-sistent nonparametric estimator, V MG in (394) ensures that νij and ηij areaveraged out by the use of bi and bj .Remark Theorem 2 does not require the rank condition to hold as long as

the number of unobserved factors m is fixed. We do not require any restrictionon the relative rate of N and T .

3D Common Correlated Effects Pooled Estimator Consider the specialcase where βij are homogeneous, where effi ciency gains from pooling can beachieved. We still allow the coeffi cients on observed and unobserved commoneffects to differ across (ij). We derive the pooled estimator of β, referred to asthe 3D CCEP estimator by

bP =

N∑i=1

N∑j=1

X ′ijM ijXij

−1N∑i=1

N∑j=1

X ′ijM ijyij , (395)

180

Theorem 15 Consider the 3D model, (362)-(413). Suppose that Assumptions1-4 and 5(b) hold. Then,

√N(bP − β

)→d N(0,Ψ−1RΨ−1) (396)

Ψ = limN→∞

1

N2

N∑i=1

N∑j=1

Ψij with Ψij = E

[(XijM ijXij

T

)−1]

(397)

R = Ων• + Ωη• + Ων• + Ωη•. (398)

Ων•= limN,T→∞

1

N

N∑i=1

E(A1,i,NTΩ′νiA

′1,i,NT

), Ωη•= lim

N,T→∞

1

N

N∑i=1

E(A2,i,NTΩ′ηiA2,i,NT

)Ων•= lim

N,T→∞

1

N

N∑j=1

E(A1,j,NTΩνjA

′1,j,NT

), Ωη•= lim

N,T→∞

1

N

N∑j=1

E(A2,j,NTΩηjA

′2,j,NT

)where A1,i,NT , A2,i,NT , A1,j,NT and A2,j,NT are defined in (??)-(??). The

variance Ψ−1RΨ−1 can be consistently estimated by Ψ−1RΨ

−1where

Ψ =1

N2

N∑i=1

N∑j=1

X ′ijM ijXij

T, (399)

and

R=1

N

N∑i=1

1

N

N∑j=1

(X ′ijM ijXij

T

)(bij − bMG

) 1

N

N∑j=1

(bij − bMG

)′(X ′ijM ijXij

T

)+

1

N

N∑j=1

[1

N

N∑i=1

(X ′ijM ijXij

T

)(bij − bMG

)][ 1

N

N∑i=1

(bij − bMG

)′(X ′ijM ijXij

T

)]

Remark The asymptotic variance matrix of bP depends on unobservedfactors and loadings, but it is possible to estimate it consistently along linessimilar to 3DCCEMG.

The Special Cases Better convergence rates can be achieved if the hierar-chical structure is simplified. We focus on two special cases:

Condition S1 : ηi = ηj = νi = νj = 0

Condition S2 : F i = F j = 0.

S2 is more restrictive and considered by KMSS.

181

Under Condition S2, the setup is similar to that of Pesaran (2006) becausethere is no hierarchical factor structure. We can treat the dataset as a T ×N2

panel by amalgamating the two cross-section dimensions into one and applyingthe 2D CCE estimation procedure. The

√N rate will be replaced by N , and all

the results of Pesaran (2006) and others analysing CCE estimator hold.Next, consider the case where S1 holds but not S2. Then,

√N(bMG − β

)=

1

N3/2

N∑i=1

N∑j=1

νij +1

N2

N∑i=1

N∑j=1

Ψ−1ijT

(√NX ′ijMQ,ijεij

T

)(400)

+1

N3/2

N∑i=1

N∑j=1

χij,• +Op

(1

N

)+Op

(1√NT

)

From the proof of Theorem 2, the magnitude of all terms on the RHS of (400) isstillN as long as N/T → 0, since 1√

NT= o

(1N

). Normality does not follow since

the Op(

1N

)term in RHS of (400) is not negligible. In this case, the asymptotic

variance estimators in Pesaran (2006) become relevant only if normality holds.In particular,

V MG =1

N2

N∑i=1

N∑j=1

(bij − bMG

)(bij − bMG

)′, (401)

for the mean group estimator, and

R =1

N2

N∑i=1

N∑j=1

(X ′ijM ijXij

T

)(bij − bMG

) (bij − bMG

)′(X ′ijM ijXij

T

),

(402)for the pooled estimator.If Condition S1 is considered too restrictive - as it implies homogeneity for

the coeffi cients of the factors in eijt, we may entertain the more general setup:

γi = γij = γ• + ηij, γj = γij = γ• + ηij .

Because of the double cross-sectional averaging, ηij and ηij are negligiblesince terms associated with χij,• and χij,• decay suffi ciently fast to give thesame fast convergence rate as under S1.

8.3.3 Empirical Application

Anderson and van Wincoop (2003) show that bilateral trade depends on thebilateral trade barriers but relative to the product of their Multilateral Re-sistance Indices: bilateral barrier relative to average trade barriers that both

182

regions face with all their trading partners. They derive the following system ofthe structural gravity equations:

Xij =YiYjY

(tij

ΠiPj

)1−σ(403)

Π1−σi =

∑j

(tijPj

)1−σYjYand P 1−σ

j =∑i

(tijΠi

)1−σYiY

(404)

where Xij are exports from i to j, Yi, Yj and Y are GPD for i (exporter),j (importer) and the world, tij (> 1) is one plus the tariff equivalent of tradecosts of imports of j from i, σ (> 1) is the elasticity of substitution with CESpreference, Πi is ease of access of exporter i, and Pj is the ease of access ofimporter j. Pj and Πi are called inward and outward multilateral resistance.Omitting MTR induces potentially severe bias.Consider the log-linearised specification of (403):

lnXij = β0 + β1 lnYi + β2 lnYj + β3 ln tij + β4 lnPi + β5 lnPj + εij (405)

where Pi and Pj are unobservable MTRs, and tij contain both barriers and in-centives to trade between i and j. Subsequent research has focused on estimat-ing (405) with replacing unobservable MTRs by N country-specific dummies,µi and µj . We extend (405) into 3D panels:

lnXijt = β0 +β1 lnYit +β2 lnYjt +β3 ln tijt +β4 lnPit +β5 lnPjt + εijt, (406)

where we should allow MTRs to vary over time. Baltagi et al. (2003) propose:

uijt = αij + θit + θ∗jt + εijt, (407)

which contains bilateral pair-fixed effects αij as well as origin (exporter) anddestination (importer) country-time fixed effects (CTFE) θit and θ∗jt. Thisapproach popular in measuring the impacts of MTRs of exporters and importersin the structural gravity studies.The main drawback of the CTFE approach lies in the assumption that bi-

lateral trade flows are independent of what happens to the rest of the tradingworld. Recently, KMSS extend the 3D panel data model (362) with the moregeneral error components:

uijt = αij + θit + θ∗jt + πijθt + εijt, (408)

that attempts to model residual CSD via unobserved heterogeneous global fac-tor θt in addition to CTFEs. The CTFE estimator is biased because it failsto remove heterogenous global factors that would be correlated with covari-ates. KMSS develop the two-step consistent 3D-PCCE estimation procedure byapproximating global factors with double cross-section averages of dependentvariable and regressors and applying the 3D-within transformation. Followingthis research trend, in this paper, we develop the hierarchical multi-factor er-ror components specification, (412), which is more structural and parsimoniousthan (408).

183

The data We collect the dataset over the period 1970-2013 (44 years), andconsider two control groups: the 210 country-pairs of the EU15 member coun-tries with 11 Euro countries (Austria, Belgium-Luxemburg, Finland, France,Germany, Greece, Ireland, Italy, Netherlands, Portugal, Spain) and 4 controlcountries (Denmark, Norway, Sweden, the UK); the 320 country-pairs among19 countries with the EU15 countries and 4 non-EU OECD countries (Australia,Canada, Japan and the US).We collect the bilateral export flow from IMF. The Data starts from 1970 as

the German data are unavailable in the 60s. There are no missing data so weconsider the balanced panel. Our sample period consists of several importanteconomic integrations, such as the European Monetary System in 1979 and theSingle Market in 1993, all of which can be regarded as promoting intra-EUtrades.

Empirical specification: We consider the 3D panel gravity specification:

lnEXPijt = β0 + β1CEEijt + β2EMUijt + β3SIMijt + β4RLFijt + β5 lnGDPit(409)

+ β6 lnGDPjt + β7RERt + γ1DISij + γ2BORij + γ3LANij + uijt

the dependent variable, EXPijt is the export flow from country i to country jat time t; CEE and EMU are dummies for European Community membershipand for European Monetary Union; SIM is the logarithm of an index thatcaptures the relative size of two countries and bounded between zero (absolutedivergence) and 0.5 (equal size); RLF is the logarithm of the absolute value ofthe difference between per capita GDPs of trading countries; RER representsthe logarithm of common real exchange rates; GDPit and GDPjt are loggedGDPs of exporter and importer; the logarithm of geographical distance (DIS)and the dummies for common language (LAN) and for common border (BOR)represent time-invariant bilateral barriers.We apply the four estimators considered in the MC simulations, namely the

two-way within estimator and the three versions of 3D CCEP estimators. Wealso report the CD results applied to the residuals and the estimates of the CSDexponent (α). We focus on investigating the impacts of tij that contain bothbarriers and incentives to trade; the two dummy variables CEE (equal to onewhen both countries belong to the European Community) and EMU (equal toone when both trading partners adopt the same currency). Both are expectedto exert a positive impact on export flows. The empirical evidence is mixed,though recent studies by Mastromarco et al. (2015), and Gunnella et al. (2015)that control for strong CSD in 2D panels, find modest but significant effects (7to 10%) of the euro on intra-EU trade flows. KMSS (2017) apply a 3D PCCEestimator, finding that the EMU impact on exports is about 8%.

The Estimation Results for the EU15 Countries Table 5 reports thepanel gravity estimation results for the 210 country-pairs among the EU15 mem-ber countries over the period 1970-2013 (44 years). The FE estimator suffers

184

from strong CSD while the 3DCCEP estimators display lower degree CSD. CDdiagnostic test by Pesaran (2015) fails to reject the null of weak CSD for both3DCCEPL and 3DCCEPGL. This is also supported by the smaller estimatesof α for 3DCCEPL (0.624) and 3DCCEPGL (0.609), close to a moderate rangeof weak CSD.Factor approximationsTheoretically, we should employ the entire set of cross-section averages to

approximate heterogeneous global and local factors. In practice, this may raisean issue of multicollinearity. Further, to avoid the curse of dimensionality, wesearch for an optimal subset of cross sectional averages. In the aftermath of theglobal financial crisis, export flows display a negative average growth as shownbelow:

Export Growth 70/80 80/90 90/00 00/10 10/13EU15 + 4 OECD 7.06 6.25 4.35 2.16 -0.34EU15 8.86 7.37 3.92 2.82 -2.05

Hence, we also add t2 as an observed factor, which helps to capture the con-founding effect of the crisis.We focus on the 3DCCEPGL estimation results with the lowest degree of

CSD. All the coeffi cients are significant and their signs are consistent with our apriori expectations. The effect of the foreign GDP is substantially higher thanthe home GDP. The effects of SIM and RER are positive while a depreciationof the home currency leads to a significant increase in exports. SIM boosts realexport flows, which suggests that the intra-industry trade is the main part ofthe trade in the EU. Importantly, the impacts of EMU and CEE are significant,but substantially smaller than the potentially biased FE estimates. Both Euroand CEE impacts drop sharply from 0.099 and 0.074 to 0.03 and 0.05. Otherestimators provide rather unreliable results.48

The 3DCCEP estimator wipes out the time invariant regressors. Followingthe 2-step approach as in Serlenga and Shin (2007), we can estimate γ by thebetween estimator:

dijt = αij + γ1DISij + γ2BORij + γ3LANij + uijt (410)

where dijt = yijt − β′xijt with β being the 3D CCEP estimator. We test the

validity of the hypothesis: if the Euro had a positive effect on the EU trade byreducing bilateral barriers and eliminating exchange-related uncertainties andtransaction costs, this caused a decrease in trade impacts of bilateral barriers(e.g. Cafiso, 2010). A declining trend after 1999 will support the hypothesisthat the Euro helps to promote more EU integration. To this end we estimate

48For example, the impacts of home GDP on exports is surprisingly larger than the foreignimpact while both Euro and CEE impacts seem to be rather high for the FE. The RERcoeffi cient is significantly negative for the CCEP with the global approximation only whereasthe CEE impact is insignificant and the Euro impact is almost negligible for the CCEP withthe local approximation only.

185

Table 2: Table 5: Estimation Results for 15 EU CountriesFE 3DCCEPG 3DCCEPL 3DCCEPGL

gdph 1.517 0.230 0.023 0.342(0.044) (0.036) (0.037) (0.124)

gdpf 0.953 1.478 0.779 1.498(0.044) (0.037) (0.057) (0.031)

sim -0.045 0.639 -0.012 0.197(0.060) (0.069) (0.056) (0.075)

rlf 0.030 -0.002 0.002 0.006(0.006) (0.005) (0.002) (0.004)

rer 0.012 -0.046 0.016 0.103(0.008) (0.007) (0.004) (0.010)

euro 0.099 0.030 0.012 0.030(0.016) (0.003) (0.003) (0.003)

cee 0.074 0.066 0.007 0.050(0.014) (0.007) (0.007) (0.013)

CD stat 206.6 4.67 2.33 2.72

α 0.91 (0.90-0.93) 0.78 (0.72-0.84) 0.62 (0.59-0.66) 0.61 (0.57-0.65)

Notes: FE is the two-way fixed effect estimator. 3DCCEPG is the CCEP estimator

with only the global factors approximated by ft =export..t, gdp..t, sim..t, rlf ..t, cee..t, t, t

2

3DCCEPL is the CCEP estimator with only the local factors approximated by ft = fiot =yi.t, gdpi.t

and fojt =

sim.jt, rlf .jt

. 3DCCEPGL is the CCEP estimator with both

global and local factors approximated by ft =export ..t, gdp..t, sim..t, rlf ..t, cee..t, t, t

2and

fiot =simi.t, rlf i.t, reri.t

. * and ** stand for significance at 5% and 10% level. CD test

refers to testing the null hypothesis of residual cross-section independence or weak depen-

dence (Pesaran, 2015). α is the estimate of CSD exponent with 90% confidence bands inside

parenthesis.

186

(410) by the cross-section regression at each period, and produce time-varyingcoeffi cients of γ.Figure 1 shows the time varying estimates of γ in (410), using the CCEP es-

timator with both local and global factors approximation. The border effect hasbeen declining until the mid 1980’s, and quite stable except the slight dip duringthe global financial crisis albeit statistically insignificant. The language effectsteadily decreasing until the end of 1980’s, reflecting the progressive lesseningof restrictions on labor mobility within the EU, that encouraged migration andreduced the relative importance of trade costs and cultural difference. Since theintroduction of the Euro in 1999, both language and border effects became flat,suggesting that the EU integration may reach near-completion stage. This isconsistent with the currency union formation hypothesis by Frankel (2005) thatcountries, which decide to join a currency union, are self-selected on the basisof distinctive features shared by EU members. The effect of distance has beenon a declining trend from the mid 80’s, but started to rise slightly after 1999.

The Estimation Results for the EU15 plus 4 OECD Countries Table3 reports the estimation results for an enlarged sample of the 342 country-pairs among the EU15 member countries plus four more countries (Australia,Canada, Japan and the US). Again we focus on the estimation results for the3DCCEPGL, which shows the lowest degree of CSD. The results are qualita-tively similar to those reported in Table 2. All the coeffi cients are significantwith expected signs. The effect of the foreign GDP is substantially higher thanthe home GDP effect. The effects of SIM is slightly higher (from 0.2 to 0.22)but RLF becomes negligibly negative. The impact of RER is stronger (from0.10 to 0.18), implying a stronger terms of trade effect.The impacts of EMU and CEE are significant, though their magnitudes

become smaller than those with the EU15 countries, namely from 3% to 1.5%and from 5% to 3%. The smaller effects for the enlarged sample might reflectthe trade diversion between the Euro and non-Euro area. The effects of theEMU on trade will differ with respect to the selected control group and dependon the composition of treatment and control groups (e.g. Baier and Bergstrand,2009). Other estimators provide rather misleading results. In particular, theFE estimation provides an opposite result that both Euro and CEE impactsincrease substantially, from 0.099 and 0.074 to 0.258 and 0.161.

Figure 2 displays time varying estimates of γ, using the CCEP estimatorwith both local and global factors approximation. Both border and languageeffect show similar pattern to the case with the EU15 countries. Again, wedo not observe any evidence in favour of the Euro effect on trade integration,consistent with the currency union formation hypothesis by Frankel (2005).The effect of distance has been slightly increasing over the whole period. Thisis consistent with the meta-study by Disdier and Head (2008), who documentthat the trade elasticity with respect to distance has not declined, but ratherincreased recently.

187

Table 3: Table 6: Estimation Results for 15 EU plus 4 OECD countriesFE 3DCCEPG 3DCCEPL 3DCCEPGL

gdph 1.066 0.531 0.069 0.169(0.019) (0.016) (0.010) (0.055)

gdpf 0.904 1.419 1.262 1.417(0.020) (0.020) (0.015) (0.017)

sim 0.332 0.109 0.100 0.220(0.029) (0.021) (0.013) (0.023)

rlf 0.027 -0.008 0.010 -0.004(0.004) (0.002) (0.001) (0.002)

rer 0.058 0.086 0.074 0.179(0.008) (0.004) (0.002) (0.005)

euro 0.258 0.012 0.012 0.014(0.009) (0.003) (0.001) (0.002)

cee 0.161 0.021 0.007 0.030(0.028) (0.017) (0.001) (0.012)

CD stat 243.33 3.272 2.331 3.201

α 0.90 (0.88-0.920) 0.74 (0.69-0.76) 0.65 (0.61-0.69) 0.62 (0.57-0.66)

Notes: FE is the two-way fixed effect estimator. 3DCCEPG is the CCEP estimator

with only the global factors approximated by ft =export..t, gdp..t, sim..t, rlf ..t, cee..t, t

.

3DCCEPL is the CCEP estimator with only the local factors approximated by ft = fiot =yi.t, gdpi.t

and fojt =

sim.jt, rlf .jt

. 3DCCEPGL is the CCEP estimator with both

global and local factors approximated by ft =export ..t, gdp..t, sim..t, rlf ..t, cee..t, t

and

fiot =simi.t, rlf i.t, reri.t

. * and ** stand for significance at 5% and 10% level. CD test

refers to testing the null hypothesis of residual cross-section independence or weak depen-

dence (Pesaran, 2015). α is the estimate of CSD exponent with 90% confidence bands inside

parenthesis.

188

8.4 The 3D Panels with Regional/Global factor specifica-tion

We consider the 3D heterogeneous panel data model given by

yijt = β′ijxijt + δ′ijdt + eijt, i = 1, ..., R, j = 1, ..., Ni, t = 1, ..., T, (411)

where yijt is the dependent variable observed across three indices, i = 1, ..., Rdenotes the ith region, j = 1, ..., ni denotes the j’th individual variable of regioni (say, the GDP growth of country j in the region i), xijt is the mx × 1 vectorof covariates and dt is the md × 1 vector of observed common effects includingdeterministic components such as constants and trends. βij and δij are mx× 1and md × 1 vectors of parameters.We allow eijt to follow the multi-level factor structure:

eijt = γg′ijgt + γi′ijfit + εijt, (412)

where gt =(g1t, ..., gmgt

)′is amg×1 vector of global factors, f it =

(f i1t, ..., f

imit

)′collects the mi × 1 vector of local factors in region i = 1, ..., R, γgij and γ

iij are

mg × 1 and mi × 1 vectors of heterogenous loadings, and εijt are idiosyncraticerrors. Define the total observation by N =

∑Ri=1Ni.

Extensions here...

8.4.1 The plans

Here we may develop and make the multiple projects and papers.

• Consider the simpler model given by (411) and (412). Here we may developthe 3D CCE and/or the 3D PCA extensions (see below).

• An extension to the general case

yijt = β′ijxijt + β′jxit + β′ixjt + δ′ijdt + uijt

would be straightforward, though we stick to the simpler setting for tractabil-ity.

• Consider the different hierarchical factor structure, (432) (say, source-destination countries) considered by KMS (2018) and Lu and Su (2018)and/or the three level or overlapping factor structure, (454) analysed byBreitung and Eickmeier (2016). Here we may also develop the 3D CCEand/or the 3D PCA extensions by developing different modelling tech-niques.

• Eventually, we wish to develop alternative approaches by combining the(local) spatial effects and global (or multi-level) factors. Indeed, this wouldmake my ultimate goal. I consider extending the QML-EM algorithmsproposed by Bai and Li in a sequence of papers (all published in topjournals).

189

• There will be abundant empirical topics in Economics, Finance, SocialScience and so on. One of my ph.d students, Rui has been working on thisapplication in extending the 2D CAPM model into 3D CAPM or APTmodels.

8.4.2 The KMS CCE extension

Unobserved factors, fgt and fiit, are likely to be correlated with xijt. Thus, we

consider the data generating process for xijt as follows:

xijt = Dijdt + Γgijgt + Γiijfit + vijt, (413)

where Dij is the (mx ×md) parameter matrix on observed common effects, Γgijand Γiij are (mx ×mg) and (mx ×mi) factor loading matrices, and vijt are theidiosyncratic errors.Combining (411)-(413), we have:

zijt =

(yijtxijt

)= Ξijdt + Φg

ijgt + Φiijf

it + uijt (414)

where

Ξij =

(δ′ij + β′ijDijDij

),Φg

ij =

(γg′ij + β′ijΓ

gij

Γgij

),Φr

ij =

(γi′ij + β′ijΓ

iij

Γiij

),uijt =

(εijt + β′ijvijt

vijt

)The ranks of Φg

ij and Φiij are determined by the ranks of the following matrices:

Γg

ij(mx+1)×mg

=

(γg′ijΓgij

), Γ

r

ij(mx+1)×mi

=

(γi′ijΓiij

).

For each (i, j), we rewrite (411) and (414) in matrix notation:

yij = Xijβij +Dδij +Gγgij + F iγiij + εij , (415)

zij = DΞij +GΦgij + F iΦi

ij + uij , (416)

where

yijT×1

=

yij1...

yijT

, XijT×mx

=

x′ij1...

x′ijT

, DT×md

=

d′1...d′T

, zijT×(mx+1)

=

z′ij1...

z′ijT

,(417)

GT×mg

=

g′1...g′T

, F iT×mi

=

f i′1...f i′T

, εijT×1

=

εij1...

εijT

, uijT×(mx+1)

=

u′ij1...

u′ijT

We develop the estimation and inference theory for E

(βij)

= β as well asindividual coeffi cients, βij . We make the following assumptions:

190

• Check the assumptions??

Assumption 1. Common Effects: The(md +mg +

∑Ri=1mi

)× 1 vector

of common factors(d′t, g

′t,f

1′t , ...,f

R′t

)′, is covariance stationary with absolute

summable autocovariances, distributed independently of εijt′ and vijt′ for all

N, i, j, t and t′.(g′t,f

1′t , ...,f

R′t

)′ (fg′t ,f

r′1t, ...,f

r′Rt

)′are zero mean process and

are mutually uncorrelated.Assumption 2. Individual-specific Errors: εijt and vijt′ are distributed

independently for all i, j, t and t′, and they are distributed independently ofxijt and dt.Assumption 3. Factor Loadings: The factor loadings are independently

and identically distributed across (i, j), and of the individual- specific errorsεijt and vijt, the common factors for all i, j and t, with finite means and finitevariances. In particular, we have:

γgij = γg + ηgij , γiij = γii + ηii or γ

iij = γii + ηiij , (418)

Γgij = Γg + ξgij , Γrij = Γri + ξri or Γrij = Γri + ξrij (419)

where ηgij ∼ iid (0,Ωηg ), ξgij ∼ iid (0,Ωξg ), ηri ∼ iid(0,Ωηi

), ?? ξi ∼

iid(0,Ωξ•

), ηj ∼ iid

(0,Ωη•

)and ξj ∼ iid

(0,Ωξ•

). Further, ‖γ‖ < K,

‖γ•‖ < K, ‖γ•‖ < K, ‖Γ‖ < K, ‖Γ•‖ < K, and ‖Γ•‖ < K for somepositive constant K <∞.Assumption 4. Random Slope Coeffi cients: βij follow the random coeffi -

cient specification: ??

βij = β+νi+νj+νij with νi ∼ iid (0,Ων•) , νj ∼ iid (0,Ων•) , νij ∼ iid (0,Ων)(420)

orβij = β + νj + νij with νj ∼ iid (0,Ων•) , νij ∼ iid (0,Ων)

where ‖β‖ < K and νij , νi, νj are distributed independently of one another,and of γij , Γij , εijt, vijt and gt for all i, j and t.

Assumption 5. Identification of βij and β: Let

zt =1

R

R∑i=1

1

Ni

Ni∑j=1

zijt, zi·t =1

Ni

Ni∑j=1

zijt, i = 1, ..., R (421)

and Zij =(Z, Zi·

)and Hij =

(D, Zij

), where

ZT×(1+mx)

=

z′1...z′T

, Zi·T×(1+mx)

=

z′i·1...z′i·T

,Further,

M ij = IT − Hij

(H′ijHij

)−1

H′ij . (422)

191

(i) Identification of βij : The mx×mx matrices, Ψij,T = T−1(X ′ijM ijXij

)are

nonsingular, and Ψ−1ij,T have finite second-order moments for all (i, j).

(ii) Identification of β: The mx ×mx matrix, N−2∑Ni=1

∑Nj=1 Ψij,T is nonsin-

gular.

Remark 16 The factors are assumed to have zero mean for simplicity. Anymeans can be subsumed in δij. Further, they are assumed mutually uncorrelatedto ensure that cross-sectional averages of local factors converge to zero. Thisis another simplifying assumption, since some weak cross-sectional dependenceacross local factors could be allowed, in exact analogy to weak cross sectionaldependence across idiosyncratic shocks.

Remark 17 The weights are not necessarily unique, but they do not affect theasymptotic results advanced in this paper (see also (?)). We focus, for simplicity,on equal weights, 1/N . Alternatively, economic distance-based or time-varyingmeasures such as trade weights or input-output shares could be considered (e.g.(?); (?)). The number of observed factors, md and the number of individual-specific regressors, mx are assumed fixed. The number of unobserved factors,m = mg +mi, is assumed fixed, but need not to be known.

Using (414), we represent the hierarchical cross-section averages in (421) asfollows:

zi·t =1

Ni

Ni∑j=1

(Ξijdt + Φg

ijgt + Φiijf

it + uijt

)(423)

= Ξi·dt + Φgi·gt + Φ

ii·f

it + ui·t = Ξi·dt +

(Φgi·, Φ

ii·

)( gtf it

)+ ui·t

where

Ξi· =1

Ni

Ni∑j=1

Ξij , Φgi· =

1

Ni

Ni∑j=1

Φgij , Φ

ii· =

1

Ni

Ni∑j=1

Φiij , ui·t =

1

Ni

Ni∑j=1

uijt

and

zt =1

R

R∑i=1

1

Ni

Ni∑j=1

(Ξijdt + Φg

ijgt + Φiijf

it + uijt

)(424)

= Ξ··dt + Φg··gt +

(1

R

R∑i=1

Φii·f

it

)+ u··t

where

Ξ·· =1

R

R∑i=1

1

Ni

Ni∑j=1

Ξij , Φg·· =

1

R

R∑i=1

1

Ni

Ni∑j=1

Φgij , u··t =

1

R

R∑i=1

1

Ni

Ni∑j=1

uijt

192

Combining (423) and (424), we have:

zit = Ξijdt + Φijht + uijt, (425)

where

zit2(mx+1)×1

=

[ztzi·t

], ht

(mg+mi)×1=

[gtf it

], uijt

2(mx+1)×1

=

[u··t + 1

R

∑Ri=1 Φ

ri·f

rit

ui·t

]

Ξij3(mx+1)×md

=

[Ξ··Ξi·

], Φij

2(mx+1)×(mg+mi)

=

[Φg·· 0

Φgi· Φ

ii·

].

Then, we obtain from (425):

ht =(Φ′ijΦij

)−1

Φ′ij

(zit − Ξijdt − uijt

)(426)

It is easily seen, by applying Lemma 1 of (?), to u··t, ui·t and 1R

∑Ri=1 Φ

ii·f

it,49

that for each t, as N →∞,

uijt = Op

(1√N

)??

Therefore, we establish that

ht −(Φ′ijΦij

)−1

Φ′ij

(zit − Ξijdt

)= Op

(1√N

)??

This suggests that we can use hit =(d′t, z

′it

)′as observable proxies for ht.

Then, we can consistently estimate the individual slope coeffi cients, βij andtheir means β by augmenting the regression, (411) with dt and the cross-sectionaverages zit. This estimator is referred to as the 3D CCE estimator??

• Remark: The multi-level factor structure in (412) is different from KMS,and it is more suitable for an analysis of the nested multi-level panel data(here 2-level).

49Alternatively, do we assume 1R

∑Ri=1 Φ

ri·f

rit = 0 for identification?? How to make it sure

under the zero mean assumption for f it? If1R

∑Ri=1 Φ

ri·f

rit 6= 0, then

1

R

R∑i=1

Φii·f

it =

1

R

(Φi1·, ..., Φ

iR·

)f1t...fRt

. (427)

In this case, instead of (425), we have

zit =

(Φg·· Φ

i1· · · · Φ

ii· · · · Φ

iR·

Φgi· 0 · · · Φ

ri· · · · 0

)fgtf1t...fRt

+ uijt, (428)

where Φii· = 1

RΦii·. And the analysis will be more complicated.

193

• ??Further, the local factors are allowed to have the loadings heterogeneousacross (i, j). Consider the special case with the within-region homogeneity:

eijt = γg′ijgt + γi′i fit + εijt (429)

In this case we have (425) with

uijt =

[u··t + 1

R

∑Ri=1 Φr

i·fit

ui·t

], Φij =

[Φg·· 0

Φgi· Φi

i·

]or see the footnote.

8.4.3 Panel Data with Cross-Sectional Dependence Characterizedby a Multi-Level Factor Structure by Rodríguez-Caballero (2016)

This is also the CCE extension.Consider the linear heterogeneous panel data model for i = 1, ..., N , t =

1, ..., T , r = 1, ..., R: (NB: abuse of notations)

yr,it = αr,idr,t + β′r,ixr,it + er,it

er,it = µ′r,iGt + λ′r,iFr,t + εr,it

where dr,t is a RN × 1 vector of observed common effects in the block r, xr,it isa k× 1 vector of observed individual-specific regressors, Gt is the rG × 1 vectorof unobserved global factors, Fr,t is the rF × 1 vector of unobserved regionalfactors, and εr,it are the individual-specific idiosyncratic errors. Assume thenumber of blocks, R, to be fixed.I adopt the following specification for the individual specific regressors:

xr,it = A′r,idr,t +M ′r,iGt + Λ′r,iFr,t + vr,it

where Ar,i, Mr,i and Λr,i are N × k, rG× k, and rf × k matrices, vr,it are errorsof xr,it distributed independently of the factor structure and across i and r.The model can be re-written for the specific block r as

zr,it(k+1)×1

= B′r,i(k+1)×N

dr,tN×1

+ C′r,i(k+1)×rG

GtrG×1

+ D′r,i(k+1)×rF

Fr,trF×1

+ ur,it

where

zr,it =

(yr,itxr,it

), Br,i = (αr,i, Ar,i)

(1 0βr,i Ik

)Cr,i =

(µr,i,Mr,i

)( 1 0βr,i Ik

),Dr,i = (λr,i,Λr,i)

(1 0βr,i Ik

)ur,it =

[εr,it + β′r,ivr,it

vr,it

]

194

The rank of matrices Cr,i andDr,i are determined by the rank of the rG×(k+1)and rF × (k + 1) matrices of the unobserved top-level and block-specific factorloadings, respectively.The model simplifies to that proposed by Pesaran (2006) and Bai (2009): i)

When R = 1. ii) When there are no block-specific factors, i.e.∑Rr=1 rFr = 0.

Assumption A. Observed Common Effects:The RN × 1 vector of observed common effects dr,t is covariance station-

ary with absolute summable autocovariances, distributed independently of theindividual specific errors εr,it and vr,it. Each dr is orthogonal to G.Assumption B. Unobserved Common Factors:B1 Block-specific factors Fr,t are covariance stationary such that

E ‖Fr,t‖4 ≤M <∞ with T−1T∑t=1

Fr,tF′r,t →p ΣFr

for some rF × rF positive definite matrix ΣFr , r = 1, ..., R.B2 The pervasive top-level factor Gt is covariance stationary such that

E ‖Gt‖4 ≤M <∞ with T−1T∑t=1

GtG′t →p ΣG

for some rG × rG positive definite matrix ΣG.B3 Define Ht =

[G′t, F

′r,t

]′. For a fixed r, assume that

T−1T∑t=1

HtH′t →p ΣH

for some positive-definite matrix ΣH with rank rG + r1 + · · ·+ rR.B4 Factors have zero mean, and

T∑t=1

GtF′r,t = 0 for r = 1, ..., R

Assumption C. Individual-specific Errors:C1 εr,it, r = 1, ..., R, i = 1, ..., N, t = 1, ..., T , are independently across r, i,

and t with zero mean and variance σ2i , and have a finite fourth-moment.

C2 vr,it, r = 1, ..., R, i = 1, ..., N, t = 1, ..., T , are independently across r, i,and t with zero mean and variance Σi > 0, and supr;itE ‖vr,it‖

4<∞.

C3 εr,it and vr,jt′ are distributed independently for all r, i, j, t and t′. Foreach r and i, εr,it and vr,it follow linear stationary processes with absolutesummable autocovariances.Assumption D. Factor loadings:The unobserved factor loadings µr,i, λr,i Mr,i and Λr,i are independently and

identically distributed across r, i, and of the individual specific errors εr,it andvr,jt, the common observable factors dr,t and the unobserved common factors

195

(Gt, Fr,t) for all r, i, j, and t with fixed means µ, λ,M and Λ, and finite variances.In particular,D1 λr,i is either deterministic such that ‖λr,i‖ ≤M <∞, or it is stochastic

such that E ‖λr,i‖4 ≤M <∞. In the latter case,

N−1r Λ′rΛr →p ΣΛr > 0

for an rF × rF non-random matrix ΣΛr for all r = 1, ..., R with a positiveconstant M .D2 µr,i is either deterministic such that

∥∥µr,i∥∥ ≤M <∞, or it is stochasticsuch that E

∥∥µr,i∥∥4 ≤M <∞ with

N−1r µ′rµr →p Σµr > 0

for an rG × rG non-random matrix Σµr for all r = 1, ..., R.Assumption E. Random Slope Coeffi cients:The slope coeffi cients βr,i follow the random coeffi cient model

βr,i = β + vr,i, vr,i ∼ iid (0,Ωv)

for i = 1, ..., N and r = 1, ..., R where ‖β‖ < K, ‖Ωv‖ < K, Ωv is k × ksymmetric nonnegative definite matrix, and vr,i are distributed independentlyof µr,j , λr,j Mr,j and Λr,j , εr,jt, vr,jt, dr,t, Fr,t, and Gt for all r, i, j and t.

Assumption F. Identification of βr,i:Identification of the slope coeffi cients are given by a two-step procedure of

cross-sectional averages. In the first step, consider the cross-sectional averages ofthe individual-specific variables zr,it in each one of R blocks separately. Defineby

zr,it =1

N

N∑j=1

zr,jt

and let

Wr = IT −Hr

(Hr′Hr

)−Hr′

WFr = IT − Fr (F′rFr)− F′r

whereHr =

(Dr, Zr

)with Dr

T×N= (dr,1, ..., dr,T )′

and Zr = (zr,1, ..., zr,T )′ is the T × (k + 1) matrix of time observations on the

cross-sectional averages for each region r and

Fr = (Dr, Fr)

where Fr = (Fr,1, ..., Fr,T )′ is T×rFr data matrices on unobserved block specificfactors.

196

For the second step,

Z∗(k+1)RN×T

=

[(Z1

(k+1)N×TW1T×T

)′,(Z2W2

)′, ...,

(ZRWR

)′]′and consider the cross-sectional average of the complete panel and let

W∗

= IT −H∗ (H∗′H∗)−

H∗′

WG = IT −G (G′G)−1G′

where H∗

= Z∗, with Z∗ = (z∗1, ..., z∗T )′ is the T × (k+1) matrix of observations

on the cross-sectional averages. denote Wr,tG′t as G

∗t , then G

∗ = (G∗1, ..., G∗T )′

is T × rG data matrix on unobserved top-level factors.F1 Identification of βr,i:

The k × k matrix Ψr,iT =(

X∗′r.iW∗X∗r.i

T

)and Ψr,iG =

(X∗′r.iWGX∗r.i

T

)are

nonsingular, and Ψr,iT , Ψr,iG have finite second-order moments for all r and i.Assumption E can be relaxed allowing for βr,i = βr + vr,i, vr,i ∼ IID(0,Ωv)

implying that the slope can vary among regions but being the same within thespecific block r. Assumption F details the identification strategy of the slopecoeffi cients.I propose an extended CCE procedure to filter out the full factor space

involved in a model whose cross-sectional dependence is driven by a multi-levelfactor structure. TheMulti-Level Common Correlated Effect Estimators(MLCCE) consists of two steps.In the first step, consider each of the zr,it vectors for each block r = 1, ..., R.

Note that the cross-sectional dependence in block r is only driven by the block-specific observable common factors, dr,t as well as the unobservable commonfactors Fr,t. The pervasive common factors Gt do not play a role hith-erto since none of the remaining blocks are considered why??. Inthis sense, µr,i = 0 and Mr,i = 0 lead to Cr,i = 0 for all i = 1, ..., N . Then, wehave

zr,it = B′r,idr,t + D′r,iFr,t + ur,it

The cross-sectional averages of the observables of zr,it can work as a proxy forthese block-specific factors:

zr,t = B′rdr,t + D′rFr,t + ur,t

where

zr,t =1

N

N∑i=1

zr,it; Br =1

N

N∑i=1

Br,i; Dr =1

N

N∑i=1

Dr,i; ur,t =1

N

N∑i=1

ur,it

Assumingrk(D)

= rFr < k + 1

197

thenFr,t =

(DrD

′r

)−1Dr

(zr,t − B′rdr,t − ur,t

)Lemma 1 in Pesaran (2006) shows that ur,t →qm 0 as N →∞, for each t, whichimplies

Fr,t =(DD′

)−1D(zr,t − B′rdr,t

)→qm 0

where

D = limN→∞

Dr = Λ

(1 0βr,i Ik

)Λ = E (λr,i,Λr,i) = (λr,i,Λr,i) and βr = E

(βr,i).

Therefore, the block-specific factors, Fr,t, can be approximated by a linear com-bination of dr, the cross-sectional averages of the dependent variable, yr,t =1N

∑Ni=1 yr,it, and of the individual-specific regressors, xr,t = 1

N

∑Ni=1 xr,it.

Let Z∗ as

Z∗(k+1)RN×T

=

Z1

(k+1)N×TW1T×T

Z2W2

...ZRWR

where

WrT×T

Z ′rT×(k+1)N

=

(IT −Hr

(Hr′Hr

)−Hr′)Z ′r

which is the residual from the regression of Z ′r on Hr =(Dr, Zr

). And similarly,

G∗ =(GWr

)′=

IT −Hr

(Hr′Hr

)−Hr′G′; U∗ =

[(U1W1

)′, ...,

(URWR

)′]′where Wr = IT −Hr

(Hr′Hr

)−Hr′. Then, we have:

z∗it = C′iG∗t + u∗it (430)

after partialing out the effects of the block-specific factors from all the blocksr = 1, ..., R by using the orthogonal projection matrix Wr. The cross-sectionaverages will result in

z∗t = C′G∗t + u∗t

where

z∗t =1

RN

RN∑i=1

z∗it

In the second step, H∗ = Z∗ becomes an observable proxy for unobserved globalfactor Gt.

198

The Common Correlated Effects Mean Group (CCEMG) is an average ofthe individual MLCCE estimators, βr,i,

βCCEMG =1

R

R∑r=1

1

N

N∑i=1

βr,i

whereβr,i =

(X∗′r,iW

∗X∗r,i)−1

X∗′r,iW∗y∗r,i (431)

X∗r,i =(xr,i1Wr,1,xr,i2Wr,2, ...,xr,iT Wr,T

)y∗r,i =

(yr,i1Wr,1, yr,i2Wr,2, ..., yr,iT Wr,T

)Theorem 4.1. Under Assumptions A-C, and F1, as (N,T )j → ∞ and

the rank conditions, then βr,i is a consistent estimator of βr,i. Furthermore,

assuming√TN → 0 as (N,T )j →∞ in the block r, then

√T(βr,i − βr,i

)→d N

(0,Σβr,i

)where

Σβr,i = σ2r,i

Σ−1r,iv (0) Σ−1

r,iv (0)

It is possible to consider a CCEMG estimator either for the specific block ror for the full panel. I focus on the more general setting.Theorem 4.2. Under Assumptions A-E, and F1, then βCCEMG is as-

ymptotically unbiased for fixed R, and T and as N → ∞. Furthermore,as(N,T )j →∞ with R fixed,

√RN

(βCCEMG − β

)→d N (0,ΣCCEMG)

where ΣCCEMG can be consistently estimated non-parametrically by

ΣCCEMG =1

R− 1

R∑r=1

1

N − 1

N∑i=1

(βr,i − βCCEMG

)(βr,i − βCCEMG

)′Note that the block-specific and global rank conditions are necessary in the

Theorem 4.1 but not in 4.2 (how??).When βr,i = β for all r and i, pooled estimators would gain effi ciency. Sim-

ilar asymptotic analysis can be done. I avoid these details to focus only onthe heterogeneity assumption which seems to be more appropriate in macroeco-nomics and financial applications.

• Remark: The MLCCE first augment the regional block regression withHr =

(Dr, Zr

), where Zr are designed to approximate the local factors,

see the regression (430). Then, augment the global regression (430) withz∗t , where z∗t are designed to approximate the global factors. He showedvia MC that the performance of the MLCCE is satisfactory... See MCsection and we may consider similar designs?

199

• First, I wonder whether this 2-step approach in (431) is equivalentto our algorithm developed in the previous section ?? Does thiscover the model in (425) or in (??)? which is more effi cient?

8.5 The Multi-level PCA approach

• What about applying the PCA following the international BCliterature, say Breitung and Eickmeier (2016) and Choi et al.(2016)? This is quite feasible and worthwhile along with de-veloping the information criteria for determining the numberof both global and local factors, though I prefer the approachnot affected by the consistent estimation of the number of truefactors...

Here we aim to adopt and extend the approaches advanced by Breitung andEickmeier (2016) and Choi et al. (2017). We first review a couple of existingapproaches (incomplete or not rigorous yet).

8.5.1 Three-Dimensional Panel Data Models with Factor Structuresby Lu and Su (2018)

We consider general three-dimensional panel data models with factor structures,which include both global factors and local factors. This type of model can beapplied to many fields in economics, such as international trade, macroeconomicsand finance, among others. We will propose a method to determine the numberof factors and provide estimators for the factors and factor loadings. We willstudy the asymptotic properties of our methods, show the consistency and derivethe asymptotic distribution of our estimators. Monte Carlo simulations will beused to demonstrate the finite sample performance of our methods. Empiricalapplications to international trade will also be provided.Latent factor models have received considerable attention in econometrics.

Most factor models have been applied to two-dimensional (2D) panel data. How-ever, more three-dimensional (3D) panel datasets have become available (e.g.,Mátyás (2017) for a recent review). In this project, we consider estimation andinference for factor models with 3D panel data.We first consider a 3D pure factor model without any exogenous regressors:

yijt = λ(0)′ij f

(0)t +λ

(1)′ij f

(1)it +λ

(2)′ij f

(2)jt +uijt; i = 1, ..., N ; j = 1, ...,Mi, t = 1, ..., T ;

(432)where yijt is the observable data, f

(0)t is the global factor, f (1)

it and f (2)jt are

local factors that depend on i and j, respectively,(λ

(0)′ij , λ

(1)′ij , λ

(2)′ij

)′are their

corresponding factor loadings and uijt is the idiosyncratic error. We assume thatthe dimensions of f (0)

t , f (1)it and f (2)

jt are r0×1, r1

i ×1 and r2j×1, respectively and

λ(0)ij , λ

(1)ij , and λ

(2)ij have the same corresponding dimensions. That is, there are

200

r0, r1i and r

2j global factors, i-specific local factors and j-specific local factors,

respectively.We can also extend to allow the exogenous regressors

yijt = β′xijt + λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt + uijt, (433)

where xijt is a k × 1 vector of observable regressors and β is the correspondingslope coeffi cients. For example, in the international trade application above,xijt could be the trading costs from country i to country j at year t: Anotherexample of xijt is the lagged dependent variable, i.e., xijt = yij,t−1 in thedynamic model. This model is often referred to as panel data models withinteractive fixed effects, as we allow the regressor xijt to be correlated with theunobservable factors.The model can be used in various ways. The most general 3D linear fixed

effect model considered in the literature is

yijt = β′xijt + γij + αit + α∗jt + uijt,

where γij , αit, α∗jt are fixed effects (e.g., Lu, Miao and Su (2017)). This model

can be thought of as a special case of our Model (433). We consider large pan-els where the three dimensions (N,M, T ) go to infinity jointly. The goal of this

project is to determine(r0, r1

i , r2j

)and estimate

(f

(0)t , f

(1)it , f

(2)jt

),(λ

(0)′ij , λ

(1)′ij , λ

(2)′ij

)′up to a rotation matrix.

Literature Review There is a large literature on the factor models for 2Dpanel data. Omitting the j index, models (432) and (433) reduce to

yit = λ′ift + uit

yit = β′xit + λ′ift + uit

where ft and λi are r × 1 vector of factor and factor loadings. The literatureon factor models for 2D panel data has been developed rapidly. For a detailedreview, see Bai and Wang (2016). For the pure factor models, Bai (2003) consid-ers estimation based on principal components analysis (PCA) and develops theinference theory assuming the number of factor r is known. For the model withexogenous regressors, Bai (2009) and Moon and Weidner (2015,2017) provide es-timators based on Gaussian quasi-maximum likelihood estimation (QMLE) andPCA. Pesaran (2006) proposes common correlated effects (CCE) estimators forestimating β.There are limited number of papers on 3D factor models. See Breitung

and Eickmeier (2015), Wang (2016), and Choi et al. (2017). All these papersconsider a special case of our pure factor model where there is only one localcomponent, i.e.,

yijt = λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + uijt; i = 1, ..., N ; j = 1, ...,Mi; t = 1, ..., T ;

201

This model is often referred to as a multi-level factor model. There arecertain limitations of the existing methods. First, most papers assume thatnumbers of global factors r(0) and local factors r(1)

i are known, e.g., Breitungand Eickmeier (2015) and Wang (2016). Choi et al. (2017) provide informationcriteria for selecting the number of local factors r(1)

i but assume the numberof global factors r(0) is known. Second, there is no inference theory for theestimated factor and factor loading. These papers only establish the consistencyof their estimators. Third, usually strong assumptions are imposed on localfactors, f (1)

it . For example, Choi et al. (2017) assume that the local factors areuncorrelated, that is,

cov(f

(1)i1t, f

(1)i2t

)= 0 for i1 6= i2;

which may not be satisfied in practice. Fourth, these papers only consider purefactor models that do not allow exogenous regressors, xijt.

Estimation of Pure Factor Models We impose the key assumptions.Assumption A.1. f (0)

t , f(1)it , and f

(2)jt are random variables such that (i)

E(f

(0)t

)= 0, E

(f

(1)it

)= 0 and E

(f

(2)jt

)= 0 for all i, j and t; and (ii)

E(f

(0)t f

(1)′it

)= 0; E

(f

(0)t f

(2)′jt

)= 0 and E

(f

(1)it f

(1)′it

)= 0 for all i, j and

t:The uncorrelated assumption allows us to separate the three factors. Define

M = max Mi, i = 1, ..., N : By symmetry, we assume that for each j, the iindex runs from 1, ..., Nj . We can rewrite Model (432) as

yijt = λ(0)′ij f

(0)t + u

(0)ijt , u

(0)ijt = λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt + uijt (434)

y(1)ijt = λ

(1)′ij f

(1)it + u

(1)ijt , y

(1)ijt = yijt − λ(0)′

ij f(0)t , u

(1)ijt = λ

(2)′ij f

(2)jt + uijt (435)

y(2)ijt = λ

(2)′ij f

(2)jt + uijt, y

(2)ijt = yijt − λ(0)′

ij f(0)t − λ(1)′

ij f(1)it (436)

In matrix form, we can write

Y (0) = F (0)Λ(0)′ + U (0); (437)

Y(1)i = F

(1)i Λ

(1)′i + U

(1)i ; i = 1, ..., N (438)

Y(2)j = F

(2)j Λ

(2)′j + U

(2)j ; j = 1, ...,M (439)

where yij = (yij1, ..., yijT )′, y(1)

ij =(y

(1)ij1, ..., y

(1)ijT

)′, y(2)

ij =(y

(2)ij1, ..., y

(2)ijT

)′,

Y(1)i =

(y

(1)i1 , ..., y

(1)iMi

), Y (2)

j =(y

(2)1j , ..., y

(2)Njj

), Y (0) = (y11, y12, ..., y1M1

, ..., yN1, ..., yNMN),

F (0) = (f1, f2, ..., fT )′, F (1)

i =(f

(1)i1 , f

(1)i2 , ..., f

(1)iT

)′, F (2)

j =(f

(2)j1 , f

(2)j2 , ..., f

(2)jT

)′,

Λ(1)i =

(λ

(1)i1 , λ

(1)i2 , ..., λ

(1)iMi

)′, Λ(2)

j =(λ

(2)1j , λ

(2)2j , ..., λ

(2)Njj

)′, and Λ(0) =

(λ

(0)11 , λ

(0)12 , ..., λ

(0)1M1

, ..., λ(0)N1, ..., λ

(0)NMN

).

Definitions of U (0), U (1)i and U (2)

j are similar to Y (0), Y (1)i and Y (2)

j .

202

We identify the factors and factor loading sequentially based on the threeequations above. We first identify the global factor and global factor loadingsbased on Model (434). Note that we can treat U (0) or u(0)

ijt as a new error term,

as(λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt

)is uncorrelated with f (0)

t . If we stack the (i, j) indices

to one index as in (437), then we can view Model (434) as a standard 2D factormodel and apply the PCA method as in Bai (2003). We can determine thenumber of factors r(0) by maximizing the ratio of two adjacent eigenvalues, assuggested by Ahn and Horenstein (2013). To identify F (0) and Λ(0) we need toimpose certain normalization condition:

1

TF (0)′F (0) = Ir(0)

(an r(0) × r(0) identity matrix) and Λ(0)′Λ(0) is an r(0) × r(0) diagonal matrix.Then, we can identify r(0), λ(0)′

ij F(0)t and a rotated version of λ(0)

ij and F (0)t .

After identifying λ(0)′ij F

(0)t , we can identify y(1)

ijt = yijt − λ(0)′ij F

(0)t in Model

(435). For each i, we impose the normalization condition that

1

TF

(1)′

i F(1)i = Ir(1)

and Λ(1)′i Λ

(1)i is an r(1)

i × r(1)i diagonal matrix. Then for each i, we can identify

r(1)i , λ(1)′

ij F(1)it and a rotated version of λ(1)

ij and F (1)it .

After achieving the identification of λ(0)′ij F

(0)t and λ(1)′

ij F(1)it , y

(2)ijt in Model

(436) is identified. With the normalization condition that 1T F

(2)′

j F(2)j = Ir(2)

and Λ(2′j Λ

(2)j is an r(2)

j × r(2)j diagonal matrix for each j, r(2)

j , λ(2)′ij F

(2)jt and a

rotated version of λ(2)ij and F (2)

jt are identified.We can obtain the initial consistent estimators based on the discussion above.

Then we can achieve more effi cient estimators through iterations. The detailedestimation algorithm is described in the appendix.Remark 2.1.1 To ensure A.1(i), we need to demean the data first. Assum-

ing that the data are weakly stationary along the time dimension, we can applyour estimation method to the demeaned data: yijt − T−1

∑Tt=1 yijt:

Remark 2.1.2 A.1(ii) is crucial for the identification and often assumed inmulti-level factor models, see Assumption 1(ii) in Choi et al. (2017). It can bethought of as a normalization, as it can be satisfied by linear projections andredefining factors and factor loadings.

Estimation of Models with Exogenous Regressors We propose the fol-lowing iteration estimation method for β: Define the factor components as

Cijt = λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt

Estimation for β: Start with an initial estimator of β, iterate the followingtwo steps until certain convergence criterion is satisfied.

203

Step 1: For a given estimator of β, β, apply Algorithm 1 in the appendixto yijt − β

′xijt to obtain estimator of Cijt:

Cijt = λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt

Step 2: Given the estimator of Cijt, run a regression of yijt − Cijt on xijtto obtain the OLS estimator β:The convergence criterion could be such that∥∥∥β` − β`−1

∥∥∥∥∥∥β`−1

∥∥∥ < ε0 and

∑Ni=1

∑Mi

j=1

∥∥∥Cij,` − Cij,`−1

∥∥∥∑Ni=1

∑Mi

j=1

∥∥∥Cij,`−1

∥∥∥ < ε0

where ε0 is a small number, Cij,` =(Cij1, ..., CijT

)′and the subscript ` repre-

sent the `th iteration.

Algorithm Pick up a reasonably large rmax; which is the largest number offactors we allow.

1. Initial estimation: Step 01: apply PCA to yijt based on (2.4).Let µ(0)

k be the kth largest eigenvalue of Y (0)Y (0)′/(NMT ) (a T×T matrix).The estimated r0 is

r(0) = arg maxk

µ

(0)k

µ(0)k+1

; k = 1, ..., rmax

.

The estimated factor F (0) and factor loading Λ(0) are respectively. F (0) =√T

times eigenvectors corresponding to the r(0) largest eigenvalues of Y (0)Y (0)′, and

Λ(0) =F (0)′Y (0)

T

Let λ(0)

ij and f (0)t be the elements of Λ(0) and F (0), respectively.

Step 02: For each i apply PCA to y(1)ijt = yijt − λ

(0)′ij f

(0)t based on (2.5).

For a given i, define the data matrix Y (1)i =

(y

(1)i1 , ..., y

(1)iMi

)where y(1)

ij =(y

(1)ij1, ..., y

(1)ijT

)′. Let µ(1)

k be the kth largest eigenvalue of Y (1)i Y

(1)′i /(MiT ) (a

T × T matrix). The estimated r(1)i is

r(1)i = arg max

k

µ

(1)k

µ(1)k+1

; k = 1, ..., rmax

.

204

The estimated factor F (1)i and factor loading Λ

(1)i are respectively: F (1)

i =√T

times eigenvectors corresponding to the r(1)i largest eigenvalues of Y (1)

i Y(1)′i , and

Λ(1)i =

F(1)′i Y

(1)i

T

Let λ(1)

ij and f (1)it be the elements of Λ

(1)i and F (1)

i , respectively.

Step 03: For each j, apply PCA to y(2)ijt = yijt− λ

(0)′ij f

(0)t − λ

(1)′ij f

(1)it it based

on (2.6).

For a given j, define the data matrix Y (2)j =

(y

(2)ij , ..., y

(2)Njj

)where y(2)

ij =(y

(2)ij1, ..., y

(2)ijT

)′. Let µ(2)

k be the kth largest eigenvalue of Y (2)j Y

(2)′j /(NjT ) (a

T × T matrix). The estimated r(2)j is

r(2)j = arg max

k

µ

(2)k

µ(2)k+1

; k = 1, ..., rmax

.

The estimated factor F (2)j and factor loading Λ

(2)j are respectively: F (2)

j =√T

times eigenvectors corresponding to the r(2)j largest eigenvalues of Y (2)

j Y(2)′j , and

Λ(2)j =

F(2)′j Y

(2)j

T

Let λ(2)

ij and f (2)jt be the elements of Λ

(2)j and F (2)

j , respectively.

2. Iteration: We use the subscript ` denote the `th iteration, ` = 1, ...Step `1: apply PCA to data

y(0)ijt,` = yijt − λ

(1)′ij,`−1f

(1)it,`−1 − λ

(2)′ij,`−1f

(2)jt,`−1

as in Step 01.Step `2: apply PCA to


(0)′ij,`−1f

(0)t,` − λ

(2)′ij,`−1f

(2)jt,`−1

as in Step 02.Step 3: apply PCA to data


(0)′ij,`−1f

(0)t,` − λ

(1)′ij,`−1f

(1)it,`−1

as in Step 03.The algorithm stops if certain convergence criterion is satisfied, e.g.∑N

i=1

∑Mi

j=1

∥∥∥Cij,` − Cij,`−1

∥∥∥∑Ni=1

∑Mi

j=1

∥∥∥Cij,`−1

∥∥∥ < ε0

205

where ε0 is a small number,

Cijt = λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + λ

(2)′ij f

(2)jt

and Cij =(Cij1, ..., CijT

)′.

8.5.2 A panel data model with interactive effects characterized bymultilevel non-parallel factors by Li and Yang (2017)

Consider a two-level factor model:

εsit = γ′igt + λs′i fst + usit, s = 1, ..., S

where the second-level factors, fst (rs × 1) are parallel to each other and nestedunder the global factor gt (r × 1). If regional shocks are also considered, thefactor model is extended to

εsdit = γ′igt + λs′i fst + δd′i h

dt + usdit , s = 1, ..., S, d = 1, ..., D (440)

where hdt (rd × 1) are regional factors with loading parameters δdi . hdt are neither

parallel to nor nested under the industrial factors fst . The estimators ofWang (2008), Moench, Ng, and Potter (2013) and Bai and Wang(2015) cannot be used.The vector form of the multilevel non-parallel factor model is given byε11it ε12

it · · · ε1Dit

ε21it ε22

it · · · ε2Dit

......

. . ....

εS1it εS2

it · · · εSDit

S×D

=

γ′i λ1′

i 0 · · · 0

γ′i 0 λ2′i · · · 0

......

.... . .

...γ′i 0 0 · · · λS′i

S×r

e′D ⊗

gtf1t...fSt

r×D

(441)

+(eS ⊗

[δ1′i δ2′

i · · · δD′i])S×rD

h1t 0 · · · 0

0 h2t · · · 0

......

. . ....

0 0 · · · hDt

rD×D

+

u11it u12

it · · · u1Dit

u21it u22

it · · · u2Dit

......

. . ....

uS1it uS2

it · · · uSDit

S×D

where

eSS×1

=

1...1

, eDD×1

=

1...1

Remark: Li and Yang (2017) propose the iterative PC algorithm following

Bai (2009) with simulation evidence. But, I believe that the model (440) is

206

basically the same as (454). Esp. when expressing (441) with more carefulnotations, it would be equivalent to (456). Still, it is worthwhile to considertheir panel estimator and extend it more rigorously. See also BE2016...

8.5.3 The 3D Panel Data Models with the Multi-level Factor Struc-ture

• Here we develop the 3D PCA following Breitung and Eickmeier(2016) and Choi et al. (2017)...

• This is quite feasible and worthwhile along with developing theinformation criteria for determining the number of both globaland local factors.

• One of my ph.d students, Rui has been working on this exten-sion, mostly focussing on the simulation performance yet.

• I will provide more details soon.

3D Panel Data Models with the Multi-level Factor Structure: Aniterative approach Consider the following multi-dimensional factor model:

yijt = β′xijt + γ′ijGt + λ′ijF(1)it + uijt, i = 1, ..., R, j = 1, ...,Mi, t = 1, ..., T

(442)where i = 1, ..., R indicates the country, j = 1, ...,Mi denotes the sector andt = 1, ..., T is the time period. xijt is a k × 1 vector of regressors and βis a conformably defined vector containing homogeneous coeffi cients. Gt =(G1t, ..., Gr0t)

′ comprises the r0 × 1 global factors, and Fit collects the ri × 1vector of sectoral factors in country i. Stacking (442) for j = 1, ...,Mi, weobtain:

yi.t = xi.tβ + (Γi,Λi)

(GtFit

)+ ui.t (443)

where

yi.tMi×1

=

yi1tyi2t...

yiMit

, ui.tMi×1

=

ui1tui2t...

uiMit

, ΓiMi×r0

=

γ′i1γ′i2...

γ′iMi

, ΛiMi×ri

=

λ′i1λ′i2...

λ′iMi

, xi.tMi×k=

x′

i1t

x′

i2t...

x′

iMit

The entire system becomes:

yt = xtβ + Λ∗F ∗t + ut (444)

where

ytN×1

=

y1.t

y2.t

...yR.t

, utN×1

=

u1.t

u2.t

...uR.t

, F ∗tr∗×1

=

GtF1t

F2t

...FRt

207

xtN×k

=

x1.t

x2.t

...xR.t

, Λ∗N×r∗

=

Γ1 Λ1 0 · · · 0Γ2 0 Λ2 · · · 0...

......

. . ....

ΓR 0 0 · · · ΛR

,

and

N =

R∑i=1

Mi, r∗ = r0 +

R∑i=1

ri

Then, we write the full system as

Y =(IT ⊗ β′

)X + F ∗Λ∗′ + U (445)

where

YT×N

=

y′1...y′T

, F ∗T×r∗

=

F ∗′1...F ∗′T

, UT×N

=

u′1...u′T

, XTk×N

=

x′1...x′T

.Alternatively, we transpose equation (443) and stack over time as

yi =(IT ⊗ β′

)xi +GΓ′i + FiΛ

′i + ui

where

yiT×Mi

=

y′i.1y′i.2...

y′i.T

, xiTk×Mi

=

x′i.1x′i.2...

x′i.T

, ui =

u′i.1u′i.2...

u′i.T

Estimation: The estimation involves two types of iterations, outer iterationand inner iteration. Outer iteration runs between beta and factors while inneriteration runs between country factors and sectoral factors. We use the subscript` denote the `th outer iteration, ` = 1, ...Outer iteration:Step 1: Run the OLS regression by ignoring factors using the data

yijt = β′xijt + εijt

where εijt = γ′ijGt +λ′ijFit +uijt, and obtain the initial estimator of β denoted

as β(0).

Step 2: Apply one of the multi-level factor estimation techniques to

y(0)ijt = γ′ijGt + λ′ijFit + uijt

where y(0)ijt = yijt − β

(0)′xijt. Denote the initial estimators of γij , Gt, λij and

Fit as γ(0)ij , G

(0)t , λ

(0)

ij and F (0)it .

208

Step 3: Construct

y(0)ijt = yijt − γ(0)′

ij G(0)t − λ

(0)

ij F(0)it

by subtracting the (estimated) factors from the data. Then, run the followingOLS regression:

y(0)ijt = β′xijt + εijt

and obtain the updated estimator of β denoted as β(1).

Step 4: Next, construct the updated residuals by

y(1)ijt = yijt − β

(1)′xijt

Then, apply the multi-level factor estimation to

y(1)ijt = γ′ijGt + λ′ijFit + uijt

and the updated estimates of factors and factor loadings, denoted γ(1)ij , G

(1)t ,

λ(1)

ij and F (1)it .

Repeat Steps 3 and 4 until convergence. After the `th iteration, we have

the estimators β(`), γ(`)

ij , G(`)t , λ

(`)

ij and F (`)it . The algorithm stops if certain

convergence criterion is satisfied, e.g.

N∑i=1

Mi∑j=1

∥∥∥Cij,` − Cij,`−1

∥∥∥ < ε0

where ε0 is a small number,

Cijt = λ(0)′ij(`)f

(0)t(`) + λ

(1)′ij(`)f

(1)it(`)

and Cij =(Cij1, ..., CijT

)′. We follow two different approaches BE16 and

Choi et al. (2017) to estimate multi-level factors and describe their algorithmrespectively.

The multi-level factor estimation by Choi et al. (2017) Suppose thatr0 and all ri are given. As in Step 2 and 4, we need to estimate the factors andfactor loadings from

y(`)ijt = γ′ijGt + λ′ijFit + uijt

=[γ′ij , λ

′ij

] [GtFit

]+ uijt

where y(`)ijt = yijt − β

(`)′xijt.

209

Stacking the above equation for j = 1, ...,Mi, we obtain:

y(`)i.t = [Γi,Λi]

[GtFit

]+ ui.t

= ΘiKit + ui.t

(446)

where Θi = [Γi,Λi] and Kit = [Gt, Fit]′. Transpose (446) and further stack over

time we obtainy

(`)i

T×Mi

= GT×r0

Γ′i + FiT×ri

Λ′i + uiT×Mi

(447)

where y(`)i = [y

(`)i.1 , ..., y

(`)i.T ], G = [G1, ..., GT ]′, Fi = [Fi1, ..., FiT ]′ and ui =

[ui.1, ..., ui.T ]′.Step `, 0: Choose two sectors, say 1 and 2, apply the PCE to (446) and

obtain the estimators of K1T×r∗1

and K2, denoted by K1 and K2. Calculate the

sample covariance matrices between K1 and K2 as Σab (a, b = 1, 2). Denote ther∗1×1 eigenvector corresponding to an mth largest eigenvalue, µm of the canon-

ical correlation matrix, Σ− 12

11 Σ12Σ−122 Σ21Σ

− 12

11 as pm. Collect all r0 eigenvectorsas p

r∗1×r0= [p1, . . . , pr(0) ]. Then, we obtain the initial estimate of G by

G(`,0)

T×r0= K1p,

where (`, 0) means the initial estimator in `th step of the outer iteration. This isthe canonical correlation analysis (CCA). CCA searches for a linear combinationof two sets of variables which yields the mth maximum squared correlation

equal to the mth largest eigenvalue of Σ− 12

11 Σ12Σ−122 Σ21Σ

− 12

11 . The correspondingeigenvector is the coeffi cients of such combination. Therefore, if K1 and K2

contain the r0 same global factors, then there must be r0 eigenvalues whichare equal to 1. BE16 suggests choosing the sector pairs who have thelargest canonical correlation. Choi et al. suggests choosing the pairswhose the sample mean of canonical correlations is largest (but intheir code they choose the pairs with largest canonical correlation).Step `1: After obtaining the initial country factors G(`,0), we can estimate

the sectoral factors by projecting out G(`,0). Specifically, the initial estimatorof ith sectoral factor, F (`,0)

i is given by√T times the eigenvectors matrix corre-

sponding to the ri largest eigenvalues of the matrix, y(`)Hiy

(`)′Hi , where y

(`)Hi = Hy

(`)i

and H = I − G(`,0)(G(`,0)G(`,0))−1G(`,0). The initial factor loading matrix canbe estimated by Λ

(`,0)i = (1/T )y

(`,0)′Hi F

(`,0)i .

Step `2: Now, we can update the country factors and factor loadings usingthe results of the previous step. Rewrite (446) as

y(`)i.t − Λ

(`,0)i F

(`,0)i.t = ΓiGt + ui.t − (Λ

(`,0)i F

(`,0)i.t − ΛiFi.t)

= ΓiGt + ui.t

210

Stacking them across sectors we obtain:y

(`)1.t − Λ

(`,0)1 F

(`,0)1

...y

(`)N.t − Λ

(`,0)N F

(`,0)N

(N×1)

=

Γ1

...ΓN

(N×r0)

Gt +

u1.t

...uN.t

(N×1)

Then we apply PCE to above equation and therefore update G(`,1) and obtainits factor loading matrix Γ

(`,1)i . They will be the final estimate of country factor

and loading for `th outer iteration.Step `3: Finally we use the updated country factors and factor loadings

to update the sectoral factors and corresponding factor loadings using PCE foreach sector using

y(`)i.t − Γ

(`,1)i G

(`,1)t = ΛiFi.t + ui.t.

The estimators F (`,1) and Λ(`,1)i will be the final estimator for sectoral factors

and factor loadings for `th outer iteration. Rewrite each element of Γ(`,1)i , G(`,1),

Λ(`,1)i and F (`,1)

i as Γ(`)i , G

(`), Λ(`)i and F (`)

i and they will be inputs in the `+1thiteration.

The multi-level factor estimation by BE16 In step 4 of the `th iteration,

given the estimator of β(`)we estimate the following model:

y(`)ijt = γ′ijGt + λ′ijFit + uijt. (448)

Assume that the idiosyncratic components are identically and independent nor-mally distributed (i.i.d.) across i, j and t with E(u2

ijt) = σ2. The estimatorremains consistent if the errors are heteroskedastic and autocorrelated, cf. Wang(2010). Using equation (445), treating the factors and factor loadings as un-known parameters yields the log-likelihood function

L = const− N∗T

2log(σ2)− 1

2σ2tr[(Y (`) − F ∗Λ∗′)(Y (`) − F ∗Λ∗′)′]

where Y (`) = Y −(IT ⊗ β(`)′

)X. The maximization of the likelihood function

is equivalent to minimizing the sum of squared residuals (RSS)

S(F,Λ) =

T∑t=1

(y(`)t − Λ∗F ∗t )′(y

(`)t − Λ∗F ∗t )

=

N∑i=1

Mi∑j=1

T∑t=1

(y(`)ijt − γ′ijGt − λ

′ijFit)

2

(449)

We use sequential sub-iteration to minimize the objective function.

211

Inner iteration: Step `0: Employ the same procedure as Step `0 and Step`1 described above so that we obtain the initial estimators G(`,0) and F (`,0)

i , thesame as in Choi et al.’s approachStep `1: Run N time series regressions of the form

y(`)ijt = γ′ijG

(`,0)t + λ′ijF

(`,0)it + uijt.

Collecting all the estimated factor loadings γ(`,0)ij and λ(`,0)

ij we can constructthe loading matrix for the full system as

Λ∗(`,0)

N×r∗=

Γ

(`,0)1 Λ

(`,0)1 0 · · · 0

Γ(`,0)2 0 Λ

(`,0)2 · · · 0

......

.... . .

...Γ

(`,0)R 0 0 · · · Λ

(`,0)R

Step `2: For each t, run the following cross section regression similar to

equation (444) treating Λ∗(`,0) as the data matrix

y(`)t = Λ∗(`,0)F ∗t + ut.

Then we can obtain the OLS estimator of F ∗t as

F(`,1)t =

G

(`,1)t

F(`,1)1t

F(`,1)2t...F

(`,1)Rt

= (Λ∗(`,0)′Λ∗(`,0)′)−1Λ∗(`,0)′y(`)t .

Step `3: After obtaining updated factors, we can use them to update thefactor loadings by running N time series regressions as in Step `1. Denote the

updated factor loadings as γ(`,1)ij , λ

(`,1)

ij and the full loading matrix as Λ∗(`,1).Repeat Step `1 to `3 until the RSS no longer decreases. It is easy to see that

S(F ∗(`,0), Λ∗(`,0)) ≥ S(F ∗(`,1), Λ∗(`,1)) ≥ S(F ∗(`,2), Λ∗(`,2)) ≥ . . .

since in each step the previous estimators are contained in the parameter space ofthe subsequent least-squares estimators. Hence the next estimation step cannotyield a larger RSS. Any fixed point is characterized by the condition

Λ∗′Y (`)′Y (`)(I − Λ∗(Λ∗′Λ∗)−1Λ∗′) = 0

which results from the fact that the sum of squared residuals does no longerdecrease whenever the estimated factors and factor loadings are orthogonal tothe residuals of the previous step. In order to ensure the uniqueness of thesolution we need to adopt the same restrictions as in PC. We need to make

212

country factors and sectoral factors orthogonal and all factors should have unitvariance. Suppose that the sub-iteration stops at κth iteration, we first regresseach sectoral factors on country factors as

F(`,κ)izt = b′G

(`,κ)t + vizt, i = 1, . . . , R, z = 1, . . . , ri.

We use residuals of these regressions as updated sectoral factors denoted asF

(`,κ)it . Now country factors are orthogonal to all sectoral factors. Given G(`,κ)

t

and F (`,κ)it , the corresponding factor loadings Γ

(`,κ)i and Λ

(`,κ)i can be updated

using OLS regression as in step `1. The normalized country factors can beobtained as the r0 PCs of the estimated common components resulting fromthe nonzero eigenvalues and the associated eigenvectors of the matrix

Γ(`,κ)i

(1

T

T∑t=1

G(`,κ)t G

(`,κ)′

t

)Γ

(`,κ)′

i .

We denote the final estimator of the country factors and factor loadings asG

(`)t . Similarly for each sector the normalized sectoral factors can be obtained

as the ri PCs of the estimated common components resulting from the nonzeroeigenvalues and the associated eigenvectors of the matrix

Λ(`,κ)i

(1

T

T∑t=1

F(`,κ)t F

(`,κ)′

t

)Λ

(`,κ)′

i .

The final estimators of the sectoral factors are denoted by F (`)it . Finally run

OLS regressions as in Step `1 we obtain the final estimators of factor loadings,

denoted by γ(`)ij and λ

(`)

ij .

8.6 Digressions: Factor Representations

The following factor representations will be useful for enhancing our understand-ing.

The benchmark case of Lu and Xu (2019) Consider the following genericmulti-dimensional factor model:

yijt = λ(0)′ij f

(0)t +λ

(1)′ij f

(1)it +λ

(2)′ij f

(2)jt +uijt, i = 1, ..., N, j = 1, ...,M, t = 1, ..., T

(450)where i = 1, ..., N indicates the source country, the index j = 1, ...,M denotesthe destination country, and t = 1, ..., T stands for the time period. (For simplic-ity we fix the number of i and j indices at N andM , which can be generalised in

the different applications.) f (0)t =

(f

(0)1t , ..., f

(0)r0t

)′comprises the r(0) × 1 global

factors, f (1)it collects the r(1)

i × 1 vector of source-factors in country i, and f (2)jt

collects the r(2)j ×1 vector of destination factors in country j. The idiosyncratic

213

component, uijt is assumed to satisfy an approximate factor model. Stacking(450) for j = 1, ...,M , we obtain:

yi1tyi2t...

yiMt

=

λ

(0)′i1

λ(0)′i2...

λ(0)′iM

f (0)t +

λ

(1)′i1

λ(1)′i2...

λ(1)′iM

f (1)it +

λ

(2)′i1 f

(2)1t

λ(2)′i2 f

(2)2t...

λ(2)′iM f

(2)Mt

+

ui1tui2t...

uiMt

=

λ

(0)′i1

λ(0)′i2...

λ(0)′iM

f (0)t +

λ

(1)′i1

λ(1)′i2...

λ(1)′iM

f (1)it +

λ

(2)′i1 0 · · · 0

0 λ(2)′i2 · · · 0

......

. . ....

0 0 · · · λ(2)′iM

f

(2)1t

f(2)2t...

f(2)Mt

+

ui1tui2t...

uiMt

which can be compactly written as

yi.t = Λ(0)i f

(0)t + Λ

(1)i f

(1)it + Λ

(2)i F

(2)t + ui.t =

(Λ

(0)i ,Λ

(1)i , Λ

(2)i

) f(0)t

f(1)it

F(2)t

+ ui.t

(451)where r(2) =

∑Mj=1 r

(2)j ,

yi.tM×1

=

yi1tyi2t...

yiMt

, ui.tM×1

=

ui1tui2t...

uiMt

, F(2)t

r(2)×1

=

f

(2)1t

f(2)2t...

f(2)Mt

Λ(0)i

M×r(0)=

λ

(0)′i1

λ(0)′i2...

λ(0)′iM

, Λ(1)i

M×r(1)i

=

λ

(1)′i1

λ(1)′i2...

λ(1)′iM

, Λ(2)i

M×r(2)=

λ

(2)′i1 0 · · · 0

0 λ(2)′i2 · · · 0

......

. . ....

0 0 · · · λ(2)′iM

The entire system representing all i = 1, ..., N source countries becomes:

y1.t

y2.t

...yN.t

=

Λ

(0)1 Λ

(1)1 0 · · · 0 Λ

(2)1

Λ(0)2 0 Λ

(1)2 · · · 0 Λ

(2)2

......

.... . .

......

Λ(0)N 0 0 · · · Λ

(1)N Λ

(2)N

f(0)t

f(1)1t

f(1)2t...

f(1)Nt

F(2)t

+

u1.t

u2.t

...uN.t


yt = ΛFt + ut

214

where r = r(0) + r(1) + r(2) with r(1) =∑Ni=1 r

(1)i ,

ytNM×1

=

y1.t

y2.t

...yM.t

, utNM×1

=

u1.t

u2.t

...uM.t

, Ftr×1

=

f(0)t

f(1)1t

f(1)2t...

f(1)Nt

F(2)t

ΛNM×r

=

Λ

(0)1 Λ

(1)1 0 · · · 0 Λ

(2)1

Λ(0)2 0 Λ

(1)2 · · · 0 Λ

(2)2

......

.... . .

......

Λ(0)N 0 0 · · · Λ

(1)N Λ

(2)N


Y = FΛ′ + U

where

YT×NM

=

y′1...y′T

, FT×r

=

F ′1...F ′T

, UT×NM

=

u′1...u′T

Special case: the country-industry hierarchical model of Choi et al.(2017) Consider the following multi-dimensional factor model:

yijt = λ(0)′ij f

(0)t + λ

(1)′ij f

(1)it + uijt, i = 1, ..., N, j = 1, ...,Mi, t = 1, ..., T (452)

where i = 1, ..., N indicates the country, j = 1, ...,M denotes the sector and

t = 1, ..., T is the time period. f (0)t =

(f

(0)1t , ..., f

(0)r0t

)′comprises the r(0) × 1

global factors, and f (1)it collects the r(1)

i × 1 vector of sectoral factors in countryi. Stacking (452) for j = 1, ...,Mi, we obtain:

yi.t = Λ(0)i f

(0)t + Λ

(1)i f

(1)it + ui.t =

(Λ

(0)i ,Λ

(1)i

)( f(0)t

f(1)it

)+ ui.t (453)

where

yi.tMi×1

=

yi1tyi2t...

yiMt

, ui.tMi×1

=

ui1tui2t...

uiMt

, Λ(0)i

Mi×r(0)=

λ

(0)′i1

λ(0)′i2...

λ(0)′iM

, Λ(1)i

Mi×r(1)i

=

λ

(1)′i1

λ(1)′i2...

λ(1)′iM

215

The entire system becomes:

y1.t

y2.t

...yN.t

=

Λ

(0)1 Λ

(1)1 0 · · · 0

Λ(0)2 0 Λ

(1)2 · · · 0

......

.... . .

...Λ

(0)N 0 0 · · · Λ

(1)N

f

(0)t

f(1)1t

f(1)2t...

f(1)Nt

+

u1.t

u2.t

...uN.t


yt = ΛFt + ut

where

ytN∗×1

=

y1.t

y2.t

...yN.t

, utN∗×1

=

u1.t

u2.t

...uN.t

, Ftr∗×1

=

f

(0)t

f(1)1t

f(1)2t...

f(1)Nt

ΛN∗×r∗

=

Λ

(0)1 Λ

(1)1 0 · · · 0

Λ(0)2 0 Λ

(1)2 · · · 0

......

.... . .

...Λ

(0)N 0 0 · · · Λ

(1)N

and

N∗ =

N∑i=1

Mi, r∗ = r(0) +

N∑i=1

r(1)i


Y = FΛ′ + U

where

YT×N∗

=

y′1...y′T

, FT×r∗

=

F ′1...F ′T

, UT×N∗

=

u′1...u′T

The extension to the three level or overlapping factor model (BE16)Consider the following three-level factor model:

yik,jt = λ(0)′ik,jf

(0)t +λ

(1)′ik,jf

(1)it +λ

(2)′ik,jf

(2)kt +uik,jt, i = 1, ..., N, k = 1, ...,K, j = 1, ...,M, t = 1, ..., T

(454)where i = 1, ..., N indicates the country, j = 1, ...,M denotes the sector, k =1, ...,K is another classification (could be overlapping, e.g. style, value-growth

216

or the market, nasdaq, dowjones). f (0)t =

(f

(0)1t , ..., f

(0)r0t

)′comprises the r(0)× 1

global factors, f (1)it collects the r(1)

i × 1 vector of industry factors, and f(2)kt

collects the r(2)k × 1 vector of another classification factors. Stacking (450) for

j = 1, ...,M , we obtain: yik,1t...

yiK,Mt

=

λ

(0)′ik,1...

λ(0)′iM

f (0)t +

λ

(1)′ik,1...

λ(1)′ik,M

f (1)it +

λ

(2)′ik,1...

λ(2)′ik,M

f (2)kt +

uik,t...

uiMt


yik,.t = Λ(0)ik f

(0)t +Λ

(1)ik f

(1)it +Λ

(2)ik f

(2)kt +uik,.t =

(Λ

(0)ik ,Λ

(1)ik ,Λ

(2)ik

) f(0)t

f(1)it

f(2)kt

+uik,.t

(455)where

yik,.tM×1

=

yi1t...

yiMt

, uik,.tM×1

=

ui1t...

uiMt

,

Λ(0)ik

M×r(0)=

λ

(0)′ik,1...

λ(0)′iM

, Λ(1)ik

M×r(1)i

=

λ

(1)′ik,1...

λ(1)′ik,M

, Λ(2)ik

M×r(2)k

=

λ

(2)′ik,1...

λ(2)′ik,M

Next, the entire system for i = 1, ..., N and k = 1, ...,K, becomes:

y11,.t

...yN1,.t

y12,.t

...yN2,.t

...y1K,.t

...yNK,.t

=

Λ(0)11 Λ

(1)11 0 · · · 0 Λ

(2)11 0 0

Λ(0)21 0 Λ

(1)21 · · · 0 Λ

(2)21 0 0

.... . .

......

. . ....

Λ(0)N1 0 0 · · · Λ

(1)N1 Λ

(2)N1 0 0

Λ(0)12 Λ

(1)12 0 · · · 0 0 Λ

(2)12 0

Λ(0)22 0 Λ

(1)22 · · · 0 0 Λ

(2)22 0

.... . .

.... . .

...Λ

(0)N2 0 0 · · · Λ

(1)N2 0 Λ

(2)N2 0

......

...Λ

(0)1K Λ

(1)1K · · · 0 0 Λ

(2)1K

Λ(0)2K Λ

(1)2K · · · 0 0 Λ

(2)2K

.... . .

......

. . ....

Λ(0)NK · · · Λ

(1)2K 0 0 Λ

(2)NK

f(0)t

f(1)1t...

f(1)Nt

f(2)1t...

f(2)Kt

+

u11,.t

...uN1,.t

...u1K,.t

...uNK,.t

(456)

217


yt = ΛFt + ut

where

ytNKM×1

=

y11,.t

...yN1,.t

...y1K,.t

...yNK,.t

, utNKM×1

=

u11,.t

...uN1,.t

...u1K,.t

...uNK,.t

, Ftr×1

=

f(0)t

f(1)1t...

f(1)Nt

f(2)1t...

f(2)Kt

ΛNKM×r

=

Λ(0)11 Λ

(1)11 0 · · · 0 Λ

(2)11 0 0

Λ(0)21 0 Λ

(1)21 · · · 0 Λ

(2)21 0 0

.... . .

......

. . ....

Λ(0)N1 0 0 · · · Λ

(1)N1 Λ

(2)N1 0 0

Λ(0)12 Λ

(1)12 0 · · · 0 0 Λ

(2)12 0

Λ(0)22 0 Λ

(1)22 · · · 0 0 Λ

(2)22 0

.... . .

.... . .

...Λ

(0)N2 0 0 · · · Λ

(1)N2 0 Λ

(2)N2 0

......

...Λ

(0)1K Λ

(1)1K · · · 0 0 Λ

(2)1K

Λ(0)2K Λ

(1)2K · · · 0 0 Λ

(2)2K

.... . .

......

. . ....

Λ(0)NK · · · Λ

(1)2K 0 0 Λ

(2)NK


Y = FΛ′ + U

where

YT×NKM

=

y′1...y′T

, FT×r

=

F ′1...F ′T

, UT×NKM

=

u′1...u′T

Remark: See the estimation algorithm in BE16... But, this is not quite

three-level model. We assume that E(f

(2)kt f

(2)′kt

)= I

r(2)k

as well as E(f

(2)kt f

(0)′t

)=

0 and E(f

(2)kt f

(1)′it

)= 0. The least-squares can be applied to estimate the fac-

tors and factor loadings, where the iteration adopts a sequential estimation of

218

the factors f (0)t , f

(1)1t , ..., f

(1)Nt and f

(2)1t , ..., f

(2)Kt . In what follows we focus on the

sequential LS procedure. Consistent starting values can be obtained from a CCAof the relevant subfactors (see below). Let f0(0)

t , f0(1)1t , ..., f

0(1)Nt and f0(2)

1t , ..., f0(2)Kt

denote the initial estimators. The loading matrices can be estimated by run-ning regressions of yik,jt on the initial factor estimates f

0(0)t , f

0(1)1t , ..., f

0(1)Nt and

f0(2)1t , ..., f

0(2)Kt . The resulting LS estimators of the loading coeffi cients are or-

ganised as in the matrix Λ, yielding the estimator Λ. An update of the factorestimates is obtained by running a regression of yt on Λ yielding the updatedvector of factors, f1(0)

t , f1(1)1t , ..., f

1(1)Nt and f

1(2)1t , ..., f

1(2)Kt . With updated esti-

mates of factors we obtain improved estimates of the loading coeffi cients byrunning again regressions of yik,jt on the estimated factors. This sequentialLS estimation procedure continues until convergence. The last step involves or-

thogonalising the local factors,(f

0(1)1t , ..., f

0(1)Nt

)and

(f

0(2)1t , ..., f

0(2)Kt

). Although

this orthogonalisation step is not necessary for identification of the factors, itenables us to perform a variance decomposition of individual variables with re-spect to the factors. Orthogonalising the factors can be achieved by regressing(f

0(1)1t , ..., f

0(1)Nt

)on(f

0(2)1t , ..., f

0(2)Kt

)(or vice versa) and taking the residuals as

new estimates of(f

(1)1t , ..., f

(1)Nt

)or f (2)

1t , ..., f(2)Kt .

Remark: The initialization for the three-level factor model works as fol-lows. We first estimate the global factor as the first r(0) PCs and the globalfactors are eliminated from the variables by running least-squares regressionsof the variables on the estimated global factors.50 In the next step the CCA isemployed to extract the common component among the r(1)

i + r(2)k estimated

factors from region i, group k and the estimated vectors from the same regioni but different group k′. This common component is the estimated regionalfactor. Similarly, the estimated factor f (2)

kt obtained from a CCA of the factorof region i, group k and a different region i′ but the same group. These initialestimates are used to start the sequential LS procedure. The overall estimationprocedure outlined for the three-level factor model with an overlapping factorstructure can be generalised to allow for further levels of factors (provided thatthe number of units in each group is suffi ciently large). Furthermore, the levelsmay be specified as a hierarchical structure (e.g. Moench et al. (2013)), that is,the second level of factors (e.g. regions) is divided into a third level of factors(e.g. countries) such that each third level group is uniquely assigned to onesecond level group. For such hierarchical structures the CCA can be adaptedto yield a consistent initial estimator for a sequential estimation procedure thatswitches between estimating the factors and (restricted) loadings.

More to come...50Alternatively, a CCA between (i) the variables in region i and group k and (ii) the variables

in group i′ and k′ with i 6= i′ and k 6= k′ may be employed to extract the common factors.In our experience the two-step top-down estimator used in our simulation performs similarlyand has the advantage that the starting values are invariant with respect to a reorganizationof the levels (that is interchanging regions and groups).

219

8.7 Alternative approaches based on the (local) spatialeffects and global factors

To be completed:

• Indeed, this would make my ultimate goal. I have some preliminary notesand consider extending the QML-EM algorithms proposed by Bai and Liin a sequence of papers (all published in top journals).

• I will provide more details soon.

9 Spatial Weights Matrix and KMS51

The choice of appropriate spatial weights is a central component of spatial mod-els as it assumes a priori a structure of spatial dependence, which may or maynot correspond closely to reality. The spatial weights are interpreted as functionsof relevant measures of geographic or economic distance (Anselin, 1988, 2002).The choice of weights is frequently arbitrary, there is substantial uncertaintyregarding the choice, and empirical results vary considerably.The spatial panel data models assume a time invariant spatial weights ma-

trix. When the spatial weights matrix is constructed with economic/socioeconomicdistances or demographic characteristics, it can be time varying (e.g. Case,Hines, and Rosen, 1993). Lee and Yu (2012b) investigate the QML estimationof SDPD models with time varying spatial weights matrices. Monte Carlo re-sults show that, when spatial weights matrices are substantially varying overtime, a model misspecification of a time invariant spatial weights matrix maycause substantial bias in estimation.Define the N ×N matrix of the spatial weights:52

W =

w11 · · · w1N

.... . .

...wN1 · · · wNN

=

w1

...wN

with wii = 0 (457)

The jth element of wi, wij , represents the link (or distance) between the neigh-bor j and the spatial unit i. It is a common practice to haveW having a zero di-agonal and being row-normalized such that the sum of elements in each row ofWis unity. The ith rowwi may be constructed aswi = (di1, di2, ..., din)/

∑nj=1 dij ,

where dij ≥ 0 represents a function of the spatial distance between ith and jthunits. The weighting operation may be interpreted as an average of neighboringvalues.Shi (2016) defines the endogenous and time-varying spatial weights matrix

satisfies the following assumption (see below for details):

51 I will make the use of more clear and consitent notations after your feedbacks.52We can easily construct the time-varying matrix, Wt.

220

Assumption 6. (1) The spatial weights satisfy wij,t ≥ 0, wij,t = 0, andwij,t = 0 if ρij,t > ρc, i.e., there exists a threshold ρc > 1 such that the weightis zero if the geographic distance exceeds ρc. For i 6= j,

wij,t = hij (zit, zjt) I(ρij,t < ρc

)(458)

or the row normalized version that

wij,t =hij (zit, zjt) I

(ρij,t < ρc

)∑nk=1 hik (zit, zkt) I

(ρik,t < ρc

)where hij()’s are nonnegative, uniformly bounded functions. (2) The functionhij(.) satisfies the Lipschitz condition,

|hij (a1, b1)− hij (a2, b2)| ≤ c0 (|a1 − a2|+ |b1 − b2|)

for some finite constant c0.For convenience consider the simple SAR model:53

yit = λ

n∑j=1

wij,tyjt + vit. (459)

We consider the different specifications for the spatial weights matrix asfollows:

• Fixed weights based on the physical or economic distance. Assuming thethreshold is known or trying a number of different thresholds. This ex-ogenous assumption may hold when spatial weights are constructed usingpredetermined geographic distances.

• If “economic distance”such as the relative GDP or trade volume is usedto construct the weight matrix, then it is likely that these elements arecorrelated with the final outcome.

• BHP: using the spatial correlation-based adjacency or spatial weights ma-trix, subject to sparsity issues, rendering the estimation less reliable.,e.g. in the UK house price data by Beulah’s thesis, the correlation-basedweights matrix contains only 0.3% nonzero weights, rendering too manyzero weights such that effective sample in each region is mostly 1 or 2.

• Under (458), Qu and Lee (2014) and Shi (2016) consider (see below fordetails):54

hij (zit, zjt) =1

|zit − zjt|(460)

53 It is straightforward to develop the extended model with covariates and error componentswith unobserved fixed effects and factors or interactive effects.54Qu and Lee (2014) also suggest the higher and nonlienar orders, hij (zit, zjt) =∑Dd=1

1

|zit−zjt|d

221

where |zit − zjt| measures the economic distance, and I(ρij,t < ρc

)is pre-

determined based on geographic distance, ρc such that I(ρij,t < ρc

)= 1

if the two locations are neighbors and otherwise 0.

• Using (458) and (460), the model (459) can be written as

yit = λ

n∑j=1

1

|zit − zjt|I(ρij,t < ρc

)yjt + vit. (461)

• Assuming that ρc is known and zit is correlated with vit, Shi extends Quand Lee (2014) and develops the control function approach in panels:55

yit = λn∑j=1

wij,tyjt + (Znt −X2ntΓ)′it δ + ξit. (462)

wherewij,t =

1

|zit − zjt|I(ρij,t < ρc

)• Consider (479) below, and assume the single covariate in z:

zit = x′itβz + γ′izfzt + εit (463)

where xit are kz×1 regressors with coeffi cient vector βz, and unobservableshave two components, fzt consisting of Rz × 1 time factors with loadingγ′izl and εitl is idiosyncratic error. Then, (462) can be written as

yit = λ

n∑j=1

wij,tyjt + (zit − x′itβz − γ′izfzt)′δ + ξit. (464)

• The structure of the model (461) is similar to KMS, which is given by

yit = λ1

mit

n∑j=1

I(ρij,t < ρc

)yjt + vit. (465)

where mit =∑nj=1 I

(ρij,t < ρc

). (461) assumes the known threshold, ρc

whilst (465) estimates λ and ρc jointly, then imposing the equal weightonce yjt is selected.

• More generally, we consider:

wij,t =1

ρij,tI(ρij,t < ρc

)with ρij,t = |zit − zjt| (466)

55 It is unclear how to write the z specification for each spatial unit. See (479) and (480)below. The basic idea seems to get some information about the single covariate zit, zjt andtheir difference zit − zjt from the large system equations.

222

Alternatively, we may consider the row-normalisation version as

wij,t =

1ρij,t

I(ρij,t < ρc

)∑nj=1

1ρij,t

I(ρij,t < ρc

) with ρij,t = |zit − zjt| (467)

Notice that Qu and Lee conjecture that another issue that needs fu-ture research is to consider an endogenous spatial weight matrixpurely constructed with economic distances. This could be a tech-nical challenging issue as the near-epoch assumption may not be met.WHY?? Alternative large sample theorems may need to be developed.

• We can consider (i) the exogenous case, E (z′itvit) = 0 and (ii) the endoge-nous case, E (z′itvit) 6= 0. We may also consider the time-invariant caseusing

wij =1

ρijI(ρij < ρc

)with ρij = |zi − zj | or ρij = |zi − zj | (468)

where we consider the time-invariant covariate, zi or the time-average,zi = T−1

∑zit.

• I conjecture that the KMS algorithm to construct the endogenous spatialweights or selection matrix would be useful, making contribution to theliterature. In particular, we consider VAR as the DGP for the N × 1vector, zt = (z1t, ..., zNt)

′, say

zt =

p∑j=1

Φjzt−j + εt

and derive the CF:

vt =

zt −p∑j=1

Φjzt−j

′ δ + ξt.

Then, the final model will become:

yit = λ

n∑j=1

wij,tyjt + ε′itδ + ξit, (469)

where wij,t is defined in (466) or (467).

• Horrace et al. (2015) consider a firm with n workers. When the managerallocates workers to projects (peer groups) in each time period, t = 1, ..., T ,she specifies an n × n adjacency matrix which determines the interrelat-edness of the workers’productivity. Let the adjacency matrix be denotedby Aot = [aoij,t], where a

oij,t = 1 if workers i and j are assigned to the same

project and aoij,t = 0 otherwise. We set aoii,t = 0. Let the row-normalized

223

Aot be At = [aij,t], where aij,t = aoij,t/∑nk=1 a

oik,t. Then productivity of

the worker i in period t is given by

yit = ρ

n∑j=1

aij,tyjt + xitβ + uit. (470)

The dependent variable yit is the productivity of worker i in period t.∑nj=1 aij,tyjt is the average productivity of worker i’s co-workers assigned

to the same project with ρ capturing the peer effect. xit is a 1× kx vectorof exogenous variables. uit is the disturbance. In this setting, the marginalproduct across workers in period t is ρaij,t when the workers are on thesame project and 0 otherwise.If At is exogenous so that E(Ut|At, Xt) = 0, then model (470) can beestimated using spatial panel data methods. However, it is reasonableto believe that the manager’s choices of how to allocate workers toprojects may be correlated with Ut. Then E(Ut|At, Xt) 6= 0 and Atis endogenous.Let dit be an indicator variable such that dit = 1 if worker is assigned tothe project and dit = 0 otherwise. Suppose mt workers are allocated tothe project. Then, for worker i assigned to the project (i.e. dit = 1), (470)can be written as

yit = ρ1

mt

n∑j=1

djtyjt + xitβ + E (uit|Dt) + u∗it. (471)

where Dt = (d1t, ..., dnt)′ and u∗it = uit − E(uit|Dt). By construction,

E (u∗it|Dt) = 0 and the weights djt in the peer effect regressor can beconsidered exogenous. We refer to E (uit|Dt) as the selectivity bias.As mt is often predetermined (e.g., in sports games, the number of ac-tive players mt is fixed), dit is not independent across i. Hence, instead ofmodeling the probability that a certain worker is assigned to a project (i.e.Pr(dit = 1)), we consider the probability of a set of workers is assigned toa project.

• Notice that the structure of the model (471) is also similar to KMS, butit imposes that djt’s are known a priori. So this model is more restrictivethan (461).

• MORE discussions...

• CCE approximation possible for spatial and factors?

• Heterogeneous extension of KMS?

• How to estimate the weights and thresholds together in KMS with andwithout endogeneity?

224

9.1 Qu and Lee (2014)

Consider the output equation of a cross-sectional SAR model:

Yn = λWnYn +X1nβ + Vn, (472)

where Yn = (y1,n, ..., yn,n)′ is an n × 1 vector, X1n is an n × k1 matrix with

its elements x1,in; l(i) ∈ Dn, n ∈ N being bounded for all i and n, Vn =

(v1,n, ..., vn,n)′, λ is a scalar, and β =

(β1, ..., βk1

)′is a k1 × 1 vector of co-

effi cients. Wn = (wij,n) is an n × n nonnegative weights matrix with zerodiagonals and its elements constructed by Zn:

wij,n = hij(Zn, ρij) for i, j = 1, ..., n; i 6= j, (473)

where h(·) is a bounded function. Finally, we consider the DGP for Zn:

Znn×p2

= X2nn×k2

Γk2×p2

+ εn, (474)

where X2n is an n× k2 matrix with x2,in; l(i) ∈ Dn, n ∈ N bounded for all iand n, Γ is a k2×p2 matrix of coeffi cients, εn = (ε1,n, ..., εn,n)

′ is an n×p2 matrixof disturbances with εi,n = (ε1,in, ..., εp2,in)

′ being p2 dimensional column vec-tors, and Zn = (z1,n, ..., zn,n)′ is an n× p2 matrix with zi,n = (z1,in, ..., zp2,in)′.

zi,np2×1

=

z1,in

...zp2,in

, Znn×p2

=

z′1,n...

z′n,n

=

z1,1n zp2,1n. . .

z1,nn zp2,nn

,

zip2×1

=

z1i

...zp2i

, Znn×p2

=

z′1...z′n

=

z11 zp21

. . .z1n zp2n

,

x2ik2×1

=

x2,1i

...x2,k2i

, X2nn×k2

=

x′21...x′2n

=

x2,11 x2,k21

. . .x2,1n x2,k2n

,

Γk2×p2

=

γ1...γk2

=

γ11 γ1p2. . .

γk21 γk2p2

Zn = X2nΓ + εn z1,1n zp2,1n

. . .z1,nn zp2,nn

=

x2,11 x2,k21

. . .x2,1n x2,k2n

γ11 γ1p2

. . .γk21 γk2p2

=

x2,11γ11 + · · ·+ x2,k21γk21 x2,11γ1p2 + · · ·+ x2,k21γk2p2. . .

x2,1nγ11 + · · ·+ x2,k2nγk21 x2,1nγ1p2 + · · ·+ xγk2p2

225

Let(εl(i),n, vl(i),n

); l(i) ∈ Dn, n ∈ N

be a triangular double array of real

random variables defined on a probability space (Ω;F ;P ), where the index setDn ⊂ D is a finite set.Remark: We consider n agents in an area where each agent i is endowed

with a predetermined location l(i). Any two agents are separated away by adistance of at least 1. Due to some competition or spillover effects, each agent ihas an outcome yi,n directly affected by its neighbors’outcomes yj,n’s. The spa-tial weight wij,n is a measure of the relative strength of linkage between agents iand j, and the spatial coeffi cient λ provides a multiplier for the spillover effects.However, the spatial weight wij,n is not predetermined but depends onsome observable random variable Zn. We can think of zi,n as some eco-nomic variables at location l(i) such as GDP, consumption and economic growthrate which influence strength of links across units. This specification has beenused in the literature, and it may introduce endogeneity into the spatial weightmatrix. Case et al. (1993) consider the weights of the form (before rownormalization):

wij,n =1

|zi,n − zj,n|where zi,n and zj,n are observations on “meaningful”socioeconomic character-istics.We have the following moment assumption.Assumption 2. The error terms vi,n and εi,n, have a joint distribution:(

vi,n, ε′i,n

)′ ∼ iid (0,Σvε) , Σvε =

[σ2v σ′vε

σvε Σε

]where Σvε is positive definite,σ2

v is a scalar variance, covariance σvε = (σvε1, ..., σvεp2)′

is a p2 dimensional vector, and Σε is a p2 × p2 matrix. The supi,nE|vi,n|4+δε

and supi,nE ‖εi,n‖4+δε exist for some δε > 0. Furthermore, E(vi,n|εi,n) = ε′i,nδ

and V ar(vi,n|εi,n) = σ2ξ .56

The endogeneity of Wn comes from the correlation between vi,n and εi,n.Using Assumption 2, we can construct57

ξn = Vn − εnδ,

where

δ = Σ−1ε σvε and ξn ∼

(0, σ2

ξIn)with σ2

ξ = σ2v − σ′vεΣ−1

ε σvε.

In particular, ξn are uncorrelated with εn. The outcome equation (472) be-comes:

Yn = λWnYn +X1nβ + (Zn −X2nΓ) δ + ξn, (475)

56This conditional homoskedasticity condition is required for the QMLE theory. For the IVor GMM estimations, we can relax this assumption.57 In the special case that (vi,n, ε

′i,n)′ has a jointly normal distribution, then

vi,n|εi,n ∼ N(σ′vεΣ−1ε εi,n, σ

2v − σ′vεΣ−1ε σvε)

and ξn is independent of εn.

226

with E(ξi,n|εi,n

)= 0 and E

(ξ2i,n|εi,n

)= σ2

ξ and ξi,n’s are iid across i. Our sub-sequent asymptotic analysis will rely on (475), where (Zn −X2nΓ) are controlvariables to control the endogeneity of Wn.

Remark: Qu and Lee (2014) propose 3 estimators. If error terms are jointlynormally distributed, QMLE becomes MLE and achieves the asymptotic effi -ciency. The 2SIV estimation is based on linear moments only and therefore as-ymptotically less effi cient. For the GMM estimator based on some proper linearand quadratic moment conditions, it can be asymptotically as effi cient as the MLestimator under normality. In the absence of normality, QMLE is no longer as-ymptotic effi cient. The optimum GMM estimator based on linear and quadraticmoment conditions might be asymptotically more effi cient than QMLE, becauseit adopts the best weighting matrix for those moment conditions, while QMLEgives each moment an equal weight. However, the asymptotic effi ciency of the2SIV and QMLE cannot be directly compared. Therefore, the 2SIV andGMM methods have the merit of computational simplicity and ro-bustness.Assumption 4. We consider two cases of Wn:(4.1) Case 1: The spatial weight wij,n = hij(zi,n, zj,n, ρij) for i 6= j, where

hij(·)’s are non-negative, uniformly bounded functions of some observable vari-able Zn. 0 ≤ wij,n ≤ c1ρ

−c3d0ij for some 0 ≤ c1 and c3 > 3.58 Furthermore,

there exist at most K (K ≥ 1) columns of Wn that the column sum exceeds cw,where K is a fixed number that does not depend on n.59

(4.2) Case 2: The spatial weight wij,n = 0 if ρij > ρc, i.e., there exists athreshold ρc > 1 and if the geographic distance exceeds ρc , then the weight iszero. For i 6= j,

wij,n = hij (zi,n, zj,n) I(ρij ≤ ρc

)or wij,n =

hij(zi,n, zj,n)I(ρij ≤ ρc)∑ρij≤ρc hik(zi,n, zk,n)

,

where hij(.)’s are non-negative, uniformly bounded functions.Remark: Assumption 4 provides the essential features of the weights ma-

trix. The geographic distance plays an important role in constraining magni-tudes of spatial weights. The spatial weight of two locations would be larger ifthey were closer to each other or when their economic indices were moresimilar, but their weights would become smaller when two units are furtherapart. Assumption (4.1) allows the situation that all agents are spatially cor-related but the spatial weight decreases suffi ciently fast at a certain rate asphysical distances increase. Symmetry is not imposed on the spatial weight ma-trix. If Wn is symmetric, the second part on the column sum norm condition in(4.1) will not be needed. For an asymmetric Wn, the second part of (4.1) limitsthe number of columns which have large magnitudes relative to the row sumnorm. For example, big countries may have great impact on small countries, but

58For example, wij,n = min(

1‖zi,n−zj,n‖p

, c1ρ−c3d0ij

).

59As c−ρij0 decreases faster than ρ−c3d0ij , all the results hold for the case of 0 ≤ wdij,n ≤

c1c−ρij0 with some c1 ≥ 0 and c0 > 1.

227

those small countries may have little or zero influence on big countries. In thisexample, we have some “stars”whose row sums are bounded by cw,while their column sums can be much larger. Assumption (4.1) assumesthat the number of such stars can only be finite and bounded. Assumption (4.2)allows for a row-normalized spatial weight matrix. In this case, wij,n might haveagents linked in an area, but once the geographic distance between two agentsexceeds a threshold, the two units are not spatially interacted.In the MC they construct the endogenous, row-normalized Wn = (wij,n) as

follows:

1. Generate bivariate normal random variables (vi,n, εi,n) from iidN

(0,

(1 ρρ 1

))as disturbances in the outcome equation and the spatial weights equation.

2. Construct the spatial weight matrix as the Hadamard productWn = W dn

W en, i.e., wij,n = wdij,nw

eij,n, where W

dn is a predetermined matrix based

on geographic distance: wdij,n = 1 if the two locations are neighbors andotherwise 0;W e

n is a matrix based on economic similarity: weij,n = 1/|zi,n−

zj,n| if i 6= j and weii,n = 0, where elements of Zn are generated by zi,n =1 + 0.8xi2,n + εi,n.

3. Row-normalize Wn.

Remark: See also Assumption 6 in Shi (2016) that considers only As-sumption (4.2). Notice that hij (zi,n, zj,n) measures economic similarity whileI(ρij ≤ ρc) depends on the physical distance. In our case we considerhij (zi,n, zj,n) measures economic similarity and I

(ρij ≤ ρc

)also depends on

the economic distance.

9.2 Shi (2106)

A data set contains n individuals indexed by i for T time periods indexed byt. Individuals are located on a unevenly spaced lattice D ⊂ Rd0 with d0 ≥ 1.The location l : 1, ..., n → Dn ⊂ D is a mapping of individual i to its locationl(i) ∈ D ⊂ Rd0 . Let DT ⊂ Z denote the set of time indexes. This is anadaptation of the topological structure of spatial processes in Jenish and Prucha(2009, 2012) and Qu and Lee (2015) to the panel data setting. Define the metric

ρit,js = ρ(it, js) = max

max

1≤k≤d0(|l(i)k − l(j)k|) , |t− s|

where l (i)k is the kth component of the d0 × 1 vector l (i).

Assumption 1. The lattice D ⊂ Rd0 with d0 ≥ 1, is infinitely countable.All elements in D are located at distances of at least ρ0 > 0 from each other.We assume that ρ0 = 1.The equation of interest is (bad notations)

ynt = λWntynt +Xntyβy + Γnyfyt + vnt, (476)

228

where ynt is n × 1, Xnty is n × ky matrix of exogenous regressors. Wnt is ann× n spatial weights matrix with elements, wij,t.

yntn×1

=

y1t

...ynt

, Xntyn×ky

=

X1t,1 X1t,ky...

. . ....

Xnt,1 Xnt,ky

, βyky×1

=

β1...βky

Γnyn×Ry

=

Γ11 Γ1Ry...

. . ....

Γn1 ΓnRy

=

Γ1

...Γn

, fytRy×1

=

f1t

...fRyt

, vntn×1

=

v1t

...vnt

,

Wntn×n

=

w11,t · · · w1n,t

.... . .

...wn1,t · · · wnn,t

=

w1,t

...wn,t

with wii,t = 0 (477)

The spatial weights matrix measures the degree of connections be-tween spatial units with zero diagonals. Let yit denote the i-th element ofynt. λwij,t measures the impact of yjt on yit and λ is the spatial interactionscoeffi cient. Economic theory may suggest how the spatial weights matrix isconstructed.Example: Kelejian and Piras (2014): Weights are based on relative prices

of cigarettes between two neighboring states. We assume the following model:

lnCit = β1 lnCi,t−1 + β2 ln pit + β3 ln Iit + λ

46∑j=1

pjtpitdijt lnCjt

+ µi + δt + uit

(478)where i = 1, . . . , 46 denotes states, t = 1, . . . , 29 denotes time periods, and thedisturbance term uit has the non-parametric specification.60 Cit is cigarettesales to persons of smoking age in packs per capita in state i at time t. pitis the average retail price per pack of cigarettes. Iit is per capita disposableincome. All values are measured in real terms. The spatial lag accounts forcross border cigarette shopping, or bootlegging. dijt is a dummy variable whichindicates the desirability of cross border shopping. In particular, dijt = 1 if iand j are border states and pjt < pit; dijt = 0 if i and j are not border states, orif pjt > pit. The multiplication of dijt by the price ratio indicates that the priceratio pjt/pit will only be considered by cigarette consumers in state i if pjt < pit.Since the price of cigarettes is endogenous in a demand model for cigarettes,our weighting matrix is endogenous. We expect λ to be positive because thehigher is the price ratio, the less attractive is cross state shopping. See also anexample (subsection 3.2.3 in Shi) about the average effect of a treatment on thetreated (ATT).

60We assume the following non-parametric specification:

uN = RNεN

where RN is an unknown NT × NT non-stochastic matrix, and εN is an NT × 1 randomvector whose mean is zero and VC is INT with E (uN ) = 0 and

(uNu

′N

)= RNR

′N .

229

Sources of endogeneity The spatial weights matrix may be constructedfrom covariate that may correlate with vnt together with common factors. Sup-pose that there are p variables, zit1, ..., zitp that are used to construct Wnt.Consider the regression equation for zitl:

zitl = x′itzlβzl + γ′izlfzt + εitl, l = 1, ..., p (479)

where xitzl are kzl × 1 regressors with coeffi cient vector βzl, and unobservableshave two components, fzt consisting of Rz×1 time factors with loading γ′izl andεitl is idiosyncratic error. Stacking it across i and then over l for time periodt, we have znt = (z1t1, ..., znt1, z1t2, ..., zntp)

′ an np × 1 vector which has thefollowing structure:

znt = Xntzβz + Γnzfzt + εnt (480)

whereXntz is np×kz with kz = kz1+...+kzp, Γnz = (γ1z1, ..., γnz1, γ1z2, ..., γnzp)′

is np×R and εnt is defined similarly. Note that Xnty and Xntz may have somecommon regressors.Remark: Notations in Shi are generally unclear and need to be improved.

In constructing znt, he allows the dimension of X regressors is different forl = 1, ..., p. Hence, the definition of the np × kz matrix, Xntz is quite un-clear. In MC and empirical applications he used the univariate construction ofzi variable. Suppose that we use the difference between univariate variable, say

1|zit−zjt| to construct the spatial weights. In such case the use of large dimen-sional specification for znt in (480) seems to be redundant (also computationallyinfeasible). If so, I conjecture that the use of VAR or SPVAR seems to bemore sensible, say

znt =

p∑j=1

Φjzn,t−j + Γnzfzt + εnt (481)

Still an important issue of how to construct the spatial weights from (481),which I will discuss more later. Remind that the VAR cannot be employed inthe cross-section studies by Qu and Lee (2015).The following conditional moment assumption specifies that the disturbance

terms in znt may correlate with vit in the ynt equation.Assumption 2 (Qu and Lee (2015)). The error terms vit and εit are

independently distributed over i and t, and have a joint distribution

(vit, ε′it)′ ∼ (0,Σvε), Σvε =

[σ2v σ′vε

σvε Σε

]where Σvεis positive definite, σ2

v is a scalar variance, the covariance σvε =(σvε1, ..., σvεp) is a p × 1 vector, and Σε is a p × p matrix. Furthermore, εitis generated as εit = Σ

1/2ε eit where eitp is independently distributed across i, t

and p with E (eitp) = 0, E(e2itp

)= 1. Furthermore, supn,T supi,tE |vit|

4+δε and

supn,T supi,tE ‖vit‖4+δε exist for some δε > 0. Denote

E (vit|εit) = ε′itδ

230

and defineξit = vit − ε′itδ.

Assuming that E(ξ2it |εit|

)= E

(ξ2it

)= σ2

ξ , E(ξ3it |εit|

)= E

(ξ3it

)and E

(ξ4it |εit|

)=

E(ξ4it

).

If σvε 6= 0, the spatial weights matrix correlates with disturbances in theoutcome equation of the same period. Hence, using the assumption 2, theoutcome equation (476) becomes:

ynt = λWntynt +Xntyβy + Γnyfyt + (znt −Xntzβz − Γnzfzt)′δ + ξnt, (482)

Shi develops the QML estimator of (482), though (482) is not clearly defined.Remark: The use of control function approach is found to be robust to

nonlinear model, e.g. Wooldridge or Blundell?? Suppose in the linear modelthat Wntynt and/or Xnty are endogenous, then we use as CF(

Wntynt − zntγy)′δy or (Xnty − zntγx)

′δx

Similarly, when treating wij,t as the variable, then we may consider the followingCF:

(Wnt − zntγW )′δW

Here, Qu and Lee and Shi assume that spatial weights follow the smooth functionof znt, and employ the CF in terms of (znt −Xntzβz − Γnzfzt)

′δ in (482). So

we should think about this issue more carefully.

MC Consider the main outcome equation:

yit = λ

n∑j=1

wij,tyjt + xitβy + γ′yifyt + vit.

Two unobserved factors (fyt) and factor loadings (γ′yi) are generated indepen-dently from U [−2, 2]. The observed regressor (xit) is a scalar and correlateswith factors through

xit =1

3

γ′yifyt + γ′yi`2 + f ′yt`2

+ ηit,

where ηit ∼ U [−2, 2] and `2 is a 2× 1 vector of 1’s.The spatial weights are correlated with vit according to the following process.

1. Individuals indexed by 1 to n are located successively on a chessboard ofdimension

√n×√n. The neighborhood structure follows the rook pattern.

Individuals in the interior of the chessboard have 4 neighbors, and thoseon the border and the corner have 3 and 2 neighbors, respectively. Letwdij = 1 if i and j are neighbors and wdij = 0 otherwise, and denote then× n matrix Wd =

[wdij].

231

2. Letzit = xitβz + γzifzt + εit.

with one unobserved factor fzt being the first element of fyt. The scalarγzi is generated from U [−2, 2], and . Let

weij,t = wdij ×min

(1

|zit − zjt|, 2

)The spatial dependence is stronger for neighbors with similar z’s. weit,jt iscapped at 2 such that individuals whose z’s are very similar (|zit − zjt| <0.5) have the same degree of spatial effect. The spatial weights in theoutcome equation are row-normalized,

wit,jt =weit,jt∑nj=1 w

eit,jt

.

3. The idiosyncratic errors vit and εit are generated from i.i.d. bivariatenormal random variables,

N

(0,

4

3ϑ

(1 ρρ 1

))The inverse of ϑ is a measure of the signal to noise ratio.

Application Let yit denote the HECM origination rate, defined as thenumber of newly originated HECM loans in state i and quarter t as a percentageof the senior population (age 65 plus) from the 2010 census. There are twon× n spatial weights matrices, W1,n = (w1,ij) and W2,nt = (w2,it,jt). w1,ij = 1if states i and j share the same border and w1,ij = 0 otherwise. W2,nt capturesthe different spillover effect from large lenders. w2,it,jt = w1,ijzitzjt, where zit isthe share of the HECM loans originated by the large lenders (defined as beingone of the top 10 largest lenders. A larger weight is given to a state with moredominant large lenders. House price dynamic variables are constructed usingthe Federal Housing Finance Agency’s quarterly all-transactions house priceindexes (HPI) deflated by the CPI, and include deviations from the previous9 year averages (hpi_dev), standard deviations of house price changes in theprevious 9 years (hpi_v) and the interaction between the two.Our empirical application considers multiple spatial weights matrices such

that

zit = hpi_devitβz1+hpi_vitβz2+(hpi_devit×hpi_vit)βz3+γ′zifzt+εit; (483)

yit = λ1

n∑j=1

w1,ijyjt+λ2

n∑j=1

wit,jtyjt+hpi_devitβy1+hpi_vitβy2+(hpi_devit×hpi_vit)βy3+γ′yifyt+vit;

(484)where εit and vit have variances σ2

ε and σ2v with correlation ρ. According to the

eigenvalue ratio criterion, zit has one unobserved factor and so is yit, and the

232

growth ratio criterion gives the same result. Table 3.7 reports the estimationresults.Remark: As discussed above, Shi used the relatively simple specification.

There is only single z variable, and x regressors are common in (483) and (484).There is no details of whether there are common factors between fzt and fyt.Also unclear of whether one unobserved factor estimated for zit and yit is thesame or different. In general, unsure about the computational details... Further,an intuition about w2,it,jt = w1,ijzitzjt is unclear too, though it is designed tocapture spillovers due to large lenders. Overall, this line of research will be moreimportant and they provide the first research step heading into such directionsbut still there are many ambiguous issues in the current study.

9.3 Horrace et al. (2016)

Suppose there are qt possible lineups denoted by Ls for s = 1, ..., qt. Themanager allocates lineup Ls to the project if and only if

d∗st > maxr 6=s

d∗rt,

whered∗st = πst + ξst, s = 1, ..., qt

where πst is the deterministic component of d∗st and ξst is a random innovationwith zero mean and unit variance. Let dst be a dummy such that dst = 1 if thelineup Ls is chosen in period t and dst = 0 otherwise. Then, dst = 1 if and onlyif

εst < 0 with εst = maxr 6=s

d∗rt − d∗st.

The productivity of Ls is given by the following model:

Yst = ρWtYst +Xstβ + Ust, s = 1, ..., qt, t = 1, ..., T. (485)

Yst = [yit]i∈Ls is an mt × 1 vector of the dependent variable of the workers inLs. Wt is a constant weighting matrix given by (complete network)

Wt =1

mt − 1

(1mt1

′mt − Imt

)=

1

mt − 1

1 1

. . .1 1

− 1 0

. . .0 1

=

1

mt − 1

0 1. . .

1 0

.

WtYst =1

mt − 1

0 1. . .

1 0

y(1)t

...y(mt)t

=

1

mt−1

∑mtj=2 y(j)t

...1

mt−1

∑mt−1j=1 y(j)t

WtYst measures the average productivity of a worker’s co-workers in lineup Ls,with its coeffi cient ρ capturing the peer effect. Xst = [xit]i∈Ls is an mt × kx

233

matrix of kx exogenous variables of the workers in Ls. Ust is an mt × 1 vectorof disturbances such that Ust ∼ iid(0,Σ). We allow for possible correlationbetween Ust and ξt =

(ξ1t, ..., ξqtt

)such that

E(Ust|dst = 1, πt) = λs (πt) 1mt with πt = (π1t, ..., πqtt) (486)

Then, the model (485) can be written as61

Yst = ρWtYst +Xstβ + λs(πt)1mt + U∗st, (487)

where U∗st = Ust − λs(πt)1mt . We consider three different approaches for esti-mation of (487).

The parametric selection correction approach Let Fst (.|πt) denotethe conditional distribution function of εst = maxr 6=s d

∗rt − d∗st. Let Φ (.) and

φ (.) denote the standard normal distribution and density. Lee (1983) suggestsusing the transformation:

Jst (.) ≡ Φ−1(Fst (.|πt))

to reduce the dimensionality of the selectivity bias. In this case the selectivitybias is given by

E (Ust|dst = 1, πt) = E[Ust|Jst(εst) < Jst(0), πt] = E[Ust|Jst (εst) < Jst(0)],

where Lee (1983) implicitly assumes that the joint distribution of Ust andJst (εst) does not depend on πt. Further, we make the following assumptionthat Ust and Jst (εst) are i.i.d. with a joint normal distribution given by[

UstJst (εst)

]∼ N

[0,

(Σ σ121mt

σ121′mt 1

)]The selectivity bias is then given by

E(Ust|dst = 1, πt) = −σ12φ(Jst(0))

Fst(0|πt)1mt . (488)

Thus, from (486) and (488), we have:

λs(πt) = −σ12

φ(Φ−1 (Pst)

)Pst

, (489)

where we use

Jst(0) = Φ−1(Fst(0|πt)) and Pst = Fst(0|πt).

Substitution of (489) into (487) gives

Yst = ρWtYst +Xstβ − σ12

φ(Φ−1 (Pst)

)Pst

1mt + U∗st. (490)

For the network model, Lee’s approach can be implemented as follows.61The selectivity bias λs(πt) introduces a group correlated effect (Manski, 1993). Semi-

parametric estimation of ρ and β along with the unknown λs(.) would face the “the curse ofdimensionality.”

234

• Step 1: Let πst = zstγ, where zst is a 1× kz vector of exogenous variables.Then, γ can be estimated by maximizing the likelihood function:

lnL =

T∑t=1

qt∑s=1

dst lnPst.

It proves convenient to assume that ξst is independently and identicallyGumbel distributed so that

Pst = exp(zstγ)/

qt∑r=1

exp(zrtγ)

Then, γ can be estimated by a conditional logit estimator γ (McFadden,1974).

• Step 2: With the predicted probabilities

Pst = exp (zstγ) /

qt∑r=1

exp (zstγ) ,

we consider the feasible counterpart of (14)

Yst = ρWtYst +Xstβ − σ12

φ(

Φ−1(Pst))

Pst1mt + U∗∗st

and estimate (ρ, β′, σ12)′ by 2SLS estimator with linearly independentcolumns in WtXst as instruments for WtYst.

The semi-parametric selection correction approach Following Dahl(2002), we impose the following assumption to reduce the dimensionality of theselectivity bias:

λs (πt) = µ(Pst).

Thus, (487) becomes:

Yst = ρWtYst +Xstβ + µ(Pst)1mt + U∗st.

The semi-parametric selection correction approach can be implemented in asimilar two-step procedure.

• Step 1: We obtain the predicted probabilities Pst from, say, a conditionallogit regression.

• Step 2: We replace µ(Pst) by its (feasible) series approximation

K∑k=1

κkbk

(Pst

),

where bk(.) are the basis functions, and estimate (ρ, β′)′ together with κkby the 2SLS estimator with linearly independent columns in WtXst asinstruments for WtYst.

235

The fixed-effect approach The selectivity bias λs (πt) can be consid-ered as a time-varying lineup-specific fixed effect. To avoid estimating the un-known function λs (.), we can apply a within transformation to eliminate thisterm. Suppose Xst = [X1,st, 1mtx2,st], where X1,st is an mt × k1 matrix of k1

individual-varying exogenous variables and x2,st is a 1×k2 vector of individual-invariant exogenous variables (k1 + k2 = kx). Then, (487) can be written as

Yst = ρWtYst +X1,stβ1 + 1mtx2,stβ2 + λs (πt) 1mt + U∗st. (491)

Let Qt = Imt − 1mt

1mt1′mt denote the within-transformation projector. As

Qt1mt = 0 and QtU∗st = QtUst, premultiplication of (487) by Qt gives:

QtYst = ρQtWtYst +QtX1,stβ1 +QtUst. (492)

Then, ρ and β1 can be estimated from the within model (492) by the conditionalmaximum likelihood (CML) in Lee (2007).The fixed-effect approach does not impose any restrictions on λs (πt). How-

ever, the within transformation may cause an identification problem esp. ifmt = m for all t, similar to the one studied in Lee (2007). The fixed-effectapproach can be implemented by the following steps.

• Step 1: We estimate the within Eq. (492) by the CML estimator in Lee(2007).

• Step 2: We obtain the predicted probabilities Pst from, say, a conditionallogit regression.

• Step 3: Let

rst =1

mt1′mt

(Yst − ρWtYst −X1,stβ1

),

where ρ and β1 are the first-step estimates. We consider the regression:

rst = x2,stβ2 + µ(Pst

)+ ζst,

where the selectivity bias µ(Pst

)) is either given by−σ12φ

(Φ− 1(Pst)

)/Pst

in the parametric approach or approximated by∑Kk=1 κkbk(Pst) in the

semi-parametric approach. We estimate β2 together with the unknown

parameters in µ(Pst

)by the OLS estimator.

• Remark: The parametric and semi-parametric approaches have the ad-vantage of computational simplicity. However, both approaches imposestrong restrictions on the selectivity bias λs (πt) to reduce its dimension-ality. Because of the endogeneity of the peer effect regressor, the modelneeds to be estimated by the 2SLS estimator that relies on the existenceof valid instruments. This may be quite challenging in empirical applica-tions. On the other hand, the fixed effect approach does not impose any

236

restrictions on λs (πt). We can use the CML or GMM estimator, which ex-ploit both linear and quadratic moment conditions, and may outperformthe 2SLS estimator that only uses linear moment conditions. However,the within transformation makes the identification of the peer effect morechallenging. In particular, the within equation is not identified if mt doesnot vary over time. In this case identification can be achieved by imposingexclusion restrictions through heterogeneous peer effects.

• Comments:

—No CSD...

— "The selectivity bias λs (πt) can be considered as a time-varyinglineup-specific fixed effect. To avoid estimating the unknown functionλs (.), we can apply a within transformation to eliminate this term."Is this suffi ciently general? In other words, how is the endogeneitycorrection in (486) is general? Suppose that λs (πt) is time-varying,still the within transformation will remove λs (πt) 1mt for all t?

— In the threshold model, Kourtellos et al. (2015) have addressed anissue of endogenous threshold variable in the single equation context."Our estimation of the threshold parameter is based on a two-stageconcentrated least squares method that involves an inverse Mills ratiobias correction term in each regime." So can we adopt the approachby Horrace et al. (2015) by extending the threshold model to thepanel context. This would be an alternative approach to Seo andShin (2015).

—Notice that the model (471) is also similar to KMS, but it imposesthat djt’s are known a priori. So we rewrite (471) using the KMSthreshold selection mechanism as

yit = ρ1

mit

n∑j=1

I (|zit − zjt| < r) yjt + xitβ + E (uit|zt) + u∗it, (493)

where zt = (z1t, ..., znt)′. As mit =

∑nj=1 I (|zit − zjt| < r), the equal

weights are row-normalised by construction. In this regard, (493) isa natural generalisation of (471), where all networks are generallyunobserved a priori or where the network members are potentiallyrandomly chosen individuals. "Bandiera et al. (2009) investigatehow social connections between workers and managers affect the pro-ductivities of fruit pickers in the UK. Their measure of social con-nectedness is based on similarities of worker/manager characteristics,and there are multiple managers whose worker assignments changedaily."We may consider and follow this line of the researchtrend.

—Notice that Horrace et al. (2015) argue that mt is often predeter-mined (e.g., in sports games, the number of active playersmt is fixed).

237

Next, workers work on the project to produce output for a given timeperiod. For the population of n workers, the n × n adjacency ma-trix across all projects is potentially endogenous. By focusing on asingle project of interest, we have an m × m submatrix of the ad-jacency matrix which is exogenous conditional on selection into thespecific project. Thus, the network endogeneity is reduced to a se-lectivity bias, which can be corrected using a fixed effect estimatoror a polychotomous Heckman-type bias correction procedure due toLee (1983) and Dahl (2002).

—Our setup allows us to give a structural interpretation to the se-lectivity bias correction term that links the model to the produc-tive effi ciency literature in that the bias can viewed as “managerial(in)competence” or (in)effi ciency, depending on the sign of the es-timate. So in this regard, we may consider the stochasticfrontier type application or extension?

—KMS extension seems to be complicated. Here we have N individu-als, and select the mt members out of N -individuals at time t. HLPassume that the selection is pre-determined and then control for selec-tivity bias. Now, we wish to endogenous the selection process usingthe threshold mechanism advanced by KMS. Then, how? Next, wecontrol endogeneity of selection by using the generic control func-tion approach. Here the main aim is to control for endogeneity, notinterested in measuring the selection bias parametrically or semi-parametrically.

• Consider the following KMS type model:

yit = ρ1

mit

n∑j=1

I (|qit − qjt| < r) yjt + β′xit + uit, (494)

where I (A) is an indicator function, taking unity if the event A is true and0 otherwise, and mit =

∑nj=1 I (|qit − qjt| < r) with r being a threshold

parameter to be estimated. So the individual productivity is spatiallyaffected by peers or members. Now, qit is likely to be correlated. Tocontrol for such endogeneity, we consider the following control function:

qit = λ′q(i)t + vit, (495)

where

q(i)t

(N−1)×1

=[q1t · · · qNt

], λ

(N−1)×1=

λ1

...λN

Assume now: (

uitvit

)∼(

0,

(σ2u σuv

σuv σ2v

))(496)

238

then we constructuit = δvit + eit

where

δ =σuvσ2v

, eit ∼(

0, σ2e = σ2

u −σ2uv

σ2v

)Hence, we rewrite (494) as

yit = ρ1

mit

n∑j=1

I (|qit − qjt| < r) yjt +β′xit + δ(qit − λ′q(i)

t

)+ eit, (497)

Now, the selection is exogenous with respect to eit, so that we follow theKMS algorithm.

• My main query is how to extend the above logic from the individual levelto the line-up level analysed as in your paper?

• A few more questions:

— I am not still 100% convinced about your estimation details.

—You now estimate eq. (22) in empirical applications. How do youconstruct, Yst? Is it an m × 1 vector? Then what is the data di-mension? How do you write the model in matrix form in terms ofYt =

(Y ′1t, ...Y

′qtt

)′? Also, the data seems to be unbalanced as the

specific line-up (a total of 79 different line-ups) does not play at alltime period. Then, not sure how to handle such unbalanced data?

—You propose the within estimator with Qt = Imt − 1mt

1mt1′mt de-

noting the within-transformation projector, since the selectivity biasλs (πt) can be considered as a time-varying lineup-specific fixed ef-fect. But this is a bit confusing, as it looks like the time effect, andQt constructs the deviation from the cross-section or group averagerather than the individual mean over time.

— In Section 5.3.1, "we can use player dummies to control for unob-served player-specific characteristics." I am a bit unclear about howto combine them... Is it feasible to introduce the line-up specific in-dividual effects from eq. (7)? More generally, why not controlling forinteractive effects by

Yst = ρWtYst +Xstβ + εst, s = 1, ..., qt, t = 1, ..., T.

εst = λ′sft + Ust

so that the model can also allow for strong form of cross-sectiondependence?

239

• Initially, I thought that the KMS extension seems to be straightforward.However, here we have N individuals, and select the mt members outof N -individuals at every time t. You assume that the selection is pre-determined and then control for selectivity bias. Now, we wish to en-dogenous the selection process using the threshold mechanism advancedby KMS. Then, how we move from an individual selection to the groupselection? If feasible, my idea is to control endogeneity of selection byusing the generic control function approach. Here remind that the mainaim is to control for endogeneity, not interested in measuring the selectionbias parametrically or semi-parametrically.

10 Concluding Remarks

Eventually, this project aims to

• develop the general econometrics specifications that can accommodate thespatial and factor dependence, the spatial heterogeneity, the endogenousspatial weights matrix as well as the spatial nonlinearity in dynamic het-erogenous panels in a rather unified framework by combining all the recentadvances made in the related literature.

• extend all the advances to the multi-dimensional dataset, separately andjointly. As the dimension grows, it would be more complicated and chal-lenging to develop the hierarchical and structural structure of the spatialeffects and factors, jointly.

• These works will be of great applicability to a variety of the big dataset,not only the health economics data...

11 References

Abowd, J.M., F. Kramarz and D.N. Margolis (1999): “High Wage Workers andHigh Wage Firms”, Econometrica 67: 251-333.

Ahn S. and A. Horenstein (2013): Eigenvalue ratio test for the number offactors. Econometrica 81: 1203-1227Ahn, S.C., Y.H. Lee, and P.Schmidt (2013): Panel data models with multiple

time-varying individual effects. Journal of Econometrics 174: 1-14.Anderson, T.W. and C. Hsiao (1981): Estimation of dynamic models with

error components. Journal of the American Statistical Association 76: 598-606.Anderson, J. and E. van Wincoop (2003): “Gravity with Gravitas: A Solu-

tion to the Border Puzzle,”American Economic Review 93: 170-92.Andrews, D.W.K. (2005): Cross-section regression with common shocks.

Econometrica 73: 1551-1585.

240

Anselin, L. (1988): Spatial Econometrics: Methods and Models. KluwerAcademic, Boston, MA.Arellano, M. and S. Bond (1991): Some tests of specification for panel data:

Monte Carlo evidence and an application to employment equations. Review ofEconomic Studies 58: 277-297.Bai, J. (2003): Inferential theory for factor models of large dimensions.

Econometrica 71: 135-171.Bai, J. (2009): Panel data models with interactive fixed effects. Economet-

rica 77: 1229-1279.Bai, J. (2013): “Likelihood Approach to Dynamic Panel Models with Inter-

active Effects,”mimeo., Columbia University.Bai, J. and K. Li (2014): “Spatial Panel Data Models with Common Shocks,”

mimeo., Columbia University.Bai, J. and K. Li (2015): “Dynamic Spatial Panel Data Models with Com-

mon Shocks,”mimeo., Columbia University.Baier, S.L. and J.H. Bergstrand (2007): “Do Free Trade Agreements Ac-

tually Increase Members International Trade?” Journal of International Eco-nomics 71:72-95.Bailey, N., S. Holly and M.H. Pesaran (2016): “A Two-Stage Approach to

Spatio-Temporal Analysis with Strong and Weak Cross-sectional Dependence,”forthcoming in Journal of Applied Econometrics.Bailey, N., G. Kapetanios and M.H. Pesaran (2016): “Exponent of Cross-

sectional Dependence: Estimation and Inference,” forthcoming in Journal ofApplied Econometrics.Balazsi, L., M. Bun, F. Chan and M.N. Harris (2016): “Models with Endoge-

nous Regressors,”Chapter 3 in The Econometrics of Multi-dimensional Panelsed. by L. Mathyas.Balazsi, L., B.H. Baltagi, L. Mathyas and D. Pus (2016): “Modelling Multi-

dimensional Panel Data: A Random Effects Approach”, mimeo., Central Euro-pean University.Balazsi, L., L. Mathyas and T. Wansbeek (2015): “The Estimation of Multi-

dimensional Fixed Effects Panel Data Models”, forthcoming in EconometricReviews.Baldwin, R.E. (2006): In or Out: Does it Matter? An Evidence-Based

Analysis of the Euro’s Trade Effects. Centre for Economic Policy Research.Baldwin, R.E. and D. Taglioni (2006): “Gravity for Dummies and Dummies

for Gravity Equations,”NBER Working Paper 12516.Baltagi B.H. (2005): Econometric Analysis of Panel Data, 3rd edition. Wi-

ley: Chichester.Baltagi B.H and G. Bresson (2016): “Modelling Housing Using Multi-dimensional

Panel Data,”Chapter 12 in The Econometrics of Multi-dimensional Panels ed.by L. Mathyas.Baltagi, B.H., P. Egger, P. and M. Pfaffermayr (2003): “A Generalized

Design for Bilateral Trade Flow Models,”Economics Letters 80: 391-397.Baltagi, B., P. Egger and M. Pfaffermayr (2008): Estimating regional trade

agreement effects on FDI in an interdependent world. Journal of Econometrics

241

145: 194-208.Baltagi, B.H., P. Egger, P. and M. Pfaffermayr (2015): “Panel Data Gravity

Models of International Trade,”in The Oxford Handbook of Panel Data ed. byB.H. Baltagi.Behrens, K., C. Ertur andW. Kock (2012): “Dual Gravity: Using Spa-

tial Econometrics to Control For Multilateral Resistance,” Journal of AppliedEconometrics 27: 773-794.Bertoli, S. and J. Fernandez-Huertas Moraga (2013): “Multilateral Resis-

tance to Migration,”Journal of Development Economics 102: 79-100.Breitung, Jorg and Sandra Eickmeier (2016): Analyzing international busi-

ness and financial cycles using multi-level factor models: A comparison of alter-native approaches, in Dynamic Factor Models, Chap. 5, pp. 177—214. EmeraldGroup Publishing Limited.Chamberlain, G. (1984): Panel Data, in Handbook of Econometrics, Volume

2, eds. Z. Griliches and M. Intriligator, Amsterdam: North-Holland, 1247-1318.Choi I. (2012): Effi cient estimation of factor models. Econometric Theory

28: 274-308.Choi, In, Dukpa Kim, Yun Jung Kim, and Noh-Sun Kwark (2017) A Multi-

level Factor Model: Identification, Asymptotic Theory and Applications. Jour-nal of Applied Econometrics.Chudik, A., M.H. Pesaran and E. Tosetti (2011): Weak and strong cross

section dependence and estimation of large panels. The Econometrics Journal14: C45-C90.Chudik, A., K. Mohaddes, M.H. Pesaran and M. Raissi (2017): “Is there a

debt-thrshold effect on output growth?”Review of Economics and Statistics 99:135-150.Cliff, A.D. and J.K. Ord (1973): Spatial Autocorrelation. Pion, London.Conley, T.G. and B. Dupor (2003): A spatial analysis of sectoral comple-

mentarity. Journal of Political Economy 111: 311-352.Dang, V.A., M. Kim and Y. Shin (2012): Asymmetric Capital Structure

Adjustments: New Evidence from Dynamic Panel Threshold Models. Journalof Empirical Finance 19: 465-482.Defever, F., B. Heid and M. Larch (2015): Spatial exporters. Journal of

International Economics 95: 145-C156.De Nardis, S. and C. Vicarelli (2003): “Currency Unions and Trade: The

Special Case of EMU,”World Review of Economics 139: 625-49.Eberhardt, M and A.F. Presbitero (2015): “Public Debt and Growth: Het-

erogeneity and Non-linearity,”Journal of International Economics 97: 45-58.Eberhardt, M., C. Helmers and H. Strauss (2013): Do spillovers matter when

estimating private returns to R&D?. Review of Economics and Statistics 95:436-448.Egger, P. and M. Pfaffermayr (2003): “The Proper Econometric Specifica-

tion of the Gravity Equation: 3-Way Model with Bilateral Interaction Effects,”Empirical Economics 28: 571-580.Elhorst, J.P. (2005): Unconditional maximum likelihood estimation of linear

242

and log-linear dynamic models for spatial panels. Geographical Analysis 37: 85-106.Elhorst, J.P. (2010): Dynamic panels with endogenous interaction effects

when T is small. Regional Science and Urban Economics 40: 272-282.Elliott, M., B. Golub and M.O. Jackson (2014): Financial Networks and

Contagion. American Economic Review 104: 3115-53.Fan, J., Y. Liao, Y and M. Mincheva (2011): High dimensional covariance

matrix estimation in approximate factor models. Annals of Statistics 39: 3320.Feenstra, R.C. (2004): Advanced International Trade. Princeton University

Press: Princeton, NJ.Forni, M., M. Hallin, M. Lippi and L. Reichlin (2004): The generalized

dynamic factor model consistency and rates. Journal of Econometrics 119:231-255.Frankel, J.A. and A.K. Rose (2002): “An Estimate of the Effect of Common

Currencies on Trade and Income,”Quarterly Journal of Economics 117: 437-466.Glick R. and A.K. Rose (2002): “Does a Currency Union Affect Trade? The

Time Series Evidence,”European Economic Review 46: 1125-1151.Gunnella V., C. Mastromarco, L. Serlenga and Y. Shin (2015): “The Euro

Effects on Intra-EU Trade Flows and Balances: Evidence from the Cross Sec-tionally Correlated Panel Gravity Models,”mimeo., University of York.Hausman, J.A. and W.E. Taylor (1981): “Panel Data and Unobservable

Individual Effect,”Econometrica 49: 1377-1398.Hahn, J. and G. Kuersteiner (2002): Asymptotically unbiased inference for

a dynamic panel model with fixed effects when both n and T are large. Econo-metrica 70: 1639-1657.Helpman, E. (1987): “Imperfect competition and international trade: ev-

idence from fourteen industrialized countries, ” Journal of the Japanese andInternational Economies, 1: 62?81.Herwartz, H. and H. Weber (2010): “The Euro’s Trade Effect under Cross-

sectional Heterogeneity and Stochastic Resistance”, Kiel Working Paper No.1631, Kiel Institute for the World Economy, Germany.Kapetanios, G. and M.H. Pesaran (2005): “Alternative Approaches to Esti-

mation and Inference in Large Multifactor Panels: Small Sample Results withan Application to Modelling of Asset Returns,”, CESifo Working Paper Series1416, CESifo Group Munich.George Kapetanios, Laura Serlenga, and Yongcheol Shin (2018): Estima-

tion and Inference for Multi-dimensional Heterogeneous Panel Datasets withHierarchical Multi-factor Error Structure, mimeo., University of York.Kato, T. (1995): Perturbation Theory for Linear Operators. Springer, Berlin.Kelejian, H.H. and I.R. Prucha (2010): Specification and estimation of spa-

tial autoregressive models with autoregressive and heteroskedastic disturbances.Journal of Econometrics 157: 53-67.Kose M., C. Otrok and C. Whiteman (2003): International business cycles:

World, region, and country-specific factors. American Economic Review 93:1216-1239

243

Kramarz, F., S.J. Machin and A. Ouazad (2008): “What Makes a Test Score?The Respective Contributions of Pupils, Schools, and Peers in Achievement inEnglish Primary Education,”INSEAD Working Paper.Kuersteiner, G.M. and I.R. Prucha (2015): “Dynamic Spatial Panel Models:

Networks, Common Shocks, and Sequential Exogeneity,”mimeo., University ofMaryland.Krugman, PR. (1997): “Increasing Returns, Monopolistic Competition and

International Trade,”Journal of International Economics, 9: 469-479.Lee, L.F. (2004): Asymptotic distributions of quasi-maximum likelihood

estimator for spatial autoregressive models. Econometrica 72: 1899-1925.Lee, L.f. and J. Yu (2010a): Some recent developments in spatial panel data

models. Regional Science and Urban Economics 40: 255-271.Lee, L.f. and J. Yu (2010b): A spatial dynamic panel data model with both

time and individual fixed effects. Econometric Theory 26: 564-597.Lee, L.f. and J. Yu (2013): Identification of spatial durbin panel models.

Working paper; forthcoming in Journal of Applied Econometrics.Lee, L.f. and J. Yu (2015): Spatial panel data models, chapter 12, in:

Baltagi, B.H. (Ed.), The Oxford Handbook of Panel Data. Oxford UniversityPress, New York, NY.Li, J. and Z. Yang (2017): A panel data model with interactive effects char-

acterized by multilevel non-parallel factors. Applied Economics Letters.Lin, X. and L.F. Lee (2010): GMM estimation of spatial autoregressive

models with unknown heteroskedasticity. Journal of Econometrics 157: 34-52.Liu, X. and L.F. Lee (2010): GMM estimation of social interaction models

with centrality. Journal of Econometrics 159: 99-115.Lu, X. and L. Su (2015): Shrinkage estimation of dynamic panel data models

with interactive fixed effects. Working paper. Singapore Management SchoolLu, X. and Su, S. (2018): Three-Dimensional Panel Data Models with Factor

Structures, mimeo., Hong Kong University of Science and Technology.Manski, C.F. (1993): Identification of endogenous social effects: The reflec-

tion problem. Review of Economic Studies 60: 531-542.Mastromarco, C., L. Serlenga and Y. Shin (2016): “Modelling Technical

Ineffi ciency in Cross Sectionally Dependent Stochastic Frontier Panels,” forth-coming in Journal of Applied Econometrics 31: 281-297.Mastromarco, C., L. Serlenga and Y. Shin, (2015): “Multilateral Resistance

and Euro Effects on Trade Flows,”forthcoming in Spatial Econometric Interac-tion Modelling eds. by G. Arbia and R. Patuelli. Springer: Berlin.Matyas, L. (1997): “Proper Econometric Specification of the Gravity Model,”

The World Economy 20: 363-369.Moon, H. and M. Weidner (2014): Dynamic linear panel regression models

with interactive fixed effects. Working paperMoon, H. and M. Weidner (2015): Linear regression for panel with unknown

number of factors as interactive fixed effects. Forthcoming in Econometrica.Mundlak, Y. (1978): On the pooling of time series and cross section data.

Econometrica 46: 69-85.

244

Nauges, C. and A. Thomas (2003). Consistent estimation of dynamic paneldata models with time-varying individual effects. Annales d’Economie et deStatistique 70: 53-74.Omay T. and E. Kan (2010): “Re-examining the threshold effects in the

inflation—growth nexuswith cross-sectionally dependent non-linear panel: Evi-dence from six industrialized economies,”Economic Modelling 27: 996—1005.Onatski A. (2009): A formal statistical test for the number of factors in the

approximate factor models. Econometrica 77: 1447-1479Onatski A. (2010): Determining the number of factors from the empirical

distribution of eigenvalues. Review of Economics and Statistics 92: 1004-1016Ord, K. (1975): Estimation methods for models of spatial interaction. Jour-

nal of the American Statistical Association 70: 120-297.Pakko, M.R. and H.J. Wall (2002): “Reconsidering the Trade-creating Ef-

fects of a Currency Union,”Federal Reserve Bank of St Louis Review 83: 37-46.Pesaran, M.H. (2006): Estimation and inference in large heterogeneous pan-

els with a multifactor error structure. Econometrica 74: 967-1012.Pesaran, M.H. (2015): “Testing Weak Cross-sectional Dependence in Large

Panels,”Econometric Reviews 34: 1089-1117.Pesaran, M.H. and E. Tosetti (2011): Large panels with common factors and

spatial correlation. Journal of Econometrics 161: 182-202.Persson T. (2001): “Currency Union and Trade: How Large is the Treatment

Effect?” Economy Policy 33: 435-448.Rodríguez-Caballero, C.V. (2016): Panel Data with Cross-Sectional Depen-

dence Characterized by a Multi-Level Factor Structure, mimeo., CREATES,Aarhus University.Rose, A.K. (2001): “Currency Unions and Trade: The Effect is Large,”

Economic Policy 33: 449-61.Serlenga, L. and Y. Shin (2007): “Gravity Models of Intra-EU Trade: Appli-

cation of the CCEP-HT Estimation in Heterogeneous Panels with UnobservedCommon Time-specific Factors,”Journal of Applied Econometrics 22: 361-381.Seo, M and Y. Shin (2016): Inference for Dynamic Panels with Threshold

Effect and Endogeneity. Journal of Econometrics 195: 169-186.Shi, W. and L.F. Lee (2017): “Spatial Dynamic Panel Data Models with

Interactive Fixed Effects,”Journal of Econometrics 197: 323-347.Stock J. and M.W. Watson (2010): Dynamic factor models. Oxford Hand-

book of Economic Forecasting. Michael P. Clements and David F. Hendry (eds),Oxford University Press. DOI: 10.1093/oxfordhb/9780195398649.013.0003Su L., S. Jin S. and Y. Zhang (2015): Specification test for panel data models

with interactive fixed effects. Journal of Econometrics 186: 222-244.Su, L. and Z. Yang (2015): QML estimation of dynamic panel data models

with spatial errors. Journal of Econometrics 185: 230-258.

245

modelling the cross-section dependence, the …modelling the cross-section dependence, the spatial...

Documents