repeated measures and multilevel modeling · 161 repeated measures and multilevel modeling e very...

26
161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries for more years. Previously existing data never disappear. The inexorable temporal and spatial expansion of the international data infrastructure has inevita- bly raised the question of what to do with all the extra years of data. The obvious answer is to use extra observations of the same variables for the same countries as additional cases for analysis, though other answers are possible (e.g., using additional years to pro- duce ever-more-robust period averages and ever-longer time lags). With new observa- tions forthcoming every year for pretty much every country for which data are already available, new analyses are always possible (and publishable). Science progresses and careers are built. Model designs that make use of vertical data structures in which the same countries appear multiple times in the same database are known as repeated measures designs. With repeated measures designs it is possible to study multiple examples of change over time, contemporaneous (or lagged) movements in variables across time and geography, or (under certain conditions) simply more cases of the same underlying phenomena. There is a danger, however, that in using repeated measures to create additional cases for analysis, the additional observations of the same country at multiple points in time are not really additional cases in the sense of new, independent realizations of underlying quantitative macro-comparative research (QMCR) data-generating processes. Along these lines, Kittel (1999) makes a distinction between observations and cases because it might reasonably be questioned to what extent (say) France in 1996 was a different analytical case from France in 1997. Unreflexively treating each additional time observation as a new case leads to the reduc- tio ad absurdum that it is possible to generate an infinite number of cases just by slicing time into thinner and thinner units: For example, consider investigating the effect of regime type on the provision of public goods with data on 20 countries. Suppose now that we obtain 20 years of data for these coun- tries . . . is this data inflation really legitimate? Why not take monthly observations for each of these 20 countries, then we would have 4800 data points, and surely all our estimates would be statistically significant. . . . In short, if we have 20 countries in our data set, we have 20 countries, not 400 [20 countries times 20 years]. . . . This topic has received little attention in the literature. (Wilson and Butler 2007:108) 7

Upload: vominh

Post on 25-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

161

Repeated Measures and Multilevel Modeling

Every year, the international data infrastructure comes to include data for more countries for more years. Previously existing data never disappear. The inexorable temporal and spatial expansion of the international data infrastructure has inevita-

bly raised the question of what to do with all the extra years of data. The obvious answer is to use extra observations of the same variables for the same countries as additional cases for analysis, though other answers are possible (e.g., using additional years to pro-duce ever-more-robust period averages and ever-longer time lags). With new observa-tions forthcoming every year for pretty much every country for which data are already available, new analyses are always possible (and publishable). Science progresses and careers are built.

Model designs that make use of vertical data structures in which the same countries appear multiple times in the same database are known as repeated measures designs. With repeated measures designs it is possible to study multiple examples of change over time, contemporaneous (or lagged) movements in variables across time and geography, or (under certain conditions) simply more cases of the same underlying phenomena. There is a danger, however, that in using repeated measures to create additional cases for analysis, the additional observations of the same country at multiple points in time are not really additional cases in the sense of new, independent realizations of underlying quantitative macro-comparative research (QMCR) data-generating processes. Along these lines, Kittel (1999) makes a distinction between observations and cases because it might reasonably be questioned to what extent (say) France in 1996 was a different analytical case from France in 1997.

Unreflexively treating each additional time observation as a new case leads to the reduc-tio ad absurdum that it is possible to generate an infinite number of cases just by slicing time into thinner and thinner units:

For example, consider investigating the effect of regime type on the provision of public goods with data on 20 countries. Suppose now that we obtain 20 years of data for these coun-tries . . . is this data inflation really legitimate? Why not take monthly observations for each of these 20 countries, then we would have 4800 data points, and surely all our estimates would be statistically significant. . . . In short, if we have 20 countries in our data set, we have 20 countries, not 400 [20 countries times 20 years]. . . . This topic has received little attention in the literature. (Wilson and Butler 2007:108)

7

Page 2: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

162 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

On the other hand, as Figure 1.4 illustrated, the same argument could be made for Guatemala 2000 versus Honduras 2000 as for France 1996 versus France 1997; in other words, the proliferation of cases by year is just a form of compositional interdependence. The problem, however, is more severe for repeated observations of the same country over time than for observations of different but related countries. All QMCR data are suscepti-ble to suspicions of case inflation due to compositional interdependence, but repeated measures data are especially and explicitly susceptible.

This is well illustrated by the cross-national relationship between national income and infant mortality. Babones (2009c) reports a range of correlations between infant mortality and national income measured over a 45-year period (10 time points). These are replicated in Figure 7.1. Presumably due to improved measurement, the correlation has slowly increased in magnitude over the years, from r = –0.887 in 1960 to r = –0.939 in 2005 (the relationship between infant mortality and poverty is tightening). The standard error of the correlation has slowly declined from 0.053 to 0.040. Pooling all 10 time points together into a single analysis yields N = 770 observations (77 countries by 10 time points). This has no real effect on the correlation; the pooled correlation

0.000

0.010

0.020

0.030

0.040

0.050

0.060

0.070

0.080

0.090

0.100

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Sta

nd

ard

Err

or

of

the

Co

rrel

atio

n

Co

rrel

atio

n (

sig

n r

ever

sed

; al

lco

rrel

atio

ns

are

neg

ativ

e)

Pooled analysis (N=770):r =–0.889

Pooled analysis (N=770): Standard Error = 0.017

Correlation (left) Standard error (right)

Figure 7.1Correlations and Standard Errors Between Infant Mortality (logged) and GDP per Capita (logged), 1960–2005

Source: After Babones (2009c:93).

Note: Constant Panel of N = 77 Countries.

Page 3: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 163

(r = –0.889) falls within the range of the observed correlations for the 10 individual time points. On the other hand, it has a dramatic effect on the standard error; the pooled standard error (0.017) is much lower than any of the 10 original standard errors. As this example illustrates, parameters estimated using repeated measures data are highly susceptible to downward biases in their standard errors (and thus inflated statistical significance).

Scenarios like the one laid out in Figure 7.1 are easily handled by the statistical tools that have been developed for repeated measures models (described below), but research-ers have to know about these tools and use them properly for them to make any differ-ence. In a review of 195 papers from the political science literature, Wilson and Butler (2007:100) found that only seven met what they considered “basic criteria” for diagnos-ing and treating common problems with repeated measures designs. It might reasonably be argued that Wilson and Butler’s “basic criteria” are very advanced indeed, but Wilson and Butler found that over 20% of the papers they studied did nothing whatsoever to address the kinds of errors portrayed in Figure 7.1. This result is especially shocking given the fact that Wilson and Butler’s study universe consisted entirely of relatively sophisticated papers that had cited either Beck and Katz (1995) or Beck and Katz (1996), methodological contributions that explicitly warned of the necessity of correct-ing standard errors.

Although all repeated measures designs share some features in common, there are, broadly speaking, two common scenarios for the use of repeated measures data. Time series cross-sectional (TSCS) designs make use of relatively large numbers of time points (T) for relatively small numbers of countries (N) so that T is (usually) much greater than N. Multilevel modeling (MLM) designs—also called hierarchical linear model (HLM) designs—make use of relatively small numbers of time points (T) for relatively large numbers of countries (N) so that N is (usually) much greater than T. The set of countries included in a repeated measures database is known as the panel, so both methods (though MLM more often than TSCS designs) are referred to collectively as methods for the analysis of panel data, or panel designs.

The TSCS approach, as its name implies, puts greater emphasis on the cross-sectional variability between countries, while the MLM approach puts greater emphasis on the over-time variability within countries. That said, it must be emphasized that the two types of models share a common statistical toolkit. The difference between TSCS and MLM designs is methodological, not statistical: identical statistical tools are used in each, just with different frequency for different purposes, and there is no firm line between the two. The first section below lays out some of the common issues that arise from the use of repeated measures data in any model design. The second section focuses (briefly) on the methodological literature around TSCS designs in econometrics and political science, much of which applies equally well to the MLM designs used in the social sciences more broadly. The third section focuses in more depth on MLM designs, which are more common in QMCR proper, addressing in particular the issue of fixed versus random effects. The chapter concludes with an evaluation of the risks and bene-fits associated with the use of repeated measures data and suggests a new way forward, the slope-slope model.

Page 4: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

164 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

THE STRUCTURE OF REPEATED MEASURES DATA

Repeated measures QMCR data are organized into vertical database structures (Figure 4.3) in which each country appears at least once—but usually more than once—as a database row. Each observation of a country is usually keyed to a year, though it need not be: repeated measures can just as well be for months, quarters, 5- or 10-year intervals, and so on. In practical application, time units below 1 year are not used in QMCR outside eco-nomics due to a lack of sub-annual data for most variables of interest, while time units greater than 1 year are rarely used because period-averaging reduces the (nominal) number of observations available for analysis. As a result, the year is the standard unit. Observations, however, are not necessarily annual; sometimes (as in Figure 7.1) year-specific data are used in 5- or 10-year intervals.

Repeated measures databases may be balanced or unbalanced. Database searches of the QMCR literature suggest that unbalanced panels are actually more common than balanced panels in QMCR. Unfortunately, the properties of repeated measures models based on unbalanced panels are not well established, since nearly all methodological research on the properties of repeated measures models presumes that the panels are balanced. As a result, it is not obvious to what extent the results of studies based on unbalanced panels of coun-tries are valid. Moreover, many of the tools available for analyzing repeated measures data are only applicable to balanced panels. This is not to say that unbalanced panels cannot be studied, but rather that unbalanced panel designs have not been adequately studied and so are (by comparison with balanced panel designs) poorly understood.

The fundamental challenge of working with repeated measures data, however, arises from their very nature as multiple observations of the same units. Classical statistical methods are built on the assumption that each observation varies independently of all other observations. With repeated measures data, this is clearly not the case. While it can be argued that cross-sectional data are riddled with complex dependence structures, with repeated measures data there is no argument: repeated observations of the same country over time are not statistically independent cases for analysis. When cases are not indepen-dent, regression results (both coefficients and their standard errors) can become seriously biased. Some typical problems of error dependence in repeated measures models and commonly used solutions are discussed in this section.

The Problem of Nonspherical Errors

The main difficulty with repeated measures data is that models based on them routinely violate the spherical errors assumption that underlies the estimation of regression coeffi-cients using ordinary least squares (OLS). The spherical errors assumption in words roughly translates as “the errors look the same from every direction”: the error associated with every case used in a regression analysis is independent of the error associated with every other case (there is no clustering in the errors) and the error associated with every case has the same variance as the error associated with every other case (the errors are homoskedastic). The data underlying Figure 7.1 fail on both counts. First, there is severe clustering by country in the size of the regression errors. Out of 77 countries, 9 have errors that are positive for at least 9 out of 10 time points and 16 have errors that are negative for

Page 5: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 165

at least 9 out of 10 time points. Second, there is noticeable clumping by year in the vari-ability of the regression errors. The standard deviation of the errors ranges from a low of 0.145 for the 1990 observations to a high of 0.191 for the 2005 observations and varies widely across countries.

These two violations of error sphericity—clustering and heteroskedasticity—have the potential to wreak havoc on regression models. Error clustering can be thought of as error mean dependence: the mean level of the error is not independent across observations. Wilson and Butler (2007) delineate two kinds of error mean dependence that are especially common in QMCR: country unit heterogeneity (errors are similar for all observations of the same country) and dynamic dependence over time (errors for one year are related to errors for the next year for the same country). Though not included in Wilson and Butler’s framework, both of these kinds of country-oriented mean dependence potentially have time-oriented counterparts: unit heterogeneity among years (errors are similar for all observations of the same year) and dynamic dependence among countries (errors for one country are related to errors for neighboring countries in the same year, i.e., spatial depen-dence). Other, more exotic forms of mean dependence are also possible.

Heteroskedasticity can be thought of as error variance dependence. It often arises in repeated measures models when observations for different countries or different time peri-ods (or both) have different error variances. This is in addition to the heteroskedasticity that might exist, say, between rich and poor countries. Error variance dependence is much less damaging than mean dependence because it usually causes no biases in the coeffi-cients themselves, though it does affect their standard errors. The general consensus is that it has to be severe before it becomes a concern. When the variables being used are reason-ably well distributed or properly transformed (Chapter 3) heteroskedasticity on its own is unlikely to have a major effect on results.

These various kinds of mean and variance dependence are summarized in Table 7.1, along with some of the tools that are commonly used to address them.

Sphericity violation Dependence structure Associated tools

Mean dependence Unit heterogeneity across time units Multilevel modeling with fixed effects for time

Unit heterogeneity across countries Multilevel modeling with fixed effects for country

Dynamic dependence for adjacent time units

AR(1) error terms and/or lagged dependent vars.

Dynamic dependence for adjacent countries

Panel corrected standard errors (PCSE)

Variance dependence (heteroskedasticity)

Differences in variance across time units Generally not corrected—assumed not to exist

Differences in variance across countries Panel corrected standard errors (PCSE)

Table 7.1 Error Sphericity Violations and Associated Tools for Addressing Them

Page 6: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

166 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

Correcting for Mean Dependence

Straightforward examples of unit heterogeneity in which the mean value of a dependent variable changes over time or across countries are easily addressed using dummy variables for countries and time points in a fixed effects regression model. Figure 7.2 illustrates the need for such adjustments using the underlying data from Figure 7.1. Figure 7.2 plots infant mortality against national income per capita for 77 countries at 10 five-year time increments, for a total of 770 observations. Regression lines have been plotted for the relationship between national income and infant mortality in 1960 versus 2005 and for the 10 observations of the United States over time versus the 10 observations for Sweden. The dashed line represents the OLS regression line for all 770 pooled observations. In the absence of unit heterogeneity, the lines for 1960, 2005, the United States, and Sweden should all (roughly) coincide with the overall pooled regression line. Clearly they do not.

A common correction for unit heterogeneity is the use of country fixed effects. Country fixed effects effectively center the data on a country-wise basis, eliminating cross-national differences in the levels of variables while retaining their over-time variability. Figure 7.3 plots country-centered infant mortality rates versus country-centered national income

0.0

0.5

1.0

1.5

2.0

2.5

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Infa

nt

Mo

rtal

ity

per

1,0

00 (

log

sca

le)

GDP per Capita (log scale)

United States

Sweden

Year 1960

Year 2005

Figure 7.2 Relationship Between National Income and Infant Mortality

Page 7: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 167

levels for the 770 observations from Figure 7.2. As in Figure 7.2, illustrative regression lines for 1960, 2005, U.S., and Swedish data are depicted. Adjusting for the fixed effects of country has removed essentially all of the country mean dependence, with the U.S. and Swedish regression lines falling very close to the pooled regression line and crossing it at the origin. It is also now very clear that mean dependence by year exists as well, with the 1960 and 2005 regression lines falling almost parallel, separated by a fixed distance between them. This could be corrected through the inclusion of period fixed effects along-side the country fixed effects.

The second form of mean dependence, dynamic dependence, is much more subtle than unit heterogeneity. Dynamic dependence occurs when regression errors are correlated in a daisy-chained manner instead of being correlated for fixed groupings of observations. The two most obvious forms of dynamic dependence are temporal autocorrelation and spatial autocorrelation. In temporal autocorrelation, the error for each time period depends on the error for the previous one; in spatial autocorrelation, the error for each country depends on the errors of its neighbors. In the simplest scenario, these autocorrelated errors are Markovian, meaning that the error for each observation depends only on that of the obser-vation immediately preceding it. In more-complex scenarios, the errors may be structured in multistep patterns, as with business cycles that span multiple years.

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Infa

nt M

ort

alit

y p

er 1

,000

(lo

g s

cale

)

GDP per Capita (log scale)

United States

SwedenYear 1960

Year 2005

Figure 7.3Relationship Between National Income and Residual Infant Mortality After Adjusting for Country Fixed Effects

Page 8: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

168 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

Straightforward temporal error autocorrelation that is both Markovian and linear can be addressed in one of three ways: through the use of a lagged dependent variable, through the use of differenced variables, or through the use of explicit autoregressive error model-ing. The first two approaches can be implemented using OLS, while the last requires the use of generalized least squares (GLS) estimation techniques. More-complicated forms of error autocorrelation also exist that require the use of GLS techniques from econometric time series analysis that are rarely used in QMCR.

Beck and Katz (1996) strongly recommend the use of OLS regression with a lagged dependent variable to address error autocorrelation. In this approach, the dependent vari-able measured at period t is regressed on the other regressors (including any fixed effects) plus the value of the dependent variable for the same case for the period t–1. This is sim-ilar to the lagged dependent variable approach to establishing nonspuriousness discussed in Chapter 6, but in repeated measures settings the other independent variables are mea-sured contemporaneously with the dependent variable, not with its earlier manifestation. That is to say, the form of the model is,

Yt = B0 + B1* X1t + B2* X2t . . . + Bi*Xit + Bi+1*Yt–lag + Error,

where i represents the series of independent variables and t represents the period. The lagged dependent variable serves a different purpose in repeated measures designs than it does in cross-sectional designs. It is not there to eliminate common-cause variables but to eliminate the autocorrelation of errors among the repeated observations of the same coun-try. Keele and Kelly (2006) caution, however, that the inclusion of a lagged dependent variable does not always eliminate error autocorrelation, and that explicit testing for resid-ual error autocorrelation is necessary, while Maddala (1998) warns researchers off the use of lagged independent variables entirely. Finally, Keele and Kelly (2006) warn that lagged dependent variables should never be used with cyclical data, since lagged dependent vari-ables fail to capture any non-Markovian dependence.

Regression based on differenced variables (change scores) should in principle remove error autocorrelation as well, but change scores are rarely used for this purpose outside economics (see Stuckler 2008 for a rare example). The change score approach to the anal-ysis of repeated measures data is identical in logic to the difference model approach to controlling for time-invariant common-cause variables, with the distinction that in the change score approach differences are calculated for multiple time periods. Thus, instead of regressing a one-time change in the dependent variable on a one-time change in the independent variables, the dependent variable change in every period is regressed on the independent variables change in that period. As with the difference model approach to causality, one suspects that these models are rarely used because of their low statistical power. Difference and change score designs are highly robust, but due to low power they rarely produce significant results.

Explicit autoregressive modeling is also used, especially in the political science litera-ture. In this approach, the dynamic dependence structures in the regression error are explicitly modeled instead of being adjusted away through creative model design. The most commonly used autoregressive error model is the AR(1), or 1-period autoregressive

Page 9: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 169

model. Like the lagged dependent variable and change score approaches, the AR(1) model assumes that the error dependence structure is Markovian. Multi-period autoregressive models, however, can overcome this assumption. For example, in the AR(2) model each year’s error term is assumed to depend on the prior 2 years’ errors. The AR(2) model is very flexible, and can accommodate simple business cycles. Econometrics textbooks pro-vide extensive material on autoregressive error modeling.

None of these methods addresses spatial autocorrelation: the correlation of errors for neighboring countries. The panel corrected standard errors (PCSE) technique developed by Beck and Katz (1995) to deal with differences in error variance across countries has the fortuitous side effect that it also adjusts for any pattern of mean dependence that is static across time periods.As a result, it corrects for the spatial autocorrelation of errors—as long as the patterns of spatial autocorrelation are stable over time. Beck et al. (2006), however, argue that instead of being adjusted away, spatial effects should be explicitly modeled. They take an expansive view of the spatial, including, in spatial modeling, not just geo-graphical proximity but social and economic proximity as well (e.g., connection through trade networks). The spatial econometrics techniques they describe are not yet widely used in QMCR but may be in the future.

There is no firm consensus in the methodological literature around which of these tech-niques is superior, nor around which should be used in what particular situations. Table 7.1, however, does present some general guidance. Unit heterogeneity can be addressed by fixed effects for country (and, if necessary, for time), while dynamic dependence can be addressed using AR(1) error models. As a result, many studies use fixed effects in combi-nation with AR(1) error models. Other studies instead use lagged dependent variables to address both unit heterogeneity and dynamic dependence at the same time. Nearly all TSCS studies use PCSE adjustments. It is not clear, however, that these fixes solve the underlying problems. For example, Babones (2009c) demonstrates extensive residual unit heterogeneity in models that implement both fixed effects and AR(1) error models. The problem is that dependence structures tend to be much more complicated than these straightforward fixes assume.

Correcting for Variance Dependence

Like mean dependence, variance dependence can be structured either along country or time dimensions—or both. The main danger in practice, though, is that error variances may be distinctive of particular countries. If each country has its own error variance, which is different from that of other countries, the repeated observations of each country will form a cluster. When T is much greater than N, such country-wise clustering can become the dominant feature of the error variance.

For many years, the standard treatment for country-wise heteroskedasticity was a method attributed to Parks (1967) called feasible generalized least squares (FGLS). Beck and Katz (1995), however, famously showed that in typical research settings the FGLS method deflated the estimated standard errors of coefficients by anywhere from 30% to 75%. Since the publication of Beck and Katz (1995), most TSCS studies have used the alternative method promoted by Beck and Katz, OLS regression with PCSE adjustments.

Page 10: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

170 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

The PCSE technique is based on an assumption that any patterns of error dependence that exist in the data are the same for every time point. This allows the pooling of the errors from all time points to estimate the true variance-covariance matrix of all the data points and, ultimately, correct standard errors for the regression coefficients.

Since it relies on pooling over T, the PCSE technique becomes more efficient as the number of time points rises. Beck and Katz (1996) show that it can be used effectively even when the number of time points is “small”—in their examples, as few as five. With such few repeated observations, however, the problems that the PCSE technique is designed to correct are usually not serious, and PCSE adjustments often produce no change in the interpretation of results. In practice, PCSE adjustments are usually made only in TSCS settings in which T>N and are rarely used in MLM settings in which N>T.

One great advantage of the PCSE technique is that it not only adjusts for country-wise variance dependence, but also for country-wise mean dependence (as long as both are constant over time and thus poolable). As discussed above, this includes all kinds of neigh-bor effects, which are surely ubiquitous in QMCR data. The great shortcoming of the PCSE technique is that it does nothing to correct for time-wise mean and variance depen-dence. Moreover, as Beck and Katz caution throughout their work, the PCSE technique can only be applied to data that do not exhibit error autocorrelation over time, since it is based on the assumption that the collection of data for each time period represents inde-pendent realizations of a single underlying error process. In their review of the political science literature citing the work of Beck and Katz, Wilson and Butler (2007) find that most authors do not even consider the possibility that this assumption may not be met. Though crediting Beck and Katz with important contributions, they argue that “the prob-lems researchers tend to ignore are far more serious than the problems corrected with the PCSEs” (Wilson and Butler 2007:110).

One such problem (not even considered by Wilson and Butler) is time-wise variance dependence. This is illustrated in Figure 7.4. Figure 7.4 plots the residuals from the regression of infant mortality on national income, controlling for year and country fixed effects. Obviously, the error variance is lowest in the middle years and highest at the beginning and end of the time period. The temporal heteroskedasticity displayed in Figure 7.4 becomes less systematic, but does not disappear entirely, when a lagged dependent variable is included in the model. With the inclusion of the lagged dependent variable, however, the coefficient for national income in the regression model is reduced to near zero. In other words, once most (but not all) of the potential violations of error sphericity have been addressed, there is no relationship left to study, despite the fact that the cross-sectional correlation between national income and infant mortality is on the order of r = 0.9.

TIME SERIES CROSS-SECTIONAL MODELS

Model designs that use repeated observations of QMCR data to focus on the cross-sectional aspects of data observed at multiple points in time are called time-series cross-sectional (TSCS) models. In a typical TSCS design, a regression model is estimated using data from a small number of countries for which data are available at annual increments for

Page 11: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 171

20–40 years. The background variables often come from OECD databases, which usually include such long time series for only 15 or so of the 20 historical OECD member coun-tries. As a result, T is usually a multiple of 1–3 times N in TSCS models. The most com-mon TSCS approach is to use OLS estimation with lagged dependent variables and PSCEs. The use of the PCSE technique is now near universal, and Beck and Katz (2011) persuasively advocate the inclusion of lagged dependent variables to control for dynamic dependence structures. Green et al. (2001) strongly recommend that fixed effects should (almost) always be included as well, but Beck and Katz (2001) argue that Green et al. overstate their case, further developing their argument in Beck and Katz (2011). There is no consensus yet, but as argued above, the inclusion of fixed effects does reduce unit het-erogeneity at the cost of reduced statistical power.

In TSCS models, systematic variation over time has historically been considered an incidental factor or a nuisance to be adjusted for and removed, though Beck (2007) and DeBoef and Keele (2008) advocate more-considered approaches to intertemporal varia-tion. Most TSCS models focus on how independent variables are related to the dependent variable in each cross section, either contemporaneously or with well-defined lags. Since each time-wise cross-section is construed as an independent realization of an underlying data-generating process, a premium is placed on having as many cross-sections as possible. This usually means working with annual data.

−0.40

−0.30

−0.20

−0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Res

idu

al In

fan

t M

ort

alit

y

Figure 7.4Residual Infant Mortality After Controlling for National Income and Fixed Effects for Year and Country

Page 12: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

172 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

Since the TSCS literature is effectively a specialist literature applying econometric time series techniques to questions from political science, this section includes only a brief summary and critique of TSCS modeling. A full tour of the main issues in the TSCS literature can be found in Beck (2001a, 2001b). More recent methodological develop-ments in accounting for space can be found in Beck et al. (2006) and in the dynamic treatment of time in DeBoef and Keele (2008). Studies that use TSCS designs are found mainly in the comparative political science literature, where they are used to study out-comes as diverse as ethnic conflict (Saideman et al. 2002), homicide rates (Jacobs and Richardson 2008), the provision of childcare (Bonoli and Reber 2010), and of course party politics (Hellstrom 2008). By one reckoning, they have recently accounted for over 5% of the all articles published in political science journals (Beck 2007:98, reproducing data from Adolph et al. 2005).

A major reason for the flourishing of TSCS designs is that they generate enough obser-vations (and thus degrees of freedom) to make possible the estimation of models with multiple independent variables. The leading TSCS methodologist explains it this way:

TSCS data became popular, particularly in political economy, because the initial complicated regressions on 15 or 20 observations were bound to be uninformative. These regressions were very sensitive to inclusion or exclusion of one particular country, or other seemingly arbitrary choices. Political economy scholars naturally gravitated toward TSCS designs that seemed to make it possible to move from only 15 or 20 data points to 20 or 30 times more than that. (Beck 2007:97)

Pooling observations in this way creates many challenges for (proper) estimation. The most obvious sphericity violations have been discussed above. There are, however, deeper challenges associated with the interpretation of results. For example, TSCS results can safely be interpreted cross-sectionally only if the data (both dependent and independent variables) are stationary over time. Stationarity implies that while the actual value of a variable may bounce up and down over time, its expected value remains constant. For example, gross domestic product (GDP) per capita is not stationary (it tends to grow over time) but GDP per capita growth is (it rises and falls but neither rises nor falls very far from the long-term global average of about 2% per year). If variables are not stationary, the estimated coefficients of TSCS might capture trends over time rather than truly cross-sectional effects.

Further challenges ensue when rapidly fluctuating (but stationary) economic variables are included in TSCS models alongside slow-changing policy and institutional variables (Kittel 2008:36–37). This is because the rapidly fluctuating variables may be cointegrated with each other but not with the slow-changing variables. Cointegration is a variant of the common-cause variable problem, with the common cause being time (or a function of time). Cointegrated variables move together over time, but may not be causally related to each other. For example, food and fuel prices move together over time due to the influence of common-cause variables. The cointegration of some variables but not others can further obscure the true cross-sectional effects in TSCS models. Tests and corrections for nonsta-tionarity and cointegration are well developed in the econometrics literature and have come to be used in the TSCS literature in political science.

Page 13: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 173

In fact, over the past 20 years the TSCS literature has become extraordinarily technical as advances in time series econometrics have percolated into the political science litera-ture. The fact that a method requires advanced technical knowledge is not in itself prob-lematic. That said, few social scientists outside economics have the proper mathematical background and statistical training to apply TSCS techniques in an informed manner. Many econometricians question whether the TSCS literature in political science really makes any sense. At a time when the field was notably less technical, the eminent econo-metrician G.S. Maddala wrote,

Political methodology has been quick in adopting econometric methods, rather too uncritically. But econometric methodology has far outstripped empirical applications and has acquired a life of its own. Much of what goes on in econometric methods is practically useless. . . . I do not think the uncritical adoption of econometric methods in political methodology is a good devel-opment. (Maddala 1998:81–82)

Along these lines, in a (highly technical) review of the TSCS literature, political scien-tists DeBoef and Keele (2008:184) catalog a litany of what they characterize as “weak connections between theory and tests, biased estimates, and incorrect inferences.” They roundly critique existing practice in the field. Nonetheless, they conclude their paper with an exhortation that

analysts should extract all the information available to them in the model. We should report and interpret error correction rates, long-run multipliers, lag distributions, and mean and median lags. . . . Taken together, careful specification and interpretation of dynamic linear models will enable us to have confidence in the estimates from our models and to draw more complete interpretations from them so that we will better understand change in the world around us. (DeBoef and Keele 2008:199)

No doubt they are right. Still, one suspects that for all their methodological sophis-tication TSCS models have not substantially improved our understanding of change in the world around us. An alternative view of the application of econometric techniques to repeated measures data was laid out by Robert H. Durr, who sadly did not live to see how the TSCS literature would develop. He predicted that “we will produce a method (1) dominated by questions of measurement and measurement strategy (at the expense of theory) and (2) inaccessible to (and therefore ignored by) most analysts” (Durr 1992:255).

Considering that every 6 or 7 years a replication study is published that invalidates nearly all previous work in the field (e.g., Beck and Katz 1995; DeBoef and Keele 2008; Wawro 2002), it is perhaps time to accept that TSCS models may not be very useful for answering the kinds of questions asked by social scientists. The data exist, and the mod-els exist, but that doesn’t necessarily mean that the models can effectively be applied to the data. On the other hand, the methodological literature around TSCS modeling seems finally to be settling down to the normal science of consensus methods. Since an addi-tional year of cross-sectional data is always minted every year, practice may eventually catch up.

Page 14: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

174 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

MULTILEVEL MODELS

A generation ago, practitioners of QMCR rarely made use of the over-time variability in their data. Even when repeated observations of data were available for a fixed panel of countries, researchers focused instead on the cross-sectional properties of the panel, using the longitudinal properties of the repeated observations either to investigate long-term lag structures or to eliminate common-cause explanations. Since a long-term lagged depen-dent variable model (Chapter 6) accomplished both of these aims, these models became very popular. At the time, critics Firebaugh and Beck (1994) bemoaned the fact that these lagged dependent variable models “are so common . . . that practitioners refer to them as ‘panel analyses.’ However, the term used in that way is potentially misleading, because it suggests that there is only one way to analyze panel data” (Firebaugh and Beck 1994:638).Today, the same might be said for MLMs, also called HLMs or, now, panel models. These models have become so ubiquitous in QMCR over the past 10 years that there often exists an a priori assumption that they should be used in all cases where it is possible to do so. The most common type of MLM design is the fixed effects model (FEM), which is mathematically identical to the TSCS design with fixed effects for coun-tries. The difference between the TSCS design with fixed effects and the FEM is that in the FEM the fixed effects are not explicitly intended to address error sphericity viola-tions. They are intended instead to shift the burden of explanatory power away from across-country factors toward over-time factors. This is because in MLM designs atten-tion is focused on how changes in the independent variables over time are related to changes in the dependent variable over time. Cross-sectional factors are explicitly fac-tored out through the use of fixed effects.

Because FEMs factor out all of the cross-national correlation between the dependent and independent variables, researchers who use them implicitly assume that their data vary meaningfully over time—specifically, over the time units being studied. Thus, it only makes sense to use MLMs when theory suggests that variables should covary over time within countries. This raises problems when independent variables of interest either don’t change or change only slowly over time. As Kenworthy and Hicks (2008) put it,

Many hypotheses about determinants of change in macrocomparative analysis either implicitly or explicitly refer to relatively long-term effects. Yet most pooled regression analyses use the country-year as the unit. . . . More often than not, using annual data to examine medium-run or long-run associations will obscure rather than clarify. (Kenworthy and Hicks 2008:8–9)

In situations where independent variables of interest change slowly or not at all over a study period, many practitioners advocate the use random effects models (REMs). The REM design is a form of MLM in which—in effect—the explanatory power of the inde-pendent variables is shared out between over-time and across-country components. While this might seem to get around Kenworthy and Hicks’s criticisms, it does so at the cost of model misspecification, as argued below. The root problem is that most QMCR theories are not concerned with rapid year-on-year changes, yet at a very fundamental level MLMs are not well suited for the study of long-term or slow-moving social change.

Page 15: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 175

Nonetheless, this section lays out the logic behind MLMs, using the FEM as a template. The FEM can be represented either as a multiple regression model or in explicitly multi-level terms, so it forms a convenient bridge between the familiar world of OLS regression and the more esoteric world of generalized linear models. The main competitor to the FEM, the REM, can be represented only in multilevel terms and cannot be estimated using least squares. This section focuses on explaining FEMs and REMs; the shortcomings of both are discussed in the next section, along with some suggested solutions for how to better take advantage of repeated measures data. The main conclusion here is that MLMs are almost never appropriate for use in QMCR, but the fact that they are so common in practice calls out for detailed explanation and criticism.

The Fixed Effects Model

Multilevel models are called multilevel or hierarchical because they separate regression models estimated on nested data structures into two hierarchical levels: an individual level (Level 1) and a group level (Level 2). In QMCR, the individual country-year observations are usually construed as Level 1 and the groups of observations over time for each country are construed as Level 2. This is illustrated for a generic FEM in Figure 7.5. Multilevel

Figure 7.5 The Multilevel Form for a Regression Model With Fixed Country Effects

yit = a + BXit + CZi + uit

yit = a + BXit + λi + uit Level 1 (individual) model

Level 2 (group) modelλi = CZi

where:yit is the predicted value of the dependent variable for country i at time ta is the regression constantB is the vector of slope coefficients for the independent variablesXit is the vector of observed values of the independent variables for country i at time tC is the vector of slope coefficients for the country dummy variablesZi is the vector of observed (0/1) values of the country dummies for each country i (these don’t vary over t)uit is the random regression error associated with each observation (country i at time t)λi is the effect associated with country i; since Zi is coded 0/1, it is a vector of one constant for each country

Single-Equation Representation Multilevel or Hierarchical Representation

Page 16: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

176 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

representations are generally not used for FEMs, but the representation is used here to establish continuity with the REM representations to follow. The data on which a model like that depicted in Figure 7.5 would be estimated would be organized into a vertical database (Figure 4.3) with one row for every country-year observation and one column for each variable, including a series of dummy (0/1) indicator variables denoting whether or not an observation belongs to each particular country. One country would be excluded from the analysis to be used as a reference category against which to compare the effects of the others. For analyses that include the United States, the United States is most often used as the reference category.

In the regression model representation depicted on the left side of Figure 7.5, the values of the dependent variable (yit) are modeled for every country (i)–year (t) pair.The model includes a constant (a), a horizontal vector of slope coefficients correspond-ing to the independent variables (B), and a horizontal vector of slope coefficients cor-responding to the country fixed effects (C). The data include a horizontal vector of data for each independent variable observed for each country at each point in time (Xit) and a horizontal vector of the country dummy variables that doesn’t change over time (Zi). The regression error (uit) is unique for each country at each time point. The usual four regression error assumptions (Chapter 5) apply, meaning that the error is a random variable that has the same distribution regardless of what country or year a case is associated with.

The associated FEM representation depicted on the right side of Figure 7.5 is the direct analog of the regression model depicted on the left side. A placeholder λ (lambda) has been used in the Level 1 model to represent the country fixed effects for each country i. They are called fixed effects because on Level 2 it has been made explicit that λ has no random component: the Level 2 model does not include an error term because there is no country- specific error in the model. The fixed effect for each country is equal to a coefficient (ci, which is an element of C) times the 0/1 dummy variable (Zi) that indicates whether or not a case is associated with that country. As a result, the individual country fixed effects λi are coded 0/ci (coded ci for cases from country i and zero for cases from other countries). Since the country fixed effects are either zero or a constant, they are often represented as part of the regression intercept–yit = (a + λi) + BXit + uit–with the interpretation that a is the regression intercept for the reference country and a + λi is the regression intercept for other countries.

The estimation of FEMs using country dummy variables in a linear regression frame-work is known as the least squares dummy variable (LSDV) approach. An alternative specification of the FEM can be arrived at by subtracting the values of all of the vari-ables (dependent and independent) from their country means and then running a regres-sion without the country dummy variables. This is often called the within-effect model. Hsiao (2003:30–32) prefers the within-effect model over the LSDV approach on com-putational grounds, since it requires the inversion of a much smaller data matrix. On the other hand, the within-effect model requires the user to make appropriate adjustments to the degrees of freedom of the model (to account for the fact that the country means have been subtracted off in a separate step). With modern computing there is little to be gained by the within-effect approach in QMCR applications; it makes sense only for use

Page 17: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 177

with repeated-measures survey data on individuals, where the number of fixed effects to be estimated could run into the thousands. Whether or not it is used for estimation, the within-effect approach does graphically illustrate the fact that FEMs rely entirely on the over-time correlation between the dependent and independent variables for their explan-atory power. All of the cross-national correlation is wiped out by the centering of the dependent and independent variables.

The Random Effects Model

Unlike the FEM, the REM design does not remove quite all of the cross-national rela-tionships between the independent and dependent variables. Following Hsiao (2003), many researchers thus view the REM as a kind of compromise between the FEM and pooled regression analyses. The reasoning is that while FEMs give country dummy vari-ables their full observed effects and pooled regression analyses give them no effects at all (since country dummies aren’t included in the model), REMs give country dummies a weight that is somewhere between zero and their full (FEM) impact. This is arithmetically true but not necessarily substantively meaningful.

Substantively, REMs are best understood outside the QMCR context. A good starting point is classic experimental design. Imagine an experiment in which all the students in a school are assigned at random to one of several groups, with each group taught by a dif-ferent teacher. The curriculums, classrooms, equipment, teaching times, and so on are identical across all of the groups; all that differs is the teacher. In a posttest, student per-formance is compared across groups. By construction, systematic differences in the expected values of the group means must be due to differences in teacher effectiveness. In finite samples, however, the FEM specification will attribute too much explanatory power to the group effects because chance differences in group composition will be (incorrectly) attributed to group effects. In the most extreme case, if each teacher had only one student, all of the between-student differences in performance would be attributed to their teachers. As a result, student performance differences due to student ability, student dedication, family income, and so on would all be underestimated.

The FEM design is thus too conservative from the standpoint of the Level 1 variables of interest, because the true variability in teacher effectiveness is less than its estimated variability. Unfortunately, it is not possible to observe true teacher effectiveness directly. What is needed is an MLM design that will attribute to the group level only the true (but unobserved) teacher effects, without overfitting. That design is the REM.

Of course, it is difficult to estimate the size of an unobserved effect. The guidance that it is something less than the full size of the fixed effects is of little help. The REM gets at the size of the unobserved teacher effects by pooling them across all of the experimental groups. The idea is that teachers in general exhibit a range of levels of effectiveness; instead of estimating the particular effectiveness of individual teachers, the REM estimates the standard deviation of the levels of teacher effectiveness in general. As usual in such situations, some assumptions have to be made. The REM requires two: “In the random- effects framework . . . there are two fundamental assumptions. One is that the unobserved [Level 2] effects are random draws from a common population. The other is that the [Level 1]

Page 18: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

178 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

explanatory variables are strictly exogenous” (Hsiao 2003:42).The first assumption, that the unobserved effects are modeled as “random draws from a common population,” is another way of saying that they are assumed to be independent and identically distributed. In the experimental example, this means that the levels of effectiveness of individual teachers are independent of each other (the effectiveness of one teacher isn’t related to the effectiveness of another) and have an identical distribution of possible levels (presumably normal with mean zero; there are some good teachers and some bad teachers, with most falling somewhere in the middle). The second assumption is that any Level 1 attributes of the students (student ability, student dedication, family income, etc.) are independent of the group (teacher) effects. This is ensured by the random assignment of students to teach-ers. This second assumption is necessary because the REM attributes any group-wise clustering in the regression error entirely to the group effects. Were group effects related to the Level 1 attributes, some of the group-wise clustering of error would actually be due to clustering by individual student attributes.

In translating the REM from an experimental setting to a QMCR setting consider an outcome like income inequality measured for 100 countries over 10 time periods. For students read “country-year observations,” and for teachers read “countries.” The idea would be that countries have some unobserved quality, perhaps a cultural egalitarian-ism, that helps determine their levels of income inequality. When there are only a few observations per country, the FEM with country group effects might give too much credit to this Level 2 cultural egalitarianism and too little to Level 1 determinants of inequality, such as investment, trade, and union density. With 10 observations per country, this is less of a problem. If we could get thousands of observations per coun-try, the problem would disappear entirely, but of course this is not possible with QMCR data.

The basic layout of the REM applied to QMCR settings is summarized in Figure 7.6. Note that λi is now a random variable, which is set equal to the (unobserved) random effect vi for each country’s cultural egalitarianism. The translation from the FEM spec-ification is that the term CZi + uit (the country effects plus observation-specific error) has been replaced by the term eit = λi + uit (a country random variable plus observation- specific error). The net effect is that while the FEM treats the country terms as a set of modifications to the regression intercept, the REM treats the country terms as a country- wise structuring of the regression error. If the REM assumptions hold, as t becomes large the two uit terms converge. Since the REM requires the estimation of one co efficient (the variance of λ) to do the same job that in the FEM requires the estima-tion of i–1 coefficients, the REM is statistically much more efficient—if it can be used.

From a substantive standpoint, this all seems quite reasonable. In QMCR settings, how-ever, both of the REM assumptions are grossly violated. Attention usually focuses on violations of the second assumption: that group effects must be independent of the covari-ates. What this means in the inequality example is that the effect of a country’s (unob-served) cultural egalitarianism must be independent of the observed Level 1 independent variables in the regression model. This is extraordinarily unlikely in nearly all QMCR settings. After all, the whole point of controlling for country effects is to remove the cross-national dimension from the analyses; if there were no correlation between the country

Page 19: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 179

effects and the Level 1 variables, there would be no reason to use MLMs instead of pooled regression. As Bollen and Brand (2010) observe,

Researchers sometimes take false comfort in the use of the REM in that it does include a latent time-invariant variable . . . without realizing that biased coefficients might result if the observed covariates are associated with the latent time-invariant variable. (Bollen and Brand 2010:2)

Moreover, as Hausman (1978:1263) points out, any time a lagged dependent variable is included in the model (as it often is in QMCR settings) it is by definition correlated with any country effects that might exist.

Though rarely mentioned, however, the first REM assumption may be even more prob-lematic in QMCR settings. It stipulates that the unobserved country effects must be inde-pendent of each other. While this may be unproblematic in survey research settings, it strains credulity to think that the unobserved differences among countries in terms of national attributes like cultural egalitarianism could be geographically and historically independent of each other. Again the REM reasoning falls apart.

Figure 7.6 The Multilevel Form for a Regression Model With Random Country Effects

yit = a + BXit + eit

yit = a + BXit + λi + uit Level 1 (individual) model

Level 2 (group) modelλi = vi

where:yit is the predicted value of the dependent variable for country i at time ta is the regression constantB is the vector of slope coefficients for the independent variablesXit is the vector of observed values of the independent variables for country i at time tuit is the random regression error associated with each observation (country i at time t)λi is the effect associated with country ieit is the total random regression error due to both country and country-year errors (eit = λi + uit = vi + uit)vi is the (unobserved) country i realization of a random effect variable with mean 0 and constant variance

Single-Equation Representation Multilevel or Hierarchical Representation

Page 20: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

180 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

Writing on REMs in the QMCR literature is riddled with errors. There are even some minor errors in the FEM versus REM discussion in Babones (2009c:102–103). Nonetheless, the methodologists’ verdict on REMs is clear. As Halaby puts it, “Without plausible theoretical grounds or empirical evidence for the random effects assumption, bias and consistency considerations alone would lead to a fixed effects model” (Halaby 2004:521). Simply put, it is difficult to imagine any scenario in QMCR in which it would be appropriate to use an REM design.

MAKING APPROPRIATE USE OF REPEATED MEASURES DATA

Many researchers (reinforced by journal editors and reviewers) take it for granted that since QMCR datasets include repeated measures over time it is only appropriate to analyze them using repeated measures techniques. They argue that not making use of the repeated measures is tantamount to throwing away data. Doubtless it makes sense to use all the data at our disposal, but it is not clear that the data are best used by treating every available observation as a distinct realization of a data-generating process of interest to us—that is, as a case. Other uses of the data may sometimes be more appropriate. For example, aver-aging the values of variables over multiple years (period averaging) is a way to make use of all of the available data to reduce measurement error. Cross-sectional regression based on period averages may not be as glamorous as random-effects multilevel modeling with lagged dependent variables and PCSEs, but under many circumstances it may be more effective. It all depends on what we want to know from our models.

The danger of trying too hard not to throw away data is that researchers will throw away meaning instead. The variety of possible cross-sectional analyses of QMCR datasets is severely limited by the small number of available cases: rarely more than 100 and never more than 200. In terms of advanced statistical modeling, there is only so much that can be done with so few cases. The use of repeated measures dramatically increases the number of cases (or, pace Kittel, observations) available for analysis, but it also dramati-cally increases the number of opportunities for mischief. As Beck and Katz put it, “Unfortunately, time-series cross-section data allows analysts to propose almost silly esti-mators, because the repeated observations allow such estimators to produce results that might appear meaningful at first glance” (Beck and Katz 2001:494).

The fact is that many repeated measures designs are inappropriate for answering the kind of questions that most QMCR practitioners want to ask. They are much better suited to studying economic time series. When it comes down to it, the basic question asked in repeated measured designs—do X and Y rise and fall together year by year, and if so, why?—is not a question that is typically asked by scholars working in the QMCR tradi-tion. Most QMCR is concerned with long-term changes in structural relationships, not short-term fluctuations in annual data. Accordingly, it is probably more productive for models in QMCR to focus mainly on broad changes over long time periods rather than on year-by-year variability. In the TSCS literature, there is at least an explicit recognition of the problems introduced by the time-series nature of the underlying data, even if all but the best methodologists regularly mishandle it. Sadly, in the MLM literature there is almost no recognition whatsoever of the problems posed by time.

Page 21: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 181

Time-Invariant Independent Variables

One of the biggest problems posed by the treatment of time in MLM designs is the question of how to handle time-invariant independent variables. Many QMCR variables either do not change at all over time (e.g., a country’s latitude, surface area, colonial her-itage, or date of independence) or change so little over time that they often do not change over a period of several decades (e.g., a country’s government institutions, legal system, ethnic makeup, or mineral wealth). Such variables are time-invariant, at least over the course of a typical study period. If the purpose of the MLM design is to investigate how variability over time in the dependent variable is related to variability over time in the independent variable, time-invariant independent variables should pose no problem: they are irrelevant. Nonetheless, researchers using MLM methods want coefficients to report, even for time-invariant independent variables.

Thus, a major (perceived) shortcoming of the FEM design is that it cannot accommo-date time-invariant independent variables. Since time-invariant variables have the same value for all instances of a country, controlling for country fixed effects effectively con-trols for time-invariant independent variables as well. To call this a shortcoming is a bit of a misnomer, since the elimination of time-invariant independent variables is a design feature that under other circumstances would be considered a major advantage. For exam-ple, the difference model (Chapter 6) is explicitly designed to eliminate the effects of time-invariant independent variables. Like the difference model, the FEM design factors out the effects of time-invariant independent variables whether they are measured or not, thus reducing the number of opportunities for spurious causality.

The REM, on the other hand, does permit the estimation of coefficients for time-invariant independent variables. One suspects that they are mainly used—despite the many warnings against them—for precisely this reason. The practice of using REMs with time-invariant independent variables is confined almost entirely to sociology, where according to Halaby (2004) it is the dominant MLM approach. It can be traced mainly to Nielsen and Alderson’s (1995) study of the role of sector dualism in determining national levels of income inequality, where REMs were used to permit the estimation of regression coefficients for regime type and other variables that did not vary over their study period. Coefficients for time-invariant independent variables can be estimated in REM designs because random effects, unlike fixed effects, do not remove all of the cross-national variability. As a result, researchers seemingly get to have their cakes and eat them too: an MLM to eliminate cross-national variability and focus on over-time effects, but estimated in an environment that still produces coefficients for important variables that would otherwise be factored out.

It has been argued above that REMs are almost never appropriate for use in QMCR settings. Nonetheless, they are used—frequently. Leaving aside objections to the use of REM designs in general, would it be reasonable to include time-invariant independent variables in an otherwise properly specified REM?

A multilevel representation sheds some light on this. Figure 7.7 reproduces the REM specification from Figure 7.6, but this time with the inclusion of time-invariant indepen-dent variables (Wi) that have values that are specific to each country. Since time-invariant variables by definition are always the same for any particular country, they appear on

Page 22: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

182 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

Level 2 of the MLM. The reasons for this are apparent from the teacher effectiveness example. Suppose that the teachers didn’t all hold class at the same time of day, but used the same classrooms at different times. In this scenario, a portion of the group (Level 2) variability in student performance would be due to the impact of class time. Since class time doesn’t vary across individuals within any one group, the class time effect would come out of the Level 2 random effect, not out of the Level 1 idiosyncratic error. As part of the Level 2 random effect, the class times would have to satisfy the REM assumptions. In other words, like unobserved group effects, time-invariant independent variables (in the QMCR setting) must satisfy the REM assumptions.

In the teacher effectiveness example, the group-invariant independent variable (class time) can be randomly assigned to ensure that the REM assumptions are not violated. Of course, in QMCR settings this is not possible, but it’s much worse than that. We specifically

Figure 7.7The Multilevel Form for a Regression Model With Random Country Effects and Level 2 Time-Invariant Dependent Variables

yit = a + BXit + HWi + eit

yit = a + BXit + λi + uit Level 1 (individual) model

Level 2 (group) modelλi = HWi + vi

where:yit is the predicted value of the dependent variable for country i at time ta is the regression constantB is the vector of slope coefficients for the independent variablesXit is the vector of observed values of the independent variables for country i at time tuit is the random regression error associated with each observation (country i at time t)λi is the effect associated with country ieit is the total random regression error due to both country and country-year errors (eit = vi + uit)vi is the (unobserved) country i realization of a random effect variable with mean 0 and constant varianceH is the Level 2 vector of slope coefficients for the time-invariant independent variablesWi is the vector of observed values of the time-invariant independent variables for country i

Single-Equation Representation Multilevel or Hierarchical Representation

Page 23: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 183

want to include the time-invariant independent variables in order to violate the REM assumptions. That is to say, if we knew in advance that the time-invariant independent vari-ables would only reduce the size of the (unobserved) random effects without affecting the coefficients of the other independent variables in any way, we would have no interest in using them. The reasons we want to use an REM design are the very reasons why we can’t.

This fact is reflected in the standard Hausman (1978) test that economists use to evaluate the reasonableness of using the REM specification in place of the FEM specification (Hausman 1978). The Hausman test does nothing more than determine whether or not the coefficients of the model covariates obtained using the REM specification differ signifi-cantly from their counterparts under the equivalent FEM specification. If they do, the Hausman guidance is to stick with the FEM. In other words, economists consider the REM specification to be reasonable only when it doesn’t change anything of interest—to QMCR.

That is not to say that the effects of time-invariant independent variables cannot be studied. Researchers who want to study the effects of time-invariant independent variables can simply run a pooled model without any country effects, fixed or random. Dynamic error dependence tools can be used to address the error sphericity violations (Table 7.1) that cause the biased significance levels that might be introduced by case inflation (Figure 7.1). The inclusion of a lagged dependent variable will be sufficient in most cases to solve this. Other alternatives also exist for retaining the MLM framework, including the use of instru-mental variables and a three-stage residualization procedure developed by Plumper and Troeger (2007). The effects of time-invariant independent variables can be studied, but the whole point of the MLM approach is to focus on intertemporal variation by factoring out country-specific effects. If researchers are interested in the effects of slow-moving or time-invariant independent variables, they shouldn’t be using MLM designs.

Lags and Trends

A further time-related problem that shows up in both TSCS and MLM designs is that the relationships between variables are almost universally assumed (without discussion) to be contemporaneous. This usually makes little substantive sense in QMCR applications. In econometrics (where TSCS models were first developed) it might very well be the case that (e.g.) a change in price causes a contemporaneous change in demand. Analogous scenarios are hard to imagine in QMCR. Instead, QMCR variables have effects that become apparent only over time, and even then only with vague, ill-defined time lags.

For example, how long does it take for a change in national income to affect a country’s infant mortality rate? The cross-correlations reported in Figure 4.7 shed little light on this question; indeed, the conclusion reached in Chapter 4 was that the contemporaneous correla-tion between national income and infant mortality is a reasonable proxy for the relationship between the two variables. An MLM approach gives a much more complicated answer. Table 7.2 reports the slope coefficients (and associated standard errors) from a series of 24 related FEMs regressing infant mortality on national income. Six different time lags are tested: none (national income has a contemporaneous effect on infant mortality) through 25 years (national income is related to infant mortality 25 years later). Models include either country or country and year fixed effects, and either a lagged dependent variable (measured for the period 5 years before the year of the dependent variable in the particular model) or not. The

Page 24: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

184 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

observations are the 770 country-years described for Figure 7.1; LDV models and models with time lags are based on accordingly smaller numbers of cases.

At first glance, Models 1 and Model 3 seem to indicate very clearly that the relationship between national income and infant mortality is contemporaneous. Both have the strongest coefficients when there is no lag. Note, however, that the t ratio for the country FEMs is t = 28.7, which is even higher than that for the individual cross-sections displayed in Figure 7.1. Considering that the FEM by design throws away all of the cross-national variability in the two variables and relies for its explanatory power entirely on with-in-country variability over time, this is a surprising result.

The reason for the strong results reported in Model 1 and Model 3 is that the errors in the simple FEM specifications are nonspherical, as demonstrated in Figure 7.4. Specifically, the errors exhibit temporal heteroskedasticity. This is likely due to the fact that the data are nonstationary, that is, there are trends in national income and infant mor-tality in each country. Error sphericity can (mostly) be established through the inclusion of a lagged dependent variable, which effectively detrends the data (but on a global basis, not country-by-country, since there is a single LDV coefficient that applies to all countries and years). Including an LDV reduces the contemporaneous effect of national income on infant mortality to nonsignificance (Model 2) or literally zero (Model 4). This is remarkable—the relationship between national income and infant mortality is one of the strongest among all QMCR variables—but it is not surprising. After all, it would be unrealistic to expect change in national income to translate instantaneously into changes in infant mortality. On reflection, the Model 2 and Model 4 results make complete sense.

In fact, it better than makes sense. It helps illuminate the dynamics connecting national income with infant mortality. Model 2 suggests that the lag is actually 15–20 years, which seems quite reasonable (the maximum t ratio occurs at 15 years, while the maximum slope occurs at 20 years). Model 4, however, suggests a similar lag but in the opposite direction: above-trend increases in national income correspond to below-trend decreases in infant mortality 10–20 years later once period fixed effects are included. The full picture that

Country Fixed Effects Only Country and Time Fixed Effects

No LDV LDV Included No LDV LDV Included

Lag Period (1) (2) (3) (4)

None -1.034 (0.036) -0.011 (0.015) -0.449 (0.028) 0.000 (0.015)

5 Years -1.005 (0.038) -0.009 (0.015) -0.422 (0.031) 0.012 (0.015)

10 Years -0.967 (0.041) -0.018 (0.018) -0.372 (0.035) 0.039 (0.018)

15 Years -0.906 (0.044) -0.041 (0.021) -0.298 (0.041) 0.055 (0.021)

20 Years -0.844 (0.045) -0.046 (0.025) -0.263 (0.048) 0.053 (0.026)

25 Years -0.784 (0.047) -0.039 (0.031) -0.226 (0.059) 0.056 (0.033)

Table 7.2Lag Structure of the Relationship Between National Income and Infant Mortality, Revisited, Slopes and (Standard Errors)

Page 25: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

CHAPTER 7. REPEATED MEASURES AND MULTILEVEL MODELING 185

emerges is that national income is related to infant mortality 15 years later, but the coun-tries that have benefited most from the general fall in infant mortality are those countries where national income changes have been least closely related to later infant mortality declines. Rerunning the regressions with the 24 Sub-Saharan African countries removed confirms this: the Model 4 lagged coefficients for national income become negative (though they remain nonsignificant).

The Slope-Slope Model

The deconstruction of the results reported in Table 7.2 is all well and good, but do we really believe that national income has only a marginally significant impact on infant mor-tality? Once the MLM design has been tweaked to satisfy all—well, not even all, but most—of the estimation conditions that are required to guard against estimation biases, there is very little explanatory power left to study the relationships we’re interested in. What’s more, if the relationship between national income and infant mortality is battered down to nonsignificance by proper model specification, what will happen to other, more-tenuous relationships? Like national income and infant mortality, nearly all QMCR variables exhibit strong time trends. It is quite possible that literally all of the multilevel modeling results published in the quantitative macro-comparative literature represent nothing more than spurious effects due to violations of error sphericity.

We need more-effective, more-robust ways to handle trended data. Fortunately, these do exist. The most obvious is the difference model (Chapter 6), which regresses change over time in the dependent variable on change over time in the independent variables. It might be objected, however, that difference models “don’t use all of the available data,” since they rely only on data for the beginning and end points of the data series. Davis et al. (2006) have demonstrated that measurement error in start and end points can produce grossly unreliable results overall. Despite strong support from Firebaugh and Beck (1994), Halaby (2004), and Babones (2009c), the difference model is not a panacea for the problems of QMCR.

An alternative to the difference model that takes full advantage of available repeated mea-sures data is to relate the slope of the dependent variable over time to the slopes of the inde-pendent variables. Such an analysis might be called a slope-slope model, and is illustrated in Figure 7.8. The slope-slope model is estimated in two steps. In Step 1, each variable is regressed on time within countries to establish a time trend within the country. This answers Davis et al.’s (2006) reservations about the accuracy of start and end points, since the slope is estimated from the full range of data. It also answers the reductio ad absurdum of producing infinite cases by slicing time into thinner and thinner intervals: in the slope-slope model, having more-frequent observations merely improves the estimation of the slopes. In Step 2 of the slope-slope model, the slope coefficients for all the countries are used to estimate a simple cross-sectional regression model. There is no reason why time-invariant independent variables cannot be used as covariates. As a further refinement, the standard errors of the slopes com-puted in Step 1 could be used to estimate the degree of regression attenuation exhibited in Step 2.

A scatterplot of the slope-slope model for the relationship between national income and infant mortality is depicted in Figure 7.8. The slope-slope model replicates the results of the MLMs reported in Table 7.2 but far more clearly and with far greater statistical power. Specifically,

Page 26: Repeated Measures and Multilevel Modeling · 161 Repeated Measures and Multilevel Modeling E very year, the international data infrastructure comes to include data for more countries

186 PART II . STATISTICAL ANALYSIS OF MACRO-COMPARATIVE DATA

1. infant mortality reduction is highly significantly correlated with national income growth;

2. infant mortality reduction has been much faster in Africa than in the rest of the world;

3. the relationship between national income growth and infant mortality reduction is the same in Africa as in the rest of the world; and there is one major outlier—Botswana—where extraor-dinarily high national income growth has not been accompanied by commensurate infant mortality reduction.

The slope-slope model is not the answer to all QMCR questions, but it illustrates the kinds of analytical strategies that are most likely to be productive in analyzing repeated measures data. These should use the data at hand in ways that are appropriate to the kinds of questions asked in QMCR settings, not slavishly implement procedures designed for use in economet-ric or experimental research. More importantly, they should be much simpler: they should not torture the data to make them say more than they can legitimately reveal about the world. It is difficult even to put into words what the coefficients estimated using the TSCS and MLM designs described in this chapter mean. Whenever statistical modeling requires mental gymnastics of the kinds described here, mistakes—even gross errors—are bound to occur. In any case the data used in QMCR are neither sufficiently numerous nor sufficiently robust to support the application of highly refined repeated measures designs. Straightforward but creative cross-sectional designs like the slope-slope model may point the way forward.

−0.030

−0.025

−0.020

−0.015

−0.010

−0.005

0.000

−0.020 −0.010 0.000 0.010 0.020 0.030 0.040Slo

pe

in In

fan

t M

ort

alit

y (m

ort

alit

y re

du

ctio

n)

Slope in Logged GDP per Capita (income growth)

Botswana

Africa x-Botswana

Rest of World

Figure 7.8Slope-Slope Model of the Relationship Between National Income and Income Mortality (N = 77 countries)