pols w4912 multivariate political analysis

238
POLS W4912 Multivariate Political Analysis Gregory Wawro Associate Professor Department of Political Science Columbia University 420 W. 118th St. New York, NY 10027 phone: (212) 854-8540 fax: (212) 222-0598 email: [email protected]

Upload: lykhuong

Post on 04-Jan-2017

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: POLS W4912 Multivariate Political Analysis

POLS W4912

Multivariate Political Analysis

Gregory Wawro

Associate Professor

Department of Political Science

Columbia University

420 W. 118th St.

New York, NY 10027

phone: (212) 854-8540

fax: (212) 222-0598

email: [email protected]

Page 2: POLS W4912 Multivariate Political Analysis

ACKNOWLEDGEMENTS

This course draws liberally on lecture notes prepared by Professors Neal Beck, Lucy Goodhart,

George Jakubson, Nolan McCarty, and Chris Zorn. The course also draws from the following

works:

Aldrich John H. and Forrest D. Nelson. 1984. Linear Probability, Logit and Probit Models.

Beverly Hills, CA: Sage.

Alvarez, R. Michael and Jonathan Nagler. 1995. “Economics, Issues and the Perot Candidacy:

Voter Choice in the 1992 Presidential Election.” American Journal of Political Science

39:714-744

Amemiya, Takeshi. 1985 Advanced econometrics. Cambridge: Harvard University Press.

Baltagi, Badi H. 1995. Econometric Analysis of Panel Data. New York: John Wiley & Sons.

Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (and Not To Do) with Time-

SeriesCross-Section Data in Comparative Politics.” American Political Science Review 89:634-

647.

Davidson, Russell and James G. MacKinnon. 1993. Estimation and Inference in Econometrics.

New York: Oxford University Press.

Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. Newbury Park,

CA: Sage.

Fox, John. 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage

Publications.

Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, N.J.: Prentice Hall

Gujarati, Damodar N., Basic Econometrics, 2003, Fourth Edition, New York: McGraw Hill.

Herron, Michael. 2000. “Post-Estimation Uncertainty in Limited Dependent Variable Models”

Political Analysis 8: 8398.

ii

Page 3: POLS W4912 Multivariate Political Analysis

Hsiao, Cheng. 2003. Analysis of Panel Data. 2nd ed. Cambridge: Cambridge University Press.

Keifer, Nicholas M. 1988. Economic duration data and hazard functions. Journal of Economic

Literature 24: 646–679.

Kennedy, Peter. 2003. A Guide to Econometrics, Fifth Edition. Cambridge, MA: MIT Press.

King, Gary. 1989. Unifying Political Methodology. New York: Cambridge University Press

Lancaster, Tony. 1990. The Econometric Analysis of Transition Data. Cambridge: Cambridge

University Press.

Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thou-

sand Oaks: Sage Publications.

Maddala, G. S. 2001. Introduction to Econometrics. Third Edition, New York: John Wiley and

Sons.

Wooldridge, Jeffrey M. 2002. Introductory Econometrics: A Modern Approach. Cincinnati, OH:

Southwestern College Publishing.

Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data, Cambridge,

MA: MIT Press.

Yamaguchi, Kazuo. 1991. Event History Analysis. Newbury Park, CA: Sage.

Zorn, Christopher J. W. 2001. “Generalized Estimating Equations Models for Correlated Data:

A Review With Applications.” American Journal of Political Science 45: 470–90.

iii

Page 4: POLS W4912 Multivariate Political Analysis

TABLE OF CONTENTS

LIST OF FIGURES x

I The Classical Linear Regression Model 1

1 Probability Theory, Estimation, and Statistical Inference as a Prelude to theClassical Linear Regression Model 21.1 Why is this review pertinent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Joint probability density functions . . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Conditional probability density functions . . . . . . . . . . . . . . . . . . . . 7

1.3 Expectations, variance, and covariance . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Properties of expected values . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.3 Properties of variance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.5 Properties of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.6 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.7 Variance of correlated variables . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.8 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4 Important distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4.2 χ2 (Chi-squared) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.4 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Matrix Algebra Review 192.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Terminology and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Types of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Addition and subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5 Multiplying a vector or matrix by a constant . . . . . . . . . . . . . . . . . . . . . . 212.6 Multiplying two vectors/matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.8 Representing a regression model via matrix multiplication . . . . . . . . . . . . . . 232.9 Using matrix multiplication to compute useful quantities . . . . . . . . . . . . . . . 242.10 Representing a system of linear equations via matrix multiplication . . . . . . . . . 24

iv

Page 5: POLS W4912 Multivariate Political Analysis

2.11 What is rank and why would a matrix not be of full rank? . . . . . . . . . . . . . . 262.12 The Rank of a non-square matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.13 Application of the inverse to regression analysis . . . . . . . . . . . . . . . . . . . . 272.14 Partial differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.15 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 The Classical Regression Model 293.1 Overview of ordinary least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Optimal prediction and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Criteria for optimality for estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Linear predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5 Ordinary least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.1 Bivariate example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.2 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.3 The multiple regression model in matrix form . . . . . . . . . . . . . . . . . 383.5.4 OLS using matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Properties of OLS in finite samples 404.1 Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Using the assumptions to show that β is unbiased . . . . . . . . . . . . . . . . . . . 414.3 Using the assumptions to find the variance of β . . . . . . . . . . . . . . . . . . . . 424.4 Finding an estimate for σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5 Distribution of the OLS coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.6 Efficiency of OLS regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Inference using OLS regression coefficients 505.1 A univariate example of a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Hypothesis testing of multivariate regression coefficients . . . . . . . . . . . . . . . . 51

5.2.1 Proof thatβk − βk

se(βk)=

βk − βk√s2(X′X)−1

kk

∼ tk . . . . . . . . . . . . . . . . . . . . 52

5.2.2 Proof that(N − k)s2

σ2∼ χ2

N−k . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Testing the equality of two regression coefficients . . . . . . . . . . . . . . . . . . . 545.4 Expressing the above as a “restriction” on the matrix of coefficients . . . . . . . . . 54

6 Goodness of Fit 576.1 The R-squared measure of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.1 The uses of R2 and three cautions . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Testing Multiple Hypotheses: the F -statistic and R2 . . . . . . . . . . . . . . . . . 616.3 The relationship between F and t for a single restriction . . . . . . . . . . . . . . . 636.4 Calculation of the F -statistic using the estimated residuals . . . . . . . . . . . . . . 646.5 Tests of structural change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

v

Page 6: POLS W4912 Multivariate Political Analysis

6.5.1 The creation and interpretation of dummy variables . . . . . . . . . . . . . . 676.5.2 The “dummy variable trap” . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.5.3 Using dummy variables to estimate separate intercepts and slopes for each

group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.6 The use of dummy variables to perform Chow tests of structural change . . . . . . . 69

6.6.1 A note on the estimate of s2 used in these tests . . . . . . . . . . . . . . . . 71

7 Partitioned Regression and Bias 737.1 Partitioned regression, partialling-out and applications . . . . . . . . . . . . . . . . 737.2 R2 and the addition of new variables . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3 Omitted variable bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.3.1 Direction of the Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.2 Bias in our estimate of σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.3 Testing for omitted variables: The RESET test . . . . . . . . . . . . . . . . 79

7.4 Including an irrelevant variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.5 Model specification guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.6.1 How to tell if multicollinearity is likely to be a problem . . . . . . . . . . . . 857.6.2 What to do if multicollinearity is a problem . . . . . . . . . . . . . . . . . . 85

8 Regression Diagnostics 868.1 Before you estimate the regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.2 Outliers, leverage points, and influence points . . . . . . . . . . . . . . . . . . . . . 86

8.2.1 How to look at outliers in a regression model . . . . . . . . . . . . . . . . . . 868.3 How to look at leverage points and influence points . . . . . . . . . . . . . . . . . . 898.4 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.4.1 Standardized and studentized residuals . . . . . . . . . . . . . . . . . . . . . 958.4.2 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9 Presentation of Results, Prediction, and Forecasting 999.1 Presentation and interpretation of regression coefficients . . . . . . . . . . . . . . . 99

9.1.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.2 Encompassing and non-encompassing tests . . . . . . . . . . . . . . . . . . . . . . . 102

10 Maximum Likelihood Estimation 10510.1 What is a likelihood (and why might I need it)? . . . . . . . . . . . . . . . . . . . . 10510.2 An example of estimating the mean and the variance . . . . . . . . . . . . . . . . . 10810.3 Are ML and OLS equivalent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10910.4 Inference and hypothesis testing with ML . . . . . . . . . . . . . . . . . . . . . . . . 112

10.4.1 The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11210.5 The precision of the ML estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

vi

Page 7: POLS W4912 Multivariate Political Analysis

II Violations of Gauss-Markov Assumptions in the Classical LinearRegression Model 116

11 Large Sample Results and Asymptotics 11711.1 What are large sample results and why do we care about them? . . . . . . . . . . . 11711.2 What are desirable large sample properties? . . . . . . . . . . . . . . . . . . . . . . 11911.3 How do we figure out the large sample properties of an estimator? . . . . . . . . . . 121

11.3.1 The consistency of βOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12111.3.2 The asymptotic normality of OLS . . . . . . . . . . . . . . . . . . . . . . . . 124

11.4 The large sample properties of test statistics . . . . . . . . . . . . . . . . . . . . . . 12611.5 The desirable large sample properties of ML estimators . . . . . . . . . . . . . . . . 12711.6 How large does n have to be? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

12 Heteroskedasticity 12912.1 Heteroskedasticity as a violation of Gauss-Markov . . . . . . . . . . . . . . . . . . . 129

12.1.1 Consequences of non-spherical errors . . . . . . . . . . . . . . . . . . . . . . 13012.2 Consequences for efficiency and standard errors . . . . . . . . . . . . . . . . . . . . 13112.3 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

12.3.1 Some intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13212.4 Feasible Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 13312.5 White-consistent standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13412.6 Tests for heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

12.6.1 Visual inspection of the residuals . . . . . . . . . . . . . . . . . . . . . . . . 13612.6.2 The Goldfeld-Quandt test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13612.6.3 The Breusch-Pagan test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

13 Autocorrelation 13813.1 The meaning of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13813.2 Causes of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13913.3 Consequences of autocorrelation for regression coefficients and standard errors . . . 13913.4 Tests for autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

13.4.1 The Durbin-Watson test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14113.4.2 The Breusch-Godfrey test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

13.5 The consequences of autocorrelation for the variance-covariance matrix . . . . . . . 14213.6 GLS and FGLS under autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 14513.7 Non-AR(1) processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14713.8 OLS estimation with lagged dependent variables and autocorrelation . . . . . . . . 14813.9 Bias and “contemporaneous correlation” . . . . . . . . . . . . . . . . . . . . . . . . 15013.10Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15013.11Instrumental variable estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15213.12In the general case, why is IV estimation unbiased and consistent? . . . . . . . . . . 153

vii

Page 8: POLS W4912 Multivariate Political Analysis

14 Simultaneous Equations Models and 2SLS 15514.1 Simultaneous equations models and bias . . . . . . . . . . . . . . . . . . . . . . . . 155

14.1.1 Motivating example: political violence and economic growth . . . . . . . . . 15514.1.2 Simultaneity bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

14.2 Reduced form equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15714.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

14.3.1 The order condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15914.4 IV estimation and two-stage least squares . . . . . . . . . . . . . . . . . . . . . . . 160

14.4.1 Some important observations . . . . . . . . . . . . . . . . . . . . . . . . . . 16114.5 Recapitulation of 2SLS and computation of goodness-of-fit . . . . . . . . . . . . . . 16214.6 Computation of standard errors in 2SLS . . . . . . . . . . . . . . . . . . . . . . . . 16314.7 Three-stage least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16514.8 Different methods to detect and test for endogeneity . . . . . . . . . . . . . . . . . . 166

14.8.1 Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16614.8.2 The Hausman specification test . . . . . . . . . . . . . . . . . . . . . . . . . 16714.8.3 Regression version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16814.8.4 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

14.9 Testing for the validity of instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 169

15 Time Series Modeling 17115.1 Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17115.2 The auto-regressive and moving average specifications . . . . . . . . . . . . . . . . . 171

15.2.1 An autoregressive process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17215.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17315.4 A moving average process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17415.5 ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17515.6 More on stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17615.7 Integrated processes, spurious correlations, and testing for unit roots . . . . . . . . 176

15.7.1 Determining the specification . . . . . . . . . . . . . . . . . . . . . . . . . . 18015.8 The autocorrelation function for AR(1) and MA(1) processes . . . . . . . . . . . . . 18015.9 The partial autocorrelation function for AR(1) and MA(1) processes . . . . . . . . . 18115.10Different specifications for time series analysis . . . . . . . . . . . . . . . . . . . . . 18215.11Determining the number of lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18415.12Determining the correct specification for your errors . . . . . . . . . . . . . . . . . . 18515.13Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

III Special Topics 187

16 Time-Series Cross-Section and Panel Data 18816.1 Unobserved country effects and LSDV . . . . . . . . . . . . . . . . . . . . . . . . . 188

16.1.1 Time effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

viii

Page 9: POLS W4912 Multivariate Political Analysis

16.2 Testing for unit or time effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19016.2.1 How to do this test in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

16.3 LSDV as fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19016.4 What types of variation do different estimators use? . . . . . . . . . . . . . . . . . . 19216.5 Random effects estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19416.6 FGLS estimation of random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 19616.7 Testing between fixed and random effects . . . . . . . . . . . . . . . . . . . . . . . . 19716.8 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19816.9 Panel regression and the Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . 198

17 Models with Discrete Dependent Variables 20417.1 Discrete dependent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20417.2 The latent choice model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20517.3 Interpreting coefficients and computing marginal effects . . . . . . . . . . . . . . . . 20817.4 Measures of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

18 Discrete Choice Models for Multiple Categories 21118.1 Ordered probit and logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21118.2 Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

18.2.1 Interpreting coefficients, assessing goodness of fit . . . . . . . . . . . . . . . 21618.2.2 The IIA assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

19 Count Data, Models for Limited Dependent Variables, and Duration Models 21819.1 Event count models and poisson estimation . . . . . . . . . . . . . . . . . . . . . . . 21819.2 Limited dependent variables: the truncation example . . . . . . . . . . . . . . . . . 22119.3 Censored data and tobit regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 22319.4 Sample selection: the Heckman model . . . . . . . . . . . . . . . . . . . . . . . . . . 22419.5 Duration models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

ix

Page 10: POLS W4912 Multivariate Political Analysis

LIST OF FIGURES

1.1 An Example of PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 An example of a CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Conditional PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Conditional PDF, X and Y independent . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Plots of special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Data Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Data Plot w/ OLS Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.1 Data on Presidential Approval, Unemployment, and Inflation . . . . . . . . . . . . . 878.2 A plots of fitted values versus residuals . . . . . . . . . . . . . . . . . . . . . . . . . 888.3 Added variable plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.4 An influential outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918.5 Hat values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948.6 Plot of DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

x

Page 11: POLS W4912 Multivariate Political Analysis

1

Part I

The Classical Linear Regression Model

Page 12: POLS W4912 Multivariate Political Analysis

Section 1

Probability Theory, Estimation, and Statistical Inference

as a Prelude to the Classical Linear Regression Model

1.1 Why is this review pertinent?

1. Most of the phenomena that we care about in the world can be characterizedas random variables, meaning that we can conceive of their values as beingdetermined by the outcome of a chance experiment. To understand thebehavior of a random variable, we need to understand probability, whichassigns a likelihood to each realization of that random variable.

2. In univariate statistics, much of the focus is on estimating the expectation(or mean) of a random variable. In multivariate analysis, we will focus onrelating movements in something we care about to variables that canexplain it—but we will still care about expectations.

3. We will focus on establishing the conditional expectation of a dependentvariable Y . In other words, given the value of X (where X may be a singlevariable or a set of variables) what value do we expect Y to take on? Thatconditional expectation is often written as βX.

4. Since we do not directly observe all the data that make up the world(and/or cannot run experiments), we must estimate βX using a sample. Tounderstand what we can say about that estimate, we will have to employstatistical inference.

5. Finally, many of the statistics we estimate (where a statistic is simply anyquantity calculated from sample data) will conform to particular knownprobability distributions. This will simplify the business of conductinghypothesis tests. To assist in understanding the rationale behind thosehypothesis tests, it helps to review the distributions in question.

2

Page 13: POLS W4912 Multivariate Political Analysis

3

1.2 Probability theory

• The set of all possible outcomes of a random experiment is called thepopulation or sample space and an event is one cell (or subset) of thesample space.

• If events cannot occur jointly, they are mutually exclusive. If they exhaustall of the things that could happen they are collectively exhaustive.

1.2.1 Properties of probability

1. 0 ≤ P (A) ≤ 1 for every A.

2. If A, B, C constitute an exhaustive set of events, then P (A + B + C) = 1.

3. If A, B, C are mutually exclusive events, thenP (A + B + C) = P (A) + P (B) + P (C).

1.2.2 Random variables

• Definition: a variable whose value is determined by a chance experiment.

• Convention: a rv is denoted by an upper case letter (e.g., X) andrealizations of a rv are denoted by lower case letters (e.g., x1, x2, . . ., xn).

• A random variable is discrete if it takes on a finite set (or countablyinfinite set) of values. Otherwise the rv is continuous.

• A probability density function (PDF) assigns a probability to eachvalue (or event or occurrence) of a random variable X.

• Thus, a discrete PDF for a variable X taking on the values x1, x2, x3, . . . , xn

is a function such that:

fX(x) = P [X = xi] for i = 1, 2, 3, . . . , n.

and is zero otherwise (there is no probability of X taking on any othervalue).

Page 14: POLS W4912 Multivariate Political Analysis

4

• A continuous probability density function for a variable X that covers acontinuous range is a function such that:

f(x) ≥ 0 for all x

and ∫ ∞

−∞f(x)dx = 1.

P [a < X < b] =

∫ b

a

f(x)dx

• The cumulative probability density function (CDF) gives theprobability of a random variable being less than or equal to some value:F (x) = P [X ≤ x].

• For a discrete rvF (x) =

∑xj≤x

f(xj).

• For a continuous rv

F (x) = P [X ≤ x]

= P [−∞ ≤ X ≤ x]

=

∫ x

−∞f(u)du.

Page 15: POLS W4912 Multivariate Political Analysis

5

Figure 1.1: An Example of PDF

x

f(x)

-2 0 2 4

0.0

0.05

0.10

0.15

0.20

0.25

Page 16: POLS W4912 Multivariate Political Analysis

6

Figure 1.2: An example of a CDF

x

P

-5 0 5

0.0

0.2

0.4

0.6

0.8

1.0

Page 17: POLS W4912 Multivariate Political Analysis

7

1.2.3 Joint probability density functions

• For all of the political science examples we will talk about, we will beinterested in how one variable is related to another, so we will be interestedin joint probability density functions:

f(x, y) = P [X = x and Y = y]

• See Fig. 1.3 for an example of a joint PDF.

• The marginal probability density function, fX(x), can be derived from thejoint probability density function, fX,Y (x, y), by summing/integrating overall the different values that Y could take on.

fX(x) =∑

y fX,Y (x, y) if fX,Y is discrete (1.1)

fX(x) =∫∞−∞ fX,Y (x, y)dy if fX,Y is continuous (1.2)

1.2.4 Conditional probability density functions

• Since we are interested in understanding how one variable is related toanother, we will also consider the conditional PDF, which gives, forexample, the probability that X has the realization x given that Y has therealization y. The conditional PDF is written as:

fX,Y (x|y) = P (X = x|Y = y) (1.3)

=fX,Y (x, y)

fY (y)(1.4)

• Statistical Independence: We say that X and Y are statisticallyindependent if and only if

fX,Y (x, y) = fX(x) · fY (y)

• See Fig. 1.4 for an example of a condit’l PDF and Fig. 1.5 for an exampleof a condit’l PDF for two independent variables.

Page 18: POLS W4912 Multivariate Political Analysis

8

Figure 1.3: Joint PDF

Page 19: POLS W4912 Multivariate Political Analysis

9

Figure 1.4: Conditional PDF

Page 20: POLS W4912 Multivariate Political Analysis

10

Figure 1.5: Conditional PDF, X and Y independent

Page 21: POLS W4912 Multivariate Political Analysis

11

1.3 Expectations, variance, and covariance

• We will often, want to know about the central tendency of a randomvariable. We are particularly interested in the expected value of the randomvariable, which is the average value that it takes on over many repeatedtrials.

• The expected value (µ), or population mean, is the first moment of X.

E[X] =n∑

j=1

xjf(xj) (1.5)

if X is discrete and has the possible values x1, x2, . . . , xn, and

E[X] =

∫ ∞

−∞xf(x)dx (1.6)

if X is continuous.

1.3.1 Properties of expected values

• If b is a constant, then E(b) = b.

• If a and b are constants, then E(aX + b) = aE(X) + b.

• If X and Y are independent, then E(XY ) = E(X)E(Y )

• If X is a random variable with a PDF f(x) and if g(x) is any function of X,then

E[g(x)] =

∑x g(x)f(x) if X is discrete∫∞

−∞ g(x)f(x)dx if X is continuous

Page 22: POLS W4912 Multivariate Political Analysis

12

1.3.2 Variance

• The distribution of values of X around its expected value can be measuredby the variance, which is defined as:

var(X) = σ2x = E[(X − µ)2]

• The variance is the second moment of X. The standard deviation of X,denoted σX , is +

√var[X].

• The variance is computed as follows:

E[g(x)] =

∑x

[(X − µ)2

]f(x) if X is discrete∫∞

−∞[(X − µ)2

]f(x)dx if X is continuous

• We can also express the variance as

var[X] = E[(X − µX)2]

= E[X2]− (E[X])2

1.3.3 Properties of variance:

• If b is a constant, then var[b] = 0.

• If a and b are constants, and Y = a + bX then

var[Y ] = b2var[X].

• If X and Y are independent random variables, thenvar(X + Y ) = var(X) + var(Y ) and var(X − Y ) = var(X) + var(Y ).

Page 23: POLS W4912 Multivariate Political Analysis

13

1.3.4 Covariance

• The covariance of two rvs, X and Y , with means µX and µY is defined as

cov(X, Y ) = E[(X − µX)(Y − µY )] (1.7)

= E[XY ]− µXµY . (1.8)

1.3.5 Properties of covariance

• If X and Y are independent, then E[XY ] = E[X]E[Y ] implyingcov(X, Y ) = 0.

• cov(a + bX, c + dY ) = bd · cov(X, Y ), where a, b, c, and d are constants.

1.3.6 Correlation

• The size of the covariance will depend on the units in which X and Y aremeasured. This has led to the development of the correlation coefficient,which gives a measure of statistical association that ranges between −1 and+1.

• The population correlation coefficient, ρ, is defined as:

ρ =cov(X,Y )√var(X)var(Y )

=cov(X, Y )

σxσy

1.3.7 Variance of correlated variables

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )

var(X − Y ) = var(X) + var(Y )− 2cov(X,Y )

Page 24: POLS W4912 Multivariate Political Analysis

14

1.3.8 Conditional expectation

• The conditional expectation of X, given Y = y, is defined as

E(X|Y = y) =

∑x xf(x|Y = y) if X is discrete∫∞

−∞ xf(x|Y = y)dx if X is continuous

• The Law of Iterated Expectations indicates how the expected value ofX relates to the conditional expectation of X: E(X) = Ey[E(X|Y )].

1.4 Important distributions

1.4.1 The normal distribution

• A random variable X is normally distributed (denoted X ∼ N(µ, σ2)) if ithas the PDF

f(x) =1√

2πσ2exp

(−1

2

(x− µ)2

σ2

)(1.9)

where −∞ < x < ∞, µ is the mean, and σ2 is the variance of thedistribution.

• Properties of the Normal Distribution:

It is symmetric around its mean value (so mean=median=mode).

Approximately 68% of the area under a normal curve lies between thevalues of µ± σ, about 95% of the area lies between µ± 2σ, and about99.7% in the range µ± 3σ.

A normal distribution is completely characterized by its two parameters.

Page 25: POLS W4912 Multivariate Political Analysis

15

• Any normally distributed variable can be transformed into a standardnormal variable by subtracting its mean and dividing by its standarddeviation.

E.g., if Y ∼ N(µY , σ2Y ), and Z = (Y − µY )/σY , then Z ∼ N(0, 1). That

is,

fZ(z) =1√2π

exp

(−1

2z2)

(1.10)

1.4.2 χ2 (Chi-squared) distribution

• Let Z1, Z2, . . . , Zk be independent standard normal variables. Then

U =k∑

i=1

Z2i

has a chi-squared distribution with k degrees of freedom (denoted U ∼ χ2k).

1.4.3 Student’s t distribution

• Suppose Z ∼ N(0, 1), U ∼ χ2k, and U and Z are distributed independently

of each other. Then the variable

t =Z√U/k

has a t distribution with k degrees of freedom.

1.4.4 The F distribution

• The F distribution is the distribution of the ratio of two independentchi-squared random variables divided by their respective degrees of freedom.

• That is, if U ∼ χ2m and V ∼ χ2

n then the variable

F =U/m

V/n

has an F distribution with m and n degrees of freedom (denoted Fm,n).

Page 26: POLS W4912 Multivariate Political Analysis

16

Figure 1.6: Plots of special distributions

5 10 15 20

0.00

0.10

0.20

Chi−squared distribution

x

Pro

babi

lity

df=1df=5df=10

−4 −2 0 2 40.

00.

10.

20.

30.

4

t distribution

t

Pro

babi

lity

df=1df=5df=10

5 10 15 20

0.00

0.05

0.10

0.15

0.20

F distribution

x

Pro

babi

lity

df=1,1df=1,5df=1,10

Page 27: POLS W4912 Multivariate Political Analysis

17

1.5 Statistical inference

• In the real world, we typically do not know the true probability distributionof our population of interest; estimate it using a sample. Use laws ofstatistics.

• If we don’t know, for instance, the true distribution of income in thepopulation, we could take a random sample of individuals.

• Each individual that we poll, or observation that we record, can be viewedas a random variable, X1, X2, . . . , Xn, because they have an income whoselevel is in part the result of an experiment. Each of those random draws (orvariables) will, however, have a known value, x1, x2, . . . , xn.

• If each comes from the same population (i.e., w/ same likelihood of beingrich or poor) then the observations are said to be identically distributed.If the selection of one person does not affect the chances of selection ofanother person, they are said to be independent. If the observations areboth, they are described as iid—standard assumption.

• Once we have our sample, we can estimate a sample mean,

Y =1

N

N∑i=1

Yi

and a sample variance,

s2y =

1

N − 1

N∑i=1

(Yi − Y )2.

• What we want to know is how good these estimates are as descriptions ofthe real world → use statistical inference.

• We can show that the sample avg. is an unbiased estimator of the truepopulation mean, µY :

E(Y ) =1

N

N∑i=1

E(Yi) = µy

Page 28: POLS W4912 Multivariate Political Analysis

18

• The sample avg. is a rv: take a different sample, likely to get a differentaverage. The hypothetical distribution of the sample avg. is the samplingdistribution.

• We can deduce the variance of this sample avg.:

var(Y ) =σ2

y

N.

• If the underlying population is distributed normal, we can also say that theavg. is distributed normal (why?). If the underlying population does nothave a normal distribution, however, & income distribution is not normal,we cannot infer that the sampling distribution is normal.

• Can use two laws of statistics that can tell us about the precision anddistribution of our estimator (in this case the avg.):

The Law of Large Numbers states that, under general conditions, Y

will be near µy with very high probability when N is large. Theproperty that Y is near µy with increasing prob. as N ↑ is calledconvergence in probability or, more concisely, consistency. We willreturn to consistency in the section on asymptotics.

The Central Limit Theorem states that, under general conditions,the distribution of Y is well approximated by a normal distributionwhen N is large.

∴ (under general conditions) can treat Y as normally distributed anduse the variance of the average to construct hypothesis tests (e.g., “howlikely it is that avg. income < $20,000 given our estimate?”).

Can also construct confidence intervals, outlining the range of income inwhich we expect the true mean income to fall with a given probability.

• We will estimate statistics that are analogous to a sample mean and are alsodistributed normal. We can again use the normal as the basis of hypothesistests. We will also estimate parameters that are distributed as χ2, as t, andas F .

Page 29: POLS W4912 Multivariate Political Analysis

Section 2

Matrix Algebra Review

2.1 Introduction

• A matrix is simply a rectangular array of numbers, like a table. Often,when we collect data, it takes the form of a table (or excel spreadsheet)with variable names in the columns across the top and observations in therows down the side. That’s a typical matrix.

• A vector (a single-columned matrix) can also be thought of as a coordinatein space.

• Matrix algebra is the topic that covers the mathematical operations we cando with numbers. These include addition, subtraction, multiplication,division (via the inverse), differentiation and factoring. In doing thesematrix operations, we just apply the particular operation to the matrix as awhole rather than to a single number.

• Matrix algebra is sometimes also called linear algebra because one of themajor uses of matrix algebra is to solve systems of linear equations such asthe following:

2x + 4y = 8

x + 5y = 7

• Easy to substitute to get expressions for x (or y) and solve for y (or x).Harder to do for more equations—motivation for using matrix algebra.

• In regression analysis, estimation of coefficients requires a solution to a setof N equations with k unknowns.

19

Page 30: POLS W4912 Multivariate Political Analysis

20

2.2 Terminology and notation

• Matrices are represented w/ bold capital letters (e.g., A)

• Vectors are represented w/ bold lower case letters (e.g., a).

• Each element in a matrix, A, is represented as aij, where i gives the rowand j gives the column.

• Transpose: switch the row and column positions of a matrix (so that thetransposed matrix has aji for its elements). Example:[

4 21 5

]′=

[4 12 5

]• Scalar: a 1× 1 matrix.

• A matrix is equal to another matrix if every element in them is identical.This implies that the matrices have the same dimensions.

2.3 Types of matrices

• Square: the number of rows equals the numbers of columns.

• Symmetric: A matrix whose elements off the main left-right diagonal are amirror-image of one another (⇒ A′ = A). Symmetric matrices must besquare.

• Diagonal: All off-diagonal elements are zero.

• Identity: A square matrix with all elements on the main diagonal = 1, = 0everywhere else. This functions as the matrix equivalent to the number one.

• Idempotent: a matrix A is idempotent if A2 = A (⇒ An = A). For amatrix to be idempotent it must be square.

Page 31: POLS W4912 Multivariate Political Analysis

21

2.4 Addition and subtraction

• As though we were dropping one matrix over another. We are adding thetwo exactly together. For that reason, we can add matrices only of exactlythe same dimensions and we find the addition by adding together eachequivalently located element. Example:[

2 41 3

]+

[1 11 1

]=

[3 52 4

]In subtraction, it is simply as though we are removing one matrix from theother, so we subtract each element from each equivalent element. Example:[

2 41 3

]−[1 11 1

]=

[1 30 2

]• Properties of Addition:

A + 0 = A

(A + B) = (B + A)

(A + B)′ = A′ + B′

(A + B) + C = A + (B + C)

2.5 Multiplying a vector or matrix by a constant

• For any constant, k, kA = kaij ∀ i, j.

Page 32: POLS W4912 Multivariate Political Analysis

22

2.6 Multiplying two vectors/matrices

• To multiply two vectors or two matrices together, they must beconformable—i.e., row dimension of one matrix must match the columndimension of the other.

• Inner-product multiplication. For two vectors, a and b, theinner-product is written as a′b.

Let a =

124

and b =

243

, then

a′b =[1 2 4

243

= (1)(2) + (2)(4) + (4)(3) = 22

• Note: a′ is 1× 3, b is 3× 1, and the result is 1× 1.

• Note: if a′b = 0, they are said to be orthogonal. (⇒ the two vectors inspace are at right angles to each other).

2.7 Matrix multiplication

• Same as vector multiplication except that we treat the first row of thematrix as the transpose vector and continue with the other rows andcolumns in the same way. An example helps:[

2 41 3

]·[1 21 1

]=

[(2)(1) + (4)(1) (2)(2) + (4)(1)(1)(1) + (3)(1) (1)(2) + (3)(1)

]=

[6 84 5

]• Again, number of columns in the first matrix must be equal to the number

of the rows in the second matrix, otherwise there will be an element thathas nothing to be multiplied with.

• Thus, for Anm ·Bjk, m = j. Note also that

An×m

· Bm×k

= Cn×k

Page 33: POLS W4912 Multivariate Political Analysis

23

• Makes a difference which matrix is pre-multiplying and which ispost-multiplying (e.g., A ·B = B ·A only under special conditions).

• Properties of Multiplication:

AI = A

(AB)′ = B′A′

(AB)C = A(BC)

A(B + C) = AB + AC or (A + B)C = AC + BC

2.8 Representing a regression model via matrix

multiplication

• We have a model: y = β0 + β1X + ε, implying:

y1 = β0 + β1X1 + ε1

y2 = β0 + β1X2 + ε2

• This can be represented in matrix form by:

y = Xβ + ε

where y is n× 1, X is n× 2, β is 2× 1, and ε is n× 1.

Page 34: POLS W4912 Multivariate Political Analysis

24

2.9 Using matrix multiplication to compute useful

quantities

• To get the average of four observations of a random variable: 30, 32, 31, 33.

[1/4 1/4 1/4 1/4

30323133

= x and this can also be written as 1/ni′x = x,

where i is a column vector of ones of dimensions n× 1 and x is the datavector.

• Suppose we want to get a vector containing the deviation of eachobservation of x from its mean (why would that be useful?):

x1 − x

x2 − x...

xn − x

= [x− ix] =

[x− 1

nii′x

]

• Since x = Ix,[x− 1

nii′x

]=

[Ix− 1

nii′x

]=

[I− 1

nii′]x = M0x

• M0 has (1− 1/n) for its diagonal elements and −1/n for all its off-diagonalelements (∴ symmetric). Also, M0 is equal to its square, M0M0 = M0 so itis idempotent.

2.10 Representing a system of linear equations via

matrix multiplication

• Say that we have a fairly elementary set of linear equations:

2x + 3y = 5

3x− 6y = −3

Page 35: POLS W4912 Multivariate Political Analysis

25

• Matrix representation: [2 33 −6

] [x

y

]=

[5−3

]• In general form, this would be expressed as: Ax = b

• Solve by using the inverse: A−1 where A−1A = I.

• Pre-multiply both sides of the equation by A−1:

Ix = A−1b

x = A−1b

• Only square matrices have inverses, but not all square matrices possess aninverse.

• To have an inverse (i.e., be non-singular), matrix must be “full rank” ⇒ itsdeterminant (e.g., |A|) is not zero.

• For a 2× 2 matrix, A =

[a c

b d

], |A| = (ad− bc) and A−1 = 1

|A|

[d −b

−c a

]

Page 36: POLS W4912 Multivariate Political Analysis

26

2.11 What is rank and why would a matrix not be of full

rank?

• Let us consider a set of two linear equations with two unknowns:

3x + y = 7

6x + 2y = 14

[3 16 2

] [x

y

]=

[714

]• Can’t solve this: the second equation is not truly a different equation; it is a

“linear combination” of the first ⇒ an infinity of solutions.

• The matrix

[3 16 2

]does not have full “row rank.”

• One of its rows can be expressed as a linear combination of the others (⇒its determinant = 0).

• Does not have full column rank either; if a square matrix does not have fullrow rank it will not have full column rank.

• Rows and columns must be “linearly independent” for matrix to have fullrank.

Page 37: POLS W4912 Multivariate Political Analysis

27

2.12 The Rank of a non-square matrix

• For non-square matrices, rank(A) = rank (A′) ≤ min(number of rows,number of columns).

• For example, for the matrix

5 12 −37 4

, the rank of the matrix is at most two

(the number of columns).

• Why is this? Let us look at the three rows separately and think of each ofthem as a vector in space. If we have two rows with two elements, we canproduce the third row via a linear combination of the first two and we cando this for any third row that we could imagine. In technical language, thematrix only spans a vector space of two.

2.13 Application of the inverse to regression analysis

• Can’t use this to directly solve

y = Xβ + ε

• We have the error vector and X is not square.

• But we will use this to get β by seeking a solution that minimized the sumof squared residuals; will get us a system of K equations from which we mayderive K unknown β coefficients.

Page 38: POLS W4912 Multivariate Political Analysis

28

2.14 Partial differentiation

• To find the partial derivative of any vector, x, or any function of thatvector, f(x) =∂f(x)

∂(xi):

take the derivative of each element of the matrix with respect to xn.

• ∂(a′x)∂x = a

• ∂(x′Ax)∂x = 2Ax.

2.15 Multivariate distributions

• Multivariate distribution describes the joint distribution of any group ofrandom variables X1 to Xn.

• This set can be collected as a vector or matrix and we can write theirmoments (i.e., mean and variance) in matrix notation.

• E.g., if X1 to Xn are written as x, then E(x) = [E(X1), E(X2), . . . , E(Xn)].

• For any n× 1 random vector, x, its variance-covariance matrix, denotedvar(x), is defined as:

var(x) =

σ2

1 σ12 . . . σ1n

σ12 σ22 . . . σ2n

... . . . ...σn1 σn2 . . . σ2

n

= E[(x− µ)(x− µ)′] = Σ (2.1)

• This matrix is symmetric. Why? In addition, if the individual elements of xare independent, then the off-diagonal elements are equal to zero. Why?

• x ∼ N(µ,Σ).

• The properties of the multivariate normal ⇒ each element of this vector isnormally distributed and that any two elements of x are independent iffthey are uncorrelated.

Page 39: POLS W4912 Multivariate Political Analysis

Section 3

The Classical Regression Model

3.1 Overview of ordinary least squares

• OLS is appropriate whenever the Gauss-Markov assumptions are satisfied.

• Technically simple and tractable.

• Allows us to develop the theory of hypothesis testing.

• Regression analysis tells us what is the relationship between a dependentvariable, y, an independent variable, x.

3.2 Optimal prediction and estimation

• Suppose that we observe some data (xi, yi) for i = 1, . . ., N observationalunits of individuals.

• Our theory tells us that xi and yi are related in some systematic fashion forall individuals.

• Use the tools of statistics to describe this relationship more precisely and totest its empirical validity.

• For now, no assumptions concerning the functional form linking the variablesin the data set.

• First problem: how to predict yi optimally given the values of xi?

• xi is known as an independent variable and yi is the dependent or endogenousvariable.

• The function that delivers the best prediction of yi for given values of xi isknown as an optimal predictor.

The “best predictor” is a real-valued function g(·), not necessarily linear.

29

Page 40: POLS W4912 Multivariate Political Analysis

30

Figure 3.1: Data Plot

−2 −1 0 1

−2

−1

01

2

X

Y

Page 41: POLS W4912 Multivariate Political Analysis

31

• How to define “best predictor”? One criterion we could apply is to minimizethe mean-square error, i.e. to solve:

ming(.)

E[(yi − g(xi, β))2]

• It turns out that the optimal predictor in this sense is the conditional expec-tation of y given x, also known as the regression function, E(yi|xi) = g(xi, β).

• g(.) presumably involves a number of fixed parameters (β) needed to correctlyspecify the theoretical relationship between yi and xi.

• Whatever part of the actually observed value of yi is not captured in g(xi, β)in the notation of this equation must be a random component which has amean of zero conditional on xi.

• ∴ another way to write the equation above in a way that explicitly capturesthe randomness inherent in any political process is as follows:

yi = g(xi,β) + εi

where εi is an error term or disturbance, with a mean of zero conditional onxi. For the moment, we shall not say anything else about the distribution ofεi.

3.3 Criteria for optimality for estimators

• Need to define what we mean by “good” or optimal indicators.

• Normally, there are two dimensions along which we compare estimators.

1. Unbiasedness, i.e.: E(β) = β

• Large sample analog of unbiasedness is “consistency”.

2. Efficiency: var(β) ≤ var(β∗), where β

∗is any unbiased estimator for β.

• Large sample analog is asymptotic efficiency.

Page 42: POLS W4912 Multivariate Political Analysis

32

3.4 Linear predictors

• We often restrict g(.) to being a linear function of xi, or a linear function ofnon-linear transformations of xi.

• More specifically, the best linear predictor (BLP) of yi given values of xi isdenoted by:

E∗(yi|xi) = β0 + β1xi.

• Note that the BLP is not the same as the conditional expectation, unless theconditional expectation happens to be linear. As before, we can write theequation in an alternative but equivalent fashion:

yi = β0 + β1xi + εi

• The assumption of linearity corresponds to a restriction that the true rela-tionship between yi and xi be a line in (yi, xi) space. Such a line is known asa regression line.

• Moreover, the fact that we are trying to predict yi given xi requires that theerror term be equal to the vertical distance between the true regression lineand the data points.

• Thus, in order to estimate the true regression line—that is, to approximateit using the information contained in our finite sample of observations—onepossible criterion will consist of minimizing some function of the verticaldistance between the estimated line and the scattered points.

In doing so, we hope to exhaust all the information in xi that is usefulin order to predict yi linearly.

• The problem that we are now confronted with consists of estimating the trueparameters β0 and β1 and drawing statistical inferences from these estimates.

• One way: ordinary least squares estimation (OLS).

• Under special assumptions (the Gauss-Markov assumptions or assump-tions of the classic linear regression model), OLS estimates are “optimal” interms of unbiasedness and efficiency.

Page 43: POLS W4912 Multivariate Political Analysis

33

3.5 Ordinary least squares

• The estimates of β0 and β1 resulting from the minimization of the sum ofsquared vertical deviations are known as the OLS estimates.

3.5.1 Bivariate example

• The vertical deviations from the estimated line, or residuals ei are as follows:

ei = yi − β0 − β1xi,

where β0 and β1 are the estimates of the parameters that fully describe theregression line.

• Thus, the objective is to solve the following problem:

minβ0,β1

N∑i=1

e2i =

N∑i=1

(yi − β0 − β1xi)2

• The first order conditions for a minimum (applying the chain rule) are:

(N∑

i=1e2i

)∂β0

=N∑

i=1

2(yi − β0 − β1xi)(−1) = 0 ⇒N∑

i=1

ei = 0

(N∑

i=1e2i

)∂β1

=N∑

i=1

2(yi − β0 − β1xi)(−xi) = 0 ⇒N∑

i=1

xiei = 0

• These are the “normal equations”. They can be re-written by substitutingfor ei and collecting terms. This yields the following two expressions.

N∑i=1

yi = nβ0 + β1

N∑i=1

xi

N∑i=1

xiyi = β0

N∑i=1

xi + β1

N∑i=1

x2i

Page 44: POLS W4912 Multivariate Political Analysis

34

• To obtain a solution, divide the first equation by n:

y = β0 + β1x

⇒ OLS regression line passes through the mean of the data (but may not betrue if no constant in the model) and

β0 = y − β1x

• Use the last expression in place of β0 in the solution for β1 and useN∑

i=1xi = Nx:

N∑i=1

xiyi −Nxy = β1

(N∑

i=1

x2i −Nx2

)or

β1 =

(N∑

i=1x1y1

)−Nxy(

N∑i=1

x2i

)−Nx2

=

N∑i=1

(xi − x)(yi − y)

N∑i=1

(xi − x)2

• Check to see if this is a minimum:∂2

(N∑

i=1

e2i

)∂β2

0

∂2

(N∑

i=1

e2i

)∂β0∂β1

∂2

(N∑

i=1

e2i

)∂β0∂β1

∂2

(N∑

i=1

e2i

)∂β2

1

=

[2N 2Nx

2Nx 2∑N

i=1 x2i

]

• Sufficient condition for a minimum: matrix must be positive definite.

• The two diagonal elements are positive, so we only need to verify that thedeterminant is positive:

Page 45: POLS W4912 Multivariate Political Analysis

35

|D| = 4N∑

i

x2i − 4

(∑i

xi

)2

= 4N∑

i

x2i − 4N 2x2

= 4N

(∑i

x2i −Nx2

)= 4N

∑i

(xi − x)2 > 0 if variation in x

• If there are more regressors, one can add normal equations of the form

∂∑

e2i

∂βk

= 0.

• Math is tedious; more convenient to use matrix notation.

Page 46: POLS W4912 Multivariate Political Analysis

36

Figure 3.2: Data Plot w/ OLS Regression Line

−2 −1 0 1

−2

−1

01

2

X

Y

Page 47: POLS W4912 Multivariate Political Analysis

37

3.5.2 Multiple regression

• When we want to know the direct effects of a number of independent variableson a dependent variable, we use multiple regressions by assuming that

E(yi|x1i, . . . , xki) = β0 + β1x1i + . . . + βkxki

(Notice that there is no x0i. Actually, we assume that x0i = 1 for all i.)

• May also be written as

yi = β0 + β1x1i + . . . + βkxki + εi

• The sample regression function can be estimated by Least Squares as welland can be written as:

yi = β0 + β1x1i + . . . + βkxki

or

yi = β0 + β1x1i + . . . + βkxki + ei

• Interpretation of Coefficients βk is the effect on y of a one unit increaseof xk holding all of the others xs constant.

Example: Infant deaths per 1000 live births

Infant mortality = 125−.972×(% Females who Read)−.002×(Per capita GNP)

Holding per capita GDP constant, a one percentage point increase infemale literacy reduces the expected infant mortality rate by .972 deathsper 1000 live births.

Holding female literacy rates constant, a one thousand dollar increasein the per capita gross domestic product reduces the expected infantmortality rate by 2.234 deaths per 1000 live births.

Page 48: POLS W4912 Multivariate Political Analysis

38

3.5.3 The multiple regression model in matrix form

• To expedite presentation and for computational reasons, it is important tobe able to express the linear regression model is matrix form. To this end,note that the observations of a multiple regression can be written as:

y1 = β0 + β1x11 + . . . +βkxk1 + . . . + βKxK1 + ε1...

yi = β0 + β1x1i + . . . +βkxki + . . . + βKxKi + εi

...

yn = β0 + β1xn1 + . . . +βkxnk + . . . + βKxnK + εn

• Now let y =

y1...yn

, β =

β0...

βK

, ε =

ε1...εn

, and X =

1 x11 . . . x1K... . . . . . .

...1 xn1 . . . xnK

.

• Now the entire model can be written as

y = Xβ + ε

• We will often want to refer to specific vectors within X. Let xk refer toa column in X (the kth independent variable) and xi be a row in X (theindependent variables for the ith observation).

• Thus, we can write the model for a specific observation as

yi = x′iβ + εi

3.5.4 OLS using matrix notation

• The sum of squared residuals is

N∑i=1

e2i =

N∑i=1

(yi − x′iβ)2

where β is now a vector of unknown coefficients.

Page 49: POLS W4912 Multivariate Political Analysis

39

• The minimization problem can now be expressed as

minβ

e′e = (y −Xβ)′(y −Xβ)

• Expanding this gives:

e′e = y′y − β′X′y − y′Xβ + β

′X′Xβ

ore′e = y′y − 2β

′X′y + β

′X′Xβ

• The first order conditions now become:

∂e′e

∂β= −2X′y + 2X′Xβ = 0

• The solution then satisfies the least squares normal equations:

X′Xβ = X′y

• If the inverse of X′X exists (i.e., assuming full rank) then the solution is:

β = (X′X)−1

X′y

Page 50: POLS W4912 Multivariate Political Analysis

Section 4

Properties of OLS in finite samples

4.1 Gauss-Markov assumptions

• OLS estimation relies on five basic assumptions about the way in which thedata are generated.

• If these assumptions hold, then OLS is BLUE (Best Linear UnbiasedEstimator).

Assumptions:

1. The true model is a linear functional form of the data: y = Xβ + ε.

2. E[ε|X] = 0

3. E[εε′|X] = σ2I

4. X is n× k with rank k (i.e., full column rank)

5. ε|X ∼ N [0, σ2I]

In English:

• Assumptions 1 and 2 ⇒ E[y|X] = Xβ. (Why?)

• Assumption 2: “strict exogeneity” assumption. ⇒ E[εi] = 0, sinceE[εi] = EX [E[εi|Xi]] = 0 (via the Law of Iterated Expectations).

• Assumption 3: errors are spherical/iid.

• Assumption 4: data matrix has full rank; is invertible (or “non-singular”);not characterized by perfect multicollinearity.

• Sometimes add to Assumption 4 “X is a non-stochastic matrix” or “X isfixed in repeated samples.” This amounts to saying that X is somethingthat we fix in an experiment. Generally not true for political science.

40

Page 51: POLS W4912 Multivariate Political Analysis

41

We will assume X can be fixed or random, but it is generated by amechanism unrelated to ε.

• Note that

E(εε′) =

E(ε21) E(ε1ε2) . . . E(ε1εN)

...E(εNε1) E(εNε2) . . . E(ε2

N)

var(εi) = E[(εi − E(εi))

2] = E[ε2i ]

andcov(εi, εj) = E[(εi − E(εi))(εj − E(εj))] = E(εiεj)

• Thus, we can understand the assumption that E(εε′) = σ2In, as a statementabout the variance-covariance matrix of the error terms, which is equal to:

σ2 0 ... 00 σ2 ... 0... . . .

0 0 ... σ2

• Diagonal elements ⇒ homoskedasticity; Off-diagonal elements ⇒ no

auto-correlation.

4.2 Using the assumptions to show that β is unbiased

• β is unbiased if E[β] = β

β = (X′X)−1X′y

= (X′X)−1X′(Xβ + ε)

= β + (X′X)−1X′ε

• Taking expectations:

E[β|X] = β + (X′X)−1X′E[ε|X] = β

Page 52: POLS W4912 Multivariate Political Analysis

42

• So the coefficients are unbiased “conditioned” on the data set at hand.Using the law of iterated expectations, however, we can also say somethingabout the unconditional expectation of the coefficients.

• By the LIE, the unconditional expectation of β can be derived by“averaging” the conditional expectation over all the samples of X that wecould observe.

E[β] = EX [E[β|X]]

= β + EX [(X′X)−1X′E[ε|X]] = β

The last step holds because the expectation over X of something that isalways equal to zero is still zero.

4.3 Using the assumptions to find the variance of β

• Now that we have the expected value of β, we should be able to find itsvariance using the equation for the variance-covariance matrix of a vector ofrandom variables (see Eq. 2.1).

var(β) = E[(β − β)(β − β)′]

β − β = (X′X)−1X′(Xβ + ε)− β

= β + (X′X)−1X′ε− β

= (X′X)−1X′ε

• Thus:var(β) = E[((X′X)−1X′ε)((X′X−1)X′ε)′]

• Use the transpose rule (AB)′ = (B′A′):

((X′X)−1X′ε)′ = (ε′X[(X′X)−1]′)

Page 53: POLS W4912 Multivariate Political Analysis

43

• The transpose of the inverse is equal to the inverse of the transpose:

[(X′X)−1]′ = [(X′X)′]−1

• Since (X′X) is a symmetric matrix:

[(X′X)−1]′ = [(X′X)′]−1 = (X′X)−1

which gives:

((X′X)−1X′ε)′ = (ε′X[(X′X)−1]′) = (ε′X(X′X)−1)

• Then,

var(β) = E[((X′X)−1X′ε)(ε′X(X′X)−1)]

= E[(X′X)−1X′εε′X(X′X)−1]

• Next, let’s pass through the conditional expectations operator rather thanthe unconditional to obtain the conditional variance.

E[[(X′X)−1X′εε′X(X′X)−1]|X] = [(X′X)−1X′E(εε′|X)X(X′X)−1]

= [(X′X)−1X′σ2IX(X′X)−1]

= σ2[(X′X)−1X′X(X′X)−1] = σ2(X′X)−1

• But the variance of β depends on the unconditional expectation.

• Use a theorem on the decomposition of variance from Greene, 6th ed., p.1008:

var(β) = EX[var[β|X]] + varX[E[β|X]]

• The second term in this decomposition is zero since E[β|X] = β for all X,thus:

= σ2(X′X)−1

Page 54: POLS W4912 Multivariate Political Analysis

44

4.4 Finding an estimate for σ2

• Since σ2 = E[(ε− E(ε))2] = E[ε2]

and e is an estimate of ε,

∑e2i

N would seem a natural choice.

• But ei = yi − x′iβ = εi − x′i(β − β).

• Note that:

e = y −Xβ= y −X(X′X)−1X′y = (IN−X(X′X)−1X′)y

• LetM = (IN−X(X′X)−1X′)

M is a symmetric, idempotent matrix, like M0 (⇒ M = M′ andMM = M).

• Note MX = 0.

• We can also show (by substituting for y) that:

e = My ⇒ e′e = y′M′My = y′My = y′e = e′y

and (even more useful)

e = My = M[Xβ + ε] = MXβ + Mε = Mε

• So E(e′e) = E(ε′Mε).

Page 55: POLS W4912 Multivariate Political Analysis

45

• Now we need some matrix algebra: the trace operator gives the sum of thediagonal elements of a matrix. So, for example:

tr

[1 72 3

]= 4

• Three key results for trace operations:

1. tr(ABC) = tr(BCA) = tr(CBA)

2. tr(cA) = c · tr(A)

3. tr(A−B) = tr(A)− tr(B)

• For a 1× 1 matrix, the trace is equal to the matrix (Why?). The matrixε′Mε is a 1× 1 matrix. As a result:

E(ε′Mε) = E[tr(ε′Mε)] = E[tr(Mε′ε)]

using the trace operations above, and

E[tr(Mε′ε)] = tr(ME[ε′ε]) = tr(Mσ2I) = σ2tr(M)

• Thus, E(e′e) = σ2tr(M)

tr(M) = tr(IN−X(X′X)−1X′)

= tr(IN)− tr(X(X′X)−1X′)

= N − tr(X(X′X)−1X′)

= N − tr(X′X(X′X)−1)

= N − tr(IK) = N − k

• Putting this altogether: E(e′e) = σ2(N − k) ⇒ s2 = e′eN−k is an unbiased

estimator for σ2.

• Thus, var[β] = s2(X′X)−1.

Page 56: POLS W4912 Multivariate Political Analysis

46

4.5 Distribution of the OLS coefficients

• Use Assumption Five to figure this out.

• In solving for β, we arrived at an expression that was a linear function ofthe error terms. Earlier, we stated that:

β = β + (X′X)−1X′ε, this is a linear function of ε, analogous to Az + b ,where A = (X′X)−1X′ and b = β and z = ε, a random vector.

• Next, we can use a property of the multi-variate normal distribution:

If z is a multivariate normal vector (i.e., z ∼ N [µ,Σ]), thenAz + b ∼ N [Aµ + b,AΣA′]

This implies that β|X ∼ N [β, σ2(X′X)−1]

• If we could realistically treat X as non-stochastic, then we could just saythat β was distributed multivariate normal, without any conditioning on X.

• Later, in the section on asymptotic results, we will be able to show that thedistribution of β is approximately normal as the sample size gets large,without any assumptions on the distribution of the true error terms andwithout having to condition on X.

• In the meantime, it is also useful to remember that, for a multivariatenormal distribution, each element of the vector β is also distributed normal,so that we can say that:

βk|X ∼ N [βk, σ2(X′X)−1

kk ]

• Will use this result for hypothesis testing.

Page 57: POLS W4912 Multivariate Political Analysis

47

4.6 Efficiency of OLS regression coefficients

• To show this, we must prove that there is no other unbiased estimator ofthe true parameters, β, such that var(β) < var(β).

• Let β be a linear estimator of β calculated as Cy, where C is a K × n

matrix.

• If β is unbiased for β, then:

E[β] = β = E[Cy] = β

which implies thatE[C(Xβ + ε)] = β

⇒CXβ + CE[ε] = β

⇒CXβ = β and CX = I

We will use the last result when we calculate the variance of β.

var[β] = E[(β − E[β])(β − E[β])′]

= E[(β − β)(β − β)′]

= E[(Cy − β)(Cy − β)′]

= E[(CXβ + Cε)− β)((CXβ + Cε)− β)′]

= E[(β + Cε− β)(β + Cε− β)′]

= E[(Cε)(Cε)′]

= CE[εε′]C′

= σ2CC′

• Now we want to say something about this variance compared to thevariance of β = σ2(X′X)−1

Page 58: POLS W4912 Multivariate Political Analysis

48

• Let C = D + (X′X)−1X′

• D could be positive or negative. We are not saying anything yet about the“size” of this matrix versus (X′X)−1.

• Now, var[β] = σ2CC′ = σ2(D + (X′X)−1X′)(D + (X′X)−1X′)′

• Recalling that CX = I, we can now say that:

(D + (X′X)−1X′)X = I

so that,DX + (X′X)−1X′X = I

andDX+I=I

andDX = 0

• Given this last step, we can finally re-express the variance of β in terms ofD and X.

var[β] = σ2(D + (X′X)−1X′)(D + (X′X)−1X′)′

= σ2(D + (X′X)−1X′)(D′+[(X′X)−1X′]′) (using (A + B)′ = A′ + B′)

= σ2(D + (X′X)−1X′)(D′+X(X′X)−1)

= σ2(DD′ + (X′X)−1X′X(X′X)−1 + DX(X′X)−1 + (X′X)−1X′D′)

= σ2(DD′ + (X′X)−1+(DX(X′X)−1)′)

Therefore:

var[β] = σ2(DD′ + (X′X)−1) = var(β) + σ2DD′

• The diagonal elements of var[β] and var(β) are the variance of theestimates. So it is sufficient to prove efficiency to show that the diagonalelements of σ2DD′ are all positive.

Page 59: POLS W4912 Multivariate Political Analysis

49

• Since each element of DD′ is made up of a sum of squares (i.e.,∑j

d2ij), no

diagonal element could be negative.

• They could all be zero, but if so then D = 0 and C = (X′X)−1X′, meaningthat β = Cy = β and thus completing the proof.

• In conclusion, any other unbiased linear estimator of β will have a samplingvariance that is larger than the sampling variance of the OLS regressioncoefficients.

• OLS is “good” in the sense that it produces regression coefficients that are“BLUE”, the best, linear unbiased estimator of the true, underlyingpopulation parameters.

Page 60: POLS W4912 Multivariate Political Analysis

Section 5

Inference using OLS regression coefficients

• While we have explored the properties of the OLS regression coefficients, wehave not explained what they tell us about the true parameters, β, in thepopulation. This is the task of statistical inference.

• Inference in multivariate regression analysis is conducted via hypothesis test-ing. Two approaches:

1. Test of significance approach

2. Confidence interval approach

• Some key concepts: the size and the power of a test, Type I and Type IIErrors, and the null hypothesis and the alternative hypothesis.

5.1 A univariate example of a hypothesis test

• In the univariate case, we were often concerned with finding the average ofa random distribution. Via the Central Limit theorem, we could say thatX ∼ N(µ, σ2

n ). Thus, it followed that:

Z =(X − µ)

σX

=(X − µ)

σ/√

n=

√n(X − µ)

σ∼ N(0, 1)

• We could then use the Z distribution to perform hypothesis tests.

• E.g.: H0 : µ = 4 and H1 : µ 6= 4.

• Next, let us decide the size of the test. Suppose, we want to have 95%“confidence” with regard to our inferences, implying a confidence coefficientof (1-α)=0.95, so α, which is the size of the test, is equal to 0.05. The sizeof the test gives us a critical value of Z = ±Zα/2

• Then the test of significance approach is to calculate (X−µ)σ/√

ndirectly.

50

Page 61: POLS W4912 Multivariate Political Analysis

51

• If (X−µ)σ/√

n> +Zα/2 or (X−µ)

σ/√

n< −Zα/2 , we reject the null hypothesis. Intuitively,

if X is very different from what we said it is under the null, we reject thenull.

• The confidence interval approach is to construct the 95% confidence intervalfor µ.

Pr

[X − Zα/2

σ√n≤ µ ≤ X + Zα/2

σ√n

]= 0.95

• If the hypothesized value of µ = 4 lies in the confidence interval, then weaccept the null. If the hypothesized value does not lie in the confidenceinterval, then we say reject H0 with 95% confidence. Intuitively, we rejectthe null hypothesis if the hypothesized value is far from the likely range ofvalues of µ suggested by our estimate.

5.2 Hypothesis testing of multivariate regression coeffi-

cients

• As demonstrated earlier, the regression coefficients are distributed multivari-ate normal conditional on X:

β|X ∼ N[β, σ2(X′X)−1]

and each individual regression coefficient is distributed normal conditional onX:

βk|X ∼ N[βk, σ

2(X′X)−1kk

]• Let us assume, at this point, that we can use the relationship between the

marginal and conditional multivariate normal distributions to say simply thatthe OLS regression coefficients are normally distributed.

βk ∼ N[βk, σ

2(X′X)−1kk

]

Page 62: POLS W4912 Multivariate Political Analysis

52

• Thus, we should be able to use

Z =βk − βk

σβk

=βk − βk√σ2(X′X)−1

kk

=βk − βk

σ√

(X′X)−1kk

∼ N(0, 1)

to conduct hypothesis tests, right?

• So, why are regression results always reported with the t-statistic? The prob-lem is that we don’t know σ2 and have to estimate it using s2 = e′e

N−k .

• Aside: as the sample size becomes large, we can ignore the sampling distri-bution in s2 and proceed as though s2 = σ2. Given our assumptions that theerror is normally distributed, however, we can use the t-distribution.

5.2.1 Proof thatβk − βk

se(βk)=

βk − βk√s2(X′X)−1

kk

∼ tk

• Recall: if Z1 ∼ N(0, 1) and another variable Z2 ∼ χk and is independent ofZ1, then the variable defined as:

t =Z1√

(Z2/k)=

Z1√

k√Z2

is said to follow the student’s t distribution with k degrees of freedom.

• We have something that is distributed like Z1 =βk − βk

σ√

(X′X)−1kk

.

• What we need is to divide it by something that is chi-squared. Let us assume

for now that(N − k)s2

σ2 ∼ χ2N−k.

Page 63: POLS W4912 Multivariate Political Analysis

53

• If that were true, we could divide one by another to get:

tk =

βk−βk√σ2(X′X)−1

kk√[(N − k)s2/σ2]/(N − k)

=βk − βk

σ√

(X′X)−1kk

· σs

=βk − βk

s√

(X′X)−1kk

=βk − βk

se(βk)

5.2.2 Proof that(N − k)s2

σ2 ∼ χ2N−k

• This proof will be done by figuring out whether the component parts ofthis fraction are summed squares of variables that are distributed standardnormal. The proof will proceed using something we used earlier (e = Mε).

(N − k)s2

σ2 =e′e

σ2 =ε′MMε

σ2 =(ε

σ

)′M(ε

σ

)• We know that

σ

)∼ N(0, 1)

• Thus, if M = I we have(ε

σ

)′ (ε

σ

)∼ χ2

N . This holds because the matrix

multiplication would give us N sums of squared standard normal randomvariables. However, it can also be shown that if M is any idempotent matrix,then: (ε

σ

)′M(ε

σ

)∼ χ2[tr(M)]

• We showed before that tr(M) =N − k(ε

σ

)′M(ε

σ

)∼ χ2

N−k

• This completes the proof. In conclusion, the test statistic,βk − βk

se(βk)∼ tN−k

Page 64: POLS W4912 Multivariate Political Analysis

54

5.3 Testing the equality of two regression coefficients

• This section will introduce us to the notion of testing a “restriction” on thevector of OLS coefficients.

• Suppose we assume that the true model is:

Yi = β0 + β1X1i + β2X2i + β3X3i + εi

• Suppose also that we want to test the hypothesis H0 : β2 = β3.

• We can also express this hypothesis as: H0 : β2 − β3 = 0.

• We test this hypothesis using the estimator (β2 − β3). If the estimated valueis very different from the hypothesized value of zero, then we will be able toreject the null. We can construct the t-statistic for the test using:

t =(β2 − β3)− (β2 − β3)

se(β2 − β3)

• What is the standard error? β2 and β3 are both random variables. Fromsection 1.3.7, var(X − Y ) = var(X) + var(Y )− 2cov(XY ). Thus:

se(β2 − β3) =

√var(β2) + var(β3)− 2cov(β2, β3)

• How do we get those variances and covariances? They are all elements ofvar(β), the variance-covariance matrix of β (Which elements?).

5.4 Expressing the above as a “restriction” on the matrix

of coefficients

• We could re-express the hypothesis as a linear restriction on the β matrix:

[0 0 1 −1

β0

β1

β2

β3

= 0 or R′β = q where R =

001−1

and q = 0

Page 65: POLS W4912 Multivariate Political Analysis

55

• The sample estimate of R′β is R′β and the sample estimate of q =q and isequal to R′β.

• Consistent with the procedure of hypothesis tests, we could calculate t =q− q

se(q).

• To test this hypothesis, we need se(q). Since q is a linear function of β, andsince we have estimated the variance of β = s2(X′X)−1, we can estimate q’svariance as var(q) = R′[s2(X′X)−1]R.

• This is just the matrix version of the rule that var(aX) = a2var(X) (seesection 1.3.3).

• Thus, se(q) =√

R′[s2(X′X)−1]R

• Given all of this, the t-statistic for the test of the equality of β2 and β3 canbe expressed as:

t =q− q

se(q)=

R′β − q√R′[s2(X′X)−1]R

• How does that get us exactly what we had above?

R′β = (β2 − β3) and q = 0.

Page 66: POLS W4912 Multivariate Political Analysis

56

• More complicatedly:

R′[s2(X′X)−1] =[0 0 1 −1

σ2

β0σβ0,β1

σβ0,β2σβ0,β3

σβ1,β0σ2

β1σβ1,β2

σβ1,β3

σβ2,β0σβ2,β1

σ2β2

σβ2,β3

σβ3,β0σβ3,β1

σβ3,β2σ2

β3

=[σβ2,β0

− σβ3,β0σβ2,β1

− σβ3,β1σ2

β2− σβ3,β2

σβ2,β3− σ2

β3

]And

R′[s2(X′X)−1]R

=[σβ2,β0

− σβ3,β0σβ2,β1

− σβ3,β1σ2

β2− σβ3,β2

σβ2,β3− σ2

β3

001−1

= σ2

β2− σβ3,β2

− σβ2,β3+ σ2

β3

= var(β2) + var(β3)− 2cov(β2, β3)

Thus, se(q) =√

R′[s2(X′X)−1]R =

√var(β2) + var(β3)− 2cov(β2, β3)

• We will come back to this when we do multiple hypotheses via F -tests.

Page 67: POLS W4912 Multivariate Political Analysis

Section 6

Goodness of Fit

6.1 The R-squared measure of goodness of fit

• The original fitting criteria used to produce the OLS regression coefficientswas to minimize the sum of squared errors. Thus, the sum of squared errorsitself could serve as a measure of the fit of the model. In other words, howwell does the model fit the data?

• Unfortunately, the sum of squared errors will always rise if we add anotherobservation or if we multiply the values of y by a constant. Thus, if we wanta measure of how well the model fits the data we might ask instead whethervariation in X is a good predictor of variation in y.

• Recall the mean deviations matrix from Section 2.9, which is used to subtractthe column means from every column of a matrix. E.g.,

M0X =

x11 − x1 . . . x1K − xK... . . . ...

xn1 − x1 . . . xnK − xK

.

• Using M0, we can take the mean deviations of both sides of our sampleregression equation:

M0y = M0Xβ + M0e

• To square the sum of deviations, we just use

(M0y)′(M0y) = y′M0M0y = y′M0y

and

y′M0y = (M0Xβ + M0e)′(M0Xβ + M0e)

= ((M0Xβ)′ + (M0e)′)(M0Xβ + M0e)

= (β′X′M0 + e′M0)(M0Xβ + M0e)

= β′X′M0Xβ + e′M0Xβ + β

′X′M0e + e′M0e

57

Page 68: POLS W4912 Multivariate Political Analysis

58

• Now consider, what is the term M0e? The M0 matrix takes the deviation ofa variable from its mean, but the mean of e is equal to zero, so M0e = e andX′M0e = X′e = 0.

• Why does the last hold true? Intuitively, the way that we have minimizedthe OLS residuals is to set the estimated residuals orthogonal to the datamatrix – there is no information in the data matrix that helps to predict theresiduals.

• You can also prove that X′e = 0 using the fact that e = Mε and that M = (IN−X(X′X)−1X′):

X′e = X′(IN −X(X′X)−1X′)ε = (X′ −X′X(X′X)−1X′)ε = X′ε−X′ε = 0.

• Given this last result, we can say that:

y′M0y =β

′X′M0Xβ + e′M0e

• The first term in this decomposition is the regression sum of squares (or thevariation in y that is explained) and the second term is the error sum ofsquares. What we have shown is that the total variation in y can be fullydecomposed into the explained and unexplained portion. That is,

SST = SSR + SSE

• Thus, a good measure of fit is the coefficient of determination or R2:

R2 =SSR

SST

• It follows that:

1 =β′X′M0Xβ

y′M0y+

e′M0e

y′M0y= R2 +

e′M0e

y′M0y

and

R2 = 1− e′M0e

y′M0y= 1− SSE

SST

• By construction, R2 can vary between 1 (if the model fully explains y and allpoints lie along the regression plane) and zero (if all the β are equal to zero).

Page 69: POLS W4912 Multivariate Political Analysis

59

6.1.1 The uses of R2 and three cautions

• R2 is frequently used to compare different models and decide between them.Other things equal, we would prefer a model that explains more to a modelthat explains less. There are three qualifications that need to be borne inmind, however, when using R2, one of which motivates the use of the adjustedR2.

1. R2 (almost) always rises when you add another regressor

• Consider a model y = Xb + e.

• Add to this model an explanatory variable, z, which I know a priori to becompletely independent of y. Unless the estimated regression coefficienton z is exactly equal to zero, I will be adding some information to themodel. Thus, I will automatically lower the estimated residuals andincrease R2, leading to a temptation to throw ever more variables intothe model.

• Alternative: Adjusted R2:

R2

= 1− e′M0e

y′M0y· n− 1

n−K

• As the number of regressors increases, the error term decreases, but theadjustment term increases. Thus, the adjusted R2 can go up or downwhen you add a new regressor. The adjusted R2 can also be written as:

R2

= 1− n− 1

n−K(1−R2).

Page 70: POLS W4912 Multivariate Political Analysis

60

2. R2 is not comparable across different dependent variables

• If you are comparing two models on the basis of R2 or the adjusted R2,the sample size, n, and the dependent variable must be the same.

• Say you used lny as the dependent variable in your next model. R2

measures the proportion of the variation in the dependent variable thatis explained by the model, but in the first case it will be measuring thevariation in y and in the second case it will be measuring the variation inlny. The two are not expected to be the same and cannot be taken as thebasis for model comparison. Also, clearly, as the number of observationschanges, adjusted R2 will also change.

• The standard error of the regression is recommended instead of R2 inthese cases.

3. R2 does not necessarily lie between zero and one if we do not include aconstant in the model.

• Inclusion of a constant gives the result in the normal equation that y =x′β and this also implied that e = 0 . If you do not include a constantterm in the regression model, this does not necessarily hold, which meansthat we cannot simplify the variation in y as we did above.

• As a result, if you do not include a constant, it is possible to get an R2

of less than zero. Generally, you should exclude the constant term onlyif you have strong theoretical reasons to believe the dependent variableis zero when all of the explanatory variables are zero.

• R2 is often abused. We will cover alternative measures that can be used tojudge the adequacy of a regression model. These include Akaike’s Informa-tion Criterion, the Bayes Information Criterion and Amemiya’s PredictionCriterion.

Page 71: POLS W4912 Multivariate Political Analysis

61

6.2 Testing Multiple Hypotheses: the F -statistic and R2

• Can use the restriction approach to test multiple hypotheses. E.g., H0 : β2 =0 and β3 = 0.

• The method we will use can accommodate any case in which we have J < K

hypotheses to test jointly, where K is the number of regressors including theconstant.

• J restrictions ⇒ J rows in the restriction matrix, R. We can express the nullhypothesis associated w/ the example above as a set of linear restrictions onthe β vector, Rβ = q.

Rβ = q ⇒[

0 0 1 00 0 0 1

β0

β1

β2

β3

=

[00

]

• Given our estimate β of β, our interest centers on how much Rβ deviatesfrom q and we base our hypothesis test on this deviation.

• Let m = Rβ − q. Intuition: will reject H0 if m is too large. Since m isa vector, we must have some way to measure distance in multi-dimensionalspace. One possibility—suggested by Wald—is to use the normalized sum ofsquares, or m′var(m)−1m.

• This is equivalent to estimating(

mσm

)′ (mσm

).

• If m can be shown to be normal w/ mean zero, then the statistic above wouldbe made up of a sum of squared standard normal variables (with the numberof squared standard normal variables given by the number of restrictions)⇒ χ2 (Wald statistic).

• We know that β is normally distributed⇒ m is normally distributed (Why?).

Page 72: POLS W4912 Multivariate Political Analysis

62

• Further, var(m) = var(Rβ − q) = Rσ2(X′X)−1R′ and so the statistic

W = m′var(m)−1m = (Rβ − q)′[σ2R(X′X)−1R′]−1(Rβ − q) ∼ χ2

• Since, we do not know σ2 and must estimate it by s2, we cannot use the Waldstatistic. Instead, we derive the sample statistic:

F =(Rβ − q)′

[σ2R(X′X)−1R′]−1

(Rβ − q)/J

[(n− k)s2/σ2] /(n− k)

[Notice that we are dividing by the same correction factor that we used toprove the appropriateness of the t-test in the single hypothesis case]

• Recall that

s2 =e′e

n− k=

ε′Mε

n− k.

• Since under the null hypothesis Rβ = q, then Rβ − q = Rβ − Rβ =R(β − β) = R(X′X)−1X′ε.

• Using these expressions, which are true under the null, we can transform thesample statistic above so that it is a function of ε/σ.

F =R(X′X)−1X′ε/σ′[R(X′X)−1R′]−1R(X′X)−1X′ε/σ/J

[M(ε/σ)]′[M(ε/σ)]/(n− k)

• The F -statistic is the ratio of two quadratic forms in ε/σ.

• Thus, both the numerator and the denominator are, in effect, the sum of“squared” standard normal random variables (where the ε/σ are the standardnormal variables) and are therefore distributed χ2.

• Since it can be shown that the two quadratic forms are independent (seeGreene, 6th Ed., p. 85) ⇒ the sample statistic above, is distributed as F

with degrees of freedom [J, n− k], where J is the number of restrictions andn− k is the degrees of freedom in the estimation of s2.

Page 73: POLS W4912 Multivariate Political Analysis

63

• Cancel the σs to get

F =(Rβ − q)′[R(X′X)−1R′]−1(Rβ − q)/J

e′e/(n− k)

or

F =(Rβ − q)′[s2R(X′X)−1R′]−1(Rβ − q)

J

• Question posed by this test: how far is the estimated value, Rβ, from itsvalue under the null, where this distance is “scaled” in the F test by itsestimated variance, equal to s2R(X′X)−1R′.

• The F statistic from our running example is:

F[2,n−k] =

[β2 β3

[var(β2) cov(β2β3)

cov(β3β2) var(β3)

]−1

·

[β2

β3

]2

The sub-scripts indicate that its degrees of freedom are two (for the numberof restrictions) and (n−k) for the number of free data points used to estimateσ2.

• F -statistic can be handy when confronted w/ high collinearity.

6.3 The relationship between F and t for a single restric-

tion

• Assume that in the example above we now have only one restriction reflectedin the R matrix: β2 = 0. The relevant F -statistic is given by:

F =(β2)

′[var(β2)]−1(β2)

1.

Since [var(β2)] is a matrix containing one element, its inverse is equal to 1/σ2β2

.

• Thus, F = β22

[var(β2)], which is just equal to the square of the estimated t-

statistic,

Page 74: POLS W4912 Multivariate Political Analysis

64

6.4 Calculation of the F -statistic using the estimated resid-

uals

• Another approach to using the F -test is to regard it as a measure of “loss offit.” We’ll show that this is equivalent although it may be easier to implementcomputationally.

• Imposing any restrictions on the model is likely to increase the sum of thesquared errors, simply because the minimization of the error terms is beingdone subject to a constraint. On the other hand, if the null hypothesis istrue, then the restrictions should not substantially increase the error term.Therefore, we can build a test of the null hypothesis from the estimatedresiduals, which will result in an F -statistic equivalent to the one above.

• Let e∗ equal the estimated residuals from the restricted model ⇒ e∗ = y −Xβ

∗.

• Adding and subtracting Xβ from the restricted model we get:

e∗ = y −Xβ −X(β∗− β) = e−X(β

∗− β)

• The sum of squared residuals in the restricted model is:

e∗′e∗ = (e′−(β∗− β)′X′)(e−X(β

∗− β)) = e′e+(β

∗− β)′X′X(β

∗− β) ≥ e′e

• The loss of fit is equal to:

e∗′e∗ − e′e = (β∗− β)′X′X(β

∗− β) = (Rβ − q)′[R(X′X)−1R′]−1(Rβ − q)

• In other words, the loss of fit can also be seen as (most of) the usual numerator

from the F -statistic. Under the null, β∗

= β, thus (Rβ − q) is equal to

(β∗− β) and R is equal to the identity matrix. Thus, we simply need to

introduce our s2 estimate and divide by J , the number of restrictions, toobtain an F -statistic.

FJ,n−k =(e∗′e∗ − e′e)/J

e′e/(n− k)

Page 75: POLS W4912 Multivariate Political Analysis

65

• Thus, an equivalent procedure for calculating the F -statistic is to run therestricted model, calculate the residuals, run the unrestricted model, calculatethis set of residuals, and construct the F -statistic based on the expressionabove.

• Since e′eSST = 1− R2, we can also divide through the above expression by the

SST, to yield:

FJ,n−k =(R2 −R∗2)/J

(1−R2)/(n− k)

where R∗2 is the coefficient of determination in the restricted model.

• If we want to test the significance of the model as a whole, we use:

FJ,n−k =(R2)/(k − 1)

(1−R2)/(n− k)

6.5 Tests of structural change

• We often pool different sorts of data in order to test hypotheses of interest.For instance, we are liable to cumulate different U.S. Presidents in a time-series used to test the relationship between the economy and approval and topool different countries when testing for the relationship between democracyand income levels.

• In some cases, this pooling is inappropriate because we would not expectthe causal mechanisms that underlie our model to function identically acrossdifferent countries or time-periods.

• Tests for structural change of model stability (generally known as Chow Tests)allow us to test whether the model is significantly different for two differenttime periods or different groups of observations (e.g., countries).

Page 76: POLS W4912 Multivariate Political Analysis

66

• Suppose that we are interested in how a set of explanatory variables affects adependent variable before and after some critical juncture. Denote the databefore the juncture as y1 and X1 and after as y2 and X2. We also allow thecoefficients across the two periods to vary.

• We wish to test an unrestricted model[y1

y2

]=

[X1 00 X2

] [β1β2

]+

[ε1

ε2

]against the restricted model[

y1

y2

]=

[X1

X2

]β∗ +

[ε1

ε2

]This is just a special case of linear restrictions β1 = β2 where R =

[I −I

]and q = 0.

• The restricted sum of squares is e∗′e∗, which we estimate using all n1 + n2

observations (so that this is just the normal sum of squared errors). Theunrestricted sum of squares is e′1e1 + e′2e2 which is estimated from each sub-sample separately. Thus, we can construct an F -test from:

F[J,n−k] =(e∗′e∗ − e′1e1 − e′2e2)/K

(e′1e1 + e′2e2)/(n1 + n2 − 2K)

• In this case, the number of restrictions, J , is equal to the number of coeffi-cients in the restricted model, K, because we are assuming that each one ofthese coefficients is the same across sub-samples. If the F -statistic is suffi-ciently large, we will reject the hypothesis that the coefficients are equal.

• This test can also be performed using dummy variables.

Page 77: POLS W4912 Multivariate Political Analysis

67

6.5.1 The creation and interpretation of dummy variables

• So far, we have mainly discussed continuous explanatory variables. Now weturn to the creation and interpretation of dummy variables that take on thevalue zero or one to indicate membership of a binary category and discusstheir uses as explanatory variables.

• Suppose that we are trying to determine whether money contributed to po-litical campaigns varies across regions of the country (e.g., northeast, south,midwest, and west).

• We can estimate: Yi = β0 + β1D1i + β2D2i + β3D3i + εi

Where

D1 = 1 if the contributor is from the south.

D2 = 1 if the contributor is from the midwest.

D3 = 1 if the contributor is from the west.

• This type of statistical analysis is called ANOVA (or analysis of variance)and informs you about how contribution levels vary by region.

• What is the interpretation of the intercept in this regression and what is theinterpretation of the dummy variables and their t-statistics?

6.5.2 The “dummy variable trap”

• We do not include all the dummy variables in the regression above. Why?

• There is a constant in the regression above. The constant appears in the Xdata matrix as a column of ones. Each of the dummy variables appears inthe data matrix as a column of zeroes and ones. Since they are mutuallyexclusive and collectively exhaustive, if we were to introduce them all intothe data matrix, the sum:

D1i + D2i + D3i + D4i = 1

for all observations.

Page 78: POLS W4912 Multivariate Political Analysis

68

• If we include all of them in the regression with a constant, then the datamatrix X does not have full rank, it has rank at most of K − 1 (perfectcollinearity). In this case, the matrix (X′X) is also not of full rank and hasrank of at most K − 1.

• Another way of putting this, is that we do not have enough informationto estimate the constant separately if we include all the regional dummiesbecause, once we have them in, the overall model constant is completelydetermined by the individual regional intercepts and the weight of each region.It cannot be separately estimated, because it is already given by what is inthe model.

• If you include a constant in the model, you can only include M − 1 of M

mutually exclusive and collectively exhaustive dummy variables. Or coulddrop the constant.

6.5.3 Using dummy variables to estimate separate intercepts and slopesfor each group

• The model above allowed us to estimate different intercepts for each regionin a model of political contributions. Suppose I am now running a modelthat looks at the effect of gender on campaign contributions. We might alsoexpect personal income to affect the level of donations and include it in ourmodel as well.

• Once we include both continuous and categorical variables into the samemodel, we are no longer performing ANOVA but ANCOVA (analysis of co-variance). If we were interested only in estimating a model looking at theeffects of gender and income on political contributions, our model would takethe following form:

yi = β0 + β1dm,i + β2Xi + εi

This model simply estimates a different intercept for men.

Page 79: POLS W4912 Multivariate Political Analysis

69

• Suppose, however, we think that men will contribute more to political cam-paigns as their income rises than will women. To test this hypothesis, wewould estimate the model:

Yi = β0 + β1dm,i + β2Xi + β3dm,i ·Xi + εi

To include the last term in the model, simply multiply the income variableby the gender dummy. If income is measured in thousands of dollars, whatis the marginal effect of an increase of $1,000 dollars for men on expectedcontributions to political campaigns?

• How do we test whether gender has any impact on the marginal effect ofincome on political contributions? How do we test the hypothesis that genderhas no impact on the behavior underlying political contributions?

• It is generally a bad idea to estimate interaction terms without including themain effects.

6.6 The use of dummy variables to perform Chow tests of

structural change

• One of the more common applications of the F -test is in tests of structuralchange (or parameter stability) due to Chow. The test basically asks wherethe behavior that one is modeling varies across different sub-sets of observa-tions.

• We often estimate a causal model using samples that combine men andwomen, rich and poor countries, and different time periods. Yet we canhypothesize that the behavior that we are modeling varies by gender, incomelevel and time period. If that is correct, then restricting the different sub-setsof observations is inappropriate.

• For example, in the model of political contributions, if I assume that givingto political campaigns does depend on gender, I would estimate the “unre-stricted” model:

Yi = β0 + β1dm,i + β2Xi + β3dm,i ·Xi + εi

Page 80: POLS W4912 Multivariate Political Analysis

70

• If, however, I impose the restriction that gender does not matter, then Iestimate the restricted model:

Yi = β0 + β2Xi + εi

In this case, I am restricting β1 to be equal to zero and β3 to be equal tozero.

• In general terms, we wish to test the unrestricted model where the OLScoefficients differ by gender.

• The Chow test to determine whether the restriction is valid is performedusing an F -test and can be accomplished using the approach outlined above.

• This version of the Chow Test is exactly equivalent to running the unrestrictedmodel and then performing an F -test on the joint hypothesis that β1= 0 andβ3=0. Why is this?

• We saw previously that an F -test could also be computed using:

FJ,n−k =(e∗′e∗ − e′e)/J

e′e/(n− k)

This is just the same as we did above. We are simply getting the unrestrictederror sum of squares from the regressions run on men and women separately.

• This form of the F -test could be derived as an extension of testing a restrictionon the OLS coefficients.

• Under the restriction, the β vector of regression coefficients was equal to β∗.In the case outlined above:

β =

β0β1β2β3

and β∗ =

β00β20

• But the null hypothesis is given by β = β∗. In terms of our notation for

expressing a hypothesis as a restriction on the matrix of coefficients, Rβ = q,we have β = β∗, so R is equal to the identity matrix. Moreover, the distanceRβ − q is now given by β

∗− β.

Page 81: POLS W4912 Multivariate Political Analysis

71

• As a result, the quantity (β∗− β)′X′X(β

∗− β) = e∗′e∗ − e′e can be written

as:(Rβ − q)′[R(X′X)−1R′]−1(Rβ − q)

• This is most of the F -statistic, except that we now have to divide by s2, equalto e′e/(n− k), our estimate of the variance of the true errors, and divide byJ to get the same F -test that we would get above.

• Testing this restriction on the vector of OLS coefficients, however, is exactlythe same thing as testing that β1 and β3 are jointly equal to zero, and thelatter computing procedure is far easier in statistical software.

6.6.1 A note on the estimate of s2 used in these tests

• In using these Chow tests, we have assumed that σ2 is the same across sub-samples, so that it is okay if we use the s2 estimate from the restricted orunrestricted model (although in practice we use s2 from the unrestrictedmodel).

• If this is not persuasive, there is an alternative. Suppose that θ1 and θ2 aretwo normally distributed estimators of a parameter based on independentsub-samples with variance-covariance matrices V1 and V2. Then, under thenull hypothesis that the two estimators have the same expected value, θ1− θ2

has mean 0 and variance V1 + V2. Thus, the Wald statistic:

W = (θ1 − θ2)′[V1 + V2]

−1(θ1 − θ2)

has a chi-square distribution with K degrees of freedom and a test that thedifference between the two parameters is equal to zero can be based on thisstatistic.

• What remains for us to resolve is whether it is valid to base such a test onestimates of V1 and V2. We shall see, when we examine asymptotic resultsfrom infinite samples that we can do so as our sample becomes sufficientlylarge.

Page 82: POLS W4912 Multivariate Political Analysis

72

• Finally, there are some more sophisticated techniques which allow you to beagnostic about the locations of the breaks (i.e., let the data determine wherethey are).

Page 83: POLS W4912 Multivariate Political Analysis

Section 7

Partitioned Regression and Bias

7.1 Partitioned regression, partialling-out and applications

• We will first examine the OLS formula for the regression coefficients using“partitioned regression” as a means of better understanding the effects ofexcluding a variable that should have been included in the model.

• Suppose our regression model contains two sets of variables, X1 and X2.Thus:

y = Xβ + ε = X1β1 + X2β2 + ε =[X1 X2

]·[

β1β2

]+ ε

• What is the algebraic solution for β, the estimate of β in this partitionedversion of the regression model?

• To find the solution, we use the “normal equations,” (X′X)β = (X′y) em-ploying the partitioned matrix,

[X1 X2

]for X.

• In this case, (X′X) =

[X′

1X′

2

]·[

X1 X2]

=

[X′

1X1 X′1X2

X′2X1 X′

2X2

]• The derivation of (X′y) proceeds analogously.

• The normal equations for this partitioned matrix give us two expressions withtwo unknowns: [

X1′X1 X1

′X2

X2′X1 X2

′X2

[β1

β2

]=

[X1

′yX2

′y

]

73

Page 84: POLS W4912 Multivariate Political Analysis

74

• We can solve directly for β1, the first set of coefficients, by multiplyingthrough the first row of the partitioned matrix and then using the trick ofpre-multiplying by (X′

1X1)−1:

(X′1X1)β1 + (X′

1X2)β2 = X′1y

and

β1 = (X1′X1)

−1X1′y − (X1

′X1)−1X1

′X2β2 = (X1′X1)

−1X1′(y −X2β2)

• We can conclude from the above that if X′1X2 = 0—i.e., if X1 and X2 are

orthogonal—then the estimated coefficient β1 from a regression of y on X1

will be the same as the estimated coefficients from a regression of y on X1

and X2.

• If not, however, and if we mistakenly omit X2 from the regression, then report-ing the coefficient of β1 as (X1

′X1)−1X1

′y will over-state the true coefficientby (X1

′X1)−1X1

′X2β2.

• The solution for β2 is found by multiplying through the second row of the ma-trix and substituting the above expression for β1. This method for estimatingthe OLS coefficients is called “partialling out.”

(X′2X1)β1 + (X′

2X2)β2 = X′2y

• Now substitute for β1.

(X′2X1)[(X1

′X1)−1X1

′y − (X1′X1)

−1X1′X2β2] + (X′

2X2)β2 = X′2y

Multiplying through and rearranging gives:

[X′2X2 −X′

2X1(X′1X1)

−1X′1X2]β2 = X′

2y −X′2X1(X

′1X1)

−1X′1y

X′2[I−X1(X

′1X1)

−1X′1]X2β2 = X′

2[I−X1(X′1X1)

−1X′1]y

Solving for β2 gives

β2 = [X2′(I−X1(X1

′X1)−1X1

′)X2]−1[X2

′(I−X1(X1′X1)

−1X1′)y]

= (X2′M1X2)

−1(X2′M1y)

Page 85: POLS W4912 Multivariate Political Analysis

75

• Here, the matrix M is once again the “residual-maker,” with the sub-scriptindicating the relevant set of explanatory variables. Thus:

M1X2 = a vector of residuals from a regression of X2 on X1.

M1y = a vector of residuals from a regression of y on X1.

• Using the fact that M1 is idempotent, so that M′1M1 = M1, and setting

X∗2 = M1X2 and y∗ = M1y, we can write the solution for β2 as:

β2 = (X∗′2 X∗

2)−1X∗′

2 y∗

• Thus, we have an entirely separate approach by which we can estimate theOLS coefficient β2 when X2 is a single variable:

1. Regress X2 on X1 and calculate the residuals.

2. Regress y on X1 and calculate the residuals.

3. Regress the second set of residuals on the first. The first round of regres-sions “partial out” the effect of X1 on X2 and y, allowing us to calculatethe marginal effect of X2 on y, accounting for X1.

• The process can also be carried out in reverse, yielding:

β1 = (X∗′1 X∗

1)−1X∗′

1 y∗

7.2 R2 and the addition of new variables

• Using the partitioning approach, we can prove that R2 will always rise whenwe add a variable to the regression.

• Start with the model y = Xβ + ε and y = Xb + e where b = β

• Add a single variable z with the coefficient c so that y = Xd+ zc+u (d 6= bif X′

1X2 6= 0).

• R2 will increase if u′u < e′e. To find out if this inequality holds, we have tofind an expression relating u and e.

Page 86: POLS W4912 Multivariate Political Analysis

76

• From the formula for partitioned regression above, we can say that:

d = (X′X)−1

X′(y − zc) = b− (X′X)−1

X′zc

and we know that u = y −Xd− zc.

• Inserting the expression for d into this equation for the residuals, we get:

u = y −Xb + X(X′X)−1X′zc− zc (7.1)

• Since I−X(X′X)−1X′ = M and X(X′X)−1X′ = I−M, we can rewrite Eq.7.1 as:

u = e−Mzc = e− z∗c

• Thus,u′u = e′e + c2(z′∗z∗)− 2cz′∗e

e = My = y∗ and z′∗e = z′∗y∗. Since c = (z′∗z∗)−1(z′∗y∗), we can say that

z′∗e = z′∗y∗ = c(z′∗z∗).

• Therefore:

u′u = e′e + c2(z′∗z∗)− 2c2(z′∗z∗) = e′e− c2(z′∗z∗)

Since the last matrix is composed of a sum of squares, it is positive definite.

• Thus, u′u < e′e unless c = 0. In other words, adding another variable to theregression will always result in a higher R2 unless the OLS coefficient on theadditional variable is exactly equal to zero.

Page 87: POLS W4912 Multivariate Political Analysis

77

7.3 Omitted variable bias

• The question of which variables should be included in a regression model ispart of the larger question of selecting the right specification. Other issues inspecification testing include whether you have the right functional form andwhether the parameters are the same across sub-samples.

• We will now use a different approach to prove that the OLS coefficients willnormally be biased when you exclude a variable from the regression that ispart of the true, population model.

• Suppose that the correctly specified regression model would be:

y = X1β1 + X2β2 + ε

• If we don’t know this and assume that the correct model is y = X1β1 + ε,then we will regress y on X1 only. Then our estimate of β1will be:

β1 = (X′1X1)

−1X′1y

• Now substitute y = X1β1 + X2β2 + ε from the true equation for y in theexpression for β1:

β1 = β1 + (X′1X1)

−1X′1X2β2 + (X′

1X1)−1X′

• To check for bias (i.e. whether E(β1) 6= β1), take the expectation:

E(β1) = β1 + (X′1X1)

−1X′1X2β2

• Thus, unless (X′1X2) = 0 or β2 = 0, β1 is biased. Also, note that (X′

1X1)−1X′

1X2

gives the coefficient(s) in a regression of X2 on X1.

Page 88: POLS W4912 Multivariate Political Analysis

78

7.3.1 Direction of the Bias

• If we have just two explanatory variables in our data set, then the coefficientfrom a regression of X2 on X1 will be equal to cov(X1X2)

var(X1).

• Thus, if we had a good idea of the sign of the covariance and the sign of β2,we would be able to predict the direction of the bias.

• If there are more than two explanatory variables in the regression, then thedirection of the bias on β1 when we exclude X2 will depend on the partialcorrelation between X1 and X2 (since this always has the same sign as theregression coefficient in the OLS regression model) and the sign of the trueβ2.

7.3.2 Bias in our estimate of σ2

• In order to calculate the standard error of the coefficients, we need to estimateσ2. If we did this just from the regression including X1 we would get:

s2 =e′1e1

n−K1

but e1 = M1y = M1(X1β1 + X2β2 + ε) = M1X2β2 + M1ε .

(Note that M1X1 = 0.)

• Does E(e′1e1) = σ2(n−K1)?

E(e′1e1) = E[(M1X2β2 + M1ε)′(M1X2β2 + M1ε)]

= E[(β′2X

′2M

′1 + ε′M′

1)(M1X2β2 + M1ε)]

= β′2X

′2M1X2β2 + E(β′

2X′2M1ε) + E(ε′M′

1X2β2) + E(ε′M1ε)

= β′2X

′2M1X2β2 + E(ε′M1ε)

This last step involves multiplying out M1X2 and using E(ε′X2) = 0 andE(X

1ε) = 0, both of which we know from the Gauss-Markov assumptions.

• From here we proceed as we did when working out the expectation of thesum of squared errors originally:

E(e′1e1) = β′2X

′2M1X2β2 + σ2tr(M1) = β′

2X′2M1X2β2 + (n−K1)σ

2

Page 89: POLS W4912 Multivariate Political Analysis

79

The first term in the expression is the increase in the error sum of squares(SSE) that results when X2 is dropped from the regression.

• We can show that β′

2X′

2M1X2β2 is positive, so the s2 derived from the re-gression of y on X1 will be biased upward (Intuition: the shorter version ofthe model leaves more variance unaccounted for).

• Omitted variable bias is a result of a violation of the Gauss-Markov assump-tions: we assumed we had the correct model.

7.3.3 Testing for omitted variables: The RESET test

• The RESET test (due to Ramsey) checks for patterns in the estimated resid-uals against the predicted values of y, y.

• Specifically, the test is conducted by estimating the original regression, com-puting the predicted values, and then using powers of these predicted values,y2

i , y3i and of the original regressors, X, as explanatory variables in an ex-

panded regression. An F -test on these additional regressors is then employedto detect whether or not they are jointly significant.

7.4 Including an irrelevant variable

• What if the true model is y = X1β1+ε and we estimate y = X1β1+X2β2+ε?We are likely to find that the estimated coefficient β2 cannot be distinguishedfrom zero.

• β1 will not be biased, nor will s2. [To verify, use β2 = 0 in the aboveequations].

• This does not mean we should estimate “kitchen sink” regressions.

• The cost of including an irrelevant variable is a loss in efficiency. It canbe shown that the variance-covariance matrix of the coefficients always riseswhen we add a variable, which means we will estimate the sampling distribu-tions of the coefficients less precisely. Our standard errors will rise and ourt-statistics will fall.

Page 90: POLS W4912 Multivariate Political Analysis

80

7.5 Model specification guidelines

• Think about theoretically likely confounding variables; alternative hypothe-ses.

• Avoid “stepwise regression”.

A common technique of model building is to build it iteratively by addingregressors sequentially and keeping those that are statistically significant.This is a very problematic technique.

Consider a simple case of adding columns from X sequentially and dis-carding those whose estimated coefficient is not statistically different fromzero. First, we estimate y = β0 + x1β1 + ε, a simple model with an in-tercept and one explanatory variable.

We will keep x1 in if|β1|

se(β1)> tα/2 (the t-statistic is bigger than the critical

value).

The problem is that unless y = β0 + x1β1 + ε is the true model, β1 andse(β1) will be biased. So our decision of what to include will be based onbiased estimates.

• Avoid “Data Mining”

Data mining refers to the practice of estimating lots of different modelsto see which has the best fit and gives statistically significant coefficients.

The problem is that if you are using a 95% confidence interval (size of thetest, α, is .05) then the probability that you will reject the null hypothesisthat β = 0, when it is true, is 5%.

If you try out twenty different variables, all of which actually have trueβ = 0, your probability of finding one that looks significant in this sampleis approximately one. Thus, when you are doing repeated tests, yourprobability of incorrectly rejecting the null is actually much higher thanthe given size of the test.

Page 91: POLS W4912 Multivariate Political Analysis

81

• Try “Cross Validation”

Randomly divide your data set in half. Select your model using the firstdata set and guided by theory.

Next, after deciding on the right model using the first data set, applyit to the second data set. If any of the variables are not significant inyour second data set, there’s a good chance that they were incorrectlyincluded to begin with.

• Let theory be your guide

Include variables that are theoretically plausible or necessary for account-ing for relevant (but theoretically uninteresting) phenomenon.

If theory is not a good guide, for instance when it comes to determiningthe number of lags in a time series model, you could try using one of thecriterion used in the profession for model selection, including the AkaikeInformation Criterion.

7.6 Multicollinearity

• Partitioned regression can be used to show that if you add another variableto a regression, the standard errors of the previously included coefficients willrise and their t-statistics will fall.

• Magnitude of change depends on covariance b/t variables.

• The estimated variance of βk is given by

var(βk) = s2(X′X)−1kk

⇒ esimated variance depends on the presence of other variables in the re-gression via their influence on the kth diagonal element in the inverse matrix(X′X)−1

kk .

• We can use the formula for the inverse of a partitioned matrix to say some-thing about what this element will be.

Page 92: POLS W4912 Multivariate Political Analysis

82

• Let the true model be equal to Yi = βo + β1X1i + β2X2i + εi.

• Measure all the data as deviations from means, so that the transformed modelis Y ∗

i = α1X∗1i +α2X

∗2i + εi, where α1 = β1 ,α2 = β2 and X∗

1i = (x1i− x1), etc.

• You can think of this as the case in which we had first “partialled out”the intercept, by regressing all the variables on an intercept, and using theresiduals from this regression which amount to the deviations from the mean.

• The data matrix, X, is now X =[X∗

1 X∗2]

and

[X∗′

1 X∗1 X∗′

1 X∗2

X∗′2 X∗

1 X∗′2 X∗

2

]• Multiplying out this last matrix,

(X′X) =

∑i

(x1i − x1)2 ∑

i

(x1i − x1)(x2i − x2)∑i

(x2i − x2)(x1i − x1)∑i

(x2i − x2)2

• To find the variance of α2 = β2, we would need to know (X′X)−1

22 , the bottomright element of the inverse of the matrix above.

• To find this, we will use a result on the inverse of a partitioned matrix (SeeGreene, 6th Ed., pp. 966). The formula for the bottom right element of theinverse of a partitioned matrix is given by: F2 = (A22−A21A

−111 A12)

−1, wherethe A’s just refer to the elements of the original partitioned matrix.

• Using this for the above matrix we have:

(X′X)−122 =

[∑(x2i − x2)

2 −∑

(x2i − x2)(x1i − x1) ·1∑

(x1i − x1)2

·∑

(x1i − x1)(x2i − x2)]−1

=

[var(X2) · (n− 1)− cov(X1, X2) · (n− 1) · cov(X1, X2) · (n− 1)

var(X1) · (n− 1)

]−1

=

[var(X2) · (n− 1)− cov(X1, X2)

2 · (n− 1)

var(X1)

]−1

Page 93: POLS W4912 Multivariate Political Analysis

83

• Multiply through by 1 = var(X2)var(X2)

on both sides of the equation to give:

(X′X)−122 =

[(1− (cov(X1, X2))

2

var(X1)var(X2)

)· var(X2)(n− 1)

]−1

=[(1− r2

12) · var(X2)(n− 1)]−1

=1

(1− r212)var(X2)(n− 1)

• In this expression, r212 is the squared partial correlation coefficient between

X1 and X2.

• It follows then that: var(β2) = s2(X′X)−1kk = s2

(1−r212)var(X2)(n−1) Thus, the vari-

ance of the OLS coefficient rises when the two included variables have a higherpartial correlation.

• Intuition: the partialling out formula for the OLS regression coefficientsdemonstrated that each regression coefficient βk is estimated from the co-variance between y and Xk once you have netted out the effect of the othervariables. If Xk is highly correlated with another variable, then there is less“information” remaining once you net out the effect of the remaining variablesin your data set.

This result can be generalized for a multivariate model with K regressors(including the constant) and in which data are entered in standard form, andnot in deviations:

var(βk) =s2

(1−R2k.X−k)var(Xk)(n− 1)

where R2k.X−k is the R2 from a regression of Xk on all the other regressors,

X−k.

Page 94: POLS W4912 Multivariate Political Analysis

84

• It follows that the variance of βk:

rises with σ2 the variance of ε and its estimate, s2.

falls as the variance of Xk rises and falls with the number of data points.

rises when Xk and the other regressors are highly correlated.

• Implications for adding another variable: As you add another vari-able to the data set, R2

k.X−k will almost always rise ⇒ estimating a kitchensink regression raises the standard errors and reduces the efficiency of theestimates.

• Implications for the bias in the standard errors when you have anomitted variable: omitted relevant variable ⇒ s2 is biased upwards.

Direction of bias for standard errors is unclear, however.

While s2 in the numerator increases, the R2k.X−k will be smaller than it

would have been if we had estimated the full model (just know std. errs.are likely to be wrong).

• Implications for the standard errors when the variables are collinear:R2

k.X−k will be high, so that the variance and standard errors will be high andthus the t-statistics will be low (but check F -test). If perfectly collinear,R2

k.X−k = 1.

Page 95: POLS W4912 Multivariate Political Analysis

85

7.6.1 How to tell if multicollinearity is likely to be a problem

1. Check the bivariate correlations between variables using the R command cor.

2. If overall R2 is high, but few if any of the t-statistics surpass the critical value⇒ high collinearity.

3. Look at the Variance Inflation Factors (VIFs). The VIF for each variable isequal to:

VIFk =1

1−R2k.X−k

Rules of thumb: evidence of multicollinearity if the largest VIF is greaterthan 10 or the mean of all the VIFs is considerably larger than one. VIFscan be computed in R using Fox’s car library.

7.6.2 What to do if multicollinearity is a problem

1. Nothing. Report the F -tests as an indication that your variables are jointlysignificant and the bivariate correlations or VIFs as evidence that multi-collinearity is likely to explain the lack of individual significance.

2. Get more data.

3. Amalgamate variables that are very highly collinear into an index of theunder-lying phenomenon. The only problem with this approach is that youlose the clarity of interpretation that comes with entering variables individu-ally.

Page 96: POLS W4912 Multivariate Political Analysis

Section 8

Regression Diagnostics• First commandment of multivariate analysis: “First, know your data.”

8.1 Before you estimate the regression

• It is often useful to plot variables in model against each other to get a sense ofbivariate correlations (other useful plots: histograms, nonparametric kerneldensity plots, lowess curves in plots).

• R’s graphical capabilities are tremendous, providing numerous canned rou-tines for displaying data. Fox’s car library augments what is in the standardR package. E.g., see Figure 8.1, which was produced by the pairs command.

8.2 Outliers, leverage points, and influence points

• An outlier is any point that is far from the fitted values. In other words, ithas a large residual. Why might we care about checking particular outliers?Two main reasons, although analysts also don’t enjoy large outliers becausethey reduce the fit of the model:

Outliers may signal an error in transcribing data.

A pattern in the outliers may imply an incorrect specification—a missingvariable or a likely quadratic relationship in the true model.

8.2.1 How to look at outliers in a regression model

• Plot the residuals against the fitted values y. If the model is properly speci-fied, no patterns should be visible in the residuals. E.g., see Fig. 8.2.

• You might also want to graph the residuals against just one of the originalexplanatory variables.

86

Page 97: POLS W4912 Multivariate Political Analysis

87

Figure 8.1: Data on Presidential Approval, Unemployment, and Inflation

var 1

5.0 5.5 6.0 6.5 7.0 7.5

40

50

60

70

80

5.0

5.5

6.0

6.5

7.0

7.5 var 2

40 50 60 70 80 0.6 0.8 1.0 1.2 1.4 1.6

0.6

0.8

1.0

1.2

1.4

1.6

var 3

Page 98: POLS W4912 Multivariate Political Analysis

88

Figure 8.2: A plots of fitted values versus residuals

1.2 1.4 1.6 1.8 2.0 2.2

−2

02

4

Fitted values

Re

sid

ua

ls

Page 99: POLS W4912 Multivariate Political Analysis

89

• It would be nice if we had a method that enabled us to identify influentialpairs or subsets of observations. This can be done with an “added variableplot,” which reduces a higher-dimension regression problem to a series oftwo-dimensional plots.

• An added variable plot is produced for a given regressor xk by regressingboth the dependent variable and xk on all of the other K− 1 regressors. Theresiduals from these regressions are then plotted against each other.

• Fig. 8.3 shows an example of some added variable plots from Fox, using thecar library. This was produced by the commands:

library(car)

Duncan <- read.table(’/usr/lib/R/library/car/data/Duncan.txt’,header=T)

attach(Duncan)

mod.duncan <- lm(prestige~income+education)

av.plots(mod.duncan,labels=row.names(Duncan),ask=F)

8.3 How to look at leverage points and influence points

• A leverage point is any point that is far from the mean value of X = X. Apoint with high leverage is capable of exerting influence over the estimatedregression coefficients (the slope of the regression line).

• An influence point is a data point that actually does change the estimatedregression coefficient, such that if that point were excluded, the estimatedcoefficient would change substantially.

• Why do points far from the mean of X = X have a bigger impact of the fit ofthe line? Recall that if a constant is included, y = Xβ, so that X, becomesthe fulcrum around which the line can tilt. See Fig. 8.4.

• We could also have worked out which were the large residuals in the estimatedregression using the lm command:

Page 100: POLS W4912 Multivariate Political Analysis

90

Figure 8.3: Added variable plots

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−3

0−

20

−1

00

10

20

30

Added−Variable Plot

(Intercept) | others

pre

stig

e

| o

the

rs

−40 −20 0 20 40

−3

0−

10

01

02

03

04

0

Added−Variable Plot

income | others

pre

stig

e

| o

the

rs

−60 −40 −20 0 20 40

−4

0−

20

02

04

06

0

Added−Variable Plot

education | others

pre

stig

e

| o

the

rs

Page 101: POLS W4912 Multivariate Political Analysis

91

Figure 8.4: An influential outlier

−3 −2 −1 0 1

−6

−4

−2

02

4

X

Y

−3 −2 −1 0 1

−6

−4

−2

02

4

X

Y

Page 102: POLS W4912 Multivariate Political Analysis

92

xyreg <- lm(Y~X)

xyresid <- resid(xyreg)

and then sorting on errors to get the data points organized from highest tolowest residual. This would also allow us to tell the X values for which theresiduals were largest.

8.4 Leverage

• We should draw a distinction between outliers and high leverage observations.

• The regressor values that are most informative (i.e., have the most influenceon the coefficient estimates) and lead to relatively large reductions in the vari-ance of coefficient estimates are those that are far removed from the majorityof values of the explanatory variables.

• Intuition: regressor values that differ substantially from average values con-tribute a great deal to isolating the effects of changes in the regressors.

• Most common measure of influence comes from the “hat matrix”:

H = X(X′X)−1X′

• For any vector y, Hy is the set of fitted values (or y in the least squaresregression of y on X. In matrix language, it projects any n × 1 vector intothe column space of X.

• For our purposes, Hy = Xβ = y. Recall also that M = (I−X(X′X)−1X′) =(I−H)

• The least squares residuals are:

e = My = Mε = (I−H)ε

• Thus, the variance-covariance matrix for the least squares residual vector is:

E[ee′] = Mεε′M′ = Mσ2M′ = σ2MM′ = σ2M = σ2(I−H)

Page 103: POLS W4912 Multivariate Political Analysis

93

• The diagonal elements of this matrix are the variance of each individual esti-mated residual. Since these diagonal elements are not generally the same, theresiduals are heteroskedastic. Since the off-diagonal elements are not gener-ally equal to zero, we can also say that the individual residuals are correlated(they have non-zero covariance). This is the case even when the true errorsare homoskedastic and independent, as we assume under the Gauss-Markovassumptions.

• For an individual residual, ei, the variance is equal to var(ei) = σ2(1 − hi).Here, hi is one of the diagonal elements in the hat matrix, H. This is consid-ered a measure of leverage, since hi increases in the value of x.

• Average hat values are given by h = (k + 1)/n where k is the number ofcoefficients in the model (including the constant). 2h and 3h are typicallytreated as thresholds that hat values must exceed to be noteworthy.

• Thus, observations with the greatest leverage have corresponding residualswith the smallest variance. Observations with high leverage tend to pull theregression line toward them, decreasing their size and variance.

• Note that high leverage observations are not necessarily bad: they contributea great deal of info about the coefficient estimates (assuming they are notcoding mistakes). They might indicate misspecification (different models fordifferent parts of the data).

• Fig. 8.5 contains a plot of hat values. This was produced using the R com-mands:

plot(hatvalues(xyreg))

abline(h=c(2,3)*2/N,lty=2)

Page 104: POLS W4912 Multivariate Political Analysis

94

Figure 8.5: Hat values

0 10 20 30 40 50

0.0

50

.10

0.1

5

Index

ha

tva

lue

s(x

yre

g)

Page 105: POLS W4912 Multivariate Political Analysis

95

8.4.1 Standardized and studentized residuals

• To identify which residuals are significantly large, we first standardize themby dividing by the appropriate standard error (given by the square root ofeach diagonal element of the variance-covariance matrix):

ei =ei

[s2(1− hi)]1/2

This is gives the standardized residuals.

• Since each residual has been standardized by its own standard error, we cannow compare the standardized residuals against a critical value (e.g., ±1.96)to see if it is truly “large”.

• A related measure is the “studentized residuals.” Standardized residuals uses2 as the estimate for σ2. Studentized residuals are calculated in exactly thesame way, except that for the estimate of σ2 we use s2

i calculated using allthe data points except i.

• Let β(i) = [X(i)′X(i)]−1 X(i)′y be the OLS estimator obtained by omittingthe ith observation. The variance of the residual yi − x′iβ(i) is given byσ(i)21 + x′i [X(i)′X(i)]−1 x′i.

• Then

e∗i =yi − x′iβ(i)

σ(i)1 + x′i [X(i)′X(i)]−1 x′i1/2.

follows a t-distribution with N − K − 1 df (assuming normal errors). Notethat

σ(i)2 =[y −Xβ(i)]′[y −Xβ(i)]

N −K − 1

• Studentized residuals can be interpreted as the t-statistic on a dummy vari-able equal to one for the observation in question and zero everywhere else.Such a dummy variable will effectively absorb the observation and so removeits influence in determining the other coefficients in the model.

• Studentized residuals that are greater than 2 in absolute value are regardedas outliers.

Page 106: POLS W4912 Multivariate Political Analysis

96

• Studentized residuals are generally preferred to standardized residuals for thedetection of outliers since they give us an idea of which residuals are largewhile purging the estimation of the residuals of the effect of that one datapoint.

• Related measures that combine the index of leverage taken from the diagonalelements of the hat matrix and the studentized residuals include DFITS,Cook’s Distance, and Welsch Distance.

• This studentized residual still does not, however, tell us how much the inclu-sion of a single data point has affected the line. For that, we need a measureof influence.

8.4.2 DFBETAS

• A measure that is often used to tell how much each data point affects thecoefficient is the DFBETAS, due to Belsley, Kuh, and Welch (1980).

• DFBETAS focuses on the difference between coefficient values when the ithobservation is included and excluded, scaling the difference by the estimatedstandard error of the coefficient:

DFBETASki =βk − βk(i)

σ(i)a−1/2kk

where a−1/2kk is the kth diagonal element of (X′X)−1.

• Belsley et al. suggest that observations with

|DFBETAki| >2√N

as deserving special attention (accounts for the fact that single observationshave less of an effect as sample size grows). It is also common practice simplyto use an absolute value of one, meaning that the observation shifted theestimated coefficient by at least one standard error.

Page 107: POLS W4912 Multivariate Political Analysis

97

• To access the DFBETAS in R (e.g., for plotting) do something like

dfbs.xyreg <- dfbetas(xyreg)

where xyreg contains the output from the lm command. To plot them forthe slope coefficient for our fake data, do

plot(dfbs.xyreg[,2],ylab="DFBETAS for X")

See Fig. 8.6 for the resulting plot.

Page 108: POLS W4912 Multivariate Political Analysis

98

Figure 8.6: Plot of DFBETAS

0 10 20 30 40 50

−1

.0−

0.5

0.0

0.5

1.0

Index

DF

BE

TA

S f

or

X

Page 109: POLS W4912 Multivariate Political Analysis

Section 9

Presentation of Results, Prediction, and Forecasting

9.1 Presentation and interpretation of regression coeffi-

cients

• Some rules of thumb for presenting results of regression analysis:

Provide a table of descriptive statistics (space permitting).

A table of OLS regression results should include coefficient estimates andstandard errors (or t statistics), measures of model fit (adjusted R2, F -tests, standard error of the regression), information about the sample(size), and estimation approach.

Make your tables readable. Do not simply cut and paste regression out-put. No one needs to see coefficients/std. errs reported to more thanthe second or third decimal place. Refrain from using scientific notation(take logs, divide by some factor if necessary). “Big” models make forunattractive tables.

Discuss substantive significance, providing ancillary tables or plots tohelp with inferences about the magnitude of effects.

• Substantive significance is easier to assess in linear models than in nonlinearmodels, although walking readers through some simulations is often helpful.

For example, what is the expected change in our dependent variable fora given change in the explanatory variable, where the given change issubstantively interesting or intuitive. This could include, for instance,indicating how much the dependent variable changes for a one standarddeviation movement up or down in the explanatory variable.

Relate this back to descriptive statistics.

99

Page 110: POLS W4912 Multivariate Political Analysis

100

9.1.1 Prediction

• We know what is the expected effect on y of a given change in xk from β.But what is the predicted value of y, or y, for a given vector of values of x?

• In-Sample Prediction tells us how the level of y would change for a changein the Xs for a particular observation in our sample.

• For example, at a given value of each X variable, x0, y, is equal to:

y = x0β

This is the conditional mean of y, conditional on the explanatory variablesand the estimated parameters of the model.

• Let us say that we have a growth model estimated on 100 countries includingNamibia and Singapore. An example of in-sample prediction would be topredict how Namibia’s expected (and predicted) growth rate would change ifNamibia had Singapore’s level of education.

• When we do in-sample prediction we don’t have to be concerned about “fun-damental uncertainty” (arising from the error term which alters the value ofy) because we are assuming that the true random error attached to Namibiawould stay the same and we only change the average level of education.

• We also do not have to be concerned with what is called “estimation un-certainty” arising from the fact that we estimated β because we are notgeneralizing out of the sample.

• Out-of-sample prediction or forecasting gives the prediction for the de-pendent variable for observations that are not included in the sample. Whenthe prediction refers to the future, it is known as a forecast.

• For given values of the explanatory variables, x0, the predicted value is still,y = x0β. To get an idea of how precise this forecast is, we need a measure ofthe variance of y0, the actual value of y at x0, around y.

y0 = x0β + ε0

Page 111: POLS W4912 Multivariate Political Analysis

101

• Therefore the forecast error is equal to:

y0 − y0 = e0 = x0β + ε0 − x0β = x0(β − β) + ε0

• The expected value of this is equal to zero. Why?

• The variance of the forecast error is equal to:

var[e0] = E[e0e′0] = E[x0(β − β) + ε0)(x0(β − β) + ε0)

′]

= E[(x0(β − β) + ε0)((β − β)′x′0 + ε′0)]

= E[x0(β − β)(β − β)′x0 + ε0ε′0 + ε0(β − β)′x′0 + ε′0x0(β − β)]

= x0E[(β − β)(β − β)′]x′0 + E[ε0ε′0]

= x0[var(β)]x′0 + σ2

= x0[σ2(X′X)−1]x′0 + σ2

This is the variance of the forecast error, and thus the variance of the actualvalue of y around the predicted value.

• If the regression contains a constant term, then an equivalent expression(derived using the expression for partitioned matrices is:

var[e0] = σ2

[1 +

1

n+

K−1∑j=1

K−1∑k=1

(x0j − xj)(x

0k − xk)(Z

′M0Z)jk

]

where Z is the K − 1 columns of X that do not include the constant and(Z′M0Z)jk is the jkth element of the inverse of the (Z′M0Z) matrix.

• Implications for the variance of the forecast:

will go up with the true variance of the errors, σ2, and this effect is stillpresent no matter how many data points we have.

will go up as the distance of x0 from the mean of the data rises. Why isthis? As x0 rises, any difference between β and β is multiplied by a largernumber. Meanwhile, the true regression line and the predicted regressionline both go through the mean of the data.

Page 112: POLS W4912 Multivariate Political Analysis

102

will fall with the number of data points because the inverse will be smallerand smaller (intuition—as you get more data points, you will estimate β

more precisely).

• The variance above can be estimated using:

var[e0] = x0[s2(X′X)−1]x′0 + s2

The square root of this is the standard error of the forecast.

• A confidence interval can then be formed using: y ± tα/2se(e0).

• Clarify software by King et. al (available for Stata) can be used to helpwith presenting predictions as well as other kinds of inferences (see “Makingthe Most of Statistical Analysis: Improving Interpretation and Presentation”by Gary King, Michael Tomz, and Jason Wittenberg, American Journal ofPolitical Science, 44:4, April 2000, pp. 341–355; this article and the softwaredescribed in the article can be found athttp://gking.harvard.edu/clarify/docs/clarify.html.

9.2 Encompassing and non-encompassing tests

• All of the hypothesis tests we have examined so far have been what are knownas “encompassing tests.” In other words, we have always stated the test asa linear restriction on a given model and have tested H0 : Rβ = q versusH1 : Rβ 6= q.

• For example, testing Model A below against Model B is a question of runningan F -test on the hypothesis that β3 = β4 = 0:

Model A: Yi = β0 + β1X1i + β2X2i + β3X3i + β4X4i + εi

Model B: Yi = β0 + β1X1i + β2X2i + εi

• This approach can be restrictive. In particular, it does not allow us to testwhich of two possible sets of regressors is more appropriate. In many cases, weare interested in testing a non-nested hypothesis of the form H0 : y = Xβ+ε0

versus H1 : y = Zδ + ε1.

Page 113: POLS W4912 Multivariate Political Analysis

103

• For example, how do we choose between Model C and Model D?

Model C: Yi = α0 + α1X1i + α2X2i + ε0i

Model D: Yi = β0 + β1Z1i + β2Z2i + ε1i

• The problem in conducting tests of this sort is to transform the two hy-potheses above so that we get a “maintained” model (i.e., reflecting the nullhypothesis, H0) that encompasses the alternative. Once we have an encom-passing model, we can base our test on restrictions to this model.

• Note that we could combine the two models above as Model F:

Model F: Yi = λ0 + λ1X1i + λ2X2i + λ3Z1i + λ4Z2i + ui

• We could then proceed to run an F -test on λ3 = λ4 = 0, as a test of C againstD, or an F -test of λ1 = λ2 = 0, as a test of D against C.

• Problem: the results of the model comparisons may depend on which modelwe treat as the reference model and the standard errors of all the coefficients(and thus the precision of the F -tests) will be affected by the inclusion ofthese additional variables. This approach also assumes the same error termfor both models.

• A way out of this problem, known as the J-test, is proposed by Davidsonand MacKinnon (1981) and uses a non-nested F -test procedure. The testproceeds as follows:

1. Estimate Model D and obtain the predicted y values, Y Di .

2. Add these predicted values as an additional regressor to Model C andrun the following model. This model is an example of the encompassingprinciple.

Yi = α0 + α1X1i + α2X2i + α3YDi + ui

3. Use a t-test to test the hypothesis that α3 = 0. If this null is not rejected,it means that the predicted values from Model D add no significant in-formation to the model that can be used to predict Yi, so that Model Cis the true model and encompasses Model D. Here, “encompasses” meansthat it encompasses the information that is in D. If the null hypothesisis rejected, then Model C cannot be the true model.

Page 114: POLS W4912 Multivariate Political Analysis

104

4. Reverse the order of estimations. Estimate Model C first, obtain thepredicted values, Y C

i , and use these as an additional explanatory variablein Model D. Run a t-test on the null hypothesis that β3, the coefficienton Y C

i , is equal to zero. If you do not reject the null, then Model D isthe true model.

• One potential problem with this test is that you can find yourself rejectingboth C and D or neither. This is a small sample problem, however.

• Another popular test is the Cox test (see Greene, 5th ed., pp. 155–159, fordetails).

Page 115: POLS W4912 Multivariate Political Analysis

Section 10

Maximum Likelihood Estimation

10.1 What is a likelihood (and why might I need it)?

• OLS is simply a method for fitting a regression line that, under the Gauss-Markov assumptions, has attractive statistical properties (unbiasedness, effi-ciency). However, it has no particular statistical justification.

• Maximum likelihood (ML) estimation has a theoretical justification and, if weare prepared to make assumptions about the population model that generatesour data, it has “attractive statistical properties” (consistency, efficiency)whether or not the Gauss-Markov assumptions hold.

• In addition, ML estimation is one method for estimating regression modelsthat are not linear.

• Also, if an unbiased minimum variance estimator (something BLUE) exists,then ML estimation chooses it and ML estimation will achieve the Cramer-Rao lower bound (more explanation on this jargon later).

• Where does the notion of a likelihood come from? It’s a partial answer tothe statistician’s search for “inverse probability.”

• If we know (or can assume) the under-lying model in the population, we canestimate the probability of seeing any particular sample of data. Thus, wecould estimate P (data|model) where the model consists of parameters and adata generating function.

• This is the notion of conditional probability that underlies hypothesis testing,“How likely is it that we observe the data y, if the null is true?” In mostcases, however, statisticians know the data, what they want to find out aboutthe parameters that generate the data.

• Often, statisticians will feel comfortable assigning a particular type of distri-bution to the DGP (e.g., the errors are distributed normally). This is thesame as saying that we are confident about the “functional form.”

105

Page 116: POLS W4912 Multivariate Political Analysis

106

• What is unknown are the parameters of the DGP. For a single, normallydistributed random variable, those unknown parameters would be the meanand the variance. For the dependent variable in a regression model, y, thoseparameters would include the regression coefficients, β, and the variance ofthe errors, σ2.

• What we want is a measure of “inverse probability,” a means to estimateP (parameters|data) where the parameters are what we don’t know about themodel. In this inverse probability statement, the data are taken as givenand the probability is a measure of absolute uncertainty over various sets ofcoefficients.

• We cannot derive this measure of absolute uncertainty, but we can get closeto it, using the notion of likelihood, which employs Bayes Theorem.

• Bayes Theorem: P (A|B) = P (B|A)P (A)P (B) = P (A∩B)

P (B)

• Let’s paraphrase this for our case of having data and wanting to find outabout coefficients:

P (parameters|data) =P (data|parameters) P (parameters)

P (data)

• For the purposes of estimation, we treat the marginal probability of seeingthe data, P (data), as additional information that we use only to scale ourbeliefs about the parameters. Thus, we can say more simply:

P (parameters|data) ∝ P (data|parameters) P (parameters) (10.1)

where ∝ means “is proportional to”.

• The conditional probability of the parameters given the data, orP (parameters | data), is also known as the likelihood function (e.g., L(µ, σ2|x),where µ refers to the mean and σ2 refers to the variance).

• The likelihood function may be read as the likelihood of any value of theparameters, µ and σ2 in the univariate case, given the sample of data thatwe observe. As our estimates of the parameters of interest, we might wantto use the values of the parameters that were most likely given the data athand.

Page 117: POLS W4912 Multivariate Political Analysis

107

• We can maximize the right-hand side of Eq. 10.1, P (data|parameters), withrespect to µ and σ2 to find the values of the parameters that maximize thelikelihood of getting the particular data sample.

• We can’t calculate the absolute probability of any particular value of theparameters, but we can tell something about the values of the parametersthat make the observed sample most likely.

• Given the proportional relationship, the values of the parameters that maxi-mize the likelihood of getting the particular data sample are also the valuesof the parameters that are most likely to obtain given the sample.

• Thus, we can maximize P (data|parameters) with respect to the parametersto find the maximum likelihood estimates of the parameters.

• We treat P (parameters) as a prior assumption that we make independent ofthe sample and often adopt what is called a “flat prior” so that we place noparticular weight on any value of the parameters.

• To estimate the parameters of the causal model, we start with the statisticalfunction that is thought to have generated the data (the “data generatingfunction”). We then use this model to derive the expression for the probabilityof observing the sample, and we maximize this expression with respect to theparameters to obtain the maximum likelihood estimates.

Page 118: POLS W4912 Multivariate Political Analysis

108

10.2 An example of estimating the mean and the variance

• The general form of a normal distribution with mean µ and variance σ2 is:

f(x|µ, σ2) =

(1

2πσ2

)1/2

e−1/2[(xi−µ)2/σ2]

This function yields a particular value of x from a normal distribution withmean µ and variance σ2.

• If we had many, independent random variables, all generated by the same un-derlying probability density function, then the function generating the sampleas a whole would be given by:

f(x1, . . . , xi, . . . xn|µ, σ2) =n

Πi=1

(1

2πσ2

)1/2

e−1/2[(x−µ)2/σ2]

Note that there is an implicit iid assumption, which enables us to write thejoint likelihood as the product of the marginals.

• The method of finding the most likely values of µ and σ2, given the sampledata, is to maximize the likelihood function. The expression above is thelikelihood function since it is the area under the curve generated by thestatistical function that gives you a probability and since this is proportionalto P (parameters|data).

• An easier way to perform the maximization, computationally, is to maximizethe log likelihood function, which is just the natural log of the likelihoodfunction. Since the log is a monotonic function, the values that maximize L

are the same as the values that maximize ln L.

ln L(µ, σ2) = −n

2ln(2π)− n

2ln σ2 − 1

2

n∑i=1

[(xi − µ)2

σ2

]

Page 119: POLS W4912 Multivariate Political Analysis

109

• To find the values of µ and σ2, we take the derivatives of the log likelihoodwith respect to the parameters and set them equal to zero.

∂ ln L

∂µ=

1

σ2

n∑i=1

(xi − µ) = 0

∂ ln L

∂σ2 = − n

2σ2 +1

2σ4

n∑i=1

(xi − µ)2 = 0

• To solve the likelihood equations, multiply both sides by σ2 in the first equa-tion and solve for µ, the estimate of µ. Next insert this in the second equationand solve for σ2. The solutions are:

µML =1

n

n∑i=1

xi = xn

σ2ML =

1

n

n∑i=1

(xi − xn)2

• You should recognize these as the sample mean and variance, so that wehave some evidence that ML estimation is reliable. Note, however, that thedenominator of the calculated variance is n rather than (n−1) ⇒ML estimateof the variance is biased (although it is consistent).

10.3 Are ML and OLS equivalent?

• We would be particularly perturbed if OLS and ML gave us different estimatesof the regression coefficients, β.

• Since the error terms under the Gauss-Markov assumptions are assumed tobe distributed normal, we can set up the regression model as an ML problem,just as we did in the previous example, and check the estimated parameters,β and σ2.

• The regression model is:

yi = β0 + β1xi + ui, where ui ∼ iid N(0, σ2).

Page 120: POLS W4912 Multivariate Political Analysis

110

• This implies that the yi are independently and normally distributed withrespective means β0 + β1xi and a common variance σ2. The joint density ofthe observations is therefore:

f(y1, . . . , yi, . . . yn|β, σ2) =n

Πi=1

(1

2πσ2

)1/2

e−1/2[(yi−β0−β1xi)2/σ2]

and the log likelihood is equal to:

ln L(β, σ2) = −n

2ln(2π)− n

2ln σ2 − 1

2

n∑i=1

[(yi − β0 − β1xi)

2

σ2

]• We will maximize the log likelihood function with respect to β0 and β1 and

then with respect to σ2.

• Note that only the last term in the log likelihood function involves β0 andβ1 and that maximizing this is the same as minimizing the sum of squarederrors, since there is a negative sign in front of the whole term.

• Thus, the ML estimators of β0 and β1, equal to βML are the same as the leastsquares estimates that we have been dealing with throughout.

• Substituting βML into the log likelihood and setting Q =n∑

i=1(yi− β0− β1xi)

2

which is the SSE, we get the log likelihood function in terms of σ2 only:

ln L(σ) = −n

2ln(2π)− n

2ln σ2 − Q

2σ2 = a constant − n

2· ln σ2 − Q

2σ2

• Differentiating this with respect to σ and setting the derivative equal to zerowe get:

∂ ln L

∂σ= −n

σ+

Q

σ3 = 0

This gives the ML estimator for σ2 = σ2 = Qn = SSE

n

• This estimator is different from the unbiased estimator that we have beenusing, = σ2

OLS = Qn−k = SSE

n−k .

Page 121: POLS W4912 Multivariate Political Analysis

111

• For large N , however, the two estimates will be very close. Thus, we can saythat the ML estimate is consistent for σ2, and we will define the meaning ofconsistency explicitly in the future.

• We can use these results to show the value of the log likelihood function atits maximum, a quantity that can sometimes be useful for computing teststatistics. We will continue to use the short-hand that Q = SSE and will

also substitute in the equation for the log likelihood, ln L, that σ2ML = Q

n .Substituting this into the equation for the log likelihood above, we get:

ln L(β, σ2) = −n

2ln(2π)− n

2ln

(Q

n

)− Q

2(Q/n)

= −n

2ln(2π)− n

2ln Q +

n

2ln(n)− n

2

• Putting all the terms that do not rely on Q together, since they are constantand do not rely on the values of the estimated regression coefficients, βML,we can say that the maximum value of the log likelihood is:

max ln L = a constant − n

2ln Q = a constant − n

2ln(SSE)

• Taking the anti-log, we can say that the maximum value of the likelihoodfunction is:

max L = a constant · (SSE)−n/2

• This shows that, for a given sample size, n, the maximum value of the like-lihood and log likelihood functions will rise as SSE falls. This can be handyto know, as it implies that measures of “goodness of fit” can be based on thevalue of the likelihood or log likelihood function at its maximum.

• In conclusion, we can say that ML is equivalent to OLS for the classicallinear regression model. The real power of ML, however, is that we canuse it in many cases where we do not assume that the errors are normallydistributed or that the model is equal to yi = β0+β1xi+ui, so that the modelis far more general.

Page 122: POLS W4912 Multivariate Political Analysis

112

• For instance, models with binary dependent variables are estimated via ML,using the logistic distribution for the data if a logit model is chosen and anormal if the probit is used. Event count models are estimated via ML,assuming that the original data is distributed Poisson. Thus, ML is the oper-ative method for estimating a number of models that are frequently employedin political science.

10.4 Inference and hypothesis testing with ML

• The tests that one can perform, and the statistics we can compute, on theML regression coefficients are analogous to the tests and statistics we canform under OLS. The difference is that they are “large sample” tests.

10.4.1 The likelihood ratio test

• The likelihood ratio (LR) test is analogous to an F -test calculated using theresiduals from the restricted and un-restricted models. Let θ be the set ofparameters in the model and let L(θ) be the likelihood function.

• Hypotheses such as θ1 = 0 impose restrictions on the set of parameters θ.What the LR test says is that we first obtain the maximum of L(θ) withoutany restrictions and we then calculate the likelihood with the restrictionsimposed by the hypothesis to be tested.

• We then consider the ratio

λ =max L(θ) under the restrictions

max L(θ) without the restrictions

• λ will in general be less than one since the restricted maximum will be lessthan the unrestricted maximum.

• If the restrictions are not valid, then λ will be significantly less than one. Ifthey are exactly correct, then λ will equal one. The LR test consists of using−2 ln λ as a test statistic ∼ χ2

k, where k is the number of restrictions. If thisamount is larger than the critical value of the chi-square distribution, thenwe reject the null.

Page 123: POLS W4912 Multivariate Political Analysis

113

• Another frequently used test statistic that is used with ML estimates is theWald test, which is analogous to the t-statistic:

W =βML − β

se(βML)

where β is the hypothesized value of βML and se(βML) are the standarderrors of the βML, calculated using σ2

ML(X′X)−1.

• The only difference from a regular t-test is that this is a large sample test.Since in large samples, the t-distribution becomes equivalent to the standardnormal, the Wald statistic is distributed standard normal.

10.5 The precision of the ML estimates

• If the likelihood is very curved at the maximum, then the ML estimate ofany parameter contains more information about what the true parametersare likely to be.

• If the likelihood is quite flat at the maximum, then the maximum is nottelling you much about the likelihood as a whole and the maximum is not agreat summary of the likelihood. Hence, a measure of the likelihood function’scurvature is also a measure of the precision of the ML estimate.

• What is a reasonable measure of the likelihood function’s curvature? Thesecond derivative of any function tells us how much the slope (or gradient)of the function changes as we move along the x axis.

• At the maximum, the slope is changing from positive, through zero, to neg-ative. The larger the second derivative, however, the more curved the likeli-hood function is at the maximum and the more information is given by theML estimates.

Page 124: POLS W4912 Multivariate Political Analysis

114

• Information about curvature and precision can therefore be captured by thesecond derivatives of the likelihood function. Indeed, the information ma-trix of the likelihood function is used for this purpose and is given by:

I(θ|y) = −E

[∂2 ln L(θ|y)

∂θ∂θ′

]θ=θ

where θ are the ML estimates (e.g., this would include βML and σ2 for thelinear regression model). Thus, the information matrix is the negative of theexpectation of the second derivatives of the likelihood function estimated atthe point where θ = θ.

• If θ is a single parameter, then the larger is I(θ|y) then the more curved thelikelihood (or log likelihood) and the more precise is θ. If θ is a vector ofparameters (the normal case) then I(θ|y) is a K ×K matrix with diagonalelements containing the information on each corresponding element of θ.

• Because of the expectations operator, the information matrix must be esti-mated. The most intuitive is:

I(θ|y) = −

[∂2 ln L(θ|y)

∂θ∂θ′

]θ=θ

In other words, one just uses the data sample at hand to estimate the secondderivatives.

• The information matrix is closely related to the asymptotic variance of theML estimates, θ. It can be shown that the asymptotic variance of the MLestimates, across an infinite number of hypothetically repeated samples is:

V (θ) ≈ limn→∞

[I(θ|y)/n]−1 = limn→∞

1

nE

[−∂2 ln L(θ|y)

∂θ∂θ′

]−1

θ=θ

• Intuition: the variance is inversely related to the curvature. The greater thecurvature, the more that the log likelihood resembles a spike around the MLestimates θ and the lower the variance in those estimates will be.

Page 125: POLS W4912 Multivariate Political Analysis

115

• In a given sample of data, one can estimate V(θ) on the basis of the estimatedinformation:

V (θ) ≈ −1

n

[−∂2 ln L(θ|y)

∂θ∂θ′

]−1

θ=θ

• This expression is also known as the the Cramer-Rao lower bound on thevariance of the estimates. The Cramer-Rao lower bound is the lowest valuethat the asymptotic variance on any estimator can take.

• Thus, the ML estimates have the very attractive property that they areasymptotically efficient. In other words, in large samples, they achieve thelowest possible variance (thus highest efficiency) of any potential estimate ofthe true parameters.

• For the linear regression case, the estimate of V (θ) is σ2ML(X′X)−1. This is

the same as the variance-covariance matrix of the βOLS except that we usethe ML estimate of σ2 to estimate the true variance of the error term.

Page 126: POLS W4912 Multivariate Political Analysis

116

Part II

Violations of Gauss-MarkovAssumptions in the Classical Linear

Regression Model

Page 127: POLS W4912 Multivariate Political Analysis

Section 11

Large Sample Results and Asymptotics

11.1 What are large sample results and why do we care

about them?

• Large sample results for any estimator, θ, are the properties that we can sayhold true as the number of data points, n, used to estimate θ becomes “large.”

• Why do we care about these large sample results? We have an OLS modelthat, when the Gauss-Markov assumptions hold, has desirable properties.Why would we ever want to rely on the more difficult mathematical proofsthat involve the limits of estimators as n becomes large?

• Recall the Gauss-Markov assumptions:

1. The true model is a linear functional form of the data: y = Xβ + ε.

2. E[ε|X] = 0

3. E[εε′|X] = σ2I

4. X is n× k with rank k (i.e., full column rank)

5. ε|X ∼ N [0, σ2I]

• Recall that if we are prepared to make a further, simplifying assumption, thatX is fixed in repeated samples, then the expectations conditional on X canbe written in unconditional form.

• There are two main reasons for the use of large sample results, both of whichhave to do with violations of these of these assumptions.

1. Errors are not distributed normally

If assumption 5, above, does not hold, then we cannot use the smallsample results.

We established that βOLS is unbiased and “best linear unbiased es-timator” without any recourse to the normality assumption and wealso established that the variance of βOLS = σ2(X′X)−1.

117

Page 128: POLS W4912 Multivariate Political Analysis

118

Used normality to show that βOLS ∼ N(β, σ2(X′X)−1) and that (n−k)s2/σ2 ∼ χ2

n−k where the latter involved showing that (n − k)s2/σ2

can also be expressed as a quadratic form of ε/σ which is distributedstandard normal.

Both of these results were used to show that we could calculate teststatistics that were distributed as t and F . It is these results ontest statistics, and our ability to perform hypothesis tests, that areinvalidated if we cannot assume that the true errors are normallydistributed.

2. Non-linear functional forms

We may be interested in estimating non-linear functions of the originalmodel.

E.g., suppose that you have an unbiased estimator, π∗ of

π = 1/(1− β)

but you want to estimate β. You cannot simply use (1− 1/π∗) as anunbiased estimate β∗ of β.

To do so, you would have to be able to prove that E(1−1/π∗) = β. Itis not true, however, that the expected value of a non-linear functionof π is equal to the non-linear function of the expected value of π

(this fact is also known as “Jensen’s Inequality”).

Thus, we can’t make the last step that we would require (in smallsamples) to show that our estimator of β is unbiased. As Kennedysays, “the algebra associated with finding (small sample) expectedvalues can become formidable whenever non-linearities are involved.”This problem disappears in large samples.

The models for discrete and limited dependent variables that we willdiscuss later in the course involve non-linear functions of the param-eters, β. Thus, we cannot use the G-M assumptions to prove thatthese estimates are unbiased. However, ML estimates have attrac-tive large sample properties. Thus, our discussion of the properties ofthose models will always be expressed in terms of large sample results.

Page 129: POLS W4912 Multivariate Political Analysis

119

11.2 What are desirable large sample properties?

• Under finite sample conditions, we look for estimators that are unbiased andefficient (i.e., have minimum variance for any unbiased estimator). We havealso found it useful, when calculating test statistics, to have estimators thatare normally distributed. In the large sample setting, we look for analogousproperties.

• The large sample analog to unbiasedness is consistency. An estimator β isconsistent if it converges in probability to β, that is if its probability limit asn becomes large is β (plim β = β).

• The best way to think about this is that the sampling distribution of β

collapses to a spike around the true β.

• Two Notions of Convergence

1. “Convergence in probability”:

XnP→ X iff limn→∞ Pr(|X(ω)−Xn(ω)| ≥ ε) = 0

where ε is some small positive number. Or

plim Xn(ω) = X(ω)

2. “Convergence in quadratic mean or mean square”: If Xn has mean µn

and variance σ2n such that limn→∞µn = c and limn→∞σ2

n = 0, then Xn

converges in mean square to c.

Note: convergence in quadratic mean implies convergence in probability

(Xnqm→ c ⇒ Xn

P→ c).

• An estimator may be biased in small samples but consistent. For example,take an estimate of β = β + 1/n. In small samples this is biased, but as thesample size becomes infinite, 1/n goes to zero (limn→∞ 1/n = 0).

Page 130: POLS W4912 Multivariate Political Analysis

120

• Although an estimator may be biased yet consistent, it is very hard (readimpossible) for it to be unbiased but inconsistent. If β is unbiased, there isnowhere for it to collapse to but β. Thus, the only way an estimator couldbe unbiased but inconsistent is if its sampling distribution never collapses toa spike around the true β.

• We are also interested in asymptotic normality. Even though an estimatormay not be distributed normally in small samples, we can usually appeal tosome version of the central limit theorem (see Greene, 6th Ed., AppendixD.2.6 or Kennedy Appendix C) to show that it will be distributed normallyin large samples.

• More precisely, the different versions of the central limit theorem state thatthe mean of any random variable, whatever the distribution of the underlyingvariable, will in the limit be distributed such that:

√n(xn − µ)

d−→ N [0, σ2]

Thus, even if xi is not distributed normal, the sampling distribution of theaverage of an independent sample of the xi’s will be distributed normal.

• To see how we arrived at this result, begin with the variance of the mean:var(xn) = σ2

n .

• Next, by the CLT, this will be distributed normal: xn ∼ N(µ, σ2

n )

• As a result,xn − µ

σ√n

=√

n(xn − µ)

σ∼ N(0, 1)

Multiplying the expression above by σ will get you the result on the asymp-totic distribution of the sample average.

• When it comes to establishing the asymptotic normality of estimators, wecan usually express that estimator in the form of an average, as a sum ofvalues divided by the number of observations n, so that we can then applythe central limit theorem.

Page 131: POLS W4912 Multivariate Political Analysis

121

• We are also interested in asymptotic efficiency. An estimator is asymptoti-cally efficient if it is consistent, asymptotically normally distributed, and hasan asymptotic covariance matrix that is not larger than the asymptotic co-variance matrix of any other consistent, asymptotically normally distributedestimator.

• We will rely on the result that (under most conditions) the maximum like-lihood estimator is asymptotically efficient. In fact, it attains the smallestpossible variance, the Cramer-Rao lower bound, if that bound exists. Thus,to show that any estimator is asymptotically efficient, it is sufficient to showthat the estimator in question either is the maximum likelihood estimate orhas identical asymptotic properties.

11.3 How do we figure out the large sample properties of

an estimator?

• To show that any estimator of any quantity, θ, is consistent, we have to showthat plim θ = θ. The means of doing so is to show that any bias approacheszero as n becomes large and the variance in the sampling distribution alsocollapses to zero.

• To show that θ is asymptotically normal, we have to show that its samplingdistribution can be expressed as the sampling distribution of a sample averagepre-multiplied by

√n.

• Let’s explore the large sample properties of βOLS w/o assuming that ε ∼N(0, σ2I).

11.3.1 The consistency of βOLS

• Begin with the expression for βOLS:

βOLS = β + (X′X)−1X′ε

• Instead of taking the expectation, we now take the probability limit:

plim βOLS = β + plim (X′X)−1X′ε

Page 132: POLS W4912 Multivariate Political Analysis

122

We can multiply both sides of the equation by n/n = 1 to produce:

plim βOLS = β + plim (X′X/n)−1(X′ε/n)

• For the next step we need the Slutsky Theorem: For a continuous functiong(xn) that is not a function of n, plim g(xn) = g(plim xn).

• An implication of this thm is that if xn and yn are random variables withplim xn = c and plim yn = d, then plim (xn · yn) = c · d.

• If Xn and Yn are random matrices with plim Xn = A and plim Yn = B thenplim XnYn = AB.

• Thus, we can say that:

plim βOLS = β + plim (X′X/n)−1plim (X′ε/n)

• Since the inverse is a continuous function, the Slutsky thm enables us to bringthe first plim into the parenthesis:

plim βOLS = β + (plim X′X/n)−1plim (X′ε/n)

• Let’s assume thatlimn→∞

(X′X/n) = Q

where Q is a finite, positive definite matrix. In words, as n increases, theelements of X′X do not increase at a rate greater than n and the explanatoryvariables are not linearly depedent.

• To fix ideas, let’s consider a case where this assumption would not be valid:

yt = β0 + β1t + εt

which would give

X′X =

[T

∑Tt=1 t∑T

t=1 t∑T

t=1 t2

]=

[T T (T + 1)/2

T (T + 1)/2 T (T + 1)(2T + 1)/6

]Taking limits gives

limt→∞

(X′X/t) =

[1 ∞∞ ∞

].

Page 133: POLS W4912 Multivariate Political Analysis

123

• More generally, each element of (X′X) is composed of the sum of squares andthe sum of cross-products of the explanatory variables. As such, the elementsof (X′X) grow larger with each additional data point, n.

• But if we assume the elements of this matrix do not grow at a rate fasterthan n and the columns of X are not linear dependent, then dividing by n,gives convergence to a finite number.

• We can now say that plim βOLS = β + Q−1plim (X′ε/n) and the next stepin the proof is to show that plim (X′ε/n) is equal to zero. To demonstratethis, we will prove that its expectation is equal to zero and that its varianceconverges to zero.

• Think about the individual elements in (X′ε/n). This is a k × 1 matrix inwhich each element is the sum of all n observations of a given explanatoryvariable multiplied by each realization of the error term. In other words:

1

nX′ε =

1

n

n∑i=1

xiεi =1

n

n∑i=1

wi = w (11.1)

where w is a k × 1 vector of the sample averages of xkiεi. (Verify this if it isnot clear to you.)

• Since we are still assuming that X is non-stochastic, we can work throughthe expectations operator to say that:

E[w] =1

n

n∑i=1

E[wi] =1

n

n∑i=1

xiE[εi] =1

nX′E[ε] = 0 (11.2)

In addition, using the fact that E[εε′] = σ2I we can say that:

var[w] = E[ww′] =1

nX′E[εε′]X

1

n=

σ2

n

X′X

n

• In the limit, as n →∞, σ2

n → 0, X′Xn → Q, and thus

limn→∞

var[w] = 0 ·Q

Page 134: POLS W4912 Multivariate Political Analysis

124

• Therefore, we can say that w converges in mean square to 0 ⇒ plim (X′ε/n)is equal to zero, so that:

plim βOLS = β + Q−1 · 0 = β

Thus, the OLS estimator is consistent as well as unbiased.

11.3.2 The asymptotic normality of OLS

• Let’s show that the OLS estimator, βOLS, is also asymptotically normal. Startwith

βOLS = β + (X′X)−1X′ε

and then subtract β from each side and multiply through by√

n to yield (use√n = n/

√n):

√n(βOLS − β) =

(X′X

n

)−1(1√n

)X′ε

We’ve already established that the first term on the right-hand side converges

to Q−1. We need to derive the limiting distribution of the term(

1√n

)X′ε.

• From equations 11.1 and 11.2, we can write(1√n

)X′ε =

√n(w− E[w])

and then find the limiting distribution of√

nw.

• To do this, we will use a variant of the CLT (called Lindberg-Feller), whichallows for variables to come from different distributions.

From above, we know that

w =1

n

n∑i=1

xiεi

which means w is the average of n indepedent vectors xiεi with means 0 andvariances

var[xiεi] = σ2xix′i = σ2Qi

Page 135: POLS W4912 Multivariate Political Analysis

125

• Thus,

var[√

nw] = σ2Qn = σ2(

1

n

)[Q1 + Q2 + · · ·+ Qn]

= σ2(

1

n

) n∑i=1

xix′i

= σ2(

X′X

n

)• Assuming the sum is not dominated by any particular term and that

limn→∞(X′X/n) = Q, then

limn→∞

σ2Qn = σ2Q

• We can now invoke the Lindberg-Feller CLT to formally state that if the ε

are iid w/ mean 0 and finite variance, and if each element of X is finite andlimn→∞(X′X/n) = Q, then(

1√n

)X′ε

d−→ N [0, σ2Q]

It follows that:

Q−1(

1√n

)X′ε

d−→ N [Q−1 · 0,Q−1(σ2Q)Q−1]

[Recalling that if a random variable X has a variance equal to σ2, then kX,where k is a constant, has a variance equal to k2σ2].

• Combining terms, and recalling what it was we were originally interested in:

√n(βOLS − β)

d−→ N [0, σ2Q−1]

• How do we get from here to a statement about the normal distribution ofβOLS? Divide through by

√n on both sides and add β to show that the OLS

estimator is asymptotically distributed normal:

βOLSa∼ N

[β,

σ2

nQ−1

]

Page 136: POLS W4912 Multivariate Political Analysis

126

• To complete the steps, we can also show that s2 = e′en−k is consistent for σ2.

Thus, s2(X′X)/n)−1 is consistent for σ2Q−1. As a result, a consistent estimate

for the asymptotic variance of βOLS(= σ2

n Q−1 = σ2

n

(X′X

n

)−1) is s2(X′X)−1.

• Thus, we can say that βOLS is normally distributed and a consistent estimateof its asymptotic variance is given by s2(X′X)−1, even when the error termsare not distributed normally. We have gone through the rigors of large sampleproofs in order to show that in large samples OLS retains desirable propertiesthat are similar to what it has in small sample when all of the G–M conditionshold.

• To conclude, the desirable properties of OLS do not rely on the assumptionthat the true error term is normally distributed. We can appeal to largesample results to show that the sampling distribution will still be normalas the sample size becomes large and that it will have variance that can beconsistently estimated by s2(X′X)−1.

11.4 The large sample properties of test statistics

• Given the normality results and the consistent estimate of the asymptoticvariance given by s2(X′X)−1, hypothesis testing proceeds almost as normal.Hypotheses on individual coefficients can still be estimated by constructinga t-statistic:

βk − βk

SE(βk)

• When we did this in small samples, we made clear that we had to use an

estimate,√

s2(X′X)−1kk for the true standard error in the denominator, equal

to√

σ2(X′X)−1kk .

Page 137: POLS W4912 Multivariate Political Analysis

127

• As the sample size becomes large, we can replace this estimate by its prob-ability limit, which is the true standard error, so that the denominator justbecomes a constant. When we do so, we have a normally distributed randomvariable in the numerator divided by a constant, so the test-statistic is dis-tributed as z, or standard normal. Another way to think of this intuitively isthat the t-distribution converges to the z distribution as n becomes large.

• Testing joint hypothesis also proceeds via constructing an F -test. In this case,we recall the formula for an F -test, put in terms of the restriction matrix:

F [J, n−K] =(Rb− q)′[R(X′X)−1R′]−1(Rb− q)/J

s2

• In small samples, we had to take account of the distribution of s2 itself. Thisgave us the ratio of two chi-squared random variables, which is distributedF .

• In large samples, we can replace s2 by its probability limit, σ2 which is justa constant value. Multiplying both sides by J , we now have that the teststatistic JF is composed of a chi-squared random variable in the numeratorover a constant, so the JF statistic is the large sample analog of the F -test.If the JF statistic is larger than the critical value, then we can say that therestrictions are unlikely to be true.

11.5 The desirable large sample properties of ML estima-

tors

• Recall that MLEs are consistent, asymptotically normally distributed, andasymptotically efficient, in that they always achieve the Cramer-Rao lowerbound, when this bound exists.

• Thus, MLEs always have desirable large sample properties although theirsmall sample estimates (as of σ2) may be biased.

• We could also have shown that OLS was going to be consistent, asymptoticallynormally distributed, and asymptotically efficient by indicating that OLS isthe MLE for the classical linear regression model.

Page 138: POLS W4912 Multivariate Political Analysis

128

11.6 How large does n have to be?

• Does any of this help us if we have to have an infinity of data points before wecan attain consistency and asymptotic normality? How do we know how manydata points are required before the sampling distribution of βOLS becomesapproximately normal?

• One way to check this is via Monte Carlo studies. Using a non-normal dis-tribution for the error terms, we simulate draws from a distribution of themodel y = Xβ + ε, using a different set of errors each time.

• We then calculate the βOLS from repeated samples of size n and plot thesedifferent sample estimates βOLS on a histogram.

• We gradually enlarge n, checking how large n has to be in order to give us asampling distribution that is approximately normal.

Page 139: POLS W4912 Multivariate Political Analysis

Section 12

Heteroskedasticity

12.1 Heteroskedasticity as a violation of Gauss-Markov

• The third of the Gauss-Markov assumptions is that E[εε′] = σ2In. Thevariance-covariance matrix of the true error terms is structured as:

E(εε′) =

E(ε21) E(ε1ε2) . . . E(ε1εN)

... . . . ...E(εNε1) E(εNε2) . . . E(ε2

N)

=

σ2 0 ... 00 σ2 ... 0...0 0 ... σ2

• If all of the diagonal terms are equal to one another, then each realization

of the error term has the same variance, and the errors are said to be ho-moskedastic. If the diagonal terms are not the same, then the true error termis heteroskedastic.

• Also, if the off-diagonal elements are zero, then the covariance between dif-ferent error terms is zero and the errors are uncorrelated. If the off-diagonalterms are non-zero, then the error terms are said to be auto-correlated and theerror term for one observation is correlated with the error term for anotherobservation.

• When the third assumption is violated, the variance-covariance matrix of theerror term does not take the special form, σ2In, and is generally written,instead as σ2Ω. The disturbances in this case are said to be non-sphericaland the model should then be estimated by Generalized Least Squares, whichemploys σ2Ω rather than σ2In.

129

Page 140: POLS W4912 Multivariate Political Analysis

130

12.1.1 Consequences of non-spherical errors

• The OLS estimator, βOLS, is still unbiased and (under most conditions) con-sistent.

Proof for Unbiasedness:

As before:βOLS = β + (X′X)−1X′ε

and we still have that E[ε|X] = 0, thus we can say that E(βOLS

)= β.

• The estimated variance of βOLS is E[(β − E(β))(β − E(β))′], which we canestimate as:

var(βOLS) = E[(X′X)−1X′εε′X(X′X)−1]

= (X′X)−1X′(σ2Ω)X(X′X)−1

= σ2(X′X)−1(X′ΩX)(X′X)−1

• Therefore, if the errors are normally distributed,

βOLS ∼ N [β, σ2(X′X)−1(X′ΩX)(X′X)−1]

• We can also use the formula we derived in the lecture on asymptotics to showthat βOLS is consistent.

plim βOLS = β + Q−1plim (X′ε/n)

To show that plim (X′ε/n)=0, we demonstrated that the expectation of(X′ε/n) was equal to zero and that its variance was equal to σ2

nX′X

n , whichbecomes zero as n grows to infinity.

• In the case where E[εε′] = σ2Ω, the variance of (X′ε/n) is equal to σ2

n(X′ΩX)

n .

So long as this matrix converges to a finite matrix, then βOLS is also consistentfor β.

• Finally, in most cases, βOLS is asymptotically normally distributed with meanβ and its variance-covariance matrix is given by σ2(X′X)−1(X′ΩX)(X′X)−1.

Page 141: POLS W4912 Multivariate Political Analysis

131

12.2 Consequences for efficiency and standard errors

• So what’s the problem with using OLS when the true error term may beheteroskedastic? It seems like it still delivers us estimates of the coefficientsthat have highly desirable properties.

• The true variance of βOLS is no longer σ2(X′X)−1, so that any inference basedon s2(X′X)−1 is likely to be “misleading”.

• Not only is the wrong matrix used, but s2 may be a biased estimator of σ2.

• In general, there is no way to tell the direction of bias, although Goldberger(1964) shows that in the special case of only one explanatory variable (inaddition to the constant term), s2 is biased downward if high error variancescorrespond with high values of the independent variable.

• Whatever the direction of the bias, we should not use the standard equationfor the standard errors in hypothesis tests on βOLS.

• More importantly, OLS is no longer efficient, since another method calledGeneralized Least Squares will give estimates of the regression coeffi-cients, βGLS, that are unbiased and have a smaller variance.

12.3 Generalized Least Squares

• We assume, as we always do for a variance matrix, that Ω is positive definiteand symmetric. Thus, it can be factored into a set of matrices containing itscharacteristic roots and vectors.

Ω = CΛC′

• Here the columns of C contain the characteristic vectors of Ω and Λ is adiagonal matrix containing its characteristic roots. We can “factor” Ω usingthe square root of Λ, or Λ1/2, which is a matrix containing the square rootsof the characteristic roots on the diagonal.

• Then, if T = CΛ1/2, TT′ = Ω. The last result holds because TT′ =CΛ1/2Λ1/2C′ = CΛC′ = Ω. Also, if we let P′ = CΛ−1/2, then P′P = Ω−1.

Page 142: POLS W4912 Multivariate Political Analysis

132

• It can be shown that the characteristic vectors are all orthogonal and for eachcharacteristic vector, c′ici = 1 (Greene, 6th ed. p. 968–969). It follows thatCC′ = I, and C′C = I, a fact that we will use below.

• GLS consists of estimating the following equation, using the standard OLSsolutions for the regression coefficients:

Py = PXβ + Pε

ory∗ = X∗β + ε∗.

Thus, βGLS = (X∗′X∗)−1(X∗′y∗) = (X′Ω−1X)−1(X′Ω−1y)

• It follows that the variance of ε∗, is equal to E[ε∗ε∗′] = Pσ2ΩP′, and:

Pσ2ΩP′ = σ2PΩP′ = σ2Λ−1/2C′ΩCΛ−1/2

= σ2Λ−1/2C′CΛ1/2Λ1/2C′CΛ−1/2

= σ2Λ−1/2Λ1/2Λ1/2Λ−1/2

= σ2In

12.3.1 Some intuition

• In the standard case of heteroskedasticity, GLS consists of dividing each ob-servation by the square root of its own element in the Ω matrix,

√ωi. The nice

thing about this is that the variance of ε∗, equal to E[ε∗ε∗′] = Pσ2ΩP′ = σ2I.

• In effect, we’ve removed the heteroskedasticity from the residuals, and we cango ahead and estimate the variance of βGLS using the formula σ2(X∗′X∗)−1 =σ2(X′Ω−1X)−1. This is also known as the Aitken estimator after the statisti-cian who originally proposed the method in 1935.

• We can also show, using the standard methods, that βGLS is unbiased, con-sistent, and asymptotically normally distributed.

• We can conclude that βGLS is BLUE for the generalized model in which thevariance of the errors is given by σ2Ω. The result follows by applying theGauss-Markov theorem to model y∗ = X∗β + ε∗.

Page 143: POLS W4912 Multivariate Political Analysis

133

• In this general case, the maximum likelihood estimator will be the GLS esti-mator, βGLS, and that the Cramer-Rao lower bound for the variance of βGLS

is given by σ2(X′Ω−1X)−1.

12.4 Feasible Generalized Least Squares

• All of the above assumes that Ω is a known matrix, which is usually not thecase.

• One option is to estimate Ω in some way and to use Ω in place of Ω in theGLS model above. For instance, one might believe that the true error termwas a function of one (or more) of the independent variables. Thus,

εi = γ0 + γ1xi + ui

• Since βOLS is consistent and unbiased for β, we can use the OLS residualsto estimate the model above. The procedure is that we first estimate theoriginal model using OLS, and then use the residuals from this regression toestimate Ω and

√ωi.

• We then transform the data using these estimates, and use OLS again onthe transformed data to estimate βGLS and the correct standard errors forhypothesis testing.

• One can also use the estimated errors terms from the last stage to conductFGLS again and keep on doing this until the error model begins to converge,so that the estimated residuals barely change as one moves through the iter-ations.

• Some examples:

If σ2i = σ2x2

i , we would divide all observations through by xi.

If σ2i = σ2xi, we would divide all observations through by

√xi.

If σ2i = σ2(γ0 + γ1Xi + ui), we would divide through all observations by√

γ0 + γ1xi.

Page 144: POLS W4912 Multivariate Political Analysis

134

• Essential Result: So long as our estimate of Ω is consistent, the FGLSestimator will be consistent and will be asymptotically efficient.

• Problem: Except for the simplest cases, the finite-sample properties and ex-act distributions of FGLS estimators are unknown. Intuitively, var(βFGLS) 6=var(βGLS) because we also have to take into account the uncertainty in es-timating Ω. Since we cannot work out the small sample distribution, wecannot even say that βFGLS is unbiased in small samples.

12.5 White-consistent standard errors

• If you can find a consistent estimator for Ω, then go ahead and perform FGLSif you have a large number of data points available.

• Otherwise, and in most cases, analysts estimate the regular OLS model anduse “White-consistent standard errors”, which are described by Leamer as“White-washing heteroskedasticity”.

• White’s heteroskedasticity consistent estimator of the variance-covariancematrix of βOLS is recommended whenever OLS estimates are being usedfor inference in a situation in which heteroskedasticity is suspected, but theresearcher is unable to consistently estimate Ω and use FGLS.

• White’s method consists of finding a consistent estimator for the true OLSvariance below:

var[βOLS] =1

n

(1

nX′X

)−1(1

nX′[σ2Ω]X

)(1

nX′X

)−1

• The trick to White’s estimation of the asymptotic variance-covariance matrixis to recognize that what we need is a consistent estimator of:

Q∗ =σ2X′ΩX

n=

1

n

n∑i=1

σ2i xix

′i

Page 145: POLS W4912 Multivariate Political Analysis

135

• Under very general conditions

S0 =1

n

n∑i=1

e2ixix

′i

(which is a k× k matrix with k(k + 1)/2 original terms) is consistent for Q∗.(See Greene, 6th ed., p. 162–3)

• Because βOLS is consistent for β, we can show that White’s estimate ofvar[βOLS] is consistent for the true asymptotic variance.

• Thus, without specifying the exact nature of the heteroskedasticity, we canstill calculate a consistent estimate of var[βOLS] and use this in the normalway to derive standard errors and conduct hypothesis tests. This makes theWhite-consistent estimator extremely attractive in a wide variety of situa-tions.

• The White heteroskedasticity consistent estimator is

Est. asy. var[βOLS] =1

n

(1

nX′X

)−1(

1

n

n∑i=1

e2ixix

′i

)(1

nX′X

)−1

= n (X′X)−1

S0 (X′X)−1

• Adding the robust qualifier to the regression command in Stata producesstandard errors computed from this estimated asymptotic variance matrix.

12.6 Tests for heteroskedasticity

• Homoskedasticity implies that the error variance will be the same acrossdifferent observations and should not vary with X, the independent variables.

• Unsurprisingly, all tests for heteroskedasticity rely on checking how far theerror variances for different groups of observations depart from one another,or how precisely they are explained by the explanatory variables.

Page 146: POLS W4912 Multivariate Political Analysis

136

12.6.1 Visual inspection of the residuals

• Plot them and look for patterns (residuals against fitted values, residualsagainst explanatory vars).

12.6.2 The Goldfeld-Quandt test

• The Goldfeld-Quandt tests consists of comparing the estimated error vari-ances for two groups of observations.

• First, sort the data points by one of the explanatory variables (e.g., countrysize). Then run the model separately for the two groups of countries.

• If the true errors are homoskedastic, then the estimated error variances forthe two groups should be approximately the same.

• We estimate the true error variance by s2g = (e′geg)/(ng − k), where the sub-

script g indicates that this is the value for each group. Since each estimatederror variance is distributed chi-squared, the test statistic below is distributedF .

• If the true error is homoskedastic, the statistic will be approximately equalto one. The larger the F -statistic, the less likely it is that the errors arehomoskedastic. It should be noted that the formulation below is computedassuming that the errors are higher for the first group.

e′1e1/(n1 − k)

e′2e2/(n2 − k)= F [n1 − k, n2 − k]

• It has also been suggested that the test-statistic should be calculated usingonly the first and last thirds of the data points, excluding the middle sectionof the data, to sharpen the test results.

Page 147: POLS W4912 Multivariate Political Analysis

137

12.6.3 The Breusch-Pagan test

• The main problem w/ Goldfeld-Quandt: requires knowledge of how to sortthe data points.

• The Breusch-Pagan test uses a different approach to see if the error variancesare systematically related to the independent variables, or to any subset ortransformation of those variables.

• The idea behind the test is that for the regression

σ2i = σ2f(α0 + α′zi)

where zi is a vector of independent variables, if α = 0 then the errors arehomoskedastic.

• To perform the test, regress e2i /(e

′e/n− 1) on some combination of the inde-pendent variables (zi).

• We then compute a Lagrange Multiplier statistic as the regression sum ofsquares from that regression divided by 2. That is,

SSR

2

a∼ χ2m−1,

where m is the number of explanatory variables in the auxiliary regression.

• Intuition: if the null hypothesis of no heteroskedasticity is true, then the SSR

should equal zero. A non-zero SSR is telling us that e2i /(e

′e/n − 1) varieswith the explanatory variables.

• Note that this test depends on the fact that βOLS is consistent for β, andthat the errors from an OLS regression of yi on xi, ei will be consistent for εi.

Page 148: POLS W4912 Multivariate Political Analysis

Section 13

Autocorrelation

13.1 The meaning of autocorrelation

• Auto-correlation is often paired with heteroskedasticity because it is anotherway in which the variance-covariance matrix of the true error terms (if wecould observe it) is different from the Gauss-Markov assumption, E[εε′] =σ2In.

• In our earlier discussion of heteroskedasticity, we saw what happened whenwe relax the assumption that the variance of the error term is constant. Anequivalent way of saying this is that we relax the assumption that the errorsare “identically distributed.” In this section, we see what happens when werelax the assumption that the error terms are independent.

• In this case, we can have errors that covary (e.g., if one error is positive andlarge the next error is likely to be positive and large) and are correlated. Ineither case, one error can give us information about another.

• Two types of error correlation:

1. Spatial correlation: e.g., of contiguous households, states, or counties.

2. Temporal or Autocorrelation: errors from adjacent time periods are cor-related with one another. Thus, εt is correlated with εt+1, εt+2, . . . andεt−1, εt−2, . . . , ε1.

• The correlation between εt and εt−k is called autocorrelation of order k.

The correlation between εt and εt−1 is the first-order autocorrelation andis usually denoted by ρ1.

The correlation between εt and εt−2 is the second-order autocorrelationand is usually denoted by ρ2.

138

Page 149: POLS W4912 Multivariate Political Analysis

139

• What does this imply for E[εε′]?

E(εε′) =

E(ε21) E(ε1ε2) . . . E(ε1εT )

... . . . ...E(εTε1) E(εTε2) . . . E(ε2

T )

Here, the off-diagonal elements are the covariances between the different errorterms. Why is this?

• Autocorrelation implies that the off-diagonal elements are not equal to zero.

13.2 Causes of autocorrelation

• Misspecification of the model

• Data manipulation, smoothing, seasonal adjustment

• Prolonged influence of shocks

• Inertia

13.3 Consequences of autocorrelation for regression coef-

ficients and standard errors

• As with heteroskedasticity, in the case of autocorrelation, the βOLS regressioncoefficients are still unbiased and consistent. Note, this is true only in thecase where the model does not contain a lagged dependent variable. We willcome to this latter case in a bit.

• As before, OLS is inefficient. Further, inferences based on the standard OLSestimate of var(βOLS) = s2(X′X)−1 are wrong.

• OLS estimation in the presence of autocorrelation is likely to lead to an under-estimate of σ2, meaning that our t-statistics will be inflated, and we are morelikely to reject the null when we should not do so.

Page 150: POLS W4912 Multivariate Political Analysis

140

13.4 Tests for autocorrelation

• Visual inspection, natch.

• Test statistics

There are two well-known test statistics to compute to detect autocorre-lation.

1. Durbin-Watson test: most well-known but does not always give un-ambiguous answers and is not appropriate when there is a laggeddependent variable (LDV) in the original model.

2. Breusch-Godfrey (or LM) test: does not have a similar indeterminaterange (more later) and can be used with a LDV.

Both tests use the fact, exploited in tests for heteroskedasticity, thatsince βOLS is consistent for β, the residuals, ei, will be consistent for εi.Both tests, therefore, use the residuals from a preceding OLS regressionto estimate the actual correlations between error terms.

At this point, it is useful to remember the expression for the correlationof two variables:

corr(x, y) =cov(x, y)√

var(x)√

var(y)

For the error terms, and if we assume that var(εt) = var(εt−1) = ε2t , then

we have

corr(εt, εt−1) =E[εtεt−1]

E[ε2t ]

= ρ1

To get the sample average of this autocorrelation, we would compute thisratio for each pair of succeeding error terms, and divide by n.

Page 151: POLS W4912 Multivariate Political Analysis

141

13.4.1 The Durbin-Watson test

• The Durbin-Watson test does something similar to estimating the sampleautocorrelation between error terms. The DW statistic is calculated as:

d =

T∑t=2

(et − et−1)2

T∑t=1

e2t

• We can write d as d =∑

e2t +∑

e2t−1−2

∑etet−1∑

e2t

• Since∑

e2t is approximately equal to

∑e2t−1 as the sample size becomes large,

and∑

etet−1∑e2t

is the sample average of the autocorrelation coefficient, ρ, we have

that d ∼= 2(1− ρ).

• If ρ = +1, then d = 0, if ρ = −1, then d = 4. If ρ = 0, then d = 2. Thus, ifd is close to zero or four, the residuals can be said to be highly correlated.

• The problem is that the exact sampling distribution of d depends on thevalues of the explanatory variables. To get around this problem, Durbin andWatson have derived upper (du) and lower limits (dl) for the significance levelsof d.

• If we were testing for positive autocorrelation, versus the null hypothesis ofno autocorrelation, we would use the following procedure:

If d < dl, reject the null of no autocorrelation.

If d > du, we do not reject the null hypothesis.

If dl < d < du, the test is inconclusive.

• Can be calculated within Stata using dwstat.

Page 152: POLS W4912 Multivariate Political Analysis

142

13.4.2 The Breusch-Godfrey test

• To perform this test, simply regress the OLS residuals, et, on all the explana-tory variables, xt, and on their own lags, et−1, et−2, et−3, . . . , et−p.

• This test works because the coefficient on the lagged error terms will onlybe significant if the partial autocorrelation between et and et−p is significant.The partial autocorrelation is the autocorrelation between the error termsaccounting for the effect of the X explanatory variables.

• Thus, the null hypothesis of no autocorrelation can be tested using an F -test to see whether the coefficients on the lagged error terms, et−1, et−2,et−3, . . . , et−p are jointly equal to zero.

• If you would like to be precise, and consider that since we are using et asan estimate of εt, you can treat this as a large sample test, and say thatp · F , where p is the number of lagged error terms restricted to be zero, isasymptotically distributed chi-squared with degrees of freedom equal to p.Then use the chi-squared distribution to give critical values for the test.

• Using what is called the Lagrange Multiplier (LM) approach to testing, wecan also show that n · R2 is asymptotically distributed as χp and use thisstatistic for testing.

13.5 The consequences of autocorrelation for the variance-

covariance matrix

• In the case of heteroskedasticity, we saw that we needed to make some as-sumptions about the form of the E[εε′] matrix in order to be able to transformthe data and use FGLS to calculate correct and efficient standard errors.

• Under auto-correlation, the off-diagonal elements will not be equal to zero,but we don’t know what those off-diagonal elements are. In order to conductFGLS, we usually make some assumption about the form of autocorrelationand the process by which the disturbances are generated.

Page 153: POLS W4912 Multivariate Political Analysis

143

• First, we must write the E[εε′] matrix in a way that can represent autocor-relation:

E[εε′] = σ2Ω

• Recall that

corr(εt, εt−s) =E[εtεt−s]

E[ε2t ]

= ρs =γs

γ0

where γs = cov(εt, εt−s) and γ0 = var(εt)

• Let R be the “autocorrelation matrix” showing the correlation between allthe disturbance terms. Then E[εε′] = γ0R

• Second, we need to calculate the autocorrelation matrix. It helps to makean assumption about the process generating the disturbances or true errors.The most common assumption is that the errors follow an “autoregressiveprocess” of order one, written as AR(1).

An AR(1) process is represented as:

εt = ρεt−1 + ut

where E[ut] = 0, E[u2t ] = σ2

u, and cov[ut, us] = 0 if t 6= s.

• By repeated substitution, we have:

εt = ut + ρut−1 + ρ2ut−2 + · · ·

• Each disturbance, εt, embodies the entire past history of the us, with themost recent shocks receiving greater weight than those in the more distantpast.

• The successive values of ut are uncorrelated, so we can estimate the varianceof εt, which is equal to E[ε2

t ], as:

var[εt] = σ2u + ρ2σ2

u + ρ4σ2u + . . .

• This is a series of positive numbers (σ2u) multiplied by increasing powers of

ρ. If ρ is greater than one, the series will be infinite and we won’t be able toget an expression for var[εt]. To proceed, we assume that |ρ| < 1.

Page 154: POLS W4912 Multivariate Political Analysis

144

• Here is a useful trick for a series. If we have an infinite series of numbers:

y = a0 · x + a1 · x + a2 · x + a3 · x + . . .

theny =

x

1− a.

• Using this, we see that

var[εt] =σ2

u

1− ρ2 = σ2ε = γ0 (13.1)

• We can also estimate the covariances between the errors:

cov[εt, εt−1] = E[εtεt−1] = E[εt−1(ρεt−1 + ut)] = ρvar[εt−1] =ρσ2

u

1− ρ2

• And, since εt = ρεt−1 +ut = ρ(ρεt−2 +ut−1)+ut = ρ2εt−2 +ρut−1 +ut we have

cov[εt, εt−2] = E[εtεt−2] = E[εt−2(ρ2εt−2 + ρut−1 + ut)] = ρ2var[εt−2] =

ρ2σ2u

1− ρ2

• By repeated substitution, we can see that:

cov[εt, εt−s] = E[εtεt−s] =ρsσ2

u

1− ρ2 = γs

• Now we can get back to the autocorrelations. Since corr[εt, εt−s] = γs

γ0, we

have:

corr[εt, εt−s] =

ρsσ2u

1−ρ2

σ2u

1−ρ2

= ρs

• In other words, the auto-correlations fade over time. They are always lessthan one and become less and less the farther two disturbances are apart intime.

Page 155: POLS W4912 Multivariate Political Analysis

145

• The auto-correlation matrix, R, shows all the auto-correlations between thedisturbances. Given the link between the auto-correlation matrix and E[εε′] =σ2Ω, we can now say that:

E[εε′] = σ2Ω = γ0R

or

σ2Ω =σ2

u

1− ρ2

1 ρ ρ2 ρ3 . . . ρT−1

ρ 1 ρ ρ2 . . . ρT−2

ρ2 ρ 1 ρ . . . ρT−3

... . . ....

ρT−1 ρT−2 ρT−3 . . . 1

13.6 GLS and FGLS under autocorrelation

• Given that we now have the expression for σ2Ω, we can in theory transformthe data and estimate via GLS (or use an estimate, σ2Ω, for FGLS).

• We are transforming the data so that we get an error term that conforms tothe Gauss-Markov assumptions. In the heteroskedasticity case, we showedhow we transformed the matrix using the factor for σ2Ω.

• In the case of autocorrelation of the AR(1) type, what’s nice is that thetransformation is fairly simple, even though the matrix expression for σ2Ωmay not look that simple.

• Let’s start with a simple model with an AR(1) error term:

yt = β0 + β1xt + εt, t = 1, 2, . . ., T (13.2)

where εt = ρεt−1 + ut

• Now lag the data by one period and multiply it by ρ:

ρyt−1 = β0ρ + β1ρxt−1 + ρεt−1

• Subtracting this from Equation 13.2 above, we get:

yt − ρyt−1 = β0(1− ρ) + β1(xt − ρxt−1) + ut

Page 156: POLS W4912 Multivariate Political Analysis

146

• But (given our assumptions) the uts are serially independent, have constantvariance (σ2

u), and the covariance between different us is zero. Thus, the newerror term, ut, conforms to all the desirable features of Gauss-Markov errors.

• If we conduct OLS using this transformed data, and use our normal esti-mate for the standard errors, or s2(X′X)−1, we will get correct and efficientstandard errors.

• The transformation is:y∗t = yt − ρyt−1

x∗t = xt − ρxt−1

• If we drop the first observation (because we don’t have a lag for it), then weare following the Cochrane-Orcutt procedure. If we keep the first observationand use it with the following transformation:

y∗1 =√

1− ρ2y1

x∗1 =√

1− ρ2x1

then we are following something called the Prais-Winsten procedure. Bothof these are examples of FGLS.

• But how do we do this if we don’t “know” ρ? Since βOLS is unbiased, ourestimated residuals are unbiased for the true disturbances, εt.

• In this case, we can estimate ρ using our residuals from an initial OLS regres-sion, estimate ρ, and then perform FGLS using this estimate. The standarderrors that we calculate using this procedure will be asymptotically efficient.

• To estimate ρ from the residuals, et compute:

ρ =

∑etet−1∑

e2t

We could also estimate ρ from a regression of et on et−1.

Page 157: POLS W4912 Multivariate Political Analysis

147

• To sum up this procedure:

1. Estimate the model using OLS. Test for autocorrelation. If tests revealthis to be present, estimate the autocorrelation coefficient, ρ, using theresiduals from the OLS estimation.

2. Transform the data.

3. Estimate OLS using the transformed data.

• For Prais-Winsten in Stata do:

prais depvar expvars

or

prais depvar expvars, corc

for Cochrane-Orcutt.

13.7 Non-AR(1) processes

• A problem with the Prais-Winsten and Cochrane-Orcutt versions of FGLSis that the disturbances may not be distributed AR(1). In this situation, wewill have used the wrong assumptions to estimate the auto-correlations.

• There is an approach, analogous to the White-consistent standard errors, thatdirectly estimates the correct OLS standard errors (from(X′X)−1X′(σ2Ω)X(X′X)−1), using an estimate of X′(σ2Ω)X based on verygeneral assumptions about the autocorrelation of the error terms. Thesestandard errors are called Newey-West standard errors.

• Stata users can use the following command to compute these:

newey depvar expvars, lag(#) t(varname)

In this case, lag(#) tells Stata how many lags there are between any twodisturbances before the autocorrelations die out to zero. t(varname) tellsStata the variable that indexes time.

Page 158: POLS W4912 Multivariate Political Analysis

148

13.8 OLS estimation with lagged dependent variables and

autocorrelation

• We said previously that autocorrelation, like heteroskedasticity, does not af-fect the unbiasedness of the OLS estimates, just the means by which we cal-culate efficient and correct standard errors. There is one important exceptionto this.

• Suppose that we use lagged values of the dependent variable, yt as a regressor,so that:

yt = βyt−1 + εt where β < 1

andεt = ρεt−1 + ut

where E[ut] = 0, E[u2t ] = σ2

u, and cov[ut, us] = 0 if t 6= s.

• The OLS estimate of β = (X′X)−1X′y where X = yt−1. Let us also as-sume that yt and yt−1 have been transformed so that they are measured asdeviations from their means.

βOLS =

T∑t=2

yt−1yt

T∑t=2

y2t−1

=

T∑t=2

yt−1(βyt−1 + εt)

T∑t=2

y2t−1

= β +

T∑t=2

yt−1εt

T∑t=2

y2t−1

• Thus,

E[βOLS] = β +cov(yt−1, εt)

var(yt)

Page 159: POLS W4912 Multivariate Political Analysis

149

• To show that βOLS is biased, we need to show only that cov(yt−1, εt) 6= 0, andto show that βOLS is inconsistent, we need to show only that the limit of thiscovariance as T →∞ is not equal to zero.

cov[yt−1, εt] = cov[yt−1, ρεt−1 + ut] = ρcov[yt−1, εt−1] = ρcov[yt, εt]

• The last step is true assuming the DGP is “stationary” and the uts are un-correlated.

• Continuing:

ρcov[yt, εt] = ρcov[βyt−1 + εt, εt] = ρβcov[yt−1, εt] + cov[εt, εt] (13.3)

= ρβcov[yt−1, εt] + var[εt] (13.4)

• Since cov[yt−1, εt] = ρcov[yt, εt] (from above) we have:

cov[yt−1, εt] = ρβcov[yt−1, εt] + ρvar[εt]

so that

cov[yt−1, εt] =ρvar[εt]

(1− βρ).

• Given our calculation of var[εt] when the error is AR(1) (see Eq. 13.1):

cov[yt−1, εt] =ρσ2

u

(1− βρ)(1− ρ2)

• Thus, if ρ is positive, the estimate of β is biased upward (more of the fitis imputed to the lagged dependent variable than to the systematic relationbetween the error terms). Moreover, since the covariance above will notdiminish to zero in the limit as T →∞, the estimated regression coefficientswill also be inconsistent.

• It can also be shown that the estimate of ρ in the Durbin-Watson test willbe biased downward, leading us to accept the null hypothesis of no auto-correlation too often. For that reason, when we include a lagged dependentvariable in the model, we should be careful to use the Breusch-Godfrey testto determine whether autocorrelation is present.

Page 160: POLS W4912 Multivariate Political Analysis

150

13.9 Bias and “contemporaneous correlation”

• A more immediate question is what general phenomenon produces this biasand how to obtain unbiased estimates when this problem exists. The answerturns out to be a general one, so it is worth exploring.

• The bias (and inconsistency) in this case arose because of a violation of theG-M assumptions E[ε] = 0 and X is a known matrix of constants (“fixed inrepeated samples”). We needed this so that we could say that E[X′ε] = 0.This yielded our proof of unbiasedness.

E[βOLS] = β + E[(X′X)−1X′ε] = β

• Recall we showed that we could relax the assumption that X is fixed inrepeated samples if we substituted the assumption that X is “strictly exoge-nous”, so that E[ε|X] = 0. This will also yield the result that E[X′ε] = 0

• The problem in the case of the lagged dependent variable is that E[X′ε] 6= 0.When there is a large and positive error in the preceding period, we wouldbe likely to get a large and positive εt and a large and positive yt−1.

• Thus, we see a positive relationship between the current error and the regres-sor. This is all that we need to get bias, a “contemporaneous” correlationbetween a regressor and the error term, such that cov[xt, εt] 6= 0.

13.10 Measurement error

• Another case in which we get biased and inconsistent estimates because ofcontemporaneous correlation between a regressor and the error is measure-ment error.

• Suppose that the true relationship does not contain an intercept and is:

y = βx + ε

but x is measured with error as z, where z = x + u is what we observe andE[ut] = 0, E[u2

t ] = σ2u.

Page 161: POLS W4912 Multivariate Political Analysis

151

• This implies that x can be written as z − u and the true model can then bewritten as:

y = β(z − u) + ε = βz + (ε− βu) = βz + η

The new disturbance, η, is a function of u (the error in measuring x). z is alsoa function of u. This sets up a non-zero covariance (and correlation) betweenz, our imperfect measure of the true regressor x, and the new disturbanceterm, (ε− βu).

• The covariance between them, as we showed then, is equal to:

cov[z, (ε− βu)] = cov[x + u, ε− βu] = −βσ2u

• How does this lead to bias? We go back to the original equation for theexpectation of the estimated β coefficient, E[β] = β + (X′X)−1X′ε. It iseasy to show:

E[β] = E[(z′z)−1(z′y)] = E[(z′z)−1(z′(βz + η))] (13.5)

= E[[(x + u)′(x + u)]−1(x + u)′(βx + ε)] (13.6)

= E

∑i

(x + u)(βx + ε)∑i

(x + u)2

=βx2

x2 + σ2u

= βx2

x2 + σ2u

(13.7)

• Note: establishing a non-zero covariance between the regressor and the errorterm is sufficient to prove bias but is not the same as indicating the directionof the bias. In this special case, however, β is biased downward.

• To show consistency we must show that:

plim β = β +

(X′X

n

)−1(X′ε

n

)= β (recall Q∗ =

(X′X

n

)−1

)

• This involves having to show that plim(

X′εn

)= 0.

• In the case above,

plim β =

plim (1/n)∑i

(x + u)(βx + ε)

plim (1/n)∑i

(x + u)2

=βQ∗

Q∗ + σ2u

Page 162: POLS W4912 Multivariate Political Analysis

152

• Describing the bias and inconsistency in the case of a lagged dependent vari-able with autocorrelation would follow the same procedure. We would lookat the expectation term to show the bias and the probability limit to showinconsistency. See particularly Greene, 6th ed., pp. 325–327.

13.11 Instrumental variable estimation

• In any case of contemporaneous correlation between a regressor and the errorterm, we can use an approach known as instrumental variable (or IV) esti-mation. The intuition to this approach is that we will find an instrument forX, where X is the variable correlated with the error term. This instrumentshould ideally be highly correlated with X, but uncorrelated with the errorterm.

• We can use more than one instrument to estimate βIV . Say that we have onevariable, x1, measured with error in the model:

y = β0 + β1x1 + β2x2 + ε

• We believe that there exists a set of variables, Z, that is correlated with x1

but not with ε. We estimate the following model:

x1 = Zα + u

where the α are the regression coefficients on Z where α = (Z′Z)−1Z′x1.

• We then calculate the fitted or predicted values of x1, or x1, equal to Zα,which is Z(Z′Z)−1Z′x1. These fitted values should not be correlated withthe error term because they are derived from instrumental variables that areuncorrelated with the error term.

• We then use the fitted values x1 in the original model for x1.

y = β0 + β1x1 + β2x2 + ε

This gives an unbiased estimate βIV of β1.

Page 163: POLS W4912 Multivariate Political Analysis

153

In the case of a model with autocorrelation and a lagged dependent variable,Hatanaka (1974), suggests the following IV estimation for the model:

yt = Xβ + γyt−1 + εt

where εt = ρεt−1 + ut

• Use the predicted values from a regression of yt on Xt and Xt−1 as an estimateof yt−1. The coefficient on yt−1 is a consistent estimate of γ, so it can be usedto estimate ρ and perform FGLS.

13.12 In the general case, why is IV estimation unbiased

and consistent?

• Suppose thaty = Xβ + ε

where X contains one regressor that is contemporaneously correlated withthe error term and the other variables are uncorrelated. The intercept andthe variables uncorrelated with the error can serve as their own (perfect)instruments.

• Each instrument is correlated with the variable of interest and uncorrelatedwith the error term. We have at least one instrument for the explanatoryvariable correlated with the error term. By regressing X on Z we get X, thepredicted values of X. For each uncorrelated variable, the predicted value isjust itself since it perfectly predicts itself. For the correlated variables, thepredicted value is the value given by the first-stage model.

X = Zα = Z(Z′Z)−1Z′X

• Then

βIV = (X′X)−1X′y

= [X′Z(Z′Z)−1Z′Z(Z′Z)−1Z′X]−1(X′Z(Z′Z)−1Z′y)

Page 164: POLS W4912 Multivariate Political Analysis

154

• Simplifying and substituting for y we get:

E[βIV ] = E[(X′X)−1X′y]

= E[(X′Z(Z′Z)−1Z′X)−1(X′Z(Z′Z)−1Z′(Xβ + ε)]

= E[X′Z(Z′Z)−1Z′X]−1(X′Z(Z′Z)−1Z′X)β

+ [X′Z(Z′Z)−1Z′X]−1(X′Z(Z′Z)−1Z′ε= β + 0

since E[Z′ε] is zero by assumption.

Page 165: POLS W4912 Multivariate Political Analysis

Section 14

Simultaneous Equations Models and 2SLS

14.1 Simultaneous equations models and bias

• Where we have a system of simultaneous equations we can get biased andinconsistent regression coefficients for the same reason as in the case of mea-surement error (i.e., we introduce a “contemporaneous correlation” betweenat least one of the regressors and the error term).

14.1.1 Motivating example: political violence and economic growth

• We are interested in the links between economic growth and political violence.Assume that we have good measures of both.

• We could estimate the following model

Growth = f(V iolence, OtherFactors)

(This is the bread riots are bad for business model).

• We could also estimate a model

V iolence = f(Growth,OtherFactors)

(This is the hunger and privation causes bread riots model).

• We could estimate either by OLS. But what if both are true? Then violencehelps to explain growth and growth helps to explain violence.

• In previous estimations we have treated the variables on the right-hand sideas exogenous. In this case, however, some of them are endogenous becausethey are themselves explained by another causal model. The model now hastwo equations:

Gi = β0 + β1Vi + εi

Vi = α0 + α1Gi + ηi

155

Page 166: POLS W4912 Multivariate Political Analysis

156

14.1.2 Simultaneity bias

• What happens if we just run OLS on the equation we care about? In general,this is a bad idea (although we will get into the case in which we can). Thereason is simultaneity bias.

• To see the problem, we will go through the following simulation.

1. Suppose some random event sparks political violence (ηi is big). Thiscould be because you did not control for the actions of demagogues instirring up crowds.

2. This causes growth to fall through the effect captured in β1.

3. Thus, Gi and ηi are negatively correlated.

• What happens in this case if we estimate Vi = α0 +α1Gi +ηi by OLS withouttaking the simultaneity into account? We will mis-estimate α1 because growthtends to be low when ηi is high.

• In fact, we are likely to estimate α1 with a negative bias, because E(G′η) isnegative. We may even produce the mistaken result that low growth produceshigh violence solely through the simultaneity.

• To investigate the likely direction and existence of bias, let’s look at E[X′ε]or the covariance between the explanatory variables and the error term.

E[G′η] = E[(β0 + β1V + ε)η] = E[(β0 + β1(α0 + α1G + η) + ε)η]

• Multiplying through and taking expectations we get:

E[G′η] = E[(β0 + β1α1)′η] + E[β1α1G

′η] + E[β1η′η] + E[ε′η]

• Passing through the expectations operator we get:

E[G′η] = (β0 + β1α1)E[η] + β1α1E[G′η] + β1E[η2] + E[ε]E[η]

(E[ε′η] = E[ε]E[η] because the two error terms are assumed independent).

Page 167: POLS W4912 Multivariate Political Analysis

157

• Since E[ε] = 0 and E[η] = 0 and E[η2] = σ2η, we get:

E[G′η] = β1α1E[G′η] + β1σ2η

So, (1− β1α1)E[G′η] = β1σ2η and E[G′η] =

β1σ2η

(1−β1α1)

• Since this is non-zero, we have a bias in the estimate of α1, the effect ofgrowth, G, on violence, V .

• The equation above indicates that the size of bias will depend on the varianceof the disturbance term, η, and the size of the coefficient on V, β1. Theseparameters jointly determine the magnitude of the feedback effect from errorsin the equation for violence, through the impact of violence, back onto growth.

• The actual computation of the bias term is not quite this simple in the caseof the models above because we will have other variables in the matrix ofright-hand side variables, X.

• Nevertheless, the expression above can help us to deduce the likely directionof bias on the endogenous variable. When we have other variables in themodel, the coefficients on these variables will also be affected by bias, but wecannot tell the direction of this bias a priori.

• How can we estimate the causal effects without bias in the presence of simul-taneity? Doing so will involve re-expressing all the endogenous variables asa function of the exogenous variables and the error terms. This leads to thequestion of the “identification” of the system.

14.2 Reduced form equations

• In order to estimate a system of simultaneous equations (or any equation inthat system) the model must be either “just identified” or “over-identified”.These conditions depend on the exogenous variables in the system of equa-tions.

Page 168: POLS W4912 Multivariate Political Analysis

158

• Suppose we have the following system of simultaneous equations:

Gi = β0 + β1Vi + β2Xi + εi

Vi = α0 + α1Gi + α2Zi + ηi

• We can insert the second equation into the first in order to derive an expres-sion for growth that does not involve the endogenous variable, violence:

Gi = β0 + β1(α0 + α1Gi + α2Zi + ηi) + β2Xi + εi

Then you can pull all the expressions involving Gi onto the left-hand side anddivide through to give:

Gi =β0 + β1α0

(1− β1α1)+

β1α2

(1− β1α1)Zi +

β2

(1− β1α1)Xi +

(εi + β1ηi

1− β1α1

)• We can perform a similar exercise to write the violence equation as:

Vi =α0 + α1β0

(1− α1β1)+

α1β2

(1− α1β1)Xi +

α2

(1− α1β1)Zi +

(ηi + α1εi

1− α1β1

)• We can then estimate these reduced form equations directly.

Gi = γ0 + γ1Zi + γ2Xi + ε∗i

Vi = φ0 + φ1Zi + φ2Xi + η∗i

Where,

γ0 =β0 + β1α0

(1− β1α1)γ1 =

β1α2

(1− β1α1)γ2 =

β2

(1− β1α1)

Also,

φ0 =α0 + α1β0

(1− α1β1)φ1 =

α2

(1− α1β1)φ2 =

α1β2

(1− α1β1)

• In this case, we can we solve backward from this for the original regressioncoefficients of interest, α1 and β1. We can re-express the coefficients from thereduced form equations above to yield:

α1 =φ2

γ2and β1 =

γ1

φ1

Page 169: POLS W4912 Multivariate Political Analysis

159

• This is often labeled “Indirect Least Squares,” a method that is mostly ofpedagogic interest.

• Thus, we uncover and estimate the true relationship between growth andviolence by looking at the relationship between those two variables and theexogenous variables. We say that X and Z identify the model.

• Without the additional variables, there would be no way to estimate α1 andβ1 from the reduced form. The model would be under-identified.

• It is also possible to have situations where there is more than one solutionfor α1 and β1 from the reduced form. This can occur if either equation hasmore than one additional variable.

• Since one way to estimate the coefficients of interest without bias is to es-timate the reduced form and compute the original coefficients from this di-rectly, it is important to determine whether your model is identified. We canestimate just-identified and over-identified models, but not under-identifiedmodels.

14.3 Identification

• There are two conditions to check for identification: the order condition andthe rank condition. In theory, since the rank condition is more binding, onechecks first the order condition and then the rank condition. In practice, veryfew people bother with the rank condition.

14.3.1 The order condition

• Let g be the number of endogenous variables in the system (here 2) and letk be the total number of variables (endogenous and exogenous) missing fromthe equation under consideration. Then:

1. If k = g − 1, the equation is exactly identified.

2. If k > g − 1, the equation is over-identified.

3. If k < g − 1, the equation is under-identified.

Page 170: POLS W4912 Multivariate Political Analysis

160

• In general, this means that there must be at least one exogenous variable inthe system, excluded from that equation, in order to estimate the coefficienton the endogenous variable that is included as an explanatory variable in thatequation.

• These conditions are necessary for a given degree of identification. The RankCondition is sufficient for each type of identification. The Rank Conditionassumes the order condition and adds that the reduced form equations musteach have full rank.

14.4 IV estimation and two-stage least squares

1. If the model is just-identified, we could regress the endogenous variables onthe exogenous variables and work back from the reduced form coefficients toestimate the structural parameters of interest.

2. If the model is just or over-identified, we can use the exogenous variables asinstruments for Gi and Vi. In this case, we use the instruments to form esti-mates of the two endogenous variables, Gi and Vi that are now uncorrelatedwith the error term and we use these in the original structural equations.

• The second method is generally easier. It is also known as Two-Stage LeastSquares, or 2SLS because it inolves the following stages:

1. Estimate the reduced form equations using OLS. To do this, regress eachendogenous variable on all the exogenous variables in the system. In ourrunning example we would have:

Vi = φ0 + φ1Zi + φ2Xi + η∗i

andGi = γ0 + γ1Zi + γ2Xi + ε∗i

Page 171: POLS W4912 Multivariate Political Analysis

161

2. From these first-stage regressions, estimate Gi and Vi for each observa-tion. The predicted values of the endogenous variables can then be usedto estimate the structural models:

Vi = α0 + α1Gi + α2Zi + ηi

Gi = β0 + β1Vi + β2Xi + εi

• This is just what we would do in standard IV estimation, which is to regressthe problem variable on its instruments and then use the predicted value inthe main regression.

14.4.1 Some important observations

• This method gives you unbiased and consistent estimates of α1 and β1. An-other way of saying this is that Gi and Vi are good instruments for Gi andVi.

A good instrument is highly correlated with the variable that we areinstrumenting for and uncorrelated with the error term. In this case,Gi and Vi are highly correlated with Gi and Vi because we use all theinformation available to us in the exogenous variables to come up withan estimate.

Second, Gi and Vi are uncorrelated with the error terms ηi and εi becauseof the properties of OLS and the way we estimate the first stage. Aproperty of OLS is that the estimated residuals are uncorrelated withthe regressors and thus uncorrelated with the predicted values (Gi andVi).

Thus, since the direct effect of growth on Vi, for example, in the firstreduced form equation is part of the estimated residual, the predictedvalue Vi can be treated as exogenous to growth. We have broken thechain of connection that runs from one model to another. We can saythat E[G′η] = 0 and E[V

′ε] = 0.

• When the system is exactly identified, 2SLS will give you results that are iden-tical to those you would obtain from estimated the reduced form equationsand using those coefficients directly to estimate α1 and β1.

Page 172: POLS W4912 Multivariate Political Analysis

162

• There is one case in which estimation of a system of simultaneous equationsby OLS will not give you biased and inconsistent estimates. This is the caseof recursive systems. The following is an example of a recursive system:

yi = α0 + α1Xi + εi

pi = β0 + β1yi + β2Ri + ηi

qi = δ0 + δ1yi + δ2pi + δ3Si + νi

Here, all the errors are independent. In this system of recursive equations,substituting for the endogenous variables yi and pi will ultimately get you tothe exogenous variable Xi, so we don’t get the feedback loops and correlationsbetween regressor and the error that we did in the earlier case.

14.5 Recapitulation of 2SLS and computation of goodness-

of-fit

Let us review the two-stage least square procedure:

1. We first estimated the reduced form equations for Violence and Growth usingOLS.

Vi = φ0 + φ1Zi + φ2Xi + η∗i

andGi = γ0 + γ1Zi + γ2Xi + ε∗i

We used the predicted value of Gi and Vi, or Gi and Vi, as instruments in thesecond stage.

2. We estimated the structural equations using the instruments:

Vi = α0 + α1Gi + α2Zi + ηi

Gi = β0 + β1Vi + β2Xi + εi

And get unbiased coefficients.

• We would normally compute R2 as:

SSR

SST=

∑(xi − x)2β2∑(yi − y)2 or 1− SSE

SST= 1−

∑e2i∑

(yi − y)2 = 1− e′e

y′M0y

Page 173: POLS W4912 Multivariate Political Analysis

163

• In the second stage, however, the estimated residuals are:

ηi = Vi − (α0 + α1Gi + α2Zi)

εi = Gi − (β0 + β1Vi + β2Xi)

• If we use these residuals in our computation of R2 we will get a statistic thattells us how well the model with the instruments fits the data. If we wantan estimate of how well the original structural model fits the data, we shouldestimate the residuals using the true endogenous variables Gi and Vi. Thus,we use:

ηi = Vi − (α0 + α1Gi + α2Zi)

εi = Gi − (β0 + β1Vi + β2Xi)

• This gives us an estimate of the fit of the structural model. There is one oddityin the calculated R2 that may result. When we derived R2, we used the factthat the OLS normal equations estimated residuals such that E[X′e] = 0.This gave us the result that:

SST = SSR + SSE or∑

(yi − y)2 =∑

(xi − x)2β2 +∑

e2i

The variances in y can be completely partitioned between the variance fromthe model and the variance from the residuals. This is no longer the casewhen you re-estimate the errors using the real values of Gi and Vi ratherthan their instruments and you can get cases where SSE > SST. In thiscase, you can get negative values of R2 in the second stage. This may beperfectly okay if the coefficients are of the right sign and the standard errorsare small.

• It also matters in this case whether you estimate R2 as SSRSST or 1− SSE

SST .

14.6 Computation of standard errors in 2SLS

• Let us denote all the original right-hand side variables in the two structuralmodels as X, the instrumented variables as X, and the exogenous variablesthat we used to estimate X as Z.

Page 174: POLS W4912 Multivariate Political Analysis

164

• In section 13.12, we showed that:

βIV = (X′X)−1X′y = (X′Z(Z′Z)−1Z′X)−1(X′Z(Z′Z)−1Z′(Xβ + ε)

= β + (X′Z(Z′Z)−1Z′X)−1(X′Z(Z′Z)−1Z′ε)

= β + (X′X)−1(X′ε)

var[βIV ] = E[(βIV − β)(βIV − β)′] = E[(X′X)−1X′εε′X(X′X)−1]

• If the errors conform to the Gauss-Markov assumptions then, E[εε′] = σ2IN

andvar[βIV ] = σ2(X′X)−1

• To estimate σ2 we would normally use e′e(n−K) .

• As with R2, we should use the estimated residuals from the structural modelwith the true variables in it rather than the predicted values. These areconsistent estimates of the true disturbances.

ηi = Vi − (α0 + α1Gi + α2Zi)

εi = Gi − (β0 + β1Vi + β2Xi)

• The IV estimator can have very large standard errors, because the instru-ments by which X is proxied are not perfectly correlated with it and yourresiduals will be larger.

Page 175: POLS W4912 Multivariate Political Analysis

165

14.7 Three-stage least squares

• What if, in the process above, we became concerned that E[εε′] 6= σ2IN?

• We would perform three-stage least squares.

1. Estimate the reduced form equations in OLS and calculate predictedvalues of the endogenous variables Gi and Vi.

2. Estimate the structural equations with the fitted values.

3. Use the residuals calculated in the manner above (using the actual val-ues of Gi and Vi) to test for heteroskedasticity and/or autocorrelationand compute the appropriate standard errors if either are present. Inthe case of heteroskedasticity, this would mean either FGLS and a datatransformation or White-Corrected Standard Errors. In the case of auto-correlation, this would mean either FGLS and a data transformation orNewey-West standard errors.

Page 176: POLS W4912 Multivariate Political Analysis

166

14.8 Different methods to detect and test for endogeneity

1. A priori tests for endogeneity – Granger Causality.

2. A test to look at whether the coefficients under OLS are markedly differentfrom the coefficients under 2SLS – Hausman Specification Test.

14.8.1 Granger causality

• A concept that is often used in time series work to define endogeneity is“Granger Causality.” What it defines is really “pre-determinedness.” Grangercausality is absent when we can say that:

f(xt|xt−1, yt−1) = f(xt|xt−1)

The definition states that in the conditional distribution, lagged values of yt

add no information to our prediction of xt beyond that provided by laggedvalues of xt itself.

• This is tested by estimating:

xt = β0 + β1xt−1 + β2yt−1 + εt

• If a t-test indicates that β2 = 0, then we say that y does not Granger cause x.If y does not Granger cause x, then x is often said to be exogenous in a systemof equations with y. Here, exogeneity implies only that prior movements iny do not lead to later movements in x.

• Kennedy has a critique of Granger Causality in which he points out that underthis definition weather reports “cause” the weather and that an increase inChristmas card sales “cause” Christmas. The problem here is that variablesthat are based on expectations (that the weather will be rainy, that Christmaswill arrive) cause earlier changes in behavior (warnings to carry an umbrellaand a desire to buy cards).

Page 177: POLS W4912 Multivariate Political Analysis

167

14.8.2 The Hausman specification test

• If βOLS and βIV are “close” in magnitude, then it would appear that endo-geneity is not producing bias.

• This intuition has been formalized into a test by the econometrician JerryHausman. The test is called a specification test because it tells you whetheryou were right to use 2SLS. In this case, however, it can also be used to testthe original assumption of endogeneity. The logic is as follows:

• H0: There is no endogeneity

In this case both βOLS and βIV are consistent and βOLS is efficient relativeto βIV (recall that OLS is BLUE).

• H1: There is endogeneity

In this case βIV remains consistent while βOLS is inconsistent. Thus, theirvalues will diverge.

• The suggestion, then, is to examine d = (βIV − βOLS). The question is howlarge this difference should be before we assume that something is up. Thiswill depend on the variance of d.

• Thus, we can form a Wald statistic to test the hypothesis above:

W = d′[Estimated Asymptotic Variance(d)]−1d

• The trouble, for a long time, was that no-one knew how to estimate this vari-ance since it should involve the covariances between βIV and βOLS. Hausmansolved this by proving that the covariance between an efficient estimator βOLS

of a parameter vector β and its difference from an inefficient estimator βIV

of the same parameter vector is zero (under the null). (For more details seeGreene, 6th ed., Section 12.4.)

• Based on this proof, we can say that:

Asy. var[βIV − βOLS] = Asy. var[βIV ]− Asy. var[βOLS]

Page 178: POLS W4912 Multivariate Political Analysis

168

• Under the null hypothesis, we are using two different but consistent estimatorsof σ2. If we use s2 as a common estimator of this, the Hausman statistic willbe:

H =(βIV − βOLS)′[(X′X)−1 − (X′X)−1]−1(βIV − βOLS)

s2

• This test statistic is distributed χ2 but the appropriate degrees of freedom forthe test statistic will depend on the context (i.e., how many of the variablesin the regression are thought to be endogenous).

14.8.3 Regression version

• The test statistic above can be automatically computed in most standardsoftware packages. In the case of IV estimation (of which 2SLS is an example)there is a completely equivalent way of running the Hausman test using an“auxiliary regression”.

• Assume that the model has K1 potentially endogenous variables, X, and K2

remaining variables, W. We have predicted values of X, X, based on thereduced form equations. We estimate the model:

y = Xβ + Xα + Wγ + ε

• The test for endogeneity is performed as an F -test on the K1 regressioncoefficients α being different from zero, where the degrees of freedom are K1

and (n− (K1 + K2)−K1). If α = 0, then X is said to be exogenous.

• The intuition is as follows. If the X variables are truly exogenous, then theβ should be unbiased and there will be no extra information added by thefitted values. If the X variables are endogenous, then the fitted values willadd extra information and account for some of the variation in y. Thus, theywill have a coefficient on them significantly different from zero.

• Greene does not go through the algebra to prove that the augmented regres-sion is equivalent to running the Hausman test but refers readers to Davidsonand MacKinnon (1993). Kennedy has a nice exposition of the logic above onp. 197–198.

Page 179: POLS W4912 Multivariate Political Analysis

169

14.8.4 How to do this in Stata

1. 2SLS or To perform any type of IV estimation, see ivreg.

The command is:

ivreg (depvar) [varlist1] [varlist2=varlist IV]

2. To perform 3SLS, see reg3

3. To perform a Hausman test, see hausman

The hausman test is used in conjunction with other regression commands.To use it, you would:

Run the less efficient model (here IV or reg3)

hausman, save

Run the fully efficient model (here OLS)

hausman

14.9 Testing for the validity of instruments

• We stated previously that the important features of “good” instruments, Z,are that they be highly correlated with the endogenous variables, X, anduncorrelated with the true errors, ε.

• The first requirement can be tested fairly simply by inspecting the reducedform model in which each endogenous variables is regressed on its instrumentsto yield predicted values, X. In a model with one instrument, look at thet-statistic. In an IV model with multiple instruments, look at the F -statistic.

• For the second requirement, the conclusion that Z and ε are uncorrelated inthe case of one instrument must be a leap of faith, since we cannot observe ε

and must appeal to theory or introspection.

Page 180: POLS W4912 Multivariate Political Analysis

170

• For multiple instruments, however, so long as we are prepared to believe thatone of the instruments is uncorrelated with the error, we can test the assump-tion that remaining instruments are uncorrelated with the error term. Thisis done via an auxiliary regression and is know as “testing over-identificationrestrictions.”

• It is called this because, if there is more than one instrument, then the en-dogenous regressor is over-identified. The logic of the test is simple.

If at least one of the instruments is uncorrelated with the true error term,then 2SLS gives consistent estimates of the true errors.

The residuals from the 2SLS estimation can then be used as the depen-dent variable in a regression on all the instruments.

If the instruments are, in fact, correlated with the true errors, then thiswill be apparent in a significant F -statistic on the instruments beingjointly significant for the residuals.

• Thus, the steps in the test for over-identifying restrictions are as follows:

1. Estimate the 2SLS residuals using ηi = Vi − α0 + α1Gi + α2Zi. Use ηi asyour consistent estimate of the true errors.

2. Regress ηi on all the instruments used in estimating the 2SLS coefficients.

3. You can approximately test the over-identifying restrictions via inspec-tion of the F -statistic for the null hypothesis that all instruments arejointly insignificant. However, since the residuals are only consistent forthe true errors, this test is only valid asymptotically and you shouldtechnically use a large-sample test.

An alternative is to use n ·R2 a∼ χ2 w/ df = # of instruments − # ofendogenous regressors. If reject the null ⇒ at least some of the IVsare not exogenous.

Page 181: POLS W4912 Multivariate Political Analysis

Section 15

Time Series Modeling

15.1 Historical background

• Genesis of modern time series models: large structural models of the macro-economy, which involved numerous different variables, were poor predictorsof actual economic outcomes.

• Box and Jenkins showed that the present and future values of an economicvariable could often be better predicted by its own past values than by othervariables—dynamic models.

• In political science: used in models of presidential approval and partisanidentity.

15.2 The auto-regressive and moving average specifica-

tions

• A time series is a sequence of numerical data in which each observation isassociated with a particular instant in time.

• Univariate time series analysis: analysis of a single sequence of data

• Multivariate time series analysis: several sets of data for the same sequenceof time periods

• The purpose of time series analysis is to study the dynamics, or temporalstructure of the data.

• Two representations of the temporal structure that will allow us to describealmost all dynamics (for “stationary” sequences) are the auto-regressive andmoving-average representations.

171

Page 182: POLS W4912 Multivariate Political Analysis

172

15.2.1 An autoregressive process

• The AR(1) process:yt = µ + γyt−1 + εt

is said to be auto-regressive (or self-regressive) because the current value isexplained by past values, so that:

E[yt] = µ + γyt−1

• This AR(1) process also contains a per-period innovation of µ, although thisis often set to zero. This is sometimes referred to as a “drift” term. A moregeneral, pth-order autoregression or AR(p) process would be written:

yt = µ + γyt−1 + γ2yt−2 + ... + γpyt−p + εt

• In the case of an AR(1) process we can substitute infinitely for the y termson the right-hand side (as we did previously) to show that:

yt = µ+γµ+γ2µ+...+εt+γεt−1+γ2εt−2+...+γ∞εt−∞ =∞∑i=0

γiµ +∞∑i=0

γiεt−i

• So, one way to remember an auto-regressive process is that your current stateis a function of all your previous errors.

• We can present this information far more simply using the lag operator orL:

Lxt = xt−1 and L2xt = xt−2 and (1− L)xt = xt − xt−1

• Using the lag operator, we can write the original AR(1) series as:

yt = µ + γLyt + εt

so that:(1− γL)yt = µ + εt

and

yt =µ

(1− γL)+

εt

(1− γL)=

µ

(1− γ)+

εt

(1− γL)=

∞∑i=0

γiµ +∞∑i=0

γiεt−i

Page 183: POLS W4912 Multivariate Political Analysis

173

• The last step comes from something we encountered before: how to representan infinite series:

A = x(1 + a + a2 + ... + an)

If |a| < 1, then the solution to this series is approximately A = x(1−a) . In

other words, the sequence is convergent (has a finite solution).

• Thus, for the series∞∑i=0

γiεt−i = εt+γεt−1+γ2εt−2+... = εt+γLεt+γ2L2εt+...

we have that, a = γL and∞∑i=0

γiεt−i = εt

(1−γL) .

• For similar reasons,∞∑i=0

γiµ = µ(1−γL) = µ

(1−γ) because Lµ = µ In other words,

µ, the per-period innovation is not subscripted by time and is assumed to bethe same in each period.

15.3 Stationarity

• So, an AR(1) process can be written quite simply as:

yt =µ

(1− γ)+

εt

(1− γL)

• Recall, though, that this requires that |γ| < 1. If |γ| ≥ 1 then we cannoteven define yt. yt keeps on growing as the error terms collect. Its expectationwill be undefined and its variance will be infinite.

• That means that we cannot use standard statistical procedures if the auto-regressive process is characterized by |γ| > 1. This is known as the station-arity condition and the data series is said to be stationary if |γ| < 1.

• The problem is that our standard results for consistency and for hypothesistesting requires that (X′X)−1 is a finite, positive matrix. This is no longertrue. The matrix (X′X) will be infinite when the data series X is non-stationary.

Page 184: POLS W4912 Multivariate Political Analysis

174

• In the more general case of an AR(p) model, the only difference is that the lagfunction by which we divide the right-hand side, (1 − γL), is more complexand is often written as C(L). In this case, the stationarity condition requiresthat the roots of this more complex expression “lie outside the unit circle.”

15.4 A moving average process

• A first-order, moving average process MA(1) is written as:

yt = µ + εt − θεt−1

In this case, your current state depends on only the current and previouserrors.

Using the lag operator:yt = µ + (1− θL)εt

Thus,yt

(1− θL)=

µ

(1− θL)+ εt

Once again, if |θ| < 1, then we can invert the series and express yt as aninfinite series of its own lagged values:

yt =µ

(1− θ)− θyt−1 − θ2yt−2 − ... + εt

Now we have written our MA process as an AR process of infinite lag lengthp, describing yt in terms of all its own past values and the contemporaneouserror term. Thus, an MA(1) process can be written as an infinite AR(p)process.

• Similarly, when we expressed the AR(1) function in terms of all past errorsterms, we were writing it as an infinite MA(p) process.

• Notice, that this last step once again relies on the condition that |θ| < 1.This is referred to in this case as the invertibility condition, implying thatwe can divide through by (1− θL).

Page 185: POLS W4912 Multivariate Political Analysis

175

• If we had a more general, MA(q) process, with more lags, we could go throughthe same steps, but we would have a more complex function of the lags than(1− θL). Greene’s textbook refers to this function as D(L). In this case, theinvertibility condition is satisfied when the roots of D(L) lie outside the unitcircle (see Greene, 6th ed., pp. 718–721).

15.5 ARMA processes

• Time series can also be posited to contain both AR and MA terms. However,if we go through the inversion above, getting:

yt =µ

(1− θ)− θyt−1 − θ2yt−2 − ... + εt

and then substitute for the lagged yts in the AR process, we will arrive at anexpression for yt that is based only on a constant and a complex function ofpast errors. See Greene, 6th ed., p. 717 for an example.

• We could also write an ARMA(1,1) process as:

yt =µ

(1− γ)+

(1− θL)

(1− γL)εt

• An ARMA process with p autoregressive components and q moving averagecomponents is called an ARMA(p, q) process.

• Where does this get us? We can estimate yt and apply the standard proofsof consistency if the time series is stationary, so it makes sense to discusswhat stationarity means in the context of time series data. If the time seriesis stationary and can be characterized by an AR process, then the modelcan be estimated using OLS. If it is stationary and characterized by an MAprocess, you will need to use a more complicated estimation procedure (non-linear least squares).

Page 186: POLS W4912 Multivariate Political Analysis

176

15.6 More on stationarity

• There are two main concepts of stationarity applied in the literature.

1. Strict Stationarity: For the process yt = ρyt−1 + εt, strict stationarityimplies that:

E[yt] = µ exists and is independent of t.

var[yt] = γ0 is a finite, positive constant, independent of t.

cov[yt, ys] = γ(|t− s|) is a finite function of |t− s|, but not of t or s.

AND all other “higher order moments” (such as skewness or kurtosis)are also independent of t.

2. Weak Stationarity (or covariance stationarity): removes the conditionon the “higher order moments” of yt.

• A stationary time series will tend to revert to its mean (mean reversion) andfluctuations around this mean will have a broadly consistent amplitude.

• Intuition: if we take two slices from the data series, they should have approxi-mately the same mean and the covariance between points should depend onlyon the number of time periods that divide them. This will not be true of “in-tegrated” series, so we will have to transform the data to make it stationarybefore estimating the model.

15.7 Integrated processes, spurious correlations, and test-

ing for unit roots

• One of the main concerns w/ non-stationary series is spurious correlation.

• Suppose you have a non-stationary, highly trending series, yt, and you regressit on another highly trending series, xt. You are likely to find a significantrelationship between yt and xt even when there was none, because we seeupward movement in both produced in their own dynamics.

• Thus, when the two time series are non-stationary, standard critical valuesof the t and F statistics are likely to be highly misleading about true causalrelationships.

Page 187: POLS W4912 Multivariate Political Analysis

177

• The question is, what kind of non-stationary sequence do we have and howcan we tell it’s non-stationary. Consider the following types of non-stationaryseries:

1. The Pure Random Walk

yt = yt−1 + εt

This DGP can also be written as:

yt = y0 +∑

εt

E[yt] = E[y0 +∑

εt] = y0

In similar fashion it can be shown that the variance of yt = tσ2.Thus, the mean is constant but the variance increases indefinitely asthe number of time points grows.

If you take the first difference of the data process, however, ∆yt =yt − yt−1, we get:

yt − yt−1 = εt

The mean of this process is constant (and equal to zero) and itsvariance is also a finite constant. Thus, the first difference of a randomwalk process is a difference stationary process.

2. The Random Walk with Drift

yt = µ + yt−1 + εt

For the random walk with drift process, we can show that E[yt] = y0 + tµ

and var[yt] = tσ2. Both the mean and the variance are non-constant.

In this case, first differencing of the series also will give you a variablethat has a constant mean and variance.

Page 188: POLS W4912 Multivariate Political Analysis

178

3. The Trend Stationary Process

yt = µ + βt + εt

yt is non-stationary because the mean of yt is equal to µ + βt, whichis non-constant, although its variance is constant and equal to σ2.

Once the values of µ and β are known, however, the mean can beperfectly predicted. Therefore, if we subtract the mean of yt fromyt, the resulting series will be stationary, and is thus called a trendstationary process in comparison to the difference stationary processesdescribed above.

• Each of these series is characterized by a unit root, meaning that the coefficienton the lagged value of yt = 1 in each process. For a trend stationary process,this follows because you can re-write the time series as yt = µ + βt + εt =yt−1 + β + εt − εt−1.

• In each case, the DGP can be written as:

(1− L)yt = α + ν

where α = 0, µ, and β respectively in each process and ν is a stationaryprocess.

• In all cases, the data should be detrended or differenced to produce a station-ary series. But which? The matter is not of merely academic interest, sincedetrending a random walk will induce autocorrelation in the error terms ofan MA(1) type.

Page 189: POLS W4912 Multivariate Political Analysis

179

• A unit-root test is based on a model that nests the different processes aboveinto one regression that you run to test the properties of the underlying dataseries:

yt = µ + βt + γyt−1 + εt

• Next subtract yt−1 from both sides of the equation to produce the equationbelow. This produces a regression with a (difference) stationary dependentvariable (even under the null of non-stationarity) and this regression formsthe basis for Dickey-Fuller tests of a unit root:

yt − yt−1 = µ + βt + (γ − 1)yt−1 + εt

A test of the hypothesis that (γ − 1) is zero gives evidence for a randomwalk, because this ⇒ γ = 1.

If (γ − 1) = 0 and µ is significantly different from zero we have evidencefor a random walk with drift.

If (γ − 1) is significantly different from zero (and < 0) we have evidenceof a stationary process.

If (γ − 1) is significantly different from and less than zero, and β (thecoefficient on the trend variable) is significant, we have evidence for atrend stationary process.

• There is one complication. Two statisticians, Dickey and Fuller (1979, 1981)showed that if the unit root is exactly equal to one, the standard errors willbe under-estimated, so that revised critical values are required for the teststatistic above.

• For this reason, the test for stationarity is referred to as the Dickey-Fullertest. The augmented Dickey-Fuller test applies to the same equation abovebut adds lags of the first difference in y, (yt − yt−1).

• One problem with the Dickey-Fuller unit-root test is that it has low powerand seems to privilege the null hypothesis of a random walk process over thealternatives.

Page 190: POLS W4912 Multivariate Political Analysis

180

• To sum up: If your data looks like a random walk, you will have to difference ituntil you get something that looks stationary. If your data looks like it’s trendstationary, you will have to de-trend it until you get something stationary.An ARMA model carried out on differenced data is called an ARIMA model,standing for “Auto-Regressive, Integrated, Moving-Average.”

15.7.1 Determining the specification

• You have data that is now stationary. How do you figure out which ARMAspecification to use? How many, if any, AR terms should there be? Howmany, if any, MA terms?

• As a means of deciding the specification, analysts in the past have looked atthe autocorrelations and partial autocorrelations between yt and yt−s. This isalso known as looking at the autocorrelation function (ACF) and the partialautocorrelation function (or PACF).

• Recall that

corr(εt, εt−s) =cov(εt, εt−s)√

var(εt)√

var(εt−s)=

E[εtεt−s]

E[ε2t

] =γs

γ0

• If yt and yt−s are both expressed in terms of deviations from their means, andif var(yt) = var(yt−s) then:

corr(yt, yt−s) =cov(yt, yt−s)√

var(yt)√

var(yt−s)=

E[ytyt−s]

E[y2t ]

=γs

γ0

15.8 The autocorrelation function for AR(1) and MA(1)

processes

• We showed in the section on autocorrelation in the error terms that if yt =ρyt−1 + εt then corr[yt, yt−s] = ρs.

• In this context, εt is white noise. E[εt] = 0, E[ε2t ] = σ2

ε , E[εtεt−1] = 0

• Thus, the autocorrelations for an AR(1) process tend to die away gradually.

Page 191: POLS W4912 Multivariate Political Analysis

181

• By contrast, the autocorrelations for an MA(1) process die away abruptly.

• Letyt = εt − θεt−1

Then

γ0 = var[yt] = Eyt−E[yt]2 = E[(εt−θεt−1)2] = E(ε2

t )+θ2E(ε2t−1) = (1+θ2)σ2

ε

and

γ1 = cov[yt, yt−1] = E[(εt − θεt−1)(εt−1 − θεt−2)] = −θE[ε2t−1] = −θσ2

ε

• The covariances between yt and yt−s when s > 1 are zero, because the ex-pression for yt only involves two error terms. Thus, the ACF for an MA(1)process has one or two spikes and then shows no autocorrelation.

15.9 The partial autocorrelation function for AR(1) and

MA(1) processes

• The partial autocorrelation is the simple correlation between yt and yt−s minusthat part explained by the intervening lags.

• Thus, the partial autocorrelation between yt and yt−s is estimated by the lastcoefficient in the regression of yt on [yt−1, yt−2, ..., yt−s]. The appearance of thepartial autocorrelation function is the reverse of that for the autocorrelationfunction. For a true AR(1) process,

yt = ρyt−1 + εt

• There will be an initial spike at the first lag (where the autocorrelation equalsρ) and then nothing, because no other lagged value of y is significant.

• For the MA(1) process, the partial autocorrelation function will look like agradually declining wave, because any MA(1) process can be written as aninfinite AR process with declining weights on the lagged values of y.

Page 192: POLS W4912 Multivariate Political Analysis

182

15.10 Different specifications for time series analysis

• We now turn to models in which yt is related to its own past values and a setof exogenous, explanatory variables, xt.

• The pathbreaking time series work in pol. sci. concerned how presidentialapproval levels respond to past levels of approval and measures of presidentialperformance. Let’s use At, standing for the current level of approval, as thedependent variable for our examples here.

More on this topic can be found in an article by Neal Beck, 1991, “Com-paring Dynamic Specifications,” Political Analysis.

1. The Static ModelAt = βXt + εt

Comment from Beck, “This assumes that approval adjusts instantaneouslyto new information (“no stickiness”) and that prior information is of noconsequence (“no memory”).

2. The Finite Distributed Lag Model

At =M∑i=0

βiXt−i + εt

This allows for memory but may have a large number of coefficients,reducing your degrees of freedom.

3. The Exponential Distributed Lag (EDL) Model

At =M∑i=0

(Xt−iλi)β + εt

Reduces the number of parameters that you now have to estimate. Inthe case in which T can be taken as infinite, this can also be written as:

At =∞∑i=0

Xtβ(λL)i + εt =Xtβ

(1− λL)+ εt

Page 193: POLS W4912 Multivariate Political Analysis

183

4. The Partial Adjustment ModelMultiply both sides of the EDL model above by the “Koyck Transforma-tion” or (1− λL). After simplification, this will yield:

At = Xtβ + λAt−1 + εt − λεt−1

This says that current approval is a function of current exogenousvariables and the past value of approval, allowing for memory andstickiness. By including a lagged dependent variable, you are actu-ally allowing for all past values of X to affect your current level ofapproval, with more recent values of X weighted more heavily.

If the errors in the original EDL model were also AR(1), that isto say if errors in preceding periods also had an effect on currentapproval (e.g., ut = λut−1 + εt), with the size of that effect falling inan exponential way, then the Koyck transformation will actually giveyou a specification in which the errors are iid. In other words:

At = Xtβ + λAt−1 + εt

This specification is very often used in applied work with a stationaryvariable whose current level is affected by memory and stickiness. Ifthe error is indeed iid, then the model can be estimated using OLS.

5. Models for “Difference Stationary” Data: The Error CorrectionModelOften, if you run an augmented Dickey-Fuller test and are unable toreject the hypothesis that your data has a unit root, you will wind uprunning the following type of regression:

∆At = ∆Xtβ + εt

As Beck notes (p. 67) this is equivalent to saying that only the infor-mation in the current period counts and that it creates an instantaneouschange in approval. This can be restrictive.

Moreover, it will frequently result in finding no significant results infirst differences, although you have strong theoretical priors that thedependent variable is related to the independent variables.

Page 194: POLS W4912 Multivariate Political Analysis

184

What you can do in this instance, with some justification, is to run anError Correction Model that allows you to estimate both long-termand short-term dynamics.

Let us assume, for the moment, that both A and X are integrated (oforder one). Then we would not be advised to include them in a modelin terms of levels. However, if they are in a long-term, equilibriumrelationship with one another, then the errors:

et = At −Xtα

should be stationary. Moreover, if we can posit that people respondto their “errors” and that gradually the relationship comes back intoequilibrium, then we can introduce the error term into a model of thechange in approval. This is the Error Correction Model.

∆At = ∆Xtβ + γ(At−1 −Xt−1α) + εt

The γ coefficient in this model is telling you how fast errors are ad-justed. An attraction of this model is that it allows you to estimatelong-term dynamics (in levels) and short-term dynamics (in changes)simultaneously.

15.11 Determining the number of lags

• In order to select the appropriate number of lags of the dependent variable(in an AR(p)) model, you could use the “general to specific” methodology.Include a number of lags that you think is more than sufficient and take outof the model any lags that are not significant.

• Second, analysts are increasingly using adjusted measures of fit (analogousto R2) that compensate for the fact that as you include more lags (to fit thedata better) you are reducing the degrees of freedom.

• The two best known are the Akaike Information Criterion and the SchwartzCriterion. Both are based on the standard error of the estimate (actuallyon s2). Thus, you want to minimize both, but you are penalized when youinclude additional lags to reduce the standard error of the estimate.

Page 195: POLS W4912 Multivariate Political Analysis

185

• The equations for the two criteria are:

AIC(K) = ln

(e′e

n

)+

2K

n

SC(K) = ln

(e′e

n

)+

K ln n

n

The Schwartz Criterion penalizes degrees of freedom lost more heavily.

• The model with the smallest AIC or SC indicates the appropriate numberof lags.

15.12 Determining the correct specification for your errors

• Beck gives examples (p. 60 and p. 64) where the error term is autocorrelatedeven when you have built in lags of the dependent variable.

• This could occur if the effect of past errors dies out at a different rate thandoes the effect of past levels in the explanatory variables. If this is the case,you could still get an error term that is AR or MA, even when you haveintroduced a lagged dependent variable, although doing this will generallyreduce the autocorrelation.

• The autocorrelation could be produced by an AR or an MA process in theerror term. To determine which, you might again look at the autocorrelationsand partial autocorrelations.

• We know how to transform the data in the case where the errors are AR(1)using the Cochrane-Orcutt transformation or using Newey-West standard er-rors. If the model was otherwise static, I would also suggest these approaches.

Page 196: POLS W4912 Multivariate Political Analysis

186

15.13 Stata Commands

• Dickey-Fuller Tests:

dfuller varname, noconstant lags(#) trend regress

• The Autocorrelation and Partial Autocorrelation Function:

corrgram varname, lags(#)

ac varname, lags(#)

pac varname, lags(#)

• ARIMA Estimation:

arima depvar [varlist], ar(numlist) ma(numlist)

Page 197: POLS W4912 Multivariate Political Analysis

187

Part III

Special Topics

Page 198: POLS W4912 Multivariate Political Analysis

Section 16

Time-Series Cross-Section and Panel Data• Consider the model

yit = β0 + xitβ + εit

where

i = 1,. . . , N cross-sectional units (e.g., countries, states, households)

t = 1, . . . , T time units (e.g., years, quarters, months)

NT = total number of data points.

• Oftentimes, this model will be estimated pooling the data from different unitsand time periods.

• Advantages: can give more leverage in estimating parameters; consistent w/“general” theories

16.1 Unobserved country effects and LSDV

• In the model:

Govt Spendingit = β0 + β1Opennessit + β2Zit + εit

It might be argued that the level of government spending as a percentageof GDP differs for reasons that are specific to each country (e.g., solidaristicvalues in Sweden). This is also known as cross-sectional heterogeneity.

• If these unit-specific factors are correlated with other variables in the model,we will have an instance of omitted variable bias. Even if not, we will getlarger standard errors because we are not incorporating sources of cross-country variation into the model.

• We could try to explicitly incorporate all the systematic factors that mightlead to different levels of government spending across countries, but placeshigh demands in terms of data gathering.

188

Page 199: POLS W4912 Multivariate Political Analysis

189

• Another way to do this, which may not be as demanding data-wise, is tointroduce a set of country dummies into the model.

Govt Spendingit = αi + β1Opennessit + β2Zit + εit

This is equivalent to introducing a country-specific intercept into the model.Either include a dummy for all the countries but one, and keep the inter-cept term, or estimate the model with a full set of country dummies and nointercept.

16.1.1 Time effects

• There might also be time-specific effects (e.g., government spending went upeverywhere in 1973–74 in OECD economies because the first oil shock led tounemployment and increased government unemployment payments). Onceagain, if the time-specific factors are not accounted for, we could face theproblem of bias.

• To account for this, introduce a set of dummies for each time period.

Govt Spendingit = αi + δt + β1Opennessit + β2Zit + εit

• The degrees of freedom for the model are now NT − k−N − T . The signifi-cance, or not, of the country-specific and time-specific effects can be tested byusing and F -test to see if the country (time) dummies are jointly significant.

• The general approach of including unit-specific dummies is known as LeastSquares Dummy Variables model, or LSDV.

• Can also include (T − 1) year dummies for time effects. These give thedifference between the predicted causal effect from xitβ and what you wouldexpect for that year. There has to be one year that provides the baselineprediction.

Page 200: POLS W4912 Multivariate Political Analysis

190

16.2 Testing for unit or time effects

• For LSDV (including an intercept), we want to test the hypothesis that

α1 = α2 = ... = αN−1 = 0

• Can use an F -test:

F (N − 1, NT −N −K) =(R2

UR −R2R)/(N − 1)

(1−R2UR)/(NT −N −K)

In this case, the unrestricted model is the one with the country dummies (andhence different intercepts); the restricted model is the one with just a singleintercept. A similar test could also be performed on the year dummies.

16.2.1 How to do this test in Stata

• After the regress command you type:

1. If there are (N-1) country dummies and an intercept

test dummy1=dummy2=dummy3=dummy4=...=dummyN-1=0

2. If there are N country dummies and no intercept

test dummy1=dummy2=dummy3=dummy4=...=dummyN

16.3 LSDV as fixed effects

• Least squares dummy variable estimation is also known as Fixed Effects,because it assumes that the variation in the dependent variable, yit, for givencountries or years can be estimated as a given, fixed effect.

• Before we go into the justification for this, let us examine which part of thevariation in yit is used to calculate the remaining β coefficients under fixedeffects.

• A fixed effects model can be estimated by transforming the data. To do this,calculate the country mean of yit for all the different countries. Let the groupmean of a given country, i, be represented as yi.

Page 201: POLS W4912 Multivariate Political Analysis

191

• Let the original model be

yit = αi + xitβ + εit (16.1)

Then:yi. = αi + xi.β + εi.

If we run OLS on this regression it will produce what is known as the “Be-tween Effects” estimator, or βBE, which shows how the mean level of thedependent variable for each country varies with the mean level of the inde-pendent variables.

Substracting this from eq. 16.1 gives

(yit − yi.) = (αi − αi) + (xit − xi.)β + (εit − εi.)

or(yit − yi.) = (xit − xi.)β + (εit − εi.)

• If we run OLS on this regression it will produce what is known as the “FixedEffects” estimator, or βFE.

• It is indentical to LSDV and is sometimes called the within-group estimator,because it uses only the variation in yit and xit within each group (or country)to estimate the β coefficients. Any variation between countries is assumed tospring from the unobserved fixed effects.

• Note that if time-invariant regressors are included in the model, the standardFE estimator will not produce estimates for the effects of these variables.Similar issue w/ LSDV.

IV approach to produce estimates, but requires some exgoneity assump-tions that may not be met in practice.

• The effects of slow-moving variables can be estimated very imprecisely dueto collinearity.

Page 202: POLS W4912 Multivariate Political Analysis

192

16.4 What types of variation do different estimators use?

• Let us now determine the sum of squares (X′X) and cross-products (X′y)for the OLS estimator and within-group estimator in order to clarify whichestimator uses what variation to calculate the β coefficients.

• Let Sxx be the sum of squares and let Sxy be the cross-products. Let theoverall means of the data be represented as y and x.

• Then the total sum of squares and cross-products (which define the variationthat we use to estimate βOLS) is:

STxx =

N∑i=1

T∑t=1

(xit − x)(xit − x)′

STxy =

N∑i=1

T∑t=1

(xit − x)(yit − y)

• The within-group sum of squares and cross-products (used to estimate βFE)is:

SWxx =

N∑i=1

T∑t=1

(xit − xi.)(xit − xi.)′

SWxy =

N∑i=1

T∑t=1

(xit − xi.)(yit − yi.)

• The between-group sum of squares and cross-products (used to estimate βBE)is:

SBxx =

N∑i=1

T∑t=1

(xi. − x)(xi. − x)′

SBxy =

N∑i=1

T∑t=1

(xi. − x)(yi. − y)

Page 203: POLS W4912 Multivariate Political Analysis

193

• It is easy to verify that:ST

xx = SWxx + SB

xx

and:ST

xy = SWxy + SB

xy

• We also have that:

βOLS = [STxx]

−1[STxy] = [SW

xx + SBxx]

−1[SWxy + SB

xy]

andβFE = [SW

xx]−1[SW

xy]

while,βBE = [SB

xx]−1[SB

xy]

• The standard βOLS uses all the variation in yit and xit to calculate the slopecoefficients while βFE just uses the variation across time and βBE just usesthe variation across countries.

• We can show that βOLS is a weighted average of βFE and βBE. In fact:

βOLS = FW βFE + FBβBE

where FW = [SWxx + SB

xx]−1SW

xx and FB = [I− FW ]

Page 204: POLS W4912 Multivariate Political Analysis

194

16.5 Random effects estimation

• Fixed effects is completely appropriate if we believe that the country-specificeffects are indeed fixed, estimable amounts that we can calculate for eachcountry.

• Thus, we believe that Sweden will always have an intercept of 1.2 units (forinstance). If we were able to take another sample, we would once againestimate the same intercept for Sweden. There are cases, however, where wemay not believe that we can estimate some fixed amount for each country.

• In particular, assume that we have a panel data model run on 20 countries,but which should be generalizable to 100 different countries. We cannotestimate the given intercept for each country or each type of country becausewe don’t have all of them in the sample for which we estimate the model.

• In this case, we might want to estimate the βs on the explanatory variablestaking into account that there could be country-specific effects that wouldenter as a random shock from a known distribution.

• The appropriate model that accounts for cross-national variation is randomeffects:

yit = α + xitβ + ui + εit

In this model, α is a general intercept and ui is a time-invariant, randomdisturbance characterizing the ith country. Thus, country-effects are treatedas country-specific shocks. We also assume in this model that:

E[εit] = E[ui] = 0

E[ε2it] = σ2

ε , E[u2i ] = σ2

u

E[εituj] = 0 ∀ i, t, j; E[εitεjs] = 0 ∀ t 6= s, i 6= j; E[uiuj] = 0 for i 6= j.

• For each country, we have a separate error term, equal to wit, where:

wit = εit + ui

andE[w2

it] = σ2ε + σ2

u, and E[witwis] = σ2u for t 6= s.

Page 205: POLS W4912 Multivariate Political Analysis

195

• It is because the RE model decomposes the disturbance term into differentcomponents that it is also known as an error components model.

• For each panel (or country), the variance-covariance matrix of the T distur-bance terms will take the following form:

Σ =

(σ2

ε + σ2u) σ2

u σ2u . . . σ2

u

σ2u (σ2

ε + σ2u) σ2

u . . . σ2u

... . . . ...σ2

u σ2u σ2

u . . . (σ2ε + σ2

u)

= σ2εIT + σ2

uiT i′

T

where iT is a (T × 1) vector of ones.

• The full variance-covariance matrix for all the NT observations is:

Ω =

Σ 0 0 . . . 00 Σ 0 . . . 0... . . . ...0 0 0 . . . Σ

= IN ⊗Σ

• The way in which the RE model differs from the original OLS estimation (withno fixed effects) is only in the specification of the disturbance term. When themodel differs from the standard G-M assumptions only in the specificationof the errors, the regression coefficients can be consistently and efficientlyestimated by Generalized Least Squares (GLS) or (when we don’t exactlyknow Ω) by FGLS).

• Thus, we can do a transformation of the original data that will create a newvar-cov matrix for the disturbances that conforms to G-M.

Page 206: POLS W4912 Multivariate Political Analysis

196

16.6 FGLS estimation of random effects

• The FGLS estimator is

βFGLS = (X′Ω−1

X)−1(X′Ωy)

To estimate this we will need to know Ω−1 = [IN ⊗Σ]−1, which means thatwe need to estimate Σ−1/2:

Σ−1/2 =1

σε

[I− θ

TiT i

T

]where

θ = 1− σε√Tσ2

u + σ2ε

• Then the transformation of yi and Xi for FGLS is

Σ−1/2yit =1

σε

yi1 − θyi.

yi2 − θyi....

yiT − θyi.

with a similar looking expression for the rows of Xi.

• It can be shown that the GLS estimator, βRE, like the OLS estimator, is aweighted average of the within (FE) and between (BE) estimators:

βRE = FW βFE + (I − FW )βBE

where:FW = [SW

xx + λSBxx]

−1SWxx

and

λ =σ2

ε

σ2ε + Tσ2

u

= (1− θ)2

• If λ = 1, then the RE model reduces to OLS. There is essentially no country-specific disturbance term, so the regression coefficients are most efficientlyestimated using the OLS method.

Page 207: POLS W4912 Multivariate Political Analysis

197

• If λ = 0, the country-specific shocks dominate the other parts of the distur-bance. Then the RE model reduces to the fixed effects model. We attributeall cross-country variation to the country-specific shock, with none attributedto the random disturbance εit and just use the cross-time variation to estimatethe slope coefficients.

• To the extent that λ differs from one, we can see that the OLS estimationinvolves an inefficient weighting of the two least squares estimators (withinand between) and GLS will produce more efficient results.

• To estimate the RE model when Ω is unknown, we use the original OLSresults, which are consistent, to get estimates of σ2

ε and σ2u.

16.7 Testing between fixed and random effects

• If the random, country-specific disturbance term, ui, is correlated with anyof the other explanatory variables, xit, then we will get biased estimates inthe OLS stage because the regressors and the disturbance will be contempo-raneously correlated.

• The coefficient on xit will be biased and inconsistent, which means the OLSestimates of the residuals will be biased and the βRE will be biased. Thissets us up for a Hausman test:

H0: E[uixit] = 0; RE appropriate ⇒ βRE is approximately equal to βFE but ismore efficient (has smaller standard errors).

H1: E[uixit] 6= 0; RE is not appropriate ⇒ βRE will be different from βFE (andinconsistent).

• In this setting, the Hausman test statistic is calculated as:

W = χ2K = [βFE − βRE]′Σ

−1[βFE − βRE]

whereΣ = var[βFE]− var[βRE]

If the Hausman test statistic is larger than its appropriate critical value, thenwe reject RE as the appropriate specification.

Page 208: POLS W4912 Multivariate Political Analysis

198

• Greene, 6th ed., p. 205–206, also shows how to perform a Breusch-Pagan testfor RE based on the residuals from the original OLS regression. This testsfor the appropriateness of OLS versus the alternative of RE. It does not testRE against FE.

16.8 How to do this in Stata

• xtreg depvar [varlist], re for RE

xtreg depvar [varlist], fe for FE

xtreg depvar [varlist], be for between effects

To perform the hausman test, type

xthausman

After xtreg depvar [varlist], re

To run the Breusch-Pagan test for RE versus OLS, type

xttest0

After xtreg depvar [varlist], re

16.9 Panel regression and the Gauss-Markov assumptions

• OLS is BLUE if the errors are iid, implying that E[εit] = 0, E[ε2it] = σ2

ε ,

and E[εitεjs]=0 for t 6= s or i 6= j

• The errors in panel regressions, however, are particularly unlikely to be spher-ical:

1. We might expect the errors in time-series, cross-sectional settings to becontemporaneously correlated (e.g., economies are linked).

2. We might also expect the errors in panel models to show “panel het-eroskedasticity” (e.g., the scale of the dependent variable may vary acrosscountries).

3. The errors may show first-order serial correlation (or autocorrelation) orsome other type of temporal dependence.

Page 209: POLS W4912 Multivariate Political Analysis

199

• Consider the following example to see how this will show up in the variance-covariance matrix of the errors. Assume that N = 2 and T = 2 and that thedata are stacked by country and then by time period:

• The covariance matrix for spherical errors is:

Ω =

σ2 0 0 00 σ2 0 00 0 σ2 00 0 0 σ2

where

Ω =

E(ε11ε11) E(ε11ε12) E(ε11ε21) E(ε11ε22)E(ε12ε11) E(ε12ε12) E(ε12ε21) E(ε12ε22)E(ε21ε11) E(ε21ε12) E(ε21ε21) E(ε21ε22)E(ε22ε11) E(ε22ε12) E(ε22ε21) E(ε22ε22)

• This is the covariance matrix with contemporaneous correlation:

Ω =

σ2 0 σ12 00 σ2 0 σ12

σ12 0 σ2 00 σ12 0 σ2

• This is the covariance matrix with contemporaneous correlation and panel-

specific heteroskedasticity:

Ω =

σ2

1 0 σ12 00 σ2

1 0 σ12

σ12 0 σ22 0

0 σ12 0 σ22

• This is the covariance matrix with all of the above and first-order serial cor-

relation:

Ω =

σ2

1 ρ σ12 0ρ σ2

1 0 σ12

σ12 0 σ22 ρ

0 σ12 ρ σ22

Page 210: POLS W4912 Multivariate Political Analysis

200

• The most popular method for addressing these issues is known as panel cor-rected standard errors (PCSEs) which is due to Beck and Katz (APSR ’95).

• In other cases of autocorrelation and/or heteroskedasticity, we have suggestedGLS or FGLS. We would transform the data so that the errors become spheri-

cal by multiplying x and y by Ω−1/2

. The FGLS estimates of β are now equal

to (X′Ω−1

X)−1(X′Ω−1

y).

• FGLS is most often performed by first using the OLS residuals to estimate ρ

(to implement the Prais-Winsten method) and then using the residuals froman OLS regression on this data to estimate the contemporaneous correlation.This is known as the Parks method and is done in Stata via xtgls.

• Beck and Katz argue, however, that unless T >> N , then very few periodsof data are being used to compute the contemporaneous correlation betweeneach pair of countries.

• In addition, estimates of the panel-specific autocorrelation coefficient, ρi, arelikely to be biased downwards when they are based on few T observations.

• In other words, FGLS via the Parks method has undesirable small sampleproperties when T is not many magnitudes greater than N .

• To show that FGLS leads to biased estimates of the standard errors in TSCS,Beck and Katz use Monte Carlo Simulation

1. Simulate 1000 different samples of data with known properties. Each ofthe 1000 samples will be “small” in terms of T and N .

2. Compute the β coefficients and the standard errors of those coefficientsusing FGLS on the 1000 different runs of data.

3. Compare the variance implied by the calculated standard errors with theactual variance found in the data to see if the calculated standard errorsare correct.

Page 211: POLS W4912 Multivariate Political Analysis

201

• PCSEs are built on the same approach as White-Consistent Standard Errorsin the case of heteroskedasticity and Newey-West Standard Errors in the caseof autocorrelation. Instead of transforming the data, we use the followingequation to compute standard errors that account for non-sphericity:

var[β] = (X′X)−1X′ΩX(X′X)−1

where Ω denote the variance-covariance matrix of the errors.

• Beck and Katz assert that the kind of non-sphericity in TSCS produces thefollowing Ω:

Ω =

σ2

1IT σ12IT · · · σ1NIT

σ21IT σ22IT · · · σ2NIT

... . . . ...σN1IT σN2IT · · · σ2

NIT

• Let

Σ =

σ2

1 σ12 σ13 · · · σ1N

σ12 σ22 σ23 · · · σ2N

... . . . ...σ1N σ2N σ3N · · · σ2

N

Page 212: POLS W4912 Multivariate Political Analysis

202

• Use OLS residuals, denoted eit for unit i at time t (in Beck and Katz’s nota-tion), to estimate the elements of Σ:

Σij =

∑Tt=1 eitejt

T, (16.2)

which means the estimate of the full matrix Σ is

Σ =E′E

T

where E is a T ×N matrix of the re-shaped NT × 1 vector of OLS residuals,such that the columns contains the T × 1 vectors of residuals for each cross-sectional unit (or conversely, each row contains the N × 1 vector of residualsfor each cross-sectional in a given time period) :

E =

e11 e21 . . . eN1

e12 e22 . . . eN2...

... . . . ...e1T e2T . . . eNT

Then

Ω =E′E

T⊗ IT ,

• Compute SEs using the square roots of the diagonal elements of

(X′X)−1X′ΩX(X′X)−1, (16.3)

where X denotes the NT × k matrix of stacked vectors of explanatory vari-ables, xit.

Page 213: POLS W4912 Multivariate Political Analysis

203

• Intuition behind why PCSEs do well: similar to White’sheteroskedasticity-consistent standard errors for cross-sect’l estimators, butbetter b/c take advantage of info provided by the panel structure.

• Good small sample properties confirmed by Monte Carlo studies.

• Note that this solves the problems of panel heteroskedasticity and contem-poraneous correlation—not serial correlation. Serial correlation must be re-moved before applying this fix.

This can be done using the Prais-Winsten transformation or by includinga lagged dependent variable (Beck and Katz are partial to the latter).

Page 214: POLS W4912 Multivariate Political Analysis

Section 17

Models with Discrete Dependent Variables

17.1 Discrete dependent variables

• There are many cases where the observed dependent variable is not continu-ous, but instead takes on discrete values. For example:

A binary or dichotomous dependent variable: yi = 0 or 1 where thesevalues can stand for any qualitative measure. We will estimate this typeof data using logit or probit models.

Polychotomous but ordered data: yi = 0, 1, 2, 3, . . . where each value ispart or a ranking, for example [did not graduate high school, graduatedHS, some college, BA, MA etc]. We will estimate this type of data usingordered logit or probit models.

Polychotomous or multinomial and unordered data, where there is noparticular ranking to the data. We will estimate this type of data usingmultinomial or conditional logit models.

Count data, yi = 0, 1, 2, 3, . . . but where the different values might rep-resent the number of events per period rather than different qualitativecategories. We will estimate this type of data using count models suchas Poisson or negative-binomial models.

Duration data where we are interested in the time it takes until an eventoccurs. The dependent variable indicates a discrete time period.

• With continuous data, we are generally interested in estimating the expec-tation of yi based on the data. With discrete dependent variables, we aregenerally interested in the probability of seeing a particular value and howthat changes with the explanatory variables.

204

Page 215: POLS W4912 Multivariate Political Analysis

205

17.2 The latent choice model

• One of the most common ways of modeling and motivating discrete choicesis the latent choice model:

y∗i = xiβ + εi,

where we observe:

yi =

1 if y∗i > 00 if y∗i ≤ 0

• Then, the probability that we observe a value of one for the dependent vari-able is:

Pr[yi = 1|xi] = Pr[y∗i ≥ 0|xi] = Pr[xiβ + εi ≥ 0] = Pr[εi ≥ −xiβ].

• Let F (·) be the cumulative distribution function (cdf) of ε, so that F (ε) =Pr[ε ≤ εi].

Pr[yi = 1|xi] = 1− F (−xiβ)

Pr[yi = 0|xi] = F (−xiβ)

If f(·), the pdf, is symmetric, then:

Pr[yi = 1|xi] = F (xiβ) and Pr[yi = 0|xi] = 1− F (xiβ)

• One way to represent the probability of yi = 1 is as the Linear ProbabilityModel:

Pr[yi = 1|xi] = F (xiβ) = xiβ

• This is the model that we implicitly estimate if we regress a column of onesand zeroes on a set of explanatory variables using ordinary least squares. Theprobability of yi = 1 simply rises in proportion to the explanatory variables.

Page 216: POLS W4912 Multivariate Political Analysis

206

• There are two problems with this way of estimating the probability of yi = 1.

1. The model may predict values of yi outside the zero-one range.

2. The linear probability model is heteroskedastic. To see this note that, ifyi = xiβ + εi, then εi = 1 − xiβ when yi=1 and εi = −xiβ when yi=0.Thus:

var[εi] = E[Pr[yi = 1] · (1− xiβ)2 + Pr[yi = 0] · (−xiβ)2|xi

]= xiβ(1− xiβ)2 + (1− xiβ)(−xiβ)2

= xiβ(1− xiβ)

• Given these two drawbacks, it seems wise to select a non-linear function forthe predicted probability that will not exceed the values of one or zero andthat will offer a better fit to the data, where the data is composed of zeroesand ones.

• For this reason, we normally assume a cumulative density function for theerror term where:

limxβ→+∞

Pr(Y = 1) = 1

andlim

xβ→−∞Pr(Y = 1) = 0

• The two most popular choices for a continuous probability distribution thatsatisfy these requirements are the normal and the logistic.

• The normal distribution function for the errors gives rise to the probit model:

F (xiβ) = Φ(xiβ) where Φ is the cumulative normal distribution.

• The logistic function gives rise to the logit model:

F (xiβ) = exβ

1+exβ = Λ(xiβ).

• The vector of coefficients, β, is estimated via maximum likelihood.

Page 217: POLS W4912 Multivariate Political Analysis

207

• Treating each observation, yi, as an independent random draw from a givendistribution, we can write out the likelihood of seeing the entire sample as:

Pr(Y1 = y1, Y2 = y2, ..., Yn = yn) = Pr(Y1 = y1) Pr(Y2 = y2) · · ·Pr(Yn = yn)

=∏yi=0

[1− F (xiβ)]∏yi=1

F (xiβ).

• This can be more conveniently written as:

L =n∏

i=1

[F (xiβ)]yi[1− F (xiβ)](1−yi).

• We take the natural log of this to get the log likelihood:

ln L =n∑

i=1

yi ln F (xiβ) + (1− yi) ln[1− F (xiβ)]

• Maximizing the log likelihood with respect to β yields the values of the pa-rameters that maximize the likelihood of observing the sample. From whatwe learned of maximum likelihood earlier in the course, we can also say thatthese are the most probable values of the coefficients for the sample.

• The first order conditions for the maximization require that:

∂ ln L

∂β=

n∑i=1

[yifi

Fi+ (1− yi)

−fi

(1− Fi)

]xi = 0

where fi is the probability distribution function, dFi/d(xiβ)

• The standard errors are estimated by directly computing the informationmatrix of the log likelihood using the data at hand. The information matrixis given by the inverse of the negative “Hessian”. The Hessian is the ma-trix containing all the second derivatives of the function with respect to theparameters, or

∂2 ln L

∂β∂β′ .

Page 218: POLS W4912 Multivariate Political Analysis

208

• Maximum likelihood estimators have attractive large sample properties (con-sistency, asymptotic efficiency, asymptotic normality). We therefore treat ourestimate of σ2 as consistent for the real parameter and converging inexorablytoward it. Thus, we use the z distribution (which assumes that we know σ2)as the source of our critical values.

17.3 Interpreting coefficients and computing marginal ef-

fects

• A positive sign on a slope coefficient obtained from a logit or probit indicatesthat the probability of yi = 1 increases as the explanatory variable increases.

• We would like to know how much the probability increases as xi increases.In other words, we would like to know the marginal effect. In OLS, this wasgiven by β

• Recall that for OLS:∂E(y|x)

∂x= β

This is not true for logit and probit.

• In binary models ∂E(y|x)∂x = f(xiβ)β

• Thus, for probit models:

∂E(y|x)

∂x= φ(xiβ)β

for logit models:

∂E(y|x)

∂x=

exi β

(1 + exiβ)2β = Λ(xiβ)[1− Λ(xiβ)]β

• Two points to note about the marginal effects:

1. They are not constant.

2. They are maximized at xiβ = 0, which is where E[y|x] = 0.5.

Page 219: POLS W4912 Multivariate Political Analysis

209

17.4 Measures of goodness of fit

• A frequently used measure of fit for discrete choice models is the Pseudo-R2

which, as its name implies, is an attempt to tell you how well the model fitsthe data.

PseudoR2 = 1− ln L

ln L0

where ln L is the log likelihood from the full model and ln L0 is the log like-lihood from the model estimated with a constant only. This will always bebetween 0 and 1.

• The log likelihood is always negative. The likelihood function that we startwith is just the probability of seeing the entire sample:

Pr[Y1 = y1, Y2 = y2, ..., Yn = yn] =∏yi=0

[1− F (xiβ)]∏yi=1

F (xiβ).

• As a probability, it must lie between zero and one. When we take the naturallog of this, to get the log likelihood,

ln L =n∑

i=1

yi ln F (xiβ) + (1− yi) ln[1− F (xiβ)],

we will get a number between negative infinity and zero. The natural log ofone is the value to which e must be raised to get one, and e0 = 1. The naturallog of zero is negative infinity, because e−∞ = 0. Thus, increases in the loglikelihood toward zero imply that the probability of seeing the entire samplegiven the estimated coefficients is higher.

• The maximum likelihood method selects that value of β that maximizes thelog likelihood and reports the value of the log likelihood at βML. A largernegative number implies that the model does not fit the data well.

• As you add variables to the model, you expect the log likelihood to get closerto zero (i.e., increase). ln L0 is the log likelihood for the model with just aconstant; ln L is the log likelihood for the full model,so ln L0 should alwaysbe larger and more negative than ln L.

Page 220: POLS W4912 Multivariate Political Analysis

210

• Where ln L0 = ln L, the additional variables have added no predictive power,and the pseudo- R2 is equal to zero. When ln L= 0, the model now perfectlypredicts the data, and the pseudo- R2 is equal to one.

• A similar measure is the likelihood ratio test estimating the probability thatthe coefficients on all the explanatory variables in the model (except theconstant) are zero (similar to the F -test under OLS).

• Although these measures offer the benefit of comparison to OLS techniques,they do not easily get at what we want to do, which is to predict accuratelywhen yi is going to fall into a particular category. An alternative measure ofthe goodness of fit of the model might be the percentage of observations thatwere correctly predicted. This is calculated as:

[(Observation = 1 ∩ Prediction = 1) + (Observation = 0 ∩ Prediction = 0)] /n

The method of model prediction is to say that we predict a value of one foryi if F (xiβ) ≥ 0.5.

Let yi be the predicted value of yi.

• The percent correctly predicted is:

1

N

N∑i=1

(yiyi + (1− yi)(1− yi))

• This measure has some shortcomings itself, which are overcome by the ex-pected percent correctly predicted (ePCP):

ePCP =1

N

(N∑

yi=1

pi +N∑

yi=0

(1− pi)

)

Page 221: POLS W4912 Multivariate Political Analysis

Section 18

Discrete Choice Models for Multiple Categories

18.1 Ordered probit and logit

• Suppose we are interested in a model where the dependent variable takes onmore than two discrete values, but the values can be ordered. The latentmodel framework can be used for this type of model.

y∗i = xiβ + εi.

yi =

1 if y∗i < γ1

2 if γ1 < y∗i ≤ γ2

3 if γ2 < y∗i ≤ γ3...m if γm−1 < y∗i

• This gives the probabilities

Pr(yi = 1) = Φ(γ1 − xiβ)

Pr(yi = 2) = Φ(γ2 − xiβ)− Φ(γ1 − xiβ)

Pr(yi = 3) = Φ(γ3 − xiβ)− Φ(γ2 − xiβ)...

Pr(yi = m) = 1− Φ(γm−1 − xiβ)

= Φ(xiβ − γm−1)

• To write down the likelihood let

zij =

1 if yi = j

0 otherwise for j = 1, . . . ,m.

211

Page 222: POLS W4912 Multivariate Political Analysis

212

• ThenPr(zij = 1) = Φ(γj − xiβ)− Φ(γj−1 − xiβ)

• The likelihood for an individual is

Li = [Φ(γ1 − xiβ)]zi1 [Φ(γ2 − xiβ)− Φ(γ1 − xiβ)]zi2

· · · [1− Φ(γm−1 − xiβ)]zim

= Πmj=1 [Φ(γj − xiβ)− Φ(γj−1 − xiβ)]zij

• The likelihood function for the sample then is

L = Πni=1Π

mj=1 [Φ(γj − xiβ)− Φ(γj−1 − xiβ)]zij

• We compute marginal effects just like for the dichotomous probit model. Forthe three categoy case these would be:

∂ Pr(yi = 1)

∂xi= −φ(γ1 − xiβ)β

∂ Pr(yi = 2)

∂xi= (φ(γ1 − xiβ)− φ(γ2 − xiβ))β

∂ Pr(yi = 3)

∂xi= φ(γ2 − xiβ)β

Page 223: POLS W4912 Multivariate Political Analysis

213

18.2 Multinomial Logit

• Suppose we are interested in modeling vote choices in multi-party systems.In this setting, the dependent variable (party choice) might be y ∈ [1, 2, 3]but there is no implicit ordering, they are just different choices. Hence, thismodel is different from ordered logit or probit, which we will discuss later.

• The multinomial logit or probit is motivated through a re-working of thelatent variable model. The latent variable that we don’t see now becomesthe “utility” to the voter of choosing a particular party. Because the utilityalso depends on an individual shock, the model is called the random utilitymodel.

• In the current example, we have three different choices for voter i, wherethose choices are sub-scripted by j.

Uij = xiβj + εij

In this case xi are the attributes of the voters selecting a party and βj arechoice-specific coefficients determining how utility from a party varies withvoter attributes (e.g., union members are presumed to derive more utilityfrom the Democratic party). Uij is the utility to voter i of party j.

• The voter will chose party one over party two if:

Ui1 > Ui2 ⇒ xiβ1 + εi1 > xiβ2 + εi2 ⇒ (εi1 − εi2) > xi(β1 − β2)

• The likelihood of this being true can be modeled using the logistic function ifwe assume that the original errors are distributed log-Weibull. This producesa logistic distribution for (εi1 − εi2) because if two random variables, εi1 andεi2, are distributed log-Weibull, then the difference between the two randomvariables is distributed according to the logistic distribution.

Page 224: POLS W4912 Multivariate Political Analysis

214

• When there are only two choices (party one or party two) the probability ofa voter choosing party 1 is given by:

Pr(yi = 1|xi) =exi(β1−β2)

1 + exi(β1−β2)

Thus, the random utility model in the binary case gives you something thatlooks very like the regular logit.

• We notice from this formulation that the two coefficients, β1 and β2 cannotbe estimated separately. One of the coefficients, for instance β1, serves asa base and the estimated coefficients (βj − β1) tell us the relative utility ofchoice j compared to choice 1. We normalize β1 to zero, so that Ui1 = 0 andwe measure the probability that parties 2 and 3 are selected over party 1 asvoter characteristics vary.

• Thus, the probability of selecting party j compared to party 1 depends on therelative utility of that choice compared to one. Recalling that exiβ1 = e0 = 1,and substituting into the equation above, we find that the probability ofselecting party 2 over party 1 equals:

Pr(yi = 2)

Pr(yi = 1)= exiβ2

• In just the same way, if there is a third party option, we can say that therelative probability of voting for that party over party 1 is equal to:

Pr(yi = 3)

Pr(yi = 1)= exiβ3

Finally, we could use a ratio of the two expressions above to give us therelative probability of picking Party 2 over Party 3.

Page 225: POLS W4912 Multivariate Political Analysis

215

• What we may want, however, is the probability that you pick Party 2 (or anyother party). In other words, we want Pr(yi = 2) and Pr(yi = 3). To getthis, we use a little simple algebra. First:

Pr(yi = 2) = exiβ2 · Pr(yi = 1)

Pr(yi = 3) = exiβ3 · Pr(yi = 1)

• Next, we use the fact that the three probabilities must sum to one (the threechoices are mutually exclusive and collectively exhaustive):

Pr(yi = 1) + Pr(yi = 2) + Pr(yi = 3) = 1

So that:Pr(yi = 1) + exiβ2 Pr(yi = 1) + exiβ3 Pr(yi = 1) = 1

And:Pr(yi = 1)

[1 + exiβ2 + exiβ3

]= 1

so that

Pr(yi = 1) =1

(1 + exiβ2 + exiβ3)

• Using this expression for the probability of the first choice, we can say that:

Pr(yi = 2) =exiβ2

1 + exiβ2 + exiβ3

Pr(yi = 3) =exiβ3

1 + exiβ2 + exiβ3

Pr(yi = 1) =1

1 + exiβ2 + exiβ3

• The likelihood function is then estimated and computed using these proba-bilities for each of the three choices. The model is estimated in Stata usingmlogit.

• When the choice depends on attributes of the alternatives instead of theattributes of the individuals, the model is estimated as a Conditional Logit.This model is estimated in Stata using clogit.

Page 226: POLS W4912 Multivariate Political Analysis

216

18.2.1 Interpreting coefficients, assessing goodness of fit

• We can enter the coefficients into the expressions above to get the probabilityof a vote for any particular party. Further, we can easily tell somethingabout the relative probabilities of picking one party over another and howthat changes with a one-unit change in an x variable.

• Recall: Pr(yi=2)Pr(yi=1) = exiβ2

• Thus, eβ2 will tell you the change in the relative probabilities of a one-unitchange in x. This is also known as the relative risk or the relative risk ratios.

18.2.2 The IIA assumption

• The drawback to multinomial logit is the Independence of Irrelevant Alter-natives (IIA) assumption. In this model, the odds ratios are independent ofthe other alternatives.

• That is to say, the ratio of the likelihood of voting Democrat to voting Re-publican is independent of whether option Nader is an option and does notchange if we add other options to the model. This is due to the fact that wetreat the errors in the multinomial logit model as independent.

• Thus, the error term, εi1, which makes someone more likely to vote for theDemocrats than the Republicans, is treated as being wholly independent ofthe error term that might make them more likely to vote for the Greens.

• The assumption aids estimation, but is not always warranted in behavioralsituations where the types of alternatives available might have an affect ofthe relative utility of other choices. In other words, if you could vote forthe Greens, you might be less likely to vote for the Democrats compared toRepublicans.

• One test of whether the assumption is inappropriate in your model settingis to run the model with and without a particular category and use a Haus-man test to see whether the coefficients for the explanatory variables in theremaining categories change.

Page 227: POLS W4912 Multivariate Political Analysis

217

• The IIA problem afflicts both the multinomial logit and the conditional logitmodels. An option is to use the multinomial probit model, although it comeswith its own set of problems.

Page 228: POLS W4912 Multivariate Political Analysis

Section 19

Count Data, Models for Limited Dependent Variables, and

Duration Models

19.1 Event count models and poisson estimation

• Suppose that we have data on the number of coups, or the number of instancesof state-to-state conflict per year over a 20 year stretch. This data is clearlynot continuous, and therefore should not be modeled using OLS, because wenever see 2.5 conflicts.

• In addition, OLS could sometimes predict that we should see negative num-bers of state-to-state conflict, something we want to rule out given that ourdata are always zero or positive.

• It should not, however, be modeled as an ordered probit or logit because thenumbers represent more than different categories. They are count data, andchanges in the count have a cardinal as well as an ordinal interpretation. Thatis, we can interpret a change of one in the dependent variable as meaningsomething about the magnitude of conflict, rather than just identifying achange in category.

• Enter the Poisson distribution. This is an important discrete probabilitydistribution for a count for a count starting at zero with no upper bound.

• For data that we believe could be modeled as a Poisson distribution, eachobservation is a unit of time (one year in our data example) and the Poissongives the probability of observing a particular count for that period.

218

Page 229: POLS W4912 Multivariate Political Analysis

219

• For each observation, the probability of seeing a particular count is given by:

Pr(Yi = yi) =e−λλyi

yi!, yi = 0, 1, 2, ...

• In this model, λ is just a parameter that can be estimated, as µ is for thenormal distribution. It can also be shown that the expected event count ineach period, or E(yi) is equal to λ and that the variance of the count is alsoequal to λ. Thus, the variance of the event counts increases as their numberincreases.

• The distribution above implies that the expected event count is just the samein each period. This is the univariate analysis of state-to-state conflict. Weare saying what the expected number of events is for each year.

• In multivariate analysis, however, we can do better. We can relate the numberof events in each period to explanatory variables. This requires relating λ

somehow to explanatory variables that we think should be associated withinstances of state-to-state conflict. Then we will get a different parameter λi

for each observation, related to the level of the explanatory variables.

• The standard way to do this for the Poisson model is to say that:

λi = exiβ or equivalently ln λi = xiβ

• We can now say that the probability of observing a given count for period i

is:

Pr(Yi = yi) =e−λiλyi

i

yi!, yi = 0, 1, 2, ...

and E[yi|xi] = λi = exiβ

• Given this probability for a particular value at each observation, the likelihoodof observing the entire sample is the following:

Pr[Y |λi] =n∏

i=1

e−λiλyi

i

yi!

Page 230: POLS W4912 Multivariate Political Analysis

220

• The log likelihood of this is:

ln L =n∑

i=1

[−λi + yixiβ − ln yi!]

Since the last term does not involve β, it can be dropped and the log likelihoodcan be maximized with respect to β to give the maximum likelihood estimatesof the coefficients.

• This model can be done in Stata using the poisson command.

• For inference about the effects of explanatory variables, we can

examine the predicted number of events based on a given set of valuesfor the variables, which is given by λi = exiβ.

examine the factor change: for a unit change in xk, the expected countchanges by a factor of eβk, holding other variables constant.

examine the percentage change in the expected count for a δ unit changein xk: 100× [e(βk·δ) − 1], holding other variables constant.

• Marginal effects can be computed in Stata after poisson using mfx com-pute.

• The Poisson model offers no natural counterpart to the R2 in a linear regres-sion because the conditional mean (the expected number of counts in eachperiod) is non-linear. Many alternatives have been suggested. Greene p. 908–909 offers a variety and Cameron and Trivedi (1998) give more. A popularstatistic is the standard Pseudo− R2, which compares the log-likelihood forthe full model to the log-likelihood for the model containing only a constant.

• One feature of the Poisson distribution is that E[yi] = var[yi]. If E[yi] <

var[yi], this is called overdispersion and the Poisson is inappropriate. A neg-ative binomial or generalized event count model can be used in this case.

Page 231: POLS W4912 Multivariate Political Analysis

221

19.2 Limited dependent variables: the truncation example

• The technical definition of a limited dependent variable is that it is limitedbecause some part of the data is simply not observed and missing (truncation)or not observed and set at some limit (censoring).

• Although it is not often encountered, we start by discussing the truncationcase because it establishes the intuition.

• Imagine that we have a random variable, x, with a normal distribution. How-ever, we do not observe the distribution of x below some value a, because thedistribution is truncated.

• If we wanted to estimate the expected value of x using the values we doobserve, we would over-estimate the expected value, because we are not takinginto account the observations that are truncated.

• How do we tell what the mean of x would have been had we seen all theobservations?

• To establish the expected value, we need the probability density function(pdf) of x given the truncation:

f(x|x > a) =f(x)

Pr[x > a]

The denominator simply re-scales the distribution by the probability that weobserve x, so that we will get a pdf that would integrate to one.

• In addition:

Pr[x > a] = 1− Φ

(a− µ

σ

)= 1− Φ(α)

where Φ is the cumulative standard normal distribution and the term inbrackets simply transforms the normal distribution of x into a standard nor-mal.

• Thus: f(x|x > a) = f(x)1−Φ(α)

Page 232: POLS W4912 Multivariate Political Analysis

222

• To get the expected value of x, we use:

E[x|x > a] =

∞∫a

x.f(x|x > a)dx = µ + σλ(α)

where α =(

a−µσ

), φ(α) is the standard normal density and:

λ(α) = φ(α)/[1− Φ(α)] when truncation is at x > a

λ(α) = −φ(α)/Φ(α) when truncation is at x < a

• The function λ(α) is also known as the “inverse Mills ratio” and is relevantfor succeeding models.

• Now, imagine that we have a dependent variable, yi, which is truncated. Forexample, we only see values of yi > a, or where xiβ + εi > a, implying thatεi > a − xiβ. In this case, the probability that we observe yi is equal to

Pr[εi > a− xiβ] = 1− Φ(

a−xiβσ

).

• Given our assumptions about the distribution of εi, we can calculate thisprobability, meaning that we can correct the conditional mean we estimate.

• In most cases, we assume that the conditional mean of yi, or E[yi|xi] = xiβ.This is not the case for the truncated dependent variable, for the reasonsexpressed above.

• Instead, and following a similar logic:

E[yi|yi > a] = xiβ + σφ[(a− xiβ)/σ]

1− Φ[(a− xiβ)/σ= xiβ + σλ(αi)

• What this implies, very basically, is that we need to add an additional termto a regression model where we have truncated data in order to correctlyestimate our β coefficients. If we don’t include the term that adjusts for theprobability that we see yi to begin with, we will get an inconsistent estimate,similar to what would happen if we omitted a relevant variable.

• You can do this in Stata using the command truncreg. The truncatedregression is estimated using maximum likelihood.

Page 233: POLS W4912 Multivariate Political Analysis

223

19.3 Censored data and tobit regression

• Suppose that instead of the data being simply missing, we observe the valueof yi = a, whenever yi < a. We can imagine a latent variable, yi∗, but weobserve a whenever yi∗ ≤ a, and yi∗ otherwise. This is known as a case of“left-censoring.”

• In this case: E[yi] = Φa + (1− Φ)(µ + σλ)

• With probability Φ, yi = a and with probability (1− Φ), yi = µ + σλ.

• In the “nice” case, where a = 0, the expression above simplifies to:

E[yi] = Φ(µ/σ)(µ + σλ)

• In a multivariate regression model, where normally E[yi|xi] = xiβ, the cen-soring with a = 0 implies that:

E[yi|xi] = Φ

(xiβ

σ

)(xiβ + σλi)

where λi = φ[(0−xiβ)/σ1−Φ[(0−xiβ)/µ] = φ(xiβ/σ)

Φ(xiβ/σ)

• If the censoring cut point is not equal to zero, and if we also have an upper-censoring cut point, then the conditional expectation of yi will be more com-plicated, but the point remains that in estimating the conditional mean wemust take account of the fact that yi could be equal to xiβ or to the censoredvalue(s). If we don’t take this into account, we are likely to get incorrectestimates of the β coefficients just as we did in the case of the truncatedregression.

• The censored model (with an upper or lower censoring point or both) isestimated in Stata using the tobit command.

• For an example of tobit in practice, see a paper presented at the AmericanPolitics seminar last year, “Accountability and Coercion: Is Justice Blindwhen It Runs for Office” by Greg Huber and Sanford Gordon, which recentlyappeared in the American Journal of Political Science.

Page 234: POLS W4912 Multivariate Political Analysis

224

19.4 Sample selection: the Heckman model

• Imagine that you want to estimate the impact of participating in a communitygroup on one’s level of satisfaction with democratic institutions. This is a keyconcern for scholars of social capital. Perhaps taking part in some communityorganization is likely to increase one’s trust in government and to increasesatisfaction with government performance. Thus, taking part in a communityorganization is like a treatment effect. We want to see how this treatmentcould affect outcomes for the “average” person.”

• The problem is that the people participating in community groups are un-likely to be the same as the “average person.” There may well be somethingabout them that leads them to take part in community initiatives and, if thisunobserved characteristic also makes them likely to trust government, we willbias our estimates of the treatment effect.

• The difficulty is that we only observe those people who joined the communitygroups.

• The situation is analogous to one of truncation. Here, the truncation is“incidental.” It does not occur at given, definite values of yi, but instead thetruncation cuts out those people for whom the underlying attractiveness ofcommunity service did not reach some critical level.

• Let us call the underlying attractiveness of community service a random vari-able, zi, and trust in government, our dependent variable of interest, yi.

• The expression governing whether someone joins a community group or notis given by:

z∗i = wiγ + ui and when z∗i > 0 they join.

• The equation of primary interest is:

yi = xiβ + εi

• We assume that the two error terms, ui and εi are distributed bivariate nor-mal, with standard errors, σu and σε, and that the correlation between themis equal to ρ.

Page 235: POLS W4912 Multivariate Political Analysis

225

• Then we can show that:

E[yi|yi observed] = E[yi|z∗i > 0]

= E[yi|ui > −wiγ]

= xiβ + E[εi|ui > −wiγ]

= xiβ + ρσελi(αu)

= xiβ + βλλi(αu)

where αu = −wiγ/σu and λi(αu) = φ(wiγ/σu)/Φ(wiγ/σu)]

• This sounds like a complicated model, but in fact it is not. The sampleselection model is typically computed via the Heckman two-stage procedure.

1. Estimate the probability that z∗i > 0 by the probit model to obtainestimates of γ. For each observation in the sample, compute λi(αu) =φ(wiγ/σu)/Φ(wiγ/σu).

2. Estimate β and βλ by least squares regression of yi on xi and λi.

• The model is estimated in Stata using the heckman command. The outputthat Stata gives you will indicate the coefficients in the model for yi =xiβ+εi, the coefficients in the model for z∗i = wiγ+ui, and will also computefor you ρ and σε.

19.5 Duration models

• Suppose the data that you have is on the length of civil wars or on the numberof days of a strike. You are interested in the factors that prolong civil conflictor that bring strikes to an end speedily.

• You should not estimate this model using the Poisson distribution, becauseyou are not interested in the number of days of unrest per se. You areinterested in the probability that a war will end and this probability candepend on the time for which combatants have already been fighting.

Page 236: POLS W4912 Multivariate Political Analysis

226

• Thus, duration models, and the kinds of questions that go with them, shouldbe investigated using techniques that permit you to include the effects oftime as a chronological sequence, and not just as a numerical marker. Thesemodels will also allow you to test for the effect of other “covariates” on theprobability of war ending.

• Censoring is also a common (but easily dealt with) problem with durationanalysis.

• Let us begin with a very simple example in which we are examining the prob-ability of a “spell” of war or strike lasting t periods. Our dependent variableis the random variable, T , which has a continuous probability distributionf(t), where t is a realization of T .

• The cumulative probability is:

F (t) =

t∫0

f(s)ds = Pr(T ≤ t)

• Note that the normal distribution might not be a good choice of functionalform for this distribution because the normal usually takes on negative valuesand time does not.

The probability that a spell is of length at least t is given by the survivalfunction:

S(t) = 1− F (t) = Pr(T ≤ t)

• We are sometimes interested in a related issue. Given that a spell has lasteduntil time t, what is the probability that it ends in the next short interval oftime, say ∆t?:

l(t, ∆t) = Pr[t ≤ T ≤ t + ∆t|T ≥ t)

• A useful function for characterizing this aspect of the question is the hazardrate:

λ(t) = lim∆t→0

Pr[t ≤ T ≤ t + ∆t|T ≥ t]

∆t= lim

∆t→0

F (t + ∆t)− F (t)

∆tS(t)=

f(t)

S(t)

Page 237: POLS W4912 Multivariate Political Analysis

227

Roughly, the hazard rate is the rate at which spells are completed after du-ration t, given that they last at least until t.

• Now, we build in different functional forms for F (t) to get different models ofduration. In the exponential model, the hazard rate, λ, is just a parameter tobe estimated and the survival function, S(t) = e−λt. As such, the exponentialmodel has a hazard rate that does not vary over time.

• The Weibull model allows the hazard rate to vary over the duration of thespell. In the Weibull model, λ(t) = λp(λt)p−1, where p is also a parameterto be estimated and the survival function is S(t) = e−(λt)p. This model can“accelerate” or “decelerate” the effect of time and is thus called an acceleratedfailure time model.

• How do we relate duration to explanatory factors like how well the combatantsare funded? As in the Poisson model, we derive a λi for each observation (waror strike), where λi = e−xiβ.

• Either of these models can be estimated in Stata via maximum likelihoodusing the streg command. This command permits you to adopt the Weibullor the exponential or one of several other distributions to characterize yoursurvival function.

• If you do not specify otherwise, your output will present the effect of yourcovariates on your “relative hazard rate” (i.e., how a one unit increase inthe covariate moves the hazard rate up or down). If you use the commandstreg depvar expvars, nohr, your output will be in terms of the actualcoefficients. In all cases, you can use the predict command after streg toget the predicted time until end of spell for each observation. In cases wherethe hazard is constant over time, you can also predict the hazard rate usingpredict, hazard.

Page 238: POLS W4912 Multivariate Political Analysis

228

• One of the problems that is sometimes encountered with duration data isindividual level heterogeneity. That is, even if two conflicts had the samelevel of the explanatory factors, x, they might differ in the length of conflictbecause of some non-observed, conflict specific factor (note the similarity tounobserved, country-specific factors in panel data).

• To estimate a duration model in this context, we use a “proportional hazards”model, such as the Cox proportional hazard model. This model allows usto say how hazard rates change as we alter the value of x, but does notpermit us to compute the absolute level of hazard (or baseline hazard) foreach observation. This model is estimated in Stata using stcox.