(2014) econometrics - creel

710
Econometrics Michael Creel Department of Economics and Economic History Universitat Autònoma de Barcelona February 2014

Upload: nino-matos-da-fonseca

Post on 12-Oct-2015

105 views

Category:

Documents


2 download

TRANSCRIPT

  • Econometrics

    Michael Creel

    Department of Economics and Economic History

    Universitat Autnoma de Barcelona

    February 2014

  • Contents

    1 About this document 161.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3 Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.4 Obtaining the materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.5 An easy way run the examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2 Introduction: Economic and econometric models 23

    3 Ordinary Least Squares 283.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Estimation by least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Geometric interpretation of least squares estimation . . . . . . . . . . . . . . . . . . . . 333.4 Influential observations and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 The classical linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    1

  • 3.7 Small sample statistical properties of the least squares estimator . . . . . . . . . . . . . 463.8 Example: The Nerlove model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4 Asymptotic properties of the least squares estimator 634.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.3 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5 Restrictions and hypothesis tests 695.1 Exact linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 The asymptotic equivalence of the LR, Wald and score tests . . . . . . . . . . . . . . . 855.4 Interpretation of test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.6 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.7 Wald test for nonlinear restrictions: the delta method . . . . . . . . . . . . . . . . . . . 945.8 Example: the Nerlove data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    6 Stochastic regressors 1086.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

  • 6.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.4 When are the assumptions reasonable? . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    7 Data problems 1177.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2 Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Missing observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.4 Missing regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    8 Functional form and nonnested tests 1508.1 Flexible functional forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528.2 Testing nonnested hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

    9 Generalized least squares 1689.1 Effects of nonspherical disturbances on the OLS estimator . . . . . . . . . . . . . . . . 1699.2 The GLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.3 Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.4 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1799.5 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1989.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

    10 Endogeneity and simultaneity 23510.1 Simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

  • 10.2 Reduced form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24010.3 Estimation of the reduced form equations . . . . . . . . . . . . . . . . . . . . . . . . . . 24310.4 Bias and inconsistency of OLS estimation of a structural equation . . . . . . . . . . . . 24710.5 Note about the rest of this chaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24910.6 Identification by exclusion restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24910.7 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26010.8 Testing the overidentifying restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26410.9 System methods of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27010.10Example: Kleins Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

    11 Numeric optimization methods 28411.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28511.2 Derivative-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28711.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29711.4 A practical example: Maximum likelihood estimation using count data: The MEPS

    data and the Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29711.5 Numeric optimization: pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30111.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

    12 Asymptotic properties of extremum estimators 30812.1 Extremum estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30812.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31212.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31212.4 Example: Consistency of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

  • 12.5 Example: Inconsistency of Misspecified Least Squares . . . . . . . . . . . . . . . . . . . 32212.6 Example: Linearization of a nonlinear model . . . . . . . . . . . . . . . . . . . . . . . . 32212.7 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32612.8 Example: Classical linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33012.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

    13 Maximum likelihood estimation 33313.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33413.2 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33913.3 The score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34013.4 Asymptotic normality of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34213.5 The information matrix equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34613.6 The Cramr-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35113.7 Likelihood ratio-type tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35413.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35613.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

    14 Generalized method of moments 37614.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37614.2 Definition of GMM estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38214.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38314.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38414.5 Choosing the weighting matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38814.6 Estimation of the variance-covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 391

  • 14.7 Estimation using conditional moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 39614.8 A specification test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40014.9 Example: Generalized instrumental variables estimator . . . . . . . . . . . . . . . . . . 40314.10Nonlinear simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41514.11Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41614.12Example: OLS as a GMM estimator - the Nerlove model again . . . . . . . . . . . . . . 41914.13Example: The MEPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41914.14Example: The Hausman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42214.15Application: Nonlinear rational expectations . . . . . . . . . . . . . . . . . . . . . . . . 43114.16Empirical example: a portfolio model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43614.17Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

    15 Models for time series data 44415.1 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44715.2 VAR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45615.3 ARCH, GARCH and Stochastic volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 45915.4 Diffusion models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46615.5 State space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46815.6 Nonstationarity and cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47015.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

    16 Bayesian methods 47216.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47316.2 Philosophy, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

  • 16.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47616.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47716.5 Computational methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47916.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48416.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

    17 Introduction to panel data 49317.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49317.2 Static models and correlations between variables . . . . . . . . . . . . . . . . . . . . . . 49617.3 Estimation of the simple linear panel model . . . . . . . . . . . . . . . . . . . . . . . . 49817.4 Dynamic panel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50317.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50817.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509

    18 Quasi-ML 51118.1 Consistent Estimation of Variance Components . . . . . . . . . . . . . . . . . . . . . . 51418.2 Example: the MEPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51618.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

    19 Nonlinear least squares (NLS) 53119.1 Introduction and definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53119.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53419.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53619.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536

  • 19.5 Example: The Poisson model for count data . . . . . . . . . . . . . . . . . . . . . . . . 53819.6 The Gauss-Newton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54019.7 Application: Limited dependent variables and sample selection . . . . . . . . . . . . . . 542

    20 Nonparametric inference 54720.1 Possible pitfalls of parametric inference: estimation . . . . . . . . . . . . . . . . . . . . 54720.2 Possible pitfalls of parametric inference: hypothesis testing . . . . . . . . . . . . . . . . 55420.3 Estimation of regression functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55520.4 Density function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57420.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58020.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587

    21 Quantile regression 58821.1 Quantiles of the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 58821.2 Fully nonparametric conditional quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 59121.3 Quantile regression as a semi-parametric estimator . . . . . . . . . . . . . . . . . . . . 592

    22 Simulation-based methods for estimation and inference 59522.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59622.2 Simulated maximum likelihood (SML) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60322.3 Method of simulated moments (MSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60822.4 Efficient method of moments (EMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61222.5 Indirect likelihood inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61922.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628

  • 22.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636

    23 Parallel programming for econometrics 63723.1 Example problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639

    24 Introduction to Octave 64624.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64624.2 A short introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64724.3 If youre running a Linux installation... . . . . . . . . . . . . . . . . . . . . . . . . . . . 649

    25 Notation and Review 65025.1 Notation for differentiation of vectors and matrices . . . . . . . . . . . . . . . . . . . . 65025.2 Convergenge modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65225.3 Rates of convergence and asymptotic equality . . . . . . . . . . . . . . . . . . . . . . . 656

    26 Licenses 66026.1 The GPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66026.2 Creative Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676

    27 The attic 68427.1 Hurdle models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695

  • List of Figures1.1 Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.2 LYX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.1 Typical data, Classical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Example OLS Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 The fit in observation space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Detection of influential observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5 Uncentered R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.6 Unbiasedness of OLS under classical assumptions . . . . . . . . . . . . . . . . . . . . . 483.7 Biasedness of OLS when an assumption fails . . . . . . . . . . . . . . . . . . . . . . . . 493.8 Gauss-Markov Result: The OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . 533.9 Gauss-Markov Resul: The split sample estimator . . . . . . . . . . . . . . . . . . . . . 54

    5.1 Joint and Individual Confidence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 915.2 RTS as a function of firm size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    7.1 s() when there is no collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    10

  • 7.2 s() when there is collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.3 Collinearity: Monte Carlo results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307.4 OLS and Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.5 with and without measurement error . . . . . . . . . . . . . . . . . . . . . . . . . 1427.6 Sample selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    9.1 Rejection frequency of 10% t-test, H0 is true. . . . . . . . . . . . . . . . . . . . . . . . 1729.2 Motivation for GLS correction when there is HET . . . . . . . . . . . . . . . . . . . . . 1889.3 Residuals, Nerlove model, sorted by firm size . . . . . . . . . . . . . . . . . . . . . . . . 1939.4 Residuals from time trend for CO2 data . . . . . . . . . . . . . . . . . . . . . . . . . . 2019.5 Autocorrelation induced by misspecification . . . . . . . . . . . . . . . . . . . . . . . . 2039.6 Efficiency of OLS and FGLS, AR1 errors . . . . . . . . . . . . . . . . . . . . . . . . . . 2139.7 Durbin-Watson critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2209.8 Dynamic model with MA(1) errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.9 Residuals of simple Nerlove model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2259.10 OLS residuals, Klein consumption equation . . . . . . . . . . . . . . . . . . . . . . . . . 228

    10.1 Exogeneity and Endogeneity (adapted from Cameron and Trivedi) . . . . . . . . . . . . 236

    11.1 Search method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28611.2 Increasing directions of search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28911.3 Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29211.4 Using Sage to get analytic derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29611.5 Mountains with low fog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30211.6 A foggy mountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

  • 12.1 Effects of I and J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

    13.1 Dwarf mongooses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36813.2 Life expectancy of mongooses, Weibull model . . . . . . . . . . . . . . . . . . . . . . . 36913.3 Life expectancy of mongooses, mixed Weibull model . . . . . . . . . . . . . . . . . . . . 371

    14.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37714.2 Asymptotic Normality of GMM estimator, 2 example . . . . . . . . . . . . . . . . . . 38814.3 Inefficient and Efficient GMM estimators, 2 data . . . . . . . . . . . . . . . . . . . . . 39214.4 GIV estimation results for , dynamic model with measurement error . . . . . . . . 41214.5 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42314.6 IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42414.7 Incorrect rank and the Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

    15.1 NYSE weekly close price, 100 log differences . . . . . . . . . . . . . . . . . . . . . . . 46115.2 Returns from jump-diffusion model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46815.3 Spot volatility, jump-diffusion model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

    16.1 Bayesian estimation, exponential likelihood, lognormal prior . . . . . . . . . . . . . . . 47716.2 Chernozhukov and Hong, Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47816.3 Metropolis-Hastings MCMC, exponential likelihood, lognormal prior . . . . . . . . . . . 48516.4 Data from RBC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48916.5 BVAR residuals, with separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490

    20.1 True and simple approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . 54920.2 True and approximating elasticities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551

  • 20.3 True function and more flexible approximation . . . . . . . . . . . . . . . . . . . . . . . 55220.4 True elasticity and more flexible approximation . . . . . . . . . . . . . . . . . . . . . . 55320.5 Negative binomial raw moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57820.6 Kernel fitted OBDV usage versus AGE . . . . . . . . . . . . . . . . . . . . . . . . . . . 58120.7 Dollar-Euro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58420.8 Dollar-Yen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58520.9 Kernel regression fitted conditional second moments, Yen/Dollar and Euro/Dollar . . . 586

    21.1 Inverse CDF for N(0,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59021.2 Quantiles of classical linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 59121.3 Quantile regression results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594

    23.1 Speedups from parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644

    24.1 Running an Octave program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648

  • List of Tables17.1 Dynamic panel data model. Bias. Source for ML and II is Gouriroux, Phillips and

    Yu, 2010, Table 2. SBIL, SMIL and II are exactly identified, using the ML auxiliarystatistic. SBIL(OI) and SMIL(OI) are overidentified, using both the naive and MLauxiliary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

    17.2 Dynamic panel data model. RMSE. Source for ML and II is Gouriroux, Phillips andYu, 2010, Table 2. SBIL, SMIL and II are exactly identified, using the ML auxiliarystatistic. SBIL(OI) and SMIL(OI) are overidentified, using both the naive and MLauxiliary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

    18.1 Marginal Variances, Sample and Estimated (Poisson) . . . . . . . . . . . . . . . . . . . 51718.2 Marginal Variances, Sample and Estimated (NB-II) . . . . . . . . . . . . . . . . . . . . 52418.3 Information Criteria, OBDV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

    22.1 True parameter values and bound of priors . . . . . . . . . . . . . . . . . . . . . . . . . 62622.2 Monte Carlo results, bias corrected estimators . . . . . . . . . . . . . . . . . . . . . . . 627

    27.1 Actual and Poisson fitted frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695

    14

  • 27.2 Actual and Hurdle Poisson fitted frequencies . . . . . . . . . . . . . . . . . . . . . . . . 701

  • Chapter 1

    About this document

    1.1 PrerequisitesThese notes have been prepared under the assumption that the reader understands basic statistics,linear algebra, and mathematical optimization. There are many sources for this material, one arethe appendices to Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge. It is thestudents resposibility to get up to speed on this material, it will not be covered in class

    This document integrates lecture notes for a one year graduate level course with computer programsthat illustrate and apply the methods that are studied. The immediate availability of executable (andmodifiable) example programs when using the PDF version of the document is a distinguishing featureof these notes. If printed, the document is a somewhat terse approximation to a textbook. These notesare not intended to be a perfect substitute for a printed textbook. If you are a student of mine, pleasenote that last sentence carefully. There are many good textbooks available. Students taking my coursesshould read the appropriate sections from at least one of the following books (or other textbooks with

    16

  • similar level and content)

    Cameron, A.C. and P.K. Trivedi, Microeconometrics - Methods and Applications

    Davidson, R. and J.G. MacKinnon, Econometric Theory and Methods

    Gallant, A.R., An Introduction to Econometric Theory

    Hamilton, J.D., Time Series Analysis

    Hayashi, F., Econometrics

    A more introductory-level reference is Introductory Econometrics: A Modern Approach by JeffreyWooldridge.

    1.2 ContentsWith respect to contents, the emphasis is on estimation and inference within the world of stationarydata. If you take a moment to read the licensing information in the next section, youll see that youare free to copy and modify the document. If anyone would like to contribute material that expandsthe contents, it would be very welcome. Error corrections and other additions are also welcome.

    The integrated examples (they are on-line here and the support files are here) are an importantpart of these notes. GNU Octave (www.octave.org) has been used for most of the example programs,which are scattered though the document. This choice is motivated by several factors. The first isthe high quality of the Octave environment for doing applied econometrics. Octave is similar to the

  • commercial package Matlab R, and will run scripts for that language without modification1. Thefundamental tools (manipulation of matrices, statistical functions, minimization, etc.) exist and areimplemented in a way that make extending them fairly easy. Second, an advantage of free software isthat you dont have to pay for it. This can be an important consideration if you are at a universitywith a tight budget or if need to run many copies, as can be the case if you do parallel computing(discussed in Chapter 23). Third, Octave runs on GNU/Linux, Windows and MacOS. Figure 1.1shows a sample GNU/Linux work environment, with an Octave script being edited, and the resultsare visible in an embedded shell window. As of 2011, some examples are being added using Gretl, theGnu Regression, Econometrics, and Time-Series Library. This is an easy to use program, available ina number of languages, and it comes with a lot of data ready to use. It runs on the major operatingsystems. As of 2012, I am increasingly trying to make examples run on Matlab, though the need foradd-on toolboxes for tasks as simple as generating random numbers limits what can be done.

    The main document was prepared using LYX (www.lyx.org). LYX is a free2 what you see is whatyou mean word processor, basically working as a graphical frontend to LATEX. It (with help fromother applications) can export your work in LATEX, HTML, PDF and several other forms. It will runon Linux, Windows, and MacOS systems. Figure 1.2 shows LYX editing this document.

    1Matlab Ris a trademark of The Mathworks, Inc. Octave will run pure Matlab scripts. If a Matlab script calls an extension, such as atoolbox function, then it is necessary to make a similar extension available to Octave. The examples discussed in this document call a numberof functions, such as a BFGS minimizer, a program for ML estimation, etc. All of this code is provided with the examples, as well as on thePelicanHPC live CD image.

    2Free is used in the sense of freedom, but LYX is also free of charge (free as in free beer).

  • Figure 1.1: Octave

  • Figure 1.2: LYX

  • 1.3 LicensesAll materials are copyrighted by Michael Creel with the date that appears above. They are providedunder the terms of the GNU General Public License, ver. 2, which forms Section 26.1 of the notes, or,at your option, under the Creative Commons Attribution-Share Alike 2.5 license, which forms Section26.2 of the notes. The main thing you need to know is that you are free to modify and distribute thesematerials in any way you like, as long as you share your contributions in the same way the materialsare made available to you. In particular, you must make available the source files, in editable form,for your modified version of the materials.

    1.4 Obtaining the materialsThe materials are available on my web page. In addition to the final product, which youre probablylooking at in some form now, you can obtain the editable LYX sources, which will allow you to createyour own version, if you like, or send error corrections and contributions.

    1.5 An easy way run the examplesOctave is available from the Octave home page, www.octave.org. Also, some updated links to packagesfor Windows and MacOS are at http://www.dynare.org/download/octave. The example programs areavailable as links to files on my web page in the PDF version, and here. Support files needed to runthese are available here. The files wont run properly from your browser, since there are dependenciesbetween files - they are only illustrative when browsing. To see how to use these files (edit and run

  • them), you should go to the home page of this document, since you will probably want to download thepdf version together with all the support files and examples. Then set the base URL of the PDF fileto point to wherever the Octave files are installed. Then you need to install Octave and the supportfiles. All of this may sound a bit complicated, because it is. An easier solution is available:

    The Linux OS image file econometrics.iso an ISO image file that may be copied to USB or burntto CDROM. It contains a bootable-from-CD or USB GNU/Linux system. These notes, in source formand as a PDF, together with all of the examples and the software needed to run them are available oneconometrics.iso. I recommend starting off by using virtualization, to run the Linux system with all ofthe materials inside of a virtual computer, while still running your normal operating system. Variousvirtualization platforms are available. I recommend Virtualbox 3, which runs on Windows, Linux, andMac OS.

    3Virtualbox is free software (GPL v2). That, and the fact that it works very well, is the reason it is recommended here. There are a numberof similar products available. It is possible to run PelicanHPC as a virtual machine, and to communicate with the installed operating systemusing a private network. Learning how to do this is not too difficult, and it is very convenient.

  • Chapter 2

    Introduction: Economic andeconometric modelsHeres some data: 100 observations on 3 economic variables. Lets do some exploratory analysis usingGretl:

    histograms

    correlations

    x-y scatterplotsSo, what can we say? Correlations? Yes. Causality? Who knows? This is economic data, generated byeconomic agents, following their own beliefs, technologies and preferences. It is not experimental datagenerated under controlled conditions. How can we determine causality if we dont have experimentaldata?

    23

  • Without a model, we cant distinguish correlation from causality. It turns out that the variableswere looking at are QUANTITY (q), PRICE (p), and INCOME (m). Economic theory tells us thatthe quantity of a good that consumers will puchase (the demand function) is something like:

    q = f(p,m, z)

    q is the quantity demanded

    p is the price of the good

    m is income

    z is a vector of other variables that may affect demand

    The supply of the good to the market is the aggregation of the firms supply functions. The marketsupply function is something like

    q = g(p, z)

    Suppose we have a sample consisting of a number of observations on q p and m at different timeperiods t = 1, 2, ..., n. Supply and demand in each period is

    qt = f(pt,mt, zt)qt = g(pt, zt)

    (draw some graphs showing roles of m and z)This is the basic economic model of supply and demand: q and p are determined in the market

    equilibrium, given by the intersection of the two curves. These two variables are determined jointly by

  • the model, and are the endogenous variables. Income (m) is not determined by this model, its value isdetermined independently of q and p by some other process. m is an exogenous variable. So, m causesq, though the demand function. Because q and p are jointly determined, m also causes p. p and q donot cause m, according to this theoretical model. q and p have a joint causal relationship.

    Economic theory can help us to determine the causality relationships between correlated vari-ables.

    If we had experimental data, we could control certain variables and observe the outcomes forother variables. If we see that variable x changes as the controlled value of variable y is changed,then we know that y causes x. With economic data, we are unable to control the values ofthe variables: for example in supply and demand, if price changes, then quantity changes, butquantity also affect price. We cant control the market price, because the market price changes asquantity adjusts. This is the reason we need a theoretical model to help us distinguish correlationand causality.

    The model is essentially a theoretical construct up to now:

    We dont know the forms of the functions f and g.

    Some components of zt may not be observable. For example, people dont eat the same lunchevery day, and you cant tell what they will order just by looking at them. There are unobservablecomponents to supply and demand, and we can model them as random variables. Suppose wecan break zt into two unobservable components t1 and t2.

  • An econometric model attempts to quantify the relationship more precisely. A step toward an estimableeconometric model is to suppose that the model may be written as

    qt = 1 + 2pt + 3mt + t1qt = 1 + 2pt + t1

    We have imposed a number of restrictions on the theoretical model:

    The functions f and g have been specified to be linear functions

    The parameters (1, 2, etc.) are constant over time.

    There is a single unobservable component in each equation, and we assume it is additive.

    If we assume nothing about the error terms t1 and t2, we can always write the last two equations,as the errors simply make up the difference between the true demand and supply functions and theassumed forms. But in order for the coefficients to exist in a sense that has economic meaning, andin order to be able to use sample data to make reliable inferences about their values, we need to makeadditional assumptions. Such assumptions might be something like:

    E(tj) = 0, j = 1, 2

    E(pttj) = 0, j = 1, 2

    E(mttj) = 0, j = 1, 2

    These are assertions that the errors are uncorrelated with the variables, and such assertions may ormay not be reasonable. Later we will see how such assumption may be used and/or tested.

  • All of the last six bulleted points have no theoretical basis, in that the theory of supply anddemand doesnt imply these conditions. The validity of any results we obtain using this model willbe contingent on these additional restrictions being at least approximately correct. For this reason,specification testing will be needed, to check that the model seems to be reasonable. Only when weare convinced that the model is at least approximately correct should we use it for economic analysis.

    When testing a hypothesis using an econometric model, at least three factors can cause a statisticaltest to reject the null hypothesis:

    1. the hypothesis is false

    2. a type I error has occured

    3. the econometric model is not correctly specified, and thus the test does not have the assumeddistribution

    To be able to make scientific progress, we would like to ensure that the third reason is not contributingin a major way to rejections, so that rejection will be most likely due to either the first or secondreasons. Hopefully the above example makes it clear that econometric models are necessarily moredetailed than what we can obtain from economic theory, and that this additional detail introducesmany possible sources of misspecification of econometric models. In the next few sections we willobtain results supposing that the econometric model is entirely correctly specified. Later we willexamine the consequences of misspecification and see some methods for determining if a model iscorrectly specified. Later on, econometric methods that seek to minimize maintained assumptions areintroduced.

  • Chapter 3

    Ordinary Least Squares

    3.1 The Linear ModelConsider approximating a variable y using the variables x1, x2, ..., xk. We can consider a model that isa linear approximation:

    Linearity: the model is a linear function of the parameter vector 0 :

    y = 01x1 + 02x2 + ...+ 0kxk +

    or, using vector notation:y = x0 +

    The dependent variable y is a scalar random variable, x = ( x1 x2 xk) is a k-vector of explana-tory variables, and 0 = ( 01 02 0k) . The superscript 0 in 0 means this is the true valueof the unknown parameter. It will be defined more precisely later, and usually suppressed when its

    28

  • not necessary for clarity.Suppose that we want to use data to try to determine the best linear approximation to y using the

    variables x. The data {(yt,xt)} , t = 1, 2, ..., n are obtained by some form of sampling1. An individualobservation is

    yt = xt + t

    The n observations can be written in matrix form as

    y = X + , (3.1)

    where y =(y1 y2 yn

    )is n 1 and X =

    (x1 x2 xn

    ).

    Linear models are more general than they might first appear, since one can employ nonlineartransformations of the variables:

    0(z) =[1(w) 2(w) p(w)

    ] +

    where the i() are known functions. Defining y = 0(z), x1 = 1(w), etc. leads to a model in the formof equation 3.4. For example, the Cobb-Douglas model

    z = Aw22 w33 exp()

    can be transformed logarithmically to obtain

    ln z = lnA+ 2 lnw2 + 3 lnw3 + .1For example, cross-sectional data may be obtained by random sampling. Time series data accumulate historically.

  • If we define y = ln z, 1 = lnA, etc., we can put the model in the form needed. The approximation islinear in the parameters, but not necessarily linear in the variables.

    3.2 Estimation by least squaresFigure 3.1, obtained by running TypicalData.m shows some data that follows the linear model yt =1 + 2xt2 + t. The green line is the true regression line 1 + 2xt2, and the red crosses are the datapoints (xt2, yt), where t is a random error that has mean zero and is independent of xt2. Exactly howthe green line is defined will become clear later. In practice, we only have the data, and we dont knowwhere the green line lies. We need to gain information about the straight line that best fits the datapoints.

    The ordinary least squares (OLS) estimator is defined as the value that minimizes the sum of thesquared errors:

    = arg min s()

    where

    s() =nt=1

    (yt xt)2 (3.2)= (yX) (yX)= yy 2yX + XX= yX 2

  • Figure 3.1: Typical data, Classical Model

    -15

    -10

    -5

    0

    5

    10

    0 2 4 6 8 10 12 14 16 18 20X

    datatrue regression line

  • This last expression makes it clear how the OLS estimator is defined: it minimizes the Euclidean dis-tance between y and X. The fitted OLS coefficients are those that give the best linear approximationto y using x as basis functions, where best means minimum Euclidean distance. One could thinkof other estimators based upon other metrics. For example, the minimum absolute distance (MAD)minimizes nt=1 |yt xt|. Later, we will see that which estimator is best in terms of their statisticalproperties, rather than in terms of the metrics that define them, depends upon the properties of ,about which we have as yet made no assumptions.

    To minimize the criterion s(), find the derivative with respect to :

    Ds() = 2Xy + 2XX

    Then setting it to zeros gives

    Ds() = 2Xy + 2XX 0

    so = (XX)1Xy.

    To verify that this is a minimum, check the second order sufficient condition:

    D2s() = 2XX

    Since (X) = K, this matrix is positive definite, since its a quadratic form in a p.d. matrix(identity matrix of order n), so is in fact a minimizer.

  • The fitted values are the vector y = X.

    The residuals are the vector = yX

    Note that

    y = X + = X +

    Also, the first order conditions can be written as

    XyXX = 0X

    (yX

    )= 0

    X = 0

    which is to say, the OLS residuals are orthogonal to X. Lets look at this more carefully.

    3.3 Geometric interpretation of least squares estimation

    In X, Y SpaceFigure 3.2 shows a typical fit to data, along with the true regression line. Note that the true line andthe estimated line are different. This figure was created by running the Octave program OlsFit.m .You can experiment with changing the parameter values to see how this affects the fit, and to see howthe fitted line will sometimes be close to the true line, and sometimes rather far away.

  • Figure 3.2: Example OLS Fit

    -15

    -10

    -5

    0

    5

    10

    15

    0 2 4 6 8 10 12 14 16 18 20X

    data pointsfitted linetrue line

  • In Observation SpaceIf we want to plot in observation space, well need to use only two or three observations, or wellencounter some limitations of the blackboard. If we try to use 3, well encounter the limits of myartistic ability, so lets use two. With only two observations, we cant have K > 1.

    Figure 3.3: The fit in observation space

    Observation 2

    Observation 1

    x

    y

    S(x)

    x*beta=P_xY

    e = M_xY

    We can decompose y into two components: the orthogonal projection onto the Kdimensionalspace spanned by X, X, and the component that is the orthogonal projection onto the nKsubpace that is orthogonal to the span of X, .

  • Since is chosen to make as short as possible, will be orthogonal to the space spanned byX. Since X is in this space, X = 0. Note that the f.o.c. that define the least squares estimatorimply that this is so.

    Projection MatricesX is the projection of y onto the span of X, or

    X = X (X X)1X y

    Therefore, the matrix that projects y onto the span of X is

    PX = X(X X)1X

    sinceX = PXy.

    is the projection of y onto the N K dimensional space that is orthogonal to the span of X. Wehave that

    = y X= y X(X X)1X y=

    [In X(X X)1X

    ]y.

  • So the matrix that projects y onto the space orthogonal to the span of X is

    MX = In X(X X)1X = In PX .

    We have = MXy.

    Therefore

    y = PXy +MXy= X + .

    These two projection matrices decompose the n dimensional vector y into two orthogonal components- the portion that lies in the K dimensional space defined by X, and the portion that lies in theorthogonal nK dimensional space.

    Note that both PX and MX are symmetric and idempotent.

    A symmetric matrix A is one such that A = A.

    An idempotent matrix A is one such that A = AA.

    The only nonsingular idempotent matrix is the identity matrix.

  • 3.4 Influential observations and outliersThe OLS estimator of the ith element of the vector 0 is simply

    i =[(X X)1X

    ]i y

    = ciy

    This is how we define a linear estimator - its a linear function of the dependent variable. Since itsa linear combination of the observations on the dependent variable, where the weights are determinedby the observations on the regressors, some observations may have more influence than others.

    To investigate this, let et be an n vector of zeros with a 1 in the tth position, i.e., its thetth column of the matrix In. Define

    ht = (PX)tt= etPXet

    so ht is the tth element on the main diagonal of PX . Note that

    ht = PXet 2

    soht et 2= 1

    So 0 < ht < 1. Also,TrPX = K h = K/n.

  • So the average of the ht is K/n. The value ht is referred to as the leverage of the observation. Ifthe leverage is much higher than average, the observation has the potential to affect the OLS fitimportantly. However, an observation may also be influential due to the value of yt, rather than theweight it is multiplied by, which only depends on the xts.

    To account for this, consider estimation of without using the tth observation (designate thisestimator as (t)). One can show (see Davidson and MacKinnon, pp. 32-5 for proof) that

    (t) = ( 1

    1 ht)

    (X X)1X tt

    so the change in the tth observations fitted value is

    xt xt(t) =(

    ht1 ht

    )t

    While an observation may be influential if it doesnt affect its own fitted value, it certainly is influentialif it does. A fast means of identifying influential observations is to plot

    (ht

    1ht)t (which I will refer to

    as the own influence of the observation) as a function of t. Figure 3.4 gives an example plot of data,fit, leverage and influence. The Octave program is InfluentialObservation.m. (note to self whenlecturing: load the data ../OLS/influencedata into Gretl and reproduce this). If you re-runthe program you will see that the leverage of the last observation (an outlying value of x) is alwayshigh, and the influence is sometimes high.

    After influential observations are detected, one needs to determine why they are influential. Possiblecauses include:

    data entry error, which can easily be corrected once detected. Data entry errors are very common.

  • Figure 3.4: Detection of influential observations

    0

    2

    4

    6

    8

    10

    12

    14

    0 0.5 1 1.5 2 2.5 3 3.5

    Data pointsfitted

    LeverageInfluence

  • special economic factors that affect some observations. These would need to be identified andincorporated in the model. This is the idea behind structural change: the parameters may notbe constant across all observations.

    pure randomness may have caused us to sample a low-probability observation.

    There exist robust estimation methods that downweight outliers.

    3.5 Goodness of fitThe fitted model is

    y = X +

    Take the inner product:

    yy = X X + 2X +

    But the middle term of the RHS is zero since X = 0, so

    yy = X X + (3.3)

  • The uncentered R2u is defined as

    R2u = 1yy

    = X Xyy

    = PXy 2

    y 2= cos2(),

    where is the angle between y and the span of X .

    The uncentered R2 changes if we add a constant to y, since this changes (see Figure 3.5, theyellow vector is a constant, since its on the 45 degree line in observation space). Another, morecommon definition measures the contribution of the variables, other than the constant term, toexplaining the variation in y. Thus it measures the ability of the model to explain the variationof y about its unconditional sample mean.

    Let = (1, 1, ..., 1), a n -vector. So

    M = In ()1= In /n

    My just returns the vector of deviations from the mean. In terms of deviations from the mean,equation 3.3 becomes

    yMy = X MX + M

  • Figure 3.5: Uncentered R2

  • The centered R2c is defined as

    R2c = 1

    yMy= 1 ESS

    TSS

    where ESS = and TSS = yMy=nt=1(yt y)2.

    Supposing that X contains a column of ones (i.e., there is a constant term),

    X = 0tt = 0

    so M = . In this caseyMy = X MX +

    SoR2c =

    RSS

    TSS

    where RSS = X MX

    Supposing that a column of ones is in the space spanned by X (PX = ), then one can showthat 0 R2c 1.

    3.6 The classical linear regression modelUp to this point the model is empty of content beyond the definition of a best linear approximationto y and some geometrical properties. There is no economic content to the model, and the regressionparameters have no economic interpretation. For example, what is the partial derivative of y with

  • respect to xj? The linear approximation is

    y = 1x1 + 2x2 + ...+ kxk +

    The partial derivative isy

    xj= j +

    xj

    Up to now, theres no guarantee that xj=0. For the to have an economic meaning, we need tomake additional assumptions. The assumptions that are appropriate to make depend on the dataunder consideration. Well start with the classical linear regression model, which incorporates someassumptions that are clearly not realistic for economic data. This is to be able to explain some conceptswith a minimum of confusion and notational clutter. Later well adapt the results to what we can getwith more realistic assumptions.

    Linearity: the model is a linear function of the parameter vector 0 :

    y = 01x1 + 02x2 + ...+ 0kxk + (3.4)

    or, using vector notation:y = x0 +

    Nonstochastic linearly independent regressors: X is a fixed matrix of constants, it has rankK equal to its number of columns, and

    lim 1nXX = QX (3.5)

  • where QX is a finite positive definite matrix. This is needed to be able to identify the individual effectsof the explanatory variables.

    Independently and identically distributed errors:

    IID(0, 2In) (3.6)

    is jointly distributed IID. This implies the following two properties:Homoscedastic errors:

    V (t) = 20,t (3.7)Nonautocorrelated errors:

    E(ts) = 0,t 6= s (3.8)Optionally, we will sometimes assume that the errors are normally distributed.Normally distributed errors:

    N(0, 2In) (3.9)

    3.7 Small sample statistical properties of the least squaresestimator

    Up to now, we have only examined numeric properties of the OLS estimator, that always hold. Nowwe will examine statistical properties. The statistical properties depend upon the assumptions wemake.

  • UnbiasednessWe have = (X X)1X y. By linearity,

    = (X X)1X (X + )= + (X X)1X

    By 3.5 and 3.6

    E(X X)1X = E(X X)1X = (X X)1X E= 0

    so the OLS estimator is unbiased under the assumptions of the classical model.Figure 3.6 shows the results of a small Monte Carlo experiment where the OLS estimator was

    calculated for 10000 samples from the classical model with y = 1 + 2x+ , where n = 20, 2 = 9, andx is fixed across samples. We can see that the 2 appears to be estimated without bias. The programthat generates the plot is Unbiased.m , if you would like to experiment with this.

    With time series data, the OLS estimator will often be biased. Figure 3.7 shows the results ofa small Monte Carlo experiment where the OLS estimator was calculated for 1000 samples from theAR(1) model with yt = 0 + 0.9yt1 + t, where n = 20 and 2 = 1. In this case, assumption 3.5 doesnot hold: the regressors are stochastic. We can see that the bias in the estimation of 2 is about -0.2.

    The program that generates the plot is Biased.m , if you would like to experiment with this.

  • Figure 3.6: Unbiasedness of OLS under classical assumptions

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    -3 -2 -1 0 1 2 3

  • Figure 3.7: Biasedness of OLS when an assumption fails

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4

  • NormalityWith the linearity assumption, we have = + (X X)1X . This is a linear function of . Addingthe assumption of normality (3.9, which implies strong exogeneity), then

    N (, (X X)120)since a linear function of a normal random vector is also normally distributed. In Figure 3.6 you cansee that the estimator appears to be normally distributed. It in fact is normally distributed, sincethe DGP (see the Octave program) has normal errors. Even when the data may be taken to be IID,the assumption of normality is often questionable or simply untenable. For example, if the dependentvariable is the number of automobile trips per week, it is a count variable with a discrete distribution,and is thus not normally distributed. Many variables in economics can take on only nonnegativevalues, which, strictly speaking, rules out normality.2

    The variance of the OLS estimator and the Gauss-Markov theoremNow lets make all the classical assumptions except the assumption of normality. We have = + (X X)1X and we know that E() = . So

    V ar() = E{(

    ) (

    )}= E

    {(X X)1X X(X X)1

    }= (X X)120

    2Normality may be a good model nonetheless, as long as the probability of a negative value occuring is negligable under the model. Thisdepends upon the mean being large enough in relation to the variance.

  • The OLS estimator is a linear estimator , which means that it is a linear function of the dependentvariable, y.

    =[(X X)1X

    ]y

    = Cy

    where C is a function of the explanatory variables only, not the dependent variable. It is also unbiasedunder the present assumptions, as we proved above. One could consider other weights W that are afunction of X that define some other linear estimator. Well still insist upon unbiasedness. Consider = Wy, where W = W (X) is some k n matrix function of X. Note that since W is a function ofX, it is nonstochastic, too. If the estimator is unbiased, then we must have WX = IK :

    E(Wy) = E(WX0 +W)= WX0= 0

    WX = IK

    The variance of isV () = WW 20.

    DefineD = W (X X)1X

  • soW = D + (X X)1X

    Since WX = IK , DX = 0, so

    V () =(D + (X X)1X

    ) (D + (X X)1X

    )20

    =(DD + (X X)1

    )20

    SoV () V ()

    The inequality is a shorthand means of expressing, more formally, that V () V () is a positivesemi-definite matrix. This is a proof of the Gauss-Markov Theorem. The OLS estimator is the bestlinear unbiased estimator (BLUE).

    It is worth emphasizing again that we have not used the normality assumption in any way toprove the Gauss-Markov theorem, so it is valid if the errors are not normally distributed, as longas the other assumptions hold.

    To illustrate the Gauss-Markov result, consider the estimator that results from splitting the sampleinto p equally-sized parts, estimating using each part of the data separately by OLS, then averagingthe p resulting estimators. You should be able to show that this estimator is unbiased, but inefficientwith respect to the OLS estimator. The program Efficiency.m illustrates this using a small MonteCarlo experiment, which compares the OLS estimator and a 3-way split sample estimator. The datagenerating process follows the classical model, with n = 21. The true parameter value is = 2. InFigures 3.8 and 3.9 we can see that the OLS estimator is more efficient, since the tails of its histogram

  • Figure 3.8: Gauss-Markov Result: The OLS estimator

    are more narrow.We have that E() = and V ar() =

    (XX)1

    20, but we still need to estimate the variance of ,20, in order to have an idea of the precision of the estimates of . A commonly used estimator of 20 is

    20 =1

    nK

    This estimator is unbiased:

  • Figure 3.9: Gauss-Markov Resul: The split sample estimator

  • 20 =1

    nK

    = 1nK

    M

    E(20) =1

    nKE(TrM)

    = 1nKE(TrM

    )

    = 1nKTrE(M

    )

    = 1nK

    20TrM

    = 1nK

    20 (n k)

    = 20

    where we use the fact that Tr(AB) = Tr(BA) when both products are conformable. Thus, thisestimator is also unbiased under these assumptions.

    3.8 Example: The Nerlove model

    Theoretical backgroundFor a firm that takes input prices w and the output level q as given, the cost minimization problem isto choose the quantities of inputs x to solve the problem

  • minxwx

    subject to the restriction

    f(x) = q.

    The solution is the vector of factor demands x(w, q). The cost function is obtained by substitutingthe factor demands into the criterion function:

    Cw, q) = wx(w, q).

    Monotonicity Increasing factor prices cannot decrease cost, so

    C(w, q)w

    0

    Remember that these derivatives give the conditional factor demands (Shephards Lemma).

    Homogeneity The cost function is homogeneous of degree 1 in input prices: C(tw, q) = tC(w, q)where t is a scalar constant. This is because the factor demands are homogeneous of degree zeroin factor prices - they only depend upon relative prices.

    Returns to scale The returns to scale parameter is defined as the inverse of the elasticity ofcost with respect to output:

    =C(w, q)

    q

    q

    C(w, q)

    1

  • Constant returns to scale is the case where increasing production q implies that cost increasesin the proportion 1:1. If this is the case, then = 1.

    Cobb-Douglas functional formThe Cobb-Douglas functional form is linear in the logarithms of the regressors and the dependentvariable. For a cost function, if there are g factors, the Cobb-Douglas cost function has the form

    C = Aw11 ...wgg qqe

    What is the elasticity of C with respect to wj?

    eCwj = CWJ

    (wjC

    )

    = jAw11 .wj1j ..w

    gg q

    qewj

    Aw11 ...wgg qqe

    = j

    This is one of the reasons the Cobb-Douglas form is popular - the coefficients are easy to interpret,since they are the elasticities of the dependent variable with respect to the explanatory variable. Notthat in this case,

    eCwj = CWJ

    (wjC

    )

    = xj(w, q)wjC

    sj(w, q)

  • the cost share of the jth input. So with a Cobb-Douglas cost function, j = sj(w, q). The cost sharesare constants.

    Note that after a logarithmic transformation we obtain

    lnC = + 1 lnw1 + ...+ g lnwg + q ln q +

    where = lnA . So we see that the transformed model is linear in the logs of the data.One can verify that the property of HOD1 implies that

    gi=1

    g = 1

    In other words, the cost shares add up to 1.The hypothesis that the technology exhibits CRTS implies that

    = 1q

    = 1

    so q = 1. Likewise, monotonicity implies that the coefficients i 0, i = 1, ..., g.

    The Nerlove data and OLSThe file nerlove.data contains data on 145 electric utility companies cost of production, output andinput prices. The data are for the U.S., and were collected by M. Nerlove. The observations are byrow, and the columns are COMPANY, COST (C), OUTPUT (Q), PRICE OF LABOR (PL),PRICE OF FUEL (PF ) and PRICE OF CAPITAL (PK). Note that the data are sorted by output

  • level (the third column).We will estimate the Cobb-Douglas model

    lnC = 1 + 2 lnQ+ 3 lnPL + 4 lnPF + 5 lnPK + (3.10)

    using OLS. To do this yourself, you need the data file mentioned above, as well as Nerlove.m (theestimation program), and the library of Octave functions mentioned in the introduction to Octavethat forms section 24 of this document.3

    The results are

    *********************************************************OLS estimation resultsObservations 145R-squared 0.925955Sigma-squared 0.153943

    Results (Ordinary var-cov estimator)

    estimate st.err. t-stat. p-valueconstant -3.527 1.774 -1.987 0.049output 0.720 0.017 41.244 0.000labor 0.436 0.291 1.499 0.136fuel 0.427 0.100 4.249 0.000capital -0.220 0.339 -0.648 0.518

    *********************************************************

    3If you are running the bootable CD, you have all of this installed and ready to run.

  • Do the theoretical restrictions hold?

    Does the model fit well?

    What do you think about RTS?

    While we will most often use Octave programs as examples in this document, since following theprogramming statements is a useful way of learning how theory is put into practice, you may beinterested in a more user-friendly environment for doing econometrics. I heartily recommend Gretl,the Gnu Regression, Econometrics, and Time-Series Library. This is an easy to use program, availablein English, French, and Spanish, and it comes with a lot of data ready to use. It even has an optionto save output as LATEX fragments, so that I can just include the results into this document, no muss,no fuss. Here is the Nerlove data in the form of a GRETL data set: nerlove.gdt . Here the results ofthe Nerlove model from GRETL:

    Model 2: OLS estimates using the 145 observations 1145Dependent variable: l_cost

    Variable Coefficient Std. Error t-statistic p-value

    const 3.5265 1.77437 1.9875 0.0488l_output 0.720394 0.0174664 41.2445 0.0000l_labor 0.436341 0.291048 1.4992 0.1361l_fuel 0.426517 0.100369 4.2495 0.0000l_capita 0.219888 0.339429 0.6478 0.5182

  • Mean of dependent variable 1.72466S.D. of dependent variable 1.42172Sum of squared residuals 21.5520Standard error of residuals () 0.392356Unadjusted R2 0.925955Adjusted R2 0.923840F (4, 140) 437.686Akaike information criterion 145.084Schwarz Bayesian criterion 159.967

    Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in the bootableCD mentioned in the introduction. I recommend using GRETL to repeat the examples that are doneusing Octave.

    The previous properties hold for finite sample sizes. Before considering the asymptotic propertiesof the OLS estimator it is useful to review the MLE estimator, since under the assumption of normalerrors the two estimators coincide.

    3.9 Exercises1. Prove that the split sample estimator used to generate figure 3.9 is unbiased.

    2. Calculate the OLS estimates of the Nerlove model using Octave and GRETL, and provide print-outs of the results. Interpret the results.

    3. Do an analysis of whether or not there are influential observations for OLS estimation of the

  • Nerlove model. Discuss.

    4. Using GRETL, examine the residuals after OLS estimation and tell me whether or not youbelieve that the assumption of independent identically distributed normal errors is warranted.No need to do formal tests, just look at the plots. Print out any that you think are relevant, andinterpret them.

    5. For a random vector X N(x,), what is the distribution of AX + b, where A and b areconformable matrices of constants?

    6. Using Octave, write a little program that verifies that Tr(AB) = Tr(BA) for A and B 4x4matrices of random numbers. Note: there is an Octave function trace.

    7. For the model with a constant and a single regressor, yt = 1 + 2xt + t, which satisfies theclassical assumptions, prove that the variance of the OLS estimator declines to zero as the samplesize increases.

  • Chapter 4

    Asymptotic properties of the leastsquares estimatorThe OLS estimator under the classical assumptions is BLUE1, for all sample sizes. Now lets see whathappens when the sample size tends to infinity.

    1BLUE best linear unbiased estimator if I havent defined it before

    63

  • 4.1 Consistency

    = (X X)1X y= (X X)1X (X + )= 0 + (X X)1X

    = 0 +X X

    n

    1 X n

    Consider the last two terms. By assumption limn(X Xn

    )= QX limn

    (X Xn

    )1 = Q1X , since theinverse of a nonsingular matrix is a continuous function of the elements of the matrix. ConsideringX n ,

    X n

    = 1n

    nt=1

    xtt

    Each xtt has expectation zero, so

    E

    X n

    = 0The variance of each term is

    V (xtt) = xtxt2.

  • As long as these are finite, and given a technical condition2, the Kolmogorov SLLN applies, so

    1n

    nt=1

    xtta.s. 0.

    This implies thata.s. 0.

    This is the property of strong consistency: the estimator converges in almost surely to the true value.

    The consistency proof does not use the normality assumption.

    Remember that almost sure convergence implies convergence in probability.

    4.2 Asymptotic normalityWeve seen that the OLS estimator is normally distributed under the assumption of normal errors. Ifthe error distribution is unknown, we of course dont know the distribution of the estimator. However,we can get asymptotic results. Assuming the distribution of is unknown, but the the other classicalassumptions hold:

    2For application of LLNs and CLTs, of which there are very many to choose from, Im going to avoid the technicalities. Basically, as longas terms that make up an average have finite variances and are not too strongly dependent, one will be able to find a LLN or CLT to apply.Which one it is doesnt matter, we only need the result.

  • = 0 + (X X)1X 0 = (X X)1X

    n( 0

    )=

    X Xn

    1 X n

    Now as before, (X Xn )1 Q1X . Considering X

    n, the limit of the variance is

    limnV

    X n

    = limnE

    X Xn

    = 20QX

    The mean is of course zero. To get asymptotic normality, we need to apply a CLT. We assumeone (for instance, the Lindeberg-Feller CLT) holds, so

    X n

    d N (0, 20QX)

    Therefore, n( 0

    )d N (0, 20Q1X ) (4.1)

    In summary, the OLS estimator is normally distributed in small and large samples if is normallydistributed. If is not normally distributed, is asymptotically normally distributed when a

  • CLT can be applied.

    4.3 Asymptotic efficiencyThe least squares objective function is

    s() =nt=1

    (yt xt)2

    Supposing that is normally distributed, the model is

    y = X0 + ,

    N(0, 20In), sof() =

    nt=1

    12pi2

    exp 2t22

    The joint density for y can be constructed using a change of variables. We have = yX, so y = Inand | y | = 1, so

    f(y) =nt=1

    12pi2

    exp(yt xt)222

    .Taking logs,

    lnL(, ) = n ln2pi n ln nt=1

    (yt xt)222 .

  • Maximizing this function with respect to and gives what is known as the maximum likelihood(ML) estimator. It turns out that ML estimators are asymptotically efficient, a concept that will beexplained in detail later. Its clear that the first order conditions for the MLE of 0 are the same as thefirst order conditions that define the OLS estimator (up to multiplication by a constant), so the OLSestimator of is also the ML estimator. The estimators are the same, under the present assumptions.Therefore, their properties are the same. In particular, under the classical assumptions with normality,the OLS estimator is asymptotically efficient. Note that one needs to make an assumption aboutthe distribution of the errors to compute the ML estimator. If the errors had a distribution other thanthe normal, then the OLS estimator and the ML estimator would not coincide.

    As well see later, it will be possible to use (iterated) linear estimation methods and still achieveasymptotic efficiency even if the assumption that V ar() 6= 2In, as long as is still normally dis-tributed. This is not the case if is nonnormal. In general with nonnormal errors it will be necessaryto use nonlinear estimation methods to achieve asymptotically efficient estimation.

    4.4 Exercises1. Write an Octave program that generates a histogram forRMonte Carlo replications of

    n(j j

    ),

    where is the OLS estimator and j is one of the k slope parameters. R should be a large number,at least 1000. The model used to generate data should follow the classical assumptions, exceptthat the errors should not be normally distributed (try U(a, a), t(p), 2(p) p, etc). Gener-ate histograms for n {20, 50, 100, 1000}. Do you observe evidence of asymptotic normality?Comment.

  • Chapter 5

    Restrictions and hypothesis tests

    5.1 Exact linear restrictionsIn many cases, economic theory suggests restrictions on the parameters of a model. For example, ademand function is supposed to be homogeneous of degree zero in prices and income. If we have aCobb-Douglas (log-linear) model,

    ln q = 0 + 1 ln p1 + 2 ln p2 + 3 lnm+ ,

    then we need thatk0 ln q = 0 + 1 ln kp1 + 2 ln kp2 + 3 ln km+ ,

    69

  • so

    1 ln p1 + 2 ln p2 + 3 lnm = 1 ln kp1 + 2 ln kp2 + 3 ln km= (ln k) (1 + 2 + 3) + 1 ln p1 + 2 ln p2 + 3 lnm.

    The only way to guarantee this for arbitrary k is to set

    1 + 2 + 3 = 0,

    which is a parameter restriction. In particular, this is a linear equality restriction, which is probablythe most commonly encountered case.

    ImpositionThe general formulation of linear equality restrictions is the model

    y = X + R = r

    where R is a QK matrix, Q < K and r is a Q 1 vector of constants.

    We assume R is of rank Q, so that there are no redundant restrictions.

    We also assume that that satisfies the restrictions: they arent infeasible.

  • Lets consider how to estimate subject to the restrictions R = r. The most obvious approach is toset up the Lagrangean

    mins() = 1

    n(y X) (y X) + 2(R r).

    The Lagrange multipliers are scaled by 2, which makes things less messy. The fonc are

    Ds(, ) = 2X y + 2X XR + 2R 0Ds(, ) = RR r 0,

    which can be written as X X RR 0

    R

    = X y

    r

    .We get R

    = X X R

    R 0

    1 X y

    r

    .Maybe youre curious about how to invert a partitioned matrix? I can help you with that:

  • Note that (X X)1 0R (X X)1 IQ

    X X R

    R 0

    AB=

    IK (X X)1R0 R (X X)1R

    IK (X X)1R

    0 P

    C,

    and IK (X X)1RP1

    0 P1 IK (X X)1R

    0 P

    DC= IK+Q,

    so

    DAB = IK+QDA = B1

    B1 = IK (X X)1RP1

    0 P1 (X X)1 0R (X X)1 IQ

    = (X X)1 (X X)1RP1R (X X)1 (X X)1RP1

    P1R (X X)1 P1 ,

  • If you werent curious about that, please start paying attention again. Also, note that we have madethe definition P = R (X X)1R)

    R

    = (X X)1 (X X)1RP1R (X X)1 (X X)1RP1

    P1R (X X)1 P1 X y

    r

    =

    (XX)1RP1

    (R r

    )P1

    (R r

    )

    = (IK (X X)1RP1R)

    P1R

    + (X X)1RP1rP1r

    The fact that R and are linear functions of makes it easy to determine their distributions, sincethe distribution of is already known. Recall that for x a random vector, and for A and b a matrixand vector of constants, respectively, V ar (Ax+ b) = AV ar(x)A.

    Though this is the obvious way to go about finding the restricted estimator, an easier way, if thenumber of restrictions is small, is to impose them by substitution. Write

    y = X11 +X22 + [R1 R2

    ] 12

    = rwhere R1 is QQ nonsingular. Supposing the Q restrictions are linearly independent, one can alwaysmake R1 nonsingular by reorganizing the columns of X. Then

    1 = R11 r R11 R22.

  • Substitute this into the model

    y = X1R11 r X1R11 R22 +X22 + y X1R11 r =

    [X2 X1R11 R2

    ]2 +

    or with the appropriate definitions,yR = XR2 + .

    This model satisfies the classical assumptions, supposing the restriction is true. One can estimate byOLS. The variance of 2 is as before

    V (2) = (X RXR)120

    and the estimator isV (2) = (X RXR)

    12

    where one estimates 20 in the normal way, using the restricted model, i.e.,

    20 =(yR XR2

    ) (yR XR2

    )n (K Q)

    To recover 1, use the restriction. To find the variance of 1, use the fact that it is a linear functionof 2, so

    V (1) = R11 R2V (2)R2(R11

    )= R11 R2 (X 2X2)

    1R2

    (R11

    )20

  • Properties of the restricted estimatorWe have that

    R = (X X)1RP1(R r

    )= + (X X)1RP1r (X X)1RP1R(X X)1X y= + (X X)1X + (X X)1RP1 [r R] (X X)1RP1R(X X)1X

    R = (X X)1X + (X X)1RP1 [r R] (X X)1RP1R(X X)1X

    Mean squared error isMSE(R) = E(R )(R )

    Noting that the crosses between the second term and the other terms expect to zero, and that thecross of the first and third has a cancellation with the square of the third, we obtain

    MSE(R) = (X X)12

    + (X X)1RP1 [r R] [r R] P1R(X X)1 (X X)1RP1R(X X)12

    So, the first term is the OLS covariance. The second term is PSD, and the third term is NSD.

    If the restriction is true, the second term is 0, so we are better off. True restrictions improveefficiency of estimation.

  • If the restriction is false, we may be better or worse off, in terms of MSE, depending on themagnitudes of r R and 2.

    5.2 TestingIn many cases, one wishes to test economic theories. If theory suggests parameter restrictions, as inthe above homogeneity example, one can test theory by testing parameter restrictions. A number oftests are available. The first two (t and F) have a known small sample distributions, when the errorsare normally distributed. The third and fourth (Wald and score) do not require normality of theerrors, but their distributions are known only approximately, so that they are not exactly valid withfinite samples.

    t-testSuppose one has the model

    y = X +

    and one wishes to test the single restriction H0 :R = r vs. HA :R 6= r . Under H0, with normalityof the errors,

    R r N (0, R(X X)1R20)so

    R rR(X X)1R20

    = R r0R(X X)1R

    N (0, 1) .

  • The problem is that 20 is unknown. One could use the consistent estimator 20 in place of 20, but thetest would only be valid asymptotically in this case.

    Proposition 1. N(0,1)2(q)q

    t(q)as long as the N(0, 1) and the 2(q) are independent.

    We need a few results on the 2 distribution.

    Proposition 2. If x N(, In) is a vector of n independent r.v.s., then xx 2(n, ) where = i 2i = is the noncentrality parameter.

    When a 2 r.v. has the noncentrality parameter equal to zero, it is referred to as a central 2 r.v.,and its distribution is written as 2(n), suppressing the noncentrality parameter.

    Proposition 3. If the n dimensional random vector x N(0, V ), then xV 1x 2(n).

    Well prove this one as an indication of how the following unproven propositions could be proved.Proof: Factor V 1 as P P (this is the Cholesky factorization, where P is defined to be upper

    triangular). Then consider y = Px. We have

    y N(0, PV P )

    but

    V P P = InPV P P = P

  • so PV P = In and thus y N(0, In). Thus yy 2(n) but

    yy = xP Px = xV 1x

    and we get the result we wanted.A more general proposition which implies this result is

    Proposition 4. If the n dimensional random vector x N(0, V ), then xBx 2((B)) if and onlyif BV is idempotent.

    An immediate consequence is

    Proposition 5. If the random vector (of dimension n) x N(0, I), and B is idempotent with rank r,then xBx 2(r).

    Consider the random variable

    20

    = MX20

    =(

    0

    )MX

    (

    0

    ) 2(nK)

    Proposition 6. If the random vector (of dimension n) x N(0, I), then Ax and xBx are independentif AB = 0.

    Now consider (remember that we have only one restriction in this case)

  • Rr0R(X X)1R

    (nK)20

    = R r0R(X X)1R

    This will have the t(nK) distribution if and are independent. But = + (X X)1X and

    (X X)1X MX = 0,

    soR r

    0R(X X)1R

    = R rR

    t(nK)

    In particular, for the commonly encountered test of significance of an individual coefficient, for whichH0 : i = 0 vs. H0 : i 6= 0 , the test statistic is

    ii t(nK)

    Note: the t test is strictly valid only if the errors are actually normally distributed. If onehas nonnormal errors, one could use the above asymptotic result to justify taking critical valuesfrom the N(0, 1) distribution, since t(n K) d N(0, 1) as n . In practice, a conservativeprocedure is to take critical values from the t distribution if nonnormality is suspected. This willreject H0 less often since the t distribution is fatter-tailed than is the normal.

  • F testThe F test allows testing multiple restrictions jointly.

    Proposition 7. If x 2(r) and y 2(s), then x/ry/s F (r, s), provided that x and y are independent.

    Proposition 8. If the random vector (of dimension n) x N(0, I), then xAxand xBx are independent if AB = 0.

    Using these results, and previous results on the 2 distribution, it is simple to show that thefollowing statistic has the F distribution:

    F =

    (R r

    ) (R (X X)1R

    )1 (R r

    )q2

    F (q, nK).

    A numerically equivalent expression is

    (ESSR ESSU) /qESSU/(nK) F (q, nK).

    Note: The F test is strictly valid only if the errors are truly normally distributed. The followingtests will be appropriate when one cannot assume normally distributed errors.

    Wald-type testsThe t and F tests require normality of the errors. The Wald test does not, but it is an asymptotictest - it is only approximately valid in finite samples.

  • The Wald principle is based on the idea that if a restriction is true, the unrestricted model shouldapproximately satisfy the restriction. Given that the least squares estimator is asymptotically nor-mally distributed:

    n( 0

    )d N (0, 20Q1X )

    then under H0 : R0 = r, we have

    n(R r

    )d N (0, 20RQ1X R)

    so by Proposition [3]n(R r

    ) (20RQ

    1X R

    )1 (R r) d 2(q)Note that Q1X or 20 are not observable. The test statistic we use substitutes the consistent estimators.Use (X X/n)1 as the consistent estimator of Q1X . With this, there is a cancellation of ns, and thestatistic to use is (

    R r) (

    20R(X X)1R)1 (

    R r)

    d 2(q)

    The Wald test is a simple way to test restrictions without having to estimate the restrictedmodel.

    Note that this formula is similar to one of the formulae provided for the F test.

    Score-type tests (Rao tests, Lagrange multiplier tests)The score test is another asymptotically valid test that does not require normality of the errors.

    In some cases, an unrestricted model may be nonlinear in the parameters, but the model is linear

  • in the parameters under the null hypothesis. For example, the model

    y = (X) +

    is nonlinear in and , but is linear in under H0 : = 1. Estimation of nonlinear models is a bitmore complicated, so one might prefer to have a test based upon the restricted, linear model. Thescore test is useful in this situation.

    Score-type tests are based upon the general principle that the gradient vector of the unrestrictedmodel, evaluated at the restricted estimate, should be asymptotically normally distributed withmean zero, if the restrictions are true. The original development was for ML estimation, but theprinciple is valid for a wide variety of estimation methods.

    We have seen that

    =(R(X X)1R

    )1 (R r

    )= P1

    (R r

    )

    so nP =

    n(R r

    )Given that

    n(R r

    )d N (0, 20RQ1X R)

    under the null hypothesis, we obtain

    nP

    d N (0, 20RQ1X R)

  • So (nP

    ) (20RQ

    1X R

    )1 (nP) d 2(q)Noting that limnP = RQ1X R, we obtain,

    R(X X)1R

    20

    d 2(q)since the powers of n cancel. To get a usable test statistic substitute a consistent estimator of 20.

    This makes it clear why the test is sometimes referred to as a Lagrange multiplier test. Itmay seem that one needs the actual Lagrange multipliers to calculate this. If we impose therestrictions by substitution, these are not available. Note that the test can be written as

    (R

    )(X X)1R20

    d 2(q)

    However, we can use the fonc for the restricted estimator:

    X y +X XR +R

    to get that

    R = X (y XR)= X R

  • Substituting this into the above, we get

    RX(X X)1X R20

    d 2(q)

    but this is simplyRPX20R

    d 2(q).

    To see why the test is also known as a score test, note that the fonc for restricted least squares

    X y +X XR +R

    give usR = X y X XR

    and the rhs is simply the gradient (score) of the unrestricted model, evaluated at the restricted esti-mator. The scores evaluated at the unrestricted estimate are identically zero. The logic behind thescore test is that the scores evaluated at the restricted estimate should be approximately zero, if therestriction is true. The test is also known as a Rao test, since P. Rao first proposed it in 1948.

  • 5.3 The asymptotic equivalence of the LR, Wald and scoretests

    Note: the discussion of the LR test has been moved forward in these notes. I no longer teach thematerial in this section, but Im leaving it here for reference.

    We have seen that the three tests all converge to 2 random variables. In fact, they all converge tothe same 2 rv, under the null hypothesis. Well show that the Wald and LR tests are asymptoticallyequivalent. We have seen that the Wald test is asymptotically equivalent to

    Wa= n

    (R r

    ) (20RQ

    1X R

    )1 (R r) d 2(q) (5.1)Using

    0 = (X X)1X and

    R r = R( 0)we get

    nR( 0) =

    nR(X X)1X

    = RX X

    n

    1 n1/2X

  • Substitute this into [5.1] to get

    Wa= n1XQ1X R

    (20RQ

    1X R

    )1RQ1X X a= X(X X)1R

    (20R(X X)1R

    )1R(X X)1X

    a= A(AA)1A

    20a=

    PR20

    where PR is the projection matrix formed by the matrix X(X X)1R.

    Note that this matrix is idempotent and has q columns, so the projection matrix has rank q.

    Now consider the likelihood ratio statistic

    LRa= n1/2g(0)I(0)1R

    (RI(0)1R

    )1RI(0)1n1/2g(0) (5.2)

    Under normality, we have seen that the likelihood function is

    lnL(, ) = n ln2pi n ln 12(y X) (y X)

    2.

  • Using this,

    g(0) D 1n

    lnL(, )

    = X(y X0)n2

    = X

    n2

    Also, by the information matrix equality:

    I(0) = H(0)= limDg(0)= limDX

    (y X0)n2

    = lim XX

    n2

    = QX2

    soI(0)1 = 2Q1X

  • Substituting these last expressions into [5.2], we get

    LRa= X (X X)1R

    (20R(X X)1R

    )1R(X X)1X

    a= PR20

    a= W

    This completes the proof that the Wald and LR tests are asymptotically equivalent. Similarly, onecan show that, under the null hypothesis,

    qFa= W a= LM a= LR

    The proof for the statistics except for LR does not depend upon normality of the errors, as canbe verified by examining the expressions for the statistics.

    The LR statistic is based upon distributional assumptions, since one cant write the likelihoodfunction without them.

    However, due to the close relationship between the statistics qF and LR, supposing normality,the qF statistic can be thought of as a pseudo-LR statistic, in that its like a LR statistic inthat it uses the value of the objective functions of the restricted and unrestricted models, but itdoesnt require distributional assumptions.

    The presentation of the score and Wald tests has been done in the context of the linear model.This is readily generalizable to nonlinear models and/or other estimation methods.

    Though the four statistics are asymptotically equivalent, they are numerically different in small sam-

  • ples. The numeric values of the tests also depend upon how 2 is estimated, and weve already seenthan there are several ways to do this. For example all of the following are consistent for 2 under H0

    nkn

    RRnk+qRRn

    and in general the denominator call be replaced with any quantity a such that lim a/n = 1.It can be shown, for linear regression models subject to linear restrictions, and if n is used to

    calculate the Wald test and RRn is used for the score test, that

    W > LR > LM.

    For this reason, the Wald test will always reject if the LR test rejects, and in turn the LR testrejects if the LM test rejects. This is a bit problematic: there is the possibility that by carefulchoice of the statistic used, one can manipulate reported results to favor or disfavor a hypothesis. Aconservative/honest approach would be to report all three test statistics when they are available. In thecase of linear models with normal errors the F test is to be preferred, since asymptotic approximationsare not an issue.

    The small sample behavior of the tests can be quite different. The true size (probability of rejectionof the null when the null is true) of the Wald test is often dramatically higher than the nominal sizeassociated with the asymptotic distribution. Likewise, the true size of the score test is often smaller

  • than the nominal size.

    5.4 Interpretation of test statisticsNow that we have a menu of test statistics, we need to know how to use them.

    5.5 Confidence intervalsConfidence intervals for single coefficients are generated in the normal manner. Given the t statistic

    t() =

    a 100 (1 ) % confidence interval for 0 is defined by the bounds of the set of such that t() doesnot reject H0 : 0 = , using a significance level:

    C() = { : c/2 <

    < c/2}

    The set of such is the interval c/2

    A confidence ellipse for two coefficients jointly would be, analogously, the set of {1, 2} such thatthe F (or some other test statistic) doesnt reject at the specified critical value. This generates anellipse, if the estimators are correlated.

  • Figure 5.1: Joint and Individual Confidence Regions

  • The region is an ellipse, since the CI for an individual coefficient defines a (infinitely long)rectangle with total prob. mass 1 , since the other coefficient is marginalized (e.g., can takeon any value). Since the ellipse is bounded in both dimensions but also contains mass 1 , itmust extend beyond the bounds of the individual CI.

    From the pictue we can see that:

    Rejection of hypotheses individually does not imply that the joint test will reject.

    Joint rejection does not imply individal tests will reject.

    5.6 BootstrappingWhen we rely on asymptotic theory to use the normal distribution-based tests and confidence intervals,were often at serious risk of making important errors. If the sample size is small and errors are highlynonnormal, the small sample distribution of

    n( 0

    )may be very different than its large sample

    distribution. Also, the distributions of test statistics may not resemble their limiting distributionsat all. A means of trying to gain information on the small sample distribution of test statistics andestimators is the bootstrap. Well consider a simple example, just to get the main idea.

    Suppose that

    y = X0 + IID(0, 20)X is nonstochastic

  • Given that the distribution of is unknown, the distribution of will be unknown in small samples.However, since we have random sampling, we could generate artificial data. The steps are:

    1. Draw n observations from with replacement. Call this vector j (its a n 1).

    2. Then generate the data by yj = X + j

    3. Now take this and estimatej = (X X)1X yj.

    4. Save j

    5. Repeat steps 1-4, until we have a large number, J, of j.

    With this, we can use the replications to calculate the empirical distribution of j. One way to form a100(1-)% confidence interval for 0 would be to order the j from smallest to largest, and drop thefirst and