econometrics-creel (2005)

Econometrics

c�

Michael Creel

Version 0.70, September, 2005

DEPT. OF ECONOMICS AND ECONOMIC HISTORY, UNIVERSITAT AUTÒNOMA DE

BARCELONA, [email protected], HTTP://PARETO.UAB.ES/MCREEL

http://pareto.uab.es/mcreel

Contents

List of Figures 10

List of Tables 12

Chapter 1. About this document 13

1.1. License 14

1.2. Obtaining the materials 14

1.3. An easy way to use LYX and Octave today 15

1.4. Known Bugs 17

Chapter 2. Introduction: Economic and econometric models 18

Chapter 3. Ordinary Least Squares 21

3.1. The Linear Model 21

3.2. Estimation by least squares 22

3.3. Geometric interpretation of least squares estimation 25

3.4. Influential observations and outliers 28

3.5. Goodness of fit 31

3.6. The classical linear regression model 34

3.7. Small sample statistical properties of the least squares estimator 36

3.8. Example: The Nerlove model 43

Exercises 49

Chapter 4. Maximum likelihood estimation 503

CONTENTS 4

4.1. The likelihood function 50

4.2. Consistency of MLE 54

4.3. The score function 56

4.4. Asymptotic normality of MLE 58

4.6. The information matrix equality 63

4.7. The Cramér-Rao lower bound 65

Exercises 68

Chapter 5. Asymptotic properties of the least squares estimator 70

5.1. Consistency 70

5.2. Asymptotic normality 71

5.3. Asymptotic efficiency 72

Chapter 6. Restrictions and hypothesis tests 75

6.1. Exact linear restrictions 75

6.2. Testing 81

6.3. The asymptotic equivalence of the LR, Wald and score tests 90

6.4. Interpretation of test statistics 94

6.5. Confidence intervals 94

6.6. Bootstrapping 95

6.7. Testing nonlinear restrictions, and the Delta Method 98

6.8. Example: the Nerlove data 102

Chapter 7. Generalized least squares 110

7.1. Effects of nonspherical disturbances on the OLS estimator 111

7.2. The GLS estimator 112

7.3. Feasible GLS 115

7.4. Heteroscedasticity 117

CONTENTS 5

7.5. Autocorrelation 130

Exercises 151

Exercises 153

Chapter 8. Stochastic regressors 154

8.1. Case 1 155

8.2. Case 2 156

8.3. Case 3 158

8.4. When are the assumptions reasonable? 158

Exercises 161

Chapter 9. Data problems 162

9.1. Collinearity 162

9.2. Measurement error 171

9.3. Missing observations 175

Exercises 181

Exercises 181

Exercises 181

Chapter 10. Functional form and nonnested tests 182

10.1. Flexible functional forms 183

10.2. Testing nonnested hypotheses 195

Chapter 11. Exogeneity and simultaneity 199

11.1. Simultaneous equations 199

11.2. Exogeneity 202

11.3. Reduced form 205

11.4. IV estimation 208

CONTENTS 6

11.5. Identification by exclusion restrictions 214

11.6. 2SLS 227

11.7. Testing the overidentifying restrictions 231

11.8. System methods of estimation 236

11.9. Example: 2SLS and Klein’s Model 1 245

Chapter 12. Introduction to the second half 248

Chapter 13. Numeric optimization methods 257

13.1. Search 258

13.2. Derivative-based methods 258

13.3. Simulated Annealing 267

13.4. Examples 268

13.5. Duration data and the Weibull model 272

13.6. Numeric optimization: pitfalls 276

Exercises 282

Chapter 14. Asymptotic properties of extremum estimators 283

14.1. Extremum estimators 283

14.2. Consistency 284

14.3. Example: Consistency of Least Squares 289

14.4. Asymptotic Normality 291

14.5. Examples 294

14.6. Example: Linearization of a nonlinear model 298

Exercises 303

Chapter 15. Generalized method of moments (GMM) 304

15.1. Definition 304

CONTENTS 7



15.4. Choosing the weighting matrix 310

15.5. Estimation of the variance-covariance matrix 313

15.6. Estimation using conditional moments 316

15.7. Estimation using dynamic moment conditions 322

15.8. A specification test 322

15.9. Other estimators interpreted as GMM estimators 325

15.10. Example: The Hausman Test 334

15.11. Application: Nonlinear rational expectations 341

15.12. Empirical example: a portfolio model 345

Exercises 347

Chapter 16. Quasi-ML 348

Chapter 17. Nonlinear least squares (NLS) 354

17.1. Introduction and definition 354

17.2. Identification 356



17.5. Example: The Poisson model for count data 360

17.6. The Gauss-Newton algorithm 361

17.7. Application: Limited dependent variables and sample selection 364

Chapter 18. Nonparametric inference 368

18.1. Possible pitfalls of parametric inference: estimation 368

18.2. Possible pitfalls of parametric inference: hypothesis testing 372

18.3. The Fourier functional form 373

CONTENTS 8

18.4. Kernel regression estimators 385

18.5. Kernel density estimation 391

18.6. Semi-nonparametric maximum likelihood 391

18.7. Examples 397

Chapter 19. Simulation-based estimation 408

19.1. Motivation 408

19.2. Simulated maximum likelihood (SML) 415

19.3. Method of simulated moments (MSM) 418

19.4. Efficient method of moments (EMM) 422

19.5. Example: estimation of stochastic differential equations 428

Chapter 20. Parallel programming for econometrics 431

Chapter 21. Introduction to Octave 432

21.1. Getting started 432

21.2. A short introduction 432

21.3. If you’re running a Linux installation... 435

Chapter 22. Notation and Review 436

22.1. Notation for differentiation of vectors and matrices 436

22.2. Convergenge modes 437

22.3. Rates of convergence and asymptotic equality 441

Exercises 444

Chapter 23. The GPL 445

Chapter 24. The attic 456

24.1. MEPS data: more on count models 457

24.2. Hurdle models 462

CONTENTS 9

24.3. Models for time series data 474

Bibliography 491

Index 492

List of Figures

1.2.1 LYX 15

1.2.2 Octave 16

3.2.1 Typical data, Classical Model 23

3.3.1 Example OLS Fit 26

3.3.2 The fit in observation space 26

3.4.1 Detection of influential observations 30

3.5.1 Uncentered �� 32

3.7.1 Unbiasedness of OLS under classical assumptions 37

3.7.2 Biasedness of OLS when an assumption fails 38

3.7.3 Gauss-Markov Result: The OLS estimator 41

3.7.4 Gauss-Markov Result: The split sample estimator 42

6.5.1 Joint and Individual Confidence Regions 96

6.8.1 RTS as a function of firm size 107

7.4.1 Residuals, Nerlove model, sorted by firm size 125

7.5.1 Autocorrelation induced by misspecification 132

7.5.2 Durbin-Watson critical values 144

7.6.1 Residuals of simple Nerlove model 147

7.6.2 OLS residuals, Klein consumption equation 149

10

LIST OF FIGURES 11

9.1.1 �� when there is no collinearity 164

9.1.2 �� when there is collinearity 165

9.3.1 Sample selection bias 179

13.1.1 The search method 259

13.2.1 Increasing directions of search 261

13.2.2 Newton-Raphson method 263

13.2.3 Using MuPAD to get analytic derivatives 266

13.5.1 Life expectancy of mongooses, Weibull model 275

13.5.2 Life expectancy of mongooses, mixed Weibull model 277

13.6.1 A foggy mountain 278

15.10.1 OLS and IV estimators when regressors and errors are

correlated 335

21.2.1 Running an Octave program 433

List of Tables

1 Marginal Variances, Sample and Estimated (Poisson) 457

2 Marginal Variances, Sample and Estimated (NB-II) 462

3 Actual and Poisson fitted frequencies 463

4 Actual and Hurdle Poisson fitted frequencies 467

5 Information Criteria, OBDV 474

12

CHAPTER 1

About this document

This document integrates lecture notes for a one year graduate level course

with computer programs that illustrate and apply the methods that are stud-

ied. The immediate availability of executable (and modifiable) example pro-

grams when using the PDF1 version of the document is one of the advantages

of the system that has been used. On the other hand, when viewed in printed

form, the document is a somewhat terse approximation to a textbook. These

notes are not intended to be a perfect substitute for a printed textbook. If you

are a student of mine, please note that last sentence carefully. There are many

good textbooks available. A few of my favorites are listed in the bibliography.

With respect to contents, the emphasis is on estimation and inference within

the world of stationary data, with a bias toward microeconometrics. The sec-

ond half is somewhat more polished than the first half, since I have taught that

course more often. If you take a moment to read the licensing information in

the next section, you’ll see that you are free to copy and modify the document.

If anyone would like to contribute material that expands the contents, it would

be very welcome. Error corrections and other additions are also welcome. As

an example of a project that has made use of these notes, see these very nice

lecture slides.

1It is possible to have the program links open up in an editor, ready to run using keyboardmacros. To do this with the PDF version you need to do some setup work. See the bootableCD described below.

13

http://halweb.uc3m.es/esp/Personal/personas/amalonso/esp/ephd.htm

1.2. OBTAINING THE MATERIALS 14

1.1. License

All materials are copyrighted by Michael Creel with the date that appears

above. They are provided under the terms of the GNU General Public License,

which forms Section 23 of the notes. The main thing you need to know is that

you are free to modify and distribute these materials in any way you like, as

long as you do so under the terms of the GPL. In particular, you must make

available the source files, in editable form, for your modified version of the

materials.

1.2. Obtaining the materials

The materials are available on my web page, in a variety of forms including

PDF and the editable sources, at pareto.uab.es/mcreel/Econometrics/. In ad-

dition to the final product, which you’re looking at in some form now, you can

obtain the editable sources, which will allow you to create your own version,

if you like, or send error corrections and contributions. The main document

was prepared using LYX (www.lyx.org) and Octave (www.octave.org). LYX is

a free2 “what you see is what you mean” word processor, basically working as

a graphical frontend to LATEX. It (with help from other applications) can export

your work in LATEX, HTML, PDF and several other forms. It will run on Linux,

Windows, and MacOS systems. Figure 1.2.1 shows LYX editing this document.

GNU Octave has been used for the example programs, which are scattered

though the document. This choice is motivated by two factors. The first is the

high quality of the Octave environment for doing applied econometrics. The

fundamental tools exist and are implemented in a way that make extending

2”Free” is used in the sense of ”freedom”, but LYX is also free of charge.

http://pareto.uab.es/mcreel/Econometrics/

http://www.lyx.org

http://www.octave.org

1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 15

FIGURE 1.2.1. LYX

them fairly easy. The example programs included here may convince you of

this point. Secondly, Octave’s licensing philosophy fits in with the goals of this

project. Thirdly, it runs on Linux, Windows and MacOS. Figure 1.2.2 shows an

Octave program being edited by NEdit, and the result of running the program

in a shell window.

1.3. An easy way to use LYX and Octave today

The example programs are available as links to files on my web page in the

PDF version, and here. Support files needed to run these are available here.

The files won’t run properly from your browser, since there are dependencies

http://pareto.uab.es/mcreel/Econometrics/EconometricsOctaveFiles.html

http://pareto.uab.es/mcreel/Econometrics/MyOctaveFiles.html

1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 16

FIGURE 1.2.2. Octave

between files - they are only illustrative when browsing. To see how to use

these files (edit and run them), you should go to the home page of this doc-

ument, since you will probably want to download the pdf version together

with all the support files and examples. Then set the base URL of the PDF file

to point to wherever the Octave files are installed. All of this may sound a bit

complicated, because it is. An easier solution is available:

The file pareto.uab.es/mcreel/Econometrics/econometrics.iso is an ISO im-

age file that may be burnt to CDROM. It contains a bootable-from-CD Gnu/Linux

http://pareto.uab.es/mcreel/Econometrics/

http://pareto.uab.es/mcreel/Econometrics/econometrics.iso

1.4. KNOWN BUGS 17

system that has all of the tools needed to edit this document, run the Octave ex-

ample programs, etcetera. In particular, it will allow you to cut out small por-

tions of the notes and edit them, and send them to me as LYX (or TEX) files for

inclusion in future versions. Think error corrections, additions, etc.! The CD

automatically detects the hardware of your computer, and will not touch your

hard disk unless you explicitly tell it to do so. It is based upon the Knoppix

GNU/Linux distribution, with some material removed and other added. Ad-

ditionally, you can use it to install Debian GNU/Linux on your computer (run

knoppix-installer as the root user). The versions of programs on the CD

may be quite out of date, possibly with security problems that have not been

fixed. So if you do a hard disk installation you should do apt-get update,

apt-get upgrade toot sweet. See the Knoppix web page for more informa-

tion.

1.4. Known Bugs

This section is a reminder to myself to try to fix a few things.� The PDF version has hyperlinks to figures that jump to the wrong fig-

ure. The numbers are correct, but the links are not. ps2pdf bugs?

http://www.knoppix.net

CHAPTER 2

Introduction: Economic and econometric models

Economic theory tells us that the demand function for a good is something

like: �� is the quantity demanded� � is �� vector of prices of the good and its substitutes and comple-

ments� � is income� � is a vector of other variables such as individual characteristics that

affect preferences

Suppose we have a sample consisting of one observation on � individuals’

demands at time period (this is a cross section, where ! �"�$#%�'&(&)&(�� indexes the

individuals in the sample). The individual demand functions are

�+*, ��+* �� * �� * �� * The model is not estimable as it stands, since:

� The form of the demand function is different for all !-&� Some components of � * may not be observable to an outside modeler.

For example, people don’t eat the same lunch every day, and you can’t

tell what they will order just by looking at them. Suppose we can18

2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS 19

break � * into the observable components . * and a single unobservable

component / * .A step toward an estimable econometric model is to suppose that the model

may be written as

�+*0 �01,23�54* ��6728� * �59:28.;4* �5<=2�/ *We have imposed a number of restrictions on the theoretical model:

� The functions �+* �?>@ which in principle may differ for all ! have been

restricted to all belong to the same parametric family.� Of all parametric families of functions, we have restricted the model

to the class of linear in the variables functions.� The parameters are constant across individuals.� There is a single unobservable component, and we assume it is addi-

tive.

If we assume nothing about the error term A , we can always write the last

equation. But in order for the � coefficients to have an economic meaning,

and in order to be able to estimate them from sample data, we need to make

additional assumptions. These additional assumptions have no theoretical

basis, they are assumptions on top of those needed to prove the existence of

a demand function. The validity of any results we obtain using this model

will be contingent on these additional restrictions being at least approximately

correct. For this reason, specification testing will be needed, to check that the

model seems to be reasonable. Only when we are convinced that the model is

at least approximately correct should we use it for economic analysis.

2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS 20

When testing a hypothesis using an econometric model, three factors can

cause a statistical test to reject the null hypothesis:

(1) the hypothesis is false

(2) a type I error has occured

(3) the econometric model is not correctly specified so the test does not

have the assumed distribution

We would like to ensure that the third reason is not contributing to rejections,

so that rejection will be due to either the first or second reasons. Hopefully the

above example makes it clear that there are many possible sources of misspec-

ification of econometric models. In the next few sections we will obtain results

supposing that the econometric model is entirely correctly specified. Later we

will examine the consequences of misspecification and see some methods for

determining if a model is correctly specified. Later on, econometric methods

that seek to minimize maintained assumptions are introduced.

CHAPTER 3

Ordinary Least Squares

3.1. The Linear Model

Consider approximating a variable B using the variables � 1C� � � �'&)&(&(� �ED . We

can consider a model that is a linear approximation:

Linearity: the model is a linear function of the parameter vector �GF;HB � F1 � 1I28� F� � � 2�&)&(&J2K� FD �ED 2KA

or, using vector notation: B �L 4@� F 2MAThe dependent variable B is a scalar random variable, L: � � 1 � � >N>N> �ED POis a Q -vector of explanatory variables, and �F � � F1 � F� >N>N>R� FD ?O & The su-

perscript “0” in � F means this is the ”true value” of the unknown parameter.

It will be defined more precisely later, and usually suppressed when it’s not

necessary for clarity.

Suppose that we want to use data to try to determine the best linear ap-

proximation to B using the variables L & The data ST��BVUP� L UWYXZ�[ �"�Y#%�\&)&(&)�[� are

obtained by some form of sampling1. An individual observation is thus

B]U �L 4U �^2�/JU1For example, cross-sectional data may be obtained by random sampling. Time series dataaccumulate historically.

21

3.2. ESTIMATION BY LEAST SQUARES 22

The � observations can be written in matrix form as

(3.1.1) _ a` �b28/T�where _ dc BT1eB � >N>N>RBgfih 4 is �3�j� and `k lc L 1 L � >N>N> L f�h 4 .

Linear models are more general than they might first appear, since one can

employ nonlinear transformations of the variables:

m F �� on m 1\��.� m � ��.pq>N>N> m 6T��.psr �^28/where the t * �W are known functions. Defining B um F �W��Y� � 1 um 1v��.pC� etc. leads

to a model in the form of equation 3.6.1. For example, the Cobb-Douglas model

� aw .yxCz� .yxY{|M}C~%� ��/Vcan be transformed logarithmically to obtain�� )� w 2K� � �� . � 2K� | �� . | 2�/%&If we define B �� %��,1 �� w � etc., we can put the model in the form needed.

The approximation is linear in the parameters, but not necessarily linear in the

variables.

3.2. Estimation by least squares

Figure 3.2.1, obtained by running TypicalData.m shows some data that fol-

lows the linear model BgU �01\2=� � � U � 2=A-U . The green line is the ”true” regression

line �01V2�� U � , and the red crosses are the data points � � U � ��B]U�C� where A-U is a ran-

dom error that has mean zero and is independent of � U � . Exactly how the green

line is defined will become clear later. In practice, we only have the data, and

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/TypicalData.m


FIGURE 3.2.1. Typical data, Classical Model

-15

-10

-5

0

5

10

0 2 4 6 8 10 12 14 16 18 20X

datatrue regression line

we don’t know where the green line lies. We need to gain information about

the straight line that best fits the data points.

The ordinary least squares (OLS) estimator is defined as the value that mini-

mizes the sum of the squared errors:�� V�-�� where

�T�� f� U(��1 ��BgU�� L 4U � � ��_�� ` �G 4 ��_�� ` � _I4�_i�K#J_�4 ` �^28��4 ` 4 ` � � _�� ` � � �


This last expression makes it clear how the OLS estimator is defined: it min-

imizes the Euclidean distance between B and ��& The fitted OLS coefficients

will define the best linear approximation to B using L as basis functions, where

”best” means minimum Euclidean distance. One could think of other esti-

mators based upon other metrics. For example, the minimum absolute distance

(MAD) minimizes � fU(��1G� BgU�� L 4U � � . Later, we will see that which estimator is

best in terms of their statistical properties, rather than in terms of the metrics

that define them, depends upon the properties of A , about which we have as

yet made no assumptions.

� To minimize the criterion ��C� find the derivative with respect to �and it to zero: � x �� G �p# ` 4 _�2M# ` 4 ` �� so �� ` 4 ` �� 1 ` 4�_�&� To verify that this is a minimum, check the s.o.s.c.:� �x �T� �� # ` 4 `Since �� ` � � this matrix is positive definite, since it’s a quadratic

form in a p.d. matrix (identity matrix of order �� , so�� is in fact a

minimizer.� The fitted values are in the vector�_ ¡` ��¢&� The residuals are in the vector

�/ _i� ` ��

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 25� Note that

_ ` �^2�/ ` ��^2 �/� Also, the first order conditions can be written as

` 4�_£� ` 4 ` �� ` 4 c _�� ` �� h �` 4 �/ �which is to say, the OLS residuals are orthogonal to ` . Let’s look at

this more carefully.

3.3. Geometric interpretation of least squares estimation

3.3.1. In �i��¤ Space. Figure 3.3.1 shows a typical fit to data, along with the

true regression line. Note that the true line and the estimated line are different.

This figure was created by running the Octave program OlsFit.m . You can

experiment with changing the parameter values to see how this affects the fit,

and to see how the fitted line will sometimes be close to the true line, and

sometimes rather far away.

3.3.2. In Observation Space. If we want to plot in observation space, we’ll

need to use only two or three observations, or we’ll encounter some limitations

of the blackboard. Let’s use two. With only two observations, we can’t have�¦¥ �"&

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/OlsFit.m

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 26

FIGURE 3.3.1. Example OLS Fit

-15

-10

-5

0

5

10

15

0 2 4 6 8 10 12 14 16 18 20X

data pointsfitted linetrue line

FIGURE 3.3.2. The fit in observation space

Observation 2

Observation 1

x

y

S(x)

x*beta=P_xY

e = M_xY

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 27� We can decompose B into two components: the orthogonal projection

onto the � � dimensional space spanned by � , � ��Z� and the compo-

nent that is the orthogonal projection onto the �i� � subpace that is

orthogonal to the span of �i� �/�&� Since�� is chosen to make

�/ as short as possible,�/ will be orthogonal

to the space spanned by �i& Since � is in this space, � 4 �/ ¡� & Note that

the f.o.c. that define the least squares estimator imply that this is so.

3.3.3. Projection Matrices. � �� is the projection of B onto the span of �i� or

� �� §�¨�i4©�3 � 1 �i4�BTherefore, the matrix that projects B onto the span of � isªI« �K�¬�i4��:�� 1 �i4since � �� ª�« B�&�/ is the projection of B onto the ®� � dimensional space that is orthogonal

to the span of � . We have that�/ B¯�j� �� B¯�j�K��4��:�� 1 �i4©B °@± fp�²�K��4��:�� 1 �i4�³0B�&

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS 28

So the matrix that projects B onto the space orthogonal to the span of � is´:« ± fp�j�K��i4��:$� 1 �i4 ± fp� ª�« &We have �/ ´3« B�&

Therefore

B ªI« B�2 ´:« B � ��^2 �/�&These two projection matrices decompose the � dimensional vector B into two

orthogonal components - the portion that lies in the � dimensional space de-

fined by �i� and the portion that lies in the orthogonal �K� � dimensional

space. � Note that bothª�«

and´:«

are symmetric and idempotent.

– A symmetric matrix w is one such that w¡ aw 4 &– An idempotent matrix w is one such that w� �wµw &– The only nonsingular idempotent matrix is the identity matrix.

3.4. Influential observations and outliers

The OLS estimator of the ! U�¶ element of the vector � F is simply�� *§ ° �¬�i4��3$� 1 �i4�³ *)· B ¸ 4* B


This is how we define a linear estimator - it’s a linear function of the de-

pendent variable. Since it’s a linear combination of the observations on the

dependent variable, where the weights are detemined by the observations on

the regressors, some observations may have more influence than others. De-

fine ¹ U � ªI« U@U º 4U ª�« º U � ªI« º U � �» �¼º U � � �¹ U is the t U�¶ element on the main diagonal ofª«

( º U is a � vector of zeros with

a � in the t U�¶ position). So �¾½ ¹ U ½ �V� and¿yÀVªI« ��eÁ ¹ ¡��Â �G&So, on average, the weight on the BVU ’s is ��Â � . If the weight is much higher, then

the observation has the potential to affect the fit importantly. The weight,¹ U

is referred to as the leverage of the observation. However, an observation may

also be influential due to the value of BVU , rather than the weight it is multiplied

by, which only depends on the � U ’s.

To account for this, consider estimation of � without using the U�¶ observa-

tion (designate this estimator as��ÄÃ U(Å Y& One can show (see Davidson and MacK-

innon, pp. 32-5 for proof) that�� Ã U(Å ��£�ÇÆ ��È� ¹ UCÉ ��4��:�� 1 �i4U �/JU


FIGURE 3.4.1. Detection of influential observations

-2

0

2

4

6

8

10

12

14

0 0.5 1 1.5 2 2.5 3X

Data pointsfitted

LeverageInfluence

so the change in the U�¶ observations fitted value is

�=U ��£�j�=U �� Ã U(Å Æ ¹ U�È� ¹ UCÉ �/]UWhile an observation may be influential if it doesn’t affect its own fitted value,

it certainly is influential if it does. A fast means of identifying influential ob-

servations is to plot c ¶YÊ1 � ¶YÊ h �/JU (which I will refer to as the own influence of the

observation) as a function of . Figure 3.4.1 gives an example plot of data, fit,

leverage and influence. The Octave program is InfluentialObservation.m . If

you re-run the program you will see that the leverage of the last observation

(an outlying value of x) is always high, and the influence is sometimes high.

After influential observations are detected, one needs to determine why

they are influential. Possible causes include:

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/InfluentialObservation.m

3.5. GOODNESS OF FIT 31� data entry error, which can easily be corrected once detected. Data

entry errors are very common.� special economic factors that affect some observations. These would

need to be identified and incorporated in the model. This is the idea

behind structural change: the parameters may not be constant across all

observations.� pure randomness may have caused us to sample a low-probability ob-

servation.

There exist robust estimation methods that downweight outliers.

3.5. Goodness of fit

The fitted model is B � ��b2 �/Take the inner product:

BT4�B ��4��i4�� ^2Ë# ��4Ì�i4 �/¼2 �/J4 �/But the middle term of the RHS is zero since � 4 �/ a� , so

(3.5.1) BT4�B ��4��i4�� ^2 �/J4 �/The uncentered � �Í is defined as

� �Í �È� �/ 4 �/B 4 B �� 4 � 4 � ��B 4 B � ª�« B � �� B � � ÎvÏ"Ð � �Wt0Y�

3.5. GOODNESS OF FIT 32

where t is the angle between B and the span of � .

� The uncentered �� changes if we add a constant to B�� since this changest (see Figure 3.5.1, the yellow vector is a constant, since it’s on theÑ�Òdegree line in observation space). Another, more common defini-

FIGURE 3.5.1. Uncentered � �

tion measures the contribution of the variables, other than the constant

term, to explaining the variation in B�& Thus it measures the ability of

the model to explain the variation of B about its unconditional sample

mean.

3.5. GOODNESS OF FIT 33

Let Ó �Ô�"�\�"�'&(&)&(�'�J 4 � a � -vector. So´ÖÕ ± fs�MÓ-�PÓ¨4�ÓW$� 1 Ó¨4 ± fs�MÓ×Ó¬4 Â �´ÖÕ B just returns the vector of deviations from the mean. In terms of deviations

from the mean, equation 3.5.1 becomes

BT4 ´ÖÕ B ��I4Ì�i4 ´:Õ � ��2 �/J4 ´ÖÕ'�/The centered � �Ø is defined as

� �Ø �y� �/ 4 �/B 4 ´ÖÕ B �È�ÚÙÜÛZÛ¿ Û�Ûwhere ÙÜÛ�Û �/ 4 �/ and

¿ Û�Û B 4 ´ÖÕ B = � fU(��1 ��B]U��uÝB% � .Supposing that � contains a column of ones (i.e., there is a constant term),

�i4 �/ ��ÞÁ � U �/JU a�so´ÖÕ'�/ �/T& In this case

B�4 ´ÖÕ B ��4��i4 ´ÖÕ � ��^2 �/]4 �/So � �Ø � Û�Û¿ Û�Ûwhere � Û�Û �� 4 � 4 ´:Õ � ��

� Supposing that a column of ones is in the space spanned by � (ªG« Ó ÓWC� then one can show that � » � �Ø » �"&

3.6. THE CLASSICAL LINEAR REGRESSION MODEL 34

3.6. The classical linear regression model

Up to this point the model is empty of content beyond the definition of a

best linear approximation to B and some geometrical properties. There is no

economic content to the model, and the regression parameters have no eco-

nomic interpretation. For example, what is the partial derivative of B with

respect to �%ß ? The linear approximation is

B �01 � 1�2K� � � � 2�&)&(&]2K� DY�ED 2MAThe partial derivative is à Bà �Tß � ß 2 à Aà �TßUp to now, there’s no guarantee that áYâá$ãÔä =0. For the � to have an economic

meaning, we need to make additional assumptions. The assumptions that are

appropriate to make depend on the data under consideration. We’ll start with

the classical linear regression model, which incorporates some assumptions

that are clearly not realistic for economic data. This is to be able to explain

some concepts with a minimum of confusion and notational clutter. Later we’ll

adapt the results to what we can get with more realistic assumptions.

Linearity: the model is a linear function of the parameter vector �GF;HB � F1 � 1I28� F� � � 2�&)&(&J2K� FD �ED 2KA(3.6.1)

or, using vector notation: B �L 4@� F 2MA

3.6. THE CLASSICAL LINEAR REGRESSION MODEL 35

Nonstochastic linearly independent regressors: ` is a fixed matrix of con-

stants, it has rank � , its number of columns, and 3.6.2� �)� �� ` 4 `å �æ «(3.6.2)

where æ « is a finite positive definite matrix. This is needed to be able to iden-

tify the individual effects of the explanatory variables.

Independently and identically distributed errors:

(3.6.3) A�ç ±T± � � � ��è � ± f�/ is jointly distributed IIN. This implies the following two properties:

Homoscedastic errors:

(3.6.4) é^�¬/]U� è �F �?ê� Nonautocorrelated errors:

(3.6.5) ë¼�¬/]U�A�ì- �� Ôê� pí �Optionally, we will sometimes assume that the errors are normally dis-

tributed.

Normally distributed errors:

(3.6.6) A�çRj� � ��è � ± f"

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 36

3.7. Small sample statistical properties of the least squares estimator

Up to now, we have only examined numeric properties of the OLS estima-

tor, that always hold. Now we will examine statistical properties. The statisti-

cal properties depend upon the assumptions we can make.

3.7.1. Unbiasedness. We have�� 4 �: � 1 � 4 B . By linearity,�� ¬� 4 �3 � 1 � 4 ��^28/V �b2��¬�i4��3$� 1 �i4�/

By 3.6.2 and 3.6.3

Ù ��4��:�� 1 �i4�/ Ù �¬�i4��:�� 1 �i4�/ ��i4��:$� 1 �i4 Ù / �so the OLS estimator is unbiased under the assumptions of the classical model.

Figure 3.7.1 shows the results of a small Monte Carlo experiment where the

OLS estimator was calculated for 10000 samples from the classical model withB ��2�# � 2Ö/ , where � # � , è��î aï , and � is fixed across samples. We can see

that the � � appears to be estimated without bias. The program that generates

the plot is Unbiased.m , if you would like to experiment with this.

With time series data, the OLS estimator will often be biased. Figure 3.7.2

shows the results of a small Monte Carlo experiment where the OLS estimator

was calculated for 1000 samples from the AR(1) model with B"U a� 2 � & ï B]U � 1v2¯/JU ,where � # � and è �î � . In this case, assumption 3.6.2 does not hold: the

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/Unbiased.m


FIGURE 3.7.1. Unbiasedness of OLS under classical assumptions

0

0.02

0.04

0.06

0.08

0.1

0.12

-3 -2 -1 0 1 2 3

Beta hat - Beta true

regressors are stochastic. We can see that the bias in the estimation of � � is

about -0.2.

The program that generates the plot is Biased.m , if you would like to ex-

periment with this.

3.7.2. Normality. With the linearity assumption, we have�� 2=�� 4 �: � 1 � 4 /T&

This is a linear function of / . Adding the assumption of normality (3.6.6, which

implies strong exogeneity), then

��içuñðò�Z�N��i4��:$� 1 è �F\ósince a linear function of a normal random vector is also normally distributed.

In Figure 3.7.1 you can see that the estimator appears to be normally dis-

tributed. It in fact is normally distributed, since the DGP (see the Octave pro-

gram) has normal errors. Even when the data may be taken to be IID, the

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/Biased.m


FIGURE 3.7.2. Biasedness of OLS when an assumption fails

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4

Beta hat - Beta true

assumption of normality is often questionable or simply untenable. For exam-

ple, if the dependent variable is the number of automobile trips per week, it

is a count variable with a discrete distribution, and is thus not normally dis-

tributed. Many variables in economics can take on only nonnegative values,

which, strictly speaking, rules out normality.2

3.7.3. The variance of the OLS estimator and the Gauss-Markov theo-

rem. Now let’s make all the classical assumptions except the assumption of

2Normality may be a good model nonetheless, as long as the probability of a negative valueoccuring is negligable under the model. This depends upon the mean being large enough inrelation to the variance.


normality. We have�� ^2a�� 4 �: � 1 � 4 / and we know that Ù � �� . So

éÜô À � �� ÙRõ c¢ö�÷�� h c¢ö�� h 4×ø Ùaù �¬�i4��3$� 1 �i4�/J/]4Ì�K�¬�i4��:�� 1'ú �� 4 �: � 1 è �FThe OLS estimator is a linear estimator, which means that it is a linear func-

tion of the dependent variable, B�&�� ° �� 4 �: � 1 � 4�³ B û Bwhere û is a function of the explanatory variables only, not the dependent vari-

able. It is also unbiased under the present assumptions, as we proved above.

One could consider other weights ü that are a function of � that define some

other linear estimator. We’ll still insist upon unbiasedness. Consider ý� üRB��where ü üþ��3 is some Q:�j� matrix function of �i& Note that since ü is

a function of �� it is nonstochastic, too. If the estimator is unbiased, then we

must have üR� a±'ÿ :

ë¼�WüuB ë7�òü�� F 2Ëü�/V ü�� F � FÁüR� ±'ÿ


The variance of ý� is éb��ý� üÚü�4©è �F &Define � ü �¡�� 4 �: � 1 � 4so ü � 2��¬� 4 �: � 1 � 4Since ü�� a±'ÿ � � � �� so

éb��ý�G ð � 2a��i4��:$� 1 ��4 ó ð � 2a��i4��:$� 1 ��4 ó 4 è �F c �^� 4 2��¬� 4 �3 � 1 h è �FSo é�� ý� � é^� ��The inequality is a shorthand means of expressing, more formally, that éb�%ý��é^� �� is a positive semi-definite matrix. This is a proof of the Gauss-Markov

Theorem. The OLS estimator is the ”best linear unbiased estimator” (BLUE).� It is worth emphasizing again that we have not used the normality

assumption in any way to prove the Gauss-Markov theorem, so it is

valid if the errors are not normally distributed, as long as the other

assumptions hold.

To illustrate the Gauss-Markov result, consider the estimator that results from

splitting the sample into � equally-sized parts, estimating using each part of

the data separately by OLS, then averaging the � resulting estimators. You

should be able to show that this estimator is unbiased, but inefficient with

respect to the OLS estimator. The program Efficiency.m illustrates this using

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/Efficiency.m


FIGURE 3.7.3. Gauss-Markov Result: The OLS estimator

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.5 1 1.5 2 2.5 3 3.5 4

Beta 2 hat, OLS

a small Monte Carlo experiment, which compares the OLS estimator and a 3-

way split sample estimator. The data generating process follows the classical

model, with � #%� . The true parameter value is � #%& In Figures 3.7.3 and

3.7.4 we can see that the OLS estimator is more efficient, since the tails of its

histogram are more narrow.

We have that Ù � �� and éÞô À � �� ð �iO�� ó � 1 è �F � but we still need to

estimate the variance of A , è �F , in order to have an idea of the precision of the

estimates of � . A commonly used estimator of è �F is�è �F ��÷� � �/ 4 �/This estimator is unbiased:


FIGURE 3.7.4. Gauss-Markov Result: The split sample estimator

0

0.02

0.04

0.06

0.08

0.1

0.12

0 0.5 1 1.5 2 2.5 3 3.5 4

Beta 2 hat, Split Sample Estimator

�è �F �� /J4 �/ �� /J4 ´ /ë¼� �è �F �� Ù � ¿yÀ / 4 ´ /V �� Ù � ¿yÀ"´ /]/ 4 �� ¿yÀ Ù « Ù â�� « � ´ /J/ 4 �� è �F Ù «�¿;ÀV´ �� è �F ��KQ5 è �F

3.8. EXAMPLE: THE NERLOVE MODEL 43

where we use the fact that¿yÀ � w�� ¿yÀ � �Üw when both products are con-

formable. Thus, this estimator is also unbiased under these assumptions.

3.8. Example: The Nerlove model

3.8.1. Theoretical background. For a firm that takes input prices . and

the output level � as given, the cost minimization problem is to choose the

quantities of inputs � to solve the problem

� � �ã .y4 �subject to the restriction � � � �T&The solution is the vector of factor demands � ��.Ü��" . The cost function is ob-

tained by substituting the factor demands into the criterion function:

û .¯��" .;4 � ��.Ü��"Y&� Monotonicity Increasing factor prices cannot decrease cost, soà û ��.¯��"à . � �

Remember that these derivatives give the conditional factor demands

(Shephard’s Lemma).� Homogeneity The cost function is homogeneous of degree 1 in input

prices: û �� ?.¯��V û ��.Ü��" where is a scalar constant. This is because


the factor demands are homogeneous of degree zero in factor prices -

they only depend upon relative prices.� Returns to scale The returns to scale parameter � is defined as the in-

verse of the elasticity of cost with respect to output:� Æ à û ��.¯��Và � �û ��.Ü��"JÉ � 1Constant returns to scale is the case where increasing production � im-

plies that cost increases in the proportion 1:1. If this is the case, then� � .

3.8.2. Cobb-Douglas functional form. The Cobb-Douglas functional form

is linear in the logarithms of the regressors and the dependent variable. For a

cost function, if there are � factors, the Cobb-Douglas cost function has the

form

ûu ¡w . x� 1 &)&(&Ì. x�� x�� º îWhat is the elasticity of û with respect to . ß ?

º��< ä Æ à ûà�� É c . ßû h � ß$w . x� 1 &Ì. x ä � 1ß &(&�. x�� x�� º î . ßw .;x� 1 &)&(&Ì. x�� x � º î � ßThis is one of the reasons the Cobb-Douglas form is popular - the coefficients

are easy to interpret, since they are the elasticities of the dependent variable


with respect to the explanatory variable. Not that in this case,

º��< ä Æ à ûà�� É c . ßû h �Tß ��.¯��V . ßû� � ß ��.¯��Vthe cost share of the � U�¶ input. So with a Cobb-Douglas cost function, � ßj � ß ��.Ü��" . The cost shares are constants.

Note that after a logarithmic transformation we obtain�)� ûu �� 2K�01 �)� .µ1�2�&)&(&J28� � �)� . � 28�� È2KAwhere �8 �)� w . So we see that the transformed model is linear in the logs of

the data.

One can verify that the property of HOD1 implies that�� * ��1 � � �In other words, the cost shares add up to 1.

The hypothesis that the technology exhibits CRTS implies that� �� so �� "& Likewise, monotonicity implies that the coefficients � * � � ��! �"�'&(&(&)�� .

3.8.3. The Nerlove data and OLS. The file nerlove.data contains data on

145 electric utility companies’ cost of production, output and input prices. The

data are for the U.S., and were collected by M. Nerlove. The observations are

http://pareto.uab.es/mcreel/Econometrics/Include/Data/nerlove.data


by row, and the columns are COMPANY, COST � û , OUTPUT � æ C� PRICE

OF LABOR � ª�� , PRICE OF FUEL � ª� and PRICE OF CAPITAL � ª ÿ C& Note

that the data are sorted by output level (the third column).

We will estimate the Cobb-Douglas model

(3.8.1)�� ûÚ �,1�2K� � �� æ 28� | �� ª�� 2K�"! �� ª# 28��$ �� ª ÿ 2KA

using OLS. To do this yourself, you need the data file mentioned above, as

well as Nerlove.m (the estimation program) , and the library of Octave func-

tions mentioned in the introduction to Octave that forms section 21 of this

document.3

The results are

*********************************************************OLS estimation resultsObservations 145R-squared 0.925955Sigma-squared 0.153943

Results (Ordinary var-cov estimator)

estimate st.err. t-stat. p-valueconstant -3.527 1.774 -1.987 0.049output 0.720 0.017 41.244 0.000labor 0.436 0.291 1.499 0.136fuel 0.427 0.100 4.249 0.000capital -0.220 0.339 -0.648 0.518

*********************************************************

� Do the theoretical restrictions hold?� Does the model fit well?� What do you think about RTS?

3If you are running the bootable CD, you have all of this installed and ready to run.

http://pareto.uab.es/mcreel/Econometrics/Include/OLS/Nerlove.m


While we will use Octave programs as examples in this document, since fol-

lowing the programming statements is a useful way of learning how theory

is put into practice, you may be interested in a more ”user-friendly” environ-

ment for doing econometrics. I heartily recommend Gretl, the Gnu Regression,

Econometrics, and Time-Series Library. This is an easy to use program, avail-

able in English, French, and Spanish, and it comes with a lot of data ready to

use. It even has an option to save output as LATEX fragments, so that I can just

include the results into this document, no muss, no fuss. Here the results of

the Nerlove model from GRETL:

Model 2: OLS estimates using the 145 observations 1–145

Dependent variable: l_cost

Variable Coefficient Std. Error -statistic p-value

const �&% & Ò #(' Ò �V&*)+) Ñ %,) �¯�"& ï+- ) Ò � & � Ñ -+-l_output � &.)"# � % ï Ñ � & � ��) Ñ '+' Ñ Ñ �V&�# Ñ"Ñ�Ò � & �"�"�"�l_labor � & Ñ %+'(% Ñ � � &©# ï � � Ñ - �V& Ñ ïVï # � &(��%+' �l_fuel � & Ñ #(' Ò �/) � &)� �V� %+' ï Ñ &�# Ñ ï Ò � & �"�"�"�l_capita � � &©#%� ï+-+-+- � &.%(% ï Ñ # ï � � &.' Ñ ) - � & Ò � - #

http://gretl.sourceforge.net


Mean of dependent variable �"&*)V# Ñ '+'S.D. of dependent variable �"& Ñ #%�/)"#Sum of squared residuals #%�g& Ò"Ò # �Standard error of residuals (

�è ) � &0% ï #(% Ò 'Unadjusted � � � & ï # Ò ï Ò"ÒAdjusted Ý� � � & ï #1% - Ñ �2 � Ñ �'� Ñ � Ñ %,)�&.' - 'Akaike information criterion � Ñ�Ò & �+- ÑSchwarz Bayesian criterion � Ò ï & ï ',)

Fortunately, Gretl and my OLS program agree upon the results. Gretl is in-

cluded in the bootable CD mentioned in the introduction. I recommend using

GRETL to repeat the examples that are done using Octave.

The previous properties hold for finite sample sizes. Before considering

the asymptotic properties of the OLS estimator it is useful to review the MLE

estimator, since under the assumption of normal errors the two estimators co-

incide.

EXERCISES 49

Exercises

(1) Prove that the split sample estimator used to generate figure 3.7.4 is unbi-

ased.

(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL,

and provide printouts of the results. Interpret the results.

(3) Do an analysis of whether or not there are influential observations for OLS

estimation of the Nerlove model. Discuss.

(4) Using GRETL, examine the residuals after OLS estimation and tell me whether

or not you believe that the assumption of independent identically dis-

tributed normal errors is warranted. No need to do formal tests, just look

at the plots. Print out any that you think are relevant, and interpret them.

(5) For a random vector � ç ²�43 ã �65¼C� what is the distribution of w �o287 ,where w and 7 are conformable matrices of contants?

(6) Using Octave, write a little program that verifies that¿yÀ � w�� ¿;À � �¯w

for w and � 4x4 matrices of random numbers. Note: there is an Octave

function trace.

(7) For the model with a constant and a single regressor, B"U �01Ä2 � � � U�2�A-U ,which satisfies the classical assumptions, prove that the variance of the

OLS estimator declines to zero as the sample size increases.

CHAPTER 4

Maximum likelihood estimation

The maximum likelihood estimator is important since it is asymptotically

efficient, as is shown below. For the classical linear model with normal errors,

the ML and OLS estimators of � are the same, so the following theory is pre-

sented without examples. In the second half of the course, nonlinear models

with nonnormal errors are introduced, and examples may be found there.

4.1. The likelihood function

Suppose we have a sample of size � of the random vectors B and � . Suppose

the joint density of ¤ c BT1q&\&'&RBgf h and 9 c �g1 &'&\&Ú�\f h is character-

ized by a parameter vector : F H �<;>= ��¤Z�69y�: F Y&This is the joint density of the sample. This density can be factored as�/;?= �W¤��69;�: F � ; � = ��¤ � 9;�@ F �<= �A9;�� F

The likelihood function is just this density evaluated at other values :B ��¤Z�69y�:7 � �W¤��69;�: C�:DCFE �where E is a parameter space.

The maximum likelihood estimator of : F is the value of : that maximizes the

likelihood function.50

4.1. THE LIKELIHOOD FUNCTION 51

Note that if @ F and � F share no elements, then the maximizer of the condi-

tional likelihood function� ; � = �W¤ � 9;��@V with respect to @ is the same as the max-

imizer of the overall likelihood function�<;?= �W¤��69;�:7 � ; � = �W¤ � 9;�@" �<= ��9;��T ,

for the elements of : that correspond to @ . In this case, the variables 9 are said

to be exogenous for estimation of @ , and we may more conveniently work with

the conditional likelihood function� ; � = �W¤ � 9;��@" for the purposes of estimating@ F .

DEFINITION 4.1.1. The maximum likelihood estimator of @ F ��V�[�� ~ �/; � = ��¤ � 9y��@"� If the � observations are independent, the likelihood function can be

written as B �W¤ � 9;��@V fG U(��1 � ��BgU � �\UP��@Vwhere the

� U are possibly of different form.� If this is not possible, we can always factor the likelihood into contribu-

tions of observations, by using the fact that a joint density can be factored

into the product of a marginal and conditional (doing this iteratively)B ��¤Z��@V � ��BT1 � �g1Y��@" � ��B � � B�1$�$� � �@" � ��B | � B�1Y��B � �$� | ��@V0>N>N> � ��Bgf � B�1IH B � �'&'&'&-BgU � fT�$�\f%��@"To simplify notation, define

� U SJBT1Y��B � �'&(&)&(��B]U � 1Y�$�\UPXso � 1 �g1C� � � SJBT1Y�$� � X , etc. - it contains exogenous and predetermined endo-

geous variables. Now the likelihood function can be written asB �W¤��@" fG U(��1 � ��B]U � � U?��@V


The criterion function can be defined as the average log-likelihood function:

�\f+�4@" �� )� B ��¤Z�@" �� f� U(��1 �� B]U � � U?��@VThe maximum likelihood estimator may thus be defined equivalently as�@ a�V�[�� ~ �\f5�A@VC�where the set maximized over is defined below. Since

�� Ô>@ is a monotonic

increasing function,�)� B

andB

maximize at the same value of @%& Dividing by �has no effect on

�@%&4.1.1. Example: Bernoulli trial. Suppose that we are flipping a coin that

may be biased, so that the probability of a heads may not be 0.5. Maybe we’re

interested in estimating the probability of a heads. Let B �� ¹ º ô�JT�J be a binary

variable that indicates whether or not a heads is observed. The outcome of a

toss is a Bernoulli random variable:�<; ��B��ò� F �"K F �?�È�i� F 1 � K ��BLC²S � �'�VX � ��B ÂCjS � �'�VXSo a representative term that enters the likelihood function is�/; ��B��ò�� K �Ô�È��E 1 � Kand �� /; ��B��ò�� B �� Ü2a�Ô�È��B5 �� P�y�i�E


The derivative of this is à �� <; ��B��ò�Eà � B� � �P�È��B5�Ô�¼�i�E BÜ�i��Ü�?�È�i��Averaging this over a sample of size � givesà �\fE�@��à � �� f� * ��1 B * �i��Ü�Ô�¼�i��Setting to zero and solving gives �� ÝBSo it’s easy to calculate the MLE of � F in this case.

Now imagine that we had a bag full of bent coins, each bent around a

sphere of a different radius (with the head pointing to the outside of the sphere).

We might suspect that the probability of a heads could depend upon the ra-

dius. Suppose that � * � �� +* �� ?� 2 }C~%� �Ô� � 4* �[ � 1 where �E*G n � À * r 4 , so

that � is a 2 � 1 vector. Now à � * ��Gà � � * �Ô�È�� * �+*so

à �)� �<; ��B��à � B¯�i� *� * �?�È�i� * � * �Ô�È�i� * �+* ��B * �i�� +* ��[ �+*

4.2. CONSISTENCY OF MLE 54

So the derivative of the average log lihelihood function is nowà �\fE��à � � f* ��1 ��B * �i�� +* ��[ �+*�This is a set of 2 nolinear equations in the two unknown elements in � . There

is no explicit solution for the two elements that set the equations to zero. This

is common with ML estimators, they are often nonlinear, and finding their

values often require use of numeric methods to find solutions to the first order

conditions.

4.2. Consistency of MLE

To show consistency of the MLE, we need to make explicit some assump-

tions.

Compact parameter space: @MCFN � an open bounded subset of O ÿ & Max-

imixation is over N � which is compact.

This implies that @ is an interior point of the parameter space N .

Uniform convergence:

�\f5�A@" Í�P Q�P ìR � �)�f�SUT ë"V�W��\f+�4@" � ��T=�A@%�@ F C�Ôê?@MC N &We have suppressed ¤ here for simplicity. This requires that almost sure con-

vergence holds for all possible parameter values. For a given parameter value,

an ordinary Law of Large Numbers will usually imply almost sure conver-

gence to the limit of the expectation. Convergence for a single element of

the parameter space, combined with the assumption of a compact parameter

space, ensures uniform convergence.

4.2. CONSISTENCY OF MLE 55

Continuity: �'f5�A@V is continuous in @%�@XC NÜ& This implies that �/T=�A@T��@ F is

continuous in @%&Identification: �/T=�4@%��@ F has a unique maximum in its first argument.

We will use these assumptions to show that�@\f Q6P ì PR @ F &

First,�@\f certainly exists, since a continuous function has a maximum on a

compact set.

Second, for any @�í @ Fë Æ �)� Æ B �4@"B �A@ F ]ÉµÉ » �� Æ ë Æ B �4@"B �A@ F ]ÉµÉ

by Jensen’s inequality (�� Ô>© is a concave function).

Now, the expectation on the RHS is

ë Æ B �A@VB �A@ F ]É �Y B �A@"B �4@ F B �A@ F ZJ�B �"�since

B �A@ F is the density function of the observations, and since the integral of

any density is 1 & Therefore, since�� Ô�J ��

ë Æ �)� Æ B �4@"B �A@ F ]ÉpÉ » � �or ë��ò�\fÈ�4@"[Ä�²ë��ò�\fy�4@ F - » � &

Taking limits, this is

�/T �A@%�@ F G�K�/T �A@ F ��@ F » �except on a set of zero probability (by the uniform convergence assumption).

4.3. THE SCORE FUNCTION 56

By the identification assumption there is a unique maximizer, the inequal-

ity is strict if @�í @ F : ��T=�4@%��@ F G�K��T=�4@ F ��@ F ½ � �?ê?@÷í @ F � a.s.

Suppose that @+[ is a limit point of�@\f (any sequence from a compact set has

at least one limit point). Since�@\f is a maximizer, independent of �G� we must

have

��T=�4@ [ ��@ F G�K��T=�4@ F ��@ F � � &These last two inequalities imply that@ [ @ F � a.s.

Thus there is only one limit point, and it is equal to the true parameter value

with probability one. In other words,� ��f�SUT �@ @ F � a & s &This completes the proof of strong consistency of the MLE. One can use weaker

assumptions to prove weak consistency (convergence in probability to @ F ) of

the MLE. This is omitted here. Note that almost sure convergence implies

convergence in probability.

4.3. The score function

Differentiability: Assume that �'f+�4@" is twice continuously differentiable

in a neighborhood ²�A@ F of @ F , at least when � is large enough.

4.3. THE SCORE FUNCTION 57

To maximize the log-likelihood function, take derivatives:�gf5�W¤��@V � VY�\f5�A@V �� f� U(��1 � V �� BgU � � ã ��@V� �� f� U(��1 �]UÔ�A@VC&This is the score vector (with dim � �p�NC& Note that the score function has ¤ as an

argument, which implies that it is a random function. ¤ (and any exogeneous

variables) will often be suppressed for clarity, but one should not forget that

they are still there.

The ML estimator�@ sets the derivatives to zero:�]f5� �@" �� f� U(��1 �]UÔ� �@" � � &

We will show that ë�V�\ �gU?�A@"I] þ� �Vê� Y& This is the expectation taken with respect

to the density� �4@"Y� not necessarily

� �A@ F �&ë"V�\^�]UÔ�4@"_] Y \ � V �� BgU � � U?��@V_] � ��B]U � � ��@V`J�BgU Y �� BgU � � U?��@V \ � V � ��BgU � � U?��@V_] � ��BgU � � U?��@V`J�BgU Y � V � ��BgU � � U?��@V`J�BgUP&

4.4. ASYMPTOTIC NORMALITY OF MLE 58

Given some regularity conditions on boundedness of� V � � we can switch the

order of integration and differentiation, by the dominated convergence theo-

rem. This gives

ëaV#\^�]U-�A@V_] � V Y � ��B]U � � U?�@"`J�B]U � V'� �where we use the fact that the integral of the density is 1.� So ëaVN�b�]UÔ�4@" �� H the expectation of the score vector is zero.� This hold for all Y� so it implies that ë�V�]f5�W¤��@" �� &

4.4. Asymptotic normality of MLE

Recall that we assume that �'f+�4@" is twice continuously differentiable. Take

a first order Taylor’s series expansion of �0��¤Z� �@V about the true value @ F H� � �0� �@" �,�4@ F �2�� V O �0�A@ [ - c �@��c@ F hor with appropriate definitionsd

�4@ [ c �@��c@ F h �U�0�A@ F C�where @ [ fe �@y2��?�;� e `@ F � � ½ge�½ �"& Assume

d�A@ [ is invertible (we’ll justify

this in a minute). So h� c �@p�i@ F h �

d�4@ [ $� 1 h �>�0�4@ F


Now consider

d�A@ [ Y& This isd

�A@ [ � V O �,�4@ [ � �V �\f5�A@ [ �� f� U(��1 � �V �� U-�A@ [ where the notation � �V �'f �4@" � à �\�\f5�A@Và @ à @ 4 &Given that this is an average of terms, it should usually be the case that this

satisfies a strong law of large numbers (SLLN). Regularity conditions are a set

of assumptions that guarantee that this will happen. There are different sets

of assumptions that can be used to justify appeal to different SLLN’s. For

example, the� �V �)� � U-�A@1[$ must not be too strongly dependent over time, and

their variances must not become infinite. We don’t assume any particular set

here, since the appropriate assumptions will depend upon the particularities

of a given model. However, we assume that a SLLN applies.

Also, since we know that�@ is consistent, and since @ [ je �@ 2þ�?� � e `@ F �

we have that @ [ Q�P ì PR @ F . Also, by the above differentiability assumtion,

d�4@" is

continuous in @ . Given this,

d�A@ [ converges to the limit of it’s expectation:d

�4@ [ Q�P ì PR � �)�f�SUT ë3ð � �V �\f5�A@ F ó d T¾�4@ F ½lkThis matrix converges to a finite limit.


Re-arranging orders of limits and differentiation, which is legitimate given

regularity conditions, we getd T¾�4@ F � �V � �)�f�SUT ë��ò�\f5�A@ F - � �V ��T=�A@ F �@ F We’ve already seen that

��T¾�4@%��@ F ½ �/T=�4@ F ��@ F i.e., @ F maximizes the limiting objective function. Since there is a unique max-

imizer, and by the assumption that �'f+�4@" is twice continuously differentiable

(which holds in the limit), then

d T=�A@ F must be negative definite, and there-

fore of full rank. Therefore the previous inversion is justified, asymptotically,

and we have

(4.4.1)

h� c �@��c@ F h Q6P ì PR �

d T=�A@ F $� 1 h �>�0�A@ F C&Now consider

h�>�,�4@ F C& This ish�>�]f+�4@ F

h� � VY�\f5�A@V

h�� f� U(��1 � V �� U-��BgU � � U?��@ F �h � f� U(��1 �]UÔ�A@ F

We’ve already seen that ë�V#\^�]U-�A@V_] ®� & As such, it is reasonable to assume that

a CLT applies.

Note that �]f5�4@ F Q6P ì PR � � by consistency. To avoid this collapse to a degenerate

r.v. (a constant vector) we need to scale by

h�& A generic CLT states that, for

4.4. ASYMPTOTIC NORMALITY OF MLE 61�=f a random vector that satisfies certain conditions,

�=f�� Ù �¬� f"cmR ²� � � � �� é��=f�-The “certain conditions” that � f must satisfy depend on the case at hand. Usu-

ally, � f will be of the form of an average, scaled by

h� :

�=f h� � fU(��1 �=U�

This is the case for

h�>�,�4@ F for example. Then the properties of � f depend on

the properties of the � U?& For example, if the � U have finite variances and are

not too strongly dependent, then a CLT for dependent processes will apply.

Supposing that a CLT applies, and noting that Ù �h�n�gf5�A@ F ¡� � we geto T=�A@ F �� 1Ap � h �>�gf �A@ F cmR q\ � � ±'ÿ ]

where o T=�A@ F � ��f�SUT ë"VAW ð �X\^�]f+�4@ F _],\^�]f+�4@ F _] 4 ó � ��f�SUT érVAW ð h �>�]f �A@ F óThis can also be written as

(4.4.2)

h�>�]f+�4@ F mR q\ � � o T¾�4@ F _]� o T=�A@ F is known as the information matrix.� Combining [4.4.1] and [4.4.2], we geth

� c �@s�s@ F h Qçu °@� �d T¾�4@ F � 1 o T=�4@ F d T=�A@ F � 1 ³Ä&

The MLE estimator is asymptotically normally distributed.


DEFINITION 1 (CAN). An estimator�@ of a parameter @ F is

h� -consistent

and asymptotically normally distributed if

(4.4.3)

h� c �@��c@ F h mR þ� � �$énT�

where érT is a finite positive definite matrix.

There do exist, in special cases, estimators that are consistent such thath� c �@s�s@ F h 6R � & These are known as superconsistent estimators, since nor-

mally,

h� is the highest factor that we can multiply by an still get convergence

to a stable limiting distribution.

DEFINITION 2 (Asymptotic unbiasedness). An estimator�@ of a parameter@ F is asymptotically unbiased if

(4.4.4)� �)�f�StT ëaVJ� �@V @%&

Estimators that are CAN are asymptotically unbiased, though not all consistent

estimators are asymptotically unbiased. Such cases are unusual, though. An

example is

EXERCISE 4.5. Consider an estimator�@ with density� � �@V �È� 1f � �@ @ F1f1H �@ �

Show that this estimator is consistent but asymptotically biased. Also ask

yourself how you could define an estimator that would have this density.

4.6. THE INFORMATION MATRIX EQUALITY 63

4.6. The information matrix equality

We will show that

d T¾�4@" � ± T=�4@"C& Let� U-�4@" be short for

� ��BgU � � U?��@V� Y � U-�A@V`J�B�� so� Y � V � U-�A@V`J�B Y � � V �� U?�A@"- � U-�4@"`J�B

Now differentiate again:

� Y ° � �V �)� � U-�A@V ³ � U-�A@V`J�B�2 Y \ � V �� UÔ�A@V_] � V O � U-�4@"`J�B ë"V ° � �V �� U-�A@"P³Z2 Y \ � V �)� � U-�4@"I]u\ � V O �� U-�4@"I] � U-�4@"ZJ�B ë"V ° � �V �� U-�A@" ³ 28ëaV#\ � V �� U-�A@V_]u\ � V O �)� � U-�A@V_] ë"V#\ d U-�4@"I]T28ëaV#\^�]UÔ�A@V_]u\ �]U-�A@"I] 4(4.6.1)

Now sum over � and multiply by 1fëaV �� f� U(��1 \

dU-�4@"I] �Èë"VMv �� f� U(��1 \ �gU-�4@"I]u\^�]UÔ�4@"_] 40w

The scores �gU and �"ì are uncorrelated for �í �"� since for ¥ �� UÔ��BgU � BT1Y�'&(&(&)��B]U � 1Y��@Vhas conditioned on prior information, so what was random in � is fixed in .(This forms the basis for a specification test proposed by White: if the scores

appear to be correlated one may question the specification of the model). This

allows us to write ë"V�\ d �A@"I] �Èë"V ð �X\ �,�4@"I]�\ �,�4@"I] 4 ó

4.6. THE INFORMATION MATRIX EQUALITY 64

since all cross products between different periods expect to zero. Finally take

limits, we get

(4.6.2)

d T¾�A@V � o T=�4@"Y&This holds for all @%� in particular, for @ F & Using this,h

� c �@��c@ F h Q�P ì PR ° � �d T �A@ F $� 1 o T=�A@ F d T �A@ F $� 1 ³

simplifies to

(4.6.3)

h� c �@p�i@ F h Q�P ì PR °@� � o T=�4@ F $� 1 ³

To estimate the asymptotic variance, we need estimators of

d T¾�4@ F ando T=�4@ F .

We can use xo T=�A@ F � f� U(��1 �]UÔ� �@V_�]UÔ� �@�P4xd T=�A@ F d� �@"Y&

Note, one can’t usex± T=�4@ F � n �]f � �@V r n �gf5� �@V r 4

to estimate the information matrix. Why not?

From this we see that there are alternative ways to estimate é?T=�4@ F that are

all valid. These includexénT=�4@ F �

xd T=�A@ F � 1xénT=�4@ F

xo T=�A@ F � 1xénT=�4@ F

xd T �A@ F � 1 xo T=�A@ F xd T=�4@ F � 1

4.7. THE CRAMÉR-RAO LOWER BOUND 65

These are known as the inverse Hessian, outer product of the gradient (OPG) and

sandwich estimators, respectively. The sandwich form is the most robust, since

it coincides with the covariance estimator of the quasi-ML estimator.

4.7. The Cramér-Rao lower bound

THEOREM 3. [Cramer-Rao Lower Bound] The limiting variance of a CAN

estimator of @ F , say ý@ , minus the inverse of the information matrix is a positive

semidefinite matrix.

Proof: Since the estimator is CAN, it is asymptotically unbiased, so� �)�f�SUT ë"VJ� ý@p�i@" a�Differentiate wrt @ 4 H� V O � ��f�SUT ëaVJ� ý@p�i@V � �)�f�SUT Y � V O n � ��¤Z�@" c ý@��c@ h r J�B � � this is a � � � matrix of zeros C&Noting that

� V O � �W¤��@" � �A@V � V O �)� � �A@VC� we can write� ��f�StT Y c ý@��c@ h � �4@" � V O �� 4@"ZJ�Bp2 � ��f�StT Y � �W¤��@V � V O c ý@p�i@ h J�B �� &Now note that

� V O c ý@p�i@ h � ±\ÿ � and y � �W¤��@"v�Ô� ±'ÿ ZJ�B � ±\ÿ & With this we

have � ��f�SUT Y c ý@p�i@ h � �A@" � V O �� A@"ZJ�B a±\ÿ &Playing with powers of � we get� �)�f�SUT Y h

� c ý@��c@ h h � �� \ � V O �� 4@"_]z {}| ~ � �A@V`J�B ±\ÿ


Note that the bracketed part is just the transpose of the score vector, �0�A@VC� so

we can write � �)�f�SUT ëaV n h � c ý@��c@ h h �?�0�A@"P4 r ¡±'ÿThis means that the covariance of the score function with

h� c ý@p�i@ h � for ý@

any CAN estimator, is an identity matrix. Using this, suppose the variance ofh� c ý@s�s@ h tends to énT¾� ý@�C& Therefore,

(4.7.1) énT��h� c ý@p�i@ hh�n�0�A@" �� énT=�'ý@" ±'ÿ±'ÿ o T=�A@V �� &

Since this is a covariance matrix, it is positive semi-definite. Therefore, for any� -vector � �n � 4 � � 4 o � 1T �A@V r �� érT¾� ý@� ±'ÿ±'ÿ o T=�A@V �� o T=�4@" � 1 � �� &

This simplifies to � 4 c érT¾�'ý@VG� o � 1T �A@" h � � � &Since � is arbitrary, énT=�'ý@VZ� o T=�A@V is positive semidefinite. This conludes the

proof.

This means thato � 1T �A@V is a lower bound for the asymptotic variance of a

CAN estimator.

DEFINITION 4.7.1. (Asymptotic efficiency) Given two CAN estimators of a

parameter @ F , say ý@ and�@ , �@ is asymptotically efficient with respect to ý@ iférT=� ý@"G��énT=� �@" is a positive semidefinite matrix.

A direct proof of asymptotic efficiency of an estimator is infeasible, but

if one can show that the asymptotic variance is equal to the inverse of the


information matrix, then the estimator is asymptotically efficient. In particular,

the MLE is asymptotically efficient.

Summary of MLE� Consistent� Asymptotically normal (CAN)� Asymptotically efficient� Asymptotically unbiased� This is for general MLE: we haven’t specified the distribution or the

linearity/nonlinearity of the estimator

EXERCISES 68

Exercises

(1) Consider coin tossing with a single possibly biased coin. The density func-

tion for the random variable B �� ¹ º ô�JT�J is�<; ��B��ò� F �"K F �?�È�i� F 1 � K ��BLC²S � �'�VX � ��B ÂCjS � �'�VXSuppose that we have a sample of size � . We know from above that the ML

estimator is�� F ÝB . We also know from the theory above thath� �YÝBÜ�i� F QçR °@� �

d T=�@� F � 1 o T=�� F d T=�� F � 1 ³a) find the analytical expressions for

d T=�@� F ando T=�� F for this problem

b) Write an Octave program that does a Monte Carlo study that shows thath� �CÝBÜ�� F is approximately normally distributed when � is large. Please

give me histograms that show the sampling frequency of

h� �vÝBÜ�i� F for

several values of � .

(2) Consider the model BVU � 4U �j2 � A-U where the errors follow the Cauchy

(Student-t with 1 degree of freedom) density. So� �WA-UW �� Ô��2MA �U �'� k ½ AÔU ½�kThe Cauchy density has a shape similar to a normal density, but with much

thicker tails. Thus, extremely small and large errors occur much more fre-

quently with this density than would happen if the errors were normally

distributed. Find the score function �gf+�4@" where @ lc � 4 � h 4 .(3) Consider the model classical linear regression model B"U Ç� 4U �i2¡A-U whereAÔU�ç ±�± j� � ��è � . Find the score function �gf+�4@" where @ c � 4 è h 4 .

EXERCISES 69

(4) Compare the first order conditional that define the ML estimators of prob-

lems 2 and 3 and interpret the differences. Why are the first order condi-

tions that define an efficient estimator different in the two cases?

CHAPTER 5

Asymptotic properties of the least squares estimator

The OLS estimator under the classical assumptions is unbiased and BLUE,

for all sample sizes. Now let’s see what happens when the sample size tends

to infinity.

5.1. Consistency

�� ¬� 4 �3 � 1 � 4 B �¬�i4��3$� 1 �i4J��^28/V � F 2a��i4��:$� 1 ��4�/ � F 2 Æ � 4 �� É � 1 � 4 /�Consider the last two terms. By assumption

� �)� f�StT ð « O «f ó �æ « Á � �� f�SUT ð « O «f ó � 1 æ � 1« � since the inverse of a nonsingular matrix is a continuous function of the

elements of the matrix. Considering« O îf �� 4 /� �� f� U(��1 � U¬/]U

Each � U¬/JU has expectation zero, so

Ù Æ � 4 /�eÉ ��70

5.2. ASYMPTOTIC NORMALITY 71

The variance of each term is

é � � U�A-UW � U � 4U è � &As long as these are finite, and given a technical condition1, the Kolmogorov

SLLN applies, so �� f� U(��1 � U¬/]U Q�P ì PR � &This implies that �� Q�P ì PR � F &This is the property of strong consistency: the estimator converges in almost

surely to the true value.� The consistency proof does not use the normality assumption.� Remember that almost sure convergence implies convergence in prob-

ability.

5.2. Asymptotic normality

We’ve seen that the OLS estimator is normally distributed under the assump-

tion of normal errors. If the error distribution is unknown, we of course don’t

know the distribution of the estimator. However, we can get asymptotic re-

sults. Assuming the distribution of / is unknown, but the the other classical

assumptions hold:

1For application of LLN’s and CLT’s, of which there are very many to choose from, I’m goingto avoid the technicalities. Basically, as long as terms of an average have finite variances andare not too strongly dependent, one will be able to find a LLN or CLT to apply.

5.3. ASYMPTOTIC EFFICIENCY 72

�� F 2a��i4��:$� 1 �i4�/�� F ��i4��:$� 1 �i4�/h� c ��£�j� F h Æ � 4 �� É � 1 � 4 /

h�� Now as before, ð « O «f ó � 1 R æ � 1« &� Considering

« O î� f � the limit of the variance is� ��f�SUT é Æ � 4 /h�ÞÉ � �)�f�SUT Ù Æ � 4 A�A 4 �� É è �F æ «

The mean is of course zero. To get asymptotic normality, we need to

apply a CLT. We assume one (for instance, the Lindeberg-Feller CLT)

holds, so � 4 /h� mR ð � ��è �F æ « ó

Therefore,h� c ��÷�� F h mR ð � ��è �F æ � 1« ó� In summary, the OLS estimator is normally distributed in small and

large samples if / is normally distributed. If / is not normally dis-

tributed,�� is asymptotically normally distributed when a CLT can be

applied.

5.3. Asymptotic efficiency

The least squares objective function is

5.3. ASYMPTOTIC EFFICIENCY 73

��G f� U(��1 ��BgU�� 4U � �Supposing that / is normally distributed, the model is

B �i� F 2�/%�/ ç ²� � ��è �F ± f"C� so� �¬/" fG U(��1 �h

# � è � }Y~%� Æ,� / �U#gè � ÉThe joint density for B can be constructed using a change of variables. We have/ BÜ�j�� so á îá K O a± f and � á îá K O � �"� so� ��B5 fG U(��1 �h

# � è � }C~%� ÆI� ��BgU�� 4U �Ô�#gè � É &Taking logs, �� B ��è� �y� ��

h# � �j� �)� è�� f� U(��1 ��BgU�� 4U � �#gè � &

It’s clear that the fonc for the MLE of � F are the same as the fonc for OLS (up

to multiplication by a constant), so the estimators are the same, under the present

assumptions. Therefore, their properties are the same. In particular, under the

classical assumptions with normality, the OLS estimator�� is asymptotically efficient.

As we’ll see later, it will be possible to use (iterated) linear estimation

methods and still achieve asymptotic efficiency even if the assumption thatéÜô À ��/VÜí è � ± f%� as long as / is still normally distributed. This is not the case if

5.3. ASYMPTOTIC EFFICIENCY 74/ is nonnormal. In general with nonnormal errors it will be necessary to use

nonlinear estimation methods to achieve asymptotically efficient estimation.

CHAPTER 6

Restrictions and hypothesis tests

6.1. Exact linear restrictions

In many cases, economic theory suggests restrictions on the parameters of

a model. For example, a demand function is supposed to be homogeneous

of degree zero in prices and income. If we have a Cobb-Douglas (log-linear)

model, �)� � � F 2K�01 �� 1I28� � �)� � � 2K� | �� 28/T�then we need that

Q F �� F 2K�01 �)� Q]�01�2K� � �� Q]� � 28� | �� Q%� 28/T�so

�01 �� 01�2K� � �)� � � 2K� | �� 01 �� Q]�01�2K� � �)� Q]� � 2K� | �� Q%� � �)� Q5��01,2K� � 2K� | �28�01 �� 01,2K� � �)� � � 28� | �� &The only way to guarantee this for arbitrary Q is to set

�01,2K� � 2K� | �� which is a parameter restriction. In particular, this is a linear equality restriction,

which is probably the most commonly encountered case.

75

6.1. EXACT LINEAR RESTRICTIONS 76

6.1.1. Imposition. The general formulation of linear equality restrictions

is the model

B ��2�/�p� Àwhere � is a æ � � matrix, æ ½Ë� and

Àis a æ �²� vector of constants.� We assume � is of rank æ � so that there are no redundant restrictions.� We also assume that � � that satisfies the restrictions: they aren’t infea-

sible.

Let’s consider how to estimate � subject to the restrictions �p� À & The most

obvious approach is to set up the Lagrangean

� � �x �� BÜ�²�i�G 4 ��B¯�²��2Ë# e 4 ��p�� À C&The Lagrange multipliers are scaled by 2, which makes things less messy. The

fonc are � x �� ¢� �e �µ#]� 4 B�2M#]� 4 � ��r�=2Ë#V� 4 �e � ��M� �� ¢� �e � ��r�^� À � � �which can be written as�� 4 � � 4� � �� r��e �� 4 BÀ �� &We get �� r��e �� 4 � � 4� � �� 1 �� 4 BÀ �� &


For the masochists: Stepwise Inversion

Note that�� 4 �: � 1 ��µ�M�� 4 �: � 1 ±�� 4 � � 4� � �� w&� �� ±'ÿ �� 4 �: � 1 � 4� �µ�M�� 4 �: � 1 � 4 �� ±'ÿ �� 4 �: � 1 � 4� � ª �� û �

and �� ±'ÿ �¬� 4 �: � 1 � 4 ª � 1� � ª � 1 �� ±'ÿ �¬� 4 �: � 1 � 4� � ª �� û ±\ÿ��

so � w�� ±'ÿ�� w � � 1� � 1 �� ±\ÿ �� 4 �: � 1 � 4 ª � 1� � ª � 1 �� ¬� 4 �: � 1 ��;�Ë�¬� 4 �: � 1 ±}� �� ¬� 4 �3 � 1 �� 4 �: � 1 � 4 ª � 1 �K�� 4 �: � 1 �� 4 �: � 1 � 4 ª � 1ª � 1 �M�� 4 �: � 1 � ª � 1 ��


so (everyone should start paying attention again)�� r��e �� ¬� 4 �: � 1 �� 4 �: � 1 � 4 ª � 1 �M�¬� 4 �3 � 1 �� 4 �: � 1 � 4 ª � 1ª � 1 �M�� 4 �: � 1 � ª � 1 �� 4 BÀ �� £�¡�� 4 �: � 1 � 4 ª � 1 c � �� À hª � 1 c � �� À h �� ±'ÿ �¡�� 4 �: � 1 � 4 ª � 1 �sª � 1 � �� 2 �� 4 �: � 1 � 4 ª � 1 À� ª � 1 À ��

The fact that��r� and

�e are linear functions of�� makes it easy to determine their

distributions, since the distribution of�� is already known. Recall that for � a

random vector, and for w and 7 a matrix and vector of constants, respectively,éÜô À � w;� 2�7C ¡w é¯ô À � � w 4 &Though this is the obvious way to go about finding the restricted estima-

tor, an easier way, if the number of restrictions is small, is to impose them by

substitution. Write

B ��1Ô�01,28� � � � 28/n �Þ1 � � r �� 01� � �� Àwhere �Þ1 is æ � æ nonsingular. Supposing the æ restrictions are linearly inde-

pendent, one can always make �¯1 nonsingular by reorganizing the columns of�i& Then �01 �¯� 11 À ��¯� 11 � � � � &


Substitute this into the model

B ��1-�¯� 11 À �²��1-�¯� 11 � � � � 28� � � � 28/BÜ�j��1Ô�¯� 11 À ° � � �j��1Ô�¯� 11 � � ³0� � 2�/or with the appropriate definitions,

B(� �� 28/T&This model satisfies the classical assumptions, supposing the restriction is true.

One can estimate by OLS. The variance of�� is as before

é�� ¬�i4� ��, � 1 è �Fand the estimator is �é�� ¬� 4� ��, � 1 �è �where one estimates è��F in the normal way, using the restricted model, i.e.,�è �F c B+�^�Ö�� h 4 c B(�^�²�� h�÷� � � � æ To recover

��01Y� use the restriction. To find the variance of��01C� use the fact that it

is a linear function of�� so

é^� ��01Ô � � 11 � � é^� �� -� 4� ð � � 11 ó 4 � � 11 � � �� 4� � � � 1 � 4� ð×� � 11 ó 4 è �F


6.1.2. Properties of the restricted estimator. We have that��r� ��÷�¡��i4��:$� 1 �µ4 ª � 1 c � �� À h ��b2��¬�i4��:�� 1 �µ4 ª � 1 À �¡�¬�i4��3$� 1 �µ4 ª � 1 �=�¬�i4��3$� 1 �i4�B �b2��¬�i4��:�� 1 �i4©/72��¬�i4��3$� 1 �µ4 ª � 1 \ À �8�p��]+��4��:�� 1 �µ4 ª � 1 �=��4©�:$� 1 �i4�/��r�^�� ¬�i4��:�� 1 �i4�/2 �¬�i4��:�� 1 �µ4 ª � 1 \ À �8�p��]� �¬�i4��:�� 1 �µ4 ª � 1 �=�¬�i4��:�� 1 �i4�/Mean squared error is ´ Û¢Ù � ��r�, ë¼� ��r�b��\� ��r�b��?4Noting that the crosses between the second term and the other terms expect to

zero, and that the cross of the first and third has a cancellation with the square

of the third, we obtain´ Û¢Ù � ��r�� 4 �: � 1 è �2 ��i4��:$� 1 �µ4 ª � 1 \ À ��p��]u\ À ��p��] 4 ª � 1 �=�¬�i4��3$� 1� ��i4��:$� 1 �µ4 ª � 1 �=��i4��:$� 1 è �So, the first term is the OLS covariance. The second term is PSD, and the third

term is NSD.� If the restriction is true, the second term is 0, so we are better off. True

restrictions improve efficiency of estimation.� If the restriction is false, we may be better or worse off, in terms of

MSE, depending on the magnitudes ofÀ ��p� and è � &

6.2. TESTING 81

6.2. Testing

In many cases, one wishes to test economic theories. If theory suggests pa-

rameter restrictions, as in the above homogeneity example, one can test theory

by testing parameter restrictions. A number of tests are available.

6.2.1. t-test. Suppose one has the model

B ��b28/and one wishes to test the single restriction

dF H��p� À vs.

d��H��p��í À . Under

dF � with normality of the errors,

� ��£� À çÚ ð � �$�=�� 4 �: � 1 � 4 è �FNóso � ��£� À� �=�¬� 4 �: � 1 � 4 è �F � �� Àè F � �=�� 4 �3 � 1 � 4 çRþ� � �'�J�&The problem is that è��F is unknown. One could use the consistent estimator

�è �Fin place of è,�F � but the test would only be valid asymptotically in this case.

PROPOSITION 4.

(6.2.1)j� � �'�J� � z Ã �PÅ� ça v�4�V

as long as the j� � �'�J and the � � �4�V are independent.

We need a few results on the � � distribution.

PROPOSITION 5. If � çRj�b3Ä� ± f� is a vector of � independent r.v.’s., then

(6.2.2) � 4 � ç�� e

6.2. TESTING 82

where e� � * 3 �* 3 4 3 is the noncentrality parameter.

When a � � r.v. has the noncentrality parameter equal to zero, it is referred

to as a central � � r.v., and it’s distribution is written as � � ��,C� suppressing the

noncentrality parameter.

PROPOSITION 6. If the � dimensional random vector � ç ²� � �Yé¯C� then� 4 é � 1 � ç�� ,C&We’ll prove this one as an indication of how the following unproven propo-

sitions could be proved.

Proof: Factor é � 1 asªÞª 4 (this is the Cholesky factorization). Then considerB ª 4 � & We have B=çRj� � � ª 4�é ª

but

é ªÞª 4 ± fª 4 é ªÞª 4 ª 4soª é ª 4 �± f and thus B¾çuj� � � ± f" . Thus B 4 B=ç�� , but

B�4�B ¡� 4 ªÞª 4 �� é=� 1 �and we get the result we wanted.

A more general proposition which implies this result is

PROPOSITION 7. If the � dimensional random vector � çuj� � �YéÜY� then

(6.2.3) � 4 �¯� ç�� -if and only if � é is idempotent.

6.2. TESTING 83

An immediate consequence is

PROPOSITION 8. If the random vector (of dimension � ) � çuj� � � ± C� and �is idempotent with rank

À � then

(6.2.4) � 4 �¯� ç�� À Y&Consider the random variable�/ 4 �/è �F / 4 ´:« /è �F Æ /è F É 4 ´:« Æ /è F Éç � � �� PROPOSITION 9. If the random vector (of dimension � ) � çl²� � � ± Y� thenw;� and � 4 �¯� are independent if w&� a� &Now consider (remember that we have only one restriction in this case)��x �� W h � Ã « O « Å�� O� �î O �îÃ f � ÿ Å � zW � �� À�è F � �=�� 4 �: � 1 � 4

This will have the v��K� � distribution if�� and

�/ 4 �/ are independent. But�� 2��¬� 4 �3 � 1 � 4 / and

�¬�i4��:�� 1 �i4 ´:« a� �so � ��£� À�è F � �=�� 4 �: � 1 � 4 � �� À�è ��x ça v��

6.2. TESTING 84

In particular, for the commonly encountered test of significance of an individual

coefficient, for which

dF H�� *0 a� vs.

dF H�� * í �� , the test statistic is�� *�è �x * ça v��÷� � � Note: the Y� test is strictly valid only if the errors are actually normally

distributed. If one has nonnormal errors, one could use the above as-

ymptotic result to justify taking critical values from the j� � �'�J distri-

bution, since v�� mR ²� � �\�J as � R k & In practice, a conservative

procedure is to take critical values from the distribution if nonnor-

mality is suspected. This will reject

dF less often since the distribu-

tion is fatter-tailed than is the normal.

6.2.2.2

test. The2

test allows testing multiple restrictions jointly.

PROPOSITION 10. If � ç�� À and B=ç�� ò�JC� then

(6.2.5)��Â ÀB Â � ç 2 � À �Y�J

provided that � and B are independent.

PROPOSITION 11. If the random vector (of dimension � ) � ç j� � � ± C� then� 4 wy� and � 4 �¯� are independent if w�� &Using these results, and previous results on the � � distribution, it is simple

to show that the following statistic has the2

distribution:2 c � ��£� À h 4 ð �M�� 4 �: � 1 � 4 ó � 1 c � �� À h� �è � ç 2 �4�T�� Y&A numerically equivalent expression is

6.2. TESTING 85

� ÙÜÛZÛ �^� ÙÜÛ�Û� Â �Ù Û�Û� Â �� ç 2 �4�T�� Y&� Note: The2

test is strictly valid only if the errors are truly normally

distributed. The following tests will be appropriate when one cannot

assume normally distributed errors.

6.2.3. Wald-type tests. The Wald principle is based on the idea that if a

restriction is true, the unrestricted model should “approximately” satisfy the

restriction. Given that the least squares estimator is asymptotically normally

distributed:h� c ��÷�� F h mR ð � ��è �F æ � 1« ó

then under

dF H��p� F À � we haveh

� c � ��£� À h mR ñð � ��è �F � æ � 1« �µ4 óso by Proposition [6]

� c � �� À h 4 ð è �F � æ � 1« �µ4 ó � 1 c � ��£� À h mR � � �4�VNote that æ � 1« or è �F are not observable. The test statistic we use substitutes the

consistent estimators. Use �¬� 4 � Â �, � 1 as the consistent estimator of æ � 1« & With

this, there is a cancellation of � 4 �"� and the statistic to use is

c � �� À h 4 c �è �F �=��i4��:$� 1 �µ4 h � 1 c � �� À h mR � � �4�V� The Wald test is a simple way to test restrictions without having to

estimate the restricted model.� Note that this formula is similar to one of the formulae provided for

the2

test.

6.2. TESTING 86

6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,

an unrestricted model may be nonlinear in the parameters, but the model is

linear in the parameters under the null hypothesis. For example, the model

B �¬��I¡728/is nonlinear in � and �� but is linear in � under

dF Ht� �"& Estimation of

nonlinear models is a bit more complicated, so one might prefer to have a

test based upon the restricted, linear model. The score test is useful in this

situation.� Score-type tests are based upon the general principle that the gradient

vector of the unrestricted model, evaluated at the restricted estimate,

should be asymptotically normally distributed with mean zero, if the

restrictions are true. The original development was for ML estimation,

but the principle is valid for a wide variety of estimation methods.

We have seen that �e ð �=��i4��:$� 1 �µ4 ó � 1 c � �� À h ª � 1 c � ��£� À hGiven that

h� c � ��£� À h mR ð � ��è �F � æ � 1« � 4 ó

under the null hypothesis,h� �e mR ð � ��è �F ª � 1 � æ � 1« � 4 ª � 1 ó

orh� �e mR ð � ��è �F � �)� � �� ª � 1 � æ � 1« �µ4 ª � 1 ó

6.2. TESTING 87

since the � ’s cancel and inserting the limit of a matrix of constants changes

nothing.

However, � �)� � ª � �� ,�=�¬�i4��:�� 1 �µ4 � �� Æ � 4 �� É � 1 �µ4 � æ � 1« �µ4So there is a cancellation and we geth

� �e mR ñð � ��è �F � �� ª � 1 óIn this case, �e 4 Æ �=�� 4 �: � 1 � 4è �F É �e mR � � �4�Vsince the powers of � cancel. To get a usable test statistic substitute a consistent

estimator of è �F &� This makes it clear why the test is sometimes referred to as a Lagrange

multiplier test. It may seem that one needs the actual Lagrange mul-

tipliers to calculate this. If we impose the restrictions by substitution,

these are not available. Note that the test can be written asc � 4 �e h 4 �� 4 �: � 1 � 4 �eè �F mR � � �4�VHowever, we can use the fonc for the restricted estimator:

�È�i4@B�2��i4�� r�=2M�µ4 �e

6.2. TESTING 88

to get that

�µ4 �e �i4��B �²� ��r�0 �i4 �/1�Substituting this into the above, we get�/ 4 � �K�¬� 4 �: � 1 � 4 �/(�è �F mR � � �4�Vbut this is simply �/ 4 � ª�«è �F �/(� mR � � �4�VC&

To see why the test is also known as a score test, note that the fonc for restricted

least squares �È�i4@B�2��i4�� r�=2M�µ4 �egive us �µ4 �e^ �i4�BÜ�j�i4�� r�and the rhs is simply the gradient (score) of the unrestricted model, evaluated

at the restricted estimator. The scores evaluated at the unrestricted estimate are

identically zero. The logic behind the score test is that the scores evaluated at

the restricted estimate should be approximately zero, if the restriction is true.

The test is also known as a Rao test, since P. Rao first proposed it in 1948.

6.2. TESTING 89

6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using

the unrestricted model. The score test can be calculated using only the re-

stricted model. The likelihood ratio test, on the other hand, uses both the re-

stricted and the unrestricted estimators. The test statistic isB � # c �� B � �@"G� �� B � ý@" hwhere

�@ is the unrestricted estimate and ý@ is the restricted estimate. To show

that it is asymptotically ��N� take a second order Taylor’s series expansion of�� B �'ý@" about�@¾H �� B �'ý@"�¢ �� B � �@��2 � # c ý@p� �@ h 4 d � �@g c ý@�� @ h

(note, the first order term drops out since� V �� B � �@V � � by the fonc and we

need to multiply the second-order term by � since

d�4@" is defined in terms of1f �)� B �4@" ) so B �£¢ �y� c ý@p� �@ h 4 d � �@V c ý@s� �@ h

As � R k � d � �@g R d T¾�4@ F � o �4@ F Y� by the information matrix equality. SoB � Q � c ý@p� �@ h 4 o T¾�4@ F c ý@�� @ hWe also have that, from [??] thath

� c �@p�i@ F h Q o T=�A@ F �� 1 � 1Ap � �0�A@ F C&An analogous result for the restricted estimator is (this is unproven here, to

prove this set up the Lagrangean for MLE subject to �p� À � and manipulate

6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 90

the first order conditions) :h� c ý@p�i@ F h Q o T=�A@ F �� 1 c\± fs��µ4%ðP� o T=�4@ F $� 1 �µ4 ó � 1 � o T=�4@ F $� 1 h � 1Ap � �,�4@ F C&

Combining the last two equationsh� c ý@p� �@ h Q �;� 1Ap � o T=�A@ F �� 1 �µ4 ð � o T=�A@ F �� 1 �µ4 ó � 1 � o T=�4@ F $� 1 �0�A@ F

so, substituting into [??]B � Q ° � 1Ap � �,�4@ F 4 o T=�A@ F � 1 � 4 ³ ° � o T=�4@ F � 1 � 4 ³ � 1 ° � o T=�4@ F � 1 � 1Ap � �0�4@ F ?³But since � 1Ap � �0�A@ F mR þ� � � o T¾�A@ F -the linear function

� o T=�A@ F � 1 � 1Ap � �0�A@ F mR j� � �$� o T=�4@ F � 1 � 4 C&We can see that LR is a quadratic form of this rv, with the inverse of its variance

in the middle, so B � mR � � �b�"Y&6.3. The asymptotic equivalence of the LR, Wald and score tests

We have seen that the three tests all converge to � � random variables. In

fact, they all converge to the same � � rv, under the null hypothesis. We’ll show

that the Wald and LR tests are asymptotically equivalent. We have seen that

the Wald test is asymptotically equivalent to

ü Q � c � �� À h 4 ð è �F � æ � 1« �µ4 ó � 1 c � ��£� À h mR � � �b�"


Using ��£�� F �� 4 �: � 1 � 4 /and � ��£� À �=� ��£�� F we get h

�,�=� ��£�� F h�0�=��i4��:$� 1 �i4�/ � Æ � 4 �� É � 1 � � 1Ap � � 4 /

Substitute this into [??] to get

ü Q �� 1 /J4�� æ � 1« �µ4 ð è �F � æ � 1« �µ4 ó � 1 � æ � 1« �i4�/Q / 4 �K�� 4 �: � 1 � 4%ð è �F �=�� 4 �: � 1 � 4 ó � 1 �=�� 4 �: � 1 � 4 /Q / 4 w � w 4 w � 1 w 4 /è �FQ / 4 ª �E/è �Fwhere

ª � is the projection matrix formed by the matrix �K�� 4 �: � 1 � 4 .� Note that this matrix is idempotent and has � columns, so the projec-

tion matrix has rank �T&Now consider the likelihood ratio statisticB � Q � 1Ap � �0�A@ F P4 o �A@ F $� 1 �µ4 ð � o �4@ F $� 1 �µ4 ó � 1 � o �A@ F �� 1 � 1Ap � �0�A@ F Under normality, we have seen that the likelihood function is�)� B ��èI �y� ��

h# � �� è�� # ��BÜ�j�� 4 ��B¯�j��è � &


Using this, �0�� F � � x �� B ��èI � 4 ��B¯�j�� F ��è � � 4 /�0è �Also, by the information matrix equality:o �A@ F �

d T=�4@ F � �)� � � x O �0�� F � �)� � � x O � 4 ��BÜ�j�i� F ��è � � �)� � 4 ��è � æ «è �so o �A@ F � 1 è � æ � 1«Substituting these last expressions into [??], we getB � Q / 4 � 4 �� 4 �: � 1 � 4 ðòè �F �=�� 4 �: � 1 � 4 ó � 1 �=�� 4 �: � 1 � 4 /Q / 4 ª �E/è �FQ üThis completes the proof that the Wald and LR tests are asymptotically equiv-

alent. Similarly, one can show that, under the null hypothesis,� 2 Q ü Q BZ´ Q B �

6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 93� The proof for the statistics except forB � does not depend upon nor-

mality of the errors, as can be verified by examining the expressions

for the statistics.� TheB � statistic is based upon distributional assumptions, since one

can’t write the likelihood function without them.� However, due to the close relationship between the statistics � 2 andB � � supposing normality, the � 2 statistic can be thought of as a pseudo-

LR statistic, in that it’s like a LR statistic in that it uses the value of

the objective functions of the restricted and unrestricted models, but it

doesn’t require distributional assumptions.� The presentation of the score and Wald tests has been done in the

context of the linear model. This is readily generalizable to nonlinear

models and/or other estimation methods.

Though the four statistics are asymptotically equivalent, they are numerically

different in small samples. The numeric values of the tests also depend upon

how è � is estimated, and we’ve already seen than there are several ways to do

this. For example all of the following are consistent for è � under

dF

�î O �îf � D�î O �îf�î O ¤ �î ¤f � D�� î O ¤ �î ¤fand in general the denominator call be replaced with any quantity ô such that� �)� ô Â � �"&

6.5. CONFIDENCE INTERVALS 94

It can be shown, for linear regression models subject to linear restrictions,

and if �î O �îf is used to calculate the Wald test and �î O ¤ �î ¤f is used for the score test,

that ü ¥ B � ¥ BZ´ &For this reason, the Wald test will always reject if the LR test rejects, and in

turn the LR test rejects if the LM test rejects. This is a bit problematic: there is

the possibility that by careful choice of the statistic used, one can manipulate

reported results to favor or disfavor a hypothesis. A conservative/honest ap-

proach would be to report all three test statistics when they are available. In

the case of linear models with normal errors the2

test is to be preferred, since

asymptotic approximations are not an issue.

The small sample behavior of the tests can be quite different. The true size

(probability of rejection of the null when the null is true) of the Wald test is

often dramatically higher than the nominal size associated with the asymptotic

distribution. Likewise, the true size of the score test is often smaller than the

nominal size.

6.4. Interpretation of test statistics

Now that we have a menu of test statistics, we need to know how to use

them.

6.5. Confidence intervals

Confidence intervals for single coefficients are generated in the normal

manner. Given the statistic

v�� ²�¥ è �x

6.6. BOOTSTRAPPING 95

a � �"� �?�È� � a¦ confidence interval for � F is defined by the bounds of the set of� such that v�� does not reject

dF H�� F �� using a � significance level:

û � � SJ�ÖH5� ¸6§ p � ½ ��¥è �x ½Ë¸¨§ p � XThe set of such � is the interval ��ª© ¥ è �x ¸ § p �

A confidence ellipse for two coefficients jointly would be, analogously, the

set of { �,1C�[� � X such that the2

(or some other test statistic) doesn’t reject at the

specified critical value. This generates an ellipse, if the estimators are corre-

lated. � The region is an ellipse, since the CI for an individual coefficient de-

fines a (infinitely long) rectangle with total prob. mass � � � � since the

other coefficient is marginalized (e.g., can take on any value). Since the

ellipse is bounded in both dimensions but also contains mass �;� � � itmust extend beyond the bounds of the individual CI.� From the pictue we can see that:

– Rejection of hypotheses individually does not imply that the joint

test will reject.

– Joint rejection does not imply individal tests will reject.

6.6. Bootstrapping

When we rely on asymptotic theory to use the normal distribution-based

tests and confidence intervals, we’re often at serious risk of making impor-

tant errors. If the sample size is small and errors are highly nonnormal, the

small sample distribution of

h� c ��÷�� F h may be very different than its large


FIGURE 6.5.1. Joint and Individual Confidence Regions


sample distribution. Also, the distributions of test statistics may not resemble

their limiting distributions at all. A means of trying to gain information on the

small sample distribution of test statistics and estimators is the bootstrap. We’ll

consider a simple example, just to get the main idea.

Suppose that

B �� F 2�// ç ±T± � � � ��è �F � is nonstochastic

Given that the distribution of / is unknown, the distribution of�� will be un-

known in small samples. However, since we have random sampling, we could

generate artificial data. The steps are:

(1) Draw � observations from�/ with replacement. Call this vector ý/ ß (it’s

a �3�j�JC&(2) Then generate the data by ýB ß � ��b2 ý/ ß(3) Now take this and estimate

ý� ß ��4��:�� 1 �i4 ýB ß &(4) Save ý� ß(5) Repeat steps 1-4, until we have a large number, « � of ý� ß &

With this, we can use the replications to calculate the empirical distribution of ý� ß &One way to form a 100(1- � ¦ confidence interval for � F would be to order theý� ß from smallest to largest, and drop the first and last « �GÂ # of the replications,

and use the remaining endpoints as the limits of the CI. Note that this will not

give the shortest CI if the empirical distribution is skewed.

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 98� Suppose one was interested in the distribution of some function of��Z�

for example a test statistic. Simple: just calculate the transformation

for each �"� and work with the empirical distribution of the transforma-

tion.� If the assumption of iid errors is too strong (for example if there is

heteroscedasticity or autocorrelation, see below) one can work with a

bootstrap defined by sampling from ��B�� with replacement.� How to choose « : « should be large enough that the results don’t

change with repetition of the entire bootstrap. This is easy to check.

If you find the results change a lot, increase « and try again.� The bootstrap is based fundamentally on the idea that the empiri-

cal distribution of the sample data converges to the actual sampling

distribution as � becomes large, so statistics based on sampling from

the empirical distribution should converge in distribution to statistics

based on sampling from the actual sampling distribution.� In finite samples, this doesn’t hold. At a minimum, the bootstrap is a

good way to check if asymptotic theory results offer a decent approxi-

mation to the small sample distribution.

6.7. Testing nonlinear restrictions, and the Delta Method

Testing nonlinear restrictions of a linear model is not much more difficult,

at least when the model is linear. Since estimation subject to nonlinear re-

strictions requires nonlinear estimation methods, which are beyond the score

of this course, we’ll just consider the Wald test for nonlinear restrictions on a

linear model.

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 99

Consider the � nonlinear restrictionsÀ �� F ¡� &where

À �?>@ is a � -vector valued function. Write the derivative of the restriction

evaluated at � as � x O À �� x �=��GWe suppose that the restrictions are not redundant in a neighborhood of � F , so

that ��W�=��- �in a neighborhood of � F & Take a first order Taylor’s series expansion of

À � ��about � F : À � ��G À �� F �2M�=�� [ \� ��£�j� F where �#[ is a convex combination of

�� and � F & Under the null hypothesis we

have À � �� =�� [ v� ��£�� F Due to consistency of

�� we can replace � [ by � F , asymptotically, soh� À � �� Q h

�,�=�� F \� ��÷�� F We’ve already seen the distribution of

h�¢� ��j� F Y& Using this we geth

� À � �� mR ð � �$�=�� F æ � 1« �=�� F P4©è �FNó &Considering the quadratic form� À � �� 4 ð �=�� F æ � 1« �=�� F 4 ó � 1 À � ��è �F mR � � �b�"


under the null hypothesis. Substituting consistent estimators for � F H æ « and è �F �the resulting statistic isÀ � �� 4 c �=� ��\�¬� 4 �3 � 1 �=� ��G 4 h � 1 À � ��è � mR � � �b�"under the null hypothesis.

� This is known in the literature as the Delta method, or as Klein’s approx-

imation.� Since this is a Wald test, it will tend to over-reject in finite samples. The

score and LR tests are also possibilities, but they require estimation

methods for nonlinear models, which aren’t in the scope of this course.

Note that this also gives a convenient way to estimate nonlinear functions and

associated asymptotic confidence intervals. If the nonlinear functionÀ �� F is

not hypothesized to be zero, we just haveh� c À � ��G� À �� F h mR ð � �$�=�� F æ � 1« �=�� F P4©è �FNó

so an approximation to the distribution of the function of the estimator isÀ � ��¬u²� À �� F Y�$�=�� F v��i4��:$� 1 �=�� F ?4�è �F For example, the vector of elasticities of a function

� � � is � � à � � � à � ® �� where ® means element-by-element multiplication. Suppose we estimate a

linear function B �� 4@�^28/T&


The elasticities of B w.r.t. � are � � �� 4 � ® �(note that this is the entire vector of elasticities). The estimated elasticities are� � � �� 4 �� ® �To calculate the estimated standard errors of all five elasticites, use

�=�� à � � à � 4

�� 1 � >N>N> �� ...... . . . �� >N>N> � �ED

�� 4 ��01 � � 1 � >N>N> �� ...

... . . . �� >N>N> � � Dv� �D�� 4 � � &

To get a consistent estimator just substitute in�� . Note that the elasticity and

the standard error are functions of � & The program ExampleDeltaMethod.m

shows how this can be done.

In many cases, nonlinear restrictions can also involve the data, not just the

parameters. For example, consider a model of expenditure shares. Let � �@��÷be a demand funcion, where � is prices and � is income. An expenditure share

system for � goods is

� * ��[�÷ � *��+* ��÷� ��! �"�$#%�'&(&)&(�$�=&

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/ExampleDeltaMethod.m

6.8. EXAMPLE: THE NERLOVE DATA 102

Now demand must be positive, and we assume that expenditures sum to in-

come, so we have the restrictions

� » � * �@��÷ » �"�aê�!¯� * ��1 � * ��÷ �Suppose we postulate a linear model for the expenditure shares:

� * �@��÷ � *1 23� 4©� *6 2K�� *9 2�/ *It is fairly easy to write restrictions such that the shares sum to one, but the

restriction that the shares lie in the \ � �\�}] interval depends on both parameters

and the values of � and ��& It is impossible to impose the restriction that � »� * �@��÷ » � for all possible � and ��& In such cases, one might consider whether

or not a linear model is a reasonable specification.

6.8. Example: the Nerlove data

Remember that we in a previous example (section 3.8.3) that the OLS re-

sults for the Nerlove model are

*********************************************************OLS estimation resultsObservations 145R-squared 0.925955Sigma-squared 0.153943


estimate st.err. t-stat. p-valueconstant -3.527 1.774 -1.987 0.049output 0.720 0.017 41.244 0.000


labor 0.436 0.291 1.499 0.136fuel 0.427 0.100 4.249 0.000capital -0.220 0.339 -0.648 0.518

*********************************************************

Note that � ÿK � ÿ ½ � , and that � � 28� 28� ÿ í � .Remember that if we have constant returns to scale, then � �Ú �"� and if

there is homogeneity of degree 1 then � � 2R� 2R� ÿñ � . We can test these

hypotheses either separately or jointly. NerloveRestrictions.m imposes and

tests CRTS and then HOD1. From it we obtain the results that follow:

Imposing and testing HOD1

*******************************************************

Restricted LS estimation results

Observations 145

R-squared 0.925652

Sigma-squared 0.155686

estimate st.err. t-stat. p-value

constant -4.691 0.891 -5.263 0.000

output 0.721 0.018 41.040 0.000

labor 0.593 0.206 2.878 0.005

fuel 0.414 0.100 4.159 0.000

capital -0.007 0.192 -0.038 0.969

*******************************************************

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/NerloveRestrictions.m


Value p-value

F 0.574 0.450

Wald 0.594 0.441

LR 0.593 0.441

Score 0.592 0.442

Imposing and testing CRTS

*******************************************************

Restricted LS estimation results

Observations 145

R-squared 0.790420



constant -7.530 2.966 -2.539 0.012

output 1.000 0.000 Inf 0.000

labor 0.020 0.489 0.040 0.968

fuel 0.715 0.167 4.289 0.000

capital 0.076 0.572 0.132 0.895

*******************************************************

Value p-value

F 256.262 0.000

Wald 265.414 0.000

LR 150.863 0.000


Score 93.771 0.000

Notice that the input price coefficients in fact sum to 1 when HOD1 is im-

posed. HOD1 is not rejected at usual significance levels (e.g., �K � &(� � ). Also,� � does not drop much when the restriction is imposed, compared to the un-

restricted results. For CRTS, you should note that � �j � , so the restriction is

satisfied. Also note that the hypothesis that � �� is rejected by the test sta-

tistics at all reasonable significance levels. Note that � � drops quite a bit when

imposing CRTS. If you look at the unrestricted estimation results, you can see

that a t-test for � �8 � also rejects, and that a confidence interval for � � does

not overlap 1.

From the point of view of neoclassical economic theory, these results are

not anomalous: HOD1 is an implication of the theory, but CRTS is not.

EXERCISE 12. Modify the NerloveRestrictions.m program to impose and

test the restrictions jointly.

The Chow test. Since CRTS is rejected, let’s examine the possibilities more

carefully. Recall that the data is sorted by output (the third column). Define

5 subsamples of firms, with the first group being the 29 firms with the lowest

output levels, then the next 29 firms, etc. The five subsamples can be indexed

by � �"�$#%�'&(&)&(� Ò � where � � for �"�Y#%�\&)&(&©# ï , � # for % � ��% �"�\&)&(& Ò - , etc.

Define a piecewise linear model

(6.8.1)�)� û U � ß 1 28� ß� �� æ U+28� ß| �)� ª�� U52K� ß! �� ª� U 28� ß$ �� ª ÿ U 2MA-U

where � is a superscript (not a power) that inicates that the coefficients may be

different according to the subsample in which the observation falls. That is,


the coefficients depend upon � which in turn depends upon Y& Note that the

first column of nerlove.data indicates this way of breaking up the sample. The

new model may be written as

(6.8.2)

��BT1B �...

B($��

��1 � >N>N> �� ... � | �X! �� X$

�� 1� �� $

��2��A 1A �...

A $��

where B�1 is 29 �y�V�E��1 is 29 � Ò �E� ß is theÒ �¡� vector of coefficient for the � U�¶

subsample, and A ß is the # ï �j� vector of errors for the � U�¶ subsample.

The Octave program Restrictions/ChowTest.m estimates the above model.

It also tests the hypothesis that the five subsamples share the same parameter

vector, or in other words, that there is coefficient stability across the five sub-

samples. The null to test is that the parameter vectors for the separate groups

are all the same, that is,

� 1 � � � | � ! � $This type of test, that parameters are constant across different sets of data, is

sometimes referred to as a Chow test.

� There are 20 restrictions. If that’s not clear to you, look at the Octave

program.� The restrictions are rejected at all conventional significance levels.

Since the restrictions are rejected, we should probably use the unrestricted

model for analysis. What is the pattern of RTS as a function of the output

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/ChowTest.m


FIGURE 6.8.1. RTS as a function of firm size

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

1 1.5 2 2.5 3 3.5 4 4.5 5Output group

RTS

group (small to large)? Figure 6.8.1 plots RTS. We can see that there is increas-

ing RTS for small firms, but that RTS is approximately constant for large firms.


(1) Using the Chow test on the Nerlove model, we reject that there is coef-

ficient stability across the 5 groups. But perhaps we could restrict the

input price coefficients to be the same but let the constant and output

coefficients vary by group size. This new model is

(6.8.3)�� û *, � ß 1 2K� ß� �)� æs* 28� | �� ª�� * 2K�"! �� ª# * 2K��$ �� ª ÿG* 2MA *

(a) estimate this model by OLS, giving � , estimated standard errors

for coefficients, t-statistics for tests of significance, and the associ-

ated p-values. Interpret the results in detail.

(b) Test the restrictions implied by this model using the F, Wald, score

and likelihood ratio tests. Comment on the results.

(c) Plot the estimated RTS parameters as a function of firm size. Com-

pare the plot to that given in the notes for the unrestricted model.

Comment on the results.

(2) For the simple Nerlove model, estimated returns to scale is °� ¿ Û 1cx � .

Apply the delta method to calculate the estimated standard error for

estimated RTS. Directly test

dF H�� ¿ Û � versus

d��H�� ¿ Û í � rather

than testing

dF H�� £ � versus

d��H�� í � . Comment on the results.

(3) Perform a Monte Carlo study that generates data from the model

B �µ#;2�� 2¡� � | 2MAwhere the sample size is 30, � � and � | are independently uniformly

distributed on \ � �'�}] and A�ç ±�± ²� � �\�J(a) Compare the means and standard errors of the estimated coeffi-

cients using OLS and restricted OLS, imposing the restriction that� � 28� | #T&


(b) Compare the means and standard errors of the estimated coeffi-

cients using OLS and restricted OLS, imposing the restriction that� � 28� | �V&(c) Discuss the results.

(4) Get the Octave scripts bootstrap_example1.m , bootstrap.m , bootstrap_resample_iid.m

and myols.m figure out what they do, run them, and interpret the re-

sults.

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/bootstrap_example1.m

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/bootstrap.m

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/bootstrap_resample_iid.m

http://pareto.uab.es/mcreel/Econometrics/Include/Restrictions/myols.m

CHAPTER 7

Generalized least squares

One of the assumptions we’ve made up to now is that

/JU�ç ±�± � � � ��è � Y�or occasionally /JU,ç ±T± ²� � ��è � C&Now we’ll investigate the consequences of nonidentically and/or dependently

distributed errors. We’ll assume fixed regressors for now, relaxing this admit-

tedly unrealistic assumption later. The model is

B �i��28/ë7��/V �é^��/V 5

where 5 is a general symmetric positive definite matrix (we’ll write � in place

of � F to simplify the typing of these notes).� The case where 5 is a diagonal matrix gives uncorrelated, nonidenti-

cally distributed errors. This is known as heteroscedasticity.� The case where 5 has the same number on the main diagonal but

nonzero elements off the main diagonal gives identically (assuming

higher moments are also the same) dependently distributed errors.

This is known as autocorrelation.110

7.1. EFFECTS OF NONSPHERICAL DISTURBANCES ON THE OLS ESTIMATOR 111� The general case combines heteroscedasticity and autocorrelation. This

is known as “nonspherical” disturbances, though why this term is

used, I have no idea. Perhaps it’s because under the classical assump-

tions, a joint confidence region for / would be an �G� dimensional hy-

persphere.

7.1. Effects of nonspherical disturbances on the OLS estimator

The least square estimator is�� 4 �: � 1 � 4 B ��2a��i4��:$� 1 ��4�/� We have unbiasedness, as before.� The variance of

�� is

ë n � ��£�j�\� ��£��?4 r ë ° �¬�i4��:�� 1 �i4�/J/J4Ì�K��i4��:$� 1 ³ ��4��:�� 1 �i4±5��8��i4��:$� 1(7.1.1)

Due to this, any test statistic that is based upon�è � or the probability

limit�è � of is invalid. In particular, the formulas for the Y� 2 �� based

tests given above do not lead to statistics with these distributions.� �� is still consistent, following exactly the same argument given before.� If / is normally distributed, then��çuñðò��J��4��:�� 1 �i4±5��8��i4��:$� 1 óThe problem is that 5 is unknown in general, so this distribution won’t

be useful for testing hypotheses.

7.2. THE GLS ESTIMATOR 112� Without normality, and unconditional on � we still haveh� c ��j� h

h�Ä��i4©�3$� 1 �i4�/ Æ � 4 �� É � 1 � � 1Ap � � 4 /

Define the limiting variance of � � 1Ap � � 4 / (supposing a CLT applies) as� ��f�StT ëKÆ � 4 /J/ 4 �� É �²so we obtain

h� c �� h mR ð � � æ � 1« ²¼æ � 1« ó

Summary: OLS with heteroscedasticity and/or autocorrelation is:� unbiased in the same circumstances in which the estimator is unbiased

with iid errors� has a different variance than before, so the previous test statistics aren’t

valid� is consistent� is asymptotically normally distributed, but with a different limiting

covariance matrix. Previous test statistics aren’t valid in this case for

this reason.� is inefficient, as is shown below.

7.2. The GLS estimator

Suppose 5 were known. Then one could form the Cholesky decompositionª 4 ª 5y� 1We have ª 4 ª 5 a± f

7.2. THE GLS ESTIMATOR 113

so ª 4 ª 5 ª 4 ª 4 �which implies that ª 5 ª 4 a± f

Consider the model ª 4©B ª 4��b2 ª 4�/T�or, making the obvious definitions,

B [ � [ �b28/ [ &This variance of /([ ª / is

ë¼� ª /J/ 4 ª 4 ª 5 ª 4 ± fTherefore, the model

B [ � [ ��28/ [ë7��/ [ �é^��/ [ ± f

satisfies the classical assumptions. The GLS estimator is simply OLS applied

to the transformed model:�� ¯ �+³ �� [ 4�� [ $� 1 � [ 4©B [ �� 4 ª¯ª 4 �3 � 1 � 4 ª¯ª 4 B �� 4 5 � 1 �: � 1 � 4 5 � 1 B

7.2. THE GLS ESTIMATOR 114

The GLS estimator is unbiased in the same circumstances under which the

OLS estimator is unbiased. For example, assuming � is nonstochastic

ë7� �� ¯ �(³ ë ù ��i4�5y� 1 �:$� 1 ��4±5y� 1 B ú ë ù ��i4�5y� 1 �:$� 1 ��4±5y� 1 ��i�^2�/ ú ��&The variance of the estimator, conditional on � can be calculated using�� ¯ �+³ �� [ 4©� [ $� 1 � [ 4©B [ �� [ 4©� [ $� 1 � [ 4g�� [ �^28/ [ ��2��¬� [ 4�� [ $� 1 � [ 4�/ [so

ë õ c �� ¯ �(³ �j� h c �� ¯ �+³ �� h 4 ø ë ù �� [ 4©� [ $� 1 � [ 4©/ [ / [ 4�� [ �¬� [ 4�� [ $� 1 ú �� [ 4�� [ $� 1 � [ 4�� [ �¬� [ 4�� [ $� 1 �� [ 4�� [ $� 1 ��i4�5y� 1 �:$� 1Either of these last formulas can be used.� All the previous results regarding the desirable properties of the least

squares estimator hold, when dealing with the transformed model,

since the transformed model satisfies the classical assumptions..� Tests are valid, using the previous formulas, as long as we substitute� [ in place of ��& Furthermore, any test that involves è � can set it to �"&This is preferable to re-deriving the appropriate formulas.

7.3. FEASIBLE GLS 115� The GLS estimator is more efficient than the OLS estimator. This is a

consequence of the Gauss-Markov theorem, since the GLS estimator is

based on a model that satisfies the classical assumptions but the OLS

estimator is not. To see this directly, not that (the following needs to

be completed)

é¯ô À � ��G�8éÜô À � �� ¯ �+³ �¬�i4��3$� 1 �i4�5��K��4©�:$� 1 �¡�¬�i4±5y� 1 �3$� 1 w 5 w Owhere wa ° �¬� 4 �3 � 1 � 4 �¡�� 4 5 � 1 �: � 1 � 4 5 � 1 ³ & This may not seem ob-

vious, but it is true, as you can verify for yourself. Then noting thatw 5 w O is a quadratic form in a positive definite matrix, we conclude

that w 5 w O is positive semi-definite, and that GLS is efficient relative to

OLS.� As one can verify by calculating fonc, the GLS estimator is the solution

to the minimization problem�� ¯ �+³ ��V�-�Z� � � ��B¯�j��?4*5y� 1 ��BÜ�j�i�Gso the metric 5 � 1 is used to weight the residuals.

7.3. Feasible GLS

The problem is that 5 isn’t known usually, so this estimator isn’t available.

� Consider the dimension of 5 : it’s an �;�¼� matrix with �¬� � �� Â # 2Ü� �� 2K�, Â # unique elements.

7.3. FEASIBLE GLS 116� The number of parameters to estimate is larger than � and increases

faster than �G& There’s no way to devise an estimator that satisfies a

LLN without adding restrictions.� The feasible GLS estimator is based upon making sufficient assumptions

regarding the form of 5 so that a consistent estimator can be devised.

Suppose that we parameterize 5 as a function of � and @ , where @ may include� as well as other parameters, so that5 5��@"where @ is of fixed dimension. If we can consistently estimate @%� we can con-

sistently estimate 5µ� as long as 5��i�@" is a continuous function of @ (by the

Slutsky theorem). In this case,�5 5��i� �@" 6R 5��¬�i��@VIf we replace 5 in the formulas for the GLS estimator with

�5p� we obtain the

FGLS estimator. The FGLS estimator shares the same asymptotic properties

as GLS. These are

(1) Consistency

(2) Asymptotic normality

(3) Asymptotic efficiency if the errors are normally distributed. (Cramer-

Rao).

(4) Test procedures are asymptotically valid.

In practice, the usual way to proceed is

(1) Define a consistent estimator of @%& This is a case-by-case proposition,

depending on the parameterization 5��4@"C& We’ll see examples below.

7.4. HETEROSCEDASTICITY 117

(2) Form�5 5�� @"

(3) Calculate the Cholesky factorization�ª �û ¹�´1µ � �5 � 1 .

(4) Transform the model using�ª 4 B �ª 4 �i��2 �ª 4 /(5) Estimate using OLS on the transformed model.

7.4. Heteroscedasticity

Heteroscedasticity is the case where

ë7��/J/J4© 5is a diagonal matrix, so that the errors are uncorrelated, but have different

variances. Heteroscedasticity is usually thought of as associated with cross

sectional data, though there is absolutely no reason why time series data can-

not also be heteroscedastic. Actually, the popular ARCH (autoregressive con-

ditionally heteroscedastic) models explicitly assume that a time series is het-

eroscedastic.

Consider a supply function� *0 �01�2K��6 ª * 2K�Eì Û * 28/ *where

ª * is price and Û * is some measure of size of the ! U�¶ firm. One might

suppose that unobservable factors (e.g., talent of managers, degree of coordi-

nation between production units, etc.) account for the error term / * & If there

is more variability in these factors for large firms than for small firms, then / *may have a higher variance when Û * is high than when it is low.


Another example, individual demand.� *, �01I28��6 ª * 2K�59 ´ * 2�/ *where

ªis price and

´is income. In this case, / * can reflect variations in

preferences. There are more possibilities for expression of preferences when

one is rich, so it is possible that the variance of / * could be higher when´

is

high.

Add example of group means.

7.4.1. OLS with heteroscedastic consistent varcov estimation. Eicker (1967)

and White (1980) showed how to modify test statistics to account for het-

eroscedasticity of unknown form. The OLS estimator has asymptotic distri-

butionh� c ��÷�� h mR ñð � � æ � 1« ²¼æ � 1« ó

as we’ve already seen. Recall that we defined� ��f�StT ëKÆ � 4 /J/ 4 �� É �²This matrix has dimension � � � and can be consistently estimated, even if we

can’t estimate 5 consistently. The consistent estimator, under heteroscedastic-

ity but no autocorrelation is �²Ë �� f� U(��1 � 4U � U �/ �UOne can then modify the previous test statistics to obtain tests that are valid

when there is heteroscedasticity of unknown form. For example, the Wald test


for

dF H��p�÷� À �� would be

� c � ��÷� À h 4#¶ � Æ � 4 �� É � 1 �² Æ � 4 �� É � 1 � 4�· � 1 c � �� À h Qç£� � �4�V7.4.2. Detection. There exist many tests for the presence of heteroscedas-

ticity. We’ll discuss three methods.

Goldfeld-Quandt. The sample is divided in to three parts, with �G1C�� and� | observations, where ��1 2i� � 2i� | � . The model is estimated using the first

and third parts of the sample, separately, so that�� 1 and

�� | will be independent.

Then we have �/ 1 4 �/ 1è � / 1 O ´ 1 / 1è � mR � � ��1� � and �/ | 4 �/ |è � / | O ´ | / |è � mR � � �� | � � so �/ 1 4 �/ 1 Â ��I1G� � �/ | 4 �/ | Â �� | � � mR 2 ��1� � �� | � � Y&The distributional result is exact if the errors are normally distributed. This test

is a two-tailed test. Alternatively, and probably more conventionally, if one has

prior ideas about the possible magnitudes of the variances of the observations,

one could order the observations accordingly, from largest to smallest. In this

case, one would use a conventional one-tailed F-test. Draw picture.� Ordering the observations is an important step if the test is to have

any power.� The motive for dropping the middle observations is to increase the

difference between the average variance in the subsamples, suppos-

ing that there exists heteroscedasticity. This can increase the power of


the test. On the other hand, dropping too many observations will sub-

stantially increase the variance of the statistics�/ 1 4 �/ 1 and

�/ | 4 �/ | & A rule of

thumb, based on Monte Carlo experiments is to drop around 25% of

the observations.� If one doesn’t have any ideas about the form of the het. the test will

probably have low power since a sensible data ordering isn’t available.

White’s test. When one has little idea if there exists heteroscedasticity, and

no idea of its potential form, the White test is a possibility. The idea is that if

there is homoscedasticity, then

ë7��/ �U � � UW è � �Ôê� so that � U or functions of � U shouldn’t help to explain ë¼�¬/ �U C& The test works as

follows:

(1) Since /]U isn’t available, use the consistent estimator�/gU instead.

(2) Regress �/ �U è � 2M�]4U �=2¹¸JUwhere �\U is a

ª-vector. �\U may include some or all of the variables in� U?� as well as other variables. White’s original suggestion was to use� U , plus the set of all unique squares and cross products of variables in� U?&

(3) Test the hypothesis that � a� & The � 2 statistic in this case is� 2 ª � Ù Û�Û �^� ÙÜÛ�Û� Â ªÙ Û�Û� Â �� ª � �J


Note that Ù Û�Û � ¿ Û�Û� � so dividing both numerator and denomina-

tor by this we get � 2 �� ª � �J � ��¼�8� �Note that this is the �p� or the artificial regression used to test for het-

eroscedasticity, not the �� of the original model.

An asymptotically equivalent statistic, under the null of no heteroscedasticity

(so that � � should tend to zero), is

�,� � Qç�� ª Y&This doesn’t require normality of the errors, though it does assume that the

fourth moment of /]U is constant, under the null. Question: why is this neces-

sary?

� The White test has the disadvantage that it may not be very power-

ful unless the �\U vector is chosen well, and this is hard to do without

knowledge of the form of heteroscedasticity.� It also has the problem that specification errors other than heteroscedas-

ticity may lead to rejection.� Note: the null hypothesis of this test may be interpreted as @ � for

the variance model é^��/ �U ¹ � � 2²� 4U @VC� where¹ �?>@ is an arbitrary func-

tion of unknown form. The test is more general than is may appear

from the regression that is used.

Plotting the residuals. A very simple method is to simply plot the residuals

(or their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will


be more informative if the observations are ordered according to the suspected

form of the heteroscedasticity.

7.4.3. Correction. Correcting for heteroscedasticity requires that a para-

metric form for 5��A@" be supplied, and that a means for estimating @ consis-

tently be determined. The estimation method will be specific to the for sup-

plied for 5��A@VC& We’ll consider two examples. Before this, let’s consider the

general nature of GLS when there is heteroscedasticity.

Multiplicative heteroscedasticity

Suppose the model is

BgU � 4U ��2�/]Uè �U ë¼�¬/ �U �W� 4U �0_ºbut the other classical assumptions hold. In this case

/ �U �W�g4U �0 º 2i¸]Uand ¸]U has mean zero. Nonlinear least squares could be used to estimate � and»

consistently, were /]U observable. The solution is to substitute the squared

OLS residuals�/ �U in place of / �U � since it is consistent by the Slutsky theorem.

Once we have�� and

�» � we can estimate è �U consistently using�è �U ��g4U ��0 �º 6R è �U &In the second step, we transform the model by dividing by the standard devi-

ation: BgU�è5U � 4U ��è5U 2 /JU�è5U


or B [U � [ 4U ��2�/ [U &Asymptotically, this model satisfies the classical assumptions.� This model is a bit complex in that NLS is required to estimate the

model of the variance. A simpler version would be

B]U � 4U �^2�/]Uè �U ë¼�¬/ �U è � � ºUwhere �'U is a single variable. There are still two parameters to be esti-

mated, and the model of the variance is still nonlinear in the parame-

ters. However, the search method can be used in this case to reduce the

estimation problem to repeated applications of OLS.� First, we define an interval of reasonable values for» � e.g.,

» Cc\ � �%1]ò&� Partition this interval into´

equally spaced values, e.g., S � �\&)�"�\&©#%�\&)&(&)�$#%& ï ��%%X�&� For each of these values, calculate the variable � ºb¼U &� The regression �/ �U è � � ºb¼U 2i¸]Uis linear in the parameters, conditional on

» 9µ� so one can estimate è �by OLS.� Save the pairs ( è��9 � » 97C� and the corresponding ÙÜÛZÛ 9µ& Choose the pair

with the minimum Ù Û�Û 9 as the estimate.� Next, divide the model by the estimated standard deviations.� Can refine. Draw picture.� Works well when the parameter to be searched over is low dimen-

sional, as in this case.


Groupwise heteroscedasticity

A common case is where we have repeated observations on each of a num-

ber of economic agents: e.g., 10 years of macroeconomic data on each of a set

of countries or regions, or daily observations of transactions of 200 banks. This

sort of data is a pooled cross-section time-series model. It may be reasonable to pre-

sume that the variance is constant over time within the cross-sectional units,

but that it differs across them (e.g., firms or countries of different sizes...). The

model is

B * U � 4* U �^28/ * Uë¼�¬/ �* U è �* �?ê� where ! �"�Y#%�'&(&(&)�$� are the agents, and �"�$#%�'&(&)&(�� are the observations on

each agent.

� The other classical assumptions are presumed to hold.� In this case, the variance è �* is specific to each agent, but constant over

the � observations for that agent.� In this model, we assume that ë7��/ * U�/ * ì- a� & This is a strong assumption

that we’ll relax later.

To correct for heteroscedasticity, just estimate each èI�* using the natural estima-

tor: �è �* �� f� U(��1 �/ �* U� Note that we use � Â � here since it’s possible that there are more than �regressors, so �� could be negative. Asymptotically the difference

is unimportant.


FIGURE 7.4.1. Residuals, Nerlove model, sorted by firm size

-1.5

-1

-0.5

0

0.5

1

1.5

0 20 40 60 80 100 120 140 160

Regression residuals

Residuals

� With each of these, transform the model as usual:B * U�è * � 4* U ��è * 2 / * U�è *Do this for each cross-sectional group. This transformed model satis-

fies the classical assumptions, asymptotically.

7.4.4. Example: the Nerlove model (again!) Let’s check the Nerlove data

for evidence of heteroscedasticity. In what follows, we’re going to use the

model with the constant and output coefficient varying across 5 groups, but

with the input price coefficients fixed (see Equation 6.8.3 for the rationale be-

hind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m

plots the residuals. We can see pretty clearly that the error variance is larger

for small firms than for larger firms.

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveResiduals.m


Now let’s try out some tests to formally check for heteroscedasticity. The

Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt

tests, using the above model. The results are

Value p-value

White’s test 61.903 0.000

Value p-value

GQ test 10.886 0.000

All in all, it is very clear that the data are heteroscedastic. That means that OLS

estimation is not efficient, and tests of restrictions that ignore heteroscedastic-

ity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were cal-

culated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m

uses the Wald test to check for CRTS and HOD1, but using a heteroscedastic-

consistent covariance estimator.1 The results are

Testing HOD1

Value p-value

Wald test 6.161 0.013

Testing CRTS

Value p-value

Wald test 20.169 0.001

1By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the re-stricted LS estimator directly to restrict the fully general model with all coefficientsvarying to the model with only the constant and the output coefficient varying. ButGLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into themodel. The methods are equivalent, but the second is more convenient and easier to under-stand.

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/HetTests.m

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveRestrictions-Het.m

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveResiduals.m

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/HetTests.m

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveRestrictions-Het.m


We see that the previous conclusions are altered - both CRTS is and HOD1 are

rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald test’s

tendency to over-reject?

From the previous plot, it seems that the variance of A is a decreasing func-

tion of output. Suppose that the 5 size groups have different error variances

(heteroscedasticity by groups):

é¯ô À ��A * è �ß �where � � if ! �"�Y#%�\&)&(&)�$# ï , etc., as before. The Octave program GLS/NerloveGLS.m

estimates the model using GLS (through a transformation of the model so that

OLS can be applied). The estimation results are

*********************************************************

OLS estimation results

Observations 145

R-squared 0.958822


Results (Het. consistent var-cov estimator)


constant1 -1.046 1.276 -0.820 0.414

constant2 -1.977 1.364 -1.450 0.149

constant3 -3.616 1.656 -2.184 0.031

constant4 -4.052 1.462 -2.771 0.006

constant5 -5.308 1.586 -3.346 0.001

output1 0.391 0.090 4.363 0.000

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveGLS.m


output2 0.649 0.090 7.184 0.000

output3 0.897 0.134 6.688 0.000

output4 0.962 0.112 8.612 0.000

output5 1.101 0.090 12.237 0.000

labor 0.007 0.208 0.032 0.975

fuel 0.498 0.081 6.149 0.000

capital -0.460 0.253 -1.818 0.071

*********************************************************

*********************************************************


Observations 145

R-squared 0.987429


Results (Het. consistent var-cov estimator)


constant1 -1.580 0.917 -1.723 0.087

constant2 -2.497 0.988 -2.528 0.013

constant3 -4.108 1.327 -3.097 0.002

constant4 -4.494 1.180 -3.808 0.000

constant5 -5.765 1.274 -4.525 0.000

output1 0.392 0.090 4.346 0.000

output2 0.648 0.094 6.917 0.000


output3 0.892 0.138 6.474 0.000

output4 0.951 0.109 8.755 0.000

output5 1.093 0.086 12.684 0.000

labor 0.103 0.141 0.733 0.465

fuel 0.492 0.044 11.294 0.000

capital -0.366 0.165 -2.217 0.028

*********************************************************

Testing HOD1

Value p-value

Wald test 9.312 0.002

The first panel of output are the OLS estimation results, which are used to

consistently estimate the è �ß . The second panel of results are the GLS estimation

results. Some comments:

� The �� measures are not comparable - the dependent variables are

not the same. The measure for the GLS results uses the transformed

dependent variable. One could calculate a comparable �s� measure,

but I have not done so.� The differences in estimated standard errors (smaller in general for

GLS) can be interpreted as evidence of improved efficiency of GLS,

since the OLS standard errors are calculated using the Huber-White

estimator. They would not be comparable if the ordinary (inconsis-

tent) estimator had been used.

7.5. AUTOCORRELATION 130� Note that the previously noted pattern in the output coefficients per-

sists. The nonconstant CRTS result is robust.� The coefficient on capital is now negative and significant at the 3%

level. That seems to indicate some kind of problem with the model or

the data, or economic theory.� Note that HOD1 is now rejected. Problem of Wald test over-rejecting?

Specification error in model?

7.5. Autocorrelation

Autocorrelation, which is the serial correlation of the error term, is a prob-

lem that is usually associated with time series data, but also can affect cross-

sectional data. For example, a shock to oil prices will simultaneously affect

all countries, so one could expect contemporaneous correlation of macroeco-

nomic variables across countries.

7.5.1. Causes. Autocorrelation is the existence of correlation across the er-

ror term: ë¼�¬/JU¨/gì-�í ¡� �[ �í �"&Why might this occur? Plausible explanations include

(1) Lags in adjustment to shocks. In a model such as

BgU �� 4U ��2�/JUP�one could interpret � 4U � as the equilibrium value. Suppose � U is con-

stant over a number of observations. One can interpret /gU as a shock

that moves the system away from equilibrium. If the time needed to

return to equilibrium is long with respect to the observation frequency,

7.5. AUTOCORRELATION 131

one could expect /]U � 1 to be positive, conditional on /]U positive, which

induces a correlation.

(2) Unobserved factors that are correlated over time. The error term is

often assumed to correspond to unobservable factors. If these factors

are correlated, there will be autocorrelation.

(3) Misspecification of the model. Suppose that the DGP is

BgU � F 2K�01 � U52K� � � �U 28/JUbut we estimate BgU � F 2K�01 � U528/JUThe effects are illustrated in Figure 7.5.1.

7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is

the same as in the case of heteroscedasticity - the standard formula does not

apply. The correct formula is given in equation 7.1.1. Next we discuss two

GLS corrections for OLS. These will potentially induce inconsistency when the

regressors are nonstochastic (see Chapter8) and should either not be used in

that case (which is usually the relevant case) or used with caution. The more

recommended procedure is discussed in section 7.5.5.

7.5.3. AR(1). There are many types of autocorrelation. We’ll consider two

examples. The first is the most commonly encountered case: autoregressive


FIGURE 7.5.1. Autocorrelation induced by misspecification

order 1 (AR(1) errors. The model is

B]U � 4U �^28/JU/JU �"/JU � 1�2¹½+U½+Ukç !W!¾J�� [è �Í ë7��/JU�½�ìÔ � �- ½ �We assume that the model satisfies the other classical assumptions.� We need a stationarity assumption: � � � ½ �"& Otherwise the variance of/]U explodes as increases, so standard asymptotics will not apply.

7.5. AUTOCORRELATION 133� By recursive substitution we obtain

/JU �"/JU � 1I2i½EU �p��"/JU � � 2i½EU � 1Ô�2�½+U � � /JU � � 28�,½EU � 1,2i½EU � � ��"/JU � | 2i½EU � � �2K�,½+U � 1I2i½+UIn the limit the lagged / drops out, since � 9 R � as � R k � so we

obtain /]U T�9G� F � 9 ½+U � 9With this, the variance of /gU is found as

ë7��/ �U è �Í T�9G� F � � 9 è �Í�È�j� �� If we had directly assumed that /gU were covariance stationary, we could

obtain this using

éb��/JU� � � ë7��/ �U � 1 �2M#g�"ë¼�¬/JU � 1_½+UW�28ë7�4½ �U � � é^�¬/JU��2Kè �Í �so é^�¬/JU� è �Í�È�� The variance is the � U�¶ order autocovariance: � F é^��/JU�� Note that the variance does not depend on


Likewise, the first order autocovariance ��1 is

û ´ ¸��¬/]U×�[/JU � 1- �%ì ë7�[�¬�"/]U � 1�2¹½+U��/]U � 1[ �Téb�¬/JUW ��è,�Í�È�� Using the same method, we find that for � ½ û ´ ¸��¬/]U×�[/JU � ì- � ì � ì è,�Í�È�j� �� The autocovariances don’t depend on : the process SN/gUPX is covariance

stationary

The correlation (in general, for r.v.’s � and B ) is defined as

corr � � �[B5 cov � � ��B se � � se ��B

but in this case, the two standard errors are the same, so the � -order autocor-

relation �Tì is ��ì � ì� All this means that the overall matrix 5 has the form

5 è �Í�È�j� �z {}| ~this is the variance

�� >N>N>u� f � 1� � � >N>N>u� f � �... . . . ...

. . . �� f � 1 >N>N> ��z {}| ~

this is the correlation matrix


So we have homoscedasticity, but elements off the main diagonal are

not zero. All of this depends only on two parameters, � and èI�Í & If we

can estimate these consistently, we can apply FGLS.

It turns out that it’s easy to estimate these consistently. The steps are

(1) Estimate the model BgU ¡� 4U ��2�/]U by OLS.

(2) Take the residuals, and estimate the model�/JU � �/JU � 1�2¹½ [USince

�/]U 6R /JU?� this regression is asymptotically equivalent to the re-

gression /JU �"/JU � 1�2¹½+Uwhich satisfies the classical assumptions. Therefore,

�� obtained by ap-

plying OLS to�/]U � �/JU � 1�2g½n[U is consistent. Also, since ½>[U 6R ½+U , the

estimator �è �Í �� f� U(� � � �½ [U � 6R è �Í(3) With the consistent estimators

�è �Í and��E� form

�5 5�� è �Í � ��T using the

previous structure of 5µ� and estimate by FGLS. Actually, one can omit

the factor�è �Í Â �?�¼�� C� since it cancels out in the formula�� ¯ �(³ c ��4 �5y� 1 � h � 1 ��4 �5y� 1 B C&� One can iterate the process, by taking the first FGLS estimator of �Z� re-

estimating � and è �Í � etc. If one iterates to convergences it’s equivalent

to MLE (supposing normal errors).

7.5. AUTOCORRELATION 136� An asymptotically equivalent approach is to simply estimate the trans-

formed model

BgU�� BgU � 1 � � U�� U � 1�?4��2¹½ [Uusing �K�� observations (since B F and � F aren’t available). This is

the method of Cochrane and Orcutt. Dropping the first observation is

asymptotically irrelevant, but it can be very important in small samples.

One can recuperate the first observation by putting

B [1 BT1 � �È� �� [ 1 �� 1 � �È� �� This somewhat odd-looking result is related to the Cholesky factor-

ization of 5 � 1 & See Davidson and MacKinnon, pg. 348-49 for more

discussion. Note that the variance of B [1 is è �Í � asymptotically, so we

see that the transformed model will be homoscedastic (and nonauto-

correlated, since the ½ 4 � are uncorrelated with the B 4 �"� in different time

periods.

7.5.4. MA(1). The linear regression model with moving average order 1

errors is

B]U � 4U ��2�/JU/JU ½+U 2Ëtn½+U � 1½+Uñç !ò!�J�� è �Í ë¼�¬/]U¿½�ì[ � �[ ½ �


In this case,

é��/JUW � F ë ° �4½+U52Ëtr½EU � 1Ô � ³ è �Í 2Mt � è �Í è �Í �Ô��2Ët � Similarly �E1 ëª\)�b½EU 2Mtn½+U � 1-��4½+U � 1,2Ëtr½+U � � _] tEè �Íand � � \(�4½+U 2Ëtr½+U � 1-��4½+U � � 2Ëtr½EU � | _] �so in this case

5 è �Í �� 2Mt � t � >N>N> �t ��2Ët � t� t . . . ...

... . . . t� >N>N> t �Z2Ët ��

Note that the first order autocorrelation is

� 1 À � zÁ� zÁ Ã 1 � À z Å �+1� F t�Ô��2Mt �

7.5. AUTOCORRELATION 138� This achieves a maximum at t � and a minimum at t �Þ�"� and the

maximal and minimal autocorrelations are 1/2 and -1/2. Therefore,

series that are more strongly autocorrelated can’t be MA(1) processes.

Again the covariance matrix has a simple structure that depends on only two

parameters. The problem in this case is that one can’t estimate t using OLS on�/JU ½+U 2Ëtr½EU � 1because the ½EU are unobservable and they can’t be estimated consistently. How-

ever, there is a simple way to estimate the parameters.

� Since the model is homoscedastic, we can estimate

éb��/JU� è �î è �Í �Ô�Z2Mt � using the typical estimator:�è �î x

è �Í �?��2Ët � �� f� U(��1 �/ �U� By the Slutsky theorem, we can interpret this as defining an (uniden-

tified) estimator of both è,�Í and t�� e.g., use this as¥ è �Í �?��2 �t � �� f� U(��1 �/ �UHowever, this isn’t sufficient to define consistent estimators of the pa-

rameters, since it’s unidentified.� To solve this problem, estimate the covariance of /gU and /]U � 1 usingÂû ´ ¸��¬/]U×�[/JU � 1- Ât�è �Í �� f� U(� � �/JU �/JU � 1


This is a consistent estimator, following a LLN (and given that the

epsilon hats are consistent for the epsilons). As above, this can be

interpreted as defining an unidentified estimator:�t ¥ è �Í �� f� U(� � �/]U �/]U � 1� Now solve these two equations to obtain identified (and therefore con-

sistent) estimators of both t and è �Í & Define the consistent estimator�5 5�� t�� ¥ è �Í following the form we’ve seen above, and transform the model us-

ing the Cholesky decomposition. The transformed model satisfies the

classical assumptions asymptotically.

7.5.5. Asymptotically valid inferences with autocorrelation of unknown

form. See Hamilton Ch. 10, pp. 261-2 and 280-84.

When the form of autocorrelation is unknown, one may decide to use the

OLS estimator, without correction. We’ve seen that this estimator has the lim-

iting distributionh� c ��÷�� h mR ñð � � æ � 1« ²¼æ � 1« ó

where, as before, ² is ²Ë � �)�f�SUT ë Æ � 4 /J/ 4 �� É


We need a consistent estimate of ² . Define ��U ®� U¬/JU (recall that � U is defined

as a � �j� vector). Note that

�i4�/ n � 1 � � >N>N> � fir ��/�1/ �.../Jf�� f� U(��1 � U¬/]U f� U(��1 �Û

so that ² � ��f�SUT �� ë�v ¶ f� U(��1 �Ûb· ¶ f� U(��1 � 4U · wWe assume that ��U is covariance stationary (so that the covariance between ��Uand �Û � ì does not depend on -C&

Define the ¸Ü�j ¹ autocovariance of ��U asÃÅÄ ë7��Û¬�^4U � Ä C&Note that ë¼��U¬� 4U � Ä Ã 4Ä & (show this with an example). In general, we expect

that:

� �Û will be autocorrelated, since /]U is potentially autocorrelated:ÃÅÄ ë¼��Û¬�^4U � Ä �í ¡�Note that this autocovariance does not depend on Y� due to covariance

stationarity.

7.5. AUTOCORRELATION 141� contemporaneously correlated ( ë¼�� * U�� ß U��í k� ), since the regressors

in � U will in general be correlated (more on this later).� and heteroscedastic ( ë7�� * U è �* , which depends upon ! ), again since

the regressors will have different variances.

While one could estimate ² parametrically, we in general have little informa-

tion upon which to base a parametric specification. Recent research has fo-

cused on consistent nonparametric estimators of ² &Now define ² f ë �� v ¶ f� U(��1 �Ûb· ¶ f� U(��1 �^4U · w

We have (show that the following is true, by expanding sum and shifting rows to left)² f Ã F 2 �� Ã 1,2 Ã 4 1 �2 ��K#� � Ã � 2 Ã 4 � 0>N>N>N2 �� ð Ã f � 1I2 Ã 4f � 1 óThe natural, consistent estimator of

ÃÆÄis¥ ÃÅÄ �� f�U(� Ä � 1 ��Û ��^4U � Ä &

where ��Û ¡� U �/JU(note: one could put � Â ��^�Ç¸% instead of � Â � here). So, a natural, but inconsis-

tent, estimator of ² f would be�² f ¥ Ã F 2 �� c ¥ Ã 1,2 ¥ Ã 4 1 h 2 ��8#� c ¥ Ã � 2 ¥ Ã 4 � h 2¡>N>N>'2 �� c ÂÃ f � 1�2 ÂÃ 4f � 1 h ¥ Ã F 2 f � 1� Ä ��1 ��s¸� c ¥ ÃÅÄ 2 ¥ Ã 4Ä h &


This estimator is inconsistent in general, since the number of parameters to

estimate is more than the number of observations, and increases more rapidly

than � , so information does not build up as � R k &On the other hand, supposing that

Ã�Ätends to zero sufficiently rapidly as ¸

tends to k � a modified estimator�² f ¥ Ã F 2 � Ã f'Å� Ä ��1 c ¥ ÃÅÄ 2 ¥ Ã 4Ä h �where � �� 6R k as � R k will be consistent, provided �5�� grows sufficiently

slowly.

� The assumption that autocorrelations die off is reasonable in many

cases. For example, the AR(1) model with � � � ½ � has autocorrelations

that die off.� The term f � Äf can be dropped because it tends to one for ¸ ½ �5�� ,given that � ��, increases slowly relative to �G&� A disadvantage of this estimator is that is may not be positive definite.

This could cause one to calculate a negative �� statistic, for example!� Newey and West proposed and estimator (Econometrica, 1987) that

solves the problem of possible nonpositive definiteness of the above

estimator. Their estimator is�² f ¥ Ã F 2 � Ã f'Å� Ä ��1MÈ �È� ¸�È2¡��É c ¥ ÃÅÄ 2 ¥ Ã 4Ä h &This estimator is p.d. by construction. The condition for consistency

is that � � 1Ap�! � ��, R � & Note that this is a very slow rate of growth

for �T& This estimator is nonparametric - we’ve placed no parametric

restrictions on the form of ² & It is an example of a kernel estimator.


Finally, since ² f has ² as its limit,�² f 6R ² & We can now use

�² f andÂ æ « 1f � 4 � to consistently estimate the limiting distribution of the OLS estimator

under heteroscedasticity and autocorrelation of unknown form. With this,

asymptotically valid tests are constructed in the usual way.

7.5.6. Testing for autocorrelation. Durbin-Watson test

The Durbin-Watson test statistic is� ü � fU(� � � �/JU�� /JU � 1- �� fU(��1 �/ �U � fU(� � ð �/ �U ��# �/JU �/JU � 1I2 �/ �U � 1 ó� fU(��1 �/ �U� The null hypothesis is that the first order autocorrelation of the errors

is zero:

dF H�� 1 � & The alternative is of course

dX�H�� 13í � & Note

that the alternative is not that the errors are AR(1), since many gen-

eral patterns of autocorrelation will have the first order autocorrela-

tion different than zero. For this reason the test is useful for detecting

autocorrelation in general. For the same reason, one shouldn’t just as-

sume that an AR(1) model is appropriate when the DW test rejects the

null.� Under the null, the middle term tends to zero, and the other two tend

to one, so� ü 6R #%&� Supposing that we had an AR(1) error process with � �"& In this case

the middle term tends to �p#%� so� ü 6R �� Supposing that we had an AR(1) error process with � �Þ�"& In this

case the middle term tends to #%� so� ü 6R Ñ� These are the extremes:

� ü always lies between 0 and 4.


FIGURE 7.5.2. Durbin-Watson critical values

� The distribution of the test statistic depends on the matrix of regres-

sors, �� so tables can’t give exact critical values. The give upper and

lower bounds, which correspond to the extremes that are possible. See

Figure 7.5.2. There are means of determining exact critical values con-

ditional on �i&� Note that DW can be used to test for nonlinearity (add discussion).� The DW test is based upon the assumption that the matrix � is fixed

in repeated samples. This is often unreasonable in the context of eco-

nomic time series, which is precisely the context where the test would

have application. It is possible to relate the DW test to other test sta-

tistics which are valid without strict exogeneity.


Breusch-Godfrey test

This test uses an auxiliary regression, as does the White test for heteroscedas-

ticity. The regression is�/JU ¡� 4U » 2i�+1 �/JU � 1I2c� � �/JU � � 2�>N>N>J2i�uÊ �/]U � Ê�2i¸]Uand the test statistic is the �0�s� statistic, just as in the White test. There are

ªrestrictions, so the test statistic is asymptotically distributed as a � � � ª C&� The intuition is that the lagged errors shouldn’t contribute to explain-

ing the current error if there is no autocorrelation.�� U is included as a regressor to account for the fact that the�/gU are not

independent even if the /]U are. This is a technicality that we won’t go

into here.� This test is valid even if the regressors are stochastic and contain lagged

dependent variables, so it is considerably more useful than the DW

test for typical time series data.� The alternative is not that the model is an AR(P), following the ar-

gument above. The alternative is simply that some or all of the firstªautocorrelations are different from zero. This is compatible with

many specific forms of autocorrelation.

7.5.7. Lagged dependent variables and autocorrelation. We’ve seen that

the OLS estimator is consistent under autocorrelation, as long as � µ !ò� « O îf ®� &This will be the case when ë7�� 4 /" Ú� � following a LLN. An important excep-

tion is the case where � contains lagged B 4 � and the errors are autocorrelated.

A simple example is the case of a single lag of the dependent variable with


AR(1) errors. The model is

B]U � 4U ��2KB]U � 1I� 2�/]U/JU �"/]U � 1�2¹½+UNow we can write

ë7��BgU � 1ò/JUW ë ù � � 4U � 1 ��28BgU � � �¾28/JU � 1[\��"/]U � 1�2¹½+U� úí �since one of the terms is ë7��"/ �U � 1 which is clearly nonzero. In this case ë7�� 4 /Vpí � � and therefore � µ !W� « O îf í �� & Since

� µ !ò� �� ^2�� µ !ò� � 4 /�the OLS estimator is inconsistent in this case. One needs to estimate by instru-

mental variables (IV), which we’ll get to later.

7.5.8. Examples.

Nerlove model, yet again. The Nerlove model uses cross-sectional data, so

one may not think of performing tests for autocorrelation. However, speci-

fication error can induce autocorrelated errors. Consider the simple Nerlove

model �)� ûÚ �01�2K� � �� æ 28� | �� ª�� 2K�"! �� ª# 28��$ �� ª ÿ 2KAand the extended Nerlove model�)� ûu � ß 1 28� ß� �� æ 2K� | �� ª#� 2K�"! �� ª# 2K��$ �� ª ÿ 2KA\&


FIGURE 7.6.1. Residuals of simple Nerlove model

-1

-0.5

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10

ResidualsQuadratic fit to Residuals

We have seen evidence that the extended model is preferred. So if it is in

fact the proper model, the simple model is misspecified. Let’s check if this

misspecification might induce autocorrelated errors.

The Octave program GLS/NerloveAR.m estimates the simple Nerlove model,

and plots the residuals as a function of�� æ , and it calculates a Breusch-Godfrey

test statistic. The residual plot is in Figure 7.6.1 , and the test results are:

Value p-value

Breusch-Godfrey test 34.930 0.000

Clearly, there is a problem of autocorrelated residuals.

EXERCISE 7.6. Repeat the autocorrelation tests using the extended Nerlove

model (Equation ??) to see the problem is solved.

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/NerloveAR.m


Klein model. Klein’s Model I is a simple macroeconometric model. One of

the equations in the model explains consumption ( û ) as a function of profits

(ª

), both current and lagged, as well as the sum of wages in the private sector

( ü 6 ) and wages in the government sector ( ü �). Have a look at the README

file for this data set. This gives the variable names and other information.

Consider the model

û U �� F 2 � 1 ª U52 � � ª U � 1�2 � | �òü 6U 2Ëü �U I2MAY1�UThe Octave program GLS/Klein.m estimates this model by OLS, plots the

residuals, and performs the Breusch-Godfrey test, using 1 lag of the residu-

als. The estimation and test results are:

*********************************************************


Observations 21

R-squared 0.981008




Constant 16.237 1.303 12.464 0.000

Profits 0.193 0.091 2.115 0.049

Lagged Profits 0.090 0.091 0.992 0.335

Wages 0.796 0.040 19.933 0.000

http://pareto.uab.es/mcreel/Econometrics/Include/Data/klein_readme.txt

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/Klein.m


FIGURE 7.6.2. OLS residuals, Klein consumption equation

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 5 10 15 20 25

Regression residuals

Residuals

*********************************************************

Value p-value


and the residual plot is in Figure 7.6.2. The test does not reject the null of

nonautocorrelatetd errors, but we should remember that we have only 21 ob-

servations, so power is likely to be fairly low. The residual plot leads me to

suspect that there may be autocorrelation - there are some significant runs be-

low and above the x-axis. Your opinion may differ.

Since it seems that there may be autocorrelation, lets’s try an AR(1) correc-

tion. The Octave program GLS/KleinAR1.m estimates the Klein consumption

equation assuming that the errors follow the AR(1) pattern. The results, with

the Breusch-Godfrey test for remaining autocorrelation are:

http://pareto.uab.es/mcreel/Econometrics/Include/GLS/KleinAR1.m


*********************************************************


Observations 21

R-squared 0.967090




Constant 16.992 1.492 11.388 0.000

Profits 0.215 0.096 2.232 0.039

Lagged Profits 0.076 0.094 0.806 0.431

Wages 0.774 0.048 16.234 0.000

*********************************************************

Value p-value


� The test is farther away from the rejection region than before, and the

residual plot is a bit more favorable for the hypothesis of nonauto-

correlated residuals, IMHO. For this reason, it seems that the AR(1)

correction might have improved the estimation.� Nevertheless, there has not been much of an effect on the estimated

coefficients nor on their estimated standard errors. This is probably

because the estimated AR(1) coefficient is not very large (around 0.2)

EXERCISES 151� The existence or not of autocorrelation in this model will be important

later, in the section on simultaneous equations.

Exercises

EXERCISES 152

(1) Comparing the variances of the OLS and GLS estimators, I claimed that the

following holds:

(2)

é¯ô À � ��Ä�8éÜô À � �� ¯ �(³ w 5 w OVerify that this is true.

(3) Show that the GLS estimator can be defined as�� ¯ �+³ ��V�-�Z� � � ��B¯�j��?4*5y� 1 ��BÜ�j�i�G(4) The limiting distribution of the OLS estimator with heteroscedasticity of

unknown form is h� c ��j� h mR ð � � æ � 1« ²¼æ � 1« ó �

where � ��f�StT ëKÆ � 4 /J/ 4 �� É �²Explain why �²Ë �� f� U(��1 � 4U � U �/ �Uis a consistent estimator of this matrix.

(5) Define the ¸��R ¹ autocovariance of a covariance stationary process �÷U ,where Ù ��U �� as ÃÅÄ ë7��^U¬�^4U � Ä C&Show that ë¼��U¨� 4U � Ä Ã 4Ä &

(6) For the Nerlove model�� ûÚ � ß 1 2K� ß� �)� æ 2K� | �� ª#� 2K�"! �)� ª� 2K��$ �� ª ÿ 2MA

EXERCISES 153

assume that é^�WA-U � � UW �� 2 �� )� æ .

Exercises

(a) Calculate the FGLS estimator and interpret the estimation results.

(b) Test the transformed model to check whether it appears to satisfy ho-

moscedasticity.

CHAPTER 8

Stochastic regressors

Up to now we have treated the regressors as fixed, which is clearly un-

realistic. Now we will assume they are random. There are several ways to

think of the problem. First, if we are interested in an analysis conditional on the

explanatory variables, then it is irrelevant if they are stochastic or not, since

conditional on the values of they regressors take on, they are nonstochastic,

which is the case already considered.� In cross-sectional analysis it is usually reasonable to make the analysis

conditional on the regressors.� In dynamic models, where BgU may depend on BgU � 1C� a conditional anal-

ysis is not sufficiently general, since we may want to predict into the

future many periods out, so we need to consider the behavior of�� and

the relevant test statistics unconditional on �i&The model we’ll deal will involve a combination of the following assumptions

Linearity: the model is a linear function of the parameter vector � F HBgU �� 4U � F 28/JUP�

or in matrix form, B �i� F 2�/%�where B is �÷�i�"�'� c � 1 � � >N>N> � f h 4 � where � U is � �i�V� and � F and / are

conformable.154

8.1. CASE 1 155

Stochastic, linearly independent regressors� has rank � with probability 1� is stochastic� �� f�SUTFË � ð 1f � 4 � ¡æ « ó �V� where æ « is a finite positive definite matrix.

Central limit theorem� � 1Ap � � 4 /8mR j� � � æ « è �F Normality (Optional): / � �eçRj� � ��è � ± f� : A is normally distributed

Strongly exogenous regressors:

ë7��/JU � ` � �Ôê� (8.0.1)

Weakly exogenous regressors:

Ù ��/JU � L?Ì � �Ôê� (8.0.2)

In both cases, L 4U � is the conditional mean of BVU given L U : Ù ��B]U � L UW �L 4U �8.1. Case 1

Normality of /%� strongly exogenous regressors

In this case, �� F 2��¬�i4��:�� 1 �i4�/ë7� �� : � F 2��¬�i4��:�� 1 �i4�ë¼�¬/ � �: � F

and since this holds for all �i� Ù � �� , unconditional on ��& Likewise,�� eçRñð×��J��i4Ì�:�� 1 è �F'ó

8.2. CASE 2 156� If the density of � is J,3��:Y� the marginal density of�� is obtained by

multiplying the conditional density by J�3¢��: and integrating over �i&Doing this leads to a nonnormal density for

��¢� in small samples.� However, conditional on �i� the usual test statistics have the Y� 2 and� � distributions. Importantly, these distributions don’t depend on �i� so

when marginalizing to obtain the unconditional distribution, nothing

changes. The tests are valid in small samples.� Summary: When � is stochastic but strongly exogenous and / is nor-

mally distributed:

(1)�� is unbiased

(2)�� is nonnormally distributed

(3) The usual test statistics have the same distribution as with non-

stochastic �i&(4) The Gauss-Markov theorem still holds, since it holds condition-

ally on �i� and this is true for all ��&(5) Asymptotic properties are treated in the next section.

8.2. Case 2/ nonnormally distributed, strongly exogenous regressors

The unbiasedness of�� carries through as before. However, the argument

regarding test statistics doesn’t hold, due to nonnormality of /T& Still, we have�� F 2a��i4��:$� 1 �i4�/ � F 2 Æ � 4 �� É � 1 � 4 /�

8.2. CASE 2 157

Now Æ � 4 �� É � 1 6R æ � 1«by assumption, and � 4 /� � � 1Ap � � 4 /h

� 6R �since the numerator converges to a j� � � æ « è,�C r.v. and the denominator still

goes to infinity. We have unbiasedness and the variance disappearing, so, the

estimator is consistent: �� 6R � F &Considering the asymptotic distributionh

� c ��²� F h h�²Æ � 4 �� É � 1 � 4 /� Æ � 4 �� É � 1 � � 1Ap � � 4 /

soh� c ��÷�� F h mR j� � � æ � 1« è �F

directly following the assumptions. Asymptotic normality of the estimator still

holds. Since the asymptotic results on all test statistics only require this, all the

previous asymptotic results on test statistics are also valid in this case.� Summary: Under strongly exogenous regressors, with / normal or

nonnormal,�� has the properties:

(1) Unbiasedness

(2) Consistency

(3) Gauss-Markov theorem holds, since it holds in the previous case

and doesn’t depend on normality.

(4) Asymptotic normality

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE? 158

(5) Tests are asymptotically valid, but are not valid in small samples.

8.3. Case 3

Weakly exogenous regressors

An important class of models are dynamic models, where lagged dependent

variables have an impact on the current value. A simple version of these mod-

els that captures the important points is

B]U �g4U � 2 6� ìò��1 �%ìPBgU � ì�2�/]U � 4U �^2�/]Uwhere now � U contains lagged dependent variables. Clearly, even with Ù �WA�U � L U� � �'� and / are not uncorrelated, so one can’t show unbiasedness. For example,

ë¼�¬/JU � 1 � UW�í a�since � U contains BgU � 1 (which is a function of /]U � 1[ as an element.� This fact implies that all of the small sample properties such as un-

biasedness, Gauss-Markov theorem, and small sample validity of test

statistics do not hold in this case. Recall Figure 3.7.2. This is a case of

weakly exogenous regressors, and we see that the OLS estimator is

biased in this case.� Nevertheless, under the above assumptions, all asymptotic properties

continue to hold, using the same arguments as before.

8.4. When are the assumptions reasonable?

The two assumptions we’ve added are

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE? 159

(1)� �)� f�SUTFË � ð 1f � 4 � �æ « ó �"� a æ « finite positive definite matrix.

(2) � � 1Ap � � 4 / mR ²� � � æ « è �F The most complicated case is that of dynamic models, since the other cases can

be treated as nested in this case. There exist a number of central limit theorems

for dependent processes, many of which are fairly technical. We won’t enter

into details (see Hamilton, Chapter 7 if you’re interested). A main requirement

for use of standard asymptotics for a dependent sequence

Sg�'UòX S �� f� U(��1 �\UPXto converge in probability to a finite limit is that �NU be stationary, in some sense.� Strong stationarity requires that the joint distribution of the set

S]�\U?��\U � ìC�$�\U � �\�'&)&(&�Xnot depend on Y&� Covariance (weak) stationarity requires that the first and second mo-

ments of this set not depend on Y&� An example of a sequence that doesn’t satisfy this is an AR(1) process

with a unit root (a random walk):

� U � U � 1I2�/JU/]U§ç ±�± ²� � �[è � One can show that the variance of � U depends upon in this case.

Stationarity prevents the process from trending off to plus or minus infinity,

and prevents cyclical behavior which would allow correlations between far

removed �\U znd �Nì to be high. Draw a picture here.

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE? 160� In summary, the assumptions are reasonable when the stochastic con-

ditioning variables have variances that are finite, and are not too strongly

dependent. The AR(1) model with unit root is an example of a case

where the dependence is too strong for standard asymptotics to apply.� The econometrics of nonstationary processes has been an active area

of research in the last two decades. The standard asymptotics don’t

apply in this case. This isn’t in the scope of this course.

EXERCISES 161

Exercises

(1) Show that for two random variables w and � � if Ù � w � � a� � then Ù � w � � � [ � . How is this used in the Gauss-Markov theorem?

(2) If it possible for an AR(1) model for time series data, e.g., B�U �� 2 � & ï BgU � 1C2 /JUsatisfy weak exogeneity? Strong exogeneity? Discuss.

CHAPTER 9

Data problems

In this section well consider problems associated with the regressor matrix:

collinearity, missing observation and measurement error.

9.1. Collinearity

Collinearity is the existence of linear relationships amongst the regressors.

We can always write e 1 L 1�2 e � L � 2¡>N>N>N2 e�ÿZLIÿ 2¹¸ ¡�where L,* is the ! U�¶ column of the regressor matrix �� and ¸ is an �j�M� vector.

In the case that there exists collinearity, the variation in ¸ is relatively small, so

that there is an approximately exact linear relation between the regressors.� “relative” and “approximate” are imprecise, so it’s difficult to define

when collinearilty exists.

In the extreme, if there are exact linear relationships (every element of ¸ equal)

then ��: ½ � � so �� 4 �3 ½Ë� � so � 4 � is not invertible and the OLS estimator

is not uniquely defined. For example, if the model is

BgU �01I28� � � � UE28� | � | UE2�/JU� � U � 1�2 � � � | U162

9.1. COLLINEARITY 163

then we can write

BgU �01,2K� � � � 1I2 � � � | UW�28� | � | U+2�/JU �01,2K� � � 1I28� � � � � | U52K� | � | U528/JU ��01,2K� � � 1-�2a�� 2K� | � | U �+1�2c� � � | U+2�/JU� The � 4 � can be consistently estimated, but since the � 4 s define two

equations in three � 4 �� the � 4 � can’t be consistently estimated (there

are multiple values of � that solve the fonc). The � 4 � are unidentified in

the case of perfect collinearity.� Perfect collinearity is unusual, except in the case of an error in con-

struction of the regressor matrix, such as including the same regressor

twice.

Another case where perfect collinearity may be encountered is with models

with dummy variables, if one is not careful. Consider a model of rental price��B * of an apartment. This could depend factors such as size, quality etc., col-

lected in �+* � as well as on the location of the apartment. Let �p*¼ � if the ! U�¶apartment is in Barcelona, �µ*, a� otherwise. Similarly, define � * � ¿ * and

B * for

Girona, Tarragona and Lleida. One could use a model such as

B *, �01I28� � �;* 2K� | � * 2K�"! ¿ * 2K��$ B * 2 � 4* �=2�/ *In this model, �µ* 2�� * 2 ¿ * 2 B *� �"�Cê�!-� so there is an exact relationship between

these variables and the column of ones corresponding to the constant. One

must either drop the constant, or one of the qualitative variables.


FIGURE 9.1.1. �� when there is no collinearity

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

60 55 50 45 40 35 30 25 20 15

9.1.1. A brief aside on dummy variables. Introduce a brief discussion of

dummy variables here.

9.1.2. Back to collinearity. The more common case, if one doesn’t make

mistakes such as these, is the existence of inexact linear relationships, i.e., cor-

relations between the regressors that are less than one in absolute value, but

not zero. The basic problem is that when two (or more) variables move to-

gether, it is difficult to determine their separate influences. This is reflected

in imprecise estimates, i.e., estimates with high variances. With economic data,

collinearity is commonly encountered, and is often a severe problem.

When there is collinearity, the minimizing point of the objective function

that defines the OLS estimator ( �� , the sum of squared errors) is relatively

poorly defined. This is seen in Figures 9.1.1 and 9.1.2.


FIGURE 9.1.2. �T�� when there is collinearity

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

100 90 80 70 60 50 40 30 20

To see the effect of collinearity on variances, partition the regressor matrix

as � on L ü rwhere L is the first column of � (note: we can interchange the columns of � isf

we like, so there’s no loss of generality in considering the first column). Now,

the variance of��¢� under the classical assumptions, is

é�� 4��: � 1 è �Using the partition,

� 4 � �� L 4 L L 4 üü 4 L ü 4 ü ��


and following a rule for partitioned inversion,

��i4��: � 11IH 1 ð L 4 L � L 4�ü �Wü�4@üÚ$� 1 üR4 L ó � 1 c L 4 c ± fs�8üþ�òü�4�üÚ O 1 ü�4 h L h � 1 ð Ù Û�Û>Í � � ó � 1where by Ù Û�Û>Í � � we mean the error sum of squares obtained from the regres-

sion L� ü e 2¹¸E&Since � � �È� Ù Û�Û Â ¿ Û�Û �we have

ÙÜÛ�Û ¿ Û�Û �Ô�¼�8� � so the variance of the coefficient corresponding to L is

é^� �� Í è �¿ Û�Û Í �?�y�j� �Í � � We see three factors influence the variance of this coefficient. It will be high if

(1) è � is large

(2) There is little variation in L & Draw a picture here.

(3) There is a strong linear relationship between � and the other regres-

sors, so that ü can explain the movement in L well. In this case, � �Í � �will be close to 1. As ��Í � � R �"�Yé�� Í R k &

The last of these cases is collinearity.


Intuitively, when there are strong linear relations between the regressors, it

is difficult to determine the separate influence of the regressors on the depen-

dent variable. This can be seen by comparing the OLS objective function in

the case of no correlation between regressors with the objective function with

correlation between the regressors. See the figures nocollin.ps (no correlation)

and collin.ps (correlation), available on the web site.

9.1.3. Detection of collinearity. The best way is simply to regress each ex-

planatory variable in turn on the remaining regressors. If any of these auxiliary

regressions has a high � � � there is a problem of collinearity. Furthermore, this

procedure identifies which parameters are affected.� Sometimes, we’re only interested in certain parameters. Collinearity

isn’t a problem if it doesn’t affect what we’re interested in estimating.

An alternative is to examine the matrix of correlations between the regressors.

High correlations are sufficient but not necessary for severe collinearity.

Also indicative of collinearity is that the model fits well (high � � C� but none

of the variables is significantly different from zero (e.g., their separate influ-

ences aren’t well determined).

In summary, the artificial regressions are the best approach if one wants to

be careful.

9.1.4. Dealing with collinearity. More information

Collinearity is a problem of an uninformative sample. The first question

is: is all the available information being used? Is more data available? Are

there coefficient restrictions that have been neglected? Picture illustrating how

a restriction can solve problem of perfect collinearity.

Stochastic restrictions and ridge regression


Supposing that there is no more data or neglected restrictions, one possibil-

ity is to change perspectives, to Bayesian econometrics. One can express prior

beliefs regarding the coefficients using stochastic restrictions. A stochastic lin-

ear restriction would be something of the form

�p� À 2i¸where � and

Àare as in the case of exact linear restrictions, but ¸ is a random

vector. For example, the model could be

B �i��28/�p� À 2¹¸ÎÏ /¸ÑÐÒ ç ÎÏ �� ÐÒ � ÎÏ è �î ± f � f,Ó(�� 6Ógf è �Ä ± �XÐÒThis sort of model isn’t in line with the classical interpretation of parameters

as constants: according to this interpretation the left hand side of �p� À 2Ô¸ is

constant but the right is random. This model does fit the Bayesian perspective:

we combine information coming from the model and the data, summarized in

B ��b28// ç j� � ��è �î ± f"with prior beliefs regarding the distribution of the parameter, summarized in

�p�3çRj� À ��è �Ä ± ��Since the sample is random it is reasonable to suppose that ë7��/<¸ 4 a� � which is

the last piece of information in the specification. How can you estimate using


this model? The solution is to treat the restrictions as artificial data. Write�� B À �� 2 �� /¸ ��This model is heteroscedastic, since è �î í è �Ä & Define the prior precision Q è î Â è Ä & This expresses the degree of belief in the restriction relative to the vari-

ability of the data. Supposing that we specify Q�� then the model�� BQ À �� Q � �� ^2 �� /Qu¸ ��is homoscedastic and can be estimated by OLS. Note that this estimator is bi-

ased. It is consistent, however, given that Q is a fixed constant, even if the

restriction is false (this is in contrast to the case of false exact restrictions). To

see this, note that there are æ restrictions, where æ is the number of rows of� & As � R k � these æ artificial observations have no weight in the objective

function, so the estimator has the same limiting objective function as the OLS

estimator, and is therefore consistent.

To motivate the use of stochastic restrictions, consider the expectation of

the squared length of�� :

ë7� �� 4 �� ë õ c ��2a�� 4 �: � 1 � 4 / h 4 c ��2a�� 4 �: � 1 � 4 / h ø � 4 �b28ë ð / 4 �K�� 4 �: � 1 �� 4 �: � 1 � 4 / ó � 4 �b2 ¿yÀ �� 4 �: � 1 è � ��4©�b2Kè � ÿ� * ��1 e5* (the trace is the sum of eigenvalues)¥ ��4©�b2 e"ÕÆÖ�× Ã « O « � Å è � (the eigenvalues are all positive, since �34�� is p.d.


so ë7� �� 4 �� ¥ � 4 ��2 è,�e Õ�Ø^Ù Ã « O « Åwhere e Õ�Ø^Ù Ã « O « Å is the minimum eigenvalue of � 4 � (which is the inverse of the

maximum eigenvalue of �¬� 4 �: � 1 Y& As collinearity becomes worse and worse,� 4 � becomes more nearly singular, so e Õ�Ø^Ù Ã « O « Å tends to zero (recall that the

determinant is the product of the eigenvalues) and ë¼� �� 4 �� tends to infinite. On

the other hand, � 4 � is finite.

Now considering the restriction ±Nÿ � a� 2X¸E& With this restriction the model

becomes �� B � �� Q ±'ÿ �� 2 �� /QÚ¸ ��and the estimator is

�� * m �ZÛ ÎÏ n � 4 Q ±\ÿ r �� Q ±\ÿ �� ÐÒ � 1 n � 4 ±\ÿ r �� B � �� ðW�i4��§2MQ � ±'ÿ ó � 1 ��4©BThis is the ordinary ridge regression estimator. The ridge regression estimator

can be seen to add Q � ±\ÿ � which is nonsingular, to � 4 �� which is more and

more nearly singular as collinearity becomes worse and worse. As Q R k � the

restrictions tend to � e� � that is, the coefficients are shrunken toward zero.

Also, the estimator tends to�� * m �ZÛ ðò�i4��§2MQ � ±\ÿ ó � 1 �i4©B R ðPQ � ±'ÿ ó � 1 ��4©B � 4 BQ � R �so

�� 4� * m �ZÛ �� * m �ZÛ R � & This is clearly a false restriction in the limit, if our original

model is at al sensible.

9.2. MEASUREMENT ERROR 171

There should be some amount of shrinkage that is in fact a true restriction.

The problem is to determine the Q such that the restriction is correct. The inter-

est in ridge regression centers on the fact that it can be shown that there exists

a Q such that´ Û¢Ù � �� * m �ZÛ ½ ��rÜ �(³ & The problem is that this Q depends on � andè � � which are unknown.

The ridge trace method plots�� 4� * m �ZÛ �� * m �ZÛ as a function of Q�� and chooses

the value of Q that “artistically” seems appropriate (e.g., where the effect of

increasing Q dies off). Draw picture here. This means of choosing Q is obviously

subjective. This is not a problem from the Bayesian perspective: the choice ofQ reflects prior beliefs about the length of ��&In summary, the ridge estimator offers some hope, but it is impossible to

guarantee that it will outperform the OLS estimator. Collinearity is a fact of

life in econometrics, and there is no clear solution to the problem.

9.2. Measurement error

Measurement error is exactly what it says, either the dependent variable or

the regressors are measured with error. Thinking about the way economic data

are reported, measurement error is probably quite prevalent. For example,

estimates of growth of GDP, inflation, etc. are commonly revised several times.

Why should the last revision necessarily be correct?

9.2.1. Error of measurement of the dependent variable. Measurement er-

rors in the dependent variable and the regressors have important differences.


First consider error in measurement of the dependent variable. The data gen-

erating process is presumed to be

B [ ��^28/B B [ 2i¸¸JUñç !ò!�J�� è �Ä where Ba[ is the unobservable true dependent variable, and B is what is ob-

served. We assume that / and ¸ are independent and that B [ �� 2i/ satisfies

the classical assumptions. Given this, we have

B�2i¸ ��^28/so

B ��b28/p�c¸ ��b2cÝÝU§ç !ò!�J�� è �î 28è �Ä � As long as ¸ is uncorrelated with �� this model satisfies the classical

assumptions and can be estimated by OLS. This type of measurement

error isn’t a problem, then.


9.2.2. Error of measurement of the regressors. The situation isn’t so good

in this case. The DGP is

BgU � [ 4U ��2�/JU� U � [U 2i¸]U¸JUñç !ò!�J�� 65 Ä where 5 Ä is a � � � matrix. Now �Þ[ contains the true, unobserved regressors,

and � is what is observed. Again assume that ¸ is independent of /T� and that

the model B � [ �b28/ satisfies the classical assumptions. Now we have

B]U � � U��c¸]U� 4 ��2�/]U � 4U ��s¸�4U ��2�/]U � 4U ��2cÝ�UThe problem is that now there is a correlation between � U and ÝUP� since

ë¼� � UßÝU� ë��[� � [U 2i¸]UW0�Ô�t¸ 4U �^28/JU�[ ��5 Ä �where 5 Ä ë��b¸]U�¸�4U 0&Because of this correlation, the OLS estimator is biased and inconsistent, just as

in the case of autocorrelated errors with lagged dependent variables. In matrix

notation, write the estimated model as

B ��^2cÝ


We have that �� Æ � 4 �� É � 1 Æ � 4 B� Éand

� µ !W� Æ � 4 �� É � 1 � µ !W� �� [ 4 2Mé 4 ,�¬� [ 2ËéÜ� � æ «�à 2�5 Ä � 1since � [ and é are independent, and

� µ !W� é 4 é� � �� ë �� f� U(��1 ¸JUb¸�4U 5 ÄLikewise,

� µ !ò� Æ � 4 B� É � µ !W� �� [ 4 2Mé 4 ��¬� [ ��2�/V� æ « à �so � µ !ò� �� æ « à 2D5 Ä � 1 æ « à �So we see that the least squares estimator is inconsistent when the regressors

are measured with error.

� A potential solution to this problem is the instrumental variables (IV)

estimator, which we’ll discuss shortly.

9.3. MISSING OBSERVATIONS 175

9.3. Missing observations

Missing observations occur quite frequently: time series data may not be

gathered in a certain year, or respondents to a survey may not answer all ques-

tions. We’ll consider two cases: missing observations on the dependent vari-

able and missing observations on the regressors.

9.3.1. Missing observations on the dependent variable. In this case, we

have B ��b28/or �� B�1B � �� 1� � �� b2 �� /�1/ � ��where B � is not observed. Otherwise, we assume the classical assumptions

hold. � A clear alternative is to simply estimate using the compete observa-

tions B�1 ��1?��2�/�1Since these observations satisfy the classical assumptions, one could

estimate by OLS.� The question remains whether or not one could somehow replace the

unobserved B � by a predictor, and improve over OLS in some sense.

Let�B � be the predictor of B � & Now

�� áâã âä �� 1� � �� 4 �� 1� � ��Xå âæâç �1 �� 1� � �� 4 �� B�1�B � �� \ �i41 ��1�2��i4� � � ] � 1 \ �i41 B�1,28�i4� �B � ]


Recall that the OLS fonc are � 4 � �� 4 Bso if we regressed using only the first (complete) observations, we would have

��41 ��1 ��01 �i41 BT1 PLikewise, an OLS regression using only the second (filled in) observations

would give � 4� � � �� 4� �B � &Substituting these into the equation for the overall combined estimator gives�� \ �i41 ��1�28��4� � � ] � 1 n �i41 ��1 ��01,28�i4� � � �� r \ �i41 ��1�28��4� � � ] � 1 �i41 ��1 ��01I2g\ �i41 ��1,28�i4� � � ] � 1 �i4� � � �� w ��01�2�� ±'ÿ � w �� where w � \ �i41 ��1,2��i4� � � ] � 1 �i41 ��1and we use\ ��41 ��1,28�i4� � � ] � 1 ��4� � � \ ��41 ��1�28�i4� � � ] � 1 \(��41 ��1�28�i4� � � Ä�²�i41 ��1I] ±'ÿ �è\ � 41 ��1,2�� 4� � � ] � 1 � 41 ��1 ±'ÿ � w &

Now, ë¼� �� aw �^2�� ±\ÿ � w Pë c �� hand this will be unbiased only if ë c �� h �Z&

9.3. MISSING OBSERVATIONS 177� The conclusion is the this filled in observations alone would need to

define an unbiased estimator. This will be the case only if�B � � � ��2 �/ �where

�/ � has mean zero. Clearly, it is difficult to satisfy this condition

without knowledge of ��&� Note that putting�B � ÝB�1 does not satisfy the condition and therefore

leads to a biased estimator.

EXERCISE 13. Formally prove this last statement.

� One possibility that has been suggested (see Greene, page 275) is to

estimate � using a first round estimation using only the complete ob-

servations ��01 �¬�i41 ��1Ô$� 1 �i41 BT1then use this estimate,

��01Y� to predict B � :�B � � � ��01 � � �� 41 ��1- � 1 � 41 BT1Now, the overall estimate is a weighted average of

��01 and�� just as

above, but we have�� 4� � � �� 1 �i4� �B � ��4� � � �� 1 �i4� � � ��01 ��01


This shows that this suggestion is completely empty of content: the fi-

nal estimator is the same as the OLS estimator using only the complete

observations.

9.3.2. The sample selection problem. In the above discussion we assumed

that the missing observations are random. The sample selection problem is a

case where the missing observations are not random. Consider the model

B [U �� 4U ��2�/JUwhich is assumed to satisfy the classical assumptions. However, B [U is not

always observed. What is observed is BVU defined as

B]U B [U if B [U � �Or, in other words, B [U is missing when it is less than zero.

The difference in this case is that the missing values are not random: they

are correlated with the � U?& Consider the case

B [ �� 2�/with é^�¬/" # Ò , but using only the observations for which B"[ ¥u� to estimate.

Figure 9.3.1 illustrates the bias. The Octave program is sampsel.m

9.3.3. Missing observations on the regressors. Again the model is�� B�1B � �� 1� � �� b2 �� /�1/ � ��but we assume now that each row of � � has an unobserved component(s).

Again, one could just estimate using the complete observations, but it may

http://pareto.uab.es/mcreel/Econometrics/Include/Figures/sampsel.m


FIGURE 9.3.1. Sample selection bias

-10

-5

0

5

10

15

20

25

0 2 4 6 8 10

DataTrue Line

Fitted Line

seem frustrating to have to drop observations simply because of a single miss-

ing variable. In general, if the unobserved � � is replaced by some prediction,� [� � then we are in the case of errors of observation. As before, this means

that the OLS estimator is biased when � [� is used instead of � � & Consistency

is salvaged, however, as long as the number of missing observations doesn’t

increase with �G&� Including observations that have missing values replaced by ad hoc

values can be interpreted as introducing false stochastic restrictions.

In general, this introduces bias. It is difficult to determine whether

MSE increases or decreases. Monte Carlo studies suggest that it is

dangerous to simply substitute the mean, for example.

9.3. MISSING OBSERVATIONS 180� In the case that there is only one regressor other than the constant,

subtitution of Ý� for the missing � U does not lead to bias. This is a special

case that doesn’t hold for �¦¥ #%&EXERCISE 14. Prove this last statement.� In summary, if one is strongly concerned with bias, it is best to drop

observations that have missing components. There is potential for re-

duction of MSE through filling in missing elements with intelligent

guesses, but this could also increase MSE.

EXERCISES 181

Exercises

(1) Consider the Nerlove model�� ûÚ � ß 1 2K� ß� �)� æ 2K� | �� ª#� 2K�"! �)� ª� 2K��$ �� ª ÿ 2MAWhen this model is estimated by OLS, some coefficients are not significant.

This may be due to collinearity.

Exercises

(a) Calculate the correlation matrix of the regressors.

(b) Perform artificial regressions to see if collinearity is a problem.

(c) Apply the ridge regression estimator.

Exercises

(i) Plot the ridge trace diagram

(ii) Check what happens as Q goes to zero, and as Q becomes very

large.

CHAPTER 10

Functional form and nonnested tests

Though theory often suggests which conditioning variables should be in-

cluded, and suggests the signs of certain derivatives, it is usually silent regard-

ing the functional form of the relationship between the dependent variable

and the regressors. For example, considering a cost function, one could have a

Cobb-Douglas model ¸7 aw .yx� 1 .;xCz� � x � º îThis model, after taking logarithms, gives�)� ¸¼ � F 2K�01 �� .µ1�2K� � �)� . � 28�� È2�/where � F �)� w & Theory suggests that w ¥ � ��,1 ¥ � �� ¥ � �� | ¥ � & This

model isn’t compatible with a fixed cost of production since ¸¼ a� when � a� &Homogeneity of degree one in input prices suggests that ��1¢2a� � �"� while

constant returns to scale implies �n� �V&While this model may be reasonable in some cases, an alternativeh

¸È � F 2K�01h.µ1�28� �

h. � 2K�� h �È2�/

may be just as plausible. Note that

h� and

�)� � � look quite alike, for certain

values of the regressors, and up to a linear transform, so it may be difficult to

choose between these models.

182

10.1. FLEXIBLE FUNCTIONAL FORMS 183

The basic point is that many functional forms are compatible with the linear-

in-parameters model, since this model can incorporate a wide variety of non-

linear transformations of the dependent variable and the regressors. For ex-

ample, suppose that �0�Ô>© is a real valued function and that � �Ô>© is a � � vector-

valued function. The following model is linear in the parameters but nonlinear

in the variables:

� U � ��\UWB]U � 4U �^28/JUThere may be

ªfundamental conditioning variables �NU , but there may be � re-

gressors, where � may be smaller than, equal to or larger thanª & For example,� U could include squares and cross products of the conditioning variables in �JU?&

10.1. Flexible functional forms

Given that the functional form of the relationship between the dependent

variable and the regressors is in general unknown, one might wonder if there

exist parametric models that can closely approximate a wide variety of func-

tional relationships. A “Diewert-Flexible” functional form is defined as one

such that the function, the vector of first derivatives and the matrix of second

derivatives can take on an arbitrary value at a single data point. Flexibility in

this sense clearly requires that there be at least

�d ��2 ª 2 ð ª � � ª ó Â #È2 ªfree parameters: one for each independent effect that we wish to model.


Suppose that the model is

B �,� � �28/A second-order Taylor’s series expansion (with remainder term) of the func-

tion �0� � about the point �� a� is�,� � �0� � 02 � 4 � ã �0� � ,2 � 4 � �ã �,� � �# 2M�Use the approximation, which simply drops the remainder term, as an approx-

imation to �0� � 7H �0� � �¢�� ÿ � � �,� � ,2 � 4 � ã �0� � ,2 � 4 � �ã �0� � �#As � R � � the approximation becomes more and more exact, in the sense that� ÿ � � R �0� � Y� � ã � ÿ � � R � ã �0� � and

� �ã � ÿ � � R � �ã �0� � Y& For �Ö �� the ap-

proximation is exact, up to the second order. The idea behind many flexible

functional forms is to note that �,� � C� � ã �0� � and� �ã �0� � are all constants. If we

treat them as parameters, the approximation will have exactly enough free pa-

rameters to approximate the function �0� � C� which is of unknown form, exactly,

up to second order, at the point �� & The model is� ÿ � � g� 2 � 4©�^2¡� Â # � 4 Ã �so the regression model to fit is

B è� 2 � 4@�^2¡� Â # � 4 Ã � 28/

10.1. FLEXIBLE FUNCTIONAL FORMS 185� While the regression model has enough free parameters to be Diewert-

flexible, the question remains: is � µ !W� ��M �0� � `é Is � µ !ò� �� ã �0� � `é Is� µ !ò� �Ã � �ã �0� � é� The answer is no, in general. The reason is that if we treat the true

values of the parameters as these derivatives, then / is forced to play

the part of the remainder term, which is a function of � � so that � and/ are correlated in this case. As before, the estimator is biased in this

case.� A simpler example would be to consider a first-order T.S. approxima-

tion to a quadratic function. Draw picture.� The conclusion is that “flexible functional forms” aren’t really flexi-

ble in a useful statistical sense, in that neither the function itself nor

its derivatives are consistently estimated, unless the function belongs

to the parametric family of the specified functional form. In order to

lead to consistent inferences, the regression model must be correctly

specified.

10.1.1. The translog form. In spite of the fact that FFF’s aren’t really as

flexible as they were originally claimed to be, they are useful, and they are

certainly subject to less bias due to misspecification of the functional form than

are many popular forms, such as the Cobb-Douglas of the simple linear in the

variables model. The translog model is probably the most widely used FFF.

This model is as above, except that the variables are subjected to a logarithmic

tranformation. Also, the expansion point is usually taken to be the sample

mean of the data, after the logarithmic transformation. The model is defined


by

B �)� � ¸ � �)� � � Ý� �)� �W�"¢� �)� �CÝ��B � 2 � 4©��2�� Â # � 4 Ã � 28/In this presentation, the subscript that distinguishes observations is sup-

pressed for simplicity. Note thatà Bà � �b2 Ã � à �� ¸ à �� W�� (the other part of � is constant) à ¸à � � ¸which is the elasticity of ¸ with respect to � & This is a convenient feature of the

translog model. Note that at the means of the conditioning variables, Ý� , �� so à Bà �ëêêêê0ì �Åíì �so the � are the first-order elasticities, at the means of the data.

To illustrate, consider that B is cost of production:

B �¸ ��.Ü��Vwhere . is a vector of input prices and � is output. We could add other vari-

ables by extending � in the obvious manner, but this is supressed for simplicity.


By Shephard’s lemma, the conditional factor demands are

�� à ¸ ��.Ü��"à .and the cost shares of the factors are therefore

� . �¸ à ¸ ��.¯��"à . . ¸which is simply the vector of elasticities of cost with respect to input prices. If

the cost function is modeled using a translog function, we have

�� ¸ � 2 � 4 �^2M� 4 » 2¡� Â # n � 4 � r �� Ã 1P1 Ã 1 �Ã 4 1 � Ã �P� �� 2 � 4©�^2M�g4 » 2¡� Â # � 4 Ã 1P1 � 2 � 4 Ã 1 � �µ2�� Â #V� � � �P�where �� . Â Ý.; and � �)� �4� Â Ý�VC� andÃ 1P1 �� +1P1î�E1 ��+1 � � �P� ��Ã 1 � �� +1 |� � | ��Ã �P� � |P| &Note that symmetry of the second derivatives has been imposed.

Then the share equations are just

� �b2 n Ã 1P1 Ã 1 � r ��


Therefore, the share equations and the cost equation have parameters in com-

mon. By pooling the equations together and imposing the (true) restriction

that the parameters of the equations be the same, we can gain efficiency.

To illustrate in more detail, consider the case of two inputs, so

�÷ �� 1� � �� &In this case the translog model of the logarithmic cost function is�� ¸È è� 2²�01 � 102j� � � � 2 » �¼2 �E1P1# � � 1 2 � �P�# � �� 2 � |P|# � � 2Ñ�+1 � � 1 � � 2Ç�+1 | � 1-�¼2Ñ� � | � � �The two cost shares of the inputs are the derivatives of

�� ¸ with respect to � 1and � � : �g1 �01�2i�+1P1 � 1�2i�+1 � � � 2i�+1 | �� 2i�+1 � � 1�2i� �P� � � 2i�+1 | �

Note that the share equations and the cost equation have parameters in

common. One can do a pooled estimation of the three equations at once, im-

posing that the parameters are the same. In this way we’re using more ob-

servations and therefore more information, which will lead to imporved effi-

ciency. Note that this does assume that the cost equation is correctly specified

(i.e., not an approximation), since otherwise the derivatives would not be the

true derivatives of the log cost function, and would then be misspecified for

the shares. To pool the equations, write the model in matrix form (adding in


error terms)

�� ¸�g1� � �� 1 � � � ã z � ã zz� ì z� � 1 � � � 1-� � � �� 1 � � � � � �� 1 � � ��

��

��01� �»�+1P1� �P�� |P|�+1 ��+1 |� � |

��2 �� /�1/ �/ | ��

This is one observation on the three equations. With the appropriate nota-

tion, a single observation can be written as

B]U � U4@È2�/JUThe overall model would stack � observations on the three equations for a total

of %V� observations: ��B�1B �...B]f

� �� 1� �...�=f

� �� @È2��/�1/ �.../Jf

� ��


Next we need to consider the errors. For observation the errors can be placed

in a vector

/JU �� /�1�U/ � U/ | U ��First consider the covariance matrix of this vector: the shares are certainly

correlated since they must sum to one. (In fact, with 2 shares the variances are

equal and the covariance is -1 times the variance. General notation is used to

allow easy extension to the case of more than 2 inputs). Also, it’s likely that

the shares and the cost equation have different variances. Supposing that the

model is covariance stationary, the variance of /gU won 4 t depend upon :é¯ô À /JU 5 F �� è01P1 è01 � è01 |> è �P� è � |> > è |P| ��

Note that this matrix is singular, since the shares sum to 1. Assuming that there

is no autocorrelation, the overall covariance matrix has the seemingly unrelated


regressions (SUR) structure.

éÜô À ��/�1/ �.../]f

�� 5 ��

5 F � >N>N> �� 5 F . . . ...... . . . �� >N>N> � 5 F

� �� ± fUï�5 Fwhere the symbol ï indicates the Kronecker product. The Kronecker product of

two matrices w and � is

w ï � ��ô 1P1 � ô 1 � � >N>N>Úô514� �ô � 1 � . . . ......ôN6¨� � >N>N> ôJ66� �

�� &Personally, I can never keep straight the roles of w and � .

10.1.2. FGLS estimation of a translog model. So, this model has heteroscedas-

ticity and autocorrelation, so OLS won’t be efficient. The next question is: how

do we estimate efficiently using FGLS? FGLS is based upon inverting the esti-

mated error covariance�5µ& So we need to estimate 5µ&

An asymptotically efficient procedure is (supposing normality of the er-

rors)


(1) Estimate each equation by OLS

(2) Estimate 5 F using �5 F �� f� U(��1 �/JU �/ 4U(3) Next we need to account for the singularity of 5 F & It can be shown that�5 F will be singular when the shares sum to one, so FGLS won’t work.

The solution is to drop one of the share equations, for example the

second. The model becomes

�� ¸�g1 �� 1 � � � ã z � ã zz� ì z� � 1 � � � 1Ô� � � �� 1 � � � � � � ��

��01� �»�+1P1� �P�� |P|�+1 ��+1 |� � |

� ��2 �� /�1/ � ��

or in matrix notation for the observation:

B [U � [U @y2�/ [U


and in stacked notation for all observations we have the #g� observa-

tions: ��B [1B [�...B [f

� �� [1� [�...� [f

� �� @È2��/ [ 1/ [�.../ [f

� ��or, finally in matrix notation for all observations:

B [ � [ @y2�/ [Considering the error covariance, we can define5 [F é¯ô À �� /�1/ � ��5 [ ± ftïD5 [FDefine

�5 [F as the leading #¾�3# block of�5 F , and form�5 [ �± ftï �5 [F &

This is a consistent estimator, following the consistency of OLS and

applying a LLN.

(4) Next compute the Cholesky factorization�ª F ¡û ¹�´<µ c �5 [F h � 1and the Cholesky factorization of the overall covariance matrix of the

2 equation model, which can be calculated as�ª �û ¹�´<µ �5 [ a± ftï �ª F


(5) Finally the FGLS estimator can be calculated by applying OLS to the

transformed model �ª B [ �ª � [ @È2 �ª / [or by directly using the GLS formula�@ ¯ �+³ Æ+� [ 4 c �5 [F h � 1 � [ É � 1 � [ 4 c �5 [F h � 1 B [

It is equivalent to transform each observation individually:�ª F B [K �ª F � [U @;2 �ª / [and then apply OLS. This is probably the simplest approach.

A few last comments.

(1) We have assumed no autocorrelation across time. This is clearly re-

strictive. It is relatively simple to relax this, but we won’t go into it

here.

(2) Also, we have only imposed symmetry of the second derivatives. An-

other restriction that the model should satisfy is that the estimated

shares should sum to 1. This can be accomplished by imposing

�01�28� � �|� * ��1 � * ßq � �Å� �"�Y#%��%%&These are linear parameter restrictions, so they are easy to impose and

will improve efficiency if they are true.

10.2. TESTING NONNESTED HYPOTHESES 195

(3) The estimation procedure outlined above can be iterated. That is, esti-

mate�@ ¯ �+³ as above, then re-estimate 5ð[F using errors calculated as�/ BÜ�j� �@ ¯ �(³

These might be expected to lead to a better estimate than the es-

timator based on�@�Ü �(³ � since FGLS is asymptotically more efficient.

Then re-estimate @ using the new estimated error covariance. It can

be shown that if this is repeated until the estimates don’t change (i.e.,

iterated to convergence) then the resulting estimator is the MLE. At

any rate, the asymptotic properties of the iterated and uniterated esti-

mators are the same, since both are based upon a consistent estimator

of the error covariance.

10.2. Testing nonnested hypotheses

Given that the choice of functional form isn’t perfectly clear, in that many

possibilities exist, how can one choose between forms? When one form is a

parametric restriction of another, the previously studied tests such as Wald,

LR, score or � 2 are all possibilities. For example, the Cobb-Douglas model is a

parametric restriction of the translog: The translog is

BgU g� 2 � 4U ��2�� Â # � 4U Ã � U528/where the variables are in logarithms, while the Cobb-Douglas is

B]U è� 2 � 4U ��2�/so a test of the Cobb-Douglas versus the translog is simply a test that

Ã a� &


The situation is more complicated when we want to test non-nested hypothe-

ses. If the two functional forms are linear in the parameters, and use the same

transformation of the dependent variable, then they may be written as´ 17H�B �i��28//JUkç !W!¾J�� è �î ´ � H�B 9U� 2 ç !W!¾J�� è �ñ We wish to test hypotheses of the form:

dF H ´ * is correctly specified versus

dM�H ´ * is misspecified, for ! �"�Y#%&� One could account for non-iid errors, but we’ll suppress this for sim-

plicity.� There are a number of ways to proceed. We’ll consider the « test, pro-

posed by Davidson and MacKinnon, Econometrica (1981). The idea is

to artificially nest the two models, e.g.,

B �Ô�È� � ?��b2 � ��9t�0�2cÝIf the first model is correctly specified, then the true value of � is zero.

On the other hand, if the second model is correctly specified then �: �"&– The problem is that this model is not identified in general. For

example, if the models share some regressors, as in

´ 1¼H�B]U �01,28� � � � U+28� | � | U+2�/]U´ � H�B]U �+1,2i� � � � U52i� | � !PU+2 U


then the composite model is

BgU �?�È� � Ô�01,2a�Ô�È� � ?� � � � U+2a�Ô�È� � ?� | � | U52 � �+1,2 � � � � � U+2 � � | � !PU+2cÝ�UCombining terms we get

BgU �[�?�È� � ?�01�2 � �+1-�2a�[�Ô�È� � Ô� � 2 � � � � � UE2a�Ô�È� � ?� | � | UE2 � � | � !PU+2cÝ�U » 1�2 » � � � UE2 » | � | U+2 » ! � !PUE2sÝUThe four

» 4 � are consistently estimable, but � is not, since we have four equa-

tions in 7 unknowns, so one can’t test the hypothesis that �3 �� &The idea of the « test is to substitute

�� in place of ��& This is a consistent

estimator supposing that the second model is correctly specified. It will tend

to a finite probability limit even if the second model is misspecified. Then

estimate the model

B �Ô�¼� � P��b2 � �A9 ��0�2cÝ �F@È2 � �B�2sÝwhere

�B 9 �A9 4 9p � 1 9 4 B ª = B�& In this model, � is consistently estimable, and

one can show that, under the hypothesis that the first model is correct, � 6R �and that the ordinary -statistic for �3 a� is asymptotically normal:

��è �§ QçR²� � �'�N� If the second model is correctly specified, then 6R k � since

�� tends in

probability to 1, while it’s estimated standard error tends to zero. Thus

the test will always reject the false null model, asymptotically, since the

statistic will eventually exceed any critical value with probability one.

10.2. TESTING NONNESTED HYPOTHESES 198� We can reverse the roles of the models, testing the second against the

first.� It may be the case that neither model is correctly specified. In this case,

the test will still reject the null hypothesis, asymptotically, if we use

critical values from the j� � �'�J distribution, since as long as�� tends to

something different from zero, � � 6R k & Of course, when we switch

the roles of the models the other will also be rejected asymptotically.� In summary, there are 4 possible outcomes when we test two models,

each against the other. Both may be rejected, neither may be rejected,

or one of the two may be rejected.� There are other tests available for non-nested models. The «G� test is

simple to apply when both models are linear in the parameters. Theª-test is similar, but easier to apply when

´ 1 is nonlinear.� The above presentation assumes that the same transformation of the

dependent variable is used by both models. MacKinnon, White and

Davidson, Journal of Econometrics, (1983) shows how to deal with the

case of different transformations.� Monte-Carlo evidence shows that these tests often over-reject a cor-

rectly specified model. Can use bootstrap critical values to get better-

performing tests.

CHAPTER 11

Exogeneity and simultaneity

Several times we’ve encountered cases where correlation between regres-

sors and the error term lead to biasedness and inconsistency of the OLS es-

timator. Cases include autocorrelation with lagged dependent variables and

measurement error in the regressors. Another important case is that of simul-

taneous equations. The cause is different, but the effect is the same.

11.1. Simultaneous equations

Up until now our model is

B ��b28/where, for purposes of estimation we can treat � as fixed. This means that

when estimating � we condition on �i& When analyzing dynamic models, we’re

not interested in conditioning on �� as we saw in the section on stochastic

regressors. Nevertheless, the OLS estimator obtained by treating � as fixed

continues to have desirable asymptotic properties even in that case.199

11.1. SIMULTANEOUS EQUATIONS 200

Simultaneous equations is a different prospect. An example of a simulta-

neous equation system is a simple supply-demand system:

Demand: �vU � 1�2 � � � U52 � | B]U 28/�1�USupply: �vU �01,2K� � �5U 2�/ � Uë ÎÏ �� /�1�U/ � U �� n /�1�Uå/ � UÈr ÐÒ �� è01P1 è01 �> è �P� �� 5µ�Ôê�

The presumption is that �vU and �5U are jointly determined at the same time by

the intersection of these equations. We’ll assume that B"U is determined by some

unrelated process. It’s easy to see that we have correlation between regressors

and errors. Solving for �+U :� 1I2 � � �5U+2 � | B]U 28/�1�U �01I28� � �5U 28/ � U� � �5U�� 5U � 1G��01,2 � | BgU 28/�1�U��²/ � U�5U � 1G��01� � � � � 2 � | BgU� � � � � 2 /�1�U��²/ � U� � � � �Now consider whether �5U is uncorrelated with /�1�UGH

ë7�� U¨/�1�U� ë õ Æ � 1G�j�01� � � � � 2 � | BgU� � � � � 2 /�1�U��j/ � U� � � � � É /�1�U ø è01P1Ä��è01 �� Because of this correlation, OLS estimation of the demand equation will be

biased and inconsistent. The same applies to the supply equation, for the same

reason.

11.1. SIMULTANEOUS EQUATIONS 201

In this model, �vU and �5U are the endogenous varibles (endogs), that are deter-

mined within the system. BgU is an exogenous variable (exogs). These concepts

are a bit tricky, and we’ll return to it in a minute. First, some notation. Suppose

we group together current endogs in the vector ¤�U?& If there are � endogs, ¤�U is�Ú��"& Group current and lagged exogs, as well as lagged endogs in the vector�=U , which is � ��"& Stack the errors of the � equations into the error vector Ù U?&The model, with additional assumtions, can be written as

¤ 4U Ã � 4U � 2 Ù 4UÙ Uñç j� � �65¼Y�Ôê� ë7� Ù U Ù 4ì � �[ µí �We can stack all � observations and write the model as

¤ Ã � � 2 Ùë¼�¬� 4 Ù � Ã ÿ Ó ¯ Å¸ ºN¸ � Ù ç j� � �6Eswhere

¤ ��¤ 41¤ 4�...¤ 4f

� �� [� �� 41� 4�...� 4f

� �� Ù ��Ù 41Ù 4�...

Ù 4f� ��¤ is �:�i�=�]� is �3� � � and Ù is �:�i�=&� This system is complete, in that there are as many equations as endogs.� There is a normality assumption. This isn’t necessary, but allows us to

consider the relationship between least squares and ML estimators.

11.2. EXOGENEITY 202� Since there is no autocorrelation of the Ù U ’s, and since the columns of

Ù are individually homoscedastic, then

E ��è01P1 ± f è01 � ± f >N>N>uè01 ¯ ± fè �P� ± f ...

. . . ...> è ¯�¯ ± f�� ± fUï�5

� � may contain lagged endogenous and exogenous variables. These

variables are predetermined.� We need to define what is meant by “endogenous” and “exogenous”

when classifying the current period variables.

11.2. Exogeneity

The model defines a data generating process. The model involves two sets of

variables, ¤EU and � UP� as well as a parameter vector@ n ¸ ºN¸ � Ã 4 ¸ ºN¸ � � 4 ¸ ºN¸ [ ��5¼ 4 r 4� In general, without additional restrictions, @ is a � � 2�� 2=�W� � �8�Ü Â #"2� dimensional vector. This is the parameter vector that were inter-

ested in estimating.� In principle, there exists a joint density function for ¤�U and �=U?� which

depends on a parameter vector t�& Write this density as� U-��¤EUP�-� U � t�� o U�

11.2. EXOGENEITY 203

whereo U is the information set in period Y& This includes lagged ¤ 4U �

and lagged � U ’s of course. This can be factored into the density of ¤�Uconditional on � U times the marginal density of � U :� UÔ�W¤+U?�[�=U � tI� o U� � U-�W¤+U � � U×�Yt�� o U� � U-��=U � tI� o U�This is a general factorization, but is may very well be the case that not

all parameters in t affect both factors. So use tI1 to indicate elements

of t that enter into the conditional density and write t � for parameters

that enter into the marginal. In general, tI1 and t � may share elements,

of course. We have� U-�W¤+UP�[�=U � tI� o U� � UÔ�W¤+U � � U?�$t�1Y� o U� � U-��=U � t � � o UW� Recall that the model is

¤s4U Ã �i4U � 2 Ù 4UÙ Uñç j� � �65¼Y�Ôê� ë7� Ù U Ù 4ì � �[ µí �Normality and lack of correlation over time imply that the observations are

independent of one another, so we can write the log-likelihood function as the

11.2. EXOGENEITY 204

sum of likelihood contributions of each observation:�� B �W¤ � @%� o U� f� U(��1 �� U-�W¤+UP�[�=U � tI� o U� f� U(��1 �� U-��¤EU � �=U?�Yt�1�� o UW � U-��=U � t � � o UW[ f� U(��1 �� U-�W¤+U � �=U?�Yt,1$� o UW�2 f� U(��1 �� UÔ��=U � t � � o U� DEFINITION 15 (Weak Exogeneity). ��U is weakly exogeneous for @ (the

original parameter vector) if there is a mapping from t to @ that is invariant

to t � & More formally, for an arbitrary �òt�1Y�Yt � C�1@5�òt� @5�Wt�1-Y&This implies that t�1 and t � cannot share elements if � U is weakly exoge-

nous, since t�1 would change as t � changes, which prevents consideration of

arbitrary combinations of �òt�1C�Yt � .Supposing that �=U is weakly exogenous, then the MLE of tI1 using the joint

density is the same as the MLE using only the conditional density�� B �W¤ � �i��@T� o U� f� U(��1 �)� � U-��¤EU � �=UP�Yt�1$� o UWsince the conditional likelihood doesn’t depend on t � & In other words, the joint

and conditional log-likelihoods maximize at the same value of tI1Y&� With weak exogeneity, knowledge of the DGP of ��U is irrelevant for

inference on tI1$� and knowledge of t�1 is sufficient to recover the pa-

rameter of interest, @T& Since the DGP of ��U is irrelevant, we can treat� U as fixed in inference.� By the invariance property of MLE, the MLE of @ is @5� �t�1[Y� and this map-

ping is assumed to exist in the definition of weak exogeneity.

11.3. REDUCED FORM 205� Of course, we’ll need to figure out just what this mapping is to recover�@ from�t,1Y& This is the famous identification problem.� With lack of weak exogeneity, the joint and conditional likelihood func-

tions maximize in different places. For this reason, we can’t treat �bU as

fixed in inference. The joint MLE is valid, but the conditional MLE is

not.� In resume, we require the variables in ��U to be weakly exogenous if

we are to be able to treat them as fixed in estimation. Lagged ¤�U sat-

isfy the definition, since they are in the conditioning information set,

e.g., ¤EU � 1òC o U?& Lagged ¤EU aren’t exogenous in the normal usage of the

word, since their values are determined within the model, just earlier

on. Weakly exogenous variables include exogenous (in the normal sense)

variables as well as all predetermined variables.

11.3. Reduced form

Recall that the model is

¤s4U Ã �i4U � 2 Ù 4Ué^� Ù UW 5This is the model in structural form.

DEFINITION 16 (Structural form). An equation is in structural form when

more than one current period endogenous variable is included.

11.3. REDUCED FORM 206

The solution for the current period endogs is easy to find. It is

¤¯4U �i4U � Ã � 1 2 Ù 4U Ã � 1 �i4UAó 2ËéÞ4U Now only one current period endog appears in each equation. This is the

reduced form.

DEFINITION 17 (Reduced form). An equation is in reduced form if only one

current period endog is included.

An example is our supply/demand system. The reduced form for quan-

tity is obtained by solving the supply equation for price and substituting into

demand:

�CU � 1�2 � � Æ �vU��j�01G�²/ � U� � É 2 � | BgU 28/�1�U� � �vU�� CU � � � 1G� � � ��01I2�/ � U�I2K� � � | BgU 2K� � /�1�U�CU � � � 1G� � � �01� � � � � 2 � � � | BgU� � � � � 2 � � /�1�U�� / � U� � � � � � 1P1�2 � � 1ÔB]U 2Ëé01�USimilarly, the rf for price is

�01,2K� � �5U 28/ � U � 1�2 � � �5U52 � | B]U+2�/�1�U� � �5U�� 5U � 1G�j�01�2 � | BgU528/�1�U��j/ � U�5U � 1G�j�01� � � � � 2 � | BgU� � � � � 2 /�1�U��j/ � U� � � � � � 1 � 2 � �P� BgU52Ëé � U

11.3. REDUCED FORM 207

The interesting thing about the rf is that the equations individually satisfy the

classical assumptions, since BVU is uncorrelated with /�1�U and / � U by assumption,

and therefore ë¼��BVU¬é * UW ¡� � i=1,2, ê� Y& The errors of the rf are�� é01�Ué � U �� xCz î Ê � § z î z ÊxCz � § zî Ê � î z ÊxCz � § z ��The variance of é,1�U is

é^�òé01�U� ë È Æ � � /�1�U�� / � U� � � � � É Æ � � /�1�U0� � � / � U� � � � � É É � �� è01P1¢��#g� � � � è01 � 2 � � è �P�� This is constant over time, so the first rf equation is homoscedastic.� Likewise, since the /]U are independent over time, so are the éEU?&The variance of the second rf error is

éb�òé � U� ë È Æ /�1�U0�j/ � U� � � � � É Æ /�1�U0�j/ � U� � � � � É É è01P1¢�8#]è01 � 28è �P�� and the contemporaneous covariance of the errors across equations is

ë7�òé01�U�é � UW ë È Æ � � /�1�U�� / � U� � � � � É Æ /�1�U��j/ � U� � � � � É É � � è01P1G�¡�� 2 � � è01 � 28è �P�� In summary the rf equations individually satisfy the classical assump-

tions, under the assumtions we’ve made, but they are contemporane-

ously correlated.

11.4. IV ESTIMATION 208

The general form of the rf is

¤¯4U �i4U � Ã � 1 2 Ù 4U Ã � 1 �i4UAó 2ËéÞ4Uso we have that

é+U ð Ã � 1 ó 4 Ù U�çu cN� ��ð Ã � 1 ó 4 5 Ã � 1 h �Ôê� and that the é5U are timewise independent (note that this wouldn’t be the case

if the Ù U were autocorrelated).

11.4. IV estimation

The IV estimator may appear a bit unusual at first, but it will grow on you

over time.

The simultaneous equations model is

¤ Ã � � 2 ÙConsidering the first equation (this is without loss of generality, since we can

always reorder the equations) we can partition the ¤ matrix as

¤ n B ¤,1 ¤ � r� B is the first column� ¤�1 are the other endogenous variables that enter the first equation� ¤ � are endogs that are excluded from this equation

Similarly, partition � as � n ��1e� � r

11.4. IV ESTIMATION 209� ��1 are the included exogs, and � � are the excluded exogs.

Finally, partition the error matrix as

Ù n / Ù 1 � rAssume that

Ãhas ones on the main diagonal. These are normalization

restrictions that simply scale the remaining coefficients on each equation, and

which scale the variances of the error terms.

Given this scaling and our partitioning, the coefficient matrices can be writ-

ten as Ã �� Ã 1 ��U�+1 Ã �P�� Ã | � �� 01 � 1 �� P� ��With this, the first equation can be written as

B ¤,1I�+1,28��1Ô�01,28/ 9 » 28/The problem, as we’ve seen is that 9 is correlated with /%� since ¤�1 is formed of

endogs.


Now, let’s consider the general problem of a linear regression model with

correlation between regressors and the error term:

B ��2�// ç !ò!¾J�� ± fgè � ë7�� 4 /V í � &The present case of a structural equation from a system of equations fits into

this notation, but so do other problems, such as measurement error or lagged

dependent variables with autocorrelated errors. Consider some matrix üwhich is formed of variables uncorrelated with / . This matrix defines a projec-

tion matrix ª � üþ�Wü 4 üÚ � 1 ü 4so that anything that is projected onto the space spanned by ü will be un-

correlated with /T� by the definition of üÖ& Transforming the model with this

projection matrix we get ª � B ª � ��^2 ª � /or B [ � [ ��28/ [Now we have that / [ and � [ are uncorrelated, since this is simply

ë7�� [ 4 / [ ë7�� 4 ª 4� ª � /V ë7�� 4 ª � /V


and ª � � üþ�Wü 4 üÚ � 1 ü 4 �is the fitted value from a regression of � on üÖ& This is a linear combination of

the columns of ü:� so it must be uncorrelated with /T& This implies that applying

OLS to the model B [ � [ ��28/ [will lead to a consistent estimator, given a few more assumptions. This is the

generalized instrumental variables estimator. ü is known as the matrix of instru-

ments. The estimator is ��ôZõ ��i4 ª � �:�� 1 �i4 ª � Bfrom which we obtain��ôZõ ��i4 ª � �:�� 1 �i4 ª � ��^28/V ��2a�� 4 ª � �: � 1 � 4 ª � /so ��ô`õ÷�� ¬�i4 ª � �:$� 1 ��4 ª � / ð �i4@üþ�òü�4@üÚ�� 1 ü�4�� ó � 1 �i4�üþ�WüR4©üÚ$� 1 ü�4�/Now we can introduce factors of � to get��ôZõ÷�j� Æ;Æ � 4 ü� É Æ ü 4 ü� � 1 É Æ ü 4 �� ÉpÉ � 1 Æ � 4 ü� É Æ ü 4 ü� É � 1 Æ ü 4 /� ÉAssuming that each of the terms with a � in the denominator satisfies a LLN,

so that

11.4. IV ESTIMATION 212� � O �f 6R æ �ö� , a finite pd matrix� « O �f 6R æ « � � a finite matrix with rank � (= cols ��3 )� � O îf 6R �then the plim of the rhs is zero. This last term has plim 0 since we assume thatü and / are uncorrelated, e.g.,

ë¼�Wü 4U /JU� a� �Given these assumtions the IV estimator is consistent��ôZõ 6R ��&Furthermore, scaling by

h�G� we haveh

� c ��ô`õ÷�� h ¶ Æ � 4 ü� É Æ ü 4 ü� É � 1 Æ ü 4 �� É · � 1 Æ � 4 ü� É Æ ü 4 ü� É � 1 Æ ü 4 /h� É

Assuming that the far right term satifies a CLT, so that

� � O î� f mR ²� � � æ �ö� è � then we get h

� c ��ô`õ÷�� h mR ð � �J� æ « � æ � 1�ö� æ 4« � �� 1 è � óThe estimators for æ « � and æ �ö� are the obvious ones. An estimator for è � isÂè �ô`õ �� c B �Ö� ��ôZõ h 4 c BÜ�j� ��ôZõ h &This estimator is consistent following the proof of consistency of the OLS esti-

mator of è � � when the classical assumptions hold.


The formula used to estimate the variance of��ôZõ is�é�� ôZõ� c ��4�üÚ,�òü�4@ü® � 1 �òü�4��3 h � 1 Âè �ôZõ

The IV estimator is

(1) Consistent

(2) Asymptotically normally distributed

(3) Biased in general, since even though ë¼�¬� 4 ª � /V �� $ë7�� 4 ª � �: � 1 � 4 ª � /may not be zero, since �¬� 4 ª � �: � 1 and � 4 ª � / are not independent.

An important point is that the asymptotic distribution of��ô`õ depends uponæ « � and æ �ö� � and these depend upon the choice of üÖ& The choice of instru-

ments influences the efficiency of the estimator.

� When we have two sets of instruments, ü�1 and ü � such that üj1ø÷¡ü � �then the IV estimator using ü � is at least as efficiently asymptotically

as the estimator that used üj1C& More instruments leads to more asymp-

totically efficient estimation, in general.� There are special cases where there is no gain (simultaneous equations

is an example of this, as we’ll see).� The penalty for indiscriminant use of instruments is that the small

sample bias of the IV estimator rises as the number of instruments

increases. The reason for this is thatª � � becomes closer and closer

to � itself as the number of instruments increases.� IV estimation can clearly be used in the case of simultaneous equa-

tions. The only issue is which instruments to use.

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 214

11.5. Identification by exclusion restrictions

The identification problem in simultaneous equations is in fact of the same

nature as the identification problem in any estimation setting: does the lim-

iting objective function have the proper curvature so that there is a unique

global minimum or maximum at the true parameter value? In the context of

IV estimation, this is the case if the limiting covariance of the IV estimator is

positive definite and � µ !ò� 1f ü 4 / �� . This matrix is

énT=� ��ô`õ� � æ « � æ � 1�ö� æ 4« � $� 1 è �� The necessary and sufficient condition for identification is simply that

this matrix be positive definite, and that the instruments be (asymp-

totically) uncorrelated with / .� For this matrix to be positive definite, we need that the conditions

noted above hold: æ �ö� must be positive definite and æ « � must be

of full rank ( � ).� These identification conditions are not that intuitive nor is it very ob-

vious how to check them.

11.5.1. Necessary conditions. If we use IV estimation for a single equation

of the system, the equation can be written as

B 9 » 28/where 9 n ¤,1d��1 r

Notation:� Let � be the total numer of weakly exogenous variables.

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 215� Let � [ ¸ ´<µ ��¬��1[ be the number of included exogs, and let � [I[ � � � [ be the number of excluded exogs (in this equation).� Let � [ Ú¸ ´1µ ��W¤,1-�2�� be the total number of included endogs, and let�ù[I[ ��8�ö[ be the number of excluded endogs.

Using this notation, consider the selection of instruments.

� Now the ��1 are weakly exogenous and can serve as their own instru-

ments.� It turns out that � exhausts the set of possible instruments, in that

if the variables in � don’t lead to an identified model then no other

instruments will identify the model either. Assuming this is true (we’ll

prove it in a moment), then a necessary condition for identification is

that ¸ ´<µ �� ¸ ´1µ ��W¤,1- since if not then at least one instrument must

be used twice, so ü will not have full column rank:

��òüÚ ½ � [ 2M� [ � � Á �� æ = � ½ � [ 2M� [ � �This is the order condition for identification in a set of simultaneous

equations. When the only identifying information is exclusion restric-

tions on the variables that enter an equation, then the number of ex-

cluded exogs must be greater than or equal to the number of included

endogs, minus 1 (the normalized lhs endog), e.g.,

� [I[ � � [ � �� To show that this is in fact a necessary condition consider some arbi-

trary set of instruments üÖ& A necessary condition for identification is


that ��Æ�� µ !ò� �� ü 4 9 É �� [ 2M� [ � �where 9 n ¤,1d��1 rRecall that we’ve partitioned the model

¤ Ã � � 2 Ùas ¤ n B ¤,1 ¤ � r� n ��1e� � r

Given the reduced form ¤ � ó 2Ëéwe can write the reduced form using the same partition

n B ¤,1 ¤ � r n ��1 � � r �� 1P1 ó 1 � ó 1 |� � 1 ó �P� ó � | �� 2 n ¸ é,1 é � rso we have ¤,1 ��1 ó 1 � 28� � ó �P� 2Ëé01so �� ü 4 9 �� ü 4 n ��1 ó 1 � 28� � ó �P� 2Mé,1e��1¯rBecause the ü ’s are uncorrelated with the éI1 ’s, by assumption, the cross

between ü and é,1 converges in probability to zero, so

� µ !ò� �� ü 4 9 � µ !ò� �� ü 4 n ��1 ó 1 � 28� � ó �P� ��1¯r


Since the far rhs term is formed only of linear combinations of columns of ��the rank of this matrix can never be greater than � � regardless of the choice of

instruments. If 9 has more than � columns, then it is not of full column rank.

When 9 has more than � columns we have

� [ � ��2 � [ ¥ �or noting that � [I[ �� [ �

� [ � � ¥ � [I[In this case, the limiting matrix is not of full column rank, and the identification

condition fails.

11.5.2. Sufficient conditions. Identification essentially requires that the struc-

tural parameters be recoverable from the data. This won’t be the case, in gen-

eral, unless the structural model is subject to some restrictions. We’ve already

identified necessary conditions. Turning to sufficient conditions (again, we’re

only considering identification through zero restricitions on the parameters,

for the moment).

The model is

¤s4U Ã �i4U � 2 Ù Ué^� Ù UW 5


This leads to the reduced form

¤Þ4U �i4U � Ã � 1 2 Ù U Ã � 1 �i4UAó 2Ëé5Uéb�òé5UW ð Ã � 1 ó 4 5 Ã � 1 ²The reduced form parameters are consistently estimable, but none of them are

known a priori, and there are no restrictions on their values. The problem is

that more than one structural form has the same reduced form, so knowledge

of the reduced form parameters alone isn’t enough to determine the structural

parameters. To see this, consider the model

¤ 4U ÃÆ2 � 4U � 2 2 Ù U 2é^� Ù U 2 2 4 5 2where

2is some arbirary nonsingular � �i� matrix. The rf of this new model

is

¤ 4U � 4U � 2 � ÃÆ2 � 1 2 Ù U 2 � ÃÆ2 � 1 �i4U � 2ù2 � 1 Ã � 1 2 Ù U 2ù2 � 1 Ã � 1 �i4U � Ã � 1 2 Ù U Ã � 1 � 4U ó 2Mé+U


Likewise, the covariance of the rf of the transformed model is

éb� Ù U 2 � ÃÆ2 � 1 é�� Ù U Ã � 1 ²Since the two structural forms lead to the same rf, and the rf is all that is di-

rectly estimable, the models are said to be observationally equivalent. What we

need for identification are restrictions onÃ

and � such that the only admissi-

ble2

is an identity matrix (if all of the equations are to be identified). Take the

coefficient matrices as partitioned before:

�� Ã� �� Ã 1 ��U�+1 Ã �P�� Ã | ��01 � 1 �� P�

� ��The coefficients of the first equation of the transformed model are simply these

coefficients multiplied by the first column of2

. This gives

�� Ã� �� 1P12 � �� Ã 1 ��U�+1 Ã �P�� Ã | ��01 � 1 �� P�

�� 1P12 � ��


For identification of the first equation we need that there be enough restrictions

so that the only admissible �� 1P12 � ��be the leading column of an identity matrix, so that��

� Ã 1 ��U�E1 Ã �P�� Ã | ��01 � 1 �� P�� 1P12 � ��

��t�+1��01�

��Note that the third and fifth rows are�� Ã | �� P� �� 2 � �� Supposing that the leading matrix is of full column rank, e.g.,

� ÎÏ �� Ã | �� P� �� ÐÒ a¸ ´<µ � ÎÏ �� Ã | �� P� �� ÐÒ �� then the only way this can hold, without additional restrictions on the model’s

parameters, is if2 � is a vector of zeros. Given that

2 � is a vector of zeros, then

the first equation n � Ã 1 � r �� 1P12 � �� Á � 1P1 �Therefore, as long as

� ÎÏ �� Ã | �� P� �� ÐÒ ��


then �� 1P12 � �� ¯ � 1 ��The first equation is identified in this case, so the condition is sufficient for

identification. It is also necessary, since the condition implies that this subma-

trix must have at least �� rows. Since this matrix has

� [I[ 2 � [I[ ��8� [ 2 � [I[rows, we obtain �� [ 2 � [I[ � �a� �or � [I[ � � [ � �which is the previously derived necessary condition.

The above result is fairly intuitive (draw picture here). The necessary con-

dition ensures that there are enough variables not in the equation of interest to

potentially move the other equations, so as to trace out the equation of inter-

est. The sufficient condition ensures that those other equations in fact do move

around as the variables change their values. Some points:

� When an equation has � [I[ �ù[s��"� is is exactly identified, in that

omission of an identifiying restriction is not possible without loosing

consistency.� When � [I[ ¥ � [ �þ�"� the equation is overidentified, since one could

drop a restriction and still retain consistency. Overidentifying restric-

tions are therefore testable. When an equation is overidentified we


have more instruments than are strictly necessary for consistent esti-

mation. Since estimation by IV with more instruments is more efficient

asymptotically, one should employ overidentifying restrictions if one

is confident that they’re true.� We can repeat this partition for each equation in the system, to see

which equations are identified and which aren’t.� These results are valid assuming that the only identifying informa-

tion comes from knowing which variables appear in which equations,

e.g., by exclusion restrictions, and through the use of a normaliza-

tion. There are other sorts of identifying information that can be used.

These include

(1) Cross equation restrictions

(2) Additional restrictions on parameters within equations (as in the

Klein model discussed below)

(3) Restrictions on the covariance matrix of the errors

(4) Nonlinearities in variables� When these sorts of information are available, the above conditions

aren’t necessary for identification, though they are of course still suffi-

cient.

To give an example of how other information can be used, consider the model

¤ Ã � � 2 Ùwhere

Ãis an upper triangular matrix with 1’s on the main diagonal. This is a

triangular system of equations. In this case, the first equation is

B�1 � �Þ· 1�2 Ù · 1


Since only exogs appear on the rhs, this equation is identified.

The second equation is

B � �t� � 1ÔBT1,2�� Þ· � 2 Ù · �This equation has � [I[ Ú� excluded exogs, and � [ # included endogs, so it

fails the order (necessary) condition for identification.

� However, suppose that we have the restriction 5 � 1 �� so that the first

and second structural errors are uncorrelated. In this case

ë7��B�1�U¨/ � UW ëbS%�¬�i4U �Þ· 1,2�/�1�UWP/ � U×X ¡�so there’s no problem of simultaneity. If the entire 5 matrix is diago-

nal, then following the same logic, all of the equations are identified.

This is known as a fully recursive model.

11.5.3. Example: Klein’s Model 1. To give an example of determining iden-

tification status, consider the following macro model (this is the widely known


Klein’s Model 1)

Consumption: û U � F 2 � 1 ª U+2 � � ª U � 1�2 � | �òü 6U 2Ëü �U �28/�1�UInvestment: ± U � F 2K�01 ª U+28� � ª U � 1,2K� | � U � 1I2�/ � U

Private Wages: ü 6U � F 2i�+1P�=U+2c� � �=U � 1I2c� | w U528/ | UOutput: �=U û U52 ± U52M��U

Profits:ª U �=U0� ¿ U��8ü 6U

Capital Stock: � U � U � 1�2 ± UÎúúúÏ AC1�UA � UA | U Ð}ûûûÒ ç ±�± � ÎúúúÏ ÎúúúÏ �� Ð}ûûûÒ �ÎúúúÏ è01P1 è01 � è01 |è �P� è � |è |P| Ð}ûûûÒ Ð}ûûûÒ

The other variables are the government wage bill, ü �U � taxes,¿ U?� government

nonwage spending, �sUP� and a time trend, w U?& The endogenous variables are the

lhs variables, ¤Þ4U on û U ± U§ü 6U �=U ª U � UÈrand the predetermined variables are all others:

� 4U n � ü �U �sU ¿ U w U ª U � 1 � U � 1 �=U � 1Þr &


The model assumes that the errors of the equations are contemporaneously

correlated, by nonautocorrelated. The model written as ¤ Ã � � 2 Ù gives

Ã �� ¯� � �� ¯� � �¯�� | � � � � �� t�+1 � �¯� �� 1 �;�01 � � � ��

��

��

� F � F � F � � �� | � � � � �� Þ� �� | � � �� | � � � ��

��To check this identification of the consumption equation, we need to extract

Ã | �and � �P� � the submatrices of coefficients of endogs and exogs that don’t appear

in this equation. These are the rows that have zeros in the first column, and


we need to drop the first column. We get

�� Ã | �� P� ��

� � �Þ� � �Þ�� U�+1q� �Þ� �� Þ� �� | � � �� | � � � ��

� ��We need to find a set of 5 rows of this matrix gives a full-rank 5 � Ò matrix. For

example, selecting rows 3,4,5,6, and 7 we obtain the matrix

w¡ �� Þ� �� | � � �� | � � � �

� ��This matrix is of full rank, so the sufficient condition for identification is met.

Counting included endogs, �ë[ %%� and counting excluded exogs, � [I[ Ò � so

� [I[ � B � [ � �Ò � B %�� B %

11.6. 2SLS 227� The equation is over-identified by three restrictions, according to the

counting rules, which are correct when the only identifying informa-

tion are the exclusion restrictions. However, there is additional infor-

mation in this case. Both ü 6U and ü �U enter the consumption equation,

and their coefficients are restricted to be the same. For this reason the

consumption equation is in fact overidentified by four restrictions.

11.6. 2SLS

When we have no information regarding cross-equation restrictions or the

structure of the error covariance matrix, one can estimate the parameters of a

single equation of the system without regard to the other equations.� This isn’t always efficient, as we’ll see, but it has the advantage that

misspecifications in other equations will not affect the consistency of

the estimator of the parameters of the equation of interest.� Also, estimation of the equation won’t be affected by identification

problems in other equations.

The 2SLS estimator is very simple: in the first stage, each column of ¤�1 is re-

gressed on all the weakly exogenous variables in the system, e.g., the entire �matrix. The fitted values are�¤�1 �K�¬�i4��:�� 1 �i4©¤�1 ª�« ¤,1 � �ó 1Since these fitted values are the projection of ¤I1 on the space spanned by �i�and since any vector in this space is uncorrelated with / by assumption,

�¤,1 is

11.6. 2SLS 228

uncorrelated with /T& Since�¤�1 is simply the reduced-form prediction, it is cor-

related with ¤�1Y� The only other requirement is that the instruments be linearly

independent. This should be the case when the order condition is satisfied,

since there are more columns in � � than in ¤,1 in this case.

The second stage substitutes�¤�1 in place of ¤�1Y� and estimates by OLS. This

original model is

B ¤,1I�+1,28��1Ô�01,28/ 9 » 28/and the second stage model is

B �¤,1I�+1,28��1Ô�01,28/T&Since ��1 is in the space spanned by �� ª« ��1 ��1Y� so we can write the second

stage model as

B ªI« ¤,1I�+1,2 ª�« ��1Ô�01,28/� ªI« 9 » 28/The OLS estimator applied to this model is�» �A9 4 ªI« 9p � 1 9 4 ªI« Bwhich is exactly what we get if we estimate using IV, with the reduced form

predictions of the endogs used as instruments. Note that if we define�9 ª�« 9 n �¤,1d��1Þr

11.6. 2SLS 229

so that�9 are the instruments for 9;� then we can write�» � �9y4±9p�� 1 �9y4©B

� Important note: OLS on the transformed model can be used to calcu-

late the 2SLS estimate of» � since we see that it’s equivalent to IV using

a particular set of instruments. However the OLS covariance formula is

not valid. We need to apply the IV covariance formula already seen

above.

Actually, there is also a simplification of the general IV variance formula. De-

fine �9 ªI« 9 n �¤ � rThe IV covariance estimator would ordinarily be�éb� �» dc 9y4 �9 h � 1 c �9y4 �9 h c �9È4±9 h � 1 �è �ôZõHowever, looking at the last term in brackets

�9 4 9 n �¤�1d��1 r 4 n ¤,1e��1 r �� ¤ 41 � ª�« -¤,1 ¤ 41 � ª�« ?��1� 41 ¤�1 � 41 ��1 ��

11.6. 2SLS 230

but sinceª�«

is idempotent and sinceª« � �i� we can write

n �¤,1 ��1Þr 4 n ¤,1e��1Þr �� ¤ 41 ªI«7ªI« ¤,1 ¤ 41 ª�« ��1� 41 ªI« ¤,1 � 41 ��1 �� n �¤,1d��1 r 4 n �¤,1e��1 r �9y4 �9Therefore, the second and last term in the variance formula cancel, so the 2SLS

varcov estimator simplifies to�é^� �» c 9y4 �9 h � 1 �è �ôZõwhich, following some algebra similar to the above, can also be written as�é^� �» dc �9y4 �9 h � 1 �è �ôZõFinally, recall that though this is presented in terms of the first equation, it is

general since any equation can be placed first.

Properties of 2SLS:

(1) Consistent

(2) Asymptotically normal

(3) Biased when the mean esists (the existence of moments is a technical

issue we won’t go into here).

(4) Asymptotically inefficient, except in special circumstances (more on

this later).

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 231

11.7. Testing the overidentifying restrictions

The selection of which variables are endogs and which are exogs is part of

the specification of the model. As such, there is room for error here: one might

erroneously classify a variable as exog when it is in fact correlated with the

error term. A general test for the specification on the model can be formulated

as follows:

The IV estimator can be calculated by applying OLS to the transformed

model, so the IV objective function at the minimized value is

�� ô`õ� lc B¯�j� ��ôZõ h 4 ª � c BÜ�j� ��ô`õ h �but �/<ôZõ B¯�j� ��ôZõ B¯�j�K��i4 ª � �:�� 1 �i4 ª � B ð ± �j�K�� 4 ª � �: � 1 � 4 ª � ó B ð ± �j�K�� 4 ª � �: � 1 � 4 ª � ó �¬��b28/V w �¨��2�/Vwhere w � ± �j�K�¬�i4 ª � �:$� 1 ��4 ª �so �� ôZõI �¬/J4g2K��4��i4@ w 4 ª � w �¬�i��28/V


Moreover, w 4 ª � w is idempotent, as can be verified by multiplication:

w 4 ª � w ð ± � ª � �K�¬�i4 ª � �:$� 1 �i4 ó ª � ð ± �j�K�¬�i4 ª � �:$� 1 ��4 ª � ó ð ª � � ª � �K��4 ª � �:�� 1 �i4 ª � ó ð ª � � ª � �K��i4 ª � �:�� 1 �i4 ª � ó ð ± � ª � �K�¬�i4 ª � �:$� 1 �i4 ó ª � &Furthermore, w is orthogonal to �

w � ð ± �²�K�� 4 ª � �3 � 1 � 4 ª � ó � �l�²� �so �� ôZõI /J4 w 4 ª � w /Supposing the / are normally distributed, with variance èI�N� then the random

variable �� ôZõ�è � / 4 w 4 ª � w /è �is a quadratic form of a j� � �'�J random variable with an idempotent matrix in

the middle, so �� ôZõ�è � ç�� w 4 ª � w [This isn’t available, since we need to estimate è � . Substituting a consistent

estimator, �� ôZõ��è � Qç�� w 4 ª � w [

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 233� Even if the / aren’t normally distributed, the asymptotic result still

holds. The last thing we need to determine is the rank of the idempo-

tent matrix. We have

w 4 ª � wa ð ª � � ª � �K��i4 ª � �:�� 1 �i4 ª � óso

�� w 4 ª � w ¿yÀ ð ª � � ª � �K�¬�i4 ª � �:$� 1 �i4 ª � ó ¿yÀVª � � ¿yÀ �i4 ª � ª � �K�¬�i4 ª � �:$� 1 ¿yÀ üþ�òü 4 üÚ � 1 ü 4 � � « ¿yÀ ü 4 ü �Wü 4 üÚ � 1 � � « � � � � «where � � is the number of columns of ü and � « is the number of

columns of ��& The degrees of freedom of the test is simply the number

of overidentifying restrictions: the number of instruments we have

beyond the number that is strictly necessary for consistent estimation.� This test is an overall specification test: the joint null hypothesis is that

the model is correctly specified and that the ü form valid instruments

(e.g., that the variables classified as exogs really are uncorrelated with/%& Rejection can mean that either the model B 9 » 2:/ is misspecified,

or that there is correlation between � and /T&� This is a particular case of the GMM criterion test, which is covered in

the second half of the course. See Section 15.8.� Note that since �/<ôZõ �w /


and �� ôZõI / 4 w 4 ª � w /we can write�� ôZõ��è � � �/ 4 üþ�òü 4 üÚ � 1 ü 4 ��Wüþ�òü 4 üÚ � 1 ü 4 �/V�/ 4 �/ Â � �¢�W� Û�Û �î�üAý � � Â ¿ Û�Û �î¾ü4ý �,� �Íwhere ��Í is the uncentered �� from a regression of the ± é residuals

on all of the instruments ü . This is a convenient way to calculate the

test statistic.

On an aside, consider IV estimation of a just-identified model, using the stan-

dard notation

B ��b28/and ü is the matrix of instruments. If we have exact identification then ¸ ´<µ ��Wü® ¸ ´1µ ��: , so üRO�� is a square matrix. The transformed model isª � B ª � ��^2 ª � /and the fonc are ��4 ª � ��BÜ�j� ��ôZõI ��The IV estimator is ��ôZõ �¬�i4 ª � �: � 1 �i4 ª � B


Considering the inverse here

��i4 ª � �: � 1 ðW�i4@üþ�òü�4�üÚ$� 1 üR4�� ó � 1 �WüR4Ì�:�� 1 ð �i4@üþ�òü�4@üÚ�� 1 ó � 1 �WüR4Ì�:�� 1 �òü�4�üÚ��¨�i4(üÚ � 1Now multiplying this by � 4 ª � B�� we obtain��ô`õ �òü 4 �3 � 1 �Wü 4 üÚ��¬� 4 üÚ � 1 � 4 ª � B �òü 4 �3 � 1 �Wü 4 üÚ��¬� 4 üÚ � 1 � 4 ü �Wü 4 üÚ � 1 ü 4 B �òü�4��3$� 1 ü�4©BThe objective function for the generalized IV estimator is

�� ôZõ� c B �Ö� ��ôZõ h 4 ª � c BÜ�j� ��ôZõ h B�4 ª � c B¯�j� ��ôZõ h � ��4ôZõ �i4 ª � c BÜ�j� ��ôZõ h B�4 ª � c B¯�j� ��ôZõ h � ��4ôZõ �i4 ª � B�2 ��I4ô`õ �i4 ª � � ��ôZõ B 4 ª � c B¯�j� ��ôZõ h � �� 4ôZõ c � 4 ª � B�28� 4 ª � � ��ôZõ h B 4 ª � c B¯�j� ��ôZõ h

11.8. SYSTEM METHODS OF ESTIMATION 236

by the fonc for generalized IV. However, when we’re in the just indentified

case, this is

�� ôZõ� B�4 ª � ð BÜ�j�8�òü�4��3$� 1 ü�4©B ó B�4 ª � ð ± �Ö�K�òü�4��3$� 1 ü�4 ó B B 4 ð üþ�òü 4 üÚ � 1 ü 4 ��ü �Wü 4 üÚ � 1 ü 4 �K�òü 4 �: � 1 ü 4 ó B �The value of the objective function of the IV estimator is zero in the just identified

case. This makes sense, since we’ve already shown that the objective function

after dividing by è � is asymptotically � � with degrees of freedom equal to the

number of overidentifying restrictions. In the present case, there are no overi-

dentifying restrictions, so we have a � � � � rv, which has mean 0 and variance 0,

e.g., it’s simply 0. This means we’re not able to test the identifying restrictions

in the case of exact identification.

11.8. System methods of estimation

2SLS is a single equation method of estimation, as noted above. The advan-

tage of a single equation method is that it’s unaffected by the other equations

of the system, so they don’t need to be specified (except for defining what are

the exogs, so 2SLS can use the complete set of instruments). The disadvantage

of 2SLS is that it’s inefficient, in general.� Recall that overidentification improves efficiency of estimation, since

an overidentified equation can use more instruments than are neces-

sary for consistent estimation.� Secondly, the assumption is that

11.8. SYSTEM METHODS OF ESTIMATION 237¤ Ã � � 2 Ùë¼�¬�i4 Ù � Ã ÿ Ó ¯ Å¸ ºN¸ � Ù ç j� � �6Es� Since there is no autocorrelation of the Ù U ’s, and since the columns of

Ù are individually homoscedastic, then

E ��è01P1 ± f è01 � ± f >N>N>uè01 ¯ ± fè �P� ± f ...

. . . ...> è ¯�¯ ± f�� 5Ñï ± f

This means that the structural equations are heteroscedastic and cor-

related with one another� In general, ignoring this will lead to inefficient estimation, following

the section on GLS. When equations are correlated with one another

estimation should account for the correlation in order to obtain effi-

ciency.� Also, since the equations are correlated, information about one equa-

tion is implicitly information about all equations. Therefore, overiden-

tification restrictions in any equation improve efficiency for all equa-

tions, even the just identified equations.� Single equation methods can’t use these types of information, and are

therefore inefficient (in general).


11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS esti-

mator as a generalized method of moments estimator (see Chapter 15). I no

longer teach the following section, but it is retained for its possible historical

interest. Another alternative is to use FIML (Subsection 11.8.2), if you are will-

ing to make distributional assumptions on the errors. This is computationally

feasible with modern computers.

Following our above notation, each structural equation can be written as

B *å ¤ * �E1,2�� * �01�2�/ * 9 * » * 28/ *Grouping the � equations together we get��

B�1B �...B ¯

�� 971 � >N>N> �� 9 � ...... . . . �� >N>N> � 9 ¯

��» 1» �...» ¯ �� 2

��/�1/ �.../ ¯

��or B 9 » 28/where we already have that

ë¼�¬/]/J4© E 5Çï ± f


The 3SLS estimator is just 2SLS combined with a GLS correction that takes

advantage of the structure of E¾& Define�9 as

�9 ��K�¬� 4 �: � 1 � 4 971 � >N>N> �� K�� 4 �: � 1 � 4 9 � ...... . . . �� >N>N> � �K�� 4 �: � 1 � 4 9 ¯

��

�¤�1d��1 � >N>N> �� ¤ � � � ...... . . . �� >N>N> � �¤ ¯ � ¯

��These instruments are simply the unrestricted rf predicitions of the endogs,

combined with the exogs. The distinction is that if the model is overidentified,

then ó �� Ã � 1may be subject to some zero restrictions, depending on the restrictions on

Ãand � � and

�ó does not impose these restrictions. Also, note that�ó is calculated

using OLS equation by equation. More on this later.

The 2SLS estimator would be�» � �9y4±9p�� 1 �9y4©Bas can be verified by simple multiplication, and noting that the inverse of a

block-diagonal matrix is just the matrix with the inverses of the blocks on the

main diagonal. This IV estimator still ignores the covariance information. The

natural extension is to add the GLS transformation, putting the inverse of the


error covariance into the formula, which gives the 3SLS estimator�» | ³/�+³ c �9y4]��5Çï ± f" � 1 9 h � 1 �9y4g�A5Çï ± f� � 1 B c �9y4 ð 5y� 1 ï ± f ó 9 h � 1 �9y4 ð 5y� 1 ï ± f ó BThis estimator requires knowledge of 5µ& The solution is to define a feasible

estimator using a consistent estimator of 5µ& The obvious solution is to use an

estimator based on the 2SLS residuals:�/ *0 B * �¹9 * �» * H � ³<�(³(IMPORTANT NOTE: this is calculated using 9 * � not

�9 * C& Then the element![�¾� of 5 is estimated by �è * ßÈ �/ 4* �/ ß�Substitute

�5 into the formula above to get the feasible 3SLS estimator.

Analogously to what we did in the case of 2SLS, the asymptotic distribution

of the 3SLS estimator can be shown to beh� c �» | ³<�(³ � » h QçR ÎÏ � � � ��f�SUT ë áã ä ¶ �9 4 �45Ñï ± f� � 1 �9� · � 1 å æç ÐÒ

A formula for estimating the variance of the 3SLS estimator in finite samples

(cancelling out the powers of �� is�é c �» | ³<�(³ h c �9È4 c �5y� 1 ï ± f h �9 h � 1� This is analogous to the 2SLS formula in equation (??), combined with

the GLS correction.

11.8. SYSTEM METHODS OF ESTIMATION 241� In the case that all equations are just identified, 3SLS is numerically

equivalent to 2SLS. Proving this is easiest if we use a GMM interpre-

tation of 2SLS and 3SLS. GMM is presented in the next econometrics

course. For now, take it on faith.

The 3SLS estimator is based upon the rf parameter estimator�ó � calculated

equation by equation using OLS:�ó ��i4��:$� 1 ��4@¤which is simply �ó �� 4 �: � 1 � 4 n B�1 B � >N>N>RB ¯ rthat is, OLS equation by equation using all the exogs in the estimation of each

column of ó &It may seem odd that we use OLS on the reduced form, since the rf equa-

tions are correlated:

¤¯4U �i4U � Ã � 1 2 Ù 4U Ã � 1 � 4U ó 2Ëé 4Uand é+U ð Ã � 1 ó 4 Ù U�çu c � � ð Ã � 1 ó 4 5 Ã � 1 h �Ôê� Let this var-cov matrix be indicated byþ ð Ã � 1 ó 4 5 Ã � 1


OLS equation by equation to get the rf is equivalent to��B�1B �...B ¯

� �� >N>N> �� ...... . . . �� >N>N> � �

� �� 1� �...� ¯ � �� 2

��¸�1¸ �...¸ ¯

� ��where B * is the �M�� vector of observations of the ! U�¶ endog, � is the entire�j� � matrix of exogs, � * is the ! U�¶ column of ó � and ¸ * is the ! U�¶ column of é¢&Use the notation B a` � 2¹¸to indicate the pooled model. Following this notation, the error covariance

matrix is éb�4¸% þ ï ± f� This is a special case of a type of model known as a set of seemingly

unrelated equations (SUR) since the parameter vector of each equation

is different. The equations are contemporanously correlated, however.

The general case would have a different � * for each equation.� Note that each equation of the system individually satisfies the classi-

cal assumptions.� However, pooled estimation using the GLS correction is more efficient,

since equation-by-equation estimation is equivalent to pooled estima-

tion, since ` is block diagonal, but ignoring the covariance informa-

tion.� The model is estimated by GLS, whereþ

is estimated using the OLS

residuals from equation-by-equation estimation, which are consistent.

11.8. SYSTEM METHODS OF ESTIMATION 243� In the special case that all the � * are the same, which is true in the

present case of estimation of the rf parameters, SUR � OLS. To show

this note that in this case `§ ¡± ftïM��& Using the rules

(1) � w ï � � 1 � w � 1 ï � � 1 (2) � w ï � 4 � w 4 ï � 4 and

(3) � w ï � v� û ï � � wµû ï � � Y� we get�� ³ � ðY� ± ftïM�: 4 � þ ï ± f" � 1 � ± ftïM�: ó � 1 � ± fUïM�: 4 � þ ï ± f� � 1 B ðJð þ � 1 ïM�i4 ó � ± ftïM�: ó � 1 ð þ � 1 ïM�i4 ó B ð þ ïu��i4��:$� 1 ó ð þ � 1 ïM�i4 ó B ° ± ¯ ïu�¬� 4 �: � 1 � 4 ³ B ��

�� 1�� ...�� ¯ �� So the unrestricted rf coefficients can be estimated efficiently (assum-

ing normality) by OLS, even if the equations are correlated.� We have ignored any potential zeros in the matrix ó � which if they

exist could potentially increase the efficiency of estimation of the rf.� Another example where SUR � OLS is in estimation of vector autore-

gressions. See two sections ahead.

11.8.2. FIML. Full information maximum likelihood is an alternative es-

timation method. FIML will be asymptotically efficient, since ML estima-

tors based on a given information set are asymptotically efficient w.r.t. all

other estimators that use the same information set, and in the case of the


full-information ML estimator we use the entire information set. The 2SLS

and 3SLS estimators don’t require distributional assumptions, while FIML of

course does. Our model is, recall

¤s4U Ã �i4U � 2 Ù 4UÙ Uñç j� � �65¼Y�Ôê� ë7� Ù U Ù 4ì � �[ µí �The joint normality of Ù U means that the density for Ù U is the multivariate nor-

mal, which is �W# � � � p � ð�ÿ }�� 5 � 1 ó � 1Ap � }Y~ � Æ � �# Ù 4U 5 � 1 Ù U ÉThe transformation from Ù U to ¤+U requires the Jacobian

� ÿ }�� J Ù UJ�¤ 4U � � ÿ }�� Ã �so the density for ¤EU is

�ò# � � ¯ p � � ÿ }�� Ã � ð ÿ }�� 5 � 1 ó � 1Ap � }C~%� Æ�� # �W¤ 4U Ã �²� 4U � �5 � 1 �W¤ 4U Ã �j� 4U � 4 ÉGiven the assumption of independence over time, the joint log-likelihood func-

tion is�� B � � � Ã �65¼ � �,�# �)� �ò# � [2p� �� ÿ }�� Ã � Y� � # �)� ÿ }�� 5 � 1 � �# f� U(��1 ��¤ 4U Ã �j� 4U � "5 � 1 �W¤ 4U Ã �j� 4U � 4� This is a nonlinear in the parameters objective function. Maximixation

of this can be done using iterative numeric methods. We’ll see how to

do this in the next section.

11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 245� It turns out that the asymptotic distribution of 3SLS and FIML are the

same, assuming normality of the errors.� One can calculate the FIML estimator by iterating the 3SLS estimator,

thus avoiding the use of a nonlinear optimizer. The steps are

(1) Calculate�Ã | ³<�(³ and

�� | ³/�+³ as normal.

(2) Calculate�ó �� | ³/�+³ �Ã � 1| ³/�+³ & This is new, we didn’t estimate ó in

this way before. This estimator may have some zeros in it. When

Greene says iterated 3SLS doesn’t lead to FIML, he means this for

a procedure that doesn’t update�ó � but only updates

�5 and�� and�Ã & If you update

�ó you do converge to FIML.

(3) Calculate the instruments�¤ � �ó and calculate

�5 using�Ã

and��

to get the estimated errors, applying the usual estimator.

(4) Apply 3SLS using these new instruments and the estimate of 5µ&(5) Repeat steps 2-4 until there is no change in the parameters.� FIML is fully efficient, since it’s an ML estimator that uses all informa-

tion. This implies that 3SLS is fully efficient when the errors are normally

distributed. Also, if each equation is just identified and the errors are

normal, then 2SLS will be fully efficient, since in this case 2SLS � 3SLS.� When the errors aren’t normally distributed, the likelihood function is

of course different than what’s written above.

11.9. Example: 2SLS and Klein’s Model 1

The Octave program Simeq/Klein.m performs 2SLS estimation for the 3

equations of Klein’s model 1, assuming nonautocorrelated errors, so that lagged

endogenous variables can be used as instruments. The results are:

CONSUMPTION EQUATION

http://pareto.uab.es/mcreel/Econometrics/Include/Simeq/Klein.m

11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 246

*******************************************************

2SLS estimation results

Observations 21

R-squared 0.976711



Constant 16.555 1.321 12.534 0.000

Profits 0.017 0.118 0.147 0.885

Lagged Profits 0.216 0.107 2.016 0.060

Wages 0.810 0.040 20.129 0.000

*******************************************************

INVESTMENT EQUATION

*******************************************************


Observations 21

R-squared 0.884884



Constant 20.278 7.543 2.688 0.016

Profits 0.150 0.173 0.867 0.398

Lagged Profits 0.616 0.163 3.784 0.001

11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 247

Lagged Capital -0.158 0.036 -4.368 0.000

*******************************************************

WAGES EQUATION

*******************************************************


Observations 21

R-squared 0.987414



Constant 1.500 1.148 1.307 0.209

Output 0.439 0.036 12.316 0.000

Lagged Output 0.147 0.039 3.777 0.002

Trend 0.130 0.029 4.475 0.000

*******************************************************

The above results are not valid (specifically, they are inconsistent) if the er-

rors are autocorrelated, since lagged endogenous variables will not be valid

instruments in that case. You might consider eliminating the lagged endoge-

nous variables as instruments, and re-estimating by 2SLS, to obtain consistent

parameter estimates in this more complex case. Standard errors will still be

estimated inconsistently, unless use a Newey-West type covariance estimator.

Food for thought...

CHAPTER 12

Introduction to the second half

We’ll begin with study of extremum estimators in general. Let ��f be the

available data, based on a sample of size � .

DEFINITION 12.0.1. [Extremum estimator] An extremum estimator�@ is the

optimizing element of an objective function �'f+��Äf%��@" over a set N .

We’ll usually write the objective function suppressing the dependence on�Äf%&Example: Least squares, linear model

Let the d.g.p. be BgU ¡L 4U @ F 2j/JU?�5 �"�Y#%�'&(&(&)��r@ F CFN & Stacking observations

vertically, _�f �` f+@]F02�/Jf � where ` f dc � 1 � � >N>N> � f h 4 & The least squares

estimator is defined as�@ � �V�-�� 'f5�A@V �Ô� Â ��Å\ _,fs� ` f(@<] 4 \ _,f�� ` f(@<]We readily find that

�@ � ` 4 ` � 1 ` 4 _Z&Example: Maximum likelihood

Suppose that the continuous random variable BVU;ç ±�± j�A@]F'�'�JY& The maxi-

mum likelihood estimator is defined as�@ � �V�[�� ~� � f5�A@" fG U(��1 �ò# � � 1Ap � }C~%� ¶ � ��BgU��i@" �# ·Because the logarithmic function is strictly increasing on � � � k , maximization

of the average logarithm of the likelihood function is achieved at the same�@ as

248

12. INTRODUCTION TO THE SECOND HALF 249

for the likelihood function:�@ � �g�[��b� ~� �\f5�A@" �Ô� Â �, �� f �4@" �Þ� Â # �� # � � �Ô� Â �, f� U(��1 ��B]U��i@V �#Solution of the f.o.c. leads to the familiar result that

�@ Ý_Z&� MLE estimators are asymptotically efficient (Cramér-Rao lower bound,

Theorem3), supposing the strong distributional assumptions upon which

they are based are true.� One can investigate the properties of an “ML” estimator supposing

that the distributional assumptions are incorrect. This gives a quasi-

ML estimator, which we’ll study later.� The strong distributional assumptions of MLE may be questionable

in many cases. It is possible to estimate using weaker distributional

assumptions based only on some of the moments of a random vari-

able(s).

Example: Method of moments

Suppose we draw a random sample of BVU from the � � �A@]FC distribution. Here,@ F is the parameter of interest. The first moment (expectation), 3Ä1Y� of a random

variable will in general be a function of the parameters of the distribution, i.e.,31v�4@ F .� 31 31v�4@gFC is a moment-parameter equation.� In this example, the relationship is the identity function 3Ä1\�A@]FY @gFN�though in general the relationship may be more complicated. The

sample first moment is �31 f� U(��1 B]U Â �G&

12. INTRODUCTION TO THE SECOND HALF 250� Define ��1\�4@" 31C�A@"G� �3�1� The method of moments principle is to choose the estimator of the

parameter to set the estimate of the population moment equal to the

sample moment, i.e., �i1\� �@V � � . Then the moment-parameter equation

is inverted to solve for the parameter estimate.

In this case, ��1\� �@" �@p� f� U(��1 BgU Â � �� &Since � fU(��1 B]U Â � 6R @gF by the LLN, the estimator is consistent.

More on the method of moments

Continuing with the above example, the variance of a � � �A@]FC r.v. is

é ��BgU� Ù ð×BgU��i@ F ó � #(@ F &� Define � � �A@" #(@p� � fU(��1 ��BgU��uÝB% �� The MM estimator would set

� � � �@" # �@p� � fU(��1 ��BgU��ÚÝB% �� &Again, by the LLN, the sample variance is consistent for the true vari-

ance, that is, � fU(��1 ��BgU��uÝB% �� 6R #(@ F &So, �@ � fU(��1 ��BgU��®ÝB% �#]� �


which is obtained by inverting the moment-parameter equation, is

consistent.

Example: Generalized method of moments (GMM)

The previous two examples give two estimators of @ F which are both con-

sistent. With a given sample, the estimators will be different in general.

� With two moment-parameter equations and only one parameter, we

have overidentification, which means that we have more information

than is strictly necessary for consistent estimation of the parameter.� The GMM combines information from the two moment-parameter equa-

tions to form a new estimator which will be more efficient, in general

(proof of this below).

From the first example, define �i1�U-�4@" @Ü�MBgUP& We already have that �i1v�A@V is

the sample average of �i1�U-�4@"Y� i.e.,

��1\�A@V � Â � f� U(��1 ��1�U-�A@V @p� f� U(��1 BgU Â �&Clearly, when evaluated at the true parameter value @ F � both Ù \ ��1�U-�A@ F I] ñ�and Ù \ ��1\�A@]FC_] a� .

From the second example we define additional moment conditions

� � U-�A@V #(@p��BgU��ÚÝB% �and � � �4@" #(@p� � fU(��1 ��BgU��uÝB% �� &


Again, it is clear from the LLN that � � �A@ F Q�P ì PR � & The MM estimator would

chose�@ to set either �i1\� �@V k� or � � � �@" ñ� & In general, no single value of @

will solve the two equations simultaneously.� The GMM estimator is based on defining a measure of distance J��:�A@"-C�where �:�A@V ��1\�4@"C�[� � �A@"- 4 � and choosing�@ ¡�V�[�Z� � �� \f5�A@" Jp�¬�:�A@"-0&

An example would be to choose J��÷ � 4 w �� where w is a positive definite

matrix. While it’s clear that the MM gives consistent estimates if there is a one-

to-one relationship between parameters and moments, it’s not immediately

obvious that the GMM estimator is consistent. (We’ll see later that it is.)

These examples show that these widely used estimators may all be inter-

preted as the solution of an optimization problem. For this reason, the study

of extremum estimators is useful for its generality. We will see that the general

results extend smoothly to the more specialized results available for specific

estimators. After studying extremum estimators in general, we will study the

GMM estimator, then QML and NLS. The reason we study GMM first is that

LS, IV, NLS, MLE, QML and other well-known parametric estimators may all

be interpreted as special cases of the GMM estimator, so the general results on

GMM can simplify and unify the treatment of these other estimators. Never-

theless, there are some special results on QML and NLS, and both are impor-

tant in empirical research, which makes focus on them useful.

One of the focal points of the course will be nonlinear models. This is not to

suggest that linear models aren’t useful. Linear models are more general than


they might first appear, since one can employ nonlinear transformations of the

variables:

m F ��BgU� n m 1v� � UW m � � � UW >N>N> m 6�� UW¯r @ F 2�/JUFor example, �� B]U è� 2K� � 1�U52i� � � 1�U 2 » � 1�U � � U+28/JUfits this form.� The important point is that the model is linear in the parameters but not

necessarily linear in the variables.

In spite of this generality, situations often arise which simply can not be con-

vincingly represented by linear in the parameters models. Also, theory that

applies to nonlinear models also applies to linear models, so one may as well

start off with the general case.

Example: Expenditure shares

Roy’s Identity states that the quantity demanded of the ! U�¶ of � goods is

�+*, � à ¸��@��B Â à � *à ¸��@��B Â à B &An expenditure share is � * � � *)�+*�Â B��so necessarily � * C�\ � �\�}]ò� and � ¯* ��1 � *� � . No linear in the parameters model

for �+* or � * with a parameter space that is defined independent of the data can

guarantee that either of these conditions holds. These constraints will often be

violated by estimated linear models, which calls into question their appropri-

ateness in cases of this sort.

Example: Binary limited dependent variable


The referendum contingent valuation (CV) method of infering the social

value of a project provides a simple example. This example is a special case

of more general discrete choice (or binary response) models. Individuals are

asked if they would pay an amount w for provision of a project. Indirect util-

ity in the base case (no project) is ¸%F]��TG2Ë/]F'� where � is income and � is a

vector of other variables such as prices, personal characteristics, etc. After pro-

vision, utility is ¸ 1 ��TG2Ë/ 1 & The random terms / * ��! �V�Y#%� reflect variations

of preferences in the population. With this, an individual agrees1 to pay w if

/ F �j/ 1 ½ ¸ 1 ��Ç� w ��T�c¸ F ��TDefine / /]F¼�Ë/ 1 � let collect � and �+� and let �M¸�� w ¸ 1 ��d� w ��T �¸�FJ��TC& Define B � if the consumer agrees to pay w for the change, B §�otherwise. The probability of agreement is

(12.0.1) Ë � ��B �J 2 î \ �M¸�� w I]"&To simplify notation, define �� w � 2 î \ �M¸0��i� w _]g& To make the example

specific, suppose that ¸ 1 ��T � ��I�¸ F ��T �y�I�and /]F and / 1 are i.i.d. extreme value random variables. That is, utility de-

pends only on income, preferences in both states are homothetic, and a spe-

cific distributional assumption is made on the distribution of preferences in

the population. With these assumptions (the details are unimportant here, see

1We assume here that responses are truthful, that is there is no strategic behavior and thatindividuals are able to order their preferences in this hypothetical situation.


articles by D. McFadden if you’re interested) it can be shown that

�� w ��@V � � � 28� w ��where � �W�� is the logistic distribution function� �� Ô��2 }C~%� �Ô�;��[ � 1 &This is the simple logit model: the choice probability is the logit function of a

linear in parameters function.

Now, B is either � or 1, and the expected value of B is � � � 28� w . Thus, we

can write

B � � � 28� w �2 ë7� � &One could estimate this by (nonlinear) least squares

c �� h ��V�-�� U ��BÜ� � � � 2K� w - �The main point is that it is impossible that � � � 2K� w can be written as a linear

in the parameters model, in the sense that, for arbitrary w , there are no @T� m � w such that � � � 28� w Rm � w ?4*@%�Ôê��where m � w is a � -vector valued function of � and @ is a � dimensional param-

eter. This is because for any @%� we can always find a � such that m �� 4 @ will be

negative or greater than �"� which is illogical, since it is the expectation of a 0/1

binary random variable. Since this sort of problem occurs often in empirical

work, it is useful to study NLS and other nonlinear models.


After discussing these estimation methods for parametric models we’ll briefly

introduce nonparametric estimation methods. These methods allow one, for ex-

ample, to estimate� � � UW consistently when we are not willing to assume that a

model of the form B]U � � � U��2�/JUcan be restricted to a parametric form

BgU � � � U×��@"�2�/]UË � ��/JU ½ �� 2 î �W� � t�� U�@ C N �Yt C��where

� �?>@ and perhaps2 î �� tI� � UW are of known functional form. This is im-

portant since economic theory gives us general information about functions

and the signs of their derivatives, but not about their specific form.

Then we’ll look at simulation-based methods in econometrics. These meth-

ods allow us to substitute computer power for mental power. Since computer

power is becoming relatively cheap compared to mental effort, any econome-

trician who lives by the principles of economic theory should be interested in

these techniques.

Finally, we’ll look at how econometric computations can be done in paral-

lel on a cluster of computers. This allows us to harness more computational

power to work with more complex models that can be dealt with using a desk-

top computer.

CHAPTER 13

Numeric optimization methods

Readings: Hamilton, ch. 5, section 7 (pp. 133-139) [�� Gourieroux and Mon-

fort, Vol. 1, ch. 13, pp. 443-60 [ ; Goffe, et. al. (1994).

If we’re going to be applying extremum estimators, we’ll need to know

how to find an extremum. This section gives a very brief introduction to what

is a large literature on numeric optimization methods. We’ll consider a few

well-known techniques, and one fairly new technique that may allow one to

solve difficult problems. The main objective is to become familiar with the

issues, and to learn how to use the BFGS algorithm at the practical level.

The general problem we consider is how to find the maximizing element�@ (a � -vector) of a function �T�4@"Y& This function may not be continuous, and

it may not be differentiable. Even if it is twice continuously differentiable, it

may not be globally concave, so local maxima, minima and saddlepoints may

all exist. Supposing ��4@" were a quadratic function of @%� e.g.,

��A@" ô�2�7�4�@y2 �# @]4 û @T�the first order conditions would be linear:

� VY��A@" 7¢2 û @so the maximizing (minimizing) element would be

�@ � û � 1 7N& This is the sort

of problem we have with linear models estimated by OLS. It’s also the case for

257

13.2. DERIVATIVE-BASED METHODS 258

feasible GLS, since conditional on the estimate of the varcov matrix, we have

a quadratic objective function in the remaining parameters.

More general problems will not have linear f.o.c., and we will not be able

to solve for the maximizer analytically. This is when we need a numeric opti-

mization method.

13.1. Search

The idea is to create a grid over the parameter space and evaluate the func-

tion at each point on the grid. Select the best point. Then refine the grid in

the neighborhood of the best point, and continue until the accuracy is ”good

enough”. See Figure 13.1.1. One has to be careful that the grid is fine enough

in relationship to the irregularity of the function to ensure that sharp peaks are

not missed entirely.

To check � values in each dimension of a � dimensional parameter space,

we need to check � ÿ points. For example, if � � �"� and �o � � � there would

be � �"� 1 F points to check. If 1000 points can be checked in a second, it would

take % &\�/)%�p�Ö� �� years to perform the calculations, which is approximately the

age of the earth. The search method is a very reasonable choice if � is small,

but it quickly becomes infeasible if � is moderate or large.

13.2. Derivative-based methods

13.2.1. Introduction. Derivative-based methods are defined by

(1) the method for choosing the initial value, @ 1(2) the iteration method for choosing @ D�� 1 given @ D (based upon deriva-

tives)

(3) the stopping criterion.


FIGURE 13.1.1. The search method

The iteration method can be broken into two problems: choosing the stepsizeô D (a scalar) and choosing the direction of movement, J D � which is of the same

dimension of @T� so that @ Ã D�� 1WÅ @ Ã D Å 2Mô D J D &A locally increasing direction of search J is a direction such that�5ôbH à ��A@y2Mô�JTà ô ¥ �

for ô positive but small. That is, if we go in direction J , we will improve on the

objective function, at least if we don’t go too far in that direction.

13.2. DERIVATIVE-BASED METHODS 260� As long as the gradient at @ is not zero there exist increasing directions,

and they can all be represented as æ D �0�4@ D where æ D is a symmetric pd

matrix and �p�A@V � VY��A@V is the gradient at @ . To see this, take a T.S.

expansion around ô F ��A@È2Mô�JT ��A@È2 � JT�2��WôÞ� � u�0�A@È2 � JT 4 Jp2 ´ �?�J ��A@V�2Mô+�0�A@"P4�Jµ2 ´ �Ô�J

For small enough ô the´ �Ô�N term can be ignored. If J is to be an in-

creasing direction, we need �,�4@" 4 J ¥ � & Defining J ¡æ �0�A@VC� where æ is

positive definite, we guarantee that�0�A@"P4*J �0�A@V?4 æ �0�A@V ¥ �unless �0�A@V �� & Every increasing direction can be represented in this

way (p.d. matrices are those such that the angle between � and æ �,�4@"is less that 90 degrees). See Figure 13.2.1.� With this, the iteration rule becomes@ Ã D� 1WÅ @ Ã D Å 2Kô D æ D �0�A@ D

and we keep going until the gradient becomes zero, so that there is no increas-

ing direction. The problem is how to choose ô and æ &� Conditional on æ , choosing ô is fairly straightforward. A simple line

search is an attractive possibility, since ô is a scalar.� The remaining problem is how to choose æ &� Note also that this gives no guarantees to find a global maximum.


FIGURE 13.2.1. Increasing directions of search

13.2.2. Steepest descent. Steepest descent (ascent if we’re maximizing) just

sets æ to and identity matrix, since the gradient provides the direction of max-

imum rate of change of the objective function.

� Advantages: fast - doesn’t require anything more than first deriva-

tives.� Disadvantages: This doesn’t always work too well however (draw pic-

ture of banana function).


13.2.3. Newton-Raphson. The Newton-Raphson method uses information

about the slope and curvature of the objective function to determine which di-

rection and how far to move from an initial point. Supposing we’re trying

to maximize �'f5�A@VC& Take a second order Taylor’s series approximation of �Jf5�A@Vabout @ D (an initial guess).

�\f+�4@" ¬u�\f+�4@ D �2c�0�A@ D ?4 ð @��c@ D ó 2¡� Â # ð @p�c@ D ó 4 d �4@ D ð @s�s@ D óTo attempt to maximize �Nf5�A@VC� we can maximize the portion of the right-hand

side that depends on @T� i.e., we can maximize

ý��A@" �,�4@ D ?4*@È2¡� Â # ð @p�i@ D ó 4 d �4@ D ð @p�i@ D ówith respect to @%& This is a much easier problem, since it is a quadratic function

in @%� so it has linear first order conditions. These are

� V ý��A@V �0�A@ D �2 d �A@ D ð @p�i@ D óSo the solution for the next round estimate is@ D�� 1 @ D � d �4@ D � 1 �0�A@ D This is illustrated in Figure 13.2.2.

However, it’s good to include a stepsize, since the approximation to �Jf5�A@Vmay be bad far away from the maximizer

�@%� so the actual iteration formula is@ D�� 1 @ D �jô D d �A@ D �� 1 �0�A@ D � A potential problem is that the Hessian may not be negative definite

when we’re far from the maximizing point. So �d�A@ D � 1 may not be


FIGURE 13.2.2. Newton-Raphson method

positive definite, and �d�A@ D � 1 �0�A@ D may not define an increasing di-

rection of search. This can happen when the objective function has flat

regions, in which case the Hessian matrix is very ill-conditioned (e.g.,

is nearly singular), or when we’re in the vicinity of a local minimum,d�A@ D is positive definite, and our direction is a decreasing direction

of search. Matrix inverses by computers are subject to large errors

when the matrix is ill-conditioned. Also, we certainly don’t want to

go in the direction of a minimum when we’re maximizing. To solve

this problem, Quasi-Newton methods simply add a positive definite

component to

d�A@" to ensure that the resulting matrix is positive def-

inite, e.g., æ �d�A@V72 7��]� where 7 is chosen large enough so that

13.2. DERIVATIVE-BASED METHODS 264æ is well-conditioned and positive definite. This has the benefit that

improvement in the objective function is guaranteed.� Another variation of quasi-Newton methods is to approximate the

Hessian by using successive gradient evaluations. This avoids actual

calculation of the Hessian, which is an order of magnitude (in the di-

mension of the parameter vector) more costly than calculation of the

gradient. They can be done to ensure that the approximation is p.d.

DFP and BFGS are two well-known examples.

Stopping criteria

The last thing we need is to decide when to stop. A digital computer is

subject to limited machine precision and round-off errors. For these reasons,

it is unreasonable to hope that a program can exactly find the point that max-

imizes a function. We need to define acceptable tolerances. Some stopping

criteria are:� Negligable change in parameters:

� @ Dß �c@ D � 1ß � ½ /�1Y�Ôêa�� Negligable relative change:

� @ Dß �c@ D � 1ß@ D � 1ß � ½ / � �Ôêa�� Negligable change of function:

� ��A@ D G�K��4@ D � 1 � ½ / |� Gradient negligibly different from zero:

� � ß �4@ D � ½ //!N�?êa�

13.2. DERIVATIVE-BASED METHODS 265� Or, even better, check all of these.� Also, if we’re maximizing, it’s good to check that the last round (real,

not approximate) Hessian is negative definite.

Starting values

The Newton-Raphson and related algorithms work well if the objective

function is concave (when maximizing), but not so well if there are convex

regions and local minima or multiple local maxima. The algorithm may con-

verge to a local minimum or to a local maximum that is not optimal. The

algorithm may also have difficulties converging at all.� The usual way to “ensure” that a global maximum has been found

is to use many different starting values, and choose the solution that

returns the highest objective function value. THIS IS IMPORTANT

in practice. More on this later.

Calculating derivatives

The Newton-Raphson algorithm requires first and second derivatives. It

is often difficult to calculate derivatives (especially the Hessian) analytically if

the function �\f+�?>@ is complicated. Possible solutions are to calculate derivatives

numerically, or to use programs such as MuPAD or Mathematica to calculate

analytic derivatives. For example, Figure 13.2.3 shows MuPAD1 calculating a

derivative that I didn’t know off the top of my head, and one that I did know.� Numeric derivatives are less accurate than analytic derivatives, and

are usually more costly to evaluate. Both factors usually cause opti-

mization programs to be less successful when numeric derivatives are

used.

1MuPAD is not a freely distributable program, so it’s not on the CD. You can download it fromhttp://www.mupad.de/download.shtml


FIGURE 13.2.3. Using MuPAD to get analytic derivatives

� One advantage of numeric derivatives is that you don’t have to worry

about having made an error in calculating the analytic derivative. When

programming analytic derivatives it’s a good idea to check that they

are correct by using numeric derivatives. This is a lesson I learned the

hard way when writing my thesis.� Numeric second derivatives are much more accurate if the data are

scaled so that the elements of the gradient are of the same order of

magnitude. Example: if the model is BVU ¹ � �� U�2 ��\U�2M/]U?� and esti-

mation is by NLS, suppose that� § �\f5�Ô>© � �"�V� and

� x �\f5�Ô>© å� & �"� �"&

13.3. SIMULATED ANNEALING 267

One could define � [ �GÂ � �"�"� � � [U � �V�"�g� U ; � [ � �"�"� � � � [U �\U Â � �"�"� &In this case, the gradients

� § à �\f5�Ô>© and� x �\f+�?>@ will both be 1.

In general, estimation programs always work better if data is scaled

in this way, since roundoff errors are less likely to become important.

This is important in practice.� There are algorithms (such as BFGS and DFP) that use the sequen-

tial gradient evaluations to build up an approximation to the Hessian.

The iterations are faster for this reason since the actual Hessian isn’t

calculated, but more iterations usually are required for convergence.� Switching between algorithms during iterations is sometimes useful.

13.3. Simulated Annealing

Simulated annealing is an algorithm which can find an optimum in the

presence of nonconcavities, discontinuities and multiple local minima/maxima.

Basically, the algorithm randomly selects evaluation points, accepts all points

that yield an increase in the objective function, but also accepts some points

that decrease the objective function. This allows the algorithm to escape from

local minima. As more and more points are tried, periodically the algorithm

focuses on the best point so far, and reduces the range over which random

points are generated. Also, the probability that a negative move is accepted

reduces. The algorithm relies on many evaluations, as in the search method,

but focuses in on promising areas, which reduces function evaluations with

respect to the search method. It does not require derivatives to be evaluated. I

have a program to do this if you’re interested.

13.4. EXAMPLES 268

13.4. Examples

This section gives a few examples of how some nonlinear models may be

estimated using maximum likelihood.

13.4.1. Discrete Choice: The logit model. In this section we will consider

maximum likelihood estimation of the logit model for binary 0/1 dependent

variables. We will use the BFGS algotithm to find the MLE.

We saw an example of a binary choice model in equation 12.0.1. A more

general representation is

B [ �0� � G�j/B ��B [ ¥ � ªÞÀ ��B �J 2 î \ �0� � _]� �� @"The log-likelihood function is

�\f5�A@" �� f� * ��1 ��B * �)� �� +* ��@"�2a�Ô�È��B * �� \@�È�i�� +* ��@"I]¬For the logit model (see the contingent valuation example above), the prob-

ability has the specific form

�� @V ��Z2 }Y~ � �?� �� @"You should download and examine LogitDGP.m , which generates data

according to the logit model, logit.m , which calculates the loglikelihood, and

EstimateLogit.m , which sets things up and calls the estimation routine, which

uses the BFGS algorithm.

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/LogitDGP.m

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/logit.m

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/EstimateLogit.m

13.4. EXAMPLES 269

Here are some estimation results with � � �"� � and the true @ � � �'�N 4 &***********************************************Trial of MLE estimation of Logit model

MLE Estimation ResultsBFGS convergence: Normal convergence

Average Log-L: 0.607063Observations: 100

estimate st. err t-stat p-valueconstant 0.5400 0.2229 2.4224 0.0154slope 0.7566 0.2374 3.1863 0.0014

Information CriteriaCAIC : 132.6230BIC : 130.6230AIC : 125.4127

***********************************************

The estimation program is calling mle_results(), which in turn calls

a number of other routines. These functions are part of the octave-forge

repository.

13.4.2. Count Data: The Poisson model. Demand for health care is usu-

ally thought of a a derived demand: health care is an input to a home pro-

duction function that produces health, and health is an argument of the utility

function. Grossman (1972), for example, models health as a capital stock that

is subject to depreciation (e.g., the effects of ageing). Health care visits restore

the stock. Under the home production framework, individuals decide when to

make health care visits to maintain their health stock, or to deal with negative

shocks to the stock in the form of accidents or illnesses. As such, individual

13.4. EXAMPLES 270

demand will be a function of the parameters of the individuals’ utility func-

tions.

The MEPS health data file , meps1996.data, contains 4564 observations

on six measures of health care usage. The data is from the 1996 Medical Expen-

diture Panel Survey (MEPS). You can get more information at http://www.meps.ahrq.gov/.

The six measures of use are are office-based visits (OBDV), outpatient vis-

its (OPV), inpatient visits (IPV), emergency room visits (ERV), dental visits

(VDV), and number of prescription drugs taken (PRESCR). These form columns

1 - 6 of meps1996.data. The conditioning variables are public insurance

(PUBLIC), private insurance (PRIV), sex (SEX), age (AGE), years of education

(EDUC), and income (INCOME). These form columns 7 - 12 of the file, in the

order given here. PRIV and PUBLIC are 0/1 binary variables, where a 1 indi-

cates that the person has access to public or private insurance coverage. SEX

is also 0/1, where 1 indicates that the person is female. This data will be used

in examples fairly extensively in what follows.

The program ExploreMEPS.m shows how the data may be read in, and

gives some descriptive information about variables, which follows:

All of the measures of use are count data, which means that they take on

the values � �\�"�Y#%�'&(&(& . It might be reasonable to try to use this information by

specifying the density as a count data density. One of the simplest count data

densities is the Poisson density, which is�/; ��B5 }C~%� �Ô� e e KB�� &The Poisson average log-likelihood function is

�\f+�4@" �� f� * ��1 �P� e5* 2KB * �� e5* � �� B * ��

http://pareto.uab.es/mcreel/Econometrics/Include/Data/meps1996.data

http://www.meps.ahrq.gov/

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-I/ExploreMEPS.m

13.4. EXAMPLES 271

We will parameterize the model ase+*§ }Y~%� � L 4* �L,*§ \@� ª�� B ±Tû ª � ± é Û¢Ù � w � Ù Ù � � û�± û ](4¬&This ensures that the mean is positive, as is required for the Poisson model.

Note that for this parameterization

� ß7 à e�Â à � ßeso

� ß$�TßÈ è �ãÔä �the elasticity of the conditional mean of B with respect to the � U�¶ conditioning

variable.

The program EstimatePoisson.m estimates a Poisson model using the full

data set. The results of the estimation, using OBDV as the dependent variable

are here:

MPITB extensions found

OBDV

******************************************************

Poisson model, MEPS 1996 full data set

MLE Estimation Results

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-I/EstimatePoisson.m

13.5. DURATION DATA AND THE WEIBULL MODEL 272

BFGS convergence: Normal convergence

Average Log-L: -3.671090

Observations: 4564

estimate st. err t-stat p-value

constant -0.791 0.149 -5.290 0.000

pub. ins. 0.848 0.076 11.093 0.000

priv. ins. 0.294 0.071 4.137 0.000

sex 0.487 0.055 8.797 0.000

age 0.024 0.002 11.471 0.000

edu 0.029 0.010 3.061 0.002

inc -0.000 0.000 -0.978 0.328

Information Criteria

CAIC : 33575.6881 Avg. CAIC: 7.3566

BIC : 33568.6881 Avg. BIC: 7.3551

AIC : 33523.7064 Avg. AIC: 7.3452

******************************************************

13.5. Duration data and the Weibull model

In some cases the dependent variable may be the time that passes between

the occurence of two events. For example, it may be the duration of a strike,

or the time needed to find a job once one is unemployed. Such variables take

on values on the positive real line, and are referred to as duration data.


A spell is the period of time between the occurence of initial event and the

concluding event. For example, the initial event could be the loss of a job, and

the final event is the finding of a new job. The spell is the period of unemploy-

ment.

Let F be the time the initial event occurs, and Y1 be the time the conclud-

ing event occurs. For simplicity, assume that time is measured in years. The

random variable�

is the duration of the spell,� $1Ä�j F . Define the density

function of� � �"! �¬ -C� with distribution function

2 ! �¬ - Ë � � � ½ -Y&Several questions may be of interest. For example, one might wish to know

the expected time one has to wait to find a job given that one has already

waited � years. The probability that a spell lasts � years isË � � � ¥ �J �È�cË � � � » �J �È� 2 ! �W�JC&The density of

�conditional on the spell already having lasted � years is�"! �� ¥ �J �"! �¬ -�È� 2 ! �ò�J &

The expectanced additional time required for the spell to end given that is has

already lasted � years is the expectation of�

with respect to this density, minus�"&Ù ë7� � � � ¥ �JÄ�8� Æ Y TU � �#! �W�"�y� 2 ! �ò�J J%� É �K�

To estimate this function, one needs to specify the density��! �� - as a para-

metric density, then estimate by maximum likelihood. There are a number of

possibilities including the exponential density, the lognormal, etc. A reason-

ably flexible model that is a generalization of the exponential density is the

Weibull density

13.5. DURATION DATA AND THE WEIBULL MODEL 274�"! �¬ � @V �º � Ã � U(Å%$ e �G� e - ¡ � 1 &According to this model, ë¼� � e � ¡ & The log-likelihood is just the product of

the log densities.

To illustrate application of this model, 402 observations on the lifespan of

mongooses in Serengeti National Park (Tanzania) were used to fit a Weibull

model. The ”spell” in this case is the lifetime of an individual mongoose.

The parameter estimates and standard errors are�eR ¦� & Ò"Ò ï � � & � % Ñ and

�� & - ',)È� � & � %(%� and the log-likelihood value is -659.3. Figure 13.5.1 presents fitted

life expectancy (expected additional years of life) as a function of age, with 95%

confidence bands. The plot is accompanied by a nonparametric Kaplan-Meier

estimate of life-expectancy. This nonparametric estimator simply averages all

spell lengths greater than age, and then subtracts age. This is consistent by the

LLN.

In the figure one can see that the model doesn’t fit the data well, in that it

predicts life expectancy quite differently than does the nonparametric model.

For ages 4-6, the nonparametric estimate is outside the confidence interval that

results from the parametric model, which casts doubt upon the parametric

model. Mongooses that are between 2-6 years old seem to have a lower life

expectancy than is predicted by the Weibull model, whereas young mongooses

that survive beyond infancy have a higher life expectancy, up to a bit beyond

2 years. Due to the dramatic change in the death rate as a function of , one

might specify�"! �� - as a mixture of two Weibull densities,�"! �¬ � @" » cJº � Ã � U(Å $ e 1_�+1v� e 1× - ¡ � 1 h 2��Ô�È� » cJº � Ã � z U(Å $ z e � � � � e � - ¡ z � 1 h &


FIGURE 13.5.1. Life expectancy of mongooses, Weibull model

The parameters � * and e5* ��! �V�Y# are the parameters of the two Weibull densi-

ties, and»

is the parameter that mixes the two.

With the same data, @ can be estimated using the mixed model. The results

are a log-likelihood = -623.17. Note that a standard likelihood ratio test can-

not be used to chose between the two models, since under the null that» �

(single density), the two parameters e � and � � are not identified. It is possi-

ble to take this into account, but this topic is out of the scope of this course.

Nevertheless, the improvement in the likelihood function is considerable. The

parameter estimates are

13.6. NUMERIC OPTIMIZATION: PITFALLS 276

Parameter Estimate St. Errore 1 0.233 0.016�E1 1.722 0.166e � 1.731 0.101� � 1.522 0.096»0.428 0.035

Note that the mixture parameter is highly significant. This model leads to

the fit in Figure 13.5.2. Note that the parametric and nonparametric fits are

quite close to one another, up to around ' years. The disagreement after this

point is not too important, since less than 5% of mongooses live more than 6

years, which implies that the Kaplan-Meier nonparametric estimate has a high

variance (since it’s an average of a small number of observations).

Mixture models are often an effective way to model complex responses,

though they can suffer from overparameterization. Alternatives will be dis-

cussed later.

13.6. Numeric optimization: pitfalls

In this section we’ll examine two common problems that can be encoun-

tered when doing numeric optimization of nonlinear models, and some solu-

tions.

13.6.1. Poor scaling of the data. When the data is scaled so that the magni-

tudes of the first and second derivatives are of different orders, problems can

easily result. If we uncomment the appropriate line in EstimatePoisson.m, the

data will not be scaled, and the estimation program will have difficulty con-

verging (it seems to take an infinite amount of time). With unscaled data, the

elements of the score vector have very different magnitudes at the initial value

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-I/EstimatePoisson.m


FIGURE 13.5.2. Life expectancy of mongooses, mixed Weibull model

of @ (all zeros). To see this run CheckScore.m. With unscaled data, one element

of the gradient is very large, and the maximum and minimum elements are 5

orders of magnitude apart. This causes convergence problems due to serious

numerical inaccuracy when doing inversions to calculate the BFGS direction

of search. With scaled data, none of the elements of the gradient are very

large, and the maximum difference in orders of magnitude is 3. Convergence

is quick.

13.6.2. Multiple optima. Multiple optima (one global, others local) can

complicate life, since we have limited means of determining if there is a higher

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-I/CheckScore.m


FIGURE 13.6.1. A foggy mountain

maximum the the one we’re at. Think of climbing a mountain in an unknown

range, in a very foggy place (Figure 13.6.1). You can go up until there’s nowhere

else to go up, but since you’re in the fog you don’t know if the true summit

is across the gap that’s at your feet. Do you claim victory and go home, or do

you trudge down the gap and explore the other side?

The best way to avoid stopping at a local maximum is to use many starting

values, for example on a grid, or randomly generated. Or perhaps one might

have priors about possible values for the parameters (e.g., from previous stud-

ies of similar data).


Let’s try to find the true minimizer of minus 1 times the foggy mountain

function (since the algoritms are set up to minimize). From the picture, you

can see it’s close to � � � � , but let’s pretend there is fog, and that we don’t know

that. The program FoggyMountain.m shows that poor start values can lead to

problems. It uses SA, which finds the true global minimum, and it shows that

BFGS using a battery of random start values can also find the global minimum

help. The output of one run is here:

MPITB extensions found

======================================================

BFGSMIN final results

Used numeric gradient

------------------------------------------------------

STRONG CONVERGENCE

Function conv 1 Param conv 1 Gradient conv 1

------------------------------------------------------

Objective function value -0.0130329

Stepsize 0.102833

43 iterations

------------------------------------------------------

param gradient change

15.9999 -0.0000 0.0000

-28.8119 0.0000 0.0000

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/FoggyMountain.m


The result with poor start values

ans =

16.000 -28.812

================================================

SAMIN final results

NORMAL CONVERGENCE

Func. tol. 1.000000e-10 Param. tol. 1.000000e-03

Obj. fn. value -0.100023

parameter search width

0.037419 0.000018

-0.000000 0.000051

================================================

Now try a battery of random start values and

a short BFGS on each, then iterate to convergence

The result using 20 randoms start values

ans =

3.7417e-02 2.7628e-07

The true maximizer is near (0.037,0)


In that run, the single BFGS run with bad start values converged to a point far

from the true minimizer, which simulated annealing and BFGS using a battery

of random start values both found the true maximizaer. battery of random

start values managed to find the global max. The moral of the story is be

cautious and don’t publish your results too quickly.

EXERCISES 282

Exercises

(1) In octave, type ”help bfgsmin_example”, to find out the location of the

file. Edit the file to examine it and learn how to call bfgsmin. Run it, and

examine the output.

(2) In octave, type ”help samin_example”, to find out the location of the

file. Edit the file to examine it and learn how to call samin. Run it, and

examine the output.

(3) Using logit.m and EstimateLogit.m as templates, write a function to calcu-

late the probit loglikelihood, and a script to estimate a probit model. Run

it using data that actually follows a logit model (you can generate it in the

same way that is done in the logit example).

(4) Study mle_results.m to see what it does. Examine the functions that

mle_results.m calls, and in turn the functions that those functions call.

Write a complete description of how the whole chain works.

(5) Look at the Poisson estimation results for the OBDV measure of health care

use and give an economic interpretation. Estimate Poisson models for the

other 5 measures of health care usage.

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/logit.m

http://pareto.uab.es/mcreel/Econometrics/Include/NonlinearOptimization/EstimateLogit.m

CHAPTER 14

Asymptotic properties of extremum estimators

Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24 [&� Amemiya, Ch.

4 section 4.1 [ ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey

and McFadden (1994), “Large Sample Estimation and Hypothesis Testing,” in

Handbook of Econometrics, Vol. 4, Ch. 36.

14.1. Extremum estimators

In Definition 12.0.1 we defined an extremum estimator�@ as the optimizing

element of an objective function �'f5�A@" over a set N . Let the objective function�\f+��Äf%��@" depend upon a �:�^� random matrix 9¢f n �g1 � � >N>N>u�\f r 4 where

the �'U are � -vectors and � is finite.

EXAMPLE 18. Given the model B *È k� 4* @Þ2 / * � with � observations, define� *0 ��B * � � 4* 4 & The OLS estimator minimizes

�\f+�A9¢fT��@V � Â � f� * ��1 ��B * � � 4* @V � � Â � � ¤R�j�F@ � �where ¤ and � are defined similarly to 9;&

283

14.2. CONSISTENCY 284

14.2. Consistency

The following theorem is patterned on a proof in Gallant (1987) (the article,

ref. later), which we’ll see in its original form later in the course. It is interest-

ing to compare the following proof with Amemiya’s Theorem 4.1.1, which is

done in terms of convergence in probability.

THEOREM 19. [Consistency of e.e.] Suppose that�@\f is obtained by maximiz-

ing �'f5�A@V over N &Assume

(1) Compactness: The parameter space N is an open subset of Euclidean

space O ÿ & The closure of N � N is compact.

(2) Uniform Convergence: There is a nonstochastic function �1T¾�4@" that is

continuous in @ on N such that� ��f�SUT Ð(' �V) � � �\f �A@"G�K��T=�4@" � a� � a.s.

(3) Identification: �/T¾�Ô>© has a unique global maximum at @VFtCFN � i.e., ��T¾�A@]FC ¥�/T=�4@"Y�Nê?@�í @]FN��@MC NThen

�@\f Q�P ì PR @gF'&Proof: Select a ÝèC ² and hold it fixed. Then Sg�Nf+�ßÝy��@"$X is a fixed sequence

of functions. Suppose that Ý is such that �Nf5�A@V converges uniformly to �/T=�4@"C&This happens with probability one by assumption (b). The sequence S �@\f%X lies

in the compact set N¾� by assumption (1) and the fact that maximixation is overN . Since every sequence from a compact set has at least one limit point (David-

son, Thm. 2.12), say that�@ is a limit point of S �@\f%X�& There is a subsequence S �@\f ¼ X

( SJ��9;X is simply a sequence of increasing integers) with� �� 9 SUT �@\f ¼ �@ . By


uniform convergence and continuity� ��9 SUT �\f ¼ � �@\f ¼ ��T=� �@VC&To see this, first of all, select an element

�@\U from the sequence* �@\f ¼,+ & Then

uniform convergence implies� ��9 SUT �\f ¼ � �@\UW ��T¾� �@\UWC&Continuity of �/TK�?>@ implies that� �)�U SUT ��T¾� �@\Uò ��T¾� �@Vsince the limit as R k of

* �@\U + is�@ . So the above claim is true.

Next, by maximization

�'f ¼ � �@\f ¼ � �\f ¼ �A@ F which holds in the limit, so� ��9 SUT �\f ¼ � �@\f ¼ � � ��9 StT �\f ¼ �A@ F C&However, � ��9 SUT �\f ¼ � �@\f ¼ ��T=� �@VC�as seen above, and � ��9 StT �\f ¼ �A@ F ��T=�A@ F by uniform convergence, so

��T¾� �@V � ��T=�4@ F Y&


But by assumption (3), there is a unique global maximum of �1T=�4@" at @]F'� so

we must have ��T=� �@" ��T=�A@ F C� and�@ @ F & Finally, all of the above limits hold

almost surely, since so far we have held Ý fixed, but now we need to consider

all Ý C ² . Therefore S �@\fTX has only one limit point, @ F � except on a set û ÷ ²with

ª � û �� &Discussion of the proof:� This proof relies on the identification assumption of a unique global

maximum at @gFN& An equivalent way to state this is

(c) Identification: Any point @ in N with �/T¾�A@V � ��T=�4@ F must have � @��@ F �v a� �which matches the way we will write the assumption in the section on non-

parametric inference.� We assume that�@\f is in fact a global maximum of �Nf¼�A@V0& It is not re-

quired to be unique for � finite, though the identification assumption

requires that the limiting objective function have a unique maximiz-

ing argument. The next section on numeric optimization methods will

show that actually finding the global maximum of �'fy�A@V may be a non-

trivial problem.� See Amemiya’s Example 4.1.4 for a case where discontinuity leads to

breakdown of consistency.� The assumption that @ F is in the interior of N (part of the identifica-

tion assumption) has not been used to prove consistency, so we could

directly assume that @ F is simply an element of a compact set N¾& The

reason that we assume it’s in the interior here is that this is necessary

for subsequent proof of asymptotic normality, and I’d like to maintain

a minimal set of simple assumptions, for clarity. Parameters on the

boundary of the parameter set cause theoretical difficulties that we


will not deal with in this course. Just note that conventional hypothe-

sis testing methods do not apply in this case.� Note that �'f¼�A@V is not required to be continuous, though �<T¾�4@" is.� The following figures illustrate why uniform convergence is impor-

tant.

With uniform convergence, the maximum of the sampleobjective function eventually must be in the neighborhoodof the maximum of the limiting objective function


With pointwise convergence, the sample objective functionmay have its maximum far away from that of the limitingobjective function

We need a uniform strong law of large numbers in order to verify assump-

tion (2) of Theorem 19. The following theorem is from Davidson, pg. 337.

THEOREM 20. [Uniform Strong LLN] Let S]�Þf �A@VYX be a sequence of stochastic

real-valued functions on a totally-bounded metric space ��N ��TY& Then

Ð' �V) � � �sf5�A@V � Q�P ì PR �if and only if

(a) �sf �4@" Q6P ì PR � for each @MCFN F � where N F is a dense subset of N and

(b) S]�sf5�A@VYX is strongly stochastically equicontinuous..

� The metric space we are interested in now is simply N ÷ O ÿ � using

the Euclidean norm.� The pointwise almost sure convergence needed for assuption (a) comes

from one of the usual SLLN’s.

14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES 289� Stronger assumptions that imply those of the theorem are:

– the parameter space is compact (this has already been assumed)

– the objective function is continuous and bounded with probabil-

ity one on the entire parameter space

– a standard SLLN can be shown to apply to some point in the pa-

rameter space� These are reasonable conditions in many cases, and henceforth when

dealing with specific estimators we’ll simply assume that pointwise

almost sure convergence can be extended to uniform almost sure con-

vergence in this way.� The more general theorem is useful in the case that the limiting ob-

jective function can be continuous in @ even if �Nf5�A@" is discontinuous.

This can happen because discontinuities may be smoothed out as we

take expectations over the data. In the section on simlation-based esti-

mation we will se a case of a discontinuous objective function.

14.3. Example: Consistency of Least Squares

We suppose that data is generated by random sampling of ��B��.p , whereBgU £� FG2M�IF�.7U�2µ/JU . ��.7UP�[/JU� has the common distribution function 3,<n3 î ( . and/ are independent) with support - �÷ë�& Suppose that the variances è �< and è �îare finite. Let @gF � � F\��IF$ 4 ClN � for which N is compact. Let � U �Ô�"�[.7U� 4 � so

we can write BgU Ú� 4U @]F�2K/JU?& The sample objective function for a sample size �

14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES 290

is

�'f5�A@V � Â � f� U(��1 ��BgU�� 4U @" � � Â � f� * ��1 ð � 4U @ F 28/JU�� 4U @ ó � � Â � f� U(��1 ð � 4U ð¾@ F �i@ óJó � 2Ë# Â � f� U(��1 � 4U ðI@ F �c@ ó /JU 2¡� Â � f� U(��1 / �U� Considering the last term, by the SLLN,

� Â � f� U(��1 / �U Q�P ì PR Y/.DY10 / � J,3 . J�3 0 è �î &� Considering the second term, since Ù �¬/V �� and . and / are indepen-

dent, the SLLN implies that it converges to zero.� Finally, for the first term, for a given @ , we assume that a SLLN applies

so that

� Â � f� U(��1 ð � 4U ð¾@ F �i@ óNó � Q6P ì PR Y1. ð � 4 ð¾@ F �i@ óJó � J,3 .(14.3.1)

ð � F � � ó � 2M# ð � F � � ó ð � F �� ó Y . .&J�3 . 2 ð � F �� ó � Y . . � J,3 . ð � F � � ó � 2M#Þð � F � � ó ðò� F �� ó Ù ��.p,2 ð×� F �� ó � Ù ðò. � óFinally, the objective function is clearly continuous, and the parameter space

is assumed to be compact, so the convergence is also uniform. Thus,

�/T=�4@" ð � F � � ó � 2Ë#sð � F � � ó ð×� F �� ó Ù ��.��2 ðP� F �j� ó � Ù ð×. � ó 2Kè �îA minimizer of this is clearly �3 �� F'�� IF'&

EXERCISE 21. Show that in order for the above solution to be unique it is

necessary that Ù ��. � pí �� & Discuss the relationship between this condition and

the problem of colinearity of regressors.


This example shows that Theorem 19 can be used to prove strong consis-

tency of the OLS estimator. There are easier ways to show this, of course - this

is only an example of application of the theorem.

14.4. Asymptotic Normality

A consistent estimator is oftentimes not very useful unless we know how

fast it is likely to be converging to the true value, and the probability that it

is far away from the true value. Establishment of asymptotic normality with

a known scaling factor solves these two problems. The following theorem is

similar to Amemiya’s Theorem 4.1.3 (pg. 111).

THEOREM 22. [Asymptotic normality of e.e.] In addition to the assumptions

of Theorem 19, assume

(a) 2�f+�4@" � � �V �'f5�A@V exists and is continuous in an open, convex neighbor-

hood of @ F &(b) S"2¢f+�4@\f�$X Q6P ì PR 2�T �A@ F Y� a finite negative definite matrix, for any sequenceS<@\fTX that converges almost surely to @ F &(c)

h� � VY�\f+�4@gFCcmR q\ � � o T¾�A@]FC_]"� where

o T=�4@gFC � �� f�StT3é¯ô À h � � VY�\f+�4@gFYThen

h� c �@��c@ F h mR \ � ��2 T �A@ F � 1 o T=�A@ F (2�T¾�4@ F � 1 ]

Proof: By Taylor expansion:� VY�\f5� �@\f" � V$�'f5�A@ F �2 � �V �\f+�4@ [ c �@p�i@ F hwhere @ [ £e �@È2��?�È� e Z@gFN� � » e » �"&� Note that

�@ will be in the neighborhood where� �V �\f+�4@" exists with

probability one as � becomes large, by consistency.

14.4. ASYMPTOTIC NORMALITY 292� Now the l.h.s. of this equation is zero, at least asymptotically, since�@\f is a maximizer and the f.o.c. must hold exactly since the limiting

objective function is strictly concave in a neighborhood of @VF'&� Also, since @+[ is between�@\f and @ F � and since

�@\f Q�P ì PR @ F , assumption (b)

gives � �V �\f5�A@ [ Q6P ì PR 2 T¾�A@ F So �¯ � VY�\f+�4@ F �2 ° 2 T¾�A@ F �2 ´ 6T�Ô�J ³ c �@p�i@ F h

And �Þ h� � V$�'f5�A@ F �2 ° 2�T �A@ F �2 ´ 6��Ô�J ³ h � c �@s�s@ F h

Now 2�T �A@gF$ is a finite negative definite matrix, so the´ 6T�Ô�J term is asymptoti-

cally irrelevant next to 2øT=�4@gFY , so we can write

� Q h� � V$�'f5�A@ F �232 T¾�A@ F h � c �@p�i@ F hh

� c �@p�i@ F h Q �42�T=�A@ F $� 1 h � � VY�\f5�A@ F Because of assumption (c), and the formula for the variance of a linear combi-

nation of r.v.’s, h� c �@p�i@ F h mR °�� 2 T �4@ F � 1 o T=�A@ F 2�T �A@ F � 1 ³

� Assumption (b) is not implied by the Slutsky theorem. The Slutsky

theorem says that �0� � fV Q�P ì PR �0� � if � f R � and �0�Ô>@ is continuous at � &However, the function �0�Ô>© can’t depend on � to use this theorem. In

our case 2�f �A@\f� is a function of �& A theorem which applies (Amemiya,

Ch. 4) is


THEOREM 23. If �gf+�4@" converges uniformly almost surely to a nonstochastic

function �+T=�4@" uniformly on an open neighborhood of @ F � then �gf � �@V Q�P ì PR �(T=�A@ F if �+T=�4@gFY is continuous at @gF and

�@ Q�P ì PR @]FN&� To apply this to the second derivatives, sufficient conditions would

be that the second derivatives be strongly stochastically equicontinu-

ous on a neighborhood of @gFN� and that an ordinary LLN applies to the

derivatives when evaluated at @MC�²�A@VFCY&� Stronger conditions that imply this are as above: continuous and bounded

second derivatives in a neighborhood of @VF'&� Skip this in lecture. A note on the order of these matrices: Supposing

that �\f5�A@" is representable as an average of � terms, which is the case

for all estimators we consider,� �V �\f+�4@" is also an average of � matrices,

the elements of which are not centered (they do not have zero expec-

tation). Supposing a SLLN applies, the almost sure limit of� �V �'f �4@gFCC�2�T¾�4@gFC 65 �Ô�JY� as we saw in Example 51. On the other hand, assump-

tion (c):

h� � V$�'f5�A@ F mR q\ � � o T �A@ F I] means thath

� � V$�'f5�A@ F 65 6��87Jwhere we use the result of Example 49. If we were to omit the

h�G�

we’d have � V$�'f5�A@ F �� z 5 6��Ô�J 5 6 c �� z h

14.5. EXAMPLES 294

where we use the fact that 5 6�� 5 6�� 5 6�� C& The sequence� V$�'f5�A@ F is centered, so we need to scale by

h� to avoid convergence

to zero.

14.5. Examples

14.5.1. Binary response models. Binary response models arise in a variety

of contexts. We’ve already seen a logit model. Another simple example is a

probit threshold-crossing model. Assume that

B [ � 4 �£�j/B ��B [ ¥ � / ç ²� � �'�NHere, Ba[ is an unobserved (latent) continuous variable, and B is a binary vari-

able that indicates whether B [ is negative or positive. ThenªsÀ ��B �J ªsÀ �¬/ ½� � �� , where �� £Y ã x� T �W# � � 1Ap � }Y~%� �Ô� / �# `J"/

is the standard normal distribution function.

In general, a binary response model will require that the choice probability

be parameterized in some form. For a vector of explanatory variables � , the

response probability will be parameterized in some mannerªÞÀ ��B � � � �� @"If �I� � ��@V 9� � � 4 @VC� we have a logit model. If �� @V �� 4 @"Y� where ��?>@ is the

standard normal distribution function, then we have a probit model.

14.5. EXAMPLES 295

Regardless of the parameterization, we are dealing with a Bernoulli den-

sity, �/;�: ��B * � �+* �� +* ��@" K : �?�¼�i�� @"[ 1 � K :so as long as the observations are independent, the maximum likelihood (ML)

estimator,�@T� is the maximizer of

�\f+�4@" �� f� * ��1 ��B * �� +* ��@V�2��?�y��B * �)� \��È��I� �+* ��@V_]¨� �� f� * ��1 ��B * � �+* ��@"Y&(14.5.1)

Following the above theoretical results,�@ tends in probability to the @gF that

maximizes the uniform almost sure limit of �Nf+�4@"Y& Noting that ëB *Ä �I� �+* ��@]FYC�and following a SLLN for i.i.d. processes, �Nf5�A@V converges almost surely to the

expectation of a representative term ��B�� @VC& First one can take the expectation

conditional on � to get

ë K � ã SJB �� @"I2��?�È�jB5 �)� \@�y�£�� @V_]WX �I� � ��@ F �� @"-2 ° �È�i�� @ F ³ �)� \��È�i�� @"I]�&Next taking expectation over � we get the limiting objective function

(14.5.2) ��T=�A@" gY�; ù �� @ F �� @VI2 ° �È�i�� @ F ?³ �� \@�È�i�� @V_] ú 3�� ZJ � �where 3�� is the (joint - the integral is understood to be multiple, and < is the

support of � ) density function of the explanatory variables � . This is clearly

continuous in @T� as long as �� @V is continuous, and if the parameter space is

compact we therefore have uniform almost sure convergence. Note that �� @"is continous for the logit and probit models, for example. The maximizing

14.5. EXAMPLES 296

element of �/T=�4@"C�1@ [ � solves the first order conditionsY ; õ �� @]FC�� @ [ àà @ �� @ [ G� �y�i�I� � ��@]FY�y�i�I� � ��@ [ àà @ �� @ [ ø 3�� ZJ �� This is clearly solved by @ [ @gFN& Provided the solution is unique,

�@ is consis-

tent. Question: what’s needed to ensure that the solution is unique?

The asymptotic normality theorem tells us thath� c �@��c@ F h mR °(� ��2�T¾�4@ F �� 1 o T=�4@ F 2 T¾�A@ F $� 1 ³G&

In the case of i.i.d. observationso T=�4@gFY � �� f�SUTÖé¯ô À h � � VY�\f+�4@gFY is simply

the expectation of a typical element of the outer product of the gradient.

� There’s no need to subtract the mean, since it’s zero, following the

f.o.c. in the consistency proof above and the fact that observations are

i.i.d.� The terms in � also drop out by the same argument:� ��f�SUT é¯ô À h � � VY�\f5�A@ F � �)�f�SUT éÜô À h � � V �� U ��A@ F � �)�f�SUT éÜô À �h � � V � U �T�4@ F � �)�f�SUT �� é¯ô À � U � VY��A@ F � �)�f�SUT éÜô ÀV� VY��A@ F é¯ô ÀV� VY��4@ F So we get o T=�A@ F ë õ àà @ ��B�� @ F àà @ 4 ��B�� @ F ø &

14.5. EXAMPLES 297

Likewise, 2 T¾�A@ F ë à �à @ à @ 4 ��B�� @ F C&Expectations are jointly over B and � � or equivalently, first over B conditional

on � � then over � & From above, a typical element of the objective function is

��B�� @ F B �� @ F �2a�Ô�¼��B �� ° �È�i�� @ F ³ &Now suppose that we are dealing with a correctly specified logit model:

�� @" �?� 2 }C~%� �Ô� L 4±@V[ � 1 &We can simplify the above results in this case. We have that

àà @ �� @" �Ô��2 }C~%� �Ô� L 4�@V[ � � }Y~%� �Ô� L 4±@" L �Ô��2 }C~%� �Ô� L 4 @V[ � 1 }Y~%� �Ô� L 4 @"��2 }C~%� �Ô� L 4 @V L �� @"��P�È�i�� @V[ L ð �� @"Ä�� @" � ó L &So àà @ ��B�� @ F ° BÜ�� @ F ?³ L(14.5.3) à �à @ à @ 4 ��A@ F � ° �� @ F G�i�� @ F � ³ L,L 4�&

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 298

Taking expectations over B then L giveso T=�4@ F Y Ù ; ° B � �K#Y�� @ F ��I� � ��@ F �23�� @ F � ³ L,L 4*3�� `J �(14.5.4) Y ° �� @ F G�i�� @ F � ³ L0L 4�3�� `J � &(14.5.5)

where we use the fact that Ù ; ��B5 Ù ; ��B � �� L �@gFY . Likewise,

(14.5.6) 2�T¾�4@ F � Y ° �� @ F G�i�� @ F � ³ L0L 4*3�� `J � &Note that we arrive at the expected result: the information matrix equality

holds (that is, 2�T=�4@gFY � o T �A@]FY[ . With this,h� c �@p�i@ F h mR °�� 2 T �4@ F $� 1 o T=�A@ F 2�T �A@ F $� 1 ³

simplifies toh� c �@p�i@ F h mR °@� �'�42�T¾�4@ F $� 1 ³

which can also be expressed ash� c �@p�i@ F h mR ° � � o T �A@ F � 1 ³ &

On a final note, the logit and standard normal CDF’s are very similar - the

logit distribution is a bit more fat-tailed. While coefficients will vary slightly

between the two models, functions of interest such as estimated probabilities�� @" will be virtually identical for the two models.

14.6. Example: Linearization of a nonlinear model

Ref. Gourieroux and Monfort, section 8.3.4. White, Intn’l Econ. Rev. 1980 is

an earlier reference.


Suppose we have a nonlinear model

B *� ¹ � �+* ��@ F �28/ *where / * ç�!ò!�J�� è � The nonlinear least squares estimator solves�@\f ��g�[�� f� * ��1 ��B * � ¹ � �+* ��@V[ �We’ll study this more later, but for now it is clear that the foc for minimization

will require solving a set of nonlinear equations. A common approach to the

problem seeks to avoid this difficulty by linearizing the model. A first order

Taylor’s series expansion about the point � F with remainder gives

B *� ¹ � � F ��@ F �2�� +* � � F 4 à ¹ � � F ��@ F à � 23= *where = * encompasses both / * and the Taylor’s series remainder. Note that = *is no longer a classical error - its mean is not zero. We should expect problems.

Define � [ ¹ � � F ��@ F G� � 4 F à ¹ � � F]�@gFYà �� [ à ¹ � � F �@ F à �Given this, one might try to estimate � [ and � [ by applying OLS to

B *� g� 2K� �+* 23= *� Question, will

�� and�� be consistent for � [ and � [ ?

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 300� The answer is no, as one can see by interpreting�� and

�� as extremum

estimators. Let � � � �� 4 4 &

�� V�[�� \f �b�0 �� f� * ��1 ��B * � � �� +* �The objective function converges to its expectation

�'f5�b�0 Í�P Q6P ì PR �/T=��0 ë « ë ; � « ��BÜ� � �j� � �and

�� converges ô+&©�"& to the � F that minimizes �/T¾��0 :� F ¡�V�[�Z� � � ë « ë ; � « ��BÜ� � �� Noting that

ë « ë ; � « ��B¯� � � � 4@� � ë « ë ; � « ð ¹ � � ��@ F �2�/p� � �� ó � è � 28ë « ð ¹ � � ��@ F G� � �� ó �since cross products involving / drop out. � F and ��F correspond to the hy-

perplane that is closest to the true regression function¹ � � �@"FC according to the

mean squared error criterion. This depends on both the shape of¹ �Ô>© and the

density function of the conditioning variables.


x_0

α

β

x

x

x

x

xx x

x

x

x

Tangent line

Fitted line

Inconsistency of the linear approximation, even at the approximation point

h(x,θ)

� It is clear that the tangent line does not minimize MSE, since, for ex-

ample, if¹ � � ��@gFC is concave, all errors between the tangent line and

the true function are negative.� Note that the true underlying parameter @VF is not estimated consis-

tently, either (it may be of a different dimension than the dimension

of the parameter of the approximating model, which is 2 in this exam-

ple).� Second order and higher-order approximations suffer from exactly

the same problem, though to a less severe degree, of course. For

this reason, translog, Generalized Leontiev and other “flexible func-

tional forms” based upon second-order approximations in general suf-

fer from bias and inconsistency. The bias may not be too important for

analysis of conditional means, but it can be very important for analyz-

ing first and second derivatives. In production and consumer analysis,

first and second derivatives (e.g., elasticities of substitution) are often


of interest, so in this case, one should be cautious of unthinking appli-

cation of models that impose stong restrictions on second derivatives.� This sort of linearization about a long run equilibrium is a common

practice in dynamic macroeconomic models. It is justified for the pur-

poses of theoretical analysis of a model given the model’s parameters,

but it is not justifiable for the estimation of the parameters of the model

using data. The section on simulation-based methods offers a means

of obtaining consistent estimators of the parameters of dynamic macro

models that are too complex for standard methods of analysis.

EXERCISES 303

Exercises

(1) Suppose that �+* ç uniform(0,1), and B *G �µ� � �* 2Ë/ * � where / * is iid(0, è,�YC&Suppose we estimate the misspecified model B *� �� 2÷� �+* 2 ]* by OLS. Find

the numeric values of � F and � F that are the probability limits of�� and

��(2) Verify your results using Octave by generating data that follows the above

model, and calculating the OLS estimator. When the sample size is very

large the estimator should be very close to the analytical results you ob-

tained in question 1.

(3) Use the asymptotic normality theorem to find the asymptotic distribution

of the ML estimator of ��F for the model B l� ��F72¡/T� where /3ç ²� � �\�Jand is independent of � & This means finding á zá x á x O �\f5�� , 23��IFYC� á ì?> Ã x Åá x êêê � ando �� F C&

CHAPTER 15

Generalized method of moments (GMM)

Readings: Hamilton Ch. 14 [ ; Davidson and MacKinnon, Ch. 17 (see pg.

587 for refs. to applications); Newey and McFadden (1994), “Large Sample

Estimation and Hypothesis Testing,” in Handbook of Econometrics, Vol. 4, Ch.

36.

15.1. Definition

We’ve already seen one example of GMM in the introduction, based upon

the �� distribution. Consider the following example based upon the t-distribution.

The density function of a t-distributed r.v. ¤�U is�<; Ê��BgUP�@ F Ã \)�4@gFG2¡�J Â #<]� � @ F 1Ap � Ã �4@ F Â #" ° ��2 ð×B �U Â @ F ó ³ � � V W � 1 p �Given an iid sample of size �G� one could estimate @"F by maximizing the log-

likelihood function

�@ � �g�[�Z�b� ~� �� f5�4@" f� U(��1 �)� �<; Ê ��BgUP�@"� This approach is attractive since ML estimators are asymptotically ef-

ficient. This is because the ML estimator uses all of the available infor-

mation (e.g., the distribution is fully specified up to a parameter). Re-

calling that a distribution is completely characterized by its moments,

the ML estimator is interpretable as a GMM estimator that uses all of304

15.1. DEFINITION 305

the moments. The method of moments estimator uses only � mo-

ments to estimate a � � dimensional parameter. Since information is

discarded, in general, by the MM estimator, efficiency is lost relative

to the ML estimator.� Continuing with the example, a t-distributed r.v. with density�(; Ê ��BgUP�@gFC

has mean zero and variance é^��BVU� @gF Â �b@gFZ�8#" (for @gF ¥ #"C&� Using the notation introduced previously, define a moment condition��1�U-�A@" @ Â �4@p�K#" � B �U and ��1v�4@" � Â � � fU(��1 ��1�U[�A@V @ Â �4@��8#V¼�� Â � � fU(��1 B%�U & As before, when evaluated at the true parameter value @ F �both ë V W \ ��1�U-�A@gF$_] �� and ë V W \ ��1\�A@]FY_] ¡� &� Choosing

�@ to set ��1\� �@V � � yields a MM estimator:

(15.1.1)�@ #�y� fP : K z:

This estimator is based on only one moment of the distribution - it uses less

information than the ML estimator, so it is intuitively clear that the MM esti-

mator will be inefficient relative to the ML estimator.

� An alternative MM estimator could be based upon the fourth moment

of the t-distribution. The fourth moment of a t-distributed r.v. is3?! � Ù ��B !U %y�4@]FC ��A@ F �K#",�4@ F � Ñ �provided @gF ¥ Ñ & We can define a second moment condition

� � �4@" %;�b@" ��4@��8#"��b@s� Ñ � �� f� U(��1 B !U

15.1. DEFINITION 306� A second, different MM estimator chooses�@ to set � � � �@" � � & If you

solve this you’ll see that the estimate is different from that in equation

15.1.1.

This estimator isn’t efficient either, since it uses only one moment. A GMM es-

timator would use the two moment conditions together to estimate the single

parameter. The GMM estimator is overidentified, which leads to an estima-

tor which is efficient relative to the just identified MM estimators (more on

efficiency later).� As before, set ��f+�A@V ��1\�A@VC�� A@"- 4 & The � subscript is used to in-

dicate the sample size. Note that �:�A@VFC 5 6�� 1Ap � C� since it is an

average of centered random variables, whereas �:�A@V @5 6��Ô�JY�,@3í @gFN�where expectations are taken using the true distribution with param-

eter @]FN& This is the fundamental reason that GMM is consistent.� A GMM estimator requires defining a measure of distance, Jp��:�4@"- . A

popular choice (for reasons noted below) is to set Jp��:�A@"- � 4 üifV��and we minimize �'f �A@V �:�A@" 4 üifV�:�4@"C& We assume ü�f converges to a

finite positive definite matrix.� In general, assume we have � moment conditions, so �:�4@" is a � -vector

and ü is a �b� � matrix.

For the purposes of this course, the following definition of the GMM estimator

is sufficiently general:

DEFINITION 24. The GMM estimator of the � -dimensional parameter vec-

tor @ F � �@ � �g�[�� \f5�A@" � �^f5�A@V 4 ü�fg�^fE�4@"C� where �^f+�4@" 1f � fU(��1 �^U-�4@" is a� -vector, � � � � with ë"V��:�A@V þ� � and üif converges almost surely to a finite�b� � symmetric positive definite matrix üÞT .


What’s the reason for using GMM if MLE is asymptotically efficient?� Robustness: GMM is based upon a limited set of moment conditions.

For consistency, only these moment conditions need to be correctly

specified, whereas MLE in effect requires correct specification of every

conceivable moment condition. GMM is robust with respect to distribu-

tional misspecification. The price for robustness is loss of efficiency with

respect to the MLE estimator. Keep in mind that the true distribution

is not known so if we erroneously specify a distribution and estimate

by MLE, the estimator will be inconsistent in general (not always).

– Feasibility: in some cases the MLE estimator is not available, be-

cause we are not able to deduce the likelihood function. More

on this in the section on simulation-based estimation. The GMM

estimator may still be feasible even though MLE is not possible.

15.2. Consistency

We simply assume that the assumptions of Theorem 19 hold, so the GMM

estimator is strongly consistent. The only assumption that warrants addi-

tional comments is that of identification. In Theorem 19, the third assump-

tion reads: (c) Identification: �/T=�Ô>@ has a unique global maximum at @VF\� i.e.,��T=�A@]FY ¥ ��T=�A@VC��ê?@þí @]FN& Taking the case of a quadratic objective function�\f+�4@" �^f5�A@V 4 üifg�^f+�4@"C� first consider ��f5�A@"Y&� Applying a uniform law of large numbers, we get ��f5�A@V Q6P ì PR � T=�4@"C&� Since ë"V O �^f+�4@ F �� by assumption, � T=�4@ F �� &� Since �/T �A@gFC � T=�A@gFC 4 üFTp� T=�A@]FY a� � in order for asymptotic identi-

fication, we need that � T=�A@"¯í � for @�í @gF'� for at least some element


of the vector. This and the assumption that ü3f Q�P ì PR üFT � a finite positive� �U� definite ��t� matrix guarantee that @VF is asymptotically identified.� Note that asymptotic identification does not rule out the possibility

of lack of identification for a given data set - there may be multiple

minimizing solutions in finite samples.


We also simply assume that the conditions of Theorem 22 hold, so we will

have asymptotic normality. However, we do need to find the structure of the

asymptotic variance-covariance matrix of the estimator. From Theorem 22, we

haveh� c �@p�i@ F h mR ° � ��2 T �4@ F $� 1 o T=�A@ F 2�T �A@ F $� 1 ³

where 2�T¾�A@]FC is the almost sure limit of á zá V á V O �\f5�A@V ando T=�A@]FY � �)� f�SUT:é¯ô À h � áá V �\f5�A@gFCY&

We need to determine the form of these matrices given the objective function�\f+�4@" �^f5�A@V 4 üifg�^f+�4@"C&Now using the product rule from the introduction,àà @ �\f+�4@" # È àà @ � Of �A@V É üifg�^fy�A@V

Define the � � � matrix � f5�A@V � àà @ �^4f �4@V0�so:

(15.3.1)

àà @ ��A@" # � �A@V[üR� �4@V0&(Note that �'f5�A@V , � f5�A@"Y�%ü�f and ��f5�A@V all depend on the sample size �G� but it

is omitted to unclutter the notation).


To take second derivatives, let� * be the ![� th row of

� �A@VC& Using the prod-

uct rule, à �à @ 4 à @ * �T�4@" àà @ 4 # � * �A@V[üifV� �A@V # � * ü � 4g2Ë#g�^4@ü È àà @ 4 � 4* ÉWhen evaluating the term

#]�:�A@" 4 ü È àà @ 4 � �A@" 4* Éat @gFN� assume that áá V O � �A@V 4* satisfies a LLN, so that it converges almost surely

to a finite limit. In this case, we have

#g�:�4@ F 4 ü È àà @ 4 � �A@ F 4* É Q�P ì PR � �since �:�A@ F ´ 6T�?�JC�gü Q�P ì PR üÞT .

Stacking these results over the � rows of� � we get� �)� à �à @ à @ 4 �\f+�4@ F 2 T=�A@ F # � TsüFT � 4T �$ô+&©�"&(�

where we define� �� T �%ô+&��"&)� and

� �� ü üFT � a.s. (we assume a LLN

holds).

With regard too T=�4@ F , following equation 15.3.1, and noting that the scores

have mean zero at @gF (since ëG�:�4@gFC �� by assumption), we haveo T=�A@ F � �)�f�SUT éÜô À h � àà @ �\f+�4@ F � �)�f�SUT ë Ñ � � f�üifV�:�4@ F ?�:�A@"P4�üif � 4f � �)�f�SUT ë Ñ � fVüif ù h �0�:�A@ F ú ù h �0�:�4@" 4 ú ü�f � 4f

15.4. CHOOSING THE WEIGHTING MATRIX 310

Now, given that �:�4@gFC is an average of centered (mean-zero) quantities, it is

reasonable to expect a CLT to apply, after multiplication by

h� . Assuming

this,h�0�:�4@ F mR ²� � � ² TsC�

where ² T � �)�f�SUT ë ° �0�:�4@ F Ô�:�4@ F 4�³ &Using this, and the last equation, we geto T=�4@ F Ñ � TsüÞT ² TsüFT � 4TUsing these results, the asymptotic normality theorem gives ush

� c �@p�i@ F h mR n � �N� � TsüFT � 4T � 1 � TpüÞT ² TsüFT � 4T � � TsüÞT � 4T � 1 r �the asymptotic distribution of the GMM estimator for arbitrary weighting ma-

trix üif & Note that for «�T to be positive definite,� T must have full row rank,�� Ts Q .

15.4. Choosing the weighting matrixü is a weighting matrix, which determines the relative importance of viola-

tions of the individual moment conditions. For example, if we are much more

sure of the first moment condition, which is based upon the variance, than of

the second, which is based upon the fourth moment, we could set

ü �� ô �� 7 ��


with ô much larger than 7N& In this case, errors in the second moment condition

have less weight in the objective function.� Since moments are not independent, in general, we should expect that

there be a correlation between the moment conditions, so it may not

be desirable to set the off-diagonal elements to 0. ü may be a random,

data dependent matrix.� We have already seen that the choice of ü will influence the asymp-

totic distribution of the GMM estimator. Since the GMM estimator is

already inefficient w.r.t. MLE, we might like to choose the ü matrix

to make the GMM estimator efficient within the class of GMM estimators

defined by ��f5�A@V .� To provide a little intuition, consider the linear model B ¦L 4 �Ö2�/%�where /¾ç j� � � ² C& That is, he have heteroscedasticity and autocorre-

lation.� Letª

be the Cholesky factorization of ² � 1 � e.g,ª 4 ª �² � 1 &� Then the model

ª B ª ` �b2 ª / satisfies the classical assumptions of

homoscedasticity and nonautocorrelation, since éb� ª /" ª é��/V ª 4 ª ² ª 4 ª � ª 4 ª � 1 ª 4 ª¯ª � 1 � ª 4 � 1 ª 4 ± f & (Note: we use � w&� � 1 � � 1 w � 1 for w � � both nonsingular). This means that the transformed

model is efficient.� The OLS estimator of the modelª B ª ` �i2 ª / minimizes the ob-

jective function ��Bb� ` � 4 ² � 1 ��Bb� ` �Y& Interpreting ��B¯� ` �G / ��as moment conditions (note that they do have zero expectation when

evaluated at � F ), the optimal weighting matrix is seen to be the in-

verse of the covariance matrix of the moment conditions. This result

carries over to GMM estimation. (Note: this presentation of GLS is not


a GMM estimator, because the number of moment conditions here is

equal to the sample size, �G& Later we’ll see that GLS can be put into the

GMM framework defined above).

THEOREM 25. If�@ is a GMM estimator that minimizes ��f5�A@V 4 üifg�^fE�A@VC� the

asymptotic variance of�@ will be minimized by choosing ü�f so that ü�f Q�P ìRüFT �² � 1T � where ² T � �)� f�SUT�ë \ �0�:�4@gFCÔ�:�4@gFY 4 ]"&

Proof: For üÞT è² � 1T � the asymptotic variance

� � T�üFT � 4T � 1 � TsüÞT ² T¯üFT � 4T � � T�üFT � 4T � 1simplifies to � � T ² � 1T � 4T � 1 & Now, for any choice such that üÞT í ² � 1T � con-

sider the difference of the inverses of the variances when ü ² � 1 versus

when ü is some arbitrary positive definite matrix:

ð � T ² � 1T � 4T ó �¡� � TsüFT � 4T �\ � TsüFT ² TÞüFT � 4T ] � 1 � � TsüFT � 4T � T ² � 1Ap �T n�± � ² 1Ap �T �WüÞT � 4T Å\ � TsüÞT ² TsüFT � 4T ] � 1 � TsüFT ² 1Ap �T r ² � 1Ap �T � 4Tas can be verified by multiplication. The term in brackets is idempotent, which

is also easy to check by multiplication, and is therefore positive semidefinite.

A quadratic form in a positive semidefinite matrix is also positive semidefi-

nite. The difference of the inverses of the variances is positive semidefinite,

which implies that the difference of the variances is negative semidefinite,

which proves the theorem.

The result

(15.4.1)

h� c �@p�i@ F h mR n�� ð � T ² � 1T � 4T ó � 1 r

15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 313

allows us to treat �@ù¬R ¶ @ F � � � T ² � 1T � 4T � 1� ·��where the ¬ means ”approximately distributed as.” To operationalize this we

need estimators of� T and ² T¾&

� The obvious estimator ofÂ� T is simply áá V � 4f c �@ h � which is consistent

by the consistency of�@%� assuming that áá V � 4f is continuous in @%& Sto-

chastic equicontinuity results can give us this result even if áá V � 4f is

not continuous. We now turn to estimation of ² T &15.5. Estimation of the variance-covariance matrix

(See Hamilton Ch. 10, pp. 261-2 and 280-84) [ .In the case that we wish to use the optimal weighting matrix, we need an

estimate of ² T � the limiting variance-covariance matrix of

h�,�^f+�4@gFC . While

one could estimate ² T parametrically, we in general have little information

upon which to base a parametric specification. In general, we expect that:

� �^U will be autocorrelated (Ã U(ì ë7��^U¬� 4U � ì sí ¡� ). Note that this autoco-

variance will not depend on if the moment conditions are covariance

stationary.� contemporaneously correlated, since the individual moment condi-

tions will not in general be independent of one another ( ë7�� * U�� ß U�^í � ).� and have different variances ( ë¼�� * U è �* U ).


Since we need to estimate so many components if we are to take the paramet-

ric approach, it is unlikely that we would arrive at a correct parametric spec-

ification. For this reason, research has focused on consistent nonparametric

estimators of ² T &Henceforth we assume that ��U is covariance stationary (the covariance be-

tween �Û and �Û � ì does not depend on -C& Define the ¸¯�: ¹ autocovariance of

the moment conditionsÃÅÄ ë7��Û�� 4U � ì C& Note that ë¼��U¨� 4U � ì Ã 4Ä & Recall that�Û and � are functions of @%� so for now assume that we have some consistent

estimator of @ F � so that��Û �Û-� �@"Y& Now² f ë ° ��:�A@ F ?�:�A@ F 4 ³ ë v � ¶ � Â � f� U(��1 �Û · ¶ � Â � f� U(��1 � 4U · w ë v � Â � ¶ f� U(��1 �Û · ¶ f� U(��1 �^4U · w Ã F 2 �� Ã 1�2 Ã 4 1 �2 �÷��#� � Ã � 2 Ã 4 � 0>N>N>N2 �� ð Ã f � 1�2 Ã 4f � 1 ó

A natural, consistent estimator ofÃÆÄ

is¥ ÃÅÄ � Â � f�U(� Ä � 1 ��Û ��^4U � Ä &(you might use �÷�i¸ in the denominator instead). So, a natural, but inconsis-

tent, estimator of ² T would be�² ¥ Ã F 2 �� c ¥ Ã 1,2 ¥ Ã 4 1 h 2 �÷��#� c ¥ Ã � 2 ¥ Ã 4 � h 2¡>N>N>'2 c ÂÃ f � 1,2 ÂÃ 4f � 1 h ¥ Ã F 2 f � 1� Ä ��1 �÷�Ñ¸� c ¥ ÃÅÄ 2 ¥ Ã 4Ä h &


This estimator is inconsistent in general, since the number of parameters to

estimate is more than the number of observations, and increases more rapidly

than � , so information does not build up as � R k &On the other hand, supposing that

Ã�Ätends to zero sufficiently rapidly as ¸

tends to k � a modified estimator�² ¥ Ã F 2 � Ã fNÅ� Ä ��1 c ¥ ÃÅÄ 2 ¥ Ã 4Ä h �where � �� 6R k as � R k will be consistent, provided �5�� grows sufficiently

slowly. The term f � Äf can be dropped because � �� must be´ 6T��,C& This allows

information to accumulate at a rate that satisfies a LLN. A disadvantage of

this estimator is that it may not be positive definite. This could cause one to

calculate a negative � � statistic, for example!� Note: the formula for�² requires an estimate of �:�A@VFCY� which in turn

requires an estimate of @T� which is based upon an estimate of ² � The

solution to this circularity is to set the weighting matrix ü arbitrarily

(for example to an identity matrix), obtain a first consistent but ineffi-

cient estimate of @ F � then use this estimate to form�² � then re-estimate@gFN& The process can be iterated until neither

�² nor�@ change appreciably

between iterations.

15.5.1. Newey-West covariance estimator. The Newey-West estimator (Econo-

metrica, 1987) solves the problem of possible nonpositive definiteness of the

above estimator. Their estimator is�² ¥ Ã F 2 � Ã fNÅ� Ä ��1 È �y� ¸�È2¡�ÚÉ c ¥ Ã�Ä 2 ¥ Ã 4Ä h &

15.6. ESTIMATION USING CONDITIONAL MOMENTS 316

This estimator is p.d. by construction. The condition for consistency is that� � 1Ap�! � R � & Note that this is a very slow rate of growth for �%& This estimator is

nonparametric - we’ve placed no parametric restrictions on the form of ² & It is

an example of a kernel estimator.

In a more recent paper, Newey and West (Review of Economic Studies, 1994)

use pre-whitening before applying the kernel estimator. The idea is to fit a VAR

model to the moment conditions. It is expected that the residuals of the VAR

model will be more nearly white noise, so that the Newey-West covariance

estimator might perform better with short lag lengths..

The VAR model is ��Û N¯1 ��Û � 1I2�>N>N>N2DN76 ��Û � 672¹½+UThis is estimated, giving the residuals

�½EUP& Then the Newey-West covariance

estimator is applied to these pre-whitened residuals, and the covariance ² is

estimated combining the fitted VAR¥ ��Û ¥N¯1 ��Û � 1I2¡>N>N>N2 ¥N76 ��Û � 6with the kernel estimate of the covariance of the ½�UP& See Newey-West for de-

tails. � I have a program that does this if you’re interested.

15.6. Estimation using conditional moments

If the above VAR model does succeed in removing unmodeled heteroscedas-

ticity and autocorrelation, might this imply that this information is not being

used efficiently in estimation? In other words, since the performance of GMM

depends on which moment conditions are used, if the set of selected moments


exhibits heteroscedasticity and autocorrelation, can’t we use this information,

a la GLS, to guide us in selecting a better set of moment conditions to improve

efficiency? The answer to this may not be so clear when moments are defined

unconditionally, but it can be analyzed more carefully when the moments used

in estimation are derived from conditional moments.

So far, the moment conditions have been presented as unconditional ex-

pectations. One common way of defining unconditional moment conditions is

based upon conditional moment conditions.

Suppose that a random variable ¤ has zero expectation conditional on the

random variable � ë ; � « ¤ Y ¤ � �W¤ � �:`JE¤ a�Then the unconditional expectation of the product of ¤ and a function �0�¬�: of� is also zero. The unconditional expectation is

ëG¤ö�0��3 �Y ; Æ Y1A ¤ �0��: � �W¤��[�:ZJ�¤ É J"�i&This can be factored into a conditional expectation and an expectation w.r.t.

the marginal density of �oHëÄ¤ �,�¬�: Y ; Æ Y1A ¤ö�0��3 � ��¤ � �:`JE¤ É � ��:ZJ��&

Since �0��3 doesn’t depend on ¤ it can be pulled out of the integral

ëÄ¤ �,�¬�: �YB; Æ Y A ¤ � ��¤ � �:`JE¤ É �0�¬�: � �¬�:`J"��&But the term in parentheses on the rhs is zero by assumption, so

ëG¤ �,�¬�: ��


as claimed.

This is important econometrically, since models often imply restrictions on

conditional moments. Suppose a model tells us that the function � ��B"U×� � UW has

expectation, conditional on the information set ± U?� equal to Q,� � U?��@VC�ë"V � ��BgU×� � UW � ± U Q,� � U?�@"C&� For example, in the context of the classical linear model B"U �� 4U ��2M/JUP�

we can set � ��BVU×� � UW BgU so that Q,� � UÔ��@V ¡� 4U � .

With this, the function ¹ U-�A@V a� ��B]UP� � UWG�KQ,� � U?��@"has conditional expectation equal to zero

ë"V ¹ U-�4@" � ± U �� &This is a scalar moment condition,which wouldn’t be sufficient to identify a� � � ¥ �J dimensional parameter @%& However, the above result allows us to

form various unconditional expectations

�^U-�A@V 9 ��.7U� ¹ U-�A@Vwhere 9 ��.7U� is a ��j� -vector valued function of .¼U and .7U is a set of variables

drawn from the information set ± U?& The 9 ��.7U� are instrumental variables. We

now have � moment conditions, so as long as � ¥ � the necessary condition

for identification holds.


One can form the �:� � matrix

9Äf ��9 1\��.µ1Ô 9 � ��.µ1- >N>N>89 � ��.µ1-9 1\��. � 9 � ��. � 9 � ��. � ...

...9 1\��.7fV 9 � ��.7f" >N>N>89 � ��.7f��

��9 419 4�9 4f

� ��With this we can form the � moment conditions

�^f+�4@" �� 9È4f ��¹ 1\�A@"¹ � �A@"...¹ fE�A@V

� �� 9È4f ¹ f5�A@V �� f� U(��1 9ÄU ¹ U-�4@" �� f� U(��1 �^U-�A@Vwhere 9 Ã UßH · Å is the U�¶ row of 9ÄfT& This fits the previous treatment. An interesting

question that arises is how one should choose the instrumental variables 9Ü��.yU�to achieve maximum efficiency.


Note that with this choice of moment conditions, we have that� f � áá V � 4 �A@V

(a � � � matrix) is � f5�A@V àà @ �� 9y4f ¹ f5�A@V[ 4 �� Æàà @ ¹ 4f �b@" É 9Äf

which we can define to be � f+�4@" ��df,9Äf &

where

df is a � �� matrix that has the derivatives of the individual moment

conditions as its columns. Likewise, define the var-cov. of the moment condi-

tions ² f ë ° ��^fE�4@ F ?�^fE�A@ F ?4�³ ë È �� 9 4f ¹ f+�4@ F ¹ f5�A@ F 4 9Äf É 9y4f ëËÆ �� ¹ f5�A@ F ¹ f5�A@ F ?4 É 9Äf� 9 4f ��f� 9Äfwhere we have defined � f é¯ô À ¹ f5�A@]FCC& Note that matrix is growing with the

sample size and is not consistently estimable without additional assumptions.

The asymptotic normality theorem above says that the GMM estimator us-

ing the optimal weighting matrix is distributed ash� c �@��c@ F h mR j� � �YérTs


where

(15.6.1) énT � ��f�SUT ¶ Æ d f,9Äf� É Æ 9 4f ��f+9¢f� É � 1 Æ 9 4fd4f� É · � 1 &

Using an argument similar to that used to prove that ² � 1T is the efficient weight-

ing matrix, we can show that putting9¢f � � 1f d 4fcauses the above var-cov matrix to simplify to

(15.6.2) énT � ��f�StT ÆdfC� � 1f d 4f� É � 1 &

and furthermore, this matrix is smaller that the limiting var-cov for any other

choice of instrumental variables. (To prove this, examine the difference of the

inverses of the var-cov matrices with the optimal intruments and with non-

optimal instruments. As above, you can show that the difference is positive

semi-definite).

� Note that both

dfT� which we should write more properly as

df �4@ F C�

since it depends on @ F � and � must be consistently estimated to apply

this.� Usually, estimation of

df is straightforward - one just uses�d àà @ ¹ 4f c ý@ h �

where ý@ is some initial consistent estimator based on non-optimal in-

struments.� Estimation of ��f may not be possible. It is an �8�:� matrix, so it has

more unique elements than �G� the sample size, so without restrictions

15.8. A SPECIFICATION TEST 322

on the parameters it can’t be estimated consistently. Basically, you

need to provide a parametric specification of the covariances of the¹ U-�4@" in order to be able to use optimal instruments. A solution is to ap-

proximate this matrix parametrically to define the instruments. Note

that the simplified var-cov matrix in equation 15.6.2 will not apply if

approximately optimal instruments are used - it will be necessary to

use an estimator based upon equation 15.6.1, where the term= O>�D > = >f

must be estimated consistently apart, for example by the Newey-West

procedure.

15.7. Estimation using dynamic moment conditions

Note that dynamic moment conditions simplify the var-cov matrix, but are

often harder to formulate. The will be added in future editions. For now, the

Hansen application below is enough.

15.8. A specification test

The first order conditions for minimization, using the an estimate of the

optimal weighting matrix, areàà @ �T� �@" # È àà @ � Of c �@ h É �² � 1 �^f c �@ h � �or

� � �@" �² � 1 �^fE� �@V � �


Consider a Taylor expansion of �:� �@" :(15.8.1) �:� �@" �^f+�4@ F �2 � 4f �A@ F c �@��c@ F h 2 ´ 6T�?�JC&Multiplying by

� � �@V �² � 1 we obtain� � �@" �² � 1 �:� �@" � � �@V �² � 1 �^f+�4@ F �2 � � �@V �² � 1 � �4@ F P4 c �@p�i@ F h 2 ´ 6T�Ô�NThe lhs is zero, and since

�@ tends to @ F and�² tends to ² T , we can write� T ² � 1T �^f+�A@ F Q � � T ² � 1T � 4T c �@p�i@ F h

or h� c �@p�c@ F h Q �

h� ð � T ² � 1T � 4T ó � 1 � T ² � 1T �^f+�4@ F

With this, and taking into account the original expansion (equation ??), we

geth�0�:� �@� Q h

�0�^f5�A@ F G� h � � 4T ð � T ² � 1T � 4T ó � 1 � T ² � 1T �^f5�A@ F Y&This last can be written ash

��:� �@V Q h� c ² 1Ap �T � � 4T ð � T ² � 1T � 4T ó � 1 � T ² � 1Ap �T h ² � 1Ap �T �^f+�A@ F

Orh� ² � 1Ap �T �:� �@V Q h

� cN± � � ² � 1Ap �T � 4T ð � T ² � 1T � 4T ó � 1 � T ² � 1Ap �T h ² � 1Ap �T �^f+�4@ F Now

h� ² � 1Ap �T �^f5�A@ F imR ²� � � ± �


and one can easily verify thatª lcN± � � ² � 1Ap �T � 4T ð � T ² � 1T � 4T ó � 1 � T ² � 1Ap �T his idempotent of rank �� (recall that the rank of an idempotent matrix is

equal to its trace) so

ch� ² � 1Ap �T �:� �@� h 4 c h � ² � 1Ap �T �:� �@� h �0�:� �@�?4 ² � 1T �:� �@"imR � � �b�¯� �

Since�² converges to ² T � we also have

�0�:� �@"?4 �² � 1 �:� �@� mR � � �b�Ü� � or ��>]�\f+� �@" mR � � �b� � � supposing the model is correctly specified. This is a convenient test since we

just multiply the optimized value of the objective function by �G� and compare

with a � � �b�Ü� � critical value. The test is a general test of whether or not the

moments used to estimate are correctly specified.

� This won’t work when the estimator is just identified. The f.o.c. are� VY�\f5�A@V � �² � 1 �:� �@� � � &But with exact identification, both

�and

�² are square and invertible

(at least asymptotically, assuming that asymptotic normality hold), so

�:� �@V � � &

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 325

So the moment conditions are zero regardless of the weighting matrix

used. As such, we might as well use an identity matrix and save trou-

ble. Also �'f � �@" �� , so the test breaks down.� A note: this sort of test often over-rejects in finite samples. If the sam-

ple size is small, it might be better to use bootstrap critical values. That

is, draw artificial samples of size � by sampling from the data with re-

placement. For � bootstrap samples, optimize and calculate the test

statistic ��>T�� @ ß C�+� �"�Y#%�'&(&(&)�$� & Define the bootstrap critical value ûFEsuch that � >�� "� percent of the �T� �@ ß exceed the value. Of course, �must be a very large number if �b� � is large, in order to determine

the critical value with precision. This sort of test has been found to

have quite good small sample properties.

15.9. Other estimators interpreted as GMM estimators

15.9.1. OLS with heteroscedasticity of unknown form.

EXAMPLE 26. White’s heteroscedastic consistent varcov estimator for OLS.

Suppose _ a` ��F28/T� where /sçRj� � �65¼C�(5 a diagonal matrix.� The typical approach is to parameterize 5 5��èIC� where è is a finite

dimensional parameter vector, and to estimate � and è jointly (feasible

GLS). This will work well if the parameterization of 5 is correct.� If we’re not confident about parameterizing 5µ� we can still estimate �consistently by OLS. However, the typical covariance estimator é�� ` 4 ` � 1 �è � will be biased and inconsistent, and will lead to invalid in-

ferences.


By exogeneity of the regressors � U (a � �� column vector) we have Ù � � U¬/JUW � �which suggests the moment condition

�^U-��G �� U5��BgU�� L 4U ��&In this case, we have exact identification ( � parameters and � moment con-

ditions). We have

�:�� Â � � U �^U � Â � � U L U¬B]U0�Ë� Â � � U L U L 4U ��&For any choice of üÖ�\�:��G will be identically zero at the minimum, due to exact

identification. That is, since the number of moment conditions is identical to

the number of parameters, the foc imply that �:� �� regardless of ü:& There

is no need to use the “optimal” weighting matrix in this case, an identity matrix

works just as well for the purpose of estimation. Therefore�� ¶ � U L U L 4U · � 1 � U L U¬BgU � ` 4 ` $� 1 ` 4�_Z�which is the usual OLS estimator.

The GMM estimator of the asymptotic varcov matrix is c Â� T �² � 1 Â� T 4 h � 1 &Recall that

Â� T is simply áá V � 4 c �@ h & In this caseÂ� T �¯� Â � � U L U L 4U � ` 4 `÷Â �G&Recall that a possible estimator of ² is�² ¥ Ã F 2 f � 1� Ä ��1 c ¥ ÃÅÄ 2 ¥ Ã 4Ä h &


This is in general inconsistent, but in the present case of nonautocorrelation, it

simplifies to �²Ë ¥ Ã Fwhich has a constant number of elements to estimate, so information will ac-

cumulate, and consistency obtains. In the present case�² ¥ Ã F � Â � ¶ f� U(��1 ��^U �� 4U · � Â �lv f� U(��1 L U L 4U c B]U�� L 4U �� h � w � Â � v f� U(��1 L U L 4U �/ �U w ` 4 �G `�where

�Gis an �:�� diagonal matrix with

�/ �U in the position Y�[ .Therefore, the GMM varcov. estimator, which is consistent, is�é c

h� c ��j� h,h H Æ � ` 4 `� É ¶ ` 4 �G `� � 1 ·åÆ � ` 4 `� ÉJI � 1 Æ ` 4 `� É � 1 ¶ ` 4 �G `� ·åÆ ` 4 `� É � 1

This is the varcov estimator that White (1980) arrived at in an influential article.

This estimator is consistent under heteroscedasticity of an unknown form. If

there is autocorrelation, the Newey-West estimator can be used to estimate ² -

the rest is the same.


15.9.2. Weighted Least Squares. Consider the previous example of a lin-

ear model with heteroscedasticity of unknown form:

_ ` � F 2�// ç j� � �65¼where 5 is a diagonal matrix.

Now, suppose that the form of 5 is known, so that 5��4@ F is a correct para-

metric specification (which may also depend upon ` C& In this case, the GLS

estimator is ý� ð ` 4 5 � 1 ` ó � 1 ` 4 5 � 1 _GThis estimator can be interpreted as the solution to the � moment conditions

�:��ý�� Â � � U L U�B]Uè5U��4@ F � � Â � � U L U L 4Uè5U-�A@ F ý� � � &That is, the GLS estimator in this case has an obvious representation as a GMM

estimator. With autocorrelation, the representation exists but it is a little more

complicated. Nevertheless, the idea is the same. There are a few points:

� The (feasible) GLS estimator is known to be asymptotically efficient in

the class of linear asymptotically unbiased estimators (Gauss-Markov).� This means that it is more efficient than the above example of OLS with

White’s heteroscedastic consistent covariance, which is an alternative

GMM estimator.� This means that the choice of the moment conditions is important to

achieve efficiency.


15.9.3. 2SLS. Consider the linear model

BgU �g4U ��2�/JUP�or _ �Ä��28/using the usual construction, where � is � �� and /gU is i.i.d. Suppose that

this equation is one of a system of simultaneous equations, so that �NU contains

both endogenous and exogenous variables. Suppose that L U is the vector of all

exogenous and predetermined variables that are uncorrelated with /VU (suppose

that L U isÀ �j�JC&� Define

�� as the vector of predictions of � when regressed upon ` , e.g.,�� ¡` � ` 4 ` � 1 ` 4 � �� ` � ` 4 ` � 1 ` 4K�� Since

�� is a linear combination of the exogenous variables L � ��"U must

be uncorrelated with /T& This suggests the � -dimensional moment con-

dition �^U-�� "U ��BgU��L� 4U � and so

�:�� Â � � U ��"U ��BgU��M�"4U ��&� Since we have � parameters and � moment conditions, the GMM

estimator will set � identically equal to zero, regardless of üÖ� so we

have �� ¶ � U ��"UN�"4U · � 1 � U � ��"U¬B]UW dc ��Ä4K� h � 1 ��Ä4©_This is the standard formula for 2SLS. We use the exogenous variables and

the reduced form predictions of the endogenous variables as instruments, and


apply IV estimation. See Hamilton pp. 420-21 for the varcov formula (which

is the standard formula for 2SLS), and for how to deal with /VU heterogeneous

and dependent (basically, just use the Newey-West or some other consistent

estimator of ² � and apply the usual formula). Note that /gU dependent causes

lagged endogenous variables to loose their status as legitimate instruments.

15.9.4. Nonlinear simultaneous equations. GMM provides a convenient

way to estimate nonlinear systems of simultaneous equations. We have a sys-

tem of equations of the form

B�1�U � 1v�?�"U?�@ F1 �2�/�1�UB � U � � �?�"U?�@ F� �2�/ � U...B ¯ U � ¯ �?�"U×��@ F¯ �28/ ¯ UP�

or in compact notation BgU � ��"UP��@ F �2�/JUP�where

� �Ô>@ is a � -vector valued function, and @ F �A@ F 41 ��@ F 4� �'>N>N>,��@ F 4¯ 4 &We need to find an w;* �j� vector of instruments L,* U?� for each equation, that

are uncorrelated with / * UP& Typical instruments would be low order monomials

in the exogenous variables in ��U?� with their lagged values. Then we can define

the c � ¯* ��1 wy* h �²� orthogonality conditions

�^U-�A@V ��BT1�U�� 1\�?�"UP��@g1[- L 1�U��B � U�� ?�"UP��@ � - L � U

...��B ¯ U�� ¯ �?�"UP��@ ¯ - L ¯ U� �� &

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 331� A note on identification: selection of instruments that ensure identifi-

cation is a non-trivial problem.� A note on efficiency: the selected set of instruments has important ef-

fects on the efficiency of estimation. Unfortunately there is little theory

offering guidance on what is the optimal set. More on this later.

15.9.5. Maximum likelihood. In the introduction we argued that ML will

in general be more efficient than GMM since ML implicitly uses all of the mo-

ments of the distribution while GMM uses a limited number of moments. Ac-

tually, a distribution withª

parameters can be uniquely characterized byª

moment conditions. However, some sets ofª

moment conditions may contain

more information than others, since the moment conditions could be highly

correlated. A GMM estimator that chose an optimal set ofª

moment condi-

tions would be fully efficient. Here we’ll see that the optimal moment condi-

tions are simply the scores of the ML estimator.

Let BgU be a � -vector of variables, and let ¤�U ��B 41 ��B 4� �'&(&)&(��B 4U 4 & Then at time Y�+¤+U � 1 has been observed (refer to it as the information set, since we assume

the conditioning variables have been selected to take advantage of all useful

information). The likelihood function is the joint density of the sample:� �4@" � ��BT1$��B � �\&)&(&)�[BgfT��@"which can be factored as� �A@V � ��Bgf � ¤+f � 1��@"G> � ��¤Ef � 1$��@Vand we can repeat this to get� �A@" � ��B]f � ¤Ef � 1$��@VG> � ��B]f � 1 � ¤+f � � ��@VG>"&(&)&�> � ��BT1?C&


The log-likelihood function is therefore�� 4@" f� U(��1 �� BgU � ¤+U � 1Y��@"Y&Define �Û-��¤EU×��@" � � V �� BgU � ¤EU � 1$��@"as the score of the U�¶ observation. It can be shown that, under the regularity

conditions, that the scores have conditional mean zero when evaluated at @�F(see notes to Introduction to Econometrics):

ëZSJ�ÛÔ��¤EU?��@ F � ¤EU � 1�X ��so one could interpret these as moment conditions to use to define a just-

identified GMM estimator ( if there are � parameters there are � score equa-

tions). The GMM estimator sets

� Â � f� U(��1 �Û-�W¤+UP� �@V � Â � f� U(��1 � V �� BgU � ¤+U � 1Y� �@" a� �which are precisely the first order conditions of MLE. Therefore, MLE can be

interpreted as a GMM estimator. The GMM varcov formula is é?T � � T ² � 1 � 4T � 1 .Consistent estimates of variance components are as follows� � T Â� T àà @ 4 �:��¤EU?� �@" � Â � f� U(��1 � �V �� BgU � ¤EU � 1$� �@"� ²

It is important to note that ��U and ��U � ìC�� ¥ � are both condi-

tionally and unconditionally uncorrelated. Conditional uncorrelation

follows from the fact that ��U � ì is a function of ¤EU � ìY� which is in the in-

formation set at time . Unconditional uncorrelation follows from the


fact that conditional uncorrelation hold regardless of the realization

of ¤+U � 1Y� so marginalizing with respect to ¤EU � 1 preserves uncorrelation

(see the section on ML estimation, above). The fact that the scores are

serially uncorrelated implies that ² can be estimated by the estimator

of the 0 U�¶ autocovariance of the moment conditions:�² � Â � f� U(��1 �^U-�W¤+UP� �@VÔ�^UÔ��¤EU?� �@"?4 � Â � f� U(��1 n � V �)� � ��B]U � ¤+U � 1Y� �@V r n � V �� BgU � ¤EU � 1$� �@" r 4Recall from study of ML estimation that the information matrix equality (equa-

tion ??) states that

Ù * ° � V �� BgU � ¤EU � 1C�@ F ³ ° � V �)� � ��B]U � ¤+U � 1Y��@ F ³ 4 + � Ùaù � �V �� BgU � ¤+U � 1Y��@ F ú &This result implies the well known (and already seeen) result that we can esti-

mate énT in any of three ways:� The sandwich version:

¥érT � áâââââã âââââä* � fU(��1 � �V �� BgU � ¤EU � 1Y� �@" + �õ � fU(��1 n � V �� BgU � ¤+U � 1Y� �@" r n � V �)� � ��B]U � ¤+U � 1Y� �@V r 4 ø � 1 �* � fU(��1 � �V �)� � ��B]U � ¤+U � 1Y� �@V + å âââââæâââââç

� 1

� or the inverse of the negative of the Hessian (since the middle and last

term cancel, except for a minus sign):¥érT v �Þ� Â � f� U(��1 � �V �� BgU � ¤+U � 1Y� �@V w � 1 �� or the inverse of the outer product of the gradient (since the middle

and last cancel except for a minus sign, and the first term converges to

minus the inverse of the middle term, which is still inside the overall

inverse)

15.10. EXAMPLE: THE HAUSMAN TEST 334¥énT OH � Â � f� U(��1 n � V �� BgU � ¤EU � 1$� �@" r n � V �)� � ��B]U � ¤+U � 1Y� �@V r 4 I � 1 &This simplification is a special result for the MLE estimator - it doesn’t apply

to GMM estimators in general.

Asymptotically, if the model is correctly specified, all of these forms con-

verge to the same limit. In small samples they will differ. In particular, there

is evidence that the outer product of the gradient formula does not perform

very well in small samples (see Davidson and MacKinnon, pg. 477). White’s

Information matrix test (Econometrica, 1982) is based upon comparing the two

ways to estimate the information matrix: outer product of gradient or negative

of the Hessian. If they differ by too much, this is evidence of misspecification

of the model.

15.10. Example: The Hausman Test

This section discusses the Hausman test, which was originally presented

in Hausman, J.A. (1978), Specification tests in econometrics, Econometrica, 46,

1251-71.

Consider the simple linear regression model B"U R� 4U �^2ËA-U?& We assume that

the functional form and the choice of regressors is correct, but that the some of

the regressors may be correlated with the error term, which as you know will

produce inconsistency of��¢& For example, this will be a problem if� if some regressors are endogeneous� some regressors are measured with error� lagged values of the dependent variable are used as regressors and A$U

is autocorrelated.

15.10. EXAMPLE: THE HAUSMAN TEST 335

FIGURE 15.10.1. OLS and IV estimators when regressors and er-rors are correlated

PP#Q P&RP#Q P�SP#Q P�TP#Q P�UPVQXWP#QYW�RP#QYW(S

RVQ R�U RVQ Z RVQ Z&R RVQ Z�S R[Q Z�T RVQ Z�U R[Q S

\^]`_ba�cYd�e"fhgid&fij Qlknmpo�q fhrsautwv R�x


RVQ R�U RVQ Z RVQ Z&R RVQ Z�S R[Q Z�T RVQ Z�U R[Q S

\^]`_ba�cYd�e"fhgid&fij Qlknmpo�q fhrsautwv R�x


W�Q U�U W�Q y W�Q y&R W�Q y�S W�Q y�T W�Q y�U R R[Q P�R R[Q P�S R[Q P�T R[Q P�U

z�{ a�cYd�e"fhgid&fij Qlknmpo�q fhrsautwv R�x


W�Q U�U W�Q y W�Q y&R W�Q y�S W�Q y�T W�Q y�U R R[Q P�R R[Q P�S R[Q P�T R[Q P�U

z�{ a�cYd�e"fhgid&fij Qlknmpo�q fhrsautwv R�x

To illustrate, the Octave program biased.m performs a Monte Carlo experi-

ment where errors are correlated with regressors, and estimation is by OLS

and IV.

Figure 15.10.1 shows that the OLS estimator is quite biased, while the IV

estimator is on average much closer to the true value. If you play with the pro-

gram, increasing the sample size, you can see evidence that the OLS estimator

is asymptotically biased, while the IV estimator is consistent.

We have seen that inconsistent and the consistent estimators converge to

different probability limits. This is the idea behind the Hausman test - a pair

of consistent estimators converge to the same probability limit, while if one is

consistent and the other is not they converge to different limits. If we accept

that one is consistent (e.g., the IV estimator), but we are doubting if the other

is consistent (e.g., the OLS estimator), we might try to check if the difference

between the estimators is significantly different from zero.

http://pareto.uab.es/mcreel/Econometrics/Include/Hausman/biased.m

15.10. EXAMPLE: THE HAUSMAN TEST 336� If we’re doubting about the consistency of OLS (or QML, etc.), why

should we be interested in testing - why not just use the IV estima-

tor? Because the OLS estimator is more efficient when the regressors

are exogenous and the other classical assumptions (including normal-

ity of the errors) hold. When we have a more efficient estimator that

relies on stronger assumptions (such as exogeneity) than the IV es-

timator, we might prefer to use it, unless we have evidence that the

assumptions are false.

So, let’s consider the covariance between the MLE estimator�@ (or any other

fully efficient estimator) and some other CAN estimator, say ý@ . Now, let’s

recall some results from MLE. Equation 4.4.1 is:h� c �@��c@ F h Q6P ì PR �

d T=�A@ F $� 1 h �>�0�A@ F C&Equation 4.6.2 is

d T¾�A@V � o T=�4@"Y&Combining these two equations, we geth

� c �@p�i@ F h Q�P ì PR ± T=�A@ F �� 1 h �n�0�A@ F Y&Also, equation 4.7.1 tells us that the asymptotic covariance between any

CAN estimator and the MLE score vector is

érT��h� c ý@p�i@ hh�n�0�A@" �� énT=�'ý@" ±'ÿ±'ÿ o T=�A@V �� &


Now, consider�� ±'ÿ �Vÿ�"ÿ ± T¾�A@V � 1 �� h� c ý@p�i@ hh�>�,�4@" �� Q6P ì PR ��

h� c ý@p�i@ hh� c �@p�i@ h �� &

The asymptotic covariance of this is

énT ��h� c ý@��c@ hh� c �@��c@ h �� ±'ÿ �"ÿ�"ÿ ± T=�A@" � 1 �� énT=�'ý@" ±'ÿ±\ÿ o T=�4@" �� ±'ÿ �"ÿ�"ÿ ± T=�A@" � 1 ��

�� énT �Ný@� ± T=�4@" � 1± T¾�4@" � 1 ± T=�4@" � 1 �� which, for clarity in what follows, we might write as

énT ��h� c ý@p�i@ hh� c �@p�i@ h �� énT �Ný@� ± T=�4@" � 1± T¾�4@" � 1 érT=� �@" �� &

So, the asymptotic covariance between the MLE and any other CAN estima-

tor is equal to the MLE asymptotic variance (the inverse of the information

matrix).

Now, suppose we with to test whether the the two estimators are in fact

both converging to @ F , versus the alternative hypothesis that the ”MLE” esti-

mator is not in fact consistent (the consistency of ý@ is a maintained hypothesis).

Under the null hypothesis that they are, we have

n ±'ÿ � ±\ÿ r ��h� c ý@p�c@ F hh� c �@p�c@ F h � ��

h� c ý@�� @ h �

will be asymptotically normally distributed as

15.10. EXAMPLE: THE HAUSMAN TEST 338h� c ý@p� �@ h mR c � �YénT � ý@"G�KérT¾� �@" h &

So, � c ý@p� �@ h 4 c énT=� ý@"G�8érT=� �@V h � 1 c ý@p� �@ h mR � � ��TC�where � is the rank of the difference of the asymptotic variances. A statistic

that has the same asymptotic distribution is

c ý@p� �@ h 4 c �éb� ý@VG� �é^� �@V h � 1 c ý@�� @ h mR � � ��TC&This is the Hausman test statistic, in its original form. The reason that this

test has power under the alternative hypothesis is that in that case the ”MLE”

estimator will not be consistent, and will converge to @ � , say, where @ � í @ F .Then the mean of the asymptotic distribution of vector

h� c ý@�� @ h will be @ F �@ � , a non-zero vector, so the test statistic will eventually reject, regardless of

how small a significance level is used.

� Note: if the test is based on a sub-vector of the entire parameter vector

of the MLE, it is possible that the inconsistency of the MLE will not

show up in the portion of the vector that has been used. If this is the

case, the test may not have power to detect the inconsistency. This

may occur, for example, when the consistent but inefficient estimator

is not identified for all the parameters of the model.

Some things to note:

� The rank, � , of the difference of the asymptotic variances is often less

than the dimension of the matrices, and it may be difficult to deter-

mine what the true rank is. If the true rank is lower than what is taken


to be true, the test will be biased against rejection of the null hypothe-

sis. The contrary holds if we underestimate the rank.� A solution to this problem is to use a rank 1 test, by comparing only

a single coefficient. For example, if a variable is suspected of possibly

being endogenous, that variable’s coefficients may be compared.� This simple formula only holds when the estimator that is being tested

for consistency is fully efficient under the null hypothesis. This means

that it must be a ML estimator or a fully efficient estimator that has

the same asymptotic distribution as the ML estimator. This is quite

restrictive since modern estimators such as GMM and QML are not in

general fully efficient.

Following up on this last point, let’s think of two not necessarily efficient es-

timators,�@g1 and

�@ � , where one is assumed to be consistent, but the other may

not be. We assume for expositional simplicity that both�@g1 and

�@ � belong to the

same parameter space, and that they can be expressed as generalized method

of moments (GMM) estimators. The estimators are defined (suppressing the

dependence upon data) by�@ *§ �V�[�� V : ) � � : �A@ * 4 ü * � * �A@ * where � * �A@ * is a � * �� vector of moment conditions, and ü * is a � * � � * positive

definite weighting matrix, ! �V�Y#%& Consider the omnibus GMM estimator

(15.10.1)c �@g1C� �@ � h ¡�V�[�� Ó � n ��1\�A@g1[ 4 � � �4@ � 4 r �� ü²1 | Ã � Ó � z Å| Ã � z Ó � Å ü � �� 1\�4@g1[� � �4@ � �� &


Suppose that the asymptotic covariance of the omnibus moment vector is5 � ��f�StT éÜô À áã äh� �� 1v�A@g1[� � �A@ � �� å æç(15.10.2)

� ÎÏ 5y1 5y1 �> 5 � ÐÒ &The standard Hausman test is equivalent to a Wald test of the equality of @�1and @ � (or subvectors of the two) applied to the omnibus GMM estimator, but

with the covariance of the moment conditions estimated as�5 ÎÏ ¥5y1 | Ã � Ó � z Å| Ã � z Ó � Å ¥5 � ÐÒ &While this is clearly an inconsistent estimator in general, the omitted 5µ1 � term

cancels out of the test statistic when one of the estimators is asymptotically

efficient, as we have seen above, and thus it need not be estimated.

The general solution when neither of the estimators is efficient is clear: the

entire 5 matrix must be estimated consistently, since the 5µ1 � term will not can-

cel out. Methods for consistently estimating the asymptotic covariance of a

vector of moment conditions are well-known, e.g., the Newey-West estimator

discussed previously. The Hausman test using a proper estimator of the over-

all covariance matrix will now have an asymptotic �� distribution when nei-

ther estimator is efficient. However, the test suffers from a loss of power due to

the fact that the omnibus GMM estimator of equation 15.10.1 is defined using

an inefficient weight matrix. A new test can be defined by using an alternative

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 341

omnibus GMM estimator

(15.10.3) c �@g1C� �@ � h ¡�V�[�� Ó � n ��1\�A@g1[ 4 � � �A@ � 4 r c~}5 h � 1 �� 1v�4@g1[� � �4@ � �� where }5 is a consistent estimator of the overall covariance matrix 5 of equation

15.10.2. By standard arguments, this is a more efficient estimator than that

defined by equation 15.10.1, so the Wald test using this alternative is more

powerful. See my article in Applied Economics, 2004, for more details, including

simulation results.

15.11. Application: Nonlinear rational expectations

Readings: Hansen and Singleton, 1982 [&� Tauchen, 1986

Though GMM estimation has many applications, application to rational

expectations models is elegant, since theory directly suggests the moment con-

ditions. Hansen and Singleton’s 1982 paper is also a classic worth studying in

itself. Though I strongly recommend reading the paper, I’ll use a simplified

model with similar notation to Hamilton’s.

We assume a representative consumer maximizes expected discounted util-

ity over an infinite horizon. Utility is temporally additive, and the expected

utility hypothesis holds. The future consumption stream is the stochastic se-

quence S ¸ U×X TU(� F & The objective function at time is the discounted expected util-

ity

(15.11.1)T� ìò� F � ì ë��b½� ¸ U � ì- � ± UW�&� The parameter � is between 0 and 1, and reflects discounting.

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 342�i± U is the information set at time Y� and includes the all realizations of

random variables indexed and earlier.� The choice variable is ¸ U - current consumption, which is constained to

be less than or equal to current wealth .¼UP&� Suppose the consumer can invest in a risky asset. A dollar invested in

the asset yields a gross return

�Ô��2 À U � 1[ �5U � 1�2�J"U � 1�5Uwhere �5U is the price and J�U is the dividend in period Y& The price of ¸ Uis normalized to �V&� Current wealth .7U �Ô�p2 À U�?!WU � 1 , where !WU � 1 is investment in period ;� � . So the problem is to allocate current wealth between current

consumption and investment to finance future consumption: .ÈU �¸ U$2!WU .� Future net rates of returnÀ U � ìY�Y� ¥ � are not known in period : the asset

is risky.

A partial set of necessary conditions for utility maximization have the form:

(15.11.2) ½E4�� ¸ UW �Ië S%�Ô��2 À U � 1[Ú½+4�� ¸ U � 1[ � ± UòX &To see that the condition is necessary, suppose that the lhs < rhs. Then by

reducing current consumption marginally would cause equation 15.11.1 to

drop by ½ 4 � ¸ UWC� since there is no discounting of the current period. At the

same time, the marginal reduction in consumption finances investment, which

has gross return �Ô��2 À U � 1[�� which could finance consumption in period �2��V&This increase in consumption would cause the objective function to increase by

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 343��ëbS%�P��2 À U � 1[a½ 4 � ¸ U � 1[ � ± U×X�& Therefore, unless the condition holds, the expected

discounted utility function is not maximized.� To use this we need to choose the functional form of utility. A constant

relative risk aversion form is½� ¸ U� ¸ ¡ U�where �È�s� is the coefficient of relative risk aversion ( � ½ �J . With this form,½+4¬� ¸ U� a¸ ¡ � 1Uso the foc are ¸ ¡ � 1U ��ë ù �Ô��2 À U � 1[ ¸ ¡ � 1U � 1 � ± U úWhile it is true that

ë ð ¸ ¡ � 1U �²� ù �Ô��2 À U � 1[ ¸ ¡ � 1U � 1 ú ó � ± U ¡�so that we could use this to define moment conditions, it is unlikely that ¸ U is

stationary, even though it is in real terms, and our theory requires stationarity.

To solve this, divide though by ¸ ¡ � 1UÙ ¶ 1- � H �Ô��2 À U � 1[ Æ ¸ U � 1¸ U÷É ¡ � 1 I · � ± U a�

(note that ¸ U can be passed though the conditional expectation since ¸ U is chosen

based only upon information available in time -C&Suppose that L U is a vector of variables drawn from the information set ± U?&

We can use the necessary conditions to form the expressions

È �È��^�?��2 À U � 1[ c Ø Ên� Ø Ê h ¡ � 1 É L U � �^U-�A@V

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 344� @ represents � and ��&� Therefore, the above expression may be interpreted as a moment con-

dition which can be used for GMM estimation of the parameters @"FN&Note that at time Y�0��U � ì has been observed, and is therefore an element of

the information set. By rational expectations, the autocovariances of the mo-

ment conditions other thanÃ F should be zero. The optimal weighting matrix

is therefore the inverse of the variance of the moment conditions:²F�M � �)� Ù ° ��:�A@ F ?�:�A@ F ?4�³which can be consistently estimated by�² � Â � f� U(��1 �^U-� �@"?�^U-� �@V?4As before, this estimate depends on an initial consistent estimate of @T� which

can be obtained by setting the weighting matrix ü arbitrarily (to an identity

matrix, for example). After obtaining�@%� we then minimize

�T�4@" �:�4@"?4 �² � 1 �:�A@VC&This process can be iterated, e.g., use the new estimate to re-estimate ² � use

this to estimate @gF'� and repeat until the estimates don’t change.

� This whole approach relies on the very strong assumption that equa-

tion 15.11.2 holds without error. Supposing agents were heteroge-

neous, this wouldn’t be reasonable. If there were an error term here, it

could potentially be autocorrelated, which would no longer allow any

variable in the information set to be used as an instrument..

15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 345� In principle, we could use a very large number of moment conditions

in estimation, since any current or lagged variable could be used in L U?&Since use of more moment conditions will lead to a more (asymptot-

ically) efficient estimator, one might be tempted to use many instru-

mental variables. We will do a compter lab that will show that this

may not be a good idea with finite samples. This issue has been stud-

ied using Monte Carlos (Tauchen, JBES, 1986). The reason for poor

performance when using many instruments is that the estimate of ²becomes very imprecise.� Empirical papers that use this approach often have serious problems

in obtaining precise estimates of the parameters. Note that we are bas-

ing everything on a single parial first order condition. Probably this

f.o.c. is simply not informative enough. Simulation-based estimation

methods (discussed below) are one means of trying to use more infor-

mative moment conditions to estimate this sort of model.

15.12. Empirical example: a portfolio model

The Octave program portfolio.m performs GMM estimation of a portfolio

model, using the data file tauchen.data. The columns of this data file are ¸ �'��and J in that order. There are 95 observations (source: Tauchen, « � Ù Û � 1986).

As instruments we use 2 lags of ¸ andÀ & The estimation results are

***********************************************

Example of GMM estimation of rational expectations model

http://pareto.uab.es/mcreel/Econometrics/Include/GMM/portfolio.m

http://pareto.uab.es/mcreel/Econometrics/Include/GMM/tauchen.data

15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 346

GMM Estimation Results

BFGS convergence: Normal convergence

Objective function value: 0.071872

Observations: 93

Value df p-value

X^2 test 6.6841 5.0000 0.2452

estimate st. err t-stat p-value

beta 0.8723 0.0220 39.6079 0.0000

gamma 3.1555 0.2854 11.0580 0.0000

***********************************************

� experiment with the program using lags of 1, 3 and 4 periods to define

instruments� Iterate the estimation of @ �� and ² to convergence.� Comment on the results. Are the results sensitive to the set of instru-

ments used? (Look at�² as well as

�@%& Are these good instruments? Are

the instruments highly correlated with one another?

EXERCISES 347

Exercises

(1) Show how to cast the generalized IV estimator presented in section 11.4 as

a GMM estimator. Identify what are the moment conditions, ��U-�A@V , what

is the form of the the matrix� f%� what is the efficient weight matrix, and

show that the covariance matrix formula given previously corresponds to

the GMM covariance matrix formula.

(2) Using Octave, generate data from the logit dgp . Recall that Ù ��B"U � L U� � � L U?��@" \@�V2 }C~%� �?� L U � @"I] � 1 . Consider the moment condtions (exactly iden-

tified): �^U-�4@" \ BgU��i�� L U?��@V_] L UP&(a) Estimate by GMM, using these moments. Estimate by MLE.

(b) The two estimators should coincide. Prove analytically that the estima-

tors coicide.

(1) Verify the missing steps needed to show that ��>��:� �@" 4 �² � 1 �:� �@� has a� � �b�¯� � distribution. That is, show that the monster matrix is idem-

potent and has trace equal to �Ü� � &

CHAPTER 16

Quasi-ML

Quasi-ML is the estimator one obtains when a misspecified probability

model is used to calculate an ”ML” estimator.

Given a sample of size � of a random vector _ and a vector of conditioning

variables L � suppose the joint density of � c _G1 &'&'&�_,f h conditional on` c L 1q&'&'& L f h is a member of the parametric family � A �� ` ��TY�"� C þ &The true joint density is associated with the vector �TF;H

� A �� ` �� F C&As long as the marginal density of ` doesn’t depend on �TFN� this conditional

density fully characterizes the random characteristics of samples: e.g., it fully

describes the probabilistically important features of the d.g.p. The likelihood

function is just this density evaluated at other values �B �� ` ��T � A �� ` ��TC��C þ &� Let ��U � 1 c _1q&'&'&R_0U � 1 h , � F � � and let ` U c L 1q&\&'& L U hThe likelihood function, taking into account possible dependence of

348

16. QUASI-ML 349

observations, can be written asB �� ` ��T fG U(��1 � U-��_,U � ��U � 1Y� ` UP�[�T� fG U(��1 � U-��T� The average log-likelihood function is:

�\f+��T �� B �� ` ��T �� f� U(��1 �� 5U-��T� Suppose that we do not have knowledge of the family of densities�5U-��TY& Mistakenly, we may assume that the conditional density of _IUis a member of the family

� UÔ��_,U � ��U � 1Y� ` UP�@"C�a@ÞCèN � where there is no@gF such that� UÔ��_,U � ��U � 1Y� ` UP�@gFY �5U-��_0U � ��U � 1Y� ` U×��FYC�?ê� (this is what we

mean by “misspecified”).� This setup allows for heterogeneous time series data, with dynamic

misspecification.

The QML estimator is the argument that maximizes the misspecified average

log likelihood, which we refer to as the quasi-log likelihood function. This

objective function is

�\f+�4@" �� f� U(��1 �� U-�¬_,U � ��U � 1Y� ` UP��@ F � �� f� U(��1 �� U-�4@"and the QML is �@\f ��g�[��b� ~� �\f5�A@"

16. QUASI-ML 350

A SLLN for dependent sequences applies (we assume), so that

�\f5�A@" Q�P ì PR � �)�f�SUT ë �� f� U(��1 �)� � U-�4@" � Ý��A@"We assume that this can be strengthened to uniform convergence, a.s., follow-

ing the previous arguments. The “pseudo-true” value of @ is the value that

maximizes Ý�T�4@" : @ F ��V�-��b� ~� Ý��A@VGiven assumptions so that theorem 19 is applicable, we obtain� �)�f�SUT �@\f @ F � a.s.

An example of sufficient conditions for consistency are

� N is compact

– �'f5�A@V is continuous and converges pointwise almost surely to Ý��A@"(this means that Ý��4@" will be continuous, and this combined with

compactness of N means Ý��A@V is uniformly continuous).

– @gF is a unique global maximizer. A stronger version of this as-

sumption that allows for asymptotic normality is that� �V Ý��4@" ex-

ists and is negative definite in a neighborhood of @ F &� Applying the asymptotic normality theorem,h� c �@p�i@ F h mR °�� 2 T �4@ F $� 1 o T=�A@ F 2�T �A@ F $� 1 ³

where 2�T¾�4@ F � ��f�SUT ë � �V �\f5�A@ F

16. QUASI-ML 351

and o T=�A@ F � �)�f�SUT é¯ô À h � � VY�\f5�A@ F Y&� Note that asymptotic normality only requires that the additional as-

sumptions regarding 2 ando

hold in a neighborhood of @ F for 2 and

at @gFN� foro � not throughout N¾& In this sense, asymptotic normality is a

local property.

16.0.1. Consistent Estimation of Variance Components. Consistent esti-

mation of 2�T �A@ F is straightforward. Assumption (b) of Theorem 22 implies

that 2�f5� �@\f" �� f� U(��1 � �V �� U-� �@\f" Q6P ì PR � ��f�StT ë �� f� U(��1 � �V �� UÔ�A@ F 2�T¾�4@ F Y&That is, just calculate the Hessian using the estimate

�@\f in place of @gF'&Consistent estimation of

o T=�A@ F is more difficult, and may be impossible.

� Notation: Let �gU � � V � U-�A@]FYWe need to estimateo T=�4@ F � ��f�SUT éÜô À h � � V$�'f5�A@ F � ��f�SUT éÜô À h � �� f� U(��1 � V �)� � UÔ�4@ F � ��f�SUT �� éÜô À f� U(��1 �gU � ��f�SUT �� ë H ¶ f� U(��1 �b�]U��jë��]U�_· ¶ f� U(��1 �b�]U��jë#�]U�_· 4 I

16. QUASI-ML 352

This is going to contain a term� ��f�SUT �� f� U(��1 �¬ë��]U��ë#�]U� 4which will not tend to zero, in general. This term is not consistently estimable

in general, since it requires calculating an expectation using the true density

under the d.g.p., which is unknown.

� There are important cases whereo T=�A@gF$ is consistently estimable. For

example, suppose that the data come from a random sample (i.e., they

are iid). This would be the case with cross sectional data, for example.

(Note: we have that the joint distribution of ��BVUP� � UW is identical. This

does not imply that the conditional density� ��BVU � � UW is identical).� With random sampling, the limiting objective function is simply

Ý��A@ F ë « ë F �� B � � ��@ F where ë F means expectation of B � � and ë « means expectation respect

to the marginal density of � &� By the requirement that the limiting objective function be maximized

at @ F we have � V[ë « ë F �� B � � ��@ F � VgÝ��A@ F �� The dominated convergence theorem allows switching the order of

expectation and differentiation, so� V[ë « ë F �� B � � ��@ F ë « ë F � V �)� � ��B � � ��@ F a�

16. QUASI-ML 353

The CLT implies that�h � f� U(��1 � V �)� � ��B � � �@ F mR j� � � o T=�4@ F -C&That is, it’s not necessary to subtract the individual means, since they

are zero. Given this, and due to independent observations, a consis-

tent estimator is�o �� f� U(��1 � V �)� � U-� �@" � V O �� U?� �@�This is an important case where consistent estimation of the covariance matrix

is possible. Other cases exist, even for dynamically misspecified time series

models.

CHAPTER 17

Nonlinear least squares (NLS)

Readings: Davidson and MacKinnon, Ch. 2 [ and 5 [ ; Gallant, Ch. 1

17.1. Introduction and definition

Nonlinear least squares (NLS) is a means of estimating the parameter of

the model BgU � � L UP��@ F �28/JU?&� In general, /]U will be heteroscedastic and autocorrelated, and possibly

nonnormally distributed. However, dealing with this is exactly as in

the case of linear models, so we’ll just treat the iid case here,

/JU,ç�!ò!¾J�� è � If we stack the observations vertically, defining

_ ��B�1Y��B � �'&)&(&(��BgfV?4� � � � � 1C��@VC� � � � 1Y��@VC�'&(&)&(� � � � 1Y��@V[?4and / ��/�1Y�[/ � �\&)&(&)�-/]fV 4we can write the � observations as

_ � �4@"�2�/354

17.1. INTRODUCTION AND DEFINITION 355

Using this notation, the NLS estimator can be defined as�@ � �V�[�Z� � �� \f5�A@" �� \ _�� A@V_] 4 \ _�� A@V_] �� _i� � �4@" � �� The estimator minimizes the weighted sum of squared errors, which

is the same as minimizing the Euclidean distance between _ and� �4@"Y&

The objective function can be written as

�\f+�4@" �� \ _I4�_£�K#]_I4 � �4@"�2 � �A@"P4 � �4@"_]"�which gives the first order conditions

� È àà @ � � �@� 4 É _�2 È àà @ � � �@" 4 É � � �@� � � &Define the �3� � matrix

(17.1.1) �s� �@� � � V O � � �@�Y&In shorthand, use

�� in place of �s� �@�C& Using this, the first order conditions can

be written as � �� 4 _�2 �� 4 � � �@� � � �or

(17.1.2)�� 4 n _i� � � �@" r � � &

This bears a good deal of similarity to the f.o.c. for the linear model - the

derivative of the prediction is orthogonal to the prediction error. If� �A@V �` @T�

then�� is simply ` � so the f.o.c. (with spherical errors) simplify to

` 4�_£� ` 4 ` � a� �

17.2. IDENTIFICATION 356

the usual 0LS f.o.c.

We can interpret this geometrically: INSERT drawings of geometrical depiction

of OLS and NLS (see Davidson and MacKinnon, pgs. 8,13 and 46).� Note that the nonlinearity of the manifold leads to potential multiple

local maxima, minima and saddlepoints: the objective function �Jf5�A@Vis not necessarily well-behaved and may be difficult to minimize.

17.2. Identification

As before, identification can be considered conditional on the sample, and

asymptotically. The condition for asymptotic identification is that �Jf5�A@V tend

to a limiting function �/T �A@" such that �/T=�A@]FY ½ ��T=�A@VC�'ê?@£í @]FN& This will be the

case if �/T¾�4@gFC is strictly convex at @gFN� which requires that� �V ��T¾�A@]FC be positive

definite. Consider the objective function:

�'f5�A@V �� f� U(��1 \ BgU�� L UP��@V_] � �� f� U(��1 ° � � L U?��@ F �28/JU�� U-� L U?��@"P³ � �� f� U(��1 ° � U-�A@ F G� � U-�A@V ³ � 2 �� f� U(��1 ��/JU� �� #� f� U(��1 ° � U-�A@ F G� � U-�A@V?³�/]U� As in example 14.3, which illustrated the consistency of extremum es-

timators using OLS, we conclude that the second term will converge

to a constant which does not depend upon @T&� A LLN can be applied to the third term to conclude that it converges

pointwise to 0, as long as� �4@" and / are uncorrelated.

17.2. IDENTIFICATION 357� Next, pointwise convergence needs to be stregnthened to uniform al-

most sure convergence. There are a number of possible assumptions

one could use. Here, we’ll just assume it holds.� Turning to the first term, we’ll assume a pointwise law of large num-

bers applies, so

(17.2.1)�� f� U(��1 ° � U-�4@ F G� � U-�A@V?³ � Q�P ì PR Y ° � �W� ��@ F G� � �W�%��@"P³ � J,3��W�"C�

where 3�� is the distribution function of � & In many cases,� � � ��@" will

be bounded and continuous, for all @lC N � so strengthening to uni-

form almost sure convergence is immediate. For example if� � � �@" \��2 }Y~%� �Ô� � @"_] � 1 � � H�O ÿ R � � �'�J�� a bounded range, and the function

is continuous in @%&Given these results, it is clear that a minimizer is @"FN& When considering identi-

fication (asymptotic), the question is whether or not there may be some other

minimizer. A local condition for identification is thatà �à @ à @ 4 ��T=�4@" à �à @ à @ 4 Y ° � � � ��@ F G� � � � ��@"P³ � J,3�� be positive definite at @ F & Evaluating this derivative, we obtain (after a little

work)

à �à @ à @ 4 Y ° � � � ��@ F G� � � � ��@V ³ � J,3�� êêêê V W # Y ° � V � �� @ F ?4 ³ ° � V O � �W�%��@ F ³ 4 J,3��the expectation of the outer product of the gradient of the regression function

evaluated at @gF'& (Note: the uniform boundedness we have already assumed


allows passing the derivative through the integral, by the dominated conver-

gence theorem.) This matrix will be positive definite (wp1) as long as the gra-

dient vector is of full rank (wp1). The tangent space to the regression manifold

must span a � -dimensional space if we are to consistently estimate a � -

dimensional parameter vector. This is analogous to the requirement that there

be no perfect colinearity in a linear model. This is a necessary condition for

identification. Note that the LLN implies that the above expectation is equal

to 2�T �A@ F # � �� ë � 4 ��17.3. Consistency

We simply assume that the conditions of Theorem 19 hold, so the estimator

is consistent. Given that the strong stochastic equicontinuity conditions hold,

as discussed above, and given the above identification conditions an a com-

pact estimation space (the closure of the parameter space NsC� the consistency

proof’s assumptions are satisfied.


As in the case of GMM, we also simply assume that the conditions for as-

ymptotic normality as in Theorem 22 hold. The only remaining problem is to

determine the form of the asymptotic variance-covariance matrix. Recall that

the result of the asymptotic normality theorem ish� c �@��c@ F h mR ° � ��2�T¾�4@ F � 1 o T=�4@ F 2 T¾�A@ F � 1 ³ �

where 2 T¾�A@]FC is the almost sure limit of á zá V á V O �\f+�4@" evaluated at @gFN� and

� ° � V$�'f5�A@ F ?³ ° � VC�\f �A@ F ?³ 4 Q6P ì PR o T=�4@ F Y�


The objective function is

�\f5�A@" �� f� U(��1 \ BgU�� L UP��@V_] �So � VY�\f+�4@" � #� f� U(��1 \ B]U�� L UP�@"_] � V � � L U×��@"Y&Evaluating at @gF'� � VY�\f+�4@ F � #� f� U(��1 /]U � V � � L U×��@ F Y&With this we obtain

� ° � VY�\f+�4@ F P³ ° � VY�\f+�4@ F P³ 4 Ñ� v f� U(��1 /]U � V � � L U×��@ F w v f� U(��1 /JU � V � � L U?��@ F w 4Noting that f� U(��1 /JU � V � � L UP��@ F àà @ ° � �4@ F ?³ 4 / � 4 /we can write the above as

� ° � VY�\f+�4@ F P³ ° � VC�'f �4@ F ?³ 4 Ñ� � 4©/J/]4X�This converges almost surely to its expectation, following a LLNo T=�A@ F Ñ è � � �)� ë � 4 ��We’ve already seen that 2�T=�4@ F # � �� ë � 4 ��

17.5. EXAMPLE: THE POISSON MODEL FOR COUNT DATA 360

where the expectation is with respect to the joint density of � and /T& Combin-

ing these expressions for 2øT=�4@ F ando T=�A@ F C� and the result of the asymptotic

normality theorem, we geth� c �@��c@ F h mR ¶ � � Æ � �� ë � 4 ��oÉ � 1 è � · &

We can consistently estimate the variance covariance matrix using

(17.4.1) ¶ �� 4 �� · � 1 �è � �where

�� is defined as in equation 17.1.1 and

�è � n _�� @V r 4 n _�� @V r� �the obvious estimator. Note the close correspondence to the results for the

linear model.

17.5. Example: The Poisson model for count data

Suppose that BVU conditional on L U is independently distributed Poisson. A

Poisson random variable is a count data variable, which means it can take the

values {0,1,2,...}. This sort of model has been used to study visits to doctors per

year, number of patents registered by businesses per year, etc.

The Poisson density is� ��BgU� }C~%� �Ô� e UW e K ÊUBgU�� B]U#C²S � �'�"�$#%�'&(&)&�X�&The mean of BgU is e UP� as is the variance. Note that e U must be positive. Suppose

that the true mean is e FU }Y~%� � L 4U � F Y�

17.6. THE GAUSS-NEWTON ALGORITHM 361

which enforces the positivity of e U?& Suppose we estimate �F by nonlinear least

squares: �� ¡�V�[�Z� � � �\f5�� ¿ f� U(��1 ��BgU�� }Y~ � � L 4U �[ �We can write

�\f+�� ¿ f� U(��1 ð }C~%� � L 4U � F 2�/JU�� }Y~%� � L 4U � ó � �¿ f� U(��1 ð }C~%� � L 4U � F � }C~%� � L 4U � ó � 2 �¿ f� U(��1 / �U 2Ë# �¿ f� U(��1 /JU ð }Y~ � � L 4U � F � }C~%� � L 4U �G óThe last term has expectation zero since the assumption that ë7��B"U � L UW }Y~ � � L 4U �IFYimplies that ë��/JU � L UW R� � which in turn implies that functions of L U are uncor-

related with /]UP& Applying a strong LLN, and noting that the objective function

is continuous on a compact parameter space, we get

��T¾�� ë Í ð }C~%� � L 4@� F � }C~%� � L 4©�G ó � 2�ë Í }C~%� � L 4�� F where the last term comes from the fact that the conditional variance of / is the

same as the variance of B�& This function is clearly minimized at � � F � so the

NLS estimator is consistent as long as identification holds.

EXERCISE 27. Determine the limiting distribution of

h� c ��j�IF h & This

means finding the the specific forms of á zá x á x O �\f+��G , 23��IFYY� á ì�> Ã x Åá x êêê � ando ��IFCC&

Again, use a CLT as needed, no need to verify that it can be applied.

17.6. The Gauss-Newton algorithm

Readings: Davidson and MacKinnon, Chapter 6, pgs. 201-207 [ .The Gauss-Newton optimization technique is specifically designed for non-

linear least squares. The idea is to linearize the nonlinear model, rather than


the objective function. The model is

_ � �A@ F �28/T&At some @ in the parameter space, not equal to @ F � we have

_ � �4@"I23=where = is a combination of the fundamental error term / and the error due

to evaluating the regression function at @ rather than the true value @ F & Take a

first order Taylor’s series approximation around a point @ 1 H_ � �4@ 1 �2 ° � V O � ð_@ 1 ó ³ ðI@p�i@ 1 ó 23=Þ2 approximationerror.

This can be written as � �s�4@ 1 Z7¢2cÝ ,

where, as above, �s�A@ 1 � � V O � �4@ 1 is the � � � matrix of derivatives of the

regression function, evaluated at @ 1 � and Ý is = plus approximation error from

the truncated Taylor’s series.

� Note that � is known, given @ 1 &� Similarly, � � _i� � �4@ 1 C� which is also known.� The other new element here is 7 � �A@Ü�D@ 1 C& Note that one could esti-

mate 7 simply by performing OLS on the above equation.� Given�7]� we calculate a new round estimate of @ F as @ � �7�2l@ 1 & With

this, take a new Taylor’s series expansion around @ � and repeat the

process. Stop when�7 a� (to within a specified tolerance).


To see why this might work, consider the above approximation, but evaluated

at the NLS estimator:

_ � � �@��2��s� �@� c @p� �@ h 2sÝThe OLS estimate of 7 � @p� �@ is�7 dc �� 4 �� h � 1 �� 4 n _i� � � �@" r &This must be zero, since �� 4 n _�� @" r � �by definition of the NLS estimator (these are the normal equations as in equa-

tion 17.1.2, Since�7 � � when we evaluate at

�@ � updating would stop.

� The Gauss-Newton method doesn’t require second derivatives, as does

the Newton-Raphson method, so it’s faster.� The varcov estimator, as in equation 17.4.1 is simple to calculate, since

we have�� as a by-product of the estimation process (i.e., it’s just the

last round “regressor matrix”). In fact, a normal OLS program will

give the NLS varcov estimator directly, since it’s just the OLS varcov

estimator from the last iteration.� The method can suffer from convergence problems since �s�A@V 4 �s�4@"Y�may be very nearly singular, even with an asymptotically identified

model, especially if @ is very far from�@ . Consider the example

B �01I28� � � U�� | 2�/JUWhen evaluated at � � ¬ � �]� | has virtually no effect on the NLS objec-

tive function, so � will have rank that is “essentially” 2, rather than 3.

17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 364

In this case, � 4 � will be nearly singular, so �� 4 �È � 1 will be subject to

large roundoff errors.

17.7. Application: Limited dependent variables and sample selection

Readings: Davidson and MacKinnon, Ch. 15 [ (a quick reading is suffi-

cient), J. Heckman, “Sample Selection Bias as a Specification Error”, Econo-

metrica, 1979 (This is a classic article, not required for reading, and which is a

bit out-dated. Nevertheless it’s a good place to start if you encounter sample

selection problems in your research).

Sample selection is a common problem in applied research. The problem

occurs when observations used in estimation are sampled non-randomly, ac-

cording to some selection scheme.

17.7.1. Example: Labor Supply. Labor supply of a person is a positive

number of hours per unit time supposing the offer wage is higher than the

reservation wage, which is the wage at which the person prefers not to work.

The model (very simple, with subscripts suppressed):� Characteristics of individual: L� Latent labor supply: �/[ ¡L 4 ��2cÝ� Offer wage: .4� � 4 �¾23=� Reservation wage: . � 6� 4 » 2 Write the wage differential as

. [ �?�"40�=2�=5Ä�¡� � 4 » 2 � � 4�@y28/


We have the set of equations

� [ L 4@�^2sÝ. [ � 4*@;2�/T&Assume that �� Ý / �� çR ÎÏ �� è � ��è��è � �� ÐÒ &We assume that the offer wage and the reservation wage, as well as the latent

variable � [ are unobservable. What is observed is

. ��\ . [ ¥ � ]� .p� [ &In other words, we observe whether or not a person is working. If the person

is working, we observe labor supply, which is equal to latent labor supply, �1[\&Otherwise, � �� í � [ & Note that we are using a simplifying assumption that

individuals can freely choose their weekly hours of work.

Suppose we estimated the model

� [ ¡L 4 ��2 residual

using only observations for which � ¥ � & The problem is that these observa-

tions are those for which . [ ¥ � � or equivalently, �È/ ½ � 4 @ and

ë \ Ý � �²/ ½ � 4 @<]Zí ��


since / and Ý are dependent. Furthermore, this expectation will in general

depend on L since elements of L can enter in � & Because of these two facts,

least squares estimation is biased and inconsistent.

Consider more carefully ëª\ Ý � �j/ ½ � 4 @/]�& Given the joint normality of Ý and/T� we can write (see for example Spanos Statistical Foundations of Econometric

Modelling, pg. 122) Ý ��è0/È2 �where has mean zero and is independent of / . With this we can write

� [ ¡L 4��2K��è�/È2 &If we condition this equation on �È/ ½ � 4 @ we get

� ¡L 4��2K��è�ë¼�¬/ � �j/ ½ � 4±@V�2 &� A useful result is that for

�ÜçR²� � �\�JÙ �W� � � ¥ � [ tG��+[Y��Ô�;� [ �

where t¾�P>@ and �Ö�?>@ are the standard normal density and distribution

function, respectively. The quantity on the RHS above is known as the

inverse Mill’s ratio:

± ´ �=�?� [ tG�W� [ ��Ô�;� [ With this we can write


� L 4©�^28��è t¾� � 4 @V�Ö� � 4 @" 2 (17.7.1) � n L 4 À ÃX� O V?ÅD Ãs� O V?Å r �� 2 &(17.7.2)

where

� ��è�& The error term has conditional mean zero, and is uncorrelated

with the regressors L 4 À Ãs�WO V?ÅD Ãs� O V?Å & At this point, we can estimate the equation by

NLS. � Heckman showed how one can estimate this in a two step procedure

where first @ is estimated, then equation 17.7.2 is estimated by least

squares using the estimated value of @ to form the regressors. This

is inefficient and estimation of the covariance is a tricky issue. It is

probably easier (and more efficient) just to do MLE.� The model presented above depends strongly on joint normality. There

exist many alternative models which weaken the maintained assump-

tions. It is possible to estimate consistently without distributional as-

sumptions. See Ahn and Powell, Journal of Econometrics, 1994.

CHAPTER 18

Nonparametric inference

18.1. Possible pitfalls of parametric inference: estimation

Readings: H. White (1980) “Using Least Squares to Approximate Unknown

Regression Functions,” International Economic Review, pp. 149-70.

In this section we consider a simple example, which illustrates both why

nonparametric methods may in some cases be preferred to parametric meth-

ods.

We suppose that data is generated by random sampling of ��B�� , whereB � � � µ2µ/ , � is uniformly distributed on � � �Y# � C� and / is a classical error.

Suppose that � � � ��2 % �# � � c �# � h �The problem of interest is to estimate the elasticity of

� � � with respect to � �throughout the range of � .

In general, the functional form of� � � is unknown. One idea is to take a

Taylor’s series approximation to� � � about some point � F & Flexible functional

forms such as the transcendental logarithmic (usually know as the translog)

can be interpreted as second order Taylor’s series approximations. We’ll work

with a first order approximation, for simplicity. Approximating about � F :¹ � � � � � F �2 � ã � � � F �� F 368

18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 369

If the approximation point is � F �� we can write¹ � � ô�2¹7 �The coefficient ô is the value of the function at � � � and the slope is the

value of the derivative at �� a� & These are of course not known. One might try

estimation by ordinary least squares. The objective function is

��Wô+��7C � Â � f� U(��1 ��BgU�� ¹ � � UW[ � &The limiting objective function, following the argument we used to get equa-

tions 14.3.1 and 17.2.1 is

��T=�Wô+��7C Y ��F � � � � G� ¹ � � - � J � &The theorem regarding the consistency of extremum estimators (Theorem 19)

tells us that�ô and

�7 will converge almost surely to the values that minimize

the limiting objective function. Solving the first order conditions1 reveals that��T=�Wô+��7C obtains its minimum at ù ô�F �� 7$F 1� ú & The estimated approximat-

ing function�¹ � � therefore tends almost surely to

¹ T=� � ) Â 'È2 ��Â �We may plot the true function and the limit of the approximation to see the

asymptotic bias as a function of � :

(The approximating model is the straight line, the true model has curva-

ture.) Note that the approximating model is in general inconsistent, even at

the approximation point. This shows that “flexible functional forms” based

1All calculations were done using Scientific Workplace.


upon Taylor’s series approximations do not in general allow consistent esti-

mation. The mathematical properties of the Taylor’s series do not carry over

when coefficients are estimated.

The approximating model seems to fit the true model fairly well, asymp-

totically. However, we are interested in the elasticity of the function. Recall

that an elasticity is the marginal function divided by the average function:

/ � � � tE4�� Â t� � Good approximation of the elasticity over the range of � will require a good

approximation of both� � � and

� 4 � � over the range of � & The approximating

elasticity is � � ¡� ¹ 4�� Â ¹ � � Plotting the true elasticity and the elasticity obtained from the limiting approx-

imating model

The true elasticity is the line that has negative slope for large � & Visually we

see that the elasticity is not approximated so well. Root mean squared error in

the approximation of the elasticity is

Æ Y ��F ��/ � � G� � � [ � J � É 1Ap � &�%%� ÒgÑ 'Now suppose we use the leading terms of a trigonometric series as the

approximating model. The reason for using a trigonometric series as an ap-

proximating model is motivated by the asymptotic properties of the Fourier

flexible functional form (Gallant, 1981, 1982), which we will study in more de-

tail below. Normally with this type of model the number of basis functions is


an increasing function of the sample size. Here we hold the set of basis func-

tion fixed. We will consider the asymptotic behavior of a fixed model, which

we interpret as an approximation to the estimator’s behavior in finite samples.

Consider the set of basis functions:

9 � � n � � Î\ÏVÐ � � Ð[� � � � ÎvÏ"Ð �ò# � Ð-� � �W# � r &The approximating model is � ÿ � � 9 � � � &Maintaining these basis functions as the sample size increases, we find that the

limiting objective function is minimized at

õ ô 1 )' �$ô � �� $ô | � �� $ô+! ¡� �$ô,$ � �Ñ � � �$ô � a� ø &Substituting these values into � ÿ � � we obtain the almost sure limit of the ap-

proximation

(18.1.1)�(T=� � ) Â 'y2 ��Â � 2�� Î\Ï"Ð5� yÆ,� �� É 2a� Ð-� � � � 2�� Î\Ï"Ð # � yÆ,� �Ñ � � É 2�� Ð[� � # � �Plotting the approximation and the true function:

Clearly the truncated trigonometric series model offers a better approxi-

mation, asymptotically, than does the linear model. Plotting elasticities: On

average, the fit is better, though there is some implausible wavyness in the

estimate.

18.2. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING 372

Root mean squared error in the approximation of the elasticity is¶ Y ��F Æ /5� � G� � 4T � � ��(T=� � �É � J � · 1Ap � &'��'�#%��% �about half that of the RMSE when the first order approximation is used. If

the trigonometric series contained infinite terms, this error measure would be

driven to zero, as we shall see.

18.2. Possible pitfalls of parametric inference: hypothesis testing

What do we mean by the term “nonparametric inference”? Simply, this

means inferences that are possible without restricting the functions of interest

to belong to a parametric family.� Consider means of testing for the hypothesis that consumers maxi-

mize utility. A consequence of utility maximization is that the Slutsky

matrix� �6 ¹ �� , where

¹ �@�� are the a set of compensated demand

functions, must be negative semi-definite. One approach to testing for

utility maximization would estimate a set of normal demand functions� ��÷ .� Estimation of these functions by normal parametric methods requires

specification of the functional form of demand, for example

� �@��÷ �� @��@ F �28/T��@ F CFN F �where � �@��@]FC is a function of known form and N�F is a finite dimen-

sional parameter.� After estimation, we could use��j þ� �@�� @V to calculate (by solving

the integrability problem, which is non-trivial)�� 6 ¹ �� Y& If we can

18.3. THE FOURIER FUNCTIONAL FORM 373

statistically reject that the matrix is negative semi-definite, we might

conclude that consumers don’t maximize utility.� The problem with this is that the reason for rejection of the theoretical

proposition may be that our choice of functional form is incorrect. In

the introductory section we saw that functional form misspecification

leads to inconsistent estimation of the function and its derivatives.� Testing using parametric models always means we are testing a com-

pound hypothesis. The hypothesis that is tested is 1) the economic

proposition we wish to test, and 2) the model is correctly specified.

Failure of either 1) or 2) can lead to rejection. This is known as the

“model-induced augmenting hypothesis.”� Varian’s WARP allows one to test for utility maximization without

specifying the form of the demand functions. The only assumptions

used in the test are those directly implied by theory, so rejection of the

hypothesis calls into question the theory.� Nonparametric inference allows direct testing of economic proposi-

tions, without the “model-induced augmenting hypothesis”.

18.3. The Fourier functional form

Readings: Gallant, 1987, “Identification and consistency in semi-nonparametric

regression,” in Advances in Econometrics, Fifth World Congress, V. 1, Truman Be-

wley, ed., Cambridge.� Suppose we have a multivariate model

B � � L �2�/%�


where� � � is of unknown form and � is a

ª � dimensional vector. For

simplicity, assume that / is a classical error. Let us take the estimation

of the vector of elasticities with typical element� ã : L,*� � L à � � L à �+* � � � �at an arbitrary point L�* &

The Fourier form, following Gallant (1982), but with a somewhat different pa-

rameterization, may be written as

(18.3.1)� ÿ � L � @ ÿ �� 2 L 4 ��2�� Â # L 4%� L 2�� § ��19�� ß ��1 �4½ ß`§¢ÎvÏ"Ð � �� 4 § L G�c¸ ßZ§¢Ð[� � � �~� 4 § L -�&

where the � -dimensional parameter vector

(18.3.2) @ ÿK S � �[�I4)�¸ ºN¸ [ � û ?4��½,1P1Y�¸�1P1C�'&'&'&C�½ ��¸ �

�XN4¨&

� We assume that the conditioning variables L have each been trans-

formed to lie in an interval that is shorter than # � & This is required

to avoid periodic behavior of the approximation, which is desirable

since economic functions aren’t periodic. For example, subtract sam-

ple means, divide by the maxima of the conditioning variables, and

multiply by # � � º �E�� where º �� is some positive number less than # �in value.� The Q § are ”elementary multi-indices” which are simply

ª � vectors

formed of integers (negative, positive and zero). The Q § , �3 �"�Y#%�\&)&(&)� ware required to be linearly independent, and we follow the convention


that the first non-zero element be positive. For example

n � � �Þ� � � r 4is a potential multi-index to be used, but

n � �Þ� �Þ� � �^r 4is not since its first nonzero element is negative. Nor is

n � # �µ# � #�r 4a multi-index we would use, since it is a scalar multiple of the original

multi-index.� We parameterize the matrix û differently than does Gallant because it

simplifies things in practice. The cost of this is that we are no longer

able to test a quadratic specification using nested testing.

The vector of first partial derivatives is

(18.3.3)� ã � ÿ � L � @ ÿ �^2 � L 2�� § ��19�� ß ��1 \(�Ô�t½ ß`§¢Ð-� � � ��4 § L Ä�s¸ ß`§�Î\Ï"Ð � �~�04 § L -+�~� § ]

and the matrix of second partial derivatives is

(18.3.4)� �ã � ÿ � L � @ ÿ � 2

�� § ��1 �� ß ��1 ° �Ô�t½ ßZ§¢Î\Ï"Ð � �~�04 § L �2i¸ ßZ§�Ð[� � � �~�04 § L -+� � � § �04 § ³

To define a compact notation for partial derivatives, let e be an -dimensional

multi-index with no negative elements. Define � e � [ as the sum of the elements


of e . If we have arguments L of the (arbitrary) function¹ � L , use

� � ¹ � L to

indicate a certain partial derivative:� � ¹ � L � à � � � àà � � 1 à � � z� >N>N> à � �� ¹ � L When e is the zero vector,

� � ¹ � L � ¹ � L . Taking this definition and the last

few equations into account, we see that it is possible to define �?�¯� � vector9 � � L so that

(18.3.5)� � � ÿ � L � @ ÿ � � � L ?4*@ ÿ &� Both the approximating model and the derivatives of the approximat-

ing model are linear in the parameters.� For the approximating model to the function (not derivatives), write� ÿ � L � @ ÿ � 4 @ ÿ for simplicity.

The following theorem can be used to prove the consistency of the Fourier

form.

THEOREM 28. [Gallant and Nychka, 1987] Suppose that�¹ f is obtained by

maximizing a sample objective function �Nf5� ¹ over � ÿ > where � ÿ is a subset

of some function space � on which is defined a norm � ¹ � . Consider the

following conditions:

(a) Compactness: The closure of � with respect to � ¹ � is compact in the

relative topology defined by � ¹ � .(b) Denseness: � ÿ � ÿ , � �"�Y#%�% �'&(&)& is a dense subset of the closure of �

with respect to � ¹ � and � ÿ ÷�� ÿ�� 1 .


(c) Uniform convergence: There is a point¹ [ in � and there is a function��T=� ¹ � ¹ [$ that is continuous in

¹with respect to � ¹ � such that� �)�f�SUT Ð' �� 'f5� ¹ G�K��T¾� ¹ � ¹ [ � ��

almost surely.

(d) Identification: Any point¹

in the closure of � with �<T=� ¹ � ¹ [$ � ��T¾� ¹ ['� ¹ [$must have � ¹ � ¹ [ �v ¡� .

Under these conditions� �)� f�SUT � ¹ [I� �¹ f �v a� almost surely, provided that� �)� f�SUT � f gk almost surely.

The modification of the original statement of the theorem that has been

made is to set the parameter space N in Gallant and Nychka’s (1987) Theorem

0 to a single point and to state the theorem in terms of maximization rather

than minimization.

This theorem is very similar in form to Theorem 19. The main differences

are:

(1) A generic norm � ¹ � is used in place of the Euclidean norm. This

norm may be stronger than the Euclidean norm, so that convergence

with respect to � ¹ � implies convergence w.r.t the Euclidean norm.

Typically we will want to make sure that the norm is strong enough to

imply convergence of all functions of interest.

(2) The “estimation space” � is a function space. It plays the role of the

parameter space N in our discussion of parametric estimators. There

is no restriction to a parametric family, only a restriction to a space of

functions that satisfy certain conditions. This formulation is much less

restrictive than the restriction to a parametric family.


(3) There is a denseness assumption that was not present in the other the-

orem.

We will not prove this theorem (the proof is quite similar to the proof of theo-

rem [19], see Gallant, 1987) but we will discuss its assumptions, in relation to

the Fourier form as the approximating model.

18.3.1. Sobolev norm. Since all of the assumptions involve the norm � ¹ �, we need to make explicit what norm we wish to use. We need a norm that

guarantees that the errors in approximation of the functions we are interested

in are accounted for. Since we are interested in first-order elasticities in the

present case, we need close approximation of both the function� � � and its

first derivative� 4 � � C� throughout the range of � & Let < be an open set that con-

tains all values of � that we’re interested in. The Sobolev norm is appropriate

in this case. It is defined, making use of our notation for partial derivatives, as:

� ¹ � 9 H ; �� ~� � à � � 9 Ð' �; êê � � ¹ � � êêTo see whether or not the function

� � � is well approximated by an approxi-

mating model � ÿ � � � @ ÿ , we would evaluate

� � � L G�s� ÿ � L � @ ÿ � 9 H ; &We see that this norm takes into account errors in approximating the function

and partial derivatives up to order ��& If we want to estimate first order elas-

ticities, as is the case in this example, the relevant � would be � �V& Further-

more, since we examine the Ð' � over <i� convergence w.r.t. the Sobolev means

uniform convergence, so that we obtain consistent estimates for all values of � &


18.3.2. Compactness. Verifying compactness with respect to this norm is

quite technical and unenlightening. It is proven by Elbadawi, Gallant and

Souza, Econometrica, 1983. The basic requirement is that if we need consistency

w.r.t. � ¹ � 9 H ; � then the functions of interest must belong to a Sobolev space

which takes into account derivatives of order �u2K� . A Sobolev space is the set

of functions -M9 H ; � � S ¹ � L H � ¹ � L � 9 H ; ½ � X��where

�is a finite constant. In plain words, the functions must have bounded

partial derivatives of one order higher than the derivatives we seek to estimate.

18.3.3. The estimation space and the estimation subspace. Since in our

case we’re interested in consistent estimation of first-order elasticities, we’ll

define the estimation space as follows:

DEFINITION 29. [Estimation space] The estimation space � - � H ; � � C&The estimation space is an open set, and we presume that

¹ [ C��3&So we are assuming that the function to be estimated has bounded second

derivatives throughout < .

With seminonparametric estimators, we don’t actually optimize over the

estimation space. Rather, we optimize over a subspace, � ÿ > � defined as:

DEFINITION 30. [Estimation subspace] The estimation subspace � ÿ is de-

fined as � ÿK S�� ÿ � L � @ ÿ ¼H1� ÿ � L � @ ÿ C�- � H �Z� � C��@ ÿ C O ÿ X��where � ÿ � L �@ ÿ is the Fourier form approximation as defined in Equation

18.3.1.


18.3.4. Denseness. The important point here is that � ÿ is a space of func-

tions that is indexed by a finite dimensional parameter ( @ ÿ has � elements, as

in equation 18.3.2). With � observations, � ¥þ� � this parameter is estimable.

Note that the true function¹ [ is not necessarily an element of � ÿ � so optimiza-

tion over � ÿ may not lead to a consistent estimator. In order for optimization

over � ÿ to be equivalent to optimization over �3� at least asymptotically, we

need that:

(1) The dimension of the parameter vector, ÿ �� @ ÿ > R k as � R k & This

is achieved by making w and « in equation 18.3.1 increasing functions

of �G� the sample size. It is clear that � will have to grow more slowly

than � . The second requirement is:

(2) We need that the � ÿ be dense subsets of �3&The estimation subspace � ÿ , defined above, is a subset of the closure of the

estimation space, � . A set of subsets � Q of a set � is “dense” if the closure of

the countable union of the subsets is equal to the closure of � :� TQ ��1 � Q �Use a picture here. The rest of the discussion of denseness is provided just for com-

pleteness: there’s no need to study it in detail. To show that � ÿ is a dense subset

of � with respect to � ¹ � 1IH ; � it is useful to apply Theorem 1 of Gallant (1982),

who in turn cites Edmunds and Moscatelli (1977). We reproduce the theorem

as presented by Gallant, with minor notational changes, for convenience of

reference:

THEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function¹ [ � L be continuously differentiable up to order � on an open set containing


the closure of < . Then it is possible to choose a triangular array of coefficients@g1C��@ � �\&'&'&@ ÿ �'&'&'&v� such that for every � with � » � ½ � , and every / ¥ � � �¹ [ � L G� ¹ ÿ � L � @ ÿ � �`H ; ´ � � � 9 � � � î as � R k &In the present application, � � , and � # . By definition of the estimation

space, the elements of � are once continuously differentiable on < , which is

open and contains the closure of < , so the theorem is applicable. Closely fol-

lowing Gallant and Nychka (1987), � T4� ÿ is the countable union of the � ÿ .

The implication of Theorem 31 is that there is a sequence of {¹ ÿ } from ��T�� ÿ

such that � �)�ÿ SUT � ¹ [ � ¹ ÿR� 1IH ; a� �for all

¹ [ C�� . Therefore, � ÷ ��T4� ÿ &However, ��T4� ÿ ÷��3�so ��T4� ÿ ÷ �3&Therefore � ��T4� ÿ �so � T4� ÿ is a dense subset of � , with respect to the norm � ¹ � 1IH ; .

18.3.5. Uniform convergence. We now turn to the limiting objective func-

tion. We estimate by OLS. The sample objective function stated in terms of

maximization is �\f+�4@ ÿ � �� f� U(��1 ��BgU��s� ÿ � L U � @ ÿ - �


With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limit-

ing objective function is

(18.3.6) �/TK�b�� Y ; � � � L G�s�0� L - � J,3 � �jè �î &where the true function

� � � takes the place of the generic function¹ [ in the

presentation of the theorem. Both �0� � and� � � are elements of ��T4� ÿ .

The pointwise convergence of the objective function needs to be strength-

ened to uniform convergence. We will simply assume that this holds, since

the way to verify this depends upon the specific application. We also have

continuity of the objective function in �� with respect to the norm � ¹ � 1IH ; since� �� W �¡ ¢ S F ù ��T ð � 1 � � ó �8��T ð � F � � ó ú � �� W �¡ ¢ S F Y/; n ð � 1 � L G� � � L ó � � ð � F � L G� � � L ó � r J�3 � &By the dominated convergence theorem (which applies since the finite bound�

used to define - � H �Z� � is dominated by an integrable function), the limit

and the integral can be interchanged, so by inspection, the limit is zero.

18.3.6. Identification. The identification condition requires that for any point�b�� in �å� �:�V��T¾�� T¾� � � � Á � �y� � � 1IH ; �� . This condition is clearly

satisfied given that � and�

are once continuously differentiable (by the as-

sumption that defines the estimation space).

18.3.7. Review of concepts. For the example of estimation of first-order

elasticities, the relevant concepts are:

18.3. THE FOURIER FUNCTIONAL FORM 383� Estimation space � - � H ; � � : the function space in the closure of

which the true function must lie.� Consistency norm � ¹ � 1IH ; & The closure of � is compact with respect

to this norm.� Estimation subspace � ÿ & The estimation subspace is the subset of �that is representable by a Fourier form with parameter @ ÿ & These are

dense subsets of �3&� Sample objective function �'f+�4@ ÿ Y� the negative of the sum of squares.

By standard arguments this converges uniformly to the� Limiting objective function �/T=�a�� Y� which is continuous in � and has

a global maximum in its first argument, over the closure of the infinite

union of the estimation subpaces, at � � &� As a result of this, first order elasticitiesL,*� � L à � � L à �+* � � � are consistently estimated for all L C£<i&

18.3.8. Discussion. Consistency requires that the number of parameters

used in the expansion increase with the sample size, tending to infinity. If pa-

rameters are added at a high rate, the bias tends relatively rapidly to zero. A

basic problem is that a high rate of inclusion of additional parameters causes

the variance to tend more slowly to zero. The issue of how to chose the rate at

which parameters are added and which to add first is fairly complex. A prob-

lem is that the allowable rates for asymptotic normality to obtain (Andrews

1991; Gallant and Souza, 1991) are very strict. Supposing we stick to these


rates, our approximating model is:� ÿ � L � @ ÿ �"4*@ ÿ &� Define � ÿ as the �²� � matrix of regressors obtained by stacking ob-

servations. The LS estimator is�@ ÿË �¤�Ä4ÿ � ÿ � �G4ÿ B��where �Ô>© � is the Moore-Penrose generalized inverse.

– This is used since � 4 ÿ � ÿ may be singular, as would be the case for� �� large enough when some dummy variables are included.� . The prediction, � 4 �@ ÿ � of the unknown function� � L is asymptotically

normally distributed:h� c �"4 �@ ÿ � � � � h mR j� � � w é Y�

where w é � �)�f�SUT Ù v � 4,Æ � 4 ÿ � ÿ� É � � �è � w &Formally, this is exactly the same as if we were dealing with a para-

metric linear model. I emphasize, though, that this is only valid if �grows very slowly as � grows. If we can’t stick to acceptable rates, we

should probably use some other method of approximating the small

sample distribution. Bootstrapping is a possibility. We’ll discuss this

in the section on simulation.

18.4. KERNEL REGRESSION ESTIMATORS 385

18.4. Kernel regression estimators

Readings: Bierens, 1987, “Kernel estimators of regression functions,” in

Advances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cam-

bridge.

An alternative method to the semi-nonparametric method is a fully non-

parametric method of estimation. Kernel regression estimation is an exam-

ple (others are splines, nearest neighbor, etc.). We’ll consider the Nadaraya-

Watson kernel regression estimator in a simple case.

� Suppose we have an iid sample from the joint density� � � �[B5Y� where �

is Q -dimensional. The model is

B]U �,� � UW�28/JUP�where

Ù �¬/JU � � U� �� &� The conditional expectation of B given � is �0� � Y& By definition of the

conditional expectation, we have�0� � Y B � � � ��B ¹ � � J�B �¹ � � Y B � � � �[B5ZJ�B��where

¹ � � is the marginal density of � H¹ � � Y � � � �[B5ZJ�B�&� This suggests that we could estimate �0� � by estimating

¹ � � and y B � � � ��B `J�B�&


18.4.1. Estimation of the denominator. A kernel estimator for¹ � � has the

form �¹ � � �� f� U(��1 � \(� � � � UW Â ��f/]� Df �where � is the sample size and Q is the dimension of � &

� The function � �Ô>© (the kernel) is absolutely integrable:Y � � � � � J ��½�k �and � �?>@ integrates to �ÞHYk� � � ZJ �� "&In this respect, � �Ô>@ is like a density function, but we do not necessarily

restrict � �Ô>© to be nonnegative.� The window width parameter, �%f is a sequence of positive numbers that

satisfies � ��f�SUT ��f �� )�f�SUT �>� Df kSo, the window width must tend to zero, but not too quickly.� To show pointwise consistency of

�¹ � � for¹ � � C� first consider the ex-

pectation of the estimator (since the estimator is an average of iid

terms we only need to consider the expectation of a representative

term):

Ù n �¹ � � r gY �� Df � \(� � �� Â ��f/] ¹ �W��ZJT�%&


Change variables as �,[ � � � �" Â ��f%� so � � �¹��f"�+[ and � m ìm ì à O � � Df �we obtain

Ù n �¹ � � r Y �� Df � �� [ ¹ � � �Ç�TfV� [ I� Df J�� [ Yk� �W� [ ¹ � � �s��fg� [ `J�� [ &Now, asymptotically,� ��f�SUT Ù n �¹ � � r � ��f�StT Yñ� �W� [ ¹ � � �Ñ��f"� [ `J�� [ Y � ��f�SUT � �W� [ ¹ � � �Ñ��f"� [ `J�� [ Yk� �W� [ ¹ � � `J�� [ ¹ � � Y � �W� [ "J�� [ ¹ � � Y�since �Tf R � and y � �W� [ "J�� [ � by assumption. (Note: that we

can pass the limit through the integral is a result of the dominated

convergence theorem.. For this to hold we need that¹ �Ô>© be dominated

by an absolutely integrable function.� Next, considering the variance of�¹ � � C� we have, due to the iid assump-

tion

�>� Df é n �¹ � � r �>� Df �� f� U(��1 é õ � $� � � � UW Â ��f<]� Df ø �� Df �� f� U(��1 éMS � $� � � � UW Â �Tf/]�X

� By the representative term argument, this is


�>� Df é n �¹ � � r � � Df éËS � $� � �j�� Â ��f<]�X� Also, since é�� Ù � � � G� Ù � � � we have

�>� Df é n �¹ � � r �� Df Ù ù � � $� � �8�� Â �Tf/]¬ � ú �s�� Df S Ù � � $� � �� Â ��f/]¬$X � Y �I� Df � $� � �� Â ��f/] � ¹ ��`J��s�s� Df õ Y �� Df � $� � �8�� Â ��f<] ¹ �W��ZJ�� ø � Y �I� Df � $� � �� Â ��f/] � ¹ ��`J��s�s� Df Ù n � ¹ � � r �The second term converges to zero:� Df Ù n � ¹ � � r � R � �by the previous result regarding the expectation and the fact that � f R� & Therefore,� �)�f�SUT �>� Df é n �¹ � � r � �)�f�SUT Y �� Df � \(� � �8�� Â ��f/] � ¹ �W��ZJ�� &Using exactly the same change of variables as before, this can be shown

to be � �)�f�SUT �>� Df é n �¹ � � r ¹ � � Y \ � �� [ _] � JT� [ &Since both y \ � ��+[Y_] � JT�([ and

¹ � � are bounded, this is bounded, and

since �>� Df R k by assumption, we have that

é n �¹ � � r R � &� Since the bias and the variance both go to zero, we have pointwise

consistency (convergence in quadratic mean implies convergence in

probability).


18.4.2. Estimation of the numerator. To estimate y B � � � ��B `J�B�� we need an

estimator of� � � ��B C& The estimator has the same form as the estimator for

¹ � � C�only with one dimension more:�� [B5 �� f� U(��1 � [ \)��B �jB]U� Â ��fT�J� � � � UW Â ��f/]� D�� 1fThe kernel � [ �?>@ is required to have mean zero:Y B � [ ��B�� aJ�B a�and to marginalize to the previous kernel for

¹ � � 7HYk� [ ��B�� aJ�B ¡� � � Y&With this kernel, we haveY B �� B�� `J�B �� f� U(��1 BgU � \(� � � � UW Â ��f1]� Dfby marginalization of the kernel, so we obtain��0� � ��¹ � � Y B �� B�� `J�B

1f � fU(��1 B]U ÿ¦¥ Ã ã � ã Ê Å p ¡ >�§¡�¨>1f � fU(��1 ÿ¦¥ Ã ã � ã Ê Å p ¡ >�§¡�¨> � fU(��1 B]U � \(� � � � UW Â ��f<]� fU(��1 � \(� � � � UW Â ��f<] &This is the Nadaraya-Watson kernel regression estimator.

18.4.3. Discussion.

18.4. KERNEL REGRESSION ESTIMATORS 390� The kernel regression estimator for �,� � UW is a weighted average of theB ß �� "�$#%�'&(&)&(�� , where higher weights are associated with points that

are closer to � U?& The weights sum to 1.� The window width parameter �Tf imposes smoothness. The estimator

is increasingly flat as �%f R k � since in this case each weight tends to� Â �G&� A large window width reduces the variance (strong imposition of flat-

ness), but increases the bias.� A small window width reduces the bias, but makes very little use of

information except points that are in a small neighborhood of � U?& Since

relatively little information is used, the variance is large when the win-

dow width is small.� The standard normal density is a popular choice for � �Ô&� and � [ ��B�� C�though there are possibly better alternatives.

18.4.4. Choice of the window width: Cross-validation. The selection of

an appropriate window width is important. One popular method is cross val-

idation. This consists of splitting the sample into two parts (e.g., 50%-50%).

The first part is the “in sample” data, which is used for estimation, and the

second part is the “out of sample” data, used for evaluation of the fit though

RMSE or some other criterion. The steps are:

(1) Split the data. The out of sample data is B � Í U and � � Í U &(2) Choose a window width � .

(3) With the in sample data, fit�B � Í UU corresponding to each � � Í UU & This fitted

value is a function of the in sample data, as well as the evaluation

point � � Í UU , but it does not involve B1� Í UU &

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 391

(4) Repeat for all out of sample points.

(5) Calculate RMSE ��,(6) Go to step #%� or to the next step if enough window widths have been

tried.

(7) Select the � that minimizes RMSE( �0 (Verify that a minimum has been

found, for example by plotting RMSE as a function of �0Y&(8) Re-estimate using the best � and all of the data.

This same principle can be used to choose w and « in a Fourier form model.

18.5. Kernel density estimation

The previous discussion suggests that a kernel density estimator may easily

be constructed. We have already seen how joint densities may be estimated.

If were interested in a conditional density, for example of B conditional on � ,

then the kernel estimate of the conditional density is simply�� K � ã �� B �¹ � � 1f � fU(��1 ÿ à ¥ Ã K � K Ê�Å p ¡ > H Ã ã � ã Ê�Å p ¡ > §¡ ¨ � >1f � fU(��1 ÿ¦¥ Ã ã � ã Ê Å p ¡ >�§¡ ¨> ��Tf � fU(��1 � [ \��¬B �jB]UW Â ��fT�J� � � � UW Â ��f/]� fU(��1 � \)� � � � UW Â ��f1]where we obtain the expressions for the joint and marginal densities from the

section on kernel regression.

18.6. Semi-nonparametric maximum likelihood

Readings: Gallant and Nychka, Econometrica, 1987. For a Fortran program

to do this and a useful discussion in the user’s guide, see


this link . See also Cameron and Johansson, Journal of Applied Econometrics,

V. 12, 1997.

MLE is the estimation method of choice when we are confident about spec-

ifying the density. Is is possible to obtain the benefits of MLE when we’re not

so confident about the specification? In part, yes.

Suppose we’re interested in the density of B conditional on � (both may be

vectors). Suppose that the density� ��B � � �Yt0 is a reasonable starting approxi-

mation to the true density. This density can be reshaped by multiplying it by

a squared polynomial. The new density is�\6T��B � � �Yt��0 ¹ �6 ��B � �0 � ��B � � �Yt0 6T� � �YtI��0where ¹ 6T��B � �0 6� D � F � D B Dand 6T� � �Yt��0 is a normalizing factor to make the density integrate (sum) to

one. Because¹ �6 ��B � �0 Â1 6�� Yt��`�, is a homogenous function of @ it is necessary

to impose a normalization: � F is set to 1. The normalization factor 6T�Wt��0 is

http://www.econ.duke.edu/~get/snp.html


calculated (following Cameron and Johansson) using

Ù ��¤ � T� K � F B � �/; ��B � t��`�, T� K � F B � \ ¹ 6��B � �0I] � 6��Wt��0 �<; ��B � t0 T� K � F 6� D � F6� © � F B � �/; ��B � t0I� D � © B D B

© Â1 6T�Wt��0 6� D � F

6� © � F � D � © H T� K � F B � �5D�© �<; ��B � t� I Â1 6��òtI��0 6� D � F

6� © � F � D � © � D�� © � � Â< 6��Wt��0C&By setting

À a� we get that the normalizing factor is

18.6.1

(18.6.1) 6T�Wt��0 6� D � F6� © � F � D � © � D�� ©

Recall that � F is set to 1 to achieve identification. The � � in equation 18.6.1

are the raw moments of the baseline density. Gallant and Nychka (1987) give

conditions under which such a density may be treated as correctly specified,

asymptotically. Basically, the order of the polynomial must increase as the

sample size increases. However, there are technicalities.

Similarly to Cameron and Johannson (1997), we may develop a negative bi-

nomial polynomial (NBP) density for count data. The negative binomial base-

line density may be written (see equation as�/; ��B � t0 Ã ��B�2¹: Ã ��B�2��J Ã �4: Æ ::32 e É,ª Æ e:32 e É K


where t S e �:yX�� e²¥u� and : ¥R� . The usual means of incorporating condi-

tioning variables L is the parameterization e² �º Í O x . When : e�Â(� we have

the negative binomial-I model (NB-I). When : � Â(� we have the negative

binomial-II (NP-II) model. For the NB-I density, é^��¤ �e 2 � e . In the case of

the NB-II model, we have é^��¤ �e 2 � e � . For both forms, Ù �W¤Ü ge .

The reshaped density, with normalization to sum to one, is

(18.6.2)�<; ��B � tI��0 \ ¹ 6 ��B � �0I] � 6T�òtI��0 Ã ��Bs2¹: Ã ��B�2��J Ã �b:7 Æ ::32 e É ª Æ e:32 e É K &

To get the normailization factor, we need the moment generating function:

(18.6.3)´ ; �¬ - : ª ð e � º U e 2¹: ó � ª &

To illustrate, here are the first through fourth raw moments of the NB density,

calculated using

MuPAD, which is a Computer Algebra System that is free for personal use,

and then programmed in Ox. These are the moments you would need to use a

second order polynomial �� #V .if(k_gam >= 1)

{

m[][0] = lambda;

m[][1] = (lambda .* (lambda + psi + lambda .* psi))

Econometrics/ psi;

}

if(k_gam >= 2)

{

http://www.mupad.org


m[][2] = (lambda .* (psi .^ 2 + 3 .* lambda .* psi

.* (1 + psi) + lambda .^ 2 .* (2 + 3 .* psi +

psi .^ 2))) Econometrics/ psi .^ 2;

m[][3] = (lambda .* (psi .^ 3 + 7 .* lambda .* psi

.^ 2 .* (1 + psi) +

6 .* lambda .^ 2 .* psi .* (2 + 3 .* psi + psi

.^ 2) +

lambda .^ 3 .* (6 + 11 .* psi + 6 .* psi .^ 2

+ psi .^ 3))) Econometrics/ psi .^ 3;

}

After calculating the raw moments, the normalization factor is calculated

using equation 18.6.1, again with the help of MuPAD.

if(k_gam == 1)

{

norm_factor = 1 + gam[0][] .* (2 .* m[][0] + gam[0][]

.* m[][1]);

}

else

if(k_gam == 2)

{

norm_factor = 1 + gam[0][] .^ 2 .* m[][1] + 2 .*

gam[0][] .* (m[][0] + gam[1][] .* m[][2]) +

gam[1][] .* (2 .* m[][1] + gam[1][] .* m[][3]);

}


For � '%� the analogous formulae are impressively (i.e. several pages) long.

This is an example of a model that would be difficult ot formulate without the

help of a program like MuPAD.

It is possible that there is conditional heterogeneity such that the appropri-

ate reshaping should be more local. This can be accomodated by allowing the� D parameters to depend upon the conditioning variables, for example using

polynomials.

Gallant and Nychka, Econometrica, 1987 prove that this sort of density can

approximate a wide variety of densities arbitrarily well as the degree of the

polynomial increases with the sample size. This approach is not without its

drawbacks: the sample objective function can have an extremely large number

of local maxima that can lead to numeric difficulties. If someone could figure

out how to do in a way such that the sample objective function was nice and

smooth, they would probably get the paper published in a good journal. Any

ideas?

Here’s a plot of true and the limiting SNP approximations (with the order

of the polynomial fixed) to four different count data densities, which variously

exhibit over and underdispersion, as well as excess zeros. The baseline model

is a negative binomial density.

Figures/SNP.eps not found!

18.7. EXAMPLES 397

18.7. Examples

18.7.1. Fourier form estimation. You need to get the file

FFF.ox, which sets up the data matrix for Fourier form estimation.

The first DGP first DGP generates data with a nonlinear mean and � � �ò#V er-

rors (with the mean subtracted out). Then the program fourierform.ox allows

you to experiment with different sample sizes and values of « . There is no

need to specify multi-indices with a univariate regressor (as is the case here to

keep the graphics simple). For a sample size of � Ò �"� , here are several plots

with different « .

This first plot shows an underparameterized fit ( « # ).Nonparametric-I/fff_2.eps not found!

This next one looks pretty good.

Nonparametric-I/fff_4.eps not found!

Here’s an example of an overfitted model - we are starting to chase the

error term too much ( « � � ).

http://pareto.uab.es/mcreel/Econometrics/Include/ox_files/FFF.ox

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/dgp1.ox

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/fourierform.ox

18.7. EXAMPLES 398

Nonparametric-I/fff_10.eps not found!

18.7.2. Kernel regression estimation. You need to get the file

KernelLib.ox, which contains the routines for kernel regression and density

estimation.

18.7.3. Kernel regression. We will use the same data generating process as

for the above examples of Fourier Form models. The program kernelreg1.ox

allows you to experiment with different sample sizes, window widths. For a

sample size of � Ò �V� , here are several plots with different window widths.

Note that too small a window-width (ww = 0.1) leads to a very irregular fit,

while setting the window width too high leads to too flat a fit.

Nonparametric-I/undersmoothed.eps not found!

Nonparametric-I/oversmoothed.eps not found!

http://pareto.uab.es/mcreel/Econometrics/Include/ox_files/KernelLib.ox

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/kernelreg1.ox

18.7. EXAMPLES 399

Nonparametric-I/justright.eps not found!

� Cross Validation

The leave-one-out method of cross validation consists of doing an out-of-sample

fit to each data point in turn, and calculating the MSE. This is repeated for var-

ious window widths. The minimum MSE window width may then be chosen.

The program kernelreg2.ox does this. The results are:

Nonparametric-I/cvscores.eps not found!

Nonparametric-I/crossvalidated.eps not found!

18.7.4. Kernel density estimation. The second DGP second DGP gener-

ates � � �@�� random variables, then estimates their density using kernel density

estimation. The program kerneldens.ox allows you to experiment using dif-

ferent sample sizes, kernels, and window widths. The following figure shows

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/kernelreg2.ox

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/dgp2.ox

http://pareto.uab.es/mcreel/Econometrics/Include/Nonparametric-I/kerneldens.ox

18.7. EXAMPLES 400

an Epanechnikov kernel fit using different window widths. To change kernels

you need to selectively (un)comment lines in the KernelLib.ox file.

Nonparametric-I/kerneldensfit.eps not found!

18.7.5. Seminonparametric density estimation and MuPAD. Following

the lecture notes, an SNP density for count data may be obtained by reshaping

a negative binomial density using a squared polynomial:

(18.7.1)�<; ��B � tI��0 \ ¹ 6 ��B � �0I] � 6T�òtI��0 Ã ��Bs2¹: Ã ��B�2��J Ã �b:7 Æ ::32 e É ª Æ e:32 e É K �

(18.7.2)¹ 67��B � �0 6� D � F � D B D &

The normalization factor is

(18.7.3) 6T�òtI��0 6� D � F6� © � F � D � © � D�� © &

To implement this using a polynomial of order �� we need the raw moments

of the negative binomial density up to order #C� . I couldn’t find the NB moment

generating function anywhere, so a solution is to calculate it using a Computer

Algebra System (CAS). Rather than using one of the expensive alternatives, we

can try out MuPAD, which can be downloaded and is free (in the sense of free

http://www.mupad.org

18.7. EXAMPLES 401

beer) for personal use. It is installed on the Linux machines in the computer

room, and if you like you can install the Windows version, too.

The file negbinSNP.mpd, if run using the the command mupad negbinSNP.mpd,

will give you the output that follows:

*----* MuPAD 2.5.1 -- The Open Computer Algebra System

/| /|

*----* | Copyright (c) 1997 - 2002 by SciFace Software

| *--|-* All rights reserved.

|/ |/

*----* Licensed to: Dr. Michael Creel

Negative Binomial SNP Density

First define the NB density

/ a \a / b \y

gamma(a + y) | ----- | | ----- |

\ a + b / \ a + b /

----------------------------------

gamma(a) gamma(y + 1)

Verify that it sums to 1

http://pareto.uab.es/mcreel/Econometrics/Include/SNP-MuPAD/negbinSNP.mpd

18.7. EXAMPLES 402

1

Define the MGF

/ a \a

| ----- |

\ a + b /

---------------------

/ a + b - b exp(t) \a

| ---------------- |

\ a + b /

Print the MGF in TeX format

"\\frac{\\frac{a}{\\left(a + b\\right)}^a}{\\frac{\\left(a + b - b\\, \\mb\

ox{exp}\\left(t\\right)\\right)}{\\left(a + b\\right)}^a}"

Find the first moment (which we know is b (lambda))

b

18.7. EXAMPLES 403

Find the fifth moment (which we probably don’t know)

5 4 4 5 2 3 3 2 2 4

(24 b + 60 a b + a b + 50 a b + 50 a b + 15 a b + 110 a b +

3 3 4 2 2 5 3 4 4 3 3 5

75 a b + 15 a b + 35 a b + 60 a b + 25 a b + 10 a b +

4 4 4 5 4

10 a b + a b ) / a

Print the fifth moment in fortran form, to program ln L

" t3 = a**-4*(b**5*24.0D0+60.0D0*a*b**4+a**4*b+50.0D0*a*b**5+50.0D0*\\

n ~(a*a)*b**3+15.0D0*a**3*(b*b)+110.0D0*(a*a)*b**4+75.0D0*a**3*b**3+1\\

n ~5.0D0*a**4*(b*b)+35.0D0*(a*a)*b**5+60.0D0*a**3*b**4+25.0D0*a**4*b*\\

n ~*3+10.0D0*a**3*b**5+10.0D0*a**4*b**4+a**4*b**5)"

Print the fifth moment in TeX form

"\\frac{24\\, b^5 + 60\\, a\\, b^4 + a^4\\, b + 50\\, a\\, b^5 + 50\\, a^2\

\\, b^3 + 15\\, a^3\\, b^2 + 110\\, a^2\\, b^4 + 75\\, a^3\\, b^3 + 15\\, \

a^4\\, b^2 + 35\\, a^2\\, b^5 + 60\\, a^3\\, b^4 + 25\\, a^4\\, b^3 + 10\\\

18.7. EXAMPLES 404

, a^3\\, b^5 + 10\\, a^4\\, b^4 + a^4\\, b^5}{a^4}"

To get the normalizing factor, we need expressions of the

form of the following

a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) +

b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) +

a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) +

b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6)

>> quit

Once you get expressions for the moments and the double sums, you can

use these to program a loglikelihood function in Ox, without too much trouble.

The file NegBinSNP.ox implements this. The file EstimateNBSNP.ox will let

you estimate NegBinSNP models for the MEPS data. The estimation results

for OBDV usingª # and a NB-I baseline model are

Ox version 3.20 (Linux) (C) J.A. Doornik, 1994-2002

**************************************************************************

MEPS data, OBDV

http://pareto.uab.es/mcreel/Econometrics/Include/ox_files/NegBinSNP.ox

http://pareto.uab.es/mcreel/Econometrics/Include/SNP-MuPAD/EstimateNBSNP.ox

18.7. EXAMPLES 405

negbin_snp_obj results

Strong convergence

Observations = 500

Avg. Log Likelihood

-2.2426

Standard Errors

params se(OPG) se(Sand.) se(Hess)

constant 1.5340 0.13289 0.12645 0.12593

pub_ins 0.16113 0.053100 0.056824 0.054144

priv_ins 0.090624 0.062689 0.065619 0.063835

sex 0.16863 0.047614 0.050720 0.048707

age 0.17950 0.048407 0.045060 0.046301

educ 0.039692 0.047968 0.058794 0.052521

inc 0.032581 0.064384 0.043708 0.051033

ln_alpha 1.8138 0.18466 0.17398 0.17378

-0.052710 0.0089429 0.0078799 0.0083419

0.013382 0.0042349 0.0039745 0.0040547

t-Stats

params t(OPG) t(Sand.) t(Hess)

constant 1.5340 11.543 12.132 12.181

18.7. EXAMPLES 406

pub_ins 0.16113 3.0344 2.8356 2.9759

priv_ins 0.090624 1.4456 1.3811 1.4197

sex 0.16863 3.5416 3.3248 3.4621

age 0.17950 3.7082 3.9837 3.8769

educ 0.039692 0.82746 0.67509 0.75573

inc 0.032581 0.50603 0.74541 0.63842

ln_alpha 1.8138 9.8226 10.425 10.438

-0.052710 -5.8941 -6.6892 -6.3188

0.013382 3.1599 3.3669 3.3003


CAIC BIC AIC

2314.7 2304.7 2262.6

**************************************************************************

Note that the CAIC and BIC are lower for this model than for the ordinary

NB-I model. NOTE: density functions formed in this way may have MANY

local maxima, so you need to be careful before accepting the results of a casual

run. To guard against having converged to a local maximum, one can try using

multiple starting values, or one could try simulated annealing as an optimiza-

tion method. To do this, copy maxsa.ox and maxsa.h into your working direc-

tory, and then use the program EstimateNBSNP2.ox to see how to implement

SA estimation of the reshaped negative binomial model. For more details on

http://pareto.uab.es/mcreel/Econometrics/Include/SNP-MuPAD/maxsa.ox

http://pareto.uab.es/mcreel/Econometrics/Include/SNP-MuPAD/maxsa.h

http://pareto.uab.es/mcreel/Econometrics/Include/SNP-MuPAD/EstimateNBSNP2.ox

18.7. EXAMPLES 407

the Ox implementation of SA, see Charles Bos’ page. Note - in my own experi-

ence, using a gradient-based method such as BFGS with many starting values

is as successful as SA, and is usually faster. Perhaps I’m not using SA as well

as is possible... YMMV.

http://www.tinbergen.nl/~cbos/

CHAPTER 19

Simulation-based estimation

Readings: In addition to the book mentioned previously, articles include

Gallant and Tauchen (1996), “Which Moments to Match?”, ECONOMETRIC

THEORY, Vol. 12, 1996, pages 657-681;a Gourieroux, Monfort and Renault

(1993), “Indirect Inference,” J. Apl. Econometrics; Pakes and Pollard (1989)

Econometrica; McFadden (1989) Econometrica.

19.1. Motivation

Simulation methods are of interest when the DGP is fully characterized by

a parameter vector, but the likelihood function is not calculable. If it were

available, we would simply estimate by MLE, which is asymptotically fully

efficient.

19.1.1. Example: Multinomial and/or dynamic discrete response models.

Let B [* be a latent random vector of dimension ��& Suppose that

B [* � * �^28/ *where � * is �e� � & Suppose that

(19.1.1) / * çRj� � � ² Henceforth drop the ! subscript when it is not needed for clarity.

408

19.1. MOTIVATION 409� B [ is not observed. Rather, we observe a many-to-one mapping

B ¬« ��B [ This mapping is such that each element of B is either zero or one (in

some cases only one element will be one).� Define wy*0 aw ��B * SJB [ � B *� « ��B [ $XSuppose random sampling of ��B * �[� * . In this case the elements of B *may not be independent of one another (and clearly are not if ² is not

diagonal). However, B * is independent of B ß , !yí �"&� Let @ �� 4 �J�b¸ ºN¸ [ ² 4 4 be the vector of parameters of the model. The

contribution of the ! U�¶ observation to the likelihood function is

� * �A@" £Y � : �Ä��B [* �j� * �� ² `J�B [*where

�¢�¬/T� ² �ò# � $�1® p � � ² � � 1Ap � }Y~%� È �y/ 4 ² � 1 /# Éis the multivariate normal density of an

´-dimensional random vec-

tor. The log-likelihood function is�� 4@" �� f� * ��1 �)� � * �4@"and the MLE

�@ solves the score equations�� f� * ��1 � * � �@V �� f� * ��1 � Vò� * � �@"� * � �@" � � &

19.1. MOTIVATION 410� The problem is that evaluation of� * �A@V and its derivative w.r.t. @ by

standard methods of numeric integration such as quadrature is com-

putationally infeasible when � (the dimension of B is higher than 3

or 4 (as long as there are no restrictions on ² Y&� The mapping « ��B [ has not been made specific so far. This setup is

quite general: for different choices of « ��B [ it nests the case of dynamic

binary discrete choice models as well as the case of multinomial dis-

crete choice (the choice of one out of a finite set of alternatives).

– Multinomial discrete choice is illustrated by a (very simple) job

search model. We have cross sectional data on individuals’ match-

ing to a set of � jobs that are available (one of which is unemploy-

ment). The utility of alternative � is½ ß¼ � ß �^28/ ßUtilities of jobs, stacked in the vector ½ * are not observed. Rather,

we observe the vector formed of elements

B ß¼ ��\ ½ ß;¥ ½ D �?ê,Q Ci��$Q:í �+]Only one of these elements is different than zero.

– Dynamic discrete choice is illustrated by repeated choices over

time between two alternatives. Let alternative � have utility½ ß U ü ß U¬��j/ ß Uò�� C S��"�Y#TX C S��"�Y#%�'&(&(&)��iX

19.1. MOTIVATION 411

Then

B [ ½ � �c½�1 �òü � �Küj1-?��2�/ � �j/�1� ��^28/Now the mapping is (element-by-element)

B ��\ B [ ¥ � ]V�that is B * U � if individual ! chooses the second alternative in

period Y� zero otherwise.

19.1.2. Example: Marginalization of latent variables. Economic data of-

ten presents substantial heterogeneity that may be difficult to model. A possi-

bility is to introduce latent random variables. This can cause the problem that

there may be no known closed form for the distribution of observable vari-

ables after marginalizing out the unobservable latent variables. For example,

count data (that takes values � �\�"�Y#%��%%�'&)&(&� is often modeled using the Poisson

distribution Ë � ��B !× }C~%� �Ô� e e *!(�The mean and variance of the Poisson distribution are both equal to e H

ë7��B é^��B5 ge &Often, one parameterizes the conditional mean ase+*0 }C~%� �� * �Y&


This ensures that the mean is positive (as it must be). Estimation by ML is

straightforward.

Often, count data exhibits “overdispersion” which simply means that

é��B ¥ ë¼��B5Y&If this is the case, a solution is to use the negative binomial distribution rather

than the Poisson. An alternative is to introduce a latent variable that reflects

heterogeneity into the specification:e+*0 }Y~%� �� * �b2 ]* where ]* has some specified density with support Û (this density may depend

on additional parameters). Let J,3�� V* be the density of g* & In some cases, the

marginal density of BË � ��B B * gY ³ }Y~ � \@� }C~%� �� * ��2 ]* I]u\ }Y~%� �� * �^2 g* _] K :B * � J,3�� ]* will have a closed-form solution (one can derive the negative binomial distri-

bution in the way if has an exponential distribution), but often this will not

be possible. In this case, simulation is a means of calculating Ë � ��B !PY� which

is then used to do ML estimation. This would be an example of the Simulated

Maximum Likelihood (SML) estimation.

� In this case, since there is only one latent variable, quadrature is proba-

bly a better choice. However, a more flexible model with heterogeneity


would allow all parameters (not just the constant) to vary. For exam-

ple Ë � ��B B * �Y ³ }C~%� \@� }C~%� �� * � * _],\ }Y~ � �¬� * � * I] K :B * � J,3�� * entails a �e ÿ �)� � * -dimensional integral, which will not be evaluable

by quadrature when � gets large.

19.1.3. Estimation of models specified in terms of stochastic differential

equations. It is often convenient to formulate models in terms of continuous

time using differential equations. A realistic model should account for exoge-

nous shocks to the system, which can be done by assuming a random compo-

nent. This leads to a model that is expressed as a system of stochastic differen-

tial equations. Consider the processJ�B]U �,�4@%��B]U�ZJ� ,2 ¹ �4@%��B]UW`JTü�Uwhich is assumed to be stationary. Sgü3UPX is a standard Brownian motion (Weiner

process), such that üþ� ¿ Y°¯F JTü�U,çRj� � � ¿ Brownian motion is a continuous-time stochastic process such that� ü � � a�� \�üþ�ò�JG�8üþ�¬ -_]�çRj� � �Y�È�j -� \�üþ�ò�JG�8üþ�¬ -_] and \�üþ� �T�Küþ�WQ5_] are independent for � ¥ ¥ � ¥ Q�&

That is, non-overlapping segments are independent.

One can think of Brownian motion the accumulation of independent normally

distributed shocks with infinitesimal variance.� The function �0�A@%�[BgU� is the deterministic part.

19.1. MOTIVATION 414� ¹ �4@%��B]U� determines the variance of the shocks.

To estimate a model of this sort, we typically have data that are assumed to be

observations of BgU in discrete points BT1Y�]B � �'&)&(&ÌB ¯ & That is, though BgU is a continu-

ous process it is observed in discrete time.

To perform inference on @%� direct ML or GMM estimation is not usually fea-

sible, because one cannot, in general, deduce the transition density� ��B�U � BgU � 1$��@"Y&

This density is necessary to evaluate the likelihood function or to evaluate mo-

ment conditions (which are based upon expectations with respect to this den-

sity). � A typical solution is to “discretize” the model, by which we mean to

find a discrete time approximation to the model. The discretized ver-

sion of the model is

B]U0�jB]U � 1 �0�òtI��BgU � 1?�2 ¹ �òtI��BgU � 1Ô?/JU/JUñç ²� � �'�NThe discretization induces a new parameter, t (that is, the t0F which

defines the best approximation of the discretization to the actual (un-

known) discrete time version of the model is not equal to @VF which is

the true parameter value). This is an approximation, and as such “ML”

estimation of t (which is actually quasi-maximum likelihood, QML)

based upon this equation is in general biased and inconsistent for the

original parameter, @ . Nevertheless, the approximation shouldn’t be

too bad, which will be useful, as we will see.� The important point about these three examples is that computational

difficulties prevent direct application of ML, GMM, etc. Nevertheless

19.2. SIMULATED MAXIMUM LIKELIHOOD (SML) 415

the model is fully specified in probabilistic terms up to a parameter

vector. This means that the model is simulable, conditional on the

parameter vector.

19.2. Simulated maximum likelihood (SML)

For simplicity, consider cross-sectional data. An ML estimator solves�@ ® � a�V�-�� ~ �\f+�4@" �� f� U(��1 �)� ��B]U � �=U?��@Vwhere ��BgU � � U×��@" is the density function of the U�¶ observation. When ��BVU � �=UP��@Vdoes not have a known closed form,

�@ ® � is an infeasible estimator. However,

it may be possible to define a random function such that

ë/± � �p=T��BgUò�[�=U?��@" �I��BgU � �=UP��@Vwhere the density of = is known. If this is the case, the simulator

ý�Ü��BgUP�[�=UP�@" �d ²� ìò��1 � ��=JU(ìY��BgUP�-� U×��@"is unbiased for �I��BgU � �=UP�@"C&� The SML simply substitutes ý�Ü��BVUP�[�=UP�@" in place of ��BgU � �=U?��@V in the

log-likelihood function, that is�@ ³ ® � ¡�V�[�Z�� ~ �\f5�A@V �� f� * ��1 �� ý�Ü��B]UP�[�=UP��@V19.2.1. Example: multinomial probit. Recall that the utility of alternative� is ½ ß¼ � ß �^28/ ß


and the vector B is formed of elements

B ß¼ �ø\^½ ß;¥ ½ D �YQªCi��YQ�í �+]The problem is that Ë � ��B ß7 � � @V can’t be calculated when � is larger than 4 or

5. However, it is easy to simulate this probability.� Draw ý/ * from the distribution j� � � ² � Calculate ý½ *� � * �Ü2 ý/ * (where � * is the matrix formed by stacking the� * ß � Define ýB * ß7 �ø\ ½ * ßµ¥ ½ *©D �Ôê,QªCi��YQ3í �+]� Repeat this

dtimes and define}� * ß7 � ²¶C��1 ýB * ß ¶d

� Define } � * as the � -vector formed of the }� * ß . Each element of } � * is be-

tween 0 and 1, and the elements sum to one.� Now ý� ��B * �[� * ��@V B 4* }� *� The SML multinomial probit log-likelihood function is�� Z� ² �� f� * ��1 BT4* �)� ý�Ü��B * �[� * �@"This is to be maximized w.r.t. � and ² &

Notes:� The

ddraws of ý/ * are draw only once and are used repeatedly during

the iterations used to find�� and

�² & The draws are different for each !-&If the ý/ * are re-drawn at every iteration the estimator will not converge.� The log-likelihood function with this simulator is a discontinuous func-

tion of � and ² & This does not cause problems from a theoretical point


of view since it can be shown that�)� � �� ² is stochastically equicon-

tinuous. However, it does cause problems if one attempts to use a

gradient-based optimization method such as Newton-Raphson.� It may be the case, particularly if few simulations,

d, are used, that

some elements of }� * are zero. If the corresponding element of B * is

equal to 1, there will be a� Ï"� � � problem.� Solutions to discontinuity:

– 1) use an estimation method that doesn’t require a continuous and

differentiable objective function, for example, simulated anneal-

ing. This is computationally costly.

– 2) Smooth the simulated probabilities so that they are continuous

functions of the parameters. For example, apply a kernel trans-

formation such as

ýB * ß7 � c w � n ½ * ß � 9�b� ~D ��1 ½ *@D r h 2¡& Ò �j� n ½ * ß¼ 9�b� ~D ��1 ½ *@D rwhere w is a large positive number. This approximates a step

function such that ýB * ß is very close to zero if ½ * ß is not the max-

imum, and ½ * ß£ � if it is the maximum. This makes ýB * ß a con-

tinuous function of � and ² � so that ý� * ß and therefore�� ²

will be continuous and differentiable. Consistency requires thatw ��, 6R k � so that the approximation to a step function becomes

arbitrarily close as the sample size increases. There are alternative

methods (e.g., Gibbs sampling) that may work better, but this is

too technical to discuss here.� To solve to log(0) problem, one possibility is to search the web for the

slog function. Also, increase

dif this is a serious problem.

19.3. METHOD OF SIMULATED MOMENTS (MSM) 418

19.2.2. Properties. The properties of the SML estimator depend on how

dis set. The following is taken from Lee (1995) “Asymptotic Bias in Simulated

Maximum Likelihood Estimation of Discrete Choice Models,” Econometric The-

ory, 11, pp. 437-83.

THEOREM 32. [Lee] 1) if� �� f�StT�� 1Ap � Â d a� � thenh

� c �@ ³ ® � �i@ F h mR j� � � o � 1 �A@ F [2) if

� �� f�SUT�� 1Ap � Â d ge � e a finite constant, thenh� c �@ ³ ® � �i@ F h mR j� � � o � 1 �A@ F -

where � is a finite vector of constants.� This means that the SML estimator is asymptotically biased if

ddoesn’t

grow faster than � 1Ap � &� The varcov is the typical inverse of the information matrix, so that

as long as

dgrows fast enough the estimator is consistent and fully

asymptotically efficient.

19.3. Method of simulated moments (MSM)

Suppose we have a DGP ��B � � ��@" which is simulable given @ , but is such that

the density of B is not calculable.

Once could, in principle, base a GMM estimator upon the moment condi-

tions �^U-�A@" \ � ��BgUP� � UWG�KQ0� � U?��@V_]N�\Uwhere Q,� � U?��@V �Yk� ��B]U?� � U��I��B � � UP��@V`J�B��

19.3. METHOD OF SIMULATED MOMENTS (MSM) 419�\U is a vector of instruments in the information set and ��B � � UP��@V is the density

of B conditional on � U?& The problem is that this density is not available.� However Q,� � U?��@V is readily simulated using} Qs� � U?�@" �d ²� ¶C��1 � � }B ¶U � � UW� By the law of large numbers, } Qs� � U?��@V Q�P ì PR Qs� � U?�@"�� as

d R k � which

provides a clear intuitive basis for the estimator, though in fact we ob-

tain consistency even for

dfinite, since a law of large numbers is also

operating across the � observations of real data, so errors introduced

by simulation cancel themselves out.� This allows us to form the moment conditions

(19.3.1) ³��U-�4@" n � ��B]UP� � UWG� } Qs� � U?�@" r �\Uwhere �\U is drawn from the information set. As before, form}�:�A@V �� f� * ��1 ³�^U-�A@" �� f� * ��1 v � ��B]UP� � UWG� �d ²� ¶C��1 Q,� }B ¶U � � UW w �\U(19.3.2)

with which we form the GMM criterion and estimate as usual. Note

that the unbiased simulator Q,� }B ¶U � � U� appears linearly within the sums.

19.3.1. Properties. Suppose that the optimal weighting matrix is used. Mc-

Fadden (ref. above) and Pakes and Pollard (refs. above) show that the asymp-

totic distribution of the MSM estimator is very similar to that of the infeasible

GMM estimator. In particular, assuming that the optimal weighting matrix is


used, and for

dfinite,

(19.3.3)

h� c �@ ® ³ ® �i@ F h mR È � ��Æ,� 2 �d É ð � T ² � 1 � 4T ó � 1 É

where � � T ² � 1 � 4T � 1 is the asymptotic variance of the infeasible GMM esti-

mator.� That is, the asymptotic variance is inflated by a factor �,2²� Âd& For this

reason the MSM estimator is not fully asymptotically efficient relative

to the infeasible GMM estimator, for

dfinite, but the efficiency loss is

small and controllable, by setting

dreasonably large.� The estimator is asymptotically unbiased even for

d �V& This is an

advantage relative to SML.� If one doesn’t use the optimal weighting matrix, the asymptotic varcov

is just the ordinary GMM varcov, inflated by ��2¡� Âd&� The above presentation is in terms of a specific moment condition

based upon the conditional mean. Simulated GMM can be applied

to moment conditions of any form.

19.3.2. Comments. Why is SML inconsistent if

dis finite, while MSM is?

The reason is that SML is based upon an average of logarithms of an unbiased

simulator (the densities of the observations). To use the multinomial probit

model as an example, the log-likelihood function is�)� � �� ² �� f� * ��1 B�4* �� * �� ² The SML version is �)� � �� ² �� f� * ��1 B�4* �� ý� * �� ²


The problem is that

Ù �� ý� * �� ² [�í �� ¬ë ý� * �� ² -in spite of the fact that ë ý� * �� ² � * �� ² due to the fact that

�� Ô>@ is a nonlinear transformation. The only way for the

two to be equal (in the limit) is if

dtends to infinite so that ý�Ü�Ô>© tends to �Ü�Ô>@ .

The reason that MSM does not suffer from this problem is that in this case

the unbiased simulator appears linearly within every sum of terms, and it ap-

pears within a sum over � (see equation [19.3.2]). Therefore the SLLN applies

to cancel out simulation errors, from which we get consistency. That is, using

simple notation for the random sampling case, the moment conditions

ý�:�A@V �� f� * ��1 v � ��BgU?� � U�G� �d ²� ¶C��1 Q,� }B ¶U � � U� w �\U(19.3.4)

�� f� * ��1 v?Q,� � U?��@ F �2�/JU�� d ²� ¶C��1 \�Q,� � U?�@"�2 ý/]¶$Uß] w �\U(19.3.5)

converge almost surely to

ý� T=�A@" Y ° Q,� � ��@ F G�KQ,� � ��@" ³ �+� � `J,3�� C&(note: �\U is assume to be made up of functions of � UWC& The objective function

converges to ��T=�A@V ý� T=�A@V?4 ² � 1T ý� T=�A@"which obviously has a minimum at @VF\� henceforth consistency.

� If you look at equation 19.3.5 a bit, you will see why the variance in-

flation factor is �?�Z2 1² .

19.4. EFFICIENT METHOD OF MOMENTS (EMM) 422

19.4. Efficient method of moments (EMM)

The choice of which moments upon which to base a GMM estimator can

have very pronounced effects upon the efficiency of the estimator.

� A poor choice of moment conditions may lead to very inefficient es-

timators, and can even cause identification problems (as we’ve seen

with the GMM problem set).� The drawback of the above approach MSM is that the moment condi-

tions used in estimation are selected arbitrarily. The asymptotic effi-

ciency of the estimator may be low.� The asymptotically optimal choice of moments would be the score vec-

tor of the likelihood function,

�^U-�4@" � V �� U-�A@ � ± UWAs before, this choice is unavailable.

The efficient method of moments (EMM) (see Gallant and Tauchen (1996),

“Which Moments to Match?”, ECONOMETRIC THEORY, Vol. 12, 1996, pages

657-681) seeks to provide moment conditions that closely mimic the score vec-

tor. If the approximation is very good, the resulting estimator will be very

nearly fully efficient.

The DGP is characterized by random sampling from the density

��BgU � � UP��@ F � � U-�A@ F


We can define an auxiliary model, called the “score generator”, which sim-

ply provides a (misspecified) parametric density� ��B � � U×� e � � UÔ� e � This density is known up to a parameter e & We assume that this den-

sity function is calculable. Therefore quasi-ML estimation is possible.

Specifically, �e� a�V�-�� ~´ �\f+� e �� f� U(��1 �)� � U-� e Y&� After determining�e we can calculate the score functions

�X� �)� � ��B]U � � U×� �e .� The important point is that even if the density is misspecified, there is

a pseudo-true e F for which the true expectation, taken with respect to

the true but unknown density of B��'��B � � UP��@ F C� and then marginalized

over � is zero:� e F H"ë « ë ; � « ° �M� �� B � � � e F ³ Y « Y ; � « �� )� � ��B � � � e F ¬��B � � �@ F ZJ�BÚJ,3�� a�� We have seen in the section on QML that�e 6R e F ; this suggests using

the moment conditions

(19.4.1) �^fE�A@T� �e �� f� U(��1 Y �� UÔ� �e �� U-�A@"ZJ�B� These moment conditions are not calculable, since �EU-�A@V is not avail-

able, but they are simulable using³�^f5�A@T� �e �� f� U(��1 �d ²� ¶C��1 �� }B ¶U � � UP� �e


where ýB ¶U is a draw from� � ª �A@VC� holding � U fixed. By the LLN and

the fact that�e converges to e F ,}� T=�A@ F � e F a� &

This is not the case for other values of @ , assuming that e F is identified.� The advantage of this procedure is that if� ��BVU � � U?� e closely approx-

imates ��B � � U×��@VC� then }��f5�A@T� �e will closely approximate the optimal

moment conditions which characterize maximum likelihood estima-

tion, which is fully efficient.� If one has prior information that a certain density approximates the

data well, it would be a good choice for� �Ô>@Y&� If one has no density in mind, there exist good ways of approximating

unknown distributions parametrically: Philips’ ERA’s (Econometrica,

1983) and Gallant and Nychka’s (Econometrica, 1987) SNP density es-

timator which we saw before. Since the SNP density is consistent, the

efficiency of the indirect estimator is the same as the infeasible ML

estimator.

19.4.1. Optimal weighting matrix. I will present the theory for

dfinite,

and possibly small. This is done because it is sometimes impractical to esti-

mate with

dvery large. Gallant and Tauchen give the theory for the case of

dso large that it may be treated as infinite (the difference being irrelevant given

the numerical precision of a computer). The theory for the case of

dinfinite

follows directly from the results presented here.


The moment condition }�:�4@%� �e depends on the pseudo-ML estimate�e & We

can apply Theorem 22 to conclude that

(19.4.2)

h� c �e � e F h mR ° � ��23� e F �� 1 o � e F 23� e F �� 1 ³

If the density� ��BgU � � U?� �e were in fact the true density ��B � � UP�@"C� then

�e would

be the maximum likelihood estimator, and 23� e F$ � 1 o � e FC would be an identity

matrix, due to the information matrix equality. However, in the present case

we assume that� ��BgU � � U?� �e is only an approximation to ��B � � U×��@"Y� so there is no

cancellation.

Recall that 23� e FY � � � �)� c á zá � á � O �\f5� e FC h & Comparing the definition of �'f5� e with the definition of the moment condition in Equation 19.4.1, we see that23� e F �� O �:�A@ F � e F C&As in Theorem 22, o � e F � ��f�SUT ë È � à �\fE� e à e êêêê � W

à �'f5� e à e 4 êêêê � W É &In this case, this is simply the asymptotic variance covariance matrix of the

moment conditions, ² & Now take a first order Taylor’s series approximation toh��^fE�4@gF'� �e about e F :h

� ý�^f5�A@ F � �e h� ý�^f+�4@ F � e F �2 h

� �M� O ý�:�4@ F � e F c �e � e F h 2 ´ 6T�?�JFirst consider

h� ý�^f+�4@gFN� e FY . It is straightforward but somewhat tedious to

show that the asymptotic variance of this term is 1² ± T=� e FY .


Next consider the second term

h� �M� O ý�²�4@gF'� e FY c �e � e F h . Note that

�� O ý�^fE�A@]F'� e FY Q6P ì PR23� e F C� so we haveh� �M� O ý�:�4@ F � e F c �e � e F h h

�l23� e F c �e � e F h �$ô+&©�"&But noting equation 19.4.2h

�l23� e F c �e � e F h QçR °�� o � e F P³Now, combining the results for the first and second terms,h

� ý�^fE�A@ F � �e Qçu È � � Æ ��2 �d É o � e F ÉSuppose that

xo � e F is a consistent estimator of the asymptotic variance-covariance

matrix of the moment conditions. This may be complicated if the score gener-

ator is a poor approximator, since the individual score contributions may not

have mean zero in this case (see the section on QML) . Even if this is the case,

the individuals means can be calculated by simulation, so it is always possible

to consistently estimateo � e FC when the model is simulable. On the other hand,

if the score generator is taken to be correctly specified, the ordinary estimator

of the information matrix is consistent. Combining this with the result on the

efficient GMM weighting matrix in Theorem 25, we see that defining�@ as�@ ¡�V�[�Z� � �� ^f5�A@T� �e P4 È Æ�� 2 �d É xo � e F É � 1 �^fE�4@%� �e

is the GMM estimator with the efficient choice of weighting matrix.� If one has used the Gallant-Nychka ML estimator as the auxiliary model,

the appropriate weighting matrix is simply the information matrix of

the auxiliary model, since the scores are uncorrelated. (e.g., it really is


ML estimation asymptotically, since the score generator can approxi-

mate the unknown density arbitrarily well).

19.4.2. Asymptotic distribution. Since we use the optimal weighting ma-

trix, the asymptotic distribution is as in Equation 15.4.1, so we have (using the

result in Equation 19.4.2):h� c �@��c@ F h mR �� ¶ � T È Æ ��2 �d É o � e F É � 1 � 4T · � 1 ��

where � T � ��f�SUT ë ° � V��^4f �A@ F � e F ?³G&This can be consistently estimated using�� V��^4f � �@T� �e

19.4.3. Diagnotic testing. The fact thath��^fE�A@ F � �e Qçu È � � Æ ��2 �d É o � e F É

implies that

�0�^f+� �@T� �e P4 È Æ,��2 �d É o � �e É � 1 �^f+� �@%� �e Qç�� 4�Vwhere � is ÿ �)� � e T� ÿ �� A@VC� since without ÿ �)� �A@" moment conditions the model

is not identified, so testing is impossible. One test of the model is simply based

on this statistic: if it exceeds the � � �4�V critical point, something may be wrong

(the small sample performance of this sort of test would be a topic worth in-

vestigating).

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 428� Information about what is wrong can be gotten from the pseudo-t-

statistics: ¶ diag È Æ � 2 �d É o � �e É 1Ap � · � 1 h ��^fE� �@T� �e can be used to test which moments are not well modeled. Since these

moments are related to parameters of the score generator, which are

usually related to certain features of the model, this information can be

used to revise the model. These aren’t actually distributed as ²� � �'�NC�since

h�0�^f5�A@ F � �e and

h��^f+� �@%� �e have different distributions (that of

h�0�^f5� �@%� �e is somewhat more complicated). It can be shown that the

pseudo-t statistics are biased toward nonrejection. See Gourieroux et.

al. or Gallant and Long, 1995, for more details.

19.5. Example: estimation of stochastic differential equations

It is often convenient to formulate theoretical models in terms of differen-

tial equations, and when the observation frequency is high (e.g., weekly, daily,

hourly or real-time) it may be more natural to adopt this framework for econo-

metric models of time series.

The most common approach to estimation of stochastic differential equa-

tions is to “discretize” the model, as above, and estimate using the discretized

version. However, since the discretization is only an approximation to the true

discrete-time version of the model (which is not calculable), the resulting esti-

mator is in general biased and inconsistent.

An alternative is to use indirect inference: The discretized model is used as

the score generator. That is, one estimates by QML to obtain the scores of the

discretized approximation:

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 429

B]U0�jB]U � 1 �0�òtI��BgU � 1?�2 ¹ �òtI��BgU � 1Ô?/JU/JUñç ²� � �'�NIndicate these scores by ��fE�A@T� �t�C& Then the system of stochastic differential

equations J�B]U �,�4@%��B]U�ZJ� ,2 ¹ �4@%��B]UW`JTü�Uis simulated over @ , and the scores are calculated and averaged over the simu-

lations ý�^f+�4@%� �t� � �� * ��1 � * fE�4@%� �t��@ is chosen to set the simulated scores to zero

ý�^fE� �@%� �t� � �(since @ and t are of the same dimension).

This method requires simulating the stochastic differential equation. There

are many ways of doing this. Basically, they involve doing very fine discretiza-

tions:

BgU �`µ BgU 2i�0�A@T��BgU¬�2 ¹ �A@T��BgU� U Ukç j� � � « By setting « very small, the sequence of U approximates a Brownian motion

fairly well.

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 430

This is only one method of using indirect inference for estimation of differ-

ential equations. There are others (see Gallant and Long, 1995 and Gourieroux

et. al.). Use of a series approximation to the transitional density as in Gal-

lant and Long is an interesting possibility since the score generator may have

a higher dimensional parameter than the model, which allows for diagnostic

testing. In the method described above the score generator’s parameter t is of

the same dimension as is @%� so diagnostic testing is not possible.

CHAPTER 20

Parallel programming for econometrics

In this chapter we’ll see how commonly used computations in economet-

rics can be done in parallel on a cluster of computers.

431

CHAPTER 21

Introduction to Octave

Why is Octave being used here, since it’s not that well-known by econome-

tricians? Well, because it is a high quality environment that is easily extensible,

uses well-tested and high performance numerical libraries, it is licensed under

the GNU GPL, so you can get it for free and modify it if you like, and it runs

on both GNU/Linux, Mac OSX and Windows systems. It’s also quite easy to

learn.

21.1. Getting started

Get the bootable CD, as was described in Section 1.3. Then burn the image,

and boot your computer with it. This will give you this same PDF file, but with

all of the example programs ready to run. The editor is configure with a macro

to execute the programs using Octave, which is of course installed. From this

point, I assume you are running the CD (or sitting in the computer room across

the hall from my office), or that you have configured your computer to be able

to run the *.m files mentioned below.

21.2. A short introduction

The objective of this introduction is to learn just the basics of Octave. There

are other ways to use Octave, which I encourage you to explore. These are just

some rudiments. After this, you can look at the example programs scattered

throughout the document (and edit them, and run them) to learn more about

how Octave can be used to do econometrics. Students of mine: your problem432

http://pareto.uab.es/mcreel/Econometrics/econometrics.iso

21.2. A SHORT INTRODUCTION 433

FIGURE 21.2.1. Running an Octave program

sets will include exercises that can be done by modifying the example pro-

grams in relatively minor ways. So study the examples!

Octave can be used interactively, or it can be used to run programs that are

written using a text editor. We’ll use this second method, preparing programs

with NEdit, and calling Octave from within the editor. The program first.m

gets us started. To run this, open it up with NEdit (by finding the correct

file inside the /home/knoppix/Desktop/Econometrics folder and click-

ing on the icon) and then type CTRL-ALT-o, or use the Octave item in the Shell

menu (see Figure 21.2.1).

http://pareto.uab.es/mcreel/Econometrics/Include/OctaveIntro/first.m

21.2. A SHORT INTRODUCTION 434

Note that the output is not formatted in a pleasing way. That’s because

printf() doesn’t automatically start a new line. Edit first.m so that the

8th line reads ”printf(”hello world\n”);” and re-run the program.

We need to know how to load and save data. The program second.m

shows how. Once you have run this, you will find the file ”x” in the directory

Econometrics/Include/OctaveIntro/ You might have a look at it with

NEdit to see Octave’s default format for saving data. Basically, if you have

data in an ASCII text file, named for example ”myfile.data”, formed of

numbers separated by spaces, just use the command ”load myfile.data”.

After having done so, the matrix ”myfile” (without extension) will contain

the data.

Please have a look at CommonOperations.m for examples of how to do

some basic things in Octave. Now that we’re done with the basics, have a look

at the Octave programs that are included as examples. If you are looking at

the browsable PDF version of this document, then you should be able to click

on links to open them. If not, the example programs are available here and the

support files needed to run these are available here. Those pages will allow

you to examine individual files, out of context. To actually use these files (edit

and run them), you should go to the home page of this document, since you

will probably want to download the pdf version together with all the support

files and examples. Or get the bootable CD.

There are some other resources for doing econometrics with Octave. You

might like to check the article Econometrics with Octave and the Econometrics Toolbox ,

which is for Matlab, but much of which could be easily used with Octave.

http://pareto.uab.es/mcreel/Econometrics/Include/OctaveIntro/second.m

http://pareto.uab.es/mcreel/Econometrics/Include/OctaveIntro/CommonOperations.m

http://pareto.uab.es/mcreel/Econometrics/Include/EconometricsOctaveFiles.html

http://pareto.uab.es/mcreel/Econometrics/Include/SupportOctaveFiles.html

http://pareto.uab.es/mcreel/Econometrics

http://ideas.repec.org/a/jae/japmet/v15y2000i5p531-542.html

http://www.spatial-econometrics.com/

21.3. IF YOU’RE RUNNING A LINUX INSTALLATION... 435

21.3. If you’re running a Linux installation...

Then to get the same behavior as found on the CD, you need to:� Get the collection of support programs and the examples, from the

document home page.� Put them somewhere, and tell Octave how to find them, e.g., by putting

a link to the MyOctaveFiles directory in /usr/local/share/octave/site-m� Make sure nedit is installed and configured to run Octave and use

syntax highlighting. Copy the file /home/econometrics/.nedit

from the CD to do this. Or, get the file NeditConfiguration and save

it in your $HOME directory with the name ”.nedit”. Not to put too

fine a point on it, please note that there is a period in that name.� Associate *.m files with NEdit so that they open up in the editor when

you click on them. That should do it.

http://pareto.uab.es/mcreel/Econometrics/Include/

http://pareto.uab.es/mcreel/NeditConfiguration

CHAPTER 22

Notation and Review� All vectors will be column vectors, unless they have a transpose sym-

bol (or I forget to apply this rule - your help catching typos and er0rors

is much appreciated). For example, if � U is a ��8� vector, � 4U is a � ��vector. When I refer to a � -vector, I mean a column vector.

22.1. Notation for differentiation of vectors and matrices

[3, Chapter 1]

Let ��Ô>©ÜH�O 6�R O be a real valued function of the � -vector @T& Then á ì Ã V?Åá V is

organized as a � -vector,

à �T�4@"à @ ��á ì Ã V?Åá V á ì Ã V?Åá V z...á ì Ã V?Åá V�¶

� ��Following this convention, á ì Ã V?Åá V O is a �� vector � and á z ì Ã V?Åá V á V O is a �;�� matrix. Also,à � ��4@"à @ à @ 4 àà @ Æ à ��A@Và @ 4 É àà @ 4 Æ à ��A@Và @ É &

EXERCISE 33. For ô and � both � -vectors, show that á Q O ãáYã ô .Let

� �A@V : O 6 R O f be a � -vector valued function of the � -vector @ . Let� �4@" 4

be the �Þ�� valued transpose of�

. Then ð áá V � �A@V 4 ó 4 áá V O � �A@VC&436

22.2. CONVERGENGE MODES 437� Product rule: Let� �A@V : O 6LR O f and

¹ �A@" : O 6�R O f be � -vector valued

functions of the � -vector @ . Thenàà @ 4 ¹ �A@"P4 � �4@" ¹ 4 Æ àà @ 4 � É 2 � 4 Æ àà @ 4 ¹ Éhas dimension �¯� ��& Applying the transposition rule we getàà @ ¹ �4@"P4 � �A@V Æ àà @ � 4 É ¹ 2§Æ àà @ ¹ 4 É �

which has dimension �£�j�"&EXERCISE 34. For w a �i�� matrix and � a ��8� vector, show that á$ã O

�ãá$ã w 2 w 4 .

� Chain rule: Let� �Ô>© : O 6 R O f a � -vector valued function of a � -vector

argument, and let �0�ò : O � R O 6 be a � -vector valued function of anÀ-vector valued argument � . Thenàà � 4 � \ ��T_] àà @ 4 � �A@V êêêê V?� � ÃY· Å

àà � 4 �0��Thas dimension �3� À &

EXERCISE 35. For � and � both � �;� vectors, show that á�¸ ×¹ Ã ã O x Åá x }C~%� � � 4 � � .

22.2. Convergenge modes

Readings: [1, Chapter 4];[4, Chapter 4].

We will consider several modes of convergence. The first three modes dis-

cussed are simply for background. The stochastic modes are those which will

be used later in the course.

22.2. CONVERGENGE MODES 438

DEFINITION 36. A sequence is a mapping from the natural numbers S��V�Y#%�'&(&)&�X SJ�X Tf'��1 SJ��X to some other set, so that the set is ordered according to the nat-

ural numbers associated with its elements.

Real-valued sequences:

DEFINITION 37. [Convergence] A real-valued sequence of vectors S]ôTf%X con-

verges to the vector ô if for any / ¥Ë� there exists an integer î such that for all� ¥ î � � ô"fs��ô �J½ / . ô is the limit of ô"f � written ô�f R ô+&Deterministic real-valued functions. Consider a sequence of functions S � f+�ßÝ YX

where � f=H ² R ¿»º Os&² may be an arbitrary set.

DEFINITION 38. [Pointwise convergence] A sequence of functions S � f5��Ý $Xconverges pointwise on ² to the function

�( Ý if for all / ¥Ç� and ÝqC ² there

exists an integer î�¼ such that

� � f ��Ý G� � ��Ý � ½ /T�Ôê�� ¥ î�¼ &It’s important to note that î�¼ depends upon Ýy� so that converge may be

much more rapid for certain Ý than for others. Uniform convergence requires

a similar rate of convergence throughout ² &DEFINITION 39. [Uniform convergence] A sequence of functions S � f5��Ý $X con-

verges uniformly on ² to the function�

( Ý if for any / ¥ � there exists an integer such that Ð' �¼ )¾½ � � f5��Ý � � ��Ý � ½ /T�Ôê�� ¥ £&


(insert a diagram here showing the envelope around� �ßÝ in which

� f5��Ý must

lie)

Stochastic sequences. In econometrics, we typically deal with stochastic

sequences. Given a probability space � ² �¿£� ª �� recall that a random variable

maps the sample space to the real line, i.e., �K�ßÝ 3H ² R Os& A sequence of

random variables SN� f5��Ý $X is a collection of such mappings, i.e., each ��f5��Ý is

a random variable with respect to the probability space � ² �¿�� ª E& For example,

given the model ¤ ��F�2¡/T� the OLS estimator��5f �� 4 �: � 1 � 4 ¤�� where� is the sample size, can be used to form a sequence of random vectors S ��5fTX .

A number of modes of convergence are in use when dealing with sequences

of random variables. Several such modes of convergence should already be

familiar:

DEFINITION 40. [Convergence in probability] Let �bf5��Ý be a sequence of ran-

dom variables, and let �K��Ý be a random variable. Let �=f S�ÝÇH � � f �ßÝ ¼��K�ßÝ � ¥ /�X . Then SN�=f5��Ý YX converges in probability to �K��Ý if� �)�f�SUT ª �?� fV a� �Ôê�/ ¥ � &Convergence in probability is written as � f 6R �� or plim �=f ��&

DEFINITION 41. [Almost sure convergence] Let ��f+�ßÝ be a sequence of ran-

dom variables, and let �K��Ý be a random variable. Let � S�ÝMH � �)� f�SUT��=f+�ßÝ �K�ßÝ YX . Then SN�=f5��Ý YX converges almost surely to �K�ßÝ ifª �?� �"&In other words, � f5��Ý R �K��Ý (ordinary convergence of the two functions)

except on a set û ² �� such thatª � û d� & Almost sure convergence is


written as � f Q�P ì PR �i� or �=f R ��$ô+&©�"& One can show that

�=f Q6P ì PR � Á �=f 6R ��&DEFINITION 42. [Convergence in distribution] Let the r.v. �bf have distribu-

tion function2 f and the r.v. � f have distribution function

2 & If2 f R 2

at

every continuity point of2 � then � f converges in distribution to �i&

Convergence in distribution is written as � f mR �i& It can be shown that con-

vergence in probability implies convergence in distribution.

Stochastic functions. Simple laws of large numbers (LLN’s) allow us to

directly conclude that��5f Q6P ì PR � F in the OLS example, since��5f � F 2kÆ � 4 �� É � 1 Æ � 4 /� É �

and« O îf Q6P ì PR � by a SLLN. Note that this term is not a function of the parameter��& This easy proof is a result of the linearity of the model, which allows us to

express the estimator in a way that separates parameters from random func-

tions. In general, this is not possible. We often deal with the more complicated

situation where the stochastic sequence depends on parameters in a manner

that is not reducible to a simple sequence of random variables. In this case,

we have a sequence of random functions that depend on @ : SN�bf5��Ýy��@VYX�� where

each �=f+�ßÝy��@" is a random variable with respect to a probability space � ² �¿£� ª and the parameter @ belongs to a parameter space @MCFN &

22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 441

DEFINITION 43. [Uniform almost sure convergence] SN��f5�ßÝy��@"$X converges uni-

formly almost surely in N to �K��Ýy��@V if� ��f�SUT Ð(' �V) � � �=f+�ßÝy��@"��j�K��Ýy��@V � ¡� � (a.s.)

Implicit is the assumption that all � f+�ßÝy��@" and �K��Ýy��@V are random vari-

ables w.r.t. � ² �¿£� ª for all @ªC¹N¾& We’ll indicate uniform almost sure conver-

gence byÍ�P Q�P ì PR and uniform convergence in probability by

Í�P 6 PR &� An equivalent definition, based on the fact that “almost sure” means

“with probability one” isË � Æ � ��f�StT Ð' �V) � � �=f+�ßÝy��@"�j�8��Ýy��@V � �� É �This has a form similar to that of the definition of a.s. convergence -

the essential difference is the addition of the Ð' � .

22.3. Rates of convergence and asymptotic equality

It’s often useful to have notation for the relative magnitudes of quantities.

Quantities that are small relative to others can often be ignored, which simpli-

fies analysis.

DEFINITION 44. [Little-o] Let� ��, and �0��, be two real-valued functions.

The notation� �� ´ ��,��- means

� �� f�SUTÀ Ã fNÅ� Ã f'Å ¡� &DEFINITION 45. [Big-O] Let

� �� and �0�� be two real-valued functions.

The notation� �� 5 �b�0��,[ means there exists some such that for � ¥£� êêê À Ã f'Å� Ã fNÅ êêê ½ � � where � is a finite constant.

This definition doesn’t require that À Ã f'Å� Ã fNÅ have a limit (it may fluctuate bound-

edly).


If S � fTX and S��gfTX are sequences of random variables analogous definitions

are

DEFINITION 46. The notation� �� ´ 6T�b�0��[ means À Ã f'Å� Ã fNÅ 6R � &

EXAMPLE 47. The least squares estimator�@ �� 4 �: � 1 � 4 ¤ �¬� 4 �: � 1 � 4 �¬�F@gF28/V @gF¢2®�¬� 4 �3 � 1 � 4 /T& Since plim Ã « O « Å � « O î1 �� we can write �¬� 4 �: � 1 � 4 / ´ 6��Ô�N

and�@ @gF�2 ´ 6��?�JC& Asymptotically, the term

´ 6T�?�J is negligible. This is just a

way of indicating that the LS estimator is consistent.

DEFINITION 48. The notation� ��, @5 6��0��- means there exists some î

such that for / ¥ � and all � ¥ î �ª Æ êêêê � ��,�0�� êêêê ½ � î É ¥ �È�²/T�where � î is a finite constant.

EXAMPLE 49. If � f¡ç ²� � �\�J then �=f 5 6��?�JC� since, given /T� there is

always some � î such thatª � � � f � ½ � î ¥ �¼�j/T&

Useful rules:�Á5 6�� 6 5 6�� 65 6�� 6 � � � ´ 6�� 6 ´ 6T�� ´ 6T�� 6 � � EXAMPLE 50. Consider a random sample of iid r.v.’s with mean 0 and vari-

ance è � . The estimator of the mean�@ � Â � � f* ��1 �+* is asymptotically normally

distributed, e.g., � 1Ap � �@ �çkj� � ��è,�YY& So � 1Ap � �@ Â5 6��?�JC� so�@ Â5 6�� 1Ap �CY& Before

we had�@ ´ 6T�Ô�NC� now we have have the stronger result that relates the rate of

convergence to the sample size.


EXAMPLE 51. Now consider a random sample of iid r.v.’s with mean 3and variance è�� . The estimator of the mean

�@ � Â � � f* ��1 �+* is asymptotically

normally distributed, e.g., � 1Ap � c �@p�s3 h�çÚ²� � �[è � C& So � 1Ap � c �@p�c3 h Ã5 6T�?�JC� so�@p�c3 65 6�� 1Ap �CC� so

�@ 65 6��Ô�NC&These two examples show that averages of centered (mean zero) quanti-

ties typically have plim 0, while averages of uncentered quantities have finite

nonzero plims. Note that the definition of 5 6 does not mean that� ��, and �,��

are of the same order. Asymptotic equality ensures that this is the case.

DEFINITION 52. Two sequences of random variables S � f%X and S}�gfTX are

asymptotically equal (written� f Q �]f" if

� µ !W� Æ � ��,�0��, É �Finally, analogous almost sure versions of

´ 6 and 5 6 are defined in the ob-

vious way.

EXERCISES 444

Exercises

(1) For ô and � both �£�²� vectors, show that á Q O ãá$ã ô .(2) For w a �i�� matrix and � a �£�²� vector, show that á$ã O

�ãáYã �w 2 w 4 .

(3) For � and � both �£�j� vectors, show that� x }C~%� � 4 � }Y~%� � � 4 � � .

(4) For � and � both �£�j� vectors, find the analytic expression for� �x }Y~%� � 4 � .

(5) Write an Octave program that verifies each of the previous results by tak-

ing numeric derivatives. For a hint, type help numgradient and help

numhessian inside octave.

CHAPTER 23

The GPL

This document and the associated examples and materials are copyright

Michael Creel, under the terms of the GNU General Public License. This li-

cense follows:

GNU GENERAL PUBLIC LICENSE Version 2, June 1991

Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place,

Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and

distribute verbatim copies of this license document, but changing it is not al-

lowed.

Preamble

The licenses for most software are designed to take away your freedom to

share and change it. By contrast, the GNU General Public License is intended

to guarantee your freedom to share and change free software–to make sure the

software is free for all its users. This General Public License applies to most

of the Free Software Foundation’s software and to any other program whose

authors commit to using it. (Some other Free Software Foundation software is

covered by the GNU Library General Public License instead.) You can apply it

to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our

General Public Licenses are designed to make sure that you have the freedom

to distribute copies of free software (and charge for this service if you wish),

that you receive source code or can get it if you want it, that you can change

445

23. THE GPL 446

the software or use pieces of it in new free programs; and that you know you

can do these things.

To protect your rights, we need to make restrictions that forbid anyone to

deny you these rights or to ask you to surrender the rights. These restrictions

translate to certain responsibilities for you if you distribute copies of the soft-

ware, or if you modify it.

For example, if you distribute copies of such a program, whether gratis or

for a fee, you must give the recipients all the rights that you have. You must

make sure that they, too, receive or can get the source code. And you must

show them these terms so they know their rights.

We protect your rights with two steps: (1) copyright the software, and

(2) offer you this license which gives you legal permission to copy, distribute

and/or modify the software.

Also, for each author’s protection and ours, we want to make certain that

everyone understands that there is no warranty for this free software. If the

software is modified by someone else and passed on, we want its recipients to

know that what they have is not the original, so that any problems introduced

by others will not reflect on the original authors’ reputations.

Finally, any free program is threatened constantly by software patents. We

wish to avoid the danger that redistributors of a free program will individually

obtain patent licenses, in effect making the program proprietary. To prevent

this, we have made it clear that any patent must be licensed for everyone’s

free use or not licensed at all.

The precise terms and conditions for copying, distribution and modifica-

tion follow.

23. THE GPL 447

GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPY-

ING, DISTRIBUTION AND MODIFICATION

0. This License applies to any program or other work which contains a

notice placed by the copyright holder saying it may be distributed under the

terms of this General Public License. The "Program", below, refers to any such

program or work, and a "work based on the Program" means either the Pro-

gram or any derivative work under copyright law: that is to say, a work con-

taining the Program or a portion of it, either verbatim or with modifications

and/or translated into another language. (Hereinafter, translation is included

without limitation in the term "modification".) Each licensee is addressed as

"you".

Activities other than copying, distribution and modification are not cov-

ered by this License; they are outside its scope. The act of running the Pro-

gram is not restricted, and the output from the Program is covered only if its

contents constitute a work based on the Program (independent of having been

made by running the Program). Whether that is true depends on what the

Program does.

1. You may copy and distribute verbatim copies of the Program’s source

code as you receive it, in any medium, provided that you conspicuously and

appropriately publish on each copy an appropriate copyright notice and dis-

claimer of warranty; keep intact all the notices that refer to this License and to

the absence of any warranty; and give any other recipients of the Program a

copy of this License along with the Program.

You may charge a fee for the physical act of transferring a copy, and you

may at your option offer warranty protection in exchange for a fee.

23. THE GPL 448

2. You may modify your copy or copies of the Program or any portion of

it, thus forming a work based on the Program, and copy and distribute such

modifications or work under the terms of Section 1 above, provided that you

also meet all of these conditions:

a) You must cause the modified files to carry prominent notices stating that

you changed the files and the date of any change.

b) You must cause any work that you distribute or publish, that in whole

or in part contains or is derived from the Program or any part thereof, to be

licensed as a whole at no charge to all third parties under the terms of this

License.

c) If the modified program normally reads commands interactively when

run, you must cause it, when started running for such interactive use in the

most ordinary way, to print or display an announcement including an appro-

priate copyright notice and a notice that there is no warranty (or else, saying

that you provide a warranty) and that users may redistribute the program un-

der these conditions, and telling the user how to view a copy of this License.

(Exception: if the Program itself is interactive but does not normally print such

an announcement, your work based on the Program is not required to print an

announcement.)

These requirements apply to the modified work as a whole. If identifiable

sections of that work are not derived from the Program, and can be reasonably

considered independent and separate works in themselves, then this License,

and its terms, do not apply to those sections when you distribute them as sep-

arate works. But when you distribute the same sections as part of a whole

which is a work based on the Program, the distribution of the whole must be

23. THE GPL 449

on the terms of this License, whose permissions for other licensees extend to

the entire whole, and thus to each and every part regardless of who wrote it.

Thus, it is not the intent of this section to claim rights or contest your rights

to work written entirely by you; rather, the intent is to exercise the right to

control the distribution of derivative or collective works based on the Program.

In addition, mere aggregation of another work not based on the Program

with the Program (or with a work based on the Program) on a volume of a

storage or distribution medium does not bring the other work under the scope

of this License.

3. You may copy and distribute the Program (or a work based on it, under

Section 2) in object code or executable form under the terms of Sections 1 and

2 above provided that you also do one of the following:

a) Accompany it with the complete corresponding machine-readable source

code, which must be distributed under the terms of Sections 1 and 2 above on

a medium customarily used for software interchange; or,

b) Accompany it with a written offer, valid for at least three years, to give

any third party, for a charge no more than your cost of physically performing

source distribution, a complete machine-readable copy of the corresponding

source code, to be distributed under the terms of Sections 1 and 2 above on a

medium customarily used for software interchange; or,

c) Accompany it with the information you received as to the offer to dis-

tribute corresponding source code. (This alternative is allowed only for non-

commercial distribution and only if you received the program in object code

or executable form with such an offer, in accord with Subsection b above.)

The source code for a work means the preferred form of the work for mak-

ing modifications to it. For an executable work, complete source code means

23. THE GPL 450

all the source code for all modules it contains, plus any associated interface

definition files, plus the scripts used to control compilation and installation of

the executable. However, as a special exception, the source code distributed

need not include anything that is normally distributed (in either source or bi-

nary form) with the major components (compiler, kernel, and so on) of the

operating system on which the executable runs, unless that component itself

accompanies the executable.

If distribution of executable or object code is made by offering access to

copy from a designated place, then offering equivalent access to copy the

source code from the same place counts as distribution of the source code,

even though third parties are not compelled to copy the source along with the

object code.

4. You may not copy, modify, sublicense, or distribute the Program ex-

cept as expressly provided under this License. Any attempt otherwise to copy,

modify, sublicense or distribute the Program is void, and will automatically

terminate your rights under this License. However, parties who have received

copies, or rights, from you under this License will not have their licenses ter-

minated so long as such parties remain in full compliance.

5. You are not required to accept this License, since you have not signed

it. However, nothing else grants you permission to modify or distribute the

Program or its derivative works. These actions are prohibited by law if you do

not accept this License. Therefore, by modifying or distributing the Program

(or any work based on the Program), you indicate your acceptance of this Li-

cense to do so, and all its terms and conditions for copying, distributing or

modifying the Program or works based on it.

23. THE GPL 451

6. Each time you redistribute the Program (or any work based on the Pro-

gram), the recipient automatically receives a license from the original licensor

to copy, distribute or modify the Program subject to these terms and condi-

tions. You may not impose any further restrictions on the recipients’ exercise

of the rights granted herein. You are not responsible for enforcing compliance

by third parties to this License.

7. If, as a consequence of a court judgment or allegation of patent infringe-

ment or for any other reason (not limited to patent issues), conditions are im-

posed on you (whether by court order, agreement or otherwise) that contradict

the conditions of this License, they do not excuse you from the conditions of

this License. If you cannot distribute so as to satisfy simultaneously your obli-

gations under this License and any other pertinent obligations, then as a con-

sequence you may not distribute the Program at all. For example, if a patent

license would not permit royalty-free redistribution of the Program by all those

who receive copies directly or indirectly through you, then the only way you

could satisfy both it and this License would be to refrain entirely from distri-

bution of the Program.

If any portion of this section is held invalid or unenforceable under any

particular circumstance, the balance of the section is intended to apply and the

section as a whole is intended to apply in other circumstances.

It is not the purpose of this section to induce you to infringe any patents or

other property right claims or to contest validity of any such claims; this sec-

tion has the sole purpose of protecting the integrity of the free software distri-

bution system, which is implemented by public license practices. Many people

have made generous contributions to the wide range of software distributed

through that system in reliance on consistent application of that system; it is

23. THE GPL 452

up to the author/donor to decide if he or she is willing to distribute software

through any other system and a licensee cannot impose that choice.

This section is intended to make thoroughly clear what is believed to be a

consequence of the rest of this License. 8. If the distribution and/or use of the

Program is restricted in certain countries either by patents or by copyrighted

interfaces, the original copyright holder who places the Program under this Li-

cense may add an explicit geographical distribution limitation excluding those

countries, so that distribution is permitted only in or among countries not thus

excluded. In such case, this License incorporates the limitation as if written in

the body of this License.

9. The Free Software Foundation may publish revised and/or new versions

of the General Public License from time to time. Such new versions will be

similar in spirit to the present version, but may differ in detail to address new

problems or concerns.

Each version is given a distinguishing version number. If the Program

specifies a version number of this License which applies to it and "any later

version", you have the option of following the terms and conditions either of

that version or of any later version published by the Free Software Founda-

tion. If the Program does not specify a version number of this License, you

may choose any version ever published by the Free Software Foundation.

10. If you wish to incorporate parts of the Program into other free pro-

grams whose distribution conditions are different, write to the author to ask

for permission. For software which is copyrighted by the Free Software Foun-

dation, write to the Free Software Foundation; we sometimes make exceptions

for this. Our decision will be guided by the two goals of preserving the free

23. THE GPL 453

status of all derivatives of our free software and of promoting the sharing and

reuse of software generally.

NO WARRANTY

11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE

IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED

BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING

THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE

PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EX-

PRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICU-

LAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFOR-

MANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE

DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,

REPAIR OR CORRECTION.

12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED

TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY

WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PER-

MITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY

GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARIS-

ING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUD-

ING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED

INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR

A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PRO-

GRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED

OF THE POSSIBILITY OF SUCH DAMAGES.

END OF TERMS AND CONDITIONS

23. THE GPL 454

How to Apply These Terms to Your New Programs

If you develop a new program, and you want it to be of the greatest possible

use to the public, the best way to achieve this is to make it free software which

everyone can redistribute and change under these terms.

To do so, attach the following notices to the program. It is safest to attach

them to the start of each source file to most effectively convey the exclusion of

warranty; and each file should have at least the "copyright" line and a pointer

to where the full notice is found.

<one line to give the program’s name and a brief idea of what it does.>

Copyright (C) 19yy <name of author>

This program is free software; you can redistribute it and/or modify it un-

der the terms of the GNU General Public License as published by the Free

Software Foundation; either version 2 of the License, or (at your option) any

later version.

This program is distributed in the hope that it will be useful, but WITH-

OUT ANY WARRANTY; without even the implied warranty of MERCHANTABIL-

ITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public

License for more details.

You should have received a copy of the GNU General Public License along

with this program; if not, write to the Free Software Foundation, Inc., 59 Tem-

ple Place, Suite 330, Boston, MA 02111-1307 USA

Also add information on how to contact you by electronic and paper mail.

If the program is interactive, make it output a short notice like this when it

starts in an interactive mode:

Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision

comes with ABSOLUTELY NO WARRANTY; for details type ‘show w’. This is

23. THE GPL 455

free software, and you are welcome to redistribute it under certain conditions;

type ‘show c’ for details.

The hypothetical commands ‘show w’ and ‘show c’ should show the ap-

propriate parts of the General Public License. Of course, the commands you

use may be called something other than ‘show w’ and ‘show c’; they could

even be mouse-clicks or menu items–whatever suits your program.

You should also get your employer (if you work as a programmer) or your

school, if any, to sign a "copyright disclaimer" for the program, if necessary.

Here is a sample; alter the names:

Yoyodyne, Inc., hereby disclaims all copyright interest in the program ‘Gnomo-

vision’ (which makes passes at compilers) written by James Hacker.

<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice

This General Public License does not permit incorporating your program

into proprietary programs. If your program is a subroutine library, you may

consider it more useful to permit linking proprietary applications with the li-

brary. If this is what you want to do, use the GNU Library General Public

License instead of this License.

CHAPTER 24

The attic

The GMM estimator, briefly

The OLS estimator can be thought of as a method of moments estimator.

With weak exogeneity, Ù � L U�AÔU � . So, likewise, Ù c P Ê Í Ê â Êf h Ù ��Ä O âf � .The idea of the MM estimator is to choose the estimator to make the sample

counterpart hold: ` 4 �A� �` 4 c _i� ` �� h� �� ` 4 ` $� 1 ` 4�_This means of deriving the formula requires no calculus. It provides another

interpretation of how the OLS estimator is defined.

We can perhaps think of other variables that are not correlated with A�U , say=U . This may be needed if the weak exogeneity assumption fails for L U . Let us

assume that we have instruments U that satisy Ù ��=U�A-U � . If the dimension

of U is greater than � � then we have more un

This holds material that is not really ready to be incorporated into the main

body, but that I don’t want to lose. Basically, ignore it, unless you’d like to help

get it ready for inclusion.

456

24.1. MEPS DATA: MORE ON COUNT MODELS 457

24.1. MEPS data: more on count models

Note to self: this chapter is yet to be converted to use Octave. To check the

plausibility of the Poisson model, we can compare the sample unconditional

variance with the estimated unconditional variance according to the Poisson

model: °éb��B5 P >ÊnÅ �� Êf . For OBDV and ERV, we get We see that even after

TABLE 1. Marginal Variances, Sample and Estimated (Poisson)

OBDV ERVSample 37.446 0.30614

Estimated 3.4540 0.19060

conditioning, the overdispersion is not captured in either case. There is huge

problem with OBDV, and a significant problem with ERV. In both cases the

Poisson model does not appear to be plausible.

24.1.1. Infinite mixture models. Reference: Cameron and Trivedi (1998)

Regression analysis of count data, chapter 4.

The two measures seem to exhibit extra-Poisson variation. To capture un-

observed heterogeneity, a possibility is the random parameters approach. Con-

sider the possibility that the constant term in a Poisson model were random:�/; ��B � L �[/V }C~%� �Ô�ò@"Z@ KBÆ�@ }C~%� � LÇ� ��2�/" }C~%� � LÇ� � }C~%� ��/V e =


where eK }Y~%� � LÇ� � ) and = }Y~%� ��/V . Now = captures the randomness in the

constant. The problem is that we don’t observe = , so we will need to marginal-

ize it to get a usable density�<; ��B � L Y T� T }C~%� \@�&@/]ß@ KB�� Ä ��`J��

This density can be used directly, perhaps using numerical integration to eval-

uate the likelihood function. In some cases, though, the integral will have an

analytic solution. For example, if = follows a certain one parameter gamma

density, then

(24.1.1)�<; ��B � L �Yt� Ã ��B�2i:7Ã ��B�2¡�N Ã �4: Æ ::32 e ÉÈª Æ e::2 e É K

where t � e �:7 . : appears since it is the parameter of the gamma density.� For this density, Ù ��B � L �e , which we have parameterized e� }C~%� � L 4 �� The variance depends upon how : is parameterized.

– If : qe�Â1� , where �a¥®� , then é^��B � L qe 2 � e . Note that e is a

function of L , so that the variance is too. This is referred to as the

NB-I model.

– If : � Â(� , where ��¥Ú� , then é��B � L e 2 � e � . This is referred

to as the NB-II model.

So both forms of the NB model allow for overdispersion, with the NB-II model

allowing for a more radical form.� Testing reduction of a NB model to a Poisson model cannot be done

by testing �¡ å� using standard Wald or LR procedures. The critical

values need to be adjusted to account for the fact that �ñ � is on

the boundary of the parameter space. Without getting into details,


suppose that the data were in fact Poisson, so there is equidispersion

and the true �Ç � . Then about half the time the sample data will

de underdispersed, and about half the time overdispersed. When the

data is underdispersed, the MLE of � will be�� . Thus, under the

null, there will be a probability spike in the asymptotic distribution of� �¢� �� h� �� at 0, so standard testing methods will not be valid.� Here are NB-I estimation results for OBDV, obtained using this estimation program

.

MEPS data, OBDV

negbin results

Strong convergence

Observations = 500

Function value -2.2656

t-Stats


constant -0.055766 -0.16793 -0.17418 -0.17215

pub_ins 0.47936 2.9406 2.8296 2.9122

priv_ins 0.20673 1.3847 1.4201 1.4086

sex 0.34916 3.2466 3.4148 3.3434

age 0.015116 3.3569 3.8055 3.5974

educ 0.014637 0.78661 0.67910 0.73757

inc 0.012581 0.60022 0.93782 0.76330

ln_alpha 1.7389 23.669 11.295 16.660


Consistent Akaike

2323.3

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-II/estimate_negbin.ox


Schwartz

2315.3

Hannan-Quinn

2294.8

Akaike

2281.6


Here are NB-II results for OBDV

**************************************************************************

MEPS data, OBDV

negbin results

Strong convergence

Observations = 500


t-Stats


constant -0.65981 -1.8913 -1.4717 -1.6977

pub_ins 0.68928 2.9991 3.1825 3.1436

priv_ins 0.22171 1.1515 1.2057 1.1917

sex 0.44610 3.8752 2.9768 3.5164

age 0.024221 3.8193 4.5236 4.3239

educ 0.020608 0.94844 0.74627 0.86004

inc 0.020040 0.87374 0.72569 0.86579

ln_alpha 0.47421 5.6622 4.6278 5.6281


Consistent Akaike

2319.3

Schwartz

2311.3

Hannan-Quinn

2290.8

Akaike

2277.6

24.2. HURDLE MODELS 462

**************************************************************************� For the OBDV model, the NB-II model does a better job, in terms of

the average log-likelihood and the information criteria.� Note that both versions of the NB model fit much better than does the

Poisson model.� The t-statistics are now similar for all three ways of calculating them,

which might indicate that the serious specification problems of the

Poisson model for the OBDV data are partially solved by moving to

the NB model.� The estimated�� is highly significant.

To check the plausibility of the NB-II model, we can compare the sample un-

conditional variance with the estimated unconditional variance according to

the NB-II model: °é��B P >ÊnÅ �� Ê � �§ � �� Ê zf . For OBDV and ERV (estimation results

not reported), we get The overdispersion problem is significantly better than

TABLE 2. Marginal Variances, Sample and Estimated (NB-II)

OBDV ERVSample 37.446 0.30614

Estimated 26.962 0.27620

in the Poisson case, but there is still some overdispersion that is not captured,

for both OBDV and ERV.

24.2. Hurdle models

Returning to the Poisson model, lets look at actual and fitted count prob-

abilities. Actual relative frequencies are� ��B �T � * ��B *p �T Â � and fit-

ted frequencies are�� B �T � f* ��1 �<; � � � �+* � �@" Â � We see that for the OBDV


TABLE 3. Actual and Poisson fitted frequencies

Count OBDV ERVCount Actual Fitted Actual Fitted

0 0.32 0.06 0.86 0.831 0.18 0.15 0.10 0.142 0.11 0.19 0.02 0.023 0.10 0.18 0.004 0.0024 0.052 0.15 0.002 0.00025 0.032 0.10 0 2.4e-5

measure, there are many more actual zeros than predicted. For ERV, there are

somewhat more actual zeros than fitted, but the difference is not too important.

Why might OBDV not fit the zeros well? What if people made the deci-

sion to contact the doctor for a first visit, they are sick, then the doctor decides

on whether or not follow-up visits are needed. This is a principal/agent type

situation, where the total number of visits depends upon the decision of both

the patient and the doctor. Since different parameters may govern the two

decision-makers choices, we might expect that different parameters govern

the probability of zeros versus the other counts. Let e 6 be the parameters of

the patient’s demand for visits, and let e m be the paramter of the doctor’s “de-

mand” for visits. The patient will initiate visits according to a discrete choice

model, for example, a logit model:

Ë � �W¤ ¡� �<; � � � e 6J �È� � Â \@��2 }C~%� �Ô� e 6]I]Ë � �W¤ ¥Ë� � Â \©� 2 }C~%� �Ô� e 6]_]��


The above probabilities are used to estimate the binary 0/1 hurdle process.

Then, for the observations where visits are positive, a truncated Poisson den-

sity is estimated. This density is�<; ��B�� e m � B ¥ � �<; ��B�� e m Ë � ��B ¥ � �<; ��B�� e m �È� }Y~ � �?� e m since according to the Poisson model with the doctor’s paramaters,Ë � ��B ¡� }Y~ � �?� e m e Fm� � &Since the hurdle and truncated components of the overall density for ¤ share

no parameters, they may be estimated separately, which is computationally

more efficient than estimating the overall model. (Recall that the BFGS algo-

rithm, for example, will have to invert the approximated Hessian. The com-

putational overhead is of order � � where � is the number of parameters to be

estimated) . The expectation of ¤ is

Ù �W¤ � � Ë � ��¤ ¥Ë� � � Ù ��¤ � ¤ ¥ � � � Æ ��2 }C~%� �Ô� e 6] É Æ e m�È� }C~%� �?� e m É


Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program

**************************************************************************

MEPS data, OBDV

logit results

Strong convergence

Observations = 500


t-Stats


constant -1.5502 -2.5709 -2.5269 -2.5560

pub_ins 1.0519 3.0520 3.0027 3.0384

priv_ins 0.45867 1.7289 1.6924 1.7166

sex 0.63570 3.0873 3.1677 3.1366

age 0.018614 2.1547 2.1969 2.1807

educ 0.039606 1.0467 0.98710 1.0222

inc 0.077446 1.7655 2.1672 1.9601


Consistent Akaike

639.89

Schwartz

632.89

Hannan-Quinn

614.96

Akaike

603.39

**************************************************************************

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-II/estimate_hpoisson.ox


The results for the truncated part:

**************************************************************************

MEPS data, OBDV

tpoisson results

Strong convergence

Observations = 500


t-Stats


constant 0.54254 7.4291 1.1747 3.2323

pub_ins 0.31001 6.5708 1.7573 3.7183

priv_ins 0.014382 0.29433 0.10438 0.18112

sex 0.19075 10.293 1.1890 3.6942

age 0.016683 16.148 3.5262 7.9814

educ 0.016286 4.2144 0.56547 1.6353

inc -0.0079016 -2.3186 -0.35309 -0.96078


Consistent Akaike

2754.7

Schwartz

2747.7

Hannan-Quinn

2729.8

Akaike

2718.2

**************************************************************************


Fitted and actual probabilites (NB-II fits are provided as well) are:

TABLE 4. Actual and Hurdle Poisson fitted frequencies

Count OBDV ERVCount Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II

0 0.32 0.32 0.34 0.86 0.86 0.861 0.18 0.035 0.16 0.10 0.10 0.102 0.11 0.071 0.11 0.02 0.02 0.023 0.10 0.10 0.08 0.004 0.006 0.0064 0.052 0.11 0.06 0.002 0.002 0.0025 0.032 0.10 0.05 0 0.0005 0.001

For the Hurdle Poisson models, the ERV fit is very accurate. The OBDV fit

is not so good. Zeros are exact, but 1’s and 2’s are underestimated, and higher

counts are overestimated. For the NB-II fits, performance is at least as good as

the hurdle Poisson model, and one should recall that many fewer parameters

are used. Hurdle version of the negative binomial model are also widely used.

24.2.1. Finite mixture models. The finite mixture approach to fitting health

care demand was introduced by Deb and Trivedi (1997). The mixture approach

has the intuitive appeal of allowing for subgroups of the population with dif-

ferent health status. If individuals are classified as healthy or unhealthy then

two subgroups are defined. A finer classification scheme would lead to more

subgroups. Many studies have incorporated objective and/or subjective indi-

cators of health status in an effort to capture this heterogeneity. The available

objective measures, such as limitations on activity, are not necessarily very

informative about a person’s overall health status. Subjective, self-reported

measures may suffer from the same problem, and may also not be exogenous


Finite mixture models are conceptually simple. The density is�<; ��B��Yt,1$�'&(&)&(�Yt�6"� � 1Y�'&(&(&)� � 6 � 1- 6 � 1� * ��1 � * � Ã * Å; ��B��$t * �2 � 6 � 6; ��B��Yt�6NY�where � *¥Ë� ��! �V�Y#%�'&(&)&(�ò� , � 6 �G� � 6 � 1* ��1 � * , and � 6 * ��1 � *, � . Identification re-

quires that the � * are ordered in some way, for example, � 1 � � � � >N>N> � � 6 andt * í t ß ��!pí � . This is simple to accomplish post-estimation by rearrangement

and possible elimination of redundant component densities.

� The properties of the mixture density follow in a straightforward way

from those of the components. In particular, the moment generat-

ing function is the same mixture of the moment generating functions

of the component densities, so, for example, Ù �W¤ � � � 6 * ��1 � * 3 * � � ,where 3 * � � is the mean of the ! U�¶ component density.� Mixture densities may suffer from overparameterization, since the to-

tal number of parameters grows rapidly with the number of compo-

nent densities. It is possible to constrained parameters across the mix-

tures.� Testing for the number of component densities is a tricky issue. For

example, testing for � � (a single component, which is to say, no

mixture) versus � # (a mixture of two components) involves the

restriction � 1 � , which is on the boundary of the parameter space.

Not that when � 1 � , the parameters of the second component can

take on any value without affecting the density. Usual methods such

as the likelihood ratio test are not applicable when parameters are on

the boundary under the null hypothesis. Information criteria means

of choosing the model (see below) are valid.


The following are results for a mixture of 2 negative binomial (NB-I) models,

for the OBDV data, which you can replicate using this estimation program

http://pareto.uab.es/mcreel/Econometrics/Include/MEPS-II/estimate_mixnegbin.ox


**************************************************************************

MEPS data, OBDV

mixnegbin results

Strong convergence

Observations = 500


t-Stats


constant 0.64852 1.3851 1.3226 1.4358

pub_ins -0.062139 -0.23188 -0.13802 -0.18729

priv_ins 0.093396 0.46948 0.33046 0.40854

sex 0.39785 2.6121 2.2148 2.4882

age 0.015969 2.5173 2.5475 2.7151

educ -0.049175 -1.8013 -1.7061 -1.8036

inc 0.015880 0.58386 0.76782 0.73281

ln_alpha 0.69961 2.3456 2.0396 2.4029

constant -3.6130 -1.6126 -1.7365 -1.8411

pub_ins 2.3456 1.7527 3.7677 2.6519

priv_ins 0.77431 0.73854 1.1366 0.97338

sex 0.34886 0.80035 0.74016 0.81892

age 0.021425 1.1354 1.3032 1.3387

educ 0.22461 2.0922 1.7826 2.1470

inc 0.019227 0.20453 0.40854 0.36313

ln_alpha 2.8419 6.2497 6.8702 7.6182

logit_inv_mix 0.85186 1.7096 1.4827 1.7883



Consistent Akaike

2353.8

Schwartz

2336.8

Hannan-Quinn

2293.3

Akaike

2265.2

**************************************************************************

Delta method for mix parameter st. err.

mix se_mix

0.70096 0.12043� The 95% confidence interval for the mix parameter is perilously close

to 1, which suggests that there may really be only one component den-

sity, rather than a mixture. Again, this is not the way to test this - it is

merely suggestive.� Education is interesting. For the subpopulation that is “healthy”, i.e.,

that makes relatively few visits, education seems to have a positive

effect on visits. For the “unhealthy” group, education has a negative

effect on visits. The other results are more mixed. A larger sample

could help clarify things.

The following are results for a 2 component constrained mixture negative bi-

nomial model where all the slope parameters in e%ß� Úº Í x ä are the same across

the two components. The constants and the overdispersion parameters �,ß are

allowed to differ for the two components.


**************************************************************************

MEPS data, OBDV

cmixnegbin results

Strong convergence

Observations = 500


t-Stats


constant -0.34153 -0.94203 -0.91456 -0.97943

pub_ins 0.45320 2.6206 2.5088 2.7067

priv_ins 0.20663 1.4258 1.3105 1.3895

sex 0.37714 3.1948 3.4929 3.5319

age 0.015822 3.1212 3.7806 3.7042

educ 0.011784 0.65887 0.50362 0.58331

inc 0.014088 0.69088 0.96831 0.83408

ln_alpha 1.1798 4.6140 7.2462 6.4293

const_2 1.2621 0.47525 2.5219 1.5060

lnalpha_2 2.7769 1.5539 6.4918 4.2243

logit_inv_mix 2.4888 0.60073 3.7224 1.9693


Consistent Akaike

2323.5

Schwartz

2312.5

Hannan-Quinn


2284.3

Akaike

2266.1

**************************************************************************

Delta method for mix parameter st. err.

mix se_mix

0.92335 0.047318� Now the mixture parameter is even closer to 1.� The slope parameter estimates are pretty close to what we got with the

NB-I model.

24.2.2. Comparing models using information criteria. A Poisson model

can’t be tested (using standard methods) as a restriction of a negative bino-

mial model. Testing for collapse of a finite mixture to a mixture of fewer com-

ponents has the same problem. How can we determine which of competing

models is the best?

The information criteria approach is one possibility. Information criteria

are functions of the log-likelihood, with a penalty for the number of parame-

ters used. Three popular information criteria are the Akaike (AIC), Bayes (BIC)

and consistent Akaike (CAIC). The formulae are

ûÞw;±%û �p# �� B � �@��2ËQ0� �� 2��J�¯±%û �p# �� B � �@��2ËQ �)� �w;±%û �p# �� B � �@��2Ë#VQIt can be shown that the CAIC and BIC will select the correctly specified model

from a group of models, asymptotically. This doesn’t mean, of course, that the

24.3. MODELS FOR TIME SERIES DATA 474

correct model is necesarily in the group. The AIC is not consistent, and will

asymptotically favor an over-parameterized model over the correctly specified

model. Here are information criteria values for the models we’ve seen, for

OBDV. According to the AIC, the best is the MNB-I, which has relatively many

TABLE 5. Information Criteria, OBDV

Model AIC BIC CAICPoisson 3822 3911 3918

NB-I 2282 2315 2323Hurdle Poisson 3333 3381 3395

MNB-I 2265 2337 2354CMNB-I 2266 2312 2323

parameters. The best according to the BIC is CMNB-I, and according to CAIC,

the best is NB-I. The Poisson-based models do not do well.

24.3. Models for time series data

This section can be ignored in its present form. Just left in to form a basis

for completion (by someone else ?!) at some point.

Hamilton, Time Series Analysis is a good reference for this section. This is

very incomplete and contributions would be very welcome.

Up to now we’ve considered the behavior of the dependent variable B"U as a

function of other variables � U?& These variables can of course contain lagged

dependent variables, e.g., � U ��.7UP��B]U � 1Y�'&(&)&(��B]U � ß Y& Pure time series methods

consider the behavior of BgU as a function only of its own lagged values, un-

conditional on other observable variables. One can think of this as modeling

the behavior of BgU after marginalizing out all other variables. While it’s not

immediately clear why a model that has other explanatory variables should

marginalize to a linear in the parameters time series model, most time series


work is done with linear models, though nonlinear time series is also a large

and growing field. We’ll stick with linear time series models.

24.3.1. Basic concepts.

DEFINITION 53 (Stochastic process). A stochastic process is a sequence of

random variables, indexed by time:

(24.3.1) S]¤EU×X TU(� � TDEFINITION 54 (Time series). A time series is one observation of a stochas-

tic process, over a specific interval:

(24.3.2) SJBgU×X fU(��1So a time series is a sample of size � from a stochastic process. It’s impor-

tant to keep in mind that conceptually, one could draw another sample, and

that the values would be different.

DEFINITION 55 (Autocovariance). The � U�¶ autocovariance of a stochastic

process is

(24.3.3) � ß U ë¼��BgU��s30U�\��B]U � ß �c30U � ß where 30U ë��BgU��&

DEFINITION 56 (Covariance (weak) stationarity). A stochastic process is

covariance stationary if it has time constant mean and autocovariances of all


orders: 30U 3Ä�Ôê� � ß U � ß �Ôê� As we’ve seen, this implies that � ß¼ � � ß H the autocovariances depend only

one the interval between observations, but not the time of the observations.

DEFINITION 57 (Strong stationarity). A stochastic process is strongly sta-

tionary if the joint distribution of an arbitrary collection of the S]¤�U×X doesn’t

depend on Y&Since moments are determined by the distribution, strong stationarity Á weak

stationarity.

What is the mean of ¤EU4é The time series is one sample from the stochastic

process. One could think of´

repeated samples from the stoch. proc., e.g.,SJB 9U X By a LLN, we would expect that

� �)�® SUT �´ ®�9G��1 BgU@9 6R ë¼��¤EU�The problem is, we have only one sample to work with, since we can’t go back

in time and collect another. How can ë¼��¤�U� be estimated then? It turns out that

ergodicity is the needed property.

DEFINITION 58 (Ergodicity). A stationary stochastic process is ergodic (for

the mean) if the time average converges to the mean

(24.3.4)�� f� U(��1 BgU 6R 3


A sufficient condition for ergodicity is that the autocovariances be abso-

lutely summable: T� ß � F � � ß � ½lkThis implies that the autocovariances die off, so that the B"U are not so strongly

dependent that they don’t satisfy a LLN.

DEFINITION 59 (Autocorrelation). The � U�¶ autocorrelation, � ß is just the � U�¶autocovariance divided by the variance:

(24.3.5) � ß¼ � ß� FDEFINITION 60 (White noise). White noise is just the time series literature

term for a classical error. A[U is white noise if i) ë7�WA-UW � �Ôê� Y� ii) é^��A-UW è,�N�ê� Y� and iii) A-U and A�ì are independent, �í �"& Gaussian white noise just adds a

normality assumption.

24.3.2. ARMA models. With these concepts, we can discuss ARMA mod-

els. These are closely related to the AR and MA error processes that we’ve

already discussed. The main difference is that the lhs variable is observed di-

rectly now.

24.3.2.1. MA(q) processes. A � U�¶ order moving average (MA) process is

BgU 3 28/JU52�@g1?/JU � 1,2�@ � /JU � � 2¡>N>N>N2¹@��[/JU � �


where /JU is white noise. The variance is� F ë��B]U��c3� � ë��/JU 2�@g1?/JU � 1�2�@ � /JU � � 2¡>N>N>'2�@��[/JU � �$ � è � ð��2�@ � 1 2¹@ �� 2¡>N>N>N2¹@ �� óSimilarly, the autocovariances are� ß @ ß 2�@ ß`� 1`@g1I2¹@ ß� � @ � 2¡>N>N>'2�@��@�� ß �¾� » � � �I� ¥ �Therefore an MA(q) process is necessarily covariance stationary and ergodic,

as long as è,� and all of the @ ß are finite.

24.3.2.2. AR(p) processes. An AR(p) process can be represented as

BgU �¸ 2Mt�1?B]U � 1I2Mt � B]U � � 2�>N>N>N2Ët�6'BgU � 6 28/JUThe dynamic behavior of an AR(p) process can be studied by writing this � U�¶order difference equation as a vector first order difference equation:

��BgUBgU � 1...BgU � 6 � 1

�� ¸�...��t�1 t � >N>N> tT6� � � �� . . . �... . . . . . . . . . � >N>N>� >N>N> � � �

� ��BgU � 1BgU � �...BgU � 6

�� 2��/JU�...��

or ¤+U �û 2 2 ¤EU � 1,2 Ù U


With this, we can recursively work forward in time:

¤+U � 1 �û 2 2 ¤+U 2 Ù U � 1 �û 2 2 � û 2 2 ¤EU � 1,2 Ù UW�2 Ù U � 1 �û 2 2 û 2 2 � ¤+U � 1�2 2 Ù U52 Ù U � 1and

¤+U � � �û 2 2 ¤+U � 1,2 Ù U � � �û 2 2 � û 2 2 û 2 2 �$¤+U � 1I2 2 Ù U 2 Ù U � 1[�2 Ù U � � ¡û 2 2 û 2 2 � û 2 2 | ¤EU � 1�2 2 � Ù U+2 2 Ù U � 1I2 Ù U � �or in general

¤+U ��ß¼ �û 2 2 û 2²>N>N>?2 2 ß û 2 2 ß� 1 ¤EU � 1�2 2 ß Ù UJ2 2 ß � 1 Ù U � 1T2²>N>N>Ô2 2 Ù U ��ß � 1"2 Ù U ��ßConsider the impact of a shock in period on BVU ��ß & This is simplyà ¤+U ��ßà Ù 4U Ã 1IH 1WÅ 2 ßÃ 1IH 1WÅIf the system is to be stationary, then as we move forward in time this impact

must die off. Otherwise a shock causes a permanent change in the mean of B�UP&Therefore, stationarity requires that� ��ß StT 2 ßÃ 1IH 1WÅ ¡�� Save this result, we’ll need it in a minute.


Consider the eigenvalues of the matrix2 & These are the for e such that

� 2 � e+± Ê � ¡�The determinant here can be expressed as a polynomial. for example, for � �V�the matrix

2is simply 2 t�1

so � t�1� e � a�can be written as t�1� e� a�When � #%� the matrix

2is 2 �� t�1 t ��

so 2 � e+± Ê �� t,1� e t �� e ��and � 2 � e+± Ê � ge � � e t�1G�8t �So the eigenvalues are the roots of the polynomiale � � e t�1�8t �


which can be found using the quadratic equation. This generalizes. For a � U�¶order AR process, the eigenvalues are the roots ofe 6 � e 6 � 1 t,1Ä� e 6 � � t � � >N>N>g� e t�6 � 1�8t�6 a�Supposing that all of the roots of this polynomial are distinct, then the matrix2

can be factored as 2 ¿ � ¿ � 1where

¿is the matrix which has as its columns the eigenvectors of

2 � and �is a diagonal matrix with the eigenvalues on the main diagonal. Using this

decomposition, we can write2 ß ð ¿ � ¿ � 1 ó ð ¿ � ¿ � 1 ó >N>N> ð ¿ � ¿ � 1 ówhere

¿ � ¿ � 1 is repeated � times. This gives2 ß ¿ � ß ¿ � 1and

� ß ��e ß 1 � �� e ß �

. . .� e ß6��

Supposing that the e+* ! �"�Y#%�\&)&(&)�W� are all real valued, it is clear that� ��ß StT 2 ßÃ 1IH 1WÅ ¡�requires that � e5* � ½ �V��! �"�Y#T�'&)&(&(�ò�


e.g., the eigenvalues must be less than one in absolute value.� It may be the case that some eigenvalues are complex-valued. The

previous result generalizes to the requirement that the eigenvalues be

less than one in modulus, where the modulus of a complex numberôp2�7$! is � ´ J��ôp2¹7Y!× hô � 2D7 �

This leads to the famous statement that “stationarity requires the roots

of the determinantal polynomial to lie inside the complex unit circle.”

draw picture here.� When there are roots on the unit circle (unit roots) or outside the unit

circle, we leave the world of stationary processes.� Dynamic multipliers:à BVU ��ßvÂ à /JU 2 ßÃ 1IH 1WÅ is a dynamic multiplier or an

impulse-response function. Real eigenvalues lead to steady movements,

whereas comlpex eigenvalue lead to ocillatory behavior. Of course,

when there are multiple eigenvalues the overall effect can be a mix-

ture. pictures

Invertibility of AR process

To begin with, define the lag operatorBB BgU B]U � 1

The lag operator is defined to behave just as an algebraic quantity, e.g.,B � BgU B � B BgUW B BgU � 1 BgU � �


or

�Ô�È� B \�Ô��2 B ?BgU �È� B BgU 2 B BgU�� B � BgU �È��BgU � �A mean-zero AR(p) process can be written as

B]U0�8t,1ÔBgU � 1�Kt � BgU � � � >N>N>V�Kt�6'BgU � 6 /JUor B]UÔ�Ô�È�Kt�1 B �8t � B � � >N>N>V�Kt�6 B 6 /JUFactor this polynomial as

�È�8t�1 B �Kt � B � � >N>N>"�8t�6 B 6 �?�È� e 1 B v�Ô�È� e � B 0>N>N>'�Ô�È� e 6 B For the moment, just assume that the eE* are coefficients to be determined. SinceB

is defined to operate as an algebraic quantitiy, determination of the e�* is the

same as determination of the eE* such that the following two expressions are

the same for all ��H�È�8t�1?�Þ�Kt � � � � >N>N>V�KtT6'� 6 �?�È� e 1Ô�"\�Ô�È� e � ��0>N>N>J�?�¼� e 6J�"

Multiply both sides by � � 6� � 6 �8t�1?� 1 � 6 �Kt � � � � 6 �Ë>N>N>$tT6 � 1Ô� � 1 �8t�6 �W� � 1 � e 1Ô\�W� � 1 � e � 0>N>N>N�� 1 � e 6]

and now define e� � � 1 so we gete 6 �Kt�1 e 6 � 1 �Kt � e 6 � � �Ë>N>N>"�8tT6 � 1 e �KtT6 � e � e 1-v� e � e � 0>N>N>\� e � e 6]


The LHS is precisely the determinantal polynomial that gives the eigenvalues

of2 & Therefore, the e+* that are the coefficients of the factorization are simply

the eigenvalues of the matrix2 &

Now consider a different stationary process

�?�È�8t B ÔB]U /]U� Stationarity, as above, implies that � t � ½ �"&

Multiply both sides by ��2Mt B 2Mt�� B �2�&)&(&]2Ët ß B ß to get

ð ��2Ët B 2Ët � B � 2�&)&(&J2Ët ß B ß ó �?�È�8t B ÔB]U ð � 2Mt B 2Mt � B � 2¡&(&)&N2Ët ß B ß ó /JUor, multiplying the polynomials on th LHS, we get

�Ô��2Ët B 2Ët � B � 2�&)&(&J2Mt ß B ß �8t B ��t � B � � &)&(&"�Kt ß B ß �Kt ß� 1 B ß`� 1 BgU s �?��2Ët B 2Ët � B � 2¡&(&(&J2Ët ß B ß %/JUand with cancellations we have

ð �È�Kt ß`� 1 B ß� 1 ó BgU ð ��2Ët B 2Mt � B � 2¡&(&)&]2Mt ß B ß ó /JUso B]U t ß� 1 B ß`� 1 BgU52 ð ��2Mt B 2Mt � B � 2¡&(&)&J2Mt ß B ß ó /JUNow as � R k �"t ß`� 1 B ß� 1 BgU R � � since � t � ½ �"� so

BgU ç ð � 2Mt B 2Ët � B � 2¡&(&(&J2Ët ß B ß ó /JU


and the approximation becomes better and better as � increases. However, we

started with �?�È�8t B ÔB]U /]USubstituting this into the above equation we have

BgU ç ð[��2Ët B 2Ët � B � 2�&)&(&J2Ët ß B ß ó �?�¼�Kt B ÔB]Uso ð[��2Mt B 2Mt � B � 2¡&(&)&]2Mt ß B ß ó �Ô�È�Kt B ç �and the approximation becomes arbitrarily good as � increases arbitrarily. There-

fore, for � t � ½ �"� define �?�È�Kt B $� 1 T� ß � F t ß B ßRecall that our mean zero AR(p) process

B]UÔ�Ô�È�Kt�1 B �8t � B � � >N>N>V�Kt�6 B 6 /JUcan be written using the factorization

BgU?�Ô�È� e 1 B v�Ô�È� e � B 0>N>N>N�Ô�¼� e 6 B /JUwhere the e are the eigenvalues of

2 � and given stationarity, all the � e�* � ½ �"&Therefore, we can invert each first order polynomial on the LHS to get

BgU ¶ T� ß � F e ß 1 B ß · ¶ T� ß � F e ß � B ß · >N>N> ¶ T� ß � F e ß6 B ß ·j/JUThe RHS is a product of infinite-order polynomials in

B � which can be repre-

sented as BgU �?� 2i:Z1 B 2¹: � B � 2¡>N>N>�?/JU


where the : * are real-valued and absolutely summable.� The : * are formed of products of powers of the eE* , which are in turn

functions of the t * &� The : * are real-valued because any complex-valued eE* always occur in

conjugate pairs. This means that if ô72Ç7$! is an eigenvalue of2 � then so

is ôÞ�i7$![& In multiplication

��ôp2�7$!×��ô¯�i7$!P ô��Z��ô�7$!02Kô�7Y!��c7$�$!ò� ô � 2�7 �which is real-valued.� This shows that an AR(p) process is representable as an infinite-order

MA(q) process.� Recall before that by recursive substitution, an AR(p) process can be

written as

¤+U ��ß¼ �û 2 2 û 2²>N>N>?2 2 ß û 2 2 ß� 1 ¤EU � 1�2 2 ß Ù UJ2 2 ß � 1 Ù U � 1T2²>N>N>Ô2 2 Ù U ��ß � 1"2 Ù U ��ßIf the process is mean zero, then everything with a û drops out. Take

this and lag it by � periods to get

¤EU 2 ß� 1 ¤EU � ß � 1�2 2 ß Ù U � ß 2 2 ß � 1 Ù U � ß`� 1�2�>N>N>J2 2 Ù U � 1I2 Ù UAs � R k � the lagged ¤ on the RHS drops out. The Ù U � ì are vectors

of zeros except for their first element, so we see that the first equation

here, in the limit, is just

BgU T� ß � F ð 2 ß ó 1IH 1 /JU � ß


which makes explicit the relationship between the : * and the t * (and

the e5* as well, recalling the previous factorization of2 ß C&

Moments of AR(p) process. The AR(p) process is

BgU �¸ 2Mt�1?B]U � 1I2Mt � B]U � � 2�>N>N>N2Ët�6'BgU � 6 28/JUAssuming stationarity, ë7��BVU� 3Ä�Ôê� Y� so3 a¸ 2Mt�1_3�2Mt � 3�2¡&(&(&]2ËtT6}3so 3 ¸�¼�Kt�1�Kt � � &(&)&V�KtT6and ¸¼ 3��8t,1Z3�� &)&(&"�Kt�6�3so

B]U0�s3 3��Kt,1_3�� &)&(&"�KtT6}3b2Mt�1?B]U � 1I2Mt � B]U � � 2¡>N>N>J2MtT6'B]U � 672�/]U��c3 t�1\��BgU � 1�c3I,2Mt � ��BgU � � �c3I,2a&(&(&J2Ët�6��B]U � 6;�c3I028/JUWith this, the second moments are easy to find: The variance is� F t,1_�E1,2Mt � � � 2¡&(&)&]2MtT6}�]6�2Kè �The autocovariances of orders � � � follow the rule� ß ë $��BgU��c3I0��BgU � ß �c3I[¾] ëª$�Wt�1v��BgU � 1G�s3I�2Ët � ��B]U � � �c3�,2¡&(&)&]2MtT6��BgU � 6y�s3I�2�/]UW0��BgU � ß �s3I_] t,1I� ß � 102Mt � � ß � � 2�&)&(&J2Ët�6�� ß � 6


Using the fact that � � ßy � ß � one can take the � 2a� equations for � u� �'�V�'&)&(&(�ò� ,

which have �I2�� unknowns ( è,�N�6� F ��+1Y�'&(&)&(��]6J and solve for the unknowns. With

these, the � ß for � ¥ � can be solved for recursively.

24.3.2.3. Invertibility of MA(q) process. An MA(q) can be written as

BgU��c3 �?� 2¹@g1 B 2�&)&(&]2�@�� B � ?/JUAs before, the polynomial on the RHS can be factored as

�Ô��2�@g1 B 2¡&(&)&]2¹@�� B � �?�È� 1 B \�?�È� � B C&(&(&¨�?�;� � B and each of the �?�µ� g* B can be inverted as long as � g* � ½ �"& If this is the case,

then we can write

�Ô��2¹@g1 B 2¡&(&(&]2�@�� B � �� 1 ��B]U��c3I /JUwhere �Ô��2¹@g1 B 2¡&(&(&]2�@�� B � �� 1will be an infinite-order polynomial in

B � so we getT� ß � F � » ß B ß ��BgU � ß �c3I /JUwith

» F �¯�V� or

��B]U��c3�G� » 1\��BgU � 1G�s3IG� » � ��BgU � � �s3I�2�&)&(& /JUor BgU �¸ 2 » 1?B]U � 1I2 » � BgU � � 2�&)&(&]28/JU


where ¸È 3�2 » 1Z3 2 » � 3�2�&)&(&So we see that an MA(q) has an infinite AR representation, as long as the � "* � ½�"�]! �"�Y#T�'&)&(&(��%&

� It turns out that one can always manipulate the parameters of an MA(q)

process to find an invertible representation. For example, the two

MA(1) processes BgU��s3 �Ô�È�i@ B P/]Uand B [U �c3 �Ô�È�c@�� 1 B P/ [Uhave exactly the same moments if

è �î à è �î @ �For example, we’ve seen that� F è � �Ô��2¹@ � C&Given the above relationships amongst the parameters,� [F è �î @ � �Ô��2�@�� è � �?�Z2D@ � so the variances are the same. It turns out that all the autocovariances

will be the same, as is easily checked. This means that the two MA

processes are observationally equivalent. As before, it’s impossible to

distinguish between observationally equivalent processes on the basis

of data.

24.3. MODELS FOR TIME SERIES DATA 490� For a given MA(q) process, it’s always possible to manipulate the pa-

rameters to find an invertible representation (which is unique).� It’s important to find an invertible representation, since it’s the only

representation that allows one to represent /gU as a function of past B 4 ��&The other representations express� Why is invertibility important? The most important reason is that it

provides a justification for the use of parsimonious models. Since an

AR(1) process has an MA( k representation, one can reverse the ar-

gument and note that at least some MA( k processes have an AR(1)

representation. At the time of estimation, it’s a lot easier to estimate

the single AR(1) coefficient rather than the infinite number of coeffi-

cients associated with the MA representation.� This is the reason that ARMA models are popular. Combining low-

order AR and MA models can usually offer a satisfactory representa-

tion of univariate time series data with a reasonable number of param-

eters.� Stationarity and invertibility of ARMA models is similar to what we’ve

seen - we won’t go into the details. Likewise, calculating moments is

similar.

EXERCISE 61. Calculate the autocovariances of an ARMA(1,1) model: �Ô��2t B ?BgU a¸ 2��?�Z2�@ B ÔA-U

Bibliography

[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford

Univ. Press.

[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ.

Press.

[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.

[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.

[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press

[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.

[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supple-

mentary use only).

491

Index

asymptotic equality, 442

Chain rule, 436

Cobb-Douglas model, 21

convergence, almost sure, 438

convergence, in distribution, 439

convergence, in probability, 438

Convergence, ordinary, 437

convergence, pointwise, 437

convergence, uniform, 437

convergence, uniform almost sure, 440

cross section, 17

estimator, linear, 28, 38

estimator, OLS, 23

extremum estimator, 247

leverage, 28

likelihood function, 49

matrix, idempotent, 27

matrix, projection, 26

matrix, symmetric, 27

observations, influential, 27

outliers, 27

own influence, 29

parameter space, 49

Product rule, 436

R- squared, uncentered, 31

R-squared, centered, 32

492

econometrics-creel (2005)

Documents