the generalised method of moments

The Generalised Method of Moments

Ibrahim StevensJoint HKIMR/CCBS Workshop

Advanced Modelling for Monetary Policy in the Asia-Pacific Region

May 2004

GMM

• Why use GMM?– Nonlinear estimation– Structural estimation– ‘Robust’ estimation

• Models estimated using GMM– Many….

• Rational expectations models – Euler Equations

• Non-Gaussian distributed models

The Method of Moments

• Simple moment conditions

Population Sample

0ˆ0],cov[

0ˆ0][

'1

1

ttT

tT

XX

E

The Method of Moments

• OLS as a MM estimator

• Moment conditions:

• MM estimator

ˆˆ that so , XyXy

0)ˆ(0][1

1

T

tttT XyE

0)ˆ('0]'[ XyXXE

yXXXXXyX ''ˆ0ˆ'' 1

Slightly more Generalised MM

• IV is a MM estimator

• Moment condition:

• MM estimator:

0],cov[but ,0],cov[ ZX

0)ˆ('0]'[ IVXyZZE

yZXZXZyZ IVIV ''ˆ0ˆ'' 1

Slightly more Generalised MM

• In the previous IV estimator we have considered the case where the number of instruments is equal to the number of coefficients we want to estimate

• Size of Z is the same as the size of IV

• What happens if the number of instruments is greater than the number of coefficients?

• Essentially, the number of equations is greater than the number of coefficients you want to estimate: model is over-identified

IV with more constraints than equations

• Maintain the moment condition as before• Variance of moment condition is:

• Minimise ‘weighted’ distance:

WZZZZZ ']''[E ]'var[ 2

)ˆ(')'()''ˆ'(

''11

1

2

XyZZZZXy

ZZWV


• Why do we do a minimisation exercise?

• Because we have more equations than ‘unknows’.

• How do we determine the true values of the coefficients?

• Solution is to minimise the previous expression so that the coefficients are able to approximate the moment condition, that is pick coefficients such that the orthogonality condition is satisfied


• First order conditions:

• MM estimator (looks like an IV estimator with more instruments than parameters to estimate):

0)ˆ(')'('ˆ1

XyZZZZXV

yZZZZXXZZZZX ''')'''(ˆ 111

Moment conditions in estimation

• Model may be nonlinear– Euler equations often imply models in levels not logs

(consumption, output, other first order conditions)– Both ad hoc and structural models may be nonlinear

in parameters of interest (systems)

• Models may have unknown disturbance structure– Rational expectations– May not be interested in related parameters

A generalised problem

• Let any (nonlinear) moment condition be:

• Sample counterpart:

• Minimise:

0])',([ titt zxmE

0])',([1

1

T

ttitT zxm

T

ttit

T

ttit zxmWzxm

1

1

1

])',([])',([min


• If we have more instruments (n) than coefficients (p) we choose to minimise:

• What should the matrix W look like?

T

ttit

T

ttit zxmWzxm

1

1

1

])',([])',([min


• It turns out that any symmetric positive definite matrix of W yields consistent estimates for the parameters

• However, it does not yield efficient ones

• Hansen (1982) derives the necessary (not sufficient) condition to obtain asymptotically efficient estimates for the coefficients

Choice of W (efficiency)

• Appropriate weight matrices (Hansen, 82):

• Intuition: W-1 denotes the inverse of the covariance matrix of the sample moments. This matrix is chosen because it ‘means’ that less weight is placed on the more imprecise moments

T

ttit zxmW

1

)',(var

Implementation• Implementation is generally undertaken in a

‘two-step procedure’:

1. Any symmetric positive definite matrix yields consistent estimates of the parameters. Thus exploit this. Using ‘any’ symmetric positive definite matrix, back up estimates for the parameters in the model

• An arbitrary matrix such the identity matrix is normally used to obtain the first consistent estimator

2. Using these parameters construct the weighting matrix W and from that we can undertake the minimisation problem

• This process can be iterated– Some computational cost

Instrument validity and W

• Estimation of the minimised criterion can be used to test the validity of the instruments

• EViews gives you the ‘wrong’ Hansen J-statistic - test of overidentification

– Multiply by the number of observations to get correct J– This is a Chi squared with n-p degrees of freedom

• If a sub-optimal weighting matrix is used, Hansen’s J-test does not apply. See Chochrane 1996

– We can also test as sub-set of othogonality conditions

T

ttit

T

ttit zxmWzxmJ

1

1

1

])'ˆ,([ˆ])'ˆ,([

Covariance estimators

• Choosing the right weighting matrix is important for GMM estimation

• There have been many econometric papers written on this subject

• Estimation results can be sensitive to the choice of weighting matrix


• So far we have not considered the possibility that heteroskedasticity and autocorrelation be a part of your model

• How can we account for this?

• We need to modify the covariance matrix


• Write our covariance matrix of empirical moments as:

• Where Mq is the qth row of the Txn matrix of sample moments

T

q

T

ptitqtitpT

zxMzxMET

W1 1

),,(),,(1lim

Covariance estimators• Define the autocovariances:

• Express W in terms of the above expressions:

0for ,),,(),,()(

0for ,),,(),,()(

1

1

1

1

jzxMzxMEj

jzxMzxMEj

T

jptitjptitpT

T

jptitjptitpT

1

1

)(n

nj

n jW

Covariance estimators• If there is no serial correlation, the expressions for

j0 are all equal to zero (since the autocovariances will be zero):

• Note that this ‘looks like’ a White (1980) heteroskesdastic consistent estimator…

T

ptitptitpT

n zxMzxMEW1

1 ),,(),,()0(

Covariance estimators• If this looks like a White (1980) heteroskesdastic

consistent estimator… …implementation should be straight-forward! • Example (Remembering White): Take the standard

heteroskedastic version of the linear model

2

21

0

0

,

T

uXy

Covariance estimators• The appropriate problem and weighting matrix are

• The weighting matrix can be consistently estimated by using any consistent estimator of the model’s parameters and substituting the expected value of the squared residuals by the actual residual

(NB. The only difference here is that we are generalising the problem by allowing instruments, ie Zs)

pp

T

pp ZZuE

TZZW

uZZWu

'1')0(

''

1

2

1min

Covariance estimators• The problem is that with autocorrelation it is not

possible to replace the expected values of the squared residuals by the actual values from the first estimation

• It would lead to an inconsistent estimate of the autocovariance matrix of order j

• The problem of this approach is that, asymptotically, the number of estimated autocovariances grows at the same rate as the sample size

• Thus whilst unbiased W is not consistent in the mean squared error sense

Covariance estimators• Thus we require a class of estimators that circumvents these

problems• A class of estimators that prevent the autocovariances from

growing with the sample size are

• Parzen termed the ws’ the lag window• These estimators correspond to a class of kernel (spectral

density) estimators (evaluated at frequency zero)

1

1

)(T

Tjj jW

Covariance estimators• The key is to choose the sequence of ws’ such that

the sequence of weights approaches unity rapidly enough to obtain asymptotic unbiasedness but slowly enough to ensure that the variance converges to zero

• The type of weights you will find in EViews correspond to a particular class of lagged windows termed scale parameter windows

• The lag window is expressed as

Tj bjk /

Covariance estimators• HAC matrix estimation:

• k(j/bT) is a kernel, bT is the bandwidth• Intuition: bT streches or contracts the distribution; it

acts as a scaling parameter• k(z) is referred to as the ‘lagged window generator’

p

jT jjbjkW

1

)(ˆ)(ˆ/)0(ˆˆ

Covariance estimators• HAC matrix estimation:

• When the value of the kernel is zero for z>1, bT is called a ‘lag truncation parameter’ (autocovariances corresponding to lags greater than bT are given zero weight)

• The scalar bT is often referred to as the ‘bandwidth parameter’

p

jT jjbjkW

1

)(ˆ)(ˆ/)0(ˆˆ


• Eviews provides two kernels:1. Quadratic2. Barlett• It provides 3 options for the bandwidth parameter bT

(See manual for specific functional forms and good discussion!)

Covariance estimators• For instance Newey and West (1987) suggest

using a Barlett:

• Guarantees positive definiteness (which is something that we desire since we would like a positive variance)

p

j

jjp

jW1

)(ˆ)(ˆ1

1)0(ˆˆ

Alternative covariance estimators

• Andrews (1991)• Quadratic spectral estimator:

where:

p

jT jjbjkW

1

)(ˆ)(ˆ/)0(ˆˆ

56cos

5/6)5/6sin(

1225)( 22

xx

xx

xk

51

)(3221.1 and , Tbbjx TT

Pre-whitening

• Andrews and Monahan (1992)• Fit an VAR to the moment residuals:

where:

• This is known as a pre-whitened estimate

• Can be applied to any kernel

ttt A ˆˆˆˆ 1

.)ˆ(ˆ and ,'ˆˆˆˆ 1 AIDDWDWpw

Linear models

• Estimate by IV (consistent but inefficient):

• Use estimates to construct estimate of W:

• Can iterate on estimates of W

yZWZXXZWZX 'ˆ')'ˆ'(ˆ̂ 111

yZZZZXXZZZZX ''')'''(ˆ 111

Nonlinear models

• Estimate by nonlinear IV– May solve by standard ‘iterative’ nonlinear IV

• Estimate covariance matrix• Minimise J using non-linear optimisation• Iterate on covariance matrix (optional)• Eviews uses Berndt-Hall-Hall-Hausman or

Marquardt algorithms (see manual for pros and cons)

Useful facts

• Covariance matrix estimators must be positive-definite, asymptotically it has been shown that the quadratic spectral window is best

• But in small samples Newey and West (1994) show little difference between the Quadratic and their estimator (based on Barlett)

Useful facts

• Choice of bandwidth parameter more important than the choice of the kernel

• Variable Newey and West and Andrews is state of the art

• HAC estimators suffer from poor small sample performance, thus test statistics (eg t-test) may not be reliable – t-stats appear to reject a true null far more often than their nominal size

• Adjustments to the matrix W may be made but these depend on whether there is autocorrelation and/or heteroskesdacity

Useful facts

• Numerical Optimisation – common problem of not having a global maximum/minimum

• Eg Problems of local maximum/minimum or flat functions

• Without a global mimimum, GMM estimation does not yield consistent and efficient estimates

• Convexity of the criterion function is important – it guarantees global minima

Useful facts• For non-convex problems you must use ‘different

methods’• A multi-start algorithm popular: start at a local

optimisation algorithm from initial values of the parameters to converge to a local minimum and the repeat the process a number of times with different starting values. The estimator is taken to be the parameter values corresponding to the small value of the criterion function

• However it does not find the global minimum• Andrews (1997) proposes a stopping-rule procedure to

overcome this problem

Useful facts• Weak instrument literature• Nelson and Startz (1990) instrumental variables

estimators have poor sample properties when the instruments are weakly correlated with the explanatory variables

• Chi-square tests tend to reject the null too frequently compared to its asymptotic distribution

• T-ratios are too large• Hansen (1985) characterises an efficiency bound for the

asymptotic covariance matrices of the alternative GMM estimators and optimal instruments that attain the bound

Useful facts• Weak instrument literature – Stock, Wright and Yogo

(2002) provide an excellent summary of some of the issues related to weak instruments

• Recently, some authors advocate the use of limited-information maximum likelihood techniques to compare results with GMM estimation since both asymptotically equivalent

• Neely, Roy and Whiteman (2001) show that results can be very different for CAPM models

• Furher and Rudebush (2003) show this to be the case in Euler equations for output

• Mavroeidis (2003) finds similar results for KNPC

Useful facts• Finite sample properties of GMM estimators – similar to

weak instrument literature• Tauchen (1986) and Kocherlakota (1990) examine

artificial data generated from a non-linear CAPM• Using two-step GMM estimator Tauchen concluded that

GMM estimators and test stats had reasonable properties in small samples

• He also investigated optimal instruments finding that optimal estimators based on optimal selection of instruments often do not perform as well in small samples as GMM estimators using an arbitrary selection of instruments

Useful facts• Kocherlakota (1990) allows for multiple assets and

different sets of instruments• Using iterated GMM estimators, Kocherlakota finds

that GMM performs worse with larger instrument sets leading to downward biases in coefficient estimates and narrow confidence intervals. Also the J test tends to reject too often

• Hansen, Heaton and Yaron (1996) consider the same methods as Tauchen together with alternative choices for W. Both the 2 stage and the iterative methods have small sample distributions that can be greatly distorted

Useful facts

• Furher, Moore and Schuh (1995) compare GMM and maximum likelihood estimators in a class of nonlinear models using MonteCarlo simulations

• They find that GMM estimates tend to reject their model whilst ML support it

• Why? They find GMM estimates are often biased, statistically insignificant, economically implausible and dynamically unstable

• They attribute the result to weak instruments

Useful facts

• Nonstationarity – the data must all be nonstationary to use GMM

• Thus data are differenced or cointegrated• In the case of co-integration Cooley and Ogaki

(1996) suggest estimating the cointegration relationship using OLS and use these parameters for the covariance matrix W

Practical GMM

• Moment conditions– Theoretical moment conditions best– Empirical moment condition - try different

informational assumptions• Try ‘straight’ IV• If you know the form of autocorrelation try IV-

MA• Eviews reports J/T

Practical GMM

• Use Newey-West first– Try setting the lag truncation to something that is

close to the autocorrelation expected– Then try T/3

• Pre-whitening– Don’t do it unless nothing else works for NW

• QS-PW

– ‘State of the art’ - if it works use it

Euler equations and consumption

• Problem of intertemporal utility max:

subject to:

• Constrained problem:

1

00

max (1 )1t

t t

C t

CE

tttt CYArA 1)1(

))1((1

)1(max 10

1

0 tttttt

tt

CACYArCE

t

Euler equations and consumption

• First order conditions:

• Euler equation of:

011

0][

10

0

tt

tt

rE

CE

1 10

1 1 01

t t

t

r CEC

Consumption reduced form- dummy’s guide to Lucas’ critique

• Income process:

• Consumption function:

where

tttt Yr

rArC

11)1()1)(1( 1

1

1

)1(

)1(

r,

)1()()1(

0

i

iittit

t rYEY

ttt YY 1

Consumption moment conditions

• The orthogonality (zero mean) conditions:

011

1

11

t

ttt C

CRE

011

1

11

tt

ttt R

CCRE

011

1 1

11

t

t

t

ttt C

CCCRE

Conclusion• GMM a natural way to estimate many models

• Helpful since – It does not impose restrictions on the distribution of

errors– Allows for heteroskedasticity of unknown form– Estimates the parameters even when models cannot be

solved analytically

Useful References

• Davidson, R and MacKinnon, J. D., 2004, Econometric Theory and Methods, OUP

• Hayashi, F., 2000, Econometrics, Princeton• Matyas, L., 1999, Generalised Methods of Moments

Estimation• Cochrane, J. H., 2001, Asset Pricing, Princeton• Further references on especially the classic paper by Hansen

(1982), and papers by Hansen and Singleton, Newey and McFadden, and others can be found in the above textbook.

• The Handbook of Statistics, Vol. 11: Econometrics, edited by G.S. Maddala, C.R. Rao, and H.D. Vinod, North-Holland, Amsterdam, (1993) has two good papers on GMM by Ogaki and Hall

the generalised method of moments

Documents

number of coefficients

nonlinear moment condition

weighting matrix w

covariance matrix

identity matrix

number of instruments

number of equations

sufficient condition