modern basic econometrics lectures introduction to matrix approach...

Modern Basic Econometrics ⊖ Lectures

INTRODUCTION TO MATRIX APPROACH TO

ECONOMETRICS

Esben Høg

Department of Mathematical SciencesAalborg University

18. februar 2016

FSA ECONOMETRICS MATRIX APPROACH TO ECONOMETRICS AND REGRESSION ESBEN HØG 2016

List of Slides

1 Introduction 22 The Multiple Regression Model 33 The Multivariate Normal Distribution 64 Conditioning a multivariate normal distribution 85 The Bivariate Normal Distribution 96 Least Squares and differentiation w.r.t. a vector 18

6.1 Some formulas of Matrix Differentiation 216.2 The distribution of βββ 296.3 The Gauss-Markov theorem 32

7 The decomposition of sums of squares 338 Geometric interpretation of OLS 379 Linear Restrictions 41

9.1 Estimation 429.2 Tests of linear restrictions 479.3 The Chow Forecast test 499.4 Chow’s Breakpoint Test 57

10 Heteroskedasticity defined 5911 OLS with Heteroskedastic Errors 6012 Heteroskedasticity-Robust Inference 6313 GLS estimation 6514 FGLS estimation 71

14.1 Whites HCSE 72

/(page 1)


1. Introduction

ä In the following I will treat a number of subjects closely related to thestandard multiple regression models in Econometrics.

ä These subjects are more or less thoroughly examined in other courses.The major difference is that I almost exclusively use the notation of ma-trix algebra here. However the first half of the present slide set is perhapsmore or less well known to most of you.

ä There are several reasons to make yourself well acquainted with this:

ç Linear models are easier to work with in matrix notation.

ç A lot of scientific papers use this notation.

ç Several programming languages are matrix-oriented, which ease theimplementation of calculations.

Introduction/(page 2)


2. The Multiple Regression Model

ä The general linear multiple regression model with k explana-tory variables and n observations is fundamentally a systemof n equations:

y1 = β1 + β2x12 + . . . + βkx1k + u1,y2 = β1 + β2x22 + . . . + βkx2k + u2,y3 = β1 + β2x32 + . . . + βkx3k + u3,

...yn = β1 + β2xn2 + . . . + βkxnk + un.

ä with u1, . . . , un ∼ iid−N (0, σ2).

The Multiple Regression Model/(page 3)


ä If we construct the following vectors and matrices

yyy =

y1

y2...

yn

, βββ =

β1

β2...

βk

and uuu =

u1

u2...

un

,

ä and

XXX =

1 x12 · · · x1k1 x22 · · · x2k...

.... . .

...1 xn2 · · · xnk

=

x11 x12 · · · x1kx21 x22 · · · x2k

......

. . ....

xn1 xn2 · · · xnk

,



ä the regression model can be written asy1

y2...

yn

=

x11 x12 · · · x1k

x21 x22 · · · x2k...

.... . .

...xn1 xn2 · · · xnk

β1

β2...

βk

+

u1

u2...

un

⇔ yyyn×1

= XXXn×k

βββk×1

+ uuun×1

, uuu ∼ Nn(000, σ2IIIn).

ä This is perhaps known from the theory of linear normal mo-dels.



3. The Multivariate Normal DistributionWe introduce some concepts from multivariate normal (Gaussian)distributions.

yyy is Multivariate Normal Distributed with mean vector µµµ

and non-singular covariance matrix ΣΣΣ.

⇓yyy ∼ N (µµµ, ΣΣΣ).

⇓Multivariate density for yyy is

f (yyy) = 1(2π)n/2

√|Σ|

exp(− 1

2 (yyy− µµµ)′Σ−1(yyy− µµµ)

).

The Multivariate Normal Distribution/(page 6)


Mean vector:

Covariance matrix:

The Multivariate Normal Distribution/(page 7)


4. Conditioning a multivariate normal distributionPartitioned vectors and covariance matrix

yyy =

(yyy1

yyy2

), µµµ =

(µµµ1

µµµ2

), ΣΣΣ =

(ΣΣΣ11 ΣΣΣ12

ΣΣΣ21 ΣΣΣ22

).

If yyy ∼ N (µµµ, ΣΣΣ) then the distribution of yyy2 conditional on yyy1 is mul-tivariate normal with

yyy2 | yyy1 ∼ N (µµµ2 +ΣΣΣ21ΣΣΣ−111 (yyy1 −µµµ1) , ΣΣΣ22 −ΣΣΣ21ΣΣΣ−1

11 ΣΣΣ12).

In particular this is important if yyy is a time series, and for exampleyyy2 corresponds to a future period and yyy1 corresponds to the pre-sent. Then under normality we have a conditional normal distribu-tion for the future observations given the present observations.

Conditioning a multivariate normal distribution/(page 8)


5. The Bivariate Normal DistributionSpecial case: The bivariate case, that is yyy1 = y1 and yyy2 = y2 are bothscalars,

yyy =

(y1

y2

), µµµ =

(µ1

µ2

), ΣΣΣ =

(σ2

1 ρσ1σ2

ρσ1σ2 σ22

).

Under normality y2 is conditionally normal given y1:

y2 | y1 ∼ N (µ2 +σ2

σ1ρ(y1 − µ1) , (1− ρ2)σ2

2 ),

where ρ is the correlation coefficient between y1 and y2.

The Bivariate Normal Distribution/(page 9)


ä Examples for (x, y) bivariate normal distributed (no. 1).

X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = 0

X ∼ N (0, 2), Y ∼ N (0, 1)

ρ = 0

X ∼ N (0, 1), Y ∼ N (0, 2)

ρ = 0




X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = 0.25

X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = 0.5

X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = 0.75




X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = −0.25

X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = −0.5

X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = −0.75




X ∼ N (0, 1), Y ∼ N (0, 1)

ρ = −0.75

X ∼ N (0, 2), Y ∼ N (0, 1)

ρ = −0.75

X ∼ N (0, 1), Y ∼ N (0, 2)

ρ = −0.75



Classical Assumptions in Econometrics:

A1. zero conditional mean: E(uuu|XXX) = 000

A2. No perfect collinearity: rank(XXX) = k⇔ |X⊤XX⊤XX⊤X| = 0

A3 Homoskedasticity andno serial correlation var(uuu|XXX) = σ2IIIn

A4. Normality of errors: uuu ∼ Nn(000, σ2IIIn)

where:

σ2IIIn = σ2

1 0 . . . 00 1 . . . 0...

.... . .

...0 0 . . . 1

.



The case of a General Covariance Matrixä The general case where we may have heteroscedasticity and/or auto-

correlation corresponds to cases where the covariance matrix of uuu is nolonger σ2IIIn. Instead the covariance matrix of uuu has the following generalform:

var(uuu) = E(uuuuuu⊤) =

var(u1) cov(u1u2) . . . cov(u1un)

cov(u2u1) var(u2) . . . cov(u2un)...

.... . .

...cov(unu1) cov(unu2) . . . var(un)

=

σ2

1 σ12 . . . σ1nσ21 σ2

2 . . . σ2n...

.... . .

...σn1 σn2 . . . σ2

n

= ΣΣΣ.



Example: Heteroscedasticity

ä For so-called heteroscedasticity in the form:

var(ui) = σ2i = σ2z2

i for i = 1, . . . , n,

and where z1, . . . , zn are some known values,

ä the “General Covariance Matrix” then has the form

var(uuu) = ΣΣΣ = σ2

z1 0 . . . 00 z2 . . . 0...

.... . .

...0 0 . . . zn

.



Example: Autocorrelation

ä For 1st order autocorrelation in the form:

ut = ρut−1 + ϵt, ϵt ∼ nid(0, σ2ϵ ) for t = 1, . . . , T,

ä it turns out that the “General Covariance Matrix” has theform

var(uuu) = ΣΣΣ =σ2

ϵ

1− ρ2

1 ρ . . . ρT−2 ρT−1

ρ 1 . . . ρT−3 ρt−2

......

...ρT−1 ρT−2 . . . ρ 1

.

ä Exercise: Show that ΣΣΣ indeed has this form.



6. Least Squares and differentiation w.r.t. a vector

ä The first task is to estimate the regression parameters: β1, . . . , βk by OLS.ä OLS minimizes the sum of squared residuals:

minβ1 ,β2 ,...,βk

n

∑i=1

u2i = min

β1 ,β2 ,...,βkQ(β1, β2, . . . , βk) = min

βββQ(βββ).

ä We interpret the sum of squared errors as a function of the unknown parametervector βββ. By minimizing Q(βββ) wrt. βββ we find the OLS estimate βββ.

ä To do that we may differentiate Q(βββ) wrt. βββ, put the 1st order derivatives equalto zero and solve the resulting k equations which we call the normal equations.

ä The second order differentiation w.r.t. βββ produces a square matrix (a Hessian)which should be positive definite in order for the normal equations to define aminimum.

Least Squares and differentiation w.r.t. a vector/(page 18)


ä We use the notation: uuu⊤ =[

u1 u2 · · · un

], where uuu

n×1is the

column vector of residuals.

ä Obviously by using the definition of inner product

uuu⊤1×n

uuun×1

= u1u1 + u2u2 + · · · unun

= ∑ni=1 u2

i = Q(βββ).

ä which motivates the need to calculate a derivative like

∂uuu⊤uuu∂βββ

.



ä By definition: uuun×1

= yyyn×1− XXX

n×kβββ

k×1. If we insert this we obtain

Q(βββ) = uuu⊤uuu = (yyy−XXXβββ)⊤(yyy−XXXβββ)

= yyy⊤yyy− βββ⊤XXX⊤yyy− yyy⊤XXXβββ + βββ⊤XXX⊤XXXβββ

= yyy⊤yyy− 2βββ⊤XXX⊤yyy + βββ⊤XXX⊤XXXβββ,

ä and since ∂Q(βββ)

∂βββ=

∂yyy⊤yyy∂βββ− ∂2βββ⊤XXX⊤yyy

∂βββ+

∂βββ⊤XXX⊤XXXβββ

∂βββ,

ä we obviously have to know how to calculate these three scalarfunctions of βββ:

∂yyy⊤yyy∂βββ

,∂2βββ⊤XXX⊤yyy

∂βββand


∂βββ.



6.1. Some formulas of Matrix Differentiationä Consider the case where we want to differentiate a scalar fun-

ction of a vector, f (δδδ), with respect to that very same vector δδδk×1

.

ä Then the rule is:

∂ f (δδδ)∂δδδ

=

∂ f (δδδ)

∂δ1∂ f (δδδ)

∂δ2...

∂ f (δδδ)∂δk

.

ä From the previous we have three “relevant” forms of scalar fun-ctions of βββ:

f (βββ) = yyy⊤yyy, f (βββ) = βββ⊤XXX⊤yyy = yyy⊤XXXβββ and f (βββ) = βββ⊤XXX⊤XXXβββ.ä I.e. a constant, a linear function, and a quadratic form.



1. The constant function is

f (βββ) = yyy⊤yyy.

Therefore we have


=

∂(∑ni=1 y2

i )∂β1

∂(∑ni=1 y2

i )∂β2...

∂(∑ni=1 y2

i )∂βk

=

0

0...

0

= 000k×1

.



2. The linear function is f (βββ) = βββ⊤XXX⊤yyy = yyy⊤XXXβββ.

This is because XXX⊤k×n

yyyn×1

is a k × 1 vector (call it aaa for example),

we have with aaa =[

a1 a2 . . . an

]⊤f (βββ) = βββ⊤aaa = β1a1 + β2a2 + · · ·+ βkak = aaa⊤βββ.

So

∂(βββ⊤aaa)∂βββ

=∂(aaa⊤βββ)

∂βββ=

∂(β1a1+β2a2+···+βk ak)∂β1

∂(β1a1+β2a2+···+βk ak)∂β2

...∂(β1a1+β2a2+···+βk ak)

∂βk

=

a1

a2...

ak

= aaa = XXX⊤yyy.



3. The quadratic form is

f (βββ) = βββ⊤XXX⊤XXXβββ.

XXX⊤k×n

XXXn×k

is a symmetric k× k matrix. Call it AAA. Then we have

f (βββ) = βββ⊤

1×kAAA

k×kβββ

k×1

=[

β1 β2 · · · βk

]

a11 a12 . . . a1k

a21 a22 . . . a2k...

.... . .

a1k a2k . . . akk

β1

β2...βk

.

This derivative in this case is best understood via a simpleillustration.



Therefore, let us calculate

f (βββ) = βββ⊤

1×3AAA

3×3βββ

3×1=[

β1 β2 β3

] a11 a12 a13

a12 a22 a23

a13 a23 a33

β1

β2

β3

=

[a11β1 + a12β2 + a13β3 a12β1 + a22β2 + a23β3 a13β1 + a23β2 + a33β3

] β1

β2

β3

= a11β2

1 + a12β1β2 + a13β1β3 + a12β2β1 + a22β22 + a23β2β3 + a13β3β1 + a23β3β2 + a33β2

3

= a11β21 + a22β2

2 + a33β23 + 2a12β1β2 + 2a13β1β3 + 2a23β3β2.



So,

∂βββ⊤AAAβββ

∂βββ=

∂ f (βββ)∂β1



=

2(a11β1 + a12β2 + a13β3)

2(a22β2 + a12β1 + a23β3)

2(a33β3 + a13β1 + a23β2)

=

2(a11β1 + a12β2 + a13β3)

2(a12β1 + a22β2 + a23β3)

2(a13β1 + a23β2 + a33β3)

= 2

a11 a12 a13

a12 a22 a23

a13 a23 a33

β1

β2

β3

= 2AAAβββ.

Hence∂(βββ⊤XXX⊤XXXβββ)

∂βββ = 2XXX⊤XXXβββ.



ä So much tedious algebra to convince ourselves that


= 000,∂βββ⊤XXX⊤yyy

∂βββ= XXX⊤yyy, and


∂βββ= 2XXX⊤XXXβββ.

ä Hence we must have:

∂Q(βββ)

∂βββ=

∂yyy⊤yyy∂βββ− 2

∂βββ⊤XXX⊤yyy∂βββ

+++∂βββ⊤XXX⊤XXXβββ

∂βββ= −2XXX⊤ y+y+y+ 2XXX⊤XXXβββ.

ä And therefore the normal equations are

∂Q(βββ)

∂βββ= 000 ⇔ XXX⊤XXXβββ = XXX⊤yyy.



ä There exist three equivalent necessary and sufficient condi-tions that ensure that XXX⊤XXX is invertible:

1. The columns in XXX are linearly independent.

2. XXX⊤XXX has full rank, that is ρ(XXX⊤XXX

)= k.

3.∣∣XXX⊤XXX

∣∣ = 0.

ä If these conditions are not met we are faced with so–called per-fect collinearity.

ä If however the three conditions are fulfilled(XXX⊤XXX

)−1is well

defined and the OLS estimate of βββ is

βββ =(

XXX⊤XXX)−1

XXX⊤yyy.



6.2. The distribution of βββ

ä Note that βββ is also the MLE of βββ.

ä Given XXX, βββ is a linear function of yyy. From the theory of linearnormal models we then know that βββ is normally distributed.

ä It is easily shown that

E[βββ|XXX] = βββ, therefore βββ is an unbiased estimator of βββ.



ä Furthermore it is easily shown that the variance matrix ofβββ is

var(βββ|XXX) = σ2(XXX⊤XXX)−1.



ä Hence βββ is multivariate normal distributed according to

βββ ∼ Nk(βββ, σ2(XXX⊤XXX)−1).

ä This fact is used to test linear hypotheses on β.β.β.

ä We estimate σ2 by the unbiased estimator s2:

s2 =(yyy−XXXβββ)⊤(yyy−XXXβββ)

n− k=

uuu⊤uuun− k

.

ä Thenβ j − β j

s√

xjj∼ t(n− k),

where xjj is the jth diagonal element of (XXX⊤XXX)−1.



6.3. The Gauss-Markov theoremä This theorem states that the OLS estimate βββ is BLUE, (Best Li-

near Unbiased Estimator).

BLUE: 1. Unbiased: E[βββ] = βββ.2. Efficient (min. variance): var(βββ) smallest possible.3. Linear function of the y’s.

2. Says that the estimate has the smallesta variance among all possibleunbiased linear estimators.

aA square matrix AAA is here defined to be smaller than another square matrix BBB ifBBB− AAA is positive definite.



7. The decomposition of sums of squares

ä We can construct the usual decomposition of the variation of yyy,since

yyy = XXXβββ + uuu,

where uuu is the residual vector. Then

yyy⊤yyy = (XXXβββ + uuu)⊤(XXXβββ + uuu)

= βββ⊤

XXX⊤XXXβββ + 2βββ⊤

XXX⊤uuu + uuu⊤uuu.

The decomposition of sums of squares/(page 33)


ä Sinceβββ⊤

XXX⊤uuu = βββ⊤

XXX⊤(yyy−XXXβββ)

= ((XXX⊤XXX)−1XXX⊤yyy)⊤XXX⊤(yyy−XXX(XXX⊤XXX)−1XXX⊤yyy)

= yyy⊤XXX(XXX⊤XXX)−1XXX⊤(yyy−XXX(XXX⊤XXX)−1XXX⊤yyy)

= yyy⊤XXX(XXX⊤XXX)−1XXX⊤yyy− yyy⊤XXX(XXX⊤XXX)−1XXX⊤XXX(XXX⊤XXX)−1XXX⊤yyy

= yyy⊤XXX(XXX⊤XXX)−1XXX⊤yyy− yyy⊤XXX(XXX⊤XXX)−1XXX⊤yyy = 000,

ä we have

yyy⊤yyy = βββ⊤

XXX⊤XXXβββ + uuu⊤uuu.



ä This only provides us with the sum of squares, but what wewant is the squared deviations from averages. This is obtainedby subtracting ny2 = 1

n (ιιι⊤yyy)2 on both sides:

yyy⊤yyy− 1n (ιιι⊤yyy)2 = βββ

⊤X⊤XX⊤XX⊤Xβββ− 1

n (ιιι⊤yyy)2 + uuu⊤uuu

Totalvariation = Explained

variation + Unexplainedvariation

SS Total SS Explained SS Residuals

Here:

SSE: The part of the variation in y, which is explainedby XXX. SSR: The part of the variation in y, which is not

explained by the model.



ä The coefficient of determination is now defined as:

R2 =ESSTSS = 1− RSS

TSS

ä and the adjusted coefficient of determination

R2adj = 1− n−1

n−k (1− R2)

ä Note the intuition in the definition of R2adj: Mean sums of squa-

res are used instead of sums of squares.



8. Geometric interpretation of OLS

ä Given the OLS estimators βββ, the predicted value of yyy is definedby

yyy = XXXβββ = XXX(XXX⊤XXX)−1XXX⊤︸︷︷︸PPP

y = Pyy = Pyy = Py.

ä The residuals are defined as

uuu = yyy−XXXβββ = yyy−XXX(XXX⊤XXX)−1XXX⊤yyy = [III −XXX(XXX⊤XXX)−1XXX⊤]︸︷︷︸MMM

yyy = MyMyMy = MuMuMu,

ä WherePPP = XXX(XXX⊤XXX)−1XXX⊤ and MMM = III − PPP

are the so-called fundamental OLS matrices.

Geometric interpretation of OLS/(page 37)


ä They are idempotent (and symmetric and hence by definition pro-jection matrices) and satisfy the equations PXPXPX = XXX and MXMXMX = 000, and

yyy⊤uuu = (PyPyPy)⊤[MyMyMy] = yyy⊤PPP⊤(III − PPP)yyy

= yyy⊤(PPP⊤ − PPP2)yyy

= yyy⊤(PPP− PPP)yyy

= 0,

which shows that the residuals and the predicted responses are ort-hogonal.

ä Further it can be seen that

yyy = XXXβββ + uuu

= PyPyPy + MyMyMy.



ä The projection related to OLS:



ä The previous and the fact that both PPP and MMM are projection ma-trices with PPP + MMM = I,I,I, show that OLS geometrically can beinterpreted as a decomposition of yyy in two orthogonal compo-nents.

ç PPP projects yyy on the (hyper-)plane , which is spanned by thecolumns in XXX, and

ç MMM projects yyy on the (hyper-)plane, which is orthogonal to thefirst (hyper-)plane.

ä Thus, yyy may be seen as a sum of two orthogonal vectors, thatconstitute the sides in a right-angled triangle enclosing theright angle.



9. Linear Restrictions

ä We want to apply some non linear restrictions to the model:

r11β1 + r12β2 + . . . + r1kβk = r1

r21β1 + r22β2 + . . . + r2kβk = r2...

rq1β1 + rq2β2 + . . . + rqkβk = rq.

ä In matrix notation this is writtenwhere the elements of RRR gives the restrictions.

RRRq×k

βββk×1

= rrrq×1

,

Linear Restrictions/(page 41)


9.1. Estimation

ä Vi want to estimate the model under such restrictions.

ä Each row in RRR is a linear restriction of the coefficient vector.

ä Typically RRR will have few rows and many zeros in each row.

ä Examples:

• One of the coefficients is 0, β j = 0:

RRR =[

0 0 . . . 1 0 . . . 0]

and rrr = 0.

• Two of the coefficients are equal, βk = β j:

RRR =[

0 0 1 . . . −1 . . . 0]

and rrr = 0.



ä More examples:

• Some of the coefficients add up to 1, β2 + β3 + β4 = 1:

RRR =[

0 1 1 1 0 . . . 0]

and rrr = 1.

• Some of the coefficients are 0, β1 = β2 = β4 = 0:

RRR =

1 0 0 0 . . . 0

0 1 0 0 . . . 0

0 0 0 1 . . . 0

and rrr =

0

0

0

.



ä To estimate the model under the restrictions we set out the Lag-range function to minimizing:

L(λλλ, βββ) = (yyy−XβXβXβ)⊤(yyy−XβXβXβ) +λλλ⊤(RβRβRβ− rrr)

= yyy⊤yyy− 2βββ⊤XXX⊤yyy + βββ⊤XXX⊤XβXβXβ +λλλ⊤(RβRβRβ− rrr).

ä The first order conditions are

∂L(λλλ, βββ)

∂βββ= −2XXX⊤yyy + 2XXX⊤XβXβXβ + RRR⊤λλλ = 0,

∂L(λλλ, βββ)

∂λλλ= RβRβRβ− rrr = 0.

ä Let ˆβββ denote the estimate of βββ under the restrictions.



ä From the first set of restrictions we get

βββ = (XXX⊤XXX)−1(XXX⊤yyy− 12

RRR⊤λλλ).

ä Inserting in the last set of restrictions reveals

rrr = RRR(XXX⊤XXX)−1XXX⊤yyy− 12

RRR(XXX⊤XXX)−1RRR⊤λλλ,

ä which again may be solved for λλλ:

λλλ = 2[RRR(XXX⊤XXX)−1RRR⊤

]−1 [RRR(XXX⊤XXX)−1XXX⊤yyy− rrr

].



ä This can be used to find ˆβββ:

ˆβββ = (XXX⊤XXX)−1(XXX⊤yyy− 12

RRR⊤

2[RRR(XXX⊤XXX)−1RRR⊤


]︸︷︷︸

λλλ )

= (XXX⊤XXX)−1XXX⊤yyy− (XXX⊤XXX)−1RRR⊤[RRR(XXX⊤XXX)−1RRR⊤


]= βββ− (XXX⊤XXX)−1RRR⊤

[RRR(XXX⊤XXX)−1RRR⊤

]−1 [RRRβββ− rrr

].

ä It is worth noting that if βββ fulfills the restrictions then

RRRβββ− rrr = 0 which implies ˆβββ = βββ.



9.2. Test linear restrictionsä We will see how to test if the restrictions are valid.

ä Thus we want to test the null hypothesis: H0 : RβRβRβ = rrr.

ä This is done under assumption A4 via the following F-statistic

F =(RRRβββ− rrr)⊤[RRR(XXX⊤XXX)−1RRR⊤]−1(RRRβββ− rrr)/q

uuu⊤uuu/(n− k)∼ F(q, n− k)

ä If A4 is not fulfilled, then we must use the so–called Wald sta-tistic:

W =(RRRβββ− rrr)⊤[RRR(XXX⊤XXX)−1RRR⊤]−1(RRRβββ− rrr)

uuu⊤uuu/(n− k)∼ χ2(q) (approximately).



ä Note that the F-statistic is exactly equal to:

F =(uuu⊤∗ uuu∗ − uuu⊤uuu)/q

uuu⊤uuu/(n− k)∼ F(q, n− k),

ä where

uuu⊤∗ uuu∗ is the sum of the squared residuals from the restrictedmodel, and

uuu⊤uuu is the sum of squared residuals from the unrestricted mo-del.

ä A nice feature of using the matrix version on the previous slideis that you only need the estimates from one regression, namelythe unrestricted one.



9.3. The Chow Forecast test

ä The idea is that, if the parameter vector is constant, then we ha-ve a specific confidence that out-of-sample predictions will fallwithin specified bounds.

ç these bounds are the well-known prediction intervals com-puted from sample data.

ä Hence, large prediction errors will be critical for the constancyhypothesis.



ä To test this hypothesis we split the sample in two.

W Use n1 observations for estimation, and

W use n2 = n− n1 for testing.

1. For time series we would usually take the first n1 for esti-mation.

2. In cross-sections we could e.g. split the sample according toa size variable.

ä It will be appropriate to reserve 5-15 percent of the observa-tions for testing.



ä In splitting the sample, we use the following notation:

yyy =

[yyy1

yyy2

]← n1 × 1

← n2 × 1XXX =

[XXX1

XXX2

]← n1 × k

← n2 × k

uuu =

[uuu1

uuu2

]← n1 × 1

← n2 × 1

ä The complete model can then be written as

yyy1 = XXX1βββ + uuu1,

yyy2 = XXX2ααα + uuu2.

ä The null hypothesis is

H0 : ααα = βββ.



ä To use a dummy variable approach write the second equa-tion as

yyy2 = XXX2ααα + uuu2 = XXX2ααα + uuu2 +XXX2βββ−XXX2βββ

= XXX2βββ +XXX2(ααα− βββ) + uuu2

= XXX2βββ +γγγ + uuu2.

ä Hence if γγγ = 000 then ααα = βββ.

ä Note that XXX2(ααα− βββ) is a n2 × 1 “parameter-vector”.

ä Now the model may be written compactly as[ yyy1yyy2

]=[ XXX1 000

XXX2 IIIn2

]︸︷︷︸

ZZZ

[βββγγγ

]+[

uuu1uuu2

].



ä Now it is simple to estimate[

βββ γγγ]⊤

:[βββ

γγγ

]= (ZZZ⊤ZZZ)−1ZZZ⊤yyy

=

[ XXX1 000

XXX2 IIIn2

]⊤ [XXX1 000

XXX2 IIIn2

]−1 [XXX1 000

XXX2 IIIn2

]⊤ [yyy1

yyy2

]

=

[(XXX⊤1 XXX1)

−1XXX⊤1 yyy1

yyy2 −XXX2(XXX⊤1 XXX1)−1XXX⊤1 yyy1

].

ä Hence βββ estimate βββ using n1 data points.



ä Noting that

XXX2(XXX⊤1 XXX1)−1XXX⊤1 yyy1 = XXX2βββ = yyy2,

we discover that γγγ = yyy2 − yyy2 is the prediction errors, whenwe are trying to predict yyy2 using an estimate of βββ based on n1observations.

ä Hence to test for parameter constancy we test

H0 : γγγ = 000,

using our test for linear restrictions to the model with the dum-

my variable[

000 IIIn2

]⊤.



ä The restriction matrices are:

RRR =

[000 000

000 IIIn2

]and rrr =

[000]

.

ä Hence, the test is:

F =

(RRR

[βββ

γγγ

])⊤[(RZRZRZ⊤ZZZ)−1RRR⊤]−1(RRR

[βββ

γγγ

])/n2

uuu⊤uuu/ (n1 − k)︸︷︷︸=n1+n2−(k+n2)

∼ F(n2, n1− k).



ä If A4 does not apply, then use:

W =

(RRR

[βββ

γγγ


[βββ

γγγ

])

uuu⊤uuu/(n1 − k)∼ χ2(n2).



9.4. Chow’s Breakpoint Test

ä If the forecast subset is large enough it might be better to esti-mate two regression functions, one for each sub-sample, andthen test for common parameters.

ä Then the unrestricted model may be written:[yyy1

yyy2

]=

[XXX1 000

000 XXX2

] [βββ1

βββ2

]+

[uuu1

uuu2

],

where βββ1 and βββ2 are the k-vectors of the two sub-samples.

ä The null hypothesis of no structural break is then

H0 : βββ1 = βββ2.



ä Again, use the test for linear restrictions to this model. We justhave to set up the restriction matrices:

RRR =[

IIIk −IIIk

]and rrr =

[000]

.

ä Hence, the test is:

F =

(RRR

[βββ1

βββ2


[βββ1

βββ2

])/k

uuu⊤uuu/(n− 2k)∼ F(k, n− 2k).

ä If A4 does not apply, then use the chi-squared alternative (the Waldtest).

ä It is not hard to modify this test to relax the restriction on the interceptterm, or on the slope.



10. Heteroskedasticity Defined

ä A general Heteroskedastic model violates A3.

ä It can be presented as:

yyy = XβXβXβ + u, uu, uu, u ∼ N(000, σ2ΣΣΣ).

where

σ2ΣΣΣ = var(uuu) =

var(u1) cov(u1, u2) . . . cov(u1, un)

cov(u2, u1) var(u2) . . . cov(u2, un)...

.... . .

...

cov(un, u1) cov(un, u2) . . . var(un)

.

= E(uuuuuu⊤) =

Heteroskedasticity defined/(page 59)


11. OLS with Heteroskedastic Errors

ä For a general covariance matrix the following results hold:

1. The OLS estimator βββ = (XXX⊤XXX)−1XXX⊤yyy is unbiased and consi-stent.

2. The OLS estimator is inefficient.

3. The message is the following:

ç We can get a fair estimate of the parameter vector usingOLS, even when A3 is not satisfied, but

ç We can not perform hypothesis testing on the parametervector, because the standard errors estimated using OLSare WRONG (i.e. biased and inconsistent).

OLS with Heteroskedastic Errors/(page 60)


ä The correct variance matrix of the OLS estimator is

var(βββ|XXX) = E[(βββ− βββ)(βββ− βββ)⊤|XXX

]= E

[((XXX⊤XXX)−1XXX⊤uuu)((XXX⊤XXX)−1XXX⊤uuu)⊤|XXX

]= (XXX⊤XXX)−1XXX⊤E

[uuuuuu⊤|XXX

]XXX(XXX⊤XXX)−1

= (XXX⊤XXX)−1XXX⊤σ2ΣXΣXΣX(XXX⊤XXX)−1

= σ2(XXX⊤XXX)−1XXX⊤ΣXΣXΣX(XXX⊤XXX)−1

ç Hence, tests based on σ2(XXX⊤XXX)−1 are invalid.



To sum up:

ä The usual OLS t-statistics are not t-distributed under hetero-skedasticity, and the problem can not be resolved by using lar-ge sample sizes. This is also the case for the F-test.

ä One will usually underestimate SE(βjOLS), which implies thatconfidence bands get to short and t-statistics get to large.

å Thus, hypotheses which should be retained may be re-jected.



12. Heteroskedasticity-Robust Inference

ä In the simple regression model with heteroskedasticity:

yi = β0 + β1xi + ui, var(ui) = σ2i

we have an easy correction.

ä If we don’t know σ2i , then we may use:

var(β1OLS)HRSE

=∑n

i=1 (xi − x)2 u2i(

∑ni=1 (xi − x)2

)2 .

ä This estimator of the standrad error is consistent.

Heteroskedasticity-Robust Inference/(page 63)


ä In the multiple regression model one should use

var(βjOLS)HRSE

=∑n

i=1 r2iju

2i(

∑ni=1 r2

ij

)2 ,

ä where rij are the residuals from the following regression:

xji = δ0 + δ1x1i + . . . + δj−1xj−1,i + δj+1xj+1,i + . . . + δkxki + rij,

ä E.g. the residuals from the regression where xj are regressed onthe remaining explanatory variables.

ä

√var(βjOLS)

HRSEis known as the heteroskedasticity-robust

standard error for βj.

Heteroskedasticity-Robust Inference/(page 64)


13. GLS Estimation

ä We have something like: σ2i = σ2zi with zi known.

ä The trick is: yi√zi

= α1√zi

+ βxi√zi

+ui√

zi, (1)

⇒ var

(ui√

zi

)= σ2 for all i

ä and now OLS is applicable on (1), β = βGLS with:

var(βGLS) =s2

(∑n x2i /zi)− (∑n xi/zi)2(∑n z−1

i )−1, s2 =

SSEGLS

n− 2.

ä This method is also called Weighted Least Squares WLS.

GLS estimation/(page 65)


ä Hence in general: Least Squares estimation, when ΣΣΣ is known,can be done by transforming the model such that the classicalassumptions are fulfilled.

ä As ΣΣΣ is a positive definite symmetric matrix we can write it as

ΣΣΣ = CCCΛΛΛCCC⊤.

ä By the calculus rules for eigen-values and -vectors we know

ΣΣΣ−1 = CCCΛΛΛ−1CCC⊤ = CCCΛΛΛ−1/2ΛΛΛ−1/2CCC⊤ = CCCΛΛΛ−1/2(CΛCΛCΛ−1/2)⊤

= PPP⊤PPP,

andΣΣΣ = (ΣΣΣ−1)−1 = (PPP⊤PPP)−1 = PPP−1(PPP⊤)−1.



Important general Rule

yyy ∼ N (µµµ, ΣΣΣ) and ΣΣΣ is invertible

⇕ΣΣΣ−1/2yyy ∼ N (ΣΣΣ−1/2µµµ, III).

Here ΣΣΣ−1/2 is the matrix defined such that ΣΣΣ−1 = (ΣΣΣ−1/2)⊤ΣΣΣ−1/2.Let PPP = ΣΣΣ−1/2. Then ΣΣΣ−1 = PPP⊤PPP. Exactly as defined on the previousslide.

ä Hence in general; Least Squares estimation, when ΣΣΣ is known,can be done by transforming the model such that the classicalassumptions (classical design criteria) on the covariance matrixare fulfilled. See next slide.



ä Now use PPP to define:

yyy∗ = PyPyPy, XXX∗ = PXPXPX, uuu∗ = PuPuPu,

ä and note that this implies that

var(uuu∗|XXX) = E[uuu∗uuu⊤∗ |XXX] = E[PuPuPu(PuPuPu)⊤|XXX]

= E[PuuPuuPuu⊤PPP⊤|XXX] = PPPE[uuuuuu⊤|XXX]PPP⊤

= PPPσ2ΣPΣPΣP⊤

= σ2PPPPPP−1(PPP⊤)−1PPP⊤

= σ2III.



ä Hence, the transformed errors satisfy the design criteria, and itis valid to apply OLS to the model

yyy∗ = XXX∗βββ + uuu∗, u, u, u∗ ∼ N(000, σ2III).

ä Hence, the estimates of βββ and σ2 are the usual ones:

βββGLS = (XXX⊤∗ XXX∗)−1XXX⊤∗ yyy∗

= ((PXPXPX)⊤PXPXPX)−1(PXPXPX)⊤PyPyPy

= (XXX⊤PPP⊤PXPXPX)−1XXX⊤PPP⊤PyPyPy

= (XXX⊤ΣΣΣ−1XXX)−1XXX⊤ΣΣΣ−1yyy.



s2GLS =

(yyy∗ −XXX∗βββGLS)⊤(yyy∗ −XXX∗βββGLS)

n− k

=(yyy−XXXβββGLS)

⊤ΣΣΣ−1(yyy−XXXβββGLS)

n− k

å These are called GLS estimators.

ä Variance of βββGLS:var(βββGLS) = σ2(XXX⊤∗ XXX∗)−1 = σ2(XXX⊤ΣΣΣ−1XXX)−1.

ä and the estimated variance:var(βββGLS) = s2

GLS(XXX⊤ΣΣΣ−1XXX)−1.

ä In conclusion this shows that if ΣΣΣ is known, then it is easy totake care of a general covariance matrix for the disturbances.



14. FGLS estimation

ä FGLS is short for Feasible Generalized Least Squares, and it isthe method to use when ΣΣΣ is unknown.

ä In the general cases the situation is

yyy = XβXβXβ + uuu, uuu ∼ N(000,VVV),

where VVV is the unknown covariance matrix of the disturbances.Note that the scaling factor σ2 is included in V.V.V.

ä Now if we can come up with a consistent estimate of VVV, call it VVV,then we are home free:

βββFGLS = (XXX⊤VVV−1

XXX)−1XXX⊤VVV−1

yyy,

var(βββFGLS) = (XXX⊤VVV−1

XXX)−1.

FGLS estimation/(page 71)


14.1. Whites HCSEä Halbert White suggested some simple heteroskedasticity con-

sistent standard errors for the OLS estimator in the case whereVVV has the following structure:

VVV =

σ2

1 0 · · · 0

0 σ22 · · · 0

......

. . ....

0 0 · · · σ2n

,

where σ21 , σ2

2 , . . . , σ2n are unknown.

ä This estimator is exactly equal to the robust one we encountered pre-viously.



ä Remember that the correct variance matrix for the OLS estimatoris

var(βββ) = (XXX⊤XXX)−1XXX⊤VXVXVX(XXX⊤XXX)−1.

ä Hence, to perform FGLS we need an estimate of

XXX⊤VXVXVX =

...

......

xxx1 xxx2 · · · xxxn...

......

σ21 0 · · · 0

0 σ22 · · · 0

......

. . ....

0 0 · · · σ2n

· · · xxx⊤1 · · ·· · · xxx⊤2 · · ·

...

· · · xxx⊤n · · ·

=

n

∑t=1

σ2t xxxtxxx⊤t ,

where xxxt = (1, x2t, · · · , xkt) is the t’th row of XXX.



ä The White estimator of var(βββ), then, is to replace σ2t with the

squared tth residual u2t , such that

var(βββ)HCSE

= (XXX⊤XXX)−1XXX⊤VVVXXX(XXX⊤XXX)−1

= (XXX⊤XXX)−1XXX⊤

u2

1 0 · · · 0

0 u22 · · · 0

......

. . ....

0 0 · · · u2n

XXX(XXX⊤XXX)−1.

ä In general, one should always report these standard errors withthe regression output.



Problem 1

ä Verify the formulas on slide 9 from the notation on slides 6 and8.

ä Show that ΣΣΣ on slide 17 has the form as postulated.

ä Show that s2 from slide 31 is unbiased

ä Show that the OLS estimator βββ on slide 32 has the smallest va-riance among all possible unbiased linear estimators.



Problem 2

ä Consider the equationyyy = XXXβββ + uuu.

Suppose that the rows of XXX are a random sample from some distri-bution with mean 000 and covariance matrix ΣΣΣ > 0, and suppose thatvar(uuu|XXX) = σ2IIIn, and E(uuu|XXX) = XXXθθθ, where θθθ = 000 does not depend onn.

ä Show that E(uuu) = 000 and var(uuu) = (σ2 + θθθ⊤ΣΣΣθθθ)IIIn, so that the ran-dom disturbances have mean 000, constant variance, and are uncorrela-ted with one another, but are correlated with XXX.

ä Show that the OLS estimator of βββ satisfies βββ = βββ +(

XXX⊤XXX)−1

XXX⊤uuu.

ä Show that E(βββ|XXX) = βββ + θθθ and var(βββ|XXX) = σ2(

XXX⊤XXX)−1

.



Problem 3

ä Let βββ be the (k× 1) vector of OLS estimates.

1. Show that for any (k+ 1)× 1 vector bbb, we can write the sumof squared residuals as

SSR(bbb) = uuu⊤uuu + (βββ− bbb)⊤XXX⊤XXX(βββ− bbb).

(Hint: Write (yyy − XXXbbb)⊤(yyy − XXXbbb) = [uuu + XXX(βββ − bbb)]⊤[uuu +

XXX(βββ− bbb)] and use the fact that XXX⊤uuu = 000.

2. Explain how the expression for SSR(bbb) in part 1. aboveproves that βββ uniquely minimizes SSR(bbb) over all possib-le values of bbb, assuming XXX has rank k× 1.



Problem 4

Let βββ be the OLS estimate from the regression of yyy on XXX. Let AAA be a (k +1)× (k + 1) nonsingular matrix and define zt ≡ xxxtAAA, t = 1, . . . , n, wherexxxt is the tth row in XXX. Therefore, zt is 1 × (k + 1) and is a nonsingularlinear combination of xxxt. Let ZZZ be the n× (k + 1) matrix with rows zt. Letβββ denote the OLS estimate from a regression of yyy on ZZZ.

ä Show that βββ = AAA−1βββ.

ä Let yt be the fitted values from the original regression and let yt bethe fitted values from regressing yyy on ZZZ. Show that yt = yt, for allt = 1, . . . , n. How do the residuals from the two regressions compare?

ä Show that the estimated variance matrix for βββ iss2AAA−1(XXX⊤XXX)−1AAA−1⊤, where s2 is the usual variance estimatefrom regressing yyy on XXX.



Problem 5

Consider the usual linear model yyy = XXXβββ + uuu, where yyy is (n × 1), XXX is(n × 5), βββ is (5 × 1), and uuu is (n × 1), E(uuu) = 000 and var(uuu) = σ2III. Thesample size n is equal to 500, and XXX is non-stochastic and has full rank.OLS is used to estimate the following coefficients and variance matrix ofthe coefficients

βββ =

β1

β2

β3

β4

β5

=

2.5

1.1

−3

0.5

0.4

, var(βββ) =

0.5

0.2 0.2

0.1 0.3 0.1

−0.2 −0.1 0.1 1

−0.5 −0.7 −0.2 −0.5 0.1

.



Problem 5 (continued)

Test the following hypotheses. In each case set up the test for the hypo-thesis, write down the RRR and rrr matrices and do the F and Wald tests, cf.slide 47.

1. H0 : β1 = 1

2. H0 : −β2 + β4 + β5 = 0

3. H0 : β1 = 1, −β2 + β4 + β5 = 0.



HINTS to problemsä Problem 1: The Gauss-Markov Theorem, slide page 24. The idea of the proof

is to choose another arbitrary linear estimator for βββ. Call this estimator βββ. Ithas the form βββ = AAAyyy. Define a matrix DDD = AAA − (XXX⊤XXX)−1XXX⊤, and write thevariance-covariance matrix of βββ as something that depends on DDD and XXX, andrecognize that var(βββ)− var(βββ) must be positive semi definite.

ä Problem 2, second bullet on slide 76: Here you should use the vector ver-sions of two general and important rules: (1) The law of iterated expectations:E [Y] = E [E [Y|Z]] for any pair of random variables. (2) The law of uncondi-tional variance: var [Y] = E [var [Y|Z]] + var [E [Y|Z]] for any pair of randomvariables.

ä Problem 3: Should be straightforward.

ä Problem 4: Should be straightforward when you realize that ZZZ = XXXAAA.

ä Problem 5: Rewrite the F and Wald statistics on slide 47 using the fact that

s2 =uuu⊤uuun− k

and var(βββ) = s2(XXX⊤XXX)−1.


modern basic econometrics lectures introduction to matrix approach...

Documents