ec212: introduction to econometrics review materials ...personal.lse.ac.uk/otsu/app.pdf · 1 ec212:...

1

EC212: Introduction to Econometrics

Review Materials

(Wooldridge, Appendix)

Taisuke Otsu

London School of Economics

Summer 2018

2

A.1. Summation operator

(Wooldridge, App. A.1)

3

Summation operator

• For sequence {x1, x2, . . . , xn}, denote summation as

n∑i=1

xi = x1 + x2 + · · ·+ xn

• Since data are collection of numbers, “∑n

i=1” plays key role ineconometrics and statistics

4

Properties

1. For any constant cn∑

i=1

c = nc

2. For any constant c

n∑i=1

cxi = cn∑

i=1

xi

3. For sequence {(x1, y1), (x2, y2), . . . , (xn, yn)} and constants aand b

n∑i=1

(axi + byi ) = an∑

i=1

xi + bn∑

i=1

yi

• If you get confused, try for case of n = 2 or 3

5

Average

• For {x1, x2, . . . , xn}, average or mean is defined as

x =1

n

n∑i=1

xi

• (xi − x) is called deviation from average

6

Properties of xi − x

1. Sum of deviations is always zero

n∑i=1

(xi − x) =n∑

i=1

xi − nx =n∑

i=1

xi −n∑

i=1

xi = 0

2. Sum of squared deviations

n∑i=1

(xi − x)2 =n∑

i=1

x2i − n(x)2

3. Cross-product version

n∑i=1

(xi − x)(yi − y) =n∑

i=1

xiyi − n(x y)

• These are shown by properties of summation operator

7

Derive Property 2

• Expand square and apply properties of summation operator

n∑i=1

(xi − x)2 =n∑

i=1

(x2i − 2xi x + (x)2)

=n∑

i=1

x2i − 2x

n∑i=1

xi + n(x)2

=n∑

i=1

x2i − 2x(nx) + n(x)2

=n∑

i=1

x2i − n(x)2

• Property 3 is similarly shown

8

A.2. Linear function(Wooldridge, App. A.2)

9

Linear function

• Linear function plays important role to specify econometricmodels

• If x and y are related by

y = β0 + β1x

then we say that y is a linear function of x

• This relation is described by two parameters: intercept β0

and slope β1

10

Property of linear function

• Let ∆ denote “change”

• Key feature of linear function y = β0 + β1x is: change in y isgiven by slope β1 times change in x , i.e.

∆y = β1∆x

• In other words, marginal effect of x on y is constant and equalto β1

11

Two variable case

• If we have x1 and x2, linear function is

y = β0 + β1x1 + β2x2

• Change in y given changes in x1 and x2 is

∆y = β1∆x1 + β2∆x2

• If x2 does not change, then

∆y = β1∆x1 if ∆x2 = 0

or

β1 =∆y

∆x1if ∆x2 = 0

• So β1 measures how y changes with x1 holding x2 fixed (calledpartial effect). This is closely related to ceteris paribus

12

A.4. Some special functions

(Wooldridge, App. A.4)

13

Quadratic function

• One way to capture diminishing return is to add quadraticterm

y = β0 + β1x + β2x2

• When β1 > 0 and β2 < 0, it will be parabolic mountain shape

• By applying calculus, slope of quadratic function isapproximated by

slope =∆y

∆x≈ β1 + 2β2x

• Caution: quadratic function is not monotone

14

Natural logarithm• Perhaps most important nonlinear function in econometrics.

Denote by log(x) (but ln(x) is also common)

• log(x) is defined only for x > 0 and looks like

0 5 10 15 20

-10

12

3

x

log

15

• It is not very important how values of log(x) are obtained

• log(x) is monotone increasing and displays diminishingmarginal returns (slope gets closer to 0 as x increases)

• Also we can see

log(x) < 0 for 0 < x < 1

log(1) = 0

log(x) > 0 for x > 1

• Some properties

log(x1x2) = log(x1) + log(x2)

log

(x1

x2

)= log(x1)− log(x2)

log(xc) = c log(x) for any c

16

Key property: Relationship with percent change

• (By using calculus) we can see that

log(x1)− log(x0) ≈ x1 − x0

x0

if x1 − x0 is small

• Right hand side multiplied by 100 gives us percent change inx . So this can be written as

100∆ log(x) ≈ %∆x

i.e. log change times 100 approximates percent change

17

Elasticity

• Thus log is useful to approximate elasticity. Elasticity of ywith respect to x is defined as

%∆y

%∆x=

(∆y/y)

(∆x/x)=

∆y

∆x

x

y

i.e. percentage change in y when x increases by 1% (familiarconcept in economics)

• By log, elasticity is approximated as

%∆y

%∆x≈ ∆ log(y)

∆ log(x)

18

B.1. Random variables and theirprobability distributions

(Wooldridge, App. B.1)

19

Definition

• Experiment is any procedure that can yield outcomes withuncertainty

• E.g. Tossing a coin (head or tail)

• Random variable is one that takes numerical values and hasoutcome determined by an experiment

• E.g. Number of heads by tossing 10 coins

20

Notation for appendix• In Appendix, denote random variables by uppercase letters,

like X ,Y ,Z

• On the other hand, denote particular outcomes bycorresponding lowercase letters, like x , y , z

• In main body of textbook, both are denoted by lowercasex , y , z (should be clear for each context)

• X is not associated with any particular value but x is, sayx = 3

• Typical example in mind: X is exam score at this point (whichis random and not realized yet). Once you take the exam, itrealizes and you get a particular value x , say x = 80

• So expressionP(X = x) = 0.2

means “probability that random variable X takes a particularnumber x is 0.2”

21

Discrete random variables

• If X takes on only a finite (like {1, 2, . . . , 10}) or countablyinfinite (like {1, 2, 3, . . .}) number of values, then X is calleddiscrete random variable

• Suppose X can take on k possible values {x1, . . . , xk}. SinceX is random, we never know which number X takes for sure.So we need to talk about probability for X to take each value

pj = P(X = xj) for j = 1, 2, . . . , k

• Note: pj is between 0 and 1 and satisfies

p1 + p2 + · · ·+ pk = 1

22

Probability density function (pdf)

• Distribution of X is summarized by probability densityfunction (pdf)

f (xj) = pj for j = 1, 2, . . . , k

with f (x) = 0 for any x not equal to xj ’s

• Probability for any event involving X can be computed by pj ’s

23

Continuous random variable

• If X takes on some interval or real line, then X is calledcontinuous random variable

• Continuous random variable takes on any real value with zeroprobability, i.e. if X is continuous, then

P(X = x) = 0 for any value of x

• Since X can take on too many possible values, we cannotallocate probability each value of x

• For continuous X , it only makes sense to talk aboutprobability for interval, such as P(a ≤ X ≤ b) and P(X ≥ c)

24

Cumulative distribution function (cdf)

• To compute probabilities for continuous random variable, it isuseful to work with cumulative distribution function (cdf)

F (x) = P(X ≤ x) for any x

• F (x) is an increasing (or non-decreasing) function (starts from0 and increases to 1)

• By F (x), we can compute

P(X ≥ c) = 1− F (c)

P(a ≤ X ≤ b) = F (b)− F (a)

• For continuous case, pdf f (x) is also available which providesprobability for any interval by integral over the interval

25

B.2. Joint distributions, conditionaldistributions and independence


26

Joint distribution

• Let X and Y be discrete random variables. Then (X ,Y ) havejoint distribution, which is fully described by joint pdf

fX ,Y (x , y) = P(X = x ,Y = y)

where right hand side is probability that X takes x and Ytakes y

• pdf of single variable such as pdf fX (x) of X is calledmarginal pdf

• E.g. Y =wage, X =years of education

27

Independence

• We say X and Y are independent if

fX ,Y (x , y) = fX (x)fY (y)

for all x and y , where fX (x) is marginal pdf of X and fY (y) ismarginal pdf of Y

• Otherwise, we say X and Y are dependent

• As we will see soon, if X and Y are independent, knowing theoutcome of X does not change the probabilities of outcomesof Y , and vice versa

28

Conditional distribution• To talk about how X affects Y , we look at conditional

distribution of Y given X , which is summarized byconditional pdf

fY |X (y |x) =fX ,Y (x , y)

fX (x)

for all values of x such that fX (x) > 0

• Note that by definition

fY |X (y |x) =P(X = x ,Y = y)

P(X = x)

= P(Y = y |X = x)

so conditional pdf fY |X (y |x) gives us “(conditional) probabilityfor Y = y given that X = x”

• E.g. Y =wage and X =years of education. fY |X (y |12) meanspdf of wage for all people in the population with 12 years ofeducation

29

Relationship with independence

• If X and Y are independent (i.e. fX ,Y (x , y) = fX (x)fY (y)),then conditional pdf of Y given X is written as

fY |X (y |x) =fX ,Y (x , y)

fX (x)=

fX (x)fY (y)

fX (x)

= fY (y)

i.e. knowledge of the value taken by X tells nothing aboutdistribution of Y

30

B.3. Features of probabilitydistributions


31

Features of distribution

• Knowing pdf is great but for many purposes we will beinterested in only a few aspects of distribution of randomvariable, such as

• Measure of central tendency

• Measure of variability or spread

• Measure of association between two random variables

32

Measure of central tendency: Expected value

• One of the most important concepts in this course

• Expected value (or expectation) of random variable X(denoted by E (X ) or sometimes µ) is weighted average of allpossible values of X with weights determined by pdf

• If X takes values on {x1, . . . , xk} with pdf f (x), then expectedvalue is written as

E (X ) = x1f (x1) + · · ·+ xk f (xk)

• If X is continuous, expected value is given by integral

E (X ) =

∫ ∞−∞

xf (x)dx

33

Expected value of function of X

• Consider g(X ), function of X . Its expected value is

E [g(X )] = g(x1)f (x1) + · · ·+ g(xk)f (xk)

i.e. weighted average of all possible values of g(X )

• For example, if g(X ) = X 2, then

E [X 2] = x21 f (x1) + · · ·+ x2

k f (xk)

34

Properties of E (·)

• Used very frequently in this course

• Property E.1: For any (nonrandom) constant c,

E (c) = c

• E.g. E (3) = 3. Since c (or 3 in this case) never takes othernumber, it makes sense

35

• Property E.2: For any constants a and b,

E (aX + b) = aE (X ) + b

• Intuitively constants can go outside of E (·)

• This can be seen from expressing E (·) by weighted averages

36

• Property E.3: If {a1, . . . , an} are constants and {X1, . . . ,Xn}are random variables, then

E (a1X1 + · · ·+ anXn) = a1E (X1) + · · ·+ anE (Xn)

• This is generalization of Property E.2

• Expectation of summation can be split into sum ofexpectations. Constant coefficients ai ’s can go outside of E (·)

37

Measure of variability: Variance and standard deviation

• Once we figure out central tendency of distribution of X byexpected value µ = E (X ), next step is to characterizevariability or spread of distribution around µ

• Common measure of variability is variance

Var(X ) = E [(X − µ)2]

i.e. measure variability by squared difference (X − µ)2 andsummarize by its expected value

• Also standard deviation is defined as

sd(X ) =√

Var(X )

38

Properties of variance

• Property VAR.1: For any (nonrandom) constant c,

Var(c) = 0

• Constant has no variability

• Property VAR.2: For any constants a and b,

Var(aX + b) = a2Var(X )

• b does not change variance. When a goes outside of Var(·), itbecomes “a2” (because variance is defined by expected squareddifference E [(X − µ)2])

39

B.4. Features of joint and conditionaldistributions


40

Covariance

• Consider two random variables X and Y . Let µX = E (X ) andµY = E (Y ). To measure association of X and Y , we look atproduct of deviations from the means

(X − µX )(Y − µY )

If X > µX ,Y > µY or X < µX ,Y < µY (i.e. same signs),then this product is positive. If X > µX ,Y < µY orX > µX ,Y < µY (i.e. different signs), then this product isnegative

• Covariance is expected value of this product

Cov(X ,Y ) = E [(X − µX )(Y − µY )]

• Property COV.1: If X and Y are independent, then

Cov(X ,Y ) = 0

(but converse is not true in general)

41

Correlation coefficient

• Drawback of covariance is that it depends on unit ofmeasurements. This can be overcome by correlationcoefficient

Corr(X ,Y ) =Cov(X ,Y )

sd(X )sd(Y )

• Property CORR.1:

−1 ≤ Corr(X ,Y ) ≤ 1

• If Cov(X ,Y ) > 0 (or Corr(X ,Y ) > 0), we say X and Y arepositively correlated

• If Cov(X ,Y ) < 0 (or Corr(X ,Y ) < 0), we say X and Y arenegatively correlated

42

Variance of sum of random variables

• Property VAR.3: For constants a and b

Var(aX + bY ) = a2Var(X ) + b2Var(Y ) + 2abCov(X ,Y )

• If X and Y are uncorrelated (i.e. Cov(X ,Y ) = 0), then

Var(aX + bY ) = a2Var(X ) + b2Var(Y )

• Property VAR.4: Suppose {X1, . . . ,Xn} are uncorrelatedeach other (i.e. Cov(Xi ,Xj) = 0 for any i 6= j). Then forconstants {a1, . . . , an},

Var(a1X1 + · · ·+ anXn) = a21Var(X1) + · · ·+ a2

nVar(Xn)

43

Conditional expectation

• Let X and Y be discrete random variables. Recall conditionalpdf is

fY |X (y |x) =fX ,Y (x , y)

fX (x)= P(Y = y |X = x)

i.e. probability of Y = y given that X = x

• E.g. Y =wage and X =years of education. fY |X (y |12) meanspdf of wage for all people in the population with 12 years ofeducation. Similarly, we can define fY |X (y |13), fY |X (y |14),fY |X (y |16), so on. In general, these distributions are alldifferent

• Conditional expectation (or conditional mean) is looking atexpected values of these conditional pdfs

44

• Suppose Y takes on values {y1, . . . , ym}. Conditionalexpectation of Y given X = x is

E (Y |X = x) = y1fY |X (y1|x) + · · ·+ ymfY |X (ym|x)

• If Y is continuos, E (Y |X = x) is defined by integral over y

• E.g. Y =wage and X =years of education. E (Y |X = 12) isaverage wage for all people in the population with 12 years ofeducation. E (Y |X = x) means that for x years of education

• Note: E (Y |X = x) typically varies with x . In other words,E (Y |X = x) is a function of x (say, m(x) = E (Y |X = x))

• Very useful summary on how Y and X are related

45

Properties of conditional expectation

• Used frequently in this course

• Property CE.1: For any function c(X ),

E [c(X )|X ] = c(X )

• Intuitively, if we know X , then we also know c(X )

• To compute expectation conditional on X , the function c(X )of X is treated like constant

• E.g. For c(X ) = X 2, E [X 2|X ] = X 2

46

• Property CE.2: For any functions a(X ) and b(X ),

E [a(X )Y + b(X )|X ] = a(X )E (Y |X ) + b(X )

• Intuitively, functions of X can go outside of conditionalexpectation E (·|X )

• To compute expectation conditional on X , the functions a(X )and b(X ) of X are treated like constants

47

• Property CE.5: If E (Y |X ) = E (Y ), then

Cov(X ,Y ) = 0

(and also Corr(X ,Y ) = 0)

• If knowledge of X does not change the expected value of Y ,then X and Y must be uncorrelated

• Converse is not true in general: Even if X and Y areuncorrelated, E (Y |X ) could still depend on X

49

B.5. Normal and related distributions(Wooldridge, App. B.5)

50

Normal distribution

• Most widely used distribution in econometrics and statistics

• Other distributions such as t- and F -distributions (explainlater) are obtained by functions of normally distributedrandom variables

• Normal random variable is continuous and can take any valueon real line. Although mathematical expression of its pdf is bitcomplicated, pdf is bell-shape and symmetric around itsexpected value

• We say X has normal distribution with expected valueµ = E (X ) and variance σ2 = Var(X ), written as

X ∼ Normal(µ, σ2)

• If Z ∼ N(0, 1), we say Z has standard normal distribution

51

Graph of N(0, 1) and t6

52

Property of normal random variable

• Property of Normal.1: If X ∼ Normal(µ, σ2), then

X − µσ

∼ N(0, 1)

• This transformation (i.e. subtract expected value µ thendivide by standard deviation σ) is called standardization

• Property of Normal.4: Linear combination of normalrandom variables (e.g. a1X1 + a2X2 + · · ·+ anXn) is alsonormally distributed

53

Chi-square distribution

• Consider n independent standard normal random variablesZ1, . . . ,Zn (i.e. Zi ∼ Normal(0, 1))

• Based on them, consider sum of squares

X =n∑

i=1

Z 2i

• Since this object appears very often (closely related to samplevariance), people put a name on it

• Distribution of X is called the chi-square distribution with ndegree of freedom, written as

X ∼ χ2n

• pdf is complicated

54

t distribution

• Let

Z ∼ N(0, 1)

X ∼ χ2n

Z and X are independent

• Then consider the ratio

T =Z√X/n

• Since this object appears very often people put a name on it

• Distribution of T is called tn distribution with n degree offreedom, written as

T ∼ tn

55

• tn distribution depends on n (called degree of freedom)

• pdf of t distribution is similar bell-shape as standard normalNormal(0, 1) but is more spread (Intuitively Z is normal butT has extra variation due to random denominator

√X/n)

• Indeed, tn distribution converges to Normal(0, 1) as n→∞

• Mathematical expression of t distribution is complicated. UseTable G in Appendix or computer

56

F distribution

• Let

X1 ∼ χ2k1

X2 ∼ χ2k2

X1 and X2 are independent

• Based on them, consider

F =(X1/k1)

(X2/k2)

• Again, since this object appears very often people put a nameon it

• Distribution of F is called Fk1,k2 distribution with (k1, k2)degrees of freedom, written as

F ∼ Fk1,k2

57

C.1. & C.2. Concepts for pointestimation

(Wooldridge, App. C.1 & C.2)

58

Random sampling

• Consider n independent random variables Y1, . . . ,Yn withcommon pdf f (y ; θ). Then {Y1, . . . ,Yn} is called randomsample from the population f (y ; θ) with parameter θ

• Example: Yi = 0 or 1 (say, tail or head) with pdf

P(Yi = 1) = θ

P(Yi = 0) = 1− θ

• We want to estimate θ by random sample {Y1, . . . ,Yn}

59

Estimator & Estimate

• In principle, any method to θ should be some function ofsample {Y1, . . . ,Yn}, say

θ = g(Y1, . . . ,Yn)

such object is called estimator of θ

• Note that estimator is function of random variable, so θ israndom, too

• What we report is its outcome based on the outcomes{y1, . . . , yn} of {Y1, . . . ,Yn}

θestimate = g(y1, . . . , yn)

which is called estimate of θ

• Estimator is random. Estimate is non-random (just somenumber)

60

• For example, to estimate population mean µ = E (Yi ), samplemean

Y =1

n

n∑i=1

Yi

is an estimator of θ. By the data {y1, . . . , yn} (i.e. particularoutcomes of sample), we report

y =1

n

n∑i=1

yi (say, y = 75)

• Property of estimator is described by sampling distribution ofestimator Y (y is constant, so it does not have distribution)

61

Unbiasedness

• First property we focus on is the expected value E (θ) ofestimator

• θ is an unbiased estimator for θ if

E (θ) = θ

• If it is not equal, estimator is biased and

Bias(θ) = E (θ)− θ

• For example, Y is unbiased for µ = E (Yi ) because

E (Y ) = E

(1

n

n∑i=1

Yi

)=

1

n

n∑i=1

E (Yi ) =1

n

n∑i=1

µ = µ

62

Sampling variance

• Second property is sampling variance Var(θ) of estimator

• If we have two unbiased estimators (say θ and θ), we oftencompare by their variances Var(θ) and Var(θ) (prefer smallervariance estimator). Smaller variance is called more efficient

• For example, sampling variance of Y is

Var(Y ) = Var

(1

n

n∑i=1

Yi

)=

1

n2Var

(n∑

i=1

Yi

)

=1

n2

n∑i=1

Var(Yi ) (because Yi ’s are independent)

=1

n2

n∑i=1

σ2 =σ2

n

63

C.3. Asymptotic properties ofestimators

(Wooldridge, Appendix. C.3)

64

Consistency

• First asymptotic property of estimator concerns how far theestimator is likely to be from the parameter supposed to beestimating as sample size increases to infinity

• Intuitively we want “convergence” of estimator, say θn, to theunknown parameter, say θ, as n→∞

• Recall: convergence of non-random sequence cn → c. Forexample,

cn = 2 +3

n→ 2 as n→∞

or write limn→∞ cn = 2

65

Convergence in probability

• Want analog of convergence for θn, which is random

• We say: Sequence of random variables Wn converges inprobability to c if for any ε > 0,

P(|Wn − c| > ε)→ 0 as n→∞

• This is denoted byplim(Wn) = c

called probability limit

66

Consistency of estimator

• Estimator θn is consistent for parameter θ if

plim(θn) = θ

• It means distribution of θn becomes more and moreconcentrated around θ and collapses to constant θ in the limit

• In particular, we want consistency of OLS estimatorplim(βj) = βj (note that βj depends on the sample size n)

67

Law of large numbers (LLN)

• Basic tool for establishing consistency is law of largenumbers (LLN)

• LLN: Let Y1, . . . ,Yn be independent and identicallydistributed random variables with mean µ = E (Yi ). Then

plim(Yn) = µ

i.e. sample average converges in probability to populationmean

• In other words, Yn is a consistent estimator for µ

68

Simulation

• Let Y1, . . . ,Yn be independent and

Yi ∼ Uniform(0, 100)

for i = 1, . . . , n

• Population mean is E (Yi ) = 50

• Fix n. Then simulate Yn 10,000 times by computer and drawthe histogram

69

Histogram for Yn with n = 1

Histogram of z1

z1

Frequency

0 20 40 60 80 100

0100

200

300

400

500

70


Histogram of z2

z2

Frequency

0 20 40 60 80 100

0200

400

600

800

71


Histogram of z5

z5

Frequency

20 40 60 80

0500

1000

1500

72


Histogram of z10

z10

Frequency

20 30 40 50 60 70 80

0500

1000

1500

2000

73


Histogram of z100

z100

Frequency

40 45 50 55 60

0500

1000

1500

2000

2500

74

Intuition for LLN

• Key: Look at variance of Yn

• Let Var(Yi ) = σ2. Recall that

Var(Yn) =σ2

n→ 0

i.e. variance of Yn shrinks at n rate to zero, so distribution ofYn collapses

75

Consistency of sample moments

• We saw sample mean Yn is consistent for population meanE (Yi )

• LLN also gives us consistency of other sample momentestimators, e.g. sample variance

plim

(1

n − 1

n∑i=1

(Yi − Yn)2

)= Var(Yi )

and sample covariance

plim

(1

n

n∑i=1

(Yi − Yn)(Zi − Zn)

)= Cov(Yi ,Zi )

76

Property of plim

• Property PLIM.2:If plim(Zn) = a and plim(Wn) = b, then

plim(Zn + Wn) = a + b

plim(ZnWn) = ab

plim(Zn/Wn) = a/b provided b 6= 0

77

Asymptotic distribution

• Consistency is desirable property of estimator. If estimator θnis consistent, it eventually converges to unknown parameter θof interest

• However, if we wish to conduct statistical inference(hypothesis testing or confidence interval), we need moreinformation about θn, i.e. its distribution

• Unless we impose restrictive assumption (e.g. MLR.6), it isnot easy to get finite sample distribution of θn for given n

• However, it is easy to get approximate distribution for θnwhen n increases to infinity under mild condition

• Indeed most estimators in econometrics are well approximatedby normal distribution

78

Asymptotic normal distribution

• We say: Sequence of random variables {Zn} have asymptoticstandard normal distribution if for each a

P(Zn ≤ a)→ Φ(a) as n→∞

where Φ(a) is cumulative distribution function (cdf) ofstandard normal Normal(0, 1)

• In words, for each a, cdf of Zn evaluated at a converges to cdfof Normal(0, 1) evaluated at a

• We often writeZn

a∼ Normal(0, 1)

79

Central limit theorem (CLT)

• Basic tool for establishing asymptotic normality is centrallimit theorem (CLT)

• Let Y1, . . . ,Yn be independent and identically distributedrandom variables with mean µ = E (Yi ) and varianceσ2 = Var(Yi )

• Consider sample average Yn = 1n

∑ni=1 Yi again

• Note: Yn itself does not have asymptotic distribution (itcollapses to µ by LLN)

80

• Key: Look at standardized version of Yn

• Note

E (Yn) = µ

Var(Yn) =σ2

n

which implies

Zn =Yn − µσ/√n

satisfies E (Zn) = 0 and Var(Zn) = 1

• Therefore, distribution of Zn will not collapse even if n→∞

81

• CLT: Let Y1, . . . ,Yn be independent and identicallydistributed random variables with mean µ = E (Yi ) andvariance σ2 = Var(Yi ). Then


a∼ Normal(0, 1)

• Remarkably, regardless of distribution of Yi , distribution ofZn gets arbitrarily close to standard normal

82

Simulation

• Again, let Y1, . . . ,Yn be independent and

Yi ∼ Uniform(0, 100)

for i = 1, . . . , n

• Population mean is µ = 50 and variance is σ2 = 10000/12

• Fix n. Then simulate


10,000 times by computer and draw the histogram

83

Histogram for Zn with n = 1

Histogram of w1

w1

Frequency

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

0100

200

300

400

500

600

84


Histogram of w2

w2

Frequency

-2 -1 0 1 2

0500

1000

1500

85


Histogram of w5

w5

Frequency

-4 -2 0 2

0500

1000

1500

86


Histogram of w10

w10

Frequency

-2 0 2 4

0500

1000

1500

2000

87


Histogram of w100

w100

Frequency

-2 0 2 4

0500

1000

1500

2000

88

C.6. Hypothesis testing

(Wooldridge, App. C.6)

89

Hypothesis testing

• Let θ be parameter of interest. Estimator θ gives us anestimate for θ, i.e. we report some number

• E.g. θ = E (X ) (population mean) θ = X (sample mean)

• Hypothesis testing is interested in answering yes/no questionabout θ, i.e. we report yes or no

• Typical question: some regression coefficient is zero or not

90

Example: Testing hypotheses about mean in normalpopulation

• To illustrate basic idea, consider N(µ, σ2) population andhypothesis testing about mean µ based on random sample{Y1, . . . ,Yn} (so Yi ∼ N(µ, σ2) for all i = 1, . . . , n)

• Consider the null hypothesis

H0 : µ = µ0

where µ0 is a value we specify (e.g. µ0 = 0, so H0 : µ = 0)

91

• To setup yes/no question, we need to specify the alternativehypothesis. Popular examples are

H1 : µ > µ0

H1 : µ < µ0

H1 : µ 6= µ0

The first and second ones are called one-sided alternativehypothesis. The third one is called two-sided alternativehypothesis

• Here let us consider

H0 : µ = µ0, vs. H1 : µ > µ0

• We report: “Reject H0” or “Do not reject H0 (in favor of H1)”

92

Idea for testing

• Intuitively we should reject H0 if

y is sufficiently greater than µ0

but how large? y − µ0 > 10, 500, say?

• Meaning of y − µ0 = 10 (say) is case-by-case. So considerstandardized version by dividing the standard error

t =y − µ0

se(y)=

y − µ0

s/√n

where se(y) = s/√n and

s =

√√√√ 1

n − 1

n∑i=1

(yi − y)2

• Now meaning of t = 2 (say) is universal for any data

93

Find critical value

• Based on standardized object t, reasonable test would be

Reject H0 : µ = µ0 if t > c

and do not reject H0 (in favor of H1 : µ > µ0) if t ≤ c

• So what we have to do is to find the critical value c

• To pin down c, we need some rule

94

Rule for critical value

• In testing, we have two kinds of mistakes

Reject Not rejectH0 true Type I correctH1 true correct Type II

• Type I error probability: P(Reject ; H0 true)

• Type II error probability: P(Accept ; H1 true)

• Rule:Find c to control Type I error probability

95

• Let us find c in current example. To compute probability,consider random variable counterpart of t = y−µ0

s/√n

, that is

T =Y − µ0

S/√n

• We want to find c such that

P(Reject ; H0 true) = P(T > c ; H0 true) = α

where α (called significance level) should be specified by us.Typically α = .01, .05, .10

• To find c, we need to know the distribution of T underH0 : µ = µ0. Indeed

T follows tn−1 distribution under H0

96

• Then look up t distribution table (Table G.2). For example, ifn = 29 and α = .05, critical value is c = 1.701

97

Test for mean in normal population

• Hypotheses

H0 : µ = µ0, vs. H1 : µ > µ0

• Significance level α = .05

• Test statistic & distribution under H0

T =Y − µ0

S/√n∼ tn−1 under H0

• Find critical value c = 1.701 from t29−1 distribution table

• Test: Reject H0 if t > 1.701. Do not reject if t ≤ 1.701

98

Test for another one-sided alternative

• If alternative hypothesis is

H1 : µ < µ0

we reject H0 ift < −c

• c can be found in the same way by looking at left tail of tn−1

distribution. For example, if n = 29,

t < −1.701

99

Test for two-sided alternative

• If alternative hypothesis is two-sided

H1 : µ 6= µ0

we reject H0 if|t| > c

• We should reject for both positive and negative large values oft

• Distribution of T under H0 remains same, i.e. T ∼ tn−1 butwe have to allocate significance level α to left and right tails

• So if we look right tail, area should be α/2

100

• Look up t distribution table (Table G.2). For example, ifn = 26 and α = .05, critical value is 2.06. t distribution issymmetric

101

Summary: Basic steps for testing

• State null and alternative hypotheses, H0 and H1

• Declare significance level α

• Find test statistic & distribution under H0 (e.g. T ∼ tn−1

under H0)

• Find critical value c from distribution table (or by software)

• State testing procedure: Reject H0 if... and do not reject H0

if...

• Implement the test by data and report the result: Reject (ordo not reject) H0 at 100(1− α)% significance level

ec212: introduction to econometrics review materials ...personal.lse.ac.uk/otsu/app.pdf · 1 ec212:...

Documents