ec212: introduction to econometrics review materials ...personal.lse.ac.uk/otsu/app.pdf · 1 ec212:...
TRANSCRIPT
1
EC212: Introduction to Econometrics
Review Materials
(Wooldridge, Appendix)
Taisuke Otsu
London School of Economics
Summer 2018
2
A.1. Summation operator
(Wooldridge, App. A.1)
3
Summation operator
• For sequence {x1, x2, . . . , xn}, denote summation as
n∑i=1
xi = x1 + x2 + · · ·+ xn
• Since data are collection of numbers, “∑n
i=1” plays key role ineconometrics and statistics
4
Properties
1. For any constant cn∑
i=1
c = nc
2. For any constant c
n∑i=1
cxi = cn∑
i=1
xi
3. For sequence {(x1, y1), (x2, y2), . . . , (xn, yn)} and constants aand b
n∑i=1
(axi + byi ) = an∑
i=1
xi + bn∑
i=1
yi
• If you get confused, try for case of n = 2 or 3
5
Average
• For {x1, x2, . . . , xn}, average or mean is defined as
x =1
n
n∑i=1
xi
• (xi − x) is called deviation from average
6
Properties of xi − x
1. Sum of deviations is always zero
n∑i=1
(xi − x) =n∑
i=1
xi − nx =n∑
i=1
xi −n∑
i=1
xi = 0
2. Sum of squared deviations
n∑i=1
(xi − x)2 =n∑
i=1
x2i − n(x)2
3. Cross-product version
n∑i=1
(xi − x)(yi − y) =n∑
i=1
xiyi − n(x y)
• These are shown by properties of summation operator
7
Derive Property 2
• Expand square and apply properties of summation operator
n∑i=1
(xi − x)2 =n∑
i=1
(x2i − 2xi x + (x)2)
=n∑
i=1
x2i − 2x
n∑i=1
xi + n(x)2
=n∑
i=1
x2i − 2x(nx) + n(x)2
=n∑
i=1
x2i − n(x)2
• Property 3 is similarly shown
8
A.2. Linear function(Wooldridge, App. A.2)
9
Linear function
• Linear function plays important role to specify econometricmodels
• If x and y are related by
y = β0 + β1x
then we say that y is a linear function of x
• This relation is described by two parameters: intercept β0
and slope β1
10
Property of linear function
• Let ∆ denote “change”
• Key feature of linear function y = β0 + β1x is: change in y isgiven by slope β1 times change in x , i.e.
∆y = β1∆x
• In other words, marginal effect of x on y is constant and equalto β1
11
Two variable case
• If we have x1 and x2, linear function is
y = β0 + β1x1 + β2x2
• Change in y given changes in x1 and x2 is
∆y = β1∆x1 + β2∆x2
• If x2 does not change, then
∆y = β1∆x1 if ∆x2 = 0
or
β1 =∆y
∆x1if ∆x2 = 0
• So β1 measures how y changes with x1 holding x2 fixed (calledpartial effect). This is closely related to ceteris paribus
12
A.4. Some special functions
(Wooldridge, App. A.4)
13
Quadratic function
• One way to capture diminishing return is to add quadraticterm
y = β0 + β1x + β2x2
• When β1 > 0 and β2 < 0, it will be parabolic mountain shape
• By applying calculus, slope of quadratic function isapproximated by
slope =∆y
∆x≈ β1 + 2β2x
• Caution: quadratic function is not monotone
14
Natural logarithm• Perhaps most important nonlinear function in econometrics.
Denote by log(x) (but ln(x) is also common)
• log(x) is defined only for x > 0 and looks like
0 5 10 15 20
-10
12
3
x
log
15
• It is not very important how values of log(x) are obtained
• log(x) is monotone increasing and displays diminishingmarginal returns (slope gets closer to 0 as x increases)
• Also we can see
log(x) < 0 for 0 < x < 1
log(1) = 0
log(x) > 0 for x > 1
• Some properties
log(x1x2) = log(x1) + log(x2)
log
(x1
x2
)= log(x1)− log(x2)
log(xc) = c log(x) for any c
16
Key property: Relationship with percent change
• (By using calculus) we can see that
log(x1)− log(x0) ≈ x1 − x0
x0
if x1 − x0 is small
• Right hand side multiplied by 100 gives us percent change inx . So this can be written as
100∆ log(x) ≈ %∆x
i.e. log change times 100 approximates percent change
17
Elasticity
• Thus log is useful to approximate elasticity. Elasticity of ywith respect to x is defined as
%∆y
%∆x=
(∆y/y)
(∆x/x)=
∆y
∆x
x
y
i.e. percentage change in y when x increases by 1% (familiarconcept in economics)
• By log, elasticity is approximated as
%∆y
%∆x≈ ∆ log(y)
∆ log(x)
18
B.1. Random variables and theirprobability distributions
(Wooldridge, App. B.1)
19
Definition
• Experiment is any procedure that can yield outcomes withuncertainty
• E.g. Tossing a coin (head or tail)
• Random variable is one that takes numerical values and hasoutcome determined by an experiment
• E.g. Number of heads by tossing 10 coins
20
Notation for appendix• In Appendix, denote random variables by uppercase letters,
like X ,Y ,Z
• On the other hand, denote particular outcomes bycorresponding lowercase letters, like x , y , z
• In main body of textbook, both are denoted by lowercasex , y , z (should be clear for each context)
• X is not associated with any particular value but x is, sayx = 3
• Typical example in mind: X is exam score at this point (whichis random and not realized yet). Once you take the exam, itrealizes and you get a particular value x , say x = 80
• So expressionP(X = x) = 0.2
means “probability that random variable X takes a particularnumber x is 0.2”
21
Discrete random variables
• If X takes on only a finite (like {1, 2, . . . , 10}) or countablyinfinite (like {1, 2, 3, . . .}) number of values, then X is calleddiscrete random variable
• Suppose X can take on k possible values {x1, . . . , xk}. SinceX is random, we never know which number X takes for sure.So we need to talk about probability for X to take each value
pj = P(X = xj) for j = 1, 2, . . . , k
• Note: pj is between 0 and 1 and satisfies
p1 + p2 + · · ·+ pk = 1
22
Probability density function (pdf)
• Distribution of X is summarized by probability densityfunction (pdf)
f (xj) = pj for j = 1, 2, . . . , k
with f (x) = 0 for any x not equal to xj ’s
• Probability for any event involving X can be computed by pj ’s
23
Continuous random variable
• If X takes on some interval or real line, then X is calledcontinuous random variable
• Continuous random variable takes on any real value with zeroprobability, i.e. if X is continuous, then
P(X = x) = 0 for any value of x
• Since X can take on too many possible values, we cannotallocate probability each value of x
• For continuous X , it only makes sense to talk aboutprobability for interval, such as P(a ≤ X ≤ b) and P(X ≥ c)
24
Cumulative distribution function (cdf)
• To compute probabilities for continuous random variable, it isuseful to work with cumulative distribution function (cdf)
F (x) = P(X ≤ x) for any x
• F (x) is an increasing (or non-decreasing) function (starts from0 and increases to 1)
• By F (x), we can compute
P(X ≥ c) = 1− F (c)
P(a ≤ X ≤ b) = F (b)− F (a)
• For continuous case, pdf f (x) is also available which providesprobability for any interval by integral over the interval
25
B.2. Joint distributions, conditionaldistributions and independence
(Wooldridge, App. B.2)
26
Joint distribution
• Let X and Y be discrete random variables. Then (X ,Y ) havejoint distribution, which is fully described by joint pdf
fX ,Y (x , y) = P(X = x ,Y = y)
where right hand side is probability that X takes x and Ytakes y
• pdf of single variable such as pdf fX (x) of X is calledmarginal pdf
• E.g. Y =wage, X =years of education
27
Independence
• We say X and Y are independent if
fX ,Y (x , y) = fX (x)fY (y)
for all x and y , where fX (x) is marginal pdf of X and fY (y) ismarginal pdf of Y
• Otherwise, we say X and Y are dependent
• As we will see soon, if X and Y are independent, knowing theoutcome of X does not change the probabilities of outcomesof Y , and vice versa
28
Conditional distribution• To talk about how X affects Y , we look at conditional
distribution of Y given X , which is summarized byconditional pdf
fY |X (y |x) =fX ,Y (x , y)
fX (x)
for all values of x such that fX (x) > 0
• Note that by definition
fY |X (y |x) =P(X = x ,Y = y)
P(X = x)
= P(Y = y |X = x)
so conditional pdf fY |X (y |x) gives us “(conditional) probabilityfor Y = y given that X = x”
• E.g. Y =wage and X =years of education. fY |X (y |12) meanspdf of wage for all people in the population with 12 years ofeducation
29
Relationship with independence
• If X and Y are independent (i.e. fX ,Y (x , y) = fX (x)fY (y)),then conditional pdf of Y given X is written as
fY |X (y |x) =fX ,Y (x , y)
fX (x)=
fX (x)fY (y)
fX (x)
= fY (y)
i.e. knowledge of the value taken by X tells nothing aboutdistribution of Y
30
B.3. Features of probabilitydistributions
(Wooldridge, App. B.3)
31
Features of distribution
• Knowing pdf is great but for many purposes we will beinterested in only a few aspects of distribution of randomvariable, such as
• Measure of central tendency
• Measure of variability or spread
• Measure of association between two random variables
32
Measure of central tendency: Expected value
• One of the most important concepts in this course
• Expected value (or expectation) of random variable X(denoted by E (X ) or sometimes µ) is weighted average of allpossible values of X with weights determined by pdf
• If X takes values on {x1, . . . , xk} with pdf f (x), then expectedvalue is written as
E (X ) = x1f (x1) + · · ·+ xk f (xk)
• If X is continuous, expected value is given by integral
E (X ) =
∫ ∞−∞
xf (x)dx
33
Expected value of function of X
• Consider g(X ), function of X . Its expected value is
E [g(X )] = g(x1)f (x1) + · · ·+ g(xk)f (xk)
i.e. weighted average of all possible values of g(X )
• For example, if g(X ) = X 2, then
E [X 2] = x21 f (x1) + · · ·+ x2
k f (xk)
34
Properties of E (·)
• Used very frequently in this course
• Property E.1: For any (nonrandom) constant c,
E (c) = c
• E.g. E (3) = 3. Since c (or 3 in this case) never takes othernumber, it makes sense
35
• Property E.2: For any constants a and b,
E (aX + b) = aE (X ) + b
• Intuitively constants can go outside of E (·)
• This can be seen from expressing E (·) by weighted averages
36
• Property E.3: If {a1, . . . , an} are constants and {X1, . . . ,Xn}are random variables, then
E (a1X1 + · · ·+ anXn) = a1E (X1) + · · ·+ anE (Xn)
• This is generalization of Property E.2
• Expectation of summation can be split into sum ofexpectations. Constant coefficients ai ’s can go outside of E (·)
37
Measure of variability: Variance and standard deviation
• Once we figure out central tendency of distribution of X byexpected value µ = E (X ), next step is to characterizevariability or spread of distribution around µ
• Common measure of variability is variance
Var(X ) = E [(X − µ)2]
i.e. measure variability by squared difference (X − µ)2 andsummarize by its expected value
• Also standard deviation is defined as
sd(X ) =√
Var(X )
38
Properties of variance
• Property VAR.1: For any (nonrandom) constant c,
Var(c) = 0
• Constant has no variability
• Property VAR.2: For any constants a and b,
Var(aX + b) = a2Var(X )
• b does not change variance. When a goes outside of Var(·), itbecomes “a2” (because variance is defined by expected squareddifference E [(X − µ)2])
39
B.4. Features of joint and conditionaldistributions
(Wooldridge, App. B.4)
40
Covariance
• Consider two random variables X and Y . Let µX = E (X ) andµY = E (Y ). To measure association of X and Y , we look atproduct of deviations from the means
(X − µX )(Y − µY )
If X > µX ,Y > µY or X < µX ,Y < µY (i.e. same signs),then this product is positive. If X > µX ,Y < µY orX > µX ,Y < µY (i.e. different signs), then this product isnegative
• Covariance is expected value of this product
Cov(X ,Y ) = E [(X − µX )(Y − µY )]
• Property COV.1: If X and Y are independent, then
Cov(X ,Y ) = 0
(but converse is not true in general)
41
Correlation coefficient
• Drawback of covariance is that it depends on unit ofmeasurements. This can be overcome by correlationcoefficient
Corr(X ,Y ) =Cov(X ,Y )
sd(X )sd(Y )
• Property CORR.1:
−1 ≤ Corr(X ,Y ) ≤ 1
• If Cov(X ,Y ) > 0 (or Corr(X ,Y ) > 0), we say X and Y arepositively correlated
• If Cov(X ,Y ) < 0 (or Corr(X ,Y ) < 0), we say X and Y arenegatively correlated
42
Variance of sum of random variables
• Property VAR.3: For constants a and b
Var(aX + bY ) = a2Var(X ) + b2Var(Y ) + 2abCov(X ,Y )
• If X and Y are uncorrelated (i.e. Cov(X ,Y ) = 0), then
Var(aX + bY ) = a2Var(X ) + b2Var(Y )
• Property VAR.4: Suppose {X1, . . . ,Xn} are uncorrelatedeach other (i.e. Cov(Xi ,Xj) = 0 for any i 6= j). Then forconstants {a1, . . . , an},
Var(a1X1 + · · ·+ anXn) = a21Var(X1) + · · ·+ a2
nVar(Xn)
43
Conditional expectation
• Let X and Y be discrete random variables. Recall conditionalpdf is
fY |X (y |x) =fX ,Y (x , y)
fX (x)= P(Y = y |X = x)
i.e. probability of Y = y given that X = x
• E.g. Y =wage and X =years of education. fY |X (y |12) meanspdf of wage for all people in the population with 12 years ofeducation. Similarly, we can define fY |X (y |13), fY |X (y |14),fY |X (y |16), so on. In general, these distributions are alldifferent
• Conditional expectation (or conditional mean) is looking atexpected values of these conditional pdfs
44
• Suppose Y takes on values {y1, . . . , ym}. Conditionalexpectation of Y given X = x is
E (Y |X = x) = y1fY |X (y1|x) + · · ·+ ymfY |X (ym|x)
• If Y is continuos, E (Y |X = x) is defined by integral over y
• E.g. Y =wage and X =years of education. E (Y |X = 12) isaverage wage for all people in the population with 12 years ofeducation. E (Y |X = x) means that for x years of education
• Note: E (Y |X = x) typically varies with x . In other words,E (Y |X = x) is a function of x (say, m(x) = E (Y |X = x))
• Very useful summary on how Y and X are related
45
Properties of conditional expectation
• Used frequently in this course
• Property CE.1: For any function c(X ),
E [c(X )|X ] = c(X )
• Intuitively, if we know X , then we also know c(X )
• To compute expectation conditional on X , the function c(X )of X is treated like constant
• E.g. For c(X ) = X 2, E [X 2|X ] = X 2
46
• Property CE.2: For any functions a(X ) and b(X ),
E [a(X )Y + b(X )|X ] = a(X )E (Y |X ) + b(X )
• Intuitively, functions of X can go outside of conditionalexpectation E (·|X )
• To compute expectation conditional on X , the functions a(X )and b(X ) of X are treated like constants
47
• Property CE.5: If E (Y |X ) = E (Y ), then
Cov(X ,Y ) = 0
(and also Corr(X ,Y ) = 0)
• If knowledge of X does not change the expected value of Y ,then X and Y must be uncorrelated
• Converse is not true in general: Even if X and Y areuncorrelated, E (Y |X ) could still depend on X
48
Conditional variance
• Conditional variance of conditional distribution of Y givenX = x is
Var(Y |X = x) = E [(Y − E (Y |x))2|x ]
• Formula often used:
Var(Y |X ) = E (Y 2|X )− [E (Y |X )]2
• Property CV.1: If X and Y are independent, then
Var(Y |X ) = Var(Y )
49
B.5. Normal and related distributions(Wooldridge, App. B.5)
50
Normal distribution
• Most widely used distribution in econometrics and statistics
• Other distributions such as t- and F -distributions (explainlater) are obtained by functions of normally distributedrandom variables
• Normal random variable is continuous and can take any valueon real line. Although mathematical expression of its pdf is bitcomplicated, pdf is bell-shape and symmetric around itsexpected value
• We say X has normal distribution with expected valueµ = E (X ) and variance σ2 = Var(X ), written as
X ∼ Normal(µ, σ2)
• If Z ∼ N(0, 1), we say Z has standard normal distribution
51
Graph of N(0, 1) and t6
52
Property of normal random variable
• Property of Normal.1: If X ∼ Normal(µ, σ2), then
X − µσ
∼ N(0, 1)
• This transformation (i.e. subtract expected value µ thendivide by standard deviation σ) is called standardization
• Property of Normal.4: Linear combination of normalrandom variables (e.g. a1X1 + a2X2 + · · ·+ anXn) is alsonormally distributed
53
Chi-square distribution
• Consider n independent standard normal random variablesZ1, . . . ,Zn (i.e. Zi ∼ Normal(0, 1))
• Based on them, consider sum of squares
X =n∑
i=1
Z 2i
• Since this object appears very often (closely related to samplevariance), people put a name on it
• Distribution of X is called the chi-square distribution with ndegree of freedom, written as
X ∼ χ2n
• pdf is complicated
54
t distribution
• Let
Z ∼ N(0, 1)
X ∼ χ2n
Z and X are independent
• Then consider the ratio
T =Z√X/n
• Since this object appears very often people put a name on it
• Distribution of T is called tn distribution with n degree offreedom, written as
T ∼ tn
55
• tn distribution depends on n (called degree of freedom)
• pdf of t distribution is similar bell-shape as standard normalNormal(0, 1) but is more spread (Intuitively Z is normal butT has extra variation due to random denominator
√X/n)
• Indeed, tn distribution converges to Normal(0, 1) as n→∞
• Mathematical expression of t distribution is complicated. UseTable G in Appendix or computer
56
F distribution
• Let
X1 ∼ χ2k1
X2 ∼ χ2k2
X1 and X2 are independent
• Based on them, consider
F =(X1/k1)
(X2/k2)
• Again, since this object appears very often people put a nameon it
• Distribution of F is called Fk1,k2 distribution with (k1, k2)degrees of freedom, written as
F ∼ Fk1,k2
57
C.1. & C.2. Concepts for pointestimation
(Wooldridge, App. C.1 & C.2)
58
Random sampling
• Consider n independent random variables Y1, . . . ,Yn withcommon pdf f (y ; θ). Then {Y1, . . . ,Yn} is called randomsample from the population f (y ; θ) with parameter θ
• Example: Yi = 0 or 1 (say, tail or head) with pdf
P(Yi = 1) = θ
P(Yi = 0) = 1− θ
• We want to estimate θ by random sample {Y1, . . . ,Yn}
59
Estimator & Estimate
• In principle, any method to θ should be some function ofsample {Y1, . . . ,Yn}, say
θ = g(Y1, . . . ,Yn)
such object is called estimator of θ
• Note that estimator is function of random variable, so θ israndom, too
• What we report is its outcome based on the outcomes{y1, . . . , yn} of {Y1, . . . ,Yn}
θestimate = g(y1, . . . , yn)
which is called estimate of θ
• Estimator is random. Estimate is non-random (just somenumber)
60
• For example, to estimate population mean µ = E (Yi ), samplemean
Y =1
n
n∑i=1
Yi
is an estimator of θ. By the data {y1, . . . , yn} (i.e. particularoutcomes of sample), we report
y =1
n
n∑i=1
yi (say, y = 75)
• Property of estimator is described by sampling distribution ofestimator Y (y is constant, so it does not have distribution)
61
Unbiasedness
• First property we focus on is the expected value E (θ) ofestimator
• θ is an unbiased estimator for θ if
E (θ) = θ
• If it is not equal, estimator is biased and
Bias(θ) = E (θ)− θ
• For example, Y is unbiased for µ = E (Yi ) because
E (Y ) = E
(1
n
n∑i=1
Yi
)=
1
n
n∑i=1
E (Yi ) =1
n
n∑i=1
µ = µ
62
Sampling variance
• Second property is sampling variance Var(θ) of estimator
• If we have two unbiased estimators (say θ and θ), we oftencompare by their variances Var(θ) and Var(θ) (prefer smallervariance estimator). Smaller variance is called more efficient
• For example, sampling variance of Y is
Var(Y ) = Var
(1
n
n∑i=1
Yi
)=
1
n2Var
(n∑
i=1
Yi
)
=1
n2
n∑i=1
Var(Yi ) (because Yi ’s are independent)
=1
n2
n∑i=1
σ2 =σ2
n
63
C.3. Asymptotic properties ofestimators
(Wooldridge, Appendix. C.3)
64
Consistency
• First asymptotic property of estimator concerns how far theestimator is likely to be from the parameter supposed to beestimating as sample size increases to infinity
• Intuitively we want “convergence” of estimator, say θn, to theunknown parameter, say θ, as n→∞
• Recall: convergence of non-random sequence cn → c. Forexample,
cn = 2 +3
n→ 2 as n→∞
or write limn→∞ cn = 2
65
Convergence in probability
• Want analog of convergence for θn, which is random
• We say: Sequence of random variables Wn converges inprobability to c if for any ε > 0,
P(|Wn − c| > ε)→ 0 as n→∞
• This is denoted byplim(Wn) = c
called probability limit
66
Consistency of estimator
• Estimator θn is consistent for parameter θ if
plim(θn) = θ
• It means distribution of θn becomes more and moreconcentrated around θ and collapses to constant θ in the limit
• In particular, we want consistency of OLS estimatorplim(βj) = βj (note that βj depends on the sample size n)
67
Law of large numbers (LLN)
• Basic tool for establishing consistency is law of largenumbers (LLN)
• LLN: Let Y1, . . . ,Yn be independent and identicallydistributed random variables with mean µ = E (Yi ). Then
plim(Yn) = µ
i.e. sample average converges in probability to populationmean
• In other words, Yn is a consistent estimator for µ
68
Simulation
• Let Y1, . . . ,Yn be independent and
Yi ∼ Uniform(0, 100)
for i = 1, . . . , n
• Population mean is E (Yi ) = 50
• Fix n. Then simulate Yn 10,000 times by computer and drawthe histogram
69
Histogram for Yn with n = 1
Histogram of z1
z1
Frequency
0 20 40 60 80 100
0100
200
300
400
500
70
Histogram for Yn with n = 2
Histogram of z2
z2
Frequency
0 20 40 60 80 100
0200
400
600
800
71
Histogram for Yn with n = 5
Histogram of z5
z5
Frequency
20 40 60 80
0500
1000
1500
72
Histogram for Yn with n = 10
Histogram of z10
z10
Frequency
20 30 40 50 60 70 80
0500
1000
1500
2000
73
Histogram for Yn with n = 100
Histogram of z100
z100
Frequency
40 45 50 55 60
0500
1000
1500
2000
2500
74
Intuition for LLN
• Key: Look at variance of Yn
• Let Var(Yi ) = σ2. Recall that
Var(Yn) =σ2
n→ 0
i.e. variance of Yn shrinks at n rate to zero, so distribution ofYn collapses
75
Consistency of sample moments
• We saw sample mean Yn is consistent for population meanE (Yi )
• LLN also gives us consistency of other sample momentestimators, e.g. sample variance
plim
(1
n − 1
n∑i=1
(Yi − Yn)2
)= Var(Yi )
and sample covariance
plim
(1
n
n∑i=1
(Yi − Yn)(Zi − Zn)
)= Cov(Yi ,Zi )
76
Property of plim
• Property PLIM.2:If plim(Zn) = a and plim(Wn) = b, then
plim(Zn + Wn) = a + b
plim(ZnWn) = ab
plim(Zn/Wn) = a/b provided b 6= 0
77
Asymptotic distribution
• Consistency is desirable property of estimator. If estimator θnis consistent, it eventually converges to unknown parameter θof interest
• However, if we wish to conduct statistical inference(hypothesis testing or confidence interval), we need moreinformation about θn, i.e. its distribution
• Unless we impose restrictive assumption (e.g. MLR.6), it isnot easy to get finite sample distribution of θn for given n
• However, it is easy to get approximate distribution for θnwhen n increases to infinity under mild condition
• Indeed most estimators in econometrics are well approximatedby normal distribution
78
Asymptotic normal distribution
• We say: Sequence of random variables {Zn} have asymptoticstandard normal distribution if for each a
P(Zn ≤ a)→ Φ(a) as n→∞
where Φ(a) is cumulative distribution function (cdf) ofstandard normal Normal(0, 1)
• In words, for each a, cdf of Zn evaluated at a converges to cdfof Normal(0, 1) evaluated at a
• We often writeZn
a∼ Normal(0, 1)
79
Central limit theorem (CLT)
• Basic tool for establishing asymptotic normality is centrallimit theorem (CLT)
• Let Y1, . . . ,Yn be independent and identically distributedrandom variables with mean µ = E (Yi ) and varianceσ2 = Var(Yi )
• Consider sample average Yn = 1n
∑ni=1 Yi again
• Note: Yn itself does not have asymptotic distribution (itcollapses to µ by LLN)
80
• Key: Look at standardized version of Yn
• Note
E (Yn) = µ
Var(Yn) =σ2
n
which implies
Zn =Yn − µσ/√n
satisfies E (Zn) = 0 and Var(Zn) = 1
• Therefore, distribution of Zn will not collapse even if n→∞
81
• CLT: Let Y1, . . . ,Yn be independent and identicallydistributed random variables with mean µ = E (Yi ) andvariance σ2 = Var(Yi ). Then
Zn =Yn − µσ/√n
a∼ Normal(0, 1)
• Remarkably, regardless of distribution of Yi , distribution ofZn gets arbitrarily close to standard normal
82
Simulation
• Again, let Y1, . . . ,Yn be independent and
Yi ∼ Uniform(0, 100)
for i = 1, . . . , n
• Population mean is µ = 50 and variance is σ2 = 10000/12
• Fix n. Then simulate
Zn =Yn − µσ/√n
10,000 times by computer and draw the histogram
83
Histogram for Zn with n = 1
Histogram of w1
w1
Frequency
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
0100
200
300
400
500
600
84
Histogram for Zn with n = 2
Histogram of w2
w2
Frequency
-2 -1 0 1 2
0500
1000
1500
85
Histogram for Zn with n = 5
Histogram of w5
w5
Frequency
-4 -2 0 2
0500
1000
1500
86
Histogram for Zn with n = 10
Histogram of w10
w10
Frequency
-2 0 2 4
0500
1000
1500
2000
87
Histogram for Zn with n = 100
Histogram of w100
w100
Frequency
-2 0 2 4
0500
1000
1500
2000
88
C.6. Hypothesis testing
(Wooldridge, App. C.6)
89
Hypothesis testing
• Let θ be parameter of interest. Estimator θ gives us anestimate for θ, i.e. we report some number
• E.g. θ = E (X ) (population mean) θ = X (sample mean)
• Hypothesis testing is interested in answering yes/no questionabout θ, i.e. we report yes or no
• Typical question: some regression coefficient is zero or not
90
Example: Testing hypotheses about mean in normalpopulation
• To illustrate basic idea, consider N(µ, σ2) population andhypothesis testing about mean µ based on random sample{Y1, . . . ,Yn} (so Yi ∼ N(µ, σ2) for all i = 1, . . . , n)
• Consider the null hypothesis
H0 : µ = µ0
where µ0 is a value we specify (e.g. µ0 = 0, so H0 : µ = 0)
91
• To setup yes/no question, we need to specify the alternativehypothesis. Popular examples are
H1 : µ > µ0
H1 : µ < µ0
H1 : µ 6= µ0
The first and second ones are called one-sided alternativehypothesis. The third one is called two-sided alternativehypothesis
• Here let us consider
H0 : µ = µ0, vs. H1 : µ > µ0
• We report: “Reject H0” or “Do not reject H0 (in favor of H1)”
92
Idea for testing
• Intuitively we should reject H0 if
y is sufficiently greater than µ0
but how large? y − µ0 > 10, 500, say?
• Meaning of y − µ0 = 10 (say) is case-by-case. So considerstandardized version by dividing the standard error
t =y − µ0
se(y)=
y − µ0
s/√n
where se(y) = s/√n and
s =
√√√√ 1
n − 1
n∑i=1
(yi − y)2
• Now meaning of t = 2 (say) is universal for any data
93
Find critical value
• Based on standardized object t, reasonable test would be
Reject H0 : µ = µ0 if t > c
and do not reject H0 (in favor of H1 : µ > µ0) if t ≤ c
• So what we have to do is to find the critical value c
• To pin down c, we need some rule
94
Rule for critical value
• In testing, we have two kinds of mistakes
Reject Not rejectH0 true Type I correctH1 true correct Type II
• Type I error probability: P(Reject ; H0 true)
• Type II error probability: P(Accept ; H1 true)
• Rule:Find c to control Type I error probability
95
• Let us find c in current example. To compute probability,consider random variable counterpart of t = y−µ0
s/√n
, that is
T =Y − µ0
S/√n
• We want to find c such that
P(Reject ; H0 true) = P(T > c ; H0 true) = α
where α (called significance level) should be specified by us.Typically α = .01, .05, .10
• To find c, we need to know the distribution of T underH0 : µ = µ0. Indeed
T follows tn−1 distribution under H0
96
• Then look up t distribution table (Table G.2). For example, ifn = 29 and α = .05, critical value is c = 1.701
97
Test for mean in normal population
• Hypotheses
H0 : µ = µ0, vs. H1 : µ > µ0
• Significance level α = .05
• Test statistic & distribution under H0
T =Y − µ0
S/√n∼ tn−1 under H0
• Find critical value c = 1.701 from t29−1 distribution table
• Test: Reject H0 if t > 1.701. Do not reject if t ≤ 1.701
98
Test for another one-sided alternative
• If alternative hypothesis is
H1 : µ < µ0
we reject H0 ift < −c
• c can be found in the same way by looking at left tail of tn−1
distribution. For example, if n = 29,
t < −1.701
99
Test for two-sided alternative
• If alternative hypothesis is two-sided
H1 : µ 6= µ0
we reject H0 if|t| > c
• We should reject for both positive and negative large values oft
• Distribution of T under H0 remains same, i.e. T ∼ tn−1 butwe have to allocate significance level α to left and right tails
• So if we look right tail, area should be α/2
100
• Look up t distribution table (Table G.2). For example, ifn = 26 and α = .05, critical value is 2.06. t distribution issymmetric
101
Summary: Basic steps for testing
• State null and alternative hypotheses, H0 and H1
• Declare significance level α
• Find test statistic & distribution under H0 (e.g. T ∼ tn−1
under H0)
• Find critical value c from distribution table (or by software)
• State testing procedure: Reject H0 if... and do not reject H0
if...
• Implement the test by data and report the result: Reject (ordo not reject) H0 at 100(1− α)% significance level