chapter 7 estimation - bauer college of business · 2014-08-22 · 2 . criteria for estimators •...
Post on 10-Jun-2020
2 Views
Preview:
TRANSCRIPT
1
Chapter 7 Estimation
2
Criteria for Estimators
• Main problem in stats: Estimation of population parameters, say θ. • Recall that an estimator is a statistic. That is, any function of the observations {xi} with values in the parameter space.
• There are many estimators of θ. Question: Which is better? • Criteria for Estimators (1) Unbiasedness (2) Efficiency (3) Sufficiency (4) Consistency
3
Unbiasedness
Definition: Unbiasedness An unbiased estimator, say , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, E[ ] = θ Example: E[ ] = µ E[s2] = σ2
^θ
^θ
x
4
Definition: Efficiency / Mean squared error An estimator is efficient if it estimates the parameter of interest in
some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality:
b(θ): E[( - θ) bias in . The MSE is the sum of the variance and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an
unbiased estimator. Note: The most efficient estimator among a group of unbiased
estimators is the one with the smallest variance => BUE.
Efficiency
^θ
( ) ( ) 222)]([)ˆ()ˆ()ˆ(ˆˆMSE θθθθθθθθ bVarEEEE +=
−+−=
−=
^θ
5
Now we can compare estimators and select the “best” one. Example: Three different estimators’ distributions
– 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST. 2 is more efficient than 1.
Efficiency
1 2
3
Value of Estimator
1, 2, 3 based on samples of the same size
θ
6
Relative Efficiency It is difficult to prove that an estimator is the best among all estimators, a relative concept is usually used. Definition: Relative efficiency
Example: Sample mean vs. sample median Variance of sample mean = σ2/n Variance of sample median = πσ2/2n Var[median]/Var[mean] = (πσ2/2n) / (σ2/n) = π/2 = 1.57 The sample median is 1.57 times less efficient than the sample mean.
estimator second of Varianceestimatorfirst of Variance Efficiency Relative =
Asymptotic Efficiency
• We compare two sample statistics in terms of their variances. The statistic with the smallest variance is called efficient. • When we look at asymptotic efficiency, we look at the asymptotic variance of two statistics as n grows. Note that if we compare two consistent estimators, both variances eventually go to zero. Example: Random sampling from the normal distribution
• Sample mean is asymptotically normal[μ,σ2/n] • Median is asymptotically normal [μ,(π/2)σ2/n] • Mean is asymptotically more efficient
8
• Definition: Sufficiency A statistic is sufficient when no other statistic, which can be calculated from the same sample, provides any additional information as to the value of the parameter of interest. Equivalently, we say that conditional on the value of a sufficient statistic for a parameter, the joint probability distribution of the data does not depend on that parameter. That is, if P(X=x|T(X)=t, θ) = P(X=x|T(X)=t) we say that T is a sufficient statistic. • The sufficient statistic contains all the information needed to estimate the population parameter. It is OK to ‘get rid’ of the original data, while keeping only the value of the sufficient statistic.
Sufficiency
• Visualize sufficiency: Consider a Markov chain θ → T(X1, . . . ,Xn) → {X1, . . . ,Xn} (although in classical statistics θ is not a RV). Conditioned on the middle part of the chain, the front and back are independent. Theorem Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is a sufficient statistic for θ if, for every x in the sample space, the ratio of
is a constant as a function of θ. Example: Normal sufficient statistic: Let X1, X2, … Xn be iid N(μ,σ2) where the variance is known. The sample mean, , is the sufficient statistic for μ.
( )( )
p xq t
θθ
Sufficiency
x
Proof: Let’s starting with the joint distribution function
( ) ( )
( )( )
2
221
2
22 2 1
1 exp22
1 exp22
ni
i
ni
ni
xf x
x
µµ
σπσ
µσπσ
=
=
−= −
−
= −
∏
∑
• Next, add and subtract the sample mean:
( )( )
( )
( )
( ) ( )
2
22 2 1
2 2
12
2 2
1 exp22
1 exp22
ni
ni
n
ii
n
x x xf x
x x n x
µµ
σπσ
µ
σπσ
=
=
− + −= −
− + − = −
∑
∑
• Recall that the distribution of the sample mean is
( )( )( )
( )2
1 22 2
1 exp22
n xq T X
n
µθ
σσπ
−= −
• The ratio of the information in the sample to the information in the statistic becomes independent of μ
( )( )( )
( )
( ) ( )
( )( )
2 2
12
2 2
2
1 22 2
1 exp22
1 exp22
n
ii
n
x x n x
f x
q T x n x
n
µ
σπσθ
θ µσσπ
=
− + − − =
−−
∑
( )( )( ) ( )
( )2
11 21 2 22
1 exp22
n
ii
n
x xf x
q T x n
θσθ πσ
=−
− = −
∑
Theorem: Factorization Theorem Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ • Sufficient statistics are not unique. From the factorization theorem it is easy to see that (i) the identity function T(X) = X is a sufficient statistic vector and (ii) if T is a sufficient statistic for θ then so is any 1-1 function of T. Then, we have minimal sufficient statistics. Definition: Minimal sufficiency A sufficient statistic T(X) is called a minimal sufficient statistic if, for any other sufficient statistic T ’(X), T'(X) is a function of T (X).
( ) ( )( ) ( )f x g T x h xθ θ=
Sufficiency
13
Definition: Consistency The estimator converges in probability to the population parameter being estimated when n (sample size) becomes larger That is, θ. We say that is a consistent estimator of θ. Example: is a consistent estimator of μ (the population mean). • Q: Does unbiasedness imply consistency? No. The first observation of {xn}, x1, is an unbiased estimator of μ. That is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in probability to μ.
Consistency
n^θ →p
x
n^θ
14
Definition: Squared Error Consistency The sequence { } is a squared-error consistent estimator of θ, if limn→∞ E[( - θ)2] = 0 That is, θ. • Squared-error consistency implies that both the bias and the variance of an estimator approach zero. Thus, squared-error consistency implies consistency.
Squared-Error Consistency
n^θ → ..sm
n^θ
n^θ
Order of a Sequence: Big O and Little o
• “Little o” o(.). A sequence {xn}is o(nδ) (order less than nδ) if |n-δ xn|→ 0, as n → ∞. Example: xn = n3 is o(n4) since |n-4 xn|= 1 /n → 0, as n → ∞. • “Big O” O(.). A sequence {xn} is O(nδ) (at most of order nδ ) if n-δ xn → ψ, as n → ∞ (ψ≠0, constant). Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0. Special case: O(1): constant • Order of a sequence of RV The order of the variance gives the order of the sequence. Example: What is the order of the sequence { }? Var[ ] = σ2/n, which is O(1/n) -or O(n-1).
xx
Root n-Consistency • Q: Let xn be a consistent estimator of θ. But how fast does xn converges to θ ?
The sample mean, , has as its variance σ2/n, which is O(1/n). That is, the convergence is at the rate of n-½. This is called “root n-consistency.” Note: n½ has variance of O(1).
• Definition: nδ convergence? If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ –convergent. Example: Suppose var(xn) is O(1/n2). Then, xn is n–convergent. The usual convergence is root n. If an estimator has a faster (higher degree of) convergence, it’s called super-consistent.
x
x
Estimation • Two philosophies regarding models (assumptions) in statistics: (1) Parametric statistics. It assumes data come from a type of probability distribution and makes inferences about the parameters of the distribution. Models are parameterized before collecting the data. Example: Maximum likelihood estimation.
(2) Non-parametric statistics. It assumes no probability distribution –i.e., they are “distribution free.” Models are not imposed a priori, but determined by the data. Examples: histograms, kernel density estimation. • In general, parametric statistics makes more assumptions.
Least Squares Estimation
• Long history: Gauss (1795, 1801) used it in astronomy. • Idea: There is a functional form relating Y and k variables X. This function depends on unknown parameters, θ. The relation between Y and X is not exact. There is an error, ε. We will estimate the parameters θ by minimizing the sum of squared errors. (1) Functional form known
yi = f(xi, θ) + εi (2) Typical Assumptions
- f(x, θ) is correctly specified. For example, f(x, θ) = X β - X are numbers with full rank --or E(ε|X) = 0. That is, (ε ⊥ x) - ε ~ iid D(0, σ2 I)
Least Squares Estimation
• Objective function: S(xi, θ) =Σi εi2
• We want to minimize w.r.t to θ. That is, minθ {S(xi, θ) =Σi εi
2 = Σi [yi - f(xi, θ)]2 } => d S(xi, θ)/d θ = - 2 Σi [yi - f(xi, θ)] f ‘(xi, θ) f.o.c. => - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) =0 Note: The f.o.c. deliver the normal equations. The solution to the normal equation, θLS, is the LS estimator. The estimator θLS is a function of the data (yi ,xi).
Least Squares Estimation
Suppose we assume a linear functional form. That is, f(x, θ) = Xβ. Using linear algebra, the objective function becomes S(xi, θ) =Σi εi
2 = ε’ε = (y- X β)’ (y- X β) The f.o.c. - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)’ X =0 where b = βOLS. (Ordinary LS. Ordinary=linear) Solving for b => b = (X’ X)-1 X’ y Note: b is a (linear) function of the data (yi ,xi).
Least Squares Estimation The LS estimator of βLS when f(x, θ) = X β is linear is b = (X′X)-1 X′ y Note: b is a (linear) function of the data (yi ,xi). Moreover, b = (X′X)-1 X′ y = (X′X)-1 X′ (Xβ + ε) = β +(X′X)-1 X′ε Under the typical assumptions, we can establish properties for b. 1) E[b|X]= β 2) Var[b|X] = E[(b-β) (b-β)′|X] =(X′X)-1 X’E[ε ε′|X] X(X′X)-1 = σ2 (X′X)-1 Under the typical assumptions, Gauss established that b is BLUE. 3) If ε|X ~ iid N(0, σ2In) => b|X ~iid N(β, σ2 (X’ X)-1) 4) With some additional assumptions, we can use the CLT to get b|X N(β, σ2/n (X’ X/n)-1) →a
Maximum Likelihood Estimation
• Idea: Assume a particular distribution with unknown parameters. Maximum likelihood (ML) estimation chooses the set of parameters that maximize the likelihood of drawing a particular sample. • Consider a sample (X1, ... , Xn) which is drawn from a pdf f(X|θ) where θ are parameters. If the Xi’s are independent with pdf f(Xi|θ) the joint probability of the whole sample is:
)|Xf(=)|X...Xf(XL i
n
=1in1 θθθ ∏=)|(
The function L(X| θ) --also written as L(X; θ)-- is called the likelihood function. This function can be maximized with respect to θ to produce maximum likelihood estimates ( ). MLE
^θ
Maximum Likelihood Estimation
• It is often convenient to work with the Log of the likelihood function. That is, ln L(X|θ) = Σi ln f(Xi| θ).
• The ML estimation approach is very general. Now, if the model is not correctly specified, the estimates are sensitive to the misspecification.
Ronald Fisher (1890 – 1962)
Maximum Likelihood: Example I
Let the sample be X={5, 6, 7, 8, 9, 10}drawn from a Normal(μ,1). The probability of each of these points based on the unknown mean, μ, can be written as:
( ) ( )
( ) ( )
( ) ( )
−−=
−−=
−−=
210exp
21|10
26exp
21|6
25exp
21|5
2
2
2
µπ
µ
µπ
µ
µπ
µ
f
f
f
Assume that the sample is independent.
Then, the joint pdf function can be written as: The value of µ that maximize the likelihood function of the sample can then be defined by It easier, however, to maximize ln L(X|μ). That is,
( )( )
( ) ( ) ( )
−−
−−
−−=
− 210
26
25exp
21|
222
25
µµµπ
µ XL
Maximum Likelihood: Example I
( )µµ
|max XL
( )( ) ( ) ( ) ( )
( ) ( ) ( )_
222
61098765ˆ
01065
210
26
25|lnmax
x
KXL
MLE =+++++
=µ
=µ−++µ−+µ−
µ−−
µ−−
µ−−
µ∂∂
⇒µµ
• Let’s generalize this example to an i.i.d. sample X={X1, X2,..., XT}drawn from a Normal(μ,σ2). Then, the joint pdf function is: Then, taking logs, we have: We take first derivatives:
Maximum Likelihood: Example I
)X)X µ−′µ−σ
−σ−π−=µ−σ
−πσ−= ∑=
((2
1ln2
2ln2
)(2
12ln2 2
2
1
22
2 TTXTLT
ii
∏∏=
−
=
σ
µ−−πσ=
σ
µ−−
πσ=
T
i
iTT
i
i XXL1
2
22/2
12
2
2 2)(
exp)2(2
)(exp
2
1
∑
∑∑
=
==
µ−σ
+σ
−=σ∂
∂
µ−σ
=−µ−σ
−=µ∂
∂
T
ii
T
ii
T
ii
XTL
XXL
1
2422
12
12
)(2
12
ln
)(1)1()(22
1
• Then, we have the f.o.c. and jointly solve for the ML estimators: Note: The MLE of μ is the sample mean. Therefore, it is unbiased. Note: The MLE of σ2 is not s2. Therefore, it is biased!
Maximum Likelihood: Example I
XXT
XL T
iiMLE
T
iMLEi
MLE
==µ⇒=µ−σ
=µ∂
∂ ∑∑== 11
21ˆ0)ˆ(
ˆ1)1(
∑∑==
−=σ⇒=µ−σ
+σ
−=σ∂
∂ T
ii
T
iMLEi XX
TXTL
MLE
MLEMLE 1
22
1
2422 )(1ˆ0)ˆ(
ˆ21
ˆ2ln)2(
• We will work the previous example with matrix notation. Suppose we assume: where Xi is a 1xk vector of exogenous numbers and β is a kx1 vector of unknown parameters. Then, the joint likelihood function becomes: • Then, taking logs, we have the log likelihood function::
∏∏=
−
=
σ
ε−πσ=
σ
ε−
πσ=
T
i
iTT
i
iL1
2
22/2
12
2
2 2exp)2(
2exp
2
1
),0(~
),0(~2
2
T
iiii
Nor
NXy
IεεXβy
β
σ+=
σεε+=
)(2
12ln22
12ln2
ln 22
1
22
2 Xβ(y)Xβy −′−σ
−πσ−=εσ
−πσ−= ∑=
TTLT
ii
Maximum Likelihood: Example II
• The joint likelihood function becomes: • We take first derivatives of the log likelihood wrt β and σ2: • Using the f.o.c., we jointly estimate β and σ2: :
])[2
1()2
1(2
ln22
1
2422 TTL T
ii −
σσ=ε
σ−−
σ−=
σ∂∂ ∑
=
εε'
εX'x 22'
1
1/221ln
σ−=σε−=
β∂∂ ∑
=i
T
ii
L
)(2
1ln2
2ln22
12ln2
ln 22
1
22
2 Xβ(y)Xβy −′−σ
−σ−π−=εσ
−πσ−= ∑=
TTTLT
ii
Maximum Likelihood: Example II
∑=
−==σ⇒=−
σσ=
σ∂∂ T
i
MLEiiMLE
MLEMLE Ty
TTL
1
22
222)ˆ(ˆ0]
ˆ)[
ˆ21(ln βXee'ee'
yX'XXββXyX'εX' 122 )'(ˆ0)ˆ(11ln −=⇒=−
σ=
σ−=
β∂∂
MLEMLEL
Definition: Score (or efficient score)
S(X; θ) is called the score of the sample. It is the vector of partial derivatives (the gradient), with respect to the parameter θ. If we have k parameters, the score will have a kx1 dimension.
Definition: Fisher information for a single sample:
I(θ) is sometimes just called information. It measures the shape of the log f(X|θ).
ML: Score and Information Matrix
∑ ===
n
ii ))(f(x))(L(X)S(X
1
|log|log;δθ
θδδθ
θδθ
)(|log 2
θ=
θ∂θ∂ I))(f(XE
• The concept of information can be generalized for the k-parameter case. In this case:
This is kxk matrix.
If L is twice differentiable with respect to θ, and under certain regularity conditions, then the information may also be written as9
I(θ) is called the information matrix (negative Hessian). It measures the shape of the likelihood function.
ML: Score and Information Matrix
)('|logloglog T
θIθθ
θθθ
=
∂∂
δ=
∂∂
∂∂ ))(L(X-ELLE
2
)(loglog T
θ=
∂∂
∂∂ I
θθLLE
• Properties of S(X; θ):
(1) E[S(X; θ)]=0.
ML: Score and Information Matrix
0)];([0);();(log
0);();();(
1
0);(1);(
=⇒=∂
∂
=∂
∂
=∂
∂⇒=
∫
∫∫∫
θθθ
θ
θθ
θθ
θθθ
xSEdxxfxf
dxxfxfxf
dxxfdxxf
∑ ===
n
ii ))(f(x))(L(X)S(X
1
|log|log;δθ
θδδθ
θδθ
(2) Var[S(X; θ)]= n I(θ)
ML: Score and Information Matrix
)(]);(log[)];([
)('
);(log);(log
0);('
);(log);();(log
0);('
);(log);();();(
1);(log
0);('
);(log);();(log
:more once integral above theatedifferenti sLet'
0);();(log
22
22
2
2
θθ
θθ
θθθ
θθ
θ
θθθ
θθθ
θ
θθθ
θθθ
θθθ
θ
θθθ
θθ
θθ
θ
θθ
θ
InxfVarnXSVar
IxfExfE
dxxfxfdxxfxf
dxxfxfdxxfxfxf
xf
dxxfxfdxxfxf
dxxfxf
=∂
∂=
=
∂∂∂
−=
∂∂
=∂∂
∂+
∂∂
=∂∂
∂+
∂
∂∂
∂
=∂∂
∂+
∂∂
∂∂
=∂
∂
∫∫
∫∫
∫∫
∫
(3) If S(xi; θ) are i.i.d. (with finite first and second moments), then we can apply the CLT to get:
Sn(X; θ) = Σi S(xi; θ) N(0, n I(θ)).
Note: This an important result. It will drive the distribution of MLE estimators.
ML: Score and Information Matrix
→a
• Again, we assume: • Taking logs, we have the log likelihood function: • The score function is –first derivatives of log L wrt θ=(β,σ2):
ML: Score and Information Matrix – Example
),0(~
),0(~2
2
T
iiii
Nor
NXy
IεεXβy
β
σ+=
σεε+=
])[2
1()2
1(2
ln22
1
2422 TTL T
ii −
σσ=ε
σ−−
σ−=
σ∂∂ ∑
=
εε'
εX'x 22'
1
1/221ln
σ−=σε−=
β∂∂ ∑
=i
T
ii
L
)(2
1ln2
2ln22
12ln2
ln 22
1
22
2 Xβ(y)Xβy −′−σ
−σ−π−=εσ
−πσ−= ∑=
TTTLT
ii
• Then, we take second derivatives to calculate I(θ): : • Then,
∑=
εσ
−=σ∂β∂
∂ T
iii xL
142 '1
'ln
XXL T
iii '1/'
'ln
22
1
2
σ=σ−=
β∂β∂∂ ∑
=
xx
]2[2
1))(2
1(][2
1'
ln24422422 TTL
−σσ
−=σ
−σ
+−σσ
−=σ∂σ∂
∂ εε'εε'εε'
σ
σ=θ∂θ∂
∂−=θ
4
2
20
0)'1(]
'ln[)( T
XXLEI
ML: Score and Information Matrix – Example
In deriving properties (1) and (2), we have made some implicit assumptions, which are called regularity conditions:
(i) θ lies in an open interval of the parameter space, Ω.
(ii) The 1st derivative and 2nd derivatives of f(X; θ) w.r.t. θ exist.
(iii) L(X; θ) can be differentiated w.r.t. θ under the integral sign.
(iv) E[S(X; θ) 2]>0, for all θ in Ω.
(v) T(X) L(X; θ) can be differentiated w.r.t. θ under the integral sign.
Recall: If S(X; θ) are i.i.d. and regularity conditions apply, then we can apply the CLT to get:
S(X; θ) N(0, n I(θ))
ML: Score and Information Matrix
→a
Theorem: Cramer-Rao inequality Let the random sample (X1, ... , Xn) be drawn from a pdf f(X|θ) and let T=T(X1, ... , Xn) be a statistic such that E[T]=u(θ), differentiable in θ. Let b(θ)= u(θ) - θ, the bias in T. Assume regularity conditions. Then,
Regularity conditions:
(1) θ lies in an open interval Ω of the real line.
(2) For all θ in Ω, δf(X|θ)/δθ is well defined.
(3) ∫L(X|θ)dx can be differentiated wrt. θ under the integral sign
(4) E[S(X;θ)2]>0, for all θ in Ω
(5) ∫T(X) L(X|θ)dx can be differentiated wrt. θ under the integral sign
)()]('1[
)()]('[ 22
θθ
θθ
nIb
nIuVar(T) +
=≥
ML: Cramer-Rao inequality
The lower bound for Var(T) is called the Cramer-Rao (CR) lower bound.
Corollary: If T(X) is an unbiased estimator of θ, then
Note: This theorem establishes the superiority of the ML estimate over all others. The CR lower bound is the smallest theoretical variance. It can be shown that ML estimates achieve this bound, therefore, any other estimation technique can at best only equal it.
)()]('1[
)()]('[ 22
θθ
θθ
nIb
nIuVar(T) +
=≥
ML: Cramer-Rao inequality
1))(( −≥ θnIVar(T)
Proof: For any T(X) and S(X;θ) we have
[Cov(T,S)]2 ≤ Var(T) Var(S) (Cauchy-Schwarz inequality)
Since E[S]=0, Cov(T,S)=E[TS].
Also, u(θ) = E[T] = ∫ T L(X;θ) dx. Differentiating both sides:
u’(θ) = ∫ T δL(X;θ)/δθ dx = ∫ T [1/L δL(X;θ)/δθ] L dx
= ∫ T S L dx = E[TS] = Cov(TS)
Substituting in the Cauchy-Schwarz inequality:
[u’(θ)]2 ≤ Var(T) n I(θ) => Var(T) ≥[u’(θ)]2/[n I(θ)] ■
ML: Cramer-Rao inequality
Note: For an estimator to achieve the CR lower bound, we need
[Cov(T,S)]2 = Var(T) Var(S).
This is possible if T is a linear function of S. That is,
T(X) = α(θ) S(X;θ) + β(θ)
Since E[T] = α(θ) E[S(X;θ)] + β(θ) = β(θ) . Then,
S(X;θ) = δ log L(X;θ)/δθ =[T(X) - β(θ)]/ α(θ).
Integrating both sides wrt to θ:
log L(X;θ) = U(X) – T(X) A(θ)+ B(θ)
That is, L(X;θ) = exp{ΣiU(Xi) – A(θ) ΣiT(Xi) + n B(θ)}
Or, f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}
ML: Cramer-Rao inequality
f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}
That is, the exponential (Pitman-Koopman-Darmois) family of distributions attain the CR lower bound.
• Most of the distributions we have seen belong to this family: normal, exponential, gamma, chi-square, beta, Weibull (if the shape parameter is known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known parameter r), and geometric.
• Note: The Chapman–Robbins bound is a lower bound on the variance of estimators of θ. It generalizes the Cramér–Rao bound. It is tighter and can be applied to more situations –for example, when I(θ) does not exist. However, it is usually more difficult to compute.
ML: Cramer-Rao inequality
• When we have k parameters, then covariance matrix of the estimator T(X) has a CR lower bound given by: Note: In matrix notation, the inequality A ≥ B means the matrix A-B is positive semidefinite. If T(X) is unbiased, then
T1 )()()())(
θθθI
θθXT
∂∂
∂∂
≥ − uuCovar(
Cramer-Rao inequality: Multivariate Case
1)())( −≥ θIXTCovar(
C. R. Rao (1920, India) & Harald Cramer (1893-1985, Sweden)
We want to check if the sample mean and s2 for an i.i.d. sample X={X1, X2,..., XT}drawn from N(μ,σ2) achieve the CR lower bound. Recall:
Since the sample mean and s2 are unbiased, the CR lower bound is given by:
Then,
We have already derived that Var( ) = σ2/n and Var(s2) = 2 σ4/(n-1). Then, the sample mean achieves its CR bound, but s2 does not..
1)( −θ≥ IT )Covar(
Cramer-Rao inequality: Example
σ
σ=∂∂
∂−=θ
4
2
20
0]
'ln[)( n
nLEθθ
I
nVar(s
n)XVar(
42
2 2)& σ≥
σ≥
X
X
• We split the parameter vector θ into two vectors:
),L(=)L( 21 θθθ
Sometimes, we can derive a formulae for the ML estimate of θ2, say:
)g(= 12 θθ
If this is possible, we can write the Likelihood function as
)(L=))g(,L(=),L( 1*
1121 θθθθθ
This is the concentrated likelihood function. • This process is often useful as it reduces the number of parameters needed to be estimated.
Concentrated ML
• The normal log likelihood function can be written as:
( ) ( )∑ =−−−=
n
i iXnL1
22
22
21ln
2),(ln µ
σσσµ
Concentrated ML: Example
• This expression can be solved for the optimal choice of σ2 by differentiating with respect to σ2:
( )( )
( )
( )∑∑
∑
=
=
=
−=⇒
=−+−⇒
=−+−=∂
∂
n
i iMLE
n
i i
n
i i
Xn
Xn
XnL
122
122
12
2222
2
1ˆ
0
02
12
),(ln
µσ
µσ
µσσσ
σµ
• Substituting this result into the original log likelihood produces:
( )
( )( )
( )2
1ln2
12
1
1ln2
)(ln
12
12
12
12
nXn
n
XX
n
Xn
nL
n
i i
n
i in
j j
n
i i
−
−−=
−−
−
−−=
∑
∑∑
∑
=
=
=
=
µ
µµ
µµ
Concentrated ML: Example
• Intuitively, the ML estimator of µ is the value that minimizes the MSE of the estimator. Thus, the least squares estimate of the mean of a normal distribution is the same as the ML estimator under the assumption that the sample is i.i.d.
Properties of ML Estimators (1) Efficiency. Under general conditions, we have that
The right-hand side is the Cramer-Rao lower bound (CR-LB). If an
estimator can achieve this bound, ML will produce it. (2) Consistency. We know that E[S(Xi; θ)]=0 and Var[S(Xi; θ)]= I(θ). The consistency of ML can be shown by applying Khinchine’s LLN
to S(Xi,; θ) and then to Sn(X; θ)=Σi S(Xi,; θ). Then, do a 1st-order Taylor expansion of Sn(X; θ) around Sn(X; θ) and ( - θ) converge together to zero (i.e., expectation).
MLE^θ
1^
))(( −≥ θθ nI)Var( MLE
MLEθ̂
)ˆ )( (X;'S) (X;S
|ˆ || |)ˆ )( (X;'S)ˆ (X;S) (X;S*
nn
**nnn
MLEn
MLEnMLEnMLE
θθθθ
εθθθθθθθθθ
−=
<−≤−−+=
MLEθ̂
(3) Theorem: Asymptotic Normality Let the likelihood function be L(X1,X2,…Xn| θ). Under general
conditions, the MLE of θ is asymptotically distributed as Sketch of a proof. Using the CLT, we’ve already established Sn(X; θ) N(0, nI(θ)). Then, using a first order Taylor expansion as before, we get Notice that E[Sn′(xi ; θ)]= -I(θ). Then, apply the LLN to get Sn′ (X; θn*)/n -I(θ). (using θn* θ.) Now, algebra and Slutzky’s theorem for RV get the final result.
( )1)(,ˆ −→ θθθ nINaMLE
Properties of ML Estimators
)ˆ (n
1) (X;'Sn
1) (X;S 1/2*
n1/2n MLEn θθθθ −=
→p
→p
→p
(4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ must be a function of it. That is, depends on the sample observations only through the value of a sufficient statistic.
(5) Invariance. The ML estimate is invariant under functional
transformations. That is, if is the MLE of θ and if g(θ) is a function of θ , then g( ) is the MLE of g(θ) .
Properties of ML Estimators
MLEθ̂
MLEθ̂
MLEθ̂
• ML rests on the assumption that the errors follow a particular distribution (OLS is only ML if the errors are normal).
•Q: What happens if we make the wrong assumption? White (Econometrica, 1982) shows that, under broad assumptions about the misspecification of the error process, is still a consistent estimator. The estimation is called Quasi ML.
• But the covariance matrix is no longer I(θ)-1, instead it is given by
Quasi Maximum Likelihood
MLEθ̂
)I()S()S()I(= -1-1 θθ′θθθ ˆ]ˆˆ[ˆ]ˆ[Var
• In general, Wald and LM tests are valid, by using this corrected covariance matrix. But, LR tests are invalid, since they works directly from the value of the likelihood function.
• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.
• We resort to numerical optimisation of the likelihood function.
• Think of hill climbing in parameter space. There are many algorithms to do this.
• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (3) Determine a step length to move (4) Check convergence criteria and either stop or go back to (2).
ML Estimation: Numerical Optimization
• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.
• We resort to numerical optimisation of the likelihood function.
• Think of hill climbing in parameter space. There are many algorithms to do this.
• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (for example, by dL/dθ). (3) Determine a step length to move (for example, by d2L/dθ2). (4) Check convergence criteria and either stop or go back to (2).
ML Estimation: Numerical Optimization
L
Lu
*β1β 2β
ML Estimation: Numerical Optimization
• Simple idea: Suppose the first moment (the mean) is generated by the distribution, f(X,θ). The observed moment from a sample of n observations is
Hence, we can retrieve the parameter θ by inverting the distribution function f(X,θ):
111
1 )()|( mmfxfm ===>= −θθ
∑=
=n
iixnm
11 )/1(
Method of Moments (MM) Estimation
• Example: Mean of a Poisson pdf: f(x) = exp(-λ) λx/x! E[X] = λ => plim (1/N)Σi xi = λ. Then, the MM estimator of λ is the sample mean of X => λMM = x
Method of Moments (MM) Estimation
• Let’s complicate the MM idea: Now, suppose we have a model. This model implies certain knowledge about the moments of the distribution. Then, we invert the model to give us estimates of the unknown parameters of the model, which match the theoretical moments for a given sample.
• Example: Mean of Exponential pdf: f(x,λ) = λ e-λy
E[X] = 1/λ => plim (1/N)Σixi = 1/λ Then, the λMM = 1/ . x
• We have a model Y = h (X,θ), where θ are k parameters. Under this model, we know what some moments of the distribution should be. That is, the model provide us with k conditions (or moments), which should be met:
0))|,(( =θXYgE
• In this case, the (population) first moment of g (Y,X, θ) equals 0. Then, we approximate the k moments –i.e., E(g)- with a sample measure and invert g to get an estimate of θ: is the Method of Moment estimator of θ. Note: In this example we have as many moments (k) as unknown parameters (k). Thus, θ is uniquely and exactly determined.
)0,,(ˆ 1 XYgMM−=θ
MMθ̂
MM Estimation
We start with a model Y = X β + ε. In OLS estimation, we make the assumption that the X’s are orthogonal to the errors. Thus,
0)'( =eXE
The sample moment analogue for each xi is
.0')/1(or 0)/1(1
=−=∑ =eXnexn
n
t tit
And, thus,
MMMM XXYXXYXneXn ββ '')(')/1(0')/1( ==>−==
Therefore, the method of moments estimator, βMM, solves the normal equations. That is, βMM will be identical to the OLS estimator, b.
MM Estimation: Example
• So far, we have assumed that there are as many moments (l ) as unknown parameters (k). The parameters are uniquely and exactly determined.
• If l < k –i.e., less moment conditions than parameters-, we would not be able to solve them for a unique set of parameters (the model would be under identified).
• If l > k –i.e., more moment conditions than parameters-, then all the conditions can not be met at the same time, the model is over identified and we have GMM estimation.
If we can not satisfy all the conditions at the same time, we want to make them all as close to zero as possible at the same time. We have to figure out a way to weight them.
Generalized Method of Moments (GMM)
• Now, we have k parameters but l moment conditions l>k. Thus,
∑ ====
==n
t j
j
ljmnm
ljmE
1,...10)()/1()(
,...10))((
θθ
θ
• Then, we need to make all l moments as small as possible, simultaneously. Let’s use a weighted least squares criterion:
)()'()( θθθ
mWmqMin =
That is, the weighted squared sum of the moments. The weighting matrix is the lxl matrix W. (Note that we have a quadratic form.) • First order condition:
Generalized Method of Moments (GMM)
(l population moments)
(l sample moments)
0)(')'(2 =
∂∂
=GMMmWm
GMM
θθθ
θθ
• The GMM estimator, θGMM, solves the kx1 system of equations. There is typically no closed form solution for θGMM. It must be obtained through numerical optimization methods.
• If plim =0, and W (not a function of θ) is a positive definite matrix, then θGMM is a consistent estimator of θ.
• The optimal W Any weighting matrix produces a consistent estimator of θ. We can select the most efficient one –i.e., the optimal W. The optimal W is simply the covariance matrix of the moment conditions. Thus,
Generalized Method of Moments (GMM)
)(θm
)(* mVarAsyWWOptimal ==
• Properties of the GMM estimator.
(1) Consistency.
If plim =0, and W (not a function of θ) is a pd matrix, then under some conditions, θGMM θ.
(2) Asymptotic Normality
Under some general condition θGMM N(θ, VGMM), and VGMM=(1/n)[G′V-1G]-1,
where G is the matrix of derivatives of the moments
with respect to the parameters and ))(( 2/1 θmnVarV =
Properties of the GMM estimator
)(θm
Lars Peter Hansen (1952)
→p
→a
• Recall Bayes’ Theorem:
Bayesian Estimation: Bayes’ Theorem
( ) ( ) ( )( )X
XXProb
Prob|ProbProb θθθ =
- P(θ): Prior probability about parameter θ. - P(X|θ): Probability of observing the data, X, conditioning on θ. This conditional probability is called the likelihood –i.e., probability of event X will be the outcome of the experiment depends on θ. - P(θ |X): Posterior probability -i.e., probability assigned to θ, after X is observed. - P(X): Marginal probability of X. This the prior probability of witnessing the data X under all possible scenarios for θ, and it depends on the prior probabilities given to each θ.
• Example: Courtroom – Guilty vs. Non-guilty G: Event that the defendant is guilty. E: Event that the defendant's DNA matches DNA found at the
crime scene. The jurors, after initial questions, form a personal belief about the
defendant’s guilt. This initial belief is the prior. The jurors, after seeing the DNA evidence (event E), will update their
prior beliefs. This update is the posterior.
Bayesian Estimation: Bayes’ Theorem
• Example: Courtroom – Guilty vs. Non-guilty - P(G): Juror’s personal estimate of the probability that the defendant is guilty, based on evidence other than the DNA match. (Say, .30). - P(E|G): Probability of seeing event E if the defendant is actually guilty. (In our case, it should be near 1.) - P(E): E can happen in two ways: defendant is guilty and thus DNA match is correct or defendant is non-guilty with incorrect DNA match (one in a million chance). - P(G|E): Probability that defendant is guilty given a DNA match.
Bayesian Estimation: Bayes’ Theorem
( ) ( ) ( )( ) 999998.
.7x10.3x1x(.3)1
ProbProb|ProbProb 6- =
+==
EGGEEG
• Implicitly, in our previous discussions about estimation (MLE), we adopted a classical viewpoint.
– We had some process generating random observations. – This random process was a function of fixed, but unknown
parameters. – Then, we designed procedures to estimate these unknown
parameters based on observed data.
• For example, we assume a random process such as CEO compensation. This CEO compensation process can be characterized by a normal distribution.
– We can estimate the parameters of this distribution using maximum likelihood.
Bayesian Estimation: Viewpoints
– The likelihood of a particular sample can be expressed as
– Our estimates of µ and σ2 are then based on the value of each parameter that maximizes the likelihood of drawing that sample
( )( )
( )
−−= ∑ =
2
12
22
221 2
1exp2
1,,,i innn XXXXL µ
σσπσµ
Bayesian Estimation: Viewpoints
• Turning the classical process around slightly, a Bayesian viewpoint starts with some kind of probability statement about the parameters (a prior). Then, the data, X, are used to update our prior beliefs (a
posterior). – First, assume that our prior beliefs about the distribution
function can be expressed as a probability density function π(θ), where θ is the parameter we are interested in estimating.
– Based on a sample -the likelihood function, L(X,θ)- we can update our knowledge of the distribution using Bayes’ theorem:
Bayesian Estimation: Viewpoints
( ) ( ) ( )( )
( ) ( )( ) ( )∫
∞
∞−
==θθπθ
θπθθπθθπdXL
XLX
XXProb
|Prob
Thomas Bayes (1702–April 17, 1761)
• Assume that we have a prior of a Bernoulli distribution. Our prior is that P in the Bernoulli distribution is distributed Β(α,β):
( ) ( ) ( ) 11 1,
1,;)( −− −== βα
βαβαπ PP
BPfP
Bayesian Estimation: Example
( ) ( ) ( ) ( )( )βα
βαβα βα
+ΓΓΓ
=−= ∫ −−1
0
11 1, dxxxB
( )( ) ( ) ( ) 11 1)( −− −
ΓΓ+Γ
= βα
βαβαπ PPP
• Assume that we are interested in forming the posterior distribution after a single draw, X:
( )( ) ( )
( ) ( ) ( )
( ) ( )( ) ( ) ( )
( )( )∫
∫
−−+
−−+
−−−
−−−
−
−=
−ΓΓ+Γ
−
−ΓΓ+Γ
−=
1
0
1
1
1
0
111
111
1
1
11
11
dPPP
PP
dPPPPP
PPPPXP
XX
XX
XX
XX
βα
βα
βα
βα
βαβαβα
βα
π
Bayesian Estimation: Example
• Following the original specification of the beta function
( ) ( )
( ) ( )( )1
1X1 and where
11
**
1
0
111
0
1 **
++Γ+−Γ+Γ
=
+−=+=
−=− ∫∫ −−−−+
βαβα
ββαα
βαβα
XXX
dPPPdPPP XX
Bayesian Estimation: Example
• The posterior distribution, the distribution of P after the observation is then
( ) ( )( ) ( ) ( ) XX PP
XXXP −−+ −
+−Γ+Γ++Γ
= βα
βαβαπ 1
11 1
• The Bayesian estimate of P is then the value that minimizes a loss function. Several loss functions can be used, but we will focus on the quadratic loss function consistent with mean square errors
( )( )
[ ]][ˆ
0ˆ2ˆ
ˆˆmin
2
2
ˆ
PEP
PPEP
PPEPPE
P
=⇒
=−=∂
−∂
⇒
−
Bayesian Estimation: Example
• Taking the expectation of the posterior distribution yields
[ ] ( )( ) ( ) ( )
( )( ) ( ) ( )∫
∫−+
−+
−+−Γ+Γ
++Γ=
−+−Γ+Γ
++Γ=
1
0
1
0
11
1
11
1
dPPPXX
dPPPXX
PE
XX
XX
βα
βα
βαβα
βαβα
• As before, we solve the integral by creating α*=α+X+1 and β*=β-X+1. The integral then becomes
( ) ( ) ( )( )
( ) ( )( )2
111 **
**1
0
11 **
++Γ+−Γ++Γ
=+ΓΓΓ
=−∫ −−
βαβα
βαβαβα XXdPPP
[ ] ( )( )
( )( )
( )( )1
1121
+−Γ+−Γ
+Γ++Γ
++Γ++Γ
=XX
XXPE
ββ
αα
βαβα
Bayesian Estimation: Example
• Which can be simplified using the fact Γ(α+1)= α Γ(α):
[ ] ( )( )
( )( )
( )( ) ( )
( ) ( )( )
( )( )1
1111
21
+++
=
+Γ+Γ+
++Γ++++Γ
=+Γ
++Γ++Γ++Γ
=
βαα
ααα
βαβαβα
αα
βαβα
XX
XXX
XPE
• To make this estimation process operational, assume that we have a prior distribution with parameters α=β=1.4968 that yields a beta distribution with a mean P of 0.5 and a variance of the estimate of 0.0625.
• Extending the results to n Bernoulli trials yields
where Y is the sum of the individual Xs or the number of heads in the sample. The estimated value of P then becomes:
Bayesian Estimation: Example
( ) ( )( ) ( ) ( ) 11 1 −+−−+ −
+−Γ+Γ++Γ
= nYY PPnYY
nXP βα
βαβαπ
nYP
+++
=βα
αˆ
• Suppose in the first draw Y=15 and n=50. This yields an estimated value of P of 0.31129. This value compares with the maximum likelihood estimate of 0.3000. Since the maximum likelihood estimator in this case is unbiased, the results imply that the Bayesian estimator is biased.
Bayesian Estimation: Example
top related