mathematical statistics - nanyang technological university
TRANSCRIPT
Mathematical Statistics
MAS 713
Chapter 6
Previous lecture
Point estimatorsI Estimation and sampling distributionI Point estimationI Properties of estimators
Sufficient StatisticsI Factorization Theorem
Any questions?
Mathematical Statistics (MAS713) Ariel Neufeld 2 / 53
This lecture
1 6.1 Maximum Likelihood Estimation6.1.1 Introduction6.1.3 Maximum Likelihood Principle
2 6.2 Cramér-Rao Lower Bound6.2.1 Introduction6.2.2 Examples
3 6.3 Method of Moments
4 6.4 Examples: MLE and Methods of Moments
Additional reading : Chapter 7
Mathematical Statistics (MAS713) Ariel Neufeld 3 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
Intuition of MLE
A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.
Mathematical Statistics (MAS713) Ariel Neufeld 4 / 53
6.1 Maximum Likelihood Estimation 6.1.1 Introduction
We have seen that there are plenty of choices for an estimator θ ofan unknown parameter θ
=⇒ How to choose θ?
One possible approach:
Given observations x1, x2, . . . , xn, choose unknown parameterθ = θ(x1, . . . , xn) in such a way that it maximizes the probability of theoccurrence of our observed values x1, x2, . . . , xn.
=⇒ choose θ such that
P(X1 = x1, . . . ,Xn = xn | θ) = maxθ
P(X1 = x1, . . . ,Xn = xn | θ)
It is the intuition behind the Maximum Likelihood estimator (MLE).
Mathematical Statistics (MAS713) Ariel Neufeld 5 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
The Maximum Likelihood PrincipleThe main ingredients:
1 X : a random variable.2 θ: parameter to estimate (restricted to a parameter space Sθ).3 p (X ; θ) (or p (X |θ)): a statistical model (pmf or pdf)4 X1, . . . ,Xn: a random sample from X .
We want to construct good estimators for θ
Notation: Given observation x1, . . . , xn, we write
p(x|θ) =
{joint probability mass function if X is discretejoint probability density function if X is continuous
Mathematical Statistics (MAS713) Ariel Neufeld 6 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
The Maximum Likelihood PrincipleDefinitionLet X = (X1, . . . ,Xn) have joint pdf/pmf p (x; θ) where θ ∈ Sθ. Thelikelihood function (or simply likelihood) is defined by
Sθ 3 θ 7→ L (θ) := L (θ; x) = p (x; θ)
Note: x is fixed and θ varies in Sθ.
The likelihood is a function of θ.The likelihood is not a pdf/pmf (as function of θ, for fixed x).If the data is i.i.d then
L (θ; x) =n∏
i=1
p (xi ; θ)
Mathematical Statistics (MAS713) Ariel Neufeld 7 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
The Maximum Likelihood PrincipleChoose θ = θ (x) which maximizes the likelihood function, i.e.
L(θ; x
)= max
θ∈Sθ
L(θ; x
)
by definition of the arg max, this means
θ (x) ∈ arg maxθ∈Sθ
L(θ; x
)
Definition of Maximum Likelihood Estimator (MLE)Let X = (X1, . . . ,Xn) be a random sample. If
θ (X) ∈ arg maxθ∈Sθ
L(θ; X
)
Then we call θ (X) a Maximum Likelihood Estimator (MLE) for θ.Note: MLE may not be unique or may not exist.
Remark: arg maxθ f (θ) is the set of points, θ, for which f (θ) attains thefunction’s largest value.
Mathematical Statistics (MAS713) Ariel Neufeld 8 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Intuition of MLEThe data:x =“I have a headache, I’m feeling weak and have no appetite."The (discrete) parameter space θ:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The likelihood under each parameter:
P (“headache, weakness,no appetite"|θ = brain tumor) = 0.2P (“headache, weakness,no appetite"|θ = broken foot) = 0.05P (“headache, weakness,no appetite"|θ = cold) = 0.4
.The ML estimateThe likelihood of having a cold is the highest.θ = cold
Mathematical Statistics (MAS713) Ariel Neufeld 9 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Intuition of MLEThe data:x =“I have a headache, I’m feeling weak and have no appetite."The (discrete) parameter space θ:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The likelihood under each parameter:
P (“headache, weakness,no appetite"|θ = brain tumor) = 0.2P (“headache, weakness,no appetite"|θ = broken foot) = 0.05P (“headache, weakness,no appetite"|θ = cold) = 0.4
.The ML estimateThe likelihood of having a cold is the highest.θ = cold
Mathematical Statistics (MAS713) Ariel Neufeld 9 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Intuition of MLEThe data:x =“I have a headache, I’m feeling weak and have no appetite."The (discrete) parameter space θ:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The likelihood under each parameter:
P (“headache, weakness,no appetite"|θ = brain tumor) = 0.2P (“headache, weakness,no appetite"|θ = broken foot) = 0.05P (“headache, weakness,no appetite"|θ = cold) = 0.4
.The ML estimateThe likelihood of having a cold is the highest.θ = cold
Mathematical Statistics (MAS713) Ariel Neufeld 9 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Intuition of MLEThe data:x =“I have a headache, I’m feeling weak and have no appetite."The (discrete) parameter space θ:
1 You have a brain tumor.2 You broke your foot.3 You have a cold.
The likelihood under each parameter:
P (“headache, weakness,no appetite"|θ = brain tumor) = 0.2P (“headache, weakness,no appetite"|θ = broken foot) = 0.05P (“headache, weakness,no appetite"|θ = cold) = 0.4
.The ML estimateThe likelihood of having a cold is the highest.θ = cold
Mathematical Statistics (MAS713) Ariel Neufeld 9 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
The Maximum Likelihood PrincipleWe may apply any monotone increasing function, and still achievemaximization. Very often it is more convenient to consider thelogarithm of the likelihood function (log-likelihood function)
log L (θ,x) = log p (x|θ)
Since the logarithm is a monotonic function, the maximization of thelikelihood and log-likelihood functions is equivalent, that is, θmaximizes the likelihood function if and only if it also maximizes thelog-likelihood function.
arg maxθ∈Sθ
L(θ; X
)= arg max
θ∈Sθ
log L(θ; X
)
or in other words
θ ∈ arg maxθ∈Sθ
L(θ; X
)⇐⇒ θ ∈ arg max
θ∈Sθ
log L(θ; X
)
Mathematical Statistics (MAS713) Ariel Neufeld 10 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Maximum Likelihood Estimation
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
log p(x; θ)
p(x; θ)
θθ
Mathematical Statistics (MAS713) Ariel Neufeld 11 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
ExampleSuppose that X is a discrete random variable with the followingprobability mass function:
X 0 1 2 3p(X) 2θ/3 θ/3 2 (1− θ) /3 (1− θ) /3
where 0 < θ < 1 is a parameter. The following 10 independentobservations were taken from such a distribution:x = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1).Find a point estimate of θ using the MLE.
Mathematical Statistics (MAS713) Ariel Neufeld 12 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Solution:
The likelihood function given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is given by
L(θ; x) =n∏
i=1
p(xi |θ)
= p(X = 3|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1|θ)p(X = 3|θ)
× p(X = 2|θ)p(X = 1|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1|θ)
=
(2θ3
)2(θ3
)3(2(1− θ)
3
)3(1− θ3
)2
.
=⇒ θ ∈ arg maxθ∈(0,1)
(2θ3
)2(θ3
)3(2(1− θ)
3
)3(1− θ3
)2
Clearly, the likelihood function is not easy to maximize.Let’s look at the log-likelihood
Mathematical Statistics (MAS713) Ariel Neufeld 13 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
The log-likelihood function given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is
log L(θ; x) = logn∏
i=1
p(xi |θ)
= 2(
log23
+ log θ
)+ 3
(log
13
+ log θ
)+ 3
(log
23
+ log(1− θ)
)
+ 2(
log(13− log(1− θ)
)
= Constant + 5 log θ + 5 log(1− θ)
Setting the derivative to 0 and solving
d log L(θ)
dθ= 5
(1θ− 1
1− θ
)= 0
θ = θ(x) = 0.5
Mathematical Statistics (MAS713) Ariel Neufeld 14 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Example: Estimating mean and variance in a normal populationGiven a random sample X = (X1, . . . ,Xn) of size n where
Xii.i.d∼ N (µ, σ)
Derive the Maximum Likelihood estimator for the mean and variance ofa Normal random variable
Solution:
θ =(µ, σ2), Sθ = R× (0,∞)
We need to find (µ, σ2
)∈ arg max
(µ,σ2)
p(
x|µ, σ2)
Notation: We write φ(x |µ, σ) for the pdf of a N(µ, σ)-distributedrandom variable, i.e.
φ(x |µ, σ) := 1√2πσ2 e−
(x−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 15 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Example: Estimating mean and variance in a normal populationGiven a random sample X = (X1, . . . ,Xn) of size n where
Xii.i.d∼ N (µ, σ)
Derive the Maximum Likelihood estimator for the mean and variance ofa Normal random variable
Solution:
θ =(µ, σ2), Sθ = R× (0,∞)
We need to find (µ, σ2
)∈ arg max
(µ,σ2)
p(
x|µ, σ2)
Notation: We write φ(x |µ, σ) for the pdf of a N(µ, σ)-distributedrandom variable, i.e.
φ(x |µ, σ) := 1√2πσ2 e−
(x−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 15 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
θ :=(µ, σ2
)∈ argmax
µ,σ2p(
x|µ, σ2)
i.i.d= argmax
µ,σ2
n∏i=1
p(
xi |µ, σ2)
= argmaxµ,σ2
n∏i=1
φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log φ (xi |µ, σ)
= argmaxµ,σ2
n∑i=1
log
(1√
2πσ2exp
(− 1
2σ2 (xi − µ)2))
= argmaxµ,σ2
−n2
(log(2π) + log(σ2)
)−
n∑i=1
(xi−µ)2
2σ2
Mathematical Statistics (MAS713) Ariel Neufeld 16 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
To find the maximizer, we calculate
∂
∂µ
(− n 1
2
(log(2π) + log(σ2)
)−
n∑
i=1
(xi−µ)2
2σ2
)=
n∑
i=1
xi−µσ2 .
Similarly, setting v := σ2 and taking the derivatives yields
∂
∂σ2
(−n2
(log(2π) + log(σ2)
)−
n∑
i=1
(xi−µ)2
2σ2
)
=∂
∂v
(−n2
(log(2π) + log(v)
)−
n∑
i=1
(xi−µ)2
2v
)
= −n2
1v + 1
2v2
n∑
i=1
(xi − µ)2
= −n2
1σ2 + 1
2σ4
n∑
i=1
(xi − µ)2
Mathematical Statistics (MAS713) Ariel Neufeld 17 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Setting both derivatives equal to 0 implies
n∑
i=1
xi−µσ2 = 0 =⇒ µ = 1
n
n∑
i=1
xi = xn
−n2
1v + 1
2v2
n∑i=1
(xi − µ)2 = 0 =⇒ v = σ2 = 1n
n∑i=1
(xi−µ)2 = 1n
n∑i=1
(xi−xn)2
Therefore, we obtained the estimators
µ = 1n
n∑
i=1
Xi = Xn
σ2 = 1n
n∑i=1
(Xi − Xn)2
Note: Don’t forget, estimators are random variables!
Mathematical Statistics (MAS713) Ariel Neufeld 18 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Note:
E[µ] = E[
1n
n∑i=1
Xi
]= µ =⇒ µ unbiased
But, one can show that
E[σ2] = E[
1n
n∑i=1
(Xi − Xn)2]
= n−1n σ2 =⇒ σ2 biased
Observe:In this setting S2 := 1
n−1
n∑i=1
(Xi − Xn)2 is unbiased estimator for σ2.
Mathematical Statistics (MAS713) Ariel Neufeld 19 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Some issues to consider:
1 How do we guarantee that MLE exists?2 How do we guarantee that MLE is unique?3 How do we guarantee that calculation of MLE is tractable?4 Is the Likelihood function convex (related to uniqueness)?5 Boundary conditions?6 Numerical sensitivity: in many cases the likelihood function is flat...
These are not statistical questions, but mathematical ones, namelyfunctional analysis, convex analysis....
Mathematical Statistics (MAS713) Ariel Neufeld 20 / 53
6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle
Cramér-Rao Bound (CRLB)
Mathematical Statistics (MAS713) Ariel Neufeld 21 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Lower Bound (CRLB)
The Cramér-Rao Lower Bound (CRLB) sets a lower bound on thevariance of any unbiased estimator. This can be extremely useful inseveral ways:
1 If we find an estimator that achieves the CRLB, then we know thatwe have found an Minimum Variance Unbiased estimator (MVUE)!
2 The CRLB can provide a benchmark against which we cancompare the performance of any unbiased estimator (We knowwe’re doing very well if our estimator is "close" to the CRLB)
3 The CRLB enables us to rule-out impossible estimators. That is,we know that it is physically impossible to find an unbiasedestimator that beats the CRLB. This is useful in feasibility studies.
4 The theory behind the CRLB can tell us if an estimator existswhich achieves the bound.
Mathematical Statistics (MAS713) Ariel Neufeld 22 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Lower Bound (CRLB)
The Cramér-Rao Lower Bound (CRLB) sets a lower bound on thevariance of any unbiased estimator. This can be extremely useful inseveral ways:
1 If we find an estimator that achieves the CRLB, then we know thatwe have found an Minimum Variance Unbiased estimator (MVUE)!
2 The CRLB can provide a benchmark against which we cancompare the performance of any unbiased estimator (We knowwe’re doing very well if our estimator is "close" to the CRLB)
3 The CRLB enables us to rule-out impossible estimators. That is,we know that it is physically impossible to find an unbiasedestimator that beats the CRLB. This is useful in feasibility studies.
4 The theory behind the CRLB can tell us if an estimator existswhich achieves the bound.
Mathematical Statistics (MAS713) Ariel Neufeld 22 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Lower Bound (CRLB)
The Cramér-Rao Lower Bound (CRLB) sets a lower bound on thevariance of any unbiased estimator. This can be extremely useful inseveral ways:
1 If we find an estimator that achieves the CRLB, then we know thatwe have found an Minimum Variance Unbiased estimator (MVUE)!
2 The CRLB can provide a benchmark against which we cancompare the performance of any unbiased estimator (We knowwe’re doing very well if our estimator is "close" to the CRLB)
3 The CRLB enables us to rule-out impossible estimators. That is,we know that it is physically impossible to find an unbiasedestimator that beats the CRLB. This is useful in feasibility studies.
4 The theory behind the CRLB can tell us if an estimator existswhich achieves the bound.
Mathematical Statistics (MAS713) Ariel Neufeld 22 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Lower Bound (CRLB)
The Cramér-Rao Lower Bound (CRLB) sets a lower bound on thevariance of any unbiased estimator. This can be extremely useful inseveral ways:
1 If we find an estimator that achieves the CRLB, then we know thatwe have found an Minimum Variance Unbiased estimator (MVUE)!
2 The CRLB can provide a benchmark against which we cancompare the performance of any unbiased estimator (We knowwe’re doing very well if our estimator is "close" to the CRLB)
3 The CRLB enables us to rule-out impossible estimators. That is,we know that it is physically impossible to find an unbiasedestimator that beats the CRLB. This is useful in feasibility studies.
4 The theory behind the CRLB can tell us if an estimator existswhich achieves the bound.
Mathematical Statistics (MAS713) Ariel Neufeld 22 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Lower Bound (CRLB)
Theorem: Cramér-Rao Lower Bound
If θ is any unbiased estimator of θ based on the random sample X,then the variance of the error in the estimator is bounded by theinverse of the Fisher Information I:
E[∥∥∥θ − θ
∥∥∥2]
= Var(θ) ≥ I−1,
where I is given by:
I = −E
[d2 log p (X|θ)
dθ2
].
Mathematical Statistics (MAS713) Ariel Neufeld 23 / 53
6.2 Cramér-Rao Lower Bound 6.2.1 Introduction
Cramér-Rao Bound (CRLB)
Definition: Efficient Estimator
An unbiased estimator θ is called efficient if
Var(θ) = I−1
Efficient estimator is an unbiased estimator withminimal possible variance.
Theorem: Sufficient condition for efficiency
If θ is an unbiased estimator of θ and
∂ log p (Y|θ)
∂θ= c (θ)
(θ − θ
)
then θ is an efficient estimator.
Mathematical Statistics (MAS713) Ariel Neufeld 24 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
ExampleSuppose that X ∼ Bin(m, p), where m is known. The pmf is given by
p (x ; p) =
(mx
)px (1− p)m−x , x = 0,1, . . . ,m.
Find the CRLB.
Note: The range of X depends on m, but not on the unknownparameter p. Also, the sample size equals n = 1.
Mathematical Statistics (MAS713) Ariel Neufeld 25 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Solution:
The log-likelihood is given by
log p (x ; p) = log
(mx
)+ x log p + (m − x) log (1− p)
The first derivative is given by:
∂ log p (x ; p)
∂p=
xp− (m − x)
11− p
The second derivative is given by:
∂2 log p (x ; p)
∂p2 =−xp2 − (m − x)
1
(1− p)2
Mathematical Statistics (MAS713) Ariel Neufeld 26 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Therefore the Fisher Information I satisfies
I := −E
[−Xp2 − (m − X )
1
(1− p)2
]=
E [X ]
p2 + (m − E [X ])1
(1− p)2
=mpp2 + (m −mp)
1
(1− p)2
=m
p (1− p)
It follows that the CRLB is given by
Var(p)≥ I−1 =
p (1− p)
m
Mathematical Statistics (MAS713) Ariel Neufeld 27 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Cramér-Rao Bound (CRLB)
ExampleConsider n observations, such that
Yk = m + Wk , k = {1, . . . ,n}
where Wki.i.d∼ N
(0, σ2)
1 Find the MLE for m.2 Is m and efficient estimator?
Mathematical Statistics (MAS713) Ariel Neufeld 28 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Cramér-Rao Bound (CRLB)
ExampleConsider n observations, such that
Yk = m + Wk , k = {1, . . . ,n}
where Wki.i.d∼ N
(0, σ2)
1 Find the MLE for m.2 Is m and efficient estimator?
Mathematical Statistics (MAS713) Ariel Neufeld 28 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Cramér-Rao Bound (CRLB)Solution:
1) As Yki.i.d∼ N
(m, σ2), we know from Slide 18 that
m =
∑ni=1 Yi
n= Yn
2)m is unbiased, as E
[m]
= 1n∑n
i=1 E [Yi ] = mMoreover, from calculation on Slide 16–17
∂ log p(Y|m, σ2)
∂m=
n∑
i=1
(Yi −m)
σ2
=nσ2
(1n
n∑
i=1
Yi −m
)
= c(m −m
)
; efficient estimatorMathematical Statistics (MAS713) Ariel Neufeld 29 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Properties of MLE
Mathematical Statistics (MAS713) Ariel Neufeld 30 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
The concept of MLE makes sense, but can we scientifically justify it?
Bad news: no optimum properties for finite samples.Good news: has a few attractive limiting properties.
Mathematical Statistics (MAS713) Ariel Neufeld 31 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
The concept of MLE makes sense, but can we scientifically justify it?
Bad news: no optimum properties for finite samples.Good news: has a few attractive limiting properties.
Mathematical Statistics (MAS713) Ariel Neufeld 31 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
The concept of MLE makes sense, but can we scientifically justify it?
Bad news: no optimum properties for finite samples.Good news: has a few attractive limiting properties.
Mathematical Statistics (MAS713) Ariel Neufeld 31 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Properties of MLE
What are the criteria for a “good" estimator?Unbiased.Consistency.normality.efficiency.
Mathematical Statistics (MAS713) Ariel Neufeld 32 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Properties of MLE
What are the criteria for a “good" estimator?Unbiased.Consistency.normality.efficiency.
Mathematical Statistics (MAS713) Ariel Neufeld 32 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Properties of MLE
What are the criteria for a “good" estimator?Unbiased.Consistency.normality.efficiency.
Mathematical Statistics (MAS713) Ariel Neufeld 32 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Properties of MLE
What are the criteria for a “good" estimator?Unbiased.Consistency.normality.efficiency.
Mathematical Statistics (MAS713) Ariel Neufeld 32 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
The MLE satisfies the following 4 asymptotic properties:(under some additional regularity and integrability conditions)
Consistency: the sequence of MLEs converges in probability to thevalue being estimated.
limn→∞
P(∣∣∣θ(n) − θ
∣∣∣ > ε)
= 0 ∀ε > 0.
Asymptotically unbiased: The MLE satisfies
limn→∞
E(θ(n) − θ
)= 0
Mathematical Statistics (MAS713) Ariel Neufeld 33 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Asymptotic normality: A consistent estimator is called asymptoticallynormal if for some σ2
∞ > 0 we have that the limiting distribution of√
n(θ(n) − θ
)is equal N
(0, σ2
∞), i.e.
limn→∞
√n(θ(n) − θ
)d→ N (0, σ∞)
Asymptotic efficiency: Moreover, we call a consistent estimatorasymptotically efficient if σ2
∞ = I−1, meaning that
limn→∞
√n(θ(n) − θ
)d→ N
(0,√I−1
)
Mathematical Statistics (MAS713) Ariel Neufeld 34 / 53
6.2 Cramér-Rao Lower Bound 6.2.2 Examples
Method of Moments Estimator
Mathematical Statistics (MAS713) Ariel Neufeld 35 / 53
6.3 Method of Moments
Method of Moments EstimatorFacts :
Moments give good (but not always full!) information aboutdistribution.
If the distribution has bounded support thenmoments uniquely determine the law.
Idea:
=⇒ match sample moments with population moments
Theorem: Law of Large NumbersLet X1, . . . , Xn be i.i.d random variable with E[|X1|] <∞ and denotethe mean µ = E[X1]. Then
1n
n∑
i=1
Xi → µ
Mathematical Statistics (MAS713) Ariel Neufeld 36 / 53
6.3 Method of Moments
Method of MomentsLet X1,X2, . . . ,Xn be a sample from a population with pdf or pmf p (x |θ1, θ2, . . . , θk ).Let the unknown parameter θ = (θ1, θ2, . . . , θk) be k -dimensional.
The method of moments estimation is found by:1) equating the first k sample moments to the corresponding k population moments,2) solving the resulting system of simultaneous equations.
The k -th theoretical/population moment of this random variable is defined as
µk = E[X k]=
∫xk p (x |θ1, θ2, . . . , θk ) dx if X continuous
µk = E[X k]=∑
x
xk p (x |θ1, θ2, . . . , θk ) if X discrete.
If X1,X2, . . . ,Xn are i.i.d. random variables from that distribution, the k -th samplemoment is defined as
mk =1n
n∑i=1
X ki ,
thus mk can be viewed as an estimator for µk . From the law of large number, we havemk → µk in probability as n→∞.
Mathematical Statistics (MAS713) Ariel Neufeld 37 / 53
6.3 Method of Moments
Method of Moments:
E [X ] = 1n
n∑i=1
Xi
E[X 2] = 1
n
n∑i=1
X 2i
......
E[X k] = 1
n
n∑i=1
X ki
=⇒ Solve the set of k equations and find θ1, . . . , θk .
Mathematical Statistics (MAS713) Ariel Neufeld 38 / 53
6.3 Method of Moments
ExampleSuppose that X is a discrete random variable with the followingprobability mass function:
X 0 1 2 3p(X) 2θ/3 θ/3 2 (1− θ) /3 (1− θ) /3
where θ is a parameter in (0,1). The following 10 independentobservations were taken from such a distribution:x = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1).
Find a point estimate of θ using the method of moments and MLE.
Mathematical Statistics (MAS713) Ariel Neufeld 39 / 53
6.3 Method of Moments
Solution:
We have only a single parameter to estimate=⇒ we need to calculate only the first moment.
The theoretical mean value is
E [X ] =3∑
x=0
xp (x ; θ) = 02θ3
+ 1θ
3+ 2
2 (1− θ)
3+ 3
(1− θ)
3=
73− 2θ
The sample mean is
x =1n
n∑
i=0
xi =3 + 0 + 2 + 1 + 3 + 2 + 1 + 0 + 2 + 1
10= 1.5
We solve the single equation
73− 2θ = 1.5
and find that θ = 512 .
Mathematical Statistics (MAS713) Ariel Neufeld 40 / 53
6.3 Method of Moments
The likelihood function of X given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is
L(θ; x) =n∏
i=1
p(xi |θ)
= p(X = 3|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1θ)p(X = 3θ)
× p(X = 2θ)p(X = 1θ)p(X = 0θ)p(X = 2θ)p(X = 1θ)
=
(2θ3
)2(θ3
)3(2(1− θ)
3
)3(1− θ3
)2
.
θ = arg maxθ∈(0,1)
(2θ3
)2(θ3
)3(2(1− θ)
3
)3(1− θ3
)2
Clearly, the likelihood function is not easy to maximize.Let’s look at the log-likelihood
Mathematical Statistics (MAS713) Ariel Neufeld 41 / 53
6.3 Method of Moments
The log-likelihood function of X given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is
log L(θ) = logn∏
i=1
p(xi |θ)
= 2(
log23
+ log θ
)+ 3
(log
13
+ log θ
)+ 3
(log
23
+ log(1− θ)
)
+ 2(
log(13− log(1− θ)
)
= Constant + 5 log θ + 5 log(1− θ)
Setting the derivative to 0 and solving
d log L(θ)
dθ= 5
(1θ− 1
1− θ
)= 0
θ = 0.5(the Method of Moments yields θ = 5/12, which is different from MLE.)
Mathematical Statistics (MAS713) Ariel Neufeld 42 / 53
6.3 Method of Moments
Example
Use the Method of Moments to estimate the parameters µ and σ2 forthe normal density
p(
x |µ, σ2)
=1√2πσ
exp
(−(x − µ)2
2σ2
)
based on i.i.d. random sample X1, . . . ,Xn.
Mathematical Statistics (MAS713) Ariel Neufeld 43 / 53
6.3 Method of Moments
Solution:
First and second theoretical moments for the normal distribution are
µ1 = E [X ] = µ
µ2 = E[X 2]
= σ2 + µ2.
The first and second sample moments are
m1 =1n
n∑
i=1
Xi
m2 =1n
n∑
i=1
X 2i .
Mathematical Statistics (MAS713) Ariel Neufeld 44 / 53
6.3 Method of Moments
Solving the equations
µ =1n
n∑
i=1
Xi
σ2 + µ2 =1n
n∑
i=1
X 2i .
We have the Method of Moments estimator
µ =1n
n∑
i=1
Xi
σ2 =1n
n∑
i=1
X 2i −
(1n
n∑
i=1
Xi
)2
=1n
n∑
i=1
X 2i −
(Xn)2
=1n
n∑
i=1
(Xi − Xn
)2
In this case the MLE and MME yield the same estimators.
Mathematical Statistics (MAS713) Ariel Neufeld 45 / 53
6.3 Method of Moments
ExampleLet X1, . . . ,Xn be i.i.d samples with from a uniform distribution on theinterval [a,b], that is
p (x |a,b) =
{1
b−a , a ≤ x ≤ b0 ,otherwise
Find the Method of Moments estimator for a,b.
Mathematical Statistics (MAS713) Ariel Neufeld 46 / 53
6.3 Method of Moments
Solution:
The first two moments are:
µ1 = E [X ] =
b∫
a
x1
b − ad x =
a + b2
µ2 = E[X 2]
=
b∫
a
x2 1b − a
d x =a2 + ab + b2
3.
The corresponding sample moments are:
m1 =1n
n∑
i=1
Xi
m2 =1n
n∑
i=1
X 2i
Mathematical Statistics (MAS713) Ariel Neufeld 47 / 53
6.3 Method of Moments
We solve the equations:
µ1 = m1
µ2 = m2.
and obtain:
a = m1 −√
3(m2 −m2
1
)
b = m1 +√
3(m2 −m2
1
)
Mathematical Statistics (MAS713) Ariel Neufeld 48 / 53
6.4 Examples: MLE and Methods of Moments
ExampleLet X1, . . . ,Xn be i.i.d samples with from a beta distribution(X ∼ β (θ,1)) with pdf
p (x |θ) = θxθ−1, 0 ≤ x ≤ 1, 0 ≤ θ ≤ ∞
1 Find the MLE for θ.2 Find the Method of Moments estimator for θ.
Mathematical Statistics (MAS713) Ariel Neufeld 49 / 53
6.4 Examples: MLE and Methods of Moments
The likelihood function is given by
p (x |θ) =n∏
i=1
θxθ−1i = θn
n∏
i=1
(xi)θ−1 = θn
(n∏
i=1
xi
)θ−1
Its derivative is given by
ddθ
log p (x |θ) =ddθ
log
θn
(n∏
i=1
xi
)θ−1
=ddθ
(n log θ + (θ − 1)
n∑
i=1
log (xi)
)
=nθ
+n∑
i=1
log(xi)
Mathematical Statistics (MAS713) Ariel Neufeld 50 / 53
6.4 Examples: MLE and Methods of Moments
Set the derivative equal to zero, solve for θ , and replace xi by Xito obtain
θ = − nn∑
i=1log(Xi)
Is this the maximum?Let’s calculate the second derivative
ddθ2 log p (x |θ) =
ddθ
(nθ
+n∑
i=1
log(xi)
)
= − nθ2 ≤ 0,
so this is the MLE.
Mathematical Statistics (MAS713) Ariel Neufeld 51 / 53
6.4 Examples: MLE and Methods of Moments
The Method of Moments for θ:
T The first moment of X ∼ β (θ,1)
E [X ] =θ
θ + 1
The first sample moment is
m1 =1n
n∑
i=1
Xi
We solve the equation
θ
θ + 1=
1n
n∑
i=1
Xi
which yields θ =
n∑i=1
Xi
n−n∑
i=1Xi
Mathematical Statistics (MAS713) Ariel Neufeld 52 / 53
6.4 Examples: MLE and Methods of Moments
Objectives
Now you should be able to :Understand the likelihood principleUnderstand how to formulate the MLE procedureApply the CRLBUnderstand how to formulate the Method of Moments estimationprocedure
Put yourself to the test ! ; Q7.1 p.355, Q7.2 p.355, Q7.6 p.355, Q7.8p.355, Q7.10 p.355, Q7.15 p.355
Mathematical Statistics (MAS713) Ariel Neufeld 53 / 53