maximum likelihood estimationmaximum likelihood estimator (mle) the maximum likelihood estimator...

Maximum Likelihood Estimation

January 4, 2019

MLE January 4, 2019 1 / 48

Likelihood Function and ML estimator

Suppose we have a random sample X1, . . . ,Xn i.i.d f (x ; θ).

Then the joint pdf is

f (x ; θ) =n∏

i=1

f (xi ; θ)

with x = (x1, . . . , xn)T .

The likelihood function L(θ; x) is the joint pdf viewed as a function of θ. It isoften more convenient to work with the log-likelihood

l(θ; x) = ln [L(θ; x)]

.

MLE January 4, 2019 2 / 48

Maximum likelihood estimator (MLE)

The maximum likelihood estimator (MLE) is the value θ which maximisesL(θ; x).

The MLE also maximises l(θ; x) because ln() is monotonic. Usually it iseasier to maximise l(θ; x), so we work with this.

Comments:

To find the maximum of the likelihood function is an optimization problem.For simple cases we can find closed-form expressions for θ. However, weoften need iterative numerical numerical optimisation procedures.Useful to plot (log-)likelihood surface to identify potential problems.

MLE January 4, 2019 3 / 48

Example

Suppose X1, . . . ,Xn is a random sample from the exponential distribution withpdf

f (x ; θ) =

θe−θx x > 0,0 otherwise.

L(θ; x) = θne−θ∑n

i=1 xi , so l(θ; x) = n ln(θ)− θn∑

i=1

xi

∂l∂θ

=nθ−

n∑i=1

xi = 0 gives the MLE θ =1X

Check:∂2l∂θ2 =

−nθ2 < 0,

so θ does correspond to a maximum.

MLE January 4, 2019 4 / 48

Example

The sample data on the atmospheric pollution (due to sulphur dioxide) in 50cities in micrograms/m3 follows an exponential distribution.xin=c(xin=c(6.8,6.0,2.4,0.98,3.3, 5.3,1.2,3.7,4.2,7.5, 6.9,5.6,3.2,3.7,4.7, 3.2,3.5,7.0,4.4,3.1, 8.8,3.4,3.4,7.9,3.7,3.4,9.1,5.2,6.7,2.5, 7.8,1.7,2.4,6.9,4.2, 5.1,6.4,8.7,3.6,2.7, 3.4,5.7,5.38,5.2,7.3, 4.9,3.9,7.9,2.7,2.4))

lvexp=function(lambda,yoss) # the (log-)likelihood functionn=length(yoss)sumy=sum(yoss)nlog(lambda)-lambdasumylambda=seq(0,1,length=length(xin)) # plot of the log-likelihoodlogv=lvexp(lambda,xin)plot(lambda,logv,type="l", xlab="lambda",ylab="log-likelihood")lambda[which.max(logv)] # max of log-likelihood function at[1] 0.21/mean(xin) # the MLE[1] 0.2

MLE January 4, 2019 5 / 48

Plot of the (log-)likelihood function

0.0 0.2 0.4 0.6 0.8 1.0

−24

0−

220

−20

0−

180

−16

0−

140

log likelihood: exponential distribution

lambda

log−

likel

ihoo

d

MLE January 4, 2019 6 / 48

Example 2

Example

Suppose X1, . . . ,Xn is a random sample from the Bernouilli distribution withpdf

f (x) =

px (1− p)1−x x = 0,1 0 ≤ p ≤ 10 otherwise

E [X ] = p and Var [X ] = p(1− p) = pq

l(p; x) =n∑

i=1

xi ln p + (n −n∑

i=1

xi ) ln(1− p)

∂l∂p

=

∑ni=1 xi

p−

n −∑n

i=1 xi

1− p= 0 gives the MLE p = X

Check: ∂2 l∂p2 < 0, so p does correspond to a maximum.

Since∑n

i=1 Xi ∼ Bin(n,p), we have E [p] = p so that p is unbiased.

MLE January 4, 2019 7 / 48

Plot of the (log-)likelihood functiontheta = seq(0,1,by=.01 ) # values for θy = dbinom (3, 10, theta) # calculate l(θ)y = y / max(y) # rescaleplot ( theta, y, type="l", xlab=expression(theta), ylab="likelihood function", main ="Likelihood for n=10 s=3")theta[which.max(y)]lines(theta[which.max(y)], max(y) ,type="h", lty=1)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood for n=10 s=3

θ

likel

ihoo

d fu

nctio

n

MLE January 4, 2019 8 / 48

Example

Suppose X1, . . . ,Xn is a random sample from U[0, θ]. The likelihood function

L(θ; x) =1θn 1maxi xi≤θ(max

ixi ).

For θ ≥ max xi , L(θ; x) = 1θn > 0 and is decreasing as θ increases, while for

θ < max xi , L(θ; x) = 0.Hence the MLE is the value θ = maxi xi .

Exercise

Is θ = maxi xi unbiased? Is it asymptotically unbiased?

MLE January 4, 2019 9 / 48

Plot of likelihood for U[0, θ].Assume x = (4,7,2,10), so n = 4 and maxi = 10.

0 5 10 15 20 25 30

0e+

002e

−05

4e−

056e

−05

8e−

051e

−04

Likelihood U[0,theta] for max(x)=10, n=4

theta

likel

ihoo

d

MLE January 4, 2019 10 / 48

R commands for plot of likelihood of U[0, θ]

L<-function(theta,x)n<-length(x)maxx<-max(x)(1/thetaˆn)* ifelse(maxx>theta,0,1)x<-c(4,7,2,10)theta<-seq(1,30,by=1)plot(theta,L(theta,x),type="l",xlab="theta",ylab="likelihood",main="Likelihood U[0,theta] for max(x)=10, n=4")lines(max(L(theta,x)) ,theta[which.max(L(theta,x))],type="l", lty=1)

MLE January 4, 2019 11 / 48

MLE and Exponential Families of Distributions

Definition (The Exponential Family of Distributions)

The r.v. X belongs to the k-parameter exponential family of distributions iffits pdf can be written in the form

f (x ; θ) = exp

k∑

j=1

Aj (θ)Bj (x) + C(x) + D(θ)

where

A1(θ), . . . ,Ak (θ),D(θ) are functions of θ alone.B1(x), . . . ,Bk (x),C(x) are functions of x alone.

MLE January 4, 2019 12 / 48

Examples

Example

Exponential (k = 1): θe−θx = exp (−θ)(x) + ln θ

i.e. A(θ) = −θ,B(x) = x ,C(x) = 0 and D(θ) = ln θ.

Normal (k = 2):(2πσ2)−12 exp

− 1

2σ2 (x − µ)2

= exp

− 1

2σ2 x2 +µ

σ2 x − µ2

2σ2 −12

ln(2πσ2)

i.e. A1(θ) = − 1

2σ2 ,A2(θ) = µσ2 ,B1(x) = x2,B2(x) = x ,C(x) = 0 and

D(θ) = − µ2

2σ2 − 12 ln(2πσ2).

MLE January 4, 2019 13 / 48

Some distributions belonging to the exponential family.

Distribution f (x ; θ) A(θ) B(x) C(x) D(θ)k = 1Binomial

(nx

)px (1− p)n−x ln( p

1−p ) x ln(n

x

)n ln(1− p)

PoissonExponential θ exp−θx −θ x 0 ln θ

N(0, σ2) (2πσ2)−1/2 exp− x2

2σ2 − 12σ2 x2 0 − 1

2 ln(2πσ2)

N(µ, 1) (2π)−1/2 exp−(x−µ)2

2 µ x − x22 − 1

2 ln(2π)− µ22

Gamma θr xr−1 exp−θx

(r−1)! −θ x (r − 1) ln x r ln θ − ln((r − 1)!)(1 param)k = 2

N(µ, σ2) (2πσ2)−1/2 exp− (x−µ)2

2σ2 A1(θ) = B1(x) 0 − 12 ln(2πsigma2)

− 12σ2 x2

A2(θ) = B2(x) = − 12µ

2/σ2

µ

σ2 xGamma(2 param)

Table: Some members of the exponential family of distributions

MLE January 4, 2019 14 / 48

ExerciseFill in the gaps in the table above for the following:

Poisson:e−θθx

x!

Gamma (two parameters):βαxα−1e−βx

Γ(α)

MLE January 4, 2019 15 / 48

Natural Parameterization

Letting φj = Aj (θ), j = 1, . . . , k , the exponential form becomes

f (x ; θ) = exp

k∑

j=1

φjBj (x) + C(x) + D(φ)

The parameters φ1, . . . , φk are called natural or canonical parameters.

Exponential in terms of its natural parameter φ = −θ:

−φeφx

Normal in terms of its natural parameters φ1 = − 12σ2 , φ2 = µ

σ2 :

exp

φ1x2 + φ2x +

φ22

4φ1− 1

2ln

(− π

φ1

)

MLE January 4, 2019 16 / 48

MLEs of Natural Parameters

Theorem (MLEs of Natural Parameters)

Suppose X1, . . . ,Xn form a random sample from a distribution which is amember of the k-parameter exponential family with pdf

f (x ; θ) = exp

k∑

j=1

φjBj (x) + C(x) + D(φ)

then the MLEs of φ1, . . . , φk are found by solving the equations

tj = E [Tj ], j = 1, . . . , k

where Tj =n∑

i=1

Bj (Xi ), j = 1, . . . , k and tj =n∑

i=1

Bj (xi ).

MLE January 4, 2019 17 / 48

Proof.The likelihood function is

L(φ; x) =n∏

i=1

f (xi ;φ) =n∏

i=1

exp

k∑

j=1

φjBj (xi ) + C(xi ) + D(φ)

= exp

k∑

j=1

φj

n∑i=1

Bj (xi ) +n∑

i=1

C(xi ) + nD(φ)

= exp

k∑

j=1

φj tj +n∑

i=1

C(xi ) + nD(φ)

⇒ l(φ; x) = constant +

k∑j=1

φj tj + nD(φ)

MLE January 4, 2019 18 / 48

Proof.The likelihood function is

l(φ; x) = constant +k∑

j=1

φj tj + nD(φ)⇒ ∂l∂φj

= tj + n∂D(φ)

∂φj

Furthermore,

E[∂l∂φj

]= 0, so E [Tj ] = −n

∂D(φ)

∂φj,

hence∂l∂φj

= tj − E [Tj ]

and so solving∂l∂φj

= 0 is equivalent to solving tj = E [Tj ].

Moreover, it can be shown (not here) that if these equations have a solutionthen it is the unique MLE (thus there is not need to check second derivatives).See Bickel and Doksum, 1977, Mathematical Statistics, Basic Ideas and selectedTopics, Holden Day, San Francisco.

MLE January 4, 2019 19 / 48

Example: N(µ,1) distribution

Example

For the N(µ,1) distribution,

A(θ) = µ and B(x) = x

Therefore T =∑n

i=1 Xi and

E [T ] = nE [Xi ] = nµ

Setting t = E [T ] givesn∑

i=1

xi = nµ

and solving for µ gives the MLE µ = X .

MLE January 4, 2019 20 / 48

The Cramer-Rao Inequality and Lower Bound

Theorem (The Cramer-Rao Inequality and Lower Bound)

Suppose X1, . . . ,Xn form a random sample from the distribution with pdff (x ; θ). Subject to certain regularity conditions on f (x ; θ), we have that for anyunbiased estimator θ for θ,

Var [θ] ≥ I−1θ

where Iθ is the Fisher Information about θ

Iθ = E

[(∂ ln[L(θ; x)]

∂θ

)2]

= E

[(∂l∂θ

)2].

I−1θ is known as the Cramer-Rao lower bound

MLE January 4, 2019 21 / 48

Proof.

For unbiased θE [θ] =

∫xθL(θ; x) dx = θ

Under regularity conditions ∫θ∂L∂θ

dx = 1

Now∂l∂θ

=∂ ln L∂θ

=1L∂L∂θ⇒ ∂L

∂θ= L

∂l∂θ.

Therefore

1 =

∫θ∂L∂θ

= E[θ∂l∂θ

].

MLE January 4, 2019 22 / 48

Proof.

We can then prove the result using Cauchy-Schwartz inequality. Let U = θand V = ∂l

∂θ . Then

E [V ] =

∫∂l∂θ

L dx =

∫∂L∂θ

=∂

∂θ

[∫L dx

]= 0

Therefore

Cov [U,V ] = E [UV ]− E [U]E [V ] = E [UV ] = E[θ∂l∂θ

]= 1

Also

Var [V ] = E [V 2] = E

[(∂l∂θ

)2]

= Iθ

So by Jensen’s Inequality:

(Cov [U,V ])2 ≤ Var [U]Var [V ]⇒ 1 ≤ Var(θ)Iθ

MLE January 4, 2019 23 / 48

Comments:

The larger Iθ is, the more information we have in the data about θ, hencethe attainable variance of θ is lower.Regularity conditions required to exchange integration and differentiationin the proof include that the range of values of X must not depend on θ.

An unbiased estimator θ whose variance attains the Cramer-Rao lowerbound is called efficient.

MLE January 4, 2019 24 / 48

Example

Suppose X1, . . . ,Xn form a random sample from N ∼ (µ, σ2) with σ2 unknown.

L =n∏

i=1

f (xi , µ) =n∏

i=1

(2πσ2)−12 exp

− 1

2σ2 (xi − µ)2

⇒ l = −n2

ln(2πσ2)− 12σ2

n∑i=1

(xi − µ)2

⇒ ∂l∂µ

=1σ2

n∑i=1

(xi − µ) =nσ2 (x − µ)

∴ µ = X is MLE.

MLE January 4, 2019 25 / 48

Example

Iθ = E

[(∂l∂µ

)2]

= E[

n2

σ4 (X − µ)2]

=n2

σ4 E[(X − µ)2]

=n2

σ4 Var [X ] =n2

σ4σ2

n=

nσ2

Thus the lower bound is I−1θ = σ2

n , which is attained by µ = X , hence µ is anefficient estimator.µ may also be referred to as a minimum variance unbiased estimator(MVUE).

MLE January 4, 2019 26 / 48

Exercise

ExerciseUnder the same regularity conditions as before, show that Iθ can beexpressed in the more useful form

Iθ = −E[∂2l∂θ2

].

Using this result show that the ML estimator obtained earlier for the parameterof a Poisson distribution attains the Cramer-Rao lower bound.

MLE January 4, 2019 27 / 48

Properties of MLEs

TheoremSuppose θ and φ represent two alternative parameterizations and that φ is a(1-1) function of θ, so we can write

φ = g(θ), θ = h(φ)

for appropriate g and h.Then if θ is the MLE of θ, then the MLE of φ is g(θ).

MLE January 4, 2019 28 / 48

Proof.

Suppose the value of φ that maximises L corresponds to θ 6= θ, so that

L(g(θ); x) > L(g(θ); x)

Taking the inverse function h(.) we have that

L(θ; x) > L(θ; x)

so θ is not the MLE.

MLE January 4, 2019 29 / 48

Invariance of MLE

Theorem (Invariance of MLE)

Let θ1, . . . , θk be a MLE for θ1, . . . , θk . If

T (θ) = (T1(θ), . . . , Tr (θ))

is a transformation of the parameter space Ω, then

T (θ) = (T1(θ), . . . , Tr (θ))

is a MLE of T (θ).

MLE January 4, 2019 30 / 48

Example

Consider X1, . . . ,Xn ∼ N(µ, σ2), µ, σ2 both unknown. Then the log-likelihoodis

l = −n2

ln(2πσ2)− 12σ2

n∑i=1

(xi − µ)2

Could find σ2 from∂l∂σ2 =

−n2σ2 +

12σ4

n∑i=1

(xi − µ)2

or∂l∂σ

=−nσ

+1σ3

n∑i=1

(xi − µ)2

MLE January 4, 2019 31 / 48

Example

Substituting µ = X and setting the second equation to zero:

1σ3

n∑i=1

(xi − X )2 =nσ

Could solve for σ and square but easier to solve for σ2 directly:

σ2 =

∑ni=1(xi − X )2

n

MLE January 4, 2019 32 / 48

Example

By invariance, we can find MLEs for parameters of distributions in theexponential family via the natural parameterisation.E.g. for the Poisson distribution,

A(θ) = ln θ and B(x) = x

Therefore T =∑n

i=1 Xi and

E [T ] = nE [Xi ] = nθ

Setting t = E [T ] givesn∑

i=1

xi = nθ

and solving for θ gives the MLE θ = X .

MLE January 4, 2019 33 / 48

Lemmas

Lemma

Suppose there exists an unbiased estimator, θ, which attains the Cramer-Raobound. Suppose that θ, the MLE is a solution to

∂l∂θ

= 0.

Then θ = θ.

MLE January 4, 2019 34 / 48

Lemma

Under fairly weak regularity conditions, MLE’s are consistent. If θ is the MLEfor θ, then asymptotically

θ ∼ N(θ, I−1θ )

MLE January 4, 2019 35 / 48

Example

Example

For the normal distribution N(µ, σ2)

the MLE µ is an efficient estimator for µthe MLE σ2 is asymptotically efficient for σ2, but not efficient for finitesample sizes.

MLE January 4, 2019 36 / 48

MLE and properties of MLE for the multi-parameterCase

Let X1, ...,Xn be iid with common pdf f (x ;θ), where θ ∈ Ωθ ⊂ Rp.

Likelihood function:

L(θ) =n∏

i=1

f (xi ;θ)

l(θ) = log L(θ) =n∑

i=1

log f (xi ;θ)

Maximum likelihood estimator (MLE) θ solves the vector equations

∂

∂θl(θ) = 0.

MLE January 4, 2019 37 / 48

Properties

The following properties extend to θ from the scalar case:

Invariance: let η = g(θ), then η = g(θ) is MLE of η.

MLE January 4, 2019 38 / 48

Properties of MLE, multi-parameter case

Theorem (Properties of MLE, multi-parameter case)

Under a set of regularity conditions1 Consistency: the likelihood equation

∂

∂θl(θ) = 0

has a solution θn such thatθn

P→ θ.

2 Asymptotic normality:

√n(θn − θ)

D→ MVNp(0, I−1(θ)).

MLE January 4, 2019 39 / 48

Theorem

I(θ) is the Fisher Information matrix with entries

Iii (θ) = Var[∂ log f (X ;θ)

∂θi

](1)

= −E[∂2

∂θ2i

log f (X ;θ)

](2)

Ijk (θ) = Cov[∂ log f (X ;θ)

∂θj,∂ log f (X ;θ)

∂θk

](3)

= −E[∂2

∂θjθklog f (X ;θ)

](4)

(5)

for i , j , k = 1, ...,p.

MLE January 4, 2019 40 / 48

Comments

Rao-Cramer bound in the multi-parameter case:Let θn,j be an unbiased estimator of θj . Then it can be shown that

Var(θn,j ) ≥1n

I−1jj (θ).

The unbiased estimator is efficient if it attains the lower bound.θn are asymptotically efficient estimators, that is, for j = 1, ...,p

√n(θn,j − θj )

D→ N(0, I−1jj (θ)).

MLE January 4, 2019 41 / 48

Transformation

Theorem (Transformation )

Let g be a transformation g(θ) = (g1(θ), ...,gk (θ))T such that 1 ≤ k ≤ p andthat the k × p matrix of partial derivatives

B =

[∂gi

∂θj

], i = 1, ..., k ; j = 1, ...,p,

has continuous elements and does not vanish in the neighbourhood of θ. Letη = g(η). Then η is MLE of η = g(θ) and

√n(η − η)

D→ MVNk (0,BI−1(θ)B′).

MLE January 4, 2019 42 / 48

Computation of MLEs: the Newton Raphson Method

Let g(θ) be the gradient of l(θ, ; x) and let H(θ) denote the matrix of 2ndderivatives (i.e. the Hessian matrix). Suppose θ0 is an initial estimate of θ andθ is the MLE. Expanding g(θ) about θ0 using the Taylor expansion gives

g(θ) = g(θ0) + (θ − θ0)T H(θ0) + . . .

⇒ 0 = g(θ0) + (θ − θ0)T H(θ0) + . . .

Therefore θ is approximated by

θ1 = θ0 − g(θ0)H−1(θ0)

Begin again with improved estimate θ1 and iterate until convergence.

MLE January 4, 2019 43 / 48

Fisher’s Method of Scoring

A simple modification of N-R in which the H(θ) are replaced by

E [H(θ)] = −Iθ

Now (under the usual regularity conditions)

E

[(∂l∂θ

)2]

= −E[∂2l∂θ2

]and so E

[∂2l

∂θi∂θj

]= −E

[∂l∂θi

∂l∂θj

].

therefore we need only calculate the score vector of 1st derivatives.

Also E [H(θ)] is positive definite, thus eliminating possible non-convergenceproblems of N-R.

MLE January 4, 2019 44 / 48

N-R and the Exponential Families of Distributions

In this case N-R and Fisher’s method of scoring are equivalent. Using thenatural parametrization

l(φ; x) = constant +k∑

j=1

φj tj + nD(φ)

thus∂l∂φj

= tj + n∂D(φ)

∂φj

and∂2l

∂φi∂φj= n

∂2D(φ)

∂φi∂φj

As D(φ) does not depend on x , H(θ) and E(H) are identical.

MLE January 4, 2019 45 / 48

The EM AlgorithmSuppose data is decomposed into observed (incomplete data) and missing(augmented data) values

x = (x0,xm)

L(θ|x0) = g(X 0|θ)︸︷︷︸incomplete data likelihood

=

∫f (x0,xm|θ)︸︷︷︸

complete data likelihood

dxm

We would like to maximise L(θ; x0) but this may be difficult to obtain.

The EM algorithm maximises L(θ; x0) by working with f (x0,xm|θ)

We have

f (x0,xm|θ)

g(x0|θ)= k(xm|θ,x0)

so g(x0|θ) =f (x0,xm|θ)

k(xm|θ,x0)

MLE January 4, 2019 46 / 48

The EM Algorithm

Let θ0 be an initial estimate of θ. The algorithm iterates as follows:

E-step: Calculate the expected log-likelihood,

Q(θ|θi ) = E [l(θ|x0,xm)|θi ,x0] =

∫l(θ|x0,xm)k(xm|θi ,x0) dxm

.M-step: Maximise Q(θ|θi ) w.r.t. θ to obtain a new estimate θi+1.

We then iterate through the E- and M-steps until convergence (e.g. until|θi+1 − θi | is small) to the incomplete data MLE.

MLE January 4, 2019 48 / 48

maximum likelihood estimationmaximum likelihood estimator (mle) the maximum likelihood estimator...

Documents