ece595 / stat598: machine learning i lecture 12 bayesian … · maximum-likelihood estimation...

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 12 Bayesian Parameter Estimation

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 32


Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 32


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Posterior1D IllustrationInterpretations

Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior

3 / 32


MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

4 / 32


From MLE to MAP

In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:

pΘ(θ)

pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.

5 / 32


Maximum-a-Posteriori

By Bayes Theorem again:

pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)

pX (xn).

To maximize the posterior distribution

θ = argmaxθ

pΘ|X (θ|D)

= argmaxθ

N∏n=1

pΘ|X (θ|xn)

= argmaxθ

N∏n=1

pX |Θ(xn|θ)pΘ(θ)

pX (xn)

= argminθ

−N∑

n=1

log pX |Θ(xn|θ) + log pΘ(θ)

6 / 32


The Role of pΘ(θ)

Let’s look at the MAP:

θ = argminθ

−N∑

n=1

log pX |Θ(xn|θ) + log pΘ(θ)

Special case: When

pΘ(θ) = δ(θ − θ0).

Then the delta function gives

log pΘ(θ) =

−∞, if θ 6= θ0,

0, if θ = θ0.

This will give

θ = argminθ

−N∑

n=1

−∞, if θ 6= θ0,

log pX |Θ(xn|θ0), if θ = θ0.

= θ0.

No uncertainty. Absolutely sure θ = θ0.7 / 32


Illustration: 1D Example

Suppose that:

pX |Θ(x |θ) =1√

2πσ2exp

−(x − θ)2

2σ2

pΘ(θ) =

1√2πσ2

0

exp

−(θ − θ0)2

2σ20

.

When N = 1. The MAP problem is simply

θ = argmaxθ

pX |Θ(x |θ)pΘ(θ)

= argmaxθ

1√2πσ2

exp

−(x − θ)2

2σ2

· 1√

2πσ20

exp

−(θ − θ0)2

2σ20

= argmaxθ

−(x − θ)2

2σ2− (θ − θ0)2

2σ20

8 / 32


Illustration: 1D Example

Taking derivatives:

ddθ

− (x−θ)2

2σ2 − (θ−θ0)2

2σ20

= 0

⇒ (x−θ)σ2 − (θ−θ0)

σ20

= 0

⇒ σ20(x − θ) = σ2(θ − θ0)

⇒ σ20x + σ2θ0 = (σ2

0 + σ2)θ

Therefore, the solution is

θ =σ2

0x + σ2θ0

σ20 + σ2

.

9 / 32


Interpreting the Result

Let us interpret the result

θ =σ2

0x + σ2θ0

σ20 + σ2

.

Does it make sense?

If σ0 = 0, then θ = σ20x+σ2θ0

σ20+σ2

= θ0.

This means: No uncertainty. Absolutely sure that θ = θ0.

pΘ(θ) = δ(θ − θ0)

The other extreme

If σ0 =∞, then θ =σ2

0x+σ2θ0

σ20+σ2 = x .

This means: I don’t trust my prior at all. Use data.

pΘ(θ) = 1|Ω| , for all θ ∈ Ω.

Therefore, MAP solution gives you a trade-off between data and prior.10 / 32


When N = 2.

When N = 2. The MAP problem is

θ = argmaxθ

pX |Θ(x |θ)pΘ(θ)

= argmaxθ

(2∏

n=1

pX |Θ(xn|θ)

)pΘ(θ)

= argmaxθ

(1√

2πσ2

)2

exp

−(x1 − θ)2 + (x2 − θ)2

2σ2

× 1√

2πσ20

exp

−(θ − θ0)2

2σ20

= argmin

θ− log( · )

= argminθ

(x1 − θ)2

2σ2+

(x2 − θ)2

2σ2+

(θ − θ0)2

2σ20

11 / 32


When N = 2.

Taking derivatives and setting to zero

d

dθ

(x1 − θ)2

2σ2+

(x2 − θ)2

2σ2+

(θ − θ0)2

2σ20

= −x1 − θ

σ2− x2 − θ

σ2+θ − θ0

σ20

= 0.

Equating to zero yields

θ =(x1 + x2)σ2

0 + θ0σ2

2σ20 + σ2

.

If σ0 = 0 (certain prior), then θ = θ0.

If σ0 =∞ (useless prior), then θ = x1+x22 .

12 / 32


When N is Arbitrary

General N.

θ = argmaxθ

[N∏

n=1

pX |Θ(xn|θ)

]pΘ(θ)

= argmaxθ

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − θ)2

2σ2

· 1√

2πσ20

exp

− (θ − θ0)2

2σ20

= argminθ

N∑

n=1

(xn − θ)2

2σ2+

(θ − θ0)2

2σ20

=σ2

0

∑Nn=1 xn + θ0σ

2

Nσ20 + σ2

.

What does it mean?

13 / 32


Interpreting the MAP solution: N →∞Let’s do some algebra:

θ =(∑N

n=1 xn)σ20 + θ0σ

2

Nσ20 + σ2

=(∑N

n=1 xn)σ20 + θ0σ

2

N(σ20 + σ2

N )

=

(1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

.

Fix σ0 and σ

As N →∞,

θ =

(1N

∑Nn=1 xn

)σ2

0 +σ2

N θ0

σ20 +

σ2

N

=1

N

N∑n=1

xn

This is the maximum-likelihood estimate.

When I have a lot of samples, the prior does not really matter.14 / 32


Interpreting the MAP solution: N → 0

θ =

(1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

Fix σ0 and σ

As N → 0,

θ = (1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

= θ0.

This is just the prior.

When I have very few sample, I should rely on the prior.

If the prior is good, then I can do well even if I have very few samples.

Maximum-likelihood does not have the same luxury!

15 / 32


What Happens to the Posterior?

Consider

p(D|µ) =

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − µ)2

2σ2

p(µ) =1√

2πσ20

exp

−(µ− µ0)2

2σ20

.

We can show that the posterior is

p(µ|D) =1√

2πσ2N

exp

−(µ− µN)2

2σ2N

,

where

µN =σ2

Nσ20 + σ2

µ0 +Nσ2

0

Nσ20 + σ2

µML

σ2N =

σ2σ20

σ2 + Nσ20

.16 / 32


What Happens to the Posterior?

xn are generated from µ = 0.8 and σ2 = 0.1.When N increases, the posterior shifts towards the true distribution.

17 / 32


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Posterior1D IllustrationInterpretations

Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior

18 / 32


Prior for µ

So far we have considered

p(D|µ) =

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − µ)2

2σ2

p(µ) =1√

2πσ20

exp

−(µ− µ0)2

2σ20

.

Unknown µ, and known σ2.

The likelihood is Gaussian (by problem setup).

The prior for µ is Gaussian (by our choice).

Good, because posterior remains a Gaussian.

What happens if σ2 is unknown but µ is known?

19 / 32


Prior for σ2

Let us define the precision: λ = 1σ2 .

The likelihood is

p(D|λ) =

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − µ)2

2σ2

=

(λN/2

(√

2π)N

)exp

−

N∑n=1

λ

2(xn − µ)2

=1

(2π)N/2λN/2 exp

−λ

2

N∑n=1

(xn − µ)2

.

We want to choose p(λ) in a similar form:

p(λ) = AλB exp −Cλ

so that the posterior p(λ|D) is easy to compute.20 / 32


Prior for σ2

We want to choose p(λ) in a similar form:

p(λ) = AλB exp −CλThe candidate is ...

p(λ) =1

Γ(a)baλa−1 exp(−bλ)

This distribution is called the Gamma distribution Gam(λ|a, b).We can show that

E[λ] =a

b, Var[λ] =

a

b2.

21 / 32


Prior for σ2

If we consider this pair of likelihood and prior

p(D|λ) =1

(2π)N/2λN/2 exp

−λ

2

N∑n=1

(xn − µ)2

p(λ) =1

Γ(a0)ba0

0 λa0−1 exp(−b0λ),

then the posterior is

p(λ|D) ∝ λ(a0+N/2)−1 exp

−

(b0 +

1

2

N∑n=1

(xn − µ)2

)λ

Just another Gamma distribution.

You can now do estimation on this Gamma by finding λ whichmaximizes the posterior. Details: See Appendix.

22 / 32


Prior for Both µ and σ2

Again, let λ = 1σ2 .

The likelihood is

p(D|µ, λ) =

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − µ)2

2σ2

∝[λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

Candidate for the prior is

p(µ, λ) ∝[λ1/2 exp

−λµ

2

2

]βexp cλµ− dλ

= exp

−βλ

2(µ− c/β)2

︸︷︷︸

N (µ|µ0,σ20)

λβ/2 exp

−(d − c2

2β

)λ

︸︷︷︸

Gam(λ|a,b)

The prior distribution is called the Normal-Gamma distribution.

23 / 32


Conjugate Prior

You have a likelihood pX |Θ(x |θ)

You want to choose a prior pΘ(θ) so that ...

the posterior pΘ|X (θ|x) takes the same form as the prior

Such prior is called the conjugate prior

Conjugate with respect to the likelihood

Finding the conjugate prior may not be easy!

Good news: Any likelihood belong to the exponential family willhave a conjugate prior also in the exponential family.

Exponential family: Gaussian, Exponential, Poisson, Bernoulli, etc

For more discussions, see Bishop Chapter 2.4

25 / 32


Reading List

Bayesian Parameter Estimation

Duda-Hart-Stork, Pattern Classification, Chapter 3.3 - 3.5

Bishop, Pattern Recognition and Machine Learning, Chapter 2.4

M. Jordan (Berkeley),https://people.eecs.berkeley.edu/~jordan/courses/

260-spring10/other-readings/chapter9.pdf

CMU Note, http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf

A. Kak (Purdue), https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf

26 / 32

https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter9.pdf

https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter9.pdf

http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf

http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf

https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf

https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf


Appendix

27 / 32


Prior for σ2: Solution

The posterior is

p(λ|D) ∝ λ(a0+N/2)−1 exp

−

(b0 +

1

2

N∑n=1

(xn − µ)2

)λ

∝ λaN−1 exp −bNλ .

The maximum-a-posteriori estimate of λ is

λ = argmaxλ

p(λ|D)

= argmaxλ

λaN−1 exp −bNλ

= argmaxλ

(aN − 1) log λ− bNλ.

Taking derivative and setting to zero:

d

dλ

((aN − 1) log λ− bNλ

)=

aN − 1

λ− bN = 0.

28 / 32


Prior for σ2: Solution

Therefore,

λ =aN − 1

bN.

where the parameters are

aN = a0 +N

2,

bN = b0 +1

2

N∑n=1

(xn − µ)2 = b0 +N

2σ2ML.

Hence, the MAP estimate is

λ =a0 + N

2

b0 + N2 σ

2ML

.

As N →∞, λ→ 1σ2ML

.

As N → 0, λ→ a0b0

.

29 / 32


Prior for Both µ and σ2: Detailed Derivation

Again, let λ = 1σ2 .

The likelihood is

p(D|µ, λ) =

(1√

2πσ2

)N

exp

−

N∑n=1

(xn − µ)2

2σ2

=

(λ

2π

)N/2

exp

−λ

2

N∑n=1

(xn − µ)2

=

(λ

2π

)N/2

exp

−λ

2

N∑n=1

(x2n − 2µxn + µ2)

=

(λ

2π

)N/2

exp

−λ

2

N∑n=1

x2n + λµ

N∑n=1

xn

[exp

−λµ

2

2

]N

=

(1

2π

)N/2 [λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

30 / 32



The likelihood is

p(D|µ, λ) ∝[λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

Candidate for the prior is

p(µ, λ) ∝[λ1/2 exp

−λµ

2

2

]βexp cλµ− dλ

=

[exp

−λµ

2

2

]β [λβ/2 exp cλµ− dλ

]= exp

−βλ

2(µ− c/β)2

︸︷︷︸

N (µ|µ0,σ20)

λβ/2 exp

−(d − c2

2β

)λ

︸︷︷︸

Gam(λ|a,b)

µ0 = c/β, σ20 = (βλ)−1, a = 1 + β/2, b = d − c2/2β

31 / 32



The prior distribution is

p(µ, λ) = N(µ|µ0, (βλ)−1

)Gam(λ|a, b)

This is called the Normal-Gamma distribution

Here is a 2D plot of p(µ, λ) when µ0 = 0, β = 2, a = 5, b = 6.

32 / 32

ece595 / stat598: machine learning i lecture 12 bayesian … · maximum-likelihood estimation...

Documents