ece595 / stat598: machine learning i lecture 12 bayesian … · maximum-likelihood estimation...

32
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian Parameter Estimation Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 32

Upload: others

Post on 02-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 12 Bayesian Parameter Estimation

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 32

Page 2: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 32

Page 3: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Posterior1D IllustrationInterpretations

Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior

3 / 32

Page 4: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

4 / 32

Page 5: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

From MLE to MAP

In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:

pΘ(θ)

pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.

5 / 32

Page 6: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Maximum-a-Posteriori

By Bayes Theorem again:

pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)

pX (xn).

To maximize the posterior distribution

θ = argmaxθ

pΘ|X (θ|D)

= argmaxθ

N∏n=1

pΘ|X (θ|xn)

= argmaxθ

N∏n=1

pX |Θ(xn|θ)pΘ(θ)

pX (xn)

= argminθ

−N∑

n=1

log pX |Θ(xn|θ) + log pΘ(θ)

6 / 32

Page 7: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

The Role of pΘ(θ)

Let’s look at the MAP:

θ = argminθ

−N∑

n=1

log pX |Θ(xn|θ) + log pΘ(θ)

Special case: When

pΘ(θ) = δ(θ − θ0).

Then the delta function gives

log pΘ(θ) =

−∞, if θ 6= θ0,

0, if θ = θ0.

This will give

θ = argminθ

−N∑

n=1

−∞, if θ 6= θ0,

log pX |Θ(xn|θ0), if θ = θ0.

= θ0.

No uncertainty. Absolutely sure θ = θ0.7 / 32

Page 8: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Illustration: 1D Example

Suppose that:

pX |Θ(x |θ) =1√

2πσ2exp

−(x − θ)2

2σ2

pΘ(θ) =

1√2πσ2

0

exp

−(θ − θ0)2

2σ20

.

When N = 1. The MAP problem is simply

θ = argmaxθ

pX |Θ(x |θ)pΘ(θ)

= argmaxθ

1√2πσ2

exp

−(x − θ)2

2σ2

· 1√

2πσ20

exp

−(θ − θ0)2

2σ20

= argmaxθ

−(x − θ)2

2σ2− (θ − θ0)2

2σ20

8 / 32

Page 9: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Illustration: 1D Example

Taking derivatives:

ddθ

− (x−θ)2

2σ2 − (θ−θ0)2

2σ20

= 0

⇒ (x−θ)σ2 − (θ−θ0)

σ20

= 0

⇒ σ20(x − θ) = σ2(θ − θ0)

⇒ σ20x + σ2θ0 = (σ2

0 + σ2)θ

Therefore, the solution is

θ =σ2

0x + σ2θ0

σ20 + σ2

.

9 / 32

Page 10: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Interpreting the Result

Let us interpret the result

θ =σ2

0x + σ2θ0

σ20 + σ2

.

Does it make sense?

If σ0 = 0, then θ = σ20x+σ2θ0

σ20+σ2

= θ0.

This means: No uncertainty. Absolutely sure that θ = θ0.

pΘ(θ) = δ(θ − θ0)

The other extreme

If σ0 =∞, then θ =σ2

0x+σ2θ0

σ20+σ2 = x .

This means: I don’t trust my prior at all. Use data.

pΘ(θ) = 1|Ω| , for all θ ∈ Ω.

Therefore, MAP solution gives you a trade-off between data and prior.10 / 32

Page 11: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

When N = 2.

When N = 2. The MAP problem is

θ = argmaxθ

pX |Θ(x |θ)pΘ(θ)

= argmaxθ

(2∏

n=1

pX |Θ(xn|θ)

)pΘ(θ)

= argmaxθ

(1√

2πσ2

)2

exp

−(x1 − θ)2 + (x2 − θ)2

2σ2

× 1√

2πσ20

exp

−(θ − θ0)2

2σ20

= argmin

θ− log( · )

= argminθ

(x1 − θ)2

2σ2+

(x2 − θ)2

2σ2+

(θ − θ0)2

2σ20

11 / 32

Page 12: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

When N = 2.

Taking derivatives and setting to zero

d

(x1 − θ)2

2σ2+

(x2 − θ)2

2σ2+

(θ − θ0)2

2σ20

= −x1 − θ

σ2− x2 − θ

σ2+θ − θ0

σ20

= 0.

Equating to zero yields

θ =(x1 + x2)σ2

0 + θ0σ2

2σ20 + σ2

.

If σ0 = 0 (certain prior), then θ = θ0.

If σ0 =∞ (useless prior), then θ = x1+x22 .

12 / 32

Page 13: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

When N is Arbitrary

General N.

θ = argmaxθ

[N∏

n=1

pX |Θ(xn|θ)

]pΘ(θ)

= argmaxθ

(1√

2πσ2

)N

exp

N∑n=1

(xn − θ)2

2σ2

· 1√

2πσ20

exp

− (θ − θ0)2

2σ20

= argminθ

N∑

n=1

(xn − θ)2

2σ2+

(θ − θ0)2

2σ20

=σ2

0

∑Nn=1 xn + θ0σ

2

Nσ20 + σ2

.

What does it mean?

13 / 32

Page 14: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Interpreting the MAP solution: N →∞Let’s do some algebra:

θ =(∑N

n=1 xn)σ20 + θ0σ

2

Nσ20 + σ2

=(∑N

n=1 xn)σ20 + θ0σ

2

N(σ20 + σ2

N )

=

(1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

.

Fix σ0 and σ

As N →∞,

θ =

(1N

∑Nn=1 xn

)σ2

0 +σ2

N θ0

σ20 +

σ2

N

=1

N

N∑n=1

xn

This is the maximum-likelihood estimate.

When I have a lot of samples, the prior does not really matter.14 / 32

Page 15: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Interpreting the MAP solution: N → 0

θ =

(1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

Fix σ0 and σ

As N → 0,

θ = (1N

∑Nn=1 xn

)σ2

0 + σ2

N θ0

σ20 + σ2

N

= θ0.

This is just the prior.

When I have very few sample, I should rely on the prior.

If the prior is good, then I can do well even if I have very few samples.

Maximum-likelihood does not have the same luxury!

15 / 32

Page 16: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

What Happens to the Posterior?

Consider

p(D|µ) =

(1√

2πσ2

)N

exp

N∑n=1

(xn − µ)2

2σ2

p(µ) =1√

2πσ20

exp

−(µ− µ0)2

2σ20

.

We can show that the posterior is

p(µ|D) =1√

2πσ2N

exp

−(µ− µN)2

2σ2N

,

where

µN =σ2

Nσ20 + σ2

µ0 +Nσ2

0

Nσ20 + σ2

µML

σ2N =

σ2σ20

σ2 + Nσ20

.16 / 32

Page 17: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

What Happens to the Posterior?

xn are generated from µ = 0.8 and σ2 = 0.1.When N increases, the posterior shifts towards the true distribution.

17 / 32

Page 18: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Posterior1D IllustrationInterpretations

Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior

18 / 32

Page 19: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for µ

So far we have considered

p(D|µ) =

(1√

2πσ2

)N

exp

N∑n=1

(xn − µ)2

2σ2

p(µ) =1√

2πσ20

exp

−(µ− µ0)2

2σ20

.

Unknown µ, and known σ2.

The likelihood is Gaussian (by problem setup).

The prior for µ is Gaussian (by our choice).

Good, because posterior remains a Gaussian.

What happens if σ2 is unknown but µ is known?

19 / 32

Page 20: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for σ2

Let us define the precision: λ = 1σ2 .

The likelihood is

p(D|λ) =

(1√

2πσ2

)N

exp

N∑n=1

(xn − µ)2

2σ2

=

(λN/2

(√

2π)N

)exp

N∑n=1

λ

2(xn − µ)2

=1

(2π)N/2λN/2 exp

−λ

2

N∑n=1

(xn − µ)2

.

We want to choose p(λ) in a similar form:

p(λ) = AλB exp −Cλ

so that the posterior p(λ|D) is easy to compute.20 / 32

Page 21: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for σ2

We want to choose p(λ) in a similar form:

p(λ) = AλB exp −CλThe candidate is ...

p(λ) =1

Γ(a)baλa−1 exp(−bλ)

This distribution is called the Gamma distribution Gam(λ|a, b).We can show that

E[λ] =a

b, Var[λ] =

a

b2.

21 / 32

Page 22: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for σ2

If we consider this pair of likelihood and prior

p(D|λ) =1

(2π)N/2λN/2 exp

−λ

2

N∑n=1

(xn − µ)2

p(λ) =1

Γ(a0)ba0

0 λa0−1 exp(−b0λ),

then the posterior is

p(λ|D) ∝ λ(a0+N/2)−1 exp

(b0 +

1

2

N∑n=1

(xn − µ)2

Just another Gamma distribution.

You can now do estimation on this Gamma by finding λ whichmaximizes the posterior. Details: See Appendix.

22 / 32

Page 23: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for Both µ and σ2

Again, let λ = 1σ2 .

The likelihood is

p(D|µ, λ) =

(1√

2πσ2

)N

exp

N∑n=1

(xn − µ)2

2σ2

∝[λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

Candidate for the prior is

p(µ, λ) ∝[λ1/2 exp

−λµ

2

2

]βexp cλµ− dλ

= exp

−βλ

2(µ− c/β)2

︸ ︷︷ ︸

N (µ|µ0,σ20)

λβ/2 exp

−(d − c2

︸ ︷︷ ︸

Gam(λ|a,b)

The prior distribution is called the Normal-Gamma distribution.

23 / 32

Page 24: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Priors for High-dimension Gaussians

Let Λ = Σ−1.

The likelihood is

p(xn|µ,Σ) = N (xm|µ,Λ−1)

Prior for µ: Gaussian.

p(µ) = N (µ|µ0,Λ−10 ).

Prior for Σ: Wishart.

p(Λ) =W(Λ|W , ν).

Prior for both µ and Σ: Normal-Wishart.

p(µ,Λ) = N (µ|µ0, (βΛ)−1)W(Λ|W , ν).

24 / 32

Page 25: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Conjugate Prior

You have a likelihood pX |Θ(x |θ)

You want to choose a prior pΘ(θ) so that ...

the posterior pΘ|X (θ|x) takes the same form as the prior

Such prior is called the conjugate prior

Conjugate with respect to the likelihood

Finding the conjugate prior may not be easy!

Good news: Any likelihood belong to the exponential family willhave a conjugate prior also in the exponential family.

Exponential family: Gaussian, Exponential, Poisson, Bernoulli, etc

For more discussions, see Bishop Chapter 2.4

25 / 32

Page 26: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Bayesian Parameter Estimation

Duda-Hart-Stork, Pattern Classification, Chapter 3.3 - 3.5

Bishop, Pattern Recognition and Machine Learning, Chapter 2.4

M. Jordan (Berkeley),https://people.eecs.berkeley.edu/~jordan/courses/

260-spring10/other-readings/chapter9.pdf

CMU Note, http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf

A. Kak (Purdue), https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf

26 / 32

Page 27: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Appendix

27 / 32

Page 28: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for σ2: Solution

The posterior is

p(λ|D) ∝ λ(a0+N/2)−1 exp

(b0 +

1

2

N∑n=1

(xn − µ)2

∝ λaN−1 exp −bNλ .

The maximum-a-posteriori estimate of λ is

λ = argmaxλ

p(λ|D)

= argmaxλ

λaN−1 exp −bNλ

= argmaxλ

(aN − 1) log λ− bNλ.

Taking derivative and setting to zero:

d

((aN − 1) log λ− bNλ

)=

aN − 1

λ− bN = 0.

28 / 32

Page 29: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for σ2: Solution

Therefore,

λ =aN − 1

bN.

where the parameters are

aN = a0 +N

2,

bN = b0 +1

2

N∑n=1

(xn − µ)2 = b0 +N

2σ2ML.

Hence, the MAP estimate is

λ =a0 + N

2

b0 + N2 σ

2ML

.

As N →∞, λ→ 1σ2ML

.

As N → 0, λ→ a0b0

.

29 / 32

Page 30: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for Both µ and σ2: Detailed Derivation

Again, let λ = 1σ2 .

The likelihood is

p(D|µ, λ) =

(1√

2πσ2

)N

exp

N∑n=1

(xn − µ)2

2σ2

=

)N/2

exp

−λ

2

N∑n=1

(xn − µ)2

=

)N/2

exp

−λ

2

N∑n=1

(x2n − 2µxn + µ2)

=

)N/2

exp

−λ

2

N∑n=1

x2n + λµ

N∑n=1

xn

[exp

−λµ

2

2

]N

=

(1

)N/2 [λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

30 / 32

Page 31: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for Both µ and σ2: Detailed Derivation

The likelihood is

p(D|µ, λ) ∝[λ1/2 exp

−λµ

2

2

]Nexp

λµ

N∑n=1

xn −λ

2

N∑n=1

x2n

Candidate for the prior is

p(µ, λ) ∝[λ1/2 exp

−λµ

2

2

]βexp cλµ− dλ

=

[exp

−λµ

2

2

]β [λβ/2 exp cλµ− dλ

]= exp

−βλ

2(µ− c/β)2

︸ ︷︷ ︸

N (µ|µ0,σ20)

λβ/2 exp

−(d − c2

︸ ︷︷ ︸

Gam(λ|a,b)

µ0 = c/β, σ20 = (βλ)−1, a = 1 + β/2, b = d − c2/2β

31 / 32

Page 32: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior

c©Stanley Chan 2020. All Rights Reserved.

Prior for Both µ and σ2: Detailed Derivation

The prior distribution is

p(µ, λ) = N(µ|µ0, (βλ)−1

)Gam(λ|a, b)

This is called the Normal-Gamma distribution

Here is a 2D plot of p(µ, λ) when µ0 = 0, β = 2, a = 5, b = 6.

32 / 32