ece595 / stat598: machine learning i lecture 12 bayesian … · maximum-likelihood estimation...
TRANSCRIPT
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 12 Bayesian Parameter Estimation
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 32
c©Stanley Chan 2020. All Rights Reserved.
Overview
In linear discriminant analysis (LDA), there are generally two types ofapproaches
Generative approach: Estimate model, then define the classifier
Discriminative approach: Directly define the classifier2 / 32
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Posterior1D IllustrationInterpretations
Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior
3 / 32
c©Stanley Chan 2020. All Rights Reserved.
MLE and MAP
There are two typical ways of estimating parameters.
Maximum-likelihood estimation (MLE): θ is deterministic.
Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.
4 / 32
c©Stanley Chan 2020. All Rights Reserved.
From MLE to MAP
In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:
pΘ(θ)
pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.
5 / 32
c©Stanley Chan 2020. All Rights Reserved.
Maximum-a-Posteriori
By Bayes Theorem again:
pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)
pX (xn).
To maximize the posterior distribution
θ = argmaxθ
pΘ|X (θ|D)
= argmaxθ
N∏n=1
pΘ|X (θ|xn)
= argmaxθ
N∏n=1
pX |Θ(xn|θ)pΘ(θ)
pX (xn)
= argminθ
−N∑
n=1
log pX |Θ(xn|θ) + log pΘ(θ)
6 / 32
c©Stanley Chan 2020. All Rights Reserved.
The Role of pΘ(θ)
Let’s look at the MAP:
θ = argminθ
−N∑
n=1
log pX |Θ(xn|θ) + log pΘ(θ)
Special case: When
pΘ(θ) = δ(θ − θ0).
Then the delta function gives
log pΘ(θ) =
−∞, if θ 6= θ0,
0, if θ = θ0.
This will give
θ = argminθ
−N∑
n=1
−∞, if θ 6= θ0,
log pX |Θ(xn|θ0), if θ = θ0.
= θ0.
No uncertainty. Absolutely sure θ = θ0.7 / 32
c©Stanley Chan 2020. All Rights Reserved.
Illustration: 1D Example
Suppose that:
pX |Θ(x |θ) =1√
2πσ2exp
−(x − θ)2
2σ2
pΘ(θ) =
1√2πσ2
0
exp
−(θ − θ0)2
2σ20
.
When N = 1. The MAP problem is simply
θ = argmaxθ
pX |Θ(x |θ)pΘ(θ)
= argmaxθ
1√2πσ2
exp
−(x − θ)2
2σ2
· 1√
2πσ20
exp
−(θ − θ0)2
2σ20
= argmaxθ
−(x − θ)2
2σ2− (θ − θ0)2
2σ20
8 / 32
c©Stanley Chan 2020. All Rights Reserved.
Illustration: 1D Example
Taking derivatives:
ddθ
− (x−θ)2
2σ2 − (θ−θ0)2
2σ20
= 0
⇒ (x−θ)σ2 − (θ−θ0)
σ20
= 0
⇒ σ20(x − θ) = σ2(θ − θ0)
⇒ σ20x + σ2θ0 = (σ2
0 + σ2)θ
Therefore, the solution is
θ =σ2
0x + σ2θ0
σ20 + σ2
.
9 / 32
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the Result
Let us interpret the result
θ =σ2
0x + σ2θ0
σ20 + σ2
.
Does it make sense?
If σ0 = 0, then θ = σ20x+σ2θ0
σ20+σ2
= θ0.
This means: No uncertainty. Absolutely sure that θ = θ0.
pΘ(θ) = δ(θ − θ0)
The other extreme
If σ0 =∞, then θ =σ2
0x+σ2θ0
σ20+σ2 = x .
This means: I don’t trust my prior at all. Use data.
pΘ(θ) = 1|Ω| , for all θ ∈ Ω.
Therefore, MAP solution gives you a trade-off between data and prior.10 / 32
c©Stanley Chan 2020. All Rights Reserved.
When N = 2.
When N = 2. The MAP problem is
θ = argmaxθ
pX |Θ(x |θ)pΘ(θ)
= argmaxθ
(2∏
n=1
pX |Θ(xn|θ)
)pΘ(θ)
= argmaxθ
(1√
2πσ2
)2
exp
−(x1 − θ)2 + (x2 − θ)2
2σ2
× 1√
2πσ20
exp
−(θ − θ0)2
2σ20
= argmin
θ− log( · )
= argminθ
(x1 − θ)2
2σ2+
(x2 − θ)2
2σ2+
(θ − θ0)2
2σ20
11 / 32
c©Stanley Chan 2020. All Rights Reserved.
When N = 2.
Taking derivatives and setting to zero
d
dθ
(x1 − θ)2
2σ2+
(x2 − θ)2
2σ2+
(θ − θ0)2
2σ20
= −x1 − θ
σ2− x2 − θ
σ2+θ − θ0
σ20
= 0.
Equating to zero yields
θ =(x1 + x2)σ2
0 + θ0σ2
2σ20 + σ2
.
If σ0 = 0 (certain prior), then θ = θ0.
If σ0 =∞ (useless prior), then θ = x1+x22 .
12 / 32
c©Stanley Chan 2020. All Rights Reserved.
When N is Arbitrary
General N.
θ = argmaxθ
[N∏
n=1
pX |Θ(xn|θ)
]pΘ(θ)
= argmaxθ
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − θ)2
2σ2
· 1√
2πσ20
exp
− (θ − θ0)2
2σ20
= argminθ
N∑
n=1
(xn − θ)2
2σ2+
(θ − θ0)2
2σ20
=σ2
0
∑Nn=1 xn + θ0σ
2
Nσ20 + σ2
.
What does it mean?
13 / 32
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the MAP solution: N →∞Let’s do some algebra:
θ =(∑N
n=1 xn)σ20 + θ0σ
2
Nσ20 + σ2
=(∑N
n=1 xn)σ20 + θ0σ
2
N(σ20 + σ2
N )
=
(1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
.
Fix σ0 and σ
As N →∞,
θ =
(1N
∑Nn=1 xn
)σ2
0 +σ2
N θ0
σ20 +
σ2
N
=1
N
N∑n=1
xn
This is the maximum-likelihood estimate.
When I have a lot of samples, the prior does not really matter.14 / 32
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the MAP solution: N → 0
θ =
(1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
Fix σ0 and σ
As N → 0,
θ = (1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
= θ0.
This is just the prior.
When I have very few sample, I should rely on the prior.
If the prior is good, then I can do well even if I have very few samples.
Maximum-likelihood does not have the same luxury!
15 / 32
c©Stanley Chan 2020. All Rights Reserved.
What Happens to the Posterior?
Consider
p(D|µ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
p(µ) =1√
2πσ20
exp
−(µ− µ0)2
2σ20
.
We can show that the posterior is
p(µ|D) =1√
2πσ2N
exp
−(µ− µN)2
2σ2N
,
where
µN =σ2
Nσ20 + σ2
µ0 +Nσ2
0
Nσ20 + σ2
µML
σ2N =
σ2σ20
σ2 + Nσ20
.16 / 32
c©Stanley Chan 2020. All Rights Reserved.
What Happens to the Posterior?
xn are generated from µ = 0.8 and σ2 = 0.1.When N increases, the posterior shifts towards the true distribution.
17 / 32
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Posterior1D IllustrationInterpretations
Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior
18 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for µ
So far we have considered
p(D|µ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
p(µ) =1√
2πσ20
exp
−(µ− µ0)2
2σ20
.
Unknown µ, and known σ2.
The likelihood is Gaussian (by problem setup).
The prior for µ is Gaussian (by our choice).
Good, because posterior remains a Gaussian.
What happens if σ2 is unknown but µ is known?
19 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
Let us define the precision: λ = 1σ2 .
The likelihood is
p(D|λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
=
(λN/2
(√
2π)N
)exp
−
N∑n=1
λ
2(xn − µ)2
=1
(2π)N/2λN/2 exp
−λ
2
N∑n=1
(xn − µ)2
.
We want to choose p(λ) in a similar form:
p(λ) = AλB exp −Cλ
so that the posterior p(λ|D) is easy to compute.20 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
We want to choose p(λ) in a similar form:
p(λ) = AλB exp −CλThe candidate is ...
p(λ) =1
Γ(a)baλa−1 exp(−bλ)
This distribution is called the Gamma distribution Gam(λ|a, b).We can show that
E[λ] =a
b, Var[λ] =
a
b2.
21 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
If we consider this pair of likelihood and prior
p(D|λ) =1
(2π)N/2λN/2 exp
−λ
2
N∑n=1
(xn − µ)2
p(λ) =1
Γ(a0)ba0
0 λa0−1 exp(−b0λ),
then the posterior is
p(λ|D) ∝ λ(a0+N/2)−1 exp
−
(b0 +
1
2
N∑n=1
(xn − µ)2
)λ
Just another Gamma distribution.
You can now do estimation on this Gamma by finding λ whichmaximizes the posterior. Details: See Appendix.
22 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2
Again, let λ = 1σ2 .
The likelihood is
p(D|µ, λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
∝[λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
Candidate for the prior is
p(µ, λ) ∝[λ1/2 exp
−λµ
2
2
]βexp cλµ− dλ
= exp
−βλ
2(µ− c/β)2
︸ ︷︷ ︸
N (µ|µ0,σ20)
λβ/2 exp
−(d − c2
2β
)λ
︸ ︷︷ ︸
Gam(λ|a,b)
The prior distribution is called the Normal-Gamma distribution.
23 / 32
c©Stanley Chan 2020. All Rights Reserved.
Priors for High-dimension Gaussians
Let Λ = Σ−1.
The likelihood is
p(xn|µ,Σ) = N (xm|µ,Λ−1)
Prior for µ: Gaussian.
p(µ) = N (µ|µ0,Λ−10 ).
Prior for Σ: Wishart.
p(Λ) =W(Λ|W , ν).
Prior for both µ and Σ: Normal-Wishart.
p(µ,Λ) = N (µ|µ0, (βΛ)−1)W(Λ|W , ν).
24 / 32
c©Stanley Chan 2020. All Rights Reserved.
Conjugate Prior
You have a likelihood pX |Θ(x |θ)
You want to choose a prior pΘ(θ) so that ...
the posterior pΘ|X (θ|x) takes the same form as the prior
Such prior is called the conjugate prior
Conjugate with respect to the likelihood
Finding the conjugate prior may not be easy!
Good news: Any likelihood belong to the exponential family willhave a conjugate prior also in the exponential family.
Exponential family: Gaussian, Exponential, Poisson, Bernoulli, etc
For more discussions, see Bishop Chapter 2.4
25 / 32
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Bayesian Parameter Estimation
Duda-Hart-Stork, Pattern Classification, Chapter 3.3 - 3.5
Bishop, Pattern Recognition and Machine Learning, Chapter 2.4
M. Jordan (Berkeley),https://people.eecs.berkeley.edu/~jordan/courses/
260-spring10/other-readings/chapter9.pdf
CMU Note, http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf
A. Kak (Purdue), https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf
26 / 32
c©Stanley Chan 2020. All Rights Reserved.
Appendix
27 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2: Solution
The posterior is
p(λ|D) ∝ λ(a0+N/2)−1 exp
−
(b0 +
1
2
N∑n=1
(xn − µ)2
)λ
∝ λaN−1 exp −bNλ .
The maximum-a-posteriori estimate of λ is
λ = argmaxλ
p(λ|D)
= argmaxλ
λaN−1 exp −bNλ
= argmaxλ
(aN − 1) log λ− bNλ.
Taking derivative and setting to zero:
d
dλ
((aN − 1) log λ− bNλ
)=
aN − 1
λ− bN = 0.
28 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2: Solution
Therefore,
λ =aN − 1
bN.
where the parameters are
aN = a0 +N
2,
bN = b0 +1
2
N∑n=1
(xn − µ)2 = b0 +N
2σ2ML.
Hence, the MAP estimate is
λ =a0 + N
2
b0 + N2 σ
2ML
.
As N →∞, λ→ 1σ2ML
.
As N → 0, λ→ a0b0
.
29 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
Again, let λ = 1σ2 .
The likelihood is
p(D|µ, λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
(xn − µ)2
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
(x2n − 2µxn + µ2)
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
x2n + λµ
N∑n=1
xn
[exp
−λµ
2
2
]N
=
(1
2π
)N/2 [λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
30 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
The likelihood is
p(D|µ, λ) ∝[λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
Candidate for the prior is
p(µ, λ) ∝[λ1/2 exp
−λµ
2
2
]βexp cλµ− dλ
=
[exp
−λµ
2
2
]β [λβ/2 exp cλµ− dλ
]= exp
−βλ
2(µ− c/β)2
︸ ︷︷ ︸
N (µ|µ0,σ20)
λβ/2 exp
−(d − c2
2β
)λ
︸ ︷︷ ︸
Gam(λ|a,b)
µ0 = c/β, σ20 = (βλ)−1, a = 1 + β/2, b = d − c2/2β
31 / 32
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
The prior distribution is
p(µ, λ) = N(µ|µ0, (βλ)−1
)Gam(λ|a, b)
This is called the Normal-Gamma distribution
Here is a 2D plot of p(µ, λ) when µ0 = 0, β = 2, a = 5, b = 6.
32 / 32