![Page 1: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/1.jpg)
Statistical Machine LearningLecture 06: Probability Density Estimation
Kristian KerstingTU Darmstadt
Summer Term 2020
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 77
![Page 2: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/2.jpg)
Today’s Objectives
Make you understand how to do find p (x)
Covered TopicsDensity Estimation
Maximum Likelihood Estimation
Non-Parametric Methods
Mixture Models
Expectation Maximization
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 77
![Page 3: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/3.jpg)
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 77
![Page 4: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/4.jpg)
1. Probability Density
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 77
![Page 5: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/5.jpg)
1. Probability Density
Training Data
0 0.25 0.5 0.75 10
0.5
1
1.5
2
How do we get the probability distributions from this so that wecan classify with them?
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 77
![Page 6: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/6.jpg)
1. Probability Density
Probability Density Estimation
So far we have seen:Bayes optimal classification, based on probability distributionsp(x | Ck)p(Ck)
The prior p(Ck) is easy to deal with. We can “just count” thenumber of occurrences of each class in the training data
We need to estimate (learn) the class-conditional probabilitydensity p(x | Ck)
Supervised training: we know the input data points and their truelabels (classes)
Estimate the density separately for each class Ck
“Abbreviation”: p(x) = p(x | Ck)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 77
![Page 7: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/7.jpg)
1. Probability Density
Probability Density Estimation
Training data
x1, x2, x3, . . .
Estimation
p(x)
MethodsParametric model
Non-parametric model
Mixture models
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 77
![Page 8: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/8.jpg)
2. Parametric models
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 77
![Page 9: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/9.jpg)
2. Parametric models
2. Parametric models
Simple case: Gaussian Distribution
p (x|µ, σ) =1√2πσ2
exp
{−(x − µ)2
2σ2
}
Is governed by two parameters: mean and variance. That is, if weknow these parameters we can fully describe p(x)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 77
![Page 10: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/10.jpg)
2. Parametric models
2. Parametric models
Notation for parametric density models
x ∼ p(x | θ)
For the Gaussian distribution
θ = (µ, σ)
x ∼ p(x∣∣∣µ, σ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 77
![Page 11: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/11.jpg)
2. Parametric models : Maximum Likelihood Method
2. Parametric models
Learning means to estimate the parameters θ given the trainingdata X = {x1, x2, . . .}
Likelihood of θ is defined as the probability that the data X wasgenerated from the probability density function with parametersθ
L (θ) = p (X | θ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 77
![Page 12: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/12.jpg)
2. Parametric models : Maximum Likelihood Method
Maximum Likelihood Method
Consider a set of points X = {x1, . . . , xN}
Computing the likelihoodOf a single datum? p (xn|θ)
Of all data?
Assumption: the data is i.i.d. (independent and identicallydistributed)
The random variables x1 and x2 are independent if
P (x1 ≤ α, x2 ≤ β) = P (x1 ≤ α) P (x2 ≤ β) ∀α, β ∈ R
The random variables x1 and x2 are identically distributed if
P (x1 ≤ α) = P (x2 ≤ α) ∀α ∈ R
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 77
![Page 13: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/13.jpg)
2. Parametric models : Maximum Likelihood Method
Maximum Likelihood Method
Likelihood
L (θ) = p (X | θ) = p(x1, . . . , xN
∣∣∣ θ)(using the i.i.d. assumption)= p (x1 | θ) · . . . · p (xn | θ)
=N∏n=1
p (xn | θ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 77
![Page 14: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/14.jpg)
2. Parametric models : Maximum Likelihood Method
Maximum log-Likelihood Method
Maximize the (log-)likelihood w.r.t. θ
log L (θ) = log p (X | θ) = logN∏n=1
p (xn | θ) =N∑n=1
log p (xn | θ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 77
![Page 15: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/15.jpg)
2. Parametric models : Maximum Likelihood Method
Maximum Likelihood Method - Gaussian
Maximum likelihood estimation of a Gaussian
µ̂, σ̂ = arg maxµ,σ
log L (θ) = log p (X | θ) =N∑n=1
log p(xn∣∣∣µ, σ)
Take the partial derivatives and set them to zero∂L∂µ
= 0,∂L∂σ
= 0
This leads to a closed form solution
µ̂ =1N
N∑n=1
xn
σ̂2 =1N
N∑n=1
(xn − µ̂)2
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 77
![Page 16: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/16.jpg)
2. Parametric models : Maximum Likelihood Method
Maximum Likelihood Method - Gaussian
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 77
![Page 17: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/17.jpg)
2. Parametric models : Maximum Likelihood Method
Likelihood
L (θ) = p (X | θ) =N∏n=1
p (xn | θ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 77
![Page 18: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/18.jpg)
2. Parametric models : Maximum Likelihood Method
Degenerate case
If N = 1, X = {x1}, the resulting Gaussian looks like
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 77
![Page 19: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/19.jpg)
2. Parametric models : Maximum Likelihood Method
Degenerate case
What can we do to still get a useful estimate?
We can put a prior on the mean!
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 77
![Page 20: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/20.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
Bayesian estimation / learning of parametric distributions,assumes that the parameters are not fixed, but are randomvariables too
This allows us to use prior knowledge about the parameters
How do we achieve that?What do we want? A density model for x, p(x)
What do we have? Data X
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 77
![Page 21: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/21.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
Formalize this as a conditional probability p(x∣∣∣X)
p(x∣∣∣X) =
∫p(x, θ
∣∣∣X) dθp(x, θ
∣∣∣X) = p(x∣∣∣ θ,X) p(θ ∣∣∣X)
p(x) can be fully determined with the parameters θ, i.e., θ is asufficient statistic
Hence, we have p(x∣∣∣ θ,X) = p
(x∣∣∣ θ)
p(x∣∣∣X) =
∫p(x∣∣∣ θ) p(θ ∣∣∣X) dθ
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 77
![Page 22: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/22.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
p(x∣∣∣X) =
∫p(x∣∣∣ θ) p(θ ∣∣∣X) dθ
p(θ∣∣∣X) =
p(X∣∣∣ θ) p (θ)
p (X)= L (θ)
p (θ)
p (X)
p (X) =
∫p(X∣∣∣ θ) p (θ) dθ =
∫L (θ) p (θ) dθ
p(x∣∣∣X) =
1p (X)
∫p(x∣∣∣ θ) L (θ) p (θ) dθ
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 77
![Page 23: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/23.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
p(x∣∣∣X) =
∫p(x∣∣∣ θ) p(θ ∣∣∣X) dθ
The probability p(θ∣∣∣X) makes it explicit how the parameter
estimation depends on the training data
If p(θ∣∣∣X) is small in most places, but large for a single θ̂ then
we can approximate
p(x∣∣∣X) ≈ p(x ∣∣∣ θ̂)
Sometimes referred to as the Bayes point
The more uncertain we are about estimating θ̂, the more weaverage
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 77
![Page 24: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/24.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
Problem: In general, it is intractable to integrate out theparameters θ (or only possible to do so numerically)
Example with closed form solutionGaussian data distribution, the variance is known and fixed
We estimate the distribution of the mean
p(µ∣∣∣X) =
p(X∣∣∣µ) p (µ)
p (X)
With prior
p (µ) = N(µ0, σ
20
)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 77
![Page 25: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/25.jpg)
2. Parametric models : Maximum Likelihood Method
Bayesian Estimation
Sample mean
x̄ =1N
N∑n=1
xn
Bayesian estimation
p(µ∣∣∣X) ∼ N (µN, σ2N)
µN =Nσ20 x̄ + σ2µ0
Nσ20 + σ2,
1σ2N
=Nσ2
+1σ20
Check what happens when N grows to infinity...
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 25 / 77
![Page 26: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/26.jpg)
2. Parametric models : Maximum Likelihood Method
Conjugate Priors
Conjugate Priors are prior distributions for the parameters thatdo not “change” the type of the parametric model
For example, as we saw that a Gaussian prior on the mean isconjugate to the Gaussian model. This works here because...
The product of two Gaussians is a Gaussian
The marginal of a Gaussian is a Gaussian
In general, it is not as easy!
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 26 / 77
![Page 27: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/27.jpg)
3. Non-Parametric Models
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 27 / 77
![Page 28: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/28.jpg)
3. Non-Parametric Models
3. Non-Parametric Models
Why use Non-parametric representations?Often we do not know what functional form the class-conditionaldensity takes (or we do not know what class of function we need)
Probability density is estimated directly from the data (i.e.without an explicit parametric model)
Histograms
Kernel density estimation (Parzen windows)
K-nearest neighbors
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 28 / 77
![Page 29: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/29.jpg)
3. Non-Parametric Models : Histograms
Histograms
Discretize the feature space into bins
Not smooth enough
About right
Too smooth
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 29 / 77
![Page 30: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/30.jpg)
3. Non-Parametric Models : Histograms
Histograms
PropertiesThey are very general, because in the infinite data limit anyprobability density can be approximated arbitrarily well
At the same time it is a Brute-force method
ProblemsHigh-dimensional feature spaces
Exponential increase in the number of bins
Hence requires exponentially much data
Commonly known as the Curse of dimensionality
How to choose the size of the bins?
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 30 / 77
![Page 31: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/31.jpg)
3. Non-Parametric Models : Histograms
Curse of Dimensionality
For histograms
We will see that it is a general issue that we have to keep in mind
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 31 / 77
![Page 32: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/32.jpg)
3. Non-Parametric Models : Histograms
More formally
Data point x is sampled from probability density p (x)
Probability that x falls in region R
P (x ∈ R) =
∫Rp (x) dx
If R is sufficiently small, with volume V , then p (x) is almost constant
P (x ∈ R) =
∫Rp (x) dx ≈ p (x) V
If R is sufficiently large
P (x ∈ R) =KN
=⇒ p (x) ≈ KNV
where N is the number of total points and K is the number of pointsfalling in the region R
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 32 / 77
![Page 33: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/33.jpg)
3. Non-Parametric Models : Histograms
More formally
p (x) ≈ KNV
Kernel density estimation - Fix V and determine KExample: determine the number of data points K in a fixedhypercube
K-nearest neighbor - Fix K and determine VExample: increase the size of a sphere until K data points fall intothe sphere
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 33 / 77
![Page 34: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/34.jpg)
3. Non-Parametric Models : Kernel Density Estimation
Parzen Window
Hypercubes in d dimensions with edge length h
H (u) =
{1∣∣uj∣∣ ≤ h
2 , j = 1, . . . , d0 otherwise
V =
∫H (u) du = hd
K (x) =N∑n=1
H(x− x(n)
)
p (x) ≈ K (x)
NV=
1Nhd
N∑n=1
H(x− x(n)
)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 34 / 77
![Page 35: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/35.jpg)
3. Non-Parametric Models : Kernel Density Estimation
Gaussian Kernel
H (u) =1(√2πh2
)d exp
{−‖u‖
2
2h2
}
V =
∫H (u) du = 1
K (x) =N∑n=1
H(x− x(n)
)
p (x) ≈ K (x)
NV=
1
N(√2πh2
)d N∑n=1
exp
{−∥∥x− x(n)∥∥2
2h2
}
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 35 / 77
![Page 36: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/36.jpg)
3. Non-Parametric Models : Kernel Density Estimation
General formulation - arbitrary kernel
k (u) ≥ 0,∫k (u) du = 1
V = hd
K (x) =N∑n=1
k
(∥∥x− x(n)∥∥h
)
p (x) ≈ K (x)
NV=
1Nhd
N∑n=1
k
(∥∥x− x(n)∥∥h
)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 36 / 77
![Page 37: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/37.jpg)
3. Non-Parametric Models : Kernel Density Estimation
Common Kernels
Gaussian Kernel
k (u) =1√2π
exp
{−12u2}
Problem: kernel has infinite support
Requires a lot of computation
Parzen window
k (u) =
{1 |u| ≤ 1/20 otherwise
Not very smooth results
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 37 / 77
![Page 38: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/38.jpg)
3. Non-Parametric Models : Kernel Density Estimation
Common Kernels
Epanechnikov kernel
k (u) = max
{0,34
(1− u)2}
Smoother, but finite support
Problem with kernel methods: We have to select the kernelbandwidth h appropriately
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 38 / 77
![Page 39: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/39.jpg)
3. Non-Parametric Models : Kernel Density Estimation
Gaussian KDE Example
Not smooth enough
About right
Too smooth
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 39 / 77
![Page 40: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/40.jpg)
3. Non-Parametric Models : K-nearest Neighbors
Again to our definition
p (x) ≈ KNV
Kernel density estimation - Fix V and determine KExample: determine the number of data points K in a fixedhypercube
K-nearest neighbor - Fix K and determine VExample: increase the size of a sphere until K data points fall intothe sphere
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 40 / 77
![Page 41: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/41.jpg)
3. Non-Parametric Models : K-nearest Neighbors
K-Nearest Neighbors (kNN)
Not smooth enough
About right
Too smooth
Note: Blue rescaled for visualization
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 41 / 77
![Page 42: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/42.jpg)
3. Non-Parametric Models : K-nearest Neighbors
K-Nearest Neighbors (kNN)
Bayesian classification
P(Cj∣∣∣ x) =
P(x∣∣∣ Cj) P (Cj)P (x)
k-Nearest Neighbors classificationAssume we have a dataset of N points, where Nj is the number ofdata points in class Cj and
∑j Nj = N. To classify a point x we
draw a sphere centered in x that contains K points (from anyclasses). Assume the sphere has volume V and contains Kj pointsof class Cj
P (x) ≈ KNV
, P(x∣∣∣ Cj) ≈ Kj
NjV, P
(Cj)≈NjN
P(Cj∣∣∣ x) ≈ Kj
NjVNjNNVK
=KjK
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 42 / 77
![Page 43: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/43.jpg)
3. Non-Parametric Models : K-nearest Neighbors
Bias-Variance Problem
Nonparametric probability density estimationHistograms: Size of the bins?
too large: too smooth
too small: not smooth enough
Kernel density estimation: Kernel bandwidth?h too large: too smooth
h too small: not smooth enough
K-nearest neighbor: Number of neighbors?K too large: too smooth
K too small: not smooth enough
A general problem of many density estimation approaches,including parametric and mixture models
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 43 / 77
![Page 44: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/44.jpg)
4. Mixture models
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 44 / 77
![Page 45: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/45.jpg)
4. Mixture models
4. Mixture models
Parametric modelsGaussian, NeuralNetworks, ...
Good analytic properties
Simple
Small memoryrequirements
Fast
Nonparametric modelsKernel Density Estimation,k-Nearest Neighbors, ...
General
Large memoryrequirements
Slow
Mixture models are a mix of parametric and nonparametric models
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 45 / 77
![Page 46: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/46.jpg)
4. Mixture models
Mixture of Gaussians (MoG)
Sum of individual Gaussian distributions
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 46 / 77
![Page 47: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/47.jpg)
4. Mixture models
Mixture of Gaussians
Sum of individual Gaussian distributions
In the limit (i.e. with many mixture components) this canapproximate every (smooth) density
p (x) =M∑j=1
p(x∣∣∣ j) p (j)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 47 / 77
![Page 48: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/48.jpg)
4. Mixture models
Mixture of Gaussians
p (x) =M∑j=1
p(x∣∣∣ j) p (j)
p(x∣∣∣ j) = N
(x∣∣∣µj, σj) =
1√2πσ2j
exp
{−(x − µj
)22σ2j
}
p (j) = πj with 0 ≤ πj ≤ 1,M∑j=1
πj = 1
RemarksThe mixture density integrates to 1:
∫p (x) dx = 1
The mixture parameters are: θ = {µ1, σ1, π1, . . . , µM, σM, πM}
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 48 / 77
![Page 49: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/49.jpg)
4. Mixture models
Mixture of Gaussians - MLE
Maximum (log-)Likelihood EstimationDataset with N i.i.d. points {x1, . . . , xN}
L = log L (θ) =N∑n=1
log p(xn∣∣∣ θ)
∂L∂µj
= 0
µj =
∑Nn=1 p
(j∣∣∣ xn) xn∑N
n=1 p(j∣∣∣ xn)
What is the problem with this approach?
Circular dependency - No analytical solution!
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 49 / 77
![Page 50: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/50.jpg)
4. Mixture models
Mixture of Gaussians - MLE Gradient Ascent
Maximum (log-)Likelihood EstimationDataset with N i.d.d. points {x1, . . . , xN}
L = log L (θ) =N∑n=1
log p(xn∣∣∣ θ)
∂L∂µj
= 0
Gradient ascentComplex gradient (nonlinear, circular dependencies)
Optimization of one Gaussian component depends on all othercomponents
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 50 / 77
![Page 51: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/51.jpg)
4. Mixture models
Mixture of Gaussians - Different strategy
Unobserved := hidden or latent variables (j|x)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 51 / 77
![Page 52: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/52.jpg)
4. Mixture models
Mixture of Gaussians - Different strategy
Suppose we knew the observed and unobserved dataset (alsocalled the complete dataset)
Then we can compute the maximum likelihood solution ofcomponents 1 and 2
µ1 =
∑Nn=1 p
(1∣∣∣ xn) xn∑N
n=1 p(1∣∣∣ xn) µ2 =
∑Nn=1 p
(2∣∣∣ xn) xn∑N
n=1 p(2∣∣∣ xn)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 52 / 77
![Page 53: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/53.jpg)
4. Mixture models
Mixture of Gaussians - Different strategy
Suppose we knew the distributions
We can infer the unobserved data using Bayes Decision Rule.Namely we decide 1 if
p(j = 1
∣∣∣ x) > p(j = 2∣∣∣ x)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 53 / 77
![Page 54: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/54.jpg)
4. Mixture models
Mixture of Gaussians - Chicken and Eggproblem
We have big problem at hand... we neither know the distributionnor the unobserved data!
To break this loop, we need some estimation of the unobserveddata j
Temporary solution: Clustering (to be replaced soon)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 54 / 77
![Page 55: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/55.jpg)
4. Mixture models
Estimation using Clustering
Clustering with hard assignments
Somehow assign mixturelabels to each data point
Estimate the mixturecomponent only from itsdata
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 55 / 77
![Page 56: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/56.jpg)
4. Mixture models
Mixture of Gaussians
Suppose we had a guess about the distribution, but did not knowthe unobserved data
Compute the probability for each mixture component:
p(j = 1
∣∣∣ x) =p(x∣∣∣ 1) p (1)p (x)
=p(x∣∣∣ 1)π1∑M
j=1 p(x∣∣∣ j)πj
p(j = 2
∣∣∣ x) =p(x∣∣∣ 2) p (2)p (x)
=p(x∣∣∣ 2)π2∑M
j=1 p(x∣∣∣ j)πj
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 56 / 77
![Page 57: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/57.jpg)
4. Mixture models
Expectation Maximization - Clustering
Clustering with soft assignments
Expectation-step of the EM-algorithm (shortly)
We can determine the means by maximum likelihood estimation
µj =
∑Nn=1 p
(j∣∣∣ xn) xn∑N
n=1 p(j∣∣∣ xn)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 57 / 77
![Page 58: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/58.jpg)
4. Mixture models
Expectation Maximization Algorithm
AlgorithmInitialize with (random) means: µ1, µ2, . . . , µM
While stop-condition is not metE-step: Compute the posterior distribution for each mixturecomponent and for all data points
p(j∣∣∣ xn)
M-step: Compute the new means as the weighted means of alldata points
µj =
∑Nn=1 p
(j∣∣∣ xn) xn∑N
n=1 p(j∣∣∣ xn)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 58 / 77
![Page 59: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/59.jpg)
4. Mixture models
Expectation Maximization
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 59 / 77
![Page 60: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/60.jpg)
4. Mixture models
Expectation Maximization (EM) Algorithm
Expectation-Maximization (EM) AlgorithmMethod for performing maximum likelihood estimation, evenwhen the data is incomplete (i.e. we only have access to observedvariables)
Idea: if we have unknown values in our estimation problem(so-called hidden variables) we can use EM
Assume:Observed (incomplete) data: X = {x1, . . . , xN}
Unobserved (hidden) data: Y = {y1, . . . , yN}
In case of Gaussian mixtures:Association of every data point to one of the mixture components
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 60 / 77
![Page 61: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/61.jpg)
4. Mixture models
Properties of EM
Incomplete (observed) data: X = {x1, . . . , xN}
Hidden (unobserved) data: Y = {y1, . . . , yN}
Complete data: Z = (X, Y)
Joint density
p (Z) = p (X, Y) = p(Y∣∣∣X) p (X)
With parameters
p(Z∣∣∣ θ) = p
(X, Y
∣∣∣ θ) = p(Y∣∣∣X, θ) p(X ∣∣∣ θ)
In the case of Gaussian mixturesp(X∣∣∣ θ) - likelihood of the mixture model
p(Y∣∣∣X, θ) - predictions of the mixture component
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 61 / 77
![Page 62: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/62.jpg)
4. Mixture models
Properties of EM
Incomplete likelihood
L(θ∣∣∣X) = p
(X∣∣∣ θ) =
N∏n=1
p(xn∣∣∣ θ)
Complete likelihood
L(θ∣∣∣ Z) = p
(Z∣∣∣ θ) = p
(X, Y
∣∣∣ θ) = p(Y∣∣∣X, θ) p(X ∣∣∣ θ)
=N∏n=1
p(yn∣∣∣ xn, θ) p(xn ∣∣∣ θ)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 62 / 77
![Page 63: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/63.jpg)
4. Mixture models
EM Algorithm
We don’t know Y , but if we have the current guess θi−1 of theparameters θ, we can it use that to predict Y
Formally we compute the expected value of the (complete)log-likelihood given the data X and the current estimation of θ
EY[log p
(X, Y
∣∣∣ θ) ∣∣∣X, θi−1] =: Q(θ, θi−1
)X - fixed; Y - random variable; θ - variable; θi−1 - currentestimation of the parameters (fixed)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 63 / 77
![Page 64: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/64.jpg)
4. Mixture models
Properties of the EM Algorithm
Maximize the expected complete log-likelihood
Q(θ, θi−1
)= EY
[log p
(X, Y
∣∣∣ θ) ∣∣∣X, θi−1]=
∫p(y∣∣∣X, θi−1) log p
(X, y
∣∣∣ θ) dy
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 64 / 77
![Page 65: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/65.jpg)
4. Mixture models
Properties of the EM Algorithm
Q(θ, θi−1
)=
∫p(y∣∣∣X, θi−1) log p
(X, y
∣∣∣ θ) dyE-step (expectation): compute p
(y∣∣∣X, θi−1) to be able to
compute the expectation Q(θ, θi−1
)M-step (maximization): maximize the expected value of thecomplete log-likelihood
θi = arg maxθQ(θ, θi−1
)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 65 / 77
![Page 66: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/66.jpg)
4. Mixture models
Formal Properties of the EM Algorithm
Main result from Dempster et al, Maximum Likelihood fromIncomplete Data via the EM Algorithm, 1977
The expected complete log-likelihood of the i-th iteration is atleast as good as that of the (i-1)-th iteration:
Q(θi, θi−1
)≥ Q
(θi−1, θi−1
)If this expectation is maximized w.r.t. θi , then it holds that:
L(θi∣∣∣X) ≥ L(θi−1 ∣∣∣X)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 66 / 77
Dempster (1929-)Laird (1943-)Rubin (1942-)
![Page 67: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/67.jpg)
4. Mixture models
Formal Properties of the EM Algorithm
Consequence of the previous statementsThe incomplete log-likelihood increases in every iteration (or atleast stays the same)
The incomplete log-likelihood is maximized (locally)
In practiceThe quality of the results depends on the initialization
If we initialize poorly, we may get stuck in poor local optima
EM relies on good initialization of the parameters
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 67 / 77
![Page 68: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/68.jpg)
4. Mixture models
Special case - Gaussian Mixtures
For mixtures of Gaussians there is a closed form solution
Look at the fully general case: also estimate the variances of themixture components and the prior distribution over the mixturecomponents
θi = arg maxθQ(θ, θi−1
)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 68 / 77
![Page 69: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/69.jpg)
4. Mixture models
EM for Gaussian Mixtures
AlgorithmInitialize parameters: µ1, σ1, π1 . . .
While stop-condition is not metE-step: Compute the posterior distribution, also calledresponsibility, for each mixture component and for all data points
αnj = p(j∣∣∣ xn) =
πjN(xn∣∣∣µj, σj)∑M
i=1 πiN(xn∣∣∣µi, σi)
M-step: Compute the new parameters using weighted estimates
µnewj =1Nj
N∑n=1
αnjxn with Nj =N∑n=1
αnj
(σnewj
)2=1Nj
N∑n=1
αnj
(xn − µnewj
)2, πnewj =
NjN
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 69 / 77
![Page 70: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/70.jpg)
4. Mixture models
Expectation Maximization
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 70 / 77
![Page 71: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/71.jpg)
4. Mixture models
How many components?
How many mixture components do we need?More components will typically lead to a better likelihood
But are more components necessarily better? Not always,because of overfitting!
(Simple) automatic selectionFind K that maximizes the Akaike information criterion
log p(X∣∣∣ θML)− K
where K is the number of parameters
Or find K that maximizes the Bayesian information criterion
log p(X∣∣∣ θML)− 12K logN
where N is the number of data points
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 71 / 77
![Page 72: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/72.jpg)
4. Mixture models
Before we move on... It is important tounderstand
Mixture models are much more general than mixtures ofGaussians
One can have mixtures of any parametric distribution, and evenmixtures of different parametric distributions
Gaussian mixtures are only one of many possibilities, though byfar the most common one
Expectation maximization is not just for fitting mixtures ofGaussians
One can fit other mixture models with EM
EM is still more general, in that it applies to many other hiddenvariable models
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 72 / 77
![Page 73: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/73.jpg)
5. Wrap-Up
Outline
1. Probability Density
2. Parametric modelsMaximum Likelihood Method
3. Non-Parametric ModelsHistogramsKernel Density EstimationK-nearest Neighbors
4. Mixture models
5. Wrap-Up
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 73 / 77
![Page 74: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/74.jpg)
5. Wrap-Up
5. Wrap-Up
You know now:The difference between parametric and non-parametric models
More about the likelihood function and how to derive themaximum likelihood estimators for the Gaussian distribution
What Bayesian estimation is
Different non-parametric models (histogram, kernel densityestimation and k-nearest neighbors)
What mixture models are
What the Expectation-Maximization idea and algorithm are
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 74 / 77
![Page 75: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/75.jpg)
5. Wrap-Up
Self-Test Questions
Where do we get the probability of data from?
What are parametric methods and how to obtain theirparameters?
How many parameters have non-parametric methods?
What are mixture models?
Should gradient methods be used for training mixture models?
How does the EM algorithm work?
What is the biggest problem of mixture models?
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 75 / 77
![Page 76: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/76.jpg)
5. Wrap-Up
Homework
Reading Assignment for next lectureClustering: Murphy ch. 25
Bias & Variance: Bishop ch. 3.2, Murphy ch. 6.4
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 76 / 77
![Page 77: Statistical Machine Learning - GitHub Pages...1.Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions](https://reader035.vdocuments.site/reader035/viewer/2022071603/613e028859df6428461641b2/html5/thumbnails/77.jpg)
5. Wrap-Up
References
EM Standard ReferenceA.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood fromincomplete data via EM algorithm, In Journal Royal StatisticalSociety, Series B. Vol. 39, 1977
EM TutorialJeff A. Bilmes, A Gentle Tutorial of the EM Algorithm and itsApplication to Parameter Estimation for Gaussian Mixture andHidden Markov Models, TR-97-021, ICSI, U.C. Berkeley, CA, USA
Modern interpretationNeal, R.M. and Hinton, G.E., A view of the EM algorithm thatjustifies incremental, sparse, and other variants, In Learning inGraphical Models, M.I. Jordan (editor)
K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 77 / 77