introduction to machine learning -...
TRANSCRIPT
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Introduction to Machine LearningLecture 2
Chaohui Wang
October 14, 2019
Chaohui Wang Introduction to Machine Learning 1 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Outline of This Lecture
Probability Theory (review)
Bayes Decision Theory
Probability Density Estimation
Chaohui Wang Introduction to Machine Learning 2 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Outline of This Lecture
Probability Theory (review)
Bayes Decision Theory
Probability Density Estimation
Chaohui Wang Introduction to Machine Learning 3 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Basic Concepts
Let us consider the scenario where:• Two discrete variables:
X ∈ {xi} and Y ∈ {yi}• N trials and denote:
nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}
→We then have:• Joint probability: Pr(X = xi,Y = yi) =
nijN
• Marginal probability: Pr(X = xi) = ciN
• Conditional probability: Pr(Y = yi|X = xi) =nijci
• Sum rule: Pr(X = xi) = 1N
∑Lj=1 nij =
∑Lj=1 Pr(X = xi,Y = yi)
• Product rule: Pr(X = xi,Y = yi) =nijN =
nijci· ci
N = Pr(Y =yi|X = xi) Pr(X = xi)
Chaohui Wang Introduction to Machine Learning 4 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Basic Concepts
Let us consider the scenario where:• Two discrete variables:
X ∈ {xi} and Y ∈ {yi}• N trials and denote:
nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}
→We then have:• Joint probability: Pr(X = xi,Y = yi) =
nijN
• Marginal probability: Pr(X = xi) = ciN
• Conditional probability: Pr(Y = yi|X = xi) =nijci
• Sum rule: Pr(X = xi) = 1N
∑Lj=1 nij =
∑Lj=1 Pr(X = xi,Y = yi)
• Product rule: Pr(X = xi,Y = yi) =nijN =
nijci· ci
N = Pr(Y =yi|X = xi) Pr(X = xi)
Chaohui Wang Introduction to Machine Learning 4 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Basic Concepts
Let us consider the scenario where:• Two discrete variables:
X ∈ {xi} and Y ∈ {yi}• N trials and denote:
nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}
→We then have:• Joint probability: Pr(X = xi,Y = yi) =
nijN
• Marginal probability: Pr(X = xi) = ciN
• Conditional probability: Pr(Y = yi|X = xi) =nijci
• Sum rule: Pr(X = xi) = 1N
∑Lj=1 nij =
∑Lj=1 Pr(X = xi,Y = yi)
• Product rule: Pr(X = xi,Y = yi) =nijN =
nijci· ci
N = Pr(Y =yi|X = xi) Pr(X = xi)
Chaohui Wang Introduction to Machine Learning 4 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
The Rules of Probability
→ Thus we have:• Sum rule:
p(X) =∑
Y
p(X,Y)
• Product rule:p(X,Y) = p(Y|X)p(X)
→ Finally, we can derive:• Bayes’ Theorem:
p(Y|X) =p(X|Y)p(Y)
p(X), with p(X) =
∑Y
p(X|Y)p(Y)
Chaohui Wang Introduction to Machine Learning 5 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
The Rules of Probability
→ Thus we have:• Sum rule:
p(X) =∑
Y
p(X,Y)
• Product rule:p(X,Y) = p(Y|X)p(X)
→ Finally, we can derive:• Bayes’ Theorem:
p(Y|X) =p(X|Y)p(Y)
p(X), with p(X) =
∑Y
p(X|Y)p(Y)
Chaohui Wang Introduction to Machine Learning 5 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Probability Densities
• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):
Pr(x ∈ (a, b)) =
∫ b
ap(x)dx
• Cumulative distribution function: the probability that xlies in the interval (− inf, z)
P(z) =
∫ z
− infp(x)dx
Chaohui Wang Introduction to Machine Learning 6 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Probability Densities
• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):
Pr(x ∈ (a, b)) =
∫ b
ap(x)dx
• Cumulative distribution function: the probability that xlies in the interval (− inf, z)
P(z) =
∫ z
− infp(x)dx
Chaohui Wang Introduction to Machine Learning 6 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectations
• Expectation: The average value of some function f (x)under a probability distribution
discrete case: E[f ] =∑
x
p(x)f (x)
continuous case: E[f ] =
∫p(x)f (x)
→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1
N
∑Ni=1 f (xn)
• Conditional expectation:
discrete case: Ex[f |y] =∑
x
p(x|y)f (x)
continuous case: Ex(f |y) =
∫x
p(x|y)f (x)
Chaohui Wang Introduction to Machine Learning 7 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectations
• Expectation: The average value of some function f (x)under a probability distribution
discrete case: E[f ] =∑
x
p(x)f (x)
continuous case: E[f ] =
∫p(x)f (x)
→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1
N
∑Ni=1 f (xn)
• Conditional expectation:
discrete case: Ex[f |y] =∑
x
p(x|y)f (x)
continuous case: Ex(f |y) =
∫x
p(x|y)f (x)
Chaohui Wang Introduction to Machine Learning 7 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectations
• Expectation: The average value of some function f (x)under a probability distribution
discrete case: E[f ] =∑
x
p(x)f (x)
continuous case: E[f ] =
∫p(x)f (x)
→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1
N
∑Ni=1 f (xn)
• Conditional expectation:
discrete case: Ex[f |y] =∑
x
p(x|y)f (x)
continuous case: Ex(f |y) =
∫x
p(x|y)f (x)
Chaohui Wang Introduction to Machine Learning 7 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Variances and Covariances
• Variance of a function f (x):
var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2
• Covariance between variables X and Y:
cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]
→ Covariance Matrix in case X and Y are vectors:
cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]
Chaohui Wang Introduction to Machine Learning 8 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Variances and Covariances
• Variance of a function f (x):
var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2
• Covariance between variables X and Y:
cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]
→ Covariance Matrix in case X and Y are vectors:
cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]
Chaohui Wang Introduction to Machine Learning 8 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Variances and Covariances
• Variance of a function f (x):
var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2
• Covariance between variables X and Y:
cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]
→ Covariance Matrix in case X and Y are vectors:
cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]
Chaohui Wang Introduction to Machine Learning 8 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Outline of This Lecture
Probability Theory (review)
Bayes Decision Theory
Probability Density Estimation
Chaohui Wang Introduction to Machine Learning 9 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification Example
• Handwritten character recognition
→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.
Chaohui Wang Introduction to Machine Learning 10 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification Example
• Handwritten character recognition
→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.
Chaohui Wang Introduction to Machine Learning 10 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Priors
• Concept 1: Priors (a priori probabilities) p(Ck)
• What we “know” (or assume in practice) about theprobability before seeing the data.
Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25
→ In general: ∑k
p(Ck) = 1
Chaohui Wang Introduction to Machine Learning 11 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Priors
• Concept 1: Priors (a priori probabilities) p(Ck)
• What we “know” (or assume in practice) about theprobability before seeing the data.
Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25
→ In general: ∑k
p(Ck) = 1
Chaohui Wang Introduction to Machine Learning 11 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Priors
• Concept 1: Priors (a priori probabilities) p(Ck)
• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25
→ In general: ∑k
p(Ck) = 1
Chaohui Wang Introduction to Machine Learning 11 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Priors
• Concept 1: Priors (a priori probabilities) p(Ck)
• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25
→ In general: ∑k
p(Ck) = 1
Chaohui Wang Introduction to Machine Learning 11 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Conditional probabilities
• Concept 2: Conditional probabilities p(x|Ck)
• feature vector x: characterizes certain properties of theinput.
• p(x|Ck): describes the likelihood of x for a given class Ck
Example:
Chaohui Wang Introduction to Machine Learning 12 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Conditional probabilities
• Concept 2: Conditional probabilities p(x|Ck)
• feature vector x: characterizes certain properties of theinput.
• p(x|Ck): describes the likelihood of x for a given class Ck
Example:
Chaohui Wang Introduction to Machine Learning 12 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
How to decide?
• Example:
• Question: Which class to choose?
Chaohui Wang Introduction to Machine Learning 13 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
How to decide?
• Example:
• Question: Which class to choose?
→ Since p(x|b) is much smaller than p(x|a), the decision shouldbe ’a’ here
Chaohui Wang Introduction to Machine Learning 13 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
How to decide?
• Example:
• Question: Which class to choose?
→ Since p(x|a) is much smaller than p(x|b), the decision shouldbe ’b’ here
Chaohui Wang Introduction to Machine Learning 13 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
How to decide?
• Example:
• Question: Which class to choose?
→ Attentions: p(a) = 0.75 and p(b) = 0.25!How we should do in this case?
Chaohui Wang Introduction to Machine Learning 13 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Posterior probabilities
• Concept 3: Posterior probabilities p(Ck|x)
• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.
• Bayes’ Theorem:
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)
• Interpretation:
Posterior =Likelihood × Prior
NormalizationFactor
Chaohui Wang Introduction to Machine Learning 14 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Posterior probabilities
• Concept 3: Posterior probabilities p(Ck|x)
• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.
• Bayes’ Theorem:
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)
• Interpretation:
Posterior =Likelihood × Prior
NormalizationFactor
Chaohui Wang Introduction to Machine Learning 14 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Posterior probabilities
• Concept 3: Posterior probabilities p(Ck|x)
• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.
• Bayes’ Theorem:
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)
• Interpretation:
Posterior =Likelihood × Prior
NormalizationFactor
Chaohui Wang Introduction to Machine Learning 14 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
How to decide?
Chaohui Wang Introduction to Machine Learning 15 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Decision Theory
• Goal: Minimize the probability of a misclassification
Chaohui Wang Introduction to Machine Learning 16 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Decision Theory
• Optimal decision rule:• Decide for C1, if
p(C1|x) > p(C2|x)
• and vice versa.
→ p(C1|x) > p(C2|x) is equivalent to:
p(x|C1)p(C1) > p(x|C2)p(C2)
→ Further equivalent to (Likelihood-Ratio test):
p(x|C1)
p(x|C2)>
p(C2)
p(C1)
Chaohui Wang Introduction to Machine Learning 17 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Decision Theory
• Optimal decision rule:• Decide for C1, if
p(C1|x) > p(C2|x)
• and vice versa.
→ p(C1|x) > p(C2|x) is equivalent to:
p(x|C1)p(C1) > p(x|C2)p(C2)
→ Further equivalent to (Likelihood-Ratio test):
p(x|C1)
p(x|C2)>
p(C2)
p(C1)
Chaohui Wang Introduction to Machine Learning 17 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Decision Theory
• Optimal decision rule:• Decide for C1, if
p(C1|x) > p(C2|x)
• and vice versa.
→ p(C1|x) > p(C2|x) is equivalent to:
p(x|C1)p(C1) > p(x|C2)p(C2)
→ Further equivalent to (Likelihood-Ratio test):
p(x|C1)
p(x|C2)>
p(C2)
p(C1)
Chaohui Wang Introduction to Machine Learning 17 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Generalization to More Than 2 Classes
• Decide for class k if it has the greatest posterior probabilityof all classes:
p(Ck|x) > p(Cj|x), ∀j 6= k
p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k
→ Example :
→ Likelihood-Ratio test:p(x|Ck)
p(x|Cj)>
p(Cj)
p(Ck), ∀j 6= k
Chaohui Wang Introduction to Machine Learning 18 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Generalization to More Than 2 Classes
• Decide for class k if it has the greatest posterior probabilityof all classes:
p(Ck|x) > p(Cj|x), ∀j 6= k
p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k
→ Example :
→ Likelihood-Ratio test:p(x|Ck)
p(x|Cj)>
p(Cj)
p(Ck), ∀j 6= k
Chaohui Wang Introduction to Machine Learning 18 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Generalization to More Than 2 Classes
• Decide for class k if it has the greatest posterior probabilityof all classes:
p(Ck|x) > p(Cj|x), ∀j 6= k
p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k
→ Example :
→ Likelihood-Ratio test:p(x|Ck)
p(x|Cj)>
p(Cj)
p(Ck), ∀j 6= k
Chaohui Wang Introduction to Machine Learning 18 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of
misclassification• Can be asymmetric, for example:
loss(decision = healthy|patient = sick) >> loss(sick|healthy)
• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck
→ for example:
Chaohui Wang Introduction to Machine Learning 19 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of
misclassification• Can be asymmetric, for example:
loss(decision = healthy|patient = sick) >> loss(sick|healthy)
• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck
→ for example:
Chaohui Wang Introduction to Machine Learning 19 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of
misclassification• Can be asymmetric, for example:
loss(decision = healthy|patient = sick) >> loss(sick|healthy)
• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck
→ for example:
Chaohui Wang Introduction to Machine Learning 19 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of
misclassification• Can be asymmetric, for example:
loss(decision = healthy|patient = sick) >> loss(sick|healthy)
• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck
→ for example:
Chaohui Wang Introduction to Machine Learning 19 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of
misclassification• Can be asymmetric, for example:
loss(decision = healthy|patient = sick) >> loss(sick|healthy)
• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck
→ for example:
Chaohui Wang Introduction to Machine Learning 19 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown
• Solution: Minimize the expected loss
E[L] =∑
k
∑j
∫Rj
Lkjp(x,Ck)dx
→ This can be done by choosing the region Rj for each x,such that
E[L] =∑
k
Lkjp(Ck|x)
is minimized
→ It still is the posterior probability p(Ck|x) that matters!
Chaohui Wang Introduction to Machine Learning 20 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown
• Solution: Minimize the expected loss
E[L] =∑
k
∑j
∫Rj
Lkjp(x,Ck)dx
→ This can be done by choosing the region Rj for each x,such that
E[L] =∑
k
Lkjp(Ck|x)
is minimized
→ It still is the posterior probability p(Ck|x) that matters!
Chaohui Wang Introduction to Machine Learning 20 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown
• Solution: Minimize the expected loss
E[L] =∑
k
∑j
∫Rj
Lkjp(x,Ck)dx
→ This can be done by choosing the region Rj for each x,such that
E[L] =∑
k
Lkjp(Ck|x)
is minimized
→ It still is the posterior probability p(Ck|x) that matters!
Chaohui Wang Introduction to Machine Learning 20 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown
• Solution: Minimize the expected loss
E[L] =∑
k
∑j
∫Rj
Lkjp(x,Ck)dx
→ This can be done by choosing the region Rj for each x,such that
E[L] =∑
k
Lkjp(Ck|x)
is minimized
→ It still is the posterior probability p(Ck|x) that matters!
Chaohui Wang Introduction to Machine Learning 20 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown
• Solution: Minimize the expected loss
E[L] =∑
k
∑j
∫Rj
Lkjp(x,Ck)dx
→ This can be done by choosing the region Rj for each x,such that
E[L] =∑
k
Lkjp(Ck|x)
is minimized
→ It still is the posterior probability p(Ck|x) that matters!
Chaohui Wang Introduction to Machine Learning 20 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• For the binary classification problem: decide for C1, if
p(x|C1)
p(x|C2)>
(L21 − L22)p(C2)
(L12 − L11)p(C1)
→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)
> p(C2)p(C1)
→ Take into account the loss function, leading to ageneralization above
Chaohui Wang Introduction to Machine Learning 21 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• For the binary classification problem: decide for C1, if
p(x|C1)
p(x|C2)>
(L21 − L22)p(C2)
(L12 − L11)p(C1)
→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)
> p(C2)p(C1)
→ Take into account the loss function, leading to ageneralization above
Chaohui Wang Introduction to Machine Learning 21 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classifying with Loss Functions
• For the binary classification problem: decide for C1, if
p(x|C1)
p(x|C2)>
(L21 − L22)p(C2)
(L12 − L11)p(C1)
→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)
> p(C2)p(C1)
→ Take into account the loss function, leading to ageneralization above
Chaohui Wang Introduction to Machine Learning 21 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification via Discriminant Functions
• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:
yk(x) > yj(x),∀j 6= k
→ Examples (Bayes Decision Theory):
yk(x) = p(Ck|x)
yk(x) = p(x|Ck)p(Ck)
yk(x) = log p(x|Ck) + log p(Ck)
→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown
Chaohui Wang Introduction to Machine Learning 22 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification via Discriminant Functions
• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:
yk(x) > yj(x),∀j 6= k
→ Examples (Bayes Decision Theory):
yk(x) = p(Ck|x)
yk(x) = p(x|Ck)p(Ck)
yk(x) = log p(x|Ck) + log p(Ck)
→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown
Chaohui Wang Introduction to Machine Learning 22 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification via Discriminant Functions
• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:
yk(x) > yj(x),∀j 6= k
→ Examples (Bayes Decision Theory):
yk(x) = p(Ck|x)
yk(x) = p(x|Ck)p(Ck)
yk(x) = log p(x|Ck) + log p(Ck)
→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?
→ Probability Density EstimationE.g., In supervised training: data and class labels areknown
Chaohui Wang Introduction to Machine Learning 22 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Classification via Discriminant Functions
• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:
yk(x) > yj(x),∀j 6= k
→ Examples (Bayes Decision Theory):
yk(x) = p(Ck|x)
yk(x) = p(x|Ck)p(Ck)
yk(x) = log p(x|Ck) + log p(Ck)
→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown
Chaohui Wang Introduction to Machine Learning 22 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Outline of This Lecture
Probability Theory (review)
Bayes Decision Theory
Probability Density Estimation
Chaohui Wang Introduction to Machine Learning 23 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Probability Density Estimation
• Methods• Parametric• Non-parametric• Mixture models
Chaohui Wang Introduction to Machine Learning 24 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Parametric Methods
• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)
• Learning→ Estimation of the parameters θ
→ For example :
Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?
Chaohui Wang Introduction to Machine Learning 25 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Parametric Methods
• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)
• Learning→ Estimation of the parameters θ
→ For example :
Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?
Chaohui Wang Introduction to Machine Learning 25 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Parametric Methods
• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)
• Learning→ Estimation of the parameters θ
→ For example :
Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?
Chaohui Wang Introduction to Machine Learning 25 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:
L(θ) = p(X|θ)
• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:
L(θ) = ΠNn=1p(xn|θ)
• Negative log-likelihood:E(θ) = − log L(θ) = −
∑Nn=1 log p(xn|θ)
• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood
Chaohui Wang Introduction to Machine Learning 26 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach
• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero
• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)
µML =1N
N∑n=1
xn, σ2ML =
1N
N∑n=1
(xn − µML)2
→ Unfortunately, it is not so correct ...
→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:
E(µML) = µ,E(σ2ML) =
N − 1N
σ2
• Corrected estimate: σ̃2 = NN−1σ
2ML = 1
N−1∑N
n=1(xn − µ̂)2
Chaohui Wang Introduction to Machine Learning 27 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach - Limitations
• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:
• ML overfits to the observed data• Although we often use ML, it is important to know this
limitation
Chaohui Wang Introduction to Machine Learning 28 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach - Limitations
• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:
• ML overfits to the observed data• Although we often use ML, it is important to know this
limitation
Chaohui Wang Introduction to Machine Learning 28 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach - Limitations
• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:
• ML overfits to the observed data• Although we often use ML, it is important to know this
limitation
Chaohui Wang Introduction to Machine Learning 28 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Maximum Likelihood Approach - Limitations
• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:
• ML overfits to the observed data• Although we often use ML, it is important to know this
limitation
Chaohui Wang Introduction to Machine Learning 28 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
A Deeper Reason
• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of
random, repeatable events• These frequencies are fixed, but can be estimated more
precisely when more data is available• This is in contrast to the Bayesian interpretation
• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events
• This uncertainty can be revised in the light of new evidence
Chaohui Wang Introduction to Machine Learning 29 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian vs. Frequentist View
• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the
Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since
the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we
may revise our opinion and update the uncertainty from theprior, via:
Posterior ∝ Likelihood × Prior
• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse
Chaohui Wang Introduction to Machine Learning 30 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach to Parameter Learning
• Conceptual shift• Maximum Likelihood views the true parameter vector θ to
be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable
• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a
posterior probability density
→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ
Chaohui Wang Introduction to Machine Learning 31 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Bayesian Approach
• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is
Chaohui Wang Introduction to Machine Learning 32 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Summary: ML vs. Bayesian Learning
• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization
• But: Approximation gets accurate when N → + inf• Bayesian Learning
• General approach, avoids the estimation bias through aprior
• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques
Chaohui Wang Introduction to Machine Learning 33 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Non-Parametric Methods
• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:
• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian
kernels)• k-Nearest-Neighbor• etc.
Chaohui Wang Introduction to Machine Learning 34 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Non-Parametric Methods
• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:
• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian
kernels)• k-Nearest-Neighbor• etc.
Chaohui Wang Introduction to Machine Learning 34 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Histograms
• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):
pi =ni
N∆i
Chaohui Wang Introduction to Machine Learning 35 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Histograms
• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):
pi =ni
N∆i
• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D
→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram
Chaohui Wang Introduction to Machine Learning 36 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Histograms
• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):
pi =ni
N∆i
• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D
→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram
Chaohui Wang Introduction to Machine Learning 36 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Histograms
• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):
pi =ni
N∆i
• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D
→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram
Chaohui Wang Introduction to Machine Learning 36 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Histograms
• The bin width ∆ acts as a smoothing factor
Chaohui Wang Introduction to Machine Learning 37 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Towards More “Statistically”-founded Approaches
• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R
P =
∫R
p(y)dy
• If R is sufficiently small such that p(x) is roughly constant
P =
∫R
p(y)dy ≈ p(x)V
where V denotes the volume of R• If the number N of samples is sufficiently large, we can
estimate P as:
P =KN
=⇒ p(x) ≈ KNV
where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Towards More “Statistically”-founded Approaches
• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R
P =
∫R
p(y)dy
• If R is sufficiently small such that p(x) is roughly constant
P =
∫R
p(y)dy ≈ p(x)V
where V denotes the volume of R• If the number N of samples is sufficiently large, we can
estimate P as:
P =KN
=⇒ p(x) ≈ KNV
where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Towards More “Statistically”-founded Approaches
• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R
P =
∫R
p(y)dy
• If R is sufficiently small such that p(x) is roughly constant
P =
∫R
p(y)dy ≈ p(x)V
where V denotes the volume of R• If the number N of samples is sufficiently large, we can
estimate P as:
P =KN
=⇒ p(x) ≈ KNV
where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Towards More “Statistically”-founded Approaches
Chaohui Wang Introduction to Machine Learning 39 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:
k(u)
{1, if |ui| ≤ 1
2 , ∀i = {1, . . . ,D}0, else
→ Considering a cube with side width h, the distributionof K in the space:
K(x) =
N∑n=1
k(x− xn
h),V =
∫k(u)du = hD
→ Probability density estimate:
p(x) ≈ K(x)
NV=
1NhD
N∑n=1
k(x− xn
h) =
1N
N∑n=1
1hD k(
x− xn
h)
Chaohui Wang Introduction to Machine Learning 40 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:
k(u)
{1, if |ui| ≤ 1
2 , ∀i = {1, . . . ,D}0, else
→ Considering a cube with side width h, the distributionof K in the space:
K(x) =
N∑n=1
k(x− xn
h),V =
∫k(u)du = hD
→ Probability density estimate:
p(x) ≈ K(x)
NV=
1NhD
N∑n=1
k(x− xn
h) =
1N
N∑n=1
1hD k(
x− xn
h)
Chaohui Wang Introduction to Machine Learning 40 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:
k(u)
{1, if |ui| ≤ 1
2 , ∀i = {1, . . . ,D}0, else
→ Considering a cube with side width h, the distributionof K in the space:
K(x) =
N∑n=1
k(x− xn
h),V =
∫k(u)du = hD
→ Probability density estimate:
p(x) ≈ K(x)
NV=
1NhD
N∑n=1
k(x− xn
h) =
1N
N∑n=1
1hD k(
x− xn
h)
Chaohui Wang Introduction to Machine Learning 40 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:
k(u)
{1, if |ui| ≤ 1
2 , ∀i = {1, . . . ,D}0, else
→ Considering a cube with side width h, the distributionof K in the space:
K(x) =
N∑n=1
k(x− xn
h),V =
∫k(u)du = hD
→ Probability density estimate:
p(x) ≈ K(x)
NV=
1NhD
N∑n=1
k(x− xn
h) =
1N
N∑n=1
1hD k(
x− xn
h)
Chaohui Wang Introduction to Machine Learning 40 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods
• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at
location x and count how many data points fall inside it
• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density
• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model
Chaohui Wang Introduction to Machine Learning 41 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods: Gaussian Kernel
• Gaussian kernel• Kernel function
k(u) =1
(2πh2)D2
exp{− u2
2h2 }
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
1
(2πh2)D2
exp{−‖ x− xn ‖2
2h2 }
Chaohui Wang Introduction to Machine Learning 42 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods: Gaussian Kernel
• Gaussian kernel• Kernel function
k(u) =1
(2πh2)D2
exp{− u2
2h2 }
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
1
(2πh2)D2
exp{−‖ x− xn ‖2
2h2 }
Chaohui Wang Introduction to Machine Learning 42 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods: Gaussian Kernel
• Gaussian kernel• Kernel function
k(u) =1
(2πh2)D2
exp{− u2
2h2 }
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
1
(2πh2)D2
exp{−‖ x− xn ‖2
2h2 }
Chaohui Wang Introduction to Machine Learning 42 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods - General Principle
• In general, a kernel satisfying the following properties canbe used:
k(u) ≥ 0,∫
k(u)du = 1
• Then
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Then we get the probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
k(x− xn)
Chaohui Wang Introduction to Machine Learning 43 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods - General Principle
• In general, a kernel satisfying the following properties canbe used:
k(u) ≥ 0,∫
k(u)du = 1
• Then
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Then we get the probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
k(x− xn)
Chaohui Wang Introduction to Machine Learning 43 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Kernel Methods - General Principle
• In general, a kernel satisfying the following properties canbe used:
k(u) ≥ 0,∫
k(u)du = 1
• Then
K(x) =
N∑n=1
k(x− xn),V =
∫k(u)du = 1
• Then we get the probability density estimate
p(x) ≈ K(x)
NV=
1N
N∑n=1
k(x− xn)
Chaohui Wang Introduction to Machine Learning 43 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Towards More “Statistically”-founded Approaches
Chaohui Wang Introduction to Machine Learning 44 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Density Estimation
• Basic idea: increase the volume V until the Kth closestdata point is found
• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:
p(x) ≈ KNV̂(x,K)
→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while
Chaohui Wang Introduction to Machine Learning 45 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Density Estimation
• Basic idea: increase the volume V until the Kth closestdata point is found
• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:
p(x) ≈ KNV̂(x,K)
→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while
Chaohui Wang Introduction to Machine Learning 45 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Density Estimation
• Basic idea: increase the volume V until the Kth closestdata point is found
• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:
p(x) ≈ KNV̂(x,K)
→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while
Chaohui Wang Introduction to Machine Learning 45 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Density Estimation
• Basic idea: increase the volume V until the Kth closestdata point is found
• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:
p(x) ≈ KNV̂(x,K)
→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while
Chaohui Wang Introduction to Machine Learning 45 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Density Estimation
• Basic idea: increase the volume V until the Kth closestdata point is found
• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:
p(x) ≈ KNV̂(x,K)
→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while
Chaohui Wang Introduction to Machine Learning 45 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor - Examples
Chaohui Wang Introduction to Machine Learning 46 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Recall: Bayesian Classification: posterior probability
p(Cj|x) =p(x|Cj)p(Cj)
p(x)
• Now we havep(x) ≈ K
NV̂(x,K)
p(x|Cj) ≈Kj(x,K)
NjV̂(x,K)
p(Cj) ≈Nj
N
→ p(Cj|x) ≈Kj(x,K)
NjV̂(x,K)
Nj
NNV̂(x,K)
K=
Kj(x,K)
K
Chaohui Wang Introduction to Machine Learning 47 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Recall: Bayesian Classification: posterior probability
p(Cj|x) =p(x|Cj)p(Cj)
p(x)
• Now we havep(x) ≈ K
NV̂(x,K)
p(x|Cj) ≈Kj(x,K)
NjV̂(x,K)
p(Cj) ≈Nj
N
→ p(Cj|x) ≈Kj(x,K)
NjV̂(x,K)
Nj
NNV̂(x,K)
K=
Kj(x,K)
K
Chaohui Wang Introduction to Machine Learning 47 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Recall: Bayesian Classification: posterior probability
p(Cj|x) =p(x|Cj)p(Cj)
p(x)
• Now we havep(x) ≈ K
NV̂(x,K)
p(x|Cj) ≈Kj(x,K)
NjV̂(x,K)
p(Cj) ≈Nj
N
→ p(Cj|x) ≈Kj(x,K)
NjV̂(x,K)
Nj
NNV̂(x,K)
K=
Kj(x,K)
K
Chaohui Wang Introduction to Machine Learning 47 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Recall: Bayesian Classification: posterior probability
p(Cj|x) =p(x|Cj)p(Cj)
p(x)
• Now we havep(x) ≈ K
NV̂(x,K)
p(x|Cj) ≈Kj(x,K)
NjV̂(x,K)
p(Cj) ≈Nj
N
→ p(Cj|x) ≈Kj(x,K)
NjV̂(x,K)
Nj
NNV̂(x,K)
K=
Kj(x,K)
K
Chaohui Wang Introduction to Machine Learning 47 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Recall: Bayesian Classification: posterior probability
p(Cj|x) =p(x|Cj)p(Cj)
p(x)
• Now we havep(x) ≈ K
NV̂(x,K)
p(x|Cj) ≈Kj(x,K)
NjV̂(x,K)
p(Cj) ≈Nj
N
→ p(Cj|x) ≈Kj(x,K)
NjV̂(x,K)
Nj
NNV̂(x,K)
K=
Kj(x,K)
K
Chaohui Wang Introduction to Machine Learning 47 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
Chaohui Wang Introduction to Machine Learning 48 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Results on an example data set
• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the
1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .
Chaohui Wang Introduction to Machine Learning 49 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Results on an example data set
• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the
1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .
Chaohui Wang Introduction to Machine Learning 49 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Results on an example data set
• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the
1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .
Chaohui Wang Introduction to Machine Learning 49 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
K-Nearest Neighbor Classification
• Results on an example data set
• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the
1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .
Chaohui Wang Introduction to Machine Learning 49 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture Models - Motivations
• A single parametric distribution is often not sufficient
Chaohui Wang Introduction to Machine Learning 50 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Sum of M individual Gaussian distributions
→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)
p(x|θ) =
M∑m=1
πmp(x|θm), πm : p(ln = m|θm)
→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)
Chaohui Wang Introduction to Machine Learning 51 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Sum of M individual Gaussian distributions
→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)
p(x|θ) =
M∑m=1
πmp(x|θm), πm : p(ln = m|θm)
→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)
Chaohui Wang Introduction to Machine Learning 51 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Sum of M individual Gaussian distributions
→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)
p(x|θ) =
M∑m=1
πmp(x|θm), πm : p(ln = m|θm)
→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)
Chaohui Wang Introduction to Machine Learning 51 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Mixture of Gaussians (MoG):
p(x|θ) =
M∑m=1
πmp(x|θm)
• Prior of component m:
πm = p(ln = m|θm)
(∀m) 0 ≤ πm ≤ 1 andM∑
m=1
πm = 1
• Likelihood of x given the component m:
p(x|θm) =1
(2πσ2m)
12
exp{− (x− µm)2
2σ2m}
•∫
p(x)dx = 1
Chaohui Wang Introduction to Machine Learning 52 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Mixture of Gaussians (MoG):
p(x|θ) =
M∑m=1
πmp(x|θm)
• Prior of component m:
πm = p(ln = m|θm)
(∀m) 0 ≤ πm ≤ 1 andM∑
m=1
πm = 1
• Likelihood of x given the component m:
p(x|θm) =1
(2πσ2m)
12
exp{− (x− µm)2
2σ2m}
•∫
p(x)dx = 1
Chaohui Wang Introduction to Machine Learning 52 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Mixture of Gaussians (MoG):
p(x|θ) =
M∑m=1
πmp(x|θm)
• Prior of component m:
πm = p(ln = m|θm)
(∀m) 0 ≤ πm ≤ 1 andM∑
m=1
πm = 1
• Likelihood of x given the component m:
p(x|θm) =1
(2πσ2m)
12
exp{− (x− µm)2
2σ2m}
•∫
p(x)dx = 1
Chaohui Wang Introduction to Machine Learning 52 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
• Mixture of Gaussians (MoG):
p(x|θ) =
M∑m=1
πmp(x|θm)
• Prior of component m:
πm = p(ln = m|θm)
(∀m) 0 ≤ πm ≤ 1 andM∑
m=1
πm = 1
• Likelihood of x given the component m:
p(x|θm) =1
(2πσ2m)
12
exp{− (x− µm)2
2σ2m}
•∫
p(x)dx = 1
Chaohui Wang Introduction to Machine Learning 52 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Gaussians (MoG)
Chaohui Wang Introduction to Machine Learning 53 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Mixture of Multivariate Gaussians
Chaohui Wang Introduction to Machine Learning 54 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Estimation of MoG
• Maximum Likelihood: there is no direct analytical solution
∂{− log L(Θ)}∂µj
= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)
• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians
• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields
Chaohui Wang Introduction to Machine Learning 55 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Estimation of MoG
• Maximum Likelihood: there is no direct analytical solution
∂{− log L(Θ)}∂µj
= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)
• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians
• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields
Chaohui Wang Introduction to Machine Learning 55 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Estimation of MoG
• Maximum Likelihood: there is no direct analytical solution
∂{− log L(Θ)}∂µj
= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)
• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians
• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields
Chaohui Wang Introduction to Machine Learning 55 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Estimation of MoG
• Maximum Likelihood: there is no direct analytical solution
∂{− log L(Θ)}∂µj
= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)
• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians
• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields
Chaohui Wang Introduction to Machine Learning 55 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Estimation of MoG
• Maximum Likelihood: there is no direct analytical solution
∂{− log L(Θ)}∂µj
= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)
• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians
• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields
Chaohui Wang Introduction to Machine Learning 55 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (1)
• Basic Strategy:• Model the unobserved component label, via hidden
variable• Explore the probability that a training example is generated
by each component
Chaohui Wang Introduction to Machine Learning 56 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (1)
• Basic Strategy:• Model the unobserved component label, via hidden
variable• Explore the probability that a training example is generated
by each component
Chaohui Wang Introduction to Machine Learning 56 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (1)
• Basic Strategy:• Model the unobserved component label, via hidden
variable• Explore the probability that a training example is generated
by each component
Chaohui Wang Introduction to Machine Learning 56 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (2)
• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the
Gaussians independently→ e.g., using Maximum Likelihood
li : the label for sample xi
N : the total number of samplesN̂j : the number of samples labeled j
π̂j ←N̂j
N, µ̂j ←
1N̂j
∑n:ln=j
xn
Σ̂j ←1N̂j
∑n:ln=j
(xn − µ̂j)(xn − µ̂j)T
• But we don’t have such labels li.→We may use some clustering results at first, but then...
Chaohui Wang Introduction to Machine Learning 57 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (2)
• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the
Gaussians independently→ e.g., using Maximum Likelihood
li : the label for sample xi
N : the total number of samplesN̂j : the number of samples labeled j
π̂j ←N̂j
N, µ̂j ←
1N̂j
∑n:ln=j
xn
Σ̂j ←1N̂j
∑n:ln=j
(xn − µ̂j)(xn − µ̂j)T
• But we don’t have such labels li.→We may use some clustering results at first, but then...
Chaohui Wang Introduction to Machine Learning 57 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (2)
• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the
Gaussians independently→ e.g., using Maximum Likelihood
li : the label for sample xi
N : the total number of samplesN̂j : the number of samples labeled j
π̂j ←N̂j
N, µ̂j ←
1N̂j
∑n:ln=j
xn
Σ̂j ←1N̂j
∑n:ln=j
(xn − µ̂j)(xn − µ̂j)T
• But we don’t have such labels li.→We may use some clustering results at first, but then...
Chaohui Wang Introduction to Machine Learning 57 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (2)
• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the
Gaussians independently→ e.g., using Maximum Likelihood
li : the label for sample xi
N : the total number of samplesN̂j : the number of samples labeled j
π̂j ←N̂j
N, µ̂j ←
1N̂j
∑n:ln=j
xn
Σ̂j ←1N̂j
∑n:ln=j
(xn − µ̂j)(xn − µ̂j)T
• But we don’t have such labels li.→We may use some clustering results at first, but then...
Chaohui Wang Introduction to Machine Learning 57 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (2)
• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the
Gaussians independently→ e.g., using Maximum Likelihood
li : the label for sample xi
N : the total number of samplesN̂j : the number of samples labeled j
π̂j ←N̂j
N, µ̂j ←
1N̂j
∑n:ln=j
xn
Σ̂j ←1N̂j
∑n:ln=j
(xn − µ̂j)(xn − µ̂j)T
• But we don’t have such labels li.→We may use some clustering results at first, but then...
Chaohui Wang Introduction to Machine Learning 57 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (3)
• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can
evaluate the posterior probability that xn was generatedfrom a specific component j:
p(ln = j|xn, θ) =p(ln = j, xn|θ)
p(xn|θ)=
p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)
p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)
→ p(ln = j|xn, θ) =πjp(xn|θj)∑M
m=1 πmp(xn|θm)
Chaohui Wang Introduction to Machine Learning 58 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (3)
• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can
evaluate the posterior probability that xn was generatedfrom a specific component j:
p(ln = j|xn, θ) =p(ln = j, xn|θ)
p(xn|θ)=
p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)
p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)
→ p(ln = j|xn, θ) =πjp(xn|θj)∑M
m=1 πmp(xn|θm)
Chaohui Wang Introduction to Machine Learning 58 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (3)
• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can
evaluate the posterior probability that xn was generatedfrom a specific component j:
p(ln = j|xn, θ) =p(ln = j, xn|θ)
p(xn|θ)=
p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)
p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)
→ p(ln = j|xn, θ) =πjp(xn|θj)∑M
m=1 πmp(xn|θm)
Chaohui Wang Introduction to Machine Learning 58 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Preliminaries (3)
• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can
evaluate the posterior probability that xn was generatedfrom a specific component j:
p(ln = j|xn, θ) =p(ln = j, xn|θ)
p(xn|θ)=
p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)
p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)
→ p(ln = j|xn, θ) =πjp(xn|θj)∑M
m=1 πmp(xn|θm)
Chaohui Wang Introduction to Machine Learning 58 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• E-Step: softly assign samples to mixture components
γj(xn)←πjN (xn|µj,Σj)∑M
k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N
• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments
N̂j ←N∑
n=1
γj(xn): soft number of samples labeled j
π̂newj ← N̂j
N
µ̂newj ← 1
N̂j
N∑n=1
γj(xn)xn
Σ̂newj ← 1
N̂j
N∑n=1
γj(xn)(xn − µ̂newj )(xn − µ̂new
j )T
→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• E-Step: softly assign samples to mixture components
γj(xn)←πjN (xn|µj,Σj)∑M
k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N
• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments
N̂j ←N∑
n=1
γj(xn): soft number of samples labeled j
π̂newj ← N̂j
N
µ̂newj ← 1
N̂j
N∑n=1
γj(xn)xn
Σ̂newj ← 1
N̂j
N∑n=1
γj(xn)(xn − µ̂newj )(xn − µ̂new
j )T
→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• E-Step: softly assign samples to mixture components
γj(xn)←πjN (xn|µj,Σj)∑M
k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N
• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments
N̂j ←N∑
n=1
γj(xn): soft number of samples labeled j
π̂newj ← N̂j
N
µ̂newj ← 1
N̂j
N∑n=1
γj(xn)xn
Σ̂newj ← 1
N̂j
N∑n=1
γj(xn)(xn − µ̂newj )(xn − µ̂new
j )T
→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• E-Step: softly assign samples to mixture components
γj(xn)←πjN (xn|µj,Σj)∑M
k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N
• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments
N̂j ←N∑
n=1
γj(xn): soft number of samples labeled j
π̂newj ← N̂j
N
µ̂newj ← 1
N̂j
N∑n=1
γj(xn)xn
Σ̂newj ← 1
N̂j
N∑n=1
γj(xn)(xn − µ̂newj )(xn − µ̂new
j )T
→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• E-Step: softly assign samples to mixture components
γj(xn)←πjN (xn|µj,Σj)∑M
k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N
• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments
N̂j ←N∑
n=1
γj(xn): soft number of samples labeled j
π̂newj ← N̂j
N
µ̂newj ← 1
N̂j
N∑n=1
γj(xn)xn
Σ̂newj ← 1
N̂j
N∑n=1
γj(xn)(xn − µ̂newj )(xn − µ̂new
j )T
→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• Initialization :• Way 1: initializing the algorithm with a set of initial
parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing
a first M-step
Chaohui Wang Introduction to Machine Learning 60 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• Initialization :• Way 1: initializing the algorithm with a set of initial
parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing
a first M-step
Chaohui Wang Introduction to Machine Learning 60 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Expectation-Maximization (EM) Algorithm
• Initialization :• Way 1: initializing the algorithm with a set of initial
parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing
a first M-step
Chaohui Wang Introduction to Machine Learning 60 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Example
Chaohui Wang Introduction to Machine Learning 61 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Implementation
• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?
If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function
• How?
Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1
Chaohui Wang Introduction to Machine Learning 62 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Implementation
• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?
If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function
• How?
Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1
Chaohui Wang Introduction to Machine Learning 62 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Implementation
• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,
this data point will then contribute an infinite term in thelikelihood function
• How?
Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1
Chaohui Wang Introduction to Machine Learning 62 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Implementation
• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,
this data point will then contribute an infinite term in thelikelihood function
• How?
Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1
Chaohui Wang Introduction to Machine Learning 62 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
EM Algorithm - Implementation
• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,
this data point will then contribute an infinite term in thelikelihood function
• How? Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1
Chaohui Wang Introduction to Machine Learning 62 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Gaussian Mixture Models - Applications
• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...
• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels
Chaohui Wang Introduction to Machine Learning 63 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Gaussian Mixture Models - Applications
• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...
• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels
Chaohui Wang Introduction to Machine Learning 63 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Gaussian Mixture Models - Applications
• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...
• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels
Chaohui Wang Introduction to Machine Learning 63 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Gaussian Mixture Models - Applications
• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...
• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels
Chaohui Wang Introduction to Machine Learning 63 / 63
Probability Theory (review) Bayes Decision Theory Probability Density Estimation
Gaussian Mixture Models - Applications
• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...
• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels
Chaohui Wang Introduction to Machine Learning 63 / 63