introduction to machine learning -...

Post on 26-Jun-2020

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Introduction to Machine LearningLecture 2

Chaohui Wang

October 14, 2019

Chaohui Wang Introduction to Machine Learning 1 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 2 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 3 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

Y

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

∑Y

p(X|Y)p(Y)

Chaohui Wang Introduction to Machine Learning 5 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

Y

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

∑Y

p(X|Y)p(Y)

Chaohui Wang Introduction to Machine Learning 5 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

∫ b

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

∫ z

− infp(x)dx

Chaohui Wang Introduction to Machine Learning 6 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

∫ b

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

∫ z

− infp(x)dx

Chaohui Wang Introduction to Machine Learning 6 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 9 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Chaohui Wang Introduction to Machine Learning 10 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Chaohui Wang Introduction to Machine Learning 10 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

Chaohui Wang Introduction to Machine Learning 12 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

Chaohui Wang Introduction to Machine Learning 12 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

Chaohui Wang Introduction to Machine Learning 13 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Since p(x|b) is much smaller than p(x|a), the decision shouldbe ’a’ here

Chaohui Wang Introduction to Machine Learning 13 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Since p(x|a) is much smaller than p(x|b), the decision shouldbe ’b’ here

Chaohui Wang Introduction to Machine Learning 13 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Attentions: p(a) = 0.75 and p(b) = 0.25!How we should do in this case?

Chaohui Wang Introduction to Machine Learning 13 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

Chaohui Wang Introduction to Machine Learning 15 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Goal: Minimize the probability of a misclassification

Chaohui Wang Introduction to Machine Learning 16 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?

→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 23 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Density Estimation

• Methods• Parametric• Non-parametric• Mixture models

Chaohui Wang Introduction to Machine Learning 24 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Chaohui Wang Introduction to Machine Learning 34 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Chaohui Wang Introduction to Machine Learning 34 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

Chaohui Wang Introduction to Machine Learning 35 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• The bin width ∆ acts as a smoothing factor

Chaohui Wang Introduction to Machine Learning 37 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

Chaohui Wang Introduction to Machine Learning 39 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

Chaohui Wang Introduction to Machine Learning 44 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor - Examples

Chaohui Wang Introduction to Machine Learning 46 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

Chaohui Wang Introduction to Machine Learning 48 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture Models - Motivations

• A single parametric distribution is often not sufficient

Chaohui Wang Introduction to Machine Learning 50 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

Chaohui Wang Introduction to Machine Learning 53 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Multivariate Gaussians

Chaohui Wang Introduction to Machine Learning 54 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Example

Chaohui Wang Introduction to Machine Learning 61 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?

If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?

If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How? Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

top related