introduction to machine learning -...

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Introduction to Machine LearningLecture 2

Chaohui Wang

October 14, 2019

Chaohui Wang Introduction to Machine Learning 1 / 63

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Basic Concepts

∑Lj=1 nij =

nijci· ci

Basic Concepts

∑Lj=1 nij =

nijci· ci

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

p(X|Y)p(Y)

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

p(X|Y)p(Y)

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

− infp(x)dx

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

− infp(x)dx

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

p(x|y)f (x)

continuous case: Ex(f |y) =

p(x|y)f (x)

Expectations

p(x)f (x)

∫p(x)f (x)

∑Ni=1 f (xn)

p(x|y)f (x)

Expectations

p(x)f (x)

∫p(x)f (x)

∑Ni=1 f (xn)

p(x|y)f (x)

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Priors

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

p(Ck) = 1

Priors

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

p(Ck) = 1

Priors

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

p(Ck) = 1

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

How to decide?

• Example:

• Question: Which class to choose?

How to decide?

• Example:

→ Since p(x|b) is much smaller than p(x|a), the decision shouldbe ’a’ here

How to decide?

• Example:

→ Since p(x|a) is much smaller than p(x|b), the decision shouldbe ’b’ here

How to decide?

• Example:

→ Attentions: p(a) = 0.75 and p(b) = 0.25!How we should do in this case?

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

• Interpretation:

NormalizationFactor

• Interpretation:

NormalizationFactor

How to decide?

Bayesian Decision Theory

• Goal: Minimize the probability of a misclassification

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C1|x) > p(C2|x)

• and vice versa.

p(x|C1)p(C1) > p(x|C2)p(C2)

p(x|C1)

p(x|C2)>

p(C1|x) > p(C2|x)

• and vice versa.

p(x|C1)p(C1) > p(x|C2)p(C2)

p(x|C1)

p(x|C2)>

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Ck), ∀j 6= k

p(Ck|x) > p(Cj|x), ∀j 6= k

→ Example :

p(x|Cj)>

p(Ck), ∀j 6= k

p(Ck|x) > p(Cj|x), ∀j 6= k

→ Example :

p(x|Cj)>

p(Ck), ∀j 6= k

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

E[L] =∑

Lkjp(x,Ck)dx

E[L] =∑

Lkjp(Ck|x)

is minimized

E[L] =∑

Lkjp(x,Ck)dx

E[L] =∑

Lkjp(Ck|x)

is minimized

E[L] =∑

Lkjp(x,Ck)dx

E[L] =∑

Lkjp(Ck|x)

is minimized

E[L] =∑

Lkjp(x,Ck)dx

E[L] =∑

Lkjp(Ck|x)

is minimized

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

> p(C2)p(C1)

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

> p(C2)p(C1)

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

yk(x) > yj(x),∀j 6= k

yk(x) = p(Ck|x)

yk(x) > yj(x),∀j 6= k

yk(x) = p(Ck|x)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?

→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

yk(x) > yj(x),∀j 6= k

yk(x) = p(Ck|x)

• Methods• Parametric• Non-parametric• Mixture models

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Parametric Methods

→ For example :

Parametric Methods

→ For example :

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

L(θ) = p(X|θ)

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

N − 1N

2ML = 1

N−1∑N

n=1(xn − µ̂)2

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

N − 1N

2ML = 1

N−1∑N

n=1(xn − µ̂)2

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

N − 1N

2ML = 1

N−1∑N

n=1(xn − µ̂)2

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

N − 1N

2ML = 1

N−1∑N

n=1(xn − µ̂)2

µML =1N

N∑n=1

xn, σ2ML =

N∑n=1

(xn − µML)2

N − 1N

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

A Deeper Reason

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Bayesian Approach

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

Histograms

pi =ni

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Histograms

pi =ni

Histograms

pi =ni

Histograms

• The bin width ∆ acts as a smoothing factor

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

p(y)dy

p(y)dy ≈ p(x)V

estimate P as:

=⇒ p(x) ≈ KNV

p(y)dy

p(y)dy ≈ p(x)V

estimate P as:

=⇒ p(x) ≈ KNV

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

N∑n=1

k(x− xn

N∑n=1

1hD k(

x− xn

Kernel Methods

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

p(x) ≈ K(x)

N∑n=1

k(x− xn

N∑n=1

1hD k(

x− xn

Kernel Methods

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

p(x) ≈ K(x)

N∑n=1

k(x− xn

N∑n=1

1hD k(

x− xn

Kernel Methods

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

p(x) ≈ K(x)

N∑n=1

k(x− xn

N∑n=1

1hD k(

x− xn

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Kernel Methods

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

N∑n=1

(2πh2)D2

exp{−‖ x− xn ‖2

k(u) =1

(2πh2)D2

exp{− u2

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

p(x) ≈ K(x)

N∑n=1

(2πh2)D2

k(u) =1

(2πh2)D2

exp{− u2

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

p(x) ≈ K(x)

N∑n=1

(2πh2)D2

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

N∑n=1

k(x− xn)

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

p(x) ≈ K(x)

N∑n=1

k(x− xn)

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

p(x) ≈ K(x)

N∑n=1

k(x− xn)

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

p(x) ≈ KNV̂(x,K)

K-Nearest Neighbor - Examples

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

NNV̂(x,K)

Kj(x,K)

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

NjV̂(x,K)

NNV̂(x,K)

Kj(x,K)

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

NjV̂(x,K)

NNV̂(x,K)

Kj(x,K)

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

NjV̂(x,K)

NNV̂(x,K)

Kj(x,K)

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

NjV̂(x,K)

NNV̂(x,K)

Kj(x,K)

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Mixture Models - Motivations

• A single parametric distribution is often not sufficient

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

p(x|θ) =

M∑m=1

p(x|θ) =

M∑m=1

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

p(x|θ) =

M∑m=1

πmp(x|θm)

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

πm = 1

p(x|θm) =1

(2πσ2m)

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

p(x|θ) =

M∑m=1

πmp(x|θm)

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

πm = 1

p(x|θm) =1

(2πσ2m)

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

p(x|θ) =

M∑m=1

πmp(x|θm)

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

πm = 1

p(x|θm) =1

(2πσ2m)

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Mixture of Multivariate Gaussians

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj