ece595 / stat598: machine learning i lecture 10 minimum ...lecture 9 bayesian decision rules lecture...

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 10 Minimum Probability of Error Rule

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 31


Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 31


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureThe Three Cases

Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi

Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve

3 / 31


Three Cases of Gaussians

Discriminant function of Gaussian:

gi (x) = log pX |Y (x |i) + log πi

= −1

2(x − µi )

TΣ−1i (x − µi )−1

2log |Σi |+ log πi .

Σi = σ2IAll Gaussians have the same covariance matrixThe covariance matrix is diagonal and same variance

Σi = ΣAll Gaussians have the same covariance matrixThe covariance matrix can be anything

arbitrary Σi

Any positive semi-definite covariance matrix

4 / 31


Last Lecture

Theorem

If Σi = σ2I , then the separating hyperplane is given by

g(x) = wTx + w0 = 0,

where

w =µi − µj

σ2, and w0 = −

‖µi‖2 − ‖µj‖2

2σ2+ log

πiπj.

You tell me the two Gaussians: µi , µj , πi , πj , σ

I return you a separating hyperplane

g(x) = wTx + w0

This is the best possible hyperplane according to posterior distribution5 / 31


Case 1: Σi = σ2I : Geometry

w =µi − µj

σ2, and x0 =

µi + µj

2− σ2

‖µi − µj‖2

(log

πiπj

)(µi−µj),

6 / 31


Case 2: Σi = Σ

Make all Σi = Σ.

gi (x) = −1

2(x − µi )

TΣ−1(x − µi )−1

2log |Σ|+ log πi .

In this case, we can show that

gi (x) = −1

2(x − µi )

TΣ−1(x − µi )−��1

2log |Σ|+ log πi

= −1

2

(xTΣ−1x − 2µT

i Σ−1x + µTi Σ−1µi

)+ log πi

= −1

2

(��xTΣ−1x − 2µT

i Σ−1x + µTi Σ−1µi

)+ log πi

= µTi Σ−1x︸︷︷︸wT

i x

−1

2µTi Σ−1µi + log πi︸︷︷︸

wi0

7 / 31


Case 2: Σi = Σ

The discriminant function is therefore:

g(x) = gi (x)− gj(x)

= (w i −w j)Tx + (wi0 − wj0)

=[Σ−1(µi − µj)

]T x︸︷︷︸wT x

−1

2

(µTi Σ−1µi − µT

j Σ−1µj

)+ log

πiπj︸︷︷︸

w0

.

If we want to write g(x) = wT (x − x0), we can show that

w = Σ−1(µi − µj),

x0 =µi + µj

2−

log πiπj

(µi − µj)TΣ−1(µi − µj)

(µi − µj).

8 / 31


Case 2: Σi = Σ: Geometry

The discriminant function is g(x) = wT (x − x0), with

w = Σ−1(µi − µj),

x0 =µi + µj

2−

log πiπj

(µi − µj)TΣ−1(µi − µj)

(µi − µj).

9 / 31


Case 3: Arbitrary Σi

This is the most general setting.

gi (x) = −1

2(x − µi )

TΣ−1i (x − µi )−1

2log |Σi |+ log πi .

We can show that

gi (x) =1

2(x − µi )

TΣ−1i (x − µi ) +1

2log |Σi | − log πi

=1

2xTΣ−1i x︸︷︷︸12xTW ix

−µTi Σ−1i x︸︷︷︸wT

i x

+1

2µTi Σ−1i µi +

1

2log |Σi | − log πi︸︷︷︸

wi0

.

Therefore, the discriminant function is


=1

2xT (W i −W j)x + (w i −w j)

Tx + (wi0 − wj0).

10 / 31


Case 3: Arbitrary Σi : Geometry

11 / 31


Case 3: Arbitrary Σi : Computing

Quadratic classifier is not a big deal, on computer!


=1

2xT (W i −W j)x + (w i −w j)

Tx + (wi0 − wj0).

Recall: A hypothesis function is

h(x) =

1, if g(x) > 0

0, if g(x) < 0

either, if g(x) = 0

All you need is to check whether g(x) > 0 or g(x) < 0.

No need to obtain a closed form solution.

12 / 31


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureThe Three Cases

Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi

Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve

13 / 31


How Well does Bayesian Decision Rule Perform?

Gaussian is a model.

If the underlying true distribution is Gaussian, then there is alwaysoverlapping.

No matter how you do, there is always classification error.

14 / 31


What if the distributions do not match

If the true distribution is not overlapping, then the overlapping of theGaussian is not a problem.

In this lecture we assume the true distribution is indeed Gaussian.

15 / 31


Probability of Error

Suppose we have found a decision boundary, which is a threshold τ in 1D.

Probability of error is

Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]

16 / 31


Minimize the Probability of Error

Let us do some calculation:

Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]

= P[X < τ | Y = 1]P[Y = 1] + P[X > τ | Y = 0]P[Y = 0]

= π1N (X < τ | C1) + π0N (X > τ | C0)

= π1

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx + π0

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

Can we minimize the Pe by finding a good τ?

d

dτPe = π1

d

dτ

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx

+ π0

d

dτ

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

17 / 31


Minimize the Classification Error

Fundamental Theorem of Calculus:

d

dτ

∫ τ

−∞f (x)dx = f (τ)

Therefore, in our problem,

d

dτ

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx

=1√

2πσ21

e− (τ−µ1)

2

2σ21

d

dτ

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

= − 1√2πσ20

e− (τ−µ0)

2

2σ20

18 / 31


Special Case: σ1 = σ0 = σ

Let us assume that σ1 = σ0 = σ. Then,

d

dτPe = π1

d

dτ

∫ τ

−∞

1√2πσ2

e−(x−µ1)

2

2σ2 dx + π0d

dτ

∫ ∞τ

1√2πσ2

e−(x−µ0)

2

2σ2 dx

= π11√

2πσ2e−

(τ−µ1)2

2σ2 − π01√

2πσ2e−

(τ−µ0)2

2σ2 = 0

So finally, we can equate the two sides, and get

π1��1√

2πσ2e−

(τ−µ1)2

2σ2 = π0��1√

2πσ2e−

(τ−µ0)2

2σ2

log

{π1e− (τ−µ1)

2

2σ2

}= log

{π0e− (τ−µ0)

2

2σ2

}τ =

µ1 − µ02

− σ2

µ1 − µ0log

π1π0.

19 / 31


Optimality of Bayesian Decision Rule

Therefore,

τ =µ1 − µ0

2− σ2

µ1 − µ0log

π1π0,

and this means that

h(x) =

1, if x > µ1−µ0

2 − σ2

µ1−µ0log π1

π0,

0, if x < µ1−µ02 − σ2

µ1−µ0log π1

π0,

either, if x = µ1−µ02 − σ2

µ1−µ0log π1

π0.

This is the exact same result as the Bayesian decision rule.

Bayesian decision is optimal in posterior

Bayesian decision is optimal in probability of error

Optimal 6= error free

You still have error if you have infinite training data

20 / 31


Two Levels of Optimality

Error due to Finite Training Samples

You need to estimate µ and Σ

More samples, better estimate

Controlled by Hoeffding inequality

Error due to Linear Non-Separability

Error even if you have infinite training data

Fundamental limit of the model

Gaussians are not linearly separable

Minimize error probability 6= zero error

21 / 31


Risk and Probability of Error

Is Your “Optimal” Really Optimal?

Pe = P[X < τ | Y = 1]π1 + P[X > τ | Y = 0]π0.

Y = 1: disease present. Y = 0: normal.

X < τ : report no disease. X > τ : report disease.

P[X < τ | Y = 1]: There is a disease, but you cannot find it.

P[X > τ | Y = 0]: There is no disease, but you say there is.

Which one is more serious?

We need to define Risk

R︸︷︷︸risk

= C01P[X < τ | Y = 1]︸︷︷︸miss

π1 + C10P[X > τ | Y = 0]︸︷︷︸false alarm

π0.

C01: cost of miss. C10: cost of false alarm.22 / 31


False Alarm and Detection

Pe = C01P[X < τ | Y = 1]︸︷︷︸PM

π1 + C10P[X > τ | Y = 0]︸︷︷︸PF

π0.

PM : Probability of MissPD = 1− PM : Probability of DetectionPF : Probability of False Alarm

23 / 31


Receive Operating Characteristic (ROC) Curve

They are all functions of τ .

If you change τ , they will change correspondingly.

Can plot PF (τ) and PD(τ) as τ changes.

This called a Receive Operating Characteristic (ROC) curve.

24 / 31


Interpreting the ROC

Higher ROC is better: Same level of false alarm, higher detection rate

One classifier has one ROC. One ROC curve for all τ .Ways to improve ROC:

Train with more sample so that you have less training errorFind a better model than GaussianFind a better decision rule than linear classifier

25 / 31


Reading List

Bayesian Decision Rule

Bishop, Pattern Recognition and Machine Learning, Chapter 4.1

Duda, Hart and Stork’s Pattern Classification, Chapter 2.1, 2.2, 2.6

Stanford CS 229 Generative Algorithmshttp://cs229.stanford.edu/notes/cs229-notes2.pdf

Probability of Error

Duda, Hart and Stork’s Pattern Classification, Chapter 2.7, 3.1.

Poor, Intro to Signal Estimation and Detection, Chapter 2.

ROC Curve

ECE645 Note. https://engineering.purdue.edu/ChanGroup/

ECE645Notes/StudentLecture02.pdf

26 / 31

http://cs229.stanford.edu/notes/cs229-notes2.pdf

https://engineering.purdue.edu/ChanGroup/ECE645Notes/StudentLecture02.pdf

https://engineering.purdue.edu/ChanGroup/ECE645Notes/StudentLecture02.pdf


Appendix

27 / 31


Q&A 1 Is ROC curve always concave?

No.

The ROC curve is concave when the decision rule comes fromNeyman-Pearson. See V. Poor’s Intro to Signal Estimation andDetection Chapter 2.

You can always create a decision rule that performs terribly on onepart and very well on the other part.

Most algorithms will generate ROC curves that shows stair-caseeffects.

It is important to report the entire ROC curve.

28 / 31


Q&A 2 Type 1 and Type 2 Error

You probably have heard these two terms before. Below is an image Ifound from Wikipedia

29 / 31


Q&A 2 Type 1 and Type 2 Error

A typical naming issue between statistics and engineering

False Positive = Type 1 Error = False Alarm

False Negative = Type 2 Error = Miss

Both are probabilities

In the previous figure, we see that a decision rule is the location ofthe black line.

If you move the black line left and right, you will get a point on theROC curve.

30 / 31


Q&A 3 What is the difference between ROC andPrecision-Recall Curve (PRC)?

ROC: Receiver Operating Curve.

Detection: No. TP / (No. TP + No. FN)False Alarm: No. FP / (No. FP + No. TN)ROC Curve always goes up.

PRC: Precision Recall Curve.

Precision: No. TP / (No. TP + No. FP)Recall: No. TP / (No. TP + No. FN)PRC Curve always goes down.

There is generally no definitive answer to which of ROC or PRC isbetter. However, according to the following paper, ROC is better fordatasets with balanced classes whereas PRC is better for skewedclasses. https://www.biostat.wisc.edu/~page/rocpr.pdf

31 / 31

https://www.biostat.wisc.edu/~page/rocpr.pdf

ece595 / stat598: machine learning i lecture 10 minimum ...lecture 9 bayesian decision rules lecture...

Documents