ece595 / stat598: machine learning i lecture 10 minimum ...lecture 9 bayesian decision rules lecture...

31
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 10 Minimum Probability of Error Rule Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 31

Upload: others

Post on 06-Jul-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 10 Minimum Probability of Error Rule

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 31

Page 2: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 31

Page 3: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureThe Three Cases

Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi

Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve

3 / 31

Page 4: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Three Cases of Gaussians

Discriminant function of Gaussian:

gi (x) = log pX |Y (x |i) + log πi

= −1

2(x − µi )

TΣ−1i (x − µi )−1

2log |Σi |+ log πi .

Σi = σ2IAll Gaussians have the same covariance matrixThe covariance matrix is diagonal and same variance

Σi = ΣAll Gaussians have the same covariance matrixThe covariance matrix can be anything

arbitrary Σi

Any positive semi-definite covariance matrix

4 / 31

Page 5: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Last Lecture

Theorem

If Σi = σ2I , then the separating hyperplane is given by

g(x) = wTx + w0 = 0,

where

w =µi − µj

σ2, and w0 = −

‖µi‖2 − ‖µj‖2

2σ2+ log

πiπj.

You tell me the two Gaussians: µi , µj , πi , πj , σ

I return you a separating hyperplane

g(x) = wTx + w0

This is the best possible hyperplane according to posterior distribution5 / 31

Page 6: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 1: Σi = σ2I : Geometry

w =µi − µj

σ2, and x0 =

µi + µj

2− σ2

‖µi − µj‖2

(log

πiπj

)(µi−µj),

6 / 31

Page 7: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 2: Σi = Σ

Make all Σi = Σ.

gi (x) = −1

2(x − µi )

TΣ−1(x − µi )−1

2log |Σ|+ log πi .

In this case, we can show that

gi (x) = −1

2(x − µi )

TΣ−1(x − µi )−�����1

2log |Σ|+ log πi

= −1

2

(xTΣ−1x − 2µT

i Σ−1x + µTi Σ−1µi

)+ log πi

= −1

2

(�����xTΣ−1x − 2µT

i Σ−1x + µTi Σ−1µi

)+ log πi

= µTi Σ−1x︸ ︷︷ ︸wT

i x

−1

2µTi Σ−1µi + log πi︸ ︷︷ ︸

wi0

7 / 31

Page 8: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 2: Σi = Σ

The discriminant function is therefore:

g(x) = gi (x)− gj(x)

= (w i −w j)Tx + (wi0 − wj0)

=[Σ−1(µi − µj)

]T x︸ ︷︷ ︸wT x

−1

2

(µTi Σ−1µi − µT

j Σ−1µj

)+ log

πiπj︸ ︷︷ ︸

w0

.

If we want to write g(x) = wT (x − x0), we can show that

w = Σ−1(µi − µj),

x0 =µi + µj

2−

log πiπj

(µi − µj)TΣ−1(µi − µj)

(µi − µj).

8 / 31

Page 9: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 2: Σi = Σ: Geometry

The discriminant function is g(x) = wT (x − x0), with

w = Σ−1(µi − µj),

x0 =µi + µj

2−

log πiπj

(µi − µj)TΣ−1(µi − µj)

(µi − µj).

9 / 31

Page 10: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 3: Arbitrary Σi

This is the most general setting.

gi (x) = −1

2(x − µi )

TΣ−1i (x − µi )−1

2log |Σi |+ log πi .

We can show that

gi (x) =1

2(x − µi )

TΣ−1i (x − µi ) +1

2log |Σi | − log πi

=1

2xTΣ−1i x︸ ︷︷ ︸12xTW ix

−µTi Σ−1i x︸ ︷︷ ︸wT

i x

+1

2µTi Σ−1i µi +

1

2log |Σi | − log πi︸ ︷︷ ︸

wi0

.

Therefore, the discriminant function is

g(x) = gi (x)− gj(x)

=1

2xT (W i −W j)x + (w i −w j)

Tx + (wi0 − wj0).

10 / 31

Page 11: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 3: Arbitrary Σi : Geometry

11 / 31

Page 12: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Case 3: Arbitrary Σi : Computing

Quadratic classifier is not a big deal, on computer!

g(x) = gi (x)− gj(x)

=1

2xT (W i −W j)x + (w i −w j)

Tx + (wi0 − wj0).

Recall: A hypothesis function is

h(x) =

1, if g(x) > 0

0, if g(x) < 0

either, if g(x) = 0

All you need is to check whether g(x) > 0 or g(x) < 0.

No need to obtain a closed form solution.

12 / 31

Page 13: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureThe Three Cases

Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi

Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve

13 / 31

Page 14: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

How Well does Bayesian Decision Rule Perform?

Gaussian is a model.

If the underlying true distribution is Gaussian, then there is alwaysoverlapping.

No matter how you do, there is always classification error.

14 / 31

Page 15: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

What if the distributions do not match

If the true distribution is not overlapping, then the overlapping of theGaussian is not a problem.

In this lecture we assume the true distribution is indeed Gaussian.

15 / 31

Page 16: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Probability of Error

Suppose we have found a decision boundary, which is a threshold τ in 1D.

Probability of error is

Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]

16 / 31

Page 17: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Minimize the Probability of Error

Let us do some calculation:

Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]

= P[X < τ | Y = 1]P[Y = 1] + P[X > τ | Y = 0]P[Y = 0]

= π1N (X < τ | C1) + π0N (X > τ | C0)

= π1

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx + π0

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

Can we minimize the Pe by finding a good τ?

d

dτPe = π1

d

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx

+ π0

d

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

17 / 31

Page 18: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Minimize the Classification Error

Fundamental Theorem of Calculus:

d

∫ τ

−∞f (x)dx = f (τ)

Therefore, in our problem,

d

∫ τ

−∞

1√2πσ21

e− (x−µ1)

2

2σ21 dx

=1√

2πσ21

e− (τ−µ1)

2

2σ21

d

∫ ∞τ

1√2πσ20

e− (x−µ0)

2

2σ20 dx

= − 1√2πσ20

e− (τ−µ0)

2

2σ20

18 / 31

Page 19: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Special Case: σ1 = σ0 = σ

Let us assume that σ1 = σ0 = σ. Then,

d

dτPe = π1

d

∫ τ

−∞

1√2πσ2

e−(x−µ1)

2

2σ2 dx + π0d

∫ ∞τ

1√2πσ2

e−(x−µ0)

2

2σ2 dx

= π11√

2πσ2e−

(τ−µ1)2

2σ2 − π01√

2πσ2e−

(τ−µ0)2

2σ2 = 0

So finally, we can equate the two sides, and get

π1����1√

2πσ2e−

(τ−µ1)2

2σ2 = π0����1√

2πσ2e−

(τ−µ0)2

2σ2

log

{π1e− (τ−µ1)

2

2σ2

}= log

{π0e− (τ−µ0)

2

2σ2

}τ =

µ1 − µ02

− σ2

µ1 − µ0log

π1π0.

19 / 31

Page 20: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Optimality of Bayesian Decision Rule

Therefore,

τ =µ1 − µ0

2− σ2

µ1 − µ0log

π1π0,

and this means that

h(x) =

1, if x > µ1−µ0

2 − σ2

µ1−µ0log π1

π0,

0, if x < µ1−µ02 − σ2

µ1−µ0log π1

π0,

either, if x = µ1−µ02 − σ2

µ1−µ0log π1

π0.

This is the exact same result as the Bayesian decision rule.

Bayesian decision is optimal in posterior

Bayesian decision is optimal in probability of error

Optimal 6= error free

You still have error if you have infinite training data

20 / 31

Page 21: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Two Levels of Optimality

Error due to Finite Training Samples

You need to estimate µ and Σ

More samples, better estimate

Controlled by Hoeffding inequality

Error due to Linear Non-Separability

Error even if you have infinite training data

Fundamental limit of the model

Gaussians are not linearly separable

Minimize error probability 6= zero error

21 / 31

Page 22: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Risk and Probability of Error

Is Your “Optimal” Really Optimal?

Pe = P[X < τ | Y = 1]π1 + P[X > τ | Y = 0]π0.

Y = 1: disease present. Y = 0: normal.

X < τ : report no disease. X > τ : report disease.

P[X < τ | Y = 1]: There is a disease, but you cannot find it.

P[X > τ | Y = 0]: There is no disease, but you say there is.

Which one is more serious?

We need to define Risk

R︸︷︷︸risk

= C01P[X < τ | Y = 1]︸ ︷︷ ︸miss

π1 + C10P[X > τ | Y = 0]︸ ︷︷ ︸false alarm

π0.

C01: cost of miss. C10: cost of false alarm.22 / 31

Page 23: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

False Alarm and Detection

Pe = C01P[X < τ | Y = 1]︸ ︷︷ ︸PM

π1 + C10P[X > τ | Y = 0]︸ ︷︷ ︸PF

π0.

PM : Probability of MissPD = 1− PM : Probability of DetectionPF : Probability of False Alarm

23 / 31

Page 24: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Receive Operating Characteristic (ROC) Curve

They are all functions of τ .

If you change τ , they will change correspondingly.

Can plot PF (τ) and PD(τ) as τ changes.

This called a Receive Operating Characteristic (ROC) curve.

24 / 31

Page 25: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Interpreting the ROC

Higher ROC is better: Same level of false alarm, higher detection rate

One classifier has one ROC. One ROC curve for all τ .Ways to improve ROC:

Train with more sample so that you have less training errorFind a better model than GaussianFind a better decision rule than linear classifier

25 / 31

Page 26: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Bayesian Decision Rule

Bishop, Pattern Recognition and Machine Learning, Chapter 4.1

Duda, Hart and Stork’s Pattern Classification, Chapter 2.1, 2.2, 2.6

Stanford CS 229 Generative Algorithmshttp://cs229.stanford.edu/notes/cs229-notes2.pdf

Probability of Error

Duda, Hart and Stork’s Pattern Classification, Chapter 2.7, 3.1.

Poor, Intro to Signal Estimation and Detection, Chapter 2.

ROC Curve

ECE645 Note. https://engineering.purdue.edu/ChanGroup/

ECE645Notes/StudentLecture02.pdf

26 / 31

Page 27: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Appendix

27 / 31

Page 28: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Q&A 1 Is ROC curve always concave?

No.

The ROC curve is concave when the decision rule comes fromNeyman-Pearson. See V. Poor’s Intro to Signal Estimation andDetection Chapter 2.

You can always create a decision rule that performs terribly on onepart and very well on the other part.

Most algorithms will generate ROC curves that shows stair-caseeffects.

It is important to report the entire ROC curve.

28 / 31

Page 29: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Q&A 2 Type 1 and Type 2 Error

You probably have heard these two terms before. Below is an image Ifound from Wikipedia

29 / 31

Page 30: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Q&A 2 Type 1 and Type 2 Error

A typical naming issue between statistics and engineering

False Positive = Type 1 Error = False Alarm

False Negative = Type 2 Error = Miss

Both are probabilities

In the previous figure, we see that a decision rule is the location ofthe black line.

If you move the black line left and right, you will get a point on theROC curve.

30 / 31

Page 31: ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

c©Stanley Chan 2020. All Rights Reserved.

Q&A 3 What is the difference between ROC andPrecision-Recall Curve (PRC)?

ROC: Receiver Operating Curve.

Detection: No. TP / (No. TP + No. FN)False Alarm: No. FP / (No. FP + No. TN)ROC Curve always goes up.

PRC: Precision Recall Curve.

Precision: No. TP / (No. TP + No. FP)Recall: No. TP / (No. TP + No. FN)PRC Curve always goes down.

There is generally no definitive answer to which of ROC or PRC isbetter. However, according to the following paper, ROC is better fordatasets with balanced classes whereas PRC is better for skewedclasses. https://www.biostat.wisc.edu/~page/rocpr.pdf

31 / 31