ece595 / stat598: machine learning i lecture 10 minimum ...lecture 9 bayesian decision rules lecture...
TRANSCRIPT
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 10 Minimum Probability of Error Rule
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 31
c©Stanley Chan 2020. All Rights Reserved.
Overview
In linear discriminant analysis (LDA), there are generally two types ofapproaches
Generative approach: Estimate model, then define the classifier
Discriminative approach: Directly define the classifier2 / 31
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureThe Three Cases
Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi
Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve
3 / 31
c©Stanley Chan 2020. All Rights Reserved.
Three Cases of Gaussians
Discriminant function of Gaussian:
gi (x) = log pX |Y (x |i) + log πi
= −1
2(x − µi )
TΣ−1i (x − µi )−1
2log |Σi |+ log πi .
Σi = σ2IAll Gaussians have the same covariance matrixThe covariance matrix is diagonal and same variance
Σi = ΣAll Gaussians have the same covariance matrixThe covariance matrix can be anything
arbitrary Σi
Any positive semi-definite covariance matrix
4 / 31
c©Stanley Chan 2020. All Rights Reserved.
Last Lecture
Theorem
If Σi = σ2I , then the separating hyperplane is given by
g(x) = wTx + w0 = 0,
where
w =µi − µj
σ2, and w0 = −
‖µi‖2 − ‖µj‖2
2σ2+ log
πiπj.
You tell me the two Gaussians: µi , µj , πi , πj , σ
I return you a separating hyperplane
g(x) = wTx + w0
This is the best possible hyperplane according to posterior distribution5 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 1: Σi = σ2I : Geometry
w =µi − µj
σ2, and x0 =
µi + µj
2− σ2
‖µi − µj‖2
(log
πiπj
)(µi−µj),
6 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 2: Σi = Σ
Make all Σi = Σ.
gi (x) = −1
2(x − µi )
TΣ−1(x − µi )−1
2log |Σ|+ log πi .
In this case, we can show that
gi (x) = −1
2(x − µi )
TΣ−1(x − µi )−�����1
2log |Σ|+ log πi
= −1
2
(xTΣ−1x − 2µT
i Σ−1x + µTi Σ−1µi
)+ log πi
= −1
2
(�����xTΣ−1x − 2µT
i Σ−1x + µTi Σ−1µi
)+ log πi
= µTi Σ−1x︸ ︷︷ ︸wT
i x
−1
2µTi Σ−1µi + log πi︸ ︷︷ ︸
wi0
7 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 2: Σi = Σ
The discriminant function is therefore:
g(x) = gi (x)− gj(x)
= (w i −w j)Tx + (wi0 − wj0)
=[Σ−1(µi − µj)
]T x︸ ︷︷ ︸wT x
−1
2
(µTi Σ−1µi − µT
j Σ−1µj
)+ log
πiπj︸ ︷︷ ︸
w0
.
If we want to write g(x) = wT (x − x0), we can show that
w = Σ−1(µi − µj),
x0 =µi + µj
2−
log πiπj
(µi − µj)TΣ−1(µi − µj)
(µi − µj).
8 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 2: Σi = Σ: Geometry
The discriminant function is g(x) = wT (x − x0), with
w = Σ−1(µi − µj),
x0 =µi + µj
2−
log πiπj
(µi − µj)TΣ−1(µi − µj)
(µi − µj).
9 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 3: Arbitrary Σi
This is the most general setting.
gi (x) = −1
2(x − µi )
TΣ−1i (x − µi )−1
2log |Σi |+ log πi .
We can show that
gi (x) =1
2(x − µi )
TΣ−1i (x − µi ) +1
2log |Σi | − log πi
=1
2xTΣ−1i x︸ ︷︷ ︸12xTW ix
−µTi Σ−1i x︸ ︷︷ ︸wT
i x
+1
2µTi Σ−1i µi +
1
2log |Σi | − log πi︸ ︷︷ ︸
wi0
.
Therefore, the discriminant function is
g(x) = gi (x)− gj(x)
=1
2xT (W i −W j)x + (w i −w j)
Tx + (wi0 − wj0).
10 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 3: Arbitrary Σi : Geometry
11 / 31
c©Stanley Chan 2020. All Rights Reserved.
Case 3: Arbitrary Σi : Computing
Quadratic classifier is not a big deal, on computer!
g(x) = gi (x)− gj(x)
=1
2xT (W i −W j)x + (w i −w j)
Tx + (wi0 − wj0).
Recall: A hypothesis function is
h(x) =
1, if g(x) > 0
0, if g(x) < 0
either, if g(x) = 0
All you need is to check whether g(x) > 0 or g(x) < 0.
No need to obtain a closed form solution.
12 / 31
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureThe Three Cases
Σi = σ2I (Last Lecture)Σi = ΣGeneral Σi
Evaluating PerformanceProbability of ErrorBayesian Decision Rule is also Minimum Error RuleROC Curve
13 / 31
c©Stanley Chan 2020. All Rights Reserved.
How Well does Bayesian Decision Rule Perform?
Gaussian is a model.
If the underlying true distribution is Gaussian, then there is alwaysoverlapping.
No matter how you do, there is always classification error.
14 / 31
c©Stanley Chan 2020. All Rights Reserved.
What if the distributions do not match
If the true distribution is not overlapping, then the overlapping of theGaussian is not a problem.
In this lecture we assume the true distribution is indeed Gaussian.
15 / 31
c©Stanley Chan 2020. All Rights Reserved.
Probability of Error
Suppose we have found a decision boundary, which is a threshold τ in 1D.
Probability of error is
Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]
16 / 31
c©Stanley Chan 2020. All Rights Reserved.
Minimize the Probability of Error
Let us do some calculation:
Pe = P[X < τ and Y = 1] + P[X > τ and Y = 0]
= P[X < τ | Y = 1]P[Y = 1] + P[X > τ | Y = 0]P[Y = 0]
= π1N (X < τ | C1) + π0N (X > τ | C0)
= π1
∫ τ
−∞
1√2πσ21
e− (x−µ1)
2
2σ21 dx + π0
∫ ∞τ
1√2πσ20
e− (x−µ0)
2
2σ20 dx
Can we minimize the Pe by finding a good τ?
d
dτPe = π1
d
dτ
∫ τ
−∞
1√2πσ21
e− (x−µ1)
2
2σ21 dx
+ π0
d
dτ
∫ ∞τ
1√2πσ20
e− (x−µ0)
2
2σ20 dx
17 / 31
c©Stanley Chan 2020. All Rights Reserved.
Minimize the Classification Error
Fundamental Theorem of Calculus:
d
dτ
∫ τ
−∞f (x)dx = f (τ)
Therefore, in our problem,
d
dτ
∫ τ
−∞
1√2πσ21
e− (x−µ1)
2
2σ21 dx
=1√
2πσ21
e− (τ−µ1)
2
2σ21
d
dτ
∫ ∞τ
1√2πσ20
e− (x−µ0)
2
2σ20 dx
= − 1√2πσ20
e− (τ−µ0)
2
2σ20
18 / 31
c©Stanley Chan 2020. All Rights Reserved.
Special Case: σ1 = σ0 = σ
Let us assume that σ1 = σ0 = σ. Then,
d
dτPe = π1
d
dτ
∫ τ
−∞
1√2πσ2
e−(x−µ1)
2
2σ2 dx + π0d
dτ
∫ ∞τ
1√2πσ2
e−(x−µ0)
2
2σ2 dx
= π11√
2πσ2e−
(τ−µ1)2
2σ2 − π01√
2πσ2e−
(τ−µ0)2
2σ2 = 0
So finally, we can equate the two sides, and get
π1����1√
2πσ2e−
(τ−µ1)2
2σ2 = π0����1√
2πσ2e−
(τ−µ0)2
2σ2
log
{π1e− (τ−µ1)
2
2σ2
}= log
{π0e− (τ−µ0)
2
2σ2
}τ =
µ1 − µ02
− σ2
µ1 − µ0log
π1π0.
19 / 31
c©Stanley Chan 2020. All Rights Reserved.
Optimality of Bayesian Decision Rule
Therefore,
τ =µ1 − µ0
2− σ2
µ1 − µ0log
π1π0,
and this means that
h(x) =
1, if x > µ1−µ0
2 − σ2
µ1−µ0log π1
π0,
0, if x < µ1−µ02 − σ2
µ1−µ0log π1
π0,
either, if x = µ1−µ02 − σ2
µ1−µ0log π1
π0.
This is the exact same result as the Bayesian decision rule.
Bayesian decision is optimal in posterior
Bayesian decision is optimal in probability of error
Optimal 6= error free
You still have error if you have infinite training data
20 / 31
c©Stanley Chan 2020. All Rights Reserved.
Two Levels of Optimality
Error due to Finite Training Samples
You need to estimate µ and Σ
More samples, better estimate
Controlled by Hoeffding inequality
Error due to Linear Non-Separability
Error even if you have infinite training data
Fundamental limit of the model
Gaussians are not linearly separable
Minimize error probability 6= zero error
21 / 31
c©Stanley Chan 2020. All Rights Reserved.
Risk and Probability of Error
Is Your “Optimal” Really Optimal?
Pe = P[X < τ | Y = 1]π1 + P[X > τ | Y = 0]π0.
Y = 1: disease present. Y = 0: normal.
X < τ : report no disease. X > τ : report disease.
P[X < τ | Y = 1]: There is a disease, but you cannot find it.
P[X > τ | Y = 0]: There is no disease, but you say there is.
Which one is more serious?
We need to define Risk
R︸︷︷︸risk
= C01P[X < τ | Y = 1]︸ ︷︷ ︸miss
π1 + C10P[X > τ | Y = 0]︸ ︷︷ ︸false alarm
π0.
C01: cost of miss. C10: cost of false alarm.22 / 31
c©Stanley Chan 2020. All Rights Reserved.
False Alarm and Detection
Pe = C01P[X < τ | Y = 1]︸ ︷︷ ︸PM
π1 + C10P[X > τ | Y = 0]︸ ︷︷ ︸PF
π0.
PM : Probability of MissPD = 1− PM : Probability of DetectionPF : Probability of False Alarm
23 / 31
c©Stanley Chan 2020. All Rights Reserved.
Receive Operating Characteristic (ROC) Curve
They are all functions of τ .
If you change τ , they will change correspondingly.
Can plot PF (τ) and PD(τ) as τ changes.
This called a Receive Operating Characteristic (ROC) curve.
24 / 31
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the ROC
Higher ROC is better: Same level of false alarm, higher detection rate
One classifier has one ROC. One ROC curve for all τ .Ways to improve ROC:
Train with more sample so that you have less training errorFind a better model than GaussianFind a better decision rule than linear classifier
25 / 31
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Bayesian Decision Rule
Bishop, Pattern Recognition and Machine Learning, Chapter 4.1
Duda, Hart and Stork’s Pattern Classification, Chapter 2.1, 2.2, 2.6
Stanford CS 229 Generative Algorithmshttp://cs229.stanford.edu/notes/cs229-notes2.pdf
Probability of Error
Duda, Hart and Stork’s Pattern Classification, Chapter 2.7, 3.1.
Poor, Intro to Signal Estimation and Detection, Chapter 2.
ROC Curve
ECE645 Note. https://engineering.purdue.edu/ChanGroup/
ECE645Notes/StudentLecture02.pdf
26 / 31
c©Stanley Chan 2020. All Rights Reserved.
Appendix
27 / 31
c©Stanley Chan 2020. All Rights Reserved.
Q&A 1 Is ROC curve always concave?
No.
The ROC curve is concave when the decision rule comes fromNeyman-Pearson. See V. Poor’s Intro to Signal Estimation andDetection Chapter 2.
You can always create a decision rule that performs terribly on onepart and very well on the other part.
Most algorithms will generate ROC curves that shows stair-caseeffects.
It is important to report the entire ROC curve.
28 / 31
c©Stanley Chan 2020. All Rights Reserved.
Q&A 2 Type 1 and Type 2 Error
You probably have heard these two terms before. Below is an image Ifound from Wikipedia
29 / 31
c©Stanley Chan 2020. All Rights Reserved.
Q&A 2 Type 1 and Type 2 Error
A typical naming issue between statistics and engineering
False Positive = Type 1 Error = False Alarm
False Negative = Type 2 Error = Miss
Both are probabilities
In the previous figure, we see that a decision rule is the location ofthe black line.
If you move the black line left and right, you will get a point on theROC curve.
30 / 31
c©Stanley Chan 2020. All Rights Reserved.
Q&A 3 What is the difference between ROC andPrecision-Recall Curve (PRC)?
ROC: Receiver Operating Curve.
Detection: No. TP / (No. TP + No. FN)False Alarm: No. FP / (No. FP + No. TN)ROC Curve always goes up.
PRC: Precision Recall Curve.
Precision: No. TP / (No. TP + No. FP)Recall: No. TP / (No. TP + No. FN)PRC Curve always goes down.
There is generally no definitive answer to which of ROC or PRC isbetter. However, according to the following paper, ROC is better fordatasets with balanced classes whereas PRC is better for skewedclasses. https://www.biostat.wisc.edu/~page/rocpr.pdf
31 / 31