240-650: chapter 2: bayesian decision theory 1 montri karnjanadecha [email protected] ....
TRANSCRIPT
240-650: Chapter 2: Bayesian Decision Theory
1
Montri [email protected]://fivedots.coe.psu.ac.th/~montri
240-650 Principles of Pattern
Recognition
240-650: Chapter 2: Bayesian Decision Theory
2
Chapter 2
Bayesian Decision Theory
240-650: Chapter 2: Bayesian Decision Theory
3
Statistical Approach to Pattern Recognition
240-650: Chapter 2: Bayesian Decision Theory
4
A Simple Example
• Suppose that we are given two classes 1 and 2
– P(1) = 0.7
– P(2) = 0.3
– No measurement is given• Guessing
– What shall we do to recognize a given input?– What is the best we can do statistically? Why?
240-650: Chapter 2: Bayesian Decision Theory
5
A More Complicated Example
• Suppose that we are given two classes– A single measurement x
– P(1|x) and P(2|x) are given graphically
240-650: Chapter 2: Bayesian Decision Theory
6
A Bayesian Example• Suppose that we are given two classes
– A single measurement x
– We are given p(x|1) and p(x|2) this time
240-650: Chapter 2: Bayesian Decision Theory
7
A Bayesian Example – cont.
240-650: Chapter 2: Bayesian Decision Theory
8
Bayesian Decision Theory
• Bayes formula
• In case of two categories
• In English, it can be expressed as
)()|()(
)()|()|(
)()|()()|(),(
jjjj
j
jjjj
Pxpxp
PxpxP
PxpxpxPxp
evidence
priorlikelihoodposterior
x
2
1
)()|()(j
jj Pxpxp
240-650: Chapter 2: Bayesian Decision Theory
9
Bayesian Decision Theory – cont.
• A posterior probability– The probability of the state of nature being
given that feature value x has been measured
• Likelihood– is the likelihood of with respect to
x
• Evidence– The evidence factor can be viewed as a
scaling factor that guarantees that the posterior probabilities sum to one.
j
j)|( jxp
240-650: Chapter 2: Bayesian Decision Theory
10
Bayesian Decision Theory – cont.
• Whenever we observe a particular x, the prob. of error is
• The average prob. of error is given by
12
21
decide weif)|(
decide weif)|()|(
xP
xPxerrorP
dxxpxerrorPdxxerrorPerrorP )()|(),()(
240-650: Chapter 2: Bayesian Decision Theory
11
Bayesian Decision Theory – cont.
• Bayes decision rule
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
• Prob. of error
P(error|x)=min[P(1|x), P(2|x)]
• If we ignore the “evidence”, the decision rule becomes:
Decide 1 if P(x|1) P(1) > P(x|2) P(2)
Otherwise decide 2
240-650: Chapter 2: Bayesian Decision Theory
12
Bayesian Decision Theory--continuous features
• Feature space– In general, an input can be represented by a
vector, a point in a d-dimensional Euclidean space Rd
• Loss function– The loss function states exactly how costly
each action is and is used to convert a probability determination into a decision
– Written as )|( ji
240-650: Chapter 2: Bayesian Decision Theory
13
Loss Function
• Describe the loss incurred for taking action i
when the state of nature is j
)|( ji
240-650: Chapter 2: Bayesian Decision Theory
14
Conditional Risk
• Suppose we observe a particular x• We take action i
• If the true state of nature is j
• By definition we will incur the loss i|j)
• We can minimize our expected loss by selecting the action that minimize the condition risk, R(i|x)
xx ||)|(1
j
c
jjii PR
240-650: Chapter 2: Bayesian Decision Theory
15
Bayesian Decision Theory
• Suppose that there are c categories{1, 2, ..., c}
• Conditional risk
• Risk is the average expected loss
)|()|()|(1
xx j
c
jjii PR
xxxx dpRR )()|)((
240-650: Chapter 2: Bayesian Decision Theory
16
Bayesian Decision Theory
• Bayes decision rule– For a given x, select the action i for which
the conditional risk is minimum
– The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved
)|(min* xRi ii
240-650: Chapter 2: Bayesian Decision Theory
17
Two-Category Classification
• Let ij = (i|j)
• Conditional risk
• Fundamental decision rule
Decide 1 if R(1|x) < R(2|x)
)|()|()|(
)|()|()|(
2221212
2121111
xxx
xxx
PPR
PPR
240-650: Chapter 2: Bayesian Decision Theory
18
Two-Category Classification – cont.
• The decision rule can be written in several ways– Decide 1 if one of the followings is true
)(
)(
)|(
)|(
)()|()()()|()(
)|()()|()(
1
2
1121
2212
2
1
222212111121
2221211121
P
P
p
p
PpPp
PP
x
x
xx
xx
Likelihood Ratio
These rules are
equivalent
240-650: Chapter 2: Bayesian Decision Theory
19
Minimum-Error-Rate Classification
• A special case of the Bayes decision rule with the following zero-one loss function
– Assigns no loss to correct decision– Assigns unit loss to any error– All errors are equally costly
j i if 1
ji if 0)|( ji
240-650: Chapter 2: Bayesian Decision Theory
20
Minimum-Error-Rate Classification
• Conditional risk
x
x
xx
|1
|
|||1
j
ijj
j
c
jjii
P
P
PR
240-650: Chapter 2: Bayesian Decision Theory
21
Minimum-Error-Rate Classification
• We should select i that maximizes the posterior probability
• For minimum error rate:
Decide
)|( xjP
ijPP jii allfor )|()|( if xx
240-650: Chapter 2: Bayesian Decision Theory
22
Minimum-Error-Rate Classification
240-650: Chapter 2: Bayesian Decision Theory
23
Classifiers, Discriminant Functions, and Decision Surfaces
• There are many ways to represent pattern classifiers
• One of the most useful is in terms of a set of discriminant functions gi(x), i=1,
…,c• The classifier assigns a feature vector x
to class ifjigg ji allfor )()( xx
i
240-650: Chapter 2: Bayesian Decision Theory
24
The Multicategory Classifier
240-650: Chapter 2: Bayesian Decision Theory
25
Classifiers, Discriminant Functions, and Decision Surfaces
• There are many equivalent discriminant functions– i.e., the classification results will be the
same even though they are different functions
– For example, if f is a monotonically increasing function, then
))(()( xgfxg ii
240-650: Chapter 2: Bayesian Decision Theory
26
Classifiers, Discriminant Functions, and Decision Surfaces
• Some of discriminant functions are easier to understand or to compute
240-650: Chapter 2: Bayesian Decision Theory
27
Decision Regions
• The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc
– The regions are separated with decision boundaries, where ties occur among the largest discriminant functions
iji ijgg Rxxx then , allfor )()( If
240-650: Chapter 2: Bayesian Decision Theory
28
Decision Regions – cont.
240-650: Chapter 2: Bayesian Decision Theory
29
Two-Category Case (Dichotomizer)
• Two-category case is a special case– Instead of two discriminant functions, a
single one can be used
)(
)(ln
)|(
)|(ln)(
)|()|()(
)()()(
2
1
2
1
21
21
P
P
p
pg
PPg
ggg
x
xx
xxx
xxx
240-650: Chapter 2: Bayesian Decision Theory
30
The Normal Density
• Univariate Gaussian Density
• Mean
• Variance
2
2
1exp
2
1)(
x
xp
dxxxpx
)(
dxxpxx
)(222
240-650: Chapter 2: Bayesian Decision Theory
31
The Normal Density
240-650: Chapter 2: Bayesian Decision Theory
32
The Normal Density
• Central Limit Theorem
– The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution
– Gaussian is often a good model for the actual probability distribution
240-650: Chapter 2: Bayesian Decision Theory
33
The Multivariate Normal Density
• Multivariate Density (in d dimension)
Abbreviation
μxΣμx
Σx 1
2/12/ 2
1exp
2
1)( t
dp
Σμx ,)( Np
240-650: Chapter 2: Bayesian Decision Theory
34
The Multivariate Normal Density
• Mean
• Covariance matrix
• The ijth component of
xxxxμ dp )(
xxμxμxμxμxΣ dptt )(
Σ
jjiiij xx
240-650: Chapter 2: Bayesian Decision Theory
35
Statistically Independence
– If xi and xj are statistically independence then
– The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero
0ij
240-650: Chapter 2: Bayesian Decision Theory
36
Whitening Transform
2/1ΦΛAw
matrix whose columns are the
orthonormal eigenvectors of Σ
Diagonal matrix of the corresponding eigenvalues
of Σ
240-650: Chapter 2: Bayesian Decision Theory
37
Whitening Transform
240-650: Chapter 2: Bayesian Decision Theory
38
Squared Mahalanobis Distance from x to
μxΣμx 12 tr
Constant density
Principle axes of hyperellipsiods
are given by the eigenvectors of
Length of axes are determined
by eigenvalues of
240-650: Chapter 2: Bayesian Decision Theory
39
Discriminant Functions for the Normal Density
• Minimum distance classifier
• If the density are multivariate normal– i.e., if
Then we have:
),()|( iii Np Σμx )|( ip x
)(lnln2
12ln
22
1)( 1
iiiit
ii Pd
g ΣμxΣμxx
)(ln)|(ln)( iii Ppg xx
240-650: Chapter 2: Bayesian Decision Theory
40
Discriminant Functions for the Normal Density
• Case 1:– Features are statistically independence and
each feature has the same variance
– Where || . || denotes the Euclidean norm
IΣ 2i
)(ln2
)(2
2
ii
i Pg
μx
x
)()(2
it
ii μxμxμx
240-650: Chapter 2: Bayesian Decision Theory
41
Case 1: i = 2I
240-650: Chapter 2: Bayesian Decision Theory
42
Linear Discriminant Function
• It is not necessary to compute distances– Expanding the form yields
– The term is the same for all i– We have the following linear discriminant
function
)()( it
i μxμx
)(ln22
1)(
2 iiti
ti
ti Pg
μμxμxxx
xxt
0)( itii wg xwx
240-650: Chapter 2: Bayesian Decision Theory
43
Linear Discriminant Function
where
and
ii μw2
1
)(ln2
10
Pw t
i
μμ
Threshold or bias for the ith category
240-650: Chapter 2: Bayesian Decision Theory
44
Linear Machine
• A classifier that uses linear discriminant functions is called a linear machine
• Its decision surfaces are pieces of hyperplanes defined by the linear equations
for the two categories with the highest posterior probabilities. For our case this equation can be written as
)()( xx ji gg
0)( 0 xxw t
240-650: Chapter 2: Bayesian Decision Theory
45
Linear Machine
Where
And
If then the second term vanishes
It is called a minimum-distance classifier
jμμw
jij
i
ji
ji P
Pμμ
μμμμx
)(
)(ln
2
120
)()( ji PP
240-650: Chapter 2: Bayesian Decision Theory
46
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
47
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
48
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
49
Case 2: i =
• Covariance matrices for all of the classes are identical but otherwise arbitrary
• The cluster for the ith class is centered about i
• Discriminant function:
)(ln2
1)( 1
iit
ii Pg μxΣμxx
Can be ignored if prior probabilities are the same for all classes
240-650: Chapter 2: Bayesian Decision Theory
50
Case 2: Discriminant function
Where
and
0)( itii wg xwx
ii μΣw 1
)(ln2
1 10 Pw ti μΣμ
240-650: Chapter 2: Bayesian Decision Theory
51
For 2-category case
• If Ri and Rj are contiguous, the boundary between them has the equation
where
and
0)( 0 xxw t
jμμΣw
1
ji
jit
ji
jiji
PPμμ
μμΣμμμμx
10
)(/)(ln
2
1
240-650: Chapter 2: Bayesian Decision Theory
52
240-650: Chapter 2: Bayesian Decision Theory
53
240-650: Chapter 2: Bayesian Decision Theory
54
Case 3: i = arbitrary
• In general, the covariance matrices are different for each category
• The only term that can be dropped is the (d/2) ln 2 term
240-650: Chapter 2: Bayesian Decision Theory
55
Case 3: i = arbitrary
The discriminant functions are
Where
and
0)( itii
ti wg xwxWxx
1
2
1 ii ΣW
iii 1Σw
)(lnln2
1
2
1 10 Pw i
ti ΣμΣμ
240-650: Chapter 2: Bayesian Decision Theory
56
Two-category case
• The decision surface are hyperquadrics (hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids,…)
240-650: Chapter 2: Bayesian Decision Theory
57
240-650: Chapter 2: Bayesian Decision Theory
58
240-650: Chapter 2: Bayesian Decision Theory
59
240-650: Chapter 2: Bayesian Decision Theory
60
240-650: Chapter 2: Bayesian Decision Theory
61
Example