1 bayesian decision theory shyh-kang jeng department of electrical engineering/ graduate institute...
TRANSCRIPT
11
Bayesian Decision TheoryBayesian Decision Theory
Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/
Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University
22
Basic AssumptionsBasic Assumptions
The decision problem is posed in The decision problem is posed in probabilistic termsprobabilistic terms
All of the relevant probability values All of the relevant probability values are knownare known
33
State of NatureState of NatureState of natureState of nature–
A priori probability (prior)A priori probability (prior)–
Decision rule to judge just one fishDecision rule to judge just one fish–
)salmon(or bass) sea( 21
salmon isfish next the:)(
bass sea isfish next the:)(
2
1
P
P
2211 decide otherwise ;)()( if Decide PP
44
Class-Conditional Class-Conditional Probability DensityProbability Density
55
Bayes FormulaBayes Formula
evidence
priorlikelihoodposterior
Pxpxp
xp
PxpxP
jjj
jjj
2
1
)()|()(
)(
)()|()|(
66
Posterior ProbabilitiesPosterior Probabilities3/1)(,3/2)( 21 PP
77
Bayes Decision RuleBayes Decision RuleProbability of errorProbability of error
Bayes decision ruleBayes decision rule
dxxpxerrorpdxxerrorperrorP
xP
xPxerrorP
)()|(),()(
decide weif)|(
decide weif)|()|(
12
21
2
22111
2211
decide otherwise
);()|()()|( if decide Or,
decide otherwise);|()|( if Decide
PxpPxp
xPxP
88
Bayes Decision Theory (1/3)Bayes Decision Theory (1/3)CategoriesCategories
ActionsActions
Loss functionsLoss functions
Feature vector Feature vector
c ,,1
a ,,1
)|( ji
xvector component d
99
Bayes Decision Theory (2/3)Bayes Decision Theory (2/3)Bayes formulaBayes formula
Conditional riskConditional risk
c
jjj
jjj
Ppp
p
Ppp
1
)()|()(
)(
)()|()|(
xx
x
xx
c
jjjii xpR
1
)|()|()|( x
1010
Bayes Decision Theory (3/3)Bayes Decision Theory (3/3)Decision function assumes one of the vDecision function assumes one of the values alues Overall riskOverall risk
Bayes decision rule: compute the conditional Bayes decision rule: compute the conditional risk risk
then select the action for which is mithen select the action for which is mi
nimumnimum
)(x
xxxx dpRR )()|)((
a
a ,,1
c
jjjii aiPR
1
,,1),|()|()|( xx
i )|( xiR
1111
Two-Category ClassificationTwo-Category ClassificationConditional riskConditional risk
Decision rule: decide Decision rule: decide 11 if if
Likelihood ratioLikelihood ratio
)|(
)|()|()|(
)|()|()|(
2221212
2121111
jiij
PPR
PPR
xxx
xxx
)()|()()()|()( 222212111121 PpPp xx
)(
)(
)|(
)|(
2
1
1121
2212
2
1
P
P
p
p
x
x
1212
Minimum-Error-Rate ClassificationMinimum-Error-Rate ClassificationIf action is taken and the true state is If action is taken and the true state is , then the decision is correct if and , then the decision is correct if and in error ifin error if
Error rate (the probability of error) is to Error rate (the probability of error) is to be minimizedbe minimized
Symmetrical or zero-one loss functionSymmetrical or zero-one loss function
Conditional riskConditional risk
cjiji
jiji ,,1,,
,1
,0)|(
)|(1)|()|()|(1
xxx ijj
c
jii PPR
ij ji
ji
1313
Minimum-Error-Rate ClassificationMinimum-Error-Rate Classification
1414
Mini-max CriterionMini-max Criterion
To perform well over a range of prior To perform well over a range of prior probabilityprobability
Minimize the maximum possible Minimize the maximum possible overall risk overall risk – So that the worst risk for any value of So that the worst risk for any value of
the priors is as small as possiblethe priors is as small as possible
1515
Mini-maximizing RiskMini-maximizing Risk
RR
dpdp
P
dp
dpPpP
dpPpPR
mm
RR
R
R
R
])|()()|()(
))[((
)|()(
)|()()|()(
)|()()|()(
12
1
2
1
2221211121
22111
2221222
22221121
22121111
xxxx
xx
xxx
xxx
1616
Searching for Mini-max BoundarySearching for Mini-max Boundary
1717
Neyman-Pearson CriterionNeyman-Pearson Criterion
Minimize the overall risk subject to Minimize the overall risk subject to a constrainta constraint
ExampleExample– Minimize the total risk subject toMinimize the total risk subject to
constantdR i xx)|(
1818
Discriminant FunctionsDiscriminant FunctionsA classifier assigns to class if A classifier assigns to class if
where are called discriminant functionswhere are called discriminant functionsA discriminant function for a Bayes classifier A discriminant function for a Bayes classifier Two discriminant functions for minimum- erTwo discriminant functions for minimum- error-rate classificationror-rate classification
x iijgg ji allfor )()( xx
)(xig
)|()( xx ii Rg
)(ln)|(ln)(;)()|(
)()|()(
1
iiic
jjj
iii Ppg
Pp
Ppg
xxx
xx
1919
Discriminant FunctionsDiscriminant Functions
2020
Two-Dimensional Two-Category Two-Dimensional Two-Category ClassifierClassifier
2121
DichotomizersDichotomizersPlace a pattern in one of only two categoriesPlace a pattern in one of only two categories– cf. Polychotomizerscf. Polychotomizers
More common to define a single duscriminaMore common to define a single duscriminant functionnt function
Some particular formsSome particular forms )()()( 21 xxx ggg
)(
)(ln
)|(
)|(ln)(
)|()|()(
2
1
2
1
21
P
P
p
pg
PPg
x
xx
xxx
2222
Univariate Normal PDFUnivariate Normal PDF
),(~2
1exp
2
1)( 2
2
Nx
xp
2323
Distribution with Maximum Entropy Distribution with Maximum Entropy and Central Limit Theoremand Central Limit Theorem
Entropy for discrete distributionEntropy for discrete distribution
Entropy for continuous distributionEntropy for continuous distribution
Central limit theoremCentral limit theorem– Aggregate effect of the sum of a large number of sAggregate effect of the sum of a large number of s
mall, independent random disturbances, will lead mall, independent random disturbances, will lead to a Gaussian distrubutionto a Gaussian distrubution
)(log21
bitsPPH i
m
ii
)()(ln)())(( natsdxxpxpxpH
2424
Multivariate Normal PDFMultivariate Normal PDF
: : dd-component mean vector-component mean vector
: : dd-by--by-dd covariance matrixcovariance matrix
][xμ E
),(~
2
1exp
2
1)( 1
2/12/
Σμ
μxΣμxΣ
x
N
p T
d
TE μxμxΣ
2525
Linear Combination of Gaussian Linear Combination of Gaussian Random VariablesRandom Variables
),(~)(),,(~)( ΣAAμAyxAyΣμx ttt NpNp
2626
Whitening TransformWhitening Transform
: matrix whose columns are the ortho: matrix whose columns are the orthonormal eigenvectors of normal eigenvectors of : diagonal matrix of the correspondin: diagonal matrix of the corresponding eigenvaluesg eigenvaluesWhitening transformWhitening transform
IΣAA
ΦΛA
wtw
w2/1
2727
Bivariate Gaussian PDFBivariate Gaussian PDF
2828
Mahalanobis DistanceMahalanobis DistanceSquared Mahalanobus distanceSquared Mahalanobus distance
Volume of the Hyperellipsoids of constaVolume of the Hyperellipsoids of constant Mahalanobis distance nt Mahalanobis distance rr
)()( 12 μxΣμx tr
odd!/)!2
1(2
even)!2//(2/)1(
2/
2/1
ddd
ddV
rVV
dd
d
d
dd
Σ
2929
Discriminant Functions for Discriminant Functions for Normal DensityNormal Density
)(lnln2
12ln
2)()(
2
1)(
density normaslfor
)(ln)|(ln)(
tionclassifica rate-error-minimumfor
1iiii
tii
iii
Pd
g
Ppg
ΣμxΣμxx
xx
3030
Case 1: Case 1: ii = = 22 II
)(ln2
1,
1
)(
)(ln22
1)(
)()(
)(ln2
)(
202
0
2
2
2
2
iiti
iii
ii
itii
iiti
ti
ti
it
ii
ii
i
Pw
wg
Pg
Pg
μμμw
xwx
μμxμxxx
μxμxμx
μxx
3131
Decision BoundariesDecision Boundaries
)()(
)(ln)(
2
1
0)(
)()(
2
2
0
0
jij
i
ji
ji
ji
t
ji
P
P
gg
μμμμ
μμx
μμw
xxw
xx
3232
Decision Boundaries when Decision Boundaries when PP((ii)=)=PP
((jj))
3333
Decision Boundaries when Decision Boundaries when PP((ii) ) anan
dd PP((jj) ) are unequalare unequal
3434
Case 2: Case 2: ii = =
)(ln2
1,
)(
)(ln2
1
2
1)(
)(ln)()(2
1)(
10
1
0
11
1
iitiiii
itii
iiti
ti
ti
iit
ii
Pw
wg
Pg
Pg
μΣμμΣw
xwx
μμxΣμxΣxx
μxΣμxx
3535
Decision BoundariesDecision Boundaries
)()()(
)](/)(ln[)(
2
1
)(
0)(
)()(
10
1
0
jiji
tji
jiji
ji
t
ji
PP
gg
μμμμΣμμ
μμx
μμΣw
xxw
xx
3636
Decision BoundariesDecision Boundaries
3737
Case 3: Case 3: ii = arbitrary = arbitrary
)(lnln2
1
2
1
,2
1
)(
10
11
0
iiiitii
iiiii
itii
ti
Pw
wg
ΣμΣμ
μΣwΣW
xwxWxx
3838
Decision Boundaries for One-Decision Boundaries for One-Dimensional CaseDimensional Case
3939
Decision Boundaries for Two-Decision Boundaries for Two-Dimensional CaseDimensional Case
4040
Decision Boundaries for Three-Decision Boundaries for Three-Dimensional Case (1/2)Dimensional Case (1/2)
4141
Decision Boundaries for Three-Decision Boundaries for Three-Dimensional Case (2/2)Dimensional Case (2/2)
4242
Decision Boundaries for Four Decision Boundaries for Four Normal DistributionsNormal Distributions
4343
Example: Decision Regions for Example: Decision Regions for Two-Dimensional Gaussian DataTwo-Dimensional Gaussian Data
4444
Example: Decision Regions for Example: Decision Regions for Two-Dimensional Gaussian DataTwo-Dimensional Gaussian Data
2
3 passingnot ,1875.0125.1514.3
boundarydecision
5.0)()(
2/10
02/1,
2/10
02
20
02,
2
3,
20
02/1,
6
3
2112
21
12
11
2211
xxx
PP
ΣΣ
ΣμΣμ
4545
Bayes Decision Compared with OthBayes Decision Compared with Other Decision Strategieser Decision Strategies
12
)()|()()|(
),(),()(
2211
2112
RR
dPpdPp
RPRPerrorP
xxxx
xx
4646
Multicategory CaseMulticategory CaseProbability of being correctProbability of being correct
Bayes classifier maximizes this probabiBayes classifier maximizes this probability by choosing the regions so that the lity by choosing the regions so that the integrand is maximal for all xintegrand is maximal for all x– No other partitioning can yield a smaller pNo other partitioning can yield a smaller p
robability of errorrobability of error
c
i R
ii
i
dPpcorrectP1
)()|()( xx
4747
Error Bounds for Normal DensitiesError Bounds for Normal Densities
Full calculation of the error Full calculation of the error probability is difficult for the probability is difficult for the Gaussian caseGaussian case– Especially in high dimensionsEspecially in high dimensions– Discontinuous nature of the decision Discontinuous nature of the decision
regionsregions
Upper bound on the error can be Upper bound on the error can be obtained for two-category caseobtained for two-category case– By approximating the error integral By approximating the error integral
analyticallyanalytically
4848
Chernoff BoundChernoff Bound
2
1
1
21
121
2112
)(2
11
21
121
1
21
1
)1(ln
2
1
)(])1[()(2
)1()(
)|()|( densities, normalfor
)|()|()()()(
)()|()(
)()|()|(
)]|(),|(min[)(
10 and 0,for ],min[
ΣΣ
ΣΣ
μμΣΣμμ
xxx
xxx
xx
xx
xx
t
k
jjjj
j
k
edpp
dppPPerrorP
Ppp
Ppp
PPerrorP
bababa
4949
Bhattacharyya BoundBhattacharyya Bound
21
21
12
1
2112
)2/1(21
2121
2ln
2
1
)(2
)(8
1)2/1(
)()(
)|()|()()()(
2/1set
ΣΣ
ΣΣ
μμΣΣ
μμ
xxx
t
k
k
ePP
dppPPerrorP
5050
Chernoff Bound and Chernoff Bound and Bhattacharyya BoundBhattacharyya Bound
5151
Example: Error Bounds for Example: Error Bounds for Gaussian DistributionGaussian Distribution
5.0)()(
2/10
02/1,
2/10
02
20
02,
2
3,
20
02/1,
6
3
21
12
11
2211
PP
ΣΣ
ΣμΣμ
5252
Example: Error Bounds for Example: Error Bounds for Gaussian DistributionGaussian Distribution
Bhattacharyya boundBhattacharyya bound– kk(1/2)(1/2) = 4.11157 = 4.11157– PP((errorerror)) < 0.0087 < 0.0087Chernoff boundChernoff bound– 0.008190 by numerical searching0.008190 by numerical searchingError rate by numerical integrationError rate by numerical integration– 0.00210.0021– Impractical for higher dimensionImpractical for higher dimension
5353
Signal Detection TheorySignal Detection Theory
Internal signal in the detector xInternal signal in the detector x– Has mean Has mean 22 when external signal (pulse) is when external signal (pulse) is
presentpresent– Has mean Has mean 11 when external signal is not pre when external signal is not pre
sentsent– pp((xx||ii) ~ ) ~ NN((ii, , 22))
5454
Signal Detection TheorySignal Detection Theory
12'bility discrimina
d
5555
Four ProbabilitiesFour ProbabilitiesHit: Hit: PP((xx>>xx*|*|xx in in 22))
False alarm: False alarm: PP((xx>>xx*|*|xx in in 11))
Miss: Miss: PP((xx<<xx*|*|xx in in 22))
Correct reject: Correct reject: PP((xx<<xx*|*|xx in in 11))
5656
Receiver Operating Characteristic Receiver Operating Characteristic (ROC)(ROC)
5757
Bayes Decision Theory: Bayes Decision Theory: Discrete FeaturesDiscrete Features
)|(minargaction select
)()|()(
)(
)()|()|(
)|()|(
*
1
x
xx
x
xx
xxxx
ii
c
iii
iii
ii
R
PPP
P
PPP
Pdp
5858
Independent Binary FeaturesIndependent Binary Features
ii
iiii
x
i
id
i
x
i
i
d
i
xi
xi
d
i
xi
xi
iiii
td
q
p
q
p
P
P
qqxPppP
xqxp
xx
1
12
1
1
12
1
11
21
1
1
1
)|(
)|(
ratio likelihood
)1()|(,)1()|(
ceindependen lconditiona Assume
]|1Pr[],|1Pr[
,,
x
x
x
x
5959
Discriminant FunctionDiscriminant Function
0)( if and 0)( if decide
)(
)(ln
1
1ln,
)1(
)1(ln
)(
)(
)(ln
1
1ln)1(ln)(
21
1 2
10
01
2
1
1
xx
x
x
gg
P
P
q
pw
pq
qpw
wxwg
P
P
q
px
q
pxg
d
i i
i
ii
iii
d
iii
d
i i
ii
i
ii
6060
Example: Three-Dimensional Example: Three-Dimensional Binary DataBinary Data
75.2
5.0
5.0ln
5.01
8.01ln
3863.1
)8.01(5.0
)5.01(8.0ln
3,2,1,5.0,8.0
5.0)()(
3
10
21
i
i
ii
w
w
iqp
PP
6161
Example: Three-Dimensional Example: Three-Dimensional Binary DataBinary Data
83.1
5.0
5.0ln
5.01
8.01ln
0
2,1,3863.1
)8.01(5.0
)5.01(8.0ln
5.0
2,1,5.0,8.0
5.0)()(
2
10
3
33
21
i
i
ii
w
w
i
w
qp
iqp
PP
6262
Illustration of Missing FeaturesIllustration of Missing Features
6363
Decision with Missing FeaturesDecision with Missing Features
b
bi
b
bbgbgi
b
bbgi
g
gigi
bg
dp
dpg
dp
dxpP
dp
dp
P
PP
xx
xxx
xx
xxxx
xx
xxx
x
xx
xxx
)(
)()(
)(
),(),|(
)(
),,(
)(
),()|(
],[
6464
Noisy FeaturesNoisy Features
],[,)|()(
)|()()(
)|(),(
)|(),(),|(),|(
)|(),|(),,(),|(),|(
),,(),,|(),,,(
),(
),,,(),|(
and oft independen assume),|( :model noise
tg
ttb
ttbi
ttbtg
ttbtgtgi
bgi
tbtgbtgtgbtgi
tbgtbgitbgi
bg
ttbgi
bgi
gibtb
dxpp
dxppg
dxpp
dxpppP
ppppP
pPp
P
dxpP
xp
xxxxxx
xxxx
xxxx
xxxxxxxx
xxxxxxxxxxxx
xxxxxxxxx
xx
xxxxx
xxx
6565
Example of Statistical Dependence Example of Statistical Dependence and Independenceand Independence
)()(),( 3131 xpxpxxp
6666
Example of Causal DependenceExample of Causal Dependence
State of an mobileState of an mobile– Temperature of engineTemperature of engine– Pressure of brake fluidPressure of brake fluid– Pressure of air in the tiresPressure of air in the tires– Voltages in the wiresVoltages in the wires– Oil temperatureOil temperature– Coolant temperatureCoolant temperature– Speed of the radiator fanSpeed of the radiator fan
6767
Bayesian Belief Nets Bayesian Belief Nets (Causal Networks)(Causal Networks)
6868
Example: Belief Network for FishExample: Belief Network for Fish
012.04.05.04.06.025.0
)|()|(),|()()(),,,,( 22231321323213
xdPxcPbaxPbPaPdcxbap
6969
Simple Belief Network 1Simple Belief Network 1
)()|()|()|(
)|()|()|()(
),,,()(
,,
,,
aabbccd
cdbcaba
dcbad
ac b
cba
cba
PPPP
PPPP
PP
7070
Simple Belief Network 2Simple Belief Network 2
gfe
gfe
gfe
gfhegefe
gfhegefe
hgfeh
,
,,
,,
),|()|()|()(
),|()|()|()(
),,,()(
PPPP
PPPP
PP
7171
Use of Bayes Belief NetsUse of Bayes Belief NetsSeek to determine some particular Seek to determine some particular configuration of other variablesconfiguration of other variables– Given the values of some of the Given the values of some of the
variables (evidence)variables (evidence)
Determine values of several query Determine values of several query variables (variables (xx) given the evidence of ) given the evidence of all other variables (all other variables (ee))
),()(
),()|( ex
e
exex P
P
PP
7272
ExampleExample
7373
ExampleExample
37.0),|(,63.0),|(
066.0),|(
114.0)]|()|([
)],|()(),|()(
),|()(),|()([
)|()(
)|()|(),|()()(
),,,,(),(
),,(),|(
212211
212
1211
24142313
22122111
112
,111212
,121
21
211211
bcxPbcxP
bcxP
xdPxdP
baxPaPbaxPaP
baxPaPbaxPaP
xcPbP
xPxcPbxPbPP
cbxPbcP
bcxPbcxP
da
da
daa
da
7474
Naïve Bayes’ Rule Naïve Bayes’ Rule (Idiot Bayes’ Rule)(Idiot Bayes’ Rule)
When the dependency relationship When the dependency relationship among the features are unknown, among the features are unknown, we generally take the simplest we generally take the simplest assumptionassumption– Features are conditionally Features are conditionally
independent given the categoryindependent given the category
– Often works quite wellOften works quite well
)|()|(),|( bxaxbax PPP
7575
Applications in Medical DiagnosisApplications in Medical DiagnosisUppermost nodes represent a fundamental Uppermost nodes represent a fundamental biological agentbiological agent– Such as the presence of a virus or bacteriaSuch as the presence of a virus or bacteria
Intermediate nodes describe diseaseIntermediate nodes describe disease– Such as flu or emphysemaSuch as flu or emphysema
Lowermost nodes describe the symptomsLowermost nodes describe the symptoms– Such as high temperature or coughingSuch as high temperature or coughing
A physician enters measured values into A physician enters measured values into the net and finds the most likely disease the net and finds the most likely disease or causeor cause
7676
Compound Bayesian DecisionCompound Bayesian Decision
n
ii
n
c
t
ipp
Pp
Pp
p
PpP
i
n
1
1
1
))(|()|(
tionsimplifica
)()|(
)()|(
)(
)()|()|(
,,
,, from valuesone takes)(
)(,),1(
xωX
ωωX
ωωX
X
ωωXXω
xxX
ω