bayesian decision theoryhic/cs7616/pdf/lecture2.pdf · 2016-01-19 · • the bayes decision rule...
Post on 10-Mar-2020
10 Views
Preview:
TRANSCRIPT
BayesianDecisionTheory
Chapter 2(Duda,Hart&Stork)
CS7616- PatternRecognition
HenrikIChristensenGeorgiaTech.
BayesianDecisionTheory
• Designclassifierstorecommenddecisions thatminimizesometotalexpected”risk”.– Thesimplestrisk istheclassificationerror(i.e.,costsareequal).
– Typically,therisk includesthecost associatedwithdifferentdecisions.
Terminology
• Stateofnatureω (randomvariable):– e.g.,ω1 forseabass,ω2 forsalmon
• ProbabilitiesP(ω1) andP(ω2) (priors):– e.g.,priorknowledgeofhowlikelyistogetaseabassorasalmon
• Probabilitydensityfunctionp(x)(evidence):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex (e.g.,x correspondstolightness)
Terminology(cont’d)
• Conditionalprobabilitydensityp(x/ωj) (likelihood):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex giventhatthepatternbelongstoclassωj
e.g., lightness distributionsbetween salmon/sea-basspopulations
Terminology(cont’d)
• ConditionalprobabilityP(ωj/x)(posterior):– e.g.,theprobabilitythatthefishbelongstoclassωj givenmeasurementx.
DecisionRuleUsingPriorProbabilities
Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2
or P(error)=min[P(ω1),P(ω2)]
• Favoursthemostlikelyclass.• Thisrulewillbemakingthesamedecisionalltimes.
– i.e.,optimumifnootherinformationisavailable
1 2
2 1
( )( )
( )P if wedecide
P errorP if wedecideω ω
ω ω⎧
= ⎨⎩
DecisionRuleUsingConditionalProbabilities
• UsingBayes’rule,theposteriorprobabilityofcategoryωjgivenmeasurementxisgivenby:
where(i.e.,scalefactor– sumofprobs=1)
Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2or
Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2
( / ) ( )( / )
( )j j
j
p x P likelihood priorP xp x evidenceω ω
ω×
= =
2
1( ) ( / ) ( )j j
jp x p x Pω ω
=
=∑
DecisionRuleUsingConditionalpdf (cont’d)
1 22 1( ) ( )3 3
P Pω ω= = P(ωj /x)p(x/ωj)
ProbabilityofError
• Theprobabilityoferrorisdefinedas:
or
• Whatistheaverageprobabilityerror?
• TheBayesruleisoptimum,thatis,itminimizestheaverageprobabilityerror!
1 2
2 1
( / )( / )
( / )P x if wedecide
P error xP x if wedecideω ω
ω ω⎧
= ⎨⎩
( ) ( , ) ( / ) ( )P error P error x dx P error x p x dx∞ ∞
−∞ −∞
= =∫ ∫
P(error/x) = min[P(ω1/x), P(ω2/x)]
WheredoProbabilitiesComeFrom?
• Therearetwocompetitiveanswerstothisquestion:
(1) Relativefrequency (objective)approach.– Probabilitiescanonlycomefromexperiments.
(2) Bayesian (subjective)approach.– Probabilitiesmayreflectdegreeofbeliefandcanbebasedonopinion.
Example(objectiveapproach)
• Classifycarswhethertheyaremoreorlessthan$50K:– Classes:C1 ifprice>$50K,C2 ifprice<=$50K– Features:x,theheightofacar
• UsetheBayes’ruletocomputetheposteriorprobabilities:
• Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2)
( / ) ( )( / )( )i i
ip x C P CP C x
p x=
Example(cont’d)
• Collectdata– Askdrivershowmuchtheircarwasandmeasureheight.
• Determineprior probabilitiesP(C1),P(C2)– e.g.,1209samples:#C1=221#C2=988
1
2
221( ) 0.1831209988( ) 0.8171209
P C
P C
= =
= =
Example(cont’d)
• Determineclassconditionalprobabilities(likelihood)– Discretizecarheightintobinsandusenormalizedhistogram
( / )ip x C
Example(cont’d)
• Calculatetheposteriorprobability foreachbin:
1 11
1 1 2 2
( 1.0 / ) ( )( / 1.0)( 1.0 / ) ( ) ( 1.0 / ) ( )
0.2081*0.183 0.4380.2081*0.183 0.0597*0.817
p x C P CP C xp x C P C p x C P C
== = =
= + =
= =+
( / )iP C x
AMoreGeneralTheory
• Usemorethanonefeatures.• Allowmorethantwocategories.• Allowactions otherthanclassifyingtheinputtooneofthepossiblecategories(e.g.,rejection).
• Employamoregeneralerrorfunction(i.e.,“risk”function)byassociatinga“cost”(“loss”function)witheacherror(i.e.,wrongaction).
Terminology
• Featuresformavector• Afinitesetofc categoriesω1,ω2,…,ωc
• Bayesrule(i.e.,usingvectornotation):
• Afinitesetof lactionsα1,α2,…,αl
• Aloss functionλ(αi /ωj)– thecostassociatedwithtakingactionαiwhenthecorrect
classificationcategoryisωj
dR∈x
( / ) ( )( / )
( )j j
j
p PP
pω ω
ω =x
xx
1( ) ( / ) ( )
c
j jj
where p p Pω ω=
=∑x x
ConditionalRisk(orExpectedLoss)
• Supposeweobservexandtakeaction αi
• Supposethatthecostassociatedwithtakingactionαi withωj beingthecorrectcategoryisλ(αi /ωj)
• Theconditionalrisk (orexpectedloss)withtakingactionαi is:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pλ ω ω=
=∑x x
OverallRisk
• Supposeα(x)isageneral decisionrulethatdetermineswhichactionα1,α2,…,αltotakeforeveryx;thentheoverallriskisdefinedas:
• Theoptimum decisionruleistheBayesrule
( ( ) / ) ( )R R a p d= ∫ x x x x
OverallRisk(cont’d)
• TheBayesdecisionruleminimizesR by:(i)ComputingR(αi /x) foreveryαi givenanx
(ii)ChoosingtheactionαiwiththeminimumR(αi /x)
• TheresultingminimumoverallriskiscalledBayesrisk andisthebest(i.e.,optimum)performancethatcanbeachieved:
* minR R=
Example:Two-categoryclassification
• Define– α1:decideω1
– α2:decideω2
– λij=λ(αi /ωj)
• Theconditionalrisksare:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pλ ω ω=
=∑x x
(c=2)
Example:Two-categoryclassification(cont’d)
• Minimumriskdecisionrule:
or (i.e.,usinglikelihoodratio)
or
>
thresholdlikelihood ratio
SpecialCase:Zero-OneLossFunction
• Assignthesamelosstoallerrors:
• Theconditionalriskcorrespondingtothislossfunction:
SpecialCase:Zero-OneLossFunction(cont’d)
• Thedecisionrulebecomes:
• Inthiscase,theoverallriskistheaverageprobabilityerror!
or
or
Example
2 1( ) / ( )a P Pθ ω ω=
2 12 22
1 21 11
( )( )( )( )bPPω λ λ
θω λ λ
−=
−(decisionregions)
Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2
Assumingzero-one loss:
12 21λ λ>
>
assume:
Assuminggeneral loss:
DiscriminantFunctions
• Ausefulwaytorepresentclassifiersisthroughdiscriminant functions gi(x),i =1,...,c,whereafeaturevectorx isassignedtoclassωi if:
gi(x)>gj(x) forall j i≠
DiscriminantsforBayesClassifier
• Assumingagenerallossfunction:
gi(x)=-R(αi/x)
• Assumingthezero-onelossfunction:
gi(x)=P(ωi/x)
DiscriminantsforBayesClassifier(cont’d)
• Isthechoiceofgi unique?– Replacinggi(x)withf(gi(x)),wheref() ismonotonicallyincreasing,doesnotchangetheclassificationresults.
( / ) ( )( )( )
( ) ( / ) ( )( ) ln ( / ) ln ( )
i ii
i i i
i i i
p Pgp
g p Pg p P
ω ω
ω ω
ω ω
=
=
= +
xxx
x xx x
gi(x)=P(ωi/x)
we’llusethisformextensively!
Caseoftwocategories
• Morecommontouseasinglediscriminantfunction(dichotomizer)insteadoftwo:
• Examples:1 2
1 1
2 2
( ) ( / ) ( / )( / ) ( )( ) ln ln( / ) ( )
g P Pp Pgp P
ω ω
ω ωω ω
= −
= +
x x xxxx
DecisionRegions andBoundaries• Decisionrulesdividethefeaturespaceindecisionregions
R1,R2,…,Rc, separatedbydecisionboundaries.
decisionboundaryisdefinedby:
g1(x)=g2(x)
DiscriminantFunctionforMultivariateGaussianDensity
• Considerthefollowingdiscriminantfunction:
( ) ln ( / ) ln ( )i i ig p Pω ω= +x x
N(µ,Σ)
p(x/ωi)
MultivariateGaussianDensity:CaseI
• Σi=σ2(diagonal)– Featuresarestatisticallyindependent– Eachfeaturehasthesamevariance
favoursthea-priorimorelikelycategory
MultivariateGaussianDensity:CaseI(cont’d)
wi=
)
)
MultivariateGaussianDensity:CaseI(cont’d)
• Propertiesofdecisionboundary:– Itpassesthroughx0– Itisorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.– Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi)
and P(ωj)
≠
)
)
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
≠
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
≠
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
≠
MultivariateGaussianDensity:CaseI(cont’d)
• Minimumdistanceclassifier– WhenP(ωi)areequal,then:
2( ) || ||i ig µ= − −x x
max
MultivariateGaussianDensity:CaseII
• Σi=Σ
MultivariateGaussianDensity:CaseII(cont’d)
MultivariateGaussianDensity:CaseII(cont’d)
• Propertiesofhyperplane(decisionboundary):– Itpassesthroughx0– Itisnotorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.≠
MultivariateGaussianDensity:CaseII(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
≠
MultivariateGaussianDensity:CaseII(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
≠
MultivariateGaussianDensity:CaseII(cont’d)
• Mahalanobisdistanceclassifier– WhenP(ωi)areequal,then:
max
MultivariateGaussianDensity:CaseIII
• Σi=arbitrary
e.g., hyperplanes,pairsofhyperplanes,hyperspheres,hyperellipsoids,hyperparaboloids etc.
hyperquadrics;
Example- CaseIII
P(ω1)=P(ω2)
decisionboundary:
boundarydoesnot passthroughmidpointofμ1,μ2
MultivariateGaussianDensity:CaseIII(cont’d)
non-lineardecisionboundaries
MultivariateGaussianDensity:CaseIII(cont’d)
• Moreexamples
ErrorBounds• Exacterrorcalculationscouldbedifficult– easierto
estimateerrorbounds!
ormin[P(ω1/x),P(ω2/x)]
P(error)
ErrorBounds(cont’d)
• IftheclassconditionaldistributionsareGaussian,then
where:
| |
ErrorBounds(cont’d)
• TheChernoff boundcorrespondstoβ thatminimizes e-κ(β)– Thisisa1-Doptimizationproblem,regardlesstothedimensionality
oftheclassconditionaldensities.loose boundloose bound
tight bound
ErrorBounds(cont’d)• Bhattacharyyabound
– Approximatetheerrorboundusingβ=0.5– EasiertocomputethanChernofferrorbutlooser.
• TheChernoffandBhattacharyyaboundswillnotbegoodboundsifthedistributionsarenot Gaussian.
Example
k(0.5)=4.06
( ) 0.0087P error ≤
Bhattacharyyaerror:
ReceiverOperatingCharacteristic(ROC)Curve
• Everyclassifieremployssomekindofathreshold.
• Changingthethresholdaffectstheperformanceofthesystem.
• ROCcurvescanhelpusevaluatesystemperformancefordifferent thresholds.
2 1( ) / ( )a P Pθ ω ω=
2 12 22
1 21 11
( )( )( )( )bPPω λ λ
θω λ λ
−=
−
Example:PersonAuthentication• Authenticateapersonusingbiometrics(e.g.,fingerprints).
• Therearetwopossibledistributions(i.e.,classes):– Authentic (A)andImpostor (I)
IA
Example:PersonAuthentication(cont’d)
• Possibledecisions:– (1)correctacceptance(truepositive):
• Xbelongs toA,andwedecideA
– (2)incorrectacceptance (falsepositive):• Xbelongs toI,andwedecide A
– (3)correctrejection(truenegative):• Xbelongs toI,andwedecide I
– (4)incorrectrejection (falsenegative):• Xbelongs toA,andwedecide I
I A
false positive
correct acceptance
correct rejection
false negative
ErrorvsThreshold
ROC
FalseNegativesvsPositives
NextLecture
• LinearClassificationMethods– Hastieetal,Chapter4
• PaperlistwillavailablebyWeekend– BiddingtostartonMonday
BayesDecisionTheory:CaseofDiscreteFeatures
• Replacewith
• Seesection2.9
( / )jp dω∫ x x ( / )jP ω∑x
x
MissingFeatures
• ConsideraBayesclassifierusinguncorrupteddata.• Supposex=(x1,x2)isatestvectorwherex1 ismissingandthe
valueofx2 is- howcanweclassifyit?– Ifwesetx1 equaltotheaveragevalue,wewillclassifyx asω3
– Butislarger;maybeweshouldclassifyxasω2 ?2 2ˆ( / )p x ω
2x̂
MissingFeatures(cont’d)
• Supposex=[xg,xb](xg:goodfeatures,xb:badfeatures)• DerivetheBayesruleusingthegoodfeatures:
pp
Marginalizeposteriorprobabilityoverbadfeatures.
CompoundBayesianDecisionTheory
• Sequential decision(1)Decideaseachfishemerges.
• Compound decision(1)Waitforn fishtoemerge.(2)Makeall n decisionsjointly.
– Couldimproveperformancewhenconsecutivestatesofnaturearenot bestatisticallyindependent.
CompoundBayesianDecisionTheory(cont’d)
• SupposeΩ=(ω(1),ω(2),…,ω(n))denotesthenstatesofnaturewhereω(i)cantakeoneofcvaluesω1,ω2,…,ωc(i.e.,ccategories)
• SupposeP(Ω)isthepriorprobabilityofthenstatesofnature.
• SupposeX=(x1,x2,…,xn)arenobservedvectors.
CompoundBayesianDecisionTheory(cont’d)
i.e.,consecutivestatesofnaturemaynot bestatisticallyindependent!
acceptable!P P
top related