introduction to advanced probability for graphical modelsintroduction to advanced probability for...
TRANSCRIPT
IntroductiontoAdvancedProbabilityforGraphicalModels
CSC412ByElliotCreager
ThursdayJanuary11,2018
PresentedbyJonathanLorraine
*ManyslidesbasedonKaustav Kunduâs, KevinSwerskyâs,Inmar Givoniâs,DannyTarlowâs,JasperSnoekâs slides,SamRoweis âsreviewofprobability, Bishopâsbook,andsomeimagesfromWikipedia
Outline
⢠Basics⢠Probabilityrules⢠Exponentialfamilymodels⢠Maximumlikelihood⢠ConjugateBayesianinference(timepermitting)
WhyRepresentUncertainty?
⢠Theworldisfullofuncertaintyâ âWhatwilltheweatherbeliketoday?ââ âWillIlikethismovie?ââ âIsthereapersoninthisimage?â
⢠Weâretryingtobuildsystemsthatunderstandand(possibly)interactwiththerealworld
⢠Weoftencanâtprovesomethingistrue,butwecanstillaskhowlikelydifferentoutcomesareoraskforthemostlikelyexplanation
⢠Sometimesprobabilitygivesaconcisedescriptionofanotherwisecomplexphenomenon.
WhyUseProbabilitytoRepresentUncertainty?
⢠Writedownsimple,reasonablecriteriathatyou'dwantfromasystemofuncertainty(commonsensestuff),andyoualwaysgetprobability.
⢠CoxAxioms(Cox1946);SeeBishop,Section1.2.3
⢠Wewillrestrictourselvestoarelativelyinformaldiscussionofprobabilitytheory.
Notation⢠A randomvariableX representsoutcomesorstatesoftheworld.
⢠Wewillwritep(x)tomeanProbability(X=x)⢠Samplespace:thespaceofallpossibleoutcomes(maybediscrete,continuous,ormixed)
⢠p(x)istheprobabilitymass(density)functionâ Assignsanumbertoeachpointinsamplespaceâ Non-negative,sums(integrates)to1â Intuitively:howoftendoesxoccur,howmuchdowebelieveinx.
JointProbabilityDistribution⢠Prob(X=x,Y=y)â âProbabilityofX=xandY=yââ p(x,y)
ConditionalProbabilityDistribution⢠Prob(X=x|Y=y)â âProbability ofX=xgivenY=yââ p(x|y)=p(x,y)/p(y)
MarginalProbabilityDistribution⢠Prob(X=x),Prob(Y=y)â âProbability ofX=xââ p(x)=\Sum_{y}p(x,y)=\Sum{y}p(x|y)p(y)
TheRulesofProbability
⢠SumRule(marginalization/summingout):
⢠Product/ChainRule:
),...,,(...)(
),()(
2112 3
Nx x x
y
xxxpxp
yxpxp
N
ââ â
â
=
=
),...,|()...|()(),...,()()|(),(
111211 â=
=
NNN xxxpxxpxpxxpxpxypyxp
BayesâRule
⢠Oneofthemostimportantformulasinprobabilitytheory
⢠Thisgivesusawayofâreversingâconditionalprobabilities
⢠ReadasâPosterior=likelihood*prior/evidenceâ
â==
')'()'|(
)()|()()()|()|(
xxpxypxpxyp
ypxpxypyxp
Independence
⢠Tworandomvariablesaresaidtobeindependent iff theirjointdistributionfactors
⢠Tworandomvariablesareconditionallyindependentgivenathirdiftheyareindependentafterconditioningonthethird
)()()()|()()|(),( ypxpypyxpxpxypyxp ===
zzxpzypzxpzxypzyxp â== )|()|()|(),|()|,(
ContinuousRandomVariables⢠Outcomesarerealvalues.Probabilitydensityfunctionsdefinedistributions.â E.g.,
⢠Continuousjointdistributions:replacesumswithintegrals,andeverythingholdsâ E.g.,Marginalizationandconditionalprobability
âŤâŤ ==yy
yPyzxPzyxPzxP )()|,(),,(),(
ââŹâŤ
âŠâ¨â§ ââ= 2
2 )(21exp
21),|( Âľ
ĎĎĎĎÂľ xxP
SummarizingProbabilityDistributions
⢠Itisoftenusefultogivesummariesofdistributionswithoutdefiningthewholedistribution(E.g.,meanandvariance)
⢠Mean:
⢠Variance:
dxxpxxxEx
)(][ ⍠â ==
dxxpxExxx
)(])[()var( 2⍠â â=
=E[x2 ]âE[x]2
ExponentialFamily
⢠Familyofprobabilitydistributions⢠Manyofthestandarddistributionsbelongtothisfamilyâ Bernoulli,binomial/multinomial,Poisson,Normal(Gaussian),beta/Dirichlet,âŚ
⢠Sharemanyimportantpropertiesâ e.g. Theyhaveaconjugateprior(weâllgettothatlater.ImportantforBayesianstatistics)
Definition⢠Theexponentialfamilyofdistributionsoverx,givenparameterΡ (eta)isthesetofdistributionsoftheform
⢠x-scalar/vector,discrete/continuous⢠Ρ â ânaturalparametersâ⢠u(x)â somefunctionofx(sufficientstatistic)⢠g(Ρ)â normalizer⢠h(x)â basemeasure(oftenconstant)
)}(exp{)()()|( xugxhxp TΡΡΡ =
1)}(exp{)()( =⍠dxxuxhg TΡΡ
SufficientStatistics
⢠Vaguedefinition:calledsobecausetheycompletelysummarizeadistribution.
⢠Lessvague:theyaretheonlypartofthedistributionthatinteractswiththeparametersandarethereforesufficienttoestimatetheparameters.
⢠Perhapsthenumberoftimesacoincameupheads,orthesumofvaluesmagnitudes.
Example1:Bernoulli
⢠Binaryrandomvariable-⢠p(heads)=¾⢠Cointoss
xxxp ââ= 1)1()|( ¾¾¾
}1,0{âX]1,0[âÂľ
Example1:Bernoulli
xxxp ââ= 1)1()|( ¾¾¾
}1
exp{ln)1(
)}1ln()1(lnexp{
xu
xx
ââ
âââ
ââ
â=
ââ+=
¾¾
¾¾
)()(11)(
1ln
)(1)(
ΡĎΡ
ΡĎ¾¾¾
Ρ Ρ
â=
+==âââ
â
ââââ
â
â=
=
=
â
ge
xxuxh
)}(exp{)()()|( xugxhxp TΡΡΡ =
)exp()()|( xxp ΡΡĎΡ â=
Example2:Multinomial⢠p(valuek)=¾k
⢠Forasingleobservationâ dietossâ SometimescalledCategorical
⢠Formultipleobservationsâ integercountsonNtrialsâ Prob(1cameout3times,2cameoutonce,âŚ,6cameout7timesifItossedadie20times)
1],1,0[1
=â â=
M
kkk ¾¾
ââ =
=M
k
xk
kk
Mk
xNxxP
11 !
!)|,...,( ¾¾
â=
=M
kk Nx
1
Example2:Multinomial(1observation)
}lnexp{1â=
=M
kkkx Âľ
xxx=
=
)(1)(
uh
)}(exp{)()()|( xugxhxp TΡΡΡ =
â=
=M
k
xkMkxxP
11 )|,...,( ¾¾
)exp()|( xx Tp ΡΡ =
Parametersarenotindependentduetoconstraintofsumming to1,thereâsaslightlymoreinvolvednotation toaddressthat,seeBishop2.4
Example3:Normal(Gaussian)Distribution
⢠Gaussian(Normal)
ââŹâŤ
âŠâ¨â§ ââ= 2
2 )(21exp
21),|( Âľ
ĎĎĎĎÂľ xxp
Example3:Normal(Gaussian)Distribution
⢠¾isthemean⢠Ď2 isthevariance⢠Canverifythesebycomputingintegrals.E.g.,
ââŹâŤ
âŠâ¨â§ ââ= 2
2 )(21exp
21),|( Âľ
ĎĎĎĎÂľ xxp
âŹ
x â 12ĎĎ
exp â12Ď 2 (x âÂľ)2
⧠⨠âŠ
⍠⏠â dx = Âľ
xâââ
xââ
âŤ
Example3:Normal(Gaussian)Distribution
⢠MultivariateGaussian
âŹ
P(x |Âľ,â) = 2Ď ââ1/ 2 exp â12(x âÂľ)T ââ1(x âÂľ)
⧠⨠âŠ
⍠⏠â
Example3:Normal(Gaussian)Distribution
⢠MultivariateGaussian
⢠x isnowavector⢠¾isthemeanvector⢠Σ isthecovariancematrix
ââŹâŤ
âŠâ¨â§ âââââ=â ââ )()(21exp2),|( 12/1
¾¾ĎÂľ xxxp T
ImportantPropertiesofGaussians
⢠Allmarginals ofaGaussianareagainGaussian⢠AnyconditionalofaGaussianisGaussian⢠TheproductoftwoGaussiansisagainGaussian
⢠EventhesumoftwoindependentGaussianRVsisaGaussian.
⢠Beyondthescopeofthistutorial,butveryimportant:marginalizationandconditioningrulesformultivariateGaussians.
ExponentialFamilyRepresentation
}21exp{)
4exp()2()2(
}21
21exp{
21
)(21exp
21),|(
2222
212
1
221
222
22
22
âĽâŚ
â¤â˘âŁ
âĄâĽâŚ
â¤â˘âŁ
⥠ââ=
=â
++â
=
ââŹâŤ
âŠâ¨â§ ââ=
â
xx
xx
xxp
ĎĎÂľ
ΡΡ
ΡĎ
ÂľĎĎ
ÂľĎĎĎ
ÂľĎĎĎ
ĎÂľ
)}(exp{)()()|( xugxhxp TΡΡΡ =
)(xh )(Ρg TΡ )(xu
Example:MaximumLikelihoodFora1DGaussian
⢠SupposewearegivenadatasetofsamplesofaGaussianrandomvariableX,D={x1,âŚ,xN}andtoldthatthevarianceofthedataisĎ2
WhatisourbestguessofÂľ?*Needtoassumedataisindependentandidenticallydistributed(i.i.d.)
x1 x2 xNâŚ
Example:MaximumLikelihoodFora1DGaussian
Whatisourbestguessof¾?⢠Wecanwritedownthelikelihoodfunction:
⢠WewanttochoosetheÂľthatmaximizesthisexpressionâ Takelog,thenbasiccalculus:differentiatew.r.t.Âľ,setderivativeto0,solveforÂľtogetsamplemean
â â= = â
âŹâŤ
âŠâ¨â§ ââ==Âľ
N
i
N
i
ii xxpdp1 1
22 )(
21exp
21),|()|( Âľ
ĎĎĎĎÂľ
âŹ
ÂľML =1N
xii=1
Nâ
MLestimationofmodelparametersforExponentialFamily
p(D |Ρ) = p(x1,..., xN ) = h(xn )â( )g(Ρ)N exp{ΡT u(xnnâ )}
âln(p(D |Ρ))
âΡ= ..., set to 0, solve for âg(Ρ)
â=
=ââN
nnML xu
Ng
1)(1)(ln Ρ
⢠Caninprinciplebesolvedtogetestimateforeta.⢠ThesolutionfortheMLestimatordependsonthedataonlythroughsumoveru,whichisthereforecalledsufficientstatisticâ˘Whatweneedtostoreinordertoestimateparameters.
BayesianProbabilities
⢠isthelikelihood function⢠isthepriorprobability of(orourpriorbelief over)θâ ourbeliefsoverwhatmodelsarelikelyornotbeforeseeinganydata
⢠isthenormalizationconstantorpartitionfunction
⢠istheposteriordistribution
â Readjustmentofourpriorbeliefsinthefaceofdata
)()()|()|(
dppdpdp θθ
θ =
âŤ= θθθ dPdpdp )()|()(
)|( θdp)(θp
)|( dp θ
Example:BayesianInferenceFora1DGaussian
⢠SupposewehaveapriorbeliefthatthemeanofsomerandomvariableXisÂľ0 andthevarianceofourbeliefisĎ02
⢠WearethengivenadatasetofsamplesofX,d={x1,âŚ,xN}andsomehowknowthatthevarianceofthedataisĎ2
Whatistheposteriordistributionover(ourbeliefaboutthevalueof)Âľ?
Example:BayesianInferenceFora1DGaussian
⢠Rememberfromearlier
⢠isthelikelihoodfunction
⢠isthepriorprobability of(orourpriorbelief over)¾
)()()|()|(
dppdpdp ¾¾
=Âľ
)|( Âľdp
)(Âľp
â â= = â
âŹâŤ
âŠâ¨â§ ââ==Âľ
N
i
N
i
ii xxPdp1 1
22 )(
21exp
21),|()|( Âľ
ĎĎĎĎÂľ
ââŹâŤ
âŠâ¨â§
ÂľâÂľâ=¾¾ 202
0000 )(
21exp
21),|(
ĎĎĎĎp
Example:BayesianInferenceFora1DGaussian
),|()|()()|()|(
NNDppDpDp
Ď¾¾=Âľ
¾¾âÂľ
Normal
âŹ
ÂľN =Ď 2
NĎ 02 +Ď 2 Âľ0 +
NĎ 02
NĎ 02 +Ď 2 ÂľML
1ĎN2 =
1Ď 02 +
NĎ 2
where
Example:BayesianInferenceFora1DGaussian
x1 x2 xNÂľN
ĎN
PriorbeliefMaximumLikelihood
PosteriorDistribution
ConjugatePriors⢠NoticeintheGaussianparameterestimationexamplethatthefunctionalformoftheposteriorwasthatoftheprior(Gaussian)
⢠Priorsthatleadtothatformarecalledâconjugatepriorsâ
⢠Foranymemberoftheexponentialfamilythereexistsaconjugatepriorthatcanbewrittenlike
⢠Multiplybylikelihoodtoobtainposterior(uptonormalization)oftheform
⢠Noticetheadditiontothesufficientstatistic⢠ν istheeffectivenumberofpseudo-observations.
}exp{)(),(),|( ĎνΡΡνĎνĎΡ ν Tgfp =
)})((exp{)(),,|(1
νĎΡΡνĎΡ ν +â â=
+N
nn
TN xugDp