some surprises in the theory of generalized belief propagation jonathan yedidia mitsubishi electric...
TRANSCRIPT
Some Surprises in the Theory of Generalized Belief Propagation
Jonathan Yedidia Mitsubishi Electric Research Labs (MERL)
Collaborators:Bill Freeman (MIT)
Yair Weiss (Hebrew University)
See “Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms,” MERL TR2004-040, to be published in IEEE Trans. Info. Theory.
Outline
• Introduction to GBP– Quick Review of Standard BP– Free Energies– Region Graphs and Valid
Approximations
• Some Surprises• Maxent Normal Approximations & a useful heuristic
Factor Graphs
A B C
1 2 3 4
a
aa XfZ
Xp )(1
)(
(Kschischang, et.al. 2001)
)(),,(),,(1
),,,( 44323214321 xfxxxfxxxfZ
xxxxp CBA
Computing Marginal Probabilities
• Decoding error-correcting codes
• Inference in Bayesian networks
• Computer vision
• Statistical physics of magnets
SXX
SS XpXp\
)()(
Fundamental for
Non-trivial because of the huge number of terms in the sum.
Error-correcting Codes
+ + +
Marginal Probabilities = A posteriori bit probabilities
(Tanner, 1981
Gallager, 1963)
Statistical Physics
Marginal Probabilities= local magnetization
Standard Belief Propagation
i
)(
)()(iNa
iiaii xmxb
“beliefs” “messages”
a
)( \)(
)()()(aNi aiNb
iibaaaa xmXfXb
The “belief” is the BP approximation of the marginal probability.
BP Message-update Rules
Using ,)()(\
ia xX
aaii Xbxb we get
ai
ia xX iaNj ajNb
jjbaaiia xmXfxm\ \)( \)(
)()()(
i a=
Variational (Gibbs) Free Energy Kullback-Leibler Distance:
X Xp
XbXbpbD
)(
)(ln)()||(
“Boltzmann’s Law” (serves to define the “energy”):
)(exp1
)(1
)( XEZ
XfZ
Xpa
aa
X X
ZXbXbXEXbpbD ln)(ln)()()()||(
Variational Free Energy is minimized when
)(bU )(bH
)()( XpXb bG
So we’ve replaced the intractable problem of computing marginal probabilities with the even more intractable problem of minimizing a variational free energy over the space of possible beliefs.
But the point is that now we can introduce interesting approximations.
Region-based Approximations to the Variational Free Energy
)]([ XbG
)}]([{ rr XbG
Exact:
Regions:
(intractable)
(Kikuchi, 1951)
Defining a “Region” A region r is a set of variable nodes Vr
and factor nodes Fr such that if a factor node a belongs to Fr , all variable nodes neighboring a must belong to Vr.
Regions
Not a Region
Region Definitions
rX
aFa
arr XfXEr
ln
rr Xb
rX
rrrrr XEXbbUr
rX
rrrrr XbXbbHr
ln
Region states:
Region beliefs:
Region energy:
Region average energy:
Region entropy:
Region free energy: rrrrrr bHbUbG
“Valid” ApproximationsIntroduce a set of regions R, and a counting number cr for each region r in R, such that cr=1 for the largest regions, and for every factor node a and variable node i,
Rr
rrrr ViIcFaIc 1
Count every node once!
Rr
rrrr bGcbG }{
Indicator functions
Entropy and Energy
• : Counting each factor node once makes the approximate energy exact (if the beliefs are).
• : Counting each variable node once makes the approximate entropy reasonable (at least the entropy is correct when all states are equiprobable).
1rr FaIc
1rr ViIc
Comments• We could actually use different counting
numbers for energy and entropy—leads to “fractional BP” (Weigerinck & Heskes 2002) or “convexified free energy” (Wainwright et.al., 2002) algorithms.
• Of course, we also need to impose normalization and consistency constraints on our free energy.
Methods to Generate Valid Region-based Approximations
Cluster Variational
Method
(Kikuchi)
Junction graphs
Aji-McEliece
Region Graphs
Bethe
(Bethe is example of Kikuchi for Factor graphs with no 4-cycles; Bethe is example of Aji-McEliece for “normal” factor graphs.)
Junction trees
Example of a Region Graph
A,C,1,2,4,5 B,D,2,3,5,6 C,E,4,5,7,8 D,F,5,6,8,9
2,5 C,4,5 D,5,6 5,8
5[5] is a child of [2,5]
Definition of a Region Graph
• Labeled, directed graph of regions.• Arc may exist from region A to region B
if B is a subset of A. • Sub-graphs formed from regions
containing a given node are connected.• where is the set of
ancestors of region r. (Mobius Function)• We insist that
)(
1rAs
sr cc )(rA
.1:,
Rr
rrrr ViIcFaIcia
Bethe Method
Two sets of regions:
Large regions containing a single factor node a and all attached variable nodes.
Small regions containing a single variable node i.
3
6
9
1 2
4 5
7 8
A B
C D
E F
A;1,2,4,5 B;2,3,5,6 C;4,5 D;5,6 E;4,5,7,8 F;5,6,8,9
1 2 3 4 5 6 7 8 9
1rc
ir dc 1
(after Bethe, 1935)
Bethe Approximation to Gibbs Free Energy
ia xiiii
ii
a aa
aa
XaaBethe xbxbd
Xf
XbXbG ln1ln
Equal to the exact Gibbs free energy when the factor graph is a tree because in that case,
id
iii
aaa xbXbXb
1
)(
Minimizing the Bethe Free Energy
a xXiiaa
aNi xiai
xii
iiBethe
iai
i
xbXbx
xbGL
\)(
}1)({
0)(
ii xb
L
)(
)(1
1exp)(
iNaiai
iii x
dxb
0)(
aa Xb
L
)(
)()(exp)(aNi
iaiaaaa xXEXb
Bethe = BP
aiNb
iibiai xmx)(
)(ln)(Identify
)(
)()(iNa
iiaii xmxb
to obtain BP equations:
)( \)(
)()()(aNi aiNb
iibaaaa xmXfXb
i
Cluster Variation Method
Form a region graph with an arbitrary number of different sized regions. Start with largest regions.
Then find intersection regions of the largest regions, discarding any regions that are sub-regions of other intersection regions.
Continue finding intersections of those intersection regions, etc.
All intersection regions obey , where
S(r ) is the set of super-regions of region r.
1rc
)(
1rSs
sr cc
(Kikuchi, 1951)
Region Graph Created Using CVM
A,C,1,2,4,5 B,D,2,3,5,6 C,E,4,5,7,8 D,F,5,6,8,9
2,5 C,4,5 D,5,6 5,8
5
Minimizing a Region Graph Free Energy
• Minimization is possible, but it may be awkward because of all the constraints that must be satisfied.
• We introduce generalized belief propagation algorithms whose fixed points are provably identical to the stationary points of the region graph free energy.
Generalized Belief Propagation
• Belief in a region is the product of:– Local information (factors in region)– Messages from parent regions– Messages into descendant regions from
parents who are not descendants.
• Message-update rules obtained by enforcing marginalization constraints.
58
2356 4578 5689
25 45 56
1245
5
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9
58
2356 4578 5689
25 45 56
1245
5
Generalized Belief Propagation
585654525 mmmmb
1 2 3
4 5 6
7 8 9
58
2356 4578 5689
25 45 56
1245
5
Generalized Belief Propagation
]][[ 585652457845124545 mmmmmfb
1 2 3
4 5 6
7 8 9
58
2356 4578 5689
25 45 56
1245
5
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9][ 452514121245 ffffb ][ 585645782536 mmmm
1 2 3
4 5 6
7 8 9
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9
4
),()( 544555x
xxbxb
=
Use Marginalization Constraints to Derive Message-Update Rules
1 2 3
4 5 6
7 8 9
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9
4
),()( 544555x
xxbxb
=
Use Marginalization Constraints to Derive Message-Update Rules
1 2 3
4 5 6
7 8 9
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9
4
),()( 544555x
xxbxb
=
Use Marginalization Constraints to Derive Message-Update Rules
1 2 3
4 5 6
7 8 9
Generalized Belief Propagation
1 2 3
4 5 6
7 8 9
4
),(),(),()( 5445785445125445554x
xxmxxmxxfxm
=
Use Marginalization Constraints to Derive Message-Update Rules
(Mild) Surprise #1• Region beliefs (even those given by
Bethe/BP) may not be realizable as the marginals of any global belief.
1.4.
4.1.),(
4.1.
1.4.),(),(
32
3121
xxb
xxbxxb
c
ba
a
c
b
1
2 3
is a perfectly acceptable solution to BP, but cannot arise from any ),,( 321 xxxb
(Minor) Surprise #2
• For some sets of beliefs (sometimes even when the marginal beliefs are the exactly correct ones), the Bethe entropy is negative!
2ln22ln)2(42ln6 H
Each large region has counting number 1, each small region has counting number -2, so if the beliefs are
then the entropy is negative:
5.0
05.),(
5.5.)(
ji
i
xxb
xb
(Serious) Surprise #3
• When there are no interactions, the minimum of the CVM free energy should correspond to the equiprobable global distribution, but sometimes it doesn’t!
ExampleFully connected pairwise model, using CVM and all triplets as the largest regions, with pairs and singlets as the intersection regions.
2/)3)(2(
32/)1(
12/)2)(1(
NNN
NNN
NNNTriplets
Pairs
Singlets
# of regions Counting #
Two simple distributions
• Equiprobable distribution: each state is equally probable for all beliefs.– e.g.
• Distribution obtained by marginalizing the global distribution that only allows all-zeros or all-ones. “Equipolarized distribution”– e.g.
25.25.
25.25.),( ji xxb
5.0
05.),( ji xxb
Each region gives 2lnrr nH
Each region gives 2lnrH
Surprise #3 (cont.)
• Surprisingly, the “equipolarized” distribution can have a greater entropy than the equiprobable distribution for some valid CVM approximations.
• This is a serious problem, because if the model gets the wrong answer without any interactions, it can’t be expected to be correct with interactions.
The Fix: Consider only Maxent-Normal Approximations
• A maxent-normal approximation is one which gives a maximum of the entropy when the beliefs correspond to the equiprobable distribution.
• Bethe approximations are provably maxent-normal.
• Some other CVM and other region-graph approximations are also provably maxent-normal. – Using quartets on square lattices – Empirically, these approximations give good results
• Other valid CVM or region-graph approximations are not maxent-normal.– Empirically, these approximations give poor results
A Very Simple Heuristic
• Sum all the counting numbers. – If the sum is greater than N, the equipolarized
distribution will have a greater entropy than the equiprobable distribution. (Too Much—Very Bad)
– If the sum is less than 0, the equipolarized distribution will have a negative entropy. (Too Little)
– If the sum equals 1, the equipolarized distribution will have the correct entropy. (Just Right)
10x10 Ising Spin Glass
Random interactions
Random fields
Conclusions• Standard BP essentially equivalent to minimizing the
Bethe free energy.• Bethe method and cluster variation method are special
cases of the more general region graph method for generating valid free energy approximations.
• GBP is essentially equivalent to minimizing region graph free energy.
• One should be careful to use maxent-normal approximations.
• Sometimes you can prove that your approximation is maxent-normal; and simple heuristics can be used to prove when it isn’t.