structured variational methods for distributed inference ...part of this work was presented at the...
TRANSCRIPT
-
Structured Variational Methods for Distributed Inference in Networked Systems: Design and Analysis
Huaiyu Dai*, Senior Member, IEEE, Yanbing Zhang, Member, IEEE, and Juan Liu
Abstract
In this paper, a variational message passing framework is proposed for distributed inference in
networked systems. Based on this framework, structured variational methods are explored to take
advantage of both the simplicity of variational approximation (for inter-cluster processing) and
the quality of more accurate inference (for intra-cluster processing). To investigate the
convergence performance of our inference approach, we distinguish the inter- and intra-cluster
inference algorithms as vertex and edge processes respectively. Based on an analysis on the
intra-cluster inference procedure, the overall performance of structured variational methods,
modeled as a mixed vertex-edge process, is quantitatively characterized via a coupling approach.
The tradeoff between performance and complexity of this inference approach is also addressed.
Index Terms: convergence analysis, distributed inference, Markov chain, variational methods
Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for
any other purposes must be obtained from the IEEE by sending a request to [email protected]. H. Dai and J. Liu are with the Department of Electrical and Computer Engineering, NC State University, Raleigh,
NC 27695 (Email: [email protected], [email protected]). Y. Zhang is with Broadcom Corporation, Matawan, NJ, 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258.
Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30]. 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258. Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30].
-
1
I. INTRODUCTION Large-scale networked systems of intelligent devices are playing an increasingly important role
in protecting the nation's critical infrastructures as well as serving people’s needs; smart grid,
intelligent transportation, precision agriculture, and seamless surveillance are a few such
examples. In many such systems, there is a pressing need for automatic reasoning and inference
due to practical and economic considerations, and it is desirable that inference be conducted in a
distributed fashion. This motivates us to develop a general and flexible framework for automatic
inference in networked systems which can admit wide applications and provide desired tradeoff
between accuracy and efficiency, while allowing simple and distributed implementation.
Exact inference is known to be NP-hard [2], and generally computationally intractable for
many applications. Therefore, approximate methods are often resorted to in practice. One
popular approach for approximate inference is sampling, of which the family of Markov Chain
Monte Carlo methods is noteworthy. Some concerns about this approach include slow
convergence, analytical tractability, and computational complexity. Belief propagation (BP)
algorithms [1] and its variants (such as consensus propagation [3], a special case of Gaussian BP)
have also been widely studied in literature. BP algorithms yield accurate inference on acyclic
graphs, and continue to work well on loopy graphs with sufficient sparsity and symmetry. They
are also amenable to distributed implementation. However, BP and related algorithms are known
not always to converge in general cyclic graphs. They are also computationally intractable when
continuous variables are involved (except for Gaussian distribution), and approximate methods
such as particle filtering may be employed as a remedy in practice. Variational methods [4] are an alternative for approximate inference. Being a deterministic
approach, they are often more efficient in computation, more amenable to analysis, and admit
wide applicability regarding the underlying models, whether acyclic or cyclic, discrete or
continuous. A message-passing algorithm for the mean-field (MF) inference was proposed for
conjugate-exponential models in (directed) Bayesian networks [5]. An implementation on the
-
2
factor graph can be found in [6]. In this paper, we derive a variational message passing framework
for Markov random fields (MRF), which arguably assume advantages in modeling wireless
networks. In particular, we formulate explicit message passing rules for distributions in the
exponential family, which covers a large class of probabilistic models. Relevant discussion is
given in Section II.
Among variational methods, the simplest MF approach is mostly considered due to its
analytical and computational tractability, whose inference accuracy is nonetheless limited because
of its inherent assumption that variables of interest are fully independent. Naturally a richer
structure for the variational distribution can be exploited for better inference quality, with
increased complexity. Such approaches are named structured variational methods or simply
structured mean field (SMF). They have mainly been studied in the artificial intelligence area
[7][8], and little consideration is given on their applications in real networks. In this work, we
further investigate exploiting substructures of networks to improve variational methods in real
systems. Thus the simplicity of variational methods (for inter-cluster processing) and the accuracy
of (approximately) exact inference algorithms (for intra-cluster processing) can be exploited
simultaneously, as detailed in Section III. In this study, BP is adopted as an approximation for
exact inference in intra-cluster processing, as it can be readily realized in a distributed form.
Meanwhile, our SMF framework can effectively control the cluster sizes (and even the topologies)
to ensure good performance for BP processing. In [24], an alternative approach of combining the
BP and MF inference is presented on the factor graph model: the whole set of factor nodes are
divided into two parts, with BP applied on one subset (in particular discrete variables) and MF on
the remaining part. One possible application of such an approach is to design iterative message-
passing algorithms jointly for different components of a communication system, such as joint
channel estimation and decoding, which is orthogonal to our study.
For distributed inference algorithms mentioned above and studied in this work, typically
stochastic weight matrices are employed, which are conformant to the underlying graphical
-
3
structure (network topology). Hence the convergence of these algorithms is closely related to the
mixing time of a random walk on the corresponding graph. Random walks on graphs can be
categorized as vertex process-based or edge process-based ones. The essential difference
between these two is that the former is a process on nodes that transitions along edges and is
allowed to “backtrack”, while the latter is a process on directed edges that transitions towards
nodes where “backtrack” is forbidden. As we will see, distributed algorithm derived from the
variational method can be characterized by a vertex process, typically involving reversible
Markov chains; while belief propagation and its variants correspond to edge processes, typically
involving non-reversible Markov chains.
Even though quite a few techniques exist for analyzing the convergence of reversible Markov
chains, including spectral theory, conductance, canonical paths and multi-commodity flow (see [9]
and the references therein), few of them can be successfully applied to non-reversible cases. In
[10] a non-reversible random walk in the one-dimensional chain is analyzed through a direct
probabilistic approach. A study on the convergence properties of consensus propagation is given
in [3] through function mapping and matrix analysis; an explicit result on convergence time is
derived for the cycle, with conjectures given for higher-dimensional tori. Structured variational
methods, as we will formulate in Section III, are actually mixed vertex-edge processes involving
hybrid Markov chains, entailing even more difficulties on analysis. In this paper, we use a “divide
and conquer” strategy to investigate its performance: first we analyze the convergence of the
intra-cluster edge process, where we derive an upper bound on the mixing time and verify the
conjecture in [3] for the two-dimensional (2-d) torus; then we exploit the coupling technique [11]
to combine the results for edge and vertex processes to obtain a characterization on the overall
performance. Relevant contents are given in Section IV and Section V, respectively. As a result,
the performance-complexity tradeoff in structured variational methods is further addressed in
Section VI, together with some supporting simulation results.
-
4
The contributions of this work are summarized as follows. First, we derive a general and
scalable variational message-passing framework for Markov random fields, which admits wide
applicability concerning network size and topology, allows flexible tradeoff between performance
and complexity, and easily adapts to practical wireless networks. In particular, we obtain explicit
forms for variational message passing rules for probabilistic distributions in the exponential
family, which are simple and yet admit wide applications. Then, this framework is applied to a
clustered network (exemplified by Gaussian MRF), to realize a novel distributed inference
approach which can achieve a flexible balance between inference accuracy and computational
complexity. We also characterize the convergence behavior of the proposed inference algorithm
on a 2-d torus, and during this process, derive an upper bound of the mixing time of the intra-
cluster BP inference process, which should be of independent interest. Our analytical
methodologies are developed for general edge process-based and mixed random walks, which
may assume wider applicability.
II. VARIATIONAL MESSAGE PASSING IN MRF A. System Model Distributed inference in complex networked systems is often casted as probabilistic inference in
a graphical model. Well-known graphical models include Markov random fields, Bayesian
networks, and factor graphs. Associated with each graphical model is a family of distributions
which factorize according to the dependency structure of the underlying graph. Besides obvious
advantages in visual representation, graphical models also facilitate design and analysis of
distributed algorithms. Among existing graphical models, Markov random fields exhibit certain
modeling convenience for wireless networks, as they can be conveniently mapped to real
communication graphs, and often admit simpler forms for message-passing algorithms.
-
5
In this work we mainly consider a pairwise MRF1 represented by an undirected graph ( , ),V E
where V and E denote the vertex and edge set respectively, and each node i V is associated
with a random variable iX and observation iy . Define for each node a local potential function
( , )i i iX y , and for each edge ( , )i j E a compatibility function ( , )ij i jX X [12]. The
Hammersley-Clifford theorem [1] indicates that the posterior probability of the random vector
i i VX X given the observation vector i i Vy y admits the following product form:
( , )( | ) ( , ) ( , )ij i j i i i
i j E i Vp X X X y
X y . (1)
We also assume that { }i and { }ij belong to the exponential family, i.e., take the following
forms:
( , ) exp ( ) ( )Ti i i i i i i iX y X g θ η θ , (2) ( , ) exp ( , ) ( )Tij i j ij ij i j ij ijX X X X g θ η θ , (3)
where θ ’s and η ’s are usually referred to as the natural parameters and sufficient statistics,
respectively, and ( )i ig θ and ( )ij ijg θ are functions dependent on θ ’s only and irrelevant with
X’s. The exponential family covers a large class of distributions of interest, such as Gaussian,
Wishart, Gamma, Beta, and any discrete distributions.
The objective of distributed inference is to compute ( | )P X Y y (or more generally
( | )P S Y y for some subset S X ) in an efficient way, through local computation and
communications only. This is particularly important for large-scale networked systems where the
observation data is widely distributed, and each node may have limited computation and
communication resources. Distributed inference algorithms can also find applications in data
processing where communication is not a concern but computational complexity is. A
centralized processing will incur a computational complexity scaling exponentially with the size
of X, and additional complexity is needed when subset S X is considered.
1 MRF with higher order cliques can always be converted into an equivalent pairwise MRF [12].
-
6
B. Variational Inference Modern variational methods in general refer to converting the original problem into an
optimization problem (variational transformation), and seeking approximate solutions to the
latter through approximating the objective function or restricting the feasible set of solutions
(variational approximation). Solving the (relaxed) optimization problem typically results in a set
of fixed-point equations, and successive enforcement of them can (hopefully) lead to a solution.
Applying the variational approach to distributed inference, the original problem is first
transformed into finding a distribution ( )Q X which minimizes the KL divergence
( ( ) ( | ))KL Q P X X Y y , or equivalently maximizes a tight lower bound of log ( )P y [4]:
( ) ( ) log ( , )QL Q H Q E P X y , (4)
where ( )H Q is the entropy function of ( )Q X , and {}QE stands for expectation with respect to
(w.r.t.) ( )Q X . For analytical and computational tractability, the variational distribution ( )Q X is
often restricted to a class of distributions with simpler dependency structure (such as sub-graphs
of the original graphical model). Thus far, the most fruitful applications of variational inference
assume a fully factorized form ( ) ( )i iQ Q XX , referred to as the Mean Field approach. When
(4) is instantiated by this form and optimized w.r.t. each individual component, the following set
of fixed-point equations is obtained for the optimal variational distribution (where |Q iE X
refers to conditional expectation on iX and iZ is a normalization constant):
log ( ) log ( | ) | logi i Q i iQ X E P X Z X y , i . (5)
The complexity and accuracy of the variational inference depends on the inherent structure of
the variational distributions ( )Q X . While the MF approach is attractive for its simplicity, its
inference accuracy may not be satisfactory as the posterior correlation is not captured. In this
paper, we will consider a richer structure for the variational distributions and explore the tradeoff
between performance and complexity.
-
7
C. Variational Message Passing Framework In this section, we follow the general procedure of variational inference discussed in Section II.B
to derive a message-passing framework for distributed inference in networked systems. Consider
the system model (1), and rearrange ( , )Tij ij i jX Xθ η in terms of iX : ( , ) ( ' ) ' ( ),T Tij ij i j ij ij iX X Xθ η θ η
where 'ijθ may be a function of jX . Let ( )i iXη be the union of sufficient statistics ( )i iXη and
' ( )ij iXη . Then the corresponding terms in (2) and (3) can be rewritten as
( ) ( )T Ti i i i i iX Xθ η θ η , (6) ( , ) ( ' ) ' ( ) ( )T T Tij ij i j ij ij i ij i iX X X X θ η θ η θ η , (7)
where the newly obtained iθ and ijθ 2 are named the extended natural parameters and
( )i iXη the extended sufficient statistics.
It can be shown that the optimal mean-field approximation *iQ dictated by (5) is also a
member of the exponential family with sufficient statistics ( )i iXη and natural parameter
* **, \ i ii v i ijQ Q jE θ θ θ , (8) where i is the set of neighboring nodes of node i, and
* *\ iQ Q stands for the distribution
*( )k i k kQ X . From (8), a simple message passing rule can be obtained:
Messaging passing: ( 1)( ) nj
ijQ
ni j E θm ; (9)
Parameter updating: ( ) ( ),i
n ni v i i j
j
θ θ m . (10)
That is, in the nth iteration, the message from node j to its neighbor i is the expected value of the
extended parameter of the corresponding edge compatibility function ijθ (generally a function of
jX ), w.r.t. the current variational approximation ( 1)njQ (similarly the message from node i to its
2 Note that there is a counterpart jiθ which abstracts the corresponding terms of iX .
-
8
neighbor j is ( 1)( ) ni
nj i jiQ
E m θ ). In turn, node i sums up all the messages from its neighbors,
together with the extended parameter of its own potential function, to get an updated parameter
of its variational distribution component. The iteration generally converges under mild
conditions [4][13]. While conforming to general expressions in literature, the above explicit
message-passing and parameter updating forms seem to be new.
D. Gaussian MRF For concreteness of discussion, we will particularly consider Gaussian graphical models in this
study. Gaussian models are widely adopted in theory and practice of many areas, such as
computer vision, oceanography, and wireless networks, and serve as good approximations in
many scenarios due to the application of the central limit theorem. Without loss of generality,
consider that X in (1) is jointly Gaussian with zero mean and (positive definite) covariance matrix
XΣ , abbreviated as (0, )XX Σ N , where 2 2
S,[ ]i iE X and . The
observation at each node is given by
,i i i iy H x 1, ,| |i V , (11)
where channel gain iH is assumed known, and noise 2N,(0, )i i N , independent across the
network 3 . Given the observation vector y the posterior probability ( | )P X y is Gaussian
distributed as4
1 1 1 1 1( | ) ~ ( , ) ~ ( , )T TP X y F H Ξ y F H Ξ y FN N , (12)
where diag( )iHH , 2N,diag( )iΞ , and
1 1| | | |[ ]
Tij V VF
XF H Ξ H .
Consider approximating the posterior probability (12) by the MF variational distribution with
2MF, MF,( ) ~ ( , )i i i iQ X N , where MF,i and
2MF,i are the posterior mean and variance respectively.
3 It is straightforward to extend the model to include more complex scenarios, such as correlated observations and noise in space. 4 1( , ) N is the information parameterization for Gaussian distribution ( , ) N , with 1 , 1 .
S, S,[ ]i j i j ijE X X
-
9
As an application of the variational message passing framework derived in the previous section,
the following iterative form is obtained for the estimate of MF,i (see Appendix A):
( ) 2 ( 1)MF, N, MF,/ /i
n ni i i i ij j iij
H y F F
. (13)
Collecting all node estimates into a vector of dimension | |V leads to the following expression
( ) ( 1)MF MFˆn n
V μ Ay BP μ , (14)
where A and B are relevant coefficient matrices, and the stochastic matrix ˆVP has entries
/ ,
0, otherwise.i
ij ij ijij
F F jP
(15)
In the next section, the same variational message passing framework will be applied to derive the
message exchange rules between clusters (viewed as “mega-nodes”).
III. STRUCTURED VARIATIONAL METHODS FOR DISTRIBUTED INFERENCE Although attractive for its computational simplicity, the naive mean field approach may not yield
sufficient accuracy or fast convergence due to the independence restriction on variational
distributions. A natural idea for improvement is to consider a variational distribution with richer
(and yet tractable) dependence structure, and integrate exact or more accurate probabilistic
inference algorithms with the mean field method to achieve a good tradeoff between accuracy
and complexity. As mentioned earlier, the application of structured variational methods for
distributed inference in practical networks is largely unexplored, so is its quantitative
performance analysis in this setting. In this section, we discuss its instantiation in the context of
clustered wireless networks, and analyze its convergence. In the following two sections, we will
characterize its convergence rate.
-
10
A. Overview of the SMF5 Method The MF approximation corresponds to a totally disconnected graph where all the dependencies
between the variables of the original model are removed. A natural idea to enrich the structure of
Q is to replace each node in the MF approximation by a “mega-node”, i.e., to consider a class of
variational distributions of the form 1
( )i
s
i Ci
Q Q
X , where 1,..., sC C form a disjoint partition of
all nodes (variables). This approach intends to keep the original dependency structure within the
clusters while decouple the connections between clusters. Approximately accurate inference will
be pursued within the clusters to improve the performance, while MF approximation can be
utilized across the clusters to maintain tractability. The tradeoff between accuracy and
complexity can be realized through the construction of clusters, with the MF approach (clusters
of size one) and exact inference (one cluster) on the two extremes.
In particular, we adopt the belief propagation algorithm for intra-cluster reasoning, as it
yields accurate inference on acyclic graphs, and continues to work well on loopy graphs with
suitable sparsity or symmetry. An important reason to choose BP is that it can be readily
implemented in a message-passing style through the prominent sum-product algorithm or its
variants. However, intra-cluster processing in our SMF framework is not limited to BP. Other
exact inference algorithms such as the junction tree (JT) algorithm can be employed for better
convergence and wider applicability. It should be noted that the computational complexity of JT
grows exponentially with the size of the maximal clique in the cluster, so it is sensible to put a
limit on the cluster size when JT is applied. In both cases, our SMF framework provides a
platform to best reap the benefits of these high-accuracy inference algorithms in practice.
SMF also requires some overhead for clustering, which can be done before the network setup
and can be adjusted during network operation when necessary. We have designed a distributed
clustering scheme for SMF, which endeavors to minimize the dependence (correlation) between
5 In this paper, structured variational method and structured mean field are largely interchangeable; more specifically, the former refers to the methodology, while the latter is used on the scheme we develop.
-
11
the clusters, thus improve the algorithm performance. The effectiveness of the algorithm is
testified in Section VI via simulations. Interested readers are referred to [14] for details.
B. SMF in Gaussian MRF Consider the Gaussian MRF model presented in II.D, for which the messages and node beliefs
are both Gaussian distributed. Assuming for the nth iteration, the message from node i to j, ( )nj im ,
and the belief at node i, ( )nib , are parameterized by
( ) 1 ( ) ( )( ) ~ ( , )n n nj i j i j im x
N and ( ) 1 ( ) ( )( ) ~ ( , )n n ni i ib x q W
N ,
we can obtain the updating rules as (see Appendix B)
( 1)\{ }( )
2 ( 1)S, S, S, S, \{ }
2 2 ( 1) 2S, S, S,\{ }( )
2 2 ( 1)S, \{ }
(1 )
1,
1 (1 )
i
i
i
i
nij i i kk jn
j i nj i i j ij i i kk j
ni j i i k jk jn
j i ni ij i i kk j
V
V
V
(16)
with
2N,/i i i iH y ,
2 2 2N, S,/ (1 | |)i i i i iV H , (17)
and
( ) ( )
( ) ( ) .i
i
n ni i i kk
n ni i i kk
q
W V
(18)
We proceed to discuss inter-cluster message updates. Consider a partitioning of the network
nodes 1{ }s
i iC C , and denote iCX as the collection of all variables such that ii C . For each
cluster iC , define its Markov blanket (MB), ( )iMB C , as the set of nodes outside of iC but
connected to some nodes in iC . In turn, those nodes in which are connected to some nodes in
are denoted as gateway nodes. A neighboring cluster which contains part of is
named a Markov blanket cluster (MBC) of iC , whose collection is denoted as iC . A conceptual
iX
iC
( )iMB C ( )iMB C
-
12
illustration for MB and MBC is given in Figure 1. To apply the variational message passing rules
derived in II.C, the posterior probability can be reformulated as (c.f. (1))
,( | ) ( , ) ( , )i i i j i ji j Ci
C C C C C CC C C
P
X y X y X X , (19)
where ( , ) , ,
( ) ( , ) ( , )i i
i i iC C i i i ij i j
i C i j E i C j CX y X X
X collects the node potentials and edge
compatibility functions within cluster iC , while ,( , ) , ,
( , ) ( , )i j i j
i j
C C C C ij i ji j E i C j C
X X
X X
collects the compatibility functions of the connecting edges between cluster iC and jC .
As derived in Appendix C, inter-cluster updating can be readily obtained for the gateway
node i in the cluster iC as
2
1N,( ) ( 1) ( 1) ( 1)
( )i
in n n ni i ij j j
j MB Ci
y y F W qH
. (20)
Expression (20) admits an interesting interpretation: gateway nodes use the intra-cluster estimates
of their neighbors in the Markov blanket to “update” observations, and exploit these
“new” observations, which encode the messages from other parts of the network (propagated
through intra- and inter-cluster processing), for the next round intra-cluster inference. The
execution of intra- and inter-cluster updating does not need to follow one another; in practice it is
found to be advantageous to run intra-cluster inference more often than the inter-cluster one.
C. Convergence Analysis Convergence of Gaussian BP in loopy graphs has been actively studied in literature ([15][3][16]
and references therein). While a full understanding is still lacking, various sufficient conditions
have been found, among which the pairwise-normalizable condition6 is noteworthy [16]. Here
we will assume that such a condition is satisfied, so the intra-cluster BP is guaranteed to
converge. We can see from (16) that the inverse variance updating of BP within clusters stands
6 That is, there exists a decomposition of the form (1) where both node potential and edge compatibility functions are valid Gaussian distributions.
( )iMB C
-
13
alone. It is also observed in our study that the inverse variance iteration converges much faster
than the message mean iteration, so we can allow it to run first till the variance is sufficiently low
(as clarified in the proof of Lemma 3.1 below). In the following, we provide an alternative proof
to the convergence of the mean iteration, assuming that variances in intra-cluster BP have
already converged to some small values { }j i . This approach can be easily extended for the
analysis of the overall convergence of SMF, and facilitate the study of the convergence rate.
The message mean iteration in Equation (16) can be rewritten with the conventional
parameter pair ( ij , ij ) as (c.f. Footnote 4)
12 ( 1)\{ }N,( ) ( 1)
12\{ }S, S, S, S,
/( )
(1 )i
i
nk jij ij i i i i k i kn n
j i j ik jj i i j ij i i k
H yG
V
μ , (21)
where ( 1) ( 1) 2| | 1[ ]n n Ej i R
μ , ( , )i j E .
Lemma 3.1 ( )j iG is a contraction mapping7.
Proof: Let 12S, S,
\{ } 2S, S, S, S,
(1 )(1 )
i j ij i kki j
j i i j ij i
KV
,
2N,
2S, S, S, S,
/(1 )
ij ij i i iij
j i i j ij i
H yy
V
, and
2S, S, (1 )
ij ijij
i j ij
, then Equation (21) can be reformulated to
( 1)\
( )\( ) ( 1)
\( )\
( )1
nij ij ki j i k
k N i jn nj i j i
ki jk N i j
y KG
K
μ . (22)
Define\{ }
( )\{ }
11ij ki j
k N i jK
, 2 2diag R E Eij D , 2 2diag R E Eij V and
2 1[ ] R Eijy y . Further define a stochastic matrix 2 2ˆ R E EE
P with the entries
7 A contraction mapping on a metric space (M, d) is a function G from M to itself, with the property that there is some nonnegative real number 1 such that for all and ' in M, ( ), ( ') ( , ')d G G d .
-
14
( )
( ') ( ')\{ ( )}
( ') ( ')\{ ( )}', ': ( ') \ ( )
, ( ) ( ') but ( ') ( )ˆ
0, otherwise,s e
s e d e d e
s e d e d ee e e s e d e
Ks e d e s e d e
K
P (23)
where ( )s e and ( )d e denote the source and destination node of edge e. The iteration (22) can be
written in a vector-matrix form as (c.f. (14))
( ) ( 1)BP BPˆn nE μ Dy V I D P μ , (24)
which leads to
ˆ' '
'
'
'
EG G
μ μ V I D P μ μ
V I D μ μ
V I D 1 μ μ
μ μ
(25)
with 1 . The last inequality comes from the assumption that for sufficiently large n,
2S, S,
1(1 )ij
i j ij j i
, i.e. ( ) 1nij . Thus, G μ is a maximum-norm contraction mapping and
hence has a unique fixed point. This proves the convergence of the mean in Gaussian belief
propagation.
□
Based on Lemma 3.1, we have:
Theorem 3.1 In a Gaussian MRF, the structured variational method using Gaussian BP as the
intra-cluster inference algorithm converges.
Proof: Taking inter-cluster updating into account, the change in observations (20) will only
reflect on ( )niy , which will be cancelled out in ( ) ( ')j i j iG G μ μ (c.f. (22)). So the proof of
Lemma 3.1 still can be applied, and the conclusion follows.
□
-
15
IV. CONVERGENCE RATE OF INTRA-CLUSTER INFERENCE Although successfully employed in SMF for intra-cluster inferences, the performance of the
belief propagation algorithm is still not fully understood. We first analyze the performance of the
intra-cluster algorithm in this section, where we derive an upper bound on its convergence time
for the 2-d torus; then we utilize this result in Section V to investigate the overall performance of
SMF. Our analysis is mainly focused on 2-d tori, which captures the essence of planar networks;
extension to other models such as geometric random graphs will be considered in our future
work.
A. Vertex, Edge and Mixed Process Both updates in (14) and (24) involve stochastic matrices which define irreducible and aperiodic
Markov chains. The MFP̂ in (14) is a | | | |V V matrix defined on the vertex set (c.f. (15)), while
BPP̂ in (24) turns out to be a 2 | | 2 | |E E matrix defined on the set of directed edges (c.f. (23)).
We denote the evolvement of the corresponding Markov chains in these two schemes as the
vertex process and edge process respectively.
Figure 2 illustrates the distinction between a vertex process and an edge process. As (a)
shows, the states in a vertex process are represented by nodes (the circles), while the allowable
(two-way) transition between the states is determined by the undirected edges. In contrast, the
states in an edge process are represented by the directed edges (the arrows in (b)), and the
transitions are guided by the directions that the arrows point to. More specifically, the transition
can only occur between the edges which are connected but not directly against each other (i.e.,
between 'e and e such that ( ) ( ')s e d e but ( ') ( )s e d e ), dictated by the rule in BP that the
message from one neighbor of a node contributes to the new messages sent to other neighbors
but not back to itself (c.f. (21)). For structured variational methods, we constrain the edge
process only within clusters, and employ the vertex process to exchange information between
clusters. This leads to a mixed vertex-edge process model as shown in (c).
-
16
For a Markov chain P̂ with stationary distributionπ , the mixing time is defined as
mix 1ˆ( ) max inf :|| ( , ) || /2 ,tit t i P π (26)
where ˆ ( , )t i P is the i-th row of the t-step transition matrix, and 1|| || stands for the 1l norm.
Essentially, mix ( )t specifies the (worst case) time that P̂ takes to converge to the -vicinity of
its stationary distribution, considering all possible initial states.
The convergence behavior of vertex processes has been well studied in the context of
reversible Markov chains. In particular, it is not difficult to prove that the mixing time of a
reversible Markov chain on a 2-d n n torus is 2mix ( )t O n [17], which characterizes the
convergence time of vertex processes and thus the variational method. However, the
performance of edge processes, which in general involves non-reversible Markov chains, is still
largely unexplored. And to the best of our knowledge, there is no formal convergence discussion
on the mixed model.
B. Convergence Rate of Edge Process To facilitate our discussion, we adopt the following labeling to represent an edge process on an
n n torus, as shown in Figure 3. Specifically, given node (1,1) on the left-bottom corner, and
( , )n n on the right-top corner, four outgoing edges of node ( , )i j pointing to the East, North, West
and South directions are respectively labeled as ( , )i j , ( , )j i , ( , )i j and ( , )j i . The states are
only allowed to make 90 degrees turns with the probability ( ) /q n n (e.g., from ( , )i j to ( , 1)j i
or ( , 1)j i ), and move forward with the probability1 2 ( ) /q n n (e.g., from ( , )i j to ( 1, )i j ),
but can’t backtrack (e.g., the transition from ( , )i j to ( 1, )i j is forbidden).
A conjecture is put forth in [3] that the convergence time of consensus propagation (or belief
propagation) on a 2-d n n torus is 3/2( )O n . In this section, we verify this conjecture through
deriving a slightly more general upper bound for the mixing time of edge processes on a 2-d
torus, assuming a turning probability ( ) /q n n , where ( )q n satisfies lim ( )n
q n
and
-
17
( ) / 1/ 3q n n . For a normal BP algorithm, we have ( ) /q n n c for some constant 1 / 3c 8. We
begin with citing a result from [18], which applies to general Markov chains.
Lemma 4.1 For any irreducible and aperiodic Markov chain P̂ with stationary distribution ,π
1 1mix ( ) log( ) log((1 ) ) 1 ( )fillt c t c
, (27)
where fill ˆ( ) max inf{ : ( , ) }t
it c t i c P π , 0 1c .
As mentioned above, a two-tuple 0 1( , )s s 2{ ,..., 1,1,..., }n n is used to represent the states
of the edge process. It can be verified that the state evolution (whether horizontal or vertical)
admits:
1 10 0 1 1Moving Forward 1,i i i is s s s , (28)
1 10 1 1 0Turning Left , 1)i i i is s s s , (29)
1 10 1 1 0Turning Right , 1i i i is s s s . (30)
Without loss of generality, assume the random walk starts from state 0 00 1( , ) ( , )s s a b . Let
0 1 20, , ,...T T T be the time instances that the random walk makes turns, and L / RiD be the
corresponding turning direction (Left or Right) at time i. Then for the time 1[ , )k kt T T , with k
being the total number of turnings before time t, the destination state evolves as9
1 1
1 1
1 0 10 1
1 0 1
( , + ) L( , )
( , ) R.
k k
k k
T Tt t k k k k
T Tk k k k
t s T s T T Ds s
t s T s T T D
(31)
Clearly, the number of possible destination states grows with the number of turns made. For
example, when 1 2[ , )t T T
1 1 10 1
1 1 1
( , ) L( , )
( , ) R;t t t b T a T Ds s
t b T a T D
(32)
8 The reader is referred to [10][18] for the construction of non-reversible chains that mix even faster, where ( )q n is a constant. 9 All the summations in (31)-(35) should be understood as modulo 2n operation, taking values in { , ..., 1,1, ..., }n n .
-
18
and when 2 3[ , )t T T
2 1 2 1 1 2
2 1 2 1 1 20 1
2 1 2 1 1 2
2 1 2 1 1 2
( , ) L, L( , ) L, R
( , )( , ) R, L
( , ) R, R.
t t
t a T T b T T D Dt a T T b T T D D
s st a T T b T T D D
t a T T b T T D D
(33)
Generally, with k total number of turnings, the destination state has 2k possibilities; when k is
even, it is given by
1 2 10
1 2 11
( ) ( )( ) ( )
tk kk
tk k
T T T at TsT T T Tbs
, (34)
and when k is odd, it is given by
1 2 2 10
1 2 3 11
( ) ( ) ( )( ) ( ) ( )
tk k k
tk k k k
b t T T T T TsT T T T T as
, (35)
where the plus/minus signs are determined by the turning directions.
By symmetry, the stationary distribution of this Markov chain is uniform with probability
21/ 4n . As dictated by Lemma 4.1, we need to find the minimum time t such that,
20 1Pr(( , ) ( , )) /t ts s x y c n for any 2( , ) { ,..., 1,1,..., }x y n n for some constant c, regardless of
the initial state. Intuitively, both 0ts and 1
ts in (34) and (35) are sums of independent geometric
random variables, which allows us to examine the final state probability by the Central Limit
Theorem. This is done in the Lemma 4.3 below, which requires a technical result in Lemma 4.2
to simplify analysis.
Lemma 4.2 With high probability (w.h.p.)10 there exists a constant 1 0c such that there are
at least 3/2( )q n positive odd integers (time indices) i satisfying
3/41 / ( )i iT T n q n and
3/41 / ( )i iT T n q n
10 the probability approaches 1 as n
-
19
in the first 1 ( )c q n n steps, where denotes the floor function which returns the closest
integer smaller than the argument, and denotes the opposite, the ceil function.
Proof: See Appendix D.
Lemma 4.2 essentially reveals some regularity in the turning times; in particular, some
intervals between them are well bounded. Divide the whole turning time index set {1,2,...} into
two subsets 1S and 2S , where 1S is the set of positive odd integers i such that
3/41 / ( )i iT T n q n and
3/ 41 / ( )i iT T n q n , as in Lemma 4.2, and 2S contains the rest. To
achieve our purpose, it is sufficient to find a lower bound of the conditional probability
20 1Pr ( , ) ( , ) |t t Ss s x y T for any given set 2 2{ }S j j ST T 11. This allows us to focus on the more
regular random variables 11
( )( )
i i
i i
T TT T
in (34) and (35) specified by 1i S . We thus can derive:
Lemma 4.3 There exists a constant 0c such that if 1 ( )t c q n n ,
2 20 1Pr ( , ) ( , ) | /t t Ss s x y T c n , (36)
for any 2( , ) ,..., 1,1,...,x y n n and any 2 2
{ }S j j ST T w.h.p.
Proof: See Appendix E.
Combining Lemma 4.1 and Lemma 4.3, we get the following conclusion:
Theorem 4.1 On a 2-d n n torus, the mixing time of an edge process with turning probability
( ) /q n n is ( ( ) )O q n n w.h.p.
As a result, the convergence time of consensus propagation [3] and our intra-cluster BP
inference (16) is 3/2( )O n , with ( ) /q n n c for some constant c in the worst case. This is verified
in Figure 4, where the mixing times with 0.01 of the vertex and edge process (with
11 Our goal is then achieved by the total probability theorem.
-
20
( ) / 1/ 3q n n ) are simulated. It is observed that the two curves fit well with 2( )O n and 3/2( )O n ,
respectively.
V. CONVERGENCE RATE OF STRUCTURED VARIATIONAL METHODS It has been shown in Section IV.A that the performance of structured variational methods is
governed by a mixed vertex-edge process, or equivalently a hybrid Markov chain model. The
complexity of this model precludes direct applications of any standard techniques in literature. In
this section, we explore the coupling technique [11] to analyze this model.
Coupling provides a simple and elegant way of bounding the mixing time, and isn’t tied to
reversibility. Essentially, a coupling of Markov chains is a process 0{ , }t t tX Y with the property
that both { }tX and { }tY are Markov chains with the same transition matrix P̂ of interest, but
typically with different starting states. Once the two chains meet at one state, they stay together
at all times after that, i.e. if ' 't tX Y , then t tX Y for 't t . For starting states x and y, let
,0 0inf{ : | , }
x yt tT t X Y X x Y y , (37)
then the coupling time is defined as
,couple ,max ( )
x yx yt E T , (38)
which can serve as an upper bound for the mixing time according to the Coupling Lemma [11]:
mix couple( ) lnt t . (39)
We assume an n n torus is equally divided into 2s clusters, each of size ( / ) ( / )n s n s , and
consider a vertex-edge process on it as indicated in Figure 2(c). Then by investigating the
coupling time of two random walks on such a clustered graph, we can obtain a characterization
for the mixing time. Firstly, we study how quickly an edge process can “escape” from a 2-d torus
(or how long it can stay in the torus before hitting any outgoing edges on the boundaries), and
obtain the following result:
-
21
Lemma 5.1 On a 2-d n n torus, the average staying time of an edge process is upper-
bounded by ( ( ) )O q n n w.h.p.
Proof: See Appendix F.
Using this result, the mixing time of a mixed vertex-edge process can be characterized as
follows:
Theorem 5.1 On a 2-d n n torus, the mixing time of a mixed vertex-edge process with equal
cluster size of ( / ) ( / )n s n s is 3/2O sn w.h.p. Proof: Suppose two random walks start from two randomly selected points in this clustered
graph, then the coupling process can be described as follows: firstly, these two random walks
wander inside their respective clusters (edge process) until they hit the gateway nodes and exit
the starting clusters. From then on, they roam over the network, repeatedly entering and leaving
clusters, and finally arrive at a same cluster. This journey, on the high level, can be regarded as a
vertex process on an s s torus with “mega-vertices”, which take 2( )O s steps to couple. At each
mega-vertex (cluster), the average staying time is 3/ 2(( / ) )O n s according to Lemma 5.1. Besides,
we need to consider the following scenario. Even after these two walks reach the same cluster,
one of them may leave early, by hitting gateway nodes before coupling with the other walk; the
probability of such an event is denoted as p . In this case, the above journey is repeated.
To evaluate p , assume a is the state of the random walk which has entered a cluster earlier,
x is the state of the second random walk which just steps into this cluster, and z is a boundary
node of this cluster. It has been shown that [19], the probability that starting from node x, a
random walk hits node z before it hits a is given by the ratio of the effective resistance12 between
a and x and that between and a and z
hit, hit ,( )( )( )x z a
R a xp PR a z
. (40)
12 The effective resistance between node i and j is the expected number of traversals in a random walk starting at i and ending in j [19].
-
22
Since both x and z are boundary states, while we do not have any further information about a but
to assume it is uniformly and randomly located inside the cluster, ( )R a x and ( )R a z are
on the same order, and so p is a constant.
Then the total time to couple these two random walks is given by
2 3/2 3/2 1couple
1
2 3/2 3/2
( ) ( / ) ( / ) (1 )
( ) ( / ) / (1 ) ( / ) / (1 ),
i
it i O s O n s O n s p p
O s O n s p O n s p
(41)
where the first term gives the total roaming time among the clusters, while the second term
corresponds to the staying time of two random walks in the same cluster.
We thus have
2 3/2 3/2mix couple( ) ln ( ( / ) )t t O s n s O sn . (42) □
Note that Theorem 5.1 includes previous results for the vertex process ( s n ) and edge process
( 1s ) as two special cases.
VI. PERFORMANCE-COMPLEXITY TRADEOFF IN STRUCTURED VARIATIONAL METHODS
Theorem 5.1 indicates that the inference performance decreases with s (increases with the cluster
size). To inspect how clustering affects the message complexity, note for the total 2s clusters,
each cluster has 2( / )n s nodes and thus 24( / )n s directed edges. On each directed edge, two
metrics, the message mean and variance, are exchanged in the BP algorithm, except for the
4( / )n s outgoing ones across the cluster boundaries. Instead, on these cross-cluster edges, only
the estimated means are needed for inter-cluster updating. So the message complexity per
iteration in the SMF algorithm is given by
2 2 2 22 4( / ) 4( / ) 4( / ) (8 4 )O s n s n s s n s O n ns . (43)
-
23
Comparing (42) and (43), we can observe the performance-complexity tradeoff inherent in SMF:
as the cluster size increases, more accurate inference is performed on more nodes; this results in
faster convergence, but also inevitably increases communication burden.
The inherent tradeoff of SMF is further explored through simulations. We consider a
Gaussian MRF estimation, and adopt a similar simulation setting as in [20]13, with 150 nodes
uniformly and randomly distributed in a unit plane. The estimation mean square error (MSE) is
used as the comparison metric, defined by
2 2ˆ ˆ ˆ/mMSE x x x , (44)
with ˆ mx denoting the estimation vector at the mth iteration, and x̂ being the exact estimation (the
MMSE solution).
Two clustering schemes are considered in the simulation. One is a centralized scheme [21],
which uses semi-definite programming to solve a relaxation of the min-cut problem14 with equal
cluster size constraint, and the other is our proposed distributed clustering algorithm in [14],
which seeks to approximate the min-cut solution in a greedy manner.
Figure 5 compares the convergence rate of BP, MF, and SMF – with the cluster size limit of 3
and 8, respectively. The results verify that BP converges the fastest. But we also observe that even
when the network is divided into very small clusters (of size 3), SMF can still significantly speed
up the convergence; in this case there is not much increase in computational complexity as
compared to MF. A reasonable cluster size (of 8) achieves almost indistinguishable performance
to that of BP. It is also observed that our distributed clustering scheme yields results close to the
centralized one, but with a much lower computational overhead.
Another important consideration in wireless networks is the energy consumption, which is
typically dominated by the communication energy. Comparison of the three approaches in this
aspect is illustrated in Figure 6, where the total number of exchanged messages till reaching the
13 see Section V.A in [20] 14 The edge weight is set proportional to the correlation between two end variables.
-
24
corresponding MSE is used as an indicator for communication energy. Note that compared to MF,
BP exchanges more messages per round but converges faster. It is interesting to observe that for
this simulation setting15 SMF consumes the least communication energy to obtain the same
estimation accuracy, which indicates its potential superiority in practice.
The above analysis and simulation indicates that, an appropriate cluster size should be
selected for SMF depending on applications, to achieve a good balance among estimation
accuracy, convergence rate, computational complexity, and energy efficiency. In practice, how to
select an appropriate cluster size is worth further investigation. If the anticipated operational
environment is stable, we may conduct off-line simulation or experiment such as in the Figure 4
and 5 to determine the best cluster size. In the more dynamic scenarios, we can resort to a cross-
layer approach to facilitate re-clustering. For example, when there is a demand for higher quality
of service from the application layer, the inference module will notify the clustering module to re-
cluster with a larger cluster size. This can be easily accommodated through our distributed
clustering scheme in [14], which has a procedure to merge clusters of smaller sizes into larger
ones till the designated conditions are satisfied. Similarly, a split of current clusters can be
initiated when a smaller cluster size is desired.
VII. CONCLUSIONS AND FUTURE WORK In this paper, we develop a general variational message passing framework for distributed
inference in Markov random fields. Structured variational methods are explored to achieve a nice
tradeoff between system performance and complexity. We also investigate the convergence
performance of our proposed structured variational methods for distributed inference. We adopt
a direct probabilistic approach to analyze the convergence of an edge process which models the
intra-cluster processing, and devise a coupling process to characterize the overall performance
concerning a more complicated mixed vertex-edge process. We expect both the results obtained
and the analytical tools developed in this work can be applied to other similar problems in
15 This result should not be viewed as contradictory to (43), which only describe the scaling behavior.
-
25
wireless networks. In particular, the methodologies developed on the mixing behavior of the
edge process and mixed process seem not to be limited to Gaussian assumption and should
assume wider applicability.
In this study, the quality of distributed inference at convergence has been neglected. This is a
very challenging research topic, and any progress in this area will likely lead to significant
impact. Relevant works along this line include [15], [25]and [26]. Another direction for future
work is to further consider the impact of channel uncertainties and communication constraints
[26][27]. One recent work [28] is also worth mentioning, where a hidden variable is introduced
to decouple the dependence among observations, for the purpose of simplifying the optimal local
decision rules in distributed detection. The feasibility of incorporating this idea into our
variational processing framework deserves further exploration.
APPENDIX A. Derivation of Gaussian Variational Message Passing Eq. (13) Consider the information representation of the posterior distribution given in (12), we can find a
factorization of the form (1), with the individual node potential functions and edge compatibility
functions given by
212( , ) exp ( )i i i i i i ii iX y v y X F X , and 120
( , ) exp [ ]0
ij iij i j i j
ij j
F XX X X X
F X
, (45)
where 1 2N,( ) /Ti i i i iiv y H y H Ξ y . Defining an extended sufficient statistics vector 2( ) [ , ]Ti i i iX X Xη , the corresponding extended natural parameters in (6) and (7) are
respectively given by
12,T
i i iiv F θ and ,0T
ij ij jF X θ . (46)
The MF variational distribution ( )i iQ X can be parameterized with the same sufficient statistics,
and the corresponding natural parameter ,i vθ is given by 2 2
MF, MF, MF,[ / , 1/ 2 )]T
i i i . Applying the
-
26
variational message passing rule and after some algebra, we obtain the updating form given in
(13).
B. Derivation of Gaussian Belief Propagation Eq. (16) and (18) The message passing rules of belief propagation are derived on a spanning tree of the network16.
In this setting, the posterior probability (12) assumes the following form composed of the prior
marginal ( )i ip x and pairwise joint distribution ( , )ij i jp x x , as well as the marginal conditional
distribution ( | )i i if y x (viewed as a function ix ):
( , ) | | 1( , )
( | ) ( | )( ) iij i j
i j Ei i i
i Vi ii V
p x xP f y x
p x
X y . (47)
In the above expression, 1 2 2 2N, N,( | ) ( , )i i i i i i i if y x H y H N , 1 2S,( ) ~ (0,1 )i i ip x
N , and
1( , ) ~ ( , )ij i j ijp x x 0 ΩN , ij , where
2S, S, S,
22 2 2S, S, S,S, S,
1(1 )
j i j ijij
i j ij ii j ij
Ω .
Note that with information parameter representation, if 11 1 1( ) ~ ( , )p x N and
12 2 2( ) ~ ( , )p x
N are two different distributions on the same random Gaussian random
vector x, then the product density
112 1 2 12 12( ) ( ) ( ) ~ ( , )p x p x p x N (48)
with 12 1 2 and 12 1 2 . Similarly, the quotient 1 2( ) / ( )p x p x produces an
exponential quadratic form with parameters 1 2 1 2( , ) , which defines a valid
probability density when 1 2 is positive definite. Moreover, if 1x and 2x are two joint
16 Various forms have been developed in literature (e.g., [15]). We provide a derivation for the specific forms of (16) and (18) here for completeness, which will also facilitate the discussion in Section III.
-
27
Gaussian random vectors with distribution 1 11 1211 22 21 22
( , ) ~ ,p x x
N , the marginal
distribution 11 1 1( ) ~ ( , )p x N is also Gaussian with information parameters given by
1
1 1 12 22 21
1 11 12 22 21.
(49)
Comparing (47) with the standard Hammersley-Clifford equation (1), we have the node potential
function
1( ) ( , )i i i iX V
N , (50)
where the parameters are given in (17), and the edge compatibility function
1( , ) ( , )ij i j ijX X 0 ΩN . (51)
Consider the general message updating and belief computing formulas in belief propagation
[1][12]:
( ) ( 1)\{ }
( ) ( , ) ( ) ( )ii
n nj i j ij i j i i i k i ik jx
m x x x x m x dx , ( ) ( )( ) ( ) ( )
i
n ni i i i i k ik
b x x m x .
By applying the product rule (48) on the message updating equation, we have the integrand 1 1
\{ }( , ) ( ) ( ) ~ ( , )
i
nij i j i i i k i ij ijk j
x x x m x Ι JN , (52)
where
1
\{ }
0i
ni i kk j
ij
Ι , (53)
( 1)2 2 2\{ }S, S, S,
2 2 2S, S, S,
1(1 ) (1 )
1(1 ) (1 )
i
ijni i kk j
i ij i j ijij
ij
i j ij j ij
V
J . (54)
Using the marginalization rule (49) to perform the integration over ix , we can obtain the
message updating rule (16). The belief updating rule (18) is easier to obtain following the
product rule (48).
-
28
C. Derivation of Inter-cluster Variational Message Passing Eq.(20) As in Appendix A, define the extended sufficient statistics for the cluster iC as
2
{ }{ }( ){ }
i
i
i ji C
i
XX XXX
η , , ii j C , (55)
where{ }iX denotes the collection all iX ’s in cluster iC , and the other two terms are defined
similarly. Referring to (19) and (45), the corresponding extended natural parameters of the
cluster potential and cluster compatibility functions are given by
12[{ },{ },{ }] , ,iT
C i ij ii iv F F i j C θ and ,{0},{0} , ,i j iT
C C ij j i CF X i C j θ , (56)
where the involved parameters are the same as in (46). Considering the structure of i jC C
θ above,
we only need to focus on the first component when applying the variational message passing rule
in Section II.C. In particular, the messages are exchanged between gateway nodes and their
Markov blankets, taking the form of ( 1)nj i j
CQ CE θ according to (9); then, following the spirit of
(10), such messages are used together with the parameters of the gateway nodes 2N,/i i i iv H y
to update the corresponding variational distributions. Recall that in our structured variational
method, the variational distribution is assumed to take the form 1
( )i
s
i Ci
Q Q
X , and Gaussian
belief propagation is adopted for intra-cluster inference. Therefore, the variational message
passing rule indicates the following interaction between neighboring clusters: the intra-cluster
inference in the previous round provides a mean estimate for all the variables in the Markov
blanket of cluster iC , i.e, ( 1)1( 1) ( 1)
njQ
n nj j jE X W q
, iC
j (c.f. (18)). Then this estimate is
used to construct a “new” observation ( )niy as given in (20) to stimulate a new round of intra-
cluster inference through (16)-(18).
-
29
D. Proof of Lemma 4.2
Notice that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . We can
readily compute the probability
3/43/4 3/4
1 1
2/ ( )
Pr / ( ) , / ( )
1 (1 2 ( ) / ) ,
i i i i
n q n
p T T n q n T T n q n
q n n
(57)
which is non-vanishing as n (for all ( ) ( )q n O n ). Choose a constant 2 1 /c p and
consider a group of 3/22 ( )c q n positive odd integers. By the Chernoff bound, the probability
that there are at least 3/2( )q n positive odd integers i satisfying 3/4
1 / ( )i iT T n q n and
3/41 / ( )i iT T n q n in such a group is at least 2 3/22 21 exp 1 1 ( ) 2c p c pq n , which
approaches 1 as n goes to infinity. Finally, consider a constant 1 22( 1)c c , and it is claimed
that in the first 1 ( )c q n n steps there are at least 23/2
2 ( )c q n turns w.h.p. To prove this claim,
note that the random variable 3/222 ( )c q n
T
is the sum of 2 3/22 ( )c q n i.i.d. geometric random
variables each with mean (2 ( ))n q n and variance between 2 212 ( )n q n and 2 24 ( )n q n . By
the Chebyshev’s inequality
3/2
2
3/22
3/22
3/2 2 22 ( ) 2
1 2 2 22 ( )1 2
1 2 ( )
var ( ) / ( )Pr ( ) 0.
2( 2 ) ( )( )
c q n
c q n
c q n
T c q n n q nT c q n n
c c q n nc q n n E T
(58)
E. Proof of Lemma 4.3
Recall that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . For turning
times indexed by 1S , there is a useful property given as follows. If i k L j , it is known
that [22], 1 11Pr( | , , )i i i iT k T i T j T L L i
, { 1, 2, , }k i i L . That is, when 2S
T is
-
30
fixed, for every 1i S , the component vector 1
1
( )( )
i i
i i
T TT T
in (34) and (35) is a uniform random
vector on the space 23/ 4 3/ 4/ ( ) ,..., 1,1,..., / ( )n q n n q n . From Lemma 4.2, we know
that if 1 ( )t c q n n , there are at least 3/2( )q n such uniform random vectors on , with
zero mean and covariance matrix Σ whose entries are all on the order of 2 3/2/ ( )n q n . Denote
these random vectors as ( )m nX , 3/21,..., ( )m q n and let
1/ 2 3/ 4( ) ( ) / ( )m mn n q nY Σ X . Then
from the Central Limit Theorem, the summation of ( )m nY converges to a standard multivariate
normal distribution
3/2( )2 1 2 21
( ) ( , )q n Dn mm n
S Y 0 IN . (59)
Clearly the probability 3 4Pr || ||nc c S is non-vanishing for any constants 3c and 4c , where
|| || denotes the Euclidean distance. Noting that the entries of 3/4 1/2( )q n Σ are on the order of n ,
there exists a constant 5c such that 3/4 1/2 5Pr 2 ( ) 2nn q n n c Σ S . Also note that nS is a zero-mean unimodal symmetric random vector. Therefore there exists a constant c such that
3/4 1/2 2Pr ( ) 2 /nq n n c n Σ S . Now consider the sum of 3/2( )q n i.i.d. random vectors which are uniform on (those indexed by 1S in (34) and (35) with the rest fixed) and denote it
as nU , then nU and 3/4 1/2( ) nq n Σ S have the same distribution. Therefore the lemma is proved.
F. Proof of Lemma 5.1 From the state representation of the edge process (c.f. Figure 3), we know that hitting an
outgoing edge on the boundary corresponds to 0 1ts or n. According to (34) and (35), 0
ts is
given by
1 2 10
1 2 2 1
( ) ( ) is even( ) ( ) ( ) is odd,
k k kt
k k k
t T T T T a ks
b t T T T T T k
(60)
-
31
which is the sum of k geometric random variables with zero mean and variance 2 3/2/ ( )n q n .
We learn from the proof of Lemma 4.2 that in the first 1 ( )c q n n steps there are 3/ 2( ( ) )k O q n
turns. Let us consider 0ts n , for which the local limit theorem ([23], page 10) can be invoked to
get
2
3/ 403/ 4 2 3/ 2 3/ 2
1( ) Pr( ) exp( ) 2 / ( ) ( )2
tn nq n s nq n n q n q n
, (61)
or 0Pr( ) 1/ 2ts n n e as n . Then by the Chernoff bound, the probability that there is at
least one time that the walk hits the boundary edges before 6 ( )t c q n n for some constant 6c is
at least 2
7 711 exp 1 1/ ( ) ( )2
c q n c q n ( 7 6 / 2c c e ), which approaches 1 as .n
REFERENCES [1] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA:
Morgan Kaufmann, 1988. [2] G. Cooper, “Probabilistic inference using belief networks is NP-hard,” Artificial Intelligence, vol. 42, pp. 393-
405,1990. [3] C. C. Moallemi and B. Van Roy, “Consensus Propagation,” IEEE Transactions on Information Theory, Vol. 52,
No. 11, pp. 4753-4766, 2006. [4] T. Jaakkola, “Tutorial on variational approximation methods,” In Advanced mean field methods: theory and
practice. MIT Press, 2000. [5] C. M. Bishop, J. M. Winn, and D. Spiegelhalter, “VIBES: A variational inference engine for Bayesian
networks,” Advances in Neural Information Processing Systems, 2002. [6] J. Dauwels, “On variational message passing on factor graphs,” Proc. International Symposium on Information
Theory (ISIT), 2007. [7] L. Saul and M. Jordan, “Exploiting tractable substructures in intractable networks,” In Advances in Neural
Information Processing Systems, Vol. 8. The MIT Press, 1996. [8] E.P. Xing, M.I. Jordan, and S. Russell, “A generalized mean field algorithm for variational inference in
exponential families,” In Proceeding of 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2003.
[9] D. Randall, “Rapidly mixing Markov chains with applications in computer science and physics,” Computing in Science and Engineering, Vol. 8, No 2, March 2006.
[10] P. Diaconis, S. Holmes, and R. M. Neal, “Analysis of a non-reversible Markov chain sampler,” Biometrics Unit, Cornell University, Tech. Rep.BU-1385-M, 1997.
[11] T. Lindvall, “Lectures on the Coupling Method,” Courier Dover Publications, 2002. [12] J. Yedidia, W. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,” Technical
Report TR-2001-22, Mitsubishi Electric Research Laboratories, 2001.
-
32
[13] M. J. Wainwright, and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, Nos. 1–2, pp.1–305, 2008.
[14] Y. Zhang and H. Dai, “Distributed Network Decomposition: a Probabilistic Greedy Approach,” 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, Mar. 2010.
[15] Y. Weiss and W. Freeman, “Correctness of belief propagation in Gaussian graphical models of arbitrary topology,” Neural Computation, vol. 13, no. 10, pp. 2173-2200, 2001.
[16] D. Malioutov, J. Johnson, and A. Willsky, “Walk-sums and belief propagation in Gaussian graphical models,” Journal of Machine Learning Research, 2006.
[17] S. Boyd, A. Ghosh, B. Prabhakar and D. Shah, “Mixing Times for Random Walks on Geometric Random Graphs,” SIAM Workshop on Analytic Algorithmics & Combinatorics (ANALCO), Vancouver, Canada, January 2005.
[18] W. Li, H. Dai, and Y. Zhang, “Location Aided Fast Distributed Consensus in Wireless Networks,” IEEE Trans. Information Theory, vol. 56, no. 12, pp. 6208-6227, Dec. 2010.
[19] D. A. Levin, Y. Peres and E. L. Wilmer, “Markov Chains and Mixing Times,” American Mathematical Society, 2008.
[20] V. Delouille, R. Neelamani, R. Baraniuk, “Robust Distributed Estimation using the Embedded Subgraphs Algorithm,” IEEE Trans. Signal Processing, Vol. 54, No.8, 2006.
[21] K. Schloegel, G. Karypis, and V. Kumar, Graph partitioning for high performance scientific simulations, CRPC Parallel Computing Handbook, 2000.
[22] Grimmett and Stirzaker, Probability and Random Processes, Oxford University Press, 2001. [23] V. F. Kolchin, “Random Graph,” Cambridge University Press, 1999. [24] E. Riegler, G. E. Kirkelund, C. N. Manchon, and B. H. Fleury, “Merging belief propagation and the mean field
approximation: A free energy approach,” 2010 6th International Symposium on Turbo Codes and Iterative Information Processing (ISTC), pp.256-260, Sept. 6-10, 2010.
[25] M. J. Wainwright, T. S. Jaakkola and A. S. Willsky, “Tree-based reparameterization framework for analysis of sum-product and related algorithms,” IEEE Transactions on Information Theory, vol. 49, no. 5, pp. 1120-1146, May 2003.
[26] A. T. Ihler, J. W. Fisher III, and A. S. Willsky, “Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, vol. 6, pp. 905-936, May 2005.
[27] O. P. Kreidl and A. S. Willsky, "Inference with minimal communication: A decision-theoretic variational approach," Advances in Neural Information Processing Systems, 2006.
[28] H. Chen, B. Chen, and P. K. Varshney, “A new framework for distributed detection with conditionally dependent observations,” IEEE Transactions on Signal Processing, vol.60, no.3, pp.1409--1419, March 2012.
[29] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference in Wireless Ad Hoc and Sensor Networks,” 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009.
[30] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference: Convergence Analysis and Performance-Complexity Tradeoff,” 2009 IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009.
Huaiyu Dai (M’03, SM’09) received the B.E. and M.S. degrees in Electrical Engineering from Tsinghua University, Beijing, China, in 1996 and 1998, respectively, and the Ph.D. degree in Electrical Engineering from Princeton University, Princeton, NJ in 2002.
-
33
He was with Bell Labs, Lucent Technologies, Holmdel, NJ, during summer 2000, and with AT&T Labs-Research, Middletown, NJ, during summer 2001. Currently he is an Associate Professor of Electrical and Computer Engineering at NC State University, Raleigh. His research interests are in the general areas of communication systems and networks, advanced signal processing for digital communications, and communication theory and information theory. His current research focuses on networked information processing and crosslayer design in wireless networks, cognitive radio networks, wireless security, and associated information-theoretic and computation-theoretic analysis.
He has served as editor of IEEE Transactions on Communications, Signal Processing, and Wireless Communications. He co-edited two special issues for EURASIP journals on distributed signal processing techniques for wireless sensor networks, and on multiuser information theory and related applications, respectively. He co-chairs the Signal Processing for Communications Symposium of IEEE Globecom 2013, the Communications Theory Symposium of IEEE ICC 2014, and the Wireless Communications Symposium of IEEE Globecom 2014.
Yanbing Zhang (M’09) received the B.E. and M.S. degrees in electronics engineering from Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, in 2009. Currently, he is a Staff Scientist in the Mobile Communications Group, Broadcom Inc., Matawan, NJ. His research interests are in the general areas of wireless communications and networking, signal processing for wireless communications, with emphasis on
cooperative communication and information processing in wireless networks.
Juan Liu received her B.S. degree in Information and Electronic Engineering from Zhejiang University, Hangzhou, China, in 2000. She received her M.S. degree in Information Engineering from Beijing University of Posts and Telecommunications, Beijing, China, in 2005. She received her PhD degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2011. She is currently a postdoc researcher in the Department of Electrical and Computer Engineering, NC State University, Raleigh, NC. Her research interest is in
wireless communications.
-
34
iC
( )iMB CiC
Figure 1 Markov blanket (shaded nodes) and Markov blanket clusters (shaded clusters) for cluster iC .
Figure 2 Vertex Process (a), Edge Process (b) and Mixed Process (c)
Figure 3 State labeling of edge process on a 2-d torus
-
35
Figure 4 Mixing times of vertex and edge process
Figure 5 Mean square error of estimation versus time complexity
0 10 20 30 40 50 6010
-5
10 -4
10 -3
10 -2
10 -1
10 0
Iteration Number
Mea
n S
quar
e E
rror
MFBPSMF - Centralized Clustering SMF - Distributed Clustering
Cluster Size
-
36
Figure 6 Mean square error of estimation versus message complexity
0 50 100 150 200 25010 -5
10 -4
10 -3
10 -2
10 -1
Number of Exchanged Messages
Mea
n S
quar
e E
rror
BPMFSMF - Centralized ClusteringSMF - Distributed Clustering
Cluster Size