structured variational methods for distributed inference ...part of this work was presented at the...

Structured Variational Methods for Distributed Inference in Networked Systems: Design and Analysis

Huaiyu Dai*, Senior Member, IEEE, Yanbing Zhang, Member, IEEE, and Juan Liu

Abstract

In this paper, a variational message passing framework is proposed for distributed inference in

networked systems. Based on this framework, structured variational methods are explored to take

advantage of both the simplicity of variational approximation (for inter-cluster processing) and

the quality of more accurate inference (for intra-cluster processing). To investigate the

convergence performance of our inference approach, we distinguish the inter- and intra-cluster

inference algorithms as vertex and edge processes respectively. Based on an analysis on the

intra-cluster inference procedure, the overall performance of structured variational methods,

modeled as a mixed vertex-edge process, is quantitatively characterized via a coupling approach.

The tradeoff between performance and complexity of this inference approach is also addressed.

Index Terms: convergence analysis, distributed inference, Markov chain, variational methods

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for

any other purposes must be obtained from the IEEE by sending a request to [email protected]. H. Dai and J. Liu are with the Department of Electrical and Computer Engineering, NC State University, Raleigh,

NC 27695 (Email: [email protected], [email protected]). Y. Zhang is with Broadcom Corporation, Matawan, NJ, 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258.

Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30]. 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258. Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30].

1

I. INTRODUCTION Large-scale networked systems of intelligent devices are playing an increasingly important role

in protecting the nation's critical infrastructures as well as serving people’s needs; smart grid,

intelligent transportation, precision agriculture, and seamless surveillance are a few such

examples. In many such systems, there is a pressing need for automatic reasoning and inference

due to practical and economic considerations, and it is desirable that inference be conducted in a

distributed fashion. This motivates us to develop a general and flexible framework for automatic

inference in networked systems which can admit wide applications and provide desired tradeoff

between accuracy and efficiency, while allowing simple and distributed implementation.

Exact inference is known to be NP-hard [2], and generally computationally intractable for

many applications. Therefore, approximate methods are often resorted to in practice. One

popular approach for approximate inference is sampling, of which the family of Markov Chain

Monte Carlo methods is noteworthy. Some concerns about this approach include slow

convergence, analytical tractability, and computational complexity. Belief propagation (BP)

algorithms [1] and its variants (such as consensus propagation [3], a special case of Gaussian BP)

have also been widely studied in literature. BP algorithms yield accurate inference on acyclic

graphs, and continue to work well on loopy graphs with sufficient sparsity and symmetry. They

are also amenable to distributed implementation. However, BP and related algorithms are known

not always to converge in general cyclic graphs. They are also computationally intractable when

continuous variables are involved (except for Gaussian distribution), and approximate methods

such as particle filtering may be employed as a remedy in practice. Variational methods [4] are an alternative for approximate inference. Being a deterministic

approach, they are often more efficient in computation, more amenable to analysis, and admit

wide applicability regarding the underlying models, whether acyclic or cyclic, discrete or

continuous. A message-passing algorithm for the mean-field (MF) inference was proposed for

conjugate-exponential models in (directed) Bayesian networks [5]. An implementation on the

2

factor graph can be found in [6]. In this paper, we derive a variational message passing framework

for Markov random fields (MRF), which arguably assume advantages in modeling wireless

networks. In particular, we formulate explicit message passing rules for distributions in the

exponential family, which covers a large class of probabilistic models. Relevant discussion is

given in Section II.

Among variational methods, the simplest MF approach is mostly considered due to its

analytical and computational tractability, whose inference accuracy is nonetheless limited because

of its inherent assumption that variables of interest are fully independent. Naturally a richer

structure for the variational distribution can be exploited for better inference quality, with

increased complexity. Such approaches are named structured variational methods or simply

structured mean field (SMF). They have mainly been studied in the artificial intelligence area

[7][8], and little consideration is given on their applications in real networks. In this work, we

further investigate exploiting substructures of networks to improve variational methods in real

systems. Thus the simplicity of variational methods (for inter-cluster processing) and the accuracy

of (approximately) exact inference algorithms (for intra-cluster processing) can be exploited

simultaneously, as detailed in Section III. In this study, BP is adopted as an approximation for

exact inference in intra-cluster processing, as it can be readily realized in a distributed form.

Meanwhile, our SMF framework can effectively control the cluster sizes (and even the topologies)

to ensure good performance for BP processing. In [24], an alternative approach of combining the

BP and MF inference is presented on the factor graph model: the whole set of factor nodes are

divided into two parts, with BP applied on one subset (in particular discrete variables) and MF on

the remaining part. One possible application of such an approach is to design iterative message-

passing algorithms jointly for different components of a communication system, such as joint

channel estimation and decoding, which is orthogonal to our study.

For distributed inference algorithms mentioned above and studied in this work, typically

stochastic weight matrices are employed, which are conformant to the underlying graphical

3

structure (network topology). Hence the convergence of these algorithms is closely related to the

mixing time of a random walk on the corresponding graph. Random walks on graphs can be

categorized as vertex process-based or edge process-based ones. The essential difference

between these two is that the former is a process on nodes that transitions along edges and is

allowed to “backtrack”, while the latter is a process on directed edges that transitions towards

nodes where “backtrack” is forbidden. As we will see, distributed algorithm derived from the

variational method can be characterized by a vertex process, typically involving reversible

Markov chains; while belief propagation and its variants correspond to edge processes, typically

involving non-reversible Markov chains.

Even though quite a few techniques exist for analyzing the convergence of reversible Markov

chains, including spectral theory, conductance, canonical paths and multi-commodity flow (see [9]

and the references therein), few of them can be successfully applied to non-reversible cases. In

[10] a non-reversible random walk in the one-dimensional chain is analyzed through a direct

probabilistic approach. A study on the convergence properties of consensus propagation is given

in [3] through function mapping and matrix analysis; an explicit result on convergence time is

derived for the cycle, with conjectures given for higher-dimensional tori. Structured variational

methods, as we will formulate in Section III, are actually mixed vertex-edge processes involving

hybrid Markov chains, entailing even more difficulties on analysis. In this paper, we use a “divide

and conquer” strategy to investigate its performance: first we analyze the convergence of the

intra-cluster edge process, where we derive an upper bound on the mixing time and verify the

conjecture in [3] for the two-dimensional (2-d) torus; then we exploit the coupling technique [11]

to combine the results for edge and vertex processes to obtain a characterization on the overall

performance. Relevant contents are given in Section IV and Section V, respectively. As a result,

the performance-complexity tradeoff in structured variational methods is further addressed in

Section VI, together with some supporting simulation results.

4

The contributions of this work are summarized as follows. First, we derive a general and

scalable variational message-passing framework for Markov random fields, which admits wide

applicability concerning network size and topology, allows flexible tradeoff between performance

and complexity, and easily adapts to practical wireless networks. In particular, we obtain explicit

forms for variational message passing rules for probabilistic distributions in the exponential

family, which are simple and yet admit wide applications. Then, this framework is applied to a

clustered network (exemplified by Gaussian MRF), to realize a novel distributed inference

approach which can achieve a flexible balance between inference accuracy and computational

complexity. We also characterize the convergence behavior of the proposed inference algorithm

on a 2-d torus, and during this process, derive an upper bound of the mixing time of the intra-

cluster BP inference process, which should be of independent interest. Our analytical

methodologies are developed for general edge process-based and mixed random walks, which

may assume wider applicability.

II. VARIATIONAL MESSAGE PASSING IN MRF A. System Model Distributed inference in complex networked systems is often casted as probabilistic inference in

a graphical model. Well-known graphical models include Markov random fields, Bayesian

networks, and factor graphs. Associated with each graphical model is a family of distributions

which factorize according to the dependency structure of the underlying graph. Besides obvious

advantages in visual representation, graphical models also facilitate design and analysis of

distributed algorithms. Among existing graphical models, Markov random fields exhibit certain

modeling convenience for wireless networks, as they can be conveniently mapped to real

communication graphs, and often admit simpler forms for message-passing algorithms.

5

In this work we mainly consider a pairwise MRF1 represented by an undirected graph ( , ),V E

where V and E denote the vertex and edge set respectively, and each node i V is associated

with a random variable iX and observation iy . Define for each node a local potential function

( , )i i iX y , and for each edge ( , )i j E a compatibility function ( , )ij i jX X [12]. The

Hammersley-Clifford theorem [1] indicates that the posterior probability of the random vector

i i VX X given the observation vector i i Vy y admits the following product form:

( , )( | ) ( , ) ( , )ij i j i i i

i j E i Vp X X X y

X y . (1)

We also assume that { }i and { }ij belong to the exponential family, i.e., take the following

forms:

( , ) exp ( ) ( )Ti i i i i i i iX y X g θ η θ , (2) ( , ) exp ( , ) ( )Tij i j ij ij i j ij ijX X X X g θ η θ , (3)

where θ ’s and η ’s are usually referred to as the natural parameters and sufficient statistics,

respectively, and ( )i ig θ and ( )ij ijg θ are functions dependent on θ ’s only and irrelevant with

X’s. The exponential family covers a large class of distributions of interest, such as Gaussian,

Wishart, Gamma, Beta, and any discrete distributions.

The objective of distributed inference is to compute ( | )P X Y y (or more generally

( | )P S Y y for some subset S X ) in an efficient way, through local computation and

communications only. This is particularly important for large-scale networked systems where the

observation data is widely distributed, and each node may have limited computation and

communication resources. Distributed inference algorithms can also find applications in data

processing where communication is not a concern but computational complexity is. A

centralized processing will incur a computational complexity scaling exponentially with the size

of X, and additional complexity is needed when subset S X is considered.

1 MRF with higher order cliques can always be converted into an equivalent pairwise MRF [12].

6

B. Variational Inference Modern variational methods in general refer to converting the original problem into an

optimization problem (variational transformation), and seeking approximate solutions to the

latter through approximating the objective function or restricting the feasible set of solutions

(variational approximation). Solving the (relaxed) optimization problem typically results in a set

of fixed-point equations, and successive enforcement of them can (hopefully) lead to a solution.

Applying the variational approach to distributed inference, the original problem is first

transformed into finding a distribution ( )Q X which minimizes the KL divergence

( ( ) ( | ))KL Q P X X Y y , or equivalently maximizes a tight lower bound of log ( )P y [4]:

( ) ( ) log ( , )QL Q H Q E P X y , (4)

where ( )H Q is the entropy function of ( )Q X , and {}QE stands for expectation with respect to

(w.r.t.) ( )Q X . For analytical and computational tractability, the variational distribution ( )Q X is

often restricted to a class of distributions with simpler dependency structure (such as sub-graphs

of the original graphical model). Thus far, the most fruitful applications of variational inference

assume a fully factorized form ( ) ( )i iQ Q XX , referred to as the Mean Field approach. When

(4) is instantiated by this form and optimized w.r.t. each individual component, the following set

of fixed-point equations is obtained for the optimal variational distribution (where |Q iE X

refers to conditional expectation on iX and iZ is a normalization constant):

log ( ) log ( | ) | logi i Q i iQ X E P X Z X y , i . (5)

The complexity and accuracy of the variational inference depends on the inherent structure of

the variational distributions ( )Q X . While the MF approach is attractive for its simplicity, its

inference accuracy may not be satisfactory as the posterior correlation is not captured. In this

paper, we will consider a richer structure for the variational distributions and explore the tradeoff

between performance and complexity.

7

C. Variational Message Passing Framework In this section, we follow the general procedure of variational inference discussed in Section II.B

to derive a message-passing framework for distributed inference in networked systems. Consider

the system model (1), and rearrange ( , )Tij ij i jX Xθ η in terms of iX : ( , ) ( ' ) ' ( ),T Tij ij i j ij ij iX X Xθ η θ η

where 'ijθ may be a function of jX . Let ( )i iXη be the union of sufficient statistics ( )i iXη and

' ( )ij iXη . Then the corresponding terms in (2) and (3) can be rewritten as

( ) ( )T Ti i i i i iX Xθ η θ η , (6) ( , ) ( ' ) ' ( ) ( )T T Tij ij i j ij ij i ij i iX X X X θ η θ η θ η , (7)

where the newly obtained iθ and ijθ 2 are named the extended natural parameters and

( )i iXη the extended sufficient statistics.

It can be shown that the optimal mean-field approximation *iQ dictated by (5) is also a

member of the exponential family with sufficient statistics ( )i iXη and natural parameter

* **, \ i ii v i ijQ Q jE θ θ θ , (8) where i is the set of neighboring nodes of node i, and

* *\ iQ Q stands for the distribution

*( )k i k kQ X . From (8), a simple message passing rule can be obtained:

Messaging passing: ( 1)( ) nj

ijQ

ni j E θm ; (9)

Parameter updating: ( ) ( ),i

n ni v i i j

j

θ θ m . (10)

That is, in the nth iteration, the message from node j to its neighbor i is the expected value of the

extended parameter of the corresponding edge compatibility function ijθ (generally a function of

jX ), w.r.t. the current variational approximation ( 1)njQ (similarly the message from node i to its

2 Note that there is a counterpart jiθ which abstracts the corresponding terms of iX .

8

neighbor j is ( 1)( ) ni

nj i jiQ

E m θ ). In turn, node i sums up all the messages from its neighbors,

together with the extended parameter of its own potential function, to get an updated parameter

of its variational distribution component. The iteration generally converges under mild

conditions [4][13]. While conforming to general expressions in literature, the above explicit

message-passing and parameter updating forms seem to be new.

D. Gaussian MRF For concreteness of discussion, we will particularly consider Gaussian graphical models in this

study. Gaussian models are widely adopted in theory and practice of many areas, such as

computer vision, oceanography, and wireless networks, and serve as good approximations in

many scenarios due to the application of the central limit theorem. Without loss of generality,

consider that X in (1) is jointly Gaussian with zero mean and (positive definite) covariance matrix

XΣ , abbreviated as (0, )XX Σ N , where 2 2

S,[ ]i iE X and . The

observation at each node is given by

,i i i iy H x 1, ,| |i V , (11)

where channel gain iH is assumed known, and noise 2N,(0, )i i N , independent across the

network 3 . Given the observation vector y the posterior probability ( | )P X y is Gaussian

distributed as4

1 1 1 1 1( | ) ~ ( , ) ~ ( , )T TP X y F H Ξ y F H Ξ y FN N , (12)

where diag( )iHH , 2N,diag( )iΞ , and

1 1| | | |[ ]

Tij V VF

XF H Ξ H .

Consider approximating the posterior probability (12) by the MF variational distribution with

2MF, MF,( ) ~ ( , )i i i iQ X N , where MF,i and

2MF,i are the posterior mean and variance respectively.

3 It is straightforward to extend the model to include more complex scenarios, such as correlated observations and noise in space. 4 1( , ) N is the information parameterization for Gaussian distribution ( , ) N , with 1 , 1 .

S, S,[ ]i j i j ijE X X

9

As an application of the variational message passing framework derived in the previous section,

the following iterative form is obtained for the estimate of MF,i (see Appendix A):

( ) 2 ( 1)MF, N, MF,/ /i

n ni i i i ij j iij

H y F F

. (13)

Collecting all node estimates into a vector of dimension | |V leads to the following expression

( ) ( 1)MF MFˆn n

V μ Ay BP μ , (14)

where A and B are relevant coefficient matrices, and the stochastic matrix ˆVP has entries

/ ,

0, otherwise.i

ij ij ijij

F F jP

(15)

In the next section, the same variational message passing framework will be applied to derive the

message exchange rules between clusters (viewed as “mega-nodes”).

III. STRUCTURED VARIATIONAL METHODS FOR DISTRIBUTED INFERENCE Although attractive for its computational simplicity, the naive mean field approach may not yield

sufficient accuracy or fast convergence due to the independence restriction on variational

distributions. A natural idea for improvement is to consider a variational distribution with richer

(and yet tractable) dependence structure, and integrate exact or more accurate probabilistic

inference algorithms with the mean field method to achieve a good tradeoff between accuracy

and complexity. As mentioned earlier, the application of structured variational methods for

distributed inference in practical networks is largely unexplored, so is its quantitative

performance analysis in this setting. In this section, we discuss its instantiation in the context of

clustered wireless networks, and analyze its convergence. In the following two sections, we will

characterize its convergence rate.

10

A. Overview of the SMF5 Method The MF approximation corresponds to a totally disconnected graph where all the dependencies

between the variables of the original model are removed. A natural idea to enrich the structure of

Q is to replace each node in the MF approximation by a “mega-node”, i.e., to consider a class of

variational distributions of the form 1

( )i

s

i Ci

Q Q

X , where 1,..., sC C form a disjoint partition of

all nodes (variables). This approach intends to keep the original dependency structure within the

clusters while decouple the connections between clusters. Approximately accurate inference will

be pursued within the clusters to improve the performance, while MF approximation can be

utilized across the clusters to maintain tractability. The tradeoff between accuracy and

complexity can be realized through the construction of clusters, with the MF approach (clusters

of size one) and exact inference (one cluster) on the two extremes.

In particular, we adopt the belief propagation algorithm for intra-cluster reasoning, as it

yields accurate inference on acyclic graphs, and continues to work well on loopy graphs with

suitable sparsity or symmetry. An important reason to choose BP is that it can be readily

implemented in a message-passing style through the prominent sum-product algorithm or its

variants. However, intra-cluster processing in our SMF framework is not limited to BP. Other

exact inference algorithms such as the junction tree (JT) algorithm can be employed for better

convergence and wider applicability. It should be noted that the computational complexity of JT

grows exponentially with the size of the maximal clique in the cluster, so it is sensible to put a

limit on the cluster size when JT is applied. In both cases, our SMF framework provides a

platform to best reap the benefits of these high-accuracy inference algorithms in practice.

SMF also requires some overhead for clustering, which can be done before the network setup

and can be adjusted during network operation when necessary. We have designed a distributed

clustering scheme for SMF, which endeavors to minimize the dependence (correlation) between

5 In this paper, structured variational method and structured mean field are largely interchangeable; more specifically, the former refers to the methodology, while the latter is used on the scheme we develop.

11

the clusters, thus improve the algorithm performance. The effectiveness of the algorithm is

testified in Section VI via simulations. Interested readers are referred to [14] for details.

B. SMF in Gaussian MRF Consider the Gaussian MRF model presented in II.D, for which the messages and node beliefs

are both Gaussian distributed. Assuming for the nth iteration, the message from node i to j, ( )nj im ,

and the belief at node i, ( )nib , are parameterized by

( ) 1 ( ) ( )( ) ~ ( , )n n nj i j i j im x

N and ( ) 1 ( ) ( )( ) ~ ( , )n n ni i ib x q W

N ,

we can obtain the updating rules as (see Appendix B)

( 1)\{ }( )

2 ( 1)S, S, S, S, \{ }

2 2 ( 1) 2S, S, S,\{ }( )

2 2 ( 1)S, \{ }

(1 )

1,

1 (1 )

i

i

i

i

nij i i kk jn

j i nj i i j ij i i kk j

ni j i i k jk jn

j i ni ij i i kk j

V

V

V

(16)

with

2N,/i i i iH y ,

2 2 2N, S,/ (1 | |)i i i i iV H , (17)

and

( ) ( )

( ) ( ) .i

i

n ni i i kk

n ni i i kk

q

W V

(18)

We proceed to discuss inter-cluster message updates. Consider a partitioning of the network

nodes 1{ }s

i iC C , and denote iCX as the collection of all variables such that ii C . For each

cluster iC , define its Markov blanket (MB), ( )iMB C , as the set of nodes outside of iC but

connected to some nodes in iC . In turn, those nodes in which are connected to some nodes in

are denoted as gateway nodes. A neighboring cluster which contains part of is

named a Markov blanket cluster (MBC) of iC , whose collection is denoted as iC . A conceptual

iX

iC

( )iMB C ( )iMB C

12

illustration for MB and MBC is given in Figure 1. To apply the variational message passing rules

derived in II.C, the posterior probability can be reformulated as (c.f. (1))

,( | ) ( , ) ( , )i i i j i ji j Ci

C C C C C CC C C

P

X y X y X X , (19)

where ( , ) , ,

( ) ( , ) ( , )i i

i i iC C i i i ij i j

i C i j E i C j CX y X X

X collects the node potentials and edge

compatibility functions within cluster iC , while ,( , ) , ,

( , ) ( , )i j i j

i j

C C C C ij i ji j E i C j C

X X

X X

collects the compatibility functions of the connecting edges between cluster iC and jC .

As derived in Appendix C, inter-cluster updating can be readily obtained for the gateway

node i in the cluster iC as

2

1N,( ) ( 1) ( 1) ( 1)

( )i

in n n ni i ij j j

j MB Ci

y y F W qH

. (20)

Expression (20) admits an interesting interpretation: gateway nodes use the intra-cluster estimates

of their neighbors in the Markov blanket to “update” observations, and exploit these

“new” observations, which encode the messages from other parts of the network (propagated

through intra- and inter-cluster processing), for the next round intra-cluster inference. The

execution of intra- and inter-cluster updating does not need to follow one another; in practice it is

found to be advantageous to run intra-cluster inference more often than the inter-cluster one.

C. Convergence Analysis Convergence of Gaussian BP in loopy graphs has been actively studied in literature ([15][3][16]

and references therein). While a full understanding is still lacking, various sufficient conditions

have been found, among which the pairwise-normalizable condition6 is noteworthy [16]. Here

we will assume that such a condition is satisfied, so the intra-cluster BP is guaranteed to

converge. We can see from (16) that the inverse variance updating of BP within clusters stands

6 That is, there exists a decomposition of the form (1) where both node potential and edge compatibility functions are valid Gaussian distributions.

( )iMB C

13

alone. It is also observed in our study that the inverse variance iteration converges much faster

than the message mean iteration, so we can allow it to run first till the variance is sufficiently low

(as clarified in the proof of Lemma 3.1 below). In the following, we provide an alternative proof

to the convergence of the mean iteration, assuming that variances in intra-cluster BP have

already converged to some small values { }j i . This approach can be easily extended for the

analysis of the overall convergence of SMF, and facilitate the study of the convergence rate.

The message mean iteration in Equation (16) can be rewritten with the conventional

parameter pair ( ij , ij ) as (c.f. Footnote 4)

12 ( 1)\{ }N,( ) ( 1)

12\{ }S, S, S, S,

/( )

(1 )i

i

nk jij ij i i i i k i kn n

j i j ik jj i i j ij i i k

H yG

V

μ , (21)

where ( 1) ( 1) 2| | 1[ ]n n Ej i R

μ , ( , )i j E .

Lemma 3.1 ( )j iG is a contraction mapping7.

Proof: Let 12S, S,

\{ } 2S, S, S, S,

(1 )(1 )

i j ij i kki j

j i i j ij i

KV

,

2N,

2S, S, S, S,

/(1 )

ij ij i i iij

j i i j ij i

H yy

V

, and

2S, S, (1 )

ij ijij

i j ij

, then Equation (21) can be reformulated to

( 1)\

( )\( ) ( 1)

\( )\

( )1

nij ij ki j i k

k N i jn nj i j i

ki jk N i j

y KG

K

μ . (22)

Define\{ }

( )\{ }

11ij ki j

k N i jK

, 2 2diag R E Eij D , 2 2diag R E Eij V and

2 1[ ] R Eijy y . Further define a stochastic matrix 2 2ˆ R E EE

P with the entries

7 A contraction mapping on a metric space (M, d) is a function G from M to itself, with the property that there is some nonnegative real number 1 such that for all and ' in M, ( ), ( ') ( , ')d G G d .

14

( )

( ') ( ')\{ ( )}

( ') ( ')\{ ( )}', ': ( ') \ ( )

, ( ) ( ') but ( ') ( )ˆ

0, otherwise,s e

s e d e d e

s e d e d ee e e s e d e

Ks e d e s e d e

K

P (23)

where ( )s e and ( )d e denote the source and destination node of edge e. The iteration (22) can be

written in a vector-matrix form as (c.f. (14))

( ) ( 1)BP BPˆn nE μ Dy V I D P μ , (24)

which leads to

ˆ' '

'

'

'

EG G

μ μ V I D P μ μ

V I D μ μ

V I D 1 μ μ

μ μ

(25)

with 1 . The last inequality comes from the assumption that for sufficiently large n,

2S, S,

1(1 )ij

i j ij j i

, i.e. ( ) 1nij . Thus, G μ is a maximum-norm contraction mapping and

hence has a unique fixed point. This proves the convergence of the mean in Gaussian belief

propagation.

□

Based on Lemma 3.1, we have:

Theorem 3.1 In a Gaussian MRF, the structured variational method using Gaussian BP as the

intra-cluster inference algorithm converges.

Proof: Taking inter-cluster updating into account, the change in observations (20) will only

reflect on ( )niy , which will be cancelled out in ( ) ( ')j i j iG G μ μ (c.f. (22)). So the proof of

Lemma 3.1 still can be applied, and the conclusion follows.

□

15

IV. CONVERGENCE RATE OF INTRA-CLUSTER INFERENCE Although successfully employed in SMF for intra-cluster inferences, the performance of the

belief propagation algorithm is still not fully understood. We first analyze the performance of the

intra-cluster algorithm in this section, where we derive an upper bound on its convergence time

for the 2-d torus; then we utilize this result in Section V to investigate the overall performance of

SMF. Our analysis is mainly focused on 2-d tori, which captures the essence of planar networks;

extension to other models such as geometric random graphs will be considered in our future

work.

A. Vertex, Edge and Mixed Process Both updates in (14) and (24) involve stochastic matrices which define irreducible and aperiodic

Markov chains. The MFP̂ in (14) is a | | | |V V matrix defined on the vertex set (c.f. (15)), while

BPP̂ in (24) turns out to be a 2 | | 2 | |E E matrix defined on the set of directed edges (c.f. (23)).

We denote the evolvement of the corresponding Markov chains in these two schemes as the

vertex process and edge process respectively.

Figure 2 illustrates the distinction between a vertex process and an edge process. As (a)

shows, the states in a vertex process are represented by nodes (the circles), while the allowable

(two-way) transition between the states is determined by the undirected edges. In contrast, the

states in an edge process are represented by the directed edges (the arrows in (b)), and the

transitions are guided by the directions that the arrows point to. More specifically, the transition

can only occur between the edges which are connected but not directly against each other (i.e.,

between 'e and e such that ( ) ( ')s e d e but ( ') ( )s e d e ), dictated by the rule in BP that the

message from one neighbor of a node contributes to the new messages sent to other neighbors

but not back to itself (c.f. (21)). For structured variational methods, we constrain the edge

process only within clusters, and employ the vertex process to exchange information between

clusters. This leads to a mixed vertex-edge process model as shown in (c).

16

For a Markov chain P̂ with stationary distributionπ , the mixing time is defined as

mix 1ˆ( ) max inf :|| ( , ) || /2 ,tit t i P π (26)

where ˆ ( , )t i P is the i-th row of the t-step transition matrix, and 1|| || stands for the 1l norm.

Essentially, mix ( )t specifies the (worst case) time that P̂ takes to converge to the -vicinity of

its stationary distribution, considering all possible initial states.

The convergence behavior of vertex processes has been well studied in the context of

reversible Markov chains. In particular, it is not difficult to prove that the mixing time of a

reversible Markov chain on a 2-d n n torus is 2mix ( )t O n [17], which characterizes the

convergence time of vertex processes and thus the variational method. However, the

performance of edge processes, which in general involves non-reversible Markov chains, is still

largely unexplored. And to the best of our knowledge, there is no formal convergence discussion

on the mixed model.

B. Convergence Rate of Edge Process To facilitate our discussion, we adopt the following labeling to represent an edge process on an

n n torus, as shown in Figure 3. Specifically, given node (1,1) on the left-bottom corner, and

( , )n n on the right-top corner, four outgoing edges of node ( , )i j pointing to the East, North, West

and South directions are respectively labeled as ( , )i j , ( , )j i , ( , )i j and ( , )j i . The states are

only allowed to make 90 degrees turns with the probability ( ) /q n n (e.g., from ( , )i j to ( , 1)j i

or ( , 1)j i ), and move forward with the probability1 2 ( ) /q n n (e.g., from ( , )i j to ( 1, )i j ),

but can’t backtrack (e.g., the transition from ( , )i j to ( 1, )i j is forbidden).

A conjecture is put forth in [3] that the convergence time of consensus propagation (or belief

propagation) on a 2-d n n torus is 3/2( )O n . In this section, we verify this conjecture through

deriving a slightly more general upper bound for the mixing time of edge processes on a 2-d

torus, assuming a turning probability ( ) /q n n , where ( )q n satisfies lim ( )n

q n

and

17

( ) / 1/ 3q n n . For a normal BP algorithm, we have ( ) /q n n c for some constant 1 / 3c 8. We

begin with citing a result from [18], which applies to general Markov chains.

Lemma 4.1 For any irreducible and aperiodic Markov chain P̂ with stationary distribution ,π

1 1mix ( ) log( ) log((1 ) ) 1 ( )fillt c t c

, (27)

where fill ˆ( ) max inf{ : ( , ) }t

it c t i c P π , 0 1c .

As mentioned above, a two-tuple 0 1( , )s s 2{ ,..., 1,1,..., }n n is used to represent the states

of the edge process. It can be verified that the state evolution (whether horizontal or vertical)

admits:

1 10 0 1 1Moving Forward 1,i i i is s s s , (28)

1 10 1 1 0Turning Left , 1)i i i is s s s , (29)

1 10 1 1 0Turning Right , 1i i i is s s s . (30)

Without loss of generality, assume the random walk starts from state 0 00 1( , ) ( , )s s a b . Let

0 1 20, , ,...T T T be the time instances that the random walk makes turns, and L / RiD be the

corresponding turning direction (Left or Right) at time i. Then for the time 1[ , )k kt T T , with k

being the total number of turnings before time t, the destination state evolves as9

1 1

1 1

1 0 10 1

1 0 1

( , + ) L( , )

( , ) R.

k k

k k

T Tt t k k k k

T Tk k k k

t s T s T T Ds s

t s T s T T D

(31)

Clearly, the number of possible destination states grows with the number of turns made. For

example, when 1 2[ , )t T T

1 1 10 1

1 1 1

( , ) L( , )

( , ) R;t t t b T a T Ds s

t b T a T D

(32)

8 The reader is referred to [10][18] for the construction of non-reversible chains that mix even faster, where ( )q n is a constant. 9 All the summations in (31)-(35) should be understood as modulo 2n operation, taking values in { , ..., 1,1, ..., }n n .

18

and when 2 3[ , )t T T

2 1 2 1 1 2

2 1 2 1 1 20 1

2 1 2 1 1 2

2 1 2 1 1 2

( , ) L, L( , ) L, R

( , )( , ) R, L

( , ) R, R.

t t

t a T T b T T D Dt a T T b T T D D

s st a T T b T T D D

t a T T b T T D D

(33)

Generally, with k total number of turnings, the destination state has 2k possibilities; when k is

even, it is given by

1 2 10

1 2 11

( ) ( )( ) ( )

tk kk

tk k

T T T at TsT T T Tbs

, (34)

and when k is odd, it is given by

1 2 2 10

1 2 3 11

( ) ( ) ( )( ) ( ) ( )

tk k k

tk k k k

b t T T T T TsT T T T T as

, (35)

where the plus/minus signs are determined by the turning directions.

By symmetry, the stationary distribution of this Markov chain is uniform with probability

21/ 4n . As dictated by Lemma 4.1, we need to find the minimum time t such that,

20 1Pr(( , ) ( , )) /t ts s x y c n for any 2( , ) { ,..., 1,1,..., }x y n n for some constant c, regardless of

the initial state. Intuitively, both 0ts and 1

ts in (34) and (35) are sums of independent geometric

random variables, which allows us to examine the final state probability by the Central Limit

Theorem. This is done in the Lemma 4.3 below, which requires a technical result in Lemma 4.2

to simplify analysis.

Lemma 4.2 With high probability (w.h.p.)10 there exists a constant 1 0c such that there are

at least 3/2( )q n positive odd integers (time indices) i satisfying

3/41 / ( )i iT T n q n and

3/41 / ( )i iT T n q n

10 the probability approaches 1 as n

19

in the first 1 ( )c q n n steps, where denotes the floor function which returns the closest

integer smaller than the argument, and denotes the opposite, the ceil function.

Proof: See Appendix D.

Lemma 4.2 essentially reveals some regularity in the turning times; in particular, some

intervals between them are well bounded. Divide the whole turning time index set {1,2,...} into

two subsets 1S and 2S , where 1S is the set of positive odd integers i such that

3/41 / ( )i iT T n q n and

3/ 41 / ( )i iT T n q n , as in Lemma 4.2, and 2S contains the rest. To

achieve our purpose, it is sufficient to find a lower bound of the conditional probability

20 1Pr ( , ) ( , ) |t t Ss s x y T for any given set 2 2{ }S j j ST T 11. This allows us to focus on the more

regular random variables 11

( )( )

i i

i i

T TT T

in (34) and (35) specified by 1i S . We thus can derive:

Lemma 4.3 There exists a constant 0c such that if 1 ( )t c q n n ,

2 20 1Pr ( , ) ( , ) | /t t Ss s x y T c n , (36)

for any 2( , ) ,..., 1,1,...,x y n n and any 2 2

{ }S j j ST T w.h.p.

Proof: See Appendix E.

Combining Lemma 4.1 and Lemma 4.3, we get the following conclusion:

Theorem 4.1 On a 2-d n n torus, the mixing time of an edge process with turning probability

( ) /q n n is ( ( ) )O q n n w.h.p.

As a result, the convergence time of consensus propagation [3] and our intra-cluster BP

inference (16) is 3/2( )O n , with ( ) /q n n c for some constant c in the worst case. This is verified

in Figure 4, where the mixing times with 0.01 of the vertex and edge process (with

11 Our goal is then achieved by the total probability theorem.

20

( ) / 1/ 3q n n ) are simulated. It is observed that the two curves fit well with 2( )O n and 3/2( )O n ,

respectively.

V. CONVERGENCE RATE OF STRUCTURED VARIATIONAL METHODS It has been shown in Section IV.A that the performance of structured variational methods is

governed by a mixed vertex-edge process, or equivalently a hybrid Markov chain model. The

complexity of this model precludes direct applications of any standard techniques in literature. In

this section, we explore the coupling technique [11] to analyze this model.

Coupling provides a simple and elegant way of bounding the mixing time, and isn’t tied to

reversibility. Essentially, a coupling of Markov chains is a process 0{ , }t t tX Y with the property

that both { }tX and { }tY are Markov chains with the same transition matrix P̂ of interest, but

typically with different starting states. Once the two chains meet at one state, they stay together

at all times after that, i.e. if ' 't tX Y , then t tX Y for 't t . For starting states x and y, let

,0 0inf{ : | , }

x yt tT t X Y X x Y y , (37)

then the coupling time is defined as

,couple ,max ( )

x yx yt E T , (38)

which can serve as an upper bound for the mixing time according to the Coupling Lemma [11]:

mix couple( ) lnt t . (39)

We assume an n n torus is equally divided into 2s clusters, each of size ( / ) ( / )n s n s , and

consider a vertex-edge process on it as indicated in Figure 2(c). Then by investigating the

coupling time of two random walks on such a clustered graph, we can obtain a characterization

for the mixing time. Firstly, we study how quickly an edge process can “escape” from a 2-d torus

(or how long it can stay in the torus before hitting any outgoing edges on the boundaries), and

obtain the following result:

21

Lemma 5.1 On a 2-d n n torus, the average staying time of an edge process is upper-

bounded by ( ( ) )O q n n w.h.p.

Proof: See Appendix F.

Using this result, the mixing time of a mixed vertex-edge process can be characterized as

follows:

Theorem 5.1 On a 2-d n n torus, the mixing time of a mixed vertex-edge process with equal

cluster size of ( / ) ( / )n s n s is 3/2O sn w.h.p. Proof: Suppose two random walks start from two randomly selected points in this clustered

graph, then the coupling process can be described as follows: firstly, these two random walks

wander inside their respective clusters (edge process) until they hit the gateway nodes and exit

the starting clusters. From then on, they roam over the network, repeatedly entering and leaving

clusters, and finally arrive at a same cluster. This journey, on the high level, can be regarded as a

vertex process on an s s torus with “mega-vertices”, which take 2( )O s steps to couple. At each

mega-vertex (cluster), the average staying time is 3/ 2(( / ) )O n s according to Lemma 5.1. Besides,

we need to consider the following scenario. Even after these two walks reach the same cluster,

one of them may leave early, by hitting gateway nodes before coupling with the other walk; the

probability of such an event is denoted as p . In this case, the above journey is repeated.

To evaluate p , assume a is the state of the random walk which has entered a cluster earlier,

x is the state of the second random walk which just steps into this cluster, and z is a boundary

node of this cluster. It has been shown that [19], the probability that starting from node x, a

random walk hits node z before it hits a is given by the ratio of the effective resistance12 between

a and x and that between and a and z

hit, hit ,( )( )( )x z a

R a xp PR a z

. (40)

12 The effective resistance between node i and j is the expected number of traversals in a random walk starting at i and ending in j [19].

22

Since both x and z are boundary states, while we do not have any further information about a but

to assume it is uniformly and randomly located inside the cluster, ( )R a x and ( )R a z are

on the same order, and so p is a constant.

Then the total time to couple these two random walks is given by

2 3/2 3/2 1couple

1

2 3/2 3/2

( ) ( / ) ( / ) (1 )

( ) ( / ) / (1 ) ( / ) / (1 ),

i

it i O s O n s O n s p p

O s O n s p O n s p

(41)

where the first term gives the total roaming time among the clusters, while the second term

corresponds to the staying time of two random walks in the same cluster.

We thus have

2 3/2 3/2mix couple( ) ln ( ( / ) )t t O s n s O sn . (42) □

Note that Theorem 5.1 includes previous results for the vertex process ( s n ) and edge process

( 1s ) as two special cases.

VI. PERFORMANCE-COMPLEXITY TRADEOFF IN STRUCTURED VARIATIONAL METHODS

Theorem 5.1 indicates that the inference performance decreases with s (increases with the cluster

size). To inspect how clustering affects the message complexity, note for the total 2s clusters,

each cluster has 2( / )n s nodes and thus 24( / )n s directed edges. On each directed edge, two

metrics, the message mean and variance, are exchanged in the BP algorithm, except for the

4( / )n s outgoing ones across the cluster boundaries. Instead, on these cross-cluster edges, only

the estimated means are needed for inter-cluster updating. So the message complexity per

iteration in the SMF algorithm is given by

2 2 2 22 4( / ) 4( / ) 4( / ) (8 4 )O s n s n s s n s O n ns . (43)

23

Comparing (42) and (43), we can observe the performance-complexity tradeoff inherent in SMF:

as the cluster size increases, more accurate inference is performed on more nodes; this results in

faster convergence, but also inevitably increases communication burden.

The inherent tradeoff of SMF is further explored through simulations. We consider a

Gaussian MRF estimation, and adopt a similar simulation setting as in [20]13, with 150 nodes

uniformly and randomly distributed in a unit plane. The estimation mean square error (MSE) is

used as the comparison metric, defined by

2 2ˆ ˆ ˆ/mMSE x x x , (44)

with ˆ mx denoting the estimation vector at the mth iteration, and x̂ being the exact estimation (the

MMSE solution).

Two clustering schemes are considered in the simulation. One is a centralized scheme [21],

which uses semi-definite programming to solve a relaxation of the min-cut problem14 with equal

cluster size constraint, and the other is our proposed distributed clustering algorithm in [14],

which seeks to approximate the min-cut solution in a greedy manner.

Figure 5 compares the convergence rate of BP, MF, and SMF – with the cluster size limit of 3

and 8, respectively. The results verify that BP converges the fastest. But we also observe that even

when the network is divided into very small clusters (of size 3), SMF can still significantly speed

up the convergence; in this case there is not much increase in computational complexity as

compared to MF. A reasonable cluster size (of 8) achieves almost indistinguishable performance

to that of BP. It is also observed that our distributed clustering scheme yields results close to the

centralized one, but with a much lower computational overhead.

Another important consideration in wireless networks is the energy consumption, which is

typically dominated by the communication energy. Comparison of the three approaches in this

aspect is illustrated in Figure 6, where the total number of exchanged messages till reaching the

13 see Section V.A in [20] 14 The edge weight is set proportional to the correlation between two end variables.

24

corresponding MSE is used as an indicator for communication energy. Note that compared to MF,

BP exchanges more messages per round but converges faster. It is interesting to observe that for

this simulation setting15 SMF consumes the least communication energy to obtain the same

estimation accuracy, which indicates its potential superiority in practice.

The above analysis and simulation indicates that, an appropriate cluster size should be

selected for SMF depending on applications, to achieve a good balance among estimation

accuracy, convergence rate, computational complexity, and energy efficiency. In practice, how to

select an appropriate cluster size is worth further investigation. If the anticipated operational

environment is stable, we may conduct off-line simulation or experiment such as in the Figure 4

and 5 to determine the best cluster size. In the more dynamic scenarios, we can resort to a cross-

layer approach to facilitate re-clustering. For example, when there is a demand for higher quality

of service from the application layer, the inference module will notify the clustering module to re-

cluster with a larger cluster size. This can be easily accommodated through our distributed

clustering scheme in [14], which has a procedure to merge clusters of smaller sizes into larger

ones till the designated conditions are satisfied. Similarly, a split of current clusters can be

initiated when a smaller cluster size is desired.

VII. CONCLUSIONS AND FUTURE WORK In this paper, we develop a general variational message passing framework for distributed

inference in Markov random fields. Structured variational methods are explored to achieve a nice

tradeoff between system performance and complexity. We also investigate the convergence

performance of our proposed structured variational methods for distributed inference. We adopt

a direct probabilistic approach to analyze the convergence of an edge process which models the

intra-cluster processing, and devise a coupling process to characterize the overall performance

concerning a more complicated mixed vertex-edge process. We expect both the results obtained

and the analytical tools developed in this work can be applied to other similar problems in

15 This result should not be viewed as contradictory to (43), which only describe the scaling behavior.

25

wireless networks. In particular, the methodologies developed on the mixing behavior of the

edge process and mixed process seem not to be limited to Gaussian assumption and should

assume wider applicability.

In this study, the quality of distributed inference at convergence has been neglected. This is a

very challenging research topic, and any progress in this area will likely lead to significant

impact. Relevant works along this line include [15], [25]and [26]. Another direction for future

work is to further consider the impact of channel uncertainties and communication constraints

[26][27]. One recent work [28] is also worth mentioning, where a hidden variable is introduced

to decouple the dependence among observations, for the purpose of simplifying the optimal local

decision rules in distributed detection. The feasibility of incorporating this idea into our

variational processing framework deserves further exploration.

APPENDIX A. Derivation of Gaussian Variational Message Passing Eq. (13) Consider the information representation of the posterior distribution given in (12), we can find a

factorization of the form (1), with the individual node potential functions and edge compatibility

functions given by

212( , ) exp ( )i i i i i i ii iX y v y X F X , and 120

( , ) exp [ ]0

ij iij i j i j

ij j

F XX X X X

F X

, (45)

where 1 2N,( ) /Ti i i i iiv y H y H Ξ y . Defining an extended sufficient statistics vector 2( ) [ , ]Ti i i iX X Xη , the corresponding extended natural parameters in (6) and (7) are

respectively given by

12,T

i i iiv F θ and ,0T

ij ij jF X θ . (46)

The MF variational distribution ( )i iQ X can be parameterized with the same sufficient statistics,

and the corresponding natural parameter ,i vθ is given by 2 2

MF, MF, MF,[ / , 1/ 2 )]T

i i i . Applying the

26

variational message passing rule and after some algebra, we obtain the updating form given in

(13).

B. Derivation of Gaussian Belief Propagation Eq. (16) and (18) The message passing rules of belief propagation are derived on a spanning tree of the network16.

In this setting, the posterior probability (12) assumes the following form composed of the prior

marginal ( )i ip x and pairwise joint distribution ( , )ij i jp x x , as well as the marginal conditional

distribution ( | )i i if y x (viewed as a function ix ):

( , ) | | 1( , )

( | ) ( | )( ) iij i j

i j Ei i i

i Vi ii V

p x xP f y x

p x

X y . (47)

In the above expression, 1 2 2 2N, N,( | ) ( , )i i i i i i i if y x H y H N , 1 2S,( ) ~ (0,1 )i i ip x

N , and

1( , ) ~ ( , )ij i j ijp x x 0 ΩN , ij , where

2S, S, S,

22 2 2S, S, S,S, S,

1(1 )

j i j ijij

i j ij ii j ij

Ω .

Note that with information parameter representation, if 11 1 1( ) ~ ( , )p x N and

12 2 2( ) ~ ( , )p x

N are two different distributions on the same random Gaussian random

vector x, then the product density

112 1 2 12 12( ) ( ) ( ) ~ ( , )p x p x p x N (48)

with 12 1 2 and 12 1 2 . Similarly, the quotient 1 2( ) / ( )p x p x produces an

exponential quadratic form with parameters 1 2 1 2( , ) , which defines a valid

probability density when 1 2 is positive definite. Moreover, if 1x and 2x are two joint

16 Various forms have been developed in literature (e.g., [15]). We provide a derivation for the specific forms of (16) and (18) here for completeness, which will also facilitate the discussion in Section III.

27

Gaussian random vectors with distribution 1 11 1211 22 21 22

( , ) ~ ,p x x

N , the marginal

distribution 11 1 1( ) ~ ( , )p x N is also Gaussian with information parameters given by

1

1 1 12 22 21

1 11 12 22 21.

(49)

Comparing (47) with the standard Hammersley-Clifford equation (1), we have the node potential

function

1( ) ( , )i i i iX V

N , (50)

where the parameters are given in (17), and the edge compatibility function

1( , ) ( , )ij i j ijX X 0 ΩN . (51)

Consider the general message updating and belief computing formulas in belief propagation

[1][12]:

( ) ( 1)\{ }

( ) ( , ) ( ) ( )ii

n nj i j ij i j i i i k i ik jx

m x x x x m x dx , ( ) ( )( ) ( ) ( )

i

n ni i i i i k ik

b x x m x .

By applying the product rule (48) on the message updating equation, we have the integrand 1 1

\{ }( , ) ( ) ( ) ~ ( , )

i

nij i j i i i k i ij ijk j

x x x m x Ι JN , (52)

where

1

\{ }

0i

ni i kk j

ij

Ι , (53)

( 1)2 2 2\{ }S, S, S,

2 2 2S, S, S,

1(1 ) (1 )

1(1 ) (1 )

i

ijni i kk j

i ij i j ijij

ij

i j ij j ij

V

J . (54)

Using the marginalization rule (49) to perform the integration over ix , we can obtain the

message updating rule (16). The belief updating rule (18) is easier to obtain following the

product rule (48).

28

C. Derivation of Inter-cluster Variational Message Passing Eq.(20) As in Appendix A, define the extended sufficient statistics for the cluster iC as

2

{ }{ }( ){ }

i

i

i ji C

i

XX XXX

η , , ii j C , (55)

where{ }iX denotes the collection all iX ’s in cluster iC , and the other two terms are defined

similarly. Referring to (19) and (45), the corresponding extended natural parameters of the

cluster potential and cluster compatibility functions are given by

12[{ },{ },{ }] , ,iT

C i ij ii iv F F i j C θ and ,{0},{0} , ,i j iT

C C ij j i CF X i C j θ , (56)

where the involved parameters are the same as in (46). Considering the structure of i jC C

θ above,

we only need to focus on the first component when applying the variational message passing rule

in Section II.C. In particular, the messages are exchanged between gateway nodes and their

Markov blankets, taking the form of ( 1)nj i j

CQ CE θ according to (9); then, following the spirit of

(10), such messages are used together with the parameters of the gateway nodes 2N,/i i i iv H y

to update the corresponding variational distributions. Recall that in our structured variational

method, the variational distribution is assumed to take the form 1

( )i

s

i Ci

Q Q

X , and Gaussian

belief propagation is adopted for intra-cluster inference. Therefore, the variational message

passing rule indicates the following interaction between neighboring clusters: the intra-cluster

inference in the previous round provides a mean estimate for all the variables in the Markov

blanket of cluster iC , i.e, ( 1)1( 1) ( 1)

njQ

n nj j jE X W q

, iC

j (c.f. (18)). Then this estimate is

used to construct a “new” observation ( )niy as given in (20) to stimulate a new round of intra-

cluster inference through (16)-(18).

29

D. Proof of Lemma 4.2

Notice that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . We can

readily compute the probability

3/43/4 3/4

1 1

2/ ( )

Pr / ( ) , / ( )

1 (1 2 ( ) / ) ,

i i i i

n q n

p T T n q n T T n q n

q n n

(57)

which is non-vanishing as n (for all ( ) ( )q n O n ). Choose a constant 2 1 /c p and

consider a group of 3/22 ( )c q n positive odd integers. By the Chernoff bound, the probability

that there are at least 3/2( )q n positive odd integers i satisfying 3/4

1 / ( )i iT T n q n and

3/41 / ( )i iT T n q n in such a group is at least 2 3/22 21 exp 1 1 ( ) 2c p c pq n , which

approaches 1 as n goes to infinity. Finally, consider a constant 1 22( 1)c c , and it is claimed

that in the first 1 ( )c q n n steps there are at least 23/2

2 ( )c q n turns w.h.p. To prove this claim,

note that the random variable 3/222 ( )c q n

T

is the sum of 2 3/22 ( )c q n i.i.d. geometric random

variables each with mean (2 ( ))n q n and variance between 2 212 ( )n q n and 2 24 ( )n q n . By

the Chebyshev’s inequality

3/2

2

3/22

3/22

3/2 2 22 ( ) 2

1 2 2 22 ( )1 2

1 2 ( )

var ( ) / ( )Pr ( ) 0.

2( 2 ) ( )( )

c q n

c q n

c q n

T c q n n q nT c q n n

c c q n nc q n n E T

(58)

E. Proof of Lemma 4.3

Recall that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . For turning

times indexed by 1S , there is a useful property given as follows. If i k L j , it is known

that [22], 1 11Pr( | , , )i i i iT k T i T j T L L i

, { 1, 2, , }k i i L . That is, when 2S

T is

30

fixed, for every 1i S , the component vector 1

1

( )( )

i i

i i

T TT T

in (34) and (35) is a uniform random

vector on the space 23/ 4 3/ 4/ ( ) ,..., 1,1,..., / ( )n q n n q n . From Lemma 4.2, we know

that if 1 ( )t c q n n , there are at least 3/2( )q n such uniform random vectors on , with

zero mean and covariance matrix Σ whose entries are all on the order of 2 3/2/ ( )n q n . Denote

these random vectors as ( )m nX , 3/21,..., ( )m q n and let

1/ 2 3/ 4( ) ( ) / ( )m mn n q nY Σ X . Then

from the Central Limit Theorem, the summation of ( )m nY converges to a standard multivariate

normal distribution

3/2( )2 1 2 21

( ) ( , )q n Dn mm n

S Y 0 IN . (59)

Clearly the probability 3 4Pr || ||nc c S is non-vanishing for any constants 3c and 4c , where

|| || denotes the Euclidean distance. Noting that the entries of 3/4 1/2( )q n Σ are on the order of n ,

there exists a constant 5c such that 3/4 1/2 5Pr 2 ( ) 2nn q n n c Σ S . Also note that nS is a zero-mean unimodal symmetric random vector. Therefore there exists a constant c such that

3/4 1/2 2Pr ( ) 2 /nq n n c n Σ S . Now consider the sum of 3/2( )q n i.i.d. random vectors which are uniform on (those indexed by 1S in (34) and (35) with the rest fixed) and denote it

as nU , then nU and 3/4 1/2( ) nq n Σ S have the same distribution. Therefore the lemma is proved.

F. Proof of Lemma 5.1 From the state representation of the edge process (c.f. Figure 3), we know that hitting an

outgoing edge on the boundary corresponds to 0 1ts or n. According to (34) and (35), 0

ts is

given by

1 2 10

1 2 2 1

( ) ( ) is even( ) ( ) ( ) is odd,

k k kt

k k k

t T T T T a ks

b t T T T T T k

(60)

31

which is the sum of k geometric random variables with zero mean and variance 2 3/2/ ( )n q n .

We learn from the proof of Lemma 4.2 that in the first 1 ( )c q n n steps there are 3/ 2( ( ) )k O q n

turns. Let us consider 0ts n , for which the local limit theorem ([23], page 10) can be invoked to

get

2

3/ 403/ 4 2 3/ 2 3/ 2

1( ) Pr( ) exp( ) 2 / ( ) ( )2

tn nq n s nq n n q n q n

, (61)

or 0Pr( ) 1/ 2ts n n e as n . Then by the Chernoff bound, the probability that there is at

least one time that the walk hits the boundary edges before 6 ( )t c q n n for some constant 6c is

at least 2

7 711 exp 1 1/ ( ) ( )2

c q n c q n ( 7 6 / 2c c e ), which approaches 1 as .n

REFERENCES [1] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA:

Morgan Kaufmann, 1988. [2] G. Cooper, “Probabilistic inference using belief networks is NP-hard,” Artificial Intelligence, vol. 42, pp. 393-

405,1990. [3] C. C. Moallemi and B. Van Roy, “Consensus Propagation,” IEEE Transactions on Information Theory, Vol. 52,

No. 11, pp. 4753-4766, 2006. [4] T. Jaakkola, “Tutorial on variational approximation methods,” In Advanced mean field methods: theory and

practice. MIT Press, 2000. [5] C. M. Bishop, J. M. Winn, and D. Spiegelhalter, “VIBES: A variational inference engine for Bayesian

networks,” Advances in Neural Information Processing Systems, 2002. [6] J. Dauwels, “On variational message passing on factor graphs,” Proc. International Symposium on Information

Theory (ISIT), 2007. [7] L. Saul and M. Jordan, “Exploiting tractable substructures in intractable networks,” In Advances in Neural

Information Processing Systems, Vol. 8. The MIT Press, 1996. [8] E.P. Xing, M.I. Jordan, and S. Russell, “A generalized mean field algorithm for variational inference in

exponential families,” In Proceeding of 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2003.

[9] D. Randall, “Rapidly mixing Markov chains with applications in computer science and physics,” Computing in Science and Engineering, Vol. 8, No 2, March 2006.

[10] P. Diaconis, S. Holmes, and R. M. Neal, “Analysis of a non-reversible Markov chain sampler,” Biometrics Unit, Cornell University, Tech. Rep.BU-1385-M, 1997.

[11] T. Lindvall, “Lectures on the Coupling Method,” Courier Dover Publications, 2002. [12] J. Yedidia, W. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,” Technical

Report TR-2001-22, Mitsubishi Electric Research Laboratories, 2001.

32

[13] M. J. Wainwright, and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, Nos. 1–2, pp.1–305, 2008.

[14] Y. Zhang and H. Dai, “Distributed Network Decomposition: a Probabilistic Greedy Approach,” 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, Mar. 2010.

[15] Y. Weiss and W. Freeman, “Correctness of belief propagation in Gaussian graphical models of arbitrary topology,” Neural Computation, vol. 13, no. 10, pp. 2173-2200, 2001.

[16] D. Malioutov, J. Johnson, and A. Willsky, “Walk-sums and belief propagation in Gaussian graphical models,” Journal of Machine Learning Research, 2006.

[17] S. Boyd, A. Ghosh, B. Prabhakar and D. Shah, “Mixing Times for Random Walks on Geometric Random Graphs,” SIAM Workshop on Analytic Algorithmics & Combinatorics (ANALCO), Vancouver, Canada, January 2005.

[18] W. Li, H. Dai, and Y. Zhang, “Location Aided Fast Distributed Consensus in Wireless Networks,” IEEE Trans. Information Theory, vol. 56, no. 12, pp. 6208-6227, Dec. 2010.

[19] D. A. Levin, Y. Peres and E. L. Wilmer, “Markov Chains and Mixing Times,” American Mathematical Society, 2008.

[20] V. Delouille, R. Neelamani, R. Baraniuk, “Robust Distributed Estimation using the Embedded Subgraphs Algorithm,” IEEE Trans. Signal Processing, Vol. 54, No.8, 2006.

[21] K. Schloegel, G. Karypis, and V. Kumar, Graph partitioning for high performance scientific simulations, CRPC Parallel Computing Handbook, 2000.

[22] Grimmett and Stirzaker, Probability and Random Processes, Oxford University Press, 2001. [23] V. F. Kolchin, “Random Graph,” Cambridge University Press, 1999. [24] E. Riegler, G. E. Kirkelund, C. N. Manchon, and B. H. Fleury, “Merging belief propagation and the mean field

approximation: A free energy approach,” 2010 6th International Symposium on Turbo Codes and Iterative Information Processing (ISTC), pp.256-260, Sept. 6-10, 2010.

[25] M. J. Wainwright, T. S. Jaakkola and A. S. Willsky, “Tree-based reparameterization framework for analysis of sum-product and related algorithms,” IEEE Transactions on Information Theory, vol. 49, no. 5, pp. 1120-1146, May 2003.

[26] A. T. Ihler, J. W. Fisher III, and A. S. Willsky, “Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, vol. 6, pp. 905-936, May 2005.

[27] O. P. Kreidl and A. S. Willsky, "Inference with minimal communication: A decision-theoretic variational approach," Advances in Neural Information Processing Systems, 2006.

[28] H. Chen, B. Chen, and P. K. Varshney, “A new framework for distributed detection with conditionally dependent observations,” IEEE Transactions on Signal Processing, vol.60, no.3, pp.1409--1419, March 2012.

[29] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference in Wireless Ad Hoc and Sensor Networks,” 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009.

[30] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference: Convergence Analysis and Performance-Complexity Tradeoff,” 2009 IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009.

Huaiyu Dai (M’03, SM’09) received the B.E. and M.S. degrees in Electrical Engineering from Tsinghua University, Beijing, China, in 1996 and 1998, respectively, and the Ph.D. degree in Electrical Engineering from Princeton University, Princeton, NJ in 2002.

33

He was with Bell Labs, Lucent Technologies, Holmdel, NJ, during summer 2000, and with AT&T Labs-Research, Middletown, NJ, during summer 2001. Currently he is an Associate Professor of Electrical and Computer Engineering at NC State University, Raleigh. His research interests are in the general areas of communication systems and networks, advanced signal processing for digital communications, and communication theory and information theory. His current research focuses on networked information processing and crosslayer design in wireless networks, cognitive radio networks, wireless security, and associated information-theoretic and computation-theoretic analysis.

He has served as editor of IEEE Transactions on Communications, Signal Processing, and Wireless Communications. He co-edited two special issues for EURASIP journals on distributed signal processing techniques for wireless sensor networks, and on multiuser information theory and related applications, respectively. He co-chairs the Signal Processing for Communications Symposium of IEEE Globecom 2013, the Communications Theory Symposium of IEEE ICC 2014, and the Wireless Communications Symposium of IEEE Globecom 2014.

Yanbing Zhang (M’09) received the B.E. and M.S. degrees in electronics engineering from Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, in 2009. Currently, he is a Staff Scientist in the Mobile Communications Group, Broadcom Inc., Matawan, NJ. His research interests are in the general areas of wireless communications and networking, signal processing for wireless communications, with emphasis on

cooperative communication and information processing in wireless networks.

Juan Liu received her B.S. degree in Information and Electronic Engineering from Zhejiang University, Hangzhou, China, in 2000. She received her M.S. degree in Information Engineering from Beijing University of Posts and Telecommunications, Beijing, China, in 2005. She received her PhD degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2011. She is currently a postdoc researcher in the Department of Electrical and Computer Engineering, NC State University, Raleigh, NC. Her research interest is in

wireless communications.

34

iC

( )iMB CiC

Figure 1 Markov blanket (shaded nodes) and Markov blanket clusters (shaded clusters) for cluster iC .

Figure 2 Vertex Process (a), Edge Process (b) and Mixed Process (c)

Figure 3 State labeling of edge process on a 2-d torus

35

Figure 4 Mixing times of vertex and edge process

Figure 5 Mean square error of estimation versus time complexity

0 10 20 30 40 50 6010

-5

10 -4

10 -3

10 -2

10 -1

10 0

Iteration Number

Mea

n S

quar

e E

rror

MFBPSMF - Centralized Clustering SMF - Distributed Clustering

Cluster Size

36

Figure 6 Mean square error of estimation versus message complexity

0 50 100 150 200 25010 -5

10 -4

10 -3

10 -2

10 -1

Number of Exchanged Messages

Mea

n S

quar

e E

rror

BPMFSMF - Centralized ClusteringSMF - Distributed Clustering

Cluster Size

structured variational methods for distributed inference ...part of this work was presented at the...

Documents