fundamentals of learning algorithms in boltzmann...
TRANSCRIPT
Fundamentals of Learning Algorithms in Boltzmann
Machines
by Mihaela G. Erbiceanu
M. Eng., "Gheorghe Asachi" Technical University, 1991
Project Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Computing Science
in the
School of Computing Science
Faculty of Applied Sciences
© Mihaela G. Erbiceanu 2016
SIMON FRASER UNIVERSITY
Fall 2016
ii
Approval
Name: Mihaela Erbiceanu
Degree: Master of Science (Computing Science)
Title: Fundamentals of Learning Algorithms in
Boltzmann Machines
Examining Committee: Chair: Binay Bhattacharya Professor
Petra Berenbrink Senior Supervisor Professor
Andrei Bulatov Supervisor Professor
Leonid Chindelevitch External Examiner Assistant Professor
Date Defended/Approved: September 7, 2016
iii
Abstract
Boltzmann learning underlies an artificial neural network model known as the Boltzmann
machine that extends and improves upon the Hopfield network model. Boltzmann
machine model uses stochastic binary units and allows for the existence of hidden units
to represent latent variables. When subjected to reducing noise via simulated annealing
and allowing uphill steps via Metropolis algorithm, the training algorithm increases the
chances that, at thermal equilibrium, the network settles on the best distribution of
parameters. The existence of equilibrium distribution for an asynchronous Boltzmann
machine is analyzed with respect to temperature. Two families of learning algorithms,
which correspond to two different approaches to compute the statistics required for
learning, are presented. The learning algorithms based only on stochastic
approximations are traditionally slow. When variational approximations of the free
energy are used, like the mean field approximation or the Bethe approximation, the
performance of learning improves considerably. The principal contribution of the present
study is to provide, from a rigorous mathematical perspective, a unified framework for
these two families of learning algorithms in asynchronous Boltzmann machines.
Keywords: Boltzmann–Gibbs distribution, Gibbs free energy, asynchronous Boltzmann
machine, thermal equilibrium, data–dependent statistics, data–independent statistics,
stochastic approximation, variational method, mean field approximation, Bethe
approximation.
iv
Dedication
This thesis is dedicated to my mother for her support, sacrifice, and constant love.
v
Acknowledgements
First and foremost, I would like to thank my supervisor Petra Berenbrink not only for
giving me the opportunity to work on this thesis under her supervision, but also for her
valuable feedback. I would also like to thank Andrei Bulatov, my second supervisor, and
my committee members, Leonid Chindelevitch and Binay Bhattacharya, for their support,
encouragement, and patience.
vi
Table of Contents
Approval ..................................................................................................................... ii Abstract .................................................................................................................... iii Dedication .................................................................................................................... iv Acknowledgements ................................................................................................................... v Table of Contents .................................................................................................................... vi List of Tables .................................................................................................................. viii List of Figures .................................................................................................................... ix List of Acronyms .................................................................................................................... x
Chapter 1. Introduction ......................................................................................................... 1 1.1 Motivation .................................................................................................................... 1 1.2 Overview and roadmap ..................................................................................................... 3 1.3 Related work .................................................................................................................... 4 1.4 Connection to other disciplines ......................................................................................... 6
Chapter 2. Foundations ........................................................................................................ 7 2.1 Boltzmann–Gibbs distribution ............................................................................................ 7 2.2 Markov random fields and Gibbs measures .................................................................... 10 2.3 Gibbs free energy ........................................................................................................... 15 2.4 Connectionist networks ................................................................................................... 19 2.5 Hopfield networks ........................................................................................................... 21 2.5.1 Hopfield network models ............................................................................................ 22 2.5.2 Convergence of the Hopfield network ......................................................................... 28
Chapter 3. Variational methods for Markov networks ...................................................... 32 3.1 Pairwise Markov networks as exponential families .......................................................... 32 3.1.1 Basics of exponential families .................................................................................... 33 3.1.2 Canonical representation of pairwise Markov networks .............................................. 34 3.1.3 Mean parameterization of pairwise Markov networks ................................................. 35 3.1.4 The role of transformations between parameterizations ............................................. 37 3.2 The energy functional...................................................................................................... 38 3.3 Gibbs free energy revisited ............................................................................................. 42 3.3.1 Hamiltonian and Plefka expansion ............................................................................. 43 3.3.2 The Gibbs free energy as a variational energy ........................................................... 45 3.4 Mean field approximation ................................................................................................ 48 3.4.1 The mean field energy functional ............................................................................... 48 3.4.2 Maximizing the energy functional: fixed–point characterization .................................. 50 3.4.3 Maximizing the energy functional: the naïve mean field algorithm .............................. 55 3.5 Bethe approximation ....................................................................................................... 57 3.5.1 The Bethe free energy ............................................................................................... 58 3.5.2 The Bethe–Gibbs free energy .................................................................................... 60 3.5.3 The relationship between belief propagation fixed–points and Bethe free energy....... 62 3.5.4 Belief optimization ...................................................................................................... 66
Chapter 4. Introduction to Boltzmann Machines ............................................................... 68 4.1 Definitions .................................................................................................................. 68 4.2 Modelling the underlying structure of an environment ..................................................... 74 4.3 Representation of a Boltzmann Machine as an energy–based model ............................. 77
vii
4.4 How a Boltzmann Machine models data ......................................................................... 82 4.5 General dynamics of Boltzmann Machines ..................................................................... 84 4.6 The biological interpretation of the model ........................................................................ 90
Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines ............................................................................................................................ 93 5.1 Problem description ........................................................................................................ 93 5.2 Phases of a learning algorithm in a Boltzmann Machine ................................................. 97 5.3 Learning algorithms based on approximate maximum likelihood ..................................... 99 5.3.1 Learning by minimizing the KL–divergence of Gibbs measures ................................ 100 5.3.2 Collecting the statistics required for learning ............................................................ 114 5.4 The equilibrium distribution of a Boltzmann machine .................................................... 121 5.5 Learning algorithms based on variational approaches................................................... 130 5.5.1 Using variational free energies to compute the statistics required for learning .......... 131 5.5.2 Learning by naïve mean field approximation ............................................................ 134 5.6 Unlearning and relearning in Boltzmann Machines ....................................................... 138
Chapter 6. Conclusions .................................................................................................... 141 6.1 Summary of what has been done .................................................................................. 141 6.2 Future directions ........................................................................................................... 143
References .......................................................................................................................... 144
Appendix A: Mathematical notations .................................................................................. 149
Appendix B: Probability theory and statistics ................................................................... 150
Appendix C: Finite Markov chains ...................................................................................... 158
viii
List of Tables
Table 1 Distributions of interest in asynchronous Boltzmann machine learning ....... 95
Table 2 Transition probability matrices for asynchronous symmetric Boltzmann
machines. ................................................................................................. 124
ix
List of Figures
Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four
hidden nodes; b) A layered Boltzmann machine with one visible layer and
two hidden layers. ....................................................................................... 77
x
List of Acronyms
SFU Simon Fraser University
LAC Library and Archives Canada
BO belief optimization
BP belief propagation
CD contrastive divergence
KL–divergence Kullback–Leibler divergence
LBP loopy belief propagation
MCMC Markov chain Monte Carlo
ML maximum likelihood
1
Chapter 1. Introduction
1.1 Motivation
Boltzmann machines are a particular class of artificial neural networks that have been
extensively studied, because of the interesting properties of the associated learning algorithms.
In this context, learning for Boltzmann machines means “acquiring a particular behavior by
observing it” [1]. The machine is named after Ludwig Boltzmann who discovered the
fundamental law governing the equilibrium state of a gas. The distribution of molecules of an
ideal gas among the various energy states is called the Boltzmann–Gibbs distribution. This
distribution was proposed by Geoffrey Hinton and Terrence Sejnowski as the stochastic update
rule in a new network which they named “Boltzmann Machine”.
From a pure theoretical point of view, a Boltzmann machine is a generalization of a Hopfield
network in which the units update their states according to a stochastic decision rule and which
allows the presence of hidden units.
From a graphical model point of view, a Boltzmann machine is a binary pairwise Markov random
field in which every node is endowed with a non–linear activation function similar to an
activation model for neurons. As a graphical model, a Boltzmann machine has both a structural
component, encoded by the pattern of edges in the underlying graph, and a parametric
component, encoded by the potentials associated with sets of edges in the underlying graph.
The particularities of the Boltzmann machine model that make it suitable for pattern recognition
tasks are due more to its parameterization than to its conditional independence structure.
The main interest for Boltzmann machine has come from the neural network field where a
particular type of Boltzmann machine – the layered Boltzmann machine – is considered a deep
neural network. The learning algorithms for Boltzmann machine have been mostly created to
train this kind of neural network.
Boltzmann machines are theoretically intriguing because of the locality and Hebbian1 nature of
their training algorithm, and because of their parallelism and the resemblance of their dynamics
to simple physical processes [2]. There is, however, one drawback in the use of learning
process in Boltzmann machines: the process is computationally very expensive. The
1 Hebbian theory is a theory in neuroscience that proposes an explanation for the adaptation of neurons in the
brain during the learning process. See Section 2.4 for more information.
2
computational complexity of the exact algorithm is exponential in the number of neurons
because involves the computation of the partition function of the Boltzmann–Gibbs distribution,
which involves a sum over all states in the network, of which there are exponentially many. If a
learning algorithm uses an approximate inference method to compute the partition function of
the Boltzmann–Gibbs distribution, then the learning process can be made efficient enough to be
useful for practical problems.
Based on the approach employed to compute the statistics required for learning, the Boltzmann
machine learning algorithms are divided into two groups or families. One family of algorithms
uses only stochastic approximations; the other family uses both variational approximations and
stochastic approximations to compute the statistics.
The goal of this paper is to present, from a rigorous mathematical perspective, a unified
framework for two families of learning algorithms in asynchronous Boltzmann. Precursors of our
approach are: Sussmann, who elaborated in 1988-1989 a rigorous mathematical analysis of the
original asynchronous Boltzmann machine learning algorithm [1,3]; Welling and Teh who, in
2002, reviewed the inference algorithms in Boltzmann machines with emphasis on the
advanced mean field methods and the loopy belief propagation [4]; Salakhutdinov, who, in
2008, in the context of arbitrary Markov random fields, reviewed some Monte Carlo based
methods for estimating partition functions, as well as the variational framework for obtaining
deterministic approximations or upper bounds for the log partition function [5]. During 1990s and
early 2000s the subject of learning algorithms for Boltzmann machines had been somewhat
neglected by the research community. However, a promising new method to train deep neural
networks proposed in 2006 by Hinton et al. [6] has caused a resurgence of interest in this
subject. Despite its ups and downs as a research subject, a considerable number of papers on
the learning algorithms for Boltzmann machines have been published. Some of these papers
proposed refinements for existing algorithms; others proposed new algorithms or even
completely new approaches.
However, as far as we know, there has not been yet a documented effort to gather in one place,
with a consistent set of definitions and notations, and built on a unified framework of concepts,
proofs, and interpretation of results, the mathematical foundations of the main families of
Boltzmann machine learning algorithms. By approaching the topic of this paper from a computer
science theoretical perspective, but without omitting the intuition behind, we intend to fill this
void and to help other interested parties to obtain a good understanding of the intricacies and
limitations of Boltzmann machines and their learning algorithms.
3
1.2 Overview and roadmap
This paper consists of six chapters and three appendices and is organized as follows.
In Chapter 1, we present an introduction to the topic of this paper and our goals in covering it.
We also include a brief history of Boltzmann machine learning and what connections it has with
other disciplines.
In Chapter 2, we introduce the Boltzmann–Gibbs distribution as the main source of inspiration
for Boltzmann machine. We also review the main concepts and results from Markov random
field theory that are subsequently used in this paper. Then we introduce the Gibbs free energy
and its intrinsic relationship with the Boltzmann–Gibbs distribution. Furthermore we introduce
the precursors of Boltzmann machine: the connectionist networks and the Hopfield networks.
Because the asynchronous Hopfield’s network represents the limiting case of the asynchronous
Boltzmann machine as the “temperature” parameter 𝐓 → 0, we cover the dynamics and
convergence of Hopfield networks as well as its learning algorithms.
In Chapter 3, we start by introducing the basics of variational methodology. We also explain
how, in certain conditions, the Gibbs free energy can be viewed as a variational energy. Then
we review the main concepts and results regarding two classes of variational methods that are
used by Boltzmann machine learning algorithms to approximate the free energy of a Markov
random field: the mean field approximation and the Bethe approximation.
In Chapter 4, we introduce the asynchronous Boltzmann machine. From formal definitions, how
the underlying environment is modeled, the energy–based representation, the data
representation, to general dynamics, we provide a detailed description of its functionality,
without omitting the intuition behind its concepts and algorithms. We end this chapter with the
biological interpretation of the model as it was given by Hinton.
In Chapter 5 we start by formally defining the process of learning and justifying why the
Boltzmann machine learning algorithms have two phases. Then we present two categories of
learning algorithms for asynchronous Boltzmann machine: based on Monte Carlo methods and
based on variational approximations of the free energy, specifically the mean field
approximation and the Bethe approximation. For each category we present the derivation and
analysis of the original algorithm. Other important algorithms from each category are introduced
by presenting their differences and/or improvements comparative to the original algorithm.
Finally, we cover the processes of unlearning and relearning in asynchronous Boltzmann
machine.
4
Finally, Chapter 6 contains a very brief summary and outlook.
In Appendix A we introduce the mathematical notations used throughout this paper.
In Appendix B we review the main concepts from probability theory and statistics that are
necessary to have a good understanding of the paper.
In Appendix C we review the main concepts regarding finite Markov chains that are necessary
to have a good understanding of this paper.
1.3 Related work
In 1982, Hopfield showed that a network of symmetrically–coupled binary threshold units has a
simple quadratic energy function that governs its dynamic behavior [7]. When the nodes are
deterministically updated one at a time, the network settles to an energy minimum and Hopfield
suggested using these minima to store content–addressable memories.
Hinton and Sejnowski realized that the energy function can be viewed as an indirect way of
defining a probability distribution over all the binary configurations of the network and that, if the
right stochastic updating rule is used, the dynamics eventually produces samples from the
Boltzmann–Gibbs distribution [8-9]. This discovery led them to invent in 1983 the “Boltzmann
Machine” [8-9]. Furthermore, if a Boltzmann machine is divided into a set of visible nodes whose
states are externally forced or “clamped” at the data and a disjoint set of hidden nodes, the
stochastic updating produces samples from the posterior distribution over configurations of the
hidden nodes given the current data [8,10-11]. Ackley, Hinton, and Sejnowski proposed a
learning algorithm that performs maximum likelihood learning of the weights that define the
hidden nodes and uses sequential Gibbs sampling to approach the posterior distribution [10].
This new algorithm is known in literature as the original learning algorithm/procedure for
(asynchronous) Boltzmann machines. Inspired by Kirkpatrick, Gelatt, and Vecchi [12], Hinton
and Sejnowski used simulated annealing from a high initial “temperature” to a final
“temperature” of 1 to speed up convergence to the stationary distribution. They demonstrated
that this was a feasible way of learning the weights in small networks. However, the original
learning procedure was still much too slow to be practical for learning large, multilayer
Boltzmann machines [13]. The simplicity and locality of original learning procedure for
Boltzmann machines led to much interest, but the settling time required getting samples from
the right distribution and the high noise in the estimates made learning slow and unreliable [5].
5
During the following two decades the researchers tried to improve the learning speed of
Boltzmann machine by using various approaches.
In 1992 Neal improved the original learning procedure by using persistent Markov chains [14].
Neal did not explicitly use simulated annealing. However, the persistent Markov chains
implement it implicitly, provided that the weights have small initial values. Neal showed that
persistent Markov chains work quite well for training a Boltzmann machine on a fairly small data
set [14]. For large data sets, however, it is much more efficient to update the weights after a
small mini batch of training examples [13].
The first efficient learning procedure for large–scale asynchronous Boltzmann machines used
an extremely limited architecture named Restricted Boltzmann Machine. This architecture
together with its learning procedure was first proposed by Smolensky in 1986 [15] and it was
designed to make inference tractable [13].
In 1987, in an attempt to reduce the time required by the sampling process, Peterson and
Anderson [16-17] replaced Gibbs sampling with a simple mean field method that approximates
a stationary distribution by replacing stochastic binary values with deterministic real–valued
probabilities. More sophisticated deterministic approximation methods were investigated by
Galland in 1990 [18-19], Kappen and Rodriguez in 1998 [20-21], and Tanaka in 1998 [22-23]
but none of these approximations worked very well for learning for reasons that were not well
understood at the time [13]. Similar deterministic approximation methods were studied
intensively in 1990s in the context of directed graphical models learning [24-27]. In 2010
Salakhutdinov interpreted these results and provided a possible explanation of the limited
success of using deterministic approximation methods for learning in asynchronous Boltzmann
machines [13].
Because variational methods typically scale well to large applications, during 2000s extended
research has been done for obtaining deterministic approximations [28-29] or deterministic
upper bounds [30-32] on the log partition function of an arbitrary discrete Markov random field.
The Bethe approximation first made its appearance in the field of approximate inference and
error correcting decoding in [33-34] under the names TAP approximation and cavity method.
The relation between belief propagation and the Bethe approximation was further clarified in
[28-29,35-36] where it was shown that belief propagation, even when applied to loopy graphs,
has fixed–points at the stationary points of the Bethe free energy. In 2003 Welling and Teh
proposed a new algorithm, named belief optimization, to minimize the Bethe free energy
directly, as an alternative to the fixed–point equations of belief propagation [4].
6
In 2002 Hinton proposed a new learning algorithm for the asynchronous Boltzmann machine:
contrastive divergence learning. In his view this new algorithm works as a better approximation
to the maximum likelihood learning used in the original learning algorithm [37]. The most
attractive aspect of this new algorithm is that it allows Restricted Boltzmann Machines with
millions of parameters to achieve state–of–the–art performance on a large collaborative filtering
task [38].
The newest variant of asynchronous Boltzmann machine, called Deep Boltzmann Machine, is a
deep multilayer Boltzmann machine that was proposed in 2009 by Salakhutdinov and Hinton
[39]. Its learning algorithm was designed to incorporate both bottom–up and top–down
feedback, allowing a better propagation of uncertainty about ambiguous inputs [39].
The dynamics of the synchronous Boltzmann machine were first studied in the 1970s by Little
and Shaw [40-41]. A comprehensive study of synchronous Boltzmann machines and their
learning algorithms was done by Viveros in her PhD thesis in 2001 [42].
1.4 Connection to other disciplines
We previously mentioned that the learning algorithms for Boltzmann machines have been
intensively used for training deep neural networks. A deep neural network is an artificial neural
network with multiple hidden layers of units between the input and output layers which is
capable of learning the underlying constraints that characterize a domain simply by being shown
examples from the domain [10-11,43]. Computational deep learning is closely related to a class
of theories of brain development named neocortical development that was proposed by
cognitive neuroscientists in the early 1990s. Neocortical development is a major focus in
neurobiology, not only from a purely developmental standpoint, but also because understanding
how neocortex develops provides important insight into mature neocortical organization and
function. This shows how Boltzmann machines and their learning algorithms are connected with
neurobiology and, generally, with the field of cognitive sciences.
When applied to Boltzmann machines and their learning algorithms, the emphasis on
mathematical technique and rigor employed by theoretical computing science becomes an
invaluable research asset.
7
Chapter 2. Foundations
2.1 Boltzmann–Gibbs distribution
In statistical mechanics and mathematics, the Boltzmann–Gibbs distribution (also called
the Boltzmann distribution or the Gibbs distribution) is a certain distribution function or
probability measure for the distribution of the states of a system. The Boltzmann distribution is
named after Ludwig Boltzmann who first formulated it in 1868 during his studies of statistical
mechanics of gases in thermal equilibrium [44]. The distribution was later investigated
extensively, in its modern generic form, by Josiah Willard Gibbs (1902) [45]. It underpins the
concept of the canonical ensemble by providing the underlying distribution. In more general
mathematical settings, the Boltzmann–Gibbs distribution is also known as the Gibbs measure.
In statistical mechanics the Boltzmann–Gibbs distribution is an intrinsic characteristic of isolated
(or nearly–isolated) systems of fixed composition that are in thermal equilibrium (i.e., equilibrium
with respect to the energy exchange). The most general case of such system is the canonical
ensemble. Before we define the concept of canonical ensemble, we need to define a concept
that is employed by its definition, specifically heat bath.
Definition 2.1:
In thermodynamics, a heat bath is a system 𝐵 which is in contact with a many–particle system 𝐴
such that:
𝐴 and 𝐵 can exchange energy, but not particles;
𝐵 is at equilibrium and has temperature 𝐓;
𝐵 is much larger than 𝐴, so that its contact with 𝐴 does not affect its equilibrium state.
Definition 2.2:
In statistical mechanics, a canonical ensemble is the statistical ensemble that represents the
possible states of a mechanical system in thermal equilibrium with a heat bath at some fixed
temperature.
8
From previous definitions we can infer that the states of the system 𝐴, which plays the role of a
canonical ensemble, will differ in their total energy as a consequence of the energy exchange
with the system 𝐵, which plays the role of a heat bath. The principal thermodynamic variable of
the canonical ensemble, determining the probability distribution of states, is the absolute
temperature 𝐓. In general, the canonical ensemble 𝑋 assigns to each distinct microstate 𝑥 a
probability 𝐏 (equivalently of the random variable 𝑋 having the value 𝑥) given by the following
exponential:
𝐏(𝑋 = 𝑥) = exp (
𝐹 − 𝐸(𝑥)
𝐤 ∙ 𝐓) (2.1)
where: 𝐸(𝑥) is the energy of the microstate 𝑥; 𝐹 is the Helmholtz free energy; 𝐓 is the absolute
temperature of the system; and 𝐤 is the Boltzmann's constant. 𝐸(𝑥) is a function that maps the
space of states to ℝ and is interpreted as the energy of state 𝑥. For a given ensemble the
Helmholtz free energy is a constant.
If the canonical ensemble has 𝑚 states accessible to the system of interest indexed by
{1,2,…𝑚}, then the equation (2.1) can be rewritten as:
𝐏𝒊 =exp (
−𝐸𝑖 𝐤 ∙ 𝐓
)
∑ exp (−𝐸𝑗𝐤 ∙ 𝐓
)𝑚𝑗=1
(2.2)
where: 𝐏𝒊 is the probability of state 𝑖; 𝐸𝑖 is the energy of the state 𝑖; 𝐤 is the Boltzmann’s
constant; 𝐓 is the absolute temperature of the system; and 𝑚 is the number of states of the
canonical ensemble.
An alternative but equivalent formulation for canonical ensemble uses the canonical partition
function or normalization constant 𝑍 rather than the free energy and is described below:
𝑍(𝐓) = exp (
−𝐹
𝐤 ∙ 𝐓) (2.3)
For a canonical ensemble with 𝑚 states, if we know the energy of the states accessible to the
system of interest, we can calculate the canonical partition function 𝑍 as follows:
𝑍(𝐓) =∑exp (
−𝐸𝑗
𝐤 ∙ 𝐓)
𝑚
𝑗=1
(2.4)
By introducing the canonical partition function defined by the equation (2.3) respectively (2.4)
into the equation (2.1) respectively (2.2) we obtain:
9
𝐏(𝑋 = 𝑥) =
1
𝑍(𝐓)∙ exp(
−𝐸(𝑥)
𝐤 ∙ 𝐓) (2.5)
respectively:
𝐏𝒊 =1
𝑍(𝐓)∙ exp (
−𝐸𝑖𝐤 ∙ 𝐓
) (2.6)
In a system with local (finite–range) interactions, the canonical ensemble’s distribution
maximizes the entropy density for a given expected energy density, or equivalently, minimizes
the free energy density. The distribution shows that states with lower energy will always have a
higher probability of being occupied than the states with higher energy.
The Boltzmann–Gibbs distribution is often used to describe the distribution of particles, such as
atoms in binary alloys or molecules in a gas, over energy states accessible to them. If we have
a system consisting of a finite number of particles, the probability of a particle being in state 𝑖 is
practically the probability that, if we pick a random particle from that system and check what
state is in, we will find that it is in state 𝑖. This probability is equal to the number of particles in
state 𝑖 divided by the total number of particles in the system, which is the fraction of particles
that occupy state 𝑖. The formula (2.7) gives the fraction of particles in state 𝑖 as a function of the
state’s energy:
𝐏𝒊 =𝑛𝑖𝑛=
exp (−𝐸𝑖k ∙ 𝐓
)
∑ exp (−𝐸𝑗k ∙ 𝐓
)𝑚𝑗=1
(2.7)
where 𝑛 is the total number of particles in the system and 𝑛𝑖 is the number of particles in state 𝑖.
In infinite systems, the total energy is no longer a finite number and cannot be used in the
traditional construction of the probability distribution of a canonical ensemble. The traditional
approach, followed by statistical physicists, of studying the thermodynamic limit of the energy
function as the size of a finite system approaches infinity, had not been very useful. Looking for
an alternative approach, the researchers discovered that, when the energy function of an infinite
system can be written as a sum of terms that each involves only variables from a finite
subsystem, the notion of Gibbs measure provides a framework to directly study such systems
(instead of taking the limit of finite systems).
10
Definition 2.3:
In physics, a probability measure is a Gibbs measure if the conditional probabilities it induces on
each finite subsystem satisfy the following consistency condition: if all degrees of freedom
outside the finite subsystem are frozen, the canonical ensemble for the subsystem subject to
these boundary conditions matches the probabilities in the Gibbs measure conditional on the
frozen degrees of freedom.
2.2 Markov random fields and Gibbs measures
We have seen that the Gibbs measure has a native relationship with physics: it was born
to describe the behavior of a system whose interaction between particles can be described by a
form of energy. More, the Gibbs measure can be applied successfully to systems outside its
domain of origin, sometimes even without introducing notions specific to physics into the
probabilistic definitions of those systems. Examples of such systems are: Hopfield networks,
Markov random fields, and Markov logic networks. All these systems exploit the following
general principle derived from Boltzmann’s and Gibbs’s work: a network consisting of a large
number of units, with each unit interacting with neighbouring units, will approach at equilibrium a
canonical distribution given by the equations (2.5) and (2.6). This expanded applicability of
Gibbs measure has been made possible by a fundamental mathematical result known as the
theorem Hammersley–Clifford or the fundamental theorem of random fields. In this section we
present how the computer scientists adapted the physicists’ definition of Gibbs measure for
graphical models. We also present the theorem Hammersley–Clifford and its consequences
with respect to the special class of Markov random fields that is Boltzmann machine.
Dobrushin showed in [46] that, apparently, there are two different ways to define configurations
of points on a structure that mathematically resemble a lattice; he called these configurations
“random fields”. One way is based on the formulation of statistical mechanics of Gibbs and is
generally accepted as the simplest useful mathematical model of a discrete gas (also called
lattice gas) [46]. The other way, introduced by Dobrushin himself, is that of Markov random
fields. Dobrushin’s formulation has no apparent connection with physics, being instead based
on the natural way of extending the notion of a Markov process [46].
A Markov process is a stochastic model that has the Markov property, i.e., the conditional
probability distribution of future states of the process (conditional on both past and present
11
states) depends only upon the present state, not on the sequence of events that preceded it. A
special case of Markov process is the Markov chain.
A Markov chain is a discrete–time Markov process with a countable or finite state space.
A Markov random field, also called Markov network, extends the Markov chain to two or more
dimensions or to random variables defined for an interconnected network of items; therefore, it
may be considered a generalization of a Markov chain in multiple dimensions.
In this paper we use the term Markov random field to designate a Markov random field that
models an interconnected network of items. In a Markov chain, each state depends only on the
previous state in time, whereas in a Markov random field each state depends only on its
neighbors in any of multiple directions. Hence, a Markov random field may be visualized as a
field or graph of random variables, where the distribution of each random variable depends on
the neighboring variables which it is connected with. Thus, in a Markov random field the Markov
property becomes a local property rather than a temporal property.
Any graphical model can be seen as a “marriage” between probability theory and graph theory.
A consequence of this relationship is the existence of two equivalent characterizations of the
family of probability distributions associated with an undirected graph: one algebraic that
involves the concept of factorization and one graph–theory specific that involves the concept of
reachability [27,47]. For Markov random fields the concepts of reachability respectively
factorization are identified with conditional independence respectively factor graph
representation. The theorem Hammersley–Clifford shows that these two ways of defining a
random field are equivalent, which further translates into equivalence between Markov random
fields and Gibbs measures. Before we present the theorem Hammersley–Clifford, we formally
introduce the concepts it operates with: Markov random field and Gibbs measure. We use the
notations for univariate and multivariate random variables specified in Appendix A.
Definition 2.4:
Given an undirected graph 𝐺 = (𝑉, 𝐸), a set of random variables X = (𝑋𝑣)𝑣∈𝑉 indexed by 𝑉 form
a Markov random field with respect to 𝐺 if they satisfy the Markov property expressed in either
one of the following forms:
Pairwise Markov Property: Any two non–adjacent variables are conditionally independent
given all other variables:
12
𝑋𝑢 ⊥ 𝑋𝑣 | X𝑉−{𝑢,𝑣} if {𝑢, 𝑣} ∉ 𝐸 (2.8)
Local Markov Property: A variable is conditionally independent of all other variables given its
neighbors:
𝑋𝑣 ⊥ X𝑉−cl(𝑣) | Xne(𝑣) (2.9)
where ne(𝑣), also called the Markov blanket, is the set of neighbors of 𝑣 and cl(𝑣) = ne(𝑣) ∪
{𝑣} is the closed neighborhood of 𝑣.
Global Markov Property: Any two subsets of variables are conditionally independent given a
separating subset:
XA ⊥ XB | XS (2.10)
where every path from a node in 𝐴 to a node in 𝐵 passes through 𝑆.
Generally, these three expressions of Markov property are not equivalent. The local Markov
property is stronger than the pairwise one, while weaker than the global one.
Definitions 2.5:
A probability distribution 𝐏(X) = 𝐏(𝑋1, 𝑋2, … , 𝑋𝑛) on an undirected graph 𝐺 = (𝑉, 𝐸) with |𝑉| = 𝑛
is called a Gibbs distribution or Gibbs measure if it can be factorized into potentials defined on
cliques that cover all the nodes and edges of 𝐺.
A potential function or sufficient statistic is a function defined on the set of configurations of a
clique (i.e., a setting of values for all the nodes in the clique) that associates a positive real
number with each configuration. Hence, for every subset of nodes Xc ⊆ 𝑉 that form a clique, we
associate a non–negative potential 𝜙𝑐 = 𝜙𝑐(Xc).
In this paper we will refer equivalently to the nodes of 𝐺 that form the clique Xc and the random
variables that correspond to those nodes. Before formulating the Gibbs measure let us
introduce the following notations:
CG = {Xc1 , Xc2 , … , Xcd} = {Xcj ∶ 1 ≤ 𝑗 ≤ 𝑑, 𝑑 ≤ 𝑛} represents a set of 𝑑 cliques that cover the
edges and nodes of the underlying graph 𝐺;
13
ΦG = {𝜙c1 , 𝜙c2 , … , 𝜙cd} = {𝜙cj ∶ 1 ≤ 𝑗 ≤ 𝑑, 𝑑 = |CG|} represents the set of potential functions
or clique potentials that correspond to CG;
There is a one–to–one correspondence between CG and ΦG, i.e., 𝜙cj = 𝜙cj(Xcj). Therefore, it
should be generally understood that, when iterating over CG, we also iterate over ΦG.
The Gibbs measure is precisely the joint probability distribution of all the nodes in the graph
𝐺 = (𝑉, 𝐸) and is obtained by taking the product over the clique potentials:
𝐏(X) =
1
𝑍∙ ∏ 𝜙𝑐(Xc)
Xc∈CG
(2.11)
where: 𝑍 ≡ 𝑍(𝐏) =∑ ∏ 𝜙𝑐(Xc)
Xc∈CGX
(2.12)
𝐴 ≡ 𝐴(𝐏) = log(𝑍 (𝐏)) ≡ ln(𝑍(𝐏)) = ln(𝑍) (2.13)
where 𝑍, called the partition function, is a constant chosen to ensure that the distribution 𝐏 is
normalized. If the distribution 𝐏 belongs to the exponential family, it is more practical to work
with the logarithm, specifically the natural logarithm, of the partition function 𝑍. By definition, the
cumulant function 𝐴 is the natural logarithm of Z.
The set CG is often taken to be the set of all maximal cliques of the graph 𝐺, i.e., the set of
cliques that are not properly contained within any other clique. This condition can be imposed
without loss of generality because any representation based on non–maximal cliques can
always be converted into one based on maximal cliques by redefining the potential function on a
maximal clique to be the product over the potential functions on the subsets of that clique.
However, the factorization of a Markov random field is of particular value when CG consists of
more than the maximal cliques. This is the case of factor graphs.
Definition 2.6:
Given a factorization of a function:
𝑔 ∶ ℝ𝑛 → ℝ, 𝑔(𝑋1, 𝑋2, … , 𝑋𝑛) =∏𝑓𝑗
𝑚
𝑗=1
(𝑆𝑗)
where: (2.14) 𝑆𝑗 ⊑ {𝑋1, 𝑋2, … , 𝑋𝑛}
14
The corresponding factor graph 𝐺 = (𝑋, 𝐹, 𝐸) is a bipartite graph that consists of: variable nodes
𝑋 = {𝑋1, 𝑋2, … , 𝑋𝑛}, factor nodes 𝐹 = {𝑓1, … 𝑓𝑚}, and edges 𝐸. The edges depend on the
factorization as follows: there is an undirected edge between factor 𝑓𝑗 and variable 𝑋𝑖 if and only
if 𝑋𝑖 is an argument of 𝑓𝑗, i.e., 𝑋𝑖 ∈ 𝑆𝑗.
Factor graphs allow a finer–grained specification of factorization properties by explicitly
representing potential functions for non–maximal cliques. We observe that a factor graph has
only node potentials and pairwise potentials. Generally, if the potential functions in a Markov
random field are defined over single variables or pairs of variables, then the Markov random
field is referred as pairwise Markov network. More precisely, a pairwise Markov network over a
graph 𝐺 = (𝑉, 𝐸) is a Markov random field associated with a set of node potentials and a set of
edge potentials as described by the equation (2.15):
ΦG = {𝜙(𝑋i) ∶ 𝑋i ∈ 𝑉, 1 ≤ 𝑖 ≤ 𝑛} ∪ {𝜙(𝑋i, 𝑋𝑗) ∶ {𝑖, 𝑗} ∈ 𝐸, 1 ≤ 𝑖, 𝑗 ≤ 𝑛} (2.15)
A factor graph is a pairwise Markov network whose nodes and edges are endowed with special
meanings that originate in the function it factorizes. We will come back at the relationship
between Markov random fields and factor graphs in Section 3.5.
One important property of Markov random fields is that the potential functions ΦG need not have
any obvious or direct relation to marginal or conditional distributions defined over the graph
cliques.
Theorem 2.1 (Hammersley–Clifford):
A probability distribution that has a positive mass or density satisfies the Markov property with
respect to an undirected graph if and only if it is a Gibbs distribution in that graph.
The proof of this theorem is outside the scope of this paper. A rigorous mathematical proof can
be found in [48]. The theorem Hammersley–Clifford gives the necessary and sufficient
conditions under which a Gibbs measure is equivalent with a Markov random field.
Consequently, any positive probability measure that satisfies a Markov property is a Gibbs
measure for an appropriate choice of (locally defined) energy function.
15
The learning algorithms in a pairwise Markov network like the Boltzmann machine require
computing statistical quantities (e.g., likelihoods and probabilities) and information–theoretic
quantities (e.g., mutual information and conditional entropies) on the underlying graphical
model. These types of computational tasks in a graphical model are called inference or
probabilistic inference. Furthermore, the learning algorithms are built on inference algorithms
and allow parameters and structures to be estimated from data. However, exact inference for
large–scale Markov random fields is intractable. Therefore, to achieve a scalable learning
algorithm, approximate methods are required.
One popular source of approximate methods is the Markov chain Monte Carlo (MCMC)
framework. The main problem with the MCMC approach is that convergence times can be long
and it can be difficult to diagnose convergence.
An alternative to MCMC is the variational framework whose goal is to convert the probabilistic
inference problem into an optimization problem. The best known variational algorithm used in
Boltzmann machine learning is the mean field approximation that searches for the best
distribution that assumes independence among all the nodes and then uses it to construct the
true posterior distribution over hidden variables.
Another alternative to MCMC is the belief propagation (BP) framework. BP is a message
passing algorithm for performing inference on tree–like graphs. The discovery of the relationship
between belief propagation and Bethe free energy led to the so–called Bethe approximation of
the free energy, which led to a new class of learning algorithms for Boltzmann machine.
2.3 Gibbs free energy
The third millennium has brought exciting progresses on understanding computationally
hard problems in computer science by using a variety of concepts and methods from statistical
physics. One of these concepts is the Gibbs free energy. In this section we start by briefly
introducing the Gibbs free energy as a thermodynamic potential; then we explain how this
energy can be accommodated to describe a Markov random field. In subsequent development
we use the term temperature to designate the absolute temperature of a canonical ensemble or,
generally, of a thermodynamic system, and the term pseudo–temperature to designate the
“temperature” of a Markov random field, i.e., a parameter which models the thermal noise
16
injected into the system. The majority of theoretical results reviewed in this section come from
[49] and [4].
The Gibbs free energy, originally called the “available energy”, was developed in the 1870s by
Josiah Willard Gibbs, who described it in [50] as:
the greatest amount of mechanical work which can be obtained from a given quantity of a certain substance in a given initial state, without increasing its total volume or allowing heat to pass to or from external bodies, except such as at the close of the processes are left in their initial condition.
The Gibbs free energy is one of the four thermodynamic potentials used in the chemical
thermodynamics of reactions and non–cyclic processes. The other three thermodynamic
potentials are: internal energy, enthalpy, and Helmholtz free energy. In this paper we are only
interested in the internal energy, the Helmholtz free energy, and the Gibbs free energy.
Generally, energy is a concept which takes into account the physical nature of a system. The
exact (true) energy 𝐸 is usually unknown, but the mean (internal) energy 𝑈 is usually known –
for example when is determined by external factors such as a thermostat.
The internal energy 𝑈 is a thermodynamic potential that might be thought of as the energy
contained within a system, otherwise the energy required to create a system in the absence of
changes in temperature or volume.
If the system is created in an environment of temperature 𝐓, then some of the energy can be
obtained by spontaneous heat transfer from the environment to the system. The amount of this
spontaneous energy transfer is 𝐓 ∙ 𝑆, where 𝐓 represents the temperature and 𝑆 represents the
final entropy of the system. The Helmholtz free energy 𝐹 is then a measure of the amount of
energy required to create a system once the spontaneous energy transfer to the system from
the environment is accounted for:
𝐹 = 𝑈 − 𝐓 ∙ 𝑆 (2.16) where: 𝑈 is the internal energy; 𝑆 is the entropy; and 𝐓 is the temperature of the system. At low
temperatures, the Helmholtz free energy is dominated by the energy, while at high
temperatures, the entropy dominates it. The Helmholtz free energy is commonly used for
systems held at constant volume. More, for a system at constant temperature and volume, the
Helmholtz free energy is minimized at equilibrium.
17
If the system is created from a very small volume, in order to "create room" for the system, an
additional amount of work P ∙ V must be done, where P represents the absolute pressure and V
represents the final volume of the system. As discussed in defining the Helmholtz free energy,
an environment at constant temperature 𝐓 will contribute an amount 𝐓 ∙ 𝑆 to the system,
reducing the overall investment necessary for creating the system. The Gibbs free energy 𝐺 is
then the net energy contribution for a system created in an environment of temperature 𝐓 from a
negligible initial volume:
𝐺 = 𝑈 − 𝐓 ∙ 𝑆 + P ∙ V (2.17) where: 𝑈 is the internal energy; 𝑆 is the entropy; 𝐓 is the temperature; P is the absolute
pressure; and V is the final volume of the system. For a system at constant pressure and
temperature, the Gibbs free energy is minimized at equilibrium.
In the context of Markov random fields, energy is a scalar quantity used to represent the state
and the parameters of the system in certain conditions. Similarly to a thermodynamic system,
the true energy of a Markov random field is unknown. The true energy of a Markov random field
at equilibrium is referred as the true free energy and corresponds to the true joint probability
distribution 𝐏(𝑋1, … , 𝑋𝑛) of the random field. If the true joint probability distribution has a positive
mass or density, then, according with Theorem 2.1, it is a Boltzmann–Gibbs distribution.
The internal energy 𝑈 of a Markov random field 𝐏(𝑋1, … , 𝑋𝑛) is defined as the expected value of
the exact energy 𝐸 of the system.
𝑈𝐏 = 𝐄𝐏[𝐸(𝑋1, … , 𝑋𝑛)] = ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙
𝑋1,…,𝑋𝑛
𝐸(𝑋1, … , 𝑋𝑛) (2..18)
The entropy 𝑆 of a Markov random field 𝐏(𝑋1, … , 𝑋𝑛) is defined as the expected value of the
logarithm of the inverse of the probability distribution 𝐏 of the system (equations (B26) and
(B27) from Appendix B):
𝑆𝐏 = 𝐄𝐏 [ln
1
𝐏(𝑋1, … , 𝑋𝑛)] = − ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙ ln(𝐏(𝑋1, … , 𝑋𝑛))
𝑋1,…,𝑋𝑛
(2.19)
The exact Gibbs free energy can be thought of as a mathematical construction designed so that
its minimization leads to the Boltzmann–Gibbs distribution given by the equation (2.5) [49]. In
order to define the exact Gibbs free energy, we write the equation (2.5) for a Markov random
field as:
18
𝐏(𝑋1, … , 𝑋𝑛) =
1
𝑍∙ exp (−
𝐸(𝑋1, … , 𝑋𝑛)
𝐓)
(2.20)
where 𝐸(𝑋1, … , 𝑋𝑛) is the true energy of the Markov random field (adjusted with the Boltzmann’s
constant) and 𝐓 is the pseudo–temperature of the Markov random field.
By definition, the exact Gibbs free energy denoted 𝐺exact is the following function of the true joint
probability function 𝐏(𝑋1, … , 𝑋𝑛):
𝐺𝑒𝑥𝑎𝑐𝑡[𝐏(𝑋1, … , 𝑋𝑛)] = 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 (2.21) where: 𝑈𝐏 is given by the equation (2.18); 𝑆𝐏 is given by the equation (2.19); and 𝐓 is the
pseudo–temperature of the system.
We note the absence from the equation (2.21) of the term P ∙ V of the equation (2.17). This
absence is explained by the fact that the parameters pressure and volume of a thermodynamic
system do not have any correspondent in a Markov random field. Hence, the exact Gibbs free
energy of a Markov random field is apparently identical with the Helmholtz free energy.
However, there is a difference of nuance between them: while the Helmholtz free energy is just
the value 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 computed at equilibrium, the Gibbs free energy is a function that computes
the expression 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 for any state of the network after applying some constraints [49].
At equilibrium, the exact Gibbs free energy is equal to the Helmholtz free energy, which is given
by the formula:
𝐹 = −𝐓 ∙ ln(𝑍) (2.22)
where 𝑍 is the partition function of the Markov random field [49].
It can be shown that, if we minimize 𝐺𝑒𝑥𝑎𝑐𝑡 given by (2.21) with respect to 𝐏(𝑋1, … , 𝑋𝑛) and
enforce, via a Lagrange multiplier, the constraint of 𝐏 being a probability distribution, then we
recover, as desired, the Boltzmann–Gibbs distribution.
Different types of constraints can be imposed on various probabilities that characterize the
Markov random field and each such scenario “produces” a Gibbs free energy. By minimizing a
Gibbs free energy with respect to the probabilities that are constrained, we obtain self–
consistent equations that must be obeyed at equilibrium [49].
In general, a given system can have more than “one Gibbs free energy” depending on what
constraints are applied and over what probabilities. If the full joint probability is constrained, then
we obtain the exact Gibbs free energy denoted 𝐺𝑒𝑥𝑎𝑐𝑡. If some or all marginal probabilities are
19
constrained, then we obtain an approximate Gibbs free energy denoted 𝐺. The mean field free
energy and the Bethe energy, which we are going to introduce in Chapter 3, are both Gibbs free
energies. The advantage of working in a Markov random field with a Gibbs free energy instead
of a Boltzmann–Gibbs distribution is that it is much easier to come up with ideas for
approximations [49].
2.4 Connectionist networks
In order to emphasize their brain–style computational properties, Hinton has
characterized Boltzmann machines as connectionist networks, specifically symmetrical
connectionist networks with hidden units. Their counterparts with respect to the presence of
hidden units are the Hopfield networks. Before submerging into the world of Boltzmann
machines, we are going to briefly present their “ancestors”: the connectionist networks and the
Hopfield networks.
Connectionism is a set of approaches in the field of cognitive science that models mental or
behavioral phenomena as the emergent processes of interconnected networks of simple units.
The central connectionist principle is that mental phenomena can be rather described from the
point of view of brain–style computation rather than rule–based symbol manipulation. However,
the connectionist architectures are not meant to duplicate the physiology of the human brain,
but rather to receive inspiration from known facts about how the brain works [51]. There are
many forms of connectionism, but the most common form uses artificial neural network models.
Connectionist models typically consist of many simple neuron–like processing elements called
units that interact by using weighted connections. The connections between units can be
symmetrical or asymmetrical, depending on whether they have the same weight in both
directions or not. Each unit has a state or activity level that is determined by the input received
from other units in the network. There are many possible variations within this general
framework: 1/0, +1/-1, on/off. When the effective values of the states of the units are not
important for the argument we try to make, we refer to them as on/off. One common, simplifying
assumption is that the combined effects of the rest of the network on the ith unit are mediated by
a single scalar quantity. This quantity, which is called the total input of unit i and denoted neti, is
a linear function of the activity levels of the units that provide input to unit 𝑖:
20
neti =∑𝜎𝑗 ⋅ 𝑤𝑗𝑖 − 𝜃𝑖𝑗
(2.23)
where: 𝜎𝑗 is the state of the jth unit; 𝑤𝑗𝑖 is the weight on the connection from the jth to the ith unit;
and 𝜃𝑖 is the threshold of the ith unit.
An external input vector can be supplied to the network by clamping the states of some units or
by adding an input term to the total input of some units. By taking into consideration the external
input, the total input of unit 𝑖 is computed with formula:
neti =∑𝜎𝑗 ⋅ 𝑤𝑗𝑖 + 𝐼𝑖 − 𝜃𝑖𝑗
(2.24)
where: 𝜎𝑗 is the state of the jth unit; 𝑤𝑗𝑖 is the weight on the connection from the jth to the ith unit;
𝜃𝑖 is the threshold of the ith unit; and 𝐼𝑖 is the external input received by the ith unit.
The threshold term can be eliminated by giving every unit an extra input connection whose
activity level is always on. The weight on this special connection is the negative of the threshold,
and it can be learned in just the same way as the other weights.
The capacity of a network to change over time is expressed at unit level by the concept of
activation. At any time, a unit in the network has an activation, which is a numerical value
intended to represent some aspect of the unit, which is often called the state of the unit. The
activation of a unit spreads to all the other units connected to it. Typically the state of a unit is
described as a function, called the activation function, of the total input that it receives from its
input units. Usually the activation function is non–linear, but it can be linear as well.
For units with discrete nonnegative states, the activation function typically has value 0 or 1.
For units with continuous nonnegative states a typical activation function is the logistic sigmoid
defined by the formula (2.25).
For units with discrete bipolar states, the typical activation function has value -1 or 1.
For units with continuous positive and negative states a typical activation function is the
hyperbolic tangent defined by the formula (2.26).
States 0/1: 𝜎𝑖 = sigm(neti) =
1
1 + exp(−neti) (2.25)
States -1/1:
𝜎𝑖 = tanh ( neti) =exp(neti) − exp(−neti)
exp(neti) + exp(−neti)=exp(2 ∙ neti) − 1
exp(2 ∙ neti) + 1 (2.26)
21
where neti is the input of the ith unit and 𝜎𝑖 is the state of the same unit.
All the long–term knowledge in a connectionist model is encoded by the locations and the
weights of the connections, so learning consists of changing the weights or adding or removing
connections. The short–term knowledge of the model is normally encoded by the states of the
units, but some models also have fast–changing temporary weights or thresholds that can be
used to encode temporary contexts or bindings [51].
2.5 Hopfield networks
Historically the Boltzmann machine was preceded by a simpler connectionist model
invented by John Hopfield in 1982. Hopfield’s network is not only a precursor of Boltzmann
machine; it also represents the limiting case of the asynchronous Boltzmann machine as the
pseudo–temperature parameter 𝐓 → 0. The network proposed by Hopfield in [7] and expanded
in [52] is a symmetrical connectionist network without hidden units whose main purpose is to
store memories as distributed patterns of activity. Hopfield, who is a physicist, got the idea of a
network acting as an associative memory by studying the dynamics of a physical system whose
state space is dominated by a substantial number of locally stable states to which the system is
attracted. He regarded these numerous locally stable states as associative memory or content
addressable memory. Before we present the Hopfield network, we are going to briefly introduce
the ideas of associative memory and Hebbian learning which are used by the learning
algorithms of both Hopfield network and Boltzmann machine.
Inspired by the associative nature of biological memory, Hebb proposed in 1949 a simple model
for the neuron that captures the idea of associative memory [2]. Hebb’s theory is often
summarized by Siegrid Löwel's phrase: "neurons wire together if they fire together" [53]. We are
going to present the intuition behind Hebb’s theory by using an example. Let imagine that the
weights between neurons whose activities are positively correlated are increased:
d
dt𝑤𝑖𝑗 = corr(𝜎𝑖, 𝜎𝑗) (2.27)
where corr(𝜎𝑖, 𝜎𝑗) is the correlation coefficient between the states 𝜎𝑖 and 𝜎𝑗.
Let also imagine the following two scenarios:
when stimulus 𝑖 is present – for instance, a bell ringing – the activity of neuron 𝑖 increases;
22
neuron 𝑗 is associated with another stimulus 𝑗 – for instance, the sight of a teacher coming
to the classroom carrying a register.
If these two stimuli – first a person formally dressed and carrying a register and second a ringing
bell – co–occur in the environment, then the Hebbian learning rule will increase the weights 𝑤𝑖𝑗
and 𝑤𝑗𝑖. This means that when, on a later occasion, stimulus 𝑗 occurs in isolation, making the
activity of 𝜎𝑗 large, the positive weight from 𝑗 to 𝑖 will cause neuron 𝑖 to be also activated. Thus,
the response to the sight of a formally dressed person carrying a register is an automatic
association with the bell ringing sound. Hence, we would expect to hear a bell ringing. We could
call this "pattern completion". No instructor is required for this associative memory to work and
no signal is needed to indicate that a correlation has been detected or that an association
should be made. Thus, the unsupervised local learning algorithm and the unsupervised local
activity rule spontaneously produce the associative memory.
2.5.1 Hopfield network models
In his influential paper [7] Hopfield proposed a model that was later called the binary Hopfield
network. Later he generalized the original model and published in [52] a new model called the
continuous Hopfield network. In [52] Hopfield also explained the relationship between the two
models. Because a binary Hopfield network becomes a Boltzmann machine with the addition of
noise in updating, we give a detailed presentation of the binary model and only a brief
presentation of the continuous model. We also briefly present the relation between the stable
states of the Hopfield models.
2.5.1.1 The binary Hopfield model
Architecture:
A binary Hopfield network consists of 𝑁 processing devices called neurons or units. Each unit 𝑖
has two activation levels or states: off or not firing, usually represented as 𝜎𝑖 = 0, and on or
firing at maximum rate, usually represented as 𝜎𝑖 = 1. An alternative representation of the off/on
states uses the bipolar elements -1 and +1.
23
In the Hopfield network there are weights associated with the connections between units. All
these weights are organized in a matrix W = (𝑤𝑖𝑗)1≤𝑖≤𝑛 1≤𝑗≤𝑛
called the weight matrix or the
correlation matrix. The strength of connection between two units 𝑖 and 𝑗 is called weight and is
denoted 𝑤𝑖𝑗. The units are connected through symmetric, bidirectional connections, so 𝑤𝑖𝑗 =
𝑤𝑗𝑖. If two units 𝑖 and 𝑗 are not connected, then 𝑤𝑖𝑗 = 0. If they are connected, then 𝑤𝑖𝑗 > 0 or
𝑤𝑖𝑗 < 0. There are no self–connections, so 𝑤𝑖𝑖 = 0 for all 𝑖 ∈ {1,2,… , 𝑛}.
The activity of unit 𝑖, denoted neti, represents the total input that the unit receives from other
units and is computed either with the equation (2.23) or with the equation (2.24), depending on
the presence of the external input. Unless otherwise stated, we consider the external input 𝐼𝑖 for
each unit 𝑖 to be 0. The units are binary threshold units, i.e., for each unit 𝑖 there is a fixed
threshold 𝜃𝑖 ≥ 0. We can think of the threshold of unit 𝑖 as the weight of a special connection
from a virtual unit “0”, whose activity is permanently on, towards unit 𝑖. We formally express this
relation as: 𝜃𝑖 = −𝑤𝑖0. Then, if we include the threshold in the computation of the activity of the
unit, the equation (2.23) becomes:
neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=1
− 𝜃𝑖 =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=0
for 1 ≤ 𝑖 ≤ 𝑛 (2.28)
The instantaneous state of a model composed of 𝑛 units is specified by a configuration or state
vector 𝝈 whose elements represent the states of the units: 𝝈 = (𝜎1, 𝜎2, … , 𝜎𝑛).
Global energy:
Hopfield realized that, when the weight matrix W is symmetric, the network can be characterized
by a global energy function [7]. More, each configuration of the network can also be
characterized by an energy function. The global energy of the network is a sum of contributions
from each unit and is computed with the following formula:
𝐸 = −
1
2∑∑𝑤𝑖𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑖
𝑛
𝑗=1
𝑛
𝑖=1
+∑𝜎𝑖 ∙
𝑛
𝑖=1
𝜃𝑖 (2.29)
This simple quadratic energy function makes it possible for each unit to compute locally how its
state affects the global energy.
Update rule:
24
The state of the model system changes in time as a consequence of each unit 𝑖 readjusting its
state. While the selection of the unit to be updated could be a stochastic process (taking place
at a mean rate 𝑟 > 0 for each unit) or a deterministic process (being part of a predefined
sequence), the update itself is always a deterministic process. Each selected unit evaluates
whether its activity is above or below zero (because we included the threshold into the
computation of the unit’s activity) and updates its state according with the following “threshold
rules”:
States 0/1:
𝜎𝑖 → 0 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=1
− 𝜃𝑖 ≤ 0
𝜎𝑖 → 1 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=1
− 𝜃𝑖 > 0
(2.30)
States -1/1:
𝜎𝑖 → −1 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=1
− 𝜃𝑖 ≤ 0
𝜎𝑖 → 1 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗
𝑛
𝑗=1
− 𝜃𝑖 > 0
(2.31)
Equivalently, the update rule can be formulated as: update each unit to whichever of its two
states gives the lowest global energy. The updates may be synchronous or asynchronous and,
because the network has feedback (i.e., every unit’s output is an input to other units), an order
for the updates to occur has to be specified.
Synchronous (parallel) updates: firstly all units compute their activities (neti)1≤𝑖≤𝑛, and
secondly they update their states (𝜎𝑖)1≤𝑖≤𝑛 simultaneously.
There are a few drawbacks for this update strategy. Firstly, if the units make
simultaneously decisions, their energy could go up. Secondly, with simultaneous parallel
updating, we can get oscillations which always have a period of two. However, if the
updates occur in parallel but with random timing, the oscillations are usually destroyed.
Asynchronous (sequential) updates: one unit at a time computes its activity neti and
updates its state 𝜎𝑖.
When units are randomly chosen to update, the global energy 𝐸 of the network will either
lower its value or stay the same. Under repeated sequential updating the network will
eventually converge to a state which is a local minimum in the global energy function.
25
Thus, if a state is a local minimum in the global energy function, it is a stable state for the
network.
Learning rule:
The first goal of the Hopfield network is to store the input data or desired memories – this is
what we call the store phase. The desired memories are represented as a set with 𝑚 elements,
each element being a 𝑛–dimensional binary vector.
The second goal is that, given the initial configuration of a Hopfield network, the network is
capable to retrieve or recall one particular configuration or stored memory from all the memories
stored in the network – this is what we call the recall phase.
In general, the initial configuration is a noisy version of one stored memory. The learning rule is
intended to make a set of desired memories to become stored memories, i.e., stable states of
the Hopfield network's activity rule. In order to understand how the Hopfield network learns a set
of desired memories, firstly we present the information storage rules and secondly we prove that
the stored memories are stable states for Hopfield network.
Information storage algorithm:
We start by observing that each desired memory represents a possible configuration of the
network:
𝝈(𝒔) = (𝜎1(𝑠), 𝜎2
(𝑠), … , 𝜎𝑛(𝑠)) for all 𝑠 ∈ {1,2,… ,𝑚} (2.32)
Hopfield demonstrated that the capacity 𝑚 of a totally connected network with 𝑛 units under his
storage rule is only about 0.15𝑛 memories [7]. Also, if all the desired memories are known, the
matrix W does not change in time. Hence, it can be determined in advance.
Hopfield proposed the following rule for computing the weights of a network whose purpose is to
store a given set of 𝑚 desired memories. In both cases the factor 1
𝑚 assures that |𝑤𝑖𝑗| ≤ 1.
States 0/1: 𝑤𝑖𝑗 =
1
𝑚∙∑(2 ∙ 𝜎𝑖
(𝑠) − 1) ∙ (2 ∙ 𝜎𝑗(𝑠) − 1)
𝑚
𝑠=1
for 1 ≤ 𝑖, 𝑗 ≤ 𝑛; 1 ≤ 𝑠 ≤ 𝑚
𝑤𝑖𝑖 = 0 for 1 ≤ 𝑖 ≤ 𝑛
(2.33)
States -1/1: 𝑤𝑖𝑗 =
1
𝑚∙∑𝜎𝑖
(𝑠) ∙ 𝜎𝑗(𝑠)
𝑚
𝑠=1
for 1 ≤ 𝑖, 𝑗 ≤ 𝑛; 1 ≤ 𝑠 ≤ 𝑚
𝑤𝑖𝑖 = 0 for 1 ≤ 𝑖 ≤ 𝑛
(2.34)
26
There is another way to compute the weight matrix W. The algorithm starts from a matrix W with
all the elements equal to zero. For each binary vector 𝜎 that represents a desired memory, the
weight 𝑤𝑖𝑗 between any two units 𝑖 and 𝑗 is incremented with a quantity Δ𝑤𝑖𝑗:
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 + Δ𝑤𝑖𝑗 for 1 ≤ 𝑖, 𝑗 ≤ 𝑛 (2.35)
where Δ𝑤𝑖𝑗 is computed with the following formulae:
States 0/1: Δ𝑤𝑖𝑗 = 4 ⋅ (𝜎𝑖 −
1
2) ⋅ (𝜎𝑗 −
1
2) for 1 ≤ 𝑖, 𝑗 ≤ 𝑛
(2.36) States -1/1: Δ𝑤𝑖𝑗 = 𝜎𝑖 ∙ 𝜎𝑗 for 1 ≤ 𝑖, 𝑗 ≤ 𝑛
The rules (2.28) to (2.29) are applied to the whole matrix 𝑚 times, one time for each desired
memory. After these steps each weight 𝑤𝑖𝑗 has an integer value in the range [−𝑚,𝑚]. Finally,
the weight matrix W may be normalized by multiplying it with the factor 1
𝑚.
Once the matrix W was computed, the desired memories have become stored memories. Now
we need to prove that the stored memories are stable states for the Hopfield network. We are
going to present the proof only for the case when the states of the units are represented 0/1.
The proof for the case when the states of the units are represented -1/1 is similar.
In order to prove that the stored memories (𝝈(𝒔))1≤𝑠≤𝑚 are stable states for the Hopfield
network, we start by computing neti(s)
which is the activity of some unit 𝑖 of some 𝒔th stored
memory:
neti
(s)=∑𝑤𝑗𝑖 ⋅ 𝜎𝑗
(𝑠)
𝑛
𝑗=0
=∑𝜎𝑗(𝑠)⋅
𝑛
𝑗=0
∑(2 ⋅ 𝜎𝑖(𝑢) − 1) ⋅ (2 ⋅ 𝜎𝑗
(𝑢) − 1)
𝑚
𝑢=1
(2.37)
neti(s) = ∑(2 ⋅ 𝜎𝑖
(𝑢) − 1) ∙
𝑚
𝑢=1
[∑𝜎𝑗(𝑠) ⋅
𝑛
𝑗=0
(2 ⋅ 𝜎𝑗(𝑢) − 1)]
In the equation (2.37) the mean value of the bracketed term is 0 unless 𝑠 ≡ 𝑢, in which case the
mean value is 𝑛/2. This pseudo–orthogonality yields to:
neti
(s) =∑𝑤𝑗𝑖 ⋅ 𝜎𝑗(𝑠)
𝑛
𝑗=0
≅ ⟨neti(s)⟩ =
𝑛
2∙ (2 ⋅ 𝜎𝑖
(𝑠) − 1) for 1 ≤ 𝑖 ≤ 𝑛 (2.38)
27
The equation (2.38) shows that ⟨neti(s)⟩ is positive when 𝜎𝑖
(𝑠) = 1 and negative when 𝜎𝑖(𝑠) = 0.
The sth stored state would always be stable under Hopfield’s algorithm except the noise coming
from the 𝑠 ≠ 𝑢 terms.
2.5.1.2 The continuous Hopfield model and its relation with the binary Hopfield model
Let consider a binary Hopfield network where the set of possible states 𝜎𝑖 of unit 𝑖 is {Vi0, Vi
1}
where Vi0 ∈ ℝ, Vi
1 ∈ ℝ, Vi0 < Vi
1, and 1 ≤ 𝑖 ≤ 𝑛.
Let also consider another Hopfield network identical with the first one except the following
aspects:
the output variable Vi of unit 𝑖 is a continuous and monotone increasing function of the
instantaneous input neti of the same unit;
the output variable Vi of unit 𝑖 has the range Vi0 ≤ Vi ≤ Vi
1;
the input–output relation is described by a sigmoid function with vertical asymptotes Vi0 and
Vi1.
In the second network the sigmoid function acts as an activation function and the output Vi of
unit 𝑖 is similar to the state 𝜎𝑖 of unit 𝑖 in the first network:
𝜎𝑖 ≡ Vi for 1 ≤ 𝑖 ≤ 𝑛 (2.39)
If Vi0 and Vi
1 are 0 respectively 1, then an appropriate activation function for the second network
is the logistic sigmoid and the activity of the unit 𝑖 is computed with the formula (2.40).
If Vi0 and Vi
1 are -1 respectively 1, then an appropriate activation function for the second
network is the hyperbolic tangent and the activity of the unit 𝑖 is computed with the formula
(2.41).
{Vi0, Vi
1} = {0,1} Vi = sigm(neti) =1
1 + exp(−∑ 𝑤𝑗𝑖 ∙ 𝜎𝑗𝑛𝑗=1 + 𝜃𝑖)
(2.40)
{Vi0, Vi
1} = {−1,1} Vi = tanh ( neti) = tanh(∑𝑤𝑗𝑖 ∙ 𝜎𝑗 − 𝜃𝑖
𝑛
𝑗=1
) (2.41)
Each unit updates its state as if it were the single unit in the network. The updates may also be
synchronous or asynchronous and the learning rule is similar to the learning rule of the binary
28
network. We observe that the binary Hopfield network is a special case of the continuous
Hopfield network. The continuous model has the same flow properties in its continuous space
that the binary model does in its discrete space. It can, therefore, be used as a content
addressable memory or any other computational task which an energy function is essential for.
The relation between the stable states of the two models
For a given weight matrix W, the stable states of the continuous system have a simple
correspondence with the stable states of the binary system. The discrete algorithm searches for
minimal states at the corners of the hypercube, i.e., corners that are lower than adjacent
corners. Since the global energy of the model is a linear function of a single unit state along any
cube edge, the energy minima (or maxima) for the discrete space with, for instance, activities
𝜎𝑖 ∈ {0,1} are exactly the same corners as the energy minima (or maxima) for the continuous
case with activities 0 ≤ Vi ≤ 1 [52].
2.5.2 Convergence of the Hopfield network
Hopfield claimed that his original model behaves as an associative memory when the state
space flow generated by the algorithm is characterized by a set of stable fixed–points such that
each stable point represents a nominally assigned memory. He proved that the stored
memories are stable under the asynchronous update rule and, more, the asynchronous update
rule of a Hopfield network is able to take a partial memory or a corrupted memory and perform
pattern completion or error correction to restore the original memory [7,50]. The proof relies on
an essential feature of the store–recall operation: the state space flow algorithm converges to
stable states. The flow convergence to stable states is guaranteed by a mathematical condition
imposed on the weight matrix W: to be symmetric and to have zero diagonal elements. Here we
present a sketch of the proof for the case of a binary Hopfield network with asynchronous
updates. The proof for the continuous Hopfield network is pretty similar.
Claim: The binary threshold update rules (2.30) and (2.31) cause the network to settle to a
minimum of the global energy function.
29
Proof: The proof follows the construction of an appropriate energy function 𝐸 (equation (2.29))
that is always decreased by any state change produced by the algorithm.
First we introduce the concept of energy gap; then we compute the energy gap of some unit 𝑖,
where 1 ≤ 𝑖 ≤ 𝑛. The energy gap of unit 𝑖 represents the change Δ𝐸𝑖 in global energy 𝐸 due to
changing the state of the unit 𝑖 by Δ𝜎𝑖 and keeping all the other units unchanged. In order to
compute Δ𝐸𝑖, we rewrite the equation (2.29) by separating the contribution of unit 𝑖 from the
contributions of all the other units:
𝐸 = −1
2∑∑𝑤𝑖𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑖
𝑛
𝑗=1
𝑛
𝑖=1
+∑𝜎𝑖 ∙ 𝜃𝑖
𝑛
𝑖=1
𝐸 =
(
−1
2∑ ∑𝑤𝑘𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑘
𝑛
𝑗=1,𝑗≠𝑖
𝑛
𝑘=1,𝑘≠𝑖
+ ∑ 𝜎𝑘 ∙
𝑛
𝑘=1,𝑘≠𝑖
𝜃𝑘
)
+ (− ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗
𝑛
𝑗=1,𝑗≠𝑖
∙ 𝜎𝑖 +∙ 𝜎𝑖 ∙ 𝜃𝑖)
𝐸 =
(
−1
2∑ ∑𝑤𝑘𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑘
𝑛
𝑗=1,𝑗≠𝑖
𝑛
𝑘=1,𝑘≠𝑖
+ ∑ 𝜎𝑘 ∙
𝑛
𝑘=1,𝑘≠𝑖
𝜃𝑘
)
+(− ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗
𝑛
𝑗=1,𝑗≠𝑖
+ 𝜃𝑖) ∙ 𝜎𝑖
(2.42)
In the equation (2.42) the content of the first parenthesis doesn’t depend on the state of unit 𝑖.
Consequently, the first parenthesis is eliminated during the computation of Δ𝐸𝑖.
States 0/1:
Δ𝐸𝑖 = 𝐸(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1) = −( ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗
𝑛
𝑗=1,𝑗≠𝑖
− 𝜃𝑖) ∙ Δ𝜎𝑖 (2.43)
States -1/1:
Δ𝐸𝑖 = 𝐸(𝜎𝑖 = −1) − 𝐸(𝜎𝑖 = 1) = −( ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗
𝑛
𝑗=1,𝑗≠𝑖
− 𝜃𝑖) ∙ Δ𝜎𝑖 (2.44)
According to the equation (2.23), the content of the parenthesis in both equations (2.43) and
(2.44) is exactly neti. Hence, the equations (2.43) and (2.44) can be compactly written as:
Δ𝐸𝑖 = −neti ∙ Δ𝜎𝑖 (2.45)
According with the update rules (2.30) and (2.31), Δ𝜎𝑖 is positive (state changes from 0 to 1
respectively from -1 to 1) only when neti is positive and is negative (state changes from 1 to 0
respectively from 1 to -1) only when neti is negative. Therefore, any change in the global energy
30
𝐸 under the algorithm is negative; otherwise the global energy 𝐸 is a monotonically decreasing
function. More, for a given set of weights W and a given set of thresholds (𝜃𝑖)1≤𝑖≤𝑛 the global
energy 𝐸 is both lower and upper bounded. Hence, the iteration of the algorithm must lead to
stable states that do not further change with time.
The following algorithm describes the dynamics of a trained Hopfield network that uses the
representation of states as 0/1, starts from a given configuration, and converges to a stable
configuration. If the states are represented as -1/1, the step 4 of Algorithm 2.1 needs to be
modified correspondingly.
Algorithm 2.1: Hopfield Network Dynamics
Given: a trained network W and an initial configuration 𝜎
begin
Step 1: repeat
Step 2: choose a unit 𝑖 at random with mean rate 𝑟 > 0
Step 3: compute the activity of the unit 𝑖: neti
Step 4: if neti > 0 and 𝜎𝑖 = 0 then set: 𝜎𝑖 = 1
if neti < 0 and 𝜎𝑖 = 1 then set: 𝜎𝑖 = 0
A unit that changes its state as described above becomes “satisfied”.
If state’s change is not necessary, then the unit is already satisfied.
Step 5: until the current configuration is stable
A configuration is stable when all the units are satisfied.
end
In a Hopfield network the weight matrix W contains simultaneously many memories. We refer to
the process of incorporating all these memories into the network’s weights as training. The
training process is described by the equations (2.33) and (2.34). A trained Hopfield network
converges to a stable configuration that generally depends on the initial configuration of the
network. This means that the stored memories or stable points can be individually reconstructed
from partial information in an initial state of the network.
31
If the stable points describe a simple flow in which nearby points in state space tend to remain
close during the flow (i.e., a non–mixing flow), then initial states that are close (in Hamming
distance) to a particular stable state and far from all others will tend to terminate in that nearby
stable state [52]. If the initial state is ambiguous (i.e., not particularly close to any stable state),
then the flow is not entirely deterministic and the system responds to that ambiguous state by a
statistical choice between the memory states it most resembles [7].
States near a particular stable point contain partial information about the memory assigned to
that stable point. From an initial state of partial information about a memory, a final stable state
with all the information of the memory is found. The memory is reached not by knowing an
address, but rather by supplying in the initial state some subpart of the memory. Any subpart of
adequate size will do – the memory is truly addressable by content rather than location.
Because the Hopfield’s network uses its local energy minima to store memories, when the
system is started near some local minimum, the desired behavior of the network is to fall into
that local minimum and not to find the global minimum.
32
Chapter 3. Variational methods for Markov networks
Variational methods are used as approximation methods in a wide variety of settings.
They have become very popular, since they typically scale well to large applications. The name
variational method refers to a general strategy in which the problem to be solved is expanded to
include additional parameters that increase the degrees of freedom over which the optimization
is performed and which must be fit to the problem at hand. Each choice of these new
parameters, called variational parameters, gives an approximate answer to the original problem.
The best approximation is usually obtained by optimizing the variational parameters. In this way
the “expansion” of the original problem is actually a modality to convert a complex problem into
a simpler problem, where the simpler problem is generally characterized by a decoupling of the
degrees of freedom in the original problem [27]. Throughout this chapter, we use the standard
terminology for graphical models and we concentrate on Markov networks.
3.1 Pairwise Markov networks as exponential families
In this section we take a look at the parameterization of a pairwise Markov network, i.e. a
representation of it as a parameterized family of probability distributions, which is the same as
belonging to an exponential family of probability distributions. Our approach is justified by the
fact that the particularities of the Boltzmann machine model are due to its parameterization and
not to its conditional independence structure. Therefore, we start by defining the concept of
exponential family together with a few related concepts. Then we apply them to obtain an
exponential form for a pairwise Markov network. Next we define the concept of canonical
parameters and we introduce the canonical representations for pairwise Markov networks. Then
we define mean parameters and we introduce the mean parameterization for pairwise Markov
networks. We end this section by exploring the role of mean parameters in inference problems.
The majority of theoretical results presented in this section are taken from [47].
33
3.1.1 Basics of exponential families
In Section 2.2 we defined Markov networks in terms of products of potential functions (equations
(2.11) to (2.13) and (2.15)). In this section we are going to see that, in an exponential family
setting, these products become additive decompositions.
Let consider a pairwise Markov network defined over a graph 𝐺 = (𝑉, 𝐸) and associated with a
set of random variables X = (𝑋1, … , 𝑋𝑛) where 𝑛 = |𝑉|. Without restricting the generality, let us
assume that each random variable 𝑋𝑖, which is associated with node 𝑖 ∈ 𝑉, is Bernoulli,
otherwise is taking the “spin” values/states from 𝐈 = {0,1}.
Let 𝚽𝐆 = (𝜙𝑗)1≤𝑗≤𝑑 be a collection of 𝑑 potential functions, such that: 𝜙𝑗 ∶ I𝑛 → ℝ for all 𝑗 ∈
{1,2…𝑑}. Here 𝑑 is the number of cliques that cover the edges and nodes of 𝐺 and are in a
one–to–one correspondence with 𝜙𝑗.
Given the vector of potential functions ΦG, we associate to it a vector of canonical or
exponential parameters: 𝐖 = (𝑊𝑗)1≤𝑗≤𝑑. If the vector of potential functions ΦG is fixed, then
each parameter vector W indexes a particular probability distribution 𝐏W of the family.
For each fixed X ∈ I𝑛, we use ⟨W,ΦG⟩ to denote the Euclidean inner product in ℝ𝑑 between the
vectors W and ΦG:
⟨𝐖,𝚽𝐆⟩ =∑𝑊𝑗 ∙ 𝜙𝑗
𝑑
𝑗=1
(3.1)
With these notations, the exponential family associated with the set of potential functions ΦG
and the set of canonical parameters W consists of the following parameterized collection of
probability density functions:
𝐏W(X) = exp(⟨W,ΦG⟩ − 𝑨(W)) (3.2)
where: 𝑨(W) = ln ∑ exp(⟨W,ΦG⟩)
X∈I𝑛
(3.3)
We are particularly interested in canonical parameters W that belong to the set:
𝛀 = {W ∈ ℝ𝑑 ∶ 𝑨(W) < +∞} (3.4)
The following notions are important in subsequent development:
34
Regular families: An exponential family for which the domain Ω is an open set is known as a
regular family.
Minimal: It is typical to define an exponential family with a vector of potential functions ΦG
such that there is no nonzero vector W ∈ ℝ𝑑 such that ⟨W,ΦG⟩ is equal to a constant.
This condition gives rise to a so–called minimal representation, in which there is a unique
canonical parameter vector W associated with each probability distribution 𝐏.
Overcomplete: Instead of a minimal representation, it can be convenient to use a non–
minimal or overcomplete representation, in which there exist linear combinations ⟨W,ΦG⟩
that are equal to a constant. In this case, there actually exists an entire affine subset of
parameter vectors W, each associated with the same distribution.
3.1.2 Canonical representation of pairwise Markov networks
The potential functions of a pairwise Markov network, as described by the equation (2.15), are
either node potentials or edge potentials. Therefore we can differentiate between the node–
specific canonical parameters Θ = (𝜃𝑖)𝑖∈𝑉 and the edge–specific canonical parameters W =
(��𝑖𝑗){𝑖,𝑗}∈𝐸. This leads us to a new representation of the pairwise Markov network’s canonical
parameters 𝐖 = (��,𝚯). This new representation has a dimension 𝑑 = |𝑉| + |𝐸| and will be of
particular importance in Chapter 4 and in Chapter 5. Hence, the exponential form of a pairwise
Markov network and the corresponding cumulant function are:
𝐏W(X) = 𝐏W(𝑋1, … , 𝑋𝑛) = exp(∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉
+ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸
− 𝑨(W))
(3.5)
where:
𝑨(W) = ln(∑ exp(∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉
+ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸
)
X∈I𝑛
)
(3.6)
The exponential form of a pairwise Markov network given by the equations (3.5) and (3.6) is a
regular minimal representation. The representation is regular because the sums from the
equation (3.5) are finite for all choices of W ∈ ℝ𝑑 and the domain Ω is the full space ℝ𝑑. The
representation is minimal because there is no nontrivial inner product ⟨W,ΦG⟩ equal to a
constant.
35
An alternative canonical representation of pairwise Markov networks, named the standard
overcomplete representation, uses the indicator functions 𝕀𝒊;𝒔 and 𝕀𝒊𝒋;𝒔𝒕 as potential functions.
Each pairing of a node 𝑖 ∈ 𝑉 and a state 𝑠 ∈ I yields a node–specific indicator function 𝕀𝒊;𝒔 with
an associated vector of canonical parameters Θi = (𝜃𝑖;𝑠)𝑠∈I.
𝕀𝒊;𝒔(𝑋𝑖) = {
1, if 𝑋𝑖 = 𝑠 0, otherwise
for all 𝑖 ∈ 𝑉, 𝑠 ∈ I (3.7)
Each pairing of an edge {𝑖, 𝑗} ∈ 𝐸 with a pair of states (𝑠, 𝑡) ∈ I × I yields an edge–specific
indicator function 𝕀𝑖𝑗;𝑠𝑡, as well as the associated canonical parameter ��𝑖𝑗;𝑠𝑡 ∈ ℝ.
𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) = {
1, if 𝑋𝑖 = 𝑠 and 𝑋𝑗 = 𝑡
0, otherwise for all {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I
(3.8)
The indicator functions given by the equations (3.7) and (3.8) together with their associated
canonical parameters define an exponential family with dimension 𝑑 = 2 ∙ |𝑉 | + 4 ∙ |𝐸|. Hence,
the exponential form of a pairwise Markov network with indicator functions given by the
equations (3.7) and (3.8) is:
𝐏W(X) = exp
(
∑𝕀𝒊;𝒔(𝑋𝑖) ∙𝑖∈𝑉,𝑠∈I
𝜃𝑖;𝑠 + ∑ 𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∙{𝑖,𝑗}∈𝐸,(𝑠,𝑡)∈I×I
��𝑖𝑗;𝑠𝑡 − 𝑨(W)
)
(3.9)
where:
𝑨(W) = ln(∑ exp( ∑ 𝕀𝒊;𝒔(𝑋𝑖) ∙
𝑖∈𝑉,𝑠∈I
𝜃𝑖;𝑠 + ∑ 𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∙{𝑖,𝑗}∈𝐸,(𝑠,𝑡)∈I×I
��𝑖𝑗;𝑠𝑡)
X∈I𝑛
)
(3.10)
The exponential form of a pairwise Markov network given by the equations (3.9) and (3.10) is
regular and overcomplete. Like the previous representation, the cumulant function 𝐴 is
everywhere finite, so that the family is regular. In contrast to the previous representation, this
representation is overcomplete because the indicator functions satisfy various linear relations,
like for instance: ∑ 𝕀𝒊;𝒔(𝑋𝑖)𝑠∈I = 1 for all 𝑋𝑖 ∈ I.
3.1.3 Mean parameterization of pairwise Markov networks
So far, we have characterized a pairwise Markov network by its vector of canonical parameters
W ∈ Ω. It turns out that any exponential family, particularly a pairwise Markov network, has an
alternative parameterization in terms of a vector of mean parameters.
36
Let 𝐏 be a given probability distribution that is a member of an exponential family and whose
collection of potential functions is ΦG = (𝜙𝑗)1≤𝑗≤𝑑. Here all the potential functions are indexed by
𝑗, not only the node–related ones: 𝜙𝑗 = 𝜙j(X) = 𝜙𝑗(𝑋1, … , 𝑋𝑛). Then the mean parameter 𝜇𝑗
associated with the potential function 𝜙j, where 𝑗 ∈ {1,2…𝑑}, is defined by the expectation:
𝜇𝑗 ≝ 𝐄𝐏[𝜙j(X)] = ∑ 𝜙j(X) ∙ 𝐏(X)
X∈I𝑛
(3.11)
Thus, given an arbitrary probability distribution 𝐏 from an exponential family, we defined a vector
𝛍 ≝ (𝜇1, … , 𝜇𝑑) of 𝑑 mean parameters such that there is one mean parameter 𝜇𝑗 for each
potential function 𝜙j. We also define the set ℳ that contains all realizable mean parameters,
i.e., all possible mean vectors μ that can be traced out as the underlying distribution 𝐏 is varied:
𝓜= {𝛍 ∈ ℝ𝑑 ∶ ∃𝐏 such that 𝐄𝐏[𝜙j(X)] = 𝜇𝑗 for all 𝑗 ∈ {1,2…𝑑}} (3.12)
If the exponential family is a pairwise Markov network with indicator functions given by the
equations (3.7) and (3.8), then the collection of potential functions ΦG takes the form:
𝚽𝐆 = {𝕀𝒊;𝒔(𝑋𝑖) ∶ 𝑖 ∈ 𝑉, 𝑠 ∈ I} ∪ {𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∶ {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I } (3.13)
The corresponding mean parameter vector μ ∈ ℝ𝑑, where 𝑑 = 2 ∙ |𝑉 | + 4 ∙ |𝐸|, consists of
marginal probabilities over singleton variables and marginal probabilities over pairs of variables
that correspond to graph edges:
𝛍 = {𝜇𝑖;𝑠 ∶ 𝑖 ∈ 𝑉, 𝑠 ∈ I} ∪ {𝜇𝑖𝑗;𝑠𝑡 ∶ {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I } (3.14)
where: 𝜇𝑖;𝑠 = 𝐄𝐏[𝕀𝒊;𝒔(𝑋𝑖)] = 𝐏[𝑋𝑖 = s] for all 𝑖 ∈ 𝑉, 𝑠 ∈ I (3.15)
and: 𝜇𝑖𝑗;𝑠𝑡 = 𝐄𝐏[𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗)] = 𝐏[𝑋𝑖 = 𝑠, 𝑋𝑗 = t] for all {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I (3.16)
The corresponding set ℳ is known as the marginal polytope associated with the graph 𝐺 and is
denoted ℳ(𝐺). Explicitly, ℳ(𝐺) is given by:
𝓜(𝑮) = {
𝛍 ∈ ℝ𝑑 ∶ ∃𝐏 such that: eq. (3.15) holds ∀𝑖 ∈ 𝑉, ∀ 𝑠 ∈ I and
eq. (3.16) holds ∀{𝑖, 𝑗} ∈ 𝐸, ∀(𝑠, 𝑡) ∈ I × I}
(3.17)
37
3.1.4 The role of transformations between parameterizations
Various statistical computations, among them marginalization and maximum likelihood
estimation, can be understood as transforming from one parameterization to the other.
The computation of the forward mapping, from canonical parameters 𝐖 ∈ 𝛀 to mean
parameters 𝛍 ∈𝓜, can be viewed as a fundamental class of inference problems in exponential
family models and is extremely difficult for many high–dimensional exponential families.
The computation of backward mapping, namely from mean parameters 𝛍 ∈𝓜 to canonical
parameters 𝐖 ∈ 𝛀, also has a natural statistical interpretation. In particular, suppose that we are
given a set of 𝑚 samples 𝕏 of a multivariate random variable X = (𝑋1, … , 𝑋𝑛):
𝕏 = (X(1), … , X(m))T where X(j) = (𝑋1
(j), … , 𝑋n(j)) for 1 ≤ 𝑗 ≤ m (3.18)
The samples are drawn independently from an exponential family member 𝐏W(X) where the
parameter W is unknown. If the goal is to estimate W, the classical principle of maximum
likelihood dictates obtaining an estimate W by maximizing the likelihood of the data, or
equivalently (after taking logarithms and rescaling), maximizing the quantity:
𝓛(𝐖,𝕏) =
1
𝑚∙∑ln (𝐏W(X
(j)))
𝑚
𝑗=1
= ⟨W, μ⟩ (3.19)
where:
�� = ��[ΦG(X)] =1
𝑚∙∑ΦG(X
(j))
𝑚
𝑗=1
(3.20)
where μ is the vector of empirical mean parameters defined by the data 𝕏. The maximum
likelihood estimate W is chosen to achieve the maximum of this objective function. Generally,
computing W is another challenging problem, since the objective function involves the cumulant
function 𝐴. Under suitable conditions, the maximum likelihood estimate is unique, and specified
by the stationarity condition:
𝐄W[ΦG(X)] = μ (3.21) Finding the unique solution to this equation is equivalent to computing the backward mapping
μ → W and generally is computationally intensive.
38
3.2 The energy functional
In this section we introduce the concept of energy functional as a variational method for
Markov random fields. In physics, the energy functional is the total energy of a certain system,
as a functional of the system's state. In the context of Boltzmann machine, the energy functional
acts as an alternative to Boltzmann–Gibbs distribution in the sense that it is more advantageous
to maximize the energy functional instead of computing the partition function for the Boltzmann–
Gibbs distribution. The majority of theoretical results presented in this section are taken from
[27,47,52].
Let us consider we are given some complicated probabilistic system which is modelled by a
Markov random field with 𝑛 nodes and random variables 𝑋1, 𝑋2, … , 𝑋𝑛. We introduce a new
random variable �� that represents the unnormalized measure of the probability distribution 𝐏
that describes the Markov random field. We rewrite the equation (2.11) in a way that highlights
the unnormalized measure of 𝐏.
𝐏(𝑋1, 𝑋2, … , 𝑋𝑛) =
1
𝑍∙ ∏ 𝜙𝑐(Xc)
Xc∈CG
=��(𝑋1, 𝑋2, … , 𝑋𝑛)
𝑍 (3.22)
where: ��(𝑋1, 𝑋2, … , 𝑋𝑛) = ∏ 𝜙𝑐(Xc)
Xc∈CG
(3.23)
and: 𝑍 = ∑ ∏ 𝜙𝑐(Xc)
Xc∈CG𝑋1,𝑋2,…,𝑋𝑛
= ∑ ��(𝑋1, 𝑋2, … , 𝑋𝑛)
𝑋1,𝑋2,…,𝑋𝑛
(3.24)
Our goal is to construct an approximation for the joint distribution 𝐏; we are going to name this
new distribution 𝐐. In order to reach this goal, we are going to employ the following strategy:
instead of looking for a distribution equivalent to 𝐏, we are looking for a distribution reasonably
close to 𝐏. Moreover, we want to make sure that we can perform inference efficiently in the
given Markov random field by using 𝐐. Therefore, instead of choosing a priori a single
distribution 𝐐, we firstly choose a family of approximating distributions ℚ = {𝐐𝒊 ∶ 1 ≤ 𝑖 ≤ 𝑛},
then we let the optimization machinery to choose a particular member from this family.
In our journey to finding a decent approximation for 𝐏 we are going to use the Kullback–Leibler
divergence or KL–divergence which is defined in Appendix B (equation (B29)). If we explicitly
write the expectation with respect to 𝐐 in the definition of KL(𝐐||𝐏), then we obtain the following
equation:
39
KL(𝐐||𝐏) = KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = ∑ 𝐐(𝑋1…𝑋𝑛) ∙
𝑋1,…,𝑋𝑛
ln (𝐐(𝑋1, … , 𝑋𝑛)
𝐏(𝑋1, … , 𝑋𝑛)) (3.25)
where: 𝐏 is the “true” joint distribution from which the data was generated; 𝐐 is a distribution
from a certain family of distributions that, more or less, approximate 𝐏; and
KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) is the KL–divergence of 𝐐 and 𝐏.
We observe that the computation of KL–divergence from the equation (3.25) involves an
intractable operation which is the explicit summation over all possible instantiations of 𝑋1, … , 𝑋𝑛.
However, since we know from the equations (3.22) and (3.23) how 𝐏(𝑋1, … , 𝑋𝑛) respectively
��(𝑋1, 𝑋2, … , 𝑋𝑛) look like, we can exploit this fact to rewrite the KL–divergence in a simpler form.
Before we present this simplified form of KL–divergence, we need to introduce a few concepts
related to energy respectively entropy in Markov random fields.
The entropy of 𝑋1, … , 𝑋𝑛 with respect to 𝐐 is given by the equation (B27) from Appendix B:
𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))
𝑋1,…,𝑋𝑛
(3.26)
Definition 3.1:
The energy functional 𝐹[��, 𝐐] of two probability distributions 𝐏 and 𝐐 is defined in connection
with the unnormalized measure ��(𝑋1, 𝑋2, … , 𝑋𝑛) of 𝐏 as:
𝐹[��, 𝐐] = 𝐄𝐐 [ln (��(𝑋1, … , 𝑋𝑛))] + 𝑆𝐐(𝑋1, … , 𝑋𝑛) (3.27)
where 𝐄𝐐 [ln (��(𝑋1, … , 𝑋𝑛))] represents the expectation with respect to 𝐐 of the logarithm of the
unnormalized measure of 𝐏 and 𝑆𝐐(𝑋1, … , 𝑋𝑛) denotes the entropy of 𝑋1, … , 𝑋𝑛 with respect to 𝐐.
Equivalent forms of the energy functional can be obtained by expanding the expectation in the
first term and substituting the entropy in the second term of the equation (3.27):
𝐹[��, 𝐐] = 𝐄𝐐 [ln( ∏ 𝜙𝑐(Xc))
Xc∈CG
)] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)
40
𝐹[��, 𝐐] = 𝐄𝐐 [ ∑ ln(𝜙𝑐(Xc))
Xc∈CG
] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)
𝐹[��, 𝐐] = ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)
Xc∈CG
(3.28) 𝐹[��, 𝐐] = ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] − ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))
𝑋1,…,𝑋𝑛Xc∈CG
The energy functional contains two terms:
The first term, called the energy term, involves expectations with respect to 𝐐 of the
logarithms of the factors in 𝜙𝑐. Here each factor in 𝜙𝑐 appears as a separate term. Thereby,
if the factors that comprise 𝜙𝑐 are small, and this is the case for the Boltzmann machine,
each expectation deals with relatively few variables. The difficulties in dealing with these
expectations depend on the properties of distribution 𝐐. Assuming that inference is “easy” in
𝐐, we should be able to evaluate such expectations relatively easily.
The second term, called the entropy term, is the entropy of 𝐐. The choice of 𝐐 determines
whether this term is tractable, otherwise we can evaluate it.
The following theorem clarifies the relationship between the KL–divergence and the energy
functional. The proof of this theorem is outside the scope of this paper. However, a proof can be
found in [54].
Theorem 3.1:
The KL–divergence of the probability distributions 𝐐 and 𝐏 can be calculated using the formula:
KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) = −𝐹[��, 𝐐] + ln(𝑍(𝐏)) (3.29)
where 𝐹[��, 𝐐] is the energy functional given by Definition 3.1. and 𝑍(𝐏) is the partition function
of the probability distribution 𝐏.
Equivalently, the equation (3.29) can be written using free energies as in [4]:
KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) = 𝐹[𝐐] − 𝐹[𝐏] (3.30)
41
where 𝐹[𝐐] is a variational free energy and 𝐹[𝐏] is the true free energy of the Markov random
field.
The variational free energy is equal with the opposite of the energy functional 𝐹[��, 𝐐] and,
generally, is a Gibbs free energy (equation (2.21)). The exact (true) free energy is the Helmholtz
free energy (equation (2.22)). Without restricting the generality, in this section we can assume
that the constant pseudo–temperature at equilibrium is 1. Therefore, the variational free energy
and the true free energy can be written as:
𝐹[𝐐] = −𝐹[��, 𝐐] (3.31)
𝐹[𝐏] = − ln(𝑍(𝐏)) (3.32)
If we incorporate the equations (3.28) into the equation (3.29) we obtain the following equivalent
forms of the KL–divergence of probability distributions 𝐐 and 𝐏:
KL(𝐐||𝐏) = − ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] − 𝑆𝐐(𝑋1, … , 𝑋𝑛)
Xc∈CG
+ ln(𝑍(𝐏))
(3.33)
KL(𝐐||𝐏) = − ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] + ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))
𝑋1,…,𝑋𝑛Xc∈CG
+ ln(𝑍(𝐏))
We observe that the term ln(𝑍(𝐏)) from the equations (3.29) and (3.33) doesn’t depend on 𝐐.
Hence, if we want to minimize KL(𝐐||𝐏) with respect to 𝐐, we just need to minimize the first two
terms of the right–hand side of the equations (3.33), which means we need to either maximize
the energy functional 𝐹[��, 𝐐] (equations (3.28)) or minimize the variational free energy 𝐹[𝐐]
(equation (3.31)).
To summarize, instead of searching for a good approximation 𝐐 of the true probability 𝐏, we
need to solve one of the following equivalent optimization problems:
maximize the energy functional 𝐹[��, 𝐐];
minimize the variational free energy 𝐹[𝐐];
minimize the KL–divergence KL(𝐐||𝐏).
Choosing one of these problems depends on the specifics of the problem we try to solve.
Importantly, the energy functional respectively the variational free energy involves expectations
in 𝐐. By choosing approximations 𝐐 that allow for efficient inference, we can both evaluate the
energy functional respectively the variational free energy and optimize it effectively.
42
Moreover, the KL–divergence enjoys the property of being always positive and becoming zero if
and only if 𝐐 and 𝐏 are equal. The proof of this claim can be found in [54]:
KL(𝐐||𝐏) ≥ 0 (3.34)
and: KL(𝐐||𝐏) = 0 if and only if 𝐐 = 𝐏 (3.35)
Then, from the equations (3.29) and (3.30) we can infer that:
𝐹[𝐏] ≥ 𝐹[��, 𝐐] (3.36)
and: 𝐹[𝐏] ≤ 𝐹[𝐐] (3.37)
The inequalities (3.36) and (3.37) together with the equation (3.32) are significant because they
provide bounds on the variational free energy respectively the energy functional:
The variational free energy 𝐹[𝐐] is an upper bound for the true free energy 𝐹[𝐏] for any
choice of 𝐐. This translates into the result of the optimization problem “minimize 𝐹[𝐐]” being
an upper bound for 𝐹[𝐏].
The energy functional 𝐹[��, 𝐐] is a lower bound for the true free energy 𝐹[𝐏] for any choice of
𝐐. This translates into the result of the optimization problem “maximize 𝐹[��, 𝐐]” being a
lower bound for 𝐹[𝐏].
Therefore, instead of directly computing the true partition function 𝑍(𝐏), we can look for a
decent approximation 𝐐 of 𝐏, Moreover, depending on the type of optimization employed
(minimization or maximization), we should obtain a decent upper or lower bound of − ln(𝑍(𝐏)),
which becomes a decent lower or upper bound of ln(𝑍(𝐏)), which leads to a decent lower–
bound or upper–bound approximation of 𝑍(𝐏).
3.3 Gibbs free energy revisited
In Section 2.3 we introduced the concept of Gibbs free energy in Markov random fields by
analogy with the homologue concept from thermodynamics. In this section we firstly introduce
the concepts of Hamiltonian and Plefka expansion; then we present two approaches for defining
a variational Gibbs free energy in a Markov random field. Before we start, we mention that we
use the notation (A2) from Appendix A to designate by X a multivariate random variable that
43
represents all the random variables 𝑋1, … , 𝑋𝑛 of a Markov random field and the notation (A3)
from Appendix A to designate by X−i all the random variables from X except 𝑋𝑖.
3.3.1 Hamiltonian and Plefka expansion
The Hamiltonian mechanics is a theory developed as a reformulation of classical mechanics
and predicts the same outcomes as the non–Hamiltonian (Newtonian) classical mechanics. It
uses a different mathematical formalism, providing a more abstract understanding of the theory.
Hamiltonian mechanics contributed to the subsequent formulation of statistical mechanics and
quantum mechanics. Hamiltonian is an operator introduced by Hamiltonian mechanics that in
most of the cases corresponds to the total energy of the system. For example, the Hamiltonian
of a closed system is the sum of the kinetic energies of all the particles, plus the potential
energy of the particles associated with the system.
Plefka expansion is an approximate method to compute free energies in physical systems. The
method, originally applied to classical spin systems, can be applied to any model for which a
transition from a 'trivial' disordered phase to an ordered phase occurs as some initially small
parameter is varied [55]. That parameter need not be the inverse temperature [55]. In spin glass
theory, Plefka expansion is a “high–temperature” expansion of the ordinary free energy of the
system. Concretely, it is a Taylor expansion of the ordinary free energy with respect to the
inverse temperature such that the resulted free energy is valid in both high–temperature and
low–temperature phases of the system [55].
The concepts of Hamiltonian and Plefka expansion can be extended to Markov random fields.
We consider a pairwise Markov random field with binary random variables 𝑋1, … , 𝑋𝑛 defined by
the equations (2.11) to (2.13) and whose joint probability distribution 𝐏 is a Boltzmann–Gibbs
distribution described by the equation (2.5). The canonical parameters 𝐖 = (��,𝚯) of the
Markov random field are given by the equation (3.5). The derivation by Plefka is particularly
suitable for this type of Markov networks since it does not regard the parameters 𝑤𝑖𝑗 as random
quantities and hence does not require averaging over them. Unlike the spin glass theory, where
the parameters 𝑤𝑖𝑗 are generally regarded as random variables representing random
interactions and their properties are analyzed in thermodynamic limit, otherwise their properties
do not depend on a particular realization of 𝑤𝑖𝑗 [22], in Markov random field theory the
44
parameters 𝑤𝑖𝑗 are given and fixed, and hence in principle they cannot be thought of as random
variables [22].
In Plefka’s argument [56-57], we associate the Hamiltonian H(𝛼) with 𝛼 the expansion
parameter, to a given Markov random field:
H(𝛼) = −𝛼 ∙
1
2∙ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸
−∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉
(3.38)
The free energy 𝐹 corresponding to the Hamiltonian H(α) is given by the following formula [56-
57]:
−𝛽 ∙ 𝐹[𝛼, 𝛽, {𝜇𝑖}𝑖∈𝑉] = ln(Tr exp(−𝛽 ∙ H(α))) − 𝛽 ∙∑𝜃𝑖 ∙ 𝜇𝑖𝑖∈𝑉
(3.39)
where:
𝛽 =1
𝐤 ∙ 𝐓 (3.40)
And 𝜇𝑖 = 𝐄𝑋𝑖[H(𝛼)] for all 𝑖 ∈ 𝑉 (3.41)
where:; 𝐓 is the absolute temperature of the system; 𝐤 is the Boltzmann's constant; 𝜇𝑖 is the
mean value of the Hamiltonian H(α) with respect to the random variable 𝑋𝑖; and Tr denotes the
trace of a matrix. Generally, in Markov networks the Boltzmann’s constant 𝐤 can be taken equal
to 1.
The Plefka expansion of the free energy 𝐹 given by the equation (3.39) is obtained by
suppressing the dependence of 𝐹 on 𝛽 and {𝜇𝑖}𝑖∈𝑉 and then expanding 𝐹 into a power series of
𝛼 as follows [56-57]:
𝐹[𝛼] = 𝐹[0] +∑
𝛼𝑛
𝑛!∙𝜕𝑛𝐹
𝜕𝛼𝑛
∞
𝑛=1
|
𝛼=0
= 𝐹[0] + 𝛼 ∙ 𝐹′[0] +1
2∙ 𝛼2 ∙ 𝐹′′[0] + ⋯ (3.42)
where the derivatives with respect to 𝛼: 𝐹′[𝛼] =𝜕𝐹
𝜕𝛼, 𝐹′′[𝛼] =
𝜕2𝐹
𝜕𝛼2, and so on should be taken with
𝜇𝑖 fixed, for all 𝑖 ∈ 𝑉.
The coefficients of the Plefka expansion up to the second order are the following [56-57]:
𝐹[0] =
1
2∙∑[(1 + 𝜇𝑖) ∙ ln (
1 + 𝜇𝑖2
) + (1 − 𝜇𝑖) ∙ ln (1 − 𝜇𝑖2
)]
𝑖∈𝑉
(3.43)
𝐹′[0] = −1
2∙ ∑ ��𝑖𝑗 ∙ 𝜇𝑖 ∙ 𝜇𝑗{𝑖,𝑗}∈𝐸
(3.44)
45
𝐹′′[0] = −1
2∙ ∑ ��𝑖𝑗
2 ∙ (1 − 𝜇𝑖2) ∙ (1 − 𝜇𝑗
2){𝑖,𝑗}∈𝐸
(3.45)
Together with the equations (3.43) to (3.45) the equation (3.42) becomes:
𝛽 ∙ 𝐹[𝛼] =
1
2∙∑[(1 + 𝜇𝑖) ∙ ln (
1 + 𝜇𝑖2
) + (1 − 𝜇𝑖) ∙ ln (1 − 𝜇𝑖2
)]
𝑖∈𝑉
−
−𝛽 ∙ 𝛼
2∙ ∑ ��𝑖𝑗 ∙ 𝜇𝑖 ∙ 𝜇𝑗{𝑖,𝑗}∈𝐸
− (𝛽 ∙ 𝛼
2)2
∙ ∑ ��𝑖𝑗2 ∙ (1 − 𝜇𝑖
2) ∙ (1 − 𝜇𝑗2)
{𝑖,𝑗}∈𝐸
+ O(𝛼3)
(3.46)
Since 𝐇(𝛼 = 1) is the original Hamiltonian to be considered, leaving the convergence problem
aside and neglecting the higher–order terms O(𝛼3), setting 𝛼 = 1 in the equation (3.46) yields
the true free energy of the Markov random field: 𝐹[𝛼] ≡ 𝐹[𝐏] [56-57]. The free energy obtained
by truncating the Plefka expansion of the ordinary free energy is a Gibbs free energy as well.
3.3.2 The Gibbs free energy as a variational energy
We consider we are given the Markov random field described in Section 3.3.1 and we denote by
X the set of its random variables. We are given a proper subset Y of X together with 𝒫, which
denotes the set of marginal probabilities with respect to 𝐏 of all the variables belonging to Y.
Our task is to define a Gibbs free energy for this Markov random field by performing a partial
constrained minimization over a distribution 𝐐 of certain form such that the marginals of 𝐏
corresponding to the variables included in Y are kept in 𝐐. The intended optimization is a partial
constrained minimization, where the term partial signals the fact that only some of the random
variables 𝑋1, … , 𝑋𝑛 are constrained.
Formally this optimization task is represented as:
Given: X = (𝑋1, … , 𝑋𝑛) ⊃ Y = {𝑋𝑖1 , 𝑋𝑖2 , … , 𝑋𝑖𝑚} = {𝑋𝑖𝑗}1≤𝑗≤𝑚
(3.47)
such that: ∀𝑗 ∈ {1,… ,𝑚}, ∃𝑘 ∈ {1. . 𝑛} such that 𝑋𝑘 ≡ 𝑋𝑖𝑗 (3.48)
and given: 𝒫 = {𝑝1, 𝑝2, … , 𝑝𝑚} = {𝑝𝑗}1≤𝑗≤𝑚 (3.49)
where:
46
𝑝𝑗 = MARG(𝐏, 𝑋𝑖𝑗) = ∑𝐏(𝑋1, … , 𝑋𝑘 , … , 𝑋𝑛)
X−ij
with 𝑋𝑖𝑗 ≡ 𝑋𝑘 cf. (3.48) (3.50)
Construct: 𝐺 [{𝑝𝑗}𝑗]
= min𝐐{𝐹[𝐐] ∶ MARG(𝐐, 𝑋𝑖𝑗) = 𝑝𝑗 , 1 ≤ j ≤ m} (3.51)
such that:
𝐐𝐄𝐐(Y) =1
𝑍∙ exp(−𝐸(Y)) = 𝐏(Y) (3.52)
In order to accomplish this task, we could follow any of the following approaches:
Approach 1:
Step 1.1: We obtain a Gibbs free energy by truncating up to the nth–order term the
Plefka expansion of the ordinary free energy (formulae (3.42) and (3.46)).
Step 1.2: We obtain another Gibbs free energy, which generally is a variational Gibbs
free energy, by minimizing the Gibbs free energy obtained in Step 1.1 over the
parameters {𝑝𝑗}𝑗.
Approach 2:
Step 2.1: We obtain a Gibbs free energy by using the formula (2.21).
Step 2.2: We obtain another Gibbs free energy, which generally is a variational Gibbs
free energy, by minimizing the Gibbs free energy obtained in Step 2.1 over the
parameters {𝑝𝑗}𝑗.
In this section we are going to exemplify Approach 1. Step 1.1 is explained in Section 3.3.1 so
we are going to show only how to perform Step 1.2. In order to achieve this goal, we follow the
work of Welling and Teh [4].
The natural way to enforce the constraints on the marginals is by employing a set of Lagrange
multipliers {𝜆𝑗}𝑗 and incorporating them in the approximation of the free energy 𝐹 obtained in
Step 1.1:
𝐹[𝐐] ← 𝐹[𝐐] −∑𝜆𝑗𝑗
∙ (MARG(𝐐, 𝑋𝑖𝑗) − 𝑝𝑗) (3.53)
We minimize 𝐹[𝐐] over 𝐐 in terms of the Lagrange multipliers {𝜆𝑗}𝑗 and the parameters {𝑝𝑗}𝑗.
The solution obtained is again a Boltzmann–Gibbs distribution, but with a modified energy which
includes additional bias terms:
47
𝐸 ({𝑋𝑖𝑗}𝑗) → 𝐸 ({𝑋𝑖𝑗}𝑗
) −∑𝜆𝑗𝑗
∙ 𝑋𝑖𝑗 (3.54)
After inserting the expression (3.54) into the free energy given by (3.53) and minimizing over the
Lagrange multipliers {𝜆𝑗}𝑗, we find the values of {𝜆𝑗}𝑗 as a function of the parameters {𝑝𝑗}𝑗. The
resulted Gibbs free energy is:
𝐺 [{𝑝𝑗}𝑗] = min{𝜆𝑗}𝑗
{∑𝜆𝑗𝑗
∙ 𝑝𝑗 − ln𝑍 ({𝜆𝑗}𝑗)} (3.55)
where 𝑍 ({𝜆𝑗}𝑗) is the normalizing constant for the Boltzmann–Gibbs distribution with energy
defined by the equation (3.54). The equation (3.55) is known as the Legendre transform
between {𝜆𝑗}𝑗 and {𝑝𝑗}𝑗. By shifting the Lagrange multipliers as follows:
𝜆𝑗′ = 𝜆𝑗 + 𝜃𝑗 (3.56)
we can pull the contribution of the thresholds to the Gibbs free energy out of the Legendre
transform and obtain another form of the resulted Gibbs free energy:
𝐺 [{𝑝𝑗}𝑗] = −∑𝜃𝑗 ∙ 𝑝𝑗𝑗
+ min{𝜆𝑗′}𝑗
{∑𝜆𝑗′
𝑗
∙ 𝑝𝑗 − ln𝑍′ ({𝜆𝑗′}𝑗)} (3.57)
where 𝑍′ is the partition function of the modified Boltzmann–Gibbs distribution with all the
thresholds {𝜃𝑗}𝑗 set to zero:
𝑍′ ({𝜆𝑗′}𝑗) =∑exp(−∑��𝑗𝑙 ∙ 𝑋𝑖𝑗 ∙ 𝑋𝑖𝑙{𝑗,𝑙}
−∑𝜆𝑗′ ∙ 𝑋𝑖𝑗)
𝑋𝑖𝑗
(3.58)
Various variational Gibbs free energies can be obtained by following this approach. For
instance, the mean field free energy is the Gibbs free energy obtained by truncating the Plefka
expansion of the free energy (equation (3.46)) in the first order and minimizing it with respect to
single node marginals.
48
3.4 Mean field approximation
The mean field approximation is a variational approximation of the true free energy, or,
equivalently, of the energy functional, over a computationally tractable family ℚ of simple
distributions:
ℚ = {𝐐𝐢 ∶ 1 ≤ 𝑖 ≤ 𝑛} (3.59) The mean field approximation of the true free energy is called the mean field free energy and is
a Gibbs free energy. In this section our goal is to obtain an expression for the mean field free
energy by maximizing the energy functional. In Section 3.5 we will show how we can obtain an
equivalent expression for the mean field free energy in relation with the Bethe free energy.
The fact that the distributions 𝐐 are tractable comes with a cost: they are not generally
sufficiently expressive to capture all the information of the true probability distribution 𝐏. Before
we present the simplest mean field algorithm, often called naïve mean field, we introduce the
following notations:
We use the notation (A2) from Appendix A to denote a multivariate random variable that
represents either all the random variables of the Markov random field X = (𝑋1, 𝑋2, … , 𝑋𝑛), or
all the random variables belonging to a specific clique Xc whose potential 𝜙𝑐 would appear in
the joint probability distribution 𝐏 of the random field.
We use the notation (A3) from Appendix A to designate by X−i all the random variables
from X except 𝑋𝑖.
We use the notation 𝐄Y~𝐐 to designate the expectation of probability 𝐐 with respect to all the
random variables of the given Markov random field “contained” in the multivariate random
variable Y. A few examples of this notation are: 𝐄X−i~𝐐 and 𝐄Xc~𝐐 .
3.4.1 The mean field energy functional
The naïve mean field algorithm looks for the distribution 𝐐 closest to the true distribution 𝐏 in
terms of KL(𝐐||𝐏) inside the class of distributions representable as a product of independent
marginals:
𝐐(𝑋1, … , 𝑋𝑛) = ∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
(3.60)
49
A few observations should be made regarding the equation (3.60).
On one hand the approximation of 𝐏 as a fully factored distribution assumes that all
variables 𝑋1, … , 𝑋𝑛 of 𝐏 are independent of each other in 𝐐. Consequently this approximation
doesn’t capture any of the dependencies existing in 𝐏 between the variables belonging to a
clique Xc, for all Xc ∈ CG, i.e., the dependencies reflected by the clique potentials 𝜙𝑐(Xc) from
the equation (3.23).
On other hand this distribution is computationally attractive since we can evaluate any query
on 𝐐 by a product over terms that involve only the variables in the scope of the query (i.e.,
the set of variables that appear in the query). Moreover, to represent 𝐐 we need only to
describe the marginal probabilities of each of the variables 𝑋1, … , 𝑋𝑛.
In machine learning literature the marginal probabilities 𝐐(𝑋𝑖), where 1 ≤ 𝑖 ≤ 𝑛, are usually
called mean parameters and denoted 𝝁𝒊 [4,5,16-23,27,47].
Before we derive the mean field algorithm, we are going to formulate the energy functional in a
slightly different way. We do this by incorporating the formula (3.60) into the formulae (3.28):
𝐹[��, 𝐐] = ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc))] − ∑ ∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
∙ ln (∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
)
𝑋1,…,𝑋𝑛Xc∈CG
(3.61)
In the equation (3.61) the first term of the energy functional is itself a sum of terms that have the
form 𝐄X~𝐐[ln(𝜙𝑐(Xc))]. In order to evaluate these terms, we can use the equation (3.60) to
compute 𝐐(𝜙𝑐(Xc)) as a product of marginals, allowing the evaluation of this term to be
performed in time linear in the number of random variables of the clique Xc.
𝐐(𝜙𝑐(Xc)) = ∏ 𝐐(𝑋𝑖)
𝑋𝑖∈Xc
for all Xc ∈ CG, 𝜙𝑐 ∈ ΦG (3.62)
Then the cost to evaluate 𝐐(𝑋1, … , 𝑋𝑛) overall is linear in the description size of the factors 𝜙𝑐 of
𝐏. As for now we cannot expect to do much better.
𝐄X~𝐐[ln(𝜙𝑐(Xc))] = ∑ 𝐐(𝜙𝑐(Xc))
Xc∈CG
∙ ln(𝜙𝑐(Xc))
(3.63)
𝐄X~𝐐[ln(𝜙𝑐(Xc))] = ∑ (∏ 𝐐(𝑋𝑖)
𝑋𝑖∈Xc
)
Xc∈CG
∙ ln(𝜙𝑐(Xc))
50
The second term of the energy functional in the equation (3.61) is the entropy of 𝑋1, … , 𝑋𝑛 with
respect to 𝐐 and, for a fully factored distribution 𝐐, is also decomposable as follows. The proof
for this claim can be found in [54].
𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ ∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
∙ ln (∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
)
𝑋1,…,𝑋𝑛
𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ ∏ 𝐐(𝑋𝑖)
1≤𝑖≤𝑛
∙ ∑ ln(𝐐(𝑋𝑖))
1≤𝑖≤𝑛𝑋1,…,𝑋𝑛
𝑆𝐐(𝑋1, … , 𝑋𝑛) = ∑ 𝑆𝐐(𝑋𝑖)
1≤𝑖≤𝑛
(3.64)
We substitute the appropriate quantities given by the equations (3.63) and (3.64) into the
equation (3.61). Finally, the energy functional respectively the corresponding variational free
energy for a fully factored distribution 𝐐 is given by the following formula:
𝐹[��, 𝐐] = −𝐹𝑀𝐹[𝐐] = ∑ (∏ 𝐐(𝑋𝑖)
𝑋𝑖∈Xc
)
𝑋1,…,𝑋𝑛
∙ ln(𝜙𝑐(Xc)) + ∑ 𝑆𝐐(𝑋𝑖)
1≤𝑖≤𝑛
(3.65)
where 𝐹𝑀𝐹[𝐐] is the mean field free energy.
The formula (3.67) shows that the energy functional for a fully factored distribution can be
written as a sum of expectations, each expectation being defined over a small set of variables;
each such a set corresponds to a clique potential 𝜙𝑐(Xc) in 𝐏. The complexity of evaluating this
form of the energy functional depends on the size of factors 𝜙𝑐(Xc) in 𝐏 and not on the topology
of the Markov network. Thus, the energy functional can be represented and manipulated
effectively, even in Markov networks that would require exponential time for exact inference.
3.4.2 Maximizing the energy functional: fixed–point characterization
In Section 3.2 we showed that, instead of searching for a good approximation 𝐐 of the true
probability 𝐏, we could use a variational approach to either maximize the energy functional or
minimize either the corresponding variational free energy or the KL–divergence. Each of these
approaches transforms the original problem – approximate inference in a Markov random field –
into an optimization problem. An interesting aspect of these optimization problems is the fact
51
that, instead of approximating the objective, they approximate the optimization space. This is
done by starting with a class of distributions:
ℚ = {𝐐𝐢 = 𝐐(𝑋𝑖) ∶ 1 ≤ 𝑖 ≤ 𝑛} (3.66) that generally doesn’t contain the true distribution 𝐏. Then, the distribution of this class that
complies with the type of optimization performed and with the imposed constraints represents
an approximation of the true probability of the underlying Markov network.
A formal description of the optimization problem “maximization of energy functional” follows.
Problem Mean Field Approximation:
Given: ℚ = { 𝐐𝐢 = 𝐐(𝑋𝑖) ∶ 1 ≤ 𝑖 ≤ 𝑛 }
Find: 𝐐 ∈ ℚ
By maximizing: 𝐹[��, 𝐐]
Subject to: 𝐐(𝑋1, … , 𝑋𝑛) = ∏ 𝐐(𝑋𝑖)𝑖
∑ 𝐐(𝑥𝑖) = 1{𝑥𝑖} for all 𝑖
The following theorem and corollaries provide a set of fixed–point equations that characterize
the stationary points of Mean Field Approximation. These theoretical results are taken from
[54] and adapted to our notations and conventions. We provide the proofs only for Theorem 3.2
and Corollary 3.5. We also provide our interpretation of these theoretical results.
Theorem 3.2:
The marginal distribution 𝐐(𝑋𝑖) is a local maximum of Mean Field Approximation given
{𝐐(𝑋𝑗)}𝑗≠𝑖 if and only if:
𝐐(𝑥𝑖) =1
𝑍𝑖∙ exp { ∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]
𝜙𝑐∈ΦG
}
equivalent to: (3.67)
𝐐(𝑥𝑖) =1
𝑍𝑖∙ exp{ ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]
Xc∈CG
}
52
where 𝑍𝑖 is a local normalizing constant and 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖] is the conditional expectation for a
given value 𝑥𝑖 of 𝑋𝑖:
𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖] = ∑ 𝐐(𝜙𝑐(Xc) | 𝑥𝑖) ∙ ln(𝜙𝑐(Xc))
Xc∈CG
Proof: The proof of this theorem relies on proving the fixed–point characterization of the
individual marginal 𝐐(𝑋𝑖) in terms of the other components 𝐐(𝑋1),…, 𝐐(𝑋𝑖−1), 𝐐(𝑋𝑖+1),…, 𝐐(𝑋𝑛)
as specified in the equation (3.67).
We first consider the restriction of the objective 𝐹[��, 𝐐] to those terms that involve 𝐐(𝑋𝑖):
𝐹𝑖[𝐐] = ∑ 𝐄Xc~𝐐[ln(𝜙𝑐(Xc))]𝑋𝑖∈Xc,Xc∈CG
+ 𝑆𝐐(𝑋𝑖) (3.68)
To optimize 𝐐(𝑋𝑖), we define the Lagrangian that consists of all terms in 𝐹[��, 𝐐] that involve
𝐐(𝑋𝑖):
𝐿𝑖[𝐐] = ∑ 𝐄Xc~𝐐[ln(𝜙𝑐(Xc))]𝑋𝑖∈Xc,Xc∈CG
+ 𝑆𝐐(𝑋𝑖) + 𝜆 ∙ (∑𝐐(𝑥𝑖)
𝑥𝑖
− 1)
where 𝜆 is a Lagrange multiplier that corresponds to the constraint that 𝐐(𝑋𝑖) is a distribution.
We now take derivatives with respect to 𝐐(𝑥𝑖). The following result, whose proof we do not
provide, plays an important role in the remainder of the derivation.
Lemma 3.3:
If 𝐐(X) = ∏ 𝐐(𝑋𝑖)𝑖 , then for any function 𝑓 with scope 𝒰:
𝜕𝐄𝒰~𝐐[𝑓(𝒰)]
𝜕𝐐(𝑥𝑖)= 𝐄𝒰~𝐐[𝑓(𝒰) | 𝑥𝑖]
Using Lemma 3.3 and standard derivatives of entropies, we see that:
𝜕𝐿𝑖𝜕𝐐(𝑥𝑖)
= ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]
Xc∈CG
− ln𝐐(𝑥𝑖) − 1 + 𝜆
Setting the derivative to 0 and rearranging terms we get that:
53
ln𝐐(𝑋𝑖) = 𝜆 − 1 + ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]
Xc∈CG
We take exponents of both sides and renormalize; because 𝜆 is constant relative to 𝑥𝑖, it drops
out in the renormalization, so that we obtain the formula (3.67).
Theorem 3.2 shows only that the solution of the equation (3.67) is a stationary point of the
equation (3.68). To prove that it is a maximum, we note that the equation (3.68) is a sum of two
terms: ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc))]Xc∈CG is linear in 𝐐(𝑋𝑖), given all the other components 𝐐(𝑋𝑗), 𝑗 ≠ 𝑖;
𝑆𝐐(𝑋𝑖) is a concave function in 𝐐(𝑋𝑖). As a whole, given the other components of 𝐐, the function
𝐹𝑖 is concave in 𝐐(𝑋𝑖) and, therefore, has a unique global optimum, which is easily verified to be
the solution of the equation (3.67) rather than any of the extremal points [54].
The following two corollaries help characterize the stationary points of Mean Field
Approximation.
Corollary 3.4:
The distribution 𝐐 is a stationary point of Mean Field Approximation if and only if, for each 𝑋𝑖,
the equation (3.67) holds.
Corollary 3.5:
In the mean field approximation, 𝐐(𝑥𝑖) is locally optimal only if:
𝐐(𝑥𝑖) =
1
𝑍𝑖∙ exp{𝐄X−i~𝐐[ln 𝐏(𝑥𝑖 | X−i)]} (3.69)
where 𝑍𝑖 is a normalizing constant.
Proof: We recall that �� = ∏ 𝜙𝑐𝜙𝑐∈ΦG = ∏ 𝜙𝑐(Xc)Xc∈CG is the unnormalized measure defined by
ΦG and CG. Due to the linearity of expectation we have:
∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]
𝜙𝑐∈ΦG
= 𝐄X~𝐐[ln ��(𝑋𝑖, X−i) | 𝑥𝑖]
Because 𝐐 is a product of marginals, we can rewrite 𝐐( X−i | 𝑥𝑖 ) as 𝐐( X−i ) and get that:
𝐄X~𝐐[ln ��(𝑋𝑖, X−i) | 𝑥𝑖] = 𝐄X−i~𝐐[ln ��(𝑥𝑖, X−i)]
54
Using properties of conditional distributions, it follows that:
��(𝑥𝑖, X−𝑖 ) = 𝑍 ∙ 𝐏(𝑥𝑖, X−𝑖 ) = 𝑍 ∙ 𝐏( X−𝑖 ) ∙ 𝐏(𝑥𝑖 | X−i) We conclude that:
∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]
𝜙𝑐∈ΦG
= 𝐄X−i~𝐐[ln 𝐏(𝑥𝑖 | X−i)] + 𝐄X−i~𝐐[ln(𝐏( X−𝑖 ) ∙ 𝑍)]
Plugging this equality into the update equation (3.67) we get that:
𝐐(𝑥𝑖) =1
𝑍𝑖∙ exp{𝐄X−i~𝐐[ln𝐏(𝑥𝑖 | X−i)]} ∙ exp{𝐄X−i~𝐐[ln(𝐏( X−𝑖 ) ∙ 𝑍)]}
The term ln(𝐏( X−𝑖 ) ∙ 𝑍) does not depend on the value of 𝑥𝑖. More, when a marginal is
multiplied by a constant factor, it does not change the joint distribution. In fact, as the distribution
is renormalized at the end to sum to 1, the constant is absorbed into the normalizing function to
achieve normalization. Therefore, the constant term can be simply ignored and the formula we
obtain is exactly the formula (3.69).
Corollary 3.5 shows that the marginal of 𝑋𝑖 in 𝐐, i.e., 𝐐(𝑋𝑖), is the geometric average of the
conditional probability of 𝑥𝑖 given all other variables in the domain. The average is based on the
probability that 𝐐 assigns to all possible assignments to the variables in the domain. In this
sense, the mean field approximation requires that the marginal of 𝑋𝑖 be “consistent” with the
marginals of other variables [54].
Comparatively, the marginal of 𝑋𝑖 in 𝐏 can be represented as an arithmetic average:
𝐏(𝑥𝑖) =∑𝐏(𝑥−i)
𝑥−𝑖
∙ 𝐏(𝑥𝑖 | 𝑥−i) = 𝐄X−i~𝐏[𝐏(𝑥𝑖 | X−i)] (3.70)
In general, the geometric average tends to lead to marginals that are more sharply peaked than
the original marginals in 𝐏. More significant, the expectation in the equation (3.69) is relative to
the approximation distribution 𝐐, while the expectation in the equation (3.70) is relative to the
true distribution 𝐏. However this should not be interpreted as our approximation in 𝐐 of the
marginals in 𝐏 is a good one [54].
55
3.4.3 Maximizing the energy functional: the naïve mean field algorithm
We start by observing that, if a clique with potential function 𝜙𝑐 and set of nodes Xc doesn’t
contain the variable 𝑋𝑖, i.e., 𝑋𝑖 ∉ Xc, then:
𝐄Xc~𝐐[ln(𝜙𝑐(Xc) | 𝑥𝑖)] = 𝐄Xc~𝐐[ln(𝜙𝑐(Xc, 𝑥𝑖))] when 𝑋𝑖 ∉ Xc (3.71)
Hence, expectations terms on such factors are independent of the value 𝑋𝑖. Consequently, we
can absorb them into the normalization constant 𝑍𝑖 and get the following simplification.
Corollary 3.6:
In the mean field approximation 𝐐(𝑥𝑖) is locally optimal only if:
𝐐(𝑥𝑖) =1
𝑍𝑖∙ exp
{
∑ 𝐄Xc−{𝑋𝑖}~𝐐[ln(𝜙𝑐(Xc, 𝑥𝑖))]
𝑋𝑖∈Xc,Xc∈CG }
(3.72)
where 𝑍𝑖 is the normalization constant.
The equation (3.72) shows that 𝐐(𝑥𝑖) has to be consistent with the expectation of the potentials
in which it appears. This characterization of Corollary 3.3 is very useful for converting the
fixed–point equations (3.67) into an algorithm that maximizes 𝐹[��, 𝐐]. All the terms on the right–
hand side of the equation (3.70) involve expectations of variables other than 𝑋𝑖 and do not
depend on the choice of 𝐐(𝑋𝑖). We can achieve equality simply by evaluating the exponential
terms for each value 𝑥𝑖, normalizing the results to sum to 1, and then assigning them to 𝐐(𝑋𝑖).
As a consequence, we reach the optimal value of 𝐐(𝑋𝑖) in one step.
The last statement should be interpreted with some care. The resulting value for 𝐐(𝑋𝑖) is its
optimal value given the choice of all other marginals. Thus, this step optimizes the function
𝐹[��, 𝐐] relative only to one single coordinate in the space – the marginal of 𝐐(𝑋𝑖). To optimize
the function in its entirety, we need to optimize relative to all the coordinates. We can embed
this step in an iterated coordinate ascent algorithm, which repeatedly maximizes a single
marginal at a time, given fixed choices to all of the others. The result is Algorithm 3.1.
56
Algorithm 3.1: Naïve Mean Field Approximation
Given: CG , ΦG , 𝐐𝟎 // the initial choice of 𝐐
begin
Step 1: 𝐐 ← 𝐐𝟎
Step 2: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← X = (𝑋1, 𝑋2, … , 𝑋𝑛)
Step 3: while 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ≠ ∅ do
Step 4: choose 𝑋𝑖 from 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑
Step 5: 𝐐𝐨𝐥𝐝(𝑋𝑖) ← 𝐐(𝑋𝑖)
Step 6: for 𝑥𝑖 ∈ val(𝑋𝑖) do
we iterate over all possible values of random variable 𝑋𝑖.
Step 7: 𝐐(𝑥𝑖) = exp(∑ 𝐄Xc−{𝑋𝑖}~𝐐[ln𝜙𝑐(Xc, 𝑥𝑖)]𝑋𝑖∈Xc )
end for // 𝑥𝑖
Step 8: normalize 𝐐(𝑥𝑖) to sum to 1
Step 9: if 𝐐𝐨𝐥𝐝(𝑋𝑖) ≠ 𝐐(𝑋𝑖) then
Step 10: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ∪ (⋃ Xc𝑋𝑖∈XcXc∈CG
)
Step 11: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 − {𝑋𝑖}
end while
end
return 𝐐
Importantly, a single optimization doesn’t usually suffice; a subsequent modification to another
marginal 𝐐(𝑋𝑗) may result in a different optimal parameterization for 𝐐(𝑋𝑖). Therefore, the
algorithm repeats these steps until convergence. A key property of the coordinate ascent
procedure is that each step leads to an increase in the energy functional. Hence, any iteration of
Algorithm 3.1 results in a better approximation of the true distribution 𝐏.
57
Theorem 3.7:
Algorithm 3.1 is guaranteed to converge. Moreover, the distribution returned by algorithm is a
stationary point of 𝐹[��, 𝐐], subject to the constraint that 𝐐(X) = ∏ 𝐐(𝑋𝑖)1≤𝑖≤𝑛 is a distribution.
The proof of Theorem 3.7 is outside the scope of this paper. Interested parties can find the
proof in [54]. The distribution returned by Algorithm 3.1 is a stationary point of the energy
functional, so theoretically it could be: a local maximum, a local minimum, or a saddle point.
However, it cannot be either a local minimum or a saddle point because these stationary points
are not stable convergence points of the algorithm in the sense that a small perturbation of 𝐐
followed by optimization will lead to a better convergence point [54]. Because the algorithm is
unlikely to accidentally land precisely on such an unstable point and get stuck there, in practice
the convergence points of the algorithm are local maximum and not necessarily global
maximum [54].
3.5 Bethe approximation
The Bethe approximation is an approximate of the free energy similar to the mean field
approximation. The Bethe approximation reduces the problem of computing the partition
function in a Markov random field to that of solving a set of non-linear equations – the Bethe
fixed–point equations [58]. In this section we start by introducing the Bethe free energy and its
“close relative” the Bethe–Gibbs free energy. Then, we describe briefly the belief propagation
(BP) algorithm and we present the theoretical result due to Yedidia et al. [28,35] that establishes
the connection between BP fixed–points and Bethe free energy. Because this result is
considered fundamental for approximate inference research, we include its original proof. We
end this section by introducing a new approximate inference algorithm, due to Welling and Teh
[4], which is based on the Bethe free energy and called belief optimization (BO).
We assume that we are given a pairwise Markov network with binary variables described by the
equations (2.11) to (2.13) and (2.15). We rewrite the equation (2.11) by taking into consideration
the fact that the potential functions, given by the equation (2.15), are either node potentials or
edge potentials:
58
𝐏(X) =
1
𝑍∙ ∏ 𝜙𝑐(Xc)
Xc∈CG
=1
𝑍∙∏𝜙𝑖(𝑋i) ∙∏𝜙𝑖𝑗(𝑋i, 𝑋𝑗)
𝑖,𝑗i
(3.73)
where: 𝜙𝑖(𝑋i) is the local “evidence” for node 𝑖; 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) is the clique potential that
corresponds to the edge that connects the nodes 𝑖 and 𝑗; and 𝑍 is the partition function. Any
fixed evidence nodes is subsumed into our definition of 𝜙𝑖(𝑋i).
We denote by 𝑝𝑖 the marginal probabilities over singleton variables and by 𝑝𝑖𝑗 the pairwise
marginal probabilities over pairs of variables that correspond to edges in the underlying graph:
𝑝𝑖 = MARG(𝐏, 𝑋𝑖) =∑𝐏(𝑋1, … , 𝑋𝑖, … , 𝑋𝑛)
X−i
(3.74) 𝑝𝑖𝑗 = MARG(𝐏, 𝑋𝑖𝑋𝑗) = ∑ 𝐏(𝑋1, … , 𝑋𝑖 , … , 𝑋𝑗, … , 𝑋𝑛)
X−{𝑋𝑖,𝑋𝑗}
The majority of theoretical results presented in this section come from [4, 28-29, 35-36].
3.5.1 The Bethe free energy
The original formula for the Bethe free energy, proposed by Yedidia et al. in [28-29,35-36], relies
on a minimal canonical representation for the Markov network. An alternative form of the Bethe
free energy that relies on the mean parameterization of the Markov network was introduced by
Wainwright et al. in [47].
The Bethe free energy is the Gibbs free energy obtained by truncating the Plefka expansion of
the free energy (equation (3.46)) in the second order and minimizing it with respect to single
node marginals and pairwise marginals. Unlike the mean field free energy, which depends only
on approximate marginals at single nodes, the Bethe free energy depends on approximate
marginals at single nodes as well as approximate marginals on edges.
To understand the relationship between the Bethe free energy and the mean field free energy,
we define a “close relative” of both by imposing additional constraints on the Bethe free energy.
The Gibbs free energy resulted from this additional optimization is called the Bethe–Gibbs free
energy. To distinguish between the Bethe free energy and the Bethe–Gibbs free energy, we
denote the former by 𝒢𝛽 and the latter by 𝐺𝛽.
59
We assume that we work under a set of hypothesis similar to ones described by (3.47) to (3.52)
except that Y = X. The constraints are represented by the set of all 𝑝𝑖 and the set of all 𝑝𝑖𝑗
defined in (3.74). Formally, the Bethe free energy is defined as:
𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = min𝐐{𝐹[𝐐] ∶ MARG(𝐐, 𝑋𝑖) = 𝑝𝑖 and MARG(𝐐, 𝑋𝑖𝑋𝑗) = 𝑝𝑖𝑗} (3.75)
where MARG(𝐐, 𝑋𝑖) denotes the singleton marginal probability of 𝐐 with respect to 𝑋𝑖 and
MARG(𝐐, 𝑋𝑖𝑋𝑗) denotes the pairwise marginal probability of 𝐐 with respect to the variables 𝑋𝑖
and 𝑋𝑗, whose corresponding nodes are connected in the Markov network.
In order to compute the Bethe free energy of a binary pairwise Markov network, in this section
we follow Approach 2 described in Section 3.3.2. As previously mentioned, we can assume
that the constant pseudo–temperature at equilibrium is 1. Therefore:
𝒢𝛽[𝐏(𝑋1, … , 𝑋𝑛)] = 𝑈𝛽 − 𝑆𝛽 (3.76)
In Section 2.3 we learned that the energy of a pairwise Markov network is a quadratic function
of the states, so the internal energy given by the formula (2.18) can be computed exact in terms
of {𝑝𝑖} and {𝑝𝑖𝑗} [4]:
𝑈𝛽 = 𝐄𝐏 [𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}]] = 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}]
where: (3.77) 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}] = −∑𝑝𝑖𝑗 ∙ ��𝑖𝑗
{𝑖,𝑗}
+∑𝑝𝑖 ∙ 𝜃𝑖𝑖
This means that the computation of the Bethe free energy (3.76) requires an approximation only
for the entropy term (equation (2.19)). The idea is that we want to correct the mean field
approximation which overestimates the entropy due to its assumption that all nodes are
independent. The natural next step is to take pairwise dependencies into account. But just
adding all pairwise entropy contributions to the mean field approximation would clearly over–
count the entropy contributions at the nodes. Correcting for this over–counting gives the
following approximation to the entropy [4]:
𝑆𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] =∑𝑆𝑖 +∑(𝑆𝑖𝑗 − 𝑆𝑖 − 𝑆𝑗){𝑖,𝑗}𝑖
=∑𝑆𝑖 ∙ (1 − 𝑑𝑖)
𝑖
+∑𝑆𝑖𝑗{𝑖,𝑗}
(3.78)
where: 𝑑𝑖 is the degree of node 𝑖, i.e., the number of neighbors of node 𝑖; 𝑆𝑖 is the mean field
entropy for node 𝑖; and 𝑆𝑖𝑗 is the pairwise entropy. The mean field entropy 𝑆𝑖 and the pairwise
entropy 𝑆𝑖𝑗 can be written as:
60
𝑆𝑖 = −(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖)) (3.79) 𝑆𝑖𝑗 = −(𝑝𝑖𝑗 ∙ ln 𝑝𝑖𝑗 + (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) ∙ ln(𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) +
+(𝑝𝑖 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑖 − 𝑝𝑖𝑗) + (𝑝𝑗 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑗 − 𝑝𝑖𝑗)) (3.80)
The Bethe free energy is obtained by integrating (3.76), (3.77), and (3.79) into (2.21):
𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}] − 𝑆𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] (3.81)
𝒢𝛽 = −∑𝑝𝑖𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}
+∑𝑝𝑖 ∙ 𝜃𝑖𝑖
+∑(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖)) ∙ (1 − 𝑑𝑖)
𝑖
+
+∑(𝑝𝑖𝑗 ∙ ln 𝑝𝑖𝑗 + (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) ∙ ln(𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) + (𝑝𝑖 − 𝑝𝑖𝑗){𝑖,𝑗}
∙ ln(𝑝𝑖 − 𝑝𝑖𝑗) + (𝑝𝑗 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑗 − 𝑝𝑖𝑗))
(3.82)
The expression (3.76) for the entropy is exact when the underlying graph is a tree. Since the
expression (3.77) for the energy is exact for general Boltzmann machines, it is also exact on
Boltzmann trees. Consequently, the Bethe free energy (3.82) is exact on trees [4]. If the
underlying graph has loops, then the distribution corresponding to the energy given by (3.82) is
not always a properly normalized probability distribution [4]. Therefore, the Bethe free energy is
not necessarily an upper bound for the true free energy 𝐹 [4], otherwise it does not fall into the
category of variational free energies characterized by (3.36) and (3.37). So, when can we
expect the Bethe free energy to be a good approximation for the free energy of the system? The
above argument suggests that this should be the case when the graph is “close to a tree”, i.e., if
there are not many short loops in the graph. If the underlying graph has tight loops, evidence
impinging on one node can travel around these loops and return back to the original node,
causing it to be over–counted [4].
3.5.2 The Bethe–Gibbs free energy
In order to improve the approximation of the free energy, the Bethe free energy has been
studied in connection with a well–known free energy: the mean field free energy. Welling and
Teh proved that the mean field free energy is a small weight expansion of the Bethe free energy
[4], which suggests that the Bethe free energy should be accurate for small weights and should
improve on the mean field energy [4]. In this section we use a different approach to explore the
relationship between the Bethe free energy and the mean field free energy: via the Bethe–Gibbs
free energy.
61
We recall the mean field approximation of the free energy (equation (3.65)) and we observe that
the expression of the entropy 𝑆𝐐(𝑋𝑖) is the same as the mean field entropy 𝑆𝑖 given by (3.79).
We also note that the independent marginal 𝐐(𝑋𝑖) corresponds to 𝑝𝑖 given by (3.74) and the
clique potential 𝜙𝑐(Xc) corresponds to 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) given by (3.73). Hence, the mean field free
energy can be written as:
𝐹𝑀𝐹[𝐐] = − ∑ (∏ 𝐐(𝑋𝑖)
𝑋𝑖∈Xc
)
𝑋1,…,𝑋𝑛
∙ ln ( 𝜙𝑐(Xc)) − ∑ 𝑆𝐐(𝑋𝑖)
1≤𝑖≤𝑛
𝐹𝑀𝐹({𝑝𝑖}) = −∑𝑝𝑖 ∙ 𝑝𝑗 ∙ ln𝜙𝑖𝑗(𝑋i, 𝑋𝑗) +
𝑖,𝑗
∑(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖))
𝑖
(3.83)
Then, we are going to convert the Bethe free energy given by the equations (3.81) – (3.82) into
a more constrained Gibbs free energy named the Bethe–Gibbs free energy. We do this by
imposing additional constraints over the Bethe free energy 𝒢𝛽, specifically we minimize 𝒢𝛽 with
respect to the parameters {𝑝𝑖𝑗}, then we solve {𝑝𝑖𝑗} exactly in terms of the parameters {𝑝𝑖}.
The minimization is done, as usually, by taking derivatives of the Bethe free energy with respect
to {𝑝𝑖𝑗} and setting them to zero:
∂𝒢𝛽
∂𝑝𝑖𝑗= −��𝑖𝑗 + ln (
𝑝𝑖𝑗 ∙ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)
(𝑝𝑖 − 𝑝𝑖𝑗) ∙ (𝑝𝑗 − 𝑝𝑖𝑗)) = 0 (3.84)
This can be simplified to a quadratic equation:
𝛼𝑖𝑗 ∙ 𝑝𝑖𝑗2 − (1 + 𝛼𝑖𝑗 ∙ 𝑝𝑖 + 𝛼𝑖𝑗 ∙ 𝑝𝑗) ∙ 𝑝𝑖𝑗 + (1 + 𝛼𝑖𝑗) ∙ 𝑝𝑖 ∙ 𝑝𝑗 = 0 (3.85)
where we have defined:
𝛼𝑖𝑗 = exp(��𝑖𝑗) − 1 (3.86)
In addition to this equation we have to make sure that 𝑝𝑖𝑗 satisfies the following bounds:
max(0, 𝑝𝑖 + 𝑝𝑗 − 1) ≤ 𝑝𝑖𝑗 ≤ min(𝑝𝑖, 𝑝𝑗) (3.87)
These bounds can be understood by noting that all the parameters {𝑝𝑖} and {𝑝𝑖𝑗} cannot
become negative by their definition. The following theorem guarantees the desired unique
solution for {𝑝𝑖𝑗}.
62
Theorem 3.8:
There is exactly one solution to the quadratic equation (3.85) that minimizes the Bethe free
energy and satisfies the bounds (3.87). The analytic expression of this solution is:
𝑝𝑖𝑗 =
1
2 ∙ 𝛼𝑖𝑗∙ (𝑄𝑖𝑗 −√𝑄𝑖𝑗
2 − 4 ∙ 𝛼𝑖𝑗 ∙ (1 + 𝛼𝑖𝑗) ∙ 𝑝𝑖 ∙ 𝑝𝑗)
where: (3.88) 𝑄𝑖𝑗 = 1 + 𝛼𝑖𝑗 ∙ 𝑝𝑖 + 𝛼𝑖𝑗 ∙ 𝑝𝑗
Moreover, the parameters 𝑝𝑖𝑗 will never actually saturate one of the bounds.
The proof of this theorem is outside the scope of this paper. A proof can be found in [4].
Thus, by inserting the expression for {𝑝𝑖𝑗} given by (3.88) into the Bethe free energy given by
(3.82), we obtain the analytic expression of Bethe–Gibbs free energy 𝐺𝛽 (also called the Gibbs
free energy in the Bethe approximation). We are not going to provide the whole analytic
expression for 𝐺𝛽[{𝑝𝑖}], but a simpler expression that highlights the dependency 𝐺𝛽 has on {𝑝𝑖}
and how it had arrived there:
𝐺𝛽[{𝑝𝑖}] = min{𝑝𝑖𝑗}
𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗(𝑝𝑖 , 𝑝𝑗)}] (3.89)
We observe that the mean field free energy 𝐹𝑀𝐹({𝑝𝑖}) given by (3.83) and the Bethe–Gibbs free
energy 𝐺𝛽[{𝑝𝑖}] given by (3.89) are similar in spirit, so they might behave similarly in
approximate inference algorithms concerned with singleton marginals. In Section 3.5.4 we will
elaborate upon this topic.
3.5.3 The relationship between belief propagation fixed–points and Bethe free energy
The belief propagation (BP) or the sum–product algorithm is an efficient local message passing
algorithm for exact inference on trees or, generally, on graphs without cycles. The BP algorithm
is guaranteed to converge to the correct marginal posterior probabilities in tree–like graphical
models. The BP algorithm applied to a graph with loops is called loopy belief propagation (LBP).
The LBP algorithm remains well–defined and, in some cases, gives good approximate answers,
while in other cases gives poor results or fails to converge.
63
Yedidia, Freeman, and Weiss [28,35] established that, in a factor graph (see Definition 2.6),
there is a one–to–one correspondence between the fixed–points of BP algorithms and the
stationary points of the Bethe free energy. They also showed that the BP algorithms can only
converge to a fixed–point that is also a stationary point of the Bethe approximation to the free
energy. Their discovery, which has been heralded as a major breakthrough for belief
propagation in general, not only clarified the nature of the Bethe approximation of the free
energy, but also opened the way to construct more sophisticated message passing algorithms
based on improvements made to Bethe’s approximation.
The theoretical result of Yedidia et al. [28,35] is applicable not only to factor graphs, but to all
types of graphical models. The justification of this statement relies on two facts. The first fact is
that all types of graphical models have the following property: they can be converted, before
doing inference, into a pairwise Markov network, through a suitable clustering of nodes into
larger nodes [54]. The second fact is that the pairwise Markov network a factor graph is
converted into and the factor graph itself have the same joint probability distribution [59].
Therefore, without loss of generality, we can use a pairwise Markov network as the underlying
graphical model for the BP algorithm. In this section we give only a briefly description of BP.
Detailed presentations of BP can be found in [59-60]. We assume that we work under the
hypothesis and notations given by (3.73). We use the standard set of notations for BP
algorithms and, when applicable, we provide the correspondent in our notations. We use the
notation 𝑑𝑖 for the degree of node 𝑖.
The standard BP update rules are applicable to the message that node 𝑖 sends to node 𝑗
denoted 𝑚𝑖𝑗 and to the belief of node 𝑖 denoted 𝑏𝑖:
𝑚𝑖𝑗(𝑋𝑗) ← 𝛼 ∙∑𝜙𝑖𝑗(𝑋i, 𝑋𝑗) ∙
𝑋𝑖
𝜙𝑖(𝑋i) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)
𝑘∈ne(𝑖)−{𝑗}
(3.90)
𝑏𝑖(𝑋𝑖) ← 𝛼 ∙ 𝜙𝑖(𝑋i) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)
𝑘∈ne(𝑖)
(3.91)
where 𝛼 denotes a normalization constant and ne(𝑖) denotes the Markov blanket of node 𝑖.
The belief 𝑏𝑖(𝑋𝑖) is obtained by multiplying all incoming messages to node 𝑖 by the local
evidence. If the belief 𝑏𝑖(𝑋𝑖) is normalized, then it approximates the marginal probability
𝑝𝑖 = 𝑝𝑖(𝑋𝑖) given by (3.74).
64
The belief 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) at the pair of connected nodes 𝑋i and 𝑋𝑗 is defined as the product of the
local potentials and all incoming messages to the pair of nodes:
𝑏𝑖𝑗(𝑋i, 𝑋𝑗) ← 𝛼 ∙ 𝜓𝑖𝑗(𝑋i, 𝑋𝑗) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)
𝑘∈ne(𝑖)−{𝑗}
∙ ∏ 𝑚𝑙𝑗(𝑋𝑗)
𝑙∈ne(𝑗)−{𝑖}
(3.92)
where: 𝜓𝑖𝑗(𝑋i, 𝑋𝑗) ≡ 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) ∙ 𝜙𝑖(𝑋i) ∙ 𝜙𝑗(𝑋j) (3.93)
If the belief 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) is normalized, then it approximates the marginal probability 𝑝𝑖𝑗 = 𝑝𝑖𝑗(𝑋𝑖𝑋𝑗)
given by (3.74). Generally the beliefs 𝑏𝑖(𝑋𝑖) and 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) are approximate marginals. However,
they become the exact marginals when the underlying graph contains no cycles [13].
Theorem 3.9 (Yedidia et al. [28,35]):
Let {𝑚𝑖𝑗} be a set of BP messages and let {𝑏𝑖𝑗, 𝑏𝑖} be the beliefs calculated from those
messages. Then the beliefs are fixed–points of the BP algorithm if and only if they are zero
gradient points of the Bethe free energy 𝒢𝛽 and subject to the following normalization and
marginalization constraints:
∑𝑏𝑖(𝑋𝑖) = 1
𝑋𝑖
and ∑𝑏𝑖𝑗(𝑋i, 𝑋𝑗) = 𝑏𝑗(𝑋𝑗)
𝑋𝑖
(3.94)
Proof: We start by writing the Bethe free energy 𝒢𝛽 (3.82) in terms of beliefs:
𝒢𝛽({𝑏𝑖}, {𝑏𝑖𝑗}) =∑ ∑ 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) ∙ [ln 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) − ln𝜓𝑖𝑗(𝑋i, 𝑋𝑗)]
𝑋𝑖,𝑋𝑗𝑖,𝑗
−∑(𝑑𝑖 − 1) ∙∑𝑏𝑖(𝑋𝑖) ∙ [ln 𝑏𝑖(𝑋𝑖) − ln𝜙𝑖(𝑋i)]
𝑋𝑖𝑖
(3.95)
To prove the claim " → " we add the following Lagrange multipliers to form a Lagrangian 𝐿:
𝜆𝑖𝑗(𝑋𝑗) is the multiplier corresponding to the constraint that 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) marginalizes down to
𝑏𝑗(𝑋𝑗);
𝜉𝑖𝑗 , 𝜉𝑖 are multipliers corresponding to the normalization constraints.
The equation: 𝜕𝐿
𝜕𝑏𝑖𝑗(𝑋i,𝑋𝑗)= 0 gives:
ln 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) = ln𝜓𝑖𝑗(𝑋i, 𝑋𝑗) + 𝜆𝑖𝑗(𝑋𝑗) + 𝜆𝑗𝑖(𝑋𝑖) + 𝜉𝑖𝑗 − 1
65
The equation: 𝜕𝐿
𝜕𝑏𝑖(𝑋i)= 0 gives:
(𝑑𝑖 − 1) ∙ (ln 𝑏𝑖(𝑋𝑖) + 1) = ln𝜙𝑖(𝑋i) + ∑ 𝜆𝑗𝑖(𝑋𝑖) +
𝑗∈ne(𝑖)
𝜉𝑖
Setting:
𝜆𝑖𝑗(𝑋𝑗) = ln ∏ 𝑚𝑘𝑗(𝑋𝑗)
𝑘∈ne(𝑖)−{𝑗}
and using the marginalization constraints (3.94), we find that the stationary conditions on the
Lagrangian are equivalent to the BP fixed–point conditions.
To prove the claim " ← ", consider that we are given 𝑏𝑖, 𝑏𝑖𝑗, and 𝜆𝑖𝑗(𝑋𝑗) that correspond to a
zero–gradient point and set:
𝑚𝑖𝑗(𝑋𝑗) =𝑏𝑗(𝑋𝑗)
exp (𝜆𝑖𝑗(𝑋𝑗))
Because 𝑏𝑖, 𝑏𝑖𝑗, and 𝜆𝑖𝑗(𝑋𝑗) satisfy the stationarity conditions, 𝑚𝑖𝑗 defined in this way must be a
fixed–point of BP.
Since both sides of Theorem 3.9 are valid, there is a one–to–one correspondence between the
fixed–points of the BP algorithm and the stationary points of the Bethe free energy.
The consequences of Theorem 3.9 are different for tree–like graphs respectively loopy graphs.
In tree–like graphs the fixed–points of the BP algorithm are the global minima of the Bethe free
energy [28]. This comes as a natural effect of the fact that the BP algorithm performs exact
inference in graphs without loops, so the Bethe free energy is minimal for exact marginals.
In loopy graphs the situation is more complicated. In [61] Heskes showed that the stable fixed–
points of the LBP algorithm are local minima of the Bethe free energy. He also showed that the
converse is not necessarily true: minima of the Bethe free energy can be unstable fixed–points
of LBP [61]. Furthermore, in [62] Heskes derived sufficient conditions for uniqueness of a LBP
fixed–point. By using a particular Boltzmann machine as a counter–example, Heskes showed
that the uniqueness of a LBP fixed–point does not guarantee the convergence of the LBP
algorithm to that fixed–point [62].
66
In [58] Shin proposed an alternative solution to LBP fixing its convergence issue via double-
looping. His solution applies to arbitrary binary graphical models of 𝑛 nodes and maximum
degree in the underlying graph O(log 𝑛). Shin’s algorithm is a message passing algorithm that
solves the Bethe fixed–point equations in polynomial number of bitwise operations and is
considered the first fully polynomial–time approximation scheme for the LBP fixed–point
computation in Markov random fields [58].
We end this section by rewriting the expression (3.83) of the mean field free energy in terms of
beliefs:
𝐹𝑀𝐹({𝑏𝑖}) = −∑𝑏𝑖(𝑋𝑖) ∙ 𝑏𝑗(𝑋𝑗) ∙ ln 𝜙𝑖𝑗(𝑋i, 𝑋𝑗)
𝑖,𝑗
+∑𝑏𝑖(𝑋𝑖) ∙ [ln 𝑏𝑖(𝑋𝑖) − ln𝜙𝑖(𝑋i)]
𝑖
(3.96)
3.5.4 Belief optimization
Unlike the mean field free energy and the Bethe–Gibbs free energy, which include only first
order terms 𝑝𝑖(𝑋𝑖), the Bethe free energy includes first–order terms 𝑝𝑖(𝑋𝑖) as well as second–
order terms 𝑝𝑖𝑗(𝑋i, 𝑋𝑗). Unlike the mean field free energy, which is minimized in the primal
variables {𝑝𝑖}, the Bethe free energy can be minimized in both the primal space and the dual
space. Usually the Bethe free energy is minimized in the dual space by using messages, which
are a combination of the dual variables {𝜆𝑖𝑗(𝑋𝑗)}. The process of minimizing the Bethe free
energy in the primal space, i.e., in terms of the posterior probability distributions, is similar to the
mean field free energy minimization. The approximate inference algorithm that employs this
type of minimization for the Bethe free energy is named belief optimization and represents an
alternative to the fixed–point equations of belief propagation [4].
In order to derive the fixed–point equations that solve the marginals {𝑝𝑖} for the Bethe free
energy, we follow a familiar recipe: first we compute derivatives of the Bethe free energy given
by (3.89) with respect to {𝑝𝑖} and then we equate them to zero:
𝐺𝛽[{𝑝𝑖}] = min{𝑝𝑖𝑗}
𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗(𝑝𝑖 , 𝑝𝑗)}]
𝑑𝐺𝛽
𝑑𝑝𝑖=𝜕𝒢𝛽
𝜕𝑝𝑖+ ∑
𝜕𝒢𝛽
𝜕𝑝𝑖𝑗𝑗∈ne(𝑖)
∙𝜕𝑝𝑖𝑗
𝜕𝑝𝑖 (3.97)
where ne(𝑖) denotes the Markov blanket of unit 𝑖.
67
We recall that in 𝐺𝛽[{𝑝𝑖}] the pairwise marginals {𝑝𝑖𝑗} are defined in terms of the singleton
marginals {𝑝𝑖}, otherwise 𝜕𝒢𝛽
𝜕𝑝𝑖𝑗= 0. Therefore, (3.97) becomes:
𝑑𝐺𝛽
𝑑𝑝𝑖=𝜕𝒢𝛽
𝜕𝑝𝑖 (3.98)
The equation (3.98) shows that, under the current assumptions, the Bethe–Gibbs free energy
𝐺𝛽 and the Bethe free energy 𝒢𝛽 have the same fixed–points. In order to solve the gradient, we
use the analytic expression (3.82) of 𝒢𝛽.
𝜕𝒢𝛽
𝜕𝑝𝑖= 𝜃𝑖 + ln(
(1 − 𝑝𝑖)𝑑𝑖−1 ∙ ∏ (𝑝𝑖 − 𝑝𝑖𝑗)𝑗∈ne(𝑖)
𝑝𝑖𝑑𝑖−1 ∙ ∏ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)𝑗∈ne(𝑖)
) (3.99)
Equating these equations to zero gives the following set of fixed–point equations for the Bethe–
Gibbs free energy 𝐺𝛽 respectively the Bethe free energy [4]:
𝑝𝑖 = sigm(−𝜃𝑖 + ∑ ln(𝑝𝑖 ∙ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)
(1 − 𝑝𝑖) ∙ (𝑝𝑖 − 𝑝𝑖𝑗))
𝑗∈ne(𝑖)
) for all 𝑖 ∈ 𝑉 (3.100)
Regardless how they run, sequentially or in parallel, the fixed–point equations (3.100) are not
guaranteed to decrease the Bethe free energy 𝒢𝛽 or to converge at all.
Similarly with the mean field approximation, we may achieve a decrease of the Bethe free
energy by optimizing it only relative to a single coordinate in the space, otherwise temporarily
fixing all neighboring marginals 𝑝𝑗 and minimizing over the central node 𝑝𝑖. The resulting value
for the singleton marginal 𝑝𝑖 is an optimal value given the choice of all other singleton marginals.
Furthermore, to minimize the function in its entirety, we need to minimize relative to all the
coordinates {𝑝𝑖}. One way to achieve this goal is to embed the global minimization step (relative
to all the coordinates) in an iterated coordinate descent algorithm, which repeatedly minimize a
single marginal at a time, given fixed choices to all of the others.
Other way to perform the global minimization is to perform gradient descent on all the
coordinates {𝑝𝑖} simultaneously while enforcing the constraint that they stay within the interval
[0,1].
68
Chapter 4. Introduction to Boltzmann Machines
A Boltzmann machine is a parallel computational organization or network that is well
suited to constraint satisfaction tasks involving large numbers of “weak” constraints. A weak
constraint is a goal criteria that may not be satisfied by any solution, otherwise is not an all–or–
none criteria. In some problem domains, such as finding the most plausible interpretation of an
image, it happens frequently that even the best possible solution violates some constraints. In
these cases a variation of weak constraints is used, specifically weak constraints that incur a
cost when violated. Furthermore, the quality of the solution is determined by the total cost of all
the constraints that it violates [10-11,43].
4.1 Definitions
Structurally a Boltzmann machine is a symmetrical connectionist network with hidden
units; therefore, its structure follows the general structure of a connectionist network described
in Section 2.4. Hinton characterized the Boltzmann machine as “a generalization of a Hopfield
network in which the units update their states according to a stochastic decision rule” [51]. The
majority of following definitions are taken from [1,3,43] and appropriately adapted for
consistence. Our focus in Chapter 4 and Chapter 5 is the asynchronous Boltzmann machine
that we simply refer to as the Boltzmann machine.
Definition 4.1:
A Boltzmann machine is a neural network that satisfies certain properties. Formally a Boltzmann
machine 𝐁𝐌 is a four–tuple:
𝐁𝐌 = (𝓝,𝓖, ��, 𝚯) (4.1)
comprising of:
a finite set 𝓝 of primitive computing elements called units or neurons;
Without restricting the generality we assume that 𝒩 is indexed by the set {1,2,…𝑛} for 𝒏 = |𝒩|.
To make the formulae more readable, in subsequent development we refer equivalently to 𝒩
and {1,2,…𝑛}.
69
an undirected graph (𝓝,𝓖) called the connectivity graph, where:
𝓖 = { {𝑖, 𝑗} ∶ (𝑖, 𝑗) ∈ 𝒩 ×𝒩, 𝑖 ≠ 𝑗} (4.2)
a collection �� = (��𝑖𝑗){𝑖,𝑗}∈𝓖 of real numbers called the weights or synaptic weights; each
weight ��𝑖𝑗 is associated with an edge {𝑖, 𝑗};
a collection 𝚯 = (𝜃𝑗)𝑗∈𝒩 of real numbers called the thresholds; each threshold 𝜃𝑗 is
associated with an unit;
and satisfying the following properties:
a unit is always in one of two activation levels or states designated as on/off or 1/0 or 1/-1;
a unit adopts these activation levels as a probabilistic function of the activation levels of its
neighboring units and the weights on its edges to them;
the weights ��𝑖𝑗 on the edges are symmetric having the same strength in both directions:
��𝑖𝑗 = ��𝑗𝑖 (4.3)
the weights ��𝑖𝑗 on the edges can take on real values of either sign;
a unit being on or off is taken to mean that the system currently accepts or rejects some
elemental hypothesis about the domain;
the weight on an edge represents a weak pairwise constraint between two hypotheses:
a positive weight indicates that the two hypotheses tend to support one another; if one is
currently accepted, accepting the other should be more likely;
a negative weight suggests, other things being equal, that the two hypotheses should
not both be accepted.
The following notions are important in subsequent development:
The terms link and connection equally denote an edge {𝑖, 𝑗} ∈ 𝓖 of the connectivity graph,
where 1 ≤ 𝑖, 𝑗 ≤ 𝑛;
𝐈 = {0,1} denotes the set of possible activation levels or states for a unit;
Hopfield represented the states of his model with -1 and 1 because his model was derived
from a physical system called spin glass in which spins are either down or up. Provided the
units have thresholds, models that use the representation -1 and 1 for their states can he
translated into models that represent their states with 0 and 1 and have different thresholds
70
[43]. In Section 2.5 we showed a similar translation for a Hopfield network (equations (2.33)
and (2.34)).
𝝈𝐢 ∈ I denotes the activation level of unit 𝑖, ∀𝑖 ∈ 𝒩;
ℝ𝓖 denotes the set of all families of weights W;
ℝ𝓝 denotes the set of all families of thresholds Θ;
The connectivity graph (𝒩, 𝒢) can be extended to (𝓝,𝓖′) as follows:
𝓖′ = { {𝑖, 𝑗} ∶ (𝑖, 𝑗) ∈ 𝒢 or (𝑖 = 0 and 𝑗 ∈ 𝒩)} (4.4) and: ℝ𝓖
′≝ ℝ𝓖 × ℝ𝓝 (4.5)
Parameters
The parameters or extended weights 𝐖 are a collection of pairs of real numbers defined as:
𝐖 ≝ (��,𝚯) = (𝒘𝒊𝒋){𝒊,𝒋}∈𝓖′∈ ℝ𝓖
′ (4.6)
where:
𝑤𝑖𝑗 ≝ {(��𝑖𝑗 , 𝜃𝑗), if {𝑖, 𝑗} ∈ 𝒢
(−𝜃𝑗, 𝜃𝑗), if {𝑖, 𝑗} ∈ 𝒢′ − 𝒢
(4.7)
The number of elements of W is at most equal to: 𝑛∙(𝑛−1)
2, which corresponds to (𝒩, 𝒢) being
a complete undirected graph with 𝑛 vertices or units.
The number of elements of W is at most equal to: 𝑛∙(𝑛−1)
2+ 𝑛 =
𝑛∙(𝑛+1)
2, which corresponds
to (𝒩, 𝒢′) being a complete undirected graph with 𝑛 + 1 vertices or units.
Definition 4.2:
If we incorporate into Definition 4.1 the concept of parameters according with the formulae (4.6)
and (4.7), then we obtain the following equivalent definition for Boltzmann machine:
𝐁𝐌 = (𝓝,𝓖, (��,𝚯)) = (𝓝,𝓖′,𝐖) (4.8)
Configurations
A function 𝝈:𝓝⟶ 𝐈, 𝝈(𝒊) ≝ 𝝈𝐢 is called an 𝐼–valued configuration of 𝒩.
A specification of activation levels (𝝈𝐢)𝒊∈𝓝 of all the units 𝑖 ∈ 𝒩 represents a configuration or
a global state of 𝐁𝐌. A configuration of 𝐁𝐌 can also be seen as a particular combination of
71
hypotheses about the domain. The set of all possible configurations of 𝐁𝐌 represents the
configuration space of 𝐁𝐌 and is written 𝐈𝓝.
Net input of configuration towards unit
Generally, the net input of configuration 𝜎 towards a unit 𝑖 ∈ 𝒩, also called the activation
potential of unit 𝑖, is defined by the equation (2.23). If we adapt the equation (2.23) to the
current conventions and notations, we obtain:
𝐧𝐞𝐭𝐢 ≡ 𝐧𝐞𝐭(𝒊, 𝝈) = −𝜃𝑖 + ∑ ��𝑗𝑖𝑗∈𝓖(𝑖)
∙ 𝜎𝑗 = ∑ 𝑤𝑗𝑖𝑗∈𝓖′(𝑖)
∙ 𝜎𝑗 (4.9)
where: 𝒢(𝑖) = {𝑗 ∶ {𝑗, 𝑖} ∈ 𝒢}; 𝒢′(𝑖) = {𝑗 ∶ 𝑗 ∈ {0} ∪𝒩 and {𝑗, 𝑖} ∈ 𝒢}; and 𝜎𝑗 is the projection of
𝜎 onto the jth component on I𝒩.
The net input of configuration 𝜎 towards a specific unit can be seen as a mapping between a
given configuration with a given unit and ℝ. The mapping between the configuration alone
and ℝ is called Hamiltonian. Formally, a Hamiltonian H is an element of 𝓗(𝓝), where
ℋ(𝒩) represents the set of all real–valued functions defined on I𝒩:
𝓗(𝓝) = {H | H ∶ I𝒩⟶ℝ} (4.10)
Clearly ℋ(𝒩) is a linear space of dimension 2𝑛.
Probability functions on the configuration space
Let 𝓟(𝓝) denote the set of all probability functions on the configuration space I𝒩. 𝒫(𝒩) is a
simplex of dimension 2𝑛−1 in ℝ𝐈𝓝
:
𝓟(𝓝) = {𝐏 | 𝐏 ∶ I𝒩⟶ [0,1], ∑ 𝐏(𝜎) = 1
𝜎∈I𝒩
} (4.11)
Let 𝓟+(𝓝) denote the interior of the simplex 𝒫(𝒩), i.e., the set of those 𝐏 ∈ 𝒫(𝒩) that are
nondegenerate, in the sense that 𝐏(𝜎) ≠ 0 for all 𝜎 ∈ I𝒩:
𝓟+(𝓝) = {𝐏 ∈ 𝒫(𝒩) | 𝐏(𝜎) ≠ 0 for all 𝜎 ∈ I𝒩} (4.12)
Gibbs measure associated to a Hamiltonian
Any element H ∈ ℋ(𝒩) gives rise to a probability distribution 𝐆𝐇 ∈ 𝓟+(𝓝) named the Gibbs
measure associated to the Hamiltonian H and defined by:
𝐆𝐇(𝝈) =
exp(H(𝜎))
𝑍(H) (4.13)
72
where 𝑍(H) is the normalization constant needed to make the probabilities add up to 1, i.e.:
𝒁(𝐇) = ∑ exp(H(𝜎))
𝜎∈I𝒩
(4.14)
Clearly the set of Gibbs measures on I𝒩 is exactly 𝒫+(𝒩).
Two Hamiltonians H1 ≠ H2 give rise to the same Gibbs measure if and only if they differ by a
constant. Let 𝓗𝟎(𝓝) be the quotient space of ℋ(𝒩) modulo the constants. Then the
function 𝒇𝟎 defined by the equation (4.15) is well defined and bijective:
𝒇𝟎 ∶ 𝓗𝟎(𝓝)⟶ 𝓟+(𝓝), 𝒇𝟎(𝐇) = 𝐆𝐇 (4.15)
Quadratic Hamiltonian
For any 𝑖 ∈ 𝒩, let 𝜎𝑖 denote the projection of 𝜎 onto the ith component on I𝒩.
Let also 𝜎0 denote the function identically equal to 1: 𝜎0 ≝ 1.
Then for any pair (W, Θ) = W ∈ ℝ𝓖′ we associate a function 𝐇(��,𝚯) ≝ 𝐇𝐖 named the
Hamiltonian of W = (W,Θ) and defined by:
𝐇(��,𝚯) ∶ 𝐈𝓝⟶ℝ, 𝐇(��,𝚯)(𝝈) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗
{𝑖,𝑗}∈𝒢,𝑖<𝑗
− ∑ 𝜎𝑗 ∙ 𝜃𝑗𝑗∈𝒩
(4.16)
equivalent to:
𝐇(��,𝚯)(𝝈) =1
2∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢
− ∑ 𝜎𝑗 ∙ 𝜃𝑗𝑗∈𝒩
(4.17)
equivalent to: 𝐇𝐖 ∶ 𝐈
𝓝⟶ℝ, 𝐇𝐖(𝝈) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝑤𝑖𝑗{𝑖,𝑗}∈𝒢′
𝑖<𝑗
(4.18)
Any Hamiltonian H which is of the form HW for some W ∈ ℝ𝓖′ is called a quadratic
Hamiltonian.
Partition function and cumulant function
The partition function associated to a Hamiltonian H is the function 𝒁 defined by:
𝒁 ∶ 𝓗(𝓝)⟶ ℝ,𝒁(𝐇) = ∑ exp(H(𝜎))
𝜎∈I𝒩
(4.19)
The partition function of a quadratic Hamiltonian HW is denoted by the simplified notation:
𝒁(𝐖) ≝ 𝒁(𝐇𝐖) = ∑ exp(HW(𝜎))
𝜎∈I𝒩
(4.20)
73
The partition function 𝑍 of a quadratic Hamiltonian HW is well defined and strictly convex on
ℋ0(𝒩) and, generally, intractable. Therefore, we need to approximate it and this is where
the cumulant function helps.
By definition, the cumulant function 𝑨 of a quadratic Hamiltonian HW is the natural logarithm
of the partition function associated to that Hamiltonian:
𝑨 ∶ 𝓗(𝓝)⟶ ℝ,𝑨(𝐖) ≝ 𝑨(𝐇𝐖) = ln 𝑍(W) (4.21)
When trying to approximate a probability distribution, it is more important to get the
probabilities correct for events that happen frequently than for rare events. One way to
accomplish this objective is to operate with logarithms of probabilities instead of directly with
probabilities.
Quadratic Gibbs measure
We introduce a simplified notation for the Gibbs measure GHW associated to a Hamiltonian
HW, which itself is associated to a given set of parameters W = (W,Θ) ∈ ℝ𝓖′:
𝐆𝐖(𝝈) ≝ 𝐆𝐇𝐖(𝝈) =
exp(HW(𝜎))
𝑍(HW)=exp(HW(𝜎))
𝑍(W) for all 𝜎 ∈ I𝒩 (4.22)
A quadratic Gibbs measure on I𝒩 associated to a connectivity graph (𝒩,𝒢) is a probability
function 𝐏 ∈ 𝒫(𝒩) that satisfies the following property:
∃W = (W, Θ) ∈ ℝ𝓖′ such that 𝐏 ≡ 𝐆𝐖
We introduce the notation 𝐆𝟐(𝓝,𝓖) to designate the set of all quadratic Gibbs measures on
I𝒩:
𝐆𝟐(𝓝,𝓖) = {𝐏 ∈ 𝒫(𝒩) ∶ ∃W = (W, Θ) ∈ ℝ𝓖′ such that 𝐏 ≡ 𝐆𝐖} (4.23)
Clearly, quadratic Gibbs measures on I𝒩 are very special, since they are parameterized by
W ∈ ℝ𝓖′, i.e., by maximum 𝑛∙(𝑛+1)
2 parameters, whereas 𝒫(𝒩) has dimension 2𝑛−1 and
𝑛∙(𝑛+1)
2 ≪ 2𝑛−1. If we consider, in addition, probability distributions that are marginals of
Gibbs measures, we get all the Gibbs measures on I𝒩.
74
4.2 Modelling the underlying structure of an environment
By differentiating their roles in the learning process, Hinton partitioned the units of a
Boltzmann machine into two functional groups: a nonempty set of visible units and a possible
empty set of hidden units. This is how Hinton explained in [10] the reason for this partition:
Suppose that the environment directly and completely determines the states of a subset of the units (called the "visible" units), but leaves the network to determine the states of the remaining "hidden" units. The aim of the learning is to use the hidden units to create a model of the structure implicit in the ensemble of binary state vectors that the environment determines on the visible units.
A more detailed justification for the differentiation between units is given by Hinton in [10]. He
considers a parallel network like Boltzmann machine a “pattern completion device such that a
subset of the units are “clamped” into their on or off state and the weights in the network then
complete the pattern by determining the states of the remaining units” [10]. Hinton comes up
with an example of a network that has one unit for each component of the environmental input
vector; such a network is capable to learn only a limited set of binary vectors. He uses this
example to explain why his assumption about pattern completion has strong limitations and how
these limits can be transcended: by using extra units whose states do not correspond to
components in the vectors to be learned [43]:
The weights on connections to these extra units can be used to represent complex interactions that cannot be expressed as pairwise correlations between the components of the environmental input vectors.
He calls these extra units hidden units and the units that are used to specify the patterns visible
units. In [43] Hinton gives the following intuitive explanation for the separation of units in two
classes:
The visible units are the interface between the network and the environment that specifies vectors for it to learn or asks it to complete a partial vector. The hidden units are where the network can build its own internal representations.
Formally this split–operation of 𝒩 can be described as:
𝓝 = 𝓥∪𝓗 and 𝓥 ∩𝓗 = ∅ (4.24) where 𝓥 represents the set of visible units and 𝓗 represents the set of hidden units.
Let 𝒎 be the number of units in 𝒱 and 𝒍 the number of units in ℋ:
75
𝑛 = 𝑚 + 𝑙,𝑚 = |𝒱|, 𝑙 = | ℋ| (4.25) Theoretically, the structure of an environment can be specified by giving the probability
distribution over all 2𝑚 states of the 𝒱 visible units. Practically, the network is said to have a
perfect model of the environment if it achieves exactly the same probability distribution over
these 2𝑚 states when is running freely at thermal equilibrium with all units unclamped so there
is no environmental input [10].
We can regard I𝒩 as the Cartesian product of I𝒱 and Iℋ and each configuration 𝜎 ∈ I𝒩 as a
pair of configurations over the visible respectively hidden units:
𝐈𝓝 = 𝐈𝓥 × 𝐈𝓗 (4.26)
𝝈 = (𝒗, 𝒉) for 𝑣 ∈ I𝒱 and ℎ ∈ Iℋ (4.27)
If 𝐏 ∈ 𝒫(𝒩), we use 𝐌𝐀𝐑𝐆(𝐏,𝒱) to denote the marginal of the probability distribution 𝐏 with
respect to the variables 𝜎𝑖 such that 𝜎𝑖 = 𝑣𝑖 for all 𝑖 ∈ 𝒱, i.e., the measure given by:
𝐌𝐀𝐑𝐆(𝐏,𝒱) ≡ 𝐌𝐀𝐑𝐆(𝐏,𝒱)(𝑣) = ∑ 𝐏(𝑣, ℎ)
ℎ∈Iℋ
for 𝑣 ∈ I𝒱 (4.28)
Given a connectivity graph (𝒩, 𝒢), we introduce the notation 𝐆𝟐(𝓥,𝓗, 𝓖) to designate the set of
all probability measures 𝐐 on 𝒱 that are marginals of some quadratic Gibbs measure 𝐏 ∈
G2(𝒩, 𝒢), i.e., satisfy 𝐐 ≡ MARG(𝐏, 𝒱) for some 𝐏 ∈ G2(𝒩, 𝒢). We also recall the definition (4.23)
of 𝐆𝟐(𝓝,𝓖), i.e., the set of all quadratic Gibbs measures on I𝒩.
𝐆𝟐(𝓝,𝓖) = {𝐏 ∈ 𝒫(𝒩) ∶ ∃W ∈ ℝ𝓖′ such that: 𝐏 ≡ 𝐆𝐖}
(4.29) 𝐆𝟐(𝓥,𝓗, 𝓖) = {𝐐 ∈ 𝒫(𝒱) ∶ ∃𝐏 ∈ G2(𝒩, 𝒢) such that: 𝐐 ≡ MARG(𝐏,𝒱)}
The following theorem mentioned in [3] establishes a relation between 𝒫+(𝒱) and G2(𝒱,ℋ, 𝒢).
Theorem 4.1:
Let (𝒩, 𝒢) be the full connectivity graph of a 𝐁𝐌. Using the notations (4.25), let assume that:
𝑙 ≥ 2𝑚 −
1
2∙ (𝑚2 +𝑚) − 1
Then: (4.30) G2(𝒱,ℋ, 𝒢) = 𝒫+(𝒱)
This means that every nondegenerate probability distribution on I𝒱 can be realized as a
marginal of a distribution on I𝒩 with a quadratic Hamiltonian.
76
In view of this result, if we are trying to model a probability distribution 𝐐 on I𝒱, it is not too much
of a restriction to assume that 𝐐 is a marginal of some quadratic Gibbs measure on some larger
set 𝒩 = 𝒱 ∪ℋ of units. In particular, if we only look at the visible units, then the equilibrium
behavior of a Boltzmann machine is a marginal of a quadratic Gibbs measure. Moreover, every
quadratic Gibbs measure arises from a Boltzmann machine and then Theorem 4.1 implies that
every nondegenerate probability measure on I𝒱 arises as the behavior of some Boltzmann
machine, possibly with hidden units.
The connectivity graph (𝒩, 𝒢) of a Boltzmann machine proposed by Hinton in [8-11,43] is a
general undirected graph, which means that, as a graphical model, a general Boltzmann
machine represents a pairwise Markov random field. Its connectivity graph can be any
undirected graph, in particular a complete undirected graph that was described by Hinton as a
fully–connected Boltzmann machine. However, the majority of research on Boltzmann machines
has been done on a particular type of graph, specifically a graph that can be “decomposed” into
layers. This particular graph structure is named in the machine learning literature the generic
Boltzmann machine or simple the Boltzmann machine. Because the topic of this paper –
learning algorithms for Boltzmann machines – originates in the field of machine learning, we
adhere to their concept of Boltzmann machine.
The generic Boltzmann machine has one layer which contains all the fully–interconnected
visible units and at least one layer that contains fully–interconnected hidden units. If the hidden
units are distributed among multiple layers, only one hidden layer is connected, specifically
fully–connected, with the visible layer. The other hidden layers are interconnected, that means
each hidden layer is fully connected with the layer “below” and fully–connected with the layer
“above” (if that exists). The visible layer is placed “below” the first hidden layer. Figure 1
illustrates two Boltzmann machine configurations: the fully–connected Boltzmann machine and
the generic Boltzmann machine.
77
Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four hidden nodes; b) A layered Boltzmann machine with one visible layer and two hidden layers.
4.3 Representation of a Boltzmann Machine as an energy–based model
As constraint satisfaction networks, the Boltzmann machines should be well equipped to
deal with tasks that involve a large number of weak constraints. However, what we have learned
about them so far doesn’t endow them with such qualities. Specifically the hidden units, seen as
hidden latent causes, are not good at modelling constraints between variables. Hidden ancestral
variables, i.e., the variables corresponding to the hidden units, may be good for modelling some
types of correlation, but they cannot be used to decrease variance. A better way to model
constraints is to use an energy–based model that associates high energies with data vectors
that violate constraints.
Inspired by a variant of Hopfield’s network, which we described in Section 2.5, Hinton had
shown that there exists an expression for the energy of a configuration of the network such that,
under certain circumstances, the individual units act so as to minimize the global energy. In [10]
Hinton explained the importance of the energy of a parallel system like the Boltzmann machine:
it represents the degree of violation of the constraints between hypotheses and consequently
determines the dynamics of the search. He also formulated the following postulates or
78
assumptions about the energy, which he later used to derive the main properties of the
probabilistic system that is Boltzmann machine.
Postulate 1:
There is a potential energy function over states of the whole system which is a function 𝑓(𝐏(𝜎))
of the probability of a state 𝜎.
This is equivalent to saying that, given any input, a particular state or configuration 𝜎 of a
Boltzmann machine has exactly one probability. It does not, for instance, have a probability of
0.3 and also a probability of 0.5.
Postulate 2:
The potential energy function is additive for independent systems. Since the probability for a
combination of states of independent systems is multiplicative, it follows that:
𝑓(𝐏(𝜎)) + 𝑓(𝐏(𝜎′)) = 𝑓(𝐏(𝜎)𝐏(𝜎′))
The only function that satisfies this equation is:
𝑓(𝐏(𝜎)) = 𝑘 ∙ ln𝐏(𝜎)
To make more probable states have lower energy the real–valued constant 𝑘 must be negative.
Postulate 3:
The part of the potential energy contributed by a single unit can be computed from information
available to the unit.
Only potential energies symmetrical in all pairs of units have this property, since in this case a
unit can "deduce" its effect on other units from their effect on itself.
Under the previous assumptions, the individual units of a Boltzmann machine can be made to
act so as to minimize the global energy. If some of the units are clamped into particular states to
represent a particular input, the system will then try to find the minimum energy configuration
that is compatible with that input. Thus, the energy of a configuration can be interpreted as the
79
extent to which that combination of hypotheses fails to fit the input and violates the constraints
implicit in the problem domain. So, in minimizing energy, the system is maximizing the extent to
which a perceptual interpretation fits the data and satisfies the constraints. Consequently the
system evolves towards interpretations of that input that increasingly satisfy the constraints of
the problem domain [10].
Using the previous notations, the global energy of the system, also referred as the energy of the
configuration 𝜎 of the system, is defined as:
𝐸(𝜎) = −
(
∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
−∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
)
or: (4.31)
𝐸(𝜎) = −(1
2∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢
−∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
)
equivalent to: 𝐸(𝜎) = −H(W,Θ)(𝜎) = −HW(𝜎) (4.32)
If we represent the configuration 𝜎 as a 𝑛–dimensional column vector, the weights W as a 𝑛 × 𝑛
symmetric matrix, and the thresholds as a 𝑛–dimensional column vector, then the energy of
configuration 𝜎 can be written in a matrix form as:
𝐸(𝜎) = −(
1
2∙ 𝜎T × W × 𝜎 − ΘT × 𝜎) (4.33)
We observe that the global energy defined by the equations (4.31) belongs to the Hamiltonian
family given by the definition (4.10). More, the global energy is the negative of the quadratic
Hamiltonian defined by the equations (4.16) and (4.17).
In Section 2.5 we presented the Hopfield update rule: switch each randomly selected unit into
whichever of its two states yields the lower total energy given the current configuration of the
network. If the Boltzmann machine had operated according with the Hopfield update rule, then it
would have been no different than a multilayer perceptron that follows the same rule, otherwise
would have suffered from the standard weakness of gradient descent methods: it could get
stuck at a local minimum that is not a global minimum [10-11]. This is an inevitable
consequence of only allowing jumps to states of lower energy or so called “downhill moves”.
Unlike the Boltzmann machine, the Hopfield’s network doesn’t suffer from this weakness
because its local energy minima are used to store memories. Therefore, if the Hopfield network
80
is started near some local minimum, the desired behavior is to fall into that local minimum and
not to find the global minimum.
Hinton realized that, if jumps to higher energy states occasionally occurred, it would be possible
for the system to break out of local minima, but it was not clear to him how the system would
then behave and also when the uphill steps should be allowed [43]. Therefore, in order to
escape the local minima in a Boltzmann machine, Hinton advanced the following idea: make the
binary units stochastic and add thermal noise to the global energy such that, occasionally, it
would lead to uphill steps. Hence, Hinton proposed that the stochastic unit should update its
state based on its previous state according with the following rule: the ith unit of a configuration 𝜎
at time 𝑡 outputs the state 0 or the state 1 with probability:
𝐏(𝜎𝑖) =
1
1 + exp (Δ𝐸𝑖𝐓 )
(4.34)
where: 𝐓 is the pseudo–temperature, i.e., a parameter which models the thermal noise injected
into the system; and Δ𝐸𝑖 is the energy gap between the current state and the previous state of
the ith unit of a configuration 𝜎.
Hinton used a simulated annealing algorithm to guide the increase in the level of thermal noise.
He studied experimentally the effect of thermal noise over transition probabilities and came up
with an annealing schedule that starts with a higher pseudo–temperature and gradually reduces
it to a pseudo–temperature of 1. He based his annealing schedule on the following
observations: at low pseudo–temperatures there is a strong bias in favor of states with low
energy, but the time required to reach equilibrium may be longer; at higher pseudo–
temperatures the bias is not so favorable, but equilibrium is reached faster [43]. According with
Hinton, this technique cannot guarantee that a global minimum will be found, but it can
guarantee that a nearly global minimum can be found with high probability.
Later, Hinton refined his original algorithm rule by adopting, during each annealing stage, i.e.,
when the pseudo–temperature is kept constant, a variant of the Metropolis algorithm. He also
proposed a simplified version of the update rule (4.34): “if the energy gap between the on and
off states of the ith unit of a configuration 𝜎 is 𝛥𝐸𝑖′, then, regardless of the previous state of that
unit, set the unit to 1 with a probability given by formula” (4.35):
𝐏(𝜎𝑖 = 1) =
1
1 + exp (E(𝜎𝑖 = 1) − 𝐸(𝜎𝑖 = 0)
𝐓 )
81
𝐏(𝜎𝑖 = 1) =
1
1 + exp (−E(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1)
𝐓)
𝐏(𝜎𝑖 = 1) =1
1 + exp (−Δ𝐸𝑖′𝐓
)
(4.35)
where: Δ𝐸𝑖
′ = 𝐸(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1) Hinton found inspiration in Boltzmann’s work, specifically the principle that a network consisting
of a large number of units, with each unit interacting with neighbouring units, will approach a
canonical distribution at equilibrium given by the Boltzmann–Gibbs distribution. Although the
development of Boltzmann machines has been motivated by ideas from statistical physics, they
are nevertheless neural networks. Therefore, the following two differences, which we mentioned
in the context of Markov random fields, should be carefully noted. Firstly, in neural networks the
parameter 𝐓 plays the role of a pseudo–temperature that has no physical meaning. Secondly, in
neural networks the Boltzmann’s constant 𝐤 can be taken equal to 1.
The local nature of the update rules (4.34) and (4.35) ensures that raising the noise level is
equivalent to decreasing all the energy gaps between configurations, so in thermal equilibrium
the relative probability of two configurations 𝜎 and 𝜎′ is determined solely by their energy
difference and follows a Boltzmann distribution:
𝐏(𝜎)
𝐏(𝜎′)= exp(−
𝐸(𝜎) − 𝐸(𝜎′)
𝐓) (4.36)
where: 𝜎 and 𝜎′ are two configurations of a Boltzmann machine; 𝐏(𝜎) is the probability of the
Boltzmann machine to have the configuration 𝜎; 𝐸(𝜎) is the energy of the configuration 𝜎; and
𝐓 is the pseudo–temperature.
Hinton’s justification for this heuristic was the fact that energy barriers are what prevent a
system from reaching equilibrium rapidly at low pseudo–temperature and, if the energy barriers
can be suppressed or at least surpassed, equilibrium can be achieved rapidly at a pseudo–
temperature at which the distribution strongly favors the lower minima [43]. However, in Hinton’s
opinion, the energy barriers cannot be permanently removed because they correspond to states
that violate the constraints and the energies of these states must be kept high to prevent the
system from settling into them. Thereby, a solution to surpass the energy barriers between low–
lying states was needed. Hinton realized that, in a complex system with a high–dimensional
state space like the Boltzmann machine, the energy barriers between low–lying states are
82
highly degenerate, so the number of ways of getting from one low–lying state to another is an
exponential function of the height of the barrier one is willing to cross [43]. Thus, the effect of
either one of the update rules (4.34) and (4.35) is the opening of an enormous variety of paths
for escaping from a local minimum and, even though each path by itself is unlikely, it is highly
probable that the system would cross the energy barrier between two low–lying states [43].
4.4 How a Boltzmann Machine models data
As a binary pairwise Markov network, a Boltzmann machine operates with probabilities,
specifically it associates to each input a probability distribution over the output. Hence, we are
looking at two categories of data that are essential for this type of network: environmental data
and probabilities distributions associated with the underlying graphical model.
The input or environmental data consist of a set of binary vectors. Each input vector is
mapped one–to–one to the set of visible units, in this way producing a configuration over the
visible units. The problem that we need to address regarding these configurations is to fit a
model that will assign a probability to every possible configuration over the visible units. The
formulae (4.25) show that there are 2𝑚 such configurations. Knowing this probability
distribution would allow us to decide whether other binary vectors come from the same
distribution. The network is said to have a perfect model of the environment if it achieves
exactly the same probability distribution over these 2𝑚 configurations when it is running
freely at thermal equilibrium with no environmental input.
In order to allow the network to approach thermal equilibrium, Hinton makes the
assumptions that each of the environmental input vectors persists for long enough and the
structure in the sequence of environmental vectors, if any, should be ignored [11]. The
distribution over all visible configurations 𝑣 is nothing else than the marginal distribution over
all the configurations 𝜎 of the network.
The probability distributions associated with the underlying graphical model can be divided
into three categories: joint configuration probabilities, conditional probabilities, and
marginals. To define these distributions, we follow an approach similar with the approach we
used in Section 4.3 for the individual unit.
Joint configurations probabilities:
83
Let consider the configuration 𝜎 of visible units 𝑣 and hidden units ℎ given by the
equation (4.27). Such a configuration is often called the “joint configuration” of 𝑣 and ℎ.
The probability of the joint configuration 𝜎 is related to the energy of that configuration,
which is given by the equations (4.31). Therefore, we start by evidencing 𝑣 and ℎ in
(4.31):
𝐸(𝜎) = 𝐸(𝑣, ℎ) = − ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
(4.37)
In Section 4.3 we have learned that, in a Boltzmann machine, the energy of a
configuration 𝜎 can be seen as a real function defined on I𝒩; therefore, according to the
definition (4.10), it belongs to the Hamiltonian family. More, the energy of a configuration
𝜎 is the negative of a quadratic Hamiltonian. Then, the logical steps we need to follow to
obtain the expression of the probability of a joint configuration 𝜎 = (𝑣, ℎ) are the same
steps we followed in Section 4.1 to define the Gibbs measure associated to a
Hamiltonian. We start by looking at how the energies of joint configurations are related to
their probabilities and we identify two ways in which they are logically connected:
In one way we can define the probability of a joint configuration 𝜎 = (𝑣, ℎ) by using
an exponential model similar to one used in the definition (4.13):
𝐏(𝑣, ℎ) ∝ exp(−𝐸(𝑣, ℎ)) (4.38)
In other way we can define the probability of a joint configuration 𝜎 = (𝑣, ℎ) to be the
probability of finding the network in that particular joint configuration after we have
updated all of the stochastic binary units many times. Thus, the probability of a joint
configuration over both visible and hidden units depends on the energy of that joint
configuration compared with the energy of all other joint configurations. This
approach also follows the definition (4.13).
To comply with both requirements, we need to specify a normalization factor in the first
definition that is compatible with the second definition and this is exactly the partition
function defined by (4.14):
𝑍 = ∑ exp(−𝐸(𝑢, 𝑔))
𝑢∈𝐼𝒱 ,
𝑔∈𝐼ℋ
(4.39)
84
Therefore, the probability of a joint configuration (𝑣, ℎ) of visible and hidden units is
defined as:
𝐏(𝑣, ℎ) =
exp(−𝐸(𝑣, ℎ))
𝑍=
exp(−𝐸(𝑣, ℎ))
∑ exp(−𝐸(𝑢, 𝑔))𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ (4.40)
Conditional probabilities:
The conditional distributions over hidden and visible units are given by:
𝐏(ℎ𝑗 = 1 | 𝑣, ℎ−𝑗) = sigm(∑𝑣𝑖 ∙ ��𝑖𝑗𝑖∈𝒱
+ ∑ ℎ𝑚 ∙ ��𝑚𝑗𝑚∈ℋ−{j}
− 𝜃𝑗) (4.41)
𝐏(𝑣𝑖 = 1 | ℎ, 𝑣−𝑖) = sigm(∑ ℎ𝑗 ∙ ��𝑗𝑖 + ∑ 𝑣𝑘 ∙ ��𝑘𝑖
𝑘∈𝒱−{i}𝑗∈ℋ
− 𝜃𝑖) (4.42)
where 𝑥−𝑖 denotes a vector 𝑥 with the ith component 𝑥𝑖 omitted and sigm is the logistic
function.
Marginal probabilities:
The probability of a configuration 𝑣 of the visible units is the sum of the probabilities of all
the joint configurations that contain it and it is identical with the marginal distribution over
the configuration 𝑣 of visible units. It is computed with the following formula:
𝐏(𝑣) =
∑ exp(−𝐸(𝑣, ℎ))ℎ∈Iℋ
∑ exp(−𝐸(𝑢, 𝑔))𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ (4.43)
The formulae (4.40) to (4.43) show that all the distributions specific to a generic Boltzmann
machine are intractable. The main reason for their intractability is the computation of the
partition function 𝑍 given by the equation (4.39).
4.5 General dynamics of Boltzmann Machines
Hinton investigated the behaviour of the Boltzmann machine under either one of the
update rules (4.34) and (4.35) in two scenarios: when approaching the thermal equilibrium and
after the thermal equilibrium was reached. To help understand the concept of thermal
equilibrium, he suggested an intuitive way to think about it which is inspired by the idea behind
the equation (2.7):
85
Imagine a huge ensemble of systems that all have exactly the same energy function; then the probability of a global state of the ensemble is just the fraction of the systems that have the corresponding state.
Similarly, reaching thermal equilibrium in a Boltzmann machine does not mean that the system
has settled down into the lowest energy configuration, but that the probability distribution over
configurations settles down to the stationary distribution. Hinton came up with an algorithm to
describe the dynamics of the Boltzmann machine. We are going to give firstly an intuitive
description of the algorithm (Algorithm 4.1) and then we are going to present the algorithm
formally (Algorithm 4.2).
Algorithm 4.1: Boltzmann Machine Dynamics 1
Given: W , 𝐓
begin
Step 1: Start with any distribution over all the identical units.
We could start with all the units in the same state or with an equal number of
units in each possible state.
Step 2: Keep applying the stochastic update rule (4.34) to pick the next state
for each randomly selected individual unit.
Step 3: After running the units stochastically in the right way, the network may
eventually reach a situation where the fraction of units in each state remains
constant.
This is the stationary distribution that physicists call thermal equilibrium.
Step 4: After reaching the thermal equilibrium, any given unit keeps changing
its state but the fraction of units in each state does not change. Otherwise,
once equilibrium has been reached, the number of units that “leave” a
configuration at each time step will be equal to the number of units that
“enter” that configuration.
end
86
We start our approach to formally define the dynamics of the Boltzmann machine by
establishing the assumptions the network works under:
the pseudo–temperature is 𝐓;
time is discrete and is represented by the set of nonnegative integers 𝒯 = {0,1,2,… };
at each time 𝑡 ∈ 𝒯 one unit 𝑖 ∈ 𝒩 is selected at random for a possible update.
We observe that, if a Boltzmann machine has the weights and thresholds varying in time,
deterministically or stochastically, then the Boltzmann machine becomes a particular case of a
time–varying neural network, whose definition follows.
Definition 4.3:
A time–varying Boltzmann machine 𝐓𝐕𝐁𝐌 is a 𝐁𝐌 that has a fixed 𝒩, a fixed 𝒢, and whose
parameters W = (W, Θ) can vary in time, either deterministically or stochastically.
Precisely, a 𝐓𝐕𝐁𝐌 is a four–tuple (𝓝,𝓖, ��, 𝚯) where �� = {��(𝑡)}𝑡∈𝒯 and 𝚯 = {𝚯(𝑡)}𝑡∈𝒯 are ℝ𝓖–
valued respectively ℝ𝓝–valued stochastic processes.
With respect to Definition 4.3, we make the following remarks:
If 𝐁𝐌 is a 𝐓𝐕𝐁𝐌, then the net input computed according with the equation (4.9) by using the
weights W(𝑡) and the thresholds Θ(𝑡) is called the net input at time 𝑡 and is denoted
𝐧𝐞𝐭𝐭(𝒊, 𝝈).
The unit 𝑖 selected for update at moment 𝑡 and denoted 𝑖(𝑡) finds out what its new state is
going to be by computing two quantities: its net input at time 𝑡 and the probability given by
the update rule.
The update rules (4.34) and (4.35) become time–varying as well, so they need to be
changed to reflect the time factor. However, before we modify the general update rule (4.34)
to reflect the time factor, we need to rewrite the rule to highlight the states of the ith unit at
two consecutive moments: 𝑡 − 1 and 𝑡.
Claim: Given a configuration 𝜎 and a unit 𝑖, the update rule (4.34) can be written as:
87
𝐏(𝜎𝑖(𝑡) | 𝜎𝑖(𝑡 − 1)) =
1
1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))
𝐓 )
(4.44)
Proof: We start by looking at the denominator in the formula (4.34), specifically at the term Δ𝐸𝑖.
We know from the equation (4.32) that the energy of a configuration is the negative of the
quadratic Hamiltonian of that configuration, so we are focusing our attention on the quadratic
Hamiltonian given by the equation (4.16).
Given a configuration 𝜎 and a unit 𝑖, we can split the quadratic Hamiltonian associated to the
configuration 𝜎 into two terms such that one term reflects the contribution of the unit 𝑖 and other
term H′ incorporates the contribution of all the units except 𝑖. We observe that the term that
reflects the contribution of unit 𝑖 to the quadratic Hamiltonian is related to the net input to unit 𝑖
defined by the equation (4.9). Therefore, we can write:
HW(𝜎) = net(𝑖, 𝜎) ∙ 𝜎𝑖 +H′ (4.45)
In particular, if 𝜎(𝑖) denotes the configuration obtained from 𝜎 by switching the value of the ith
unit, then we can compute the variation of the quadratic Hamiltonian corresponding to this
operation:
ΔH = HW(𝜎(𝑖)) − HW(𝜎) = net(𝑖, 𝜎
(𝑖)) ∙ 𝜎𝑖(𝑖) + H′ − net(𝑖, 𝜎) ∙ 𝜎𝑖 − H
′
ΔH = net(𝑖, 𝜎(𝑖)) ∙ 𝜎𝑖
(𝑖) − net(𝑖, 𝜎) ∙ 𝜎𝑖 (4.46)
A few observations have to be made with respect to the equation (4.46).
Firstly, the net input of a configuration towards a unit (equation (4.9)) doesn’t depend on the
state of that unit. This means that two configurations that differ only in one and the same
unit have exactly the same net input to that unit, assuming that the parameters of the
network are the same. This fact translates to:
net(𝑖, 𝜎(𝑖)) = net(𝑖, 𝜎) (4.47)
Accordingly, the equation (4.46) can be rewritten as:
ΔH = net(𝑖, 𝜎) ∙ (𝜎𝑖(𝑖) − 𝜎𝑖) = net(𝑖, 𝜎) ∙ Δ𝜎𝑖 (4.48)
Secondly, switching the value of the ith hypothesis of a configuration that is also an I–valued
configuration, where I = {0,1}, is the same as applying the following formula to that
hypothesis:
88
𝜎𝑖 = 1 − 𝜎𝑖(𝑖) ⇔ 𝜎𝑖
(𝑖) = 1 − 𝜎𝑖 (4.49)
Accordingly, the equation (4.48) can be written in an equivalent form using only the new
state 𝜎𝑖(𝑖) given by the equation (4.49):
ΔH = net(𝑖, 𝜎) ∙ (𝜎𝑖(𝑖) − 1 + 𝜎𝑖
(𝑖)) = net(𝑖, 𝜎) ∙ (2 ∙ 𝜎𝑖(𝑖) − 1) (4.50)
Thirdly, we compute Δ𝐸𝑖 :
Δ𝐸𝑖 = 𝐸(𝜎(𝑖)) − 𝐸(𝜎) = −HW(𝜎
(𝑖)) + HW(𝜎) = −ΔH
Δ𝐸𝑖 = −net(𝑖, 𝜎) ∙ Δ𝜎𝑖 = −(2 ∙ 𝜎𝑖
(𝑖) − 1) ∙ net(𝑖, 𝜎) (4.51)
If we substitute (4.51) into (4.34) we obtain:
𝐏(𝜎𝑖(𝑡)) =
1
1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))
𝐓 )
(4.52)
The state of the ith unit at time 𝑡 depends only on its previous state. Therefore, the equation
(4.52) becomes exactly (4.44).
After we incorporate all the remarks regarding Definition 4.3 into the formula (4.44), we obtain
the following update rule for a 𝐓𝐕𝐁𝐌. This is the rule used by Algorithm 4.2.
𝐏(𝜎𝑖(𝑡)(𝑡) | 𝜎𝑖(𝑡)(𝑡 − 1)) =
1
1 + exp(−(2 ∙ 𝜎𝑖(𝑡)(𝑡) − 1) ∙ net𝑡−1(𝑖(𝑡) , 𝜎))
𝐓 )
(4.53)
Definition 4.4:
A Boltzmann Machine Dynamics (BMD) on a network 𝐁𝐌/𝐓𝐕𝐁𝐌 is a Markov chain {𝝈(𝒕)}𝒕∈𝓣
with state space I𝒩 and whose transitions occur according to the following algorithm:
Algorithm 4.2: Boltzmann Machine Dynamics 2
Given: W , 𝐓
begin
Step 1: repeat
89
Step 2: at each time 𝑡 ∈ 𝒯 one unit 𝑖(𝑡) is chosen at random from the set
𝒩 ∪ {0} with the probability: 1
𝑛+1
Step 3: if the unit 𝑖(𝑡) is the “0” unit then set: 𝜎(𝑡) = 𝜎(𝑡 − 1)
else
Step 4: compute the net input to unit 𝑖(𝑡):
𝑥 = nett−1(𝑖(𝑡), 𝜎)
Step 5: generate the candidate state 𝑦 for the unit 𝑖(𝑡):
𝑦 = 1 − 𝜎(𝑡 − 1)𝑖(𝑡)
Step 6: if 𝑦 = 0 then compute the probability: 𝐏 =1
1+exp(𝑥
𝐓)
else compute the probability: 𝐏 =1
1+exp(−𝑥
𝐓)
Step 7: if 𝑥 ∙ (2𝑦 − 1) > 0 then accept y as the state of unit 𝑖(𝑡):
𝜎(𝑡)𝑖(𝑡) = 𝑦
else accept y as the state of unit 𝑖(𝑡) with probability 𝐏:
if random(0,1) < 𝐏 then 𝜎(𝑡)𝑖(𝑡) = 𝑦
Step 8: 𝜎(𝑡)𝑗 = 𝜎(𝑡 − 1)𝑗 for any 𝑗 ≠ 𝑖(𝑡)
Step 9: until stopping criterion true
end
In Algorithm 4.2 the notation random(0,1) denotes a sample from a uniform distribution 𝒰[0,1].
Furthermore, we consider Boltzmann machines with inputs and their corresponding dynamics.
Specifically we clamp certain units at certain values of the activation levels and do not allow
them to switch. The set of units that are clamped at a particular time 𝑡 will be allowed to depend
on 𝑡. Suppose we choose a subset 𝒞 ∈ 𝒩, which may depend on 𝑡, in which case we denote it
by 𝒞(𝑡) ∈ 𝒩. Also suppose we have an “external input,” i.e., a process 𝑣 = {𝑣(𝑡)}𝑡∈𝒯 such that
each 𝑣(𝑡) takes values in I𝒞(t).
90
Definition 4.5:
A Boltzmann Machine Dynamics with Inputs (BMDI) is a BMD process that follows Algorithm
4.2 except that in Step 2 the choice of 𝑖(𝑡) is limited to the set (𝒩 ∪ {0}) − 𝒞(𝑡); consequently
the probability of a particular unit being chosen is 1
|(𝒩∪{0})−𝒞(𝑡)|=
1
𝑛+1−|𝒞(𝑡)| .
Therefore, Step 2 of the algorithm looks like this:
Step 2: at each time 𝑡 ∈ 𝒯 one unit 𝑖(𝑡) is chosen at random from the set
(𝒩 ∪ {0}) − 𝒞(𝑡) with the probability: 1
𝑛+1−|𝒞(𝑡)|
4.6 The biological interpretation of the model
In this section we present Hinton’s original argumentation to support the idea that
Boltzmann machines bear resemblance to the brains; therefore, it is worth studying them. We
start by presenting some of the facts that make the Boltzmann machine belong to the same
general class of computation devices as the brain. Then we present some irreconcilable
differences between Boltzmann machine and the cortex. Both categories of arguments (pro and
contra) had been presented by Hinton in [10]. Before we present the arguments, we need to
define two concepts, native to physiology, which we are going to use.
An action potential is a short–lasting event in which the electrical membrane potential of a cell
rapidly rises and falls, following a consistent trajectory; otherwise is a propagated impulse.
An electrotonic potential is a non–propagated local potential, resulting from a local change in
ionic conductance (e.g. synaptic or sensory that engenders a local current); when it spreads
along a stretch of membrane, it becomes exponentially smaller (decrement).
Similitudes between Boltzmann machine and the cortex:
The cerebral cortex is relatively uniform in structure.
Different areas of cerebral cortex are specialized for processing information from
different sensory modalities such as: visual cortex, auditory cortex, and somatosensory
cortex. Other areas are specialized for motor functions. However, all these cortical areas
91
have a similar anatomical organization and are more similar to each other in
cytoarchitecture than they are to any other part of the brain [10].
Many problems in vision, speech recognition, associative recall, and motor control can
be formulated as searches. The similarity between different areas of cerebral cortex
suggests that the same kind of massively parallel searches may be performed in many
different cortical areas [10].
Differences between Boltzmann machine and the cortex:
Binary states and action potentials
The simple binary units which are components of Boltzmann machines are not literal
models of cortical neurons. According with Hinton, the assumptions that the binary units
change their states asynchronously and they use a probabilistic decision rule seem
closer to the reality than a model with synchronously deterministic updating [10].
The energy gap for a binary unit has a role similar to that played by the membrane
potential for a neuron: both are the sum of excitatory and inhibitory inputs and both are
used to determine the output state. However, the cortical neurons produce action
potentials, which are brief spikes that propagate down axons, rather than binary outputs.
When an action potential reaches a synapse, the signal it produces in the postsynaptic
neuron rises to a maximum and then exponentially decays with the time constant of the
membrane (typically around five milliseconds for neurons in cerebral cortex). The effect
of a single spike on the postsynaptic cell body may be further broadened by electrotonic
transmission down the dendrite to the cell body [10].
The energy gap represents the summed input from all the recently active binary units. If
the average time between updates is identified with the average duration of a
postsynaptic potential, then the binary pulse between updates can be considered an
approximation to the postsynaptic potential. Although the shape of a single binary pulse
differs significantly from a postsynaptic potential, the sum of a large number of stochastic
pulses is independent of the shape of the individual pulses and depends only on their
amplitudes and durations [10]. According with Hinton, for large networks having the large
fan–ins typical of cerebral cortex (around 10000), the binary approximation may not be
too bad [10].
Implementing pseudo–temperature in units
92
The membrane potential of a neuron is graded, but if it exceeds a fairly sharp threshold,
an action potential is produced, followed by a refractory period lasting several
milliseconds, during which another action potential cannot be elicited. If Gaussian noise
is added to the membrane potential, then even if the total synaptic input is below
threshold, there is a finite probability that the membrane potential will reach threshold
[10]. The amplitude of the Gaussian noise determines the width of the sigmoidal
probability distribution for the unit to fire during a short time interval and it therefore plays
the role of pseudo–temperature in the model [10].
According with Hinton, a cumulative Gaussian is a very good approximation to the
required probability distribution but it might be difficult to implement because the units in
the network should be arranged in such a way that all of them have the same amplitude
of noise [10].
Asymmetry and time–delays
In a generic Boltzmann machine all the connections are symmetrical. This assumption is
not always true for neurons in the cerebral cortex. However, if the constraints of a
problem are inherently symmetrical and if the network, on average, approximates the
required symmetrical connectivity, then random asymmetries in a large network will be
reflected as an increase in the Gaussian noise in each unit [10]. Hinton proposes the
following experiment to see why random asymmetry acts as Gaussian noise:
Consider a symmetrical network in which pairs of units are linked by two equal one–way connections, one in each direction. Then perform the following operation on all pairs of these one–way connections: remove one of the connections and double the strength of the other. Provided the choice of which connection to remove is made randomly, this operation will not alter the expected value of the input to a unit from the other units. On average, it will “see” half as many other units but with twice the weight. So if a unit has a large fan–in, it will be able to make a good unbiased estimate of what its total input would have been if the links had not been cut. However, the use of fewer, larger weights will increase the variance of the energy gap and will thus act as added noise.
Experimentally Hinton came to the conclusion that time–delays act like added noise as
well. His experimental results have been confirmed mathematically for first order
constraints, provided the fan–in is large and the weights are small compared with the
energy gaps [10].
93
Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines
One of the most interesting aspects of the Boltzmann machine formalization is that it
leads to a domain–independent learning algorithm [10]. Intuitively, learning for Boltzmann
machines means "acquiring a particular behavior by observing it” [3], i.e., progressively
adjusting the connection strengths between units in such a way that the whole network develops
an internal model which captures the underlying structure of the environment [10]. The goal of
learning in Boltzmann machines is rather different from other learning algorithms like, for
instance, backpropagation learning. Rather than learning a non–linear model from inputs to
outputs, the goal of learning in the classical asynchronous Boltzmann machine is to improve the
network’s model of the structure of the environment by choosing the parameters of the network
such that the stochastic behaviour observed on the visible units when the network is free–
running closely models that observed in the environment [43].
5.1 Problem description
The formal definition of the learning process we present is inspired by Sussmann’s work
[1,3] but in the same time reflects our understanding of this family of algorithms. Before we
formalize the learning process, we lay out the context it operates in. Consider we are given a
Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢, W, Θ) with |𝒩| = 𝑛 and with the set of random variables
associated to the units denoted X = (𝑋1, 𝑋2, … , 𝑋𝑛). In this way we establish the connection
between the Boltzmann machine learning and the Markov networks discussed in Chapter 2 and
Chapter 3. Thus, a configuration 𝜎 of 𝐁𝐌 is nothing else than an instantiation of the set of
random variables X of the underlying Markov network.
According with the definition (4.23), the true probability distribution 𝐏 of a joint configuration in a
Boltzmann machine is, in fact, the Gibbs measure 𝐆𝐖 (equation (4.13)) associated to the
Hamiltonian 𝐇𝐖 (equation (4.18)), which itself would be associated to the parameters 𝐖 of the
network at thermal equilibrium, if they were possibly known. Because the partition function of a
Boltzmann machine is generally intractable, all these measures – 𝐖, 𝐇𝐖, 𝐆𝐖, and 𝐏 – cannot
be determined exactly. Therefore, we resort to their approximations, which in principle are ��,
𝐇��, 𝐆��, and ��. In order to make some proofs easier to grasp, we might use more suggestive
94
notations for some of these variables. If that is the case, we will specify, if applicable, the
correspondence between the notations.
Definition 5.1:
Given a Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢′,W) and a sequence of random configurations 𝜎 ∈ I𝒩,
distributed according with a probability ��, which are presented to the network as inputs at
various times, a learning process 𝓛 is a sequence of pairs (W, 𝜎) ∈ ℝ𝒢′ × I𝒩 that satisfy the
following property: the parameters W converge to a value W such that the corresponding Gibbs
measure 𝐆𝐖 is the same as the observable distribution �� of the configurations 𝜎 presented to
the network.
lim𝑡→∞
W = W such that 𝐆𝐖 = �� (5.1)
A variant of the learning process has 𝒩 split into two disjunctive sets 𝒱 (visible unis) and ℋ
(hidden units). Thus, the observable distribution is a probability distribution over I𝒱 and the
learning process evolves in ℝ𝒢′ × I𝒱. Definition 5.2 characterizes this scenario.
Definition 5.2:
Given a Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢′,W) and a sequence of random data vectors 𝑣 ∈ I𝒱,
distributed according with a probability ��, that are presented to the network as inputs at various
times, a learning process 𝓛 is a sequence of pairs (W, 𝑣) ∈ ℝ𝒢′ × I𝒱 that satisfy the following
property: the parameters W converge to a value W such that the marginal of the corresponding
Gibbs measure 𝐆𝐖 over the variables 𝑣 ∈ I𝒱 is the same as the observable distribution �� of the
visible vectors 𝑣 presented to the network.
lim𝑡→∞
W = W such that MARG(𝐆𝐖, 𝒱) = �� (5.2)
In reality, MARG(𝐆𝐖, 𝒱) and �� are not equal, so Definition 5.2 rather expresses a desired goal of
the learning process. Therefore, the aim of asynchronous Boltzmann machine learning
becomes to reduce the difference between MARG(𝐆𝐖, 𝒱) and �� by performing gradient descent
in the parameter space on a suitable measure of their difference.
The environment imposes the distribution �� over the network by clamping the visible units,
which means the following:
95
each member of I𝓥 is probabilistically selected using ��; the probability of selecting 𝑣 is ��(𝑣);
the selected members of I𝓥 are presented to the network sequentially;
each selected vector 𝑣 is tested by running the Boltzmann machine for a time unit long
enough for the network to reach thermal equilibrium;
in each time unit the following two steps take place:
Step 1: all the units are updated;
Step 2: the visible units are reset to 𝑣.
We introduce the following definitions and notations for the probability distributions that play a
role in asynchronous Boltzmann machine learning. Then, we summarize them in Table 1.
Let 𝐏𝐓(σ) = 𝐏𝐓(𝑣, ℎ) be the free running equilibrium distribution at pseudo–temperature 𝐓.
Let 𝐏𝐓(ℎ|𝑣) be the probability of the free running network, at thermal equilibrium, that the
hidden units are set to ℎ given that the visible units are set to 𝑣 on the very same time step.
Let 𝐩𝐓(𝑣) be the probability distribution over the states of the visible units when the network
in thermal equilibrium is running freely at pseudo–temperature 𝐓.
Let 𝐪(𝑣) be the environmentally imposed probability distribution over the state vectors 𝑣 of
visible units.
Let 𝐐𝐓(ℎ|𝑣) be the probability that vector ℎ will occur on the hidden units when 𝑣 is clamped
on the visible units and the network of hidden units is allowed to run at pseudo–temperature
𝐓.
Let 𝐐𝐓(𝜎) = 𝐐𝐓(𝑣, ℎ) be the probability of observing the global state 𝜎 over multiple runs in
which successive vectors 𝑣 are clamped with probability 𝐪(𝑣).
𝐏𝐓 represents the probability described as 𝐆𝐖 in Definition 5.2.
𝐩𝐓 represents the probability described as MARG(𝐆𝐖, 𝒱) in Definition 5.2.
𝐪 represents the probability described as �� in Definition 5.2.
Table 1 Distributions of interest in asynchronous Boltzmann machine learning
Distribution Visible units Notation
Equilibrium (true) distribution: 𝐏𝐓 ≡ 𝐆𝐖
Is defined on the whole state space 𝓝 = 𝓥 ∪𝓗.
Clamped 𝐩𝐓(𝑣)
Free–running 𝐏𝐓(σ)
Environmental (data) distribution: 𝐐𝐓
Is defined on the state space 𝓝 and observable on visible space 𝓥.
Clamped 𝐪(𝑣)
Free–running 𝐐𝐓(𝜎)
96
There is a subtle difference between the conditional distribution 𝐏𝐓(ℎ|𝑣), which refers to the free
process, and 𝐐𝐓(ℎ|𝑣), which refers to the clamped process. During the free process, the visible
units are allowed to change on every time step; therefore, 𝐏𝐓(ℎ|𝑣) quantifies the probability that
the network arrives at configuration (𝑣, ℎ) on the very same time step. During the clamped
process, the visible units are initially set to 𝑣 and only the network of hidden units is allowed to
freely run; therefore, 𝐐𝐓(ℎ|𝑣) quantifies the probability that the network of hidden units arrives at
configuration ℎ in a time step following the initial time step when the visible units have been
clamped.
A formal description of the learning problem in Boltzmann machine is presented below.
Problem Boltzmann Machine Learning:
Given: 𝒩,𝒢′ and the split of the units into visible and hidden: 𝒩 = 𝒱 ∪ℋ
a set of data vectors 𝑣 ∈ I𝒱 used as inputs at various times
𝑣 are distributed according with an observable 𝐪
𝐪/𝐐𝐓 belong to an exponential family and have parameters W
𝐪 = 𝐩𝐓
Find: the best possible W close to W
Subject to:
1. a 𝐁𝐌 with visible units 𝑣 and hidden units ℎ runs on all its 𝑛 units
according to a probability 𝐏𝐓 ≡ 𝐆𝐖 that has parameters W
2. W converge to W
The learning algorithms for Boltzmann machines build on approximate inference algorithms in
pairwise Markov networks. Based on the approach employed to perform approximate inference,
the learning algorithms for Boltzmann machines can be divided into two groups or families:
one family uses approximate maximum likelihood methods;
other family uses variational methods to compute the free energies.
97
5.2 Phases of a learning algorithm in a Boltzmann Machine
By performing learning, the Boltzmann machine captures the underlying structure of its
environment and becomes capable of performing various pattern completion tasks. One type of
such tasks is to be able to complete a pattern from any sufficiently large part of it without
knowing in advance which part must be completed. Another type of task is to know in advance
which parts of the pattern will be given as input and which parts will have to be completed as
output. Therefore, there are two pattern completion paradigms, which lead to the presence of
two phases in the learning procedure such that each phase corresponds to a paradigm.
Before we study these phases, we introduce the following parameters:
𝛿 ∈ ℝ, 𝛿 > 0 is a constant of proportionality called the learning rate;
𝑝𝑎 ∈ ℕ, 𝑝𝑎 > 0 is the number of patterns shown to the network;
𝑒𝑝 ∈ ℕ, 𝑒𝑝 > 0 is the number of learning cycles (epochs) during which the algorithm sees all
the patterns. An epoch, which is a complete pass through a given dataset, should not be
confused with an iteration, which is simply one update of the neural net model’s parameters.
A suggestive designation of these phases belongs to Sussmann, who called them "hallucinating
phase" respectively "learning phase" [1,3]. Generally, during a learning phase, a pattern 𝑣 ∈ I𝒱
is "taught" by clamping the units 𝑖 ∈ 𝒱 so that their activation levels 𝜎𝑖 are the same as 𝑣𝑖 and
allowing the hidden units to evolve according to the Metropolis dynamics. In this phase the
weights are adjusted according to the Hebb rule, that is, at each step each weight ��𝑖𝑗 is
incremented by the positive quantity:
Δ��𝑖𝑗 = 𝛿 ∙ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) (5.3)
where 𝛿 is the learning rate.
During the hallucinating phase the whole network evolves following the Metropolis dynamics.
The adjustment that takes place in this phase is similar to the one from the learning phase,
except that now the quantity Δ��𝑖𝑗 added to each weight ��𝑖𝑗 is negative.
Now that we know that the learning algorithm has two phases; how are these phases linked
temporally? The answer to this question is: the learning and hallucinating phases should
alternate.
Hinton justifies the alternation of phases by using a well–known method for identifying the
parameters of an unknown probability distribution: maximum likelihood estimation [10]. Hinton
98
calls these two phases either “collecting data–independent statistics” respectively “collecting
data–dependent statistics”, or “negative phase” respectively “positive phase”.
According with Hinton, we can formulate the learning problem as one of minimizing the distance
between two Gibbs measures: the environmental measure 𝐪 and the measure 𝐩𝐓 =
MARG(𝐆𝐖, 𝒱) where 𝐆𝐖 describes the behavior of the network at equilibrium. Then the gradient
of this distance, regarded as a function of the parameters, is a difference of two terms: one term
consists of the mean of the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) with respect to 𝐪; the other term is the
mean of the same quantity with respect to 𝐆𝐖. Hinton’s claim is that the positive phase
computes approximately the first term and the negative phase computes approximately the
second term.
Sussmann gives another justification for alternating the phases of the learning procedure [1,3].
He claims that it is not possible for the whole learning procedure to have only the learning phase
because, if that happens, the weights would blow up. Sussman’s explanation is that, during the
learning phase, the network is doing the "correct" thing (i.e., the configurations 𝜎 = (𝑣, ℎ) where
𝑣 ∈ I𝒱 have "correct" values), because it has been forced to by clamping the visible units 𝑖 ∈ 𝒱
at values that correspond to a desired pattern. Hence, whatever the network is doing, it should
be reinforced. If a particular product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) happens to be positive, it means that
the net "wants" 𝜎𝑖 and 𝜎𝑗 to be “in sync”; hence, the weight ��𝑖𝑗 should be increased to make this
more likely. This means that the connection between the units 𝑖 and 𝑗 should be made more
“excitatory" by making ��𝑖𝑗 more positive, e.g. by adding to it the positive number Δ��𝑖𝑗. Similarly,
if the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) is negative, ��𝑖𝑗 should be decreased, and once again this will
be achieved by adding the negative number Δ��𝑖𝑗 to it.
Furthermore, if the learning algorithm had only the learning phase, then some weights would
keep increasing. Indeed, assume that the weights are updated at every step of the learning
process and we just look at a pair of visible units 𝑖 and 𝑗. If we only performed the learning
phase as outlined above, then after 𝑒𝑝 × 𝑝𝑎 steps the weight ��𝑖𝑗 would become:
��𝑖𝑗 + 𝛿 ∙ 𝑒𝑝 ∙ 𝑝𝑎 ∙ ⟨ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) ⟩ (5.4)
where ⟨ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) ⟩ represents the sample mean of the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) for
the sample consisting of the 𝑝𝑎𝑡 patterns 𝑣(1), 𝑣(2), … , 𝑣(𝑝𝑎) used in the training.
If we assume that the patterns 𝑣(1), 𝑣(2), … , 𝑣(𝑝𝑎) are independent and identically distributed, or
more generally, that the Markov process {𝑣(𝑘)}𝑘 is ergodic, then the sample mean for the pair of
99
visible units 𝑖 and 𝑗 will converge almost surely to the expected value of (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) with
respect to the measure 𝐪. Unless this expected value happens to vanish, the weight ��𝑖𝑗 will
blow up as 𝑡 → +∞. Therefore, Sussmann concludes that something else has to be done to
prevent this from happening and this could very well be the presence alternatively of the
hallucinating phase.
5.3 Learning algorithms based on approximate maximum likelihood
One modality to find the parameters of the Boltzmann Machine Learning problem is by
means of maximum likelihood estimation. Maximum likelihood principle relies on Bayes
theorem. Thus, there is only a single data set data (namely the one that is actually observed)
and the uncertainty in the parameters of the model is expressed through a probability
distribution over parameters. Maximum likelihood estimation has the parameters set to the value
that maximizes 𝐏(params | data), i.e., it chooses the parameters such that the probability of the
observed data set is maximized. A variant of this principle, very well suited for exponential
models, maximizes the log likelihood of the parameters log𝐏(params | data).
According with Definition 5.2, the goal of asynchronous Boltzmann machine learning is to
minimize the difference between MARG(𝐆𝐖, 𝒱) and ��, which translates into minimizing the
difference between 𝐩𝐓 and 𝐪. That is equivalent with maximizing the log likelihood of generating
the environmental distribution 𝐐𝐓 when the network is running freely at equilibrium [43].
Regardless of the path chosen – maximizing the log likelihood of 𝐐𝐓 or minimizing the difference
between 𝐩𝐓 and 𝐪 – the end result is the same: W. The path we follow in this paper to obtain W
is by minimizing the difference between 𝐪 and 𝐩𝐓, where the difference is expressed by their
KL–divergence KL(𝐪||𝐩𝐓). Essentially, the KL–divergence of two probability distributions is
always positive and becomes zero if and only if those probabilities are equal (equations (3.34)
and (3.35)).
100
5.3.1 Learning by minimizing the KL–divergence of Gibbs measures
The aim of this section is to present the generic Boltzmann machine learning algorithm
proposed by Ackley, Hinton, and Sejnowski in [11] (see also [43]). In essence the generic
learning algorithm proposed by Ackley et al. in [11,43] computes locally the difference between
two statistics and uses the result to update the “local” parameters. One statistics is the
expectation with respect to the data distribution, i.e., the environmentally imposed distribution
𝐐𝐓, and the other statistic is the expectation with respect to the true distribution, i.e., the Gibbs
measure 𝐏𝐓. We will introduce the formulae for these expectations later in this section.
In order to derive a measure of how effectively the weights in the network are being used for
modelling the environment, Ackley et al. have made the assumption that there is no structure in
the sequential order of the environmentally clamped vectors. However, Ackley et al. admitted
that this is not a realistic assumption and a more realistic assumption would be that the
complete structure of the ensemble of the environmentally clamped vectors can be specified by
giving the probability of each of the 2𝑚 vectors over the 𝑚 visible units [11,43].
The version of the generic Boltzmann machine learning algorithm we present is inspired by
[42,61] and reflects our understanding of this important algorithm. We start by evaluating the
effect that clamping a data vector onto the visible units has over a hidden unit. In order to
accomplish this, we need to establish a new relationship between the vectors 𝑣 and ℎ.
Claim: Let consider a configuration 𝜎 = (𝑣, ℎ) where: 𝜎 ∈ I𝒩, 𝑣 ∈ I𝒱, and ℎ ∈ Iℋ. Then 𝑣 and ℎ
are orthogonal in I𝒩.
Proof: In order to compute the Euclidean inner product between 𝑣 and ℎ in I𝒩 we need to
represent both 𝑣 and ℎ as configurations in I𝒩. We do this by “packing” with zeros a data
(visible) vector 𝑣 ∈ I𝒱, up to the dimension 𝑛 of a configuration 𝜌 ∈ I𝒩, such that:
∀𝑖 ∈ 𝒱, 𝜌𝑖 = 𝑣𝑖 and ∀𝑗 ∈ 𝒩 − 𝒱 = ℋ, 𝜌𝑗 = 0 (5.5)
We apply the same “packing” operation, up to the dimension 𝑛 of a configuration 𝜏, to any
hidden vector ℎ ∈ Iℋ such that:
∀𝑗 ∈ ℋ, 𝜏𝑗 = ℎ𝑗 and ∀𝑖 ∈ 𝒩 −ℋ = 𝒱, 𝜏𝑖 = 0 (5.6)
101
Then the inner product 𝜌 ∙ 𝜏T = 𝜏 ∙ 𝜌T is zero because the zero components of both
configurations 𝜌 and 𝜏 coming from “packing” are placed at mutually exclusive indices.
Therefore, 𝜌 and 𝜏 are orthogonal, which leads to 𝑣 and ℎ being orthogonal in I𝒩.
More, a configuration 𝜎 over 𝑣 and ℎ can be represented either using concatenation between 𝑣
and ℎ or using the sum between the “expanded” versions 𝜌 of 𝑣 and 𝜏 of ℎ:
𝜎 = (𝑣, ℎ) = 𝜌 + 𝜏 ≝ 𝑣 + ℎ (5.7) A consequence of the equation (5.6) is the fact that the equation (4.28) can be rewritten as:
𝐩𝐓(𝑣) = ∑ 𝐏𝐓ℎ∈Iℋ
(𝑣, ℎ) = ∑ 𝐏𝐓ℎ∈Iℋ
(𝑣 + ℎ) for 𝑣 ∈ I𝒱 (5.8)
We are now going to evaluate the activation of a hidden unit 𝑖 ∈ ℋ due to clamping of the
visible units 𝑣. We do this by distinguishing between the contribution of the hidden units and the
contribution of the visible units to the net input of that unit:
net(𝑖, 𝜎) = ∑ ℎ𝑗 ∙ ��𝑗𝑖𝑗∈ℋ𝑗≠𝑖
+
(
∑𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱𝑗≠𝑖
− 𝜃𝑖
)
(5.9)
The terms included in the bracket in (5.9) do not depend on ℎ. More, when the visible units 𝑣
are clamped, the content of the bracket, denoted 𝛉𝐢 and called the effective threshold of unit 𝑖, is
a constant that acts as a threshold for the unit 𝑖 of subnet ℋ.
𝛉𝐢 = 𝜃𝑖 −∑𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱𝑗≠𝑖
(5.10)
Then: net(𝑖, 𝜎) = ∑ ℎ𝑗 ∙ ��𝑗𝑖
𝑗∈ℋ𝑗≠𝑖
− 𝛉𝐢 (5.11)
The subnet ℋ behaves like a Boltzmann machine with its own interconnecting weights W and
thresholds (𝛉𝐢)𝑖∈ℋ. This means that, in principle, we know the probability of any particular state
or configuration ℎ of ℋ simply because it will be determined by a Boltzmann–Gibbs distribution.
To use this fact we need to know the relationship between the internal energy of subnet ℋ
operating with effective thresholds (𝛉𝐢)𝑖∈ℋ and the energy of the whole network 𝒩 of 𝐁𝐌. The
next theorem makes this relationship explicit by means of an algebraic identity.
102
Theorem 5.1 (Jones [63]):
The energy of the whole network of 𝐁𝐌 can be computed as the sum between the internal
energy of the subnet ℋ in state ℎ when vector 𝑣 is clamped and the internal energy of the
subnet 𝒱 in state 𝑣 when is completely disconnected from the units of ℋ. Formally, we write:
Given:
𝐸ℋ(ℎ|𝑣) = −1
2∑ ℎ𝑗𝑗∈ℋ
∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖≠𝑗
+ ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ
(5.12)
And
𝐸𝒱(𝑣) = −1
2∑𝑣𝑗𝑗∈𝒱
∙∑𝑣𝑖 ∙ ��𝑖𝑗𝑖∈𝒱𝑖≠𝑗
+∑𝜃𝑗 ∙ 𝑣𝑗𝑗∈𝒱
(5.13)
Then: 𝐸(𝜎) = 𝐸ℋ(ℎ|𝑣) + 𝐸𝒱(𝑣) where 𝜎 = 𝑣 + ℎ (5.14)
Proof: We start from the energy of a joint configuration given by the equation (4.37):
𝐸(𝑣, ℎ) = −∑𝜎𝑖𝑖∈𝒩
∙ ∑ 𝜎𝑗 ∙𝑗∈𝒩,𝑗>𝑖
��𝑖𝑗 +∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐸(𝑣, ℎ) = −1
2∙ ∑ 𝜎𝑖𝑖∈𝒩
∙ ∑ 𝜎𝑗 ∙𝑗∈𝒩,𝑗≠𝑖
��𝑖𝑗 +∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐸(𝑣, ℎ) = −1
2∙ ∑ 𝜎𝑖𝑖∈𝒩
∙ ∑ (𝑣𝑗 + ℎ𝑗) ∙ ��𝑖𝑗𝑗∈𝒩,𝑗≠𝑖
+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐸(𝑣, ℎ) = −1
2∙ ∑ 𝜎𝑖𝑖∈𝒩
∙ ∑ 𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱,𝑗≠𝑖
−1
2∙ ∑ 𝜎𝑖𝑖∈𝒩
∙ ∑ ℎ𝑗 ∙ ��𝑖𝑗𝑗∈ℋ,𝑗≠𝑖
+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐸(𝑣, ℎ) = −1
2∙∑𝑣𝑗𝑗∈𝒱
∙ ∑ 𝜎𝑖 ∙ ��𝑖𝑗𝑖∈𝒩,𝑖≠𝑗
−1
2∙ ∑ ℎ𝑗𝑗∈ℋ
∙ ∑ 𝜎𝑖 ∙ ��𝑖𝑗𝑖∈𝒩,𝑖≠𝑗
+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐸(𝑣, ℎ) = −1
2∙∑𝑣𝑗 ∙ ∑ ��𝑖𝑗 ∙ (𝑣𝑖 + ℎ𝑖)
𝑖∈𝒩,𝑖≠𝑗
𝑗∈𝒱
∙ −1
2∙ ∑ ℎ𝑗 ∙ ∑ ��𝑖𝑗 ∙ (𝑣𝑖 + ℎ𝑖)
𝑖∈𝒩,𝑖≠𝑗
𝑗∈ℋ
+∑𝜃𝑖 ∙ (𝑣𝑖 + ℎ𝑖)
𝑖∈𝒩
𝐸(𝑣, ℎ) =
(
−1
2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖
𝑖∈𝒱,𝑖≠𝑗
𝑗∈𝒱
−1
2∙∑𝑣𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖
𝑖∈ℋ,𝑖≠𝑗
𝑗∈𝒱
)
+
103
+
(
−1
2∙ ∑ ℎ𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖
𝑖∈𝒱,𝑖≠𝑗
𝑗∈ℋ
−1
2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖
𝑖∈ℋ,𝑖≠𝑗
𝑗∈ℋ
)
+(∑𝑣𝑗 ∙ 𝜃𝑗
𝑖∈𝒱
+ ∑ ℎ𝑗 ∙ 𝜃𝑗𝑗∈ℋ
)
We observe that, due to the weight symmetry and the commutative and distributive laws of
multiplication versus addition, the second term and the third term of the last formula are
identical. Therefore:
𝐸(𝑣, ℎ) = −1
2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖
𝑖∈𝒱,𝑖≠𝑗
𝑗∈𝒱
−1
2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖
𝑖∈ℋ,𝑖≠𝑗
𝑗∈ℋ
− ∑ ℎ𝑗 ∙ ∑𝑣𝑖 ∙ ��𝑗𝑖𝑖∈𝒱,𝑖≠𝑗
𝑗∈ℋ
+
+∑𝑣𝑗 ∙ 𝜃𝑗𝑗∈𝒱
+ ∑ ℎ𝑗 ∙ 𝜃𝑗𝑗∈ℋ
𝐸(𝑣, ℎ) =
(
−1
2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖
𝑖∈𝒱,𝑖≠𝑗
𝑗∈𝒱
+∑𝑣𝑗 ∙ 𝜃𝑗𝑗∈𝒱
)
−1
2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖
𝑖∈ℋ,𝑖≠𝑗
𝑗∈ℋ
+
+∑ℎ𝑗 ∙
j∈ℋ
(𝜃𝑗 −∑𝑣𝑖 ∙ ��𝑗𝑖𝑖∈𝒱,𝑖≠𝑘
)
We observe that the content of the first bracket is exactly 𝐸𝒱(𝑣) given by the equation (5.13)
and the content of the second bracket is exactly 𝛉𝐣 given by the equation (5.10). Therefore:
𝐸(𝑣, ℎ) = 𝐸𝒱(𝑣) +
(
−1
2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖
𝑖∈ℋ,𝑖≠𝑗
𝑗∈ℋ
+∑ℎ𝑗 ∙ 𝛉𝐣j∈ℋ
)
We observe that the content of the bracket is exactly 𝐸ℋ(ℎ|𝑣) given by the equation (5.12).
Thus, we obtain exactly the equation (5.14):
𝐸(𝜎) = 𝐸(𝑣, ℎ) = 𝐸𝒱(𝑣) + 𝐸ℋ(ℎ|𝑣)
With respect to Theorem 5.1 we remark that 𝐸𝒱(𝑣) is constant when 𝑣 is clamped on 𝒱. This
makes the calculation of the probability of any particular vector ℎ on the hidden units particularly
straightforward. Therefore, we take a closer look at 𝐐𝐓(ℎ|𝑣), i.e., the probability that vector ℎ will
occur on the hidden units when 𝑣 is clamped on the visible units and ℋ is allowed to run at
104
pseudo–temperature 𝐓. Intuitively, the only effect of 𝑣 on ℎ is to cause the hidden units ℎ to run
with effective thresholds 𝛉𝐣 given by the equation (5.10) instead of their regular thresholds 𝜃𝑗.
Under these circumstances, 𝐐𝐓(ℎ|𝑣) is governed by the same type of distribution as the network
itself, which in our case is the Boltzmann–Gibbs distribution.
Corollary 5.1:
𝐐𝐓(ℎ|𝑣) is proportional to the probability 𝐏𝐓(𝑣, ℎ) of the joint configuration 𝜎 = (𝑣, ℎ):
𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ) (5.15) where: 𝜎 = (𝑣, ℎ) is a configuration of the network; 𝐓 is the pseudo–temperature; and 𝛼(𝑣, 𝐓) is
a positive constant depending only on 𝑣 and 𝐓.
Proof: A consequence of Theorem 5.1 is that, in a Boltzmann machine with visible units 𝑣
clamped and with hidden units ℎ, 𝐐𝐓(ℎ|𝑣) is governed by the Boltzmann–Gibbs distribution
given by the equation (2.5). The energy corresponding to 𝐐𝐓(ℎ|𝑣) according with the equation
(2.5) can be obtained from the equation (5.14). Therefore, we can write:
𝐐𝐓(ℎ|𝑣) =1
𝑍ℋ∙ exp (
−𝐸ℋ(ℎ|𝑣)
𝐓) =
1
𝑍ℋ∙ exp (
−𝐸(𝑣, ℎ) + 𝐸𝒱(𝑣)
𝐓)
𝐐𝐓(ℎ|𝑣) =1
𝑍ℋ∙ exp (
−𝐸(𝑣, ℎ)
𝐓) ∙ exp (
𝐸𝒱(𝑣)
𝐓)
where 𝑍ℋ is an appropriate normalization constant for the distribution 𝐐𝐓(ℎ|𝑣) .
𝐐𝐓(ℎ|𝑣) = (Z
𝑍ℋ∙ exp (
𝐸𝒱(𝑣)
𝐓)) ∙ (
1
𝑍∙ exp(
−𝐸(𝑣, ℎ)
𝐓))
where Z is the partition function for the true distribution 𝐏𝐓(𝑣, ℎ).
In the previous formula both Z and 𝑍ℋ are in essence constants, despite the fact that their
computation is intractable. We also observe that the first factor–bracket depends only on 𝑣 and
𝐓, which are both constant with respect to ℎ. More, the second factor–bracket is exactly 𝐏𝐓(𝑣, ℎ)
given by the equation (4.40). If we denote the first factor–bracket by 𝛼(𝑣, 𝐓), then we obtain the
same expression for 𝐐𝐓(ℎ|𝑣) as in the equation (5.15):
𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)
105
The following theorem is essential for the Boltzmann machine learning algorithm. The theorem
gives the relationship between 𝐐𝐓(𝑣, ℎ) and 𝐏𝐓(𝑣, ℎ) in terms of the observable probability 𝐪(𝑣)
and the marginal probability 𝐩𝐓(𝑣). It shows that, in the particular case of the Boltzmann–Gibbs
distribution, this relationship has a simple ratio form.
There is a little bit of history around this theorem in the sense that, in the original derivation of
the learning rule for the asynchronous Boltzmann machine, Ackley et al. assumed, without
making direct appeal to the form of the underlying distribution, that at thermal equilibrium, the
probability of a hidden state given a visible state is the same regardless how the visible units
arrived there (clamped or free running) [11,43]. However, for systems with a distribution different
from Boltzmann–Gibbs, like, for example, a synchronous Boltzmann machine, this theorem is
false and the relationship is much more complicated [42,61]. This means that the classical
arguments supporting Theorem 5.2 are logically inadequate, although the conclusion is correct
[42]. The missing piece from the original proof was identified and the logic of the original
derivation was clarified by Jones in [63].
Theorem 5.2 (Jones [63]):
If the true distribution 𝐏𝐓 over the whole network of 𝐁𝐌 is described by the Boltzmann–Gibbs
distribution given by the equation (2.5), then the environmental distribution 𝐐𝐓 is given by:
𝐐𝐓(𝑣, ℎ) =
𝐪(𝑣)
𝐩𝐓(𝑣)∙ 𝐏𝐓(𝑣, ℎ) (5.16)
Proof: We start from the equation (5.15) and we sum over all possible ℎ ∈ Iℋ. We also take into
consideration the fact that 𝛼(𝑣, 𝐓) is independent of ℎ.
𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)
∑ 𝐐𝐓(ℎ|𝑣) = ∑ 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)
ℎ∈Iℋℎ∈Iℋ
= 𝛼(𝑣, 𝐓) ∙ ∑ 𝐏𝐓(𝑣, ℎ)
ℎ∈Iℋ
We observe that the sum of probabilities on the left–side should be 1 and the sum of
probabilities on the right side is exactly the marginal MARG(𝐏𝐓, 𝒱)(𝑣) = 𝐩𝐓(𝑣) of the true
distribution over the states of the visible units. More, in Section 4.1, when constructing the Gibbs
measure associated to a Hamiltonian, we assumed that the corresponding Gibbs measure is a
nondegenerate probability distribution, so we can divide by it without restriction. These
observations lead us to the following:
106
1 = 𝛼(𝑣, 𝐓) ∙ 𝐩𝐓(𝑣) ⇔ 𝛼(𝑣, 𝐓) =1
𝐩𝐓(𝑣)
Based on our definition of 𝐐𝐓(𝑣, ℎ) in Section 5.1, we can write:
𝐐𝐓(𝑣, ℎ) = 𝐐𝐓(ℎ|𝑣) ∙ 𝐪(𝑣) ⇔ 𝐐𝐓(ℎ|𝑣) =1
𝐪(𝑣)∙ 𝐐𝐓(𝑣, ℎ)
If we substitute 𝛼(𝑣, 𝐓) and 𝐐𝐓(ℎ|𝑣) in the equation (5.15), we obtain exactly the equation
(5.16):
1
𝐪(𝑣)∙ 𝐐𝐓(𝑣, ℎ) =
1
𝐩𝐓(𝑣)∙ 𝐏𝐓(𝑣, ℎ)
𝐩𝐓(𝑣) ∙ 𝐐𝐓(𝑣, ℎ) = 𝐪(𝑣) ∙ 𝐏𝐓(𝑣, ℎ)
Lemma 5.3:
The partial derivatives of the KL–divergence between the observable probability 𝐪(𝑣) and the
marginal probability 𝐩𝐓(𝑣) with respect to the parameters W of the network are computed with
the following formulae:
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= − ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝜕𝐏𝐓(𝑣, ℎ)
𝜕��𝑖𝑗 (5.17)
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖= − ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝜕𝐏𝐓(𝑣, ℎ)
𝜕𝜃𝑖 (5.18)
Proof: Before we start the proof, we recall that, per Definition 5.2, the learning process
computes the parameters W of the network such that lim𝑡→∞ W = W. This means that the partial
derivatives of KL(𝐪||𝐩𝐓) should be computed with respect to the parameters W = (W, Θ) =
((W) , Θ). However, to keep the text as readable as possible, we are going to use the
parameters W = (W, Θ) in our computation but with the meaning of W. We start from the
definition of the KL–divergence (equations (B29) and (B30) from Appendix B):
KL(𝐪||𝐩𝐓) = ∑ 𝐪(𝑣) ∙
𝑣∈I𝒱
ln𝐪(𝑣)
𝐩𝐓(𝑣)
We compute the partial derivative of KL(𝐪||𝐩𝐓) with respect to the weights ��𝑖𝑗 of the network by
taking into consideration that 𝐪(𝑣) is an environmentally imposed probability distribution, so it
doesn’t depend on the parameters of the network.
107
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= ∑
∂(𝐪(𝑣) ∙ ln𝐪(𝑣)𝐩𝐓(𝑣)
)
∂��𝑖𝑗𝑣∈I𝒱
= ∑ 𝐪(𝑣) ∙∂ (ln
𝐪(𝑣)𝐩𝐓(𝑣)
)
∂��𝑖𝑗𝑣∈I𝒱
= ∑ 𝐪(𝑣) ∙𝐩𝐓(𝑣)
𝐪(𝑣)𝑣∈I𝒱
∙∂ (𝐪(𝑣)𝐩𝐓(𝑣)
)
∂��𝑖𝑗
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= ∑ 𝐩𝐓(𝑣) ∙ 𝐪(𝑣) ∙
𝑣∈I𝒱
∂ (𝟏
𝐩𝐓(𝑣))
∂��𝑖𝑗= −∑ 𝐩𝐓(𝑣) ∙ 𝐪(𝑣) ∙
1
𝐩𝐓(𝑣)𝟐∙
𝑣∈I𝒱
∂𝐩𝐓(𝑣)
∂��𝑖𝑗
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= −∑
𝐪(𝑣)
𝐩𝐓(𝑣)∙
𝑣∈I𝒱
∂𝐩𝐓(𝑣)
∂��𝑖𝑗
Furthermore, we substitute 𝐩𝐓(𝑣) with its expression given by the equation (5.8):
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= −∑
𝐪(𝑣)
𝐩𝐓(𝑣)∙
𝑣∈I𝒱
∂∑ 𝐏𝐓ℎ∈Iℋ (𝑣, ℎ)
∂��𝑖𝑗= −∑
𝐪(𝑣)
𝐩𝐓(𝑣)∙
𝑣∈I𝒱
∑∂𝐏𝐓(𝑣, ℎ)
∂��𝑖𝑗ℎ∈Iℋ
Earlier we proved that 𝑣 and ℎ are orthogonal (equation 5.7), which translates into the following:
∑ ∑
ℎ∈Iℋ
= ∑
(𝑣,ℎ)∈I𝓝𝑣∈I𝒱
If we apply the orthogonality of 𝑣 and ℎ to the previous expression of the partial derivative of
KL(𝐪||𝐩𝐓) with respect to the weights ��𝑖𝑗, we obtain exactly the formula (5.17):
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= −∑ ∑
𝐪(𝑣)
𝐩𝐓(𝑣)∙
ℎ∈Iℋ𝑣∈I𝒱
∂𝐏𝐓(𝑣, ℎ)
∂��𝑖𝑗= − ∑
𝐪(𝑣)
𝐩𝐓(𝑣)∙
(𝑣,ℎ)∈I𝓝
∂𝐏𝐓(𝑣, ℎ)
∂��𝑖𝑗
The proof for the formula (5.18) is very similar with the proof for the formula (5.17) except that
the weights ��𝑖𝑗 are replaced with the thresholds 𝜃𝑖.
Before presenting the most important result of the learning rule derivation for the asynchronous
symmetric Boltzmann machine, we introduce the expectations mentioned at the beginning of
this section. Given a configuration 𝜎 = (𝑣, ℎ) of the network, we denote by 𝑞𝑖𝑗 the expectation
with respect to the data distribution 𝐐𝐓, i.e., the data probability averaged over all environmental
inputs and measured at equilibrium that the ith and the jth units are both on. We denote by 𝑝𝑖𝑗 the
expectation with respect to the true distribution 𝐏𝐓, i.e., the true probability distribution measured
at equilibrium that the ith and the jth units are both on.
𝑞𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎∈I𝒩
𝐐𝐓(𝜎) if 𝑖 ≠ 𝑗 (5.19)
108
𝑝𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎∈I𝒩
𝐏𝐓(𝜎) if 𝑖 ≠ 𝑗 (5.20)
Similarly, we denote by 𝑞𝑖 the data probability averaged over all environmental inputs and
measured at equilibrium that the ith unit is on and by 𝑝𝑖 the true probability distribution measured
at equilibrium that the ith unit is on.
𝑞𝑖 = ∑ 𝜎𝑖 ∙
𝜎∈I𝒩
𝐐𝐓(𝜎) (5.21)
𝑝𝑖 = ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)
𝜎∈I𝒩
(5.22)
Theorem 5.4 (Gradient–Descent for asynchronous symmetric Boltzmann machines):
The partial derivatives of the KL–divergence between the environmental probability 𝐪(𝑣) and the
marginal of the true probability 𝐩𝐓(𝑣) with respect to the symmetric weights ��𝑖𝑗 respectively the
thresholds 𝜃𝑖 are given by the following formulae:
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= −
1
𝐓(𝑞𝑖𝑗 − 𝑝𝑖𝑗) (5.23)
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖=1
𝐓(𝑞𝑖 − 𝑝𝑖) (5.24)
Proof: From Lemma 5.3 it is sufficient to determine 𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗 respectively
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖. In order to
compute these partial derivatives, we need to know the expression of the true probability
distribution of the network at equilibrium. In Chapter 4 we learned that the joint distribution of a
Boltzmann machine is a Boltzmann–Gibbs distribution and is given by the equation (4.40).
However, we need to prove that the equation (4.40) also represents the equilibrium distribution
of the Boltzmann machine, which we are going to do in Section 5.3.2. In the rest of this section
we assume that the equilibrium distribution of the Boltzmann machine is given by the following
version of the equation (4.40) which takes into consideration the pseudo–temperature 𝐓:
𝐏𝐓(𝜎) = 𝐏𝐓(𝑣, ℎ) = 𝐏𝐓(𝑣 + ℎ) =
1
𝑍∙ exp(
−𝐸(𝜎)
𝐓) (5.25)
We observe that both the numerator and the denominator of 𝐏(𝜎) depend on the weights ��𝑖𝑗
respectively the thresholds 𝜃𝑖.
109
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗=
𝜕
𝜕��𝑖𝑗(1
𝑍∙ exp(
−𝐸(𝜎)
𝐓))
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗= −
1
𝑍2∙ exp (
−𝐸(𝜎)
𝐓) ∙
𝜕𝑍
𝜕��𝑖𝑗−
1
𝑍 ∙ 𝐓∙ exp(
−𝐸(𝜎)
𝐓) ∙𝜕𝐸(𝜎)
𝜕��𝑖𝑗
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗= −𝐏𝐓(𝜎) ∙ (
1
𝑍∙𝜕𝑍
𝜕��𝑖𝑗+1
𝐓∙𝜕𝐸(𝜎)
𝜕��𝑖𝑗) (5.26)
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖= −
1
𝑍2∙ exp (
−𝐸(𝜎)
𝐓) ∙𝜕𝑍
𝜕𝜃𝑖−
1
𝑍 ∙ 𝐓∙ exp(
−𝐸(𝜎)
𝐓) ∙𝜕𝐸(𝜎)
𝜕𝜃𝑖
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖= −𝐏𝐓(𝜎) ∙ (
1
𝑍∙𝜕𝑍
𝜕𝜃𝑖+1
𝐓∙𝜕𝐸(𝜎)
𝜕𝜃𝑖) (5.27)
From the formulae (5.26) and (5.27) it is sufficient to determine 𝜕𝑍
𝜕��𝑖𝑗 and
𝜕𝐸(𝜎)
𝜕��𝑖𝑗 respectively
𝜕𝑍
𝜕𝜃𝑖
and 𝜕𝐸(𝜎)
𝜕𝜃𝑖. We start by recalling the expression of 𝐸(𝜎) given by the equation (4.37) and the
expression of 𝑍 given by the equation (4.39):
𝐸(𝜎) = − ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝑍 = ∑ exp(−𝐸(𝑢, 𝑔)
𝐓)
𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ
= ∑ exp(
∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
− ∑ 𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐓)
𝜎∈I𝒩
The partial derivatives of 𝐸(𝜎) with respect to the weights ��𝑖𝑗 respectively the thresholds 𝜃𝑖 are:
𝜕𝐸(𝜎)
𝜕��𝑖𝑗= −𝜎𝑖 ∙ 𝜎𝑗 (5.28)
𝜕𝐸(𝜎)
𝜕𝜃𝑖= 𝜎𝑖 (5.29)
The partial derivatives of 𝑍 with respect to the weights ��𝑖𝑗 respectively the thresholds 𝜃𝑖 are:
𝜕𝑍
𝜕��𝑖𝑗= ∑ exp(
∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
− ∑ 𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐓)
𝜎∈I𝒩
∙𝜎𝑖 ∙ 𝜎𝑗
𝐓
110
𝜕𝑍
𝜕��𝑖𝑗=𝑍
𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
exp (−𝐸(𝑢, 𝑔)
𝐓 )
𝑍𝜎=(𝑢,𝑔)∈𝐼𝒩
𝜕𝑍
𝜕��𝑖𝑗=𝑍
𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
(5.30)
respectively:
𝜕𝑍
𝜕𝜃𝑖= ∑ exp(
∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗
− ∑ 𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩
𝐓)
𝜎∈I𝒩
∙−𝜎𝑖𝐓
𝜕𝑍
𝜕𝜃𝑖= −
𝑍
𝐓∙ ∑ 𝜎𝑖 ∙
exp (−𝐸(𝑢, 𝑔)
𝐓 )
𝑍𝜎=(𝑢,𝑔)∈𝐼𝒩
𝜕𝑍
𝜕𝜃𝑖= −
𝑍
𝐓∙ ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
(5.31)
We substitute the formulae (5.28) and (5.30) into the formula (5.26), respectively the formulae
(5.29) and (5.31) into the formula (5.27).
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗= −𝐏𝐓(𝜎) ∙ (
1
𝑍∙𝑍
𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
+1
𝐓∙ (−𝜎𝑖 ∙ 𝜎𝑗))
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗= −
𝐏𝐓(𝜎)
𝐓∙ ( ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
− 𝜎𝑖 ∙ 𝜎𝑗) (5.32)
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖= −𝐏𝐓(𝜎) ∙ (−
1
𝑍∙𝑍
𝐓∙ ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
+1
𝐓∙ 𝜎𝑖)
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖=𝐏𝐓(𝜎)
𝐓∙ ( ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)
𝜎∈𝐼𝒩
− 𝜎𝑖) (5.33)
We observe that in the formula (5.32) the first term inside the bracket is exactly 𝑝𝑖𝑗 given by the
formula (5.20). Similarly, in the formula (5.33) the first term inside the bracket is exactly 𝑝𝑖 given
by the formula (5.22).
𝜕𝐏𝐓(𝜎)
𝜕��𝑖𝑗= −
𝐏𝐓(𝜎)
𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) (5.34)
111
𝜕𝐏𝐓(𝜎)
𝜕𝜃𝑖=𝐏𝐓(𝜎)
𝐓∙ (𝑝𝑖 − 𝜎𝑖) (5.35)
Furthermore, we substitute the formulae (5.34) and (5.35) into the formulae (5.17) respectively
(5.18).
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝐏𝐓(𝑣, ℎ)
𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) (5.36)
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖= − ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝐏𝐓(𝜎)
𝐓∙ (𝑝𝑖 − 𝜎𝑖) (5.37)
We use the formula (5.16) to substitute in the formula (5.36) the expression 𝐪(𝑣)
𝐩𝐓(𝑣) ∙ 𝐏𝐓(𝑣, ℎ) with
𝐐𝐓(𝑣, ℎ). Thus, the formula we obtain is exactly the formula (5.23).
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗= ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝐏𝐓(𝑣, ℎ)
𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) =
1
𝐓∙ ∑ 𝐐𝐓(𝑣, ℎ)
(𝑣,ℎ)∈I𝓝
∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗)
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗=1
𝐓∙ (∑ 𝑝𝑖𝑗 ∙ 𝐐𝐓(𝜎)
𝜎∈I𝓝
− ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐐𝐓(𝜎)
𝜎∈I𝓝
)
∂KL(𝐪||𝐩𝐓)
∂��𝑖𝑗=1
𝐓∙ (𝑝𝑖𝑗 ∙ ∑ 𝐐𝐓(𝜎)
𝜎∈I𝓝
− 𝑞𝑖𝑗) = −1
𝐓∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗)
Finally, we use the formula (5.16) to substitute in the formula (5.37) the expression 𝐪(𝑣)
𝐩𝐓(𝑣) ∙
𝐏𝐓(𝑣, ℎ) with 𝐐𝐓(𝑣, ℎ). Thus, the formula we obtain is exactly the formula (5.24).
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖= − ∑
𝐪(𝑣)
𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝
∙𝐏𝐓(𝜎)
𝐓∙ (𝑝𝑖 − 𝜎𝑖) = −
1
𝐓∙ ∑ 𝐐𝐓(𝑣, ℎ)
(𝑣,ℎ)∈I𝓝
∙ (𝑝𝑖 − 𝜎𝑖)
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖=1
𝐓∙ (∑ 𝜎𝑖 ∙ 𝐐𝐓(𝜎)
𝜎∈I𝓝
− ∑ 𝑝𝑖 ∙ 𝐐𝐓(𝜎)
𝜎∈I𝓝
)
∂KL(𝐪||𝐩𝐓)
∂𝜃𝑖=1
𝐓(𝑞𝑖 − 𝑝𝑖 ∙ ∑ 𝐐𝐓(𝜎)
𝜎∈I𝓝
) =1
𝐓(𝑞𝑖 − 𝑝𝑖)
With respect to Theorem 5.4 we remark that the formulae (5.23) and (5.24) show that the
process of reaching thermal equilibrium ensures that the joint activity of any two units contains
112
all the information required for changing the weight between them in order to give the network a
better model of its environment [43]. Specifically, the joint activity of any two units encodes
information explicitly about those units and encodes information implicitly about all the other
weights in the network [43]. The formulae (5.23) and (5.24) also show that the joint activity of
any two units doesn’t depend on what kind of units they are: both visible, both hidden, or one
visible and one hidden.
In practice, to minimize KL(𝐪||𝐩𝐓), it is sufficient to observe 𝑞𝑖𝑗 and 𝑝𝑖𝑗 at thermal equilibrium
and to change each weight and each threshold with formulae:
Δ��𝑖𝑗 = −𝛿 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗) (5.38)
Δ𝜃𝑖 = 𝛿 ∙ (𝑞𝑖 − 𝑝𝑖) (5.39)
where 𝛿 is a constant learning rate.
Another possibility is to incorporate the annealing process into the learning rate as follows:
𝛿 =
1
𝐓 (5.40)
We end this section with a high–level pseudocode of the generic learning algorithm. We present
this algorithm by emphasizing the aspects related to the learning rule we derived. At this time
we do not go into details regarding the update process (specifically how the units are selected
for update) and the collection of statistics. However, we mention that, for a given pattern, the
selection of a unit to be updated is in principle similar with the selection performed in a Hopfield
network (Section 2.5.1), otherwise it could be a stochastic process (taking place at a given
mean rate 𝑟 > 0 for each unit) or a deterministic process (being part of a predefined sequence).
In Section 5.3.3 we will present various strategies to update the units as well as to collect the
statistics 𝑞𝑖𝑗 and 𝑝𝑖𝑗.
Algorithm 5.1: Generic Boltzmann Machine Learning
Given: n x n weight matrix W ; n x 1 threshold vector Θ
a training set of 𝑝𝑎 data vectors: {𝑣(𝑘)}1≤𝑘≤𝑝𝑎
the number of learning cycles: 𝑒𝑝
begin
113
Step 1: initialize the weights W and the thresholds Θ
For an arbitrary number of learning cycles:
Step 2: for e=1 to 𝑒𝑝 do
For each one of the patterns to be learned:
Step 3: for k =1 to 𝑝𝑎 do
Clamping phase:
Step 4: present and clamp the pattern 𝑣(𝑘)
UPDATE PROCESS START
Randomly pick a hidden unit to update its value:
Step 5: choose at random a hidden unit ℎ𝑖 from the set ℋ
Lower pseudo–temperature following a schedule:
Step 6: anneal ℎ𝑖
At the final pseudo–temperature estimate the correlations:
Step 7: collect statistics 𝑞𝑖𝑗
UPDATE PROCESS END
Free–running phase:
Step 8: present the pattern 𝑣(𝑘) but do not clamp it
UPDATE PROCESS START
Randomly pick a visible or hidden unit to update its value:
Step 9: choose at random a unit 𝜎𝑖 from the set 𝒩
Lower pseudo–temperature following a schedule:
Step 10: anneal 𝜎𝑖
At the final pseudo–temperature estimate the correlations:
Step 11: collect statistics 𝑝𝑖𝑗
114
UPDATE PROCESS END
Update the weights and the thresholds for any pair of
connected units 𝑖 ≠ 𝑗 such that at least one unit has been
updated:
Step 12: Δ��𝑖𝑗 = −𝛿 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗) for 𝑖 ≠ 𝑗
��𝑖𝑗 ← ��𝑖𝑗 + Δ��𝑖𝑗
Δ𝜃𝑖 = 𝛿 ∙ (𝑞𝑖 − 𝑝𝑖)
𝜃𝑖 ← 𝜃𝑖 + Δ𝜃𝑖
end for //k
end for //e
end
return W
The generic Boltzmann machine learning algorithm runs slowly, partly because of the time
required to reach thermal equilibrium and partly because the learning is driven by the difference
between two noisy variables, so these variables must be sampled for a long time at thermal
equilibrium to reduce the noise [64]. If we could achieve the same simple relationships between
log probabilities and weights in a deterministic system, the learning process would be much
faster. We will explore this idea in Section 5.5.
5.3.2 Collecting the statistics required for learning
In the previous section we saw that the generic learning algorithm computes the difference
between two expectations or statistics denoted by us as 𝑞𝑖𝑗 (equation (5.19)) and 𝑝𝑖𝑗 (equation
(5.20)). To evaluate the complexity of the exact computation of these expectations, we rewrite
the equation (5.19) by substituting 𝐐𝐓(𝑣, ℎ) with its definition given in Section 5.1:
115
𝑞𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎∈I𝒩
𝐐𝐓(𝜎) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎=(𝑣,ℎ)∈I𝒩
𝐐𝐓(𝑣, ℎ)
𝑞𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎=(𝑣,ℎ)∈I𝒩
𝐐𝐓(ℎ|𝑣) ∙ 𝐪(𝑣) (5.41)
In the equation (5.41) 𝐪(𝑣) can be approximated by its empirical distribution ��(𝑣) whose
computation is tractable:
𝐪(𝑣) ≅ ��(𝑣) =
1
𝑝𝑎∑ ∏𝕀𝒖;𝑣𝑢(𝑘)
(𝑣𝑢)
𝑚
𝑢=1
𝑝𝑎
𝑘=1 (5.42)
where: 𝑝𝑎 is the number of data vectors (patterns); {𝑣(k)}1≤k≤pa is the training set of data
vectors; 𝕀𝒖;𝑣𝑢(𝑘)(𝑣𝑢) is the indicator function given by the equation (3.7); and 𝑚 is the number of
visible units (equation (4.25)).
A simplified analysis of Algorithm 5.1 shows that the complexity of computing 𝑞𝑖𝑗 depends on
the complexity of computing 𝐐𝐓(ℎ|𝑣), which is exponential in the number of hidden units 𝑙
(equation (4.25)). Also the complexity of computing 𝑝𝑖𝑗 depends on the complexity of computing
𝐏𝐓(𝑣, ℎ) which is exponential in the total number of units (visible and hidden) 𝑛 = 𝑚 + 𝑙.
Consequently, both computations (𝑞𝑖𝑗 and 𝑝𝑖𝑗) are intractable. Later in this section we will see
why the analysis of Algorithm 5.1 is actually much more complicated.
Therefore, in order to compute the parameters of the network, an approximation of the
environmental distribution and estimations of both the environmental distribution and the true
distribution are necessary. We have already seen how the approximation of the environmental
distribution is performed (equation (5.42)). Now we concentrate our attention on the estimation
tasks. Both estimation tasks can be accomplished by using the MCMC framework. In essence,
a MCMC algorithm performs sampling from a probability distribution by constructing a Markov
chain that has the desired distribution as its equilibrium distribution. Then, after a number of
steps, the algorithm uses the state of the chain as a sample of the desired distribution.
In this section we present three categories of sampling algorithms that are used to estimate the
data–dependent statistics 𝑞𝑖𝑗 and/or the data–independent statistics 𝑝𝑖𝑗 in a Boltzmann
machine: Gibbs sampling, persistent Markov chains, and contrastive divergence.
116
5.3.2.1 Gibbs sampling
Hinton and Sejnowski used a MCMC sampling approach for estimating the data–dependent
statistics 𝑞𝑖𝑗 by clamping a training vector on the visible units, initializing the hidden units to
random binary states, and using sequential Gibbs sampling of the hidden units to approach the
posterior distribution [11,43]. They estimated the data–independent statistics 𝑝𝑖𝑗 in the same
way, but with the randomly initialized visible units included in the sequential Gibbs sampling
[11,43].
In Gibbs sampling, each variable draws a sample from its posterior distribution given the current
states of the other variables [65]. Before explaining how Gibbs sampling actually works, we
recall the notation X−i and its meaning as the set of all the random variables from X except 𝑋𝑖
(3.73). Given a joint probability distribution 𝐏 of 𝑛 random variables X = (𝑋1, 𝑋2, … , 𝑋𝑛), Gibbs
sampling of 𝐏 is done through a sequence of 𝑛 sampling sub–steps of the following form such
that the new value for 𝑋𝑖 is used straight away in subsequent sampling steps:
𝑋𝑖~𝐏(𝑋𝑖 | X−i = 𝑥−𝑖) or: for 1 ≤ 𝑖 ≤ 𝑛 (5.43)
𝑋𝑖(𝑡+1) = {
1, if 𝑢 ≤ 𝐏(𝑋𝑖 | X−i(𝑡) = 𝑥−i)
0, otherwise
where 𝑥−𝑖 represents the evidence of the corresponding random variables X−i and 𝑢 is a sample
from a uniform distribution 𝒰[0,1]. After these 𝑛 samples have been obtained, a step of the
chain is completed, yielding a sample of X whose distribution converges to 𝐏(X) as the number
of steps goes to ∞, under some conditions. A sufficient condition for convergence of a finite–
state Markov chain is that it is aperiodic and irreducible (see Theorem C.7 in Appendix C).
Let consider a configuration 𝜎 = (𝑣, ℎ) of 𝐁𝐌 and denote by 𝜎−i the set of values associated
with all units except the ith unit. In order to perform Gibbs sampling in 𝐁𝐌, we need to compute
and sample from 𝐏(𝜎𝑖|𝜎−i) as follows:
𝐏(𝜎) = 𝐏(𝜎𝑖, 𝜎−i) =exp(−𝐸(𝜎𝑖, 𝜎−i))
𝑍
𝐏(𝜎𝑖 = 1|𝜎−i) =exp (−𝐸(𝜎𝑖 = 1, 𝜎−i))
exp(−𝐸(𝜎𝑖 = 1, 𝜎−i)) + exp (−𝐸(𝜎𝑖 = 0, 𝜎−i))
𝐏(𝜎𝑖 = 1|𝜎−i) =1
1 +exp (−𝐸(𝜎𝑖 = 0, 𝜎−i))exp (−𝐸(𝜎𝑖 = 1, 𝜎−i))
117
𝐏(𝜎𝑖 = 1|𝜎−i) =1
1 + exp (𝐸(𝜎𝑖 = 1, 𝜎−i) − 𝐸(𝜎𝑖 = 0, 𝜎−i))
𝐏(𝜎𝑖 = 1|𝜎−i) =1
1 + exp (−∑ 𝜎𝑗 ∙ ��𝑗𝑖𝑗∈𝒩 + 𝜃𝑖)
𝐏(𝜎𝑖 = 1|𝜎−i) = sigm(∑ 𝜎𝑗 ∙ ��𝑗𝑖
𝑗∈𝒩
− 𝜃𝑖) (5.44)
Our task is to use formula (5.44) to sample in the positive phase ℎ from 𝐐𝐓(ℎ|𝑣) and to sample
in the negative phase both 𝑣 and ℎ from 𝐏𝐓(𝑣, ℎ), i.e., 𝐏𝐓(𝑣|ℎ) and 𝐏𝐓(ℎ|𝑣).
Therefore, in the positive phase we run a Markov chain for 𝐐𝐓 and sample ℎ according with the
following formula:
𝐐𝐓(ℎ𝑖 = 1|𝑣, ℎ−𝑖) = sigm(∑𝑣𝑗 ∙ ��𝑗𝑖𝑗∈𝒱
+ ∑ ℎ𝑚 ∙ ��𝑚𝑖𝑚∈ℋ−{i}
− 𝜃𝑖) (5.45)
In the negative phase we run a Markov chain for 𝐏𝐓 and sample ℎ and 𝑣 according with the
following formulae:
𝐏𝐓(ℎ𝑖 = 1|𝑣, ℎ−𝑖) = sigm(∑𝑣𝑗 ∙ ��𝑗𝑖𝑗∈𝒱
+ ∑ ℎ𝑚 ∙ ��𝑚𝑖𝑚∈ℋ−{i}
− 𝜃𝑖)
(5.46)
𝐏𝐓(𝑣𝑖 = 1|𝑣−𝑖, ℎ) = sigm(∑ ℎ𝑗 ∙ ��𝑗𝑖 + ∑ 𝑣𝑘 ∙ ��𝑘𝑖𝑘∈𝒱−{i}𝑗∈ℋ
− 𝜃𝑖)
For any iteration of learning, two separate Markov chains are run for every data vector: one
chain is used to estimate 𝑞𝑖𝑗 and another chain is run to estimate 𝑝𝑖𝑗. This makes the algorithm
computationally expensive because, before taking samples, we must wait until each Markov
chain reaches its stationary distribution and this process can require a very large number of
steps without known foolproof method to determine whether equilibrium has been reached. A
further disadvantage is the large variance of the estimated gradient. This means that, in general,
the samples from the stationary distribution have high variance since they come from all over
the model’s distribution [37].
The Markov chains used in the negative phase of Gibbs sampling do not depend on the training
data; therefore, they do not have to be restarted for each new data vector 𝑣. This observation
has been exploited in persistent MCMC estimators [39, 66].
118
5.3.2.2 Using persistent Markov chains to estimate the data–independent statistics
Neal in [14] and Tieleman in [66] proposed a different way to estimate the data–independent
statistics 𝑝𝑖𝑗: a stochastic approximation procedure (SAP). SAP belongs to the class of
stochastic approximation algorithms of the Robbins–Monro type [67-69,39].
To understand how SAP works, we assume that, for any 𝑖 ≠ 𝑗, the data dependent statistics 𝑞𝑖𝑗
are available to us at any time and that we maintain a set of 𝑀 sample points
{𝜎(1)(𝑡), 𝜎(2)(𝑡), … , 𝜎(𝑀)(𝑡)}.
We augment our notation to include the parameters W of the network in the specification of the
joint probability distribution 𝐏𝐓, as well in expression of the data–independent statistics 𝑝𝑖𝑗
(equation (5.20)):
𝑝𝑖𝑗(W) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙
𝜎∈I𝒩
𝐏𝐓(𝜎;W) (5.47)
Next, we consider a 𝐓𝐕𝐁𝐌 = (𝒩, 𝒢,W) given by Definition 4.3 and let W(𝑡) be the parameters
of the 𝐓𝐕𝐁𝐌 at the moment 𝑡 ∈ 𝒯 and 𝜎(𝑡) be the configuration of the 𝐓𝐕𝐁𝐌 at the same
moment of time. Then, W(𝑡) and 𝜎(𝑡) are updated sequentially as follows:
Given 𝜎(𝑡 − 1), a new state 𝜎(𝑡) is sampled from a transition operator 𝑇W(𝜎(𝑡 − 1) →
𝜎(𝑡)) that leaves 𝐏𝐓(W) invariant.
Having W(𝑡 − 1) and 𝜎(𝑡), the data–independent statistics 𝑝𝑖𝑗 are updated according
with the formulae (5.46) to reflect the changes that affected 𝜎 (from 𝑡 − 1 to 𝑡).
Based on the new value of 𝑝𝑖𝑗, a new parameter W(𝑡) is obtained with the formula
(5.38).
The transition operator 𝑇W(𝜎(𝑡 − 1) → 𝜎(𝑡)) is defined by the blocked Gibbs updates given by
the formulae (5.46). Precise sufficient conditions that guarantee almost sure convergence to an
asymptotically stable point are given in [68-70]. One necessary condition requires the learning
rate to decrease with time, i.e.:
∑ 𝛿𝑡
∞
𝑡=0= ∞ and ∑ 𝛿𝑡
2∞
𝑡=0= 0 (5.48)
This condition can be trivially satisfied by setting 𝛿𝑡 =1
𝑡. Typically, in practice, the sequence
{|W(𝑡)|}𝑡∈𝒯 is bounded, and the Markov chain governed by the transition operator 𝑇W is ergodic.
119
Together with the condition on the learning rate, this ensures almost sure convergence [39]. The
pseudocode of SAP is presented below.
Algorithm 5.2: Stochastic Approximation
Given: n x n weight matrix W ; n x 1 threshold vector Θ
all possible 𝑞𝑖𝑗 for any 𝑖 ≠ 𝑗
begin
Step 1: initialize W(0) and 𝑀 fantasy particles: {𝜎(1)(0),… , 𝜎(𝑀)(0)}
Step 2: for t=0 to 𝒯 do
Step 3: for 𝑘=1 to 𝑀 do
Step 4: sample 𝜎(𝑘)(𝑡) given 𝜎(𝑘)(𝑡 − 1) using transition
operator: 𝑇W (𝜎(𝑘)(𝑡 − 1) → 𝜎(𝑘)(𝑡))
end for //k
Step 5: update W(𝑡) = W(𝑡 − 1) + 𝛿𝑡 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗)
Step 6: decrease 𝛿𝑡
end for //t
end
The intuition behind why this procedure works is the following: as the learning rate becomes
sufficiently small compared with the mixing rate of the Markov chain, this “persistent” chain will
always stay very close to the stationary distribution even if it is only run for a few MCMC
updates per parameter update. Samples from the persistent chain will be highly correlated for
successive parameter updates, but again, if the learning rate is sufficiently small the chain will
mix before the parameters have changed enough to significantly alter the value of the estimator
[39]. Many persistent chains can be run in parallel. The current state of each of these chains is
usually denoted as a “fantasy particle” [39].
One important observation is that the process of running persistent Markov chains to produce
the data–independent statistics 𝑝𝑖𝑗 is interleaved with the learning process itself. Consequently,
120
the analysis of Algorithm 5.1 becomes more difficult because it cannot be viewed anymore as
an outer loop for the inner loop represented by the statistics gathering.
5.3.2.3 Contrastive divergence (CD)
In [37] Hinton proposed a simple and effective alternative to maximum likelihood (ML) learning
that eliminates almost all of the computation required to get samples from the equilibrium
distribution and also eliminates much of the variance that masks the gradient signal. This
method, named contrastive divergence, follows the gradient of a different function than ML
learning.
ML learning follows the log likelihood gradient by minimizing the KL–divergence:
KL(𝐪||𝐩𝐓) ≡ KL(𝐪𝟎||𝐩𝐓∞) = ∑ 𝐪𝟎(𝑣) ∙
𝑣∈I𝒱
ln𝐪𝟎(𝑣)
𝐩𝐓∞(𝑣;W) (5.49)
CD learning approximately follows the gradient of the difference of two divergences [37]:
CD𝑘 = KL(𝐪𝟎||𝐩𝐓∞) − KL(𝐪𝒌||𝐩𝐓∞) (5.50)
where: 𝐪𝟎 is the distribution over the “one–step” reconstructions of the data vectors generated
by one full step of Gibbs sampling; 𝐪𝒌 is the distribution over the “k–step” reconstructions of the
data vectors generated by 𝑘 > 0 full steps of Gibbs sampling; and 𝐩𝐓∞ is the equilibrium
distribution of the network. In particular, 𝐪𝐤 could even be 𝐪𝟏.
The CD algorithm is fueled by the contrast between the statistics collected when the input is a
real training example and when the input is a chain sample [71]. The intuitive motivation behind
CD is that we would like the Markov chain that is implemented by Gibbs sampling to leave the
initial distribution 𝐪𝟎 over the visible variables unaltered.
In CD learning, we start the Markov chain at the data distribution 𝐪𝟎 and run the chain for a
small number of steps. Instead of updating the parameters only after running the chain to
equilibrium, we simply run the chain for 𝑘 full steps and update the parameters. Then we keep
running the chain to equilibrium and, when there, update the parameters again. Thus, we
reduce the tendency of the chain to wander away from the initial distribution after 𝑘 steps.
Because 𝐪𝒌 is 𝑘 steps closer to the equilibrium distribution than 𝐪𝟎, we are guaranteed that
KL(𝐪𝟎||𝐩𝐓∞) exceeds KL(𝐪𝐤||𝐩𝐓∞) unless 𝐪𝟎 equals 𝐪𝐤. Consequently, CD (equation (5.50))
can never be negative. Also, for Markov chains in which all transitions have nonzero probability,
121
𝐪𝟎 = 𝐪𝐤 implies 𝐪𝟎 = 𝐩𝐓∞, because, if the distribution does not change at all on the first step, it
must already be at equilibrium, so CD can be zero only if the model is perfect.
In [72] Carreira–Perpinan and Hinton showed that, in general, CD provides biased estimates.
However the bias seems to be small, as their experiments of comparing CD and ML have
showed. They also showed that, for almost all data distributions, the fixed–points of CD are not
fixed–points of ML, and vice versa. Finally, they proposed a new and more effective approach to
collect statistics for Boltzmann machine learning, i.e., to use CD to perform most of the learning
followed by a short run of ML to clean up the solution.
CD learning is well–suited for Restricted Boltzmann Machines due to the fact that they are the
only Boltzmann machines with tractable conditional distributions 𝐏𝐓(ℎ|𝑣) and 𝐐𝐓(ℎ|𝑣).
5.4 The equilibrium distribution of a Boltzmann machine
To develop an understanding of Boltzmann machines it is necessary to be able to
determine the equilibrium distribution. In general, the equilibrium distribution of a stochastic
process is related to the structure of the associated transition probability matrix. If the transition
probabilities are known, then it becomes possible to compute the equilibrium distribution. In an
asynchronous Boltzmann machine with 𝑛 units the transitions between configurations or so–
called state transitions generate a finite Markov chain whose state space contains 2𝑛 global
states and whose transitions between global states are performed such that only one unit may
change its state at a time according with the update rule (4.34) or its temporal version (4.44). A
consequence of these update rules is the implicit satisfaction of the Markov local property for
the associated Markov chain. More, the transition probability matrix of the associated Markov
chain can be computed from parameters of the model.
In this section we are interested in establishing the existence of a unique equilibrium distribution
for a Boltzmann machine when 𝐓 > 0 and in determining the behavior of a Boltzmann machine
when 𝐓 = 0. In presenting the chain of logical arguments that relate Markov processes and
Boltzmann machines we have been inspired by the work of Viveros [42].
In order to construct the 2𝑛 × 2𝑛 transition probability matrix of the Markov chain associated to a
Boltzmann machine, we need to define the transition probability 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) of a global
state transition 𝜎(𝑡 − 1) → 𝜎(𝑡). The transition probability matrix 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) is rather a
122
sparse matrix. If the updating sequence is random, the matrix will have 𝑛 + 1 non–zero entries
per row. One of the non–zero entries is in the diagonal and represents the probability that the
particular unit selected for updating does not change its state, otherwise the configuration 𝜎
remains unchanged. Each of the other non–zero entries represents the probability that the
corresponding unit will change its state according with the update rule (4.34) or its temporal
version (4.44), otherwise they correspond to possible transitions to one of the 𝑛 states that
differs from 𝜎 by a change in the state of a single unit [42].
However, if the updating sequence is more orderly and each unit has a predetermined turn to
update, the number of non–zero entries per row decreases further. For example, one way to
update a layered Boltzmann machine is layer–by–layer and sequentially inside a layer. Thus, in
each row of the transition probability matrix there will be exactly two components that are not
zero: one to the “left” and one to the “right” of the unit to be updated.
Therefore, different updating regimes lead to different transition probability matrices and
consequently to different dynamics. Before we define the transition probability matrix
𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)), we recall its source of inspiration that is the transition probability for only one
unit in one time step given by the formula (4.44). The formula (4.44) also shows that, in one
time step, every transition from one configuration to another configuration that differs from first
configuration in at most one position has non–zero probability.
Definition 5.3:
The transition probability 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) of a global state transition 𝜎(𝑡 − 1) → 𝜎(𝑡) in an
asynchronous Boltzmann machine is given by the formula (5.51) which is the following:
𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) =
{
1
1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))
𝐓)
if ∃ only one 𝑖 such that 𝜎𝑖(𝑡) ≠ 𝜎𝑖(𝑡 − 1)
1 − ∑ 𝐏𝐓(𝜌(𝑡) | 𝜎(𝑡 − 1))
𝜌(𝑡)≠𝜎(𝑡)
if 𝜎𝑖(𝑡 − 1) = 𝜎𝑖(𝑡) for all 𝑖
0 otherwise
We could obtain a more readable form of the formula (5.51) if we use the following notations:
𝜎(𝑡) ≡ 𝜎 𝜎(𝑡 − 1) ≡ 𝜏
Then the transition matrix 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) given by (5.51) becomes 𝐏𝐓(𝜎|𝜏) given by (5.52):
123
𝐏𝐓(𝜎|𝜏) =
{
1
1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)
𝐓) if ∃ only one 𝑖 such that 𝜎𝑖 ≠ 𝜏𝑖
1 − ∑ 𝐏𝐓(𝜌|𝜏)
𝜌≠𝜎
if 𝜎𝑖 = 𝜏𝑖 for all 𝑖
0 otherwise
(5.52)
In order to establish the existence or non–existence of an equilibrium distribution of a Markov
chain in terms of the properties of its associated transition probability matrix, we take into
consideration the fact that the transition probability matrix is stochastic, otherwise it is a special
case of non–negative matrices. Furthermore, the Perron–Frobenius theorem [73-74] details the
precise range of possibilities for the eigenvalues and eigenvectors of non–negative, irreducible
matrices. This result is of particular importance for us because the properties of the eigenvalues
of the transition probability matrix are the only factors that influence the nature and existence of
an equilibrium distribution for a Markov process [42].
We introduce the concepts of aperiodicity and irreducibility. A finite Markov chain is aperiodic if
no state of it is periodic with period 𝑘 > 1; a state has period 𝑘 if one can only return to it at
times 𝑡 + 𝑘, 𝑡 + 2𝑘, etc. A finite Markov chain is irreducible if one can reach any state from any
state in finite time with non–zero probability. We provide without proof the following theorem that
asserts the existence and unicity of the equilibrium distribution for a category of transition
probability matrices that includes, as we shall see, the transition probability matrices associated
to asynchronous Boltzmann machines.
Theorem 5.5 (Existence of a unique equilibrium distribution for stochastic, irreducible,
aperiodic matrices)
Let 𝐏 be a 𝑑𝑥𝑑 stochastic irreducible and aperiodic matrix. Then the equilibrium distribution for
the associated Markov process exists and is given by the left eigenvector corresponding to the
eigenvalue λ. Moreover 𝜆 = 1 since 𝐏 is stochastic.
The proof of this theorem can be found in [42]. In order to establish the existence of a unique
equilibrium distribution for an asynchronous Boltzmann machine, we are looking at the
irreducibility and aperiodicity properties of the transition probability matrix. Two cases are in
essence considered: when the pseudo–temperature is strictly positive and when the pseudo–
124
temperature is zero. Moreover, within the case 𝐓 > 0, the scenario 𝐓 → 0 demands special
consideration. Therefore, we treat it separately. Table 2 gives a classification of all possible
transition probability matrices for asynchronous Boltzmann machines and how they position
themselves with respect to irreducibility and aperiodicity.
Table 2 Transition probability matrices for asynchronous symmetric Boltzmann machines
𝐓 > 0 𝐓 = 0
Irreducible and aperiodic Irreducible and periodic Reducible
Equilibrium distribution No equilibrium distribution
𝐓 > 0
We want to prove the existence of an equilibrium distribution for asynchronous symmetric
Boltzmann machines. To accomplish this, first we prove a few helper lemmas and theorems,
then we prove the most significant result of this section: Theorem 5.10.
Lemma 5.6:
Given an asynchronous Boltzmann machine with 𝑛 units, any state can be visited from any
other state with positive probability in at most 𝑛 steps.
Proof: Each configurations of a Boltzmann machine with 𝑛 units can be viewed as a vector of
length 𝑛, each component being the state of a unit. In any asynchronous Boltzmann machine
one unit updates at each time step and, since 𝐓 > 0, the update is described by the equation
(4.44). More, a unit has a positive probability of changing its state.
A consequence of the update rule is that the Hamming distance between two successive
configurations is no greater than 1. Thus, any two configurations are at most Hamming distance
𝑛 apart, and so the worst case requires 𝑛 units change state. Therefore there is a positive
probability that this can occur [42].
Theorem 5.7:
The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is
irreducible.
125
Proof: Let 𝐏 = (𝑝𝑖𝑗)𝟏≤𝒊,𝒋≤2𝑛 be the 2𝑛 × 2𝑛 transition probability matrix of 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1))
given by (5.51). From Lemma 5.6 we have that any configuration has a positive probability of
being visited from any other configuration in at most 𝑛 steps. For every configuration 𝜎 ∈ I𝒩 we
denote by 𝑢 the index of the row corresponding to 𝜎 in 𝐏. Therefore, according to the definition
of an irreducible matrix, for every pair of configurations 𝜎, 𝜏 ∈ I𝒩 identified by their row indices 𝑢
respectively 𝑣, there exist an integer 𝑟 ≥ 1 such that 𝑝𝑢𝑣(𝑟) > 0, which in fact is 𝑟 = 𝑛. Hence
the transition probability matrix 𝐏𝐓 defined by (5.51) is an irreducible matrix [42].
Theorem 5.8:
The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is
aperiodic.
Proof: For any given configuration 𝜎 ∈ I𝒩 we shall prove that 𝐏𝐓(𝜎|𝜎) > 0. Given any
configuration 𝜎 there are 𝑛 + 1 possible transitions, one of them being to remain in the current
configuration. The 𝑛 other possible transitions lead to configurations with a Hamming distance
to 𝜎 equal 1. Let 𝜏 be one of these 𝑛 configurations. The probability that the ith unit of the
configuration 𝜎 outputs 𝜏𝑖 is given by the equation (4.44):
𝐏(𝜏𝑖|𝜎) =1
1 + exp (−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)
𝐓)
When 𝐓 > 0, we have 0 < 𝐏(𝜏𝑖|𝜎) < 1 for all these 𝑛 configurations. The explanation is that for
any other configuration 𝜌, that means it has a Hamming distance to 𝜎 greater than 1, the
probability 𝐏(𝜌𝑖|𝜎) is zero.
Now we need to take into consideration another probability which we completely ignored until
now because it has no direct effect on learning. This is the probability that the ith unit is selected
for update. Usually this probability distribution is a uniform distribution over the set of units,
which means that the probability of the ith unit to be selected for update is 1
𝑛. This probability
distribution is imposed by the environment, so it is independent of 𝐏.
These being said, we can compute 𝐏(𝜎|𝜎):
𝐏(𝜎|𝜎) = 1 − ∑1
𝑛∙ 𝐏(𝜏𝑖 | 𝜎)
𝜏∈I𝒩−{𝜎},𝑖∈𝒩
= 1 −1
𝑛∙ ∑ 𝐏(𝜏𝑖 | 𝜎)
𝜏∈I𝒩−{𝜎},𝑖∈𝒩
> 0
126
Hence, for all 𝜎 ∈ I𝒩 we have that 𝐏(𝜎|𝜎) > 0. According with Lemma C.1 in Appendix C, we
conclude that the transition probability matrix defined by (5.51) – (5.52) is aperiodic.
We remark that in this proof we have also proved that the transition probability matrix given by
(5.51) – (5.52) is reflexive [42].
Theorem 5.9:
For the asynchronous transition probability matrix defined by (5.51) – (5.52) the equilibrium
distribution exists and is given by the left eigenvector corresponding to the eigenvalue 𝜆 = 1.
Proof: When 𝐓 > 0 the transition probability matrix given by (5.51) – (5.52) is stochastic. From
Theorem 5.7 and Theorem 5.8 we know that this transition probability matrix is also irreducible
and aperiodic. Therefore, from Theorem 5.5 there exists a unique equilibrium distribution given
by the left eigenvector of the matrix given by (5.51) – (5.52) corresponding to the eigenvalue
𝜆 = 1 [42].
We have proved that in general for any weight matrix the equilibrium distribution of an
asynchronous Boltzmann machine with 𝐓 > 0 is given by the left eigenvector of the transition
matrix (5.51) – (5.52) corresponding to the eigenvalue 𝜆 = 1. If the system is allowed to stabilize
at a given pseudo–temperature, we have proved the existence of an equilibrium distribution. If
we slowly lower the pseudo–temperature, allowing the system to restabilize at the new
equilibrium distribution, then as 𝐓 → 0 the distribution will tend to a uniform distribution over the
optimal set of configurations 𝜎, which we call 𝑂𝑝𝑡. Roughly speaking, the asynchronous
Boltzmann machine converges asymptotically to the set of globally optimal states 𝑂𝑝𝑡 ⊆ I𝒩 that
minimize the energy function given by (4.31) [42]. The following theorem states these facts
more formally.
Theorem 5.10 (Asynchronous weight–symmetric equilibrium distribution):
Let the transition probabilities in a Boltzmann machine be given by (5.51) – (5.52). Then:
There exists a unique equilibrium distribution 𝐏𝐓(𝜎) for all 𝐓 > 0 whose components are
given by:
127
𝐏𝐓(𝜎) = lim𝑘→∞
𝐏𝐓(𝜎(𝑘) = 𝜎) =exp (−
𝐸(𝜎)𝐓)
𝑍(𝐓)
where: (5.53)
𝑍(𝐓) = ∑ exp(−𝐸(𝜏)
𝐓)
𝝉∈I𝒩
As 𝐓 → 0 the stationary distribution converges to a uniform distribution over the set of
optimal states, i.e.:
lim𝐓→0
( lim𝑘→∞
𝐏𝐓(𝜎(𝑘) = 𝜎)) =1
|𝑂𝑝𝑡|∙ 𝕀𝑂𝑝𝑡(𝜎) (5.54)
where 𝕀𝑂𝑝𝑡 is the characteristic function of 𝑂𝑝𝑡, i.e. 𝕀𝑂𝑝𝑡 takes the value one for any 𝜏 ∈ 𝑂𝑝𝑡
and zero elsewhere.
The first part asserts that the equilibrium distribution at any pseudo–temperature 𝐓 is the
Boltzmann–Gibbs distribution. The second part implies that as 𝐓 → 0 the equilibrium distribution
tends to a uniform distribution over the set 𝑂𝑝𝑡 of minimal energy states 𝜎.
Proof: From Theorem 5.9 the transition probability matrix defined by (5.51) – (5.52) has a
unique equilibrium distribution given by its left eigenvector and the corresponding eigenvalue
𝜆 = 1.
Our approach is to use Proposition C.5 from Appendix C to prove that 𝐏𝐓(𝜎) is the equilibrium
distribution. Specifically, if for all 𝜎, 𝜏 ∈ I𝒩 there exists numbers 𝐏𝐓(𝜎), 𝐏𝐓(𝜏) such that:
𝐏𝐓(𝜏|𝜎) ∙ 𝐏𝐓(𝜎) = 𝐏𝐓(𝜎|𝜏) ∙ 𝐏𝐓(𝜏) (5.55) then 𝐏𝐓(𝜎) is the equilibrium distribution. We have to show that 𝐏𝐓(𝜎) given by (5.53) together
with 𝐏𝐓(𝜎|𝜏) given by (5.52) satisfy (5.54). We distinguish two cases:
1. if 𝜎 = 𝜏, then (5.54) is satisfied by the definitions (5.52) and (5.53).
2. if 𝜎 ≠ 𝜏, then we have:
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) = 𝐏𝐓(𝜎) ∙ 𝐏(𝜏𝑖|𝜎)
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−
𝐸(𝜎)𝐓 )
𝑍(𝐓)∙
1
1 + exp (−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)
𝐓 ) (5.56)
By multiplying the right hand side of (5.56) by:
128
exp((2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)
𝐓) ∙ exp (
−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)
𝐓) = 1
we obtain:
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−
𝐸(𝜎)𝐓)
𝑍(𝐓)∙exp (
(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)𝐓
)
1 + exp ((2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)
𝐓 ) (5.57)
The asynchronous Boltzmann machine is symmetric, so the equation (4.51) holds:
net(𝑖, 𝜎) ∙ Δ𝜎𝑖 = net(𝑖, 𝜎) ∙ (𝜏𝑖 − 𝜎𝑖) = −Δ𝐸𝑖 = 𝐸(𝜎) − 𝐸(𝜏)
The equations (4.47) and (4.49) also hold for 𝜎 and 𝜏 because their Hamming distance is 1.
net(𝑖, 𝜏) = net(𝑖, 𝜎) 𝜏𝑖 = 1 − 𝜎𝑖
We rewrite (5.57) by substituting 𝜏𝑖 as specified above and using the equality between the net
inputs net(𝑖, 𝜏) and net(𝑖, 𝜎). We obtain:
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−
𝐸(𝜏) + net(𝑖, 𝜎) ∙ Δ𝜎𝑖𝐓 )
𝑍(𝐓)∙exp (
(−2 ∙ 𝜎𝑖 + 1) ∙ net(𝑖, 𝜏)𝐓 )
1 + exp ((−2 ∙ 𝜎𝑖 + 1) ∙ net(𝑖, 𝜏)
𝐓)
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−
𝐸(𝜏)𝐓) ∙ exp (−
net(𝑖, 𝜎) ∙ (𝜏𝑖 − 𝜎𝑖)𝐓
)
𝑍(𝐓)∙exp (−
(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)𝐓
)
1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)
𝐓)
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−
𝐸(𝜏)𝐓 )
𝑍(𝐓)∙
1
1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)
𝐓)∙ exp(−
net(𝑖, 𝜎) ∙ (𝜏𝑖 + 𝜎𝑖 − 1)
𝐓)
𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) = 𝐏𝐓(𝜏) ∙ 𝐏𝐓(𝜎|𝜏) ∙ exp(0) = 𝐏𝐓(𝜏) ∙ 𝐏𝐓(𝜎|𝜏)
Consequently, according with Proposition C.5 from Appendix C, 𝐏𝐓(𝜎) is the equilibrium
distribution.
We omit the details of the proof for the second part of Theorem 5.10. However, as mentioned
by Viveros [42], apart from a change of notation, the proof is exactly Theorem 8.1 p.134 and
Corollary 2.1 p. 18 of [75].
129
The hypothesis of symmetric weights in a Boltzmann machine simplifies the case 𝐓 > 0 a lot
because it enables one to infer that the detailed balance condition holds, and this leads
immediately to simple closed formulae for the equilibrium distributions [42].
𝐓 → 0
In the limiting case when 𝐓 → 0 the transition probability matrix for asynchronous Boltzmann
machines is given by taking the limit when 𝐓 → 0 of the equation (5.52).
𝐏𝐓(𝜎|𝜏) =
{
lim𝐓→0
𝑔(𝑖)
1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)
𝐓) if ∃ only one 𝑖 such that 𝜎𝑖 ≠ 𝜏𝑖
1 − lim𝐓→0
∑𝐏𝐓(𝜌|𝜏)
𝜌≠𝜎
if 𝜎𝑖 = 𝜏𝑖 for all 𝑖
0 otherwise
(5.58)
where 𝑔(𝑖) denotes the environmental distribution used for selection of the unit 𝑖 to be updated.
Usually 𝑔 is a uniform distribution, so the probability of the ith unit to be selected for update is 1
𝑛.
The limiting behavior of the transition probability matrix as 𝐓 → 0 is determined by the limiting
transition probability for a particular unit 𝑖:
lim𝐓→0
1
1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)
𝐓 )= {
0 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) > 01
2 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) = 0
1 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) < 0
(5.59)
The equation (5.59) gives the limiting transition probability 𝜏𝑖 → 𝜎𝑖 as 𝐓 → 0 for a particular unit 𝑖
in terms of its activation or net input. We learned that, in principle, there are 𝑛 + 1 non–zero
entries in any row of the 2𝑛 × 2𝑛 transition probability matrix. But is it possible that some of
these entries become zero? If that happens, what effect has on the transition probability matrix
as 𝐓 → 0? To answer these questions, suppose that a configuration 𝜏 has 𝑘 units with zero
activation, where 1 ≤ 𝑘 ≤ 𝑛:
∃ 𝑖(1), 𝑖(2),… , 𝑖(𝑘) such that net(𝑖(𝑗), 𝜏) = 0 for any 1 ≤ 𝑗 ≤ 𝑘 (5.60) In the transition 𝜏 → 𝜎, the units 𝜎𝑖(1), … , 𝜎𝑖(𝑘) can take values 0 or 1 with equal probability.
However, because of the zero activation of these units, the limit (5.59) corresponding to them is
always 1
2𝑛. Thus, as 𝐓 → 0, in the row corresponding to 𝜏 in the 2𝑛 × 2𝑛 transition probability
130
matrix there will be 𝑘 ≤ 𝑛 non–zero entries having the value 1
2𝑛 and the remainder of entries
must sum to 1 −𝑘
2𝑛 [42].
𝐓 = 0
The case 𝐓 = 0 is a special limiting case for which the dynamics become progressively more
deterministic and the model approaches the Hopfield model. Provided that only one unit
updates at a time, the network settles to a local energy minima similarly to a Hopfield network
(see Section 2.5.2). Here there might be cycles or fixed–points but, in the strict sense, an
equilibrium distribution does not exist. Since the networks are deterministic, their “transition
probability matrices” are not transition probability matrices in the strict sense. However, these
“transition probability matrices” can be either reducible or irreducible and periodic [42].
5.5 Learning algorithms based on variational approaches
In Section 5.3 we learned how to find the parameters of the Boltzmann Machine
Learning problem by means of maximum likelihood estimation. In order to pick a tractable
variational parameterization for a Boltzmann machine, in this section we use a different
approach, specifically a variational approach. The true probability distribution of a Boltzmann
machine cannot be computed exactly, regardless of the form is expressed in: joint, conditional,
or marginal. The goal of variational learning a Boltzmann machine is to approximate the true
conditional probability of the hidden variables given the data vectors over the visible variables
and to use it in the learning rules (5.38) and (5.39) to replace the data–dependent statistics.
Here we recall a consequence of Theorem 5.1, specifically the equation (5.12), that is that, in a
Boltzmann machine with visible units 𝑣 clamped and with hidden units ℎ, the subnet ℋ itself
behaves like a Boltzmann machine with its own interconnecting weights W and thresholds
(𝛉𝐢)𝑖∈ℋ. Otherwise, the only effect of 𝑣 on ℎ is to cause the hidden units ℎ to run with effective
thresholds 𝛉𝐣 given by the equation (5.10) instead of their regular thresholds 𝜃𝑗. Therefore, the
true conditional probability distribution 𝐏𝐓(ℎ|𝑣) is governed by the following Boltzmann–Gibbs
distribution:
131
𝐏𝐓(ℎ|𝑣) = 𝐏ℋ(ℎ|𝑣) =
exp(−𝐸ℋ(ℎ|𝑣))
𝑍ℋ
(5.61)
𝐏𝐓(ℎ|𝑣) =
exp(∑ ℎ𝑗𝑗∈ℋ ∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖<𝑗
− ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ )
𝑍ℋ
where:
𝑍ℋ = ∑ exp
(
∑ ℎ𝑗𝑗∈ℋ
∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖<𝑗
− ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ
)
ℎ∈Iℋ
(5.62)
We augment the notations of the true probability distribution 𝐏𝐓 to include the parameters of the
network. In this section we are specifically interested in the parameters corresponding to a
mean parameterization of a pairwise Markov network (equations (3.11), (3.12), and (3.15)):
𝐏𝐓(𝑣, ℎ) = 𝐏𝐓(𝑣, ℎ; μ) and 𝐏𝐓(ℎ|𝑣) = 𝐏𝐓(ℎ|𝑣; μ) (5.63)
5.5.1 Using variational free energies to compute the statistics required for learning
In this section we establish the connection between the Boltzmann machine variational learning
and the approximations of the free energies discussed in Chapter 3.
As mentioned in the previous section, variational learning is concerned with the true conditional
distribution 𝐏𝐓(ℎ|𝑣). Concretely, variational learning means that we have to choose a conditional
probability distribution 𝐐(ℎ|𝑣) from a family ℚ(ℎ|𝑣; 𝜆) of approximating conditional probability
distributions that are described by the variational parameters 𝜆. Generally, the Markov network
representing 𝐐 is not the same as the Markov network representing 𝐏𝐓 but rather a sub–graph
of it.
From the family of approximating distributions ℚ(ℎ|𝑣; 𝜆), we choose a particular distribution 𝐐 by
minimizing the KL–divergence KL(ℚ||𝐏𝐓) given by the equation (3.25) with respect to the
variational parameters 𝜆. Then, the particular distribution 𝐐(ℎ|𝑣; 𝜆∗) that corresponds to the
values 𝜆∗ of the variational parameters that resulted from the KL–divergence minimization is
considered the best approximation of 𝐏𝐓(ℎ|𝑣) in the family ℚ(ℎ|𝑣; 𝜆).
𝐐 = ℚ(ℎ|𝑣; 𝜆∗) where: (5.64)
132
𝜆∗ = argmin𝜆KL(ℚ(ℎ|𝑣; 𝜆) || 𝐏𝐓(ℎ|𝑣))
One simple justification for using the KL–divergence as a measure of approximation accuracy is
that it yields the best lower bound on the probability of the evidence 𝐩𝐓(𝑣) in the family of
approximations 𝐐(ℎ|𝑣; 𝜆). The probability of the evidence is the same as the probability
distribution over the visible units.
To prove this claim, we first recall a form of Jensen’s inequality used the in context of probability
theory: if X is a random variable and φ is a convex function, then:
φ(𝐄[X]) ≤ 𝐄[φ(X)] (5.65) Thus, if we bound the log likelihood, i.e., the logarithm of the probability of the evidence, using
Jensen’s inequality we obtain:
ln 𝐩𝐓(𝑣) = lnMARG(𝐏𝐓, 𝒱) = ln ∑ 𝐏𝐓ℎ∈Iℋ
(𝑣, ℎ) = ln ∑ 𝐐(ℎ|𝑣) ∙𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣)ℎ∈Iℋ
ln ∑ 𝐐(ℎ|𝑣) ∙𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣)ℎ∈Iℋ
≥ ∑ 𝐐(ℎ|𝑣) ∙
ℎ∈Iℋ
ln𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣)
ln 𝐩𝐓(𝑣) ≥ ∑ 𝐐(ℎ|𝑣) ∙
ℎ∈Iℋ
ln𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣) (5.66)
The inequation (5.66) can be interpreted in this way: its right–hand side is a lower bound of its
left–hand side, which means that we found a lower bound for ln 𝐩𝐓(𝑣). More, the difference
between the left–hand side and the right–hand side of the inequation (5.66) is exactly the KL–
divergence KL(𝐐(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)) [27]:
KL(𝐐(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)) = ln𝐩𝐓(𝑣) − ∑ 𝐐(ℎ|𝑣) ∙
ℎ∈Iℋ
ln𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣)≥ 0 (5.67)
Hence, by choosing 𝜆∗ according to (5.64), we obtain the tightest lower bound for ln 𝐩𝐓(𝑣) [27]:
KL(𝐐(ℎ|𝑣; 𝜆∗) || 𝐏𝐓(ℎ|𝑣)) = ln𝐩𝐓(𝑣) − ∑ 𝐐(ℎ|𝑣; 𝜆∗) ∙
ℎ∈Iℋ
ln𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣; 𝜆∗)
(5.68)
ln 𝐩𝐓(𝑣) ≥ ∑ 𝐐(ℎ|𝑣; 𝜆∗) ∙
ℎ∈Iℋ
ln𝐏𝐓(𝑣, ℎ)
𝐐(ℎ|𝑣; 𝜆∗)
Furthermore, Theorem 3.1 taught us that the KL–divergence of two probability distributions 𝐐
and 𝐏 is related to the variational free energy of 𝐐 respectively the energy functional 𝐹[��, 𝐐]:
133
KL(𝐐(ℎ|𝑣) ||𝐏𝐓(ℎ|𝑣)) = −𝐹[��(ℎ|𝑣), 𝐐(ℎ|𝑣)] + ln𝑍(𝐏𝐓(ℎ|𝑣))
(5.69) KL(𝐐(ℎ|𝑣) ||𝐏𝐓(ℎ|𝑣)) = 𝐹[𝐐(ℎ|𝑣)] + ln𝑍(𝐏𝐓(ℎ|𝑣))
where: 𝐹[��, 𝐐] is the energy functional of 𝐏(ℎ|𝑣) and 𝐐(ℎ|𝑣); 𝐹[𝐐] is the variational free energy
of 𝐐(ℎ|𝑣); and 𝑍(𝐏𝐓(ℎ|𝑣)) is the partition function of the conditional of the true probability
distribution 𝐏𝐓(ℎ|𝑣).
Using (5.69), the KL–divergence employed by (5.64) can be written as:
KL(ℚ(ℎ|𝑣; 𝜆)||𝐏𝐓(ℎ|𝑣)) = 𝐹[ℚ(ℎ|𝑣; 𝜆)] + ln𝑍(𝐏𝐓(ℎ|𝑣)) (5.70)
Using (5.70) and the fact that the true probability distribution 𝐏𝐓 doesn’t depend on the
variational parameter 𝜆, the optimization problem (5.64) can be reformulated as:
𝐐 = ℚ(ℎ|𝑣; 𝜆∗) where: (5.71) 𝜆∗ = argmin
𝜆{𝐹[𝐐(ℎ|𝑣; 𝜆)] + ln𝑍(𝐏𝐓(ℎ|𝑣))} = argmin
𝜆{𝐹[𝐐(ℎ|𝑣; 𝜆)]}
The optimization problem (5.71) shows the connection between variational Boltzmann machine
learning and variational free energies and, in the same time, suggests a path to follow in a
learning algorithm. When the mean field free energy 𝐹𝑀𝐹[𝐐] plays the role of the free energy
𝐹[𝐐] in (5.70), the learning algorithm is called naïve mean field learning. When the Bethe–Gibbs
free energy 𝐺𝛽[𝐐] plays the role of the free energy 𝐹[𝐐] in (5.70), the learning algorithm is
called belief optimization learning.
Variational approaches like the mean field approximation and the Bethe approximation can be
used in Boltzmann machine learning only in the positive phase. These variational
approximations and, generally, any variational approach cannot be used in the negative phase
because the minus sign in the Boltzmann machine learning rules would cause variational
learning to change the parameters so as to maximize the divergence between the
approximating and true distributions instead of minimizing it [39]. Therefore, the data–
independent expectations should still be estimated by using a sampling algorithm like
Algorithm 5.2.
134
5.5.2 Learning by naïve mean field approximation
In the naïve mean field approximation, we try to find a factorized distribution 𝐐(ℎ|𝑣) that best
describes the true posterior distribution 𝐏𝐓(ℎ|𝑣). The true posterior distribution 𝐏𝐓(ℎ|𝑣) is
replaced by an approximate posterior 𝐐(ℎ|𝑣) and the parameters of the network are updated to
follow the gradient of the KL–divergence between 𝐐(ℎ|𝑣) and 𝐏𝐓(ℎ|𝑣).
The particular distribution we choose for 𝐐(ℎ|𝑣; μ) is the most general factorized distribution for
binary variables, which has the form:
𝐐𝐌𝐅(ℎ|𝑣; μ) =∏𝜇𝑖ℎ𝑖
𝑖∈ℋ
∙ (1 − 𝜇𝑖)1−ℎ𝑖 (5.72)
where μ = {𝜇𝑖}i∈ℋ are the variational parameters and the product is taken over the
configurations of the hidden units.
In order to form the KL–divergence between the fully factorized 𝐐𝐌𝐅 distribution and the 𝐏𝐓
distribution given by the equation (5.61), we use the fact that, under the distribution 𝐐𝐌𝐅, ℎ𝑖 and
ℎ𝑗 are independent random variables with mean values 𝜇𝑖 respectively 𝜇𝑗. Thus, we obtain:
KL(𝐐𝐌𝐅(ℎ|𝑣; μ)||𝐏𝐓(ℎ|𝑣)) = ∑[𝜇𝑖 ∙ ln 𝜇𝑖 + (1 − 𝜇𝑖) ∙ ln(1 − 𝜇𝑖)]
𝑖∈ℋ
−
−∑𝜇𝑖𝑖∈ℋ
∙ ∑ 𝜇𝑗 ∙ ��𝑗𝑖𝑗∈ℋ𝑗<𝑖
+∑𝛉𝐢 ∙ 𝜇𝑖𝑖∈ℋ
+ ln𝑍ℋ (5.73)
In order to derive the learning rule for the mean field learning algorithm, we employ the same
approach we used for the generic learning algorithm, that is we minimize the KL–divergence
KL(𝐐𝐌𝐅(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)). Concretely, we derive the mean field fixed–point equations by taking
the gradient of the KL–divergence given by the equation (5.73) with respect to 𝜇𝑖 for all 𝑖 ∈ ℋ.
We note that 𝑍ℋ doesn’t depend on variational parameters. Thus, we obtain:
𝜕KL(𝐐𝐌𝐅(ℎ|𝑣; μ)||𝐏𝐓(ℎ|𝑣))
𝜕𝜇𝑖= − ∑ 𝜇𝑗 ∙ ��𝑗𝑖
𝑗∈ne(𝑖)
+ 𝛉𝐣 + ln𝜇𝑖
1 − 𝜇𝑖 (5.74)
where ne(𝑖) denotes the Markov blanket of unit 𝑖.
If we equate (5.74) to zero then we obtain the “mean field fixed–point equations”:
𝜇𝑖 = sigm( ∑ 𝜇𝑗 ∙ ��𝑗𝑖𝑗∈ne(𝑖)
− 𝛉𝐢) for all 𝑖 ∈ ℋ (5.75)
135
The equations (5.75) are solved iteratively for a fixed–point solution. Note that each variational
parameter 𝜇𝑖 updates its value based on a sum across the variational parameters 𝜇𝑗 within its
Markov blanket.
In Section 3.4 we learned how to solve the naïve mean field approximation problem by using the
type of optimization “maximize the energy functional”. In this section we have learned how to
solve the same problem by using the type of optimization “minimize KL–divergence”. As we
mentioned in Section 3.2, for a given problem, these approaches are equivalent. Therefore, the
equations (3.67) and (5.75) should lead to pretty much the same solutions (except a small
error). More, the convergence of one set of equations implies the convergence of the other set
as well. Theorem 3.7 guarantees the convergence of the mean field fixed–point equations
(3.67). Consequently, the mean field fixed–point equations (5.75) are also convergent.
When the mean field fixed–point equations (5.75) are run sequentially, i.e., we fix 𝜇−𝑖 and we
minimize over 𝜇𝑖, the KL–divergence is convex in 𝜇𝑖 and the corresponding equation (5.75) finds
the minimum in one step. Thus, this procedure can be interpreted as coordinate descent in
{𝜇𝑖}i∈ℋ and each step is guaranteed to decrease the KL–divergence. One drawback of this
procedure is that it could suffer from slow convergence or entrapment in local minima.
Alternatively, all the {𝜇𝑖}i∈ℋ can be updated in parallel, which does not have the guarantee of
decreasing the cost–function at any iteration, but may converge faster. In practice, one often
observes oscillatory behavior which can be counteracted by damping the updates.
Finally, one can use any gradient based optimization technique to minimize over all the nodes
{𝜇𝑖}i∈ℋ simultaneously, making sure all {𝜇𝑖}i∈ℋ remain between 0 and 1 [4].
Peterson and Anderson compared the mean field approximation to Gibbs sampling on a set of
test cases and found that it ran 10–30 times faster, while yielding a roughly equivalent level of
accuracy [16,27].
There are cases, however, in which the mean field approximation is known to be less accurate.
For large, densely connected, weakly interacting systems the cumulative effect of all nodes
behaves as a “rigid” (mean) field, which acts as an additional bias term, resulting in a factorized
distribution. Also, the factorized mean field distribution is clearly unimodal, and could therefore
never represent multimodal posterior distributions accurately. In particular, the KL–divergence
KL(𝐐𝐌𝐅||𝐏𝐓) penalizes states with small posterior probability but non–vanishing probability
under the mean field distribution much harder than the other way around. The result of this
asymmetry in the KL–divergence is that the mean field distribution will choose to represent only
136
one mode, ignoring the other ones. A typical situation where we expect multiple modes in the
posterior is when there is not a lot of evidence clamped on the observation nodes [4]. Consider
for instance the situation when the thresholds are given by:
𝜃𝑖 = −
1
2∑ ��𝑖𝑗𝑗∈𝒱,j≠i
(5.76)
in which case there is symmetry in the system – switching all the nodes (ℎ𝑖 → 1 − ℎ𝑖) leaves all
the probabilities invariant. This implies that there are at least two modes. In general, we expect
many more modes, and the mean field distribution can only capture one. Moreover, when the
interactions are strong, we expect these modes to be concentrated on one state, with little
fluctuation around them. The marginals predicted by mean field would therefore be close to
either 1 or 0 (they are polarized), while the true marginal posterior probabilities are due to the
symmetry [4]. One way to overcome some of the difficulties mentioned above is to use more
structured variational distributions 𝐐 and minimize again the KL–divergence [4].
We end this section with a high–level pseudocode of the mean field learning algorithm.
During each clamping (positive) phase an algorithm similar to Algorithm 3.1, that performs
minimization instead of maximization, is executed to solve the fixed–point equations (5.75) and
the solution obtained for the variational parameters {𝜇𝑖}i∈ℋ is used to approximate the data–
dependent statistics. During each free–running (negative) phase an algorithm similar to
Algorithm 5.2 is executed and the data–independent statistics 𝑝𝑖𝑗 and 𝑝𝑖 are estimated. Then
the parameters W of the network are updated according with the following rules:
Δ��𝑖𝑗 = −𝛿 ∙ (𝜇𝑖 ∙ 𝜇𝑗 − 𝑝𝑖𝑗) (5.77)
Δ𝜃𝑖 = 𝛿 ∙ (𝜇𝑖 − 𝑝𝑖) (5.78)
where 𝛿 is the learning rate.
Algorithm 5.3: Mean Field Boltzmann Machine Learning
Given: n x n weight matrix W ; n x 1 threshold vector Θ
a training set of 𝑝𝑎 data vectors: {𝑣(𝑘)}1≤𝑘≤𝑝𝑎
the number of learning cycles: 𝑒𝑝
the number of mean field steps: 𝑚𝑓
137
the number of Markov chains: 𝑀
begin
Step 1: initialize W(0) and 𝑀 fantasy particles: {𝜎(1)(0),… , 𝜎(𝑀)(0)}
For an arbitrary number of learning cycles:
Step 2: for e=1 to 𝑒𝑝 do
For each one of the patterns to be learned:
Step 3: for k =1 to 𝑝𝑎 do
Clamping phase:
Step 4: present and clamp the pattern 𝑣(𝑘)
START ALGORITHM 3.1
Step 5: randomly initialize: μ = {𝜇𝑖}𝑖∈ℋ and run 𝑚𝑓 updates
until convergence:
𝜇𝑖 = 𝑠igm(∑ 𝜇𝑗 ∙ ��𝑖𝑗𝑗∈ne(𝑖) − θj) for all 𝑖 ∈ ℋ
END ALGORITHM 3.1
Step 6: set: μ(𝑘) = {𝜇𝑖(𝑘)}𝑖∈ℋ
Free–running phase:
Step 7: present the pattern 𝑣(𝑘) but do not clamp it
START ALGORITHM 5.2
[…]
Step 8: collect statistics 𝑝𝑖𝑗
END ALGORITHM 5.2
Update the weights and thresholds for any pair of connected
units 𝑖 ≠ 𝑗 such that at least one unit has been updated:
Step 9: Δ��𝑖𝑗 = −𝛿 ∙ (𝜇𝑖 ∙ 𝜇𝑗 − 𝑝𝑖𝑗) for 𝑖 ≠ 𝑗
138
��𝑖𝑗 ← ��𝑖𝑗 + Δ��𝑖𝑗
Δ𝜃𝑖 = 𝛿 ∙ (𝜇𝑖 − 𝑝𝑖)
𝜃𝑖 ← 𝜃𝑖 + Δ𝜃𝑖
end for //k
end for //e
end
return W
5.6 Unlearning and relearning in Boltzmann Machines
The concept of “unlearning” in a connectionist network is closely related to the concept
of “reverse learning” in neuroscience. Crick and Mitchison proposed a model of reverse learning
that compares the process of dream sleeping or the REM phase of sleep to an offline computer.
According to the model, we dream in order to forget and this involves a process of “reverse
learning” or “unlearning” [76].
A simulation of reverse learning was performed by Hopfield, Feinstein, and Palmer [77] who
independently had been studying ways to improve the associative storage capacity of simple
networks of binary processors. In their algorithm an input is presented to the network as an
initial condition and the system evolves by falling into a nearby local energy minimum. However,
not all local energy minima represent stored information. In creating the desired minima, they
accidentally create other spurious minima, and to eliminate these they use "unlearning": The
learning procedure is applied with reverse sign to the states found after starting from random
initial conditions. Following this procedure, the performance of the system in accessing stored
states was found to be improved [43].
The reverse learning proposed by Crick and Mitchison respectively the reverse learning
algorithm proposed by Hopfield et al. have an interesting relationship with the learning algorithm
proposed by Hinton. The two phases of Hinton’s learning algorithm resemble the learning and
unlearning procedures. In positive phase Hebbian learning with positive coefficient occurs
139
during which information in the environment is captured by the weights. During negative phase
the system randomly samples states according to their Boltzmann distribution and Hebbian
learning occurs with a negative coefficient. However, these two phases need not be
implemented in the manner suggested by Crick and Mitchison. For instance, during negative
phase the average co–occurrences could be computed without making any changes to the
weights. These averages could then be used as a baseline for making changes during positive
phase; that is, the co–occurrences during negative phase could be computed and the baseline
subtracted before each permanent weight change. Hence, an alternative but equivalent
proposal for the function of dream sleep is to recalibrate the baseline for plasticity – the break–
even point which determines whether a synaptic weight is incremented or decremented. This
would be safer than making permanent weight decrements to synaptic weights during sleep and
solves the problem of deciding how much "unlearning" to do [43].
Hinton’s learning algorithm refines Crick’s and Mitchison's interpretation of why two phases are
needed. He considered a hidden unit deep within the network and wanted to know how its
connections with other units should be changed to best capture regularity present in the
environment. He started by observing that, if the unit does not receive direct input from the
environment, the hidden unit has no way to determine whether the information it receives from
neighboring units is ultimately caused by structure in the environment or is entirely a result of
the other weights. Hinton compared this scenario with a "folie à deux" where two parts of the
network each construct a model of the other and ignore the external environment [43]. He
realized that the contribution of internal and external sources can be separated by comparing
the co–occurrences in positive phase with similar information that is collected in the absence or
environmental input and in this way the negative phase acts as a control condition. More,
because of the special properties of equilibrium, it is possible to subtract off this purely internal
contribution and use the difference to update the weights. His conclusion was that the role of
two phases is to make the system maximally responsive to regularities present in the
environment and to prevent the system from using its capacity to model internally generated
regularities [43].
A network like the Boltzmann machine can experience some form of damage. Hinton studied
the behavior of the network, specifically the distributed representations constructed by the
learning rule, under such circumstances. He observed that the network uses distributed
representations among the intermediate units when it learns the associations. His interpretation
of this fact was that, because many of the weights are involved in encoding several different
140
associations and each association is encoded in many weights, if a weight is changed because
of some form of damage, it will affect several different energy minima and all of them will require
the same change in the weight to restore them to their previous depths. So, in relearning any of
the associations, there should be a positive transfer effect which tends to restore the others.
Hinton observed that this effect is actually rather weak and easily masked, so it can only be
seen clearly if the network is retrained on most of the original associations. His conclusion was
that the associations constructed by the learning rule are resistant to minor damage and exhibit
rapid relearning after major damage. More, the relearning process can bring back associations
that are not practiced during the relearning and are only randomly related to the associations
that are practiced [43].
141
Chapter 6. Conclusions
6.1 Summary of what has been done
This thesis addresses several aspects of the theory of Boltzmann machines. Our principal goal
has been to provide, from a rigorous mathematical perspective, a unified framework for two
well–known classes of learning algorithms in asynchronous Boltzmann machines: based on
Monte Carlo methods and based on variational approximations of the free energy.
The second chapter focused on the foundation of knowledge necessary to understand the
subsequent chapters. We choose to introduce the Boltzmann–Gibbs distribution from both a
physicist and a computer scientist perspective to allow the concept of energy, also originating in
physics, to settle on solid ground. We introduced the pairwise Markov random fields and
explained their relationship with the Boltzmann–Gibbs distribution. We also introduced the
Gibbs free energy as a convenient replacement for the Boltzmann–Gibbs distribution when the
goal is to perform approximate inference in a Markov random field. Then, we proceeded to
introduce the ancestors of Boltzmann machine: the connectionist networks and the Hopfield
networks. While we gave only a high–level overview of the connectionist networks, we gave a
quite detailed presentation of the Hopfield network. We justified the attention granted to the
Hopfield network by the fact that it is not just an ancestor of the Boltzmann machine, but it is a
Boltzmann machine itself, as we explained in Chapter 5.
The third chapter built the infrastructure of knowledge necessary to understand the subsequent
chapters. The topic of interest in this chapter was energy and the motivation behind is the
relationship between the equilibrium distribution and the free energy in a Markov random field.
Estimating the distribution of a Markov random field is an expensive process and there is no
foolproof method to determine whether the equilibrium has been reached. Some of the
difficulties encountered when operating with distributions do not exist anymore when operating
with energies. Furthermore, we introduced a number of Gibbs free energies: the mean field free
energy and the Bethe–Gibbs free energy, which are variational free energies, and the Bethe
free energy. These energies have been purposely defined and analyzed as potential candidates
for the true free energy of a Markov random field. Then we presented an approximate inference
algorithm – belief optimization – based on the Bethe–Gibbs approximation of the free energy
and that could potentially be used in Boltzmann machine learning.
142
The goal of the fourth chapter was to present in detail every important aspect of Boltzmann
machine except learning. From various sources we synthetized a rigorous definition of the
Boltzmann machine model together with all the associated concepts. We introduced the concept
of true energy and explained its intrinsic relationship with various forms of the true probability
distribution. We described the algorithmic aspects of the dynamics of the asynchronous
Boltzmann machine. In Chapter 5 we returned to this topic and presented it from a different
perspective. To allow the reader to intuitively understand the “inner life” of the Boltzmann
machine, we presented the biological interpretation of the model as was given by its creator
Geoffrey Hinton.
The fifth chapter is the core of this thesis. It was dedicated entirely to learning algorithms in an
asynchronous Boltzmann machine. We defined formally the learning process following the same
rigorous approach as in Chapter 4. We justified, from different angles, the necessity of two
phases in a learning algorithm. Following Hinton’s terminology, we called them positive
respectively negative phase. Currently there are two equivalent modalities to approach learning:
maximizing the likelihood of parameters or minimizing the KL–divergence of Gibbs measures.
We choose to use the KL–divergence approach for all the learning algorithms we presented.
Then we introduced the class of learning algorithms based on approximate maximum likelihood.
This class contains the generic learning algorithm due to Hinton and Sejnowski. We provided a
very detailed analysis of the generic learning algorithm including the missing piece from the
original algorithm, which was identified by Jones. The class of algorithms based on approximate
maximum likelihood was completed with the introduction of three sampling algorithms used to
collect the statistics during both positive and negative phase: Gibbs sampling, stochastic
approximation using persistent Markov chains, and contrastive divergence. We summarized the
main steps of the generic learning algorithm in a high–level pseudocode and discussed the
factors that influenced its complexity. The collection of statistics for the generic learning
algorithm is conditioned on the thermal equilibrium. To understand the dynamics of the
Boltzmann machine from a probabilistic point of view, we provided a deep analysis of the
equilibrium distribution as a function of the pseudo–temperature. Furthermore, we introduced
the class of learning algorithms based on variational approaches discussed in Chapter 3 and we
explained the connection between the approximations of the free energies and the learning
process. We provided a detailed analysis of the mean field learning and the connections it has
with algorithms introduced previously: mean field approximation and stochastic approximation.
We summarized the main steps of the mean field learning algorithm in a high–level pseudocode
and discussed the factors that influenced its complexity. Finally, we introduced the concepts of
143
unlearning and relearning and gave the intuition behind them as was explained by their creator
Geoffrey Hinton.
6.2 Future directions
There are a few open questions or directions to explore inspired by ideas presented in this
thesis:
1. an algorithm to detect when an asynchronous symmetric Boltzmann machine reached its
equilibrium distribution;
2. an explicit formula for the equilibrium distribution and a learning algorithm for a
Boltzmann machine with asymmetric weights;
3. is possible to extend the learning algorithm to higher order Markov processes?
4. an improvement to the Boltzmann machine model itself that would lead to better and
faster learning algorithms;
5. a breakthrough in Boltzmann machine learning?
Solutions to some of these questions would represent a considerable improvement on the
current state of knowledge.
One idea to improve the model is to find link(s) between the energy of a thermodynamic system
with respect to pressure and volume (equation (2.17)) and some aspect(s) of the cognitive
features and/or processes of human brain that, importantly, can be represented in an artificial
neural network like the Boltzmann machine. If these connections existed and had been reflected
in the Boltzmann machine model, they could be “consumed” directly by new learning algorithms
or indirectly by new optimization algorithms that perform approximate inference in the underlying
Markov random field.
144
References
1. Sussmann, H. J. (1988, December). Learning algorithms for Boltzmann machines. In Decision and
Control, 1988., Proceedings of the 27th IEEE Conference on (pp. 786-791). IEEE.
2. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York, 4.
3. Sussmann, H. J. The mathematical theory of learning algorithms for Boltzmann machines. In Neural
Networks, 1989. IJCNN., International Joint Conference on (pp. 431-437). IEEE.
4. Welling, M., & Teh, Y. W. (2003). Approximate inference in Boltzmann machines. Artificial
Intelligence, 143(1), 19-50.
5. Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines (p. 31). Technical Report
UTML TR 2008-002, Department of Computer Science, University of Toronto.
6. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural
computation, 18(7), 1527-1554.
7. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational
abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558.
8. Hinton, G. E., & Sejnowski, T. J. (1983, June). Optimal perceptual inference. In Proceedings of the
IEEE conference on Computer Vision and Pattern Recognition (pp. 448-453). IEEE New York.
9. Fahlman, S. E., Hinton, G. E., & Sejnowski, T. J. (1983). Massively parallel architectures for Al:
NETL, THISTLE, and BOLTZMANN machines. Proceedings of AAAI-83109, 113.
10. Hinton, G. E., Sejnowski, T. J., & Ackley, D. H. (1984). Boltzmann machines: Constraint satisfaction
networks that learn. Pittsburgh, PA: Carnegie-Mellon University, Department of Computer Science.
11. Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines.
Cognitive science, 9(1), 147-169.
12. Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. Journal of statistical
physics, 34(5-6), 975-986.
13. Salakhutdinov, R., & Hinton, G. (2012). An efficient learning procedure for deep Boltzmann machines.
Neural computation, 24(8), 1967-2006.
14. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial intelligence, 56(1), 71-113.
15. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory
(No. CU-CS-321-86). COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE.
16. Peterson, C. (1987). A mean field theory learning algorithm for neural networks. Complex systems, 1,
995-1019.
17. Peterson, C., & Hartman, E. (1989). Explorations of the mean field theory learning algorithm. Neural
Networks, 2(6), 475-494.
18. Galland, C. C., & Hinton, G. E. (1990). Discovering high order features with mean field modules. In
Advances in neural information processing systems (pp. 509-515).
19. Galland, C. (1992). Learning in deterministic Boltzmann machine networks.
145
20. Kappen, H. J., & Rodriguez, F. B. (1998). Boltzmann machine learning using mean field theory and
linear response correction. Advances in neural information processing systems, 280-286.
21. Kappen, H. J., & Rodrıguez, F. B. (1997). Mean field approach to learning in Boltzmann machines.
Pattern Recognition Letters, 18(11), 1317-1322.
22. Tanaka, T. (1998). Mean-field theory of Boltzmann machine learning. Physical Review E, 58(2), 2302.
23. Tanaka, T. (1999). A theory of mean field approximation. Advances in Neural Information Processing
Systems, 351-360.
24. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning (Doctoral
dissertation, University of Toronto).
25. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free
energy. Advances in neural information processing systems, 3-3.
26. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and
other variants. In Learning in graphical models (pp. 355-368). Springer Netherlands.
27. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational
methods for graphical models. Machine learning, 37(2), 183-233.
28. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its
generalizations. Exploring artificial intelligence in the new millennium, 8, 236-239.
29. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2005). Constructing free-energy approximations and
generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7), 2282-
2312.
30. Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). A new class of upper bounds on the log
partition function. IEEE Transactions on Information Theory, 51(7), 2313-2335.
31. Wainwright, M. J., & Jordan, M. I. (2006). Log-determinant relaxation for approximate inference in
discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099-2109.
32. Globerson, A., & Jaakkola, T. S. (2007). Approximate inference using conditional entropy
decompositions. In AISTATS (pp. 130-138).
33. Kabashima, Y., & Saad, D. (1998). Belief propagation vs. TAP for decoding corrupted messages.
EPL (Europhysics Letters), 44(5), 668.
34. Opper, M., & Winther, O. (1996). Mean field approach to Bayes learning in feed-forward neural
networks. Physical review letters, 76(11), 1964.
35. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000, December). Generalized belief propagation. In
NIPS (Vol. 13, pp. 689-695).
36. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001). Bethe free energy, Kikuchi approximations, and
belief propagation algorithms. Advances in neural information processing systems, 13.
37. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural
computation, 14(8), 1771-1800.
146
38. Salakhutdinov, R., & Hinton, G. E. (2007, March). Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure. In AISTATS (pp. 412-419).
39. Salakhutdinov, R., & Hinton, G. E. (2009, April). Deep Boltzmann Machines. In AISTATS (Vol. 1, p.
3).
40. Little, W. A. (1974). The existence of persistent states in the brain. In From High-Temperature
Superconductivity to Microminiature Refrigeration (pp. 145-164). Springer US.
41. Little, W. A., & Shaw, G. L. (1978). Analytic study of the memory storage capacity of a neural
network. Mathematical biosciences, 39(3-4), 281-290.
42. Viveros, U. X. I. (2001). The Synchronous Boltzmann Machine (Doctoral dissertation, University of
London).
43. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and releaming in Boltzmann machines. Parallel
distributed processing: Explorations in the microstructure of cognition, 1, 282-317.
44. Boltzmann, L. (2012). Theoretical physics and philosophical problems: Selected writings (Vol. 5).
Springer Science & Business Media.
45. Gibbs, J. W. (1928). The collected works of J. Willard Gibbs (Vol. 1, pp. p-288). H. A. Bumstead, & W.
R. Longley (Eds.). Longmans, Green and Company.
46. Dobrushin, R. L. (1968). Description of a random field by means of conditional probabilities, with
applications. Teor. Veroyatnost. i Primenen, 13.
47. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational
inference. Foundations and Trends® in Machine Learning, 1(1-2), 1-305.
48. Spitzer, F. (1971). Markov random fields and Gibbs ensembles. The American Mathematical Monthly,
78(2), 142-154.
49. Yedidia, J. (2001). An idiosyncratic journey beyond mean field theory. Advanced mean field methods:
Theory and practice, 21-36.
50. Gibbs, J. W. (1873). A method of geometrical representation of the thermodynamic properties of
substances by means of surfaces. Connecticut Academy of Arts and Sciences.
51. Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1), 185-234.
52. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like
those of two-state neurons. Proceedings of the national academy of sciences, 81(10), 3088-3092.
53. Lowel, S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by
correlated neuronal activity. Science, 255(5041), 209.
54. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT
press.
55. Georges, A., & Yedidia, J. S. (1991). How to expand around mean-field theory using high-
temperature expansions. Journal of Physics A: Mathematical and General, 24(9), 2173.
56. Plefka, T. (1982). Convergence condition of the TAP equation for the infinite-ranged Ising spin glass
model. Journal of Physics A: Mathematical and general, 15(6), 1971.
147
57. Plefka, T. (2006). Expansion of the Gibbs potential for quantum many-body systems: General
formalism with applications to the spin glass and the weakly nonideal Bose gas. Physical Review E,
73(1), 016129.
58. Shin, J. (2012). Complexity of Bethe Approximation. In AISTATS (pp. 1037-1045).
59. Weiss, Y., & Freeman, W. T. (2001). On the optimality of solutions of the max-product belief-
propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2), 736-744.
60. Bishop, C. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics),
1st edn. 2006. corr. 2nd printing edn.
61. Heskes, T. (2002). Stable fixed points of loopy belief propagation are local minima of the Bethe free
energy. In Advances in neural information processing systems (pp. 343-350).
62. Heskes, T. (2004). On the uniqueness of loopy belief propagation fixed points. Neural Computation,
16(11), 2379-2413.
63. Jones, A. (1996). A lacuna in the theory of asynchronous Boltzmann machine learning. Simpósio
Brasileiro de Redes Neurais, 19-27.
64. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space.
Neural computation, 1(1), 143-150.
65. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6), 721-741.
66. Tieleman, T. (2008, July). Training restricted Boltzmann machines using approximations to the
likelihood gradient. In Proceedings of the 25th international conference on Machine learning (pp.
1064-1071). ACM.
67. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical
statistics, 400-407.
68. Younes, L. (1989). Parametric inference for imperfectly observed Gibbsian fields. Probability theory
and related fields, 82(4), 625-645.
69. Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing
ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes, 65(3-
4), 177-228.
70. Yuille, A. L. (2006). The convergence of contrastive divergences. Department of Statistics, UCLA.
71. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning,
2(1), 1-127.
72. Carreira-Perpinan, M. A., & Hinton, G. (2005, January). On Contrastive Divergence Learning. In
AISTATS (Vol. 10, pp. 33-40).
73. Gantmacher, F. R., & Brenner, J. L. (2005). Applications of the Theory of Matrices. Courier
Corporation.
74. Gantmacher, F. R. (1959). Matrix theory, vol.1 and 2. New York.
75. Aarts, E., & Korst, J. (1988). Simulated annealing and Boltzmann machines.
148
76. Crick, F., & Mitchison, G. (1983). The function of dream sleep. Nature, 304(5922), 111-114.
77. Hopfield, J. J., Feinstein, D. I., & Palmer, R. G. (1983). ‘Unlearning’ has a stabilizing effect in
collective memories.
78. Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov chains and mixing times. American
Mathematical Soc.
149
Appendix A: Mathematical notations
In this appendix we present the main conventions and notations used throughout this paper.
We use |A| to denote the cardinality of a finite set A.
We denote matrices by uppercase bold roman letters, such as 𝐀.
We use a superscript T to denote the transpose of a matrix or vector.
Without restricting the generality, we assume that a set of processing units (neurons) 𝒩 is
indexed by the set of natural numbers {1,2,… , 𝑛} for 𝑛 = |𝒩| ∈ ℕ. We also make the
convention that "0" denotes some object not belonging to 𝒩.
A random variable is generally denoted 𝑋 (typeface italic uppercase x).
A univariate (scalar) random variable is denoted as a general random variable.
An individual observation of a scalar random variable is denoted by 𝓍 (script italic lowercase
x). A set comprising of 𝑚 observations of a scalar random variable 𝑋 is denoted by 𝕩
(double–struck lowercase x) and is written as:
𝕩 ≡ (𝑥(1), 𝑥(2), … , 𝑥(𝑚)) (A1)
A multivariate random variable is denoted by X (typeface uppercase x):
X ≡ (𝑋1, 𝑋2, … , 𝑋n)T (A2)
We use the notation X−i to designate all the random variables from X except 𝑋𝑖, i.e.:
X−i = (𝑋1, … , 𝑋𝑖−1, 𝑋𝑖+1, … , 𝑋𝑛) (A3)
We use the symbol ⊥ to represent the conditional independence of random variables.
Example: 𝐴 ⊥ 𝐵 | 𝐶.
We use bold uppercase letters to designate probability distributions. Examples: 𝐏, 𝐐.
We use the accent “bar” to denote a probability candidate for an unknown probability.
Example: Probability distribution �� is an approximation for the probability distribution 𝐏.
We use the accent “tilde” to denote the unnormalized measure of a probability distribution.
Example: �� is the unnormalized measure of probability distribution 𝐏.
We use the accent “hat” to identify the collection of canonical parameters corresponding to
bi–dimensional cliques (edges) in a pairwise Markov network. The collection of all the
canonical parameters is represented with the same letter as the previous collection but
without “hat”. Example: the first collection is ��; the second collection is 𝐖.
150
Appendix B: Probability theory and statistics
The definitions and theoretical results presented in this appendix are taken from the books [54]
and [78]. They are notions from probability theory and statistics that have been used in the
previous sections.
In probability and statistics, a Random (Stochastic) Variable, usually written 𝑋, is a variable
whose value is subject to variations due to chance (i.e., randomness, in a mathematical sense);
otherwise its values are numerical outcomes of a random phenomenon or experiment. As
opposed to other mathematical variables, a random variable conceptually does not have a
single, fixed value (even if unknown); rather, it can take on a set of possible different values,
each with an associated probability. Based on the number of values that constitute the random
variable associated to a statistical unit, the random variables are classified into two categories:
univariate and multivariate.
A Univariate Random Variable or Random Scalar is a single variable whose value is unknown,
either because the value has not yet occurred, or because there is imperfect knowledge of its
value. Normally a random scalar is a real number.
A Multivariate Random Variable or Random Vector, usually written X, is a list of mathematical
variables whose value for each of them is unknown, either because the value has not yet
occurred, or because there is imperfect knowledge of its value. The individual variables in a
random vector are grouped together because there may be correlations among them – often
they represent different properties of an individual statistical unit (e.g. a particular person, event,
etc.). Normally each element of a random vector is a real number.
In mathematics, a moment is a specific quantitative measure of the shape of a set of points. The
moments of a random variable (or of its distribution) are expected values of powers or related
functions of the random variable. The first moment, also called mean, is a measure of the
center or location of a random variable or distribution. The second moment of a random variable
is also called variance and its square root is called standard deviation. The variance and
standard deviation are measures of the scale or spread of a random variable or distribution.
𝝈–algebra:
Given a set Ω, a 𝜎–algebra is a collection ℱ of subsets satisfying the following conditions:
Ω ∈ ℱ;
if 𝐴1, 𝐴2, … are elements of ℱ, then ⋃ 𝐴𝑖∞𝑖=1 ∈ ℱ;
151
if 𝐴 ∈ ℱ, then 𝐴𝐶= Ω − A ∈ ℱ.
Probability space:
A probability space is a three–tuple (Ω, ℱ, 𝑝) in which the three components are:
Sample space: A nonempty set Ω called the sample space, which represents all possible
outcomes.
Event space: A collection ℱ of subsets of Ω, called the event space. The elements of ℱ
are called events.
If Ω is discrete, then ℱ is usually the collection of all subsets of Ω: ℱ = pow(Ω). If Ω is
continuous, then ℱ is usually a 𝜎–algebra on Ω.
Probability function: A function 𝑝 ∶ ℱ ⟶ ℝ that assigns probabilities to the events of ℱ
and satisfies the requirements of a probability measure over Ω as specified below.
An outcome is the result of a single execution of the model. Once the probability space is
established, it is assumed that “nature” makes its move and selects a single outcome ω from
the sample space Ω. All the events in ℱ that contain the selected outcome ω (recall that each
event is a subset of Ω) are said to “have occurred”. The selection performed by “nature” is done
in such a way that, if the experiment were to be repeated an infinite number of times, the
relative frequencies of occurrence of each of the events would coincide with the probabilities
prescribed by the function 𝑝.
Borel 𝝈–algebra:
If a probability space Ω is a countable set, the 𝜎–algebra of events is usually taken to be
pow(Ω). If Ω is ℝ𝑑, then the Borel 𝜎–algebra is the smallest 𝜎–algebra containing all open sets.
Probability measure:
Given a probability space, a probability measure is a non–negative function 𝐏 defined on events
and satisfying the following:
𝐏(Ω) = 1;
for any sequence of events 𝐵1, 𝐵2, … which are disjoint, meaning 𝐵𝑖 ∩ 𝐵𝑗 = ∅ for 𝑖 ≠ 𝑗:
𝐏(⋃𝐵𝑖)
∞
𝑖=1
=∑𝐏(𝐵𝑖)
∞
𝑖=1
(B1)
152
Probability distribution:
If Ω is a countable set, a probability distribution (or sometimes simply a probability) on Ω is a
function 𝑝 ∶ Ω ⟶ [0, 1] such that:
∑ 𝑝(ω) = 1
ω∈Ω
(B2)
We will abuse notation and write, for any subset 𝐴 ⊂ Ω, 𝑝(𝐴) = ∑ 𝑝(ω)ω∈A . The set function
𝐴 ⟶ 𝑝(𝐴) is a probability measure.
Measurable function:
A function 𝑓: Ω ⟶ ℝ is called measurable if 𝑓−1(𝐵) is an event for all open sets 𝐵.
Density function:
If Ω = 𝐷 is an open subset of ℝ𝑑 and 𝑓 ∶ 𝐷 ⟶ [0,∞) is a measurable function satisfying
∫ 𝑓(𝑥) 𝑑𝑥 = 1𝐷
, then 𝑓 is called a density function.
Given a density function, the following set function defined for Borel sets 𝐵 is a probability
measure:
𝜇𝑓(𝐵) = ∫ 𝑓(𝑥) 𝑑𝑥
𝐵
(B3)
Random variable:
Given a probability space, a random variable 𝑋 is a measurable function defined on Ω. We write
{𝑋 ∈ 𝐴} as shorthand for the set:
𝑋−1(𝐴) = {ω ∈ Ω ∶ 𝑋(ω) ∈ 𝐴} (B4)
Distribution of a random variable:
The distribution of a random variable 𝑋 is the probability measure 𝜇𝑋 on ℝ defined for Borel set
𝐵 by:
𝜇𝑋(𝐵) = 𝐏({𝑋 ∈ 𝐵}) = 𝐏{𝑋 ∈ 𝐵} (B5)
Types of random variables:
We call a random variable X discrete if there is a finite or countable set 𝑆, called the support of
𝑋, such that 𝜇𝑋(𝑆) = 1. In this case, the following function is a probability distribution on 𝑆:
153
𝑝𝑋(𝑎) = 𝐏{𝑋 = 𝑎} (B6) We call a random variable 𝑋 absolutely continuous if there is a density function 𝑓 on ℝ such
that:
𝜇𝑋(𝐴) = ∫ 𝑓(𝑥) 𝑑𝑥
𝐴
(B7)
Mean or expectation:
For a discrete random variable 𝑋, the mean or expectation 𝐄(𝑋) can be computed by the
following formula whose sum has at most countably many non–zero terms:
𝐄𝐏(𝑋) = 𝐄𝐏[𝑋] = 𝐄[𝑋] = ∑ 𝑥 ∙ 𝐏(𝑋 = 𝑥)
𝑥∈ℝ
(B8)
For an absolutely continuous random variable 𝑋, the expectation 𝐄(𝑋) is computed by the
formula:
𝐄𝐟(𝑋) = 𝐄𝐟[𝑋] = 𝐄[𝑋] = ∫ 𝑥 ∙ 𝑓𝑋(𝑥)𝑑𝑥
ℝ
(B9)
Variance:
The variance of a random variable 𝑋 is defined by:
𝐕𝐚𝐫(𝑋) = 𝐕𝐚𝐫[𝑋] = 𝐄[(𝑋 − 𝐄[𝑋])𝟐] (B10)
If 𝑋 is a random variable, 𝑔 ∶ ℝ ⟶ ℝ is a function, and 𝑌 = 𝑔(𝑋) is a function of 𝑋, then the
expectation 𝐄[𝑌] can be computed via the formulae:
𝐄[𝑌] =
{
∫𝑔(𝑥) ∙ 𝑓(𝑥) 𝑑𝑥, if 𝑋 is continuous with density 𝑓
∑𝑔(𝑥) ∙ 𝑝𝑋(𝑥)
𝑥∈𝑆
, if 𝑋 is discrete with support 𝑆 (B11)
Standard deviation:
The standard deviation of a random variable 𝑋 is defined as the (nonnegative) square root of its
variance:
𝛔𝑋 = √𝐕𝐚𝐫[𝑋] (B12)
Covariance:
154
The covariance between two jointly distributed real–valued random variables 𝑋 and 𝑌 with finite
variances is defined as:
𝐜𝐨𝐯(𝑋, 𝑌) = 𝐄[(𝑋 − 𝐄[𝑋]) ∙ (𝑌 − 𝐄[𝑌])] = 𝐄[𝑋 ∙ 𝑌] − 𝐄[𝑋] ∙ 𝐄[𝑌] (B13)
Correlation coefficient:
The population correlation coefficient between two random variables 𝑋 and 𝑌 with expected
values 𝐄[𝑋] and 𝐄[𝑋] and standard deviations 𝛔𝑋 and 𝛔𝑌 is defined as:
𝝆𝑿,𝒀 = 𝐜𝐨𝐫𝐫(𝑋, 𝑌) =
𝐜𝐨𝐯(𝑋, 𝑌)
𝛔𝑋 ∙ 𝛔𝑌 (B14)
The sample correlation coefficient between two data sets 𝐗 = {x1, … , x𝑛} and 𝐘 = {𝑦1, … , y𝑛},
each of them containing 𝑛 values, is defined as:
𝒓𝑿,𝒀 = 𝐜𝐨𝐫𝐫(𝐗, 𝐘) =
∑ xi ∙ yi − 𝑛 ∙ �� ∙ ��𝒊
(n − 1) ∙ 𝐬𝐗 ∙ 𝐬𝐘 (B15)
where: 𝐬𝐗 and 𝐬𝐘 represent the sample standard deviation for 𝐗 respectively 𝐘; and �� and ��
represent the sample mean for 𝐗 respectively 𝐘.
Independence:
Fix a probability space and a probability measure 𝐏. Two events 𝐴 and 𝐵 are independent if:
𝐏(𝐴 ∩ 𝐵) = 𝐏(𝐴) ∙ 𝐏(𝐵) (B16) Events 𝐴1, 𝐴2, … are independent if for any 𝑖1, 𝑖2, … 𝑖𝑟:
𝐏(𝐴𝑖1 ∩ 𝐴𝑖2 ∩…𝐴𝑖𝑟) = 𝐏(𝐴𝑖1) ∙ 𝐏(𝐴𝑖2) ∙ … ∙ 𝐏(𝐴𝑖𝑟) (B17)
Random variables 𝑋1, 𝑋2, … are independent if for all Borel sets 𝐵1, 𝐵2, … the events {𝑋1 ∈
𝐵1}, {𝑋2 ∈ 𝐵2},… are independent.
Proposition B.1: If 𝑋 and 𝑌 and independent random variables such that 𝐕𝐚𝐫(𝑋) and 𝐕𝐚𝐫(𝑌)
exists, then:
𝐕𝐚𝐫[𝑋 + 𝑌] = 𝐕𝐚𝐫[𝑋] + 𝐕𝐚𝐫[𝑌] (B18)
Theorem B.2 (Markov’s inequality):
For a non–negative random variable 𝑋:
𝐏{𝑋 > 𝑎} ≤
𝐄(𝑋)
𝑎 (B19)
155
Convergence in probability:
A sequence of random variables (𝑋𝑡) converges in probability to a random variable 𝑋 if:
lim𝑡→∞
𝐏{|𝑋𝑡 − 𝑋| > 휀} = 0 for all 휀 ∈ ℝ (B20)
This is denoted by: 𝑋𝑡 𝑝𝑟 → 𝑋.
Theorem B.3 (Convergence for sequence of random variables):
Let (𝑌𝑡) be a sequence of random variables and 𝑌 be a random variable such that:
𝐏 { lim𝑛→∞
𝑌𝑛 = 𝑌} = 1 (B21)
Bounded Convergence:
If there is a constant 𝑘 ≥ 0 independent of 𝑛 such that |𝑌𝑛| < 𝑘 for all 𝑛 ∈ ℕ, then:
lim𝑛→∞
𝐄[𝑌𝑛] = 𝐄[𝑌] (B22)
Dominated Convergence:
If there is a random variable 𝑍 such that 𝐄[|𝑍|] < ∞ and 𝐏{|𝑌𝑛| ≤ |𝑍|} = 1 for all 𝑛 ∈ ℕ, then:
lim𝑛→∞
𝐄[𝑌𝑛] = 𝐄[𝑌] (B23)
Monotone Convergence:
If 𝐏{𝑌𝑛 ≤ 𝑌𝑛+1} = 1 for all 𝑛 ∈ ℕ, then:
lim𝑛→∞
𝐄[𝑌𝑛] = 𝐄[𝑌] (B24)
Entropy of a univariate random variable:
Let 𝐏(𝑋) be a distribution over a univariate random variable 𝑋. The entropy of 𝑋 is defined:
S𝐏(𝑋) = 𝐄 [log
1
𝐏(𝑋)] =∑𝐏(𝑋) ∙
𝑋
log1
𝐏(𝑋)= −∑𝐏(𝑋) ∙ log𝐏(𝑋)]
𝑋
(B25)
where we treat 0 ∙ log1
0= 0 because: lim𝜖→0 휀 ∙ log
1
𝜀= 0.
The entropy can be viewed as a measure of our uncertainty about the value of 𝑋.
Entropy of a multivariate random variable:
The previous definition extends naturally to multivariate random variables. Let 𝐏(𝑋1, … , 𝑋𝑛) be a
distribution over random variables 𝑋1, … , 𝑋𝑛.Then the joint entropy of 𝑋1, … , 𝑋𝑛 is:
156
S𝐏(𝑋1, … , 𝑋𝑛) = 𝐄 [log
1
𝐏(𝑋1, … , 𝑋𝑛)] = ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙
𝑋1,…,𝑋𝑛
log1
𝐏(𝑋1, … , 𝑋𝑛) (B26)
S𝐏(𝑋1, … , 𝑋𝑛) = − ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙ log𝐏(𝑋1, … , 𝑋𝑛)
𝑋1,…,𝑋𝑛
(B27)
Distance between distributions:
There are situations when we want to compare two distributions. For instance, we might want to
approximate a distribution by one with desired qualities, e.g. a simpler representation or more
efficient to reason with; we also want to evaluate the quality of a candidate approximation.
Another example is in the context of learning a distribution from data, where we want to
compare the learned distribution to the “true” distribution from which the data was generated.
Therefore, we want to construct a distance measure 𝑑 that evaluates the distance between two
distributions. There are some properties we might wish for in such a measure:
Positivity: 𝑑(𝐏,𝐐) is always nonnegative and is zero if and only if 𝐏 = 𝐐.
Symmetry: 𝑑(𝐏, 𝐐) = 𝑑(𝐐, 𝐏).
Triangle inequality: for any three distributions 𝐏,𝐐, 𝐑 we have that:
𝑑(𝐏,𝐑) ≤ 𝑑(𝐏,𝐐) + 𝑑(𝐐, 𝐑) (B28) A distance measure that satisfies these three properties is called a distance metric.
Relative entropy and KL–divergence:
Let 𝐐 and 𝐏 be two distributions over random variables 𝑋1, … , 𝑋𝑛. The relative entropy of 𝐐 and
𝐏 is:
KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = 𝑬𝑸 [log
𝐐(𝑋1, … , 𝑋𝑛)
𝐏(𝑋1, … , 𝑋𝑛)] (B29)
KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = ∑ 𝐐(𝑋1…𝑋𝑛) ∙
𝑋1,…,𝑋𝑛
log𝐐(𝑋1, … , 𝑋𝑛)
𝐏(𝑋1, … , 𝑋𝑛)
(B30)
When the set of variables is clear from context we use the shorthand definition: KL(𝐐||𝐏). This
measure is often known as the Kullback–Leibler divergence or KL–divergence.
The relative entropy measures the additional cost imposed by using a wrong distribution 𝐐
instead of the true distribution 𝐏. Thus, 𝐐 is close in the sense of relative entropy to 𝐏 if this cost
157
is small. The additional cost of using the wrong distribution is always positive. Moreover, the
relative entropy is zero if and only if the two distributions are identical.
Unfortunately, positivity is the only property of distances that relative entropy satisfies; it
satisfies neither symmetry nor triangle inequality.
158
Appendix C: Finite Markov chains
The notions presented in this appendix have been used in the previous sections and the
majority of them are taken from the book [78].
A finite Markov chain is a process which moves among the elements of a finite set Ω in the
following manner: considering 𝑥 ∈ Ω is the current position of the process, the next position is
chosen according to a fixed probability distribution 𝑃(𝑥,·). We formally define this type of
process and present some of its properties.
Finite Markov chain:
A sequence of random variables 𝑋1, 𝑋2, … is a finite Markov chain with finite state space Ω and
transition matrix 𝑃 if for all 𝑥, 𝑦 ∈ Ω, all 𝑡 ≥ 1, and all events 𝐻𝑡−1 = ⋂ (𝑋𝑠 = 𝑥𝑠)𝑡−1𝑠=0 satisfying
𝐏(𝐻𝑡−1 ∩ {𝑋𝑡 = 𝑥}) > 0, we have:
𝐏{𝑋𝑡+1 = 𝑦 | 𝐻𝑡−1 ∩ {𝑋𝑡 = 𝑥}} = 𝐏{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 = 𝑥} = 𝑃(𝑥, 𝑦) (C1)
Equation (C1) illustrates how the Markov chain explores the space in a local fashion. Often
called the Markov local property, equation (C1) means that the conditional probability of
proceeding from state 𝑥 to state 𝑦 is the same, no matter what sequence 𝑥0, 𝑥1, … 𝑥𝑡−1 of states
precedes the current state 𝑥. This is exactly why the |Ω| × |Ω| matrix 𝑃 suffices to describe the
transitions.
Let the distribution 𝑃(𝑥,·) be the 𝑥𝑡ℎ row of the transition matrix 𝑃. Thus, 𝑃 is stochastic, that is,
its entries are all non–negative and:
∑𝑃(𝑥, 𝑦) = 1
𝑦∈Ω
(C2)
Let (𝑋1, 𝑋2, … ) be a finite Markov chain with state space Ω and transition matrix 𝑃, and let the
row vector 𝜇𝑡 be the distribution of 𝑋𝑡:
𝜇𝑡(𝑥) = 𝐏{𝑋𝑡 = 𝑥} for all 𝑥 ∈ Ω
By conditioning on the possible predecessors of the (𝑡 + 1)st state, we see that:
𝜇𝑡+1(𝑦) = ∑𝐏{𝑋𝑡 = 𝑥} ∙
𝑥∈Ω
𝑃(𝑥, 𝑦) = ∑ 𝜇𝑡(𝑥) ∙
𝑥∈Ω
𝑃(𝑥, 𝑦) for all y ∈ Ω
Rewriting this in vector form gives:
159
𝜇𝑡+1 = 𝜇𝑡 ∙ 𝑃 for 𝑡 ≥ 0 (C3) hence: 𝜇𝑡 = 𝜇0 ∙ 𝑃
𝑡 for 𝑡 ≥ 0 (C4)
Irreducible finite Markov chain:
A Markov chain with transition matrix 𝑃 is called irreducible if for any two states 𝑥, 𝑦 ∈ Ω, there
exists an integer 𝑡 (possibly depending on 𝑥 and 𝑦) such that 𝑃𝑡(𝑥, 𝑦) > 0.
This means that it is possible to get from any state to any other state not necessarily in one step
and using only transitions of positive probability.
Lemma C.1: A finite irreducible Markov chain with state space Ω and transition matrix 𝑃 =
(𝑝𝑖𝑗)1≤𝑖,𝑗≤𝑚 is aperiodic if there exists a state 𝑥𝑗 ∈ Ω, where 1 ≤ 𝑗 ≤ 𝑚, such that 𝑝𝑗𝑗 > 0.
Periodic finite Markov chain:
Let 𝑇(𝑥) = {𝑡 ≥ 1 ∶ 𝑃𝑡(𝑥, 𝑥) > 0} be the set of times when it is possible for the chain to return to
starting position 𝑥. The period of state 𝑥 is defined to be the greatest common divisor of 𝑇(𝑥).
Stationary distribution:
A stationary distribution of a Markov chain 𝑃 is a probability 𝜋 on Ω invariant under right
multiplication by 𝑃, which means:
𝜋 = 𝜋 ∙ 𝑃 (C5) In this case, 𝜋 is the long–term limiting distribution of the Markov chain. Clearly, if 𝜋 is a
stationary distribution and 𝜇0 = 𝜋, i.e., the chain is started in a stationary distribution, then
𝜇𝑡 = 𝜋 for all 𝑡 ≥ 0. Equation (C5) can be rewritten element–wise as:
𝜋(𝑦) = ∑𝜋(𝑥)
𝑥∈Ω
∙ 𝑃(𝑥, 𝑦) for all 𝑦 ∈ Ω (C6)
The finite Markov chains converge to their stationary distributions. More, under mild restrictions,
stationary distributions exist and are unique.
There is a difference between multiplying a row vector by 𝑃 on the right and a column vector by
𝑃 on the left: the former advances a distribution by one step of the chain, while the latter gives
the expectation of a function on states, one step of the chain later.
Hitting and stopping time:
For x ∈ Ω, we define the hitting time for 𝑥 to be the first time at which the chain visits state 𝑥:
160
𝜏𝑥 = min{𝑡 ≥ 0 ∶ 𝑋𝑡 = 𝑥} (C7)
For situations where only a visit to 𝑥 at a positive time will do, we also define:
𝜏𝑥+ = min{𝑡 ≥ 1 ∶ 𝑋𝑡 = 𝑥} (C8)
We call 𝜏𝑥
+ the first return time when 𝑋0 = 𝑥.
A stopping time 𝜏 for (𝑋𝑡) is a {0, 1, . . . , } ∪ {∞}–valued random variable such that, for each 𝑡, the
event {𝜏 = 𝑡} is determined by 𝑋0, 𝑋1, …𝑋𝑡. If 𝜏 is a stopping time, then an immediate
consequence of the definition and the Markov property is:
𝑃𝑥0{(𝑋𝜏+1, 𝑋𝜏+2, …𝑋𝑙) ∈ 𝐴|𝜏 = 𝑘 and (𝑋1, …𝑋𝑘) = (𝑥1, … 𝑥𝑘) } = 𝑃𝑥𝑘{(𝑋1, …𝑋𝑙) ∈ 𝐴} (C9)
for any 𝐴 ⊂ Ω𝑙. This is referred to as the strong Markov property. Informally, we say that the
chain “starts afresh” at a stopping time.
Lemma C.2: For any states 𝑥 and 𝑦 of an irreducible chain:
𝐄𝑥(𝜏𝑦+) < ∞ (C10)
Theorem C.3 (Existence of a stationary distribution):
Let 𝑃 be the transition matrix of an irreducible Markov chain. Then the following are true:
there exists a probability distribution 𝜋 on Ω such that: 𝜋 = 𝜋 ∙ 𝑃 and 𝜋(𝑥) > 0 for all 𝑥 ∈ Ω (C11)
𝜋(𝑥) =1
𝐄𝑥(𝜏𝑥+)
(C12)
Theorem C.4 (Uniqueness of the stationary distribution):
Let 𝑃 be the transition matrix of an irreducible Markov chain. There exists a unique probability
distribution 𝜋 satisfying: 𝜋 = 𝜋 ∙ 𝑃.
Reversibility and Time Reversals:
Suppose a probability 𝜋 on Ω satisfies the detailed balance equation:
𝜋(𝑥) ∙ 𝑃(𝑥, 𝑦) = 𝜋(𝑦) ∙ 𝑃(𝑦, 𝑥) for all 𝑥, 𝑦 ∈ Ω (C13) Proposition C.5: Let 𝑃 be the transition matrix of a Markov chain with state space Ω. Any
distribution 𝜋 satisfying the detailed balance equations is stationary for 𝑃.
161
Checking detailed balance is often the simplest way to verify that a particular distribution is
stationary. Furthermore, when the detailed balance equation holds:
𝜋(𝑥0) ∙ 𝑃(𝑥0, 𝑥1) ∙ … ∙ 𝑃(𝑥𝑛−1, 𝑥𝑛) = 𝜋(𝑥𝑛) ∙ 𝑃(𝑥𝑛, 𝑥𝑛−1) ∙ … ∙ 𝑃(𝑥1, 𝑥0) (C14) We can rewrite the previous equation in the following suggestive form:
𝑃𝜋{X0 = x0, … , X𝑛 = x𝑛} = 𝑃𝜋{X0 = x𝑛, … , X𝑛 = x0} (C15) In other words, if a chain (𝑋𝑡) satisfies the detailed balance equation and has stationary initial
distribution, then the distribution of (𝑋0, 𝑋1, …𝑋𝑛) is the same as the distribution of
(𝑋𝑛, 𝑋𝑛−1, …𝑋0).
Reversible finite Markov chain:
A chain satisfying the detailed balance equation is called reversible. The time reversal of an
irreducible Markov chain with transition matrix 𝑃 and stationary distribution 𝜋 is the chain with
matrix:
��(𝑥, 𝑦) =
𝜋(𝑦) ∙ 𝑃(𝑦, 𝑥)
𝜋(𝑥) (C16)
The stationary equation 𝜋 = 𝜋 ∙ 𝑃 implies that �� is a stochastic matrix. We write (��𝑡) for the
time–reversed chain of (𝑋𝑡) and �� for the transition matrix of (��𝑡).
Proposition C.6: Let (𝑋𝑡) be an irreducible Markov chain with transition matrix 𝑃 and stationary
distribution 𝜋. Then 𝜋 is stationary for �� and for any 𝑥0, 𝑥1, … 𝑥𝑡∈ Ω we have:
𝑃𝜋{X0 = x0, … , X𝑡 = x𝑡} = 𝑃𝜋{X0 = x𝑡 , … , X𝑡 = x0} (C17)
Observe that if a chain with transition matrix 𝑃 is reversible, then: �� = 𝑃.
Theorem C.7 (Markov Chain Convergence):
Suppose that a Markov chain 𝑃 is irreducible and aperiodic, with stationary distribution 𝜋. Then
there exist constants 𝛼 ∈ (0, 1) and 𝐶 > 0 such that:
max𝑥∈Ω
||𝑃𝑡(𝑥,·) − 𝜋||𝑇𝑉 ≤ 𝐶 ∙ 𝛼𝑡 (C18)
where ||𝜇 − 𝜈||𝑇𝑉 represents the total variation distance between two probability distributions 𝜇
and 𝜈 on Ω and is defined as:
||𝜇 − 𝜈||𝑇𝑉= max𝐴⊂Ω
|𝜇(𝐴) − 𝜈(𝐴)| (C19)
162
This theorem implies that the “long–term” fractions of time a finite irreducible aperiodic Markov
chain spends in each state coincide with the chain’s stationary distribution.
Ergodicity:
A Markov chain is said to be ergodic if there exists a positive integer 𝑇0 such that, for all pairs of
states 𝑖 and 𝑗 in the Markov chain, if the chain is started at time 0 in state 𝑖, then for all 𝑡 > 𝑇0,
the probability of being in state 𝑗 at time 𝑡 is greater than 0.
For a Markov chain to be ergodic two technical conditions are required of its states and its
transition matrix: irreducibility and aperiodicity. Informally, irreducibility ensures that there is a
sequence of transitions of non–zero probability from any state to any other, while aperiodicity
ensures that the states are not partitioned into sets such that all state transitions occur cyclically
from one set to another.
Theorem C.8 (Ergodic Theorem):
Let 𝑓 be a real–valued function defined on Ω. If (𝑋𝑡) is an irreducible Markov chain with
stationary distribution 𝜋, then for any starting distribution 𝜇, the following holds:
𝑃𝜇 { lim
𝑡→∞
1
𝑡∙∑𝑓(𝑋𝑠) = 𝐄𝜋[𝑓]
𝑡−1
𝑠=0
} = 1 (C20)
where 𝐄𝜋[𝑓] is computed with formula (B9).
Markov chain Monte Carlo (MCMC):
Problem: Given an irreducible transition matrix 𝑃, there is a unique stationary distribution 𝜋 on
Ω satisfying 𝜋 = 𝜋 ∙ 𝑃.
Answer: The existence and unicity of solution of this problem is ensured by Theorem C.3
respectively Theorem C.4.
Inverse problem: Given a probability distribution 𝜋 on Ω, can we find a transition matrix 𝑃 for
which 𝜋 is its stationary distribution?
Answer: Yes, we do. The solution involves a method of sampling from a given probability
distribution called Markov chain Monte Carlo.
MCMC uses Markov chains to sample. A random sample from a finite set Ω means a random
uniform selection from Ω, i.e., one selection such that each element has the same chance 1 |Ω|⁄
of being chosen.
163
Suppose 𝜋 is a probability distribution on Ω. If a Markov chain (𝑋𝑡) with stationary distribution 𝜋
can be constructed, then, for 𝑡 large enough, the distribution of (𝑋𝑡) is close to 𝜋.
Metropolis chains:
Problem: Given a probability distribution 𝜋 on Ω and some Markov chain with state space Ω and
an arbitrary stationary distribution, can the chain be modified so that the new chain has the
stationary distribution 𝜋?
Answer: Yes, it does. The Metropolis algorithm solves this problem.
We distinguish two cases: symmetric base chain and general base chain. In both cases we are
given an arbitrary probability distribution 𝜋 on Ω and a base Markov chain (𝑋𝑡) with transition
matrix ψ. We want to construct a new chain (𝑌𝑡) starting from the base chain (𝑋𝑡) and modifying
its transitions such that the stationary distribution of the new chain is 𝜋. The new chain is called
the Metropolis chain. The stationary distribution of (𝑋𝑡) or, equivalently, the transition matrix ψ
are also referred as the proposal distribution.
Let (𝑋𝑡) have a symmetric transition matrix ψ. This implies that (𝑋𝑡) is reversible with
respect to the uniform distribution on Ω. The Metropolis chain is executed as follows.
It starts from the initial state of the base chain and evolves as follows: when at state 𝑥, a
candidate state 𝑦 is generated from the distribution ψ(𝑥,·). The state 𝑦 is “accepted” with
probability 𝑎(𝑥, 𝑦), which means that the next state of the new chain is 𝑦, or the state 𝑦 is
“rejected” with probability 1 − 𝑎(𝑥, 𝑦), which means that the next state of the new chain
remains at 𝑥. The acceptance probability 𝑎(𝑥, 𝑦) is:
𝑎(𝑥, 𝑦) = 1 ∧ (
𝜋(𝑦)
𝜋(𝑥)) = min(1,
𝜋(𝑦)
𝜋(𝑥)) (C21)
Therefore, the Metropolis chain for a probability 𝜋 and a symmetric transition matrix ψ is
defined by the following transition matrix:
𝑃(𝑥, 𝑦) =
{
ψ(𝑥, 𝑦) ∙ [1 ∧
𝜋(𝑦)
𝜋(𝑥)] , if 𝑦 ≠ 𝑥
1 −∑ψ(𝑥, 𝑧) ∙ [1 ∧𝜋(𝑧)
𝜋(𝑥)]
𝑧𝑧≠𝑥
, if 𝑦 = 𝑥 (C22)
164
A very important feature of the Metropolis chain is that it only depends on the ratios 𝜋(𝑦)
𝜋(𝑥).
Frequently 𝜋(𝑥) has the form ℎ(𝑥)
𝑍, where the function ℎ ∶ Ω ⟶ [0,∞) is known and 𝑍 is a
normalizing constant 𝑍 = ∑ ℎ(𝑥)𝑥∈Ω . It may be difficult to explicitly compute 𝑍, especially
if Ω is large. Because the Metropolis chain only depends on ℎ(𝑥)
ℎ(𝑦), it is not necessary to
compute the constant 𝑍 in order to simulate the chain.
Let (𝑋𝑡) have a general transition matrix ψ; this means ψ corresponds to an irreducible
(𝑋𝑡).
The Metropolis chain is executed as follows. It starts from the initial state of the base
chain and evolves as follows: when at state 𝑥, generate a state 𝑦 from the distribution
ψ(𝑥,·). Then move to 𝑦 with probability 𝑎(𝑥, 𝑦) and remain at 𝑥 with the probability
1 − 𝑎(𝑥, 𝑦). The acceptance probability 𝑎(𝑥, 𝑦) is:
𝑎(𝑥, 𝑦) = 1 ∧ (
𝜋(𝑦) ∙ ψ(𝑦, 𝑥)
𝜋(𝑥) ∙ ψ(𝑥, 𝑦)) = min(1,
𝜋(𝑦) ∙ ψ(𝑦, 𝑥)
𝜋(𝑥) ∙ ψ(𝑥, 𝑦)) (C23)
Therefore, the Metropolis chain for a probability 𝜋 and a general transition matrix ψ is
defined by the following transition matrix:
𝑃(𝑥, 𝑦) =
{
ψ(𝑥, 𝑦) ∙ [1 ∧ (
𝜋(𝑦) ∙ ψ(𝑦, 𝑥)
𝜋(𝑥) ∙ ψ(𝑥, 𝑦))] , if 𝑦 ≠ 𝑥
1 −∑ψ(𝑥, 𝑧) ∙ [1 ∧ (𝜋(𝑦) ∙ ψ(𝑦, 𝑥)
𝜋(𝑥) ∙ ψ(𝑥, 𝑦))] ,
𝑧𝑧≠𝑥
if 𝑦 = 𝑥 (C24)
The transition matrix 𝑃 defines a reversible Markov chain with stationary distribution 𝜋.
Glauber dynamics (Gibbs sampler):
In general, let 𝑉 and 𝑆 be finite sets and Ω be a subset of 𝑆𝑉: Ω ⊆ 𝑆𝑉. 𝑉 can be seen as the
vertex set of a graph and 𝑆 can be seen as the set of state values for any vertex in the graph.
The elements of 𝑆𝑉 are called configurations and can be “visualized” as labeling the vertices of
𝑉 with elements of 𝑆.
Problem: Given a probability distribution 𝜋 on a space of configurations Ω ⊆ 𝑆𝑉, can we find a
Markov chain for which 𝜋 is its stationary distribution?
Answer: Yes, we do. The Glauber dynamics algorithm solves this problem.
165
Let 𝜋 be a probability distribution whose support is Ω. For a configuration 𝜎 ∈ Ω and a vertex
𝑣 ∈ 𝑉 let Ω(𝜎, 𝑣) be the set of configurations agreeing with 𝜎 everywhere except possibly at 𝑣:
Ω(𝜎, 𝑣) = {𝜏 ∈ Ω ∶ 𝜏(𝑤) = 𝜎(𝑤) for all 𝑤 ∈ 𝑉,𝑤 ≠ 𝑣} (C25) The (single–site) Glauber dynamics for 𝜋 is a reversible Markov chain with state or configuration
space Ω, stationary distribution 𝜋, and the transition probabilities defined by the distribution 𝜋
conditioned on the set (𝜎, 𝑣) as follows:
𝜋𝜎,𝑣(𝜏) = 𝜋(𝜏 | Ω(𝜎, 𝑣)) = {
𝜋(𝜏)
𝜋(Ω(𝜎, 𝑣)), if 𝜏 ∈ Ω(𝜎, 𝑣)
0, if 𝜏 ∉ Ω(𝜎, 𝑣)
(C26)
In words, the Glauber chain moves from a configuration 𝜎 ≡ 𝑋𝑡 to a configuration 𝜏 ≡ 𝑋𝑡+1 as
follows:
a vertex 𝑣 is chosen uniformly at random from 𝑉;
a new configuration 𝜏 ∈ Ω is chosen according to the probability measure 𝜋 conditioned
on the set of configurations that agree with 𝜎 everywhere except possibly at 𝑣.
Comparing Glauber dynamics and Metropolis chains:
Suppose that 𝜋 is a probability distribution on the state space 𝑆𝑉, where 𝑆 is a finite set and 𝑉 is
the vertex set of a graph. On one hand, we can always define the Glauber chain as just
described. On the other hand, suppose that we have a chain which picks a vertex 𝑣 at random
and has some mechanism for updating its configuration 𝜎 at 𝑣. This chain may not have
stationary distribution 𝜋, but it can be modified by the Metropolis rule to obtain a Metropolis
chain with stationary distribution 𝜋. The Metropolis chain obtained in this way can be very
similar to the Glauber chain, but may not coincide exactly.