fundamentals of learning algorithms in boltzmann...

Fundamentals of Learning Algorithms in Boltzmann

Machines

by Mihaela G. Erbiceanu

M. Eng., "Gheorghe Asachi" Technical University, 1991

Project Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Computing Science

in the

School of Computing Science

Faculty of Applied Sciences

© Mihaela G. Erbiceanu 2016

SIMON FRASER UNIVERSITY

Fall 2016

ii

Approval

Name: Mihaela Erbiceanu

Degree: Master of Science (Computing Science)

Title: Fundamentals of Learning Algorithms in

Boltzmann Machines

Examining Committee: Chair: Binay Bhattacharya Professor

Petra Berenbrink Senior Supervisor Professor

Andrei Bulatov Supervisor Professor

Leonid Chindelevitch External Examiner Assistant Professor

Date Defended/Approved: September 7, 2016

iii

Abstract

Boltzmann learning underlies an artificial neural network model known as the Boltzmann

machine that extends and improves upon the Hopfield network model. Boltzmann

machine model uses stochastic binary units and allows for the existence of hidden units

to represent latent variables. When subjected to reducing noise via simulated annealing

and allowing uphill steps via Metropolis algorithm, the training algorithm increases the

chances that, at thermal equilibrium, the network settles on the best distribution of

parameters. The existence of equilibrium distribution for an asynchronous Boltzmann

machine is analyzed with respect to temperature. Two families of learning algorithms,

which correspond to two different approaches to compute the statistics required for

learning, are presented. The learning algorithms based only on stochastic

approximations are traditionally slow. When variational approximations of the free

energy are used, like the mean field approximation or the Bethe approximation, the

performance of learning improves considerably. The principal contribution of the present

study is to provide, from a rigorous mathematical perspective, a unified framework for

these two families of learning algorithms in asynchronous Boltzmann machines.

Keywords: Boltzmann–Gibbs distribution, Gibbs free energy, asynchronous Boltzmann

machine, thermal equilibrium, data–dependent statistics, data–independent statistics,

stochastic approximation, variational method, mean field approximation, Bethe

approximation.

iv

Dedication

This thesis is dedicated to my mother for her support, sacrifice, and constant love.

v

Acknowledgements

First and foremost, I would like to thank my supervisor Petra Berenbrink not only for

giving me the opportunity to work on this thesis under her supervision, but also for her

valuable feedback. I would also like to thank Andrei Bulatov, my second supervisor, and

my committee members, Leonid Chindelevitch and Binay Bhattacharya, for their support,

encouragement, and patience.

vi

Table of Contents

Approval ..................................................................................................................... ii Abstract .................................................................................................................... iii Dedication .................................................................................................................... iv Acknowledgements ................................................................................................................... v Table of Contents .................................................................................................................... vi List of Tables .................................................................................................................. viii List of Figures .................................................................................................................... ix List of Acronyms .................................................................................................................... x

Chapter 1. Introduction ......................................................................................................... 1 1.1 Motivation .................................................................................................................... 1 1.2 Overview and roadmap ..................................................................................................... 3 1.3 Related work .................................................................................................................... 4 1.4 Connection to other disciplines ......................................................................................... 6

Chapter 2. Foundations ........................................................................................................ 7 2.1 Boltzmann–Gibbs distribution ............................................................................................ 7 2.2 Markov random fields and Gibbs measures .................................................................... 10 2.3 Gibbs free energy ........................................................................................................... 15 2.4 Connectionist networks ................................................................................................... 19 2.5 Hopfield networks ........................................................................................................... 21 2.5.1 Hopfield network models ............................................................................................ 22 2.5.2 Convergence of the Hopfield network ......................................................................... 28

Chapter 3. Variational methods for Markov networks ...................................................... 32 3.1 Pairwise Markov networks as exponential families .......................................................... 32 3.1.1 Basics of exponential families .................................................................................... 33 3.1.2 Canonical representation of pairwise Markov networks .............................................. 34 3.1.3 Mean parameterization of pairwise Markov networks ................................................. 35 3.1.4 The role of transformations between parameterizations ............................................. 37 3.2 The energy functional...................................................................................................... 38 3.3 Gibbs free energy revisited ............................................................................................. 42 3.3.1 Hamiltonian and Plefka expansion ............................................................................. 43 3.3.2 The Gibbs free energy as a variational energy ........................................................... 45 3.4 Mean field approximation ................................................................................................ 48 3.4.1 The mean field energy functional ............................................................................... 48 3.4.2 Maximizing the energy functional: fixed–point characterization .................................. 50 3.4.3 Maximizing the energy functional: the naïve mean field algorithm .............................. 55 3.5 Bethe approximation ....................................................................................................... 57 3.5.1 The Bethe free energy ............................................................................................... 58 3.5.2 The Bethe–Gibbs free energy .................................................................................... 60 3.5.3 The relationship between belief propagation fixed–points and Bethe free energy....... 62 3.5.4 Belief optimization ...................................................................................................... 66

Chapter 4. Introduction to Boltzmann Machines ............................................................... 68 4.1 Definitions .................................................................................................................. 68 4.2 Modelling the underlying structure of an environment ..................................................... 74 4.3 Representation of a Boltzmann Machine as an energy–based model ............................. 77

vii

4.4 How a Boltzmann Machine models data ......................................................................... 82 4.5 General dynamics of Boltzmann Machines ..................................................................... 84 4.6 The biological interpretation of the model ........................................................................ 90

Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines ............................................................................................................................ 93 5.1 Problem description ........................................................................................................ 93 5.2 Phases of a learning algorithm in a Boltzmann Machine ................................................. 97 5.3 Learning algorithms based on approximate maximum likelihood ..................................... 99 5.3.1 Learning by minimizing the KL–divergence of Gibbs measures ................................ 100 5.3.2 Collecting the statistics required for learning ............................................................ 114 5.4 The equilibrium distribution of a Boltzmann machine .................................................... 121 5.5 Learning algorithms based on variational approaches................................................... 130 5.5.1 Using variational free energies to compute the statistics required for learning .......... 131 5.5.2 Learning by naïve mean field approximation ............................................................ 134 5.6 Unlearning and relearning in Boltzmann Machines ....................................................... 138

Chapter 6. Conclusions .................................................................................................... 141 6.1 Summary of what has been done .................................................................................. 141 6.2 Future directions ........................................................................................................... 143

References .......................................................................................................................... 144

Appendix A: Mathematical notations .................................................................................. 149

Appendix B: Probability theory and statistics ................................................................... 150

Appendix C: Finite Markov chains ...................................................................................... 158

viii

List of Tables

Table 1 Distributions of interest in asynchronous Boltzmann machine learning ....... 95

Table 2 Transition probability matrices for asynchronous symmetric Boltzmann

machines. ................................................................................................. 124

ix

List of Figures

Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four

hidden nodes; b) A layered Boltzmann machine with one visible layer and

two hidden layers. ....................................................................................... 77

x

List of Acronyms

SFU Simon Fraser University

LAC Library and Archives Canada

BO belief optimization

BP belief propagation

CD contrastive divergence

KL–divergence Kullback–Leibler divergence

LBP loopy belief propagation

MCMC Markov chain Monte Carlo

ML maximum likelihood

1

Chapter 1. Introduction

1.1 Motivation

Boltzmann machines are a particular class of artificial neural networks that have been

extensively studied, because of the interesting properties of the associated learning algorithms.

In this context, learning for Boltzmann machines means “acquiring a particular behavior by

observing it” [1]. The machine is named after Ludwig Boltzmann who discovered the

fundamental law governing the equilibrium state of a gas. The distribution of molecules of an

ideal gas among the various energy states is called the Boltzmann–Gibbs distribution. This

distribution was proposed by Geoffrey Hinton and Terrence Sejnowski as the stochastic update

rule in a new network which they named “Boltzmann Machine”.

From a pure theoretical point of view, a Boltzmann machine is a generalization of a Hopfield

network in which the units update their states according to a stochastic decision rule and which

allows the presence of hidden units.

From a graphical model point of view, a Boltzmann machine is a binary pairwise Markov random

field in which every node is endowed with a non–linear activation function similar to an

activation model for neurons. As a graphical model, a Boltzmann machine has both a structural

component, encoded by the pattern of edges in the underlying graph, and a parametric

component, encoded by the potentials associated with sets of edges in the underlying graph.

The particularities of the Boltzmann machine model that make it suitable for pattern recognition

tasks are due more to its parameterization than to its conditional independence structure.

The main interest for Boltzmann machine has come from the neural network field where a

particular type of Boltzmann machine – the layered Boltzmann machine – is considered a deep

neural network. The learning algorithms for Boltzmann machine have been mostly created to

train this kind of neural network.

Boltzmann machines are theoretically intriguing because of the locality and Hebbian1 nature of

their training algorithm, and because of their parallelism and the resemblance of their dynamics

to simple physical processes [2]. There is, however, one drawback in the use of learning

process in Boltzmann machines: the process is computationally very expensive. The

1 Hebbian theory is a theory in neuroscience that proposes an explanation for the adaptation of neurons in the

brain during the learning process. See Section 2.4 for more information.

2

computational complexity of the exact algorithm is exponential in the number of neurons

because involves the computation of the partition function of the Boltzmann–Gibbs distribution,

which involves a sum over all states in the network, of which there are exponentially many. If a

learning algorithm uses an approximate inference method to compute the partition function of

the Boltzmann–Gibbs distribution, then the learning process can be made efficient enough to be

useful for practical problems.

Based on the approach employed to compute the statistics required for learning, the Boltzmann

machine learning algorithms are divided into two groups or families. One family of algorithms

uses only stochastic approximations; the other family uses both variational approximations and

stochastic approximations to compute the statistics.

The goal of this paper is to present, from a rigorous mathematical perspective, a unified

framework for two families of learning algorithms in asynchronous Boltzmann. Precursors of our

approach are: Sussmann, who elaborated in 1988-1989 a rigorous mathematical analysis of the

original asynchronous Boltzmann machine learning algorithm [1,3]; Welling and Teh who, in

2002, reviewed the inference algorithms in Boltzmann machines with emphasis on the

advanced mean field methods and the loopy belief propagation [4]; Salakhutdinov, who, in

2008, in the context of arbitrary Markov random fields, reviewed some Monte Carlo based

methods for estimating partition functions, as well as the variational framework for obtaining

deterministic approximations or upper bounds for the log partition function [5]. During 1990s and

early 2000s the subject of learning algorithms for Boltzmann machines had been somewhat

neglected by the research community. However, a promising new method to train deep neural

networks proposed in 2006 by Hinton et al. [6] has caused a resurgence of interest in this

subject. Despite its ups and downs as a research subject, a considerable number of papers on

the learning algorithms for Boltzmann machines have been published. Some of these papers

proposed refinements for existing algorithms; others proposed new algorithms or even

completely new approaches.

However, as far as we know, there has not been yet a documented effort to gather in one place,

with a consistent set of definitions and notations, and built on a unified framework of concepts,

proofs, and interpretation of results, the mathematical foundations of the main families of

Boltzmann machine learning algorithms. By approaching the topic of this paper from a computer

science theoretical perspective, but without omitting the intuition behind, we intend to fill this

void and to help other interested parties to obtain a good understanding of the intricacies and

limitations of Boltzmann machines and their learning algorithms.

3

1.2 Overview and roadmap

This paper consists of six chapters and three appendices and is organized as follows.

In Chapter 1, we present an introduction to the topic of this paper and our goals in covering it.

We also include a brief history of Boltzmann machine learning and what connections it has with

other disciplines.

In Chapter 2, we introduce the Boltzmann–Gibbs distribution as the main source of inspiration

for Boltzmann machine. We also review the main concepts and results from Markov random

field theory that are subsequently used in this paper. Then we introduce the Gibbs free energy

and its intrinsic relationship with the Boltzmann–Gibbs distribution. Furthermore we introduce

the precursors of Boltzmann machine: the connectionist networks and the Hopfield networks.

Because the asynchronous Hopfield’s network represents the limiting case of the asynchronous

Boltzmann machine as the “temperature” parameter 𝐓 → 0, we cover the dynamics and

convergence of Hopfield networks as well as its learning algorithms.

In Chapter 3, we start by introducing the basics of variational methodology. We also explain

how, in certain conditions, the Gibbs free energy can be viewed as a variational energy. Then

we review the main concepts and results regarding two classes of variational methods that are

used by Boltzmann machine learning algorithms to approximate the free energy of a Markov

random field: the mean field approximation and the Bethe approximation.

In Chapter 4, we introduce the asynchronous Boltzmann machine. From formal definitions, how

the underlying environment is modeled, the energy–based representation, the data

representation, to general dynamics, we provide a detailed description of its functionality,

without omitting the intuition behind its concepts and algorithms. We end this chapter with the

biological interpretation of the model as it was given by Hinton.

In Chapter 5 we start by formally defining the process of learning and justifying why the

Boltzmann machine learning algorithms have two phases. Then we present two categories of

learning algorithms for asynchronous Boltzmann machine: based on Monte Carlo methods and

based on variational approximations of the free energy, specifically the mean field

approximation and the Bethe approximation. For each category we present the derivation and

analysis of the original algorithm. Other important algorithms from each category are introduced

by presenting their differences and/or improvements comparative to the original algorithm.

Finally, we cover the processes of unlearning and relearning in asynchronous Boltzmann

machine.

4

Finally, Chapter 6 contains a very brief summary and outlook.

In Appendix A we introduce the mathematical notations used throughout this paper.

In Appendix B we review the main concepts from probability theory and statistics that are

necessary to have a good understanding of the paper.

In Appendix C we review the main concepts regarding finite Markov chains that are necessary

to have a good understanding of this paper.

1.3 Related work

In 1982, Hopfield showed that a network of symmetrically–coupled binary threshold units has a

simple quadratic energy function that governs its dynamic behavior [7]. When the nodes are

deterministically updated one at a time, the network settles to an energy minimum and Hopfield

suggested using these minima to store content–addressable memories.

Hinton and Sejnowski realized that the energy function can be viewed as an indirect way of

defining a probability distribution over all the binary configurations of the network and that, if the

right stochastic updating rule is used, the dynamics eventually produces samples from the

Boltzmann–Gibbs distribution [8-9]. This discovery led them to invent in 1983 the “Boltzmann

Machine” [8-9]. Furthermore, if a Boltzmann machine is divided into a set of visible nodes whose

states are externally forced or “clamped” at the data and a disjoint set of hidden nodes, the

stochastic updating produces samples from the posterior distribution over configurations of the

hidden nodes given the current data [8,10-11]. Ackley, Hinton, and Sejnowski proposed a

learning algorithm that performs maximum likelihood learning of the weights that define the

hidden nodes and uses sequential Gibbs sampling to approach the posterior distribution [10].

This new algorithm is known in literature as the original learning algorithm/procedure for

(asynchronous) Boltzmann machines. Inspired by Kirkpatrick, Gelatt, and Vecchi [12], Hinton

and Sejnowski used simulated annealing from a high initial “temperature” to a final

“temperature” of 1 to speed up convergence to the stationary distribution. They demonstrated

that this was a feasible way of learning the weights in small networks. However, the original

learning procedure was still much too slow to be practical for learning large, multilayer

Boltzmann machines [13]. The simplicity and locality of original learning procedure for

Boltzmann machines led to much interest, but the settling time required getting samples from

the right distribution and the high noise in the estimates made learning slow and unreliable [5].

5

During the following two decades the researchers tried to improve the learning speed of

Boltzmann machine by using various approaches.

In 1992 Neal improved the original learning procedure by using persistent Markov chains [14].

Neal did not explicitly use simulated annealing. However, the persistent Markov chains

implement it implicitly, provided that the weights have small initial values. Neal showed that

persistent Markov chains work quite well for training a Boltzmann machine on a fairly small data

set [14]. For large data sets, however, it is much more efficient to update the weights after a

small mini batch of training examples [13].

The first efficient learning procedure for large–scale asynchronous Boltzmann machines used

an extremely limited architecture named Restricted Boltzmann Machine. This architecture

together with its learning procedure was first proposed by Smolensky in 1986 [15] and it was

designed to make inference tractable [13].

In 1987, in an attempt to reduce the time required by the sampling process, Peterson and

Anderson [16-17] replaced Gibbs sampling with a simple mean field method that approximates

a stationary distribution by replacing stochastic binary values with deterministic real–valued

probabilities. More sophisticated deterministic approximation methods were investigated by

Galland in 1990 [18-19], Kappen and Rodriguez in 1998 [20-21], and Tanaka in 1998 [22-23]

but none of these approximations worked very well for learning for reasons that were not well

understood at the time [13]. Similar deterministic approximation methods were studied

intensively in 1990s in the context of directed graphical models learning [24-27]. In 2010

Salakhutdinov interpreted these results and provided a possible explanation of the limited

success of using deterministic approximation methods for learning in asynchronous Boltzmann

machines [13].

Because variational methods typically scale well to large applications, during 2000s extended

research has been done for obtaining deterministic approximations [28-29] or deterministic

upper bounds [30-32] on the log partition function of an arbitrary discrete Markov random field.

The Bethe approximation first made its appearance in the field of approximate inference and

error correcting decoding in [33-34] under the names TAP approximation and cavity method.

The relation between belief propagation and the Bethe approximation was further clarified in

[28-29,35-36] where it was shown that belief propagation, even when applied to loopy graphs,

has fixed–points at the stationary points of the Bethe free energy. In 2003 Welling and Teh

proposed a new algorithm, named belief optimization, to minimize the Bethe free energy

directly, as an alternative to the fixed–point equations of belief propagation [4].

6

In 2002 Hinton proposed a new learning algorithm for the asynchronous Boltzmann machine:

contrastive divergence learning. In his view this new algorithm works as a better approximation

to the maximum likelihood learning used in the original learning algorithm [37]. The most

attractive aspect of this new algorithm is that it allows Restricted Boltzmann Machines with

millions of parameters to achieve state–of–the–art performance on a large collaborative filtering

task [38].

The newest variant of asynchronous Boltzmann machine, called Deep Boltzmann Machine, is a

deep multilayer Boltzmann machine that was proposed in 2009 by Salakhutdinov and Hinton

[39]. Its learning algorithm was designed to incorporate both bottom–up and top–down

feedback, allowing a better propagation of uncertainty about ambiguous inputs [39].

The dynamics of the synchronous Boltzmann machine were first studied in the 1970s by Little

and Shaw [40-41]. A comprehensive study of synchronous Boltzmann machines and their

learning algorithms was done by Viveros in her PhD thesis in 2001 [42].

1.4 Connection to other disciplines

We previously mentioned that the learning algorithms for Boltzmann machines have been

intensively used for training deep neural networks. A deep neural network is an artificial neural

network with multiple hidden layers of units between the input and output layers which is

capable of learning the underlying constraints that characterize a domain simply by being shown

examples from the domain [10-11,43]. Computational deep learning is closely related to a class

of theories of brain development named neocortical development that was proposed by

cognitive neuroscientists in the early 1990s. Neocortical development is a major focus in

neurobiology, not only from a purely developmental standpoint, but also because understanding

how neocortex develops provides important insight into mature neocortical organization and

function. This shows how Boltzmann machines and their learning algorithms are connected with

neurobiology and, generally, with the field of cognitive sciences.

When applied to Boltzmann machines and their learning algorithms, the emphasis on

mathematical technique and rigor employed by theoretical computing science becomes an

invaluable research asset.

7

Chapter 2. Foundations

2.1 Boltzmann–Gibbs distribution

In statistical mechanics and mathematics, the Boltzmann–Gibbs distribution (also called

the Boltzmann distribution or the Gibbs distribution) is a certain distribution function or

probability measure for the distribution of the states of a system. The Boltzmann distribution is

named after Ludwig Boltzmann who first formulated it in 1868 during his studies of statistical

mechanics of gases in thermal equilibrium [44]. The distribution was later investigated

extensively, in its modern generic form, by Josiah Willard Gibbs (1902) [45]. It underpins the

concept of the canonical ensemble by providing the underlying distribution. In more general

mathematical settings, the Boltzmann–Gibbs distribution is also known as the Gibbs measure.

In statistical mechanics the Boltzmann–Gibbs distribution is an intrinsic characteristic of isolated

(or nearly–isolated) systems of fixed composition that are in thermal equilibrium (i.e., equilibrium

with respect to the energy exchange). The most general case of such system is the canonical

ensemble. Before we define the concept of canonical ensemble, we need to define a concept

that is employed by its definition, specifically heat bath.

Definition 2.1:

In thermodynamics, a heat bath is a system 𝐵 which is in contact with a many–particle system 𝐴

such that:

𝐴 and 𝐵 can exchange energy, but not particles;

𝐵 is at equilibrium and has temperature 𝐓;

𝐵 is much larger than 𝐴, so that its contact with 𝐴 does not affect its equilibrium state.

Definition 2.2:

In statistical mechanics, a canonical ensemble is the statistical ensemble that represents the

possible states of a mechanical system in thermal equilibrium with a heat bath at some fixed

temperature.

8

From previous definitions we can infer that the states of the system 𝐴, which plays the role of a

canonical ensemble, will differ in their total energy as a consequence of the energy exchange

with the system 𝐵, which plays the role of a heat bath. The principal thermodynamic variable of

the canonical ensemble, determining the probability distribution of states, is the absolute

temperature 𝐓. In general, the canonical ensemble 𝑋 assigns to each distinct microstate 𝑥 a

probability 𝐏 (equivalently of the random variable 𝑋 having the value 𝑥) given by the following

exponential:

𝐏(𝑋 = 𝑥) = exp (

𝐹 − 𝐸(𝑥)

𝐤 ∙ 𝐓) (2.1)

where: 𝐸(𝑥) is the energy of the microstate 𝑥; 𝐹 is the Helmholtz free energy; 𝐓 is the absolute

temperature of the system; and 𝐤 is the Boltzmann's constant. 𝐸(𝑥) is a function that maps the

space of states to ℝ and is interpreted as the energy of state 𝑥. For a given ensemble the

Helmholtz free energy is a constant.

If the canonical ensemble has 𝑚 states accessible to the system of interest indexed by

{1,2,…𝑚}, then the equation (2.1) can be rewritten as:

𝐏𝒊 =exp (

−𝐸𝑖 𝐤 ∙ 𝐓

)

∑ exp (−𝐸𝑗𝐤 ∙ 𝐓

)𝑚𝑗=1

(2.2)

where: 𝐏𝒊 is the probability of state 𝑖; 𝐸𝑖 is the energy of the state 𝑖; 𝐤 is the Boltzmann’s

constant; 𝐓 is the absolute temperature of the system; and 𝑚 is the number of states of the

canonical ensemble.

An alternative but equivalent formulation for canonical ensemble uses the canonical partition

function or normalization constant 𝑍 rather than the free energy and is described below:

𝑍(𝐓) = exp (

−𝐹

𝐤 ∙ 𝐓) (2.3)

For a canonical ensemble with 𝑚 states, if we know the energy of the states accessible to the

system of interest, we can calculate the canonical partition function 𝑍 as follows:

𝑍(𝐓) =∑exp (

−𝐸𝑗

𝐤 ∙ 𝐓)

𝑚

𝑗=1

(2.4)

By introducing the canonical partition function defined by the equation (2.3) respectively (2.4)

into the equation (2.1) respectively (2.2) we obtain:

9

𝐏(𝑋 = 𝑥) =

1

𝑍(𝐓)∙ exp(

−𝐸(𝑥)

𝐤 ∙ 𝐓) (2.5)

respectively:

𝐏𝒊 =1

𝑍(𝐓)∙ exp (

−𝐸𝑖𝐤 ∙ 𝐓

) (2.6)

In a system with local (finite–range) interactions, the canonical ensemble’s distribution

maximizes the entropy density for a given expected energy density, or equivalently, minimizes

the free energy density. The distribution shows that states with lower energy will always have a

higher probability of being occupied than the states with higher energy.

The Boltzmann–Gibbs distribution is often used to describe the distribution of particles, such as

atoms in binary alloys or molecules in a gas, over energy states accessible to them. If we have

a system consisting of a finite number of particles, the probability of a particle being in state 𝑖 is

practically the probability that, if we pick a random particle from that system and check what

state is in, we will find that it is in state 𝑖. This probability is equal to the number of particles in

state 𝑖 divided by the total number of particles in the system, which is the fraction of particles

that occupy state 𝑖. The formula (2.7) gives the fraction of particles in state 𝑖 as a function of the

state’s energy:

𝐏𝒊 =𝑛𝑖𝑛=

exp (−𝐸𝑖k ∙ 𝐓

)

∑ exp (−𝐸𝑗k ∙ 𝐓

)𝑚𝑗=1

(2.7)

where 𝑛 is the total number of particles in the system and 𝑛𝑖 is the number of particles in state 𝑖.

In infinite systems, the total energy is no longer a finite number and cannot be used in the

traditional construction of the probability distribution of a canonical ensemble. The traditional

approach, followed by statistical physicists, of studying the thermodynamic limit of the energy

function as the size of a finite system approaches infinity, had not been very useful. Looking for

an alternative approach, the researchers discovered that, when the energy function of an infinite

system can be written as a sum of terms that each involves only variables from a finite

subsystem, the notion of Gibbs measure provides a framework to directly study such systems

(instead of taking the limit of finite systems).

10

Definition 2.3:

In physics, a probability measure is a Gibbs measure if the conditional probabilities it induces on

each finite subsystem satisfy the following consistency condition: if all degrees of freedom

outside the finite subsystem are frozen, the canonical ensemble for the subsystem subject to

these boundary conditions matches the probabilities in the Gibbs measure conditional on the

frozen degrees of freedom.

2.2 Markov random fields and Gibbs measures

We have seen that the Gibbs measure has a native relationship with physics: it was born

to describe the behavior of a system whose interaction between particles can be described by a

form of energy. More, the Gibbs measure can be applied successfully to systems outside its

domain of origin, sometimes even without introducing notions specific to physics into the

probabilistic definitions of those systems. Examples of such systems are: Hopfield networks,

Markov random fields, and Markov logic networks. All these systems exploit the following

general principle derived from Boltzmann’s and Gibbs’s work: a network consisting of a large

number of units, with each unit interacting with neighbouring units, will approach at equilibrium a

canonical distribution given by the equations (2.5) and (2.6). This expanded applicability of

Gibbs measure has been made possible by a fundamental mathematical result known as the

theorem Hammersley–Clifford or the fundamental theorem of random fields. In this section we

present how the computer scientists adapted the physicists’ definition of Gibbs measure for

graphical models. We also present the theorem Hammersley–Clifford and its consequences

with respect to the special class of Markov random fields that is Boltzmann machine.

Dobrushin showed in [46] that, apparently, there are two different ways to define configurations

of points on a structure that mathematically resemble a lattice; he called these configurations

“random fields”. One way is based on the formulation of statistical mechanics of Gibbs and is

generally accepted as the simplest useful mathematical model of a discrete gas (also called

lattice gas) [46]. The other way, introduced by Dobrushin himself, is that of Markov random

fields. Dobrushin’s formulation has no apparent connection with physics, being instead based

on the natural way of extending the notion of a Markov process [46].

A Markov process is a stochastic model that has the Markov property, i.e., the conditional

probability distribution of future states of the process (conditional on both past and present

11

states) depends only upon the present state, not on the sequence of events that preceded it. A

special case of Markov process is the Markov chain.

A Markov chain is a discrete–time Markov process with a countable or finite state space.

A Markov random field, also called Markov network, extends the Markov chain to two or more

dimensions or to random variables defined for an interconnected network of items; therefore, it

may be considered a generalization of a Markov chain in multiple dimensions.

In this paper we use the term Markov random field to designate a Markov random field that

models an interconnected network of items. In a Markov chain, each state depends only on the

previous state in time, whereas in a Markov random field each state depends only on its

neighbors in any of multiple directions. Hence, a Markov random field may be visualized as a

field or graph of random variables, where the distribution of each random variable depends on

the neighboring variables which it is connected with. Thus, in a Markov random field the Markov

property becomes a local property rather than a temporal property.

Any graphical model can be seen as a “marriage” between probability theory and graph theory.

A consequence of this relationship is the existence of two equivalent characterizations of the

family of probability distributions associated with an undirected graph: one algebraic that

involves the concept of factorization and one graph–theory specific that involves the concept of

reachability [27,47]. For Markov random fields the concepts of reachability respectively

factorization are identified with conditional independence respectively factor graph

representation. The theorem Hammersley–Clifford shows that these two ways of defining a

random field are equivalent, which further translates into equivalence between Markov random

fields and Gibbs measures. Before we present the theorem Hammersley–Clifford, we formally

introduce the concepts it operates with: Markov random field and Gibbs measure. We use the

notations for univariate and multivariate random variables specified in Appendix A.

Definition 2.4:

Given an undirected graph 𝐺 = (𝑉, 𝐸), a set of random variables X = (𝑋𝑣)𝑣∈𝑉 indexed by 𝑉 form

a Markov random field with respect to 𝐺 if they satisfy the Markov property expressed in either

one of the following forms:

Pairwise Markov Property: Any two non–adjacent variables are conditionally independent

given all other variables:

12

𝑋𝑢 ⊥ 𝑋𝑣 | X𝑉−{𝑢,𝑣} if {𝑢, 𝑣} ∉ 𝐸 (2.8)

Local Markov Property: A variable is conditionally independent of all other variables given its

neighbors:

𝑋𝑣 ⊥ X𝑉−cl(𝑣) | Xne(𝑣) (2.9)

where ne(𝑣), also called the Markov blanket, is the set of neighbors of 𝑣 and cl(𝑣) = ne(𝑣) ∪

{𝑣} is the closed neighborhood of 𝑣.

Global Markov Property: Any two subsets of variables are conditionally independent given a

separating subset:

XA ⊥ XB | XS (2.10)

where every path from a node in 𝐴 to a node in 𝐵 passes through 𝑆.

Generally, these three expressions of Markov property are not equivalent. The local Markov

property is stronger than the pairwise one, while weaker than the global one.

Definitions 2.5:

A probability distribution 𝐏(X) = 𝐏(𝑋1, 𝑋2, … , 𝑋𝑛) on an undirected graph 𝐺 = (𝑉, 𝐸) with |𝑉| = 𝑛

is called a Gibbs distribution or Gibbs measure if it can be factorized into potentials defined on

cliques that cover all the nodes and edges of 𝐺.

A potential function or sufficient statistic is a function defined on the set of configurations of a

clique (i.e., a setting of values for all the nodes in the clique) that associates a positive real

number with each configuration. Hence, for every subset of nodes Xc ⊆ 𝑉 that form a clique, we

associate a non–negative potential 𝜙𝑐 = 𝜙𝑐(Xc).

In this paper we will refer equivalently to the nodes of 𝐺 that form the clique Xc and the random

variables that correspond to those nodes. Before formulating the Gibbs measure let us

introduce the following notations:

CG = {Xc1 , Xc2 , … , Xcd} = {Xcj ∶ 1 ≤ 𝑗 ≤ 𝑑, 𝑑 ≤ 𝑛} represents a set of 𝑑 cliques that cover the

edges and nodes of the underlying graph 𝐺;

13

ΦG = {𝜙c1 , 𝜙c2 , … , 𝜙cd} = {𝜙cj ∶ 1 ≤ 𝑗 ≤ 𝑑, 𝑑 = |CG|} represents the set of potential functions

or clique potentials that correspond to CG;

There is a one–to–one correspondence between CG and ΦG, i.e., 𝜙cj = 𝜙cj(Xcj). Therefore, it

should be generally understood that, when iterating over CG, we also iterate over ΦG.

The Gibbs measure is precisely the joint probability distribution of all the nodes in the graph

𝐺 = (𝑉, 𝐸) and is obtained by taking the product over the clique potentials:

𝐏(X) =

1

𝑍∙ ∏ 𝜙𝑐(Xc)

Xc∈CG

(2.11)

where: 𝑍 ≡ 𝑍(𝐏) =∑ ∏ 𝜙𝑐(Xc)

Xc∈CGX

(2.12)

𝐴 ≡ 𝐴(𝐏) = log(𝑍 (𝐏)) ≡ ln(𝑍(𝐏)) = ln(𝑍) (2.13)

where 𝑍, called the partition function, is a constant chosen to ensure that the distribution 𝐏 is

normalized. If the distribution 𝐏 belongs to the exponential family, it is more practical to work

with the logarithm, specifically the natural logarithm, of the partition function 𝑍. By definition, the

cumulant function 𝐴 is the natural logarithm of Z.

The set CG is often taken to be the set of all maximal cliques of the graph 𝐺, i.e., the set of

cliques that are not properly contained within any other clique. This condition can be imposed

without loss of generality because any representation based on non–maximal cliques can

always be converted into one based on maximal cliques by redefining the potential function on a

maximal clique to be the product over the potential functions on the subsets of that clique.

However, the factorization of a Markov random field is of particular value when CG consists of

more than the maximal cliques. This is the case of factor graphs.

Definition 2.6:

Given a factorization of a function:

𝑔 ∶ ℝ𝑛 → ℝ, 𝑔(𝑋1, 𝑋2, … , 𝑋𝑛) =∏𝑓𝑗

𝑚

𝑗=1

(𝑆𝑗)

where: (2.14) 𝑆𝑗 ⊑ {𝑋1, 𝑋2, … , 𝑋𝑛}

14

The corresponding factor graph 𝐺 = (𝑋, 𝐹, 𝐸) is a bipartite graph that consists of: variable nodes

𝑋 = {𝑋1, 𝑋2, … , 𝑋𝑛}, factor nodes 𝐹 = {𝑓1, … 𝑓𝑚}, and edges 𝐸. The edges depend on the

factorization as follows: there is an undirected edge between factor 𝑓𝑗 and variable 𝑋𝑖 if and only

if 𝑋𝑖 is an argument of 𝑓𝑗, i.e., 𝑋𝑖 ∈ 𝑆𝑗.

Factor graphs allow a finer–grained specification of factorization properties by explicitly

representing potential functions for non–maximal cliques. We observe that a factor graph has

only node potentials and pairwise potentials. Generally, if the potential functions in a Markov

random field are defined over single variables or pairs of variables, then the Markov random

field is referred as pairwise Markov network. More precisely, a pairwise Markov network over a

graph 𝐺 = (𝑉, 𝐸) is a Markov random field associated with a set of node potentials and a set of

edge potentials as described by the equation (2.15):

ΦG = {𝜙(𝑋i) ∶ 𝑋i ∈ 𝑉, 1 ≤ 𝑖 ≤ 𝑛} ∪ {𝜙(𝑋i, 𝑋𝑗) ∶ {𝑖, 𝑗} ∈ 𝐸, 1 ≤ 𝑖, 𝑗 ≤ 𝑛} (2.15)

A factor graph is a pairwise Markov network whose nodes and edges are endowed with special

meanings that originate in the function it factorizes. We will come back at the relationship

between Markov random fields and factor graphs in Section 3.5.

One important property of Markov random fields is that the potential functions ΦG need not have

any obvious or direct relation to marginal or conditional distributions defined over the graph

cliques.

Theorem 2.1 (Hammersley–Clifford):

A probability distribution that has a positive mass or density satisfies the Markov property with

respect to an undirected graph if and only if it is a Gibbs distribution in that graph.

The proof of this theorem is outside the scope of this paper. A rigorous mathematical proof can

be found in [48]. The theorem Hammersley–Clifford gives the necessary and sufficient

conditions under which a Gibbs measure is equivalent with a Markov random field.

Consequently, any positive probability measure that satisfies a Markov property is a Gibbs

measure for an appropriate choice of (locally defined) energy function.

15

The learning algorithms in a pairwise Markov network like the Boltzmann machine require

computing statistical quantities (e.g., likelihoods and probabilities) and information–theoretic

quantities (e.g., mutual information and conditional entropies) on the underlying graphical

model. These types of computational tasks in a graphical model are called inference or

probabilistic inference. Furthermore, the learning algorithms are built on inference algorithms

and allow parameters and structures to be estimated from data. However, exact inference for

large–scale Markov random fields is intractable. Therefore, to achieve a scalable learning

algorithm, approximate methods are required.

One popular source of approximate methods is the Markov chain Monte Carlo (MCMC)

framework. The main problem with the MCMC approach is that convergence times can be long

and it can be difficult to diagnose convergence.

An alternative to MCMC is the variational framework whose goal is to convert the probabilistic

inference problem into an optimization problem. The best known variational algorithm used in

Boltzmann machine learning is the mean field approximation that searches for the best

distribution that assumes independence among all the nodes and then uses it to construct the

true posterior distribution over hidden variables.

Another alternative to MCMC is the belief propagation (BP) framework. BP is a message

passing algorithm for performing inference on tree–like graphs. The discovery of the relationship

between belief propagation and Bethe free energy led to the so–called Bethe approximation of

the free energy, which led to a new class of learning algorithms for Boltzmann machine.

2.3 Gibbs free energy

The third millennium has brought exciting progresses on understanding computationally

hard problems in computer science by using a variety of concepts and methods from statistical

physics. One of these concepts is the Gibbs free energy. In this section we start by briefly

introducing the Gibbs free energy as a thermodynamic potential; then we explain how this

energy can be accommodated to describe a Markov random field. In subsequent development

we use the term temperature to designate the absolute temperature of a canonical ensemble or,

generally, of a thermodynamic system, and the term pseudo–temperature to designate the

“temperature” of a Markov random field, i.e., a parameter which models the thermal noise

16

injected into the system. The majority of theoretical results reviewed in this section come from

[49] and [4].

The Gibbs free energy, originally called the “available energy”, was developed in the 1870s by

Josiah Willard Gibbs, who described it in [50] as:

the greatest amount of mechanical work which can be obtained from a given quantity of a certain substance in a given initial state, without increasing its total volume or allowing heat to pass to or from external bodies, except such as at the close of the processes are left in their initial condition.

The Gibbs free energy is one of the four thermodynamic potentials used in the chemical

thermodynamics of reactions and non–cyclic processes. The other three thermodynamic

potentials are: internal energy, enthalpy, and Helmholtz free energy. In this paper we are only

interested in the internal energy, the Helmholtz free energy, and the Gibbs free energy.

Generally, energy is a concept which takes into account the physical nature of a system. The

exact (true) energy 𝐸 is usually unknown, but the mean (internal) energy 𝑈 is usually known –

for example when is determined by external factors such as a thermostat.

The internal energy 𝑈 is a thermodynamic potential that might be thought of as the energy

contained within a system, otherwise the energy required to create a system in the absence of

changes in temperature or volume.

If the system is created in an environment of temperature 𝐓, then some of the energy can be

obtained by spontaneous heat transfer from the environment to the system. The amount of this

spontaneous energy transfer is 𝐓 ∙ 𝑆, where 𝐓 represents the temperature and 𝑆 represents the

final entropy of the system. The Helmholtz free energy 𝐹 is then a measure of the amount of

energy required to create a system once the spontaneous energy transfer to the system from

the environment is accounted for:

𝐹 = 𝑈 − 𝐓 ∙ 𝑆 (2.16) where: 𝑈 is the internal energy; 𝑆 is the entropy; and 𝐓 is the temperature of the system. At low

temperatures, the Helmholtz free energy is dominated by the energy, while at high

temperatures, the entropy dominates it. The Helmholtz free energy is commonly used for

systems held at constant volume. More, for a system at constant temperature and volume, the

Helmholtz free energy is minimized at equilibrium.

17

If the system is created from a very small volume, in order to "create room" for the system, an

additional amount of work P ∙ V must be done, where P represents the absolute pressure and V

represents the final volume of the system. As discussed in defining the Helmholtz free energy,

an environment at constant temperature 𝐓 will contribute an amount 𝐓 ∙ 𝑆 to the system,

reducing the overall investment necessary for creating the system. The Gibbs free energy 𝐺 is

then the net energy contribution for a system created in an environment of temperature 𝐓 from a

negligible initial volume:

𝐺 = 𝑈 − 𝐓 ∙ 𝑆 + P ∙ V (2.17) where: 𝑈 is the internal energy; 𝑆 is the entropy; 𝐓 is the temperature; P is the absolute

pressure; and V is the final volume of the system. For a system at constant pressure and

temperature, the Gibbs free energy is minimized at equilibrium.

In the context of Markov random fields, energy is a scalar quantity used to represent the state

and the parameters of the system in certain conditions. Similarly to a thermodynamic system,

the true energy of a Markov random field is unknown. The true energy of a Markov random field

at equilibrium is referred as the true free energy and corresponds to the true joint probability

distribution 𝐏(𝑋1, … , 𝑋𝑛) of the random field. If the true joint probability distribution has a positive

mass or density, then, according with Theorem 2.1, it is a Boltzmann–Gibbs distribution.

The internal energy 𝑈 of a Markov random field 𝐏(𝑋1, … , 𝑋𝑛) is defined as the expected value of

the exact energy 𝐸 of the system.

𝑈𝐏 = 𝐄𝐏[𝐸(𝑋1, … , 𝑋𝑛)] = ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙

𝑋1,…,𝑋𝑛

𝐸(𝑋1, … , 𝑋𝑛) (2..18)

The entropy 𝑆 of a Markov random field 𝐏(𝑋1, … , 𝑋𝑛) is defined as the expected value of the

logarithm of the inverse of the probability distribution 𝐏 of the system (equations (B26) and

(B27) from Appendix B):

𝑆𝐏 = 𝐄𝐏 [ln

1

𝐏(𝑋1, … , 𝑋𝑛)] = − ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙ ln(𝐏(𝑋1, … , 𝑋𝑛))

𝑋1,…,𝑋𝑛

(2.19)

The exact Gibbs free energy can be thought of as a mathematical construction designed so that

its minimization leads to the Boltzmann–Gibbs distribution given by the equation (2.5) [49]. In

order to define the exact Gibbs free energy, we write the equation (2.5) for a Markov random

field as:

18

𝐏(𝑋1, … , 𝑋𝑛) =

1

𝑍∙ exp (−

𝐸(𝑋1, … , 𝑋𝑛)

𝐓)

(2.20)

where 𝐸(𝑋1, … , 𝑋𝑛) is the true energy of the Markov random field (adjusted with the Boltzmann’s

constant) and 𝐓 is the pseudo–temperature of the Markov random field.

By definition, the exact Gibbs free energy denoted 𝐺exact is the following function of the true joint

probability function 𝐏(𝑋1, … , 𝑋𝑛):

𝐺𝑒𝑥𝑎𝑐𝑡[𝐏(𝑋1, … , 𝑋𝑛)] = 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 (2.21) where: 𝑈𝐏 is given by the equation (2.18); 𝑆𝐏 is given by the equation (2.19); and 𝐓 is the

pseudo–temperature of the system.

We note the absence from the equation (2.21) of the term P ∙ V of the equation (2.17). This

absence is explained by the fact that the parameters pressure and volume of a thermodynamic

system do not have any correspondent in a Markov random field. Hence, the exact Gibbs free

energy of a Markov random field is apparently identical with the Helmholtz free energy.

However, there is a difference of nuance between them: while the Helmholtz free energy is just

the value 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 computed at equilibrium, the Gibbs free energy is a function that computes

the expression 𝑈𝐏 − 𝐓 ∙ 𝑆𝐏 for any state of the network after applying some constraints [49].

At equilibrium, the exact Gibbs free energy is equal to the Helmholtz free energy, which is given

by the formula:

𝐹 = −𝐓 ∙ ln(𝑍) (2.22)

where 𝑍 is the partition function of the Markov random field [49].

It can be shown that, if we minimize 𝐺𝑒𝑥𝑎𝑐𝑡 given by (2.21) with respect to 𝐏(𝑋1, … , 𝑋𝑛) and

enforce, via a Lagrange multiplier, the constraint of 𝐏 being a probability distribution, then we

recover, as desired, the Boltzmann–Gibbs distribution.

Different types of constraints can be imposed on various probabilities that characterize the

Markov random field and each such scenario “produces” a Gibbs free energy. By minimizing a

Gibbs free energy with respect to the probabilities that are constrained, we obtain self–

consistent equations that must be obeyed at equilibrium [49].

In general, a given system can have more than “one Gibbs free energy” depending on what

constraints are applied and over what probabilities. If the full joint probability is constrained, then

we obtain the exact Gibbs free energy denoted 𝐺𝑒𝑥𝑎𝑐𝑡. If some or all marginal probabilities are

19

constrained, then we obtain an approximate Gibbs free energy denoted 𝐺. The mean field free

energy and the Bethe energy, which we are going to introduce in Chapter 3, are both Gibbs free

energies. The advantage of working in a Markov random field with a Gibbs free energy instead

of a Boltzmann–Gibbs distribution is that it is much easier to come up with ideas for

approximations [49].

2.4 Connectionist networks

In order to emphasize their brain–style computational properties, Hinton has

characterized Boltzmann machines as connectionist networks, specifically symmetrical

connectionist networks with hidden units. Their counterparts with respect to the presence of

hidden units are the Hopfield networks. Before submerging into the world of Boltzmann

machines, we are going to briefly present their “ancestors”: the connectionist networks and the

Hopfield networks.

Connectionism is a set of approaches in the field of cognitive science that models mental or

behavioral phenomena as the emergent processes of interconnected networks of simple units.

The central connectionist principle is that mental phenomena can be rather described from the

point of view of brain–style computation rather than rule–based symbol manipulation. However,

the connectionist architectures are not meant to duplicate the physiology of the human brain,

but rather to receive inspiration from known facts about how the brain works [51]. There are

many forms of connectionism, but the most common form uses artificial neural network models.

Connectionist models typically consist of many simple neuron–like processing elements called

units that interact by using weighted connections. The connections between units can be

symmetrical or asymmetrical, depending on whether they have the same weight in both

directions or not. Each unit has a state or activity level that is determined by the input received

from other units in the network. There are many possible variations within this general

framework: 1/0, +1/-1, on/off. When the effective values of the states of the units are not

important for the argument we try to make, we refer to them as on/off. One common, simplifying

assumption is that the combined effects of the rest of the network on the ith unit are mediated by

a single scalar quantity. This quantity, which is called the total input of unit i and denoted neti, is

a linear function of the activity levels of the units that provide input to unit 𝑖:

20

neti =∑𝜎𝑗 ⋅ 𝑤𝑗𝑖 − 𝜃𝑖𝑗

(2.23)

where: 𝜎𝑗 is the state of the jth unit; 𝑤𝑗𝑖 is the weight on the connection from the jth to the ith unit;

and 𝜃𝑖 is the threshold of the ith unit.

An external input vector can be supplied to the network by clamping the states of some units or

by adding an input term to the total input of some units. By taking into consideration the external

input, the total input of unit 𝑖 is computed with formula:

neti =∑𝜎𝑗 ⋅ 𝑤𝑗𝑖 + 𝐼𝑖 − 𝜃𝑖𝑗

(2.24)

where: 𝜎𝑗 is the state of the jth unit; 𝑤𝑗𝑖 is the weight on the connection from the jth to the ith unit;

𝜃𝑖 is the threshold of the ith unit; and 𝐼𝑖 is the external input received by the ith unit.

The threshold term can be eliminated by giving every unit an extra input connection whose

activity level is always on. The weight on this special connection is the negative of the threshold,

and it can be learned in just the same way as the other weights.

The capacity of a network to change over time is expressed at unit level by the concept of

activation. At any time, a unit in the network has an activation, which is a numerical value

intended to represent some aspect of the unit, which is often called the state of the unit. The

activation of a unit spreads to all the other units connected to it. Typically the state of a unit is

described as a function, called the activation function, of the total input that it receives from its

input units. Usually the activation function is non–linear, but it can be linear as well.

For units with discrete nonnegative states, the activation function typically has value 0 or 1.

For units with continuous nonnegative states a typical activation function is the logistic sigmoid

defined by the formula (2.25).

For units with discrete bipolar states, the typical activation function has value -1 or 1.

For units with continuous positive and negative states a typical activation function is the

hyperbolic tangent defined by the formula (2.26).

States 0/1: 𝜎𝑖 = sigm(neti) =

1

1 + exp(−neti) (2.25)

States -1/1:

𝜎𝑖 = tanh ( neti) =exp(neti) − exp(−neti)

exp(neti) + exp(−neti)=exp(2 ∙ neti) − 1

exp(2 ∙ neti) + 1 (2.26)

21

where neti is the input of the ith unit and 𝜎𝑖 is the state of the same unit.

All the long–term knowledge in a connectionist model is encoded by the locations and the

weights of the connections, so learning consists of changing the weights or adding or removing

connections. The short–term knowledge of the model is normally encoded by the states of the

units, but some models also have fast–changing temporary weights or thresholds that can be

used to encode temporary contexts or bindings [51].

2.5 Hopfield networks

Historically the Boltzmann machine was preceded by a simpler connectionist model

invented by John Hopfield in 1982. Hopfield’s network is not only a precursor of Boltzmann

machine; it also represents the limiting case of the asynchronous Boltzmann machine as the

pseudo–temperature parameter 𝐓 → 0. The network proposed by Hopfield in [7] and expanded

in [52] is a symmetrical connectionist network without hidden units whose main purpose is to

store memories as distributed patterns of activity. Hopfield, who is a physicist, got the idea of a

network acting as an associative memory by studying the dynamics of a physical system whose

state space is dominated by a substantial number of locally stable states to which the system is

attracted. He regarded these numerous locally stable states as associative memory or content

addressable memory. Before we present the Hopfield network, we are going to briefly introduce

the ideas of associative memory and Hebbian learning which are used by the learning

algorithms of both Hopfield network and Boltzmann machine.

Inspired by the associative nature of biological memory, Hebb proposed in 1949 a simple model

for the neuron that captures the idea of associative memory [2]. Hebb’s theory is often

summarized by Siegrid Löwel's phrase: "neurons wire together if they fire together" [53]. We are

going to present the intuition behind Hebb’s theory by using an example. Let imagine that the

weights between neurons whose activities are positively correlated are increased:

d

dt𝑤𝑖𝑗 = corr(𝜎𝑖, 𝜎𝑗) (2.27)

where corr(𝜎𝑖, 𝜎𝑗) is the correlation coefficient between the states 𝜎𝑖 and 𝜎𝑗.

Let also imagine the following two scenarios:

when stimulus 𝑖 is present – for instance, a bell ringing – the activity of neuron 𝑖 increases;

22

neuron 𝑗 is associated with another stimulus 𝑗 – for instance, the sight of a teacher coming

to the classroom carrying a register.

If these two stimuli – first a person formally dressed and carrying a register and second a ringing

bell – co–occur in the environment, then the Hebbian learning rule will increase the weights 𝑤𝑖𝑗

and 𝑤𝑗𝑖. This means that when, on a later occasion, stimulus 𝑗 occurs in isolation, making the

activity of 𝜎𝑗 large, the positive weight from 𝑗 to 𝑖 will cause neuron 𝑖 to be also activated. Thus,

the response to the sight of a formally dressed person carrying a register is an automatic

association with the bell ringing sound. Hence, we would expect to hear a bell ringing. We could

call this "pattern completion". No instructor is required for this associative memory to work and

no signal is needed to indicate that a correlation has been detected or that an association

should be made. Thus, the unsupervised local learning algorithm and the unsupervised local

activity rule spontaneously produce the associative memory.

2.5.1 Hopfield network models

In his influential paper [7] Hopfield proposed a model that was later called the binary Hopfield

network. Later he generalized the original model and published in [52] a new model called the

continuous Hopfield network. In [52] Hopfield also explained the relationship between the two

models. Because a binary Hopfield network becomes a Boltzmann machine with the addition of

noise in updating, we give a detailed presentation of the binary model and only a brief

presentation of the continuous model. We also briefly present the relation between the stable

states of the Hopfield models.

2.5.1.1 The binary Hopfield model

Architecture:

A binary Hopfield network consists of 𝑁 processing devices called neurons or units. Each unit 𝑖

has two activation levels or states: off or not firing, usually represented as 𝜎𝑖 = 0, and on or

firing at maximum rate, usually represented as 𝜎𝑖 = 1. An alternative representation of the off/on

states uses the bipolar elements -1 and +1.

23

In the Hopfield network there are weights associated with the connections between units. All

these weights are organized in a matrix W = (𝑤𝑖𝑗)1≤𝑖≤𝑛 1≤𝑗≤𝑛

called the weight matrix or the

correlation matrix. The strength of connection between two units 𝑖 and 𝑗 is called weight and is

denoted 𝑤𝑖𝑗. The units are connected through symmetric, bidirectional connections, so 𝑤𝑖𝑗 =

𝑤𝑗𝑖. If two units 𝑖 and 𝑗 are not connected, then 𝑤𝑖𝑗 = 0. If they are connected, then 𝑤𝑖𝑗 > 0 or

𝑤𝑖𝑗 < 0. There are no self–connections, so 𝑤𝑖𝑖 = 0 for all 𝑖 ∈ {1,2,… , 𝑛}.

The activity of unit 𝑖, denoted neti, represents the total input that the unit receives from other

units and is computed either with the equation (2.23) or with the equation (2.24), depending on

the presence of the external input. Unless otherwise stated, we consider the external input 𝐼𝑖 for

each unit 𝑖 to be 0. The units are binary threshold units, i.e., for each unit 𝑖 there is a fixed

threshold 𝜃𝑖 ≥ 0. We can think of the threshold of unit 𝑖 as the weight of a special connection

from a virtual unit “0”, whose activity is permanently on, towards unit 𝑖. We formally express this

relation as: 𝜃𝑖 = −𝑤𝑖0. Then, if we include the threshold in the computation of the activity of the

unit, the equation (2.23) becomes:

neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗

𝑛

𝑗=1

− 𝜃𝑖 =∑𝑤𝑗𝑖 ∙ 𝜎𝑗

𝑛

𝑗=0

for 1 ≤ 𝑖 ≤ 𝑛 (2.28)

The instantaneous state of a model composed of 𝑛 units is specified by a configuration or state

vector 𝝈 whose elements represent the states of the units: 𝝈 = (𝜎1, 𝜎2, … , 𝜎𝑛).

Global energy:

Hopfield realized that, when the weight matrix W is symmetric, the network can be characterized

by a global energy function [7]. More, each configuration of the network can also be

characterized by an energy function. The global energy of the network is a sum of contributions

from each unit and is computed with the following formula:

𝐸 = −

1

2∑∑𝑤𝑖𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑖

𝑛

𝑗=1

𝑛

𝑖=1

+∑𝜎𝑖 ∙

𝑛

𝑖=1

𝜃𝑖 (2.29)

This simple quadratic energy function makes it possible for each unit to compute locally how its

state affects the global energy.

Update rule:

24

The state of the model system changes in time as a consequence of each unit 𝑖 readjusting its

state. While the selection of the unit to be updated could be a stochastic process (taking place

at a mean rate 𝑟 > 0 for each unit) or a deterministic process (being part of a predefined

sequence), the update itself is always a deterministic process. Each selected unit evaluates

whether its activity is above or below zero (because we included the threshold into the

computation of the unit’s activity) and updates its state according with the following “threshold

rules”:

States 0/1:

𝜎𝑖 → 0 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗

𝑛

𝑗=1

− 𝜃𝑖 ≤ 0


𝑛

𝑗=1

− 𝜃𝑖 > 0

(2.30)

States -1/1:

𝜎𝑖 → −1 if neti =∑𝑤𝑗𝑖 ∙ 𝜎𝑗

𝑛

𝑗=1

− 𝜃𝑖 ≤ 0


𝑛

𝑗=1

− 𝜃𝑖 > 0

(2.31)

Equivalently, the update rule can be formulated as: update each unit to whichever of its two

states gives the lowest global energy. The updates may be synchronous or asynchronous and,

because the network has feedback (i.e., every unit’s output is an input to other units), an order

for the updates to occur has to be specified.

Synchronous (parallel) updates: firstly all units compute their activities (neti)1≤𝑖≤𝑛, and

secondly they update their states (𝜎𝑖)1≤𝑖≤𝑛 simultaneously.

There are a few drawbacks for this update strategy. Firstly, if the units make

simultaneously decisions, their energy could go up. Secondly, with simultaneous parallel

updating, we can get oscillations which always have a period of two. However, if the

updates occur in parallel but with random timing, the oscillations are usually destroyed.

Asynchronous (sequential) updates: one unit at a time computes its activity neti and

updates its state 𝜎𝑖.

When units are randomly chosen to update, the global energy 𝐸 of the network will either

lower its value or stay the same. Under repeated sequential updating the network will

eventually converge to a state which is a local minimum in the global energy function.

25

Thus, if a state is a local minimum in the global energy function, it is a stable state for the

network.

Learning rule:

The first goal of the Hopfield network is to store the input data or desired memories – this is

what we call the store phase. The desired memories are represented as a set with 𝑚 elements,

each element being a 𝑛–dimensional binary vector.

The second goal is that, given the initial configuration of a Hopfield network, the network is

capable to retrieve or recall one particular configuration or stored memory from all the memories

stored in the network – this is what we call the recall phase.

In general, the initial configuration is a noisy version of one stored memory. The learning rule is

intended to make a set of desired memories to become stored memories, i.e., stable states of

the Hopfield network's activity rule. In order to understand how the Hopfield network learns a set

of desired memories, firstly we present the information storage rules and secondly we prove that

the stored memories are stable states for Hopfield network.

Information storage algorithm:

We start by observing that each desired memory represents a possible configuration of the

network:

𝝈(𝒔) = (𝜎1(𝑠), 𝜎2

(𝑠), … , 𝜎𝑛(𝑠)) for all 𝑠 ∈ {1,2,… ,𝑚} (2.32)

Hopfield demonstrated that the capacity 𝑚 of a totally connected network with 𝑛 units under his

storage rule is only about 0.15𝑛 memories [7]. Also, if all the desired memories are known, the

matrix W does not change in time. Hence, it can be determined in advance.

Hopfield proposed the following rule for computing the weights of a network whose purpose is to

store a given set of 𝑚 desired memories. In both cases the factor 1

𝑚 assures that |𝑤𝑖𝑗| ≤ 1.

States 0/1: 𝑤𝑖𝑗 =

1

𝑚∙∑(2 ∙ 𝜎𝑖

(𝑠) − 1) ∙ (2 ∙ 𝜎𝑗(𝑠) − 1)

𝑚

𝑠=1

for 1 ≤ 𝑖, 𝑗 ≤ 𝑛; 1 ≤ 𝑠 ≤ 𝑚

𝑤𝑖𝑖 = 0 for 1 ≤ 𝑖 ≤ 𝑛

(2.33)

States -1/1: 𝑤𝑖𝑗 =

1

𝑚∙∑𝜎𝑖

(𝑠) ∙ 𝜎𝑗(𝑠)

𝑚

𝑠=1

for 1 ≤ 𝑖, 𝑗 ≤ 𝑛; 1 ≤ 𝑠 ≤ 𝑚

𝑤𝑖𝑖 = 0 for 1 ≤ 𝑖 ≤ 𝑛

(2.34)

26

There is another way to compute the weight matrix W. The algorithm starts from a matrix W with

all the elements equal to zero. For each binary vector 𝜎 that represents a desired memory, the

weight 𝑤𝑖𝑗 between any two units 𝑖 and 𝑗 is incremented with a quantity Δ𝑤𝑖𝑗:

𝑤𝑖𝑗 ← 𝑤𝑖𝑗 + Δ𝑤𝑖𝑗 for 1 ≤ 𝑖, 𝑗 ≤ 𝑛 (2.35)

where Δ𝑤𝑖𝑗 is computed with the following formulae:

States 0/1: Δ𝑤𝑖𝑗 = 4 ⋅ (𝜎𝑖 −

1

2) ⋅ (𝜎𝑗 −

1

2) for 1 ≤ 𝑖, 𝑗 ≤ 𝑛

(2.36) States -1/1: Δ𝑤𝑖𝑗 = 𝜎𝑖 ∙ 𝜎𝑗 for 1 ≤ 𝑖, 𝑗 ≤ 𝑛

The rules (2.28) to (2.29) are applied to the whole matrix 𝑚 times, one time for each desired

memory. After these steps each weight 𝑤𝑖𝑗 has an integer value in the range [−𝑚,𝑚]. Finally,

the weight matrix W may be normalized by multiplying it with the factor 1

𝑚.

Once the matrix W was computed, the desired memories have become stored memories. Now

we need to prove that the stored memories are stable states for the Hopfield network. We are

going to present the proof only for the case when the states of the units are represented 0/1.

The proof for the case when the states of the units are represented -1/1 is similar.

In order to prove that the stored memories (𝝈(𝒔))1≤𝑠≤𝑚 are stable states for the Hopfield

network, we start by computing neti(s)

which is the activity of some unit 𝑖 of some 𝒔th stored

memory:

neti

(s)=∑𝑤𝑗𝑖 ⋅ 𝜎𝑗

(𝑠)

𝑛

𝑗=0

=∑𝜎𝑗(𝑠)⋅

𝑛

𝑗=0

∑(2 ⋅ 𝜎𝑖(𝑢) − 1) ⋅ (2 ⋅ 𝜎𝑗

(𝑢) − 1)

𝑚

𝑢=1

(2.37)

neti(s) = ∑(2 ⋅ 𝜎𝑖

(𝑢) − 1) ∙

𝑚

𝑢=1

[∑𝜎𝑗(𝑠) ⋅

𝑛

𝑗=0

(2 ⋅ 𝜎𝑗(𝑢) − 1)]

In the equation (2.37) the mean value of the bracketed term is 0 unless 𝑠 ≡ 𝑢, in which case the

mean value is 𝑛/2. This pseudo–orthogonality yields to:

neti

(s) =∑𝑤𝑗𝑖 ⋅ 𝜎𝑗(𝑠)

𝑛

𝑗=0

≅ ⟨neti(s)⟩ =

𝑛

2∙ (2 ⋅ 𝜎𝑖

(𝑠) − 1) for 1 ≤ 𝑖 ≤ 𝑛 (2.38)

27

The equation (2.38) shows that ⟨neti(s)⟩ is positive when 𝜎𝑖

(𝑠) = 1 and negative when 𝜎𝑖(𝑠) = 0.

The sth stored state would always be stable under Hopfield’s algorithm except the noise coming

from the 𝑠 ≠ 𝑢 terms.

2.5.1.2 The continuous Hopfield model and its relation with the binary Hopfield model

Let consider a binary Hopfield network where the set of possible states 𝜎𝑖 of unit 𝑖 is {Vi0, Vi

1}

where Vi0 ∈ ℝ, Vi

1 ∈ ℝ, Vi0 < Vi

1, and 1 ≤ 𝑖 ≤ 𝑛.

Let also consider another Hopfield network identical with the first one except the following

aspects:

the output variable Vi of unit 𝑖 is a continuous and monotone increasing function of the

instantaneous input neti of the same unit;

the output variable Vi of unit 𝑖 has the range Vi0 ≤ Vi ≤ Vi

1;

the input–output relation is described by a sigmoid function with vertical asymptotes Vi0 and

Vi1.

In the second network the sigmoid function acts as an activation function and the output Vi of

unit 𝑖 is similar to the state 𝜎𝑖 of unit 𝑖 in the first network:

𝜎𝑖 ≡ Vi for 1 ≤ 𝑖 ≤ 𝑛 (2.39)

If Vi0 and Vi

1 are 0 respectively 1, then an appropriate activation function for the second network

is the logistic sigmoid and the activity of the unit 𝑖 is computed with the formula (2.40).

If Vi0 and Vi

1 are -1 respectively 1, then an appropriate activation function for the second

network is the hyperbolic tangent and the activity of the unit 𝑖 is computed with the formula

(2.41).

{Vi0, Vi

1} = {0,1} Vi = sigm(neti) =1

1 + exp(−∑ 𝑤𝑗𝑖 ∙ 𝜎𝑗𝑛𝑗=1 + 𝜃𝑖)

(2.40)

{Vi0, Vi

1} = {−1,1} Vi = tanh ( neti) = tanh(∑𝑤𝑗𝑖 ∙ 𝜎𝑗 − 𝜃𝑖

𝑛

𝑗=1

) (2.41)

Each unit updates its state as if it were the single unit in the network. The updates may also be

synchronous or asynchronous and the learning rule is similar to the learning rule of the binary

28

network. We observe that the binary Hopfield network is a special case of the continuous

Hopfield network. The continuous model has the same flow properties in its continuous space

that the binary model does in its discrete space. It can, therefore, be used as a content

addressable memory or any other computational task which an energy function is essential for.

The relation between the stable states of the two models

For a given weight matrix W, the stable states of the continuous system have a simple

correspondence with the stable states of the binary system. The discrete algorithm searches for

minimal states at the corners of the hypercube, i.e., corners that are lower than adjacent

corners. Since the global energy of the model is a linear function of a single unit state along any

cube edge, the energy minima (or maxima) for the discrete space with, for instance, activities

𝜎𝑖 ∈ {0,1} are exactly the same corners as the energy minima (or maxima) for the continuous

case with activities 0 ≤ Vi ≤ 1 [52].

2.5.2 Convergence of the Hopfield network

Hopfield claimed that his original model behaves as an associative memory when the state

space flow generated by the algorithm is characterized by a set of stable fixed–points such that

each stable point represents a nominally assigned memory. He proved that the stored

memories are stable under the asynchronous update rule and, more, the asynchronous update

rule of a Hopfield network is able to take a partial memory or a corrupted memory and perform

pattern completion or error correction to restore the original memory [7,50]. The proof relies on

an essential feature of the store–recall operation: the state space flow algorithm converges to

stable states. The flow convergence to stable states is guaranteed by a mathematical condition

imposed on the weight matrix W: to be symmetric and to have zero diagonal elements. Here we

present a sketch of the proof for the case of a binary Hopfield network with asynchronous

updates. The proof for the continuous Hopfield network is pretty similar.

Claim: The binary threshold update rules (2.30) and (2.31) cause the network to settle to a

minimum of the global energy function.

29

Proof: The proof follows the construction of an appropriate energy function 𝐸 (equation (2.29))

that is always decreased by any state change produced by the algorithm.

First we introduce the concept of energy gap; then we compute the energy gap of some unit 𝑖,

where 1 ≤ 𝑖 ≤ 𝑛. The energy gap of unit 𝑖 represents the change Δ𝐸𝑖 in global energy 𝐸 due to

changing the state of the unit 𝑖 by Δ𝜎𝑖 and keeping all the other units unchanged. In order to

compute Δ𝐸𝑖, we rewrite the equation (2.29) by separating the contribution of unit 𝑖 from the

contributions of all the other units:

𝐸 = −1

2∑∑𝑤𝑖𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑖

𝑛

𝑗=1

𝑛

𝑖=1

+∑𝜎𝑖 ∙ 𝜃𝑖

𝑛

𝑖=1

𝐸 =

(

−1

2∑ ∑𝑤𝑘𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑘

𝑛

𝑗=1,𝑗≠𝑖

𝑛

𝑘=1,𝑘≠𝑖

+ ∑ 𝜎𝑘 ∙

𝑛

𝑘=1,𝑘≠𝑖

𝜃𝑘

)

+ (− ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗

𝑛

𝑗=1,𝑗≠𝑖

∙ 𝜎𝑖 +∙ 𝜎𝑖 ∙ 𝜃𝑖)

𝐸 =

(

−1

2∑ ∑𝑤𝑘𝑗 ∙ 𝜎𝑗 ∙ 𝜎𝑘

𝑛

𝑗=1,𝑗≠𝑖

𝑛

𝑘=1,𝑘≠𝑖

+ ∑ 𝜎𝑘 ∙

𝑛

𝑘=1,𝑘≠𝑖

𝜃𝑘

)

+(− ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗

𝑛

𝑗=1,𝑗≠𝑖

+ 𝜃𝑖) ∙ 𝜎𝑖

(2.42)

In the equation (2.42) the content of the first parenthesis doesn’t depend on the state of unit 𝑖.

Consequently, the first parenthesis is eliminated during the computation of Δ𝐸𝑖.

States 0/1:

Δ𝐸𝑖 = 𝐸(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1) = −( ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗

𝑛

𝑗=1,𝑗≠𝑖

− 𝜃𝑖) ∙ Δ𝜎𝑖 (2.43)

States -1/1:

Δ𝐸𝑖 = 𝐸(𝜎𝑖 = −1) − 𝐸(𝜎𝑖 = 1) = −( ∑ 𝑤𝑖𝑗 ∙ 𝜎𝑗

𝑛

𝑗=1,𝑗≠𝑖

− 𝜃𝑖) ∙ Δ𝜎𝑖 (2.44)

According to the equation (2.23), the content of the parenthesis in both equations (2.43) and

(2.44) is exactly neti. Hence, the equations (2.43) and (2.44) can be compactly written as:

Δ𝐸𝑖 = −neti ∙ Δ𝜎𝑖 (2.45)

According with the update rules (2.30) and (2.31), Δ𝜎𝑖 is positive (state changes from 0 to 1

respectively from -1 to 1) only when neti is positive and is negative (state changes from 1 to 0

respectively from 1 to -1) only when neti is negative. Therefore, any change in the global energy

30

𝐸 under the algorithm is negative; otherwise the global energy 𝐸 is a monotonically decreasing

function. More, for a given set of weights W and a given set of thresholds (𝜃𝑖)1≤𝑖≤𝑛 the global

energy 𝐸 is both lower and upper bounded. Hence, the iteration of the algorithm must lead to

stable states that do not further change with time.

The following algorithm describes the dynamics of a trained Hopfield network that uses the

representation of states as 0/1, starts from a given configuration, and converges to a stable

configuration. If the states are represented as -1/1, the step 4 of Algorithm 2.1 needs to be

modified correspondingly.

Algorithm 2.1: Hopfield Network Dynamics

Given: a trained network W and an initial configuration 𝜎

begin

Step 1: repeat

Step 2: choose a unit 𝑖 at random with mean rate 𝑟 > 0

Step 3: compute the activity of the unit 𝑖: neti

Step 4: if neti > 0 and 𝜎𝑖 = 0 then set: 𝜎𝑖 = 1

if neti < 0 and 𝜎𝑖 = 1 then set: 𝜎𝑖 = 0

A unit that changes its state as described above becomes “satisfied”.

If state’s change is not necessary, then the unit is already satisfied.

Step 5: until the current configuration is stable

A configuration is stable when all the units are satisfied.

end

In a Hopfield network the weight matrix W contains simultaneously many memories. We refer to

the process of incorporating all these memories into the network’s weights as training. The

training process is described by the equations (2.33) and (2.34). A trained Hopfield network

converges to a stable configuration that generally depends on the initial configuration of the

network. This means that the stored memories or stable points can be individually reconstructed

from partial information in an initial state of the network.

31

If the stable points describe a simple flow in which nearby points in state space tend to remain

close during the flow (i.e., a non–mixing flow), then initial states that are close (in Hamming

distance) to a particular stable state and far from all others will tend to terminate in that nearby

stable state [52]. If the initial state is ambiguous (i.e., not particularly close to any stable state),

then the flow is not entirely deterministic and the system responds to that ambiguous state by a

statistical choice between the memory states it most resembles [7].

States near a particular stable point contain partial information about the memory assigned to

that stable point. From an initial state of partial information about a memory, a final stable state

with all the information of the memory is found. The memory is reached not by knowing an

address, but rather by supplying in the initial state some subpart of the memory. Any subpart of

adequate size will do – the memory is truly addressable by content rather than location.

Because the Hopfield’s network uses its local energy minima to store memories, when the

system is started near some local minimum, the desired behavior of the network is to fall into

that local minimum and not to find the global minimum.

32

Chapter 3. Variational methods for Markov networks

Variational methods are used as approximation methods in a wide variety of settings.

They have become very popular, since they typically scale well to large applications. The name

variational method refers to a general strategy in which the problem to be solved is expanded to

include additional parameters that increase the degrees of freedom over which the optimization

is performed and which must be fit to the problem at hand. Each choice of these new

parameters, called variational parameters, gives an approximate answer to the original problem.

The best approximation is usually obtained by optimizing the variational parameters. In this way

the “expansion” of the original problem is actually a modality to convert a complex problem into

a simpler problem, where the simpler problem is generally characterized by a decoupling of the

degrees of freedom in the original problem [27]. Throughout this chapter, we use the standard

terminology for graphical models and we concentrate on Markov networks.

3.1 Pairwise Markov networks as exponential families

In this section we take a look at the parameterization of a pairwise Markov network, i.e. a

representation of it as a parameterized family of probability distributions, which is the same as

belonging to an exponential family of probability distributions. Our approach is justified by the

fact that the particularities of the Boltzmann machine model are due to its parameterization and

not to its conditional independence structure. Therefore, we start by defining the concept of

exponential family together with a few related concepts. Then we apply them to obtain an

exponential form for a pairwise Markov network. Next we define the concept of canonical

parameters and we introduce the canonical representations for pairwise Markov networks. Then

we define mean parameters and we introduce the mean parameterization for pairwise Markov

networks. We end this section by exploring the role of mean parameters in inference problems.

The majority of theoretical results presented in this section are taken from [47].

33

3.1.1 Basics of exponential families

In Section 2.2 we defined Markov networks in terms of products of potential functions (equations

(2.11) to (2.13) and (2.15)). In this section we are going to see that, in an exponential family

setting, these products become additive decompositions.

Let consider a pairwise Markov network defined over a graph 𝐺 = (𝑉, 𝐸) and associated with a

set of random variables X = (𝑋1, … , 𝑋𝑛) where 𝑛 = |𝑉|. Without restricting the generality, let us

assume that each random variable 𝑋𝑖, which is associated with node 𝑖 ∈ 𝑉, is Bernoulli,

otherwise is taking the “spin” values/states from 𝐈 = {0,1}.

Let 𝚽𝐆 = (𝜙𝑗)1≤𝑗≤𝑑 be a collection of 𝑑 potential functions, such that: 𝜙𝑗 ∶ I𝑛 → ℝ for all 𝑗 ∈

{1,2…𝑑}. Here 𝑑 is the number of cliques that cover the edges and nodes of 𝐺 and are in a

one–to–one correspondence with 𝜙𝑗.

Given the vector of potential functions ΦG, we associate to it a vector of canonical or

exponential parameters: 𝐖 = (𝑊𝑗)1≤𝑗≤𝑑. If the vector of potential functions ΦG is fixed, then

each parameter vector W indexes a particular probability distribution 𝐏W of the family.

For each fixed X ∈ I𝑛, we use ⟨W,ΦG⟩ to denote the Euclidean inner product in ℝ𝑑 between the

vectors W and ΦG:

⟨𝐖,𝚽𝐆⟩ =∑𝑊𝑗 ∙ 𝜙𝑗

𝑑

𝑗=1

(3.1)

With these notations, the exponential family associated with the set of potential functions ΦG

and the set of canonical parameters W consists of the following parameterized collection of

probability density functions:

𝐏W(X) = exp(⟨W,ΦG⟩ − 𝑨(W)) (3.2)

where: 𝑨(W) = ln ∑ exp(⟨W,ΦG⟩)

X∈I𝑛

(3.3)

We are particularly interested in canonical parameters W that belong to the set:

𝛀 = {W ∈ ℝ𝑑 ∶ 𝑨(W) < +∞} (3.4)

The following notions are important in subsequent development:

34

Regular families: An exponential family for which the domain Ω is an open set is known as a

regular family.

Minimal: It is typical to define an exponential family with a vector of potential functions ΦG

such that there is no nonzero vector W ∈ ℝ𝑑 such that ⟨W,ΦG⟩ is equal to a constant.

This condition gives rise to a so–called minimal representation, in which there is a unique

canonical parameter vector W associated with each probability distribution 𝐏.

Overcomplete: Instead of a minimal representation, it can be convenient to use a non–

minimal or overcomplete representation, in which there exist linear combinations ⟨W,ΦG⟩

that are equal to a constant. In this case, there actually exists an entire affine subset of

parameter vectors W, each associated with the same distribution.

3.1.2 Canonical representation of pairwise Markov networks

The potential functions of a pairwise Markov network, as described by the equation (2.15), are

either node potentials or edge potentials. Therefore we can differentiate between the node–

specific canonical parameters Θ = (𝜃𝑖)𝑖∈𝑉 and the edge–specific canonical parameters W =

(��𝑖𝑗){𝑖,𝑗}∈𝐸. This leads us to a new representation of the pairwise Markov network’s canonical

parameters 𝐖 = (��,𝚯). This new representation has a dimension 𝑑 = |𝑉| + |𝐸| and will be of

particular importance in Chapter 4 and in Chapter 5. Hence, the exponential form of a pairwise

Markov network and the corresponding cumulant function are:

𝐏W(X) = 𝐏W(𝑋1, … , 𝑋𝑛) = exp(∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉

+ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸

− 𝑨(W))

(3.5)

where:

𝑨(W) = ln(∑ exp(∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉

+ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸

)

X∈I𝑛

)

(3.6)

The exponential form of a pairwise Markov network given by the equations (3.5) and (3.6) is a

regular minimal representation. The representation is regular because the sums from the

equation (3.5) are finite for all choices of W ∈ ℝ𝑑 and the domain Ω is the full space ℝ𝑑. The

representation is minimal because there is no nontrivial inner product ⟨W,ΦG⟩ equal to a

constant.

35

An alternative canonical representation of pairwise Markov networks, named the standard

overcomplete representation, uses the indicator functions 𝕀𝒊;𝒔 and 𝕀𝒊𝒋;𝒔𝒕 as potential functions.

Each pairing of a node 𝑖 ∈ 𝑉 and a state 𝑠 ∈ I yields a node–specific indicator function 𝕀𝒊;𝒔 with

an associated vector of canonical parameters Θi = (𝜃𝑖;𝑠)𝑠∈I.

𝕀𝒊;𝒔(𝑋𝑖) = {

1, if 𝑋𝑖 = 𝑠 0, otherwise

for all 𝑖 ∈ 𝑉, 𝑠 ∈ I (3.7)

Each pairing of an edge {𝑖, 𝑗} ∈ 𝐸 with a pair of states (𝑠, 𝑡) ∈ I × I yields an edge–specific

indicator function 𝕀𝑖𝑗;𝑠𝑡, as well as the associated canonical parameter ��𝑖𝑗;𝑠𝑡 ∈ ℝ.

𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) = {

1, if 𝑋𝑖 = 𝑠 and 𝑋𝑗 = 𝑡

0, otherwise for all {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I

(3.8)

The indicator functions given by the equations (3.7) and (3.8) together with their associated

canonical parameters define an exponential family with dimension 𝑑 = 2 ∙ |𝑉 | + 4 ∙ |𝐸|. Hence,

the exponential form of a pairwise Markov network with indicator functions given by the

equations (3.7) and (3.8) is:

𝐏W(X) = exp

(

∑𝕀𝒊;𝒔(𝑋𝑖) ∙𝑖∈𝑉,𝑠∈I

𝜃𝑖;𝑠 + ∑ 𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∙{𝑖,𝑗}∈𝐸,(𝑠,𝑡)∈I×I

��𝑖𝑗;𝑠𝑡 − 𝑨(W)

)

(3.9)

where:

𝑨(W) = ln(∑ exp( ∑ 𝕀𝒊;𝒔(𝑋𝑖) ∙

𝑖∈𝑉,𝑠∈I

𝜃𝑖;𝑠 + ∑ 𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∙{𝑖,𝑗}∈𝐸,(𝑠,𝑡)∈I×I

��𝑖𝑗;𝑠𝑡)

X∈I𝑛

)

(3.10)

The exponential form of a pairwise Markov network given by the equations (3.9) and (3.10) is

regular and overcomplete. Like the previous representation, the cumulant function 𝐴 is

everywhere finite, so that the family is regular. In contrast to the previous representation, this

representation is overcomplete because the indicator functions satisfy various linear relations,

like for instance: ∑ 𝕀𝒊;𝒔(𝑋𝑖)𝑠∈I = 1 for all 𝑋𝑖 ∈ I.

3.1.3 Mean parameterization of pairwise Markov networks

So far, we have characterized a pairwise Markov network by its vector of canonical parameters

W ∈ Ω. It turns out that any exponential family, particularly a pairwise Markov network, has an

alternative parameterization in terms of a vector of mean parameters.

36

Let 𝐏 be a given probability distribution that is a member of an exponential family and whose

collection of potential functions is ΦG = (𝜙𝑗)1≤𝑗≤𝑑. Here all the potential functions are indexed by

𝑗, not only the node–related ones: 𝜙𝑗 = 𝜙j(X) = 𝜙𝑗(𝑋1, … , 𝑋𝑛). Then the mean parameter 𝜇𝑗

associated with the potential function 𝜙j, where 𝑗 ∈ {1,2…𝑑}, is defined by the expectation:

𝜇𝑗 ≝ 𝐄𝐏[𝜙j(X)] = ∑ 𝜙j(X) ∙ 𝐏(X)

X∈I𝑛

(3.11)

Thus, given an arbitrary probability distribution 𝐏 from an exponential family, we defined a vector

𝛍 ≝ (𝜇1, … , 𝜇𝑑) of 𝑑 mean parameters such that there is one mean parameter 𝜇𝑗 for each

potential function 𝜙j. We also define the set ℳ that contains all realizable mean parameters,

i.e., all possible mean vectors μ that can be traced out as the underlying distribution 𝐏 is varied:

𝓜= {𝛍 ∈ ℝ𝑑 ∶ ∃𝐏 such that 𝐄𝐏[𝜙j(X)] = 𝜇𝑗 for all 𝑗 ∈ {1,2…𝑑}} (3.12)

If the exponential family is a pairwise Markov network with indicator functions given by the

equations (3.7) and (3.8), then the collection of potential functions ΦG takes the form:

𝚽𝐆 = {𝕀𝒊;𝒔(𝑋𝑖) ∶ 𝑖 ∈ 𝑉, 𝑠 ∈ I} ∪ {𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗) ∶ {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I } (3.13)

The corresponding mean parameter vector μ ∈ ℝ𝑑, where 𝑑 = 2 ∙ |𝑉 | + 4 ∙ |𝐸|, consists of

marginal probabilities over singleton variables and marginal probabilities over pairs of variables

that correspond to graph edges:

𝛍 = {𝜇𝑖;𝑠 ∶ 𝑖 ∈ 𝑉, 𝑠 ∈ I} ∪ {𝜇𝑖𝑗;𝑠𝑡 ∶ {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I } (3.14)

where: 𝜇𝑖;𝑠 = 𝐄𝐏[𝕀𝒊;𝒔(𝑋𝑖)] = 𝐏[𝑋𝑖 = s] for all 𝑖 ∈ 𝑉, 𝑠 ∈ I (3.15)

and: 𝜇𝑖𝑗;𝑠𝑡 = 𝐄𝐏[𝕀𝒊𝒋;𝒔𝒕(𝑋𝑖, 𝑋𝑗)] = 𝐏[𝑋𝑖 = 𝑠, 𝑋𝑗 = t] for all {𝑖, 𝑗} ∈ 𝐸, (𝑠, 𝑡) ∈ I × I (3.16)

The corresponding set ℳ is known as the marginal polytope associated with the graph 𝐺 and is

denoted ℳ(𝐺). Explicitly, ℳ(𝐺) is given by:

𝓜(𝑮) = {

𝛍 ∈ ℝ𝑑 ∶ ∃𝐏 such that: eq. (3.15) holds ∀𝑖 ∈ 𝑉, ∀ 𝑠 ∈ I and

eq. (3.16) holds ∀{𝑖, 𝑗} ∈ 𝐸, ∀(𝑠, 𝑡) ∈ I × I}

(3.17)

37

3.1.4 The role of transformations between parameterizations

Various statistical computations, among them marginalization and maximum likelihood

estimation, can be understood as transforming from one parameterization to the other.

The computation of the forward mapping, from canonical parameters 𝐖 ∈ 𝛀 to mean

parameters 𝛍 ∈𝓜, can be viewed as a fundamental class of inference problems in exponential

family models and is extremely difficult for many high–dimensional exponential families.

The computation of backward mapping, namely from mean parameters 𝛍 ∈𝓜 to canonical

parameters 𝐖 ∈ 𝛀, also has a natural statistical interpretation. In particular, suppose that we are

given a set of 𝑚 samples 𝕏 of a multivariate random variable X = (𝑋1, … , 𝑋𝑛):

𝕏 = (X(1), … , X(m))T where X(j) = (𝑋1

(j), … , 𝑋n(j)) for 1 ≤ 𝑗 ≤ m (3.18)

The samples are drawn independently from an exponential family member 𝐏W(X) where the

parameter W is unknown. If the goal is to estimate W, the classical principle of maximum

likelihood dictates obtaining an estimate W by maximizing the likelihood of the data, or

equivalently (after taking logarithms and rescaling), maximizing the quantity:

𝓛(𝐖,𝕏) =

1

𝑚∙∑ln (𝐏W(X

(j)))

𝑚

𝑗=1

= ⟨W, μ⟩ (3.19)

where:

�� = ��[ΦG(X)] =1

𝑚∙∑ΦG(X

(j))

𝑚

𝑗=1

(3.20)

where μ is the vector of empirical mean parameters defined by the data 𝕏. The maximum

likelihood estimate W is chosen to achieve the maximum of this objective function. Generally,

computing W is another challenging problem, since the objective function involves the cumulant

function 𝐴. Under suitable conditions, the maximum likelihood estimate is unique, and specified

by the stationarity condition:

𝐄W[ΦG(X)] = μ (3.21) Finding the unique solution to this equation is equivalent to computing the backward mapping

μ → W and generally is computationally intensive.

38

3.2 The energy functional

In this section we introduce the concept of energy functional as a variational method for

Markov random fields. In physics, the energy functional is the total energy of a certain system,

as a functional of the system's state. In the context of Boltzmann machine, the energy functional

acts as an alternative to Boltzmann–Gibbs distribution in the sense that it is more advantageous

to maximize the energy functional instead of computing the partition function for the Boltzmann–

Gibbs distribution. The majority of theoretical results presented in this section are taken from

[27,47,52].

Let us consider we are given some complicated probabilistic system which is modelled by a

Markov random field with 𝑛 nodes and random variables 𝑋1, 𝑋2, … , 𝑋𝑛. We introduce a new

random variable �� that represents the unnormalized measure of the probability distribution 𝐏

that describes the Markov random field. We rewrite the equation (2.11) in a way that highlights

the unnormalized measure of 𝐏.

𝐏(𝑋1, 𝑋2, … , 𝑋𝑛) =

1


Xc∈CG

=��(𝑋1, 𝑋2, … , 𝑋𝑛)

𝑍 (3.22)

where: ��(𝑋1, 𝑋2, … , 𝑋𝑛) = ∏ 𝜙𝑐(Xc)

Xc∈CG

(3.23)

and: 𝑍 = ∑ ∏ 𝜙𝑐(Xc)

Xc∈CG𝑋1,𝑋2,…,𝑋𝑛

= ∑ ��(𝑋1, 𝑋2, … , 𝑋𝑛)

𝑋1,𝑋2,…,𝑋𝑛

(3.24)

Our goal is to construct an approximation for the joint distribution 𝐏; we are going to name this

new distribution 𝐐. In order to reach this goal, we are going to employ the following strategy:

instead of looking for a distribution equivalent to 𝐏, we are looking for a distribution reasonably

close to 𝐏. Moreover, we want to make sure that we can perform inference efficiently in the

given Markov random field by using 𝐐. Therefore, instead of choosing a priori a single

distribution 𝐐, we firstly choose a family of approximating distributions ℚ = {𝐐𝒊 ∶ 1 ≤ 𝑖 ≤ 𝑛},

then we let the optimization machinery to choose a particular member from this family.

In our journey to finding a decent approximation for 𝐏 we are going to use the Kullback–Leibler

divergence or KL–divergence which is defined in Appendix B (equation (B29)). If we explicitly

write the expectation with respect to 𝐐 in the definition of KL(𝐐||𝐏), then we obtain the following

equation:

39

KL(𝐐||𝐏) = KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = ∑ 𝐐(𝑋1…𝑋𝑛) ∙

𝑋1,…,𝑋𝑛

ln (𝐐(𝑋1, … , 𝑋𝑛)

𝐏(𝑋1, … , 𝑋𝑛)) (3.25)

where: 𝐏 is the “true” joint distribution from which the data was generated; 𝐐 is a distribution

from a certain family of distributions that, more or less, approximate 𝐏; and

KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) is the KL–divergence of 𝐐 and 𝐏.

We observe that the computation of KL–divergence from the equation (3.25) involves an

intractable operation which is the explicit summation over all possible instantiations of 𝑋1, … , 𝑋𝑛.

However, since we know from the equations (3.22) and (3.23) how 𝐏(𝑋1, … , 𝑋𝑛) respectively

��(𝑋1, 𝑋2, … , 𝑋𝑛) look like, we can exploit this fact to rewrite the KL–divergence in a simpler form.

Before we present this simplified form of KL–divergence, we need to introduce a few concepts

related to energy respectively entropy in Markov random fields.

The entropy of 𝑋1, … , 𝑋𝑛 with respect to 𝐐 is given by the equation (B27) from Appendix B:

𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))

𝑋1,…,𝑋𝑛

(3.26)

Definition 3.1:

The energy functional 𝐹[��, 𝐐] of two probability distributions 𝐏 and 𝐐 is defined in connection

with the unnormalized measure ��(𝑋1, 𝑋2, … , 𝑋𝑛) of 𝐏 as:

𝐹[��, 𝐐] = 𝐄𝐐 [ln (��(𝑋1, … , 𝑋𝑛))] + 𝑆𝐐(𝑋1, … , 𝑋𝑛) (3.27)

where 𝐄𝐐 [ln (��(𝑋1, … , 𝑋𝑛))] represents the expectation with respect to 𝐐 of the logarithm of the

unnormalized measure of 𝐏 and 𝑆𝐐(𝑋1, … , 𝑋𝑛) denotes the entropy of 𝑋1, … , 𝑋𝑛 with respect to 𝐐.

Equivalent forms of the energy functional can be obtained by expanding the expectation in the

first term and substituting the entropy in the second term of the equation (3.27):

𝐹[��, 𝐐] = 𝐄𝐐 [ln( ∏ 𝜙𝑐(Xc))

Xc∈CG

)] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)

40

𝐹[��, 𝐐] = 𝐄𝐐 [ ∑ ln(𝜙𝑐(Xc))

Xc∈CG

] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)

𝐹[��, 𝐐] = ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] + 𝑆𝐐(𝑋1, … , 𝑋𝑛)

Xc∈CG

(3.28) 𝐹[��, 𝐐] = ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] − ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))

𝑋1,…,𝑋𝑛Xc∈CG

The energy functional contains two terms:

The first term, called the energy term, involves expectations with respect to 𝐐 of the

logarithms of the factors in 𝜙𝑐. Here each factor in 𝜙𝑐 appears as a separate term. Thereby,

if the factors that comprise 𝜙𝑐 are small, and this is the case for the Boltzmann machine,

each expectation deals with relatively few variables. The difficulties in dealing with these

expectations depend on the properties of distribution 𝐐. Assuming that inference is “easy” in

𝐐, we should be able to evaluate such expectations relatively easily.

The second term, called the entropy term, is the entropy of 𝐐. The choice of 𝐐 determines

whether this term is tractable, otherwise we can evaluate it.

The following theorem clarifies the relationship between the KL–divergence and the energy

functional. The proof of this theorem is outside the scope of this paper. However, a proof can be

found in [54].

Theorem 3.1:

The KL–divergence of the probability distributions 𝐐 and 𝐏 can be calculated using the formula:

KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) = −𝐹[��, 𝐐] + ln(𝑍(𝐏)) (3.29)

where 𝐹[��, 𝐐] is the energy functional given by Definition 3.1. and 𝑍(𝐏) is the partition function

of the probability distribution 𝐏.

Equivalently, the equation (3.29) can be written using free energies as in [4]:

KL(𝐐(𝑋1, … , 𝑋𝑛)||𝐏(𝑋1, … , 𝑋𝑛)) = 𝐹[𝐐] − 𝐹[𝐏] (3.30)

41

where 𝐹[𝐐] is a variational free energy and 𝐹[𝐏] is the true free energy of the Markov random

field.

The variational free energy is equal with the opposite of the energy functional 𝐹[��, 𝐐] and,

generally, is a Gibbs free energy (equation (2.21)). The exact (true) free energy is the Helmholtz

free energy (equation (2.22)). Without restricting the generality, in this section we can assume

that the constant pseudo–temperature at equilibrium is 1. Therefore, the variational free energy

and the true free energy can be written as:

𝐹[𝐐] = −𝐹[��, 𝐐] (3.31)

𝐹[𝐏] = − ln(𝑍(𝐏)) (3.32)

If we incorporate the equations (3.28) into the equation (3.29) we obtain the following equivalent

forms of the KL–divergence of probability distributions 𝐐 and 𝐏:

KL(𝐐||𝐏) = − ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] − 𝑆𝐐(𝑋1, … , 𝑋𝑛)

Xc∈CG

+ ln(𝑍(𝐏))

(3.33)

KL(𝐐||𝐏) = − ∑ 𝐄𝐐[ln(𝜙𝑐(Xc))] + ∑ 𝐐(𝑋1, … , 𝑋𝑛) ∙ ln(𝐐(𝑋1, … , 𝑋𝑛))


+ ln(𝑍(𝐏))

We observe that the term ln(𝑍(𝐏)) from the equations (3.29) and (3.33) doesn’t depend on 𝐐.

Hence, if we want to minimize KL(𝐐||𝐏) with respect to 𝐐, we just need to minimize the first two

terms of the right–hand side of the equations (3.33), which means we need to either maximize

the energy functional 𝐹[��, 𝐐] (equations (3.28)) or minimize the variational free energy 𝐹[𝐐]

(equation (3.31)).

To summarize, instead of searching for a good approximation 𝐐 of the true probability 𝐏, we

need to solve one of the following equivalent optimization problems:

maximize the energy functional 𝐹[��, 𝐐];

minimize the variational free energy 𝐹[𝐐];

minimize the KL–divergence KL(𝐐||𝐏).

Choosing one of these problems depends on the specifics of the problem we try to solve.

Importantly, the energy functional respectively the variational free energy involves expectations

in 𝐐. By choosing approximations 𝐐 that allow for efficient inference, we can both evaluate the

energy functional respectively the variational free energy and optimize it effectively.

42

Moreover, the KL–divergence enjoys the property of being always positive and becoming zero if

and only if 𝐐 and 𝐏 are equal. The proof of this claim can be found in [54]:

KL(𝐐||𝐏) ≥ 0 (3.34)

and: KL(𝐐||𝐏) = 0 if and only if 𝐐 = 𝐏 (3.35)

Then, from the equations (3.29) and (3.30) we can infer that:

𝐹[𝐏] ≥ 𝐹[��, 𝐐] (3.36)

and: 𝐹[𝐏] ≤ 𝐹[𝐐] (3.37)

The inequalities (3.36) and (3.37) together with the equation (3.32) are significant because they

provide bounds on the variational free energy respectively the energy functional:

The variational free energy 𝐹[𝐐] is an upper bound for the true free energy 𝐹[𝐏] for any

choice of 𝐐. This translates into the result of the optimization problem “minimize 𝐹[𝐐]” being

an upper bound for 𝐹[𝐏].

The energy functional 𝐹[��, 𝐐] is a lower bound for the true free energy 𝐹[𝐏] for any choice of

𝐐. This translates into the result of the optimization problem “maximize 𝐹[��, 𝐐]” being a

lower bound for 𝐹[𝐏].

Therefore, instead of directly computing the true partition function 𝑍(𝐏), we can look for a

decent approximation 𝐐 of 𝐏, Moreover, depending on the type of optimization employed

(minimization or maximization), we should obtain a decent upper or lower bound of − ln(𝑍(𝐏)),

which becomes a decent lower or upper bound of ln(𝑍(𝐏)), which leads to a decent lower–

bound or upper–bound approximation of 𝑍(𝐏).

3.3 Gibbs free energy revisited

In Section 2.3 we introduced the concept of Gibbs free energy in Markov random fields by

analogy with the homologue concept from thermodynamics. In this section we firstly introduce

the concepts of Hamiltonian and Plefka expansion; then we present two approaches for defining

a variational Gibbs free energy in a Markov random field. Before we start, we mention that we

use the notation (A2) from Appendix A to designate by X a multivariate random variable that

43

represents all the random variables 𝑋1, … , 𝑋𝑛 of a Markov random field and the notation (A3)

from Appendix A to designate by X−i all the random variables from X except 𝑋𝑖.

3.3.1 Hamiltonian and Plefka expansion

The Hamiltonian mechanics is a theory developed as a reformulation of classical mechanics

and predicts the same outcomes as the non–Hamiltonian (Newtonian) classical mechanics. It

uses a different mathematical formalism, providing a more abstract understanding of the theory.

Hamiltonian mechanics contributed to the subsequent formulation of statistical mechanics and

quantum mechanics. Hamiltonian is an operator introduced by Hamiltonian mechanics that in

most of the cases corresponds to the total energy of the system. For example, the Hamiltonian

of a closed system is the sum of the kinetic energies of all the particles, plus the potential

energy of the particles associated with the system.

Plefka expansion is an approximate method to compute free energies in physical systems. The

method, originally applied to classical spin systems, can be applied to any model for which a

transition from a 'trivial' disordered phase to an ordered phase occurs as some initially small

parameter is varied [55]. That parameter need not be the inverse temperature [55]. In spin glass

theory, Plefka expansion is a “high–temperature” expansion of the ordinary free energy of the

system. Concretely, it is a Taylor expansion of the ordinary free energy with respect to the

inverse temperature such that the resulted free energy is valid in both high–temperature and

low–temperature phases of the system [55].

The concepts of Hamiltonian and Plefka expansion can be extended to Markov random fields.

We consider a pairwise Markov random field with binary random variables 𝑋1, … , 𝑋𝑛 defined by

the equations (2.11) to (2.13) and whose joint probability distribution 𝐏 is a Boltzmann–Gibbs

distribution described by the equation (2.5). The canonical parameters 𝐖 = (��,𝚯) of the

Markov random field are given by the equation (3.5). The derivation by Plefka is particularly

suitable for this type of Markov networks since it does not regard the parameters 𝑤𝑖𝑗 as random

quantities and hence does not require averaging over them. Unlike the spin glass theory, where

the parameters 𝑤𝑖𝑗 are generally regarded as random variables representing random

interactions and their properties are analyzed in thermodynamic limit, otherwise their properties

do not depend on a particular realization of 𝑤𝑖𝑗 [22], in Markov random field theory the

44

parameters 𝑤𝑖𝑗 are given and fixed, and hence in principle they cannot be thought of as random

variables [22].

In Plefka’s argument [56-57], we associate the Hamiltonian H(𝛼) with 𝛼 the expansion

parameter, to a given Markov random field:

H(𝛼) = −𝛼 ∙

1

2∙ ∑ ��𝑖𝑗 ∙ 𝑋𝑖 ∙ 𝑋𝑗{𝑖,𝑗}∈𝐸

−∑𝜃𝑖 ∙ 𝑋𝑖𝑖∈𝑉

(3.38)

The free energy 𝐹 corresponding to the Hamiltonian H(α) is given by the following formula [56-

57]:

−𝛽 ∙ 𝐹[𝛼, 𝛽, {𝜇𝑖}𝑖∈𝑉] = ln(Tr exp(−𝛽 ∙ H(α))) − 𝛽 ∙∑𝜃𝑖 ∙ 𝜇𝑖𝑖∈𝑉

(3.39)

where:

𝛽 =1

𝐤 ∙ 𝐓 (3.40)

And 𝜇𝑖 = 𝐄𝑋𝑖[H(𝛼)] for all 𝑖 ∈ 𝑉 (3.41)

where:; 𝐓 is the absolute temperature of the system; 𝐤 is the Boltzmann's constant; 𝜇𝑖 is the

mean value of the Hamiltonian H(α) with respect to the random variable 𝑋𝑖; and Tr denotes the

trace of a matrix. Generally, in Markov networks the Boltzmann’s constant 𝐤 can be taken equal

to 1.

The Plefka expansion of the free energy 𝐹 given by the equation (3.39) is obtained by

suppressing the dependence of 𝐹 on 𝛽 and {𝜇𝑖}𝑖∈𝑉 and then expanding 𝐹 into a power series of

𝛼 as follows [56-57]:

𝐹[𝛼] = 𝐹[0] +∑

𝛼𝑛

𝑛!∙𝜕𝑛𝐹

𝜕𝛼𝑛

∞

𝑛=1

|

𝛼=0

= 𝐹[0] + 𝛼 ∙ 𝐹′[0] +1

2∙ 𝛼2 ∙ 𝐹′′[0] + ⋯ (3.42)

where the derivatives with respect to 𝛼: 𝐹′[𝛼] =𝜕𝐹

𝜕𝛼, 𝐹′′[𝛼] =

𝜕2𝐹

𝜕𝛼2, and so on should be taken with

𝜇𝑖 fixed, for all 𝑖 ∈ 𝑉.

The coefficients of the Plefka expansion up to the second order are the following [56-57]:

𝐹[0] =

1

2∙∑[(1 + 𝜇𝑖) ∙ ln (

1 + 𝜇𝑖2

) + (1 − 𝜇𝑖) ∙ ln (1 − 𝜇𝑖2

)]

𝑖∈𝑉

(3.43)

𝐹′[0] = −1

2∙ ∑ ��𝑖𝑗 ∙ 𝜇𝑖 ∙ 𝜇𝑗{𝑖,𝑗}∈𝐸

(3.44)

45

𝐹′′[0] = −1

2∙ ∑ ��𝑖𝑗

2 ∙ (1 − 𝜇𝑖2) ∙ (1 − 𝜇𝑗

2){𝑖,𝑗}∈𝐸

(3.45)

Together with the equations (3.43) to (3.45) the equation (3.42) becomes:

𝛽 ∙ 𝐹[𝛼] =

1

2∙∑[(1 + 𝜇𝑖) ∙ ln (

1 + 𝜇𝑖2

) + (1 − 𝜇𝑖) ∙ ln (1 − 𝜇𝑖2

)]

𝑖∈𝑉

−

−𝛽 ∙ 𝛼

2∙ ∑ ��𝑖𝑗 ∙ 𝜇𝑖 ∙ 𝜇𝑗{𝑖,𝑗}∈𝐸

− (𝛽 ∙ 𝛼

2)2

∙ ∑ ��𝑖𝑗2 ∙ (1 − 𝜇𝑖

2) ∙ (1 − 𝜇𝑗2)

{𝑖,𝑗}∈𝐸

+ O(𝛼3)

(3.46)

Since 𝐇(𝛼 = 1) is the original Hamiltonian to be considered, leaving the convergence problem

aside and neglecting the higher–order terms O(𝛼3), setting 𝛼 = 1 in the equation (3.46) yields

the true free energy of the Markov random field: 𝐹[𝛼] ≡ 𝐹[𝐏] [56-57]. The free energy obtained

by truncating the Plefka expansion of the ordinary free energy is a Gibbs free energy as well.

3.3.2 The Gibbs free energy as a variational energy

We consider we are given the Markov random field described in Section 3.3.1 and we denote by

X the set of its random variables. We are given a proper subset Y of X together with 𝒫, which

denotes the set of marginal probabilities with respect to 𝐏 of all the variables belonging to Y.

Our task is to define a Gibbs free energy for this Markov random field by performing a partial

constrained minimization over a distribution 𝐐 of certain form such that the marginals of 𝐏

corresponding to the variables included in Y are kept in 𝐐. The intended optimization is a partial

constrained minimization, where the term partial signals the fact that only some of the random

variables 𝑋1, … , 𝑋𝑛 are constrained.

Formally this optimization task is represented as:

Given: X = (𝑋1, … , 𝑋𝑛) ⊃ Y = {𝑋𝑖1 , 𝑋𝑖2 , … , 𝑋𝑖𝑚} = {𝑋𝑖𝑗}1≤𝑗≤𝑚

(3.47)

such that: ∀𝑗 ∈ {1,… ,𝑚}, ∃𝑘 ∈ {1. . 𝑛} such that 𝑋𝑘 ≡ 𝑋𝑖𝑗 (3.48)

and given: 𝒫 = {𝑝1, 𝑝2, … , 𝑝𝑚} = {𝑝𝑗}1≤𝑗≤𝑚 (3.49)

where:

46

𝑝𝑗 = MARG(𝐏, 𝑋𝑖𝑗) = ∑𝐏(𝑋1, … , 𝑋𝑘 , … , 𝑋𝑛)

X−ij

with 𝑋𝑖𝑗 ≡ 𝑋𝑘 cf. (3.48) (3.50)

Construct: 𝐺 [{𝑝𝑗}𝑗]

= min𝐐{𝐹[𝐐] ∶ MARG(𝐐, 𝑋𝑖𝑗) = 𝑝𝑗 , 1 ≤ j ≤ m} (3.51)

such that:

𝐐𝐄𝐐(Y) =1

𝑍∙ exp(−𝐸(Y)) = 𝐏(Y) (3.52)

In order to accomplish this task, we could follow any of the following approaches:

Approach 1:

Step 1.1: We obtain a Gibbs free energy by truncating up to the nth–order term the

Plefka expansion of the ordinary free energy (formulae (3.42) and (3.46)).

Step 1.2: We obtain another Gibbs free energy, which generally is a variational Gibbs

free energy, by minimizing the Gibbs free energy obtained in Step 1.1 over the

parameters {𝑝𝑗}𝑗.

Approach 2:

Step 2.1: We obtain a Gibbs free energy by using the formula (2.21).

Step 2.2: We obtain another Gibbs free energy, which generally is a variational Gibbs

free energy, by minimizing the Gibbs free energy obtained in Step 2.1 over the

parameters {𝑝𝑗}𝑗.

In this section we are going to exemplify Approach 1. Step 1.1 is explained in Section 3.3.1 so

we are going to show only how to perform Step 1.2. In order to achieve this goal, we follow the

work of Welling and Teh [4].

The natural way to enforce the constraints on the marginals is by employing a set of Lagrange

multipliers {𝜆𝑗}𝑗 and incorporating them in the approximation of the free energy 𝐹 obtained in

Step 1.1:

𝐹[𝐐] ← 𝐹[𝐐] −∑𝜆𝑗𝑗

∙ (MARG(𝐐, 𝑋𝑖𝑗) − 𝑝𝑗) (3.53)

We minimize 𝐹[𝐐] over 𝐐 in terms of the Lagrange multipliers {𝜆𝑗}𝑗 and the parameters {𝑝𝑗}𝑗.

The solution obtained is again a Boltzmann–Gibbs distribution, but with a modified energy which

includes additional bias terms:

47

𝐸 ({𝑋𝑖𝑗}𝑗) → 𝐸 ({𝑋𝑖𝑗}𝑗

) −∑𝜆𝑗𝑗

∙ 𝑋𝑖𝑗 (3.54)

After inserting the expression (3.54) into the free energy given by (3.53) and minimizing over the

Lagrange multipliers {𝜆𝑗}𝑗, we find the values of {𝜆𝑗}𝑗 as a function of the parameters {𝑝𝑗}𝑗. The

resulted Gibbs free energy is:

𝐺 [{𝑝𝑗}𝑗] = min{𝜆𝑗}𝑗

{∑𝜆𝑗𝑗

∙ 𝑝𝑗 − ln𝑍 ({𝜆𝑗}𝑗)} (3.55)

where 𝑍 ({𝜆𝑗}𝑗) is the normalizing constant for the Boltzmann–Gibbs distribution with energy

defined by the equation (3.54). The equation (3.55) is known as the Legendre transform

between {𝜆𝑗}𝑗 and {𝑝𝑗}𝑗. By shifting the Lagrange multipliers as follows:

𝜆𝑗′ = 𝜆𝑗 + 𝜃𝑗 (3.56)

we can pull the contribution of the thresholds to the Gibbs free energy out of the Legendre

transform and obtain another form of the resulted Gibbs free energy:

𝐺 [{𝑝𝑗}𝑗] = −∑𝜃𝑗 ∙ 𝑝𝑗𝑗

+ min{𝜆𝑗′}𝑗

{∑𝜆𝑗′

𝑗

∙ 𝑝𝑗 − ln𝑍′ ({𝜆𝑗′}𝑗)} (3.57)

where 𝑍′ is the partition function of the modified Boltzmann–Gibbs distribution with all the

thresholds {𝜃𝑗}𝑗 set to zero:

𝑍′ ({𝜆𝑗′}𝑗) =∑exp(−∑��𝑗𝑙 ∙ 𝑋𝑖𝑗 ∙ 𝑋𝑖𝑙{𝑗,𝑙}

−∑𝜆𝑗′ ∙ 𝑋𝑖𝑗)

𝑋𝑖𝑗

(3.58)

Various variational Gibbs free energies can be obtained by following this approach. For

instance, the mean field free energy is the Gibbs free energy obtained by truncating the Plefka

expansion of the free energy (equation (3.46)) in the first order and minimizing it with respect to

single node marginals.

48

3.4 Mean field approximation

The mean field approximation is a variational approximation of the true free energy, or,

equivalently, of the energy functional, over a computationally tractable family ℚ of simple

distributions:

ℚ = {𝐐𝐢 ∶ 1 ≤ 𝑖 ≤ 𝑛} (3.59) The mean field approximation of the true free energy is called the mean field free energy and is

a Gibbs free energy. In this section our goal is to obtain an expression for the mean field free

energy by maximizing the energy functional. In Section 3.5 we will show how we can obtain an

equivalent expression for the mean field free energy in relation with the Bethe free energy.

The fact that the distributions 𝐐 are tractable comes with a cost: they are not generally

sufficiently expressive to capture all the information of the true probability distribution 𝐏. Before

we present the simplest mean field algorithm, often called naïve mean field, we introduce the

following notations:

We use the notation (A2) from Appendix A to denote a multivariate random variable that

represents either all the random variables of the Markov random field X = (𝑋1, 𝑋2, … , 𝑋𝑛), or

all the random variables belonging to a specific clique Xc whose potential 𝜙𝑐 would appear in

the joint probability distribution 𝐏 of the random field.

We use the notation (A3) from Appendix A to designate by X−i all the random variables

from X except 𝑋𝑖.

We use the notation 𝐄Y~𝐐 to designate the expectation of probability 𝐐 with respect to all the

random variables of the given Markov random field “contained” in the multivariate random

variable Y. A few examples of this notation are: 𝐄X−i~𝐐 and 𝐄Xc~𝐐 .

3.4.1 The mean field energy functional

The naïve mean field algorithm looks for the distribution 𝐐 closest to the true distribution 𝐏 in

terms of KL(𝐐||𝐏) inside the class of distributions representable as a product of independent

marginals:

𝐐(𝑋1, … , 𝑋𝑛) = ∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

(3.60)

49

A few observations should be made regarding the equation (3.60).

On one hand the approximation of 𝐏 as a fully factored distribution assumes that all

variables 𝑋1, … , 𝑋𝑛 of 𝐏 are independent of each other in 𝐐. Consequently this approximation

doesn’t capture any of the dependencies existing in 𝐏 between the variables belonging to a

clique Xc, for all Xc ∈ CG, i.e., the dependencies reflected by the clique potentials 𝜙𝑐(Xc) from

the equation (3.23).

On other hand this distribution is computationally attractive since we can evaluate any query

on 𝐐 by a product over terms that involve only the variables in the scope of the query (i.e.,

the set of variables that appear in the query). Moreover, to represent 𝐐 we need only to

describe the marginal probabilities of each of the variables 𝑋1, … , 𝑋𝑛.

In machine learning literature the marginal probabilities 𝐐(𝑋𝑖), where 1 ≤ 𝑖 ≤ 𝑛, are usually

called mean parameters and denoted 𝝁𝒊 [4,5,16-23,27,47].

Before we derive the mean field algorithm, we are going to formulate the energy functional in a

slightly different way. We do this by incorporating the formula (3.60) into the formulae (3.28):

𝐹[��, 𝐐] = ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc))] − ∑ ∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

∙ ln (∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

)


(3.61)

In the equation (3.61) the first term of the energy functional is itself a sum of terms that have the

form 𝐄X~𝐐[ln(𝜙𝑐(Xc))]. In order to evaluate these terms, we can use the equation (3.60) to

compute 𝐐(𝜙𝑐(Xc)) as a product of marginals, allowing the evaluation of this term to be

performed in time linear in the number of random variables of the clique Xc.

𝐐(𝜙𝑐(Xc)) = ∏ 𝐐(𝑋𝑖)

𝑋𝑖∈Xc

for all Xc ∈ CG, 𝜙𝑐 ∈ ΦG (3.62)

Then the cost to evaluate 𝐐(𝑋1, … , 𝑋𝑛) overall is linear in the description size of the factors 𝜙𝑐 of

𝐏. As for now we cannot expect to do much better.

𝐄X~𝐐[ln(𝜙𝑐(Xc))] = ∑ 𝐐(𝜙𝑐(Xc))

Xc∈CG

∙ ln(𝜙𝑐(Xc))

(3.63)

𝐄X~𝐐[ln(𝜙𝑐(Xc))] = ∑ (∏ 𝐐(𝑋𝑖)

𝑋𝑖∈Xc

)

Xc∈CG

∙ ln(𝜙𝑐(Xc))

50

The second term of the energy functional in the equation (3.61) is the entropy of 𝑋1, … , 𝑋𝑛 with

respect to 𝐐 and, for a fully factored distribution 𝐐, is also decomposable as follows. The proof

for this claim can be found in [54].

𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ ∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

∙ ln (∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

)

𝑋1,…,𝑋𝑛

𝑆𝐐(𝑋1, … , 𝑋𝑛) = − ∑ ∏ 𝐐(𝑋𝑖)

1≤𝑖≤𝑛

∙ ∑ ln(𝐐(𝑋𝑖))

1≤𝑖≤𝑛𝑋1,…,𝑋𝑛

𝑆𝐐(𝑋1, … , 𝑋𝑛) = ∑ 𝑆𝐐(𝑋𝑖)

1≤𝑖≤𝑛

(3.64)

We substitute the appropriate quantities given by the equations (3.63) and (3.64) into the

equation (3.61). Finally, the energy functional respectively the corresponding variational free

energy for a fully factored distribution 𝐐 is given by the following formula:

𝐹[��, 𝐐] = −𝐹𝑀𝐹[𝐐] = ∑ (∏ 𝐐(𝑋𝑖)

𝑋𝑖∈Xc

)

𝑋1,…,𝑋𝑛

∙ ln(𝜙𝑐(Xc)) + ∑ 𝑆𝐐(𝑋𝑖)

1≤𝑖≤𝑛

(3.65)

where 𝐹𝑀𝐹[𝐐] is the mean field free energy.

The formula (3.67) shows that the energy functional for a fully factored distribution can be

written as a sum of expectations, each expectation being defined over a small set of variables;

each such a set corresponds to a clique potential 𝜙𝑐(Xc) in 𝐏. The complexity of evaluating this

form of the energy functional depends on the size of factors 𝜙𝑐(Xc) in 𝐏 and not on the topology

of the Markov network. Thus, the energy functional can be represented and manipulated

effectively, even in Markov networks that would require exponential time for exact inference.

3.4.2 Maximizing the energy functional: fixed–point characterization

In Section 3.2 we showed that, instead of searching for a good approximation 𝐐 of the true

probability 𝐏, we could use a variational approach to either maximize the energy functional or

minimize either the corresponding variational free energy or the KL–divergence. Each of these

approaches transforms the original problem – approximate inference in a Markov random field –

into an optimization problem. An interesting aspect of these optimization problems is the fact

51

that, instead of approximating the objective, they approximate the optimization space. This is

done by starting with a class of distributions:

ℚ = {𝐐𝐢 = 𝐐(𝑋𝑖) ∶ 1 ≤ 𝑖 ≤ 𝑛} (3.66) that generally doesn’t contain the true distribution 𝐏. Then, the distribution of this class that

complies with the type of optimization performed and with the imposed constraints represents

an approximation of the true probability of the underlying Markov network.

A formal description of the optimization problem “maximization of energy functional” follows.

Problem Mean Field Approximation:

Given: ℚ = { 𝐐𝐢 = 𝐐(𝑋𝑖) ∶ 1 ≤ 𝑖 ≤ 𝑛 }

Find: 𝐐 ∈ ℚ

By maximizing: 𝐹[��, 𝐐]

Subject to: 𝐐(𝑋1, … , 𝑋𝑛) = ∏ 𝐐(𝑋𝑖)𝑖

∑ 𝐐(𝑥𝑖) = 1{𝑥𝑖} for all 𝑖

The following theorem and corollaries provide a set of fixed–point equations that characterize

the stationary points of Mean Field Approximation. These theoretical results are taken from

[54] and adapted to our notations and conventions. We provide the proofs only for Theorem 3.2

and Corollary 3.5. We also provide our interpretation of these theoretical results.

Theorem 3.2:

The marginal distribution 𝐐(𝑋𝑖) is a local maximum of Mean Field Approximation given

{𝐐(𝑋𝑗)}𝑗≠𝑖 if and only if:

𝐐(𝑥𝑖) =1

𝑍𝑖∙ exp { ∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]

𝜙𝑐∈ΦG

}

equivalent to: (3.67)

𝐐(𝑥𝑖) =1

𝑍𝑖∙ exp{ ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]

Xc∈CG

}

52

where 𝑍𝑖 is a local normalizing constant and 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖] is the conditional expectation for a

given value 𝑥𝑖 of 𝑋𝑖:

𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖] = ∑ 𝐐(𝜙𝑐(Xc) | 𝑥𝑖) ∙ ln(𝜙𝑐(Xc))

Xc∈CG

Proof: The proof of this theorem relies on proving the fixed–point characterization of the

individual marginal 𝐐(𝑋𝑖) in terms of the other components 𝐐(𝑋1),…, 𝐐(𝑋𝑖−1), 𝐐(𝑋𝑖+1),…, 𝐐(𝑋𝑛)

as specified in the equation (3.67).

We first consider the restriction of the objective 𝐹[��, 𝐐] to those terms that involve 𝐐(𝑋𝑖):

𝐹𝑖[𝐐] = ∑ 𝐄Xc~𝐐[ln(𝜙𝑐(Xc))]𝑋𝑖∈Xc,Xc∈CG

+ 𝑆𝐐(𝑋𝑖) (3.68)

To optimize 𝐐(𝑋𝑖), we define the Lagrangian that consists of all terms in 𝐹[��, 𝐐] that involve

𝐐(𝑋𝑖):

𝐿𝑖[𝐐] = ∑ 𝐄Xc~𝐐[ln(𝜙𝑐(Xc))]𝑋𝑖∈Xc,Xc∈CG

+ 𝑆𝐐(𝑋𝑖) + 𝜆 ∙ (∑𝐐(𝑥𝑖)

𝑥𝑖

− 1)

where 𝜆 is a Lagrange multiplier that corresponds to the constraint that 𝐐(𝑋𝑖) is a distribution.

We now take derivatives with respect to 𝐐(𝑥𝑖). The following result, whose proof we do not

provide, plays an important role in the remainder of the derivation.

Lemma 3.3:

If 𝐐(X) = ∏ 𝐐(𝑋𝑖)𝑖 , then for any function 𝑓 with scope 𝒰:

𝜕𝐄𝒰~𝐐[𝑓(𝒰)]

𝜕𝐐(𝑥𝑖)= 𝐄𝒰~𝐐[𝑓(𝒰) | 𝑥𝑖]

Using Lemma 3.3 and standard derivatives of entropies, we see that:

𝜕𝐿𝑖𝜕𝐐(𝑥𝑖)

= ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]

Xc∈CG

− ln𝐐(𝑥𝑖) − 1 + 𝜆

Setting the derivative to 0 and rearranging terms we get that:

53

ln𝐐(𝑋𝑖) = 𝜆 − 1 + ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc)) | 𝑥𝑖]

Xc∈CG

We take exponents of both sides and renormalize; because 𝜆 is constant relative to 𝑥𝑖, it drops

out in the renormalization, so that we obtain the formula (3.67).

Theorem 3.2 shows only that the solution of the equation (3.67) is a stationary point of the

equation (3.68). To prove that it is a maximum, we note that the equation (3.68) is a sum of two

terms: ∑ 𝐄X~𝐐[ln(𝜙𝑐(Xc))]Xc∈CG is linear in 𝐐(𝑋𝑖), given all the other components 𝐐(𝑋𝑗), 𝑗 ≠ 𝑖;

𝑆𝐐(𝑋𝑖) is a concave function in 𝐐(𝑋𝑖). As a whole, given the other components of 𝐐, the function

𝐹𝑖 is concave in 𝐐(𝑋𝑖) and, therefore, has a unique global optimum, which is easily verified to be

the solution of the equation (3.67) rather than any of the extremal points [54].

The following two corollaries help characterize the stationary points of Mean Field

Approximation.

Corollary 3.4:

The distribution 𝐐 is a stationary point of Mean Field Approximation if and only if, for each 𝑋𝑖,

the equation (3.67) holds.

Corollary 3.5:

In the mean field approximation, 𝐐(𝑥𝑖) is locally optimal only if:

𝐐(𝑥𝑖) =

1

𝑍𝑖∙ exp{𝐄X−i~𝐐[ln 𝐏(𝑥𝑖 | X−i)]} (3.69)

where 𝑍𝑖 is a normalizing constant.

Proof: We recall that �� = ∏ 𝜙𝑐𝜙𝑐∈ΦG = ∏ 𝜙𝑐(Xc)Xc∈CG is the unnormalized measure defined by

ΦG and CG. Due to the linearity of expectation we have:

∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]

𝜙𝑐∈ΦG

= 𝐄X~𝐐[ln ��(𝑋𝑖, X−i) | 𝑥𝑖]

Because 𝐐 is a product of marginals, we can rewrite 𝐐( X−i | 𝑥𝑖 ) as 𝐐( X−i ) and get that:

𝐄X~𝐐[ln ��(𝑋𝑖, X−i) | 𝑥𝑖] = 𝐄X−i~𝐐[ln ��(𝑥𝑖, X−i)]

54

Using properties of conditional distributions, it follows that:

��(𝑥𝑖, X−𝑖 ) = 𝑍 ∙ 𝐏(𝑥𝑖, X−𝑖 ) = 𝑍 ∙ 𝐏( X−𝑖 ) ∙ 𝐏(𝑥𝑖 | X−i) We conclude that:

∑ 𝐄X~𝐐[ln𝜙𝑐 | 𝑥𝑖]

𝜙𝑐∈ΦG

= 𝐄X−i~𝐐[ln 𝐏(𝑥𝑖 | X−i)] + 𝐄X−i~𝐐[ln(𝐏( X−𝑖 ) ∙ 𝑍)]

Plugging this equality into the update equation (3.67) we get that:

𝐐(𝑥𝑖) =1

𝑍𝑖∙ exp{𝐄X−i~𝐐[ln𝐏(𝑥𝑖 | X−i)]} ∙ exp{𝐄X−i~𝐐[ln(𝐏( X−𝑖 ) ∙ 𝑍)]}

The term ln(𝐏( X−𝑖 ) ∙ 𝑍) does not depend on the value of 𝑥𝑖. More, when a marginal is

multiplied by a constant factor, it does not change the joint distribution. In fact, as the distribution

is renormalized at the end to sum to 1, the constant is absorbed into the normalizing function to

achieve normalization. Therefore, the constant term can be simply ignored and the formula we

obtain is exactly the formula (3.69).

Corollary 3.5 shows that the marginal of 𝑋𝑖 in 𝐐, i.e., 𝐐(𝑋𝑖), is the geometric average of the

conditional probability of 𝑥𝑖 given all other variables in the domain. The average is based on the

probability that 𝐐 assigns to all possible assignments to the variables in the domain. In this

sense, the mean field approximation requires that the marginal of 𝑋𝑖 be “consistent” with the

marginals of other variables [54].

Comparatively, the marginal of 𝑋𝑖 in 𝐏 can be represented as an arithmetic average:

𝐏(𝑥𝑖) =∑𝐏(𝑥−i)

𝑥−𝑖

∙ 𝐏(𝑥𝑖 | 𝑥−i) = 𝐄X−i~𝐏[𝐏(𝑥𝑖 | X−i)] (3.70)

In general, the geometric average tends to lead to marginals that are more sharply peaked than

the original marginals in 𝐏. More significant, the expectation in the equation (3.69) is relative to

the approximation distribution 𝐐, while the expectation in the equation (3.70) is relative to the

true distribution 𝐏. However this should not be interpreted as our approximation in 𝐐 of the

marginals in 𝐏 is a good one [54].

55

3.4.3 Maximizing the energy functional: the naïve mean field algorithm

We start by observing that, if a clique with potential function 𝜙𝑐 and set of nodes Xc doesn’t

contain the variable 𝑋𝑖, i.e., 𝑋𝑖 ∉ Xc, then:

𝐄Xc~𝐐[ln(𝜙𝑐(Xc) | 𝑥𝑖)] = 𝐄Xc~𝐐[ln(𝜙𝑐(Xc, 𝑥𝑖))] when 𝑋𝑖 ∉ Xc (3.71)

Hence, expectations terms on such factors are independent of the value 𝑋𝑖. Consequently, we

can absorb them into the normalization constant 𝑍𝑖 and get the following simplification.

Corollary 3.6:

In the mean field approximation 𝐐(𝑥𝑖) is locally optimal only if:

𝐐(𝑥𝑖) =1

𝑍𝑖∙ exp

{

∑ 𝐄Xc−{𝑋𝑖}~𝐐[ln(𝜙𝑐(Xc, 𝑥𝑖))]

𝑋𝑖∈Xc,Xc∈CG }

(3.72)

where 𝑍𝑖 is the normalization constant.

The equation (3.72) shows that 𝐐(𝑥𝑖) has to be consistent with the expectation of the potentials

in which it appears. This characterization of Corollary 3.3 is very useful for converting the

fixed–point equations (3.67) into an algorithm that maximizes 𝐹[��, 𝐐]. All the terms on the right–

hand side of the equation (3.70) involve expectations of variables other than 𝑋𝑖 and do not

depend on the choice of 𝐐(𝑋𝑖). We can achieve equality simply by evaluating the exponential

terms for each value 𝑥𝑖, normalizing the results to sum to 1, and then assigning them to 𝐐(𝑋𝑖).

As a consequence, we reach the optimal value of 𝐐(𝑋𝑖) in one step.

The last statement should be interpreted with some care. The resulting value for 𝐐(𝑋𝑖) is its

optimal value given the choice of all other marginals. Thus, this step optimizes the function

𝐹[��, 𝐐] relative only to one single coordinate in the space – the marginal of 𝐐(𝑋𝑖). To optimize

the function in its entirety, we need to optimize relative to all the coordinates. We can embed

this step in an iterated coordinate ascent algorithm, which repeatedly maximizes a single

marginal at a time, given fixed choices to all of the others. The result is Algorithm 3.1.

56

Algorithm 3.1: Naïve Mean Field Approximation

Given: CG , ΦG , 𝐐𝟎 // the initial choice of 𝐐

begin

Step 1: 𝐐 ← 𝐐𝟎

Step 2: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← X = (𝑋1, 𝑋2, … , 𝑋𝑛)

Step 3: while 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ≠ ∅ do

Step 4: choose 𝑋𝑖 from 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑

Step 5: 𝐐𝐨𝐥𝐝(𝑋𝑖) ← 𝐐(𝑋𝑖)

Step 6: for 𝑥𝑖 ∈ val(𝑋𝑖) do

we iterate over all possible values of random variable 𝑋𝑖.

Step 7: 𝐐(𝑥𝑖) = exp(∑ 𝐄Xc−{𝑋𝑖}~𝐐[ln𝜙𝑐(Xc, 𝑥𝑖)]𝑋𝑖∈Xc )

end for // 𝑥𝑖

Step 8: normalize 𝐐(𝑥𝑖) to sum to 1

Step 9: if 𝐐𝐨𝐥𝐝(𝑋𝑖) ≠ 𝐐(𝑋𝑖) then

Step 10: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ∪ (⋃ Xc𝑋𝑖∈XcXc∈CG

)

Step 11: 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 ← 𝑢𝑛𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 − {𝑋𝑖}

end while

end

return 𝐐

Importantly, a single optimization doesn’t usually suffice; a subsequent modification to another

marginal 𝐐(𝑋𝑗) may result in a different optimal parameterization for 𝐐(𝑋𝑖). Therefore, the

algorithm repeats these steps until convergence. A key property of the coordinate ascent

procedure is that each step leads to an increase in the energy functional. Hence, any iteration of

Algorithm 3.1 results in a better approximation of the true distribution 𝐏.

57

Theorem 3.7:

Algorithm 3.1 is guaranteed to converge. Moreover, the distribution returned by algorithm is a

stationary point of 𝐹[��, 𝐐], subject to the constraint that 𝐐(X) = ∏ 𝐐(𝑋𝑖)1≤𝑖≤𝑛 is a distribution.

The proof of Theorem 3.7 is outside the scope of this paper. Interested parties can find the

proof in [54]. The distribution returned by Algorithm 3.1 is a stationary point of the energy

functional, so theoretically it could be: a local maximum, a local minimum, or a saddle point.

However, it cannot be either a local minimum or a saddle point because these stationary points

are not stable convergence points of the algorithm in the sense that a small perturbation of 𝐐

followed by optimization will lead to a better convergence point [54]. Because the algorithm is

unlikely to accidentally land precisely on such an unstable point and get stuck there, in practice

the convergence points of the algorithm are local maximum and not necessarily global

maximum [54].

3.5 Bethe approximation

The Bethe approximation is an approximate of the free energy similar to the mean field

approximation. The Bethe approximation reduces the problem of computing the partition

function in a Markov random field to that of solving a set of non-linear equations – the Bethe

fixed–point equations [58]. In this section we start by introducing the Bethe free energy and its

“close relative” the Bethe–Gibbs free energy. Then, we describe briefly the belief propagation

(BP) algorithm and we present the theoretical result due to Yedidia et al. [28,35] that establishes

the connection between BP fixed–points and Bethe free energy. Because this result is

considered fundamental for approximate inference research, we include its original proof. We

end this section by introducing a new approximate inference algorithm, due to Welling and Teh

[4], which is based on the Bethe free energy and called belief optimization (BO).

We assume that we are given a pairwise Markov network with binary variables described by the

equations (2.11) to (2.13) and (2.15). We rewrite the equation (2.11) by taking into consideration

the fact that the potential functions, given by the equation (2.15), are either node potentials or

edge potentials:

58

𝐏(X) =

1


Xc∈CG

=1

𝑍∙∏𝜙𝑖(𝑋i) ∙∏𝜙𝑖𝑗(𝑋i, 𝑋𝑗)

𝑖,𝑗i

(3.73)

where: 𝜙𝑖(𝑋i) is the local “evidence” for node 𝑖; 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) is the clique potential that

corresponds to the edge that connects the nodes 𝑖 and 𝑗; and 𝑍 is the partition function. Any

fixed evidence nodes is subsumed into our definition of 𝜙𝑖(𝑋i).

We denote by 𝑝𝑖 the marginal probabilities over singleton variables and by 𝑝𝑖𝑗 the pairwise

marginal probabilities over pairs of variables that correspond to edges in the underlying graph:

𝑝𝑖 = MARG(𝐏, 𝑋𝑖) =∑𝐏(𝑋1, … , 𝑋𝑖, … , 𝑋𝑛)

X−i

(3.74) 𝑝𝑖𝑗 = MARG(𝐏, 𝑋𝑖𝑋𝑗) = ∑ 𝐏(𝑋1, … , 𝑋𝑖 , … , 𝑋𝑗, … , 𝑋𝑛)

X−{𝑋𝑖,𝑋𝑗}

The majority of theoretical results presented in this section come from [4, 28-29, 35-36].

3.5.1 The Bethe free energy

The original formula for the Bethe free energy, proposed by Yedidia et al. in [28-29,35-36], relies

on a minimal canonical representation for the Markov network. An alternative form of the Bethe

free energy that relies on the mean parameterization of the Markov network was introduced by

Wainwright et al. in [47].

The Bethe free energy is the Gibbs free energy obtained by truncating the Plefka expansion of

the free energy (equation (3.46)) in the second order and minimizing it with respect to single

node marginals and pairwise marginals. Unlike the mean field free energy, which depends only

on approximate marginals at single nodes, the Bethe free energy depends on approximate

marginals at single nodes as well as approximate marginals on edges.

To understand the relationship between the Bethe free energy and the mean field free energy,

we define a “close relative” of both by imposing additional constraints on the Bethe free energy.

The Gibbs free energy resulted from this additional optimization is called the Bethe–Gibbs free

energy. To distinguish between the Bethe free energy and the Bethe–Gibbs free energy, we

denote the former by 𝒢𝛽 and the latter by 𝐺𝛽.

59

We assume that we work under a set of hypothesis similar to ones described by (3.47) to (3.52)

except that Y = X. The constraints are represented by the set of all 𝑝𝑖 and the set of all 𝑝𝑖𝑗

defined in (3.74). Formally, the Bethe free energy is defined as:

𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = min𝐐{𝐹[𝐐] ∶ MARG(𝐐, 𝑋𝑖) = 𝑝𝑖 and MARG(𝐐, 𝑋𝑖𝑋𝑗) = 𝑝𝑖𝑗} (3.75)

where MARG(𝐐, 𝑋𝑖) denotes the singleton marginal probability of 𝐐 with respect to 𝑋𝑖 and

MARG(𝐐, 𝑋𝑖𝑋𝑗) denotes the pairwise marginal probability of 𝐐 with respect to the variables 𝑋𝑖

and 𝑋𝑗, whose corresponding nodes are connected in the Markov network.

In order to compute the Bethe free energy of a binary pairwise Markov network, in this section

we follow Approach 2 described in Section 3.3.2. As previously mentioned, we can assume

that the constant pseudo–temperature at equilibrium is 1. Therefore:

𝒢𝛽[𝐏(𝑋1, … , 𝑋𝑛)] = 𝑈𝛽 − 𝑆𝛽 (3.76)

In Section 2.3 we learned that the energy of a pairwise Markov network is a quadratic function

of the states, so the internal energy given by the formula (2.18) can be computed exact in terms

of {𝑝𝑖} and {𝑝𝑖𝑗} [4]:

𝑈𝛽 = 𝐄𝐏 [𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}]] = 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}]

where: (3.77) 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}] = −∑𝑝𝑖𝑗 ∙ ��𝑖𝑗

{𝑖,𝑗}

+∑𝑝𝑖 ∙ 𝜃𝑖𝑖

This means that the computation of the Bethe free energy (3.76) requires an approximation only

for the entropy term (equation (2.19)). The idea is that we want to correct the mean field

approximation which overestimates the entropy due to its assumption that all nodes are

independent. The natural next step is to take pairwise dependencies into account. But just

adding all pairwise entropy contributions to the mean field approximation would clearly over–

count the entropy contributions at the nodes. Correcting for this over–counting gives the

following approximation to the entropy [4]:

𝑆𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] =∑𝑆𝑖 +∑(𝑆𝑖𝑗 − 𝑆𝑖 − 𝑆𝑗){𝑖,𝑗}𝑖

=∑𝑆𝑖 ∙ (1 − 𝑑𝑖)

𝑖

+∑𝑆𝑖𝑗{𝑖,𝑗}

(3.78)

where: 𝑑𝑖 is the degree of node 𝑖, i.e., the number of neighbors of node 𝑖; 𝑆𝑖 is the mean field

entropy for node 𝑖; and 𝑆𝑖𝑗 is the pairwise entropy. The mean field entropy 𝑆𝑖 and the pairwise

entropy 𝑆𝑖𝑗 can be written as:

60

𝑆𝑖 = −(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖)) (3.79) 𝑆𝑖𝑗 = −(𝑝𝑖𝑗 ∙ ln 𝑝𝑖𝑗 + (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) ∙ ln(𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) +

+(𝑝𝑖 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑖 − 𝑝𝑖𝑗) + (𝑝𝑗 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑗 − 𝑝𝑖𝑗)) (3.80)

The Bethe free energy is obtained by integrating (3.76), (3.77), and (3.79) into (2.21):

𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝐸[{𝑝𝑖}, {𝑝𝑖𝑗}] − 𝑆𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] (3.81)

𝒢𝛽 = −∑𝑝𝑖𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}

+∑𝑝𝑖 ∙ 𝜃𝑖𝑖

+∑(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖)) ∙ (1 − 𝑑𝑖)

𝑖

+

+∑(𝑝𝑖𝑗 ∙ ln 𝑝𝑖𝑗 + (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) ∙ ln(𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗) + (𝑝𝑖 − 𝑝𝑖𝑗){𝑖,𝑗}

∙ ln(𝑝𝑖 − 𝑝𝑖𝑗) + (𝑝𝑗 − 𝑝𝑖𝑗) ∙ ln(𝑝𝑗 − 𝑝𝑖𝑗))

(3.82)

The expression (3.76) for the entropy is exact when the underlying graph is a tree. Since the

expression (3.77) for the energy is exact for general Boltzmann machines, it is also exact on

Boltzmann trees. Consequently, the Bethe free energy (3.82) is exact on trees [4]. If the

underlying graph has loops, then the distribution corresponding to the energy given by (3.82) is

not always a properly normalized probability distribution [4]. Therefore, the Bethe free energy is

not necessarily an upper bound for the true free energy 𝐹 [4], otherwise it does not fall into the

category of variational free energies characterized by (3.36) and (3.37). So, when can we

expect the Bethe free energy to be a good approximation for the free energy of the system? The

above argument suggests that this should be the case when the graph is “close to a tree”, i.e., if

there are not many short loops in the graph. If the underlying graph has tight loops, evidence

impinging on one node can travel around these loops and return back to the original node,

causing it to be over–counted [4].

3.5.2 The Bethe–Gibbs free energy

In order to improve the approximation of the free energy, the Bethe free energy has been

studied in connection with a well–known free energy: the mean field free energy. Welling and

Teh proved that the mean field free energy is a small weight expansion of the Bethe free energy

[4], which suggests that the Bethe free energy should be accurate for small weights and should

improve on the mean field energy [4]. In this section we use a different approach to explore the

relationship between the Bethe free energy and the mean field free energy: via the Bethe–Gibbs

free energy.

61

We recall the mean field approximation of the free energy (equation (3.65)) and we observe that

the expression of the entropy 𝑆𝐐(𝑋𝑖) is the same as the mean field entropy 𝑆𝑖 given by (3.79).

We also note that the independent marginal 𝐐(𝑋𝑖) corresponds to 𝑝𝑖 given by (3.74) and the

clique potential 𝜙𝑐(Xc) corresponds to 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) given by (3.73). Hence, the mean field free

energy can be written as:

𝐹𝑀𝐹[𝐐] = − ∑ (∏ 𝐐(𝑋𝑖)

𝑋𝑖∈Xc

)

𝑋1,…,𝑋𝑛

∙ ln ( 𝜙𝑐(Xc)) − ∑ 𝑆𝐐(𝑋𝑖)

1≤𝑖≤𝑛

𝐹𝑀𝐹({𝑝𝑖}) = −∑𝑝𝑖 ∙ 𝑝𝑗 ∙ ln𝜙𝑖𝑗(𝑋i, 𝑋𝑗) +

𝑖,𝑗

∑(𝑝𝑖 ∙ ln 𝑝𝑖 + (1 − 𝑝𝑖) ∙ ln(1 − 𝑝𝑖))

𝑖

(3.83)

Then, we are going to convert the Bethe free energy given by the equations (3.81) – (3.82) into

a more constrained Gibbs free energy named the Bethe–Gibbs free energy. We do this by

imposing additional constraints over the Bethe free energy 𝒢𝛽, specifically we minimize 𝒢𝛽 with

respect to the parameters {𝑝𝑖𝑗}, then we solve {𝑝𝑖𝑗} exactly in terms of the parameters {𝑝𝑖}.

The minimization is done, as usually, by taking derivatives of the Bethe free energy with respect

to {𝑝𝑖𝑗} and setting them to zero:

∂𝒢𝛽

∂𝑝𝑖𝑗= −��𝑖𝑗 + ln (

𝑝𝑖𝑗 ∙ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)

(𝑝𝑖 − 𝑝𝑖𝑗) ∙ (𝑝𝑗 − 𝑝𝑖𝑗)) = 0 (3.84)

This can be simplified to a quadratic equation:

𝛼𝑖𝑗 ∙ 𝑝𝑖𝑗2 − (1 + 𝛼𝑖𝑗 ∙ 𝑝𝑖 + 𝛼𝑖𝑗 ∙ 𝑝𝑗) ∙ 𝑝𝑖𝑗 + (1 + 𝛼𝑖𝑗) ∙ 𝑝𝑖 ∙ 𝑝𝑗 = 0 (3.85)

where we have defined:

𝛼𝑖𝑗 = exp(��𝑖𝑗) − 1 (3.86)

In addition to this equation we have to make sure that 𝑝𝑖𝑗 satisfies the following bounds:

max(0, 𝑝𝑖 + 𝑝𝑗 − 1) ≤ 𝑝𝑖𝑗 ≤ min(𝑝𝑖, 𝑝𝑗) (3.87)

These bounds can be understood by noting that all the parameters {𝑝𝑖} and {𝑝𝑖𝑗} cannot

become negative by their definition. The following theorem guarantees the desired unique

solution for {𝑝𝑖𝑗}.

62

Theorem 3.8:

There is exactly one solution to the quadratic equation (3.85) that minimizes the Bethe free

energy and satisfies the bounds (3.87). The analytic expression of this solution is:

𝑝𝑖𝑗 =

1

2 ∙ 𝛼𝑖𝑗∙ (𝑄𝑖𝑗 −√𝑄𝑖𝑗

2 − 4 ∙ 𝛼𝑖𝑗 ∙ (1 + 𝛼𝑖𝑗) ∙ 𝑝𝑖 ∙ 𝑝𝑗)

where: (3.88) 𝑄𝑖𝑗 = 1 + 𝛼𝑖𝑗 ∙ 𝑝𝑖 + 𝛼𝑖𝑗 ∙ 𝑝𝑗

Moreover, the parameters 𝑝𝑖𝑗 will never actually saturate one of the bounds.

The proof of this theorem is outside the scope of this paper. A proof can be found in [4].

Thus, by inserting the expression for {𝑝𝑖𝑗} given by (3.88) into the Bethe free energy given by

(3.82), we obtain the analytic expression of Bethe–Gibbs free energy 𝐺𝛽 (also called the Gibbs

free energy in the Bethe approximation). We are not going to provide the whole analytic

expression for 𝐺𝛽[{𝑝𝑖}], but a simpler expression that highlights the dependency 𝐺𝛽 has on {𝑝𝑖}

and how it had arrived there:

𝐺𝛽[{𝑝𝑖}] = min{𝑝𝑖𝑗}

𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗(𝑝𝑖 , 𝑝𝑗)}] (3.89)

We observe that the mean field free energy 𝐹𝑀𝐹({𝑝𝑖}) given by (3.83) and the Bethe–Gibbs free

energy 𝐺𝛽[{𝑝𝑖}] given by (3.89) are similar in spirit, so they might behave similarly in

approximate inference algorithms concerned with singleton marginals. In Section 3.5.4 we will

elaborate upon this topic.

3.5.3 The relationship between belief propagation fixed–points and Bethe free energy

The belief propagation (BP) or the sum–product algorithm is an efficient local message passing

algorithm for exact inference on trees or, generally, on graphs without cycles. The BP algorithm

is guaranteed to converge to the correct marginal posterior probabilities in tree–like graphical

models. The BP algorithm applied to a graph with loops is called loopy belief propagation (LBP).

The LBP algorithm remains well–defined and, in some cases, gives good approximate answers,

while in other cases gives poor results or fails to converge.

63

Yedidia, Freeman, and Weiss [28,35] established that, in a factor graph (see Definition 2.6),

there is a one–to–one correspondence between the fixed–points of BP algorithms and the

stationary points of the Bethe free energy. They also showed that the BP algorithms can only

converge to a fixed–point that is also a stationary point of the Bethe approximation to the free

energy. Their discovery, which has been heralded as a major breakthrough for belief

propagation in general, not only clarified the nature of the Bethe approximation of the free

energy, but also opened the way to construct more sophisticated message passing algorithms

based on improvements made to Bethe’s approximation.

The theoretical result of Yedidia et al. [28,35] is applicable not only to factor graphs, but to all

types of graphical models. The justification of this statement relies on two facts. The first fact is

that all types of graphical models have the following property: they can be converted, before

doing inference, into a pairwise Markov network, through a suitable clustering of nodes into

larger nodes [54]. The second fact is that the pairwise Markov network a factor graph is

converted into and the factor graph itself have the same joint probability distribution [59].

Therefore, without loss of generality, we can use a pairwise Markov network as the underlying

graphical model for the BP algorithm. In this section we give only a briefly description of BP.

Detailed presentations of BP can be found in [59-60]. We assume that we work under the

hypothesis and notations given by (3.73). We use the standard set of notations for BP

algorithms and, when applicable, we provide the correspondent in our notations. We use the

notation 𝑑𝑖 for the degree of node 𝑖.

The standard BP update rules are applicable to the message that node 𝑖 sends to node 𝑗

denoted 𝑚𝑖𝑗 and to the belief of node 𝑖 denoted 𝑏𝑖:

𝑚𝑖𝑗(𝑋𝑗) ← 𝛼 ∙∑𝜙𝑖𝑗(𝑋i, 𝑋𝑗) ∙

𝑋𝑖

𝜙𝑖(𝑋i) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)

𝑘∈ne(𝑖)−{𝑗}

(3.90)

𝑏𝑖(𝑋𝑖) ← 𝛼 ∙ 𝜙𝑖(𝑋i) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)

𝑘∈ne(𝑖)

(3.91)

where 𝛼 denotes a normalization constant and ne(𝑖) denotes the Markov blanket of node 𝑖.

The belief 𝑏𝑖(𝑋𝑖) is obtained by multiplying all incoming messages to node 𝑖 by the local

evidence. If the belief 𝑏𝑖(𝑋𝑖) is normalized, then it approximates the marginal probability

𝑝𝑖 = 𝑝𝑖(𝑋𝑖) given by (3.74).

64

The belief 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) at the pair of connected nodes 𝑋i and 𝑋𝑗 is defined as the product of the

local potentials and all incoming messages to the pair of nodes:

𝑏𝑖𝑗(𝑋i, 𝑋𝑗) ← 𝛼 ∙ 𝜓𝑖𝑗(𝑋i, 𝑋𝑗) ∙ ∏ 𝑚𝑘𝑖(𝑋𝑖)


∙ ∏ 𝑚𝑙𝑗(𝑋𝑗)

𝑙∈ne(𝑗)−{𝑖}

(3.92)

where: 𝜓𝑖𝑗(𝑋i, 𝑋𝑗) ≡ 𝜙𝑖𝑗(𝑋i, 𝑋𝑗) ∙ 𝜙𝑖(𝑋i) ∙ 𝜙𝑗(𝑋j) (3.93)

If the belief 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) is normalized, then it approximates the marginal probability 𝑝𝑖𝑗 = 𝑝𝑖𝑗(𝑋𝑖𝑋𝑗)

given by (3.74). Generally the beliefs 𝑏𝑖(𝑋𝑖) and 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) are approximate marginals. However,

they become the exact marginals when the underlying graph contains no cycles [13].

Theorem 3.9 (Yedidia et al. [28,35]):

Let {𝑚𝑖𝑗} be a set of BP messages and let {𝑏𝑖𝑗, 𝑏𝑖} be the beliefs calculated from those

messages. Then the beliefs are fixed–points of the BP algorithm if and only if they are zero

gradient points of the Bethe free energy 𝒢𝛽 and subject to the following normalization and

marginalization constraints:

∑𝑏𝑖(𝑋𝑖) = 1

𝑋𝑖

and ∑𝑏𝑖𝑗(𝑋i, 𝑋𝑗) = 𝑏𝑗(𝑋𝑗)

𝑋𝑖

(3.94)

Proof: We start by writing the Bethe free energy 𝒢𝛽 (3.82) in terms of beliefs:

𝒢𝛽({𝑏𝑖}, {𝑏𝑖𝑗}) =∑ ∑ 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) ∙ [ln 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) − ln𝜓𝑖𝑗(𝑋i, 𝑋𝑗)]

𝑋𝑖,𝑋𝑗𝑖,𝑗

−∑(𝑑𝑖 − 1) ∙∑𝑏𝑖(𝑋𝑖) ∙ [ln 𝑏𝑖(𝑋𝑖) − ln𝜙𝑖(𝑋i)]

𝑋𝑖𝑖

(3.95)

To prove the claim " → " we add the following Lagrange multipliers to form a Lagrangian 𝐿:

𝜆𝑖𝑗(𝑋𝑗) is the multiplier corresponding to the constraint that 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) marginalizes down to

𝑏𝑗(𝑋𝑗);

𝜉𝑖𝑗 , 𝜉𝑖 are multipliers corresponding to the normalization constraints.

The equation: 𝜕𝐿

𝜕𝑏𝑖𝑗(𝑋i,𝑋𝑗)= 0 gives:

ln 𝑏𝑖𝑗(𝑋i, 𝑋𝑗) = ln𝜓𝑖𝑗(𝑋i, 𝑋𝑗) + 𝜆𝑖𝑗(𝑋𝑗) + 𝜆𝑗𝑖(𝑋𝑖) + 𝜉𝑖𝑗 − 1

65

The equation: 𝜕𝐿

𝜕𝑏𝑖(𝑋i)= 0 gives:

(𝑑𝑖 − 1) ∙ (ln 𝑏𝑖(𝑋𝑖) + 1) = ln𝜙𝑖(𝑋i) + ∑ 𝜆𝑗𝑖(𝑋𝑖) +

𝑗∈ne(𝑖)

𝜉𝑖

Setting:

𝜆𝑖𝑗(𝑋𝑗) = ln ∏ 𝑚𝑘𝑗(𝑋𝑗)


and using the marginalization constraints (3.94), we find that the stationary conditions on the

Lagrangian are equivalent to the BP fixed–point conditions.

To prove the claim " ← ", consider that we are given 𝑏𝑖, 𝑏𝑖𝑗, and 𝜆𝑖𝑗(𝑋𝑗) that correspond to a

zero–gradient point and set:

𝑚𝑖𝑗(𝑋𝑗) =𝑏𝑗(𝑋𝑗)

exp (𝜆𝑖𝑗(𝑋𝑗))

Because 𝑏𝑖, 𝑏𝑖𝑗, and 𝜆𝑖𝑗(𝑋𝑗) satisfy the stationarity conditions, 𝑚𝑖𝑗 defined in this way must be a

fixed–point of BP.

Since both sides of Theorem 3.9 are valid, there is a one–to–one correspondence between the

fixed–points of the BP algorithm and the stationary points of the Bethe free energy.

The consequences of Theorem 3.9 are different for tree–like graphs respectively loopy graphs.

In tree–like graphs the fixed–points of the BP algorithm are the global minima of the Bethe free

energy [28]. This comes as a natural effect of the fact that the BP algorithm performs exact

inference in graphs without loops, so the Bethe free energy is minimal for exact marginals.

In loopy graphs the situation is more complicated. In [61] Heskes showed that the stable fixed–

points of the LBP algorithm are local minima of the Bethe free energy. He also showed that the

converse is not necessarily true: minima of the Bethe free energy can be unstable fixed–points

of LBP [61]. Furthermore, in [62] Heskes derived sufficient conditions for uniqueness of a LBP

fixed–point. By using a particular Boltzmann machine as a counter–example, Heskes showed

that the uniqueness of a LBP fixed–point does not guarantee the convergence of the LBP

algorithm to that fixed–point [62].

66

In [58] Shin proposed an alternative solution to LBP fixing its convergence issue via double-

looping. His solution applies to arbitrary binary graphical models of 𝑛 nodes and maximum

degree in the underlying graph O(log 𝑛). Shin’s algorithm is a message passing algorithm that

solves the Bethe fixed–point equations in polynomial number of bitwise operations and is

considered the first fully polynomial–time approximation scheme for the LBP fixed–point

computation in Markov random fields [58].

We end this section by rewriting the expression (3.83) of the mean field free energy in terms of

beliefs:

𝐹𝑀𝐹({𝑏𝑖}) = −∑𝑏𝑖(𝑋𝑖) ∙ 𝑏𝑗(𝑋𝑗) ∙ ln 𝜙𝑖𝑗(𝑋i, 𝑋𝑗)

𝑖,𝑗

+∑𝑏𝑖(𝑋𝑖) ∙ [ln 𝑏𝑖(𝑋𝑖) − ln𝜙𝑖(𝑋i)]

𝑖

(3.96)

3.5.4 Belief optimization

Unlike the mean field free energy and the Bethe–Gibbs free energy, which include only first

order terms 𝑝𝑖(𝑋𝑖), the Bethe free energy includes first–order terms 𝑝𝑖(𝑋𝑖) as well as second–

order terms 𝑝𝑖𝑗(𝑋i, 𝑋𝑗). Unlike the mean field free energy, which is minimized in the primal

variables {𝑝𝑖}, the Bethe free energy can be minimized in both the primal space and the dual

space. Usually the Bethe free energy is minimized in the dual space by using messages, which

are a combination of the dual variables {𝜆𝑖𝑗(𝑋𝑗)}. The process of minimizing the Bethe free

energy in the primal space, i.e., in terms of the posterior probability distributions, is similar to the

mean field free energy minimization. The approximate inference algorithm that employs this

type of minimization for the Bethe free energy is named belief optimization and represents an

alternative to the fixed–point equations of belief propagation [4].

In order to derive the fixed–point equations that solve the marginals {𝑝𝑖} for the Bethe free

energy, we follow a familiar recipe: first we compute derivatives of the Bethe free energy given

by (3.89) with respect to {𝑝𝑖} and then we equate them to zero:

𝐺𝛽[{𝑝𝑖}] = min{𝑝𝑖𝑗}

𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗}] = 𝒢𝛽[{𝑝𝑖}, {𝑝𝑖𝑗(𝑝𝑖 , 𝑝𝑗)}]

𝑑𝐺𝛽

𝑑𝑝𝑖=𝜕𝒢𝛽

𝜕𝑝𝑖+ ∑

𝜕𝒢𝛽

𝜕𝑝𝑖𝑗𝑗∈ne(𝑖)

∙𝜕𝑝𝑖𝑗

𝜕𝑝𝑖 (3.97)

where ne(𝑖) denotes the Markov blanket of unit 𝑖.

67

We recall that in 𝐺𝛽[{𝑝𝑖}] the pairwise marginals {𝑝𝑖𝑗} are defined in terms of the singleton

marginals {𝑝𝑖}, otherwise 𝜕𝒢𝛽

𝜕𝑝𝑖𝑗= 0. Therefore, (3.97) becomes:

𝑑𝐺𝛽

𝑑𝑝𝑖=𝜕𝒢𝛽

𝜕𝑝𝑖 (3.98)

The equation (3.98) shows that, under the current assumptions, the Bethe–Gibbs free energy

𝐺𝛽 and the Bethe free energy 𝒢𝛽 have the same fixed–points. In order to solve the gradient, we

use the analytic expression (3.82) of 𝒢𝛽.

𝜕𝒢𝛽

𝜕𝑝𝑖= 𝜃𝑖 + ln(

(1 − 𝑝𝑖)𝑑𝑖−1 ∙ ∏ (𝑝𝑖 − 𝑝𝑖𝑗)𝑗∈ne(𝑖)

𝑝𝑖𝑑𝑖−1 ∙ ∏ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)𝑗∈ne(𝑖)

) (3.99)

Equating these equations to zero gives the following set of fixed–point equations for the Bethe–

Gibbs free energy 𝐺𝛽 respectively the Bethe free energy [4]:

𝑝𝑖 = sigm(−𝜃𝑖 + ∑ ln(𝑝𝑖 ∙ (𝑝𝑖𝑗 + 1 − 𝑝𝑖 − 𝑝𝑗)

(1 − 𝑝𝑖) ∙ (𝑝𝑖 − 𝑝𝑖𝑗))

𝑗∈ne(𝑖)

) for all 𝑖 ∈ 𝑉 (3.100)

Regardless how they run, sequentially or in parallel, the fixed–point equations (3.100) are not

guaranteed to decrease the Bethe free energy 𝒢𝛽 or to converge at all.

Similarly with the mean field approximation, we may achieve a decrease of the Bethe free

energy by optimizing it only relative to a single coordinate in the space, otherwise temporarily

fixing all neighboring marginals 𝑝𝑗 and minimizing over the central node 𝑝𝑖. The resulting value

for the singleton marginal 𝑝𝑖 is an optimal value given the choice of all other singleton marginals.

Furthermore, to minimize the function in its entirety, we need to minimize relative to all the

coordinates {𝑝𝑖}. One way to achieve this goal is to embed the global minimization step (relative

to all the coordinates) in an iterated coordinate descent algorithm, which repeatedly minimize a

single marginal at a time, given fixed choices to all of the others.

Other way to perform the global minimization is to perform gradient descent on all the

coordinates {𝑝𝑖} simultaneously while enforcing the constraint that they stay within the interval

[0,1].

68

Chapter 4. Introduction to Boltzmann Machines

A Boltzmann machine is a parallel computational organization or network that is well

suited to constraint satisfaction tasks involving large numbers of “weak” constraints. A weak

constraint is a goal criteria that may not be satisfied by any solution, otherwise is not an all–or–

none criteria. In some problem domains, such as finding the most plausible interpretation of an

image, it happens frequently that even the best possible solution violates some constraints. In

these cases a variation of weak constraints is used, specifically weak constraints that incur a

cost when violated. Furthermore, the quality of the solution is determined by the total cost of all

the constraints that it violates [10-11,43].

4.1 Definitions

Structurally a Boltzmann machine is a symmetrical connectionist network with hidden

units; therefore, its structure follows the general structure of a connectionist network described

in Section 2.4. Hinton characterized the Boltzmann machine as “a generalization of a Hopfield

network in which the units update their states according to a stochastic decision rule” [51]. The

majority of following definitions are taken from [1,3,43] and appropriately adapted for

consistence. Our focus in Chapter 4 and Chapter 5 is the asynchronous Boltzmann machine

that we simply refer to as the Boltzmann machine.

Definition 4.1:

A Boltzmann machine is a neural network that satisfies certain properties. Formally a Boltzmann

machine 𝐁𝐌 is a four–tuple:

𝐁𝐌 = (𝓝,𝓖, ��, 𝚯) (4.1)

comprising of:

a finite set 𝓝 of primitive computing elements called units or neurons;

Without restricting the generality we assume that 𝒩 is indexed by the set {1,2,…𝑛} for 𝒏 = |𝒩|.

To make the formulae more readable, in subsequent development we refer equivalently to 𝒩

and {1,2,…𝑛}.

69

an undirected graph (𝓝,𝓖) called the connectivity graph, where:

𝓖 = { {𝑖, 𝑗} ∶ (𝑖, 𝑗) ∈ 𝒩 ×𝒩, 𝑖 ≠ 𝑗} (4.2)

a collection �� = (��𝑖𝑗){𝑖,𝑗}∈𝓖 of real numbers called the weights or synaptic weights; each

weight ��𝑖𝑗 is associated with an edge {𝑖, 𝑗};

a collection 𝚯 = (𝜃𝑗)𝑗∈𝒩 of real numbers called the thresholds; each threshold 𝜃𝑗 is

associated with an unit;

and satisfying the following properties:

a unit is always in one of two activation levels or states designated as on/off or 1/0 or 1/-1;

a unit adopts these activation levels as a probabilistic function of the activation levels of its

neighboring units and the weights on its edges to them;

the weights ��𝑖𝑗 on the edges are symmetric having the same strength in both directions:

��𝑖𝑗 = ��𝑗𝑖 (4.3)

the weights ��𝑖𝑗 on the edges can take on real values of either sign;

a unit being on or off is taken to mean that the system currently accepts or rejects some

elemental hypothesis about the domain;

the weight on an edge represents a weak pairwise constraint between two hypotheses:

a positive weight indicates that the two hypotheses tend to support one another; if one is

currently accepted, accepting the other should be more likely;

a negative weight suggests, other things being equal, that the two hypotheses should

not both be accepted.

The following notions are important in subsequent development:

The terms link and connection equally denote an edge {𝑖, 𝑗} ∈ 𝓖 of the connectivity graph,

where 1 ≤ 𝑖, 𝑗 ≤ 𝑛;

𝐈 = {0,1} denotes the set of possible activation levels or states for a unit;

Hopfield represented the states of his model with -1 and 1 because his model was derived

from a physical system called spin glass in which spins are either down or up. Provided the

units have thresholds, models that use the representation -1 and 1 for their states can he

translated into models that represent their states with 0 and 1 and have different thresholds

70

[43]. In Section 2.5 we showed a similar translation for a Hopfield network (equations (2.33)

and (2.34)).

𝝈𝐢 ∈ I denotes the activation level of unit 𝑖, ∀𝑖 ∈ 𝒩;

ℝ𝓖 denotes the set of all families of weights W;

ℝ𝓝 denotes the set of all families of thresholds Θ;

The connectivity graph (𝒩, 𝒢) can be extended to (𝓝,𝓖′) as follows:

𝓖′ = { {𝑖, 𝑗} ∶ (𝑖, 𝑗) ∈ 𝒢 or (𝑖 = 0 and 𝑗 ∈ 𝒩)} (4.4) and: ℝ𝓖

′≝ ℝ𝓖 × ℝ𝓝 (4.5)

Parameters

The parameters or extended weights 𝐖 are a collection of pairs of real numbers defined as:

𝐖 ≝ (��,𝚯) = (𝒘𝒊𝒋){𝒊,𝒋}∈𝓖′∈ ℝ𝓖

′ (4.6)

where:

𝑤𝑖𝑗 ≝ {(��𝑖𝑗 , 𝜃𝑗), if {𝑖, 𝑗} ∈ 𝒢

(−𝜃𝑗, 𝜃𝑗), if {𝑖, 𝑗} ∈ 𝒢′ − 𝒢

(4.7)

The number of elements of W is at most equal to: 𝑛∙(𝑛−1)

2, which corresponds to (𝒩, 𝒢) being

a complete undirected graph with 𝑛 vertices or units.

The number of elements of W is at most equal to: 𝑛∙(𝑛−1)

2+ 𝑛 =

𝑛∙(𝑛+1)

2, which corresponds

to (𝒩, 𝒢′) being a complete undirected graph with 𝑛 + 1 vertices or units.

Definition 4.2:

If we incorporate into Definition 4.1 the concept of parameters according with the formulae (4.6)

and (4.7), then we obtain the following equivalent definition for Boltzmann machine:

𝐁𝐌 = (𝓝,𝓖, (��,𝚯)) = (𝓝,𝓖′,𝐖) (4.8)

Configurations

A function 𝝈:𝓝⟶ 𝐈, 𝝈(𝒊) ≝ 𝝈𝐢 is called an 𝐼–valued configuration of 𝒩.

A specification of activation levels (𝝈𝐢)𝒊∈𝓝 of all the units 𝑖 ∈ 𝒩 represents a configuration or

a global state of 𝐁𝐌. A configuration of 𝐁𝐌 can also be seen as a particular combination of

71

hypotheses about the domain. The set of all possible configurations of 𝐁𝐌 represents the

configuration space of 𝐁𝐌 and is written 𝐈𝓝.

Net input of configuration towards unit

Generally, the net input of configuration 𝜎 towards a unit 𝑖 ∈ 𝒩, also called the activation

potential of unit 𝑖, is defined by the equation (2.23). If we adapt the equation (2.23) to the

current conventions and notations, we obtain:

𝐧𝐞𝐭𝐢 ≡ 𝐧𝐞𝐭(𝒊, 𝝈) = −𝜃𝑖 + ∑ ��𝑗𝑖𝑗∈𝓖(𝑖)

∙ 𝜎𝑗 = ∑ 𝑤𝑗𝑖𝑗∈𝓖′(𝑖)

∙ 𝜎𝑗 (4.9)

where: 𝒢(𝑖) = {𝑗 ∶ {𝑗, 𝑖} ∈ 𝒢}; 𝒢′(𝑖) = {𝑗 ∶ 𝑗 ∈ {0} ∪𝒩 and {𝑗, 𝑖} ∈ 𝒢}; and 𝜎𝑗 is the projection of

𝜎 onto the jth component on I𝒩.

The net input of configuration 𝜎 towards a specific unit can be seen as a mapping between a

given configuration with a given unit and ℝ. The mapping between the configuration alone

and ℝ is called Hamiltonian. Formally, a Hamiltonian H is an element of 𝓗(𝓝), where

ℋ(𝒩) represents the set of all real–valued functions defined on I𝒩:

𝓗(𝓝) = {H | H ∶ I𝒩⟶ℝ} (4.10)

Clearly ℋ(𝒩) is a linear space of dimension 2𝑛.

Probability functions on the configuration space

Let 𝓟(𝓝) denote the set of all probability functions on the configuration space I𝒩. 𝒫(𝒩) is a

simplex of dimension 2𝑛−1 in ℝ𝐈𝓝

:

𝓟(𝓝) = {𝐏 | 𝐏 ∶ I𝒩⟶ [0,1], ∑ 𝐏(𝜎) = 1

𝜎∈I𝒩

} (4.11)

Let 𝓟+(𝓝) denote the interior of the simplex 𝒫(𝒩), i.e., the set of those 𝐏 ∈ 𝒫(𝒩) that are

nondegenerate, in the sense that 𝐏(𝜎) ≠ 0 for all 𝜎 ∈ I𝒩:

𝓟+(𝓝) = {𝐏 ∈ 𝒫(𝒩) | 𝐏(𝜎) ≠ 0 for all 𝜎 ∈ I𝒩} (4.12)

Gibbs measure associated to a Hamiltonian

Any element H ∈ ℋ(𝒩) gives rise to a probability distribution 𝐆𝐇 ∈ 𝓟+(𝓝) named the Gibbs

measure associated to the Hamiltonian H and defined by:

𝐆𝐇(𝝈) =

exp(H(𝜎))

𝑍(H) (4.13)

72

where 𝑍(H) is the normalization constant needed to make the probabilities add up to 1, i.e.:

𝒁(𝐇) = ∑ exp(H(𝜎))

𝜎∈I𝒩

(4.14)

Clearly the set of Gibbs measures on I𝒩 is exactly 𝒫+(𝒩).

Two Hamiltonians H1 ≠ H2 give rise to the same Gibbs measure if and only if they differ by a

constant. Let 𝓗𝟎(𝓝) be the quotient space of ℋ(𝒩) modulo the constants. Then the

function 𝒇𝟎 defined by the equation (4.15) is well defined and bijective:

𝒇𝟎 ∶ 𝓗𝟎(𝓝)⟶ 𝓟+(𝓝), 𝒇𝟎(𝐇) = 𝐆𝐇 (4.15)

Quadratic Hamiltonian

For any 𝑖 ∈ 𝒩, let 𝜎𝑖 denote the projection of 𝜎 onto the ith component on I𝒩.

Let also 𝜎0 denote the function identically equal to 1: 𝜎0 ≝ 1.

Then for any pair (W, Θ) = W ∈ ℝ𝓖′ we associate a function 𝐇(��,𝚯) ≝ 𝐇𝐖 named the

Hamiltonian of W = (W,Θ) and defined by:

𝐇(��,𝚯) ∶ 𝐈𝓝⟶ℝ, 𝐇(��,𝚯)(𝝈) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗

{𝑖,𝑗}∈𝒢,𝑖<𝑗

− ∑ 𝜎𝑗 ∙ 𝜃𝑗𝑗∈𝒩

(4.16)

equivalent to:

𝐇(��,𝚯)(𝝈) =1

2∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢

− ∑ 𝜎𝑗 ∙ 𝜃𝑗𝑗∈𝒩

(4.17)

equivalent to: 𝐇𝐖 ∶ 𝐈

𝓝⟶ℝ, 𝐇𝐖(𝝈) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝑤𝑖𝑗{𝑖,𝑗}∈𝒢′

𝑖<𝑗

(4.18)

Any Hamiltonian H which is of the form HW for some W ∈ ℝ𝓖′ is called a quadratic

Hamiltonian.

Partition function and cumulant function

The partition function associated to a Hamiltonian H is the function 𝒁 defined by:

𝒁 ∶ 𝓗(𝓝)⟶ ℝ,𝒁(𝐇) = ∑ exp(H(𝜎))

𝜎∈I𝒩

(4.19)

The partition function of a quadratic Hamiltonian HW is denoted by the simplified notation:

𝒁(𝐖) ≝ 𝒁(𝐇𝐖) = ∑ exp(HW(𝜎))

𝜎∈I𝒩

(4.20)

73

The partition function 𝑍 of a quadratic Hamiltonian HW is well defined and strictly convex on

ℋ0(𝒩) and, generally, intractable. Therefore, we need to approximate it and this is where

the cumulant function helps.

By definition, the cumulant function 𝑨 of a quadratic Hamiltonian HW is the natural logarithm

of the partition function associated to that Hamiltonian:

𝑨 ∶ 𝓗(𝓝)⟶ ℝ,𝑨(𝐖) ≝ 𝑨(𝐇𝐖) = ln 𝑍(W) (4.21)

When trying to approximate a probability distribution, it is more important to get the

probabilities correct for events that happen frequently than for rare events. One way to

accomplish this objective is to operate with logarithms of probabilities instead of directly with

probabilities.

Quadratic Gibbs measure

We introduce a simplified notation for the Gibbs measure GHW associated to a Hamiltonian

HW, which itself is associated to a given set of parameters W = (W,Θ) ∈ ℝ𝓖′:

𝐆𝐖(𝝈) ≝ 𝐆𝐇𝐖(𝝈) =

exp(HW(𝜎))

𝑍(HW)=exp(HW(𝜎))

𝑍(W) for all 𝜎 ∈ I𝒩 (4.22)

A quadratic Gibbs measure on I𝒩 associated to a connectivity graph (𝒩,𝒢) is a probability

function 𝐏 ∈ 𝒫(𝒩) that satisfies the following property:

∃W = (W, Θ) ∈ ℝ𝓖′ such that 𝐏 ≡ 𝐆𝐖

We introduce the notation 𝐆𝟐(𝓝,𝓖) to designate the set of all quadratic Gibbs measures on

I𝒩:

𝐆𝟐(𝓝,𝓖) = {𝐏 ∈ 𝒫(𝒩) ∶ ∃W = (W, Θ) ∈ ℝ𝓖′ such that 𝐏 ≡ 𝐆𝐖} (4.23)

Clearly, quadratic Gibbs measures on I𝒩 are very special, since they are parameterized by

W ∈ ℝ𝓖′, i.e., by maximum 𝑛∙(𝑛+1)

2 parameters, whereas 𝒫(𝒩) has dimension 2𝑛−1 and

𝑛∙(𝑛+1)

2 ≪ 2𝑛−1. If we consider, in addition, probability distributions that are marginals of

Gibbs measures, we get all the Gibbs measures on I𝒩.

74

4.2 Modelling the underlying structure of an environment

By differentiating their roles in the learning process, Hinton partitioned the units of a

Boltzmann machine into two functional groups: a nonempty set of visible units and a possible

empty set of hidden units. This is how Hinton explained in [10] the reason for this partition:

Suppose that the environment directly and completely determines the states of a subset of the units (called the "visible" units), but leaves the network to determine the states of the remaining "hidden" units. The aim of the learning is to use the hidden units to create a model of the structure implicit in the ensemble of binary state vectors that the environment determines on the visible units.

A more detailed justification for the differentiation between units is given by Hinton in [10]. He

considers a parallel network like Boltzmann machine a “pattern completion device such that a

subset of the units are “clamped” into their on or off state and the weights in the network then

complete the pattern by determining the states of the remaining units” [10]. Hinton comes up

with an example of a network that has one unit for each component of the environmental input

vector; such a network is capable to learn only a limited set of binary vectors. He uses this

example to explain why his assumption about pattern completion has strong limitations and how

these limits can be transcended: by using extra units whose states do not correspond to

components in the vectors to be learned [43]:

The weights on connections to these extra units can be used to represent complex interactions that cannot be expressed as pairwise correlations between the components of the environmental input vectors.

He calls these extra units hidden units and the units that are used to specify the patterns visible

units. In [43] Hinton gives the following intuitive explanation for the separation of units in two

classes:

The visible units are the interface between the network and the environment that specifies vectors for it to learn or asks it to complete a partial vector. The hidden units are where the network can build its own internal representations.

Formally this split–operation of 𝒩 can be described as:

𝓝 = 𝓥∪𝓗 and 𝓥 ∩𝓗 = ∅ (4.24) where 𝓥 represents the set of visible units and 𝓗 represents the set of hidden units.

Let 𝒎 be the number of units in 𝒱 and 𝒍 the number of units in ℋ:

75

𝑛 = 𝑚 + 𝑙,𝑚 = |𝒱|, 𝑙 = | ℋ| (4.25) Theoretically, the structure of an environment can be specified by giving the probability

distribution over all 2𝑚 states of the 𝒱 visible units. Practically, the network is said to have a

perfect model of the environment if it achieves exactly the same probability distribution over

these 2𝑚 states when is running freely at thermal equilibrium with all units unclamped so there

is no environmental input [10].

We can regard I𝒩 as the Cartesian product of I𝒱 and Iℋ and each configuration 𝜎 ∈ I𝒩 as a

pair of configurations over the visible respectively hidden units:

𝐈𝓝 = 𝐈𝓥 × 𝐈𝓗 (4.26)

𝝈 = (𝒗, 𝒉) for 𝑣 ∈ I𝒱 and ℎ ∈ Iℋ (4.27)

If 𝐏 ∈ 𝒫(𝒩), we use 𝐌𝐀𝐑𝐆(𝐏,𝒱) to denote the marginal of the probability distribution 𝐏 with

respect to the variables 𝜎𝑖 such that 𝜎𝑖 = 𝑣𝑖 for all 𝑖 ∈ 𝒱, i.e., the measure given by:

𝐌𝐀𝐑𝐆(𝐏,𝒱) ≡ 𝐌𝐀𝐑𝐆(𝐏,𝒱)(𝑣) = ∑ 𝐏(𝑣, ℎ)

ℎ∈Iℋ

for 𝑣 ∈ I𝒱 (4.28)

Given a connectivity graph (𝒩, 𝒢), we introduce the notation 𝐆𝟐(𝓥,𝓗, 𝓖) to designate the set of

all probability measures 𝐐 on 𝒱 that are marginals of some quadratic Gibbs measure 𝐏 ∈

G2(𝒩, 𝒢), i.e., satisfy 𝐐 ≡ MARG(𝐏, 𝒱) for some 𝐏 ∈ G2(𝒩, 𝒢). We also recall the definition (4.23)

of 𝐆𝟐(𝓝,𝓖), i.e., the set of all quadratic Gibbs measures on I𝒩.

𝐆𝟐(𝓝,𝓖) = {𝐏 ∈ 𝒫(𝒩) ∶ ∃W ∈ ℝ𝓖′ such that: 𝐏 ≡ 𝐆𝐖}

(4.29) 𝐆𝟐(𝓥,𝓗, 𝓖) = {𝐐 ∈ 𝒫(𝒱) ∶ ∃𝐏 ∈ G2(𝒩, 𝒢) such that: 𝐐 ≡ MARG(𝐏,𝒱)}

The following theorem mentioned in [3] establishes a relation between 𝒫+(𝒱) and G2(𝒱,ℋ, 𝒢).

Theorem 4.1:

Let (𝒩, 𝒢) be the full connectivity graph of a 𝐁𝐌. Using the notations (4.25), let assume that:

𝑙 ≥ 2𝑚 −

1

2∙ (𝑚2 +𝑚) − 1

Then: (4.30) G2(𝒱,ℋ, 𝒢) = 𝒫+(𝒱)

This means that every nondegenerate probability distribution on I𝒱 can be realized as a

marginal of a distribution on I𝒩 with a quadratic Hamiltonian.

76

In view of this result, if we are trying to model a probability distribution 𝐐 on I𝒱, it is not too much

of a restriction to assume that 𝐐 is a marginal of some quadratic Gibbs measure on some larger

set 𝒩 = 𝒱 ∪ℋ of units. In particular, if we only look at the visible units, then the equilibrium

behavior of a Boltzmann machine is a marginal of a quadratic Gibbs measure. Moreover, every

quadratic Gibbs measure arises from a Boltzmann machine and then Theorem 4.1 implies that

every nondegenerate probability measure on I𝒱 arises as the behavior of some Boltzmann

machine, possibly with hidden units.

The connectivity graph (𝒩, 𝒢) of a Boltzmann machine proposed by Hinton in [8-11,43] is a

general undirected graph, which means that, as a graphical model, a general Boltzmann

machine represents a pairwise Markov random field. Its connectivity graph can be any

undirected graph, in particular a complete undirected graph that was described by Hinton as a

fully–connected Boltzmann machine. However, the majority of research on Boltzmann machines

has been done on a particular type of graph, specifically a graph that can be “decomposed” into

layers. This particular graph structure is named in the machine learning literature the generic

Boltzmann machine or simple the Boltzmann machine. Because the topic of this paper –

learning algorithms for Boltzmann machines – originates in the field of machine learning, we

adhere to their concept of Boltzmann machine.

The generic Boltzmann machine has one layer which contains all the fully–interconnected

visible units and at least one layer that contains fully–interconnected hidden units. If the hidden

units are distributed among multiple layers, only one hidden layer is connected, specifically

fully–connected, with the visible layer. The other hidden layers are interconnected, that means

each hidden layer is fully connected with the layer “below” and fully–connected with the layer

“above” (if that exists). The visible layer is placed “below” the first hidden layer. Figure 1

illustrates two Boltzmann machine configurations: the fully–connected Boltzmann machine and

the generic Boltzmann machine.

77

Figure 1 a) A fully–connected Boltzmann machine with three visible nodes and four hidden nodes; b) A layered Boltzmann machine with one visible layer and two hidden layers.

4.3 Representation of a Boltzmann Machine as an energy–based model

As constraint satisfaction networks, the Boltzmann machines should be well equipped to

deal with tasks that involve a large number of weak constraints. However, what we have learned

about them so far doesn’t endow them with such qualities. Specifically the hidden units, seen as

hidden latent causes, are not good at modelling constraints between variables. Hidden ancestral

variables, i.e., the variables corresponding to the hidden units, may be good for modelling some

types of correlation, but they cannot be used to decrease variance. A better way to model

constraints is to use an energy–based model that associates high energies with data vectors

that violate constraints.

Inspired by a variant of Hopfield’s network, which we described in Section 2.5, Hinton had

shown that there exists an expression for the energy of a configuration of the network such that,

under certain circumstances, the individual units act so as to minimize the global energy. In [10]

Hinton explained the importance of the energy of a parallel system like the Boltzmann machine:

it represents the degree of violation of the constraints between hypotheses and consequently

determines the dynamics of the search. He also formulated the following postulates or

78

assumptions about the energy, which he later used to derive the main properties of the

probabilistic system that is Boltzmann machine.

Postulate 1:

There is a potential energy function over states of the whole system which is a function 𝑓(𝐏(𝜎))

of the probability of a state 𝜎.

This is equivalent to saying that, given any input, a particular state or configuration 𝜎 of a

Boltzmann machine has exactly one probability. It does not, for instance, have a probability of

0.3 and also a probability of 0.5.

Postulate 2:

The potential energy function is additive for independent systems. Since the probability for a

combination of states of independent systems is multiplicative, it follows that:

𝑓(𝐏(𝜎)) + 𝑓(𝐏(𝜎′)) = 𝑓(𝐏(𝜎)𝐏(𝜎′))

The only function that satisfies this equation is:

𝑓(𝐏(𝜎)) = 𝑘 ∙ ln𝐏(𝜎)

To make more probable states have lower energy the real–valued constant 𝑘 must be negative.

Postulate 3:

The part of the potential energy contributed by a single unit can be computed from information

available to the unit.

Only potential energies symmetrical in all pairs of units have this property, since in this case a

unit can "deduce" its effect on other units from their effect on itself.

Under the previous assumptions, the individual units of a Boltzmann machine can be made to

act so as to minimize the global energy. If some of the units are clamped into particular states to

represent a particular input, the system will then try to find the minimum energy configuration

that is compatible with that input. Thus, the energy of a configuration can be interpreted as the

79

extent to which that combination of hypotheses fails to fit the input and violates the constraints

implicit in the problem domain. So, in minimizing energy, the system is maximizing the extent to

which a perceptual interpretation fits the data and satisfies the constraints. Consequently the

system evolves towards interpretations of that input that increasingly satisfy the constraints of

the problem domain [10].

Using the previous notations, the global energy of the system, also referred as the energy of the

configuration 𝜎 of the system, is defined as:

𝐸(𝜎) = −

(

∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗

−∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

)

or: (4.31)

𝐸(𝜎) = −(1

2∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢

−∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

)

equivalent to: 𝐸(𝜎) = −H(W,Θ)(𝜎) = −HW(𝜎) (4.32)

If we represent the configuration 𝜎 as a 𝑛–dimensional column vector, the weights W as a 𝑛 × 𝑛

symmetric matrix, and the thresholds as a 𝑛–dimensional column vector, then the energy of

configuration 𝜎 can be written in a matrix form as:

𝐸(𝜎) = −(

1

2∙ 𝜎T × W × 𝜎 − ΘT × 𝜎) (4.33)

We observe that the global energy defined by the equations (4.31) belongs to the Hamiltonian

family given by the definition (4.10). More, the global energy is the negative of the quadratic

Hamiltonian defined by the equations (4.16) and (4.17).

In Section 2.5 we presented the Hopfield update rule: switch each randomly selected unit into

whichever of its two states yields the lower total energy given the current configuration of the

network. If the Boltzmann machine had operated according with the Hopfield update rule, then it

would have been no different than a multilayer perceptron that follows the same rule, otherwise

would have suffered from the standard weakness of gradient descent methods: it could get

stuck at a local minimum that is not a global minimum [10-11]. This is an inevitable

consequence of only allowing jumps to states of lower energy or so called “downhill moves”.

Unlike the Boltzmann machine, the Hopfield’s network doesn’t suffer from this weakness

because its local energy minima are used to store memories. Therefore, if the Hopfield network

80

is started near some local minimum, the desired behavior is to fall into that local minimum and

not to find the global minimum.

Hinton realized that, if jumps to higher energy states occasionally occurred, it would be possible

for the system to break out of local minima, but it was not clear to him how the system would

then behave and also when the uphill steps should be allowed [43]. Therefore, in order to

escape the local minima in a Boltzmann machine, Hinton advanced the following idea: make the

binary units stochastic and add thermal noise to the global energy such that, occasionally, it

would lead to uphill steps. Hence, Hinton proposed that the stochastic unit should update its

state based on its previous state according with the following rule: the ith unit of a configuration 𝜎

at time 𝑡 outputs the state 0 or the state 1 with probability:

𝐏(𝜎𝑖) =

1

1 + exp (Δ𝐸𝑖𝐓 )

(4.34)

where: 𝐓 is the pseudo–temperature, i.e., a parameter which models the thermal noise injected

into the system; and Δ𝐸𝑖 is the energy gap between the current state and the previous state of

the ith unit of a configuration 𝜎.

Hinton used a simulated annealing algorithm to guide the increase in the level of thermal noise.

He studied experimentally the effect of thermal noise over transition probabilities and came up

with an annealing schedule that starts with a higher pseudo–temperature and gradually reduces

it to a pseudo–temperature of 1. He based his annealing schedule on the following

observations: at low pseudo–temperatures there is a strong bias in favor of states with low

energy, but the time required to reach equilibrium may be longer; at higher pseudo–

temperatures the bias is not so favorable, but equilibrium is reached faster [43]. According with

Hinton, this technique cannot guarantee that a global minimum will be found, but it can

guarantee that a nearly global minimum can be found with high probability.

Later, Hinton refined his original algorithm rule by adopting, during each annealing stage, i.e.,

when the pseudo–temperature is kept constant, a variant of the Metropolis algorithm. He also

proposed a simplified version of the update rule (4.34): “if the energy gap between the on and

off states of the ith unit of a configuration 𝜎 is 𝛥𝐸𝑖′, then, regardless of the previous state of that

unit, set the unit to 1 with a probability given by formula” (4.35):

𝐏(𝜎𝑖 = 1) =

1

1 + exp (E(𝜎𝑖 = 1) − 𝐸(𝜎𝑖 = 0)

𝐓 )

81

𝐏(𝜎𝑖 = 1) =

1

1 + exp (−E(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1)

𝐓)

𝐏(𝜎𝑖 = 1) =1

1 + exp (−Δ𝐸𝑖′𝐓

)

(4.35)

where: Δ𝐸𝑖

′ = 𝐸(𝜎𝑖 = 0) − 𝐸(𝜎𝑖 = 1) Hinton found inspiration in Boltzmann’s work, specifically the principle that a network consisting

of a large number of units, with each unit interacting with neighbouring units, will approach a

canonical distribution at equilibrium given by the Boltzmann–Gibbs distribution. Although the

development of Boltzmann machines has been motivated by ideas from statistical physics, they

are nevertheless neural networks. Therefore, the following two differences, which we mentioned

in the context of Markov random fields, should be carefully noted. Firstly, in neural networks the

parameter 𝐓 plays the role of a pseudo–temperature that has no physical meaning. Secondly, in

neural networks the Boltzmann’s constant 𝐤 can be taken equal to 1.

The local nature of the update rules (4.34) and (4.35) ensures that raising the noise level is

equivalent to decreasing all the energy gaps between configurations, so in thermal equilibrium

the relative probability of two configurations 𝜎 and 𝜎′ is determined solely by their energy

difference and follows a Boltzmann distribution:

𝐏(𝜎)

𝐏(𝜎′)= exp(−

𝐸(𝜎) − 𝐸(𝜎′)

𝐓) (4.36)

where: 𝜎 and 𝜎′ are two configurations of a Boltzmann machine; 𝐏(𝜎) is the probability of the

Boltzmann machine to have the configuration 𝜎; 𝐸(𝜎) is the energy of the configuration 𝜎; and

𝐓 is the pseudo–temperature.

Hinton’s justification for this heuristic was the fact that energy barriers are what prevent a

system from reaching equilibrium rapidly at low pseudo–temperature and, if the energy barriers

can be suppressed or at least surpassed, equilibrium can be achieved rapidly at a pseudo–

temperature at which the distribution strongly favors the lower minima [43]. However, in Hinton’s

opinion, the energy barriers cannot be permanently removed because they correspond to states

that violate the constraints and the energies of these states must be kept high to prevent the

system from settling into them. Thereby, a solution to surpass the energy barriers between low–

lying states was needed. Hinton realized that, in a complex system with a high–dimensional

state space like the Boltzmann machine, the energy barriers between low–lying states are

82

highly degenerate, so the number of ways of getting from one low–lying state to another is an

exponential function of the height of the barrier one is willing to cross [43]. Thus, the effect of

either one of the update rules (4.34) and (4.35) is the opening of an enormous variety of paths

for escaping from a local minimum and, even though each path by itself is unlikely, it is highly

probable that the system would cross the energy barrier between two low–lying states [43].

4.4 How a Boltzmann Machine models data

As a binary pairwise Markov network, a Boltzmann machine operates with probabilities,

specifically it associates to each input a probability distribution over the output. Hence, we are

looking at two categories of data that are essential for this type of network: environmental data

and probabilities distributions associated with the underlying graphical model.

The input or environmental data consist of a set of binary vectors. Each input vector is

mapped one–to–one to the set of visible units, in this way producing a configuration over the

visible units. The problem that we need to address regarding these configurations is to fit a

model that will assign a probability to every possible configuration over the visible units. The

formulae (4.25) show that there are 2𝑚 such configurations. Knowing this probability

distribution would allow us to decide whether other binary vectors come from the same

distribution. The network is said to have a perfect model of the environment if it achieves

exactly the same probability distribution over these 2𝑚 configurations when it is running

freely at thermal equilibrium with no environmental input.

In order to allow the network to approach thermal equilibrium, Hinton makes the

assumptions that each of the environmental input vectors persists for long enough and the

structure in the sequence of environmental vectors, if any, should be ignored [11]. The

distribution over all visible configurations 𝑣 is nothing else than the marginal distribution over

all the configurations 𝜎 of the network.

The probability distributions associated with the underlying graphical model can be divided

into three categories: joint configuration probabilities, conditional probabilities, and

marginals. To define these distributions, we follow an approach similar with the approach we

used in Section 4.3 for the individual unit.

Joint configurations probabilities:

83

Let consider the configuration 𝜎 of visible units 𝑣 and hidden units ℎ given by the

equation (4.27). Such a configuration is often called the “joint configuration” of 𝑣 and ℎ.

The probability of the joint configuration 𝜎 is related to the energy of that configuration,

which is given by the equations (4.31). Therefore, we start by evidencing 𝑣 and ℎ in

(4.31):

𝐸(𝜎) = 𝐸(𝑣, ℎ) = − ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗

+∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

(4.37)

In Section 4.3 we have learned that, in a Boltzmann machine, the energy of a

configuration 𝜎 can be seen as a real function defined on I𝒩; therefore, according to the

definition (4.10), it belongs to the Hamiltonian family. More, the energy of a configuration

𝜎 is the negative of a quadratic Hamiltonian. Then, the logical steps we need to follow to

obtain the expression of the probability of a joint configuration 𝜎 = (𝑣, ℎ) are the same

steps we followed in Section 4.1 to define the Gibbs measure associated to a

Hamiltonian. We start by looking at how the energies of joint configurations are related to

their probabilities and we identify two ways in which they are logically connected:

In one way we can define the probability of a joint configuration 𝜎 = (𝑣, ℎ) by using

an exponential model similar to one used in the definition (4.13):

𝐏(𝑣, ℎ) ∝ exp(−𝐸(𝑣, ℎ)) (4.38)

In other way we can define the probability of a joint configuration 𝜎 = (𝑣, ℎ) to be the

probability of finding the network in that particular joint configuration after we have

updated all of the stochastic binary units many times. Thus, the probability of a joint

configuration over both visible and hidden units depends on the energy of that joint

configuration compared with the energy of all other joint configurations. This

approach also follows the definition (4.13).

To comply with both requirements, we need to specify a normalization factor in the first

definition that is compatible with the second definition and this is exactly the partition

function defined by (4.14):

𝑍 = ∑ exp(−𝐸(𝑢, 𝑔))

𝑢∈𝐼𝒱 ,

𝑔∈𝐼ℋ

(4.39)

84

Therefore, the probability of a joint configuration (𝑣, ℎ) of visible and hidden units is

defined as:

𝐏(𝑣, ℎ) =

exp(−𝐸(𝑣, ℎ))

𝑍=

exp(−𝐸(𝑣, ℎ))

∑ exp(−𝐸(𝑢, 𝑔))𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ (4.40)

Conditional probabilities:

The conditional distributions over hidden and visible units are given by:

𝐏(ℎ𝑗 = 1 | 𝑣, ℎ−𝑗) = sigm(∑𝑣𝑖 ∙ ��𝑖𝑗𝑖∈𝒱

+ ∑ ℎ𝑚 ∙ ��𝑚𝑗𝑚∈ℋ−{j}

− 𝜃𝑗) (4.41)

𝐏(𝑣𝑖 = 1 | ℎ, 𝑣−𝑖) = sigm(∑ ℎ𝑗 ∙ ��𝑗𝑖 + ∑ 𝑣𝑘 ∙ ��𝑘𝑖

𝑘∈𝒱−{i}𝑗∈ℋ

− 𝜃𝑖) (4.42)

where 𝑥−𝑖 denotes a vector 𝑥 with the ith component 𝑥𝑖 omitted and sigm is the logistic

function.

Marginal probabilities:

The probability of a configuration 𝑣 of the visible units is the sum of the probabilities of all

the joint configurations that contain it and it is identical with the marginal distribution over

the configuration 𝑣 of visible units. It is computed with the following formula:

𝐏(𝑣) =

∑ exp(−𝐸(𝑣, ℎ))ℎ∈Iℋ

∑ exp(−𝐸(𝑢, 𝑔))𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ (4.43)

The formulae (4.40) to (4.43) show that all the distributions specific to a generic Boltzmann

machine are intractable. The main reason for their intractability is the computation of the

partition function 𝑍 given by the equation (4.39).

4.5 General dynamics of Boltzmann Machines

Hinton investigated the behaviour of the Boltzmann machine under either one of the

update rules (4.34) and (4.35) in two scenarios: when approaching the thermal equilibrium and

after the thermal equilibrium was reached. To help understand the concept of thermal

equilibrium, he suggested an intuitive way to think about it which is inspired by the idea behind

the equation (2.7):

85

Imagine a huge ensemble of systems that all have exactly the same energy function; then the probability of a global state of the ensemble is just the fraction of the systems that have the corresponding state.

Similarly, reaching thermal equilibrium in a Boltzmann machine does not mean that the system

has settled down into the lowest energy configuration, but that the probability distribution over

configurations settles down to the stationary distribution. Hinton came up with an algorithm to

describe the dynamics of the Boltzmann machine. We are going to give firstly an intuitive

description of the algorithm (Algorithm 4.1) and then we are going to present the algorithm

formally (Algorithm 4.2).

Algorithm 4.1: Boltzmann Machine Dynamics 1

Given: W , 𝐓

begin

Step 1: Start with any distribution over all the identical units.

We could start with all the units in the same state or with an equal number of

units in each possible state.

Step 2: Keep applying the stochastic update rule (4.34) to pick the next state

for each randomly selected individual unit.

Step 3: After running the units stochastically in the right way, the network may

eventually reach a situation where the fraction of units in each state remains

constant.

This is the stationary distribution that physicists call thermal equilibrium.

Step 4: After reaching the thermal equilibrium, any given unit keeps changing

its state but the fraction of units in each state does not change. Otherwise,

once equilibrium has been reached, the number of units that “leave” a

configuration at each time step will be equal to the number of units that

“enter” that configuration.

end

86

We start our approach to formally define the dynamics of the Boltzmann machine by

establishing the assumptions the network works under:

the pseudo–temperature is 𝐓;

time is discrete and is represented by the set of nonnegative integers 𝒯 = {0,1,2,… };

at each time 𝑡 ∈ 𝒯 one unit 𝑖 ∈ 𝒩 is selected at random for a possible update.

We observe that, if a Boltzmann machine has the weights and thresholds varying in time,

deterministically or stochastically, then the Boltzmann machine becomes a particular case of a

time–varying neural network, whose definition follows.

Definition 4.3:

A time–varying Boltzmann machine 𝐓𝐕𝐁𝐌 is a 𝐁𝐌 that has a fixed 𝒩, a fixed 𝒢, and whose

parameters W = (W, Θ) can vary in time, either deterministically or stochastically.

Precisely, a 𝐓𝐕𝐁𝐌 is a four–tuple (𝓝,𝓖, ��, 𝚯) where �� = {��(𝑡)}𝑡∈𝒯 and 𝚯 = {𝚯(𝑡)}𝑡∈𝒯 are ℝ𝓖–

valued respectively ℝ𝓝–valued stochastic processes.

With respect to Definition 4.3, we make the following remarks:

If 𝐁𝐌 is a 𝐓𝐕𝐁𝐌, then the net input computed according with the equation (4.9) by using the

weights W(𝑡) and the thresholds Θ(𝑡) is called the net input at time 𝑡 and is denoted

𝐧𝐞𝐭𝐭(𝒊, 𝝈).

The unit 𝑖 selected for update at moment 𝑡 and denoted 𝑖(𝑡) finds out what its new state is

going to be by computing two quantities: its net input at time 𝑡 and the probability given by

the update rule.

The update rules (4.34) and (4.35) become time–varying as well, so they need to be

changed to reflect the time factor. However, before we modify the general update rule (4.34)

to reflect the time factor, we need to rewrite the rule to highlight the states of the ith unit at

two consecutive moments: 𝑡 − 1 and 𝑡.

Claim: Given a configuration 𝜎 and a unit 𝑖, the update rule (4.34) can be written as:

87

𝐏(𝜎𝑖(𝑡) | 𝜎𝑖(𝑡 − 1)) =

1

1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))

𝐓 )

(4.44)

Proof: We start by looking at the denominator in the formula (4.34), specifically at the term Δ𝐸𝑖.

We know from the equation (4.32) that the energy of a configuration is the negative of the

quadratic Hamiltonian of that configuration, so we are focusing our attention on the quadratic

Hamiltonian given by the equation (4.16).

Given a configuration 𝜎 and a unit 𝑖, we can split the quadratic Hamiltonian associated to the

configuration 𝜎 into two terms such that one term reflects the contribution of the unit 𝑖 and other

term H′ incorporates the contribution of all the units except 𝑖. We observe that the term that

reflects the contribution of unit 𝑖 to the quadratic Hamiltonian is related to the net input to unit 𝑖

defined by the equation (4.9). Therefore, we can write:

HW(𝜎) = net(𝑖, 𝜎) ∙ 𝜎𝑖 +H′ (4.45)

In particular, if 𝜎(𝑖) denotes the configuration obtained from 𝜎 by switching the value of the ith

unit, then we can compute the variation of the quadratic Hamiltonian corresponding to this

operation:

ΔH = HW(𝜎(𝑖)) − HW(𝜎) = net(𝑖, 𝜎

(𝑖)) ∙ 𝜎𝑖(𝑖) + H′ − net(𝑖, 𝜎) ∙ 𝜎𝑖 − H

′

ΔH = net(𝑖, 𝜎(𝑖)) ∙ 𝜎𝑖

(𝑖) − net(𝑖, 𝜎) ∙ 𝜎𝑖 (4.46)

A few observations have to be made with respect to the equation (4.46).

Firstly, the net input of a configuration towards a unit (equation (4.9)) doesn’t depend on the

state of that unit. This means that two configurations that differ only in one and the same

unit have exactly the same net input to that unit, assuming that the parameters of the

network are the same. This fact translates to:

net(𝑖, 𝜎(𝑖)) = net(𝑖, 𝜎) (4.47)

Accordingly, the equation (4.46) can be rewritten as:

ΔH = net(𝑖, 𝜎) ∙ (𝜎𝑖(𝑖) − 𝜎𝑖) = net(𝑖, 𝜎) ∙ Δ𝜎𝑖 (4.48)

Secondly, switching the value of the ith hypothesis of a configuration that is also an I–valued

configuration, where I = {0,1}, is the same as applying the following formula to that

hypothesis:

88

𝜎𝑖 = 1 − 𝜎𝑖(𝑖) ⇔ 𝜎𝑖

(𝑖) = 1 − 𝜎𝑖 (4.49)

Accordingly, the equation (4.48) can be written in an equivalent form using only the new

state 𝜎𝑖(𝑖) given by the equation (4.49):

ΔH = net(𝑖, 𝜎) ∙ (𝜎𝑖(𝑖) − 1 + 𝜎𝑖

(𝑖)) = net(𝑖, 𝜎) ∙ (2 ∙ 𝜎𝑖(𝑖) − 1) (4.50)

Thirdly, we compute Δ𝐸𝑖 :

Δ𝐸𝑖 = 𝐸(𝜎(𝑖)) − 𝐸(𝜎) = −HW(𝜎

(𝑖)) + HW(𝜎) = −ΔH

Δ𝐸𝑖 = −net(𝑖, 𝜎) ∙ Δ𝜎𝑖 = −(2 ∙ 𝜎𝑖

(𝑖) − 1) ∙ net(𝑖, 𝜎) (4.51)

If we substitute (4.51) into (4.34) we obtain:

𝐏(𝜎𝑖(𝑡)) =

1

1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))

𝐓 )

(4.52)

The state of the ith unit at time 𝑡 depends only on its previous state. Therefore, the equation

(4.52) becomes exactly (4.44).

After we incorporate all the remarks regarding Definition 4.3 into the formula (4.44), we obtain

the following update rule for a 𝐓𝐕𝐁𝐌. This is the rule used by Algorithm 4.2.

𝐏(𝜎𝑖(𝑡)(𝑡) | 𝜎𝑖(𝑡)(𝑡 − 1)) =

1

1 + exp(−(2 ∙ 𝜎𝑖(𝑡)(𝑡) − 1) ∙ net𝑡−1(𝑖(𝑡) , 𝜎))

𝐓 )

(4.53)

Definition 4.4:

A Boltzmann Machine Dynamics (BMD) on a network 𝐁𝐌/𝐓𝐕𝐁𝐌 is a Markov chain {𝝈(𝒕)}𝒕∈𝓣

with state space I𝒩 and whose transitions occur according to the following algorithm:

Algorithm 4.2: Boltzmann Machine Dynamics 2

Given: W , 𝐓

begin

Step 1: repeat

89

Step 2: at each time 𝑡 ∈ 𝒯 one unit 𝑖(𝑡) is chosen at random from the set

𝒩 ∪ {0} with the probability: 1

𝑛+1

Step 3: if the unit 𝑖(𝑡) is the “0” unit then set: 𝜎(𝑡) = 𝜎(𝑡 − 1)

else

Step 4: compute the net input to unit 𝑖(𝑡):

𝑥 = nett−1(𝑖(𝑡), 𝜎)

Step 5: generate the candidate state 𝑦 for the unit 𝑖(𝑡):

𝑦 = 1 − 𝜎(𝑡 − 1)𝑖(𝑡)

Step 6: if 𝑦 = 0 then compute the probability: 𝐏 =1

1+exp(𝑥

𝐓)

else compute the probability: 𝐏 =1

1+exp(−𝑥

𝐓)

Step 7: if 𝑥 ∙ (2𝑦 − 1) > 0 then accept y as the state of unit 𝑖(𝑡):

𝜎(𝑡)𝑖(𝑡) = 𝑦

else accept y as the state of unit 𝑖(𝑡) with probability 𝐏:

if random(0,1) < 𝐏 then 𝜎(𝑡)𝑖(𝑡) = 𝑦

Step 8: 𝜎(𝑡)𝑗 = 𝜎(𝑡 − 1)𝑗 for any 𝑗 ≠ 𝑖(𝑡)

Step 9: until stopping criterion true

end

In Algorithm 4.2 the notation random(0,1) denotes a sample from a uniform distribution 𝒰[0,1].

Furthermore, we consider Boltzmann machines with inputs and their corresponding dynamics.

Specifically we clamp certain units at certain values of the activation levels and do not allow

them to switch. The set of units that are clamped at a particular time 𝑡 will be allowed to depend

on 𝑡. Suppose we choose a subset 𝒞 ∈ 𝒩, which may depend on 𝑡, in which case we denote it

by 𝒞(𝑡) ∈ 𝒩. Also suppose we have an “external input,” i.e., a process 𝑣 = {𝑣(𝑡)}𝑡∈𝒯 such that

each 𝑣(𝑡) takes values in I𝒞(t).

90

Definition 4.5:

A Boltzmann Machine Dynamics with Inputs (BMDI) is a BMD process that follows Algorithm

4.2 except that in Step 2 the choice of 𝑖(𝑡) is limited to the set (𝒩 ∪ {0}) − 𝒞(𝑡); consequently

the probability of a particular unit being chosen is 1

|(𝒩∪{0})−𝒞(𝑡)|=

1

𝑛+1−|𝒞(𝑡)| .

Therefore, Step 2 of the algorithm looks like this:

Step 2: at each time 𝑡 ∈ 𝒯 one unit 𝑖(𝑡) is chosen at random from the set

(𝒩 ∪ {0}) − 𝒞(𝑡) with the probability: 1

𝑛+1−|𝒞(𝑡)|

4.6 The biological interpretation of the model

In this section we present Hinton’s original argumentation to support the idea that

Boltzmann machines bear resemblance to the brains; therefore, it is worth studying them. We

start by presenting some of the facts that make the Boltzmann machine belong to the same

general class of computation devices as the brain. Then we present some irreconcilable

differences between Boltzmann machine and the cortex. Both categories of arguments (pro and

contra) had been presented by Hinton in [10]. Before we present the arguments, we need to

define two concepts, native to physiology, which we are going to use.

An action potential is a short–lasting event in which the electrical membrane potential of a cell

rapidly rises and falls, following a consistent trajectory; otherwise is a propagated impulse.

An electrotonic potential is a non–propagated local potential, resulting from a local change in

ionic conductance (e.g. synaptic or sensory that engenders a local current); when it spreads

along a stretch of membrane, it becomes exponentially smaller (decrement).

Similitudes between Boltzmann machine and the cortex:

The cerebral cortex is relatively uniform in structure.

Different areas of cerebral cortex are specialized for processing information from

different sensory modalities such as: visual cortex, auditory cortex, and somatosensory

cortex. Other areas are specialized for motor functions. However, all these cortical areas

91

have a similar anatomical organization and are more similar to each other in

cytoarchitecture than they are to any other part of the brain [10].

Many problems in vision, speech recognition, associative recall, and motor control can

be formulated as searches. The similarity between different areas of cerebral cortex

suggests that the same kind of massively parallel searches may be performed in many

different cortical areas [10].

Differences between Boltzmann machine and the cortex:

Binary states and action potentials

The simple binary units which are components of Boltzmann machines are not literal

models of cortical neurons. According with Hinton, the assumptions that the binary units

change their states asynchronously and they use a probabilistic decision rule seem

closer to the reality than a model with synchronously deterministic updating [10].

The energy gap for a binary unit has a role similar to that played by the membrane

potential for a neuron: both are the sum of excitatory and inhibitory inputs and both are

used to determine the output state. However, the cortical neurons produce action

potentials, which are brief spikes that propagate down axons, rather than binary outputs.

When an action potential reaches a synapse, the signal it produces in the postsynaptic

neuron rises to a maximum and then exponentially decays with the time constant of the

membrane (typically around five milliseconds for neurons in cerebral cortex). The effect

of a single spike on the postsynaptic cell body may be further broadened by electrotonic

transmission down the dendrite to the cell body [10].

The energy gap represents the summed input from all the recently active binary units. If

the average time between updates is identified with the average duration of a

postsynaptic potential, then the binary pulse between updates can be considered an

approximation to the postsynaptic potential. Although the shape of a single binary pulse

differs significantly from a postsynaptic potential, the sum of a large number of stochastic

pulses is independent of the shape of the individual pulses and depends only on their

amplitudes and durations [10]. According with Hinton, for large networks having the large

fan–ins typical of cerebral cortex (around 10000), the binary approximation may not be

too bad [10].

Implementing pseudo–temperature in units

92

The membrane potential of a neuron is graded, but if it exceeds a fairly sharp threshold,

an action potential is produced, followed by a refractory period lasting several

milliseconds, during which another action potential cannot be elicited. If Gaussian noise

is added to the membrane potential, then even if the total synaptic input is below

threshold, there is a finite probability that the membrane potential will reach threshold

[10]. The amplitude of the Gaussian noise determines the width of the sigmoidal

probability distribution for the unit to fire during a short time interval and it therefore plays

the role of pseudo–temperature in the model [10].

According with Hinton, a cumulative Gaussian is a very good approximation to the

required probability distribution but it might be difficult to implement because the units in

the network should be arranged in such a way that all of them have the same amplitude

of noise [10].

Asymmetry and time–delays

In a generic Boltzmann machine all the connections are symmetrical. This assumption is

not always true for neurons in the cerebral cortex. However, if the constraints of a

problem are inherently symmetrical and if the network, on average, approximates the

required symmetrical connectivity, then random asymmetries in a large network will be

reflected as an increase in the Gaussian noise in each unit [10]. Hinton proposes the

following experiment to see why random asymmetry acts as Gaussian noise:

Consider a symmetrical network in which pairs of units are linked by two equal one–way connections, one in each direction. Then perform the following operation on all pairs of these one–way connections: remove one of the connections and double the strength of the other. Provided the choice of which connection to remove is made randomly, this operation will not alter the expected value of the input to a unit from the other units. On average, it will “see” half as many other units but with twice the weight. So if a unit has a large fan–in, it will be able to make a good unbiased estimate of what its total input would have been if the links had not been cut. However, the use of fewer, larger weights will increase the variance of the energy gap and will thus act as added noise.

Experimentally Hinton came to the conclusion that time–delays act like added noise as

well. His experimental results have been confirmed mathematically for first order

constraints, provided the fan–in is large and the weights are small compared with the

energy gaps [10].

93

Chapter 5. The Mathematical Theory of Learning Algorithms for Boltzmann Machines

One of the most interesting aspects of the Boltzmann machine formalization is that it

leads to a domain–independent learning algorithm [10]. Intuitively, learning for Boltzmann

machines means "acquiring a particular behavior by observing it” [3], i.e., progressively

adjusting the connection strengths between units in such a way that the whole network develops

an internal model which captures the underlying structure of the environment [10]. The goal of

learning in Boltzmann machines is rather different from other learning algorithms like, for

instance, backpropagation learning. Rather than learning a non–linear model from inputs to

outputs, the goal of learning in the classical asynchronous Boltzmann machine is to improve the

network’s model of the structure of the environment by choosing the parameters of the network

such that the stochastic behaviour observed on the visible units when the network is free–

running closely models that observed in the environment [43].

5.1 Problem description

The formal definition of the learning process we present is inspired by Sussmann’s work

[1,3] but in the same time reflects our understanding of this family of algorithms. Before we

formalize the learning process, we lay out the context it operates in. Consider we are given a

Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢, W, Θ) with |𝒩| = 𝑛 and with the set of random variables

associated to the units denoted X = (𝑋1, 𝑋2, … , 𝑋𝑛). In this way we establish the connection

between the Boltzmann machine learning and the Markov networks discussed in Chapter 2 and

Chapter 3. Thus, a configuration 𝜎 of 𝐁𝐌 is nothing else than an instantiation of the set of

random variables X of the underlying Markov network.

According with the definition (4.23), the true probability distribution 𝐏 of a joint configuration in a

Boltzmann machine is, in fact, the Gibbs measure 𝐆𝐖 (equation (4.13)) associated to the

Hamiltonian 𝐇𝐖 (equation (4.18)), which itself would be associated to the parameters 𝐖 of the

network at thermal equilibrium, if they were possibly known. Because the partition function of a

Boltzmann machine is generally intractable, all these measures – 𝐖, 𝐇𝐖, 𝐆𝐖, and 𝐏 – cannot

be determined exactly. Therefore, we resort to their approximations, which in principle are ��,

𝐇��, 𝐆��, and ��. In order to make some proofs easier to grasp, we might use more suggestive

94

notations for some of these variables. If that is the case, we will specify, if applicable, the

correspondence between the notations.

Definition 5.1:

Given a Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢′,W) and a sequence of random configurations 𝜎 ∈ I𝒩,

distributed according with a probability ��, which are presented to the network as inputs at

various times, a learning process 𝓛 is a sequence of pairs (W, 𝜎) ∈ ℝ𝒢′ × I𝒩 that satisfy the

following property: the parameters W converge to a value W such that the corresponding Gibbs

measure 𝐆𝐖 is the same as the observable distribution �� of the configurations 𝜎 presented to

the network.

lim𝑡→∞

W = W such that 𝐆𝐖 = �� (5.1)

A variant of the learning process has 𝒩 split into two disjunctive sets 𝒱 (visible unis) and ℋ

(hidden units). Thus, the observable distribution is a probability distribution over I𝒱 and the

learning process evolves in ℝ𝒢′ × I𝒱. Definition 5.2 characterizes this scenario.

Definition 5.2:

Given a Boltzmann machine 𝐁𝐌 = (𝒩, 𝒢′,W) and a sequence of random data vectors 𝑣 ∈ I𝒱,

distributed according with a probability ��, that are presented to the network as inputs at various

times, a learning process 𝓛 is a sequence of pairs (W, 𝑣) ∈ ℝ𝒢′ × I𝒱 that satisfy the following

property: the parameters W converge to a value W such that the marginal of the corresponding

Gibbs measure 𝐆𝐖 over the variables 𝑣 ∈ I𝒱 is the same as the observable distribution �� of the

visible vectors 𝑣 presented to the network.

lim𝑡→∞

W = W such that MARG(𝐆𝐖, 𝒱) = �� (5.2)

In reality, MARG(𝐆𝐖, 𝒱) and �� are not equal, so Definition 5.2 rather expresses a desired goal of

the learning process. Therefore, the aim of asynchronous Boltzmann machine learning

becomes to reduce the difference between MARG(𝐆𝐖, 𝒱) and �� by performing gradient descent

in the parameter space on a suitable measure of their difference.

The environment imposes the distribution �� over the network by clamping the visible units,

which means the following:

95

each member of I𝓥 is probabilistically selected using ��; the probability of selecting 𝑣 is ��(𝑣);

the selected members of I𝓥 are presented to the network sequentially;

each selected vector 𝑣 is tested by running the Boltzmann machine for a time unit long

enough for the network to reach thermal equilibrium;

in each time unit the following two steps take place:

Step 1: all the units are updated;

Step 2: the visible units are reset to 𝑣.

We introduce the following definitions and notations for the probability distributions that play a

role in asynchronous Boltzmann machine learning. Then, we summarize them in Table 1.

Let 𝐏𝐓(σ) = 𝐏𝐓(𝑣, ℎ) be the free running equilibrium distribution at pseudo–temperature 𝐓.

Let 𝐏𝐓(ℎ|𝑣) be the probability of the free running network, at thermal equilibrium, that the

hidden units are set to ℎ given that the visible units are set to 𝑣 on the very same time step.

Let 𝐩𝐓(𝑣) be the probability distribution over the states of the visible units when the network

in thermal equilibrium is running freely at pseudo–temperature 𝐓.

Let 𝐪(𝑣) be the environmentally imposed probability distribution over the state vectors 𝑣 of

visible units.

Let 𝐐𝐓(ℎ|𝑣) be the probability that vector ℎ will occur on the hidden units when 𝑣 is clamped

on the visible units and the network of hidden units is allowed to run at pseudo–temperature

𝐓.

Let 𝐐𝐓(𝜎) = 𝐐𝐓(𝑣, ℎ) be the probability of observing the global state 𝜎 over multiple runs in

which successive vectors 𝑣 are clamped with probability 𝐪(𝑣).

𝐏𝐓 represents the probability described as 𝐆𝐖 in Definition 5.2.

𝐩𝐓 represents the probability described as MARG(𝐆𝐖, 𝒱) in Definition 5.2.

𝐪 represents the probability described as �� in Definition 5.2.

Table 1 Distributions of interest in asynchronous Boltzmann machine learning

Distribution Visible units Notation

Equilibrium (true) distribution: 𝐏𝐓 ≡ 𝐆𝐖

Is defined on the whole state space 𝓝 = 𝓥 ∪𝓗.

Clamped 𝐩𝐓(𝑣)

Free–running 𝐏𝐓(σ)

Environmental (data) distribution: 𝐐𝐓

Is defined on the state space 𝓝 and observable on visible space 𝓥.

Clamped 𝐪(𝑣)

Free–running 𝐐𝐓(𝜎)

96

There is a subtle difference between the conditional distribution 𝐏𝐓(ℎ|𝑣), which refers to the free

process, and 𝐐𝐓(ℎ|𝑣), which refers to the clamped process. During the free process, the visible

units are allowed to change on every time step; therefore, 𝐏𝐓(ℎ|𝑣) quantifies the probability that

the network arrives at configuration (𝑣, ℎ) on the very same time step. During the clamped

process, the visible units are initially set to 𝑣 and only the network of hidden units is allowed to

freely run; therefore, 𝐐𝐓(ℎ|𝑣) quantifies the probability that the network of hidden units arrives at

configuration ℎ in a time step following the initial time step when the visible units have been

clamped.

A formal description of the learning problem in Boltzmann machine is presented below.

Problem Boltzmann Machine Learning:

Given: 𝒩,𝒢′ and the split of the units into visible and hidden: 𝒩 = 𝒱 ∪ℋ

a set of data vectors 𝑣 ∈ I𝒱 used as inputs at various times

𝑣 are distributed according with an observable 𝐪

𝐪/𝐐𝐓 belong to an exponential family and have parameters W

𝐪 = 𝐩𝐓

Find: the best possible W close to W

Subject to:

1. a 𝐁𝐌 with visible units 𝑣 and hidden units ℎ runs on all its 𝑛 units

according to a probability 𝐏𝐓 ≡ 𝐆𝐖 that has parameters W

2. W converge to W

The learning algorithms for Boltzmann machines build on approximate inference algorithms in

pairwise Markov networks. Based on the approach employed to perform approximate inference,

the learning algorithms for Boltzmann machines can be divided into two groups or families:

one family uses approximate maximum likelihood methods;

other family uses variational methods to compute the free energies.

97

5.2 Phases of a learning algorithm in a Boltzmann Machine

By performing learning, the Boltzmann machine captures the underlying structure of its

environment and becomes capable of performing various pattern completion tasks. One type of

such tasks is to be able to complete a pattern from any sufficiently large part of it without

knowing in advance which part must be completed. Another type of task is to know in advance

which parts of the pattern will be given as input and which parts will have to be completed as

output. Therefore, there are two pattern completion paradigms, which lead to the presence of

two phases in the learning procedure such that each phase corresponds to a paradigm.

Before we study these phases, we introduce the following parameters:

𝛿 ∈ ℝ, 𝛿 > 0 is a constant of proportionality called the learning rate;

𝑝𝑎 ∈ ℕ, 𝑝𝑎 > 0 is the number of patterns shown to the network;

𝑒𝑝 ∈ ℕ, 𝑒𝑝 > 0 is the number of learning cycles (epochs) during which the algorithm sees all

the patterns. An epoch, which is a complete pass through a given dataset, should not be

confused with an iteration, which is simply one update of the neural net model’s parameters.

A suggestive designation of these phases belongs to Sussmann, who called them "hallucinating

phase" respectively "learning phase" [1,3]. Generally, during a learning phase, a pattern 𝑣 ∈ I𝒱

is "taught" by clamping the units 𝑖 ∈ 𝒱 so that their activation levels 𝜎𝑖 are the same as 𝑣𝑖 and

allowing the hidden units to evolve according to the Metropolis dynamics. In this phase the

weights are adjusted according to the Hebb rule, that is, at each step each weight ��𝑖𝑗 is

incremented by the positive quantity:

Δ��𝑖𝑗 = 𝛿 ∙ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) (5.3)

where 𝛿 is the learning rate.

During the hallucinating phase the whole network evolves following the Metropolis dynamics.

The adjustment that takes place in this phase is similar to the one from the learning phase,

except that now the quantity Δ��𝑖𝑗 added to each weight ��𝑖𝑗 is negative.

Now that we know that the learning algorithm has two phases; how are these phases linked

temporally? The answer to this question is: the learning and hallucinating phases should

alternate.

Hinton justifies the alternation of phases by using a well–known method for identifying the

parameters of an unknown probability distribution: maximum likelihood estimation [10]. Hinton

98

calls these two phases either “collecting data–independent statistics” respectively “collecting

data–dependent statistics”, or “negative phase” respectively “positive phase”.

According with Hinton, we can formulate the learning problem as one of minimizing the distance

between two Gibbs measures: the environmental measure 𝐪 and the measure 𝐩𝐓 =

MARG(𝐆𝐖, 𝒱) where 𝐆𝐖 describes the behavior of the network at equilibrium. Then the gradient

of this distance, regarded as a function of the parameters, is a difference of two terms: one term

consists of the mean of the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) with respect to 𝐪; the other term is the

mean of the same quantity with respect to 𝐆𝐖. Hinton’s claim is that the positive phase

computes approximately the first term and the negative phase computes approximately the

second term.

Sussmann gives another justification for alternating the phases of the learning procedure [1,3].

He claims that it is not possible for the whole learning procedure to have only the learning phase

because, if that happens, the weights would blow up. Sussman’s explanation is that, during the

learning phase, the network is doing the "correct" thing (i.e., the configurations 𝜎 = (𝑣, ℎ) where

𝑣 ∈ I𝒱 have "correct" values), because it has been forced to by clamping the visible units 𝑖 ∈ 𝒱

at values that correspond to a desired pattern. Hence, whatever the network is doing, it should

be reinforced. If a particular product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) happens to be positive, it means that

the net "wants" 𝜎𝑖 and 𝜎𝑗 to be “in sync”; hence, the weight ��𝑖𝑗 should be increased to make this

more likely. This means that the connection between the units 𝑖 and 𝑗 should be made more

“excitatory" by making ��𝑖𝑗 more positive, e.g. by adding to it the positive number Δ��𝑖𝑗. Similarly,

if the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) is negative, ��𝑖𝑗 should be decreased, and once again this will

be achieved by adding the negative number Δ��𝑖𝑗 to it.

Furthermore, if the learning algorithm had only the learning phase, then some weights would

keep increasing. Indeed, assume that the weights are updated at every step of the learning

process and we just look at a pair of visible units 𝑖 and 𝑗. If we only performed the learning

phase as outlined above, then after 𝑒𝑝 × 𝑝𝑎 steps the weight ��𝑖𝑗 would become:

��𝑖𝑗 + 𝛿 ∙ 𝑒𝑝 ∙ 𝑝𝑎 ∙ ⟨ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) ⟩ (5.4)

where ⟨ (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) ⟩ represents the sample mean of the product (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) for

the sample consisting of the 𝑝𝑎𝑡 patterns 𝑣(1), 𝑣(2), … , 𝑣(𝑝𝑎) used in the training.

If we assume that the patterns 𝑣(1), 𝑣(2), … , 𝑣(𝑝𝑎) are independent and identically distributed, or

more generally, that the Markov process {𝑣(𝑘)}𝑘 is ergodic, then the sample mean for the pair of

99

visible units 𝑖 and 𝑗 will converge almost surely to the expected value of (2𝜎𝑖 − 1) ∙ (2𝜎𝑗 − 1) with

respect to the measure 𝐪. Unless this expected value happens to vanish, the weight ��𝑖𝑗 will

blow up as 𝑡 → +∞. Therefore, Sussmann concludes that something else has to be done to

prevent this from happening and this could very well be the presence alternatively of the

hallucinating phase.

5.3 Learning algorithms based on approximate maximum likelihood

One modality to find the parameters of the Boltzmann Machine Learning problem is by

means of maximum likelihood estimation. Maximum likelihood principle relies on Bayes

theorem. Thus, there is only a single data set data (namely the one that is actually observed)

and the uncertainty in the parameters of the model is expressed through a probability

distribution over parameters. Maximum likelihood estimation has the parameters set to the value

that maximizes 𝐏(params | data), i.e., it chooses the parameters such that the probability of the

observed data set is maximized. A variant of this principle, very well suited for exponential

models, maximizes the log likelihood of the parameters log𝐏(params | data).

According with Definition 5.2, the goal of asynchronous Boltzmann machine learning is to

minimize the difference between MARG(𝐆𝐖, 𝒱) and ��, which translates into minimizing the

difference between 𝐩𝐓 and 𝐪. That is equivalent with maximizing the log likelihood of generating

the environmental distribution 𝐐𝐓 when the network is running freely at equilibrium [43].

Regardless of the path chosen – maximizing the log likelihood of 𝐐𝐓 or minimizing the difference

between 𝐩𝐓 and 𝐪 – the end result is the same: W. The path we follow in this paper to obtain W

is by minimizing the difference between 𝐪 and 𝐩𝐓, where the difference is expressed by their

KL–divergence KL(𝐪||𝐩𝐓). Essentially, the KL–divergence of two probability distributions is

always positive and becomes zero if and only if those probabilities are equal (equations (3.34)

and (3.35)).

100

5.3.1 Learning by minimizing the KL–divergence of Gibbs measures

The aim of this section is to present the generic Boltzmann machine learning algorithm

proposed by Ackley, Hinton, and Sejnowski in [11] (see also [43]). In essence the generic

learning algorithm proposed by Ackley et al. in [11,43] computes locally the difference between

two statistics and uses the result to update the “local” parameters. One statistics is the

expectation with respect to the data distribution, i.e., the environmentally imposed distribution

𝐐𝐓, and the other statistic is the expectation with respect to the true distribution, i.e., the Gibbs

measure 𝐏𝐓. We will introduce the formulae for these expectations later in this section.

In order to derive a measure of how effectively the weights in the network are being used for

modelling the environment, Ackley et al. have made the assumption that there is no structure in

the sequential order of the environmentally clamped vectors. However, Ackley et al. admitted

that this is not a realistic assumption and a more realistic assumption would be that the

complete structure of the ensemble of the environmentally clamped vectors can be specified by

giving the probability of each of the 2𝑚 vectors over the 𝑚 visible units [11,43].

The version of the generic Boltzmann machine learning algorithm we present is inspired by

[42,61] and reflects our understanding of this important algorithm. We start by evaluating the

effect that clamping a data vector onto the visible units has over a hidden unit. In order to

accomplish this, we need to establish a new relationship between the vectors 𝑣 and ℎ.

Claim: Let consider a configuration 𝜎 = (𝑣, ℎ) where: 𝜎 ∈ I𝒩, 𝑣 ∈ I𝒱, and ℎ ∈ Iℋ. Then 𝑣 and ℎ

are orthogonal in I𝒩.

Proof: In order to compute the Euclidean inner product between 𝑣 and ℎ in I𝒩 we need to

represent both 𝑣 and ℎ as configurations in I𝒩. We do this by “packing” with zeros a data

(visible) vector 𝑣 ∈ I𝒱, up to the dimension 𝑛 of a configuration 𝜌 ∈ I𝒩, such that:

∀𝑖 ∈ 𝒱, 𝜌𝑖 = 𝑣𝑖 and ∀𝑗 ∈ 𝒩 − 𝒱 = ℋ, 𝜌𝑗 = 0 (5.5)

We apply the same “packing” operation, up to the dimension 𝑛 of a configuration 𝜏, to any

hidden vector ℎ ∈ Iℋ such that:

∀𝑗 ∈ ℋ, 𝜏𝑗 = ℎ𝑗 and ∀𝑖 ∈ 𝒩 −ℋ = 𝒱, 𝜏𝑖 = 0 (5.6)

101

Then the inner product 𝜌 ∙ 𝜏T = 𝜏 ∙ 𝜌T is zero because the zero components of both

configurations 𝜌 and 𝜏 coming from “packing” are placed at mutually exclusive indices.

Therefore, 𝜌 and 𝜏 are orthogonal, which leads to 𝑣 and ℎ being orthogonal in I𝒩.

More, a configuration 𝜎 over 𝑣 and ℎ can be represented either using concatenation between 𝑣

and ℎ or using the sum between the “expanded” versions 𝜌 of 𝑣 and 𝜏 of ℎ:

𝜎 = (𝑣, ℎ) = 𝜌 + 𝜏 ≝ 𝑣 + ℎ (5.7) A consequence of the equation (5.6) is the fact that the equation (4.28) can be rewritten as:

𝐩𝐓(𝑣) = ∑ 𝐏𝐓ℎ∈Iℋ

(𝑣, ℎ) = ∑ 𝐏𝐓ℎ∈Iℋ

(𝑣 + ℎ) for 𝑣 ∈ I𝒱 (5.8)

We are now going to evaluate the activation of a hidden unit 𝑖 ∈ ℋ due to clamping of the

visible units 𝑣. We do this by distinguishing between the contribution of the hidden units and the

contribution of the visible units to the net input of that unit:

net(𝑖, 𝜎) = ∑ ℎ𝑗 ∙ ��𝑗𝑖𝑗∈ℋ𝑗≠𝑖

+

(

∑𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱𝑗≠𝑖

− 𝜃𝑖

)

(5.9)

The terms included in the bracket in (5.9) do not depend on ℎ. More, when the visible units 𝑣

are clamped, the content of the bracket, denoted 𝛉𝐢 and called the effective threshold of unit 𝑖, is

a constant that acts as a threshold for the unit 𝑖 of subnet ℋ.

𝛉𝐢 = 𝜃𝑖 −∑𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱𝑗≠𝑖

(5.10)

Then: net(𝑖, 𝜎) = ∑ ℎ𝑗 ∙ ��𝑗𝑖

𝑗∈ℋ𝑗≠𝑖

− 𝛉𝐢 (5.11)

The subnet ℋ behaves like a Boltzmann machine with its own interconnecting weights W and

thresholds (𝛉𝐢)𝑖∈ℋ. This means that, in principle, we know the probability of any particular state

or configuration ℎ of ℋ simply because it will be determined by a Boltzmann–Gibbs distribution.

To use this fact we need to know the relationship between the internal energy of subnet ℋ

operating with effective thresholds (𝛉𝐢)𝑖∈ℋ and the energy of the whole network 𝒩 of 𝐁𝐌. The

next theorem makes this relationship explicit by means of an algebraic identity.

102

Theorem 5.1 (Jones [63]):

The energy of the whole network of 𝐁𝐌 can be computed as the sum between the internal

energy of the subnet ℋ in state ℎ when vector 𝑣 is clamped and the internal energy of the

subnet 𝒱 in state 𝑣 when is completely disconnected from the units of ℋ. Formally, we write:

Given:

𝐸ℋ(ℎ|𝑣) = −1

2∑ ℎ𝑗𝑗∈ℋ

∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖≠𝑗

+ ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ

(5.12)

And

𝐸𝒱(𝑣) = −1

2∑𝑣𝑗𝑗∈𝒱

∙∑𝑣𝑖 ∙ ��𝑖𝑗𝑖∈𝒱𝑖≠𝑗

+∑𝜃𝑗 ∙ 𝑣𝑗𝑗∈𝒱

(5.13)

Then: 𝐸(𝜎) = 𝐸ℋ(ℎ|𝑣) + 𝐸𝒱(𝑣) where 𝜎 = 𝑣 + ℎ (5.14)

Proof: We start from the energy of a joint configuration given by the equation (4.37):

𝐸(𝑣, ℎ) = −∑𝜎𝑖𝑖∈𝒩

∙ ∑ 𝜎𝑗 ∙𝑗∈𝒩,𝑗>𝑖

��𝑖𝑗 +∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

𝐸(𝑣, ℎ) = −1

2∙ ∑ 𝜎𝑖𝑖∈𝒩

∙ ∑ 𝜎𝑗 ∙𝑗∈𝒩,𝑗≠𝑖

��𝑖𝑗 +∑𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

𝐸(𝑣, ℎ) = −1


∙ ∑ (𝑣𝑗 + ℎ𝑗) ∙ ��𝑖𝑗𝑗∈𝒩,𝑗≠𝑖


𝐸(𝑣, ℎ) = −1


∙ ∑ 𝑣𝑗 ∙ ��𝑖𝑗𝑗∈𝒱,𝑗≠𝑖

−1


∙ ∑ ℎ𝑗 ∙ ��𝑖𝑗𝑗∈ℋ,𝑗≠𝑖


𝐸(𝑣, ℎ) = −1

2∙∑𝑣𝑗𝑗∈𝒱

∙ ∑ 𝜎𝑖 ∙ ��𝑖𝑗𝑖∈𝒩,𝑖≠𝑗

−1

2∙ ∑ ℎ𝑗𝑗∈ℋ

∙ ∑ 𝜎𝑖 ∙ ��𝑖𝑗𝑖∈𝒩,𝑖≠𝑗


𝐸(𝑣, ℎ) = −1

2∙∑𝑣𝑗 ∙ ∑ ��𝑖𝑗 ∙ (𝑣𝑖 + ℎ𝑖)

𝑖∈𝒩,𝑖≠𝑗

𝑗∈𝒱

∙ −1

2∙ ∑ ℎ𝑗 ∙ ∑ ��𝑖𝑗 ∙ (𝑣𝑖 + ℎ𝑖)

𝑖∈𝒩,𝑖≠𝑗

𝑗∈ℋ

+∑𝜃𝑖 ∙ (𝑣𝑖 + ℎ𝑖)

𝑖∈𝒩

𝐸(𝑣, ℎ) =

(

−1

2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖

𝑖∈𝒱,𝑖≠𝑗

𝑗∈𝒱

−1

2∙∑𝑣𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖

𝑖∈ℋ,𝑖≠𝑗

𝑗∈𝒱

)

+

103

+

(

−1

2∙ ∑ ℎ𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖


𝑗∈ℋ

−1

2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖


𝑗∈ℋ

)

+(∑𝑣𝑗 ∙ 𝜃𝑗

𝑖∈𝒱

+ ∑ ℎ𝑗 ∙ 𝜃𝑗𝑗∈ℋ

)

We observe that, due to the weight symmetry and the commutative and distributive laws of

multiplication versus addition, the second term and the third term of the last formula are

identical. Therefore:

𝐸(𝑣, ℎ) = −1

2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖


𝑗∈𝒱

−1

2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖


𝑗∈ℋ

− ∑ ℎ𝑗 ∙ ∑𝑣𝑖 ∙ ��𝑗𝑖𝑖∈𝒱,𝑖≠𝑗

𝑗∈ℋ

+

+∑𝑣𝑗 ∙ 𝜃𝑗𝑗∈𝒱

+ ∑ ℎ𝑗 ∙ 𝜃𝑗𝑗∈ℋ

𝐸(𝑣, ℎ) =

(

−1

2∙∑𝑣𝑗 ∙ ∑ 𝑣𝑖 ∙ ��𝑗𝑖


𝑗∈𝒱

+∑𝑣𝑗 ∙ 𝜃𝑗𝑗∈𝒱

)

−1

2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖


𝑗∈ℋ

+

+∑ℎ𝑗 ∙

j∈ℋ

(𝜃𝑗 −∑𝑣𝑖 ∙ ��𝑗𝑖𝑖∈𝒱,𝑖≠𝑘

)

We observe that the content of the first bracket is exactly 𝐸𝒱(𝑣) given by the equation (5.13)

and the content of the second bracket is exactly 𝛉𝐣 given by the equation (5.10). Therefore:

𝐸(𝑣, ℎ) = 𝐸𝒱(𝑣) +

(

−1

2∙ ∑ ℎ𝑗 ∙ ∑ ℎ𝑖 ∙ ��𝑗𝑖


𝑗∈ℋ

+∑ℎ𝑗 ∙ 𝛉𝐣j∈ℋ

)

We observe that the content of the bracket is exactly 𝐸ℋ(ℎ|𝑣) given by the equation (5.12).

Thus, we obtain exactly the equation (5.14):

𝐸(𝜎) = 𝐸(𝑣, ℎ) = 𝐸𝒱(𝑣) + 𝐸ℋ(ℎ|𝑣)

With respect to Theorem 5.1 we remark that 𝐸𝒱(𝑣) is constant when 𝑣 is clamped on 𝒱. This

makes the calculation of the probability of any particular vector ℎ on the hidden units particularly

straightforward. Therefore, we take a closer look at 𝐐𝐓(ℎ|𝑣), i.e., the probability that vector ℎ will

occur on the hidden units when 𝑣 is clamped on the visible units and ℋ is allowed to run at

104

pseudo–temperature 𝐓. Intuitively, the only effect of 𝑣 on ℎ is to cause the hidden units ℎ to run

with effective thresholds 𝛉𝐣 given by the equation (5.10) instead of their regular thresholds 𝜃𝑗.

Under these circumstances, 𝐐𝐓(ℎ|𝑣) is governed by the same type of distribution as the network

itself, which in our case is the Boltzmann–Gibbs distribution.

Corollary 5.1:

𝐐𝐓(ℎ|𝑣) is proportional to the probability 𝐏𝐓(𝑣, ℎ) of the joint configuration 𝜎 = (𝑣, ℎ):

𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ) (5.15) where: 𝜎 = (𝑣, ℎ) is a configuration of the network; 𝐓 is the pseudo–temperature; and 𝛼(𝑣, 𝐓) is

a positive constant depending only on 𝑣 and 𝐓.

Proof: A consequence of Theorem 5.1 is that, in a Boltzmann machine with visible units 𝑣

clamped and with hidden units ℎ, 𝐐𝐓(ℎ|𝑣) is governed by the Boltzmann–Gibbs distribution

given by the equation (2.5). The energy corresponding to 𝐐𝐓(ℎ|𝑣) according with the equation

(2.5) can be obtained from the equation (5.14). Therefore, we can write:

𝐐𝐓(ℎ|𝑣) =1

𝑍ℋ∙ exp (

−𝐸ℋ(ℎ|𝑣)

𝐓) =

1

𝑍ℋ∙ exp (

−𝐸(𝑣, ℎ) + 𝐸𝒱(𝑣)

𝐓)

𝐐𝐓(ℎ|𝑣) =1

𝑍ℋ∙ exp (

−𝐸(𝑣, ℎ)

𝐓) ∙ exp (

𝐸𝒱(𝑣)

𝐓)

where 𝑍ℋ is an appropriate normalization constant for the distribution 𝐐𝐓(ℎ|𝑣) .

𝐐𝐓(ℎ|𝑣) = (Z

𝑍ℋ∙ exp (

𝐸𝒱(𝑣)

𝐓)) ∙ (

1

𝑍∙ exp(

−𝐸(𝑣, ℎ)

𝐓))

where Z is the partition function for the true distribution 𝐏𝐓(𝑣, ℎ).

In the previous formula both Z and 𝑍ℋ are in essence constants, despite the fact that their

computation is intractable. We also observe that the first factor–bracket depends only on 𝑣 and

𝐓, which are both constant with respect to ℎ. More, the second factor–bracket is exactly 𝐏𝐓(𝑣, ℎ)

given by the equation (4.40). If we denote the first factor–bracket by 𝛼(𝑣, 𝐓), then we obtain the

same expression for 𝐐𝐓(ℎ|𝑣) as in the equation (5.15):

𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)

105

The following theorem is essential for the Boltzmann machine learning algorithm. The theorem

gives the relationship between 𝐐𝐓(𝑣, ℎ) and 𝐏𝐓(𝑣, ℎ) in terms of the observable probability 𝐪(𝑣)

and the marginal probability 𝐩𝐓(𝑣). It shows that, in the particular case of the Boltzmann–Gibbs

distribution, this relationship has a simple ratio form.

There is a little bit of history around this theorem in the sense that, in the original derivation of

the learning rule for the asynchronous Boltzmann machine, Ackley et al. assumed, without

making direct appeal to the form of the underlying distribution, that at thermal equilibrium, the

probability of a hidden state given a visible state is the same regardless how the visible units

arrived there (clamped or free running) [11,43]. However, for systems with a distribution different

from Boltzmann–Gibbs, like, for example, a synchronous Boltzmann machine, this theorem is

false and the relationship is much more complicated [42,61]. This means that the classical

arguments supporting Theorem 5.2 are logically inadequate, although the conclusion is correct

[42]. The missing piece from the original proof was identified and the logic of the original

derivation was clarified by Jones in [63].

Theorem 5.2 (Jones [63]):

If the true distribution 𝐏𝐓 over the whole network of 𝐁𝐌 is described by the Boltzmann–Gibbs

distribution given by the equation (2.5), then the environmental distribution 𝐐𝐓 is given by:

𝐐𝐓(𝑣, ℎ) =

𝐪(𝑣)

𝐩𝐓(𝑣)∙ 𝐏𝐓(𝑣, ℎ) (5.16)

Proof: We start from the equation (5.15) and we sum over all possible ℎ ∈ Iℋ. We also take into

consideration the fact that 𝛼(𝑣, 𝐓) is independent of ℎ.

𝐐𝐓(ℎ|𝑣) = 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)

∑ 𝐐𝐓(ℎ|𝑣) = ∑ 𝛼(𝑣, 𝐓) ∙ 𝐏𝐓(𝑣, ℎ)

ℎ∈Iℋℎ∈Iℋ

= 𝛼(𝑣, 𝐓) ∙ ∑ 𝐏𝐓(𝑣, ℎ)

ℎ∈Iℋ

We observe that the sum of probabilities on the left–side should be 1 and the sum of

probabilities on the right side is exactly the marginal MARG(𝐏𝐓, 𝒱)(𝑣) = 𝐩𝐓(𝑣) of the true

distribution over the states of the visible units. More, in Section 4.1, when constructing the Gibbs

measure associated to a Hamiltonian, we assumed that the corresponding Gibbs measure is a

nondegenerate probability distribution, so we can divide by it without restriction. These

observations lead us to the following:

106

1 = 𝛼(𝑣, 𝐓) ∙ 𝐩𝐓(𝑣) ⇔ 𝛼(𝑣, 𝐓) =1

𝐩𝐓(𝑣)

Based on our definition of 𝐐𝐓(𝑣, ℎ) in Section 5.1, we can write:

𝐐𝐓(𝑣, ℎ) = 𝐐𝐓(ℎ|𝑣) ∙ 𝐪(𝑣) ⇔ 𝐐𝐓(ℎ|𝑣) =1

𝐪(𝑣)∙ 𝐐𝐓(𝑣, ℎ)

If we substitute 𝛼(𝑣, 𝐓) and 𝐐𝐓(ℎ|𝑣) in the equation (5.15), we obtain exactly the equation

(5.16):

1

𝐪(𝑣)∙ 𝐐𝐓(𝑣, ℎ) =

1

𝐩𝐓(𝑣)∙ 𝐏𝐓(𝑣, ℎ)

𝐩𝐓(𝑣) ∙ 𝐐𝐓(𝑣, ℎ) = 𝐪(𝑣) ∙ 𝐏𝐓(𝑣, ℎ)

Lemma 5.3:

The partial derivatives of the KL–divergence between the observable probability 𝐪(𝑣) and the

marginal probability 𝐩𝐓(𝑣) with respect to the parameters W of the network are computed with

the following formulae:

∂KL(𝐪||𝐩𝐓)

∂��𝑖𝑗= − ∑

𝐪(𝑣)

𝐩𝐓(𝑣)(𝑣,ℎ)∈I𝓝

∙𝜕𝐏𝐓(𝑣, ℎ)

𝜕��𝑖𝑗 (5.17)


∂𝜃𝑖= − ∑

𝐪(𝑣)


∙𝜕𝐏𝐓(𝑣, ℎ)

𝜕𝜃𝑖 (5.18)

Proof: Before we start the proof, we recall that, per Definition 5.2, the learning process

computes the parameters W of the network such that lim𝑡→∞ W = W. This means that the partial

derivatives of KL(𝐪||𝐩𝐓) should be computed with respect to the parameters W = (W, Θ) =

((W) , Θ). However, to keep the text as readable as possible, we are going to use the

parameters W = (W, Θ) in our computation but with the meaning of W. We start from the

definition of the KL–divergence (equations (B29) and (B30) from Appendix B):

KL(𝐪||𝐩𝐓) = ∑ 𝐪(𝑣) ∙

𝑣∈I𝒱

ln𝐪(𝑣)

𝐩𝐓(𝑣)

We compute the partial derivative of KL(𝐪||𝐩𝐓) with respect to the weights ��𝑖𝑗 of the network by

taking into consideration that 𝐪(𝑣) is an environmentally imposed probability distribution, so it

doesn’t depend on the parameters of the network.

107


∂��𝑖𝑗= ∑

∂(𝐪(𝑣) ∙ ln𝐪(𝑣)𝐩𝐓(𝑣)

)

∂��𝑖𝑗𝑣∈I𝒱

= ∑ 𝐪(𝑣) ∙∂ (ln

𝐪(𝑣)𝐩𝐓(𝑣)

)

∂��𝑖𝑗𝑣∈I𝒱

= ∑ 𝐪(𝑣) ∙𝐩𝐓(𝑣)

𝐪(𝑣)𝑣∈I𝒱

∙∂ (𝐪(𝑣)𝐩𝐓(𝑣)

)

∂��𝑖𝑗


∂��𝑖𝑗= ∑ 𝐩𝐓(𝑣) ∙ 𝐪(𝑣) ∙

𝑣∈I𝒱

∂ (𝟏

𝐩𝐓(𝑣))

∂��𝑖𝑗= −∑ 𝐩𝐓(𝑣) ∙ 𝐪(𝑣) ∙

1

𝐩𝐓(𝑣)𝟐∙

𝑣∈I𝒱

∂𝐩𝐓(𝑣)

∂��𝑖𝑗


∂��𝑖𝑗= −∑

𝐪(𝑣)

𝐩𝐓(𝑣)∙

𝑣∈I𝒱

∂𝐩𝐓(𝑣)

∂��𝑖𝑗

Furthermore, we substitute 𝐩𝐓(𝑣) with its expression given by the equation (5.8):


∂��𝑖𝑗= −∑

𝐪(𝑣)

𝐩𝐓(𝑣)∙

𝑣∈I𝒱

∂∑ 𝐏𝐓ℎ∈Iℋ (𝑣, ℎ)

∂��𝑖𝑗= −∑

𝐪(𝑣)

𝐩𝐓(𝑣)∙

𝑣∈I𝒱

∑∂𝐏𝐓(𝑣, ℎ)

∂��𝑖𝑗ℎ∈Iℋ

Earlier we proved that 𝑣 and ℎ are orthogonal (equation 5.7), which translates into the following:

∑ ∑

ℎ∈Iℋ

= ∑

(𝑣,ℎ)∈I𝓝𝑣∈I𝒱

If we apply the orthogonality of 𝑣 and ℎ to the previous expression of the partial derivative of

KL(𝐪||𝐩𝐓) with respect to the weights ��𝑖𝑗, we obtain exactly the formula (5.17):


∂��𝑖𝑗= −∑ ∑

𝐪(𝑣)

𝐩𝐓(𝑣)∙

ℎ∈Iℋ𝑣∈I𝒱

∂𝐏𝐓(𝑣, ℎ)

∂��𝑖𝑗= − ∑

𝐪(𝑣)

𝐩𝐓(𝑣)∙

(𝑣,ℎ)∈I𝓝

∂𝐏𝐓(𝑣, ℎ)

∂��𝑖𝑗

The proof for the formula (5.18) is very similar with the proof for the formula (5.17) except that

the weights ��𝑖𝑗 are replaced with the thresholds 𝜃𝑖.

Before presenting the most important result of the learning rule derivation for the asynchronous

symmetric Boltzmann machine, we introduce the expectations mentioned at the beginning of

this section. Given a configuration 𝜎 = (𝑣, ℎ) of the network, we denote by 𝑞𝑖𝑗 the expectation

with respect to the data distribution 𝐐𝐓, i.e., the data probability averaged over all environmental

inputs and measured at equilibrium that the ith and the jth units are both on. We denote by 𝑝𝑖𝑗 the

expectation with respect to the true distribution 𝐏𝐓, i.e., the true probability distribution measured

at equilibrium that the ith and the jth units are both on.

𝑞𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙

𝜎∈I𝒩

𝐐𝐓(𝜎) if 𝑖 ≠ 𝑗 (5.19)

108

𝑝𝑖𝑗 = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙

𝜎∈I𝒩

𝐏𝐓(𝜎) if 𝑖 ≠ 𝑗 (5.20)

Similarly, we denote by 𝑞𝑖 the data probability averaged over all environmental inputs and

measured at equilibrium that the ith unit is on and by 𝑝𝑖 the true probability distribution measured

at equilibrium that the ith unit is on.

𝑞𝑖 = ∑ 𝜎𝑖 ∙

𝜎∈I𝒩

𝐐𝐓(𝜎) (5.21)

𝑝𝑖 = ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)

𝜎∈I𝒩

(5.22)

Theorem 5.4 (Gradient–Descent for asynchronous symmetric Boltzmann machines):

The partial derivatives of the KL–divergence between the environmental probability 𝐪(𝑣) and the

marginal of the true probability 𝐩𝐓(𝑣) with respect to the symmetric weights ��𝑖𝑗 respectively the

thresholds 𝜃𝑖 are given by the following formulae:


∂��𝑖𝑗= −

1

𝐓(𝑞𝑖𝑗 − 𝑝𝑖𝑗) (5.23)


∂𝜃𝑖=1

𝐓(𝑞𝑖 − 𝑝𝑖) (5.24)

Proof: From Lemma 5.3 it is sufficient to determine 𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗 respectively

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖. In order to

compute these partial derivatives, we need to know the expression of the true probability

distribution of the network at equilibrium. In Chapter 4 we learned that the joint distribution of a

Boltzmann machine is a Boltzmann–Gibbs distribution and is given by the equation (4.40).

However, we need to prove that the equation (4.40) also represents the equilibrium distribution

of the Boltzmann machine, which we are going to do in Section 5.3.2. In the rest of this section

we assume that the equilibrium distribution of the Boltzmann machine is given by the following

version of the equation (4.40) which takes into consideration the pseudo–temperature 𝐓:

𝐏𝐓(𝜎) = 𝐏𝐓(𝑣, ℎ) = 𝐏𝐓(𝑣 + ℎ) =

1

𝑍∙ exp(

−𝐸(𝜎)

𝐓) (5.25)

We observe that both the numerator and the denominator of 𝐏(𝜎) depend on the weights ��𝑖𝑗

respectively the thresholds 𝜃𝑖.

109

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗=

𝜕

𝜕��𝑖𝑗(1

𝑍∙ exp(

−𝐸(𝜎)

𝐓))

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗= −

1

𝑍2∙ exp (

−𝐸(𝜎)

𝐓) ∙

𝜕𝑍

𝜕��𝑖𝑗−

1

𝑍 ∙ 𝐓∙ exp(

−𝐸(𝜎)

𝐓) ∙𝜕𝐸(𝜎)

𝜕��𝑖𝑗

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗= −𝐏𝐓(𝜎) ∙ (

1

𝑍∙𝜕𝑍

𝜕��𝑖𝑗+1

𝐓∙𝜕𝐸(𝜎)

𝜕��𝑖𝑗) (5.26)

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖= −

1

𝑍2∙ exp (

−𝐸(𝜎)

𝐓) ∙𝜕𝑍

𝜕𝜃𝑖−

1

𝑍 ∙ 𝐓∙ exp(

−𝐸(𝜎)

𝐓) ∙𝜕𝐸(𝜎)

𝜕𝜃𝑖

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖= −𝐏𝐓(𝜎) ∙ (

1

𝑍∙𝜕𝑍

𝜕𝜃𝑖+1

𝐓∙𝜕𝐸(𝜎)

𝜕𝜃𝑖) (5.27)

From the formulae (5.26) and (5.27) it is sufficient to determine 𝜕𝑍

𝜕��𝑖𝑗 and

𝜕𝐸(𝜎)

𝜕��𝑖𝑗 respectively

𝜕𝑍

𝜕𝜃𝑖

and 𝜕𝐸(𝜎)

𝜕𝜃𝑖. We start by recalling the expression of 𝐸(𝜎) given by the equation (4.37) and the

expression of 𝑍 given by the equation (4.39):

𝐸(𝜎) = − ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ ��𝑖𝑗{𝑖,𝑗}∈𝒢𝑖<𝑗


𝑍 = ∑ exp(−𝐸(𝑢, 𝑔)

𝐓)

𝑢∈𝐼𝒱 ,𝑔∈𝐼ℋ

= ∑ exp(


− ∑ 𝜎𝑖 ∙ 𝜃𝑖𝑖∈𝒩

𝐓)

𝜎∈I𝒩

The partial derivatives of 𝐸(𝜎) with respect to the weights ��𝑖𝑗 respectively the thresholds 𝜃𝑖 are:

𝜕𝐸(𝜎)

𝜕��𝑖𝑗= −𝜎𝑖 ∙ 𝜎𝑗 (5.28)

𝜕𝐸(𝜎)

𝜕𝜃𝑖= 𝜎𝑖 (5.29)

The partial derivatives of 𝑍 with respect to the weights ��𝑖𝑗 respectively the thresholds 𝜃𝑖 are:

𝜕𝑍

𝜕��𝑖𝑗= ∑ exp(



𝐓)

𝜎∈I𝒩

∙𝜎𝑖 ∙ 𝜎𝑗

𝐓

110

𝜕𝑍

𝜕��𝑖𝑗=𝑍

𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙

exp (−𝐸(𝑢, 𝑔)

𝐓 )

𝑍𝜎=(𝑢,𝑔)∈𝐼𝒩

𝜕𝑍

𝜕��𝑖𝑗=𝑍

𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

(5.30)

respectively:

𝜕𝑍

𝜕𝜃𝑖= ∑ exp(



𝐓)

𝜎∈I𝒩

∙−𝜎𝑖𝐓

𝜕𝑍

𝜕𝜃𝑖= −

𝑍

𝐓∙ ∑ 𝜎𝑖 ∙

exp (−𝐸(𝑢, 𝑔)

𝐓 )

𝑍𝜎=(𝑢,𝑔)∈𝐼𝒩

𝜕𝑍

𝜕𝜃𝑖= −

𝑍

𝐓∙ ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

(5.31)

We substitute the formulae (5.28) and (5.30) into the formula (5.26), respectively the formulae

(5.29) and (5.31) into the formula (5.27).

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗= −𝐏𝐓(𝜎) ∙ (

1

𝑍∙𝑍

𝐓∙ ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

+1

𝐓∙ (−𝜎𝑖 ∙ 𝜎𝑗))

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗= −

𝐏𝐓(𝜎)

𝐓∙ ( ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

− 𝜎𝑖 ∙ 𝜎𝑗) (5.32)

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖= −𝐏𝐓(𝜎) ∙ (−

1

𝑍∙𝑍

𝐓∙ ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

+1

𝐓∙ 𝜎𝑖)

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖=𝐏𝐓(𝜎)

𝐓∙ ( ∑ 𝜎𝑖 ∙ 𝐏𝐓(𝜎)

𝜎∈𝐼𝒩

− 𝜎𝑖) (5.33)

We observe that in the formula (5.32) the first term inside the bracket is exactly 𝑝𝑖𝑗 given by the

formula (5.20). Similarly, in the formula (5.33) the first term inside the bracket is exactly 𝑝𝑖 given

by the formula (5.22).

𝜕𝐏𝐓(𝜎)

𝜕��𝑖𝑗= −

𝐏𝐓(𝜎)

𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) (5.34)

111

𝜕𝐏𝐓(𝜎)

𝜕𝜃𝑖=𝐏𝐓(𝜎)

𝐓∙ (𝑝𝑖 − 𝜎𝑖) (5.35)

Furthermore, we substitute the formulae (5.34) and (5.35) into the formulae (5.17) respectively

(5.18).


∂��𝑖𝑗= ∑

𝐪(𝑣)


∙𝐏𝐓(𝑣, ℎ)

𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) (5.36)


∂𝜃𝑖= − ∑

𝐪(𝑣)


∙𝐏𝐓(𝜎)

𝐓∙ (𝑝𝑖 − 𝜎𝑖) (5.37)

We use the formula (5.16) to substitute in the formula (5.36) the expression 𝐪(𝑣)

𝐩𝐓(𝑣) ∙ 𝐏𝐓(𝑣, ℎ) with

𝐐𝐓(𝑣, ℎ). Thus, the formula we obtain is exactly the formula (5.23).


∂��𝑖𝑗= ∑

𝐪(𝑣)


∙𝐏𝐓(𝑣, ℎ)

𝐓∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗) =

1

𝐓∙ ∑ 𝐐𝐓(𝑣, ℎ)

(𝑣,ℎ)∈I𝓝

∙ (𝑝𝑖𝑗 − 𝜎𝑖 ∙ 𝜎𝑗)


∂��𝑖𝑗=1

𝐓∙ (∑ 𝑝𝑖𝑗 ∙ 𝐐𝐓(𝜎)

𝜎∈I𝓝

− ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙ 𝐐𝐓(𝜎)

𝜎∈I𝓝

)


∂��𝑖𝑗=1

𝐓∙ (𝑝𝑖𝑗 ∙ ∑ 𝐐𝐓(𝜎)

𝜎∈I𝓝

− 𝑞𝑖𝑗) = −1

𝐓∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗)

Finally, we use the formula (5.16) to substitute in the formula (5.37) the expression 𝐪(𝑣)

𝐩𝐓(𝑣) ∙

𝐏𝐓(𝑣, ℎ) with 𝐐𝐓(𝑣, ℎ). Thus, the formula we obtain is exactly the formula (5.24).


∂𝜃𝑖= − ∑

𝐪(𝑣)


∙𝐏𝐓(𝜎)

𝐓∙ (𝑝𝑖 − 𝜎𝑖) = −

1

𝐓∙ ∑ 𝐐𝐓(𝑣, ℎ)

(𝑣,ℎ)∈I𝓝

∙ (𝑝𝑖 − 𝜎𝑖)


∂𝜃𝑖=1

𝐓∙ (∑ 𝜎𝑖 ∙ 𝐐𝐓(𝜎)

𝜎∈I𝓝

− ∑ 𝑝𝑖 ∙ 𝐐𝐓(𝜎)

𝜎∈I𝓝

)


∂𝜃𝑖=1

𝐓(𝑞𝑖 − 𝑝𝑖 ∙ ∑ 𝐐𝐓(𝜎)

𝜎∈I𝓝

) =1

𝐓(𝑞𝑖 − 𝑝𝑖)

With respect to Theorem 5.4 we remark that the formulae (5.23) and (5.24) show that the

process of reaching thermal equilibrium ensures that the joint activity of any two units contains

112

all the information required for changing the weight between them in order to give the network a

better model of its environment [43]. Specifically, the joint activity of any two units encodes

information explicitly about those units and encodes information implicitly about all the other

weights in the network [43]. The formulae (5.23) and (5.24) also show that the joint activity of

any two units doesn’t depend on what kind of units they are: both visible, both hidden, or one

visible and one hidden.

In practice, to minimize KL(𝐪||𝐩𝐓), it is sufficient to observe 𝑞𝑖𝑗 and 𝑝𝑖𝑗 at thermal equilibrium

and to change each weight and each threshold with formulae:

Δ��𝑖𝑗 = −𝛿 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗) (5.38)

Δ𝜃𝑖 = 𝛿 ∙ (𝑞𝑖 − 𝑝𝑖) (5.39)

where 𝛿 is a constant learning rate.

Another possibility is to incorporate the annealing process into the learning rate as follows:

𝛿 =

1

𝐓 (5.40)

We end this section with a high–level pseudocode of the generic learning algorithm. We present

this algorithm by emphasizing the aspects related to the learning rule we derived. At this time

we do not go into details regarding the update process (specifically how the units are selected

for update) and the collection of statistics. However, we mention that, for a given pattern, the

selection of a unit to be updated is in principle similar with the selection performed in a Hopfield

network (Section 2.5.1), otherwise it could be a stochastic process (taking place at a given

mean rate 𝑟 > 0 for each unit) or a deterministic process (being part of a predefined sequence).

In Section 5.3.3 we will present various strategies to update the units as well as to collect the

statistics 𝑞𝑖𝑗 and 𝑝𝑖𝑗.

Algorithm 5.1: Generic Boltzmann Machine Learning

Given: n x n weight matrix W ; n x 1 threshold vector Θ

a training set of 𝑝𝑎 data vectors: {𝑣(𝑘)}1≤𝑘≤𝑝𝑎

the number of learning cycles: 𝑒𝑝

begin

113

Step 1: initialize the weights W and the thresholds Θ

For an arbitrary number of learning cycles:

Step 2: for e=1 to 𝑒𝑝 do

For each one of the patterns to be learned:

Step 3: for k =1 to 𝑝𝑎 do

Clamping phase:

Step 4: present and clamp the pattern 𝑣(𝑘)

UPDATE PROCESS START

Randomly pick a hidden unit to update its value:

Step 5: choose at random a hidden unit ℎ𝑖 from the set ℋ

Lower pseudo–temperature following a schedule:

Step 6: anneal ℎ𝑖

At the final pseudo–temperature estimate the correlations:

Step 7: collect statistics 𝑞𝑖𝑗

UPDATE PROCESS END

Free–running phase:

Step 8: present the pattern 𝑣(𝑘) but do not clamp it

UPDATE PROCESS START

Randomly pick a visible or hidden unit to update its value:

Step 9: choose at random a unit 𝜎𝑖 from the set 𝒩

Lower pseudo–temperature following a schedule:

Step 10: anneal 𝜎𝑖

At the final pseudo–temperature estimate the correlations:

Step 11: collect statistics 𝑝𝑖𝑗

114

UPDATE PROCESS END

Update the weights and the thresholds for any pair of

connected units 𝑖 ≠ 𝑗 such that at least one unit has been

updated:

Step 12: Δ��𝑖𝑗 = −𝛿 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗) for 𝑖 ≠ 𝑗

��𝑖𝑗 ← ��𝑖𝑗 + Δ��𝑖𝑗

Δ𝜃𝑖 = 𝛿 ∙ (𝑞𝑖 − 𝑝𝑖)

𝜃𝑖 ← 𝜃𝑖 + Δ𝜃𝑖

end for //k

end for //e

end

return W

The generic Boltzmann machine learning algorithm runs slowly, partly because of the time

required to reach thermal equilibrium and partly because the learning is driven by the difference

between two noisy variables, so these variables must be sampled for a long time at thermal

equilibrium to reduce the noise [64]. If we could achieve the same simple relationships between

log probabilities and weights in a deterministic system, the learning process would be much

faster. We will explore this idea in Section 5.5.

5.3.2 Collecting the statistics required for learning

In the previous section we saw that the generic learning algorithm computes the difference

between two expectations or statistics denoted by us as 𝑞𝑖𝑗 (equation (5.19)) and 𝑝𝑖𝑗 (equation

(5.20)). To evaluate the complexity of the exact computation of these expectations, we rewrite

the equation (5.19) by substituting 𝐐𝐓(𝑣, ℎ) with its definition given in Section 5.1:

115


𝜎∈I𝒩

𝐐𝐓(𝜎) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙

𝜎=(𝑣,ℎ)∈I𝒩

𝐐𝐓(𝑣, ℎ)


𝜎=(𝑣,ℎ)∈I𝒩

𝐐𝐓(ℎ|𝑣) ∙ 𝐪(𝑣) (5.41)

In the equation (5.41) 𝐪(𝑣) can be approximated by its empirical distribution ��(𝑣) whose

computation is tractable:

𝐪(𝑣) ≅ ��(𝑣) =

1

𝑝𝑎∑ ∏𝕀𝒖;𝑣𝑢(𝑘)

(𝑣𝑢)

𝑚

𝑢=1

𝑝𝑎

𝑘=1 (5.42)

where: 𝑝𝑎 is the number of data vectors (patterns); {𝑣(k)}1≤k≤pa is the training set of data

vectors; 𝕀𝒖;𝑣𝑢(𝑘)(𝑣𝑢) is the indicator function given by the equation (3.7); and 𝑚 is the number of

visible units (equation (4.25)).

A simplified analysis of Algorithm 5.1 shows that the complexity of computing 𝑞𝑖𝑗 depends on

the complexity of computing 𝐐𝐓(ℎ|𝑣), which is exponential in the number of hidden units 𝑙

(equation (4.25)). Also the complexity of computing 𝑝𝑖𝑗 depends on the complexity of computing

𝐏𝐓(𝑣, ℎ) which is exponential in the total number of units (visible and hidden) 𝑛 = 𝑚 + 𝑙.

Consequently, both computations (𝑞𝑖𝑗 and 𝑝𝑖𝑗) are intractable. Later in this section we will see

why the analysis of Algorithm 5.1 is actually much more complicated.

Therefore, in order to compute the parameters of the network, an approximation of the

environmental distribution and estimations of both the environmental distribution and the true

distribution are necessary. We have already seen how the approximation of the environmental

distribution is performed (equation (5.42)). Now we concentrate our attention on the estimation

tasks. Both estimation tasks can be accomplished by using the MCMC framework. In essence,

a MCMC algorithm performs sampling from a probability distribution by constructing a Markov

chain that has the desired distribution as its equilibrium distribution. Then, after a number of

steps, the algorithm uses the state of the chain as a sample of the desired distribution.

In this section we present three categories of sampling algorithms that are used to estimate the

data–dependent statistics 𝑞𝑖𝑗 and/or the data–independent statistics 𝑝𝑖𝑗 in a Boltzmann

machine: Gibbs sampling, persistent Markov chains, and contrastive divergence.

116

5.3.2.1 Gibbs sampling

Hinton and Sejnowski used a MCMC sampling approach for estimating the data–dependent

statistics 𝑞𝑖𝑗 by clamping a training vector on the visible units, initializing the hidden units to

random binary states, and using sequential Gibbs sampling of the hidden units to approach the

posterior distribution [11,43]. They estimated the data–independent statistics 𝑝𝑖𝑗 in the same

way, but with the randomly initialized visible units included in the sequential Gibbs sampling

[11,43].

In Gibbs sampling, each variable draws a sample from its posterior distribution given the current

states of the other variables [65]. Before explaining how Gibbs sampling actually works, we

recall the notation X−i and its meaning as the set of all the random variables from X except 𝑋𝑖

(3.73). Given a joint probability distribution 𝐏 of 𝑛 random variables X = (𝑋1, 𝑋2, … , 𝑋𝑛), Gibbs

sampling of 𝐏 is done through a sequence of 𝑛 sampling sub–steps of the following form such

that the new value for 𝑋𝑖 is used straight away in subsequent sampling steps:

𝑋𝑖~𝐏(𝑋𝑖 | X−i = 𝑥−𝑖) or: for 1 ≤ 𝑖 ≤ 𝑛 (5.43)

𝑋𝑖(𝑡+1) = {

1, if 𝑢 ≤ 𝐏(𝑋𝑖 | X−i(𝑡) = 𝑥−i)

0, otherwise

where 𝑥−𝑖 represents the evidence of the corresponding random variables X−i and 𝑢 is a sample

from a uniform distribution 𝒰[0,1]. After these 𝑛 samples have been obtained, a step of the

chain is completed, yielding a sample of X whose distribution converges to 𝐏(X) as the number

of steps goes to ∞, under some conditions. A sufficient condition for convergence of a finite–

state Markov chain is that it is aperiodic and irreducible (see Theorem C.7 in Appendix C).

Let consider a configuration 𝜎 = (𝑣, ℎ) of 𝐁𝐌 and denote by 𝜎−i the set of values associated

with all units except the ith unit. In order to perform Gibbs sampling in 𝐁𝐌, we need to compute

and sample from 𝐏(𝜎𝑖|𝜎−i) as follows:

𝐏(𝜎) = 𝐏(𝜎𝑖, 𝜎−i) =exp(−𝐸(𝜎𝑖, 𝜎−i))

𝑍

𝐏(𝜎𝑖 = 1|𝜎−i) =exp (−𝐸(𝜎𝑖 = 1, 𝜎−i))

exp(−𝐸(𝜎𝑖 = 1, 𝜎−i)) + exp (−𝐸(𝜎𝑖 = 0, 𝜎−i))

𝐏(𝜎𝑖 = 1|𝜎−i) =1

1 +exp (−𝐸(𝜎𝑖 = 0, 𝜎−i))exp (−𝐸(𝜎𝑖 = 1, 𝜎−i))

117

𝐏(𝜎𝑖 = 1|𝜎−i) =1

1 + exp (𝐸(𝜎𝑖 = 1, 𝜎−i) − 𝐸(𝜎𝑖 = 0, 𝜎−i))

𝐏(𝜎𝑖 = 1|𝜎−i) =1

1 + exp (−∑ 𝜎𝑗 ∙ ��𝑗𝑖𝑗∈𝒩 + 𝜃𝑖)

𝐏(𝜎𝑖 = 1|𝜎−i) = sigm(∑ 𝜎𝑗 ∙ ��𝑗𝑖

𝑗∈𝒩

− 𝜃𝑖) (5.44)

Our task is to use formula (5.44) to sample in the positive phase ℎ from 𝐐𝐓(ℎ|𝑣) and to sample

in the negative phase both 𝑣 and ℎ from 𝐏𝐓(𝑣, ℎ), i.e., 𝐏𝐓(𝑣|ℎ) and 𝐏𝐓(ℎ|𝑣).

Therefore, in the positive phase we run a Markov chain for 𝐐𝐓 and sample ℎ according with the

following formula:

𝐐𝐓(ℎ𝑖 = 1|𝑣, ℎ−𝑖) = sigm(∑𝑣𝑗 ∙ ��𝑗𝑖𝑗∈𝒱

+ ∑ ℎ𝑚 ∙ ��𝑚𝑖𝑚∈ℋ−{i}

− 𝜃𝑖) (5.45)

In the negative phase we run a Markov chain for 𝐏𝐓 and sample ℎ and 𝑣 according with the

following formulae:

𝐏𝐓(ℎ𝑖 = 1|𝑣, ℎ−𝑖) = sigm(∑𝑣𝑗 ∙ ��𝑗𝑖𝑗∈𝒱

+ ∑ ℎ𝑚 ∙ ��𝑚𝑖𝑚∈ℋ−{i}

− 𝜃𝑖)

(5.46)

𝐏𝐓(𝑣𝑖 = 1|𝑣−𝑖, ℎ) = sigm(∑ ℎ𝑗 ∙ ��𝑗𝑖 + ∑ 𝑣𝑘 ∙ ��𝑘𝑖𝑘∈𝒱−{i}𝑗∈ℋ

− 𝜃𝑖)

For any iteration of learning, two separate Markov chains are run for every data vector: one

chain is used to estimate 𝑞𝑖𝑗 and another chain is run to estimate 𝑝𝑖𝑗. This makes the algorithm

computationally expensive because, before taking samples, we must wait until each Markov

chain reaches its stationary distribution and this process can require a very large number of

steps without known foolproof method to determine whether equilibrium has been reached. A

further disadvantage is the large variance of the estimated gradient. This means that, in general,

the samples from the stationary distribution have high variance since they come from all over

the model’s distribution [37].

The Markov chains used in the negative phase of Gibbs sampling do not depend on the training

data; therefore, they do not have to be restarted for each new data vector 𝑣. This observation

has been exploited in persistent MCMC estimators [39, 66].

118

5.3.2.2 Using persistent Markov chains to estimate the data–independent statistics

Neal in [14] and Tieleman in [66] proposed a different way to estimate the data–independent

statistics 𝑝𝑖𝑗: a stochastic approximation procedure (SAP). SAP belongs to the class of

stochastic approximation algorithms of the Robbins–Monro type [67-69,39].

To understand how SAP works, we assume that, for any 𝑖 ≠ 𝑗, the data dependent statistics 𝑞𝑖𝑗

are available to us at any time and that we maintain a set of 𝑀 sample points

{𝜎(1)(𝑡), 𝜎(2)(𝑡), … , 𝜎(𝑀)(𝑡)}.

We augment our notation to include the parameters W of the network in the specification of the

joint probability distribution 𝐏𝐓, as well in expression of the data–independent statistics 𝑝𝑖𝑗

(equation (5.20)):

𝑝𝑖𝑗(W) = ∑ 𝜎𝑖 ∙ 𝜎𝑗 ∙

𝜎∈I𝒩

𝐏𝐓(𝜎;W) (5.47)

Next, we consider a 𝐓𝐕𝐁𝐌 = (𝒩, 𝒢,W) given by Definition 4.3 and let W(𝑡) be the parameters

of the 𝐓𝐕𝐁𝐌 at the moment 𝑡 ∈ 𝒯 and 𝜎(𝑡) be the configuration of the 𝐓𝐕𝐁𝐌 at the same

moment of time. Then, W(𝑡) and 𝜎(𝑡) are updated sequentially as follows:

Given 𝜎(𝑡 − 1), a new state 𝜎(𝑡) is sampled from a transition operator 𝑇W(𝜎(𝑡 − 1) →

𝜎(𝑡)) that leaves 𝐏𝐓(W) invariant.

Having W(𝑡 − 1) and 𝜎(𝑡), the data–independent statistics 𝑝𝑖𝑗 are updated according

with the formulae (5.46) to reflect the changes that affected 𝜎 (from 𝑡 − 1 to 𝑡).

Based on the new value of 𝑝𝑖𝑗, a new parameter W(𝑡) is obtained with the formula

(5.38).

The transition operator 𝑇W(𝜎(𝑡 − 1) → 𝜎(𝑡)) is defined by the blocked Gibbs updates given by

the formulae (5.46). Precise sufficient conditions that guarantee almost sure convergence to an

asymptotically stable point are given in [68-70]. One necessary condition requires the learning

rate to decrease with time, i.e.:

∑ 𝛿𝑡

∞

𝑡=0= ∞ and ∑ 𝛿𝑡

2∞

𝑡=0= 0 (5.48)

This condition can be trivially satisfied by setting 𝛿𝑡 =1

𝑡. Typically, in practice, the sequence

{|W(𝑡)|}𝑡∈𝒯 is bounded, and the Markov chain governed by the transition operator 𝑇W is ergodic.

119

Together with the condition on the learning rate, this ensures almost sure convergence [39]. The

pseudocode of SAP is presented below.

Algorithm 5.2: Stochastic Approximation


all possible 𝑞𝑖𝑗 for any 𝑖 ≠ 𝑗

begin

Step 1: initialize W(0) and 𝑀 fantasy particles: {𝜎(1)(0),… , 𝜎(𝑀)(0)}

Step 2: for t=0 to 𝒯 do

Step 3: for 𝑘=1 to 𝑀 do

Step 4: sample 𝜎(𝑘)(𝑡) given 𝜎(𝑘)(𝑡 − 1) using transition

operator: 𝑇W (𝜎(𝑘)(𝑡 − 1) → 𝜎(𝑘)(𝑡))

end for //k

Step 5: update W(𝑡) = W(𝑡 − 1) + 𝛿𝑡 ∙ (𝑞𝑖𝑗 − 𝑝𝑖𝑗)

Step 6: decrease 𝛿𝑡

end for //t

end

The intuition behind why this procedure works is the following: as the learning rate becomes

sufficiently small compared with the mixing rate of the Markov chain, this “persistent” chain will

always stay very close to the stationary distribution even if it is only run for a few MCMC

updates per parameter update. Samples from the persistent chain will be highly correlated for

successive parameter updates, but again, if the learning rate is sufficiently small the chain will

mix before the parameters have changed enough to significantly alter the value of the estimator

[39]. Many persistent chains can be run in parallel. The current state of each of these chains is

usually denoted as a “fantasy particle” [39].

One important observation is that the process of running persistent Markov chains to produce

the data–independent statistics 𝑝𝑖𝑗 is interleaved with the learning process itself. Consequently,

120

the analysis of Algorithm 5.1 becomes more difficult because it cannot be viewed anymore as

an outer loop for the inner loop represented by the statistics gathering.

5.3.2.3 Contrastive divergence (CD)

In [37] Hinton proposed a simple and effective alternative to maximum likelihood (ML) learning

that eliminates almost all of the computation required to get samples from the equilibrium

distribution and also eliminates much of the variance that masks the gradient signal. This

method, named contrastive divergence, follows the gradient of a different function than ML

learning.

ML learning follows the log likelihood gradient by minimizing the KL–divergence:

KL(𝐪||𝐩𝐓) ≡ KL(𝐪𝟎||𝐩𝐓∞) = ∑ 𝐪𝟎(𝑣) ∙

𝑣∈I𝒱

ln𝐪𝟎(𝑣)

𝐩𝐓∞(𝑣;W) (5.49)

CD learning approximately follows the gradient of the difference of two divergences [37]:

CD𝑘 = KL(𝐪𝟎||𝐩𝐓∞) − KL(𝐪𝒌||𝐩𝐓∞) (5.50)

where: 𝐪𝟎 is the distribution over the “one–step” reconstructions of the data vectors generated

by one full step of Gibbs sampling; 𝐪𝒌 is the distribution over the “k–step” reconstructions of the

data vectors generated by 𝑘 > 0 full steps of Gibbs sampling; and 𝐩𝐓∞ is the equilibrium

distribution of the network. In particular, 𝐪𝐤 could even be 𝐪𝟏.

The CD algorithm is fueled by the contrast between the statistics collected when the input is a

real training example and when the input is a chain sample [71]. The intuitive motivation behind

CD is that we would like the Markov chain that is implemented by Gibbs sampling to leave the

initial distribution 𝐪𝟎 over the visible variables unaltered.

In CD learning, we start the Markov chain at the data distribution 𝐪𝟎 and run the chain for a

small number of steps. Instead of updating the parameters only after running the chain to

equilibrium, we simply run the chain for 𝑘 full steps and update the parameters. Then we keep

running the chain to equilibrium and, when there, update the parameters again. Thus, we

reduce the tendency of the chain to wander away from the initial distribution after 𝑘 steps.

Because 𝐪𝒌 is 𝑘 steps closer to the equilibrium distribution than 𝐪𝟎, we are guaranteed that

KL(𝐪𝟎||𝐩𝐓∞) exceeds KL(𝐪𝐤||𝐩𝐓∞) unless 𝐪𝟎 equals 𝐪𝐤. Consequently, CD (equation (5.50))

can never be negative. Also, for Markov chains in which all transitions have nonzero probability,

121

𝐪𝟎 = 𝐪𝐤 implies 𝐪𝟎 = 𝐩𝐓∞, because, if the distribution does not change at all on the first step, it

must already be at equilibrium, so CD can be zero only if the model is perfect.

In [72] Carreira–Perpinan and Hinton showed that, in general, CD provides biased estimates.

However the bias seems to be small, as their experiments of comparing CD and ML have

showed. They also showed that, for almost all data distributions, the fixed–points of CD are not

fixed–points of ML, and vice versa. Finally, they proposed a new and more effective approach to

collect statistics for Boltzmann machine learning, i.e., to use CD to perform most of the learning

followed by a short run of ML to clean up the solution.

CD learning is well–suited for Restricted Boltzmann Machines due to the fact that they are the

only Boltzmann machines with tractable conditional distributions 𝐏𝐓(ℎ|𝑣) and 𝐐𝐓(ℎ|𝑣).

5.4 The equilibrium distribution of a Boltzmann machine

To develop an understanding of Boltzmann machines it is necessary to be able to

determine the equilibrium distribution. In general, the equilibrium distribution of a stochastic

process is related to the structure of the associated transition probability matrix. If the transition

probabilities are known, then it becomes possible to compute the equilibrium distribution. In an

asynchronous Boltzmann machine with 𝑛 units the transitions between configurations or so–

called state transitions generate a finite Markov chain whose state space contains 2𝑛 global

states and whose transitions between global states are performed such that only one unit may

change its state at a time according with the update rule (4.34) or its temporal version (4.44). A

consequence of these update rules is the implicit satisfaction of the Markov local property for

the associated Markov chain. More, the transition probability matrix of the associated Markov

chain can be computed from parameters of the model.

In this section we are interested in establishing the existence of a unique equilibrium distribution

for a Boltzmann machine when 𝐓 > 0 and in determining the behavior of a Boltzmann machine

when 𝐓 = 0. In presenting the chain of logical arguments that relate Markov processes and

Boltzmann machines we have been inspired by the work of Viveros [42].

In order to construct the 2𝑛 × 2𝑛 transition probability matrix of the Markov chain associated to a

Boltzmann machine, we need to define the transition probability 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) of a global

state transition 𝜎(𝑡 − 1) → 𝜎(𝑡). The transition probability matrix 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) is rather a

122

sparse matrix. If the updating sequence is random, the matrix will have 𝑛 + 1 non–zero entries

per row. One of the non–zero entries is in the diagonal and represents the probability that the

particular unit selected for updating does not change its state, otherwise the configuration 𝜎

remains unchanged. Each of the other non–zero entries represents the probability that the

corresponding unit will change its state according with the update rule (4.34) or its temporal

version (4.44), otherwise they correspond to possible transitions to one of the 𝑛 states that

differs from 𝜎 by a change in the state of a single unit [42].

However, if the updating sequence is more orderly and each unit has a predetermined turn to

update, the number of non–zero entries per row decreases further. For example, one way to

update a layered Boltzmann machine is layer–by–layer and sequentially inside a layer. Thus, in

each row of the transition probability matrix there will be exactly two components that are not

zero: one to the “left” and one to the “right” of the unit to be updated.

Therefore, different updating regimes lead to different transition probability matrices and

consequently to different dynamics. Before we define the transition probability matrix

𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)), we recall its source of inspiration that is the transition probability for only one

unit in one time step given by the formula (4.44). The formula (4.44) also shows that, in one

time step, every transition from one configuration to another configuration that differs from first

configuration in at most one position has non–zero probability.

Definition 5.3:

The transition probability 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) of a global state transition 𝜎(𝑡 − 1) → 𝜎(𝑡) in an

asynchronous Boltzmann machine is given by the formula (5.51) which is the following:

𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) =

{

1

1 + exp(−(2 ∙ 𝜎𝑖(𝑡) − 1) ∙ nett(𝑖, 𝜎(𝑡 − 1))

𝐓)

if ∃ only one 𝑖 such that 𝜎𝑖(𝑡) ≠ 𝜎𝑖(𝑡 − 1)

1 − ∑ 𝐏𝐓(𝜌(𝑡) | 𝜎(𝑡 − 1))

𝜌(𝑡)≠𝜎(𝑡)

if 𝜎𝑖(𝑡 − 1) = 𝜎𝑖(𝑡) for all 𝑖

0 otherwise

We could obtain a more readable form of the formula (5.51) if we use the following notations:

𝜎(𝑡) ≡ 𝜎 𝜎(𝑡 − 1) ≡ 𝜏

Then the transition matrix 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1)) given by (5.51) becomes 𝐏𝐓(𝜎|𝜏) given by (5.52):

123

𝐏𝐓(𝜎|𝜏) =

{

1

1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)

𝐓) if ∃ only one 𝑖 such that 𝜎𝑖 ≠ 𝜏𝑖

1 − ∑ 𝐏𝐓(𝜌|𝜏)

𝜌≠𝜎

if 𝜎𝑖 = 𝜏𝑖 for all 𝑖

0 otherwise

(5.52)

In order to establish the existence or non–existence of an equilibrium distribution of a Markov

chain in terms of the properties of its associated transition probability matrix, we take into

consideration the fact that the transition probability matrix is stochastic, otherwise it is a special

case of non–negative matrices. Furthermore, the Perron–Frobenius theorem [73-74] details the

precise range of possibilities for the eigenvalues and eigenvectors of non–negative, irreducible

matrices. This result is of particular importance for us because the properties of the eigenvalues

of the transition probability matrix are the only factors that influence the nature and existence of

an equilibrium distribution for a Markov process [42].

We introduce the concepts of aperiodicity and irreducibility. A finite Markov chain is aperiodic if

no state of it is periodic with period 𝑘 > 1; a state has period 𝑘 if one can only return to it at

times 𝑡 + 𝑘, 𝑡 + 2𝑘, etc. A finite Markov chain is irreducible if one can reach any state from any

state in finite time with non–zero probability. We provide without proof the following theorem that

asserts the existence and unicity of the equilibrium distribution for a category of transition

probability matrices that includes, as we shall see, the transition probability matrices associated

to asynchronous Boltzmann machines.

Theorem 5.5 (Existence of a unique equilibrium distribution for stochastic, irreducible,

aperiodic matrices)

Let 𝐏 be a 𝑑𝑥𝑑 stochastic irreducible and aperiodic matrix. Then the equilibrium distribution for

the associated Markov process exists and is given by the left eigenvector corresponding to the

eigenvalue λ. Moreover 𝜆 = 1 since 𝐏 is stochastic.

The proof of this theorem can be found in [42]. In order to establish the existence of a unique

equilibrium distribution for an asynchronous Boltzmann machine, we are looking at the

irreducibility and aperiodicity properties of the transition probability matrix. Two cases are in

essence considered: when the pseudo–temperature is strictly positive and when the pseudo–

124

temperature is zero. Moreover, within the case 𝐓 > 0, the scenario 𝐓 → 0 demands special

consideration. Therefore, we treat it separately. Table 2 gives a classification of all possible

transition probability matrices for asynchronous Boltzmann machines and how they position

themselves with respect to irreducibility and aperiodicity.

Table 2 Transition probability matrices for asynchronous symmetric Boltzmann machines

𝐓 > 0 𝐓 = 0

Irreducible and aperiodic Irreducible and periodic Reducible

Equilibrium distribution No equilibrium distribution

𝐓 > 0

We want to prove the existence of an equilibrium distribution for asynchronous symmetric

Boltzmann machines. To accomplish this, first we prove a few helper lemmas and theorems,

then we prove the most significant result of this section: Theorem 5.10.

Lemma 5.6:

Given an asynchronous Boltzmann machine with 𝑛 units, any state can be visited from any

other state with positive probability in at most 𝑛 steps.

Proof: Each configurations of a Boltzmann machine with 𝑛 units can be viewed as a vector of

length 𝑛, each component being the state of a unit. In any asynchronous Boltzmann machine

one unit updates at each time step and, since 𝐓 > 0, the update is described by the equation

(4.44). More, a unit has a positive probability of changing its state.

A consequence of the update rule is that the Hamming distance between two successive

configurations is no greater than 1. Thus, any two configurations are at most Hamming distance

𝑛 apart, and so the worst case requires 𝑛 units change state. Therefore there is a positive

probability that this can occur [42].

Theorem 5.7:

The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is

irreducible.

125

Proof: Let 𝐏 = (𝑝𝑖𝑗)𝟏≤𝒊,𝒋≤2𝑛 be the 2𝑛 × 2𝑛 transition probability matrix of 𝐏𝐓(𝜎(𝑡) | 𝜎(𝑡 − 1))

given by (5.51). From Lemma 5.6 we have that any configuration has a positive probability of

being visited from any other configuration in at most 𝑛 steps. For every configuration 𝜎 ∈ I𝒩 we

denote by 𝑢 the index of the row corresponding to 𝜎 in 𝐏. Therefore, according to the definition

of an irreducible matrix, for every pair of configurations 𝜎, 𝜏 ∈ I𝒩 identified by their row indices 𝑢

respectively 𝑣, there exist an integer 𝑟 ≥ 1 such that 𝑝𝑢𝑣(𝑟) > 0, which in fact is 𝑟 = 𝑛. Hence

the transition probability matrix 𝐏𝐓 defined by (5.51) is an irreducible matrix [42].

Theorem 5.8:

The transition probability matrix (5.51) – (5.52) of an asynchronous Boltzmann machine is

aperiodic.

Proof: For any given configuration 𝜎 ∈ I𝒩 we shall prove that 𝐏𝐓(𝜎|𝜎) > 0. Given any

configuration 𝜎 there are 𝑛 + 1 possible transitions, one of them being to remain in the current

configuration. The 𝑛 other possible transitions lead to configurations with a Hamming distance

to 𝜎 equal 1. Let 𝜏 be one of these 𝑛 configurations. The probability that the ith unit of the

configuration 𝜎 outputs 𝜏𝑖 is given by the equation (4.44):

𝐏(𝜏𝑖|𝜎) =1

1 + exp (−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)

𝐓)

When 𝐓 > 0, we have 0 < 𝐏(𝜏𝑖|𝜎) < 1 for all these 𝑛 configurations. The explanation is that for

any other configuration 𝜌, that means it has a Hamming distance to 𝜎 greater than 1, the

probability 𝐏(𝜌𝑖|𝜎) is zero.

Now we need to take into consideration another probability which we completely ignored until

now because it has no direct effect on learning. This is the probability that the ith unit is selected

for update. Usually this probability distribution is a uniform distribution over the set of units,

which means that the probability of the ith unit to be selected for update is 1

𝑛. This probability

distribution is imposed by the environment, so it is independent of 𝐏.

These being said, we can compute 𝐏(𝜎|𝜎):

𝐏(𝜎|𝜎) = 1 − ∑1

𝑛∙ 𝐏(𝜏𝑖 | 𝜎)

𝜏∈I𝒩−{𝜎},𝑖∈𝒩

= 1 −1

𝑛∙ ∑ 𝐏(𝜏𝑖 | 𝜎)

𝜏∈I𝒩−{𝜎},𝑖∈𝒩

> 0

126

Hence, for all 𝜎 ∈ I𝒩 we have that 𝐏(𝜎|𝜎) > 0. According with Lemma C.1 in Appendix C, we

conclude that the transition probability matrix defined by (5.51) – (5.52) is aperiodic.

We remark that in this proof we have also proved that the transition probability matrix given by

(5.51) – (5.52) is reflexive [42].

Theorem 5.9:

For the asynchronous transition probability matrix defined by (5.51) – (5.52) the equilibrium

distribution exists and is given by the left eigenvector corresponding to the eigenvalue 𝜆 = 1.

Proof: When 𝐓 > 0 the transition probability matrix given by (5.51) – (5.52) is stochastic. From

Theorem 5.7 and Theorem 5.8 we know that this transition probability matrix is also irreducible

and aperiodic. Therefore, from Theorem 5.5 there exists a unique equilibrium distribution given

by the left eigenvector of the matrix given by (5.51) – (5.52) corresponding to the eigenvalue

𝜆 = 1 [42].

We have proved that in general for any weight matrix the equilibrium distribution of an

asynchronous Boltzmann machine with 𝐓 > 0 is given by the left eigenvector of the transition

matrix (5.51) – (5.52) corresponding to the eigenvalue 𝜆 = 1. If the system is allowed to stabilize

at a given pseudo–temperature, we have proved the existence of an equilibrium distribution. If

we slowly lower the pseudo–temperature, allowing the system to restabilize at the new

equilibrium distribution, then as 𝐓 → 0 the distribution will tend to a uniform distribution over the

optimal set of configurations 𝜎, which we call 𝑂𝑝𝑡. Roughly speaking, the asynchronous

Boltzmann machine converges asymptotically to the set of globally optimal states 𝑂𝑝𝑡 ⊆ I𝒩 that

minimize the energy function given by (4.31) [42]. The following theorem states these facts

more formally.

Theorem 5.10 (Asynchronous weight–symmetric equilibrium distribution):

Let the transition probabilities in a Boltzmann machine be given by (5.51) – (5.52). Then:

There exists a unique equilibrium distribution 𝐏𝐓(𝜎) for all 𝐓 > 0 whose components are

given by:

127

𝐏𝐓(𝜎) = lim𝑘→∞

𝐏𝐓(𝜎(𝑘) = 𝜎) =exp (−

𝐸(𝜎)𝐓)

𝑍(𝐓)

where: (5.53)

𝑍(𝐓) = ∑ exp(−𝐸(𝜏)

𝐓)

𝝉∈I𝒩

As 𝐓 → 0 the stationary distribution converges to a uniform distribution over the set of

optimal states, i.e.:

lim𝐓→0

( lim𝑘→∞

𝐏𝐓(𝜎(𝑘) = 𝜎)) =1

|𝑂𝑝𝑡|∙ 𝕀𝑂𝑝𝑡(𝜎) (5.54)

where 𝕀𝑂𝑝𝑡 is the characteristic function of 𝑂𝑝𝑡, i.e. 𝕀𝑂𝑝𝑡 takes the value one for any 𝜏 ∈ 𝑂𝑝𝑡

and zero elsewhere.

The first part asserts that the equilibrium distribution at any pseudo–temperature 𝐓 is the

Boltzmann–Gibbs distribution. The second part implies that as 𝐓 → 0 the equilibrium distribution

tends to a uniform distribution over the set 𝑂𝑝𝑡 of minimal energy states 𝜎.

Proof: From Theorem 5.9 the transition probability matrix defined by (5.51) – (5.52) has a

unique equilibrium distribution given by its left eigenvector and the corresponding eigenvalue

𝜆 = 1.

Our approach is to use Proposition C.5 from Appendix C to prove that 𝐏𝐓(𝜎) is the equilibrium

distribution. Specifically, if for all 𝜎, 𝜏 ∈ I𝒩 there exists numbers 𝐏𝐓(𝜎), 𝐏𝐓(𝜏) such that:

𝐏𝐓(𝜏|𝜎) ∙ 𝐏𝐓(𝜎) = 𝐏𝐓(𝜎|𝜏) ∙ 𝐏𝐓(𝜏) (5.55) then 𝐏𝐓(𝜎) is the equilibrium distribution. We have to show that 𝐏𝐓(𝜎) given by (5.53) together

with 𝐏𝐓(𝜎|𝜏) given by (5.52) satisfy (5.54). We distinguish two cases:

1. if 𝜎 = 𝜏, then (5.54) is satisfied by the definitions (5.52) and (5.53).

2. if 𝜎 ≠ 𝜏, then we have:

𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) = 𝐏𝐓(𝜎) ∙ 𝐏(𝜏𝑖|𝜎)

𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) =exp (−

𝐸(𝜎)𝐓 )

𝑍(𝐓)∙

1

1 + exp (−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)

𝐓 ) (5.56)

By multiplying the right hand side of (5.56) by:

128

exp((2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)

𝐓) ∙ exp (

−(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)

𝐓) = 1

we obtain:


𝐸(𝜎)𝐓)

𝑍(𝐓)∙exp (

(2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)𝐓

)

1 + exp ((2 ∙ 𝜏𝑖 − 1) ∙ net(𝑖, 𝜎)

𝐓 ) (5.57)

The asynchronous Boltzmann machine is symmetric, so the equation (4.51) holds:

net(𝑖, 𝜎) ∙ Δ𝜎𝑖 = net(𝑖, 𝜎) ∙ (𝜏𝑖 − 𝜎𝑖) = −Δ𝐸𝑖 = 𝐸(𝜎) − 𝐸(𝜏)

The equations (4.47) and (4.49) also hold for 𝜎 and 𝜏 because their Hamming distance is 1.

net(𝑖, 𝜏) = net(𝑖, 𝜎) 𝜏𝑖 = 1 − 𝜎𝑖

We rewrite (5.57) by substituting 𝜏𝑖 as specified above and using the equality between the net

inputs net(𝑖, 𝜏) and net(𝑖, 𝜎). We obtain:


𝐸(𝜏) + net(𝑖, 𝜎) ∙ Δ𝜎𝑖𝐓 )

𝑍(𝐓)∙exp (

(−2 ∙ 𝜎𝑖 + 1) ∙ net(𝑖, 𝜏)𝐓 )

1 + exp ((−2 ∙ 𝜎𝑖 + 1) ∙ net(𝑖, 𝜏)

𝐓)


𝐸(𝜏)𝐓) ∙ exp (−

net(𝑖, 𝜎) ∙ (𝜏𝑖 − 𝜎𝑖)𝐓

)

𝑍(𝐓)∙exp (−

(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)𝐓

)

1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)

𝐓)


𝐸(𝜏)𝐓 )

𝑍(𝐓)∙

1

1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)

𝐓)∙ exp(−

net(𝑖, 𝜎) ∙ (𝜏𝑖 + 𝜎𝑖 − 1)

𝐓)

𝐏𝐓(𝜎) ∙ 𝐏𝐓(𝜏|𝜎) = 𝐏𝐓(𝜏) ∙ 𝐏𝐓(𝜎|𝜏) ∙ exp(0) = 𝐏𝐓(𝜏) ∙ 𝐏𝐓(𝜎|𝜏)

Consequently, according with Proposition C.5 from Appendix C, 𝐏𝐓(𝜎) is the equilibrium

distribution.

We omit the details of the proof for the second part of Theorem 5.10. However, as mentioned

by Viveros [42], apart from a change of notation, the proof is exactly Theorem 8.1 p.134 and

Corollary 2.1 p. 18 of [75].

129

The hypothesis of symmetric weights in a Boltzmann machine simplifies the case 𝐓 > 0 a lot

because it enables one to infer that the detailed balance condition holds, and this leads

immediately to simple closed formulae for the equilibrium distributions [42].

𝐓 → 0

In the limiting case when 𝐓 → 0 the transition probability matrix for asynchronous Boltzmann

machines is given by taking the limit when 𝐓 → 0 of the equation (5.52).

𝐏𝐓(𝜎|𝜏) =

{

lim𝐓→0

𝑔(𝑖)

1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)

𝐓) if ∃ only one 𝑖 such that 𝜎𝑖 ≠ 𝜏𝑖

1 − lim𝐓→0

∑𝐏𝐓(𝜌|𝜏)

𝜌≠𝜎

if 𝜎𝑖 = 𝜏𝑖 for all 𝑖

0 otherwise

(5.58)

where 𝑔(𝑖) denotes the environmental distribution used for selection of the unit 𝑖 to be updated.

Usually 𝑔 is a uniform distribution, so the probability of the ith unit to be selected for update is 1

𝑛.

The limiting behavior of the transition probability matrix as 𝐓 → 0 is determined by the limiting

transition probability for a particular unit 𝑖:

lim𝐓→0

1

1 + exp (−(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏)

𝐓 )= {

0 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) > 01

2 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) = 0

1 if Δ𝐸 = −(2 ∙ 𝜎𝑖 − 1) ∙ net(𝑖, 𝜏) < 0

(5.59)

The equation (5.59) gives the limiting transition probability 𝜏𝑖 → 𝜎𝑖 as 𝐓 → 0 for a particular unit 𝑖

in terms of its activation or net input. We learned that, in principle, there are 𝑛 + 1 non–zero

entries in any row of the 2𝑛 × 2𝑛 transition probability matrix. But is it possible that some of

these entries become zero? If that happens, what effect has on the transition probability matrix

as 𝐓 → 0? To answer these questions, suppose that a configuration 𝜏 has 𝑘 units with zero

activation, where 1 ≤ 𝑘 ≤ 𝑛:

∃ 𝑖(1), 𝑖(2),… , 𝑖(𝑘) such that net(𝑖(𝑗), 𝜏) = 0 for any 1 ≤ 𝑗 ≤ 𝑘 (5.60) In the transition 𝜏 → 𝜎, the units 𝜎𝑖(1), … , 𝜎𝑖(𝑘) can take values 0 or 1 with equal probability.

However, because of the zero activation of these units, the limit (5.59) corresponding to them is

always 1

2𝑛. Thus, as 𝐓 → 0, in the row corresponding to 𝜏 in the 2𝑛 × 2𝑛 transition probability

130

matrix there will be 𝑘 ≤ 𝑛 non–zero entries having the value 1

2𝑛 and the remainder of entries

must sum to 1 −𝑘

2𝑛 [42].

𝐓 = 0

The case 𝐓 = 0 is a special limiting case for which the dynamics become progressively more

deterministic and the model approaches the Hopfield model. Provided that only one unit

updates at a time, the network settles to a local energy minima similarly to a Hopfield network

(see Section 2.5.2). Here there might be cycles or fixed–points but, in the strict sense, an

equilibrium distribution does not exist. Since the networks are deterministic, their “transition

probability matrices” are not transition probability matrices in the strict sense. However, these

“transition probability matrices” can be either reducible or irreducible and periodic [42].

5.5 Learning algorithms based on variational approaches

In Section 5.3 we learned how to find the parameters of the Boltzmann Machine

Learning problem by means of maximum likelihood estimation. In order to pick a tractable

variational parameterization for a Boltzmann machine, in this section we use a different

approach, specifically a variational approach. The true probability distribution of a Boltzmann

machine cannot be computed exactly, regardless of the form is expressed in: joint, conditional,

or marginal. The goal of variational learning a Boltzmann machine is to approximate the true

conditional probability of the hidden variables given the data vectors over the visible variables

and to use it in the learning rules (5.38) and (5.39) to replace the data–dependent statistics.

Here we recall a consequence of Theorem 5.1, specifically the equation (5.12), that is that, in a

Boltzmann machine with visible units 𝑣 clamped and with hidden units ℎ, the subnet ℋ itself

behaves like a Boltzmann machine with its own interconnecting weights W and thresholds

(𝛉𝐢)𝑖∈ℋ. Otherwise, the only effect of 𝑣 on ℎ is to cause the hidden units ℎ to run with effective

thresholds 𝛉𝐣 given by the equation (5.10) instead of their regular thresholds 𝜃𝑗. Therefore, the

true conditional probability distribution 𝐏𝐓(ℎ|𝑣) is governed by the following Boltzmann–Gibbs

distribution:

131

𝐏𝐓(ℎ|𝑣) = 𝐏ℋ(ℎ|𝑣) =

exp(−𝐸ℋ(ℎ|𝑣))

𝑍ℋ

(5.61)

𝐏𝐓(ℎ|𝑣) =

exp(∑ ℎ𝑗𝑗∈ℋ ∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖<𝑗

− ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ )

𝑍ℋ

where:

𝑍ℋ = ∑ exp

(

∑ ℎ𝑗𝑗∈ℋ

∙ ∑ ℎ𝑖 ∙ ��𝑖𝑗𝑖∈ℋ𝑖<𝑗

− ∑ 𝛉𝐣 ∙ ℎ𝑗𝑗∈ℋ

)

ℎ∈Iℋ

(5.62)

We augment the notations of the true probability distribution 𝐏𝐓 to include the parameters of the

network. In this section we are specifically interested in the parameters corresponding to a

mean parameterization of a pairwise Markov network (equations (3.11), (3.12), and (3.15)):

𝐏𝐓(𝑣, ℎ) = 𝐏𝐓(𝑣, ℎ; μ) and 𝐏𝐓(ℎ|𝑣) = 𝐏𝐓(ℎ|𝑣; μ) (5.63)

5.5.1 Using variational free energies to compute the statistics required for learning

In this section we establish the connection between the Boltzmann machine variational learning

and the approximations of the free energies discussed in Chapter 3.

As mentioned in the previous section, variational learning is concerned with the true conditional

distribution 𝐏𝐓(ℎ|𝑣). Concretely, variational learning means that we have to choose a conditional

probability distribution 𝐐(ℎ|𝑣) from a family ℚ(ℎ|𝑣; 𝜆) of approximating conditional probability

distributions that are described by the variational parameters 𝜆. Generally, the Markov network

representing 𝐐 is not the same as the Markov network representing 𝐏𝐓 but rather a sub–graph

of it.

From the family of approximating distributions ℚ(ℎ|𝑣; 𝜆), we choose a particular distribution 𝐐 by

minimizing the KL–divergence KL(ℚ||𝐏𝐓) given by the equation (3.25) with respect to the

variational parameters 𝜆. Then, the particular distribution 𝐐(ℎ|𝑣; 𝜆∗) that corresponds to the

values 𝜆∗ of the variational parameters that resulted from the KL–divergence minimization is

considered the best approximation of 𝐏𝐓(ℎ|𝑣) in the family ℚ(ℎ|𝑣; 𝜆).

𝐐 = ℚ(ℎ|𝑣; 𝜆∗) where: (5.64)

132

𝜆∗ = argmin𝜆KL(ℚ(ℎ|𝑣; 𝜆) || 𝐏𝐓(ℎ|𝑣))

One simple justification for using the KL–divergence as a measure of approximation accuracy is

that it yields the best lower bound on the probability of the evidence 𝐩𝐓(𝑣) in the family of

approximations 𝐐(ℎ|𝑣; 𝜆). The probability of the evidence is the same as the probability

distribution over the visible units.

To prove this claim, we first recall a form of Jensen’s inequality used the in context of probability

theory: if X is a random variable and φ is a convex function, then:

φ(𝐄[X]) ≤ 𝐄[φ(X)] (5.65) Thus, if we bound the log likelihood, i.e., the logarithm of the probability of the evidence, using

Jensen’s inequality we obtain:

ln 𝐩𝐓(𝑣) = lnMARG(𝐏𝐓, 𝒱) = ln ∑ 𝐏𝐓ℎ∈Iℋ

(𝑣, ℎ) = ln ∑ 𝐐(ℎ|𝑣) ∙𝐏𝐓(𝑣, ℎ)

𝐐(ℎ|𝑣)ℎ∈Iℋ

ln ∑ 𝐐(ℎ|𝑣) ∙𝐏𝐓(𝑣, ℎ)

𝐐(ℎ|𝑣)ℎ∈Iℋ

≥ ∑ 𝐐(ℎ|𝑣) ∙

ℎ∈Iℋ

ln𝐏𝐓(𝑣, ℎ)

𝐐(ℎ|𝑣)

ln 𝐩𝐓(𝑣) ≥ ∑ 𝐐(ℎ|𝑣) ∙

ℎ∈Iℋ


𝐐(ℎ|𝑣) (5.66)

The inequation (5.66) can be interpreted in this way: its right–hand side is a lower bound of its

left–hand side, which means that we found a lower bound for ln 𝐩𝐓(𝑣). More, the difference

between the left–hand side and the right–hand side of the inequation (5.66) is exactly the KL–

divergence KL(𝐐(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)) [27]:

KL(𝐐(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)) = ln𝐩𝐓(𝑣) − ∑ 𝐐(ℎ|𝑣) ∙

ℎ∈Iℋ


𝐐(ℎ|𝑣)≥ 0 (5.67)

Hence, by choosing 𝜆∗ according to (5.64), we obtain the tightest lower bound for ln 𝐩𝐓(𝑣) [27]:

KL(𝐐(ℎ|𝑣; 𝜆∗) || 𝐏𝐓(ℎ|𝑣)) = ln𝐩𝐓(𝑣) − ∑ 𝐐(ℎ|𝑣; 𝜆∗) ∙

ℎ∈Iℋ


𝐐(ℎ|𝑣; 𝜆∗)

(5.68)

ln 𝐩𝐓(𝑣) ≥ ∑ 𝐐(ℎ|𝑣; 𝜆∗) ∙

ℎ∈Iℋ


𝐐(ℎ|𝑣; 𝜆∗)

Furthermore, Theorem 3.1 taught us that the KL–divergence of two probability distributions 𝐐

and 𝐏 is related to the variational free energy of 𝐐 respectively the energy functional 𝐹[��, 𝐐]:

133

KL(𝐐(ℎ|𝑣) ||𝐏𝐓(ℎ|𝑣)) = −𝐹[��(ℎ|𝑣), 𝐐(ℎ|𝑣)] + ln𝑍(𝐏𝐓(ℎ|𝑣))

(5.69) KL(𝐐(ℎ|𝑣) ||𝐏𝐓(ℎ|𝑣)) = 𝐹[𝐐(ℎ|𝑣)] + ln𝑍(𝐏𝐓(ℎ|𝑣))

where: 𝐹[��, 𝐐] is the energy functional of 𝐏(ℎ|𝑣) and 𝐐(ℎ|𝑣); 𝐹[𝐐] is the variational free energy

of 𝐐(ℎ|𝑣); and 𝑍(𝐏𝐓(ℎ|𝑣)) is the partition function of the conditional of the true probability

distribution 𝐏𝐓(ℎ|𝑣).

Using (5.69), the KL–divergence employed by (5.64) can be written as:

KL(ℚ(ℎ|𝑣; 𝜆)||𝐏𝐓(ℎ|𝑣)) = 𝐹[ℚ(ℎ|𝑣; 𝜆)] + ln𝑍(𝐏𝐓(ℎ|𝑣)) (5.70)

Using (5.70) and the fact that the true probability distribution 𝐏𝐓 doesn’t depend on the

variational parameter 𝜆, the optimization problem (5.64) can be reformulated as:

𝐐 = ℚ(ℎ|𝑣; 𝜆∗) where: (5.71) 𝜆∗ = argmin

𝜆{𝐹[𝐐(ℎ|𝑣; 𝜆)] + ln𝑍(𝐏𝐓(ℎ|𝑣))} = argmin

𝜆{𝐹[𝐐(ℎ|𝑣; 𝜆)]}

The optimization problem (5.71) shows the connection between variational Boltzmann machine

learning and variational free energies and, in the same time, suggests a path to follow in a

learning algorithm. When the mean field free energy 𝐹𝑀𝐹[𝐐] plays the role of the free energy

𝐹[𝐐] in (5.70), the learning algorithm is called naïve mean field learning. When the Bethe–Gibbs

free energy 𝐺𝛽[𝐐] plays the role of the free energy 𝐹[𝐐] in (5.70), the learning algorithm is

called belief optimization learning.

Variational approaches like the mean field approximation and the Bethe approximation can be

used in Boltzmann machine learning only in the positive phase. These variational

approximations and, generally, any variational approach cannot be used in the negative phase

because the minus sign in the Boltzmann machine learning rules would cause variational

learning to change the parameters so as to maximize the divergence between the

approximating and true distributions instead of minimizing it [39]. Therefore, the data–

independent expectations should still be estimated by using a sampling algorithm like

Algorithm 5.2.

134

5.5.2 Learning by naïve mean field approximation

In the naïve mean field approximation, we try to find a factorized distribution 𝐐(ℎ|𝑣) that best

describes the true posterior distribution 𝐏𝐓(ℎ|𝑣). The true posterior distribution 𝐏𝐓(ℎ|𝑣) is

replaced by an approximate posterior 𝐐(ℎ|𝑣) and the parameters of the network are updated to

follow the gradient of the KL–divergence between 𝐐(ℎ|𝑣) and 𝐏𝐓(ℎ|𝑣).

The particular distribution we choose for 𝐐(ℎ|𝑣; μ) is the most general factorized distribution for

binary variables, which has the form:

𝐐𝐌𝐅(ℎ|𝑣; μ) =∏𝜇𝑖ℎ𝑖

𝑖∈ℋ

∙ (1 − 𝜇𝑖)1−ℎ𝑖 (5.72)

where μ = {𝜇𝑖}i∈ℋ are the variational parameters and the product is taken over the

configurations of the hidden units.

In order to form the KL–divergence between the fully factorized 𝐐𝐌𝐅 distribution and the 𝐏𝐓

distribution given by the equation (5.61), we use the fact that, under the distribution 𝐐𝐌𝐅, ℎ𝑖 and

ℎ𝑗 are independent random variables with mean values 𝜇𝑖 respectively 𝜇𝑗. Thus, we obtain:

KL(𝐐𝐌𝐅(ℎ|𝑣; μ)||𝐏𝐓(ℎ|𝑣)) = ∑[𝜇𝑖 ∙ ln 𝜇𝑖 + (1 − 𝜇𝑖) ∙ ln(1 − 𝜇𝑖)]

𝑖∈ℋ

−

−∑𝜇𝑖𝑖∈ℋ

∙ ∑ 𝜇𝑗 ∙ ��𝑗𝑖𝑗∈ℋ𝑗<𝑖

+∑𝛉𝐢 ∙ 𝜇𝑖𝑖∈ℋ

+ ln𝑍ℋ (5.73)

In order to derive the learning rule for the mean field learning algorithm, we employ the same

approach we used for the generic learning algorithm, that is we minimize the KL–divergence

KL(𝐐𝐌𝐅(ℎ|𝑣) || 𝐏𝐓(ℎ|𝑣)). Concretely, we derive the mean field fixed–point equations by taking

the gradient of the KL–divergence given by the equation (5.73) with respect to 𝜇𝑖 for all 𝑖 ∈ ℋ.

We note that 𝑍ℋ doesn’t depend on variational parameters. Thus, we obtain:

𝜕KL(𝐐𝐌𝐅(ℎ|𝑣; μ)||𝐏𝐓(ℎ|𝑣))

𝜕𝜇𝑖= − ∑ 𝜇𝑗 ∙ ��𝑗𝑖

𝑗∈ne(𝑖)

+ 𝛉𝐣 + ln𝜇𝑖

1 − 𝜇𝑖 (5.74)

where ne(𝑖) denotes the Markov blanket of unit 𝑖.

If we equate (5.74) to zero then we obtain the “mean field fixed–point equations”:

𝜇𝑖 = sigm( ∑ 𝜇𝑗 ∙ ��𝑗𝑖𝑗∈ne(𝑖)

− 𝛉𝐢) for all 𝑖 ∈ ℋ (5.75)

135

The equations (5.75) are solved iteratively for a fixed–point solution. Note that each variational

parameter 𝜇𝑖 updates its value based on a sum across the variational parameters 𝜇𝑗 within its

Markov blanket.

In Section 3.4 we learned how to solve the naïve mean field approximation problem by using the

type of optimization “maximize the energy functional”. In this section we have learned how to

solve the same problem by using the type of optimization “minimize KL–divergence”. As we

mentioned in Section 3.2, for a given problem, these approaches are equivalent. Therefore, the

equations (3.67) and (5.75) should lead to pretty much the same solutions (except a small

error). More, the convergence of one set of equations implies the convergence of the other set

as well. Theorem 3.7 guarantees the convergence of the mean field fixed–point equations

(3.67). Consequently, the mean field fixed–point equations (5.75) are also convergent.

When the mean field fixed–point equations (5.75) are run sequentially, i.e., we fix 𝜇−𝑖 and we

minimize over 𝜇𝑖, the KL–divergence is convex in 𝜇𝑖 and the corresponding equation (5.75) finds

the minimum in one step. Thus, this procedure can be interpreted as coordinate descent in

{𝜇𝑖}i∈ℋ and each step is guaranteed to decrease the KL–divergence. One drawback of this

procedure is that it could suffer from slow convergence or entrapment in local minima.

Alternatively, all the {𝜇𝑖}i∈ℋ can be updated in parallel, which does not have the guarantee of

decreasing the cost–function at any iteration, but may converge faster. In practice, one often

observes oscillatory behavior which can be counteracted by damping the updates.

Finally, one can use any gradient based optimization technique to minimize over all the nodes

{𝜇𝑖}i∈ℋ simultaneously, making sure all {𝜇𝑖}i∈ℋ remain between 0 and 1 [4].

Peterson and Anderson compared the mean field approximation to Gibbs sampling on a set of

test cases and found that it ran 10–30 times faster, while yielding a roughly equivalent level of

accuracy [16,27].

There are cases, however, in which the mean field approximation is known to be less accurate.

For large, densely connected, weakly interacting systems the cumulative effect of all nodes

behaves as a “rigid” (mean) field, which acts as an additional bias term, resulting in a factorized

distribution. Also, the factorized mean field distribution is clearly unimodal, and could therefore

never represent multimodal posterior distributions accurately. In particular, the KL–divergence

KL(𝐐𝐌𝐅||𝐏𝐓) penalizes states with small posterior probability but non–vanishing probability

under the mean field distribution much harder than the other way around. The result of this

asymmetry in the KL–divergence is that the mean field distribution will choose to represent only

136

one mode, ignoring the other ones. A typical situation where we expect multiple modes in the

posterior is when there is not a lot of evidence clamped on the observation nodes [4]. Consider

for instance the situation when the thresholds are given by:

𝜃𝑖 = −

1

2∑ ��𝑖𝑗𝑗∈𝒱,j≠i

(5.76)

in which case there is symmetry in the system – switching all the nodes (ℎ𝑖 → 1 − ℎ𝑖) leaves all

the probabilities invariant. This implies that there are at least two modes. In general, we expect

many more modes, and the mean field distribution can only capture one. Moreover, when the

interactions are strong, we expect these modes to be concentrated on one state, with little

fluctuation around them. The marginals predicted by mean field would therefore be close to

either 1 or 0 (they are polarized), while the true marginal posterior probabilities are due to the

symmetry [4]. One way to overcome some of the difficulties mentioned above is to use more

structured variational distributions 𝐐 and minimize again the KL–divergence [4].

We end this section with a high–level pseudocode of the mean field learning algorithm.

During each clamping (positive) phase an algorithm similar to Algorithm 3.1, that performs

minimization instead of maximization, is executed to solve the fixed–point equations (5.75) and

the solution obtained for the variational parameters {𝜇𝑖}i∈ℋ is used to approximate the data–

dependent statistics. During each free–running (negative) phase an algorithm similar to

Algorithm 5.2 is executed and the data–independent statistics 𝑝𝑖𝑗 and 𝑝𝑖 are estimated. Then

the parameters W of the network are updated according with the following rules:

Δ��𝑖𝑗 = −𝛿 ∙ (𝜇𝑖 ∙ 𝜇𝑗 − 𝑝𝑖𝑗) (5.77)

Δ𝜃𝑖 = 𝛿 ∙ (𝜇𝑖 − 𝑝𝑖) (5.78)

where 𝛿 is the learning rate.

Algorithm 5.3: Mean Field Boltzmann Machine Learning


a training set of 𝑝𝑎 data vectors: {𝑣(𝑘)}1≤𝑘≤𝑝𝑎

the number of learning cycles: 𝑒𝑝

the number of mean field steps: 𝑚𝑓

137

the number of Markov chains: 𝑀

begin

Step 1: initialize W(0) and 𝑀 fantasy particles: {𝜎(1)(0),… , 𝜎(𝑀)(0)}

For an arbitrary number of learning cycles:

Step 2: for e=1 to 𝑒𝑝 do

For each one of the patterns to be learned:

Step 3: for k =1 to 𝑝𝑎 do

Clamping phase:

Step 4: present and clamp the pattern 𝑣(𝑘)

START ALGORITHM 3.1

Step 5: randomly initialize: μ = {𝜇𝑖}𝑖∈ℋ and run 𝑚𝑓 updates

until convergence:

𝜇𝑖 = 𝑠igm(∑ 𝜇𝑗 ∙ ��𝑖𝑗𝑗∈ne(𝑖) − θj) for all 𝑖 ∈ ℋ

END ALGORITHM 3.1

Step 6: set: μ(𝑘) = {𝜇𝑖(𝑘)}𝑖∈ℋ

Free–running phase:

Step 7: present the pattern 𝑣(𝑘) but do not clamp it

START ALGORITHM 5.2

[…]

Step 8: collect statistics 𝑝𝑖𝑗

END ALGORITHM 5.2

Update the weights and thresholds for any pair of connected

units 𝑖 ≠ 𝑗 such that at least one unit has been updated:

Step 9: Δ��𝑖𝑗 = −𝛿 ∙ (𝜇𝑖 ∙ 𝜇𝑗 − 𝑝𝑖𝑗) for 𝑖 ≠ 𝑗

138

��𝑖𝑗 ← ��𝑖𝑗 + Δ��𝑖𝑗

Δ𝜃𝑖 = 𝛿 ∙ (𝜇𝑖 − 𝑝𝑖)

𝜃𝑖 ← 𝜃𝑖 + Δ𝜃𝑖

end for //k

end for //e

end

return W

5.6 Unlearning and relearning in Boltzmann Machines

The concept of “unlearning” in a connectionist network is closely related to the concept

of “reverse learning” in neuroscience. Crick and Mitchison proposed a model of reverse learning

that compares the process of dream sleeping or the REM phase of sleep to an offline computer.

According to the model, we dream in order to forget and this involves a process of “reverse

learning” or “unlearning” [76].

A simulation of reverse learning was performed by Hopfield, Feinstein, and Palmer [77] who

independently had been studying ways to improve the associative storage capacity of simple

networks of binary processors. In their algorithm an input is presented to the network as an

initial condition and the system evolves by falling into a nearby local energy minimum. However,

not all local energy minima represent stored information. In creating the desired minima, they

accidentally create other spurious minima, and to eliminate these they use "unlearning": The

learning procedure is applied with reverse sign to the states found after starting from random

initial conditions. Following this procedure, the performance of the system in accessing stored

states was found to be improved [43].

The reverse learning proposed by Crick and Mitchison respectively the reverse learning

algorithm proposed by Hopfield et al. have an interesting relationship with the learning algorithm

proposed by Hinton. The two phases of Hinton’s learning algorithm resemble the learning and

unlearning procedures. In positive phase Hebbian learning with positive coefficient occurs

139

during which information in the environment is captured by the weights. During negative phase

the system randomly samples states according to their Boltzmann distribution and Hebbian

learning occurs with a negative coefficient. However, these two phases need not be

implemented in the manner suggested by Crick and Mitchison. For instance, during negative

phase the average co–occurrences could be computed without making any changes to the

weights. These averages could then be used as a baseline for making changes during positive

phase; that is, the co–occurrences during negative phase could be computed and the baseline

subtracted before each permanent weight change. Hence, an alternative but equivalent

proposal for the function of dream sleep is to recalibrate the baseline for plasticity – the break–

even point which determines whether a synaptic weight is incremented or decremented. This

would be safer than making permanent weight decrements to synaptic weights during sleep and

solves the problem of deciding how much "unlearning" to do [43].

Hinton’s learning algorithm refines Crick’s and Mitchison's interpretation of why two phases are

needed. He considered a hidden unit deep within the network and wanted to know how its

connections with other units should be changed to best capture regularity present in the

environment. He started by observing that, if the unit does not receive direct input from the

environment, the hidden unit has no way to determine whether the information it receives from

neighboring units is ultimately caused by structure in the environment or is entirely a result of

the other weights. Hinton compared this scenario with a "folie à deux" where two parts of the

network each construct a model of the other and ignore the external environment [43]. He

realized that the contribution of internal and external sources can be separated by comparing

the co–occurrences in positive phase with similar information that is collected in the absence or

environmental input and in this way the negative phase acts as a control condition. More,

because of the special properties of equilibrium, it is possible to subtract off this purely internal

contribution and use the difference to update the weights. His conclusion was that the role of

two phases is to make the system maximally responsive to regularities present in the

environment and to prevent the system from using its capacity to model internally generated

regularities [43].

A network like the Boltzmann machine can experience some form of damage. Hinton studied

the behavior of the network, specifically the distributed representations constructed by the

learning rule, under such circumstances. He observed that the network uses distributed

representations among the intermediate units when it learns the associations. His interpretation

of this fact was that, because many of the weights are involved in encoding several different

140

associations and each association is encoded in many weights, if a weight is changed because

of some form of damage, it will affect several different energy minima and all of them will require

the same change in the weight to restore them to their previous depths. So, in relearning any of

the associations, there should be a positive transfer effect which tends to restore the others.

Hinton observed that this effect is actually rather weak and easily masked, so it can only be

seen clearly if the network is retrained on most of the original associations. His conclusion was

that the associations constructed by the learning rule are resistant to minor damage and exhibit

rapid relearning after major damage. More, the relearning process can bring back associations

that are not practiced during the relearning and are only randomly related to the associations

that are practiced [43].

141

Chapter 6. Conclusions

6.1 Summary of what has been done

This thesis addresses several aspects of the theory of Boltzmann machines. Our principal goal

has been to provide, from a rigorous mathematical perspective, a unified framework for two

well–known classes of learning algorithms in asynchronous Boltzmann machines: based on

Monte Carlo methods and based on variational approximations of the free energy.

The second chapter focused on the foundation of knowledge necessary to understand the

subsequent chapters. We choose to introduce the Boltzmann–Gibbs distribution from both a

physicist and a computer scientist perspective to allow the concept of energy, also originating in

physics, to settle on solid ground. We introduced the pairwise Markov random fields and

explained their relationship with the Boltzmann–Gibbs distribution. We also introduced the

Gibbs free energy as a convenient replacement for the Boltzmann–Gibbs distribution when the

goal is to perform approximate inference in a Markov random field. Then, we proceeded to

introduce the ancestors of Boltzmann machine: the connectionist networks and the Hopfield

networks. While we gave only a high–level overview of the connectionist networks, we gave a

quite detailed presentation of the Hopfield network. We justified the attention granted to the

Hopfield network by the fact that it is not just an ancestor of the Boltzmann machine, but it is a

Boltzmann machine itself, as we explained in Chapter 5.

The third chapter built the infrastructure of knowledge necessary to understand the subsequent

chapters. The topic of interest in this chapter was energy and the motivation behind is the

relationship between the equilibrium distribution and the free energy in a Markov random field.

Estimating the distribution of a Markov random field is an expensive process and there is no

foolproof method to determine whether the equilibrium has been reached. Some of the

difficulties encountered when operating with distributions do not exist anymore when operating

with energies. Furthermore, we introduced a number of Gibbs free energies: the mean field free

energy and the Bethe–Gibbs free energy, which are variational free energies, and the Bethe

free energy. These energies have been purposely defined and analyzed as potential candidates

for the true free energy of a Markov random field. Then we presented an approximate inference

algorithm – belief optimization – based on the Bethe–Gibbs approximation of the free energy

and that could potentially be used in Boltzmann machine learning.

142

The goal of the fourth chapter was to present in detail every important aspect of Boltzmann

machine except learning. From various sources we synthetized a rigorous definition of the

Boltzmann machine model together with all the associated concepts. We introduced the concept

of true energy and explained its intrinsic relationship with various forms of the true probability

distribution. We described the algorithmic aspects of the dynamics of the asynchronous

Boltzmann machine. In Chapter 5 we returned to this topic and presented it from a different

perspective. To allow the reader to intuitively understand the “inner life” of the Boltzmann

machine, we presented the biological interpretation of the model as was given by its creator

Geoffrey Hinton.

The fifth chapter is the core of this thesis. It was dedicated entirely to learning algorithms in an

asynchronous Boltzmann machine. We defined formally the learning process following the same

rigorous approach as in Chapter 4. We justified, from different angles, the necessity of two

phases in a learning algorithm. Following Hinton’s terminology, we called them positive

respectively negative phase. Currently there are two equivalent modalities to approach learning:

maximizing the likelihood of parameters or minimizing the KL–divergence of Gibbs measures.

We choose to use the KL–divergence approach for all the learning algorithms we presented.

Then we introduced the class of learning algorithms based on approximate maximum likelihood.

This class contains the generic learning algorithm due to Hinton and Sejnowski. We provided a

very detailed analysis of the generic learning algorithm including the missing piece from the

original algorithm, which was identified by Jones. The class of algorithms based on approximate

maximum likelihood was completed with the introduction of three sampling algorithms used to

collect the statistics during both positive and negative phase: Gibbs sampling, stochastic

approximation using persistent Markov chains, and contrastive divergence. We summarized the

main steps of the generic learning algorithm in a high–level pseudocode and discussed the

factors that influenced its complexity. The collection of statistics for the generic learning

algorithm is conditioned on the thermal equilibrium. To understand the dynamics of the

Boltzmann machine from a probabilistic point of view, we provided a deep analysis of the

equilibrium distribution as a function of the pseudo–temperature. Furthermore, we introduced

the class of learning algorithms based on variational approaches discussed in Chapter 3 and we

explained the connection between the approximations of the free energies and the learning

process. We provided a detailed analysis of the mean field learning and the connections it has

with algorithms introduced previously: mean field approximation and stochastic approximation.

We summarized the main steps of the mean field learning algorithm in a high–level pseudocode

and discussed the factors that influenced its complexity. Finally, we introduced the concepts of

143

unlearning and relearning and gave the intuition behind them as was explained by their creator

Geoffrey Hinton.

6.2 Future directions

There are a few open questions or directions to explore inspired by ideas presented in this

thesis:

1. an algorithm to detect when an asynchronous symmetric Boltzmann machine reached its

equilibrium distribution;

2. an explicit formula for the equilibrium distribution and a learning algorithm for a

Boltzmann machine with asymmetric weights;

3. is possible to extend the learning algorithm to higher order Markov processes?

4. an improvement to the Boltzmann machine model itself that would lead to better and

faster learning algorithms;

5. a breakthrough in Boltzmann machine learning?

Solutions to some of these questions would represent a considerable improvement on the

current state of knowledge.

One idea to improve the model is to find link(s) between the energy of a thermodynamic system

with respect to pressure and volume (equation (2.17)) and some aspect(s) of the cognitive

features and/or processes of human brain that, importantly, can be represented in an artificial

neural network like the Boltzmann machine. If these connections existed and had been reflected

in the Boltzmann machine model, they could be “consumed” directly by new learning algorithms

or indirectly by new optimization algorithms that perform approximate inference in the underlying

Markov random field.

144

References

1. Sussmann, H. J. (1988, December). Learning algorithms for Boltzmann machines. In Decision and

Control, 1988., Proceedings of the 27th IEEE Conference on (pp. 786-791). IEEE.

2. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York, 4.

3. Sussmann, H. J. The mathematical theory of learning algorithms for Boltzmann machines. In Neural

Networks, 1989. IJCNN., International Joint Conference on (pp. 431-437). IEEE.

4. Welling, M., & Teh, Y. W. (2003). Approximate inference in Boltzmann machines. Artificial

Intelligence, 143(1), 19-50.

5. Salakhutdinov, R. (2008). Learning and evaluating Boltzmann machines (p. 31). Technical Report

UTML TR 2008-002, Department of Computer Science, University of Toronto.

6. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural

computation, 18(7), 1527-1554.

7. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational

abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558.

8. Hinton, G. E., & Sejnowski, T. J. (1983, June). Optimal perceptual inference. In Proceedings of the

IEEE conference on Computer Vision and Pattern Recognition (pp. 448-453). IEEE New York.

9. Fahlman, S. E., Hinton, G. E., & Sejnowski, T. J. (1983). Massively parallel architectures for Al:

NETL, THISTLE, and BOLTZMANN machines. Proceedings of AAAI-83109, 113.

10. Hinton, G. E., Sejnowski, T. J., & Ackley, D. H. (1984). Boltzmann machines: Constraint satisfaction

networks that learn. Pittsburgh, PA: Carnegie-Mellon University, Department of Computer Science.

11. Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines.

Cognitive science, 9(1), 147-169.

12. Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. Journal of statistical

physics, 34(5-6), 975-986.

13. Salakhutdinov, R., & Hinton, G. (2012). An efficient learning procedure for deep Boltzmann machines.

Neural computation, 24(8), 1967-2006.

14. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial intelligence, 56(1), 71-113.

15. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory

(No. CU-CS-321-86). COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE.

16. Peterson, C. (1987). A mean field theory learning algorithm for neural networks. Complex systems, 1,

995-1019.

17. Peterson, C., & Hartman, E. (1989). Explorations of the mean field theory learning algorithm. Neural

Networks, 2(6), 475-494.

18. Galland, C. C., & Hinton, G. E. (1990). Discovering high order features with mean field modules. In

Advances in neural information processing systems (pp. 509-515).

19. Galland, C. (1992). Learning in deterministic Boltzmann machine networks.

145

20. Kappen, H. J., & Rodriguez, F. B. (1998). Boltzmann machine learning using mean field theory and

linear response correction. Advances in neural information processing systems, 280-286.

21. Kappen, H. J., & Rodrıguez, F. B. (1997). Mean field approach to learning in Boltzmann machines.

Pattern Recognition Letters, 18(11), 1317-1322.

22. Tanaka, T. (1998). Mean-field theory of Boltzmann machine learning. Physical Review E, 58(2), 2302.

23. Tanaka, T. (1999). A theory of mean field approximation. Advances in Neural Information Processing

Systems, 351-360.

24. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning (Doctoral

dissertation, University of Toronto).

25. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free

energy. Advances in neural information processing systems, 3-3.

26. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and

other variants. In Learning in graphical models (pp. 355-368). Springer Netherlands.

27. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational

methods for graphical models. Machine learning, 37(2), 183-233.

28. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its

generalizations. Exploring artificial intelligence in the new millennium, 8, 236-239.

29. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2005). Constructing free-energy approximations and

generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7), 2282-

2312.

30. Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). A new class of upper bounds on the log

partition function. IEEE Transactions on Information Theory, 51(7), 2313-2335.

31. Wainwright, M. J., & Jordan, M. I. (2006). Log-determinant relaxation for approximate inference in

discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099-2109.

32. Globerson, A., & Jaakkola, T. S. (2007). Approximate inference using conditional entropy

decompositions. In AISTATS (pp. 130-138).

33. Kabashima, Y., & Saad, D. (1998). Belief propagation vs. TAP for decoding corrupted messages.

EPL (Europhysics Letters), 44(5), 668.

34. Opper, M., & Winther, O. (1996). Mean field approach to Bayes learning in feed-forward neural

networks. Physical review letters, 76(11), 1964.

35. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000, December). Generalized belief propagation. In

NIPS (Vol. 13, pp. 689-695).

36. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001). Bethe free energy, Kikuchi approximations, and

belief propagation algorithms. Advances in neural information processing systems, 13.

37. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural

computation, 14(8), 1771-1800.

146

38. Salakhutdinov, R., & Hinton, G. E. (2007, March). Learning a Nonlinear Embedding by Preserving

Class Neighbourhood Structure. In AISTATS (pp. 412-419).

39. Salakhutdinov, R., & Hinton, G. E. (2009, April). Deep Boltzmann Machines. In AISTATS (Vol. 1, p.

3).

40. Little, W. A. (1974). The existence of persistent states in the brain. In From High-Temperature

Superconductivity to Microminiature Refrigeration (pp. 145-164). Springer US.

41. Little, W. A., & Shaw, G. L. (1978). Analytic study of the memory storage capacity of a neural

network. Mathematical biosciences, 39(3-4), 281-290.

42. Viveros, U. X. I. (2001). The Synchronous Boltzmann Machine (Doctoral dissertation, University of

London).

43. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and releaming in Boltzmann machines. Parallel

distributed processing: Explorations in the microstructure of cognition, 1, 282-317.

44. Boltzmann, L. (2012). Theoretical physics and philosophical problems: Selected writings (Vol. 5).

Springer Science & Business Media.

45. Gibbs, J. W. (1928). The collected works of J. Willard Gibbs (Vol. 1, pp. p-288). H. A. Bumstead, & W.

R. Longley (Eds.). Longmans, Green and Company.

46. Dobrushin, R. L. (1968). Description of a random field by means of conditional probabilities, with

applications. Teor. Veroyatnost. i Primenen, 13.

47. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational

inference. Foundations and Trends® in Machine Learning, 1(1-2), 1-305.

48. Spitzer, F. (1971). Markov random fields and Gibbs ensembles. The American Mathematical Monthly,

78(2), 142-154.

49. Yedidia, J. (2001). An idiosyncratic journey beyond mean field theory. Advanced mean field methods:

Theory and practice, 21-36.

50. Gibbs, J. W. (1873). A method of geometrical representation of the thermodynamic properties of

substances by means of surfaces. Connecticut Academy of Arts and Sciences.

51. Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1), 185-234.

52. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like

those of two-state neurons. Proceedings of the national academy of sciences, 81(10), 3088-3092.

53. Lowel, S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by

correlated neuronal activity. Science, 255(5041), 209.

54. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT

press.

55. Georges, A., & Yedidia, J. S. (1991). How to expand around mean-field theory using high-

temperature expansions. Journal of Physics A: Mathematical and General, 24(9), 2173.

56. Plefka, T. (1982). Convergence condition of the TAP equation for the infinite-ranged Ising spin glass

model. Journal of Physics A: Mathematical and general, 15(6), 1971.

147

57. Plefka, T. (2006). Expansion of the Gibbs potential for quantum many-body systems: General

formalism with applications to the spin glass and the weakly nonideal Bose gas. Physical Review E,

73(1), 016129.

58. Shin, J. (2012). Complexity of Bethe Approximation. In AISTATS (pp. 1037-1045).

59. Weiss, Y., & Freeman, W. T. (2001). On the optimality of solutions of the max-product belief-

propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2), 736-744.

60. Bishop, C. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics),

1st edn. 2006. corr. 2nd printing edn.

61. Heskes, T. (2002). Stable fixed points of loopy belief propagation are local minima of the Bethe free

energy. In Advances in neural information processing systems (pp. 343-350).

62. Heskes, T. (2004). On the uniqueness of loopy belief propagation fixed points. Neural Computation,

16(11), 2379-2413.

63. Jones, A. (1996). A lacuna in the theory of asynchronous Boltzmann machine learning. Simpósio

Brasileiro de Redes Neurais, 19-27.

64. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space.

Neural computation, 1(1), 143-150.

65. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian

restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6), 721-741.

66. Tieleman, T. (2008, July). Training restricted Boltzmann machines using approximations to the

likelihood gradient. In Proceedings of the 25th international conference on Machine learning (pp.

1064-1071). ACM.

67. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical

statistics, 400-407.

68. Younes, L. (1989). Parametric inference for imperfectly observed Gibbsian fields. Probability theory

and related fields, 82(4), 625-645.

69. Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing

ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes, 65(3-

4), 177-228.

70. Yuille, A. L. (2006). The convergence of contrastive divergences. Department of Statistics, UCLA.

71. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning,

2(1), 1-127.

72. Carreira-Perpinan, M. A., & Hinton, G. (2005, January). On Contrastive Divergence Learning. In

AISTATS (Vol. 10, pp. 33-40).

73. Gantmacher, F. R., & Brenner, J. L. (2005). Applications of the Theory of Matrices. Courier

Corporation.

74. Gantmacher, F. R. (1959). Matrix theory, vol.1 and 2. New York.

75. Aarts, E., & Korst, J. (1988). Simulated annealing and Boltzmann machines.

148

76. Crick, F., & Mitchison, G. (1983). The function of dream sleep. Nature, 304(5922), 111-114.

77. Hopfield, J. J., Feinstein, D. I., & Palmer, R. G. (1983). ‘Unlearning’ has a stabilizing effect in

collective memories.

78. Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov chains and mixing times. American

Mathematical Soc.

149

Appendix A: Mathematical notations

In this appendix we present the main conventions and notations used throughout this paper.

We use |A| to denote the cardinality of a finite set A.

We denote matrices by uppercase bold roman letters, such as 𝐀.

We use a superscript T to denote the transpose of a matrix or vector.

Without restricting the generality, we assume that a set of processing units (neurons) 𝒩 is

indexed by the set of natural numbers {1,2,… , 𝑛} for 𝑛 = |𝒩| ∈ ℕ. We also make the

convention that "0" denotes some object not belonging to 𝒩.

A random variable is generally denoted 𝑋 (typeface italic uppercase x).

A univariate (scalar) random variable is denoted as a general random variable.

An individual observation of a scalar random variable is denoted by 𝓍 (script italic lowercase

x). A set comprising of 𝑚 observations of a scalar random variable 𝑋 is denoted by 𝕩

(double–struck lowercase x) and is written as:

𝕩 ≡ (𝑥(1), 𝑥(2), … , 𝑥(𝑚)) (A1)

A multivariate random variable is denoted by X (typeface uppercase x):

X ≡ (𝑋1, 𝑋2, … , 𝑋n)T (A2)

We use the notation X−i to designate all the random variables from X except 𝑋𝑖, i.e.:

X−i = (𝑋1, … , 𝑋𝑖−1, 𝑋𝑖+1, … , 𝑋𝑛) (A3)

We use the symbol ⊥ to represent the conditional independence of random variables.

Example: 𝐴 ⊥ 𝐵 | 𝐶.

We use bold uppercase letters to designate probability distributions. Examples: 𝐏, 𝐐.

We use the accent “bar” to denote a probability candidate for an unknown probability.

Example: Probability distribution �� is an approximation for the probability distribution 𝐏.

We use the accent “tilde” to denote the unnormalized measure of a probability distribution.

Example: �� is the unnormalized measure of probability distribution 𝐏.

We use the accent “hat” to identify the collection of canonical parameters corresponding to

bi–dimensional cliques (edges) in a pairwise Markov network. The collection of all the

canonical parameters is represented with the same letter as the previous collection but

without “hat”. Example: the first collection is ��; the second collection is 𝐖.

150

Appendix B: Probability theory and statistics

The definitions and theoretical results presented in this appendix are taken from the books [54]

and [78]. They are notions from probability theory and statistics that have been used in the

previous sections.

In probability and statistics, a Random (Stochastic) Variable, usually written 𝑋, is a variable

whose value is subject to variations due to chance (i.e., randomness, in a mathematical sense);

otherwise its values are numerical outcomes of a random phenomenon or experiment. As

opposed to other mathematical variables, a random variable conceptually does not have a

single, fixed value (even if unknown); rather, it can take on a set of possible different values,

each with an associated probability. Based on the number of values that constitute the random

variable associated to a statistical unit, the random variables are classified into two categories:

univariate and multivariate.

A Univariate Random Variable or Random Scalar is a single variable whose value is unknown,

either because the value has not yet occurred, or because there is imperfect knowledge of its

value. Normally a random scalar is a real number.

A Multivariate Random Variable or Random Vector, usually written X, is a list of mathematical

variables whose value for each of them is unknown, either because the value has not yet

occurred, or because there is imperfect knowledge of its value. The individual variables in a

random vector are grouped together because there may be correlations among them – often

they represent different properties of an individual statistical unit (e.g. a particular person, event,

etc.). Normally each element of a random vector is a real number.

In mathematics, a moment is a specific quantitative measure of the shape of a set of points. The

moments of a random variable (or of its distribution) are expected values of powers or related

functions of the random variable. The first moment, also called mean, is a measure of the

center or location of a random variable or distribution. The second moment of a random variable

is also called variance and its square root is called standard deviation. The variance and

standard deviation are measures of the scale or spread of a random variable or distribution.

𝝈–algebra:

Given a set Ω, a 𝜎–algebra is a collection ℱ of subsets satisfying the following conditions:

Ω ∈ ℱ;

if 𝐴1, 𝐴2, … are elements of ℱ, then ⋃ 𝐴𝑖∞𝑖=1 ∈ ℱ;

151

if 𝐴 ∈ ℱ, then 𝐴𝐶= Ω − A ∈ ℱ.

Probability space:

A probability space is a three–tuple (Ω, ℱ, 𝑝) in which the three components are:

Sample space: A nonempty set Ω called the sample space, which represents all possible

outcomes.

Event space: A collection ℱ of subsets of Ω, called the event space. The elements of ℱ

are called events.

If Ω is discrete, then ℱ is usually the collection of all subsets of Ω: ℱ = pow(Ω). If Ω is

continuous, then ℱ is usually a 𝜎–algebra on Ω.

Probability function: A function 𝑝 ∶ ℱ ⟶ ℝ that assigns probabilities to the events of ℱ

and satisfies the requirements of a probability measure over Ω as specified below.

An outcome is the result of a single execution of the model. Once the probability space is

established, it is assumed that “nature” makes its move and selects a single outcome ω from

the sample space Ω. All the events in ℱ that contain the selected outcome ω (recall that each

event is a subset of Ω) are said to “have occurred”. The selection performed by “nature” is done

in such a way that, if the experiment were to be repeated an infinite number of times, the

relative frequencies of occurrence of each of the events would coincide with the probabilities

prescribed by the function 𝑝.

Borel 𝝈–algebra:

If a probability space Ω is a countable set, the 𝜎–algebra of events is usually taken to be

pow(Ω). If Ω is ℝ𝑑, then the Borel 𝜎–algebra is the smallest 𝜎–algebra containing all open sets.

Probability measure:

Given a probability space, a probability measure is a non–negative function 𝐏 defined on events

and satisfying the following:

𝐏(Ω) = 1;

for any sequence of events 𝐵1, 𝐵2, … which are disjoint, meaning 𝐵𝑖 ∩ 𝐵𝑗 = ∅ for 𝑖 ≠ 𝑗:

𝐏(⋃𝐵𝑖)

∞

𝑖=1

=∑𝐏(𝐵𝑖)

∞

𝑖=1

(B1)

152

Probability distribution:

If Ω is a countable set, a probability distribution (or sometimes simply a probability) on Ω is a

function 𝑝 ∶ Ω ⟶ [0, 1] such that:

∑ 𝑝(ω) = 1

ω∈Ω

(B2)

We will abuse notation and write, for any subset 𝐴 ⊂ Ω, 𝑝(𝐴) = ∑ 𝑝(ω)ω∈A . The set function

𝐴 ⟶ 𝑝(𝐴) is a probability measure.

Measurable function:

A function 𝑓: Ω ⟶ ℝ is called measurable if 𝑓−1(𝐵) is an event for all open sets 𝐵.

Density function:

If Ω = 𝐷 is an open subset of ℝ𝑑 and 𝑓 ∶ 𝐷 ⟶ [0,∞) is a measurable function satisfying

∫ 𝑓(𝑥) 𝑑𝑥 = 1𝐷

, then 𝑓 is called a density function.

Given a density function, the following set function defined for Borel sets 𝐵 is a probability

measure:

𝜇𝑓(𝐵) = ∫ 𝑓(𝑥) 𝑑𝑥

𝐵

(B3)

Random variable:

Given a probability space, a random variable 𝑋 is a measurable function defined on Ω. We write

{𝑋 ∈ 𝐴} as shorthand for the set:

𝑋−1(𝐴) = {ω ∈ Ω ∶ 𝑋(ω) ∈ 𝐴} (B4)

Distribution of a random variable:

The distribution of a random variable 𝑋 is the probability measure 𝜇𝑋 on ℝ defined for Borel set

𝐵 by:

𝜇𝑋(𝐵) = 𝐏({𝑋 ∈ 𝐵}) = 𝐏{𝑋 ∈ 𝐵} (B5)

Types of random variables:

We call a random variable X discrete if there is a finite or countable set 𝑆, called the support of

𝑋, such that 𝜇𝑋(𝑆) = 1. In this case, the following function is a probability distribution on 𝑆:

153

𝑝𝑋(𝑎) = 𝐏{𝑋 = 𝑎} (B6) We call a random variable 𝑋 absolutely continuous if there is a density function 𝑓 on ℝ such

that:

𝜇𝑋(𝐴) = ∫ 𝑓(𝑥) 𝑑𝑥

𝐴

(B7)

Mean or expectation:

For a discrete random variable 𝑋, the mean or expectation 𝐄(𝑋) can be computed by the

following formula whose sum has at most countably many non–zero terms:

𝐄𝐏(𝑋) = 𝐄𝐏[𝑋] = 𝐄[𝑋] = ∑ 𝑥 ∙ 𝐏(𝑋 = 𝑥)

𝑥∈ℝ

(B8)

For an absolutely continuous random variable 𝑋, the expectation 𝐄(𝑋) is computed by the

formula:

𝐄𝐟(𝑋) = 𝐄𝐟[𝑋] = 𝐄[𝑋] = ∫ 𝑥 ∙ 𝑓𝑋(𝑥)𝑑𝑥

ℝ

(B9)

Variance:

The variance of a random variable 𝑋 is defined by:

𝐕𝐚𝐫(𝑋) = 𝐕𝐚𝐫[𝑋] = 𝐄[(𝑋 − 𝐄[𝑋])𝟐] (B10)

If 𝑋 is a random variable, 𝑔 ∶ ℝ ⟶ ℝ is a function, and 𝑌 = 𝑔(𝑋) is a function of 𝑋, then the

expectation 𝐄[𝑌] can be computed via the formulae:

𝐄[𝑌] =

{

∫𝑔(𝑥) ∙ 𝑓(𝑥) 𝑑𝑥, if 𝑋 is continuous with density 𝑓

∑𝑔(𝑥) ∙ 𝑝𝑋(𝑥)

𝑥∈𝑆

, if 𝑋 is discrete with support 𝑆 (B11)

Standard deviation:

The standard deviation of a random variable 𝑋 is defined as the (nonnegative) square root of its

variance:

𝛔𝑋 = √𝐕𝐚𝐫[𝑋] (B12)

Covariance:

154

The covariance between two jointly distributed real–valued random variables 𝑋 and 𝑌 with finite

variances is defined as:

𝐜𝐨𝐯(𝑋, 𝑌) = 𝐄[(𝑋 − 𝐄[𝑋]) ∙ (𝑌 − 𝐄[𝑌])] = 𝐄[𝑋 ∙ 𝑌] − 𝐄[𝑋] ∙ 𝐄[𝑌] (B13)

Correlation coefficient:

The population correlation coefficient between two random variables 𝑋 and 𝑌 with expected

values 𝐄[𝑋] and 𝐄[𝑋] and standard deviations 𝛔𝑋 and 𝛔𝑌 is defined as:

𝝆𝑿,𝒀 = 𝐜𝐨𝐫𝐫(𝑋, 𝑌) =

𝐜𝐨𝐯(𝑋, 𝑌)

𝛔𝑋 ∙ 𝛔𝑌 (B14)

The sample correlation coefficient between two data sets 𝐗 = {x1, … , x𝑛} and 𝐘 = {𝑦1, … , y𝑛},

each of them containing 𝑛 values, is defined as:

𝒓𝑿,𝒀 = 𝐜𝐨𝐫𝐫(𝐗, 𝐘) =

∑ xi ∙ yi − 𝑛 ∙ �� ∙ ��𝒊

(n − 1) ∙ 𝐬𝐗 ∙ 𝐬𝐘 (B15)

where: 𝐬𝐗 and 𝐬𝐘 represent the sample standard deviation for 𝐗 respectively 𝐘; and �� and ��

represent the sample mean for 𝐗 respectively 𝐘.

Independence:

Fix a probability space and a probability measure 𝐏. Two events 𝐴 and 𝐵 are independent if:

𝐏(𝐴 ∩ 𝐵) = 𝐏(𝐴) ∙ 𝐏(𝐵) (B16) Events 𝐴1, 𝐴2, … are independent if for any 𝑖1, 𝑖2, … 𝑖𝑟:

𝐏(𝐴𝑖1 ∩ 𝐴𝑖2 ∩…𝐴𝑖𝑟) = 𝐏(𝐴𝑖1) ∙ 𝐏(𝐴𝑖2) ∙ … ∙ 𝐏(𝐴𝑖𝑟) (B17)

Random variables 𝑋1, 𝑋2, … are independent if for all Borel sets 𝐵1, 𝐵2, … the events {𝑋1 ∈

𝐵1}, {𝑋2 ∈ 𝐵2},… are independent.

Proposition B.1: If 𝑋 and 𝑌 and independent random variables such that 𝐕𝐚𝐫(𝑋) and 𝐕𝐚𝐫(𝑌)

exists, then:

𝐕𝐚𝐫[𝑋 + 𝑌] = 𝐕𝐚𝐫[𝑋] + 𝐕𝐚𝐫[𝑌] (B18)

Theorem B.2 (Markov’s inequality):

For a non–negative random variable 𝑋:

𝐏{𝑋 > 𝑎} ≤

𝐄(𝑋)

𝑎 (B19)

155

Convergence in probability:

A sequence of random variables (𝑋𝑡) converges in probability to a random variable 𝑋 if:

lim𝑡→∞

𝐏{|𝑋𝑡 − 𝑋| > 휀} = 0 for all 휀 ∈ ℝ (B20)

This is denoted by: 𝑋𝑡 𝑝𝑟 → 𝑋.

Theorem B.3 (Convergence for sequence of random variables):

Let (𝑌𝑡) be a sequence of random variables and 𝑌 be a random variable such that:

𝐏 { lim𝑛→∞

𝑌𝑛 = 𝑌} = 1 (B21)

Bounded Convergence:

If there is a constant 𝑘 ≥ 0 independent of 𝑛 such that |𝑌𝑛| < 𝑘 for all 𝑛 ∈ ℕ, then:

lim𝑛→∞

𝐄[𝑌𝑛] = 𝐄[𝑌] (B22)

Dominated Convergence:

If there is a random variable 𝑍 such that 𝐄[|𝑍|] < ∞ and 𝐏{|𝑌𝑛| ≤ |𝑍|} = 1 for all 𝑛 ∈ ℕ, then:

lim𝑛→∞

𝐄[𝑌𝑛] = 𝐄[𝑌] (B23)

Monotone Convergence:

If 𝐏{𝑌𝑛 ≤ 𝑌𝑛+1} = 1 for all 𝑛 ∈ ℕ, then:

lim𝑛→∞

𝐄[𝑌𝑛] = 𝐄[𝑌] (B24)

Entropy of a univariate random variable:

Let 𝐏(𝑋) be a distribution over a univariate random variable 𝑋. The entropy of 𝑋 is defined:

S𝐏(𝑋) = 𝐄 [log

1

𝐏(𝑋)] =∑𝐏(𝑋) ∙

𝑋

log1

𝐏(𝑋)= −∑𝐏(𝑋) ∙ log𝐏(𝑋)]

𝑋

(B25)

where we treat 0 ∙ log1

0= 0 because: lim𝜖→0 휀 ∙ log

1

𝜀= 0.

The entropy can be viewed as a measure of our uncertainty about the value of 𝑋.

Entropy of a multivariate random variable:

The previous definition extends naturally to multivariate random variables. Let 𝐏(𝑋1, … , 𝑋𝑛) be a

distribution over random variables 𝑋1, … , 𝑋𝑛.Then the joint entropy of 𝑋1, … , 𝑋𝑛 is:

156

S𝐏(𝑋1, … , 𝑋𝑛) = 𝐄 [log

1

𝐏(𝑋1, … , 𝑋𝑛)] = ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙

𝑋1,…,𝑋𝑛

log1

𝐏(𝑋1, … , 𝑋𝑛) (B26)

S𝐏(𝑋1, … , 𝑋𝑛) = − ∑ 𝐏(𝑋1, … , 𝑋𝑛) ∙ log𝐏(𝑋1, … , 𝑋𝑛)

𝑋1,…,𝑋𝑛

(B27)

Distance between distributions:

There are situations when we want to compare two distributions. For instance, we might want to

approximate a distribution by one with desired qualities, e.g. a simpler representation or more

efficient to reason with; we also want to evaluate the quality of a candidate approximation.

Another example is in the context of learning a distribution from data, where we want to

compare the learned distribution to the “true” distribution from which the data was generated.

Therefore, we want to construct a distance measure 𝑑 that evaluates the distance between two

distributions. There are some properties we might wish for in such a measure:

Positivity: 𝑑(𝐏,𝐐) is always nonnegative and is zero if and only if 𝐏 = 𝐐.

Symmetry: 𝑑(𝐏, 𝐐) = 𝑑(𝐐, 𝐏).

Triangle inequality: for any three distributions 𝐏,𝐐, 𝐑 we have that:

𝑑(𝐏,𝐑) ≤ 𝑑(𝐏,𝐐) + 𝑑(𝐐, 𝐑) (B28) A distance measure that satisfies these three properties is called a distance metric.

Relative entropy and KL–divergence:

Let 𝐐 and 𝐏 be two distributions over random variables 𝑋1, … , 𝑋𝑛. The relative entropy of 𝐐 and

𝐏 is:

KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = 𝑬𝑸 [log

𝐐(𝑋1, … , 𝑋𝑛)

𝐏(𝑋1, … , 𝑋𝑛)] (B29)

KL(𝐐(𝑋1…𝑋𝑛)||𝐏(𝑋1…𝑋𝑛)) = ∑ 𝐐(𝑋1…𝑋𝑛) ∙

𝑋1,…,𝑋𝑛

log𝐐(𝑋1, … , 𝑋𝑛)

𝐏(𝑋1, … , 𝑋𝑛)

(B30)

When the set of variables is clear from context we use the shorthand definition: KL(𝐐||𝐏). This

measure is often known as the Kullback–Leibler divergence or KL–divergence.

The relative entropy measures the additional cost imposed by using a wrong distribution 𝐐

instead of the true distribution 𝐏. Thus, 𝐐 is close in the sense of relative entropy to 𝐏 if this cost

157

is small. The additional cost of using the wrong distribution is always positive. Moreover, the

relative entropy is zero if and only if the two distributions are identical.

Unfortunately, positivity is the only property of distances that relative entropy satisfies; it

satisfies neither symmetry nor triangle inequality.

158

Appendix C: Finite Markov chains

The notions presented in this appendix have been used in the previous sections and the

majority of them are taken from the book [78].

A finite Markov chain is a process which moves among the elements of a finite set Ω in the

following manner: considering 𝑥 ∈ Ω is the current position of the process, the next position is

chosen according to a fixed probability distribution 𝑃(𝑥,·). We formally define this type of

process and present some of its properties.

Finite Markov chain:

A sequence of random variables 𝑋1, 𝑋2, … is a finite Markov chain with finite state space Ω and

transition matrix 𝑃 if for all 𝑥, 𝑦 ∈ Ω, all 𝑡 ≥ 1, and all events 𝐻𝑡−1 = ⋂ (𝑋𝑠 = 𝑥𝑠)𝑡−1𝑠=0 satisfying

𝐏(𝐻𝑡−1 ∩ {𝑋𝑡 = 𝑥}) > 0, we have:

𝐏{𝑋𝑡+1 = 𝑦 | 𝐻𝑡−1 ∩ {𝑋𝑡 = 𝑥}} = 𝐏{𝑋𝑡+1 = 𝑦 | 𝑋𝑡 = 𝑥} = 𝑃(𝑥, 𝑦) (C1)

Equation (C1) illustrates how the Markov chain explores the space in a local fashion. Often

called the Markov local property, equation (C1) means that the conditional probability of

proceeding from state 𝑥 to state 𝑦 is the same, no matter what sequence 𝑥0, 𝑥1, … 𝑥𝑡−1 of states

precedes the current state 𝑥. This is exactly why the |Ω| × |Ω| matrix 𝑃 suffices to describe the

transitions.

Let the distribution 𝑃(𝑥,·) be the 𝑥𝑡ℎ row of the transition matrix 𝑃. Thus, 𝑃 is stochastic, that is,

its entries are all non–negative and:

∑𝑃(𝑥, 𝑦) = 1

𝑦∈Ω

(C2)

Let (𝑋1, 𝑋2, … ) be a finite Markov chain with state space Ω and transition matrix 𝑃, and let the

row vector 𝜇𝑡 be the distribution of 𝑋𝑡:

𝜇𝑡(𝑥) = 𝐏{𝑋𝑡 = 𝑥} for all 𝑥 ∈ Ω

By conditioning on the possible predecessors of the (𝑡 + 1)st state, we see that:

𝜇𝑡+1(𝑦) = ∑𝐏{𝑋𝑡 = 𝑥} ∙

𝑥∈Ω

𝑃(𝑥, 𝑦) = ∑ 𝜇𝑡(𝑥) ∙

𝑥∈Ω

𝑃(𝑥, 𝑦) for all y ∈ Ω

Rewriting this in vector form gives:

159

𝜇𝑡+1 = 𝜇𝑡 ∙ 𝑃 for 𝑡 ≥ 0 (C3) hence: 𝜇𝑡 = 𝜇0 ∙ 𝑃

𝑡 for 𝑡 ≥ 0 (C4)

Irreducible finite Markov chain:

A Markov chain with transition matrix 𝑃 is called irreducible if for any two states 𝑥, 𝑦 ∈ Ω, there

exists an integer 𝑡 (possibly depending on 𝑥 and 𝑦) such that 𝑃𝑡(𝑥, 𝑦) > 0.

This means that it is possible to get from any state to any other state not necessarily in one step

and using only transitions of positive probability.

Lemma C.1: A finite irreducible Markov chain with state space Ω and transition matrix 𝑃 =

(𝑝𝑖𝑗)1≤𝑖,𝑗≤𝑚 is aperiodic if there exists a state 𝑥𝑗 ∈ Ω, where 1 ≤ 𝑗 ≤ 𝑚, such that 𝑝𝑗𝑗 > 0.

Periodic finite Markov chain:

Let 𝑇(𝑥) = {𝑡 ≥ 1 ∶ 𝑃𝑡(𝑥, 𝑥) > 0} be the set of times when it is possible for the chain to return to

starting position 𝑥. The period of state 𝑥 is defined to be the greatest common divisor of 𝑇(𝑥).

Stationary distribution:

A stationary distribution of a Markov chain 𝑃 is a probability 𝜋 on Ω invariant under right

multiplication by 𝑃, which means:

𝜋 = 𝜋 ∙ 𝑃 (C5) In this case, 𝜋 is the long–term limiting distribution of the Markov chain. Clearly, if 𝜋 is a

stationary distribution and 𝜇0 = 𝜋, i.e., the chain is started in a stationary distribution, then

𝜇𝑡 = 𝜋 for all 𝑡 ≥ 0. Equation (C5) can be rewritten element–wise as:

𝜋(𝑦) = ∑𝜋(𝑥)

𝑥∈Ω

∙ 𝑃(𝑥, 𝑦) for all 𝑦 ∈ Ω (C6)

The finite Markov chains converge to their stationary distributions. More, under mild restrictions,

stationary distributions exist and are unique.

There is a difference between multiplying a row vector by 𝑃 on the right and a column vector by

𝑃 on the left: the former advances a distribution by one step of the chain, while the latter gives

the expectation of a function on states, one step of the chain later.

Hitting and stopping time:

For x ∈ Ω, we define the hitting time for 𝑥 to be the first time at which the chain visits state 𝑥:

160

𝜏𝑥 = min{𝑡 ≥ 0 ∶ 𝑋𝑡 = 𝑥} (C7)

For situations where only a visit to 𝑥 at a positive time will do, we also define:

𝜏𝑥+ = min{𝑡 ≥ 1 ∶ 𝑋𝑡 = 𝑥} (C8)

We call 𝜏𝑥

+ the first return time when 𝑋0 = 𝑥.

A stopping time 𝜏 for (𝑋𝑡) is a {0, 1, . . . , } ∪ {∞}–valued random variable such that, for each 𝑡, the

event {𝜏 = 𝑡} is determined by 𝑋0, 𝑋1, …𝑋𝑡. If 𝜏 is a stopping time, then an immediate

consequence of the definition and the Markov property is:

𝑃𝑥0{(𝑋𝜏+1, 𝑋𝜏+2, …𝑋𝑙) ∈ 𝐴|𝜏 = 𝑘 and (𝑋1, …𝑋𝑘) = (𝑥1, … 𝑥𝑘) } = 𝑃𝑥𝑘{(𝑋1, …𝑋𝑙) ∈ 𝐴} (C9)

for any 𝐴 ⊂ Ω𝑙. This is referred to as the strong Markov property. Informally, we say that the

chain “starts afresh” at a stopping time.

Lemma C.2: For any states 𝑥 and 𝑦 of an irreducible chain:

𝐄𝑥(𝜏𝑦+) < ∞ (C10)

Theorem C.3 (Existence of a stationary distribution):

Let 𝑃 be the transition matrix of an irreducible Markov chain. Then the following are true:

there exists a probability distribution 𝜋 on Ω such that: 𝜋 = 𝜋 ∙ 𝑃 and 𝜋(𝑥) > 0 for all 𝑥 ∈ Ω (C11)

𝜋(𝑥) =1

𝐄𝑥(𝜏𝑥+)

(C12)

Theorem C.4 (Uniqueness of the stationary distribution):

Let 𝑃 be the transition matrix of an irreducible Markov chain. There exists a unique probability

distribution 𝜋 satisfying: 𝜋 = 𝜋 ∙ 𝑃.

Reversibility and Time Reversals:

Suppose a probability 𝜋 on Ω satisfies the detailed balance equation:

𝜋(𝑥) ∙ 𝑃(𝑥, 𝑦) = 𝜋(𝑦) ∙ 𝑃(𝑦, 𝑥) for all 𝑥, 𝑦 ∈ Ω (C13) Proposition C.5: Let 𝑃 be the transition matrix of a Markov chain with state space Ω. Any

distribution 𝜋 satisfying the detailed balance equations is stationary for 𝑃.

161

Checking detailed balance is often the simplest way to verify that a particular distribution is

stationary. Furthermore, when the detailed balance equation holds:

𝜋(𝑥0) ∙ 𝑃(𝑥0, 𝑥1) ∙ … ∙ 𝑃(𝑥𝑛−1, 𝑥𝑛) = 𝜋(𝑥𝑛) ∙ 𝑃(𝑥𝑛, 𝑥𝑛−1) ∙ … ∙ 𝑃(𝑥1, 𝑥0) (C14) We can rewrite the previous equation in the following suggestive form:

𝑃𝜋{X0 = x0, … , X𝑛 = x𝑛} = 𝑃𝜋{X0 = x𝑛, … , X𝑛 = x0} (C15) In other words, if a chain (𝑋𝑡) satisfies the detailed balance equation and has stationary initial

distribution, then the distribution of (𝑋0, 𝑋1, …𝑋𝑛) is the same as the distribution of

(𝑋𝑛, 𝑋𝑛−1, …𝑋0).

Reversible finite Markov chain:

A chain satisfying the detailed balance equation is called reversible. The time reversal of an

irreducible Markov chain with transition matrix 𝑃 and stationary distribution 𝜋 is the chain with

matrix:

��(𝑥, 𝑦) =

𝜋(𝑦) ∙ 𝑃(𝑦, 𝑥)

𝜋(𝑥) (C16)

The stationary equation 𝜋 = 𝜋 ∙ 𝑃 implies that �� is a stochastic matrix. We write (��𝑡) for the

time–reversed chain of (𝑋𝑡) and �� for the transition matrix of (��𝑡).

Proposition C.6: Let (𝑋𝑡) be an irreducible Markov chain with transition matrix 𝑃 and stationary

distribution 𝜋. Then 𝜋 is stationary for �� and for any 𝑥0, 𝑥1, … 𝑥𝑡∈ Ω we have:

𝑃𝜋{X0 = x0, … , X𝑡 = x𝑡} = 𝑃𝜋{X0 = x𝑡 , … , X𝑡 = x0} (C17)

Observe that if a chain with transition matrix 𝑃 is reversible, then: �� = 𝑃.

Theorem C.7 (Markov Chain Convergence):

Suppose that a Markov chain 𝑃 is irreducible and aperiodic, with stationary distribution 𝜋. Then

there exist constants 𝛼 ∈ (0, 1) and 𝐶 > 0 such that:

max𝑥∈Ω

||𝑃𝑡(𝑥,·) − 𝜋||𝑇𝑉 ≤ 𝐶 ∙ 𝛼𝑡 (C18)

where ||𝜇 − 𝜈||𝑇𝑉 represents the total variation distance between two probability distributions 𝜇

and 𝜈 on Ω and is defined as:

||𝜇 − 𝜈||𝑇𝑉= max𝐴⊂Ω

|𝜇(𝐴) − 𝜈(𝐴)| (C19)

162

This theorem implies that the “long–term” fractions of time a finite irreducible aperiodic Markov

chain spends in each state coincide with the chain’s stationary distribution.

Ergodicity:

A Markov chain is said to be ergodic if there exists a positive integer 𝑇0 such that, for all pairs of

states 𝑖 and 𝑗 in the Markov chain, if the chain is started at time 0 in state 𝑖, then for all 𝑡 > 𝑇0,

the probability of being in state 𝑗 at time 𝑡 is greater than 0.

For a Markov chain to be ergodic two technical conditions are required of its states and its

transition matrix: irreducibility and aperiodicity. Informally, irreducibility ensures that there is a

sequence of transitions of non–zero probability from any state to any other, while aperiodicity

ensures that the states are not partitioned into sets such that all state transitions occur cyclically

from one set to another.

Theorem C.8 (Ergodic Theorem):

Let 𝑓 be a real–valued function defined on Ω. If (𝑋𝑡) is an irreducible Markov chain with

stationary distribution 𝜋, then for any starting distribution 𝜇, the following holds:

𝑃𝜇 { lim

𝑡→∞

1

𝑡∙∑𝑓(𝑋𝑠) = 𝐄𝜋[𝑓]

𝑡−1

𝑠=0

} = 1 (C20)

where 𝐄𝜋[𝑓] is computed with formula (B9).

Markov chain Monte Carlo (MCMC):

Problem: Given an irreducible transition matrix 𝑃, there is a unique stationary distribution 𝜋 on

Ω satisfying 𝜋 = 𝜋 ∙ 𝑃.

Answer: The existence and unicity of solution of this problem is ensured by Theorem C.3

respectively Theorem C.4.

Inverse problem: Given a probability distribution 𝜋 on Ω, can we find a transition matrix 𝑃 for

which 𝜋 is its stationary distribution?

Answer: Yes, we do. The solution involves a method of sampling from a given probability

distribution called Markov chain Monte Carlo.

MCMC uses Markov chains to sample. A random sample from a finite set Ω means a random

uniform selection from Ω, i.e., one selection such that each element has the same chance 1 |Ω|⁄

of being chosen.

163

Suppose 𝜋 is a probability distribution on Ω. If a Markov chain (𝑋𝑡) with stationary distribution 𝜋

can be constructed, then, for 𝑡 large enough, the distribution of (𝑋𝑡) is close to 𝜋.

Metropolis chains:

Problem: Given a probability distribution 𝜋 on Ω and some Markov chain with state space Ω and

an arbitrary stationary distribution, can the chain be modified so that the new chain has the

stationary distribution 𝜋?

Answer: Yes, it does. The Metropolis algorithm solves this problem.

We distinguish two cases: symmetric base chain and general base chain. In both cases we are

given an arbitrary probability distribution 𝜋 on Ω and a base Markov chain (𝑋𝑡) with transition

matrix ψ. We want to construct a new chain (𝑌𝑡) starting from the base chain (𝑋𝑡) and modifying

its transitions such that the stationary distribution of the new chain is 𝜋. The new chain is called

the Metropolis chain. The stationary distribution of (𝑋𝑡) or, equivalently, the transition matrix ψ

are also referred as the proposal distribution.

Let (𝑋𝑡) have a symmetric transition matrix ψ. This implies that (𝑋𝑡) is reversible with

respect to the uniform distribution on Ω. The Metropolis chain is executed as follows.

It starts from the initial state of the base chain and evolves as follows: when at state 𝑥, a

candidate state 𝑦 is generated from the distribution ψ(𝑥,·). The state 𝑦 is “accepted” with

probability 𝑎(𝑥, 𝑦), which means that the next state of the new chain is 𝑦, or the state 𝑦 is

“rejected” with probability 1 − 𝑎(𝑥, 𝑦), which means that the next state of the new chain

remains at 𝑥. The acceptance probability 𝑎(𝑥, 𝑦) is:

𝑎(𝑥, 𝑦) = 1 ∧ (

𝜋(𝑦)

𝜋(𝑥)) = min(1,

𝜋(𝑦)

𝜋(𝑥)) (C21)

Therefore, the Metropolis chain for a probability 𝜋 and a symmetric transition matrix ψ is

defined by the following transition matrix:

𝑃(𝑥, 𝑦) =

{

ψ(𝑥, 𝑦) ∙ [1 ∧

𝜋(𝑦)

𝜋(𝑥)] , if 𝑦 ≠ 𝑥

1 −∑ψ(𝑥, 𝑧) ∙ [1 ∧𝜋(𝑧)

𝜋(𝑥)]

𝑧𝑧≠𝑥

, if 𝑦 = 𝑥 (C22)

164

A very important feature of the Metropolis chain is that it only depends on the ratios 𝜋(𝑦)

𝜋(𝑥).

Frequently 𝜋(𝑥) has the form ℎ(𝑥)

𝑍, where the function ℎ ∶ Ω ⟶ [0,∞) is known and 𝑍 is a

normalizing constant 𝑍 = ∑ ℎ(𝑥)𝑥∈Ω . It may be difficult to explicitly compute 𝑍, especially

if Ω is large. Because the Metropolis chain only depends on ℎ(𝑥)

ℎ(𝑦), it is not necessary to

compute the constant 𝑍 in order to simulate the chain.

Let (𝑋𝑡) have a general transition matrix ψ; this means ψ corresponds to an irreducible

(𝑋𝑡).

The Metropolis chain is executed as follows. It starts from the initial state of the base

chain and evolves as follows: when at state 𝑥, generate a state 𝑦 from the distribution

ψ(𝑥,·). Then move to 𝑦 with probability 𝑎(𝑥, 𝑦) and remain at 𝑥 with the probability

1 − 𝑎(𝑥, 𝑦). The acceptance probability 𝑎(𝑥, 𝑦) is:

𝑎(𝑥, 𝑦) = 1 ∧ (

𝜋(𝑦) ∙ ψ(𝑦, 𝑥)

𝜋(𝑥) ∙ ψ(𝑥, 𝑦)) = min(1,

𝜋(𝑦) ∙ ψ(𝑦, 𝑥)

𝜋(𝑥) ∙ ψ(𝑥, 𝑦)) (C23)

Therefore, the Metropolis chain for a probability 𝜋 and a general transition matrix ψ is

defined by the following transition matrix:

𝑃(𝑥, 𝑦) =

{

ψ(𝑥, 𝑦) ∙ [1 ∧ (

𝜋(𝑦) ∙ ψ(𝑦, 𝑥)

𝜋(𝑥) ∙ ψ(𝑥, 𝑦))] , if 𝑦 ≠ 𝑥

1 −∑ψ(𝑥, 𝑧) ∙ [1 ∧ (𝜋(𝑦) ∙ ψ(𝑦, 𝑥)

𝜋(𝑥) ∙ ψ(𝑥, 𝑦))] ,

𝑧𝑧≠𝑥

if 𝑦 = 𝑥 (C24)

The transition matrix 𝑃 defines a reversible Markov chain with stationary distribution 𝜋.

Glauber dynamics (Gibbs sampler):

In general, let 𝑉 and 𝑆 be finite sets and Ω be a subset of 𝑆𝑉: Ω ⊆ 𝑆𝑉. 𝑉 can be seen as the

vertex set of a graph and 𝑆 can be seen as the set of state values for any vertex in the graph.

The elements of 𝑆𝑉 are called configurations and can be “visualized” as labeling the vertices of

𝑉 with elements of 𝑆.

Problem: Given a probability distribution 𝜋 on a space of configurations Ω ⊆ 𝑆𝑉, can we find a

Markov chain for which 𝜋 is its stationary distribution?

Answer: Yes, we do. The Glauber dynamics algorithm solves this problem.

165

Let 𝜋 be a probability distribution whose support is Ω. For a configuration 𝜎 ∈ Ω and a vertex

𝑣 ∈ 𝑉 let Ω(𝜎, 𝑣) be the set of configurations agreeing with 𝜎 everywhere except possibly at 𝑣:

Ω(𝜎, 𝑣) = {𝜏 ∈ Ω ∶ 𝜏(𝑤) = 𝜎(𝑤) for all 𝑤 ∈ 𝑉,𝑤 ≠ 𝑣} (C25) The (single–site) Glauber dynamics for 𝜋 is a reversible Markov chain with state or configuration

space Ω, stationary distribution 𝜋, and the transition probabilities defined by the distribution 𝜋

conditioned on the set (𝜎, 𝑣) as follows:

𝜋𝜎,𝑣(𝜏) = 𝜋(𝜏 | Ω(𝜎, 𝑣)) = {

𝜋(𝜏)

𝜋(Ω(𝜎, 𝑣)), if 𝜏 ∈ Ω(𝜎, 𝑣)

0, if 𝜏 ∉ Ω(𝜎, 𝑣)

(C26)

In words, the Glauber chain moves from a configuration 𝜎 ≡ 𝑋𝑡 to a configuration 𝜏 ≡ 𝑋𝑡+1 as

follows:

a vertex 𝑣 is chosen uniformly at random from 𝑉;

a new configuration 𝜏 ∈ Ω is chosen according to the probability measure 𝜋 conditioned

on the set of configurations that agree with 𝜎 everywhere except possibly at 𝑣.

Comparing Glauber dynamics and Metropolis chains:

Suppose that 𝜋 is a probability distribution on the state space 𝑆𝑉, where 𝑆 is a finite set and 𝑉 is

the vertex set of a graph. On one hand, we can always define the Glauber chain as just

described. On the other hand, suppose that we have a chain which picks a vertex 𝑣 at random

and has some mechanism for updating its configuration 𝜎 at 𝑣. This chain may not have

stationary distribution 𝜋, but it can be modified by the Metropolis rule to obtain a Metropolis

chain with stationary distribution 𝜋. The Metropolis chain obtained in this way can be very

similar to the Glauber chain, but may not coincide exactly.

fundamentals of learning algorithms in boltzmann...

Documents