deep neural networks, tensor networks, and renormalization ......deep neural networks, tensor...
TRANSCRIPT
Deep neural networks, Tensor networks,and Renormalization group
Submitted By:
HAFIZ ARSLAN HASHIM
2016−12−0005
Supervised By:DR. BABAR AHMED QURESHI
MS ThesisMAY 2018
Department of Physics
SYED BABAR ALI SCHOOL OF SCIENCE AND ENGINEERINGLAHORE UNIVERSITY OF MANAGEMENT SCIENCES
ABSTRACT
Neural networks are one of the most successful systems that have been used to ac-complish machine learning tasks, in particular in the context of unsupervised learningand pattern/feature recognition. Deep neural algorithms consist of multiple layers ofoperational nodes where the output from each layer acts as an input for the next deeperlayer. Deep neural networks (DNNs) have been used in physical applications as well, inparticular for finding ground states of complicated many-body systems. Despite their suc-cess in practical machine learning applications, there is little theoretical understandingas to why these algorithms work so efficiently. A physical process-based understanding ofneural networks will not only allow for better algorithms of many-body physical systemsbut will also help in the construction of better DNN algorithms for specific tasks.
During the MS thesis work, we have explored various models that can connectDNNs to physical systems with the hope of gaining insight into both DNNs workingand the corresponding physical systems. Recently, it has been proposed DNN may berelated to variational renormalization group of Kadanoff and a map between the twowas established. We studied this relationship for Restricted Boltzmann Machines (RBM)and renormalization group for spin systems with many examples. Another avenue whichwe have investigated is the relationship of DNNs to tensor network (TN) models. Thecorrespondence between TN and DNNs will allow quantifying the expressibility of DNNsfor industrial dataset as well as for quantum states. Our main goal is to study thispossibility and understand how do renormalization and entanglement emerge in thiscontext.
i
DEDICATION AND ACKNOWLEDGEMENTS
F irst of all, I like to thank Allah Almighty for giving me the strength to do thisthesis. Afterwards, I would like to say that this work would not be possible withoutthe guidance of my supervisor, support of my family and help from my friends.
I am very much grateful to my supervisor Dr. Babar Ahmed Qureshi. At the timeof enrolling thesis, he appreciated my interest in a particular topic and fully supportedand guided me throughout the duration. Without him, at least I had to change my topicof MS thesis. Because of his guidance and patience, I am able to complete my researchwork under his supervision.
Moreover, I am thankful to all of my friends at LUMS, Lahore, and Rahim Yar Khan,especially Hassaan Wasalat, Fawad Masood, Junaid Saif Khan, Asif Nawaz, HassaanAhmed, Waqar Ahmed, Usman Rasheed, Yasir Iqbal, Yasir Abbass and Shania for allkind of their support and encouragement. I should also express my gratitude towardsthe staff of my department especially Mr. Arshad Maral for his support.
Finally, I would like to gratefully acknowledge the prayers and encouragement of myfamily and parents throughout my studies.
ii
AUTHOR’S DECLARATION
LAHORE UNIVERSITY OF MANAGEMENT SCIENCES
Department of Physics
CERTIFICATE
I hereby recommend that the thesis prepared under my supervision by Hafiz ArslanHashim on Deep neural networks, Tensor networks, and Renormalization group beaccepted in partial fulfilment of the requirements for the MS degree.
Supervisor: Co-supervisor:Dr. Babar Ahmed Qureshi Dr. Adam Zaman Chaudhry
__________________________ __________________________
Recommendation of Thesis Defense Committee :
Dr. Ata ul Haq ————————————————————Name Signature Date
Dr. Ammar Ahmed Khan ——————————————-Name Signature Date
iii
TABLE OF CONTENTS
Page
List of Figures vii
1 Introduction 11.1 Restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Deep neural network in renormalization group perspective . . . . . . . . . 4
1.3 Correspondence between RBM and Tensor Networks (MPS) . . . . . . . . . 5
2 Machine Learning Basics 82.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 The Task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 The Performance measure, P . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 The Experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Gradient descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Conditional Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Logistic Regression or Classification . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Overfitting, Underfitting, and Regularization . . . . . . . . . . . . . . . . . 17
2.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Deep Feed Forward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 Unsupervised Learning for Deep Neural Networks . . . . . . . . . . . . . . 21
2.10 Energy-based models and Restricted Boltzmann Machine . . . . . . . . . . 22
2.10.1 Energy based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10.2 Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10.3 Conditional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10.4 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
TABLE OF CONTENTS
2.10.5 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . 25
2.10.6 Gibbs Sampling in RBMs . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.11 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Tensor Networks 313.1 Necessity of Tensor Network? . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Theory of Tensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Tensors and tensor networks in tensor network notation . . . . . . 37
3.2.2 Wave function as a set of small tensors . . . . . . . . . . . . . . . . . 39
3.2.3 Entanglement entropy and Area-law . . . . . . . . . . . . . . . . . . 41
3.2.4 Proven instances and violations of Area-law . . . . . . . . . . . . . . 42
3.2.5 Entanglement spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Matrix Product States (MPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Some properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 MPS construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.4 Gauge degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 1-D Projected Entangled Pair States(PEPS) . . . . . . . . . . . . . . . . . . 49
3.5 Examples of MPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Mapping between Variational RG and RBM 544.1 Renormalizaton group (RG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 1D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.2 2D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Variational RG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Overview of RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Mapping between Variational RG and RBM . . . . . . . . . . . . . . . . . . 63
4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Ising model in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Ising model in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Correspondence between restricted Boltzmann machine and Tensornetwork states 685.1 Transformation of an RBM to TNS . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 Direct transformation from RBM to MPS . . . . . . . . . . . . . . . 69
5.1.2 Optimum MPS representation of an RBM . . . . . . . . . . . . . . . 71
v
TABLE OF CONTENTS
5.1.3 Inference of RBM to MPS mapping . . . . . . . . . . . . . . . . . . . 74
5.2 Representation of TNS as RBM: sufficient and necessary conditions . . . . 77
5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Implication of the RBM-TNS correspondence . . . . . . . . . . . . . . . . . . 81
5.3.1 RBM optimization by using tensor network methods . . . . . . . . . 81
5.3.2 Unsupervised learning in entanglement perspective . . . . . . . . . 82
5.3.3 Entanglement: a measure of effectiveness of deep learning as com-
pared to shallow ones . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A MPS Examples in Mathematica 85A.1 W State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.2 GHZ State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
B Renormalization Group Example and Code description 87B.1 1D Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.2 Training DBN for the 2D Ising model . . . . . . . . . . . . . . . . . . . . . . 90
Bibliography 91
vi
LIST OF FIGURES
FIGURE Page
1.1 The structure of Restricted Boltzmann machine. . . . . . . . . . . . . . . . . . 4
1.2 Tranformation of RBM to MPS: (a) Graphical notation of the RBM defined
by Eq.1.2. (b) The matrix product state (MPS) in graphical notation. Here
dangling links corresponds to physical variables vi and A(i) is three index
tensor. The thickness of horizontal link between tensors shows the virtual
bond dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Gradient descent: initialize w randomly, take gradient on this point and
next point will be updated according to the gradient. If the gradient is negative
as shown here then updated value of w will be greater by the factor α∂wJ. If
the gradient is positive then next value of w will be less by the same factor.
GD algorithm takes small steps in this way and reaches to the minima. . . . 14
2.2 Sigmoid function: this function outputs a smooth value between 0 and 1. . 17
2.3 Overfitting and underfitting: (a) underfitting problem is occured because
data is more complex than a linear model; (b) cubic model is appropriate for
the provided data; and (c) overfitting, polynomial of degree 5 is used as a
model which is more complex than given data. . . . . . . . . . . . . . . . . . . . 18
2.4 Effect of λ on the model: (a) λ value is too large which makes w → 0; (b)
moderate value of λ is used; (c) when λ→ 0 then w becomes very large and
model will be more complex than data. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Neuron’s simple model:(a) Perceptron: takes weighted binary input and
outputs a binary value; (b) sigmoid neuron it also take weighted input but
outputs smooth real value between 0 and 1. . . . . . . . . . . . . . . . . . . . . 19
2.6 Deep feed forward neural network: network consists of l layers and each
layer has sn units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Boltzmann machine: an undirected graphical model. . . . . . . . . . . . . . . 24
vii
LIST OF FIGURES
2.8 Restricted Boltzmann machine (RBM): no intra-layer connection but hid-
den and visible units can interact with each other. . . . . . . . . . . . . . . . . 26
2.9 Softplus function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 Markov chain: starting with an example vt sampled from the empirical
probability distribution. Sample hidden layer given visible layer and then
sample visible given hidden layer, this process go on and on. . . . . . . . . . . 28
2.11 Deep Belief Network is defined as a generative model, the generative path
is from top to bottom with distributions P, and Q distributions extract multi-
ple features from the input and construct an abstract representation. The top
two layers define an RBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Tensor Network representation of state : the tensor is an elementary
building block of a tensor network and tensor network is a representation of
a quantum state (we used the graphical interpretation for a tensor network
which will obvious in incoming sections). . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Tensor network diagram examples: (a) Matrix Product State (MPS) for 4
site with open boundary condition (OBC); and (b) Projected Entanglement
Pair State (PEPS) for 4×4 lattice with OBC. . . . . . . . . . . . . . . . . . . . . 34
3.3 Area Law: entanglement entropy S of the reduced system A scales with the
boundary of the system not with volume. . . . . . . . . . . . . . . . . . . . . . . 35
3.4 The physical state in a small manifold of Hilbert space: the size of the
Hilbert space containing the states which obeys area-law is exponentially
small corner of the gigantic Hilbert space. . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Tensor representation by diagram: here we use superscript for physical
index as shown in (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Tensor graphical notation: these tensor diagrams corresponds to Eq.3.1,3.2,3.3,3.4,
and notice that in (b) and (d), bond between tensors B and C is thick which
corresponds to higher bond dimensions as compared to other links between
the tensors. In practice, we can split and merge any number of indices. In
this case, two indices y, z are merged. There are two and three open indices
in diagram (a) and (b) respectively. While diagrams (c) and (d) have no open
indices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 1D lattice with PBC: trace of the product of 8 matrices or lattice of 8 parti-
cles with PBC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 TN representation of a wave function: (a) is an MPS (b) is a PEPS and
(c) is some tensor network which fulfil the demand. . . . . . . . . . . . . . . . . 40
viii
LIST OF FIGURES
3.9 Area-law in PEPS: reduced state |A(α)⟩ 2×2 and |B(α)⟩ from 4×4 PEPS.
Each broken bond has dimension D and it contributes logD to the entangle-
ment entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.10 By increasing the bond dimension D of a TN state one can explore larger
region of Hilbert space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.11 (a) Infinite-sized MPS with 1 and 2 site unit cell. (b): An efficient way to
contract finite-sized MPS which can also be applied to infinite-sized MPS with
any boundary condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.12 Singular Value Decomposition (SVD) Eq. 3.13 in TN notation: here I ≡{i1, i2, ...in} and I ≡ { j1, j2, ... jn}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.13 Pictorial presentation of SVD performed in Eq. 3.16. . . . . . . . . . . . . . . . 46
3.14 Complete construction of an MPS by SVD. . . . . . . . . . . . . . . . . . . . . . 47
3.15 AKLT state: Each physical site represents spin−1 which is replaced by the
two spin−1/2 degrees of freedom (called auxiliary spins). Each right auxiliary
spin−1/2 on a site i is entangled to the left spin−1/2 at site i +1. Linear
projection operator is defined on the auxiliary spins which maps them to
physical spins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 The RG decimation scheme: every spin has 4 nearest-neighbours and we
summed over half of the spins. The resulting lattice is same as original but
rotated at an angle 45◦. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 In coarse-graining process, result of removing the degrees of freedom we got
high degree of connectivity. Yellow lines show nearest-neighbour couplings
v1v2+v2v3+v3v4+v4v1, green lines represent next-nearest neighbours v1v3+v2v4 and blue connections depict all four spin product v1v3v2v4. . . . . . . . . 58
4.3 RG flow diagram for 2D Ising model: there are three fixed points, two are
stable (K = 0,∞) and one is unstable (Kc). Kc is a phase transition point. . . . 59
4.4 Block spin transformation: (a) 2×2 blocks are defined on the physical
spins vi to marginalize, (b) depicts the effective spins hi after marginalizing
the physical degrees of freedom, (c) side view of RG procedure is shown,
repeated application of RG transformations produce series of ensembles one
over the other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
ix
LIST OF FIGURES
4.5 RG and deep learning aspect of 1D Ising model: (a) a coarse-grained
process by the renormalization transformation of the ferromagnetic 1D Ising
model. After every RG iteration, half of the spins marginalize and the lattice
spacing becomes double. At each level the system replaced by the new system
with relatively fewer degrees of freedom and new couplings K ’s. By the RG
flow equation, the couplings at the previous layer can provide the couplings
for the next layer. (b) The RG Coarse-graining can also be performed by the
deep learning architecture where weights/parameters between n and n−1
hidden layer are given by K (n−1). . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Deep neural network for the 2D Ising model: (a) four-layered DBN is
constructed with size of each layer 1600,400,100, and 25 spins. The network
is trained over the samples generated from 2D Ising model with J = 0.405.
(b) The Effective Receptive field (ERF): encodes the effect of input spins to a
particular spin in the given layer, of the top layer. Each image size is 40×40
which depicts the ERF of a particular spin in the top layer. (c) The ERF gets
larger as we move from bottom to top layer of the network, thats consistent
with the successive block spin transformation. (d) Similarly ERF for the third
layer with 100 spins. (e) Three samples generated from the trained network. 66
5.1 Correspondence between RBM-TNS: (a)RBM representation as an undi-
rected graphical model as defined by Eq.5.1. The blue circles denote the units
v called visible, and gray circles labelled as h called hidden units. They inter-
act with each other through links denoted as solid lines. (b) MPS described
by 5.5. Each dark blue dot represents a 3 indexed tensor A(i). Now on we
use hollow circles to denote RBM units and use filled ones to indicate ten-
sors. undirected lines in the RBM represents the link weight and lines in the
MPS denotes the bond indices and thickness of the bond expresses the bond
dimension. RBM and TNS are used to represent complex multi-variable func-
tions, Both have the ability to describe any function with arbitrary precision
given unlimited resources (unlimited hidden variables or bond dimensions).
Although, provided the limited resources they can represent two overlapping
but independent regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
x
LIST OF FIGURES
5.2 Stepwise mapping an RBM to an MPS: (a) An MPS description of an
RBM given in Fig.5.1(a). The light blue circles express the diagonal tensor
Γ(i)v at the visible units defined by the Eq. 5.2 and gray circle used to denote
Γ( j)h at the hidden units as defined by Eq. 5.3. The orange diamonds express
the matrix M(i j), described in Eq. 5.4. (b) The RBM is divided into nv pieces.
Corresponding to each long-range link, put an identity tensor (red ellipse)
to subdivide M(i j) into two matrices. (c) An MPS is transformed from RBM
by summing up all the hidden units belonging to each piece in (b). The
number of cuts (link) made by the dashed vertical line is equal to the bond
degrees of freedom of the MPS. (d) The matrix M(41) corresponds to long-range
connection is broken into a product of two M1, M2 matrices, represented by
the light pink diamond. The red ellipse shows the product of two identity
matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 The optimum MPS representation of an RBM: (a)− (e) Depicts the step-
wise construction. The set X g is denoted by light yellow rectangle/triangle and
set Yg represents a dark blue rectangle. The set Zg which provides the condi-
tional independence of X g and Yg is represented as a light green rectangle.
When the set Zg is given, the RBM function which interpreted as probability
can be written as a product of two functions, one depends on X g and other
depends on Yg. The variables in Zg are defined as the virtual bond of the MPS.
The light gray lines show the connections included in the previous tensor. The
connections being considered in the current step are denoted by dotted lines
and these are represented as G t in Algo. 5.1.2. (f) The resulting MPS. . . . . . 73
5.4 (a) RBM after summing out entire set of hidden units. (b) The curved lines
represent the connections between visible units through hidden units. The
whole system is split into two parts X g and Yg and second one is further
split as Yg =Y1 ∪Y2. Where Y1 contains the visible variables that are directly
connected to X g. (b) The alternative way is expressed. (c) The MPS provided
by this method has smaller bond dimension as compared to Fig.5.2. . . . . . . 76
5.5 TNS to RBM transformation: graphical representation of (a) Eq. 5.12 and
(b) Eq. 5.13 and 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 The matrix A is defined in Eq. 5.19 has special form that represented by
the dashed box. The blue dots denote the identity matrices and square box
represent the left and right matrices. To obtain the RBM parameters we apply
transformation of MPS to RBM according to Eq.5.20. . . . . . . . . . . . . . . . 81
xi
LIST OF FIGURES
5.7 The RBM architecture which used to represent the cluster state. It has local
connections, each hidden unit connected with three visible units. . . . . . . . . 82
5.8 Effectiveness of deep network as compared to shallow network: (a)
An RBM and (b) a DBM, both have same nv (blue circles), nh = 3nv and
number of connections 9nv. The approach discussed in Sec.5.1 can be used
to represent both architectures as an MPS with bond dimensions for DBM
D = 24 and RBM has D = 22. Dashed rectangle depicts the minimum number
of visible units required to fix in order to cut the system into two subsystems.
The bond dimension shows that DBM can encode more entanglement as
opposed to RBM when equal number of hidden units and parameters are
provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.1 Tracing out even degrees of freedom. . . . . . . . . . . . . . . . . . . . . . . . . . 88
B.2 RG flow in coupling space, it depicts the stable and unstable fixed points. . . 89
B.3 RG flow: (a) coupling space in different domain from 0 to 1, (b)n the presence
of external magnetic field h. The arrows show flow direction and blue lines in
between two limits (K = 0,∞) depict flow and these end up on vertical axis
h where K = 0. “×” signs on the vertical axis when K = 0 represent the fixed
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xii
CH
AP
TE
R
1INTRODUCTION
The human brain is jammed packed with neurons and these neurons connected together
by links. A neuron is the fundamental unit of information processing in the brain. A
network of these neurons receives input in the form of signals and generates an output
after passing through its middle or so called hidden layers. The brain learns by changing
the connection strength between neurons. It is good in recognition or identification things
but finds difficulty in multiplying two matrices, complex integration, etc. Conventional
computers, in contrast, are good at performing a lot of calculations such as multiplication
of a huge set of numbers but very weak in recognition things. It turns out that the
ability of recognition can be put into computers by mimicking the human brain. Set of
methods and techniques in machine learning which mimic the brain is called artificial
neural networks (ANNs) sometime simply called neural networks. An ANN consists
of layers of neurons from few to many. In general, neural networks with three or less
than three layers, input, hidden and output layer called shallow neural networks while
neural networks with more than three layers are called deep neural networks (DNNs).
DNNs are more powerful than shallow networks and has produced groundbreaking
advances in automated translation, speech and image recognition, etc. The difference
between shallow and deep network is more than just number of layers. To make shallow
network learn, one has to do feature engineering: transforming the input in such a way
to have accurate input to output mapping. It increases human labor and makes some
tasks almost impossible. On the other hand, deep networks find a good representation
of a raw data by itself which enable them to learn better and perform well on the tasks
1
CHAPTER 1. INTRODUCTION
which shallow networks can’t do. This remarkable strength of deep networks is not free
of cost. Deep networks often require billions of parameters to learn and it makes them
computational extensive as compared to the shallow networks. But today’s computers
are good enough to simulate deep networks with many layers.
Despite DNNs remarkable success, there is a very less theoretical understanding of
the working of DNNs. Why are DNNs better than shallow networks? What is an under-
lying mechanism which makes them find a good representation of data? Recently, Carleo
and Troyer [1] made a great attempt to use an ANN as a representation of the quantum
state of a many-body system. They showed that the restricted Boltzmann machine, a
generative artificial neural network model commonly known as RBM, can be used as a
variational wave function for the quantum many-body system at or away from equilib-
rium. An RBM can be used to construct topological states showed by Dong-Ling Deng [2].
He also studied the Quantum entanglement properties of RBM and proved that short
and long range RBM states follow the area and volume law respectively [3]. Giacomo
Torlai and Roger G. Melko [4] showed that one can study the whole thermodynamics of a
statistical system by using RBM. Many different efforts have been made in this regard.
These developments lead us to some important questions about the representation power
of ANNs in the physics context. Is standard variational wave function approximation
less expressive than RBM? can we use RBM to study systems at criticality? In this study,
we address these questions in detail and try to understand DNNs underlying working
and their expressive power for deep learning tasks as well as for quantum states.
The RG is a mathematical machinery that allows systematic investigation of the
changes of a physical system as observed at different length scale. We will see how
coarse-graining process in real space RG related to DNN algorithms. To be more specific,
we will study an exact mapping from variational RG and RBM [5]. Tensor network
methods are widely used in the quantum many-body system, the relationship between
RBM and tensor network will also closely observed and try to quantify the expressive
power of RBM by using quantum information aspect [6].
1.1 Restricted Boltzmann machine
The neural network used in this regard is the restricted Boltzmann machine in short
RBM. An RBM was developed by the machine learning community with the motivation
of statistical physics. It is a generative stochastic neural network, used to automatically
detect the inherent patterns in the unlabelled data by reconstructing their input. Instead
2
CHAPTER 1. INTRODUCTION
of having output the right answer, it always tries to constantly learn to be able to
construct that data which is closest to the data it is being trained with. RBM can learn
its input’s probability distribution in an unsupervised or supervised manner, therefore,
labelled data is not an obligation to train them. This property makes them perfect fit for
the case, like feature learning, topic modelling, dimensional reduction, classification and
recommender systems and one can create DNNs by training them layer by layer in a
greedy manner. It’s a building block that can be used to create DNNs, for instance deep
belief networks (DBNs). It contains one hidden layer and one visible layer, the connection
between units are only between consecutive layers in other words there is no inter-layer
connection as shown in Fig.1.1., due to this restriction we call it restricted Boltzmann
machine. These connections between units form a bipartite graph i.e. a pair of layers
has symmetric connections. That means we take a sample and update the states of the
hidden layers and we reconstruct the visible units by updating hidden units again. We
go both ways to update weights instead of one-way. RBM is a member of energy based
models, energy of this network is given as
E(v,h)=−∑i
aivi −∑
jb jh j −
∑i, j
viWi, jh j. (1.1)
Similar to Boltzmann distribution in statistical physics, configurations with less
energy has high probability, RBM learning corresponds to finding an energy function
that map’s low energy configurations to the desired values and higher energies to the
inaccurate values. The goal is to minimize cross-entropy of the loss function and maximize
the product of probabilities assigned to some training set. The energy configuration of a
pair of the boolean vector v and h is shown in Eq. 1.1. v corresponds to visible units, hcorresponds to hidden units and a,b are biases to v and h respectively. The system starts
with producing random values and over time it calls to thermal equilibrium and produces
the dataset that has distribution close to the original dataset. This doesn’t mean that
the system will eventually settle down into the lowest energy configuration possible but
the probability distribution over all configurations settles down. RBMs are two-layered
neural networks that form the building blocks of deep belief networks (DBNs). An RBM
extracts higher-level features from low-level features, training them incrementally and
stacking them creates DBN.
3
CHAPTER 1. INTRODUCTION
Figure 1.1: The structure of Restricted Boltzmann machine.
1.2 Deep neural network in renormalization groupperspective
The renormalization group (RG) allows theoretical understanding of divergent prob-
lems in statistical physics and quantum field theory. In statistical physics, “real-space”renormalization group enables to deal with the phase transition and produces neat table
of critical exponents. There are other types of renormaliztion group procedures exist
but the idea is common in all RG studies i.e. re-expressing the parameters {K} which
define the problem in terms of other parameters {K′} (perhaps simpler) while keeping
constant those physical aspects of the problem which are of interest. This may achieve
through some kind of iterative coarse-graining degree of freedom at short-distance while
studying physics at long-distance. The goal is to extract relevant features which describe
system at larger length scale while marginalizing over short irrelevant features. In RG
procedure, starting with fine-grained system, marginalize fluctuations at microscopic
level and proceed towards macroscopic level to have coarse-grained system. During this
sequence, some features which needed to describe behaviour of system at macroscopic
level dubbed as relevant operators while irrelevant operators have a diminishing effect
on the system at macroscopic scale.
For most of the problems, it is very difficult to carry out renormalization group exactly.
So, physics community has developed many variational schemes to approximate RG
procedure. One such variational scheme, class of real-space, introduced by Kadanoff to
execute RG on a spin system. Kadanoff ’s RG scheme introduces coupling parameters
(unknown) which couple coarse-grained system to fine-grained system. This mapping
carries out as defining the auxiliary or hidden spins and couple them with physical spins
through unknown parameters. The free energy Fh, dependent on coupling parameters,
is determined for the coarse-grained system from the coupled system by marginalizing or
integrating out the physical spins. The coupling parameters are chosen(variationally) in
4
CHAPTER 1. INTRODUCTION
such a way that the difference between the free energies ∆F = Fh−Fv of the hidden spin
system and the physical spin system becomes minimum. The ∆F ∼ 0 ensures that long-
distance physical features are encoded in the coarse-grained system. The mathematical
structure of procedure carrying out in this way defines RG transformation which maps
physical spin system into hidden spin system preserving the long-distance physics. Next
round of renormalization carries out by serving the hidden spins as input.
To understand the DNNs success one perspective could be renormalization group
(RG). DNNs can be viewed as an iterative coarse-graining process, where each next
layer in the network learns high-level abstract feature representation of the data. As
already said, the output from the previous layer becomes input for next layer. Input-layer
receives low-level features in the form of raw data and next layer learns relevant and
more abstract features. This process continues until a raw data becomes an abstract
concept. By this feature extraction mechanism, DNNs learn relevant features and at the
same time ignore irrelevant ones.
This implies that DNN algorithms may be applying an RG-like scheme to learn
relevant features from data. The ANN model we use to show RG and DNN relation is
the RBM. We will see that RBM has exact map to variational RG [5].
1.3 Correspondence between RBM and TensorNetworks (MPS)
Solving the quantum many-body problem is a challenging task as the Hilbert space
corresponds to the system grows exponentially. The exponential complexity is the result
of non-trivial correlations encoded in the system. To tackle such a difficult problem, the
variational approach and sampling methods are usually adopted. A sampling method
consists of Monte Carlo while variational approach includes many examples from a
straightforward mean-field approximation to more involved methods such as those based
on tensor network states, and, a recent development, neural network states [1]. The
efficient description of a required quantum many-body state is the first and crucial step
of the variational method. An efficient description of the quantum many-body state is
defined as the number of parameters needed to distinguish those quantum states grow
polynomially with increase in the number of particles or degrees of freedom in the system.
After finding the efficient description, one can determine the variational parameters by
combining it with powerful learning methods, such as gradient descent.
5
CHAPTER 1. INTRODUCTION
Figure 1.2: Tranformation of RBM to MPS: (a) Graphical notation of the RBM definedby Eq.1.2. (b) The matrix product state (MPS) in graphical notation. Here dangling linkscorresponds to physical variables vi and A(i) is three index tensor. The thickness ofhorizontal link between tensors shows the virtual bond dimension.
ANNs are renowned for representing complex probability distribution or complex
correlations and recently find prominent success in many AI applications through the
popularity of DL methods. Numerical results show that an RBM, trained by the rein-
forcement learning technique, produces a quality solution to a wide variety of many-body
models [1]. The quantum state in RBM representation, an efficient, can be written
(without normalization factor) by considering the Fig. 1.1. with some bias vectors a,b to
visible and hidden layer respectively:
ΨRBM(v)=∑h
e−E(v,h)
=∏i
eaivi∏
j(1+ eb j+
∑i viWi j ),
(1.2)
where energy E(v,h) over network is given in Eq. 1.1.
On the other hand, Tensor Network (TN) methods provide an ultimate parametriza-
tion of a wave function to represent quantum state, exponential in its degrees of freedom,
with polynomial resources. Fig. 1.2(b) shows a simplest TN state: the matrix product
state (MPS). The MPS parametrized representation of a wave function of nv physical
spins or variables given as
ΨRBM(v)= Tr∏
iA(i)[vi]. (1.3)
6
CHAPTER 1. INTRODUCTION
Here A(i) is a tensor with three indices as depicted in Fig. 1.2(b). Physical variables
vi are represented by vertical dangling bonds. For a specific value of vi, the A(i)[vi] is a
matrix. The matrix dimension, commonly known as the virtual bond dimension of the
MPS, depicted by the thickness of the bond between two tensors in Fig. 1.2(b). After
specifying the value of vi, rest of the process is just contracting all virtual bonds and
taking the trace of the resulting matrix gives a number. This number represents the
coefficient(probability amplitude) of the specified physical state. MPS allows enhancing
its capability to represent complex multivariable functions by increasing bond dimension.
In the previous few decades, concrete understanding for TN in both theoretical and
numerical regime has been authenticated with many applications. We now also interpret
the effectiveness of TN. It is based on the entanglement entropy (EE) area-law: EE
increases with the size of the boundary separating between two systems. Low-energy
gapped Hamiltonian states follow this area-law. As a consequence of this fact, the degree
of freedom required to define a physical state (in which we interested) is exponentially
less than the entire set of degrees of freedom. The TN methods are developed to describe
such states, comparatively low entanglement states, and have accomplished exceptional
success in the previous few years.
RBM and TN are similar in their mathematical structure, in particular, expressed
in graphical language as shown in Fig. 1.2(a) and (b). The working of DNN methods
consists of finding good representation or extracting features, a tiny fraction of data as
compared to input, for example, PCA, Autoencoders, etc. This provokes us to investigate
for a guiding principle in the view of quantum information. A principle that can be used
as a guide to quantify the representation power of ANNs for deep learning as well as
for physics problems. Also, we will see that the RBM can be represented as TN and vice
versa, in fact, they are equivalent [6].
The main resources used for the MS Thesis are following: 2nd chapter is consist on
the book “Deep learning”[7] and paper “Learning Deep Architectures for AI” [8]. The
primary resources for 3rd chapter are the tensor network papers [9, 10]. The 4th and
5th chapters are merely based on the papers [5] and [6] respectively.
7
CH
AP
TE
R
2MACHINE LEARNING BASICS
Deep learning is a particular type of machine learning. Basic principles of machine
learning are crucial to understanding deep learning well. This chapter is devoted to
explaining basic building blocks of machine learning and particular type deep learning
models i.e. energy-based models.
2.1 Learning Algorithms
A machine learning algorithm can be defined loosely in just five words, “using data
to answer questions” or more better we can split the definition into two parts, “using
data” and “answer questions”. These two pieces broadly outline the two sides of machine
learning algorithm, both equally important. Using data is what we referred to as training
and answering questions is referred to as making predictions or inference. Conventional
definition of learning algorithm is provided by Tom Mitchell “A computer program issaid to learn from experience E with respect to some class of tasks T and performancemeasure P, if its performance at tasks in T, as measured by P, improves with experienceE.” For example, if we ask to write a learning algorithm to classify emails as spam or not
spam, what will be the task T, performance measure P and experience E? In this problem,
classifying emails as spam/not spam is task T, E will be watching classified examples
labelled by us as spam/not spam and P can be number of emails correctly classified by
learning algorithm. In the upcoming sections we attempt to describe E, P and T with
examples.
8
CHAPTER 2. MACHINE LEARNING BASICS
2.1.1 The Task, T
There are few AI tasks which can be hand-coded or explicitly programmed such as find
the shortest distance between two points. Some tasks are very difficult to solve because
they require intelligent behaviour and there is no way to perform these by explicit
programming. Machine learning not only enables us to solve such type of problems but
also increase our understanding about the principles and basis of intelligence.
The task is somewhat different than learning. Developing the ability to do a task is
the other way to define learning. The “Task” is simply a job. For example, if we want a
helicopter fly itself, then flying is the task. We can explicitly program the helicopter to
learn to fly but it is very difficult to write directly a program (which specify the process
of flying step by step) for flying.
A Task is a particular way which attains by machine learning system to process an
example: a set of features, provided to machine learning system for processing, which
qualitatively characterized by some event or object. We use x εℜn vector to denote an
example and xi represents a feature in an example. Few machine learning tasks are
listed below:
• Classification: Classification is simply the process of taking some kind of input
x and mapping it to some discrete k labels like true/false or 0/1. In classification
task, learning algorithm requires to produce a function f : ℜn → {1, ...,k} i.e. a
classifier which takes an input and outputs discrete label. There are other types of
a classification task for example, a binary classification is predicting a tumour is
malignant or benign, where each example is a feature vector x consist of patient’s
age, tumour size etc., and outputs 0/1 or probability distribution upon these two
classes. An other example can be object recognition, where input vector is an image
and output is a label which used to identify the object in an image.
• Regression: In this machine learning task, computer program learns to predict
real output y given some input x, where y is continuous. To solve this type of task,
learning algorithm requires to output a function f :ℜn →ℜ. An example of regres-
sion task is housing price prediction model where we given set of examples({x, y})
and computer program predicts real output on some unseen inputs.
• Estimate density or probability estimation: In this task, the density estimation
problem, a learning algorithm needs to learn a function pmodel : ℜn → ℜ. Here
pmodel is interpreted as probability mass function (if x is discrete) or probability
9
CHAPTER 2. MACHINE LEARNING BASICS
density function (if x is continuous) on the example space. Its a bit tricky to define
performance measure on such a task, we will explain this in incoming section.
The learning algorithm has to learn underlying structure of the data, it should
have to know where the example space is dense or where it is sparse. Density
estimation captures the distribution by seeing just examples. In this type of tasks
we desire to approximate true/empirical probability distribution ptrue and hope
pmodel ' ptrue. After learning, we can generate new examples, which also belongs
to given examples space, by approximating the true probability distribution.
We can have many other tasks as well which can be defined as machine learning
tasks, for example, denoising, Synthesis, and sampling, anomaly detection, imputation ofmissing values, classification with missing values, etc.
2.1.2 The Performance measure, P
After defining our task T we need to figure out how to measure the performance P of the
learning algorithm that is we need a metric. Performance measure depends upon the
task being carried out by learning algorithm. For example, in classification problems,
accuracy is a commonly used metric. The accuracy of a classifier is defined as the number
of correct predictions divided by the total number of data points. This begs the question
though which data do we use to compute accuracy well. What we are really interested in
is how well our model will perform on new unseen data examples. We could compute
the accuracy of the data during the fitting process of the classifier. However, as given
data was used to train it, the classifier’s performance will not be indicative of how well
it can generalize to unseen data. For this reason it’s usual convention to split the data
into two sets a training set and a test set. We trained or fit the classifier on the training
set, make predictions on the labelled test set and compare these predictions with the
known labels. In this case, this difference defines the accuracy of the classifier. We can
define error rate as a metric for performance and refer it to 0−1 loss. The loss is 0 if an
example correctly classified, otherwise 1.
Defining performance metric for a particular task is not a straightforward thing
infect very difficult problem itself. For instant, in density estimation task we know the
quantity we would like to measure, but measuring it is infeasible. It makes no sense to
use metric like accuracy and error rate for probability estimation task. We should have
some other type of performance measure which outputs continuous value for each data
example. The common practice is to report average log-probability of examples assigned
10
CHAPTER 2. MACHINE LEARNING BASICS
by the model. Computation of actual probability in which every point in space assigned a
probability value is often intractable.
2.1.3 The Experience, E
After defining the task T and performance measure P, let’s drill into the experience E
in machine learning process. Usually, there are two major types of machine learning
algorithms, supervised and unsupervised. This category based on what sort of experience
a machine learning algorithm allowed to undergo the machine learning process. Almost
all the learning algorithms experience whole dataset i.e. a collection of many many
examples. Iris dataset [11] reciprocity has many datasets which are publicly available.
For example, the Adult dataset which consists of 48842 examples and each example has
14 features like age, occupation, capital-gain, capital-loss, etc.
• Unsupervised : learning algorithms learn properties of the structure of the dataset
by experiencing the whole dataset. In deep learning, our goal is to approximate
the true probability distribution through that dataset is generated. One example
of such a task is density estimation. clustering is an unsupervised algorithm, in
which learning system divide the dataset into cluster or group of similar examples.
The k-means is an example of clustering algorithm.
• Supervised: learning algorithms require datasets with labelled examples. Mean-
ing each example has a label or target, as a feature, value which we need to
predict. For example, MNIST dataset [12] widely used for hand-written digit recog-
nition system. It contains 60,000 hand-written digits example as 28×28 gray scale
pictures with true labels.
The difference between two types is clear, supervised learning algorithms have
supervision in terms of target values while unsupervised learning algorithms have no
supervision. In the former case, learning algorithm finds optimum parameters those
give us a good input x to output y mapping. And in the latter case, learning algorithm
emphasize to learn underlying probability distribution of data and have an unlabelled
dataset.
Although we have defined learning algorithms in two categories but this distinction is
not clear. Because supervised tasks can be learned by unsupervised learning technologies
and vice-versa. It often helps to roughly categories learning algorithms. In practice,
11
CHAPTER 2. MACHINE LEARNING BASICS
regression and classification tasks considered as supervised while density estimation as
unsupervised.
We can have other categories of learning algorithms as well. For example,Semi-supervised learning algorithm in which we have partially labelled dataset i.e. some
examples are labelled and remaining are not. The other example is reinforcement learn-
ing, in this learning process, learning system doesn’t experience the whole dataset and
has a feedback mechanism.
There are various ways to define a dataset common one is design matrix. In this
representation of data we have rows and columns, rows correspond to an example and
columns corresponds to features. In design matrix, we have all examples with equal
size/the number of features but this always not the case. For instance, if dataset consists
of pictures and each picture have different dimension then we have to find another way
to represent the data.
2.2 Linear regression
In linear regression problem, goal or “Task T” is to develop a system which take an
example in the form of vector x ∈ℜn as input and predicts a scalar y ∈ℜ as an output. In
linear regression task, as the name implies, our model is the linear function of the input.
Let’s hw is our model, predicts value which should be an output of given example.
hw (x )=wTx
=n∑i
wixi,(2.1)
where w ∈ℜn is a vector of parameters. In above equation, the weighted sum of input
x corresponds to the effectiveness of a particular feature let’s say xi on the output y.
So, mathematically, learning consist of finding optimum parameters w such that error
between value predicted by our model hw and true value y. The performance measure Pcan be mean square error on test set. It should be on unseen examples, for that purpose
we define design matrix X(test) with mtest examples in which target values will be y(test).
So idea is, we train the model on {X(train),Y(train)} training dataset and {X(test),Y(test)}
test dataset tell us the model has learned good parameters or not for gernalization.
J(w) = 12mtest
∑i
(h(test)i − y(test)
i )2, (2.2)
12
CHAPTER 2. MACHINE LEARNING BASICS
where J is our mean square error function and mtest is the size of test dataset.
To find the optimum parameters we need to design a learning algorithm that learns
or improves the parameters by reducing the J by gaining the experience of training
dataset. One obvious way of doing this is just to minimize J: ∇wJ = 0 on training dataset.
One can solve this gradient analytically and get the following :
w= (XTX)−1XTY. (2.3)
Above solution is nice and elegant but it has limitations: (XTX)−1 inverse should exist
and features space should be not too large. Normal equation solution Eq.2.3 is preferred
when we have a small dataset or feature space otherwise we need some other way to find
the optimum parameters such as gradient based optimization. And that algorithm is the
topic of next section.
2.3 Gradient descent algorithm
The optimization technique is central to machine learning algorithms. Optimization
defined as maximizing or minimizing some function, and the function needs to optimize
in learning algorithm is J(w). Almost every learning algorithm use some sort of opti-
mization method to find optimum parameters. Most of them use gradient descent (GD)algorithm, in which derivative is used to move towards the minima of parameter surface.
The algorithm consists of just two steps, take the derivative of error function J(w) and
modify the parameters according to that as shown in Fig. 2.1. Let’s take a concrete
example, consider a linear regression model with w parameters. The GD algorithm used
to update the parameters as
w j = w j −α∂J(w0,w1, · · · ,wN)∂w j
; ∀ j ε {0,1,2, · · · , N} (2.4)
where α is learning rate. Here w j should be updated simultaneously. GD algorithm
does not guarantee to reach the global minimum, it can be stuck at local minima. In this
case, one can add additional terms in the GD algorithm such as momentum term.
2.3.1 Stochastic Gradient Descent
Stochastic gradient descent is a very important extension to gradient descent algorithm.
Almost all the deep learning is powered by this algorithm. A big dataset is necessary for
best generalization but at the same time, it is computationally expensive.
13
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.1: Gradient descent: initialize w randomly, take gradient on this point andnext point will be updated according to the gradient. If the gradient is negative as shownhere then updated value of w will be greater by the factor α∂wJ. If the gradient ispositive then next value of w will be less by the same factor. GD algorithm takes smallsteps in this way and reaches to the minima.
Idea is, decompose the cost function as sum of training examples of any per-example
cost function. For instance, the negative conditional log-likelihood of data can be ex-
pressed as
J(w)= 1m
m∑i=1
L(x(i), y(i),w), (2.5)
where L is defined as L(x, y,w)=−logP(y | x,w), per-example loss. For optimization, GD
needs to compute
∂J(w)∂w
= 1m
m∑i=1
∂L(x(i), y(i),w)∂w
. (2.6)
The computational cost increases linearly with the size of data i.e. O(m). Notice
that the gradient descent, after decomposition, is the expectation of gradients. So the
expectation can be estimated by the small set of examples. Particularly, on each step of
training we uniformly sample a small training set B = {x(1), · · · ,x(m′)} where m
′is much
smaller than the total training size.
The gradient estimation is performed as
q= 1m′
m′∑
i=1L(x(i), y(i),w) (2.7)
and parameter update as
w=w−αq,
where α is learning rate.
In general, gradient descent algorithm is very slow and unreliable, it can be stuck
at local minima. But it is still useful because it provides sufficiently low test error in a
14
CHAPTER 2. MACHINE LEARNING BASICS
reasonable amount of time. SGD is independent of training size m and converges very
quickly. Also, it provides a constant cost of training a model as a function of m.
2.4 Maximum Likelihood Estimation
Point estimation is a way to predict some quantity: a parameter or a set of parameters,
of interest in a single attempt. Sometime we estimate function f (x) (by assumption there
exist a relation between input x and output y such as y= f (x)+ε) instead of parameter
of a function.
Besides these two approaches, there is a principle by using that we can derive partic-
ular functions that have ability to good estimate different models. The most common one
is Maximum Likelihood (ML).
Given a dataset X of m examples that drawn, of course independently, from unknown
but true probability distribution Pdata (x). Assume a Pmodel (x;w) a family parametrized
by w over same space x. It estimates the true probability Pdata(x) by providing a mapping
between a configuration x to a real number. The ML estimator for w is defined as
wML = max iw
mize Pmodel (X; w)
= max iw
mizem∏
i=1Pmodel(x(i); w). (2.8)
In practice, the product of probabilities is not appropriate for many reasons such as
numeric underflow. The log of above expression leads us to sum of log probabilities which
is easy to compute and has same behaviour as the product of probabilities.
wML = max iw
mizem∑
i=1log pmodel(x(i); w). (2.9)
Above equations can be transform into an expectation w.r.t. true distribution Pdata
by dividing with m:
wML = max iw
mize ⟨logPmodel(x(i); w)⟩x∼Pdata . (2.10)
One way to think about maximum likelihood estimation as the difference between
true/empirical distribution Pdata and estimated distribution Pmodel and this is measured
by Kullback–Leibler divergence in-short KL divergence. This is given by
15
CHAPTER 2. MACHINE LEARNING BASICS
DKL(Pdata ∥ Pmodel)= ⟨[logPdata(x)− logPmodel(x)]⟩x∼Pdata . (2.11)
Notice that the first term just depends on data generation process not on the model so
we can train the model by just minimizing
−⟨[log pmodel(x)]⟩x∼pdata , (2.12)
this is same as maximizing Eq.2.10. Thus minimizing negative log-likelihood is equal to
maximizing likelihood. We will use this to train the restricted Boltzmann machine.
2.4.1 Conditional Log-Likelihood
We can estimate conditional probability P(y | x; w) by generalizing the maximum likeli-
hood. The conditional probability is the fundamental to supervised learning algorithms.
P(y | x; w) interpreted as probability of y given x, parametrized by w. It is given by
wML = max iw
mizem∑
i=1log pmodel(y(i) | x(i); w), (2.13)
where y(i) is i−th target value of corresponding example.
2.5 Logistic Regression or Classification
The linear regression problem is all about to predict real value by using linear model. But
there are many problems in which one needs to predict discrete values such as spam/not
spam , tumour/not tumour, etc. We can’t use linear regression model in this case by
introducing some threshold as output = 1 if h(x)> 0.5 otherwise 0. For this classification
problem we use sigmoid function that provides a smooth value from 0 to 1. The sigmoid
function is given as
σ(x)= 11+ e−x . (2.14)
Consider a dataset X consists of m examples with target value y, and task is to predict
a binary output. We use sigmoid function as our model and interpret it as probability of
being 1. For this model the cost function cannot be same as before, it will not a convex
function for this model. So we define a cost function from log-likelihood as
J(w)= 1m
[m∑
i=1y(i) loghw(x(i))+ (1− y(i)) log(1−hw)
], (2.15)
16
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.2: Sigmoid function: this function outputs a smooth value between 0 and 1.
where
hw(x)= 1
1+ e−wTx. (2.16)
Use gradient descent algorithm to optimize this cost function. We will get exactly
same form as we have for linear regression (Eq. 2.4) but with different model. So,
algorithm is exactly same Eg. 2.4, take derivative and update the values of parameters
simultaneously.
2.6 Overfitting, Underfitting, and Regularization
The difference between optimization problem and learning algorithm is that its ability to
generalize to new unseen examples; its performance on unseen data. In training process,
we emphasize on reducing the training error but we do more care about test error. The
characteristics of a good machine learning algorithm are its capability to make:
• training error small.
• the difference between test and training error small.
If a model does not respect the first property then it said to be underfitting. And
the violation of second property leads to overfitting-as shown in Fig.2.4. We can control
the fitting ability of a model by shrinking or expanding its hypothesis space: space of
function which can be solution. For instance, set of all the linear functions are included
in hypothesis space of the linear regression, as h = b+wx, by increasing the degree of
polynomial we expand hypothesis space as h = b+w1x+w2x2. So, linear models have
less capacity as compared to quadratic.
How one can decide the optimum capacity for a particular model? The principle of
parsimony (extend to statistical learning theory by Vapnik [13]) can be used in this
regard, it states that “among competing hypothesis that explain known observations
17
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.3: Overfitting and underfitting: (a) underfitting problem is occured becausedata is more complex than a linear model; (b) cubic model is appropriate for the provideddata; and (c) overfitting, polynomial of degree 5 is used as a model which is more complexthan given data.
equally well, we should chose the simplest one.” In particular, we make capacity of
model sufficiently large and keep regularizing the learning parameters. Modify the error
function as
J(w) = 12mtrain
∑i
(h(train)i − y(train)
i )2 +λwTw, (2.17)
where λ is the regularization constant. It restrict the value of learning parameters w in
an interval and prevent the model to over/under fit as shown in Fig.2.4. The training
process consists of minimizing J(w) so, if we define the value of λ very large then w will
be very small and vice verse.
Figure 2.4: Effect of λ on the model: (a) λ value is too large which makes w→ 0; (b)moderate value of λ is used; (c) when λ→ 0 then w becomes very large and model willbe more complex than data.
18
CHAPTER 2. MACHINE LEARNING BASICS
2.7 Neural Networks
Biologically inspired neural networks are programming paradigms which allows a com-
puter to learn from observing data. And deep learning consists of powerful techniques
for learning in the neural networks domain. To understand the neural networks first
look at the single Perceptron. A Perceptron takes a weighted binary input and outputs a
binary value. If the weighted sum∑N
i=1 wixi is equal or greater than some value, lets call
it threshold, then the output will be 1 otherwise 0 as shown in Fig. 2.5(a). On the other
hand, the model which takes weighted input but outputs a smooth real value between 0
and 1, this is called sigmoid neuron. A sigmoid neuron use sigmoid function defined in
Eq.2.14 and its plot is shown in Fig. 2.5(b).
Figure 2.5: Neuron’s simple model:(a) Perceptron: takes weighted binary input andoutputs a binary value; (b) sigmoid neuron it also take weighted input but outputssmooth real value between 0 and 1.
A network of Perceptrons can be formed by arranging the layers of these basic units.
Each unit in layer takes weighted input from the previous layer and generate an output
which acts as input for next layer, this network is called Multi Layered Perceptron (MLP).
Notice that sigmoid neuron is different from Perceptron due to just output function. The
network of sigmoid neurons can be constructed in a similar way. In practice, the neural
network with sigmoid output functions (activation of a neuron) is more useful than
Perceptron because many interesting functions which we want to approximate are non-
linear. A useful way to think about neural networks is they are function approximators.
The architecture we have discussed is usually used in supervised fashion so target values
y are given. Thus, the task of the neural network here is to find the input to output
mapping y = f (X;W) by finding optimum weights W (here W is the matrix). We will
discuss these networks in the next section.
In the case of unsupervised learning, different architectures of the network are used
for example sigmoid belief networks and Deep Belief Networks (DBN) from the deep
19
CHAPTER 2. MACHINE LEARNING BASICS
generative architecture domain. After training, these graphical models used to generate
examples similar to training data. And the training task in these models is to find joint
probability distribution over the network. Here we will just explain the DBN and its
building block, the RBM.
Neural networks consist of many layers, the number of layers called depth of network.
And the number of neurons in a layer called width of a network. The questions like
how to choose depth and width of a network for a particular problem is still not well
understood theoretically. The selection of these parameters, learning rate, and few others
(called hyper parameters) can be done by experimentation on a particular data.
2.8 Deep Feed Forward networks
These networks are used to approximate any function f and find a good mapping between
input x and output y. The information flows from input x to function f which consists
of many layers to output y. There is no feedback mechanism in the model if a feedback
mechanism is defined in the network itself is called recurrent neural networks. These
networks are powerful to learn non-linear functions and have a large capacity. It is
observed that the depth of network is crucial to learn complex and highly non-linear
functions. A simple example could be XOR function that cannot be learned by linear
models. Deep feedforward networks consist of many layers of representations. The layers
in between the input layer and the output layers called hi hidden layers (here we devoted
h for hidden layer not hypothesis). The network model can be represented in an acyclic
graph as shown in Fig.2.6. For example, network consists on two layers h(1) and h(2)
connected in a chain then the function approximator is f (x)=h(2)(h(1)(x)).
Figure 2.6: Deep feed forward neural network: network consists of l layers and eachlayer has sn units.
20
CHAPTER 2. MACHINE LEARNING BASICS
A usual set of equations for these feed forward multi-layer neural network is following:
hk = sigm(bk +Wkhk−1), (2.18)
where x = h0. And bk is a bias vector, Wk weight matrix for layer k and sigm is the
sigmoid function and it applied element-wise on given vector. The top layer hl is used to
predict and is combined with the true target values y into convex loss function L(hl ,y) in
bk +Wkhk−1. The output layer can have different activation function from hidden layers,
for example softmax function,
hli =
ebli+W l
i hk−1
∑j ebl
j+W lj hk−1
. (2.19)
Here hli is positive and sum of units in the layer is equal to 1 so it can be interpreted as
P(Y = i | x) that Y is the target class corresponds to input vector x, loss of this unit is
L(hl , y)=− logP(Y = y | x)=− loghli. The training comprises of minimizing this cost and
the total cost with regularization of a network which consist of l layers and NQ output
units is given as
J(W)=− 1m
[m∑
i=1
NQ∑q=1
y(i)q log(hW (x(i)))q + (1− y(i)
q ) log(1−hW (x(i)))q
]+ λ
2m
l−1∑p=1
sn∑i=1
sn+1∑j=1
(W pj,i)
2,
(2.20)
where sn is the number of units in n-th layer and λ is regularization parameter.
This cost function can be optimized by an algorithm called backpropagation. The
backpropagation algorithm is an efficient way to calculate derivatives. Usually, training
the deep network is a difficult task due to various challenges such as randomly initialized
weights in a gradient-based learning result a poor generalization because it stuck in the
local minima. One more problem could be vanishing gradient at the deeper layer, the
error surface for deeper layers becomes flat and the gradient approaches zero.
2.9 Unsupervised Learning for Deep NeuralNetworks
Layer-wise unsupervised learning has been a key component behind the success of
learning for the deep architecture so far. If the gradient in a deep architecture becomes
less useful as it propagates toward input layer, then it is easy to believe that a mechanism
21
CHAPTER 2. MACHINE LEARNING BASICS
(of the gradient) established at a single layer could be used to progress its parameters
in a right direction and one can get rid of the dependency on the unreliable update
gradient direction provided by supervised learning. Moreover, the advantage of using
unsupervised learning algorithms at each level of the deep neural network is that it could
be a way to break the task into sub-tasks corresponds to various levels of abstraction
and these levels extract implicit features from input distribution. The first layer of
deep architecture learns these silent features but due to a limited capacity of the layer,
features dubbed as low-level features. This layer becomes the input for next layer, by
using the same principle the higher-level features computed by the second layer and this
process go on and on until a high level abstraction emerges from input. This criteria
also keeps learning local to each layer. These strategies encourage us to discuss deep
generative models.
2.10 Energy-based models and RestrictedBoltzmann Machine
The Restricted Boltzmann machine is an energy based model and the building block
of DBNs. We will discuss here the mathematical concepts needed to understand these
models, including Gibbs sampling and Contrastive Divergence (CD).
2.10.1 Energy based models
Energy-based models borrow the concepts from statistical physics. In these models, a
number is assigned to every configuration (state) of the system and that number is called
the energy of the state. The learning task here is to adjust the energy function to have
desired properties. Energy E is used to define probability of a particular configuration as
P(v)= e−E(v)
Z, (2.21)
low energy configuration has high probability, and Z =∑v e−E(v) is a partition function.
Above expression also shows that energy act in the domain of log probability. For any
exponential family distribution, it is easy to calculate conditional probability distribution,
we will see this in RBM.
A particular form of energy, sum of terms, is used in the product of experts formulation.
Each term is corresponds to an “expert” fi, for energy E(v)=∑i f i(v):
22
CHAPTER 2. MACHINE LEARNING BASICS
P(v)∝∏i
Pi(v)∝∏i
e−fi(v). (2.22)
Here we devoted v for input or visible layer. Each expert can be thought as a detector
of improbable structure/configuration of input v, similarly, as imposing constraints on
v. For instance, if fi can take two values, then it has small value corresponds to the
configuration that satisfies constraint and large value which does not. We can compute
the gradient of logP(v) in Eq.2.22 by using the one instantiation of the CD algorithm.
2.10.2 Hidden Variables
In many tasks of interest, an input v consists of many components vi but we do not want
to observe these or the aim is to increase the capacity of the model, both tasks can be
accomplished by introducing the hidden-variables in the model. So observed part v and
hidden part h
P(v,h)= e−E(v,h)
Z, (2.23)
but we just observe v
P(v)=∑h
e−E(v,h)
Z. (2.24)
This formulation can be restore to previous one given in Eq.2.21 by introducing free
energy F =− log∑
h e−E(v,h), so
P(v)= e−F(v)
Z, (2.25)
but here Z =∑v e−F(v).
Thus a sum of energies E in the logarithmic domain is the free energy F.
2.10.3 Conditional Models
Learning algorithms which consist of partition function are very difficult to train because
of computation of all the configurations in each iteration of the training. If the task is
to predict a class y given an input v then it is sufficient to consider just configurations
of output y for each input v. Usually y belongs to a small set of discrete values. The
conditional distribution is given as
P(y | v)= e−E(v,y)∑y e−E(v,y) . (2.26)
23
CHAPTER 2. MACHINE LEARNING BASICS
In these models, the training process, in particular, the gradient of a conditional log-
likelihood is easy (in terms of computation cost) to compute. These concepts are used to
implement a type of RBM called Discriminative RBM [14], we will code this model in our
example. These models give rise to a conditional function that picks up a y for a provided
input which is the purpose of the application. Indeed, when y consists on a small set of
values then P(y | v) always computable because the normalization of the energy function
E requires only possible values of y.
2.10.4 Boltzmann Machine
The Boltzmann machine is an extension to the Hopfield networks. This specific type of
energy-based model composed of hidden variables. And the introduction of a restriction
in this model leads to an efficient and widely used model called RBM. The energy of the
Boltzmann machine is given as follows:
E(v,h)=−aTv−bTh−hTWv−vTUv−hT Mh. (2.27)
Here W is a weight matrix between v and h, U contains connection between v’s, similarly,
M connects hidden variables h’s. And a,b are biased vectors, U and W are symmetric.
The structure of Boltzmann machine is shown in Fig.2.7.
Figure 2.7: Boltzmann machine: an undirected graphical model.
This model involves quadratic terms in h that’s why analytical calculations of free
energy is not possible. The stochastic estimate can be obtained by Markov Chain Monte
Carlo (MCMC) sampling. Starting from the Eq.2.24, the gradient of the log-likelihood is
∂ logP(v)∂t
=−∑h
P(h | v)∂E(v,h)
∂t+∑
v,hP(v,h)
∂E(v,h)∂t
, (2.28)
24
CHAPTER 2. MACHINE LEARNING BASICS
where t = {ai,b j,Wi j,Ui j, Mi j}. Notice that partial derivative is easy to compute. So,
if there is a procedure to sample from P(v | h) and P(v,h) then gradient of the log-
likelihood can be estimated stochastically. In 1986 Hinton and others [15] introduced the
terminology of positive phase and negative phase. In the sampling process, the positivephase corresponds to sampling h while v is clamped to input vector. The sampling of (v,h)
from model itself is called negative phase. In practice, only approximate sampling can be
done, for example, construct an MCMC by iterative process. Hinton also introduced an
MCMC established on the Gibbs sampling. The Gibbs sampling for N combined random
variables can be done in a series of N sampling sub-steps. Consider S = (S1, · · · ,SN) is a
set of N random variables, the sampling sub-steps for these
Si ∼ P(Si | S−i = si). (2.29)
Here Si is a variable that we want to sample and S−i are remaining N −1 variables.
A step of the Markov chain is completed after sampling all the N variables. Under
some condition, e.g. aperiodic and irreducible, the Markov chain converges to P(S) after
infinite steps. Gibbs sampling can be performed on Boltzmann machine if we denote all
the visible and hidden units by s and calculate P(si | s−i), we will see that in the next
section, this conditional probability distribution is just sigmoid function that takes input
from the previous layer of the neurons s−i.
Notice that the computational cost of the gradient is very high as we need to run
two Markov chains for each example. This is the reason for the downfall of Boltzmann
machine in the late 80’s. In the last decade, the machine learning community finds its
way out for training the Boltzmann machine. It turns out that the short Markov chains
are also useful and this is the principle of the CD.
2.10.5 Restricted Boltzmann Machine
The Restricted Boltzmann Machine (RBM) is fundamental to Deep Belief Networks
(DBNs) because a DBN is just a stack of RBMs. The RBM is a special type of Boltzmann
machine with a restriction that “no intra-layer connection exist for both visible and
hidden layers”, and it can be trained efficiently. Due to this restriction, the hidden units
are independent from each other when v is given and visible units are independent from
each other when h is given. Two layers, hidden and visible, can interact with each other
through a weight matrix W as shown in Fig.2.8. The energy function becomes simpler
under this restriction as follows:
25
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.8: Restricted Boltzmann machine (RBM): no intra-layer connection buthidden and visible units can interact with each other.
E(v,h)=−aTv−bTh−hTWv. (2.30)
The probability distribution can be calculated by the same formula given in Eq.2.21.
The purpose of training RBM is to approximate probability distribution over observed
variables v. The conditional probability distribution can be expressed as the factorial
distribution
P(h | v)= eaTv+bTh+hTWv∑h′ eaTv+bTh′+h′TWv
=∏
i ebihi+hiWiv∏i∑
h′iebih
′i+h′
iWiv
=∏i
P(hi | v).
(2.31)
In the usual models (the hi is binary), we ended up on familiar neuron activation
function.
P(hi = 1 | v)= ebi+Wiv
1+ ebi+Wiv= sigm(bi +Wiv), (2.32)
where Wi corresponds to i-th row of W . Similar calculations can be done for probability
of v when h is given and the result is same
P(v |h)=∏i
P(vi |h), (2.33)
and for the binary units
P(vi = 1 |h)= sigm(a j +WT. j h). (2.34)
Here W. j corresponds to j-th column of W. Binary units are useful when the task is to
approximate the binomial distribution such as gray scale pictures. For continues valued
26
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.9: Softplus function.
dataset, the Gaussian input units are recommended. The RBM is less efficient for some
distributions as compared to Boltzmann machine but still, the RBMs can approximate
any discrete distribution when sufficient hidden units are provided. Interestingly, it can
be shown that adding extra hidden units in already learned RBM can always improve
results.
We can write the probability distribution in terms of the free energy. In that form,
training process is more intuitive. Starting from the marginal distribution
P(v)=∑h
P(v,h)=∑h
e−E(v,h)
Z
= eaTv+∑j sof tplus(b j+Wjv)
Z
= e−F(v)
Z,
(2.35)
where sof tplus(x)= log(1+ ex) is shown in Fig.2.9 and the free energy is
F(v)=−aTv−∑j
sof tplus(b j +Wjv). (2.36)
From above equation it is clear that training corresponds to rows of W along their bias
which are such that sof tplus term tends to be high. The free energy expression allows
us to know that what RBM needs to do to make certain inputs have high probability.
What its need to do to make training set more likely i.e. P(v) big on the training dataset.
To train the RBM we need to compute the gradient of the log-likelihood Eq.2.28.
2.10.6 Gibbs Sampling in RBMs
Sampling from an RBM is useful in the learning process as it is used to estimate the
gradient of the log-likelihood and generated samples can be used to inspect whether
27
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.10: Markov chain: starting with an example vt sampled from the empiricalprobability distribution. Sample hidden layer given visible layer and then sample visiblegiven hidden layer, this process go on and on.
RBM has captured the underlying data distribution or not. As we will see that sampling
from an RBM allows us to sample from DBN.
Gibbs sampling in the unrestricted Boltzmann machine is not efficient because of the
sub-steps in the Gibbs chain proportional to the units in the network. In contrast, we
have an analytical expression for positive phase because of factorization and the visible
and hidden layer can be sampled in two steps of the chain. Starting with an example
from given dataset vt, sample h1 with probability P(h1 | vt) then repeat this step but this
time hidden layer h1 is given, sample v1 this process is shown in Fig.2.10. Experiments
show that starting Markov chain with an example from dataset makes the convergence
of the chain faster. It also makes sense as we want the probability distribution of our
model similar as empirical probability distribution of the data.
In Contrastive Divergence (CD) algorithm, we use this approach and take negative
sample after just k steps of Markov chain, CD algorithm is given in Algo. 2.10.6, here
k = 1. As far as the code implementation of RBM is concerned, we have used the Matlab
code given at [18].
2.11 Deep Belief Networks
A deep belief network (DBN) is a stack of RBMs which models the combine distribution
over the visible and many hidden layers [17]. The probability over the DBN which
consists of the l layers, one visible and l−1 hidden as given
P(v,h1, · · · ,hl)=(
l−2∏k=0
P(hk |hk+1
)P(hl−1,hl), (2.37)
28
CHAPTER 2. MACHINE LEARNING BASICS
Algorithm 1 This is parameter update procedure for binomial RBM.1: RBMupdate (v1,α,W ,a,b)2: vt : given example from empirical distribution3: α : learning rate4: W : weight matrix Nhidden ×Nvisible for RBM5: a: bias vector for visible units6: b: bias vector for hidden units7: Notation: H(h2 = 1 | v2) it the vector containing elements H(h2i = 1 | v2)8: for all hidden units i do9: compute H(h1i = 1 | vt) from sigm(bi +∑
j Wi jv1 j)10: sample h1i ∈ {0,1} from H(h1i | vt)11: end for12: for all visible units j do13: compute P(v2i = 1 |h1) from sigm(a j +∑
i Wi jh1i)14: sample v2 j ∈ {0,1} from P(v2i = 1 |h1)15: end for16: for all hidden units i do17: compute H(h2i = 1 | v2) from sigm(bi +∑
j Wi jv2 j)18: end for19: W ←W +α(h1vT
1 −H(h2 = 1 | v2)vT2 )
20: a← a+α(v1 −v2)21: b←b+α(h1 −H(h2 = 1 | v2))
where v = h0, P(hl−1,hl) is the distribution of RBM at the top and P(hk | hk+1) is
conditional probability of hk given hk+1. The DBN structure is depicted in Fig.2.11.
The conditional probability distribution and the RBM at the top define the generative
model. From now on we use Q for approximate posterior or exact probability of this
generative model, which are used to infer and training. The Q posterior is equal to true
distribution P at the top because these two form an RBM and it is approximate for other
layers. For code implementation of the DBN, interested user can use [18]. Where DBN is
trained for MNIST dataset but code is easy to modify for a particular purpose.
29
CHAPTER 2. MACHINE LEARNING BASICS
Figure 2.11: Deep Belief Network is defined as a generative model, the generativepath is from top to bottom with distributions P, and Q distributions extract multiplefeatures from the input and construct an abstract representation. The top two layersdefine an RBM.
30
CH
AP
TE
R
3TENSOR NETWORKS
Tensor networks (TNs) are motivated from starting systems in condensed matter physics,
high energy physics, maybe also quantum chemistry and have produced several bench-
mark results in many directions. TN becomes an essential computational tool in the
study of the quantum many-body system. Also, it opens up new directions, such as its
relation to the Ads/CFT correspondence in quantum gravity and the holographic princi-
ple. Nowadays, TN is rapidly growing field in which researchers aim to study complex
quantum systems from a point of view of their entanglement and trying to understand
what entanglement theory can teach us about the structure of these systems. And how
we can use it to understand these systems better.
In all these fields, we have many-particle systems and these particles interact with
each other quantum mechanically. So, we won’t be able, in a general setting, to describe
these systems by simply studying the quantum mechanics of a single particle. We will
have to consider all these particles together, the way they interact and become the
joint quantum state. If we really want to address this problem fully, we will have to
consider the quantum correlations in particles, essentially the entanglement between
these particles. A problem like that where we don’t have any extra information is really
a daunting task. In a many-particle system, the number of degrees of freedom grows
exponentially with the number of particles. It seems there’s no structure at all in that but
of course, that’s not the case. In a real matter, there is structure, particles which are close
by will interact strongly on the other hand particles which are distant from each other
will interact very weakly. So, there is some notion of locality and spatial structure. And
31
CHAPTER 3. TENSOR NETWORKS
it’s really this interplay which allows us to say something, how these kinds of systems
behave and lead us to the fact that there is some structure in these systems, the type of
entanglement they display. It makes entanglement still very special, despite being some
complex quantum entanglement, that allows us to study systems from the point of view
of entanglement.
For most of the matter around us, we don’t really have to think so much about the
quantum interactions as the entanglement is not that important in most of the systems.
Many systems, usually, described by what is called mean field theory and mean field
theory basically neglects entanglement and one can think of entanglement as something
which is a corrective level on top of something which is not entangled or not in a non-
trivial way. This approach is vastly successful; it describes most of the systems we have
around us. And one reason could be that many of the systems around us are at high
temperatures as compared to the well interaction strengths. At high temperature, states
become quite mixed and, in that case, entanglement becomes less and less important
than mixture state. We are not interested in the systems like that. Here, we would like
to look at the systems where the quantum correlations play an essential role, which we
can’t understand by neglecting entanglement. We are most interested in a regime of
low temperature and the reason is essential that, in higher temperature systems, the
state of the system becomes more mixed and less entangle. If we have a state at infinite
temperature i.e. a maximally mixed state or very less entangled, the state has no intent
just everything is completely random. Therefore, we would expect naturally that the
lower we choose our temperature the more quantum the system will behave. So, to see
the most interesting physics we should look at very low temperatures or maybe ideally
the ground state.
We will try to focus on a simpler kind of systems which is also closer to quantum
information theory, but they still display many of the features we are interested in.
Especially those which make the quantum matter special and these are quantum spin
systems on a lattice.
Understanding the many-body system quantum mechanically may be a most chal-
lenging and difficult task in many-body physics. For example, the phase transition
phenomenon, out of Landau’s regime, have also proven very hard to grasp, in turn
including new exotic phases of matters. Few cases of these are, the topological phase
of matter: where a particular structure of entanglement spread throughout the system,
quantum spin liquids: phase of matter occur without breaking any symmetry.
The usual approach to perceive these systems is consisted of proposing toy models
32
CHAPTER 3. TENSOR NETWORKS
Figure 3.1: Tensor Network representation of state : the tensor is an elementarybuilding block of a tensor network and tensor network is a representation of a quantumstate (we used the graphical interpretation for a tensor network which will obvious inincoming sections).
that are supposed to replicate the suitable interactions for observed physics. For instance,
the Hubbard model as the specific case of high temperature super conductors. After
proposing such a simple model, excluding some special cases where these models have
an exact solution, one has to rely on numerical techniques to solve the problem.
In this chapter, We will explain basic concepts of tensor network methods and mainly
focus on the matrix product state (MPS). After defining the MPS, we will discuss some
properties of the MPS. The two methods of constructing the MPS will be present, namely
singular value decomposition and Projected Entangled Pair States (PEPS). In the end,
we will discuss some examples for concreteness.
3.1 Necessity of Tensor Network?
One can ask an obvious question why we need tensor network at all in spite of various
numerical methods available to study strongly interacting systems. There is no unique
answer to this question. There could be several reasons to discuss importance of the
tensor network but here we mention four major reasons:
• New limitations for classical simulations: All the numerical techniques devel-
oped so far have their own limitations. Few of these are, the exact diagonalization
of the quantum Hamiltonian is limited to very small systems say 12 or 14 spin-12
particles on a regular computer, thus studying the quantum phase transitions
become a dream in this method. Mean field theory is restricted to incorporate
33
CHAPTER 3. TENSOR NETWORKS
Figure 3.2: Tensor network diagram examples: (a) Matrix Product State (MPS) for4 site with open boundary condition (OBC); and (b) Projected Entanglement Pair State(PEPS) for 4×4 lattice with OBC.
the effect of classical correlations, not quantum correlations. One cannot study
frustrated or fermionic quantum spin systems by using quantum Monte Carlo
due to sign problem. And the strength of DFT depends on the modelling of the
correlations and exchange interactions in between electrons. These are just a few
examples.
TN methods have also limitations as well but their limitations are very different
from existing classical methods: the structure and amount of entanglement in the
state of quantum system. With this new boundaries, one can simulate the range of
models with a classical computer.
• Graphical language for (many-body) physics: Description of a quantum state
is entirely different from usual one. The TN states encodes the structure of quantum
entanglement. Instead of considering messy set of equations, we will be dealing
with tensor network diagrams as shown in Fig. 3.2. It has been realized that the
diagrammatic approach provides a natural language to describe quantum state
of matter, and even for those which cannot be expressed by Landau’s picture such
as topologically-ordered states and quantum spin liquids. This new language for
quantum many-body system provides intuitive and visual description of a system,
and new ideas and results.
• Entanglement structure: Usual approach to describe quantum state does not
allow to visualize the structure and amount of the entanglement among its con-
stituents. It is supposed that structure of the entanglement depends on the dimen-
34
CHAPTER 3. TENSOR NETWORKS
Figure 3.3: Area Law: entanglement entropy S of the reduced system A scales with theboundary of the system not with volume.
sionality of the system i.e. it will be different for 1D system, 2D system, so on and
so forth. It also depends on the state of the system such as at criticality and its
correlation length. The usual way of representing quantum state doesn’t allow to
retrieve such properties about these characteristics. So, it is nice to have a way of
describing quantum states where these properties and information is clear and
easy to access.
To a certain extent, one can think of TN state is referred to a quantum state in
particular entanglement representation. Different states of the system have a
different representation, and the effective lattice geometry: in which the quantum
state really lives, emerges as a result of correlations in the network. At this point,
this is a subtle property but in practice, due to this fascinating idea, a number of
works have been proposed that the pattern of entanglement occurs in the quantum
state cause a geometry and curvature (and hence gravity). And also this property
of TN established the link between machine learning and holographic geometry. It
is proposed that the geometry which respects holography can appear from deep
learning when we train the entanglement feature of a quantum many-body state
[19]. In this connection, entanglement encoded in the neural network(of course
motivated by TN). Here we simply express that it becomes clear that the language
of TN is the right one to follow such type of connections.
• Exponentially larger Hilbert space (“curse of dimensionality”) : This is the
major and important answer to the question why TNs is a natural description
of the quantum many-body state. One of the serious and biggest hurdles to the
numerical and theoretical study of the quantum many-body system is the curse ofdimensionality i.e. Hilbert space of quantum states grows exponentially. In general,
35
CHAPTER 3. TENSOR NETWORKS
this curse puts limits on efficient description (here we meant by efficiency is that
the Hilbert space grows polynomially with the increase in the number of particles)
of states, and their study becomes intractable. For example, the size of Hilbert
space is 2N for a system of N spins-12 . So, representation of a quantum state in
a usual way is inefficient. One can imagine the size of the Hilbert space for the
system of the order of Avogadro number.
Fortunately, a number of important Hamiltonians in nature have locality of in-
teractions among the particles in the system. As a consequence of this fact, very
few quantum states are more relevant than others. To be more specific, one can
prove that ground state or low-energy eigenstates of gapped Hamiltonians: the
finite difference between energies of the ground state and excited states, with local
interactions, obey the area-law for the entanglement entropy [20], as shown in
Fig.3.3. This is remarkable property, it says that entanglement entropy of a reduced
system scales with the area of the system not with volume as opposed to the state
chosen randomly from the Hilbert space follows the volume law. So, we can use
this law as a guide for low-energy states of realistic Hamiltonians because they
are deliberately restricted by locality, they must obey the entanglement area-law.
Addition to that the size of the manifold containing these states is exponentially
small [21], corner of huge Hilbert space as depicted in the Fig. 3.4. If one aims
to study states, relevant, from this corner, then it is better to device a tool which
targets this manifold instead of exploring whole Hilbert space. And the good news
is, there is a family of TN states that target this relevant corner of Hilbert space
i.e. renormalization group (RG) methods. That’s why it is natural to device RG
methods that keep these relevant degrees of freedom and at the same time ignore
irrelevant ones, and are for this reason based on TN states.
3.2 Theory of Tensor Network
In this section we will see TN representation by diagrams and define TN state mathe-
matically as well as diagrammatically. We will also see computational complexity of TN
and explanation of entanglement relation with bond dimension.
36
CHAPTER 3. TENSOR NETWORKS
Figure 3.4: The physical state in a small manifold of Hilbert space: the size of theHilbert space containing the states which obeys area-law is exponentially small cornerof the gigantic Hilbert space.
Figure 3.5: Tensor representation by diagram: here we use superscript for physicalindex as shown in (d).
3.2.1 Tensors and tensor networks in tensor network notation
A tensor is defined, for our purpose, as multidimensional array containing complex
numbers. The rank of tensor corresponds to number of indices. Hence, tensors with
rank-0,1,2 is defined as scalar, vector and matrix respectively as shown in Fig. 3.5. Here
we represent a tensor as a bubble and number of legs attached to the bubble is the rank
of tensor.
An index contraction is one of the important operations when someone dealing with
TN. It amounts to tracing or summing over all possible repeated indices those shared by
collection of tensors. For example, the product of two matrices
Fi, j =D∑
k=1A i,kBk, j, (3.1)
is the trace over repeated index k which can take D possible values. Same is true for
collection of tensors as
E i, j,k =D∑
x,y,z=1Ax,iBx, j,k,y,zCy,z, (3.2)
37
CHAPTER 3. TENSOR NETWORKS
Figure 3.6: Tensor graphical notation: these tensor diagrams corresponds toEq.3.1,3.2,3.3,3.4, and notice that in (b) and (d), bond between tensors B and C isthick which corresponds to higher bond dimensions as compared to other links betweenthe tensors. In practice, we can split and merge any number of indices. In this case, twoindices y, z are merged. There are two and three open indices in diagram (a) and (b)respectively. While diagrams (c) and (d) have no open indices.
where repeated indices x, y, z can take different ranges of values, for simplicity, here
we have defined D number of values for all of them. In Eq. 3.1 result of contraction is a
tensor with open indices: remaining indices left behind after contraction, i and j. Usually,
the outcome of contraction is a tensor. The Tensor Network TN is a collection of tensors
with some contracting pattern, where contraction over all of its indices results a scalar
and contraction over some of its indices outputs a tensor. As
F =D∑
k=1AkBk, (3.3)
E =D∑
x,y,z=1AxBx,y,zCy,z, (3.4)
where F and C in Eq. 3.3 and Eq. 3.4 are complex numbers as a result of contraction
of all indices. In practice, we will see that any number of indices can split and merge
without changing the value corresponds to a particular index. The Graphical notation
for tensor networks in Eqs. 3.1,3.2,3.3, and 3.4 is given in Fig. 3.6.
Manipulating TN by using diagrams is quite easy as compared to a complicated set of
equations. For example, the 1D lattice of 8 particles with PBC is shown in TN notation in
Fig 3.7, or in mathematical language trace of product of 8 matrices. One can compare the
language of TN diagrams to the Feynman diagrams in QFT. Intuitive, allow visualization,
easy to use, and many properties are prominent in the diagram such as cyclic property
or PBC in a lattice. After describing the graphical notation of TN, now we will use TN in
rest of the chapter.
38
CHAPTER 3. TENSOR NETWORKS
Figure 3.7: 1D lattice with PBC: trace of the product of 8 matrices or lattice of 8particles with PBC.
One important question should be mentioned here which also a parameter to evaluate
the efficient TN representation of a state. Is there a way to contract the TN efficiently?
Importance of the question lies in the fact that if we cannot contract a TN efficiently
then that TN worth nothing. Also, there is not a unique way to contract a TN, if someone
starts from the center of the network or use the left corner of the network as a starting
point, the end result will be the same but number of operations required to contract
depends on the order in which network indices are contracted. The efficient TN at least
has one order in which network contraction is optimum.
3.2.2 Wave function as a set of small tensors
Let us represent a quantum state in the TN notation. Consider a quantum system of
N particles each system consists of p-levels in some individual basis |ir⟩, for the state
of every particle in the system r = 1,2, ...N, and ir = 1,2, ..., p. For instance, p = 2 for
spin-1/2 particle. The wave function |Ψ⟩ of this system can be written as
|Ψ⟩ = ∑i1,i2,...,iN
Ci1,i2,...,iN |i1⟩⊗ |i2⟩⊗ ...⊗|iN⟩ , (3.5)
where Ci1,i2,...,iN can take up pN complex values. The quantum many-body system
|Ψ⟩ is consist of tensor product of individual state of the each particle.
The big set of coefficients Ci1,i2,...,iN can be thought of as a big tensor which has
pN values because its rank is N with i1, i2, ..., iN indices and each index can take pvalues. So, coefficients need to describe the wave function is exponential in system size.
The so-called “curse of dimensionality” happens again. One of the main purposes of the
TN approach is to tackle this problem and provide a way to represent quantum state
efficiently. In particular, the number of coefficients required to specify a quantum state
39
CHAPTER 3. TENSOR NETWORKS
Figure 3.8: TN representation of a wave function: (a) is an MPS (b) is a PEPS and(c) is some tensor network which fulfil the demand.
should be polynomial in the system size N. TN methods meet this challenge by providing
the correct description of the supposed entanglement properties of the quantum state.
This is attained by replacing big tensor “C” by much smaller ones as shown in Fig. 3.8.
The total number of parameters gtot required to specify a quantum state is a sum
of all the parameters of an individual tensor g(t), which in turn depends on the rank of
that tensor and number of values corresponds to each index.
gtot =Ntens∑t=1
g(t)=Ntens∑t=1
(O
rank(t)∏at=1
D(at)
), (3.6)
where rank(t) is number of indices of a tensor and D(at) is the number of values
correspond to index at. Lets define Dt the maximum of all the numbers for a given tensor
D(at). Then
gtot =Ntens∑t=1
O (Dt)=O (ploy(N)pol y(D)) , (3.7)
where D is the maximum of Dt, and we supposed that the number of indices a tensor
can take is bounded by some constant.
One example is given in Fig. 3.8(a) is MPS with PBC, it will be discussed in detail.
In MPS, the number of parameters is just O(N pD2), if we consider that open indices in
MPS can take up to p values, and remaining can take up to D values. Still, there are pN
coefficients (corresponding to rank−N) after contracting the TN. But, here magic comes
in, in TN description these pN coefficients are dependent, in fact, they are acquired by
the contraction of a specific TN and hence have a structure. This structure is a result
40
CHAPTER 3. TENSOR NETWORKS
Figure 3.9: Area-law in PEPS: reduced state |A(α)⟩ 2×2 and |B(α)⟩ from 4×4 PEPS.Each broken bond has dimension D and it contributes logD to the entanglement entropy.
of the extra degree of freedom which requires to “glue” small tensors together to form a
TN. These degrees of freedom have a crucial physical meaning: the structure of quantum
entanglement in a state |Ψi⟩ is represented by these degrees of freedom and the number
of values it can take is a quantitative measure of entanglement or quantum correlations
in |Ψ⟩. These degrees of freedom or indices are called bond and the number of values a
particular bond can take is referred as the bond dimension. And the bond dimension
D of a TN is defined as the maximum of all the bond dimensions in the network. In
next section, we will discuss the relationship between bonds and entanglement in a TN
representation.
3.2.3 Entanglement entropy and Area-law
Lets consider an example to understand how entanglement in TN representation related
to bond dimension. Suppose we are given a state in TN representation such as shown
in Fig.3.9 this is PEPS with bond dimension D. Lets approximate the entanglement
entropy (EE) or a region of length L and label all the cross boundary indices of state
|A(α)⟩ and |B(α)⟩ as α. Obviously, if each index that crosses boundary can take up to Ddifferent values then α has D4L number of values. Reduced density matrix for system Acan be written as
ρA = ∑α,α′
Xα,α′ |A⟩⟨A| , (3.8)
where Xα,α′ ≡ ⟨B(α′)| |B(α)⟩. The reduced density matrix has rank, that is, at most
41
CHAPTER 3. TENSOR NETWORKS
D4L whether we consider system A or B. Further, the EE of a reduced system A is given
by
S(ρA)=−tr(ρA logρA). (3.9)
S(ρA) is upper bounded by log of rank ρA, and the EE in this case is
S(ρA)6 4LlogD, (3.10)
this is an upper-bounded version of “the area-law” for the EE. We also can deduce
from this equation is “every broken bond index contributes logD in total entropy”.Lets discuss the area-law for different possible states. If the given state is just a
product state then its bond dimension will be D = 1 and there is no entanglement shown
by EE hence S = 0. This is a general result of a TN if the bound dimension is trivial
then there is no entanglement. The second case can be D > 1, it already holds the area
law and increasing the D results the change of the multiplicative factor. Therefore, to
change the entropy, structure or geometric pattern of TN should be changed because Ldepends on the geometry. Conclusion of above fact is, entanglement depends on the bond
dimension D and how these bonds are connected together define a geometric pattern.
That’s why different families of TN states with same D have entirely different properties
of entanglement. Notice that, by fixing the value of D greater than one we can achieve
both computational efficiency and quantum correlations. Larger region of the Hilbert
space can be explored by increasing the bond dimension as shown in Fig. 3.10. The
entropy given in Eq. 3.9 is called von-Neumann entanglement entropy. There are other
type of entropies to quantify entanglement, the whole set of EEs is called Renyi-entropies
S(ρA)= 11−α log(ραA), (3.11)
for α> 0. For α→ 1, this formula reduced to von-Neumann entropy.
3.2.4 Proven instances and violations of Area-law
One can make the general category of systems which hold area-law by stating the fact
that most of the systems with gapped Hamiltonians follow the area-law while gapless
systems do not. There are examples of gapless Hamiltonians those follow the area-law
but the statement is true in general. For instance, free bosons hold area law even at
criticality, gapless models with the dimension of systems greater than 1. For gapped
free fermionic models where Hamiltonian expressed in quadratic polynomial of creation
42
CHAPTER 3. TENSOR NETWORKS
Figure 3.10: By increasing the bond dimension D of a TN state one can explore largerregion of Hilbert space.
and annihilation operators, area-law holds for arbitrary lattice system in any dimension.
And more importantly, MPS and PEPS also follow area-law. It has been discussed that
efficient representation of a quantum state in TN methods is due to this area-law. There
are many other examples but in general, any gapped Hamiltonian with unique ground
state respects area-law.
Systems at criticality or gapless systems don’t respect area-law. There are correlations
in the system at every length scale hence correlation decay is not exponential and area-
law is violated. But corrections to the area-law are small. Conformal Field Theory (CFT)
proposed that
S(ρA)= c3
logla+C, (3.12)
where a is the lattice spacing, c is the centeral charge, and C > 0 is a constant [22].
So, in fact entropy is in order of area not boundary of an area as
S(ρA)=O (log(|A|)) , (3.13)
where A is area. It is divergent in the log of the system size. Free fermionic systems
at critical point violate the area-law. Area-law scaling for higher-dimensional critical
systems is still unknown.
3.2.5 Entanglement spectra
Often entire spectrum of ρA is useful instead of just a single number provided by
the entropy of entanglement. In reality, more information is revealed if entanglement
Hamiltonian is considered. And information provided by Renyi-entropies is same as given
by entanglement spectrum of a reduced system ρA. Provided a state ρA, entanglement
Hamiltonian HA is defined as
43
CHAPTER 3. TENSOR NETWORKS
ρA = e−HA . (3.14)
In fact, important information of a system revealed by the entire entanglement
spectrum, for example, one can extract information about universality class of the
system. Entanglement Hamiltonian is receiving a notable consideration in the context of
boundary theories and topological systems.
3.3 Matrix Product States (MPS)
After establishing the notation and some key concepts, we will discuss some fundamental
TN for quantum many-body systems with strong interactions among its parts. We will
discuss one dimensional system first.
An MPS is a natural choice(also efficient) for representing the low-energy states
of 1D quantum systems more precisely physically realistic quantum systems. In this
section, first of all we will discuss some properties of MPS after that we construct the
MPS by considering singular value decomposition briefly and motivate and define TN
in two different ways. After that, some analytical examples of this TN and complexity
tackle by MPS will be discussed. At the end, some simple characteristics of MPS and
operators in MPS will be explained.
3.3.1 Some properties
• MPS are dense: Any quantum state can be represented by the MPS, by increasing
the bond dimension D one can explore the larger region of many-body Hilbert space.
All the states in the Hilbert space can be covered by MPS but the increase in the
bond dimension D will exponential with the system size. Nevertheless, MPS can
represent the low energy states of the gapped local Hamiltonian to arbitrary
accuracy with a finite value of D [23]. But for the critical systems, D diverges
polynomially with the system size. This is shown in Fig.3.10.
• 1D Translational symmetry and thermodynamic limit: In general, finite-
sized MPS is not itself symmetric under translation because all the tensors can
be different from each other. To apply thermodynamic limit one can choose a unit
cell in the array of tensors. If all the tensors are same then MPS will translational
invariant under any number of the tensor shift. And unit cell could be consist of
two or three tensors as well. The idea is pictorially shown in Fig. 3.11(a).
44
CHAPTER 3. TENSOR NETWORKS
Figure 3.11: (a) Infinite-sized MPS with 1 and 2 site unit cell. (b): An efficient wayto contract finite-sized MPS which can also be applied to infinite-sized MPS with anyboundary condition.
• 1D area-law: MPS respects area-law of entanglement entropy for 1D systems. In
particular, the entanglement entropy of a reduced system is bounded by constant
i.e. S(L)=O(logD). Intuitively, to have a reduced system we need to cut two bonds
independent of the size of the reduced system. Strictly, S(L) ∼ const for L >> 1,
this is the property of the ground state of the gapped Hamiltonian of 1D systems.
• Exponential decay of correlations: MPS always show finite correlation length,
this means that correlation length decays exponentially in the MPS. That’s why
they are not fit for critical systems and cannot reproduce the scale-invariant
properties where systems show the power-law decay of correlations.
• Efficient contraction of expectation value: The scalar product for finite-sized
MPS can be contracted in a time O(N pD3) and infinite-sized MPS require O(pD3).
The contraction strategy is shown in Fig.3.11(b). The same kind of manipulation
can be used to calculate expectation value of operators.
3.3.2 Singular value decomposition
One of the main purposes of TN states for quantum many-body systems is representing
states which reside in the physically relevant corner of a huge Hilbert space. These
states are low-entanglement states. The main idea behind this statement is finding
low-rank approximation of a big matrix and this task can be achieved by singular value
decomposition (SVD) of a matrix. The SVD of a high-rank matrix Ci1,i2,...in; j1, j2,... jm is
given by
Ci1,i2,...in; j1, j2,... jm =∑α
Ui1,i2,...in,αSα,αV †j1, j2,... jm,α, (3.15)
45
CHAPTER 3. TENSOR NETWORKS
where U and V are unitary matrices and S is a diagonal matrix containing non-
negative values, and called the singular matrix. U and V are not unique but S always
unique for a particular matrix. One can see that the TN notation of this decomposition in
Fig. 3.12. The SVD is also a way to calculate the EE between two systems and singular
matrix S is used for this purpose.
Figure 3.12: Singular Value Decomposition (SVD) Eq. 3.13 in TN notation: hereI ≡ {i1, i2, ...in} and I ≡ { j1, j2, ... jn}.
As already said, a tensor is just a container which contain certain set of values and
allows access to these values by indexing. There is no harm in splitting and merging
these indices together because it does not matter how someone arrange the values.
3.3.3 MPS construction
Suppose we are given a general N-qubit state |Ψ⟩ =∑pi1,i2,...,iN
Ψi1,i2,...,iN |i1⟩⊗ |i2⟩⊗ ...⊗|iN⟩ in which each qubit is a p-dimensional system. All states can be identified by the
knowledge of complex coefficients C.
After separating the first index of C from the rest, and applying the SVD we get the
decomposition
|Ψ⟩ =∑jλ j |L j⟩ |R j⟩ , (3.16)
where {|L j⟩} and {|R j⟩} are orthonormal basis and λi is Schmidt eigen value or weight.
In TN notation this process shown in Fig. 3.13.
Figure 3.13: Pictorial presentation of SVD performed in Eq. 3.16.
46
CHAPTER 3. TENSOR NETWORKS
Figure 3.14: Complete construction of an MPS by SVD.
The Renyi-entropy is given in Eq. 3.11, note that the EE is just log sum of non-zero
Schmidt weights. And these weights correspond to singular matrix which consists of
singular values of decomposition. Hence singular values represent the entanglement
structure along the applied cut.
We can apply consecutive SVD along each cut, thus splitting out the whole tensor
into small local tensors X . And singular matrices λ quantify the entanglement over that
cut as shown in Fig. 3.14. Now contract the singular matrix or tensor λ[i] into the local
tensor X [i], consequently we have a general form of quantum many-body state |Ψ⟩ as
. (3.17)
The graphical representation given in Eq. 3.17 is a matrix product state (MPS).
This construction is generic and contain same number of parameters in much more
complicated fashion. In fact, the usefulness of this construction is not obvious yet but
will clear in a moment.
However, the states with short range entanglement across the cut have only few
non-zero Schmidt weights, lets call D the number of non-zero Schmidt weights. For these
states, MPS form enables us to truncate the λ matrix. To be more specific, any state that
respects area-law such that S0 < log c where c is some constant across any bipartition
can be expressed completely using an MPS with just O(dNc2) parameters. For many
relevant states with von-Neumann entropy S1 =O(1) is enough to make sure arbitrarily
best approximation with an MPS of pol y(N) bond dimension.
47
CHAPTER 3. TENSOR NETWORKS
In MPS, all the tensors are ranked-3 tensors and dangling indices are called physicalindices while contracted indices are called bond or virtual indices. Sometimes it is
convenient to consider PBC, and to tackle periodic states, an MPS is commonly modified
from Eq. 3.17 to
∣∣∣Ψ[A[1], A[2], ..., A[N]
]⟩= ∑
i1,i2,...iN
Tr[A[1]
i1, A[2]
i2, ..., A[N]
iN
]|i1, i2, ...iN⟩ , (3.18)
and for the translational invariant states all the A′s are same
∣∣∣Ψ[A[1], A[2], ..., A[N]
]⟩= ∑
i1,i2,...iN
Tr[A i1 , A i2 , ..., A iN
] |i1, i2, ...iN⟩ . (3.19)
This representation may look scary but its really simple, one has to specify the
physical indices and rest of the task is just matrix multiplication. This state in TN
notation is given as
. (3.20)
Lets discuss the characteristics of the set of A matrices. These matrices are left-normalized i.e.
∑i
A†i A i = I. (3.21)
The MPS that have the only left-normalized set of matrices are called left-canonical.Clearly, there is nothing special to start slicing the huge matrix from left, one can start
applying SV D from right to left the resulting MPS is called right-canonical MPS as it
holds the property
∑i
A i A†i = I. (3.22)
One can also choose the centre of a big matrix as a starting point. All these choices
can be a representation of the same system because these are not unique. One can rid of
this non-uniqueness by fixing the gauge.
3.3.4 Gauge degrees of freedom
There can be various ways to write an MPS for arbitrary quantum state. All the ap-
proaches have their own advantages and disadvantages. It should be noticed that MPS
48
CHAPTER 3. TENSOR NETWORKS
are not unique and this implies that the existence of the gauge degree of freedom. Sup-
pose two consecutive set of matrices Bi and Bi+1 of common column/row dimension D.
Then the MPS will not vary for any matrix X with dimension (D×D) and have inverse
as well, under transformation:
Bi → Bi X , Bi+1 → X−1Bi+1. (3.23)
By fixing the gauge, calculations become very simple, all the constructions of MPS are
special cases of that.
The matrices can possibly be huge one needs to restrict their size to some D to make
MPS simulatable on a computer. This is doable without much compromising on the
description of the state defined on one dimension. Often, MPS canonical representation
has exponentially decreasing eigenvalues of the reduced density matrix of the system.
It is possible to keep the first few eigenvectors of the reduced density matrix (Eq.3.16)
which have greater singular value than some threshold D. The D defines the order of
precision in which we are interested.
3.4 1-D Projected Entangled Pair States(PEPS)
Another way to construct MPS is, consider an MPS is a special case of the PEPS. This
begins by laying out some entangled pair state |φ⟩ such as Bells state on a certain lattice
and performing some linear map P between pairs
, (3.24)
where
, (3.25)
is a particular entangled pair that we chose. It is an easy task to show that this
construction leads us to an MPS by considering |φ⟩ =∑d−1k=0 |kk⟩. The linear map P can be
written as
P= ∑i,µ,ν
A i,µ,ν |i⟩⟨µν| . (3.26)
49
CHAPTER 3. TENSOR NETWORKS
The tensor A is precisely the MPS we defined above, applying projection operator to the
auxiliary spins which are entangled pairs result in the exact construction of an MPS:
P(1) ⊗P(2) |φ⟩2,3 =∑
i1,i2,µ1,ν1,µ2,ν2,kA(1)
i1,µ1,ν1A(2)
i2,µ2,ν2|i1i2⟩⟨µ1ν1µ2ν2| (I ⊗|kk⟩⊗ I)
= ∑i1,i2,µ1,ν1,ν2
A(1)i1,µ1,ν1
A(2)i2,ν1,ν2
|i1i2⟩⟨µ1ν2| .(3.27)
Hence we notice that the two definitions are the same, and exchanged through
employing the local unitaries to the auxiliary/virtual bond indices of A or similarly using
the different maximal entangled pair in PEPS.
3.5 Examples of MPS
Lets take some concrete examples to make MPS less abstract.
AKLT State
The most interesting quantum many-body state to study correlations is AKLT-state, it is
the ground state of the following Hamiltonian:
H =∑i
Si.Si+1 + 13
(Si.Si+1)2. (3.28)
Figure 3.15: AKLT state: Each physical site represents spin−1 which is replaced by thetwo spin−1/2 degrees of freedom (called auxiliary spins). Each right auxiliary spin−1/2on a site i is entangled to the left spin−1/2 at site i+1. Linear projection operator isdefined on the auxiliary spins which maps them to physical spins.
Here S = 1 spins. The ground state of this Hamiltonian can be constructed by using
PEPS as depicted in Fig.3.15. Each physical spin-1 is replaced by 2 spin-1/2 particles,
out of four possible states we only take triplets to represent S = 1 states:
|+⟩ = |↑↑⟩
50
CHAPTER 3. TENSOR NETWORKS
|0⟩ = |↑↓⟩+ |↓↑⟩p2
(3.29)
|−⟩ = |↓↓⟩ .
On consecutive sites, adjacent spin-1/2 particles are paired in a singlet state
|↑↓⟩− |↓↑⟩p2
. (3.30)
This state can be represented by MPS with bond dimension D = 2. In the description
of the auxiliary 2L spin-1/2 on one dimension lattice with length L any state can be
written as
|Ψ⟩ =∑µν
cµν |µν⟩ (3.31)
where |µ⟩ = |µ1, · · · ,µL⟩ and |ν⟩ = |ν1, · · · ,νL⟩ constitute the first and second spin-1/2
particle on every site. And the single bond between k and k+1 site
|σ[k]⟩ =∑νkµk+1 |νk⟩ |µk+1⟩ (3.32)
defining a 2×2 matrix σ
σ= 0 1p
2− 1p
20
. (3.33)
Any state with singlets and all the bonds
|Ψσ⟩ =∑µν
σν1µ2σν2µ3 , · · · ,σνL−1µLσνLµ1 |µν⟩ , (3.34)
for PBC. For OBC the first and last spins-1/2 will be single by ignoring σνLµ1 .
Now we are going to map these auxiliary spins-1/2, |µk⟩ |νk⟩ ∈ {|↑⟩ , |↓⟩}⊗2 to the physi-
cal spins |i⟩ ∈ {|+⟩ , |0⟩ |−⟩}. According to previous section PEPS now we need a projection
operator which maps auxiliary Hilbert space to physical one. We define M iµν |i⟩⟨µν| as
three 2×2 matrices for each eigenvalue of |i⟩
M+ =[
1 0
0 0
], M0 =
0 1p2
1p2
0
, M− =[
0 0
0 1
]. (3.35)
The map for spin-1 chain |i⟩ given as
51
CHAPTER 3. TENSOR NETWORKS
∑i′s,µ,ν,
Mi1,µ1,ν1 M(2)i2,µ2,ν2
, · · · , MiL,µL,νL |i1, i2, ...iL⟩⟨µν| , (3.36)
application of this map on |Ψσ⟩ results in
|Ψ⟩ =∑i′s
Tr[Mi1σMi2σ · · ·MiLσ
] |i1, i2, ...iL⟩ . (3.37)
For more simplification we introduce A i = Miσ, hence the AKLT state is
|Ψ⟩ =∑i′s
Tr[A i1 A i2 · · ·A iL
] |i1, i2, ...iL⟩ . (3.38)
Product State
Let
A0 =[1]
, A1 =[0]
. (3.39)
This set of matrices provides the state |00 · · ·0⟩.
W State
To get the W-state for n-particles we define matrices A as
A0 =[
1 0
0 1
], A1 =
[0 1
0 0
], (3.40)
with periodic boundary conditions
. (3.41)
Notice that A20 = A0, A0A1 = A1, A2
1 = 0 and define a matrix X as Tr[A1X ] = 1, hence
the MPS representation of state
|W⟩ =N∑k|00 · · ·01k00 · · ·0⟩ . (3.42)
Mathematica code is give in the Appendix A.1.
52
CHAPTER 3. TENSOR NETWORKS
GHZ State
The set of matrices A for GHZ state is given as
A0 =[
1 0
0 0
], A1 =
[0 0
0 1
], (3.43)
and equivalent MPS representation by PEPS if we define |φ⟩ = |00⟩ + |11⟩ and P =|0⟩⟨00|+ |1⟩⟨11| then we get
|GHZ⟩ = |00 · · ·0⟩+ |11 · · ·1⟩ . (3.44)
Mathematica code is give in the Appendix A.2.
53
CH
AP
TE
R
4MAPPING BETWEEN VARIATIONAL RG AND RBM
This chapter is devoted to explain the relation and mapping between renormalization
group (RG) and restricted Boltzmann machine (RBM). We will begin with RG and show
how theories at different length scale constructed by coarse-graining process in RG.
In particular, RG transformation on block spin lattice in 1D as well as in 2D will be
explained. After that, we will see a detailed explanation of mapping between variational
RG and RBM, and consider Ising model in 1D and 2D as examples.
4.1 Renormalizaton group (RG)
The renormalization group theory is one of the fundamental ideas on which theoretical
structure of Quantum Field Theory (QFT) is based. That belief leads us to the suggestion
that the potential ideas of RG can be viewed from the point of view of statistical physics
and condensed matter physics.
In early days, RG was the subject of the high energy physics such as renormalizaed
Quantum Electrodynamics later on it was realized that RG ideas can be applied to
statistical and condensed matter physics as well. In high energy physics domain, RG
methods are performed in momentum space, hence called momentum-space RG on the
other hand real-space RG ideas are for statistical physics. From now on we will only
discuss real-space RG method.
The RG is a framework or a set of ideas that can tackle problems in which fluctuations
occur in a system at all length scale, for example, the systems at criticality. At the
54
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
critical point of the system our previous approach, Mean Field Theory (MFT), breaks
down because in the mean-field model we suppose that a particle interacts with other
particles in the whole system with equal strength. Usually, we account nearest-neighbor
interactions and the process gets complicated by including next range of interactions.
The RG approach uses a very interesting set of conceptual ideas such as scale-invariance:singularities appear at critical points are connected to that behaviour of the statistical
system which is same at all length scale, so at criticality, the system has correlations
at all length scale. That scale-invariance is expressed by the free energy function or
Hamiltonian of the system which in turn consists of the components that do not depend
on length scale, so, as a result, the Hamiltonian or free energy of a system will not change
with the change in length scale. This “unchanging” response is expressed as “being atfixed point”. Repeated application of RG on the system which changes its length scale but
keeping its free energy unchanged is defined as RG transformation or renormalization.
It is observed that near criticality many different systems have identical quantities
which describe the critical behaviour (critical exponents). So these systems fall into same
universality class meaning that different systems from entirely different origins behave
identically near critical point hence they have same critical exponents. For example,
para-ferromagnet and liquid-gas phase transitions belong to same universality class.
The idea of this representation as opposed to the previous one is that in several
important instances one may interested in not only what these universal properties are,
but also explicit temperature or phase diagram of the system as a function of external
parameters as well as temperature. And if we get some idea how microscopic degrees
of freedom interact in our chosen system then we should be able to solve the partition
functions ( for Ising model or lattice model like that) and get singularities etc. For
example, for a particular system, someone interested in the critical temperature where
phase transition occurs as well as potential phase diagram and critical behaviour.
No matter what the motivation is, all RG methods provide a set of mathematical
equations that define renromalization flow in some complex parameter space and these
flows tell us a lot about the physical problem at hand, which is the quality of the RG
theory. But these methods are difficult to control quantitatively as the expansion of the
parameters become large. Despite this challenge, RG provides powerful and advanced
concepts such as scaling and universality.
55
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
4.1.1 1D Ising model
Let’s begin with the model that does not show the critical behaviour but exactly solvable.
The 1D model represents the physical spins vi lying on the lattice with some lattice
constant. When external magnetic field not present the Hamiltonian of the system look
like
H =−K∑
ivivi+1. (4.1)
where K is the coupling constant that favours the aligned spin configuration. There
could be various schemes to perform RG transformation one could be marginalize half of
the spins which doubles the lattice constant and K (1) will be the couplings between new
coarse grained system. We can again coarse-grained this system and that will be the next
round of RG. Let’s denote a set of couplings constants as K (0),K (1), · · · ,K (n) which define
interactions between spins after RG transformation. The output of the RG procedure is
an elegant recursive relation between the couplings and that defines a flow
tanhK (n) = tanh2 K (n−1). (4.2)
Here K = K (0) and the detailed calculations of RG transformation for 1D Ising model is
given in AppendixD.
4.1.2 2D Ising model
Now consider the system that does have critical behaviour. The first step in RG procedure
is decimation/coarse-graining, there can be many renormalization schemes to achieve
this, one possible shown in Fig. 4.1. To perform this step first notice that every spin has
four nearest-neighbours we arrange the partition in such a way that every spin shows
up in only one Boltzmann factor:
Z = ∑v1,v2,···
· · · eKv5(v1+v2+v3+v4)eKv6(v2+v3+v7+v8) · · · . (4.3)
After doing sum over every other spin, we have to find the Kadanoff transformation so
that summed partition function exactly look as the unsummed partition function. But
this time this task is not as simple as in 1D Ising model. We have to include all type of
couplings as shown in Fig. 4.2 otherwise there is no solution. For this case single K will
not work we have to include all three possible couplings {K1,K2,K3}:
56
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Figure 4.1: The RG decimation scheme: every spin has 4 nearest-neighbours and wesummed over half of the spins. The resulting lattice is same as original but rotated at anangle 45◦.
eKv5(v1+v2+v3+v4) + e−Kv5(v1+v2+v3+v4) = g(k)e12 K1(v1v2+v2v3+v3v4+v4v1)+K2(v1v3+v2v4)+K3v1v2v3v4 .
(4.4)
After inserting all possible values for (v1,v2,v3,v4), we obtain four equations with four
unknowns. There solutions are:
K1 = 14
lncosh(4K),
K2 = 18
lncosh(4K),
K3 = 18
lncosh(4K)− lncosh(2K),
g(K)= 2[cosh(2K)]1/2 [cosh(4K)]1/8 .
(4.5)
Combine Eq.4.4 with the partition function(partial) produces
Z(K , N)= ∑N spins
eK∑′
i, j viv j
= [g(K)]N/2 ∑N/2 spins
e[K1
∑′i, j viv j+K2
∑′′l,m vlvm+K3
∑′′′p,q,r,t vpvqvrvt
],
(4.6)
where N/2 summed is over remaining spins. Single primed sum is over nearest-
neighbours and double primed for next nearest-neighbours while triple prime for four
product spin interaction around the square. Notice that we got complicated connectivity
57
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Figure 4.2: In coarse-graining process, result of removing the degrees of freedom wegot high degree of connectivity. Yellow lines show nearest-neighbour couplings v1v2 +v2v3 + v3v4 + v4v1, green lines represent next-nearest neighbours v1v3 + v2v4 and blueconnections depict all four spin product v1v3v2v4.
(as shown in Fig.4.2) after removing the degrees of freedom. This is the major complica-
tion in RG treatment this motivates us to variational renormalizaton techniques we will
see that in next section.
These complicated couplings do not allow us to express partition function into a
form of Kadanoff transformation and we cannot perform RG calculations. We need to
approximate the couplings somehow to proceed further. One possible solution could be
neglect K2 and K3 but resulting equation will be the same as we have for 1D Ising
model, predicts no phase transition. For better approximations, atleast include K2
in calculations. One appropriate solution can be mean-field approximations in which
next nearest couplings incorporate into nearest ones as K1∑′
i, j viv j +K2∑′′
l,m vlvm ≈K
′(K1,K2)
∑′i, j viv j. This approximation enable us to describe partially summed Z in the
form of unsummed Z
Z(K , N)= [g(K)]N/2Z[K′(K1.K2), N/2].
Lets define free energy per particle as f (K)= N−1 ln Z(K , N) and use the value of g(K)
that leads us to
f (K′)= 2 f (K)− ln{2[cosh(2K)]1/2[cosh(4K)]1/8}. (4.7)
Consider the energy when system is highly ordered i.e. all the spins are aligned. For 2Dsquare lattice with N/2 spins, there are N nearest and also N next nearest neighbour
bonds. So, K1∑′
i, j viv j = NK1 and K2∑′′
l,m vlvm = NK2, when all spins are parallel. Now
we estimate K′as K
′ ≈ K1 +K2. By using 4.5 we get
58
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
K′ = 3
8lncosh(4K). (4.8)
So we have now recursive relations for renormalization , notice that Eq. 4.8 has nontrivial
“fixed point”. A finite value of Kc exist where
Kc = 38
lncosh(4Kc).
Indeed,
Kc = 0.50698.
Figure 4.3: RG flow diagram for 2D Ising model: there are three fixed points, twoare stable (K = 0,∞) and one is unstable (Kc). Kc is a phase transition point.
If renormalization start at Kc, Fig. 4.3 depicts that iterations move away from Kc.
This non-trivial fixed point is unstable and often called “source”. The other two fixed
points are called “sink”.This RG treatment on 2D Ising model shows that even with rudimentary approxima-
tions, the RG methods are powerful enough to extract qualitative results. Before going
to next section, we summarize few observations made in this whole topic. First is the
high degree of connectivity that results in cooperation between variables and a cause of
phase transition as well. That topology also leads to very complicated interactions as one
sum over more and more variables. One can imagine that the complicated interactions
by considering the integration over 5 spins in given 2D lattice as oppose to 1D case. One
cannot generate that high degree of connectivity in 1D Ising model and that model has
no phase transition phenomenon.
Indeed, if we neglect the complicated interactions then RG process will be very easy
but that leads us to wrong predictions. One way to think that removing the degrees of
freedom take us to higher dimensional space of couplings, in this case, K1,K2 and K3.
And the resulting partition function Z depends upon all these coupling parameters. One
has to consider all imaginable couplings to compute partition function by RG process
and that results a transformation to higher dimensional coupling space i.e. {K′}=R{K}.
It is only by an estimation that we restrict the flow diagram in just 1D parameter space.
For more detailed calculations use Ref. [24][25][26].
59
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Above discussed difficulties motivate us to consider a numerical technique to perform
RG transformation. We can use some variational methods to find an exact transformation
and this is the topic of next section.
4.2 Variational RG
In the previous section, RG theory is explained by two simple examples. And we have
seen the major difficulty in RG transformation is taking account of all the possible
couplings. The variational approach was introduced to tackle this difficulty and allows
renormalizaton.
Consider an ensemble of N spins v = {vi} sitting on some lattice, each spin can
take two values ±1 and index i denotes the position of spin on the lattice. At thermal
equilibrium the fundamental formula of statistical mechanics that relates the probability
of a particular configuration of the system to its energy:
P(v)= e−H(v)
Z, (4.9)
Z = Trve−H(v) ≡ ∑v1,v2,··· ,vN=±
e−H(v), (4.10)
where H(v) is Hamiltonian of the system and Z is partition function.
For convenience, set the temperature T = 1. Usually, and we have seen in 2D Ising
model, all possible Hamiltonians parameterized by all imaginable couplings K= {K}. For
instance, there are 3-order of interactions {K1,K2,K3} in Eq. 4.6. Order of interactions
depends on topology of the system, more general type of interactions can be
H[v]=−∑i
K ivi −∑i, j
K i, jviv j −∑i, j,k
K i, j,kviv jvk +·· · . (4.11)
Given partition function Z, typical way to define free energy
Fv =− log Z =− log(Trve−H(v)
). (4.12)
The idea on which the RG is based, given a fine-grained description of the system, find
a coarse-grained one by summing over short distance fluctuations. New coarse-grained
system consists on the new and relatively few M < N degrees of freedom h = {h j}. In
this process, the new coarse-grained description of the system has larger characteristic
length scale (lattice spacing) than previous one. One can easily rescale the system to
60
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Figure 4.4: Block spin transformation: (a) 2×2 blocks are defined on the physicalspins vi to marginalize, (b) depicts the effective spins hi after marginalizing the physicaldegrees of freedom, (c) side view of RG procedure is shown, repeated application of RGtransformations produce series of ensembles one over the other.
have same characteristic length. For instance, the block spin RG transformation defined
by Kadanoff is shown in Fig.4.4 where hi “auxiliary” spins represents the state of 2×2
local block of “physical” spins vi. Under this renormalization scheme the lattice spacing
doubled after each step of RG.
In this way, RG allows us to replace an ensemble consists of N degrees of freedom vwith an ensemble that composed of fewer degrees of freedom h. New system also have
same form of Z with new coupling parameters K′ = {K′} and Hamiltonian HRG[h] defined
on this new system parameterized by these couplings. And {K′} include the effect of
couplings in the fine-grained system.
HRG[h]=−∑i
K′ihi −
∑i, j
K′i, jhih j −
∑i, j,k
K′i, j,khih jhk +·· · , (4.13)
where hidden spins h are coupled with each other by K′. In simple terms, this renormal-
ization transformation is just a mapping between couplings K→K′. The exact mapping
highly depends on the chosen RG scheme and often difficult to solve this problem analyt-
ically.
Kadanoff introduced the variational approach to RG scheme. In his proposed vari-
ational scheme, the coarse-graining process is performed by establishing a projection
function Tχ(v,h) that parameterized by the variational parameters {χ} and encodes
interactions between hidden spins h and physical spins v. After coupling the v and h,
one can summed up physical/visible spins to achieve a coarse-grained description of the
61
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
fine-grained system completely in terms of h. Naturally, the function Tχ represents a
Hamiltonian for h as
e−HRGχ [h] = TrveTχ(v,h)−H(v). (4.14)
The free energy of the coarse-grained system can also be defined in similar way
Fhχ =− log
(Trhe−HRG
χ (h)). (4.15)
So far we did not pay attention on variational parameters χ for Tχ. The exact renor-
malization procedure preserves the physics at large length scale, to make sure this
quantitatively, the energy of the coarse-grained system should be equal to fine-grained.
So, choose χ such a way that ∆F = Fhχ −Fv. For exact transformation using any RG
scheme
Trh j eTχ(v,h) = 1. (4.16)
Notice that its true in this case as well
∆F = 0⇐⇒ Trh j eTχ(v,h) = 1. (4.17)
Usually it is hard to carry out exact transformation by choosing optimum χ. So,
various optimization and variational techniques has been proposed to determine χ by
minimizing ∆F.
4.3 Overview of RBM
Before going into the details of the mapping between RBM and RG, lets introduced
some convenient notation for RBM. True probability distribution is P(v) over N visible
units and M is the number of hidden units. Here Energy function is defined as E(v,h)=aTv+bTh+hTWv. The set of variational parameters are χ= {ai,Wi j,b j} and the joint
probability distribution Pχ is
Pχ(v,h)= e−E(v,h)
Z, (4.18)
and the marginal distributions given as
Pχ(v)=∑h
Pχ(v,h)= TrhPχ(v,h), (4.19)
62
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Pχ(h)=∑v
Pχ(v,h)= TrvPχ(v,h). (4.20)
Lets define a “variational” RBM in terms of Hamiltonian for convenience in the future.
For the hidden units:
Pχ(h)= e−HRBMχ [h]
Z, (4.21)
and for the visible units:
Pχ(v)= e−HRBMχ [v]
Z. (4.22)
The training process of the RBM is based on the minimization of the gradient of the
log-likelihood. The detailed explanation and training process of the RBM, deep neural
networks (DNNs) and deep belief networks (DBNs) given in the chapter 2.
4.4 Mapping between Variational RG and RBM
In RBM the energy of the configuration of visible and hidden units is given by E(v,h). In
variational RG, similar role is played by the Tχ(v,h) which encodes the coupling between
visible and hidden units. We claim here and will prove the relation between these two
T(v,h)=−E(v,h)+H[v], (4.23)
the H[v] is Hamiltonian that describes the data and encodes probability distribution
P(v) of the data, given in Eq. 4.11. Our claimed relation defines the mapping between
variational RG and a particular type of DNN i.e. RBM.
Using this relation it is easy to express that HRGχ [h] which defines the coarse-grained
system in variational RG, also represents the hidden spin variables in the RBM. Or other
way around, the marginal distribution Pχ(h) over the hidden spin variables of the RBM
is Boltzmann distribution weighted by the HRGχ [h]. To prove this lets begin with Eq. 4.14
and normalize the equation by Z and substitute our claimed relation Eq.4.23
e−HRGχ [h]
Z= Trv
e−E(v,h)
Z= Pχ(h). (4.24)
Put Eq.4.21 into above expression, hence
HRGχ [h]=HRBM
χ [h]. (4.25)
63
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
By using these results, variational RG can be viewed in the language of the prob-
ability theory. The projection operator Tχ(v,h) can be interpreted as the variationally
approximated conditional probability of the hidden layer given the visible layer. To get
intuitive expression, take exponential of the Eq. 4.23 and use the joint and marginal(for
visible) probability distributions. The desired result is
eT(v,h) = Pχ(h | v)eH[v]−HRBMχ [v]. (4.26)
It indicates that the equality of the variationally approximated Hamiltonian HRBMχ [v]
of the visible units in the RBM and Hamiltonian H[v] that describes the data, ensures
the exact RG procedure i.e. TrheTχ(v,h) = 1. So, HRBMχ [v]=H[v] also implies that T(v,h)
is exact conditional probability distribution and two probabilities, Pχ(v) probability
learned by model and true probability P(v), are equal. In machine learning terms, the
gradient of the log-likelihood approaches zero.
The exact RG transformation carry out by the various variational approximation
schemes. On the other hand, machine learning uses minimization of the log-likelihood
to approximate. Thus, RG and deep neural networks provide well defined variational
scheme for coarse-graining process. Lastly, the equivalence of the two approaches does
not depend on the specific form of the energy E(v,h) so this is also true for any Boltzmann
machine.
4.5 Examples
To illustrate the relation between RG and deep learning approaches, it is a good idea to
discuss some simple examples. One (1D Ising model earlier in this chapter and detailed
calculations given in AppendixD) we have briefly explained from the point of physics
now we consider it in the perspective of deep learning. After that we will apply deep
learning variational approach to the 2D-Ising model.
4.5.1 Ising model in 1D
For 1D-Ising model, the coarse-graining process by RG transformation is shown in
Fig.4.5(a). We have considered nearest-neighbour couplings K (0) in the fine-grained sys-
tem and K (n) are the couplings in the coarse-grained system after n RG transformations.
If we start with the large value of K then we flow towards the weak couplings, it is clear
from the RG flow Eq.4.2.
64
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Figure 4.5: RG and deep learning aspect of 1D Ising model: (a) a coarse-grainedprocess by the renormalization transformation of the ferromagnetic 1D Ising model. Afterevery RG iteration, half of the spins marginalize and the lattice spacing becomes double.At each level the system replaced by the new system with relatively fewer degrees offreedom and new couplings K ’s. By the RG flow equation, the couplings at the previouslayer can provide the couplings for the next layer. (b) The RG Coarse-graining can alsobe performed by the deep learning architecture where weights/parameters between nand n−1 hidden layer are given by K (n−1).
The RG transformation can be used as a guide for deep architecture as shown in
Fig.4.5. The spins at n-th layer in the DNN can be interpreted as the coarse-grained
spins when implementing RG transformation in n−1th layer. It should be noted that
the first two layers of the DNN shown in Fig.4.5(b), form an “effective” 1D spin chain
similar to actual one. So, decimating every other spin in the actual spin chain is equal to
marginalizing/summing over spins in the layer below in the DNN. This suggests that
the hidden layer in the DNN is also defined by the RG transformation Hamiltonian
when K (1) are the local short-range couplings. And this is true for any layer of the DNN
architecture for coarse-graining process.
This simple example gives some idea about how to construct the DNN, in particular,
selection of the depth and width of the network and require no calculations. These two
deep network parameters are called hyper-parameters and deep learning community
don’t have any neat and clean guiding principle to choose these, it shows the essence of
the mapping between DNN and RG. Although, it tells us nothing about the half spins
which are not connected to the hidden layer.
65
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
Figure 4.6: Deep neural network for the 2D Ising model: (a) four-layered DBN isconstructed with size of each layer 1600,400,100, and 25 spins. The network is trainedover the samples generated from 2D Ising model with J = 0.405. (b) The EffectiveReceptive field (ERF): encodes the effect of input spins to a particular spin in the givenlayer, of the top layer. Each image size is 40×40 which depicts the ERF of a particularspin in the top layer. (c) The ERF gets larger as we move from bottom to top layer of thenetwork, thats consistent with the successive block spin transformation. (d) SimilarlyERF for the third layer with 100 spins. (e) Three samples generated from the trainednetwork.
4.5.2 Ising model in 2D
In Sec. 4.1, we have discussed the renormalization of the Ising model in two dimension.
Now we apply deep learning techniques to the two-dimensional Ising model with nearest-
neighbour ferromagnet coupling. The model is defined by the Hamiltonian
H[v]=−J∑<i j>
viv j. (4.27)
The two dimensional Ising model shows the phase transition at J/KBT = 0.4352, and
the system can be coarse-grained (Kadanoff block spin transformation) at this value
because of the divergence of the correlation length.
Motivated by the relation between DNNs and variational RG, we trained the data
generated by the two dimensional Ising model at J = 0.405. By using the standard Monte
Carlo technique, 50,000 samples were generated and used as an input to the four-layered
1600,400,100,25 DBN 4.6(a). Moreover, we employed L1 regularization and used the
66
CHAPTER 4. MAPPING BETWEEN VARIATIONAL RG AND RBM
contrastive divergence method for training. The regularization term makes sure that
most of the weights approach zero and prevents the model from overfitting. Practically,
it ensures the local interaction between the visible and hidden spins 4.6(b)(d). If we
have used the convolutional neural network it would take care of the local interaction by
definition.
Th DNN architecture is similar to the 2×2 block spin renormalization procedure,
hence, implies that the DNN is performing the coarse-graining procedure. Notice that
the coupling range in the block is same in each layer and it is increasing from bottom
to upper layer. The features of Kadanoff block spin renormalization emerge from the
deep architecture which implies that the DNN is employing block spin renormalization.
Moreover, the DNN can reproduce the quality samples even the compression ratio is 64.
The coding detail is provided in the Appendix B.2.
67
CH
AP
TE
R
5CORRESPONDENCE BETWEEN RESTRICTED
BOLTZMANN MACHINE AND TENSOR NETWORK STATES
In chapter 2, we discussed the machine learning basics and energy based models from
deep leaning, more specifically, Boltzmann machines and DBNs. The RBM is an im-
portant building block of deep learning models and finds a wide range of applicability
domain. In chapter 4, we expressed the relation between RBM and RG. In the current
chapter, we construct a bridge between tensor network states (TNS) and RBM. The TNS
extensively used in the quantum many body systems. We will study an efficient algorithm
to convert an RBM into widely used TNS and also the difficulties and conditions if we
move back from TNS to specific RBM architecture. Moreover, we can determine the ex-
pressive ability of the RBM on the complex dataset by using the concept of entanglement
entropy (EE) in TNS. By exploring the TNS and its entanglement extent may lead us to
a guiding principle to design the deep architecture. Alternatively, RBM can describe a
quantum many-body system with relatively fewer parameters than TNS, which enable
us to simulate classical systems more efficiently.
5.1 Transformation of an RBM to TNS
This section is devoted to build a connection between RBM and TNS. The importance of
TNS representation of RBM is that it provides upper limit of the EE it can express. Only
structural information of RBM is enough to estimate the bound. First we discuss an easy
68
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.1: Correspondence between RBM-TNS: (a)RBM representation as an undi-rected graphical model as defined by Eq.5.1. The blue circles denote the units v calledvisible, and gray circles labelled as h called hidden units. They interact with each otherthrough links denoted as solid lines. (b) MPS described by 5.5. Each dark blue dot repre-sents a 3 indexed tensor A(i). Now on we use hollow circles to denote RBM units and usefilled ones to indicate tensors. undirected lines in the RBM represents the link weightand lines in the MPS denotes the bond indices and thickness of the bond expresses thebond dimension. RBM and TNS are used to represent complex multi-variable functions,Both have the ability to describe any function with arbitrary precision given unlimitedresources (unlimited hidden variables or bond dimensions). Although, provided thelimited resources they can represent two overlapping but independent regions.
and illustrative method, then present more elegant approaches that can yield optimum
bonds on the TNS bond dimensions. Code is provided in [27]. Before going into details,
we first mention that we follow the notation used in [1]. In that paper the probability
distribution of visible units is interpreted as quantum wave function as follows:
E(v,h)=−∑i
aivi −∑
jb jh j −
∑i, j
viWi jh j,
here v = {vi} and h = {h j} can have ±1 value , the parameters {ai,b j,Wi j} are complex
numbers. And the unnormalized quantum wave function is expressed as
ΨRBM(v)=∑v
e−E(v,h)
=∏i
eaivi∏
j(1+ eb j+
∑i viWi j ).
(5.1)
The common architecture of RBM has dense connectivity between visible and hidden
units but here we use sparse connections (as shown in Fig. 5.1) to make things clear.
Although, the approach we use is general for any RBM architecture.
5.1.1 Direct transformation from RBM to MPS
Lets start with simple and intuitive example of transforming an RBM architecutre given
in Fig. 5.1 to Matrix Product State (MPS). Define hidden and visible units as the virtual
69
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
and physical indices, this will be the first step of converting RBM to MPS. To achieve this,
we separate out the Boltzmann weights and represent the biases at individual vertices
with Γ(i)v and Γ( j)
h . Each connection between unit j from hidden layer and unit i from
visible layer is defined as M(i j):
Γ(i)v = diag(1, eai ), (5.2)
Γ( j)h = diag(1, eb j ), (5.3)
M(i j) =[
1 1
1 eWi j .
](5.4)
That will be the TNS representation of the RBM given in 5.1. This process is depicted
in the Fig.5.2(a). The MPS parameterized wave function with nv physical degrees of
freedom given as
ΨMPS(v)= Tr∏
iA(i). (5.5)
The second step is to map the TNS (given in Fig.5.2(a)) into MPS. First we divide the
graph into nv pieces as depicted in Fig. 5.2, where nv is the number of visible units. Each
part contains 1 visible unit and the assignment of hidden units to each region is arbitrary.
After that, contract all the units in the region and this will be equal to summing over
hidden units, and merge all the external links in each part to the virtual bond, hence
MPS shown in Fig.5.2(c). The width of the virtual bonds determined by the number of
links incorporate into the bond.
Notice that, we tackle the long range connections: which crosses more than one
vertical cuts, in different manner. In our current example, the only long-range connection
is h1 to v4 as shown in Fig.5.2(b). The matrix M(14) define this connection we break this
matrix into any two 2×2 matrices with constraint M(14) = M1.M2, and include these
into the virtual bonds of MPS. That will be the representation of a long-range connection
to two short-range connections defined by M1 and M2 as shown in Fig. 5.2(d). M1 and
M2 then merge to local tensors A(2) and A(3) (corresponds to v1 and v2), respectively. The
extended connection doubles the bond degrees of freed of every tensor it passes through.
The bond dimension of a particular tensor in the MPS is defined by the number of links
n cut by the vertical dotted line i.e., D = 2n.
70
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.2: Stepwise mapping an RBM to an MPS: (a) An MPS description of anRBM given in Fig.5.1(a). The light blue circles express the diagonal tensor Γ(i)
v at thevisible units defined by the Eq. 5.2 and gray circle used to denote Γ( j)
h at the hiddenunits as defined by Eq. 5.3. The orange diamonds express the matrix M(i j), described inEq. 5.4. (b) The RBM is divided into nv pieces. Corresponding to each long-range link,put an identity tensor (red ellipse) to subdivide M(i j) into two matrices. (c) An MPS istransformed from RBM by summing up all the hidden units belonging to each piece in (b).The number of cuts (link) made by the dashed vertical line is equal to the bond degreesof freedom of the MPS. (d) The matrix M(41) corresponds to long-range connection isbroken into a product of two M1, M2 matrices, represented by the light pink diamond.The red ellipse shows the product of two identity matrices.
It should be highlighted that the mapping we get by using this approach is not unique.
Because it depends on the configuration of the hidden units in the network. Regardless
of this geometrical dependence of the hidden units, all the different MPS are equal. This
approach also induces redundancy (in terms of degrees of freedom) in the local tensors.
By performing canonical transformation on the local tensors A’s, one can obtain unique
form of MPS [28, 29]. We will discuss the redundancy of the MPS later in this chapter.
5.1.2 Optimum MPS representation of an RBM
The method given in the previous subsection is easy to understand but not an optimal
one. Now lets consider a method that provides an MPS description with optimum bond
degrees of freedom. As already said, the RBM is a bidirectional probabilistic graphical
model. We separate out X g and Yg as two sets of variables which are independent by
71
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Algorithm 2 Direct transformation from RBM to MPS.1: RBM2MPS1 (W ,a,b)2: W : weight matrix Nhidden ×Nvisible for RBM.3: a: bias vector for visible units.4: b: bias vector for hidden units.5: Output: An MPS with local tensors A(i), i = 1,2, · · · ,nv.6: The RBM is divided into nv slices Si i = 1,2, · · · ,nv. Each slice Si has one visible unit
vi and many hidden units.7: for all visible units i do8: Set Ti =; (carries the tensors in Si)9: Construct Γ(i)
v by using 5.2.10: for all h j ∈ Si do11: Construct Γ( j)
h by using 5.3.12: Add Γ( j)
h to Ti.13: end for14: end for15: for all Non-zero weights do16: Construct M(i j) by using 5.4.17: Break M(i j) into product of matrices and put back all these matrices into the
corresponding slice(Fig.5.2(d)).18: end for19: for all visible units i do20: A(i) ← Summed up all the internal indices of tensors belong to Ti21: end for
condition if a set of variables Zg is given. The statement is written as
X g ⊥Yg | Zg. (5.6)
For the separation of X g and Yg, we can determine the set Zg with minimum number
of elements to satisfy the 5.6. The set Zg can be used as the MPS virtual bond after the
mapping. The bond dimension between two separated sets (X g and Yg) is determined by
the size of Zg denoted as | Zg |:
D = 2|Zg|. (5.7)
The Algorithm for optimum translation by conditional independence is given in Algo.
5.1.2. We begin from left and progress towards right by constructing tensors one by
one with minimum bond (virtual) dimension. The virtual bonds created in this fashion
describe the degrees of freedom of hidden or visible units of an RBM.
72
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.3: The optimum MPS representation of an RBM: (a)− (e) Depicts thestepwise construction. The set X g is denoted by light yellow rectangle/triangle and set Ygrepresents a dark blue rectangle. The set Zg which provides the conditional independenceof X g and Yg is represented as a light green rectangle. When the set Zg is given, the RBMfunction which interpreted as probability can be written as a product of two functions,one depends on X g and other depends on Yg. The variables in Zg are defined as thevirtual bond of the MPS. The light gray lines show the connections included in theprevious tensor. The connections being considered in the current step are denoted bydotted lines and these are represented as G t in Algo. 5.1.2. (f) The resulting MPS.
To demonstrate the translation algorithm, let’s apply this method on the RBM given
in Fig. 5.1(a). We begin with Fig.5.3(a), so the X g = {v1} and Yg = {v2, · · · ,v6}. It is easy
to see the minimal set Zg = {h1} or Zg = {v1} is enough to satisfy the condition Eq. 5.6.
Assume that we take Zg = {v1} this implies G t = Ht = ;, and the initial tensor A(1) of
MPS can be specify as an identity matrix that simply duplicates the visible unit vi to its
virtual bond at the right.
Move forward and include v2 in the set X g, so X g = {v1,v2}, the remaining visible
variables are Yg = {v2, · · · ,v6} as shown in Fig.5.3(b). Notice that there are four minimum
possible set which can be Zg but we choose Zg = {h1,h2}, for example. All the connections
needed to merge in the tensor A(2) are shown by dotted lines in Fig.5.3(b) and by G t in
Algo. 5.1.2. Ht =;. The set Zg = {h1,h2} will be the right bond of A(2). The left virtual
bond of A(2) and the right virtual bond of A(1) is the same.
73
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Now move to the the Fig.5.3(c). The light gray lines represent the interactions which
are already considered in the previous tensors. According to the Algo. 5.1.2, this is the
3rd iteration and the line 7 shows that X g = {h1,h2}∪ {v3}. Here we have several options
for minimal set Zg with size | Zg |= 2 but we choose Zg = {v3,v4}. The set G t contains all
the connections between Zg and {h1,h2}. All these couplings are included by A(3) tensor
and its right virtual bond consists of Zg. No hidden unit needs to summed up, so Ht =;.
In Fig. 5.3(d), Zg composed of v5 and h4. We observe Ht = {h3} cannot interact with
the whole set Yg when Zg is given. So, h3 should be traced out during construction of
A(4).
During the construction of the MPS, each interaction in the RBM is considered only
once. Hence, we end up with MPS which consists of 6 tensors and the degrees of freedom
of the virtual bonds are labeled in Fig.5.3( f ). Notice that, there is no need to run the
algorithm numerically to obtain the optimum virtual bond dimension of resulting MPS.
Further, this algorithm also works for any bidirectional graphical model.
5.1.3 Inference of RBM to MPS mapping
The translation of an RBM to TNS implies that the expressive power of an RBM can
be quantified by the knowledge of the TNS, in particular, the concept of EE. To define
the EE as a function of Ψ for an RBM or MPS, just split the set of visible units into two
subsets, represented as X g and Yg. The EE between X g and Yg is given as
S =−Trρ lnρ, (5.8)
where ρ is called the reduced density matrix
ρ =∑vy
Ψ∗(vx,vy)Ψ(vx,vy). (5.9)
Here the matrix is consists of two sets of visible units vy and vx, where former composed
on whole set of visible units in Yg and later spans entire set of visible units in X g. The
EE specify the information content of Ψ. The EE for MPS is easy to calculate and it
turns out that S depends on the bond dimension D i.e. lnD. For more detail discussion
on the EE, we would like to refer Chapter 3.
To evaluate the expressive ability of an RBM first we need to find its MPS repre-
sentation with optimum bond dimensions. The bond dimension calculated by the direct
approach given in Fig.5.1 is higher than what actually is needed. For instance, as shown
in Fig.5.1(c), the bond dimension of the second tensor is D = 8 but the two visible units
74
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Algorithm 3 RBM to MPS mapping with optimum bond dimension.1: RBM2MPS1 (W ,a,b)2: W : weight matrix Nhidden ×Nvisible for RBM.3: a: bias vector for visible units.4: b: bias vector for hidden units.5: Output: An MPS with local tensors A(i), i = 1,2, · · · ,nv.6: Gs = {(i, j) |W(i, j) 6= 0} (whole connected graph).7: Hs = { j | (i, j) ∈Gs} (entire set of hidden units).8: Z
′g =; (left virtual bond dimension).
9: for all visible units i do10: Set G t =; (enumerate connections in A(i)).11: Set Ht =; (all the hidden variables to be summed in A(i)).12: X g = Z
′g ∪ {vi}.
13: Yg = {vi+1, · · · ,vnv} (remaining physical variables).14: Find the minimum set Zg from Gs to satisfy 5.6.15: for all h j ∈ Hs do16: if h j is disconnected with (Yg \ Zg) then17: Shift j from Hs to Ht (h j will be summed up in A(i)).18: end if19: end for20: for all (p, q) ∈Gs do21: if vp and hq belongs to X g ∪Zg ∪Ht then22: Shift (p, q) from Gs to G t ( The connection between vp and hq will be incorpo-
rated in A(i)).23: end if24: end for25: A(i)
Z′ ,Z[vi]=∑
{h j∈Ht} eaivi+∑
(p,q)∈Gt vpWpqhq+∑j∈Ht b jh j .
26: Z′g ← Zg
27: end for
at the left are enough for conditional independence property to hold and these two
variables have four degrees of freedom. So, here D = 8 is more than sufficient to capture
entanglement. The method given in Sec. 5.2 is improved, does not depends on assignment
of hidden units, and provides optimal bond dimension.
Fig.5.4(a) depicts the RBM (given in Fig.5.1) after summing out entire set of hidden
variables in the RBM graph. The curved lines represent the couplings between visible
variables through hidden variables. If we split the visible variable set into two parts X g
and Yg = {Y1∪Y2}, where Y1 contains the visible variables which have direct connection
to the X , and Y2 composed of the remaining visible units, Eq. 5.1 can be represented as
75
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.4: (a) RBM after summing out entire set of hidden units. (b) The curved linesrepresent the connections between visible units through hidden units. The whole systemis split into two parts X g and Yg and second one is further split as Yg =Y1 ∪Y2. WhereY1 contains the visible variables that are directly connected to X g. (b) The alternativeway is expressed. (c) The MPS provided by this method has smaller bond dimension ascompared to Fig.5.2.
ΨRBM(v)=ψ(vx,vy1)φ(vy1 ,vy2). (5.10)
This wave function in RBM representation factorizes into the product of the visible
variables in X g and Y2. The EE between the region X g and Yg is obtained by the number
of visible units in Y1, denoted by | Y1 |. So, the dimension of the bond at separation
between X g and Yg is given as D = 2|Y1|. On the other hand, we can also divide the X g
into two sets and apply similar approach as shown in Fig.5.4. Hence, the EE between X g
and Yg is bounded by Smax = min(| X1 |, |Y1 |) ln2.
The resulting MPS (shown in Fig.5.4(c)) has tighter bound on the bond dimension as
compared to Fig. 5.3(c). Even tighter bound can be attained on the MPS by finding the
minimal set, whether they are visible or not, to factorize the two functions. Programming
code to implement these mapping is given in [27].
The bond degrees of freedom of the emerging MPS govern the maximum EE between
the visible variables. Thus, EE gives a precise measure on the expressiveness of an RBM
merely established on its architecture. Anticipating these bond dimensions is an easy
task to do. By the canonicalization, the MPS bond dimension can be reduced further.
Likewise, by arranging the visible units on two-dimensions, the RBM can be mapped
76
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
to a PEPS. An equivalent method was used [30] to transform MERA into PEPS. This
procedure is convenient for the RBM trained for the two-dimensional gird (for example
images).
If we denote m as the upper bound on the number of units that have direct connection
across the cut in the RBM, then the upper limit of the EE should vary as
Smax ∼ mV (d−1)/d, (5.11)
where V represents the volume and d shows the dimension of the space on which TNS
is described. Hence, the RBM with sparse connectivity respects area-law. In contrast
to that, the RBM with dense connection, the region that cuts the RBM expands to the
entire system and the relation between m and V becomes m ∼V 1/d, thus Smax ∼V . So,
this implies that the densely connected RBM is capable of representing highly entangles
quantum state that violate area-law. And it has polynomial increase in the number of
parameters with the size of the system. But MPS and PEPS representation of this state
requires exponential increase in the number of parameters with the size of the system
[31, 32]. This justifies the variational computation of the quantum system by using RBM
functions [1]. We will discuss this further later in this chapter.
The mapping (RBM to MPS) is used so far can be extended to the general Boltzmann
machine without imposing any restriction.
5.2 Representation of TNS as RBM: sufficient andnecessary conditions
Now we pay attention to the reverse process i.e. transformation of TNS into a given RBM
architecture. Here we only consider a provided architecture because a part from that any
function can be reproduced by the RBM with exponentially large resource [33]. At the
end of this section we will use practical approach for this mapping and will demonstrate
an example.
Lets assume the MPS given Fig.5.1(b) and find its parametric representation as an
RBM by the architecture given in Fig. 5.5(a). The MPS has 6 sites and nh = 4 hidden
units in the RBM architecture. It should be mentioned here that the architecture of the
RBM is given which we would like to get after conversion and we want to know whether
conversion is possible for given architecture or not. The hidden layer factorizes into a
product of 4 tensors. One tensor is defined on h1 with dimensions 2×2×2×2 and all
77
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
the remaining tensors have dimensions 2×2×2. We require the product of these four
tensors equal to the MPS as:
Tr∏
iA(i)[vi]= T(1)
v1v2v3v4T(2)
v2v3v4T(3)
v3v4v5T(4)
v4v5v6. (5.12)
The logarithm of this equation shows that the 2nv number of linear equations for 40
parameters, so this linear equation system is overdetermined. We need a unique solution
of this system of linear equations for the translation of MPS to RBM. If there is no
solution of these equations then architecture of the RBM should be changed. We can
balance both the number of equations and the number of parameters by varying the
number of both connections and hidden units.
Assume that one has a unique solution of the Eq.5.12, the next step will be decouple
each tensor into RBM (remember that each tensor has 1 hidden unit). For instance,
decouple T(2)v2v3v4 as
T(2)v2v3v4
= ∑h2∈{0,1}
eh2b2+∑k∈{2,3,4} vk(Wk2h2+ak2), (5.13)
where ak j is used to denote the component of the bias of the kth visible imparted
with the jth hidden variable. The ak bias of kth visible variable is given by ak =∑j ak j.
As shown in Fig.5.5(b), this three indexed tensor has 7 parameters and these can be
determined by solving 23 equations. The increase in the number of variational parameters
is linear with the index/order of tensor T. But there is an exponential increase in the
number of equations with the order of tensor. Typically, Eq. 5.13 has no unique solution
because of overdetermined but the solution exists in the exceptional cases.
Practically, Eq.5.13 can be decomposed into a minimum rank by using tensor rank
decomposition [34] of T(2):
T(2)v2,v3,v4
= ∑h2∈{0,1}
Xv2h2Yv3h2 Zv4h2 . (5.14)
Here X ,Y , Z are all 2×2 matrices since the hidden and visible units are all binary.
Because of binary hidden units the rank of T(2) is 2. Tensor rank also depends on the
underlying field of the tensor, for instance, for 2×2×2 the rank-2 decomposition is
common in complex field and if the same tensor decomposed in a real field then the rank
will be 4 but it is not an easy task to obtain the rank of an arbitrary tensor [34]. The
higher-order SVD [35] provides the lower-bound of the rank of the tensor in the form of
the decomposition of the core tensor. The condition of hidden units to be binary will not
satisfy if the rank is greater than 2.
78
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.5: TNS to RBM transformation: graphical representation of (a) Eq. 5.12 and(b) Eq. 5.13 and 5.14.
.
After decomposing the tensor, each matrix can be factorized into a product of 3
matrices, for instance the X matrix by using Eqs.5.2−5.4
X =[
x11 x12
x21 x22
]= x11
[1
x21x11
][1 1
1 x11x22x12x21
][1
x12x11
]. (5.15)
and
W22 = lnx11x22
x12x21(5.16)
a22 = lnx21
x11(5.17)
b22 = lnx12
x11. (5.18)
Here bk j is a partial bias same as ak j and interpreted as the component of the bias
of the kth hidden variable exploited by the jth visible variable. The bk bias of the kthhidden variable is bk =∑
j bk j. By using this method, all the tensors in Eq.5.12 can be
written as an RBM.
Hence, the essential and sufficient constrain for an MPS to have an RBM represen-
tation is that both Eqs. 5.12 and 5.13 are uniquely solvable. For RBM with just single
79
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
hidden variable Eq.5.12 simply reinterpret the MPS as a wave function itself. The disin-
tegration of the tensor in Eq.5.13 as a rank-2 tensor is a more difficult condition to meet.
By varying the parameters(in terms of connection or hidden units) of RBM, one can find
the architecture that has a unique solution of both equations. This is equivalent to the
mathematical statement that an RBM can describe any function when an exponential
resources (parameters) are provided.
In practice, an appropriate way to directly determine whether a TNS has specific
RBM description is to test the factorization property defined in Eq.5.10.
5.2.1 Examples
After discussing a sufficient condition for the one or two dimension tensor networks
(MPS or PEPS) to describe as an RBM. Various interesting physical quantum or thermal
states can be transformed under this mapping. For instance, toric code model, statistical
Ising model in the presence of external field etc.
A sufficient mathematical condition for the MPS that can be represented as an RBM
is following:
A i j[v]= L ivRv j. (5.19)
This constrain is for every tensor in MPS, where L and R are left and right matrices
of dimension 2×2. The product of two matrices R and L (dashed box shown in Fig.5.6)
can be replaced to connect the hidden units in an RBM. The bias of the visible variable
becomes half because it is distributed into two consecutive boxes. From Eqs. 5.2−5.4,
the RL matrix can be written in the form of RBM parameters as following:
RL =[
1
ea/2
][1 1
1 ew
][1
eb
][1 1
1 ew
][1
ea/2
]. (5.20)
Here the weights and biases are chosen same for simplicity and this decomposition is
not unique.
One example can be Ising model in one dimension. The partition function with
coupling constant K and the external field H is given as following:
Z =∑si
eK∑
⟨i, j⟩ sis j+H∑
i si .
80
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.6: The matrix A is defined in Eq. 5.19 has special form that represented by thedashed box. The blue dots denote the identity matrices and square box represent the leftand right matrices. To obtain the RBM parameters we apply transformation of MPS toRBM according to Eq.5.20.
si are Ising spin, each spin can have two values ±1, and can be transformed to binary
variables as vi = (si +1)/2. One dimensional Ising model is easy to represent as an MPS
as shown in Fig.5.6. The RL on an individual bond can be defined as
RL =[
eK+H e−K
e−K eK−H
]. (5.21)
After combining above two equations for RL one can have parameters for an RBM. This
procedure can be extended to two dimensional tensor networks, for example, PEPS, where
each hidden unit can be connected to two visible units. As opposed to one dimensional
case visible biases becomes a/4 and H becomes H/2.
These explorations show that a very modest sparse RBM with equal numbers of
hidden and visible units in 1D case (nh = 2nv in 2D)is enough to replicate the ther-
mal probability distribution of the Ising model. Notice that, this approach is does not
dependent on the couplings and valid even at criticality.
5.3 Implication of the RBM-TNS correspondence
5.3.1 RBM optimization by using tensor network methods
Analogous to TNS, it is common that neural networks including RBM functions restrains
redundancy in terms of degrees of freedom. The RBMs with entirely distinct architectures
can represent same function. So, the well-known tensor network approaches can be used
to remove this redundancy from an RBM.
81
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
In one dimension, for instance, the canonicalization method of MPS can be used to
find optimal RBM. In order to do this, the first step would be map RBM to MPS by using
method discussed in Sec. 5.1. The MPS can be canonicalized to have minimum bond
dimension of each tensor by truncating approximately zero vectors. This procedure also
fixes the gauge of MPS to some extent. Lastly, translate this simplified TNS back to the
RBM by using the method discussed in Sec. 5.2. The resulting RBM is equal to previous
one, but is optimal.
To illustrate this optimization process, lets take an RBM representation of the
cluster state which discussed in [2]. The architecture is shown in Fig.5.7. This RBM
architecture can be mapped to MPS with D = 4 (each partition breaks 2 links). The
Figure 5.7: The RBM architecture which used to represent the cluster state. It has localconnections, each hidden unit connected with three visible units.
canonical transformation of this MPS results in D = 2 MPS. After that, translate D = 2
MPS back, the derived RBM consists of the hidden layer in which every hidden variables
linked with two visible variables. A higher-dimensional RBM can be also optimized by
translating it onto PEPS. By using SVD the bond-dimension of PEPS can be reduced.
5.3.2 Unsupervised learning in entanglement perspective
A natural outcome of the relationship between RBM and TNS is a quantum entanglement
point of view on unsupervised probabilist models. According to the theorem in machine
learning [33], adding a hidden unit in RBM results in the improvement of model so one
can approximate any function to any accuracy but unlimited resources are needed. One
can approximate the resources or number of hidden units/parameters by introducing the
concept of EE for a real data set, or, similarly, the effective bond degrees of freedom of a
TNS.
What we meant by the EE for a real-valued dataset? The probability amplitude
Ψ(v)=pP(v) (same as used in quantum mechanics) can be defined from the P(v), where
P(v) represents the probability distribution over the dataset. This definition of EE for
82
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
real-valued data is meaningful, similar to classical information measures it accounts the
complexity of data. Bringing the concept of EE to the machine learning anticipates a
practical and convenient way to quantify the challenges of unsupervised learning. And it
assists further in modeling as well as understanding the quantum many body systems.
These implications are apposite to those machine learning generative models motivated
by quantum mechanics, where authors showed that a wave-function square can be used
to model the probability [36].
Consider a natural images dataset, typically the couplings between the pixels are
strong with the neighbouring ones which imply that the EE introduced above is compar-
atively small. As a result, there is no need to have RBM with dense connections to model
such dataset. In fact, D. Constantine and others have shown [37] that the RBM with
sparse connections performs better than the fully connected. Further, the entanglement
distribution varies point to point in space. The entanglement quantification can be used
as a guiding principle to design a neural network architecture.
One more advantage of proposed quantum entanglement for the real-valued dataset
is that the connection between RBM and TNS may entitle one to embrace strategies
emerge in quantum mechanics straight to machine learning. For instance, it is easy to
determine the upper bound of EE of an RBM by transforming it into TNS and count the
bond dimension. When directly using TNS to train the dataset one can use the EE to
quantify the difficulty of the learning task [38].
5.3.3 Entanglement: a measure of effectiveness of deeplearning as compared to shallow ones
The mapping we have discussed in Sec.5.1 is more general and can be performed on
the general Boltzmann machine without restriction in the graph. More specifically,
performing this method on the deep Boltzmann machine (DBM) explains the difference
between deep and shallow RBM.
Both architectures, RBM and DBM, are shown in Fig.5.8 with the same number of
visible units nv, hidden units nh = 3nv and connections 9nv. As opposed to RBM, the
architecture of the DBM is consists on multi-layer and connections are distributed among
the consecutive layers as depicted in Fig.5.8(b). Following Sec. 5.1.3, the entire system
can be divided into two by fixing the t units. The bond dimensions for DBM is D = 16 and
for RBM is D = 4. Hence, the DBM is capable of representing more complex functions as
compared to RBM when the same number of parameters are given.
83
CHAPTER 5. CORRESPONDENCE BETWEEN RESTRICTED BOLTZMANNMACHINE AND TENSOR NETWORK STATES
Figure 5.8: Effectiveness of deep network as compared to shallow network: (a)An RBM and (b) a DBM, both have same nv (blue circles), nh = 3nv and number ofconnections 9nv. The approach discussed in Sec.5.1 can be used to represent both archi-tectures as an MPS with bond dimensions for DBM D = 24 and RBM has D = 22. Dashedrectangle depicts the minimum number of visible units required to fix in order to cutthe system into two subsystems. The bond dimension shows that DBM can encode moreentanglement as opposed to RBM when equal number of hidden units and parametersare provided.
In general, the deep hidden units cause long-range effective links between visible
units in DBM therefore, more entanglement capacity as opposed to RBM with the
equal number of weights and hidden variables. One can analyze and compare between
expressive power of various neural architecture by employing mapping to TNS.
84
AP
PE
ND
IX
AMPS EXAMPLES IN MATHEMATICA
A.1 W State
85
APPENDIX A. MPS EXAMPLES IN MATHEMATICA
A.2 GHZ State
86
AP
PE
ND
IX
BRENORMALIZATION GROUP EXAMPLE AND CODE
DESCRIPTION
B.1 1D Ising model
The procedure we are going to discuss is quite general and apply on system which have
nearest-neighbour interactions. Several of the ideas in RG theory can be demonstrated
with the 1D Ising model. Let’s begin with that physical system despite the fact that it
does not show criticality or phase transition. When external magnetic field not present
the partition function Z is given by
Z(K , N)= ∑v1,v2,...vN=±1
eK(···+v1v2+v2v3+v3v4+··· ), (B.1)
where K = βJ. The RG is a procedure in which we get rid of some degrees of freedom
by tracing out. This is entirely different from MFT approach where very few degrees of
freedom are removed explicitly. In particular, summing over the even numbered spins in
Eq.B.1
Z(K , N)= ∑v1,v3,v5,···=±1
N/2∏i=1
(eK(vi+vi+2) + e−K(vi+vi+2)
). (B.2)
By doing this, we have eliminate half degrees of freedom as shown in Fig.B.1.
Now next step in RG theory is to express the partially summed Z into a partition
function(original) of Ising model with N/2 degrees of freedom and may be different
87
APPENDIX B. RENORMALIZATION GROUP EXAMPLE AND CODE DESCRIPTION
Figure B.1: Tracing out even degrees of freedom.
coupling constants K . If this rescaling/coarse-graining is possible then we can have
recursion relation in Z. The recursion relation allows us to compute Z for that system
after another rescaling. We just have to perform this procedure once then the resulting
set of equations (recursive) enable us to move everywhere in coupling space. In particular,
we require a function g(K) and new coupling constant(s) K′such as eK
(v+v
′)+ e−K
(v+v
′)=
g(K)eK′vv
′,∀v,v
′ =±1. If we succeed in finding g(K) then
Z(K , N)= [g(K)]N/2 Z(K′, N/2), (B.3)
this is the recursion relation which we aim for. This recursion relation is called
Kadanoff transformation. To find g(K) and K′
put all the four combinations of two
spins in the Kadanoff transformation relation that leads us to two equations with two
unknowns, they have solutions
K′ = 1
2ln(cosh(2K)),
g(K)= 2√
cosh(2K) .(B.4)
Define ln(Z) = N f (K), with −kβT, N f (K) will be the free energy. With the help of
above equations we get
f (K′ = 2 f (K)− ln(2
√cosh(2K) )). (B.5)
Eqs.B.4 and B.4 are called RG equations. They hold group properties and do renor-
malization.
Now we discuss results and conceptual ideas of renormalization flows: consecutive
RG transformation generates flow in coupling space. First note that new coarse-grain
coupling K′generated by these RG transformation equations is always less than K . To
see this fixed point put limit K → 0 in Eq. B.4 and result will beK′ = K2 and on the other
end K →∞ we will get K′ = K − 1
2 ln2. These two K = 0,∞ are fixed points but first one
is stable and second one is unstable as shown in Fig. B.2.
88
APPENDIX B. RENORMALIZATION GROUP EXAMPLE AND CODE DESCRIPTION
Figure B.2: RG flow in coupling space, it depicts the stable and unstable fixed points.
As K is variable between 0 and ∞, we can slightly change the variables by using the
trigonometric identity tanh(K′)= e2K
′−1
e2K ′+1
tanh(K′)= tanh2(K). (B.6)
This change of variables also change the domain of RG flow diagram as well, as shown in
Fig.B.3.
Figure B.3: RG flow: (a) coupling space in different domain from 0 to 1, (b)n the presenceof external magnetic field h. The arrows show flow direction and blue lines in betweentwo limits (K = 0,∞) depict flow and these end up on vertical axis h where K = 0. “×”signs on the vertical axis when K = 0 represent the fixed points.
If we introduce the external magnetic field h then we have one more RG transforma-
tion equation corresponding to that. The flow diagram in the presence of h shown in Fig.
B.3. From flow diagram one can realize that, starting with any set of (h,K) as we rescale
the system, we essentially go to independent spins with an effective magnetic field.
The RG is still useful and illustrative even there is no critical behavior exists in the
system. After RG transformation the long distance physics is not changed, so ξ must
be the same. Although the space between lattice is increased by 2(for this example).
At the larger value of K , the correlation length will also large. Under one step of RG
transformation at the fixed point of the system, the correlation length ξ′ = ξ/2 and for
l-steps ξ(h, e−K )= 2lξ(2lh,2l/2e−K ).
89
APPENDIX B. RENORMALIZATION GROUP EXAMPLE AND CODE DESCRIPTION
B.2 Training DBN for the 2D Ising model
The Matlab code used for the training is given in the supplementary material of [16].
First of all Ising model samples are generated and only unsupervised learning phase is
considered. The RBM is trained layer by layer with 200 epoch, momentum 0.5 and mini
bach with 100 examples. The regularization is performed by keeping weight cost 0.0002.
The effective receptive field (ERF) is a method to anticipate the effect of a spin in
the visible layer to the spin in the hidden layer. We represent the ERF matrix of the
layer l by r(l) and l = 0 for visible layer. And it can be computed as r(1) =W1 and iterate
rl = r(l−1)W (l).
90
BIBLIOGRAPHY
[1] Carleo, G., Troyer, M., 2017. Solving the quantum many-body problem with artificial
neural networks. Science 355, 602–606. https://doi.org/10.1126/science.aag2302
[2] Deng, D.-L., Li, X., Das Sarma, S., 2017. Machine learning topological states. Physi-
cal Review B 96. https://doi.org/10.1103/PhysRevB.96.195145
[3] Deng, D.-L., Li, X., Das Sarma, S., 2017. Quantum Entanglement in Neural Network
States. Phys. Rev. X 7, 021021. https://doi.org/10.1103/PhysRevX.7.021021
[4] Torlai, G., Melko, R.G., 2016. Learning thermodynamics with Boltzmann machines.
Phys. Rev. B 94, 165134. https://doi.org/10.1103/PhysRevB.94.165134
[5] Mehta, P., Schwab, D.J., 2014. An exact mapping between the Variational Renor-
malization Group and Deep Learning. arXiv:1410.3831 [cond-mat, stat].
[6] Chen, J., Cheng, S., Xie, H., Wang, L., Xiang, T., 2018. Equivalence of restricted
Boltzmann machines and tensor network states. Phys. Rev. B 97, 085104.
https://doi.org/10.1103/PhysRevB.97.085104
[7] Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. The MIT Press,
Cambridge, Massachusetts.
[8] Bengio, Y., 2009. Learning Deep Architectures for AI. MAL 2, 1–127.
https://doi.org/10.1561/2200000006 .
[9] Orus, R., 2014. A Practical Introduction to Tensor Networks: Matrix Product
States and Projected Entangled Pair States. Annals of Physics 349, 117–158.
https://doi.org/10.1016/j.aop.2014.06.013 .
[10] Bridgeman, J.C., Chubb, C.T., 2017. Hand-waving and Interpretive Dance: An
Introductory Course on Tensor Networks. Journal of Physics A: Mathematical
and Theoretical 50, 223001. https://doi.org/10.1088/1751-8121/aa6dc3 .
91
BIBLIOGRAPHY
[11] UCI Machine Learning Repository [WWW Document], n.d. URL
https://archive.ics.uci.edu/ml/index.php (accessed 4.25.18).
[12] MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris
Burges [WWW Document], n.d. URL http://yann.lecun.com/exdb/mnist/ (ac-
cessed 4.25.18).
[13] Vapnik, V., 2000. The Nature of Statistical Learning Theory, 2nd ed, Information
Science and Statistics. Springer-Verlag, New York.
[14] Larochelle, H., Bengio, Y., 2008. Classification Using Discriminative Restricted
Boltzmann Machines, in: Proceedings of the 25th International Conference
on Machine Learning, ICML ’08. ACM, New York, NY, USA, pp. 536–543.
https://doi.org/10.1145/1390156.1390224.
[15] Ackley, D.H., Hinton, G.E., Sejnowski, T.J., 1985. A learning algorithm for boltz-
mann machines. Cognitive Science 9, 147–169. https://doi.org/10.1016/S0364-
0213(85)80012-4.
[16] Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the Dimension-
ality of Data with Neural Networks. Science 313, 504–507.
https://doi.org/10.1126/science.1127647.
[17] Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A Fast Learning Algo-
rithm for Deep Belief Nets. Neural Computation 18, 1527–1554.
https://doi.org/10.1162/neco.2006.18.7.1527.
[18] Google Code Archive - Long-term storage for Google Code Project Hosting. [WWW
Document], n.d. URL https://code.google.com/archive/p/matrbm/.
[19] You, Y.-Z., Yang, Z., Qi, X.-L., 2018. Machine learning spatial ge-
ometry from entanglement features. Phys. Rev. B 97, 045153.
https://doi.org/10.1103/PhysRevB.97.045153.
[20] Vidal, G., Latorre, J.I., Rico, E., Kitaev, A., 2003. Entanglement
in Quantum Critical Phenomena. Phys. Rev. Lett. 90, 227902.
https://doi.org/10.1103/PhysRevLett.90.227902
[21] Poulin, D., Qarry, A., Somma, R.D., Verstraete, F., 2011. Quantum simulation
of time-dependent Hamiltonians and the convenient illusion of Hilbert space.
Physical Review Letters 106. https://doi.org/10.1103/PhysRevLett.106.170501.
92
BIBLIOGRAPHY
[22] Vidal, G., Latorre, J.I., Rico, E., Kitaev, A., 2003. Entanglement
in Quantum Critical Phenomena. Phys. Rev. Lett. 90, 227902.
https://doi.org/10.1103/PhysRevLett.90.227902.
[23] White, S.R., 1992. Density matrix formulation for quantum renormalization groups.
Phys. Rev. Lett. 69, 2863–2866. https://doi.org/10.1103/PhysRevLett.69.2863.
[24] Cardy, J., 1996. Scaling and Renormalization in Statistical Physics. Cambridge
University Press.
[25] Chandler, D., 1987. Introduction to Modern Statistical Mechanics.
[26] Kardar, M., 2007. Statistical Physics of Fields [WWW Document]. Cambridge Core.
https://doi.org/10.1017/CBO9780511815881 .
[27] Chen, J., 2018. rbm2mps: MPS representation of RBM.
https://github.com/yzcj105/rbm2mps.
[28] Orus, R., 2014. A practical introduction to tensor networks: Matrix product
states and projected entangled pair states. Annals of Physics 349, 117–158.
https://doi.org/10.1016/j.aop.2014.06.013 .
[29] Schollwock, U., 2011. The density-matrix renormalization group in the age of matrix
product states. Annals of Physics, January 2011 Special Issue 326, 96–192.
https://doi.org/10.1016/j.aop.2010.09.012 .
[30] Barthel, T., Kliesch, M., Eisert, J., 2010. Real-Space Renormaliza-
tion Yields Finite Correlations. Phys. Rev. Lett. 105, 010502.
https://doi.org/10.1103/PhysRevLett.105.010502.
[31] Evenbly, G., Vidal, G., 2014. Class of Highly Entangled Many-Body
States that can be Efficiently Simulated. Phys. Rev. Lett. 112, 240502.
https://doi.org/10.1103/PhysRevLett.112.240502.
[32] Vidal, G., 2007. Entanglement Renormalization. Phys. Rev. Lett. 99, 220405.
https://doi.org/10.1103/PhysRevLett.99.220405.
[33] Le Roux, N., Bengio, Y., 2008. Representational Power of Restricted Boltzmann
Machines and Deep Belief Networks. Neural Computation 20, 1631–1649.
https://doi.org/10.1162/neco.2008.04-07-510.
93
BIBLIOGRAPHY
[34] Tensor rank decomposition, 2018. . Wikipedia.
[35] Kolda, T., Bader, B., 2009. Tensor Decompositions and Applications. SIAM Rev. 51,
455–500. https://doi.org/10.1137/07070111X.
[36] Han, Z.-Y., Wang, J., Fan, H., Wang, L., Zhang, P., 2017. Unsupervised Gener-
ative Modeling Using Matrix Product States. arXiv:1709.01662 [cond-mat,
physics:quant-ph, stat].
[37] Mocanu, D.C., Mocanu, E., Nguyen, P.H., Gibescu, M., Liotta, A., 2016. A topolog-
ical insight into restricted Boltzmann machines. Mach Learn 104, 243–270.
https://doi.org/10.1007/s10994-016-5570-z.
[38] Stoudenmire, E., Schwab, D.J., 2016. Supervised Learning with Tensor Networks,
in: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (Eds.), Ad-
vances in Neural Information Processing Systems 29. Curran Associates, Inc.,
pp. 4799–4807.
94