rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture5b.pdf · rapid...

1/24

1. Objectives of Lecture 5b 2. Markov random field (MRF)

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/24


Lecture 5bMarkov random field (MRF)

November 13, 2015

3/24


Table of contents

1 1. Objectives of Lecture 5b

2 2. Markov random field (MRF)2.1. Basics of MRF2.2. Boltzmann machine2.3. Restricted Boltzmann machine (RBM)

4/24


1. Objectives of Lecture 5b

Objective 1

Learn minimal MRF formalism that is necessary for theunderstanding of the deep neural network pretraining usingrestricted Boltzmann machine

Objective 2

Learn how the probability structure is encoded in the MRF,especially the energy based formalism of Boltzmann machine

Objective 3

Learn about some basic formalism of restricted Boltzmann machine

5/24


2.1. Basics of MRF

2. Markov random field (MRF)2.1.Basics of MRF

Terminology

G: undirected graph (not necessarily a tree) in which eachnode represents a random variable

Let Xi be the random variable represented by node i , and letxi be the value of Xi [we frequently confuse node i with xi ]

x = (x1,⋯, xn) a list of the values of all random variables

The joint probability is denoted by

P(x) = P(x1,⋯, xn)

For each node xi , let N (xi) be the neighbor of xi , i.e. N (xi)is the set of nodes connected to xi

6/24


2.1. Basics of MRF

Definition of MRF

We say P(x) satisfies the Markov property, if

P(Xi = xi ∣Xj = xj , for j ≠ i) = P(Xi = xi ∣Xj = xj , for xj ∈ N(xi))

G with P(x) satisfying the Markov property is called a Markovrandom field (MRF)

Proposition

Let G be a MRF. Let A,B ,C be mutually disjoint sets of nodes ofG. Assume A and B are separated by C, meaning that every pathfrom a node in A to a node in B passes through some node in C,then

P(A,B ∣C) = P(A ∣C)P(B ∣C)

7/24


2.1. Basics of MRF

i.e. A and B are conditionally independent given C. The converseis obviously true.

Example

8/24


2.1. Basics of MRF

Gibbs distributions

Definition

Clique is a set of nodes every node of which is connected toevery other node in the set

A probability distribution P(x) is called a Gibbs distribution ifit is of the form

P(x) =∏c∈C

ψc(xc),

where C is the set of maximal cliques and ψc is a non-negativefunction of xc , where xc is the list of variables in the clique c

9/24


2.1. Basics of MRF

Example

maximal cliques:c1 = {x1, x2, x3}, c2 = {x2, x3, x4}, c3 = {x3, x5}

P(x) = ψ1(x1, x2, x3)ψ2(x2, x3, x4)ψ3(x3, x5)

Theorem (Clifford-Hammersley)

Assume P(x) > 0 for all x . If P(x) is a Gibbs distribution, then Gis an MRF

10/24


2.2. Boltzmann machine


G ∶ graph

xi ∈ {1,−1} or xi ∈ {0,1}

11/24



E : EnergyE(x) = −∑

i∼j

ωijxixj −∑i

bixi

∑i∼j means the sum over adjacent nodes i and j for i < j

P: Probability

P(x) = 1

Zexp(−λE(x)),

where Z is the partition function given by

Z =∑x

exp(−λE(x))

[We usually set λ = 1]

12/24


2.3. Restricted Boltzmann machine (RBM)


Notation

x = (x1,⋯, xd)x = (h1,⋯,hn)(x ,h) = (x1,⋯, xd ,h1,⋯,hn)

13/24



Energy

E(x ,h) = −∑wijxihj −∑j

bjxj −∑i

cihi

Probability

P(x ,h) = 1

Zexp(−E(x ,h))

Z =∑x ,h

exp(−E(x ,h)) [= ∫ exp(−E(x ,h))]

Note

The lower the energy, the higher the probabilityIf wij > 0, it is more likely that xj and hi have the same signIf wij < 0, it is more likely that xj and hi have the opposite signIf bj > 0, it is more likely that xj > 0, and so on

14/24



Probabilities of RBM

WriteE(x ,h) = −hTWx − bT x − cTh,

where W = (wij),h = [h1,⋯,hn]T , x = [x1,⋯, xd]d

P(x ,h) = 1

Zexp(−hTWx − bT x − cTh)

P(x) =∑h

P(x ,h)

P(h) =∑x

P(x ,h)

15/24



P(h∣x)

Given x , hi and hj are separated, i.e. conditionallyindependent.Thus

P(h∣x) = P(h1,⋯,hn∣x) =∏i

P(hi ∣x)

16/24



Remark

This fact can be proved directly as follows:Let Wi =Wi● be the ith row of W . Then

hTWx =∑i

hiWix .

Thus

P(h∣x) = exp(hTWx + bT x + cTh)∑h exp(hTWx + bT x + cTh)

= ∏i exp(hiWix + cihi)∑h1,⋯,hn exp(hiWix + cihi)

17/24



= ∏i exp(hiWix + cihi)∏i ∑hi exp(hiWix + cihi)

= ∏i

1

Zexp(hiWix + bT x + cihi)

1

Z∑hi exp(hiWix + bT x + cihi)

= ∏i

P(x ,hi)∑hi P(x ,hi)

=∏i

P(hi ∣x)

18/24



Special Case: binary neurons

Assume xj = {0,1},hi ∈ {0,1}. Then

P(hi ∣x) =exp(hiWix + cihi)∑hi exp(hiWix + cihi)

19/24



Thus

P(hi = 1∣x) = exp(Wix + ci)1 + exp(Wix + ci)

= sigm(Wix + ci)

By symmetryP(xj = 1∣h) = sigm(W T

●j h + bj),

where W●j is the jth column of W .Now

P(x) = ∑h

P(x ,h)

= ebT x

Z∏i∑hi

exp(hiWix + cihi)

= ebT x

Z∏i

[1 + exp(Wix + ci)]

20/24



= ebT x

Zexp∑

i

log(1 + exp(Wix + ci))

= 1

Zexp [bT x +∑

i

log(1 + exp(Wix + ci))]

= 1

Zexp [bT x +∑

i

softplus(Wix + ci))] ,

where softplus(t) = log(1 + et)

21/24



Example Ising model

22/24



xi ∈ {1,−1}Configuration x

x = {x1,⋯, xi ,⋯, xn}

There are 2n configurations

Hamiltonian (Energy)

H = −∑i∼j

hijxixj −∑i

bixi

∑i∼j means the sum over adjacent nodes i and j for i < j

Probability of configuration x

P(x) ∼ exp(−λH)

23/24



λ = 1

kBT

kB ∶ Boltzmann constant (usually set to be 1)T ∶ temperature

If most xi are aligned in the same direction the energy(Hamiltonian) tends to be smaller, thus the probability isbigger

Ising model is an idealized “magnet” model

Partition functionZ =∑

x

P(x)

Thus

P(x) = 1

Zexp(−λH)

24/24



Due to large number of configuration, it is impractical tocompute z

rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture5b.pdf · rapid...

Documents