structured probabilistic models for deep learningjcid/mlg/mlg2018/lorenaalvarez_mlg... · 2018. 3....

Lorena Álvarez Pérez Machine Learning Group (MLG)

March 5, 2018

Structured Probabilistic

Models for Deep Learning

Lectures slides for Chapter 16 of Deep Learning www.deeplearningbook.org

Ian Goodfellow

Deep Learning Book Chapter 16: Graphical Models March 5, 2018

MLG: Semi-‐supervised Learning November 15, 2017 2/58

Index

0. Overview 1.  The Challenge of Unstructured Modeling 2.  Using graphs to describe model structure 3.  Sampling from Graphical models 4.  Advantages of Structured Modeling 5.  Learning about Dependencies 6.  Inference and Approximate Inference 7.  The Deep Learning Approach to Structured Probabilistic

Models

Role of probabilistic structured models in deep learning �  For many tasks (other than classification), a full

representation of the probability distribution of variables is needed -  e.g., denoising, missing value imputation, sampling, etc.

�  Structured models (also known as probabilistic graphical models, PGMs) provide compact representations -  When compared to full (unstructured) probability distributions

March 5, 2018 3/58

Deep Learning Book: Chapter 16

0. Overview

What are structured probabilistic models? �  A way of describing probability distributions using a

graph to describe which variables interact with each other directly

�  Graph is used in the sense of graph theory -  Vertices connected to one another by edges

�  Because structure is described by a graph, they are are called graphical models

In deep learning, different model structures, learning algorithms and inference procedures are used!

March 5, 2018 4/58


0. Overview

•  The goal of deep learning is to scale machine learning to the kinds of challenges needed to solve artificial intelligence –  e.g., understand natural images, audio waveforms representing speech, etc.

•  The classification task of machine learning is a limited goal –  They take input from a rich high-dimensional distribution and summarize

it with a categorical label –  They discard most of the input

–  They produce a single output •  Or a probability distribution over values of that single output

It is possible to ask probabilistic models to do many

other tasks!

March 5, 2018 5/58


1. The Challenge of Unstructured Modeling

Probabilistic models for other tasks �  They are more expensive than classification

�  They require producing many output values

�  They require complete understanding of structure of entire input without ignoring sections of it

�  Some tasks are: 1)  Density estimation

2)  Denoising

3)  Missing value imputation

4)  Sampling

March 5, 2018 6/58 Deep Learning Book: Chapter 16


Example: Probabilistic modeling of natural images Generate new samples from a distribution p(x)



Intractability of rich distribution •  It is a challenging task, both computationally and statistically

•  Consider a 32x32x3 binary image –  There are 23072 possible images

•  If we have n discrete variables with k possible values each, naïve approach of representing p(x) requires storing a table with kn values!!

This is not feasible!



Intractability of rich distribution •  Memory •  Statistical efficiency •  Runtime: Cost of inference

–  If we have p(x) and need to infer p(x1) or p(x2|x1) requires summing across the entire table

•  Runtime: Cost of sampling

Table-based approach models every possible interaction between variables, but usually variables

influence each other only inderectly



Direct and indirect interaction �  Consider modeling finishing times in a relay. The team has

three runners: Alice, Bob and Carol. �  Alice hands baton to Bob, Bob hands to Carol who finishes the

lap -  Alice’s finishing does not depend on anyone else, Bob’s

finishing time depends on Alice’s and Carol’s depends on Bob’s

-  Carol’s finishing time depends only indirectly on Alice’s

-  If we already know Bob’s finishing time, we will not be able to better estimate Carol’s finishing time by finding out what Alice’s finishing time was

March 5, 2018 10/58

2. Using Graphs to Describe Model Structured

2.1 Directed Models


Using graphs to describe model structure •  Each node represents a random variable

•  Each edge represents a direct interaction -  These direct interactions imply other indirect interactions

-  But only direct interactions need to be represented

•  Graphical models can be largely divided into two categories 1)  Models based on direct acyclic graphs

2)  Models based on undirected graphs

March 5, 2018 11/58


2.1 Directed Models


Direct Graphical Model •  Also called belief network or Bayesian network

•  In the relay race example…

Bob’s finishing time t1 depends on Alice’s finishing time t0 and Carol’s finishing time t2 depends on Bob’s finishing time t1

What does the arrow represent?

March 5, 2018 12/58


2.1 Directed Models


t0 t1 t2

Alice Bob Carol

Meaning of directed edges Drawing an arrow from “a” to “b” means we define a conditional probability distribution (CPD) over “b” via a conditional distribution. with “a” as one of the variables on the right side of the conditional bar

- i.e., distribution over “b” depends on the value of “a”

March 5, 2018 13/58


2.1 Directed Models


a b p(b|a)

Formal direct graphical model •  A directed graphical acyclic graph ( ) defined on variables x

needs: –  A set of vertices which represent the random variables in the

model

–  A set of local CPDs:

–  The probability distribution over x is given by

–  In the relay race example:

March 5, 2018 14/58


2.1 Directed Models


G

p(x) =�

i

p(xi|PaG(xi))

This expression gives the parents of xi in Gp(xi|PaG(xi))

p(t0, t1, t2) = p(t0)p(t0|t1)p(t2|t1)

Savings achieved by direct model •  As an example: If t0, t1 and t2 are discrete with 100 values,

the single table would require 999999 values -  By making tables for only conditional probabilities, we need only

18999 values (it is reduced by a factor of more than 50)

•  The cost of a single table for modeling n discrete variables each having k values is

•  If m is the maximum no. of variables appearing (on either side of the conditioning bar) in a single CPD, the cost of the tables for direct model is –  As long as m << n, very dramatic saving are got!

March 5, 2018 15/58


2.1 Directed Models


O(kn)

O(km)

•  Also known as Markov random fields o Markov networks •  They use graphs whose edges are undirected and have no

CPD –  Direct models work best when influence clearly flows in one

direction

–  Undirect models work best when influence has no clear direction or is best modeled as flowing in both directions

Let’s see an example!

March 5, 2018 16/58


2.2 Undirected models


Example: The health undirected model •  Consider a model over three binary variables

•  Whether or not you are sick ( )

•  Whether or not your coworker is sick ( )

•  Whether or not your roommate is sick ( )

•  Let us assume coworker and roommate do not know each other, very unlikely one of them will give a cold to the other (we do not model it)

•  There is no clear directionality either undirected model

March 5, 2018 17/58




hy

hc

hr

Example: The health undirected graph

•  You and your roommate may infect each other with a cold

•  You and your coworker may do the same

•  Let us assume your roommate and colleague do not know

March 5, 2018 18/58




hyhc hr

Does your roommate have a cold?

Do you have a cold?

Does your coworker have a cold?

Formal undirected graphical model •  An undirected probabilistic graphical model is defined on a

graph ( ) –  For each clique in the graph, a factor (also called a

clique potential) measures the affinity of the variables for being in each of their joint states

–  Together, they define an unnormalized probability distribution

March 5, 2018 19/58


2.2 Undirected Models


GC φ(C)

A subset of nodes all connected to each other

�p(x) =�

C∈Gφ(C)

Example: This graph (with five cliques) implies that

March 5, 2018 20/58


2.2 Undirected Models


a b c

d e f

p(a, b, c, d, e, f) =1Z

φa,b(a, b)φb,c(b, c)φa,d(a, d)φb,e(b, e)φe,f (e, f)

p(a, b, c, d, e, f)∼

•  The unnormalized probability distribution : –  It is guaranteed to be non-negative

–  It is not guaranteed to sum or integrate to 1

•  To obtain a valid probability distribution, we must normalize the distribution (Gibbs distribution)

•  Obviously:

–  Z is a constant when the ϕ functions are constants

–  If ϕ has parameters, then Z is a function of those parameters

March 5, 2018 21/58


2.3 The Partition Function


p(x)∼

p(x) =1Z

�p(x) �p(x) =�

C∈Gφ(C)

Z =�

�p(x)dx

Partition function

Intractability of Z •  Since Z is an integral or sum of all possible values of x,

it is often intractable to compute

•  In order to compute a normalize probability of an undirected model: –  Model structure and definitions of ϕ functions must be conductive to

computing Z efficiently

–  In deep learning, Z is intractable and we must resort to approximations

March 5, 2018 22/58




Difference between directed & undirected •  Directed models are:

–  Defined directly in terms of probability distributions

•  Undirected models are: –  Defined loosely in terms of ϕ functions which must then

converted into probability distributions

–  Domain of variables has a dramatic effect on kind of probability distributions given a set ϕ functions corresponds to

March 5, 2018 23/58




•  Many interesting theoretical results of undirected graphs depend on the assumption:

•  We can enforce this using an energy-based model

–  is known as the energy function

–  Since no energy function will result in probability of zero for any state of x

•  Any distribution of the form is referred to as a Boltzmann distribution

March 5, 2018 24/58


2.4 Energy-Based Models


�p(x) > 0 ∀x�

�p(x) =�

C∈Gφ(C)

�

!p(x) = exp !E(x)( )E(x)

exp(z)> 0 !z,

!p(x) = exp !E(x)( )

•  Cliques in the undirected graph correspond to factors in the unnormalized probability function –  Since , different cliques in undirected graph

correspond to different terms of the energy function •  Exponentation makes each term of the energy function correspond to a factor for a

different clique

•  i.e., energy-based model is a special Markov network

•  This graph (with five cliques) implies that:

March 5, 2018 25/58




exp(a)exp(b) = exp(a+ b)

a b c

d e f

E(a,b,c,d,e, f ) = Ea,b(a,b)+Eb,c (b,c)+Ea,d (a,d)++Eb,e(b,e)+Ee, f (e, f )

ϕ functions are obtained by setting each ϕ to the exponential of the

corresponding negative energy !a,b(a,b) = exp(!E(a,b))

•  This graph (with five cliques) implies that:

•  ϕ functions are obtained by setting each ϕ to the exponential of the corresponding negative energy

March 5, 2018 26/58




a b c

d e f

E(a,b,c,d,e, f ) = Ea,b(a,b)+Eb,c (b,c)+Ea,d (a,d)+Eb,e(b,e)+Ee, f (e, f )

!a,b(a,b) = exp !E(a,b)( )

Separation in undirected models •  Identifying conditional independences is very simple

–  Conditional independence implied by the graph is called separation

•  A set of variables A is separated from variables B given a third set of variables S if the graph structure implies that A is independent from B given S

•  If two variables “a” and “b” are connected by a path involving only unobserved variables, then they are not separated –  If no path exists between them, or all paths contain an observed

variable they they are separated

March 5, 2018 27/58


2.5 Separation and D-separation


Separation in undirected models: Example •  “b” is shaded to indicate that it is observed

•  “b” blocks path from “a” to “c”, so “a” and “c” are separated given “b”

•  There is an active path from “a” to “d”, so “a” and “d” are not separated given “b”

March 5, 2018 28/58




a

c

d

b

Separation in directed models •  In the context of directed graphs, these separations concepts

are called d-separation •  D-separation is defined as the same as separation for

undirected graphs: –  A set of variables A is separated from variables B given a third set of

variables S if the graph structure implies that A is independent from B given S

•  Two variables are dependent if there is an active path between them. If not, they are d-separated.

•  In directed nets determining whether a path is active is more complicated

March 5, 2018 29/58




All active paths (of length 2) in direct models between “a” and “b”

March 5, 2018 30/58




V-structure or the collider case The explaining away effect

(e.g., relay race) A common cause s

March 5, 2018


2.6 Converting between Undirected and Directed Graphs


•  No probabilistic model is inherently directed or undirected –  Some models are easily described using a direct graph, or most easily

described using an undirected graph

•  Direct models and undirected models have both their advantages and disadvantages –  The choice will partially depend on which probability distribution we

wish to describe –  Which approach can capture the most independences in the

probability distribution or which approach uses the fewest edges

•  Every probability distribution can be represented by either a directed model or by an undirected model –  Worst case: “complete graph”

31/58

“Complete graphs” §  Directed models:

•  Any directed acyclic graph where we impose some ordering in the random variables •  Each variable has all other variables that precede it in the ordering as its ancestors

in the graph

§  Undirected models: •  A graph containing a single clique encompassing all of the variables

March 5, 2018 32/58


2.6 Converting between Undirected and Directed Graph


Undirected model Directed model

Not useful because they do not imply any independences!

Converting a direct model into an undirected model •  We need to create a new graph •  Looking at graph :

–  For every pair of variables “x” and “y”, we add an undirected edge connecting “x” and “y” to if there is a directed edge between or if “x” and “y” are both parents of a third variable “z”

•  The resulting graph ( ) is known as a moralized graph

March 5, 2018 33/58




DU

U

D

U

Examples of converting direct models to undirected models

March 5, 2018 34/58




Undirected model

Directed model

Converting an undirected model to a directed model •  A loop is a sequence of variables connected by undirected

edges, with the last variable connected to the first one in the sequence

•  A chord is a connection between any two non-consecutive variables in the sequence defining a loop

•  We can not create a directed model if the graph have loops of length four or greater –  Solution: To add edges to triangulate long loops (the new graph is

known as chordal or triangulated graph)

•  Finally, it is necessary to assign directions to the edges –  No direct cycles are allowed!

March 5, 2018 35/58




Examples of converting an undirected model to a directed one

March 5, 2018 36/58




Undirected model Directed model

No loops of length greater than three

are allowed!

Edges are added to triangulate long loops

To assign direc4ons to edges (no direct cycles

are allowed!)

•  Factor graphs resolve an ambiguity in the graphical representation of standard undirected models –  Ambiguity arises because it is not clear if each clique actually has a

corresponding factor whose scope encompasses the entire clique

•  A factor graph is a graphical representation of an undirected model that that consists of a bipartite undirected graph –  Some of the nodes are drawn as circles

•  They correspond to random variables in the standard undirected model

–  The rest of the nodes are drawn as squares •  They correspond to the factors of the unnormalized probability distribution

•  A variable and a factor are connected if the variable is one of the arguments to the factor

–  No factor may be connected to another factor in the graph, nor can a variable be connected to a variable

March 5, 2018 37/58


2.7 Factor graphs


Example of how a factor graph can resolve ambiguity

March 5, 2018 38/58


2.7 Factor graphs


Undirected graph: Is this tree pairwise

potentials or one potenTal over three variables?

This factor graph has one factor over all three variables

This factor graph has three factors (each over

only two variables)

March 5, 2018 39/58

3. Sampling from Graphical Models

3.1 Direct models


•  In direct graphical models, the ancestral sampling method can produce samples from the joint distribution represented by the model

•  How does ancestral sampling works? –  Sorting the variables into a topological ordering, so that for all i and

j, j is greater than i if is a parent of . –  The variables can be sampled in this order.

•  First, we sample: •  Then, sample: •  … •  Finally, we sample:

•  It does not support every conditional sampling operation –  To sample from a subset of variables in a directed graphical model,

given other variables, requires that all the contioning variables come earlier than the variables to be sampled in the ordered graph.

xix jxi

x1 ∼ P (x1)P (x2|PaG(x2))

P (xn|PaG(xn))

March 5, 2018 40/58

3. Sampling from Graphical Models



•  Ancestral sampling is applicable only to directed models •  We can sample from undirected models by converting them to

directed models –  It involves solving intractable problems –  Or introducing so many edges that the resulting directed model

becomes intractable •  Gibbs sampling: The conceptually approach for drawing

samples from an undirected graph –  Suppose we have a graphical model over a n-dimensional vector of

random variables x –  We iteratively visit each variable xi and draw a sample conditioned on

all the other variables, i.e., –  Asymptotically, after many repetitions, process converges to sampling

from correct distribution •  Difficulty to determine when the samples have reached a sufficiently accurate

p(xi x!i )

March 5, 2018 41/58

4. Advantages of Structured Modeling


•  To dramatically reduce the cost of representing probability distributions as well as learning as inference –  By assuming each node has a tabular distribution given its parents,

memory, sampling, inference are now exponential in number of variables in factor with largest scope •  For many interesting models, this is very small •  e.g., RBMs: all factor scopes are size 2 or 1

–  Previously, these costs were exponential in total number of nodes –  Statistically, much easier to estimate this manageable number of

parameters

March 5, 2018 42/58

5. Learning about Dependencies


•  A good generative model needs to accurately capture the distribution over the visible variables v –  The different elements of v are highly dependent on each other –  In deep learning, the dependencies are modeled by introducing latent

variables h –  A good model of v which did not contain any latent variables will need

to have: •  A very large number of parents per node in a Bayesian network •  A very large number of cliques in a Markov network

Highly cost in both computational and statistical sense!

March 5, 2018 43/58

5. Learning about Dependencies


•  When the model is intended to capture dependencies between visible variables with direct connections, it is usually infeasible to connect all variables –  The graph must be designed to connect all those variables that are tightly

coupled and omit edges between other variables •  Structure learning algorithms perform greedy search

•  Using latent variables, instead of adaptive structure avoids to perform discrete searches and multiple rounds of training –  Use one graph structure –  Many latent variables –  Dense connections of latent variables to observed variables –  Parameters learn that each latent variable interacts strongly with only a

small subset of observed variables

March 5, 2018 44/58

6. Inference and approximate inference


Inference •  Ask questions about how variables are related to each other

–  i.e., given a set of medical tests, we can ask what disease a patient might have –  In a latent variable model, we want to extract features describing the

observed variables v –  Solve such problems in order to perform other task

•  We want to compute to determine

•  These are inference problems –  Predict variables given other variables –  Predict distributions of some variables given values of other variables

E[h | v]

p(h | v) p(v)

March 5, 2018 45/58

6. Inference and approximate inference


Intractability of Inference •  For most interesting deep models, the inference problems are

intractable –  Even when we use a structured graphical model to simplify them

•  Graphs structures allow to represent complicated high-dimensional distributions with reasonable number of parameters –  Resulting graphs are not restrictive enough to allow efficient inference

•  Computing the marginal probability is #P hard –  NP problems require determining whether a problem has a solution and if so, find it –  Problems in #P require counting the number of solutions

•  This motivates the use of approximate inference in deep learning –  It is usually referred to variational inference

•  Approximate a true distribution by another distribution that is close to the true one as possible

p(h | v) q(h | v)

March 5, 2018 46/58

7. The Deep Learning Approach to Structured Probabilistic Models


•  Deep learning does not involve specially deep graphical models

•  The main differences in structured probabilistic models in deep learning –  Depth –  Proportion of observed to latent variables –  Latent semantics (meaning of a latent variable) –  Connectivity and inference algorithm –  Intractability and approximation

March 5, 2018 47/58



Depth of a graphical model •  Latent variable “hi” is at depth j if shortest path hi from to an

observed variable is j steps •  Depth of model is the greatest depth of any such hi

•  This kind of depth is different from depth induced by the computational graph •  Many generative models used for deep learning have no latent variables

(or only one layer), but use deep computational graphs to define the conditional distributions within a model

March 5, 2018 48/58



Proportion of observed/latent variables •  Deep learning models typically have more latent variables than

observed variables –  They always use of distributed representations

•  Even shallow models have a single large layer of latent variables

•  Complicated non-linear interactions between variables are accomplished via indirect connections that flow through multiple latent variables

•  By contrast, traditional graphical models contain mostly variables that are observed (i.e., few latent variables)

March 5, 2018 49/58



Latent variable semantics •  Latent variables are designed differently in deep learning •  In traditional graphical models, they are designed with specific

semantics in mind –  Topic of a document, intelligence of a student, disease causing a

patient’s symptoms, etc.

•  In deep learning, they are not designed to take any specific semantics ahead of time –  Training algorithm is free to invent concepts needed to model a dataset –  Latent variables not easy to interpret after the fact

March 5, 2018 50/58



Connectivity •  Deep graphical models have large groups of units connected

other large groups of units –  Interactions can be described b a single matrix

•  Traditional graphical models have few connections and the choice of connections for each variable may be individually designed –  The design of the model structure is tightly linked to the choice of

inference algorithm

March 5, 2018 51/58



Inference •  Traditional graphical models: Tractability of exact inference

–  When this is too limiting, a popular approximate approach is loopy belief propagation

–  Both approaches work well with sparsely connected graphs

•  Models used in deep learning are not sparse –  Use either Gibbs sampling or variational inference

•  Rather than simplifying model until exact inference is feasible, make model complex enough as long as we can compute a gradient

March 5, 2018 52/58



The restricted Boltzmann machine (RBM) •  Quintessential example of how graphical models are used for

deep learning •  RBM itself is not a deep model

–  It has a single layer of latent units that may be used to learn a representation for the input

–  RBMs can be used to build many deeper models (Chapter 20)

March 5, 2018



•  A general Boltzman machine can have arbitrary connections

•  Restricted Boltzmann Machine (RBM): First layer is called the visible or,

input layer, and the second is the hidden layer – Bipartite undirected graph

•  Used for dimensionality reduction, classification or feature learning

No direct interactions between any two visible units or between any two

hidden units (“restricted”)

53/58

March 5, 2018 54/58



RBM Characteristics •  Units are organized into large groups called layers •  Connectivity between layers is described by a matrix •  Connectivity is relatively dense •  The model is designed to allow efficient Gibbs sampling •  Learn latent variables whose semantics are not specified by the

designer

March 5, 2018



Canonical RBM •  Energy-based model with binary visible and hidden units

–  The model is divided into groups of units v and h and the interaction between them is described by matrix W

•  The restrictions on RBM structure yield the properties

and

E(v,h) = !bTv! cTh! vTWh

Unconstrained, real-valued, learnable parameters

p(h | v) = p(hi | v)i!

p(v |h) = p(vi |h)i!

55/58

March 5, 2018 56/58



Example: For the binary RBM, we obtain

-  Together these properties allow for block Gibbs sampling which

alternate between sampling all h simultaneously and all v simultaneously

P(hi =1| v) =! (vTW:, i + bi )

P(hi = 0 | v) =1!! (vTW:, i + bi )

March 5, 2018 57/58



Example: Samples from a trained RBM and its weights (model trained on MNIST data)

Corresponding weight vectors

•  Each column is a separate Gibbs process •  Each row represents the output of another

1000 steps of Gibbs samples (sucessive samples are highly correlated)

March 5, 2018 57/58



How to put RBMs into practice? Tensor implementation of Restricted Boltzman Machine (RBM) and

Autoenconder with layerwise pretraining

hWps://github.com/Cospel/rbm-‐ae-‐Y

Semi-supervised Learning

Thank you very much for

your attention!

MLG: Semi-‐supervised Learning November 15, 2017

structured probabilistic models for deep learningjcid/mlg/mlg2018/lorenaalvarez_mlg... · 2018. 3....

Documents