Tutorial on Bayesian learning and related methodsA pre-seminar for Simon Godsill’s talk

Simon Wilson

Trinity College Dublin

Probability: the basics (1)

Probability: the mathematics of describing uncertain quantities;

P(A) is the probability that the event A occurs;Rules of probability are:

1 0 ≤ P(A) ≤ 1;2 If A and B are mutually exclusive then P(A or B) = P(A) + P(B);3 P(A and B) = P(A) P(B |A), where P(B |A) means the probability of

B given that A has occurred.

Two events A and B are called independent if P(B |A) = P(B),and so P(A and B) = P(A) P(B).

Probability: the basics (2)

A random variable is real-valued and its value is uncertain.Denoted by a capital letter e.g. X , its value by small letter x ;

Random variables can be discrete or continuous;If discrete, X is described by its probability mass function,pX (x) = P(X = x);

pX (x) ≥ 0,∑∀x pX (x) = 1.

If continuous, the probability density function pX (x) is used:

pX (x) ≥ 0 and has the property that P(a < X < b) =∫ b

apX (x) dx ;

The cumulative distribution function FX (x) is P(X ≤ x)

FX (x) =

{∑s≤x pX (s), if X discrete,∫

s≤xpX (s) ds, if X continuous.

Probability: the basics (3)

The expected value or mean of a random variable is:

E(X ) =

{∑∀x xpX (x), if X is discrete;∫∀x xpX (x) dx , if X is continuous.

It’s the ’average’ value of X , the ’centre of gravity’ of thedistribution;

The variance of X is E((X − E(X ))2):

Var(X ) =

{∑∀x(x − E(X ))2pX (x), if X is discrete;∫∀x(x − E(X ))2pX (x) dx , if X is continuous.

It’s a measure of how variable the value of X can be;

The standard deviation is√

Var(X );It has the same units of measurement as X and E(X ).

Probability: the basics (4)

Examples of discrete random variable distributions are theBernoulli, binomial and Poisson:

pX (x | p) = px(1− p)1−x , x ∈ {0, 1};

pX (x | n, p) =



)px(1− p)n−x , x ∈ {0, 1, . . . , n};

pX (x |λ) =λx

x!e−λ, x = 0, 1, 2, . . .

Are these familiar?

Good models for many physical phenomena;

Note that they are all defined in terms of parameters — p, n, λ— we think of these as conditional distributions of X given theparameter.

Probability: the basics (5)

Examples of continuous random variable distributions are theexponential and normal (or Gaussian):

pX (x |µ) =1

µe−x/µ x ≥ 0;

pX (x |µ, σ2) =1√


(− 1

2σ2(x − µ)2

), x ∈ R.

Are these familiar?The normal distribution occurs in many places (and will in theseminar, repeatedly)

µ is its mean and σ2 is its variance;

Some normal pdf plots

Probability: the basics (6)

If we have two random variables X and Y then we can define thejoint pmf/pdf p(x , y):

For discrete X , Y , p(x , y) = P(X = x and Y = y);For continuous X , Y ,∫ b


∫ d

cp(x , y) dx dy = P(c < X < d , a < Y < b);

The laws of probability show that pX (x) =∫∀y p(x , y) dy ;

For discrete X and Y , the conditional distribution of X givenY = y is

P(X = x |Y = y) =p(x , y)

pY (y).

X and Y are called independent ifP(X = x |Y = y) = P(X = x) andP(Y = y |X = x) = P(Y = y);

In this case, p(x , y) = pX (x) pY (y)

Probability: two important laws (1)

Two urns: urn I has 3 red and 3 blue balls, urn II has 2 red and 4green balls;

I flip a fair coin (so P(H) = P(T ) = 1/2). If H then pick a ballfrom urn I else pick one from urn II.

What is P(R) = P(red ball picked)? By laws of probability:

P(R) = P((H and R) or (T and R))

= P(H and R) + P(T and R)

= P(H) P(R |H) + P(T ) P(R |T )



P(y) P(R | y)

( = 1/2× 3/6 + 1/2× 2/6 = 5/12).

Probability: two important laws (2)

Two urns: urn I has 3 red and 3 blue balls, urn II has 2 red and 4green balls;

I flip a fair coin . If H then pick a ball from urn I else pick onefrom urn II.

Now I tell you that I picked a red ball. What is the chance that Iflipped a H? This is P(H |R):

P(H |R) = P(H and R)/P(R)

=P(H) P(R |H)


= (1/2× 3/6)/(5/12) = 3/5.

Note that if I do not tell you the ball colour, the chance of a H isP(H) = 1/2.

Observing R has allowed you to learn about how likely is H .

Probability: two important laws (3)

The first equation is an example of the partition law.

In terms of random variables, we write that for any two randomvariables X and Y :

pX (x) =∑∀y

pX |Y (x |Y = y) pY (y),

where pX |Y (x |Y = y) is the conditional pmf of X given Y .

Probability: two important laws (4)

The second equation is an example of Bayes’ law.

It can be written as:

pY |X (y |X = x) =pY (y) pX |Y (x |Y = y)

pX (x),

which is often written (by Partition law):

pY |X (y |X = x) =pY (y) pX |Y (x |Y = y)∑∀y pY (y) pX |Y (x |Y = y)


We also see Bayes law written as:

pY |X (y |X = x) ∝ pY (y) pX |Y (x |Y = y).∑replaced by

∫if Y is continuous.

What is Monte Carlo simulation?

This is to generate a sequence of values from a probabilitydistribution;

Usually done by computer;To Monte Carlo simulate from a probability distribution pX (x)means to generate a sequence of values x1, x2, . . . , xN such that:

The values are independentIf X discrete, the proportion of values equal to x converges to pX (x),∀x as N →∞;If X is continuous, the proportion of values in the interval (a, b)

converges to∫ b

apX (x) dx , ∀a, b as N →∞;

E.g. simulation of a die: 4, 2, 5, 5, 1, 6, 3, 4, 2 , 1, . . .

What about 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, . . .?

Pseudo-random numbers

Computers are logic machines ⇒ should be no good at MonteCarlo simulation!This is true — the best we can do is generate deterministicsequences of numbers;

These sequences have many of the properties of ’really’ randomsequences;For most purposes they are indistinguishable from using ’the real thing’;They can also be generated very quickly (≈ 107 / second);

The basis of Monte Carlo methods are random numbers — theseare uniformly distributed between 0 and 1(pX (x) = 1, 0 ≤ x ≤ 1);There are many algorithms for generating deterministic sequencesthat look like random numbers:

These are called pseudo-random numbers;

Methods of generating other distributions

There are many methods of generating values from otherprobability distributions: discrete, normal, exponential, etc;

All of these rely on a supply of pseudo-random numbers;

Many computer packages are able to Monte Carlo simulate frommany distributions: MATLAB, R, even Excel!

Example: discrete probability distributions

Let X be tomorrow’s weather, X ∈ {sunny, cloudy, rainy};Suppose P(X = S) = 0.2,P(X = C ) = 0.3,P(X = R) = 0.5;We can Monte Carlo simulate this distribution as follows:

Generate a (pseudo-) random number u;If u < 0.2 then X = S ;if 0.2 ≤ u < 0.5 then X = C ;if u ≥ 0.5 then X = R.

This idea is called the inverse distribution method and works forall discrete distributions.

Bayes law is the basis for learning

In the urn problem, observing R tells you something about thecoin flip but does not tell you if it’s H or T with certainty;

The question is then: how “certain” can I be that the flip is a H?Or T?

Bayes’ law allowed us to compute how certain, as a probability, interms of probabilities that we know.

This situation occurs everywhere in data analysis and is the basisof statistical inference (or statistical learning);

Bayesian statistical inference defines what we learn through aprobability distribution on the quantity of interest;

Often this is defined through Bayes’ law

Slightly more complicated example (1)

What is the temperature in this room?

For simplicity, let’s assume that it’s constant all over the room.I have a thermometer and it measures 18.1◦C ;

Is that the “real” temperature in the room?Why not?

I have another identical make of thermometer. It measures18.1◦C as well.

Should I be more certain about the value of the real temperature now?If yes then by how much?What if the second thermometer had read 18.4◦C?

Slightly more complicated example (2)

What does Bayesian inference say about how to answer thisquestion?

Let T be the true (and unknown) temperature in the room;

Let x1 and x2 be the temperature measurements;

Our state of knowledge about T is defined by pT (t | x1, x2);

By Bayes’ law:

pT (t | x1) =pT (t) p(x1 |T )

p(x1)∝ pT (t) p(x1 |T );

pT (t | x1, x2) ∝ pT (t) p(x1, x2 |T ).

Slightly more complicated example (3)

pT (t) represents what we think T is before we measure it;

This is known as the prior distribution;

For example, we are pretty sure that 0 ≤ T ≤ 40; one possibilityis a uniform distribution on this range:

pT (t) =1

40, 0 ≤ t ≤ 40.

Another is a normal distribution with mean as our best guess(say 20◦C ) and a standard deviation of 10 (so that 0 ≤ T ≤ 40with high probability)

Two possible priors for T

Slightly more complicated example (4)

p(x1, x2 |T = t) describes what we measure given the truetemperature is t;

One reasonable model is that what we measure is normallydistributed with mean T and a variance σ2;

Here we assume that we know σ2 (it says how accurate ourthermometer is — let’s say σ2 = 0.32);

p(x1 |T = t) =1√



Also assume that the two measurements are independent givenT , so that:

p(x1, x2 |T = t) = p(x1 |T = t) p(x2 | t = t)



12σ2 [(x1−t)2+(x2−t)2].

Distribution of x1 when T = 18

Slightly more complicated example (5)

In Bayes’ law, the variable is t, so we should actually think ofp(x1 |T = t) and p(x1, x2 |T = t) as a function of t;

This is called the likelihood; with σ2 = 0.32 we have:

p(x1 |T = t) =1√


Also assume that the two measurements are independent givenT , so that:

p(x1, x2 |T = t) = p(x1 |T = t) p(x2 | t = t)





Likelihood for x1 = 18.1

Likelihood for x1 = 18.1, x2 = 18.1

Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk26 / 58

Likelihood for x1 = 18.1, x2 = 18.4

Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk27 / 58

Slightly more complicated example (6)

Bayes’ law gave us:

pT (t | x1, x2) ∝ pT (t)× p(x1, x2 |T = t);

pT (t | x1, x2) is called the posterior distribution

Plots of pT (t)× p(x1, x2 |T = t) next.

pT (t)× p(x1 = 18.1 |T = t) andpT (t)× p(x1 = x2 = 18.1 |T = t)

pT (t)× p(x1 = x2 = 18.1 |T = t) andpT (t)× p(x1 = 18.1, x2 = 18.4 |T = t)

Slightly more complicated example (7)

Now all that is missing is the constant that relates pT (t | x1, x2)to pT (t)× p(x1, x2 |T = t);

Bayes law:

pT (t | x1, x2) =pT (t)× p(x1, x2 |T = t)

p(x1, x2)

tells you that this is1/p(x1, x2) = 1/


pT (t)× p(x1, x2 |T = t) dt

It’s just the integral of the function plotted on the last slides;

It’s there to ensure that∫∞

0pT (t | x1, x2) dt = 1;

1/p(18.1, 18.1) = 55.6, 1/p(18.1, 18.4) = 68.7

p(t | x1 = x2 = 18.1) and p(t | x1 = 18.1, x2 = 18.4)

Stochastic processes

Many of the processes that we want to model occur over spaceand time:

Audio signals;Rainfall;Financial time seriesImage and video data...

These are modelled probabilistically by stochastic processes

Stochastic processes in discrete time

A very common subset is processes that evolve at discrete pointsin time t = 1, 2, 3, . . . ,T ;

Then X1,X2, . . . ,XT are the values of the process at these times;In general, we then have to define a probability distribution on(X1, . . . ,XT );

This is in general difficult because we have to define a T -dimensionalfunction p(x1, . . . , xT )

We can exploit properties of the process to make the modelsimpler to define:

Many processes obey what is called the Markov property;This means that the distribution of Xt only depends on the value ofXt−1;If X1,X2, . . . obeys this property then it’s called a Markov chain.

Markov chains

For a Markov chain:

p(x1, . . . , xT ) = p(x1) p(x2 | x1) p(x3 | x2) · · · p(xT | xT−1).

So p(x1, . . . , xT ) is defined in terms of simple one-dimensionaldistributions;If p(xt | xt−1) independent of t then we just need to define p(x1)and p(xt | xt−1);If xt is discrete-valued (say xt ∈ {1, 2, . . . , S}) then p(xt | xt−1)defined in a matrix:

P =

p11 p12 · · · p1S

p21 p22 · · · p2S...

.... . .

...pS1 pS2 · · · pSS


where pij = P(Xt = j |Xt−1 = i). Hence each row in P sums to1.

The weather each day is: sunny (S), cloudy (C) or rainy (R);

Xt is the weather on day t;

Suppose the weather on day t depends on the weather on dayt − 1, but is independent of earlier days;

It is then a Markov chain and suppose


P =

0.3 0.5 0.20.25 0.5 0.250.4 0.3 0.3



e.g. P(xt = S | xt−1 = R) = p31 = 0.4;

Monte Carlo simulation of our Markov chain

This is easy to do;

Define x1 (let’s make it x1 = R);

Simulate x2 given x1 (using the R row of P) — suppose wegenerate x2 = C ;

Simulate x3 given x2 (using the C row of P);

and so on.

40 days of weather

A random walk

Markov chains can have Xt continuous;

Example: Xt is normally distributed with mean Xt−1 and avariance σ2;

This is known as a random walk;

On next page is a simulation with X1 = 0 and two values of σ2

A normal random walk

Autoregressive processes

Random walks have the property that they ’wander away’ from 0(they are non-stationary);

Many physical processes tend to stay around a mean value (theyare stationary);

An autoregressive process is a simple case: Xt is normallydistributed with mean θXt−1 and a variance σ2, where−1 < θ < 1;

Higher order autoregressive processes are also very common e.g.Xt has mean θ1Xt−1 + θ2Xt−2, etc.;

These are a simple model for an audio signal.

Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk41 / 58

First order autoregressive processes (with σ = 1)

Third order autoregressive processE(Xt) = 0.7Xt−1 + 0.1Xt−2 + 0.15Xt−2

Properties of Markov chains

Vast literature in probability theory on properties of Markovchains;

Here, we concentrate on one property;

Suppose I start our weather chain in day 1;

What is the weather on day t + 1? This is P(xt+1 | x1);

It turns out that the matrix of these probabilities is

P t = P × P × · · · × P (matrix multiplication).

Properties of Markov chains

Here is P t for t = 2, 20, 200: 0.30 0.46 0.240.30 0.45 0.250.32 0.44 0.24


0.30 0.45 0.250.30 0.45 0.250.30 0.45 0.25


0.30 0.45 0.250.30 0.45 0.250.30 0.45 0.25


Stationary distributions

So as you look further into the future:The probability that you are in each state converges to a value;This probability is the same regardless of your current state (so thechain ‘forgets’ where it was at the start);This is called the stationary distribution of the chain (in this case(0.30, 0.45, 0.25));Not all Markov chains have such a distribution;Lots of theory on conditions for which it does happen — includes mostchains that one uses in practice;

Simulating the weather Markov chain — proportion ofsunny days

Markov chains for Monte Carlo simulation

Note that the proportion of sunny days in the simulationconverges to the stationary probability of 0.302;So there is (very long winded!!) way to simulate from thedistribution (0.30, 0.45, 0.25):

Start this Markov chain in any of the 3 states;Monte Carlo simulate the chain for a ’long’ time;The state of the chain at the end of the long simulation will havedistribution (0.30, 0.45, 0.25);

Hold this thought for later!

Audio reconstruction

Think of an audio signal as a discrete process in time X1,X2, . . .;We have some audio data that has been corrupted:

CD or vinyl record scratches, tape degradation;Telephone call over a noisy line;

So what we observe are not the Xt but a corrupted versionY1, . . . ,YT ;

We want to recover the original ’true’ audio signal X1, . . . ,XT

from Y1, . . . ,YT ;

Hidden Markov models

A simple model for this process is as follows:Our real audio signal is a stochastic process like the AR model;What we actually observe Yt is normally distributed with mean Xt ;The normal distribution models the noise or degradation of the signal’The Yt ’s are independent of each other given the Xt ’s;


Yt |Xt , σ2Y ∼ N(Xt , σ

2Y )

Xt |Xt−1, σ2X ∼ N(θXt−1, σ

2X ).

This is an example of a Hidden Markov Model (HMM):There is hidden Markov chain X1,X2, . . .;We observe Yt that are Xt plus some noise, and are independent;

Xt – an AR process

Xt and Yt

Yt alone

Bayesian inference (1)

Our data are the Yt ’s;

The likelihood is:

p(y1, . . . , yT | x1, . . . , xt , σ2Y )






(− 1


(yt − xt)2


We want to learn about the Xt and the 3 parameters (σ2Y , θ, σ

2X );

Bayesian inference (2)

The prior distribution for the Xt is (say) an AR process:

p(x1, . . . , xt | θ, σ2X )








e−(xt−θxt−1)2/2σ2X .

We have a prior for the 3 parameters p(σ2Y , θ, σ

2X );

Then Bayes’ law gives us:

p(x1, . . . , xT , σ2X , θ, σ

2Y | y1, . . . , yT )

=p(y1, . . . , yT | x1, . . . , xt , σ

2Y ) p(x1, . . . , xt | θ, σ2

X ) p(σ2Y , θ, σ

2X )

p(y1, . . . , yT )

Bayesian inference (3)

What is the denominator?

p(y1, . . . , yT )


∫p(y1, . . . , yT | x1, . . . , xt , σ

2Y ) p(x1, . . . , xt | θ, σ2

X )

× p(σ2Y , θ, σ

2X ),

a T + 3 dimensional integral;This is in general impossible to compute numerically — too big aproblem!What can we do?

For this problem, there are some algorithms that can compute themeans of the Xi , and approximate the posterior;These break down when we consider more realistic models for Xt andYt |Xt ;Can we Monte Carlo simulate fromp(x1, . . . , xT , σ

2X , θ, σ

2Y | y1, . . . , yT )?

Markov chain Monte Carlo simulation

MC simulation from high-dimensional distributions is also verydifficult;

However, it is possible to simulate from a Markov chain withstationary distribution that is the posterior;

So we simulate values of (x1, . . . , xT , σ2X , θ, σ

2Y ) according to a

certain Markov chain;

After we simulate values for a ’long time’, we are sampling fromthe posterior.

This is known as MCMC (Markov chain Monte Carlo);

MCMC methods are a module in themselves!

I hope that this has given you an idea of some of the methodsused for Bayesian inference;

Remember: seminar is on Friday at 12pm in this room.

