conditional & joint probability

Conditional & Joint ProbabilityA brief digression back to joint probability:

i.e. both events O and H occur

Again, we can express joint probability in terms of their separate conditional and unconditional probabilities

This key result turns out to be exceedingly useful!

Conditional ProbabilityConverting expressions of joint probability

We can therefore express everything only in terms of reciprocal conditional and unconditional probabilities:

The intersection operator makes no assertion regarding order:

This is usually expressed in a slightly rearranged form…

Conditional ProbabilityBayes’ theorem expresses the essence of inference

We can think of this as allowing us to compute the probability of some hidden event H given that some observable event O has occurred, provided we know the probability of

the observed event O assuming that hidden event H has occurred

Bayes’ theorem is a recipe for problems involving conditional probability

Conditional ProbabilityNormalizing the probabilities

For convenience, we often replace the probability of the observed event O with the sum over all possible values of H of the joint probabilities of O and H. Whew! But consider that if we now calculated Pr{H|O} for every H, the sum of these would be

one, which is just as a probability should behave…

This summing of the expression in numerator “normalizes” the probabilities

Conditional Probability

Bayes’ theorem is so important that each part of this recipe has a special name

Bayes’ theorem as a recipe for inference

The posterior

Think of this perhaps as the evidence for some specific model H given the set of observations O.We are making an inference about

H on the basis of O

The priorOur best guess about H before any observation is

made. Often we will make neutral assumptions, resulting in an uninformative prior. But priors can also come from the posterior results from some earlier experiment. How best to choose priors is

an essential element of Bayesian analysis

The likelihood modelWe’ve seen already that the probability of an

observation given a hidden parameter is really a likelihood. Choosing a likelihood model is akin to

proposing some process, H, by which the observation, O, might have come about

The observable probabilityGenerally, care must be taken to ensure that our observables

have no uncertainty, otherwise they are really hidden!!!

Many paths give rise to the same sequence XThe Backward Algorithm

∑𝜋𝑃 (𝑥 ,𝜋)P(x) =

We would often like to know the total probability of some sequence:

But wait! Didn’t we already solve this problem using the forward algorithm!?

Sometimes, the trip isn’t about the destination. Stick with me!

Well, yes, but we’re going to solve it again by iterating backwards through the sequence instead of forwards

Defining the backward variableThe Backward Algorithm

Since we are effectively stepping “backwards” through the event sequence, this is formulated as a statement of conditional probability

rather than in terms of joint probability as are forward variables

bk(i) = P(xi+1 … xL | pi = k) “The backward variable for state k at position i”

“the probability of the sequence from the end to the symbol at position i, given that the path at position i is k”

What if we had in our possession all of the backward variables for 1, the first position of the sequence?

The Backward Algorithm

We’ll obtain these “initial position” backward variables in a manner directly analogous to the method in the forward algorithm…

To get the overall probability of the sequence, we would need only sum the backward variables for each state (after

correcting for the probability of the initial transition from Start)

∑𝒍𝒆𝒍 (𝒙𝟏)𝒃𝒍 (𝟏)𝒂𝑺→ 𝒍P(X) =

A recursive definition for backward variablesThe Backward Algorithm

As always with a dynamic programming algorithm, we recursively define the variables.. But this time in terms of

their own values at later positions in the sequence…

The termination condition is satisfied and the basis case provided by virtue of the fact that sequences are finite and in this case we must eventually (albeit implicitly) come to the End state

∑𝒍e l ( xi ) ·𝒃𝒍 (𝒊) ·𝒂𝒌→ 𝒍bk(i-1) =

bk(i) = P(xi+1 … xL | pi = k)

If you understand forward, you already understand backward!


Initialization: bk(L) = 1 for all states k

We usually don’t need the termination step, except to check that the result is the same as from forward… So why bother with the backward algorithm in the first place?

Recursion (i = L-1,…,1):

Termination: ∑

𝒍𝒆𝒍 (𝒙𝟏)𝒃𝒍 (𝟏)𝒂𝒔𝒕𝒂𝒓𝒕→ 𝒍P(X) =

∑𝒍❑bk(i) = el(xi + 1) · bl(i +1) ·


The backward algorithm takes its name from its backwards iteration through the sequence

S0.1 0.9

State “+” State “-”

A: 0.30

C: 0.25

G: 0.15

T: 0.300.5 0.4

0.5

0.6

A: 0.20

C: 0.35

G: 0.25

T: 0.20

The probability of a sequence summed across all possible state paths

End

A C G

_

1

1

0.2

0.6 * 0.15 * 1

0.4 * 0.25 * 1

S = 0.19

0.6 * 0.25 * 0.2

0.4 * 0.35 * 0.19

S = 0.0566

0.5 * 0.25 * 0.2

0.5 * 0.35 * 0.19

S = 0.05825

0.1 * 0.3 * 0.058250.9 * 0.2 * 0.0566

S = 0.0119355

P(x) =

0.0119355

0.5 * 0.25 * 1

0.5 * 0.15 * 1

S = 0.2

The most probable state pathThe Decoding Problem

The Viterbi algorithm will calculate the most probable state path, thereby allowing us to “decode” the state path in cases

where the true state path is hidden

The Viterbi algorithm generally does a good job of recovering the state path…

p*= argmax P(x, p)p

_64622311445316146343245452316262524425613233524355442454134246666215124646526536662616666426446665162436615266416215651SFFFFFFFFFFFFFLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

Sequence

True state path

Viterbi patha.k.a MPSP

p*= argmax P(x, p | θ)p

Limitations of the most probable state pathThe Decoding Problem

…but the most probable state path might not always be the best choice for further inference on the sequence

This is the posterior probability of state k at position i when sequence x is known

P(pi = k|x)

• There may be other paths, sometimes several, that result in probabilities nearly as good as the MPSP

• The MPSP tells us about the probability of the entire path, but it doesn’t actually tell us what the most probable state might be for some particular observation xi

• More specifically, we might want to know:

“the probability that observation xi resulted from being in state k, given the observed sequence x”

The approach is a little bit indirect….Calculating the Posterior Probabilities

We can sometimes ask a slightly different or related question, and see if that gets us closer to the result we seek

Does anything here look like something we may have seen before??

P(x,pi = k)

Maybe we can first say something about the probability of producing the entire observed sequence with observation xi

resulting from having been in state k…..

P(x1…xi, pi = k)·P(xi+1…xL|x1…xi, pi = k)

=

=

P(x1…xi, pi = k)·P(xi+1…xL|pi = k)

Calculating the Posterior Probabilities

These terms are exactly our now familiar forward and backward variables!

P(x,pi = k)=

P(x1…xi, pi = k)·P(xi+1…xL|pi = k)

Limitations of the most probable state path

=

fk(i)·bk(i)

Calculating the Posterior Probabilities

OK, but what can we really do with these posterior probabilities?

P(x,pi = k)

Putting it all together using Baye’s theorem

fk(i)·bk(i) P(pi = k|x) =

P(x)We now have all of the necessary ingredients required to apply

Baye’s theorem… We can now therefore find the posterior probability of being in state k for each position i in the sequence!

We need only run both the forward and the backward algorithms to generate the values we need.

Python: we probably want to store our forward and backward values as instance variables rather than method variables

Remember, we can get P(x) directly from either the forward or backward algorithm!

P(x|pi = k)·P(pi = k)

Making use of the posterior probabilities

Posterior Decoding

Two primary applications of the posterior state path probabilities

In some scenarios, the overall path might not even be a permitted path through the model!

• We can define an alternative state path to the most probable state path:

pi = argmax P(pi = k | x)k

˰

This alternative to Viterbi decoding is most useful when we are most interested in the what the state might be at some particular point or

points. It’s possible that the overall path suggested by this might not be particularly likely

Plotting the posterior probabilitiesPosterior Decoding

Key: true path / Viterbi path / posterior path

_44263143366636346665525516166566666224264513226235262443416SFFFFFFFFLLLLLLLLLLLFFLLLLLFFLLLLLLLLFFFFFFFFFFFFFFFFFFFLLLL

SFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLFFFFLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFF

Since we know individual probabilities at each position we can easily plot the posterior probabilities

myHMM.generate(60,143456789)

Plotting the posterior probabilities with matplotlib

Posterior Decoding

Assuming that we have list variables self._x containing the range of the sequence and self._y containing the posterior probabilities…

Note: you may need to convert the log_float probabilities back to normal floats I found it more convenient to just define the __float__() method in

log_float

from pylab import * # this line at the beginning of the file . . . class HMM(object): . . . # all your other stuff

def show_posterior(self): if self._x and self._y: plot (self._x, self._y) show() return

conditional & joint probability

Documents

terms of joint probability

hidden event h

total probability

joint probabilities

event sequence

observable event o

h occuragain

basis of o