cs 6243 machine learning

CS 6243 Machine Learning Markov Chain and Hidden Markov Models

Upload: dara

Post on 26-Jan-2016




1 download


CS 6243 Machine Learning. Markov Chain and Hidden Markov Models. Outline. Background on probability Hidden Markov models Algorithms Applications. Probability Basics. Definition (informal) - PowerPoint PPT Presentation


Page 1: CS 6243 Machine Learning

CS 6243 Machine Learning

Markov Chain and Hidden Markov Models

Page 2: CS 6243 Machine Learning


• Background on probability

• Hidden Markov models– Algorithms– Applications

Page 3: CS 6243 Machine Learning

Probability Basics

• Definition (informal)– Probabilities are numbers assigned to events

that indicate “how likely” it is that the event will occur when a random experiment is performed

– A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

– The sample space S of a random experiment is the set of all possible outcomes

Page 4: CS 6243 Machine Learning

Probabilistic Calculus• All probabilities between 0 and 1

• If A, B are mutually exclusive:– P(A B) = P(A) + P(B)

• Thus: P(not(A)) = P(Ac) = 1 – P(A)


1)(0 AP


Page 5: CS 6243 Machine Learning

Conditional probability• The joint probability of two events A and B P(AB), or

simply P(A, B) is the probability that event A and B occur at the same time.

• The conditional probability of P(A|B) is the probability that A occurs given B occurred.

P(A | B) = P(A B) / P(B)

<=> P(A B) = P(A | B) P(B)

<=> P(A B) = P(B|A) P(A)

Page 6: CS 6243 Machine Learning


• Roll a die– If I tell you the number is less than 4– What is the probability of an even number?

• P(d = even | d < 4) = P(d = even d < 4) / P(d < 4)• P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

Page 7: CS 6243 Machine Learning


• A and B are independent iff:

• Therefore, if A and B are independent:

)()|( APBAP

)()|( BPABP


)()|( AP



)()()( BPAPBAP

These two constraints are logically equivalent

Page 8: CS 6243 Machine Learning


• Are P(d = even) and P(d < 4) independent?– P(d = even and d < 4) = 1/6– P(d = even) = ½– P(d < 4) = ½– ½ * ½ > 1/6

• If your die actually has 8 faces, will P(d = even) and P(d < 5) be independent?

• Are P(even in first roll) and P(even in second roll) independent?

• Playing card, are the suit and rank independent?

Page 9: CS 6243 Machine Learning

Theorem of total probability• Let B1, B2, …, BN be mutually exclusive events whose union equals

the sample space S. We refer to these sets as a partition of S.

• An event A can be represented as:

• Since B1, B2, …, BN are mutually exclusive, then

P(A) = P(A B1) + P(A B2) + … + P(A BN)

• And therefore

P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN)

= i P(A | Bi) * P(Bi)Exhaustive conditionalization


Page 10: CS 6243 Machine Learning


• A loaded die: – P(6) = 0.5– P(1) = … = P(5) = 0.1

• Prob of even number? P(even) = P(even | d < 6) * P (d<6) +

P(even | d = 6) * P (d=6)= 2/5 * 0.5 + 1 * 0.5= 0.7

Page 11: CS 6243 Machine Learning

Another example

• A box of dice:– 99% fair– 1% loaded

• P(6) = 0.5.• P(1) = … = P(5) = 0.1

– Randomly pick a die and roll, P(6)?

• P(6) = P(6 | F) * P(F) + P(6 | L) * P(L)– 1/6 * 0.99 + 0.5 * 0.01 = 0.17

Page 12: CS 6243 Machine Learning

Bayes theorem

• P(A B) = P(B) * P(A | B) = P(A) * P(B | A)




()(|) ==>

Posterior probability Prior of A (Normalizing constant)

BAP (|) Prior of B

Conditional probability(likelihood)

This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics

Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

Page 13: CS 6243 Machine Learning

Bayes theorem (cont’d)

• Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?

• P(Bj | A) = P(A | Bj) * P(Bj) / P(A)

= P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj)Bj: different models / hypotheses

In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

Posterior probabilityLikelihood Prior of Bj

Normalizing constant

(theorem of total probabilities)

Page 14: CS 6243 Machine Learning


• A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without.

• The disease is present in the population at 1 in 100,000

• What is P(disease | positive test)?– P(D|+) = P(+|D)P(D)/P(+) = 0.01

• What is P(disease | negative test)?– P(D|-) = P(-|D)P(D)/P(-) = 5e-8

Page 15: CS 6243 Machine Learning

Another example

• We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six)

• We said if we randomly pick a die and roll, we have 17% of chance to get a six

• If we get 3 six in a row, what’s the chance that the die is loaded?

• How about 5 six in a row?

Page 16: CS 6243 Machine Learning

• P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21

• P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

Page 17: CS 6243 Machine Learning

Simple probabilistic models for DNA sequences

• Assume nature generates a type of DNA sequence as follows:

1. A box of dice, each with four faces: {A,C,G,T}2. Select a die suitable for the type of DNA3. Roll it, append the symbol to a string.4. Repeat 3, until all symbols have been generated.

• Given a string say X=“GATTCCAA…” and two dice

– M1 has the distribution of pA=pC=pG=pT=0.25. – M2 has the distribution: pA=pT=0.20, pC=pG=0.30

• What is the probability of the sequence being generated by M1 or M2?

Page 18: CS 6243 Machine Learning

Model selection by maximum likelihood criterion


• P(X | M1) = P(x1,x2,…,xn | M1)

= i=1..n P(xi|M1)

= 0.258 = 1.53e-5

• P(X | M2) = P(x1,x2,…,xn | M2)

= i=1..n P(xi|M2)

= 0.25 0.33 = 8.64e-6

P(X|M1) / P(X|M2) = P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3

LLR = log(P(xi|M1)/P(xi|M2))

= nASA + nCSC + nGSG + nTST

= 5 * log(1.25) + 3 * log(0.833) = 0.57

Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T

Log likelihood ratio (LLR)

Page 19: CS 6243 Machine Learning

Model selection by maximum a posterior probability criterion

• Take the prior probabilities of M1 and M2 into consideration if knownLog (P(M1|X) / P(M2|X))

= LLR + log(P(M1)) – log(P(M2))

= nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2))

• If P(M1) ~ P(M2), results will be similar to LLR test

Page 20: CS 6243 Machine Learning

Markov models for DNA sequences

We have assumed independence of nucleotides in different positions - unrealistic in biology

Page 21: CS 6243 Machine Learning

Example: CpG islands

• CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone)

• In mammal promoter regions, CpG is more frequent than other regions of genome– often mark gene-rich regions

Page 22: CS 6243 Machine Learning

CpG islands

• CpG Islands– More CpG than elsewhere– More C & G than elsewhere, too– Typical length: a few 100s to few 1000s bp

• Questions– Is a short sequence (say, 200 bp) a CpG

island or not?– Given a long sequence (say, 10-100kb), find

CpG islands?

Page 23: CS 6243 Machine Learning

Markov models

• A sequence of random variables is a k-th order Markov chain if, for all i, ith value is independent of all but the previous k values:

• First order (k=1):• Second order:

• 0th order: (independence)

Page 24: CS 6243 Machine Learning

First order Markov model

Page 25: CS 6243 Machine Learning

A 1st order Markov model for CpG islands

• Essentially a finite state automaton (FSA)• Transitions are probabilistic (instead of deterministic)

• 4 states: A, C, G, T• 16 transitions: ast = P(xi = t | xi-1 = s)• Begin/End states

Page 26: CS 6243 Machine Learning

Probability of emitting sequence x

Page 27: CS 6243 Machine Learning

Probability of a sequence

• What’s the probability of ACGGCTA in this model?

P(A) * P(C|A) * P(G|C) … P(A|T)

= aBA aAC aCG …aTA

• Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

Page 28: CS 6243 Machine Learning


• Estimate the parameters of the model– CpG+ model: Count the transition frequencies from

known CpG islands – CpG- model: Also count the transition frequencies

from sequences without CpG islands

– ast = #(s→t) / #(s → )

a+st a-


Page 29: CS 6243 Machine Learning

Discrimination / Classification

• Given a sequence, is it CpG island or not?

• Log likelihood ratio (LLR)

βCG = log2(a+CG/a -

CG) = log2(0.274/0.078) = 1.812

βBA = log2(a+ A/a -

A) = log2(0.591/1.047) = -0.825

Page 30: CS 6243 Machine Learning



• S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG

= βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC

= -0.825 + 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573

= 7.25

Page 31: CS 6243 Machine Learning

CpG island scores

Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Page 32: CS 6243 Machine Learning


• Q1: given a short sequence, is it more likely from CpG+ model or CpG- model?

• Q2: Given a long sequence, where are the CpG islands (if any)?– Approach 1: score (e.g.) 100 bp windows

• Pro: simple• Con: arbitrary, fixed length, inflexible

– Approach 2: combine +/- models.

Page 33: CS 6243 Machine Learning

Combined model

• Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

Page 34: CS 6243 Machine Learning

Hidden Markov Model (HMM)

• Introduced in the 70’s for speech recognition• Have been shown to be good models for biosequences

– Alignment– Gene prediction– Protein domain analysis– …

• An observed sequence data that can be modeled by a Markov chain– State path unknown– Model parameter known or unknown

• Observed data: emission sequences X = (x1x2…xn)

• Hidden data: state sequences Π = (π1π2…πn)

Page 35: CS 6243 Machine Learning

Hidden Markov model (HMM)Definition: A hidden Markov model (HMM) is a five-tuple• Alphabet = { b1, b2, …, bM }• Set of states Q = { 1, ..., K }• Transition probabilities between any two states

aij = transition prob from state i to state jai1 + … + aiK = 1, for all states i = 1…K

• Start probabilities a0i

a01 + … + a0K = 1

• Emission probabilities within each stateek(b) = P( xi = b | i = k)ek(b1) + … + ek(bM) = 1, for all states k = 1…K




Page 36: CS 6243 Machine Learning

HMM for the Dishonest Casino

A casino has two dice:• Fair die

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6• Loaded die

P(1) = P(2) = P(3) = P(4) = P(5) = 1/10P(6) = 1/2

Casino player switches back and forth between fair and loaded die once in a while

Page 37: CS 6243 Machine Learning

The dishonest casino model


aLF = 0.05

eF(1) = 1/6eF(2) = 1/6eF(3) = 1/6eF(4) = 1/6eF(5) = 1/6eF(6) = 1/6

eL(1) = 1/10eL(2) = 1/10eL(3) = 1/10eL(4) = 1/10eL(5) = 1/10eL(6) = 1/2Transition probability

Emission probability

aFL = 0.05 aLL = 0.95aFF = 0.95

Page 38: CS 6243 Machine Learning

Simple scenario

• You don’t know the probabilities• The casino player lets you observe which die

he/she uses every time– The “state” of each roll is known

• Training (parameter estimation)– How often the casino player switches dice?– How “loaded” is the loaded die?– Simply count the frequency that each face appeared

and the frequency of die switching– May add pseudo-counts if number of observations is


Page 39: CS 6243 Machine Learning

More complex scenarios

• The “state” of each roll is unknown:– You are given the results of a series of rolls– You don’t know which number is generated by

which die

• You may or may not know the parameters– How “loaded” is the loaded die– How frequently the casino player switches


Page 40: CS 6243 Machine Learning

The three main questions on HMMs

1. Decoding

GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P (x, | M )

2. Evaluation

GIVEN a HMM M, and a sequence x,FIND P ( x | M ) [or P(x) for simplicity]

3. Learning

GIVEN a HMM M with unspecified transition/emission probs.,and a sequence x,

FIND parameters = (ei(.), aij) that maximize P (x | )

Sometimes written as P (x, ) for simplicity.

Page 41: CS 6243 Machine Learning

Question # 1 – Decoding


A HMM with parameters. And a sequence of rolls by the casino player



What portion of the sequence was generated with the fair die, and what portion with the loaded die?

This is the DECODING question in HMMs

Page 42: CS 6243 Machine Learning

A parse of a sequence

Given a sequence x = x1……xN, and a HMM with k states,

A parse of x is a sequence of states = 1, ……, N













x1 x2 x3 xK





Page 43: CS 6243 Machine Learning

Probability of a parse

Given a sequence x = x1……xN

and a parse = 1, ……, N

To find how likely is the parse:

(given our HMM)

P(x, ) = P(x1, …, xN, 1, ……, N)

= P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1)

= P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1)

= a01 a12……aN-1N e1(x1)……eN(xN)













x1 x2 x3 xK





Page 44: CS 6243 Machine Learning


• What’s the probability of = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, FairX = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?





P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Page 45: CS 6243 Machine Learning


• What’s the probability of = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, FairX = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?

P = ½ * P(1 | F) P(Fi+1 | Fi) …P(5 | F) P(Li+1 | Fi) P(6|L) P(Li+1 | Li) …P(4 | F)= ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 10-11

0.05 0.05

Page 46: CS 6243 Machine Learning


• Parse (path) is unknown. What to do?• Alternative algorithms:

– Most probable single path (Viterbi algorithm)

– Sequence of most probable states (Forward-backward algorithm)

Page 47: CS 6243 Machine Learning

The Viterbi algorithm

• Goal: to find

• Is equivalent to find )),(log(* maxarg


Page 48: CS 6243 Machine Learning

The Viterbi algorithm

• Find a path with the following objective: – Maximize the product of transition and emission probabilities

Maximize the sum of log probabilities


P(s|F) = 1/6, for s in [1..6]P(s|L) = 1/10, for s in [1..5]P(6|L) = 1/2



Edge weight(symbol independent)

Node weight (depend on symbols in seq)

Page 49: CS 6243 Machine Learning

The Viterbi algorithm




x1 x2 x3 … xi xi+1 … xn-1 xn

VF (i+1) = rF (xi+1) + max VF (i) + wFF

VL (i) + wLF



Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F

Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L


wFF = log (aFF)



rF(xi+1) = log (eF(xi+1))

VL (i+1) = rL (xi+1) + max VF (i) + wFL

VL (i) + wLL

Page 50: CS 6243 Machine Learning

Recursion from FSA directly





rF(s) = -1.8s = 1...6

rL(6) = -0.7rL(s) = -2.3(s = 1…5)

VF (i+1) = rF (xi+1) + max {VL (i) + WLF

VF (i) + WFF }

VL (i+1) = rL (xi+1) + max {VL (i) + WLL

VF (i) + WFL }






P(s|F) = 1/6s = 1…6

P(6|L) = ½P(s|L) = 1/10(s = 1...5)

PF (i+1) = eF (xi+1) max {PL (i) aLF

PF (i) aFF }

PL (i+1) = eL (xi+1) max {PL (i) aLL

PF (i) aFL }

Page 51: CS 6243 Machine Learning

In general: more states / symbols• Alphabet = { b1, b2, …, bM }

• Set of states Q = { 1, ..., K }

• States are completely connected. – K2 transitions probabilities (some may be 0)

– Each state has M transition probabilities (some may be 0)

))((max)()1( ..1 klkKkill wiVxriV









xi xi+1



1 2


… k

Page 52: CS 6243 Machine Learning

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence

Time: O(K2N)

Space: O(KN)

x1 x2 x3 … … xi+1……… … … ……………………xN

State 1




))((max)()1( ..1 klkKkill wiVxriV


Page 53: CS 6243 Machine Learning

The Viterbi Algorithm (in log space)

Input: x = x1……xN

Initialization:V0(0) = 0 (zero in subscript is the start state.)Vl(0) = -inf, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:for each i for each l

Vl(i) = rl(xi) + maxk (wkl + Vk(i-1)) // rj(xi) = log(ej(xi)), wkj = log(akj)Ptrl(i) = argmaxk (wkl + Vk(i-1))


Termination:Prob(x, *) = exp{maxk Vk(N)}

Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i)

Page 54: CS 6243 Machine Learning

The Viterbi Algorithm (in prob space)

Input: x = x1……xN

Initialization:P0(0) = 1 (zero in subscript is the start state.)Pl(0) = 0, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:for each i for each l

Pl(i) = el(xi) maxk (akl Pk(i-1)) Ptrl(i) = argmaxk (akl Pk(i-1))


Termination:Prob(x, *) = maxk Pk(N)

Traceback: N* = argmaxk Pk(N) i-1* = Ptri (i)

Page 55: CS 6243 Machine Learning
Page 56: CS 6243 Machine Learning

CpG islands

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi: – Found 46 of 48 – plus 121 “false positives”

• Post-processing: – merge within 500bp– discard < 500– Found 46/48– 67 false positive

Page 57: CS 6243 Machine Learning

Problems with Viterbi decoding

• Most probable path not necessarily the only interesting one– Single optimal vs multiple sub-optimal

• What if there are many sub-optimal paths with slightly lower probabilities?

– Global optimal vs local optimal• What’s best globally may not be the best for each


Page 58: CS 6243 Machine Learning


• The dishonest casino• Say x = 12341623162616364616234161221341• Most probable path: = FF……F• However: marked letters more likely to be L than

unmarked letters• Another way to interpret the problem

– With Viterbi, every position is assigned a single label– Confidence level for each assignment?

Page 59: CS 6243 Machine Learning

Posterior decoding

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

• Computing P(x) and P(x,i=k) is called the evaluation problem

• The solution: Forward-backward algorithm

Page 60: CS 6243 Machine Learning

Probability of a sequence

• P(X | M): prob that X can be generated by M• Sometimes simply written as P(X)• May be written as P(X | M, θ) or P(X | θ) to emphasize

that we are looking for θ to optimize the likelihood (discussed later in learning)

• Not equal to the probability of a path P(X, )– Many possible paths can generate X. Each with a probability

– P(X) = P(X, ) = P(X | ) P()– How to compute without summing over all possible paths

(exponential of them)?• Dynamic programming

Page 61: CS 6243 Machine Learning

The forward algorithm

• Define fk(i) = P(x1…xi, i=k)– Implicitly: sum over all possible paths for x1…xi-1



k kkn





),(),...()( 1

Page 62: CS 6243 Machine Learning
























Page 63: CS 6243 Machine Learning

The forward algorithm



Page 64: CS 6243 Machine Learning

The forward algorithm

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:f0(0) = 1fk(0) = 0, for all k > 0


fk(i) = ek(xi) j fj(i-1) ajk


Prob(x) = k fk(N)

Page 65: CS 6243 Machine Learning

Relation between Forward and Viterbi

VITERBI (in prob space)


P0(0) = 1

Pk(0) = 0, for all k > 0


Pk(i) = ek(xi) maxj Pj(i-1) ajk


Prob(x, *) = maxk Pk(N)



f0(0) = 1

fk(0) = 0, for all k > 0


fk(i) = ek(xi) j fj(i-1) ajk


Prob(x) = k fk(N)

Page 66: CS 6243 Machine Learning

Posterior decoding

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

Have just shown how to compute this

Need to know how to compute this

Page 67: CS 6243 Machine Learning



Page 68: CS 6243 Machine Learning

The backward algorithm

• Define bk(i) = P(xi+1…xn | i=k)

– Implicitly: sum over all possible paths for xi…xn



Page 69: CS 6243 Machine Learning


This does not include the emission probability of xi



Page 70: CS 6243 Machine Learning

The forward-backward algorithm

• Compute fk(i) for each state k and position i

• Compute bk(i), for each state k and position i

• Compute P(x) = kfk(N)

• Compute P(i=k | x) = fk(i) * bk(i) / P(x)

Page 71: CS 6243 Machine Learning

The prob of x, with the constraint that xi was generated by state k



P(i=k | x)Space: O(KN)

Time: O(K2N)

/ P(X)Forward probabilities Backward probabilities

Page 72: CS 6243 Machine Learning

What’s P(i=k | x) good for?

• For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level

• Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding

^i = argmaxk P(i = k | x)

Page 73: CS 6243 Machine Learning

Posterior decoding for the dishonest casino

If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

Page 74: CS 6243 Machine Learning

Posterior decoding for another dishonest casino

In this example, Viterbi predicts that all rolls were from the fair die.

Page 75: CS 6243 Machine Learning

CpG islands again

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi: Post-process:– Found 46 of 48 46/48– plus 121 “false positives” 67 false pos

• Posterior Decoding:– same 2 false negatives 46/48– plus 236 false positives 83 false pos

Post-process: merge within 500; discard < 500

Page 76: CS 6243 Machine Learning

What if a new genome comes?We just sequenced the porcupine genome

We know CpG islands play the same role in this genome

However, we have not many known CpG islands for porcupines

We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?


Page 77: CS 6243 Machine Learning


• When the state path is known– We’ve already done that– Estimate parameters from labeled data

(known CpG and non-CpG)– “Supervised” learning

• When the state path is unknown– Estimate parameters without labeled data– “unsupervised” learning

Page 78: CS 6243 Machine Learning

Basic idea

1. Estimate our “best guess” on the model parameters θ

2. Use θ to predict the unknown labels

3. Re-estimate a new set of θ

4. Repeat 2 & 3

Multiple ways

Page 79: CS 6243 Machine Learning

Viterbi training

1. Estimate our “best guess” on the model parameters θ

2. Find the Viterbi path using current θ

3. Re-estimate a new set of θ based on the Viterbi path

– Count transitions/emissions on those paths, getting new θ

4. Repeat 2 & 3 until converge

Page 80: CS 6243 Machine Learning

Baum-Welch training

1. Estimate our “best guess” on the model parameters θ

2. Find P(i=k | x,θ) using forward-backward algorithm3. Re-estimate a new set of θ based on all possible

paths For example, according to Viterbi, pos i is in state k and pos

(i+1) is in state l• This contributes 1 count towards the frequency that transition

k l is used• In Baum-Welch, pos i has some prob in state k and pos (i+1)

has some prob in state l. This transition is counted only partially, according to the prob of this transition

4. Repeat 2 & 3 until converge

Page 81: CS 6243 Machine Learning

Probability that a transition is used



i i+1

Page 82: CS 6243 Machine Learning

Estimated # of kl transition

Page 83: CS 6243 Machine Learning

Viterbi vs Baum-Welch training

• Viterbi training– Returns a single path– Each position labeled with a fixed state– Each transition counts one– Each emission also counts one

• Baum-Welch training– Does not return a single path– Considers the prob that each transition is used and

the prob that a symbol is generated by a certain state– They only contribute partial counts

Page 84: CS 6243 Machine Learning

Viterbi vs Baum-Welch training

• Both guaranteed to converges

• Baum-Welch improves the likelihood of the data in each iteration: P(X)– True EM (expectation-maximization)

• Viterbi improves the probability of the most probable path in each iteration: P(X, *)– EM-like

Page 85: CS 6243 Machine Learning

Expectation-maximization (EM)

• Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data

• E-step: compute the expectation (e.g. prob for each pos to be in a certain state)

• M-step: maximum-likelihood parameter estimation

• Recall: clustering

Page 86: CS 6243 Machine Learning
Page 87: CS 6243 Machine Learning

HMM summary

• Viterbi – best single path

• Forward – sum over all paths

• Backward – similar

• Baum-Welch – training via EM and forward-backward

• Viterbi training – another “EM”, but Viterbi-based

Page 88: CS 6243 Machine Learning

Modular design of HMM

• HMM can be designed modularly

• Each modular has own begin / end states (silent, i.e. no emission)

• Each module communicates with other modules only through begin/end states

Page 89: CS 6243 Machine Learning


C+ G+

T+B+ E+

B-A- T-

C- G-


HMM modules and non-HMM modules can be mixed

Page 90: CS 6243 Machine Learning

HMM applications

• Gene finding

• Character recognition

• Speech recognition: a good tutorial on course website

• Machine translation

• Many others

Page 91: CS 6243 Machine Learning

• Typed word recognition, assume all characters are separated.

• Character recognizer outputs probability of the image being particular character, P(image|character).








Word recognition example(1).

Hidden state Observationhttp://www.cedar.buffalo.edu/~govind/cs661

Page 92: CS 6243 Machine Learning

• Hidden states of HMM = characters.

• Observations = typed images of characters segmented from the image . Note that there is an infinite number of observations

• Observation probabilities = character recognizer scores.

•Transition probabilities will be defined differently in two subsequent models.

Word recognition example(2).

)|()( ii svPvbB



Page 93: CS 6243 Machine Learning

• If lexicon is given, we can construct separate HMM models for each lexicon word.

Amherst a m h e r s t

Buffalo b u f f a l o

0.5 0.03

• Here recognition of word image is equivalent to the problem of evaluating few HMM models.•This is an application of Evaluation problem.

Word recognition example(3).

0.4 0.6


Page 94: CS 6243 Machine Learning

• We can construct a single HMM for all words.• Hidden states = all characters in the alphabet.• Transition probabilities and initial probabilities are calculated from language model.• Observations and observation probabilities are as before.

a m

h e




b v



• Here we have to determine the best sequence of hidden states, the one that most likely produced word image.• This is an application of Decoding problem.

Word recognition example(4).


Page 95: CS 6243 Machine Learning

• The structure of hidden states is chosen.

• Observations are feature vectors extracted from vertical slices.

• Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models2. Quantize feature vector space.

Character recognition with HMM example.


Page 96: CS 6243 Machine Learning

• The structure of hidden states:

• Observation = number of islands in the vertical slice.

s1 s2 s3

•HMM for character ‘A’ :

Transition probabilities: {aij}=

Observation probabilities: {bjk}=

.8 .2 0 0 .8 .2 0 0 1

.9 .1 0 .1 .8 .1 .9 .1 0

•HMM for character ‘B’ :

Transition probabilities: {aij}=

Observation probabilities: {bjk}=

.8 .2 0 0 .8 .2 0 0 1

.9 .1 0 0 .2 .8 .6 .4 0

Exercise: character recognition with HMM(1)


Page 97: CS 6243 Machine Learning

• Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed: { 1, 3, 2, 1}

• What HMM is more likely to generate this observation sequence , HMM for ‘A’ or HMM for ‘B’ ?

Exercise: character recognition with HMM(2)


Page 98: CS 6243 Machine Learning

Consider likelihood of generating given observation for each possible sequence of hidden states:

• HMM for character ‘A’:Hidden state sequence Transition probabilities Observation probabilities

s1 s1 s2s3 .8 .2 .2 .9 0 .8 .9 = 0

s1 s2 s2s3 .2 .8 .2 .9 .1 .8 .9 = 0.0020736

s1 s2 s3s3 .2 .2 1 .9 .1 .1 .9 = 0.000324

Total = 0.0023976 • HMM for character ‘B’:

Hidden state sequence Transition probabilities Observation probabilities

s1 s1 s2s3 .8 .2 .2 .9 0 .2 .6 = 0

s1 s2 s2s3 .2 .8 .2 .9 .8 .2 .6 = 0.0027648

s1 s2 s3s3 .2 .2 1 .9 .8 .4 .6 = 0.006912

Total = 0.0096768

Exercise: character recognition with HMM(3)


Page 99: CS 6243 Machine Learning

HMM for gene finding

• Foundation for most gene finders

• Include many knowledge-based fine-tunes and GHMM extensions

• We’ll only discuss basic ideas

Page 100: CS 6243 Machine Learning

Gene structure

exon1 exon2 exon3intron1 intron2




Exon: codingIntron: non-codingIntergenic: non-coding

5’ 3’IntergenicDNA


Mature mRNA


Page 101: CS 6243 Machine Learning

Transcription(where genetic information is stored)

(for making mRNA)


Template strand: 3’-TGCATCTGCATATCTCGGATC-5’


Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

DNA-RNA pair:

A=U, C=G

T=A, G=C

Page 102: CS 6243 Machine Learning


• The sequence of codons is translated to a sequence of amino acids

• Gene: -GCT TGT TTA CGA ATT-• mRNA: -GCU UGU UUA CGA AUU -• Peptide: - Ala - Cys - Leu - Arg - Ile –

• Start codon: AUG– Also code Met– Stop codon: UGA, UAA, UAG

Page 103: CS 6243 Machine Learning

The Genetic CodeThirdletter

Page 104: CS 6243 Machine Learning

Finding genes









As the coding/non-coding length ratio decreases, exon prediction becomes more complex

Page 105: CS 6243 Machine Learning

Gene prediction in prokaryote

• Finding long ORFs (open reading frame)• An ORF may not contain stop codons

– Average ORF length = 64/3– Expect 300bp ORF per 36kbp – Actual ORF length ~ 1000bp

• Codon biases– Some triplets are used more frequently than

others– Codon third position biases

Page 106: CS 6243 Machine Learning

HMM for eukaryote gene finding

• Basic idea is the same: the distributions of nucleotides is different in exon and other regions– Alone won’t work very well

• More signals are needed

• How to combine all the signal together?

exon1 exon2 exon3intron1 intron2

5’ 3’





Splicing donor: GT

Splicing acceptor: AG STOP



Page 107: CS 6243 Machine Learning

Simplest model

• Exon length may not be exact multiple of 3• Basically have to triple the number of states to remember the

excess number of bases in the previous state

Intergenic exon intron

64 triplets emission probabilities

4 emission probability

4 emission probability

Actually more accurate at the di-amino-acid level, i.e. 2 codons. Many methods use 5th-order Markov model for all regions

Page 108: CS 6243 Machine Learning

More detailed model


Init exon


Term exon

Internal Exon

Single exon

Page 109: CS 6243 Machine Learning


• START, STOP are PWMs• Including start and stop codons and surrounding bases

5’-UTR START CDSInit exon

3’-UTRSTOPCDSTerm exon

CDS: coding sequence

Page 110: CS 6243 Machine Learning

Sub-model for intron

• Sequence logos: an informative display of PWMs• Within each column, relative height represents probability• Height of each column reflects “information content”

Splice donorIntron body

Splice acceptorIntron

Page 111: CS 6243 Machine Learning
Page 112: CS 6243 Machine Learning

Duration modeling

• For any sub-path, the probability consists of two components– The product of emission probabilities

• Depend on symbols and state path

– The product of transition probabilities• Depend on state path

Page 113: CS 6243 Machine Learning

Duration modeling

• Model a stretch of DNA for which the distribution does not change for a certain length

• The simplest model implies that

P(length = L) = pL-1(1-p)• i.e., length follows geometric distribution

– Not always appropriate




Duration: the number of times that a state is used consecutively without visiting other states



Page 114: CS 6243 Machine Learning

Duration models



s ss



s ss


Negative binominal

Min, then geometric




Page 115: CS 6243 Machine Learning

Explicit duration modeling


P(A | I) = 0.3P(C | I) = 0.2P(G | I) = 0.2P(T | I) = 0.3



Exon Intergenic

Empirical intron length distribution

Generalized HMM. Often used in gene finders

Page 116: CS 6243 Machine Learning

Explicit duration modeling

• For each position j and each state i– Need to consider the transition from all previous


• Time: O(N2K2)• N can be 108

x1 x2 x3 ………………………………………..xN





Page 117: CS 6243 Machine Learning

Speedup GHMM

• Restrict maximum duration length to be L– O(LNK2)

• However, intergenic and intron can be quite long– L can be 105

• Compromise: explicit duration for exons only, geometric for all other states

• Pre-compute all possible starting points of ORFs– For init exon: ATG– For internal/terminal exon: splice donor signal (GT)

Page 118: CS 6243 Machine Learning

GeneScan model

Page 119: CS 6243 Machine Learning

Approaches to gene finding

• Homology– BLAST, BLAT, etc.

• Ab initio– Genscan, Glimmer, Fgenesh, GeneMark, etc.– Each one has been tuned towards certain organisms

• Hybrids– Twinscan, SLAM– Use pair-HMM, or pre-compute score for potential

coding regions based on alignment

• None are perfect, never used alone in practice

Page 120: CS 6243 Machine Learning

Current status

• More accurate on internal exons

• Determining boundaries of init and term exons is hard

• Biased towards multiple-exon genes

• Alternative splicing is hard

• Non-coding RNA is hard

Page 121: CS 6243 Machine Learning

• State of the Art:– predictions ~ 60% similar to real proteins– ~80% if database similarity used– lab verification still needed, still expensive

Page 122: CS 6243 Machine Learning

HMM wrap up

• We’ve talked about– Probability, mainly Bayes Theorem– Markov models– Hidden Markov models– HMM parameter estimation given state path– Decoding given HMM and parameters

• Viterbi• F-B

– Learning• Baum-Welch (Expectation-Maximization)• Viterbi

Page 123: CS 6243 Machine Learning

HMM wrap up

• We’ve also talked about– Extension to gHMMs– gHMM for gene finding

• We did not talk about– Higher-order Markov models– How to escape from local optima in learning