cédric notredame (21/10/2015) uncovering sequences mysteries with hidden markov model cédric...

Cédric Notredame (21/04/23)

Uncovering

Sequences

Mysteries

With

Hidden Markov

ModelCédric Notredame


Our Scope

Understand the principle of HMMs

Understand HOW HMMs are used in Biology

Look once Under the Hood


Outline

-Reminder of Bayesian Probabilities

-Application to gene prediction

-Application Tm predictions

-HMMs and Markov Chains

-Application to Domain/Prot Family Prediction

-Future Applications


Conditional Probabilities

AndBayes Theorem


I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times.

Bayes


“The Durbin…”


What is a Probabilistic Model ?

Dice = Probabilistic Model

-Each Possible outcome has a probability (1/6)

-Biological Questions:

-What kind of dice would generate coding DNA

-Non-Coding ?


Which Parameters ?

Dice = Probabilistic Model

-A Priori estimation: 1/6 for each Number

Parameters: proba of each outcome

-Through Observation:-measure frequencies on a large numberof events

OR


Which Parameters ?

Model: Intra/Extra Protein

1- Make a set of Inside Proteins using annotation

Parameters: proba of each outcome

2- Make a set of Outside Proteins using annotation

3- COUNT Frequencies on the two sets

Model Accuracy Training Set


Maximum Likelihood Models

Model: Intra/Extra Proteins

1- Make training set

2- Count Frequencies

Model Accuracy Training Set

Maximum Likelihood Model:

Model probability MAXIMISES Data probability



Model: Intra/Extra-Cell Proteins

Model Probability MAXIMISES Data ProbabilityAND Data Probability MAXIMISES Model Probability

P ( Model ¦ Data) is Maximised

¦ means GIVEN!

Maximum Likelihood Model




Model Probability MAXIMISES Data ProbabilityAND Data Probability MAXIMISES Model Probability

P ( Model ¦ Data) is Maximised


P ( Data ¦ Model) is Maximised




Data: 11121112221212122121112221112121112211111

P ( Coin ¦ Data)< P(Dice ¦ Data)




The Probability that something happens IF

something else ALSO

Happens

P (Win Lottery ¦ Participation)


Conditional Probability


something else ALSO

Happens

Dice 1Dice 2

P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2

Loaded!


P(6¦ D1)=1/6P(6¦ D2)=1/2

P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100

Joint Probability


something else ALSO

Happens

Comma

AND


Joint Probability

Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of

the time

P(6¦ DF and DL)= P(6, DF) + P(6, DL)= P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL)= 1/6*0.99 + 1/2*0.01= 0.17

(0.16 for an unloaded dice)


Joint Probability

P(6¦ DF and DL)= P(6, DF) + P(6, DL)= P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL)= 1/6*0.99 + 1/2*0.01= 0.17(0.16 for an unloaded dice)

Unsuspected Heterogeneity In the training set

Inaccurate Parameters Estimation


Bayes Theorem

X : Model or Data or any EventY : Model or Data or any Event

P(Xi¦ Y) =

P(Y¦Xi) * P(Xi)

(P(Y¦Xi)*P(Xi

))i


Bayes Theorem

X : Model or Data or any EventY : Model or Data or any Event

XT=X+ X

P(Y,X)+ P(Y,X)

P(Y)

P(X¦ Y) =

P(Y¦X) * P(X)

P(Y¦X)*P(X)+ P(Y¦X)*P(X)


Bayes Theorem

X : Model or Data or any EventY : Model or Data or any event

P(X¦ Y) =

P(Y¦X) * P(X)

P(Y)

Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y)

to Get P(X¦Y)

Proba of Observing Y

AND X simultaneously


Bayes Theorem

X : Model or Data or any EventY : Model or Data or any event

P(X¦Y) = P(X,Y)

P(Y)

Proba of Observing XIF Y is fulfilled

Proba of Observing Y and X simultaneously

‘Remove’ P(Y) to Get P(X¦Y)


Using Bayes Theorem

Question:The dice gave three 6s in a rowIS IT LOADED !!!

We will use Bayes Theorem to test our belief:

If the Dice was loaded (model) what would be the probability of this

ModelGiven the data (three 6 in a row)


Using Bayes Theorem


P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

Occasionally DishonestCasino…


Using Bayes Theorem


P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

P(X¦ Y) =

P(Y¦X)*P(X)

P(Y)

63 with D1 63 with D2

Y: 63

X: D2


Using Bayes Theorem


P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

P(X¦ Y) =

P(X,Y)

P(Y)

= 0.21

Probably NOT


Posterior Probability


P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)= 0.21

0.21 is a posterior probability: it was estimated AFTER the Data was obtained

P(63¦D2) is the likelihood of the Hypotheses


Debunking Headlines

P(Migrant) =0.1P(Criminal) =0.0001P(M¦C)=0.5

P(C¦M) =

P(M¦C)*P(C)

P(M)

50% of the crimes are committed by Migrants.

Question: Are 50% of the Migrants Criminals??.

NO: 0.05% Migrants only are Criminals (NOT 50%!)

= 0.5*0.0001

0.1=0.0005P(C¦M)

=

P(M¦C)*P(C)

P(M)


Debunking Headlines

50% of Gene Promoters contain TATA.

P(T)=0.1P(P)=0.0001P(T¦P)=0.5

P(P¦T) = P(T¦P)*P(P)

P(T)

Question:IS TATA a good gene predictor

NO

= 0.5*0.0001

0.1=0.0005P(P¦T) =

P(T¦P)*P(P)

P(T)


Bayes Theorem

Bayes Theorem Reveals the Trade-offBetween

Sensitivity:Finding ALL the genesand

Specificity: Finding ONLY genes

TATA=High Sensitivity / Low Specificity


Markov Chains


What is a Markov Chain ?

Simple Chain: One Dice

-Each Roll is the same-A Roll does not depend on the previous

Markov Chain: Two Dices

-You only use ONE dice: the fair OR the loaded

-The Dice you roll only depends on the previous roll



Biological Sequences Tend To Behave like Markov Chains

Question/Example

Is it possible to Tell Whether my sequence is CpG island ???



Question:

Identify CpG Island sequences

Old Fashion Solution

-Slide a Window of size: Captain’s Height/-Measure the % of CpG-Plot it against the sequence-Decide


sliding Window Methods

Average

Sliding Window

Sliding Window



Question:

Identify CpG Island sequences

Bayesian Solution

-Make a CpG Markov Chain-Run the sequence through the Chain-Likelihood for the chain to produce the sequence?


A

C G

T

Transition

State

Transition ProbabilitiesProbability of Transition from G to C

AGC=P(Xi=C ¦ Xi-1=G)


P(sequence)=P(XL,XL-1,XL-2,….., X1)

Remember: P(X,Y)=P(X¦Y)*P(Y)

P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

In The Markov Chain, XL only depends on XL-1


P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

L

i=2Axi-1 xi

P(sequence)=P(x1)*

AGC=P(Xi=C ¦ Xi-1=G)


A

C G

T

Arbitrary Beginning and End States can be addedTo The Chain.

By Convention, Only the Beginning State is added

B


A

C G

T

B

Adding An End State with a Transition Proba T Defines Length probabilities

P(all the sequences length L)=T(1-T)L-1

E


A

C G

T

The transition are probabilities

The sum of the probability of all thepossible Sequences of all possible

Lengthis 1

B E


Using Markov Chains

To Predict


What is a Prediction

Given A sequence We want to know what is the probability that this sequence is a CpG

1-We need a training set:-CpG+ sequences-CpG- sequences

2-We will Measure the transition frequencies, and treat them like probabilities



Is my sequence a CpG ???


A+GC

N+GC

N+GX

X

=Ratio between the number of transitions GC, and all the other transitions involving G->X

Transition GC: G followed by a C

GCCGCTGCGCGA


1




A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT


A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

L

i=1P(seq ¦ M+)= +

Axi-1 xi



3-Evaluate the probability for each of these models to generate our sequence

L

i=1P(seq ¦ M-)= -

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT

Axi-1 xi


Using The Log ODD


4-Measure the Log Odd

Log Odd Confrontation of the Two Models…Log2 Gives a value in bits (standard)LEN Gives a less spread out score distribution

S(seq)= LogP(seq ¦ M+)

P(seq ¦ M-)~

A+Xi-1,Xi

A-Xi-1,Xi

log2X

1

LEN


Using The Log ODD


4-Measure the Log Odd

Positive: more likely than NOT to be CpG

Negative: more likely NOT to be CpG

S(seq)= LogP(seq ¦ M+)

P(seq ¦ M-)~

A+Xi-1,Xi

A-Xi-1,Xi

log2X

1

LEN


Using The Log ODD


5-Plot the score distribution

N seq

Bits0


Using The Log ODD


5-Plot the score distribution

N seq

Bits0

Things can go Wrong-bad training set-bad param estimation


Using The Log ODD


-The Markov Chain is a Good discriminator-PB: What to do with long sequences That are partly CpG, and partly NON CpG ???-How Can we make a prediction Nucleotide per Nucleotide??

-We want to uncover the HIDDEN Boundaries


Hidden Markov Models


Hidden Markov Model:Switching Dices

-If you are Cheating You want to switch Dices Without Telling!

-The MODEL Switch is HIDDEN

Simple Chain: One Dice

-Each Roll is the same

-A roll does not depend on the previous

Markov Chain: Two Dices

-You only use ONE dice: the fair OR the loaded

-The Dice you roll only depends on the previous roll


Using HMMS

Question: I want to find the CpG boundaries

The chain had four symbol AGCT

The Model has eight states: A+, A-, G+, G-, C+, C-, T+, T-

There is no 1to1 correspondence symbol/states:

The state of each symbol is hiddenA can either be in A+ or A-


Using HMMs


1-Define the model topology

A+ G+ C+ T+

A- G- C- T-

EVERY transition is possible

C+ TO G- cost more


Using HMMs


2-Parameterise the model: count frequencies…

A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT

We also Need + to -


Using HMMs


3-FORCE the model to emit your sequence: Viterbi

One can use the model to emit any sequence. This sequence is named a PATH () because it is a walk through the model

G+ C+ G+ C+ T+ C+ C+ C- C- G- T- ….


The path with the occasionally dishonest Casino

-The state L, emits a symbol with a proba

AL,F =P(i=L¦ i-1=F)

P (emit 6 with L)=EL(6) = P(Xi=6 ¦ i=L)=0.5

Using HMMs



Switch Dices: Transition

Roll The Dice: Emission


1- 0.162- 0.163- 0.164- 0.165- 0.166- 0.16

1- 0.102- 0.103- 0.104- 0.105- 0.106- 0.50

Fair Loaded

Two States: Fair and Loaded

SixEmissionForStateLoaded

Six EmissionFor State Fair


1- 0.162- 0.163- 0.164- 0.165- 0.166- 0.16

Fair

1- 0.102- 0.103- 0.104- 0.105- 0.106- 0.50

Loaded

P (emit 6L) =EL(6) = P(Xi=6 ¦ i=L)=0.5

Emissionsof L withTheir Proba

AL,F =P(i=L¦ i-1=F) Switch Dices: Transition

Roll The Dice: Emission


A+

A-

G+

G-

C+

C-

T+

T-

8 STATES, 1 EMISSION per State


Using HMMs



The path:-goes from state to state with a proba

AG+,C+ =P(i=C+¦ i-1=G+)

-in x, it EMITS a symbol with a proba 1

Proba emit G=EG+(G) = P(Xi=G ¦ i=G+)

1


Using HMMs



We are interested in the joint probability of the PATH (chain of G+, C-…) with our Sequence X

Ei

i=1

P(X,)=L

Ai,i-1

(Xi)*A0,1

*


Using HMMs



Ei

i=1

P(X,)=L

Ai,i-1

(Xi)*A0,1

*

A0,C+ *1 * A C+,G- *1 * AG-,C- *1 * AC-,G+ *1

P= C+ G- C- G+X= C G C G


Using HMMs



To Make a prediction We must Identify the Best Scoring Path:

A0,C+ *1 * A C+,G- *1 * AG-,C- *1 * AC-,G+ *1

*=argmax P(x,)

This is NOT a prediction


Using HMMs



To Make a prediction We must Identify the Best Scoring Path:

*=argmax P(x,)

We do this recursively with the VITERBI Algorithm


A+G+C+T+A-G-C-T-

C

A+G+C+T+A-G-C-T-

G

G+C+G-A-

G+C+G-A-

A+G+C+T+A-G-C-T-

A

G+C+G-

G+C+G+A+G+C+T+A-G-C-T-

G

…

…

A+G+C+T+A-G-C-T-

C

G+

G-

A+G+C+T+A-G-C-T-

G

G+C+

G+C+


A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

C

A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

A

A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

C

G+ C+ G- A- G- C-

Trace Back


Initiation:

V0(0)=1, Vk(0)=0 for every k

Recursion: i=1..L

Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)

ptri (l)=argmax (Vk(i-1) *Akl)

Termination: i=1..L

P(x,*)=Maxk (Vk(L)*Ak0)

-k and l are two states

-Vk(i) score of the best path 1…i, that finishes on state k and position i


Initiation: k and l are two states

Recursion: i=1..L



Multiplying Proba can cause an underflow problem

Usually, Proba multiplications are replaced with Log additions

log (a*b) = log (a) + log (b)


Using HMMs

Question: I want to know the Probability of my sequence Given The model

In Theory, you must sum over ALL the possible PATH. In practice:

* is a good approximation


Using HMMs

Question: I want to know the Proba of my sequence Given The model

The Forward Algorithm Gives the exact value of P(x)

* is a good approximation But…



Recursion: i=1..L



Termination:P(x,*)=Maxk (Vk(L)*Ak0)

Viterbi


Recursion: i=1..L

Vl (i)=El(Xi)*k (Vk(i-1)*Akl)


Termination: P(x)=k (Vk(L)*Ak0)

Forward



Recursion: i=1..L



Termination: P(x,*)=Maxk (Vk(L)*Ak0)

Viterbi

A+G+C+T+A-G-C-T-

…

…

A+G+C+T+A-G-C-T-

G+

G-

Max


Recursion: i=1..L

Vl (i)=El(Xi)*k (Vk(i-1)*Akl)


Termination: P(x)=k (Vk(L)*Ak0)

Forward

A+G+C+T+A-G-C-T-

…

…

A+G+C+T+A-G-C-T-

G+

G-


Posterior Decodingof



Why Posterior Decoding ?

-Viterbi is BRUTAL !!!!-It does Not Associate Individual PredictionsWith a Probability

Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ?

ANSWER: The Backward Algorithm


Posterieur Decoding ?

Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ?

P (X,i=l)

Probability of Sequence X WITH

position i is in state l


Posterieur Decoding

i

P (x,i=l)=P(X1…Xi¦ i=l) * P(XL… Xi+1¦ i=l)

i=l

Forward Algorithm

i=l

Backward Algorithm


Initiation:

Recursion: i=1..L

Fl (i)=El(Xi)*k (Fk(i-1)*Akl)

F0(0)=1, Fk(0)=0 for every k

Termination: P(x)=k (Fk(L)*Ak0)

Forward

Initiation:

Recursion: i=L..1

Bl (i)=El(Xi)*k (Bk(i+1)*Akl)

B0(0)=1, Bk(L)=Ak0 for every k

Termination: P(x)=k (Bk(1)*Ak0)

Backward


Recursion: i=1..L

Fl (i)=Fl(Xi)*k (Fk(i-1)*Akl)Forward

Recursion: i=L..1

Bl (i)=Bl(Xi)*k (Bk(i+1)*Akl)Backward

P (i=l,X)=Fl(i)*Bl(i)

P (i=l,X)=P(i=l ¦ X)*P(X) = Fl(i) * Bl(i)

Fl(i) * Bl(i)

P(X)P(i=l ¦ X)=

P(X)=F(L)=B(1)


Sliding Window

P(i=l ¦ X)

Free From The Sliding Window ofArbitrary Size!!!!


P(i=l ¦ X)

Posterior Decoding is Less Sensitive to the Parameterisation of the model.


Training HMMs


Training HMMs ?

Case 1-Set of annotated data

Parameters can be estimated on this data where thePATH is known.

Case 2-NO annotated data and a Model

-Parameterise the model so P(Model¦data)=max-Start with random parameters-Iterate using Baum-Welch, Viterbi or EM


Trainning HMMs ?

Difficult !!!!


What MattersAbout



HMM and Markov Chains

Bayes Theorem

-Markov Chain: When There is no Hidden State

-Hidden Markov Models: When a Nucleotide can be in different HIDDEN states


Three Algorithms for HMMS

Viterbi: -Make the State assignments-Predict

Forward: Evaluate the Sequence Probability under the considered model

Backward and Posterior Decoding:Evaluating the proba of the predictionWindow-Free


Applicationsof HMMs


What To Do with an HMM?

Transmembrane domain predictions

www.cbs.dtu.dk/services/TMHMM/



RNA structure Prediction/Fold Recognition

SCGF: Stochastic Context Free Grammars(Sean Eddy)



Gene Prediction

State of the art use HMMs

Genemark: Prokaryotes

GenScan: Eukaryotes


GeneMark


A typical HMM for Coding DNA

S

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

E

64 Codons


Emission (codon Frequency)

Transition (Dipeptide)

A Typical HMM for Coding DNA


GeneMark HMM

HMM order 5: 6th Nucleotide depends on the 5 previous

Proba of seq (GGG-TGG Given Model)=

Proba(GGG)*Proba(GGG->TGG)*Proba(TGG)

Takes into account Codon Bias AND dipeptide Comp



Family and Domain Identification

PfamSmartProsite Profiles



Bayesian Phylogenic Inference

chite

wheattrybr

mouse

morphbank.ebc.uu.se/mrbayes/manual.php



Metabolic Networks: Bayesian Networks

www.cs.huji.ac.il/~nirf/


CollectionsOf

Domains HMMs


What is a Domain HMM ?

SAM, HMMER, PFtools


Emission Proba


Using Domain HMMs

Question: I want to Compare my HMM with all the sequences in SwissProt

Very Similar to Dynamic Programming

Requires an adapted Viterbi: Pair-HMM


Using Domain HMMs

Question: What are the Available CollectionsOf Pre-computed HMMs

Interpro unites many collections


Interpro: The Idea of Domains


Interpro: A Federation of Databases


Using InterPro: Asking a question

Which Domains does the oncogene FosB contain?


Using InterPro: Asking a question


Finding Domains

-How can I be sure that the domain Prediction of my Protein is real ?

Use the EMBnet pfscan


Using EMBNet PFscan


Posterior Decoding With EMBNet PFscan

Important Position that is Well conserved in our sequence


Posterior

Prior


The Inside

Of Pfam


A Typical pfam Domain


A Typical pfam Domain

HMMER Package:


Going FurtherBuilding and Using

HMMs


HMMer2: hmmer.wustl.edu/Used to create and distribute Pfam

PFtools: www.isrec.isb-sib.ch/ftp-server/pftools/Used to create and distribute Prosite

SAM T02

www.cse.ucsc.edu/research/compbio/sam.html


EMBOSS Online

www.hgmp.mrc.ac.uk/SOFTWARE/EMBOSS

Jemboss: a JAVA aplet interacting with an EMBOSSServer


HMMer


EMBASSY(Hmmer)


In The End:Markov Uncovered



Domain Collections

Gene Prediction

Bayesian Phylogenetic Inferencechite

wheattrybr

mouse



Domain Collections

Profiles HMM Generalized Profiles

Interactive Tools

cédric notredame (21/10/2015) uncovering sequences mysteries with hidden markov model cédric...

Documents