how to find foreign genes? markov models aaaa: 10% aaac: 15% aaag: 40% aaat: 35% aaa aac aag aat...

15
How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA . . . TTG TTT Trainin g Set Building the model

Upload: winifred-hamilton

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

How to find foreign genes?Markov Models

AAAA: 10%

AAAC: 15%

AAAG: 40%

AAAT: 35%

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

Building the model

Page 2: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

How to find foreign genes?Markov Models

A C G TAAA 0.10 0.15 0.40 0.35AAC 0.25 0.45 0.25 0.05AAG 0.25 0.20 0.30 0.25 AAT 0.25 0.20 0.30 0.25 ACA 0.15 0.20 0.25 0.40 . . .TTG 0.20 0.50 0.05 0.25TTT 0.10 0.55 0.25 0.10

Candidategene

AAAACAA…

0.10

3rd order Markov model

Page 3: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsA traffic light considered as a sequence of states

A trivial Markov chain – the transition probability between the states is always 1

Pgy = 1

Pyr = 1

Prg = 1

Page 4: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

If we watch our traffic light, it will emit a string of states

A traffic light considered as a sequence of states Markov Chains

In the case of a simple Markov model, the state labels (e.g. green, red, yellow)

are the observable outputs of the process

Page 5: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsAn occasionally malfunctioning traffic light!!

The Markov property is that the probability of observing next a given future state depends only on the current state!

Pgy = 1

Pyr = .9

Prg = .85

Pry = .15

Pyg = .10

Page 6: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsThe Markov Property

ast = P(xi = t | xi-1 = s)

English Translation:

The transition probability ast from state s to state t…

…is equal to the probability that the ith state was t..

given that

that the immediately proceeding state (xi-1) was s

This is a form of conditional probability

Page 7: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov Chain

Now we can consider the probability of an observed sequence!

An occasionally malfunctioning traffic light!!

Page 8: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsWhat is the probability of chain of events x?

P(x) = P(xL, xL-1, … ,x1)

English Translation:

The probability of observing sequence of states x...

...is equal to the probability that the XLth state was

whatever AND the XL-1th state was whatever else,

AND etc., etc.

This is a form of joint probability

Page 9: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsWhat is the probability of chain of events x?

P(x) = P(xL, xL-1, … ,x1)

= P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1)

This is because P(X,Y) = P(X|Y) * P(Y)

English Translation:

The probability of events X AND Y happening is equal to the probability of X happening given that Y has already

happened, times the probability of event Y

Page 10: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsWhat is the probability of chain of events x?

P(x) = P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1)

But remember the key property of a Markov Chain is that probability of symbol xi depends ONLY on

the value of preceding symbol Xi-1!! Therefore:

P(x) = P(xL | xL-1) P(xL-1 | xL-2) ... P(x2|x1) P(x1)

P(x) = P(x1) axi-1xi

L

i=2

Page 11: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ChainsHow about nucleic acid sequences?

No reason why nucleic acid sequences found in an organism cannot be modeled using Markov chains

A C

G T

Page 12: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ModelWhat do we need to probabilistically model DNA sequences?

A C

G T

States

Transition probabilities

The states are the same for all organisms, so the transition probabilities are the model parameters we need to estimate

Page 13: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Parameter estimation

AAAA: 10%

AAAC: 15%

AAAG: 40%

AAAT: 35%

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

Building the Markov Model

This is a maximum likelihood approach to parameter estimation. Such procedures

maximize the overall probability of the training set data.

Page 14: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

Markov ModelWhich model best explains a newly observed sequence?

A C

G T

Each organism will have different transition probabilities parameters, so you can ask “was the sequence more likely

to be generated by model A or model B?”

A C

G T

Organism A Organism B

Page 15: How to find foreign genes? Markov Models AAAA: 10% AAAC: 15% AAAG: 40% AAAT: 35% AAA AAC AAG AAT ACA... TTG TTT Training Set Building the model

P(x|model A)

P(x|model B)S(x) = log

A commonly used metric for discrimination usingMarkov Chains is the Log-Odds ratio

Markov ModelWhich model best explains a newly observed sequence?

i =1

L

aAxi-1xi

aBxi-1xi

log