evolutionary models for multiple sequence alignment cbb/cs 261 b. majoros

33
Evolutionary Models for Multiple Sequence Alignment CBB/CS 261 B. Majoros

Upload: cathleen-chase

Post on 02-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Evolutionary Models for Multiple Sequence Alignment

CBB/CS 261

B. Majoros

Part I

Evolutionary Sequence Models

The Utility of Evolutionary ModelsThe Utility of Evolutionary Models

Evolutionary sequence models make use of the assumption that natural selection operates more strongly on some genomic features than others (i.e., functional versus non-functional DNA elements), resulting in a detectable bias in sequence conservation for the features of interest.

More generally, conservation patterns may differ between levels of DNA organization (i.e., amino acids in coding segments, versus individual nucleotides in conserved noncoding elements).

(Siepel et al., 2005)

push

...

Recall: Multivariate HMMsRecall: Multivariate HMMs

CCATATAATCCCAGGCTCCGCTTCA

AGATATCATCAGATGCTACGTATCT

ACATAATATCCGAGGCTCCGCTTCG

Each state emits N residues, one per track—i.e., a column of a multiple alignment.

Thus, each state must have a model for the joint distribution of the tracks (to represent emission probabilities).

Q: how might we model dependencies between tracks?

N tracks

single emission from a single state

Non-independence of SequencesNon-independence of SequencesDue to their common ancestry, genomic sequences for related taxa are not independent. We can control for that non-independence by explicitly modeling their dependence structure using a phylogenetic tree:

We will see later that a phylogenetic tree (or “phylogeny”) can be interpreted as a special type of Bayesian network, in which sequence conservation probabilities are expressed as a function of the branch lengths.

Branch lengths represent evolutionary distance, which conflates the distinct phenomena of elapsed time and substitution rate (as well as selection and drift).

PhyloHMMsPhyloHMMsA PhyloHMM is a discrete multivariate HMM in which each state qi has an associated evolution model i describing the expected rates and patterns of evolution in the class of features represented by that state.

(Siepel et al., 2005)

Evaluating the Emission ProbabilityEvaluating the Emission Probability

G

E

A B C D

F

P(G)

P(E|G)

P(A|E)

P(C|F

)P(B|E)

P(F|G)

P(D|F)

P(A,B,C,D) = P(G)P(E G)P(A E)P(B E)P(F G)P(C F)P(D F)G ,E ,F

P(observables) = P(root) P(v parent(v))nonroot

v

∏ ⎛

⎜ ⎜ ⎜

⎟ ⎟ ⎟unobservables

...A...

...B...

...C...

...D...

human

chimp

mouse

rat

observables

unobservables

alignmentBayesian network

The likelihood can be computed using a recursion known as Felsenstein’s pruning algorithm:

Lu(a) =

δ (u,a) if u is a leaf

Lc (b)P(c = b | u = a)b∈α

∑c∈

children(u)

∏ otherwise

⎨ ⎪ ⎪

⎩ ⎪ ⎪

P(c=b|u=a) = probability of observing b in the child, given that we observe a in the parent. We can model this using a matrix of substitution probabilities, parameterized by the evolutionary time t that has passed between the ancestor and descendant taxa:

A Recursion for the Emission LikelihoodA Recursion for the Emission Likelihood

P(t) =

pA→A pA→C pA→G pA→T

pC→A pC→C pC→G pC→T

pG→A pG→C pG→G pG→T

pT →A pT →C pT →G pT →T

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

ancestor

descendant A C G T A

C G

T

=P(descendents of u|u=a).

P(t)

Q

Substitution Matrix vs. Rate MatrixSubstitution Matrix vs. Rate Matrix

There is an important distinction between a substitution matrix and an instantaneous rate matrix.

= Substitution matrix. Gives the probabilities of substitutions for a specific branch length, t (“time”).

= Instantaneous rate matrix. Gives instantaneous rates of substitutions (not parameterized by time).

Given Q and a set of phylogeny branch lengths {ti}, we can compute a substitution matrix P(ti) for each branch...

One Q, Many P(t)’sOne Q, Many P(t)’s

P(t1)

P(t) = eQt

P(t2 )

P(t3)€

P(t5 )

P(t6 )€

P(t4 )

human

kangaroo

dolphin

whale

Substitution models are typically based on continuous-time Markov chains. The

Markov property for CTMCs states that: )()()( stst PPP =+

QPPP

P

IPPPPPP

)()0()(

lim)(

)()()(lim

)()(lim

)(

0

00

tt

tt

t

ttt

t

ttt

dt

td

t

tt

=Δ−Δ

=

Δ−Δ

−Δ+=

→Δ

→Δ→Δ

We can derive an instantaneous rate matrix Q from P(t), where we make use of the fact that P(0)=I.

Continuous-time Markov Chains (CTMCs)Continuous-time Markov Chains (CTMCs)

does not depend on t

...!2!

)(22

0

+++=== ∑∞

=

ttI

nt

etn

nnt Q

QQ

P Q

eQt (the “matrix exponential”) denotes a Taylor expansion, which we can solve via spectral (eigenvector) decomposition:

12

1

)( −

⎥⎥⎥⎥

⎢⎢⎢⎢

= GGP

nt

t

t

e

ee

λ

λ

O12

1

⎥⎥⎥

⎢⎢⎢

= GGQ

λλ

O

The solution to this differential equation is:

Reversibility:

ij(πiPij(t)=πjPji(t))

where πi is the equilibrium frequency of base i.

Desirable Properties of Substitution MatricesDesirable Properties of Substitution Matrices

Transition/Transversion Discrimination:

Using different parameters for transition and transversion rates.Transitions: purinepurine,

Transversions: purinepyrimidine

pyrimidinepyrimidine

(“detailed balance”)

⎥⎥⎥

⎢⎢⎢

−−

−−

=αααααααααααα

JKQ

⎥⎥⎥

⎢⎢⎢

−−

−−

=

βαβββααβββαβ

PK 2Q⎥⎥⎥

⎢⎢⎢

−−

−−

=

GCA

TCA

TGA

TGC

FEL

απαπαπαπαπαπαπαπαπαπαπαπ

Q

⎥⎥⎥

⎢⎢⎢

−−

−−

=

GCA

TCA

TGA

TGC

HKY

βπαπβπβπβπαπαπβπβπβπαπβπ

Q⎥⎥⎥

⎢⎢⎢

−−

−−

=

GCA

TCA

TGA

TGC

REV

τπωπχπτπκπαπωπκπβπχπαπβπ

Q

Jukes-Cantor: Kimura: Felsenstein:

Hasegawa, Kishino, Yano: General reversible model:

Some Common Forms for QSome Common Forms for Q

these assume uniform equilibrium frequencies!

Part II

Multiple Alignment

Recall: Pairwise alignment with PHMMsRecall: Pairwise alignment with PHMMs

• Emission probabilities assess similarity between aligned residues

• Transition probabilities can be used to penalize gaps

• Viterbi decoding finds the optimal alignment

I D

M δ

ε

δ

ε

1-ε

1-2δ

δ = gap open ε = gap extend

BLO

SU

M62

Pair HMM Triple HMM Quadruple HMM ?Pair HMM Triple HMM Quadruple HMM ?

pair HMM

triple HMM

quadruple HMM

(Holmes & Bruno, 2001)

aligns 2 species

aligns 3 species

aligns 4 species

#gallons of water in the Pacific ocean

for 100bp sequences...

Progressive Alignment: One Pair at a TimeProgressive Alignment: One Pair at a Time

sequence-to-sequence alignment

profile-to-profile alignment

-AAGTAGCATACG--GGCA-ATT-AACT--CATACG--CCCAGATTT-AGT--CATACGGG--CA-ATCT-AGT--CGTATGGGGGCA-GTT

AAGTAGCATACGGGCA-ATTAACT--CATACGCCCAGATT

TAGTCATACGGG--CAATCTAGTCGTATGGGGGCAGTT

AAGTAGCATACGGGCAATT

AACTCATACGCCCAGATT

TAGTCATACGGGCAATC

TAGTCGTATGGGGGCAGTT

• First, align the leaves of the tree (using a Pair HMM)

• Then align ancestral taxa, using either a “consensus” sequence for ancestors, or averaging over all pairs of leaf residues

A more principled approach: model the ancestral sequences explicitly, using a probabilistic evolutionary model...

P S

J K L

A B C D E F G H I

M N

R

O

TQAB-C--D-E---FG----HI

Networks of ResiduesNetworks of Residues

The multiple alignment problem is precisely the problem of inferring the network of residue homologies—i.e., the evolutionary history of each base.

Problem:Align the sequences ABC, DE, FG, and HI.

Solution:

Building the NetworkBuilding the Network

AB

C

D E

FG

H I

ABC DE FG HI

Building the NetworkBuilding the Network

AB

C

D E

FG

H I

ABC DE FG HI

FG--HI

Building the NetworkBuilding the Network

AB

C

D E

FG

H I F G H I

M N O

ABC DE FG HI

unobservables

observables

FG--HI

Building the NetworkBuilding the Network

AB

C

D E

FG

H I F G H I

M N O

ABC DE FG HI

FG--HI

ABC-DE

Building the NetworkBuilding the Network

AB

C

D E

FG

H I

J K L

A B C D E

F G H I

M N O

ABC DE FG HI

FG--HI

ABC-DE

Building the NetworkBuilding the Network

AB

C

D E

FG

H I

J K L

A B C D E

F G H I

M N O

JK

L

M N O

ABC DE FG HI

Building the NetworkBuilding the Network

AB

C

D E

FG

H I

J K L

A B C D E

F G H I

M N O

JK

L

M N O

ABC DE FG HI

JK-L---MNO

Building the NetworkBuilding the Network

P S

J K L

A B C D E F G H I

M N

AB

C

D E

FG

H I

R

O

T

J K L

A B C D E

F G H I

M N O

JK

L

M N O

ABC DE FG HI

Q

JK-L---MNO

Building the NetworkBuilding the Network

P S

J K L

A B C D E F G H I

M N

AB

C

D E

FG

H I

R

O

T

J K L

A B C D E

F G H I

M N O

JK

L

M N O

ABC DE FG HI

QAB-C--D-E---FG----HI

Evaluating Emission ProbabilitiesEvaluating Emission Probabilities

JK

L

AB

CD

E

F G H I

M N O

X

P(B,D,G,H) = P(X)P(K X)P(B K)P(DK)P(N X)P(G N)P(H N)X ,K ,N

P(observables) = P(root) P(v parent(v))nonroot

v

∏ ⎛

⎜ ⎜ ⎜

⎟ ⎟ ⎟unobservables

K

BD

G H

N

XP(X)

P(N

|X)

P(H|N)P(

G|N

)

P(K|X)

P(D|K

)

P(B|K)

= a sequence S

= (B,H), a “Branch-HMM” (transducer) B describing the evolutionary process whereby the child evolves from the parent, and the actual indel history H which is a specific realization of this process (a “draw”)

Sampling AlignmentsSampling Alignments

Sampling of alignments proceeds by sampling pairwise “branch alignments” (or “indel histories”) H that live within the yellow squares. An indel history is simply a path through a Pair HMM.

Sampling branch alignments is simple: just sample from a PHMM via Forward or Backward:

P(y k y k −1,i, j,S1,S2) =

Bi+1, j +1,y kPt (y k y k −1)Pe (si,1,s j,2 y k )

Bi, j,y k−1

if y k ∈QM

Bi, j +1,y kPt (y k y k −1)Pe (−,s j ,2 y k )

Bi, j,y k−1

if y k ∈QI

Bi+1, j ,y kPt (y k y k −1)Pe (si,1,−y k )

Bi, j,y k−1

if y k ∈QD

⎪ ⎪ ⎪

⎪ ⎪ ⎪

Posterior Alignment MatrixPosterior Alignment Matrix

sequence 1...

sequ

ence

2..

.

pixel intensity = posterior probability of a match in that cell

(posterior probability: conditional on the full input sequences)

Banding: Reduce the Search SpaceBanding: Reduce the Search Space

Block Rearrangements are a Problem!Block Rearrangements are a Problem!

Translocation Inversion

Duplication

The simple case:

• Optimal MSA computation is intractible in the general case

• Progressive alignment is more tractible, but is greedy

• Iterative refinement attempts to undo greedy decisions

• PairHMM’s provide a principled way to perform pairwise steps

• Felsenstein’s algorithm computes likelihoods on phylogenies

• Substitution models can use continuous-time Markov chains

• Large-scale rearrangements are a problem

• Banding can improve alignment speed

• Optimal MSA computation is intractible in the general case

• Progressive alignment is more tractible, but is greedy

• Iterative refinement attempts to undo greedy decisions

• PairHMM’s provide a principled way to perform pairwise steps

• Felsenstein’s algorithm computes likelihoods on phylogenies

• Substitution models can use continuous-time Markov chains

• Large-scale rearrangements are a problem

• Banding can improve alignment speed

SummarySummary