life exists beyond gtr - ben kaehler

28
© Ben Kaehler 2013 Life Exists Beyond GTR Ben Kaehler John Curtin School of Medical Research Australian National University

Upload: australian-bioinformatics-network

Post on 11-May-2015

465 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Life Exists Beyond GTR

Ben Kaehler John Curtin School of Medical Research

Australian National University

Page 2: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Overview

• Modelling evolution

• Relaxing some assumptions

• Quantifying plausibility

Page 3: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

MotivationGTR Overestimates Genetic Distance

Page 4: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Programming in Science

• Use source control

• Write unit tests

• Read this http://software-carpentry.org/

Page 5: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Setup• Consider only genetic data (DNA or protein

sequences)

• Take “genes” to be orthologous sequences

• Assume that each position (nucleotide or codon) in a gene has evolved on the same bifurcating tree

• Assume that each position in a gene has evolved independently under identical processes

Page 6: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

.org

Knight et al. (2007). PyCogent: a toolkit for making sense from sequence. Genome Biol 8, R171.

Page 7: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Sequences in PyCogent

• Loading and saving sequence collections

• Mapping between DNA and protein sequences for alignment

• Filtering by position and for gaps

Page 8: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

The Substitution Process• Evolution along each

branch is a continuous-time, time-homogenous Markov process

• Markov process has no memory

• Between speciation events substitution rates are constant

(Thanks Gavin)

Page 9: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

• Marginal probability vector:

• Transition probability matrix:

• Markov process:

• Substitution rate matrix:

Some Maths⇡ =

�⇡A ⇡C ⇡G ⇡T

⇡(t) = ⇡(0)P (t)

P =

0

BB@

P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )

1

CCA

P (t) = eQt

Page 10: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Submodels• It is common practice to place constraints on Q to reduce the

parameter space and impose

• computational tractability,

• desirable phylogenetic properties, and

• interpretability

• Most general constrained model is called GTR (aka REV)

• time reversibility means statistical agnosticism to direction of time

• time reversibility implies stationarity

Page 11: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

• For all t:

• This means the base composition of every gene must be approximately equal

• GTR has the desirable properties that:

• it can be reconstructed from just two genes and

• it doesn’t matter where you put the root

Stationarity and Time Reversibility⇡(t) = ⇡(0) = ⇡ so

⇡Q = 0

Page 12: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Constraints on Q

The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.

GTR

TN93

F84 HKY85

F81 CS05

SYM

K3ST

K80JC69

Page 13: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Constraints on Q

The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.

GTR

TN93

F84 HKY85

F81 CS05

SYM

K3ST

K80JC69

Page 14: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Constraints on Q

The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.

GTR

TN93

F84 HKY85

F81 CS05

SYM

K3ST

K80JC69

Page 15: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Fitting Models in PyCogent

• Fit HKY85, GTR

• PyCogent parameters relate to the matrices above

• Generalising models

Page 16: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Non-Stationarity• We introduce non-stationarity by allowing Q to vary

independently of π

• The base composition is no longer uniquely determined by Q so it is free to vary across the phylogeny

• There are issues regarding identifiability

• The location of the root matters

• But life exists beyond GTR

Page 17: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Identifiability• Nature gives us frequencies for each possible column:

• If there are two sets of parameters that fit the observed frequencies equally well, we say that the model is not identifiable

• GTR is identifiable for two or more sequences

Dog A C G T A C G T …Pangolin A A A A C C

C C …

Rhino A A A A A A A A … x 45

FalseVamp A A A A A A A A …TombBat A A A A A A A A …

Page 18: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Consistency• If the maximum likelihood (ML) estimate of a model converges

to the true model as sequence length increases, we say that the estimate is consistent

• If we do not constrain Q,

• the ML estimates are consistent for three or more sequences

• a sensible, continuous-time model that achieves provable consistency but is more general than this non-stationary model would be difficult to devise

• some mild constraints on P and hence Q are still necessary

Page 19: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

The Three Taxon Topology

⇡(0)

Dog

Pangolin

TombBatQDtD

QPtP

QTtT

Page 20: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Fitting Non-Stationary Models in PyCogent

• Demonstration of model generalisation

• Fit GTR and General

Page 21: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Mild Constraints• For the General model to be identifiable (and

consistent), its Ps must be Reconstructible from Rows

• If the Ps are Reconstructible from Rows, you can’t relabel states at internal nodes to achieve the same likelihood values

• We can check a criterion that implies that a matrix is Reconstructible from Rows: Diagonal Largest in Column

Page 22: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Diagonal Largest in Column

• Check Diagonal Largest in Column

• Availability in PyCogent is forthcoming

Page 23: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Quantifying Plausibility• Likelihood ratio tests between rungs on the

Felsenstein Hierarchy have been used to justify the use of more general models

• Best of a bad lot is still bad

• We can use parametric bootstrap to at least outright reject implausible models

• We use the G-statistic to quantify goodness-of-fit for the bootstraps

Page 24: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Parametric Bootstrap

• A really simple parametric bootstrap

Page 25: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Life Exists Beyond GTR

Page 26: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Application: Genetic Distance

• Genetic distance is fundamental in the field of molecular evolution

• Common to use the Expected Number of Substitutions (ENS) along a branch

• For stationary models π is constant so

• PyCogent automatically calibrates Q so that

• Which means that in PyCogent, t, the branch length, always equals ENS, but only for stationary models

�⇡ diag(Q) = 1

dENS = �Z t

0⇡(s)ds diag(Q)

dENS = �⇡ diag(Q)t

Page 27: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Doing it Right

• How to extract Q, π, and t in PyCogent

• The Van Loan method for exponential integration

• Availability in PyCogent is forthcoming

Page 28: Life Exists Beyond GTR - Ben Kaehler

© Ben Kaehler 2013

Conclusion

• The PyCogent API provides programmatic access to leading edge phylogenetic tools

• Python is a great language for writing any extensions you need

• We will put it all together in this afternoon’s workshop