life exists beyond gtr - ben kaehler
TRANSCRIPT
© Ben Kaehler 2013
Life Exists Beyond GTR
Ben Kaehler John Curtin School of Medical Research
Australian National University
© Ben Kaehler 2013
Overview
• Modelling evolution
• Relaxing some assumptions
• Quantifying plausibility
© Ben Kaehler 2013
MotivationGTR Overestimates Genetic Distance
© Ben Kaehler 2013
Programming in Science
• Use source control
• Write unit tests
• Read this http://software-carpentry.org/
© Ben Kaehler 2013
Setup• Consider only genetic data (DNA or protein
sequences)
• Take “genes” to be orthologous sequences
• Assume that each position (nucleotide or codon) in a gene has evolved on the same bifurcating tree
• Assume that each position in a gene has evolved independently under identical processes
© Ben Kaehler 2013
.org
Knight et al. (2007). PyCogent: a toolkit for making sense from sequence. Genome Biol 8, R171.
© Ben Kaehler 2013
Sequences in PyCogent
• Loading and saving sequence collections
• Mapping between DNA and protein sequences for alignment
• Filtering by position and for gaps
© Ben Kaehler 2013
The Substitution Process• Evolution along each
branch is a continuous-time, time-homogenous Markov process
• Markov process has no memory
• Between speciation events substitution rates are constant
(Thanks Gavin)
© Ben Kaehler 2013
• Marginal probability vector:
• Transition probability matrix:
• Markov process:
• Substitution rate matrix:
Some Maths⇡ =
�⇡A ⇡C ⇡G ⇡T
�
⇡(t) = ⇡(0)P (t)
P =
0
BB@
P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )
1
CCA
P (t) = eQt
© Ben Kaehler 2013
Submodels• It is common practice to place constraints on Q to reduce the
parameter space and impose
• computational tractability,
• desirable phylogenetic properties, and
• interpretability
• Most general constrained model is called GTR (aka REV)
• time reversibility means statistical agnosticism to direction of time
• time reversibility implies stationarity
© Ben Kaehler 2013
• For all t:
• This means the base composition of every gene must be approximately equal
• GTR has the desirable properties that:
• it can be reconstructed from just two genes and
• it doesn’t matter where you put the root
Stationarity and Time Reversibility⇡(t) = ⇡(0) = ⇡ so
⇡Q = 0
© Ben Kaehler 2013
Constraints on Q
The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.
GTR
TN93
F84 HKY85
F81 CS05
SYM
K3ST
K80JC69
© Ben Kaehler 2013
Constraints on Q
The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.
GTR
TN93
F84 HKY85
F81 CS05
SYM
K3ST
K80JC69
© Ben Kaehler 2013
Constraints on Q
The Felsenstein Hierarchy Pachter, L., & Sturmfels, B. (Eds.). (2005). Algebraic statistics for computational biology (Vol. 13). Cambridge University Press.
GTR
TN93
F84 HKY85
F81 CS05
SYM
K3ST
K80JC69
© Ben Kaehler 2013
Fitting Models in PyCogent
• Fit HKY85, GTR
• PyCogent parameters relate to the matrices above
• Generalising models
© Ben Kaehler 2013
Non-Stationarity• We introduce non-stationarity by allowing Q to vary
independently of π
• The base composition is no longer uniquely determined by Q so it is free to vary across the phylogeny
• There are issues regarding identifiability
• The location of the root matters
• But life exists beyond GTR
© Ben Kaehler 2013
Identifiability• Nature gives us frequencies for each possible column:
• If there are two sets of parameters that fit the observed frequencies equally well, we say that the model is not identifiable
• GTR is identifiable for two or more sequences
Dog A C G T A C G T …Pangolin A A A A C C
C C …
Rhino A A A A A A A A … x 45
FalseVamp A A A A A A A A …TombBat A A A A A A A A …
© Ben Kaehler 2013
Consistency• If the maximum likelihood (ML) estimate of a model converges
to the true model as sequence length increases, we say that the estimate is consistent
• If we do not constrain Q,
• the ML estimates are consistent for three or more sequences
• a sensible, continuous-time model that achieves provable consistency but is more general than this non-stationary model would be difficult to devise
• some mild constraints on P and hence Q are still necessary
© Ben Kaehler 2013
The Three Taxon Topology
⇡(0)
Dog
Pangolin
TombBatQDtD
QPtP
QTtT
© Ben Kaehler 2013
Fitting Non-Stationary Models in PyCogent
• Demonstration of model generalisation
• Fit GTR and General
© Ben Kaehler 2013
Mild Constraints• For the General model to be identifiable (and
consistent), its Ps must be Reconstructible from Rows
• If the Ps are Reconstructible from Rows, you can’t relabel states at internal nodes to achieve the same likelihood values
• We can check a criterion that implies that a matrix is Reconstructible from Rows: Diagonal Largest in Column
© Ben Kaehler 2013
Diagonal Largest in Column
• Check Diagonal Largest in Column
• Availability in PyCogent is forthcoming
© Ben Kaehler 2013
Quantifying Plausibility• Likelihood ratio tests between rungs on the
Felsenstein Hierarchy have been used to justify the use of more general models
• Best of a bad lot is still bad
• We can use parametric bootstrap to at least outright reject implausible models
• We use the G-statistic to quantify goodness-of-fit for the bootstraps
© Ben Kaehler 2013
Parametric Bootstrap
• A really simple parametric bootstrap
© Ben Kaehler 2013
Life Exists Beyond GTR
© Ben Kaehler 2013
Application: Genetic Distance
• Genetic distance is fundamental in the field of molecular evolution
• Common to use the Expected Number of Substitutions (ENS) along a branch
• For stationary models π is constant so
• PyCogent automatically calibrates Q so that
• Which means that in PyCogent, t, the branch length, always equals ENS, but only for stationary models
�⇡ diag(Q) = 1
dENS = �Z t
0⇡(s)ds diag(Q)
dENS = �⇡ diag(Q)t
© Ben Kaehler 2013
Doing it Right
• How to extract Q, π, and t in PyCogent
• The Van Loan method for exponential integration
• Availability in PyCogent is forthcoming
© Ben Kaehler 2013
Conclusion
• The PyCogent API provides programmatic access to leading edge phylogenetic tools
• Python is a great language for writing any extensions you need
• We will put it all together in this afternoon’s workshop