maximum likelihood phylogenetic inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows...

9
Maximum Likelihood Phylogenetic Inference A Dhar and VN Minin, University of Washington, Seattle, WA, USA r 2016 Elsevier Inc. All rights reserved. Glossary Dynamic programming A general method for solving complex problems by breaking them down into a collection of simpler subproblems that are similar in structure to the original problem. Hessian matrix A square matrix of second order partial derivatives of a real-valued function. It is often needed within numerical optimization techniques. Hill climbing A generic optimization strategy that relies on local searches when trying to nd the global maximum of a given function. Hill-climbing procedures incrementally modify variables in an optimization problem and check to see if the modied variables have achieved a higher function value (i.e., climbed the hill). Law of total probability A probability formula that states how to calculate the probability of an event given a partition of the sample space. Metric space A set of objects (i.e., real numbers, phylogenies, etc.) equipped with a distance between any two elements. Reversible substitution model A Markov substitution model that, if started at the equilibrium distribution, can be run backwards in time, with the resulting backward Markov model following the same probability law as the original forward model. Simulated annealing A probabilistic method for approximating the global maximum of a given function. It is a useful technique that avoids getting stuck in local maxima. Stationary distribution A marginal probability distribution over the states of a substitution model that can be interpreted as a long-run steady-state distribution as evolution occurs over time. Substitution model A model that species probabilities of state changes for DNA (or amino acid) sequence data along the branches of a given phylogeny. Introduction Maximum likelihood estimation is an extremely popular statistical inference framework that is used to estimate the parameters in a probabilistic data generating model. This conceptually simple method provides parameter estimates that have good statistical properties. Before delving into maximum likelihood techniques for phylogenetic tree reconstruction, we present a simple example of maximum likelihood estimation to illustrate how this procedure works. Suppose we have a coin that has some unknown prob- ability p of landing on heads. We are interested in estimating this unknown probability based on the outcomes of observed coin ips. For example, one could imagine ipping this coin independently 10 times, resulting in an observed sequence of HTTHTHTHTT, denoted by D. The maximum likelihood principle suggests that the best guess for p is obtained by choosing p so that the probability of observing the sequence HTTHTHTHTT is the highest. The probability of observing the sequence HTTHTHTHTT from this coin is: LðpÞ¼ PrðD; pÞ¼ pð1 pÞð1 pÞpð1 pÞpð1 pÞ pð1 pÞð1 pÞ¼ p 4 ð1 pÞ 6 ½1 We are able to take a product of p and 1 p terms because of the independence of the coin ips. Equation [1] is referred to as the likelihood given parameter p. Because we are picking the value of p that makes L(p) the largest, this likelihood function is treated as a function of p, not as a function of the data D. From basic calculus, we know that nding the p that maximizes a differentiable function w.r.t. p is equivalent to solving the equation L(p) ¼ 0, which is: 2ð1 þ pÞ 5 p 3 ð2 þ 5pÞ¼ 0 ½2 This results in three roots p ¼ 0, 0.4, 1. A quick inspection of the likelihood function, depicted in Figure 1, evaluated at all three roots shows that p ¼ 0.4 is the maximum likelihood estimate of p. This is an intuitive estimate because it is just the proportion of heads observed in the sequence HTTHTHTHTT. It turns out that this maximum likelihood estimator (MLE) of p tends to the true, unknown value of p as we observe more and more coin ips this is called consistency in statistics. Also, this estimator uses available data in the most optimal way, again as we observe more and more coin ips this is called efciency in statistics. Under fairly general and reasonable regularity conditions, consistency and efciency hold for a large class of MLEs (van der Vaart, 1998). This easy-to- understand estimation principle along with the associated optimality properties for a wide class of likelihood models make maximum likelihood an attractive procedure for many parameter estimation problems, including the problem of estimating phylogenetic relationships from molecular sequence data. Phylogenetic Likelihood Likelihood Description Throughout this presentation, we restrict our attention to DNA sequence data for simplicity (although the methods we de- scribe are compatible with other discrete character datasets as Encyclopedia of Evolutionary Biology, Volume 2 doi:10.1016/B978-0-12-800049-6.00207-9 499

Upload: others

Post on 20-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

En

Maximum Likelihood Phylogenetic InferenceA Dhar and VN Minin, University of Washington, Seattle, WA, USA

r 2016 Elsevier Inc. All rights reserved.

GlossaryDynamic programming A general method for solvingcomplex problems by breaking them down into a collectionof simpler subproblems that are similar in structure to theoriginal problem.Hessian matrix A square matrix of second order partialderivatives of a real-valued function. It is often neededwithin numerical optimization techniques.Hill climbing A generic optimization strategy that relieson local searches when trying to find the global maximumof a given function. Hill-climbing procedures incrementallymodify variables in an optimization problem and check tosee if the modified variables have achieved a higher functionvalue (i.e., ‘climbed the hill’).Law of total probability A probability formula that stateshow to calculate the probability of an event given apartition of the sample space.

cyclopedia of Evolutionary Biology, Volume 2 doi:10.1016/B978-0-12-800049-6

Metric space A set of objects (i.e., real numbers, phylogenies,etc.) equipped with a distance between any two elements.Reversible substitution model A Markov substitutionmodel that, if started at the equilibrium distribution, can berun backwards in time, with the resulting backward Markovmodel following the same probability law as the originalforward model.Simulated annealing A probabilistic method forapproximating the global maximum of a given function. It isa useful technique that avoids getting stuck in local maxima.Stationary distribution A marginal probabilitydistribution over the states of a substitution model that canbe interpreted as a long-run steady-state distribution asevolution occurs over time.Substitution model A model that specifies probabilities ofstate changes for DNA (or amino acid) sequence data alongthe branches of a given phylogeny.

Introduction

Maximum likelihood estimation is an extremely popularstatistical inference framework that is used to estimate theparameters in a probabilistic data generating model. Thisconceptually simple method provides parameter estimates thathave good statistical properties. Before delving into maximumlikelihood techniques for phylogenetic tree reconstruction, wepresent a simple example of maximum likelihood estimationto illustrate how this procedure works.

Suppose we have a coin that has some unknown prob-ability p of landing on heads. We are interested in estimatingthis unknown probability based on the outcomes of observedcoin flips. For example, one could imagine flipping this coinindependently 10 times, resulting in an observed sequence ofHTTHTHTHTT, denoted by D. The maximum likelihoodprinciple suggests that the best guess for p is obtained bychoosing p so that the probability of observing the sequenceHTTHTHTHTT is the highest. The probability of observing thesequence HTTHTHTHTT from this coin is:

LðpÞ ¼ PrðD; pÞ ¼ pð1� pÞð1� pÞpð1� pÞpð1� pÞ

pð1� pÞð1� pÞ ¼ p4ð1� pÞ6 ½1�

We are able to take a product of p and 1� p terms becauseof the independence of the coin flips. Equation [1] is referredto as the likelihood given parameter p. Because we are pickingthe value of p that makes L(p) the largest, this likelihoodfunction is treated as a function of p, not as a function of thedata D. From basic calculus, we know that finding the p thatmaximizes a differentiable function w.r.t. p is equivalent to

solving the equation L′(p)¼0, which is:

2ð�1þ pÞ5p3ð�2þ 5pÞ ¼ 0 ½2�

This results in three roots p¼0, 0.4, 1. A quick inspectionof the likelihood function, depicted in Figure 1, evaluated atall three roots shows that p¼0.4 is the maximum likelihoodestimate of p. This is an intuitive estimate because it is just theproportion of heads observed in the sequence HTTHTHTHTT.It turns out that this maximum likelihood estimator (MLE) ofp tends to the true, unknown value of p as we observe moreand more coin flips this is called consistency in statistics. Also,this estimator uses available data in the most optimal way,again as we observe more and more coin flips — this is calledefficiency in statistics. Under fairly general and reasonableregularity conditions, consistency and efficiency hold for alarge class of MLEs (van der Vaart, 1998). This easy-to-understand estimation principle along with the associatedoptimality properties for a wide class of likelihood modelsmake maximum likelihood an attractive procedure for manyparameter estimation problems, including the problem ofestimating phylogenetic relationships from molecularsequence data.

Phylogenetic Likelihood

Likelihood Description

Throughout this presentation, we restrict our attention to DNAsequence data for simplicity (although the methods we de-scribe are compatible with other discrete character datasets as

.00207-9 499

Page 2: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

0.0 0.2 0.4 0.6 0.8 1.0

0.0000

0.0004

0.0008

0.0012

p

L(p) p = 0 p = 0.4 p = 1

Figure 1 The likelihood of observing 4 heads in 10 coin tosses as a function of the unknown probability of heads p in a single toss. Thelikelihood function p4(1� p)6 is maximized at p¼0.4, which corresponds to the observed proportion of heads. The red dashed lines representcritical points found by differentiating the likelihood function and setting it equal to 0.

x

y z

A T C T

t3 t4

t1

t5 t6

t2

Figure 2 An example phylogenetic tree. Letters x,y,z represent theunobserved internal node states where x is associated with the rootnode, τex specifies the tree topology, and tex¼ (t1,t2,…,t6) denotes thevector of branch lengths. Given the tree topology τex and branchlengths tex, we can calculate the likelihood of observing the nucleotidevector (A,T,C,T).

500 Maximum Likelihood Phylogenetic Inference

well). Suppose we observe m aligned sequences of DNA (po-tentially corresponding to m distinct species), where each se-quence has nucleotide observations recorded at n distinct sites.Gaps in the alignment are usually treated as missing data, butmore accurate treatment of insertions and deletions is possible(Redelings and Suchard, 2005; Liu et al., 2012). Just as weconsidered the likelihood of the observed coin flips as afunction of the unknown parameter p, here we will examinethe likelihood of the DNA sequence data as a function of theunknown tree topology and branch lengths. Given a treetopology with branch lengths, we can use a substitution modelto calculate the probabilities of state changes along the bran-ches of the tree; substitution models are described in greatdetail in the ‘Models and Model Selection’ article containedwithin this encyclopedia. Specifically if t denotes a branchlength on a tree, substitution models allow us to calculatepij(t), which denotes the probability of going from state i tostate j on a branch of length t, where i,jA{A,G,C,T}. Note thatbranch lengths are commonly measured in expected numberof substitutions per site, not in clock time, because estimatingsubstitution rates and branch lengths in units of clock timerequires additional information about branching and/orsampling times in the phylogeny (Drummond et al., 2006).

Two assumptions are made that are crucial to the rest of theanalysis (Felsenstein, 2004):

1. Evolution at different sites (on a given tree) is independent.2. Conditional on the internal node states, evolution pro-

ceeds independently on different branches of thephylogeny.

Let L(τ,t,θ) be the likelihood corresponding to the m� nDNA sequence alignment matrix y for a given tree topology τwith branch length vector t and substitution model parametervector θ. We can write the likelihood as:

Lðτ,t,θÞ ¼ Prðy; τ,t,θÞ ¼ ∏n

i ¼ 1Pðyi; τ,t,θÞ ¼ ∏

n

i ¼ 1Liðτ,t,θÞ ½3�

where yi is the m� 1 vector of observed nucleotides at the i'thsite and Li(τ,t,θ) is the site i likelihood. This factorization fol-lows directly from the first independence assumption givenabove. Thus, we can find the likelihood of the whole sequencematrix by finding the likelihoods for each of the n sites. Sup-pose we observed the nucleotide vector (A,T,C,T) at a par-ticular site (assuming there were only m¼4 alignedsequences). We will use the example tree τex given in Figure 2to help illustrate how to calculate the likelihood at this site.

Using the tree τex, we can deconstruct the likelihood of thisnucleotide vector in the following manner:

PrðA,T,C,T; τex,tex,θÞ ¼∑x∑y∑zPrðA,T,C,T,x,y,z; τex,tex,θÞ

¼∑x∑y∑zπxpxyðt1Þpxzðt2ÞpyAðt3ÞpyTðt4ÞpzCðt5ÞpzTðt6Þ ½4�

where the summations are over the elements in {A,G,C,T}. Weobtain eqn [4] by conditioning on the internal node states and

Page 3: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

Maximum Likelihood Phylogenetic Inference 501

by invoking the assumption of independent evolution acrossbranches. Note that, following common practice, we assumedthat the initial distribution at the root of the phylogeny is π¼(πA, πG, πC, πT)

T, the stationary distribution of the substitutionmodel (Page and Holmes, 2009). Looking at eqn [4], it is clearthat we will need to take a sum over 43¼64 probabilities. Foronly m¼4 terminal nodes, this computation is reasonable; asm grows, the computation becomes problematic because thesum will involve 4m�1 terms over m� 1 internal nodes. In thenext section, we describe an algorithm that can efficientlycalculate this site likelihood by eliminating redundantcomputations.

Likelihood Computation

First presented by Felsenstein (1973), the pruning algorithm isa standard technique used to efficiently compute phylogeneticlikelihoods. This algorithm is a particular type of dynamicprogramming technique and takes advantage of the distribu-tive law of algebra to achieve efficiency gains. For example, theexpression abþ acþ adþ ae, where a, b, c, d, e are scalars, re-quires 7 computations (4 multiplications and 3 additions). Bynoting the scalar a appears in all 4 multiplications, we can usethe distributive law to rewrite the above expression as a(bþ cþ dþ e), which requires 4 computations (3 additionsand 1 multiplication). Thus by applying the distributive law toexpressions that contain sums of product terms, like in eqn [4],we can reduce the number of computations required toevaluate them.

For the phylogenetic site likelihood given in eqn [4], wecan utilize the distributive law by pushing the summations asfar right as possible:

PðA,T,C,Tjτex,tex,θÞ ¼∑x∑y∑zπxpxyðt1Þpxzðt2ÞpyAðt3ÞpyTðt4ÞpzCðt5Þ

pzTðt6Þ ¼∑xπx

�∑ypxyðt1ÞpyAðt3ÞpyTðt4Þ

��∑zpxzðt2ÞpzCðt5ÞpzTðt6Þ

½5�

This formulation suggests calculating∑ypxyðt1ÞpyAðt3ÞpyTðt4Þand ∑zpxzðt2ÞpzCðt5ÞpzTðt6Þ first, caching these intermediateresults for all possible values x, and then computing the finalsum over x. Note that we can visualize this procedure as tra-versing τex bottom-up because the sums over y and z areevaluated before the sum over x. Instead of naively summingover 43¼64 terms, this reformulation requires a sum over4� 3¼12 terms. For an arbitrary number of terminal nodes m,a similar reformulation will reduce the number of compu-tations from being exponential in m to being linear in m.

The nested, bottom-up nature of the above computationleads naturally to a recursion for calculating site likelihoods.We define L(k, i) to be the conditional likelihood of a subtreewith root node k being in state i. Conceptually, it is the like-lihood of the observed terminal nodes below node k, con-ditional on node k being in state i. For example, in τex, L(y, i)represents the likelihood of observing (A, T) below node y,conditional on node y being in state i. For any internal node k(in state i) with children v,w and corresponding branch lengths

tv,tw, Felsenstein (1973) defines the recursion to be:

Lðk,iÞ ¼ ∑s1AfA,G,C,Tg

pis1 ðtvÞLðv,s1Þ" #

∑s2AfA,G,C,Tg

pis2 ðtwÞLðw,s2Þ" #

½6�

for all states iA{A,G,C,T}. This recursion uses the conditionallikelihoods calculated for the children of node k to computethe conditional likelihoods for node k itself.

The derivation of the above recursion is based on the as-sumption of independent evolution across different lineagesand the law of total probability. The former justifies de-composition of L(k, i) into a product of the two terms, en-closed in square brackets, in eqn [6], where each termrepresents a conditional likelihood component from one ofthe two lineages below node k. The component from child v iscomputed by summing over all possible states (s1A{A,G,C,T})to which state i could have changed to and for each possiblestate computes the probability of changing to that state (i.e.,pis1ðtvÞ) times the probability of everything that is observedbelow node v, given that node v's state is s1. A similar rea-soning can be used to describe the component correspondingthe other child of k — node w.

To turn the recursion in eqn [6] into an algorithm, we needto initialize the recursion by defining the L(k, i) values for allterminal nodes. For any given terminal node k, L(k, i) is set to1 when state i is the observed state and 0 otherwise. Thisinitialization process can be adjusted to account for situationswhen data are either missing or partially observed at theterminal nodes (Felsenstein, 2004). Once all of the terminalnode L(k, i)'s are initialized, we continue calculating con-ditional likelihoods for internal nodes up the tree, computingthem only if the corresponding conditional likelihoods fortheir children have been computed. The procedure ends aftercalculating the conditional likelihoods at the root (L(root, i),for all i); the resulting site likelihood is computed as∑iπiLðroot,iÞ, where iA{A,G,C,T}.

Root Invariance and the Pulley Principle

Before discussing the specifics of maximum likelihood phyl-ogeny estimation, it is essential to understand the types ofphylogenetic trees that will be estimated. Even though it mayseem like we are estimating rooted phylogenies, it turns outthat under our assumption of substitution model reversibilityand without further assumptions or external informationmaximum likelihood methods can only estimate unrootedtrees. This result is a direct consequence of the Pulley Principle,first discussed by Felsenstein (1981). Under the assumptionsof a reversible substitution model, unconstrained branchlengths, and a root nucleotide distribution at equilibrium, thePulley Principle states that the root may be placed anywhereon the tree without affecting the likelihood. This implies thatthe root is unidentifiable using likelihood methods for phyl-ogeny estimation because the likelihood is invariant to theplacement of the root. In fact, we are not estimating a singlerooted tree, but an equivalence class of rooted trees that cor-responds to a unique unrooted tree (Felsenstein, 1981). InFigure 3, we present two rooted trees that lie in the sameequivalence class; the unrooted tree that corresponds to this

Page 4: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

x

y z

A T C T

x

Ay

T z

C T

a

y z

A

T C

T

Note: 0 ≤ a ≤ t3.

t1 + t2

t1 + t2

t3

t4

t4

t4t1

t6

t6

t5

t5

t2

t3

t3 – a

t5

t6

Figure 3 Three trees with equivalent likelihood values under theassumptions of the Pulley Principle. The Pulley Principle states thatthe root of a tree may be placed anywhere on the tree withoutaffecting the likelihood value. The two rooted trees (top) are containedwithin a larger equivalence class of rooted trees that uniquelycorresponds to the given unrooted tree (bottom).

502 Maximum Likelihood Phylogenetic Inference

equivalence class is shown below the two rooted trees. As weshift the root node associated with x, the tree topology andbranch length parameters change, but the likelihood valueremains the same.

Likelihood Maximization

Branch Length Optimization

Now that we have described how phylogenetic likelihoods canbe computed, we will begin discussing how to maximize thesefunctions with respect to the tree topology τ, branch lengths t,and substitution model parameters θ. Phylogenetic maximumlikelihood algorithms proceed by iterating between two majoralgorithmic steps: (1) for a given tree topology, find optimalbranch lengths and substitution model parameters; (2) obtaina tree topology that maximizes the likelihood given branchlengths and substitution model parameters. We start with thecontinuous optimization problem.

The problem of optimizing the phylogenetic likelihoodfunction, or equivalently the log-likelihood, over branchlengths and substitution model parameters falls into a class ofnonlinear, non-convex optimization problems. This meansthat no existing optimization algorithm can guarantee to solvethis problem. However, in practice, this problem can be solvedby a myriad of hill-climbing optimization methods, includingthe Newton–Raphson method and the expectation–maximization (EM) algorithm. Although computational in-gredients for the Newton–Raphson (Schadt et al., 1998; Ken-ney and Gu, 2012) and the EM algorithm (Holmes and Rubin,2002; Hobolth and Jensen, 2005) are available, often simplermethods are preferred. For example, many implementations ofphylogenetic maximum likelihood estimation update branch

lengths one at a time, rather than jointly (Guindon and Gas-cuel, 2003).

In addition to lack of optimization convergence guarantees,there is no theory that says that a phylogenetic likelihood willhave a unique maximum, although multiple maxima arerarely seen in practice (Felsenstein, 2004). Steel (1994) pro-vides an example that shows maximum likelihood branchlengths for a given tree are not necessarily unique. Chor et al.(2000) found sequence alignments that have multiple branchlength optima on the same tree. In contrast, Rogers andSwofford (1999) performed numerous simulation studies andconcludes that it's extremely unlikely that maximum likeli-hood estimation results in multiple local optima for a givenphylogenetic tree. Given these intriguing results, it is not sur-prising that studying properties of the phylogenetic likelihoodfunction remains an active research area.

Tree Topology Search

Because there exist finitely many tree topologies, we could, inprinciple, optimize branch lengths and substitution modelparameters for every possible tree topology and choose the treethat had the highest likelihood value as the maximum likeli-hood tree. Unfortunately, the set of possible topologies is ex-tremely large (Felsenstein, 1981) so naively searching over thistree space is computationally infeasible. Various heuristics areused to find the topology that has the highest likelihood. All ofthese methods use local modifications of the previously visitedtree topologies to find a new tree with a higher likelihood. Forexample, early methods, such as the PHYLIP (Felsenstein,1989) and PAUP� (Swofford, 2003) packages, traverse the treetopology space greedily by comparing the likelihood valuesbetween these modified trees and by choosing the topologythat increases the likelihood the most; the procedure will endif there are not any trees that increase the likelihood. Whilethis local tree search is faster than the exhaustive search over allpossible trees, it is still inefficient because it has to optimizebranch lengths and evaluate likelihoods for all of the rejectedtrees (Guindon and Gascuel, 2003).

Many other tree search heuristics have been introduced toimprove upon the aforementioned hill-climbing methods.Salter and Pearl (2001) proposed a stochastic search algorithmthat uses simulated annealing to move through tree space. Thisstochastic search was found to be faster and less likely to be-come trapped in local optima, when compared to PHYLIP andPAUP, for several simulated and real data examples. The im-provement in speed is largely due to the fact that stochasticsearch algorithms gradually optimize branch lengths and othermodel parameters as the tree search goes on and avoid the fulloptimization steps used within hill-climbing methods forevery candidate tree (Salter and Pearl, 2001). In addition,Guindon and Gascuel (2003) presented a simple hill-climbingalgorithm that adjusts the tree topology and branch lengthssimultaneously. By performing joint optimization, this pro-cedure tends not to get stuck at local modes of the likelihoodand produces extremely accurate estimates of the tree topology(Guindon and Gascuel, 2003). Although these optimizationstrategies work well in many practical situations, it is import-ant to keep in mind that all local hill-climbing optimization

Page 5: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

Maximum Likelihood Phylogenetic Inference 503

methods are prone to getting stuck in local maxima and maynot return the true MLE phylogeny as a result.

Consistency of Maximum Likelihood Estimates

An important question of interest is whether the maximumlikelihood phylogenetic estimation process is able to re-construct the true phylogeny as we gather more and more data.More precisely, as we collect data at more and more sites for afixed set of sequences, will the maximum likelihood phyl-ogeny estimates tend to the true phylogenetic tree? As webriefly mentioned before, consistency holds for a wide class ofMLEs under sufficiently broad regularity conditions (van derVaart, 1998). Despite this, consistency for maximum likeli-hood phylogenies has been hard to establish due to thecomplex nature of the parameter space.

Felsenstein (1973) suggested that consistency could beproven for maximum likelihood phylogeny estimates by usinga modified version of a general consistency proof found inWald (1949), although the proof was never explicitly given.Chang (1996) presented one of the earliest proofs of consist-ency, but did not consider the branch length parameters in hissetup. RoyChoudhury et al. (2015) provide a complete proofof consistency by verifying the Wald conditions in this setting;their proof is dependent on a previously constructed metricspace for phylogenetic trees (Billera et al., 2001). Thus, Roy-Choudhury et al. (2015) show that consistency will hold formaximum likelihood phylogenies assuming the correct modelof evolution is used. A discussion regarding phylogenetic MLEconsistency under model misspecification can be found inFelsenstein (2004).

Bootstrap for Phylogenies

In the process of phylogenetic estimation, it is natural to desirean estimate of the uncertainty surrounding the reconstructedphylogeny. In this section, we explain how the nonparametricbootstrap, a general statistical technique for assessing samplingvariability of an estimator (Efron, 1979), can be used to assignconfidence to phylogenetic MLEs. Before getting into the de-tails of the bootstrap procedure, it is instructive to imaginewhat would be a statistically ideal way to estimate phylogeneticuncertainty. Suppose we could repeatedly ‘rerun’ evolution

A A G T C AG C T A A GT C A T T TT A G C T C

Observed sequence align

C T A C T T A T G CA C G A G A C G T TT G T T G T C G A AT G T T A C A A G G

Bootstrap sample #1

Figure 4 An example of bootstrap sampling for sequence alignment data. Wcolumns) with replacement from the observed sequence alignment (top). We

along the same true phylogenetic history to obtain replicatedmolecular sequence alignments. Then, estimating phylogeniesbased on these alignments via maximum likelihood would tellus something about the sampling variability of our estimationprocedure. For example, if the phylogenies obtained from thereplicated reruns of evolution were nearly identical, we wouldconclude that our phylogenetic estimation procedure is veryprecise, enjoying low sampling variability. The idea behind thebootstrap is to get around rerunning evolution, which is clearlyinfeasible, by resampling the observed data.

Bootstrap Description

The bootstrap procedure starts by generating B replicate data-sets. Each replicate dataset is obtained by sampling n align-ment sites with replacement (i.e., sampling columns) from theobserved alignment. Maximum likelihood estimation is thenapplied to each of these B bootstrapped sequence alignments,and uncertainty is assessed by summarizing the similaritiesbetween the bootstrapped phylogenies (Felsenstein, 1985).Figure 4 displays some bootstrap datasets drawn from anobserved sequence alignment; note that each dataset consistsof randomly sampled columns from this observed alignment.While bootstrap sampling is conceptually simple, summar-izing bootstrapped trees is not an easy task so we devote thenext subsection to this topic. For a more detailed account ofbootstrap sampling, please see the other article containedwithin this encyclopedia.

Summarizing Bootstrapped Phylogenies

We are interested in understanding how to summarize thesimilarities among the B bootstrap trees because a high degreeof similarity corresponds to a low degree of variability. Forinstance, if a particular tree split – a bipartition of the speciesset – appears in most of the bootstrap trees, then we will bemore confident about that tree split being in the true phyl-ogeny. One way to keep track of which tree splits appear mostoften in the bootstrap trees is to construct a majority-ruleconsensus tree – a tree that is constructed from tree splits thatappear in a majority of the bootstrap trees. This is done by firstenumerating all of the tree splits that occur on the bootstraptrees and then retaining only those splits that appear in morethan 50% of the trees. We can then construct the consensustree using this remaining collection of splits. In building the

T C T CG T C AG A G TA G G G

ment (m = 4, n = 10)

C A A A T A A C C TT G C G C G C T A AA T C T G T C A T TG C A T G T A G G C

Bootstrap sample #2

e obtain bootstrap datasets by randomly sampling n¼10 sites (i.e.,provide two examples of possible bootstrap datasets (bottom).

Page 6: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

C

D E

FB

F

E C

DB

D

B F

CE

C

E B

DF

E

F D

CB

Tree collection

Tree splits Counts33211

Number of appearances of each tree split

C

D F

EB

60% 60%

Majority-rule consensus tree

{C, D} |{B, E, F }{B, C, D} |{E, F}{B, D} |{C, E, F}{B, D, E} |{C, F}{C, E} |{B, D, F}

Figure 5 Consensus tree example with five species {B,C,D,E,F}. At the top of the figure, we display a sample collection of trees from which we'llbuild a consensus tree. A table enumerating all possible tree splits from this collection and their respective counts is given in the middle of thefigure. At the bottom of the figure, we display the only consensus tree that can be constructed from the majority tree splits, which are {C,D}|{B,E,F}and {B,C,D}|{E,F}. Note that we label each tree split with the corresponding percentage of the tree collection it appears in.

504 Maximum Likelihood Phylogenetic Inference

consensus tree, the procedure always avoids putting two splitsthat might conflict on the same tree because there will alwaysbe at least one bootstrap tree where the two splits coexist(Felsenstein, 1985). Thus, it is guaranteed that a consensus treecan be built from these tree splits. Once the tree is constructed,it is common to label each split with the percentage of boot-strap trees it appears in. This helps us understand which partsof the consensus tree we should have strong or weak con-fidence in. Figure 5 illustrates how this tree building processworks by constructing a majority-rule consensus tree for asimple collection of trees. For a more comprehensive treat-ment of consensus trees, please see the other article containedwithin this encyclopedia.

Maximum Likelihood and More Complex Models ofEvolution

So far, we have limited our discussion of phylogenetic max-imum likelihood estimation to the very basic models of mo-lecular evolution. However, assuming more complex modelsdoes not significantly change the maximum likelihood ma-chinery. For example, it is widely recognized that when mod-eling evolution of molecular sequences it is important toaccount for a possibility of different substitution rates across

sites (Yang, 1994). Therefore, most implementations of phylo-genetic maximum likelihood estimation include models thatdeal with such rate heterogeneity. Model extensions that relaxthe assumption of site independence (Hobolth, 2008) and thatimpose constraints on substitution rates across branches of thephylogeny (Rambaut and Bromham, 1998) are also possible.Increasing model complexity may eventually lead to a situationwhere maximum likelihood estimation ceases to produce aunique solution, so studying identifiability of models becomesan important avenue of research (Rhodes and Sullivant, 2012).

Example: Primate Phylogeny Estimation

In this section, we present an example of maximum likelihoodphylogeny estimation applied to an alignment of mitochon-drial DNA sequences from seven different primate species:human, chimpanzee, bonobo, gorilla, Bornean orangutan,Sumatran orangutan, and gibbon (Yang et al., 1998). Thelength of the alignment is 9993 sites. We are interested inunderstanding the ancestral relationships among these species.We use a general time-reversible substitution model, with agamma distribution used to model rate variation across sites.After finding the maximum likelihood phylogeny estimateusing the PhyML package (Guindon et al., 2010), we perform

Page 7: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

Borneanorangutan

Sumatranorangutan

Gibbon

Gorilla

Human

Chimpanzee

Bonobo

100%

100%

100%

100%

Full alignment

Borneanorangutan

Sumatranorangutan

Gibbon

Gorilla

Human

Chimpanzee

Bonobo

100%

87%

95%

98%

Subsampled alignment

Figure 6 Maximum likelihood phylogenies with corresponding bootstrap percentages for the primate example using both the full sequencealignment and a randomly subsampled alignment containing 500 sites.

Maximum Likelihood Phylogenetic Inference 505

the bootstrap and obtain 1000 bootstrapped trees. We carryout this procedure on the full sequence alignment and on analignment that we constructed by randomly subsampling 500sites from the original alignment.

Figure 6 displays the maximum likelihood trees with cor-responding labeled bootstrap percentages; note that the ma-jority-rule consensus trees have been omitted from Figure 6 asthey coincide with the maximum likelihood trees in this ex-ample. However, maximum likelihood trees and bootstrapconsensus trees could be different, which usually happenswhen maximum likelihood phylogenetic inference is notvery precise. In this example, maximum likelihood trees ofboth the full and subsampled alignment share the sametopology and have similarly sized bootstrap percentages.However, notice that artificially reducing the size of the datayields more uncertainty reflected in lower bootstrap support ofinternal branches of the reconstructed phylogeny on therighthand side of Figure 6. Our analysis suggests an acceptedrelationship among the primates, with bonobos and chim-panzees being most closely related to humans (Mailund et al.,2014).

See also: Consensus Methods, Phylogenetic. Distance-BasedPhylogenetic Inference. Molecular Evolution, Models of.Phylogenetic Invariants. Support Measures, Phylogenetic Tree

References

Billera, L.J., Holmes, S.P., Vogtmann, K., 2001. Geometry of the space ofphylogenetic trees. Advances in Applied Mathematics 27 (4), 733–767.

Chang, J.T., 1996. Full reconstruction of Markov models on evolutionary trees:Identifiability and consistency. Mathematical Biosciences 137 (1), 51–73.

Chor, B., Hendy, M.D., Holland, B.R., Penny, D., 2000. Multiple maxima oflikelihood in phylogenetic trees: An analytic approach. Molecular Biology andEvolution 17 (10), 1529–1541.

Drummond, A.J., Ho, S.Y., Phillips, M.J., Rambaut, A., et al., 2006. Relaxedphylogenetics and dating with confidence. PLOS Biology 4 (5), 699.

Efron, B., 1979. Bootstrap methods: Another look at the jackknife. Annals ofStatistics 7 (1), 1–26.

Felsenstein, J., 1973. Maximum likelihood and minimum-steps methods forestimating evolutionary trees from data on discrete characters. SystematicZoology 22 (3), 240–249.

Felsenstein, J., 1981. Evolutionary trees from DNA sequences: A maximumlikelihood approach. Journal of Molecular Evolution 17 (6), 368–376.

Felsenstein, J., 1985. Confidence limits on phylogenies: An approach using thebootstrap. Evolution 39, 783–791.

Felsenstein, J., 1989. PHYLIP-phylogeny inference package (version 3.2). Cladistics5, 164–166.

Felsenstein, J., 2004. Inferring Phylogenies. Sunderland, MA: Sinauer Associates.Guindon, S., Dufayard, J.-F., Lefort, V., et al., 2010. New algorithms and methods to

estimate maximum-likelihood phylogenies: Assessing the performance of PhyML3.0. Systematic Biology 59 (3), 307–321.

Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm to estimatelarge phylogenies by maximum likelihood. Systematic Biology 52 (5), 696–704.

Hobolth, A., 2008. A Markov chain Monte Carlo expectation maximization algorithm forstatistical analysis of DNA sequence evolution with neighbor-dependent substitutionrates. Journal of Computational and Graphical Statistics 17 (1), 138–162.

Hobolth, A., Jensen, J.L., 2005. Statistical inference in evolutionary models of DNAsequences via the EM algorithm. Statistical Applications in Genetics andMolecular Biology 4 (1), 1–22.

Holmes, I., Rubin, G., 2002. An expectation maximization algorithm for traininghidden substitution models. Journal of Molecular Biology 317 (5), 753–764.

Kenney, T., Gu, H., 2012. Hessian calculation for phylogenetic likelihood based onthe pruning algorithm and its applications. Statistical Applications in Geneticsand Molecular Biology 11 (4), 1–46.

Liu, K., Warnow, T.J., Holder, M.T., 2012. Sate-II: Very fast and accuratesimultaneous estimation of multiple sequence alignments and phylogenetic trees.Systematic Biology 61 (1), 90–106.

Mailund, T., Munch, K., Schierup, M.H., 2014. Lineage sorting in apes. AnnualReview of Genetics 48, 519–535.

Page 8: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

506 Maximum Likelihood Phylogenetic Inference

Page, R.D., Holmes, E.C., 2009. Molecular Evolution: A Phylogenetic Approach. NewJersey, NJ: John Wiley & Sons.

Rambaut, A., Bromham, L., 1998. Estimating divergence dates from molecularsequences. Molecular Biology and Evolution 15 (4), 442–448.

Redelings, B.D., Suchard, M.A., 2005. Joint Bayesian estimation of alignment andphylogeny. Systematic Biology 54 (3), 401–418.

Rhodes, J.A., Sullivant, S., 2012. Identifiability of large phylogenetic mixture models.Bulletin of Mathematical Biology 74 (1), 212–231.

Rogers, J.S., Swofford, D.L., 1999. Multiple local maxima for likelihoods ofphylogenetic trees: A simulation study. Molecular Biology and Evolution 16 (8),1079–1085.

RoyChoudhury, A., Willis, A., Bunge, J., 2015. Consistency of a phylogenetic treemaximum likelihood estimator. Journal of Statistical Planning and Inference 161,73–80.

Salter, L.A., Pearl, D.K., 2001. Stochastic search strategy for estimation of maximumlikelihood phylogenetic trees. Systematic Biology 50 (1), 7–17.

Schadt, E.E., Sinsheimer, J.S., Lange, K., 1998. Computational advances in maximumlikelihood methods for molecular phylogeny. Genome Research 8 (3), 222–233.

Steel, M., 1994. The maximum likelihood point for a phylogenetic tree is notunique. Systematic Biology 43 (4), 560–564.

Swofford, D.L., 2003. PAUP*. Phylogenetic Analysis Using Parsimony (* and OtherMethods). Version 4. Sunderland, MA: Sinauer Associates.

van der Vaart, A., 1998. Asymptotic Statistics. Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge: Cambridge University Press.

Wald, A., 1949. Note on the consistency of the maximum likelihood estimate.Annals of Mathematical Statistics 20 (4), 595–601.

Yang, Z., 1994. Maximum likelihood phylogenetic estimation from DNA sequenceswith variable rates over sites: Approximate methods. Journal of MolecularEvolution 39 (3), 306–314.

Yang, Z., Nielsen, R., Hasegawa, M., 1998. Models of amino acid substitution andapplications to mitochondrial protein evolution. Molecular Biology and Evolution15 (12), 1600–1611.

Relevant Websites

http://www.atgc-montpellier.fr/phyml/ATGC.

http://sco.h-its.org/exelixis/web/software/raxml/Heidelberg Institute for Theoretical Studies.

http://evolution.gs.washington.edu/phylip/University of Washington.

Page 9: Maximum Likelihood Phylogenetic Inferencedownload.xuebalib.com/xuebalib.com.29391.pdf · lows directly from the first independence assumption given above. Thus, we can find the

本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。

学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研,提供最强文献下载服务。

图书馆导航:

图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具