HIGH PERFORMANCE, BAYESIAN‐BASED PHYLOGENETIC INFERENCE FRAMEWORK
By
Xizhou Feng
Bachelor of Engineering
China Textile University, 1993
Master of Science Tsinghua University, 1996
————————————————————————
Submitted in Partial Fulfillment of the Requirements
for the Degree of Doctor Philosophy in the
Department of Computer Science and Engineering
College of Engineering and Information Technology
University of South Carolina
2006
Major Professor Chairman, Examining Committee
Committee Member Committee Member
Committee Member Dean of The Graduate School
iii
Acknowledgements
During the course of my graduate study, I have been fortunate to receive advice, support,
and encouragement from many people. Foremost is the debt of gratitude that I owe to my
thesis advisors, Professor Duncan A. Buell and Professor Kirk W. Cameron. Not only
was Duncan responsible for introducing me to this interesting and fruitful field, he also
provided me inspiring guidance, great patience, and never-ending encouragement during
the past several years. I especially thank Professor Kirk W. Cameron for his invaluable
mentoring, insightful advising, and constant investing. Kirk guided me into the exciting
field of systems study, and provided opportunities and support to conduct quality
research work in several cutting-edge areas.
I thank Professor Manton Matthews for his years of academic advising and being on
my advisory committee. His guidance and support made it possible for me to explore
various fields in computer science and engineering.
I thank Professor John R. Rose and Professor Peter Waddell for their valuable
suggestions in this research work. The discussions and collaborative work with John and
Peter generated some important ideas which have been included in this thesis.
I appreciate Professor Austin L. Hughes for being on my advisory committee and
providing me critical opinions which led me to rethink and significantly improvement
this dissertation.
I also thank the faculty and staff in the Department of Computer and Engineering for
providing me one of the most wonderful training programs in the world.
iv
Finally, I thank my family for their love and support during the hard time of
completing my dissertation.
This dissertation is dedicated to my wife Rong, my son Kevin, and my daughter
Katherine.
v
Abstract
Comparative analyses of biological data rely on a phylogenetic tree that describes the
evolutionary relationship of the organisms studied. By combining the Markov Chain
Monte Carlo (MCMC) method with likelihood-based assessment of phylogenies,
Bayesian phylogenetic inferences incorporate complex statistical models into the process
of phylogenetic tree estimation. This combination can be used to address a number of
complex questions in evolutionary biology. However, Bayesian analyses are
computationally expensive because they almost invariably require high dimensional
integrations over unknown parameters. Thoroughly investigating and exploiting the
power of the Bayesian approach requires a high performance computing framework.
Otherwise one cannot tackle the computational challenges of Bayesian phylogenetic
inference for large phylogeny problems.
This dissertation extended existing Bayesian phylogenetic inference framework in
three aspects: 1) Exploring various strategies to improve the performance of the MCMC
sampling method; 2) Developing high performance, parallel algorithms for Bayesian
phylogenetic inference; and 3) Combining data uncertainty and model uncertainty in
Bayesian phylogenetic inference. We implemented all these extensions in PBPI, a
software package for parallel Bayesian phylogenetic inference.
We validated the PBPI implementation using simulation study, a common method
used in phylogenetics and other scientific disciplines. The simulation results showed that
PBPI can estimate the model trees accurately given sufficient number of sequences and
correct models.
vi
We evaluated the computational speed of PBPI using simulated datasets on a
Terascale computing facility and observed significantly performance improvement. On a
single processor, PBPI ran up to 19 times faster than the current leading Bayesian
phylogenetic inference program with the same quality output. On 64 processors, PBPI
achieved 46 times parallel speedup in average. Combining both sequential improvement
and parallel computation, PBPI can speedup current Bayesian phylogenetic inferences up
to 870 times.
.
vii
Table of Contents
Dedication ........................................................................................................................... ii
Acknowledgements............................................................................................................ iii
Abstract ............................................................................................................................... v
List of Tables ................................................................................................................... xiii
List of Figures .................................................................................................................. xiv
Chapter 1 Introduction ........................................................................................................ 1
1.1 Phylogeny and its applications.................................................................................. 1
1.2 Phylogenetic inference.............................................................................................. 2
1.3 The challenges .......................................................................................................... 5
1.3.1 Searching a complex tree space ......................................................................... 5
1.3.2 Developing realistic evolutionary models ......................................................... 6
1.3.3 Dealing with incomplete and unequal data distribution .................................... 7
1.3.4 Resolving conflicts among different methods and data sources........................ 8
1.4 Bayesian phylogenetic inference and its issues ........................................................ 8
1.5 Motivation............................................................................................................... 10
1.6 Research objectives and contributions.................................................................... 11
1.7 Organization of this dissertation ............................................................................. 12
Chapter 2 Background ...................................................................................................... 14
2.1 Representations of phylogenetic trees .................................................................... 14
2.2 Methods for phylogenetic inference ....................................................................... 19
viii
2.2.1 Sequenced-based methods and genome-based methods.................................. 19
2.2.2 Distance-, MP-, ML- and BP-based methods .................................................. 20
2.2.3 Tree search strategies....................................................................................... 21
2.3 High performance computing phylogenetic inference methods ............................. 22
2.4 Bayesian phylogenetic inference ............................................................................ 23
2.4.1 Introduction...................................................................................................... 23
2.4.2 The Bayesian framework ................................................................................. 25
2.4.3 Components of Bayesian phylogenetic inference............................................ 27
2.4.4 Likelihood, prior and posterior probability...................................................... 27
2.4.5 Empirical and hierarchical Bayesian analysis.................................................. 28
2.5 Models of molecular evolution ............................................................................... 29
2.5.1 The substitute rate matrix................................................................................. 29
2.5.2 Properties of the substitution rate matrix ......................................................... 31
2.5.3 The general time reversible (GTR) model ....................................................... 32
2.5.4 Rate heterogeneity among different sites......................................................... 34
2.5.5 Other more realistic evolutionary models........................................................ 35
2.6 Likelihood function and its evaluation ................................................................... 35
2.6.1 The likelihood function.................................................................................... 35
2.6.2 Felsenstein’s algorithm for likelihood evaluation............................................ 37
2.7 Optimizations of likelihood computation ............................................................... 39
2.7.1 Sequence packing............................................................................................. 39
2.7.2 Likelihood local update.................................................................................... 39
2.7.3 Tree balance ..................................................................................................... 41
ix
2.8 Markov Chain Monte Carlo methods ..................................................................... 41
2.8.1 The Metropolis-Hasting algorithm .................................................................. 41
2.8.2 Exploring the posterior distribution ................................................................. 43
2.8.3 The issues......................................................................................................... 44
2.9 Summary of the posterior distribution .................................................................... 46
2.9.1 Summary of the phylogenetic trees.................................................................. 46
2.9.2 Summary of the model parameters .................................................................. 46
2.10 Chapter summary .................................................................................................. 47
Chapter 3 Improved Monte Carlo Strategies .................................................................... 49
3.1 Introduction............................................................................................................. 49
3.2 Observations ........................................................................................................... 50
3.3 Strategy #1: reducing stickiness using variable proposal step length..................... 53
3.4 Strategy #2: reducing sampling intervals using multipoint MCMC....................... 55
3.5 Strategy #3: improving mixing rate with parallel tempering.................................. 57
3.6 Proposal algorithms for phylogenetic models......................................................... 60
3.6.1 Basic tree mutation operators........................................................................... 61
3.6.2 Basic tree branch length proposal methods ..................................................... 62
3.6.3 Propose new parameters .................................................................................. 63
3.6.4 Co-propose topology and branch length .......................................................... 63
3.7 Extended proposal algorithms for phylogenetic models......................................... 63
3.7.1 Extended tree mutation operator...................................................................... 64
3.7.2 Multiple-tree-merge operator........................................................................... 64
3.7.3 Backbone-slide-and-slide operator .................................................................. 65
x
3.8 Chapter summary .................................................................................................... 66
Chapter 4 Parallel Bayesian Phylogenetic Inference ........................................................ 68
4.1 The need for parallel Bayesian phylogenetic inference.......................................... 68
4.2 TAPS: a tree-based abstraction of parallel system ................................................. 69
4.3 Performance models for parallel algorithms........................................................... 71
4.4 Concurrencies in Bayesian phylogenetic inference ................................................ 74
4.5 Issues of parallel Bayesian phylogenetic inference ................................................ 75
4.6 Parallel algorithms for Bayesian phylogenetic inference ....................................... 77
4.6.1 Task decomposition and assignment ............................................................... 77
4.6.2 Synchronization and communication............................................................... 79
4.6.3 Load balancing................................................................................................. 80
4.6.4 Symmetric MCMC algorithm.......................................................................... 80
4.6.5 Asymmetric MCMC algorithm........................................................................ 83
4.7 Justifying the correctness of the parallel algorithms............................................... 83
4.8 Chapter summary .................................................................................................... 84
Chapter 5 Validation and Verification.............................................................................. 86
5.1 Introduction............................................................................................................. 86
5.2 Experimental methodology..................................................................................... 89
5.2.1 The model trees................................................................................................ 89
5.2.2 The simulated datasets ..................................................................................... 90
5.2.3 The accuracy metrics ....................................................................................... 90
5.2.4 Tested programs and their run configurations ................................................. 92
5.2.5 The computing platforms................................................................................. 93
xi
5.3 Results on model tree FUSO024............................................................................. 94
5.3.1 The overall accuracy of results ........................................................................ 94
5.3.2 Further analysis................................................................................................ 96
5.3.3 PBPI stability ................................................................................................. 100
5.4 Results on model tree BURK050.......................................................................... 103
5.5 Chapter summary .................................................................................................. 105
Chapter 6 Performance Evaluation ................................................................................. 107
6.1 Introduction........................................................................................................... 107
6.2 Experimental methodology................................................................................... 108
6.3 The sequential performance of PBPI .................................................................... 110
6.3.1 The execution time of PBPI and MrBayes .................................................... 110
6.3.2 The quality of the tree samples drawn by PBPI............................................. 111
6.3.3 The execution time of PBPI and MrBayes .................................................... 112
6.4 Parallel speedup for fixed problem size................................................................ 115
6.5 Scalability analysis................................................................................................ 119
6.6 Parallel speedup with scaled workload ................................................................. 121
6.6.1 Scalability with different problem sizes ........................................................ 121
6.6.2 Scalability with the number of chains............................................................ 122
6.7 Chapter summary .................................................................................................. 123
Chapter 7 Summary and Future Work ............................................................................ 124
7.1 The big picture ...................................................................................................... 124
7.2 Future work........................................................................................................... 127
xii
Bibliography ................................................................................................................... 129
.
xiii
List of Tables
Table 1 - 1: The number of unrooted bifurcating trees as a function of taxa ..................... 5
Table 5 - 1: The four model trees used in experiments..................................................... 89
Table 5 - 2: PBPI run configurations for validation and verification ............................... 95
Table 5 - 3: The number of datasets where the model tree FUSO024 is found in the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree. A total of 5 datasets are used in each case................................... 96
Table 5 - 4: The average distances between the model tree FUSO024 and the maximum
probability tree, the 95% credible set of trees and the 50% majority consensus tree.
A total of 5 datasets are used in each case. ........................................................... 96
Table 5 - 5: The topological distances between the model tree FUSO024 and the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree for datasets with 10,000 characters. Datasets are simulated under
the JC69 model. .................................................................................................... 97
Table 5 - 6: The average distances between the model tree BURK050 and the maximum
probability tree, the 95% credible set of tree and the 50% majority consensus tree.
A total of 5 datasets were used in each case. ...................................................... 103
Table 6 - 1: Benchmark dataset used in the evaluation .................................................. 109
Table 6 - 2: Sequential execution time of PBPI and MrBayes ....................................... 110
xiv
List of Figures
Figure 1 - 1: The procedure of a phylogenetic inference.................................................... 4
Figure 2 - 1: Phylogenetic trees of 12 primates mitochondrial DNA sequences.............. 15
Figure 2 - 2: The NEWICK representation of the primate phylogenetic tree................... 16
Figure 2 - 3: The nontrivial bipartitions of the primate phylogenetic tree........................ 17
Figure 2 - 4: A phylogenetic tree with support values for each clade ............................. 18
Figure 2 - 5: The transition diagram and transition matrix of nucleotides ....................... 30
Figure 2 - 6: The Felsenstein algorithm for likelihood evaluation .................................. 38
Figure 2 - 7: Illustration of likelihood local update .......................................................... 40
Figure 2 - 8: The tree-balance algorithm .......................................................................... 41
Figure 2 - 9: Metropolis-Hasting algorithm...................................................................... 42
Figure 3 - 1: A target distribution with three modes......................................................... 50
Figure 3 - 2: Distribution approximated using Metropolis MCMC methods ................... 51
Figure 3 - 3: Samples drawn using Metropolis MCMC method ...................................... 52
Figure 3 - 4: Illustration of state moves ............................................................................ 54
Figure 3 - 5: Approximated distribution using variable step length MCMC.................... 55
Figure 3 - 6: The multipoint MCMC ................................................................................ 56
Figure 3 - 7: A family of tempered distributions with different temperatures................. 58
Figure 3 - 8: The Metropolis-coupled MCMC algorithm................................................. 59
Figure 3 - 9: The extended-tree-mutation method ........................................................... 64
Figure 3 - 10: The multiple-tree-merge method ............................................................... 65
Figure 3 - 11: The backbone slide and scale method........................................................ 66
xv
Figure 4 - 1: An illustration of TAPS ............................................................................... 70
Figure 4 - 2: Speedup under fixed workload .................................................................... 73
Figure 4 - 3: The procedure of a generic Bayesian phylogenetic inference ..................... 75
Figure 4 - 4: Map 8 chains to a 4 x 4 grid, where the length each sequence is 2000 ....... 78
Figure 4 - 5: The symmetric parallel MCMC algorithm................................................... 82
Figure 5 - 1: The procedure of a simulation method for accuracy assessment................. 88
Figure 5 - 2: Run configuration for MrBayes ................................................................... 93
Figure 5 - 3: The phylogram of the model tree FUSO024................................................ 98
Figure 5 - 4: The MPP tree estimated from dataset fuso024_L10000_jc69_D001 ...... 99
Figure 5 - 5: Estimation variances in 10 individual runs ................................................ 100
Figure 5 - 6: The phylogram of the model tree BURK050............................................. 101
Figure 5 - 7: The MPP tree estimated from dataset burk050_L10000_jc69_D001.nex. 102
Figure 5 - 8: The posterior distribution of the top 50 most probable trees ..................... 104
Figure 5 - 9: The topological distances distribution of the top 50 most probable trees.. 105
Figure 6 - 1: Different speedup values computed by wall clock time and user time...... 108
Figure 6 - 2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes...... 111
Figure 6 - 3: The consensus tree estimated by PBPI ...................................................... 113
Figure 6 - 4: The consensus tree estimated by MrBayes ................................................ 114
Figure 6 - 5: Parallel speedup of PBPI for dataset FUSO024_L10000 ......................... 116
Figure 6 - 6: Parallel speedup of PBPI for dataset ARCH107_L1000 ........................... 117
Figure 6 - 7: Parallel speedup of PBPI for dataset BACK218_L10000 ......................... 117
Figure 6 - 8: The consensus tree estimated by PBPI on 64 processors........................... 118
Figure 6 - 9: Parallel speedup with different number of taxa ......................................... 122
1
Chapter 1
Introduction
1.1 Phylogeny and its applications
All life on the earth, both present and past, are believed to be descended from a common
ancestor. The descending pattern or evolutionary relationship among species or
organisms, or the relatedness of their genes, is usually described by a phylogeny, a tree or
network structure, with edge length representing the evolutionary divergence along
different lineages. In a phylogeny, all existing organisms are placed on its “leaves” and
ancestral organisms are placed at its “branches,” or internal nodes.
Since all biological phenomena are the result of evolution, most biological studies
have to be conducted in the light of evolution and require information on phylogeny to
interpret data [1]. Thus, phylogenies play important roles not only in evolutionary
biology, genetics and genomics, but also in modern pharmaceutical research, drug
discovery, agricultural plant improvement, disease control studies (detection, prevention
and prediction) and other biology-related fields. The importance of phylogeny in
scientific research and human society has never been made more clear than by the
ambitious “Tree of Life” project initiated by the US National Science Foundation, which
2
aims to assemble a phylogeny for all 1.7 million described species (ATOL) to benefit
society and science [2].
The applications of phylogenies span a wide range of fields, both in industry and
science. Several examples follow:
• Identifying, organizing and classifying organism [3, 4];
• Interpreting and understanding the organization and evolution of genomes [5, 6];
• Identifying and characterizing newly discovered pathogens [7];
• Reconstructing the evolution and radiation of life on the earth [8, 9]; and
• Identifying mutations most likely associated with diseases [10].
1.2 Phylogenetic inference
Phylogeny describes the pattern of evolution history among a group of taxa. But history
only happens once, and people have to use clues left by the history to reconstruct actual
events. One of the fundamental tasks of phylogenetic inference is to approximate the
“true” phylogenetic tree for a group of taxa using a set of evolutionary evidence in which
the phylogenetic signals reside.
Various kinds of data are used in phylogenetics inferences, but recently DNA/RNA
molecular sequences are most common. There are three reasons:
1) DNA sequences are the inheritance materials of all organisms on the earth;
2) Mathematical models of molecular evolution are feasible and can be improved
incrementally;
3) Huge numbers of genomic sequences have been generated and are publicly
accessible.
3
The third reason is the most important for the rapid advancement of phylogenetic
inference using genomic data. Worldwide genome projects, such as the Human Genome
Project (HGP) [11], have generated an ever-increasing amount of biological data. These
data are publicly accessible through several government-supported database efforts, such
as GenBank[12], EMBL[13], DDJB[14], and Swiss-Prot[15]. On August 22, 2005, the
public collections of DNA and RNA sequences provided by GenBank, EMBL, and DDBJ
reached 100 Giga bases (i.e. 100,000,000,000 bases), representing genes and genomes of
over 165,000 organisms. Those massive, complex data sets already generated—and those
yet to be generated—have been fueling the emerging or renaissance of a few
interdisciplinary fields, including large scale phylogenetic analysis of genomic data.
The problem of phylogenetic inference using genomic (molecular) sequences is
formalized as follows:
Given an aligned character matrix ( )N M
ijX x×
= for a set of N taxa, each taxa being
represented by an M − character sequence, ijx denoting the character of the -i th taxa at
the -j th site of its sequence, phylogenetic inference typically seeks to answer two basic
questions:
1) What is the phylogenetic tree (or model) that “best” explains the evolutionary
relations among these taxa?
2) With how much confidence is a particular tree expected to be “correct”?
Every phylogenetic method can output a phylogenetic tree which the method views
as the “best” tree according to certain optimization criteria. However, given the inherent
complexities in biological evolution and some unrealistic assumptions in phylogenetic
inference, each given inference method usually not only produces a tree but also provides
4
a measurement of the confidence in the tree. Bootstrapping and Bayesian posterior
probability (discussed later) are two common statistical tools to provide such confidence
measurements.
As shown in Figure 1-1, a phylogenetic inference usually is preceded by multiple
alignments and model selections to generate input. Most phylogenetic methods rely on
some phylogenetic tree as their input as well. To reduce the errors produced by the
interdependence among multiple alignments, model selections and phylogenetic
inference, several iterations of alignments, selections, and inferences may be required.
Collect Data
Retrieve Homologous Sequences
Alignt Multiple Sequences
Select Model of Evolution
Phylogenetic Inference
Assess Confidence
Aligned Data Matrix
“Best” tree with measures of support
Hypothesis Testing
Phylogenetic Trees(s)
Figure 1 - 1: The procedure of a phylogenetic inference
5
1.3 The challenges
Though there have been significant advances in phylogenetic inference in the past several
decades, large scale phylogenetic inference is still a challenging problem.
1.3.1 Searching a complex tree space
The biggest challenge of phylogenetic inference is the growth in the number of unrooted
trees, described by
( )3
2 -5N
ii
=Ζ = Π (1- 1)
Here Z denotes the number of possible tree topologies, N denotes of the number of
taxa. Table 1 shows the number of unrooted trees corresponding to the number of taxa.
For example, the tree space for 100 taxa will contain 182107.1 × unrooted trees. Searching
this space to find the best tree is computationally impractical. Most optimization-based
phylogenetic methods, such as maximum parsimony and maximum likelihood, are NP-
hard problems. Many heuristic strategies for tree searching have been studied, but much
work remains to be done to improve these methods [16].
Table 1 - 1: The number of unrooted bifurcating trees as a function of taxa
Number of taxa Number of unrooted trees 3 1
10 61003.2 × 50 741084.2 ×
100 1821070.1 × 1000 28601093.1 ×
6
1.3.2 Developing realistic evolutionary models
Most phylogenetic methods explicitly or implicitly assume a model of genomic sequence
evolution and use such a model to estimate the rate of evolution, calculate pair-wise
distance, or compute the likelihood of a given phylogeny. The process of genomic
sequence evolution has been affected by two factors: mutations and selections. Mutations
are errors incurred during DNA replication. Mutations create genetic diversity among
populations, and natural selection steers evolutionary direction. Possible causes of
mutations include substitution, recombination, duplication, insertion, deletion, and
inversions [17]. At the same time, mutations are constrained by the geometric, physical
and chemical structures of nucleotides, amino acids, codons, protein secondary structures,
and protein tertiary structures [18].
Though phylogenetic signals exist in all kinds of mutation events, most evolutionary
models only consider substitution events because it is either difficult or computationally
intractable to integrate other events into the models used by phylogenetic analysis [19,
20]. With increasing computational power, researchers have relaxed some early
assumptions in evolutionary models and proposed more realistic models, such as
allowing rate variation across sites [21], considering the effect of insertion and deletion,
and combining secondary structure information [22-24]. Given multiple possible models,
it is necessary for the phylogenetic inference approach to select a model that best fits the
data. Also this approach should be robust enough to give a correct tree even when some
assumptions have been violated.
Besides the complexity of modeling single type sequence evolution, the need for
combined analysis of multiple datasets with different data types and sources requires
7
some unified model which is both mathematically founded and biologically meaningful
[25, 26].
1.3.3 Dealing with incomplete and unequal data distribution
The imperfect process of sampling, sequencing and alignment may introduce varied noise
into an available data set. Bias or errors in multiple sequence alignment is the cause of
most noise because: 1) most multiple sequence alignment methods depend on a “correct”
phylogeny to guide the alignment process; 2) it is necessary to search across trees to find
the overall optimum. It is possible to refine the alignment by repeating the procedure of
“multiple alignment—model selection—phylogenetic inference,” but it is always
dangerous to assume the alignment is “perfect”.
To assess the reliability or sensitivity of phylogeny on data with uncertainty, the
bootstrap approach [28] was suggested by Felsenstein [29] and further refined by Efron et
al. [30]. Bootstrapping requires repeating the phylogenetic inference procedure many
times (typically on the order of 1000 times [23]) on derived datasets obtained by
permuting the original data with resampling and replacing.
The usefulness of phylogenetic inference methods is also limited by the sparse and
uneven distribution of sequence data among species and the uncertainty inherent in the
available data. Some species have been sequenced for many genes; a few genes have
been sequenced for many species; but most of the potential data available for
phylogenetic purposes is still missing [31, 32].
8
1.3.4 Resolving conflicts among different methods and data sources
Researchers usually represent a species with one or more genes in phylogeny
reconstruction. However, a gene tree is not the same as a species tree [23]. Phylogenetic
trees constructed with different genes or different data types (morphological data vs.
molecular data) may be different. These conflicts may come from improper model
assumptions or tree building approaches.
1.4 Bayesian phylogenetic inference and its issues
This dissertation aims to extend the framework of Bayesian phylogenetic inference to
achieve high performance on large phylogeny problems. By combining several factors
into a comprehensive probability model and removing unknown parameters with a
marginal probability distribution, Bayesian analysis has the potential to integrate complex
(i.e. realistic) models and existing knowledge into phylogenetic inference.
However, like other methods when they were first introduced, Bayesian phylogenetic
inference generated both excitement and debate.
Supporters of the Bayesian approach claim that Bayesian phylogenetic methods have
at least two advantages over traditional phylogenetic methods [33-36]:
1) The primary Bayesian phylogenetic analysis produces both a tree estimate and a
measure of uncertainty for the groups on the estimated tree[10, 37, 38]. The
uncertainty is measured by a quantity called Bayesian posterior probability, which
is approximated by the percentage of occurrences of a group in the tree samples
generated by certain MCMC (Markov Chain Monte Carlo) methods [39-41].
9
2) Bayesian methods can implement very complex models of sequence evolution,
because a well-designed MCMC can traverse various highly probably regions of
the tree space instead of sticking around only one region which is locally optimal
but may be not the globally optimal [37].
However, with more thorough investigations, Bayesian phylogenetic inference also
brings various highly-debated issues [34, 36, 42]. Several major issues have been
summarized below:
1) Some Bayesian analyses offer conflicting findings to those from other approaches,
such as maximum parsimony (MP) and maximum likelihood (ML) [43, 44]. Some
highly debated topics include: “How meaningful are Bayesian support values?”
[45]; “Do Bayesian support values reflect the probability of being true?” [46]; and
“Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics”
[47]. Supporters claim that the Bayesian posterior probability of a tree is “the
probability that the estimated tree is correct under the correct model” [10] is
highly debatable. Some convincing interpretation is necessary to reconcile these
debates.
2) One cornerstone of Bayesian phylogenetic inference is posterior probability
approximation using Markov Chain Monte Carlo (MCMC). Shortly after MCMC
came out, people expected that it would be more efficient than traditional ML
with bootstrapping [41]. However, experience shows that the chains have to run
much longer than previously expected to converge to the correct approximation
[48]. More seriously, research shows that the MCMC method may give
10
misleading “posterior probability” under certain conditions [42, 49], for example
on a mixture of trees [50].
In spite of the above and other issues, Bayesian analysis has still gained wide
acceptance since it was introduced into phylogenetics [8, 51-57].
1.5 Motivation
Given the challenges described above, both positive and negative, it is necessary to
investigate Bayesian phylogenetic inference more thoroughly. Given the stochastic nature
of molecular evolution, statistical analyses such Bayesian methods do have the potential
to develop a unified framework to combine multiple data sources and existing knowledge
into phylogenetic inference.
Some of the debates about Bayesian phylogenetic inference are due to insufficient
understanding or implementation of this method, especially the MCMC algorithm. An
improper MCMC implementation does have the danger of stopping at local optima. In
addition, it can not cross low probability zones to reach other optimal modes. Therefore,
we need to explore improved MCMC strategies to develop more reliable, more efficient
implementation.
One barrier for extensive investigation of Bayesian methods is that the method itself
is time consuming. Given hundreds of taxa and complex models, a complete MCMC-
based Bayesian analysis may run several months to obtain a solution. A similar situation
occurred when the maximum likelihood method was first introduced. However, when
computing systems became more and more powerful and better algorithms were
11
developed, the maximum likelihood method came into wide use. This phenomenon may
happen again to the Bayesian-based phylogenetic method.
1.6 Research objectives and contributions
This dissertation aims to develop a high performance framework for Bayesian
phylogenetic inference. The following summarizes the research objectives and
contributions of this dissertation.
1) Developing a high performance computing framework for Bayesian phylogenetic
inference. In this dissertation, we investigate technologies and platforms for
Bayesian phylogenetic inference and abstract different computing platforms into
the TAPS (Tree-based Abstraction of Parallel System) model. Based on this
model, we developed parallel MCMC algorithms for Bayesian phylogenetic
inference and implemented them in the PBPI (Parallel Bayesian Phylogenetic
Inference) program. Both analytical analyses and numerical simulations show that
PBPI achieves roughly linear speedup for datasets with different problem sizes.
This means a Bayesian phylogenetic inference lasting several months by former
methods can be finished in several hours using parallel algorithms on mid-sized
Beowulf-like clusters.
2) Developing better MCMC strategies for Bayesian phylogenetic inference. In this
dissertation, we proposed and implemented several MCMC strategies for
exploring the posterior probability distribution of the phylogenetic model. By
using variable proposal step length, we made the MCMC chain cross high energy
barriers (i.e., low probability regions) and overcome “stickiness” around local
12
optimal regions. By introducing directional search within each proposal step, we
improved the quality of each proposal and shortened the sample intervals, thereby
reducing the total number of generations, to produce an acceptable distribution.
To improve the mixing rate of the chain, we also implemented a class of
population-based MCMC methods which used multiple chains to explore the
search space more efficiently. We demonstrated that classical MCMC methods
risk generating misleading posterior probability on some models; by using an
improved MCMC framework, this risk was reduced. Various novel algorithms
and MCMC strategies were implemented in this research.
3) Accommodating data uncertainty in phylogenetic inference with data resampling
in the MCMC. We extended Bayesian phylogenetic inference to include data
noise in the inference procedure and showed that ML with bootstrapping can be
viewed as a special case of generic Bayesian phylogenetic inference. We justified
that Bayesian posterior probability and bootstrap support value measure two kinds
of phylogenetic uncertainties: the former refers to multiple possible models for
the same dataset; the latter refers to the robustness of a tree on a specific dataset.
Both uncertainties can be assessed jointly by incorporating data resampling during
a single MCMC run.
1.7 Organization of this dissertation
This dissertation includes three parts.
The first part consists of Chapters 1 and 2, which present background, methods, and
results in the field of Bayesian phylogenetic inference. In this chapter we introduce the
13
phylogenetic inference problem, its applications, and its challenges. We also provide a
short review of positive and negative views of Bayesian phylogenetic methods. In
Chapter 2, we review various phylogenetic approaches and recent advances in high
performance computing for solving large phylogeny problems.
The second part includes Chapters 3 and 4 in which we describe our extended, high
performance, Bayesian phylogenetic inference framework. In Chapter 3, we demonstrate
the weaknesses of traditional MCMC methods and propose how to overcome these
weaknesses using improved MCMC algorithms. In Chapter 4, we describe our parallel
Bayesian phylogenetic inference framework. We first discuss the general models and
methods for parallelizing Bayesian phylogenetic inference that can be used as the
foundation of introducing high performance computing support to the phylogenetic
inference problem. Then we present an implementation of parallel Metropolis-coupled
MCMC and numerical results.
The third part consists of Chapters 5 and 6, where we provide performance evaluation
of the Bayesian method and our implementations. Using simulated datasets under several
model trees, we verified that our implementation not only output the correct results but
also ran faster both in sequential and parallel implementation, in contrast to MrBayes [58],
the most popular Bayesian phylogenetic inference program currently available. Our
results also demonstrated that the accuracies of Bayesian-based phylogenetic method are
very well-suited for the current models of evolution.
Finally, in Chapter 7, we summarize the results, conclusions and contributions from
this dissertation and outline future research.
14
Chapter 2
Background
2.1 Representations of phylogenetic trees
A phylogenetic tree is a graph representation of the evolutionary relationship among a set
of species or organisms. Since species are organized as a hierarchical classification in
taxonomy, we call species at the leaf node of the tree taxon (plural taxa) in phylogenetic
inference. A phylogenetic tree is usually represented by a binary tree in which each tree
node are connected at most three other nodes, but it could be represented by a multi-
forked tree when some parts of the tree can not be fully resolved [59-62].
Each internal branch of the tree maps a divergence event in evolution and divides all
taxa into two groups. Each group is called a clade and each taxon in the clade shares the
same common ancestor with other taxa in the clade. If the length of the branch is set, it is
proportional to the divergence time that two groups of taxa were separated from their
latest common ancestor. A phylogenetic tree could be rooted or unrooted depending on
whether a unique node is chosen as the least common ancestor of all taxa. Determining
the “true” root from for a group of taxa is usually impractical, so unrooted trees are most
used in phylogenetic inference.
15
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
( a ) (b)
0.1
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
( c ) ( d )
Figure 2 - 1: Phylogenetic trees of 12 primates mitochondrial DNA sequences
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
16
Figure 2-1 shows the phylogenetic tree of 12 Primates mitochondrial DNA sequences.
This tree is constructed using MrBayes from 898 DNA characters using JC69 model.
Figure 2-1 (a) and (b) are called cladograms which provide topological information only.
Figure 2-1 (c) and (d) are called phylograms which provide both branching order and
divergence time.
The NEWICK format representation of the phylogenetic tree [63, 64] in Figure 2-1 is
shown as follows.
To make the NEWICK representation unique, we define the signature of an unrooted
tree as one of its NEWICK format that satisfies two requirements:
1) The root of the tree is fixed at the internal node that has the taxon with the smallest
label as one of its children; and
2) The children of each internal node are order by their labels lexicographically.
For example, the signature of the above tree is:
#NEXUS BEGIN TREES; TRANSLATE 1 Tarsius_syrichta, 2 Lemur_catta, 3 Homo_sapiens, 4 Pan, 5 Gorilla, 6 Pongo, 7 Hylobates, 8 Macaca_fuscata,[63] 9 M_mulatta, 10 M_fascicularis, 11 M_sylvanus, 12 Saimiri_sciureus ; UTREE * PRIMATE = (1,2,(12,((7,(6,(5,(3,4)))),(11,(10,(8,9)))))); ENDBLOCK;
Figure 2 - 2: The NEWICK representation of the primate phylogenetic tree
17
(1,2,((((((3,4),5),6),7),(((8,9),10),11)),12))
Using the tree signature, we can easily test the equality of two trees in the same way
as string comparison.
When distance between two trees instead of equality is preferred in practice, a
phylogenetic tree is also treated as a hierarchical bipartitions. Each branch in the
phylogenetic tree divides the set of taxa into one bipartition. For example, the complete
set of nontrivial bipartitions (i.e., bipartitions in which each part has at least two nodes)
for the primate phylogenetic tree shown in Figure 2-2 is:
Like the signature of a phylogenetic tree, we can view each bipartition as a signature
of its corresponding tree node and thus can compare two nodes from two different
phylogenetic trees including the same group of taxa. The total number of bipartitions
which are shown in only one of the two trees but not both is defined the Robinson and
(1,2)| (3,4,5,6,7,8,9,10,11,12)
(1,2,12)| (3,4,5,6,7,8,9,10,11)
(3,4)| (1,2,5,6,7,8,9,10,11,12)
(3,4,5)| (1,2,6,7,8,9,10,11,12)
(3,4,5,6)| (1,2,7,8,9,10,11,12)
(3,4,5,6,7)| (1,2,8,9,10,11,12)
(8,9)| (1,2,3,4,5,6,7,10,11,12)
(8,9,10)| (1,2,3,4,5,6,7,11,12)
(8,9,10,11)| (1,2,3,4,5,6,7,12)
Figure 2 - 3: The nontrivial bipartitions of the primate phylogenetic tree
18
Foulds topological distance of these two trees [24], a distanced widely used in tree
comparisons.
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
0.91
1.00
1.00
1.00
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
1.00
1.00
1.00
1.00
1.00
Figure 2 - 4: A phylogenetic tree with support values for each clade
The support of a phylogenetic tree for given is usually assessed with bootstrapping
[65] or Bayesian posterior probability [66]. In both methods, a consensus tree is
commonly used to summarize common structures among a group of trees sampled using
MCMC (Markov Chain Monte Carlo) or computed using the bootstrapped dataset. In
either way, the occurrences of each bipartitions are counted and the frequencies of each
bipartition are shown in the phylogram as shown in Figure 2-4. The consensus tree is also
used to combine trees estimated using different genes or dataset or the same group of taxa.
19
When each individual tree has different but overlapped set of taxa, a supertree is used
to replace the consensus tree as the summarized output [67].
Considering the possibility of horizontal gene transfer, phylogenetic network is used
as an alternative representation of the evolution relationship of a group of taxa[68].
2.2 Methods for phylogenetic inference
Various methods have been developed to build phylogenetic trees from different kinds of
data. These methods can be classified by: 1) the data type used in tree estimation; 2) the
criteria to define an “optimal” tree; and 3) the tree search strategies.
2.2.1 Sequenced-based methods and genome-based methods
Currently, molecular sequences and whole genome features are the two major data types
used in phylogenetic inference [69]:
1) Sequence-based methods use one or multiple gene alignments to estimate the
phylogenetic tree. Phylogenetic inference with multiple gene alignments
becomes common in recent years. The supermatrix [70] and supertree [71]
methods are two major approaches to handle combined data such as multiple
gene alignments. Both approaches rely on standard sequenced-based
phylogenetic inference methods.
2) Genome-based methods use phylogenetic signals contained in gene content
[72-74] or gene order [75, 76] to estimate the phylogenetic tree. Phylogenetic
inference using whole-genome feature attracts researcher’s attention recently
and many efforts are devoted to how to formulate distance metrics and
20
probabilities models. An overview of genome-based methods is provided by
Delsuc et al. [69].
2.2.2 Distance-, MP-, ML- and BP-based methods
There are four major criteria to define an “optimal” tree: distance, maximum parsimony
(MP), maximum likelihood (ML), and Bayesian posterior probability (BP). Comparisons
among these methods are reviewed in [33, 62, 77].
Briefly, distance-based methods are much faster than the other three methods but
have some potential weaknesses including: 1) information loss in converting sequences
into distance matrix; 2) inconsistency for data set with large distances.
MP and ML are both optimization-based methods which break the tree estimation
process into two major components: scoring a given tree and searching the tree (or trees)
with best scores. MP uses the minimum number of mutations that could produce a given
tree as the score. ML uses the likelihood of the given tree under an explicit evolutionary
model as the score. MP runs much faster than ML because: 1) MP needs much less
computations in evaluating the number of mutations than ML evaluating the likelihood;
and 2) MP does not need to optimize the branch lengths. Drawbacks of MP include: 1)
multiple (or too many) trees may have the same MP score and only one of them is true;
and 2) MP is subject to the “long-branch attraction” problem [78] since it does not
account for the fact that the number of mutations varies on different branches.
Both ML and BP are likelihood-based methods which explicitly use a probabilistic
model of molecular evolution. Their major difference is ML uses point estimation for the
unknown parameters and BP uses marginal distribution to integrate “out” the unknown
parameters. BP is suggested as an faster alternative of ML with bootstrapping [41],
21
however this argument needs to be further justified [79]. Whether BP should be classified
as an optimization-based method is questionable since theoretically BP requires more
computations than ML in order to find the probabilities of all modes for the posterior
distribution. As ML is conjectured as an NP-Hard problem, BP is at least as difficult as
ML. Therefore, we put BP in a new category of phylogenetic methods: sampling-based
method.
2.2.3 Tree search strategies
Any phylogenetic inference methods rely on one or more tree search strategies once the
“optimal” criterion is formulated. We divide the tree search strategies into the following
categories:
1) Clustering method [23]: a clustering method builds the tree using a sequence of
clustering operations. UPGMA[80] and neighbor-joining [81]. A cluster method
runs much faster than other methods. Its limitation is that it produces only one
tree which may not be the global optimal.
2) Exact search [77]: this method examines every possible tree to locate the “best”
tree. Exact search can be further divided into exhaustive search and branch-and-
bound search. Exhaustive search enumerates all possible trees for evaluation.
Considering the huge number of possible trees as described in Chapter 1,
exhaustive is practical only for small data size. Branch-and-bound can prune the
search space by deleting those trees that have lower score than a preset bound (or
threshold). The more strict the bound, the further the space will be pruned. Same
to exhaustive search, branch-and-bound is limited to small problem size.
22
3) Deterministic heuristics search: the tree space is not completely random
distributed. There is certain order in the tree space. A heuristic search attempts to
exploit such an order to find the “best” or near “best” tree. Common used
deterministic search strategies include stepwise addition, local arrangement, and
global arrangement [64, 77]. One potential problem of deterministic heuristics
search is that it dose not guarantee a global optimal solution.
4) Stochastic search: By introducing some random moves, a stochastic search may
avoid local optima and move toward the global optima. Three stochastic
algorithms are used in phylogenetic inference: simulated annealing [82, 83],
genetic algorithm [84-86] and MCMC [40, 41, 87, 88].
5) Divide and conquer: a large problem can be solved by dividing the original
problem into a set of smaller problems, solving each of them separately, and then
merge the solutions for each smaller problem to obtain the solution for the
original problem. Disk-covering method (DCM) [89], quartet-puzzling [90] and
supertree [67] are used in phylogenetic inference.
2.3 High performance computing phylogenetic inference methods
As phylogenetic inference goes to large problem size and the parallel processing become
common, high performance computing support in phylogenetic inference is needed. High
performance computing support includes: algorithm turning, parallel algorithm design,
and parallel platform deployment.
Algorithm tuning seeks alternative approaches for computation intensive parts in the
phylogenetic inference. One common technique for likelihood-based phylogenetic
23
method is not to frequently optimize the branch length because this optimization process
will take 2( )o N times likelihood calculations. This technique has been used [85, 86, 91,
92].
Besides algorithms improvement and exploration, parallel processing has the
possibility to reduce the computation time from several months to several hours in
efficient and immediate manner. Several parallel implementations of widely used
phylogenetic inference methods have been developed recently, among them are parallel
fastDNLml [93, 94] , parallel TREE-PUZZLE [95], parallel genetic algorithm for ML
[96], GRAPPA [97], and Parallel MCMC algorithms [98, 99]. We note there are multiple
level concurrencies in most phylogenetic inference and these methods can run in parallel
embarrassingly.
2.4 Bayesian phylogenetic inference
2.4.1 Introduction
As described in the previous chapter, the task of phylogenetic inference includes two
major steps: 1) constructing a phylogenetic tree that maps the evolutionary relationship
among a group of taxa, and 2) accessing the confidence on the estimated tree given the
observed data. Various methods are available for building the phylogenetic tree and some
of them are based on a probabilistic model of molecular evolution. Due to the stochastic
nature of molecular evolution, complicated mechanisms that affect the evolutionary
process, almost every phylogenetic method has to deal with uncertainties caused by
unknown parameters. Also, the fact that multiple phylogenetic trees are possible for the
24
same group of taxa has to be considered in applications which explicitly use a phylogeny
as the basis of study.
Using a comprehensive probabilistic model, Bayesian analysis provides a
methodology to describe relationships among all variables under consideration. Bayesian
phylogenetic inference can learn the phylogenetic model from observed data based on a
quantity called posterior probability. The posterior probability of a phylogenetic model
( )θτ ,,TΨ can be interpreted as the probability with which this phylogenetic model is
correct.
Bayesian phylogenetic inference share same similarities with maximum likelihood
estimation [10, 33]: both explicitly use a model of molecular evolution and a
formalization of the likelihood function. However, the underlying methodologies are
quite different. First, the Bayesian approach deals with parameter uncertainty by
integrating over all possible values that a parameter might assume, while maximum
likelihood estimation uses a point estimate in analysis. Second, Bayesian analysis
requires specifying prior distributions of the parameters of a phylogenetic model, which
provides an advantage to incorporating existing knowledge but also invites criticism
since the prior distributions are often unknown. Finally, Bayesian analysis outputs the
posterior probability of trees and clades as a measurement of the confidence on the
estimated results. Therefore, Bayesian phylogenetic inference is considered a faster
alternative of maximum likelihood estimation with bootstrap resampling [41].
Though the idea of Bayesian phylogenetic inference emerged almost at the same
period as the maximum likelihood method [100], the computation of Bayesian posterior
probability of phylogeny was not feasible until Markov Chain Monte Carlo methods were
25
implemented for phylogenetic inference by three independent research groups [87, 101-
103] in 1996. Bayesian phylogenetic inference became widely used after the method of
computing posterior probability was described [10, 33, 39-41, 87, 104, 105] and several
phylogenetic inference programs (BAMBE [106] and MrBayes [58]) become publicly
available.
Despite some obvious benefits and ever-increasing applications, Bayesian
phylogenetic inference has been hotly debated on several issues including the amount of
bias caused by inappropriate prior probability, the interpretation of Bayesian posterior
probability [46], and the accuracy of Bayesian clade support [34, 36, 42, 45]. This calls
for further examination of the power and performance of Bayesian phylogenetic analysis,
and therefore a need for improved and faster implementations of current Bayesian
phylogenetic methods.
2.4.2 The Bayesian framework
A phylogenetic model ( )θτ ,,T=Ψ consists of three components: a tree structure (T )
that represents the evolutionary relationships of a set of organism under study, a vector of
branch lengths (τ ) which maps the divergence time along different lineages, and a model
of the molecular evolution (θ ) that approximates how the characters at each site evolve
over time along the tree. In the Bayesian framework, both the observed data X and
parameters of the phylogenetic model Ψ are treated as random variables. Then the joint
distribution of the data and the model can be set up as follows:
)()|(),( ΨΨ=Ψ PXPXP (2 - 1)
Once the data is known, Bayesian theory can be used to compute the posterior probability
of the model using
26
)(
)()|()|(XP
PXPXP ΨΨ=Ψ (2 - 2)
Here, )|( ΨXP is called the likelihood (the probability of the data given the model),
)(ΨP is called the prior probability of the model (the unconditional probability of the
model without any knowledge of the observed data), and )(XP is the unconditional
probability of the data. For the continuous case, )(XP is computed by
( ) ( | ) ( )P X P X P d= Ψ Ψ Ψ∫ (2 - 3)
For discrete case, )(XP is computed by
( ) ( | ) ( )i
i iP X P X PΨ
= Ψ Ψ∑ (2 - 4)
Since )(XP is just a normalizing constant, the computation of (2 - 3) or (2 - 4) is not
needed in practical inference.
The posterior probability distribution of the phylogenetic model can be written as
( ) ( )∑ ∫∫==Ψ
jT jj
iii ddTPTXP
TPTXPXTPXPθτθτθτ
θτθτθτ,,),,|(),,(),,|()|,,(| . (2 - 5)
This distribution is the current basis of Bayesian phylogenetic inference; useful
information can be obtained from this distribution. For example, the posterior probability
of a phylogenetic tree iT can be computed as
∫∫= θτθτ ddXTPXTP ii )|,,()|( . (2 - 6)
Similarly, the posterior probability of the i th− component of the parameter θ in the
evolutionary model can be summarized by
∑ ∫∫=jT iiiji ddXTPXP )\()|\,,,()|( θθτθθθτθ (2 - 7)
27
Here, iθ is the i th− component of the parameter θ and \ iθ θ are the remaining
components of the parameterθ .
2.4.3 Components of Bayesian phylogenetic inference
A complete Bayesian phylogenetic inference consists of four major components:
(1) Formulating the phylogenetic model ),,|( θτiTXP ;
(2) Choosing a proper prior probability ),,( θτiTP ;
(3) Approximating the posterior probability distribution of phylogenetic models;
(4) Inferring characteristics from the posterior probability distribution.
We briefly describe the second component in this section; the other three components
will be described in the following sections.
2.4.4 Likelihood, prior and posterior probability
Bayesian theory shown in (2 - 2) can be expressed informally in English as:
evidence
priorlikelihoodposterior ×= (2 - 8)
This formula indicates that by observing some new evidence (i.e. the data X ) our starting
belief (i.e. the prior probability ( )ΨP ) may be converted into a set of new belief (i.e.
posterior probability )|( XΨΡ ). The prior probability and the posterior probability are
connected through the likelihood, the probability with which the evidence can be
observed.
Phylogenetic model is a hypothesis about how the data will evolve. Hypotheses can
not be observed directly, so both the prior and the posterior should be interpreted as a
confidence interval for a model instead of explained as frequencies [107].
28
A major concern in Bayesian analysis is how to choose the prior. Prior probability has
the potential to incorporate existing knowledge about phylogenetic models into current
analysis, but it is also a controversial issue since choosing the appropriate prior
distribution can be subjective. Two approaches are often used for choosing prior
probability: using a non-informative prior (or flat prior, which treats every hypothesis
equally possible); and using the knowledge obtained from past experience. In Bayesian
phylogenetic inference, the prior probability on phylogenetic models can be introduced as
constraints to prune the search space parameters.
The posterior probability of a phylogenetic model (for example, a phylogenetic tree)
can be interpreted as the probability with which this model can be correctly estimated for
a set of random data simulated from this model. The accuracy of the posterior probability
will be affected adversely by the use of improper hypothesis [108].
2.4.5 Empirical and hierarchical Bayesian analysis
The comprehensive posterior distribution ( )XTP i |,, θτ requires knowledge of uncertain
parameters not of interest in our current analysis (e.g., branch length or model
parameters). In addition to directly explore ( )XTP i |,, θτ , two alternatives approximations
are used to accommodate these uncertain parameters [109] in practice.
The first method is called empirical Bayesian analysis, which uses a point estimate to
eliminate one of the integrals on ( )XTP i |,, θτ . For example, we estimate the best fit
parameters *θ and then substitute equation (2 - 6) as
∫∫∫ ≈= τθτθτθτ dXTPddXTPXTP iii ),|,()|,,()|( * . (2 - 9)
29
The second method is called hierarchical Bayesian analysis, which takes the posterior
probability of the phylogenetic tree as the integral over all possible combinations of
branch lengths and model parameters. The hierarchical Bayesian analysis can be written
as
∑
=jT jj
iii TPTXP
TPTXPXTP)()|(
)()|()|( (2 - 10)
∫∫= θτθτ ddTXPTXP ii ),,|()|( (2 - 11)
2.5 Models of molecular evolution
As shown in previous section, Bayesian phylogenetic inferences explicitly use
phylogenetic models and likelihood functions for phylogeny estimation. Though
Bayesian phylogenetic inference essentially can be applied to various data types
including molecular sequences[58, 87, 102], morphological features, gene order [104],
genomic contents and combined data [25, 26, 56, 110], here we limit our discussion to
molecular sequences.
2.5.1 The substitute rate matrix
Though phylogenetic signals exist in various mutation events which can be observed by
sequence comparisons, most phylogenetic methods consider substitution events because
other events are either difficult to model mathematically or the derived model is
computational intractable.
30
A
TC
G
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
)()()()()()()()()()()()()()()()(
)(
tptptptptptptptptptptptptptptptp
tP
GGGCGTGA
CGCCCTCA
TGTCTTTA
AGACATAA
Figure 2 - 5: The transition diagram and transition matrix of nucleotides
DNA sequences for phylogenetic inference are treated as an aligned character matrix.
Each site can have multiple states. For nucleotides, the number of states is 4; for amino
acids, the state is 20; for codons –the triplet of nucleotides, the number of states if 64 (or
61 if stopping codons are excluded). The character at each site can transit from one state
to another state stochastically. The probability ( )abp t with which a site is substituted
from state a by state b after a time interval t is determined by a molecular substitution
model. Figure 2-5 shows the transition diagram of nucleotides and corresponding
transition matrix.
The molecular substitution can be modeled as a continuous-time Markov process
which has a set of character states as its state space [111]. This Markov process,
described by a transition matrix ( ))()( tptP ij= , is determined by an instantaneous
substitution rate matrixQ . This substitution rate matrix is independent of time and has its
definition as:
t
ItPQt Δ
−Δ≡
→Δ
)(lim0
(2 - 12)
Once the rate matrix Q is known, the transition matrix )(tP can be computed by:
QtetP =)( (2 - 13)
31
To compute )(tP with equation (2 - 13), we first transform Q as
1−Γ= UUQ (2 - 14)
In (2 - 14), Γ is a diagonal matrix with eigenvalues of Q as its diagonal entries,
},,,{
00
0000
212
1
N
N
diag λλλ
λ
λλ
=
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=Γ (2 - 15)
U is the matrix consisting of the eigenvectors of Q in the same order of Γ . 1−U is the
inverse Matrix of U . Applying (2-14) and (2-15) to (2-13), )(tP can be calculated by:
11 },,,{)( 11 −−Γ ⋅⋅== UeeediagUUUetP Nt λλλ (2 - 16)
2.5.2 Properties of the substitution rate matrix
Suppose there are S possible states at each site, then the substitution rate matrix can be
written as
( )⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
==
sjss
j
j
ij
qqq
qqqqqq
21
22221
11211
. (2 - 17)
We also denote stationary frequency distribution of the states ),,,( 21 Nππππ = . Then
the following properties hold forQ and π .
0≥ijq (2 - 18)
∑≠
−=ijj
ijii qq,
(2 - 19)
∑ =i
i 1π (2 - 20)
32
0=Qπ (2 - 21)
( ) ⎟⎠⎞
⎜⎝⎛⋅=α
α tQQt ( 0≠α ). (2 - 22)
Property (2-21) is the result of stationary assumption for the Markov chain, i.e.
ππ =)(tP . Property (2-22) indicates the substitution rate and the evolutionary time are
co-founded [111]. Therefore it is impossible to distinguish between the mutation rate and
the divergence time. The substitution rate can be fixed by assuming the total number of
mutation events per unit time is constant, i.e. cqi ijj
iji =⎟⎟⎠
⎞⎜⎜⎝
⎛∑ ∑
≠,π
From equation (2 – 19 ), this constraint can be simplified as
cqi
iii −=∑π (2 - 23)
2.5.3 The general time reversible (GTR) model
There are 12 substitution rate parameters and 4 state frequency parameters for a general
substitute rate matrix, 11 of them are free parameters due to the constraints of (2 -20),
(2-21) and (2-23). Various models with fewer model parameters have been proposed by
making more assumptions. Some widely used nucleotide substitution models include the
Jukes-Cantor model (JC69) [112], the Kimura model (K2P) [113], the Felsenstein
models (F81 and F84) [114], the HKY model [115], and the GTR model (GTR) [116].
Details of these models and methods to calculate their transition probability are
discussed by Swofford and et al. [77], Yang [117] and other researchers [18, 118].
The GTR model adds the time reversible assumption into the substation rate matrix
which requires
33
jijiji qq ππ = (2 - 24)
or
α
ππ==
i
ji
j
ij qq
(2 - 25)
Therefore, the nucleotide substitution rate matrix for GTR model can be simplified as
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
−−
−−
=
CTA
GTA
GCA
GCT
GTR
fecfdbedacba
Q
ππππππππππππ
(2 - 26)
By introducing another matrix, },,,{ 21 Sdiag πππ=Π , it is easy to verify that
GTRGTR QQ ⋅Π=Π⋅ (2 - 27)
Further, we have
( )'2/12/12/12/12/12/1 ' −−− Π⋅⋅Π=Π⋅⋅Π=Π⋅⋅Π GTRGTRGTR QQQ (2 - 28)
Equation (2-28) states that the substitution rate matrix GTRQ is similar to the symmetric
matrix 2/12/1 −Π⋅⋅Π GTRQ . Therefore, all eigenvalues of GTRQ are real numbers and can be
computed by
2/112/1 −− ΠΓΠ= UUQGTR (2 - 29)
Here U and Γ are the eigenvectors and eigenvalues of 2/12/1 −Π⋅⋅Π GTRQ respectively. The
equation (2 - 29) reduces the task of solving the eigensystem for a non-symmetric matrix
to solving the eigensystem for a symmetric matrix.
An additional benefit from the GTR model is that under the GTR model, the
likelihood value for the phylogenetic tree is independent of the root position of the tree
34
[114]. Therefore, we can the change the root position at free without changing the
likelihood value of the tree.
2.5.4 Rate heterogeneity among different sites
In the previous discussion, the molecular substitution models are derived based on a
single homologous site. Because the mutation events are constrained by physical and
chemical structures of the DNA and protein molecular, and purified by natural selection,
substitution rate varies greatly among different genes, different codon positions, and
different gene regions [17]. Rate heterogeneity among different sites is accommodated by
including an additional related rate coefficient r in the substitution rate matrix, i.e.
( ) rQtP t e= (2 - 30)
There are several possible ways to determine r [77]:
(1) Assigning different r and different substitution rate matrix to different partitions of
the dataset (by genes or by positions in the codon) [77];
(2) Assuming r at each site is drawn independently from a distribution, such
distribution could be continuous (such as Gamma distribution [119-121] or Log
distribution) or discrete (assume several categories of rate and each has a separate
probability to be chosen);
(3) Assume some fraction of the sites is invariable (i.e. 0=r ) while others mutate at
constant rate [77];
(4) Combining several methods, for example, “invariant + gamma” model.
35
2.5.5 Other more realistic evolutionary models
Considerable amount of effort has targeted in developing more realistic evolutionary
model. Felsenstein and Churchill proposed Hidden Markov model (HMM) to
accommodate rate variance [122] along the sequences. Similarly, the assumption that rate
should be the same in all branches of the trees also needs to be relaxed and a variety of
methods have been proposed.
The gaps within the alignments provide important phylogenetic signals. However
they are often neglected or removed in common phylogenetic inference
methods/packages. Some models have been proposed to incorporate gap in evolutionary
models for phylogenetic inference. These developments include the fragment substitution
model proposed by Thorne et al. [123], the tree HMM approach by Mitchison & Durbin
[23, 123, 124]. In the future, incorporation of rate variation correlated with the three
dimensional structure may be needed [21].
2.6 Likelihood function and its evaluation
Evaluating the likelihood of the data under a given model is a key component in Bayesian
phylogenetic inference and maximum likelihood estimation. Most computation time in
likelihood-based phylogenetic inference methods is spent in likelihood evaluation.
2.6.1 The likelihood function
The likelihood of a specific phylogenetic modelΨ is proportional to the probability of
observing the dataset }{ ijxX = given the phylogenetic model Ψ . Here we assume N is
the number of taxa and M is the sequence length. Each site ),,,( 21 Muuuu xxxx = is an
36
individual observation. The probability of observing the nucleotide pattern at site depends
on the phylogenetic model Ψ , which includes a phylogenetic tree T , a vector of branch
length ( )3221 ,,, −= Nττττ , and an evolutionary model θ .
As described in the previous section, the model of molecular evolution gives the
probability of a mutation from the thi − state to the thj − state at a site u over a finite
period of time t . This transition probability is computed as
( )θ,,|)( tisjsptpij === , (2 - 31)
Here i is the starting state, j is the ending state, and t is the divergence time (i.e. the
length of the branch). )(tPij is computed by equation (2-13) when the substitution rate
matrix }{ ijqQ = is known. The substitution rate matrix is determined by θ , parameters
of an evolutionary model.
The probability of observing the data at a site u given the phylogenetic tree is a sum
over all possible states at the internal nodes of the tree, which is computed by
)),,|(),,|((),,|(1221
12
,
22
1 1
)()(∑ ∏ ∏−++
−
−
+= =
×=NNN
N
aaa
N
Ni
N
ii
iu
iui
iu
iuau axpaapTxL θτθτπθτ αα (2 - 32)
In the above equation, )(iα denotes the immediate ancestral node of node i , iua denotes
the residual state at node i , and iux denotes the residual at the u th site of the i th
sequence.
When rate heterogeneity across different sites is considered, and we assumes the rate
at a site follows a distribution )|( αrf with the shape parameterα (e.g., the gamma
distribution), equation (2 - 32) is replaced by
37
∫ ∑ ∏ ∏∞ −
+= =⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛×=
+++
−0
,
22
1 1
)()( )|(),,|(),,|(
),,,|(
122112 drrfaxpaap
TxL
NNNN
aaa
N
Ni
N
ii
iu
iui
iu
iua
u
αθτθτπ
αθτ
αα (2 - 33)
Equation (2 - 33) can be approximated by replacing the continuous gamma distribution
with a discrete gamma distribution [125].
Assuming the observation at each site is independent, the likelihood to observe the
entire sequence is:
∏=
=M
uu TxLTXL
1
),,|(),,|( θτθτ (2 - 34)
Generally, a logarithmic form of the likelihood is used because the likelihood itself is
very small number.
2.6.2 Felsenstein’s algorithm for likelihood evaluation
The probability given by (2 - 32) and (2 - 34) can be computed by traversing the tree in
post order using the algorithm proposed by Felsenstein [114]. Let ( )aLp ku
| denote the
probability of all the leaves below node k when the state at site u on node k is a and uL
denotes the likelihood of all leaves at site u .
The Felsenstein algorithm is shown in Figure 2-6. Starting from the leaves, the
Felsenstein’s algorithm continuously prunes the subtrees through step 9-14 until there are
no other nodes left except the root.
Without redundant computation, the Felsenstein’s algorithm needs about
(2 1)NMS S + multiplication operations in step 4. The memory space requirements vary
with implementation. If the site likelihood values at the most recently visited nodes are
saved, only 216 24NS MS NM+ + byte memory space is needed: the first item stores the
38
transition matrix for each branch (there are total 2 2N − branches for a rooted tree); the
second item stores the site likelihood of current node and its two children; and the third
item stores the data matrix. However, this scheme will cause the algorithm to re-compute
the site likelihood for all nodes even if only a small portion of the tree has been changed
between two adjacent likelihood evaluations.
Computer-Node-Likelihood ( k ) 1. Compute-Transition-Matrix ( )(kP , kτ ) 2. If Leaf-Node ( k ) 3. Then 4. For 1←u ; Mu ≤ ; 1+← uu
5. If kuxa =
6. Then 1)|( ←aLp ku
7. Else 0)|( ←aLp ku
8. Else 9. i←k.leftChild; j←k.rightChild
10. Compute-Transition-Matrix ( ( )iP , iτ ) 11. Compute-Transition-Matrix ( ( )jP , jτ )
12. For 1←u ; Mu ≤ ; 1+← uu 13. Foreach StatesofSeta −−∈
14. ( | ) ( | ) ( | , ) ( | ) ( | , )k i ju u i u i
b c
p L a p L b p b a p L c p c aτ τ⎛ ⎞ ⎛ ⎞← ⋅ ⋅ ⋅⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠∑ ∑
Computer-Tree-Likelihood ( T ) 15. 0ln ←L 16. Computer-Node-Likelihood ( 2N-1 ) 17. 0←uL
18. For 1←u ; Mu ≤ ; 1+← uu
19. 2 1( | )Nu a u
aL P L aπ −←∑
20. )ln(lnln uLLL +=
Figure 2 - 6: The Felsenstein algorithm for likelihood evaluation
39
2.7 Optimizations of likelihood computation
The likelihood evaluation can be optimized in several ways between two adjacent
evaluations of tree likelihood.
2.7.1 Sequence packing
Repeated patterns are common for real datasets used in phylogenetic inference. The
length of the sequences can be cut down by packing the sites with the same pattern and
speed up the likelihood computation. For example, if there are w columns of all state “a”
in the dataset calculate the likelihood of this once and raise it to the power of w. Through
sequence packing, equation (2 - 34) is replaced by
∏=
=P
uppp TxLwTXL
1
),,|(),,|( θτθτ (2 - 35)
Here P is the total number of site patterns, pw is the weight of pattern (i.e. the number of
sites with pattern p ), and ( )θτ ,,| TxL pp is the site likelihood of pattern.
Sequence packing can reduce the likelihood computation by MP
−1 . Here P is the
number of unique site patterns; M is the number of characters.
2.7.2 Likelihood local update
Since in most MCMC algorithms the phylogenetic model changes continuously and
the change between two adjacent generations is small, if we record nodes affected by a
change in the parameter values and trace them back to the root, then only the conditional
probability of those nodes appearing in the back tracing path needs to be recomputed; all
other parts of the computation remain the same. We call this shortcut a local update of the
40
likelihood. Figure 2-7 shows how local update works. The local update can reduce the
average number of nodes to be evaluated from N to Nlog21 . The disadvantage is that all
the conditional probability values in the previous computation need to be kept in the
memory, increasing the memory requirement to 216 8NS NMS NM+ + : the first item
stores the transition matrix for each branch; the second item stores the likelihood for all
internal nodes; and the third item stores the data matrix (we time a multiplier of 8 for
likelihood and transition probability since they are stored as double precision numbers).
When N is large, the local update schema can speed up the likelihood evaluation
remarkably. For this reason, most Bayesian inference programs adopt local update
schema despite the additional memory requirement. However, smart memory
management is required to keep the local update property without frequent data copy
operations.
rootc
ab
If branch a-b has been changed, only the nodes in the path from node a to the root need to recomputed to get the likelihood of the tree.
Figure 2 - 7: Illustration of likelihood local update
41
2.7.3 Tree balance
After multiple mutation operations, the current tree may become imbalanced: one subtree
of the root is much deeper than the other subtree. We define the depth of a subtree as the
maximum number of internal nodes from the root of this subtree to any leaf nodes in the
subtree. As discussed above, the average number of nodes to be re-computed for local
update is 1 log( . )2
root depth ; reducing the depth of the root can speed up the computation
because there are fewer nodes to be recomputed. The tree balancing algorithm is shown
in Figure 2-8.
2.8 Markov Chain Monte Carlo methods
2.8.1 The Metropolis-Hasting algorithm
Though Bayesian analysis of phylogeny provides a direct, formal approach of dealing
with uncertainty in phylogenetic inference with sophisticated statistical models, the
computations required by integrations over unknown parameters is a major obstacle.
tree-balance ( T ) 1. root← T .root 2. gap←depth(root.leftChild) – depth(root.rightChild) 3. If abs(gap) < 2 4. Then return 5. Else node←root 6. For i←0; i≤ abs(gap)/2; i← i+1 7. If depth(node.leftChild)> depth(node.rightChild) 8. Then node←node.leftChild 9. Else node←node.rightChild 10. Tree-Reroot ( T , node )
Figure 2 - 8: The tree-balance algorithm
42
Until the advancement of computing technologies and the introduction of Markov Chain
Monte Carlo methods, Bayesian phylogenetic inferences weren’t feasible.
Markov Chain Monte Carlo refers to a class of methods that simulate random
variables from a target distribution, known up to a normalizing constant. The basic idea
of the MCMC methods is first to construct a Markov chain that has the space of the
parameters to be estimated as its state space and the posterior probability distribution of
the parameters as its stationary distribution. Next, simulate the chain and treat the
realization as a hopefully large and representative sample from the posterior probability
of the parameters of interests. Two major strategies can be used to construct the Markov
chains for exploring posterior distribution: the Metropolis-Hasting algorithm [126, 127]
and the Gibbs sampler [128].
When applied to phylogenetic inference, the Metropolis-Hasting algorithm can be
descried as follows (Figure 2- 9):
Metropolis-Hasting algorithm 1. 0=t ; 0
)0( Ψ←Ψ 2. Repeat 3-9 3. Draw a sampleΨ from ( ))(| tq Ψ• 4. Draw a random variable u from a uniform
distribution )1,0(U
5. Compute ( )ΨΨ ,)(tα
6. If ),( )( ΨΨ≤ tu α
7. Then Ψ←Ψ + )1(t
8. Else )()1( tt Ψ←Ψ + 9. 1+← tt .
Figure 2 - 9: Metropolis-Hasting algorithm
43
In the above algorithm, ( )ΨΨ ,)(tα is called the acceptance probability, and its
definition distinguishes different MCMC algorithms. In the original Metropolis algorithm
[126], the acceptance probability is:
( )( )⎟⎟⎠
⎞⎜⎜⎝
⎛ΨΨ
=ΨΨX
Xt
t
||,1min),( )(
)(
ππα (2 - 36)
Hasting [127] extended the original Metropolis algorithm by allowing an asymmetric
proposal probability )|( )(tq ΨΨ and introduced a new transitional kernel
( )( )
( )( )
( )( )
( ) ( )
||( , ) min 1,
| |
tt
t t
qXX q
πα
π
⎛ ⎞Ψ ΨΨ⎜ ⎟Ψ Ψ = ⋅⎜ ⎟Ψ Ψ Ψ⎝ ⎠
. (2 - 37)
The proposal probability (2 - 37) can be in any form that satisfies 0)|( >⋅ Mq . The
choice of )|( Mq ⋅ may affect the convergence rate of the Markov chain.
In (2 - 36) and (2 - 37), ( )X|Ψπ is the posterior distribution of phylogenetic models
which is proportional to the product of the likelihood and the prior probability. As shown
in the form of the acceptance probability, only the likelihood ratio and the prior ratio for
current sample ( )tΨ and candidate sampleΨ are needed to decide whether accepting the
proposal or not; the computation of the normalizing constant in (2 - 2) is unnecessary.
2.8.2 Exploring the posterior distribution
The direct objective of MCMC in Bayesian analysis is to calculate the integral appearing
in the marginal distribution shown in equation (2 - 5) or (2 - 6). Thus an MCMC method
plays the same role as the Monte Carlo integral approximation method. According to the
law of large numbers, the variance of the Monte Carlo integral approximation is
proportional to 1N
, regardless of the dimensionality of the state space of target
44
distribution [129]. Note that N is the number of samples and the variance decrease as N
increases. To directly draw samples from a complex space with high dimensions is
difficult. Metropolis-based MCMC provides an effective method of sampling mechanism
by evolving a Markov chain.
In theory, a Markov chain constructed with the Metropolis-Hasting algorithm will
converge to a stationary distribution if the chain is irreducible, aperiodic, and possesses a
stationary distribution given the chain runs long enough [130, 131]. The irreducible
property requires the chain has a positive probability to move from one state into any
other state in a finite number of time steps, i.e.
( )( )| ( ) 0j iP t s tΨ + = Ψ Ψ = Ψ > (2 - 38)
Here iΨ and jΨ are any pair of states in the state space, t is the current time, s is the
number of time steps needed to move from iΨ to jΨ .
If the chain is irreducible, then the chain can reach any state after a sufficiently large
number of time steps no matter what the starting state is. It is intuitive that all tree
proposal methods shown in Chapter 3 guarantee the irreducible property of the MCMC
chain. Thus MCMC is a promising method for phylogenetic inference.
2.8.3 The issues
Equation (2- 38) does not provide any information regarding how large s must be. As the
length of any MCMC chain used in real analysis is limited, there are risks that some
states are never reached after the chain has been terminated. One fundamental reason is
that the samples generated using Metropolis-Hasting algorithms are dependent samples.
As a result, samples between two time steps are correlated. The samples drawn using
45
MCMC tend to stick around a local optima mode of the target distribution. Due to such
“stickiness”, the chain may mix extremely slowly: it may take a huge large number of
time (perhaps infinite) steps for the chain to move from one mode to another mode.
Increasing the mixing rate of an MCMC chain may improve the quality of the
posterior distribution approximated by the chain. However, if the chain moves too fast,
the acceptance ratio (the ratio of the number of accepted proposals to the total number of
proposals) become very low and a large percentage of computation is wasted; if the chain
moves too slowly, the acceptance ratio is high but it may take a extraordinary large
number of time steps for the chain to converge [132]. Neither is satisfactory.
The quality of the posterior distribution sampled by a MCMC sampler is critical to
the accuracy of the conclusions summarized from the distribution. If the approximated
distribution ( )π Ψ deviates from the real distribution, ( )π Ψ , the conclusions based on
( )π Ψ may be completely misleading. For example, a poorly implemented Markov chain
may be trapped at local optima localΨ , thus samples generated from this chain give an
extremely high posterior probability for localΨ which is far away from the truth.
Therefore, some practical issues exist for the original Metropolis-Hasting and have to
be avoided in implementation.
1) Choosing the appropriate proposal steps;
2) Making the chain move fast when exploring a posterior distribution in a high
dimensional parameter space;
3) Avoiding the chain stop at local optima; and
4) Detecting the halting time for the chain.
46
2.9 Summary of the posterior distribution
Once the posterior probability has been approximated with MCMC samplers, various
kinds of information can be summarized from such posterior distribution.
2.9.1 Summary of the phylogenetic trees
The posterior probability of phylogenetic trees can be summarized in several ways:
I. Summarizing with the posterior probability of trees. The occurring frequency of a
phylogeny in the samples can be interpreted as an approximation of its posterior
probability. By ranking the trees in the order of their posterior probability, the 99%
credible set of trees can be obtained. Among these trees, the one with the maximum
posterior probability is called the MPP tree.
II. Summarizing with the posterior probability of clades. A clade is the group of taxa
which is included in the same partition divided by an internal branch. Similar to the
posterior probability of the phylogenetic tree, the frequency of a clade in the
samples can be interpreted as its posterior probability. Using the clade posterior
probability distribution and a specific consensus rule, a consensus tree can be
constructed as the summary of the posterior probability of the clades.
III. Summarizing with the likelihood values. Though seldom used in practice, the
samples can also be summarized using their likelihood values as maximum
likelihood methods.
2.9.2 Summary of the model parameters
When model parameters are also included as the interested of state parameters in MCMC,
Bayesian analysis also output samples drawn during the MCMC run. Model parameters
47
and some evolutionary characteristics (e.g. tree length, distance between two partitions
separated by an internal branch) can be summarized from the approximated posterior
distribution samples from Bayesian phylogenetic inference using conventional statistical
methods.
2.10 Chapter summary
This chapter provides an overview of phylogenetic inference method and the framework
of Bayesian phylogenetic inference. There are various competing phylogenetic methods
to build a phylogenetic tree from a dataset which provides clues for the evolutionary
history of a set of taxa. These methods may use different source of data, different
optimality to choosing the closest estimation. But their common objective is same:
estimating the correct tree, or if impossible, making the estimation close to the true tree
as much as possible.
When the phylogenetic trees become large and large (up to thousands and ten of
thousands of taxa), advanced algorithms and high performance implementations become
critical to guarantee the estimated trees close to the true tree sufficiently. In this
dissertation, we choose Bayesian methods as a possible candidate to infer large
phylogenies.
Bayesian phylogenetics inference is founded on likelihood function for a dataset
under some phylogenetic model. It is a twin of Maximum Likelihood Estimation, both
require explicit probabilistic model of evolutions.
48
The computational complexity of Bayesian phylogenetic inference is handled by a
family of Markov Chain Monte Carlo methods which can be viewed as a sampling
method or a stochastic search method depending on the interest of study and context.
Using MCMC, both the posterior distributions of phylogenetic trees and the
parameters in the model of evolution can be approximated and conventional statistic
procedure can be applied to summarize the variables of interest.
49
Chapter 3
Improved Monte Carlo Strategies
3.1 Introduction
As described in Chapter 2, Bayesian phylogenetic inference relies on the posterior
distribution approximated by Markov Chain Monte Carlo (MCMC) methods. The quality
of the posterior distribution sampled by a MCMC sampler is critical to the accuracy of
the conclusions. If the approximated distribution deviates from the real distribution, the
conclusions may be completely misleading, as observed in the literature [49, 50].
Though the MCMC method plays a critical role in Bayesian phylogenetic inference,
there are few studies of its performance due to its computational expense. Developing
better MCMC strategies and studying their performance are necessary for two reasons:
1) Some experimental results indicate that Bayesian phylogenetic inference using
MCMC (BMCMC) produces misleading posterior probability for trees and clades
under some evolutionary scenarios. It is not clear whether the MCMC implementation
or some deeper reasons in the Bayesian phylogenetic inference is responsible for such
discrepancies.
50
2) There is no theoretical or formal proof to detect or guarantee that an MCMC run will
converge to the correct posterior distribution after the chain stop at a certain number
of time steps.
Thus, improved MCMC strategies are required for more robust, more efficient
Bayesian phylogenetic inference. This chapter provides our observations and
improvements for MCMC strategies for use in Bayesian phylogenetic inference.
3.2 Observations
To illustrate the problem in MCMC implementation, we use the Metropolis-Hasting
algorithm to approximate the distribution shown in Figure 3-1. The target distribution has
the analytical form
( )212
32
1[0,1]( )
0 [0,1]
i
x
ii
c a e xf x
x
μσ−
−
=
⎧⎪ ∈= ⎨⎪
∉⎩
∑ (3 - 1)
A Distribution with Three Modes
0.0
1.0
2.0
3.0
4.0
5.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
p.d.
f
Figure 3 - 1: A target distribution with three modes
51
where 1 0.5a = , 1 0.1μ = , 1 0.04σ = , 2 1.0a = , 2 0.5μ = , 2 0.04σ = , 3 0.5a = , 3 0.9μ = ,
3 0.04σ = and 5.0c = . The modes of this target distribution are 0.1 , 0.5 and 0.9 .
We use the delta method described in the previous section to propose a candidate
sample point which uses Equation (3-2) to draw a new proposal point, namely:
( ) ( ) ( )1 0.5x t x t uλ+ = + ⋅ − , (3 - 2)
Approximated Distribution Using Metropolis Algorithm (lamda=0.7, x0=0.5)
0.01.02.03.04.05.06.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
p.d.
f
(a)
Approximated Distribution Using Metropolis Algorithm (lamda=0.2, x0=0.5)
0.02.04.06.08.0
10.012.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
p.d.
f
(b)
Figure 3 - 2: Distribution approximated using Metropolis MCMC methods
52
The target distribution seems simple, but it is difficult to approximate accurately
using an MCMC constructed with the original Metropolis-Hasting algorithm. Figure 4-2
shows the two approximations using 0.7λ = and 0.2λ = . Both chains start at ( )0 0.5x = .
Though all three modes appear in the approximation shown in Figure 3-2 (a), the
shape of each mode differs from the target distribution slightly. In Figure 3-2 (b), the
approximation shows only one mode; the other two modes have disappeared. Neither
shows the expected distribution from Figure 3-1.
Samples Drawn at Each Time Step (lamada=0.7, x0=0.5)
0.0
0.2
0.4
0.6
0.8
1.0
0 100 200 300 400 500 600 700 800 900 1000
Time Step (t)
x
(a)
Samples Drawn at Each Time Step (lamada=0.2, x0=0.5)
0.0
0.2
0.4
0.6
0.8
1.0
0 100 200 300 400 500 600 700 800 900 1000
Time Step (t)
x
(b)
Figure 3 - 3: Samples drawn using Metropolis MCMC method
53
Figure 3-3 shows samples drawn at each time step during the above two
approximations. We observe that for larger proposal steps, the chain mixes faster; for a
smaller proposal step, the chain “sticks” around a local mode.
For the above example, though the target distribution is known and simple, it is
nontrivial to choose a proper proposal step parameter to achieve an efficient MCMC
chain. Intuitively, since we know little about the shape of posterior distribution of the
phylogenetic model, we have to be cautious in interpreting the summary of the posterior
distribution sampled using MCMC methods. At the same time, we must develop more
robust MCMC methods and investigate their performance in practical Phylogenetic
inference.
3.3 Strategy #1: reducing stickiness using variable proposal step length
In the previous section, we discussed the risks that a Markov chain constructed using the
Metropolis-Hasting algorithm may be trapped by a local mode and fail to explore the
desired distribution. Changing proposal step length may improve the mixing property of
an underlying Markov chain.
According to the irreducible property requirement, to approximate the target
distribution correctly, at time t , the chain must have a positive probability to move from
one state iΨ into any other state jΨ in a finite number of time steps s , i.e.
( )( )| ( ) 0j iP t s tΨ + = Ψ Ψ = Ψ > . (3 - 3)
Under ideal situations, the chain can move from one state to any other state within
one step (as shown in Figure 3-4). According to a proof of the MCMC algorithm [130],
54
the chain will approximate the distribution accurately. However, such a situation is rare;
the chain actually needs to traverse some intermediate states to reach another state. If the
transition probability between the intermediate states are smaller than some threshold, the
probability given in (3 – 3) may be close to 0, which means the target state will never be
reached and the theoretical approximation has a large deviation from the real
approximation.
Therefore, we proposed a variable step length MCMC which draws the step length
from certain distributions (for example, a uniform distribution); we used this step length
to propose a new candidate state. Using variable step length, we can move the chain
between different states more freely and overcome its “stickiness” to local optimal mode.
Figure 3-5 shows the approximation of the target distribution shown in Figure 3-1.
The distribution of the samples is close to the target distribution, both in the number of
modes and in the shape of each mode.
The number of possible states of phylogenetic trees, even for a small number of taxa,
is extraordinary large. However, the distance between any two states is less than the
number of taxa, even using the simplest tree proposal method, such as NNI (Nearest
Neighbor Interchange). We used the extended tree mutation operator for variable step
length proposal in phylogenetic inference; this algorithm is shown in Sections 3.6.
Figure 3 - 4: Illustration of state moves
55
3.4 Strategy #2: reducing sampling intervals using multipoint MCMC
Choosing a good proposal mechanism is very difficult in phylogenetic inference.
Variable step length MCMC can reduce the risk of trapping local optima. But it
introduces another issue: low acceptance rate. A chain must try many proposals to accept
one successful candidate, which usually requires a large sampling interval.
One strategy is to propose multiple sample candidates and consider their combined
effects in deciding the next move of the Markov chain. We call this strategy multipoint
MCMC. In this dissertation, we implemented a variant of multipoint MCMC proposed by
Liu et al. [132]. Figure 3-6 illustrates the process of multipoint MCMC. This algorithm
includes 4 steps:
1) Propose K samples 1 2, , , KX X X from the distribution of X ;
2) Select a candidate Y from 1 2, , , KX X X according to the probabilities of
1 2, , , KX X X ;
Approximated Distribution Using Metropolis Algorithm (lamda=variable, x0=0.5)
0.01.02.03.04.05.06.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
p.d.
f
Figure 3 - 5: Approximated distribution using variable step length MCMC
56
3) Propose K samples 1 2, , , KY Y Y from the distribution of Y ;
4) Accept Y with acceptance ratio
( ) ( ) ( )( ) ( ) ( )1 2
1 2
, , ,min 1,
, , ,K
K
w X X w X X w X Xr
w Y Y w Y Y w Y Y⎛ ⎞+ +
= ⎜ ⎟⎜ ⎟+ +⎝ ⎠. (3 - 4)
and reject it with 1 r− .
In equation (3 – 4), ( )( , ) ( , ) ( , )w X Y X T X Y X Yπ λ= , ( , )T X Y is an arbitrary
proposal function, and ( , )X Yλ is an arbitrary, symmetric, non-negative function.
In our implementation, we chose
( )( ) ( ) ( )*
*( , ) ln lni
i i
Xw X X L X L X
Xππ
= = − (3 - 5)
and
( )( ) ( ) ( )*
*( , ) ln lni
i i
Yw Y Y L Y L X
Xππ
= = − . (3 - 6)
Figure 3 - 6: The multipoint MCMC
57
Thus, the acceptance ratio r becomes:
1 2
1 2
ln ( ) ln ( ) ln ( )ln ( ) ln ( ) ln ( )
K
K
L X L X L XrL Y L Y L Y
+ + +=
+ + + (3 - 7
We can use a similar technique [132] to prove that the above algorithm will correctly
approximate the posterior distribution of phylogenetic models.
Though multipoint MCMC allows the chain to keep moving with a large step size,
one potential issue is that if the step length is still small and the distribution has multiple
modes, multipoint MCMC may fail just as often as the classical Metropolis algorithms.
By combing multipoint-MCMC with variable step length to draw candidate samples with
different step sizes, we can overcome this issue.
3.5 Strategy #3: improving mixing rate with parallel tempering
As shown in Section 3-2, a target distribution may contain multiple modes which are
separated by high energy barriers, or low probability regions. In phylogenetic inference,
such regions could be phylogenetic models with low likelihood scores. If a proposal
mechanism fails to draw candidate samples in regions which are separated from current
states by low probability regions, the chain may seem to converge, but the approximation
is far from complete.
One strategy is to use an augmented distribution { }( )i xπΠ = )..1( mi = , which
consists of multiple “tempered” distributions, each distribution having a different
temperature iT . Increasing iT will result in a flatter distribution, given a heating schema
like
58
( ) ( )1 10
iTiπ π += . (3 - 8)
Figure 3-7 shows four tempered distributions based on the target distribution given in
Figure 3-1. The temperatures of the four distributions are 0.0, 1.0, 3.0, and 8.0.
Intuitively, Metropolis algorithms can approximate a flatter distribution accurately. This
was verified in our simulation.
Figure 3 - 7: A family of tempered distributions with different temperatures
The Metropolis-coupled MCMC, first proposed by Geyer [133, 134] was also called
Parallel Tempering, exchange Monte Carlo, or (MC)3. This strategy currently has been
adopted in MrBayes [58]. The idea of Metropolis-coupled MCMC is to run several chains
in parallel, each chain having a different stationary distribution )(Ψiπ , and with index
swap operations conducted in place of the temperature transition of simulated annealing.
The chain with distribution )(1 Ψπ is used in sampling and is called the cool chain. The
other chains are used to improve the mixing of the chains and are called heated chains.
The Metropolis-coupled MCMC algorithm is shown in Figure 3-8.
A family of tempered distributions
0.0
1.0
2.0
3.0
4.0
5.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
p. d
. f.
T=0.0T=1.0T=3.0T=8.0
59
An alternative is to combine the parallel step and the swap step into a super step and
conducts a swap step at every generation. A parallel version of Metropolis-coupled
MCMC was implemented in this dissertation and will be described in Chapter 4.
There are three related questions when applying Metropolis-coupled MCMC in
Bayesian phylogenetic inference:
1) How many chains are needed?
2) Which heating schema should be used?
3) Will Metropolis-coupled MCMC fail?
How many chains to use is an empirical issue. Usually, more chains will provide
more chances to improve the mixing rate and avoid local optima, but more chains also
incur a higher computational cost. Parallel computing can keep the total time from
Metropolis-coupled MCMC 1. 0t ← ; 2. For 1i ← to m
3. ( ) 0ti iΨ ←Ψ
4. While (stop-condition-not-met) 5. Draw random variable 1u from )1,0(U
6. If 1 0u α≤ 7. Then do-classical-MCMC in parallel 8. Else //do a chain swap operation 9. choose two chains i and j
10. compute ( ) ( )
( ) ( )
( ) ( )min 1,
( ) ( )
t ti j j i
s t ti i j j
aπ ππ π
⎧ ⎫Ψ Ψ⎪ ⎪= ⎨ ⎬Ψ Ψ⎪ ⎪⎩ ⎭
11. Draw random variable 2u from )1,0(U
12. If 2 su α≤
13. Then swap-index-temperature ( ),i j
14. 1t t← +
Figure 3 - 8: The Metropolis-coupled MCMC algorithm
60
increasing as the number of chains increase. We observed the benefit of choosing the
number of chains according to the heating schema and the target distribution.
Assuming two adjacent chains with temperature iT and 1iT + , there is a rough
relationship among chain temperatures, the logarithm format likelihood, and the desired
acceptance ratio,
( ) ( )( ) ( )
1
1
1 11 1
11 1
1 11
i i
i i
T Ti i
T Ti i
π πα
π π
+
+
+ ++
+ ++
≈ . (3 - 9)
From (3 – 9), we have
1
1 1 ln log1 1i i
LT T
α+
⎛ ⎞− Δ ≈ −⎜ ⎟+ +⎝ ⎠
. (3 - 10)
Here, ln LΔ is the typical difference of the logarithmic form of the likelihood which can
be obtained by averaging the differences between random samples and the one estimated
using maximum likelihood; α is the lower bound of the acceptance ratio of the chains.
During our experiments, we observed that for phylogenetic inference problems with a
large number of taxa, Metropolis-coupled MCMC may have very low acceptance ratios.
3.6 Proposal algorithms for phylogenetic models
A phylogenetic model ( ), ,T τ θΨ = Ψ includes three components: a tree topology (T ); a
vector of branch lengths ( 1( , , )nτ τ τ= , where n is the number of branches); and the
parameters of an evolutionary model ( ( )1, , mθ θ θ= , where m is the number of the
parameters for describing the model). Thus, a phylogenetic model is in the
space T τ θΩ = × × , which consists of discrete sub-spaces separated by all possible tree
topologies. Another characteristic of this space is that m varies with different model
61
assumptions (for example, 1m = for the JC69 model; 6m = for the HKY model; and
10m = for the GTR model).
An MCMC method is essentially a sampler which draws dependent samples from the
target distribution using one or more Markov chains constructed using Metropolis-
Hasting algorithms or their variants. We break an MCMC sampler into two parts: 1)
generate a proposal for the next move, and 2) design a transition function to guide the
move. In this section, we discuss how to generate a proposal in the space of phylogenetic
models, i.e. T τ θΩ = × × .
There is one discrete random variable, T , and there are n m+ continuous random
variables 1 1, , , , ,n mτ τ θ θ for one state in the phylogenetic model spaceΩ . The Gibbs
sampling strategies provide various mechanisms to update each component either
randomly or systematically [132]. The random-scan Gibbs sampler chooses a component
c at random and then draws a sample from ( )( )tcπ −⋅ Ψ and leaves other components
unchanged at time step t . A system-scan Gibbs sampler updates every component in
order within one super step. Therefore, given the current phylogenetic model state ( )tΨ ,
we can propose a new state by updating T , iτ and iθ separately.
3.6.1 Basic tree mutation operators
A new tree topology can be generated by mutating current topology through three basic
tree operators: NNI (Nearest Neighbor Interchanges), SPR (Subtree Pruning-Regrafting),
and TBR (Tree Bisection-Reconnection) [77].
1) NNI changes a tree topology locally. Any internal branch ( ),b u v= of a tree
topology connects four subtrees ( A , B , C , and D ), where u and v are the
62
labels of the two nodes connected by b . Assuming the original tree topology is
( ), | ,A B C D , two additional tree topologies ( ), | ,A C B D and ( ), | ,A D B C are
obtained by swapping one subtree of node u and one subtree of node v .
3) SPR prunes a subtree sT from the current tree T and then attaches sT to a branch of
the pruned tree \ sT T . Assuming the number of leaves in T is n and the number of
leaves in sT is m , then 2( ) 3n m− − topologies may be obtained from a single SPR
operation.
4) TBR partitions a tree T into two subtrees ( AT and BT ) along a branch, chooses one
branch a from AT and another branch b from BT . Then a new topology can be
obtained by connecting branch a and branch b . Assuming the number of leaves on
AT is an and the number of leaves on BT is bn , a single TBR operation can result in
( )( )2 3 2 3a bn n− − new topologies.
It can be shown that NNI, SPR, and TBR can change any tree topology into any other
tree topology with a finite number of mutation operations used separately. This is true
under the condition that all proposed moves will be accepted; it may be not true if the
move is subjected to selection according to its likelihood score because a high energy
barrier (or low probability region) will constrain the move around some local optima.
3.6.2 Basic tree branch length proposal methods
The length of a branch can be any real number within the interval ( )0,∞ . Two proposal
methods can be used to propose new branch lengths: the scaling and delta methods.
63
3.6.2.1 The scaling method
Denote the current branch length 0τ . A new branch length is generated as ( 0.5)0
ueλτ τ −= ,
where u is draw from a uniform distribution (0,1)U . When 0.5u < , the branch is
shortened; when 0.5u > , the branch length is extended. The parameter λ controls the
range of the proposed branch length.
3.6.2.2 The delta method
The delta method adds a mutation to the current branch length as ( )0 0.5uτ τ λ= + − ,
where u is draw from a uniform distribution (0,1)U , and λ controls the step length.
3.6.3 Propose new parameters
Basically, the method for proposing branch lengths can be used to propose a new
parameterθ . Some prior distributions can be used to increase the acceptance ratio. For
example, the distribution of the frequencies of all nucleotide states can be drawn from the
Dirichlet distribution.
3.6.4 Co-propose topology and branch length
Tree topology and branch length can be updated simultaneously using a single tree
mutation. Two methods have been proposed in the past: the Traversal profile method
[40]and the LOCAL methods [41].
3.7 Extended proposal algorithms for phylogenetic models
This section presents our extension of the basic proposal algorithm described in previous
sections.
64
3.7.1 Extended tree mutation operator
As discussed in Section 4.3, in order to avoid local optima phenomena in MCMC
sampling, we need to design tree mutation operators that can move rapidly in the tree
space. The idea is to combine multiple basic tree mutations within one operator, which
we name extended tree mutation operators. We use parameter D to control the maximum
number of basic tree mutations in an extended tree mutation operator. An extended tree
mutation operator work as follows:
We can choose a proper distribution ( )f k to control the percentage of NNI, SPR and
TBR. The choice of parameter D depends on the number of leaves of the tree topology.
An extended tree mutation can discover as many as DN topologies, where N is the
number of tree topologies that can be explored by a single basic tree operator.
3.7.2 Multiple-tree-merge operator
Most tree search methods search the optimal tree along a single trajectory. Larger spaces
can be explored using multiple independent trajectories. After some training period under
the likelihood function, each tree on those independent trajectories may contain some
extended-tree-mutation (T , D) 1. draw d from a uniform distribution (1, )U D
2. 0T T←
3. For 1i ← to d 4. draw k from a distribution ( )f k for 1..3k =
5. If 1k = Then iT ← Tree-NNI( 1iT − )
6. If 2k = Then iT ← Tree-SPR( 1iT − )
7. If 3k = Then iT ← Tree-TBR( 1iT − )
8. Return iT 9.
Figure 3 - 9: The extended-tree-mutation method
65
subtrees which are partially optimal. Merging these optimal subtrees can result in a good
proposal for the next move. This is one of the basic ideas of the genetic algorithm. We
introduce this method here as another tree proposal operator. The multiple-tree-merge
operator merges subtrees from several “good” candidates into a new candidate as Figure
3-10:
For 2K = , the Multiple-tree-merge operator becomes the crossover operator used in
GAML [135].
3.7.3 Backbone-slide-and-slide operator
For any two internal nodes u and v on an unrooted tree, there exists a path from node u
to node v . We call this path backbone u v− . Assuming there are a total of n internal
nodes (including node u and node v ) on backbone u v− , then the backbone connects
2n + subtrees labeled from 1 to n in the order they are visited, from node u to node v .
Label the other subtree of node u as subtree 0 , and the other subtree of node v as
multiple-tree-merge ( 1, , KT T ) 1. draw k from a uniform distribution (1, )U K
2. 0 kT T←
3. For 1i ← to K 4. If i k≠ Then 5. Select a subtree sT from iT
6. For each leave node in sT , prune it from 0T ,
denote the pruned tree 0 \ sT T
7. Attach sT to 0 \ sT T at a random branch, replace 0sT with the new tree
8. Return 0T
Figure 3 - 10: The multiple-tree-merge method
66
subtree 1n + . Arrange subtree 0 , subtree 1n + and all internal nodes into one vertical
line, and denote the distance from each internal node to subtree 0 iy . Then we randomly
choose one internal node k , and slide it along the backbone by drawing a random
number y from a uniform distribution 1(0, )nU y + . The value of y will decide the new
position of node k. Finally, we scale the length of the backbone.
The backbone-slide-scale method is shown in Figure 3-11, it is an extension of the
LOCAL method proposed by Larget et al. [41] when the back bone includes two nodes.
3.8 Chapter summary
MCMC is the cornerstone of Bayesian phylogenetic inference. The proper
implementation of MCMC is critical to the correctness of Bayesian phylogenetic
u
v
0
1
2
3
4
5
6
0
y1
y3
y4
y5
y2
y6
a
b
c
u
v
0
1
2
3
4
5
6
0
y1
y3
y4
y5
y2'
y6
a
b
c
u
v
0
1
2
3
4
5
6
0
y1*
y3*
y4*
y5*
y2*
y6*
a
b
c
Step 1: Construct the Backbone Step 2: Slide Subtree 3 Step 3: Scale Backbone length
Figure 3 - 11: The backbone slide and scale method
67
inference. In theory an MCMC chain constructed using the Metropolis-Hasting algorithm
will visit every state after a sufficiently large number of time steps. In practice, many
chains can not efficiently mix between two states separated by low probability regions.
We analyzed the dangers of MCMC to output misleading approximations and proposed
several strategies to overcome those pitfalls. The key idea is to design some transitional
kernel which can move from one state to any other state within limited number of steps
without blocking by probable high energy barriers.
We implemented this idea as improved MCMC strategies and extended proposal
methods. Using variable proposal step length, we can bring two distant states close to
each other. Using multipoint MCMC, we can improve the quality of candidate state and
reduce the sample intervals. Using population-based MCMC, we can expand the search
range of the MCMC algorithm. By introducing the above-described proposal methods
and MCMC strategies the steps needed for the chain jump from one state to any other
state is greatly shortened; therefore, the chain can cross valleys much easier.
These described strategies and proposal methods are implemented in PBPI, our high
performance implementation of Bayesian phylogenetic inference.
68
Chapter 4
Parallel Bayesian Phylogenetic Inference
4.1 The need for parallel Bayesian phylogenetic inference
Large phylogenies deepen our understanding about biological evolution and diversity.
With the rapid accumulation of genomic data through various genome sequencing
projects, constructing large phylogenies across the tree of life is becoming a reality.
Simulation studies indicate that the accuracy of a phylogenetic method can be improved
by adding more taxa and including more characters [136].
Bayesian inference of large phylogeny is a computationally intensive process.
Consider a realistic problem: estimating the phylogeny of 200 aligned amino acid
sequences with 3500 characters using a model that allows five different rates across sites
( 200=N , 5000=M , 20=S , and 5=K ). Assume we use a Metropolis-coupled
MCMC algorithm with 5 chains, each chain lasting 100,000,000 generations, and that we
use a local update schema in the implementation. Then we need at least on the order of
1010 bytes of memory space and at least on the order of 1710 multiplication operations. To
be competitive in analytic quality, more complicated models are desirable; this together
69
with an exponential growth rate in the number of sequenced taxa makes growing
computational demand. Even now these demands exceed the ability of a single-CPU
computer and require a longer computation time than is reasonable, hence the motivation
for parallel implementation of Bayesian phylogenetic inference to reduce computational
time.
4.2 TAPS: a tree-based abstraction of parallel system
A parallel system uses a collection of processing elements that communicate and
cooperate to solve large problems faster [137]. Modern computer systems effectively
exploit various hardware parallelisms to gain raw performance at different levels ranging
from instruction, architecture, vector, processor core, microprocessor chip, SMP node,
cluster, and grid. In the past three decades, we have observed tremendous performance
increases and cost decreases in microprocessors, storage, and networking. Beowulf
clusters with hundreds of nodes are common in major universities and research institutes.
The grid, as a new infrastructure, makes sharing geographically-distributed computing
resources a reality.
Parallel algorithm design and analysis relies on an abstract model of parallel
computation to model the key attributes of physical parallel systems. Existing models
include PRAM (parallel random access machine) [138], BSP (bulk synchronous parallel)
[139], LogP [140], and their variants. None of them can be directly applied to grid
systems or clusters of heterogeneous clusters. Therefore, we use TAPS, a tree-based
abstraction of ubiquitous parallel systems, as the model guiding our design and analysis
of parallel algorithms for Bayesian phylogenetic inference.
70
As shown in Figure 4-1, TAPS represents the parallel systems as a rooted-tree, where
all the physical processors are located on the leaves and clusters of computing resources
are represented as an internal node. The root is the largest organization available to the
user. Each leaf node has its independent processing unit ( P ), memory space ( M ) and
network interface ( N ). The internal node is a virtual organization which includes an
interconnection network and a collection of computing resources which could be a
physical processing unit or a lower level virtual organization. Each edge of the tree
represents a communication link with fixed bandwidth and latency. Each node (labeled k )
on the tree incurs an overhead ( ko ) when communicating with other nodes. Similar to the
LogP and Pnlog model [141], the communication cost between a pair of nodes i and
j is modeled as
jiji OLOC ++=, . (4 - 1)
P/M P/MP/M P/M P/M P/MP/M P/M P/M P/Mi j
Li-jOi Oj
A
C
B
LAC
OA
Figure 4 - 1: An illustration of TAPS
71
Here ∑=k
ki oO , k is the node on the path from node i to r , the root of the smallest
subtree shared by node i and j ; ∑=e
elL , e is the edge on the path from node i to r .
Both O and L are system characteristics and vary with message size.
TAPS provides a hierarchical view of a general parallel system which clusters
heterogeneous, distributed computing resources into one virtual platform for parallel
programs. Communications between processing units within a lower level virtual
organization (e.g. an SMP node) have smaller latency and overhead than communications
between nodes that belong to different lower level virtual organizations (e.g. processing
units on two different SMP nodes or two Beowulf clusters). A real system represented by
TAPS is the departmental grid, which consists of several clusters with different kinds of
computing nodes, several SMP systems, and a cluster of loosely connected Linux/Unix
workstations. TAPS is an MPMD system, using MPI (message passing interface) as its
communication mechanism and NFS as its global storage. In this chapter, we discuss
parallel Bayesian phylogenetic inference algorithms based on a system modeled by TAPS.
4.3 Performance models for parallel algorithms
The performance of a parallel algorithm is described by two metrics: speedup and
scalability. Speedup quantifies the performance improvement for a given workload.
Scalability characterizes how speedup varies with the number of processing units and the
size of workload.
The speedup S of workload W on N processing units using algorithm A is defined
as
72
),()(
),( 0
NWTWT
NWSA
A =. (4 - 2)
Here 0T is the execution time for workload W on a single processing unit using an
optimal sequential algorithm; ),( NWTA is the TTS for workload W on N processing
units using algorithm A , which is defined as the maximum of N individual execution
times on the processing units. Ignoring the communication cost and parallelization
overhead, Kruskal and Weiss [142] have shown that for independent subtasks, ),( NWTA
can be approximated as
NN
WTNWTA log21)(
),(
0
σ+=
. (4 - 3)
Here σ is the standard deviation of workload per process and indicates the load
imbalance.
Not all workload can be executed in parallel. If we assumeα percentage of the
workload has to be executed sequentially, then (4-3) should be modified as
NNN
WTNWTA )log21)(1()(
),(
0
σαα +−+=
. (4 - 4)
Parallel algorithms on realistic systems always incur parallel overhead and
communication latency. We assume parallel overhead is dependent on N and
approximated by Nlogβ . We further assume communication breaks the whole execution
time into K super steps, each super step consisting of one computation phase and one
communication phase. Following the abstraction in the last section, the communication
cost can be approximated as ( ) 'log)( NMLOK + , where M is the message size with
73
byte as the unit, and 'N is the number of processing units involved in the communication
phase.
Thus, considering all the above factors, the speedup can be finally approximated as
( ) ))((logloglog21)1(1),(
' MLONWKN
WNN
NNWS A
+⋅⋅+⋅+−+−+=
βσαα. (4 - 5)
The above formula indicates the difficulties of scaling speedup with a large number of
processors for fixed workload. Figure 4-2 illustrates the differences between real speedup
and ideal speedup, and Amdahl’s law (speedup without communication cost) [143].
In realistic scientific computing, such as Bayesian phylogenetic inference, the
workload is not fixed. According to Gustafson’s law [144], the percentage of sequential
execution of the workload may decrease by increasing problem size, thus an improved
speedup can be achieved.
Ideal speedup
Speedup without Communication Cost
Real speedup
# of processing units
Speedup
N
SA(W,N)
Figure 4 - 2: Speedup under fixed workload
74
In summary, there are two kinds of speedup scalability: strong scalability for fixed
problem size and weak scalability for varying problem size. The former indicates for a
given problem, how fast we can get the solution; the later indicates given a time limit,
how big a problem we can solve. Analyzing the performance of parallel algorithms
should consider both of kinds of speedup scalability.
Equation (4-5) also shows that fixed workload speedup can be improved by: 1)
reducing load imbalances across all processing units; 2) reducing communication
frequency; 3) reducing message size per communication; and 4) reducing the number of
processors involved in communication. These principles have been applied in our
implementations of parallel Bayesian phylogenetic algorithms.
4.4 Concurrencies in Bayesian phylogenetic inference
In Figure 4-3, the procedure of a generic Bayesian phylogenetic inference sketches
multiple levels of concurrency:
1) Multiple independent runs for chain convergence detection (lines 7-17);
2) MCMC chains for better chain mixing (lines 8-17);
3) Multiple data partitions for improved model or combined data (lines 10-13);
4) Multiple rate categories for rate variations across sites (lines 12-13); and
5) Multiple sites across the sequence.
Levels 1-4 are conditional on the setting of the Bayesian analysis. Thus, the parallel
algorithm for Bayesian phylogenetic inference should be flexible and automatically
exploit the available parallelism for current analysis setting.
75
Considering the existence of multiple level concurrencies in a Bayesian analysis and
their tolerance for communication latency, we can map each level to a virtual
organization, represented by TAPS (shown as Figure 4-1). For example, we can run site
likelihood in parallel at the SMP node level and run multiple chains at the cluster level.
4.5 Issues of parallel Bayesian phylogenetic inference
Parallelizing Bayesian phylogenetic inference brings two advantages: speeding up the
computation and providing the memory space needed for a competitive biological
analysis of current data. Since generating the Markov chain accounts for the vast bulk of
the computation, our parallelization will focus on the MCMC algorithm. A single chain
MCMC is essentially a sequential program, since the state at time 1+t is dependent on
the state at time t . Multiple dependent chains may increase the mixing rate of the chains
The procedure of a generic Bayesian analysis 1. Read-Dataset 2. Set-Assumption 3. For run=1 to number-of-run 4. For chain=1 to number-of-chain 5. Set-starting-model 6. For time-step=1 to maximum-time-step 7. For run=1 to number-of-run 8. For chain=1 to num-of-chain 9. Propose-candidate-model 10. For partition=1 to number-of-partition 11. For site=1 to number-of-site 12. For rate=1 to number-of-rate 13. Calculate-Site-Likelihood 14. Compute-Tree-Likelihood 15. Make-Accept/Reject-Decision 16. Update-Chain-State 17. Exchange-State-between-chains 18. Detect-Chain-Convergence 19. Analyze-Samples
Figure 4 - 3: The procedure of a generic Bayesian phylogenetic inference
76
and also increase the parallel granularity of Bayesian computation.
One way to parallelize a Metropolis-Hasting MCMC is to parallelize the likelihood
evaluation. Another is to run multiple chains and sample each one after the burn-in stage.
This method may involve many random starting points, which provide the advantage of
exploring the space through independent initial trajectories; however, it also has the
danger that the burn-in stage may not always be cleared. A single Metropolis-coupled
MCMC run can be parallelized on the chain level, but chain-to-chain communications are
needed frequently. However, just parallelizing on the chain level will not use all the
available resources, especially when memory is the limiting factor. Multiple-try
Metropolis methods are easy to parallelize and may reduce the whole computation by
using a shorter chain to get the same result as a long chain. For illustrative purpose we
focus on Metropolis-coupled MCMC in the next chapter.
An important issue in parallelization of Metropolis-coupled MCMC is balancing the
load. This issue of load balancing comes from the fact that when local update schema is
used, different chains will reevaluate a different number of nodes. More seriously, local
update schema are only available for topology and branch length changes; with
parameters such as global rate matrix changing, the likelihood needs to be evaluated
across the tree, so all nodes need to be reevaluated.
Other issues that must be considered in a parallel algorithm are how to synchronize
the processors, how to reduce the number of communications, and how to reduce the
message length of each communication.
77
4.6 Parallel algorithms for Bayesian phylogenetic inference
This section presents a parallel implementation of MCMC algorithms in Bayesian
phylogenetic inference. The MCMC strategy chosen here is the Metropolis MCMC. As
described in Chapter 3, there are two variants of Metropolis-coupled MCMC: simulated
tempering MCMC and parallel tempering MCMC. Both methods build h companion
chains whose distribution is ( ) ( )1/(1 )0 ( ) iT
i x xπ π += , where 0π is the target distribution, iπ
is a tempered distribution, and iT is the temperature of the ith chain for 1, ,i h= . The
cold chain ( 1 0T = ) is for sampling from the target distribution, and the heated chains
(chains with 0iT > ) help bridge subspace in the sampling space separated by high energy
barriers. Running multiple chains in parallel can improve the efficiency of the chain, thus
fewer time steps are required to approximate the target distribution at higher accuracy.
But it also increases the computation h times per time step. Parallel implementations
keep the execution time per time step unchanged or even smaller when more chains are
used.
With a few modifications, the algorithms presented in this chapter are applicable to
other MCMC strategies used by Bayesian phylogenetic inference or Bayesian
computation in scientific problems.
4.6.1 Task decomposition and assignment
As discussed in previous sections, there are two natural approaches to exploiting
parallelism in Metropolis-coupled MCMC: chain-level parallelization and subsequence-
level parallelization. Chain-level parallelization divides chains among processors; each
78
processor is responsible for one or more chains and communications between different
chains are conducted every cycle. Subsequence-level parallelization divides the whole
sequence among processors; each processor is responsible for a segment of the sequence
and communications contribute to computing the global likelihood by collecting local
likelihoods from all processors. Our implementation combines these two approaches
together and maps the computation task of one cycle into a two dimensional grid
topology.
The processor pool is arranged as a rc × , two-dimensional Cartesian grid. The data
set is split into c segments and each column is assigned one segment. The chains are
divided into r groups and each row is assigned one group of chains. When 1=c , the
arrangement becomes chain-level parallel; when 1=r , the arrangement becomes
subsequence-level parallel. Figure 4-4 illustrates how to map 8 chains onto a 4×4 grid,
where the length of the sequences is 2000.
P11 P14P13P12
P31 P34P33P32
P21 P24P23P22
P41 P44P43P43
1..500 501..1000 1001..1500 1501..2000
Sequences 1..2000
chain{1,2}
chain{3,4}
chain{5,6}
chain{7,8}
Figure 4 - 4: Map 8 chains to a 4 x 4 grid, where the length each sequence is 2000
79
4.6.2 Synchronization and communication
We use two sets of random number generators (RNG-1 and RNG-2) to synchronize the
processors in the grid. RNG-1 is used for row-wise synchronization. The processors on
the same row have the same seed for RNG-1, but different rows have different seeds for
RNG-1. RNG-2 is used for grid-wise communication. All processors in the grid
topologies have the same seed for RNG-2.
On each row, RNG-1 is used to generate the proposal state and draw random
variables from the uniform distribution. Since the same seed is used, the processors on
the same row always generate the same proposal, and make the same decision on whether
or not to accept the proposal. During each cycle, only one collective communication is
needed to gather the global likelihood and broadcast it to all processors on the same row.
The MPI_ALLREDUCE function can be used to fulfill this task. Each communication
only needs to communicate twice as many double precision values as the number of
chains on the row, that is, the local likelihood and the global likelihood. Since different
rows use different seeds for RNG-1, the chains on them can traverse different states.
RNG-2 is used to choose which two chains should conduct a swap and the probability
of accepting the swap operation.
When the two chosen chains are located on different rows, peer-to-peer
communications are required for nodes on these two rows. In each chain swap step, the
indices—not the state information—of the chains are swapped. An index swap operation
changes the temperature of the chains being swapped. The cool chain may jump from one
chain to another. Index swapping reduces the communication contents needed by chain
swapping to a minimum.
80
4.6.3 Load balancing
The processors on the same row always have a balanced load if the differences between
the lengths of the subsequences on each column are small enough. However, the
imbalance among different rows is unavoidable, since we cannot predict the
instantaneous behavior of a given chain within a time step. Some techniques are available
to reduce the imbalance if it will impact performance significantly.
The first technique is to synchronize the proposal type on all chains. We use RNG-2
to control how to propose a new candidate state. This prevents one chain from
performing a local update while another chain is doing a global update.
The second technique is to select a swap proposal probability that adjusts the interval
between two swap steps.
4.6.4 Symmetric MCMC algorithm
Until now, all the parallel strategies we have discussed are based on an assumption that
any two chains chosen to conduct a swap step need to be synchronized. The whole
algorithm, which we refer to as the symmetric parallel MCMC algorithm, is provided in
Figure 4-5.
Because the above symmetric parallel MCMC algorithm is the same as sequential
Metropolis-coupled MCMC and maintains all statistical properties required for exploring
the posterior probability, its correctness is guaranteed.
The symmetric parallel MCMC algorithms avoid the need for frequent inter-processor
communications through the use of common knowledge: if all processors have the same
notion of when to swap, who will swap, and what is the next move, then they can
81
compute their own tasks autonomously. Here, common knowledge is encoded in two
random number generators.
Symmetric parallel MCMC algorithm 1. Init-Run-Setting 2. Init-MCMC 3. t ← 0 4. While ( t < maximum-generations) do 5. draw 1u from (0,1)U using RNG-2
6. If 1 0u α≤ Then // 0α : swap probability) 7. Do-swap-step 8. Else Do-parallel-step 9. If sample-step( t ) Then sample-cool-chain 10. t ← t + 1 Init-Run-Setting 1. Init MPI environment 2. Collect Resource information 3. Read Run configuration 4. Mapping chains onto processors 5 Compute the grid coordinate ( r , c) of current processor 5. If me = Head-Node Then 6. Read-Dataset 7. Scatter/Broadcast the Dataset to processors 8. Set seed for RNG-1 and RNG-2 Init-MCMC 1. Compress the sequence data 2. For each chain on current processor 3. Setup the temperature for each chain 4. Build a starting tree 5. Set length for each branch randomly 6. Choose parameters for the models 7. Compute the local likelihood using local data 8. Sum the global likelihood across each row Do-swap-step 1. Choose chain i and j globally using RNG-2 2. Compute )(ir and )( jr , the row index of chain i and j 3. If )()( jrir = Then 4. Intra-processor-chain-swap 5. Else Inter-processor-chain-swap
82
Intra-processor-chain-swap 1. If ( )r r i= Then
2. Compute ( ) ( )( ) ( )
( ) ( )
( ) ( )min 1,
t ti j j i
s t ti i j j
π πα
π π
⎛ ⎞Ψ Ψ⎜ ⎟←⎜ ⎟Ψ Ψ⎝ ⎠
3. Draw 2u from (0,1)U using RNG-2
4. If 2 su α≤ Then
5. Swap chain i and chain j 6. Else do-nothing Inter-processor-chain-swap 1. If ( )r r i= or ( )r r j= Then 2. Exchange the temperature and likelihood of the chain i and j
3. Compute ( ) ( )( ) ( )
( ) ( )
( ) ( )min 1,
t ti j j i
s t ti i j j
π πα
π π
⎛ ⎞Ψ Ψ⎜ ⎟←⎜ ⎟Ψ Ψ⎝ ⎠
4. Draw 2u from (0,1)U using RNG-2
5. If 2 su α≤ Then
6. Swap chain i and chain j 7. Else do-nothing Do-parallel-step 1. For each chain on current processor 2. Draw a random variable 3u from (0,1)U using RNG-1
3. Map 3u to a proposal type
4. Propose a new state Ψ 5. Compute the local likelihood 6. Sum the global likelihood across each row 7. For each chain on current processor
8. Compute ( )
( )( ) ( )
( | ) ( | )( , ) min(1, )( | ) ( | )
tt
t t
D qD q
παπ
Ψ ⋅ Ψ ΨΨ Ψ =
Ψ ⋅ Ψ Ψ
9. Draw a random variable 4u from (0,1)U using RNG-1
10. If ( )4 ( , )tu α≤ Ψ Ψ Then
11. ( 1)t+Ψ ←Ψ
12. Else ( 1) ( )t t+Ψ ←Ψ
Figure 4 - 5: The symmetric parallel MCMC algorithm
83
4.6.5 Asymmetric MCMC algorithm
To further reduce the negative effect of imbalance between different chains, an
asymmetric MCMC algorithm is used. The basic idea is to introduce a processor as the
coordinator node. This node is used to coordinate the communication between different
rows; it does not participate in the likelihood evaluation. After each cycle, the head of
each row sends the state information for its chains to the coordinator and retrieves
information from it when a swap step is proposed. The asymmetric MCMC algorithm is
similar to the shared memory algorithm, but the coordinator can perform other functions,
such as convergence detection and sampling output.
Compared to the symmetric MCMC algorithm, the asymmetric MCMC algorithm
wastes one processor. Thus, when the number of rows in the grid topology is not large,
the symmetric MCMC algorithm is suggested.
4.7 Justifying the correctness of the parallel algorithms
As shown in Chapter 6, we can validate the correctness and accuracy of our algorithm
and implementation using simulation study. This section provides a brief justification that
our proposed algorithms are correct. Our proof is based on two assumptions:
1) The sequential MCMC algorithm is correct and accurate; and
2) There are no correlations between two independent random number generators.
We justify that the first assumption is correct since the objective of our parallel
algorithm is to reduce the execution time of the sequential algorithm, not to invent a new
algorithm. Thus we justify that if the parallel algorithm is equivalent to the sequential
84
algorithm, then the parallel algorithm is correct. The second assumption is correct given
that the random number generators are properly implemented.
There are two major parallelisms exploited in our algorithms. The first, or sequence
level, parallelism partitions the dataset into c segments. As we use the same seed for each
row. All nodes in the same row generate the same proposal and conduct the same
computation, except that each works on different segments. This parallel computation is
equivalent to breaking one loop into several parts, and its results are exactly same as
those of the sequential computation.
The second, or chain level, parallelism runs a subset of chains in each row. According
to the second assumptions, if we draw random numbers from two random number
generators and merge them into one single stream, statistically, the resulted single stream
of random numbers is equivalent to a single stream of random numbers drawn from one
single number generator. Therefore, the effects of using two random number generators
for swapping and sampling in our parallel algorithms is equivalent to using one single
random number generator. In other words, our parallel algorithms keep every statistical
property of the samples unchanged.
Concluding from the above justification, our parallel algorithms are equivalent to the
sequential algorithm. If the sequential algorithm is correct, our parallel algorithms are
correct.
4.8 Chapter summary
In this chapter, we provided a framework for implementing Bayesian phylogenetic
inference in the context of a high performance computing environment. We proposed
85
using TAPS (Tree-based Abstraction for Parallel System) to model the heterogeneous,
multiple level organization of modern parallel computing systems. We discussed the
multilevel parallelism in Bayesian phylogenetic inference and how to exploit it in
practical implementation.
We described a parallel implementation of Bayesian phylogenetic inference
methods using MCMC. We called this implementation PBPI. PBPI organizes the
processors in a 2D grid topology to exploit both chain-level parallelism and subsequence-
level parallelism. We used two random number systems to synchronize the processors in
the grid and to reduce the overhead caused by communications and imbalances.
The memory space is distributed in our algorithm; the duplicated data is limited
primarily to the input dataset, and this is relatively small compared with the ongoing
likelihood data. PBPI can make inferring large phylogenies—which require huge
memory space and compute cycles—feasible and fast.
We justified that under the assumptions that the sequential MCMC algorithm is
correct and the random number generators are well-implemented, PBPI will generate
equivalent results as sequential MCMC for the same dataset. This justification has been
confirmed in simulation studies. Those results are presented in Chapters 5 and 6.
86
Chapter 5
Validation and Verification
5.1 Introduction
In this chapter, we validate the PBPI framework and verify its accuracy in phylogenetic
inference using a simulation study. The performance of a phylogenetic method is usually
evaluated with respect to multiple criteria: consistency, efficiency, robustness, and
computational speed [145-147]. This chapter is focused on the first three criteria, which
characterize the accuracy of a phylogenetic method. The next chapter will evaluate the
computational performance, i.e., how much we can improve performance by using PBPI
instead of other Bayesian phylogenetic methods.
In the field of performance studies of phylogenetic method, consistency is the ability
to estimate the correct phylogenetic tree given an unlimited amount of data, while
efficiency is the ability to quickly converge to the correct tree as more data become
available. Both consistency and efficiency are evaluated under an ideal situation, i.e., the
evolutionary model and all its parameters are exactly known during the phylogenetic
inference. A phylogenetic method is also evaluated with robustness in practice.
Robustness is the ability of the method to estimate the correct tree when one or more
87
assumptions used by this method are violated. All these three criteria are critical to judge
the performance of a phylogenetic method.
As Hillis pointed out [147], simulation, known species, statistical analysis and
congruence studies are major techniques to assess the accuracy of a phylogenetic method.
Among these techniques, simulation is used most frequently, especially in estimating
large phylogenies [89, 148-150].
The procedure of a simulation study (See Figure 5-1) involves the following steps:
(1) Choosing a model tree mT ;
(2) Simulating a dataset iX under a model of evolution Ψ , guided by the model tree
mT ;
(3) Feeding the dataset to a phylogenetic method M to estimate a tree iT (or a set of
trees);
(4) Computing the distance between the estimation iT with the model tree mT ;
(5) Repeating steps (2)-(4) a sufficiently large number of times to make statistical
assessments for the phylogenetic method.
As pointed out by some researchers, simulation studies suffer from several types of
bias [136, 148]; one of them is how to choose the model tree. The parameters of the mode
tree—such as the branching patterns, branch lengths, the number of taxa, and model of
evolution—may affect the simulation results significantly.
88
In simulation study, consistency is measured by including a sufficiently large number
of characters in the simulated dataset, while efficiency is studied by investigating how
accuracy will change as more and more characters are included. Finally, robustness is
examined by simulating the dataset under one model and making the estimation under
another model.
Model TreeTm
Model of Evolution
Dataset Simulation
DatsetDi
Phylogenetic MethodModel of Evolution*
Estimated Tree
Ti
Tree ComparisonTree Distanced(Ti, Tm)
i <= MAX_REPEATS
i = 1
i = i + 1
Yes
Accuracy AssessmentAccuracy Metrics
Figure 5 - 1: The procedure of a simulation method for accuracy assessment
89
5.2 Experimental methodology
We use the procedure shown in Figure 5-1 to guide experimental design in validating and
verifying the PBPI framework and its implementation.
5.2.1 The model trees
We choose several phylogenetic trees published by RDP-II (release 8.1)1 as the mode
trees. Those trees are constructed from the small subunit prokaryotic ribosomal RNA
sequence alignments released by Ribosomal Database Project II [151]. These trees are
built using the WEIGHBOR (Weighted Neighbor Joining) program [152] and the
distance matrices by the PAUP program [63]. Table 5-1 shows the four model trees we
used to present our results. The phylograms of these trees are shown in results.
1 The URL to download model trees used in the chapter is:
http://rdp8.cme.msu.edu/download/SSU_rRNA/trees/release_8.1_trees/.
Table 5 - 1: The four model trees used in experiments Model Tree Number of Taxa RDP-II Filename
FUSO024 24 Fusobacteriaceae.tree
BURK050 50 Burkholderiaceae_and_Alcaligenaceae.tree
ARCH107 107 Archaea.newick
BACK218 218 Backbone.newick
90
5.2.2 The simulated datasets
For each model tree listed in Table 5-1, we choose three numbers of sequence lengths:
1000, 5000, and 10,000. Then for each combination of the model tree and the number of
characters, we simulate 5 datasets under the JC69 model [112] and another 5 datasets
under the K2P model [113] using a sequence simulation program, SEQ-GEN [153]. For
the K2P model, we set the transition/transversion parameters to 2.0. Thus, we simulate a
total of 5 2 3 30× × = datasets for each model tree. We label these datasets using the
model tree name, sequence length, model type and dataset repeat index. For example, the
dataset “back218_L10000_jc69_D003.nex” is the 3rd dataset generated (D003), with a
sequence length of 10,000, under the JC69 model, for the model tree BACK218. To
assess the performance a phylogenetic method statistically, many more datasets may be
needed. However, for validation and verification purposes, the above datasets are
adequate in this dissertation.
5.2.3 The accuracy metrics
There are two issues in choosing the accuracy metrics: 1) quantifying the topological
distance between the estimated tree and the model tree; and 2) summarizing the
simulation results.
In this distance, we use Robinson and Foulds’ measure of topological distance [154].
This distance is equivalent to the number of unique taxon bipartitions found only in one
of the two trees but not both. Chapter 2 illustrates how to calculate bipartitions for a
phylogenetic tree. Taking the model tree as the ground truth, an estimated tree may be
different from the model tree in two ways: it missed some bipartitions included in the
mode tree, or it introduced some novel bipartitions which are not found in the model tree
91
[89]. Missed bipartitions are also called false negative bipartitions, while novel
bipartitions are called false positive bipartitions. When both the model tree and the
estimated tree are fully resolved, the number of false negative partitions equals the
number of false negative partitions, and their sum equals the Robinson and Foulds
topological distance.
For simplicity, we use half of Robinson and Foulds’ distance in our discussions. The
rationales are: 1) there is no biological meaning for this distance; and 2) if the model tree
and the estimated tree has a Robinson and Foulds’ distance of 2, then we can transform
the estimated tree into the model tree with one SPR (Subtree Pruning and Regrafting [77])
move.
The second issue is specific for Bayesian phylogenetic methods because Bayesian-
based methods usually generate a number of tree samples, quite different from other
optimization-based methods. As discussed in Chapter 2, three methods could be used to
summarize these tree samples:
1. The maximum posterior probability tree ( MPPT , the MPP tree): outputting the tree
with the highest frequency of occurrences as an estimate of the true tree.
2. The 95% credible set of trees: outputting the set of trees whose cumulative
occurrence is estimated to be more than 95% of the total number of samples as the
estimation.
3. The majority consensus tree: Calculating the frequencies of each taxa bipartition
and building a consensus tree from the bipartitions whose concurrence
frequencies are above a certain threshold value (for example 50%).
92
In this chapter, we use all these three summary methods and assess the accuracy of
the PBPI based on the following six metrics:
(1) m MPPP(T =T ) : The percentage of the true tree (i.e. the model tree) is recovered as
the MPP Tree.
(2) m CTSP(T T )∉ : The percentage of the true tree (i.e. the model tree) is recovered in
the credible set of trees.
(3) m CONP(T =T ) : The percentage of the true tree (i.e. the model tree) is recovered as
the majority consensus tree.
(4) m MPPd(T ,T ) : The average topological distance between the true tree (i.e. the
model tree) and the MPP tree.
(5) m CTSd(T ,T ) : The average topological distance between the true tree (i.e. the model
tree) and the credible tree set.
(6) m CONd(T ,T ) : The average topological distance between the true tree (i.e. the
model tree) and the majority consensus tree.
In the above metrics, because the 95% credible set may include more than one tree,
we use the smallest distance between the model tree and each individual tree included in
this credible set. If the computed distance equals 0, then the model tree has been found.
5.2.4 Tested programs and their run configurations
We tested PBPI, a software package for parallel Bayesian Phylogenetic Inference that
implements the framework in Chapters 3 and 4. For comparison purposes, we also tested
MrBayes Version 3.1 on part of the dataset.
93
PBPI supports numerous features which can be configured with an XML file. The
configurations of PBPI used in the experiments are given in Table 5-2.
We run PBPI in parallel on 4, 8, or 16 nodes. We use different number of nodes for
different problem sizes to achieve optimal use of CPU time. For each case tested, we run
4 MCMC chains with parallel tempering schema and span each chain on 1, 2, or 4
processors. Each run lasts 1,000,000 generations.
We record both the CPU time (system time + user time) and the wall time on each
node during each run; we use the maximum wall time as the execution time for each run.
We summarize tree samples using metrics provided in Section 5.2.3. We used the
sumt feature provided in PBPI.
In our experiments, we ran sequential versions of MrBayes using the run
configuration given in Figure 5-2. This configuration corresponds to a run with 4 MCMC
chains under JC69 model, each chain lasting 1,000,000 generations.
We did not use the parallel version of MrBayes because doing so has not given any
noticeable execution time change when we run on 1, 2, 4, and 8 processors.
5.2.5 The computing platforms
We run experiments on three systems: NICK at University of South Carolina, DORI in
the SCAPE laboratory at Virginia Tech, and SystemX at Virginia Tech. NICK is a 76-
lset nst=1 rates=equal;
prset statefreqpr=fixed(0.25,0.25,0.25,0.25);
mcmc ngen=1000000 nchains=4 nrun=1 samplefreq=100
printfreq=10000 swapfreq=10;
Figure 5 - 2: Run configuration for MrBayes
94
node Intel Xeon-based dual core cluster, each node having two dual-core 3.2 GHz Intel
Xeon CPUs and a total of 4 GB memory. DORI is an 8-node AMD Opteron-based dual-
core cluster, each node having two dual-code 1.8 GHz AMD Opteron 265 CPUs and a
total of 4GB memory. SystemX is a Terascale computing facility with 1100 Apple
XServer G5 cluster nodes, each node having one Dual 2.3 GHz PowerPC 970FX
processors and 4 GB ECC DDR400 (PC3200) RAM. Though the above three systems
give similar results, we provide the results collected on SystemX because SystemX is a
stable production system.
5.3 Results on model tree FUSO024
5.3.1 The overall accuracy of results
The overall measured accuracy results of PBPI for model tree FUSO024 are shown in
Tables 5-3 and 5-4. In Table 5-3, we show how many times the model tree has been
found as the maximum probability tree (MPP), or the 50% majority consensus tree
(CON), or in the 95% credible set of trees (CTS). In Table 5-4, we show the average
topological distance (½ R-F topological distance) between the model tree and the
estimated trees. The results indicate that the accuracies of PBPI will be improved as more
characters are included in the analysis. Using 10,000 sequences, the model tree has been
found 5 times under the JC69 model (an ideal situation) and 4 times under the K2P model
(some assumption are violated) in all 5 datasets. These results also imply that the 95%
credible set of trees provides better accuracy values than the majority consensus tree and
the maximum probability tree.
95
Table 5 - 2: PBPI run configurations for validation and verification
Parameter Value Meaning
model JC69 Use JC69 as model of evolution
nrun 1 Run once for each dataset
number_of_chains 4 The number of MCMC chains for each run
multipleTry disabled Disable multiple try MCMC feature
sample_interval 1000000 The maximum generations of MCMC chains
maximum_generation 100 How frequently to sample states
print_interval 10000 How frequently to print states
nburnin 0 Start sampling from the beginning
rng1::seed time Use current time as the seed of RNG1
rng2::seed time Use current time as the seed of RNG1
mcmc::bootstrap disabled Disable in-chain MCMC resampling
exchange enabled Enable chain-to-chain exchange
exchange_interval 10 How frequently to perfomr exchange
recombine disabled Disable chain-to-chain recombinations
variablestep disabled Disable variable proposal step
stochasticNNI enabled Enable stochastic NNI proposal
branch enabled Enable stochastic branch proposal
Stochastic SPR enabled Enable stochastic SPR proposal
Stochastic TBR enabled Enable stochastic TBR proposal
Sequence parallel enable Enable sequence level parallism
#chain per group 1 One chain per row on the grid
Chain parallel enabled Enable sequence level parallism
num_partitions 2 Distribute sequences on 2 nodes
96
5.3.2 Further analysis
Results shown in Tables 5-3 and 5-4 verified that the PBPI implementation was valid for
phylogenetic inference and was able to find the correct tree with desirable accuracy. In
this section, we examine each estimated tree which is different from the model tree to
find what caused the estimation errors. Table 5-5 provides the estimations of PBPI for
each dataset. The results indicate that the MPP trees estimated from datasets #1 and #4,
and the consensus tree estimated from dataset #1, #2, and #4 are different from the model
Table 5 - 3: The number of datasets where the model tree FUSO024 is found in the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case.
JC69 Model K2P Model
# of characters MPP CTS CON MPP CTS CON
1000 0 5 0 0 3 0
5000 1 3 1 1 4 0
10000 3 5 2 2 4 2 Table 5 - 4: The average distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case.
JC69 Model K2P Model
# of characters MPP CTS CON MPP CTS CON
1000 1.6 0.0 1.8 2.6 0.6 2.6
5000 1.4 0.4 1.4 1.0 0.2 1.2
10000 0.4 0.0 0.6 0.8 0.2 0.8
97
tree. Examinations have verified that all these five trees have the same topology. So we
only compare the MPP tree estimated from dataset #1 with the model tree.
We provide the phylograms of the model tree and the MPP tree in Figures 5-3 and 5-4,
respectively. These phylograms are drawn using the TreeView program [155]. These two
figures indicate that the only difference between the model tree FUSO024 and the MPP
tree lies in the three taxa groups (fus.necph2, fus.necph3, and af044948). In the model
tree, the relation among these three taxa is:
(fus.necph2:0.0088,(fus.necph3:0.0063,af044948:0.0000):0.0005).
In the first occurrence of the MPP tree, the relationship is estimated as:
((fus.necph2:0.0090,fus.necph3:0.0061):0.0000,af044948:0.0000).
These two trees are in fact the topological equivalent, if considering zero branch lengths.
Table 5 - 5: The topological distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree for datasets with 10,000 characters. Datasets are simulated under the JC69 model.
Dataset MPP CTS CON
fuso024_L10000_jc69_D001 1 0 1
fuso024_L10000_jc69_D002 0 0 1
fuso024_L10000_jc69_D003 0 0 0
fuso024_L10000_jc69_D004 1 0 1
fuso024_L10000_jc69_D005 0 0 0
98
The above examinations further verified the correctness of PBPI and indicated that
the accuracy measurements shown in Tables 5-3 and 5-4 can be increased significantly
once the zero branch factors are included in the topological distance calculation.
0.1
fus.gonidi
fus.necph2
fus.necph3
af044948
fus.russi
fus.simiae
fus.nuclea
aj006965
fus.nucle7
fus.perfoe
af189244
lpt.bucca2
sbd.termit
stb.monili
stb.monil2
lpt.microb
x83517
cb.ceti1
prg.modest
prg.maris
fus.varium
fus.mortif
c.rectum
fus.morti3
Figure 5 - 3: The phylogram of the model tree FUSO024
99
0.1
fus.gonidi
af044948
fus.necph2
fus.necph3
fus.russi
fus.simiae
fus.nuclea
aj006965
fus.nucle7
fus.perfoe
af189244
lpt.bucca2
sbd.termit
stb.monili
stb.monil2
lpt.microb
x83517
cb.ceti1
prg.modest
prg.maris
fus.varium
fus.mortif
c.rectum
fus.morti3
Figure 5 - 4: The MPP tree estimated from dataset fuso024_L10000_jc69_D001
100
5.3.3 PBPI stability
The above accuracy results are obtained from a single run for each given dataset. To
demonstrate that PBPI will produce stable estimations, we examined 10 individual runs
on the dataset fuso024_L10000_jc69_D001. The resulted topological distances between the
model tree and the three summarized trees are show in Figure 5-5 where the x-axis is the
index of each run, and y-axis is the ½ R-F topological distance. To show that the stability
will not be affected by the number of processors, the first 5 runs were on 4 processors and
the second 5 runs were on 8 processors. The results show that 8 runs obtained the same
estimation while the other 2 runs have estimated one different branch. As discussed in
Section 5.3.2, the topological distance shown here should be corrected by including the
zero branch length into the topological calculation. After the correction, all 10 trees are
equivalent to the model tree. These empirical results confirm that the estimations
obtained from PBPI are stable.
PBPI Estimation Stability
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9 10
Index of Runs
Topo
logi
cal D
ista
nce
MPPCTSCON
Figure 5 - 5: Estimation variances in 10 individual runs
101
0.1
brd.aviumsym.blcrcu
sym.critspamo.xyldend88005
aj002802aj002808aj002809ab015607aj002804u80417
aj224990lau.mirabi
ab001520sut.wdswr3sut.wdswr2
af067729tay.eqgenipls.europaaf190911
aj004688alc.sp3068alc.sp0551alc.sp0536alc.sp1562aj133493ab021352ab021336
x92415ab021351aj005450alc.defragab021360aj005448aj005449u82826
y12639y07585
aj002815alc.faecald88008af155147
u71008brd.bronchbrd.bronc2brd.ppertsbrd.holmesbrd.pertusaf142327af142326
Figure 5 - 6: The phylogram of the model tree BURK050
103
5.4 Results on model tree BURK050
In Table 5-6, we show how many times the model tree BURK050 was found as the
maximum probability tree (MPP), or the 50% majority consensus tree (CON), or in the
95% credible set of trees (CTS). The results indicated that the accuracy could be
improved as more characters were included in the analysis. After correcting the zero
branch lengths problem, we concluded that PBPI is a consistent phylogenetic method.
The model tree of BURK050 and the maximum probability tree estimated from
the dataset burk050_L10000_jc69_D001.nex by PBPI are provided in Figures 5-6 and 5-7.
This MPP tree had three bipartitions different from the model tree. The locations of those
differences are circled in Figure 5-7. All of them were the artifacts of the R-F distance
metric because it takes no account of zero or near-zero branch lengths.
Table 5 - 6: The average distances between the model tree BURK050 and the maximum probability tree, the 95% credible set of tree and the 50% majority consensus tree. A total of 5 datasets were used in each case.
JC69 Model K2P Model
# of characters MPP CTS CON MPP CTS CON
1000 20.8 4.2 13.4 24 4 13.2
5000 5.2 1.6 6.6 5.6 1.2 6.2
10000 3 0.8 4.2 3.4 0.4 4
104
The summary of tree samples includes 625 trees in the 95% credible tree sets, most of
which had at most 5 partitions different from the model tree. The posterior probability
value of the MPP tree in the examined dataset was about 0.6%. The posterior
probabilities of the first 50 trees in the credible set are shown in Figure 5-8. The
“uncorrected” (see discussion in Section 5.3.2) topological distance distributions of these
trees are shown in Figure 5-9. These two figures indicate that PBPI have the capability to
find multiple statistical equivalent trees which are close to the model tree. However, as
we only use one model tree to generate the data, we need some insightful interpretations
for the posterior probabilities of these tree samples.
Probability Distribution of the Top 50 Most Probable Trees
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0 5 10 15 20 25 30 35 40 45 50
Index of the Trees
Prob
abili
ty A
prro
xim
ated
Fro
m T
ree
Sam
ples
Figure 5 - 8: The posterior distribution of the top 50 most probable trees
105
5.5 Chapter summary
In this chapter, we validated the correctness of PBPI implementation and measured its
accuracy using simulation study. We randomly chose several phylogenetic trees
published by RDP-II project as the model trees. We simulated a group of datasets with
different numbers of characters under the JC69 and K2P models and analyzed these
datasets using PBPI. The experimental results showed that PBPI estimated the correct
trees or equivalent trees for all datasets with 10,000 characters. Thus we concluded that
PBPI was both correct and consistent for inferring phylogeny under ideal situations.
The results also indicated that when part of the assumptions were violated (for
example, there was a transition and transversion bias in the simulated data), PBPI still
achieved estimations which were very close to the model trees.
Topological Distances between the Model Tree and the Top 50 Most Probable Trees
0
1
2
3
4
5
6
0 5 10 15 20 25 30 35 40 45 50
Index of the Trees
Topo
logi
cal D
ista
nce
Figure 5 - 9: The topological distances distribution of the top 50 most probable trees
106
We provided detailed results on the model trees FUSO024 and BURK050. Multiple
runs on the same dataset indicated that PBPI produced stable estimations, though the
method itself was a stochastic algorithm.
We also showed the probability distributions and topological distance distributions of
the first 50 trees in the credible tree sets. The results demonstrated the possibility that
there were multiple, statistically-equivalent phylogenetic trees for the same dataset; this
signified a potential advantage of Bayesian methods over other optimization-based
methods which produce only one tree as the estimation.
We measured execution time of both PBPI and MrBayes; the results presented in
Chapter 6 showed that for the same dataset and similar run configurations, PBPI ran
much faster than MrBayes, both in sequential and in parallel. On a product HPC system,
SystemX, PBPI reduced the execution time 5~10 times running on a single node mode,
and hundreds of times when running on 32 nodes.
Our experiments also showed a couple of issues with MCMC-based Bayesian
phylogenetic inference, including constraints caused by limited numeric precision in
floating point number representation and the failure of Metropolis-coupled MCMC to
infer large phylogenies (>500 taxa). These limitations call for further improvement of
Bayesian polygenetic inference and we leave them as future work.
107
Chapter 6
Performance Evaluation
6.1 Introduction
In Chapter 5, we provided verification and accuracy measurements of our PBPI
implementation. In this chapter we further evaluate the computational performance of
PBPI. As discussed in Chapter 4, the computational performance of PBPI is studied with
parallel speedup and scalability.
The performance of a parallel algorithm is measured by speedup or efficiency. The
speedup of a parallel algorithm using p processors is defined as
)(
)( 0
pTT
pSP
= (6- 1)
Here 0T is the running time of the fastest known sequential algorithm on one processor
for the same problem.
There are a number of possibilities for timing analysis, and different methods will
give disparate results. Figure 6-1 shows the different values under different timing
methods (user time versus wall time) for the same example. In this paper, we chose the
wall clock time, that is, the elapsed time between the start and the end of a specific run.
108
Wall clock time will make the speedup smaller than that computed with other timing
methods, such as user time. However, wall clock time includes such negative effects as
communication overhead, idle time caused by imbalance, and synchronization. One
disadvantage of using wall clock time to measure speedup is that the same program may
give different results in a non-dedicated environment.
Speedup computed by wall clock time and user time
0
5
10
15
20
25
30
35
0 4 8 12 16 20 24 28 32Number of Processors
Spee
dup
Wall clock timeUser time
Figure 6 - 1: Different speedup values computed by wall clock time and user time
6.2 Experimental methodology
We evaluated the implementation of PBPI in a Terascale Computing Facility, SystemX,
at Virginia Tech. This system has 1100 Apple XServer G5 cluster nodes, each node
having one Dual 2.3 GHz PowerPC 970FX processors, and 4 GB ECC DDR400 (PC3200)
RAM. The cluster nodes were connected with SilverStorm’s 91200 InfiniBand-based
switches with a bidirectional port speed of 10 Gbps (billions of bits per second).
109
We used a subset of the datasets described in Chapter 5 as the benchmark datasets for
performance evaluations. Given each model tree, we generated two datasets, each having
1000 and 10,000 characters. These datasets are listed in Table 6-1.
We analyzed the benchmark datasets under the JC69 model. We set the maximum
generations for the MCMC run at 200,000. All other analysis configurations were the
same as those used in Chapter 5.
In our evaluation, we benched the execution time of the sequential version of PBPI
and MrBayes using the same dataset on a single node. Then we profiled the execution
time of PBPI running in parallel mode and calculated the speedup values under various
run settings (different numbers of processors, different grid topologies, and different
problem sizes).
To get the performance data, we executed 5 runs for each case. The average execution
time over 5 runs was used to compute the speedup. Since PBPI is a stochastic algorithm,
we also showed the variance of all measured numbers. We used wall clock time in timing
Table 6 - 1: Benchmark dataset used in the evaluation
Dataset # of data # of characters # of pattern
FUSO024_1000 24 1000 402
FUSO024_10000 24 10000 1972
BURK050_1000 50 1000 432
BURK050_10000 50 10000 2429
ARCH107_1000 107 1000 1000
ARCH107_10000 107 10000 9996
BACK218_1000 218 1000 1000
BACK218_10000 218 10000 10000
110
analysis. For parallel executions, as each node may take different execution time, we used
the recorded maximum values of execution time for all evaluations.
6.3 The sequential performance of PBPI
Since MrBayes is one of the most widely used Bayesian phylogenetic inference programs,
we compared the performance of PBPI against MrBayes. The version of MrBayes we
have tested was Version 3.1.
6.3.1 The execution time of PBPI and MrBayes
Table 6-2 provides the profiled execution time of PBPI and MrBayes on the benchmark
dataset running in a single node on SystemX. The measured results indicated that when
both programs run in sequential mode, PBPI runs 5~19 times faster than MrBayes,
depending on the problem size. The larger the problem size, the larger performance
improvement of PBPI over MrBayes. Also, as the table shows, the variances are less than
Table 6 - 2: Sequential execution time of PBPI and MrBayes
Dataset TPBPI
(Second) TMrBayes
(Seconds) S = TMrBayes / TPBPI
fuso024_L1000 102.8±2.2 605.4±0.3 5.8±2%
fuso024_L10000 563±28 2765.2±1.6 4.7±5%
burk050_L1000 169.8±4.8 1403.3±0.5 8.0±3%
burk050_L10000 903.2±56.2 7257.0±28.1 7.5±6%
arch107_L1000 643.2±33.8 6796.5±10.5 10.0±5%
arch107_L10000 6407.6±346.4 66130.7±145.2 9.8±5%
back218_L1000 836.0±41.0 13913.5±20.5 15.8±5%
back218_L10000 7978.8±197.0 156233.3±333.3 19.1±3%
111
6% during our experiments. We credited this performance increase to improved memory
management and local update schema in likelihood calculation.
6.3.2 The quality of the tree samples drawn by PBPI
Since PBPI uses different implementation than MrBayes, and since both algorithms are
stochastic algorithms, an important concern is raised about the quality of the MCMC
samples. In other words, was the quality of MCMC samples drawn by PBPI as good as
the samples drawn by MrBayes?
We answered such questions in two ways. First, we compared the log likelihood plots
of both programs. Second we compared the summary of both tree samples.
Figure 6-2 plots the log likelihood of the tree samples for the dataset FUSO024_1000
drawn in the first 5000 generations by PBPI and MrBayes. Both programs reached the
same level of stationary equilibrium at about the same time. This plot demonstrated that
tree samples drawn by PBPI and MrBayes were similar.
The consensus trees, with posterior probability of each bipartition summarized from
Log Likelihood Plot of MCMC Samples
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000Generations
-lnL
PBPIMrBayes
Figure 6 - 2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes
112
PBPI and MrBayes for the dataset FUSO024_1000, are shown in Figures 6-3 and 6-4.
Both trees have the same topology as the model tree (shown in Figure 5-3), which is used
to simulate the dataset. The differences between the consensus tree estimated by
MrBayes and PBPI lay in the posterior probability values of three bipartitions (groups).
For example, in the tree estimated from PBPI, the posterior probability of the group
(AF044948, Fus.necph3) was 0.75. While in the tree estimated from MrBayes, this
posterior probability became 0.92. As noted in Chapter 2, the interpretation of the
posterior probability of bipartition is far from clear and it is difficult to verify. Therefore,
since both programs constructed the true tree correctly, we concluded that the quality of
the MCMC chain evolved by PBPI was at least as good as the chain evolved by MrBayes.
6.3.3 The execution time of PBPI and MrBayes
Based on empirical results in this section, we concluded that the PBPI implementation
achieved the same quality and accuracy as MrBayes, but PBPI ran much faster than
MrBayes. In fact, the sequential version of PBPI was the same program as the parallel
version of PBPI, except that we set the number of computing processors to 1. In the
following sections, we compare the speed up of parallel implementation to the PBPI run
in sequential mode.
113
Figure 6 - 3: The consensus tree estimated by PBPI
0.1
Fus.gonidi
Fus.necph2
Fus.necph3
AF0449480.75
1.00
Fus.russi
Fus.simiae
Fus.nuclea
AJ006965
Fus.nucle71.00
0.58
1.00
1.00
Fus.perfoe
AF189244
Lpt.bucca21.00
Sbd.termit
Stb.monili
Stb.monil21.00
Lpt.microb
X835171.00
1.00
0.98
1.00
Cb.ceti1
Prg.modest
Prg.maris1.00
1.00
1.00
Fus.varium
Fus.mortif
C.rectum
Fus.morti31.00
1.00
1.00
1.00
1.00
114
0.1
Fus.gonidi
Fus.russi
Fus.simiae
Fus.nuclea
AJ006965
Fus.nucle71.00
0.80
1.00
1.00
Fus.perfoe
AF189244
Lpt.bucca21.00
Sbd.termit
Stb.monili
Stb.monil21.00
Lpt.microb
X835171.00
1.00
0.94
1.00
0.54
Cb.ceti1
Prg.modest
Prg.maris1.00
1.00
1.00
Fus.varium
Fus.mortif
C.rectum
Fus.morti31.00
1.00
1.00
1.00
1.00
Fus.necph2
Fus.necph3
AF0449480.92
1.00
Figure 6 - 4: The consensus tree estimated by MrBayes
115
6.4 Parallel speedup for fixed problem size
We used benchmark datasets FUSO024_10000, ARCH107_1000, and BACK218_10000
as fixed problems to investigate the parallel speedup of PBPI. These three datasets
represented three kinds of problems:
• FUSO024_10000: problems with a small number of taxa and long sequences of
characters;
• ARCH107_1000: problems with a medium number of taxa and short sequences
of characters;
• BACK218_10000: problems with a large number of taxa and long sequences of
characters.
For each dataset, we ran 4 chains with parallel tempering schema, each chain lasting
200,000 generations. We scale the number of processors from 4 to 64 and calculate the
average, maximum, and minimum values of the speedup. The results are shown in
Figures 6-5, 6-6, and 6-7.
The results indicated that for all three benchmark datasets, PBPI achieved roughly
linear speedup, with measurement errors ranging from -16% to 12%. For
FUSO024_L10000 dataset, the maximum speedup was 28.1 and the minimum speedup
was 25.6 when using 64 nodes. For ARCH107_L1000 dataset, the maximum speedup
was 24.7 and minimum speedup was 23.0 when using 64 nodes. For Back218_L10000
dataset, the maximum speedup was 50.6 and minimum speedup was 43.6 when using 64
nodes. We also observed that when using small number of processors (<8), the
ARCH107_L1000 dataset had a larger speedup than the FUSO024_L10000 dataset.
When the number of processors was increased, this observation reversed.
116
Combining the performance improvement both in sequential optimization and parallel
speedup, for the same dataset, PBPI ran up to 874 times faster using 64 processors than
MrBayes using a single processor on a system similar to SystemX.
To demonstrate that PBPI running in parallel mode would generate statistically
equivalent tree samples, we summarized the tree samples output by parallel MCMC runs
and compared them with the ground truth: the model tree. The results matched our
informal proof in Chapter 5 and the simulation study in Chapter 6. Figure 6-8 shows the
tree estimated using 64 processors. After one re-rooting operation (which does not
change the tree topology from when the tree is unrooted), the only difference between the
estimated tree and the model tree was that the subtree (Fus.necph2, (Fus.necph3,
AF04948)) in the model tree became ((Fus.necph2, Fus.necph3), AF04948). Since, the
tree edge between AF04948 and (Fus.necph2, Fus.necph3) had zero lengths in both trees;
these two trees were the same.
Parallel Speedup of PBPI for Dataset FUSO024_L10000
0
5
10
15
20
25
30
4 8 16 32 64
Number of Processors
Spee
dup
AverageMaximumMinmum
Figure 6 - 5: Parallel speedup of PBPI for dataset FUSO024_L10000
117
Parallel Speedup of PBPI for Dataset: ARCH107_L1000
0
5
10
15
20
25
30
4 8 16 32 64
Number of Processors
Spee
dup
AverageMaximumMinmum
Figure 6 - 6: Parallel speedup of PBPI for dataset ARCH107_L1000
Parallel Speedup of PBPI for Dataset BACK218_L10000
0
10
20
30
40
50
60
4 8 16 32 64
Number of Processors
Spee
dup
AverageMaximumMinmum
Figure 6 - 7: Parallel speedup of PBPI for dataset BACK218_L10000
118
0.1
Fus.gonidi
AF044948
Fus.necph2
Fus.necph30.85
1.00
Fus.russi
Fus.simiae
Fus.nuclea
AJ006965
Fus.nucle71.00
1.00
1.00
1.00
Fus.perfoe
AF189244
Lpt.bucca21.00
Sbd.termit
Stb.monili
Stb.monil21.00
Lpt.microb
X835171.00
1.00
1.00
1.00
0.99
Cb.ceti1
Prg.modest
Prg.maris1.00
1.00
1.00
Fus.varium
Fus.mortif
C.rectum
Fus.morti31.00
1.00
1.00
1.00
1.00
Figure 6 - 8: The consensus tree estimated by PBPI on 64 processors
119
6.5 Scalability analysis
Before we discuss the scaled parallel speedup and impacts of grid topology, we will
analyze scalability of PBPI using equation (4-5). The workload of sequential MCMC
algorithms (without considering memory latency in this section) is approximately
0 1 2 3logotherlocal update global update TT T
W T k hm n k hmn k− −
= ≈ + + . (6- 2)
Here, ik , for 1,...,3i = is constant, n is the number of taxa, m is the number of
characters patterns, and h is the number of chains. Similarly, the computation time of
the parallel MCMC algorithm can be approximated as
2 31
5 64
1 24 5
6 3 7
( log )( ) log log
log
p
T TT
T TT
k hm n k hmn cp hT p k kp h c
k m n k k
+≈ + +
+ + +. (6- 3)
Here, ik for 1,...,7i = , is constant, p is the number of processors, and c is the number of
chains per row. The terms in the right part of this equation represent real computation
time ( 1T ), row collective communication time ( 2T ), column collective communication
time ( 3T ), imbalance overhead ( 4T ), sequential overhead ( 5T ), and parallel overhead ( 6T ),
respectively. The speedup of the parallel algorithm can be predicted as
( )
( )4 5 6 3 7
1 2 3
1( , , ; ) .log log log
1og / /
S n m h p pcp hk k k m n k k
p h chmn k l n n k k hmn
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟≈ ⎜ ⎟⎛ ⎞ ⎛ ⎞⎜ ⎟+ + + +⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠⎜ ⎟+ ⋅⎜ ⎟+ +⎝ ⎠
(6- 4)
120
From equation (6-4), we make the following observations:
1) The problem size (W ) is determined by n , m , and h . According to equation (6-4),
the larger the problem size, the bigger the speedup that can be obtained.
2) It is possible to remove the column collective communication overhead (i.e.,
4 log cpkh
⎛ ⎞⎜ ⎟⎝ ⎠
is decreased), and row collective communication overhead 5 log hkc
⎛ ⎞⎜ ⎟⎝ ⎠
is
decreased, since no row collective communication is needed in swap step). We
discussed three chain swapping algorithms in an earlier paper [156]. Our analysis and
empirical results indicate that
3) Communication overhead ( 4 5log logcp hk kh c
⎛ ⎞ ⎛ ⎞+⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
) and imbalance ( ( )6 logk m n ) are
major factors that influence performance scalability. For a small problem size,
communication overhead is important. For a larger problem size, imbalance is the
major obstacle.
4) Assume we have synchronized the proposal types using RNG-2 and also balanced the
tree through tree re-root operation. Then the imbalance overhead is mainly caused by
the sample mean of multiple random samples from (0, log )U n . This means small
exchange steps result in large imbalance. One way to reduce imbalance overhead is to
enlarge exchange step. Another way is to use asymmetric algorithms to decouple the
interaction of different chains.
5) When p and h are fixed, running more chains for each row may result in larger
column size and thus larger row communication overhead.
121
6.6 Parallel speedup with scaled workload
With a fixed problem size, parallel speedup will finally slow down as the number of
processors reaches some threshold. After that point, there is no further performance gain
from using more processors. This limitation is governed by Amdahl’s law [143].
However, additional speedup can be obtained by scaling the problem size. Though
speedup obtained from scaling problem size will not reduce the execution time further, it
does have the advantage of improving the accuracies and precisions of the solution.
As implied in equation (6-4), we can scale the Bayesian phylogenetic inference
problem size in three dimensions: 1) the number of taxa; 2) the number of characters; and
3) the number of MCMC chains. Though it is an interesting problem, how the results of
Bayesian phylogenetic inference are affected by scaling the problem in these three
dimensions is beyond the scope of this dissertation. In this section, we focus on the
impact of the problem size on the speedup achieved by PBPI.
6.6.1 Scalability with different problem sizes
The speedup curves of PBPI for datasets FUSO024_L10000, ARCH107_1000, and
BACK218_L10000 are shown in Figure 6-9.
As shown in Chapter 2, the amount of computation required in likelihood evaluation
is proportional to the number of unique site patterns (or the number of characters);
increasing the number of characters linearly increases the execution time. Equation (6-4)
indicates that speedup scales with the number of characters very well. On the other hand,
increasing the number of taxa only increases the computation at the rate of log N (where
N is the number of taxa). Thus, though the number of taxa in dataset FUSO024_L10000
122
was much smaller than that in ARCH107_1000, the FUSO024_L10000 achieved a
slighter larger speedup than ARCH107_1000.
We also observed that BACK218_L10000 achieved about 2 times greater speedup
than ARCH107 and FUSO024 using 64 nodes. This meant we could scale the number of
processors to at least 128 nodes and still maintain the speedup trend.
Parallel Speedup of PBPI for Different Problem Size
0
10
20
30
40
50
60
4 8 16 32 64
Number of Processors
Spee
dup
BACK218_L10000FUSO024_L10000ARCH107_L1000
Figure 6 - 9: Parallel speedup with different number of taxa
6.6.2 Scalability with the number of chains
Increasing the number of chains in a dataset is equivalent to increasing the problem size.
From equation (6-4), the direct conclusion is that using more chains will lead to greater
speedup. A larger number of chains can also increase the interval between two swap steps
executed on a given chain, so imbalance overhead can also be reduced. However, since
MCMC algorithms sample the cool chain only, introducing too many chains may lead to
diminishing returns. However some people argue that multiple chains may provide
123
multiple starting points to explore the tree space. Further experiments are needed to
justify which argument is true.
6.7 Chapter summary
In this chapter, we presented experimental evaluation of the parallel performance of PBPI
on a Terascale computing system, SystemX. Comparing PBPI with MrBayes in
sequential mode, using different benchmark datasets, showed that PBPI runs 5~19 times
faster than MrBayes and achieves similar results as well. We measured the speedup of
PBPI on several benchmark datasets on SystemX; it achieved roughly linear speedup
when using 4 to 64 processors. This lack of slowdown indicated that PBPI has the
capability to scale up to hundreds of processors.
The memory space was distributed in our algorithms; the duplicated data was limited
primarily to the input dataset, which was relatively small compared with the data likely to
be encountered in ongoing modeling. Our algorithms could make inferring large
phylogenies—which require huge memory space and compute cycles—feasible and fast.
Combining both sequential and parallel speedup, PBPI may run 874 times faster than
MrBayes (version 3.1), even only using 64 nodes. PBPI also exploits parallelism at the
bootstrapping and dataset levels. This means Bayesian phylogenetic inference, which
takes several months to complete, now can be finished in less than a half day. This
performance improvement makes large scale investigation of Bayesian phylogenies a
reality.
In future work, we will extend PBPI to support more evolutionary models of new
data types.
124
Chapter 7
Summary and Future Work
7.1 The big picture
In this work, we studied and extended the framework of Bayesian phylogenetic inference
using Markov Chain Monte Carlo (MCMC) methods. Bayesian analysis is of interest to
scientists in fields including biology, statistics, and computer science. Various issues
have been identified for current Bayesian phylogenetic inference methods and
implementations:
1) As the evolutionary process happens only once, there is only one phylogenetic tree
corresponding to the evolutionary process for a group of taxa. This concept has been
held among biologists for a long time. A convincing interpretation of Bayesian
posterior probability of phylogenetic models within a biological and statistical context
is therefore required.
2) Most, if not all, Bayesian phylogenetic inferences use an MCMC method to generate
samples from the posterior distribution which is in a complex, high dimensional space.
The previous MCMC methods may fail accurately to explore the posterior
distribution due to slow convergence, high stickiness, and local optima phenomena.
125
3) MCMC methods are computationally expensive. A comprehensive Bayesian
phylogenetic inference of hundreds of taxa under complex evolutionary models may
take months to finish. This disadvantage limits the application of Bayesian
phylogenetic inference to very large phylogenetic problems and also hinders further
investigations of the behavior of Bayesian phylogenetic inference methods.
4) Discrepancies between the confidence support values obtained from Bayesian
phylogenetic inference and those values obtained from traditional phylogenetic
methods are highly debatable. Without identifying the inherent causes of such
discrepancies, there is little justification for preferring one method over the other. It is
even worse to dispute one method using a conclusion drawn from the other method
but without acknowledging that different assumptions are being used by the two
methods.
5) The increasing size of datasets is making a grand challenge of Bayesian phylogenetic
inference. Phylogenetic analysis of tens of thousands of taxa with hundreds of genes
may deepen our understanding about biological evolution and diversity. Such
phylogenetic analyses are better conducted within a statistically sound probabilistic
framework that has the ability to incorporate existing knowledge and comprehensive
models. Bayesian analysis is promising, but some breakthroughs are needed to make
it practical.
Our work attempts to provide solutions for some of the above issues through various
efforts:
1) For the correct interpretation of Bayesian posterior probability of phylogenetic
models, we revisited the Bayesian phylogenetic inference framework and reviewed
126
different options for Bayesian phylogenetic methods. We found that the Bayesian
posterior probability distribution of phylogenetic models is highly correlated to the
likelihood ranking of these models. The likelihood ratio of two phylogenetic models
roughly equals to the posterior probability ratio of these two models. Therefore, the
posterior probability of a phylogenetic model reflects the probability that the data are
supported by this model.
2) To make MCMC methods more robust and more efficient, we proposed several
extended tree mutation operators which vary the step length to explore larger region of
the phylogenetic model state space. We also studied several MCMC strategies for
Bayesian phylogenetic inference that can effectively increase the mixing rate of the
MCMC chains, making the chains converge faster with less danger of being trapped in
regions separated by a high energy barrier.
3) To reduce the computation time of current Bayesian analysis and make Bayesian
phylogenetic inference feasible for solving large phylogenies which need long
computation time and have large memory requirements, we developed and implemented
PBPI as a high performance implementation of the Bayesian phylogenetic inference
method. The PBPI code can run on a wide range of parallel computing platforms.
4) Using a simulation study, we measured the accuracy of PBPI on several model trees
with different number of taxa. The empirical results show that PBPI is a consistent
phylogenetic method which can, given enough data, .estimate all non-zero length
branches correctly.
5) We also evaluated the performance of PBPI and compared it with MrBayes, one of
the leading Bayesian phylogenetic inference program. In sequential mode, PBPI runs up
127
to 19 times faster than MrBayes for some of our tested datasets. In parallel mode, PBPI
achieves an average 46× speedup on 64 processors. Results also indicate PBPI have the
capability to scale up to hundreds of processors given a proper problem size.
6) To further resolve the discrepancies between Bayesian posterior probability and
Bootstrap support value, we introduce in-chain resampling MCMC (IR-MCMC) methods
which combine data uncertainty and model uncertainty into a single confidence support.
Our analysis clarifies some misleading explanations about the Bootstrap support and
Bayesian posterior probability. The experimental results indicate that IR-MCMC can
capture the data variance existed in the input dataset and include data uncertainties into
an extended version of Bayesian posterior probability.
In this research, we developed open-source software which will be made available to
the public for use in further research. Though this work is focused on Bayesian
phylogenetic inference, many ideas in this work are general and can be applied in solving
other problems.
7.2 Future work
This work is only the first step in building a comprehensive framework. In follow-on
work, we hope to extend this framework as follows:
1) Analyzing the performance of Bayesian phylogenetic methods more thoroughly
through theoretical and experimental study. We may need to develop a set of benchmark
datasets and models for performance analysis. We also need to run the benchmark under
an HPC environment and automate the benchmarking process. Considering the increasing
interest in Bayesian phylogenetic inference but insufficient investigation of the
128
performance of this method, a more comprehensive study of Bayesian phylogenetic
inference could clarify some confusions.
2) Developing a more robust, effective MCMC framework to support advanced
Bayesian analysis. There is no evidence to show that Metropolis-coupled MCMC always
approximates the posterior distribution correctly.
3) Extending Bayesian analysis to support more data types and models. As a general
statistical framework, the Bayesian approach has the potential to solve phylogenetic
problems in which uncertainties exist and need to be accommodated. In addition to
dealing with complex models for DNA sequences, Bayesian analyses can be used to
handle novel data types such as gene order [75] and genome contents [72-74, 166].
4) Developing a formal framework to assemble subtrees generated in Bayesian analysis
to a supertree with confidence support on each clade. One advantage of Bayesian analysis
is that the inference always provides measures of clade support. However, current
supertree approaches do not use such information.
5) To resolve the discrepancies between Bayesian posterior probability and bootstrap
support values, we extended Bayesian phylogenetic inference to include both
uncertainties and proposed using IR-MCMC (in-chain resampling MCMC) to estimate
the effects of both data uncertainty and model uncertainty. The experimental results
indicate that IR-MCMC can include data uncertainties into an extended version of
Bayesian posterior probability. As a future work, we need more thorough theoretical and
experimental investigations of the IR-MCMC method.
129
Bibliography
[1] Hillis, D. M., "Biology Recapitulates Phylogeny," Science, vol. 276, pp. 218-219,
1997.
[2] NSF, "Assembling the Tree of Life (ATOL)," 2003.
[3] Owen, R. J., "Helicobacter - species classification and identification," British
Medical Bulletin, vol. 54, pp. 17-30, 1998.
[4] Pennisi, E., "Modernizing the Tree of Life," Science, vol. 300, pp. 1692-1697,
2003.
[5] Graur, A.G,. and Li, W.-H., Fundmentals of Molecular Evolution. Sunderland,
MA: Sinauer, 1991.
[6] Eisen, J. A. and Fraser, C. M., "Phylogenomics: Intersection of Evolution and
Genomics," Science, vol. 300, pp. 1706-1707, 2003.
[7] Nichol S. T., Spiropoulou C. F., Morzunov S, Rollin P. E., Ksiazek T. G.,
Feldmann H, Sanchez A, Childs J, Zaki S, Peters C. J., "Genetic identification of
a hantavirus associated with an outbreak of acute respiratory illness," Science in
China Series F-Information Sciences, vol. 262, pp. 914-917, 1993.
[8] Murphy, W. J., Eizirik, E., O'Brien, S. J., et al., "Resolution of the Early Placental
Mammal Radiation Using Bayesian Phylogenetics," Science, vol. 294, pp. 2348-
2351, 2001.
[9] Webster, A. J., Payne, R. J. H., and Pagel, M., "Molecular Phylogenies Link Rates
of Evolution and Speciation," Science, vol. 301, pp. 478-, 2003.
130
[10] Huelsenbeck, J. P., Ronquist, F., Nielsen, R., et al., "Bayesian Inference of
Phylogeny and Its Impact on Evolutionary Biology," Science, vol. 294, pp. 2310-
2314, 2001.
[11] Venter, J. C., Adams, M. D., Myers, E. W., et al., "The Sequence of the Human
Genome," Science, vol. 291, pp. 1304-1351, 2001.
[12] Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., et al., "GenBank," Nucl. Acids
Res., vol. 33, pp. D34-38, 2005.
[13] Stoesser, G., Baker, W., Van Den Broek, A., et al., "The EMBL Nucleotide
Sequence Database," Nucl. Acids Res., vol. 30, pp. 21-26, 2002.
[14] Tateno, Y., Miyazaki, S., Ota, M., et al., "DNA Data Bank of Japan (DDBJ) in
collaboration with mass sequencing teams," Nucl. Acids Res., vol. 28, pp. 24-26,
2000.
[15] Bairoch, A. and Apweiler, R., "The SWISS-PROT protein sequence database and
its supplement TrEMBL in 2000," Nucl. Acids Res., vol. 28, pp. 45-48, 2000.
[16] NSF, "Tree of Life Workshop III: Developing the Technology and Infrastructure
Needed for Assembling the Tree of Life," University of Texas, Austin December
2000.
[17] Li, W.-H., Molecular Evolution. Sunderland, MA: Sinauer Associates, 1997.
[18] Lio, P. and Goldman, N., "Models of Molecular Evolution and Phylogeny,"
Genome Res., vol. 8, pp. 1233-1244, 1998.
[19] Giribet, G. and Wheeler, W. C., "On Gaps," Molecular Phylogenetics and
Evolution, vol. 13, pp. 132-143, 1999.
131
[20] Wheeler, W., "Optimization alignment: the end of multiple sequence alignment in
phylogenetics?," Cladistics, vol. 12, pp. 1-9, 1996.
[21] Felsenstein, J., "Taking Variation of Evolutionary Rates Between Sites into
Account in Inferring Phylogenies," Journal of Molecular Evolution, vol. 53, pp.
0447-0455, 2001.
[22] Lunter, G., Miklos, I., Drummond, A., et al., "Bayesian phylogenetic inference
under a statistical insertion-deletion model," in Algorithms in Bioinformatics,
Proceedings, vol. 2812, Lecture Notes in Bioinformatics. Berlin: Springer-Verlag
Berlin 2003, pp. 228-244.
[23] Durbin, R., Eddy, S., Krogh, A., et al., Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids: Cambridge University Press,
1998.
[24] Robinson, D. M., Jones, D. T., Kishino, H., et al., "Protein evolution with
dependence among codons due to tertiary structure," Molecular Biology and
Evolution, vol. 20, pp. 1692-1704, 2003.
[25] Nylander, J. A. A., Ronquist, F., Huelsenbeck, J. P., et al., "Bayesian
phylogenetic analysis of combined data," Systematic Biology, vol. 53, pp. 47-67,
2004.
[26] Buckley, T. R., Arensburger, P., Simon, C., et al., "Combined data, Bayesian
phylogenetics, and the origin of the New Zealand cicada genera," Systematic
Biology, vol. 51, pp. 4-18, 2002.
[27] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology: Cambridge University Press, 1997.
132
[28] Efron, B. and Tibshirani, R., An Introduction to the Bootstrap: Chapman & Hall,
London, 1993.
[29] Felsenstein, J., "Confidence limits on phylogenies: an approach using the
bootstrap," Evolution, vol. 39, pp. 783-791, 1985.
[30] Efron, B., Halloran, E., and Holmes, S., "Bootstrap confidence levels for
phylogenetic trees," PNAS, vol. 93, pp. 7085-7090, 1996.
[31] Sanderson, M. J. and Driskell, A. C., "The challenge of constructing large
phylogenetic trees," Trends in Plant Science, vol. 8, pp. 374-379, 2003.
[32] Nielsen, R., "Mutations as missing data: Inferences on the ages and distributions
of nonsynonymous and synonymous mutations," Genetics, vol. 159, pp. 401-411,
2001.
[33] Holder, M. and Lewis, P. O., "Phylogeny estimation: traditional and Bayesian
approaches," Nature Rev. Genet., vol. 4, pp. 275-284, 2003.
[34] Randle, C. P., Mort, M. E., and Crawford, D. J., "Bayesian inference of
phylogenetics revisited: developments and concerns," Taxon, vol. 54, pp. 9-15,
2005.
[35] Beaumont, M. A. and Rannala, B., "The Bayesian revolution in genetics," Nature
Reviews Genetics, vol. 5, pp. 251-261, 2004.
[36] Huelsenbeck, J. P., Larget, B., Miller, R. E., et al., "Potential applications and
pitfalls of Bayesian inference of phylogeny," Systematic Biolog, vol. 51, pp. 673-
688, 2002.
[37] Lewis, P. O., "Phylogenetic systematics turns over a new leaf," Trends in Ecology
and Evolution, vol. 16, pp. 30-37, 2001.
133
[38] Huelsenbeck, J. P., Ronquist, F., Nielsen, R., et al., "Evolution - Bayesian
inference of phylogeny and its impact on evolutionary biology," Science, vol. 294,
pp. 2310-2314, 2001.
[39] Newton, M. A., Mau, B., and Larget, B., "Markov chain Monte Carlo for the
Bayesian analysis of evolutionary trees from aligned molecular sequences," in
Statistics in molecular biology, vol. 33: Institute of Mathematical Statistics, 1999.
[40] Mau, B., Newton, M. A., and Larget, B., "Bayesian phylogenetic inference via
Markov chain Monte Carlo methods," Biometrics, vol. 55, pp. 1-12, 1999.
[41] Larget, B. and Simon, D. L., "Markov chain Monte Carlo algorithms for the
Bayesian analysis of phylogenetic trees," Molecular Biology and Evolution, vol.
16, pp. 750-759, 1999.
[42] Pickett, K. M. and Randle, C. P., "The persistence of clade prior influence in
Bayesian phylogenetic analyses," Cladistics-the International Journal of the Willi
Hennig Society, vol. 20, pp. 602-602, 2004.
[43] Erixon, P., Svennblad, B., Britton, T., et al., "Reliability of Bayesian posterior
probabilities and bootstrap frequencies in phylogenetics," Systematic Biology, vol.
52, pp. 665-673, 2003.
[44] Douady, C. J., Delsuc, F., Boucher, Y., et al., "Comparison of Bayesian and
maximum likelihood bootstrap measures of phylogenetic reliability," Molecular
Biology and Evolution., vol. 20, pp. 248-254, 2003.
[45] Simmons, M. P., Pickett, K. M., and Miya, M., "How Meaningful Are Bayesian
Support Values?," Molecular Biology and Evolution, vol. 21, pp. 188-199, 2004.
134
[46] Pickett, K. M., Simmons, M. P., and Randle, C. P., "Do Bayesian support values
reflect probability of the truth?," Cladistics-the International Journal of the Willi
Hennig Society, vol. 20, pp. 92-93, 2004.
[47] Suzuki, Y., Glazko, G. V., and Nei, M., "Overcredibility of molecular
phylogenies obtained by Bayesian phylogenetics," PNAS, vol. 99, pp. 16138-
16143, 2002.
[48] Lewis, P. O. and Swofford, D. L., "Back to the future: Bayesian inference arrives
in phylogenetics," Trends in Ecology and Evolution, vol. 16, pp. 600-601, 2001.
[49] Goloboff, P. A. and Pol, D., "Cases in which Bayesian phylogenetic analysis will
be positively misleading," Cladistics-the International Journal of the Willi
Hennig Society, vol. 20, pp. 83-84, 2004.
[50] Mossel, E. and Vigoda, E., "Phylogenetic MCMC Algorithms Are Misleading on
Mixtures of Trees," Science, vol. 309, pp. 2207-2209, 2005.
[51] Bergmann, P. J. and Russell, A. P., "Application of Bayesian inference in the
phylogenetic analysis of multiple data sources: Intraspecific systematics of the
turnip-tailed gecko," Integrative and Comparative Biology, vol. 43, pp. 830-830,
2003.
[52] Glenner, H., Hansen, A. J., Sorensen, M. V., et al., "Bayesian inference of the
metazoan phylogeny: A combined molecular and morphological approach (vol 12,
pg 1828, 2004)," Current Biology, vol. 15, pp. 392-393, 2005.
[53] Voglmayr, H., Rietmuller, A., Goker, M., et al., "Phylogenetic relationships of
Plasmopara, Bremia and other genera of downy mildew pathogens with pyriform
135
haustoria based on Bayesian analysis of partial LSU rDNA sequence data,"
Mycological Research, vol. 108, pp. 1011-1024, 2004.
[54] Simmons, M. P. and Miya, M., "Efficiently resolving the basal clades of a
phylogenetic tree using Bayesian and parsimony approaches: A case study using
mitogenomic data from 100 higher teleost fishes," Cladistics-the International
Journal of the Willi Hennig Society, vol. 20, pp. 96-97, 2004.
[55] Geuten, K., Smets, E., Schols, P., et al., "Conflicting phylogenies of balsaminoid
families and the polytomy in Ericales: combining data in a Bayesian framework,"
Molecular Phylogenetics and Evolution, vol. 31, pp. 711-729, 2004.
[56] Glenner, H., Hansen, A. J., Sorensen, M. V., et al., "Bayesian inference of the
metazoan phylogeny: A combined molecular and morphological approach,"
Current Biology, vol. 14, pp. 1644-1649, 2004.
[57] Schmitt, I., Lumbsch, H. T., and Sochting, U., "Phylogeny of the lichen genus
Placopsis and its allies based on Bayesian analyses of nuclear and mitochondrial
sequences," Mycologia, vol. 95, pp. 827-835, 2003.
[58] Huelsenbeck, J. P. and Ronquist, F., "MRBAYES: Bayesian inference of
phylogenetic trees," Bioinformatics, vol. 17, pp. 754-755, 2001.
[59] Swofford, D. L., "PAUP*: Phylogenetic Analysis Using Parsimony and other
methods," 2002.
[60] Nei, M. and Kumar, S., Molecular Evolution and Phylogenetics. Oxford ; New
York: Oxford University Press, 2000.
[61] Page, R. D. M. and Holmes, E. C., Molecular Evolution: a Phylogenetic
Approach. Oxford ; Malden, MA: Blackwell Science, 1998.
136
[62] Felsenstein, J., Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates,
2004.
[63] Swofford, D. L., "PAUP*: Phylogenetic Analysis Using Parsimony and Other
Methods," Sinauer Associates, Sunderland, MA, 2000.
[64] Felsenstein, J., "PHYLogeny Inference Package," 1980.
[65] Felsenstein, J., "Confidence-Limits on Phylogenies - an Approach Using the
Bootstrap," Evolution, vol. 39, pp. 783-791, 1985.
[66] Rannala, B. and Yang, Z. H., "Probability distribution of molecular evolutionary
trees: A new method of phylogenetic inference," Journal of Molecular Evolution,
vol. 43, pp. 304-311, 1996.
[67] Sanderson, M. J., Purvis, A., and Henze, C., "Phylogenetic supertrees: assembling
the trees of life," Trends in Ecology and Evolution, vol. 13, pp. 105-109, 1998.
[68] Strimmer, K., Goldman, N., and Vonhaeseler, A., "Bayesian probabilities and
quartet puzzling," Molecular Biology and Evolution, vol. 14, pp. 210-211, 1997.
[69] Delsuc, F., Brinkmann, H., and Philippe, H., "Phylogenomics and the
reconstruction of the tree of life," Nature Reviews Genetics, vol. 6, pp. 361-375,
2005.
[70] Driskell, A. C., "Prospects for building the tree of life from large sequence
databases," Science, vol. 306, pp. 1172-1174, 2004.
[71] Bininda-Emonds, O. R. P., Gittleman, J. L., and Steel, M. A., "The (Super)tree of
life: Procedures, problems, and prospects," Annual Review of Ecology and
Systematics, vol. 33, pp. 265-289, 2002.
137
[72] Gu, X. and Zhang, H., "Genome phylogenetic analysis based on extended gene
contents," Molecular Biology and Evolution, vol. 21, pp. 1401-1408, 2004.
[73] Snel, B., Bork, P., and Huynen, M. A., "Genome phylogeny based on gene
content," Nature Genet., vol. 21, pp. 108-110, 1999.
[74] Huson, D. H. and Steel, M., "Phylogenetic trees based on gene content,"
Bioinformatics, vol. 20, pp. 2044-2049, 2004.
[75] Blanchette, M., Kunisawa, T., and Sankoff, D., "Gene order breakpoint evidence
in animal mitochondrial phylogeny," Journal of Molecular Evolution, vol. 49, pp.
193-203, 1999.
[76] Sankoff, D., "Gene order comparisons for phylogenetic inference: Evolution of
the mitochondrial genome," Proc. Natl Acad. Sci. USA, vol. 89, pp. 6575-6579,
1992.
[77] Swofford, D. L., Olsen, G. J., Waddell , P. J., et al., "Phylogenetic inference," in
Molecular Systematics, Hillis, D. M., Moritz, C., and Mable, B. K., Eds., 2ed.
Sunderland, MA: Sinauer & Associates, 1996, pp. 407–514.
[78] Felsenstein, J., "Cases in which parsimony and compatibility methods will be
positively misleading.," Syst. Zool., vol. 27, pp. 401 - 410, 1978.
[79] Douady, C. J., Delsuc, F., Boucher, Y., et al., "Comparison of Bayesian and
Maximum Likelihood Bootstrap Measures of Phylogenetic Reliability,"
Molecular Biology and Evolution, vol. 20, pp. 248-254, 2003.
[80] Sokal, R. R. and Michener, C. D., "A statistical method for evaluating systematic
relationships," University of Kansas Scientific Bulletin, vol. 28, pp. 1409-1438,
1958.
138
[81] Saitou, N. and Nei, M., "The neighbor-joining method: a new method for
reconstructing phylogenetic trees," Molecular Biology and Evolution, vol. 4, pp.
406-625, 1987.
[82] Salter, L. A. and Pearl, D. K., "Stochastic search strategy for estimation of
maximum likelihood phylogenetic trees," Systematic Biology, vol. 50, pp. 7-17,
2001.
[83] Barker, D., "LVB: parsimony and simulated annealing in the search for
phylogenetic trees," Bioinformatics, vol. 20, pp. 274-275, 2004.
[84] Katoh, K., Kuma, K.-I., and Miyata, T., "Genetic Algorithm-Based Maximum-
Likelihood Analysis for Molecular Phylogeny," Journal of Molecular Evolution,
vol. 53, pp. 0477-0484, 2001.
[85] Lemmon, A. R. and Milinkovitch, M. C., "The metapopulation genetic algorithm:
An efficient solution for the problem of large phylogeny estimation," PNAS, vol.
99, pp. 10516-10521, 2002.
[86] Lewis, P. O., "A genetic algorithm for maximum-likelihood phylogeny inference
using nucleotide sequence data," Molecular Biology and Evolution, vol. 15, pp.
277-283, 1998.
[87] Yang, Z. H. and Rannala, B., "Bayesian phylogenetic inference using DNA
sequences: A Markov Chain Monte Carlo method," Molecular Biology and
Evolution, vol. 14, pp. 717-724, 1997.
[88] Huelsenbeck, J. and Ronquist, F., "MRBAYES: Bayesian inference of
phylogenetic trees," Bioinformatics, vol. 17, pp. 754-755, 2001.
139
[89] Huson, D. H., Nettles, S. M., and Warnow, T. J., "Disk-covering, a fast-
converging method for phylogenetic tree reconstruction," Journal of
Computational Biology, vol. 6, pp. 369-386, 1999.
[90] Strimmer, K. and Von Haeseler, A., "Quartet puzzling: A quartet maximum-
likelihood method for reconstructing tree topologies.," Molecular Biology and
Evolution, pp. 964 - 969, 1996.
[91] Stamatakis, A., Ludwig, T., and Meier, H., "RAxML-III: a fast program for
maximum likelihood-based inference of large phylogenetic trees," Bioinformatics,
vol. 21, pp. 456-463, 2005.
[92] Olsen, G., Matsuda, H., Hagstrom, R., et al., "fastDNAmL: a tool for construction
of phylogenetic trees of DNA sequences using maximum likelihood," Computer
Applications in the Biosciences, vol. 10, pp. 41-48, 1994.
[93] Keane, T. M. Naughton, T. J., et al., "Distributed phylogeny reconstruction by
maximum likelihood," Bioinformatics, vol. 21, pp. 969-974, 2005.
[94] Stewart, C. A., D. Hart, D. K. Berry, G. J. Olsen, E. Wernert, W. Fischer,
"Parallel implementation and performance of fastDNAml - a program for
maximum likelihood phylogenetic inference," presented at Supercomputing 2001,
2001.
[95] Schmidt, H. A., Strimmer, K., Vingron, M., et al., "TREE-PUZZLE: maximum
likelihood phylogenetic analysis using quartets and parallel computing,"
Bioinformatics, vol. 18, pp. 502-504, 2002.
140
[96] Brauer, M. J., Holder, M. T., Dries, L. A., et al., "Genetic algorithms and parallel
processing in maximum-likelihood phylogeny inference," Molecular Biology and
Evolution, vol. 19, pp. 1717-1726, 2002.
[97] Moret, B. M. E., Bader, D. A., and Warnow, T., "High-performance algorithm
engineering for computational phylogenetics," Journal of Supercomputing, vol.
22, pp. 99-110, 2002.
[98] Feng, X. Z., Buell, D. A., Rose, J. R., et al., "Parallel algorithms for Bayesian
phylogenetic inference," Journal of Parallel and Distributed Computing, vol. 63,
pp. 707-718, 2003.
[99] Altekar, G., Dwarkadas, S., Huelsenbeck, J. P., et al., "Parallel metropolis coupled
Markov chain Monte Carlo for Bayesian phylogenetic inference," Bioinformatics,
vol. 20, pp. 407-415, 2004.
[100] Felsenstein, J., "Statistical inference and the estimation of phylogenies," Ph.D.
dissertation, Chicago: University of Chicago, 1968.
[101] Li, S., "Phylogenetic tree construction using Markov chain Monte Carlo," Ph.D.
dissertation, Columbus: Ohio State University, 1996.
[102] Rannala, B. and Yang, Z., "Probability Distribution of Molecular Evolutionary
Trees: A New Method of Phylogenetic Inference," Journal of Molecular
Evolution, vol. 43, pp. 0304 - 0311, 1996.
[103] Mau, B., "Bayesian phylogenetic inference via Markov chain Monte Carlo
methods," Ph.D. dissertation, Madison: University of Wisconsin, 1996.
141
[104] Larget, B., Simon, D. L., and Kadane, J. B., "Bayesian phylogenetic inference
from animal mitochondrial genome arrangements," Journal of the Royal
Statistical Society Series B-Statistical Methodology, vol. 64, pp. 681-693, 2002.
[105] Thorne, J., Kishino, H., and Painter, I., "Estimating the rate of evolution of the
rate of molecular evolution," Molecula Microbiol, vol. 15, pp. 1647-1657, 1998.
[106] Simon, D. L. and Larget, B., "Bayesian analysis in molecular biology and
evolution (BAMBE)," Department of Mathematics and Computer Science,
Duquesne University, Pittsburgh 1998.
[107] Eddy, S. R., "What is Bayesian statistics?," Nat Biotech, vol. 22, pp. 1177-1178,
2004.
[108] Huelsenbeck, J. P. and Rannala, B., "Frequentist properties of Bayesian posterior
probabilities of phylogenetic trees under simple and complex substitution
models," Systematic Biology, vol. 53, pp. 904-913, 2004.
[109] Huelsenbeck, J. P. and Bollback, J. P., "Empirical and hierarchical Bayesian
estimation of ancestral states," Systematic Biology, vol. 50, pp. 351-366, 2001.
[110] Ronquist, F. and Huelsenbeck, J. P., "MrBayes 3: Bayesian phylogenetic
inference under mixed models," Bioinformatics, vol. 19, pp. 1572-1574, 2003.
[111] Waterman, M. S., Introduction to Computational Biology: Maps, Sequences, and
Genomes, 1st ed. London ; New York, NY: Chapman & Hall, 1995.
[112] Jukes, T. H. and Cantor, C. R., "Evolution of protein molecules," in Mammalian
Protein Metabolism, MUNRO, H. N., Ed. New York: Academic Press, 1969, pp.
21-132.
142
[113] Kimura, M., "A simple method for estimating evolutionary rate of base
substitutions through comparative studies of nucleotide sequences," Journal of
Molecular Evolution, vol. 16, pp. 111-120, 1980.
[114] Felsenstein, J., "Evolutionary trees from DNA sequences: a maximum likelihood
approach," Journal of Molecular Evolution, vol. 17, pp. 368-76, 1981.
[115] Hasegawa, M., Kishino, H., and Yano, T., "Dating of the human-ape splitting by a
molecular clock of mitchondrial DNA," Journal of Molecular Evolution, vol. 22,
pp. 160-174, 1985.
[116] Yang, Z., "Estimating the pattern of nucleotide substitution," Journal of
Molecular Evolution, vol. 39, pp. 105-111, 1994.
[117] Yang, Z., "PAML: a program package for phylogenetic analysis by maximum
likelihood," 1997.
[118] Huelsenbeck, J. P. and Crandall, K. A., "Phylogeny estimation and hypothesis
testing using maximum likelihood," Annual Review of Ecology and Systematics,
vol. 28, pp. 437-466, 1997.
[119] Yang, Z., "Maximum-likelihood estimation of phylogeny from DNA sequences
when substitution rates differ over sites," Molecular Biology and Evolution, vol.
10, pp. 1396-1401, 1993.
[120] Jin, L. and Nei, M., "Limitations of the evolutionary parsimony method of
phylogenetic analysis [published erratum appears in Molecular Biology and
Evolution 1990 Mar;7(2):201]," Molecular Biology and Evolution, vol. 7, pp. 82-
102, 1990.
143
[121] Yang, Z., "Maximum likelihood phylogenetic estimation from DNA sequences
with variable rates over sites: approximate methods.," Journal of Molecular
Evolution, vol. 39, pp. 306 - 314, 1994.
[122] Felsenstein, J. and Churchill, G., "A Hidden Markov Model approach to variation
among sites in rate of evolution," Molecular Biology and Evolution, vol. 13, pp.
93-104, 1996.
[123] Thorne, J., Kishino, H., and Felsenstein, J., "An evolutionary model for maximum
likelihood alignment of DNA sequences.," Journal of Molecular Evolution, vol.
33, pp. 114-24, 1991.
[124] Mitchison, G. J. and Durbin, R. M., "Tree-based maximal likelihood substitution
matrices and hidden Markov models," Journal of Molecular Evolution, vol. 41, pp.
1139–1151, 1995.
[125] Yang, Z. and Kumar, S., "Approximate methods for estimating the pattern of
nucleotide substitution and the variation of substitution rates among sites,"
Molecular Biology and Evolution, vol. 13, pp. 650-659, 1996.
[126] Metropolis, N., Rosenbluth, A. N., Rosenbluth, M. N., et al., "Equations of state
calculations by fast computing machine," J. Chem. Phys., vol. 21, pp. 1087-1091,
1953.
[127] Hastings, W. K., "Monte Carlo sampling methods using Markov chains and their
application," Biometrika, vol. 57, pp. 97-109, 1970.
[128] Geman, S. and Geman, D., "Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 6, pp. 721-741, 1984.
144
[129] Besag, J. and Green, P. J., "Spatial Statistics and Bayesian Computation," Journal
of the Royal Statistical Society Series B-Methodological, vol. 55, pp. 25-37, 1993.
[130] Tierney, L., "Markov-Chains for Exploring Posterior Distributions," Annals of
Statistics, vol. 22, pp. 1701-1728, 1994.
[131] Tierney, L., "Markov-Chains for Exploring Posterior Distributions - Rejoinder,"
Annals of Statistics, vol. 22, pp. 1758-1762, 1994.
[132] Liu, J. S., "Monte Carlo Strategies in Scientific Computing," in Springer Series in
Statistics: Springer, 2001.
[133] Geyer, C. J., "Markov chain Monte Carlo maximum likelihood," presented at
Computing Science and Statistics: the 23rd symposium on the interface, Fairfax,
1991.
[134] Geyer, C. J. and Thompson, E. A., "Annealing Markov-Chain Monte-Carlo with
Applications to Ancestral Inference," Journal of the American Statistical
Association, vol. 90, pp. 909-920, 1995.
[135] Lewis, P., "A genetic algorithm for maximum-likelihood phylogeny inference
using nucleotide sequence data," Molecular Biology and Evolution, vol. 15, pp.
277-283, 1998.
[136] Hillis, D. M., "Inferring complex phylogenies," Nature, vol. 383, pp. 130-131,
1996.
[137] Gottlieb, G. S. A. A. A., Highly Parallel Computing, 2nd ed. Redwood City,
California: Benjamin/Cummings, 1989.
[138] Aggarwal, A., Chandra, A. K., and Snir, M., "Communication complexity of
PRAMs " Theoretical Computer Science, vol. 71, pp. 3-28, 1990.
145
[139] Valiant, L. G., "A Bridging Model for Parallel Computation," Communications of
the ACM, vol. 33, pp. 103-111, 1990.
[140] Culler, D. E., Karp, R. M., Patterson, D., et al., "LogP - A practice model of
parallel computation," Communications of the ACM, vol. 39, pp. 78-85, 1996.
[141] Cameron, K. W. and Ge, R., "Predicting and Evaluating Distributed
Communication Performance," presented at 16th High Performance Computing,
Networking and Storage Conference (SC 2004), Pittsburgh, PA, 2004.
[142] Kruskal, C. P. and Weiss, A., "Allocating Independent Subtasks on Parallel
Processors," IEEE Transactions on Software Engineering, vol. 11, pp. 1001-1016,
1985.
[143] Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large-
Scale Computing Capabilities," presented at AFIPS Spring Joint Computer
Conference, Reston, VA, 1967.
[144] Gustafson, J., "Reevaluating Amdahl's Law," Communications of the ACM, vol.
31, pp. 532-533, 1988.
[145] Penny, D., Hendy, M. D., and Steel, M. A., "Progress with Methods for
Constructing Evolutionary Trees," Trends in Ecology & Evolution, vol. 7, pp. 73-
79, 1992.
[146] Hillis, D. M. and Huelsenbeck, J. P., "Assessing Molecular Phylogenies - Reply,"
Science, vol. 267, pp. 255-256, 1995.
[147] Hillis, D. M., "Approaches for Assessing Phylogenetic Accuracy," Systematic
Biology, vol. 44, pp. 3-16, 1995.
146
[148] Huelsenbeck, J. P., "Performance of Phylogenetic Methods in Simulation,"
Systematic Biology, vol. 44, pp. 17-48, 1995.
[149] Guindon, S. and Gascuel, O., "A simple, fast, and accurate algorithm to estimate
large phylogenies by maximum likelihood," Systematic Biology, vol. 52, pp. 696-
704, 2003.
[150] Hillis, D. M., "Inferring complex phylogenies," Nature, vol. 383, pp. 130-131,
1996.
[151] Cole, J. R., Chai, B., Marsh, T. L., et al., "The Ribosomal Database Project (RDP-
II): previewing a new autoaligner that allows regular updates and the new
prokaryotic taxonomy," Nucl. Acids Res., vol. 31, pp. 442-443, 2003.
[152] Bruno, W. J., Socci, N. D., and Halpern, A. L., "Weighted neighbor joining: A
likelihood-based approach to distance-based phylogeny reconstruction,"
Molecular Biology and Evolution, vol. 17, pp. 189-197, 2000.
[153] Rambaut, A. and Grassly, N. C., "Seq-Gen: An application for the Monte Carlo
simulation of DNA sequence evolution along phylogenetic frees," Computer
Applications in the Biosciences, vol. 13, pp. 235-238, 1997.
[154] Robinson, D. R. and Foulds, L. R., "Comparison of phylogenetic trees,"
Mathematical Biosciences, vol. 53, pp. 131-147, 1981.
[155] Page, R. D. M., "TREEVIEW: An application to display phylogenetic trees on
personal computers," Computer Applications in the Biosciences, pp. 357-358,
1996.
147
[156] Feng, X., Buell, D. A., Rose, J. R., et al., "Parallel algorithms for Bayesian
phylogenetic inference," Journal of Parallel and Distributed Computing, vol. 63,
pp. 707-718, 2003.
[157] Efron, B., "Bootstrap methods: another look at the jackknife," Annals of Statistic,
vol. 7, pp. 1-26, 1979.
[158] Felsenstein, J., "Estimating Effective Population-Size from Samples of Sequences
- a Bootstrap Monte-Carlo Integration Method," Genetical Research, vol. 60, pp.
209-220, 1992.
[159] Felsenstein, J., "Phylogenies from Molecular Sequences - Inference and
Reliability," Annual Review of Genetics, vol. 22, pp. 521-565, 1988.
[160] Farris, J. S., Albert, V. A., Kallersjo, M., et al., "Parsimony jackknifing
outperforms neighbor-joining," Cladistics, vol. 12, pp. 99-124, 1996.
[161] Hillis, D. M. and Bull, J. J., "An Empirical-Test of Bootstrapping as a Method for
Assessing Confidence in Phylogenetic Analysis," Systematic Biology, vol. 42, pp.
182-192, 1993.
[162] Murphy, W. J., "Resolution of the early placental mammal radiation using
Bayesian phylogenetics," Science, vol. 294, pp. 2348-2351, 2001.
[163] Wilcox, T. P., Zwickl, D. J., Heath, T. A., et al., "Phylogenetic relationships of
the dwarf boas and a comparison of Bayesian and bootstrap measures of
phylogenetic support," Molecular Phylogenetics and Evolution, vol. 25, pp. 361-
371, 2002.
148
[164] Misawa, K. and Nei, M., "Reanalysis of Murphy et al.'s Data Gives Various
Mammalian Phylogenies and Suggests Overcredibility of Bayesian Trees,"
Journal of Molecular Evolution, vol. 57, pp. S290-S296, 2003.
[165] Alfaro, M. E., Zoller, S., and Lutzoni, F., "Bayes or Bootstrap? A Simulation
Study Comparing the Performance of Bayesian Markov Chain Monte Carlo
Sampling and Bootstrapping in Assessing Phylogenetic Confidence," Molecular
Biology and Evolution, vol. 20, pp. 255-266, 2003.
[166] Hughes, A. L., Ekollu, V., Friedman, R., et al., "Gene family content-based
phylogeny of prokaryotes: The effect of criteria for inferring homology,"
Systematic Biology, vol. 54, pp. 268-276, 2005.