inferring phytogenies - gbv

14
Inferring Phytogenies Joseph Felsenstein University of Washington Sinauer Associates, Inc. Publishers Sunderland, Massachusetts Technische UniversitSt Darmstatii FACHBEREICH 10 — BIOIOGIE >— B i b I i o t h e k — SchnittspahnstraBe 10 D j 6 4 2 8 7 D a r rti s t a d t Ifiy.-Nr.

Upload: others

Post on 04-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inferring Phytogenies - GBV

Inferring PhytogeniesJoseph Felsenstein

University of Washington

Sinauer Associates, Inc. • PublishersSunderland, Massachusetts

Technische UniversitSt DarmstatiiFACHBEREICH 10 — BIOIOGIE

>— B i b I i o t h e k —SchnittspahnstraBe 10

Dj6 4 2 8 7 D a r rti s t a d t

Ifiy.-Nr.

Page 2: Inferring Phytogenies - GBV

Contents

Preface xix

1 Parsimony methods 1A simple example '. 1

Evaluating a particular tree 1Rootedness and unrootedness 4

Methods of rooting the tree 6Branch lengths 8Unresolved questions 9

2 Counting evolutionary changes 11The Fitch algorithm 11The Sankoff algorithm 13

Connection between the two algorithms 16Using the algorithms when modifying trees 16

Views 16Using views when a tree is altered 17

Further economies 18

3 How many treesare there? 19Rooted bifurcating trees 20Unrooted bifurcating trees 24Multifurcating trees . . . ' 25

Unrooted trees with multifurcations 28Tree shapes 28

Rooted bifurcating tree shapes 29Rooted multifurcating tree shapes 30Unrooted Shapes 32

Labeled histories 35Perspective 36

Page 3: Inferring Phytogenies - GBV

VI

4 Finding the best treeby heuristic search 37Nearest-neighbor interchanges 38Subtree pruning and regrafting 41Tree bisection and reconnection 44Other tree rearrangement methods 44

Tree-fusing 7 . 44Genetic algorithms 44Tree windows and sectorial search 46

Speeding up rearrangements 46Sequential addition 47Star decomposition 48Tree space 48Search by reweighting of characters 51Simulated annealing 52History 53

5 Finding the best treeby branch and bound 54A nonbiological example 54Finding the optimal solution 57NP-hardness 57Branch and bound methods 60Phylogenies: Despair and hope 60Branch and bound for parsimony 61Improving the bound 64

Using still-absent states 64Using compatibility 64

Rules limiting the search 65

6 Ancestral statesand branch lengths 67Reconstructing ancestral states 67Accelerated and delayed transformation •. . 70Branch lengths 70

7 Variants of parsimony 73Camin-Sokal parsimony 73Parsimony on an ordinal scale 74Dollo parsimony 75Polymorphism parsimony 76Unknown ancestral states 78Multiple states and binary coding 78Dollo parsimony and multiple states 80

Page 4: Inferring Phytogenies - GBV

Vll

Polymorphism parsimony and multiple states 81Transformation series analysis 81Weighting characters 82Successive weighting and nonlinear weighting 83

Successive weighting 83Nonsuccessive algorithms " . . . 84

8 Compatibility 87Testing compatibility 88The Pairwise Compatibility Theorem 89Cliques of compatible characters 91Finding the tree from the clique 92Other cases where cliques can be used 94Where cliques cannot be used 94

Perfect phylogeny 95Using compatibility on molecules anyway 96

9 Statistical properties of parsimony 97Likelihood and parsimony 97

The weights 100Unweighted parsimony 100Limitations of this justification of parsimony 101Farris's proofs 102No common mechanism 103Likelihood and compatibility 105Parsimony versus compatibility 107

Consistency and parsimony 107Character patterns and parsimony 107Observed numbers of the patterns 110Observed fractions of the patterns 110Expected fractions of the patterns I l lInconsistency 113When inconsistency is not a problem 114The nucleotide sequence case 115Other situations where consistency is guaranteed 117Does a molecular clock guarantee consistency? 118The Farris zone 120

Some perspective 121

10 A digression on history and philosophy 123How phylogeny algorithms developed 123

Sokal and Sneath 123Edwards and Cavalli-Sforza 125Camin and Sokal and parsimony 128

Page 5: Inferring Phytogenies - GBV

Vlll

Eck and Dayhoff and molecular parsimony 130Fitch and Margoliash popularize distance matrix methods 131Wilson and Le Quesne introduce compatibility 133Jukes and Cantor and molecular distances 134Farris and Kluge and unordered parsimony^ - , . . . . 134Fitch and molecular parsimony 136Further work 136What about Willi Hennig and Walter Zimmerman? 136

Different philosophical frameworks 138Hypothetico-deductive 138Logical parsimony 140Logical probability? 142Criticisms of statistical inference 143The irrelevance of classification 145

11 Distance matrix methods 147Branch lengths and times 147The least squares methods 148

Least squares branch lengths 148Finding the least squares tree topology 153

The statistical rationale 153Generalized least squares 154Distances 155The Jukes-Cantor model—an example 156Why correct for multiple changes? 158Minimum evolution 159Clustering algorithms , 161UPGMA and least squares 161

A clustering algorithm 162An example 162UPGMA on nonclocklike trees 165

Neighbor-joining 166Performance 168Using neighbor-joining with other methods 169Relation of neighbor-joining to least squares 169Weighted versions of neighbor-joining 170

Other approximate distance methods 171Distance Wagner method 171A related family 171Minimizing the maximum discrepancy 172Two approaches to error in trees 172

A puzzling formula 173Consistency and distance methods 174

Page 6: Inferring Phytogenies - GBV

IX

A limitation of distance methods 175

12 Quartets of species 176The four point metric 177The split decomposition 178

Related methods 182Short quartets methods 182The disk-covering method 183Challenges for the short quartets and DCM methods 185Three-taxon statement methods 186Other uses of quartets with parsimony 188Consensus supertrees 189Neighborliness 191De Soete's search method 192Quartet puzzling and searching tree space 193Perspective 194

13 Models of DNA evolution 196Kimura's two-parameter model 196Calculation of the distance 198The Tamura-Nei model, F84, and HKY 200The general time-reversible model 204

Distances from the GTR model 206The general 12-parameter model 210LogDet distances 211Other distances 213Variance of distance 214Rate variation between sites or loci 215

Different rates at different sites 215Distances with known rates 216Distribution of rates 216Gamma- and lognormally distributed rates 217Distances from gamma-distributed rates 217

Models with nonindependence of sites 221

14 Models of protein evolution 222Amino acid models 222The Dayhoff model 222Other empirically-based models 223

Models depending on secondary structure 225Codon-based models 225

Inequality of synonymous and nonsynonymous substitutions . . . 227Protein structure and correlated change 228

Page 7: Inferring Phytogenies - GBV

15 Restriction sites, RAPDs, AFLPs, and microsatellites 230Restriction sites '. 230

Nei and Tajima's model 230Distances based on restriction sites 233Issues of ascertainment 234Parsimony for restriction sites 235

Modeling restriction fragments 236Parsimony with restriction fragments 239

RAPDs and AFLPs . 239The issue of dominance ; 240Unresolved problems 240

Microsatellite models 241The one-step model 241Microsatellite distances 242A Brownian motion approximation 244Models with constraints on array size 246Multi-step and heterogeneous models 246Snakes and Ladders 246Complications 247

16 Likelihood methods 248Maximum likelihood 248

An example 249Computing the likelihood of a tree 251

Economizing on the computation 253Handling ambiguity and error 255

Unrootedness 256Finding the maximum likelihood tree 256Inferring ancestral sequences 259Rates varying among sites 260

Hidden Markov models 262Autocorrelation of rates 264HMMs for other aspects of models 265Estimating the states 265

Models with clocks 266Relaxing molecular clocks 266Models for relaxed clocks 267Covarions 268Empirical approaches to change of rates 269

Are ML estimates consistent? 269Comparability of likelihoods 270A nonexistent proof? 270A simple proof 271

Page 8: Inferring Phytogenies - GBV

XI

Misbehavior with the wrong model 272Better behavior with the wrong model 274

17 Hadamard methods 275The edge length spectrum and conjugate spectrum 279The closest tree criterion 281DNA models 284Computational effort 285Extensions of Hadamard methods ' 286

18 Bayesian inference of phylogenies 288Bayes' theorem 288Bayesian methods for phylogenies 289Markov chain Monte Carlo methods 292The Metropolis algorithm 292

Its equilibrium distribution 293Bayesian MCMC 294

Bayesian MCMC for phylogenies 295Priors 295

Proposal distributions 296Computing the likelihoods 298Summarizing the posterior 299Priors on trees 300Controversies over Bayesian inference 301

Universality of the prior 301Flat priors and doubts about them 301

Applications of Bayesian methods 304

19 Testing models, trees, and clocks 307Likelihood and tests 307Likelihood ratios near asymptopia 308Multiple parameters 309

Some parameters constrained, some not 310Conditions 310Curvature or height? 311

Interval estimates 311Testing assertions about parameters 311

Coins in a barrel 313Evolutionary rates instead of coins 314

Choosing among nonnested hypotheses: AIC and BIC 315An example using the AIC criterion 317

The problem of multiple topologies 318LRTs and single branches 319

Interior branch tests 320

Page 9: Inferring Phytogenies - GBV

Xll

Interior branch tests using parsimony 321A multiple-branch counterpart of interior branch tests 322

Testing the molecular clock 322Parsimony-based methods 322Distance-based methods 323Likelihood-based methods 323The relative rate test 324

Simulation tests based on likelihood . . .* 328Further literature 329

More exact tests and confidence intervals 329Tests for three species with a clock 329Bremer support 330Zander's conditional probability of reconstruction 331More generalized confidence sets 332

20 Bootstrap, jackknife, and permutation tests 335The bootstrap and the jackknife 335Bootstrapping and phylogenies 337The delete-half jackknife 339The bootstrap and jackknife for phylogenies 340The multiple-tests problem 342Independence of characters 342Identical distribution — a problem? 343Invariant characters and resampling methods . 344Biases in bootstrap and jackknife probabilities 346

P values in a simple normal case 346Methods of reducing the bias 349The drug testing analogy 352

Alternatives to P values 355Probabilities of trees 356Using tree distances 356Jackknifing species 357

Parametric bootstrapping 357Advantages and disadvantages of the parametric bootstrap 358

Permutation tests 358Permuting species within characters 359Permuting characters 361Skewness of tree length distribution 362

21 Paired-sites tests 364An example 365

Multiple trees 369The SH test . ." 369Other multiple-comparison tests 371

Page 10: Inferring Phytogenies - GBV

Xlll

Testing other parameters . . 372Perspective 372

22 Invariants 373Symmetry invariants 374Three-species invariants 376Lake's linear invariants 378Cavender's quadratic invariants 380

The K invariants 380The L invariants 381Generalization of Cavender's L invariants 382

Drolet and Sankoff's fc-state quadratic invariants 385Clock invariants 385General methods for finding invariants 386

Fourier transform methods 386Grobner bases and other general methods 387Expressions for all the 3ST invariants 387Finding all invariants empirically 387All linear invariants 388Special cases and extensions 389

Invariants and evolutionary rates 389Testing invariants 389What use are invariants? 390

23 Brownian motion andgene frequencies 391Brownian motion 391Likelihood for a phylogeny 392What likelihood to compute? 395

Assuming a clock 399The REML approach 400

Multiple characters and Kronecker products 402Pruning the likelihood 404Maximizing the likelihood 406Inferring ancestral states 408

Squared-change parsimony 409Gene frequencies and Brownian motion 7410

Using approximate Brownian motion 411Distances from gene frequencies 412A more exact likelihood method 413Gene frequency parsimony 413

Page 11: Inferring Phytogenies - GBV

XIV

24 Quantitative characters 415Neutral models of quantitative characters 416Changes due to natural selection 419

Selective correlation 419Covariances of multiple characters in multiple lineages 420Selection for an optimum 420Brownian motion and selection 422

Correcting for correlations 422Punctuational models 424Inferring phylogenies and correlations 425Chasing a common optimum 426The character-coding "problem" 426Continuous-character parsimony methods 428

Manhattan metric parsimony 428Other parsimony methods 429

Threshold models 429

25 Comparative methods 432An example with discrete states 432An example with continuous characters 433The contrasts method 435Correlations between characters 436When the tree is not completely known 437Inferring change in a branch 438Sampling error 439The standard regression and other variations 442

Generalized least squares 442Phylogenetic autocorrelation 442Transformations of time 442Should we use the phylogeny at all? 443

Paired-lineage tests 443Discrete characters 444

Ridley's method 444Concentrated-changes tests 445A paired-lineages test 446Methods using likelihood •> . 446Advantages of the likelihood approach 448

Molecular applications 448

26 Coalescent trees 450Kingman's coalescent 454Bugs in a box—an analogy 460Effect of varying population size 460Migration 461

Page 12: Inferring Phytogenies - GBV

XV

Effect of recombination 464Coalescents and natural selection 467

Neuhauser and Krone's method 468

27 Likelihood calculations on coalescents 470The basic equation 470Using accurate genealogies—a reverie 471Two random sampling methods 473

A Metropolis-Hastings method 473Griffiths and Tavare's method . 476

Bayesian methods 482MCMC for a variety of coalescent models 482

Single-tree methods 484Slatkin and Maddison's method 484Fu's method 484

Summary-statistic methods 485Watterson's method 485Other summary-statistic methods 486Testing for recombination 486

28 Coalescents and species trees 488Methods of inferring the species phylogeny 490

Reconciled tree parsimony approaches 492Likelihood 493

29 Alignment, gene families, and genomics 496Alignment 497

Why phylogenies are important 497Parsimony method 497

Approximations and progressive alignment 500Probabilistic models 502

Bishop and Thompson's method 502The minimum message length method 502The TKF model 503Multibase insertions and deletions .c 506TreeHMMs 507Trees 507Inferring the alignment 509

Gene families 509Reconciled trees 509Reconstructing duplications 511Rooting unrooted trees 512A likelihood analysis 514

Comparative genomics 515

Page 13: Inferring Phytogenies - GBV

XVI

Tandemly repeated genes 515Inversions 516Inversions in trees 516Inversions, transpositions, and translocations 516Breakpoint and neighbor-coding approximations 517Synteny 517Probabilistic models 518

Genome signature methods 519

30 Consensus trees and distances between trees 521Consensus trees 521

Strict consensus 521Majority-rule consensus 523Adams consensus tree 524

A dismaying result 525Consensus using branch lengths 526Other consensus tree methods 526Consensus subtrees 528

Distances between trees 528The symmetric difference 528The quartets distance 530The nearest-neighbor interchange distance 530The path-length-difference metric 531Distances using branch lengths 531Are these distances truly distances? 533Consensus trees and distances 534Trees significantly the same? different? 534

What do consensus trees and tree distances tell us? 535The total evidence debate 536A modest proposal 537

31 Biogeography, hosts, and parasites 539Component compatibility 540Brooks parsimony 541Event-based parsimony methods 543

Relation to tree reconciliation 545Randomization tests 545Statistical inference 546

32 Phylogenies and paleontology 547Stratigraphic indices 548Stratophenetics 549Stratocladistics 549Controversies 552

Page 14: Inferring Phytogenies - GBV

XVII

A not-quite-likelihood method 553Stratolikelihood 553

Making a full likelihood method 554More realistic fossilization models 554

Fossils within species: Sequential sampling 555Between species 557

33 Tests based on tree shape 559Using the topology only 559

Imbalance at the root 560Harding's probabilities of tree shapes 561Tests from shapes 562

Measures of overall asymmetry 563Choosing a powerful test 564

Tests using times 564Lineage plots 565Likelihood formulas 567Other likelihood approaches 569Other statistical approaches 569A time transformation 570

Characters and key innovations 571Work remaining 571

34 Drawing trees 573Issues in drawing rooted trees 574

Placement of interior nodes 574Shapes of lineages 576

Unrooted trees 578The equal-angle algorithm 578n-Body algorithms 580The equal-daylight algorithm 582

Challenges 584

35 Phylogeny software 585Trees, records, and pointers 585Declaring records 586Traversing the tree 587Unrooted tree data structures 589Tree file formats 590Widely used phylogeny programs and packages 591

References 595

Index 644