systematics - bio 615...systematics - bio 615 4 mcmc can get stuck in suboptimal parameter space...

Systematics - Bio 615

1

Derek S. Sikes University of Alaska

Bayesian Phylogenetic Inference Outline

1.  Introduction, history

2.  Advantages over ML

3.  Bayes’ Rule

4.  The Priors

5.  Marginal vs Joint estimation

6.  MCMC

7.  Posteriors vs Bootstrap values

Bayesian Phylogenetic Inference - MCMC

Posterior Probabilities are determined by integrating the product of the prior probabilities and the model over all possible parameter values

The likelihood functions are too complex to integrate analytically

MCMC is used instead - Markov chain Monte Carlo dates from the work of Metropolis (1953) - an algorithm that does not seek the maximum likelihood parameter values

MCMC relies on marginal estimation (volume under the posterior-probability curve is estimated) rather than joint estimation (seeking the highest point)

This volume is estimated using a search procedure analogous to exploring tree-space for the optimal tree(s)

MCMC explores parameter space (not tree-space) and MCMC is not seeking just the top (optimal) portion but instead will describe the entire space


Parameter space

MCMC seeks the area under the probability surface

MCMC can start anywhere

MCMC is much like our mountaineer in the fog analogy - except the mountaineer is trying to map the entire mountain range rather than just find the tallest peak!

The MCMC algorithm follows a set of rules and takes ‘steps’ by making changes to the parameters including the topology and the branch lengths



2

If the changes improve the likelihood the ‘step’ is taken

If the changes would make the likelihood worse the step is only taken under certain conditions - when the ratio between the current likelihood and the new (worse) likelihood is larger than a number drawn at random

This allows the MCMC algorithm to traverse valleys - although it can still get “stuck” in suboptimal regions of parameter space

Bayesian Phylogenetic Inference - MCMC Bayesian Phylogenetic Inference - MCMC Steps: 100 1,000 10,000

Eventually, if run long enough, the MCMC method will approximate the posterior distribution

Flat surface

Two - peak surface

The MCMC chain visits regions of parameter space in proportion to their probabilities

Samples of trees and parameters are taken from the chain at regular intervals - typically every 100 (or rarely 1000) steps

The longer the chain is allowed to run the closer the approximation becomes

Bayesian Phylogenetic Inference - MCMC The sample includes trees with low, moderate,

and high likelihood scores - but proportionally more high scoring trees

The frequency a tree appears in the sample of a well-run MCMC search is an estimate of its posterior probability

eg a MCMC run of 2,000,000 steps (or generations) that is sampled every 100 steps will yield a final sample of 20,000 trees


These 20,000 trees can be used to build a majority rule consensus tree (after the “burn-in” trees have been discarded - see next slides)

The proportion a clade is represented in the sample is its posterior probability

The MCMC chain starts sampling in a region with very poor posterior probabilities because starting values are set arbitrarily or chosen randomly

Bayesian Phylogenetic Inference - MCMC History or trace plot plot of -lnL over 2 million MCMC generations


3

History or trace plot of -lnL over first 50,000 MCMC generations

“Burnin” - discard these samples - they do not correspond to posterior probabilities

The chain has converged or burned-in

- it has reached stationarity

All values should be estimated from post-burnin samples of the MCMC chain

Typically it is safe to use 20% as a cut-off for how many trees to discard as burnin (eg sample 5000 trees & discard the first 1000 as burnin)

However, examination of the trace plot is a more accurate method to assess convergence


Good mixing during the MCMC run

Mixing is the speed with which the chain covers the region of interest in the target distribution

Convergence within Run

Autocorrelation time

where ρk(θ) is the autocorrelation in the MCMC samples for a lag of k generations

Effective sample size (ESS)

where n is the total sample size (number of generations)

Good mixing when t is small and e large – Tracer can calculate ESS for parameters of interest from a Bayesian run

Sam

pled

val

ue

Target distribution

Too modest proposals Acceptance rate too high Poor mixing

Too bold proposals Acceptance rate too low Poor mixing

Moderately bold proposals Acceptance rate intermediate Good mixing

Convergence among Runs

  Tree topology:   Compare clade probabilities (split frequencies)   Average standard deviation of split frequencies

above some cut-off (min. 10 % in at least one run). Should go to 0 as runs converge.

  Continuous variables   Potential scale reduction factor (PSRF). Compares

variance within and between runs. Should approach 1 as runs converge.


4

MCMC can get stuck in suboptimal parameter space Best to do multiple MCMC searches and compare

Leaché, A.D. and Reeder, T.W. (2002) Molecular systematics of the Eastern Fence lizard (Sceloporus undulates): A comparison of parsimony, likelihood, and Bayesian approaches. Systematic Biology 51: 44-68.

Run ‘A’ never got unstuck…

Another solution to prevent getting stuck in suboptimal space:

Metropolis-coupled Markov chain Monte Carlo MCMCMC (or MC3)

Several chains are run simultaneously

One cold chain which is sampled and multiple (usually 3) heated chains that are not sampled but act as scouts - looking for better regions of parameter space (what does this remind you of?)


Even with MCMCMC one should still do several independent runs and compare the results

Prepare a regression plot of the posterior probabilities for each clade between different searches - should be highly correlated

Also note: The trees visited by the MCMC algorithm are highly auto-correlated - this is why we don’t sample every tree - this is why we skip lots of trees and sample

every 100 (or rarely every 1,000 or 10,000) trees

Bayesian Phylogenetic Inference - MCMC C

lade

pro

babi

lity

in a

naly

sis

2

Clade probability in analysis 1

Another issue is how many generations should one specify?

The more the better: some rules of thumb - 1. For data exploration (not publication) do as many as can

be done in 0.5-1.5 days (overnight)

2. For final results do as many as possible (at least over a weekend - 3 nights)

3. Run multiple chains and compare results:

- if they differ then none of the chains were run long enough

Bayesian Phylogenetic Inference - MCMC 1. Convergence - when the MCMC run has burned in, aka stationarity

2. Mixing - evidence the MCMC run isn’t getting stuck in one portion of parameter space (we want good mixing)

3. MCMCMC - multiple chains run simultaneously to help prevent the cold, sampled, chain from getting stuck

4.  MCMC generation count - longer is better (see rules of thumb), 0.5 million steps is a good place to start but 2 million, or more, is recommended

5.  Sampling the MCMC run - a minimum of every one tree per 100 steps

6.  Comparing different runs - plot a regression of the posterior probabilities for each branch - should be highly correlated if both runs sampled the same region of parameter space

MCMC - Summary of important issues


5

Bayesian Phylogenetic Inference Outline

1.  Introduction, history

2.  Advantages over ML

3.  Bayes’ Rule

4.  The Priors

5.  Marginal vs Joint estimation

6.  MCMC

7.  Posteriors vs Bootstrap values

Priors can matter - their influence is under investigation - “noninformative” priors may have an influence on the posteriors - huge Bayesian literature on this subject

Bayesian Phylogenetic Inference - Priors

Zwickl, D. and Holder, M. (2004) Model parameterization, prior distributions, and the general time-reversible model in bayesian phylogenetics. Syst Biol 53: 877-888.

Default model settings: MRBAYES 3.0b4 !

Parameter Options Current Setting ! ------------------------------------------------------------------ ! Tratiopr Beta/Fixed Beta(1.0,1.0)! Revmatpr Dirichlet/Fixed Dirichl(1.0,1.0,1.0,1.0,1.0,1.0)! Aamodelpr Fixed/Mixed Fixed(Poisson)! Omegapr Dirichlet/Fixed Dirichlet(1.0,1.0)! Ny98omega1pr Beta/Fixed Beta(1.0,1.0)! Ny98omega3pr Uniform/Exponential/Fixed Exponential(1.0)! M3omegapr Exponential/Fixed Exponential! Codoncatfreqs Dirichlet/Fixed Dirichlet(1.0,1.0,1.0)! Statefreqpr Dirichlet/Fixed Dirichlet! Ratepr Fixed/Variable=Dirichlet Fixed! Shapepr Uniform/Exponential/Fixed Uniform(0.1,50.0)! Ratecorrpr Uniform/Fixed Uniform(-1.0,1.0)! Pinvarpr Uniform/Fixed Uniform(0.0,1.0)! Covswitchpr Uniform/Exponential/Fixed Uniform(0.0,100.0)! Symmetricbetapr Uniform/Exponential/Fixed Fixed(Infinity)! Topologypr Uniform/Constraints Uniform! Brlenspr Unconstrained/Clock Unconstrained:Exp(10.0)! Speciationpr Uniform/Exponential/Fixed Uniform(0.0,10.0)! Extinctionpr Uniform/Exponential/Fixed Uniform(0.0,10.0)! Sampleprob <number> 1.00! Thetapr Uniform/Exponential/Fixed Uniform(0.0,10.0)! ------------------------------------------------------------------ !

Bayesian Phylogenetic Inference - Priors

Recall that PP are easily interpreted as:

The probability the branch is true given the model, the priors, the branch lengths and the data

Bootstrap values are less easily interpreted…

And using Maximum Likelihood they are difficult to obtain for large datasets (if a good ML search takes you 1 day, bootstrapping will take you 100 days - or you’ll need a parallel processor with 100 CPUs - a “super cluster”)

Posterior Probabilities and Bootstrap Values

Huelsenbeck & Rannala (2004) list 3 common interpretations

1. Probability that a clade is correct (accuracy)

2. Robustness of the results to perturbation (repeatability / precision)

3. Probability of incorrectly rejecting a hypothesis of monophyly (1-P) : probability of getting that much evidence if, in fact, the group did not exist

Huelsenbeck, J.P. and Rannala, B. (2004) Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Systematic Biology 53: 904-913.

Bootstrap - interpretation

It was noticed repeatedly that Posterior Probabilities tend to be higher than Bootstrap Values

It was also noticed that there was no clear /strong relationship between the two values - correlations were weak (but strongest between ML-BS & BI-PP)

The higher PP expected from theory - we knew that BS are biased downward - perhaps PP are less biased and thus more likely to correspond to true measures of accuracy?



6

Leaché, A.D. and Reeder, T.W. (2002) Molecular systematics of the Eastern Fence lizard (Sceloporus undulates): A comparison of parsimony, likelihood, and Bayesian approaches. Systematic Biology 51: 44-68.

PP values of 100% correspond to BS values from 50% - 100%

“We show that Bayesian posterior probabilities are significantly higher than corresponding nonparametric bootstrap frequencies for true clades, but also that erroneous conclusions will be made more often. These errors are strongly accentuated when the models used for analyses are underparameterized. When data are analyzed under the correct model, nonparametric bootstrapping is conservative. Bayesian posterior probabilities are also conservative in this respect, but less so.”

Erixon, P., Svennblad, B., Britton, T. and Oxelman, B. (2003) Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Syst Biol 52: 665-673.




84% BS = 95% P

91% PP =95% P



Simulations indicate

But both subject to error:

Errors: Type I - failure to reject a false hypothesis

Type II - rejection of a true hypothesis



Bootstraps less likely to make Error I (but more likely to make Error II) - more likely to reject a true hypothesis - “over-rejects” = conservative

Posterior Probabilities less likely to make Error II (but more likely to make Error I) - more likely to accept a false hypothesis - “under-rejects”??? (no! still conservative)

Huelsenbeck, J. and Rannala, B. (2004) Frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst Biol 53: 904-913.


Huelsenbeck & Rannala (2004) indicate that past simulations of data to test Bayesian methods have not been done properly

-  a proper simulation study compares data that agrees with the assumptions of the model and contrasts this “best case scenario” with data that violate one or more assumptions

-  These authors simulated best-case data for Bayesian methods by using the priors to generate the trees and parameter space in which to simulate data


7



Huelsenbeck & Rannala (2004) wanted to answer the question

What does the posterior probability of a phylogenetic tree mean?

Best case scenarios - the data fit the model perfectly

Data evolved using GTR+G but analyzed with too simple model

When the model is too simple (underfit) the PP are too high - error of accepting false hypotheses

Even JC69 does better than GTR if ASRV is accommodated

Data evolved using JC69 but analyzed with too complex model When the model is too complex (overfit) the PP are only slightly too low - error of rejecting true hypotheses

Thus - better to be too complex than too simple - use at least GTR+G to be safe


Credible sets of trees

Bayesian Inference provides a collection of trees assembled by their probabilities to total 95%

If the assumptions of the model are not violated this set has a ~ 95% probability of containing the true tree



The answer to their question:

What does the posterior probability of a phylogenetic tree mean?

“The posterior probability of a phylogenetic tree is the probability that the tree is correct, assuming that the model is correct. On the other hand, all bets are off when the assumptions of a Bayesian analysis are not satisfied,…” [italics added]



8

What about a comparison with bootstrap values?

PP far more biased due to underfitting than BS

Underfit model Correct model Posterior Probabilities and Bootstrap Values

Although Posterior Probabilities correctly measure the probability of a clade being true when the model isn’t violated…

The Bayesian method appears to be more sensitive to model violation than ML bootstrapping

Recommendation: use the most complex model available: GTR+G for Bayesian Inference (or GTR+I+G)



Priors - noninformative vs informative

example: 3 urns urn 1 - full of coins, 80% are biased coins urn 2 - full of coins, 10% are biased coins urn 3 - empty and always has been

You are given a coin and asked to come up with a prior probability for it being from urn 1, urn 2, or urn 3

Obviously the prior for urn 3 is zero! You might assign a P = 50% for urn 1 and urn 2, then flip your coin (gather data) & use Bayes’ Rule to obtain a posterior probability - see how the data change your prior (modify your beliefs in light of new evidence)

Posterior Probabilities and Bootstrap Values Recent study (Pickett &

Randle 2005) which indicates a disturbing but potentially very interesting problem

Clade priors appear to be correlated with their size - smallest and largest clades have the highest prior probability - midsized clades have the lowest

Picket, K.M. and Randle, C.P. (2005) Strange bayes indeed: uniform topological priors imply non-uniform clade priors. Molecular Phylogenetics and Evolution 34: 203-211.


Example - 5 taxa (A-E) yield 105 bifurcating rooted trees

A clade of 3 taxa (A,B,C) is in only 9 of the 105 trees

A clade of 2 taxa (A,B) is in 15 of the 105 trees

Prior to seeing any data we would expect a clade of (A,B) to be 1.7 times as likely as a clade of (A,B,C)

Although the trees are all equiprobable a priori, all the clades aren’t


Low BS values for mid-sized clades - an artifact? Saturation?


9



All branch support measures are significantly correlated with clade priors

Brandley et al. (2006) found the answer -

Clade priors are only a problem if you have very few data

-which is obviously more of a problem for morphological data!

Brandley, M.C., Leaché, A.D., Warren, D.L. and McGuire, J.A. (2006). Are unequal clade priors problematic for Bayesian phylogenetics? Systematic Biology 55: 138-146.

MCMC parameter space steps / generations convergence stationarity burnin history or generation plot mixing MCMCMC cold and heated chains Type I error Type II error clade priors Bayes factors

Terms - from lecture & readings What is the major difference between what the MCMC algorithm is doing and what a traditional tree searching (eg ML) algorithm does?"

Why should one run MCMC searches for a long long time? "

What can be done to help prevent getting stuck in suboptimal space?"

Why do we throw out 20% or more of the initial samples taken from a MCMC search?"

Why do we sample only every 100th tree (or 1 every 1000, or 10000) from a MCMC run?"

What are three common interpretations of bootstrap values? What is the interpretation of a posterior probability?"

Study questions

Why would we expect PP to be more strongly correlated with ML-BS values than with MP-BS values?"

Erixon et al (2003) made some important statements about PP and BS - what were two of their key findings?"

One of Erixon et alʼs findings was studied in depth by Huelsenbeck & Rannala (2005) -which finding was this and what were the conclusions and recommendations from this study?"

What question did Huelsenbeck & Rannala (2005) set out to answer and what was the answer?"

Picket & Randle (2005) noted an interesting and (if true) disturbing point regarding all branch support measures - what was it? They provided no correction for BS & Jackknifing but what correction did they suggest for PP?"

Study questions

systematics - bio 615...systematics - bio 615 4 mcmc can get stuck in suboptimal parameter space...

Documents