lab3: bayesian phylogenetic inference and mcmc department of bioinformatics & biostatistics,...

49
Lab3: Lab3: Bayesian phylogenetic Bayesian phylogenetic Inference and MCMC Inference and MCMC Department of Department of Bioinformatics & Bioinformatics & Biostatistics, Biostatistics, SJTU SJTU

Upload: merry-harmon

Post on 20-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Lab3: Lab3: Bayesian phylogenetic Bayesian phylogenetic Inference and MCMCInference and MCMC

Department of Department of Bioinformatics & Bioinformatics & Biostatistics, SJTUBiostatistics, SJTU

Page 2: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Topics

Phylogenetics Bayesian inference and MCMC: overview Bayesian model testing MrBayesian tutorial and application

– Nexus file– Configuration of the process– How to execute the process– analyzing the results

Page 3: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Phylogenetics

Greek: phylum + genesis

Broad definition: historical term, how the species evolve and fall

Narrow definition: infer relationship of the extant

We prefer the narrow one

Page 4: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Infer relationships among three species:

Outgroup:

Page 5: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Three possible trees (topologies):

A

B

C

Page 6: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

A B C

Prior distribution

prob

abili

ty 1.0

Posterior distribution

prob

abili

ty 1.0

Data (observations)

Page 7: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

What is needed for inference?

A probabilistic model of evolution Prior distribution on the parameters of

the model Data A method for calculating the posterior

distribution for the model, prior distribution and data

Page 8: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

What is needed for inference?

A probabilistic model of evolution Prior distribution on the parameters of

the model Data A method for calculating the posterior

distribution for the model, prior distribution and data

Page 9: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Model: topology + branch lengths

Parameters

topology )(branch lengths )( iv

A

B

3vC

D

2v

1v4v

5v (expected amount of change)

),( v

Page 10: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Model: molecular evolution

Parameters

instantaneous rate matrix(Jukes-Cantor)

111][

111][

111][

111][

][][][][

T

G

C

A

TGCA

Q

Page 11: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

What is needed for inference?

A probabilistic model of evolution Prior distribution on the parameters of

the model Data A method for calculating the posterior

distribution for the model, prior distribution and data

Page 12: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Priors on parameters

Topology– All unique topologies have equal probabilities

Branch lengths– Exponential prior puts more weight on small branch

lengths; appr. uniform on transition probabilities

Page 13: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

What is needed for inference?

A probabilistic model of evolution Prior distribution on the parameters of

the model Data A method for calculating the posterior

distribution for the model, prior distribution and data

Page 14: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Data

X The data (alignment)

Taxon Characters

A ACG TTA TTA AAT TGT CCT CTT TTC AGA

B ACG TGT TTC GAT CGT CCT CTT TTC AGA

C ACG TGT TTA GAC CGA CCT CGG TTA AGG

D ACA GGA TTA GAT CGT CCG CTT TTC AGA

Page 15: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

What is needed for inference?

A probabilistic model of evolution Prior distribution on the parameters of

the model Data A method for calculating the posterior

distribution for the model, prior distribution and data

Page 16: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Bayes’ Theorem

dXlp

XlpXf

)|()(

)|()()|(

Posteriordistribution

Prior distribution Likelihood function

Normalizing Constant

Page 17: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

tree 1 tree 2 tree 3

)|( Xf

Posterior probability distribution

Parameter space (high-dimension 1d)

Post

eri

or

pro

bab

ility

Page 18: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

tree 1 tree 2 tree 3

20% 48% 32%

We can focus on any parameter of interest (there are no nuisance parameters) by marginalizing the posterior over the other parameters (integrating out the uncertainty in the other parameters)

(Percentages denote marginal probability distribution on trees)

Page 19: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

32.048.020.0

38.014.019.005.0

33.006.022.005.0

29.012.007.010.0

3

2

1

321

joint probabilities

marginal probabilities

Marginal probabilities

trees

bra

nch

len

gth

vect

ors

Page 20: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

How to estimate the posterior?

Analytical calculation? Impossible!!! except for very simple

examples

Random sampling of parameter space? Impossible too!!! computational infeasible

Dependent sampling using MCMC technique? Yes, you got it!

Page 21: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Metropolis-Hastings Sampling

* * *

*

( ) ( | ) ( | )min 1, * *

( ) ( | ) ( | )

p l X qr

p l X q

Assume that the current state has parameter values

Consider a move to a state with parameter values according to proposal density q

Accept the move with probability

(prior ratio x likelihood ratio x proposal ratio)

Page 22: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Sampling Principles

For a complex model, you typically have many “proposal” or “update” mechanisms (“moves”)

Each mechanism changes one or a few parameters

At each step (generation of the chain) one mechanism is chosen randomly according to some predetermined probability distribution

It makes sense to try changing ‘more difficult’ parameters (such as topology in a phylogenetic analysis) more often

Page 23: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Analysis ofAnalysis of 85 insect taxa 85 insect taxa based on 18S rDNAbased on 18S rDNA

Application example

Page 24: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Model parameters 1

General Time Reversible (GTR)substitution model

GTGCTCATA

GTTCGCAGA

CTTCGGACA

ATTAGGACC

rrr

rrr

rrr

rrr

Q

A

B

3vC

D

2v

1v4v

5v

topology branch lengths

n ,...,, 21

Page 25: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Model parameters 2

Gamma-shaped rate variation across sites

Page 26: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Priors on parameters

Topology– all unique topologies have equal probability

Branch lengths– exponential prior (exp(10) means that expected

mean is 0.1 (1/10)) State Frequencies

– Dirichlet prior: Dir(1,1,1,1) Rates (revmat)

– Dirichlet prior: Dir(1,1,1,1,1,1) Shape of gamma-distribution of rates

– Uniform prior: Uni(0,100)

( , , , )A C G T

( , , , , , )AC AG AT CG CT GTr r r r r r

Page 27: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 28: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

burn-in

stationary phase sampled with thinning(rapid mixing essential)

Page 29: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Majority rule consensus tree from sampled treesFrequencies represent the posterior probability of the clades

Probability of clade being true given data, model, and prior(and given that the MCMC sample is OK)

Page 30: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Mean and 95% credibility interval for model parameters

Page 31: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

MrBayes tutorial

Introduction/examples

Page 32: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Nexus format input file

Input: nexus format; accurately, nexus(ish)

Page 33: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Page 34: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 35: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 36: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Running MrBayes

① Use execute to bring data in a Nexus file into MrBayes

② Set the model and priors using lset and prset③ Run the chain using mcmc

④ Summarize the parameter samples using sump⑤ Summarize the tree samples using sumt

Note that MrBayes 3.1 runs two independent analyses by default

Page 37: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Convergence Diagnostics

By default performs two independent analyses starting from different random trees (mcmc nruns=2)

Average standard deviation of clade frequencies calculated and presented during the run (mcmc mcmcdiagn=yes diagnfreq=1000) and written to file (.mcmc)

Standard deviation of each clade frequency and potential scale reduction for branch lengths calculated with sumt

Potential scale reduction calculated for all substitution model parameters with sump

Page 38: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Bayes’ theorem

)(

)|()(

)|()(

)|()()|(

Xf

Xff

dXff

XffXf

Marginal likelihood (of the model)

)|(

),|()|(),|(

MXf

MXfMfMXf

We have implicitly conditioned on a model:

Page 39: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Bayesian Model Choice

00

11

X|MfMf

X|MfMf

Posterior model odds:

0

110 X|Mf

X|MfB

Bayes factor:

Page 40: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Bayesian Model Choice

The normalizing constant in Bayes’ theorem, the marginal probability of the model, f(X) or f(X|M), can be used for model choice

f(X|M) can be estimated by taking the harmonic mean of the likelihood values from the MCMC run (MrBayes will do this automatically with ‘sump’)

Any models can be compared: nested, non-nested, data-derived No correction for number of parameters Can prefer a simpler model over a more complex mode

Page 41: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU

Bayes Factor Comparisons

Interpretation of the Bayes factor

2ln(B10) B10 Evidence against M0

0 to 2 1 to 3Not worth more than a bare mention

2 to 6 3 to 20 Positive

6 to 10 20 to 150 Strong

> 10 > 150 Very strong

Page 42: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 43: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 44: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 45: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 46: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 47: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 48: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU
Page 49: Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU