paml: phylogenetic analysis by maximum likelihood ziheng yang depart of biology university college...

23
PAML: PAML: Phylogenetic Analysis by Phylogenetic Analysis by Maximum Likelihood Maximum Likelihood Ziheng Yang Depart of Biology University College London http://abacus.gene.ucl.ac.uk/

Upload: rosanna-felicia-norris

Post on 02-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PAML:PAML:Phylogenetic Analysis by Phylogenetic Analysis by

Maximum LikelihoodMaximum Likelihood

Ziheng Yang

Depart of BiologyUniversity College London

http://abacus.gene.ucl.ac.uk/

Page 2: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PlanPlan

• Overview of PAML, things it can do, and especially things that other program don’t do.

• An example (of detecting amino acids under positive selection)

• The trouble with sliding windows

Page 3: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PAML programs, currently in ver PAML programs, currently in ver 3.153.15• PAML programs are written in ANSI C.

Executables are provided for MS Windows and Mac OSX. Source codes can be compiled for unix and other platforms.

• Free for academics (and everybody else).

• Sequential, not parallelized.

• Old-style command-line programs, with no GUI, no menu, no mice.

• Yang’s theorem: Every version of PAML has bugs.

Page 4: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PAML programsPAML programs

baseml

basemlg

codeml

evolver

yn00chi2

ML under nucleotide-based modelsML under nucleotide-based models

Continuous-gamma, for bases (Yang 1993)Continuous-gamma, for bases (Yang 1993)

ML for amino acids & codonsML for amino acids & codons

The world’s best named simulation program

dN and dS estimation using Y&N2000

2 critical values and p values

pamp

mcmctree

Parsimony calculations (Yang and Kumar 1996)Species divergence times, soft bounds, relaxed clocks (Yang & Rannala 2006)

Page 5: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PAML docs & examplesPAML docs & examples

• doc in doc/: pamlDOC.pdf, pamlFAQ.pdf, pamlHistory.txt

• examples/ are provided with README files

• Apologies for poor support. Bug reports can come to my mailbox. Questions should go to paml discussion group: http://www.rannala.org/gsf

Page 6: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Major weaknessesMajor weaknesses

• Poor tree search

• Poor user interface

Major strengthMajor strength

• Many models implemented in the likelihood framework.

Page 7: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Maximum likelihood parameter estimation and likelihood ratio tests of hypotheses under a number of substitution models based on nucleotides, amino acids, and codons (such as the molecular clock, rate variation among sites).

Most of the nucleotide-based models are available in PAUP. Most of models are available in MrBayes?

Uses of PAML (i)Uses of PAML (i)

Page 8: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Uses of PAML (ii)Uses of PAML (ii)

Likelihood (empirical Bayes) reconstruction of ancestral nucleotide, amino acid, or codon sequences.

This is the same as parsimony reconstruction except that it accounts for different branch lengths and different rates of change between states.

Yang, Z., S. Kumar, and M. Nei. 1995. Genetics 141:1641-1650.Koshi, J. M., and R. A. Goldstein. 1996. J. Mol. Evol. 42:313-320.Pupko, T., I. Pe’er, R. Shamir, and D. Graur. 2000. Mol. Biol. Evol. 17:890-896.

Page 9: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Uses of PAML (iii)Uses of PAML (iii)• Combined analysis of heterogeneous data

sets.• MrBayes has implemented more powerful

models of this kind (Nylander, et al. 2004. Syst. Biol. 53:47-67).

• These should make the following debates unnecessary:• combined analysis (total evidence) vs.

separate analysis• Supertree vs. supermatrixYang, Z. 1996. Maximum-likelihood models for combined analyses of

multiple sequence data. J. Mol. Evol. 42:587-596.Pupko, T., D. Huchon, Y. Cao et al. 2002. Combining multiple data sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. 19:2294-2307.

Page 10: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Uses of PAML (iv)Uses of PAML (iv)

Likelihood ratio test of the clock and likelihood estimation of species divergences under clock and relaxed-clock models (baseml & codeml)

Bayesian estimation of species divergence times using soft bounds and relaxed molecular clocks (mcmctree), similar to Jeff Thorne’s multidivtime.Rambaut, A., and L. Bromham. 1998. Mol. Biol. Evol. 15:442-448.

Yoder, A. D., and Z. Yang. 2000. Mol. Biol. Evol. 17:1081-1090.Yang, Z., and A. D. Yoder. 2003. Syst. Biol. 52:705-716.

Yang, Z., and B. Rannala. 2006. Mol. Biol. Evol. 23:212-226.Rannala, B. and Z. Yang. in preparation.

Page 11: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Uses of PAML (iv): Codon substitution models & Uses of PAML (iv): Codon substitution models & detection of selection in protein-coding genes detection of selection in protein-coding genes (codeml)(codeml)

• Branch models to test positive selection on lineages on the tree (Yang 1998. Mol. Biol. Evol. 15:568-573)

• Site models to test positive selection affecting individual sites(Nielsen & Yang. 1998. Genetics 148:929-936; Yang, et al. 2000. Genetics 155:431-449)

• Branch-site models to detect positive selection at a few sites on a particular lineage(Yang & Nielsen. 2002. Mol. Biol. Evol. 19:908-917; Yang, et al. 2005. Mol. Biol. Evol. 22:1107-1118; Zhang, J., R. Nielsen, and Z. Yang. 2005. Mol. Biol. Evol. 22:2472-2479)

Page 12: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

MacCallum, C., and E. Hill. 2006. MacCallum, C., and E. Hill. 2006. Being positive about selection. Being positive about selection. PLoS Biol 4:e87.PLoS Biol 4:e87.

PLoS Biol is receiving and rejecting too many manuscripts that use the M&K test and paml/codeml to detect positive selection.

Their main criterion right now is that the ms. should include experimental verification to justify publication in such high-profile journals.

Page 13: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

LRT of amino acid sites under LRT of amino acid sites under positive selectionpositive selection

H0: there are no sites at which > 1H1: there are such sites

Compare 2 = 2(1 0) with a 2

distribution

(Nielsen & Yang 1998 Genetics 148:929-936;Yang, Nielsen, Goldman & Pedersen 2000. Genetics 155:431-449)

Page 14: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Models M1a & M2aModels M1a & M2a

M1a (neutral)M1a (neutral)

Site classSite class: : 00 11

Proportion:Proportion: pp00 pp11

ratio:: 00<1<1 11=1=1

M2a (selection)M2a (selection)

Site class:Site class: 00 11 22

Proportion:Proportion: pp00 pp11 pp22

ratio:: 00<1<1 11=1=1 22>1>1

Modified from Nielsen & Yang (1998), where 00=0 is fixed=0 is fixed

Page 15: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Human MHC Class I data:Human MHC Class I data:192 alleles, 270 codons192 alleles, 270 codons

Model Parameter estimates

M1a (neutral) 7,490.99 p0 = 0.830, 0 = 0.041

p1 = 0.170, 1 = 1

M2a (selection) 7,231.15 p0 = 0.776, 0 = 0.058

p1 = 0.140, 1 = 1

p2 = 0.084, 2 = 5.389

Likelihood ratio test of positive selection: 2 = 2 259.84 = 519.68, P < 0.000, d.f. =

2

Page 16: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Posterior probabilities for MHC (M2a)Posterior probabilities for MHC (M2a)

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Page 17: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

25 sites 25 sites identified identified under M2aunder M2a

Page 18: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

There are a few wrong ways for There are a few wrong ways for

detecting selection, detecting selection,

one of which is sliding windows.one of which is sliding windows.

Page 19: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Sliding window analysisSliding window analysis (mouse-rat BRCA1)(mouse-rat BRCA1)

dN dSdNdN dSdS

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500

0.5

1.0

1.5

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500

0.5

1.0

1.5dN/dS ()

Page 20: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Sliding window analysis Sliding window analysis (fake data)(fake data)

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0 500 1000 1500

0.5

1.0

1.5

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0 500 1000 1500

0.5

1.0

1.5

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0 500 1000 1500

0.5

1.0

1.5

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0 500 1000 1500

0.5

1.0

1.5dN dSdN dS

dN dSdNdN dSdS dN/dS ()

dN/dS ()

Page 21: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Two trends in sliding window analysisTwo trends in sliding window analysis

• Both dS and dN fluctuate smoothly (because consecutive windows overlap)

• dS fluctuates more than dN (because there are fewer silent than replacement sites)

Sliding windows may be useful for displaying trends that are known to exist, but is misleading if used to detect trends.

Page 22: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Orthodox statistical analysis • formulate a biological hypothesis• design the experiment & collect data• test whether the data are compatible with the

hypothesis

The more-common way of data analysis in biology• a large amount of data, no a priori hypothesis• filter and plot data to identify “unexpected” patterns• test the patterns using statistical tests

Page 23: PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Acknowledgment Acknowledgment (sliding-window analysis)(sliding-window analysis)

Karl Schmid