paml lab slides: review some keynotes part 1: paml introduction

PAML lab slides:

Review some keynotes

Part 1: PAML Introduction

Part 2: Real data exercises

1.  LRT is accurate and powerful, even in small data sets

2.  Parameter estimation is more difficult, and can be very sensitive to assumptions

3.  Bayes site prediction is very difficult and requires much more data than LRT and parameter estimation

1. Site prediction is NOT reliable from a few, similar sequences

2. Site prediction can be sensitive to assumed ω dist.

4.  There is an optimal window of sequence divergence

5.  Adding lineages is the most efficient way to improve power and accuracy

6.  We recommend robustness analyses:

1. Judicious use multiple models (M0-M3, M1a-M2a, M7-M8)

2. Evaluate sensitivity to model parameters (incl. tree!)

3. Attempt to identify “consensus sets of sites”

7. Check for and discard results from local optima

Review some Keynotes

PAML (Phylogenetic Analysis by Maximum Likelihood)

A program package by Ziheng Yang

(Demonstration by Joseph Bielawski)


What does PAML do?

Features include:

•  estimating synonymous and nonsynonymous rates

•  testing hypotheses concerning dN/dS rate ratios

•  various amino acid-based likelihood analysis

•  ancestral sequence reconstruction (DNA, codon, or AAs)

•  various clock models

•  simulating nucleotide, codon, or AA sequence data sets

•  and more ……


Downloading PAML

PAML download files are at:

ftp://abacus.gene.ucl.ac.uk/pub/paml/

For windows, Macs, and Unix/Linux


Programs in the package baseml basemlg codeml evolver yn00 chi2

for bases continuous-gamma for bases aaml (for amino acids) & codonml (for codons) simulation, tree distances dN and dS by YN00 chi square table

pamp mcmctree

parsimony (Yang and Kumar 1996)

Bayes MCMC tree (Yang & Rannala 1997). Slow


Running PAML programs

1.  Sequence data file

2.  Tree file

3.  Control file (*.ctl)


Running PAML programs: the sequence file

4 20 sequence_1 TCATT CTATC TATCG TGATG sequence_2 TCATT CTATC TATCG TGATG sequence_3 TCATT CTATC TATCG TGATG sequence_4 TCATT CTATC TATCG TGATG 4 20 sequence_1TCATTCTATCTATCGTGATG sequence_2TCATTCTATCTATCGTGATG sequence_3TCATTCTATCTATCGTGATG sequence_4TCATTCTATCTATCGTGATG

Format = plain text in “PHYLIP” format


Running PAML programs: the tree file

Format = parenthetical notation

((((1 , 2), 3), 4), 5)

This is a rooted tree




(((1 , 2), 3), 4, 5)

This is an unrooted tree




Examples: (((1,2),3),4,5);

((((1,2),3),4),5);

(((1:0.1, 2:0.2):0.8, 3:0.3):0.7, 4:0.4, 5:0.5);

(((Human:0.1, Chimpanzee:0.2):0.8, Gorilla:0.3):0.7, Orangutan:0.4, Gibbon:0.5);


codeml.ctl

Running PAML programs: the “*.ctl” file


seqfile = seqfile.txt * sequence data filename treefile = tree.txt * tree structure file name outfile = results.txt * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1:detailed output runmode = 0 * 0:user defined tree seqtype = 1 * 1:codons CodonFreq = 2 * 0:equal, 1:F1X4, 2:F3X4, 3:F61 model = 0 * 0:one omega ratio for all branches NSsites = 0 * 0:one omega ratio (M0 in Tables 2 and 4) * 1:neutral (M1 in Tables 2 and 4) * 2:selection (M2 in Tables 2 and 4) * 3:discrete (M3 in Tables 2 and 4) * 7:beta (M7 in Tables 2 and 4) * 8:beta&w; (M8 in Tables 2 and 4) icode = 0 * 0:universal code fix_kappa = 0 * 1:kappa fixed, 0:kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 0 * 1:omega fixed, 0:omega to be estimated omega = 5 * initial omega *set ncatG for models M3, M7, and M8!!! *ncatG = 3 * # of site categories for M3 in Table 4 *ncatG = 10 * # of site categories for M7 and M8 in Table 4


Exercises:

Maximum Likelihood Methods for Detecting Adaptive Protein Evolution

Joseph P. Bielawski and Ziheng Yang

in

Statistical methods in Molecular Evolution (R. Nielsen, ed.), Springer Verlag Series in Statistics in Health and Medicine. New York, New York.


Exercises:

Method/model program dataset

1 Pair-wise ML method codeml Drosophila GstD1

2 Pair-wise ML method codeml Drosophila GstD1

3 M0 and “branch models” codeml Ldh gene family

4 M0 and “site models” codeml HIV-2 nef genes


Exercise 1: ML estimation of the dN/dS (ω) ratio “by hand” for GstD1

Dataset:

GstD1 genes of Drosophila melanogaster and D. simulans (600 codons).

Objective: Use codeml to evaluate the likelihood the GstD1 sequences for a variety of fixed ω values. 1- Plot log-likelihood scores against the values of ω and determine the maximum likelihood estimate of ω. 2- Check your finding by running codeml’s hill-climbing algorithm.


seqfile = seqfile.txt * sequence data filename

outfile = results.txt * main result file name

noisy = 9 * 0,1,2,3,9: how much rubbish on the screen

verbose = 1 * 1:detailed output

runmode = -2 * -2:pairwise

seqtype = 1 * 1:codons

CodonFreq = 3 * 0:equal, 1:F1X4, 2:F3X4, 3:F61

model = 0 *

NSsites = 0 *

icode = 0 * 0:universal code

fix_kappa = 0 * 1:kappa fixed, 0:kappa to be estimated

kappa = 2 * initial or fixed kappa

fix_omega = 1 * 1:omega fixed, 0:omega to be estimated

omega = 0.001 * 1st fixed omega value [CHANGE THIS]

*alternate fixed omega values

*omega = 0.005 * 2nd fixed value

*omega = 0.01 * 3rd fixed value

*omega = 0.05 * 4th fixed value







Part 2: Real data exercises Exercise 1

0.05 0.2

2.0

0.1

1.6

0.001

0.005

0.01 0.4

0.8

-795-790-785-780-775-770-765-760-755-750

0.001 0.01 0.1 1 10

Fixed value of ω

Likelihood score

Plot results: likelihood score vs. omega (log scale)


Exercise 1

If you forget what to do, there is a

“step-by-step” guide on the course web site.

There is also a “HelpFile” to help you to get what you need from the output of codeml


Exercise 2: Investigating the sensitivity of the dN/dS ratio to assumptions

Dataset:

GstD1 genes of Drosophila melanogaster and D. simulans (600 codons).

Objective: 1- Test effect of transition / transversion ratio (κ ) 2- Test effect of codon frequencies (πI’s ) 3- Determine which assumptions yield the largest and smallest values of S, and what is the effect on ω



“CodonFreq=” is used to specify the equilibrium codon frequencies Fequal: - 1/61 each for the standard genetic code - CodonFreq = 0 - number of parameters in the model = 0 F3x4: - caluclated from the average nucleotide frequencies at the three codon positions - CodonFreq = 2 - number of parameters in the model = 9 F61 - also called “ftable”; empirical estimate of each codon frequency - CodonFreq = 3 - number of parameters in the model = 61


AAA → CAA

AAA → ACA

AAA → AAC

Example: A → C

Target codon (nucleotide)

CAA ACA AAC NP

No bias 1/61 1/61 1/61 0

F1×4 πc(πA)2 πc(πA)2 πc(πA)2 3

MG πc1 πc

2 πc3 9

F3×4 (GY) 9

F61 (GY) πCAA πACA πAAC 61

!C1! A

2! A3 ! A

1!C2! A

3 ! A1! A

2!C3

seqfile = seqfile.txt * sequence data filename

outfile = results.txt * main result file name

noisy = 9 * 0,1,2,3,9: how much rubbish on the screen

verbose = 1 * 1:detailed output

runmode = -2 * -2:pairwise

seqtype = 1 * 1:codons

CodonFreq = 0 * 0:equal, 1:F1X4, 2:F3X4, 3:F61 [CHANGE THIS]

model = 0 *

NSsites = 0 *

icode = 0 * 0:universal code

fix_kappa = 1 * 1:kappa fixed, 0:kappa to be estimated [CHANGE THIS]

kappa = 1 * fixed or initial value

fix_omega = 0 * 1:omega fixed, 0:omega to be estimated

omega = 0.5 * initial omega value

* Codon bias = none (equal); Ts/Tv bias = none (fixed at 1)

* Codon bias = none (equal); Ts/Tv bias = Yes (estimate by ML)

* Codon bias = yes (F3x4); Ts/Tv bias = none (fixed at 1)

* Codon bias = yes (F3x4); Ts/Tv bias = Yes (estimate by ML)

* Codon bias = yes (F61); Ts/Tv bias = none (fixed at 1)

* Codon bias = yes (F61); Ts/Tv bias = Yes (estimate by ML)


Exercise 2 Part 2: Real data exercises

Further details for about the assumptions tested in ACTIVITY 2 Assumption set 1: (Codon bias = none; Ts/Tv bias = none) CodonFreq=0; kappa=1; fix_kappa=1 Assumption set 2: (Codon bias = none; Ts/Tv bias = Yes) CodonFreq=0; kappa=1; fix_kappa=0 Assumption set 3: (Codon bias = yes [F3x4]; Ts/Tv bias = none) CodonFreq=2; kappa=1; fix_kappa=1 Assumption set 4: (Codon bias = yes [F3x4]; Ts/Tv bias = Yes) CodonFreq=2; kappa=1; fix_kappa=0 Assumption set 5: (Codon bias = yes [F61]; Ts/Tv bias = none) CodonFreq=3; kappa=1; fix_kappa=1 Assumption set 6: (Codon bias = yes [F61]; Ts/Tv bias = Yes) CodonFreq=3; kappa=1; fix_kappa=0

Table E2: Estimation of dS and dN between Drosophila melanogaster and D. simulans GstD1 genes Assumptions κ S N dS dN ω ℓ Fequal + κ = 1 1.0 ? ? ? ? ? ? Fequal + κ = estimated ? ? ? ? ? ? ? F3×4 + κ = 1 1.0 ? ? ? ? ? ? F3×4 + κ = estimated ? ? ? ? ? ? ? F61 + κ = 1 1.0 ? ? ? ? ? ? F61 + κ = estimated ? ? ? ? ? ? ? κ = transition/transversion rate ratio

S = number of synonymous sites

N = number of nonsynonymous sites

ω = dN/dS

ℓ = log likelihood score

Complete this table (If you forget what to do, there is a “step-by-step” guide on the course web-site.)


Exercise 3: Test hypotheses about molecular evolution of Ldh

Dataset:

The Ldh gene family is an important model system for molecular evolution of isozyme multigene families. The rate of evolution is known to have increased in Ldh-C following the gene duplication event

Objective: Use LRTs to evaluate the following hypotheses: 1- The mutation rate of Ldh-C has increased relative to Ldh-A,

2- A burst of positive selection for functional divergence occurred following the duplication event that gave rise to Ldh-C

3- There was a long term shift in selective constraints following the duplication event that gave rise to Ldh-C


Mus Cricetinae

C1

Gallus Sceloporus

C1 C1

C1 Rattus Sus Homo

C1 C1 C1

C1 C0

A0 A0

A0

A0

A1 A1

A1 A1 A1 A1

A1 A1 A1

Sus Rabbit Mus Rattus

Ldh-A

Ldh-C Gene duplication

event

H0: ωA0 = ωA1 = ωC1 = ωC0 H1: ωA0 = ωA1 = ωC1 ≠ ωC0 H2: ωA0 = ωA1 ≠ ωC1 = ωC0 H3: ωA0 ≠ ωA1 ≠ ωC1 = ωC0


seqfile = seqfile.txt * sequence data filename treefile = tree.txt * tree structure file name [CHANGE THIS] outfile = results.txt * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1:detailed output runmode = 0 * 0:user defined tree seqtype = 1 * 1:codons CodonFreq = 2 * 0:equal, 1:F1X4, 2:F3X4, 3:F61 model = 0 * 0:one omega ratio for all branches * 1:separate omega for each branch * 2:user specified dN/dS ratios for branches NSsites = 0 * icode = 0 * 0:universal code fix_kappa = 0 * 1:kappa fixed, 0:kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 0 * 1:omega fixed, 0:omega to be estimated omega = 0.2 * initial omega *H0 in Table 3: *model = 0 *(X02152Hom,U07178Sus,(M22585rab,((NM017025Rat,U13687Mus), *(((AF070995C,(X04752Mus,U07177Rat)),(U95378Sus,U13680Hom)),(X53828OG1, * U28410OG2))))); *H1 in Table 3: *model = 2 *(X02152Hom,U07178Sus,(M22585rab,((NM017025Rat,U13687Mus),(((AF070995C, *(X04752Mus,U07177Rat)),(U95378Sus,U13680Hom))#1,(X53828OG1,U28410OG2)) * ))); *H2 in Table 3: *model = 2 * (X02152Hom,U07178Sus,(M22585rab,((NM017025Rat,U13687Mus),(((AF070995C * #1,(X04752Mus #1,U07177Rat #1)#1)#1,(U95378Sus #1,U13680Hom #1) * #1)#1,(X53828OG1,U28410OG2))))); *H3 in Table 3: *model = 2 * (X02152Hom,U07178Sus,(M22585rab,((NM017025Rat,U13687Mus),(((AF070995C * #1,(X04752Mus #1,U07177Rat #1)#1)#1,(U95378Sus #1,U13680Hom #1) * #1)#1,(X53828OG1 #2,U28410OG2 #2)#2))));


Table E3: Parameter estimates under models of variable ω ratios among lineages and LRTs of their fit to the Ldh-A and Ldh-C gene family. Models ωA0 ωA1 ωC1 ωC0 ℓ LRT H0: ωA0 = ωA1 = ωC1 = ωC0 ? = ωA.0 = ωA.0 = ωA.0 ? ? H1: ωA0 = ωA1 = ωC1 ≠ ωC0 ? = ωA.0 = ωA.0 ? ? ? H2: ωA0 = ωA1 ≠ ωC1 = ωC0 ? = ωA.0 ? = ωC.1 ? ? H3: ωA0 ≠ ωA1 ≠ ωC1 = ωC0 ? ? ? = ωC.1 ? ? The topology and branch specific ω ratios are presented in Figure 5. H0 v H1: df = 1 H0 v H2: df = 1 H2 v H3: df = 1



!

"df =1, #= 0.052 = 3.841

Exercise 4: Testing for adaptive evolution in the nef gene of human HIV-2 (Start tonight, but finish as homework)

Dataset:

44 nef alleles from a study population of 37 HIV-2 infected people living in Lisbon, Portugal. The nef gene in HIV-2 has received less attention than HIV-1, presumably because HIV-2 is associated with reduced virulence and pathogenicity relative to HIV-1

Objectives: 1- Learn to use LRTs to test for sites evolving under positive selection in the nef gene. 2- If you find significant evidence for positive selection, then identify the involved sites by using empirical Bayes methods.


H0: uniform selective pressure among sites (M0) H1: variable selective pressure among sites (M3)

Compare 2Δl = 2(l1 - l0) with a χ2 distribution

00.10.20.30.40.50.60.70.80.91

Model 0

= 0.65 ω̂00.10.20.30.40.50.60.70.80.91

Model 3

= 0.01 = 0.90 = 5.55 ω̂ ω̂ω̂


H0: variable selective pressure but NO positive selection (M1) H1: variable selective pressure with positive selection (M2)


00.10.20.30.40.50.60.70.80.91

Model 2a

= 0.05 ω = 1 = 3.0 ω̂ω̂0

0.1

0.2

0.3

0.4

0.5

0.6Model 1a

ω = 1 = 0.06 ω̂


0 0.2 0.4 0.6 0.8 1 ω ratio

Site

s

M7: beta M8: beta&ω

0 0.2 0.4 0.6 0.8 1 ω ratio

Site

s >1

H0: Beta distributed variable selective pressure (M7) H1: Beta plus positive selection (M8)




seqfile = seqfile.txt * sequence data filename * treefile = treefile_M0.txt * SET THIS for tree file with ML branch lengths under M0 * treefile = treefile_M1.txt * SET THIS for tree file with ML branch lengths under M1 * treefile = treefile_M2.txt * SET THIS for tree file with ML branch lengths under M2 * treefile = treefile_M3.txt * SET THIS for tree file with ML branch lengths under M3 * treefile = treefile_M7.txt * SET THIS for tree file with ML branch lengths under M7 * treefile = treefile_M8.txt * SET THIS for tree file with ML branch lengths under M8 outfile = results.txt * main result file name noisy = 9 * lots of rubbish on the screen verbose = 1 * detailed output runmode = 0 * user defined tree seqtype = 1 * codons CodonFreq = 2 * F3X4 for codon ferquencies model = 0 * one omega ratio for all branches * NSsites = 0 * SET THIS for M0 * NSsites = 1 * SET THIS for M1 * NSsites = 2 * SET THIS for M2 * NSsites = 3 * SET THIS for M3 * NSsites = 7 * SET THIS for M7 * NSsites = 8 * SET THIS for M8 icode = 0 * universal code fix_kappa = 1 * kappa fixed * kappa = 4.43491 * SET THIS to fix kappa at MLE under M0 * kappa = 4.39117 * SET THIS to fix kappa at MLE under M1 * kappa = 5.08964 * SET THIS to fix kappa at MLE under M2 * kappa = 4.89033 * SET THIS to fix kappa at MLE under M3 * kappa = 4.22750 * SET THIS to fix kappa at MLE under M7 * kappa = 4.87827 * SET THIS to fix kappa at MLE under M8 fix_omega = 0 * omega to be estimated omega = 5 * initial omega * ncatG = 3 * SET THIS for 3 site categories under M3 * ncatG = 10 * SET THIS for 10 of site categories under M7 and M8 fix_blength = 2 * fixed branch lengths from tree file

Table E4: Parameter estimates and likelihood scores under models of variable ω ratios among sites for HIV-2 nef genes. Nested model pairs dN/dS b Parameter estimates c PSS d ℓ M0: one-ratio (1) a ? ω = ? N.A. ? M3: discrete (5) ? p0, = ?, p1, = ?, (p2 = ?)

ω0 = ?, ω1 = ?, ω2 = ? ? (?) ?

M1: neutral (1) ? p0 = ?, (p1 = ?)

ω0 = ?, (ω1 = 1) N.A. ?

M2: selection (3) ? p0= ?, p1 = ?, (p2 = ?) ω0 = ?, (ω1 = 1), ω2 = ?

? (?) ?

M7: beta (2) ? p = ?, q = ? N.A. ? M8: beta&ω (4) ? p0 = ? (p1 = ?)

p = ?, q = ?, ω = ? ? (?) ?

a The number after the model code, in parentheses, is the number of free parameters in the ω distribution. b This dN/dS ratio is an average over all sites in the HIV-2 nef gene alignment. c Parameters in parentheses are not free parameters. d PSS is the number of positive selection sites (NEB). The first number is the PSS with posterior probabilities > 50%. The second number (in parentheses) is the PSS with posterior probabilities > 95%.

NOTE: Codeml now implements models M1a and M2a !


0 0.2 0.4 0.6 0.8

1

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 Amino acid sites in the HIV-2 nef gene

ω0 = 0.034 ω1 = 0.74 ω2 = 2.50

Reproduce this plot


Major weaknesses:

•  Poor tree search

•  Poor user interface

Major strength:

•  Sophisticated likelihood models

Wrap - up

paml lab slides: review some keynotes part 1: paml introduction

Documents