lecture 1: overview of phylogenetic methods and applications allan wilson
TRANSCRIPT
Lecture 1: Overview of Phylogenetic methods and applications
Allan Wilson
Charles Darwin and Alfred Russel Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines
Phylogenetics seeks to determine these genetic relationships
Darwin’s sketch: the first phylogenetic tree? Charles Darwin
Alfred Russel Wallace
Opalized lower jaw of the monotreme Steropodon Modern therians (2)
Archaic therians (2)
Eupantotheres (2)
Spalacotheriids (2)
Eutriconodonts (1)
Morganuconodonts (1)
Cynodonts (0)
Interpretation of morphological characters is often subjective, so open to personal biases
e.g. Jaw rotation: weak (0), moderate (1), strong (2) as indicated by vertical wear facets on molars. Hu et al. (Nature, 1997) and Ji et al. (Nature, 1999) coded Steropodon (1) and (2) respectively, helping to account for their alternative placements of monotremes
Hu et al.
Ji et al.
Deoxyribonucleic acid (DNA) -Watson, Crick, Wilkins and Franklin
Early Molecular phylogenetics
- Immunological distances
- DNA-DNA hybridization
Without access to the actual sequences, these are difficult to apply corrections and statistical significance testing to
Hominid phylogeny from DNA
Phylogenetics is now dominated by the clearly defined 4 nucleotides and 20 amino acids
A G
C T
Purines
Pyrimidines
Transitions
TransversionsMillions of years
Tree terminology
Rooted treeUnrooted tree
Tax
on 1
Tax
on 3
Tax
on 5
Tax
on 7
Tax
on 6
Tax
on 8
Tax
on 2
Tax
on 4
internode
node
internal edge/branch
external edge/branch
paraphyly
polyphyly
bifurcating polytomy
outgroup
Sister taxa
ingroup
Overview of phylogenetic procedure - by example
1. Biological problem (the question)
2. Which data to obtain (data sampling)
3. Finding the best tree (search strategy)
4. Defining the best tree (optimality criterion)
Two main sister group hypotheses
A. Cheetahs (Acinonyx jubatus): Limb, skull, vertebrae morphology
B. Pumas (Felis concolor): Geography, early fossils less cheetah-like
1. Biological problem (the question)
What is the relationship of the extinct American Cheetah (Miracinonyx trumani) to other cats?
See Barnett et al. (Curr. Biol., 2005)
2. Which data to obtain (data sampling)
Mitochondrial (mt) DNA
1. High mtDNA copy number is important because Ancient DNA is degraded
2. Inferring relatively recent (2-10 million year) divergences, so substantial sequence variation is required
time
Obs
erve
d di
verg
ence
mt Protein/RNA coding, best 2 25 million years
mt control region best < 2 million years
Nuclear protein-coding, best > 25 million years
Mitochondrial partial NADH1 alignment for birds #Nexus
Begin DATA; Dimensions ntax=29 nchar=10692; Format datatype=dna gap=-; Matrix Tinamou AACTATCTATTCATATCCTTATCATACATCATTCCTATTCTTATTGCA..Emu AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA..Cassowary AACCACCTCACCATATCCCTGTCCTATGCAATCCCAATTCTAATCGCA..Kiwi AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA..Rhea AACTACCTAATTATGTCCCTGTCATATGCTATCCCAATTCTAATCGCA..Ostrich ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA..Chicken AACCTTCTAATCATAACCTTATCCTATATTCTCCCCATCCTAATCGCC..BrushTurkey AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC..MagpieGoose AATCACCTCATTATAACCCTATCGTATGCCATCCCAATCCTAATCGCC..Duck AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC..Broadbill ACTAACCTTACCATATCCCTATCCTACGCCATCCCCGTCCTAGTTGCC..Flycatcher ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT..ZebraFinch ATTAACCTCATCATAGCCCTCTCCTATGCCCTCCCAATCCTGATCGCA..Rook GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC..Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCCCAATCCTGATCGCA..Turnstone ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA..Penguin GCTCACTTAGCCATATCCCTATCCTATGCCATCCCAATCCTCATTGCA..Albatross ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC..;
End;
Type of data
Tre
e-bu
ildi
ng m
etho
d
Unweighted pair group method with arithmetic means (UPGMA)
Clu
ster
ing
algo
rith
mO
ptim
alit
y cr
iter
ion
Neighbour-joining (NJ)
Minimum evolution (ME)
Maximum parsimony (MP)
Maximum likelihood (ML)
Distances Discrete (e.g. nucleotides)
Tree reconstruction
Information loss often statistical power lossS
low
er
Fas
ter
Number of possible trees (where n is the number of taxa)
Unrooted trees: (2n-5) (2n-7) …31
Rooted trees: (2n-3) (2n-5) …31
For the 11-taxon cat phylogeny
Unrooted = 17 5 13 11 9 7 5 3 1 = 34,459,425
Rooted = Unrooted (2n-3) = 654,729,075An exhaustive search will examine all trees, but is not practical for n > 12
3. Finding the best tree (search strategy)
Reducing the time for searching “tree space”
Heuristic search
Only a small amount of tree-space is searched and there is no guarantee of finding the optimal tree - can be trapped in local maxima
XX
XStarting point
Global optimaLocal optima
Find an initial tree, and move within near-by tree-space, discarding worse alternatives
Branch and Bound search
As trees are built and branches added, if the addition of a taxon to a particular branch results in a tree-length greater than a previously determined upper bound for the tree, then this topology and all those derived from it are ignored and the search continues with a new placement for that taxon
Branch and bound guarantees finding globally optimal trees
XX
XStarting point
Global optimaLocal optima
Distance methods
4. Defining the best tree (optimality criteria)
Absolute distance matrix
1 2 3 4 5 6 7 8 9 10 11 1 Mongoose - 2 Hyena 156 - 3 Sabretooth 207 147 - 4 Am.Cheetah 192 140 159 - 5 Lion 186 134 148 131 - 6 Tiger 160 143 132 111 64 - 7 Puma 194 139 162 70 124 100 - 8 House.Cat 206 133 163 124 118 100 117 - 9 Cheetah 192 139 162 108 127 109 96 110 - 10 Ocelot 206 123 165 116 116 98 111 98 113 - 11 Jaguarundi 204 147 177 123 143 121 101 119 128 131 -
Early phenetics (distance/similarity) studies would note that taxon X and taxon Z are the most similar
Taxon X
Taxon Y
Taxon Z
Taxon Y TCAGCTA Taxon X ACATGTG Taxon Z ACGTCAG
XZ= 3 difference YZ= 5 differences XY= 4 differences
Taxon Y TC A GCTA Taxon X AC A TGTG Taxon Z AC G TCAG Outgroup AA G TCTG
Cladistic methods, rather than being concerned with similarity, are concerned with the nature of changes (apomorphies)
symplesiomorphy
synapomorphy
autapomorphy
Synapomorphies are shared derived characters and so are considered to define clades (relationship groupings)
Taxon X
Taxon Y
Taxon Z
Outgroup
Maximum Parsimony: chooses the tree topology that minimises the number of changes required
7 steps (MP tree)
Taxon Z
Taxon X
Taxon Y
Outgroup
8 step sub-optimal phenetic tree
* Character 3 changes G to A
*
Homoplasysynapomorphy
*
*
Maximum Likelihood: The explanation that makes the observed outcome the most likely
First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences
L = Pr(D|H)
Probability of the data, given an hypothesis
The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved
A A
G G
Model of rate change e.g. Kishino-Hasegawa (1985): 4 base frequencies, transition/transversion (ti/tv ratio)
0.5 substitutions per site
0.5
0.40.4
0.6
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A
G G
A A A A C CA G C T A G
Sum the probabilities for each of the 16 internal node combinations to get the likelihood for this single nucleotide site
C T A G C C C T T T
T A G C T T G G G G
The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood:
The tree with the highest –lnL is the ML tree
• ML is computationally intensive (slow)
• If branch-lengths are long, such that substitutions occur multiple times along the same branch for the same site, ML will be more consistent than MP – if the evolutionary process is sufficiently well modelled.
Bayesian Inference: The explanation with the highest posterior probability
First use in phylogenetics: Li (1996, PhD thesis), Rannala and Yang (1996)
Pr(H D) = Pr(H) Pr(D H)
Pr(D)
Bayes’ Theorem
Posterior probability, the probability of the hypothesis given the data
Prior probability, the probability of the hypothesis on previous knowledge
Likelihood function, probability of the data given the hypothesis
Unconditional probability of the data, a normalizing constant ensuring the posterior probabilities sum to 1.00
Bayesian inference in phylogenetics is essentially a likelihood method, but may more closely reflect the way humans think. • It is Informed by prior knowledge (e.g. fossil data)• emphasis is placed on Pr(H D) instead of Pr(D H)
Markov chain Monte Carlo (MCMC) is used to approximate Bayesian posterior probabilities *(BPP) over 1,000s – 1,000,000s of generations
Tree 1Tree 2Tree 3
Generation 1 2 3 4 5 6
New state acceptedNew state rejected
BPP(tree 1) = 4/6
Posterior probabilities are integrated over all trees in the posterior distribution – providing density distributions rather than the optimization of likelihood
Prior for a parameter value (e.g. proportion of invariant sites)
Posterior for the proportion of invariant sites
0 0.5 1.0
(Flat prior)
0 0.5 1.0
The American cheetah is related to the puma - morphological similarity to the cheetah is convergence
Mongoose
Hyena
Sabretooth
Am.Cheetah
Puma
Jaguarundi
Cheetah
Cat
Ocelot
Lion
Tiger
Mongoose
Hyena
Sabretooth
Am.Cheetah
Puma
Jaguarundi
Cheetah
Cat
Ocelot
Lion
Tiger0.05 substitutions/site
Maximum parsimony and neighbour-joining (distance) cladogram
Maximum likelihood and Bayesian inference phylogram
Am
eric
an f
elid
s
Applications:
The tree of life and inferring our origins
146 gene phylogeny: Delsuc et al. (Nature, 2006)
Little evidence from fossils
Identifying selection
ACA GAG CGC Threonine - Glutamic acid - Arginine
ACG GAG AGC Threonine - Glutamic acid - SerineSynonymous (S) Decreased
dN/dS suggests purifying selectionThe dN/dS ratio can be estimated
along branches of phylogenetic trees (e.g. Guindon et al. PNAS, 2004)
Here dN/dS is indicated by branch width
Increased dN/dS suggests Positive selection
non-synonymous (N) substitutions
Cohen (Molec. Biol. Evol., 2002) found increased positive selection at binding sites in the MHC proteins of estuarine fish Fundulus heteroclitus populations subject to severe chemical pollution.
Non-synonymous/synonymous ratios for peptide binding regions and non-peptide binding regions
Positive selection at binding sites provides high MHC variability with which to confront new pathogenic threats.
MHC (Major histocompatibility complex) binds antigens and presents them to T-cells as part of the immune response.
Mhc class II B with inferred locations of population-specific amino acid changes for Gloucester and Hot Spot.
Fish from the Hot spot and Gloucester populations are genetically adapted to severe chemical pollution and show novel patterns of DNA substitution for Mhc class II B locus including strong signals of positive selection at inferred antigen-binding sites
Stanhope et al. (Infect. Genet. Evol., 2004)
Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) has a recombinant history with lineages of types I and III coronavirus
Using more sophisticated models of sequence evolution, Holmes and Rambaut (Phil. Trans. Roy. Soc. B, 2004) could not reject a single history across the SARS genome
Understanding sequence evolution and the biases that may result from models (which necessarily are simplifications) are of vital importance in phylogenetic inference
II
III
I
SARS-TOR2
Caliciviruses infect diverse mammalian hosts and include Norovirus, the major cause of food-borne viral gastroenteritis in humans.
Host switching by caliciviruses is rare, although pigs have strains from co-speciation (artiodactyl strain) and host switching (carnivoran strain).
Host-Parasite coevolution/co-speciation
• Etherington et al. (J. Gen Virol, 2006)
Carnivoran strainsArtiodactyl strains
Fig (Ficus) and fig wasp mutualism is reflected by co-speciation patterns: Machado et al. (PNAS, 2006)
Biogeography: vicariance and dispersal
From: SanMartin and Ronquist (Syst. Biol. 2004)
Most frequent Area cladoragms – mapping taxa onto landmasses
Africa
S. South America
Australia
New Zealand
Many plants; follows wind dispersal patterns
Many land animals: follows continental break-up
Southern beechCushion herb Marsupial mammals
midges
Conservation genetics : Amur leopard (Panthera pardus orientalis)
Relict population of 25-40 individuals in the Russian Far East.
Nuclear microsatellites and mtDNA: Uphyrkina et al. (J. Hered., 2002)
• validates subspecies distinctiveness
• extreme reduction in genetic diversity in the wild
• captive population genetically mixed with the Chinese subspecies
Macroevolutionary inference
65 Ma PresentCretaceous Tertiary
Does the 65 Ma meteor impact (Alvarez et al. Science, 1980) fully explain the “great reptile extinction” and the rise of modern birds and mammals?
Molecular clock: DNA/protein divergence between organisms is a function of time
0
0.03
0.06
0.09
0.12
E-MCret
CMP E-MMAA
L. MAA
Re
lativ
e D
iver
sit
y
144-
83 M
a
83-7
1 M
a
71-6
8 M
a
68-6
5 M
a
K/T boundary
95Ma 65Ma
Bison (Lascaux, France)
Megafaunal extinctions (human induced or climate change)
Macrauchenia
Arrival of humans in North America
Last glacial maximum
The distribution of coalescence events over time on the tree allow inference of relative population size