lecture 1: overview of phylogenetic methods and applications allan wilson

43
Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Upload: branden-hines

Post on 28-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Lecture 1: Overview of Phylogenetic methods and applications

Allan Wilson

Page 2: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Charles Darwin and Alfred Russel Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines

Phylogenetics seeks to determine these genetic relationships

Darwin’s sketch: the first phylogenetic tree? Charles Darwin

Alfred Russel Wallace

Page 3: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Opalized lower jaw of the monotreme Steropodon Modern therians (2)

Archaic therians (2)

Eupantotheres (2)

Spalacotheriids (2)

Eutriconodonts (1)

Morganuconodonts (1)

Cynodonts (0)

Interpretation of morphological characters is often subjective, so open to personal biases

e.g. Jaw rotation: weak (0), moderate (1), strong (2) as indicated by vertical wear facets on molars. Hu et al. (Nature, 1997) and Ji et al. (Nature, 1999) coded Steropodon (1) and (2) respectively, helping to account for their alternative placements of monotremes

Hu et al.

Ji et al.

Page 4: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Deoxyribonucleic acid (DNA) -Watson, Crick, Wilkins and Franklin

Page 5: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Early Molecular phylogenetics

- Immunological distances

- DNA-DNA hybridization

Without access to the actual sequences, these are difficult to apply corrections and statistical significance testing to

Page 6: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Hominid phylogeny from DNA

Phylogenetics is now dominated by the clearly defined 4 nucleotides and 20 amino acids

A G

C T

Purines

Pyrimidines

Transitions

TransversionsMillions of years

Page 7: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Tree terminology

Rooted treeUnrooted tree

Tax

on 1

Tax

on 3

Tax

on 5

Tax

on 7

Tax

on 6

Tax

on 8

Tax

on 2

Tax

on 4

internode

node

internal edge/branch

external edge/branch

Page 8: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

paraphyly

polyphyly

bifurcating polytomy

outgroup

Sister taxa

ingroup

Page 9: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Overview of phylogenetic procedure - by example

1. Biological problem (the question)

2. Which data to obtain (data sampling)

3. Finding the best tree (search strategy)

4. Defining the best tree (optimality criterion)

Page 10: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Two main sister group hypotheses

A. Cheetahs (Acinonyx jubatus): Limb, skull, vertebrae morphology

B. Pumas (Felis concolor): Geography, early fossils less cheetah-like

1. Biological problem (the question)

What is the relationship of the extinct American Cheetah (Miracinonyx trumani) to other cats?

See Barnett et al. (Curr. Biol., 2005)

Page 11: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

2. Which data to obtain (data sampling)

Mitochondrial (mt) DNA

1. High mtDNA copy number is important because Ancient DNA is degraded

2. Inferring relatively recent (2-10 million year) divergences, so substantial sequence variation is required

time

Obs

erve

d di

verg

ence

mt Protein/RNA coding, best 2 25 million years

mt control region best < 2 million years

Nuclear protein-coding, best > 25 million years

Page 12: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Mitochondrial partial NADH1 alignment for birds #Nexus

Begin DATA; Dimensions ntax=29 nchar=10692; Format datatype=dna gap=-; Matrix Tinamou AACTATCTATTCATATCCTTATCATACATCATTCCTATTCTTATTGCA..Emu AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA..Cassowary AACCACCTCACCATATCCCTGTCCTATGCAATCCCAATTCTAATCGCA..Kiwi AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA..Rhea AACTACCTAATTATGTCCCTGTCATATGCTATCCCAATTCTAATCGCA..Ostrich ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA..Chicken AACCTTCTAATCATAACCTTATCCTATATTCTCCCCATCCTAATCGCC..BrushTurkey AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC..MagpieGoose AATCACCTCATTATAACCCTATCGTATGCCATCCCAATCCTAATCGCC..Duck AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC..Broadbill ACTAACCTTACCATATCCCTATCCTACGCCATCCCCGTCCTAGTTGCC..Flycatcher ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT..ZebraFinch ATTAACCTCATCATAGCCCTCTCCTATGCCCTCCCAATCCTGATCGCA..Rook GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC..Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCCCAATCCTGATCGCA..Turnstone ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA..Penguin GCTCACTTAGCCATATCCCTATCCTATGCCATCCCAATCCTCATTGCA..Albatross ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC..;

End;

Page 13: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Type of data

Tre

e-bu

ildi

ng m

etho

d

Unweighted pair group method with arithmetic means (UPGMA)

Clu

ster

ing

algo

rith

mO

ptim

alit

y cr

iter

ion

Neighbour-joining (NJ)

Minimum evolution (ME)

Maximum parsimony (MP)

Maximum likelihood (ML)

Distances Discrete (e.g. nucleotides)

Tree reconstruction

Information loss often statistical power lossS

low

er

Fas

ter

Page 14: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Number of possible trees (where n is the number of taxa)

Unrooted trees: (2n-5) (2n-7) …31

Rooted trees: (2n-3) (2n-5) …31

For the 11-taxon cat phylogeny

Unrooted = 17 5 13 11 9 7 5 3 1 = 34,459,425

Rooted = Unrooted (2n-3) = 654,729,075An exhaustive search will examine all trees, but is not practical for n > 12

3. Finding the best tree (search strategy)

Page 15: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Reducing the time for searching “tree space”

Heuristic search

Only a small amount of tree-space is searched and there is no guarantee of finding the optimal tree - can be trapped in local maxima

XX

XStarting point

Global optimaLocal optima

Find an initial tree, and move within near-by tree-space, discarding worse alternatives

Page 16: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Branch and Bound search

As trees are built and branches added, if the addition of a taxon to a particular branch results in a tree-length greater than a previously determined upper bound for the tree, then this topology and all those derived from it are ignored and the search continues with a new placement for that taxon

Branch and bound guarantees finding globally optimal trees

XX

XStarting point

Global optimaLocal optima

Page 17: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Distance methods

4. Defining the best tree (optimality criteria)

Absolute distance matrix

1 2 3 4 5 6 7 8 9 10 11 1 Mongoose - 2 Hyena 156 - 3 Sabretooth 207 147 - 4 Am.Cheetah 192 140 159 - 5 Lion 186 134 148 131 - 6 Tiger 160 143 132 111 64 - 7 Puma 194 139 162 70 124 100 - 8 House.Cat 206 133 163 124 118 100 117 - 9 Cheetah 192 139 162 108 127 109 96 110 - 10 Ocelot 206 123 165 116 116 98 111 98 113 - 11 Jaguarundi 204 147 177 123 143 121 101 119 128 131 -

Page 18: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Early phenetics (distance/similarity) studies would note that taxon X and taxon Z are the most similar

Taxon X

Taxon Y

Taxon Z

Taxon Y TCAGCTA Taxon X ACATGTG Taxon Z ACGTCAG

XZ= 3 difference YZ= 5 differences XY= 4 differences

Page 19: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Taxon Y TC A GCTA Taxon X AC A TGTG Taxon Z AC G TCAG Outgroup AA G TCTG

Cladistic methods, rather than being concerned with similarity, are concerned with the nature of changes (apomorphies)

symplesiomorphy

synapomorphy

autapomorphy

Synapomorphies are shared derived characters and so are considered to define clades (relationship groupings)

Page 20: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Taxon X

Taxon Y

Taxon Z

Outgroup

Maximum Parsimony: chooses the tree topology that minimises the number of changes required

7 steps (MP tree)

Taxon Z

Taxon X

Taxon Y

Outgroup

8 step sub-optimal phenetic tree

* Character 3 changes G to A

*

Homoplasysynapomorphy

*

*

Page 21: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Maximum Likelihood: The explanation that makes the observed outcome the most likely

First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences

L = Pr(D|H)

Probability of the data, given an hypothesis

The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved

Page 22: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

A A

G G

Model of rate change e.g. Kishino-Hasegawa (1985): 4 base frequencies, transition/transversion (ti/tv ratio)

0.5 substitutions per site

0.5

0.40.4

0.6

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A

G G

A A A A C CA G C T A G

Sum the probabilities for each of the 16 internal node combinations to get the likelihood for this single nucleotide site

C T A G C C C T T T

T A G C T T G G G G

Page 23: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood:

The tree with the highest –lnL is the ML tree

• ML is computationally intensive (slow)

• If branch-lengths are long, such that substitutions occur multiple times along the same branch for the same site, ML will be more consistent than MP – if the evolutionary process is sufficiently well modelled.

Page 24: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Bayesian Inference: The explanation with the highest posterior probability

First use in phylogenetics: Li (1996, PhD thesis), Rannala and Yang (1996)

Pr(H D) = Pr(H) Pr(D H)

Pr(D)

Bayes’ Theorem

Posterior probability, the probability of the hypothesis given the data

Prior probability, the probability of the hypothesis on previous knowledge

Likelihood function, probability of the data given the hypothesis

Unconditional probability of the data, a normalizing constant ensuring the posterior probabilities sum to 1.00

Page 25: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Bayesian inference in phylogenetics is essentially a likelihood method, but may more closely reflect the way humans think. • It is Informed by prior knowledge (e.g. fossil data)• emphasis is placed on Pr(H D) instead of Pr(D H)

Markov chain Monte Carlo (MCMC) is used to approximate Bayesian posterior probabilities *(BPP) over 1,000s – 1,000,000s of generations

Tree 1Tree 2Tree 3

Generation 1 2 3 4 5 6

New state acceptedNew state rejected

BPP(tree 1) = 4/6

Page 26: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Posterior probabilities are integrated over all trees in the posterior distribution – providing density distributions rather than the optimization of likelihood

Prior for a parameter value (e.g. proportion of invariant sites)

Posterior for the proportion of invariant sites

0 0.5 1.0

(Flat prior)

0 0.5 1.0

Page 27: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

The American cheetah is related to the puma - morphological similarity to the cheetah is convergence

Mongoose

Hyena

Sabretooth

Am.Cheetah

Puma

Jaguarundi

Cheetah

Cat

Ocelot

Lion

Tiger

Mongoose

Hyena

Sabretooth

Am.Cheetah

Puma

Jaguarundi

Cheetah

Cat

Ocelot

Lion

Tiger0.05 substitutions/site

Maximum parsimony and neighbour-joining (distance) cladogram

Maximum likelihood and Bayesian inference phylogram

Am

eric

an f

elid

s

Page 28: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Applications:

The tree of life and inferring our origins

Page 29: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

146 gene phylogeny: Delsuc et al. (Nature, 2006)

Little evidence from fossils

Page 30: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Identifying selection

ACA GAG CGC Threonine - Glutamic acid - Arginine

ACG GAG AGC Threonine - Glutamic acid - SerineSynonymous (S) Decreased

dN/dS suggests purifying selectionThe dN/dS ratio can be estimated

along branches of phylogenetic trees (e.g. Guindon et al. PNAS, 2004)

Here dN/dS is indicated by branch width

Increased dN/dS suggests Positive selection

non-synonymous (N) substitutions

Page 31: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Cohen (Molec. Biol. Evol., 2002) found increased positive selection at binding sites in the MHC proteins of estuarine fish Fundulus heteroclitus populations subject to severe chemical pollution.

Non-synonymous/synonymous ratios for peptide binding regions and non-peptide binding regions

Positive selection at binding sites provides high MHC variability with which to confront new pathogenic threats.

MHC (Major histocompatibility complex) binds antigens and presents them to T-cells as part of the immune response.

Page 32: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Mhc class II B with inferred locations of population-specific amino acid changes for Gloucester and Hot Spot.

Fish from the Hot spot and Gloucester populations are genetically adapted to severe chemical pollution and show novel patterns of DNA substitution for Mhc class II B locus including strong signals of positive selection at inferred antigen-binding sites

Page 33: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Stanhope et al. (Infect. Genet. Evol., 2004)

Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) has a recombinant history with lineages of types I and III coronavirus

Page 34: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Using more sophisticated models of sequence evolution, Holmes and Rambaut (Phil. Trans. Roy. Soc. B, 2004) could not reject a single history across the SARS genome

Understanding sequence evolution and the biases that may result from models (which necessarily are simplifications) are of vital importance in phylogenetic inference

II

III

I

SARS-TOR2

Page 35: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Caliciviruses infect diverse mammalian hosts and include Norovirus, the major cause of food-borne viral gastroenteritis in humans.

Host switching by caliciviruses is rare, although pigs have strains from co-speciation (artiodactyl strain) and host switching (carnivoran strain).

Host-Parasite coevolution/co-speciation

• Etherington et al. (J. Gen Virol, 2006)

Carnivoran strainsArtiodactyl strains

Page 36: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Fig (Ficus) and fig wasp mutualism is reflected by co-speciation patterns: Machado et al. (PNAS, 2006)

Page 37: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Biogeography: vicariance and dispersal

Page 38: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

From: SanMartin and Ronquist (Syst. Biol. 2004)

Most frequent Area cladoragms – mapping taxa onto landmasses

Africa

S. South America

Australia

New Zealand

Many plants; follows wind dispersal patterns

Many land animals: follows continental break-up

Southern beechCushion herb Marsupial mammals

midges

Page 39: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Conservation genetics : Amur leopard (Panthera pardus orientalis)

Relict population of 25-40 individuals in the Russian Far East.

Nuclear microsatellites and mtDNA: Uphyrkina et al. (J. Hered., 2002)

• validates subspecies distinctiveness

• extreme reduction in genetic diversity in the wild

• captive population genetically mixed with the Chinese subspecies

Page 40: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Macroevolutionary inference

65 Ma PresentCretaceous Tertiary

Does the 65 Ma meteor impact (Alvarez et al. Science, 1980) fully explain the “great reptile extinction” and the rise of modern birds and mammals?

Page 41: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Molecular clock: DNA/protein divergence between organisms is a function of time

0

0.03

0.06

0.09

0.12

E-MCret

CMP E-MMAA

L. MAA

Re

lativ

e D

iver

sit

y

144-

83 M

a

83-7

1 M

a

71-6

8 M

a

68-6

5 M

a

K/T boundary

95Ma 65Ma

Page 42: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Bison (Lascaux, France)

Megafaunal extinctions (human induced or climate change)

Macrauchenia

Page 43: Lecture 1: Overview of Phylogenetic methods and applications Allan Wilson

Arrival of humans in North America

Last glacial maximum

The distribution of coalescence events over time on the tree allow inference of relative population size