reconstrucción filogenética. una manera simple de entender la evolución…

Post on 03-Jan-2016

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reconstrucción Filogenética

Una manera simple de entender la evolución…

DATOS: Alineamiento de secuencias de genes

Cómo podemos transformar esta información a un contexto histórico?

Phylogeny inference

1. Distance based methods-Pair wise distance matrix-Adjust tree branch lengths to fit the distance

matrix (ex. Minimum squares, Neighbor joining)

2. Character based methods-Parsimony-Maximum likelihood or model based evolution

In 1866, Ernst Haeckel coined the word “phylogeny” and presented phylogenetic trees for most known groups of living organisms.

Surf the tree of life at:http://tolweb.org/tree/phylogeny.html

The Tree of Life project

What is a tree?

A tree consists of nodes connected by branches.

Terminal nodes represent sequences or organisms for which we have data.Each is typically called a “Operational Taxonomical Unit” or OTU.

Internal nodes representhypothetical ancestors

The ancestor of all the sequences is the root of

the tree

A tree is a mathematical structure which is used to modelthe actual evolutionary history of a group of sequences or organisms, i.e. an evolutionary hypothesis.

Types of TreesRooted vs. Unrooted

BranchesNodes

RootedInteriorM – 2 M – 1

Total2M – 22M – 1

UnrootedInteriorM – 3 M – 2

Total2M – 32M – 2

M is the number of OTU’s

Possible Number of

Rooted trees Unrooted trees

2 1 1

3 3 1

4 15 3

5 105 15

6 945 105

7 10395 945

8 135135 10395

9 2027025 135135

10 34459425 2027025

The number of rooted and unrooted trees:

Number of OTU’s

OTU – Operational Taxonomical Unit

Bifurcating

Polytomies: Soft vs. Hard• Soft: designate a lack of information about the

order of divergence.• Hard: the hypothesis that multiple divergences

occurred simultaneously

Types of Trees

Multifurcating

Polytomy

Trees

Types of Trees

Networks

Only one path between any pair of nodes

More than one path between any pair of nodes

A shorthand for trees: the Newick format

1 2 3 4 5 6

(((1,2),((3,4),5)),6)

1 2 3 4

((1,2),(3,4))

Comments on Trees

•Trees give insights into underlying data

•Identical trees can appear differently depending upon the

method of display

•Information maybe lost when creating the tree. The tree is not

the underlying data.

A B C B A C

B ACA BC

Different kinds of trees can be used to depict different aspects of evolutionary history

1. Cladogram: simply shows relative recency of common ancestry

2. Additive trees: a cladogram with branch lengths,

also called phylograms and metric trees

3. Ultrametric trees: (dendograms) special kind of additive tree in which the tips of the trees are all equidistant from the root

5

4

3

1

3

7

32

1

1 1 1 12

31

1

13

Making trees according to morphological features

Ridley New Scientist (Dec. 1983) 100, 647-51

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGATC – ACTTGTCCGAAACGATD - ACTTGACCGTTTCCTTE – AGATGACCGTTTCGATF - ACTACACCCTTATGAG

Given a multiple alignment, how do we construct the tree?

?

Distance methods

General Method:• Evolutionary distances are computed for all pairs of taxa.• A phylogenetic tree is constructed by considering the

relationships among these distance data (fitting a tree to the matrix).

Logic: Evolutionary distance is a tree metric and hence defines a tree

Methods we’ll talk about• UPGMA (Unweighted Pair Group Method with Arithmetic Mean )

• Neighbor Joining

Distance methods

Metric distances must obey 4 rules:

Non-negativity d(a,b) >= 0Symmetry d(a,b) = d(b,a)Triangle Inequality d(a,c) <= d(a,b) + d(b,c)Distinctness d(a,b) = 0 iff a = b

Ultrametric Trees

1 1 1 12

31

1

14 1

a b c

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA)

 A  B  C  D  E

 B  2

 C  4  4

 D  6  6  6

 E  6  6  6  4

 F  8  8  8  8  8

From http://www.icp.ucl.ac.be/~opperd/private/upgma.html

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGATC – ACTTGTCCGAAACGATD - ACTTGACCGTTTCCTTE – AGATGACCGTTTCGATF - ACTACACCCTTATGAG

First, construct a distance matrix:

First round

dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),D = (distAD + distBD) / 2 = 6 dist(A,B),E = (distAE + distBE) / 2 = 6dist(A,B),F = (distAF + distBF) / 2 = 8

 A  B  C  D  E

 B  2

 C  4  4

 D  6  6  6

 E  6  6  6  4

 F  8  8  8  8  8

 A,B  C  D  E

 C  4 D  6  6 E  6  6  4 F  8  8  8  8

UPGMA

Choose the most similar pair, cluster them together and calculate the new distance matrix.

 A,B  C  D  E

 C  4 D  6  6 E  6  6  4 F  8  8  8  8

 A,B  C  D,E

 C  4

 D,E  6  6

 F  8  8  8

Second round

Third round

UPGMA

 AB,C  D,E

 D,E  6

 F  8  8

 ABC,DE

 F  8

Fourth round

Fifth round

UPGMA

Note the this method identifies the root of the tree.

http

://w

ww

.gen

pat.u

u.se

/mtD

B/

A tree of human mitochondria sequences

• The mitochondrial genome has 16,500 base-pairs.

• In 2000, Gyllensten and colleagues sequenced the mitochondrial genomes of 53 people of diverse geographical, racial and linguistic backgrounds.

• A molecular clock seems to hold the divergence of these sequences at a rate of 1.7x10-8 substitutions per site per year.

Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.

The deepest branches lead exclusively to sub-Saharan mtDNAs, with the second branch containing both Africans and non-Africans.

sub-Sahara mtDNA

A tree of 86 mitochondrial sequences.Downloaded from http://www.genpat.uu.se/mtDB/sequences.html and analyzed using MEGA, method: UPGMA

Rooting the tree with an outgroup

Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.

Root

Outgroup

Phylogeny based upon the molecular clock

• Evidence for a human mitochondrial origin in Africa: African sequence diversity is twice as large as that of non-African

• Gyllensten and colleagues estimate that the divergence of Africans and non-Africans occurred 52,000 to 28,000 years ago.

Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408: 708-713.

• The UPGMA clustering method is very sensitive to unequal evolutionary rates (assumes that the evolutionary rate is the same for all branches).

• Clustering works only if the data are ultrametric • Ultrametric distances are defined by the satisfaction of the

'three-point condition'.

UPGMA assumes a molecular clock

A B C

For any three taxa, the two greatest distances are equal.

The three-point condition:

 A  B  C  D  E

 B  5

 C  4  7

 D  7  10  7

 E  6  9  6  5

 F  8  11  8  9  8

UPGMA fails when rates of evolution are not constant

A tree in which the evolutionary rates are not equal

From http://www.icp.ucl.ac.be/~opperd/private/upgma.html

)Neighbor joining will get the right tree in this case(.

Neighbors

A

B

C

D

a

b

xc

d

A and B are neighbors because they are connected through a single internal node.

C and D are also neighbors, but A and D are not neighbors.

The Four Point Condition

A

B

C

D

dAC + dBD = dAD + dBC = a + b + c + d + 2x = dAB + dCD + 2x

a

b

xc

d

dAB + dCD < dAC + dBD

dAB + dCD < dAD + dBC

The 4-point condition can be used to identify neighbors.Basically states that neighbors are closer than non-neighbors.

neighbors non-neighbors

a

b

c

d

Start with a star (no hierarchical structure)

Neighbor JoiningAn algorithm for finding the shortest tree

The length of the tree Pair-wise distancesNumber of OTUs

Neighbor Joining

(Saitou and Nei, 1987)

Neighbor Joining

(Saitou and Nei, 1987)

Neighbor Joining

(Saitou and Nei, 1987)

Neighbor Joining

(Saitou and Nei, 1987)

Character state methods

MAXIMUM PARSIMONY

Logic: Examine each column in the multiple alignment of the sequences.Examine all possible trees and choose among them according to some optimality criteria

Method we’ll talk about• Maximum parsimony

Maximum Parsimony

Simpler hypotheses are preferable to more complicated ones and that as hoc hypotheses should be avoided whenever possible (Occam’s Razor).

Thus, find the tree that requires the smallest number of evolutionary changes.

0123456789012345W - ACTTGACCCTTACGATX – AGCTGGCCCTGATTACY – AGTTGACCATTACGATZ - AGCTGGTCCTGATGAC

W

Y

X

Z

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAAC------------------------------------Invariant * * ******** *****Variant ** * * **------------------------------------Informative ** ** Non-inform. * *

Start by classifying the sites:

Maximum Parsimony

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAAC

** *

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog HumanMouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

Site 5:G

G

T

C

T

T

T

C

T

C

G

G

T

G

T

C

G

T

G

C

T

C

T

G

G

T

T

T

T

G

G

C

C

C

T

G

GG

CC

GG

GG

CT

GG

TG

CC

GT

Site 2:

Site 3:

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAACInformative ** **

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

3 0 1

Maximum Parsimony

The situation is more complicated when there are more than four units.

C T G T A A

)CT( )GT(

)AGT(

T

)AT(

T T A A G C

TA

)AG()TAG(

)TAGC(

Maximum Parsimony

Problems with maximum parsimony:Only uses “informative” sitesLong-branches “attract”

Maximum Likelihood Analysis•Same as Maximum Parsimony except rates of nucleic acids substitutions are not

considered to have equal probability.

•All possible unrooted trees are evaluated. (Same for Parsimony)

•Each column of the alignment is processed. (Same for Parsimony)

•The transition of A -> T will have a different probability than the transition from G -> C

•Start with a frequency distribution table that specifies the probability of one base being substituted for another base.

•See probabilities of nucleotide substitution. (Table 6.5 pg 275)

•Probability that unrooted tree predicts each column of the alignment is calculated.

•Probabilities for each column are summed together for each tree.

•The unrooted tree with the highest probability is chosen.

Maximum Likelihood Example

•Four sequences are compared (w, x, y and z)

•All unrooted trees are shown

•In this example we will examine the first unrooted tree.

Sequence w ACGCGTTGGG Sequence x ACGCGTTGGG Sequence y ACGCAATGAA Sequence z ACACAGGGAA w

z

y

x

w

z

x

y

w

x

y

z

Maximum Likelihood Example Continued

•L(Tree x) = L0 * L1 * L2 * L3 * L4 * L5 * L6

•L0 base probability of nucleotide at 0 (0.25)

•L1 probability of nucleotide changing from value at 0 to value at 1.

•L2 probability of nucleotide changing from value at 0 to value at 1.

•L3 probability of nucleotide changing from value at 1 to value at 3 (T).

•L4, L5, L6 probability of nucleotide changing to value at leaf.

L0

L5 L6L4L3

L2L1

TT A G

x yw z

0

1 2

Maximum Likelihood Example Continued

•There are 64 likelihood trees to evaluate. (number of bases) ^ (number

of internal nodes) or 4^3.

•We will show evaluation TTG against the first unrooted tree for column

TTAG

•Determine values for L0, … L6. Values are determined by looking up

probabilities in transition probability table.

•Probability of L2 is T->G

•Probability of L5 is G -> A

•Probability of L3 is T->T

•Determine combined probability L0 * L1 * L2 * … * L6

L0

L5 L6L4L3

L2L1

TT A G

x yw z

0

1 2

T

T

G

Maximum Likelihood Example Continued

•Determine probability for combination TGG

•Determine probability for the other 62 combinations.

•Sum all the trees together. L(Tree) = (LTree1) + L(Tree2) + … + L(Tree64)

•Move to next column and repeat the same procedure.

•Once all columns are complete sum all the probabilities. This is the likelihood

of the first unrooted tree.•Continue this process for the other

unrooted trees.•Pick the unrooted tree with the highest

probability. This is the most likely unrooted tree.

L0

L5 L6L4L3

L2L1

TT A G

x yw z

0

1 2

T

G

G

EVOLUCIÓN IN VITRO POR INTERMEDIO DE PCR

Conclusion

•Phylogenetic Prediction can be used for more than Evolutionary Distance

–Verification of Taxonomy

–Identification of unknown

–Techniques work for genetic and non genetic data (Fatty Acid).

•Use multiple methods for verification–Pick at least two different types of methods from Parsimony,

Distance and Likelihood.

–If the analysis is in agreement there is a higher level of confidence that the analysis is correct.

BOOTSTRAPING

How confident are we in this tree?

A statistical method that can be used to place confidence intervals on phylogenies

Bootstrapping

human_myoglobin -GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHL ... pig_myoglobin -GLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL ... horse_myoglobin -GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHL ... common_seal_myoglobin -GLSEGEWQLVLNVWGKVEADLAGHGQDVLIRLFKGHPETLEKFDKFKHL ... sperm_whale_myoglobin MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL ... sea_hare_myoglobin -SLSAAEADLAGKSWAPVFANKDANGDAFLVALFEKFPDSANFFADFKG- ...

Pick with replacement

human_myoglobin LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE pig_myoglobin LQKWDQKHNVHTEFGAEELQGDKLSWKKLDQGKKVVKKELGLDEDEWLGE horse_myoglobin LQKWDQTHNVHTEFGAEELQGDKLSWKTLDQGKKVVTKELGQDEDEWLGE common_seal_myoglobin LQKWEQKHNVHTEFGADELQGDKLSWKKLDQGKKVVKKELGLDEDDWLGE -sperm_whale_myoglobin LQRWEQKHHVHTEFAADELQGDKLSWKKLDQGRKVVKKELGLDEDDWLGE sea_hare_myoglobin LDDWADENKSNSNFAAAELDANFASAPELNDGDKVAEKFAALNNAAWAAN

Resampling from the Data

Original data:

Resampled data number 1:

Repeat 99 more time (or 999,999..)

Chimpanzee

Gorilla

Human

Orang-utan

Gibbon

Given the following tree, estimate the confidence of the two internal branches

Chimpanzee

Chimpanzee

Chimpanzee

Gorilla

Gorilla

Gorilla

HumanHuman Human

Orang-utanOrang-utan Orang-utan

GibbonGibbon

Gibbon

41/100 28/100 31/100

Chimpanzee

Gorilla

Human

Orang-utan

Gibbon

100

41

Estimating Confidence from the Resamplings

1. Of the 100 trees:

In 100 of the 100 trees, gibbon and orang-utan are split from the rest.

In 41 of the 100 trees, chimp and gorilla are split from the rest.

2. Upon the original tree we superimpose bootstrap values:

THE TREE OF LIFERelationships between 16S ribosomal RNAs

Distant relationships Close relationships

bacteriaeukaryotes archaea

The three domains of Lifeas identified by phylogenetic analysis of the

highly conserved 16S ribosomal RNA

(Woese and Fox 1977)

16S ribosomal RNA

Where is the root of the tree of life?

(by definition there is no outgroup)

An ancient gene duplication can root a tree

Graur & Li. Fundamentals of Molecular Evolution (1999)

Gene duplicationSpeciation of 3 and 1-2

Speciation of 1 and 2

Root of 1,2,3

Outgroups for A2

Outgroups for A1

Graur & Li. Fundamentals of Molecular Evolution (1999)

The root of the tree of life as inferred from Ef-Tu and EF-G

Both trees show Archaea and Eucarya as sister taxa

Mn-dependent transcriptional regulator

Horizontal Gene Transfer

(Tatusov, 1996)

eubacteria

archae

What is the origin of the mitochondria?

http://www.mitomap.org/

The endosymbiotic theory

The evidence:

• Both mitochondria and chloroplasts can arise only from preexisting mitochondria and chloroplasts. They cannot be formed in a cell that lacks them because nuclear genes encode only some of the proteins of which they are made.

• Both mitochondria and chloroplasts have their own genome.• Both genomes consist of a single circular molecule of DNA. • There are no histones associated with the DNA.

The Mitochondria sit with the proteobacteria in the tree of life

Gray MW Nature. 1998 Nov 12;396(6707):109-10.

mitochondrial (MT)

Small-subunit (SSU) ribosomal RNA tree

mitochondrion

chloroplast

Lackmitochondria (?)

Andersson SG Nature 1998 Nov 12;396(6707):133-40

The genome sequence of Rickettsia prowazekii and the origin of mitochondria.

Mitochondrial ribosomal proteins are most similar to those of R. prowazekii

Andersson SG Nature 1998 Nov 12;396(6707):133-40

Mitochondrial proteins involved in ATP synthesisare most similar to those of R. prowazekii

Andersson SG Nature 1998 Nov 12;396(6707):133-40

Mitochondria derive from -Purple bacteriaChloroplasts derive from cyanobacteria

Graur & Li. Fundamentals of Molecular Evolution (1999)

The tree of life with mitochondria and chloroplast endosymbiotic events

)Doolittle, 1999(

Horizontal transfer is a dominant feature of the “tree” of life

)Doolittle, 1999(

top related