intro to phylogenetic trees lecture 6bchor/cg05/cg6-trees.pdfintro to phylogenetic trees lecture 6...
TRANSCRIPT
�
�
Intro to Phylogenetic TreesLecture 6
Sections 7.1, 7.2, in Durbin et al.Chapter 17 in Gusfield
Slides by Shlomo Moran and by Ydo Wexler. Modifications by Benny Chor �
Evolution
����������� ������� ����������
� ���������� � ��������������������������������������������� ��������������
� ����� �� � ��� � � �������������������������������������������������������� � � ���� ��������
� � ������� �����
�
The Tree of Life
Sour
ce: A
lber
tset
al
�
� ���!�����" ���#���$%&$
Tree of life- a better picture
�
�
Primate evolution
� ������� � ��������������������������������������������������������������� ��������������������������'������������������ ������������
�
Historical Note�Until mid 1950’s phylogenies were constructed by
experts based on their opinion (subjective criteria)
�Since then, focus on objective criteria for constructing phylogenetic trees� Thousands of articles in the last decades
� Important for many aspects of biology� Classification � Understanding biological mechanisms
�
Morphological vs. Molecular
�Classical phylogenetic analysis: morphologicalfeatures: number of legs, lengths of legs, etc.
�Modern biological methods allow to use molecularfeatures� Gene sequences� Protein sequences
�Analysis based on homologous sequences (e.g., globins) in different species
�
Morphological topology
(��� �) ��� ��*��+ �, ������- �� �����������(�����������)�� � ������(���������(����. ����/������ �������- �� ������ ������� 0 �����������������1��/���������0 �� ���������/��������" �����������1���������������23 ��#���������2+ ����3 ��4 ���)��/���, �������- �������� ��� ����3 �����5�#�5��" ��������� ��- ����)� � �����(��� ����6� ����- ���� ����� �#��" ����7�������. ���������������� ������#, �������" ���������� ��)��� ���������� 1��/��� ������ - � ���+ �����������������" �������, �� ���+ ���� �� �� ����(�������. �������8 ������5�������
� ������
, �����
9 ������
)�������
7���������
: �������
;(�����+ �<�� ��(����$&&=>
�
Rat QEPGGLVVPPTDA
Rabbit QEPGGMVVPPTDA
Gorilla QEPGGLVVPPTDA
Cat REPGGLVVPPTEG
From sequences to a phylogenetic tree
� �������� ������������������������������;����+ ������������� � �������������>�
�
DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPig
Little red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat
Asiatic shrewLong-clawed shrew
MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon
White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus
5�������������
)�������
)��������������
3������ $
" ��������3������ ?
5��� ����
)���������+ ����@- ��� �� ���������
: �������1���� �����@- �������
Mitochondrial topology;(�����5��#� ������>
��
Nuclear topology
Round Eared Bat
Flying Fox
Hedgehog
Mole
Pangolin
Whale
Hippo
Cow
Pig
Cat
Dog
Horse
Rhino
Rat
Capybara
Rabbit
Flying Lemur
Tree Shrew
Human
Galago
Sloth
Hyrax
Dugong
Elephant
Aardvark
Elephant Shrew
Opossum
Kangaroo
$
?
A
B
)��������������
� ���������
)���������
������������
, �����
: �������
)�������
5�������������
- �������@� ��� ������
5��������
5��� ���
;������+ �����>
;(�����5��#� ����������>
��
Theory of Evolution
�Basic idea� speciation events lead to creation of different
species.� Speciation caused by physical separation into
groups where different genetic variants become dominant
�Any two species share a (possibly distant) common ancestor
�
��
Phylogenenetic trees
� Leafs - current day species� Nodes - hypothetical most recent common ancestors� Edges length - “time” from one speciation to the next
Aardvark Bison Chimp Dog Elephant
��
Types of Trees
A natural model to consider is that of rooted trees
CommonAncestor
��
Types of treesUnrooted tree represents the same phylogeny without
the root node
Depending on the model, data from current day species does not distinguish between different placements of the root.
��
������������� ������ �����Tree a
ab
Tree b
c
Tree c
3���������������������������
�
��
Positioning Roots in Unrooted Trees
�We can estimate the position of the root by introducing an outgroup: � a set of species that are definitely distant from all
the species of interest
Aardvark Bison Chimp Dog Elephant
Falcon
Proposed root
��
Type of Reconstruction
�Distance-based� Input is a matrix of distances between species� Can be fraction of residue they disagree on, or
alignment score between them, or …
�Character-based� Examine all characters (AAs or DNA bases).� Do not ``summarize’’ sequences or pairs of
sequences by a single number.� Major methods: Parsimony; Likelihood.
�
Two Approaches to Tree Construction
� ����� ��/ � ��������������������*���������������� �������C�����
� ��������������� D � ������������� �*�����C��������������������������������������������;� �C��� ��������������� ������#�������>�
We start with distance based methods, considering the following question:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.
�
Exact solution: Additive sets
Given a set M of L objects with an L×L distance matrix:� d(i,i)=0, and for i�j, d(i,j)>0� d(i,j)=d(j,i).� For all i,j,k it holds that d(i,k) � d(i,j)+d(j,k).
Can we construct a weighted tree which realizes these distances?
�
��
Additive Distances (cont)
We say that the set of distances M over L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j,d(i,j) = dT(i,j), the length of the path from i to j in T.
Note: Sometimes the tree is required to be binary, and then the edge weights are required to be just non-negative.
��
Distances for three objectsare always additive:
For L=3, here is always a (unique) tree with one internal node (by simple linear algebra)
( , )( , )( , )
d i j a bd i k a cd j k b c
� �
� �
� ��
�
�
i
j
k
m
Thus0
21
����� )],(),(),([),( jidkjdkidmkdc
��
How about four objects?
Not all distance matrices with 4 objects are additive, evenif they satisfy triangle inequality.E.g., no tree realizes these distances:
0l
30k
220j
2220i
lkji
��
The Four Points ConditionTheorem: A set M of distances is additive iff any subset of four objects can be labeled i,j,k,l so that:
d(i,k) + d(j,l) = d(i,l) +d(k,j) � d(i,j) + d(k,l)
ik
lj
Proof:By inspecting the figure, additivity � 4 points condition...
We call (i,j),(k,l) the “split” of {i,j,k,l}.
�
��
4P Condition � Additivity:Induction on the number of objects, L.For L � 3 the condition is empty and tree exists. Consider L=4. Denote B = d(i,k) +d(j,l) = d(i,l) +d(j,k) � d(i,j) + d(k,l) = A
Let y = (B – A)/2 � 0 (length of internal edge).
Then the tree should look as follows:We want to find the distances a,b, c and f.
a b
i j
k
m
c
y
l
n
f
Again, an instance of linear algebra
��
Tree construction for L=4
ab
i
j
k
m
c
y
l
n
f
Construct the tree by the given distances as follows:1. Construct a tree for {i, j,k}, with internal vertex m2. Add vertex n ,d(m,n) = y3. Add edge (n,l), c+f=d(k,l)
n
f
n
f
n
fRemains to prove: d(i,l) = dT(i,l)d(j,l) = dT(j,l)
��
Proof for L=4
a
b
i
j
k
m
c
y
l
n
f
By the 4 points condition and the definition of y:d(i,l) = d(i,j) + d(k,l) +2y - d(k,j) = a + y + f = dT(i,l) (the middle equality holds since d(i,j), d(k,l) and d(k,j) are realized by the tree)d(j,l) = dT(j,l) is proved similarly.
��
Splits Approach to Proof: Intuition
i
j
k l
Suppose 4 points condition holds with strict inequality, >,for every four leaves.
This defines a (2,2) partition of every quartet.Can use 4 points condition to show all quartets are consistent.
This in turn used to construct tree (homework assignment).
Finally show tree distances agreewith original distances using linearAlgebra.
�
�
Linear Algebraic Approach : Induction�Remove L-th object from the set�By induction, there is a tree, T’, for {1,2,…,L-1}.�For each pair of labeled nodes (i,j) in T’, let aij, bij, cij
be defined by the following figure:
aij
bij
cij
i
j
L
mij
1[ ( , ) ( , ) ( , )]
2ijc d i L d j L d i j� � �
�
Induction step:�Pick i and j that minimize cij.�T is constructed by adding L (and possibly mij) to T’,as in the figure. Then d(i,L) = dT(i,L) and d(j,L) = dT(j,L)� Remains to prove: For each k � i,j: d(k,L) = dT(k,L).
aij
bij
cij
i
j
L
mij
T’
��
Induction step (cont.)� Let k i,j be an arbitrary node in T’ , and let n be the
branching point of k in the path from i to j. � By the minimality of cij , (i,j),(k,L) is not a split of {i,j,k,L}. � Assume WLOG that (i,L),(j,k) is a split of {i,j, k,L}.
aij
bij
cij
i
j
L
mij
T’
k
n
��
Induction step (end)Since (i,L),(j,k) is a split, by the 4 points condition
d(L,k) = d(i,k) + d(L,j) - d(i,j)d(i,k) = dT(i,k) and d(i,j) = dT(i,j) by induction, and d(L,j) = dT(L,j) by the construction.
Hence d(L,k) = dT(L,k).QED
aij
bij
cij
i
j
L
mij
T’
k
n
��
From Additive Distance to a Tree
By following the proof, the four point condition can be used to construct a tree from a distance matrix, or to decide that there is no such tree (namely that the distance is not additive).
But this algorithm will go over all quartets, resulting in O(L4) many steps for L species (too sllllllllllllow).
The most popular method for constructing trees for additive sets uses the neighbor joining approach.
��
Constructing additive trees:The neighbor joining problem
• Let i, j be sisters (neighboring leaves) in a tree, let k be their father, and let m be any other vertex.• Using eq. we can compute the distances from k to all other leaves.
This suggest the following method to construct tree from an additive distance matrix: 1. Find sisters i,j in the tree,2. Replace i,j by their father, k, and recursively construct a
tree T for the smaller set.3. Add i,j as children of k in T.
[ ( , ) ( , )( , ) ( , )]/ 2d i m dd j m d i jk m � ��
��
Neighbor FindingHow can we find from distances alone a pair of sisters
(neighboring leaves)? Closest nodes are not necessarily neighboring leaves.
A B
CD
Next, we show a way to find neighbors from distances.��
Neighbor Finding: Seitou & Nei method
Theorem (Saitou&Nei) Assume d is additive, with all tree edge weights positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are sistertaxa in the tree.
ij
kl
m
T1T2
is a leaf
For a leaf , le ( , )t . im
i r d i m� �
, ).: Let be two leaves (out of leaves in Definitiondivergenc ( ,Then their is e ( , ) ( ) /( 2) ) i j
i j L TD i j d i j r r L� � ��
The proof is rather involved, and will be skipped (no tears pls).
�
��
A simpler neighbor finding method:Select an arbitrary (fixed) node r.�For each pair of labeled nodes (i,j) let C(i,j) be defined
by the following expression (also see figure):
C(i,j)
i
j
r
Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.
)],(),(),([),( jidrjdridjiC ���21
�
Sisters Identification: Example
A B
CD
5 4 6
2025
)],(),(),([),( jidrjdridjiC ���21
Select arbitrarily r=A.C(B,C)=(15+25-30)/2=5C(B,D)=(15+34-31)/2=8C(C,D)=(25+34-49)/2=5
Claim: Let i, j be such that C(i,j) is maximized.Then i and j are neighboring leaves.
�
Neighbor Joining Algorithm� Set M to contain all leaves, and select a root r. |M|=L� If L =2, return a tree of two verticesIteration:� Choose i,j such that C(i,j) is maximal� Create a new vertex k, and update distances
� remove i,j, and add k to M� Recursively construct a tree on the smaller set.� When done, add i,j as children on k, at distances d(i,k) and d(j,k).
ij
k
m
[ ( , ) ( , ) ( , )] / 2( , ) ( , )
1for each other node ,
( , )(
[ ( , ) ( , ) (
, )
( , , )]2
)
d i j d i r d j r
d i j
d
d
i k
d j k
d
i k
m d i m d j m d jm ik
� � �
� �
� � �
��
Complexity of Neighbor Joining Algorithm
Naive Implementation:Initialization: �(L2) to compute the C(i,j)’ s.Each Iteration:�O(L) to update {C(i,k):i� L} for the new node k.�O(L2) to find the maximal C(i,j).Total of O(L3).
ij
k
m
��
��
Complexity of Neighbor Joining Algorithm
Using a Heap to store the C(i,j)’s:Initialization: �(L2) to compute and heapify the C(i,j)’ s.Each Iteration:�O(1) to find the maximal C(i,j).�O(L log L) to delete {C(m,i), C(m,j)} and add C(m,k) for
all vertices m.Total of O(L2 log L).(implementation details are omitted)
��
Reconstructing Trees from Additive Matrices
0E70D670C7470B74720AEDCBA
A
C
1
B
1
1
2
2D
E
3
3
Given a distance matrix constituting an additive metric, the topology of the corresponding additive tree is unique.
Q: Do we have to test additivity before running NJ?
A: This would be bad news, as this takes O(L4) time!
��
Reconstructing Trees from Additive Matrices
0E70D670C7470B74720AEDCBA
A
C
1
B
1
1
2
2D
E
3
3
Q: Do we have to test additivity before running NJ?
A: By Seito-Nei, if matrix is additive, NJ will construct the correct tree. Algorithm does not care about awareness and need not know anything about the matrix!
��
NJ Algorithm: Example
1
( , )n
ij
r d i j�
��
• Identify i,j� as neighbours if their divergence is minimal.
• Combine i,j into a new node u.
• update the distance matrix.
• If only 3 nodes are left – finish.
Let ri be the sum of distances
from i to every other node
Here, we use the divergence,
( , ) ( ) /( 2, ) )( i jD d i j r ri Lj � �� �
i m
j n
0.1 0.1 0.1
0.40.4
k l
��
��
Distance Matrix
0665D
6033C
5302B
6320A
DCBA
17111011 ���� DCBA rrrr
( , ) 8.5( , ) 8( , ) 8( , ) 7.5( , ) 8.5( , ) 8
D A BD A C
D A D
D B CD B D
D C D
� �
� �
� �
� �
� �
� �
U
BA
��
Distance Matrix
065.5D
603C
5.530U
DCU
5.1195.8 ��� DCU rrr( , ) 5.75( , ) 4.5
( , ) 4.25
X U CX U D
X C D
� �
� �
� �
U
BA
Y
C
��
Distance Matrix
05.6D
5.60Y
DY
U
BA
Y
C
D
Z
�
Reconstructing Trees from non Additive Matrices
�� .��������������������2������������E
� � .�������������� 0 F
���(���������� ��������������������������E
� ��� ��������� 3������������������������������� ���������� ����������������� ���� ���
��
�
Almost Additive Matrix
� ������������2� ��G��������������H��������2������������� �����2� ��������
, ,,' '| | min{| |} mi
(n
2)
i j i ji j ed d d d
l e� �� � � �
�
Atteson: If d’ is almost additive with respect to a tree T, then the output of NJ is a tree T’ with the same topology as T����
��
Distance Matrix
��
Unrooted Tree - NJ
Root
��
Output - NJ
Branch lengthis proportional
to distance
��
��
N-J Method produces an Unrooted, Additive tree
��
PAM Spinach Rice Mosquito Monkey HumanSpinach 0.0 84.9 105.6 90.8 86.3Rice 84.9 0.0 117.8 122.4 122.6Mosquito 105.6 117.8 0.0 84.7 80.8Monkey 90.8 122.4 84.7 0.0 3.3Human 86.3 122.6 80.8 3.3 0.0
What is required for the Neighbour joining method?
Distance matrix0. Distance Matrix
Neighbor-Joining MethodAn Example
��
5� + �������A �A ;" ���/ + �#��>������������-� �I��C��" �����+ �#����+ �" �� �� �I��������������� ���������
Mon-Hum
MonkeyHumanSpinachMosquito Rice
1. First Step
��
After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances:Dist[Spinach, MonHum]
= (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55
Mon-Hum
MonkeyHumanSpinach
2. Calculation of New Distances
��
��
PAM Spinach Rice Mosquito MonHumSpinach 0.0 84.9 105.6 88.6Rice 84.9 0.0 117.8 122.5Mosquito 105.6 117.8 0.0 82.8MonHum 88.6 122.5 82.8 0.0
HumanMosquito
Mon-Hum
MonkeySpinachRice
Mos-(Mon-Hum)
3. Next Cycle
�
PAM Spinach Rice MosMonHumSpinach 0.0 84.9 97.1Rice 84.9 0.0 120.2MosMonHum 97.1 120.2 0.0
HumanMosquito
Mon-Hum
MonkeySpinachRice
Mos-(Mon-Hum)Spin-Rice
4. Penultimate Cycle
�
PAM SpinRice MosMonHumSpinach 0.0 108.7MosMonHum 108.7 0.0
HumanMosquito
Mon-Hum
MonkeySpinachRice
Mos-(Mon-Hum)Spin-Rice
(Spin-Rice)-(Mos-(Mon-Hum))
5. Last Joining
��
Human
Monkey
MosquitoRice
Spinach
The result:Unrooted Neighbor-Joining Tree
��
��
Dangers of Paralogs
Speciation events
Gene Duplication
1A 2A 3A 3B 2B 1B
If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the ����������������������������������������������������������������������� ��������
7��������� �����������������������������������
-
--
�
Distance Based Reconstruction: We now move to character
based methods
��
Character-based methodsfor constructing phylogenies
In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures, hip joint) or molecular (homologous DNA sequences). The most popular approaches are maximum parsimony (MP) and maximum likelihood (ML)
In both methods, we will assume independence of characters (no interactions). Each method has a well defined objective function. Goal is to find the tree or trees that optimize (maximize or minimize) respective function.
��
1. Maximum Parsimony� ��J���������������������J� � , �� � � �, , � �� , � ��#�����������������
������ J.������������������������2������������������E
� � �� � �
� � �� � �
� � � � � �
� � �
21 1
Here, total #substitutions = 4
�� �! �� ;�������������������>J5��#����������������������������������������������������� ;���������������������>��� �������C�������� ;�����������������������>��������������������������*���
��
��
Example ContinuedThere are many trees possible. For example:
� � �� � �
� � �� � �
� � � � � �
� � �
11
1
Total #substitutions = 3
� � �� � �
� � �� � �
� � � � � �
� � �
11 2
Total #substitutions = 4The left tree is preferred over the right tree.
� ������������������������������������" � ��������
��
Example With One Letter
�Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position
�Minimal tree has one evolutionary change:
C
C
CC
C
T
T
T
T � C
��
Extension to Many Letters
�What is the parsimony score of
Aardvark Bison Chimp Dog Elephant
A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA
.����������������������������'�����������������������������������������
�
Weighted Parsimony Scores
# �������� ����" � ��������
� ����������� ����������������;���>�� � �� ��������������������������������������������� ���;���>KL���;���>K$�������� ��
��
�
Evaluating Weighted Parsimony Scores
Each position is independent and computed by itself.Use Dynamic Programming on a given tree.� if k is a node with children i and j, then
S(k,a) = minx(S(i,x)+c(a,x)) + miny(S(j,y)+c(a,y))
k
ij
-;��2>
S(k,a)�the minimum score of subtree rooted at k when k has character a.
-;C��>
-;#��>
��
Evaluating Parsimony ScoresDynamic programming on a given treeInitialization:� For each leaf � set -;���>KL if � is labeled by �, otherwise -;���>K�
Iteration:� if # is node with children � and C, then -;#��>K��2;-;��2>@�;��2>>@���;-;C��>@�;���>>
Termination:� cost of tree is ��2-;��2> where � is the root
Comment:
To reconstruct an optimal assignment, we need to keep in each node k and for each character a the two characters x, y that bring about the minimum when k has character a.
��
Cost of Evaluating Parsimony for binary trees
If there are nodes, � characters, and #possible values for each character, then complexity is 8;�#?>�
Of course, we still need to search over possible trees and find the best one. One usually resorts to heuristic search techniques.
��
2. Perfect Phylogeny
Data on species is given by a Character State Matrix.Cell (p,i) has value j iff character i of object (species) p has state j.Goal: constructing evolution tree for the species.
10011E01430D13323C12102B00211Ac5c4c3c2c1Object
Character
�
��
Motivation: Evolution Tree
7����������������������������������� �����������������;���������>�����������
� ���������J$�� ����������;�������������������>
?�� �����������;�������������������������>