phylogenetic trees introduction to computational biology cis 786 with dr. barry cohen tuesday, may...
Post on 20-Dec-2015
213 views
TRANSCRIPT
PHYLOGENETIC TREESIntroduction to Computational Biology
CIS 786
With Dr. Barry Cohen
Tuesday, May 7, 2001
Paul Wood
Yanchun Song
Chaowei Sun
What is a Phylogenetic Tree?
• Phylogenetic trees are representations of the similarity or dissimilarity—among both existing & extinct living individuals &—across a set of characteristics or features.
• Similarity of molecular and physical systems provide compelling evidence that all life on earth arose from a common ancestry.
Carl R. Woese, Interpreting the universal phylogenetic tree, Proc. Natl. Acad. Sci. USA, Vol. 97, Issue 15, 8392-8396, July 18, 2000http://www.pnas.org/cgi/content/full/97/15/8392
• Shall I thee to a summers day?– W. Shakespeare, Sonnet 18
• There is a between Homer and Hesiod, between Æschylus and Euripides…
– P. Shelley, Prometheus Unbound
• Life all around me…All in the loom, and oh
What ! Woodlands, meadows,…– E. L. Masters, Spoon River Anthology
• If the foolish call them “flowers”/Need the wiser tell? // If the savants “ ” them/It is just as well.
– E. Dickenson, Part 1: Life, XCIV
SIMILARITY
PATTERNS
Why do we study Phylogenetic Trees?
COMPARE
CLASSIFY
…because humans need to….fill in blanks…
…and understand in our own language…
What are some applications of “phylogenetic” trees?
Computational Linguistics• Manning, Christopher D. and Heinrich Schutze, Foundations of Statistical
Natural Language Processing, MIT Press, Cambridge Massachusetts, 1999. http://www.aclweb.org/archive/fsnlp-ch1.pdf
Archaeological Statistics• Archaeological Statistics: Brief Bibliography
http://ad.trafficmp.com/tmpad/banner/itrack.asp?rv=3.0&id=16&nojs=1
Broad Historical and Technical Overview• Discriminant Analysis and Clustering, Panel on Discriminant Analysis,
Classification, and Clustering, Committee on Applied and Theoretical Statistics Board on Mathematical Sciences, Commission on Physical Sciences, Mathematics, and Resources National Research Council, NATIONAL ACADEMY PRESS, Washington, D.C. 1988 http://www.ulib.org/webRoot/Books/National_Academy_Press_Books/discrim_analysis/discr001.htm
Phylogenetic trees are used to study locations,
migrations, lives, health & cultures of populations.
Velda
Helena Tara
Katrina
Ursula
Xenia
Jasmine
http://www.oxfordancestors.com/daughters.html
Phylogenetic trees are used to study physical &
genetic variability, evolution of species.
http://www.oxfordancestors.com/daughters.html
Which areas of the genome provide mutant data to create phylogenetic trees?
Y-Chromosome
MitochondrialControl Region
Autosomes
How do we get data for computational biology?
Concentrationgradient
Homogenize
Detergent(Sodium Dodecyl Sulphate SDS)
+
+
Phenol
GeneticMaterial
InsolubleProtein
Phenol
Remove Upper Phase
Cesium Chloride
+
SPIN40 hrs @
40,000 RPM
RNARNA
RNARNA
CsCs
Cs
Cs
RNA
STEP 1: Eukaryotic Biochemical Protocol is……kind of like washing greasy dishes!
LowWeight
MediumWeight
HighWeight
How do we get sequence data?
RNARNA
RNARNA
CsCs
Cs
Cs
RNA
STEP 2: Cut up DNA using one of “two” methods… &
STEP 3: Label fragments using one of “two” methods…
2 b: Maxam-Gilbert
2 a: Sanger (Dideoxy)
EtOH+
+
RestrictionEnzymes
32Phosphate
GelElectro-phoresis
AutoRadiography
Fluorescent
Dye
FluorescenceSpectroscopy
~ 4 Reactions
~ 4 Reactions
GelElectro-phoresis
3a:
3b:
What is the rate of evolutionary change…or…how many mutants can we expect?
• Estimates vary depending upon assessment method and location within the genome
• “…134 independent mtDNA lineages spanning 327 generations found ~2.5 mutations per site per 1000 yrs.”
– A high observed substitution rate in the human mitochondrial DNA control region. Parsons TJ, Muniec DS, Sullivan K, Woodyatt N, Alliston-Greiner R, Wilson MR, Berry DL, Holland KA, Weedn VW, Gill P, Holland MM. Nat Genet 1997 Apr; 15(4):363-8. Armed Forces DNA Identification Laboratory, Armed Forces Institute of Pathology, Rockville, Maryland 20850, USA. http://www.mhrc.net/mitochondria.htm
– M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. (1978) A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, M. O. Dayhoff, (Ed.). National Biomedical Research Foundation, Vol. 5, Suppl. 3, chapter 22, 345-352)
What do sequence data and input files typically look like?
263 2821 AY053096 cacgggagct …variable region... 2822 AY053097 cacgggagct …variable region... 2823 AY053098 cacgggagct …variable region... 282.263
!Domain=Data property=Coding CodonStart=1;#W._Pygmy_(1)_{African} TTC TTT CAT GGG#W._Pygmy_(6)_{African} ... ... ... ...#Kung_(7)_{African} ... .C. ... ... .T.#Kung_(9)_{African} ... ... ... ... ...#Kung_(10)_{African} ... ... ... ... ...#Kung_(13)_{African} ... ... .G. ... ...
PHYLIP INPUT FILE (SEQUENCE)
MEGA INPUT FILE (SEQUENCE)
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
DISTANCE MATRIX
What are some of the major classifications of algorithms & software applications?
Count of Software Applications by Type and Platform
Unix/Source Code DOS Windows Mac VMS
General-purpose 6 5 5 3 3Parsimony 12 12 13 5 3Distance matrix 27 21 20 15 4Compute distances 22 16 17 14 6Maximum likelihood 23 5 13 14 5Quartets methods 7 5 0 4 1Artificial Intelligence 1 0 0 0 0Invariants 2 2 2 2 2Tree rearrangement 4 2 3 5 1Recombination 9 2 1 2 0Bootstrapping and other measures 16 15 9 11 2Clocks, dating, and stratigraphy 10 2 6 10 0
PHYLIP, PAUP & MEGA are represented across most categories. PHYLIP is the most widely distributed and used. PAUP is most frequently cited in publications. MEGA has a nice GUI and is user friendly. http://evolution.genetics.washington.edu/phylip/software.html
Two Types of Data
• Distance-based: – The input is a matrix of distances between the
species (e.g., the alignment score between them or the fraction of residues they agree on).
• Character-based: – Examine each character (e.g., a base in a
specific position in the DNA) separately
Pairwise Distance
• Model of Jukes and Cantor– Each base in the DNA sequence has an equal
chance of mutating, and when it does, it is replaced by some other nucleotide uniformly.
• Distance dij:
– The fraction f of sites u where residues xu
i and x
uj differ (presupposing an alignment of the
two sequences).T. H. Jukes and C. Cantor, Mammalian Protein Metabolism, Chapter Evolution of protein molecules, pages 21-132, Academic Press, New York, 1969
Clustering Method: UPGMA
• UPGMA: Unweighted Pair Group Method with Arithmetic Mean
• Di,j between two clusters of species Ci and
C
j:
d(p, q) – distance function between species,
ni = |Ci| and nj = |Cj|.
i jCp Cqji
ji qpdnn
D ),(1
,
http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html
Algorithm• Initialization:
– Initialize n clusters with the given species, one species per cluster. – Size of each cluster: ni ← 1; assign a leaf for each species.
• Iteration: – Find minimal Dij,
– Create a new cluster (ij), which has n(ij) = ni + nj members.
– Connect i and j to the new node (ij), each given length Di,j /2. – Compute the distance from (ij) to all other clusters as a weighted average of the
distances from its components:
– Replace the columns and rows of clusters i and in D with cluster (ij), with D(ij),k computed as above.
• Termination: – until there is only one cluster left.
kjji
iki
ji
ikij D
nn
nD
nn
nD ,,),(
UPGMA Example
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
http://www.icp.ucl.ac.be/~opperd/private/upgma.html
UPGMA Example (cont’d)
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
kjji
iki
ji
ikij D
nn
nD
nn
nD ,,),(
D(A,B),C = (DAC + DBC) / 2 = 4 D(A,B),D = (DAD + DBD) / 2 = 6 D(A,B),E = (DAE + DBE) / 2 = 6 D(A,B),F = (DAF + DBF) / 2 = 8
http://www.icp.ucl.ac.be/~opperd/private/upgma.html
UPGMA Example (cont’d)
A,B C D,E
C 4
D,E 6 6
F 8 8 8
AB,C D,E
D,E 6
F 8 8
ABC,DE
F 8
http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Additivity
• Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them.
The idea of Neighbor-joining
• Distance of i from the rest of the tree:
• To find neighboring nodes i and j:
min(Di,j – (ui + uj) )
i m
j n
0.1 0.1 0.1
0.40.4
k l
ik
kii n
Du
)2(,
)(2
1,)(, jijiiji uuDD
)(2
1,)(, ijjiijj uuDD
R. Durbin, et al, Additivity and neighbour-joining, Biological Sequence Analysis, p. 169-173, Cambridge Univ. Press, 1999.
Algorithm: Neighbor-Joining
• Initialization:– Define T to be the set of leaf nodes, one for each given sequence, and put n =
T.
• Iteration:– For each species, compute . – Choose a pair i, j in T for which Di,j – (ui + uj) is minimal.– Join i and j to a new cluster k=(ij). Calculate the branch lengths from i and j
to the new node k as: Di,k=1/2(Di,j+ ui – uj), Dj,k=1/2(Di,j+ uj – ui)
– Compute the distances between k and each other cluster: Dk,m=1/2(Di,m+ Dj,m – Di,j), mT
– Remove i and j from T and add k.
• Termination:– When T consists of only two nodes i and j, connect the remaining nodes by a
branch of length Dij.
ik
kii n
Du
)2(,
MEGA 2MEGA 2
• Molecular Evolutionary Genetics Analysis
• Provides tools for exploring and analyzing DNA and protein sequences from evolutionary perspectives
History of MEGA
• MEGA 1
DOS-Based
• MEGA 2
User-friendly interface
Windows
Macintosh
Sun Workstation
Linux
Input
• Character Sequence - DNA/RNA - Protein• Distance Matrix• Import data from other formats, PHYLIP, XML,
etc.
Methods and Algorithms
• methods for constructing phylogenetic trees from molecular data.
1. UPGMA Method
2. Neighbor-Joining (NJ) Method
3. Minimum Evolution (ME) Method
4. Maximum Parsimony (MP) Method
Unweighted Pair Group Method with Arithmetic Mean - UPGMA
• Assumes a constant rate of evolution
• sequential clustering method
• Produces a rooted tree
• edge lengths - time measured by a molecular clock
Neighbor-Joining - NJ
• No assumption
• finds neighbors sequentially that may minimize the total length of the tree
• produces an unrooted tree
• root - midpoint of the longest route connecting two taxa in the tree
Minimum Evolution - ME
• Finds a topology with the smallest sum of branch lengths
• time-consuming: sum of branches for all topologies have to be evaluated
Maximum Parsimony - MP
• Finds a topology that requires the smallest number of changes (substitution)
• For each topology – sums up total number of substitutions
Comparison
Parsimony
Minimum EvolutionUPGMA
Neighbor-Joining
Optimality criterion Clustering algorithm
Computational Method
Distance
Characters
Comparison – Cont’d
• UPGMA, Neighbor-Joining
• Minimum Evolution, Maximum Parsimony
- Fast O(n2), Large dataset- depends upon the order in which we add sequences to the tree
- Time consuming, NP-Complete- use an explicit function relating the trees to the data