TheGenomeAccessCourse
Phylogenetic Analysis
Phylogenetics
•Developed by Willi Henning (Grundzüge einer Theorie der Phylogenetischen Systematik, 1950; Phylogenetic Systematics, 1966)
What is the ancestral sequence?
• pfeffer
• pepper
• (pf/p)e(ff/pp)er
Evolutionary Trees
• A tree is a connected, acyclic 2D graph
• Leaf: Taxon
• Node: Vertex
• Branch: Edge
• Tree length = sum of all branch lengths
• Phylogenetic trees are binary trees
A Generic Tree
Evolutionary Trees
• Rooted– common ancestor– unique path to any leaf– directed
• Unrooted– root could be placed anywhere– fewer possible than rooted
Rooted Treegenerated by DRAWGRAM (PHYLIP)
Unrooted Treegenerated by DRAWTREE (PHYLIP)
Possible Evolutionary Trees
Taxa (n) Rooted(2n-3)!/(2n-2(n-2)!)
Unrooted(2n-5)!/(2n-3(n-3)!)
2 1 1
3 3 1
4 15 3
5 105 15
6 954 105
7 10395 954
8 135135 10395
9 2027025 135135
10 34459425 2027025
Genes vs. Species
• Sequences show gene relationships, but phylogenetic histories may be different for gene and species
• Genes evolve at different speeds
• Horizontal gene transfer
Methods for Phylogenetic Analysis
• Character-State– Maximum Parsimony– Maximum Likelihood
• Genetic Distance– Fitch & Margoliash– Neighbor-Joining– Unweighted Pair Group
Phylogenetic Software
• PHYLIP
• PAUP (Available in GCG)
• TREE-PUZZLE
• PhyloBLAST
• Felsenstein maintains an extensive list of programs on the PHYLIP site
PHYLIP Programs
• dnapars/protpars
• dnadist/protdist
• dnaml (use fastDNAml instead)
• neighbor
• fitch/kitsch
• drawtree/drawgram
Maximum Parsimony
• Most common method• Allows use of all evolutionary information• Build and score all possible trees• Each node is a transformation in a character
state• Minimize treelength• Best tree requires the fewest changes to
derive all sequences
Which is the more parsimonious tree?
9 Node Crossings
8 Node Crossings3 Nodes
3 Nodes
• Reconstruction using an explicit evolutionary model
• Tree is calculated separately for each nucleotide site. The product of the likelihoods for each site provides the overall likelihood of the observed data.
• Demanding computationally
• Slowest method
• Use to test (or improve) an existing tree
Maximum Likelihood
Clustering Algorithms
• Use distances to calculate phylogenetic trees• Trees are based on the relative numbers of
similarities and differences between sequences
• A distance matrix is constructed by computing pairwise distances for all sequences
• Clustering links successively more distant taxa
DNA Distances
• Distances between pairs of DNA sequences are relatively simple to compute as the sum of all base pair differences between the two sequences
• Can only work for pairs of sequences that are similar enough to be aligned
• All base changes are considered equal
• Insertion/deletions are generally given a larger weight than replacements (gap penalties).
• Possible to correct for multiple substitutions at a single site, which is common in distant relationships and for rapidly evolving sites.
Amino Acid Distances
• More difficult to compute
• Substitutions have differing effects on structure
• Some substitutions require more than one DNA mutation
• Use replacement frequencies (PAM, BLOSUM)
Fitch & Margoliash
• 3 sequences are combined at a time to define branches and calculate their length
• Additive branch lengths
• Accurate for short branches
Neighbor Joining
• Most common method of tree construction
• Distance matrix adjusted for each taxon depending on its rate of evolution
• Good for simulation studies
• Most efficient computationally
UPGMA – Unweighted Pair Group Methods Using Arithmetic Averages
• Simplest method
• Calculates branch lengths between most closely related sequences
• Averages distance to next sequence or cluster
• Predicts a position for the root
Phylogenetic Complications
• Errors
• Loss of function
• Convergent evolution
• Lateral gene transfer
Validation
• Use several different algorithms and data sets• NJ methods generate one tree, possibly supporting
a tree built by parsimony or maximum likelihood• Bootstrapping
– Perturb data and note effect on tree
– Repeat many times
– Unchanged ~90%, tree’s correctness is supported
Are there bugs in our genome?
N-acetylneuraminate lyase
The End