genomics and personalized care in health systems lecture 3 sequence alignment
DESCRIPTION
Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment. Leming Zhou School of Health and Rehabilitation Sciences Department of Health Information Management. Outline. Pairwise sequence alignment Multiple sequence alignment Phylogenetic tree. Similarity Search. - PowerPoint PPT PresentationTRANSCRIPT
Genomics and Personalized Care in Health Systems
Lecture 3 Sequence Alignment
Leming ZhouSchool of Health and Rehabilitation Sciences
Department of Health Information Management
Department of Health Information Management
Outline• Pairwise sequence alignment• Multiple sequence alignment• Phylogenetic tree
Department of Health Information Management
Similarity Search• Find statistically significant matches to a protein
or DNA sequence of interest. • Obtain information on inferred function of the
gene• Sequence identity/similarity is a quantitative
measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences– Calculated from a sequence alignment
– Can be expressed as a percentage
– In proteins, some residues are chemically similar but not identical
Department of Health Information Management
Sequence Alignment• A linear, one-to-one correspondence between some
of the symbols in one sequence with some of the symbols in another sequence– Four possible outcomes in aligning two sequences
• Identity; mismatch; gap in one sequence; gap in the other sequence
• May be DNA or protein sequences.
Department of Health Information Management
Evolutionary Basis of Alignment• The simplest molecular mechanisms of evolution
are substitution, insertion, and deletion
• If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match represent substitutions
• Residues that are aligned with a gap in the sequence represent insertions or deletions
Department of Health Information Management
Alignment Algorithms• Sequences often contain highly conserved regions
• These regions can be used for an initial alignment
Department of Health Information Management
Alignments• Two sequencesSeq 1: ACGGACTSeq 2: ATCGGATCT
• There may be multiple ways of creating the alignment. Which alignment is the best?
A – C – G G – A C T| | | | |A T C G G A T - C T
A T C G G A T C T| | | | | |A – C G G – A C T
Department of Health Information Management
Optimal vs. Correct Alignment• For a given group of sequences, there is no single
“correct” alignment, only an alignment that is “optimal” according to some set of calculations
• This is partly due to:– the complexity of the problem,
– limitations of the scoring systems used,
– our limited understanding of life and evolution
• Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment
Department of Health Information Management
Optimal Alignment• Every alignment has a score
• Chose alignment with highest score
• Must choose appropriate scoring function
• Scoring function based on evolutionary model with insertions, deletions, and substitutions
• Use substitution score matrix – contains an entry for every amino acid pair
Department of Health Information Management
Gaps• Positions at which a letter is paired with a null are
called gaps. Gap scores are typically negative.
• Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.
• Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence deleted or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)
Department of Health Information Management
Gaps in Sequence Alignment• Gap can occur
– Before the first character of a string
– Inside a string
– After the last character of a string
CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-
Department of Health Information Management
Gap penalties• There is no suitable theory for gap penalties.• The simplest gap penalty is a constant penalty for
each gap• The most common type of gap penalty is the affine
gap penalty: g = a + bx – a is the gap opening penalty
– b is the gap extension penalty
– x is the number of gapped-out residues.
• More likely contiguous block of residues inserted or deleted
• Scoring scheme should penalize new gaps more• Typical values, e.g. a = 10 and b = 1 for BLAST.
Pairwise Sequence Alignment
Department of Health Information Management
Pairwise Alignment• The process of lining up two sequences to achieve
maximal levels of identity or conservation for the purpose of assessing the degree of similarity and the possibility of homology
• It is used to – Decide if two genes are related structurally or functionally
• Find the similarities between two sequences with same evolutionary background
– Identify domains or motifs that are shared between proteins
– Analyze genomes
• Identify genes, search large databases, determine overlaps of sequences (DNA assembly)
Department of Health Information Management
DNA and Protein Sequences• DNA alphabet: {A, C, G, T}+
– Four discrete possibilities – it’s either a match or a mismatch
• Protein alphabet: {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W,
Y}+
– 20 possibilities which fall into several categories
– Residues can be similar without being identical
• In some cases, protein sequence is more informative– Codons are degenerate: changes in the third position often do not
alter the amino acid that is specified
• In some cases, DNA alignments are appropriate– To confirm the identity of a cDNA; to study noncoding regions of
DNA; to study DNA polymorphisms, …
Department of Health Information Management
Translating a DNA Sequence into Proteins• DNA sequences can be translated into protein, and then
used in pairwise alignments• One DNA sequence can be translated into six potential
proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT
5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
Department of Health Information Management
DNA Alignment Score CGAAGACTTGAGCTGAT || |||| ||| |||| CGCAGACATGA-CTGAC
)(
)penalties gap,mismatches identity,(
SMaxScore
S
Mismatch
GapMatch
Department of Health Information Management
Alignment Scoring Scheme• Possible scoring scheme:
– match: +5
– mismatch: -3
– indel: –4
• Example:G A A T T C A G T T A| | | | | |G G A – T C – G - — A
+ - + - + + - + - - +5 3 5 4 5 5 4 5 4 4 5
S = 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11
A C G T
A 5
C -3 5
G -3 -3 5
T -3 -3 -3 5
Department of Health Information Management
Amino Acid Sequence Alignment• No exact match/mismatch scores
• Match state score calculated by table lookup
• Lookup table is substitution matrix (or scoring matrix)
Department of Health Information Management
Substitution Matrix• A substitution matrix contains values proportional
to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.
• Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids.
• Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.
• The two major types of substitution matrices are Point-Accepted Mutations (PAM) and BLOcks Substituion Matrix (BLOSUM).
Department of Health Information Management
Sequence Alignment Algorithms• Dynamic Programming:
– Needleman-Wunsch Global Alignment (1970)
• Smith-Waterman Local Alignment (1981)
• Guaranteed to find the best scoring
• Slow, especially used to compare with a large database
• Heuristics– FASTA, BLAST : heuristic approximations to Smith-
waterman
• Fast and results comparable to the Smith-Waterman algorithm
Department of Health Information Management
Dynamic Programming• Solve optimization problems by dividing the problem
into independent subproblems• Sequence alignment has optimal substructure property
– Subproblem: alignment of prefixes of two sequences
– Each subproblem is computed once and stored in a matrix
• Optimal score: built upon optimal alignment computed to that point
• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters– Alignment contains matches, mismatches and gaps
– Scoring scheme for matches, mismatches, gaps
– Highest set of scores defines optimal alignment between sequences
Department of Health Information Management
The Big O Notation• Computational complexity of an algorithm is how its execution
time increases as the problem is made larger (e.g. more sequences to align)
• The big-O notation– If we have a problem size n, then an algorithm takes O(n) time if
the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2)
– More example, here c is a constant:
• O(c) utopian
• O(log n) excellent
• O(n) very good
• O(n2) not so good
• O(n3) pretty bad
• O(cn) disaster
Department of Health Information Management
Drawbacks to DP Approaches• Compute intensive• Memory intensive• Complexity of DP Algorithm
– Time O(nm); space O(nm)
• where n, m are the lengths of the two sequences.
– Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only)
• A fast heuristic (BLAST) will be discussed next week
Department of Health Information Management
Two Sequences>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC
AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG
CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC
TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT
CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA
CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA
CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT
GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA
TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA
GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG
CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA
TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA
CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT
GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC
CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA
AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT
CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC
Department of Health Information Management
Pairwise Sequence Alignment• FASTA:
http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi
• DNA vs. DNA comparison• Default parameters:
– Match: +5
– Mismatch: -4
– Gap open penalty: -12
– Gap extension penalty: -4
• BLAST search will be covered next week
Multiple Sequence Alignment
Department of Health Information Management
Multiple Sequence Alignment• Multiple sequence alignment (MSA) is a
generalization of Pairwise Sequence Alignment: instead of aligning two sequences, n (>2) sequences are aligned simultaneously
• A multiple sequence alignment is obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of n rows and L columns where each column represents a homologous position
• MSA applies both to DNA and protein sequences
Department of Health Information Management
Why Do We Need MSA?• MSA can help to develop a sequence “finger print”
which allows the identification of members of distantly related protein family (motifs)
• Formulate & test hypotheses about protein 3-D structure
• MSA can help us to reveal biological facts about proteins, e.g.: how protein function has changed or evolutionary pressure acting on a gene
• Crucial for genome sequencing:– Random fragments of a large molecule are sequenced and those
that overlap are found by a multiple sequence alignment program.
• To establish homology for phylogenetic analyses• Identify homologous sequences in other organisms
Department of Health Information Management
Multiple Sequence Alignment• Difficulty: introduction of multiple sequences
increases combination of matches, mismatches, gaps
• In pairwise alignments, one has a 2D matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences
• A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a DP algorithm in N dimensions. Algorithmically, this is not difficult to do
Department of Health Information Management
Examplefly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA
human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA
bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA
archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA
fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST
human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST
plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST
bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST
yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST
archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST
fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK
human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV
plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA
bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA
yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV
archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Department of Health Information Management
MSA• How do we generate a multiple alignment? Given
a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work?
• It is not self-evident how these sequences are to be aligned together.
• It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment
Department of Health Information Management
Dynamic Programming for MSA• Dynamic programming with two sequences
– Relatively easy to code
– Guaranteed to obtain optimal alignment
• An extension of the pairwise sequence alignment– Alignment of K sequences
• K(K-1)/2 possible sequence comparisons
• Alignment algorithms operate in a similar manner as pairwise alignment but now the distance matrix is K dimensional and the weight function compares K letters
Department of Health Information Management
Time Complexity of Optimal MSA• Space complexity (hyperlattice size): O(nk) for k
sequences each n long.• Computing a hyperlattice node: O(2k).• Time complexity: O(2knk).• Find the optimal solution is exponential in k (non-
polynomial, NP-hard).
Department of Health Information Management
Heuristics for Optimal MSA• Reduction of space and time• Heuristic alignment – not guaranteed to be
optimal• Alignment provides a limit to the volume within
which optimal alignments are likely to be found• Heuristics:
– Progressive alignments (ClustalW)
Department of Health Information Management
Progressive Alignment• Works by progressive alignment: it aligns a pair of
sequences then aligns the next one onto the first pair• Most closely related sequences are aligned first, and
then additional sequences and groups of sequences are added, guided by the initial alignments
• Uses alignment scores to produce a guide tree• Aligns the sequences sequentially, guided by the
relationships indicated by the tree– If the order is wrong and merge distantly related sequences
too soon , errors in the alignment may occur and propagate• Gap penalties can be adjusted based on specific
sequence
Department of Health Information Management
CLUSTALW• http://www.ebi.ac.uk/clustalw/• Perform pairwise alignments of all sequences• Use alignment scores to produce a guide tree• Align sequences sequentially, guided by the tree• Enhanced Dynamic Programming used to align
sequences• Genetic distance determined by number of
mismatches divided by number of matches• Gaps are added to an existing profile in progressive
methods• CLUSTALW incorporates a statistical model in order
to place gaps where they are most likely to occur
S1
S2
S3
S4
4
3
2
1
4321
S
4S
74S
494S
SSSS
All PairwiseAlignments
S1
S3
S2
S4
Distance
Cluster Analysis
Similarity Matrix Dendrogram
Multiple Alignment Step:1. Aligning S1 and S3
2. Aligning S2 and S4
3. Aligning (S1,S3) with (S2,S4).
ClustalW MSA Procedure
From Higgins(1991) and Thompson(1994).
Department of Health Information Management
Three Protein Sequences>sp|P25454|RAD51_YEAST DNA repair protein RAD51 OS=Saccharomyces cerevisiae GN=RAD51 PE=1
SV=1
MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNGSGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLRESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMGFVTAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLAVTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAYARAYNADHQLRLLDAAAQMMSESRFSLIVVDSVMALYRTDFSGRGELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFNPDPKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYEDGVGDPREEDE
>sp|P25453|DMC1_YEAST Meiotic recombination protein DMC1 OS=Saccharomyces cerevisiae GN=DMC1 PE=1 SV=1
MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLKSGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVGFIPATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLCVTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSYARALNSEHQMELVEQLGEELSSGDYRLIVVDSIMANFRVDYCGRGELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFASADGRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEKGITDSSD
>sp|P48295|RECA_STRVL Protein recA OS=Streptomyces violaceus GN=recA PE=3 SV=1
MAGTDREKALDAALAQIERQFGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQARLMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYASVRLDIRRIETLKDGTDAVGNRTRVKVVKNKVAPPFKQAEFDILYGQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPDLADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPASKTAKATKATAVKS
Department of Health Information Management
An Alignment from ClustalWsp|P25454|RAD51_YEAST MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG 50
sp|P25453|DMC1_YEAST ---------------------MSVTGTEIDSDTAKN-------------- 15
sp|P48295|RECA_STRVL ------------MAGTDREKALDAALAQIERQFGKG-------------- 24
:... :::. . ..
sp|P25454|RAD51_YEAST SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR 100
sp|P25453|DMC1_YEAST -----------------------------ILSVDELQNYGINASDLQKLK 36
sp|P48295|RECA_STRVL -------------------------------AVMRMGDRTQEPIEVISTG 43
.: .: :: .
sp|P25454|RAD51_YEAST ESGLHTAEAVAYAPRKDLLEIKG-ISEAKADKLLNEAARLVPMG----FV 145
sp|P25453|DMC1_YEAST SGGIYTVNTVLSTTRRHLCKIKG-LSEVKVEKIKEAAGKIIQVG----FI 81
sp|P48295|RECA_STRVL STALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFV 93
. .: . * .* : :* * *. *. . ... * *:
sp|P25454|RAD51_YEAST TAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLC 195
sp|P25453|DMC1_YEAST PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMS 131
sp|P48295|RECA_STRVL DAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDML--VRSGALDLI 141
* . .: : : ..* : .*.::: .* * ::
Phylogenetic Analysis
Department of Health Information Management
Page 358
Evolution• At the molecular level, evolution is a process of
mutation with selection. • Molecular evolution is the study of changes in
genes and proteins throughout different branches of the tree of life.
• Phylogeny is the inference of evolutionary relationships.
• Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses.
Department of Health Information Management
Phylogenetic Trees
• Phylogenetic trees are trees that describe the “relations” among species (genes, sequences)– Evolutionary relationships are shown as branches
• Sequences most closely related drawn as neighboring branches
– Length and nesting reflects degree of similarity between any two items (sequences, species, etc.)
• Objective of Phylogenetic Analysis: determine branch length and figure out how the tree should be drawn– Dependent upon good multiple sequence alignment
programs
– Group sequences with similar patterns of substitutions
Department of Health Information Management
Uses of Phylogenetic Analysis• Phylogeny can answer questions such as:
– How many genes are related to the gene I am working on?
– Are humans really closest to chimps and gorillas?
– How related are chicken, dog, mouse to zebrafish?
– Where and when did HIV originate?
– What is the history of life on earth?
• Given a set of genes, determine genes likely to have equivalent functions
• Follow changes occurring in a rapidly changing species
• Example: influenza – Study rapidly changing genes in influenza genome, predict
next year’s strain and develop flu vaccination accordingly
Department of Health Information Management
Difficulties With Phylogenetic Analysis• Horizontal or lateral transfer of genetic material
(for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events
• Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically
• Two sites within comparative sequences may be evolving at different rates
• Rearrangements of genetic material can lead to false conclusions
• Duplicated genes can evolve along separate pathways, leading to different functions
Department of Health Information Management
Rooted Trees• One sequence (root) defined to be common ancestor of all
other sequences• Root chosen as a sequence thought to have branched off
earliest• A rooted tree specifies evolutionary path for each sequence• A tree can be rooted using an outgroup (that is, a sequence
known to be distantly related from all other sequences).
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
past
present
1
2 3 4
5
6
7 8
9
Department of Health Information Management
Unrooted Tree• Indicates evolutionary relationship without
revealing location of oldest ancestry
4
5
87
1
2
36
Department of Health Information Management
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Department of Health Information Management
4 Steps of Phylogenetic Analysis• Molecular phylogenetic analysis may be described
in four steps:– Selection of sequences for analysis
– Multiple sequence alignment
– Tree building
– Tree evaluation
Department of Health Information Management
Page 371
Selection of Sequences (1/2)• For phylogeny, DNA can be more informative.
– Protein-coding sequences has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes.
– Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions.
– Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions, convergent substitutions, and back substitutions.
– Pseudogenes and noncoding regions may be analyzed using DNA
Department of Health Information Management
Selection of Sequences (2/2)• For phylogeny, protein sequences are also often
used.– Proteins have 20 states (amino acids) instead of only four
for DNA, so there is a stronger phylogenetic signal.
– Nucleotides are unordered characters: any one nucleotide can change to any other in one step.
– An ordered character must pass through one or more intermediate states before reaching the final state.
– Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.
Department of Health Information Management
Multiple Sequence Alignment
• The fundamental basis of a phylogenetic tree is a multiple sequence alignment. – Confirm that all sequences are homologous
– Adjust gap creation and extension penalties as needed to optimize the alignment
– Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all sequences (species)
Department of Health Information Management
Building Tree• Two tree-building methods: distance-based and
character-based– Distance-based methods involve a distance metric, such
as the number of amino acid changes between the sequences, or a distance score
– Character-based methods include maximum parsimony and maximum likelihood
• In both distance- or character-based methods for building a tree, the starting point is a multiple sequence alignment
Department of Health Information Management
Maximum Parsimony• Predicts evolutionary tree by minimizing number
of steps required to generate observed variation• For each position, phylogenetic trees requiring
smallest number of evolutionary changes to produce observed sequence changes are identified
• Columns representing greater variation dominate the analysis
• Trees producing smallest number of changes for all sequence positions are identified
• Time consuming algorithm• Only works well if the sequences have a strong
sequence similarity
Department of Health Information Management
Maximum Parsimony Example1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
• Four sequences, three possible unrooted trees
1
2 4
31
3 4
21
4 2
3
Department of Health Information Management
Maximum Parsimony Example• Some sites are informative, others are not• Site is informative if there are at least two
different kinds of letters at the site, each of which is represented in at least two of the sequences
• Only informative sites are considered
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Three informative columns
Department of Health Information Management
Maximum Parsimony Example1 G G A2 G G G3 A C A4 A C G
Is a substitution
Col 1: 1
2 4
3 1
3 4
2 1
4 2
3
Col 2:1
2 4
3 1
3 4
2 1
4 2
3
Col 3: 1
2 4
3 1
3 4
2 1
4 2
3
# of Changes:Tree 1: 4Tree 2: 5Tree 3: 6
Department of Health Information Management
Distance Methods
• Looks at number of changes between each pair in a group of sequences
• Identify tree positioning neighbors correctly that has branch lengths reproducing original data as closely as possible
• Distance score counted as: – # of mismatched positions in alignment
– # of sequence positions changed to generate the second sequence
• Success depends on degree the distances are additive on a predicted evolutionary tree
Department of Health Information Management
Example of Distance Analysis• Consider the alignment:
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
• Calculate distances (# of differences)
• Using this information, a tree can be drawn:C
D
A
B
Department of Health Information Management
Maximum Likelihood (ML)• A likelihood is calculated for the probability of each residue in
an alignment, based upon some model of the substitution process.
• A maximum likelihood method constructs a phylogenetic tree from DNA sequences whose likelihood is a maximum. This corresponds to the tree that makes the data the most probable evolutionary outcome– Calculates likelihood of a tree given an alignment
– Probability of each tree is product of mutation rates in each branch
– Likelihoods given by each column multiplied to give the likelihood of the tree
• This approach requires a explicit model of evolution which is both a strength and weakness because the results depend on the model used
• This methods can also be very computationally expensive• Can only be done for a handful of sequences
Department of Health Information Management
Which Method to Choose?• Depends upon the sequences that are being
compared– Strong sequence similarity:
• Maximum parsimony
– Clearly recognizable sequence similarity
• Distance methods
– All others:
• Maximum likelihood
• Best to choose at least two approaches• Compare the results – if they are similar, you can
have more confidence
Department of Health Information Management
Evaluating Trees • The main criteria by which the accuracy of a
phylogentic tree is assessed are consistency, efficiency, and robustness.
• Bootstrapping is a commonly used approach to measuring the robustness of a tree topology – Given a branching order, how consistently does an algorithm
find that branching order in a randomly permuted version of the original data set?
– To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment.
– Make the dataset the same size as the original. – Do 100 bootstrap replicates. – Observe the percent of cases in which the assignment of
clades in the original tree is supported by the bootstrap replicates
– >70% is considered significant
Department of Health Information Management
MEGA 5: Molecular Evolutionary Genetics Analysis
• http://www.megasoftware.net/
• Human, mouse, rat, and zebrafish CFTR gene• Multiple sequence alignment by ClustalW• Build a tree using Maximum Parsimony• The obtained phylogenetic tree
NM 021050.2| Mouse CFTR
XM 001062374.1| Rat CFTR
NM 001044883.1| Zebrafish CFTR
NM 000492.3| Human CFTR
Department of Health Information Management
Homework 2• Retrieve BRCA1 gene in human (Homo sapiens), mouse
(Mus musculus), cow (Bos taurus), and dog (canis lupus familiaris)
• Use FASTA program to perform all-against-all pairwise sequence alignments
• Create multiple sequence alignment with ClustalW using the web server
• Build phylogenetic trees using different methods (such as Neighbor Joining, minimum evolution, UPGMA, and maximum parsimony implemented in MEGA)