genomics and personalized care in health systems lecture 3 sequence alignment

Genomics and Personalized Care in Health Systems

Lecture 3 Sequence Alignment

Leming ZhouSchool of Health and Rehabilitation Sciences

Department of Health Information Management


Outline• Pairwise sequence alignment• Multiple sequence alignment• Phylogenetic tree


Similarity Search• Find statistically significant matches to a protein

or DNA sequence of interest. • Obtain information on inferred function of the

gene• Sequence identity/similarity is a quantitative

measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences– Calculated from a sequence alignment

– Can be expressed as a percentage

– In proteins, some residues are chemically similar but not identical


Sequence Alignment• A linear, one-to-one correspondence between some

of the symbols in one sequence with some of the symbols in another sequence– Four possible outcomes in aligning two sequences

• Identity; mismatch; gap in one sequence; gap in the other sequence

• May be DNA or protein sequences.


Evolutionary Basis of Alignment• The simplest molecular mechanisms of evolution

are substitution, insertion, and deletion

• If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match represent substitutions

• Residues that are aligned with a gap in the sequence represent insertions or deletions


Alignment Algorithms• Sequences often contain highly conserved regions

• These regions can be used for an initial alignment


Alignments• Two sequencesSeq 1: ACGGACTSeq 2: ATCGGATCT

• There may be multiple ways of creating the alignment. Which alignment is the best?

A – C – G G – A C T| | | | |A T C G G A T - C T

A T C G G A T C T| | | | | |A – C G G – A C T


Optimal vs. Correct Alignment• For a given group of sequences, there is no single

“correct” alignment, only an alignment that is “optimal” according to some set of calculations

• This is partly due to:– the complexity of the problem,

– limitations of the scoring systems used,

– our limited understanding of life and evolution

• Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment


Optimal Alignment• Every alignment has a score

• Chose alignment with highest score

• Must choose appropriate scoring function

• Scoring function based on evolutionary model with insertions, deletions, and substitutions

• Use substitution score matrix – contains an entry for every amino acid pair


Gaps• Positions at which a letter is paired with a null are

called gaps. Gap scores are typically negative.

• Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

• Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence deleted or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)


Gaps in Sequence Alignment• Gap can occur

– Before the first character of a string

– Inside a string

– After the last character of a string

CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-


Gap penalties• There is no suitable theory for gap penalties.• The simplest gap penalty is a constant penalty for

each gap• The most common type of gap penalty is the affine

gap penalty: g = a + bx – a is the gap opening penalty

– b is the gap extension penalty

– x is the number of gapped-out residues.

• More likely contiguous block of residues inserted or deleted

• Scoring scheme should penalize new gaps more• Typical values, e.g. a = 10 and b = 1 for BLAST.

Pairwise Sequence Alignment


Pairwise Alignment• The process of lining up two sequences to achieve

maximal levels of identity or conservation for the purpose of assessing the degree of similarity and the possibility of homology

• It is used to – Decide if two genes are related structurally or functionally

• Find the similarities between two sequences with same evolutionary background

– Identify domains or motifs that are shared between proteins

– Analyze genomes

• Identify genes, search large databases, determine overlaps of sequences (DNA assembly)


DNA and Protein Sequences• DNA alphabet: {A, C, G, T}+

– Four discrete possibilities – it’s either a match or a mismatch

• Protein alphabet: {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W,

Y}+

– 20 possibilities which fall into several categories

– Residues can be similar without being identical

• In some cases, protein sequence is more informative– Codons are degenerate: changes in the third position often do not

alter the amino acid that is specified

• In some cases, DNA alignments are appropriate– To confirm the identity of a cDNA; to study noncoding regions of

DNA; to study DNA polymorphisms, …


Translating a DNA Sequence into Proteins• DNA sequences can be translated into protein, and then

used in pairwise alignments• One DNA sequence can be translated into six potential

proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’


DNA Alignment Score CGAAGACTTGAGCTGAT || |||| ||| |||| CGCAGACATGA-CTGAC

)(

)penalties gap,mismatches identity,(

SMaxScore

S

Mismatch

GapMatch


Alignment Scoring Scheme• Possible scoring scheme:

– match: +5

– mismatch: -3

– indel: –4

• Example:G A A T T C A G T T A| | | | | |G G A – T C – G - — A

+ - + - + + - + - - +5 3 5 4 5 5 4 5 4 4 5

S = 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11

A C G T

A 5

C -3 5

G -3 -3 5

T -3 -3 -3 5


Amino Acid Sequence Alignment• No exact match/mismatch scores

• Match state score calculated by table lookup

• Lookup table is substitution matrix (or scoring matrix)


Substitution Matrix• A substitution matrix contains values proportional

to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.

• Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids.

• Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.

• The two major types of substitution matrices are Point-Accepted Mutations (PAM) and BLOcks Substituion Matrix (BLOSUM).


Sequence Alignment Algorithms• Dynamic Programming:

– Needleman-Wunsch Global Alignment (1970)

• Smith-Waterman Local Alignment (1981)

• Guaranteed to find the best scoring

• Slow, especially used to compare with a large database

• Heuristics– FASTA, BLAST : heuristic approximations to Smith-

waterman

• Fast and results comparable to the Smith-Waterman algorithm


Dynamic Programming• Solve optimization problems by dividing the problem

into independent subproblems• Sequence alignment has optimal substructure property

– Subproblem: alignment of prefixes of two sequences

– Each subproblem is computed once and stored in a matrix

• Optimal score: built upon optimal alignment computed to that point

• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters– Alignment contains matches, mismatches and gaps

– Scoring scheme for matches, mismatches, gaps

– Highest set of scores defines optimal alignment between sequences


The Big O Notation• Computational complexity of an algorithm is how its execution

time increases as the problem is made larger (e.g. more sequences to align)

• The big-O notation– If we have a problem size n, then an algorithm takes O(n) time if

the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2)

– More example, here c is a constant:

• O(c) utopian

• O(log n) excellent

• O(n) very good

• O(n2) not so good

• O(n3) pretty bad

• O(cn) disaster


Drawbacks to DP Approaches• Compute intensive• Memory intensive• Complexity of DP Algorithm

– Time O(nm); space O(nm)

• where n, m are the lengths of the two sequences.

– Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only)

• A fast heuristic (BLAST) will be discussed next week


Two Sequences>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA

ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA

GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC

AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG

CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC

TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT

CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA

CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA

CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT

GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC

>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA

TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA

GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG

CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA

TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA

CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT

GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC

CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA

AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT

CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC


Pairwise Sequence Alignment• FASTA:

http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi

• DNA vs. DNA comparison• Default parameters:

– Match: +5

– Mismatch: -4

– Gap open penalty: -12

– Gap extension penalty: -4

• BLAST search will be covered next week

http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi

Multiple Sequence Alignment


Multiple Sequence Alignment• Multiple sequence alignment (MSA) is a

generalization of Pairwise Sequence Alignment: instead of aligning two sequences, n (>2) sequences are aligned simultaneously

• A multiple sequence alignment is obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of n rows and L columns where each column represents a homologous position

• MSA applies both to DNA and protein sequences


Why Do We Need MSA?• MSA can help to develop a sequence “finger print”

which allows the identification of members of distantly related protein family (motifs)

• Formulate & test hypotheses about protein 3-D structure

• MSA can help us to reveal biological facts about proteins, e.g.: how protein function has changed or evolutionary pressure acting on a gene

• Crucial for genome sequencing:– Random fragments of a large molecule are sequenced and those

that overlap are found by a multiple sequence alignment program.

• To establish homology for phylogenetic analyses• Identify homologous sequences in other organisms


Multiple Sequence Alignment• Difficulty: introduction of multiple sequences

increases combination of matches, mismatches, gaps

• In pairwise alignments, one has a 2D matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences

• A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a DP algorithm in N dimensions. Algorithmically, this is not difficult to do


Examplefly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA

human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA

plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA

bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA

yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA

archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST

human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST

plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST

bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST

yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST

archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK

human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV

plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA

bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA

yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV

archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA


MSA• How do we generate a multiple alignment? Given

a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work?

• It is not self-evident how these sequences are to be aligned together.

• It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment


Dynamic Programming for MSA• Dynamic programming with two sequences

– Relatively easy to code

– Guaranteed to obtain optimal alignment

• An extension of the pairwise sequence alignment– Alignment of K sequences

• K(K-1)/2 possible sequence comparisons

• Alignment algorithms operate in a similar manner as pairwise alignment but now the distance matrix is K dimensional and the weight function compares K letters


Time Complexity of Optimal MSA• Space complexity (hyperlattice size): O(nk) for k

sequences each n long.• Computing a hyperlattice node: O(2k).• Time complexity: O(2knk).• Find the optimal solution is exponential in k (non-

polynomial, NP-hard).


Heuristics for Optimal MSA• Reduction of space and time• Heuristic alignment – not guaranteed to be

optimal• Alignment provides a limit to the volume within

which optimal alignments are likely to be found• Heuristics:

– Progressive alignments (ClustalW)


Progressive Alignment• Works by progressive alignment: it aligns a pair of

sequences then aligns the next one onto the first pair• Most closely related sequences are aligned first, and

then additional sequences and groups of sequences are added, guided by the initial alignments

• Uses alignment scores to produce a guide tree• Aligns the sequences sequentially, guided by the

relationships indicated by the tree– If the order is wrong and merge distantly related sequences

too soon , errors in the alignment may occur and propagate• Gap penalties can be adjusted based on specific

sequence


CLUSTALW• http://www.ebi.ac.uk/clustalw/• Perform pairwise alignments of all sequences• Use alignment scores to produce a guide tree• Align sequences sequentially, guided by the tree• Enhanced Dynamic Programming used to align

sequences• Genetic distance determined by number of

mismatches divided by number of matches• Gaps are added to an existing profile in progressive

methods• CLUSTALW incorporates a statistical model in order

to place gaps where they are most likely to occur

http://www.ebi.ac.uk/clustalw/

S1

S2

S3

S4

4

3

2

1

4321

S

4S

74S

494S

SSSS

All PairwiseAlignments

S1

S3

S2

S4

Distance

Cluster Analysis

Similarity Matrix Dendrogram

Multiple Alignment Step:1. Aligning S1 and S3

2. Aligning S2 and S4

3. Aligning (S1,S3) with (S2,S4).

ClustalW MSA Procedure

From Higgins(1991) and Thompson(1994).


Three Protein Sequences>sp|P25454|RAD51_YEAST DNA repair protein RAD51 OS=Saccharomyces cerevisiae GN=RAD51 PE=1

SV=1

MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNGSGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLRESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMGFVTAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLAVTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAYARAYNADHQLRLLDAAAQMMSESRFSLIVVDSVMALYRTDFSGRGELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFNPDPKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYEDGVGDPREEDE

>sp|P25453|DMC1_YEAST Meiotic recombination protein DMC1 OS=Saccharomyces cerevisiae GN=DMC1 PE=1 SV=1

MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLKSGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVGFIPATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLCVTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSYARALNSEHQMELVEQLGEELSSGDYRLIVVDSIMANFRVDYCGRGELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFASADGRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEKGITDSSD

>sp|P48295|RECA_STRVL Protein recA OS=Streptomyces violaceus GN=recA PE=3 SV=1

MAGTDREKALDAALAQIERQFGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQARLMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYASVRLDIRRIETLKDGTDAVGNRTRVKVVKNKVAPPFKQAEFDILYGQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPDLADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPASKTAKATKATAVKS


An Alignment from ClustalWsp|P25454|RAD51_YEAST MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG 50

sp|P25453|DMC1_YEAST ---------------------MSVTGTEIDSDTAKN-------------- 15

sp|P48295|RECA_STRVL ------------MAGTDREKALDAALAQIERQFGKG-------------- 24

:... :::. . ..

sp|P25454|RAD51_YEAST SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR 100

sp|P25453|DMC1_YEAST -----------------------------ILSVDELQNYGINASDLQKLK 36

sp|P48295|RECA_STRVL -------------------------------AVMRMGDRTQEPIEVISTG 43

.: .: :: .

sp|P25454|RAD51_YEAST ESGLHTAEAVAYAPRKDLLEIKG-ISEAKADKLLNEAARLVPMG----FV 145

sp|P25453|DMC1_YEAST SGGIYTVNTVLSTTRRHLCKIKG-LSEVKVEKIKEAAGKIIQVG----FI 81

sp|P48295|RECA_STRVL STALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFV 93

. .: . * .* : :* * *. *. . ... * *:

sp|P25454|RAD51_YEAST TAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLC 195

sp|P25453|DMC1_YEAST PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMS 131

sp|P48295|RECA_STRVL DAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDML--VRSGALDLI 141

* . .: : : ..* : .*.::: .* * ::

Phylogenetic Analysis


Page 358

Evolution• At the molecular level, evolution is a process of

mutation with selection. • Molecular evolution is the study of changes in

genes and proteins throughout different branches of the tree of life.

• Phylogeny is the inference of evolutionary relationships.

• Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses.


Phylogenetic Trees

• Phylogenetic trees are trees that describe the “relations” among species (genes, sequences)– Evolutionary relationships are shown as branches

• Sequences most closely related drawn as neighboring branches

– Length and nesting reflects degree of similarity between any two items (sequences, species, etc.)

• Objective of Phylogenetic Analysis: determine branch length and figure out how the tree should be drawn– Dependent upon good multiple sequence alignment

programs

– Group sequences with similar patterns of substitutions


Uses of Phylogenetic Analysis• Phylogeny can answer questions such as:

– How many genes are related to the gene I am working on?

– Are humans really closest to chimps and gorillas?

– How related are chicken, dog, mouse to zebrafish?

– Where and when did HIV originate?

– What is the history of life on earth?

• Given a set of genes, determine genes likely to have equivalent functions

• Follow changes occurring in a rapidly changing species

• Example: influenza – Study rapidly changing genes in influenza genome, predict

next year’s strain and develop flu vaccination accordingly


Difficulties With Phylogenetic Analysis• Horizontal or lateral transfer of genetic material

(for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events

• Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically

• Two sites within comparative sequences may be evolving at different rates

• Rearrangements of genetic material can lead to false conclusions

• Duplicated genes can evolve along separate pathways, leading to different functions


Rooted Trees• One sequence (root) defined to be common ancestor of all

other sequences• Root chosen as a sequence thought to have branched off

earliest• A rooted tree specifies evolutionary path for each sequence• A tree can be rooted using an outgroup (that is, a sequence

known to be distantly related from all other sequences).

http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

past

present

1

2 3 4

5

6

7 8

9


Unrooted Tree• Indicates evolutionary relationship without

revealing location of oldest ancestry

4

5

87

1

2

36


http://www.ncbi.nlm.nih.gov/About/primer/phylo.html


4 Steps of Phylogenetic Analysis• Molecular phylogenetic analysis may be described

in four steps:– Selection of sequences for analysis

– Multiple sequence alignment

– Tree building

– Tree evaluation


Page 371

Selection of Sequences (1/2)• For phylogeny, DNA can be more informative.

– Protein-coding sequences has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes.

– Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions.

– Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions, convergent substitutions, and back substitutions.

– Pseudogenes and noncoding regions may be analyzed using DNA


Selection of Sequences (2/2)• For phylogeny, protein sequences are also often

used.– Proteins have 20 states (amino acids) instead of only four

for DNA, so there is a stronger phylogenetic signal.

– Nucleotides are unordered characters: any one nucleotide can change to any other in one step.

– An ordered character must pass through one or more intermediate states before reaching the final state.

– Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.


Multiple Sequence Alignment

• The fundamental basis of a phylogenetic tree is a multiple sequence alignment. – Confirm that all sequences are homologous

– Adjust gap creation and extension penalties as needed to optimize the alignment

– Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all sequences (species)


Building Tree• Two tree-building methods: distance-based and

character-based– Distance-based methods involve a distance metric, such

as the number of amino acid changes between the sequences, or a distance score

– Character-based methods include maximum parsimony and maximum likelihood

• In both distance- or character-based methods for building a tree, the starting point is a multiple sequence alignment


Maximum Parsimony• Predicts evolutionary tree by minimizing number

of steps required to generate observed variation• For each position, phylogenetic trees requiring

smallest number of evolutionary changes to produce observed sequence changes are identified

• Columns representing greater variation dominate the analysis

• Trees producing smallest number of changes for all sequence positions are identified

• Time consuming algorithm• Only works well if the sequences have a strong

sequence similarity


Maximum Parsimony Example1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

• Four sequences, three possible unrooted trees

1

2 4

31

3 4

21

4 2

3


Maximum Parsimony Example• Some sites are informative, others are not• Site is informative if there are at least two

different kinds of letters at the site, each of which is represented in at least two of the sequences

• Only informative sites are considered

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

Three informative columns


Maximum Parsimony Example1 G G A2 G G G3 A C A4 A C G

Is a substitution

Col 1: 1

2 4

3 1

3 4

2 1

4 2

3

Col 2:1

2 4

3 1

3 4

2 1

4 2

3

Col 3: 1

2 4

3 1

3 4

2 1

4 2

3

# of Changes:Tree 1: 4Tree 2: 5Tree 3: 6


Distance Methods

• Looks at number of changes between each pair in a group of sequences

• Identify tree positioning neighbors correctly that has branch lengths reproducing original data as closely as possible

• Distance score counted as: – # of mismatched positions in alignment

– # of sequence positions changed to generate the second sequence

• Success depends on degree the distances are additive on a predicted evolutionary tree


Example of Distance Analysis• Consider the alignment:

A ACGCGTTGGGCGATGGCAAC

B ACGCGTTGGGCGACGGTAAT

C ACGCATTGAATGATGATAAT

D ACACATTGAGTGATAATAAT

• Calculate distances (# of differences)

• Using this information, a tree can be drawn:C

D

A

B


Maximum Likelihood (ML)• A likelihood is calculated for the probability of each residue in

an alignment, based upon some model of the substitution process.

• A maximum likelihood method constructs a phylogenetic tree from DNA sequences whose likelihood is a maximum. This corresponds to the tree that makes the data the most probable evolutionary outcome– Calculates likelihood of a tree given an alignment

– Probability of each tree is product of mutation rates in each branch

– Likelihoods given by each column multiplied to give the likelihood of the tree

• This approach requires a explicit model of evolution which is both a strength and weakness because the results depend on the model used

• This methods can also be very computationally expensive• Can only be done for a handful of sequences


Which Method to Choose?• Depends upon the sequences that are being

compared– Strong sequence similarity:

• Maximum parsimony

– Clearly recognizable sequence similarity

• Distance methods

– All others:

• Maximum likelihood

• Best to choose at least two approaches• Compare the results – if they are similar, you can

have more confidence


Evaluating Trees • The main criteria by which the accuracy of a

phylogentic tree is assessed are consistency, efficiency, and robustness.

• Bootstrapping is a commonly used approach to measuring the robustness of a tree topology – Given a branching order, how consistently does an algorithm

find that branching order in a randomly permuted version of the original data set?

– To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment.

– Make the dataset the same size as the original. – Do 100 bootstrap replicates. – Observe the percent of cases in which the assignment of

clades in the original tree is supported by the bootstrap replicates

– >70% is considered significant


MEGA 5: Molecular Evolutionary Genetics Analysis

• http://www.megasoftware.net/

• Human, mouse, rat, and zebrafish CFTR gene• Multiple sequence alignment by ClustalW• Build a tree using Maximum Parsimony• The obtained phylogenetic tree

NM 021050.2| Mouse CFTR

XM 001062374.1| Rat CFTR

NM 001044883.1| Zebrafish CFTR

NM 000492.3| Human CFTR


Homework 2• Retrieve BRCA1 gene in human (Homo sapiens), mouse

(Mus musculus), cow (Bos taurus), and dog (canis lupus familiaris)

• Use FASTA program to perform all-against-all pairwise sequence alignments

• Create multiple sequence alignment with ClustalW using the web server

• Build phylogenetic trees using different methods (such as Neighbor Joining, minimum evolution, UPGMA, and maximum parsimony implemented in MEGA)

genomics and personalized care in health systems lecture 3 sequence alignment

Documents

sequence gap

health systemslecture

t c ta t c g g

dna sequence

sequence alignmentcan

sequence variation

scorechose alignment

single correct alignment