1 multiple sequence alignment lesson 3. 2 1. what is a multiple sequence alignment?
Post on 21-Dec-2015
263 views
TRANSCRIPT
1
Multiple sequence alignmentMultiple sequence alignment
Lesson 3Lesson 3
2
1. What is a multiple sequence 1. What is a multiple sequence alignment?alignment?
3
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are aligned instead of just n=2
Multiple sequence Multiple sequence alignmentalignment
4
MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Multiple sequence Multiple sequence alignmentalignment
5
Multiple sequence alignmentMultiple sequence alignment
Homosapiens
Pantroglodytes
Musmusculus
Canisfamiliaris
Gallusgallus
Anophelesgambiae
Drosophilamelanogaster
Caenorhabditis elegans
Arabidobsisthaliana
Rattusnorvegicus
6
Histone H4 proteinHistone H4 protein
7
Multiple sequence alignmentMultiple sequence alignment
NADH dehydrogenase subunit 4
Histone H4 protein 4
►Which is better – pairwise alignment of a pair of rows in MSA?
8
2. How MSAs are computed2. How MSAs are computed
9
Alignment – Dynamic Alignment – Dynamic ProgrammingProgramming
There is a dynamic programming algorithm for n sequences similar to the pairwise alignment
Complexity :
O(n|sequences|)
10
Alignment methodsAlignment methods
This is not practical complexity, therefore heuristics are used:
• Progressive/hierarchical alignment (Clustal)
• Iterative alignment (mafft, muscle)
11
ABCDE
Compute the pairwise Compute the pairwise alignments for all against all alignments for all against all
(6 pairwise alignments).(6 pairwise alignments).The similarities are The similarities are
converted to distances and converted to distances and stored in a tablestored in a table
First step:
Progressive alignmentProgressive alignment
ABCDE
A
B8
C1517
D161410
E32313132
12
A
D
C
B
E
Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):• represents the order in which pairs ofrepresents the order in which pairs of sequences are to be aligned sequences are to be aligned• similar sequences are neighbors in thesimilar sequences are neighbors in the tree tree • distant sequences are distant from eachdistant sequences are distant from each other in the tree other in the tree
Second step: ABCDE
A
B8
C1517
D161410
E32313132
The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!
13
Third step:A
D
C
B
E
1. Align the most similar (neighboring) pairs
sequence
sequence
sequence
sequence
14
Third step:A
D
C
B
E
2. Align pairs of pairs
sequence
profile
15
Third step:A
D
C
B
E sequence
profile
Main disadvantages:
• Sub-optimal tree topology
• Misalignments resulting from globally aligning pairs of sequences.
16
ABCDE
IterativeIterative alignmentalignment
Guide tree
MSA
Pairwise distance table
A
DCB
Iterate until the MSA does not change (convergence)
E
17
3. MSA – What is it good for?3. MSA – What is it good for?
A.A. Conserved positionsConserved positions
B.B. ConsensusConsensus
C.C. PatternsPatterns
D.D. ProfilesProfiles
E.E. Much more…Much more…
18
3. MSA – What is it good for?3. MSA – What is it good for?
A.A. Conserved positionsConserved positions
B.B. ConsensusConsensus
C.C. PatternsPatterns
D.D. ProfilesProfiles
E.E. Much more…Much more…
19
Consensus sequenceConsensus sequence
ATCTTGT
AACTTGT
AACTTCT
AACTTGT
A consensus sequence holds the most frequent character of the alignment at each column
20
Consensus sequence – an Consensus sequence – an exampleexample
TACGAT
TATAAT
TATAAT
GATACT
TATGTT
TATGTT
The -10 region of six promoters. There are many variants to the
“consensus.”
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
21
Consensus sequence – an Consensus sequence – an exampleexample
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATAAT
1 .Strict majority . *In case of equal
frequencies – choose one according to the alphabet order.
22
Consensus sequence – an Consensus sequence – an exampleexample
Had we searched the region upstream of genes for this consensus, we would have identified only 2 out of the 6 sequences. So we will miss many cases.
By chance, we expect a “hit” every 4,096 bp.
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATAAT
23
Consensus sequence – an Consensus sequence – an exampleexample
We can search while allowing 1 mismatch.
we would have identified 3 out of the 6 sequences. So we will miss less cases.
By chance, we expect a “hit” every ~200bp → more “noise”.
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATAAT
24
Consensus sequence – an Consensus sequence – an exampleexample
We can search while allowing 2 mismatches.
we would have identified all 6 sequences. So we won’t miss.
By chance, we expect a “hit” every ~30bp → A LOT OF “noise”.
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATAAT
25
Consensus sequence – an Consensus sequence – an exampleexample
2. Majority only when it is a clear case. In the remaining cases – use wildcards.
Y = PyrimidineR = PurineN = Any nucleotide
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATRNT
26
Reminder: Purines & PyrimidinesReminder: Purines & Pyrimidines
Y = PyrimidineR = PurineN = Any nucleotide
27
Consensus sequence – an Consensus sequence – an exampleexample
Had we searched the region upstream of genes with the redundant consensus, we would have identified 4/6 sequences.
By chance, we expect a “hit” every ~500 bp.
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
TATRNT
28
Consensus sequence – an Consensus sequence – an exampleexample
There is always a tradeoff between sensitivity and specificity.Sensitivity: the fraction of true positive predictions among all positive predictions. Specificity: the fraction of true negative predictions among all negative predictions.
TATRNT TATAAT
29
Consensus sequence – an exampleConsensus sequence – an exampleSensitivity: the fraction of true positive predictions among all positive predictions
Specificity: the fraction of true negative predictions among all negative predictions
Permissive consensus: higher sensitivity, lower specificity (more true positives , more false positives ↔ less true negatives , less false negatives ) Nonpermissive consensus: higher specificity, lower sensitivity (less true positives , less false positives ↔ more true negatives , more false negatives )
30
3. MSA – What is it good for?3. MSA – What is it good for?
A.A. Conserved positionsConserved positions
B.B. ConsensusConsensus
C.C. PatternsPatterns
D.D. ProfilesProfiles
E.E. Much more…Much more…
31
PatternsPatterns
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
[TG-]A-]TC[-]GA[-]CTA[-]T[
Patterns are more informative than consensuses sequences.
Pattern specify for each position the possible characters for this position.
32
Patterns - syntaxPatterns - syntax
• The standard IUPAC one-letter codes. • ‘x’ : any amino acid. • ‘][’ : residues allowed at the position. • ‘{}’ : residues forbidden at the position. • ‘()’ : repetition of a pattern element are indicated in
parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.
• ‘-’ : separates each pattern element. • ‘‹’ : indicated a N-terminal restriction of the pattern. • ‘›’ : indicated a C-terminal restriction of the pattern. • ‘.’ : the period ends the pattern.
33
• W-x(9,11)-]FYV[-]FYW[-x(6,7)-]GSTNE[
PatternsPatterns
Any amino-acid, between 9-11
times
F or Y or
V
WOPLASDFGYVWPPPLAWSROPLASDFGYVWPPPLAWSWOPLASDFGYVWPPPLSQQQ
34
3. MSA – What is it good for?3. MSA – What is it good for?
A.A. Conserved positionsConserved positions
B.B. ConsensusConsensus
C.C. PatternsPatterns
D.D. ProfilesProfiles
E.E. Much more…Much more…
35
Profile =Profile = PSSM =PSSM = PPositionosition SSpecificpecific SScorecore MMatrixatrixACCCAA
AACCGG
AACCTT
123456
A1.6700.33.33
C0.331100
G0000.33.33
T0000.33.33
36
P(AACCAA)= 1 × 0.67 × 1 × 1 × 0.33 × 0.33 P(GACCAA)= 0
Sequences with higher probabilities → higher chance of being related to the PSSM.
123456
A1.6700.33.33
C0.331100
G0000.33.33
T0000.33.33
Profiles / PSSMsProfiles / PSSMs
37
One compares each n-mer to the profile and computes the probabilities. Sequences with probabilities > threshold are considered as hits.
Searching with PSSMSearching with PSSM
GACGGTACGTAGCGGAGCGACCAA
Computes the probability of the first 6-mer
123456
A1.6700.33.33
C0.331100
G0000.33.33
T0000.33.33
38
6-mers with probabilities > threshold are considered as hits .
Searching with PSSMSearching with PSSM
P2
P3
P4
GACGGTACGTAGCGGAGCGACCAA
GACGGTACGTAGCGGAGCGACCAA
GACGGTACGTAGCGGAGCGACCAA
GACGGTACGTAGCGGAGCGACCAAP1
123456
A1.6700.33.33
C0.331100
G0000.33.33
T0000.33.33
39
Profile-pattern-consensusProfile-pattern-consensus
AACTTG
AAGTCG
CACTTC
12345
A0.66100.
T0001.
C0.3300.660.
G000.330.
AACTTG
[AC-]A-]GC[-T-]TC[-]GC[
multiple alignment
consensus
pattern
profile
NANTNN
40
4. HMM:4. HMM:HHidden idden MMarkov arkov MModelsodels
41
Definitions & UsesDefinitions & Uses
• A probabilistic model which deals with sequences of symbols.Uses: inferring hidden states.
• Originally used in speech recognition (the symbols being phonemes)
• Useful in biology – the sequence of symbols being the DNA\Proteins.
42
Markov ChainsMarkov Chains• A sequence of random variables X1,X2,… where each present state depends only on the previous state.
• Weather example:
The weather in day xdepends only on day x-1:
• We can easilycompute the probability of:Sunny Sunny Rainy Sunny Sunny
43
Markov ChainsMarkov Chains
• Similarly we can assume a DNA sequence is Markovian • ACGGTA…(vertical or horizontal!)• These conditional probabilities can be illustrated as follows
(in DNA)
• Each arrow has a transition probability: PCA = P(xi=A|Xi-1=C)
• Thus – the probability of a sequence x will be :
A T
C G
ii xxLiLL PxPxxxPxP 11111 )(),...,,()(
44
Hidden Markov ModelsHidden Markov Models
• The state sequence itself follows a simple Markov chain. But-
• In a HMM it is no longer possible to know the state by looking at the symbols – the state is hidden.
P
B
PPP
BB
Si+1SiSi-1
Ki+1KiKi-1
S1
K1
Sn
Kn. . . . . .
. . . . . .
45
The weather HMM exampleThe weather HMM example
• In this weather example only the actions are observable and the weather is hidden:
46
• {S, K, Π, P, B}
• S : {s1…sN } are the values for the hidden states
• K : {k1…kM } are the values for the observations
• The hidden states emit/generate the symbols (observations)
• Π = {Πi} are the initial state probabilities
• P = {Pij} are the state transition probabilities
• B = {bik} are the emission probabilities
HMM formalitiesHMM formalities
P
B
PPP
BB
Si+1SiSi-1
Ki+1KiKi-1
S1
K1
Sn
Kn. . . . . .
. . . . . .
47
Another HMM example –Another HMM example –the dishonest casinothe dishonest casino
• In a casino, they use a fair dice most of the time, but occasionally switch to an unfair dice. The switch between dice can be represented by an HMM:
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
FAIR UNFAIR
0.05
0.1
0.950.9
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
0.05
0.1
0.950.9
UNFAIR
FAIR
48
Dishonest casino - continuedDishonest casino - continued
• The symbols (observations) are the sequence of rolls:
3 5 6 2 1 4 6 3 6…
• What is hidden?
If the die is fair or unfair:
f f f f u u u f f
This is a Markov chain.
Except for that, we have:
• Emission probabilities:
Given a state, we have 6 possible matching symbols,
each with an emission probability.
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
FAIR UNFAIR
0.05
0.1
0.950.9
49
HMM of MSAHMM of MSA
• MSA can be represented by an HMM
– Insertion of A/C/G/T
– Match or Mismatch
– Deletion
50
HMM of MSAHMM of MSA
• MSA can be represented by an HMM
– Insertion of A/C/G/T
– Match or Mismatch
– Deletion
51
HMM of MSA can get more complex…HMM of MSA can get more complex…
52
Questions where HMM’s are Questions where HMM’s are used:used:
• Does this sequence belong to a particular
family?
• Can we identify regions in a sequence (for
instance – alpha helices, beta sheets)?
• Pairwise/multiple sequence alignment
• Searching databases for protein families
(building profiles).