protein sequence comparison · implementing the dp algorithm for sequences 1)build a nxm alignment...
TRANSCRIPT
![Page 1: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/1.jpg)
Protein Sequence Comparison
![Page 2: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/2.jpg)
Why do we want to align protein
sequences?
• If two sequences align well, the corresponding
proteins are homologous; they probably share the
same structure and/or function
• Sequence Alignment is a Tool for organizing the
protein sequence space
detection of homologous proteins
build evolutionary history
![Page 3: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/3.jpg)
Alignment Methods
• Rigorous algorithms
- Needleman-Wunsch (global)
- Smith-Waterman (local)
• Rapid heuristics
- FASTA
- BLAST
![Page 4: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/4.jpg)
What is sequence alignment?
• Given two sequences of letters and a scoring
scheme for evaluating letter matching, find the
optimal pairing of letters from one sequence to
the other.
• Different alignments:
Favors identity Favors similarity
ACCTAGGC ACCTAGGC
AC-T-GG ACT-GG
Gaps
![Page 5: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/5.jpg)
Aligning Biological Sequences
• Aligning DNA: 4 letter alphabet
ACGTTGGC
AC-T-GG
• Aligning protein: 20 letter alphabet
MCYTSWGC
MC-T-WG
![Page 6: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/6.jpg)
Computing Cost
The computational complexity of aligning two
sequences when gaps are allowed anywhere is
exponential in the length of the sequences being
aligned.
Computer science offers a solution
for reducing the running time:
Dynamic Programming
![Page 7: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/7.jpg)
Dynamic Programming (DP) Concept
A problem with overlapping sub-problems and optimal sub-structures can be solved using the following algorithm:
(1) break the problem into smaller sub-problems
(2) solve these problems optimally using this 3-step procedure recursively
(3) use these optimal solutions to construct an optimal solution to the original problem
![Page 8: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/8.jpg)
DP and Sequence Alignment
Key idea:
The score of the optimal alignment that ends at a given
pair of positions in the sequences is the score of the best
alignment previous to these positions plus the score of
aligning these two positions.
![Page 9: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/9.jpg)
Test all alignments that can lead to i aligned with j
i
j
?
DP and Sequence Alignment
j
i
?
![Page 10: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/10.jpg)
Find alignment with best [previous score + score(i,j)]i
j
?
DP and Sequence Alignment
j
i
Best alignment that ends at (i,j)
![Page 11: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/11.jpg)
Implementing the DP algorithm for sequences
1) Build a NxM alignment matrix A such that
A(i,j) is the optimal score for alignments
up to the pair (i,j)
2) Find the best score in A
3) Track back through the matrix to get
the optimal alignment of S1 and S2.
Aligning 2 sequence S1 and S2 of lengths N and M:
![Page 12: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/12.jpg)
Example 1
Sequence 1: ATGCTGC
Sequence 2: AGCC
Score(i,j) = 10 if i=j, 0 otherwise
no gap penalty
![Page 13: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/13.jpg)
Example 1
0C
0C
0G
00000010A
CGTCGTA
1) Initialize
![Page 14: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/14.jpg)
Example 1
0C
0C
100G
00000010A
CGTCGTA
2) Propagate
![Page 15: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/15.jpg)
Example 1
0C
0C
20100G
00000010A
CGTCGTA
2) Propagate
![Page 16: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/16.jpg)
Example 1
0C
100C
1020101020100G
00000010A
CGTCGTA
2) Propagate
![Page 17: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/17.jpg)
Example 1
303010100C
3020203010100C
1020101020100G
00000010A
CGTCGTA
2) Propagate
![Page 18: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/18.jpg)
Example 1
4030303010100C
3020203010100C
1020101020100G
00000010A
CGTCGTA
3) Trace Back
ATGCTGC
AXGCXXC
Alignment: Score: 40
![Page 19: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/19.jpg)
Mathematical Formulation
+−−−
+−−−
−−
+=
−≤≤
−≤≤
k2ik1
k2jk1
W1)jk,1A(i
,Wk)1j1,A(i
1),j1,A(i
j)Score(i,j)A(i,
max
maxmax
Wk: penalty for a gap of size k
Global alignment (Needleman-Wunsch):
![Page 20: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/20.jpg)
Complexity
1) The computing time required to fill in the
alignment matrix is O(NM(N+M)), where
N and M are the lengths of the 2 sequences
2) This can be reduced to O(NM) by storing
the best score for each row and column.
True if gap penalty is linear!
![Page 21: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/21.jpg)
Example 2
Alignments:
High Score: 30301010100C
202010100G
102010100G
0001010A
CGTAA
AATGC
AG GC
AATGC
A GGC
AATGC
AGGC
AATG C
A GGC
AATG C
A GGC
![Page 22: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/22.jpg)
Example 2 with Gap
Alignments:
High Score: 28281088-2C
1618108-2G
818810-2G
-2-2-2810A
CGTAA
AATGC
AG GC
AATGC
A GGC
AATGC
AGGC
Gap cost: -2
![Page 23: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/23.jpg)
Complexity (2)
1) The traceback routine can be quite costly in computing
time if all possible optimal paths are required, since
there may be many branches.
2) Usually, an arbitrary choice is made about which
branch to follow. Then computing time is O(max(N,M))
By simply following pointers.
![Page 24: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/24.jpg)
The Scoring Scheme
• Scores are usually stored in a “weight” matrix or “substitution” matrix or “matching” matrix.
• Defining the “proper” matrix is still an active area of research
• Usually, start from known, reliable alignment. Compute fi, the frequency of occurrence of residue type i, and qij, the probability that residue types i and j are aligned; score is computed as:
=
ji
ij
ijff
qS log
![Page 25: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/25.jpg)
1121-3-2-3-1-3-3-2-2-3-4-4-2-3-4-3-3-2W
273-1-1-1-1-2-22-1-2-3-2-3-2-3-2-2-2Y
136-1000-3-3-1-3-3-3-3-3-2-4-2-2-2F
-3-1-14131-2-3-3-2-2-3-3-30-2-2-2-1V
-2-103422-2-2-3-2-3-4-3-4-1-3-2-2-1L
-3-101241-3-3-3-3-3-3-3-4-1-3-2-2-1I
-1-10-2215-1-1-20-2-3-2-3-1-2-1-1-1M
-3-2-3-3-2-3-152-111-10-2-1-100-3K
-3-2-3-3-2-3-125010-20-2-1-2-1-1-3R
-22-1-2-3-3-2-1080011-2-2-20-1-3H
-2-1-3-2-2-301105200-2-1-100-3Q
-3-2-3-3-3-3-21002520-2-1-100-4E
-4-3-3-3-4-3-3-1-2-10261-1-2-110-3D
-4-2-3-3-3-3-200-100160-2-201-3N
-2-3-30-4-4-3-2-2-2-2-2-1-260-210-3G
-3-2-2-2-1-1-1-1-1-2-1-1-2-104-1-110A
-4-3-4-2-3-3-2-1-2-2-1-1-1-1-2-171-1-3P
-3-2-2-2-2-2-10-1000101-1141-1T
-3-2-2-2-2-2-10-1-1000101-114-1S
-2-2-2-1-1-1-1-3-3-3-3-4-3-3-30-3-1-19C
WYFVLIMKRHQEDNGAPTSC
Example of a Scoring matrix
![Page 26: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/26.jpg)
Gap penalty
• Most common model:
WN = G0 + N * G1
WN : gap penalty for a gap of size N
G0 : cost of opening a gap
G1 : cost of extending the gap by one
N : size of the gap
![Page 27: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/27.jpg)
Global versus Local Alignment
• Global alignment finds best arrangement that
maximizes total score
• Local alignment identifies highest scoring
subsequences, sometimes at the expense of the
overall score
Local alignment algorithm is just a variation
of the global alignment algorithm!
![Page 28: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/28.jpg)
Modifications for local alignment
1) The scoring matrix has negative values for mismatches
2) The minimum score for any (i,j) in the alignment matrix is 0.
3) The best score is found anywhere in the filled alignment matrix
These 3 modifications cause the algorithm to search for matching sub-sequences which are not penalized by other regions (modif. 2), with minimal poor matches (modif 1), which can occur anywhere (modif 3).
![Page 29: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/29.jpg)
Mathematical Formulation
Wk: penalty for a gap of size k
Local alignment (Smith Waterman):
+−−−
+−−−
−−
+=
−≤≤
−≤≤
0
,max
maxmax
k2ik1
k2jk1
W1)jk,1A(i
,Wk)1j1,A(i
1),j1,A(i
j)Score(i,j)A(i,
![Page 30: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/30.jpg)
Global versus Local Alignment
100000S
001000N
000310C
000120C
000001A
SGTCCA
Match: +1; Mismatch: -2; Gap: -1
1-10-1-2-3S
001-1-2-3N
-1-1-131-3C
-2-2-212-3C
-3-3-3-3-31A
SGTCCA
Global: ACCTGS
ACC-NS
Local: ACC
ACC
![Page 31: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/31.jpg)
Heuristic methods
• O(NM) is too slow for database search
• Heuristic methods based on frequency of shared
subsequences
• Usually look for ungapped small sequences
FASTA, BLAST
![Page 32: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/32.jpg)
FASTA
• Create hash table of short words of the query sequence (from 2 to 6 characters)
• Scan database and look for matches in the query hash table
• Extend good matches empirically
WordP
…
Word3
Word2
Word1
SeqN…Seq7Seq6Seq5Seq4Seq3Seq2Seq1
![Page 33: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/33.jpg)
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.htmlTutorial:
BLAST
1) Break query sequence and database sequences into words
2) Search for matches (even not perfect) that scores at least T
3) Extend matches, and look for alignment that scores at least S
![Page 34: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/34.jpg)
Summary
• Dynamic programming finds the optimal alignment
between two sequences in a computing time
proportional to NxM, where N and M are the
sequence lengths
• Critical user choices are the scoring matrix, the gap
penalties, and the algorithm (local or global)
![Page 35: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/35.jpg)
Statistics of Sequence Alignment
![Page 36: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/36.jpg)
Significance
• We have found that the score of the alignment between two sequences is S.
Question: What is the “significance” of this score?
• Otherwise stated, what is the probability P that the alignment of two random sequences has a score at least equal to S ?
• P is the P-value, and is considered a measure of statistical significance.
If P is small, the initial alignment is significant.
![Page 37: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/37.jpg)
A given experiment may yield the event A or the event not(A) with probabilities p,
and q=1-p, respectively. If the experiment is repeated N times and X is the number
of time A is observed, then the probability that X takes the value k is given by:
Basic Statistics: Binomial Distribution
kNkpp
k
NkXp
−−
== )1()(
)!(!
!
kNk
N
k
N
−=
With the binomial coefficient:
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Properties:
E(X)=Np
Var(X)=Np(1-p)
N=40; p=0.2
![Page 38: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/38.jpg)
The Poisson distribution is a discrete distribution usually defined over a volume
or time interval.
Given a process with expected number of success λ in the given interval, the
probability of observing exactly X success is given by:
Basic Statistics: Poisson Distribution
( )!
exp)(
XXp
X λλ
−=
λλλλ=8
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Properties:
E(X)=λ
Var(X)=λ
![Page 39: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/39.jpg)
A normal distribution in a variable X with mean µ and variance σ2 is a statistical
distribution with probability function:
Basic Statistics: Normal Distribution
−−=
2
2
2
)(exp
2
1)(
σ
µ
πσ
XXp
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
N=40; p=0.2
The normal distribution is the limiting
case of a binomial distribution P(n,N,p)
with:
)1(2 pNp
Np
−=
=
σ
µ
![Page 40: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/40.jpg)
A extreme value distribution in a variable X is a statistical distribution with
probability function:
Basic Statistics: Extreme Value
Distribution
−−
−=
b
xa
b
xa
bXp expexp
1)(
a=0; b=1
-10 -8 -6 -4 -2 0 2 4 6 8 10
Note the presence of a long tail
![Page 41: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/41.jpg)
BLAST Input
![Page 42: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/42.jpg)
BLAST results
![Page 43: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/43.jpg)
BLAST results (2)
![Page 44: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/44.jpg)
Statistics of Protein Sequence Alignment
• Statistics of global alignment:
Unfortunately, not much is known! Statistics based on Monte
Carlo simulations (shuffle one sequence and recompute
alignment to get a distribution of scores)
• Statistics of local alignment
Well understood for ungapped alignment. Same theory
probably apply to gapped-alignment
![Page 45: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/45.jpg)
Statistics of Protein Sequence Alignment
What is a local alignment ?
“Pair of equal length segments, one from each sequence, whose scores
can not be improved by extension or trimming. These are called
high-scoring pairs, or HSP”
http://www.people.virginia.edu/~wrp/cshl98/Altschul/Altschul-1.html
![Page 46: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/46.jpg)
The E-value for a sequence alignment
-10 -8 -6 -4 -2 0 2 4 6 8 10
S
The expected number of
HSP with score at least S is
given by:
HSP scores follow an extreme value distribution, characterized
by two parameters, K and λ.
( )SKmnE λ−= exp
m, n : sequence lengths
E : E-value
![Page 47: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/47.jpg)
The Bit Score of a sequence alignment
( )( )2ln
ln'
KSS
−=
λ
Raw scores have little meaning without knowledge of the scoring
scheme used for the alignment, or equivalently of the parameters
K and λ.
Scores can be normalized according to:
S’ is the bit score of the alignment.
The E-value can be expressed as:
'2 SmnE
−=
![Page 48: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/48.jpg)
The P-value of a sequence alignment
( ) ( )ESscorewithHSPrandomP −=≥ exp0
( ) ( )!
expX
EESscorewithHSPrandomXP
X
−=≥
The number of random HSP with score greater of equal to S follows a
Poisson distribution:
(E: E-value)
Then:
( ) ( )ESscorewithHSPrandomleastatPPval −−=≥= exp11
Note: when E <<1, P ≈E
![Page 49: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/49.jpg)
Database search, where database contains NS sequences
corresponding to NR residues:
1) All sequences are a priori equally likely to be related to the query:
2) Longer sequences are more likely to be related to the query:
BLAST reports EDB2
The database E-value for a sequence
alignment
( )SKmnNE SDB λ−= exp
( )SKmNE RDB λ−= exp2
![Page 50: Protein Sequence Comparison · Implementing the DP algorithm for sequences 1)Build a NxM alignment matrix A such that A(i,j) is the optimal score for alignments up to the pair (i,j)](https://reader033.vdocuments.site/reader033/viewer/2022060609/6060c8ca0e3c677abb1aed4a/html5/thumbnails/50.jpg)
Summary
• Statistics on local sequence alignment are defined by:
– Raw score
– Bit score (normalized score)
– E-value
– P-value
• Statistics on database search:
- database E-value