calculating substitution matrices
DESCRIPTION
Calculating substitution matrices. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/1.jpg)
Calculating substitution matrices
• http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5
Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =iqxi jqyj
![Page 2: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/2.jpg)
Odds ratio
• The match model aligns residues with a joint probability pab
– P (x,y|M) = ipxiyi
• The ratio of match to random is known as odds ratio:
P(x,y|M)/P(x,y|R) = i (pxiyi/qxiqyi)
![Page 3: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/3.jpg)
Log odds ratio
• s(a, b) = log (pab/qaqb)• S = i s(xi, yi)• This last equation is the sum of individual
scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)
![Page 4: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/4.jpg)
Significance of scores using alignment algorithms
• Calculate a raw Score– Sum of scores for each letter to letter and letter
to null position
• Calculate a bit score– Normalizes for scoring system used
• Calculate an E-value– Calculated from bit score to account for
probability the hit arose by chance
![Page 5: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/5.jpg)
Raw score
• Calculated from substitution matrices (PAM, BLOSUM), and gap costs
• There are substitution matrices for nucleotides also:– States, D.J., Gish, W. & Altschul, S.F. (1991)
"Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.
![Page 6: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/6.jpg)
Bit score
• S’ = (S – lnK)/ ln 2• lambda and K are parameters dependent upon
the scoring system (substitution matrix and gap costs) employed – Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the
statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA
87:2264-2268. – http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda
• Gap costs – the standard cost associated with a gap of length g
![Page 7: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/7.jpg)
Gap costs• Can be linear – like we did in our matrix
(g) = -gd
• Can be an “affine” score – most prevalent now(g) = -d – (g-1)e
Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less
![Page 8: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/8.jpg)
E - value
• E = N/2S’
• This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared
• N = mn (search space size)
![Page 9: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/9.jpg)
Database searching
• If a protein is compared to whole database, n is the database length in residues
• The equation can be converted to:– S’ = log2(N/E)
• If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary
![Page 10: Calculating substitution matrices](https://reader036.vdocuments.site/reader036/viewer/2022072016/5681312e550346895d97a180/html5/thumbnails/10.jpg)
Significance of E - value
• E value is between 1 and 0
• The lower the E value the more significant the match
• Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids