![Page 1: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/1.jpg)
Lezione 14Bioinformatica
Mauro Ceccanti‡ e Alberto Paoluzzi†
†Dip. Informatica e Automazione – Università “Roma Tre”‡Dip. Medicina Clinica – Università “La Sapienza”
![Page 2: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/2.jpg)
Lecture 14: Longest Common SubsequenceLongest common subsequence (LCS) problemBLAST (Basic Local Alignment Search Tool)FASTA (FAST Alignement)
![Page 3: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/3.jpg)
Sommario
Lecture 14: Longest Common SubsequenceLongest common subsequence (LCS) problemBLAST (Basic Local Alignment Search Tool)FASTA (FAST Alignement)
![Page 4: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/4.jpg)
Dynamic Programming ApproachLCS : Longest Common Subsequence
Let X ,Y ∈ Seq be the sequences to compare, and Xi , Yj be thesubsequences of their first i , j characters, respectively.The integer function
LCS : Seq × Seq → Nat
gives the integer length of longest common subsequence of any two(sub)sequences, as follows:
LCS(Xi ,Yj) =
0 if i = 0 or j = 0
LCS(Xi−1,Yj−1) + 1 if xi = yj
max(LCS(Xi ,Yj−1),LCS(Xi−1,Yj)) if xi 6= yj
![Page 5: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/5.jpg)
Dynamic Programming ApproachLCS : Longest Common Subsequence
Let X ,Y ∈ Seq be the sequences to compare, and Xi , Yj be thesubsequences of their first i , j characters, respectively.The integer function
LCS : Seq × Seq → Nat
gives the integer length of longest common subsequence of any two(sub)sequences, as follows:
LCS(Xi ,Yj) =
0 if i = 0 or j = 0
LCS(Xi−1,Yj−1) + 1 if xi = yj
max(LCS(Xi ,Yj−1),LCS(Xi−1,Yj)) if xi 6= yj
![Page 6: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/6.jpg)
Longest common subsequenceLCS function defined
Xi!1
Xi!1
Yj!1
Yj!1
Yj
Xi
xi
yj
xi = yj
xi != yj
LCS(Xi, Yj) = LCS(Xi!1, Yj!1) + 1
LCS(Xi, Yj) = max(LCS(Xi, Yj!1), LCS(Xi!1, Yj))
![Page 7: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/7.jpg)
Recursive implementationjust write down in Python the recursive equations above
1 def c l s (X,Y) :2 i , j = len (X) , len (Y)3 if i == 0 or j == 0 : return 04 elif X[ i −1] == Y[ j −1]: return c l s (X [ : i −1] ,Y [ : j −1])+15 else : return max( c l s (X [ : i ] ,Y [ : j −1]) , c l s (X [ : i −1] ,Y [ : j ] ) )
1 print c l s ( "BASKETBALL" , "BASEBALL" ) ≡ 8
OK !
1 print c l s ( "ABRACADABRA" , "SUPERCALIFRAGILISTICESPIRALIDOSO" )
VERY long execution time ... WHY ?
![Page 8: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/8.jpg)
... because of recursion nonlinearitythe execution time is exponential with the sequence lengths
a recursion is said linear if the definition right-hand side contains atmost one recursive function call
I nonlinear recursion:(n
k
)=(n−1
k
)+(n−1
k−1
)complexity: O(2n)
1 def b inomia l ( n , k ) :2 if k == 0 or n == k : return 13 else : return b inomia l ( n−1,k ) + b inomia l ( n−1,k−1)
I linear recursion:(n
k
)=(n−1
k−1
)× n
k complexity: O(n)
1 def b inomia l ( n , k ) :2 if k == 0 or n == k : return 13 else : return b inomia l ( n−1,k−1) ∗ n / k
![Page 9: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/9.jpg)
... because of recursion nonlinearitythe execution time is exponential with the sequence lengths
a recursion is said linear if the definition right-hand side contains atmost one recursive function call
I nonlinear recursion:(n
k
)=(n−1
k
)+(n−1
k−1
)complexity: O(2n)
1 def b inomia l ( n , k ) :2 if k == 0 or n == k : return 13 else : return b inomia l ( n−1,k ) + b inomia l ( n−1,k−1)
I linear recursion:(n
k
)=(n−1
k−1
)× n
k complexity: O(n)
1 def b inomia l ( n , k ) :2 if k == 0 or n == k : return 13 else : return b inomia l ( n−1,k−1) ∗ n / k
![Page 10: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/10.jpg)
Memoization technique
In computing, “memoization” is an optimization technique usedprimarily to speed up computer programs by having functioncalls avoid repeating the calculation of results forpreviously-processed input
I This technique of saving values that have already beencalculated is frequently used
I Memoization is a means of lowering a function’s time costin exchange for space cost; that is, memoized functionsbecome optimized for speed in exchange for a higher useof computer memory space.
I An efficient LCS procedure requires: saving the solutionsto one level of subproblem in a table so that the solutionsare available to the next level of subproblems.
![Page 11: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/11.jpg)
Memoization technique
In computing, “memoization” is an optimization technique usedprimarily to speed up computer programs by having functioncalls avoid repeating the calculation of results forpreviously-processed input
I This technique of saving values that have already beencalculated is frequently used
I Memoization is a means of lowering a function’s time costin exchange for space cost; that is, memoized functionsbecome optimized for speed in exchange for a higher useof computer memory space.
I An efficient LCS procedure requires: saving the solutionsto one level of subproblem in a table so that the solutionsare available to the next level of subproblems.
![Page 12: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/12.jpg)
Memoization technique
In computing, “memoization” is an optimization technique usedprimarily to speed up computer programs by having functioncalls avoid repeating the calculation of results forpreviously-processed input
I This technique of saving values that have already beencalculated is frequently used
I Memoization is a means of lowering a function’s time costin exchange for space cost; that is, memoized functionsbecome optimized for speed in exchange for a higher useof computer memory space.
I An efficient LCS procedure requires: saving the solutionsto one level of subproblem in a table so that the solutionsare available to the next level of subproblems.
![Page 13: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/13.jpg)
Length of the Longest Common Subsequencecomputing the function LCS : Seq × Seq → Nat with memoization
1 def LCS(X, Y) :2 m, n = len (X) , len (Y)3 # An (m+1) t imes ( n+1) mat r i x4 C = [ [ 0 ] ∗ ( n+1) for i in range (m+1) ]5 for i in range (1 , m+1) :6 for j in range (1 , n+1) :7 if X[ i −1] == Y[ j −1]:8 C[ i ] [ j ] = C[ i −1][ j −1] + 19 else :
10 C[ i ] [ j ] = max(C[ i ] [ j −1] , C[ i −1][ j ] )11 return C
![Page 14: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/14.jpg)
Usage example — LCSfunction
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print C2 [ [ 0 , 0 , 0 , 0 , 0 , 0 ] ,3 [ 0 , 1 , 1 , 1 , 1 , 1 ] ,4 [ 0 , 1 , 1 , 2 , 2 , 2 ] ,5 [ 0 , 1 , 1 , 2 , 2 , 2 ] ,6 [ 0 , 1 , 2 , 2 , 3 , 3 ] ,7 [ 0 , 1 , 2 , 2 , 3 , 3 ] ]
![Page 15: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/15.jpg)
Usage example — LCSfunction
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print C2 [ [ 0 , 0 , 0 , 0 , 0 , 0 ] ,3 [ 0 , 1 , 1 , 1 , 1 , 1 ] ,4 [ 0 , 1 , 1 , 2 , 2 , 2 ] ,5 [ 0 , 1 , 1 , 2 , 2 , 2 ] ,6 [ 0 , 1 , 2 , 2 , 3 , 3 ] ,7 [ 0 , 1 , 2 , 2 , 3 , 3 ] ]
![Page 16: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/16.jpg)
Usage example — LCSfunction
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print C2 [ [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] ,3 [ 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] ,4 [ 0 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ] ,5 [ 0 , 1 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 3 ] ,6 [ 0 , 1 , 2 , 2 , 2 , 3 , 4 , 4 , 4 , 4 ] ,7 [ 0 , 1 , 2 , 3 , 3 , 3 , 4 , 4 , 5 , 5 ] ,8 [ 0 , 1 , 2 , 3 , 4 , 4 , 4 , 4 , 5 , 6 ] ,9 [ 0 , 1 , 2 , 3 , 4 , 4 , 4 , 4 , 5 , 6 ] ,
10 [ 0 , 1 , 2 , 3 , 4 , 5 , 5 , 5 , 5 , 6 ] ,11 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 6 , 6 , 6 ] ,12 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 7 , 7 ] ,13 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 8 ] ]
![Page 17: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/17.jpg)
Usage example — LCSfunction
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print C2 [ [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] ,3 [ 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ] ,4 [ 0 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ] ,5 [ 0 , 1 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , 3 ] ,6 [ 0 , 1 , 2 , 2 , 2 , 3 , 4 , 4 , 4 , 4 ] ,7 [ 0 , 1 , 2 , 3 , 3 , 3 , 4 , 4 , 5 , 5 ] ,8 [ 0 , 1 , 2 , 3 , 4 , 4 , 4 , 4 , 5 , 6 ] ,9 [ 0 , 1 , 2 , 3 , 4 , 4 , 4 , 4 , 5 , 6 ] ,
10 [ 0 , 1 , 2 , 3 , 4 , 5 , 5 , 5 , 5 , 6 ] ,11 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 6 , 6 , 6 ] ,12 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 7 , 7 ] ,13 [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 8 ] ]
![Page 18: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/18.jpg)
Reading out an LCSBacktracking on the table from the lower-right corner
1 def backTrack (C, X, Y, i , j ) :2 if i == 0 or j == 0 :3 return " "4 elif X[ i −1] == Y[ j −1]:5 return backTrack (C, X, Y, i −1, j −1) + X[ i −1]6 else :7 if C[ i ] [ j −1] > C[ i −1][ j ] :8 return backTrack (C, X, Y, i , j −1)9 else :
10 return backTrack (C, X, Y, i −1, j )
![Page 19: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/19.jpg)
Usage example — backTrack function
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print "Some LCS: ’%s ’ " % backTrack (C, X, Y, m, n )2 Some LCS: ’AAC ’
![Page 20: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/20.jpg)
Usage example — backTrack function
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print "Some LCS: ’%s ’ " % backTrack (C, X, Y, m, n )2 Some LCS: ’AAC ’
![Page 21: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/21.jpg)
Usage example — backTrack function
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print "Some LCS: ’%s ’ " % backTrack (C, X, Y, m, n )2 Some LCS: ’ATCCGGAC ’
![Page 22: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/22.jpg)
Usage example — backTrack function
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print "Some LCS: ’%s ’ " % backTrack (C, X, Y, m, n )2 Some LCS: ’ATCCGGAC ’
![Page 23: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/23.jpg)
Reading out all LCSs
1 def backTrackAl l (C, X, Y, i , j ) :2 if i == 0 or j == 0 :3 return set ( [ " " ] )4 elif X[ i −1] == Y[ j −1]:5 return set ( [ Z + X[ i −1]6 for Z in backTrackAl l (C, X, Y, i −1, j −1)
] )7 else :8 R = set ( )9 if C[ i ] [ j −1] >= C[ i −1][ j ] :
10 R. update ( backTrackAl l (C, X, Y, i , j −1) )11 if C[ i −1][ j ] >= C[ i ] [ j −1]:12 R. update ( backTrackAl l (C, X, Y, i −1, j ) )13 return R
![Page 24: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/24.jpg)
Usage example — backTrackAll function
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print " A l l LCSs : %s " % backTrackAl l (C, X, Y, m, n )2 A l l LCSs : set ( [ ’ACC ’ , ’AAC ’ ] )
![Page 25: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/25.jpg)
Usage example — backTrackAll function
1 >>> X = "AATCC"2 >>> Y = "ACACG"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print " A l l LCSs : %s " % backTrackAl l (C, X, Y, m, n )2 A l l LCSs : set ( [ ’ACC ’ , ’AAC ’ ] )
![Page 26: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/26.jpg)
Usage example — backTrackAll function
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print " A l l LCSs : %s " % backTrackAl l (C, X, Y, m, n )2 A l l LCSs : set ( [ ’ATCCGGAC ’ ] )
![Page 27: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/27.jpg)
Usage example — backTrackAll function
1 >>> X = "ATGGCCTGGAC"2 >>> Y = "ATCCGGACC"3 >>> m = len (X)4 >>> n = len (Y)5 >>> C = LCS(X, Y)
1 >>> print " A l l LCSs : %s " % backTrackAl l (C, X, Y, m, n )2 A l l LCSs : set ( [ ’ATCCGGAC ’ ] )
![Page 28: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/28.jpg)
Sommario
Lecture 14: Longest Common SubsequenceLongest common subsequence (LCS) problemBLAST (Basic Local Alignment Search Tool)FASTA (FAST Alignement)
![Page 29: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/29.jpg)
BLAST programComparison of nucleotide or protein sequences
I The Basic Local Alignment Search Tool (BLAST) findsregions of local similarity between sequences
I The program compares nucleotide or protein sequences tosequence databases and calculates the statisticalsignificance of matches
I BLAST can be used to infer functional and evolutionaryrelationships between sequences as well as help identifymembers of gene families
I BLAST makes it easy to examine a large group of potentialgene candidates
![Page 30: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/30.jpg)
BLAST programComparison of nucleotide or protein sequences
I The Basic Local Alignment Search Tool (BLAST) findsregions of local similarity between sequences
I The program compares nucleotide or protein sequences tosequence databases and calculates the statisticalsignificance of matches
I BLAST can be used to infer functional and evolutionaryrelationships between sequences as well as help identifymembers of gene families
I BLAST makes it easy to examine a large group of potentialgene candidates
![Page 31: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/31.jpg)
BLAST programComparison of nucleotide or protein sequences
I The Basic Local Alignment Search Tool (BLAST) findsregions of local similarity between sequences
I The program compares nucleotide or protein sequences tosequence databases and calculates the statisticalsignificance of matches
I BLAST can be used to infer functional and evolutionaryrelationships between sequences as well as help identifymembers of gene families
I BLAST makes it easy to examine a large group of potentialgene candidates
![Page 32: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/32.jpg)
BLAST programComparison of nucleotide or protein sequences
I The Basic Local Alignment Search Tool (BLAST) findsregions of local similarity between sequences
I The program compares nucleotide or protein sequences tosequence databases and calculates the statisticalsignificance of matches
I BLAST can be used to infer functional and evolutionaryrelationships between sequences as well as help identifymembers of gene families
I BLAST makes it easy to examine a large group of potentialgene candidates
![Page 33: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/33.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 34: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/34.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 35: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/35.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 36: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/36.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 37: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/37.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 38: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/38.jpg)
BLASTHow to do Batch BLAST jobs
I BLAST makes it easy to examine a large group of potential genecandidates
I Most likely these are isolated as amplified products from alibrary of some sort
I There is no need to manually cut and paste a 100 sequences into the BLAST web pages
I Using the BLAST web pages it is possible to input "batches" ofsequences into one form and retrieve the results
I There are two methods to do batch BLAST jobs
I The first is through the web interface and the second is using thestandalone BLAST binaries and downloaded NCBI databases
TUTORIAL
![Page 39: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/39.jpg)
BLASTExample
I BLAST paper
I QuickStart: Example-Driven Web-Based BLAST Tutorial
![Page 40: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/40.jpg)
Sommario
Lecture 14: Longest Common SubsequenceLongest common subsequence (LCS) problemBLAST (Basic Local Alignment Search Tool)FASTA (FAST Alignement)
![Page 41: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/41.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 42: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/42.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 43: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/43.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 44: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/44.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 45: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/45.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 46: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/46.jpg)
FASTAExample
FASTA stands for FAST-ALL, reflecting the fact that it can be used fora fast protein comparison or a fast nucleotide comparison
I This program achieves a high level of sensitivity for similaritysearching at high speed
I This is achieved by performing optimised searches for localalignments using a substitution matrix
I The high speed of this program is achieved by using theobserved pattern of word hits to identify potential matchesbefore attempting the more time consuming optimised search
I The trade-off between speed and sensitivity is controlled by thektup parameter, which specifies the size of the word
I Increasing the ktup decreases the number of background hits
I Not every word hit is investigated but instead initially looks forsegment’s containing several nearby hits
![Page 47: Lezione 14 - Bioinformaticapaoluzzi/web/did/biomed/2010/pdf/lezione14.… · Lezione 14 Bioinformatica Mauro Ceccantiz e Alberto Paoluzziy yDip. Informatica e Automazione – Università](https://reader035.vdocuments.site/reader035/viewer/2022070916/5fb616a57922b3514f3fbdfb/html5/thumbnails/47.jpg)
FASTA Web servicesBoth REST and SOAP web service interfaces are exposed
REST Sample clients are provided for a number ofprogramming languages.
SOAP RPC/encoded SOAP service