september 2004 Page 1
Similarity searches in biological sequence databases
Volker Flegel
september 2004 Page 2
Outline
Keyword search in databases• General concept
Examples• SRS http://srs.ebi.ac.uk• Entrez http://www.ncbi.nih.gov/Entrez/index.html• Expasy http://www.expasy.uniprot.org/search/textSearch.shtml
Similarity searches in databases• Goal• Definitions• Alignment visualisation• Alignment algorithms
Examples• FASTA• BLAST and its gory details
september 2004 Page 3
Keyword search
Accessing database entries• Each database uses its own specific access methods
Several kinds of search possibilities according to the data stored– Identification number (unique)– Authors– Keywords, ...
Biological sequence databases• Use a unique identification number to retrieve a specific sequence• This identification number must remain constant accross the database
releases
• Genbank / EMBL / DDBJaccession.version
• Swiss-Protaccession and id (Note: id may change)
september 2004 Page 4
Genbank entry example
LOCUS AF455746_1 80 aa PRI 08-JAN-2002DEFINITION ubiquitin-conjugating enzyme [Homo sapiens].ACCESSION AAL58874PID g18087414VERSION AAL58874.1 GI:18087414DBSOURCE locus AF455746 accession AF455746.1KEYWORDS .SOURCE human.ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (residues 1 to 80)AUTHORS Poloumienko,A.TITLE Exon-intron structure of the mammalian ubiquitin-conjugating enzyme
(HR6A) genesJOURNAL Unpublished
COMMENT Method: conceptual translation supplied by author.FEATURES Location/Qualifiers
source 1..80/organism="Homo sapiens"/db_xref="taxon:9606"/chromosome="X"/cell_line="MCF-7"
Protein 1..80/product="ubiquitin-conjugating enzyme"
CDS 1..80/gene="HR6A"/coded_by="join(AF455746.1:<1..64,AF455746.1:1057..1145,AF455746.1:1594..>1680)"
ORIGIN 1 teeypnkppt vrfvskmfhp nvyadgsicl dilqnrwspt ydvssiltsi qslldepnpn61 spansqaaql yqenkreyek
//
september 2004 Page 5
SwissProt entry exampleID UBCA_HUMAN STANDARD; PRT; 152 AA.AC P49459;DT 01-FEB-1996 (Rel. 33, Created)DT 01-FEB-1996 (Rel. 33, Last sequence update)DT 16-OCT-2001 (Rel. 40, Last annotation update)DE Ubiquitin-conjugating enzyme E2-17 kDa (EC 6.3.2.19)DE (Ubiquitin-protein ligase) (Ubiquitin carrier protein) (HR6A).GN UBE2A.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=92020951; PubMed=1717990;RA Koken M.H.M., Reynolds P., Jaspers-Dekker I., Prakash L., Prakash S.,RA Bootsma D., Hoeijmakers J.H.J.;RT "Structural and functional conservation of two human homologs of theRT yeast DNA repair gene RAD6.";RL Proc. Natl. Acad. Sci. U.S.A. 88:8865-8869(1991).(...)DR EMBL; M74524; AAA35981.1; -.DR HSSP; P25865; 2AAK.DR MIM; 312180; -.DR InterPro; IPR000608; UBQ_conjugat.DR Pfam; PF00179; UQ_con; 1.DR SMART; SM00212; UBCc; 1.DR PROSITE; PS00183; UBIQUITIN_CONJUGAT_1; 1.DR PROSITE; PS50127; UBIQUITIN_CONJUGAT_2; 1.KW Ubiquitin conjugation; Ligase; Multigene family.FT BINDING 88 88 UBIQUITIN (BY SIMILARITY).SQ SEQUENCE 152 AA; 17243 MW; 7A86173D5FAE6DE1 CRC64;
MSTPARRRLM RDFKRLQEDP PAGVSGAPSE NNIMVWNAVI FGPEGTPFGD GTFKLTIEFTEEYPNKPPTV RFVSKMFHPN VYADGSICLD ILQNRWSPTY DVSSILTSIQ SLLDEPNPNSPANSQAAQLY QENKREYEKR VSAIVEQSWR DC
//
september 2004 Page 6
Similarity searches
Concept• Generalisation (asymmetric) of a pairwise comparison
sequence sequence
sequence
database
Query Subject
Pairwise alignment
Similarity searches
database database
Database vs. database
september 2004 Page 7
Theoretical considerations
Similar to those of pairwise comparison • Sequence divergence is due to evolutionary mechanisms• Sequence similarity allows information extrapolation:
Sequence history and originBiological function3D structure
Alignement types• Global Alignment between the complete sequence A and the
complete sequence B• Local Alignment between a sub-sequence of A and a sub-
sequence of B
Computer implementation (Algorithms)• Dynamic programing• Global Needleman-Wunsch• Local Smith-Waterman
september 2004 Page 8
Problems to solve
Similarity search mechanism• A pairwise comparison is done successively between the query and every
sequence of the database
Obstacles• The complexity of the task is proportional to the size of the database
Extremely long running time of the searchDifficult biological interpretation of the results
Solutions• Reduce search time by using more powerful computers• Reduce search time by using newer and faster algorithms (heuristics)• Sort and analyse the resulting alignments using statistical methods
september 2004 Page 9
Definitions
QuerySequence that is being compared against the database.
Subject Sequence of the database that matches the query.
Exact algorithm An exact algorithm is guaranteed to find the best alignment, or at least one of the best in case of a tie.
Heuristic algorithm A heuristic algorithm is not guaranteed to find the best alignment.But good ones often do, and much quicker than exact ones.
september 2004 Page 10
Some more definitions
Identity Proportion of pairs of identical residues between two aligned sequences.Generally expressed as a percentage.This value strongly depends on how the two sequences are aligned.
SimilarityProportion of pairs of similar residues between two aligned sequences.If two residues are similar is determined by a substitution matrix.This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used.
Homology Two sequences are homologous if and only if they have a common ancestor.There is no such thing as a level of homology ! (It's either yes or no)
• Homologous sequences do not necessarily serve the same function...
• ... Nor are they always highly similar: structure may be conserved while sequence is not.
september 2004 Page 11
Alignment score
Amino acid substitution matrices • Example: PAM250• Most used: Blosum62
Raw score of an alignment
TPEA¦| |APGA
TPEA¦| |APGA
Score = 1 = 9+ 6 + 0 + 2
september 2004 Page 12
Insertions and deletions
Gap penalties
• Opening a gap penalizes an alignment score• Each extension of a gap penalizes the alignment's score• The gap opening penalty is in general higher than the gap extension
penalties (simulating evolutionary behavior)
• The raw score of a gapped alignment is the sum of all amino acid substitutions from which we subtract the gap opening and extensionpenalties.
Seq A GARFIELDTHE----CAT||||||||||| |||
Seq B GARFIELDTHELASTCAT
Seq A GARFIELDTHE----CAT||||||||||| |||
Seq B GARFIELDTHELASTCAT
gap
gap openinggap extension
september 2004 Page 13
Alignment visualisation
Matrix - Text - DotplotAn alignment is a path through a graphDotPlot: Graphical view in 2 dimensions
Visual aid to identify regions of similarity
Address: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Tissue-Type plasminogen Activator
Urokinase-Type plasm
inogen Activator
Seq B A-CA-CA| || |
Seq A ACCAAC-
Seq B A-CA-CA| || |
Seq A ACCAAC-
Seq B ACA--CA|
Seq A A-CCAAC
Seq B ACA--CA|
Seq A A-CCAAC
september 2004 Page 14
Optimal alignment extension
How to extend optimaly an optimal alignment• An optimal alignment up to positions i and j can be extended in 3 ways.• Keeping the best of the 3 guarantees an extended optimal alignment.
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
• We have the optimal alignment extended from i and j by one residue.
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
ai+1
bj+1
ai+1
bj+1
Score = Scoreij + Substi+1 j+1
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
ai+1
-
ai+1
-Score = Scoreij - gap
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
-
bj+1
-
bj+1
Score = Scoreij - gap
september 2004 Page 15
Exact algorithms (Needleman-Wunsch / Smith - Waterman)
Simple example (Needleman-Wunsch)
• Scoring system:Match score: 2Mismatch score: -1Gap penalty: -2
Note• We have to keep track of the origin of the score for each element in the
matrix.This allows to build the alignment by traceback when the matrix has been completely filled out.
• Computation time is proportional to the size of sequences (n x m).
G A T T A
0 -2 -4 -6 -8 -10
G -2
A -4
A -6
T -8
T -10
C -12
G A T T A
0 -2 -4 -6 -8 -10
G -2 2 0 -2 -4 -6
A -4 0 4
A -6
T -8
T -10
C -12
0 - 2
0 - 2
2 + 2
G A T T A
0 -2 -4 -6 -8 -10
G -2 2 0 -2 -4 -6
A -4 0 4 2 0 -2
A -6 -2 2 3 1 2
T -8 -4 0 4 5 3
T -10 -6 -2 2 6 4
C -12 -8 -4 0 4 5
F(i-1,j) F(i,j)
s(xi,yj)
F(i-1,j-1)
-d
F(i,j-1)
-d
F(i,j): score at position i, js(xi,yj): match or mismatch score (or substitution matrix
value) for residues xi and yjd: gap penalty (positive value)
GA-TTA|| ||GAATTC
GA-TTA|| ||GAATTC
september 2004 Page 16
Heuristic algorithms
Faster but less sensitive• They use the dynamic programming approach like exact algorithms• They try to limit its use to sequences which seem interesting
The heuristic part of the algorithm tries to make a clever guess at which sequences would produce an interesting alignment.
FASTA • Developped by Lipman and Pearson in 1985• Tries to find sequences having identical words (or k-tuples = k
consecutive residues) in common on a same diagonal.• Compares the query sequentially to all those sequences in the database.
Blast • Developped by Altschul et al. in 1990• The most used and cited bioinformatics tool in biology• Online tutorial: www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
september 2004 Page 17
A Blast for each query
Different programs are available according to the type of query
Program Query Database
blastp protein protein
blastn nucleotide nucleotide
blastx
protein
nucleotide
protein
tblastn
protein protein
nucleotide
tblastx
protein
nucleotide
protein
nucleotide
VS
VS
VS
VS
VS
september 2004 Page 18
Access to Blast
Web access • Numerous web sites offer access to Blast servers
NCBI (USA) where the Blast program was created– Provide access to all Blast options and numerous databases– User interface not very intuitive
• URL: www.ncbi.nlm.nih.gov/BLAST
EMBnet (i.e. Swiss node located in Lausanne at the SIB)– Several servers across the world– Provide access to all Blast options– Provide a simplified and an advanced user interface– Wide choice of databases
• URL: www.ch.embnet.org/software/bBLAST.html (Simple user interface)
www.ch.embnet.org/software/aBLAST.html (Advanced user interface)
september 2004 Page 19
Blast: the gory details
Blast algorithm: creating a list of similar words
RELQuery
RSLRSL
AAAAACAAD
YYY
AAAAACAAD
YYY
List of all possible words with3 amino acid residues
...
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching thequery with a score > T
score > T
......
LKPLKP
LKPLKP
score < T
A substitution matrix is used to compute the word scoresA substitution matrix is used to compute the word scores
september 2004 Page 20
Blast: the gory details
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching thequery with a score > T
......
Blast algorithm: eliminating sequences without word hits
ACTACTACT
RSL
RSL TVF
RSLRSL
RSLRSL TVFTVF
Database sequences
List of sequences containing words similar to the query (hits)
List of sequences containing words similar to the query (hits)
Search forexact matches
september 2004 Page 21
Blast: the gory details (The End)
Blast algorithm: extension of hits
Database sequence
Query A
Ungapped extension if:• 2 "Hits" are on the same diagonal
but at a distance less than A
Database sequence
Query A
Extension using dynamic programming• limited to a restricted region
september 2004 Page 22
Statistical evaluation of results
Alignments are evaluated according to their score• Raw score
It's the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension)Depends on the scoring system (substitution matrix, etc.)Different alignments should not be compared based only on the raw score
• Normalised scoreIs independent of the scoring systemAllows the comparison of different alignmentsUnits: expressed in bits
september 2004 Page 23
Statistical evaluation of results
Statistics derived from the scores• p-value
Probability that an alignment with this score occurs by chance in a database of this sizeThe closer the p-value is towards 0, the better the alignment
• e-valueNumber of matches with this score one can expect to find by chance in a database of this sizeThe closer the e-value is towards 0, the better the alignment
• Relationship between e-value and p-value:In a database containing N sequencese = p x N
100%
0%
N
0
september 2004 Page 24
Low complexity regions
• Regions with a high frequency of only a few type of residues (= low complexity regions) may produce high scoring but biological uninteresting alignments, e.g. polyserine
• Such regions are, by default, filtered out by Blast. They appear masked with 'X' in the alignment.They are not taken into account for score computation
september 2004 Page 25
Basic Blast on EMBnet
www.ch.embnet.org/software/bBLAST.html
Select the type of query
Select the substitution matrix to use
Select your input type:Either a raw sequence or an accession or id number, as well as the database from which blast should retrieve your query
Select the protein database to search with either blastp, blastx
Select the nucleotide database to search with either blastn, tblastn, tblastx
september 2004 Page 26
Advanced Blast on EMBnet
www.ch.embnet.org/software/aBLAST.html
• Greater choice of databases to search• Advanced Blast parameter modification
september 2004 Page 27
Search results
Graphical visualisation and description of alignment scores
september 2004 Page 28
Search results
Alignment example
• Normalised score, raw score and e-value• Percentage of identical aligned residues, percentage of aligned residues
having a positive score in the substitution matrix• Alignment (local) between the query and the database sequence. The
middle line shows if a residue is conserved or not• Low complexity region is masked with a series of 'X'
september 2004 Page 29
Search results
Search details (at the bottom of the results)
• Size of the database searched• Scoring system parameters• Details about the number of hits
found
september 2004 Page 30
Conclusions
Blast: the most used database search tool • Fast and very reliable even for a heuristic algorithm• Does not necessarily find the best alignment, but most of the
time it finds the best matching sequences in the database• Easy to use with default parameters• Solid statistical framework for the evaluation of scores
but... • The biologist's expertise is still essential to the analysis of the
results !Tips and tricks
• For coding sequences always search at the protein level• Mask low complexity regions• Use a substitution matrix adapted to the expected divergence of
the searched sequences (nevertheless most of the time BLOSUM62 works well)
• If there are only matches to a limited region of your query, cutout that region and rerun the search with the remaining part of your query