Download - CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments
![Page 1: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/1.jpg)
1
CAP5510 – BioinformaticsDatabase Searches for Biological Sequences or Imperfect Alignments
Tamer KahveciCISE Department
University of Florida
![Page 2: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/2.jpg)
2
Goals
• Understand how major heuristic methods for sequence comparison work– FASTA– BLAST
• Understand how search results are evaluated
![Page 3: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/3.jpg)
3
What is Database Search ?• Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).• Problem is identical to local sequence alignment,
but on a much larger scale.• We must also have some idea of the significance
of a database hit.– Databases always return some kind of hit, how much
attention should be paid to the result?• A similar problem is the global alignment of two
large sequences• General idea: good alignments contain high
scoring regions.
![Page 4: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/4.jpg)
4
Imperfect Alignment
• What is an imperfect alignment?• Why imperfect alignment?
• The result may not be optimal.• Finding optimal alignment is usually
to costly in terms of time and memory.
![Page 5: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/5.jpg)
5
Database Search Methods
• Hash table based methods– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR
![Page 6: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/6.jpg)
6
History of sequence searching
• 1970: NW• 1980: SW• 1985: FASTA• 1990: BLAST
![Page 7: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/7.jpg)
7
Hash Table
![Page 8: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/8.jpg)
8
Hash Table
• K-gram = subsequence of length K
• Ak entries– A is alphabet
size
• Linear time construction
• Constant lookup time
![Page 9: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/9.jpg)
9
FASTP
Lipman & Pearson, 1985
![Page 10: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/10.jpg)
10
FASTP
• Three phase algorithm1. Find short good matches using k-
grams1. K = 1 or 2
2. Find start and end positions for good matches
3. Use DP to align good matches
![Page 11: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/11.jpg)
11
position 1 2 3 4 5 6 7 8 9 10 11protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offsetamino acid protein A protein B pos A - posB-----------------------------------------------------a 6 6 0c 2 7 -5k - 11n 1 -p 4 9 -5r - 10s 3 8 -5t 5 ------------------------------------------------------Note the common offset for the 3 amino acids c,s and pA possible alignment can be quickly found :protein 1 n c s p t a | | | protein 2 a c s p r k
FASTP: Phase 1 (1)
![Page 12: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/12.jpg)
12
FASTP: Phase 1 (2)• Similar to dot plot• Offsets range from 1-
m to n-1• Each offset is scored
as – # matches - #
mismatches• Diagonals (offsets)
with large score show local similarities
• How does it depend on k?
![Page 13: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/13.jpg)
13
FASTP: Phase 2
• 5 best diagonal runs are found
• Rescore these 5 regions using PAM250.– Initial score
• Indels are not considered yet
![Page 14: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/14.jpg)
14
FASTP: Phase 3
• Sort the aligned regions in descending score
• Optimize these alignments using Needleman-Wunsch
• Report the results
![Page 15: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/15.jpg)
15
FASTP - Discussion
• Results are not optimal. Why ?
• How does performance compare to Smith-Waterman?
• What is the impact of k?
• How does this idea work for DNAs ?– K = 4 or 6 for DNA
![Page 16: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/16.jpg)
16
FASTA – Improvement Over FASTP
Pearson 1995
![Page 17: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/17.jpg)
17
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5
![Page 18: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/18.jpg)
18
FASTA (2)• Phase 2.5
– Eliminate diagonals that score less than some given threshold.
– Combine matches to find longer matches. It incurs join penalty similar to gap penalty
![Page 19: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/19.jpg)
19
FASTA Variations
• TFASTAX and TFASTAY: query protein against a DNA library in all reading frames
• FASTAX, FASTAY: DNA query in all reading frames against protein database
![Page 20: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/20.jpg)
20
BLAST
Altschul, Gish, Miller, Myers, Lipman, 1990
![Page 21: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/21.jpg)
21
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search Tool
• An approximation of Smith-Waterman
• Designed for database searches– Short query sequence against long
database sequence or a database of many sequences
• Sacrifices search sensitivity for speed
![Page 22: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/22.jpg)
22
BLAST Algorithm (1)
• Eliminate low complexity regions from the query sequence.– Replace them with X (protein) or N
(DNA)• Hash table on query sequence.
– K = 3 for proteins
MCG
CGP
MCGPFILGTYC
![Page 23: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/23.jpg)
23
BLAST Algorithm (2)
• For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62– 20k candidates– ~50 on the average per
k-gram– ~50n for the entire
query
• Build hash table
PQG
QGM
PQGMCGPFILGTYC
PQGPQG 18PEG 15PRG 14PSG 13PQA 12
T = 13
![Page 24: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/24.jpg)
24
BLAST Algorithm (3)
• Sequentially scan the database and locate each k-gram in the hash table
• Each match is a seed for an ungapped alignment.
![Page 25: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/25.jpg)
25
BLAST Algorithm (4)
• HSP (High Scoring Pair) = A match between a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A
• Extend the hit until the score falls below a threshold value, X
![Page 26: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/26.jpg)
26
BLAST Algorithm (5)
• Keep only the extended matches that have a score at least S.
• Determine the statistical significance of the result
![Page 27: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/27.jpg)
27
What is Statistical Significance?
13 : 15
13 : 15
•Two one-on-one games, two scores.
•Which result is more significant?
•Expected: maybe a random result.•Unexpected: significant, may have significant meanings.
![Page 28: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/28.jpg)
28
Statistical Significance
• E-value: The expected number of matches with score at least S
• E = Kmne-lambda.S
• m, n : sequence lengths• S : alignment score• K, lambda: normalization parameters
• P-value: The probability of having at least one match with score at least S
• 1 – e-E
• The smaller these values are, the more significant the result
• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
![Page 29: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/29.jpg)
29
BLAST - Analysis
• K (k-gram)– Lower: more sensitive.
Slower.
• T (neighbor cutoff)– Lower: Find distant
neighbors. Introduces noise
• X (extension cutoff)– Higher: lower chances
of getting into a local minima. Slower.
![Page 30: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/30.jpg)
30
Sample Query
• http://www.ncbi.nlm.nih.gov/BLAST/
I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I
Dhal_ecoli
![Page 31: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/31.jpg)
31
BLASTN
• BLAST for nucleic acids• K = 11• Exact match instead of neighborhood
search.
![Page 32: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/32.jpg)
32
BLAST Variations
Program Query Target Type
BLASTP Protein Protein Gapped
BLASTN Nucleic acid Nucleic acid Gapped
BLASTX Nucleic acid Protein Gapped
TBLASTN Protein Nucleic acid Gapped
TBLASTX Protein Nucleic acid Gapped
![Page 33: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/33.jpg)
33
Even More Variations
– PsiBLAST (iterative)– BLAT, BLASTZ, MegaBLAST– FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS
– Main differences are• Seed choice (k, gapped seeds)• Additional data structures
![Page 34: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/34.jpg)
34
Suffix Trees
![Page 35: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/35.jpg)
35
Suffix Tree• Tree structure that contains all suffixes of the input
sequence
• TGAGTGCGA• GAGTGCGA• AGTGCGA• GTGCGA• TGCGA• GCGA• CGA• GA• A
![Page 36: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/36.jpg)
36
Suffix Tree Example
![Page 37: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/37.jpg)
37
• O(n) space and construction time– 10n to 70n space usage reported
• O(m) search time for m-letter sequence
• Good for – Small data– Exact matches
Suffix Tree Analysis
![Page 38: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/38.jpg)
38
Suffix Array
• 5 bytes per letter• O(m log n) search
time
• Better space usage• Slower search
![Page 39: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/39.jpg)
39
Mummer
![Page 40: CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments](https://reader035.vdocuments.site/reader035/viewer/2022062517/568134c4550346895d9be7a7/html5/thumbnails/40.jpg)
40
Other Sequence Comparison Tools
• Reputer, MGA, AVID• QUASAR (suffix array)