sequence comparison techniques
DESCRIPTION
huristic approachblast and fastaTRANSCRIPT
SEQUENCE COMPARISON TECHNIQUE
Ms.ruchi yadavlecturer
amity institute of biotechnologyamity university
lucknow(up)
SEQUENCE COMPARISON TECHNIQUE Pairwise Alignment
Local Alignment(Smith Waterman Algorithm) Global Alignment(Needleman Wunsch Algorithm)
Multiple Alignment Heuristic Methods
Rather than struggling to find the optimal alignment we may save a lot of time by employing heuristic algorithms
Execution time is much faster May completely miss the optimal alignment FASTA and BLAST
HEURISTIC METHODS Problem of Dynamic Programming
D.P. compute the score in a lot of useless area for optimal sequence
FASTA focuses on diagonal area
65555433321A
55554433221G
44444433221C
33333333221T
22222222221A
22221111111G
11111111111G
ATTGACTTAAG
HEURISTIC
Heuristic Good local alignment should have some exact
match subsequence.
FASTA focus on this area
HEURISTIC METHODS: FASTA AND BLAST
FASTA First fast sequence searching algorithm for
comparing a query sequence against a database.
BLAST Basic Local Alignment Search Technique
Improvement of FASTA: Search speed, ease of use, statistical rigor.
FASTA ALGORITHM
(a)Find runs of identical words Identify regions shared by the two
sequences that have the highest density of single identities (ktup=1) or two consecutive identities(ktup=2)
(b) Re-score using PAM matrix. Longest diagonals are scored again using the
PAM-250 matrix (or other matrix). The best scores are saved as “init1” scores.
FASTA ALGORITHM
“init1” ktup=2
FASTA ALGORITHM
(c) Join segments using gaps and eliminate other segments. Long diagonals that are neighbors are joined.
The score for this joined region is “initn”. This score may be lower due to a penalty for a gap.
(d) Use DP to create the optimal alignment. construct an optimal alignment of the query
sequence and the library sequence (SW algorithm).This score is reported as the optimized score
FASTA ALIGNMENTS
“initn”
FASTA ALGORITHM- FIND WORDS OF IDENTICAL WORDS.
Lookup table showing the positions of each word of length k, or k-tuple, is constructed for each sequence.
The relative positions of each word in the two sequences are then calculated by subtracting the position in the first sequence from that in the second.
Words that have the same offset position are in phase and reveal a region of alignment between the two sequences.
LOOK-UP TABLE
FASTA - ALGORITHM - Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A
4,5,9,10T
1,8G
6C
2,3,7,11A
LocationQ
****A
**G
*C
****T
****A
**G
**G
ATTGACTTAAGLook-up Table
Dot—Matrix
1 2 3 4 5 6 7 8 9 10 11
FASTA - ALGORITHM -
Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment
Width of this band is a parameter
FASTA - COMPLEXITY
Complexity Step 1 and 2 // select the best 10 diagonal
run// Let n be a sequence from DB O(n) because Step 1 just uses look up table O(n) << O(mn) m,n = 100 to 200
FASTA - COMPLEXITY
compute partial D.P. Depends on the restricted area
< O(mn)
Therefore, FASTA is faster than D.P.
Width of this band is a parameter
16
STEP 1: FINDING SEEDS
t
s
17
STEP 2: RE-SCORING SEGMENTS, KEEPING TOP 10
t
s
18
STEP 3: ELIMINATING UNLIKELY SEGMENTS
t
s
19
STEP 4: FINDING THE BEST ALIGNMENT
t
s
VERSIONS OF FASTA
FASTA compares a query protein sequence to a protein sequence library to find similar sequences. FASTA also compares a DNA sequence to a DNA sequence library.
TFASTA compares a query protein sequence to a DNA sequence library, after translating the DNA sequence library in all six reading frames.
FASTX and FASTY translate a query DNA sequence in all three reading forward frames and compare all three frames to a protein sequence database.
TFASTX and TFASTY compare a query protein sequence to a DNA sequence database, translating each DNA sequence in all six possible reading frames.
BLAST
Publications: Ungapped BLAST – Alttschul et al., 1990 Gapped BLAST, PSI-BLAST - Altschul et al.,
1997 Basic Local Alignment Search Tool
Altschul et al. 1990,1994,1997 Heuristic method for local alignment Designed specifically for database searches Based on the same assumption as FASTA that
good alignments contain short lengths of exact matches
22
BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)
Input: Query (target) sequence – either DNA, RNA or
Protein Scoring Scheme – gap penalties, substitution matrix
for proteins, identity/mismatch scores for DNA/RNA Word length W – typical is
W=3 for proteins and W=11 for DNA/RNA
Output: Statistically significant matches
BLAST ALGORITHM PARAMETERS
ALGORITHM OF BLAST
There are three distinct steps, which are represented as follow:
Step1: Query preprocessing; Step2: Scan the database for hits; Step3: Extension of hits.
BLAST - ALGORITHM Step 1: Query preprocessing;
Create neighbourhood words for each query word Max:L-w+1
Query Word
Neighborhood words
BLAST - ALGORITHM
Step 1: Query preprocessing; A list of words of length 3 for protein (word
length 11 is used for DNA sequences)
BLAST - QUERY PREPROCESSING Compile the short-hit scoring word list from query. The length of query word, is 3. Words below threshold are not further pursued.
BLAST - ALGORITHM Step 2: Scan the database for hits;
For each words list, identify all exact matches with DB sequences
Query WordNeighborhood
Word listSequences in DB
Step 1 Step 2
The purpose of Step 1 and 2 is as same as FASTA
Sequence 1
Sequence 2
STEP3:EXTENSION OF THE HITS Every hit that has been generated is now
extended in both directions, without gaps. To determine whether each hit may be part of a
longer segment pair with higher score,
STEP3:EXTENSION OF THE HITS
HSP (High scoring Segment Pair). If the extended segment pair has score better
than equal to S (set as a parameter of the program), it is called HSP
MSP (Maximal segment pair). In a comparison, for every sequence in the
database, the best scoring HSP is called the MSP
HIGH –SCORING PAIR(HSP)
MAXIMAL SEGMENT PAIR(MSP)
33
STEP 2: EXTRACTING SEEDS
t
s
34
STEP 3: FINDING HSPS
t
s
35
STEP 4: COMBINING HSPS
t
s
BLAST
BASIC BLAST
nucleotide blast
Search a nucleotide database using a nucleotide queryAlgorithms: blastn, megablast, discontiguous megablast
protein blastSearch protein database using a protein queryAlgorithms: blastp, psi-blast, phi-blast
Blastx Search protein database using a translated nucleotide query
Tblastn Search translated nucleotide database using a protein query
Tblastx Search translated nucleotide database using a translated nucleotide query
SPECIALIZED BLAST
• Make specific primers with Primer-BLAST• Search trace archives• Find conserved domains in your sequence (cds)• Find sequences with similar conserved domain
architecture (cdart)• Search sequences that have gene expression profiles
(GEO)• Search immunoglobulins (IgBLAST)• Search for SNPs (snp)• Screen sequence for vector contamination (vecscreen)• Align two (or more) sequences using BLAST (bl2seq)• Search protein or nucleotide targets in PubChem BioAssay• Search SRA transcript and genomic libraries• Constraint Based Protein Multiple Alignment Tool• Needleman-Wunsch Global Sequence Alignment Tool
BLAST DATABASES
DATABASES AVAILABLE ON BLAST WEB SERVER
DATABASES AVAILABLE ON BLAST WEB SERVER
OPTIONS AND PARAMETER SETTINGS AVAILABLE ON THE BLAST SERVER