sequence comparison techniques

42
SEQUENCE COMPARISON TECHNIQUE Ms.ruchi yadav lecturer amity institute of biotechnology amity university lucknow(up)

Upload: ruchibioinfo

Post on 11-May-2015

2.275 views

Category:

Education


2 download

DESCRIPTION

huristic approachblast and fasta

TRANSCRIPT

Page 1: Sequence comparison techniques

SEQUENCE COMPARISON TECHNIQUE

Ms.ruchi yadavlecturer

amity institute of biotechnologyamity university

lucknow(up)

Page 2: Sequence comparison techniques

SEQUENCE COMPARISON TECHNIQUE Pairwise Alignment

Local Alignment(Smith Waterman Algorithm) Global Alignment(Needleman Wunsch Algorithm)

Multiple Alignment Heuristic Methods

Rather than struggling to find the optimal alignment we may save a lot of time by employing heuristic algorithms

Execution time is much faster May completely miss the optimal alignment FASTA and BLAST

Page 3: Sequence comparison techniques

HEURISTIC METHODS Problem of Dynamic Programming

D.P. compute the score in a lot of useless area for optimal sequence

FASTA focuses on diagonal area

65555433321A

55554433221G

44444433221C

33333333221T

22222222221A

22221111111G

11111111111G

ATTGACTTAAG

Page 4: Sequence comparison techniques

HEURISTIC

Heuristic Good local alignment should have some exact

match subsequence.

FASTA focus on this area

Page 5: Sequence comparison techniques

HEURISTIC METHODS: FASTA AND BLAST

FASTA First fast sequence searching algorithm for

comparing a query sequence against a database.

BLAST Basic Local Alignment Search Technique

Improvement of FASTA: Search speed, ease of use, statistical rigor.

Page 6: Sequence comparison techniques

FASTA ALGORITHM

(a)Find runs of identical words Identify regions shared by the two

sequences that have the highest density of single identities (ktup=1) or two consecutive identities(ktup=2)

(b) Re-score using PAM matrix. Longest diagonals are scored again using the

PAM-250 matrix (or other matrix). The best scores are saved as “init1” scores.

Page 7: Sequence comparison techniques

FASTA ALGORITHM

“init1” ktup=2

Page 8: Sequence comparison techniques

FASTA ALGORITHM

(c) Join segments using gaps and eliminate other segments. Long diagonals that are neighbors are joined.

The score for this joined region is “initn”. This score may be lower due to a penalty for a gap.

(d) Use DP to create the optimal alignment. construct an optimal alignment of the query

sequence and the library sequence (SW algorithm).This score is reported as the optimized score

Page 9: Sequence comparison techniques

FASTA ALIGNMENTS

“initn”

Page 10: Sequence comparison techniques

FASTA ALGORITHM- FIND WORDS OF IDENTICAL WORDS.

Lookup table showing the positions of each word of length k, or k-tuple, is constructed for each sequence.

The relative positions of each word in the two sequences are then calculated by subtracting the position in the first sequence from that in the second.

Words that have the same offset position are in phase and reveal a region of alignment between the two sequences.

Page 11: Sequence comparison techniques

LOOK-UP TABLE

Page 12: Sequence comparison techniques

FASTA - ALGORITHM - Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A

4,5,9,10T

1,8G

6C

2,3,7,11A

LocationQ

****A

**G

*C

****T

****A

**G

**G

ATTGACTTAAGLook-up Table

Dot—Matrix

1 2 3 4 5 6 7 8 9 10 11

Page 13: Sequence comparison techniques

FASTA - ALGORITHM -

Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment

Width of this band is a parameter

Page 14: Sequence comparison techniques

FASTA - COMPLEXITY

Complexity Step 1 and 2 // select the best 10 diagonal

run// Let n be a sequence from DB O(n) because Step 1 just uses look up table O(n) << O(mn) m,n = 100 to 200

Page 15: Sequence comparison techniques

FASTA - COMPLEXITY

compute partial D.P. Depends on the restricted area

< O(mn)

Therefore, FASTA is faster than D.P.

Width of this band is a parameter

Page 16: Sequence comparison techniques

16

STEP 1: FINDING SEEDS

t

s

Page 17: Sequence comparison techniques

17

STEP 2: RE-SCORING SEGMENTS, KEEPING TOP 10

t

s

Page 18: Sequence comparison techniques

18

STEP 3: ELIMINATING UNLIKELY SEGMENTS

t

s

Page 19: Sequence comparison techniques

19

STEP 4: FINDING THE BEST ALIGNMENT

t

s

Page 20: Sequence comparison techniques

VERSIONS OF FASTA

FASTA compares a query protein sequence to a protein sequence library to find similar sequences. FASTA also compares a DNA sequence to a DNA sequence library.

TFASTA compares a query protein sequence to a DNA sequence library, after translating the DNA sequence library in all six reading frames.

FASTX and FASTY translate a query DNA sequence in all three reading forward frames and compare all three frames to a protein sequence database.

TFASTX and TFASTY compare a query protein sequence to a DNA sequence database, translating each DNA sequence in all six possible reading frames.

Page 21: Sequence comparison techniques

BLAST

Publications: Ungapped BLAST – Alttschul et al., 1990 Gapped BLAST, PSI-BLAST - Altschul et al.,

1997 Basic Local Alignment Search Tool

Altschul et al. 1990,1994,1997 Heuristic method for local alignment Designed specifically for database searches Based on the same assumption as FASTA that

good alignments contain short lengths of exact matches

Page 22: Sequence comparison techniques

22

BASIC LOCAL ALIGNMENT SEARCH TOOL (BLAST)

Input: Query (target) sequence – either DNA, RNA or

Protein Scoring Scheme – gap penalties, substitution matrix

for proteins, identity/mismatch scores for DNA/RNA Word length W – typical is

W=3 for proteins and W=11 for DNA/RNA

Output: Statistically significant matches

Page 23: Sequence comparison techniques

BLAST ALGORITHM PARAMETERS

Page 24: Sequence comparison techniques

ALGORITHM OF BLAST

There are three distinct steps, which are represented as follow:

Step1: Query preprocessing; Step2: Scan the database for hits; Step3: Extension of hits.

Page 25: Sequence comparison techniques

BLAST - ALGORITHM Step 1: Query preprocessing;

Create neighbourhood words for each query word Max:L-w+1

Query Word

Neighborhood words

Page 26: Sequence comparison techniques

BLAST - ALGORITHM

Step 1: Query preprocessing; A list of words of length 3 for protein (word

length 11 is used for DNA sequences)

Page 27: Sequence comparison techniques

BLAST - QUERY PREPROCESSING Compile the short-hit scoring word list from query. The length of query word, is 3. Words below threshold are not further pursued.

Page 28: Sequence comparison techniques

BLAST - ALGORITHM Step 2: Scan the database for hits;

For each words list, identify all exact matches with DB sequences

Query WordNeighborhood

Word listSequences in DB

Step 1 Step 2

The purpose of Step 1 and 2 is as same as FASTA

Sequence 1

Sequence 2

Page 29: Sequence comparison techniques

STEP3:EXTENSION OF THE HITS Every hit that has been generated is now

extended in both directions, without gaps. To determine whether each hit may be part of a

longer segment pair with higher score,

Page 30: Sequence comparison techniques

STEP3:EXTENSION OF THE HITS

HSP (High scoring Segment Pair). If the extended segment pair has score better

than equal to S (set as a parameter of the program), it is called HSP

MSP (Maximal segment pair). In a comparison, for every sequence in the

database, the best scoring HSP is called the MSP

Page 31: Sequence comparison techniques

HIGH –SCORING PAIR(HSP)

Page 32: Sequence comparison techniques

MAXIMAL SEGMENT PAIR(MSP)

Page 33: Sequence comparison techniques

33

STEP 2: EXTRACTING SEEDS

t

s

Page 34: Sequence comparison techniques

34

STEP 3: FINDING HSPS

t

s

Page 35: Sequence comparison techniques

35

STEP 4: COMBINING HSPS

t

s

Page 36: Sequence comparison techniques

BLAST

Page 37: Sequence comparison techniques

BASIC BLAST

nucleotide blast

Search a nucleotide database using a nucleotide queryAlgorithms: blastn, megablast, discontiguous megablast

protein blastSearch protein database using a protein queryAlgorithms: blastp, psi-blast, phi-blast

Blastx Search protein database using a translated nucleotide query

Tblastn Search translated nucleotide database using a protein query

Tblastx Search translated nucleotide database using a translated nucleotide query

Page 38: Sequence comparison techniques

SPECIALIZED BLAST

• Make specific primers with Primer-BLAST• Search trace archives• Find conserved domains in your sequence (cds)• Find sequences with similar conserved domain

architecture (cdart)• Search sequences that have gene expression profiles

(GEO)• Search immunoglobulins (IgBLAST)• Search for SNPs (snp)• Screen sequence for vector contamination (vecscreen)• Align two (or more) sequences using BLAST (bl2seq)• Search protein or nucleotide targets in PubChem BioAssay• Search SRA transcript and genomic libraries• Constraint Based Protein Multiple Alignment Tool• Needleman-Wunsch Global Sequence Alignment Tool

Page 39: Sequence comparison techniques

BLAST DATABASES

Page 40: Sequence comparison techniques

DATABASES AVAILABLE ON BLAST WEB SERVER

Page 41: Sequence comparison techniques

DATABASES AVAILABLE ON BLAST WEB SERVER

Page 42: Sequence comparison techniques

OPTIONS AND PARAMETER SETTINGS AVAILABLE ON THE BLAST SERVER