blast workshop maya schushan june 2009. workshop outline part 1: introduction and motivation how...
TRANSCRIPT
![Page 1: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/1.jpg)
BLAST Workshop
Maya SchushanJune 2009
![Page 2: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/2.jpg)
Workshop OUTLINE
Part 1:
• Introduction and motivation
• How does BLAST work?
Part 2:
• BLAST programs
• Sequence databases
• Work Steps
• Extract and analyze results
![Page 3: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/3.jpg)
Why BLAST?Finding homologous
• Homology- similarity between sequences that result from a common ancestor.
• Sequences look alike probably have the same function and structure.
• Use a sequence as a search query in order to find homologous sequences in a data base.
• Save time! – exploit the knowledge you have about your homologues, and conclude about your query.
More then: 25% for proteins
70% for nucleotideswill be considered as homologous
![Page 4: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/4.jpg)
Why BLAST?
Identify sequence motifs
Finding homologous
![Page 5: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/5.jpg)
Why BLAST?
Find out which region are evolutionary conserved important for function and\or structure
Finding homologous
![Page 6: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/6.jpg)
Why BLAST?
Finding homologous
Construct phylogenetic trees understand the evolution of the sequence’s family
![Page 7: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/7.jpg)
Why BLAST?
Finding homologous
Inferring function for a novel sequence learning from previous data available for
homologous sequences
![Page 8: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/8.jpg)
Why BLAST?
Finding homologous
Finding out if your protein sequence has a structure (or a close homologue has one….)
![Page 9: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/9.jpg)
What Is An Alignment?
Before we can understand how BLAST works, we first
have to understand the principles of sequence
alignment….
How does BLAST work?
![Page 10: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/10.jpg)
What Is An Alignment?• Comparing 2 (pairwise) or more (multiple) sequences.
• Searching for a series of identical or similar characters in the sequences.
How does BLAST work?
VLSPADKTNVKAAWAKVGAHAAGHG||| | | |||| | ||||VLSEAEWQLVLHVWAKVEADVAGHG
![Page 11: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/11.jpg)
T C A T G
C A T T G
T C A T G
C A T T G
T C A T G
C A T T Gor
?
A process of lining-up 2 or more sequences to achieve maximum level of identity, in order to find homologies.
What Is An Alignment?
How does BLAST work?
![Page 12: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/12.jpg)
S = ACTG S’ = AC_TG S’ = ACTG S’ = ACTGT = AGT T’ = A_GT_ T’ = AGT_ T’ = _AGT
Good: Identical characters- match.Bad: Different characters- mismatch; gap (InDel).
• Each pair of characters gets a value, depending on its identity.
•The similarity score of the alignment is the sum of pair values.
What Is An Alignment?
How does BLAST work?
![Page 13: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/13.jpg)
Example: Aligning Two Globins
Human Hemoglobin (HH):VLSPADKTNVKAAWGKVGAHAGYEG
Sperm Whale Myoglobin (SWM):VLSEGEWQLVLHVWAKVEADVAGHG
General Alignment Methodology
What Is An Alignment?
How does BLAST work?
![Page 14: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/14.jpg)
Example: Aligning Two Globins
(HH) VLSPADKTNVKAAWGKVGAHAGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
No Gaps:• Percent identity: 36• Percent similarity: 40
What Is An Alignment?
How does BLAST work?
![Page 15: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/15.jpg)
(HH) VLSPADKTNVKAAWGKVGAH-AGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
Example: Aligning Two Globins
With Gaps:• Gaps: 2• Percent identity: 45.833 (instead of 36 without gaps)• Percent similarity: 54.167 (instead of 40 without gaps)
What Is An Alignment?
How does BLAST work?
![Page 16: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/16.jpg)
Alignment Scoring
1. Assume independent mutation model
2. Score at each position– Positive if the same/similar– Negative if different or gap
3. Score of an alignment is sum of position score
What Is An Alignment?
How does BLAST work?
![Page 17: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/17.jpg)
Scoring Matrix• A matrix n n : n=4 for DNA, n=20 for proteins
• Each entry matrix defines the score for observing the two letters in the alignment– Positive if likely to change– Negative otherwise
A G C T
A 1
G -5 1
C -5 -5 1
T -5 -5 -5 1
What Is An Alignment?
How does BLAST work?
![Page 18: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/18.jpg)
DNA scoring matrices
• Transitions – purine to purine or pyrmidine to pyrmidine(4 possibilities)
• Transversions – purine to pyrmidine or pyrmidine to purine (8 possibilities)
• By chance alone transversions should occur twice as often as transitions.
• De-facto transitions are more frequent than transversions.
What Is An Alignment?
How does BLAST work?
![Page 19: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/19.jpg)
FromTo
A G C T
A 2
G -4 2
C -6 -6 2
T -6 -6 -4 2
MatchTransitionTransversion
DNA scoring matrices
What Is An Alignment?
How does BLAST work?
![Page 20: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/20.jpg)
Proteins scoring matrices• Observation: some substitutions
are more frequent than others, e.g., chemically similar amino acids
• As for DNA, protein matrices define the probabilities of change between the different amino acids
• Popular matrices are based on empirical data: PAM & BLOSUM
What Is An Alignment?
How does BLAST work?
![Page 21: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/21.jpg)
PAM Matrices
• PAM matrices are based on sequences with 85% identity.
• The changes are “accepted” by natural selection
• 1 PAM unit:the probability of 1 point mutation per 100 residues.
• Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance.
What Is An Alignment?
How does BLAST work?
![Page 22: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/22.jpg)
BLOSUM Matrices• Based on BLOCKS database:
• Low BLUSOM numbers for distant sequences, High BLUSOM numbers for similar sequence
• BLOSUMn is based on sequences that shared at least n percent identity, generally: BLOSUM62 for general useBLOSUM80 for close relationsBLOSUM45 for distant relations
What Is An Alignment?
How does BLAST work?
![Page 23: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/23.jpg)
PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45
Distant sequences
Closer sequences
Proteins scoring matrices
What Is An Alignment?
How does BLAST work?
![Page 24: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/24.jpg)
How do we calculate gap scores
- Same substitution scores are applied on gapped and ungapped local alignments.
- Appropriate gap scores have been selected over the years by trial and error default gap scores
- If you wish to apply a different scoring matrix-No grantee that the gap scores will remain appropriate!!!!
- large penalty for opening and much smaller one for extending it are most effective
How does BLAST work?
![Page 25: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/25.jpg)
• The final score of the alignment is the sum of the positive scores and penalty scores:
+Number of Identities
+Number of Similarities
-Number of Gap insertions
-Number of Gap extensions
Alignment score
Scoring
Scoring Matrix
Gap penalties
What Is An Alignment?
How does BLAST work?
![Page 26: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/26.jpg)
BLAST(Basic Local Alignment Search Tool)
• Goal: A fast search for homologues in a huge database
• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them
• The heuristic:1. Discard irrelevant sequences2. Perform exact local alignment only with the
remaining sequences
Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search
tool” J. Mol. Biol. 215: 403-410
How does BLAST work?
![Page 27: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/27.jpg)
27
Searching a sequence database•Idea:In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs
•Query sequence - the sequence with which we are searching
•Hit – a sequence found in the database, suspected as homologous
How does BLAST work?
![Page 28: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/28.jpg)
How does BLAST work?
The parameters-
W : Word size – find W-mers in target/query2-3 for aa, 6-11 for nucleotides.
T : Threshold – focus on pairs scoring >Tusually 11-13
X : Drop-off – stop extending when loss >X
S : Score – the final score of segment pair
![Page 29: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/29.jpg)
How does BLAST work?
The algorithm:
1. Align a query sequence with the database.
2. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T.
3. Extend alignments until score drops more than X below hitherto best scoreConsumes most of the processing time (>90%)
s
t
![Page 30: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/30.jpg)
How do we discard irrelevantsequences quickly?
• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)
• Save the words in a look-up table that can be searched quickly
WTDFGYPAILKGGTAC
WTDTDFDFGFGYGYP …
How does BLAST work?
![Page 31: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/31.jpg)
BLAST: discarding sequences• When the user enters a query sequence, it is also
divided into words
• Search the database for consecutive neighboring words
• neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level
How does BLAST work?
GFB
GFC (20)
GPC (11)
WAC (5)
![Page 32: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/32.jpg)
Query
Data
base
re
cord
Neighbor word Look for a seed: hits on the same diagonal which
can be connected
At least 2 hits on the same diagonal with
distance which is smaller than a predetermined
cutoff
This is the filtering stage – many unrelated hits are filtered, saving lots of
time!
A
How does BLAST work?
![Page 33: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/33.jpg)
Try to extend the alignment
• Stop extending when the score of the alignment drops X beneath the maximal score obtained so far
• Discard segments with score < S
ASKIOPLLWLAASFLHNEQAPALSDAN
JWQEOPLWPLAASOIHLFACNSIFYASScore=15
Score=17
Score=14
How does BLAST work?
![Page 34: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/34.jpg)
Two-Hit Gapped BLAST
The new gapped BLAST algorithm:1. Start with the two hit method-
(a) find two hits of score higher then T, within a distance A.(b) invoke an ungapped extension on the second hit.
2. If the HSP generated has an expected score:(a) Trigger a gapped extension(b) If the final score has a significant E-value –
report the gapped alignment.
How does BLAST work?
![Page 35: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/35.jpg)
The result – local alignment
• The result of BLAST will be a series of local alignments between the query and the different hits found
How does BLAST work?
![Page 36: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/36.jpg)
How does BLAST work?
The scoring system• BLAST uses BLOSSOM62 as the scoring matrix to
perform the alignment (default).
![Page 37: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/37.jpg)
How does BLAST work?
E-value• To asses the bits score we calculate E-value:
E-value = The expected number of HSP’s with a score of at least S:
• For each score S there is a specific E-value.
Small E-value better score
sKMNeE
![Page 38: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/38.jpg)
How does BLAST work?
E-valueTheoretically, we could trust any result with an E-value ≤ 1
In practice – BLAST uses estimations.
•E-values of 10-4 and lower indicate a significant homology.
•E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).
•E-values between 10-2 and 1 do not indicate a good homology
![Page 39: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/39.jpg)
How does BLAST work?
PSI-BLAST
Step 1:1. Set a standard protein-protein BLAST search
(BLOSUM62)
2. Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low E-value.
Step 2:1. Set a BLAST search using the PSSM to evaluate the
alignment. PSSM vs. DB instead of seq vs. DB 2. Update the PSSM according to the new result3. Go back to the beginning of step two or stop.
![Page 40: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/40.jpg)
How does BLAST work?
PSI-BLASTThe difference-
• The score for aligning a letter with a pattern position is given by the matrix itself!
• The matrix is of the length of the original seq. (L* 20)
• No theory for deriving gap costs Gap scores the same as the one in the 1st iteration
1 2 3 4 5 6 7 8 9
A .1 .3 .3 .2 .3 .1 .8 0 .3
D .3 .3 .6 .2 .4 .2 .2 .1 .3
L .6 .4 .1 .6 .3 .7 0 .9 .4
![Page 41: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/41.jpg)
How does BLAST work?
The power of PSI-BLAST:
1. A much sensitive scoring system .each position has its own pattern probabilities .
2. Different weight to conserved positions.
3. Important motifs are bounded
4. Lowers the level of random noise.
5. Finding distant relatives.
![Page 42: BLAST Workshop Maya Schushan June 2009. Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases](https://reader035.vdocuments.site/reader035/viewer/2022081519/56649e7d5503460f94b801a1/html5/thumbnails/42.jpg)
How does BLAST work?
Lets sum up…- Blast is a fast way to find homologues
- No analytic theory that estimates the statistical significance of gapped alignments
- Gap scores have been selected by trial and error.applying different scoring matrix No grantee for gap scores
- PSI-BLAST finds weak homologues fast