recherche dans des bases de données de séquences biologiques
DESCRIPTION
Using BLAST to Search Sequence Databases. Recherche dans des bases de données de séquences biologiques. Cédric Notredame. Outline. -Evolution and Sequence Similarity. - The inside of BLAST. - Using BLAST. - Adapting BLAST to your needs. - Searching Protein Domains with BLAST. - PowerPoint PPT PresentationTRANSCRIPT
Recherche dans des bases de données de séquences
biologiques
Using BLAST to Search Sequence
Databases
Cédric Notredame
-The inside of BLAST
-Using BLAST
-Adapting BLAST to your needs
Outline
-Evolution and Sequence Similarity
-Searching Protein Domains with BLAST
-Digging Genomes
Two Minutes of the
Evolutionnary Clock…
An Alignment is a STORY
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
An Alignment is a STORY
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutation
InsertionDeletion
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special Role
OmpR, Cter Domain
In the core, SIZE MATTERS
On the surface, CHARGE MATTERS
--+
Why Does It Make Sense To Align Sequences ?
SameSequence
Same Function
Same 3D Fold
Same Origin
How Can We Compare Sequences ?The Twilight Zone
Length
%Sequence Identity
100
Same 3D Fold
Twilight Zone
Similar SequenceSimilar Structure
30%
Different SequenceStructure ????
30
Different molecular clocks for different proteins--another prediction
A few Basic Definitions
A few Definitions
Query : Your sequence
Subject: The database against which you search
Heuristic: Algorithm that does not guaranty the optimal solution
Other Important DefinitionsIdentity
Proportion of IDENTICAL residues between two sequences. Depends on the Alignment. Unit: the % id
Homology Sequences SIMILAR enough are sometimes HOMOLOGOUSHOMOLOGY COMMON ANCESTORUnit: Yes or No!DIFFERENT sequences can also be Homologous
SimilarityProportion of SIMILAR residuesTwo residues are similar if their substitution cost is higher than 0. Depends on the matrix Unit: the %similarity
More Important Definitions
HitA sequence that matches your sequence and reported by BLAST.
E-ValueExpectation valueHow many times would you expect to find a hit by chance only?
Depends on the alignment.Depends on the matrixDepends on the databaseSensitive to Low complexity regions
Unit: must be lower than 0.0001 to mean something
A Good Hit Is Something You
Would Not Expect by Chance
What is BLAST ?
BLAST
BLAST is a Program Designed for
RAPIDLY Comparing Your Sequence
With every Sequence in a database
and REPORT the most SIMILAR
sequences
Basic Local Alignment Search Tool
Database Search
1-Query
3-Database
4-Statistical Evaluation (E-Value)
PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW
2-Comparison Engine
LOCAL Alignment
Database Search
1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10
3
1
3
6
1.10e-2
1
20
15
13
SWQ
BLAST
BLAST
BLAST is a Heuristic Smith and Waterman
Basic Local Alignment Search Tool
BLAST = 3 STEPS
1-Decide who will be compared
This is where Blast SAVES TIME
This is where it LOSES HITS
Most BLAST parameters refer to this step
BLAST
BLAST is a Heuristic Smith and Waterman
Basic Local Alignment Search Tool
BLAST = 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most interesting Hits
Heuristic Algorithms
Smith and Waterman • Exact Local Dynamic Programming, 1981
FASTA • Lipman and Pearson, 1985• Looks for similar words (k-tup) on the same diagonal.• Comparison on the sequences one by one…
BLAST• Altschul et al., 1990• The most widely cited tool in Biology• www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
BLASTA Bit of History
The Inside of BLAST
Inside BLAST
Step 1: finding the worthy words
RELQuery
RSLRSL
AAAAACAAD
YYY
AAAAACAAD
YYY
List of all the 3AA words thatCan be found in the database
...
ACT
RSL
TVF
ACT
RSL
TVF
Words with a score > T
score > T
...
...
LKPLKP
LKPLKP
score < T
Inside BLAST
ACT
RSL
TVF
ACT
RSL
TVF
List of « interesting » words > T
...
...
Step 2: Eliminate the database sequences that do not contain any interesting word
ACTACTACT
RSL
RSL TVF
RSLRSL
RSLRSL TVFTVF
Sequences within the database
Sequences containing interesting words (Hits)
Sequences containing interesting words (Hits)
Look for «interesting»
words
Inside BLAST: the end
Step 3: Extension of the Hits
Database sequence
Qu
er
y
X
•2 "Hits" on the same diagonal distant by less than X
Database sequence
Qu
er
y
X
Extension by limited Dynamic Programming
The Statistics in BLAST
Evaluation of the score •Raw Score
Sum of the substitutions and gap penalties.
Not very informative
BLAST Statistics: Raw Score
BLAST Statistics: P Values
Derived Statistics•p-value
Probability of finding an alignment with such a score, by chance.
The lower, the better
Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution.
normal distribution Extreme value distribution(Gumbel)
BLAST Statistics: P-Values
P-Value: Probability that a random alignments obtainsa score superior or Equal to X
K must be calibrated with the database compositionLambda is calibrated with the matrix being used
BLAST Statistics: P-Values
Derived Statistics•E-value
Number of alignments expected by chance
The lower, the better: <0.00001
For Values Lower than 0.0001, E-Value ~ P-Value
The E-Values are easier to compare than P-Values
BLAST Statistics: E-Values
•Bit ScoreEvaluates the amount of information in
the alignmentMakes it possible to compare
alignments
BLAST Statistics: Bit-Score
BLAST Statistics: Booby Trap!
The E-Value depends on N, theDatabase size.
If N increases, some Hits can be lost
P31383 Vs YEAST
P31383 Vs UniProt
The Many Flavorsof
BLAST
Database Against Database:« Farm-Blast »
Ideal for finding Orthologues
Genome 1
Genome 2
The Classics
1 SequenceVs
A sequence Db
Program Query Database
blastp protein protéine
blastn nucleotide nucleotide
tblastn
protein protein
nucleotide
VS
blastx
protein
nucleotide
proteinVS
tblastx
protein
nucleotide
protein
nucleotide
VS
The Many Flavors of BLAST
Program Query Database
Psi-blast protein protein
RPS-blast protein Domain
The Many Flavors of BLAST
DART-blast protein protein
mega-blast DNA Large DNA
If your Sequence is a Protein
If your Sequence is made of DNA
BLASTing with DNA: Asking the right question.
Keeping an Eye on the Public Servers.
Using BLAST:The Basic Way
Database Search
Database Search Result=Prediction
Protein X IS or IS NOT homologous to the QUERRY.
Submitting your Query
Understanding the BLAST Output
Graphic Display
Hit List
Alignments
Understanding the Graphic Display
Understanding the Hit List
Understanding the Alignments
Low Complexity
Low Complexity Regions
Regions with a single residue repeated many times (like the AFGP) can produce meaningless alignments.
The statistics expect ALL the regions to look the same « on average ».
By default, BLAST replaces these regions with Xs
Reproducing The Experiment
Everything you need to know to reproduce your search is at the bottom.
BLAST searches are notoriously difficult to reproduce
Database Searches:A few Guidelines
DataBase Search According to Pearson
DataBase Search According to Pearson
DataBase Search According to Pearson
Using Weak Matches To Identify Domains
RNA Recognition Motif
Three Short-Sighted Witnesses
are more Informative than a single eagle-eye
witness
Using BLAST:Trouble Shooting
Domain 2
Domain 1
No Overlap
Advanced Blast on the EMBnet
www.ch.embnet.org/software/aBLAST.html
• More choice on the databases• Change all the parameters
Adapting BLAST To your Problem
Domain-FlavoredBLAST
Psi-BLAST
BLAST latest Flavor
PSI-BLAST
-Position Specific Iterated Version of BLAST.
-Uses Profiles.
-More Sensitive.
Psi-BLAST Iteration
C C
C C
C CC C
C SC C
C CC C
C SC C
Psi-BLAST Iteration
C C
C C
C CC C
C SC C
C CC C
C SC C
Psi-BLAST Iteration
C C
C C
C CC C
C SC C
C CC C
C SC C
BLAST PSSM or weight matrix
M Y C E Q U E N C E S . .A 0 2 -1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0 0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4 -1 ..Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1V -1 1 -1 -1 -1 0 -1 -1 -1 1 -1
Asking a Question With Psi-BLAST
Asking a Question With Psi-BLAST
Is the Leghemoglobin related to the Human Hemoglobin ?
Asking a Question With Psi-BLAST
Asking a Question With Psi-BLAST
Asking a Question With Psi-BLAST
Which Domain Organisation
For Your Protein:
(Reverse PSI-BLAST)
Asking a Question With RPS-BLAST
PSI-BLAST: Discovering Domains
RPS-BLAST: Which KNOWN Domain in my protein ?
DomainDatabase
Sequence
Asking a Question With RPS-BLAST
False Hits caused by the domain low complexity (see E-values)
RPS-BLAST:Filtering Or Not Filtering Low
COmplexity
How Many Proteins Have the same
Domain Structure as Mine ?
(CDART)
Asking a Question With CDART
CDART:
Conserved Domain Architecture Retrieval Tool
Finds the proteins that contain the same domains as your protein.
Asking a Question With CDART
PSI-BLAST: Discovering Domains
RPS-BLAST: Which known Domain in my protein ?
CDART:
Which domains are COMMONLY ASSOCIATED with the domain I am interested in ?
-Which proteins have the SAME DOMAIN ORGANIZATION as my proteins ?
Filtering:
-By Domain
-By Species
-I want to Find all the Insect proteins containing a June/Fos organisation.
Asking a Question With CDART
-I want to see all the Insect proteins containing a June/Fos organisation.
Asking a Question With CDART
-I want to see all the Insect proteins containing a June/Fos organisation.
Asking a Question With CDART
-I want to see all the Insect proteins containing a June/Fos organisation.
Genome FlavoredBLAST
Standard Blastn with long word size
MegaBLAST=Longer Words
Faster BUT Less sensitive
RELQuery
RSLRSL
AAAAACAAD
YYY
AAAAACAAD
YYY
List of all the 3AA words thatCan be found in the database
...
ACT
RSL
TVF
ACT
RSL
TVF
Words with a score > T
score > T
...
...
LKPLKP
LKPLKP
score < T
The NcBi BlAsT GEnoMe SecTion is MesSy
Makes it possible to select predicted proteomes
Venter-BLAST
When it comes toBLASTingEukaryotic Genomes:
WWW.ENSEMBL.ORG
Asking a Question With ENSEMBL-BLAST
ENSEMBL:
WHERE are located the genes coding for Homologues of my protein
CONCLUSION
-
-BLAST is a fast approximation for the Full Local Dynamic Programming. It is convenient to scan Databases.
-BLAST computes the Statistical Significance of the Alignments (E-Value, P-Value).
Searching Databases
-The main pitfall to avoid are low complexity regions
-
Searching Databases
-USE Psi-Blast to find remote homologues
-USE blastp the best educated blast to discover the function of your protein
-USE RPS-Blast to find domains in your protein (Interpro for EBI)
-USE ENSEMBL-Blast for the human Genome
A few Extra Ressources
Tunning BLAST
BLAST Tunning