identification of proteins through mass spectrometry databases
TRANSCRIPT
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
1/50
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
2/50
2
Proteome - complete set of proteins in cell
Current methodologies: 2D gel, protein microarray, fluorescencemicroscopy, mass spectroscopy, chromatography, nuclearmagnetic resonance, microfluidics, microchip
Mass spectrometry is an important practice for molecular and cellbiology
New advances in automation of mass spectrometry like excisionof protein spots, enzymatic digestion and acquirement of massspectra and automatic data bases searching.
Techniques for modified proteins and quantification have beendeveloped.
Proteomics and Mass Spectrometry
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
3/50
3
Servers available for ProteinIdentification through MassSpectrometry
For PMF For Sequence Query For MS/MS Ion Search
ASCQ_ME Mascot Inspect
Bupid MS-Seq (Protein Prospector) Mascot
Mascot Tagldent MS-Seq (Protein Prospector)
MassSearch Omssa
MS-Fit (Protein Prospector) PepFrag (Prowl)
PepMAPPER PepProbePrpfound (Powl) Rald_DbS
Mowse Sonar (Knexus)
PeptideSearch X!Tandam (The GPM)
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
4/50
4
MascotSoftware search engine
Uses mass spectrometry data
Mascot is uniqueWidely used
Freely available by Matrix Science
License is required for in-house use
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
5/50
5
Mascot ServerGives excellent results with peak lists from instrumentsmanufactured by:
Agilent, Bruker, Thermo Scientific
Waters AB Sciex, Shimadzu
In-house use:Data sets that exceed the 1200 spectrum limit
Confidentiality
For automation
To add and edit modifications, enzymes, quantitationmethods, etc.
Time taken in search depends on number of processors.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
6/50
6
Three proven ways of using mass
spectrometry dataPeptide mass fingerprint
Uses the molecular masses of the peptides resulting fromdigestion of a protein by a specific enzyme
Sequence queryMass values combined with amino acid sequence or
composition data.
MS/MS Ions Search
Uninterpreted MS/MS data from a single peptide or from acomplete LC-MS/MS run.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
7/50
7
Peptide Mass Fingerprint
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
8/50
8
Peptide Mass Fingerprint
Peak picking
Find a utility to convert into a peak list
Mass matter most
Get as many peptide masses in the range 1000 to 3500 Da
To perform a search
Paste your peak list or upload it as a file
Enter values for search parameters
After submission, you receive the results.
A list of matching proteins,
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
9/50
9
Protein Mass Fingerprinting
Fast simple analysis.
High Sensitivity
Need a database of proteins
not ESTSequence must be present in databases
Not Good for mixtures
Start with Swiss-Prot.
Protein hit is significant if expect value below 0.05
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
10/50
10
MS/MS Ions Search
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
11/50
11
MS/MS Ions SearchSingle protein or a complex mixture
Use chromatography to regulate the flow of peptides into the massspectrometer.
Select peptides one at a time using the first stage of mass analysis.Each isolated peptide is then induced to fragment. Second stage ofmass analysis used to collect an MS/MS spectrum.
We use software to determine which peptide sequence in the databasegives the best match.
The degree of matching is scored.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
12/50
12
Fragment ion structures
Peptide molecular ions fragment at preferred locations alongbackbone.
Major peaks are b and y ions,
Depends on the ionization technique, the mass analyser, andpeptide structure.
If peptides fragmented cleanly, we wouldnt need databasesearch. A ladder of peaks for e ach ion series
Fragmentation is rarely perfect
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
13/50
13
Results complicated to report
Report, lists a series of proteins and the peptide matches thathave been assigned.
Report uses a pop-up window to show the alternative peptidematches
Top match has a high score
MS/MS ion search
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
14/50
14
MS/MS ion searchEasily automated
Searches can be slow
Without enzyme
Several variable modifications
Large dataset
Large database
MS/MS is peptide identification
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
15/50
15
Sequence Query
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
16/50
16
Sequence tag searchEven the quality of spectrum is poor, its possible to pick outminimum of four clean peaks
A few residues of amino acid sequence are interpreted
What Mann and Wilm realized, that this very short stretch ofamino acid sequence might provide sufficient specificity toprovide identification if it was combined with the fragment ionmass values which enclose it, the peptide mass, and theenzyme specificity.
Picking out a good tag requires both luck and experience.
Requires interpretation of spectrum
Usually manual, hence not high throughput
Tag has to be called correctly
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
17/50
17
Peptide Sequence tag
Standard sequence tag is obsolete.
Easier to skip the interpretation step and pass the peak list tothe search engine.
Rapid search timesError tolerant
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
18/50
18
Search parametersName, Email and Search Title
The name and email are saved as a browser cookie. If Mascotsecurity is enabled, information taken from user database
Email address used for sending results
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
19/50
19
Databases
Swiss-Prot (~500000 entries)
Best annotated database, ideal for PMF
NCBI nr and UniRef100 (~19000000 entries)
Large, comprehensive, best choice for MS/MSEST databases (>400000000 entries in translation)
Huge, not advisable for PMF
Single genome databases
Not suitable for PMF
cRAP and Contaminants
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
20/50
20
DatabaseChoose the right database
In Mascot 2.3 and later, you can select multiple databases
You cannot mix AA and DNA databases.
Comprehensive database repositories, NCBI and EBI, to downloadnr, GenBank, Swiss-Prot, EMBL, Trembl, etc
Searching for a single organism, always include a databaseof common contaminants.
If interested in a bacterium/plant, try comprehensive proteindatabases e.g. NCBInr and UniRef100.
how
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
21/50
21
Nucleic Acid DatabasesMascot always performs a 6 frame translation
Translates entire sequence, don't look for start codon to begin
When a stop codon is encountere d, leave a gap
Uses the correct genetic code, as long as the taxonomy isknown.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
22/50
22
Taxonomy
Speeds upSimple report
Keep indexes up to date
Check the stats file for each database.
If the correct protein from the correct species is not inthe database , Dont specify a very narrow taxonomy.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
23/50
23
EnzymeFirst choice
Allowed missed cleavage sites to zero
Choose a setting of 1 or 2 when youre not sure aboutyour sample
Higher number, increases the number of calculatedpeptide masses.
No enzyme only in exceptional cases, never for PMF
The list is user configurable.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
24/50
24
ModificationsFixed modifications
Variable, post-translational modifications
Display all modifications
Keep less number of variable modifications
Some modifications are worse then othersMods that affect a terminus are less of a problem, e.g. Pyro-glu
Mods that apply to residues with a high fractional abundance and at anyposition are BIG prob, e.g. Phospho (ST)
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
25/50
25
ModificationsPost-translational
Phosphorylation, acetylation
Artifacts
Oxidation, acetylations
Derivatization
Alkylation of cysteine
Sequence varients
Errors, SNPs, other varients
Take complete list from unimodAnd if alkylation agent is iodoacetamide (carbamidomethyl),iodoacetic acid (carboxymethyl), and MMTS (methylthio).
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
26/50
26
PhosphorylationSite heterogeneity
Poor ionization efficiency
3 fragmentation channels
Intact fragments
Natural loss of HPO3 (80 Da)
Natural loss of H3PO4 (98 Da)
Can occur at STY -~16% of residues
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
27/50
27
Protein massMass of the intact protein in kDa.
If this field is left blank, there is no restriction on protein mass
Slow down the search a little.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
28/50
28
TolerancePeptide tolerance
MS/MS toleranceError window on experimental peptide mass values
Units: percentage, milli-mass units, parts per million, orDaltons.
Protein/peptide view includes a graph of the mass errors for
fragment ions.Specifying too tight peptide tolerance , common reason forfailing to get a match
A more appropriate tolerance should be +/- 0.3 in MS/MS
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
29/50
29
Mass typeAverage or monoisotopic.
Monoisotopic: most abundant natural isotopes
First peak of isotope distribution.Average mass is the chemical mass, centre of gravity of theisotope distribution.
Difference is approximately 0.06%.
If you get this setting wrong, the mass errors will be very large
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
30/50
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
31/50
31
Data (PMF)Mass
Query window are used when no data file.
The data format is auto detected.
List of mass values, one per line. If a second values is
present, it is assumed to be intensity. Any further values onthe same line are ignored
Mascot also supports other peak list formats
Applied biosystems data explorer (.pkm)
Bruker analysis autoxecute data report
Bruker XML
mzData (1.o5)
mzML
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
32/50
32
Data (MS/MS)The format cannot be auto-detected, and must be specified.
InstrumentType of instrument used to acquire the data.This setting determines which fragment ion series will be usedfor scoring
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
33/50
33
Report
AUTO to display only protein hits with significant scores.One additional after the cutoff at the significant score.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
34/50
34
Final tipBeware of
Narrowing the taxanomy
Reducing mass tolerances
Removing modifications
Selecting spectras or mass values
Set search parameters using standard samples
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
35/50
35
Types of Summary Report
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
36/50
36
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
37/50
37
Scoring and statistics
A list of proteins
Some matches not statistically significant.
The score threshold for this search is 76, and the top scoringmatch is 47.
Area shaded green to indicate random, meaningless matches.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
38/50
38
Probability based scoringScoring whether the match is random or not.
Probability: observed match, is a random event.
Real match, not random, has very low probability.
Reject anything with a probability greater than a chosen
thresholdThe mascot score is 10log10(p)
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
39/50
39
Significant thresholdsThe threshold is calculated from the number of trials
P=1/(20x500000)
Standard score
MudPIT score
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
40/50
40
Expectetion valueThe number of times you could expect to get this score or betterby chance
E=Pthreshold*(10**((Sthreshold-score)/10))
A completely random match has an expectation value of 1 ormore
The better the match, the smaller the expectation value.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
41/50
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
42/50
42
Error tolerant searchTake query 218. the observed mass difference couldcorrespond to either carbamidomethylation orcarboxymethylation at the N-terminus.
Since sample was alkylated with iodoacetamide.carbamidomethylation is also very believable, known artefactof over-alkylation.
Finds new matches by introducing mass shifts
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
43/50
43
Phosphorylation site
localizationFor confident site localization. Ascore, PTM score and MD-score
MD -score, the score difference between top two matches
Depends on fragmentation techniques
Ability increases with increasing distance
The MD score does not require complex computational
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
44/50
44
Validation (Decoy)False discovery rate.Most reliable is decoy databaseSeparate databases or concatenated to target entries
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
45/50
45
DecoySearch a decoy databaseVery simple
Repeat the search
Matches that are found in the decoy database are falsepositives.
It isnt useful when small number of spectra.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
46/50
46
DecoyA utility to create a decoy database
Reversed or randomised sequence of the same length isautomatically generated and tested.
The average amino acid composition of the random sequencesis the same
The matches and scores for the decoy sequences are recordedseparately in the result file.
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
47/50
47
Mascot DaemonAutomates the submission of data files
Batch mode
Real-time monitor modeFollow-up tasks
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
48/50
48
Mascot DistillerAccess all of the popular data formats
To produce high quality peak lists
Submit and review Mascot search results.
Perform de novo sequencing and interpret sequencetags for tag searches
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
49/50
49
References
http://www.matrixscience.com
Mikhail M. S., Simone L., Markus B., Manja L., Toby M., MarcusB., Bernard K., The American Society for Biochemistry andMolecular Biology. (2011)
Ville R. Koskinen, Patrick A. Emery, David M. Creasy, and JohnS. Cottrell, Molecular and Cellular Proteomics, (2011)
Elias, J. E. and Gygi, S. P., Natural Methods 4 207-214 (2007)
-
8/13/2019 Identification of Proteins Through Mass Spectrometry Databases
50/50
50