bioinf4120&& bioinformacs 2 -...
TRANSCRIPT
BIOINF 4120 Bioinforma2cs 2
-‐ Structures and Systems -‐
Oliver Kohlbacher Summer 2012
17. Protein Iden3fica3on
Overview
• Pep3de fragment spectra • Mass spectrometry • Fragmenta3on mechanisms • Comparison of spectra
• Pep3de ID by database search • Problem defini3on • X!Tandem
• Protein inference • Problem defini3on • Algorithms • ProteinProphet
2
Shotgun Proteomics
Key ideas • Separa3on of whole proteins possible but difficult, hence diges3on preferred
• Usually: trypsin – cuts aMer K and R and ensures pep3des suitable for MS (posi3ve charge at the end)
• Separate pep3des; this is easy • Iden3fy proteins through pep3des
M G
M K
N V
Q W E D S L G G L L V W G M
G E G A
I H
R V E
D V A G G Q E V
L F L K T P H E G
E L K
F D K F K
H L K E S D M E K K H A S E D L K
A T H N G V L T
L G G I L K
K F G E L G Q
P V I K
Q S A H G L H E A E L T P H A T K I
Q V L Q
S Y E
A E
L F
K
I I S
R F A L E L G D
F P G A
H
M Q S G
D A
A K
N M D A A K Y K
Peptid- digest
digestion
Proteins
Separation
M G L S D G E W Q L V L N V W G K
H P G D F G A D A Q G A M S K
Y L E F I S E A I I Q V L Q S K
G H H E A E L T P A Q S H A T K
V E A D V A G H G Q E V L R I
S E D E M K
A S E D L K
A L E L F R
E L G F Q G
N D M A A K
I P V K
H L K
F D K
L F K
F K
Y K
H K
K
G H P E T L K E
3
Tandem Mass Spectrometry • MS can be done in two stages: first stage separates ions by m/z • Selected ions are then selected, trapped, undergo CID and are then analyzed by
a second MS stage • These tandem mass spectra or MS/MS spectra allow the iden3fica3on of the
pep3des quan3fied in the first MS stage
http://www.nature.com/nrd/journal/v2/n2/full/nrd1011.html
4
Pep2de Fragmenta2on • Collision-‐induced dissocia2on (CID) allows the fragmenta3on of molecules
through collision with a neutral gas • The gas molecules transfer their kine3c energy to the analytes • Bond cleavages occur resul3ng in characteris3c fragment ions • Pep3des fragment preferen3ally around the pep3de backbone • This gives rise to several series of fragment ions, where b and y ions are the
most common 100
0 250 500 750 1000 m/z
% In
tens
ity
5
Pep2de Sequencing
• From the series of b/y ions (ladders) one can reconstruct the pep3de sequence
100
0 250 500 750 1000
y2 y3 y4
y5
y7
b3 b4 b5 b8 b9
[M+2H]2+
b6 b7 y9 y8
m/z
% In
tens
ity
100
0 250 500 750 1000 m/z
% In
tens
ity SGEFLEEDELK
6
Ladders
• The sequence of b and y ions gives rise to a series of ions, so called b-‐ and y-‐ladders
• The distance between adjacent b or y-‐ions corresponds to the mass of the amino acids
• Walking the peaks thus yields the mass corresponding to the amino acid and in turn the sequence – at least in theory!
http://www.nature.com/nmeth/journal/v1/n3/fig_tab/nmeth725_F2.html
IYEVEGMR
7
Amino Acid Masses
AA Chemical formula
Mono-isotopic
[Da]
Average [Da]
Ala C3H5ON 71.03711 71.0788
Arg C6H12ON4 156.10111 156.1875
Asn C4H6O2N2 114.04293 114.1038
Asp C4H5O3N 115.02694 115.0886
Cys C3H5ONS 103.00919 103.1388
Glu C5H7O3N 129.04259 129.1155
Gln C5H8O2N2 128.05858 128.1307
Gly C2H3ON 57.02146 57.0519
His C6H7ON3 137.05891 137.1411
Ile C6H11ON 113.08406 113.1594
AA Chemical formula
Mono-isotopic
[Da]
Average [Da]
Leu C6H11ON 113.08406 113.1594
Lys C6H12ON2 128.09496 128.1741
Met C5H9ONS 131.04049 131.1926
Phe C9H9ON 147.06841 147.1766
Pro C5H7ON 97.05276 97.1167
Ser C3H5O2N 87.03203 87.0782
Thr C4H7O2N 101.04768 101.1051
Trp C11H10ON2 186.07931 186.2132
Tyr C9H9O2N 163.06333 163.1760
Val C5H9ON 99.06841 99.1326
8
Amino Acid Masses • Leu and Ile (L/I) are
structural isomers • They thus have iden3cal
mass and cannot be dis3nguished!
• Fragments with same mass are called isobaric
• Gln and Lys (Q/K) have nearly iden3cal masses: 128.09496 Da and 128.05858 Da
• For low-‐resolu3on instruments they are indis3nguishable, too
AA Chemical formula
Mono-isotopic
[Da]
Average [Da]
Leu C6H11ON 113.08406 113.1594
Ile C6H11ON 113.08406 113.1594
Gln C5H8O2N2 128.05858 128.1307
Lys C6H12ON2 128.09496 128.1741
9
Pep2de Iden2fica2on
LC-MS/MS experiment Fragment m/z values
Sequence db Theoretical fragment m/z
values from suitable peptides
Compare
Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human) MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAARLAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERERLKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAEAERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQTGGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEVGRAAQLLDVSLFELSELREGLARLAEAAP
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 574.83 580.70 580.92 579.99 603.92 611.14 616.74
570.84 571.72 580.40 591.18 579.35 607.25 611.42 614.45
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14
1 QRESTATDILQK 18.77
2 EIEEDSLEGLKK 14.78
3 GIEDDLMDLIKK 12.63
Score hits
Theoretical spectra
m/z
[%]
m/z
[%]
m/z
[%]
m/z
[%]
Experimental spectra
m/z
RT
10
X!Tandem
• Many different search engines have been proposed that implement this basic database search strategie
• They differ in their speed, availability, and quality • Internally, the differences mainly concern the scoring,
preprocessing, and search data structures • Here we will discuss the X!Tandem algorithm
• Propose by Craig and Beavis in 2003 • We can only discuss the very core of the algorithm, some of the addi3onal
tricks and tweaks are beyond the scope of this lecture
• There is an addi3onal lecture (BIOINF 4399B “Computa3onal Proteomics and Metabolomics”) discussing many of these issues in more detail
• hkp://www.thegpm.org/tandem/instruc3ons.html
Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.
11
2. Compare theoretical spectra for all to the experimental spectrum S
m/z
Inte
nsit
y
m/z
m/z m/z
…
Scoring Spectra
12
Find overlapping masses
100 %
Inte
nsit
y
1
0
Experimental spectrum S
Exemplified theoretical spectrum
To find overlapping masses, a maximal fragment mass tolerance window needs to be set (for ion traps this is usually 0.5 Da)
13
X!Tandem’s Dot Product
• Reduce the experimental spectrum to only those peaks that match peaks in the theoretical spectrum
• Calculate dot product (dp) (using ion intensities and the number of matching ions)
�
Intensities from experimental spectrum … fragment ion intensities
Predicted or not in theoretical spectrum
100 %
Inte
nsit
y
14
Survival Func2on and e-‐value
Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774
• Let x represent the dot product score for the experimental spectrum S and the theoretical spectrum
• p(x) is calculated from the frequency histogram (counts of PSMs per score bin) • With f(x), the number of PSMs that are given the score x, p(x) is calculated with
, with N … total number of PSMs
Histogram of b$RT/60
b$RT/60
Frequency
15 20 25 30 35 40 45
020
4060
80100
120
Example of a frequency histogram
Random variable
Freq
uenc
y
15
Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774
ln(x)
p(x)
valid PSM
• The survival function, s(x), for a discrete stochastic score probability distribution, p(x) is defined as where P(X > x) is the probability to have a greater value than x by random matches in a database.
Survival func2on and e-‐value
16
Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774
ln(x)
p(x)
valid PSM
• With the survival function s(x), we can calculate the E-value e(x), indicating the number of PSMs that are expected to have scores of x or better where n is the number of sequences in
• Now, each PSM can be ranked accoring to e(x)
Survival func2on and e-‐value
17
X!Tandem Hyperscore
100 %
Inte
nsit
y
• The hyperscore (HS) is calculated by multiplying with factorials of the number of assigned b and y ions.
• The use of the factorials is based on the hypergeometric distribution that is assumed for matches of product ions �
Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774 18
h\p://www.proteomeso^ware.com/pdf_files/XTandem_edited.pdf
ln(x)
p(x)
valid PSM
• If p(x) is now plotted as a function of their log(hyperscores), the valid PSM is much better separated from the bulk of incorrect assignments
19
One Hit Wonders • In many cases, proteins are iden3fied through a single pep3de-‐
spectrum match (PSM) only • These ‘single hit wonders’ have long been considered problema3c:
a single false PSMcan lead to a wrongly iden3fied protein • In fact, the so-‐called ‘Paris guidelines’ for data deposi3on in
proteomics recommend only repor3ng iden3fica3ons for which at least two pep3des have been iden3fied
• This also became known as the ‘two pep3de rule’ • Obviously, just dropping the majority of PSMs is inadequate to
address this problem • Ques3on:
• How large is the error rate in the iden3fica3ons? • Which iden3fica3ons can be trusted? Bradshaw RA, Burlingame AL, Carr S, Aebersold R. Mol Cell Prot 2006, 5:787-‐8
hkp://www.mcponline.org/site/misc/ParisReport_Final.xhtml
20
Target-‐decoy databases
Elias and Gygi, Nature Methods. Vol. 4, No. 3, March 2007
Separation of target and decoy results Design decoy sequences
Although different decoy database designs produce very similar results, the most frequently used approaches are the reversed and pseudo-reversed decoy databases
21
FDR Calcula2on
• General equa3on for FDR calcula3on
There are two ways to calculate FDRs based on target-‐decoy search results: • Käll et al. suggest
• Zhang et al. suggest
(Käll et al., Proteome Res. 2008, 7, 29– 34)
(Zhang et al., J Proteome Res 2007;6(9):3549–3557)
22
Other Search Engines • OMSSA
• Open-‐source package • Fast
• SEQUEST • One of the commercial standard packages • Commercial soMware (Thermo Fisher Scien3fic, hkp://www.thermofisher.com/)
• Mascot from Matrix Science • Mascot is one of the most popular search engines • Commercial soMware (hkp://www.matrixscience.com/)
• Phenyx • Commercial soMware • Colinge et al., Proteomics (2003), 3(8):1454-‐1463.
• InsPecT • Very fast open-‐source search engine designed for the iden3fica3on of poskransla3onal modifica3on • Tanner et al., J Proteome Res. (2005), 4(4):1287-‐95.
• Myrimatch • Open source • Tabb et al., J Proteome Res. (2007), 6(2) 654-‐61.
• …
23
Iden2fying Proteins • Iden3fica3on methods so far only iden3fy pep3de-‐spectrum matches (PSMs) • Search a database • Return a ranked list of PSMs with associates scores
• PSM false discovery rates (FDRs) can be computed through a target-‐decoy approach
• An FDR of 1% would mean that 1% of the PSMs with a score above the threshold are expected to be incorrect
• Note that this is per se a statement on the individual PSM, not per pep3de or protein!
24
Iden2fying Proteins
• Each PSM above the threshold contributes • a match of a spectrum to a pep3de • a match of a pep3de to a protein
• Pep3des are not necessarily unique! • Length distribu3on of observed pep3des deviates from theore3cal
distribu3on: short pep3des (length 6 and shorter) are usually not observed
Danielle L. Swaney; Craig D. Wenger; Joshua J. Coon; J. Proteome Res. 2010, 9, 1323-1329.
25
Uniqueness
• If we are interested in proteomics (in contrast to pep3de iden3fica3on in metabolomics, MHC ligandomics etc.), we want to quan3fy proteins
• Non-‐unique pep3de sequences can stem from different proteins
• Obviously, uniqueness depends on the chosen database • Uniqueness becomes more likely for longer pep3de sequences
• Reasons for non-‐uniqueness • Chance hits • Different isoforms • Conserved regions shared within a protein family
26
Uniqueness
• Uniqueness depends on the size of the database • Searching an appropriate (non-‐redundant) database is thus preferable • Reference databases (SwissProt) usually contain few degenerate (non-‐unique)
tryp3c pep3des above a mass of 750 Da
• Problem: isoforms of proteins/splice variants! Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.
27
Protein Isoforms
• NextProt Release 3.0 • 20,110 human proteins • 35,978 sequences resul3ng from alterna3ve isoforms
• On average 2.75 different splice variants for each protein sequence • Some proteins have a much larger number of variants • Resolving the different isoforms is only possible, if pep3des
crossing the right exon boundaries are observed NextProt Release 3.0, 2011-12-09, http://www.nextprot.org/db/statistics/release?viewas=numbers
29
Protein Isoforms
• phosphodiesterase 9A has 16 documented isoforms • Pep3des stemming from the second half of the sequence are en3rely indis3nguishable
between isoforms http://www.nextprot.org/db/entry/NX_O76083/structures
30
Protein Families
• Sequence coverage is oMen poor in large scale studies: many proteins are iden3fied through very few pep3des only
• In prokaryotes, typically over 90% of the iden3fied pep3des are unique in the whole proteome
• In par3cular in eukaryotes the large number of orthologs leads to significant sequence iden3ty between different proteins that are not isoforms
• In eukaryotes, the number of unique iden3fied pep3des can thus easily drop below 50% (Gupta & Pevzner, 2009)
33
Parsimony-‐Based Inference
• Idea Find the smallest set of proteins explaining all observed pep7des
• If all pep3des mapping to one protein family can be explained by a single protein, then it is quite likely, that only this protein is present (but this must not necessarily be the case)
• Basically: applying Occam’s razor to the dataset – find the simplest explana3on possible (maximum parsimony)
35
Parsimony-‐Based Inference • Scenarios for different proteins given
a set of observed pep3des • Dis2nct proteins do not share
pep3des • Differen2able proteins can be
dis3nguished by at least one dis3nct pep3de
• Indis2nguishable proteins share all pep3des
• Subset proteins contain only pep3des also contained in another protein
• Subsumable proteins contain only pep3des that are also contained in other proteins
Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.
36
Protein Ambiguity Groups
Example:
• Note that even though the presence of A is sufficient to explain all
observed pep3des, this does not automa3cally imply the absence of B and C
• The data is explained equally well by the presence of A, the presence of A + B, A + C, B + C, or A + B + C
• The set of proteins sharing one or mul3ple pep3des is oMen referred to as a protein ambiguity group
A
B
C
37
Parsimony-‐Based Inference
• Maximum parsimony inference results in a minimal list of proteins • It thus removes all dis3nct and differen3able proteins of a protein
ambiguity group • It does not contain any subsumable or subset proteins • In the previous example, A would be sufficient to explain the
observed pep3des, B and C would not be reported
A
B
C
38
Significance of Inferred Hits • What is the meaning of a PSM for a protein iden3fica3on?
• FDR is calculated on the PSM level • 1% FDR means that one in 100 iden3fica3ons yields a an incorrect protein
iden3fica3on
• This does not mean that there is also an FDR rate of 1% on the protein level!
• In par3cular in large-‐scale studies (tens of thousands of spectra), protein FDRs are much higher than pep3de FDRs
• PSMs for a large number of (mostly) iden3cal samples • Number of correctly iden3fied proteins does not increase significantly with
the number of spectra (it is always the same proteins being iden3fied, addi3onal (correct) PSMs do not increase the number of proteins)
• Number of false posi3ves increases with the number of PSMs (yields hits to random proteins, so ini3ally mostly novel false posi3ves!)
40
Protein FDRs
• Error rates increase when going from pep3des to proteins • Correct pep3de IDs tend to group into a small set of correct proteins • Incorrect IDs are semi-‐random and scaker over the whole protein database
A. Nesvizhskii, J. Proteomics (2010), 73:2092-2123
41
ProteinProphet
• ProteinProphet is an open-‐source soMware tool for protein inference and currently one of the standard tools in the area
• Key ideas • Maximum parsimony approaches to compile protein lists
• Repor3ng of protein ambiguity groups • Protein probability es2ma2on: es3mate the probability that a given protein is correctly iden3fied given all evidence for it
Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
42
Pep2deProphet • Pep3de Probability Es3mates (PPE)
• Computed by Pep2deProphet • Converts search engine scores into a probability (1 -‐ posterior error probability)
• Similar ideas have been discussed in the context of consensus iden3fica3on
• Pep3deProphet uses expecta2on maximiza2on to compute a mixture model of the score distribu3ons of correct and incorrect PSMs
• Given a PSM and a search engine score, we can thus compute a p-‐value (probability that the PSM is correct)
• In contrast to a (raw) score, PPEs are a simple way to determine the trust in each individual PSM
Nesvizhskii, et al., Anal. Chem. (2002), 74, 5383-5392
44
Protein Probability Es2mates
• Given the PPEs, we can easily compute the probability for each of the induced protein IDs
• Assuming all pep3des are unique, we can compute the probability P for an protein iden3fica3on as 1 minus the probability of all pep3de iden3fica3ons inducing this pep3de being wrong
• We could do this on the pep3de level quite simply as follows:
with probabili3es pi for the pep3de iden3fica3on of pep3de I being correct
• However, we also need to consider mul3ple evidence for different spectra giving evidence for the same pep3de
45
Protein Probability Es2mates • We thus need to consider probabili3es
for each PSM independently • Each PSM is assigned a PPE by
Pep3deProphet • Probability that a protein is not
present in a sample despite its PSMs depends on the probabili3es p(+|Di
j) for the pep3de ID of pep3de i based on the observed data (spectrum) j being correct
• We can thus compute P based on PPEs of all PSMs:
Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
46
Protein Probability Es2mates
• There are a few problems with this: • PSMs are not independent
There is a high probability for mul3ple spectra of the same pep3de to hit the same incorrect ID if the spectra are of high quality, but do not match the database (e.g., due to post-‐transla3onal modifica3on)
• Ambiguous pep2de-‐protein matches If a pep3de matches mul3ple proteins, its evidence cannot simply be shared across these proteins
47
Protein Probability Es2mates
• A simple way to deal with mul3ple PSMs is to • Include each pep3de just once • Consider only the PSM with the best PPE of all PSMs to the same pep3de: pi = maxj p(+|Di
j) • P would then be computed as follows:
• This procedure yields a more conserva3ve es3mate of protein probabili3es
48
ProteinProphet
After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
Example:
>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!
After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
VYVEELKPTPEGDLEILLQK : p = 0.81
LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65
TPEVDDEALEK : p = 0.91
max = 0.65
P(LACB_BOVIN) = 1 – (1 – 0.81) (1 – 0.91) (1 - 0.65) = 0.99
49
Sibling Pep2des • Correct assignments tend to cluster to the same proteins • Incorrect assignments tend to be hits to proteins with no other assigned
pep3des
• As a result, the computed PPEs, while correct in the context of the whole dataset, need to be corrected for an accurate es3mate in the context of their source protein
• ProteinProphet introduces the no3on of sibling pep2des • Sibling pep3des are pep3des hizng the same protein • Rather than coun3ng them, ProteinProphet defines the number of sibling
pep3des NSPi for a pep3de i as the sum of the PPEs:
where the sum runs over all pep3des m hizng the same protein as i and PPEs pi are the maximum values for a given pep3de reached in the dataset
50
Sibling Pep2des
Example:
>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!
After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
VYVEELKPTPEGDLEILLQK : p = 0.81
LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65
TPEVDDEALEK : p = 0.91
max = 0.65
NSP(VYV…) = 0.91 + 0.65 = 1.56 NSP(TPE…) = 0.65 + 0.81 = 1.46 NSP(LSF…) = 0.91 + 0.81 = 1.72
51
Sibling Pep2des
• Intui3vely, one would trust iden3fica3ons with a high NSP more than those with a low NSP (more evidence per protein)
• We can thus refine PPEs in the context of the source protein as follows:
with • p(NSP|+) and p(NSP|-‐) being the probabili3es of having a
par3cular NSP value for correct/incorrect assignments • p(+|D) and p(-‐|D) are the uncorrected probabili3es for the
pep3de assignment being correct/incorrect
52
Sibling Pep2des
• Values for p(NSP|+) and p(NSP|-‐) can be computed for the whole dataset
• NSP values are binned and counted for correct and incorrect assignments
where N is the total number of pep3des assignments and p(+) is the prior probability of a pep3de iden3fica3on being correct
• p(+) can be computed by summa3on over all pep3de iden3fica3ons of the dataset:
53
NSP Distribu2ons
• NSP distribu3ons can be determined using expecta3on maximiza3on
• As a first guess, unadjusted p(+|D) values are used to compute an es3mated NSP value for each assignment
• Applying EM then yields adjusted probabili3es, this is repeated un3l convergence has been reached
• NSP distribu3ons depend on the dataset and the dataset size
Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
NSP distribution for datasets of varying size: • squares: single run of a low-
complexity sample • circles: four runs of the same sample • triangles: 22 runs
54
Influence of NSP Correc2on
• NSP correc3on yields beker predic3ons of protein probabili3es
• Figure on the right shows the predicted vs. true protein probabili3es with and without NSP
• Different lines correspond to different datasets
• Doked line: perfect predic3on
Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
55
Protein Ambiguity
• Shared pep3des within a PAG cause issues as well • Their probabili3es can be distributed over their poten3al source
proteins through a weigh3ng scheme based on the protein probabili3es:
• Weights wi
n are again es3mated itera3vely using an EM-‐like algorithm
peptide 1
peptide 2
protA
protB
Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658
p1
p2
PA
PB
w1A
w1B
w2B
56
References Papers: • Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316. • Colinge J, Bennek KL. Introduc3on to Computa3onal Proteomics. PLoS Comput Biol 3:e114.
(hkp://dx.doi.org/10.1371/journal.pcbi.0030114) • Nesvizhskii A I , Aebersold R, Interpreta3on of Shotgun Proteomics Data, Mol Cell
Proteomics 2005;4:1419-‐1440 • Nesvizhskii, Keller, Kolker, Aebersold, A Sta3s3cal Model for Iden3fying Protein by Tandem
Mass Spectrometry, Anal. Chem. 2003, 75, 4646-‐4658. • Keller, Nesvizhskii, Kolker, Aebersold, Empirical Sta3s3cal Model to Es3mate the Accuracy of
Pep3de Iden3fica3ons Made by MS/MS and Database Search, Anal. Chem. 2002, 74, 5383-‐5392
Links: • ProteinProphet: hkp://proteinprophet.sourceforge.net • OMSSA online server:
hkp://pubchem.ncbi.nlm.nih.gov/omssa/ • MASCOT online server
hkp://www.matrixscience.com/search_form_select.html • Pep2de Atlas – a database of pep3de spectra and iden3fica3ons
hkp://www.pep3deatlas.org/ 58