bioinf4120&& bioinformacs 2 -...

BIOINF 4120 Bioinforma2cs 2

-‐ Structures and Systems -‐

Oliver Kohlbacher Summer 2012

17. Protein Iden3fica3on

Overview

•  Pep3de fragment spectra •  Mass spectrometry •  Fragmenta3on mechanisms •  Comparison of spectra

•  Pep3de ID by database search •  Problem defini3on •  X!Tandem

•  Protein inference •  Problem defini3on •  Algorithms •  ProteinProphet

2

Shotgun Proteomics

Key ideas •  Separa3on of whole proteins possible but difficult, hence diges3on preferred

•  Usually: trypsin – cuts aMer K and R and ensures pep3des suitable for MS (posi3ve charge at the end)

•  Separate pep3des; this is easy •  Iden3fy proteins through pep3des

M G

M K

N V

Q W E D S L G G L L V W G M

G E G A

I H

R V E

D V A G G Q E V

L F L K T P H E G

E L K

F D K F K

H L K E S D M E K K H A S E D L K

A T H N G V L T

L G G I L K

K F G E L G Q

P V I K

Q S A H G L H E A E L T P H A T K I

Q V L Q

S Y E

A E

L F

K

I I S

R F A L E L G D

F P G A

H

M Q S G

D A

A K

N M D A A K Y K

Peptid- digest

digestion

Proteins

Separation

M G L S D G E W Q L V L N V W G K

H P G D F G A D A Q G A M S K

Y L E F I S E A I I Q V L Q S K

G H H E A E L T P A Q S H A T K

V E A D V A G H G Q E V L R I

S E D E M K

A S E D L K

A L E L F R

E L G F Q G

N D M A A K

I P V K

H L K

F D K

L F K

F K

Y K

H K

K

G H P E T L K E

3

Tandem Mass Spectrometry •  MS can be done in two stages: first stage separates ions by m/z •  Selected ions are then selected, trapped, undergo CID and are then analyzed by

a second MS stage •  These tandem mass spectra or MS/MS spectra allow the iden3fica3on of the

pep3des quan3fied in the first MS stage

http://www.nature.com/nrd/journal/v2/n2/full/nrd1011.html

4

Pep2de Fragmenta2on •  Collision-‐induced dissocia2on (CID) allows the fragmenta3on of molecules

through collision with a neutral gas •  The gas molecules transfer their kine3c energy to the analytes •  Bond cleavages occur resul3ng in characteris3c fragment ions •  Pep3des fragment preferen3ally around the pep3de backbone •  This gives rise to several series of fragment ions, where b and y ions are the

most common 100

0 250 500 750 1000 m/z

% In

tens

ity

5

Pep2de Sequencing

•  From the series of b/y ions (ladders) one can reconstruct the pep3de sequence

100

0 250 500 750 1000

y2 y3 y4

y5

y7

b3 b4 b5 b8 b9

[M+2H]2+

b6 b7 y9 y8

m/z

% In

tens

ity

100

0 250 500 750 1000 m/z

% In

tens

ity SGEFLEEDELK

6

Ladders

•  The sequence of b and y ions gives rise to a series of ions, so called b-‐ and y-‐ladders

•  The distance between adjacent b or y-‐ions corresponds to the mass of the amino acids

•  Walking the peaks thus yields the mass corresponding to the amino acid and in turn the sequence – at least in theory!

http://www.nature.com/nmeth/journal/v1/n3/fig_tab/nmeth725_F2.html

IYEVEGMR

7

Amino Acid Masses

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Ala C3H5ON 71.03711 71.0788

Arg C6H12ON4 156.10111 156.1875

Asn C4H6O2N2 114.04293 114.1038

Asp C4H5O3N 115.02694 115.0886

Cys C3H5ONS 103.00919 103.1388

Glu C5H7O3N 129.04259 129.1155

Gln C5H8O2N2 128.05858 128.1307

Gly C2H3ON 57.02146 57.0519

His C6H7ON3 137.05891 137.1411

Ile C6H11ON 113.08406 113.1594

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Leu C6H11ON 113.08406 113.1594

Lys C6H12ON2 128.09496 128.1741

Met C5H9ONS 131.04049 131.1926

Phe C9H9ON 147.06841 147.1766

Pro C5H7ON 97.05276 97.1167

Ser C3H5O2N 87.03203 87.0782

Thr C4H7O2N 101.04768 101.1051

Trp C11H10ON2 186.07931 186.2132

Tyr C9H9O2N 163.06333 163.1760

Val C5H9ON 99.06841 99.1326

8

Amino Acid Masses •  Leu and Ile (L/I) are

structural isomers •  They thus have iden3cal

mass and cannot be dis3nguished!

•  Fragments with same mass are called isobaric

•  Gln and Lys (Q/K) have nearly iden3cal masses: 128.09496 Da and 128.05858 Da

•  For low-‐resolu3on instruments they are indis3nguishable, too

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Leu C6H11ON 113.08406 113.1594

Ile C6H11ON 113.08406 113.1594

Gln C5H8O2N2 128.05858 128.1307

Lys C6H12ON2 128.09496 128.1741

9

Pep2de Iden2fica2on

LC-MS/MS experiment Fragment m/z values

Sequence db Theoretical fragment m/z

values from suitable peptides

Compare

Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human) MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAARLAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERERLKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAEAERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQTGGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEVGRAAQLLDVSLFELSELREGLARLAEAAP

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 574.83 580.70 580.92 579.99 603.92 611.14 616.74

570.84 571.72 580.40 591.18 579.35 607.25 611.42 614.45

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

1 QRESTATDILQK 18.77

2 EIEEDSLEGLKK 14.78

3 GIEDDLMDLIKK 12.63

Score hits

Theoretical spectra

m/z

[%]

m/z

[%]

m/z

[%]

m/z

[%]

Experimental spectra

m/z

RT

10

X!Tandem

•  Many different search engines have been proposed that implement this basic database search strategie

•  They differ in their speed, availability, and quality •  Internally, the differences mainly concern the scoring,

preprocessing, and search data structures •  Here we will discuss the X!Tandem algorithm

•  Propose by Craig and Beavis in 2003 •  We can only discuss the very core of the algorithm, some of the addi3onal

tricks and tweaks are beyond the scope of this lecture

•  There is an addi3onal lecture (BIOINF 4399B “Computa3onal Proteomics and Metabolomics”) discussing many of these issues in more detail

•  hkp://www.thegpm.org/tandem/instruc3ons.html

Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.

11

2.  Compare theoretical spectra for all to the experimental spectrum S

m/z

Inte

nsit

y

m/z

m/z m/z

…

Scoring Spectra

12

Find overlapping masses

100 %

Inte

nsit

y

1

0

Experimental spectrum S

Exemplified theoretical spectrum

To find overlapping masses, a maximal fragment mass tolerance window needs to be set (for ion traps this is usually 0.5 Da)

13

X!Tandem’s Dot Product

•  Reduce the experimental spectrum to only those peaks that match peaks in the theoretical spectrum

•  Calculate dot product (dp) (using ion intensities and the number of matching ions)

�

Intensities from experimental spectrum … fragment ion intensities

Predicted or not in theoretical spectrum

100 %

Inte

nsit

y

14

Survival Func2on and e-‐value

Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774

•  Let x represent the dot product score for the experimental spectrum S and the theoretical spectrum

•  p(x) is calculated from the frequency histogram (counts of PSMs per score bin) •  With f(x), the number of PSMs that are given the score x, p(x) is calculated with

, with N … total number of PSMs

Histogram of b$RT/60

b$RT/60

Frequency

15 20 25 30 35 40 45

020

4060

80100

120

Example of a frequency histogram

Random variable

Freq

uenc

y

15


ln(x)

p(x)

valid PSM

•  The survival function, s(x), for a discrete stochastic score probability distribution, p(x) is defined as where P(X > x) is the probability to have a greater value than x by random matches in a database.

Survival func2on and e-‐value

16


ln(x)

p(x)

valid PSM

•  With the survival function s(x), we can calculate the E-value e(x), indicating the number of PSMs that are expected to have scores of x or better where n is the number of sequences in

•  Now, each PSM can be ranked accoring to e(x)

Survival func2on and e-‐value

17

X!Tandem Hyperscore

100 %

Inte

nsit

y

•  The hyperscore (HS) is calculated by multiplying with factorials of the number of assigned b and y ions.

•  The use of the factorials is based on the hypergeometric distribution that is assumed for matches of product ions �

Fenyö and Beavis, Anal. Chem.2003, 75, 768-‐774 18

h\p://www.proteomeso^ware.com/pdf_files/XTandem_edited.pdf

ln(x)

p(x)

valid PSM

•  If p(x) is now plotted as a function of their log(hyperscores), the valid PSM is much better separated from the bulk of incorrect assignments

19

One Hit Wonders •  In many cases, proteins are iden3fied through a single pep3de-‐

spectrum match (PSM) only •  These ‘single hit wonders’ have long been considered problema3c:

a single false PSMcan lead to a wrongly iden3fied protein •  In fact, the so-‐called ‘Paris guidelines’ for data deposi3on in

proteomics recommend only repor3ng iden3fica3ons for which at least two pep3des have been iden3fied

•  This also became known as the ‘two pep3de rule’ •  Obviously, just dropping the majority of PSMs is inadequate to

address this problem •  Ques3on:

•  How large is the error rate in the iden3fica3ons? •  Which iden3fica3ons can be trusted? Bradshaw RA, Burlingame AL, Carr S, Aebersold R. Mol Cell Prot 2006, 5:787-‐8

hkp://www.mcponline.org/site/misc/ParisReport_Final.xhtml

20

Target-‐decoy databases

Elias and Gygi, Nature Methods. Vol. 4, No. 3, March 2007

Separation of target and decoy results Design decoy sequences

Although different decoy database designs produce very similar results, the most frequently used approaches are the reversed and pseudo-reversed decoy databases

21

FDR Calcula2on

•  General equa3on for FDR calcula3on

There are two ways to calculate FDRs based on target-‐decoy search results: •  Käll et al. suggest

•  Zhang et al. suggest

(Käll et al., Proteome Res. 2008, 7, 29– 34)

(Zhang et al., J Proteome Res 2007;6(9):3549–3557)

22

Other Search Engines •  OMSSA

•  Open-‐source package •  Fast

•  SEQUEST •  One of the commercial standard packages •  Commercial soMware (Thermo Fisher Scien3fic, hkp://www.thermofisher.com/)

•  Mascot from Matrix Science •  Mascot is one of the most popular search engines •  Commercial soMware (hkp://www.matrixscience.com/)

•  Phenyx •  Commercial soMware •  Colinge et al., Proteomics (2003), 3(8):1454-‐1463.

•  InsPecT •  Very fast open-‐source search engine designed for the iden3fica3on of poskransla3onal modifica3on •  Tanner et al., J Proteome Res. (2005), 4(4):1287-‐95.

•  Myrimatch •  Open source •  Tabb et al., J Proteome Res. (2007), 6(2) 654-‐61.

•  …

23

Iden2fying Proteins •  Iden3fica3on methods so far only iden3fy pep3de-‐spectrum matches (PSMs) •  Search a database •  Return a ranked list of PSMs with associates scores

•  PSM false discovery rates (FDRs) can be computed through a target-‐decoy approach

•  An FDR of 1% would mean that 1% of the PSMs with a score above the threshold are expected to be incorrect

•  Note that this is per se a statement on the individual PSM, not per pep3de or protein!

24

Iden2fying Proteins

•  Each PSM above the threshold contributes •  a match of a spectrum to a pep3de •  a match of a pep3de to a protein

•  Pep3des are not necessarily unique! •  Length distribu3on of observed pep3des deviates from theore3cal

distribu3on: short pep3des (length 6 and shorter) are usually not observed

Danielle L. Swaney; Craig D. Wenger; Joshua J. Coon; J. Proteome Res. 2010, 9, 1323-1329.

25

Uniqueness

•  If we are interested in proteomics (in contrast to pep3de iden3fica3on in metabolomics, MHC ligandomics etc.), we want to quan3fy proteins

•  Non-‐unique pep3de sequences can stem from different proteins

•  Obviously, uniqueness depends on the chosen database •  Uniqueness becomes more likely for longer pep3de sequences

•  Reasons for non-‐uniqueness •  Chance hits •  Different isoforms •  Conserved regions shared within a protein family

26

Uniqueness

•  Uniqueness depends on the size of the database •  Searching an appropriate (non-‐redundant) database is thus preferable •  Reference databases (SwissProt) usually contain few degenerate (non-‐unique)

tryp3c pep3des above a mass of 750 Da

•  Problem: isoforms of proteins/splice variants! Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

27

Uniqueness

Qeli & Ahrens, Nature Biotechnology 28, 647–650 (2010)

28

Protein Isoforms

•  NextProt Release 3.0 •  20,110 human proteins •  35,978 sequences resul3ng from alterna3ve isoforms

•  On average 2.75 different splice variants for each protein sequence •  Some proteins have a much larger number of variants •  Resolving the different isoforms is only possible, if pep3des

crossing the right exon boundaries are observed NextProt Release 3.0, 2011-12-09, http://www.nextprot.org/db/statistics/release?viewas=numbers

29

Protein Isoforms

•  phosphodiesterase 9A has 16 documented isoforms •  Pep3des stemming from the second half of the sequence are en3rely indis3nguishable

between isoforms http://www.nextprot.org/db/entry/NX_O76083/structures

30

Protein Isoforms

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

31

Protein Isoforms


32

Protein Families

•  Sequence coverage is oMen poor in large scale studies: many proteins are iden3fied through very few pep3des only

•  In prokaryotes, typically over 90% of the iden3fied pep3des are unique in the whole proteome

•  In par3cular in eukaryotes the large number of orthologs leads to significant sequence iden3ty between different proteins that are not isoforms

•  In eukaryotes, the number of unique iden3fied pep3des can thus easily drop below 50% (Gupta & Pevzner, 2009)

33

Protein Families


34

Parsimony-‐Based Inference

•  Idea Find the smallest set of proteins explaining all observed pep7des

•  If all pep3des mapping to one protein family can be explained by a single protein, then it is quite likely, that only this protein is present (but this must not necessarily be the case)

•  Basically: applying Occam’s razor to the dataset – find the simplest explana3on possible (maximum parsimony)

35

Parsimony-‐Based Inference •  Scenarios for different proteins given

a set of observed pep3des •  Dis2nct proteins do not share

pep3des •  Differen2able proteins can be

dis3nguished by at least one dis3nct pep3de

•  Indis2nguishable proteins share all pep3des

•  Subset proteins contain only pep3des also contained in another protein

•  Subsumable proteins contain only pep3des that are also contained in other proteins


36

Protein Ambiguity Groups

Example:

•  Note that even though the presence of A is sufficient to explain all

observed pep3des, this does not automa3cally imply the absence of B and C

•  The data is explained equally well by the presence of A, the presence of A + B, A + C, B + C, or A + B + C

•  The set of proteins sharing one or mul3ple pep3des is oMen referred to as a protein ambiguity group

A

B

C

37

Parsimony-‐Based Inference

•  Maximum parsimony inference results in a minimal list of proteins •  It thus removes all dis3nct and differen3able proteins of a protein

ambiguity group •  It does not contain any subsumable or subset proteins •  In the previous example, A would be sufficient to explain the

observed pep3des, B and C would not be reported

A

B

C

38

Repor2ng of PAGs


39

Significance of Inferred Hits •  What is the meaning of a PSM for a protein iden3fica3on?

•  FDR is calculated on the PSM level •  1% FDR means that one in 100 iden3fica3ons yields a an incorrect protein

iden3fica3on

•  This does not mean that there is also an FDR rate of 1% on the protein level!

•  In par3cular in large-‐scale studies (tens of thousands of spectra), protein FDRs are much higher than pep3de FDRs

•  PSMs for a large number of (mostly) iden3cal samples •  Number of correctly iden3fied proteins does not increase significantly with

the number of spectra (it is always the same proteins being iden3fied, addi3onal (correct) PSMs do not increase the number of proteins)

•  Number of false posi3ves increases with the number of PSMs (yields hits to random proteins, so ini3ally mostly novel false posi3ves!)

40

Protein FDRs

•  Error rates increase when going from pep3des to proteins •  Correct pep3de IDs tend to group into a small set of correct proteins •  Incorrect IDs are semi-‐random and scaker over the whole protein database

A. Nesvizhskii, J. Proteomics (2010), 73:2092-2123

41

ProteinProphet

•  ProteinProphet is an open-‐source soMware tool for protein inference and currently one of the standard tools in the area

•  Key ideas •  Maximum parsimony approaches to compile protein lists

•  Repor3ng of protein ambiguity groups •  Protein probability es2ma2on: es3mate the probability that a given protein is correctly iden3fied given all evidence for it

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

42

ProteinProphet -‐ Overview


43

Pep2deProphet •  Pep3de Probability Es3mates (PPE)

•  Computed by Pep2deProphet •  Converts search engine scores into a probability (1 -‐ posterior error probability)

•  Similar ideas have been discussed in the context of consensus iden3fica3on

•  Pep3deProphet uses expecta2on maximiza2on to compute a mixture model of the score distribu3ons of correct and incorrect PSMs

•  Given a PSM and a search engine score, we can thus compute a p-‐value (probability that the PSM is correct)

•  In contrast to a (raw) score, PPEs are a simple way to determine the trust in each individual PSM


44

Protein Probability Es2mates

•  Given the PPEs, we can easily compute the probability for each of the induced protein IDs

•  Assuming all pep3des are unique, we can compute the probability P for an protein iden3fica3on as 1 minus the probability of all pep3de iden3fica3ons inducing this pep3de being wrong

•  We could do this on the pep3de level quite simply as follows:

with probabili3es pi for the pep3de iden3fica3on of pep3de I being correct

•  However, we also need to consider mul3ple evidence for different spectra giving evidence for the same pep3de

45

Protein Probability Es2mates •  We thus need to consider probabili3es

for each PSM independently •  Each PSM is assigned a PPE by

Pep3deProphet •  Probability that a protein is not

present in a sample despite its PSMs depends on the probabili3es p(+|Di

j) for the pep3de ID of pep3de i based on the observed data (spectrum) j being correct

•  We can thus compute P based on PPEs of all PSMs:


46


•  There are a few problems with this: •  PSMs are not independent

There is a high probability for mul3ple spectra of the same pep3de to hit the same incorrect ID if the spectra are of high quality, but do not match the database (e.g., due to post-‐transla3onal modifica3on)

•  Ambiguous pep2de-‐protein matches If a pep3de matches mul3ple proteins, its evidence cannot simply be shared across these proteins

47


•  A simple way to deal with mul3ple PSMs is to •  Include each pep3de just once •  Consider only the PSM with the best PPE of all PSMs to the same pep3de: pi = maxj p(+|Di

j) •  P would then be computed as follows:

•  This procedure yields a more conserva3ve es3mate of protein probabili3es

48

ProteinProphet

After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

Example:

>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!


VYVEELKPTPEGDLEILLQK : p = 0.81

LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65

TPEVDDEALEK : p = 0.91

max = 0.65

P(LACB_BOVIN) = 1 – (1 – 0.81) (1 – 0.91) (1 - 0.65) = 0.99

49

Sibling Pep2des •  Correct assignments tend to cluster to the same proteins •  Incorrect assignments tend to be hits to proteins with no other assigned

pep3des

•  As a result, the computed PPEs, while correct in the context of the whole dataset, need to be corrected for an accurate es3mate in the context of their source protein

•  ProteinProphet introduces the no3on of sibling pep2des •  Sibling pep3des are pep3des hizng the same protein •  Rather than coun3ng them, ProteinProphet defines the number of sibling

pep3des NSPi for a pep3de i as the sum of the PPEs:

where the sum runs over all pep3des m hizng the same protein as i and PPEs pi are the maximum values for a given pep3de reached in the dataset

50

Sibling Pep2des

Example:

>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!


VYVEELKPTPEGDLEILLQK : p = 0.81

LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65

TPEVDDEALEK : p = 0.91

max = 0.65

NSP(VYV…) = 0.91 + 0.65 = 1.56 NSP(TPE…) = 0.65 + 0.81 = 1.46 NSP(LSF…) = 0.91 + 0.81 = 1.72

51

Sibling Pep2des

•  Intui3vely, one would trust iden3fica3ons with a high NSP more than those with a low NSP (more evidence per protein)

•  We can thus refine PPEs in the context of the source protein as follows:

with •  p(NSP|+) and p(NSP|-‐) being the probabili3es of having a

par3cular NSP value for correct/incorrect assignments •  p(+|D) and p(-‐|D) are the uncorrected probabili3es for the

pep3de assignment being correct/incorrect

52

Sibling Pep2des

•  Values for p(NSP|+) and p(NSP|-‐) can be computed for the whole dataset

•  NSP values are binned and counted for correct and incorrect assignments

where N is the total number of pep3des assignments and p(+) is the prior probability of a pep3de iden3fica3on being correct

•  p(+) can be computed by summa3on over all pep3de iden3fica3ons of the dataset:

53

NSP Distribu2ons

•  NSP distribu3ons can be determined using expecta3on maximiza3on

•  As a first guess, unadjusted p(+|D) values are used to compute an es3mated NSP value for each assignment

•  Applying EM then yields adjusted probabili3es, this is repeated un3l convergence has been reached

•  NSP distribu3ons depend on the dataset and the dataset size


NSP distribution for datasets of varying size: •  squares: single run of a low-

complexity sample •  circles: four runs of the same sample •  triangles: 22 runs

54

Influence of NSP Correc2on

•  NSP correc3on yields beker predic3ons of protein probabili3es

•  Figure on the right shows the predicted vs. true protein probabili3es with and without NSP

•  Different lines correspond to different datasets

•  Doked line: perfect predic3on


55

Protein Ambiguity

•  Shared pep3des within a PAG cause issues as well •  Their probabili3es can be distributed over their poten3al source

proteins through a weigh3ng scheme based on the protein probabili3es:

•  Weights wi

n are again es3mated itera3vely using an EM-‐like algorithm

peptide 1

peptide 2

protA

protB


p1

p2

PA

PB

w1A

w1B

w2B

56

Protein Ambiguity Group


57

References Papers: •  Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316. •  Colinge J, Bennek KL. Introduc3on to Computa3onal Proteomics. PLoS Comput Biol 3:e114.

(hkp://dx.doi.org/10.1371/journal.pcbi.0030114) •  Nesvizhskii A I , Aebersold R, Interpreta3on of Shotgun Proteomics Data, Mol Cell

Proteomics 2005;4:1419-‐1440 •  Nesvizhskii, Keller, Kolker, Aebersold, A Sta3s3cal Model for Iden3fying Protein by Tandem

Mass Spectrometry, Anal. Chem. 2003, 75, 4646-‐4658. •  Keller, Nesvizhskii, Kolker, Aebersold, Empirical Sta3s3cal Model to Es3mate the Accuracy of

Pep3de Iden3fica3ons Made by MS/MS and Database Search, Anal. Chem. 2002, 74, 5383-‐5392

Links: •  ProteinProphet: hkp://proteinprophet.sourceforge.net •  OMSSA online server:

hkp://pubchem.ncbi.nlm.nih.gov/omssa/ •  MASCOT online server

hkp://www.matrixscience.com/search_form_select.html •  Pep2de Atlas – a database of pep3de spectra and iden3fica3ons

hkp://www.pep3deatlas.org/ 58

bioinf4120&& bioinformacs 2 -...

Documents