bioinf4120&& bioinformacs 2 -...

58
BIOINF 4120 Bioinforma2cs 2 Structures and Systems Oliver Kohlbacher Summer 2012 17. Protein Iden3fica3on

Upload: trinhminh

Post on 30-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

BIOINF  4120    Bioinforma2cs  2  

-­‐  Structures  and  Systems  -­‐  

Oliver  Kohlbacher  Summer  2012  

17.  Protein  Iden3fica3on  

Overview  

•  Pep3de  fragment  spectra  •  Mass  spectrometry  •  Fragmenta3on  mechanisms  •  Comparison  of  spectra  

•  Pep3de  ID  by  database  search  •  Problem  defini3on  •  X!Tandem  

•  Protein  inference  •  Problem  defini3on  •  Algorithms  •  ProteinProphet  

2  

Shotgun  Proteomics  

Key  ideas  •  Separa3on  of  whole  proteins  possible  but  difficult,  hence  diges3on  preferred  

•  Usually:  trypsin  –  cuts  aMer  K  and  R  and  ensures  pep3des  suitable  for  MS  (posi3ve  charge  at  the  end)  

•  Separate  pep3des;  this  is  easy  •  Iden3fy  proteins  through  pep3des  

M G

M K

N V

Q W E D S L G G L L V W G M

G E G A

I H

R V E

D V A G G Q E V

L F L K T P H E G

E L K

F D K F K

H L K E S D M E K K H A S E D L K

A T H N G V L T

L G G I L K

K F G E L G Q

P V I K

Q S A H G L H E A E L T P H A T K I

Q V L Q

S Y E

A E

L F

K

I I S

R F A L E L G D

F P G A

H

M Q S G

D A

A K

N M D A A K Y K

Peptid- digest

digestion

Proteins

Separation

M G L S D G E W Q L V L N V W G K

H P G D F G A D A Q G A M S K

Y L E F I S E A I I Q V L Q S K

G H H E A E L T P A Q S H A T K

V E A D V A G H G Q E V L R I

S E D E M K

A S E D L K

A L E L F R

E L G F Q G

N D M A A K

I P V K

H L K

F D K

L F K

F K

Y K

H K

K

G H P E T L K E

3  

Tandem  Mass  Spectrometry  •  MS  can  be  done  in  two  stages:  first  stage  separates  ions  by  m/z  •  Selected  ions  are  then  selected,  trapped,  undergo  CID  and  are  then  analyzed  by  

a  second  MS  stage  •  These  tandem  mass  spectra  or  MS/MS  spectra  allow  the  iden3fica3on  of  the  

pep3des  quan3fied  in  the  first  MS  stage  

http://www.nature.com/nrd/journal/v2/n2/full/nrd1011.html

4  

Pep2de  Fragmenta2on  •  Collision-­‐induced  dissocia2on  (CID)  allows  the  fragmenta3on  of  molecules  

through  collision  with  a  neutral  gas  •  The  gas  molecules  transfer  their  kine3c  energy  to  the  analytes  •  Bond  cleavages  occur  resul3ng  in  characteris3c  fragment  ions  •  Pep3des  fragment  preferen3ally  around  the  pep3de  backbone  •  This  gives  rise  to  several  series  of  fragment  ions,  where  b  and  y  ions  are  the  

most  common  100

0 250 500 750 1000 m/z

% In

tens

ity

5  

Pep2de  Sequencing  

•  From  the  series  of  b/y  ions  (ladders)  one  can  reconstruct  the  pep3de  sequence  

100

0 250 500 750 1000

y2 y3 y4

y5

y7

b3 b4 b5 b8 b9

[M+2H]2+

b6 b7 y9 y8

m/z

% In

tens

ity

100

0 250 500 750 1000 m/z

% In

tens

ity SGEFLEEDELK

6  

Ladders  

•  The  sequence  of  b  and  y  ions  gives  rise  to  a  series  of  ions,  so  called  b-­‐  and  y-­‐ladders  

•  The  distance  between  adjacent  b  or  y-­‐ions  corresponds  to  the  mass  of  the  amino  acids  

•  Walking  the  peaks  thus  yields  the  mass  corresponding  to  the  amino  acid  and  in  turn  the  sequence  –  at  least  in  theory!  

http://www.nature.com/nmeth/journal/v1/n3/fig_tab/nmeth725_F2.html

IYEVEGMR

7  

Amino  Acid  Masses  

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Ala C3H5ON 71.03711 71.0788

Arg C6H12ON4 156.10111 156.1875

Asn C4H6O2N2 114.04293 114.1038

Asp C4H5O3N 115.02694 115.0886

Cys C3H5ONS 103.00919 103.1388

Glu C5H7O3N 129.04259 129.1155

Gln C5H8O2N2 128.05858 128.1307

Gly C2H3ON 57.02146 57.0519

His C6H7ON3 137.05891 137.1411

Ile C6H11ON 113.08406 113.1594

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Leu C6H11ON 113.08406 113.1594

Lys C6H12ON2 128.09496 128.1741

Met C5H9ONS 131.04049 131.1926

Phe C9H9ON 147.06841 147.1766

Pro C5H7ON 97.05276 97.1167

Ser C3H5O2N 87.03203 87.0782

Thr C4H7O2N 101.04768 101.1051

Trp C11H10ON2 186.07931 186.2132

Tyr C9H9O2N 163.06333 163.1760

Val C5H9ON 99.06841 99.1326

8  

Amino  Acid  Masses  •  Leu  and  Ile  (L/I)  are  

structural  isomers  •  They  thus  have  iden3cal  

mass  and  cannot  be  dis3nguished!  

•  Fragments  with  same  mass  are  called  isobaric  

•  Gln  and  Lys  (Q/K)  have  nearly  iden3cal  masses:  128.09496  Da  and  128.05858  Da  

•  For  low-­‐resolu3on  instruments  they  are  indis3nguishable,  too  

AA Chemical formula

Mono-isotopic

[Da]

Average [Da]

Leu C6H11ON 113.08406 113.1594

Ile C6H11ON 113.08406 113.1594

Gln C5H8O2N2 128.05858 128.1307

Lys C6H12ON2 128.09496 128.1741

9  

Pep2de  Iden2fica2on  

LC-MS/MS experiment Fragment m/z values

Sequence db Theoretical fragment m/z

values from suitable peptides

Compare

Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human) MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDATRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDSRANTVYGLGFASEQHLTQFAEKFQEVKEAARLAREKSQDGGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERERLKKMLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQWRQQLEAQRAEAERLRQRVAELEAQAASEVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQTGGPREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEEARAERERARAEVGRAAQLLDVSLFELSELREGLARLAEAAP

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 574.83 580.70 580.92 579.99 603.92 611.14 616.74

570.84 571.72 580.40 591.18 579.35 607.25 611.42 614.45

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

569.24 572.33 580.30 581.46 582.63 606.32 610.24 616.14

1 QRESTATDILQK 18.77

2 EIEEDSLEGLKK 14.78

3 GIEDDLMDLIKK 12.63

Score hits

Theoretical spectra

m/z

[%]

m/z

[%]

m/z

[%]

m/z

[%]

Experimental spectra

m/z

RT

10  

X!Tandem  

•  Many  different  search  engines  have  been  proposed  that  implement  this  basic  database  search  strategie  

•  They  differ  in  their  speed,  availability,  and  quality  •  Internally,  the  differences  mainly  concern  the  scoring,  

preprocessing,  and  search  data  structures  •  Here  we  will  discuss  the  X!Tandem  algorithm  

•  Propose  by  Craig  and  Beavis  in  2003  •  We  can  only  discuss  the  very  core  of  the  algorithm,  some  of  the  addi3onal  

tricks  and  tweaks  are  beyond  the  scope  of  this  lecture  

•  There  is  an  addi3onal  lecture  (BIOINF  4399B  “Computa3onal  Proteomics  and  Metabolomics”)  discussing  many  of  these  issues  in  more  detail  

•  hkp://www.thegpm.org/tandem/instruc3ons.html  

Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.

11  

2.  Compare theoretical spectra for all to the experimental spectrum S

m/z

Inte

nsit

y

m/z

m/z m/z

Scoring  Spectra  

12  

Find  overlapping  masses  

100 %

Inte

nsit

y

1

0

Experimental spectrum S

Exemplified theoretical spectrum

To find overlapping masses, a maximal fragment mass tolerance window needs to be set (for ion traps this is usually 0.5 Da)

13  

X!Tandem’s  Dot  Product  

•  Reduce the experimental spectrum to only those peaks that match peaks in the theoretical spectrum

•  Calculate dot product (dp) (using ion intensities and the number of matching ions)

Intensities from experimental spectrum … fragment ion intensities

Predicted or not in theoretical spectrum

100 %

Inte

nsit

y

14  

Survival  Func2on  and  e-­‐value  

Fenyö  and  Beavis,  Anal.  Chem.2003,  75,  768-­‐774  

•  Let x represent the dot product score for the experimental spectrum S and the theoretical spectrum

•  p(x) is calculated from the frequency histogram (counts of PSMs per score bin) •  With f(x), the number of PSMs that are given the score x, p(x) is calculated with

, with N … total number of PSMs

Histogram of b$RT/60

b$RT/60

Frequency

15 20 25 30 35 40 45

020

4060

80100

120

Example of a frequency histogram

Random variable

Freq

uenc

y

15  

Fenyö  and  Beavis,  Anal.  Chem.2003,  75,  768-­‐774  

ln(x)

p(x)

valid PSM

•  The survival function, s(x), for a discrete stochastic score probability distribution, p(x) is defined as where P(X > x) is the probability to have a greater value than x by random matches in a database.

Survival  func2on  and  e-­‐value  

16  

Fenyö  and  Beavis,  Anal.  Chem.2003,  75,  768-­‐774  

ln(x)

p(x)

valid PSM

•  With the survival function s(x), we can calculate the E-value e(x), indicating the number of PSMs that are expected to have scores of x or better where n is the number of sequences in

•  Now, each PSM can be ranked accoring to e(x)

Survival  func2on  and  e-­‐value  

17  

X!Tandem  Hyperscore  

100 %

Inte

nsit

y

•  The hyperscore (HS) is calculated by multiplying with factorials of the number of assigned b and y ions.

•  The use of the factorials is based on the hypergeometric distribution that is assumed for matches of product ions �

Fenyö  and  Beavis,  Anal.  Chem.2003,  75,  768-­‐774   18  

h\p://www.proteomeso^ware.com/pdf_files/XTandem_edited.pdf  

ln(x)

p(x)

valid PSM

•  If p(x) is now plotted as a function of their log(hyperscores), the valid PSM is much better separated from the bulk of incorrect assignments

19  

One  Hit  Wonders  •  In  many  cases,  proteins  are  iden3fied  through  a  single  pep3de-­‐

spectrum  match  (PSM)  only  •  These  ‘single  hit  wonders’  have  long  been  considered  problema3c:  

a  single  false  PSMcan  lead  to  a  wrongly  iden3fied  protein  •  In  fact,  the  so-­‐called  ‘Paris  guidelines’  for  data  deposi3on  in  

proteomics  recommend  only  repor3ng  iden3fica3ons  for  which  at  least  two  pep3des  have  been  iden3fied  

•  This  also  became  known  as  the  ‘two  pep3de  rule’  •  Obviously,  just  dropping  the  majority  of  PSMs  is  inadequate  to  

address  this  problem  •  Ques3on:    

•  How  large  is  the  error  rate  in  the  iden3fica3ons?    •  Which  iden3fica3ons  can  be  trusted?   Bradshaw  RA,  Burlingame  AL,  Carr  S,  Aebersold  R.  Mol  Cell  Prot    2006,  5:787-­‐8  

hkp://www.mcponline.org/site/misc/ParisReport_Final.xhtml  

20  

Target-­‐decoy  databases  

Elias and Gygi, Nature Methods. Vol. 4, No. 3, March 2007

Separation of target and decoy results Design decoy sequences

Although different decoy database designs produce very similar results, the most frequently used approaches are the reversed and pseudo-reversed decoy databases

21  

FDR  Calcula2on  

•  General  equa3on  for  FDR  calcula3on  

There  are  two  ways  to  calculate  FDRs  based  on  target-­‐decoy  search  results:  •  Käll  et  al.  suggest  

•  Zhang  et  al.  suggest  

(Käll et al., Proteome Res. 2008, 7, 29– 34)

(Zhang et al., J Proteome Res 2007;6(9):3549–3557)

22  

Other  Search  Engines  •  OMSSA  

•  Open-­‐source  package  •  Fast  

•  SEQUEST  •  One  of  the  commercial  standard  packages  •  Commercial  soMware  (Thermo  Fisher  Scien3fic,  hkp://www.thermofisher.com/)  

•  Mascot  from  Matrix  Science    •  Mascot  is  one  of  the  most  popular  search  engines  •  Commercial  soMware  (hkp://www.matrixscience.com/)  

•  Phenyx  •  Commercial  soMware  •  Colinge  et  al.,  Proteomics  (2003),  3(8):1454-­‐1463.  

•  InsPecT  •  Very  fast  open-­‐source  search  engine  designed  for  the  iden3fica3on  of  poskransla3onal  modifica3on  •  Tanner  et  al.,  J  Proteome  Res.  (2005),  4(4):1287-­‐95.    

•  Myrimatch  •  Open  source  •  Tabb  et  al.,  J  Proteome  Res.  (2007),  6(2)  654-­‐61.  

•  …  

23  

Iden2fying  Proteins  •  Iden3fica3on  methods  so  far  only  iden3fy  pep3de-­‐spectrum  matches  (PSMs)  •  Search  a  database    •  Return  a  ranked  list  of  PSMs  with  associates  scores  

•  PSM  false  discovery  rates  (FDRs)  can  be  computed  through  a  target-­‐decoy  approach  

•  An  FDR  of  1%  would  mean  that  1%  of  the  PSMs  with  a  score  above  the  threshold  are  expected  to  be  incorrect  

•  Note  that  this  is  per  se  a  statement  on  the  individual  PSM,  not  per  pep3de  or  protein!  

24  

Iden2fying  Proteins  

•  Each  PSM  above  the  threshold  contributes  •  a  match  of  a  spectrum  to  a  pep3de  •  a  match  of  a  pep3de  to  a  protein  

•  Pep3des  are  not  necessarily  unique!  •  Length  distribu3on  of  observed  pep3des  deviates  from  theore3cal  

distribu3on:  short  pep3des  (length  6  and  shorter)  are  usually  not  observed  

Danielle L. Swaney; Craig D. Wenger; Joshua J. Coon; J. Proteome Res.  2010, 9, 1323-1329.

25  

Uniqueness  

•  If  we  are  interested  in  proteomics  (in  contrast  to  pep3de  iden3fica3on  in  metabolomics,  MHC  ligandomics  etc.),  we  want  to  quan3fy  proteins  

•  Non-­‐unique  pep3de  sequences  can  stem  from  different  proteins  

•  Obviously,  uniqueness  depends  on  the  chosen  database  •  Uniqueness  becomes  more  likely  for  longer  pep3de  sequences  

•  Reasons  for  non-­‐uniqueness  •  Chance  hits  •  Different  isoforms  •  Conserved  regions  shared  within  a  protein  family  

26  

Uniqueness  

•  Uniqueness  depends  on  the  size  of  the  database  •  Searching  an  appropriate  (non-­‐redundant)  database  is  thus  preferable    •  Reference  databases  (SwissProt)  usually  contain  few  degenerate  (non-­‐unique)  

tryp3c  pep3des  above  a  mass  of  750  Da  

•  Problem:  isoforms  of  proteins/splice  variants!  Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

27  

Uniqueness  

Qeli & Ahrens, Nature Biotechnology 28, 647–650 (2010)

28  

Protein  Isoforms  

•  NextProt  Release  3.0  •  20,110  human  proteins  •  35,978  sequences  resul3ng  from  alterna3ve  isoforms  

•  On  average  2.75  different  splice  variants  for  each  protein  sequence  •  Some  proteins  have  a  much  larger  number  of  variants    •  Resolving  the  different  isoforms  is  only  possible,  if  pep3des  

crossing  the  right  exon  boundaries  are  observed  NextProt Release 3.0, 2011-12-09, http://www.nextprot.org/db/statistics/release?viewas=numbers

29  

Protein  Isoforms  

•  phosphodiesterase  9A  has  16  documented  isoforms  •  Pep3des  stemming  from  the  second  half  of  the  sequence  are  en3rely  indis3nguishable  

between  isoforms   http://www.nextprot.org/db/entry/NX_O76083/structures

30  

Protein  Isoforms  

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

31  

Protein  Isoforms  

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

32  

Protein  Families  

•  Sequence  coverage  is  oMen  poor  in  large  scale  studies:  many  proteins  are  iden3fied  through  very  few  pep3des  only  

•  In  prokaryotes,  typically  over  90%  of  the  iden3fied  pep3des  are  unique  in  the  whole  proteome  

•  In  par3cular  in  eukaryotes  the  large  number  of  orthologs  leads  to  significant  sequence  iden3ty  between  different  proteins  that  are  not  isoforms  

•  In  eukaryotes,  the  number  of  unique  iden3fied  pep3des  can  thus  easily  drop  below  50%  (Gupta  &  Pevzner,  2009)  

33  

Protein  Families  

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

34  

Parsimony-­‐Based  Inference  

•  Idea  Find  the  smallest  set  of  proteins  explaining  all  observed  pep7des  

•  If  all  pep3des  mapping  to  one  protein  family  can  be  explained  by  a  single  protein,  then  it  is  quite  likely,  that  only  this  protein  is  present  (but  this  must  not  necessarily  be  the  case)  

•  Basically:  applying  Occam’s  razor  to  the  dataset  –  find  the  simplest  explana3on  possible  (maximum  parsimony)  

35  

Parsimony-­‐Based  Inference  •  Scenarios  for  different  proteins  given  

a  set  of  observed  pep3des  •  Dis2nct  proteins  do  not  share  

pep3des  •  Differen2able  proteins  can  be  

dis3nguished  by  at  least  one  dis3nct  pep3de  

•  Indis2nguishable  proteins  share  all  pep3des  

•  Subset  proteins  contain  only  pep3des  also  contained  in  another  protein  

•  Subsumable  proteins  contain  only  pep3des  that  are  also  contained  in  other  proteins  

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

36  

Protein  Ambiguity  Groups  

Example:  

 •  Note  that  even  though  the  presence  of  A  is  sufficient  to  explain  all  

observed  pep3des,  this  does  not  automa3cally  imply  the  absence  of  B  and  C  

•  The  data  is  explained  equally  well  by  the  presence  of  A,  the  presence  of  A  +  B,  A  +  C,  B  +  C,  or  A  +  B  +  C  

•  The  set  of  proteins  sharing  one  or  mul3ple  pep3des  is  oMen  referred  to  as  a  protein  ambiguity  group  

A

B

C

37  

Parsimony-­‐Based  Inference  

•  Maximum  parsimony  inference  results  in  a  minimal  list  of  proteins  •  It  thus  removes  all  dis3nct  and  differen3able  proteins  of  a  protein  

ambiguity  group  •  It  does  not  contain  any  subsumable  or  subset  proteins  •  In  the  previous  example,  A  would  be  sufficient  to  explain  the  

observed  pep3des,  B  and  C  would  not  be  reported  

A

B

C

38  

Repor2ng  of  PAGs  

Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

39  

Significance  of  Inferred  Hits  •  What  is  the  meaning  of  a  PSM  for  a  protein  iden3fica3on?  

•  FDR  is  calculated  on  the  PSM  level  •  1%  FDR  means  that  one  in  100  iden3fica3ons  yields  a  an  incorrect  protein  

iden3fica3on  

•  This  does  not  mean  that  there  is  also  an  FDR  rate  of  1%  on  the  protein  level!  

•  In  par3cular  in  large-­‐scale  studies  (tens  of  thousands  of  spectra),  protein  FDRs  are  much  higher  than  pep3de  FDRs  

•  PSMs  for  a  large  number  of  (mostly)  iden3cal  samples  •  Number  of  correctly  iden3fied  proteins  does  not  increase  significantly  with  

the  number  of  spectra  (it  is  always  the  same  proteins  being  iden3fied,  addi3onal  (correct)  PSMs  do  not  increase  the  number  of  proteins)  

•  Number  of  false  posi3ves  increases  with  the  number  of  PSMs  (yields  hits  to  random  proteins,  so  ini3ally  mostly  novel  false  posi3ves!)  

40  

Protein  FDRs  

•  Error  rates  increase  when  going  from  pep3des  to  proteins  •  Correct  pep3de  IDs  tend  to  group  into  a  small  set  of  correct  proteins  •  Incorrect  IDs  are  semi-­‐random  and  scaker  over  the  whole  protein  database  

A. Nesvizhskii, J. Proteomics (2010), 73:2092-2123

41  

ProteinProphet  

•  ProteinProphet  is  an  open-­‐source  soMware  tool  for  protein  inference  and  currently  one  of  the  standard  tools  in  the  area  

•  Key  ideas  •  Maximum  parsimony  approaches  to  compile  protein  lists  

•  Repor3ng  of  protein  ambiguity  groups  •  Protein  probability  es2ma2on:  es3mate  the  probability  that  a  given  protein  is  correctly  iden3fied  given  all  evidence  for  it  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

42  

ProteinProphet  -­‐  Overview  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

43  

Pep2deProphet  •  Pep3de  Probability  Es3mates  (PPE)  

•  Computed  by  Pep2deProphet  •  Converts  search  engine  scores  into  a  probability  (1  -­‐  posterior  error  probability)  

•  Similar  ideas  have  been  discussed  in  the  context  of  consensus  iden3fica3on  

•  Pep3deProphet  uses  expecta2on  maximiza2on  to  compute  a  mixture  model  of  the  score  distribu3ons  of  correct  and  incorrect  PSMs  

•  Given  a  PSM  and  a  search  engine  score,  we  can  thus  compute  a  p-­‐value  (probability  that  the  PSM  is  correct)  

•  In  contrast  to  a  (raw)  score,  PPEs  are  a  simple  way  to  determine  the  trust  in  each  individual  PSM  

Nesvizhskii, et al., Anal. Chem. (2002), 74, 5383-5392

44  

Protein  Probability  Es2mates  

•  Given  the  PPEs,  we  can  easily  compute  the  probability  for  each  of  the  induced  protein  IDs  

•  Assuming  all  pep3des  are  unique,  we  can  compute  the  probability  P  for  an  protein  iden3fica3on  as  1  minus  the  probability  of  all  pep3de  iden3fica3ons  inducing  this  pep3de  being  wrong  

•  We  could  do  this  on  the  pep3de  level  quite  simply  as  follows:  

 with  probabili3es  pi  for  the  pep3de  iden3fica3on  of  pep3de  I          being  correct  

•  However,  we  also  need  to  consider  mul3ple  evidence  for  different  spectra  giving  evidence  for  the  same  pep3de  

45  

Protein  Probability  Es2mates  •  We  thus  need  to  consider  probabili3es  

for  each  PSM  independently  •  Each  PSM  is  assigned  a  PPE  by  

Pep3deProphet  •  Probability  that  a  protein  is  not  

present  in  a  sample  despite  its  PSMs  depends  on  the  probabili3es    p(+|Di

j)    for  the  pep3de  ID  of  pep3de  i  based  on  the  observed  data  (spectrum)  j  being  correct  

•  We  can  thus  compute  P  based  on  PPEs  of  all  PSMs:  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

46  

Protein  Probability  Es2mates  

•  There  are  a  few  problems  with  this:  •  PSMs  are  not  independent  

 There  is  a  high  probability  for  mul3ple  spectra  of  the    same  pep3de  to  hit  the  same  incorrect  ID  if  the    spectra  are  of  high  quality,  but  do  not  match  the    database  (e.g.,  due  to  post-­‐transla3onal    modifica3on)  

•  Ambiguous  pep2de-­‐protein  matches    If  a  pep3de  matches  mul3ple  proteins,  its  evidence    cannot  simply  be  shared  across  these  proteins  

47  

Protein  Probability  Es2mates  

•  A  simple  way  to  deal  with  mul3ple  PSMs  is  to    •  Include  each  pep3de  just  once  •  Consider  only  the  PSM  with  the  best  PPE  of  all  PSMs  to  the  same  pep3de:        pi  =  maxj  p(+|Di

j)  •  P  would  then  be  computed  as  follows:  

•  This  procedure  yields  a  more  conserva3ve  es3mate  of  protein  probabili3es  

48  

ProteinProphet  

After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

Example:  

>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!

After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

VYVEELKPTPEGDLEILLQK : p = 0.81

LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65

TPEVDDEALEK : p = 0.91

max = 0.65

P(LACB_BOVIN) = 1 – (1 – 0.81) (1 – 0.91) (1 - 0.65) = 0.99

49  

Sibling  Pep2des  •  Correct  assignments  tend  to  cluster  to  the  same  proteins  •  Incorrect  assignments  tend  to  be  hits  to  proteins  with  no  other  assigned  

pep3des  

•  As  a  result,  the  computed  PPEs,  while  correct  in  the  context  of  the  whole  dataset,  need  to  be  corrected  for  an  accurate  es3mate  in  the  context  of  their  source  protein  

•  ProteinProphet  introduces  the  no3on  of  sibling  pep2des    •  Sibling  pep3des  are  pep3des  hizng  the  same  protein  •  Rather  than  coun3ng  them,  ProteinProphet  defines  the  number  of  sibling  

pep3des  NSPi  for  a  pep3de  i  as  the  sum  of  the  PPEs:  

 where  the  sum  runs  over  all  pep3des  m  hizng  the  same  protein  as  i  and  PPEs  pi  are  the  maximum  values  for  a  given  pep3de  reached  in  the  dataset  

50  

Sibling  Pep2des  

Example:  

>gi|125910|sp|P02754.3|LACB_BOVIN !MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDL!EILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVR!TPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI!

After: Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

VYVEELKPTPEGDLEILLQK : p = 0.81

LSFNPTQLEEQCHI : p = 0.48 LSFNPTQLEEQCHI : p = 0.65

TPEVDDEALEK : p = 0.91

max = 0.65

NSP(VYV…) = 0.91 + 0.65 = 1.56 NSP(TPE…) = 0.65 + 0.81 = 1.46 NSP(LSF…) = 0.91 + 0.81 = 1.72

51  

Sibling  Pep2des  

•  Intui3vely,  one  would  trust  iden3fica3ons  with  a  high  NSP  more  than  those  with  a  low  NSP  (more  evidence  per  protein)  

•  We  can  thus  refine  PPEs  in  the  context  of  the  source  protein  as  follows:  

with    •  p(NSP|+)  and  p(NSP|-­‐)  being  the  probabili3es  of  having  a  

par3cular  NSP  value  for  correct/incorrect  assignments  •  p(+|D)  and  p(-­‐|D)  are  the  uncorrected  probabili3es  for  the  

pep3de  assignment  being  correct/incorrect  

52  

Sibling  Pep2des  

•  Values  for  p(NSP|+)  and  p(NSP|-­‐)  can  be  computed  for  the  whole  dataset  

•  NSP  values  are  binned  and  counted  for  correct  and  incorrect  assignments  

where  N  is  the  total  number  of  pep3des  assignments  and  p(+)  is  the  prior  probability  of  a  pep3de  iden3fica3on  being  correct  

•  p(+)  can  be  computed  by  summa3on  over  all  pep3de  iden3fica3ons  of  the  dataset:    

53  

NSP  Distribu2ons  

•  NSP  distribu3ons  can  be  determined  using  expecta3on  maximiza3on  

•  As  a  first  guess,  unadjusted  p(+|D)  values  are  used  to  compute  an  es3mated  NSP  value  for  each  assignment  

•  Applying  EM  then  yields  adjusted  probabili3es,  this  is  repeated  un3l  convergence  has  been  reached  

•  NSP  distribu3ons  depend  on  the  dataset  and  the  dataset  size  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

NSP distribution for datasets of varying size: •  squares: single run of a low-

complexity sample •  circles: four runs of the same sample •  triangles: 22 runs

54  

Influence  of  NSP  Correc2on  

•  NSP  correc3on  yields  beker  predic3ons  of  protein  probabili3es  

•  Figure  on  the  right  shows  the  predicted  vs.  true  protein  probabili3es  with  and  without  NSP  

•  Different  lines  correspond  to  different  datasets  

•  Doked  line:  perfect  predic3on  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

55  

Protein  Ambiguity  

•  Shared  pep3des  within  a  PAG  cause  issues  as  well  •  Their  probabili3es  can  be  distributed  over  their  poten3al  source  

proteins  through  a  weigh3ng  scheme  based  on  the  protein  probabili3es:  

 •  Weights  wi

n  are  again  es3mated  itera3vely  using  an  EM-­‐like  algorithm  

peptide 1

peptide 2

protA

protB

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

p1

p2

PA

PB

w1A

w1B

w2B

56  

Protein  Ambiguity  Group  

Nesvizhskii, et al., Anal. Chem. (2003), 75, 4646-4658

57  

References  Papers:  •  Craig,R.  and  Beavis,R.C.  (2003)  Rapid  Commun.  Mass  Spectrom.,  17,  2310–2316.  •  Colinge  J,  Bennek  KL.  Introduc3on  to  Computa3onal  Proteomics.  PLoS  Comput  Biol  3:e114.    

 (hkp://dx.doi.org/10.1371/journal.pcbi.0030114)  •  Nesvizhskii  A  I  ,  Aebersold  R,  Interpreta3on  of  Shotgun  Proteomics  Data,  Mol  Cell  

Proteomics  2005;4:1419-­‐1440  •  Nesvizhskii,  Keller,  Kolker,  Aebersold,  A  Sta3s3cal  Model  for  Iden3fying  Protein  by  Tandem  

Mass  Spectrometry,  Anal.  Chem.  2003,  75,  4646-­‐4658.  •  Keller,  Nesvizhskii,  Kolker,  Aebersold,  Empirical  Sta3s3cal  Model  to  Es3mate  the  Accuracy  of  

Pep3de  Iden3fica3ons  Made  by  MS/MS  and  Database  Search,  Anal.  Chem.  2002,  74,  5383-­‐5392  

Links:  •  ProteinProphet:  hkp://proteinprophet.sourceforge.net  •  OMSSA  online  server:    

 hkp://pubchem.ncbi.nlm.nih.gov/omssa/  •  MASCOT  online  server  

 hkp://www.matrixscience.com/search_form_select.html  •  Pep2de  Atlas  –  a  database  of  pep3de  spectra  and  iden3fica3ons  

 hkp://www.pep3deatlas.org/  58