interpreting ms/ms proteomics results brian c. searle proteome software inc. portland, oregon usa...

83
Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA [email protected] NPC Progress Meeting (February 2nd, 2006) The first thing I should say is that none of the material presented is original research done at Proteome Software but we do strive to make the tools presented here available in our software product Scaffold. With that caveat asideIllustrated by Toni Boudreault

Upload: madelyn-medcalf

Post on 31-Mar-2015

231 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Interpreting MS/MS Proteomics Results

Brian C. SearleProteome Software Inc. Portland, Oregon USA

[email protected]

NPC Progress Meeting(February 2nd, 2006)

The first thing I should say is that none of the material presented is

original research done at Proteome Software

but we do strive to make the tools presented here available in our software product Scaffold. With that

caveat aside…

Illustrated by Toni Boudreault

Page 2: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

OrganizationThis is an foremost an introduction so we’re first going to

talk about

Then we’re going to talk about the motivations behind the development

of the first really useful bioinformatics technique in our field,

SEQUEST.This technique has been

extended by two other tools called X! Tandem and

Mascot.

We’re also going to talk about how these programs differ

and how we can use that to our advantage by considering them simultaneously using probabilities.

Identify SEQUEST

X! Tandem/Mascot

Differ

Combine

how you go about identifying proteins with tandem mass spectrometry in the

first place

Page 3: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly.

A

A

I

K

G

K

I

D

VC

I

V

L

L

Q H KA

E PT

I

R

NT

DG

R

TA

Start with a protein

Page 4: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

And to use this technique you

generally have to lyse the protein

into peptides about 8 to 20 amino acids in length and…

A

A

I

K

G

K

I

D

VC

I

V

L

L

Q H KA

E PT

I

R

NT

DG

R

TA

Cut with an enzyme

Page 5: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

A

A

I

K

G

K

I

D

VC

I

V

L

L

Q H KA

E PT

I

R

NT

DG

R

TA

Select a peptide

Look at each peptide individually.

We select the peptide by mass using the first half of the tandem mass spectrometer

Page 6: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

A E P T I R H2O

Impart energy in collision cell

The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids.

Page 7: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

A E P

A

A E

A E P T

72.0201.1

298.1399.2

Measure mass of daughter ionsThe masses of these fragment ions is recorded using the second mass spectrometer.

Page 8: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

A E P T I R

B-type Ions

H2O

72.0 129.0 97.0 101.0 113.1 174.1

These ions are commonly called B

ions, based on nomenclature you don’t really want to

know about…

But the mass difference between the peaks corresponds directly to the amino acid sequence.

Page 9: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

A E P T I R

B-type Ions

H2O

72.0 129.0 97.0 101.0 113.1 174.1

A-0 AE-A AEP-AE

AEPT-AEP

AEPTI-AEPT

AEPTIR-AEPTI

For example, the A-E peak minus

the A peak should produce the mass

of E.

You can build these mass differences up and derive a sequence for the original peptide

This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides.

Page 10: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

So, it seems pretty easy, doesn’t it?

But there are a couple confounding factors.

For example…

Page 11: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

A E P T I R

B-type Ions

H2O

CO CO CO CO CO CO

B ions have a tendency to degrade and lose carbon monoxide producing…

Page 12: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

A E P T I R

A-type Ions

H2O

CO CO CO CO CO CO

A ions.

Furthermore…

Page 13: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

R I T P E A

Y-type Ions

H2O

… The second half are represented as Y ions that

sequence backwards.

And, unfortunately, this is the real world, so…

Page 14: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

R I T P E A

Y-type Ions

H2O

… All the peaks have different measured heights and many peaks can often be missing.

Page 15: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

R I T P E AH2O

B-type, A-type, Y-type Ions

All these peaks are seen together simultaneously

and we don’t

even know…

Page 16: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

What type of ion they are, making the mass differences approach even more difficult.

Finally, as with all analytical techniques,

Page 17: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

There’s noise,producing a final spectrum that looks like…

Page 18: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

….This, on a good day.

And so it’s actually fairly difficult to…

Page 19: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

M/z

Inte

nsity

72.0 129.0 97.0 101.0 113.1 174.1

A E P T I R H2O

… compute the mass differences to sequence the peptide, certainly in a computer automated way.

Page 20: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

So the community needed a new technique.

Now, it wasn’t all without hope…

Page 21: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Known Ion Types

B-type ions

A-type ions

Y-type ions

We knew a couple of things about peptide

fragmentation.

Not only do we know to expect B, A, and Y

ions, but…

Page 22: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Known Ion Types

B-type ions

A-type ions

Y-type ions

B- or Y-type +2H ions

B- or Y-type -NH3 ions

B- or Y-type -H2O ions

… We also know a couple

of other variations on

those ions that come up.

We even know something

about the…

Page 23: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Known Ion Types

B-type ions

A-type ions

Y-type ions

B- or Y-type +2H ions

B- or Y-type -NH3 ions

B- or Y-type -H2O ions

• 100%

• 20%

• 100%

• 50%

• 20%

• 20%

… likelihood of seeing each type of ion,

where generally B and Y ions are most prominent.

Page 24: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

If we know the amino acid

sequence of a peptide,

we can guess

what the spectra should look like!

So it’s actually pretty easy to guess what a spectrum

should look like

if we know what the peptide sequence is.

Page 25: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

ELVISLIVESK

Model Spectrum

*Courtesy of Dr. Richard Johnsonhttp://www.hairyfatguy.com/

So as an example, consider the peptide

ELVIS LIVES K

that was synthesized by Rich Johnson in Seattle

Page 26: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Model Spectrum

We can create a hypothetical spectrum based on our rules

Page 27: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

B/Y type ions (100%)

A type ionsB/Y -NH3/-H2O

(20%)

B/Y +2H type ions(50%)

Where B and Y ions are estimated at 100%,

plus 2 ions are estimated at

50%, and other stragglers are at 20%.

Page 28: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Model Spectrum

So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Page 29: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Model Spectrum

We can find where the overlap is between the hypothetical and the actual spectra…

Page 30: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Model Spectrum

And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

Page 31: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

But who cares?

The more important question is

“what about situations where we don’t know the sequence?”

Page 32: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

We guess!

Page 33: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

PepSeqAAAAAAAAAA

AAAAAAAAAC

AAAAAAAACC

AAAAAAACCC

ELVISLIVESK

WYYYYYYYYY

YYYYYYYYYY

……

J. Rozenski et al., Org. Mass Spectrom.,

29 (1994) 654-658.

build a hypothetical spectrum,

And so this was an approach followed by a program called PepSeq

which would guess every combination of amino acids possible

and find the best matching hypothetical.

Page 34: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

PepSeq

• Impossibly hard after 7 or 8 amino acids!

• High false positive rate because you consider so many options

but it’s clearly impossibly hard with

larger peptides

and there’s a lot of room to overfit the data.

This was a start,

Page 35: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

PepSeq

• Impossibly hard after 7 or 8 amino acids!

• High false positive rate because you consider so many options

Another strategy is needed!

So obviously this isn’t going to work in the long run.

Page 36: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Sequencing Explosion

• 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced.

• 1989 Yeast Genome project announced

• 1990 Human Genome project announced

• 1992 First chromosome (Yeast) sequenced

• 1995 H. influenza sequenced

• 1996 Yeast Genome sequenced

• 2000 Human Genome draft

et cetra, et cetra

In 89 and 90 the Yeast and Human Genome projects were announced

We needed a new invention to come around

followed by the first chromosome

in 92

and that was shotgun Sanger-sequencing

Page 37: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

• 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced.

• 1989 Yeast Genome project announced

• 1990 Human Genome project announced

• 1992 First chromosome (Yeast) sequenced

• 1995 H. influenza sequenced

• 1996 Yeast Genome sequenced

• 2000 Human Genome draft

Sequencing Explosion

Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.

In 1994 Jimmy Eng and John Yates published a technique

to exploit genome sequencing

And the idea was …

for use in tandem mass

spectrometry.

Page 38: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST.…instead of searching all

possible peptide sequences,

search only those in genome databases.

Now, in the post- genomic world this seems like a pretty

trivial idea,

but back then there was a lot of assumption placed

on the idea

that we’d actually have a complete Human genome in a reasonable amount of

time.

Page 39: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST2*1014 -- All possible 11mers

(ELVISLIVESK)

2*1010 -- All possible peptides in NR

1*108 -- All tryptic peptides in NR

4*106 -- All Human tryptic peptides in NRSo, In terms of 11amino

acid peptides

we’re talking about a 10 thousand fold difference between searching every

possible 11mer those in the current non-redundant protein

database from the NCBI

And a 100 million fold difference for searching human trypic peptides

So that was huge,

it made hypothetical spectrum matching feasible.

Page 40: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST Model Spectrum

Instead of trying to make a better model,

Jimmy and John noted that there was a

discontinuity between the intensities of the

hypothetical spectrum and the actual spectrum.

SEQUEST made a couple of other interesting

improvements as well

they decided just to make the actual spectrum

look like the model with normalization…

Page 41: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST Model Spectrum

For a scoring function they decided to use Cross-Correlation,

Like so. which basically sums the peaks that

overlap between hypothetical and the actual spectra

Page 42: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST Model Spectrum

And then they shifted the spectra back and ….

Page 43: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST Model Spectrum

They used this number, also called the Auto-Correlation, as their background.

… Forth so that the peaks shouldn’t align.

Page 44: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST XCorr

Gentzel M. et al Proteomics 3 (2003) 1597-1610

Offset (AMU)

Cor

rela

tion

Sco

re

Cross Correlation(direct comparison)

Auto Correlation(background)

This is another representation of the Cross Correlation and the Auto Correlation.

Page 45: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST XCorrCross Correlation

(direct comparison)

Auto Correlation(background)

CrossCorr

avg AutoCorr offset=-75 to 75 XCorr =Gentzel M. et al

Proteomics 3 (2003) 1597-1610

Offset (AMU)

Cor

rela

tion

Sco

re

The XCorr score is the Cross Correlation divided

by the average of the auto correlation over a

150 AMU range.

The XCorr is high if the direct comparison is significantly

greater than the background,

which is obviously good for peptide identification.

Page 46: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST DeltaCn

XCorr1 XCorr 2

XCorr1and so far, there

really haven’t been any significant

improvements on it.The DeltaCn is another

score that scientists often use.

It measures how good the XCorr is relative to the next best match.

And this XCorr is actually a pretty robust method

for estimating how accurate the match is,

As you can see, this is actually a pretty crude calculation.

Page 47: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Accuracy Score Relative Score

Strong(XCorr)

Weak(DeltaCn)

SE

QU

ES

T

Here’s another representation of that sentiment.

The XCorr is a strong measure of accuracy,

whereas the DeltaCn is a weak measure of relative goodness.

.

Page 48: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Accuracy Score Relative Score

Alte

rnat

eM

etho

dStrong(XCorr)

Weak

Weak(DeltaCn)

Strong

SE

QU

ES

T

Obviously, there could be an alternative method that focuses more on the success of the relative

score.Mascot and X! Tandem fit that bill.

Page 49: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

by-Score= Sum of intensities of peaks matchingB-type or Y-type ions

HyperScore=

X! Tandem Scoring

by-Score Ny! Nb!

Fenyo, D.; Beavis, R. C. Anal. Chem., 75 (2003) 768-774

Now the X! Tandem accuracy score is

rather crude. It only considers B and Y ions and

and attaches these factorial terms with an admittedly hand waving argument.

Page 50: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Distribution of “Incorrect” Hits

Hyper Score

# of

Mat

ches

Best HitSecond

Best

But instead of just considering the best match to the second best, it looks at the

distribution of lower scoring hits, assuming that they are all wrong.

This is somewhat based on ideas pioneered with the BLAST

algorithm.

Here, every bar represents the number of matches at a given score.

The X! Tandem creators found that the distribution decays (or slopes down)

exponentially…

Page 51: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Estimate Likelihood (E-Value)

Best Hit

Hyper Score

Lo

g(#

of M

atch

es)

…and the log of the distribution is relatively linear because of the exponential decay.

Page 52: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Estimate Likelihood (E-Value)Hyper Score

Lo

g(#

of M

atch

es)

Expected NumberOf Random Matches

Best Hit

If the distribution represents the number of random

matches at any given score,

the linear fit should correspond to the expected number of random matches.

Page 53: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Estimate Likelihood (E-Value)L

og

(# o

f Mat

ches

)

Score of 60 has1/10 chanceof occurring

at random

Best Hit

This is called an E-Value, or Expected-Value.

And from this, you can calculate the likelihood that the best match is random.

In this case, a score of 60 corresponds with a log number

of matches being -1 which means the estimated number of random matches

for that score is 0.1

Page 54: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

X! Tandem and Mascot

E-Value=Likelihood that match is incorrect relative to N guesses

Empirical

(X! Tandem)

P-Value=Likelihood that match is incorrect (E~P·N)

Theoretical

(Mascot)

Another search engine, Mascot, tries to get at the same kind of number

using theoretical calculations,

Now, X! Tandem calculates this E-Value empirically.

most likely based on the number of identified peaks and the likelihood of finding certain amino acids in

the genome database.

They’ve never explicitly published their algorithm, so we’ll never really know,

I just want to bring up a point that we’ll touch on a little

later…

but I suspect it’s something smart.

Page 55: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

X! Tandem and Mascot

E-Value=Likelihood that match is incorrect relative to N guesses

Empirical

(X! Tandem)

P-Value=Likelihood that match is incorrect (E~P·N)

Theoretical

(Mascot)

Probability=Likelihood that match is correct Note (Probability≠1-P)!

This is realistically not nearly as useful as

knowing

the probability that a peptide identification is right, which is NOT 1 minus

the P-Value.

…the E-Value that X! Tandem calculates

and the P-Value that Mascot calculates are probabilistically based,

but they can only estimate the likelihood that the match is

wrong.

Page 56: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Accuracy Score Relative Score

X!

Tan

dem

SE

QU

ES

TXCorr

HyperScore

DeltaCn

E-Value

Now, let’s go back and fill in the X! Tandem part of our accuracy/relativity scoring grid.

Page 57: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Accuracy Score Relative Score

X!

Tan

dem

SE

QU

ES

TXCorr

HyperScore

DeltaCn

E-Value

To reiterate, the XCorr is an excellent measure of accuracy…

Page 58: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Accuracy Score Relative Score

X!

Tan

dem

SE

QU

ES

TXCorr

HyperScore

DeltaCn

E-Value

If we assume that accuracy and relativity scores are independent measures of

goodness,could we use both the SEQUEST’s XCorr

and X! Tandem’s E-Value together?

…whereas the E-Value is an excellent measure of how good the best score is relative to the rest.

Page 59: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

SEQUEST: Discriminant Score

X!

Tan

de

m: -

log

(E-V

alu

e)

10 Protein Control SampleAnd the answer is a resounding

yes.Each point on this

graph is a spectrum, where correct

identifications are marked in red, while

incorrect identifications are marked in blue.

Although in general the spectra SEQUEST scores well are spectra X!Tandem also scores well,

there is considerable scatter between the search engines.

We know what’s correct and incorrect

because this is a control sample.

Page 60: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

10 Protein Control Sample

X!

Tan

de

m: -

log

(E-V

alu

e)

One might wonder if X! Tandem and Mascot use

similar scoring approaches,

would they benefit as much,

Now, why are the scores so different?

but the answer is

surprisingly still yes!

Page 61: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Why So Different?• Sequest

– Considers relative intensities

• X! Tandem– Considers

semi-tryptic peptides

– Considers only B/Y-type Ions

• Mascot– Considers

theoretical

P-Value relative to search space

Well, here are a couple of possible reasons.

SEQUEST is the only method to consider relative intensities.

Page 62: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Why So Different?• Sequest

– Considers relative intensities

• X! Tandem– Considers

semi-tryptic peptides

– Considers only B/Y-type Ions

• Mascot– Considers

theoretical

P-Value relative to search space

X! Tandem is the only method to consider peptides outside the standard search space by default,

such as semi-tryptic peptides.

However, it’s the only score that considers only B and Y ions,

as opposed to a complete model.

Page 63: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Why So Different?• Sequest

– Considers relative intensities

• X! Tandem– Considers

semi-tryptic peptides

– Considers only B/Y-type Ions

• Mascot– Considers

theoretical

P-Value relative to search space

And Mascot is the only search engine to compute a completely theoretical P-Value

Page 64: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

Consider Multiple Algorithms?

X!

Tan

de

m: -

log

(E-V

alu

e)

So we clearly want to consider multiple search engines

simultaneously,

but how?

Page 65: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

How To Compare Search Engines?– SEQUEST: XCorr>2.5, DeltaCn>0.1– Mascot: Ion Score-Identity Score>0– X! Tandem: E-Value<0.01

You can’t use a thresholding system

because it’s impossible to find corresponding

thresholds.

For example, a SEQUEST match with an XCorr of 2.5

doesn’t mean the same thing

as an X! Tandem match with an E-Value of 0.01.

Page 66: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

How To Compare Search Engines?

Need to convert scores to probabilities!

– SEQUEST: XCorr>2.5, DeltaCn>0.1– Mascot: Ion Score-Identity Score>0– X! Tandem: E-Value<0.01

The simplest way would be to convert the scores

into probabilities and compare those.

We advocate for Andrew Keller and Alexy Nesviskii’s Peptide Prophet approach

because it actually calculates a true probability, not just a p-value.

Page 67: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)X! Tandem approach

Other IncorrectIDs for Spectrum

PossiblyCorrect?

Mascot: Ion-Identity Score

# of

Mat

ches

So if you remember,

X! Tandem considers the best peptide

match for a spectrum against a

distribution of incorrect

matches

Page 68: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)Peptide Prophet approach

ALL Other“Best” Matches

PossiblyCorrect?

Mascot: Ion-Identity Score

# of

Mat

ches

Keller, A. et al Anal. Chem. 74, 5383-5392

Well, Peptide Prophet looks across the entire

sample, and not at just one spectrum at a time.

It compares the best match against all of

the other best matches in the

sample, which is clearly bimodal.

Page 69: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)Peptide Prophet approach

ALL Other“Best” Matches

PossiblyCorrect?

Mascot: Ion-Identity Score

# of

Mat

ches

Keller, A. et al Anal. Chem. 74, 5383-5392

The low mode represents matches that are most likely wrong while the high mode represents matches that are probably right.

Page 70: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)Peptide Prophet approach

PossiblyCorrect?

“Correct”

“Incorrect”

Mascot: Ion-Identity Score

# of

Mat

ches

Peptide Prophet curve fits two distributions to

the modes,

following the assumption that the low

scoring distribution is “Incorrect”

and that the higher scoring distribution is “correct”.

Page 71: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)

“Incorrect” p( | D)

p(D | ) p()

p(D | ) p() p(D | ) p( )

Mascot: Ion-Identity Score

# of

Mat

ches

PossiblyCorrect?

“Correct”

These two distributions can be analyzed using Bayesian statistics with

this formula.

Now that formula looks pretty complex,

but…

Page 72: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)

p( | D)

p(D | ) p()

p(D | ) p() p(D | ) p( )“Incorrect”

Mascot: Ion-Identity Score

# of

Mat

ches

“Correct”

It just calculates the height of the correct distribution at a particular score, divided by the height of both distributions.

Page 73: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

10 Protein Control Sample (Q-ToF)

p( | D)

p(D | ) p()

p(D | ) p() p(D | ) p( )

prob of having scoreand being correct

prob of having score

“Correct”

“Incorrect”

Mascot: Ion-Identity Score

This is essentially the probability of having that score and being

correct divided by the probability of just having that score

Page 74: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

PossiblyCorrect?

“Correct”

“Incorrect”

# of

Mat

ches

This is a neat method because it actually considers the likelihood of being correct,

rather than X! Tandem and Mascot, which only calculate the probability of being incorrect.

It’s because of this that Peptide Prophet can

get produce a true probability,

which is important when the sample characteristics change.

Page 75: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

PossiblyCorrect?

“Correct”

“Incorrect”

# of

Mat

ches Q-ToF:

For example, the control sample we’ve been looking at was derived

from Q-ToF data

which produces pretty high quality results

Page 76: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

PossiblyCorrect?

“Correct”

“Incorrect”

# of

Mat

ches

Mascot: Ion-Identity Score

PossiblyCorrect?

“Correct”

“Incorrect”

# of

Mat

ches Q-ToF:

Ion Trap:

If you compare that to the same sample on

run on an Ion Trap, the probability of being correct is greatly

diminished.

If you’ll note, the Incorrect distribution doesn’t change very much between the two

analyses, however, the likelihood that the

identification is right changes dramatically!

Page 77: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

PossiblyCorrect?

“Correct”

“Incorrect”

# of

Mat

ches

Mascot: Ion-Identity Score

Ion Trap:

As Peptide Prophet considers the correct distribution, it is immune to fluctuations between samples.

P-Values and E-Values don’t consider this information, so they can’t be compared across multiple samples, or different examinations of the same sample

hence the reason why we need to use Peptide

Prophet for comparing two different search engines

Page 78: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

Consider Multiple Algorithms?

X!

Tan

de

m: -

log

(E-V

alu

e)

So going back to the scatter plot between X! Tandem and Mascot,

we can use Peptide Prophet to compute the score

threshold that represents a 95% cut-off…

Page 79: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Mascot: Ion-Identity Score

Consider Multiple Algorithms?

X! Tandem: 2.6=95%

Mascot: -2.5=95%

X!

Tan

de

m: -

log

(E-V

alu

e)Like so.

This allows you to fairly consider the answers from both search engines simultaneously.

The important thing to note, is that if you looked at a different sample, these thresholds should change depending on the height of the correct distributions

Page 80: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Conclusion• All search engines

use different criteria, producing different scores

• Using multiple search engines simultaneously yields better results

• Peptide Prophet can normalize search engine results

So in conclusion,

all of the search engines look at different criteria

Page 81: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Conclusion• All search engines

use different criteria, producing different scores

• Using multiple search engines simultaneously yields better results

• Peptide Prophet can normalize search engine results

And we can leverage this to identify more peptides

Page 82: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

Conclusion• All search engines

use different criteria, producing different scores

• Using multiple search engines simultaneously yields better results

• Peptide Prophet can normalize search engine results

And that Peptide Prophet is a great

mechanism for doing that

because it calculates true probabilities,

instead of p-values

Page 83: Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting

The End