using prior knowledge to improve scoring in high-throughput top-down proteomics experiments rich...

Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments

• Rich LeDuc• Le-Shin Wu

The “Scoring” Problem• Proteoforms are

hypotheses about what was in MS.

• The model “knows” the process.

• Output is a ranked list of hypotheses.

• Science builds on prior knowledge.

1

2

3

4

2

3

1

4

Competing Hypotheses

Ranked list of hypotheses,With measure of confidence,Under a given model

Process Model

‘P score’ = Pf,n =

(xf)n x e-xf

n!

F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.

f is the number of matching fragment ions,

n is the # of matches,

Ma is the Mass Accuracy

2211.111

1 aMx

Meng-Kelleher p-score

1

0

1n

ifn

nifncrude ppp

Meng-Kelleher p-score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Pro

ba

bil

ity

Number Matching

Specific Example

Bayesian Approach

)(

)|()()|(

datap

proteoformdatapproteoformpdataproteoformp

Prior Probability of the

Proteoform

Likelihood of the Proteoform given the

observed data

Probability of the dataPosterior probability

of the proteoform after making the

observations

The Scoring Model

}){,Pr(

)|}{,Pr()Pr(}){,|Pr(

iO

qiOqiOq mM

mMmM

From Bayes Theorem we have:

Pr(MO ,{mi} |q ) Pr(MO |q ) Pr(mi |q )i1

n

From independence we can:

j ijiqO

iqiqO

jjiOj

qiOqiOq mM

mM

mM

mMmM

)|Pr()|Pr(

)|Pr()|Pr(

)|}{,Pr()Pr(

)|}{,Pr()Pr(}){,|Pr(

Which gives our final scoring function:

MS1 Generative Model

Given a certain theoretic proteoform, what is the probability of seeing the observed precursor mass?

Likelihood Fun Facts Area does not equal one. Need some level for “wrong

precursor mass”

Probability

Fragment Mass0 I

wi

Noise = k

mi



otherwise

region epermissibl a

in not but I,mfor

mfor

0

)|( i

i

mi

ji

t

k

t

w

mp

i

mim lIkwt )2)1((2

Lambda Scores

kprior

1

Assume that prior to scoring, each sequence had an equal probability of being the correct sequence. This means that if we are considering k sequences, then our prior probability is just:

So then, the ratio of the posterior over the prior is:

)ln(

)(1

ˆ

postk

postk

k

post

prior

post

ratio, thisof log the take

Lambda Spread

-200

-150

-100

-50

0

50

0.000 0.020 0.040 0.060 0.080 0.100 0.120

p-score

lam

bd

a

The lambda score spreads hits with the same number of matching fragment ions.

Room for Improvement

Initial Version I to Max of all proteoforms

Theoretical Mass Theoretical Mass

One set of real observations scored against 890,000 random “theoretical” proteoforms.

Scoring Models Compared

Ahlf, D.R., Compton, P.D., Tran, J.C., Early, B.P., Thomas, P.M., Kelleher, N.L. “Evaluation of the Compact High-Field Orbitrap for Top-Down Proteomics of Human Cells”, J. ProteomeRes., 2012, 11, 4308-4314. PMCID: PMC3437942.

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.100.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Original Model

LS + Fragment Matching

LS + Intensity Correlation

P Scores

FDR (1-Specificity)

TP

R (

Sen

siti

vity

)

Future Directions

Add oxidation for MS1.

Improve modeling of various processes.

Incorporate into a search engine.

Conclusions

Include prior knowledge: Science builds on itself.

There is a system that gives a framework for including prior knowledge in models.

This particular implementation is better than older scoring systems, and it can improve!

Acknowledgements and Questions Kelleher group for providing the data.

All my many colleagues who I have worked with on this project over the years.

Of course all the related funding agencies, but specifically NSF ABI-1062432 .

using prior knowledge to improve scoring in high-throughput top-down proteomics experiments rich...

Documents

scoring problemproteoforms

observations6the scoring

equal probability

final scoring function

older scoring systems

asms scoring talk

dataposterior probability

lambda score spreads