using prior knowledge to improve scoring in high-throughput top-down proteomics experiments rich...
TRANSCRIPT
Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments
• Rich LeDuc• Le-Shin Wu
The “Scoring” Problem• Proteoforms are
hypotheses about what was in MS.
• The model “knows” the process.
• Output is a ranked list of hypotheses.
• Science builds on prior knowledge.
1
2
3
4
2
3
1
4
Competing Hypotheses
Ranked list of hypotheses,With measure of confidence,Under a given model
Process Model
‘P score’ = Pf,n =
(xf)n x e-xf
n!
F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.
f is the number of matching fragment ions,
n is the # of matches,
Ma is the Mass Accuracy
2211.111
1 aMx
Meng-Kelleher p-score
1
0
1n
ifn
nifncrude ppp
Meng-Kelleher p-score
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Pro
ba
bil
ity
Number Matching
Specific Example
Bayesian Approach
)(
)|()()|(
datap
proteoformdatapproteoformpdataproteoformp
Prior Probability of the
Proteoform
Likelihood of the Proteoform given the
observed data
Probability of the dataPosterior probability
of the proteoform after making the
observations
The Scoring Model
}){,Pr(
)|}{,Pr()Pr(}){,|Pr(
iO
qiOqiOq mM
mMmM
From Bayes Theorem we have:
Pr(MO ,{mi} |q ) Pr(MO |q ) Pr(mi |q )i1
n
From independence we can:
j ijiqO
iqiqO
jjiOj
qiOqiOq mM
mM
mM
mMmM
)|Pr()|Pr(
)|Pr()|Pr(
)|}{,Pr()Pr(
)|}{,Pr()Pr(}){,|Pr(
Which gives our final scoring function:
MS1 Generative Model
Given a certain theoretic proteoform, what is the probability of seeing the observed precursor mass?
Likelihood Fun Facts Area does not equal one. Need some level for “wrong
precursor mass”
Probability
Fragment Mass0 I
wi
Noise = k
mi
MS2 Generative Model
MS2 Generative Model
otherwise
region epermissibl a
in not but I,mfor
mfor
0
)|( i
i
mi
ji
t
k
t
w
mp
i
mim lIkwt )2)1((2
Lambda Scores
kprior
1
Assume that prior to scoring, each sequence had an equal probability of being the correct sequence. This means that if we are considering k sequences, then our prior probability is just:
So then, the ratio of the posterior over the prior is:
)ln(
)(1
ˆ
postk
postk
k
post
prior
post
ratio, thisof log the take
Lambda Spread
-200
-150
-100
-50
0
50
0.000 0.020 0.040 0.060 0.080 0.100 0.120
p-score
lam
bd
a
The lambda score spreads hits with the same number of matching fragment ions.
Room for Improvement
Initial Version I to Max of all proteoforms
Theoretical Mass Theoretical Mass
One set of real observations scored against 890,000 random “theoretical” proteoforms.
Scoring Models Compared
Ahlf, D.R., Compton, P.D., Tran, J.C., Early, B.P., Thomas, P.M., Kelleher, N.L. “Evaluation of the Compact High-Field Orbitrap for Top-Down Proteomics of Human Cells”, J. ProteomeRes., 2012, 11, 4308-4314. PMCID: PMC3437942.
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.100.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Original Model
LS + Fragment Matching
LS + Intensity Correlation
P Scores
FDR (1-Specificity)
TP
R (
Sen
siti
vity
)
Future Directions
Add oxidation for MS1.
Improve modeling of various processes.
Incorporate into a search engine.
Conclusions
Include prior knowledge: Science builds on itself.
There is a system that gives a framework for including prior knowledge in models.
This particular implementation is better than older scoring systems, and it can improve!
Acknowledgements and Questions Kelleher group for providing the data.
All my many colleagues who I have worked with on this project over the years.
Of course all the related funding agencies, but specifically NSF ABI-1062432 .