protein and gene model inference based on statistical modeling in k -partite graphs
DESCRIPTION
Protein and gene model inference based on statistical modeling in k -partite graphs. Sarah Gester , Ermir Qeli , Christian H. Ahrens, and Peter Buhlmann. Problem Description. Given peptides and scores/probabilities, infer the set of proteins present in the sample. PERFGKLMQK. Protein A. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/1.jpg)
+
Protein and gene model inference based on statistical modeling in k-partite graphsSarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann
![Page 2: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/2.jpg)
+Problem Description
Given peptides and scores/probabilities, infer the set of proteins present in the sample.
PERFGKLMQK
MLLTDFSSAWCR
FFRDESQINNR
TGYIPPPLJMGKR
Protein A
Protein B
Protein C
![Page 3: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/3.jpg)
+Previous Approaches
N-peptides rule ProteinProphet (Nesvizhskii et al. 2003. Anal Chem)
Assumes peptide scores are correct.
Nested mixture model (Li et al. 2010. Ann Appl Statist) Rescores peptides while doing the protein inference Does not allow shared peptides Peptide scores are independent
Hierarchical statistical model (Shen et al. 2008. Bioinformatics) Allows for shared peptides Assume PSM scores for the same peptide are independent Impractical on normal datasets
MSBayesPro (Li et al. 2009. J Comput Biol) Uses peptide detectabilities to determine peptide priors.
![Page 4: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/4.jpg)
+Markovian Inference of Proteins and Gene Models (MIPGEM) Inclusion of shared/degenerate peptides in the model. Treats peptide scores/probabilities as random values Model allows dependence of peptide scores. Inference of gene models
![Page 5: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/5.jpg)
+Why scores as random values?
PERFGKLMQK
MLLTDFSSAWCR
FFRDESQINNR
TGYIPPPLJMGKR
Protein A
Protein B
Protein C
![Page 6: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/6.jpg)
+Building the bipartite graph
![Page 7: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/7.jpg)
+Shared peptides
![Page 8: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/8.jpg)
+Definitions
Let pi be the score/probabilitiy of peptide i. I is the set of all peptides.
Let Zj be the indicator variable for protein j. J is the set of all proteins.
€
P[Z j =1 |{pi;i ∈ I}]
![Page 9: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/9.jpg)
+Simple Probability Rules
€
P(A | B) = P(A,B)P(B)
= P(B | A)P(A)P(B)
€
P(A,B) = P(A | B)P(B) = P(B | A)P(A)
€
P(A) = P(A,B = b)b∑
€
P(A = a | B)a∑ =1
![Page 10: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/10.jpg)
+Bayes Rule
€
P[Z j =1 |{pi;i ∈ I}] =P[Z j =1,{pi;i ∈ I}]
P[{pi;i ∈ I}]
€
=P[{pi;i ∈ I} | Z j =1]⋅P[Z j =1]
P[{pi;i ∈ I}]
Prior probability on
the protein being present
Joint probability of seeing these peptide scores
Probability of observing these peptide scores given that the protein is present
![Page 11: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/11.jpg)
+Assumptions
Prior probabilities of proteins are independent
Dependencies can be included with a little more effort. This does not mean that proteins are independent.€
P[{Z j ; j ∈ J}] = P(Z j )j∈J∏
![Page 12: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/12.jpg)
+Assumptions
Connected components are independent
![Page 13: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/13.jpg)
+Assumptions
Peptide scores are independent given their neighboring proteins. Ne(i) is the set of proteins connected to peptide i in the
graph. Ir is the set of peptides belonging to the rth connected
component R(Ir) is the set of proteins connected to peptides in Ir
€
P[{pi;i ∈ I} |{Z j; j ∈ R(Ir )}] = P[ pi |{Z j ; j ∈ Ne(i)}]i∈I r
∏
![Page 14: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/14.jpg)
+Assumptions
Conditional peptide probabilities are modeled by a mixture model. The specific mixture model they use is based on the
peptide scores used (from PeptideProphet).
![Page 15: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/15.jpg)
+Bayes Rule
€
P[Z j =1 |{pi;i ∈ I}] =P[Z j =1,{pi;i ∈ I}]
P[{pi;i ∈ I}]
€
=P[{pi;i ∈ I} | Z j =1]⋅P[Z j =1]
P[{pi;i ∈ I}]
Prior probability on
the protein being present
Joint probability of seeing these peptide scores
Probability of observing these peptide scores given that the protein is present
![Page 16: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/16.jpg)
+Joint peptide score distribution
Assumption: peptides in different components are independent
Ir is the set of peptides in component r
R(Ir) is the set of proteins connected to peptides in Ir
€
P({pi;i ∈ I}) = P({pi;i ∈ Ir})r=1
R
∏
€
P({pi;i ∈ Ir}) = P({pi;i ∈ Ir} |{Z j; j ∈ R(Ir )})P({Z j; j ∈ R(Ir )})Z j ∈{0,1}j∈R (I r )
∑
![Page 17: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/17.jpg)
+Conditional Probability
Mixture model
€
P(pi |{Z j ; j ∈ Ne(i)}) ≈
1u − l
if Z j = 0j∈Ne( i)∑
f1(pi) if Z j > 0j∈Ne( i)∑
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
l =i
min(pi)
m =i
median (pi)
u =i
max(pi)
![Page 18: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/18.jpg)
+Conditional Probability
Mixture model
€
f1(x) =b1(x − l) l ≤ x ≤ m
(b1 + b2)(x − m) + b1(m − l) m < x ≤ u ⎧ ⎨ ⎩
l =i
min(pi)
m =i
median (pi)
u =i
max(pi)
€
f1(x)dx =1l
u
∫
![Page 19: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/19.jpg)
+f1(x) – pdf of P(pi|{zj})
0.8 0.82
0.84
0.86
0.88 0.9 0.9
20.9
40.9
60.9
80
0.050.1
0.150.2
0.250.3
0.350.4
f(x)
f(x)
median
![Page 20: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/20.jpg)
+Choosing b1 and b2
Seek to maximize the log likelihood of observing the peptide scores.
€
l= log(P({pi;i ∈ I})) = log P({pi;i ∈ Irr=1
R
∏ }) ⎛
⎝ ⎜
⎞
⎠ ⎟
€
l= log(P({pi;i ∈ Ir}))r=1
R
∑
€
l= log P(pi |{Z j = z; j ∈ Ne(i)}i∈I r
∏z∈{0,1}j∈R (I r )
∑ ) ⋅ P(Z j = z)j∈R (I r )∏
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟r=1
R
∑
![Page 21: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/21.jpg)
+Choosing b1 and b2
It turns out:
€
ˆ b 1 =b1
argmin − l (b1)
ˆ b 2 = 2 − ˆ b 1(u − l)2
(u − m)2
![Page 22: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/22.jpg)
+Conditional Protein Probabilities
€
P[Z j =1 |{pi;i ∈ I}] =P[Z j =1,{pi;i ∈ I}]
P[{pi;i ∈ I}]
€
=P[{pi;i ∈ I} | Z j =1]⋅P[Z j =1]
P[{pi;i ∈ I}]
![Page 23: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/23.jpg)
+Conditional Protein Probabilities
€
P(Z j =1 |{pi;i ∈ I}) =P[{pi;i ∈ I} | Z j =1]⋅P[Z j =1]
P[{pi;i ∈ I}]
€
=
P[{pi;i ∈ I} | Z j =1,Zk = z]⋅P[Z j =1,Zk = z]k∈R(I d ( j ) )
∑P[{pi;i ∈ I}]
€
=A(1)
P[{pi;i ∈ I}]
![Page 24: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/24.jpg)
+Conditional Protein Probabilities(NEC Correction)
€
P(Z j =1 |{pi;i ∈ I}) =P[{pi;i ∈ I} | Z j =1]⋅P[Z j =1]
P[{pi;i ∈ I}]
€
=
P[{pi;i ∈ I} | Z j =1,{Zk;k ≠ j}) ⋅P[Z j =1,{Zk;k ≠ j}]zk ∈{0,1}∑
P[{pi;i ∈ I}]
€
=A(1)
P[{pi;i ∈ I}]
![Page 25: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/25.jpg)
+Conditional Protein Probabilities
€
P(Z j =1 |{pi;i ∈ I}) = A(1)P[{pi;i ∈ I}]
€
P(Z j = 0 |{pi;i ∈ I}) = A(0)P[{pi;i ∈ I}]
![Page 26: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/26.jpg)
+Conditional Protein Probabilities
€
P(Z j =1 |{pi;i ∈ I}) + P(Z j = 0 |{pi;i ∈ I}) =1
€
A(0)P({pi;i ∈ I})
+ A(1)P({pi;i ∈ I})
= A(1) + A(0)P({pi;i ∈ I})
=1
€
A(1) + A(0) = P({pi;i ∈ I})
![Page 27: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/27.jpg)
+Conditional Protein Probabilities
€
P(Z j =1 |{pi;i ∈ I}) = A(1)P[{pi;i ∈ I}]
= A(1)A(0) + A(1)
€
P(Z j = 0 |{pi;i ∈ I}) = A(0)P[{pi;i ∈ I}]
= A(0)A(0) + A(1)
![Page 28: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/28.jpg)
+Shared Peptides
€
Aunshared (1) = P[{pi;i ∈ IA} | Z1 =1,Zk = z]⋅P[Z1 =1,Zk = z]k∈R (I A )∑
€
Aunshared (1) = P[{pi;i ∈ IA} | Z1 =1]⋅P[Z1 =1]
![Page 29: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/29.jpg)
+
€
Ashared (1) = P[{pi;i ∈ IB} | Z1 =1,Zk = z]⋅P[Z1 =1,Zk = z]k∈R (I B )∑
€
Ashared (1) = P[{pi;i ∈ IB} | Z1 =1,Z2 =1]⋅P[Z1 =1]⋅P[Z2 =1] + P[{pi;i ∈ IB} | Z1 =1,Z2 = 0]⋅P[Z1 =1]⋅P[Z2 = 0]
Shared Peptides
![Page 30: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/30.jpg)
+Shared Peptides
If the shared peptide has pi ≥ median
€
Punshared[Z1 =1 |{pi;i ∈ I}] ≥ Pshared[Z1 =1 |{pi;i ∈ I}]
€
Punshared[Z1 = 0 |{pi;i ∈ I}] ≤ Pshared[Z1 = 0 |{pi;i ∈ I}]
![Page 31: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/31.jpg)
+Shared Peptides
If the shared peptide has pi < median
€
Punshared[Z1 =1 |{pi;i ∈ I}] < Pshared[Z1 =1 |{pi;i ∈ I}]
€
Punshared[Z1 = 0 |{pi;i ∈ I}] > Pshared[Z1 = 0 |{pi;i ∈ I}]
![Page 32: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/32.jpg)
+Gene Model Inference
![Page 33: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/33.jpg)
+Gene Model Inference
Assume a gene model, X, has only protein sequences which belong to the same connected component.
Peptide 1
Peptide 2
Peptide 3
Peptide 4
Protein A
Protein B
Gene X
![Page 34: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/34.jpg)
+Gene Model Inference
Assume a gene model, X, has only protein sequences which belong to the same connected component.
R(X) is the set of proteins with edges to X. Ir(X) is the set of peptides with edges to proteins with edges
to X
€
P[X =1 |{pi;i ∈ I}] =1− P {Z j = 0} |{pi;i ∈ Ir(X )}j∈R (X )I
⎡
⎣ ⎢
⎤
⎦ ⎥
![Page 35: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/35.jpg)
+Gene Model Inference
Gene model, X, has proteins from different connected components of the peptide-protein graph.
Peptide 1
Peptide 2
Peptide 3
Peptide 4
Protein A
Protein B
Gene X
![Page 36: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/36.jpg)
+Gene Model Inference
Gene model, X, has proteins from different connected components of the peptide-protein graph.
Rl(X) is the set of proteins with edges to X in component l.
Il(X) is the set of peptides with edges to proteins with edges to X in component l.
€
P {Z j = 0} |{pi;i ∈ Ir(X )}j∈R(X )I
⎡
⎣ ⎢
⎤
⎦ ⎥= P {Z j = 0} |{pi;i ∈ Il (X )}
j∈R l (X )I
⎡
⎣ ⎢
⎤
⎦ ⎥
l =1
m
∏
![Page 37: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/37.jpg)
+Datasets
Mixture of 18 purified proteins Mixture of 49 proteins (Sigma49) Drosophila melanogaster Saccharomyces cerevisiae (~4200 proteins) Arabidopis thaliana (~4580 gene models)
![Page 38: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/38.jpg)
+Comparisons with other tools
Small datasets with a known answerMix of 18 proteins
Sigma49
![Page 39: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/39.jpg)
+
Sigma49
Comparisons with other tools
One hit wonders
Sigma49 no one hit wonders
![Page 40: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/40.jpg)
+Comparison with other tools
Arabidopsis thaliana dataset has many proteins with high sequence similarity.
![Page 41: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/41.jpg)
+Splice isoforms
![Page 42: Protein and gene model inference based on statistical modeling in k -partite graphs](https://reader035.vdocuments.site/reader035/viewer/2022062310/5681662d550346895dd992a0/html5/thumbnails/42.jpg)
+Conclusion +Criticism
Developed a model for protein and gene model inference.
Comparisons with other tools do not justify complexity: Value of a small FP rate at the expense of many FN is not
shared for all applications.
Discard some useful information such as #spectra/peptide
Assumptions of parsimony from pruning may be too aggressive.