![Page 1: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/1.jpg)
Exploring the use of mass spectrometry data for differential
proteomics
Nicholas MitsakakisDept of Public Health Sciences
University of Toronto
![Page 2: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/2.jpg)
• Finishing my 1st year of PhD program in Biostatistics
• Work originally produced for the requirements of our “Lab in Statistical Design” course
• Data from Dr. Andrew Emili’s laboratory, Banting and Best Dept of Medical Research, UofT
• Just first exploratory steps!
![Page 3: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/3.jpg)
Overview
• Background
• Data
• Objectives of the analysis
• Methods-Results
• Conclusions
• Next steps
![Page 4: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/4.jpg)
Some definitions: Proteomics
• “The study of the proteome, the complete set of proteins produced by a species, using the technologies of large-scale protein separation and identification”Medterms.com
• “The large-scale, high-throughput analysis of proteins” CHA Cambridge Healthtech Advisors, Clinical Genomics: The
Impact of Genomics on Clinical Trials and Medical Practice, 2004
• “It represents the effort to establish the identities, quantities, structures and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time and physiological state”Defining the Mandate of Proteomics in the Post- Genomics Era,
National Academy of Sciences, 2002
![Page 5: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/5.jpg)
• Protein is a complex molecule consisting of a particular sequence of amino acids
• A sequence of amino acids forms a peptide (monomer)
• Different peptides joined together form a protein (polypeptide - polymer)
![Page 6: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/6.jpg)
Mass Spectrometry: “An instrument used to identify chemicals in a substance by their mass and charge. Mass spectrometers are instruments that essentially weigh molecules. A mass spectrometer also can measure how much of a compound is present in a mixture”, medterms.com“In a typical approach, this technique for measuring and analyzing molecules involves introducing enough energy into a target molecule to cause its disintegration. The resulting fragments are then analyzed, based on their mass/ charge ratios, to produce a molecular fingerprint“, CHI Predictive Pharmacogenomics report
Direct analysis of protein complexes using mass spectrometry - Andrew J. Link et al., Nature Biotechnology
17, 676 - 682 (1999)
In our data, for each found peptide a confidence value was calculated (Kislinger et al, 2003) and returned
![Page 7: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/7.jpg)
The (almost raw) data: an excerpt
protein peptide sequence confidence
3BH1_MOUSE K.TSEWIGTIVEQHR.E 0.8018
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.8036
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.9958
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.9958
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.9763
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.9954
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.8693
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.9958
3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.986
41_MOUSE R.SLDGAAAAESTDR.S 0.7168
5NTD_MOUSE R.GPLAHQISGLFLPSK.V 0.8122
5NTD_MOUSE R.GPLAHQISGLFLPSK.V 0.8943
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.8105
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.9284
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.9243
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.9033
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.8206
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.9332
A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.9464
A1AG_MOUSE K.HGAFMLAFDLK.D 0.99
A1AG_MOUSE K.YEGGVETFAHLIVLR.K 0.9619
A1AG_MOUSE K.YEGGVETFAHLIVLR.K 0.7187
…………………………………………………………………
…………………………………………………………………
- Level of confidence about whether the particular peptide instance was identified - confidence cutoff value → number of peptides identified (counts)- different cutoff values = different “number of counts” for the peptides- Different covariates (?) not shown- The experiment ran 7 times
Peptide 1
Peptide 2Peptide 3
![Page 8: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/8.jpg)
General (long term) objective• Use the data from the mass spec to obtain
(semi)quantitative information about the (relative) abundances of the proteins in the mixtures
• Relative: Most interested in comparing abundances of the same protein between two mixtures (biological samples)
• Construct measures (scores) and appropriate statistical tests that identify significant differences in protein abundances between 2 different mixtures
• Measures should:– “reflect” abundance of the proteins– have nice statistical properties
![Page 9: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/9.jpg)
Using counts
• Count: how many times a peptide was identified (peptide level measure)
• Previously suggested as semi-quantitative measure of protein abundance (Liu et al., 2004, Collinge et al., 2005)
• Issues:1. Define appropriate confidence cutoff threshold value
for the peptide counts
2. Construct protein level measures (by combining the peptide counts of each protein)
3. In both peptide and protein level, low “variation” among experiments is desirable
![Page 10: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/10.jpg)
Selecting appropriate cutoff value
cutoff avg # of peptidesvalue identified0.75 2091 0.8 18970.85 17040.9 15180.95 1271
Threshold values
Log
val
ues
Threshold values
Mean of counts
Variance of counts
Log
val
ues
Mean of counts Variance of counts
Thresh mean Median Mean Median
0.75 2.6615 0.8571 5.3068 0.9048
0.8 2.5585 0.8571 4.9831 0.8095
0.85 2.4392 0.7143 4.7034 0.6667
0.9 2.3003 0.7143 4.3618 0.6190
0.95 2.0503 0.5714 3.8513 0.6190
Higher threshold gives less peptides but of higher confidence and of generally lower variance
The cutoff value of 0.9 was selected
![Page 11: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/11.jpg)
The processed count data – an excerpt
protein peptide run1 run2 run3 run4 run5 run6 run72AAA_MOUSE R.AISHEHSPSDLEAHFVPLVK.R 3 1 1 0 0 1 03BH3_MOUSE K.FSTANPVYVGNVAWAHILAAR.G 0 0 1 0 1 3 03BH7_MOUSE R.HFGYKPLFSWEESR.T 4 8 2 2 4 2 64F2_MOUSE R.SLLHGDFHALSSSPDLFSYIR.H 0 1 0 0 1 0 05NTD_MOUSE R.GPLAHQISGLFLPSK.V 1 0 1 0 3 2 06PGD_MOUSE R.DYFGAHTYELLTKPGEFIHTNWTGHGGSVSSSSYNA.- 1 0 0 0 3 0 0A1A1_MOUSE K.EQPLDEELKDAFQNAYLELGGLGER.V 0 0 0 1 1 0 0A1A1_MOUSE R.YHTEIVFAR.T 3 0 1 0 0 0 0A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 7 6 7 5 6 15 5A1AG_MOUSE K.HGAFMLAFDLK.D 3 0 2 0 2 3 1A1AG_MOUSE K.YEGGVETFAHLIVLR.K 3 1 0 0 2 1 1A1B1_MOUSE R.LGAPISSGLSDLFDLTSGVGTLSGSYVAPK.A 0 0 0 1 1 0 0A1T1_MOUSE K.DQSPASHEIATNLGDFAISLYR.E 4 1 1 2 1 3 3A1T1_MOUSE K.KLDQDTVFALANYILFK.G 1 1 2 1 1 1 0A1T1_MOUSE K.NHYQAEVFSVNFAESEEAK.K 5 3 3 2 2 5 4A1T1_MOUSE K.NHYQAEVFSVNFAESEEAKK.V 0 1 3 2 0 3 2A1T1_MOUSE K.SFQHLLQTLNRPDSELQLSTGNGLFVNNDLK.L 24 16 52 13 44 43 33A1T1_MOUSE R.LAQIHFPR.L 1 0 0 0 1 0 0A1T2_MOUSE R.LVQIHIPR.L 1 0 0 0 2 2 0A2A1_MOUSE K.LLGFGSALLDNVDPNPENFVGAGIIQTK.A 1 0 0 1 0 0 0A2AB_MOUSE K.VPTLVSPLSSVGEANGHPKPPR.E 0 0 1 0 1 0 0A2HS_MOUSE R.HAFSPVASVESASGETLHSPK.V 1 3 2 2 3 4 0A2M1_MOUSE K.WARPPISMNFEVPFAPSGLK.V 4 4 4 1 5 6 6A2MG_MOUSE K.HSLGDNDAHSIFQSVGINIFTNSK.I 3 0 6 1 7 7 10A2MG_MOUSE K.VNTNYRPGLPFSGQVLLVDEK.G 0 0 1 1 0 2 0A2MG_MOUSE R.KYFPETWIWDLVPLDVSGDGELAVK.V 0 0 1 0 0 0 1
Some proteins represented by only one peptide, some others by more than one
![Page 12: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/12.jpg)
Searching for the best protein measure
For i = 1,…P, P number of proteins
For j = 1,…p(i), p(i) number of peptides for protein i
If cij is the number of counts for peptide j,
then si= g(ci1, ci2,…cip(i)) is the score for protein i
For every i, one can see Ci = (ci1, ci2,…cip(i)) as a random vector and g(Ci) as a statistic, estimator of (a function of) the abundance of protein i.
Three sensible (and suggested) candidates for function g are mean, max and sum; call them m, M and S.
![Page 13: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/13.jpg)
Specific Research Question: Looking for the function that gives smaller variation across the 7 experiments
In order to compare the estimators an adjustment was made so that they have equal mean values (across the 7 experiment) for every i.
We use S as baseline estimator. The other two need to be adjusted.
Denoting gi = g(Ci) where g is m, M or S, we use the equations:
m_adji = ai * mi, where ai = avg(Si)/avg(mi)
M_adji = bi * Mi, where bi = avg(Si)/avg(Mi)
![Page 14: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/14.jpg)
• We keep 1241 selected proteins with 2 or more counts in all peptides and experiment
• ~700 proteins have only one peptide =>
ai = bi = 1
• a_i: median=1, mean=1.51, max = 16.29
• b_i: median=1, mean=1.20, max = 4.47
The relationships between the 3 statistics depend on the protein
![Page 15: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/15.jpg)
sum
sum
mea
n
mean
max
max
Comparing the variances of the 3 measures
m M Smean 12.226 11.828 8.592median 0.619 0.619 0.667over 1241 proteins >(=)
m M Sm 0(1241) 129(870) 168(749)M 242(870) 0(1241) 226(745)S 324(749) 270(745) 0(1241)
![Page 16: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/16.jpg)
• As a conclusion: “mean” of peptide counts per protein seems to be a slightly more stable statistic
• Similar results were generated when Coefficient of Variance (std/mean) was used instead of variance, as a measure of relative variation (to cope with the fact that the 3 measures have different mean values per protein)
![Page 17: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/17.jpg)
Next step: using covariates to identify “unstable” peptides
• Mass of the peptide
• Charge (values 1, 2 or 3)
• higher # of counts gives higher variance => coefficient of variation was used as a measure of variability
• Tests and model fitting showed evidence that:– Peptides with Charge =3 have higher variability
– Higher mass gives higher variance BUT not higher CV
![Page 18: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/18.jpg)
Weighted mean• Peptides with charge = 3 (i.e. “unstable”) were weighted
less
• The formula used is: wmi =
• ch(j): charge of peptide j,
• Kl, kh: numbers of low charge (1 or 2) and high charge (3) peptides respectively,
• wl, wh: weights for the low and high charge peptides respectively
• I used wl = 2*wh
hl
jchijph
jchijpl
kk
cwcw
3)(
)(2,1)(
)( **
Unfortunately the weighted mean did not produce values of lower variance …
![Page 19: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/19.jpg)
First model fitting efforts• Outcome: count cij
• Covariates: mij, chij, ai (unknown)• Fit a log-linear model• Eliminate the effect of abundance by normalizing per protein• For the covariates, subtract the arithmetic mean• For the outcome (counts), divide by the geometric mean
( )
1
1/ ( )( )
1
( )
p i
ijj
ij ij
ijij p i
p i
ijj
m
mnorm mp i
ccnorm
c
Poisson model gave very high over-dispersion, so a Negative Binomial model was used
![Page 20: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/20.jpg)
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF
Deviance 15E3 14231.8717 0.9243 Scaled Deviance 15E3 14231.8717 0.9243
Pearson Chi-Square 15E3 46419.6434 3.0148 Scaled Pearson X2 15E3 46419.6434 3.0148
Log Likelihood 28490.0270 Algorithm converged.
Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 0.7605 0.0142 0.7326 0.7884 2856.13 <.0001 mass 1 -0.0004 0.0000 -0.0005 -0.0003 97.88 <.0001
charge_mean 1 1.4668 0.0733 1.3232 1.6105 400.37 <.000Dispersion 1 2.6255 0.0400 2.5471 2.7039
Using the un-normalized count data
![Page 21: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/21.jpg)
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF
Deviance 15E3 16036.8557 1.0416 Scaled Deviance 15E3 16036.8557 1.0416
Pearson Chi-Square 15E3 18308.3152 1.1891 Scaled Pearson X2 15E3 18308.3152 1.1891
Log Likelihood 210259.2628 Algorithm converged.
Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 1.9767 0.0139 1.9494 2.0039 20202.8 <.0001 mass 1 -0.0003 0.0000 -0.0003 -0.0002 32.89 <.0001
charge_mean 1 0.8567 0.0758 0.7082 1.0052 127.85 <.0001 Dispersion 1 2.8378 0.0338 2.7716 2.9040
Using the normalized count data
???
![Page 22: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/22.jpg)
Conclusions
• Verification that selecting higher threshold for confidence values gives overall more “stable” peptide counts
• “mean” outperforms (even marginally) “max” and “sum” as a protein level measure
• Peptides with charge = 3 seem to be more “unstable”– Implemented “weighted mean” did not improve stability
• A log-linear model seems to fit the count data -> first step for more complex models for acquiring more refined and informed protein level measures (?)
![Page 23: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/23.jpg)
Next steps
• Data with true protein abundances would be useful
• Exploring model fitting (hierarchical GLM models)
• Effects of other covariates (peptide sequence) could be explored (could be hard!)
• Break down the MS experiment process into smaller processes and model accordingly
• Use data of known varying abundances to evaluate the protein measures
![Page 24: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/24.jpg)
Acknowledgements
Biostatistics - PHS
Rafal Kustra
Emili Lab – BBDMR
Andrew Emili
Clement Chung
Thomas Kislinger
![Page 25: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/25.jpg)
Thank you!
Questions? Comments? Suggestions?
![Page 26: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/26.jpg)
Supplementary Information
![Page 27: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/27.jpg)
Variance depends on mean
Coefficient of Variance (std/mean) - it provides a relative (rather than an absolute) measure of dispersion – it is independent of the mean => Use coefficient of variance instead
Selecting appropriate dispersion measure
![Page 28: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/28.jpg)
Each dot corresponds to a protein
Y axis: difference between the CV values of two measures
X axis: protein rank when ordered increasingly by the average value of the measure mentioned first
As we move from (hypothetically) lower abundance to higher abundance proteins the order of quality of measures changesFor low/medium abundance proteins, “mean” is better more often
If we are interested in lower/medium abundance proteins mean appears to be better measure
1241 proteins with more than 1 peptide count total in all 7 experiments
![Page 29: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/29.jpg)
Comparing the Coefficient of Variance of the three measures
Mean max sum
Mean 1.1593 1.1741 1.183
Median 1.1114 1.1547 1.1547
#of prot. with CV <=1
553 540 528
The three measure functions show similar behaviorMean is marginally better
Exploring further: comparing performance trends with respect to protein abundance…
![Page 30: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/30.jpg)
#of peptides identifiedcutoff: 0.75 mean: 2091 var: 59451cutoff: 0.8 mean: 1897 var: 50075cutoff: 0.85 mean: 1704.43 var: 44376.6cutoff: 0.9 mean: 1518.71 var: 32945.6cutoff: 0.95 mean: 1271.14 var: 22605.5
# of proteins identifiedcutoff: 0.75 mean: 1336.86 var: 22985.1cutoff: 0.8 mean: 1186.57 var: 18895cutoff: 0.85 mean: 1042.86 var: 16031.5cutoff: 0.9 mean: 909.714 var: 11183.6cutoff: 0.95 mean: 742.714 var: 7599.24
Peptides per proteincutoff: 0.75 mean: 1.56503 var: 0.0034228cutoff: 0.8 mean: 1.59964 var: 0.00315881cutoff: 0.85 mean: 1.63534 var: 0.00320114cutoff: 0.9 mean: 1.67008 var: 0.00298823cutoff: 0.95 mean: 1.71291 var: 0.00440239
Stats per Confidence cutoff values
![Page 31: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/31.jpg)
Other covariates
• Peptide Charge
• SEQUEST output (from which the confidence values are estimated)
• Number of scan for the peptide (related to the time when the peptide entered the mass spec)
• Patterns in the peptide sequence itself (motifs)
![Page 32: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/32.jpg)
The ABC’s (and XYZ’s) of peptide sequencing-Hanno Steen & Matthias Mann, Nature Reviews Molecular Cell Biology 5, 699 -711 (2004)
The mass-spectrometry/proteomic experiment
![Page 33: Exploring the use of mass spectrometry data for differential proteomics](https://reader036.vdocuments.site/reader036/viewer/2022081516/568138e4550346895da09638/html5/thumbnails/33.jpg)
spectra
Confidence values
Correlation scores
SEQUEST, Eng et al, 1994
STATQUEST, Kislinger et al, 2003
“Using the SEQUEST algorithm, acquired fragmentation spectra of peptides are correlated with predicted amino acid sequences in translated genomic databases”