exploring the use of mass spectrometry data for differential proteomics

Exploring the use of mass spectrometry data for differential

proteomics

Nicholas MitsakakisDept of Public Health Sciences

University of Toronto

• Finishing my 1st year of PhD program in Biostatistics

• Work originally produced for the requirements of our “Lab in Statistical Design” course

• Data from Dr. Andrew Emili’s laboratory, Banting and Best Dept of Medical Research, UofT

• Just first exploratory steps!

Overview

• Background

• Data

• Objectives of the analysis

• Methods-Results

• Conclusions

• Next steps

Some definitions: Proteomics

• “The study of the proteome, the complete set of proteins produced by a species, using the technologies of large-scale protein separation and identification”Medterms.com

• “The large-scale, high-throughput analysis of proteins” CHA Cambridge Healthtech Advisors, Clinical Genomics: The

Impact of Genomics on Clinical Trials and Medical Practice, 2004

• “It represents the effort to establish the identities, quantities, structures and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time and physiological state”Defining the Mandate of Proteomics in the Post- Genomics Era,

National Academy of Sciences, 2002

• Protein is a complex molecule consisting of a particular sequence of amino acids

• A sequence of amino acids forms a peptide (monomer)

• Different peptides joined together form a protein (polypeptide - polymer)

Mass Spectrometry: “An instrument used to identify chemicals in a substance by their mass and charge. Mass spectrometers are instruments that essentially weigh molecules. A mass spectrometer also can measure how much of a compound is present in a mixture”, medterms.com“In a typical approach, this technique for measuring and analyzing molecules involves introducing enough energy into a target molecule to cause its disintegration. The resulting fragments are then analyzed, based on their mass/ charge ratios, to produce a molecular fingerprint“, CHI Predictive Pharmacogenomics report

Direct analysis of protein complexes using mass spectrometry - Andrew J. Link et al., Nature Biotechnology

17, 676 - 682 (1999)

In our data, for each found peptide a confidence value was calculated (Kislinger et al, 2003) and returned

The (almost raw) data: an excerpt

protein peptide sequence confidence

3BH1_MOUSE K.TSEWIGTIVEQHR.E 0.8018

3BH7_MOUSE R.HFGYKPLFSWEESR.T 0.8036








41_MOUSE R.SLDGAAAAESTDR.S 0.7168

5NTD_MOUSE R.GPLAHQISGLFLPSK.V 0.8122

5NTD_MOUSE R.GPLAHQISGLFLPSK.V 0.8943

A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 0.8105







A1AG_MOUSE K.HGAFMLAFDLK.D 0.99

A1AG_MOUSE K.YEGGVETFAHLIVLR.K 0.9619

A1AG_MOUSE K.YEGGVETFAHLIVLR.K 0.7187

…………………………………………………………………

…………………………………………………………………

- Level of confidence about whether the particular peptide instance was identified - confidence cutoff value → number of peptides identified (counts)- different cutoff values = different “number of counts” for the peptides- Different covariates (?) not shown- The experiment ran 7 times

Peptide 1

Peptide 2Peptide 3

General (long term) objective• Use the data from the mass spec to obtain

(semi)quantitative information about the (relative) abundances of the proteins in the mixtures

• Relative: Most interested in comparing abundances of the same protein between two mixtures (biological samples)

• Construct measures (scores) and appropriate statistical tests that identify significant differences in protein abundances between 2 different mixtures

• Measures should:– “reflect” abundance of the proteins– have nice statistical properties

Using counts

• Count: how many times a peptide was identified (peptide level measure)

• Previously suggested as semi-quantitative measure of protein abundance (Liu et al., 2004, Collinge et al., 2005)

• Issues:1. Define appropriate confidence cutoff threshold value

for the peptide counts

2. Construct protein level measures (by combining the peptide counts of each protein)

3. In both peptide and protein level, low “variation” among experiments is desirable

Selecting appropriate cutoff value

cutoff avg # of peptidesvalue identified0.75 2091 0.8 18970.85 17040.9 15180.95 1271

Threshold values

Log

val

ues

Threshold values

Mean of counts

Variance of counts

Log

val

ues

Mean of counts Variance of counts

Thresh mean Median Mean Median

0.75 2.6615 0.8571 5.3068 0.9048

0.8 2.5585 0.8571 4.9831 0.8095

0.85 2.4392 0.7143 4.7034 0.6667

0.9 2.3003 0.7143 4.3618 0.6190

0.95 2.0503 0.5714 3.8513 0.6190

Higher threshold gives less peptides but of higher confidence and of generally lower variance

The cutoff value of 0.9 was selected

The processed count data – an excerpt

protein peptide run1 run2 run3 run4 run5 run6 run72AAA_MOUSE R.AISHEHSPSDLEAHFVPLVK.R 3 1 1 0 0 1 03BH3_MOUSE K.FSTANPVYVGNVAWAHILAAR.G 0 0 1 0 1 3 03BH7_MOUSE R.HFGYKPLFSWEESR.T 4 8 2 2 4 2 64F2_MOUSE R.SLLHGDFHALSSSPDLFSYIR.H 0 1 0 0 1 0 05NTD_MOUSE R.GPLAHQISGLFLPSK.V 1 0 1 0 3 2 06PGD_MOUSE R.DYFGAHTYELLTKPGEFIHTNWTGHGGSVSSSSYNA.- 1 0 0 0 3 0 0A1A1_MOUSE K.EQPLDEELKDAFQNAYLELGGLGER.V 0 0 0 1 1 0 0A1A1_MOUSE R.YHTEIVFAR.T 3 0 1 0 0 0 0A1AG_MOUSE K.AVTHVGMDESEIIFVDWK.K 7 6 7 5 6 15 5A1AG_MOUSE K.HGAFMLAFDLK.D 3 0 2 0 2 3 1A1AG_MOUSE K.YEGGVETFAHLIVLR.K 3 1 0 0 2 1 1A1B1_MOUSE R.LGAPISSGLSDLFDLTSGVGTLSGSYVAPK.A 0 0 0 1 1 0 0A1T1_MOUSE K.DQSPASHEIATNLGDFAISLYR.E 4 1 1 2 1 3 3A1T1_MOUSE K.KLDQDTVFALANYILFK.G 1 1 2 1 1 1 0A1T1_MOUSE K.NHYQAEVFSVNFAESEEAK.K 5 3 3 2 2 5 4A1T1_MOUSE K.NHYQAEVFSVNFAESEEAKK.V 0 1 3 2 0 3 2A1T1_MOUSE K.SFQHLLQTLNRPDSELQLSTGNGLFVNNDLK.L 24 16 52 13 44 43 33A1T1_MOUSE R.LAQIHFPR.L 1 0 0 0 1 0 0A1T2_MOUSE R.LVQIHIPR.L 1 0 0 0 2 2 0A2A1_MOUSE K.LLGFGSALLDNVDPNPENFVGAGIIQTK.A 1 0 0 1 0 0 0A2AB_MOUSE K.VPTLVSPLSSVGEANGHPKPPR.E 0 0 1 0 1 0 0A2HS_MOUSE R.HAFSPVASVESASGETLHSPK.V 1 3 2 2 3 4 0A2M1_MOUSE K.WARPPISMNFEVPFAPSGLK.V 4 4 4 1 5 6 6A2MG_MOUSE K.HSLGDNDAHSIFQSVGINIFTNSK.I 3 0 6 1 7 7 10A2MG_MOUSE K.VNTNYRPGLPFSGQVLLVDEK.G 0 0 1 1 0 2 0A2MG_MOUSE R.KYFPETWIWDLVPLDVSGDGELAVK.V 0 0 1 0 0 0 1

Some proteins represented by only one peptide, some others by more than one

Searching for the best protein measure

For i = 1,…P, P number of proteins

For j = 1,…p(i), p(i) number of peptides for protein i

If cij is the number of counts for peptide j,

then si= g(ci1, ci2,…cip(i)) is the score for protein i

For every i, one can see Ci = (ci1, ci2,…cip(i)) as a random vector and g(Ci) as a statistic, estimator of (a function of) the abundance of protein i.

Three sensible (and suggested) candidates for function g are mean, max and sum; call them m, M and S.

Specific Research Question: Looking for the function that gives smaller variation across the 7 experiments

In order to compare the estimators an adjustment was made so that they have equal mean values (across the 7 experiment) for every i.

We use S as baseline estimator. The other two need to be adjusted.

Denoting gi = g(Ci) where g is m, M or S, we use the equations:

m_adji = ai * mi, where ai = avg(Si)/avg(mi)

M_adji = bi * Mi, where bi = avg(Si)/avg(Mi)

• We keep 1241 selected proteins with 2 or more counts in all peptides and experiment

• ~700 proteins have only one peptide =>

ai = bi = 1

• a_i: median=1, mean=1.51, max = 16.29

• b_i: median=1, mean=1.20, max = 4.47

The relationships between the 3 statistics depend on the protein

sum

sum

mea

n

mean

max

max

Comparing the variances of the 3 measures

m M Smean 12.226 11.828 8.592median 0.619 0.619 0.667over 1241 proteins >(=)

m M Sm 0(1241) 129(870) 168(749)M 242(870) 0(1241) 226(745)S 324(749) 270(745) 0(1241)

• As a conclusion: “mean” of peptide counts per protein seems to be a slightly more stable statistic

• Similar results were generated when Coefficient of Variance (std/mean) was used instead of variance, as a measure of relative variation (to cope with the fact that the 3 measures have different mean values per protein)

Next step: using covariates to identify “unstable” peptides

• Mass of the peptide

• Charge (values 1, 2 or 3)

• higher # of counts gives higher variance => coefficient of variation was used as a measure of variability

• Tests and model fitting showed evidence that:– Peptides with Charge =3 have higher variability

– Higher mass gives higher variance BUT not higher CV

Weighted mean• Peptides with charge = 3 (i.e. “unstable”) were weighted

less

• The formula used is: wmi =

• ch(j): charge of peptide j,

• Kl, kh: numbers of low charge (1 or 2) and high charge (3) peptides respectively,

• wl, wh: weights for the low and high charge peptides respectively

• I used wl = 2*wh

hl

jchijph

jchijpl

kk

cwcw

3)(

)(2,1)(

)( **

Unfortunately the weighted mean did not produce values of lower variance …

First model fitting efforts• Outcome: count cij

• Covariates: mij, chij, ai (unknown)• Fit a log-linear model• Eliminate the effect of abundance by normalizing per protein• For the covariates, subtract the arithmetic mean• For the outcome (counts), divide by the geometric mean

( )

1

1/ ( )( )

1

( )

p i

ijj

ij ij

ijij p i

p i

ijj

m

mnorm mp i

ccnorm

c

Poisson model gave very high over-dispersion, so a Negative Binomial model was used

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF

Deviance 15E3 14231.8717 0.9243 Scaled Deviance 15E3 14231.8717 0.9243

Pearson Chi-Square 15E3 46419.6434 3.0148 Scaled Pearson X2 15E3 46419.6434 3.0148

Log Likelihood 28490.0270 Algorithm converged.

Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi-

Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 0.7605 0.0142 0.7326 0.7884 2856.13 <.0001 mass 1 -0.0004 0.0000 -0.0005 -0.0003 97.88 <.0001

charge_mean 1 1.4668 0.0733 1.3232 1.6105 400.37 <.000Dispersion 1 2.6255 0.0400 2.5471 2.7039

Using the un-normalized count data

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF

Deviance 15E3 16036.8557 1.0416 Scaled Deviance 15E3 16036.8557 1.0416

Pearson Chi-Square 15E3 18308.3152 1.1891 Scaled Pearson X2 15E3 18308.3152 1.1891

Log Likelihood 210259.2628 Algorithm converged.

Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi-

Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 1.9767 0.0139 1.9494 2.0039 20202.8 <.0001 mass 1 -0.0003 0.0000 -0.0003 -0.0002 32.89 <.0001

charge_mean 1 0.8567 0.0758 0.7082 1.0052 127.85 <.0001 Dispersion 1 2.8378 0.0338 2.7716 2.9040

Using the normalized count data

???

Conclusions

• Verification that selecting higher threshold for confidence values gives overall more “stable” peptide counts

• “mean” outperforms (even marginally) “max” and “sum” as a protein level measure

• Peptides with charge = 3 seem to be more “unstable”– Implemented “weighted mean” did not improve stability

• A log-linear model seems to fit the count data -> first step for more complex models for acquiring more refined and informed protein level measures (?)

Next steps

• Data with true protein abundances would be useful

• Exploring model fitting (hierarchical GLM models)

• Effects of other covariates (peptide sequence) could be explored (could be hard!)

• Break down the MS experiment process into smaller processes and model accordingly

• Use data of known varying abundances to evaluate the protein measures

Acknowledgements

Biostatistics - PHS

Rafal Kustra

Emili Lab – BBDMR

Andrew Emili

Clement Chung

Thomas Kislinger

Thank you!

Questions? Comments? Suggestions?

Supplementary Information

Variance depends on mean

Coefficient of Variance (std/mean) - it provides a relative (rather than an absolute) measure of dispersion – it is independent of the mean => Use coefficient of variance instead

Selecting appropriate dispersion measure

Each dot corresponds to a protein

Y axis: difference between the CV values of two measures

X axis: protein rank when ordered increasingly by the average value of the measure mentioned first

As we move from (hypothetically) lower abundance to higher abundance proteins the order of quality of measures changesFor low/medium abundance proteins, “mean” is better more often

If we are interested in lower/medium abundance proteins mean appears to be better measure

1241 proteins with more than 1 peptide count total in all 7 experiments

Comparing the Coefficient of Variance of the three measures

Mean max sum

Mean 1.1593 1.1741 1.183

Median 1.1114 1.1547 1.1547

#of prot. with CV <=1

553 540 528

The three measure functions show similar behaviorMean is marginally better

Exploring further: comparing performance trends with respect to protein abundance…

#of peptides identifiedcutoff: 0.75 mean: 2091 var: 59451cutoff: 0.8 mean: 1897 var: 50075cutoff: 0.85 mean: 1704.43 var: 44376.6cutoff: 0.9 mean: 1518.71 var: 32945.6cutoff: 0.95 mean: 1271.14 var: 22605.5

# of proteins identifiedcutoff: 0.75 mean: 1336.86 var: 22985.1cutoff: 0.8 mean: 1186.57 var: 18895cutoff: 0.85 mean: 1042.86 var: 16031.5cutoff: 0.9 mean: 909.714 var: 11183.6cutoff: 0.95 mean: 742.714 var: 7599.24

Peptides per proteincutoff: 0.75 mean: 1.56503 var: 0.0034228cutoff: 0.8 mean: 1.59964 var: 0.00315881cutoff: 0.85 mean: 1.63534 var: 0.00320114cutoff: 0.9 mean: 1.67008 var: 0.00298823cutoff: 0.95 mean: 1.71291 var: 0.00440239

Stats per Confidence cutoff values

Other covariates

• Peptide Charge

• SEQUEST output (from which the confidence values are estimated)

• Number of scan for the peptide (related to the time when the peptide entered the mass spec)

• Patterns in the peptide sequence itself (motifs)

The ABC’s (and XYZ’s) of peptide sequencing-Hanno Steen & Matthias Mann, Nature Reviews Molecular Cell Biology 5, 699 -711 (2004)

The mass-spectrometry/proteomic experiment

spectra

Confidence values

Correlation scores

SEQUEST, Eng et al, 1994

STATQUEST, Kislinger et al, 2003

“Using the SEQUEST algorithm, acquired fragmentation spectra of peptides are correlated with predicted amino acid sequences in translated genomic databases”

exploring the use of mass spectrometry data for differential proteomics

Documents

mass spectrometry andrew

mass spectrometers

use of mass spectrometry

mass charge ratios

clinical genomics

confidence value

impact of genomics

particular peptide instance