analysis of mass spectrometry protein data · proteomics, what and why? • mass spectrometry •...

59
Analysis of mass spectrometry protein data Jenny Barrett Section of Epidemiology and Biostatistics Leeds Institute of Molecular Medicine University of Leeds, UK

Upload: others

Post on 20-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Analysis of mass spectrometry protein data

Jenny BarrettSection of Epidemiology and Biostatistics

Leeds Institute of Molecular MedicineUniversity of Leeds, UK

Page 2: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Outline•

Introduction•

Proteomics, what and why?

Mass spectrometry

Statistical issues•

Experimental design

Preprocessing•

calibration, baseline subtraction, normalisation

Peak detection and alignment•

Multivariate exploratory data analysis

Statistical models to detect differences between groups•

Classification of samples

Concluding remarks

Page 3: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

What is proteomics?

Proteins are: cell-specifictime-specific

Protein –

chain of amino acids in 3D structure which carry out biological functions, made from DNA via mRNA

Proteome – the total protein complement expressed by a cell

Proteomics

the study of proteomes

Page 4: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

DNA ProteinRNA mRNA

Transcription Processing Translation

Transcriptionalcontrol

Post- transcriptionalControl e.g.alternative splicing

Translational controls

Activity controls

Epigenetic factorsFeedback loops

Why Proteomics?

Closer to the disease process

Page 5: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Proteomics technology•

Mass spectrometry (MS) can be used to characterise the proteome when applied to the appropriate biological sample (plasma, serum, . . .)

Many experimental pre-processing steps applied prior to MS•

depletion of abundant proteins, fractionation, digestion, separation . . .

Various types of MS analyzer –

our focus is on MS based on time-of-flight

Page 6: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Mass spectrometry

Molecules (including peptides and proteins) have different masses

MS separates ionised molecules by their mass-to- charge (m/z) ratio

Time-of-flight (TOF), which can be measured, differs according to m/z

-

can then be converted to

m/z

scale

Typical range considered 2000 to 50000 Daltons (<2000 Da

too much noise)

Page 7: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

YYLaserLaser

(+)(+) (+)(+)(+)(+) (+)(+)(+)(+)(+)(+)(+)(+)

(+)(+)(+)(+)(+)(+)(+)(+)(+)

+ - Accelerating PotentialAccelerating Potential

Mass Spectrometer

S am ple Ch ip

G r id s D etector

A cce leration phase

Page 8: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Example MS analysers•

MALDI

(Matrix Assisted Laser Desorption-

Ionization)•

Apply sample to chip

Uses energy-absorbing matrix which facilitates ionisation without much fragmentation

SELDI (Surface Enhanced . . .)•

Like MALDI but chips are pre-coated with special surface

Different proteins bind to surface depending on chip type

iTRAQ

(Isobaric Tags for Relative and Absolute Quantification)•

Samples are chemically labelled

Page 9: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Scope of this talk

Not talking about experimental

pre-processing

MS of the proteome results in a series of data points (x, y) where x is the m/z

value and y is a

measure of abundance

Our focus is on the analysis of such data in a “label- free”

context –

we will not discuss the important

step of identifying the peptides and proteins

Page 10: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

020

4060

Inte

nsity

0 5000 10000 15000 20000Mass to charge ratio (m/z)

020

4060

Inte

nsity

0 5000 10000 15000 20000Mass to charge ratio (m/z)

Spectra

from two duplicate samples

Page 11: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

020

4060

Inte

nsity

0 5000 10000 15000 20000Mass to charge ratio (m/z)

020

4060

Inte

nsity

0 5000 10000 15000 20000Mass to charge ratio (m/z)

Spectra

from two distinct samples

Page 12: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Experimental Design•

Quantitative measures of the proteome are influenced by many sources of variation –

experimental and inter-individual

Careful experimental design is crucial if false inferences are to be avoided, at all stages, including sample selection and sample handling

Numerous studies have demonstrated the importance of sample handling

Page 13: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

Citrate

EDTA

Heparin

Serum

CM10 IMAC-Cu

-

H50

m/z m/z m/z

Inte

nsity

Peak profiles from 3 different SELDI chip types Sample processed in 4 different waysBanks et al, Clinical Chemistry 2005

Page 14: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

050

100

150

CM10LOW

NV

Seru

m 2

40N

V Se

rum

120

NV

Seru

m 2

40N

V Se

rum

60

NV

Seru

m 1

20N

V Se

rum

0N

V Se

rum

60

SH S

erum

30

SH S

erum

0D

T Se

rum

0SH

Ser

um 2

40SH

Ser

um 2

40SH

Ser

um 3

0SH

Ser

um 1

20SH

Ser

um 6

0N

V Se

rum

30

NV

Seru

m 3

0D

T Se

rum

120

DT

Seru

m 3

0D

T Se

rum

120

DT

Seru

m 6

0D

T Se

rum

30

DT

Seru

m 0

DT

Seru

m 2

40D

T Se

rum

240

DT

Seru

m 6

0SH

Ser

um 1

20SH

Ser

um 6

0SH

Ser

um 0

NV

Seru

m 0

DT

Hep

arin

60

DT

Hep

arin

240

DT

Hep

arin

0D

T H

epar

in 1

20D

T H

epar

in 6

0D

T H

epar

in 2

40D

T H

epar

in 3

0D

T H

epar

in 1

20D

T H

epar

in 3

0D

T H

epar

in 0

NV

Hep

arin

30

NV

Hep

arin

240

NV

Hep

arin

120

NV

Hep

arin

0N

V H

epar

in 6

0N

V H

epar

in 2

40N

V H

epar

in 0

NV

Hep

arin

120

NV

Hep

arin

30

NV

Hep

arin

60

SH H

epar

in 3

0SH

Hep

arin

60

SH H

epar

in 0

SH H

epar

in 2

40SH

Hep

arin

30

SH H

epar

in 2

40SH

Hep

arin

60

SH H

epar

in 1

20SH

Hep

arin

120

DT

Citr

ate

120

DT

Citr

ate

0D

T C

itrat

e 0

DT

Citr

ate

30D

T C

itrat

e 30

DT

Citr

ate

240

DT

Citr

ate

60D

T C

itrat

e 60

DT

Citr

ate

240

DT

Citr

ate

120

SH C

itrat

e 30

NV

Citr

ate

60SH

Citr

ate

0SH

Citr

ate

120

SH C

itrat

e 60

SH C

itrat

e 24

0SH

Citr

ate

30SH

Citr

ate

240

SH C

itrat

e 12

0N

V C

itrat

e 60

SH C

itrat

e 0

SH C

itrat

e 60

NV

Citr

ate

30N

V C

itrat

e 24

0N

V C

itrat

e 12

0N

V C

itrat

e 24

0 17

NV

Citr

ate

30N

V C

itrat

e 12

0N

V C

itrat

e 0

NV

Citr

ate

0N

V ED

TA 6

0N

V ED

TA 2

40N

V ED

TA 1

20N

V ED

TA 0

NV

EDTA

30

NV

EDTA

0N

V ED

TA 3

0D

T ED

TA 0

DT

EDTA

30

DT

EDTA

120

DT

EDTA

240

DT

EDTA

60

DT

EDTA

240

DT

EDTA

120

DT

EDTA

60

DT

EDTA

30

DT

EDTA

0SH

ED

TA 3

0SH

ED

TA 6

0SH

ED

TA 0

NV

EDTA

240

SH E

DTA

240

NV

EDTA

120

NV

EDTA

60

SH E

DTA

60

SH E

DTA

240

SH E

DTA

30

SH E

DTA

120

SH E

DTA

120

SH E

DTA

0 050

100

150

200

250

H50LOW

HC

Citr

ate

120

185

HC

ED

TA 2

40 1

85H

C E

DTA

0 1

85H

C C

itrat

e 0

185

HC

Hep

arin

120

185

HC

ED

TA 6

0 18

5H

C S

erum

30

185

SH E

DTA

30

185

SH C

itrat

e 60

185

DT

Seru

m 0

185

DT

Seru

m 2

40 1

85N

V ED

TA 2

40 1

85N

V Se

rum

30

185

NV

Citr

ate

60 1

85N

V ED

TA 0

185

NV

Citr

ate

240

185

NV

Hep

arin

0 1

85N

V H

epar

in 1

20 1

85H

C E

DTA

30

185

HC

Ser

um 2

40 1

85N

V Se

rum

240

185

NV

EDTA

120

185

DT

Seru

m 1

20 1

85H

C S

erum

0 1

85H

C H

epar

in 2

40 1

85D

T E

DTA

240

185

HC

Citr

ate

60 1

85H

C H

epar

in 0

185

HC

Citr

ate

240

185

HC

Ser

um 6

0 18

5SH

Ser

um 1

20 1

85SH

Citr

ate

0 18

5SH

Hep

arin

30

185

SH C

itrat

e 30

185

SH C

itrat

e 60

185

SH H

epar

in 0

185

SH H

epar

in 1

20 1

85SH

Citr

ate

240

185

SH C

itrat

e 30

185

DT

Citr

ate

240

185

DT

Seru

m 2

40 1

85SH

ED

TA 1

20 1

85SH

Hep

arin

60

185

SH S

erum

120

185

SH S

erum

240

185

SH H

epar

in 3

0 18

5SH

ED

TA 2

40 1

85SH

ED

TA 3

0 18

5H

C H

epar

in 3

0 18

5H

C H

epar

in 6

0 18

5N

V Se

rum

240

185

HC

Ser

um 1

20 1

85H

C E

DTA

240

185

SH E

DTA

60

185

SH H

epar

in 2

40 1

85SH

ED

TA 0

185

SH C

itrat

e 12

0 18

5SH

Ser

um 6

0 18

5SH

Citr

ate

240

185

SH H

epar

in 6

0 18

5SH

Ser

um 3

0 18

5SH

Ser

um 0

185

SH E

DTA

240

185

SH C

itrat

e 0

185

SH E

DTA

60

185

SH S

erum

240

185

SH E

DTA

0 1

85SH

ED

TA 1

20 1

85SH

Ser

um 6

0 18

5SH

Ser

um 3

0 18

5SH

Hep

arin

120

185

SH C

itrat

e 12

0 18

5SH

Ser

um 0

185

DT

Seru

m 3

0 18

5D

T Se

rum

120

185

DT

Citr

ate

30 1

85D

T ED

TA 0

185

DT

Citr

ate

0 18

5D

T H

epar

in 3

0 18

5N

V H

epar

in 2

40 1

85N

V H

epar

in 6

0 18

5N

V Se

rum

120

185

DT

Seru

m 6

0 18

5D

T E

DTA

120

185

DT

Hep

arin

240

185

NV

Citr

ate

30 1

85N

V ED

TA 3

0 18

5N

V C

itrat

e 0

185

NV

EDTA

0 1

85N

V ED

TA 2

40 1

85N

V C

itrat

e 0

185

NV

Seru

m 6

0 18

5N

V Se

rum

60

185

NV

Hep

arin

240

185

NV

Seru

m 1

20 1

85N

V H

epar

in 0

185

NV

EDTA

60

185

NV

Hep

arin

30

185

NV

Seru

m 0

185

NV

Citr

ate

30 1

85SH

Hep

arin

240

185

NV

Citr

ate

60 1

85D

T C

itrat

e 60

185

DT

Citr

ate

120

185

DT

EDTA

0 1

85D

T H

epar

in 1

20 1

85D

T E

DTA

240

185

DT

EDTA

60

185

DT

EDTA

30

185

DT

Seru

m 3

0 18

5N

V C

itrat

e 12

0 18

5N

V H

epar

in 1

20 1

85N

V Se

rum

0 1

85D

T H

epar

in 0

185

NV

Hep

arin

30

185

NV

EDTA

30

185

NV

EDTA

120

185

NV

Hep

arin

60

185

NV

Citr

ate

120

185

NV

Seru

m 3

0 18

5N

V C

itrat

e 24

0 18

5N

V ED

TA 6

0 18

5D

T C

itrat

e 0

185

DT

Hep

arin

30

185

DT

EDTA

120

185

DT

Citr

ate

60 1

85D

T H

epar

in 0

185

DT

Seru

m 6

0 18

5D

T H

epar

in 2

40 1

85D

T C

itrat

e 30

185

DT

Hep

arin

60

1850

5010

015

020

025

030

035

0

IMAC30LOW

NV

Seru

m 1

20N

V Se

rum

60

SH S

erum

30

NV

Seru

m 3

0D

T Se

rum

240

DT

Seru

m 3

0D

T Se

rum

0

DT

Seru

m 6

0D

T Se

rum

60

DT

Seru

m 0

SH

Ser

um 0

NV

Seru

m 0

SH S

erum

0N

V Se

rum

0N

V Se

rum

240

SH S

erum

240

SH S

erum

240

SH S

erum

30

SH S

erum

120

SH S

erum

120

SH S

erum

60

SH S

erum

60

NV

Seru

m 6

0N

V Se

rum

120

NV

Seru

m 3

0D

T Se

rum

240

NV

Seru

m 2

40D

T Se

rum

30

DT

Seru

m 1

20D

T Se

rum

120

SH H

epar

in 2

40SH

Hep

arin

30

SH E

DTA

120

SH E

DTA

0SH

ED

TA 2

40N

V ED

TA 0

SH E

DTA

60

SH E

DTA

60

SH E

DTA

120

SH E

DTA

30

NV

Hep

arin

60

NV

Citr

ate

60

NV

Citr

ate

240

NV

Hep

arin

0SH

Hep

arin

120

SH H

epar

in 6

0 SH

Hep

arin

0SH

ED

TA 3

0SH

ED

TA 2

40N

V ED

TA 1

20

NV

EDTA

120

NV

EDTA

30

NV

EDTA

30

NV

EDTA

0

NV

EDTA

60

NV

EDTA

60

NV

Hep

arin

240

NV

EDTA

240

NV

EDTA

240

DT

Hep

arin

0D

T C

itrat

e 60

D

T H

epar

in 0

NV

Citr

ate

30

SH C

itrat

e 60

N

V C

itrat

e 0

DT

EDTA

120

DT

ED

TA 3

0 D

T ED

TA 0

DT

ED

TA 0

D

T H

epar

in 3

0D

T E

DTA

60

DT

EDTA

120

DT

ED

TA 3

0 D

T E

DTA

60

DT

Citr

ate

0D

T H

epar

in 2

40D

T H

epar

in 3

0D

T H

epar

in 1

20D

T C

itrat

e 24

0 D

T C

itrat

e 0

DT

EDTA

240

DT

Citr

ate

120

DT

Citr

ate

120

SH C

itrat

e 12

0SH

Citr

ate

240

NV

Citr

ate

60

SH C

itrat

e 60

SH

Citr

ate

0D

T C

itrat

e 60

DT

Citr

ate

30

DT

Hep

arin

60

DT

Citr

ate

240

DT

EDTA

240

DT

Hep

arin

240

DT

Hep

arin

60

SH C

itrat

e 30

SH

Citr

ate

30

NV

Citr

ate

120

NV

Hep

arin

120

NV

Citr

ate

120

NV

Hep

arin

30

NV

Citr

ate

30

NV

Citr

ate

0D

T H

epar

in 1

20

SH H

epar

in 0

SH H

epar

in 1

20N

V H

epar

in 2

40N

V H

epar

in 1

20N

V H

epar

in 3

0 N

V C

itrat

e 24

0N

V H

epar

in 0

NV

Hep

arin

60

DT

Citr

ate

30

SH H

epar

in 3

0 SH

Citr

ate

0SH

Hep

arin

240

SH C

itrat

e 12

0SH

Citr

ate

240

SH H

epar

in 6

0

CM10 IMAC-Cu H50

Cluster analysis of 119 samples based on peak profilesSerum, Citrate plasma, EDTA plasma, Heparin plasma

Page 15: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Essential design principles•

To avoid confounding, at sample selection stage groups to be compared should be matched wherever possible for factors such as age and sex

To avoid bias, samples from different groups must not be treated differently (e.g. in time to processing, method of processing).

Use randomisation wherever possible, e.g. in assignment of samples to chips

Page 16: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Designing efficient studies•

Attention to experimental design can also improve efficiency of studies

More sophisticated systems of randomization may be useful to remove known sources of variation (block designs)

Technical replicates can be used to reduce intra- individual variability

Sample size calculations should be carried out to maximize power within constraints of the study

Page 17: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Analysis –

pre-processing•

Calibration•

Time-of-flight converted to m/z

scale using a

number of calibrants

with known m/z

values

Baseline subtraction•

Remove baseline “noise”

from matrix, e.g. by

Loess smoothing

Normalisation•

Data normalised to create comparability between spectra

Page 18: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

020

4060

8010

0in

tens

ity

0 10000 20000 30000 40000 50000mz

Page 19: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

020

4060

8010

0

0 10000 20000 30000 40000 50000mz

intensity baseline

Page 20: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

020

4060

8010

0ba

selin

e_co

rrec

ted

0 10000 20000 30000 40000 50000mz

Page 21: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Normalisation•

Aim is to remove technical sources of variation (machine performance, sample amount, etc)

Normalisation

by the (local or global) total ion current is often applied

Concerns have been raised that this may remove relevant biological information as well as the desired experimental variability

Optimal methods are an area of ongoing research

Page 22: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Peak alignment and detection•

There is imprecision on both the x (m/z) and y (abundance) axes

Peaks in any one spectrum may be defined as points that are maximal within +/-

N points on either side

Need to also consider signal-to-noise ratio -

peaks must exceed the typical local intensity values by a certain ratio

How can peaks be aligned across spectra allowing for variability in x?

Page 23: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Numerous methods of peak detection –

none perfect

Morris et al (2005): use the mean of all spectra•

a peak denoting a molecular feature should stand out against noise

circumvents problem of alignment

Tibshirani

et al (2004): apply one-dimensional clustering algorithm to m/z

values of peaks across

spectra•

Tight clusters should represent the same biological peak possibly shifted slightly along x-axis

Use centroid

of each cluster as consensus m/z

value

Page 24: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Peak detection –

in-house method

Smooth each spectrum using progressively wider moving averages

Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)

Create a frequency distribution

Identify peak clusters

Page 25: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Smoothing a spectrum

Page 26: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Smoothed spectra

Page 27: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Smoothed spectra

Page 28: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Peak detection

Smooth each spectrum using progressively wider moving averages

Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)

Create a frequency distribution

Identify peak clusters

Page 29: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Frequency distribution of peaks

2000 4000 6000 8000 10000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

Page 30: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Identifying peak clusters

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

Page 31: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Scoring peaks

Each spectrum compared to resulting peak list to determine presence/absence of peak

If peak is present then largest abundance measurement within certain tolerance is recorded as peak abundance

Not all peaks can be detected and it is a subject of research to consider how to treat these (zeroes, missing, censored, . . .)

Page 32: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Extracting common peaks from spectra

4220 4240 4260 4280 4300 4320

0.60

0.65

0.70

0.75

0.80

m/z (Da)

Abun

danc

e

9000 9100 9200 9300 9400 9500

0.60

0.62

0.64

0.66

0.68

0.70

m/z (Da)

Abun

danc

e

4220 4240 4260 4280 4300 4320

0.60

0.65

0.70

0.75

0.80

m/z (Da)

Abun

danc

e

9000 9100 9200 9300 9400 9500

0.60

0.62

0.64

0.66

0.68

0.70

m/z (Da)

Abun

danc

e

Page 33: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Peak profile data

After all this processing, raw data is reduced to ~300 to 1000 fixed peaks

For each sample the data consist of an abundance measure at each of these peaks

Aims•

Identify peaks that differ between groups

Develop classification scheme which can be applied to new samples

Page 34: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Exploratory data analysis

Proteomics data is multivariate –

measurements on numerous inter-correlated peaks

Some exploratory data analysis is useful to explore gross differences between groups –

e.g. compare

mean spectra

Can carry out principal components analysis to investigate the pattern of variation more fully

Page 35: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Renal transplant study•

Set of 50 renal transplant recipients, 30 with clinically defined stable condition and 20 with unstable condition

• In addition 30 healthy controls and 30 patients

with renal failure (prior to dialysis), similar age and gender distribution in each group

Identical sample handling and processing protocol used

Samples randomised to chips and all analysed in duplicate

3 chip types (H50, IMAC and CM10)

Page 36: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline
Page 37: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

-10 0 10 20 30 40

-20

-15

-10

-50

510

RT imac30 med

PC1

PC

2

S

S

SSU

U

U U

UU

SS

UU

U

U

S

S

S

S

U

U

S

S

U

U

S

SU

UU

U

SS

S S U

U

S

S

UUSSU

U

SSS S

UU

S

S

S

S

U

U

U

U

UU

S S

SSU

US

S

S

S SS

S S

S

S

SS S

S

U UU U

U

US

S

S

S

SS

SS

SSS

S

Page 38: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Which peaks differ between patients and controls?

Simple test to compare group means for each peak•

To allow for duplicated samples we use linear random effects model

For person i, replicate k, the intensity is modelled by

yik

= α

+ βxi

+ ηi

+ εik

where xi

is indicator (1/0) of case/control status, ηi

~ N(0, σu2), εik

~ N(0, σe2)

Covariates can easily be included

Page 39: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Sample size calculations

Sample size depends on intra-

(σ2) and inter-

(τ2) sample variance, number (m) of technical replicates, difference in means to be detected ( δ)

For a comparison of 2 groups, for a particular peak, the number of samples n required in each group is

where α

is the significance level and 1-β

the power

⎟⎟⎠

⎞⎜⎜⎝

⎛+= ⎥⎦

⎤⎢⎣

⎡ +m

mnzz στδ

βα2

2

2

2/2

Page 40: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Estimates of variance•

Sample size requirements depend heavily on intra-

and inter-sample variance (σ2

and τ2) of the abundance measures for the peak

σ2

and τ2

can be estimated from pilot data

Since there are multiple peaks, to get one estimate we may use the median, 90th

percentile or

maximum variance across peaks

Generally maximum is too conservative, and median will mean power may be low for 50% of peaks

Page 41: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Effect of variance estimate on sample size

Cairns et al, Proteomics, 2009

Page 42: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Multiple testing•

Several hundred peaks will typically be tested. At the discovery stage the aim is not to demonstrate a difference between groups with certainty, but to ensure that a high proportion of “significant”

results are true positives

More useful to think of false discovery rate (FDR) than false positive rate (FPR)

FPR = E[(# false positives)/(# true null)] = α

FDR = E[(# false positives)/(# significant)]

Page 43: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

False discovery rate

Let P be the total number of peaks and π

the proportion of peaks that are truly differentially expressed by a particular amount, then the FDR can be estimated by:

This can be used to inform choice of α

( )( ){ }πβπα

πα)1(1

1−+−

Page 44: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

FDR as a function of α, β

and π

Page 45: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Improving the statistical model

Various improvements have been proposed to the statistical analysis process (e.g. Oberg et al, 2008)

Normalisation could be carried out in the same step as testing and estimating differences between groups;

In labelled experiments a structure can be imposed on the peaks (peptides within proteins, etc.)

Page 46: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Classification of samples

There are many available classification methods

Classification trees•

Use peak information to successively split the sample until all subjects at the same node are in the same class or a minimum node size is reached

Key issue is to avoid over-fitting

Page 47: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Random Forest algorithm•

The Random Forest (RF) algorithm is a classifier based on a group of classification trees

In each tree, samples are successively split using the classifier that “best”

discriminates between classes.

Usually this is based on the Gini

index, a measure of impurity of the nodes

Trees can be grown until each node consists entirely of samples from one class (pure node). Alternatively a minimum node size can be set as stopping rule

Idea of RF is to grow many trees and classify on the basis of the majority vote (Breiman, 2001)

Page 48: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Avoiding over-fitting

How do we make the trees differ and be less dependent on the particular data set?

Each tree is based on a bootstrap sample of original data set

At each partition only a random subset of the variables is available to choose from to make the best split

Each tree used to classify items NOT in the bootstrap sample (out-of-bag (OOB))

Page 49: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Breast cancer proteomics data

Several groups met in Leiden, Netherlands, in 2007 to compare methods of classifying samples from breast cancer patients and controls on basis of proteomics data

Data set:–

Mass spectrometry (MS) proteomic profiles from 76 breast cancer cases and 77 controls formed initial data set

Profiles from an additional 39 cases and 39 controls used as validation data

MS profiles:–

The data were available after some pre-processing

Abundance measurements for 11,205 points corresponding to mass-to-charge (m/z) ratios along the spectrum

Mertens, SAGMB, 2008

Page 50: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Mean spectra from controls and cases

2000 4000 6000 8000 10000

0.60

0.70

0.80

mean spectra (controls)

m/z

Inte

nsity

2000 4000 6000 8000 10000

0.60

0.70

0.80

mean spectra (cases)

m/z

Inte

nsity

Page 51: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Peak detection: reduce the 11,205 data points to 365 “robust”

peaks

Using all 153 samples 444 peaks detected–

Robust peak detection method reduced this to 365 peaks which appear in 7 out of 10 subsets selected

Random Forest: apply RF to the 365 peaks, using 20,000 trees in the forest, and classify according to majority vote

We have also compared this to one-stage strategy: RF applied to all 11,205 data points

Two-stage analysis strategy

Page 52: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

(a) Predictedclassification

True classification (b)Predictedclassification

True classification

Case Control Case Control

Case 62 13 Case 62 11

Control 14 64 Control 14 66

Classification results

Confusion matrix of true versus predicted case-control status for the classification (a) based on all data points and (b) based on peaks

Page 53: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Scatter plot of case and control samples showing votes from the two applications of the RF algorithm

0.2

.4.6

.81

Vot

e fro

m R

F ba

sed

on a

ll da

ta

0 .2 .4 .6 .8Vote from RF based on peaks

Cases Controls

Page 54: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Classification: calibration versus validation results

Of the groups taking part in the original project, our analysis (using RF applied to peaks) had joint lowest percentage correct from calibration data (84%) but highest from validation data (83% correct)

.6.7

.8.9

1P

erce

nt c

orre

ct

.6.7

.8.9

1P

erce

nt c

orre

ct

Training ValidationData sets

Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008

Page 55: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

. . . including results using raw data points

Rerunning with RF applied to raw data points, results from both calibration and validation data sets are very similar (82% and 85% correct)

Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008

.6.7

.8.9

1P

erce

nt c

orre

ct

.6.7

.8.9

1P

erce

nt c

orre

ct

Training ValidationData sets

Page 56: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Classification: summary

RF has good performance as a classifier –

in terms of percentage correctly classified -

certainly comparable

to other methods

One especially good property is unbiased estimation of performance from initial data set

Use of peaks has little effect on classification accuracy compared with raw (correlated) data

Parsimony principle would favour use of summary measures where good ones can be found

Page 57: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

Variable importance measures

Can also see which variables have been important in classification –

emphasis is on ranking of variables

rather than absolute measures

Most common VIMs

are Gini

index (based on impurity of nodes) and permutation-based VIM

Have been used in numerous studies in the analysis of genetic, gene expression and proteomic data

We have investigated VIMs

but so far not found them to outperform analysis of each peak separately

Page 58: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

General conclusions•

Careful attention to study design is essential to avoid bias and improve efficiency

Proper understanding of (sources of) variation is needed, so that these can be minimized or removed as far as possible

Complex analyses should be preceded by more simple descriptive analyses

Page 59: Analysis of mass spectrometry protein data · Proteomics, what and why? • Mass spectrometry • Statistical issues • Experimental design • Preprocessing • calibration, baseline

AcknowledgementsSection of Epidemiology and Biostatistics and Clinical Proteomics Research Group, Leeds Institute of Molecular Medicine, University of Leeds

Especially:David Cairns –

MRC Research fellow in Statistics

Roz

Banks

Professor of Biomedical proteomics

Dave Perkins

Principal research scientist in bioinformatics