analysis of mass spectrometry protein data · proteomics, what and why? • mass spectrometry •...

Analysis of mass spectrometry protein data

Jenny BarrettSection of Epidemiology and Biostatistics

Leeds Institute of Molecular MedicineUniversity of Leeds, UK

Outline•

Introduction•

Proteomics, what and why?

•

Mass spectrometry

•

Statistical issues•

Experimental design

•

Preprocessing•

calibration, baseline subtraction, normalisation

•

Peak detection and alignment•

Multivariate exploratory data analysis

•

Statistical models to detect differences between groups•

Classification of samples

•

Concluding remarks

What is proteomics?

Proteins are: cell-specifictime-specific

Protein –

chain of amino acids in 3D structure which carry out biological functions, made from DNA via mRNA

Proteome – the total protein complement expressed by a cell

Proteomics

–

the study of proteomes

DNA ProteinRNA mRNA

Transcription Processing Translation

Transcriptionalcontrol

Post- transcriptionalControl e.g.alternative splicing

Translational controls

Activity controls

Epigenetic factorsFeedback loops

Why Proteomics?

Closer to the disease process

Proteomics technology•

Mass spectrometry (MS) can be used to characterise the proteome when applied to the appropriate biological sample (plasma, serum, . . .)

•

Many experimental pre-processing steps applied prior to MS•

depletion of abundant proteins, fractionation, digestion, separation . . .

•

Various types of MS analyzer –

our focus is on MS based on time-of-flight

Mass spectrometry

•

Molecules (including peptides and proteins) have different masses

•

MS separates ionised molecules by their mass-to- charge (m/z) ratio

•

Time-of-flight (TOF), which can be measured, differs according to m/z

-

can then be converted to

m/z

scale

•

Typical range considered 2000 to 50000 Daltons (<2000 Da

too much noise)

YYLaserLaser

(+)(+) (+)(+)(+)(+) (+)(+)(+)(+)(+)(+)(+)(+)

(+)(+)(+)(+)(+)(+)(+)(+)(+)

+ - Accelerating PotentialAccelerating Potential

Mass Spectrometer

S am ple Ch ip

G r id s D etector

A cce leration phase

Example MS analysers•

MALDI

(Matrix Assisted Laser Desorption-

Ionization)•

Apply sample to chip

•

Uses energy-absorbing matrix which facilitates ionisation without much fragmentation

•

SELDI (Surface Enhanced . . .)•

Like MALDI but chips are pre-coated with special surface

•

Different proteins bind to surface depending on chip type

•

iTRAQ

(Isobaric Tags for Relative and Absolute Quantification)•

Samples are chemically labelled

Scope of this talk

•

Not talking about experimental

pre-processing

•

MS of the proteome results in a series of data points (x, y) where x is the m/z

value and y is a

measure of abundance

•

Our focus is on the analysis of such data in a “label- free”

context –

we will not discuss the important

step of identifying the peptides and proteins

020

4060

Inte

nsity

0 5000 10000 15000 20000Mass to charge ratio (m/z)

020

4060

Inte

nsity


Spectra

from two duplicate samples

020

4060

Inte

nsity


020

4060

Inte

nsity


Spectra

from two distinct samples

Experimental Design•

Quantitative measures of the proteome are influenced by many sources of variation –

experimental and inter-individual

•

Careful experimental design is crucial if false inferences are to be avoided, at all stages, including sample selection and sample handling

•

Numerous studies have demonstrated the importance of sample handling

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

2500 5000 7500 10000

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

0

25

50

75

100

2500 5000 7500 10000

Citrate

EDTA

Heparin

Serum

CM10 IMAC-Cu

-

H50

m/z m/z m/z

Inte

nsity

Peak profiles from 3 different SELDI chip types Sample processed in 4 different waysBanks et al, Clinical Chemistry 2005

050

100

150

CM10LOW

NV

Seru

m 2

40N

V Se

rum

120

NV

Seru

m 2

40N

V Se

rum

60

NV

Seru

m 1

20N

V Se

rum

0N

V Se

rum

60

SH S

erum

30

SH S

erum

0D

T Se

rum

0SH

Ser

um 2

40SH

Ser

um 2

40SH

Ser

um 3

0SH

Ser

um 1

20SH

Ser

um 6

0N

V Se

rum

30

NV

Seru

m 3

0D

T Se

rum

120

DT

Seru

m 3

0D

T Se

rum

120

DT

Seru

m 6

0D

T Se

rum

30

DT

Seru

m 0

DT

Seru

m 2

40D

T Se

rum

240

DT

Seru

m 6

0SH

Ser

um 1

20SH

Ser

um 6

0SH

Ser

um 0

NV

Seru

m 0

DT

Hep

arin

60

DT

Hep

arin

240

DT

Hep

arin

0D

T H

epar

in 1

20D

T H

epar

in 6

0D

T H

epar

in 2

40D

T H

epar

in 3

0D

T H

epar

in 1

20D

T H

epar

in 3

0D

T H

epar

in 0

NV

Hep

arin

30

NV

Hep

arin

240

NV

Hep

arin

120

NV

Hep

arin

0N

V H

epar

in 6

0N

V H

epar

in 2

40N

V H

epar

in 0

NV

Hep

arin

120

NV

Hep

arin

30

NV

Hep

arin

60

SH H

epar

in 3

0SH

Hep

arin

60

SH H

epar

in 0

SH H

epar

in 2

40SH

Hep

arin

30

SH H

epar

in 2

40SH

Hep

arin

60

SH H

epar

in 1

20SH

Hep

arin

120

DT

Citr

ate

120

DT

Citr

ate

0D

T C

itrat

e 0

DT

Citr

ate

30D

T C

itrat

e 30

DT

Citr

ate

240

DT

Citr

ate

60D

T C

itrat

e 60

DT

Citr

ate

240

DT

Citr

ate

120

SH C

itrat

e 30

NV

Citr

ate

60SH

Citr

ate

0SH

Citr

ate

120

SH C

itrat

e 60

SH C

itrat

e 24

0SH

Citr

ate

30SH

Citr

ate

240

SH C

itrat

e 12

0N

V C

itrat

e 60

SH C

itrat

e 0

SH C

itrat

e 60

NV

Citr

ate

30N

V C

itrat

e 24

0N

V C

itrat

e 12

0N

V C

itrat

e 24

0 17

NV

Citr

ate

30N

V C

itrat

e 12

0N

V C

itrat

e 0

NV

Citr

ate

0N

V ED

TA 6

0N

V ED

TA 2

40N

V ED

TA 1

20N

V ED

TA 0

NV

EDTA

30

NV

EDTA

0N

V ED

TA 3

0D

T ED

TA 0

DT

EDTA

30

DT

EDTA

120

DT

EDTA

240

DT

EDTA

60

DT

EDTA

240

DT

EDTA

120

DT

EDTA

60

DT

EDTA

30

DT

EDTA

0SH

ED

TA 3

0SH

ED

TA 6

0SH

ED

TA 0

NV

EDTA

240

SH E

DTA

240

NV

EDTA

120

NV

EDTA

60

SH E

DTA

60

SH E

DTA

240

SH E

DTA

30

SH E

DTA

120

SH E

DTA

120

SH E

DTA

0 050

100

150

200

250

H50LOW

HC

Citr

ate

120

185

HC

ED

TA 2

40 1

85H

C E

DTA

0 1

85H

C C

itrat

e 0

185

HC

Hep

arin

120

185

HC

ED

TA 6

0 18

5H

C S

erum

30

185

SH E

DTA

30

185

SH C

itrat

e 60

185

DT

Seru

m 0

185

DT

Seru

m 2

40 1

85N

V ED

TA 2

40 1

85N

V Se

rum

30

185

NV

Citr

ate

60 1

85N

V ED

TA 0

185

NV

Citr

ate

240

185

NV

Hep

arin

0 1

85N

V H

epar

in 1

20 1

85H

C E

DTA

30

185

HC

Ser

um 2

40 1

85N

V Se

rum

240

185

NV

EDTA

120

185

DT

Seru

m 1

20 1

85H

C S

erum

0 1

85H

C H

epar

in 2

40 1

85D

T E

DTA

240

185

HC

Citr

ate

60 1

85H

C H

epar

in 0

185

HC

Citr

ate

240

185

HC

Ser

um 6

0 18

5SH

Ser

um 1

20 1

85SH

Citr

ate

0 18

5SH

Hep

arin

30

185

SH C

itrat

e 30

185

SH C

itrat

e 60

185

SH H

epar

in 0

185

SH H

epar

in 1

20 1

85SH

Citr

ate

240

185

SH C

itrat

e 30

185

DT

Citr

ate

240

185

DT

Seru

m 2

40 1

85SH

ED

TA 1

20 1

85SH

Hep

arin

60

185

SH S

erum

120

185

SH S

erum

240

185

SH H

epar

in 3

0 18

5SH

ED

TA 2

40 1

85SH

ED

TA 3

0 18

5H

C H

epar

in 3

0 18

5H

C H

epar

in 6

0 18

5N

V Se

rum

240

185

HC

Ser

um 1

20 1

85H

C E

DTA

240

185

SH E

DTA

60

185

SH H

epar

in 2

40 1

85SH

ED

TA 0

185

SH C

itrat

e 12

0 18

5SH

Ser

um 6

0 18

5SH

Citr

ate

240

185

SH H

epar

in 6

0 18

5SH

Ser

um 3

0 18

5SH

Ser

um 0

185

SH E

DTA

240

185

SH C

itrat

e 0

185

SH E

DTA

60

185

SH S

erum

240

185

SH E

DTA

0 1

85SH

ED

TA 1

20 1

85SH

Ser

um 6

0 18

5SH

Ser

um 3

0 18

5SH

Hep

arin

120

185

SH C

itrat

e 12

0 18

5SH

Ser

um 0

185

DT

Seru

m 3

0 18

5D

T Se

rum

120

185

DT

Citr

ate

30 1

85D

T ED

TA 0

185

DT

Citr

ate

0 18

5D

T H

epar

in 3

0 18

5N

V H

epar

in 2

40 1

85N

V H

epar

in 6

0 18

5N

V Se

rum

120

185

DT

Seru

m 6

0 18

5D

T E

DTA

120

185

DT

Hep

arin

240

185

NV

Citr

ate

30 1

85N

V ED

TA 3

0 18

5N

V C

itrat

e 0

185

NV

EDTA

0 1

85N

V ED

TA 2

40 1

85N

V C

itrat

e 0

185

NV

Seru

m 6

0 18

5N

V Se

rum

60

185

NV

Hep

arin

240

185

NV

Seru

m 1

20 1

85N

V H

epar

in 0

185

NV

EDTA

60

185

NV

Hep

arin

30

185

NV

Seru

m 0

185

NV

Citr

ate

30 1

85SH

Hep

arin

240

185

NV

Citr

ate

60 1

85D

T C

itrat

e 60

185

DT

Citr

ate

120

185

DT

EDTA

0 1

85D

T H

epar

in 1

20 1

85D

T E

DTA

240

185

DT

EDTA

60

185

DT

EDTA

30

185

DT

Seru

m 3

0 18

5N

V C

itrat

e 12

0 18

5N

V H

epar

in 1

20 1

85N

V Se

rum

0 1

85D

T H

epar

in 0

185

NV

Hep

arin

30

185

NV

EDTA

30

185

NV

EDTA

120

185

NV

Hep

arin

60

185

NV

Citr

ate

120

185

NV

Seru

m 3

0 18

5N

V C

itrat

e 24

0 18

5N

V ED

TA 6

0 18

5D

T C

itrat

e 0

185

DT

Hep

arin

30

185

DT

EDTA

120

185

DT

Citr

ate

60 1

85D

T H

epar

in 0

185

DT

Seru

m 6

0 18

5D

T H

epar

in 2

40 1

85D

T C

itrat

e 30

185

DT

Hep

arin

60

1850

5010

015

020

025

030

035

0

IMAC30LOW

NV

Seru

m 1

20N

V Se

rum

60

SH S

erum

30

NV

Seru

m 3

0D

T Se

rum

240

DT

Seru

m 3

0D

T Se

rum

0

DT

Seru

m 6

0D

T Se

rum

60

DT

Seru

m 0

SH

Ser

um 0

NV

Seru

m 0

SH S

erum

0N

V Se

rum

0N

V Se

rum

240

SH S

erum

240

SH S

erum

240

SH S

erum

30

SH S

erum

120

SH S

erum

120

SH S

erum

60

SH S

erum

60

NV

Seru

m 6

0N

V Se

rum

120

NV

Seru

m 3

0D

T Se

rum

240

NV

Seru

m 2

40D

T Se

rum

30

DT

Seru

m 1

20D

T Se

rum

120

SH H

epar

in 2

40SH

Hep

arin

30

SH E

DTA

120

SH E

DTA

0SH

ED

TA 2

40N

V ED

TA 0

SH E

DTA

60

SH E

DTA

60

SH E

DTA

120

SH E

DTA

30

NV

Hep

arin

60

NV

Citr

ate

60

NV

Citr

ate

240

NV

Hep

arin

0SH

Hep

arin

120

SH H

epar

in 6

0 SH

Hep

arin

0SH

ED

TA 3

0SH

ED

TA 2

40N

V ED

TA 1

20

NV

EDTA

120

NV

EDTA

30

NV

EDTA

30

NV

EDTA

0

NV

EDTA

60

NV

EDTA

60

NV

Hep

arin

240

NV

EDTA

240

NV

EDTA

240

DT

Hep

arin

0D

T C

itrat

e 60

D

T H

epar

in 0

NV

Citr

ate

30

SH C

itrat

e 60

N

V C

itrat

e 0

DT

EDTA

120

DT

ED

TA 3

0 D

T ED

TA 0

DT

ED

TA 0

D

T H

epar

in 3

0D

T E

DTA

60

DT

EDTA

120

DT

ED

TA 3

0 D

T E

DTA

60

DT

Citr

ate

0D

T H

epar

in 2

40D

T H

epar

in 3

0D

T H

epar

in 1

20D

T C

itrat

e 24

0 D

T C

itrat

e 0

DT

EDTA

240

DT

Citr

ate

120

DT

Citr

ate

120

SH C

itrat

e 12

0SH

Citr

ate

240

NV

Citr

ate

60

SH C

itrat

e 60

SH

Citr

ate

0D

T C

itrat

e 60

DT

Citr

ate

30

DT

Hep

arin

60

DT

Citr

ate

240

DT

EDTA

240

DT

Hep

arin

240

DT

Hep

arin

60

SH C

itrat

e 30

SH

Citr

ate

30

NV

Citr

ate

120

NV

Hep

arin

120

NV

Citr

ate

120

NV

Hep

arin

30

NV

Citr

ate

30

NV

Citr

ate

0D

T H

epar

in 1

20

SH H

epar

in 0

SH H

epar

in 1

20N

V H

epar

in 2

40N

V H

epar

in 1

20N

V H

epar

in 3

0 N

V C

itrat

e 24

0N

V H

epar

in 0

NV

Hep

arin

60

DT

Citr

ate

30

SH H

epar

in 3

0 SH

Citr

ate

0SH

Hep

arin

240

SH C

itrat

e 12

0SH

Citr

ate

240

SH H

epar

in 6

0

CM10 IMAC-Cu H50

Cluster analysis of 119 samples based on peak profilesSerum, Citrate plasma, EDTA plasma, Heparin plasma

Essential design principles•

To avoid confounding, at sample selection stage groups to be compared should be matched wherever possible for factors such as age and sex

•

To avoid bias, samples from different groups must not be treated differently (e.g. in time to processing, method of processing).

•

Use randomisation wherever possible, e.g. in assignment of samples to chips

Designing efficient studies•

Attention to experimental design can also improve efficiency of studies

•

More sophisticated systems of randomization may be useful to remove known sources of variation (block designs)

•

Technical replicates can be used to reduce intra- individual variability

•

Sample size calculations should be carried out to maximize power within constraints of the study

Analysis –

pre-processing•

Calibration•

Time-of-flight converted to m/z

scale using a

number of calibrants

with known m/z

values

•

Baseline subtraction•

Remove baseline “noise”

from matrix, e.g. by

Loess smoothing

•

Normalisation•

Data normalised to create comparability between spectra

020

4060

8010

0in

tens

ity

0 10000 20000 30000 40000 50000mz

020

4060

8010

0

0 10000 20000 30000 40000 50000mz

intensity baseline

020

4060

8010

0ba

selin

e_co

rrec

ted

0 10000 20000 30000 40000 50000mz

Normalisation•

Aim is to remove technical sources of variation (machine performance, sample amount, etc)

•

Normalisation

by the (local or global) total ion current is often applied

•

Concerns have been raised that this may remove relevant biological information as well as the desired experimental variability

•

Optimal methods are an area of ongoing research

Peak alignment and detection•

There is imprecision on both the x (m/z) and y (abundance) axes

•

Peaks in any one spectrum may be defined as points that are maximal within +/-

N points on either side

•

Need to also consider signal-to-noise ratio -

peaks must exceed the typical local intensity values by a certain ratio

•

How can peaks be aligned across spectra allowing for variability in x?

Numerous methods of peak detection –

none perfect

•

Morris et al (2005): use the mean of all spectra•

a peak denoting a molecular feature should stand out against noise

•

circumvents problem of alignment

•

Tibshirani

et al (2004): apply one-dimensional clustering algorithm to m/z

values of peaks across

spectra•

Tight clusters should represent the same biological peak possibly shifted slightly along x-axis

•

Use centroid

of each cluster as consensus m/z

value

Peak detection –

in-house method

•

Smooth each spectrum using progressively wider moving averages

•

Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)

•

Create a frequency distribution

•

Identify peak clusters

Smoothing a spectrum

Smoothed spectra

Peak detection

•

Smooth each spectrum using progressively wider moving averages

•

Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)

•

Create a frequency distribution

•

Identify peak clusters

Frequency distribution of peaks

2000 4000 6000 8000 10000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

Identifying peak clusters

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

3000 3200 3400 3600 3800 4000

050

010

0015

0020

0025

0030

00

m/z

Freq

uenc

y

Scoring peaks

•

Each spectrum compared to resulting peak list to determine presence/absence of peak

•

If peak is present then largest abundance measurement within certain tolerance is recorded as peak abundance

•

Not all peaks can be detected and it is a subject of research to consider how to treat these (zeroes, missing, censored, . . .)

Extracting common peaks from spectra

4220 4240 4260 4280 4300 4320

0.60

0.65

0.70

0.75

0.80

m/z (Da)

Abun

danc

e

9000 9100 9200 9300 9400 9500

0.60

0.62

0.64

0.66

0.68

0.70

m/z (Da)

Abun

danc

e

4220 4240 4260 4280 4300 4320

0.60

0.65

0.70

0.75

0.80

m/z (Da)

Abun

danc

e

9000 9100 9200 9300 9400 9500

0.60

0.62

0.64

0.66

0.68

0.70

m/z (Da)

Abun

danc

e

Peak profile data

•

After all this processing, raw data is reduced to ~300 to 1000 fixed peaks

•

For each sample the data consist of an abundance measure at each of these peaks

•

Aims•

Identify peaks that differ between groups

•

Develop classification scheme which can be applied to new samples

Exploratory data analysis

•

Proteomics data is multivariate –

measurements on numerous inter-correlated peaks

•

Some exploratory data analysis is useful to explore gross differences between groups –

e.g. compare

mean spectra

•

Can carry out principal components analysis to investigate the pattern of variation more fully

Renal transplant study•

Set of 50 renal transplant recipients, 30 with clinically defined stable condition and 20 with unstable condition

• In addition 30 healthy controls and 30 patients

with renal failure (prior to dialysis), similar age and gender distribution in each group

•

Identical sample handling and processing protocol used

•

Samples randomised to chips and all analysed in duplicate

•

3 chip types (H50, IMAC and CM10)

-10 0 10 20 30 40

-20

-15

-10

-50

510

RT imac30 med

PC1

PC

2

S

S

SSU

U

U U

UU

SS

UU

U

U

S

S

S

S

U

U

S

S

U

U

S

SU

UU

U

SS

S S U

U

S

S

UUSSU

U

SSS S

UU

S

S

S

S

U

U

U

U

UU

S S

SSU

US

S

S

S SS

S S

S

S

SS S

S

U UU U

U

US

S

S

S

SS

SS

SSS

S

Which peaks differ between patients and controls?

Simple test to compare group means for each peak•

To allow for duplicated samples we use linear random effects model

•

For person i, replicate k, the intensity is modelled by

yik

= α

+ βxi

+ ηi

+ εik

where xi

is indicator (1/0) of case/control status, ηi

~ N(0, σu2), εik

~ N(0, σe2)

•

Covariates can easily be included

Sample size calculations

•

Sample size depends on intra-

(σ2) and inter-

(τ2) sample variance, number (m) of technical replicates, difference in means to be detected ( δ)

•

For a comparison of 2 groups, for a particular peak, the number of samples n required in each group is

where α

is the significance level and 1-β

the power

⎟⎟⎠

⎞⎜⎜⎝

⎛+= ⎥⎦

⎤⎢⎣

⎡ +m

mnzz στδ

βα2

2

2

2/2

Estimates of variance•

Sample size requirements depend heavily on intra-

and inter-sample variance (σ2

and τ2) of the abundance measures for the peak

•

σ2

and τ2

can be estimated from pilot data

•

Since there are multiple peaks, to get one estimate we may use the median, 90th

percentile or

maximum variance across peaks

•

Generally maximum is too conservative, and median will mean power may be low for 50% of peaks

Effect of variance estimate on sample size

Cairns et al, Proteomics, 2009

Multiple testing•

Several hundred peaks will typically be tested. At the discovery stage the aim is not to demonstrate a difference between groups with certainty, but to ensure that a high proportion of “significant”

results are true positives

•

More useful to think of false discovery rate (FDR) than false positive rate (FPR)

FPR = E[(# false positives)/(# true null)] = α

FDR = E[(# false positives)/(# significant)]

False discovery rate

Let P be the total number of peaks and π

the proportion of peaks that are truly differentially expressed by a particular amount, then the FDR can be estimated by:

This can be used to inform choice of α

( )( ){ }πβπα

πα)1(1

1−+−

−

FDR as a function of α, β

and π

Improving the statistical model

Various improvements have been proposed to the statistical analysis process (e.g. Oberg et al, 2008)

•

Normalisation could be carried out in the same step as testing and estimating differences between groups;

•

In labelled experiments a structure can be imposed on the peaks (peptides within proteins, etc.)

Classification of samples

•

There are many available classification methods

•

Classification trees•

Use peak information to successively split the sample until all subjects at the same node are in the same class or a minimum node size is reached

•

Key issue is to avoid over-fitting

Random Forest algorithm•

The Random Forest (RF) algorithm is a classifier based on a group of classification trees

•

In each tree, samples are successively split using the classifier that “best”

discriminates between classes.

Usually this is based on the Gini

index, a measure of impurity of the nodes

•

Trees can be grown until each node consists entirely of samples from one class (pure node). Alternatively a minimum node size can be set as stopping rule

•

Idea of RF is to grow many trees and classify on the basis of the majority vote (Breiman, 2001)

Avoiding over-fitting

•

How do we make the trees differ and be less dependent on the particular data set?

–

Each tree is based on a bootstrap sample of original data set

–

At each partition only a random subset of the variables is available to choose from to make the best split

•

Each tree used to classify items NOT in the bootstrap sample (out-of-bag (OOB))

Breast cancer proteomics data

•

Several groups met in Leiden, Netherlands, in 2007 to compare methods of classifying samples from breast cancer patients and controls on basis of proteomics data

•

Data set:–

Mass spectrometry (MS) proteomic profiles from 76 breast cancer cases and 77 controls formed initial data set

–

Profiles from an additional 39 cases and 39 controls used as validation data

•

MS profiles:–

The data were available after some pre-processing

–

Abundance measurements for 11,205 points corresponding to mass-to-charge (m/z) ratios along the spectrum

Mertens, SAGMB, 2008

Mean spectra from controls and cases

2000 4000 6000 8000 10000

0.60

0.70

0.80

mean spectra (controls)

m/z

Inte

nsity

2000 4000 6000 8000 10000

0.60

0.70

0.80

mean spectra (cases)

m/z

Inte

nsity

•

Peak detection: reduce the 11,205 data points to 365 “robust”

peaks

–

Using all 153 samples 444 peaks detected–

Robust peak detection method reduced this to 365 peaks which appear in 7 out of 10 subsets selected

•

Random Forest: apply RF to the 365 peaks, using 20,000 trees in the forest, and classify according to majority vote

•

We have also compared this to one-stage strategy: RF applied to all 11,205 data points

Two-stage analysis strategy

(a) Predictedclassification

True classification (b)Predictedclassification

True classification

Case Control Case Control

Case 62 13 Case 62 11

Control 14 64 Control 14 66

Classification results

Confusion matrix of true versus predicted case-control status for the classification (a) based on all data points and (b) based on peaks

Scatter plot of case and control samples showing votes from the two applications of the RF algorithm

0.2

.4.6

.81

Vot

e fro

m R

F ba

sed

on a

ll da

ta

0 .2 .4 .6 .8Vote from RF based on peaks

Cases Controls

Classification: calibration versus validation results

Of the groups taking part in the original project, our analysis (using RF applied to peaks) had joint lowest percentage correct from calibration data (84%) but highest from validation data (83% correct)

.6.7

.8.9

1P

erce

nt c

orre

ct

.6.7

.8.9

1P

erce

nt c

orre

ct

Training ValidationData sets

Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008

. . . including results using raw data points

Rerunning with RF applied to raw data points, results from both calibration and validation data sets are very similar (82% and 85% correct)

Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008

.6.7

.8.9

1P

erce

nt c

orre

ct

.6.7

.8.9

1P

erce

nt c

orre

ct

Training ValidationData sets

Classification: summary

•

RF has good performance as a classifier –

in terms of percentage correctly classified -

certainly comparable

to other methods

•

One especially good property is unbiased estimation of performance from initial data set

•

Use of peaks has little effect on classification accuracy compared with raw (correlated) data

•

Parsimony principle would favour use of summary measures where good ones can be found

Variable importance measures

•

Can also see which variables have been important in classification –

emphasis is on ranking of variables

rather than absolute measures

•

Most common VIMs

are Gini

index (based on impurity of nodes) and permutation-based VIM

•

Have been used in numerous studies in the analysis of genetic, gene expression and proteomic data

•

We have investigated VIMs

but so far not found them to outperform analysis of each peak separately

General conclusions•

Careful attention to study design is essential to avoid bias and improve efficiency

•

Proper understanding of (sources of) variation is needed, so that these can be minimized or removed as far as possible

•

Complex analyses should be preceded by more simple descriptive analyses

AcknowledgementsSection of Epidemiology and Biostatistics and Clinical Proteomics Research Group, Leeds Institute of Molecular Medicine, University of Leeds

Especially:David Cairns –

MRC Research fellow in Statistics

Roz

Banks

–

Professor of Biomedical proteomics

Dave Perkins

–

Principal research scientist in bioinformatics

analysis of mass spectrometry protein data · proteomics, what and why? • mass spectrometry •...

Documents