analysis of mass spectrometry protein data · proteomics, what and why? • mass spectrometry •...
TRANSCRIPT
Analysis of mass spectrometry protein data
Jenny BarrettSection of Epidemiology and Biostatistics
Leeds Institute of Molecular MedicineUniversity of Leeds, UK
Outline•
Introduction•
Proteomics, what and why?
•
Mass spectrometry
•
Statistical issues•
Experimental design
•
Preprocessing•
calibration, baseline subtraction, normalisation
•
Peak detection and alignment•
Multivariate exploratory data analysis
•
Statistical models to detect differences between groups•
Classification of samples
•
Concluding remarks
What is proteomics?
Proteins are: cell-specifictime-specific
Protein –
chain of amino acids in 3D structure which carry out biological functions, made from DNA via mRNA
Proteome – the total protein complement expressed by a cell
Proteomics
–
the study of proteomes
DNA ProteinRNA mRNA
Transcription Processing Translation
Transcriptionalcontrol
Post- transcriptionalControl e.g.alternative splicing
Translational controls
Activity controls
Epigenetic factorsFeedback loops
Why Proteomics?
Closer to the disease process
Proteomics technology•
Mass spectrometry (MS) can be used to characterise the proteome when applied to the appropriate biological sample (plasma, serum, . . .)
•
Many experimental pre-processing steps applied prior to MS•
depletion of abundant proteins, fractionation, digestion, separation . . .
•
Various types of MS analyzer –
our focus is on MS based on time-of-flight
Mass spectrometry
•
Molecules (including peptides and proteins) have different masses
•
MS separates ionised molecules by their mass-to- charge (m/z) ratio
•
Time-of-flight (TOF), which can be measured, differs according to m/z
-
can then be converted to
m/z
scale
•
Typical range considered 2000 to 50000 Daltons (<2000 Da
too much noise)
YYLaserLaser
(+)(+) (+)(+)(+)(+) (+)(+)(+)(+)(+)(+)(+)(+)
(+)(+)(+)(+)(+)(+)(+)(+)(+)
+ - Accelerating PotentialAccelerating Potential
Mass Spectrometer
S am ple Ch ip
G r id s D etector
A cce leration phase
Example MS analysers•
MALDI
(Matrix Assisted Laser Desorption-
Ionization)•
Apply sample to chip
•
Uses energy-absorbing matrix which facilitates ionisation without much fragmentation
•
SELDI (Surface Enhanced . . .)•
Like MALDI but chips are pre-coated with special surface
•
Different proteins bind to surface depending on chip type
•
iTRAQ
(Isobaric Tags for Relative and Absolute Quantification)•
Samples are chemically labelled
Scope of this talk
•
Not talking about experimental
pre-processing
•
MS of the proteome results in a series of data points (x, y) where x is the m/z
value and y is a
measure of abundance
•
Our focus is on the analysis of such data in a “label- free”
context –
we will not discuss the important
step of identifying the peptides and proteins
020
4060
Inte
nsity
0 5000 10000 15000 20000Mass to charge ratio (m/z)
020
4060
Inte
nsity
0 5000 10000 15000 20000Mass to charge ratio (m/z)
Spectra
from two duplicate samples
020
4060
Inte
nsity
0 5000 10000 15000 20000Mass to charge ratio (m/z)
020
4060
Inte
nsity
0 5000 10000 15000 20000Mass to charge ratio (m/z)
Spectra
from two distinct samples
Experimental Design•
Quantitative measures of the proteome are influenced by many sources of variation –
experimental and inter-individual
•
Careful experimental design is crucial if false inferences are to be avoided, at all stages, including sample selection and sample handling
•
Numerous studies have demonstrated the importance of sample handling
2500 5000 7500 10000
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
2500 5000 7500 10000
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
2500 5000 7500 10000
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
0
25
50
75
100
2500 5000 7500 10000
Citrate
EDTA
Heparin
Serum
CM10 IMAC-Cu
-
H50
m/z m/z m/z
Inte
nsity
Peak profiles from 3 different SELDI chip types Sample processed in 4 different waysBanks et al, Clinical Chemistry 2005
050
100
150
CM10LOW
NV
Seru
m 2
40N
V Se
rum
120
NV
Seru
m 2
40N
V Se
rum
60
NV
Seru
m 1
20N
V Se
rum
0N
V Se
rum
60
SH S
erum
30
SH S
erum
0D
T Se
rum
0SH
Ser
um 2
40SH
Ser
um 2
40SH
Ser
um 3
0SH
Ser
um 1
20SH
Ser
um 6
0N
V Se
rum
30
NV
Seru
m 3
0D
T Se
rum
120
DT
Seru
m 3
0D
T Se
rum
120
DT
Seru
m 6
0D
T Se
rum
30
DT
Seru
m 0
DT
Seru
m 2
40D
T Se
rum
240
DT
Seru
m 6
0SH
Ser
um 1
20SH
Ser
um 6
0SH
Ser
um 0
NV
Seru
m 0
DT
Hep
arin
60
DT
Hep
arin
240
DT
Hep
arin
0D
T H
epar
in 1
20D
T H
epar
in 6
0D
T H
epar
in 2
40D
T H
epar
in 3
0D
T H
epar
in 1
20D
T H
epar
in 3
0D
T H
epar
in 0
NV
Hep
arin
30
NV
Hep
arin
240
NV
Hep
arin
120
NV
Hep
arin
0N
V H
epar
in 6
0N
V H
epar
in 2
40N
V H
epar
in 0
NV
Hep
arin
120
NV
Hep
arin
30
NV
Hep
arin
60
SH H
epar
in 3
0SH
Hep
arin
60
SH H
epar
in 0
SH H
epar
in 2
40SH
Hep
arin
30
SH H
epar
in 2
40SH
Hep
arin
60
SH H
epar
in 1
20SH
Hep
arin
120
DT
Citr
ate
120
DT
Citr
ate
0D
T C
itrat
e 0
DT
Citr
ate
30D
T C
itrat
e 30
DT
Citr
ate
240
DT
Citr
ate
60D
T C
itrat
e 60
DT
Citr
ate
240
DT
Citr
ate
120
SH C
itrat
e 30
NV
Citr
ate
60SH
Citr
ate
0SH
Citr
ate
120
SH C
itrat
e 60
SH C
itrat
e 24
0SH
Citr
ate
30SH
Citr
ate
240
SH C
itrat
e 12
0N
V C
itrat
e 60
SH C
itrat
e 0
SH C
itrat
e 60
NV
Citr
ate
30N
V C
itrat
e 24
0N
V C
itrat
e 12
0N
V C
itrat
e 24
0 17
NV
Citr
ate
30N
V C
itrat
e 12
0N
V C
itrat
e 0
NV
Citr
ate
0N
V ED
TA 6
0N
V ED
TA 2
40N
V ED
TA 1
20N
V ED
TA 0
NV
EDTA
30
NV
EDTA
0N
V ED
TA 3
0D
T ED
TA 0
DT
EDTA
30
DT
EDTA
120
DT
EDTA
240
DT
EDTA
60
DT
EDTA
240
DT
EDTA
120
DT
EDTA
60
DT
EDTA
30
DT
EDTA
0SH
ED
TA 3
0SH
ED
TA 6
0SH
ED
TA 0
NV
EDTA
240
SH E
DTA
240
NV
EDTA
120
NV
EDTA
60
SH E
DTA
60
SH E
DTA
240
SH E
DTA
30
SH E
DTA
120
SH E
DTA
120
SH E
DTA
0 050
100
150
200
250
H50LOW
HC
Citr
ate
120
185
HC
ED
TA 2
40 1
85H
C E
DTA
0 1
85H
C C
itrat
e 0
185
HC
Hep
arin
120
185
HC
ED
TA 6
0 18
5H
C S
erum
30
185
SH E
DTA
30
185
SH C
itrat
e 60
185
DT
Seru
m 0
185
DT
Seru
m 2
40 1
85N
V ED
TA 2
40 1
85N
V Se
rum
30
185
NV
Citr
ate
60 1
85N
V ED
TA 0
185
NV
Citr
ate
240
185
NV
Hep
arin
0 1
85N
V H
epar
in 1
20 1
85H
C E
DTA
30
185
HC
Ser
um 2
40 1
85N
V Se
rum
240
185
NV
EDTA
120
185
DT
Seru
m 1
20 1
85H
C S
erum
0 1
85H
C H
epar
in 2
40 1
85D
T E
DTA
240
185
HC
Citr
ate
60 1
85H
C H
epar
in 0
185
HC
Citr
ate
240
185
HC
Ser
um 6
0 18
5SH
Ser
um 1
20 1
85SH
Citr
ate
0 18
5SH
Hep
arin
30
185
SH C
itrat
e 30
185
SH C
itrat
e 60
185
SH H
epar
in 0
185
SH H
epar
in 1
20 1
85SH
Citr
ate
240
185
SH C
itrat
e 30
185
DT
Citr
ate
240
185
DT
Seru
m 2
40 1
85SH
ED
TA 1
20 1
85SH
Hep
arin
60
185
SH S
erum
120
185
SH S
erum
240
185
SH H
epar
in 3
0 18
5SH
ED
TA 2
40 1
85SH
ED
TA 3
0 18
5H
C H
epar
in 3
0 18
5H
C H
epar
in 6
0 18
5N
V Se
rum
240
185
HC
Ser
um 1
20 1
85H
C E
DTA
240
185
SH E
DTA
60
185
SH H
epar
in 2
40 1
85SH
ED
TA 0
185
SH C
itrat
e 12
0 18
5SH
Ser
um 6
0 18
5SH
Citr
ate
240
185
SH H
epar
in 6
0 18
5SH
Ser
um 3
0 18
5SH
Ser
um 0
185
SH E
DTA
240
185
SH C
itrat
e 0
185
SH E
DTA
60
185
SH S
erum
240
185
SH E
DTA
0 1
85SH
ED
TA 1
20 1
85SH
Ser
um 6
0 18
5SH
Ser
um 3
0 18
5SH
Hep
arin
120
185
SH C
itrat
e 12
0 18
5SH
Ser
um 0
185
DT
Seru
m 3
0 18
5D
T Se
rum
120
185
DT
Citr
ate
30 1
85D
T ED
TA 0
185
DT
Citr
ate
0 18
5D
T H
epar
in 3
0 18
5N
V H
epar
in 2
40 1
85N
V H
epar
in 6
0 18
5N
V Se
rum
120
185
DT
Seru
m 6
0 18
5D
T E
DTA
120
185
DT
Hep
arin
240
185
NV
Citr
ate
30 1
85N
V ED
TA 3
0 18
5N
V C
itrat
e 0
185
NV
EDTA
0 1
85N
V ED
TA 2
40 1
85N
V C
itrat
e 0
185
NV
Seru
m 6
0 18
5N
V Se
rum
60
185
NV
Hep
arin
240
185
NV
Seru
m 1
20 1
85N
V H
epar
in 0
185
NV
EDTA
60
185
NV
Hep
arin
30
185
NV
Seru
m 0
185
NV
Citr
ate
30 1
85SH
Hep
arin
240
185
NV
Citr
ate
60 1
85D
T C
itrat
e 60
185
DT
Citr
ate
120
185
DT
EDTA
0 1
85D
T H
epar
in 1
20 1
85D
T E
DTA
240
185
DT
EDTA
60
185
DT
EDTA
30
185
DT
Seru
m 3
0 18
5N
V C
itrat
e 12
0 18
5N
V H
epar
in 1
20 1
85N
V Se
rum
0 1
85D
T H
epar
in 0
185
NV
Hep
arin
30
185
NV
EDTA
30
185
NV
EDTA
120
185
NV
Hep
arin
60
185
NV
Citr
ate
120
185
NV
Seru
m 3
0 18
5N
V C
itrat
e 24
0 18
5N
V ED
TA 6
0 18
5D
T C
itrat
e 0
185
DT
Hep
arin
30
185
DT
EDTA
120
185
DT
Citr
ate
60 1
85D
T H
epar
in 0
185
DT
Seru
m 6
0 18
5D
T H
epar
in 2
40 1
85D
T C
itrat
e 30
185
DT
Hep
arin
60
1850
5010
015
020
025
030
035
0
IMAC30LOW
NV
Seru
m 1
20N
V Se
rum
60
SH S
erum
30
NV
Seru
m 3
0D
T Se
rum
240
DT
Seru
m 3
0D
T Se
rum
0
DT
Seru
m 6
0D
T Se
rum
60
DT
Seru
m 0
SH
Ser
um 0
NV
Seru
m 0
SH S
erum
0N
V Se
rum
0N
V Se
rum
240
SH S
erum
240
SH S
erum
240
SH S
erum
30
SH S
erum
120
SH S
erum
120
SH S
erum
60
SH S
erum
60
NV
Seru
m 6
0N
V Se
rum
120
NV
Seru
m 3
0D
T Se
rum
240
NV
Seru
m 2
40D
T Se
rum
30
DT
Seru
m 1
20D
T Se
rum
120
SH H
epar
in 2
40SH
Hep
arin
30
SH E
DTA
120
SH E
DTA
0SH
ED
TA 2
40N
V ED
TA 0
SH E
DTA
60
SH E
DTA
60
SH E
DTA
120
SH E
DTA
30
NV
Hep
arin
60
NV
Citr
ate
60
NV
Citr
ate
240
NV
Hep
arin
0SH
Hep
arin
120
SH H
epar
in 6
0 SH
Hep
arin
0SH
ED
TA 3
0SH
ED
TA 2
40N
V ED
TA 1
20
NV
EDTA
120
NV
EDTA
30
NV
EDTA
30
NV
EDTA
0
NV
EDTA
60
NV
EDTA
60
NV
Hep
arin
240
NV
EDTA
240
NV
EDTA
240
DT
Hep
arin
0D
T C
itrat
e 60
D
T H
epar
in 0
NV
Citr
ate
30
SH C
itrat
e 60
N
V C
itrat
e 0
DT
EDTA
120
DT
ED
TA 3
0 D
T ED
TA 0
DT
ED
TA 0
D
T H
epar
in 3
0D
T E
DTA
60
DT
EDTA
120
DT
ED
TA 3
0 D
T E
DTA
60
DT
Citr
ate
0D
T H
epar
in 2
40D
T H
epar
in 3
0D
T H
epar
in 1
20D
T C
itrat
e 24
0 D
T C
itrat
e 0
DT
EDTA
240
DT
Citr
ate
120
DT
Citr
ate
120
SH C
itrat
e 12
0SH
Citr
ate
240
NV
Citr
ate
60
SH C
itrat
e 60
SH
Citr
ate
0D
T C
itrat
e 60
DT
Citr
ate
30
DT
Hep
arin
60
DT
Citr
ate
240
DT
EDTA
240
DT
Hep
arin
240
DT
Hep
arin
60
SH C
itrat
e 30
SH
Citr
ate
30
NV
Citr
ate
120
NV
Hep
arin
120
NV
Citr
ate
120
NV
Hep
arin
30
NV
Citr
ate
30
NV
Citr
ate
0D
T H
epar
in 1
20
SH H
epar
in 0
SH H
epar
in 1
20N
V H
epar
in 2
40N
V H
epar
in 1
20N
V H
epar
in 3
0 N
V C
itrat
e 24
0N
V H
epar
in 0
NV
Hep
arin
60
DT
Citr
ate
30
SH H
epar
in 3
0 SH
Citr
ate
0SH
Hep
arin
240
SH C
itrat
e 12
0SH
Citr
ate
240
SH H
epar
in 6
0
CM10 IMAC-Cu H50
Cluster analysis of 119 samples based on peak profilesSerum, Citrate plasma, EDTA plasma, Heparin plasma
Essential design principles•
To avoid confounding, at sample selection stage groups to be compared should be matched wherever possible for factors such as age and sex
•
To avoid bias, samples from different groups must not be treated differently (e.g. in time to processing, method of processing).
•
Use randomisation wherever possible, e.g. in assignment of samples to chips
Designing efficient studies•
Attention to experimental design can also improve efficiency of studies
•
More sophisticated systems of randomization may be useful to remove known sources of variation (block designs)
•
Technical replicates can be used to reduce intra- individual variability
•
Sample size calculations should be carried out to maximize power within constraints of the study
Analysis –
pre-processing•
Calibration•
Time-of-flight converted to m/z
scale using a
number of calibrants
with known m/z
values
•
Baseline subtraction•
Remove baseline “noise”
from matrix, e.g. by
Loess smoothing
•
Normalisation•
Data normalised to create comparability between spectra
020
4060
8010
0in
tens
ity
0 10000 20000 30000 40000 50000mz
020
4060
8010
0
0 10000 20000 30000 40000 50000mz
intensity baseline
020
4060
8010
0ba
selin
e_co
rrec
ted
0 10000 20000 30000 40000 50000mz
Normalisation•
Aim is to remove technical sources of variation (machine performance, sample amount, etc)
•
Normalisation
by the (local or global) total ion current is often applied
•
Concerns have been raised that this may remove relevant biological information as well as the desired experimental variability
•
Optimal methods are an area of ongoing research
Peak alignment and detection•
There is imprecision on both the x (m/z) and y (abundance) axes
•
Peaks in any one spectrum may be defined as points that are maximal within +/-
N points on either side
•
Need to also consider signal-to-noise ratio -
peaks must exceed the typical local intensity values by a certain ratio
•
How can peaks be aligned across spectra allowing for variability in x?
Numerous methods of peak detection –
none perfect
•
Morris et al (2005): use the mean of all spectra•
a peak denoting a molecular feature should stand out against noise
•
circumvents problem of alignment
•
Tibshirani
et al (2004): apply one-dimensional clustering algorithm to m/z
values of peaks across
spectra•
Tight clusters should represent the same biological peak possibly shifted slightly along x-axis
•
Use centroid
of each cluster as consensus m/z
value
Peak detection –
in-house method
•
Smooth each spectrum using progressively wider moving averages
•
Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)
•
Create a frequency distribution
•
Identify peak clusters
Smoothing a spectrum
Smoothed spectra
Smoothed spectra
Peak detection
•
Smooth each spectrum using progressively wider moving averages
•
Identify peaks of size n in each of these smoothed spectra (local maxima with at least n lower points at each side)
•
Create a frequency distribution
•
Identify peak clusters
Frequency distribution of peaks
2000 4000 6000 8000 10000
050
010
0015
0020
0025
0030
00
m/z
Freq
uenc
y
Identifying peak clusters
3000 3200 3400 3600 3800 4000
050
010
0015
0020
0025
0030
00
m/z
Freq
uenc
y
3000 3200 3400 3600 3800 4000
050
010
0015
0020
0025
0030
00
m/z
Freq
uenc
y
3000 3200 3400 3600 3800 4000
050
010
0015
0020
0025
0030
00
m/z
Freq
uenc
y
Scoring peaks
•
Each spectrum compared to resulting peak list to determine presence/absence of peak
•
If peak is present then largest abundance measurement within certain tolerance is recorded as peak abundance
•
Not all peaks can be detected and it is a subject of research to consider how to treat these (zeroes, missing, censored, . . .)
Extracting common peaks from spectra
4220 4240 4260 4280 4300 4320
0.60
0.65
0.70
0.75
0.80
m/z (Da)
Abun
danc
e
9000 9100 9200 9300 9400 9500
0.60
0.62
0.64
0.66
0.68
0.70
m/z (Da)
Abun
danc
e
4220 4240 4260 4280 4300 4320
0.60
0.65
0.70
0.75
0.80
m/z (Da)
Abun
danc
e
9000 9100 9200 9300 9400 9500
0.60
0.62
0.64
0.66
0.68
0.70
m/z (Da)
Abun
danc
e
Peak profile data
•
After all this processing, raw data is reduced to ~300 to 1000 fixed peaks
•
For each sample the data consist of an abundance measure at each of these peaks
•
Aims•
Identify peaks that differ between groups
•
Develop classification scheme which can be applied to new samples
Exploratory data analysis
•
Proteomics data is multivariate –
measurements on numerous inter-correlated peaks
•
Some exploratory data analysis is useful to explore gross differences between groups –
e.g. compare
mean spectra
•
Can carry out principal components analysis to investigate the pattern of variation more fully
Renal transplant study•
Set of 50 renal transplant recipients, 30 with clinically defined stable condition and 20 with unstable condition
• In addition 30 healthy controls and 30 patients
with renal failure (prior to dialysis), similar age and gender distribution in each group
•
Identical sample handling and processing protocol used
•
Samples randomised to chips and all analysed in duplicate
•
3 chip types (H50, IMAC and CM10)
-10 0 10 20 30 40
-20
-15
-10
-50
510
RT imac30 med
PC1
PC
2
S
S
SSU
U
U U
UU
SS
UU
U
U
S
S
S
S
U
U
S
S
U
U
S
SU
UU
U
SS
S S U
U
S
S
UUSSU
U
SSS S
UU
S
S
S
S
U
U
U
U
UU
S S
SSU
US
S
S
S SS
S S
S
S
SS S
S
U UU U
U
US
S
S
S
SS
SS
SSS
S
Which peaks differ between patients and controls?
Simple test to compare group means for each peak•
To allow for duplicated samples we use linear random effects model
•
For person i, replicate k, the intensity is modelled by
yik
= α
+ βxi
+ ηi
+ εik
where xi
is indicator (1/0) of case/control status, ηi
~ N(0, σu2), εik
~ N(0, σe2)
•
Covariates can easily be included
Sample size calculations
•
Sample size depends on intra-
(σ2) and inter-
(τ2) sample variance, number (m) of technical replicates, difference in means to be detected ( δ)
•
For a comparison of 2 groups, for a particular peak, the number of samples n required in each group is
where α
is the significance level and 1-β
the power
⎟⎟⎠
⎞⎜⎜⎝
⎛+= ⎥⎦
⎤⎢⎣
⎡ +m
mnzz στδ
βα2
2
2
2/2
Estimates of variance•
Sample size requirements depend heavily on intra-
and inter-sample variance (σ2
and τ2) of the abundance measures for the peak
•
σ2
and τ2
can be estimated from pilot data
•
Since there are multiple peaks, to get one estimate we may use the median, 90th
percentile or
maximum variance across peaks
•
Generally maximum is too conservative, and median will mean power may be low for 50% of peaks
Effect of variance estimate on sample size
Cairns et al, Proteomics, 2009
Multiple testing•
Several hundred peaks will typically be tested. At the discovery stage the aim is not to demonstrate a difference between groups with certainty, but to ensure that a high proportion of “significant”
results are true positives
•
More useful to think of false discovery rate (FDR) than false positive rate (FPR)
FPR = E[(# false positives)/(# true null)] = α
FDR = E[(# false positives)/(# significant)]
False discovery rate
Let P be the total number of peaks and π
the proportion of peaks that are truly differentially expressed by a particular amount, then the FDR can be estimated by:
This can be used to inform choice of α
( )( ){ }πβπα
πα)1(1
1−+−
−
FDR as a function of α, β
and π
Improving the statistical model
Various improvements have been proposed to the statistical analysis process (e.g. Oberg et al, 2008)
•
Normalisation could be carried out in the same step as testing and estimating differences between groups;
•
In labelled experiments a structure can be imposed on the peaks (peptides within proteins, etc.)
Classification of samples
•
There are many available classification methods
•
Classification trees•
Use peak information to successively split the sample until all subjects at the same node are in the same class or a minimum node size is reached
•
Key issue is to avoid over-fitting
Random Forest algorithm•
The Random Forest (RF) algorithm is a classifier based on a group of classification trees
•
In each tree, samples are successively split using the classifier that “best”
discriminates between classes.
Usually this is based on the Gini
index, a measure of impurity of the nodes
•
Trees can be grown until each node consists entirely of samples from one class (pure node). Alternatively a minimum node size can be set as stopping rule
•
Idea of RF is to grow many trees and classify on the basis of the majority vote (Breiman, 2001)
Avoiding over-fitting
•
How do we make the trees differ and be less dependent on the particular data set?
–
Each tree is based on a bootstrap sample of original data set
–
At each partition only a random subset of the variables is available to choose from to make the best split
•
Each tree used to classify items NOT in the bootstrap sample (out-of-bag (OOB))
Breast cancer proteomics data
•
Several groups met in Leiden, Netherlands, in 2007 to compare methods of classifying samples from breast cancer patients and controls on basis of proteomics data
•
Data set:–
Mass spectrometry (MS) proteomic profiles from 76 breast cancer cases and 77 controls formed initial data set
–
Profiles from an additional 39 cases and 39 controls used as validation data
•
MS profiles:–
The data were available after some pre-processing
–
Abundance measurements for 11,205 points corresponding to mass-to-charge (m/z) ratios along the spectrum
Mertens, SAGMB, 2008
Mean spectra from controls and cases
2000 4000 6000 8000 10000
0.60
0.70
0.80
mean spectra (controls)
m/z
Inte
nsity
2000 4000 6000 8000 10000
0.60
0.70
0.80
mean spectra (cases)
m/z
Inte
nsity
•
Peak detection: reduce the 11,205 data points to 365 “robust”
peaks
–
Using all 153 samples 444 peaks detected–
Robust peak detection method reduced this to 365 peaks which appear in 7 out of 10 subsets selected
•
Random Forest: apply RF to the 365 peaks, using 20,000 trees in the forest, and classify according to majority vote
•
We have also compared this to one-stage strategy: RF applied to all 11,205 data points
Two-stage analysis strategy
(a) Predictedclassification
True classification (b)Predictedclassification
True classification
Case Control Case Control
Case 62 13 Case 62 11
Control 14 64 Control 14 66
Classification results
Confusion matrix of true versus predicted case-control status for the classification (a) based on all data points and (b) based on peaks
Scatter plot of case and control samples showing votes from the two applications of the RF algorithm
0.2
.4.6
.81
Vot
e fro
m R
F ba
sed
on a
ll da
ta
0 .2 .4 .6 .8Vote from RF based on peaks
Cases Controls
Classification: calibration versus validation results
Of the groups taking part in the original project, our analysis (using RF applied to peaks) had joint lowest percentage correct from calibration data (84%) but highest from validation data (83% correct)
.6.7
.8.9
1P
erce
nt c
orre
ct
.6.7
.8.9
1P
erce
nt c
orre
ct
Training ValidationData sets
Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008
. . . including results using raw data points
Rerunning with RF applied to raw data points, results from both calibration and validation data sets are very similar (82% and 85% correct)
Barrett & Cairns, SAGMB, 2008; Hand, SAGMB, 2008
.6.7
.8.9
1P
erce
nt c
orre
ct
.6.7
.8.9
1P
erce
nt c
orre
ct
Training ValidationData sets
Classification: summary
•
RF has good performance as a classifier –
in terms of percentage correctly classified -
certainly comparable
to other methods
•
One especially good property is unbiased estimation of performance from initial data set
•
Use of peaks has little effect on classification accuracy compared with raw (correlated) data
•
Parsimony principle would favour use of summary measures where good ones can be found
Variable importance measures
•
Can also see which variables have been important in classification –
emphasis is on ranking of variables
rather than absolute measures
•
Most common VIMs
are Gini
index (based on impurity of nodes) and permutation-based VIM
•
Have been used in numerous studies in the analysis of genetic, gene expression and proteomic data
•
We have investigated VIMs
but so far not found them to outperform analysis of each peak separately
General conclusions•
Careful attention to study design is essential to avoid bias and improve efficiency
•
Proper understanding of (sources of) variation is needed, so that these can be minimized or removed as far as possible
•
Complex analyses should be preceded by more simple descriptive analyses
AcknowledgementsSection of Epidemiology and Biostatistics and Clinical Proteomics Research Group, Leeds Institute of Molecular Medicine, University of Leeds
Especially:David Cairns –
MRC Research fellow in Statistics
Roz
Banks
–
Professor of Biomedical proteomics
Dave Perkins
–
Principal research scientist in bioinformatics