multivariate statistical analyses and machine learning for … · 2017-02-14 · multivariate...
TRANSCRIPT
Multivariate statistical analyses and Multivariate statistical analyses and machine learning for metabolomicsmachine learning for metabolomics
Roy GoodacreRoy GoodacreSchool of Chemistry,School of Chemistry,The University of Manchester, The University of Manchester, Sackville Street, PO Box 88, Sackville Street, PO Box 88, Manchester, M60 1QDManchester, M60 1QD
[email protected]@manchester.ac.ukwww.biospec.netwww.biospec.net
Postdocs: Dr Catherine Winder, Dr Robert Cormell, Dr Roger Jarvis, Dr Mohammad Afzaal, A.N. Other x 5, With Collabs: Dr Seetharaman Vaidyanathan, Dr John Fletcher Research Technicians: Jo Ellis, Steffi Schuler + A.N. OtherResearch Students (PhD): Felicity Currie, Katherine Hollywood, Robben Jaber, Nicoletta Nicolaou, Soyab Patel, Ketan Patel, Emma Wharfe, Will Allwood, Will Cheung, Robert Coe. From Oct 1: Nicola Wood, Dong Hyun Kim.Research Students (MSc): Nicola Wood, Sadia Rabbini, Martin Coleston, Graham Mullard.£s, €s, $s: BBSRC, EPSRC, RSC, EU F6,
DEFRA, NERC, ORS
www.biospec.net
‘Top down’ metabolomics
Carefully designedData-generating experiment
Generationof hypotheses
Inductive reasoning by computation
Analysis and test hypotheses
Strategy: Inductive approach to knowledge discovery via holism
Goodacre, R. (2005) Making sense of the metabolome using evolutionary computation: seeing the wood with the trees. Journal of Experimental Botany 56, 245-254.
“Hiring a statistician after the data have been collected is like hiring a physician when the patient is in the morgue. He might be able to
tell you what went wrong, but is unlikely to be able to fix it”
Data floods
“Data does not equal information; information does not equal knowledge; and, most importantly of all, knowledge
does not equal wisdom. We have oceans of data, rivers of
information, small puddles of knowledge, and the odd drop of wisdom”
Henry Nix, 1990
Data outputs0
100
%
391
320
267269
365
595481
463
392
392419
481
594482
549503
743708595 672596672
833815744815
1059946834
9468351059
98710601063
000627 54 4 (2 009) C (1 4)
1.41.61.822.22.42.62.83Chemical shift (ppm)
Inte
nsity
dimethylglycine ?
alanine
acetate
carnitine argininehypotaurine
5001000150020002500300035004000
-0.1
0
0.1
0.2
0.3
0.4
0.5
Wavenumber (cm-1)
Abs
orba
nce
(arb
itrar
y)
Deconvolution
Metabolite Conc
Glucose 0.1
Indole 0.001
Tryptophan 1.2
Ethanolamine 0.7
… …
Metabolite #88 0.9
Metabolite #167 0.05
Name of metabolite with its concentration
Typical metabolic profile
Data handling
Input data
Objects going down in different rows
X-var1
Metabolite or peak 1
X-var2
Metabolite or peak 2
X-var3
Metabolite or peak 3
Sample 1
Sample 2…
Metabolite Conc
Glucose 0.1
Indole 0.001
Tryptophan 1.2
Ethanolamine 0.7
… …
Metabolite #88 0.9
Metabolite #167 0.05
Data handlingObjects going down in different rows
X-var1
Metabolite or peak 1
X-var2
Metabolite or peak 2
X-var3
Metabolite or peak 3
Y-var1
Lots of Metadata
Y-var2
Diseased or Healthy (Levels)
Sample 1 0 (control)
Sample 2… 1 (diseased)
Species AgeM/FBMI
sampling processingetc, etc…
Input data Output data
Unsupervised learning
System is shown a set of inputs (spectra) and then left to cluster the spectra into groupsThis optimization procedure is usually simplification or dimensionality reduction.
INPUT DATA
CLUSTERING HUMANINTERPRETATION
FEATUREEXTRACTION
Projection of the data
loadings (p)summarise variation in variables
p1p2 …
sam
ples
pi
123
k
t1 t2 ti…
scores (t)summarise variation in samples
uncorrelatedorthogonal
variablesx1, x2, … xi, … …, xn
variance: t1 ≥ t2 ≥ … ≥ ti
scores = loadings × datat1 = p1x1 + p2x2 + … + pixi + … +
pnxn
PCA
x1
x2
x3
PC1tj1
point j
PC2 tj2PC1 = 1st principal component
describes largest variancegoes through variable origin spacetj1 = score for point j = distance
from projection of that pointonto PC1 from origin
PC2 = 2nd principal componentdescribes 2nd largest variancegoes through variable origin spacetj2 = score for point j
PC1
PC2
Finds structure in dataRotate to uncover maximumcorrelations with respect to natural variation
Supervised learning
The goal is to find a mathematical model that will correctly associate the inputs with the targetsUsually achieved by minimising the error between the target and the model's response (output).
INPUT DATA
OUTPUT
OUTPUT DATA
COMPARISON
CALIBRATIONSYSTEM
ERROR
Discriminant function analysis(aka, canonical variates analysis)
Uses uncorrelated inputs a priori informationProjection based on:
Minimises within group varianceMaximises between group variance
Test by projection of ‘unknown’ samples Statistical significance:
χ2 confidence limits
-6 -4 -2 0 2 4 6 8-6
-4
-2
0
2
4
6
BindBt0
Bt1 Bt10
Bt11
Bt12Bt13
Bt14
Bt15
Bt16
Bt17Bt18Bt19
Bt2 Bt3 Bt4
Bt5
Bt6 Bt7 Bt8 Bt9
Hind
Hp1 Hp3
Hp4 Hp5
Hta0
Hta1Hta2
Hta3Hta4Hta5 Hta6
Hta7Hta8
Htb0
Htb2Htb3
Htb4Htb5
Htb6Htc1
Htc2
Htc3
Htc4Htc5 Htc6Htc7Htc8
Lind
Lp0
Lp1 Lp2
Lp3 Lta0
Lta1Lta2
Lta3
Lta4Lta5Lta6
Lta7Ltb0Ltb1Ltb2Ltb3Ltb4
Ltb5Ltb6
Ltb7Ltc0
Ltc1Ltc2
Ltc3Ltc4
Ltc5Ltc6
Ltc7
Nind
Nt-1Nt-2Nt-3Nt-4
Nt0
Nt1 Nt10
Nt11
Nt12
Nt13Nt14
Nt15
Nt2 Nt3
Nt4
Nt5 Nt6 Nt7
Nt8 Nt9
PindPt0 Pt1 Pt10Pt11
Pt12Pt13Pt14 Pt15Pt16
Pt17
Pt18Pt19 Pt2 Pt3 Pt4 Pt5 Pt6
Pt7 Pt8
Pt9
TindTt-1
Tt-2
Tt0 Tt1 Tt10
Tt11Tt12Tt13Tt14Tt15Tt16Tt17
Tt2 Tt3 Tt4 Tt5
Tt6
Tt7 Tt8
Tt9
DF1
DF2
Peat
6 groupsCircles = 95% χ2 confidence limitsArrows represent outlier samples that were from upper horizon of the peat depth profile
Speyside
Aberdeenshire
Orkney
Islay
Harrison, B., Ellis, J., Broadhurst, D., Reid, K., Goodacre, R. & Priest, F.G. (2006) Differentiation of peats used in the preparation of malt for Scotch whisky production using
Fourier transform infrared spectroscopy, Journal of the Institute of Brewing, submitted.
Target encoding for PLS, ANNs, etc…Usually binary encoded:
A B C identity
X 1 0 0 → A
Y 0 1 0 → B
Z 0 0 1 → C
Easy look up table
Known bacteria
New
isol
ates
Supervised methods are powerful…
Learn from experienceGeneralise from previous examples to new onesPerform pattern recognition on complex multivariate data.Make errors
usually because of badly chosen datatanks from trees…
What have I measured that is important
Consider healthy vs. diseased 2 class problemCollect metabolite data from both classes:
Encompass biological variationEven distribution
Try to keep number of samples the sameStatistics are easier / more rigorous
→ next do inductive bit…
Easy strategies first
Stare and compare!Always wise to actually look at the data…
Difference spectraAvg(diseased) - Avg(healthy)
Univariate analysesAnalysis of variance (ANOVA)T-test, etc
Loadings plots
Use multivariate analyses and inspect loadings
loadings (p)
…
←sa
mpl
es
scores (t)
Variables → …
Types of loadings plots
UnsupervisedPCASOFM
Supervised: discriminatoryDFA (CVA)PLS-DA
Supervised: regressionPLS
Supervised: tree-basedCART
Col-2 WT
C21 WT
Fiehn et al. (2000) Metabolite profiling for plant functional genomics. Nature Biotechnology 18, 1157-1161.
dgd1
Types of loadings plots
UnsupervisedPCASOFM
Supervised: discriminatoryDFA (CVA)PLS-DA
Supervised: regressionPLS
Supervised: tree-basedCART
But not always that easy…
isomaltoseU#106
U#107
proline
serine
threonine pyroglutamic acid
glutamate
β-alanine phenylalanine
ascorbate
γ-hydroxybutyricacid lactoneCol-2
C21
dgd1
Univariatep < 0.01 t-testp not signif.
Modest 250 metabolites
2 class problemHealthy vs. Diseased
To use or not to use?Yes / No, 250 times = 2250 or 1.8×1075
PC does say 10×106 orderings every second it would still take > 3×1062 years
Houston we have a problem…Global optimal
solution
Population ofsolutions encoding possible solutions
MutationCrossover
Selection
ComplexNP problem
algorithm
Goodsolution
ComplexNP problem
No
Evolutionary computation
Explanatory machine learning
Inputs[metabolome data]
01010101
Target output(s)[categorical or quantitative]
desi
red
resp
onse
s pa
ired
with
sam
ples
Explanatorymathematical
transformation
Qua
ntita
tive
leve
ls
Metabolites
Genetic algorithm
Chromosome with n genes selects which of ninputs to use in DA, MLR, PLS or ANN
… i … n-1 n1 2inputs:
filter:
0 1 0 1 01 1genes:
+1
0 or 1
GA: in silico evolution
Population ofsolutions
MutationCrossover
Naturalselection
+
+
Assess fitness function of population
Frequency of input selections
60080010001200140016001800200022000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wavenumbers
post spoilage
pre spoilage
0
2
4
6
8
10
12
sum
med
freq
uenc
y
Ellis, D.I., Broadhurst, D., Kell, D.B., Rowland, J.J. & Goodacre, R. (2002) Rapid and quantitative detection of the microbial spoilage of meat using FT-IR spectroscopy and
machine learning. Applied and Environmental Microbiology 68, 2822-2828.
Spoilage at 107 =
1096 cm-1
1683 cm-1
Amide I (C=O vibration)
Free amines (C-N stretch)
=
=
Proteolysis of meat by microbes detected by FT-IR
Overall conclusions on hypothesis generation
Many approaches can be used ranging fromSimple → stare and compare, ANOVA etcMultivariate → loadings: PCA, DFA, PLS More complex → evolutionary computation
Need to design experiment carefullyHypotheses might not always be correct
But a good place to start when initial knowledge is non-existent.
PyChem v2.0.0 Beta (http://pychem.org.uk)
Standalone WinXP graphical application for MVAPreprocessing algorithms incorporatedMany standard chemometrics algorithms
PCA, HCA, DFA, PLSR, PLS-DA.Feature selection using a variety of genetic algorithm tools
Coded in Python (http://python.org/) the software is both FREEand OPEN SOURCE. Source code can be downloaded at http://sourceforge.net/projects/pychem/.Can be used to analyse continuous or discrete data such as transcriptomic, metabolomic (vibrational spectra, GC-MS, LC-MS etc…)
Written by Dr Roger Jarvis
m/z 117 vs. m/z 136
m/z 117 vs. m/z 109
60 80 100 120 140 160 180 2000
2
4
6
8
10
12
Mass (m/z)
Perc
enta
ge to
tal i
on c
ount
6177
89
94
110
136
162
Chlorogenic acid
60 80 100 120 140 160 180 2000
2
4
6
8
10
12
Mass (m/z)
Perc
enta
ge to
tal i
on c
ount
6177
89
94
110
136
162
Chlorogenic acid
50 100 150 2000
5
10
15
20
25
30
Mass (m/z)
Perc
enta
ge to
tal i
on c
ount
55
67
82
109
137 165
N N
N
CH3
+•or
N
N
N
N
CH3
O
CH3
OCH3
N
N
+• N
N
CH3
O+
+
N
N
CH3
O
CH3
O
•
Caffeine
+•N
N
CH3
50 100 150 2000
5
10
15
20
25
30
Mass (m/z)
Perc
enta
ge to
tal i
on c
ount
55
67
82
109
137 165
N N
N
CH3
+•or
N
N
N
N
CH3
O
CH3
OCH3
N
N
+• N
N
CH3
O+
+
N
N
CH3
O
CH3
O
•
Caffeine
+•N
N
CH3
+Caffeine
-Caffeine
On (stamp) collecting data“Science is built up with facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house”
Jules Henri Poincaré (1854-1912) La Science et l'hypothèse
Having the parts list doesn’t mean that you know how an organism works