multivariate statistical analyses and machine learning for … · 2017-02-14 · multivariate...

Multivariate statistical analyses and Multivariate statistical analyses and machine learning for metabolomicsmachine learning for metabolomics

Roy GoodacreRoy GoodacreSchool of Chemistry,School of Chemistry,The University of Manchester, The University of Manchester, Sackville Street, PO Box 88, Sackville Street, PO Box 88, Manchester, M60 1QDManchester, M60 1QD

[email protected]@manchester.ac.ukwww.biospec.netwww.biospec.net

Postdocs: Dr Catherine Winder, Dr Robert Cormell, Dr Roger Jarvis, Dr Mohammad Afzaal, A.N. Other x 5, With Collabs: Dr Seetharaman Vaidyanathan, Dr John Fletcher Research Technicians: Jo Ellis, Steffi Schuler + A.N. OtherResearch Students (PhD): Felicity Currie, Katherine Hollywood, Robben Jaber, Nicoletta Nicolaou, Soyab Patel, Ketan Patel, Emma Wharfe, Will Allwood, Will Cheung, Robert Coe. From Oct 1: Nicola Wood, Dong Hyun Kim.Research Students (MSc): Nicola Wood, Sadia Rabbini, Martin Coleston, Graham Mullard.£s, €s, $s: BBSRC, EPSRC, RSC, EU F6,

DEFRA, NERC, ORS

www.biospec.net

‘Top down’ metabolomics

Carefully designedData-generating experiment

Generationof hypotheses

Inductive reasoning by computation

Analysis and test hypotheses

Strategy: Inductive approach to knowledge discovery via holism

Goodacre, R. (2005) Making sense of the metabolome using evolutionary computation: seeing the wood with the trees. Journal of Experimental Botany 56, 245-254.

“Hiring a statistician after the data have been collected is like hiring a physician when the patient is in the morgue. He might be able to

tell you what went wrong, but is unlikely to be able to fix it”

Data floods

“Data does not equal information; information does not equal knowledge; and, most importantly of all, knowledge

does not equal wisdom. We have oceans of data, rivers of

information, small puddles of knowledge, and the odd drop of wisdom”

Henry Nix, 1990

Data outputs0

100

%

391

320

267269

365

595481

463

392

392419

481

594482

549503

743708595 672596672

833815744815

1059946834

9468351059

98710601063

000627 54 4 (2 009) C (1 4)

1.41.61.822.22.42.62.83Chemical shift (ppm)

Inte

nsity

dimethylglycine ?

alanine

acetate

carnitine argininehypotaurine

5001000150020002500300035004000

-0.1

0

0.1

0.2

0.3

0.4

0.5

Wavenumber (cm-1)

Abs

orba

nce

(arb

itrar

y)

Deconvolution

Metabolite Conc

Glucose 0.1

Indole 0.001

Tryptophan 1.2

Ethanolamine 0.7

… …

Metabolite #88 0.9

Metabolite #167 0.05

Name of metabolite with its concentration

Typical metabolic profile

Data handling

Input data

Objects going down in different rows

X-var1

Metabolite or peak 1

X-var2


X-var3


Sample 1

Sample 2…

Metabolite Conc

Glucose 0.1

Indole 0.001

Tryptophan 1.2

Ethanolamine 0.7

… …

Metabolite #88 0.9

Metabolite #167 0.05

Data handlingObjects going down in different rows

X-var1


X-var2


X-var3


Y-var1

Lots of Metadata

Y-var2

Diseased or Healthy (Levels)

Sample 1 0 (control)

Sample 2… 1 (diseased)

Species AgeM/FBMI

sampling processingetc, etc…

Input data Output data

Unsupervised learning

System is shown a set of inputs (spectra) and then left to cluster the spectra into groupsThis optimization procedure is usually simplification or dimensionality reduction.

INPUT DATA

CLUSTERING HUMANINTERPRETATION

FEATUREEXTRACTION

Projection of the data

loadings (p)summarise variation in variables

p1p2 …

sam

ples

pi

123

k

t1 t2 ti…

scores (t)summarise variation in samples

uncorrelatedorthogonal

variablesx1, x2, … xi, … …, xn

variance: t1 ≥ t2 ≥ … ≥ ti

scores = loadings × datat1 = p1x1 + p2x2 + … + pixi + … +

pnxn

PCA

x1

x2

x3

PC1tj1

point j

PC2 tj2PC1 = 1st principal component

describes largest variancegoes through variable origin spacetj1 = score for point j = distance

from projection of that pointonto PC1 from origin

PC2 = 2nd principal componentdescribes 2nd largest variancegoes through variable origin spacetj2 = score for point j

PC1

PC2

Finds structure in dataRotate to uncover maximumcorrelations with respect to natural variation

Supervised learning

The goal is to find a mathematical model that will correctly associate the inputs with the targetsUsually achieved by minimising the error between the target and the model's response (output).

INPUT DATA

OUTPUT

OUTPUT DATA

COMPARISON

CALIBRATIONSYSTEM

ERROR

Discriminant function analysis(aka, canonical variates analysis)

Uses uncorrelated inputs a priori informationProjection based on:

Minimises within group varianceMaximises between group variance

Test by projection of ‘unknown’ samples Statistical significance:

χ2 confidence limits

-6 -4 -2 0 2 4 6 8-6

-4

-2

0

2

4

6

BindBt0

Bt1 Bt10

Bt11

Bt12Bt13

Bt14

Bt15

Bt16

Bt17Bt18Bt19

Bt2 Bt3 Bt4

Bt5

Bt6 Bt7 Bt8 Bt9

Hind

Hp1 Hp3

Hp4 Hp5

Hta0

Hta1Hta2

Hta3Hta4Hta5 Hta6

Hta7Hta8

Htb0

Htb2Htb3

Htb4Htb5

Htb6Htc1

Htc2

Htc3

Htc4Htc5 Htc6Htc7Htc8

Lind

Lp0

Lp1 Lp2

Lp3 Lta0

Lta1Lta2

Lta3

Lta4Lta5Lta6

Lta7Ltb0Ltb1Ltb2Ltb3Ltb4

Ltb5Ltb6

Ltb7Ltc0

Ltc1Ltc2

Ltc3Ltc4

Ltc5Ltc6

Ltc7

Nind

Nt-1Nt-2Nt-3Nt-4

Nt0

Nt1 Nt10

Nt11

Nt12

Nt13Nt14

Nt15

Nt2 Nt3

Nt4

Nt5 Nt6 Nt7

Nt8 Nt9

PindPt0 Pt1 Pt10Pt11

Pt12Pt13Pt14 Pt15Pt16

Pt17

Pt18Pt19 Pt2 Pt3 Pt4 Pt5 Pt6

Pt7 Pt8

Pt9

TindTt-1

Tt-2

Tt0 Tt1 Tt10

Tt11Tt12Tt13Tt14Tt15Tt16Tt17

Tt2 Tt3 Tt4 Tt5

Tt6

Tt7 Tt8

Tt9

DF1

DF2

Peat

6 groupsCircles = 95% χ2 confidence limitsArrows represent outlier samples that were from upper horizon of the peat depth profile

Speyside

Aberdeenshire

Orkney

Islay

Harrison, B., Ellis, J., Broadhurst, D., Reid, K., Goodacre, R. & Priest, F.G. (2006) Differentiation of peats used in the preparation of malt for Scotch whisky production using

Fourier transform infrared spectroscopy, Journal of the Institute of Brewing, submitted.

Target encoding for PLS, ANNs, etc…Usually binary encoded:

A B C identity

X 1 0 0 → A

Y 0 1 0 → B

Z 0 0 1 → C

Easy look up table

Known bacteria

New

isol

ates

Supervised methods are powerful…

Learn from experienceGeneralise from previous examples to new onesPerform pattern recognition on complex multivariate data.Make errors

usually because of badly chosen datatanks from trees…

What have I measured that is important

Consider healthy vs. diseased 2 class problemCollect metabolite data from both classes:

Encompass biological variationEven distribution

Try to keep number of samples the sameStatistics are easier / more rigorous

→ next do inductive bit…

Easy strategies first

Stare and compare!Always wise to actually look at the data…

Difference spectraAvg(diseased) - Avg(healthy)

Univariate analysesAnalysis of variance (ANOVA)T-test, etc

Loadings plots

Use multivariate analyses and inspect loadings

loadings (p)

…

←sa

mpl

es

scores (t)

Variables → …

Types of loadings plots

UnsupervisedPCASOFM

Supervised: discriminatoryDFA (CVA)PLS-DA

Supervised: regressionPLS

Supervised: tree-basedCART

Col-2 WT

C21 WT

Fiehn et al. (2000) Metabolite profiling for plant functional genomics. Nature Biotechnology 18, 1157-1161.

dgd1

Types of loadings plots

UnsupervisedPCASOFM

Supervised: discriminatoryDFA (CVA)PLS-DA

Supervised: regressionPLS

Supervised: tree-basedCART

But not always that easy…

isomaltoseU#106

U#107

proline

serine

threonine pyroglutamic acid

glutamate

β-alanine phenylalanine

ascorbate

γ-hydroxybutyricacid lactoneCol-2

C21

dgd1

Univariatep < 0.01 t-testp not signif.

Modest 250 metabolites

2 class problemHealthy vs. Diseased

To use or not to use?Yes / No, 250 times = 2250 or 1.8×1075

PC does say 10×106 orderings every second it would still take > 3×1062 years

Houston we have a problem…Global optimal

solution

Population ofsolutions encoding possible solutions

MutationCrossover

Selection

ComplexNP problem

algorithm

Goodsolution

ComplexNP problem

No

Evolutionary computation

Explanatory machine learning

Inputs[metabolome data]

01010101

Target output(s)[categorical or quantitative]

desi

red

resp

onse

s pa

ired

with

sam

ples

Explanatorymathematical

transformation

Qua

ntita

tive

leve

ls

Metabolites

Genetic algorithm

Chromosome with n genes selects which of ninputs to use in DA, MLR, PLS or ANN

… i … n-1 n1 2inputs:

filter:

0 1 0 1 01 1genes:

+1

0 or 1

GA: in silico evolution

Population ofsolutions

MutationCrossover

Naturalselection

+

+

Assess fitness function of population

Frequency of input selections

60080010001200140016001800200022000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wavenumbers

post spoilage

pre spoilage

0

2

4

6

8

10

12

sum

med

freq

uenc

y

Ellis, D.I., Broadhurst, D., Kell, D.B., Rowland, J.J. & Goodacre, R. (2002) Rapid and quantitative detection of the microbial spoilage of meat using FT-IR spectroscopy and

machine learning. Applied and Environmental Microbiology 68, 2822-2828.

Spoilage at 107 =

1096 cm-1

1683 cm-1

Amide I (C=O vibration)

Free amines (C-N stretch)

=

=

Proteolysis of meat by microbes detected by FT-IR

Overall conclusions on hypothesis generation

Many approaches can be used ranging fromSimple → stare and compare, ANOVA etcMultivariate → loadings: PCA, DFA, PLS More complex → evolutionary computation

Need to design experiment carefullyHypotheses might not always be correct

But a good place to start when initial knowledge is non-existent.

PyChem v2.0.0 Beta (http://pychem.org.uk)

Standalone WinXP graphical application for MVAPreprocessing algorithms incorporatedMany standard chemometrics algorithms

PCA, HCA, DFA, PLSR, PLS-DA.Feature selection using a variety of genetic algorithm tools

Coded in Python (http://python.org/) the software is both FREEand OPEN SOURCE. Source code can be downloaded at http://sourceforge.net/projects/pychem/.Can be used to analyse continuous or discrete data such as transcriptomic, metabolomic (vibrational spectra, GC-MS, LC-MS etc…)

Written by Dr Roger Jarvis

http://sourceforge.net/projects/pychem/

m/z 117 vs. m/z 136

m/z 117 vs. m/z 109

60 80 100 120 140 160 180 2000

2

4

6

8

10

12

Mass (m/z)

Perc

enta

ge to

tal i

on c

ount

6177

89

94

110

136

162

Chlorogenic acid

60 80 100 120 140 160 180 2000

2

4

6

8

10

12

Mass (m/z)

Perc

enta

ge to

tal i

on c

ount

6177

89

94

110

136

162

Chlorogenic acid

50 100 150 2000

5

10

15

20

25

30

Mass (m/z)

Perc

enta

ge to

tal i

on c

ount

55

67

82

109

137 165

N N

N

CH3

+•or

N

N

N

N

CH3

O

CH3

OCH3

N

N

+• N

N

CH3

O+

+

N

N

CH3

O

CH3

O

•

Caffeine

+•N

N

CH3

50 100 150 2000

5

10

15

20

25

30

Mass (m/z)

Perc

enta

ge to

tal i

on c

ount

55

67

82

109

137 165

N N

N

CH3

+•or

N

N

N

N

CH3

O

CH3

OCH3

N

N

+• N

N

CH3

O+

+

N

N

CH3

O

CH3

O

•

Caffeine

+•N

N

CH3

+Caffeine

-Caffeine

On (stamp) collecting data“Science is built up with facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house”

Jules Henri Poincaré (1854-1912) La Science et l'hypothèse

Having the parts list doesn’t mean that you know how an organism works

multivariate statistical analyses and machine learning for … · 2017-02-14 · multivariate...

Documents