how to perform - station biologique de...

http://workflow4metabolomics.org

HOW TO PERFORM

MULTIVARIATE ANALYZES?

1

W4M Core Team


The "Multivariate" module

The "Multivariate" module on W4M allows you

to perform:

• Principal Component Analysis (PCA)

• Partial Least-Squares regression (PLS) and

discriminant analysis (PLS-DA)

• Orthogonal Partial Least-Squares regression

(OPLS) and discriminant analysis (OPLS-DA)

The original algorithms for PCA, PLS and

OPLS with the NIPALS algorithm have been

implemented by using the R environment

2


Chaining the statistical modules

The Multivariate module can be chained with the Univariate module,

and also the Filters module (either to filter out pool or blank samples

before the statistics, or filter out the variables according to a statistical

threshold after the analysis)

3


Preparing your files (1/9)

Your data must be split into 3 files:

• dataMatrix.tsv

• sampleMetadata.tsv

• variableMetadata.tsv

4



Each file can be prepared by using Excel and saved using the

tabulated type format:

5



You can then rename your file with the .tsv extension (instead of .txt)

by right-clicking on the file:

.tsv files (i.e. tabular separated) can be handled correctly both by

Excel and Galaxy.

6



Decimal separator must be "."

Missing values must be indicated as "NA"

7



Note: you can switch your default language in Excel to English in order

to have your decimal separator automatically set to "."

8

1

2

3 4



The dataMatrix.tsv file must contain:

• the names of your samples in the first row

• the names of your variables in the first column

• numbers (or NA) in all the other cells

Note: the name in the topleft (A1) cell does not matter; avoid using "ID"

for Excel compatibility

9


Preparing your files (7/9) The sampleMetadata.tsv file must contain:

• the names of the factors to be used in statistical analyzes in the first row

• the columns must be either characters (resp. numbers) for qualitative (resp.

quantitative) factors

• the names of your samples in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

• 2) you can add columns for storing metadata about your samples even

though it is not used in your Galaxy analysis

• 3) results from statistical analyzes (e.g. scores) will be added as

supplementary columns in this file 10



The variableMetadata.tsv file must contain:

• the names of the metadata (e.g. mzmed, rtmed) in the first row (there must

be at least one column in addition to the variable names)

• the names of your variables in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

• 2) you can add columns for storing metadata about your variables even

though it is not used in your Galaxy analysis

• 3) results from the statistical analyzes (e.g. loadings, VIPs) will be added as

new columns in this file

11



Sample and variable names:

• should not start with a digit

• should contain only

• a b c d e f g h i j k l m n o p q r s t u v w x y z

• A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

• 0 1 2 3 4 5 6 7 8 9

• , [comma]

• - [dash]

• _ [underscore]

• [blank]

• other punctuations and accents should not be used

• your sample and variable names should not contain any duplicate

12

http://workflow4metabolomics.org 13

Loading your files into Galaxy (1/2)

Upload your three files (dataMatrix.tsv, sampleMetadata.tsv and

variableMetadata.tsv)

• either by using the icon

and drag & dropping the file:

1

2

3

4


Loading your files into Galaxy (2/2)

• or with the Get Data / Upload File

14

1

2

3

4

5


Check that your data have been

uploaded correctly

15


Rename your history (optional)

16


Open the "Multivariate" module

and select your 3 files of interest:

you are now ready to start your multivariate analyzes!

17

1

2

3

4

5


Principal component analysis (PCA)

18


Select

• the total number of components

• the scaling

• the logarithm (log10) transformation of the values (optional)

• the components for display

• and launch the computation

19


Graphical results

Look at the "figure.pdf" file to see the scree plot, extreme observations,

and the loading and score plots

20


Observation diagnostics

The "observation diagnostics" plot highlights observations with large

distance from the center in the score plane (score distance) or large

distance to their projection in the score plane (orthogonal distance)

21

score distance

orthogonal distance


Graphical results

The figure can be downloaded as a .pdf file

22

1

2


Numerical results

Numerical results (including the percentage of explained inertia) can

be viewed in the "information.txt" file

23

1


Score and loading values

The score (resp. loading) values of the selected components have been added

as columns in the sampleMetadata.tsv (resp. variableMetadata.tsv) files

24 24

1

2


Tuning the parameters

You can recall the page with your parameters, modify them, and restart

the analysis

25

1

3

2


Advanced parameters (1/2)

• Default algorithm is svd (faster) except if there are missing values (nipals will be

used instead)

• The number of extreme values on the loading plots (coloured in red) can be

modified

• The type of graphic can be modified

26



• A factor (column of the sampleMetadata.tsv) can be indicated to color the samples

• In case of a qualitative factor, it can be used to draw the Mahalanobis ellipses of

each class

27


References

• Husson F., Le S. and Pages J. (2011). Exploratory multivariate analysis by

example using R. Chapman & Hall/CRC

• Ringner M. (2008). What is principal component analysis? Nature

Biotechnology, 26:303-304. http://dx.doi.org/10.1038/nbt0308-303

• Baccini A. (2010). Statistique descriptive multidimensionnelle (pour les

nuls). www.math.univ-toulouse.fr/~baccini/zpedago/asdm.pdf

28


Partial Least Squares (PLS)

and

Partial Least Squares Discriminant

Analysis (PLS-DA)

29


Select (1/2)

• the Y response to be modelled (column of the sampleMetadata.tsv file)

• Note: in the case of a qualitative response, Mahalanobis ellipses can be

drawn for each class by indicating the same factor as Y

• the number of random permutations of the labels to estimate the

significance of the model

30


Select (2/2)

• the total number of components

• the scaling




31


Graphical results

Look at the "figure.pdf" file to see the results of the permutation testing,

extreme observations, and the loading and score plots

32


Diagnostic metrics

0 ≤ R2X ≤ 1: percentage of X inertia explained by the model

0 ≤ R2Y ≤ 1: percentage of Y inertia explained by the model

0 ≤ Q2Y ≤ 1: estimation of the predictive performance of the model by

cross-validation

R2X and R2Y increase with the number of components while Q2Y

reaches a maximum (due to overfitting limitation), as can be visualized

with the "overview" graphic:

33


Permutation testing

The algorithm randomly permutates the Y labels, builds the models

and computes the R2X, R2Y, Q2Y

Counting the number of R2Y (and Q2Y) metrics from random models

which are superior to the values of the true model gives an indication

of the significance of the PLS modelling

34


Numerical results

The details of the R2X, R2Y, and Q2Y values are stored in the

"information.txt" file

35

1


Scores, loadings and VIPs

The score (resp. loading and VIPs) of the selected components have been

added as columns in the sampleMetadata.tsv (resp. variableMetadata.tsv) files

36 36

1

2



• Use the icon to view your last parameters, modify them

and start a new computation

• The optimal number of components can be estimated

• The dataset can be split into a reference and test subsets

(the latter comprising samples with odd indices)

in this case, an estimation of the error on the test subset

(RMSEP) is computed in addition to the estimation of the error

on the reference test (RMSEE)

37

1

2



• Several other types of graphics are available:

• XY-scores

• predict-train and predict-test (the latter being

available only if the test set of odd-indices has

been defined)

38


Orthogonal Partial Least Squares

(OPLS)

and

Orthogonal Partial Least Squares

Discriminant Analysis (OPLS-DA)

39


Select (1/2)

• the Y response to be modelled (as in PLS)

• set the number of predictive components to 1

• the number of orthogonal components

• the number of random permutations of the labels to estimate the

significance of the model (as in PLS)

40


Select (2/2)

• the scaling




41


Graphical results

Look at the "figure.pdf" file to see the results of the permutation testing,

extreme observations, and the loading and score plots

42


Diagnostic metrics and

permutation testing

Diagnostics are similar to PLS (see above)

Note: OPLS improves the interpretation of the components but not the

overall predictive performance of the model

43


Numerical results

The details of the R2X, R2Y, and Q2Y values are stored in the

"information.txt" file

44

1


Advanced parameters

• Use the icon to view your last parameters, modify them

and start a new computation

• The optimal number of orthogonal components can be

estimated

• Note: Care should be taken to avoid too many orthogonal

components (which would result in overfitting) by

thoroughly examining the R2Y and Q2Y values in the

"overview" graphic

45

1

2


References

• Trygg J., Holmes E. and Lundstedt T. (2007). Chemometrics in

Metabonomics. Journal of Proteome Research, 6:469-479.

http://dx.doi.org/10.1021/pr060594q

• Wheelock A. and Wheelock C.E. (2013). Trials and tribulations of omics data

analysis: Assessing quality of SIMCA-based multivariate models using

examples from pulmonary medicine. Molecular BioSystems, 9:2589-2596.

http://dx.doi.org/10.1039/C3MB70194H

46

how to perform - station biologique de...

Documents