tuning the significance level of simca models for reducing...

19
SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra Biancolillo 1 , Federico Marini 1 , Jean-Michel Roger 2 1 University of Rome La Sapienza, Piazzale Aldo Moro 5, I-00185, Rome, Italy, 2 ITAP, Irstea, Montpellier SupAgro, Univ Montpellier, Montpellier, France

Upload: others

Post on 29-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION

IN MULTI-BLOCK DATA

Alessandra Biancolillo 1, Federico Marini 1, Jean-Michel Roger 2

1 University of Rome La Sapienza, Piazzale Aldo Moro 5, I-00185, Rome, Italy,2 ITAP, Irstea, Montpellier SupAgro, Univ Montpellier, Montpellier, France

Tuning the significance level of SIMCA models for reducing the impact of strong class overlap: a novel approach

R. Vitale1,2, F. Marini3, C. Ruckebusch2

1 Molecular Imaging and Photonics Unit, Department of Chemistry, Katholieke Universiteit Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium

2 Laboratoire de Spectrochimie Infrarouge et Raman – UMR 8516, Université de Lille, Bâtiment C5, 59655 Villeneuve d’Ascq, France

3 Department of Chemistry, Università degli Studi di Roma “La Sapienza”, Piazzale Aldo Moro 5, 00185 Roma, Italy

Keywords: Class Modelling (CM), Soft Independent Modelling of Class Analogy (SIMCA), Receiver Operating Characteristic (ROC) curves.

1 Introduction Nowadays, a large number of problems in fields like foodstuff origin authentication, quality control or process monitoring is addressed by Class Modelling (CM) statistical methods. Techniques such as UNEQual class modelling (UNEQ [1]) or Soft Independent Modelling of Class Analogy (SIMCA [2]) have been extensively used in the last decades for similar purposes. Contrarily to the more popular Discriminant Analysis (DA), the basic principle of CM is that classification rules are derived using only samples/objects belonging to a single target category. Faults in the definition of non-target categories, which could bias the classification performance, can thus be avoided. Nevertheless, it is also well-known that if the classes under study present a high degree of overlap, CM approaches might suffer from severe limitations. In cases like this, properly adjusting the significance level of the resulting models can represent a potential solution to guarantee a better compromise between True Positive and True Negative rate. In this work, a new data-driven methodology that exploits the concept of Receiver Operating Characteristic (ROC) curve [3] is proposed to address such a task. Its only requirement is that measurements for samples belonging to non-target classes are also available. Although this is actually not strictly needed in the CM context, it can be highly beneficial in all situations in which significant overlapping exists between categories. This presentation explores the potential of this procedure as a possible way of tuning SIMCA model parameters in circumstances like this.

2 Theory Let X be an N×J dataset constituted by Z submatrices Xz (Nz×J), each one containing the measurements collected for a single class of samples. In SIMCA, every category of objects is modelled independently from the others based on a Principal Component Analysis (PCA) model of appropriate dimensionality or complexity (say Az) as:

Xz = TzPTz +Ez (1)

Page 2: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Introduction

• More and more multi-block data:

• same samples measured by different techniques

• A lot of methods for analyzing all the blocks simultaneously

• CCSWA (ComDim), MBPCA, MBPLS, SO-PLS, Statis, …

• But there is a lack of dedicated variable selection methods

Page 3: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

The idea

To mix SO-PLS [1] and CovSel [2] because they act similarly

[1]: T. Næs, O. Tomic, B.-H. Mevik, H. Martens, Path modelling by sequential PLS regression (2011), J. Chemometr. 25, 28–40

[2]: J.M.Roger, B.Palagos, D.Bertrand, E.Fernandez-Ahumada, (2011) Chemometr. Intel. Lab. Syst.106, 216-223

Page 4: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

SO-PLS

• Principle: To extract complementary latent variables from each block, successively

• Algorithm:1. i=12. Extract ki PLS latent variables from (Xi, Y)3. Orthogonalize each block > i wrt already extracted LV4. Orthogonalize Y wrt already extracted LV5. i=i+1; GOTO 2

Page 5: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

CovSel• Principle: To extract original variables from one block X

which explain the Covariance between X and Y

• Algorithm:1. Extract the variable i from (X, Y) / Max( cov2(Xi,Y) )2. Orthogonalize X wrt already extracted Xi3. Orthogonalize Y wrt already extracted Xi4. GOTO 1

CovSel = PLS with loadings as [0 0 … 1 … 0 0 0]

Page 6: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

SO - CovSel

• Principle: To extract complementary original variables from each block, successively

• Algorithm:1. i=12. Extract ki original variables from (Xi, Y) by CovSel3. Orthogonalize each block > i wrt already extracted variables4. Orthogonalize Y wrt already extracted variables5. i=i+1; GOTO 2

Under publication in Journal of Chemometrics

Page 7: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

SO-PLS vs SO-CovSel• From two blocks X1 and X2

• Let K1, K2 be the number of (latent) variables to extract from X1, X2

• SO-PLS provides scores T1(NxK1) T2(NxK2), as linear combination of the original variables of X1 and X2

• SO-CovSel provides scores T1(NxK1) T2(NxK2), as subsets of original variables of X1 and X2

• Depending on the Y, the scores of both methods can be inputted into:

• a linear regression -> SO-PLS-R & SO-CovSel-R

• a discriminant analysis -> SO-PLS-DA & SO-CovSel-DA

Page 8: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

First example: Hazelnut origin discrimination

286

Sam

ples

2112 Variables

X Z

1000 Variables

286

Sam

ples

Training Set of 286 samples

Test Set of 90 samples

49 Romana PDO 41 Common

221 PDO Romana Hazelnut

155 Common Hazelnut

2 Classes

Overtone region Combination bands region

Overtone region Combination bands region

Page 9: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Hazelnuts: Results

SO-CovSel-LDA

Class Predicted PDO PredictedCommon

PDO 39 2

Common 3 46

SO-PLS-LDA

Class Predicted PDO PredictedCommon

PDO 38 3

Common 3 46

6 Misclassified

5 Misclassified

Latent V = 3, 5

Selected V = 2, 10

Page 10: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Hazelnuts: Results

CovSel

VIP

X: 2 variables selected

Z: 9 variables selected

X: 526 variables selected

Z: 210 variables selected

Page 11: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Second example: Polarisation spectroscopy

• Four complementary settings to measure the reflectance according to four polarisation states :

• none, perpendicular, at 45 degrees, circular

• Measurement of visible spectra of a complete mixture design of:

• 10 concentrations of Blue Methylene

• 5 concentrations of pure scattering medium

Page 12: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Second example: Polarisation spectroscopy

DATA:

450 500 550 600 650 700

2

4

6

8

10

12×10-3

450 500 550 600 650 700

2

4

6

8

10

×10-5

450 500 550 600 650 700wavelength (nm)

0

2

4

6

8×10-5

450 500 550 600 650 700wavelength (nm)

-4-3-2-1012

×10-5

X1 : no pola X2 : perpendicular pola

X3 : 45° pola X4 : circular pola

X1, X2, X3, X4 : 4 blocs of 50 spectra x 91 wavelengths

Y: 1 bloc of 50 samples x 2 concentrations Cabs, Cdif

Page 13: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Second example: Polarisation spectroscopy

Cabs Cdif

SECV R2 SECV R2

PLS on global reflectance 20.7 0.71 0.92 0.54

SO-PLS on 4 blocks#LV = 6, 7, 2, 0 13.6 0.88 0.58 0.82

SO-CovSel on 4 blocks7, 3, 2, 3 15.6 0.84 0.82 0.72

SO-CovSel on 4 blocks1, 2, 1, 0 15.1 0.85 1.3 0.2

RESULTS

Cross Validation by blocks of identical Cabs

Page 14: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Second example: Polarisation spectroscopy

0 20 40 60 80 1000

0.005

0.01

0.015

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

1.2 #10-4

0 20 40 60 80 100-2

0

2

4

6

8

10 #10-5

0 20 40 60 80 100-6

-4

-2

0

2

4 #10-5

• Selected variables :• 559 nm on back scattering• 666 & 700 nm on perpendicular polarisation• 700 nm on 45° polarisation

Should be feasible to build a sensor with 3 wavelengths and 3 polarizers Possible application: chlorophyll content in vegetation

X1 : no pola X2 : perpendicular pola

X3 : 45° pola X4 : circular pola

Page 15: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Third example: Sensory analysis of chocolates

• 208 chocolates samples belonging to 4 sensory poles

• Measurement of global fingerprints by:

• NIRS, PTRMS, SPME, 3D-Fluorescence, 8 organic acids by HPLC

• 144 samples for calibration

• 62 samples for validation

Page 16: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Third example: Sensory analysis of chocolates

1 sample misclassified over 62

Page 17: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Third example: Sensory analysis of chocolates

Technique #Sel Selection

Vis-NIRS 3 408; 598; 1116 nm

PRTMS 20 41.054;43.055; 43.068; 44.995; 61.064; 68.055; 68.075; 69.086;82.984; 85.064; 89.084; 93.041; 101.094; 104.046112.150; 114.145; 115.111; 119.068; 125.058; 127.029

SPME 1 ACIacet

3D-Fluo 0

Acids 1 Citric

Page 18: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Conclusion

• SO-CovSel is an adaptation of SO-PLS multi-block PLS

• It is able to select non redundant variables in each block

• The selection is optimal in terms of covariance

• The selection is parsimonious

• SO-CovSel can be used on any type of blocks, for regression or discrimination

Page 19: Tuning the significance level of SIMCA models for reducing ...chemom2019.sciencesconf.org/data/04_chemom... · SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION IN MULTI-BLOCK DATA Alessandra

Acknowledgements

• Arnaud Ducanchez for the data on polarized spectroscopy

• Agropolis Fondation for supporting the Chaman project on chocolates and Zoé Deuscher and Jean-Luc Le Quéré from the Centre des Sciences du Goût et de l’Alimentation, Dijon, for the PTRMS data