SO-CovSel: A NEW METHOD FOR VARIABLE SELECTION
IN MULTI-BLOCK DATA
Alessandra Biancolillo 1, Federico Marini 1, Jean-Michel Roger 2
1 University of Rome La Sapienza, Piazzale Aldo Moro 5, I-00185, Rome, Italy,2 ITAP, Irstea, Montpellier SupAgro, Univ Montpellier, Montpellier, France
Tuning the significance level of SIMCA models for reducing the impact of strong class overlap: a novel approach
R. Vitale1,2, F. Marini3, C. Ruckebusch2
1 Molecular Imaging and Photonics Unit, Department of Chemistry, Katholieke Universiteit Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
2 Laboratoire de Spectrochimie Infrarouge et Raman – UMR 8516, Université de Lille, Bâtiment C5, 59655 Villeneuve d’Ascq, France
3 Department of Chemistry, Università degli Studi di Roma “La Sapienza”, Piazzale Aldo Moro 5, 00185 Roma, Italy
Keywords: Class Modelling (CM), Soft Independent Modelling of Class Analogy (SIMCA), Receiver Operating Characteristic (ROC) curves.
1 Introduction Nowadays, a large number of problems in fields like foodstuff origin authentication, quality control or process monitoring is addressed by Class Modelling (CM) statistical methods. Techniques such as UNEQual class modelling (UNEQ [1]) or Soft Independent Modelling of Class Analogy (SIMCA [2]) have been extensively used in the last decades for similar purposes. Contrarily to the more popular Discriminant Analysis (DA), the basic principle of CM is that classification rules are derived using only samples/objects belonging to a single target category. Faults in the definition of non-target categories, which could bias the classification performance, can thus be avoided. Nevertheless, it is also well-known that if the classes under study present a high degree of overlap, CM approaches might suffer from severe limitations. In cases like this, properly adjusting the significance level of the resulting models can represent a potential solution to guarantee a better compromise between True Positive and True Negative rate. In this work, a new data-driven methodology that exploits the concept of Receiver Operating Characteristic (ROC) curve [3] is proposed to address such a task. Its only requirement is that measurements for samples belonging to non-target classes are also available. Although this is actually not strictly needed in the CM context, it can be highly beneficial in all situations in which significant overlapping exists between categories. This presentation explores the potential of this procedure as a possible way of tuning SIMCA model parameters in circumstances like this.
2 Theory Let X be an N×J dataset constituted by Z submatrices Xz (Nz×J), each one containing the measurements collected for a single class of samples. In SIMCA, every category of objects is modelled independently from the others based on a Principal Component Analysis (PCA) model of appropriate dimensionality or complexity (say Az) as:
Xz = TzPTz +Ez (1)
Introduction
• More and more multi-block data:
• same samples measured by different techniques
• A lot of methods for analyzing all the blocks simultaneously
• CCSWA (ComDim), MBPCA, MBPLS, SO-PLS, Statis, …
• But there is a lack of dedicated variable selection methods
The idea
To mix SO-PLS [1] and CovSel [2] because they act similarly
[1]: T. Næs, O. Tomic, B.-H. Mevik, H. Martens, Path modelling by sequential PLS regression (2011), J. Chemometr. 25, 28–40
[2]: J.M.Roger, B.Palagos, D.Bertrand, E.Fernandez-Ahumada, (2011) Chemometr. Intel. Lab. Syst.106, 216-223
SO-PLS
• Principle: To extract complementary latent variables from each block, successively
• Algorithm:1. i=12. Extract ki PLS latent variables from (Xi, Y)3. Orthogonalize each block > i wrt already extracted LV4. Orthogonalize Y wrt already extracted LV5. i=i+1; GOTO 2
CovSel• Principle: To extract original variables from one block X
which explain the Covariance between X and Y
• Algorithm:1. Extract the variable i from (X, Y) / Max( cov2(Xi,Y) )2. Orthogonalize X wrt already extracted Xi3. Orthogonalize Y wrt already extracted Xi4. GOTO 1
CovSel = PLS with loadings as [0 0 … 1 … 0 0 0]
SO - CovSel
• Principle: To extract complementary original variables from each block, successively
• Algorithm:1. i=12. Extract ki original variables from (Xi, Y) by CovSel3. Orthogonalize each block > i wrt already extracted variables4. Orthogonalize Y wrt already extracted variables5. i=i+1; GOTO 2
Under publication in Journal of Chemometrics
SO-PLS vs SO-CovSel• From two blocks X1 and X2
• Let K1, K2 be the number of (latent) variables to extract from X1, X2
• SO-PLS provides scores T1(NxK1) T2(NxK2), as linear combination of the original variables of X1 and X2
• SO-CovSel provides scores T1(NxK1) T2(NxK2), as subsets of original variables of X1 and X2
• Depending on the Y, the scores of both methods can be inputted into:
• a linear regression -> SO-PLS-R & SO-CovSel-R
• a discriminant analysis -> SO-PLS-DA & SO-CovSel-DA
First example: Hazelnut origin discrimination
286
Sam
ples
2112 Variables
X Z
1000 Variables
286
Sam
ples
Training Set of 286 samples
Test Set of 90 samples
49 Romana PDO 41 Common
221 PDO Romana Hazelnut
155 Common Hazelnut
2 Classes
Overtone region Combination bands region
Overtone region Combination bands region
Hazelnuts: Results
SO-CovSel-LDA
Class Predicted PDO PredictedCommon
PDO 39 2
Common 3 46
SO-PLS-LDA
Class Predicted PDO PredictedCommon
PDO 38 3
Common 3 46
6 Misclassified
5 Misclassified
Latent V = 3, 5
Selected V = 2, 10
Hazelnuts: Results
CovSel
VIP
X: 2 variables selected
Z: 9 variables selected
X: 526 variables selected
Z: 210 variables selected
Second example: Polarisation spectroscopy
• Four complementary settings to measure the reflectance according to four polarisation states :
• none, perpendicular, at 45 degrees, circular
• Measurement of visible spectra of a complete mixture design of:
• 10 concentrations of Blue Methylene
• 5 concentrations of pure scattering medium
Second example: Polarisation spectroscopy
DATA:
450 500 550 600 650 700
2
4
6
8
10
12×10-3
450 500 550 600 650 700
2
4
6
8
10
×10-5
450 500 550 600 650 700wavelength (nm)
0
2
4
6
8×10-5
450 500 550 600 650 700wavelength (nm)
-4-3-2-1012
×10-5
X1 : no pola X2 : perpendicular pola
X3 : 45° pola X4 : circular pola
X1, X2, X3, X4 : 4 blocs of 50 spectra x 91 wavelengths
Y: 1 bloc of 50 samples x 2 concentrations Cabs, Cdif
Second example: Polarisation spectroscopy
Cabs Cdif
SECV R2 SECV R2
PLS on global reflectance 20.7 0.71 0.92 0.54
SO-PLS on 4 blocks#LV = 6, 7, 2, 0 13.6 0.88 0.58 0.82
SO-CovSel on 4 blocks7, 3, 2, 3 15.6 0.84 0.82 0.72
SO-CovSel on 4 blocks1, 2, 1, 0 15.1 0.85 1.3 0.2
RESULTS
Cross Validation by blocks of identical Cabs
Second example: Polarisation spectroscopy
0 20 40 60 80 1000
0.005
0.01
0.015
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
1.2 #10-4
0 20 40 60 80 100-2
0
2
4
6
8
10 #10-5
0 20 40 60 80 100-6
-4
-2
0
2
4 #10-5
• Selected variables :• 559 nm on back scattering• 666 & 700 nm on perpendicular polarisation• 700 nm on 45° polarisation
Should be feasible to build a sensor with 3 wavelengths and 3 polarizers Possible application: chlorophyll content in vegetation
X1 : no pola X2 : perpendicular pola
X3 : 45° pola X4 : circular pola
Third example: Sensory analysis of chocolates
• 208 chocolates samples belonging to 4 sensory poles
• Measurement of global fingerprints by:
• NIRS, PTRMS, SPME, 3D-Fluorescence, 8 organic acids by HPLC
• 144 samples for calibration
• 62 samples for validation
Third example: Sensory analysis of chocolates
1 sample misclassified over 62
Third example: Sensory analysis of chocolates
Technique #Sel Selection
Vis-NIRS 3 408; 598; 1116 nm
PRTMS 20 41.054;43.055; 43.068; 44.995; 61.064; 68.055; 68.075; 69.086;82.984; 85.064; 89.084; 93.041; 101.094; 104.046112.150; 114.145; 115.111; 119.068; 125.058; 127.029
SPME 1 ACIacet
3D-Fluo 0
Acids 1 Citric
Conclusion
• SO-CovSel is an adaptation of SO-PLS multi-block PLS
• It is able to select non redundant variables in each block
• The selection is optimal in terms of covariance
• The selection is parsimonious
• SO-CovSel can be used on any type of blocks, for regression or discrimination
Acknowledgements
• Arnaud Ducanchez for the data on polarized spectroscopy
• Agropolis Fondation for supporting the Chaman project on chocolates and Zoé Deuscher and Jean-Luc Le Quéré from the Centre des Sciences du Goût et de l’Alimentation, Dijon, for the PTRMS data