can we beat over-fitting?

5
Research Article Received: 11 October 2013, Revised: 19 January 2014, Accepted: 24 January 2014, Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/cem.2602 Can we beat over-fitting? Olivier Cloarec* Over-fitting in multivariate regression is often viewed as the consequence of the number of variables. However, it is almost counterintuitive that the number of variables used to fit a regression model increases the risk of over-fitting instead of adding useful information. In this paper, we will be discussing the source of over-fitting and ways of reduc- ing it during the computation of partial least squares (PLS) components. A close look at the linear algebra used for PLS component calculation will highlight hints of the origin of over-fitting. Simulation of multivariate datasets will explore the influence of noise, number of variables and complexity of the underlying latent variable structure on over-fitting. A tentative solution to overcome the identified problem will be presented, and a new PLS algorithm will be proposed. Finally, the properties of this new algorithm will be explored. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: partial least squares; PLS; over-fitting; validation 1. INTRODUCTION Partial least squares (PLS) methods were developed by Herman Wold and co-workers [1,2] in the early 1980s and are now widely used to analyse multivariate datasets with number of vari- ables greater than the number of observations. Initially receiving much attention in the domain of chemometrics, PLS has become a standard tool to solve problems related to chemistry, such as structure–activity relationship and spectroscopic multivariate calibration [3]. Besides, because of its successful utilisation, its application range is now wider, and it is now used in many other fields where the number of observed variables is greater than the number of observations, such as medicine, pharmacology, physiology, social sciences or econometrics [4–7]. In this paper, we will only consider PLS1, which deals with only one response variable. Therefore, the following mention to PLS will refer to PLS1 only. The PLS model is based on the assumption that the multivariate data X is composed of p lv latent variables that result in p observed variables (p >> p lv ), and one or a combination of latent variables is linearly related to the response variable y. When a PLS model is fitted, the n observations are projected onto the latent structures having the highest covariance with the response variable y. For this rea- son, the acronym PLS is also often referred to as projection on latent structures. Fitting PLS models involves the iterative addition of compo- nents. Each component is composed of a set of vectors and scalars describing a latent variable. These vectors and scalars are calculated from the covariance between X and y and between their respective residuals after removing the contribution of the prior components. The goodness of fit is estimated using the determination coefficient R 2 . However, this parameter is not reliable for the estimation of the validity of the fit because it tends to inflate with the number of components, and the variability of the data. An inflated R 2 is symptomatic of over-fitting. In order to esti- mate the degree of over-fitting in the PLS regression, the Q 2 is a popular statistic generated by cross-validation [8,9]. The cal- culations of R 2 and Q 2 are given in the Appendix. As a rule of thumb, the greater the difference between Q 2 and R 2 , the greater the over-fitting. In addition, once the model is val- idated using the Q 2 possibly including a random permuta- tion re-sampling to test for the validity of the Q 2 itself [10], the interpretation uses the scores and loadings of the ini- tial regression. This means that even though the regression has been validated, its interpretation is very often based on over-fitted parameters and can potentially lead to erroneous interpretation of the data and even to the wrong conclusions. When there is no relation between X and y, Q 2 can even be negative, which may be rather confusing considering the squared notation. The first part of this paper introduces the theory of the PLS component calculation, investigates the reason of over-fitting and proposes a way of reducing it. Simulations were performed to test the proposed solution on single component calculation, and their results are presented. It follows the proposition of a mod- ified non-linear iterative PLS (NIPALS) algorithm that take into account the correction. Finally, the new algorithm is compared with the classic NIPALS PLS algorithm in the case of multicompo- nent PLS models. 2. THEORY The PLS regression algorithm is commonly used to model the relationship between a response variable y and a set of p predic- tor variables {x 1 ,x 2 ,x 3 , ::: ,x p , } when the matrix formed by the * Correspondence to: Olivier Cloarec, Korrigan Sciences Limited, 9 Imperial Place, Maidenhead SL6 2GN, UK. E-mail: [email protected] Korrigan Sciences Limited, 9 Imperial Place, Maidenhead SL6 2GN, UK J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd.

Upload: olivier

Post on 27-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Research Article

Received: 11 October 2013, Revised: 19 January 2014, Accepted: 24 January 2014, Published online in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/cem.2602

Can we beat over-fitting?

Olivier Cloarec*

Over-fitting in multivariate regression is often viewed as the consequence of the number of variables. However, it isalmost counterintuitive that the number of variables used to fit a regression model increases the risk of over-fittinginstead of adding useful information. In this paper, we will be discussing the source of over-fitting and ways of reduc-ing it during the computation of partial least squares (PLS) components. A close look at the linear algebra used for PLScomponent calculation will highlight hints of the origin of over-fitting. Simulation of multivariate datasets will explorethe influence of noise, number of variables and complexity of the underlying latent variable structure on over-fitting.A tentative solution to overcome the identified problem will be presented, and a new PLS algorithm will be proposed.Finally, the properties of this new algorithm will be explored. Copyright © 2014 John Wiley & Sons, Ltd.

Keywords: partial least squares; PLS; over-fitting; validation

1. INTRODUCTION

Partial least squares (PLS) methods were developed by HermanWold and co-workers [1,2] in the early 1980s and are nowwidely used to analyse multivariate datasets with number of vari-ables greater than the number of observations. Initially receivingmuch attention in the domain of chemometrics, PLS has becomea standard tool to solve problems related to chemistry, suchas structure–activity relationship and spectroscopic multivariatecalibration [3]. Besides, because of its successful utilisation, itsapplication range is now wider, and it is now used in many otherfields where the number of observed variables is greater thanthe number of observations, such as medicine, pharmacology,physiology, social sciences or econometrics [4–7].

In this paper, we will only consider PLS1, which deals withonly one response variable. Therefore, the following mentionto PLS will refer to PLS1 only. The PLS model is based on theassumption that the multivariate data X is composed of plvlatent variables that result in p observed variables (p >> plv),and one or a combination of latent variables is linearly relatedto the response variable y. When a PLS model is fitted, then observations are projected onto the latent structures havingthe highest covariance with the response variable y. For this rea-son, the acronym PLS is also often referred to as projection onlatent structures.

Fitting PLS models involves the iterative addition of compo-nents. Each component is composed of a set of vectors andscalars describing a latent variable. These vectors and scalars arecalculated from the covariance between X and y and betweentheir respective residuals after removing the contribution of theprior components.

The goodness of fit is estimated using the determinationcoefficient R2. However, this parameter is not reliable forthe estimation of the validity of the fit because it tends to inflatewith the number of components, and the variability of the data.An inflated R2 is symptomatic of over-fitting. In order to esti-mate the degree of over-fitting in the PLS regression, the Q2 is

a popular statistic generated by cross-validation [8,9]. The cal-culations of R2 and Q2 are given in the Appendix. As a ruleof thumb, the greater the difference between Q2 and R2, thegreater the over-fitting. In addition, once the model is val-idated using the Q2 possibly including a random permuta-tion re-sampling to test for the validity of the Q2 itself [10],the interpretation uses the scores and loadings of the ini-tial regression. This means that even though the regressionhas been validated, its interpretation is very often based onover-fitted parameters and can potentially lead to erroneousinterpretation of the data and even to the wrong conclusions.When there is no relation between X and y, Q2 can evenbe negative, which may be rather confusing considering thesquared notation.

The first part of this paper introduces the theory of the PLScomponent calculation, investigates the reason of over-fittingand proposes a way of reducing it. Simulations were performed totest the proposed solution on single component calculation, andtheir results are presented. It follows the proposition of a mod-ified non-linear iterative PLS (NIPALS) algorithm that take intoaccount the correction. Finally, the new algorithm is comparedwith the classic NIPALS PLS algorithm in the case of multicompo-nent PLS models.

2. THEORYThe PLS regression algorithm is commonly used to model therelationship between a response variable y and a set of p predic-tor variables {x1, x2, x3, : : : , xp, } when the matrix formed by the

* Correspondence to: Olivier Cloarec, Korrigan Sciences Limited, 9 Imperial Place,Maidenhead SL6 2GN, UK.E-mail: [email protected]

Korrigan Sciences Limited, 9 Imperial Place, Maidenhead SL6 2GN, UK

J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd.

O. Cloarec

predictor variables is not of full rank. Using matrix notation, thisrelationship can be written as

y = Xˇ + � (1)

where ˇ is the p� 1 regression vector that has to be estimated toperform further prediction and › the n � 1 residual vector relatedto the deviation between the measured and modelled responsevalues. The PLS algorithm allows the calculation of an estimate ofˇ when the standard least-square solution cannot be calculatedbecause of the non-existence of the inverse of X0X, which makesit impossible to estimate beta as

b = (X0X)–1X0y (2)

The PLS algorithm decomposes the X matrix into a series of nccomponents that can be combined to calculate an estimate of ˇ.The PLS regression model for X and y is

X = TP0 + F (3)

y = Tc + e (4)

where P and T are the loadings and scores matrices, respectively,and c is the vector of regression coefficients between the scoresand y. F and e are the residual matrix and vector, respectively.

The PLS regression coefficient estimate, bpls, is given by

bpls = W(P0W)–1c (5)

where W is the p � nc weight matrix. Its columns are calculatedduring the first step of the PLS algorithm. The NIPALS algorithmfor a components can be written as follows:

X0 = Xy0 = yfor h = 1, 2, : : : , a do

wh = X0h–1yh–1Normalise wh to 1th = Xh–1whph = X0h–1th/t0hthch = y0h–1th/t0hthXh = Xh–1 – thp0hyh = yh–1 – chth

end for

Looking at the calculation of the scores (th) for each compo-nent, we can notice that

th / Xh–1X0h–1yh–1 (6)

This means that the scores result from a linear transformationof y by the matrix XX0. We are now going to explore the structureof XX0 when X is composed of vectors derived from linear com-binations of latent structures with different respective weights.Such an example of a linear model can be found in many datasetsand is very common in spectroscopy. This model can be writtenin matrix form as

X = CS0 + E (7)

where C is the matrix of weight of the latent variables (e.g. chem-ical concentrations), S is the characteristic profile of a latentvariable (e.g. pure compound spectrum) and E is a matrix of whitenoise (constant variance with expectation of 0) that is related, for

example, to instrumental noise. Using this model, we can thendeduce that

XX0 = CS0SC0 + CS0E0 + ESC0 + EE0 (8)

The matrix CS0SC0 contains the information related to thenoise-free latent structure; the matrix CS0E0 and its transposeESC0 represent the interaction between the noise and the latentstructure; and finally, EE0 contains the specific influence of thenoise on the calculation of the scores. The structure of the matrixEE0 is

EE0 =

0BBBB@

Pe2

1,i

Pe1,ie2,i � � �

Pe1,ien,iP

e2,ie1,iP

e22,i � � �

Pe2,ien,i

......

. . ....P

en,ie1,iP

en,ie2,i � � �P

e2n,i

1CCCCA

(9)

From the expression of EE0, it can be seen that the diagonalis made of sum of squares of p elements (the number of vari-ables). This means that the more variables, the bigger the sum ofsquares becomes. The other elements of the matrix are the sumof products of two noise contributions. Assuming that the noisecontributions are independent and the noise level the same forall observations, then the expectations of

Pe2

j,i andP

i,k¤j ek,iej,i

are a constant number and 0, respectively. Therefore, the expec-tation of EE0 is a diagonal matrix, and as a consequence, XX0

contains a contribution of a diagonal matrix whose magnitude isproportional to the noise intensity. Thus, given Equation (6), thescores of the PLS components contain a contribution correlatedto y, whether or not X contains information related to y, and thiscontribution is generated by the diagonal of EE0 through XX0.

3. SIMULATED DATA GENERATION

Simulated data have been generated using an in-house routinewritten in Python 2.7.2 (www.python.org) using Numpy 1.6.1(www.numpy.org). The programme is able to generate multivari-ate data containing several latent variables with different levels ofnoise. The dataset is built using the following procedure:

� The number of variables as well as the number of latentvariables is determined.

� For each latent variable, a number of peaks are selected ran-domly from a uniform discrete random distribution plv ={1, 2, 3 : : : 10} in order to generate a spectrum:

– Relative peak heights are sampled from a uniformdistribution U(0, 1).

– Peak widths are sampled from a uniform distribu-tion U(2, 3) and are proportional to the number ofvariables.

– Position of the peaks is sampled from a uniform distri-bution U(2, pob).

� The expected intensities ilv of each latent variable is randomlysampled from an exponential distribution E(5).

� The variance vlv of the intensity for each latent variable israndomly sampled from a uniform distribution U(0.8, ilv).

� The intensity of each latent variable in each observation issampled from a normal distribution N(ilv, vlv).

� Each non-noisy observation (a vector of p numbers) is a sumof the latent variables.

� Finally, a chosen level of noise is added to the observation. Thenoise is sampled from a normal distribution with zero mean

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)

Can we beat over-fitting?

Figure 1. Examples of simulated datasets for n = 50, p = 1000 withdifferent signal-to-noise ratios (S/N). (A) S/N = 1000, (B) S/N = 10.

and chosen standard deviation. In this study, the level of noisewill be characterised as the noise to maximum signal ratio inthe dataset.

Some parameters, such as the number of variables, the possiblenumber of peaks and the parameter of the exponential distribu-tion, are selected arbitrarily according to the author experiencewith real data, such as spectroscopic data or micro-array data.Figure 1 shows two examples of simulated X with different levelsof noise.

To simulate a null hypothesis, y is randomly generated from anormal distribution. To simulate an existing covariance betweena latent variable and y, the intensities of a latent variable used tobuild X are used as y. Different levels of random variation can alsobe added to the latent variable intensities used as y until thesebecome equivalent to a totally random y.

4. CORRECTION OF XX0

We have seen in Section 2 that the diagonal of EE0 and, as a con-sequence, the diagonal of XX0 are at least partially responsibleof the over-fitting often observed in PLS regression. Figure 2(A)shows the results of PLS first component calculation using dif-ferent simulated data with an increasing number of variables

100 101 102 103 104

p

0.0

0.2

0.4

0.6

0.8

1.0

r 2

S/N

A

B

r 2

0.0

0.2

0.4

0.6

0.8

1.0

10-4 10-3 10-2 10-1 100 101 102 103 104

Figure 2. Correlations between the scores of the first PLS componentand y. (A) The scores were calculated using an X matrix made of p vari-ables sampled from a standard normal distribution and 50 observations.For each p, the simulation was run 20 times. (B) The signal-to-noise ratiowas gradually risen for a dataset composed of three latent variables and50 observations for X and a random y.

but containing only noise. It appears that the more variable,the more likely over-fitting will occur, as it is indicated by theinflation of the correlation between the scores and y. Similarly,when some noise is increasingly added to a dataset, eventually,the correlation between the scores of the first component and yrises whether or not there is a latent structure related to y in thedata (Figure 2(B)).

What is proposed here is to reduce the influence of thisdiagonal by zeroing its elements. However, the weighting ofthe diagonal has a side effect, and the resulting XX0 does notcorrespond to a centred X anymore. In this case, the calculatedscores and predicted y would not be centred, making the com-parison with y more difficult. Furthermore, a bias would appearwhen the contribution of such component is removed from Xbefore the calculation of the next component. This is why acorrection must be applied to centre back XX0. It is performedby subtracting the mean of the remaining elements from eachremaining element. In summary, if Z is the corrected matrix, itselements are for all i

zi,i = 0 (10)

and for all i ¤ j

zi,j = (XX0)i,j –1

n2 – n

X

k¤l

(XX0)k,l (11)

J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

O. Cloarec

−500 0 500 1000 1500 2000 2500 3000p

0.0

0.2

0.4

0

0.8

1.0r

2

10−4

S/N

A

B

0.0

0.2

0.4

0

0.8

1.0

r 2

10-3 10-2 10-1 100 101 102 103 104

Figure 3. Correlations between the scores of the first PLS componentand y after zeroing the diagonal of XX0. (A) The scores were calculatedusing a X matrix made of p variables sampled from a standard normal dis-tribution and 50 observations. For each p, the simulation was run 20 times.(B) The signal-to-noise ratio was gradually risen for a dataset composed ofthree latent variables and 50 observations for X and a random y.

Simulations were performed to investigate the effect of thiscorrection on the over-fitting. The results are presented inFigure 3. Comparing these results with those shown in Figure 2demonstrates that the correlation between the scores calculatedfor the first component is now independent of the number of vari-ables in X. Furthermore, low signal-to-noise ratio does not lead tohigh correlation between y and t anymore.

Further simulations were used to investigate the effect of a cor-rected XX0 on the scores calculation when there is actually anexisting correlation between y and a latent variable used to buildX. Here, the level of correlation between a latent variable in X andy is decreased by adding different level noise intensities in X.

Figure 4 presents the output of these simulations. It shows thatfor high signal-to-noise and strong real correlation between alatent variable and y, the use of corrected XX0 provides a simi-lar fit to that of NIPALS PLS. The average r2 obtained using thecorrected XX0 is always less than that obtained by a normal PLScomponent. However, it can be argued that this slight superiorityin the fitting could be due to the remaining over-fitting. The dis-persion of the r2 for both components when the signal-to-noise isgreater than 1 is due to the overlap between signals of the latentvariable profile used to construct X. For the lower r2, this wouldmean that further components are required to be able to fit a bet-ter PLS model. For signal-to-noise ratio around 2, NIPALS PLS r2

is still high and under the influence of over-fitting even though itmight start to grasp some signal related to y from the noisy X. Onthe other hand, the modified PLS seems to catch signal related toy with an increase of its r2.

10−4

S/N

0.0

0.2

0.4

0.6

0.8

1.0

r2(y

,T)

10-3 10−2 10-1 100 101 102 103 104

Figure 4. Correlation between y and the scores of the first componentfor the NIPALS PLS and the modified PLS when X is affected by differ-ent signal-to-noise ratios. The scores were calculated using an X matrixcontaining structures related to y and a varying level of noise.

5. MODIFIED NIPALS PLS ALGORITHM

Incorporating the new way of calculating the scores into theNIPALS algorithm requires further modifications of the initialalgorithm. Firstly, the weights are no longer calculated so thecomponent normalisation has to be performed after the scorehas been calculated. After calculation of all the requested com-ponents, an orthogonalisation of P is added, and the scoresare recomputed from the orthogonalised P. Then the regressioncoefficients are deduced from P and c.

X0 = Xy0 = yfor h = 1, 2, : : : , a do

Zh–1 = Xh–1X0h–1zi,i = 0 for all izi,j = zi,j – Qz for i ¤ j and Qz = 1

n2–n

Pi¤j zi,j with n the number

of observations.th = Zh–1yh–1Norm th to 1ph = X0h–1thch = y0h–1thXh = Xh–1 – thp0hyh = yh–1 – chth

end forOrthogonalisation of PT = XPc = (T0T)–1T0yb = Pc

The modified algorithm performance was compared with theclassic NIPALS PLS algorithm on data with increasing levels ofsignal-to-noise for X and y. An initial noise-free dataset was sim-ulated with four latent variables (n = 30, p = 1000). In thisnoise-free dataset, y was taken as the intensity of the first latentvariable used to generate X. A two-component model was fittedusing NIPALS PLS and the modified algorithms, and R2 = 0.988and R2 = 0.981 were obtained, respectively. This shows that forthis dataset, all the variance generated by the latent variablesused to simulate the data did not need to be taken into accountto obtain a very good fit. Some noise sampled from a normaldistribution was then added to both X and y. This noise addi-tion was repeated with different intensity levels. The R2 obtainedfrom both algorithms was collected for each combination of the

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)

Can we beat over-fitting?

Figure 5. Influence of the noise contained in X and y on the determination coefficient R2 of PLS models fitted with the classic and modifiedPLS algorithms.

noise levels in both X and y. These results are summarised in theheatmaps presented in Figure 5. This shows clearly that the PLSmodels fitted with the classic PLS algorithm does not reflect thereality. The R2 decreases with the noise added to y only for lowlevel of noise in X. With moderate and high noise levels in X, theR2 is always very high whatever the noise is in y. When fitted withthe modified algorithm, the R2 decreases with the noise addedto y and X as it should. Based on these simulations, the modi-fied algorithm reduces very much the over-fitting compared withthe classic PLS and therefore can lead to more reliable and moreaccurate interpretation.

6. CONCLUSION

In this paper, it has been shown that some of the over-fittingobserved in PLS regression has its origin in the way the scoresare calculated. The noise that is part of the independent vari-ables matrix (X) adds to the diagonal of XX’, which is used inthe calculation of scores in the NIPALS algorithm. The expecta-tion of the noise contribution is a (positive real) number timesthe identity matrix, which obtains more weight the noisier thedata are, which in turn results in over-fitting. The noisier the data,the more weight the identity matrix has on the calculation ofthe scores. A correction by zeroing the diagonal of XX’ has beenproposed and tested, and it provides a more reliable calculationof the scores by removing the influence of the noise depen-dent diagonal matrix. Furthermore, the number of variable doesnot increase the potential over-fitting anymore, which reducesthe need of variable selection and allows the use of all the vari-ables measured in an experiment. When integrated in a modifiedNIPALS algorithm, the new PLS can therefore be used for mul-tivariate calibration or pattern recognition with much less riskthat the conclusions of the study or future prediction are erro-neous because of over-fitting. However, for prediction purposes,this modified PLS does not need model validation with the use ofnew data. This new PLS algorithm delivers a safer method for mul-tivariate analysis with a lower risk of over-interpretation of theirresults. However, further work is needed to extend the use of thisalgorithm to multi-dimensional response variables. Finally, as ananswer to the title of this paper, it seems that over-fitting can bedealt with if we look carefully into the reason of such behaviour,and once the origin of the problem is defined, correction can beapplied to obtain more robust and reliable chemometrics tools.So yes, we can beat over-fitting!

REFERENCES1. Wold H. Partial least squares, Encyclopedia of the Statistical Sciences,

vol. 6. John Wiley: New York, 1985.2. Wold S, Ruhe A, Wold H, Dunn WJ, III. The collinearity problem in

linear regression. The partial least squares (PLS) approach to general-ized inverses. SIAM J. Sci. Stat. Comp. 1984; 5(3): 735–743.

3. Martens H, Naes T. Multivariate Calibration (2d ed), vol. 1. Wiley:Chichester, 1989.

4. Clayton TA, Lindon JC, Cloarec O, Antti H, Charuel C, Hanton G,Provost JP, Le Net JL, Baker D, Walley RJ, Everett JR, Nicholson JK.Pharmaco-metabonomic phenotyping and personalized drug treat-ment. Nature 2006; 440(7087): 1073–1077.

5. Hulland JS. Use of partial least squares (PLS) in strategic managementresearch: A review of four recent studies. Strateg. Manage. J. 1999; 20:195–204.

6. Lobaugh NJ, West R, McIntosh AR. Spatiotemporal analysis of exper-imental differences in event-related potential data with partial leastsquares. Psychophysiology 2001; 38: 517–530.

7. Nguyen DV, Rocke DM. Tumor classification by partial least squaresusing microarray gene expression data. Bioinformatics 2002; 18:39–50.

8. Tenenhaus M. La Regression PLS. Technip: Paris, 1998.9. Wakeling IN, Morris JJ. A test of significance for partial least squares

regression. J. Chemometr. 1993; 7(4): 291–304.10. van der Voet H. Comparing the predictive accuracy of models

using a simple randomization test. Chemometr. Intell. Lab. 1994;25(2): 313–323.

APPENDIX

All the R2 is estimated using

R2 = 1 –

Pni (yi – Oyi)

2Pn

i (yi – Ny)2(A.1)

where Oyi and Ny are the modelled value of the observation i andthe mean of y, respectively.

All the Q2 is estimated using

Q2 = 1 –

Pni (yi – y*

i )2

Pni (yi – Ny)2

(A.2)

where y*i is the estimate of yi through cross-validation.

J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem