author's personal copy -...

11
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Upload: others

Post on 25-Dec-2019

13 views

Category:

Documents


0 download

TRANSCRIPT

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Author's personal copy

Least-squares approximation of a space distribution for a given covariance andlatent sub-space

José Camacho a,⁎, Pablo Padilla a, Jesús Díaz-Verdejo a, Keith Smith b, David Lovett b

a Departamento de Teoría de la Señal, Telemática y Comunicaciones, Universidad de Granada, 18071, Granada, Spainb Perceptive Engineering Ltd., United Kingdom

a b s t r a c ta r t i c l e i n f o

Article history:Received 22 July 2010Received in revised form 21 December 2010Accepted 22 December 2010Available online 29 December 2010

Keywords:Covariance matricesConstrained least squaresPrincipal component analysisPartial least squares

In this paper, a newmethod to approximate a data set by another data set with constrained covariance matrixis proposed. The method is termed Approximation of a DIstribution for a given COVariance (ADICOV). Theapproximation is solved in any projection subspace, including that of Principal Component Analysis (PCA) andPartial Least Squares (PLS). Given the direct relationship between covariance matrices and projection models,ADICOV is useful to test whether a data set satisfies the covariance structure in a projection model. This idea isbroadly applicable in chemometrics. Also, ADICOV can be used to simulate data with a specific covariancestructure and data distribution. Some applications are illustrated in an industrial case of study.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Projection methods are intimately related to covariance matrices.For instance, the loading vectors of Principal Component Analysis(PCA) can be extracted either from a mean centered1 data set X orfrom the corresponding covariance matrix Cx=(XT ⋅X)/(Nx−1), withNx being the number of observations in X. To identify the loadingsfrom X, the NIPALS algorithm [1] or the Singular Value Decomposition(SVD) [2], among others, can be employed. At the same time, theloading vectors are the eigenvectors of Cx, and thus can be computedfrom the eigendecomposition (ED) of this matrix. In addition, theeigenvectors remain the same when multiplying Cx by any non-zeroscalar k. Therefore, the loading vectors in PCA can be computeddirectly from XT⋅X. Similarly, the loadings andweights in Partial LeastSquares (PLS) regression can be identified either from the originaldata sets X and Y [1,3], or from cross-product matrices XT⋅X and XT⋅Yusing the kernel algorithm [4–6].

The identification of projection models from covariance matricespresents one main advantage and one main drawback. The advantageis that when the number of observations in the data matrices is verylarge, it is more efficient to fit a model from the covariance matrices[4]. This is because the size of the covariance matrix depends only onthe number of variables of the data set, and not on the number of

observations. Moreover, the computation of the covariance can beperformed in an iterative manner which is appropriate for very largedata sets, since the complete set of observations does not need to beconsidered at once. The limitation is that covariance matrices do notgive any information about the distribution of the data in the modelsubspace: the scores. Notice that two different data sets with acompletely different distribution of the scores and even with differentnumbers of observationsmay yield the same covariancematrix and sothe same projection model. The visualization of the distribution of thescores is needed in some applications. For exploratory data analysis(EDA), it is highly important to investigate the distribution of the datatogether with the model structure [7,8]. Also, in multivariatestatistical process monitoring (MSPM), the score distribution helpsus determine if the process is under statistical control. Anyway,although scores are not obtained when projection models arecalibrated from cross-product matrices, they can be computedsubsequently using the original data.

In this paper, the parallelism between original data and covariancein the calibration of projection models is exploited to propose amethod to approximate a data set by another data set with a givencovariance structure. The method is termed Approximation of aDIstribution for a given COVariance (ADICOV). The approximation issolved in any projection subspace. Given the direct relationshipbetween covariance matrices and projection models, ADICOV may beuseful in two types of applications:

• To yield data with a specific covariance structure which approx-imates a given space distribution.

• To test whether a data set satisfies the covariance structure in amodel. This idea can be valuable in a high number of applications,

Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

⁎ Corresponding author.E-mail address: [email protected] (J. Camacho).

1 Typically, PCA is performed on mean centered data in order to focus the analysison the variability within the sample, instead of its deviation from a given, and oftenmeaningless origin of coordinates.

0169-7439/$ – see front matter © 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.chemolab.2010.12.005

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r.com/ locate /chemolab

Author's personal copy

including process monitoring, the choice of the method for missingvalue estimation and observation-wise compression.

The paper is organized as follows. Section 2 introduces ADICOV.Section 3 performs some experiments with random data. From theresults of these experiments, Section 4 discusses some applications ofADICOV. Section 5 presents an industrial case of study. In Section 6,the concluding remarks are drawn.

2. Approximation of a distribution for a givencovariance (ADICOV)

Let us define X as the original or calibration data matrix and L asthe processed or test data matrix. ADICOV computes the approxima-tion matrix A as the least squares approximation of L constrained tothe covariance of X, represented by Cx. Thus, A has the samecovariance matrix as X, and approximates the space distribution ofL. The least squares approximation can be solved in the original spaceof the variables or in a subspace of interest. For instance, if ADICOV isused to test whether a data set is coherent with a PCAmodel, the leastsquares approximation should be performed on the PCA subspace.

Depending on the application, matrices X and Lmay or may not beindependently generated. For instance, in process monitoring, Xcontains the calibration data used for model fitting and L containsposterior observations of the process, collected during actualmonitoring. In that case, X and L are independently generated. Onthe other hand, in missing data estimation, X may be a data set withmissing values and L the same matrix where the missing elementshave been estimated with a given method. ADICOV in this context isuseful to determine whether the estimated values introduce a changein the covariance structure of the original data.

Once the approximation by ADICOV is performed, the interest maybe specifically on the approximation matrix A or on the differencefound between L and its approximation. The interest is in A when theaim is data simulation. The interest is in the difference between L andAwhen the aim is to measure to what extent L satisfies the covarianceofX. This difference should be computed on the (sub)space of interest.Thus, if the original space of the variables is considered in ADICOV, thedifference is computed between A and L. Otherwise, the difference iscomputed between the projections of A and L on the (sub)space ofinterest, Ta and Tl respectively. Mathematically, it can always be statedthat the difference is computed between Ta and Tl, since thesematrices are, in fact, A and L when no subspace is considered.

The difference between Ta and Tl may be assessed for differentpurposes. When L is obtained after some processing of X, thedifference between Ta and Tl can be computed to evaluate to whatextent the processing performed involved the introduction of artifactsin the data. This is only sensible when the processing is not supposedto change the covariance structure. This is the case, for instance, of theestimation of missing values commented before. On the other hand,when X and L are independently generated, the difference between Taand Tl gives an idea of the degree of stability of the stochastic processunder analysis. This is useful in process monitoring and to assess therobustness of a calibration model. A similar philosophy can beapplicable in model selection, model validation and classification.

2.1. Projection onto the subspace of interest

ADICOV constrains the approximation matrix A to satisfy thecovariance specified in Cx in a certain subspace. Thus, the equivalencein covariance between X and A can be imposed in, say, the subspacecorresponding to the first pair of Principal Components (PCs) or to thelast three Latent Variables (LV) in PLS. The projection of covariancematrix Cx on a specific subspace, referred as Gx, follows:

Gx = RT ⋅Cx⋅R; ð1Þ

with R as the corresponding projection matrix. Also, data matrix L isprojected on the subspace yielding Tl:

Tl = L⋅R: ð2Þ

Let us define Ta as the least squares approximation of Tl with Gx asthe covariance matrix. Then, A is obtained from the followingoperation:

A = Ta⋅QT; ð3Þ

where Q performs the linear transformation from the subspace to theoriginal space of the variables.

If the subspace of interest is the PCA subspace being P as the loadingmatrix [9], then R in Eqs. (1) and (2) and Q in Eq. (3) are set to P. If thesubspace of interest is the PLS subspace being P as the loadingmatrix ofthe x-block andW as theweightmatrix [10], then R=W⋅(PT⋅W)−1 inEqs. (1) and (2) and Q=P in Eq. (3).

It should be noted that the applications presented later on in thispaper are illustrated with PCA, so that both Cx and R=Q=P arecomputed from X. This implies that Gx is a diagonal matrix with theeigenvalues of X. Nonetheless, from a general point of view Gx doesnot need to be diagonal.

2.2. Imposing a covariance matrix

Ta should be defined to have a covariance matrix equal to Gx. Thiscan be performed as follows. The SVD of Ta is:

Ta = Ua⋅Sa⋅VTa ; ð4Þ

and its covariance matrix:

Ga =1

Na−1 ⋅ Va⋅STa⋅Sa⋅V

Ta

� �; ð5Þ

with Na=Nl, being Na and Nl as the number of observations in Ta andTl, respectively.

On the other hand, the ED of Gx follows:

Gx = VTx⋅Dx⋅Vx; ð6Þ

where Vx contains the eigenvectors of Gx and Dx denotes thecorresponding eigenvalues in its diagonal. Let us define matrix Sx asthe diagonalmatrixwhere each of the elements equals the square rootof the corresponding element in Dx. Sx has only real values since theelements in Dx cannot be negative.

To satisfy Ga=Gx, the following equalities need to hold:

Sa =

ffiffiffiffiffiffiffiffiffiffiffiffiffiNa−1

pffiffiffiffiffiffiffiffiffiffiffiffiffiNx−1

p ⋅Sx; ð7Þ

Va = Vx: ð8Þ

Using the derivation above, the procedure to obtain theapproximation matrix A is as follows: matrices Sx and Vx areobtained from Gx in order to compute Ta from Eqs. (7), (8) and then(4), and in turn A from Eq. (3). Still, matrix Ua needs to be set inEq. (4). This is explained in the next section. Provided that matrix Ua

in Eq. (4) is orthonormal, Awill have the same covariance as X in thesubspace of interest.

2.3. Least squares approximation

Ua can be selected so that Ta approaches a specific spatialdistribution of the observations. From Eq. (3), this is the same asstating that A approaches a specific spatial distribution of the

172 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

observations in the projected subspace. In particular, this distri-bution should be as close as possible to that of Tl. This makes itpossible to evaluate to what extent the transition from X to Linvolved the introduction of artifacts that affect the subspace ofinterest.

The problem of finding Ua so that Ta approximates Tl in the leastsquares sense is equivalent to finding the least squares solution for:

Tl = Ua⋅Sa⋅VTa ; ð9Þ

where Tl, Sa and Va are given as explained in the previous sections.This, in turn, is an optimization problem where the solution, Ua, isconstrained to be orthogonal. In particular, this optimization problemis known as the Orthogonal Procrustes Problem [11]. This problem hasan analytic solution, therefore avoiding the computational burden ofoptimization algorithms like [12]. The solution is the polar decompo-sition [13] of matrix M, where:

M = Tl⋅ Sa⋅VTa

� �T: ð10Þ

The polar decomposition of M can be computed from the SVD asfollows:

M = Um⋅Sm⋅VTm; ð11Þ

then the solution holds:

Ua = Um⋅VTm: ð12Þ

2.4. The ADICOV algorithm

According to the previous discussion, the ADICOV algorithm iscomputed as follows:

Inputs:R, Q: subspace of interest.X: original data set.L: current data set.

Algorithm:Nx←countObservations(X)Cx = 1

Nx−1 ⋅XT ⋅X

Nl←countObservations(L)Gx=RT ⋅Cx ⋅RTl=L⋅R[Vx,Dx]←ED(Gx)Sx←elementwiseSqrt(Dx)

Sa =ffiffiffiffiffiffiffiffiffiNl−1

pffiffiffiffiffiffiffiffiffiNx−1

p ⋅Sx

M=Tl⋅(Sa ⋅VxT)T

[Um,Sm,Vm]←SVD(M)Ua=Um ⋅Vm

T

Ta=Ua ⋅Sa ⋅VxT

A=Ta ⋅QT

Two comments to this algorithm are in due. First, ifX is the input ofthe algorithm, the computation of Cx is not strictly necessary, since Sxand Vx can be directly obtained from the SVD of X. Second, there aresituations in which X cannot be used as the input to the algorithm, forinstance because the number of observations is too large. Under thesecircumstances, Cx andNx have to be used as alternative inputs toX andthe first two steps of the algorithm are not performed. As discussed inthe Introduction, Cx can be computed in an iterative manner which is

suitable for very large data sets:

Cx tð Þ = Cx t−1ð Þ + x tð Þt⋅x tð Þ; ð13Þ

where Cx(t) is the covariancematrix after the t-th observation x(t) hasbeen considered.

2.5. ADICOV similarity index

ADICOV is useful to simulate data with a given covariance andspace distribution and to check whether L satisfies the covariance of Xin a given subspace. For the latter purpose, a similarity index betweenL and A is defined. This index is computed as the normalized form ofthe square of the Frobenius norm [14] of the difference between bothmatrices. Again, this should be performed on the sub-space ofinterest:

E = L−Að Þ⋅R; ð14Þ

I A =1

Nl⋅Dr⋅∥E∥2F =

1Nl⋅Dr

⋅Tr ET ⋅E� �

; ð15Þ

where IA is the ADICOV similarity index, Tr() stands for the trace of amatrix, Nl is the number of observations in L and A and Dr is thedimension of the projection subspace. IA is the normalized element-wise Euclidean distance between L and A in the projection subspace.In some contexts, the Mahalanobis distance may be more appropriatethan the Euclidean distance. For this, R is substituted in Eq. (14) by thecorresponding matrix. For instance, to compute the ADICOV indexusing the Mahalanobis distance in the PCA subspace with eigenvec-tors in P and singular values in the diagonal matrix S, Eq. (14) issubstituted by:

E = L−Að Þ⋅P⋅S−1: ð16Þ

3. Some experiments with random data

To understand the behavior of ADICOV in detail, its performancefor different features of the input matrices should be studied. Anexperimental design with random data is performed for this purpose.Recall that X is the original data matrix with covariance matrix Cx, L isthe matrix to be approximated with ADICOV, and R and Q are thecorresponding transformation matrices of the subspace of interest.The factors considered in the experiment are the following:

• The number of observations (Nl) in L.• The number of variables (M) for each observation.• The dimension of the projection subspace (Dr).• The similarity between X and L.

Taken these previous starting considerations, the followingexperiment with random data matrices is processed. Firstly, X israndomly generated for a given dimension specified by Nl and M, asfollows:

X = Ux⋅Sx⋅VTx ; ð17Þ

where Ux and Vx are obtained from the SVD of random data generatedaccording to a multinormal distribution with 0 mean and unitstandard deviation, and Sx contains a fixed number S of non-zerosingular values. These eigenvalues are generated according to thesquare of a normal distribution with 0 mean and unit standarddeviation. In previous investigations, it was found that the numberand dispersion of the non-zero eigenvalues had an effect on theapproximation by ADICOV. In turn, these two features indirectlydepend on the dimension of X. For instance, X matrices with a largernumber of observations are prone to havemore non-zero eigenvalues.

173J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

Following the simulation approach in Eq. (17), the number of non-zero singular values does not depend on Nl and M. Furthermore, datasets with strong structural relationships among variables can besimulated.

From X, matrix L is simulated as follows:

L = 1−kð Þ⋅X + k⋅N; ð18Þ

whereN is multinormal data with 0mean and unit standard deviationwhich represents white noise and k is a constant between 0 and 1.Finally, L is approximated with ADICOV in the subspacecorresponding to the first Dr PCs of a PCA model computed from X,yielding matrix A, and the similarity index I A is computed. Thisexperiment is repeated for the following values:

• Nl={10,100,1000}.• M={10,100,1000}.• Dr={2,5,10}.• k={0.02,0.05,0.10} in Eq. (18).

In all the cases, the number of non-zero singular values in X isS=20. Ten replicates are simulated for each combination of thedifferent factors.

The determination of the contribution of each factor underconsideration to the ADICOV index I A, and possible interactionsbetween pairs of factors, is assessed by means of Analysis of Variance(ANOVA). Factors Nl, Dr and k are found to be statistically significative(p-value ≤0.01). Fig. 1 shows the Least Significant Difference (LSD)plots for those factors. The results let us propose some considerationsfor the proper usage of ADICOV:

• The ADICOV index I A decreases with the number of observations,that is to say, L can be approximated for a given covariance withsmaller variations as Nl increases.

• The number of variables has not a statistically significant effect onthe index I A, since both ADICOV and the index are computed in thesubspace of interest, instead of the original space. On the contrary,the number of constraints on the least squares approximationproblem does depend on the number of PCs, that is the dimension ofthe subspace. The ADICOV index I A increases with the dimension ofthe subspace of interest.

• Discrepancies betweenX and L yield a scenariowith a higher index I A.

Two interactions between factors are also found statisticallysignificant (p-value ≤0.01). The interaction plots are shown in Fig. 2.These interactions are the ones between the number of observationsNl

with the number of PCs Dr and the number of observationsNlwith thenoise factor k. The first interaction reveals that a low number ofobservations is more limiting than a high number of PCs. The secondinteraction shows that the dissimilarity in covariance betweenX and Lis the most limiting factor. For high enough values of k, a highernumber of observations Nl does not lead to a reduction of IA. Thisconclusion can be also illustrated with the example shown in Fig. 3:consider two simplematricesX and L (Nl observations and 2 variables)with completely different covariance matrices. No matter the numberof observations generated according to the two different covariancestructures, the approximation of L constrained to the covariance of Xusing ADICOV yields a very different distribution of the points. Thus,the distribution in A (Fig. 3(c)) is very different to that of L (Fig. 3(b))and so a very high I A index is obtained.

4. Applications of ADICOV

As mentioned in the Introduction, ADICOV can be applied togenerate data sets with a specific covariance structure or to testwhether a data set meets a specific covariance structure. For the firstobjective, the interest is in the output matrix A. For the secondobjective, the interest is in the ADICOV index IA in Eq. (15).

The applications for which I A is useful follow a similar pattern.There is a set of matrices {L1,...,Lf} which are to be tested whether theyapproximate the covariancematrix of X in a certain subspace. ADICOVis applied on each of the matrices and the indices {I1A,..., IfA} arecomputed. In this section, several applications of ADICOV followingthis pattern are introduced.

4.1. Simulating data

ADICOV can be applied to simulate data with a specific covariancematrix or projectionmodel, and specific data distribution. To illustratethis application, in Fig. 4, three completely different data distributionswhich provide the same PCA model are shown: a multinormaldistribution, a distribution with one outlier and a distribution withtwo clusters. The ADICOV algorithmwas employed to yield these datasets using the same covariance matrix Cx and different L matrices: amultinormal one, a multinormal one with an outlier, and a

Fig. 1. Least Significant Difference (LSD) plots of the ADICOV index I A for the statisticallysignificative factors (p-value ≤0.01): (a) Nl, (b) Dr and (c) k.

174 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

distribution with two multinormal distributions with an offsetbetween them. This example clearly illustrates that the sameprojection model may be the result of very different datadistributions.

4.2. Avoiding the introduction of artifacts in a data set

In data analysis applications, all the steps performed on the datafrom collection to the actual analysis should be considered for aproper interpretation of the results [16]. Inmany cases, these steps arenot performed in the same place and/or by the same person. Thus,there is a potential risk of misinterpretation of the analysis results,giving relevance to artifacts in the data which, far from being the traceof the phenomenon of interest, are determined by the data acquisitionmechanism or the data transcription procedure. There may be manyexamples of this problem. Those considered here are data compres-sion and missing data estimation.

4.2.1. Interval-wise compressionIn the industrial environment, it is customary to collect a high

number of variables at fast rates to perform control and monitoringtasks. This collection activity produces tons of data, with tens tothousands of variables and tens to millions of observations. Typically,these data are compressed to reduce the volume of data, for instanceusing interval-wise averages. Thus, instead of the actual reading of thesensors, an average per time interval is considered in the posterioranalysis. This operation is similar to a low-pass filtering, and anexpected side effect of this compression is a reduction of measure-ment noise.

Clearly, the compression operation may distort the covariancestructure in the data if it is not properly performed. As a consequence,the PCA or the PLS model for applications such as data investigation,monitoring and control is actually affected by the compression. Toavoid this, the parameters of the compression, for instance theinterval length, should be properly defined. This can be carried out bychoosing a number of possible intervals and computing thecompression from the original matrix of data X according to theseintervals, yielding matrices {L1,...,Lf}. Then, ADICOV and IA can be used

Fig. 2. Interaction plots of the ADICOV index IA for the statistically significativeinteractions (p-value ≤0.01): (a) observations with PCs, and (b) observations with k.

−3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

PC 1

PC

2

(a) X

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

PC 1

PC

2

(b) L

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

PC 1

PC

2

(c) A

Fig. 3. Data distribution for three matrices. X and L are designed to have very differentvariance–covariance matrices and A is the approximation of L with the covariance of Xby ADICOV.

175J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

to compare the resulting compressed data sets in order to choose aninterval length which does not change the covariance to a largeextent.

4.2.2. Missing data estimationMissing data methods based on projection models have been

deeply studied in the literature. Most contributions focus on theestimation of missing data once the model is built [17–21]. Recently,the problem of model building with missing elements has also beenconsidered [22]. In both cases, the estimates of missing elementsshould be distinguished from actual data for posterior analysis.

Unfortunately, inmost situationsmissing data are estimated and then,for simplicity, treated as actual data with the consequent risk of theintroduction of artifacts. This problem has been treated by modellingthe uncertainty generated in the estimations of missing elements aftermodel building [23].

In a practical situation, the choice of a missing value method is anunsupervised problem in the sense that the actual missing elementsare not available. Therefore, the analyst cannot rely on predictionerror but on different selection criteria. An alternative criterion is toavoid the introduction of artifacts in the data due to estimation.ADICOV and IA can be used to select the missing data approach which,among a number of considered methods, introduces the lowestdistortion in the covariance of the data.

4.3. Process monitoring

The performance of the monitoring system in an industrial processis very important from an economical point of view. An accuratemonitoring system saves time in the detection of productionproblems [24] and so, it saves money. Projection models are nicelycombined with multivariate statistical process monitoring (MSPM)techniques. Commonly, two complementary charts are used formonitoring: the Q-statistic, which compresses the residuals; and theD-statistic or Hotelling's T2 statistic [25], computed from the scores.There are two phases in the construction of the monitoring charts inMSPM [26,27]. The first phase is used to check whether the process isin control and to identify data useful for the calibration of themonitoring system and the second phase in which actual monitoringis performed. During the first phase, the whole data set is used to fitthe projection model and build the monitoring charts. Abnormalitiesare detected and potent causes are identified. This may lead to areduction in the variability of the process prior to the design of theactual monitoring system. The same monitoring statistics are used inboth phases, although control limits are computed with slightlydifferent equations [26,28–30]. Also, making the most of the nature ofprojection models, the contribution of the variables to an abnormalitycan be investigated using contribution plots [31,32].

The traditional statistics are based on observation-wise distancesto the mean, both in the model and residual subspaces. The D-statisticcan be identified as a Mahalanobis distance in the model sub-spaceand the Q-statistic as a Euclidean distance in the residual sub-space. Inmost situations, deviations in the Q-statistic indirectly point out to achange of covariance. Nonetheless, it should be noted that this is anindirect measure, and deviations are not always caused by acovariance breakage. For instance, a data set which satisfies therelationship structure among variables in the model may present anabnormal separation from the mean. This would cause high values inboth the D-statistic and the Q-statistic without a real breakage ofstructure. Typically, a high Q-statistic valuemay lead to the conclusionthat the model is not adequately representing the new data, while inthis case this conclusion would be wrong. Nevertheless, the mostimportant drawback of the indirect association distance-fault is whena breakage of covariance structure shows a low deviation from themean, and so goes unnoticed in traditional charts. For instance, in [33]it was shown that when the number of variables is high, changes inthe covariance with small deviations may be difficult to detect.

In this context, the use of ADICOV is different than in the previouscases. In the first phase, X is the matrix of calibration data, used formodel fitting and, optionally, for the control limits design. Then, X isdivided in a number of sub-intervals of observations, yieldingmatrices {L1,...,Lf}. In the second phase, X is the calibration data –

only data in statistical control – and intervals of incoming observa-tions are collected in matrices {L1,...,Lf}. Two types of indices aregenerated for monitoring. First, ADICOV is performed on the Lmatrices in the subspace corresponding to the PCA (or PLS) model,yielding approximation matrices {A1

D,...,AfD}. These matrices are

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Scores on PC 1 (65.75%)

Sco

res

on P

C 2

(34

.25%

)

(a)

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Scores on PC 1 (65.75%)

Sco

res

on P

C 2

(34

.25%

)

(b)

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Scores on PC 1 (65.75%)

Sco

res

on P

C 2

(34

.25%

)

(c)

Fig. 4. Three different score distributions for the same PCA model with 2 PCs. PLS-Toolbox [15].

176 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

compared to the original L matrices using the Mahalanobis distance,in turn yielding indices {I1D,..., IfD}. These indices are similar in nature tothe D-statistic. Nonetheless, unlike the D-statistic, the ADICOV indicesdo not detect an abnormal deviation of the scores in the modelsubspace, but a breakage of the covariance structure. A second set ofADICOV matrices is generated for the residual subspace: {A1

Q,...,AfQ}.

These matrices are compared to the original L matrices using theEuclidean distance, yielding indices {I1Q,..., If

Q}. Again, these indices aresimilar to the Q-statistic, but are focused on detecting a breakage inthe covariance rather than an abnormal deviation.

5. Case study: industrial data from continuous process

The data set under study was collected during a period of morethan 4 days of continuous operation of a fluidized bed reactor fed withfour reactants. The collection rate is 20 s. Data consist of 18,887observations on 36 process variables. The process variables includefeed flows, temperatures, pressures, vent flow and steam flow.

The reaction generates multiple products; the operating objectiveis to maximize the yield of the desired products. The data do notcontain any product quality data. Reactor temperatures and themix ofthe four reactants are the key for achieving the desired product mixand yield. The reactant feed rates, reactor pressure and vaporizationare manipulated by the operator. There is no direct control of reactortube temperatures. The operator tends to move multiple processinputs at the same time, which causes a high degree of correlation inthe reactor temperatures.

5.1. Interval-wise compression

The data set considered contains a very high number ofobservations which may be difficult to handle. In addition, if dataare compressed interval-wise, the measurement noise is expected tobe reduced. Nonetheless, the size of the interval should be carefullychosen in order to avoid an important impact in the projection modeldue to the compression. This can be performed using ADICOV. In theexperiment with random data in Section 3, it was shown that thereduction in the number of observations implies the increase of theADICOV index, evidencing a higher difficulty for small data sets toapproximate a given data distribution with a fixed covariance.However, the extent to which this occurs depends on other inter-related factors, such as the eigenvalues of the covariance matrix and,specially, the degree of similarity between matrices L and A. Thus, thecorrect size of the interval should be evaluated from the specific dataof the process under analysis.

The selection of the interval size with ADICOV in this example isperformed as follows. Initially, a number of possible intervals are

considered: interval size (minutes)=2a for a={0,1,..,9}. Computingthe average observation of each interval, the corresponding com-pressed data matrices are obtained: {L1,...,L512}. Then, ADICOV isemployed to find the least squares approximation of the L matrices,{A1,...,A512}, with the covariance of the non-compressed data set. Thisapproximation is carried out on the original space of the variables, sothat R and Q were set to the 36×36 identity matrix I. Finally, theADICOV index (Eq. (15)) is computed for each of the 10 cases: {I1A,...,I512A }. The result is shown in Fig. 5. It can be seen that a compressioninterval of 1 h introduces a distortion of the covariance ten timeshigher than a 2 min compression interval. This information is useful toselect an adequate interval.

According to Fig. 5, as expected, the lower the size of the intervalused for compression, the lower the ADICOV index as a consequenceof a lower distortion of the covariance matrix in Lmatrices. The indexincrease from 256 to 512 is specially abrupt. Fig. 6 illustrates theextent to which the distortion indicated by the index affects a PCAmodel fitted from the data. For this, the loading plot of the first PC andthe eigenvalues corresponding to the original data matrix (X) andmatrices L1, L256 and L512 are compared. The highest distortion isfound for L512 whereas L1 shows almost no distortion. These resultsare coherent with Fig. 5.

1 2 4 8 16 32 64 128 256 5120

0.01

0.02

0.03

0.04

0.05

0.06

Minutes per interval

AD

ICO

V In

dex

Fig. 5. ADICOV evaluation of the interval size in interval-wise compression.

10 20 30

−0.2

−0.1

0

0.1

0.2

Variables

Load

ings

PC

1

X

L1

L256

L512

(a)

2 4 6 8 100

5

10

15

20

25

Eig

enva

lues

(b)

X

L1

L256

L512

Fig. 6. Comparison of the loading vector corresponding to the first PC (a) and theeigenvalues (b) for the original matrix X and compressed matrices L1, L256 and L512.

Table 1ADICOV indices for the matrices filled with the estimations of missing values.

# obs. INIPLSA ITSR

A

100 0.0705 0.084618,878 0.0373 0.0272

177J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

5.2. Missing data estimation

To assess the distortion due to the estimation of missing values,one third of the elements in the original data matrix is randomlydiscarded and considered to be missing values. Since the influence ofestimations in the final model is also dependent on the size of the dataset, two cases are considered: a) only the first 100 observations withmissing values are available in X1(100×36) and b) the complete dataset with missing values is available in X2(18.887×36). The covarianceof each of these matrices is computed by considering only theavailable information.

For eachof the twodata sets, two techniques are employed formodelbuilding: the NIPLS algorithm [1] and the iterative Trimmed ScoreRegression (TSR) algorithm [22]. Inboth cases, a 2 PCsPCAmodel is usedto perform the estimates. These estimates replace themissing elementsin theXmatrices, yieldingmatrices {LNIPLS

1 ,LNIPLS2 } for NIPLS andmatrices{LTSR1 ,LTSR2 } for TSR. Then, ADICOV is applied to find the least squaresapproximation of the L matrices on the original space of the variables.Thus, again R and Q are set to the 36×36 identity matrix I. Finally, theADICOV index (Eq. (15)) is computed for each of the cases. The result isshown in Table 1. As expected, the influence of estimations in themodelis higher for a lower number of observations. Also, the estimationmethod with lower index varies for the two cases (100 and 18,878observations). It was surprising to find that the estimation based onNIPLS, which is in general regarded to provide a bad estimationperformance [22], outperformed TSR for the first case. This exampleshows that there is not a single “best”method for all the possible cases,and illustrates the convenience of using ADICOV.

Again, the extent to which the introduction of estimates in the datadistorts the model can be visually assessed by looking at score plotsand eigenvalue plots. This is shown for X1 (first 100 observations) inFig. 7. The distortion in the model due to the estimation of missingelements is especially high for the second PC onwards. Still, it isdifficult to select between the two estimation procedures, NIPLS andTSR, by looking at this plot, whereas with ADICOV this selection isstraightforward. The fact that the distortion affects from the second PConwards could also be found with ADICOV by selecting the subspacescorresponding to the first PC, first 2 PCs, and so on.

5.3. Process monitoring

The last application of ADICOV considered in this paper is processmonitoring. This example is restricted to the first phase of theconstruction of monitoring charts. A PCA model with 2 PCs is fittedusing the original data set and Hotelling's T2 statistic and Q-statisticsare computed. In Section 5.1, interval-wise compression wasconsidered. To illustrate the consequence of such compression inmonitoring, original data is averaged using a 64 min interval. The sizeof the interval was chosen to improve visualization of the results,while similar results were observed for lower intervals. From thiscompressed data, another PCA model with 2 PCs and correspondingstatistics is computed. Finally, and using this same interval length,ADICOV-based monitoring statistics, that is ID and IQ indices, arecomputed.

After proper scaling, the resulting statistics of each of the threeapproaches are compared in Fig. 8, both for the model subspace (D-

5 10 15 20 25 30 35−0.4

−0.2

0

0.2

0.4

Variables

Load

ings

PC

1(a)

5 10 15 20 25 30 35−0.4

−0.2

0

0.2

0.4

Variables

Load

ings

PC

2

(b)

5 10 15 20 25 30 35−0.4

−0.2

0

0.2

0.4

Variables

Load

ings

PC

3

(c)

5 10 15 20 25 30 350

1

2

3

4

5

6

7

Eig

enva

lues

(d)

Fig. 7. Comparison of the loading vector corresponding to the first 3 PCs in (a), (b) and (c) and the eigenvalues (d) for the original matrix X with missing values (dots) and thematrices with estimations using NIPLS (circles) and TSR (squares).

178 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

statistics and I D indices) and the residuals (Q-statistics and IQ indices).Control limits corresponding to the traditional statistics from originaldata are included to support the discussion. It should be noted thatthese control limits are only valid for the statistics for which theywerecomputed.

Fig. 8 shows that several large deviations at observation level arelost in both traditional and ADICOV-based MSPM due to the use ofintervals. Exception made on the previous comment, the traditionalstatistics from compressed data are quite representative of the

statistics from original data. ADICOV-based statistics in the modelsubspace (Fig. 8(a)) are similar to traditional statistics, but ahigher baseline is observed in the former. Also, those I D indicescorresponding to observations under control seem to be more stable.In the residual subspace (Fig. 8(b)), clear differences between I Q

indices and Q-statistics are found in two periods: between observa-tions 5.500 and 7.000, approximately, and the final period. In the firstperiod, ADICOV shows abnormally high IQ values in comparison tothat of the rest of observations, while the Q-statistics remain normal.This period coincides with an abnormality detection in the D-statisticchart. ADICOV leads to conclude that this abnormality also affects thecovariance of the residuals, even though a large deviation in theresidual subspace (Q-statistic) is not found. Clearly, the conclusionregarding the abnormality in this period is different if ADICOV is notused. Fig. 9 shows the scores corresponding to the third PC of theoriginal data, which was left in the residuals. The scores are movingthrough different operation points, which reflect that a tighter controlmay easily reduce process variability. In the period highlighted byADICOV, the scores show a sharp change of value. This plot supportsthat the abnormality found in the first period in Fig. 8(a), whichreflects a change of operation point in the process, is also present inthe residuals. Notice that the first two PCs do present a more stableprofile, exception made on the two abnormal periods highlighted.Regarding the last period of the process, the Q-statistics signal anabnormality which is smoothed in the ADICOV chart. This abnormalityis also clear in the score plot of Fig. 9. Unlike what happened in theprevious case, the abnormal behavior at the end of the process is alsoclearly manifesting in the fourth PC, among several others. This maycause that, while the deviation from the mean may be clear, thecorrelation structure in the data may not be so seriously affected.

Fault detection using traditional statistics has proven to be a veryvaluable tool. According to the results, ADICOV in this context may beseen as a complementary tool. Especially since ADICOV may beapplied to monitor sets of observations instead of individualobservations.2 Future research, needed to derive statisticallygrounded control limits and contribution plots for this approach,will determine in which situations the ADICOV indices are moreappropriate than the traditional statistics in the context of processmonitoring.

6. Conclusion

In this paper, a new method to approximate a data set by anotherdata set with a given covariance matrix is proposed. The method istermed Approximation of a DIstribution for a given COVariance(ADICOV). The method performs a constrained least squares approx-imation which is solved for any projection subspace, including that ofPrincipal Component Analysis (PCA) and Partial Least Squares (PLS).

ADICOV was deeply analyzed using random data to understandwhich factors are more influential in the approximation of a data setfor an imposed covariance matrix. Relevant factors by order ofimportance are: the similarity between the actual covariancestructure of the data set approximated and that used in theapproximation, the number of observations (rows) in the datamatrices and the dimension of the subspace of interest.

ADICOV may be applied with two different goals: a) to yield datawith a specific covariance structure and space distribution and b) totest whether a data set satisfies the covariance structure in aprojection model. To illustrate the latter, an industrial case studyand two types of applications were considered:

• To select, among a number of possible processing operations ormethods, the one which introduces less distortion in the covariance

5000 10000 15000−8

−6

−4

−2

0

2

4

Observation

Sco

res

on P

C 3

(8.

10%

)

Fig. 9. Score plot for the third PC in the original data. PLS-Toolbox [15].

2 Nonetheless, extensions of this approach may provide an ADICOV-basedmonitoring system for individual observations.

0 5000 10000 150000

1

2

3

4

5

x 10−4

Observation

Hotelling T2 (comp)

ID

(a)

0 5000 10000 150000

0.5

1

1.5

2

2.5

3

Observation

Q Residuals (comp)

IQ

(b)

Hotelling T2

Q Residuals

Fig. 8. I D (a) and I Q (b) indices compared to the (scaled) Hotelling T2 statistic and the(scaled) Q-statistic, respectively, both for the original and interval-wise compresseddata in a PCA model with 2 PCs.

179J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180

Author's personal copy

structure. Possible examples are selecting the method for missingdata estimation and selecting the method for data compression.

• When the aim is to assess the stability of a process. Examples areprocess monitoring and assessing the robustness of calibrationmodels in spectroscopy.

Other potential applications, not considered in this paper, aremodel selection, model validation and classification.

Acknowledgments

Research in this paper is partially supported by the SpanishMinistry of Science and Technology through grant TEC2008-06663-C03-02. The referees are acknowledged for their useful comments.

References

[1] H. Wold, E. Lyttkens, Nonlinear iterative partial least squares (nipals) estimationprocedures, Bull. Intern. Statist. Inst. Proc., 37th session, London, 1969, pp. 1–15.

[2] C. Eckart, G. Young, The approximation of one matrix by another of lower rank,Psychometrika 1 (1936) 211–218.

[3] S. de Jong, Simpls: an alternative approach to partial least squares regression,Chemometrics and Intelligent Laboratory Systems 18 (1993) 251–263.

[4] F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for pls, Journal ofChemometrics 7 (1993) 45–59.

[5] S. de Jong, C. ter Braak, Comments on the pls kernel algorithm, Journal ofChemometrics 8 (1994) 169–174.

[6] B. Dayal, J. MacGregor, Improved pls algorithms, Journal of Chemometrics 11(1997) 73–85.

[7] K. Gabriel, The biplot graphic display of matrices with application to principalcomponent analysis, Biometrika 58 (1971) 453–467.

[8] J. Camacho, Missing-data theory in the context of exploratory data analysis,Chemometrics and Intelligent Laboratory Systems 103 (2010) 8–18.

[9] J. Jackson, A User's Guide to Principal Components, Wiley-Interscience, England,2003.

[10] P. Geladi, B. Kowalski, Partial least-squares regression: a tutorial, AnalyticaChimica Acta 185 (1986) 1–17.

[11] P. Schönemann, A generalized solution of the orthogonal procrustes problem,Psychometrika 31 (1966) 1–9.

[12] C. Bernaards, R. Jennrich, Gradient projection algorithms and software forarbitrary rotation criteria in factor analysis, Educational and PsychologicalMeasurements 65 (5) (2005) 676–696.

[13] W. Gibson, On the least-squares orthogonalization of an oblique transformation,Psychometrika 27 (2) (1962) 193–195.

[14] G. Golub, C. Van Loan, Matrix Computations, Johns Hopkins University Press,Baltimore, Maryland, 1996.

[15] B. Wise, N. Gallagher, R. Bro, J. Shaver, W. Windig, R. Koch, PLSToolbox 3.5 for usewith Matlab, Eigenvector Research Inc, 2005.

[16] R. Bro, Personal communication, VII Colloquium Chemometricum Mediterra-neum, 2010.

[17] P. Nelson, P. Taylor, J. MacGregor, Missing data methods in pca and pls: scorecalculations with incomplete observations, Chemometrics and Intelligent Labo-ratory Systems 35 (1996) 45–65.

[18] D. Andrews, P. Wentzell, Applications of maximum likelihood principalcomponent analysis: incomplete data sets and calibration transfer, AnalyticaChimica Acta 350 (1997) 341–352.

[19] B. Walczak, D. Massart, Dealing with missing data: part i, Chemometrics andIntelligent Laboratoy Systems 58 (2001) 15–27.

[20] F. Arteaga, A. Ferrer, Dealing withmissing data inmspc: several methods, differentinterpretations, some examples, Journal of Chemometrics 16 (2002) 408–418.

[21] F. Arteaga, A. Ferrer, Framework for regression-based missing data imputationmethods in on-line mspc, Journal of Chemometrics 19 (2005) 439–447.

[22] F. Arteaga, A. Ferrer, New proposals for pca model buildingwith missing data, 11thInternational Conference onChemometrics for Analytical Chemistry, Vol.1, 2008, p. 87.

[23] P. Nelson, J. MacGregor, P. Taylor, The impact of missing measurements on pcaand pls prediction and monitoring applications, Chemometrics and IntelligentLaboratory Systems 80 (2006) 1–12.

[24] T. Kourti, Process analysis and abnormal situation detection: from theory topractice, IEEE Control Systems Magazine 22 (5) (2002).

[25] H. Hotelling, Multivariate Quality Control, Techniques of Statistical Analysis,MacGraw-Hill, New York, 1947.

[26] N. Tracy, J. Young, R. Mason, Multivariate control charts for individualobservations, Journal of Quality Technology 24 (2) (1992) 88–95.

[27] A. Ferrer, Multivariate statistical process control based on principal componentanalysis (mspc-pca): some reflections and a case study in an autobody assemblyprocess, Quality Engineering 19 (4) (2007) 311–325.

[28] G. Box, Some theorems on quadratic forms applied in the study of analysis ofvariance problems: effect of inequality of variance in one-way classification, TheAnnals of Mathematical Statistics 25 (1954) 290–302.

[29] J. Jackson, G. Mudholkar, Control procedures for residuals associated withprincipal component analysis, Technometrics 21 (1979) 331–349.

[30] P. Nomikos, J. MacGregor, Multivariate spc charts for monitoring batch processes,Technometrics 37 (1) (1995) 41–59.

[31] J. MacGregor, C. Jaeckle, C. Kiparissides, M. Koutoudi, Process monitoring anddiagnosis by multiblock pls methods, AIChE Journal 40 (5) (1994) 826–838.

[32] T. Kourti, J. MacGregor, Multivariate spc methods for process and productmonitoring, Journal of Quality Technology 28 (4) (1996).

[33] J. Camacho, J. Picó, A. Ferrer, On-line monitoring of batch processes based on pca:does the modelling structure matter? Analytica Chimica Acta 642 (2009) 59–68.

180 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 105 (2011) 171–180