a comparison of model-based and regression classification ... · inverted beet syrup and high...

21
A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies Deirdre Toher ∗†‡ , Gerard Downey and Thomas Brendan Murphy * Ashtown Food Research Centre, Teagasc, Dublin 15, Ireland Department of Statistics, School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland Corresponding Author: email [email protected]; tel + 353.1.8059500; fax +353.1.8059550 1

Upload: others

Post on 18-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

A comparison of model-based and regression

classification techniques applied to near

infrared spectroscopic data in food

authentication studies

Deirdre Toher∗†‡, Gerard Downey∗ and Thomas Brendan Murphy†

∗Ashtown Food Research Centre, Teagasc, Dublin 15, Ireland†Department of Statistics, School of Computer Science and Statistics, Trinity College

Dublin, Dublin 2, Ireland‡Corresponding Author: email [email protected]; tel + 353.1.8059500; fax

+353.1.8059550

1

Page 2: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

Abstract

Classification methods can be used to classify samples of unknown

type into known types. Many classification methods have been pro-

posed in the chemometrics, statistical and computer science literature.

Model-based classification methods have been developed from a

statistical modelling viewpoint. This approach allows for uncertainty

in the classification procedure to be quantified using probabilities.

Linear discriminant analysis and quadratic discriminant analysis are

particular model-based classification methods.

Partial least squares discriminant analysis is commonly used in

food authentication studies based on spectroscopic data. This method

uses partial least squares regression with a binary outcome variable

for two-group classification problems.

In this paper, model-based classification is compared to partial

least squares discriminant analysis for its ability to correctly classify

pure and adulterated honey samples when the honey has been ex-

tended by three different adulterants. The methods are compared

using the classification performance, the range of applicability of the

methods and the interpretability of the results.

In addition, since the percentage of adulterated samples in any

given sample set is unlikely to be known in a real-life setting, the ability

of updating procedures within model-based clustering to accurately

predict the adulterated samples, even when the proportion of pure to

adulterated samples in the training data is grossly unrepresentative of

the true situation, is studied in detail.

2

Page 3: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

1 Introduction

The main aim of food authenticity studies[1] is to detect when foods are notwhat they claim to be and thereby prevent economic fraud or possible dam-age to health. Foods that are susceptible to such fraud are those which areexpensive and subject to the vagaries of weather during growth or harvestinge.g. coffee, various fruits, herbs and spices. Food fraud can generate signif-icant amounts of money (e.g. several million US dollars) for unscrupuloustraders so the risk of adulteration is real. Honey is defined by the EU[2] as“the natural, sweet product produced by Apis mellifera bees from the nectarof plants or from secretions of living plants, which bees collect, transformby combining with specific substances of their own, deposit, dehydrate, storeand leave in honeycombs to ripen and mature”. As it is a relatively expen-sive product to produce and extremely variable in nature, honey is prone toadulteration for economic gain. Instances of honey adulteration have beenrecorded since Roman times when concentrated grape juice was sometimesadded, although nowadays industrial syrups are more likely to be used ashoney extenders. False claims may also be made in relation to the geo-graphic origin of the honey but this study concentrated on attempting toclassify samples as either pure or adulterated. In this study, artisanal honeyswere adulterated in the laboratory using three adulterants – fructose:glucosemixtures, fully-inverted beet syrup and high fructose corn syrup – in variousratios and weight percentages.

Model-based classification[3] is a classification method based on the Gaussianmixture model with parsimonious covariance structure. This method modelsdata within groups using a Gaussian distribution and the abundance of eachgroup has some fixed probability. This classification method has been shownto give excellent classification performance in a wide range of applications[4].A recent extension of model-based classification that uses data with unknowngroup membership in the model-fitting procedure has been developed[5]. Adetailed review of model-based classification and its extensions is given inSection 4.1.

Partial least squares regression is a method that seeks to optimise both thevariance explained and correlation with the response variable[6]. In a previousstudy[7] it was found to outperform other chemometric methods commonlyused in the study of near-infrared transflectance spectra. It has the advantagein that it can utilise highly-correlated variables for classification purposes.

Both model-based classification and partial least squares discriminantanalysis requires training on data with known group or class labels. Thecollection of training data in food authenticity can be very expensive andtime-consuming so methods that require few training data observations are

3

Page 4: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

particularly useful.We show that both methods give excellent classification performance,

even when few training data values are available. We also find that model-based classification is robust in situations where the training and test dataare quite different.

2 Materials and Methods

2.1 Honey Samples

Honey samples (157 samples) were obtained directly from beekeepers through-out the island of Ireland. Samples were from the years 2000 and 2001; theywere stored unrefrigerated from time of production and were not filtered af-ter receipt in the laboratory. Prior to spectral collection, honeys were incu-bated at 40◦C overnight to dissolve any crystalline material, manually stirredto ensure homogeneity and adjusted to a standard solids content (70◦Brix)to avoid spectral complications from naturally-occurring variations in sugarconcentration.

Collecting, extending and recording spectra of the honey was done attime points several months apart; the first phase involved extending someof the authentic samples of honey with fructose:glucose mixtures, the secondphase involved extending some of the remaining authentic samples with fully-inverted beet syrup and high fructose corn syrup. All adulterant solutionswere also produced at 70◦ Brix. Brix standardisation of honeys and adulter-ant solutions meant that any adulteration detected would not be simply onthe basis of gross added solids.

The fructose:glucose mixtures were produced by dissolving fructose andglucose (Analar grade; Merck) in distilled water in the following ratios:- 0.7:1,1.2:1 and 2.3:1 w/w. Twenty-five of the pure honeys were adulterated witheach of the three fructose:glucose adulterant solutions at three levels i.e. 7,14 and 21% w/w thus producing 225 adulterated honeys.

The other adulterant solutions were generated by diluting commercially-sourced fully-inverted beet syrup (50:50 fructose:glucose; Irish Sugar, Carlow,Ireland) and high fructose corn syrup (45% fructose and 55% glucose) withdistilled water. Eight authentic honeys were chosen at random and adulter-ated with beet invert syrup at levels of 7, 10, 14, 21, 30, 50 and 70% w/w;high fructose corn syrup was added to ten different, randomly-selected hon-eys at 10, 30, 50 and 70% w/w. This produced 56 BI-adulterated and 40HFCS-adulterated samples.

On visual inspection, the spectra of the pure honey from the two mea-

4

Page 5: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

surement phases displayed an offset, with those recorded in the first phaseexhibiting a higher mean absorbance value. To remove this offset, the dif-ference of the mean absorbance values at each wavelength was subtractedfrom spectra collected in the first phase. This was done to ensure that thedetection of the fructose:glucose adulterants was not a by-product of theoffset.

2.2 Spectral Collection

In the first measurement phase, transflectance spectra were collected between400 and 2498 nm in 2 nm steps on a NIRSystems 6500 scanning monochroma-tor (FOSS NIRSystems, Silver Springs, MD) fitted with a sample transportmodule. The second phase collected transflectance spectra between 700 and2498 nm, again in 2 nm steps, using the same monochromator. Samples werepresented to the instrument in a camlock cell fitted with a gold-plated backingplate (0.1 mm sample thickness; part no. IH-0355-1). Between samples, thiscell was washed with detergent, rinsed thoroughly with tepid water and driedusing lens tissue. In the absence of any manufacturer-supplied temperaturecontrol for this accessory, an ad hoc procedure was devised to minimise vari-ation in the temperature of honey samples prior to spectral collection. Thisprocedure involved placing each sample in the instrument and scanning it 15times without storing the resultant spectra. On the sixteenth scan, the spec-trum was stored for analysis. During the pre-scanning process, each samplewas presumed to have equilibrated to instrument temperature. While in-strumental temperature does vary (K. Norris, personal communication), thisvariation is less than that likely to occur in the ambient temperature in a lab-oratory which is not equipped with an air-conditioning system. Preliminaryexperimental work had indicated 15 scans to be an appropriate compromisenumber on the basis of visual examination to test sample spectra. All spec-tra (478 in total) were collected in duplicate (including re-sampling); meanspectra were used in all subsequent calculations.

3 Dimension Reduction and Preprocessing

Spectral data between 1100 and 2498 nm were used in this study. Eachspectrum therefore contains 700 wavelengths with adjacent absorption val-ues being highly correlated. Therefore before using model-based classifica-tion methods, a dimension reduction step is required – this avoids singularcovariance matrices, improves computational efficiency and increases statis-tical modelling possibilities. The technique chosen for data reduction in this

5

Page 6: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

1100 1300 1500 1700 1900 2100 2300 2500

0.0

0.4

0.8

1.2

Wavelength (nm)

Inte

nsity

pureadulterated

Figure 1: Spectra of Pure and Adulterated Honey Samples

study is wavelet analysis.

3.1 Wavelet Analysis

Wavelet analysis is a technique commonly used in image and signal process-ing in order to compress data. Here, it is used to decompose each spectruminto a series of wavelet coefficients. Without any thresholding of these coeffi-cients, the original spectra can be exactly reconstructed from the coefficients.However, many of the coefficients in the wavelet analysis are zero or closeto zero. By thresholding the coefficients that are zero or close to zero, it ispossible to dramatically reduce the dimensionality of the dataset. The re-sulting recomposed spectra are then approximations of each of the individualspectra. Ogden[8] gives a good practical introduction to wavelet analysis.

3.1.1 Daubechies’ Wavelet

Daubechies’ wavelet is a consistently reliable type to use and is the defaultwithin wavethresh

[9]. To efficiently carry out wavelet analysis, the datadimension should be of the order 2k, where k is an integer. Unfortunately,this can result in quite a lot of information being set aside. Techniques ofextending the data to bring them up to the nearest 2k are available, butin this case these methods result in problems when carrying out the model-based discriminant analysis – the associated variance structures are oftensingular. Thus the central 29 = 512 observations were chosen – the range

6

Page 7: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

(1290 nm – 2312 nm).

3.1.2 Thresholding techniques

Various methods of thresholding[10] wavelet functions in order to achievedimension reduction have been published.

As a default procedure, universal hard thresholding is used. When usinguniversal thresholding, the threshold λ = σ

2 log 2k, where σ is a robustestimate of the standard deviation of the coefficients and there are 2k coeffi-cients. Other thresholding techniques may sometimes provide better approx-imations of the spectra but their use adds another decision into the processthat may lead to overfitting rather than a general procedure.

1290 1490 1690 1890 2090 2290

0.0

0.4

0.8

Wavelength (nm)

Inte

nsity

Actual

Thresholded

Figure 2: Actual and Thresholded Spectra

3.2 Preprocessing: Savitzky-Golay

Using the Savitzky-Golay[11] algorithm to calculate derivatives of the data iscommonly used in spectroscopic analysis. This concentrates analysis on theshape rather than the height of the spectra. Thus we compare this methodof pre-processing the data with the alternative of no preprocessing.

7

Page 8: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

4 Classification Techniques

The classification techniques used on this data set were based on Gaussianmixture models; under these models each group was modelled using a Gaussiandistribution. The covariance of each of the Gaussian models is structured ina parsimonious manner using constraints on the eigen decomposition of thecovariance matrix[4]. This approach offers the ability to model groups thathave distinct volume, shape and orientation properties.

We introduce mixture models in general terms in Section 4.1 and thenshow the development of clustering and classification methods from mixtures.

4.1 Mixture Models

The mixture model assumes that observations come from one of G groups,that observations within each group g are modelled by a density f(·|θg) whereθg are unknown parameters and that the probability of coming from group gis pg.

Therefore, given data y with independent multivariate observations y1, . . . ,yM ,a mixture model with G groups has a likelihood function

Lmix(θ1, . . . , θG; p1, . . . , pG|y) =M∏

m=1

G∑

g=1

pgf(ym|θg). (1)

The Gaussian mixture model further assumes the density f to be a mul-tivariate Gaussian density φ, parameterized by a mean µg and covariancematrix Σg

φg(ym|µg, Σg) ≡exp{−1

2(ym − µg)

T Σ−1g (ym − µg)}

det(2πΣg)

The multivariate Gaussian densities imply that the groups are centeredat the means µg with shape, orientation and volume of the scatter of obser-vations within the group depending on the covariances Σg.

The Σg can be decomposed using an eigen decomposition into the form,

Σg = λgDgAgDTg ,

where λg is a constant of proportionality, Dg an orthogonal matrix of eigen-vectors and Ag is a diagonal matrix where the elements are proportional tothe eigenvalues as described by Fraley and Raftery[4].

The parameters λg, Ag and Dg have interpretations in terms of volume,shape and orientation of the scatter for the component. The parameter λg

8

Page 9: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

controls the volume, Ag the shape of the scatter and Dg controls the orien-tation . Constraining the parameters to be equal across groups gives greatmodelling flexibility. Some of the options for constraining the covarianceparameters are given in Table 1.

Table 1: Parametrizations of the covariance matrix Σg

Model ID: E=Equal, V=Variable, I=Identity

Model ID Decomposition StructureEII Σg = λI SphericalVII Σg = λgI SphericalEEI Σg = λA DiagonalVEI Σg = λgA DiagonalEVI Σg = λAg DiagonalVVI Σg = λgAg DiagonalEEE Σg = λDADT EllipsoidalEEV Σg = λDgADT

g EllipsoidalVEV Σg = λgDgADT

g EllipsoidalVVV Σg = λgDgAgD

Tg Ellipsoidal

The letters in ModelID denote the volume, shape and orientation repec-tively. For example, EEV represents equal volume and shape with vari-able orientation. The mixture model (1) can be fitted to multivariate ob-servations y1,y2, . . . ,yN by maximizing the log-likelihood (1) using the EMalgorithm[12]. The resulting output from the EM algorithm includes estimatesof the probability of group membership for each observation; these can beused to cluster the observations into their most probable groups. This pro-cedure is the basis of model-based clustering[4] and is easily implemented inthe mclust library[13] for the statistics package R

[14].However, in model-based discriminant analysis (also known as eigenvalue

discriminant analysis)[3], the model is fitted to the data which consist ofmultivariate observations wn where n = 1, 2, . . . , N and labels ln where lng =1 if observation n belongs to group g and 0 otherwise.

Therefore, the resulting likelihood function is,

Ldisc(p1, p2, . . . , pG; θ1, θ2, . . . , θG|w, l) =N∏

n=1

G∏

g=1

[pgf(wn|θg)]lng (2)

The log-likelihood function (2) is maximized yielding parameter estimatesp1, p2, . . . , pG and θ1, θ2, . . . , θG. For stability, equal probabilities, p1 = . . . =pG = 1/G, are often assumed.

9

Page 10: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

The posterior probability of group membership for an observation y whoselabel is unknown can be estimated as

P(Group g|y) ≈pgf(y|θg)

∑G

q=1 pqf(y|θq)(3)

and observations thus can be classified into their most probable group. Themclust

[13] package can also be used to perform the model-based discriminantanalysis. This again allows for the possibility of the models demonstratedin Table 1. It is worth noting that Linear Discriminant Analysis (LDA) andQuadratic Discriminant Analysis (QDA) are special cases of model-baseddiscriminant analysis and they correspond to the EEE and VVV modelsrespectively.

4.2 Discriminant Analysis With Updating

Model-based discriminant analysis as developed in[3] only uses the observa-tions with known group membership in the model fitting procedure. Once themodel is fitted, the observations with unknown group labels can be classifiedinto their most probable groups.

An alternative approach is to model both the labelled data (w, l) and theunlabelled data y and to maximize the resulting log-likelihood for the com-bined model. The likelihood function for the combined data is a product ofthe likelihood functions given in (2) and (1). This classification approach wasrecently developed in[5] and was demonstrated to give improved classificationperformance over the classical model-based discriminant analysis.

With this modelling approach the likelihood function is of the form

Lupdate(p, θ|w, l,y) = Ldisc(p, θ|w, l)Lmix(p, θ|y)

=N∏

n=1

G∏

g=1

[pgf(wn|θg)]lng ×

M∏

m=1

G∑

g=1

pgfg(ym|θg) (4)

The log-likelihood (4) is maximized using the EM algorithm (Section 4.3)to find estimates for p (if estimated) and θ. Output from the EM algorithmincludes estimates of the probability of group membership for the unlabelledobservations y, as given in (3).

The EM algorithm for maximizing the log-likelihood (4) proceeds iter-atively substituting the unknown labels with their estimated values. Ateach iteration the estimated labels are updated and new parameter estimatesare produced. By passing the estimated values of the unknown labels intothe EM algorithm it is possible to “update” the classification results with

10

Page 11: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

some of the knowledge gained from fitting the model to all of the data. In-deed, even with small training sets containing unrepresentative proportionsof pure/adulterated honey samples, updating shows consistency.

4.3 EM Algorithm

The Expectation Maximization Algorithm[12] is ideally suited to the prob-lem of maximizing the log-likelihood function when some of the data haveunknown group labels; this arises in the model-based clustering likelihood(1) and the model-based discriminant analysis with updating likelihood (4).In this section, we show the steps involved in the EM algorithm for model-based discriminant analysis with updating; the model-based clustering stepsare shown in[4].

Considering data to be classified as consisting of M multivariate observa-tions consisting of two parts: known, ym and unknown zm. In this contextthe spectroscopic data are observed and thus known, so are treated as theym. The labels (pure or adulterated) are unknown and thus are treated asthe zm. Additionally, N labelled observations are available which consist oftwo parts: known wn and known labels ln.

The unobserved portion of the data z, is a matrix of indicator functions,so that zm = (zm1, . . . , zmG), where each zmg is 1 if ym is from group g and0 otherwise.

Then the observed data likelihood can be written in the form

LO(p, θ|wN , lN , yM) =N∏

n=1

G∏

g=1

[pgf(wn|θg)]lng ×

M∏

m=1

G∑

g=1

pgfg(ym|θg) (5)

and the complete data likelihood is

LC(p, θ|wN , lN , yM , zM) =N∏

n=1

G∏

g=1

[pgf(wn|θg)]lng ×

M∏

m=1

G∏

g=1

[pgf(ym|θg)]zmg

(6)Initial estimates of p (if estimated) and θ = (µ, Σ) are taken from classicalmodel-based discriminant analysis, by maximizing (2).

The expected value of the unknown labels are calculated so that

z(k+1)mg ←

p(k)g f(ym|θ

(k)g )

∑G

q=1 p(k)q f(ym|θ

(k)q )

(7)

for g = 1, . . . , G and m = 1, . . . ,M and the parameters p and θ can then be

11

Page 12: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

estimated by:

p(k+1)g ←

∑N

n=1 lng +∑M

m=1 z(k+1)mg

N + Mif estimated

µ(k+1)g ←

∑N

n=1 lngwn +∑M

m=1 z(k+1)mg ym

∑N

n=1 lng +∑M

m=1 z(k+1)mg

(8)

The estimates of Σg again depend on the constraints placed on the eigenvaluedecomposition; details of the calculations are given in[3, 15].

The iterative process continues until convergence is achieved. The use ofan Aitken acceleration-based convergence criterion is discussed in[5].

Updating can take two forms – soft and hard updating. With soft updat-ing (EM) updates of the missing labels are made using (7), so the unknownlabels are replaced by probabilities (or a soft classification). Whereas, hardupdating (CEM) replaces the probabilities given in (7) with an indicator vec-tor of the most probable group. The hard classification algorithm does notmaximize (4) but actually tries to maximize (6) (although local maxima area possibility).

5 Regression-Based Techniques

Commonly used in near infrared spectroscopy on food samples, discriminantpartial least squares was found to be a reliable method of classifying Irishhoney samples[7].

5.1 Partial Least Squares Regression

Partial least squares regression (PLSR) was developed by Wold[16, 17] andis based on the assumption of a linear relationship between the observedvariables (e.g. the spectroscopy measurements) and the outcome variable(e.g. pure or adulterated). It is similar to principal components regression(PCR). Given that X is an n × p matrix that represents observed values:-n samples on p measurement points; and that y is a vector of length nrepresenting the outcome variable/label.

If S is the sample covariance matrix of X, the similarity between PCRand PLSR can be seen when one examines what is being maximised in eachsituation. PCR maximises the function Var(Xα) where vm is the mth prin-cipal component and vT

l Sα = 0 for l = 1, . . . ,m − 1 while PLSR maximises

Corr2(y,Xα)Var(Xα) where φm is the mth PLS direction and φTl Sα = 0 for

l = 1, . . . ,m − 1, both subject to the condition that ||α|| = 1. However,within the PLS maximization problem, the variance term does dominate.

12

Page 13: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

5.1.1 Algorithm for PLSR

Each observation xj is standardized to have 0 mean and variance of 1. Ini-

tialize y(0) = 1y, x(0)j = xj for j = 1, . . . , p and φ1j = 〈xj,y〉 – the univariate

regression coefficient of y on each xj.For m = 1, . . . , p

• hm =∑p

j=1 φmjx(m−1)j where φmj =

x(m−1)j ,y

• θm = 〈hm,y〉 / 〈hm,hm〉

• y(m) = y(m−1) + θmhm

• x(m)j = x

(m−1)j −

[⟨

hm,x(m−1)j

/ 〈hm,hm〉]

hm for j = 1, . . . , p so that

each x(m−1)j is othogonalized with respect to hm

It has been noted that the sequence of PLS coefficients for m = 1, . . . , prepresents the conjugate gradient sequence for computing least squares solu-tions.

5.1.2 Number of Parameters

It uses m relevant loadings/components in the model. However, deciding onm is not trivial[18]. Even when m is known, the number of parameters inthe model is open for debate. There is a problem in calculating the degreesof freedom of a model using PLS[19]. Calculating the number of parametersin a model is especially relevant when using a complexity penalty as part ofthe model selection criterion. For the purposes of this study, the number ofparameters in the population model was assumed to be

p (p + 1)

2+ m + 1

so that if m = 0 no correlation between X and y exists and m = p is the fullleast squares model; this agrees with [18].

6 Model Selection and Verification

The model with the best Bayesian Information Criterion value was chosenin each case – so that a consistent criterion could be applied to all models.It is also possible to choose models based on leave-one-out cross validation,but this is computationally expensive.

The certainty of the classification decisions are measured using Brier’sscore[20].

13

Page 14: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

6.1 Bayesian Information Criterion (BIC)

The results given are models selected using the Bayesian Information Crite-rion (BIC), where the BIC of a function is

BIC = 2loglikelihood− d log(N + M),

given that N + M is the number of observations from f and d is the numberof parameters used in the model.

6.2 Brier’s Score

Brier[20] developed a method of producing a continuous performance measurewhere perfect prediction gives a Brier’s score of zero. Given G groups andN samples and forecasted probabilities zn1, . . . , znG for sample n belongingto group 1, . . . , G respectively then the Brier’s score, B, is

B =1

2n

G∑

g=1

N∑

n=1

(zgn − ztruegn)2 (9)

where ztruegnis an indicator variable for the actual group membership. It is

especially useful for determining the certainty of predictions. Some observa-tions may be just barely put into the correct group, or indeed just miss outon correct classification. Observations that are barely classified correctly willadd more to the Brier’s score than those where a more certain classificationis made.

A trait of PLSR is that some regression outputs may in fact be beyondthe zero-one scale. For the purposes of calculating a pseudo-Brier score,such results were set to be equal to either zero or one so that these certainclassifications do not add to the total.

7 Comparison of Methods

As an example of the data input to the classification problem, Figure 1 showstransflectrance spectra of authentic and adulterated honey samples. Visualexamination of these spectra reveal a sloping baseline, the absense of finestructure and the impossibility of visually differentiating between samples,features which are typical of near infrared spectral collections. For thesereasons, multivariate statistical methods are required to efficiently and effec-tively extract useful information from such datasets.

The data consist of 157 unadulterated samples, 225 extended with fruc-tose:glucose mixtures, 56 extended with beet invert syrup and 40 samples

14

Page 15: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

extended with high fructose corn syrup. Let fg, bi and cs represent adul-teration by fructose:glucose mixtures, beet invert syrup and high fructosecorn syrup respectively and let adult represent samples adulterated with anyof the three adulterant types. The figures reported are the mean correctclassification rates (as percentages) over 400 random splits of the data intotraining and test sets and their associated standard deviations (in brackets).

Three different types of training data were examined:a) correct proportions of pure and each type of adulterantb) correct proportions of pure and adulteratedc) unrepresentative proportions of pure and adulterated

The performance of model-based discriminant analysis (DA), soft updating(EM), hard updating (CEM) and partial least squares regression (PLSR) wasmeasured at three different ratios of training/test data – 50/50%, 25/75%and 10/90% in order to fully examine the robustness of each technique tovarying sample size.

7.1 Correct Proportions of Pure and of Each Type ofAdulterant in Training Set

The training sets, at 50%, 25% and 10% of the data, are completely repre-sentatative of the entire population studied.

BIC selection methodTraining Test Method Classification Rate Brier’s Score

pure: 79 78 DA 94.86 (1.01) 0.060 (0.014)fg: 112 113 EM 93.02 (2.02) 0.060 (0.016)bi: 28 28 CEM 93.95 (1.56) 0.053 (0.009)cs: 20 20 PLSR 94.66 (1.04) 0.053 (0.007)pure: 39 118 DA 93.51 (1.07) 0.071 (0.012)fg: 56 169 EM 84.49 (12.72) 0.140 (0.120)bi: 14 42 CEM 91.60 (3.94) 0.076 (0.037)cs: 10 30 PLSR 93.88 (1.14) 0.060 (0.011)pure: 16 141 DA 90.81 (1.70) 0.104 (0.045)fg: 22 203 EM 52.38 (15.58) 0.465 (0.157)bi: 6 50 CEM 69.80 (20.85) 0.294 (0.209)cs: 4 36 PLSR 92.46 (3.75) 0.075 (0.031)

There is no significant difference in the performance of DA and PLSR, butboth outperform the updating procedures. All methods have decreasing per-formance as the training percentage decreases. However, the decrease is notlarge for DA and PLSR.

15

Page 16: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

As the percentage in the training set decreases the number of factors se-lected by PLSR becomes more variable (8–14 factors at 50% to 1–30 factorsat 10%). DA chooses more parsimonious models as the training set percent-age decreases, but the updating procedures (EM and CEM) choose modelswith more parameters as the size of the training set decreases relative to thetest set.

7.2 Correct Proportions of Pure/Adulterated in Train-ing Set

Only the proportion of pure to adulterated samples are kept fixed – theproportion of each type of adulterant is allowed to vary between simulations.

BIC selection methodTraining Test Method Classification Rate Brier’s Score

pure: 79 78 DA 94.66 (1.18) 0.062 (0.015)adult: 160 161 EM 93.84 (2.41) 0.059 (0.020)

CEM 93.84 (1.57) 0.054 (0.015)PLSR 94.58 (0.91) 0.054 (0.006)

pure: 39 118 DA 93.45 (1.01) 0.073 (0.013)adult: 80 241 EM 83.86 (12.18) 0.145 (0.117)

CEM 90.85 (7.17) 0.083 (0.069)PLSR 93.85 (1.01) 0.060 (0.010)

pure: 16 141 DA 90.63 (1.74) 0.115 (0.057)adult: 32 289 EM 51.79 (15.00) 0.470 (0.151)

CEM 68.14 (20.99) 0.312 (0.210)PLSR 92.45 (4.14) 0.074 (0.032)

As in Section 7.1, there is no significant difference between DA and PLSR,which both outperform the updating methods. Classification performance isslightly worse (as expected) but is actually quite robust to type of adulterant.

The number of factors chosen by PLSR again becomes more variableas the proportion of training data to test data decreases. The pattern ofDA choosing more parsimonious models while updating procedures choosemodels with more parameters again occurs with decreasing training dataproportion.

7.3 Unrepresentative Proportions of Pure/Adulterated

In this section the ratios of pure to adulterated samples in the training sets(and hence also in the test sets) are varied in order to examine the perfor-mance of each method when the training data does not accurately reflect

16

Page 17: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

Table 2: 50% Training Data / 50% Test DataBIC selection method

Training Test Method Classification Rate Brier’s Scorepure: 40 117 DA 92.47 (1.52) 0.079 (0.018)adult: 199 122 EM 93.13 (1.55) 0.061 (0.015)

CEM 92.45 (1.81) 0.067 (0.017)PLSR 95.15 (1.15) 0.058 (0.006)

pure: 20 102 DA 90.01 (1.60) 0.101 (0.017)adult: 219 137 EM 90.32 (3.35) 0.089 (0.031)

CEM 89.24 (2.90) 0.100 (0.029)PLSR 97.08 (1.28) 0.050 (0.008)

pure: 99 58 DA 94.98 (1.21) 0.058 (0.013)adult: 140 181 EM 92.69 (1.80) 0.064 (0.016)

CEM 93.82 (1.48) 0.055 (0.014)PLSR 94.66 (1.22) 0.054 (0.007)

pure: 119 38 DA 95.34 (0.92) 0.064 (0.016)adult: 120 201 EM 90.82 (4.41) 0.081 (0.036)

CEM 92.87 (2.91) 0.064 (0.027)PLSR 94.87 (1.11) 0.047 (0.004)

the entire population structure. In the 50% training data situation, PLSRmarginally outperforms DA and updating methods when the number of puresamples in the training set is less than expected. There is no significantdifference between PLSR and DA when the number of pure samples in thetraining set is greater than expected. In the 25% training data there is al-most no difference in classification performance between DA and PLSR, withDA marginally outperforming PLSR when the number of pure samples in thetraining data set is greater than expected (> 39 pure samples). With 10%of the data used as a training set, DA shows its robustness to both train-ing sample size and composition. It should be noted that PLSR again alsodoes well in terms of classification performance but does have a much higherstandard deviation than DA.

Where there are fewer than the expected number of pure samples (e.g. <79 in Table 2) in the training set, DA almost always chooses EEE, while whenthere are more than the expected number of pure samples in the training set,DA chooses VEV most frequently. The range of factors chosen by PLSR forwhen the number of pure samples in the training set is higher than expectedis smaller than when the number of pure samples in the training set is lessthan expected.

17

Page 18: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

Table 3: 25% Training Data / 75% Test DataBIC selection method

Training Test Method Classification Rate Brier’s Scorepure: 20 137 DA 92.25 (1.68) 0.079 (0.017)adult: 99 222 EM 72.37 (18.16) 0.256 (0.178)

CEM 82.16 (16.19) 0.169 (0.159)PLSR 93.12 (3.59) 0.067 (0.023)

pure: 10 147 DA 91.12 (1.63) 0.089 (0.014)adult: 109 212 EM 68.93 (18.38) 0.291 (0.183)

CEM 82.28 (15.57) 0.168 (0.151)PLSR 92.68 (6.09) 0.070 (0.037)

pure: 49 108 DA 93.98 (0.98) 0.068 (0.012)adult: 70 251 EM 85.05 (10.34) 0.135 (0.099)

CEM 90.79 (6.48) 0.084 (0.063)PLSR 93.50 (1.35) 0.067 (0.012)

pure: 59 98 DA 93.73 (1.00) 0.072 (0.015)adult: 60 261 EM 84.92 (9.30) 0.137 (0.088)

CEM 90.83 (3.49) 0.084 (0.032)PLSR 93.41 (1.59) 0.068 (0.014)

Table 4: 10% Training Data / 90% Test DataBIC selection method

Training Test Method Classification Rate Brier’s Scorepure: 8 149 DA 89.87 (2.08) 0.110 (0.047)adult: 40 281 EM 61.02 (17.37) 0.368 (0.171)

CEM 85.91 (8.41) 0.132 (0.080)PLSR 88.17 (7.33) 0.106 (0.047)

pure: 4 153 DA 88.66 (2.37) 0.115 (0.025)adult: 44 277 EM 66.37 (15.83) 0.302 (0.142)

CEM 91.88 (0.94) 0.074 (0.009)PLSR 85.95 (8.67) 0.120 (0.054)

pure: 20 137 DA 90.89 (1.70) 0.099 (0.023)adult: 28 293 EM 74.45 (16.39) 0.241 (0.164)

CEM 85.15 (11.28) 0.140 (0.112)PLSR 89.75 (5.42) 0.098 (0.035)

pure: 24 133 DA 90.92 (1.74) 0.104 (0.029)adult: 24 297 EM 80.03 (11.17) 0.187 (0.112)

CEM 86.05 (8.56) 0.131 (0.084)PLSR 89.83 (4.97) 0.097 (0.033)

18

Page 19: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

8 Conclusions

Both PLSR and model-based discriminant analysis achieve good classificationresults. Unlike previous studies[5] the updating techniques do not improveclassification performance. This may be due to the structure, or lack thereof,within the adulterated samples. Comparing the Brier’s scores to the pro-portion of samples misclassified shows that classification decisions made arequite definitive.

Model-based discriminant analysis is more flexible and more robust thanPLSR. As the training set proportion decreases and the make up of thetraining set becomes unrepresentative of the population, model-based dis-criminant analysis shows its strengths.

The results from model-based discriminant analysis are in the form of theprobability of being a member of each group rather than, as in PLSR, a score.Thus a cost function can be easily imposed at the end of the classificationprocess and the results have a natural interpretation.

9 Acknowledgements

The work reported in this paper is funded by Teagasc under the Walsh Fel-lowship Scheme. Thanks are due to individual beekeepers for the provisionof honey samples and to the Irish Department of Agriculture & Food forfinancial support (FIRM programme) which enabled collection and spectralanalysis of the samples. This work is also supported under the Science Foun-dation of Ireland Basic Research Grant scheme (Grant 04/BR/M0057).

References

[1] M. Lees, editor. Food Authenticity: Issues and Methodologies. EurofinsScientific, Nantes, France, 1998.

[2] European Commission. Council Directive 2001/110/EC of 20 June 2001,relating to honey, 2002.

[3] H. Bensmail and G. Celeux. Regularized Gaussian discriminant analysisthrough eigenvalue decomposition. Journal of the American Statistical

Association, 91:1743–1748, 1996.

[4] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analy-sis, and density estimation. Journal of the American Statistical Associ-

ation, 97:611–631, 2002.

19

Page 20: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

[5] N. Dean, T. B. Murphy, and G. Downey. Using Unlabelled Data To Up-date Classication Rules With Applications In Food Authenticity Studies.Journal of the Royal Statistical Society, Series C, 55:1–14, 2006.

[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical

Learning. Springer, 2001.

[7] G. Downey, V. Fouratier, and J. D. Kelly. Detection of honey adul-teration by addition of fructose and glucose using near infrared trans-flectance spectroscopy. Journal of Near Infrared Spectroscopy, 11:447–456, 2003.

[8] R. T. Ogden. Essential Wavelets for Statistical Applications and Data

Analysis. Birkhauser, 1997.

[9] A. Kovac (1997), M. Maechler (1999), and G. Nason (R-port).wavethresh: Software to perform wavelet statistics and transforms.,2004. R package version 2.2-8.

[10] D. L. Donoho and I. M. Johnstone. Ideal Spatial Adaptation by WaveletShrinkage. Biometrika, 81(3):425–455, 1994.

[11] A. Savitzky and M. J. E. Golay. Smoothing and differentiation of databy simplified least squares procedures. Analytical Chemistry, 36:1627–1639, 1964.

[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihoodfor incomplete data via the EM algorithm (with discussion). Journal of

the Royal Statistical Society, Series B, 39:1–38, 1977.

[13] C. Fraley, A. E. Raftery, Dept. of Statistics, University of Washington,and R. Wehrens (R-port). mclust: Model-based cluster analysis. Rpackage version 2.1-8.

[14] R Development Core Team. R: A language and environment for sta-

tistical computing. R Foundation for Statistical Computing, Vienna,Austria, 2004. ISBN 3-900051-07-0.

[15] G. Celeux and G. Govaert. Gaussian parsimonious clustering models.Pattern Recognition, 28:781–793, 1995.

[16] H. Wold. Estimation of principal components and related models byiterative least squares. In Multivariate Analysis, pages 391–420, 1966.

20

Page 21: A comparison of model-based and regression classification ... · inverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70 Brix. Brix standardisation

[17] H. Wold. Nonlinear estimation by iterative least square procedures. InResearch Papers in Statistics: Festschrift for J. Neyman, pages 411–444,1966.

[18] I. S. Helland. Some theorectical aspects of partial least squares regres-sion. Chemometrics Intell. Lab. Syst, 58:97–107, 2001.

[19] H. Van Der Voet. Pseudo-degrees of freedom for partial least squares.J. Chemometrics, 13:195–208, 1999.

[20] G. W. Brier. Verification of forecasts expressed in terms of probability.Monthy Weather Review, 78:1–3, 1950.

21