qualitative and quantitative aspects of the application of genetic algorithm-based variable...

15
Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry Ana Herrero * , M. Cruz Ortiz Dpto. Quı ´mica, Facultad de Ciencias, Universidad de Burgos, Pza. Misael Ban ˜uelos s/n, 09001 Burgos, Spain Received 9 June 1998; received in revised form 14 August 1998; accepted 25 August 1998 Abstract A genetic algorithm (GA) is successfully applied as a variable selection method in the multivariate analysis with partial least squares (PLS) regression of several polarographic and stripping voltammetric data sets, where different interferences are present (coupled reactions, formation of intermetallic compounds, overlapping signals and matrix effect). In most cases, the results corresponding to this variable selection method are better than those obtained when all the variables are considered. Such is the case in the determination of benzaldehyde, where a dimerization reaction occurs simultaneously to the electrochemical reactions. In general, an improvement in the precision is achieved for the test samples by using the GA. On the other hand, the GA provides valuable qualitative information that, in every case, provides a significant tool to detect and understand the chemical phenomena related to each analysis. # 1999 Elsevier Science B.V. All rights reserved. Keywords: Feature selection; Genetic algorithm; Partial least squares regression; Polarography; Stripping voltammetry; Coupled reactions; Overlapping signals; Intermetallic compounds; Matrix effect 1. Introduction The determination of those variables that are really relevant for building a multivariate regression model is of great importance. At first, it could be thought that the more variables constitute the model the better the model is, because all those variables related to the response are taken into account. But this reasoning is not adequate since the presence of other variables not related to the response can increase the background noise, reducing the prediction ability of the model built. The aim of a variable selection procedure is a significant reduction in the number of variables to obtain simpler and more stable relationships. On the other hand, the variable selection in a multi- variate calibration can reveal the need for using some variables apparently not related to a specific analyte or, on the contrary, to not use others that could seem essential. This can denote the presence in the electro- chemical signal of phenomena not expected, interfer- ences as coupled reactions, formation of intermetallic compounds, overlapping signals, matrix effect, etc. In short, a guided variable selection procedure can pro- vide very useful qualitative information about the analysed chemical system. Several methods have been Analytica Chimica Acta 378 (1999) 245–259 *Corresponding author. Fax: +34-947-258831; e-mail: [email protected] 0003-2670/99/$ – see front matter # 1999 Elsevier Science B.V. All rights reserved. PII: S0003-2670(98)00619-9

Upload: ana-herrero

Post on 02-Jul-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Qualitative and quantitative aspects of the applicationof genetic algorithm-based variable selection in

polarography and stripping voltammetry

Ana Herrero*, M. Cruz Ortiz

Dpto. QuõÂmica, Facultad de Ciencias, Universidad de Burgos, Pza. Misael BanÄuelos s/n, 09001 Burgos, Spain

Received 9 June 1998; received in revised form 14 August 1998; accepted 25 August 1998

Abstract

A genetic algorithm (GA) is successfully applied as a variable selection method in the multivariate analysis with partial least

squares (PLS) regression of several polarographic and stripping voltammetric data sets, where different interferences are

present (coupled reactions, formation of intermetallic compounds, overlapping signals and matrix effect). In most cases, the

results corresponding to this variable selection method are better than those obtained when all the variables are considered.

Such is the case in the determination of benzaldehyde, where a dimerization reaction occurs simultaneously to the

electrochemical reactions. In general, an improvement in the precision is achieved for the test samples by using the GA. On

the other hand, the GA provides valuable qualitative information that, in every case, provides a signi®cant tool to detect and

understand the chemical phenomena related to each analysis. # 1999 Elsevier Science B.V. All rights reserved.

Keywords: Feature selection; Genetic algorithm; Partial least squares regression; Polarography; Stripping voltammetry; Coupled reactions;

Overlapping signals; Intermetallic compounds; Matrix effect

1. Introduction

The determination of those variables that are really

relevant for building a multivariate regression model is

of great importance. At ®rst, it could be thought that

the more variables constitute the model the better the

model is, because all those variables related to the

response are taken into account. But this reasoning is

not adequate since the presence of other variables not

related to the response can increase the background

noise, reducing the prediction ability of the model

built. The aim of a variable selection procedure is a

signi®cant reduction in the number of variables to

obtain simpler and more stable relationships.

On the other hand, the variable selection in a multi-

variate calibration can reveal the need for using some

variables apparently not related to a speci®c analyte

or, on the contrary, to not use others that could seem

essential. This can denote the presence in the electro-

chemical signal of phenomena not expected, interfer-

ences as coupled reactions, formation of intermetallic

compounds, overlapping signals, matrix effect, etc. In

short, a guided variable selection procedure can pro-

vide very useful qualitative information about the

analysed chemical system. Several methods have been

Analytica Chimica Acta 378 (1999) 245±259

*Corresponding author. Fax: +34-947-258831; e-mail:

[email protected]

0003-2670/99/$ ± see front matter # 1999 Elsevier Science B.V. All rights reserved.

P I I : S 0 0 0 3 - 2 6 7 0 ( 9 8 ) 0 0 6 1 9 - 9

Page 2: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

proposed to carry out the selection of variables in

regression analysis [1±5], stepwise procedures being

the most commonly used.

In general, stepwise procedures for the selection of

variables have the drawback that they do not suf®-

ciently explore possible combinations of variables. An

ef®cient alternative is provided through some optimi-

zation procedures, such as the genetic algorithms

[6±9] (GAs), which have been the approach used in

this paper.

GAs [10±13] have been used to solve dif®cult

problems with objective functions that do not possess

`̀ nice'' properties such as continuity, differentiability,

etc. [14±17]. These algorithms maintain and mani-

pulate a family, or population, of solutions and imple-

ment a `̀ survival of the ®ttest'' strategy in their search

for better solutions. Whereas traditional search tech-

niques use characteristics of the problem to determine

the next sampling point (e.g., gradients, Hessians,

linearity and continuity), GAs make no such assump-

tions. Instead, the next sampled point is determined

based on stochastic sampling/decision rules rather

than a set of deterministic decision rules. So GAs

are a technique very useful in the variable selection

problem because the relationship between presence/

absence of variables in a calibration model and the

prediction ability of the model, specially for PLS

models, is very complex and the mathematical proper-

ties above cited are unknown.

GAs search the solution space of a function through

the use of simulated evolution, i.e. the survival of the

®ttest strategy. GAs have been shown to solve the

optimization problem by exploring all regions of the

potential solutions and exponentially exploiting pro-

mising areas through mutation, crossover and selec-

tion operations applied to individuals in the

population. A complete discussion of genetic algo-

rithms can be found in [14±17].

The use of a GA requires the determination of six

fundamental issues that, in this paper, were the follow-

ing:

1. Each subset of variables is represented by a vector

of binary coordinates (gens) called chromosome

where the codi®cation 1 implies that the variable

has been chosen to build a PLS calibration model

and the 0 implies that the variable has not been

chosen.

2. Selection function. The chromosomes are chosen

with proportional inverse probability to the PRESS

of the model in such a way that the effective

selection of two chromosomes that will reproduce

is carried out.

3. Genetic operators. Uniform and simple cross-over

between chromosomes and mutation of each gen

with determined probabilities.

4. Evaluation function. The PRESS of the PLS model

built with only those variables with codification 1

in the chromosome.

5. The GA used in this work, known as `steady-state

without duplicates' [6,7], is characterized by the

fact that it does not replace the whole population,

but only includes a descendent chromosome (a

potential solution) to the population if the PRESS

corresponding to this chromosome is lower than

the worst of the population. In this case, the worst

chromosome of the population goes out to maintain

the size of the population. Duplicate chromosomes

do not exist either.

6. The population evolves by maintaining the best

responses and including those chromosomes that

improve the response. Usually, the evolution stops

when a sufficiently high number of descendent

chromosomes has been generated.

This speci®c GA has been applied in the literature in

multivariate analysis for multicomponent spectropho-

tometric [18,19] and electrochemical [20] determina-

tions, for quality control [21], etc. But it seems to be

also adequate to analyse a wide kind of different

electrochemical signals where several interference

problems, such as overlapping signals or matrix effect,

require the use of multivariate regression models [22±

24]. With the aim of study the suitability of this feature

selection method for solving electroanalytical data,

both qualitative and quantitative aspects of the appli-

cation of the cited GA to several polarographic and

stripping voltammetric examples have been analysed

in this paper.

Some electrochemical examples in which interfer-

ence problems such as coupled reactions, formation of

intermetallic compounds, overlapping signals and

matrix effect, are present have been studied. In most

of the cases, the obtained results showed that the GA

had allowed improvement or, at least, maintenance of

the prediction ability of the partial least squares (PLS)

246 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 3: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

regression models built with the selected variables, the

precision for the test samples being improved.

On the other hand, the qualitative information

obtained about the chemical systems has always been

really signi®cant, helping to understand the chemical

and electrochemical processes that have been taking

place. This fact points out that the GA can be a useful

tool in qualitative electroanalysis.

2. Experimental

2.1. Materials and equipment

Analytical-reagent grade chemicals were used with-

out further puri®cation; benzaldehyde was obtained

from Merck (>99%). All the solutions were prepared

with deionized water obtained in a Barnstead NANO

Pure II system. Table 1 summarizes the supporting

electrolytes and the instrumental and experimental

conditions used in each case. Nitrogen (99.997%)

was used to remove dissolved oxygen.

Most of the polarographic and voltammetric mea-

surements were carried out using a Metrohm 646 VA

processor with a 647 VA stand in conjunction with a

Metrohm multimode electrode (MME) used in the

static (SMDE) and hanging (HMDE) mercury drop

electrode mode, respectively (stirring rate

1290 rev minÿ1). The three-electrode system was

completed by means of a platinum auxiliary electrode

and an Ag/AgCl/KCl (3 mol dmÿ3) reference elec-

trode. Some voltammetric measurements were carried

out with a mAUTOLAB system from Eco Chemie in

conjunction with a Metrohm 663 VA stand, equipped

with a Metrohm multimode electrode (MME) used in

the hanging mercury drop electrode (HMDE) mode,

and connected to the interface to mercury electrodes

IME 663 (stirring rate 1500 rev minÿ1). The three-

electrode system was completed by means of an

Ag/AgCl/KCl (3 mol dmÿ3) reference electrode and

a glassy carbon auxiliary electrode.

Analysis of data were done with PARVUS [25],

STATGRAPHICS [26], and MATLAB [27].

2.2. Polarographic procedures

The solution was placed in the polarographic cell

and purged with nitrogen for 10 min. Once the solu-

tion had been deoxygenated, polarograms were

recorded from initial, Ei, to ®nal, Ef, potentials. All

the results presented were obtained using the differ-

ential-pulse mode with drop time 0.6 s, drop area

0.40 mm2 and scan rate �10 mV sÿ1. After each addi-

tion, the solution was stirred and deoxygenated for 15 s

before applying the polarographic procedure again.

2.3. Voltammetric procedures

The solution was placed in the cell and purged with

nitrogen for 10 min. Once the solution had been

deoxygenated, a deposition potential, Edep, was

applied to the working electrode during a deposition

time, tdep. At the end of this time the stirrer was

switched off, and after 30 s had elapsed, an anodic

potential scan was initiated from Ei to Ef potentials in

the differential-pulse mode. Other instrumental para-

meters were: modulation amplitude for the Metrohm

system, 50 mV; modulation amplitude for the mAU-

TOLAB system, 49.95 mV; pulse repetition time,

0.6 s; nominal area, 0.40 mm2; scan rate, 10 mV sÿ1.

The solution was stirred and deoxygenated for 15 s

after each addition.

3. Results and discussion

Among the multivariate regressions, PLS is

designed as a biased model with high stability in

the predictions. The prediction ability of the PLS

models is evaluated by means of the cross-validated

variance, that is the prediction error sum of squares

(PRESS). So, a GA applied to the selection of the

variables that will constitute the PLS model should

improve, or at least maintain, the cross-validated

variance of the PLS model. The GA used in this paper

rejects any chromosome with cross-validated variance

lower than a predictive threshold value.

This GA has been used to perform a variable

selection procedure in determinations carried out by

differential-pulse polarography (DPP) and differen-

tial-pulse anodic stripping voltammetry (DPASV)

where the PLS regression was needed to model dif-

ferent interference problems, which are individually

speci®ed below. The possibility that, through a guided

experimental design, the variables chosen by the GA

could reveal the nature of these interferences has been

studied together with the quantitative analysis.

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 247

Page 4: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Tab

le1

Exper

imen

tal

condit

ions

and

par

amet

ers

corr

espondin

gto

each

anal

yse

dca

se

Cas

eA

nal

yte

sT

echniq

ue

Support

ing

elec

troly

teE

lect

rode

Edep/V

t dep/s

Ei/V

Ef/

VS

yst

em

AB

enza

ldeh

yde

DP

PM

cIlv

aine

buff

ers

from

pH

2.2

to

7.6

conta

inin

g20%

of

alco

hol

SM

DE

±±

ÿ0.8

00

ÿ1.5

02

Met

rohm

BC

uII,

Pb

II,

Cd

IIan

dZ

nII

DP

PA

ceti

cac

id(2

mol

dmÿ3

)an

d

amm

oniu

mhydro

xid

e(1

mol

dmÿ3

)

SM

DE

±±

0.0

95

ÿ1.1

17

Met

rohm

CC

uII,

Pb

II,

Cd

IIan

dZ

nII

DPA

SV

Ace

tic

acid

(2m

ol

dmÿ3

)an

d

amm

oniu

mhydro

xid

e(1

mol

dmÿ3

)

HM

DE

ÿ1.1

10

30

ÿ1.1

10

0.0

72

Met

rohm

DT

lIan

dP

bII

DPA

SV

Oxal

icac

id(0

.1m

ol

dmÿ3

)an

d

hydro

chlo

ric

acid

(0.1

mol

dmÿ3

)

HM

DE

ÿ0.6

00

30

ÿ0.5

58

ÿ0.2

82

Met

rohm

EC

uII

and

FeII

ID

PAS

VC

itra

te-c

itri

cac

id(p

H4.7

)H

MD

Eÿ1

.300

60

ÿ0.1

54

0.0

50

mAU

TO

LA

B

Edep,

t dep,

Ei

and

Ef

bei

ng

the

dep

osi

tio

np

ote

nti

al,

dep

osi

tion

tim

e,in

itia

lpote

nti

alan

dfi

nal

pote

nti

al,

resp

ecti

vel

y

248 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 5: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Fig

.1

.P

ola

rog

ram

s(o

nce

the

bla

nk

sign

alh

asb

een

subtr

acte

d)

and

var

iable

sse

lect

edby

the

GA

(ver

tica

lli

nes

)in

the

det

erm

inat

ion

of

ben

zald

ehyde

for

pH

:(a

)2.2

,(b

)4.6

and

(c)

7.6

.B

enza

ldeh

yde

con

cen

trat

ion

sg

ofr

om

0.4

9to

3.3

4m

mol

dmÿ3

.

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 249

Page 6: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

3.1. Determination of benzaldehyde by DPP

In the polarographic determination of benzaldehyde

carried out at different pH, coupled reactions take

place (dimerization reactions in particular), which

leads to obtaining polarograms of low analytical

quality with highly overlapping and shifted peaks,

in such a way that the use of multivariate regression

techniques (PLS regression) are required [23]. Then a

PLS regression model was built independently for

each pH. Seven additions of benzaldehyde (7 objects)

constituted each training set, the current at 119 poten-

tials being recorded (119 predictor variables). The

mean of the absolute relative errors corresponding

to these PLS models was equal to 2.10%.

Fig. 1 shows the polarograms for the three pH

values (2.2, 4.6, 7.6) which are most representative

of the different kind of signals recorded. When the

reduction of benzaldehyde is carried out in acid

medium, Fig. 1(a), two successive polarographic

peaks are observed. The ®rst (�ÿ0.95 V) corresponds

to the formation of a free radical which simultaneously

can dimerize (depending on the concentration of

benzaldehyde) or be reduced to benzyl alcohol, giving

the second peak (�ÿ1.30 V) obscured by the dis-

charge of protons (<ÿ1.38 V). At higher pH,

Fig. 1(c), the two peaks are gradually replaced by

only one peak (�ÿ1.32 V) associated with an elec-

trodic reaction of two electrons, corresponding to the

formation of benzyl alcohol as a result of an ECE

process. So, at intermediate pH, Fig. 1(b), this third

peak (�ÿ1.4 V) appears together with the other two

(�ÿ1.08 and ÿ1.23 V).

When the GA was applied to the data set corre-

sponding to each pH, different sets of selected vari-

ables were obtained for each one. Fig. 1 shows the

variables selected (vertical lines) by the GA for the

three pH indicated above. In Fig. 1(a) some of the

variables selected are in the zone corresponding to the

®rst peak, and the rest are related to those potentials

where protons are reduced, which indicates that these

potentials are also related to the benzaldehyde con-

centration because its inclusion in the PLS models

improves the cross-validated variance. In fact, a shift

of the polarographic signal due to the discharge of the

protons when the concentration of benzaldehyde

increases was observed. In Fig. 1(b) potentials over

the three polarographic peaks have been selected by

the GA at intermediate pH values, whereas in Fig. 1(c)

the selected variables at high pH corresponds to the

only peak of the polarogram, principally to the tails of

it, avoiding in both cases the zones where the shifting

of the peak is clear.

Next, the PLS models built with all the variables

and those built with the variables selected by the GA

have been compared. The prediction ability of these

models has been evaluated by means of tests of paired

samples that have been used to compare the absolute

value of the relative error of the benzaldehyde con-

centration calculated by the PLS models built without

and with the selection of variables. Since the differ-

ences between these absolute errors could not be

considered normal, it is necessary to use non-para-

metric tests. For the two tests applied, the Signs test

and the Wilcoxon signed ranks test [28], the null

hypothesis (H0) was: the median of the differences

is zero, i.e. there is no effect by using the variable

selection procedure; whereas the alternative hypoth-

esis (Ha) was: the median is different from zero, i.e.

there is effect due to the variable selection. For sig-

ni®cance levels (�) lower than 0.05, H0 is rejected.

The results of these non-parametric tests are shown in

Table 2, Case A. Both tests conclude that there exists

effect by the variable selection procedure (p<0.05), i.e.

the concentrations calculated with and without selec-

tion of variables are statistically different. Further-

more, as the median of the differences is ÿ0.255, the

variable selection procedure reduces the relative error,

for that signi®cance level, by 0.255%.

3.2. Determination of CuII, PbII, CdII and ZnII by DPP

Partial least squares regression has been success-

fully applied to the polarographic determination of

Table 2

Medians and actual significance probabilities (p) for the two non-

parametric tests applied to all the analysed cases

Case

A B C D E

Median ÿ0.255 ÿ0.155 1.440 2.815 0.000

Wilcoxon signed

ranks test

0.004a 0.164 0.031a 0.021a 0.584

Signs test 0.004a 0.112 0.052 0.077 0.617

H0: median is zero, and Ha: median is different from zero.a p<0.05 implies that H0 is rejected.

250 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 7: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

copper, lead, cadmium and zinc [22]. This multivariate

model allows one to simultaneously determine these

four metals when an adequate experimental design is

used (approximately a central composite design [29]

but with equal volume additions of each metal to give

®ve different levels of concentration). In this case, 28

samples constituted the training set and 8 the test set

(the current at 100 potentials was recorded for each

sample). The mean absolute relative errors was 2.37%,

as opposed to errors up to 40% obtained when classi-

cal univariate methods were used with these `̀ appar-

ently'' non-problematic voltammetric signals [22].

The polarograms corresponding to this analysis are

shown in Fig. 2, where it can be seen that there exits

noise due to irregularities of the signals, mainly at the

extremes of the polarogram. For this reason, when the

GA is applied independently for each metal, the

selected potentials correspond both to the peak of

the analysed metal and to other zones of the polaro-

grams in such a way that these ¯uctuations can be

modelled by the multivariate regression. The variables

selected by the GA for CuII (®rst peak) and PbII

(second peak) are shown in Fig. 2(a) and (b), respec-

tively. The variable selection carried out by the GA

seems to point out that there is no other interference

apart from that corresponding to the variability of the

base line, which is modelled by taking into account not

specially relevant potentials (valley points).

The above non-parametric tests have been applied

again to compare the results obtained by the PLS

models with and without the variable selection pro-

cedure, the corresponding results being shown in

Table 2, Case B. Both tests conclude that there are

no statistically signi®cant differences between the two

procedures for a signi®cance level��0.05. So, the use

of the GA in this case allows one to maintain the

prediction ability of the PLS models (see the cross-

validated variance in Table 3) and gives indicative

qualitative information.

3.3. Determination of CuII, PbII, CdII and ZnII by

DPASV

The determination of these four metals has been

carried out also at lower concentrations, so a more

sensitive electroanalytical technique, which implies

an electrodeposition step, has been used. But, when

several metals are simultaneously on a mercury elec-

trode, intermetallic compounds are usually formed

between the amalgamate metals (intermetallic com-

pounds between: Au±Zn, Cu±Cd, Cu±Zn, Cu±Sn, Co±

Zn, Ni±Zn, etc., have been reported) [30]. This phe-

nomenon can seriously interfere in the analytical

response, causing its severe depression or shift, which

can generate large errors in the determinations made

by stripping analysis [31].

The same experimental design used in the last

polarographic determination has been followed to

carry out this simultaneous analysis (so, there are

28 samples in the training set and 8 in the test set,

the current at 99 potentials being recorded). All the

voltammograms recorded are shown in Fig. 3, where

some interferences can be foreseen at the higher levels

of concentrations. As in the polarographic analysis, a

PLS model has been built independently for each

metal (giving a mean of the absolute relative errors

of 3.7%), which has been able to model these inter-

ferences, that could be caused by the formation of

intermetallic compounds.

The GA has also been applied to this data set, and

contrary to the previous case, the selected variables are

related to those voltammetric peaks of the metals that

form intermetallic compounds, as can be seen in

Fig. 3(a) and (b) for CuII (fourth peak) and PbII (third

peak), respectively. So, the variable selection made by

the GA allows one to con®rm the existence of a new

interference and, knowing the potentials implied in it,

to relate this interference to the intermetallic com-

pound formation.

With reference to the prediction ability of the PLS

models built with the variables selected by the GA, the

results of both non-parametric tests, in Table 2 (Case

C) do not coincide, and whereas the Signs test con-

cludes that the determinations made without and with

variable selection are statistically equal, the Wilcoxon

signed ranks test conclude the contrary. Since the

actual signi®cance probability of the latter test is

really near to the critical value of 0.05, it could be

concluded that both procedures give statistically equal

results.

3.4. Determination of TlI and PbII by DPASV

On the other hand, the simultaneous determination

of several metals by means of voltammetric techni-

ques can be rendered dif®cult by the presence of

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 251

Page 8: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Fig. 2. Polarograms recorded in the determination of CuII, PbII, CdII and ZnII by DPP (concentration ranges: 1.92±9.47 mmol dmÿ3 for CuII,

PbII and ZnII, and 3.02±14.85 mmol dmÿ3 for CdII). The polarographic peaks correspond, from left to right, to CuII, PbII, CdII and ZnII,

respectively. The vertical lines indicate those variables selected by the GA for (a) CuII and (b) PbII.

252 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 9: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Table 3

True (ctrue) and calculated (ccalc�precision) concentrations for the test set samples, and explained and cross-validated (in brackets) variances

of the PLS models built without and with variable selection by means of the genetic algorithm

Case Metal ctrue (mmol dmÿ3) Without GA With GA

Variance (%) ccalc (mmol dmÿ3) Variance (%) ccalc (mmol dmÿ3)

B CuII 3.831 99.661 3.854�0.158 99.855 4.003�0.092

7.605 (99.547) 7.483�0.095 (99.823) 7.527�0.043

5.747 5.669�0.312 5.662�0.039

5.703 5.691�0.192 5.703�0.090

5.747 5.684�0.059 5.783�0.041

5.703 5.667�0.071 5.699�0.025

5.747 5.488�0.211 5.657�0.096

5.703 5.737�0.073 5.694�0.047

CdII 9.010 99.792 9.208�0.115 99.85 9.208�0.058

8.942 (99.727) 8.907�0.163 (99.77) 8.948�0.101

9.010 9.207�0.194 9.140�0.124

8.942 9.142�0.138 9.209�0.106

6.007 5.931�0.060 5.959�0.063

11.922 11.893�0.104 11.932�0.054

9.010 9.285�0.141 8.980�0.119

8.942 9.051�0.083 9.053�0.051

PbII 5.747 99.826 5.800�0.067 99.850 5.789�0.059

5.703 (99.772) 5.707�0.092 (99.702) 5.711�0.060

3.831 3.381�0.113 3.707�0.358

7.605 7.592�0.079 7.542�0.046

5.747 5.773�0.036 5.776�0.027

5.703 5.765�0.059 5.765�0.038

5.747 6.199�0.086 5.981�0.253

5.703 5.826�0.054 5.769�0.040

ZnII 5.747 99.143 5.620�0.222 99.207 5.607�0.209

5.703 (99.105) 5.613�0.133 (99.199) 5.570�0.090

5.747 6.489�0.437 6.728�0.555

5.703 5.734�0.270 5.918�0.144

5.747 5.708�0.083 5.852�0.077

5.703 5.859�0.099 5.767�0.065

3.831 3.631�0.296 3.789�0.056

7.605 7.760�0.102 7.859�0.055

C CuII 0.136 99.210 0.138�0.011 97.827 0.131�0.006

0.269 (95.263) 0.277�0.013 (96.633) 0.277�0.004

0.204 0.195�0.012 0.193�0.007

0.202 0.205�0.012 0.202�0.006

0.204 0.217�0.015 0.232�0.006

0.202 0.225�0.010 0.217�0.005

0.204 0.201�0.015 0.191�0.006

0.202 0.208�0.011 0.208�0.006

CdII 0.052 97.898 0.053�0.002 96.485 0.054�0.001

0.051 (94.476) 0.052�0.003 (95.614) 0.052�0.002

0.052 0.048�0.003 0.047�0.002

0.051 0.050�0.004 0.047�0.002

0.035 0.032�0.004 0.035�0.001

0.069 0.074�0.003 0.075�0.001

0.052 0.050�0.004 0.048�0.002

0.051 0.051�0.003 0.049�0.003

PbII 0.075 98.903 0.080�0.003 97.650 0.081�0.003

0.074 (95.945) 0.077�0.004 (95.717) 0.079�0.003

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 253

Page 10: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

overlapping signals, such as the case of TlI and PbII

when both metals are simultaneously determined by

stripping voltammetry. To solve this problem several

experimental [32] (in order to search more speci®c

signals by using an adequate supporting electrolyte, a

separation technique, etc.) and statistical approaches

(PLS [33] and continuum [34] regression, etc.) have

been proposed. In this paper, the GA has been used to

select variables in the simultaneous determination of

TlI and PbII when, together with the overlapping

signals of both metals, another signal corresponding

to the experimental blank appears in the same poten-

tial window [24].

The original voltammograms (without the blank

signal being subtracted) have been used in the analy-

sis, which has been carried out following a complete

design (5 levels for each metal and all the possible

combinations between levels [24]) in such a way that

the training set is formed by 25 samples and the test set

by 4, recording the current at 47 potentials. Fig. 4

shows the voltammograms recorded, where the dotted

line corresponds to the signal of the experimental

blank and the solid lines to the voltammograms of

the calibration samples.

As in the previous examples, the vertical lines of

Fig. 4(a) and (b) indicate the variables selected by the

GA for TlI and PbII, respectively. It is evident that, in

the determination of TlI, the GA has avoided the blank

peak potentials and has only considered potentials in

the tails of the peak of thallium where the blank signal

does not interfere. In the same way, no potential

related to the blank signal has been taken into account

in the variable selection made for the determination of

PbII. This behaviour of the GA reveals the existence of

Table 3 (Continued )

Case Metal ctrue (mmol dmÿ3) Without GA With GA

Variance (%) ccalc (mmol dmÿ3) Variance (%) ccalc (mmol dmÿ3)

0.050 0.046�0.004 0.048�0.004

0.099 0.101�0.004 0.100�0.006

0.075 0.078�0.004 0.080�0.003

0.074 0.080�0.003 0.080�0.003

0.075 0.075�0.005 0.073�0.004

0.074 0.077�0.004 0.079�0.002

ZnII 0.474 98.590 0.482�0.018 98.591 0.474�0.009

0.468 (97.486) 0.471�0.020 (97.727) 0.459�0.009

0.474 0.471�0.019 0.474�0.008

0.468 0.460�0.021 0.432�0.021

0.474 0.517�0.028 0.507�0.011

0.468 0.494�0.017 0.480�0.013

0.317 0.317�0.024 0.307�0.011

0.625 0.626�0.022 0.638�0.015

D TlI 0.326 99.710 0.326�0.013 98.244 0.360�0.040

1.283 (99.671) 1.340�0.020 (98.181) 1.390�0.024

0.802 0.807�0.115 0.555�0.049

1.427 1.410�0.027 1.320�0.038

PbII 0.343 99.577 0.351�0.011 99.603 0.353�0.013

0.563 (99.570) 0.578�0.017 (99.608) 0.581�0.007

0.900 0.949�0.077 0.944�0.076

1.112 1.100�0.031 1.090�0.019

E CuII 0.308 99.887 0.312�0.006 99.896 0.314�0.004

0.500 (99.893) 0.500�0.007 (99.900) 0.500�0.005

0.800 0.787�0.009 0.786�0.007

0.982 0.939�0.030 0.949�0.011

FeIII 20.537 99.997 20.285�0.161 99.995 20.114�0.195

80.051 (99.993) 79.997�0.207 (99.994) 80.072�0.084

50.032 49.923�0.195 50.000�0.169

88.364 88.712�0.484 88.557�0.203

254 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 11: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Fig. 3. Voltammograms recorded in the determination of CuII, PbII, CdII and ZnII by DPASV (concentration ranges: 68.56±334.42 nmol dmÿ3

for CuII, 25.26±123.20 nmol dmÿ3 for PbII, 17.48±85.24 nmol dmÿ3 for CdII and 159.25±776.83 nmol dmÿ3 for ZnII). The voltammetric peaks

correspond, from left to right, to ZnII, CdII, PbII and CuII, respectively. The vertical lines indicate those variables selected by the GA for (a)

CuII and (b) PbII.

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 255

Page 12: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Fig. 4. Voltammograms recorded in the determination of TlI and PbII by DPASV (concentration ranges: 0.32±1.61 mmol dmÿ3 for TlI and

0.23±1.13 mmol dmÿ3 for PbII). The dotted line corresponds to the voltammogram for the blank. The voltammetric peaks correspond, from left

to right, to TlI and PbII, respectively. The vertical lines indicate those variables selected by the GA for (a) TlI and (b) PbII.

256 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 13: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

a `foreign' signal which, if a complex matrix were

analysed, could not be detected in a simple way. So,

this GA-based variable selection procedure could be a

useful tool to detect and analyse non-expected anoma-

lies in voltammetric and polarographic determina-

tions.

The predictions carried out by the PLS models after

the selection of variables have been compared with

those corresponding to the PLS models built without

variables selection, by means of the above non-para-

metric tests. The corresponding results are shown in

Table 2, Case D, and as in the last case, the results of

both tests do not coincide, although the actual sig-

ni®cance probability of the Wilcoxon signed ranks test

is very near to the critical value again, as the Sign test.

In this case, the use of the GA does not improve the

numerical results, but it provides signi®cant qualita-

tive information.

3.5. Determination of CuII and FeIII by DPASV

Another usual interference in sample analysis is the

matrix effect that, in voltammetric determinations,

could be due to solution-phase electroactive species

present in the sample which produce a response at the

same potential as the analyte of interest. Such is the

effect of the presence of high FeIII concentrations in

the stripping voltammetric determination of CuII,

because the voltammetric peaks of both metals are

close together in most supporting electrolytes, giving a

single wide voltammetric peak, result of two very

highly overlapping peaks. Most of the univariate

methods proposed in the bibliography to avoid the

interference of iron in the determination of copper

include extraction [35] or selective complexation [36],

medium exchange procedures [37], subtractive

approaches [38], etc., which do not always give the

pursued goal since the separate determination of each

metal is usually achieved together with relatively high

errors in some cases.

The use of the PLS regression allows one to simul-

taneously determine both metals, copper and the

interferent, through an adequate experimental design

with the same experimental effort that could be neces-

sary to only determine copper in the presence of high

level of iron. So a complete design with 6 levels for

each metal, i.e. 36 training set samples and 4 test set

samples has been used, the current at 35 potentials

being recorded. All the voltammograms are jointly

shown in Fig. 5 where the high overlapping of the two

peaks is obvious.

Next, the GA has been used on the training data set

to select the more informative variables for the PLS

models; the variables selected are those shown in

Fig. 5(a) and (b) for both CuII and FeIII, respectively.

Fig. 5(a) shows that, in the determination of copper,

the GA has selected variables just in the zone of the

copper peak potential, as expected, and others in the

right tail of the peak. However, in the case of iron, the

peak potential zone is clearly eluded by the GA

(only potentials in the tails of the peak have been

selected) probably because of the shifting undergone

by the iron peak when the concentration of this metal

increases. Again, the GA avoids the zones of con¯ict,

giving a qualitative way of detecting this kind of

phenomena.

With reference to the prediction ability of the PLS

models built after the variable selection procedure, the

applied non-parametric tests conclude that there are no

statistically signi®cant differences between these pre-

dictions and those made by the PLS models built

without variable selection, Table 2 (Case E). So, the

PLS models built have the same prediction ability as

those built with all the predictor variables although

they are constituted by a signi®cant lower number of

variables.

3.6. Precision for the test sets' samples

The precision corresponding to the test sets samples

without and with variable selection has been calcu-

lated in order to compare both procedures. An empiri-

cal formula suggested in [39,40] has been used to

evaluate the con®dence intervals. Table 3 shows the

results where the calculated concentration of each test

set sample is accompanied by the corresponding pre-

cision, together with the explained and crossvalidated

variances for the PLS model built without and with

previous variable selection for each metal. In general,

the precision calculated for the test samples by the

PLS models built with those variables selected by the

GA is better (smaller) than that obtained when all the

predictor variables were used, as can be seen in

Table 3. This results con®rm the fact that the GA

selects the variables more related to the response,

avoiding those that could disturb the prediction-

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 257

Page 14: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

Fig. 5. Voltammograms recorded in the determination of CuII and FeIII by DPASV (concentration ranges: 0±1.01 mmol dmÿ3 for CuII and 0±

101.03 mmol dmÿ3 for FeIII). The highly overlapping voltammetric peaks correspond, from left to right, to CuII and FeIII, respectively. The

vertical lines indicate those variables selected by the GA for (a) CuII and (b) FeIII.

258 A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259

Page 15: Qualitative and quantitative aspects of the application of genetic algorithm-based variable selection in polarography and stripping voltammetry

response relationship, so the predicted values are more

accurate.

References

[1] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed.,

Wiley, New York, 1981.

[2] P.J. Brown, Measurement, Regression and Calibration,

Clarendon Press, Oxford, 1993.

[3] J.H. Kalivas, P.M. Lang, Mathematical Analysis of Spectral

Orthogonality, Marcel Dekker, New York, 1994.

[4] A.J. Miller, Subset Selection in Regression, Chapman and

Hall, London, 1990.

[5] A. Garrido Frenich, D. Jouan-Rimbaud, D.L. Massart, S.

Kuttatharmmakul, M. MartõÂnez Galera, J.L. MartõÂnez Vidal,

Analyst 120 (1995) 2787.

[6] R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267.

[7] R. Leardi, J. Chemom. 8 (1994) 65.

[8] H. Kubinyi, J. Chemom. 10 (1996) 119.

[9] D. Broadhurst, R. Goodacre, A. Jones, J.J. Rowland, D.B.

Kell, Anal. Chim. Acta 348 (1997) 71.

[10] C.B. Lucasius, G. Kateman, Chemom. Intell. Lab. Syst. 19

(1993) 1.

[11] C.B. Lucasius, G. Kateman, Chemom. Intell. Lab. Syst. 25

(1994) 99.

[12] C.R. Houck, J.A. Joines, M.G. Kay, A Genetic Algorithm for

Function Optimization: A Matlab Implementation, NCSU-IE

TR 95-09, 1995 (anonymous ftp from: ftp://ftp.eos.ncsu.edu/

pub/simul/GAOT).

[13] D.B. Hibbert, Chemom. Intell. Lab. Syst. 19 (1993) 277.

[14] Z. Michalewicz, Genetic Algorithms�Data Structure-

s�Evolution Programs, 3rd ed., Springer, Berlin, 1996.

[15] L. Davis, The Handbook of Genetic Algorithms, Van

Nostrand Reinhold, New York, 1991.

[16] J. Holland, Adaptation in Natural and Artificial Systems, The

University of Michigan Press, Ann Arbor, MI, 1975.

[17] D.E. Goldberg, Genetic Algorithm in Search Optimization

and Machine Learning, Addison-Wesley, Reading, MA, 1989.

[18] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, Anal. Chem. 67

(1995) 4295.

[19] M.J. Arcos, M.C. Ortiz, B. Villahoz, L.A. Sarabia, Anal.

Chim. Acta 339 (1997) 63.

[20] M.J. Arcos, C. Alonso, M.C. Ortiz, Electrochimica Acta 43

(1998) 479.

[21] M.C. Ortiz, A. Herrero, M.S. SaÂnchez, L.A. Sarabia, M.

IÂnÄiguez, Analyst 120 (1995) 2793.

[22] A. Herrero, M.C. Ortiz, Anal. Chim. Acta 348 (1997) 51.

[23] A. Herrero, M.C. Ortiz, J. Electroanal. Chem. 432 (1997) 223.

[24] A. Herrero, M.C. Ortiz, Talanta, 46 (1998) 129.

[25] M. Forina, R. Leardi, C. Armanino, S. Lanteri, PARVUS: An

Extendable Package of Programs for Data Exploration,

Classification and Correlation, Ver 1.1, Elsevier Scientific

Software, 1990.

[26] STATGRAPHICS, Ver. 5, STSC, Rockville, MD, 1991.

[27] Matlab, The MathWorks, Natick, Mass., 1992.

[28] M. Hollander, D.A. Wolfe, Nonparametric Statistical Meth-

ods, Wiley, Chichester, 1973.

[29] G.E.P. Box, W.G. Hunter, J.S. Hunter, EstadõÂstica para

investigadores: introduccioÂn al disenÄo de experimentos,

anaÂlisis de datos y construccioÂn de modelos, ReverteÂ,

Barcelona, 1989.

[30] F. Vydra, K. SÏtulõÂk, E. JulaÂkovaÂ, Electrochemical Stripping

Analysis, Ellis Horwood, Chichester, 1976.

[31] T.R. Copeland, R.A. Osteryoung, R.K. Skogerboe, Anal.

Chem. 46 (1974) 2093.

[32] Z. Lukaszewski, W. Zembrzuski, A. Piela, Anal. Chim. Acta

318 (1996) 159.

[33] A. Henrion, R. Henrion, G. Henrion, F. Scholz, Electro-

analysis 2 (1990) 309.

[34] M.C. Ortiz, M.J. Arcos, L.A. Sarabia, Chemom. Intell. Lab.

Syst. 34 (1996) 245.

[35] V. Meenakumari, Analyst 120 (1995) 2849.

[36] M.E. VaÂzquez DõÂaz, J.C. JõÂmenez SaÂnchez, M. CallejoÂn

MochoÂn, A. Guiraum PeÂrez, Analyst 119 (1994) 1571.

[37] S. Gottesfeld, M. Ariel, J. Electroanal. Chem. 9 (1965) 112.

[38] J.E. Bonelli, H.E. Taylor, R.K. Skogerboe, Anal. Chim. Acta

118 (1980) 243.

[39] Unscrambler II (v 4.0), User's Guide, Camo A/S, Norway,

1992.

[40] S. De Vries, C.J.F. Ter Braak, Chemom. Intell. Lab. Syst. 30

(1995) 239.

A. Herrero, M. Cruz Ortiz / Analytica Chimica Acta 378 (1999) 245±259 259