modified particle swarm optimization algorithm for variable selection in mlr and pls modeling: qsar...

8
European Journal of Pharmaceutical Sciences 22 (2004) 145–152 Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists Qi Shen, Jian-Hui Jiang, Chen-Xu Jiao, Guo-li Shen, Ru-Qin Yu a State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, China Received 25 September 2003; received in revised form 24 February 2004; accepted 2 March 2004 Abstract A version of modified particle swarm optimization (PSO) algorithm has been proposed. The PSO algorithm has been modified to adopt to the discrete combinatorial optimization problem and reduce the probability of sinking into local optima. In the modified PSO algorithm, the velocity represents the probability of element in each particle taking value 1 or 0. The modified discrete PSO algorithm is proposed to select variables in MLR and PLS modeling and to predict antagonism of angiotensin II antagonists. The modified C(p) is employed as fitness function. The results were compared to those obtained by GAs. Experimental results have demonstrated that the modified PSO is a useful tool for variable selection which converges quickly towards the optimal position. © 2004 Elsevier B.V. All rights reserved. Keywords: Particle swarm optimization; Quantitative structure-activity relationships; Variable selection; Antagonism 1. Introduction Quantitative structure-activity relationships (QSARs), an important area of chemometrics, searches information re- lating chemical structure to biological and other activities. Chemical structure is represented by a variety of descriptors such as topological, thermodynamic, quantum mechanical and shape descriptors. Several hundreds even thousands of descriptors could be generated in QSARs studies. In QSARs, the number of compounds with the biological activity val- ues available is usually small compared with the number of structural descriptors. This can lead either to possible over- fitting or even to a complete failure in building a meaningful regression model. The selection of variables that are really indicative of the biological activity concerned is becoming one of the key steps in QSAR studies. The benefit gained from variable selection in QSAR is not only the stability of the model, but also the interpretability of relationship be- tween the descriptors and biological activity. Corresponding author. Tel.: +86-731-8821577; fax: +86-731-8822577. E-mail address: [email protected] (R.-Q. Yu). Many QSAR models are mainly based on multiple lin- ear regression (MLR) and partial least-squares (PLS). PLS regression has been found useful in handling regression tasks in case there are many variables involved and has been recommended as an approach to efficiently utilize the information carried by these variables. However, there is an increasing evidence that variable selection is also es- sential for successful PLS analysis of QSAR data and the lack of variable selection also can spoil the PLS regression (Hemmateenejad et al., 2002; Tang and Li, 2002). It has been recognized that the elimination of uninformative vari- ables which do not contribute to model formulation is of importance in both MLR and PLS. For the variable selections, one can use the classical step- wise regression, procedure, as well as some more sophis- ticated methodologies such as simulated annealing (Sutter et al., 1995), genetic algorithms (GAs) (Rogers and Hopfinger, 1994; Yasri and Hartsough, 2001) and evolu- tion algorithm(EAs) (Kubinyi, 1996; Luke, 1994). Among them, GAs and EAs are optimization techniques simulating biological systems which are classified as a category of the research of so-called artificial life. Particle swarm opti- mization (PSO) algorithm (Kennedy and Eberhart, 1995), a 0928-0987/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ejps.2004.03.002

Upload: qi-shen

Post on 04-Sep-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

European Journal of Pharmaceutical Sciences 22 (2004) 145–152

Modified particle swarm optimization algorithm for variableselection in MLR and PLS modeling: QSAR studies

of antagonism of angiotensin II antagonists

Qi Shen, Jian-Hui Jiang, Chen-Xu Jiao, Guo-li Shen, Ru-Qin Yu∗a State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering,

Hunan University, Changsha 410082, China

Received 25 September 2003; received in revised form 24 February 2004; accepted 2 March 2004

Abstract

A version of modified particle swarm optimization (PSO) algorithm has been proposed. The PSO algorithm has been modified to adoptto the discrete combinatorial optimization problem and reduce the probability of sinking into local optima. In the modified PSO algorithm,the velocity represents the probability of element in each particle taking value 1 or 0. The modified discrete PSO algorithm is proposed toselect variables in MLR and PLS modeling and to predict antagonism of angiotensin II antagonists. The modifiedC(p) is employed as fitnessfunction. The results were compared to those obtained by GAs. Experimental results have demonstrated that the modified PSO is a usefultool for variable selection which converges quickly towards the optimal position.© 2004 Elsevier B.V. All rights reserved.

Keywords:Particle swarm optimization; Quantitative structure-activity relationships; Variable selection; Antagonism

1. Introduction

Quantitative structure-activity relationships (QSARs), animportant area of chemometrics, searches information re-lating chemical structure to biological and other activities.Chemical structure is represented by a variety of descriptorssuch as topological, thermodynamic, quantum mechanicaland shape descriptors. Several hundreds even thousands ofdescriptors could be generated in QSARs studies. In QSARs,the number of compounds with the biological activity val-ues available is usually small compared with the number ofstructural descriptors. This can lead either to possible over-fitting or even to a complete failure in building a meaningfulregression model. The selection of variables that are reallyindicative of the biological activity concerned is becomingone of the key steps in QSAR studies. The benefit gainedfrom variable selection in QSAR is not only the stability ofthe model, but also the interpretability of relationship be-tween the descriptors and biological activity.

∗ Corresponding author. Tel.:+86-731-8821577;fax: +86-731-8822577.

E-mail address:[email protected] (R.-Q. Yu).

Many QSAR models are mainly based on multiple lin-ear regression (MLR) and partial least-squares (PLS). PLSregression has been found useful in handling regressiontasks in case there are many variables involved and hasbeen recommended as an approach to efficiently utilize theinformation carried by these variables. However, there isan increasing evidence that variable selection is also es-sential for successful PLS analysis of QSAR data and thelack of variable selection also can spoil the PLS regression(Hemmateenejad et al., 2002; Tang and Li, 2002). It hasbeen recognized that the elimination of uninformative vari-ables which do not contribute to model formulation is ofimportance in both MLR and PLS.

For the variable selections, one can use the classical step-wise regression, procedure, as well as some more sophis-ticated methodologies such as simulated annealing (Sutteret al., 1995), genetic algorithms (GAs) (Rogers andHopfinger, 1994; Yasri and Hartsough, 2001) and evolu-tion algorithm(EAs) (Kubinyi, 1996; Luke, 1994). Amongthem, GAs and EAs are optimization techniques simulatingbiological systems which are classified as a category ofthe research of so-called artificial life. Particle swarm opti-mization (PSO) algorithm (Kennedy and Eberhart, 1995), a

0928-0987/$ – see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.ejps.2004.03.002

Page 2: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

146 Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152

relatively new optimization technique in this category, canalso be used as an excellent optimizer which originated asa simulation of simplified social system. PSO developedby Eberhart and Kennedy in 1995 are a stochastic globaloptimization technique inspired by social behavior of birdflocking. Similar to GAs and EAs, PSO is a populationbased optimization tool, which search for optima by up-dating generations. However, unlike GAs and EAs, PSOhas no evolution operators such as crossover and mutation.Compared to GAs and EAs, the advantages of PSO are thatPSO is easy to implement and there are few parameters toadjust. PSO (Shi and Eberhart, 1998; Clerc and Kennedy,2002) has been successfully applied in: function optimiza-tion, artificial network training, and fuzzy system control.To our knowledge, there were no reports concerning theapplication of PSO in variable selection. Most versions ofPSO have operated in continuous and real-number space.In our preliminary research, a discrete binary version of thePSO (Kennedy and Eberhart, 1997) developed by Kennedyand Eberhart was applied in variable selection, but the re-sults were not satisfactory. In the present paper, the PSOalgorithm is modified to adopt the discrete combinatorial op-timization problem of variable selection in QSAR. A mod-ified discrete PSO algorithm is proposed to select variablesin MLR and PLS modeling and tried to predict antagonismof angiotensis II antagonists. The results were compared tothose obtained by GAs. It has been demonstrated that themodified PSO is an useful tool for variable selection whichconverges quickly towards the optimal position.

The vasoactive hormone angiotensin II produced by therenin-angiotensissystem plays an integral role in the patho-physiology of hypertension. Angiotensin-converting enzyme(ACE) inhibitors, which block the converting of angiotensinI to the angiotensin II are widely used for the treatment ofhypertension and congestive heart failure. However there aresome side effects of ACE inhibitors such as dry cough andrashes. Therefore, a need for more specific means of block-ing the effects of angiotensin II arose with minimal sideeffects. Experimental assessment of inhibitory ability canbe expensive and time consuming, QSAR can be used as acomplementary tool to predict antagonism of angiotensin IIwhich is helpful in the design of angiotensin II antagonists.

2. Theory

2.1. Particle swarm optimization

The PSO algorithm introduced as an optimization tech-nique by Eberhart and Kennedy simulates the behaviors ofbird flocking involving the scenario of a group of birds ran-domly looking for food in an area. Not all the birds knowwhere the food is located, except the individual who is near-est to the food location. So the effective strategy for thebirds to find food is to follow the bird which is nearest tothe food. PSO is motivated from this scenario and used it

to solve the optimization problems. In PSO, each single so-lution is a particle in the search space. The algorithm mod-els the exploration of a problem space by a population ofindividuals or particles. All of particles have fitness valueswhich are evaluated by a fitness function to be optimized.Like in other evolutionary computation, the population ofindividuals are updated by applying some kind of operatorsaccording to the fitness information so that the individualsof the population can be expected to move towards bettersolution areas. Instead of crossover and mutation operators,each individual in PSO flies in the search space with a veloc-ity which directs the flying of the particle. The particles are“flown” through the problem space by following the currentoptimum particle.

PSO is initialized with a group of random particles. Eachparticle is treated as a point in aD-dimensional space. Theith particle is represented asxi = (xi1, xi2, . . . xiD). The bestprevious position of the ith particle that give the best fitnessvalue is represented aspi = (pi1, pi2, . . . piD). The best par-ticle among all the particles in the population is representedbypg = (pg1, pg2, . . . pgD). Velocity, the rate of the positionchange for particle i is represented asvi = (vi1, vi2, . . . viD).In every iteration, each particle is updated by following thetwo best values. After finding the two best values, the par-ticle updates its velocity and positions according to the fol-lowing equation:

vid = vid + c1 × r1 × (pid − xid) + c2 × r2(pgd − xid) (1)

xid = xid + vid (2)

wherec1 andc2 are two positive constant named as learningfactors,r1 andr2 are random numbers in the range of (0,1).The Eq. (1) is used to calculate the particle’s new velocityaccording to it’s previous velocity and the distances of itscurrent position from its own best position and the group’sbest position. Then the particle flies toward a new positionaccording toEq. (2). Such an adjustment of the particle’smovement through the space causes it to search around thetwo best positions. If the minimum error criterion is attainedor the number of cycles reaches a user-defined limit, thealgorithm is terminated.

The PSO is simple for implementation and there are fewparameters to adjust. The PSO has been found to be robustand fast in solving some optimization problem. (Brandstatterand Baumgartner, 2002; Carlisle and Dozier, 2000; Ciuprinaet al., 2002).

2.2. Modified particle swarm optimization

Most versions of PSO have operated in continuous andreal-number space. In continuous versions of PSO, velocityconsist of tree parts, the first is previous velocity of the par-ticle, the second and third parts are the terms associated withtheir best positions in the past. The PSO algorithm updates apopulation of particles on a basis of information about eachparticle’s previous best performance and the best particle in

Page 3: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152 147

the population. In PSO, only best positions give out the in-formation to others. One notices that in GAs, chromosomesshare information with each other. As information sharingmechanism in PSO is significantly different from GAs, allthe particles tend to converge to the best solution quickly.

For a discrete problem expressed in a binary notation, aparticle moves in a search space restricted to 0 or 1 on eachdimension. In binary problem, updating a particle representschanges of a bit which should be in either state 1 or 0 andthe velocity represents the probability of bitxid taking thevalue 1 or 0.

According to information sharing mechanism of PSO, amodified PSO for variable selection is proposed as follows.The velocityvid of every individual is a random number inthe range of (0,1). The resulting change in position then isdefined by the following rule:

If (0 < vid ≤ a), thenxid(new) = xid(old) (3)

If (a < vid ≤ (12(1 + a)), thenxid(new) = pid (4)

If (12(1 + a) < vid ≤ 1), thenxid(new) = pgd (5)

wherea is a random value in the range of (0,1) named staticprobability. The initial value ofa is 0.5.

Though the velocity in the modified PSO is differentfrom that in continuous version of PSO, information sharingmechanism and updating model of particle by following thetwo best positions is the same in two PSO versions. Someelements inxi are kept fixed according toEq. (3)which issimilar to the first term of the right side ofEq. (1). With-out this part, the “flying” particles are only determined bytheir best positions in history and all particles would tend tomove toward the same position resembling a local search.In this sense,Eq. (3)really provides the particles a tendencyto expand the search space or the global search ability.Eqs.(4) and (5)are similar to the second and third terms of theright side of inEq. (1). Without these two parts, the parti-cles would keep on “flying” randomly and PSO would notbe able to find a meaningful solution. There should be a bal-ance between the local and global search ability. The staticprobability a plays the role of balancing the global and lo-cal search. That is to say, the larger the value of the param-eter a, the greater the probability for the modified PSO tooverleap local optima. On the other hand, a small value ofparametera is, favorable for the particle to follow the twobest past positions and for the algorithm to converge morequickly. The design of the modified PSO is intended to pos-sess more exploitation ability at the beginning and a moreexploration ability to search the local area around the parti-cle. Accordingly, we define the parameter descending alongwith the generation. Static probabilitya started with a value0.5 and decreases to 0.33 when the iteration terminates.

Even so, preliminary experiments indicates that such aPSO version still tends to converge to local optima. Tocircumvent this drawback and improve the ability of themodified PSO algorithm to overleap local optima, 10% of

particles are forced to fly randomly not following the twobest particles.

Using decreasing static probability and some percent ofrandomly fling particles to overleap local optima, the modi-fied PSO remains having satisfactory converging character-istics.

2.3. Fitness function

In the modified PSO, the performance of each particle ismeasured according to a pre-defined fitness function. Themodified C(p) statistic as objective function is applied tovariable selection in the modified PSO. The modifiedC(p)in MLR is expressed as follows.

Cp1(p) = RSSp/σ̂2PLS − (n − 2p) (6)

Here n is the number of dependent variables, p is the numberof independent variables. RSSp is the residual sum of thesquares ofp-variable model,σ̂2

PLS is defined as the valueof RSS corresponding to the minimum number of principalcomponents in conventional PLS analysis of the originaldata set when further increase of the number of principalcomponents does not cause a significant reduction in RSS.The modifiedC(p) in PLS is expressed as follows

Cp2(p) = RSSk/σ̂2PLS − (n − 2k) (7)

Herek is the number of latent variables, RSSk the residualsum squares of thek-latent-variable model. The details ofmodifiedC(p) have been described elsewhere (Shen et al.,2003a,b).

2.3.1. Angiotensin II antagonists dataA set of 38 4H-1,2,4-triazoles as angiotensin II antago-

nists, whose antagonism are collected together in a com-prehensive review byKurup et al. (2001), was used to testthe performance of the modified PSO in variable selectionof QSAR. Ashton et al. (1993)synthesize and evaluated4H-1,2,4-triazoles for their antagonism against angiotensinII, which are expressed as IC50, the molar concentration ofthe compound causing 50% antagonism of angiotensin II.

The data set of 38 4H-1,2,4-triazoles was randomly di-vided into two groups with 31 compounds used as trainingset for developing regression models and remaining sevencompounds used as the validation set in the predictionof antagonism against angiotensin II. 4H-1,2,4-triazoles.Molecular structure and numbering of substituents in theseries of 4H-1,2,4-triazoles are represented inFig. 1. A listof compounds studied along with their antagonism is givenin Table 1.

A series of molecular descriptors were calculated foranalogs of 4H-1,2,4-triazoles including structural, spatial,thermodynamic, electronic, quantum mechanical descrip-tors, and E-State indices. Structural descriptors include themolecular weight (MW), the number of rotatable bonds(Rotbonds) and the number of hydrogen bond (Hbond ac-ceptor). The spatial descriptors (Rohrbaugh and Jurs, 1987;

Page 4: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

148 Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152

Fig. 1. Structure of 4H-1,2,4-triazole.

Table 1Summary of experimental and calculated antagonism against angiotensin II of 4H-1,2,4-triazoles along with their structures used in QSAR study

Number Substituent log(1/C)

X Y Observeda Calculated1b Calculated2c Seriesd

1 C4H9 C6H5 6.03 5.86 5.93 12 SC2H5 C6H5 5.72 5.61 5.67 13 SC3H7 C6H5 5.80 5.86 5.88 14 C4H9 4-Pyridyl 5.85 5.92 5.82 15 C4H9 3-Pyridyl 5.77 5.95 5.89 16 C4H9 2-Furyl 6.00 6.01 6.03 17 SC2H5 CH2C6H5 6.00 6.23 6.13 18 SC3H7 CH2C6H5 6.42 6.51 6.42 19 SC3H7 CH2CH2C6H5 7.12 6.46 6.54 1

10 SC3H7 (CH2)3C6H5 6.49 6.66 7.00 211 SC3H7 CH2SC6H5 5.92 6.52 6.39 112 SC2H5 CH2OMe 5.28 5.09 5.33 113 SC2H5 CF3 5.32 5.16 5.23 114 SC3H7 CF3 5.68 5.54 5.60 115 C4H9 SCH2COOMe 6.48 5.84 6.11 216 C4H9 SCH2CONHMe 6.14 6.27 5.80 117 C4H9 SCH2CH2OH 6.32 6.73 6.57 118 C4H9 SCHEtCOOMe 6.52 6.45 6.46 119 C4H9 SCH2COC6H5 6.72 7.12 7.08 120 C4H9 SC6H5 7.22 6.94 7.31 221 C4H9 SCH2C6H5 7.82 7.16 7.24 122 C3H7 SCH2CH2C6H5 7.15 7.44 7.50 123 C4H9 SCH2C6H5 6.92 7.06 6.97 124 C4H9 SCH2(2-Me–C6H4) 7.85 7.27 7.87 225 C4H9 SCH2(3-Me–C6H4) 7.49 7.43 7.41 126 C4H9 SCH2(4-Me–C6H4) 8.12 7.51 7.56 127 C4H9 SCH2(2-Cl–C6H4) 7.52 7.66 7.87 128 C4H9 SCH2(3-Cl–C6H4) 7.59 7.65 8.17 229 C4H9 SCH2(4-Cl–C6H4) 8.17 7.71 7.66 130 C4H9 SCH2(3-OMe–C6H4) 7.68 8.08 8.10 131 C4H9 SCH2(4-OMe–C6H4) 8.52 7.92 8.55 232 C4H9 SOCH2(4-OMe–C6H4) 8.13 7.58 7.89 133 C4H9 SCH2(2-CN–C6H4) 7.22 7.18 7.29 134 C4H9 SCH2(4-CF3C6H4) 7.38 7.98 7.76 135 C4H9 SCH2-2-naphthy 7.31 7.07 7.26 136 C4H9 SCH(CO2Me)C6H5 8.21 7.62 7.85 137 C4H9 SCH2(2-CO2Me–C6H4) 7.85 7.18 7.63 238 C4H9 SCH2(3-CO2Me–C6H4) 7.52 7.62 7.49 1

a Taken from reference (Kurup et al., 2001).b Calculated byEq. (1) in Table 3.c Calculated byEq. (1) in Table 4.d Randomly selected as the member of training (1) and validating (2) sets.

Stanton and Jurs, 1990) used involve radius of gyration(RadOfGyration), density, molecular surface area, principalmoment of inertia (PMI), molecular volume, and shadowindices. The thermodynamic descriptors (Viswanadhanet al., 1989) were taken describing the hydrophobic char-acter (logP: logarithm of the partition coefficient in oc-tano/water), refractivity (MolRef: molar refractivity), heatof formation(Hf) and the dissolution free energy for waterand octanol (Fh2o: desolvation free energy for H2O; Foct:desolvation free energy for octanol). The electronic descrip-tors (Cerius, 1997) taken were concerning surperdelocaliz-ability (Sr), atomic polarizabilities (Apol), and the dipolemoment (Dipole). Electrotopological-state indices (E-Stateindices) (Hall and Kier, 1991, 1995) used involve S-aasC,

Page 5: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152 149

Table 2List of molecular descriptors for 4H-1,2,4-triazoles studied as candidate variables

Functional families of descriptors Descriptors

Spatial descriptors RadOfGyration (radius of gyration),Shadow indices (surface area projections)(Shadow-XZ, shadow-YZ, shadow-nu, shadow-Xlength, shadow-Ylength, shadow-Zlength)Vm (molecular volume)DensityArea (molecular surface area)PMI (principal moment of inertia)

Structural descriptors MW (molecular weight)Hbond acceptor (number of hydrogen bond acceptors)Rotbonds (number of rotatable bonds)

Electronic descriptors Apol (sum of atomic polarizabilities)Dipole (dipole-mag, dipole-Y, dipole-Z)Sr (superdelocalizability)

Quantum mechanical descriptors HOMO (highest occupied molecular orbital energy)LUMO (lowest unoccupied molecular orbital energy)

Thermodynamic descriptors A logP, logP (the octanol/water partition coefficient)Fh2o (desolvation free energy for water)Foct (desolvation free energy for octanol)MRCM∗∗-3, MolRef (molar refractivity)Heat of formation (Hf)

E-State index S-aaCH, S-aasC, S-dssC, S-ssS, S-dO, S-sCH3, S-aaN, S-aasN, S-ssO

S-aaN, S-aaCH etc. For example, in the symbol S-aaN, ’S’represents electronic topological state of atom; ’a’ stands forthe bond in an aromatic ring; and ’N’ represents the nitro-gen.Table 2summarizes all molecular descriptors used asthe candidate variables for selection.

All these molecular descriptors were generated usingCerius23.5 soft system on Silicon Graphics R3000 work-station.

The modified PSO, GAs and PLS algorithm was writtenin Matlab 5.3 and run on a personal computer (Intel Pentiumprocessor 4/1.5G Hz 256 MB RAM).

3. Results and discussion

3.1. Variable selection by the modified PSO based on MLRmodeling for QSAR

MLR was first used by the modified PSO for variable se-lection. The best model with minimumC(p) value containsthree variables during the modified PSO search. The three

Table 3Results of variable selection by the modified PSO usingC(p) and MLR modeling

Number Equation Ra Sa Fa Rpb C(p)

1 log 1/IC50 = (0.0763× Dipole-Z) − (0.1467× Foct) + (0.0362× MR) – 2.2206 0.9305 0.3524 58.0602 0.9219−13.0672 log 1/IC50 = (−0.0488× Dipole-mag)− (0.1903× Foct) + (0.0344× MR)

− (0.5759× LUMO) − 9.3570.9422 0.3284 51.405 0.9594 −13.019

3 log 1/IC50 = (0.0547× S-aaCH)− (0.1817× Foct) + (0.285× S-sCH3)+ (0.0289× MR) − (0.2914× Vm) − (0.7048× LUMO) − 11.4079

0.9657 0.2649 55.3617 0.9346 −13.009

a R: correlation coefficient;S: standard deviation;F: F-statistics.b Rp: correlation coefficient of prediction set.

variables are Dipole-Z, Foct and MR. This model and thenext best model which contain four and six variables, re-spectively, are shown inTable 3. The correlation between thecalculated and experimental values of log 1/IC50 of Eq. (1)is shown inFig. 2. the correlation coefficient for the trainingset is 0.9305 andRp for validation set was 0.9219. In theseequations, negative coefficient of Foct indicates that increas-ing the desolvation free energy for octanol causes antago-nism decreasing. The positive coefficient of MR indicatesthat the large molar refractivity is beneficial for the activity.That is in accordance with the conclusions given by Kurupet al. that the receptor is of some bulk tolerance. The nega-tive coefficient of descriptor LUMO implied molecules withlow-energy LUMOs would promote the inhibitory activity.

To compare with the modified POS, variable selection byGAs based MLR modeling was also performed. For com-paring, 100 cycles are limited in both GAs and the modifiedPSO. The results are listed inFig. 3. From Fig. 3 one cansee that the minimum Cp value can be obtained in search-ing process not only by GAs but also by the modified PSO.But fitness value drops more quickly in the modified PSO

Page 6: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

150 Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152

Fig. 2. Calculated vs. observed log 1/IC50 of three-variable model usingMLR modeling.

algorithm. The minimumC(p) can obtained in 13 cyclesduring the modified PSO searching, but it needs 35 cyclesduring GAs search. Experimental results confirm the theorythat the modified PSO converges to the best solution quickly.

3.2. Variable selection by the modified PSO based on PLSmodeling for QSAR

At the beginning, PLS regression with all variables wascarried out for antagonism of 4H-1,2,4-triazoles. Using allinitial descriptors, the PLS analysis of the training set re-sulted in a 24-latent variable QSAR model with correlationcoefficientR = 0.9975 for training set andR = 0.8046 fortest set. There is an obvious symptom of overfitting. Thoughthe correlation coefficient is high, there are too many latentvariables. As a rule of thumb, the number of observations(n) should be as large as five times of the number of de-

Fig. 3. Minimum fitness value vs. number of generation in GAs and themodified PSO algorithm by MLR modeling.

Table 4Results of variable selection by the modified PSO usingC(p) and PLSmodeling

ka Variables C(p) RSSa Ra Rpreda

3 S-aasC, S-dssC, S-ssC, area,S-aasN, Dipole-mag,Dipole-Z, Foct, S-aaN, MR,A logP, Rotbonds, PMI,MW, Vm, Sr,shadow-Xlength, shasow-XZ,

−16.89 2.2789 0.9533 0.9369

3 S-aasC, S-dssC, S-aasN,Vm,Apol, Dipole-mag, Dipole-Z,Shasow-XZ, MR,A logP,Fh2o, logP, MW, HOMO

−16.83 2.2963 0.9529 0.9181

k: the number of latent variables.a RSS: residual sum of the square;R andRpred: correlation coefficient

for training and predicting set.

scriptors (p). The number of PLS components is too largecompared to the sample size. This may result in over-fittingand underestimation of model error. So the modified PSOalgorithm is employed to select the structural parametersstrongly contributing to antagonism for PLS modeling.

The results are summarized inTable 4. The best modelsobtained haveC(p)-value significantly lower (approximately−17) than those for MLR (approximately−13). The modi-fied PSO search reached a best model with 18 selected vari-ables and a next-best model with 14 selected variables. Thesubsequent PLS modeling all provides three-latent variablemodels. The best PLS model gave the correlation coefficients0.9533 and 0.9369 for training and validation sets, respec-tively. In comparing with PLS analysis by all variables, itshows that better results are obtained from regression analy-sis including only selected variables than from PLS analysisincluding all variables. The predictive ability of PLS modelwas much improved inRp by the modified PSO analysis from0.8046 to 0.9369. The correlation between the calculated andexperimental values of log 1/IC50 is shown inFig. 4. The

Fig. 4. Calculated vs. observed log 1/IC50 of three-latent variable modelusing PLS modeling.

Page 7: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152 151

Fig. 5. Minimum fitness value vs. number of generation in GAs and themodified PSO algorithm by PLS modeling.

PLS models gave substantially improved RSS andR valuesas compared to those of MLR modeling. One notices thatthe use of the modified PSO search helps us to select 18 or14 variables from 37 descriptors. The subsequent PLS mod-eling eliminates the correlation among the selected variablesand identified three latent variables as a linear combinationof the preliminarily selected 18 or 14 variables. These se-lected variables carry more or less information related to an-tagonism. In this way, PLS maximally extracts informationfrom the original data set for QSAR model building.

Variable selection by GAs based on PLS was also per-formed to compare with the modified PSO. The results areshown inFig. 5. The minimum Cp can be obtained in about20 cycles during the modified PSO algorithm, but it needabout 40 cycles during GAs algorithm. As in MLR model-ing, fitness value drops more quickly in the modified PSOalgorithm by PLS modeling.

When the modified PSO search terminates, one may countthe number of times a particular molecular descriptor ap-pears in 100 individual combinations. When one lists thedescriptors by order of decreasing numbers of times of ap-pearance, the top descriptors or the most frequently appearedfeathers are obtained. Thermodynamic descriptors occupyan important position and one third of the preferred de-scriptors is of thermodynamic type. It shows that thermody-namic effect plays an essential role in inhibitory activity of4H-1,2,4-triazoles analogs. A log P and log P related to thehydrophobic character of the molecule are factors usuallymuch considered in the development of QSAR in biochem-istry. They are shown be important in QSAR of angiotensinII antagonists (Kurup et al., 2001; Maria et al., 2003). Themolecular refractivity index (MR) is a combined measure ofits size and polarizability which seems essential with respectto inhibitory activity. Desolvation property Foct and Fh2oalso seem important to inhibitory activity. Another kind ofpreferred descriptors are some E-State indexes. Descriptor

S-ssO, S-aasC, S-sCH3, S-ssS and S-aasN are effectivesvariables in the sense of angiotensin II antagonists. Besidesthese variables, descriptors Shadow-XZ, Vm, Apol, Dipole,and Rotbonds also have advantage for antagonism. Descrip-tor Rotbonds is interrelated with the conformational flexi-bility of molecular possessing. The importance of descriptor“Rotbonds” indicate that appropriate steric configuration isin favor of binding with the receptor. The interaction of an-giotensin II antagonists is a complex one, which involvesthermodynamic, electronic, and structural effects.

4. Conclusion

In the present study, the particle swarm optimization al-gorithm is modified to be used in discrete combinatorialoptimization problem. The modified PSO is used in vari-able selection in MLR and PLS modeling for predicting an-tagonism of angiotensis II antagonists. The modifiedC(p)is employed as fitness function. Comparing with GAs, themodified PSO show satisfactory performance and being ableto select preferred variables with satisfactory convergencerates.

Acknowledgements

The work was financially supported by the National nat-ural Science Foundation of China (Grant No. 29735150,20075006, 20105007).

References

Ashton, W.T., Cantone, C.L., Chang, L.L., 1993. Nonpeptide angiotensin IIantagonists derived from 4H-1,2,4-triazoles and 3H-tmidazo triazoles.J. Med. Chem. 36, 591–609.

Brandstatter, B., Baumgartner, U., 2002. Particle swarm optimization-mass-spring system analogon. IEEE Trans. Magn. 38 (2), 997–1000.

Carlisle A., Dozier G., 2000. Adapting particle swarm optimizationto dynamic environments. In: Proceedings of International Confer-ence on Artificial Intelligence, Las Vegas, Nevada, USA, pp. 429–434.

Cerius2 QSAR+, 1997. Molecular Simulations Inc., San Diego.Clerc, M., Kennedy, J., 2002. The particle swarm-explosion, stability and

convergence in a multidimensional complex space. IEEE Transact.Evol. Comput. 6, 58–64.

Ciuprina, G., Loan, D., Munteanu, I., 2002. Use of intelligent-particleswarm optimization in electromagentics. IEEE Trans. Magn. 38 (2),1037–1040.

Hall, L.H., Kier, L.B., 1991. The electrotopological state: structure in-formation at the atomic level for molecular graphs. J. Chem. Inf.Comput. Sci. 31, 76–78.

Hall, L.H., Kier, L.B., 1995. Electrotopological state indices for atomtypes: a novel combination of electronic, topological, and valencestate information. J. Chem. Inf. Comput. Sci. 35, 1039–1045.

Hemmateenejad, B., Miri, R., Akhond, M., Shamsipur, M., 2002. QSARstudy of the calcium channel antagonist activity of some recently

Page 8: Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists

152 Q. Shen et al. / European Journal of Pharmaceutical Sciences 22 (2004) 145–152

synthesized dihydropyridine derivatives. An applicationof genetic al-gorithm for variable selection in MLR and PLS methods. Chemom.Intell. Lab. Syst. 64, 91–99.

Kennedy J., Eberhart R., 1995. Particle swarm optimization In: Proceed-ings of IEEE International Conference On Neural Networks, Perth,Australia, pp. 1942–1948.

Kennedy J., Eberhart R.A., 1997. Discrete binary version of the particleswarm algorithm. In: Proceedings of IEEE International ConferenceOn Computational Cybernetics and Simulation, pp. 4104–4108.

Kubinyi, H., 1996. Evolutionary variable selection in regression and PLSanalyses. J. Chemom. 10, 119–133.

Kurup A, Garg R, Carini DJ, Hansch C, 2001. Comparative QSAR:angiotensin II antagonists, Chem. Rev. 101 (9), 2727–2750.

Luke, B.T., 1994. Evolutionary programming applied to the develop-ment of quantitative structure-activity relationships and quantitativestructure-property relationships. J. Chem. Inf. Comput. Sci. 34, 1279–1287.

Maria, M.Y., Justo, P., Julio, G., Manuel, A., Francisco, M., Javier, V.,2003. Production of ace inhibitory peptides by digestion of chickpealegumin with alcalase. Food Chem. 91, 363–369.

Rohrbaugh, R.H., Jurs, P.C., 1987. Descriptions of molecular shape appliedin studies of structure/activity and structure/property relationships.Anal. Chim. Acta 199, 99–109.

Rogers, D., Hopfinger, A.J., 1994. Application of genetic function ap-proximation to quantitative structure-activity relationships and quanti-tative structure-property relationships. J. Chem. Inf. Comput. Sci. 34,854–866.

Shen, Q., Lü, Q.-Z., Jing, J.H., Yu, R.Q., 2003a. Quantitative structure-activity relationships (QSAR) studies of inhibitors of tyrosine kinase.Eur. J. Pharm. Sci. 20, 63–71.

Shen, Q., Jing, J.H., Yu, R.Q., 2003b. Variable selection by an evolutionalgorithm using modified Cp based on MLR and PLS modeling:QSAR studies of carcinogenicity of aromatic amines. Anal. Bioanal.Chem. 375, 248–254.

Shi Y., Eberhart R., 1998. A modified particle swarm optimizer. In:Proceedings of IEEE World Congress on Computational Intelligence,pp. 69–73.

Stanton, D.T., Jurs, P.C., 1990. Development and use of charged partialsurface area structural descriptors in computer-assisted quantitativestructure-property relationship studies. Anal. Chem. 62, 2323–2329.

Sutter, J.M., Dixon, S.l., Jurs, P.C., 1995. Automated descriptor selec-tion for quantitative structure-activity relationships using generalizedsimulated annealing. J. Chem. Inf. Comput. Sci. 35, 77–84.

Tang, K., Li, T., 2002. Combining PLS with GA–GP for QSAR Chemom.Intell. Lab. Syst. 64, 55–64.

Viswanadhan, V.N., Ghose, A.K., Revankar, G.R., Robins, R.K., 1989.Atomic physicochemical parameters for three dimensional structuredirected quantitative structure-activity relationships. 4′. Additional pa-rameters for hydrophobic and dispersive interactions and their appli-cation for an automated superposition of certain naturally occurringnucleoside antibiotics. J. Chem. Inf. Comput. Sci. 29, 163–172.

Yasri, A., Hartsough, D., 2001. Toward an optimal procedure for variableselection and QSAR model building. J. Chem. Inf. Comput. Sci. 41,1218–1227.