variable selection by an evolution algorithm using modified cp based on mlr and pls modeling: qsar...

7
Abstract The variable selection in QSAR studies by MLR and PLS modeling has been performed using the evolu- tion algorithm (EA). The Cp statistic has been modified and used as the objective function in the EA search for different combinations of molecular descriptors. For MLR modeling a few information-rich descriptors are selected for model formulation. In PLS modeling, the proposed procedure selects a relatively large number of informa- tion-containing descriptors, and a PLS model is formu- lated based on a few latent variables, which are linear combinations of the selected descriptors. The proposed procedures were used for the prediction of carcinogenicity of aromatic amines. Keywords Modified Cp statistic · Variable selection · Evolution algorithm · QSAR · MLR · PLS Introduction Quantitative structure–activity relationships (QSARs) re- lating chemical structure to biological activity are an im- portant part of chem-bioinformatics. Many applications in chem-bioinformatics depend on the representation of mol- ecules by descriptors and selection of the descriptors that are most related to biological activity. Chemical structure is represented by a variety of descriptors including spatial, electronic, topological, information-content, thermody- namic, conformational, quantum mechanical, and shape descriptors. Several hundreds even thousands of descrip- tors could be generated in QSAR studies. To avoid the danger of over-fitting and to make a stable model, only a subset of available independent variables should be se- lected in QSAR analysis. The benefit gained from vari- able selection in QSAR is not only the stability of the model, but also the interpretability of the relationship be- tween the descriptors and the biological activity. Many QSAR studies and methods are aimed at variable selec- tion, for example, stepwise regression, simulated anneal- ing[1], genetic algorithms [2, 3], and evolution algorithms (EAs) [4, 5], to name just a few. Methods of variable selection are mostly based on multiple linear regression (MLR) which can perform sat- isfactorily in well-conditioned situations. For large data sets with highly collinear variables, a small number of variables are selected in MLR modeling and much of the information in the whole data set might not be extracted. Partial least-squares (PLS) has been recommended as an alternative approach instead of MLR to enlarge the infor- mation contained in each model and avoid the danger of over-fitting. It comes into existence under the assumption that the higher the number of variables used, the better the prediction ability of PLS. However, there is increasing ev- idence to indicate that variable selection can also refine the performance of PLS analysis [4, 6, 7, 8]. Results of PLS analysis might not be improved by using an excess of variables. This arises from the sensitivity of PLS to irrel- evant variables that do not contribute to fitting and pre- diction. It has been recognized that the aforementioned assumption for efficient PLS performance might be unre- alistic, and the elimination of uninformative variables is still of importance in QSAR studies even in situations when PLS is applied. The methods used for variable selection employ appro- priate objective function to evaluate the effectiveness of a descriptor. The objective function strongly affects the ef- fectiveness and convergence of variable selection. A vari- ety of objective functions are available in the literature. Some objective functions such as LOF[1], FIT[4], or COST [5] are empirically defined and may contain pa- rameters that need to be adjusted by the user. According to our experience, the function FIT could hardly be used when the system comprises many descriptors. Some stud- ies [4] employed the squared predictive correlation coeffi- cient R 2 pred or S PRESS -values determined by the leave- one-out procedure as the objective function. Owing to Qi Shen · Jian-Hui Jiang · Guo-li Shen · Ru-Qin Yu Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines Anal Bioanal Chem (2003) 375 : 248–254 DOI 10.1007/s00216-002-1668-1 Received: 19 July 2002 / Revised: 16 October 2002 / Accepted: 21 October 2002 / Published online: 19 December 2002 ORIGINAL PAPER Q. Shen · J.-H. Jiang · G.-l. Shen · R.-Q. Yu () State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, 410082 Changsha, China e-mail: [email protected] © Springer-Verlag 2002

Upload: qi-shen

Post on 10-Jul-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

Abstract The variable selection in QSAR studies by MLRand PLS modeling has been performed using the evolu-tion algorithm (EA). The Cp statistic has been modifiedand used as the objective function in the EA search fordifferent combinations of molecular descriptors. For MLRmodeling a few information-rich descriptors are selectedfor model formulation. In PLS modeling, the proposedprocedure selects a relatively large number of informa-tion-containing descriptors, and a PLS model is formu-lated based on a few latent variables, which are linearcombinations of the selected descriptors. The proposedprocedures were used for the prediction of carcinogenicityof aromatic amines.

Keywords Modified Cp statistic · Variable selection ·Evolution algorithm · QSAR · MLR · PLS

Introduction

Quantitative structure–activity relationships (QSARs) re-lating chemical structure to biological activity are an im-portant part of chem-bioinformatics. Many applications inchem-bioinformatics depend on the representation of mol-ecules by descriptors and selection of the descriptors thatare most related to biological activity. Chemical structureis represented by a variety of descriptors including spatial,electronic, topological, information-content, thermody-namic, conformational, quantum mechanical, and shapedescriptors. Several hundreds even thousands of descrip-tors could be generated in QSAR studies. To avoid thedanger of over-fitting and to make a stable model, only asubset of available independent variables should be se-lected in QSAR analysis. The benefit gained from vari-able selection in QSAR is not only the stability of the

model, but also the interpretability of the relationship be-tween the descriptors and the biological activity. ManyQSAR studies and methods are aimed at variable selec-tion, for example, stepwise regression, simulated anneal-ing[1], genetic algorithms [2, 3], and evolution algorithms(EAs) [4, 5], to name just a few.

Methods of variable selection are mostly based onmultiple linear regression (MLR) which can perform sat-isfactorily in well-conditioned situations. For large datasets with highly collinear variables, a small number ofvariables are selected in MLR modeling and much of theinformation in the whole data set might not be extracted.Partial least-squares (PLS) has been recommended as analternative approach instead of MLR to enlarge the infor-mation contained in each model and avoid the danger ofover-fitting. It comes into existence under the assumptionthat the higher the number of variables used, the better theprediction ability of PLS. However, there is increasing ev-idence to indicate that variable selection can also refinethe performance of PLS analysis [4, 6, 7, 8]. Results ofPLS analysis might not be improved by using an excess ofvariables. This arises from the sensitivity of PLS to irrel-evant variables that do not contribute to fitting and pre-diction. It has been recognized that the aforementionedassumption for efficient PLS performance might be unre-alistic, and the elimination of uninformative variables isstill of importance in QSAR studies even in situationswhen PLS is applied.

The methods used for variable selection employ appro-priate objective function to evaluate the effectiveness of adescriptor. The objective function strongly affects the ef-fectiveness and convergence of variable selection. A vari-ety of objective functions are available in the literature.Some objective functions such as LOF[1], FIT[4], orCOST [5] are empirically defined and may contain pa-rameters that need to be adjusted by the user. Accordingto our experience, the function FIT could hardly be usedwhen the system comprises many descriptors. Some stud-ies [4] employed the squared predictive correlation coeffi-

cient (

R2pred

)or SPRESS-values determined by the leave-

one-out procedure as the objective function. Owing to

Qi Shen · Jian-Hui Jiang · Guo-li Shen · Ru-Qin Yu

Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

Anal Bioanal Chem (2003) 375 : 248–254DOI 10.1007/s00216-002-1668-1

Received: 19 July 2002 / Revised: 16 October 2002 / Accepted: 21 October 2002 / Published online: 19 December 2002

ORIGINAL PAPER

Q. Shen · J.-H. Jiang · G.-l. Shen · R.-Q. Yu (✉)State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, 410082 Changsha, Chinae-mail: [email protected]

© Springer-Verlag 2002

Verwendete Mac Distiller 5.0.x Joboptions
Dieser Report wurde automatisch mit Hilfe der Adobe Acrobat Distiller Erweiterung "Distiller Secrets v1.0.5" der IMPRESSED GmbH erstellt. Sie koennen diese Startup-Datei für die Distiller Versionen 4.0.5 und 5.0.x kostenlos unter http://www.impressed.de herunterladen. ALLGEMEIN ---------------------------------------- Dateioptionen: Kompatibilität: PDF 1.2 Für schnelle Web-Anzeige optimieren: Ja Piktogramme einbetten: Ja Seiten automatisch drehen: Nein Seiten von: 1 Seiten bis: Alle Seiten Bund: Links Auflösung: [ 2400 2400 ] dpi Papierformat: [ 595 785 ] Punkt KOMPRIMIERUNG ---------------------------------------- Farbbilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 150 dpi Downsampling für Bilder über: 225 dpi Komprimieren: Ja Automatische Bestimmung der Komprimierungsart: Ja JPEG-Qualität: Mittel Bitanzahl pro Pixel: Wie Original Bit Graustufenbilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 150 dpi Downsampling für Bilder über: 225 dpi Komprimieren: Ja Automatische Bestimmung der Komprimierungsart: Ja JPEG-Qualität: Mittel Bitanzahl pro Pixel: Wie Original Bit Schwarzweiß-Bilder: Downsampling: Ja Berechnungsmethode: Bikubische Neuberechnung Downsample-Auflösung: 600 dpi Downsampling für Bilder über: 900 dpi Komprimieren: Ja Komprimierungsart: CCITT CCITT-Gruppe: 4 Graustufen glätten: Nein Text und Vektorgrafiken komprimieren: Ja SCHRIFTEN ---------------------------------------- Alle Schriften einbetten: Ja Untergruppen aller eingebetteten Schriften: Nein Wenn Einbetten fehlschlägt: Warnen und weiter Einbetten: Immer einbetten: [ ] Nie einbetten: [ ] FARBE(N) ---------------------------------------- Farbmanagement: Farbumrechnungsmethode: Alle Farben zu sRGB konvertieren Methode: Standard Arbeitsbereiche: Graustufen ICC-Profil: RGB ICC-Profil: sRGB IEC61966-2.1 CMYK ICC-Profil: U.S. Web Coated (SWOP) v2 Geräteabhängige Daten: Einstellungen für Überdrucken beibehalten: Ja Unterfarbreduktion und Schwarzaufbau beibehalten: Ja Transferfunktionen: Anwenden Rastereinstellungen beibehalten: Ja ERWEITERT ---------------------------------------- Optionen: Prolog/Epilog verwenden: Nein PostScript-Datei darf Einstellungen überschreiben: Ja Level 2 copypage-Semantik beibehalten: Ja Portable Job Ticket in PDF-Datei speichern: Nein Illustrator-Überdruckmodus: Ja Farbverläufe zu weichen Nuancen konvertieren: Nein ASCII-Format: Nein Document Structuring Conventions (DSC): DSC-Kommentare verarbeiten: Nein ANDERE ---------------------------------------- Distiller-Kern Version: 5000 ZIP-Komprimierung verwenden: Ja Optimierungen deaktivieren: Nein Bildspeicher: 524288 Byte Farbbilder glätten: Nein Graustufenbilder glätten: Nein Bilder (< 257 Farben) in indizierten Farbraum konvertieren: Ja sRGB ICC-Profil: sRGB IEC61966-2.1 ENDE DES REPORTS ---------------------------------------- IMPRESSED GmbH Bahrenfelder Chaussee 49 22761 Hamburg, Germany Tel. +49 40 897189-0 Fax +49 40 897189-71 Email: [email protected] Web: www.impressed.de
Adobe Acrobat Distiller 5.0.x Joboption Datei
<< /ColorSettingsFile () /LockDistillerParams false /DetectBlends false /DoThumbnails true /AntiAliasMonoImages false /MonoImageDownsampleType /Bicubic /GrayImageDownsampleType /Bicubic /MaxSubsetPct 100 /MonoImageFilter /CCITTFaxEncode /ColorImageDownsampleThreshold 1.5 /GrayImageFilter /DCTEncode /ColorConversionStrategy /sRGB /CalGrayProfile () /ColorImageResolution 150 /UsePrologue false /MonoImageResolution 600 /ColorImageDepth -1 /sRGBProfile (sRGB IEC61966-2.1) /PreserveOverprintSettings true /CompatibilityLevel 1.2 /UCRandBGInfo /Preserve /EmitDSCWarnings false /CreateJobTicket false /DownsampleMonoImages true /DownsampleColorImages true /MonoImageDict << /K -1 >> /ColorImageDownsampleType /Bicubic /GrayImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /CalCMYKProfile (U.S. Web Coated (SWOP) v2) /ParseDSCComments false /PreserveEPSInfo false /MonoImageDepth -1 /AutoFilterGrayImages true /SubsetFonts false /GrayACSImageDict << /VSamples [ 2 1 1 2 ] /HSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.76 /ColorTransform 1 >> /ColorImageFilter /DCTEncode /AutoRotatePages /None /PreserveCopyPage true /EncodeMonoImages true /ASCII85EncodePages false /PreserveOPIComments false /NeverEmbed [ ] /ColorImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /AntiAliasGrayImages false /GrayImageDepth -1 /CannotEmbedFontPolicy /Warning /EndPage -1 /TransferFunctionInfo /Apply /CalRGBProfile (sRGB IEC61966-2.1) /EncodeColorImages true /EncodeGrayImages true /ColorACSImageDict << /VSamples [ 2 1 1 2 ] /HSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.76 /ColorTransform 1 >> /Optimize true /ParseDSCCommentsForDocInfo false /GrayImageDownsampleThreshold 1.5 /MonoImageDownsampleThreshold 1.5 /AutoPositionEPSFiles false /GrayImageResolution 150 /AutoFilterColorImages true /AlwaysEmbed [ ] /ImageMemory 524288 /OPM 1 /DefaultRenderingIntent /Default /EmbedAllFonts true /StartPage 1 /DownsampleGrayImages true /AntiAliasColorImages false /ConvertImagesToIndexed true /PreserveHalftoneInfo true /CompressPages true /Binding /Left >> setdistillerparams << /PageSize [ 595.276 841.890 ] /HWResolution [ 2400 2400 ] >> setpagedevice
Page 2: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

computational burden, these objective criterions are notgenerally recommended. There is an expanding demandfor an effective and objective selection criterion to evalu-ate the significance of the descriptors.

The purpose of the present study is to seek an objectionfunction not involving user-adjustable parameters andwhich can be successfully used to select variables both inMLR and PLS modeling. We employ the modified Cp sta-tistic in MLR and PLS modeling in conjunction with theevolution algorithm and tried to predict carcinogenic po-tency of aromatic amines.

Aromatic amines are widely used in papermaking, leath-er, textile, and other industries. Most kinds of aromaticamines are mutagenic and carcinogenic. Experimental as-sessment of mutagenic and carcinogenic potency can beexpensive, time consuming, and hazardous. QSAR can beused to predict carcinogenic potency of aromatic amineswhen experimental data are not available.

Modified Cp as applied to variable selection in QSAR studies

The Cp statistic [9] is modified for both MLR and PLSmodeling. As a rather large number of molecular descrip-tors are involved in QSAR studies, the evolution algo-rithm is used as an optimization procedure to efficientlysearch the huge variable space for the best descriptorcombination using MLR or PLS modeling.

Modified Cp for variable selection in MLR

Cp statistic as the information criterion [9] is applied asthe objective function for variables selection and predic-tion in MLR. Cp is expressed as:

Cp (p) =RSSp/σ̂ 2 − (n − 2p) (1)

where n is the number of dependent variables, and p is thenumber of independent variables. RSSp is the residualsum of the squares of p-variable model, σ̂ 2 is the estima-tion of RSS in the model involving all variables. Cp sta-tistic can perform well in situations where the sample sizeis larger than the number of variables. However, the num-ber of variables generated in QSAR could be far largerthan the number of samples in most cases. When Cp isused as the objective function in conjunction with evolu-tion algorithm (EA, vide infra), the number of selected in-dependent variables in each model used to be larger thanthe sample size. As a rule of thumb, the number of obser-vations (n) should be as large as 5 times of the number ofdescriptors (p). When Cp is used as the objective function,even when an apparently optimum Cp is reached, thenumber of descriptors might still be too large compared ton, which deteriorates the performance of QSAR model-ing.

Cp can perform satisfactorily in well-conditioned situa-tions where n is larger than p and the collinearity amongvariables is negligible. In Eq. 1, usually, σ̂ 2 is derived from

a traditional well-conditioned model. However, the multi-collinearity among variables is a common case rather thanan exception in QSAR studies under normal conditions,and the number of variables representing the objects islarge compared to the sample size. The use of σ̂ 2 in suchill-conditioned systems would result in over-fitting andunderestimation of model error. Actually, σ̂ 2 represents theweight of the model precision with regard to the modeldimensionality. In other words, the increase in model pre-cision is associated with an increased number of selectedvariables, and σ̂ 2 is sensitive to p. It is generally recog-nized that PLS has the capacity to deal with the multi-collinearity problem and can provide a correct estimationof model error. So Cp is modified in such a way that inEq. 1 σ̂ 2 is replaced by σ̂ 2

P L S . When the original data set ofinterest is subjected to conventional PLS analysis, σ̂ 2

P L S isdefined as the value of RSS corresponding to the mini-mum number of principal components when a further in-crease of the number of principal components does notcause a significant reduction in RSS. The modified Cp inMLR is expressed as follows:

Cp1 (p) =RSSp/σ̂ 2

P L S − (n − 2p) (2)

Here p is still the number of independent variables whenMLR is used as the statistic model. RSSp is still the resid-ual sum of the squares of the p-variable model. The ex-perimental results as described in the following sectionsshow that the penalty to the number of independent vari-ables in modified Cp is moderate.

Modified Cp for variable selection in PLS

If the system is highly complicated and the activity of in-terest depends on a large number of correlated descriptors,MLR would not be able to obtain an appropriate modelusing few descriptors. Usually, information-rich variablesare extracted by MLR modeling, while variables contain-ing less information are neglected. The PLS method can,on the other hand, extract as much information as possibleand deals quite efficiently with the multicollinearity prob-lem. However, PLS is sensitive to noisy variables, and theaddition of irrelevant variables may impair the predictionperformance of PLS. Variable selection is also essential inPLS analysis. So the Cp criterion is further modified toselect variables in PLS analysis. The modified Cp used inPLS is expressed as:

Cp2 (p) =RSSk/σ̂ 2

P L S − (n − 2k) (3)

Here k is the number of latent variables, RSSk is the resid-ual sum of the squares of the k-latent-variable model.When Cp is employed for variable selection in PLS, thealgorithm automatically determines the number of latentvariables that minimize Cp. Actually, the algorithm opti-mizes the variable subset and the number of latent vari-ables simultaneously, and no parameter in Cp needs ad-justment by the user. By using Cp in PLS one can allevi-ate the instability due to multicollinearity without signifi-cant loss of model precision.

249

Page 3: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

Evolution algorithm for variable selection using modified Cp by MLR or PLS modeling

In this study, we employ the EA as an efficient searchingprocedure in variable selection. EA derived by analogywith the evolution process in nature, searches the globaloptima stochastically and coordinately in problem spaceof high dimensionality. EA is mainly based on mutationsthat help the breaking out of local optima.

At first, a population of N models are collected, for ex-ample 100, which have a randomly chosen subset of inde-pendent variables. Then, the modified Cp values for eachmodel are calculated as the fitness measure. The evolvingprocess includes mutation and selection operations. Eachmodel is allowed to create a new model through mutationoperation, then the Cp of the new model is calculated, andthe new models are added to the population. According tothe Cp values, 2 N models are ordered from best to worst,and the N models with largest Cp are removed from thepopulation. Mutation and selection operations are contin-uously performed in this way. If the best model or themodel with the lowest Cp remains stable for 50 genera-tions, or the number of cycles reaches a user-defined limit,the algorithm is terminated.

Aromatic amine data

A set of 41 aromatic amines, whose carcinogenic poten-cies are collected together in a comprehensive review byBenigni et al. [10], was used to test the performance of themodified Cp in QSAR studies. The carcinogenic potencyis expressed as BRR = log(MW/TD50)rat, where MW is themolecular weight, and TD50 is the daily dose rate neces-sary to halve the probability of an experimental rat re-maining tumorless to the end of its standard life span. Ascandidate variables for selection, a series of molecular de-scriptors were calculated for the amines. These includetopological thermodynamic, electronic, spatial, and infor-mation-content descriptors. The topological descriptorscalculated include electrotopological-state indices (E-Stateindices)[11, 12, 13, 14, 15] representing the electron ac-cessibility such as N-aasC, S-aaCH [15], etc. The E-Stateindex for an atom represents the electron accessibility as-sociated with each atom type. It is an indication of thepresence/absence of a given atom type and the count ofthe number of atoms of a particular element type. For ex-ample, in the symbol N-aasC, ‘N’ represents the numberof carbon atom in a phenyl group connected to otheratoms besides hydrogen, ‘a’ stands for the bond in an aro-matic ring, ‘s’ stands for the single bonds of that group,and ‘C’ represents the formula of the carbon. The thermo-dynamic descriptors were taken describing the hydropho-bic character (logP, logarithm of the partition coefficientin octanol/water) [16], refractivity (MolRef, molar refrac-tivity) [16] and thermal stability of the molecules (Hf,heat of formation) [17], and the dissolution free energy forwater and octanol (Fh2o, desolvation free energy for H2O;Foct, desolvation free energy for octanol) [18]. The elec-

tronic descriptors taken were concerning surperdelocaliz-ability (Sr), atomic polarizabilities (Apol) [19], the dipolemoment (Dipole) [19], highest occupied molecular orbitalenergy (HOMO) [20], and lowest unoccupied molecularorbital energy (LUMO) [20]. The spatial descriptors usedinvolve radius of gyration (RadOfGyration), Jurs chargedpartial surface area descriptors (Jurs descriptors) [21],density, principal moment of inertia (PMI) [21], and mo-lecular volume. Some so-called information-content de-scriptors [22], such as atomic composition indices and multi-graph information-content indices, were also included inthe candidate list. All these 68 molecular descriptors weregenerated using Cerius2 3.5 software system on a SiliconGraphics R3000 workstation. Besides the aforementionednearly calculated molecular descriptors, 9 variables usedby Benigni et al. were also included in the list of candi-date variables. Table 1 summarizes all molecular descrip-tors used as the candidate variables for selection.

The evolutionary algorithm was programmed in Mat-lab and run on a personal computer (Intel Pentium proces-sor 733MHZ 128 MB RAM).

Results and discussion

Variable selection using modified Cp and MLR modeling for QSAR

MLR was first used with the modified Cp as the statistic.For the convenience of comparison with results reportedin literature [10], all 41 compounds were taken as a wholewithout dividing them into training and validation sets.The best model with minimum Cp value among the final100 combinations contains 5 variables during the EA search.This model and the next best models involving 4 and 6variables are shown in Table 2. The results obtained byMLR after variable selection using modified Cp by EAsearch compared favorably with those reported in the lit-erature [10], which gave an RSS of 0.947 and an S of0.358.

Table 2 also lists the maximum correlation coefficientamong the variables selected, Rmax. Although the Rmaxvalues listed in Table 2 are lower than the maximum cor-relation reported by Benigni et al. [10](0.74), the correla-tion between selected variables is still significant. This isan argument in favor of testing PLS modeling.

Variable selection using the modified Cp and PLS modeling for QSAR

As mentioned in the introduction, results of PLS model-ing might not be improved when the initial data set isoverloaded by irrelevant variables. We applied the vari-able selection procedure for PLS using the modified Cp(Eq. 3) as described above. The results are summarized inTable 3. The best models obtained have Cp-values signif-icantly lower (approximately 26) than those for MLR (ap-proximately 19). The EA search with Cp2(P) as the objec-

250

Page 4: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

tive function reached a best model with 26 selected vari-ables and a next-best model with 32 selected variables.The subsequent PLS modeling provides 4- and 5-latentvariable models, respectively. The PLS models gave sub-stantially improved RSS and R values as compared tothose of MLR modeling. One notices that the use of

Cp2(P) helps us to select from 75 descriptors 26 or 32 vari-ables. These selected variables carry more or less infor-mation related to the carcinogenicity. The subsequent PLSmodeling eliminates the correlation among the selectedvariables and formulates 4–5 latent variables as a linearcombination of the preliminarily selected 26 or 32 vari-

251

Table 1 List of candidate variable molecular descriptors for aromatic amines studied

Variable Description Variable Description

1. S-aaCH E-State index 38. LUMO Lowest unoccupied molecular orbital energy2. S-aasC E-State index 39. RadOfGyration Radius of gyration3. S-sNH2 E-State index 40. Area A2 Molecular surface area4. S-sCH3 E-State index 41. MW (g mol–1) Molecular weight5. S-ssCH2 E-State index 42. Vm (A3) Molecular volume6. S-ssssC E-State index 43. Density (g mL–1) Density7. S-dsN E-State index 44. PMI-mag Principal moment of inertia8. I-sCH3 E-State index 45.PMI-X Principal moment of inertia9. I-sNH2 E-State index 46.PMI-Y Principal moment of inertia

10. I-ssCH2 E-State index 47. PMI-Z Principal moment of inertia11. N-sCH3 E-State index 48.PMI Principal moment of inertia12. N-aaCH E-State index 49. Hbond acceptor Number of hydrogen bond acceptors13. N-aasC E-State index 50. Hbond donor Number of hydrogen bond donors14. N-sNH2 E-State index 51. RotBonds Number of rotatable bonds15. N-dsN E-State index 52. Jurs-WPSA-2 Jurs charged partial surface area descriptor16. N-ssO E-State index 53. Jurs-TPSA Jurs charged partial surface area descriptor17. JX Balaban index 54. Jurs-RPSA Jurs charged partial surface area descriptor18 Kappa-1 Kappal topological indices 55. Jurs-RASA Jurs charged partial surface area descriptor19 Kappa-3 Kappal topological indices 56. Shadow-Zlength Shadow indices20 SC-1 Kier and Hall subgraph count index 57. Shadow-nu Shadow indices21. CHI-1 Molecular connectivity index 58. CIC Multigraph information content indices22. logZ Topological descriptor 59 BIC Multigraph information content indices23. AlogP Total value of logP 60. SIC Multigraph information content indices24. LogP The octanol/water partition coefficient 61. IC Multigraph information content indices25. MRCM**-3 Molar refractivity 62. V-ADJ-mag Information indices26. MolRef Molar refractivity 63. E-ADJ-mag Information indices27. Hf Heat of formation 64. V-DIST-mag Information indices28.Fh2O Desolvation free energy for water 65. E-DIST-mag Information indices29 Foct Desolvation free energy for octanol 66. IAC-Total Information of atomic composition index30. Apol Sum of atomic polarizabilities 67. MR3 Molar refractivity31. Dipol-mag Dipol 68. ΣMR2,6 Molar refractivity32. Dipol-X Dipol 69. Σ�2,6 Resonance-polar electronic substituent constants33. Dipol-Y Dipol 70. Σ�2,6 Inductive electronic substituent constants34. Dipol-Z Dipol 71. Es (R) Steric properties of substituents35. Sr Superdelocalizability 72. I (Bi) Indicator variable36. Energy Energy of conformation 73. I (F) Indicator variable37. HOMO Highest occupied molecular orbital energy 74. I (BiBr) Indicator variable

75. I(RNNO) Indicator variable

Table 2 Results of variable selection using Cp1(p) and MLR modeling

p Descriptors Cp1(p) RSSa Ra Sa Fa Rmaxb

4 1,13,25,75 –18.5388 5.0672 0.9458 0.3752 76.31 0.47715 1,10,13,74,75 –18.9595 4.2190 0.9551 0.3472 72.69 0.54696 1,10,13,23,74,75 –18.7757 3.5826 0.9620 0.3246 78.30 0.6457

aRSS sum of squared residual; R correlation coefficient; S standard deviation; F F-statisticsbRmax the maximum correlation coefficient among the variables

Page 5: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

ables. In this way, PLS maximally extracts informationfrom the original data set for QSAR model building.

When the EA search terminates, one may count thenumber of times a particular molecular descriptor appearsin 100 individual combinations. When one lists the de-scriptors by order of decreasing numbers of times of ap-pearance, we arrive at the top descriptors or the most fre-quently appearing feathers that are the preferred descrip-tors for building QSAR models (Table 4). Molar refractiv-ity (MR3) of the substituent is a combined measure of itssize and polarizability that was used by Benigni et al.[10], and it turned out to be one of the most importantvariables. It is interesting that among the top descriptorsthere are eight electrotopological state descriptors. Theseindices seem to be information-rich in describing molecu-lar structure. In order to exert their carcinogenic potential,aromatic amines have to be metabolized to reactive elec-trophiles [10]. For aromatic amines and amides, this typi-cally involves an initial N-oxidation to N-hydroxyaryl-amines and N-hydroxyarylamides. It is reasonable thatthe E-state indexes N-aasC and S-aasC are effective de-scriptors in the sense of the aforementioned reactions. The

subscript ‘aasC’ refers to the carbon atom in a phenylgroup connected to other atoms besides hydrogen. N-aasCand S-aasC embody the important structural features thatseem to be essential for carcinogenic potency of aromaticamines, for example the nature and the number of amineor amine-generating groups and other ring substituents,which play a crucial role in the reactions of nitrogen oxidation and reduction. The descriptors S-aaCH and N-aaCH represent the number of carbon atoms connectedwith hydrogen in the phenyl ring. Certain aromatic aminesand nitroaromatic hydrocarbons are converted into elec-trophilic derivatives through ring oxidation pathways.The number and nature of aromatic rings are structurallyimportant for carcinogenic potency. The atom group =N–also has a significant influence on carcinogenicity, butthere are few molecules containing these atom groups, sothe descriptor S-dsN related to these atom groups is lessimportant. Superdelocalizability (Sr) is an index of reac-tivity in aromatic hydrocarbons, which may well correlatewith relative reactivity. Indicator variables I(Bi), I(F), andI(BiBr) [10] characterize some special features of thecompounds, which seem to be important for carcinogenic-

252

Table 3 Results of variable selection using Cp2(p) and PLS modeling

k Variables RSSa Ra Cp2(p)

4 5,6,7,8,10,11,12,13,14,15,21,23,27,35,38,43,45,54,55,58,61,67,69,73,74,75, 2.1138 0.9777 –26.95065 1,5,6,7,8,10,13,14,16,17,18,20,21,22,35,37,38,39,43,49,54,55,58,59,60,61,66,67,68,69,73,74, 1.9791 0.9792 –25.3518

aSee footnote a of Table 2

Table 4 Preferred variables for the data set selected by the modified Cp

Method Preferred variables

Cp1(p) MLR N-aasC, S-aaCH, MRCM**3, I-ssCH2, S-dsN, S-ssCH2, S-aasC, N-aaCH, I-sCH3, Sr, I(BiBr), I(RNNO),Cp2(p) PLS MR3,I-ssCH2,S-dsN,I-sCH3,S-aaCH,N-aasC,S-ssCH2, S-aasC,N-aaCH, AlogP, logP, LogZ, Sr, I(BiBr), I(Bi), I(F),

LUMO, Shadow-Zlength, Jurs-TPSA

Table 5 Result of prediction using Cp1(p) and MLR modeling

p Descriptors Cp1(p) RSSa Ra Sa Fa Rpredb

5 1,10,13,15,74 –13.9566 2.6921 0.9633 0.3101 72.1195 0.92466 1,10,13,15,70,74 –13.4112 2.2053 0.9700 0.2858 71.7413 0.92527 10,13,12,15,23,70,74 –12.8608 1.7201 0.9767 0.2572 76.9667 0.9323

aSee footnote a of Table 2bRpredcorrelation coefficient of prediction set

Table 6 Result of prediction using Cp2(p) and PLS modeling

k Variables Cp2(p) RSSa Ra Rpredb

3 1,2,5,7,8,10,11,12,13,15,22,23,24,38,54,55,56,57,67,70,72,73,74, –21.2076 1.6040 0.9783 0.91754 2,5,7,8,10,11,14,18,20,21,22,24,26,34,38,39,43,54,56,61,67,72 –19.3284 1.5636 0.9789 0.94375 1,2,7,8,10,16,18,19,22,23,29,35,40,43,49,50,54,55,57,58,59,60,67,69,69,71,72,73,74,75 –19.4765 1.5140 0.9795 0.9520

aSee footnote a of Table 2bRpredcorrelation coefficient of prediction set

Page 6: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

ity. In a similar way one can find some top descriptors thatappear most frequently during the EA search for MLRmodeling; although these top descriptors may not all bepicked by the final MLR model, they are also listed inTable 4. Most of the top descriptors in MLR modeling arealso selected as important variables in PLS modeling. Onenotices that quite a few descriptors not picked by MLRmodel were selected by PLS modeling. The MLR model-ing used a minimum number of information-rich originaldescriptors and neglected all remaining less informativevariables. For example, hydrophobicity and LUMO aredescriptors selected by PLS modeling. These descriptorsare related to general properties of molecules. In the set ofaromatic amines taken for investigation, there are chemi-cal compounds with very different basic structures (e.g.,anilines, biphenylamines, naphthylamines, and aminofluo-renes). Some molecules possess very little even no car-cinogenicity at all. It is reasonable that the MLR would notselect logP, AlogP, and LUMO as most preferred variables.Some spatial and topological descriptors, for example,logZ, Jurs-TPSA, and shadow-Zlength, are also selectedby PLS modeling. Here one can see the way PLS with theproposed modified Cp works. With the help of modifiedCp, one selects most preferred more or less information-containing variables, and simultaneously the PLS model isformulated based on a few latent variables that are the lin-ear combination of all selected variables. One of the ad-vantages of PLS modeling is that it eliminates the correla-tion between the selected variables. For this reason someaforementioned variables which might be too general to beused in MLR modeling, are useful in PLS modeling.

Prediction of carcinogenicity using QSAR models

The real goal in developing QSAR is to predict the activ-ity (Tables 5 and 6). To check the validity of the proposedmethods, the data set of 41 aromatic amines was stochas-

tically divided into two groups. Thirty-four compounds wereused as the training set for developing regression models,while the remaining 7 compounds were used as the vali-dation set.

The best MLR models selected by the training set con-tain 7 variables. The correlation between the calculatedand experimental values of BRR is shown in Fig. 1. Thecorrelation coefficient for the training set was 0.9767 andRpred for the validation set was 0.9323. PLS model basedon 5 latent variables gave the correlation coefficients0.9750 and 0.9520 for training and validation sets, respec-tively (Fig. 2).

Conclusion

The modified Cp statistic has been shown to be an im-proved objective function in EA search for the optimalcombination of molecular descriptors. The MLR model-ing using EA and modified Cp is a straightforward way todetermine the optimum model with few information-richdescriptors. To eliminate the correlation between molecu-lar descriptors, PLS modeling is more advantageous. EAsearch using modified Cp selects a relatively large num-ber of information-containing descriptors and simultane-ously formulates an optimum PLS model based on a fewlatent variables extracted from the selected descriptors.The QSAR analysis of carcinogenicity of aromatic aminesshows satisfactory prediction performance for the pro-posed methodology.

Acknowledgements The work was financially supported by theNational natural Science Foundation of China (Grant Nos.29735150, 20075006, 29975007 and 20105007).

253

Fig. 1 Calculated versus observed BRR of a seven-descriptormodel using Cp1(p) and MLR modeling

Fig. 2 Calculated versus observed BRR of a five-PC model usingCp2(p) and PLS modeling

Page 7: Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines

254

References

1. Sutter JM, Dixon SL, Jurs PC (1995) J Chem Inf Comput Sci35:77–84

2. Rogers D, Hopfinger AJ (1994) J Chem Inf Comput Sci 34:854–866

3. Yasri A, Hartsough D (2001) J Chem Inf Comput Sci 41:1218–1227

4. Kubinyi H (1996) J Chemometrics 10:119–133 5. Luke BT (1994) J Chem Inf Comput Sci 34:1279–12876. Xie H–P, Jiang J–H, Cui H, Shen G–L, Yu R–Q (2002) Com-

put Chem (in press)7. Lindgren F, Geladi P, Rannar S, Wold S (1994) J Chemomet-

rics 8:349–3638. Jiang J–H, Berry RJ, Siesler HW, Ozaki Y (2002) Anal Chem

(in press)9. Mallows CL (1973) Technometrics 15:661–675

10. Benigni R, Giuliani A, Franke R, Gruska A (2000) Chem Rev100:3697–3714

11. Kier LB, Hall LH (1990) Pharm Res 7:801–80712. Hall LH, Kier LB (1991) J Chem Inf Comput Sci 31:76–7813. Hall LH, Kier LB (1991) Quant Struct-Act Relat 10:43–4814. Kier LB, Hall LH, Frazer JW (1992) J Math Chem 7:229–23715. Hall LH, Kier LB (1995) J Chem Inf Comput Sci 35:1039–

104516. Viswanadhan VN, Ghose AK, Revankar GR, Robins RK

(1989) J Chem Inf Comput Sci 29:163–17217. Dewar MJS, Thiel W (1977) J Am Chem Soc 99:4899–490718. Hopfinger AJ et al (1980) Safe handling of chemical carcino-

gens, mutagens, teratogens and highly toxic substances. AnnArbor Press, Ann Arbor, ML, p 385

19. Gasteiger J, Marsali M (1980) Tetrahedron 36:321920. Fischer H, Kollmar H (1969) Theor Chim Acta 13:21321. Stanton DT, Jurs PC (1990) Anal Chem 62:2323–232922. Boncher, Danail (1983) Information theoretic indices for char-

acterization of chemical structures. Research Studies Press,Chichester, UK, p 249