quantitative structure–activity relationships by evolved neural networks for the inhibition of...

BioSystems 65 (2002) 37–47

Quantitative structure–activity relationships by evolvedneural networks for the inhibition of dihydrofolate reductase

by pyrimidines

Dana G. Landavazo, Gary B. Fogel *, David B. FogelNatural Selection Inc., 3333 N. Torrey Pines Ct., Suite 200, La Jolla, CA 92037, USA

Received 18 October 2001; received in revised form 1 November 2001; accepted 2 November 2001

Abstract

Evolutionary computation provides a useful method for training neural networks in the face of multiple localoptima. This paper begins with a description of methods for quantitative structure activity relationships (QSAR). Anoverview of artificial neural networks for pattern recognition problems such as QSAR is presented and extended withthe description of how evolutionary computation can be used to evolve neural networks. Experiments are conductedto examine QSAR for the inhibition of dihydrofolate reductase by pyrimidines using evolved neural networks. Resultsindicate the utility of evolutionary algorithms and neural networks for the predictive task at hand. Furthermore,results that are comparable or perhaps better than those published previously were obtained using only a smallfraction of the previously required degrees of freedom. © 2002 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Artificial neural networks; Quantitative structure–activity relationships; Dihydrofolate reductase; Evolutionary computa-tion

www.elsevier.com/locate/biosystems

1. Introduction

The effectiveness of drug discovery can be en-hanced through the accurate prediction of biolog-ical activity or chemical properties of a moleculefrom a set of structure-based descriptors. A gen-eral assumption of this approach is that the activ-ity or property of a molecule is implicit in itsphysical structure. Quantitative structure–activityrelationships (QSAR) are used to develop statisti-

cal models relating structure to activity. Follow-ing the identification of a relatively small set ofrelated or unrelated molecules with known activi-ties or properties, many descriptors for thesemolecules are determined. These are grouped gen-erally into three major classes: hydrophobic, elec-tronic, and steric effects, however, severalsubdivisions of each may be used. Once a suitablestatistical model is developed, it can be used topredict the activity of a novel compound and mayalso provide insight toward the mechanism ofaction for a particular drug.

Over the past decade, artificial neural networkshave been applied on QSAR in the analysis of

* Corresponding author. Tel.: +1-858-455-6449; fax: +1-858-455-1560.

E-mail address: [email protected] (G.B. Fogel).

0303-2647/02/$ - see front matter © 2002 Elsevier Science Ireland Ltd. All rights reserved.

PII: S 0303 -2647 (01 )00192 -7

mailto:[email protected]

D.G. Landa�azo et al. / BioSystems 65 (2002) 37–4738

2,4-diamino-5-(substituted benzyl)pyrimidines (Soand Richards, 1992), 2,4-diamino-6-dimethyl-5-phenyldihydrotriazines as dihydrofolate reductase(DHFR) inhibitors (Andrea and Kalayeh, 1991),and neural networks have been recognized as animportant tool in drug discovery (Kovesdi et al.,1999; Schneider, 2000). DHFR catalyzes theNADPH-dependent reduction of 7,8-dihydrofo-late (H2F) to 5,6,7,8-tetrahydrofolate (H4F) andis necessary for maintaining intracellular levels ofH4F, an essential cofactor in the synthetic path-way of purines, thymidylate and several aminoacids. Evolved neural networks for QSAR havealso been proposed (So and Karplus, 1996a,b,1997a,b; So et al., 2000). Here we review theconcepts of evolved artificial neural networks forpattern recognition and indicate the effectivenessof evolved artificial neural networks for QSAR byusing a small subset of available features to pre-dict activity for a set of 74 2,4-diamino-5-(substi-tuted benzyl)pyrimidines. Relative to previousliterature (Hirst et al., 1994), the evolved neuralnetwork was capable of similar predictive accu-racy to the model given in Hirst et al. (1994) usingonly three of the 27 indicated input features.

1.1. Artificial neural networks for patternrecognition

Artificial neural networks (or simply neural net-works) are computer algorithms based loosely onmodeling the neuronal interconnections of naturalorganisms. They are stimulus–response transferfunctions that accept some input and yield someoutput. They are used to typically learn an input–output mapping over a set of examples.

Neural networks are parallel processing struc-tures consisting of nonlinear processing elementsinterconnected by fixed or variable weights. Theyare quite versatile, for they can be constructed togenerate arbitrarily complex decision regions forstimulus–response pairs. That is, in general, ifgiven sufficient complexity, there exists a neuralnetwork that will map every input pattern to itsappropriate output pattern, so long as the input–output mapping is not one-to-many (i.e. the sameinput having varying output). Neural networksare therefore well suited for use as detectors and

classifiers. The classic pattern recognition al-gorithms require assumptions concerning the un-derlying statistics of the environment. Neuralnetworks, in contrast, are nonparametric and caneffectively address a broader class of problems(Lippmann, 1987).

Multilayer perceptrons, also sometimes de-scribed as feed forward networks, are probably themost common architecture used in supervisedlearning applications (where exemplar patternsare available for training). Each computationalnode sums N weighted inputs, where wij is theweight between nodes i and j, and ai is the activa-tion of the ith node, subtracts a threshold value,�j, and passes the result through a logistic (sig-moid) function, generating the output:

f(�j)= (1−e−�j)−1,

where �j= (� aiwij)−�j, i=1, …, N. Single per-ceptrons (i.e. feed forward networks consisting ofa single input node) form decision regions sepa-rated by a hyperplane. If the input from the givendifferent data classes are linearly separable, ahyperplane can be positioned between the classesby adjusting the weights and bias terms. If theinputs are not linearly separable, containing over-lapping distributions, a least mean square (LMS)solution is typically generated to minimize themean squared error between the calculated outputof the network and the actual desired output.While single perceptrons can generate hyperplaneboundaries, perceptrons with a hidden layer ofprocessing nodes have been proven to be capableof approximating any measurable function(Hornik et al., 1989), indicating their broad utilityfor addressing general pattern recognitionproblems.

Another versatile neural network architecture isthe radial basis function network. Rather thanpartitioning the available data using hyperplanes,the radial basis function network clusters avail-able data, often with the use of approximateGaussian density functions. The network com-prises an input layer of nodes corresponding tothe input feature dimension, a single hidden layerof nodes with computational properties describedbelow, and output nodes, which perform linearcombinations on the hidden nodes. Each connec-

D.G. Landa�azo et al. / BioSystems 65 (2002) 37–47 39

tion between an input and hidden node carriestwo variable parameters corresponding to a meanand standard deviation. Poggio and Girosi (1990)proved that linear combinations of these near-Gaussian density functions can be constructed toapproximate any measurable function. Therefore,like the multilayer perceptron, radial basis func-tions are universal function approximators.

Given a network architecture (i.e. type of net-work, the number of nodes in each layer, theconnections between the nodes, and so forth), anda training set of input patterns, the collection ofvariable weights determines the output of thenetwork to each presented pattern. The errorbetween the actual output of the network and thedesired target output defines a response surfaceover an n-dimensional hyperspace, where thereare n parameters (e.g. weights) to be adapted.Multilayer feed forward perceptrons are the mostcommonly selected architecture and training thesenetworks can be accomplished through a backpropagation algorithm that implements a gradientsearch over the error response surface for the setof weights that minimizes the sum of the squarederror between the actual and target values.

Although, the use of back propagation is com-mon in neural network applications, it is quitelimiting. This procedure is mathematicallytractable and provides guaranteed convergence,but only to a locally optimal solution. Even if thenetwork’s topology provides sufficient complexityto solve the given pattern recognition task com-pletely, the back propagation method may beincapable of discovering an appropriate set ofweights to accomplish the task. When this occurs,the operator has several options: (1) accept sub-optimal performance; (2) restart the procedureand try again; (3) use ad hoc tricks, such asadding noise to the exemplars; (4) collect newdata and retrain; or (5) add degrees of freedom tothe network by increasing the number of nodesand/or connections.

Only this last approach, adding more degrees offreedom to the network, is guaranteed to giveadequate performance on the training set, pro-vided that sufficient nodes and layers are avail-able. Yet this also presents problems to thedesigner of the network, for any function can map

any measurable domain to its correspondingrange if given sufficient degrees of freedom. Un-fortunately, such overfit functions generallyprovide very poor performance during validationon data acquired independently. Such anomaliesare encountered commonly in regression analysis,statistical model building, and system identifica-tion. Assessing the proper trade-off between thegoodness-of-fit to the data and the necessary de-grees of freedom requires information criteria(e.g. Akaike’s information criterion, minimum de-scription length principle, predicted squared error,or others) (see Fogel, 1991a for a discussion ofthese criteria). By relying on the back propagationmethod, the designer almost inevitably acceptsthat the resulting network will not satisfy themaxim of parsimony, simply because of the defec-tive nature of the training procedure itself. Theproblems of local convergence with the backpropagation algorithm indicate the desirability oftraining with stochastic optimization methods,such as simulated evolution, which can provideconvergence to globally optimal solutions.

1.2. E�olutionary computation and neuralnetworks

Natural evolution is a population-based opti-mization process. Simulating this process on acomputer results in stochastic optimization al-gorithms that can often outperform classicalmethods of optimization when applied to difficultreal-world problems. Historically, there have beenthree main avenues of research in evolutionarycomputation: evolutionary programming (EP),evolution strategies (ES), and genetic algorithms(GA) (Back et al., 1997). The methods arebroadly similar in that each maintains a popula-tion of trial solutions, imposes random changes tothose solutions and incorporates the use of selec-tion to determine which solutions to maintain intofuture generations and which to remove from thepool of trials. The methods used to differ in thetypes of random changes that were used and themethods for selecting successful trials. Fogel(2000) provided a review of the similarities anddifferences between these procedures, but indi-cated the methods are now for all intents and


purposes identical, and there appears to be noscientific rationale for discriminating between ‘ge-netic’ and ‘evolutionary’ computation.

This is particularly true in light of the ‘no freelunch’ theorem of Wolpert and Macready (1997),which in broad terms states that all algorithmsthat do not resample points in a search spaceperform the same on average when applied acrossall possible functions. Thus, no choice of searchoperator (e.g. crossover or mutation) can be uni-formly superior. The choice must be matched tothe problem at hand. Fogel and Ghozeil (1997)offered a related result indicating that there is noadvantage to any bijective mapping (representa-tion) on any problem: a change in representationcan be offset by a change in variation operator togenerate an equivalent algorithm.

Each of the characteristics that made the previ-ously different approaches to simulating evolutionunique has been shown to be without any basisfor any general mathematical advantage. More-over, much of the early mathematical foundationsof ‘schema processing’ have recently been shownto be in error (Macready and Wolpert, 1998; seealso Fogel and Ghozeil, 1997) or not relevant foroptimization (Altenberg, 1995; Vose and Wright,2001). Rather than seek differences between ‘ge-netic’ and ‘evolutionary’ computation, a moreeffective approach appears to be to tailor therepresentation, variation operators, and selectionprocedures to the optimization task at hand. Evo-lutionary computation offers a broad range ofchoices for each of these considerations.

The procedures generally proceed as follows. Aproblem to be solved is cast in the form of anobjective function that describes the worth ofalternative solutions. Without loss of generality,suppose that the task is to find the solution thatminimizes the objective function. A collection(population) of trial solutions is selected at ran-dom from some feasible range across the availableparameters. Each solution is scored with respectto the objective function. The solutions (parents)are then mutated and/or recombined with othersolutions in order to create new trials (offspring).These offspring are also scored with respect to theobjective function and a subset of the parents andoffspring are selected to become parents of the

next iteration (generation) based on their relativeperformance. Those with superior performanceare given a greater chance of being selected thanare those of inferior quality. Fogel (2000) detailsexamples of evolutionary algorithms applied to awide range of problems, including breast cancerdetection (both from histologic and radiographicdata), pharmaceutical design, and a variety ofindustrial settings.

Evolutionary computation has been demon-strated as a useful means for training neural net-works, particularly as compared to gradient-basedmethods such as back propagation. For example,Porto et al. (1995) compared back propagation,simulated annealing and evolutionary program-ming for training a fixed network topology toclassify active sonar returns. The results indicatedthat stochastic search techniques such as anneal-ing and evolution outperform back propagationconsistently, yet can be executed more rapidly onan appropriately configured parallel processingcomputer. After sufficient computational effort,the most successful network can be put into prac-tice. But the evolutionary process can be contin-ued during application, so as to provide iterativeimprovements on the basis of newly acquiredexemplars. The procedure is efficient because itcan use the entire current population of networksas initial solutions to accommodate each newlyacquired datum. There is no need to restart thesearch procedure in the face of new data, incontrast with many classic search algorithms, suchas dynamic programming.

Designing neural networks through simulatedevolution follows an iterative procedure:1. A specific class of neural networks is selected.

The number of input nodes corresponds to theamount of input data to be analyzed. Thenumber of classes of concern (i.e. the numberof classification types of interest) determinesthe number of output nodes.

2. Exemplar data is selected for training.3. A population of P complete networks is se-

lected at random. A network incorporates thenumber of hidden layers, the number of nodesin each of these layers, the weighted connec-tions between all nodes in a feedforward orother design, and all of the bias terms associ-


ated with each node. Reasonable initialbounds must be selected for the size of thenetworks, based on the available computerarchitecture and memory.

4. Each of these ‘parent’ networks is evaluatedon the exemplar data. A payoff function isused to assess the worth of each network. Atypical objective function is the root mean-squared error (RMSE) between the target andthe actual output summed over all outputnodes; this technique is often chosen because itsimplifies calculations in the back propagationtraining algorithm. As evolutionary computa-tion does not rely on similar calculations, anyarbitrary payoff function can be incorporatedinto the process and can be made to reflect theoperational worth of various correct and in-correct classifications. Furthermore, informa-tion criteria such as Akaike’s informationcriterion (AIC) (Fogel, 1991b) or the minimumdescription length principle (Fogel and Simp-son, 1993) can provide mathematical justifica-tion for assessing the worth of each solution,based on its classification error and the re-quired degrees of freedom.

5. ‘Offspring’ are created from these parent net-works through random mutation. Simulta-neous variation is applied to the number oflayers and nodes, and to the values for theassociated parameters (e.g. weights and biasesof a multilayer perceptron, weights, biases,means and standard deviations of a radialbasis function network). A probability distri-bution function is used to determine the likeli-hood of selecting combinations of thesevariations. The probability distribution can bepreselected a priori by the operator or can bemade to evolve along with the network,providing for nearly completely autonomousevolution (Fogel, 2000).

6. The offspring networks are scored in a similarmanner as their parents.

7. A probabilistic round-robin competition isconducted to determine the relative worth ofeach proposed network. Pairs of networks areselected at random. The network with superiorperformance is assigned a ‘win’. Competitionsare run to a preselected limit. Those networks

with the most wins are selected to becomeparents for the next generation. In this man-ner, solutions that are far superior to theircompetitors have a corresponding high proba-bility of being selected. The converse is alsotrue. This function helps prevent stagnation atlocal optima by providing a parallel biasedrandom walk.

8. The process iterates by returning to step (5).

1.3. Application of e�ol�ed neural networks forquantitati�e structure–acti�ity relationships

Neural networks are useful for QSAR becausethere are many potential parameters for eachmolecule and the contribution of each parameteris not known apriori. The number of inputparameters (and their combination) is sufficientlylarge such that an exhaustive search is not compu-tationally feasible. One method to reduce thisdimensionality is to make assumptions about themost-relevant parameters on the basis of someinformation criterion or other measure. However,this type of reduction may lead to convergence onlocal minima. Evolutionary computation can beused as a tool to rapidly search for the appropri-ate number of input parameters while avoidingpremature convergence due to inappropriateheuristics.

An early attempt for evolved neural networksin QSAR was offered in Tetko et al. (1994), whichused EP for structure–activity relationships(SAR). QSAR methods focus on quantitative val-ues to describe biological activity, whereas SARmethods focus on broader classifications of activi-ties for prediction of overall worth. In their work,EP was used to search for parameter sets with thebest expected error of activity classification. Tetkoet al. (1994) demonstrated that EP could be usedto anticipate the hypoglycemic activity offlavonoids using a set of 52 molecules divided intotwo groups: active and non-active. Ninety-threeparameters were used for each molecule (electro-static, topologic and spatial characteristics).

At nearly the same time, Luke (1994) offered amethod for the use of EP for QSAR. As many as53 parameters were used for 31 analogues ofantifilarial antimycin as reported from a data set


in the literature. A population of N predictors wasgenerated at random. The fitness for each predic-tor (with respect to a payoff function described inthe article) was calculated. Each predictor wasallowed to create a new predictor. The 2N predic-tors were ranked and the n least-fit predictorswere removed from the population, restoring thepopulation size to N. This process was iterated fora set number of generations. In his analysis, Lukenoted that ‘in this study, evolutionary program-ming [was] chosen over a genetic algorithm for thesimple reason that the latter may have problemswhen the number of descriptors increases in aQSAR/QSPR (quantitative structure–property re-lationship) investigation. Though the … data has53 descriptors, it is not hard to imagine a set ofdata containing 1000 descriptors. If a researcherwants to find good predictors that only use threeof the descriptors (nterm=3) the ni vector con-tains 997 zeroes and three nonzero values. If bothparents have nterm=3, it is very likely that ntermwill not be 3 in an offspring … in a standardgenetic algorithm, the offspring … replaces theleast fit predictor in the set … whether or not theoffspring’s fitness is better than the solution it isproducing. Therefore, there is no guarantee thatthe overall fitness of the population will increasefrom generation to generation in the case where alarge number of descriptors are present’ (p. 1281).Four data sets were used in this study, each ofwhich was borrowed from other sources in theliterature.

Following this work, So and Karplus (1996a,b)investigated both EP and GAs for the training ofneural nets for QSAR on the same initial data setas Luke (1994) using 53 parameters for 31 an-timycin candidates. Their results suggested that,although both procedures arrived at the best solu-tion in under 10 generations, the average fitness ofthe population was significantly higher when us-ing an EP approach. So and Karplus used onlythe EP algorithm in the remainder of their analy-sis. This same approach was applied to a homoge-neous set of 57 classical 1,4-benzodiazepin-2-oneswith binding affinities for the BZ/GABAA recep-tor. The data set used for this investigation wasreported elsewhere in the literature and in thisstudy. Once again, the authors used EP approach

to show ‘the general utility of the genetic neuralnetwork (GNN) methodology in dealing withdata sets of high dimensionality’ (p. 5255). So andKarplus extended their work with two additionalpapers in 1997 regarding 3D QSAR with ‘genetic’neural networks (even though the procedure em-ployed was in fact more akin to EP).

2. Methods and results

For the experiments presented here, data on thebiological activities for 55 2,4-diamino-5-(substi-tuted benzyl)pyrimidines were used (Hirst et al.,1994). Biological affinities were measured bythe association constant (log Ki) to DHFR fromEscherichia coli MB1428. These data were used totest the performance of neural networks trainedusing backpropagation versus those trained usingevolutionary computation. Hirst et al. (1994) di-vided these data into five sets of 11 moleculeseach. Using leave-one-out cross-validation (Efron,1982), each of these data sets was used as a testset with the other four sets serving as trainingsets. Therefore, each of the 55 pyrimidines ap-peared only once over all of the test sets.

Each of the 55 molecules in the data set con-tained three positions suitable for replacement(the 3-, 4-, and 5-positions of the phenyl ring) bya different substituent (i.e. �OH, �O(CH2)6CH3,etc.) and the activity for each of these substituentcombinations was measured. The data reported inHirst et al. (1994) suggested that addition ofhydroxyl groups at positions 3 and 5, and ahydrogen at position 4 produced an activity(3.04), which was significantly below the mean ofthe remaining data (�=7.115, �=0.935). Thisdata point was removed from all training andtesting data sets as a significant outlier, despite itbeing used by Hirst et al. (1994).

All substituents were assigned attributes inHirst et al. (1994) with regards to polarity, size,flexibility, number of hydrogen-bond donors andacceptors, presence and strength of �-donors and-acceptors, polarizability of the molecular or-bitals, the �-effect, and branching. Therefore,each substituent position was characterized by avector of 10 numbers containing the attribute


information. It remains unclear from Hirst et al.(1994) whether all 10 of these attributes were usedfor each of the three positions or if the value forthe branching attribute was removed as an inputvector to the neural network. Given this uncer-tainty, we chose to use all 30 attributes describingeach of 55 molecules in the database for develop-ment of an evolved neural network.

A neural network with a fixed architecture con-sisting of 30 inputs, three hidden nodes, and oneoutput (as similar as possible to the descriptionprovided by Hirst et al., 1994) was used forprediction of activity. For all training sets, apopulation of 200 parents and 100 offspring wasused to evolve the appropriate weights for eachconnection in the neural network to increase thepredictive accuracy over time. Every generation,each parent network architecture generated off-spring network architectures by varying all of itsweighted connections simultaneously, following aGaussian distribution with zero mean and anadaptable standard deviation. The update rule forthe standard deviation followed the standardmethod shown in Back et al. (1997), Fogel (2000),where lognormal perturbations are applied to thestandard deviations before generating offspring.Specifically, offspring were created using the fol-lowing equations:

(1) � �i=�i×exp(�N(0, 1)+� �Ni(0, 1)),

i=1, …, n

(2) x �i=xi+� �i N(0, 1), i=1, …, n

where i denotes the ith dimension of the solutionvector x or strategy parameter vector �, N(0, 1) isa standard Gaussian random variable, Ni(0, 1)designates that a standard Gaussian random vari-able is sampled anew for each ith dimension, and� and � � are constants set equal to 1/sqrt(2n) and1/(2sqrt(n)), respectively, where n is the number ofdimensions in x and �. The weights of eachconnection on each member of the initial popula-tion were distributed uniformly at random in therange (−0.5, 0.5) and the initial standard devia-tion in each dimension for each parent was set to0.05.

Selection was based on each neural network’stotal RMSE on the training data, with a stochas-

tic tournament being used to determine whichparent and offspring networks survived into thenext generation. For this tournament, each net-work was paired up against 10 other randomlyselected networks from the population; compari-sons in performance were made in each pairingwith a ‘win’ being assigned to the network havinglower RMSE. After all such pairings were com-pleted, those networks with the most wins wereselected to be parents for the successivegeneration.

For each evolution, the number of generationswas varied from 10 to 200 to determine the mostappropriate point to stop neural network trainingto avoid under- or overfitting of the data. Perfor-mances were measured on data held out fromtraining based on the Spearman rank correlationcoefficient (following Hirst et al., 1994) and theRMSE. Values for these two performance criteriaare shown in Fig. 1. Using the Spearman rankcorrelation coefficient, the variance in the data ismuch larger than expected and cannot be usedeasily for addressing performance. Using theRMSE performance criterion, the measure is min-imized at 50 generations and its variance is muchsmaller over subsequent generations. Given this,

Fig. 1. The performances of evolved neural networks with 30inputs based on Spearman rank correlation coefficient (×)and RMSE (�) given for the total number of generations ineach evolutionary trial for data set 1. Performance was mea-sured on the test set. The number of generations was variedfrom 10 to 200 to determine the most appropriate point tostop neural network training.


Table 1Testing set performance as measured by RMSE and Spearman rank correlation coefficient

Number of Evolved neural networks Hirst et al. (1994)inputs

Spearman rank correlationMean squared error Spearman rank correlationcoefficient coefficient

303 3 30 27

0.075 0.817Set 1 0.7180.075 0.788Set 2 0.081 0.116 0.733 0.327 0.228Set 3 0.111 0.122 0.652 0.335 0.702

0.172 0.7730.097 0.443Set 4 0.8380.101Set 5 0.117 0.647 0.516 0.753

0.120 (0.034) 0.724 (0.074) 0.468 (0.161) 0.6600.093 (0.015)Mean (�)

The best-evolved neural network with three inputs achieved a higher Spearman rank correlation coefficient (0.724) than the networkwith 27 inputs presented in Hirst et al. (1994) (0.660). Furthermore, comparison between the best-evolved neural networkscomprising three or 30 input features in terms of RMSE favors the smaller neural network.

the remaining sets 2–5 were then trained andtested using this same architecture and parametersfor 50 generations of evolution. Performance overall sets is shown in Table 1 relative to the perfor-mance offered in Hirst et al. (1994).

Some attributes carry more correlation to theknown activity than others. A statistical approachwas used to determine those attributes with statis-tically significant correlation with the known ac-tivity. Specifically, a stepwise regression was usedwith the values of F to enter and F to remove of4.000 and 3.996, respectively. This analysis sug-gested that only one attribute for each of the threepositions was well correlated to known activity.Polarizability (the response of the electron distri-bution to changes in the solvent or with otherpolar agents) was selected for positions 1 and 2,while polarity (a measure of the positive or nega-tive charge) was selected for position 3. A newneural network architecture consisting of thesethree inputs, three hidden nodes and one outputwas used in a subsequent round of evolution. Foreach evolution, the number of generations wasvaried from 10 to 200 to determine the mostappropriate point to stop neural network training.As with the previous neural network, perfor-mances were measured based on the Spearmanrank correlation coefficient and the RMSE. Val-ues for these two performance criteria are shown

in Fig. 2. Based on RMSE, the most appropriatenumber of generations for training was againtaken to be 50 for set 1. The remaining sets2–5 were then trained and tested using thissame architecture and the results are presented inTable 1.

Fig. 2. The performances of evolved neural networks withthree inputs based on Spearman rank correlation coefficient(×) and RMSE (�) as a function of the number of genera-tions per evolutionary trial. Performance was measured on thetest set given for the total number of generations in eachevolution for data set 1. The number of generations was variedfrom 10 to 200 to determine the most appropriate point tostop neural network training.


3. Discussion

An assumption of all QSAR approaches is thatthe activity or property of a molecule is implicit inits physical structure. QSAR can then be used todevelop statistical models relating known struc-tures to activities. Given that there are manypotential parameters for each molecule, neuralnetwork approaches have proven successful foractivity prediction. Classical methods for trainingneural networks using backpropagation have beenapplied to QSAR problems with limited success(see above). As an alternative to backpropaga-tion, evolutionary computation can be used as atool to rapidly search for an appropriate set ofweights that connect the nodes. Following evolu-tion, the best-evolved neural network can then beused as a predictor for previously unknown sam-ples in testing data. Here, we applied evolvedneural networks for the inhibition of DHFR bypyrimidines as a test of the validity of this ap-proach against a previous experiment using neuralnetworks trained via backpropagation (Hirst etal., 1994).

To compare the performance of our neuralnetwork, we employed the Spearman rank corre-lation coefficient. This statistic measures the de-gree of linear association between two randomvariables. It is thought to be useful when eitherthe data consist of ranks or when the data set issmall enough to be ranked easily (Milton andTsokos, 1983). The Spearman rank correlationcoefficient was calculated for each independentrun of evolution in training set 1 over a specifiednumber of generations (10, 20, 30, …). For eachof these generations, the data presented in Fig. 1show that the Spearman rank correlation coeffi-cient varies drastically by generation, whereas an-other potential performance metric, the RMSE,does not show a similar spread.

These data cast doubt on the utility of theSpearman rank correlation coefficient as a meansof comparison and as a means of measuring theperformance of any predictive model on thesedata. (These results may suggest a broader inquiryinto the variability of the Spearman rank correla-tion coefficient more generally.) However, com-parison of the evolved neural network to the data

reported in Hirst et al., (1994) requires the use ofthe Spearman coefficient. The data in Table 1suggest that the evolved neural network with 30inputs, three hidden nodes, and one output nodeperformed qualitatively worse than the networkgiven in Hirst et al. (1994). However, the RMSEover all test sets for this same network topologywas quite low (0.120), suggesting that the evolvedneural network performed quite well. The vari-ance associated with the Spearman rank correla-tion coefficient may due to a nonlinear associationbetween the parameters in question.

The stepwise regression suggested that onlythree of the 30 available inputs might be adequatefor the task at hand. Using the Spearman rankcorrelation coefficient as a measure of perfor-mance, the best evolved network relying on thesethree input features performed at a level compara-ble to that offered in Hirst et al. (1994). This newresult should be favored in light of the maxim ofparsimony, using only 10% of the previous de-grees of freedom.

If attention is turned to the RMSE as a mea-sure of performance instead of Spearman rankcorrelation coefficient, the best-evolved neuralnetwork with three inputs versus 30 inputsdemonstrates a lower RMSE (0.093 vs. 0.120,respectively). Further study will be required toassess any statistically significant difference, butagain the maxim of parsimony favors the smallernetwork.

It is important to note that for the researchpresented here, one outlier from the original datawas removed (see above). This outlier was a po-tential source of confusion for any neural networkgiven that it was markedly different from anyother training data and would likely be a sourceof significant error in training and testing. Itremains unclear if this outlier was a typographicalerror in Hirst et al. (1994), but we assume that itwas used in the development of their neural net-work. The many noncorrelated inputs and thisoutlier molecule cast doubt on the validity of theperformances given in Hirst et al. (1994).

It is also important to note that regardless ofthe measure of performance shown here, thestatistical approach (relying on stepwise regres-sion) to the physicochemical data prior to the


choice of input vectors assisted the developmentof a more useful predictor. Given a fixed architec-ture, this approach may help identify inputs thathave little or no correlation with activity. How-ever, using evolutionary computation, the topol-ogy of the neural network could be evolvedsimultaneously with the weights (Angeline et al.,1994). This would allow evolution to discoverincreasingly useful network topologies. Additionalpenalties could be applied for network size, givinga reward for smaller topologies of equal or betterpredictive accuracy.

Acknowledgements

The authors would like to thank Peter J. Ange-line for assistance with software. The authorswould also like to thank the reviewers for theircomments, and Ray Paton for acting as editor-in-chief on this submission.

References

Altenberg, L., 1995. The schema theorem and Price’s theorem.In: Whitley, L.D., Vose, M.D. (Eds.), Foundations ofGenetic Algorithms 3. Morgan Kaufmann, San Mateo,CA, pp. 23–49.

Andrea, T.A., Kalayeh, H., 1991. Applications of neural net-works in quantitative structure–activity relationships ofdihydrofolate reductase inhibitors. J. Med. Chem. 34 (9),2824–2836.

Angeline, P., Saunders, G., Pollack, J., 1994. Complete induc-tion of recurrent neural networks. In: Sebald, A.V., Fogel,L.J. (Eds.), Proceedings Third Annual Conference on Evo-lutionary Programming. World Scientific, Singapore, pp.1–8.

Back, T., Hammel, U., Schwefel, H.P., 1997. Evolutionarycomputation: comments on the history and current state.IEEE Trans. Evol. Comput. 1 (1), 3–17.

Efron, B., 1982. The Jackknife, the Bootstrap, and otherResampling Plans. SIAM, Philadelphia, PA.

Fogel, D.B., 1991a. System Identification through SimulatedEvolution. Ginn Press, Needham, MA.

Fogel, D.B., 1991b. An information criterion for optimalneural network selection. IEEE Trans. Neural Networks 2(5), 490–497.

Fogel, D.B., Simpson, P.K., 1993. Experiments with evolvingfuzzy clusters. In: Fogel, D.B., Atmar, W. (Eds.), Proceed-ings of the Second Annual Conference on EvolutionaryProgramming. Evolutionary Programming Society, LaJolla, CA, pp. 90–97.

Fogel, D.B., Ghozeil, A., 1997. A note on representations andvariation operators. IEEE Trans. Evol. Comput. 1 (2),159–161.

Fogel, D.B., 2000. Evolutionary Computation: Toward a NewPhilosophy of Machine Intelligence, 2nd ed. IEEE Press,Piscataway, NY.

Hirst, J.D., King, R.D., Sternberg, M.J.E., 1994. Quantitativestructure–activity relationships by neural networks andinductive logic programming. I. The inhibition of dihydro-folate reductase by pyrimidines. J. Comput. Aided Mol.Des. 8, 405–420.

Hornik, K., Stinchcombe, M., White, H., 1989. Multilayerfeedforward networks are universal approximators. NeuralNetworks 2, 359–366.

Kovesdi, I., Dominguez-Rodriguez, M.F., Orfi, L., Naray-Sz-abo, G., Varro, A., Papp, J.G., Matyus, P., 1999. Applica-tion of neural networks in structure–activity relationships.Interscience 19 (3), 249–269.

Lippmann, R.P. 1987. An introduction to computing withneural nets. IEEE ASSP Magazine, April, 4–22.

Luke, B.T., 1994. Evolutionary programming applied to thedevelopment of quantitative structure–activity relation-ships and quantitative structure–property relationships. J.Chem. Inf. Comput. Sci. 34, 1279–1287.

Macready, W.G., Wolpert, D.H., 1998. Bandit problems andthe exploration/exploitation tradeoff. IEEE Trans. Evol.Comput. 2 (1), 2–22.

Milton, J.S., Tsokos, J.O., 1983. Statistical Methods in theBiological and Health Sciences. McGraw-Hill, Inc., NewYork.

Poggio, T., Girosi, F., 1990. Networks for approximation andlearning. Proceedings of the IEEE 78 (9), 1481–1497.

Porto, V.W., Fogel, D.B., Fogel, L.J., 1995. Alternative neuralnetwork training methods. IEEE Expert, June, 10(3), 16–22.

Schneider, G., 2000. Neural networks are useful tools for drugdesign. Neural Networks 13 (1), 15–16.

So, S.-S., Helden, S.P., Van Geerstein, V.J., Karplus, M.,2000. Quantitative structure–activity relationship studiesof progesterone receptor binding steroids. J. Chem. Inf.Comput. Sci. 40, 762–772.

So, S.-S., Karplus, M., 1996a. Evolutionary optimization inquantitative structure–activity relationship: an applicationof genetic neural networks. J. Med. Chem. 39, 1521–1530.

So, S.-S., Karplus, M., 1996b. Genetic neural networks forquantitative structure–activity relationships: improvementsand application of benzodiazepine affinity for benzodi-azepine/GABAA receptors. J. Med. Chem. 39, 5246–5256.

So, S.-S., Karplus, M., 1997a. Three-dimensional quantitativestructure–activity relationships from molecular similaritymatrices and genetic neural networks. 1. Method andvalidations. J. Med. Chem. 40, 4347–4359.

So, S.-S., Karplus, M., 1997b. Three-dimensional quantitativestructure–activity relationships from molecular similaritymatrices and genetic neural networks. 2. Applications. J.Med. Chem. 40, 4360–4371.


So, S.-S., Richards, W.G., 1992. Application of neural net-works: quantitative structure–activity relationships of thederivatives of 2,4-diamino-5-(substituted-benzyl)-pyrimidines as DHFR inhibitors. J. Med. Chem. 35, 3201–3207.

Tetko, I.V., Tanchuk, V.Y., Luik, A.I., 1994. Application ofan evolutionary algorithm to the structure–activityrelationship. In: Sebald, A.V., Fogel, L.J. (Eds.), Proceed-

ings of the Third Annual Conference on Evolution Pro-gramming. World Scientific, New Jersey, 1994, pp.109–118.

Vose, M.D., Wright, A.H., 2001. Form invariance and implicitparallelism. Evol. Comput. 9 (3), 355–370.

Neill, S.R.St.J., 1974. Experiments on whether schooling bytheir prey affects the hunting behavior of cephalopods andfish predators. J. Zool. Lond., 172, 549–569.

quantitative structure–activity relationships by evolved neural networks for the inhibition of...

Documents