genetic algorithm optimization in drug design qsar bayesian-regularized genetic neural networks...

8/2/2019 Genetic Algorithm Optimization in Drug Design QSAR Bayesian-Regularized Genetic Neural Networks (BRGNN) and

1/21

Mol Divers (2011) 15:269289

DOI 10.1007/s11030-010-9234-9

COMPREHENSIVE REVIEW

Genetic algorithm optimization in drug design QSAR:Bayesian-regularized genetic neural networks (BRGNN)

and genetic algorithm-optimized support vectors machines(GA-SVM)

Michael Fernandez Julio Caballero

Leyden Fernandez Akinori Sarai

Received: 14 May 2009 / Accepted: 25 January 2010 / Published online: 20 March 2010

Springer Science+Business Media B.V. 2010

Abstract Many articles in in silico drug design imple-

mented genetic algorithm (GA) for feature selection, modeloptimization, conformational search, or docking studies.

Some of these articles described GA applications to quan-

titative structureactivity relationships (QSAR) modeling in

combination withregressionand/orclassificationtechniques.

We reviewedthe implementationofGAindrug designQSAR

andspecifically itsperformance in theoptimizationof robust

mathematical models such as Bayesian-regularized artificial

neural networks (BRANNs) and support vector machines

(SVMs) on different drug design problems. Modeled data

sets encompassed ADMET and solubility properties, cancer

target inhibitors, acetylcholinesterase inhibitors, HIV-1 pro-

tease inhibitors, ion-channel and calcium entry blockers, and

antiprotozoan compounds as well as protein classes, func-

tional, and conformational stability data. The GA-optimized

predictors were often more accurate and robust than previ-

ous published models on the same data sets and explained

more than 65% of data variances in validation experiments.

In addition, feature selection over large pools of molecular

descriptors provided insights into the structural and atomic

properties ruling ligandtarget interactions.

M. Fernandez (B) A. Sarai

Department of Bioscience and Bioinformatics, Kyushu Instituteof Technology (KIT), 680-4 Kawazu, Iizuka 820-8502, Japan

e-mail: [email protected]

J. Caballero

Centro de Bioinformatica y Simulacion Molecular, Universidad de

Talca, 2 Norte 685, Casilla 721, Talca, Chile

e-mail: [email protected]

L. Fernandez

Barcelona Supercomputing CenterCentro Nacional

de Supercomputacin, Nexus II Building c/ Jordi Girona, 29,

08034 Barcelona, Spain

Keywords Drug design Enzyme inhibition Feature

selection In silico modeling QSAR Review SAR Structureactivity relationships

List of abbreviations

ADMET Absorption, distribution, metabolism,

excretion and toxicity

AD Alzheimers disease

log S Aqueous solubility

ANNs Artificial neural networks

BRANNs Bayesian-regularized artificial neural

networks

BRGNNs Bayesian-regularized genetic neural

networks

BBB Bloodbrain barrier

CoMFA Comparative molecular field analysis

CG Conjugated Gradient

GA Genetic algorithm

GA-PLS Genetic algorithm-based partial least

squares

GA-SVM Genetic algorithm-optimized support vector

machines

GNN Genetic neural networks

GSR Genetic stochastic resonance

HIA Human intestinal absorption

PPBR Human plasma protein binding rate

Log P Lipophilicity

LHRH Luteinizing hormone-releasing hormone

MMP Matrix metalloproteinase

MT Mitochondrial toxicity

MLR Multiple linear regression

MT Negative mitochondrial toxicity

NNEs Neural network ensembles

EVA Normal coordinate eigenvalue

BIO Oral bioavailability

123


2/21

270 Mol Divers (2011) 15:269289

PLS Partial least squares

P-gp P-glycoprotein

PCC Physicochemical composition

MT+ Positive mitochondrial toxicity

PC-GA-ANN Principal component-genetic

algorithm-artificial neural network

PCs Principal components

PPR Projection pursuit regressionQSAR Quantitative structureactivity relationship

QSPR Quantitative structureproperty relationship

RBF Radial Basic Function

SOMs Self-organized maps

SR Stochastic resonance

SVMs Support vector machines

Trb1 Thyroid hormone receptor b1

Tdp Torsades de pointes

VKCs Voltage-gated potassium channels

Introduction

One of the main challenges in todays drug design is the

discovery of new biologically active compounds on the basis

of previously synthesized molecules. Quantitativestructure

activity relationship (QSAR) is an indirect ligand-based

approach which models the effect of structural features on

biological activity. This knowledge is then employed to

propose new compounds with enhanced activity and selec-

tivity profile for a specific therapeutic target [1]. QSAR

methods are based entirely on experimental structureactiv-

ity relationships for enzyme inhibitor or receptor ligands. In

comparison to direct receptor-based methods, which include

molecular docking and advanced molecular dynamics simu-

lations, QSAR methods do not strictly require the 3D-struc-

ture of a target enzyme or even a receptoreffector complex.

They are computationally not demanding and allow estab-

lishing an in silico tool from which biological activity of

newly synthesized molecules can be predicted [1].

Three-dimensional-QSAR (3D-QSAR) methods, espe-

cially comparative molecular field analysis (CoMFA) [2]

and Comparative Molecular Similarity Indices Analysis,

(CoMSIA) [3] are nowadays used widely in drug design.

The main advantages of these methods are that they are

applicable to heterogeneous data sets, and they bring a

3D-mappeddescription of favorable andunfavorable interac-

tions, according to physicochemicalproperties. In this sense,

they provide a solid platform for retrospective hypotheses by

means of the interpretation of significant interaction regions.

However, some disadvantages of these methods are related

to the 3D information and alignment of the molecular struc-

tures, since there are uncertainties about different binding

modes of ligands, and uncertainties about the bioactive con-

formations [4].

CoMFA and CoMSIA have emerged as the 3D-QSAR

methods most embraced by the scientific community today;

however, current articles on QSAR encompass the use of

too many forms of the molecular information and statisti-

cal correlation methods. The structures can be described by

physicochemical parameters [5], topological descriptors [6],

quantum chemical descriptors [7], etc. The correlation can

be obtained by linearmethods or nonlinearpredictors such asartificial neural networks (ANNs) [8] and nonlinear support

vector machines (SVMs) [9]. Unlike linear methods (CoM-

FA, CoMSIA, etc), ANNs and SVMs are able to describe

nonlinear relationships, which should bring to a more real-

istic approximation of the structurerelationship paradigm,

since interactionsbetween the ligandand itsbiological target

must be nonlinear.

Two major problems arise when the functional depen-

dence between biological activities and thecomputed molec-

ular descriptor matrix is nonlinear, and when the number of

calculated variable exceeds the number of compounds in the

data set. The nonlinearity problem can be tackled inside anonlinear modeling framework, while the over-dimensional-

ity issue can be handled by implementing a feature selection

routine that determines whichof thedescriptorshavea signif-

icant influence on the activity of a set of compounds. Genetic

algorithm (GA) rather than forward or backward elimination

procedure has been successfully applied for feature selection

in QSAR studies when the dimensionality of the data set is

high and/or the interrelations between variables are convo-

luted [10].

The present review focuses on the application of very

flexible and robust approaches: Bayesian-regularized genetic

neural networks (BRGNNs) and GA-optimized SVM

(GA-SVM) to QSAR modeling in drug design. Biological

activities of low molecular weight compounds and protein

function, class and stability data were modeled to derive reli-

able classifiers with potential use in virtual library screening.

Firstly, we expose a general survey of GA implementation

andapplicationonQSAR drug design. Secondly, wedescribe

the BRGNN and GA-SVM approaches. Finally, we discuss

their applications to model different target-ligand data sets

relevant for drug discovery and also protein function and

stability prediction.

General survey of genetic algorithm implementations

in drug design QSAR

Genetic algorithmsare stochastic optimization methods gov-

erned by biological evolution rules that have been inspired

by evolutionaryprinciples [11]. GA investigates many possi-

ble solutions simultaneously and each one explores different

regions in the parameter space [12]. Firstly, a population of

N individuals is created in which each individual encodes a

123


3/21

Mol Divers (2011) 15:269289 271

randomly chosen subset of the modeling space and the fit-

ness or cost of each individual in the present generation is

determined. Secondly, parents selected on the basis of their

scaled fitness scores yield a fraction of children of the next

generation by crossover (crossover children) and the rest by

mutation (mutation children). In this way, the new offspring

contains characteristics from its parents. Usually, the rou-

tine is run until a satisfactory rather than the global optimumsolution is achieved. Advantages such as quickly scan a vast

solution set, bad proposals do not effect the end solution

and doesnt have to know any rules of the problem, make

GA very attractive for model optimization in drug discovery

in which every problem is highly particular because of the

lack of previous knowledgeof thefunctional relationshipand

generalization is very difficult.

Chromosome representation

Solving theshortcoming of QSAR analysis such as,selection

of optimum feature subsets, optimization of model parame-ters and also data set manipulation has been the main goal of

GA-based QSAR. Optimization space can include variables

and model parameters. However, since variable selection is

themost commontask, populations havebeen mainlyencode

by binary or integer chromosomes. Binary representation is

very popular due to its easy and straightforward implemen-

tation in which the chromosome is a binary vector having the

same length of main data matrix. Numerical values 1 and 0

represent the inclusion or exclusion of feature in the individ-

ual chromosome, respectively. Models with different dimen-

sionalitycan evolve throughout thesearch process at thesame

time. In this case, the algorithm is highly automatic since no

extra parameters must be set, and the optimum solution is

achieved when a predefined stopping condition is reached.

On the other hand, integer representation is encoded by a

string of integers representing the position of the features in

the whole data matrix. Usually, sizes of feature that vector

encodes in thechromosome arecontrolled according to some

criteria derived from previous knowledge on the modeled

problem. Despite this drawback, algorithm gains efficiency

because inefficient large-dimension models are avoided by

controlling the number of variables during search process.

This aspect is specially important when complex predictors,

given their high tendency to overparametrization/overfitting,

and expensive time-computing, are trained [10]. Model size

can be also controlled in binary GA, but this simple routine

is usually implemented in a very unsupervised way.

Inmany ofGAimplementationsin QSAR studies, individ-

uals in the populations are predictors and training, validation

or/andcrossvalidation errors are the individual fitnessor cost

functions. Different functions have been reported to rank the

individuals in a population depending on the mathematical

model implemented inside the GA framework. The authors

had proposed a variety of fitness functions which are propor-

tional to the residual error of the training set [10,1325], the

validation set [26], or crossvalidation [2730], and combi-

nation of them [3133]. Overfitting has been decreased by

complementing the cost function with terms accounting for

trade-off between number of variables and number of train-

ing cases [34] and/or keeping model complexity as simple as

possible in the searching process [10].

Population generation and ranking of individuals

The first step is to create a gene pool (population of models)

ofNindividuals. Chromosomevalues are randomly initiated,

and the fitness of each individual in this generation is deter-

mined by the fitness function of the model and scaled by

the scaling function. Fitness scaling converts the raw fitness

scores that are returned by the fitness function to values in a

range that is suitable for the selection function. The selection

function uses the scaled fitness values to select the parents

of the next generation. A higher probability of selection toindividualswith higherscaled values is assignedby theselec-

tion function. Controlling the range of the scaled values is

very important because it affects the performance of the GA.

Scaled values varying too widely cause individuals with the

highest scaled values reproduce too rapidly. They take over

the population gene pool too quickly, and prevent the GA

from exploring other areas of the solution space. However,

scaled values varying narrowly cause individuals to have too

similar reproduction chance and the optimization will pro-

gress very slowly. One type of the most used fitness scaling

functions is that of rank-based functions. The position of an

individual in the sorted scores list is its rank. In rank-based

functions scale, the raw scores are based on the rank of each

individual instead of its score. This fitness scaling removes

the effect of the spread of the raw scores [11,12].

Evolution and stopping criteria

Duringevolution,a fractionof children of thenextgeneration

is produced by crossover (crossover children) and the rest by

mutation (mutation children) from the parents. Sexual and

asexual reproductions take place so that the new offspring

contains characteristics from both or one of its parents. In

sexual reproduction, a selection function selects probabilis-

tically two individuals on the basis of their ranking to serve

as parents. An individual can be selected more than once as a

parent, in which case it contributes its genes to more than one

child. Stochastic selection functions, lays out a line in which

each parent corresponds to a section of the line of length

proportional to its scaled value [11,12]. Similarly, roulette

selection chooses parents by simulating a roulette wheel, in

which thearea of thesection of thewheel corresponding to an

individualis proportional to the individualsexpectation.The

123


4/21

272 Mol Divers (2011) 15:269289

algorithm uses a random number to select one of the sections

with a probability equal to itsarea [11,12].Onthe other hand,

tournament selection chooses each parent by selecting set of

players (individuals) at random and then choosing the best

individual out of that set to be a parent [32]. Then, crossover

of parents performs a random selection of a fraction of its

descriptor set, and a child is constructed by combining these

fragments of genetic code. Finally, the rest of the individ-uals in the new generation are obtained by asexual repro-

duction when parents selected randomly are subjected to a

random mutation of its genes. Reproduction often includes

elitism which protects the fittest individual in any given gen-

eration from crossover or mutation [27]. Finally, stopping

criteria determine what causes the algorithm to terminate.

Most common parameters used to control algorithm flow are

the maximum number of iterations the GA will perform and

themaximum time thealgorithmruns before stopping.Some

implementationsstopa GAif thebestfitnessscoreis less than

or equal to the value of a threshold value; others evaluate the

performance for a number of previously set generations ortime interval, and the algorithm stops if there is no improve-

ment in the best fitness value.

Some applications

GA has been successfully applied to drug design QSAR

to optimize linear and nonlinear predictors. Cho and

Hermsmeier [13] introduced a simple encoding scheme for

chemical features and allocation of compounds in a data set.

They applied GA to simultaneously optimize descriptors and

composition of training and test sets. The method generates

multiple models on subsets of compounds representing clus-

ters with different chemotypes and a molecular similarity

method determined the best model for a given compound in

the test set. The performance on the Selwood data set [35]

was comparable to other published methods.

Hemmateenejad and co-workers [3133] reported semi-

nal study on GA-based QSAR in drug design. They modeled

the calcium channel antagonist activity of a set of nifedi-

pine analogous by GA-optimized multiple linear regression

(MLR) and partial least squares (PLS) regression [31]. Ade-

quate models with low standard errors and high correlation

coefficients werederivedfromtopology, hydrophobicity, and

surface area but PLS had better prediction ability than MLR.

The authors applied a principal componentgenetic algo-

rithmartificial neural network (PCGAANN) procedure to

model activity of another series of nifedipine analogs [32].

Each molecule was encoded by 10 sets of descriptors and

principal component analysis (PCA) was used to compress

the descriptor groups into principal components (PCs). GA

selected the best set of PC to train feed forward ANN. PC

GAANN routine overperformed ANNs trained with top-

rankedPC (PCANN) by yielding betterpredictionalability.

Hemmateenejad et al. [33] reported the application of PC

regression to model structurecarcinogenic activity relation-

ship of drugs. PC correlation ranking and a GA were com-

pared for selecting the best set of PCs for a large data set

containing 735 carcinogenic activities and 1,355descriptors.

Crossvalidation procedure showed that introduction of PCs

by the conventional eigenvalue ranking was outperformed

by correlation ranking and GA with good similar qualityabout 80% accuracy. Thyroid hormone receptor b1 (Trb1)

antagonists are of special interest because of their potential

role in safe therapies for nonthyroid disorders while avoid-

ing the cardiac side effects. Optimum molecular descriptors

selected by GA served as inputs for a projection pursuit

regression (PPR) study yielding accurate models [36]. GA

was also reported to optimize routines of descriptor genera-

tion.Normal coordinateeigenvalue (EVA) structuraldescrip-

tors, based on calculated fundamental molecular vibrational

frequencies are sensitive to 3D structure and additionally

structural superposition is not required [28]. The original

technique involves a standardization method wherein uni-form Gaussians of fixed standard deviation () are used to

smear out frequencies projected onto a linear scale. GA was

used to search for optimal localized values by optimizing

crossvalidated PLS regression scores. Although GA-based

EVA did not improve performance for a benchmark steroid

data set, crossvalidation statistics were 0.25 unit higher than

thesimple EVA approach in thecaseof a more heterogeneous

data set of five structural classes.

A GA-optimized ANN,namedGNW, that simultaneously

optimizes feature selection and node weights, was reported

by Xue and Bajorath [37] for supervised feature ranking.

Interconnected weights were binary encoded as a 16-bit

string chromosome. A primary feature ranking index, defined

as the sum of self-depleted weights and the corresponding

weight adjustments, computed selected relevant features for

some artificial data sets of known feature rankings tested.

GNW outperformed SVM method on three artificial and

matrix metalloproteinase-1 inhibitor data sets [37].

Two-dimensional (2D) representation was chosen to clas-

sify about 500 molecules in seven biological activity classes

using a method based on principal component analysis com-

bined with GA [38]. Scoring functions, which accounted for

number of compounds in pure classes (i.e., compounds with

the same biological activity), singletons, and mixed classes,

identified effective descriptor sets. The results indicated that

combinations of few critical descriptors related to aromatic

character, hydrogen bond acceptors, estimated polar van der

Waals surface area, anda single structural key were preferred

to classifycompoundsaccordingto theirbiologicalactivities.

Kamphausen et al. [39] reported a simplified GA based

on small training sets that runs a small number of gener-

ations. Folding energies of RNA molecules and spinglass

from a multiletter alphabet biopolymers such as peptides

123


5/21

Mol Divers (2011) 15:269289 273

were optimized. Noteworthy, de novo construction of pep-

tidic thrombin inhibitors, computationally guided by this

approach, resulted in the experimental fitness determination

of only 600 different compounds from a virtual library of

more than 1017 molecules [39].

Caco-2 cell monolayers are widely used systems for

predicting human intestinal absorption and quantitative

structureproperty relationship (QSPR) models of Caco-2 permeability have been widely performed. Yamashita et

al. [34] used a GA-based partial least squares (GA-PLS)

method to predict Caco-2 permeability data using topolog-

ical descriptors. The final PLS model described more than

80% of crossvalidation variance.

In alternative applications, a GA routine based on the the-

ory of stochastic resonance (SR) was reported in which vari-

ables that are related to the bioactivity of a molecule series

were considered as signal and the other non-related features

as noise [40]. The signal was amplified by SR in a nonlinear

system with GA-optimized parameters. The algorithm was

successfully evaluated with the well-known Selwood data set[35]. The relevant variables were enhanced, and their power

spectra were significantly changed and similar to that of the

bioactivity after genetic SR (GSR). The descriptor matrix

continuously became more informative, and the collinear-

ity was suppressed. Then, feature selection was easier and

more efficient and, consequently, QSAR models of the data

set obtained had better performances than previous reported

approaches [40]. Teixido et al. [41] presented another non-

conventional GA to search for peptides that can cross the

bloodbrain barrier (BBB). A genetic meta-algorithm opti-

mized the GA parameters and the approach was validated

by virtual screening of a peptide library of more than 1000

molecules. Chromosomes were populated with chemical

physical properties of peptides instead of aminoacid peptide

sequences and the fitness function was derived from statis-

tical analysis of the experimental data available on peptide-

BBB permeability. The authors stated that GA tuned for a

specific problem cansteer thedesignanddrug discovery pro-

cess and set the stage for evolutionary combinatorial chem-

istry.

Coupling of ANNs and GA in drug QSAR studies was

introduced by So and Karplus [27] by proposing GA-based

ANNs called genetic neural networks (GNNs). After cal-

culating molecular descriptors using different commercially

available software,predictive models weregeneratedby cou-

pling GA feature selection and neural networks function

approximation. The optimum neural networks outperforms

PLS and GA-based MRL models. The authors extended

GNN to 3D-QSAR modeling by exploring similarity matrix

space [42,43]. An early review on this approach [44] reports

its evaluation in several problems such as the Selwood data

set, the benzodiazepine affinity for benzodiazepine/GABAA

receptors, progesterone receptor binding steroids human and

intestinal absorption. Patankar and Jurs also have reported

several QSAR models by hybrid GNN frameworks out-

performing other predictors for the inhibition of acyl-CoA:

cholesterol O-acyltransferase [45], sodium ionproton anti-

porter [46], cyclooxygenase-2 [47], carbonic anhydrase[48],

human type 1 5alpha-reductase [49], and glycine/NMDA

receptor [50]. Another variant of the same hybrid approach

was recently reported by Di Fenza et al. [26] as the firstattempt that combines GA and ANNs for the modeling of

CACO 2 cell apparent permeability. Theoptimum model had

adequate crossvalidation accuracy of 57%, and the selected

descriptors were related to physicochemical characteristics

suchas, hydrophilicity, hydrogen bonding propensity, hydro-

phobicity, and molecular size which are involved in the cel-

lular membrane permeation phenomenon. Ab initio theory

was used to calculate several quantum chemical descriptors

including electrostatic potentials and local charges at each

atom, HOMO and LUMO energies, etc., which were used to

model the solubility of thiazolidine-4-carboxylic acid deriv-

atives by means of theGA-PLS, which yielded relativeerrorsof prediction lower than 4%.

Bayesian-regularized genetic neural networks

In the context of hybrid GA-ANN modeling of biological

interactions, we introduced BRGNNs as a robust nonlinear

modeling techniquethat combines GA andBayesianregular-

ization forneuralnetwork input selectionandsupervised net-

work training, respectively (Fig. 1). This approach attempts

to solve the main weaknesses of neural network modeling:the selection of optimum input variables and the adjustment

of network weights and biases to optimum values for yield-

ing regularized neural network predictors [5052]. By com-

bining the concepts of BRANNs and GAs, BRGNNs were

implemented in such a way that BRANN inputs are selected

insidea GAframework.BRGNN approach is a version of the

So and Karplus article [27] incorporating Bayesian regular-

ization that has been successfully introduced by our group in

drug design QSAR. BRGNN was programmed within Mat-

lab environment [53] using GA [54] and Neural Networks

Toolboxes [55].

Bayesian regularized artificial neural networks

Back-propagation ANNs are data-driven models in the sense

that their adjustable parameters are selected in such a way as

to minimize some network performance function F:

F = MSE =1

N

N

i=1

(yi ti )2 (1)

123


6/21

274 Mol Divers (2011) 15:269289

Ensemble

averaging

(Optional)

molecular

descriptors pool

Models with R >

threshold value

Best model

(best Q2)

Random splits

Averaged

predictions

GA model

optimization

cross-

validation

Assemblingtest sets

Fig. 1 Flowchart of the BRGNN framework in QSAR studies

In the above equation, MSE is the mean of the sum of

squares of thenetwork errors,Nis thenumberof compounds,

yi is the predicted biological activity of the compound i ,

and ti is the experimental biological activity of the

compound i .

Often, predictors can memorize the training examples,

but it has not learned to generalize to new situations. The

Bayesian framework for ANNs is based on a probabilistic

interpretation of network training to improve generalization

capability of the classical networks. In contrast to conven-

tional network training, where an optimal set of weights

is chosen by minimizing an error function, the Bayesian

approach involves a probability distribution of network

weights. In BRANNs, Bayesian approach yields a posterior

distribution of network parameters, conditional on the train-

ing data, and predictions are expressed in terms of expecta-

tions with respect to this posterior distribution [56,57].

Assuming a set of pairs D = {xi , ti }, where i = 1, . . . , N

is a label running over the pairs, the data set can be modeled

as deviating from this mapping under some additive noiseprocess (vi ):

ti = yi + vi (2)

Ifv is modeled as zero-mean Gaussian noise with stan-

dard deviation v, then, the probability of the data given the

parameters w is:

P(D|w,, M) =1

ZD ()exp ( MSE) (3)

where M is the particular neural network model used, =

1/2v , and the normalization constant is given by ZD() =

(/)N/2. P(D|w,, M) is called the likelihood. Themaxi-

mumlikelihoodparameterswML (the w thatminimizesMSE)

depends sensitively on the details of the noise in the data

[56,57].

For completing the interpolation model, it must be defined

a prior probability distribution which embodies our prior

knowledge on the sort of mappings that are reasonable.

Typically, this is quite a broad distribution, reflecting the fact

that weonly havea vague belief in a range of possible param-

eter values.Once,we haveobserved thedata,Bayestheorem

can be used to update our beliefs, and we obtain the posterior

probability density. As a result, the posterior distribution is

concentratedon a smaller rangeof values than thepriordistri-

bution.Sincea neuralnetwork with largeweights will usually

give rise to a mapping with large curvature, we favor small

values for the network weights. At this point, it is defined a

prior that expresses the sort of smoothness it is expected the

interpolant to have. The model has a prior of the form:

P (w|, M) =1

ZW()exp ( MSE) (4)

where representsthe inverse varianceof thedistributionand

the normalization constant is given by ZW() = (/)N/2.

MSW is the mean of the sum of the squares of the network

weights and is commonly referred to as a regularizing func-

tion [56,57].

Considering the first level of inference, if and are

known, then posterior probability of the parameters w is:

P (w|D,,, M) =P (D|w,, M) P (w|, M)

P (D|,, M)(5)

where P(w|D,,, M) is the posterior probability, that is

theplausibilityof a weightdistribution considering the infor-

mation of the data set in the model used, P(w|, M) is the

123


7/21

Mol Divers (2011) 15:269289 275

prior density, which represents our knowledgeof theweights

before any data are collected, P(D|w,, M) is the likeli-

hood function, which is the probability of the data occurring,

given the weights and P(D|,, M) is a normalization fac-

tor, which guarantees that the total probability is 1.

Considering that thenoise in the training setdata is Gauss-

ian and that theprior distribution for the weights is Gaussian,

the posterior probability fulfills the relation:

P (w|D,,, M) =1

ZFexp(F) (6)

where ZF depends of objective function parameters. There-

fore, under this framework, minimization ofFis identical to

find the (locally) most probable parameters.

In short, Bayesian regularization involves modifying the

performance function (F) defined in Eq. 1, which is possi-

bly improving generalization by adding an additional term

that regularizes the weights by penalizing overly large mag-

nitudes.

F = MSE + MSW (7)

The relative size of the objective function parameters

and dictates the emphasis for getting a smoother net-

work response. MacKays Bayesian framework automati-

cally adapts the regularization parameters to maximize the

evidence of the training data [56,57]. BRANNs were first

and broadly applied to model biological activities by Burden

and Winkler [51,52].

Genetic algorithm implementation in BRANN feature

selection

A string of integers means the numbering of the rows in

the all-descriptors matrix that will be tested as BRANN

inputs (Fig. 2). Each individual encodes the same number

of descriptors; the descriptors are randomly chosen from a

common data matrix, and in a way such that (1) no two indi-

viduals can have exactly the same set of descriptors and (2)

all descriptors in a given individual must be different. The

fitness of each individual in this generation is determined by

the training mean square error (MSE) of the model, and a top

scaling function which scaled a top fraction of the individu-

als in a population equally; these individuals have the same

probability to be reproduced while the rest are assigned the

value 0. As it is depicted in Fig. 2, children are sexually cre-

ated by single point crossover from father chromosomes and

asexually by mutating one gene in the chromosome of a sin-

gle father. Similar to So and Karplus [27], we also included

elitism thus the genetic content of the best-fitted individual

simply moves on to the next generation intact. The reproduc-

tive cycle is continued until 90% of the generations showed

the same target fitness score (Fig. 3).

Contrary to other GA-based approach, the objective of

the algorithm is not to obtain a sole optimum model but a

reduced population of well-fitted models, with MSE lower

than a threshold MSE value, which the Bayesian regular-

ization guarantees to posses good generalization capabili-

ties (Fig. 3). This is because we used MSE of data training

fitting instead of crossvalidation, or test-set MSE values as

cost function, and, therefore, the optimum model cannot bedirectly derived from the best-fitted model yielded by the

genetic search. However, from crossvalidation experiments

over thesubpopulation of well-fitted models, it canderive the

bestgeneralizable network withthe highest predictive power.

This process also avoids chance correlations. This approach

has shown to be highly efficient in comparison with cross-

validation-based GA approach, since only optimum models,

according to the Bayesian regularization, are crossvalidated

at the end of the routine, and not all the models generated

throughout the searching process.

Genetic algorithm-optimized support vector machines

(GA-SVM)

Support vector machine (SVM) is a machine learning

method, whichhasbeenusedformanykindsofpattern recog-

nition problems [58]. Contrary to BRANN framework that is

not in so much of widespread use, SVM becomes a very pop-

ular pattern recognition technique. Since there are excellent

introductions to SVMs [58,59], only the main idea of SVMs

applied to pattern classification problems is statedhere. First,

the input vectors aremappedinto onefeature space (possible,

with a higher dimension). Second, a hyperplane which can

separate two classes is constructed within this feature space.

Only relatively low-dimensional vectors in the input space

and matrix products in the feature space will be involved in

themapping function.SVMwas designed to minimize struc-

tural risk whereas previous techniques were usually based on

minimization of empirical risk. SVM is usually less vulner-

able to the overfitting problem, and it can deal with a large

number of features.

The mapping into the feature space is performed by a

kernel function. There are several parameters in the SVM,

including the kernel function and regularization parameter.

Thekernelfunction andits specific parameters, together with

regularization parameter, cannot be set from the optimiza-

tion problem but have to be tuned by the user. These can

be optimized by the use of VapnikChervonenkis bounds,

crossvalidation, an independent optimization set, or Bayes-

ian learning. In the articles from our group, the Radial Basic

Function (RBF) was used as kernel function.

For nonlinear SVM models, we used also the GA-based

optimization of kernel regularization parameter Cand width

of an RBF kernel 2 as suggested by Frhlich et al. [60]. We

123


8/21

276 Mol Divers (2011) 15:269289

Fig. 2 Flow diagram of the

strategy for the genetic

algorithm implemented in the

BRGNNs

simply concatenated a representation of the parameter to our

existing chromosome. That means we are trying to select an

optimal feature subset andan optimal Cat thesametime.This

is reasonable, because the choice of the parameter is influ-

enced by the feature subset taken into account andvice versa.

Usually, it is not necessary to consider any arbitrary value of,

but only certain discrete values with the form: n 10k, where

n = 1, . . . , 9 and k = 3, . . . ,4. Therefore, these values

can be calculated by randomly generating n and k values as

integers between (1, . . . ,9) and (3, . . . ,4), respectively.

In a similar way, we used GA to optimize the width of an

RBF kernel, but in this case, n and k values were integers

between (1, . . . ,9) and (2, . . . ,1). Then, our chromosome

was concatenated with another gene with discrete values in

the interval (0.00190,000) for encoding the C parameter,

and similarly the width of the RBF kernel was encoded in a

gene containing discrete values ranging in the interval (0.01

90). In other articles, feature and hyperparameter genes were

123


9/21

Mol Divers (2011) 15:269289 277

Fig. 3 Reproduction procedure in the BRGNN implementation

concatenated in the chromosomes and encoded as bit string;

however,evolution wasdriven usingsimilar crossover,muta-

tion, and selection operators according to fitness functions

accounting for crossvalidation accuracies [6163].

Data subsets are created, subsets are generated in the

crossvalidation process for training the SVM, and another

subset is then predicted. This process is repeated until all

subsets have been predicted. A venetian-blind method was

used for creating the data subsets. In the first place, data set is

sorted according to the dependent variable, and in thesecond

step, thecases areadded consecutively to each subset, in such

a way that they become representative samples of the whole

data set. The GA routine minimized the regression MSE and

the misclassification percent of crossvalidation experiment.

The GA-SVM implemented in our articles is a version

of the GA by Caballero and Fernandez [10] but incorporat-

ing SVMhyperparameter optimization thatwasprogrammed

within theMatlabenvironment [53] using libSVMlibrary for

Matlab by Chang and Lin [64].

A few other authors [6163] represented features of chro-

mosomes as bit strings, but SVMparameters were optimized

by Conjugated Gradient (CG) method during models fitness

evaluation. The crossover andmutation rates were set to ade-

quate values according to preliminary experiments, and evo-

lution was stopped when the number of generations reached

a preset maximum value, or when the fitness value remained

constant or nearly constant for a maximum number of gen-

erations [6163].

Models validation

Traditionally, meaningful assessment of statistical fit of a

QSAR model consists of predicting some removed propor-

tion of the data set. The whole data are randomly split into a

number of disjointed crossvalidation subsets. One from each

of these subsets is left out in turn, and the remaining com-

plement of data is used to make a partial model. The samplesin the left-out data are then used to perform predictions. At

the end of this process, there are predictions for all data in

the training set, made up from the predictions originating

from the resulting partial models. All partial models are then

assessedagainst thesameperformance criteria, anddecisions

are made on the basis of the consistency of the assessment

results. The more-often-used crossvalidation method is the

leave-one-out crossvalidation method, when all crossvalida-

tion subsets consist of only one data point each.

In addition to assessment of statistical fit by crossvalida-

tion, randomization of the modeled property (also known

as Y-randomization) have also evaluated model robust-ness [21,24,27,65,66]. Undesirable chance correlations can

be achieved as result of exhaustive GA searches. So and

Karplus et al. [27] proposed the evaluation of crossvalida-

tion performance on several scrambled data sets. The posi-

tion of the dependent variable (modeled property) for every

case along the data set is randomized several times, and Q2

is calculated. The absence of chance correlation is proved

when no Q2 > 0.5 appear during the test [27].

The accuracy of crossvalidation results is extensively

accepted in the literature considering the Q2 value. In this

sense, a high valueof thestatistical characteristic (Q2 > 0.5)

is considered as proof of the high predictive ability of the

model. However, a high value of Q2 appears to be a nec-

essary but not sufficient condition for the model to have a

high predictive power, and the predictive ability of a QSAR

model canonly be estimatedusing a sufficiently large collec-

tion of compounds that was not used for building the model

[65,66]. In this sense, the data set can be divided into training

and validation (or test) partitions. For the given partitioning,

a model is constructed only from the samples of the training

set. At this point, an important step is the generation of these

partitions. Quite a few methods have been used, such as ran-

dom selection,activity-ranked binning, and sphere exclusion

algorithms [65,66]. Various forms of neural networks have

also been employed in theselection of trainingsets, including

Kohonen neural networks [19].

Undoubtedly, external validation is a way to establish the

reliability of a QSAR model. However, the majority of stud-

ies that are validated by external predictions are based on a

singlevalidation set; this maycause thepredictors to perform

well on a particular external set, but there is no guarantee that

the same results may be achieved on another. For example,

it can happen that several outliers, by pure coincidence, are

123


10/21

278 Mol Divers (2011) 15:269289

out of the test set, in which case, the validation error will be

small even though the training error was high. The ensemble

solution has been proposed for originating multiple valida-

tion sets [67]. An ensemble is a collection of predictors that,

as a whole, provides a prediction which is a combination of

the individual ones. If there is disagreement among those

predictors, then very reliable models can be obtained, since

a further decrease in generalization error can be achieved.Another trait to take into account for the ensemble applica-

tion is theaverage error of ensemble members; with this trait,

when decreasing the error for each individual, the ensemble

gets a smaller generalization error [67].

In BRGNN-relatedstudies, the predictive power wasmea-

sured taking into account R2 and root MSE values of the

averaged test sets of BRGNN ensembles having an optimum

numberof members [15,18,19,21,24,68,69]. For generating

thepredictors that will be averaged, thewhole data was parti-

tioned into several training and test sets. The assembled pre-

dictors aggregate their outputs to produce a singleprediction.

In this way, instead of predicting a sole randomly selectedexternal set, the result of averaging several ones was pre-

dicted. Each case was predicted several times forming train-

ing and test sets, and an average of both values was reported.

Data sets: sources and general prior preparation

Biological activity measurements were taken as affinity con-

stants (Ki) or ligand concentrations for the 50% (IC50) or90% (IC90) inhibition of the targets (Table 1). For model-

ing, IC50 and IC90 were converted in logarithmic activities;

(pIC50 and pIC90) are measurements of drug effectiveness

which is the functional strength of the ligand toward the tar-

get. For classification problems, data were labeled according

to some convenient threshold.

In our articles, prior to molecular descriptor calcula-

tions, 3D structures of the studied compounds (Fig. 4) were

geometrically optimized using the semi-empirical quantum-

chemical methods implemented in the MOPAC 6.0 com-

puter software by Frank J. Seiler Research Laboratory [70].

The articles in Table 1 included QSAR modeling of can-

cer therapy targets [19,20,23,25,7173], HIV target [22],

Table 1 Data set details and statistics of the optimum models reported by BRGNN modeling

Dataset category Target name or

biological

activity/function

Descriptor type Data size Number of

optimum

variables

Validation

accuracy (%)

Ref.

Cancer Farnesyl protein

transferase

3D 78 7 70 [25]

Matrix

metalloproteinase

2D 30a 6 70a [23]

2D 6368b 7 80b [72]Cyclin-dependent kinase 2D 98 6 65 [19]

LHRH(non-peptide) 2D 128 8 75 [20]

LHRH (erythromycin A

analogs)

Quantum chemical 38 4 70 [71]

HIV HIV-1 protease 2D 55 4 70 [22]

Cardiac dysfunction Potassium channel 2D 29 3 91 [16]

Calcium channel 2D 60 5 65 [17]

Alzheimers disease Acetylcholinesterase

inhibition (tacrine

analogs)

3D 136 7 74 [21]

Acetylcholinesterase

inhibition (huprine

analogs)

3D 41 6 84 [24]

Antifungal Candida albicans 3D 96 16 87 [10]

Antiprotozoan Cruzain 2D 46 5 75 [18]

Protein conformational

stability

Human lysozyme 2D 123 10 68 [68]

Gene V protein 2D 123 10 66 [69]

Chymotrypsin

inhibitor 2

3D 95 10 72 [15]

a Average values of five models for MMP-1, MMP-2, MMP-3, MMP-9 and MMP-13 matrix metalloproteinasesb Average values of five models

for MMP-1, MMP-9 and MMP-13 matrix metalloproteinases

123


11/21


12/21

280 Mol Divers (2011) 15:269289

Alzheimers disease target [21,24], ion channel blockers

[16,17], antifungals [10], antiprotozoan target [18], ionchan-

nel proteins [29], ghrelin receptor [30] and protein con-

formational stability [15,68,69]. Dragon computer software

[74] was used for generating the majority of the feature

vectors for low weight compounds. Four types of molecu-

lar descriptors (according to Dragon software classification)

were used: zero-dimensional (0D), one-dimensional (1D),two-dimensional (2D), three-dimensional (3D). When 2D

topological representation of molecules was used, spatial

lag was varied from 1 to 8. Four atomic properties (atomic

masses, atomic van der Waals volumes, atomic Sander-

son electronegativities, and atomic polarizabilities) weighted

both, 2D and 3D molecular graphs. In some biological sys-

tems, it was suitable to use quantum-chemical descriptors

which were calculated from output files of the semi-empiri-

cal geometrical optimizations.

In the pharmacokinetic and pharmacodynamic proper-

ties, including absorption, distribution, metabolism, excre-

tion, and toxicity (ADMET) studies using GA-optimizedSVMs, several properties were modeled such as: identifi-

cation of P-glycoprotein substrates and nonsubstrates (P-

gp) [61], prediction of human intestinal absorption (HIA)

[61], prediction of compounds inducing torsades de poin-

tes (Tdp) [61], prediction of BBB penetration [61], human

plasma protein binding rate (PPBR) [62], oral bioavailabil-

ity (BIO) [62], and induced mitochondrial toxicity (MT)

[63]. All the structures of the compounds were generated

and then optimized by using Cerius2 program package (Ce-

rius2, version 4.10) [75]. The authors manually inspected

the 3D structure of each compound to ensure that each

molecule was properly represented and molecular descrip-

tors were computed using the online application PCLINET

[76].

Feature spaces for peptides and proteins in [68] and [69]

were computed using in-house software PROTMETRICS

[77]. Different sets of protein feature vectors were computed

on thesequences [68,69] andcrystal structures [15] weighted

by 48 amino acid/residue properties from AAindex database

[78].

In general, descriptors that were constant or almost con-

stant were eliminated, and pairs of variables with a square

correlation coefficient greater than 0.9 were classified as

intercorrelated, and only oneof these was included for build-

ing the model. Finally, high dimension data matrices were

obtained. Feature subspaces in such matrices were explored

searching for lower dimensional combination of vectors

that derive optimum nonlinear model throughout BRGNN or

GA-SVM techniques. Afterward, in some applications, opti-

mum feature vectors were used for unsupervised training of

competitive neurons to build self-organized maps (SOMs)

[79] for the qualitative analysis of optimum chemical sub-

space distributions at different activity levels.

Application of BRGNN and GA-SVM to ligandtarget

data sets

ADMET modeling

GA-optimized SVMs had been applied at the early stage of

drug discovery to predict pharmacokinetic and pharmacody-

namicproperties, includingADMET [6163]. An interestingSVM method that combined GA for feature selection and

CG method for parameter optimization (GA-CG-SVM),was

reported to predict PPBR and BIO [62]. A general imple-

mentation of this framework is described later. For each

individual, features chromosomes were represented as bit

strings but SVM parameters were optimized by CG method

during models fitness evaluation. The crossover and muta-

tion rates were set to 0.8 and 0.05, respectively. Evolu-

tion was stopped when number of generations equal 500 or

with fitness value remaining constant or nearly constant for

the last 50 generations. This approach yielded, an optimum

29-variables model for the PPBR of 692 compounds withprediction accuracies of 86 and 81% for five-fold crossvali-

dationand the independent test set (161 compounds), respec-

tively. At the same time, an optimum 25-variables model

for the BIO data set including 690 compounds in the train-

ing set and 76 compounds in an independent validation set,

had prediction accuracies of 80 and 86% for the training set

five-fold crossvalidation and the independent test set, respec-

tively [62]. The descriptors selected by GA-CG method cov-

ered a large range of molecular properties which imply that

the PPBR and BIO of a drug might be affected by many

complicated factors. The authors claimed that PPBR and

BIO predictors overcame previous models in the literature

[62].

Drug-induced MT have been one of the key reasons for

drugs failing to enter into or being withdrawn from mar-

ket [80]. That is why MT has became an important test in

ADMET studies. The hybrid GA-CG-SVM approach was

also applied to predict MT using a collected data set of

288 compounds, including 171 MT+ and 117 MT [63].

Data set was randomly divided into training set (253 com-

pounds) and test set (35 compounds). Bit string represen-

tation of feature chromosome was used. Populations were

evolved according to crossover and mutation rates of 0.5

and 0.1, respectively. The algorithm was stopped when the

generation number reaches 200 or the fitness value does

not improve during the last 10 generations [63]. Accuracies

for five-fold crossvalidation and the test set were about

85 and 77%, respectively. A total of 27 optimum molecu-

lar descriptors were selected, which were roughly grouped

into five categories: molecular weight-related descriptors,

van der Waals volume-related descriptors, electronegativi-

ties, molecular structural information, shape,andother phys-

icochemical properties-related descriptors. This descriptor

123


13/21

Mol Divers (2011) 15:269289 281

Table 2 Data set details and statistics of the optimum models reported by GA-SVM modeling

Dataset category Target name or

biological

activity/function

Descriptor type Data size Number of optimum

variables

Validation accuracy

(%)

Ref.

ADMET Human plasma protein

binding rate (PPBR)

0D, 1D, 2D and 3Da 853 29 81 [63]

Oral bioavailability

(BIO)

0D, 1D, 2D and 3D 766 25 86

Mitochondrial toxicity

(MT)

0D, 1D, 2D and 3D 288 27 77 [64]

P-glycoprotein substrates

and nonsubstrates

(P-gp)

0D, 1D, 2D and 3D 201 8 85 [62]

Human intestinal

absorption (HIA)

0D, 1D, 2D and 3D 196 25 87

Induction of torsades de

pointes (Tdp)

0D, 1D, 2D and 3D 361 17 86

Blood-brain-barrier

(BBB) penetration

0D, 1D, 2D and 3D 3,941 169 91

593 24 97

Cancer Apoptosis 0D, 1D, 2D and 3D 43 7 92 [72]

Aqueous solubility LogS Structural, atom type,electrotopological

1,342 9 90 [95]

Log P Structural, atom type,

electrotopological

10,782 14 82

Protein function/ class Folding class Sequence features and

order

204,277498 700 90 [102]

Subcellular location Physicochemical

composition

504 33 56 [103]

703 28 72

Proteinprotein

complexes

Physicochemical atomic

properties

172,345 30 90 [104]

Voltage-gated K+

channelb

2D 100 3 85 [29]

Ghrelin receptor 2D 23 2 93 [30]

a Descriptor classification according to Dragon software[74]b Average over three physiological variable models

diversity pointed out the high complexity of MT mechanism

[63].

The same methodology was successfully applied to other

ADMET-related properties [61]. Identification of P-gp sub-

stratesandnonsubstratesyieldedaneight-inputmodelexplain-

ing85%ofcrossvalidationvariance.PredictionofHIAyieldeda 25-inputmodelexplaining87%of crossvalidation variance.

Prediction of compounds inducing Tdp yielded a 17-input

modelexplaining86%ofcrossvalidation variance. Prediction

of BBB penetration that yielded two models, 169-input and

24-input models explaining more than 91 and 94% of cross-

validation variance, respectively [61] (Table 2). The authors

above cited claimed that the optimum models significantly

improveoverallpredictionaccuracy andhave fewer inputfea-

tures in comparison to theprevious reported models [61].

Anticancer targets

Cancer is characterized by uncontrolled proliferative growth

and the spread of aberrant cells from their site of origin. Most

anticancer agents exert their therapeutic action by damaging

DNA, blocking DNA synthesis, altering tubulin polymeriza-tiondepolymerization, or disrupting the hormonal stimula-

tion of cell growth [81]. Recent findings on the underlying

genetic changes related to the cancerous state have aroused

interest toward novel mechanistic targets.

Computer-aided development of cancer therapeutics has

taken on newdimensionssince modern biologicaltechniques

openthe wayleading tomechanismand structureunderstand-

ingof key cellularprocessesat theprotein level. In thecontext

of cancer therapy targets, BRGNN have been employed to

123


14/21

282 Mol Divers (2011) 15:269289

predict inhibition of farnesyl protein transferase [25], matrix

metalloproteinase (MMP) [23,70], cyclin-dependent kinase

[19], and antagonist activity for the luteinizing hormone

releasing hormone (LHRH) receptor [20,69]. Results from

BRGNN modeling of four cancer-target data sets appear in

Table 1. Numbers of selected features varied according to

the size and variability of each data set. The selected fea-

tures correspond to the molecular descriptors which bestdescribed the affinity of the ligands toward the targets. Mod-

els were validated by crossvalidation or/and test set predic-

tion. Validation accuracies were higher than 65% for all data

sets.

Two-dimensional molecular descriptors were used for

BRGNN modeling of the activity toward cancer targets

of several chemotypes in Fig. 4 such as 1H-pyrazolo[3,4-

d]pyrimidine derivatives(1 and 2) ascyclin-dependentkinase

inhibitors; heterocyclic compounds as LHRH agonists; and

thieno[2,3-b]pyridine-4-ones (3), thieno[2,3-d]pyrimidine-

2,4-diones (4), imidazo[1,2-a]pyrimidin-5-ones (5), benz-

imidazole derivatives (6 and 7), N-hydroxy-2-[(phenylsulfo-nyl)amino]acetamide derivatives (8 and 9) and

N-hydroxy--phenylsulfonylacetamide derivatives (10 and

11) as inhibitors of the MMP family.

On the other hand, thiol (12) and non-thiol (13) inhibitors

of farnesyl protein transferase in Fig. 4 were modeled by 3D-

descriptors which encoded distributions of atomic properties

on the tridimensional molecular spaces [25]. Knowledge of

the binding mode was available for this target; thus, ligand

molecules were conveniently aligned to crystal structure of

an inhibitor in binding site. 3D encoding of molecules is

more realistic than 2D approximation but conformation var-

iability could introduce some undesirable noise in the data.

Consequently, 2D descriptors tends to achieve better per-

formance when the system lacks binding mode information

or/and when the target is promiscuous and the ligands bind

in different conformations.

It is worthy to note that BRGNNs trained with chemi-

cal quantum descriptors from 11,12-cyclic carbamate deriv-

atives of 6-O-methylerythromycin A (14) in Fig. 4 predicted

LHRH antagonist activitywith 70%accuracy[69]. Chemical

quantum descriptors onlyencoded informationrelative to the

electronic states of the molecules rather than distribution of

chemical groups on the structure. The structural homogene-

ity of the macrolides in this data set suggests a well-defined

electronic pattern that was successfully recognized by the

networks after supervised training.

Unwanted, defective, or damaged cells are rapidly and

selectively eliminated from the body by the innate mecha-

nism called apoptosis, or programmed cell death. Resistant

tumor cells evade the action of anticancer agents by increas-

ing their apoptotic threshold [82,83]. This has triggered the

interest in novel chemical compounds capable of induc-

ing apoptosis in chemo/immunoresistant tumor cells. There-

fore, apoptosis has received a huge attention in recent years

[82,83]. The induction of apoptosis by a total of 43 4-aryl-

4-H-chromenes (15) in Fig. 4 was predicted by chemomet-

rics methods using molecular descriptors calculated from the

molecular structure [71]. GA and stepwise multiple linear

regression were applied to feature selection for SVM, ANN,

and MLR training. Nevertheless, GA was implemented

inside the linear framework, and then selected descriptorswere used for SVM and ANN training. The optimum 7-var-

iable SVM predictor superseded ANN and MLR as well as

previous reported models, showing correlation coefficients

of 0.950 and 0.924 for training and test set, respectively, with

crossvalidation accuracy of about 70% [71].

Acetylcholinesterase inhibition

Theneurodegenerative Alzheimers disease (AD) is a degen-

erative disorder characterized by a progressive impairment

of cognitive function which seems to be associated with

deposition of amyloid protein and neuronal loss, as well aswith altered neurotransmission in the brain. Neurodegen-

eration in AD patients is mainly attributed to the loss of

the basal forebrain cholinergic system that it is thought to

play a central role in producing the cognitive impairments

[84]. Therefore, enhancement of cholinergic transmission

has been regarded as one of the most promising methods

for treating AD patients.

BRGNN models of acetylcholinesterase inhibition by

huprine- and tacrine-like inhibitors were reported. For ana-

logs of tacrine (16) [21] and huprine (17) [24], in Fig. 4 GA

exploreda wide pool of 3Ddescriptors.The predictivecapac-

ity of the selected model was evaluated by averaging multi-

ple validation sets generated as members of neural network

ensembles (NNEs). Thetacrines model showedadequate test

accuracy about 71% [21] (Table 1). Likewise, huprine ana-

logs data set was also evaluated by NNEs averaging showing

a optimum high accuracy of 85% when 40 networks were

assembled [24]. The higher accuracy yielded for the hup-

rine analogs in comparison to the tacrine analogs would be

related to the higher structural variability of tacrine data set.

This fact contributed to the 30% of prediction uncertainty

of the affinity of tacrine analogs. In this connection, tacrine-

like inhibitors had been found experimentally to bind acetyl-

cholinesterase in different binding modes at the active site

and also at peripheral sites [85,86].

HIV-1 protease inhibition

A numberof targets for potential chemotherapeutic interven-

tion of the human HIV-1 are provided by the retrovirus life

cycle. The protease-mediated transformation fromthe imma-

ture, non-dangerous virion, to the mature, infective virus is

a crucial stages in the HIV-1 life cycle. HIV-1 protease has

123


15/21

Mol Divers (2011) 15:269289 283

thus becomea major targetfor anti-AIDSdrug design, andits

inhibition has been shown to extend the length and improve

the quality of life of AIDS patients [87]. A large number

of inhibitors have been designed, synthesized, and assayed,

and several HIV-1 protease inhibitors are now utilized in the

treatment of AIDS [8790].

Cyclic urea derivatives (18) in Fig. 4 are among the most

successful candidates for AIDS targeting, and BRGNN wassuccessfully applied to model the activities of a set of such

compounds toward HIV-1 protease [22]. 2D encoding was

used to avoid conformational noise in the feature chemical

space and the optimum BRGNN model accurately predicted

IC50 values with 70%accuracy in validation test for55 cyclic

urea derivatives (Table 1). Despite the feature, the space was

only 2D dependent, and the problem was accurately solved

by the nonlinear approach. Inhibitory activity variations due

to differential chemical substitutions at the cyclic urea scaf-

fold were learned by the networks and the activity of new

compounds were adequately predicted.

Potassium-channel and calcium entry blocker activities

K+ channels constitute a remarkably diverse family of mem-

brane-spanning proteins that have a wide range of functions

in electrically excitable and unexcitable cells. One important

class opens in response to a calcium concentration increase

within thecytosol.Pharmacological andelectrophysiological

evidence and, more recently, structural evidence from clon-

ing studies, have established that there exist several kinds of

Ca2+-activated K+ channels [91,92].

Several compounds have been shown to block the IKCa-

mediated Ca2+-activated K+ permeability in red blood cells

[93]. A model of the selective inhibition of the intermediate

conductance in Ca2+-activated K+ channel by some clotrim-

azole analogs (19, 20) in Fig. 4 was developed by BRGNNs

[16]. Substitutions around triarylmethane scaffold yielded

a differential inhibition of the K+ channel by triarylme-

thane analogs that were encoded in 2D descriptors. BRGNN

approach yielded a remarkable accurate model describing

more than 90% of data variance in validation experiments.

Interactions with the ion channel were encoded in topolog-

ical charge variables, and the homogeneity of the data set

assures a very high prediction accuracy. The SOM map of

blockers depicted a very good behavior of the optimum fea-

tures for unsupervised differentiation of inhibitors at activity

levels [16].

Similarly, a BRGNN model of calcium entry blockers

with myocardial activity (negative ionotropic activity) was

reported [17]. Taking into account the lack of information

about active conformations and mechanism of action of

dilthiazen analogs (2123) in Fig. 4 as cardiac malfunction

drugs, structural information was encoded in 2D topologi-

cal autocorrelation vectors. Remarkably, optimum BRGNN

model exhibited adequate accuracy of about 65% [17]. The

complexity of the cellular cardiac response, a multifactor

event where several interactions such as membrane trespass-

ing and receptor interactions are taking place, accounts for

this discrete but adequate performance.

Antifungal activity

None of the existing systemic antifungals satisfies the med-

ical need completely; there are weaknesses in spectrum,

potency, safety, and pharmacokinetic properties [10]. Few

substances have been discovered that exert an inhibitory

effect on the fungi pathogenic for humans and most of these

are relatively toxic. BRGNN methodology was applied to a

data set of antifungal heterocyclic ring derivatives in Fig. 4:

(2,5,6- trisubstituted benzoxazoles; 2,5-disubstituted benz-

imidazoles; 2-substituted benzothiazoles; and 2-substituted

oxazolo(4,5-b)pyridines) (24 and 25) [10].

A comparative analysis using MLR and BRGNNs wascarried out to correlate the inhibitory activity against Can-

dida albicans (log(1/C)) with 3D descriptors encoding the

chemical structures of the heterocyclic compounds [10].

Beyond the improvement of training set fitting, BRGNN

outperformed multiple linear regression describing 87% of

test set variance. The antifungal nonlinear models showed

that the distribution of van der Waals atomic volumes and

atomic masses have a large influence on the antifungal activ-

ities of the compounds studied. Also, the BRGNN model

included the influence of atomic polarizability that could be

associated with the capacity of the antifungal compounds to

be deformed when interacting with biological macromole-

cules [10].

Antiprotozoan activity

Trypanosoma cruzi, a parasitic protozoan, is the causative

agent of the Chagas disease or American trypanosomiasis,

one of the most threatening endemics in Central and South

America. The primary cysteine protease of Trypanosoma

cruzi, cruzain, is expressed throughout the life cycle and is

essential for the survival of the parasite within host cells

[94]. Thus, inhibiting cruzain has become interesting for the

development of potential therapeutics for the treatmentof the

Chagas disease.

The Ki values ofa setof 46ketone-based cruzain inhibitors

(26 and 27) in Fig. 4 against cruzain was successfully mod-

eled by means of data-diverse ensembles of BRGNNs using

2D moleculardescriptors with accuracy about 75%[18]. The

BRGNNs outperformed GA-optimized PLS model suggest-

ing that the functional dependence between affinity and the

inhibitors topological structure has a strong nonlinear com-

ponent. The unsupervised training of SOM maps with opti-

123


16/21

284 Mol Divers (2011) 15:269289

mumfeature vectorsdepictedhighandlowinhibitoryactivity

levels that matched well with data set activity profiles.

Aqueous solubility

The aqueous solubility (logS) and lipophilicity (log P) are

very important properties to be evaluated in drug design pro-

cess. Zhang etal. [95] reported SVMclassifiers considering athree-classschemefor these twoproperties.Theyapplied GA

for feature selection and CG method for parameter optimiza-

tion. Two data sets with 1,342 and 10,782 compounds were

used to generate logS and logP models. The chromosome

was represented as bit string, and simple mutation and cross-

over operators were used to create the individuals in the new

generations. Five-fold crossvalidation accuracy was used as

the fitness function to evaluate the quality of individuals to

be allowed to reproduce or survive to the next generation.

A roulette wheel algorithm selected the chromosomes for

crossover to produce offspring, and the swapping positions

were randomly created with crossover and mutation rates of0.5 and 0.1, respectively [95]. The overall prediction accura-

cies for logS were 87 and 90% for training set and test set,

respectively. Similarly, the overall prediction accuracies for

logP are 81.0 and 82.0% for training set and test set, respec-

tively. The prediction accuracies of the two-class models of

logs and logP were higher than three-class models, and GA

feature selection had a significant impact on the quality of

classification [95].

Protein function/classstructure relationships

Functional variations induced by mutations are the main

causes of several genetic pathologies and syndromes. Due to

the availability of functional variation data on mutations of

several proteins and other protein functional/structural data,

it is possible to use supervised learning to model protein

function/property relationships [29,30,70,71,96105]. GA-

SVM regression and binary classification were carried out

to predict functional properties of ghrelin receptor mutants

[30] and voltage-gated K+ channel proteins [29]. Structural

information was encoded in 2D descriptors calculated from

the protein sequences. Regression and classification tasks

were properly attained with accuracies of about 93 and 85%,

respectively (Table 2). The optimum model of the consti-

tutive activity of ghrelin receptor was remarkable accurate

depending on only two descriptors.

A novel 3D pseudo-folding graph representation of pro-

tein sequences inside a magic dodecahedron was used to

classifyvoltage-gatedpotassiumchannels (VKCs) according

to thesignsof threeelectrophysiological variables: activation

threshold voltage, half-activation voltage, and half-inacti-

vation voltage [29]. We found relevant contributions of

the pseudo-core and pseudo-surface of the 3D pseudo-

folded proteins in the discrimination between VKCs accord-

ing to the three electrophysiological variables. On the other

hand, the accuracies of voltage-gated K+ channel models by

GA-SVMwere higher than theother nine GA-wrapper linear

and nonlinear classifiers [29].

Since many disease-causing mutations exert their effects

by altering protein folding, the prediction of protein struc-

tures and stability changes upon mutation is a fundamentalaim in molecular biology. BRGNN technique had been also

applied to model the conformation stability of mutants of

humanlysozyme[68],geneVprotein[69], andchymotrypsin

inhibitor 2 [15]. The change of unfolding Gibbs free energy

change (G) of human lysozyme, gene V protein mutants

were successfully modeled using amino acid sequence auto-

correlation vectors calculated by measuring the autocorre-

lations of 48 amino acid/residue properties [68,69] selected

from theAAindexdata base [78].Onthe other hand,G of

chymotrypsin inhibitor 2 mutants were predicted using pro-

tein-radial distribution scores calculated over 3D structure

using the same 48 amino acid/residue properties. Ensemblesof BRGNNs yielded optimum nonlinear models for the con-

formational stabilities of human lysozymes, gene V proteins,

andchymotrypsin inhibitor2 mutants,whichdescribedabout

68, 66 and 72% of ensemble test set variances (Table 1).

The neural network models provided information about

the most relevant properties ruling conformational stability

of the studied proteins. The authors determined how an input

descriptor is correlated to the predicted output by the net-

work. [15,68,69]. Entropy changes and the power to be at

the N-terminal of a -helix had the strongest contributions

to the stability pattern of human lysozyme. In the case of

gene V protein mutants, the sequence autocorrelations of

thermodynamic transfer hydrophobicity and the power to be

at the middle of a -helix had the highest impact on the

G. Meanwhile, spherical distribution of entropy change

of side-chains on the 3D structure of chymotrypsin inhibitor

2 mutants, exhibited the highest relevance in comparison to

the other descriptors.

Prediction of structural class of protein, that characterizes

the overall folding type or its domain, had been based on a

group of features that only possesses a kind of discriminative

information. Different types of discriminative information

associated with primary sequence have been missed reduc-

ing the prediction accuracy [102]. Li et al. [102] reported a

novel method for the prediction of protein structure class by

coupling GA and SVMs. Proteins were represented by six

feature groups composed of 10 structural and physicochemi-

cal features of proteins and peptides yielding a total of 1,447

features.GA was applied to selectan optimum feature subset

andto optimizeSVMs parameters.Theauthorsused a hybrid

binary-decimal representation of chromosomes, and the fit-

ness function was the accuracy of five-fold crossvalidation.

Features in thechromosomewere representedin1,447binary

123


17/21

Mol Divers (2011) 15:269289 285

genes and the parameters as two-decimal genes. Jack-knife

tests on the working data sets yielded outstanding prediction

accuracies of classification higher than 97% with an overall

accuracy of 99.5% [102] (Table 2).

SVM learning methods have also shown effectiveness

for prediction of protein subcellular and subnuclear local-

izations, which demand cooperation between informative

features and classifier design. For this propose, Huang etal. [103] reported an accurate system for predicting protein

subnuclear localization, named ProLoc, based on evolution-

ary SVM (ESVM) classifier with automatic feature selec-

tion from a large set of physicochemical composition (PCC)

descriptors. An inheritable GA combined with SVM auto-

matically selected the best number of PCC features using

two data sets, which have 504 proteins localized in six sub-

nuclear compartments, and 370 proteins localized in nine

subnuclearcompartments.The featuresandSVMparameters

were encoded concatenated in binary chromosomes, which

evolved according to mutation and crossover operators. The

training accuracy of ten-fold-crossvalidation was used as fit-ness function. ProLoc with 33 and 28 PCC features reported

leave-one-out accuracies over 56 and 72% for each data set,

respectively [103]. Both predictors overcame a SVM model

using k-peptide composition features and an optimized evi-

dence-theoretick-nearestneighbor classifierutilizing pseudo

amino acid composition.

The nature of different proteinprotein complexes was

analyzed by a computational framework that handles the

preparation,processing,andanalysisof proteinproteincom-

plexes with machine learning algorithms [104]. Among

different machine learning algorithms, SVM was applied

in combination with various feature selection techniques

including GA. Physicochemical characteristics of protein

protein complex interfaces were represented in four different

ways, using two different atomic contact vectors, DrugScore

pair potentialvectors, andSFC score descriptor vectors. Two

different data sets were used: one with contacts enforced

by the crystallographic packing environment (crystal con-

tacts) andbiologically functionalhomodimercomplexes;and

another with permanent complexes and transient protein

protein complexes [104]. The authors implemented a simple

GA with a population size of 30, a crossover rate of 75%,

and a mutation rate of 5%. Two-point crossover and sin-

gle bit mutation were applied to evolve until convergence,

defined as no further changes over 10 generations or 100%

predictionquality, wasreached.Although,SVMdidnotyield

the highest accuracy, the optimum models obtained by GA

selection reached more than 90% accuracy for the packing

enforced/functional and the permanent/transient complexes.

GA also identifiedthediscriminating ability of the three most

relevant features, given in descending order as follows: the

contacts of hydrophobic and/or aromatic atoms located in the

proteinprotein interfaces, the pure hydrophobic/hydropho-

bic atom contacts, and the polar/hydrophobic atom contacts

[104].

Kernytsk et al. [105] reported a framework that sets first

global sequence features and, second, widely expands the

feature space by generically encoding the coexistence of

residue-based features in proteins. A global protein feature

scheme was generated for function and structure prediction

studies. They proposed a combination of individual features,which encompasses the feature space from global feature

inputs to features that can capture every local evidence such

as a the individual residues of a catalytic triad.GA-optimized

ANN and SVM were used to explore the vast feature space

created. Inside GA, the initial population of solutions was

built as multiple combinations of all the global features,

which also contains the maximal intersection of all the fea-

ture classes with 360 input features [105]. New offspring

was created by inserting or deleting nodes in the existing

individuals. Nodes were defines as feature classes, or any

operator on the features which combined two global feature

classes. The mutation probability was set to 0.4 per node pergeneration, and the probability of crossover was set to 0.2

per solution per generation. After new offspring solutions

are generated via crossover and/or mutation (insertion/dele-

tion) of the parent solutions, the worst solutions were dis-

carded to restore the populations original size ensuring that

the best-performing solutions are not selected out of the next

generation by chance, which have a tendency to converge

faster at the cost of losing diversity more quickly among the

solutions. This contrasts with the typical selection scheme

(roulette wheel selection) where the more-fit solutions have

a higher chance than less-fit solutions of getting to the next

generation but have no guaranteed survival. Area under the

receiver operating characteristics curve was monitored as fit-

ness/cost function.

Population size was set to 100 solutions with 50 potential

offspring created in each generation, and GA ran for 1,000

generations. GA was critical to effectively manage a feature

space that is far too large for exhaustive enumeration and

allowed detecting combinations of features that were neither

too general with poor performance, nor too specific leading

to overtraining. This GA variant was successfully applied to

the prediction of protein enzymatic activity [105].

Conclusions

The reviewed articles comprise GA-optimized predictors

implemented to quantitativelyor qualitativelydescribestruc-

tureactivityrelationships in datarelevant fordrugdiscovery.

BRGNN and GA-SVM are presented and discussed as pow-

erful data modeling tools arisen from the combination of GA

andefficientnonlinearmapping techniques, such as BRANN

and SVM. Convoluted relationships can be successfully

123


18/21

286 Mol Divers (2011) 15:269289

modeled and relevant explanatory variables identify among

large pools of descriptors. Interestingly, accurate predictors

were achieved from 2D topological representation of ligands

andtargets.The approach outperformedother linearand non-

linear mapping techniques combiningdifferent feature selec-

tion methods. BRGNNs showed satisfactory performance,

converging quickly toward the optimal position and avoid

overfitting in a large extent. Similarly, GA-optimizations ofSVMs yielded robust and best generalizable models. How-

ever, considering complexity of network architecture and

weightoptimization routines,BRGNN was more suitable for

function approximation of convoluted but low dimensional

data in comparison to GA-SVM which performed better in

classification tasks of high dimensional data. These method-

ologies are regarded as useful tools for drug design.

Acknowledgements Julio Caballero acknowledges with thanks the

support received through Programa Bicentenario de Ciencia y Tecno-

loga, ACT/24.

References

1. Gasteiger J (2006) Chemoinformatics: a new field with a

long tradition. Anal Bioanal Chem 384:5764. doi:10.1007/

s00216-005-0065-y

2. Cramer RD, PattersonDE,BunceJD (1988) Comparative molec-

ular field analysis (CoMFA). 1. Effect of shape on binding of ste-

roids to carrier proteins. J Am Chem Soc 110:59595967. doi:10.

1021/ja00226a005

3. Klebe G, Abraham U, Mietzner T (1994) Molecular similarity

indices in a comparative analysis (CoMSIA) of drug molecules

to correlate and predict their biological activity. J Med Chem37:41304146. doi:10.1021/jm00050a010

4. Folkers G, Merz A, Rognan D (1993) CoMFA: scope and limi-

tations. In: Kubinyi H (ed) 3D-QSAR in drug design. Theory,

methods and applications. ESCOM Science Publishers BV, Lei-

den pp 583618

5. Hansch C, Kurup A, Garg R, Gao H (2001) Chem-bioinformat-

ics and QSAR: a review of QSAR lacking positive hydrophobic

terms. Chem Rev 101:619672. doi:10.1021/cr0000067

6. Sabljic A (1990) Topological indices and environmental chem-

istry. In: Karcher W, Devillers J (eds) Practical applications of

quantitative structureactivity relationships (QSAR) in environ-

mental chemistry and toxicology. Kluwer, Dordrecht pp 6182

7. Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chem-

ical descriptors in QSAR/QSPR studies. Chem Rev 96:1027

1043. doi:10.1021/cr950202r8. Livingstone DJ, Manallack DT, Tetko IV (1997) Data modelling

with neural networks: advantages and limitations. J Comput Aid

Mol Des 11:135142. doi:10.1023/A:1008074223811

9. Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design

by machine learning: support vector machines for pharma-

ceutical data analysis. Comput Chem 26:514. doi:10.1016/

S0097-8485(01)00094-8

10. Caballero J, Fernndez M (2006) Linear and non-linear mod-

eling of antifungal activity of some heterocyc

genetic algorithm optimization in drug design qsar bayesian-regularized genetic neural networks...

Documents