an evolutionary optimizer of libsvm...
TRANSCRIPT
Submitted to Challenges. Pages 1 - 21.OPEN ACCESS
challengesISSN 2078-1547
www.mdpi.com/journal/challenges
Communication
An Evolutionary Optimizer of libsvm ModelsDragos Horvath 1,*, J.B. Brown 2, Gilles Marcou 1 and Alexandre Varnek 1
1 Laboratoire de Chémoinformatique, UMR 7140 CNRS – Univ. Strasbourg, 1, rue Blaise Pascal, 6700Strasbourg, France
2 Kyoto University Graduate School of Pharmaceutical Sciences, Department of Systems Biosciencefor Drug Discovery, 606-8501, Kyoto, Japan
* Author to whom correspondence should be addressed; [email protected]
Version August 7, 2014 submitted to Challenges. Typeset by LATEX using class file mdpi.cls
Abstract: This User Guide describes the rationale behind, and the modus operandi of a Unix1
script-driven package for evolutionary searching of optimal libsvm operational parameters,2
leading to Support Vector Machine models of maximal predictive power and robustness.3
Unlike common libsvm parameterizing engines, the current distributions includes the key4
choice of best-suited sets of attributes/descriptors, next to the classical libsvm operational5
parameters (kernel choice, cost, etc.), allowing a unified search in an enlarged problem6
space. It relies on an aggressive, repeated cross-validation scheme to ensure a rigorous7
assessment of model quality. Primarily designed for chemoinformatics applications, it also8
supports the inclusion of decoy instances – for which the explained property (bioactivity) is,9
strictly speaking, unknown, but presumably "inactive". The package supports four different10
parallel deployment schemes of the Genetic Algorithm-driven exploration of the operational11
parameter space, for workstations and computer clusters.12
Keywords: chemoinformatics; QSAR; machine learning; libsvm parameter optimization13
1. Introduction14
Support Vector Machines (SVM), such as the very popular libsvm toolkit [1], are a method of choice15
for machine learning of complex, non-linear patterns of dependence of an explained variable – here16
termed the ”property” P – and a set (vector) of attributes (descriptors ~D) thought to be determinants17
of the current P value of the instance/object they characterize. The ultimate goal of machine learning18
Version August 7, 2014 submitted to Challenges 2 of 21
is to unveil the relationship between the property of an object and its descriptors: P = P ( ~D), where19
this relationship may ideally represent a causal relationship between the attributes of the objects that20
are responsible for its property, or, at least, a statistically validated covariance between attributes and21
property. Following good scientific practice (Occam’s razor), this relationship must take the simplest22
form that explains so-far available experimental observations. Complexifying the relationship is only23
justified if this leads to a significant increase of its predictive/extrapolation ability. Support Vector24
Machines may describe relationships of arbitrary complexity. However, controlling the underlying25
complexity of an SVM model may be only indirectly achieved, by fine tuning its operational parameters.26
Unfortunately, this is a rather counterintuitive task (one cannot easily ”guess” the optimal order of27
magnitude for some of these parameters, not the mention the actual values), and cannot be addressed by28
systematic scanning of all potentially valid combinations of control parameters. Furthermore, in various29
domains, such as mining for Quantitative (Molecular) Structure – Property Relationships (QSPR), for30
which this package has been primarily designed, there are many (hundreds) of a priori equally appealing,31
alternative choices for the molecular descriptor set to be used as the input ~D. There are many ways in32
which the molecular structure can be encoded under the form of numeric descriptors. While some33
problem-based knowledge may orient the user towards the most likely useful class of descriptors to34
use in order to model a given molecular property P, tens to hundreds of empirical ways to capture the35
needed molecular information in a bit-, integer- or real-number vector remain, and it cannot be known36
beforehand which of these alternative ”Descriptor Spaces” (DS) would allow for the most robust learning37
of P = P ( ~D).38
Moreover, vectors ~D in these various DS may be of widely varying dimension, sparseness and39
magnitude, meaning that some of the SVM parameters (in particular the γ coefficient controlling40
Gaussian or Sigmoid kernel functions) will radically differ by many orders of magnitude with respect41
to the employed DS. Since γ multiplies either a squared Euclidean distance, or a dot product value42
between two vectors ~D and ~d in the considered DS in order to return the argument of an exponential43
function, fitting will not focus on γ per se, but on a preliminary "gamma factor" gamma defined such that44
γ =gamma/ < | ~D − ~d|2 > or, respectively γ =gamma/ < ~D.~d >. Above, the denominators represent45
means, over training set instance pairs, of Euclidean distance and dot product values, as required. In this46
way, γ is certain to adopt a reasonable order of magnitude if the dimensionless gamma parameters is47
allowed to take values within (0,10).48
The goal of the herein presented software distribution is to provide a framework for simultaneous49
”meta”-optimization of relevant operational SVM parameters, including – and foremost – the choice of50
best suited descriptor spaces, out of a list of predefined choices. This assumes simultaneous fitting51
of categorical degrees of freedom (descriptor choice) and ”classical” numerical SVM parameters,52
where the acceptable orders of magnitudes for some of the latter directly depend on the former DS53
choice. Such an optimization challenge – unfeasible by systematic search – clearly requires stochastic,54
nature-inspired heuristics. A simple Genetic Algorithm (GA) has been implemented here. Chromosomes55
are a concatenation (vector) of the currently used operational parameter values. To each chromosome,56
a fitness score is associated, reporting how successful SVM model building was when the respective57
choice of parameters was employed. Fitter setups are allowed to preferentially generate ”offspring”,58
by cross-overs and mutations, and generate new potentially interesting setup configurations. Since this59
Version August 7, 2014 submitted to Challenges 3 of 21
is perfectly able to deal with both categorical and continuous degrees of freedom, additional toggle60
variables were considered: descriptor scaling (if set, initial raw descriptors of arbitrary magnitudes will61
first be linearly rescaled between zero and one) and descriptor pruning (if set, the initial raw descriptor62
space will undergo a dimensionality reduction, by removing one of pairwise correlated descriptor63
columns, respectively).64
Model optimization is so-far supported for both ε Support Vector Regression (SVR) and Support65
Vector Classification (SVC) problems. The definition of SVR and SVC chromosome slightly differs,66
since the former also includes the ε tube width, within which regression errors are ignored. The nature67
of the problem may be indicated by command line parameter.68
The fitness (objective) function (SVM model quality score) to be optimized by picking the69
most judicious parameter combination should accurately reflect the ability of the resulting model to70
extrapolate/predict instances not encountered during the training stage, not its ability to (over)fit training71
data. In this sense, ”classical” leave-1/N -out cross-validation (XV) already implemented in libsvm was72
considered too lenient, because its outcome may be strongly biased by the peculiar order of instances in73
the training set, dictating the subsets of instances that are simultaneously being left out. A M -fold74
repeated leave-1/N -out XV scenario, where each M-fold repeat proceeds on a randomly reordered75
training set item list, has been retained here. As a consequence, M ×N individual SVM models are by76
default built at any given operational parameter choice as defined by the chromosome. However, if any of77
these M×N attempts fails at fitting stage (quality of fit below a user-defined threshold), then the current78
parameterization is immediately discarded as unfit, in order to save computer time. Otherwise, the79
quality score used as fitness function of the GA not only reflects the average cross-validation propensity80
of models at given operational setup, but also accounts for its robustness with respect to training set81
item ordering. At equal average XV propensity over the M trials, models that cross-validate roughly82
equally well irrespective of the order to of training set items are to be preferred over models which83
achieve sometimes very strong, other times quite weak XV success depending on how left-out items84
were grouped together.85
For SVR models, the cross-validated determination coefficientQ2 is used to define fitness, as follows:86
1. For each XV attempt between 1...M , consider the N models generated by permuting the left-out87
1/N of the training set.88
2. For each training instance i, retrieve the predicted property P̂M(i) returned by the one specific89
model among the above-mentioned N which had not included i in its training set.90
3. Calculate the cross-validated Root-Mean-Squared Error RMSEM of stage M , based on the91
deviations of P̂M(i) from the experimental property of instance i. Convert this into the92
determination coefficient Q2M = 1 − RMSE2
M/σ2(P ), by reporting this error to the intrinsic93
variance σ2(P ) of property P within the training set.94
4. Calculate the mean < Q2M > and the standard deviation σ of Q2
M values over the M attempts.95
Then, fitness is defined as < Q2M > −κ× σ(Q2
M), with a user-defined κ (2, by default).96
In SVC, the above procedure is applied to the Balanced Accuracy (BA) of classification, defined as the97
mean, over all classes, of the fractions of correctly predicted members of class C out of the total number98
Version August 7, 2014 submitted to Challenges 4 of 21
of instances in class C. This is a more rigorous criterion than the default fraction of well-classified99
instances reported by libsvm (sum of well-classified within each class/training set size), in particular100
with highly unbalanced sets, as often the case in chemoinformatics.101
Eventually, the set ofM×N individual SVM models may serve as final consensus model for external102
predictions if the fitness score exceeds a user-defined threshold and if, of course, external descriptor sets103
are being provided. Furthermore, if these external sets also feature activity data, external prediction104
challenge quality scores are calculated on-the-fly. However, the latter are not used in the Darwinian105
selection process of the optimal parameter chromosome, but merely generated for further exploitation106
by the user. External ”on-the-fly” validation also allows to eventually delete the M × N temporarily107
generated models, avoiding the potential storage problems that would be quick to emerge if several108
tens of bulky SVM model files were to be kept for each of the thousands of attempted parameterization109
schemes. The user is given the option to eventually explicitly regenerate the model files for one or more110
of the most successful parameterization schemes ”discovered” by the Darwinian process, and use them111
outside of the context of this package.112
Both the expansion of parameter space to include ”meta”-variables such as the chosen DS and the113
pretreatment protocol of descriptors and the aim for a robust model quality estimated based on an114
M-fold more expensive approach than classical XV may render the mining for the optimal model115
quite effort-consuming. Therefore, the current distribution supports both a multi-CPU workstation116
version for ”easy” problems (hundreds of training items within tens of candidate DS), but also parallel117
deployment-ready pilots using Torque, Slurm or LSF (Load Share Facility) on clusters.118
2. Installation and Use119
This distribution includes a set of tcsh scripts, relying on various subscripts (tcsh, awk and perl) and120
executables in order to perform specific tasks. This User guide assumes basic knowledge of Linux – it121
will not go into detailed explanations concerning fundamentals such as environment, paths, etc.122
2.1. Installation and Environment Setup123
2.1.1. Prerequisites124
As a prerequisite, check whether tcsh, awk and perl are already installed on your system. Also,125
the at job scheduler should be enabled, both on the local workstations and on cluster nodes, as needed.126
In principle, our tools should not be tributary to specific versions of these prerequisite standard packages,127
and such dependence was never observed so far. Note, however, that awk (unfortunately) displays a128
language-dependent behavior, and may be coaxed to accept a comma as decimal separator instead of129
the dot, which would produce weird results (as awk always does something, and never complains).130
Therefore, we explicitly recommend to check the language setup of your machine – if in doubt, set your131
LANG environment variable to "en_US". Also, if you plan to machine-learn from DOS/Windows files132
and forgot to convert them to Unix-style, expect chaotic behavior. Install the dos2unix utility now, so133
you won’t have to come back to this later.134
Version August 7, 2014 submitted to Challenges 5 of 21
2.1.2. Installation135
Download the .tar.gz archive and extract files in a dedicated directory. It does not include the136
libsvm package, which you must install separately. Two environment variables must be added to your137
configuration: GACONF, being assigned the absolute path to the dedicated directory, and LIBSVM, set138
to the absolute path of the residence directory of libsvm tools. An example (for tcsh shells) is given in139
the file EnvVars.csh of the distribution, to be edited and added to your .cshrc configuration file. Adapt140
as needed (use export instead of setenv, in your .profile configuration file) if your default shell is other141
than (t)csh.142
Specific executables shipped with this distribution are compiled for x86_64 Linux machines. They143
are all regrouped in the subdirectory $GACONF/x86_64, since it is assumed that your machine type, as144
encoded by the environment variable $MACHTYPE, is "x86_64". If this is not your case, you will have145
to create your own executable subdirectory $GACONF/$MACHTYPE and generate the corresponding146
executables in that directory. To this purpose, source codes are provided for all Fortran utilities. They147
can be straightforwardly recompiled:148
149
foreach f ($GACONF/*.f)150
g77 -O3 -o $GACONF/$MACHTYPE/$f:r $f151
end152
153
– please adapt to your favorite Unix shell syntax. Note – gfortran, f77 or other Fortran154
compilers may do the job as well: those snippets of code are fairly robust, using standard commands.155
However, simScorer, the executable used to calculate mean Euclidean distance and dot product values156
of training set descriptor vectors is a full-blown, strategic virtual screening tool of our laboratory. Its157
FreePascal source code cannot be released. However, we might perhaps be able to generate, upon158
request, the kind of executable you require.159
2.1.3. Deployment-Specific Technical Parameters160
As hinted in the Introduction, this distribution supports a parallel deployment of the fitness scoring of161
operational parameter combinations encoded by the chromosomes. Four scenarios are supported:162
1. local: several CPUs of a workstation are used in parallel, each hosting an independent machine163
learning/XV attempt (the "slave" job) with parameters encoded by the current chromosome , and164
returning, upon completion, the fitness score associated to that chromosome to the master "pilot"165
script overseeing the process. Master (pilot) and slave (evaluators of the parameterization scheme166
quality) scripts run on the same machine – however, since the pilot sleeps for most of the time,167
waiting for slaves to return their results, the number of slave jobs may equal the number of CPU168
cores available. As the number of active slave processes decreases, the pilot will start new ones,169
until the user-defined maximal number of parameterization attempts has been reached.170
2. torque: slaves estimating the quality of a parameterization scheme are being submitted on171
available nodes of a cluster running the torque job scheduler. The pilot runs on the front-end172
Version August 7, 2014 submitted to Challenges 6 of 21
(again, without significantly draining computer power) and schedules a novel slave job as soon173
as the number of executing or waiting slave jobs has dropped below the user-defined limit, until174
the user-defined maximal number of parameterization attempts has been reached. One slave job is175
scheduled to use a single CPU core (meaning that, like in the local version, successive XV runs176
are executed sequentially). Slave jobs are being allotted explicit time lapses (walltime) by the user177
and those failing to complete within allotted time are lost.178
3. slurm: unlike in torque-driven deployment, the pilot script running on the front-end of179
the slurm-operated batch system reserves entire multi-CPU nodes for the slave jobs, not180
individualCPUs. Since slave jobs were designed to run on a single CPU, the slurm pilot does181
not directly manage them, but deploys, on each reserved node, a slave manager mandated to start182
the actual slave jobs on each CPU node, then wait for their completion. When the last of the slaves183
finished and returned results to the front-end pilot, the manager completes as well, and frees the184
node. The front-end pilot observes this, and reschedules another session on a free node, until the185
user-defined maximal number of parameterization attempts is reached. A runtime limit per node186
(walltime) must be specified: when reached, the manager and all so-far incomplete slave jobs will187
be killed.188
4. bsub: this batch system (officially known as LSF, but practically nicknamed bsub by the name189
of its key job submission command) is similar to slurm, in the sense that the front-end pilot is190
managing multi-CPU nodes, not individual cores. However, there is no mandatory time limit191
per node involved. The task of the front-end pilot is to start a user-defined number of "local"192
workstation pilot jobs on as many nodes, keep track of incoming results, potentially reschedule193
local pilots if some happened to terminate prematurely, and eventually stop the local pilots as soon194
as the user-defined maximal number of parameterization attempts has been reached. Local pilots195
will continue feeding new slave jobs to the available CPUs of the node, not freeing the current196
node until terminated by the front-end pilots (or by failure).197
IMPORTANT! Cluster-based strategies may require some customization of the default parameters.198
These may be always modified at command-line startup of the pilot scripts, but permanent modifications199
can be hard-coded by editing the corresponding *.oppars files ("*" meaning torque, slurm, bsub,200
respectively) in $GACONF. The files contain commented set instructions (tcsh-style, do not translate201
to your default shell syntax) and should be self-explanatory. Some adjustments – account and queue202
specifications, when applicable – are mandatory. Moreover, you might want to modify the tmpdir203
variable, which points to the local path on a node-own hard drive, on which temporary data may be204
written. Unless it crashed, the slave job will move relevant results back to your main work directory, and205
delete the rest of temporary data. Do not modify the restart parameter (explained further on).206
2.2. User Guide: Understanding and Using the libsvm Parameter Configurator The libsvm parameter207
configurator is run by invoking, at command line, the pilot script dedicated to the specific deployment208
scheme supported by your hardware:209
Version August 7, 2014 submitted to Challenges 7 of 21
210
$GACONF/pilot_<scheme>.csh data_dir=<directory containing input211
data> workdir=<intended location of results> [option=value...] >&212
logfile.msg213
214
where <scheme> is either of the above-mentioned local, torque, slurm, bsub. The215
script does not attempt to check whether it is actually run on a machine supporting the envisaged216
protocol – calling pilot_slurm.csh on a plain workstation will result in explicit errors, while calling217
pilot_local.csh on the front-end of a slurm cluster will charge this machine with machine learning218
jobs, much to the disarray of other users and the ire of the system manager.219
2.2.1. Preparing the Input Data Directory220
The minimalistic input of a machine learning procedure is a property-descriptor matrix of the training221
set instances. However, one of the main goals of the present procedure is to select the best suited set of222
descriptors, out of a set of predefined possibilities. Therefore, several property-descriptor matrices (one223
per candidate descriptor space) must be provided. As a consequence, it was decided to regroup all the224
required files into a directory, and pass the directory name to the script, which then will detect relevant225
files therein – following naming conventions. An example of input directory, D1-datadir, is provided226
with the distribution. It is good practice to keep a copy of the list of training set items in the data227
directory, even though this will not be explicitly used. In D1-datadir, this is entitled ref.smi_act_info.228
It is a list of SMILES strings (column #1) of 272 ligands for which experimental affinity constants (pKi,229
column #2) with respect to the D1 dopamine receptor were extracted from the ChEMBL database [2],230
forming the training set of a D1 affinity prediction model.231
Several alternative schemes to encode the molecular information under numeric form were232
considered, i.e. these molecules were encoded as position vector in several DS. These DS include233
pharmacophore (PH)- and atom-symbol (SY) labeled sequences (seq) and circular fragment counts234
(tree,aab), as well as fuzzy pharmacophore triplets [3,4]. For each DS, a descriptor file encoding235
one molecule per line (in the order of the reference list) is provided: FPT1.svm, seqPH37.svm,236
seqSY37.svm, aabPH02.svm, treeSY03.svm, treePH03.svm in .svm format. Naming of these237
descriptor files should consist of an appropriate DS label (a name of letters, numbers and underscores –238
no spaces, no dots – unequivocally reminding you how the descriptors were generated), followed by the239
.svm extension. For example, if you have three equally trustworthy/relevant commercial pieces of soft,240
each offering to calculate a different descriptor vector for the molecules, let each generate its descriptors241
and store them as Soft1.svm, Soft2.svm, Soft3.svm in the data directory. Let us generically refer to242
these files as DS.svm, each standing for a descriptor space option. These should be plain Unix text files,243
following the SVM format convention, i.e. representing each instance (molecule) on a line, as:244
245
SomeID 1:Value_Of_Descriptor_Element_1 2:Value_Of_Descriptor_Element_2246
... n:Value_Of_Descriptor_Element_n247
248
Version August 7, 2014 submitted to Challenges 8 of 21
where the values of vector elementDi refer, of course, to the current instance described by the current249
line. The strength of this way to render the vector ~D is that elements equal to zero may be omitted, so250
that in terms of total line lengths the explicit citing of the element number ahead of the value is more251
than compensated for by the omission of numerous zeros in very sparse vectors. If your software does252
not support writing of .svm files, it is always possible to convert a classical tabular file to .svm format,253
using:254
255
awk ’{printf "%s",$1;for (i=2;i<=NF;i++) {if ($i!=0) printf256
"%d:%.3f",i-1,$i};print ""}’ tabular_file_with_first_ID_column.txt257
> DS_from_tabular.svm258
259
where you may tune the output precision of descriptors (here %.3f). Note – there are no title/header260
lines in .svm files. If the tabular file does contain one, make sure to add an NR>1 clause to the above261
awk line.262
By default, .svm files do not contain the modeled property as the first field on the line, as expected263
from a property-descriptor matrix. Since several .svm files, each representing another DS – and264
presumably generated with another piece of soft, that may or may not allow the injection of the modeled265
property parameter into the first column – our software tools allow .svm files with first columns of266
arbitrary content to be imported as such into the data directory. The training set property values need267
to be specified separately, in a file having either the extension .SVMreg – for continuous properties to268
be modeled by regression, or SVMclass, for classification challenges. You may have both types of269
files, but only one file of a given type per data directory (their name is irrelevant, only the extension270
matters). For example, in D1-datadir, the file ref.SVMreg is a single-column file of associated D1271
pKi affinity values (matching the order of molecules in descriptor files). The alternative ref.SVMclass272
represents a two-class classification scheme, distinguishing "actives" of pKi > 7 from "inactives". The273
provided scripts will properly merge property data with corresponding descriptors from .svm files during274
the process of cross-validated model building. Cross-validation is an automated protocol – the user needs275
not to provide a split of the files into training and test sets: just upload the global descriptor files to the276
data directory.277
In addition, our package allows on-the-fly external predictions and validation of the models. This is278
useful in as far as, during the search stage of the optimal parameters, models generated according to a279
current parameter set are evaluated, then discarded, in order not to fill the disks with bulky model files that280
seemed promising at the very beginning of the evolutionary process, but were long since outperformed281
by later generations of parameter choices. Do not panic: the model files corresponding to user-picked282
parameter configurations may alway be recreated and kept for off-line use. However, if descriptor files283
(or property-descriptor matrices) of external test sets are added to the data directory, models of fitness284
exceeding a user-defined threshold will be challenged to predict properties for the external instances285
before deletion. The predictions, of course, will be kept and reported to the user, but never used in286
parameter scheme selection – the latter being strictly piloted by the cross-validated fitness score. Since287
the models may be based on either of the initially provided DS.svm, all the corresponding descriptor288
sets must also be provided for each of the considered external test sets. In order to avoid confusion with289
Version August 7, 2014 submitted to Challenges 9 of 21
training files, external set descriptors must be labeled as ExternalSetName.DS.psvm. Even if only one290
external set is considered, the three-part dot-separated syntax must be followed (therefore, no dots within291
ExternalSetName or DS, please). For example, check the external prediction files in D1-datadir. They292
correspond to an external set of dopamine D5-receptor inhibitors, encoded by the same descriptor types293
chosen for training: D5.FPT1.psvm, D5.seqPH37.psvm, ... Different termination notwithstanding,294
.psvm files are in the same .svm format. Whatever the first column of these files may contain, our295
tool will attempt to establish a correlation between the predicted property and this first column (forcibly296
interpreted as numeric data). If the user actually provided the experimental property values, then external297
validation statistics are automatically generated. Otherwise, if these experimental values are not known298
– this is a genuine prediction exercise – then the user should focus on the returned prediction values.299
Senseless external validation statistics scores may be reported if your first column may be interpreted as300
a number – just ignore. In the sample data provided here, the first column reports pKi affinity constants301
for the D5 receptor, whereas the predicted property will be the D1 affinity. This is thus not an actual302
external validation challenge, but an attempt to generate computed D1 affinities for D5 ligands.303
To wrap up, the input data directory must/may contain the following files:304
1. A list of training items (optional)305
2. A one-column property file with extensions .SVMreg (regression modeling of continuous306
variables) or SVMclass (classification modeling) respectively (compulsory)307
3. Descriptor files, in .svm format, one per considered DS, numerically encoding one training308
instance per line, ordered like in the property file (at least one DS.svm required). The contents309
of their first column is irrelevant, as the property column will be inserted instead.310
4. Optional external prediction files – for each considered external set, all the associated descriptors311
must be provided as ExternalSetName.DS.psvm files. The user has the option of inserting312
experimental properties to be automatically correlated with predictions in the first column of these313
.psvm files314
2.2.2. Adding Decoys – The Decoy Directory315
In chemoinformatics, it is sometimes good practice to increase the diversity of a con-generic training316
set (based on a common scaffold) by adding very different molecules. Otherwise, machine learning may317
fail to see the point that the common scaffold is essential for activity and learn the "antipharmacophore"318
– features that render some of these scaffold-based analogues less active than others. Therefore, when319
confronted to external predictions outside of the scaffold-based families, such models will predict all320
compounds not containing this antipharmacophore (including cosmic vacuum, matching the above321
condition) to be active. It makes sense to teach models that outside the scaffold-based family activity will322
be lost. This may be achieved by adding presumed diverse inactives to the training set. However, there is323
no experimental (in)activity measure for these – empirically, a value smaller than the measured activity324
of the less active genuine training set compound is considered. This assumption may, however, distort325
model quality assessment. Therefore, in this approach we propose an optimal compromise scenario in326
which, if so desired by the user, the training set may be augmented by a number of decoy instances327
Version August 7, 2014 submitted to Challenges 10 of 21
equal to half of the actual training set size. At each XV attempt, after training set reshuffling, a (every328
time different) random subset of decoys from a user-defined decoy repository will be added. The SVM329
model fitting will account for these decoys, labeled as inactives. However, the model quality assessment330
will ignore the decoy compounds, for which no robust experimental property is known: it will focus331
on the actual training compounds only. Therefore, the user should not manually complete the training332
set compounds in the data directory with decoy molecules. An extra directory – to be communicated to333
the pilot script using the decoy_dir=<path of decoy repository> command line option –334
should be set up, containing DS.svm files of an arbitrary large number of decoy compounds. Again, all335
the DS candidates added to the data directory must be represented in the decoy directory. The software336
will therefrom extract random subsets of size comparable to the actual training set, automatically inject337
a low activity value into the first column, and join these to the training instances, creating a momentarily338
expanded training set.339
In case of doubt concerning the utility of decoys in the learning process, the user is encouraged to run340
two simulations in parallel – one employing decoys, the other not (this is the default behavior unless341
decoy_dir=<path to decoy repository> is specified). Unless decoy addition strongly342
downgrades model fitness scores, models employing decoys should be preferred because they have an343
intrinsically larger applicability domain, being confronted with a much larger chemical subspace at the344
fitting stage.345
2.2.3. Command-line Launch of the libsvm Parameter Configurator: the Options and their Meaning346
The following is an overview of the most useful options that can be passed to the pilot script, followed347
by associated explanations of underlying processes, if relevant:348
1. workdir=<intended location of results> is always mandatory, for both new349
simulations (for which the working directory must not already exist, but will be created) and350
simulation restarts, in which the working directory already exists, containing preprocessed input351
files and so-far generated results.352
2. cont=yes must be specified in order to restart a simulation, based on an existing working353
directory – in which data preprocessing is completed. Unless this flag is set (default is new run),354
specifying an existing working directory will result in an error.355
3. data_dir=<directory containing input data> is mandatory for each new356
simulation, for it refers to the repository of basic input information, as described above. Unless357
cont=yes, failure to specify the input repository will result in failure.358
4. decoy_dir=<path of decoy repository> as described in the previous subsection is359
mandatory only for new starts (if cont=yes, preprocessed copies of decoy descriptors are360
assumed to already reside in the working directory).361
5. mode=SVMclass toggles the pilot to run in classification mode. Default is mode=SVMreg, i.e.362
libsvm ε-regression. Note: the extension of the property data file in the input data directory must363
strictly match the mode option value (it is advised not to mistype "SVMclass").364
Version August 7, 2014 submitted to Challenges 11 of 21
6. prune=yes is an option concerning the descriptor preprocessing stage, which will be described365
in this paragraph. This preprocessing is the main job of $GACONF/svmPreProc.pl, operating366
within $GACONF/prepWorkDir.csh, called by the pilot script. It may perform various367
operations on the brute DS.svm. By default, training set descriptors from the input directory368
are scanned for columns of near-constant or constant values. These are discarded if the standard369
deviation of vector element i over all training set instances is lower than 2% of the interval370
maxDi−minDi (near-constant), or ifmaxDi = minDi (constant). Then, training set descriptors371
are Min/Max-scaled (for each descriptor column i, maxDi is mapped to 1.0, while minDi maps372
to 0.). This is exactly what the libsvm-own svm-scale would do – now it is achieved on-the-fly,373
within the larger preprocessing scheme envisage here. Training set descriptors are the ones used374
to fix the reference minimum and maximum, further used to scale all other terms (external set375
.psvm files and/or decoy .svm files). These reference extremes are stored in .pri files in the376
working directory, for further use. In this process, descriptor elements ("columns") are sorted with377
respect to their standard deviations and renumbered. If, furthermore, the prune=yes option has378
been invoked, a (potentially time-consuming) search for pairwise correlated descriptor columns379
(at R2 > 0.7) will be performed, and one member of each concerned pair will be discarded.380
However, it is not a priori granted that either Min/Max scaling or pruning of correlated columns381
will automatically lead to better models. Therefore, scaling and pruning are considered as degrees382
of freedom of the GA – the final model fitness is the one to decide whether scaled, pruned, scaled383
and pruned or plain original descriptor values are the best choice to feed into the machine learning384
process. At preprocessing stage, these four different versions (scaled, pruned, scaled and pruned385
or plain original) are generated, in the working directory, for each of the .svm and .psvm files in386
input or decoy folders. If pruning is chosen, but for a given DS the number of cross-correlated387
terms is very low (less than 15% of columns could be discarded in the process), then pruning is388
tacitly ignored – the two "pruned" versions are deleted because they would likely behave similarly389
to their parent files. For each of the processed descriptor files – now featuring property values in390
the first column, as required by svm-train – the mean values of descriptor vector dot products,391
respectively Euclidean distances are calculated and stored in the working directory (file extensions392
.EDprops). For large training sets (> 500 instances), these means are not taken over all the393
N(N − 1)/2 pairs of descriptor vectors, but are calculated on hand of a random subset of 500394
instances only.395
IMPORTANT! The actual descriptor files used for model building – and expected as input for396
model prediction – are, due to preprocessing, significantly different from the original DS.svm397
generated by yourself in the input folder. Therefore, any attempt to use generated models in a398
stand-alone context, for property prediction, must take into account that any input descriptors must399
first be reformatted in the same way in which an external test .psvm file is being preprocessed.400
Calling the final model on brute descriptor files will only produce noise. All the information401
required for preprocessing is contained in .pri files. Suppose, for example, that Darwinian402
evolution in libsvm parameter space showed that the so-far best modeling strategy is to use the403
pruned and scaled version of DS.svm. In that case, in order to predict properties of external404
instances described in ExternalDS.svm, one would first need to call:405
Version August 7, 2014 submitted to Challenges 12 of 21
406
$GACONF/svmPreProc.pl ExternalDS.svm selfile=DS_pruned.pri407
scale=yes output=ReadyToUseExternalDS.svm408
409
where the required DS_pruned.pri can be found in the working directory. If original (not scaled)410
descriptors are to be used, drop the scale=yes directive. If pruning was not envisaged (or shown411
not to be a good idea, according to the undergone evolutionary process), use the DS.pri selection412
file instead. The output .svm is now ready to used for prediction by the generated models.413
7. wait=yes instructs the pilot script to generate the new working directory and preprocess input414
data, as described above, and then stop rather than launching the GA. The GA simulation may be415
launched later, using the cont=yes option.416
8. maxconfigs=<maximal number of parameter configurations to be417
explored> – by default, it is set to 3000. It defines the amount of effort to be invested418
in this search of the optimal parameter set, and should take into account the number of considered419
descriptor spaces impacting on the total volume of searchable problem space. However, not being420
able to guess the correct maxconfigs values is not a big problem. If the initial estimate seems421
to be too high, no need to wait for the seemingly endless job to complete: one may always trigger422
a clean stop of the procedure by creating a file (even an empty one is fine) named stop_now in the423
working directory. If the estimate was too low, one may always apply for more number crunching424
by restarting the pilot with options workdir=<already active working directory>425
cont=yes maxconfigs=<more than before>426
9. nnodes=<number of "nodes" on which to deploy simultaneous slave427
jobs> is a possibly misleading name for a context-dependent parameter. For the slurm and428
LSF contexts, this refers indeed to the number of multi-CPU machines (nodes) required for429
parallel parameterization quality assessments. Under the torque batch system and on local430
workstations, it actually stands for the number of CPU cores dedicated to this task. Therefore, the431
nnodes default value is context-dependent: 10 for slurm and LSF/bsub, 30 for torque and432
equal to the actual number of available cores on local workstations.433
10. lo=<"leave-out" XV multiplicity> is an integer parameters greater or equal to two,434
encoding the number of folds into which to split the training set for XV. By default, lo=3, meaning435
that 1/3 of the set is iteratively kept out for predictions by a model fitted on the remaining 2/3436
of training items. Higher lo values make for easier XV, as a larger part of data is part of the437
momentary training set, producing a model having "seen" more examples and therefore more438
likely to be able to properly predict the few remaining 1/lo test examples. Furthermore, since lo439
obviously gives the number of model fitting jobs needed per XV cycle, higher values translate to440
proportionally higher CPU efforts. If your data sets are small, so that 2/3 of it would likely not441
contain enough information to support predicting the left-out 1/3, you may increase lo up to 5. Do442
not go past that limit – you are only delusioning yourself by making XV too easy a challenge, and443
consume more electricity, atop of that.444
Version August 7, 2014 submitted to Challenges 13 of 21
11. ntrials=<number of repeated XV attempts, based on randomized445
regrouping of kept and left-out subsets> dictates how many times (12,446
by default) the leave-1/lo-out procedure has to be repeated. It matches the formal parameter M447
used in Introduction in order to define the model fitness score. The larger ntrials, the more448
robust the fitness score (the better the guarantee that this was not some lucky XV accident due to a449
peculiarly favorable regrouping of kept vs. left-out instances.) However, the more time-consuming450
the simulation will be, as the total number of fitted local models during XV equals ntrials ×451
lo. A possible escape from this dilemma of quick vs. rigorous XV is to perform a first run over a452
large number (maxconfigs) of parameterization attempts, but using a low ntrials XV repeat453
rate. Next, select the few tens to hundreds of best-performing parameter configurations visited so454
far, and use them (see option use_chromo below) to rebuild and reestimate the corresponding455
models, now at a high XV repeat rate.456
12. use_chromo=<file of valid parameter configuration chromosomes> is457
an option forcing the scripts to create and assess models at the parameter configurations from458
the input file, rather than using the GA to propose novel parameter setup schemes. These valid459
parameter configuration chromosomes have supposedly emerged during a previous simulation460
– therefore, the use_chromo option implicitly assumes cont=yes, i.e. an existing working461
directory with preprocessed descriptors. As will be detailed below, the GA-driven search of462
valid parameter configurations creates, in the working directory, two result files: done_so_far,463
reporting every so-far assessed parameter combination and the associated model fitness criteria,464
and best_pop, a list of the most diverse setups among the so-far most successful ones.465
There are many more options available, which will not be detailed here because their default values rarely466
need to be tampered with. Geeks are encouraged to check them out, in comment-adorned files ending in467
*pars provided in $GACONFIG. There are common parameters (common.pars), deployment-specific468
parameters in *.oppars files, and model-specific (i.e. regressions-specific vs. classification-specific)469
parameters SVMreg.pars, SVMclass.pars. The latter concern two model quality cutoffs470
1. minlevel – only models better than this, in terms of fitness scores, will be submitted to external471
prediction challenges.472
2. fit_no_go represents the minimal performance at fitting stage, for every local model built473
during the XV process. If (by default) a regression model fitting attempt fails to exceed a fit474
R2 value (or a classification model fails to discriminate at better BA), then hope to see this model475
succeed in terms of XV is low. In order to save time, the current parameterization attempt is476
aborted – the parameter choice is obviously wrong – and time is saved by reporting a fictitious,477
very low fitness score associated to the current parameter set.478
Default choices for regression models refer to correlation coefficients, and are straightforward to479
interpret. However, thresholds for classification problems depend on the number of classes. Therefore,480
defaults in SVMclass.pars correspond not to absolute balanced accuracy levels, but to fractions of481
non-random BA value range (parameter value 1.0 maps to a BA cutoff value of 1, while parameter482
value zero maps to the baseline BA value, equaling the reciprocal of the class number).483
Version August 7, 2014 submitted to Challenges 14 of 21
2.2.4. Defining the Parameter Phase Space484
As already mentioned, the model parameter space to be explored features both categorical and485
continuous variables. The former include DS and libsvm kernel type selection, the latter cover ε (for486
SVMreg, i.e. ε-regression mode only), cost γ and coeff0 values. In general, as well as in this487
case, GAs typically treat continuous variables like a user precision-dependent series of discrete values,488
within a user-defined range. The GA problem space is defined in two mode-dependent setup files in489
$GACONF: SVMreg.rng and SVMclass.rng, respectively. They feature one line for each of the490
considered problem space parameters, except for the possible choices of DS, which are not available491
by default. After the preprocessing stage, when standardized .svm files were created in the working492
directory, the script will make a list of all the available DS options, and write it out as the first line493
of a local copy (within the working directory) of the <mode>.rng file. Then it will concatenate the494
default $GACONF/<mode>.rng to the latter. This local copy is the one piloting the GA simulation,495
and may be adjusted whenever the defaults from $GACONF/<mode>.rng seem inappropriate. To do496
so, first invoke the pilot script with the data repository, a new working directory name, desired mode and497
option wait=yes. This will create the working directory, uploading all files and generating the local498
<mode>.rng file, then stop. Edit this local file, then re-invoke the pilot on the working directory, with499
option cont=yes.500
The syntax of .rng files is straightforward. Column one designs the name of the degree of freedom501
defined on the current line. If this degree of freedom is categorical, the remaining fields on the502
line enumerate the values it may take. For example, line #1 is a list of descriptor spaces (including503
their optinally generated "pruned" versions) found in the working directory. The parameter "scale"504
chooses whether those descriptors should be used in their original form, or after Min/Max scaling505
(in clear, if scale is "orig", then DS.orig.svm will be used instead of DS.scaled.svm). The "kernel"506
parameter picks the kernel type to be used with libsvm. Albeit a numeric code is used to define it507
on the svm-train command line (0- linear, 1 - 3rd order polynomial, 2 - Radial Basis Function, 3-508
Sigmoid), on the .rng file line, the options were prefixed by "k" in order to let the soft handle this509
as a categoric option. The "ignore-remaining" keyword on the kernel line is a directive to the GA510
algorithm to ignore the further options γ and coeff0 if the linear kernel choice k0 is selected. This511
is required for population redundancy control: two chromosomes encoding identical choices for all512
parameters except γ and coeff0 do stand for genuinely different parameterization schemes, leading to513
models of different quality – unless the kernel choice is set to linear, which is not sensitive to γ and514
coeff0 choices, leading to exactly the same modeling outcome. All .rng file lines having a numeric515
entry in column #2 are by default considered to encode continuous (discretized) variables. Such lines516
must obligatorily have 6 fields (besides #-marked comments): variable name, absolute minimal value,517
preferred minimal value, preferred maximum, absolute maximum and, last, output format definition518
implicitly controlling its precision. The distinction between absolute and preferred ranges is made in519
order to allow the user to focus on a narrowest range assumed to contain the targeted optimal value, all520
while leaving an open option for the exploration of larger spaces: upon random initialization or mutation521
of the variable, in 80% of cases the value will be drawn within the "preferred" boundaries, in 20% of522
cases within the "absolute" boundaries. The brute real value is then converted into a discrete option523
Version August 7, 2014 submitted to Challenges 15 of 21
according to the associated format definition in the last column – it is converted to a string representation524
using sprintf("<format representation>", value), thus rounded up to as many decimal525
digits as mentioned in the decimal part of its format specifier. For example, the cost parameter spanning a526
range of width 18, with a precision of 0.1 may globally adopt 180 different values. Modifying the default527
"4.1" format specifier to "4.2" triggers a 10-fold increase of the phase space volume to be explored, in528
order to allow for a finer scan of cost.529
2.2.5. The Genetic Algorithm530
A chromosome will be rendered as a concatenation of the options picked for every parameter, in the531
order listed in the .rng file. They are generated on-the-fly, during the initialization phase of slave jobs,532
and not in a centralized manner. When the pilot submits a slave job, it does not impose the parameter533
configuration to be assessed by the slave, but expects the slave to generate such a configuration, based534
on the past history of explored configurations, and assess its goodness. In other words, this GA is535
asynchronous. Each so-far completed slave job will report the chromosome it has assessed, associated to536
its fitness score, by concatenation to the end of the done_so_far file in the working directory. The pilot537
script, periodically waking up from sleep to check the machine work load, will also verify whether new538
entries were meanwhile added to done_so_far. If so, it will update the population of so-far best, non539
redundant results in the working directory file best_pop, by sorting done_so_far by its fitness score,540
then discarding all chromosomes that are very similar to slightly fitter essays and cutting the list off at541
fitness levels of 70% of the best-so-far encountered fitness score (this "Darwinian selection" is the job of542
awk script $GACONF/chromdiv.awk).543
A new slave job will propose a novel parameter configuration by generating offspring of these "elite"544
chromosomes in best_pop. In order to avoid reassessing an already seen, but not necessarily very545
fit, configuration, it will also read the entire done_so_far file, in order to verify that the intended546
configuration is not already in there. If so, genetic operators are again invoked, until a novel configuration547
emerges. However, neither best_pop nor done_so_far do not yet include configurations that are548
currently under evaluation on remote nodes. The asynchronous character of the GA does not allow549
absolute guarantees that it will be a perfectly self-avoiding search, albeit measures have been taken in550
this sense.551
Upon lecture of best_pop and done_so_far, the parameter selector of the slave job (awk script552
$GACONF/make_childrenX.awk) may randomly decide (with predefined probability) to perform553
a "cross-over" between two randomly picked partners from best_pop, a "mutation" of a randomly554
picked single parent, or a "spontaneous generation" (random initialization, ignoring the "ancestors" in555
best_pop). Of course, if best_pop does not contain at least two individuals, the cross-over option is556
not available, and if best_pop is empty, at the begininng of the simulation, then the mutation option is557
unavailable as well: initially, all parameter configurations will be issued by "spontaneous generation".558
2.2.6. Slave Processes – Startup, Learning Steps, Output Files and their Interpretation559
Slave jobs run in temporary directories on the local file system of the nodes. A temporary directory560
is being assigned an unambiguous attempt ID, appended to its default name "attempt". If the slave561
Version August 7, 2014 submitted to Challenges 16 of 21
job successfully completes, this attempt folder will be, after cleaning of temporary files, copied back562
to the working directory. $GACONF also contains a sample working directory D1-workdir, displaying563
typical results obtained from regression-based learning from D1-datadir – you may browse through it564
in order to get acquainted with output files. The attempt ID associated to every chromosome is also565
reported in the result file done_so_far. Inspect $GACONF/D1-workdir/done_so_far. Every566
line thereof is formally divided by an equal sign in two sub-domains. The former encodes the actual567
chromosome (field #1 stands for chosen DS, #2 for the descriptor scaling choice, ...). Fields at right568
of the "=" report output data: attempt ID, fitting score and, as last field on line, the XV fitness score,569
< Q2 > −2σ(Q2) for regression, < BAXV > −2σ(BAXV ) for classification problems as defined in570
Introduction. The fitting score is calculated, for completeness, as the "mean-minus-two-sigma" of fitted571
correlation coefficients and fitted balanced accuracy. Please do not allow yourself to be confused by572
"fitting" (referring to statistics of the svm-train model fitting process, and concerning the instances573
used to build the model), and Darwinian "fitness" (defined on the basis of XV results). The fitting score574
is never used in the Darwinian selection process – it merely serves to inform the user about the loss of575
accuracy between fitted and predicted/cross-validated property values.576
IMPORTANT! Since every slave job will try to append its result line to done_so_far as soon as577
it has completed calculations, chance may have different jobs on different nodes – each "seeing" the578
working directory on a NFS-mounted partition – attempt to simultaneously write to done_so_far. This579
may occasionally lead to corrupted lines, not matching the description above. Ignore them.580
Yet, before results are copied back to the working directory, the actual work must be performed.581
This begins by generating a chromosome, unless the use_chromo option instructs the job to apply an582
externally imposed setup. After the chromosome is generated and stored in the current attempt folder,583
it first must be interpreted. On one hand, the chromosome contains descriptor selection information.584
The property-descriptor matrix matching the name reported in the first field of the chromosome, and585
more precisely its scaled or non-scaled version, as indicated by the second field, will be the one586
copied to the attempt folder, for further processing. The remaining chromosome elements must be587
translated into the actual libsvm command line options. In this sense, the actual ε value for regression588
is calculated by multiplying the epsilon parameter in the chromosome by the standard deviation of589
the training property values. In this way, ε, representing 10 to 100% of the natural deviation of the590
property, implicitly has the proper order of magnitude and proper units. Likewise, the chromosome591
gamma parameter will be converted to the actual γ value, dividing by the mean vector dot product (for592
kernel choices k1 or k3), or by the mean Euclidean distance (kernel choice k2), respectively. Since593
the cost parameter is tricky to estimate, even in terms of magnitude orders, the chromosome features594
a log-scale cost parameter, to be converted into the actual cost by taking the exponential thereof.595
This, and also the stripping off of the "k" prefix in the kernel choice parameter, are also performed596
at the chromosome interpretation stage, mode-dependently carried out by dedicated "decoder" awk597
scripts chromo2SVMreg.awk and chromo2SVMclass.awk, respectively. At the end, the tool creates,598
within the attempt folder, the explicit option line required to pilot svm-train according to the content599
of the chromosome. This option line is stored in a one-line file svm.pars – check out, for example,600
$GACONF/D1-workdir/attempt.11118/svm.pars, the option set that produced the fittest601
models so-far.602
Version August 7, 2014 submitted to Challenges 17 of 21
A singled-out property-descriptor matrix and a command-line option set for svm-train are the603
necessary and sufficient prerequisites to start XV. For each of the ntrials requested XV attempts, the604
order of lines in the property-descriptor matrix is randomly changes. The reordered matrix is then split605
into the requested lo folds. XV is explicit, and is not using the own facility of svm-train. Iteratively,606
svm-train is used to learn a local model on the basis of all but the left-out fold. If adding of decoys607
is desired, random decoy descriptor lines (of the same type, and having undergone the same scaling608
strategy as training set descriptors) are added to make up 50% of the local learning set size (the lo-1609
folds in use). These will differ at each successive XV attempt, if the pool of decoys out of which they610
are drawn is much larger than the required half of training set size).611
These local models are stored, and used to predict the entire training set, in which, however, the612
identities of local "training" and locally left-out ("test") instances are known. These predictions are613
separately reported into "train" and "test" files. The former report, for each instance, fitted property614
values by local models having used the instance for training. The latter report predicted property values615
by local models not having used it for training.616
Quality of fit is checked on-the-fly, and failure to exceed a predefined fitting quality criterion triggers617
a premature exit of the slave job, with a fictitious low fitness score associated to the current chromosome618
– see the fit_no_go option. In such a case, the attempt subdirectory is deleted, and a new slave job is619
launched instead to the defunct one.620
Results in the attempt subdirectories of D1-workdir correspond to the default 12 times repeated621
leave-1/3-out XV schemes. This means that 12 × 3 = 36 local models are generated. For each D1622
ligand, there are exactly 12 of these local models that were not aware of that structure when they were623
fitted, and 24 others that did benefit from its structure-property information upon building.624
Files final_train.pred.gz in attempt subdirectories (provided as compressed .gz) report, for each625
instance, the 24 "fitted" affinity values returned by models having used it for fitting (it is a 26-column626
file, in which column #1 reports current numbering and #2 contains the experimental affinity627
value). Reciprocally, the 14-column files final_test.pred.gz report the 12 prediction results stemming628
from models not encountering instances at fitting stage. Furthermore, consens_train.pred.gz and629
consens_test.pred.gz are "condensed" versions of the previous, in which the multiple predictions per630
line have been condensed to their mean (column #2) and standard deviations (column #3), column #1631
being the experimental property.632
Eventually, the stat file found in each attempt subdirectory provides detailed statistics about fitting and633
prediction errors in terms of root-mean-squared error, maximal error and determination coefficients. In a634
classification process, correctly classified fractions and balanced accuracies are reported. Lines labeled635
"local_train" compare every fitted value column from final_train, to the experimental data, whilst636
"local_test" represent local model-specific XV results. By extracting all the lines "local_test:r_squared",637
actually representing the XV coefficients of every local model, and calculating the mean and standard638
deviations of these values, one may recalculate the fitness score of this attempt. The stat file lines639
labeled "test" and "train" compare the consensus means as reported in the consens_*.pred files to640
the experiment. Note: the determination coefficient "r_squared" between the mean of predictions641
and experiment tends to exceed the mean of local determination coefficients, reporting individual642
Version August 7, 2014 submitted to Challenges 18 of 21
performances of local models. This is the "consensus" effect, not exploited in fitness score calculations643
that rely on the latter mean of individual coefficients, penalized by twice their standard deviations.644
Last but not least, this model building exercise had included an external prediction set of D5 ligands,645
for which the current models were required to return a prediction of their D1 affinities. External646
prediction is enabled as soon as model fitness exceeds the user-defined threshold minlevel. This647
was the case for the best model corresponding to $GACONF/D1-workdir/attempt.11118/.648
External prediction file names are a concatenation of external set name ("D5"), the locally used DS649
("treePH03_pruned"), the descriptor scaling status ("scaled") and the extension ".pred". As all these650
external items are, by definition, "external" to all the 12×3 = 36 local models (no checking is performed651
in order to detect potential overlaps with the training set), the 38-column external prediction file will652
report the current numbering (column #1), the data reported in the first field of the .psvm files in column653
#2 (here, the experimental D5 pKi values), followed by 36 columns of individual predictions of the654
modeled property (the D1 affinity values), by each of the local models. Since the external prediction set655
.psvm files had been endowed with numeric data in the first field of each line, the present tool assumes656
these to be corresponding experimental property values, to be compared to the predictions. Therefore, it657
will take the consensus of the 36 individual D1 affinity predictions for each item, and compare them658
to the experimental field. The results of external prediction challenge statistics are reported in the659
extval file of the attempt subdirectory. First, the strictest comparison consists in calculating the RMSE660
between experimental and predicted data, and the determination coefficient – these results are labeled661
"Det" in the extval report. However, external prediction exercises may be challenging, and sometimes662
useful even if the direct predicted-experimental error is large. For example, predicted values may not663
quantitatively match experiment, but happen to be all offset by a common value or, in the weakest case,664
nevertheless follow some linear trend – albeit of arbitrary slope and intercept. In order to test either of665
these hypotheses, extval also reports statistics for (a) the predicted vs. experimental regression line at666
fixed slope of 1.0, but with free intercept, labeled "FreeInt", and (b) the unconstrained, optimal linear667
predicted-experimental correlation that may be established, labeled "Corr".668
In this peculiar case, extval results seem unlikely to make any sense, because the tool mistakenly669
takes the property data associated to external compounds (D5 affinities) for experimental D1 affinities,670
to be confronted with their predicted alter-egos. Surprise – even the strict Determination statistics are671
robustly positive on the fact that predicted D1 pKi values quantitatively match experimental D5 pKi672
values. This is due to the very close biological relatedness of the two targets, which do happen to share673
a lot of ligands. Indeed, 25% of the instances of the D5 external set were also reported ligands of D1,674
and, as such, part of the model training set. Yet, 3 out of 4 D5 ligands were genuinely new. This675
notwithstanding, their predicted D1 affinities came close to observed D5 values – a quite meaningful676
result confirming the extrapolative prediction abilities of the herein built model.677
At this point, local SVM models and other temporary files are deleted, the chromosome associated to678
it attempt ID, fitting and fitness scores is being appended to done_so_far, and the attempt subfolder now679
containing only setup information and results is moved from its temporary location on the cluster back680
to the working directory (with exception of the workstation-based implementation, when temporary and681
final attempt subdirectory location are identical). The slave job successfully exits.682
Version August 7, 2014 submitted to Challenges 19 of 21
2.2.7. Reconstruction and Off-package Use of Optimally Parameterized libsvm Models683
As soon as the number of processed attempts reaches maxconfigs, the pilot script stops684
rescheduling new slave jobs. Ongoing slave processes are allowed to complete – therefore, the final685
number of lines in done_so_far may slightly exceed maxconfigs. Before exiting, the pilot will also686
clean the working directories, by deleting all the attempt sub-folders corresponding to the less successful687
parameterization schemes, which did not pass the selection hurdle and are not listed in best_pop. A688
trace of their statistical parameters is saved in a directory called "loosers".689
The winning attempt sub-folders contain all the fitted and cross-validated/predicted property tables690
and statistics reports, but not the battery of local SVM models that served to generate them. These691
were not kept, because they may be quite bulky files which may not be allowed to accumulate in large692
numbers on the disk (at the end of the run, before being able to decide which attempts can be safely693
discarded, there should have been maxconfigs × lo times ntrials of them). However, they – or,694
actually, equivalent (recall that local model files are tributary to the random regrouping of kept vs. left-out695
instances at each XV step) – model files can be rebuilt (and kept) by rerunning the pilot script with696
the use_chromo option pointing to a file with chromosomes encoding the desired parameterization697
scheme(s). If use_chromo points to a multi-line file, the rebuilding process may be parallelized: as the698
slave jobs proceed, new attempt sub-folders – now each containing lo times ntrials libsvm model699
files – will appear in the working directory (but not necessarily in the listing order of use_chromo).700
IMPORTANT! The file of preferred setup chromosomes should only contain the parameter fields,701
but not the chromosome evaluation information. Therefore, you cannot use best_pop per se, or ‘head702
-top‘ of best_pop as use_chromo file, unless you first remove the right-most fields, starting at the703
"equal" separator:704
705
head -top best_pop| sed ’s/=.*//’ > my_favorite_chromosomes.lst706
$GACONF/pilot_<scheme>.csh workdir=<working directory used for GA707
run> use_chromo=my_favorite_chromosomes.lst708
709
Randomness at learning/left-out splitting stages may likely cause the fitness score upon refitting to710
(slightly) differ from the initial one (on the basis of which the setup was considered interesting). If the711
difference is significant, it means that the XV strategy was not robust enough – i.e. was not repeated712
sufficiently often (increase ntrials).713
Local model files can now be used for consensus predictions, and the degree of divergence of714
their prediction for an external instance may serve as prediction confidence score [5]. Remember that715
svm-predict using those model files should be called on preprocessed descriptor files, as outlined in716
the Command-Line Launch subsection.717
Alternatively, you may rebuild your libsvm models manually, using the svm-train parameter718
command line written out in successful attempt sub-folders. In this case, you may tentatively rebuild719
the models on your brute descriptor files, or on svm-scale-preprocessed descriptor files – unless the720
successful recipe requires pruning.721
Version August 7, 2014 submitted to Challenges 20 of 21
3. Conclusions722
This parameter configurator for libsvm addresses several important aspects in SVM model building,723
in view – but not exclusively dedicated to – chemoinformatics applications.724
First, the approach co-opts an important decision making of the model building process: the choice725
of descriptors/attributes, and their best preprocessing strategies, into the optimization procedure. This is726
important, because the employed descriptor space is a primordial determinant of modeling success, and727
determines the optimal operational parameters of libsvm.728
Next, an aggressive, repeated cross-validation scheme, introducing a penalty proportional to the729
fluctuation of cross-validation propensity on the local grouping of learning vs. left-out instances is used730
to select an operational parameter set leading to a model of maximal robustness, not to a model owing its731
apparent quality to a lucky ordering of instances in the training set. This fitness score is in our opinion a732
good estimator of the extrapolative predictive power of the model.733
Furthermore, the approach allows decoy instances to be added in order to expand learned problem734
space zone, without however adding uncertainty to the reported statistics (statistics of decoy-free and735
decoy-based models are directly comparable, because they both focus on the actual training instances736
and their confirmed experimental properties, ignoring "presumed" inactive decoy property values).737
Last but not least, the approach is versatile, covering both regression and classification problems and738
supporting a large variety of parallel deployment schemes. It was, for example, successfully used to739
generate very large (> 9000 instances) dataset-based chemogenomics models [6].740
BA – Balanced Accuracy (of classification, the mean of sensitivity and specificity) DS – Descriptor741
Space, GA – Genetic Algorithm, SVM – Support Vector Machine, SVR – Support Vector742
Regression, SVC – Support Vector Classification, RMSE – Root Mean Squared Error, XV -743
Cross-validation.744
The authors thank the High Performance Computing centers of the Universities of Strasbourg,745
France, and Cluj, Romania, for having hosted a part of the herein reported development work.This746
work was also supported in part by a Grant-in-Aid for Young Scientists from the Japanese Society747
for the Promotion of Science (Kakenhi (B) 25870336).This research was additionally supported748
by the Funding Program for Next Generation World-Leading Researchers as well as the CREST749
program of the Japan Science and Technology Agency.750
References751
1. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Transactions on752
Intelligent Systems and Technology 2011, 2, 27:1–27.753
2. Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.;754
McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J.P. ChEMBL: a large-scale755
bioactivity database for drug discovery. Nucl. Ac. Res. 2011, 40, D1100–D1107.756
Version August 7, 2014 submitted to Challenges 21 of 21
3. Ruggiu, F.; Marcou, G.; Varnek, A.; Horvath, D. Isida Property-labelled Fragment Descriptors.757
Molecular Informatics 2010, 29, 855–868.758
4. Bonachera, F.; Parent, B.; Barbosa, F.; Froloff, N.; Horvath, D. Fuzzy Tricentric Pharmacophore759
Fingerprints. 1 - Topological Fuzzy Pharmacophore Triplets and adapted Molecular Similarity760
Scoring Schemes. J. Chem. Inf. Model. 2006, 46, 2457–2477.761
5. Horvath, D.; Marcou, G.; Varnek, A. Predicting the Predictability: A Unified Approach to the762
Applicability Domain Problem of QSAR Models. J. Chem. Inf. Model. 2009, 49, 1762–1776.763
6. Brown, J.B.; Okuno, Y.; Marcou, G.; Varnek, A.; Horvath, D. Computational chemogenomics:764
Is it more than inductive transfer? J. Comput.-Aided Mol. Des. 2014, 28, 597–618.765
c© August 7, 2014 by the authors; submitted to Challenges for possible open access766
publication under the terms and conditions of the Creative Commons Attribution license767
http://creativecommons.org/licenses/by/3.0/.768