dagda.shef.ac.ukdagda.shef.ac.uk/.../external/parker_david_mscchem.pdf · 2 table of contents...
TRANSCRIPT
THE EFFECT OF TAUTOMERISM ON THE
PREDICTION OF BIOAVAILABILITY AND
VIRTUAL SCREENING
A study submitted in partial fulfillment
of the requirements for the degree of
Master of Science in Chemoinformatics
at
THE UNIVERSITY OF SHEFFIELD
by
DAVID PARKER
September 2004
2
Table of Contents
Acknowledgements................................................................................ 5
Abstract .................................................................................................. 6
Common abbreviations......................................................................... 7
1 Introduction .................................................................................... 8
1.1 High throughput screening in lead compound discovery............................ 8
1.2 Lipinski’s “rule of five” .............................................................................. 9
1.3 A measure of lipophilicity - log P............................................................. 11
1.4 Tautomerism and property prediction....................................................... 13
1.4.1 Prototropic Tautomerism ................................................................. 14
1.4.2 Valence tautomerism........................................................................ 20
1.5 The impact of tautomerism on drug design............................................... 21
1.6 Tautomerism and molecular docking programs........................................ 22
1.7 Tautomerism and molecular descriptors ................................................... 23
1.8 The project domain ................................................................................... 24
1.9 Project outline ........................................................................................... 25
2 Methodology ................................................................................. 27
2.1 Introduction............................................................................................... 27
2.2 SMILES notation ...................................................................................... 29
2.3 SMARTS notation..................................................................................... 30
2.4 Estimating compound lipophilicity: ELOGP............................................ 30
2.5 Estimating compound aqueous solubility: ESOL ..................................... 31
2.6 Estimating acid-base ionisation constants: ACD/pKa .............................. 31
2.7 The SOLSTICE tool set ............................................................................ 32
2.8 Compound data set preparation................................................................. 34
2.8.1 Files in .sdf format ........................................................................... 34
2.8.2 SMILES canonicalisation ................................................................ 34
2.8.3 Leatherface: A tool for transforming chemical structures ............... 36
2.9 Compound property prediction ................................................................. 38
2.9.1 ELOGP............................................................................................. 39
2.9.2 ESOL................................................................................................ 39
2.9.3 pKa ................................................................................................... 40
2.10 Result collation, indexing and data presentation....................................... 40
3
2.10.1 DIVA – A spreadsheet for manipulating and displaying chemical information ....................................................................... 40
2.10.2 Post processing of the dataset .......................................................... 41
2.10.3 Allocating a predicted charge at pH7............................................... 43
2.10.4 Data analysis and presentation......................................................... 43
2.10.5 Identifying tautomeric substructures................................................ 46
2.10.6 Other data analysis indicators .......................................................... 47
2.10.7 Comparison of measured and predicted log P and pKa values........ 47
2.10.8 Analysis of prediction failures ......................................................... 48
2.10.9 CHI data – a source of information about tautomer classes not highlighted by the STT .................................................................... 48
3 Results and discussion.................................................................. 50
3.1 About this chapter ..................................................................................... 50
3.2 Introducing the datasets............................................................................. 50
3.3 Comparing the property predictions made for the NDC forms and PRFs of each compound set ...................................................................... 51
3.3.1 ELOGP............................................................................................. 52
3.3.2 ESOL................................................................................................ 55
3.3.3 pKa ................................................................................................... 59
3.4 Summarising the differences between the NDC forms and PRFs of the HTS and PM datasets ................................................................................ 62
3.5 Formal charge distributions at pH7........................................................... 64
3.5.1 The influence of predicted pKa changes on predicted charge distribution ....................................................................................... 64
3.5.2 A comparison of predicted charge distribution at pH7 within pH 2-10 and pH 0-14 limits ............................................................. 66
3.6 Issues and problems with prediction tools ................................................ 67
3.6.1 AlogP and SMILES ......................................................................... 67
3.6.2 Analysis of prediction failures ......................................................... 68
3.6.2.1 Log P ............................................................................................. 69
3.6.2.2 pKa ................................................................................................ 73
3.7 Revealing the types of structural changes performed by the STT and the tautomer substructures concerned ....................................................... 75
3.7.1 Analysing the effect of the STT on each dataset.............................. 75
3.7.2 Categorising the types of structure change performed by the STT................................................................................................... 78
3.7.3 Validating the structural changes performed by the STT ................ 82
3.8 Comparing measured and predicted property values ................................ 87
4
3.8.1 Compounds whose structures were not modified by the STT ......... 87
3.8.1.1 pKa comparisons........................................................................... 87
3.8.1.1.1 HTS dataset ............................................................................ 87
3.8.1.1.2 PM dataset ............................................................................. 89
3.8.1.2 log P comparisons......................................................................... 90
3.8.1.2.1 HTS dataset ............................................................................ 90
3.8.1.2.2 PM dataset ............................................................................. 92
3.8.2 The impact of the STT changing tautomers on the outcome of log P and pKa predictions ................................................................ 93
3.8.2.1 Introduction................................................................................... 93
3.8.2.2 Defining tautomer type subclasses................................................ 93
3.8.2.3 pKa comparisons........................................................................... 97
3.8.2.4 Log P comparisons...................................................................... 100
3.8.2.5 Re-investigating the validity of the structural changes performed by the STT.................................................................. 104
3.8.2.6 Evaluating the predictions of alternative tautomers................... 109
3.8.2.6.1 Substructures 5b and 5d....................................................... 109
3.8.2.6.2 Substructures 5e and 12b..................................................... 111
3.8.2.6.3 Substructures of type 12a..................................................... 116
3.9 A method of investigating tautomer issues not highlighted by the STT. 116
3.9.1 Analysis of CHI data...................................................................... 117
3.9.2 Comparison of measured and predicted log P and pKa data for “new” tautomeric compounds ........................................................ 119
4 Evaluating tautomeric misrepresentation in a larger dataset 122
4.1 Introduction............................................................................................. 122
4.2 Sampling of compounds.......................................................................... 122
4.3 Dataset analysis....................................................................................... 123
5 Conclusions and further work .................................................. 130
5.1 Conclusions............................................................................................. 130
5.2 Further work............................................................................................ 133
References .......................................................................................... 135
5
Acknowledgements
I would like to thank Graham Mullier, Eric Clarke, John Delaney and
Val Gillet for their supervisory support and encouragement during the project and for
keeping me supplied with data, ideas and constructive feedback as it progressed.
Thanks also to David Adams for useful discussions.
I also wish to thank in general my colleagues at Syngenta in Jealott’s Hill for
making me feel welcome during my time here and to Nick and Thierry, with whom I
shared one of the cottages.
Finally thanks to my parents for helping with the logistics of getting me to
and from J.H., my friends, and to Dawn for her patience and understanding not just
during this project, but throughout my Masters.
6
Abstract
This work develops and tests a methodology for assessing the degree of
tautomer misrepresentation in chemical datasets and analyses the effect different
tautomers have on predictions of aqueous solubility (log Sw), lipophilicity (log P),
acid-base ionisation constants (pKa) and charge at pH7.
A structure transformation tool (STT) is used to convert compounds from
their database stored form into one considered to be physiologically relevant at pH7;
allowing the number and type of tautomeric compounds “wrongly drawn” to be
assessed. In the 3 datasets studied, such compounds are found to represent no more
than 1-2% of the total.
By making comparisons between predicted values and measured data, the
tautomers that give the best descriptions of molecules are assessed and the distinct
patterns found for different classes of tautomer examined and likely explanations
presented.
The effectiveness of the STT itself is tested and a series of tautomeric
compounds it “misses” identified. The study shows that its structure changing rules
are reasonable, but are sometimes too generically applied to always be reliable.
The reasons for the failing of the various property prediction tools for
individual compounds are also investigated. Particular problems with AlogP’s
inconsistent handling of SMILES and deficiencies in its fragment dictionary are
highlighted.
7
Common abbreviations
• CHI Chromatographic Hydrophobicity Index
• HTS High Throughput Screen(ing)
• NDC Native Drawing Convention
• PM Pesticide Manual
• PRF Physiologically Relevant Form
• STT Structure Transformation Tool
8
1 Introduction
1.1 High throughput screening in lead compound discovery
The challenge of accelerating the lead compound discovery process and
thereby reducing the time taken to bring new pharmaceuticals and agrochemicals to
the market has also been driven by the desire to make significant savings in the
associated research and development costs. With product development times of 10
years not being untypical, there is also a strong financial incentive for a company to
be able to respond more quickly to meet the market demand for a product before its
competitors do.
Undoubtedly mechanisation, miniaturisation and computerisation have made
it easier to screen larger and larger numbers of inputs from combinatorial chemistry
using techniques such as High Throughput Screening (HTS). However, simply
performing more screens does little to optimise the desired physical properties of
those active compounds that become leads (Morris & Bruneau, 2000). In recent
times, considerable efforts have therefore been made in developing techniques that
aid in the design of combinatorial libraries by allowing the physical properties of
screened inputs to be predicted in advance.
The activity of both agrochemicals and pharmaceuticals is dependent on their
ability to bind specifically to a desired target, typically a pocket on the surface of a
protein. So by removing in advance any compounds from the screen that are unlikely
to satisfy the conditions required for binding, the proportion of likely strong actives
actually screened will be increased. This ultimately is likely to increase the number
of strongly active leads obtained from a screen and so reduce the risk of high
9
development costs being directed at leads that are found later to only be weakly
active.
1.2 Lipinski’s “rule of five”
At the forefront of efforts to define the “drug-likeness” of compounds was the
work of Lipinski (Lipinski et al., 1997). His now much cited “rule of five” principle
placed upper limits on four molecular properties, above any of which a molecule is
less likely to be drug-like in permeation. These limits are:
• Molecular weight of 500.
• Log P (octanol / water) of 5.
• Five hydrogen bond donors (either OH or NH)
• Ten hydrogen bond acceptors (N or O atoms)
Some ground rules now set, a number of other research groups developed
more sophisticated models for predicting “drug-likeness” such as using a feed-
forward neural network (Sadowski & Kubinyi, 1998) and a genetic algorithm scoring
scheme (Gillet et al., 1998), both to good effect. The concept of “lead-likeness” and
how lead and drug compound properties differ from each other has also been studied
by Oprea (2000).
The delivery of agrochemicals to crops (typically by spraying) and the
application of pharmaceuticals to patients (typically by ingestion or injection) are
necessarily approached in completely different ways. The agrochemical industry
therefore realised that the typical physicochemical properties of agrochemicals were
likely to differ from those of pharmaceuticals. As a consequence, Briggs quickly
followed Lipinski with his “ground rules of three” (Briggs, 1997) and for fungicides
10
went on to set several alternative physical property limits for agrochemical-like
behavior (Briggs et al., 2002).
Tice compared the physical properties of a set of active herbicides and
insecticides with the same ones described by Lipinski’s “rule of five” for
pharmaceuticals (Tice, 2001 & 2002). His main observation was that these classes of
agrochemicals contained significantly lower numbers of hydrogen bond donors than
did the pharmaceuticals. Tice’s observations lead to him modifying the values laid-
down in Lipinski’s rules, specifically to reflect the nature of herbicides and
insecticides.
Clarke and Delaney (2003) recently compared the changes in nine physical
and molecular properties for herbicides, fungicides and insecticides between
identified HTS hit series compounds, lead series compounds and an agrochemical
product series, as well as a random subset of agrochemicals from their employer’s
corporate database. Properties considered included percentage aromaticity, molecular
weight, charge at pH 7 and partition coefficient differences. Herbicides and
fungicides in particular were surprisingly found to readily meet Lipinski’s criteria for
pharmaceutical lead-like compounds. For agrochemical products as a whole, Clarke
and Delaney (2003) summarised:
“…the whole progression from hits to products is dominated by
rising solubility, decreasing basicity and the removal of carbon,
particularly in aromatic systems.”
Underlying the physical property profiles of agrochemicals and
pharmaceuticals, are the values, whether measured or calculated, assigned to each.
Though practical measurements for properties such as log P, pKa and solubility
11
would ideally be made for every compound in a corporate collection at the time it is
added, in practice this is time consuming, expensive and consequently an unrealistic
expectation.
1.3 A measure of lipophilicity - log P
The partition coefficient log P, a measure of a compound’s lipophilicity, was
pioneered by Hansch and co-workers (Hansch et al., 1962). It has been of particular
interest in the agrochemical industry as it reveals the degree of preference a
compound has for residing in an organic phase (typically n-octanol) over an aqueous
phase. Given the nature of typical agrochemical delivery methods to crops, a
favourable log P is critical in making sure that agrochemicals are capable of crossing
their target species’ cell membrane in order that they can act. Equally important in an
increasingly environmentally-conscious world are the potentially negative
consequences of agrochemicals accidentally leaching into the environment, for
example in rainwater run off into watercourses, and the adverse effects they may
then have on other plants or wildlife.
Various methodologies have been used by researchers in their quest to find an
accurate means of predicting log P. Fragment based methods, such as in the program
CLOGP, first described by Leo (1993), break down a molecule into distinct
substructures chosen from a predetermined set. The pre-calculated log P
contributions of each substructure are then summed across the set generated for the
molecule to give the overall result.
Another common method is atom descriptor based and involves every atom
in a molecule being assigned to one of a series of different atom types, each of which
contribute a log P weight to the overall log P. The overall value is then obtained as
12
the linear sum of the set of component weights. The most well known application
based on this method, ALOGP, has seen some refinement to its operation since
inception, the most recent being described by Wildman and Crippen (1999). The
technique AUTOLOGP (Devillers et al., 2000) takes a different approach by
combining a series of different types of descriptor for hydrogen bond donor and
acceptor ability, lipophilicity and molar refractivity. A trained back-propagation
neural network is then used to evaluate the descriptor set values and produce a log P
estimate from them.
With there being so many types and variants in log P prediction tools, a need
was identified for a rigorous comparison of their performance against literature log P
data (Draper, 2002 & Clarke et al., 2004). Across six compound classes, including a
random 700 compound sample from the Pesticide Manual (Tomlin, 2000) and five
50-250 compound samples from specific agrochemical classes in the Syngenta
corporate database, six log P predictors were tested:
1. Fragment based CLOGP method – Daylight v4.71.
2. Fragment based CLOGP method – Biobyte v3.14.
3. Atom based ALOGP method - Accelrys Diamond Descriptors.
4. Atom and fragment combined method – ACD Phys Chem Batch v4.76.
5. Solvation descriptors – Sirius Absolv v1.4.
6. Quantum mechanics and neural network derived – Accelrys Diamond
Properties v1.5.
No single predictor was found to routinely out-perform the others, with the
different agrochemical classes variously favouring a total of four of the six tools.
Overall however Predictor 1 was found to be the best performer and Predictor 6 the
poorest. In order to maximise the predictive power of the combined methods, a
13
consensus scoring approach was applied to them and a new parameter, ELOGP,
defined. It was calculated for each compound as being the average of the log P
values obtained from methods 1, 3, 4 and 5. Analysis showed that ELOGP often out-
performed the individual methods from which it was derived, with significant
improvements in the proportion of log P predictions made within 0.5 units of the
actual measured value. The success of the ELOGP has since seen it adapted for use
in HTS applications, the HTS version of ELOGP being the mean log P of just
methods 1, 3 and 4 (Clarke & Delaney, 2003).
1.4 Tautomerism and property prediction
Essential to the ability of fragment-based property prediction tools to produce
accurate results is their need for a precise description of each structure. This allows
them to assign sets of specific fragment types to molecules using dictionaries of
fragments. However, many organic compounds have multiple structural isomers that
inter-convert, typically by transfer of a chemical group, and which are in equilibrium
with each other. To add further complication, the position of this equilibrium may
vary, depending upon the immediate physical and chemical environment of the
molecule.
The phenomenon, known as tautomerism, therefore has potentially a very
significant impact on physical and chemical property prediction and on computer-
aided drug design as a whole. As a consequence, it was the subject of a recent review
by Pospisil and co-workers (Pospisil et al., 2003). The concepts underlying
tautomerism in heterocyclic chemistry however are well established. For example,
accounts in the field by Heller et al. (1925) date back almost 80 years.
14
1.4.1 Prototropic Tautomerism
The most well-known and well-studied type of tautomerism is prototropic
tautomerism, manifested in the variable position of attachment of a hydrogen atom in
a molecule. The subject has been the subject of extensive reviews by Katritzky et al.
(1963, 1976, 2000 & 2001). Of this type, keto-enol tautomerism has been particularly
well studied and reviewed, for example by Whitman in relation to enzymatic
reactions in which it plays a part (Whitman, 1999). In Figure 1.1. for example,
acetone has both keto (left) and enol (right) forms that are in equilibrium with each
other.
CH3 CH3
O
CH3 CH2
OH
Figure 1.1
In simple ketones, the keto form is generally more stable than the enol form
(by ΔG = 11 kcal mol-1 in the above example (Chiang et al., 1989) resulting in the
Figure 1.1 equilibrium, in practice, being far over to the left. The favouring of the
keto form was considered by Wheland (1955) to be due to the greater strength of
carbon-oxygen double bonds compared to carbon-carbon double bonds. In contrast,
for aromatic rings, for example phenol and its two cyclohexadienone tautomers
(Figure 1.2), the enol form is most often the one favoured. In the example the higher
free energy of aromatisation (36 kcal mol-1) (Wheland, 1955) overrides the
underlying preference for the keto forms meaning the enol form instead dominates.
15
OH O
O
Figure 1.2
Particularly in conjugated systems containing more than one heteroatom, the
position of the equilibrium can be far less easy to predetermine. For example 4-
pyridone (Figure 1.3, right) and 4-hydroxypyridine (Figure 1.3, left) exist in an
equilibrium that was studied by Beak et al. (1976). They were only able to detect the
pyridone form in a solution of ethanol, but in the vapor phase found the pyridine
form to dominate. A number of other theoretical studies have been carried-out to
examine the tautomer preference of more complex cases such as those of cytosine,
thymine, uracil, 2,6-dithioxanthine and some of their analogues in both the gas and
aqueous phases (Civcir, 2000 & 2001).
N
OH
N
O
H
Figure 1.3
In contrast to the pyridine / pyridone system of Figure 1.3, though
1H-azepin-2-one (Figure 1.4, left) and 1-methyl-azepin-3-one (Figure 1.4, right)
contain conjugated π-electron systems, they cannot tautomerise into aromatic ring
structures. As a result, they exist, as drawn, predominately in their keto forms
(Heinzelmann & Märky, 1973 and MacNab & Monahan, 1990).
16
N
O
CH3NH
O
Figure 1.4
Prototropic tautomerism sometimes results in zwitterionic tautomers, such as
in the case of iso-nicotinic acid (Figure 1.5). The position of the equilibrium in a
solution of dimethyl sulfoxide (DMSO) and water was found by Hallé et al. (1996)
to be very sensitive to the ratio of the solvent’s components. Above 80% DMSO they
found that the position of the equilibrium very strongly favoured the non-zwitterionic
tautomer.
N
O OH
N+
H
O-
O
Figure 1.5
The electronic properties of molecules can also influence the position of
tautomeric equilibria. Katritzky et al. (2001) for example, compared the tautomer
ratios in solution of a series of 1,2-/2,5-dihydropyrimidines (Figure 1.6 and
Table 1.1) in deuterated chloroform and DMSO.
17
N
NH
R1
R2
N
N
R1
R2
R1 = Ph, SPh
R2 = Ph, OMe, SPh
A B
Figure 1.6
Ratio of A form to B form in solvent
R1 R2 CDCl3 DMSO-d6 Reference
Ph Ph 2:1 A form only Weis & Vishkautsan (1984) /
Weis & van der Plas (1986)
Ph OMe 1:6 8:1 Weis et al. (1986)
SPh SPh 1:3 8:1 Weis et al. (1986)
Table 1.1: “Tautomeric Equilibria of 1,2-/2,5-dihydropyrimidines” (Adapted from Katritzky et al. (2001))
In summary, electron donating substituents and non-polar solvents were
found to favour the B tautomers, but that increasing the polarity of the solvent
dramatically changed the position of the equilibrium to strongly favour the A
tautomers instead. The effect of these apparently moderate physical changes hints at
the difficulties faced when trying to accurately predict the predominant tautomer in a
given environment.
For larger, conjugated ring systems and systems containing more than three
heteroatoms, more than two tautomers are frequently plausible. Tišler (1955) for
example studied compounds in the mercapto-oxo-triazole system shown in
Figure 1.7 and for the R-group instances investigated, found tautomer B to be the
most prevalent.
18
N
NH
NH
S
O
R
N
N
NH
SH
O
R
N
N
N
SH
OH
R
A B
C
Figure 1.7
The prototropic tautomerism discussed so far has primarily occurred by
intermolecular proton transfer. However, there are also mechanisms for
intramolecular tautomerisation assisted by hydrogen bonding, such as seen in the
pyrazines and quinazolines shown in Figure 1.8 (Katritzky et al., 1995 & 1997). In
the cases considered, when X = carbon the B tautomers were preferred and when X =
nitrogen the C tautomers were the most prominent.
X
N
RO
X
N
ROH
X
N
RO
H
(X = C, N)
A B
C
Figure 1.8
Sometimes prototropic tautomerism is accompanied by a more substantial
structural change such as a ring opening / closure. Lázlár et al. (1998) for example
studied the tautomeric equilibria of a series of 1-alkyl-substituted 2-
19
arylimidazolidines in deuteriochloroform (Figure 1.9) that undergo reversible, five-
membered ring-opening reactions. The more bulky the R-substituent, the more
favoured the ring-opened tautomer was found to be.
NRHN
NH
N
R
(R = Me, Et, Pr, iPr)
Figure 1.9
The potential influence of the preferred tautomer of a ligand on that ligand’s
binding properties and therefore its whole chemistry can be illustrated by comparing
the X-ray crystal structures of the compounds shown in Figure 1.10. The crown ester
(left) and crown ether (right) are drawn in the forms observed by Bradshaw et al.
(1985 & 1986). With the crown ester’s central cavity containing one more hydrogen
bond donor and one less hydrogen bond acceptor than the crown ether, their
chemical environments will present somewhat different prospects to any potential
encapsulation atom or group. Consequently these tautomers are likely to have
different coordination chemistries.
O
O
O
O
NH
O
N N
OO
O
O
O
O
N
O
N NH
Figure 1.10
20
1.4.2 Valence tautomerism
Valence tautomerism occurs without chemical group detachment and
reattachment elsewhere taking place and instead primarily involves an electronic
rearrangement within a molecule. An example, which also involves a 6-membered
ring opening / closure, was observed in a series of N-bridged 1,3-thiazolium-4-olates
prepared by Zaleska and co-workers (Zaleska et al., 1996) (Figure 1.11). The
position of the equilibrium was found to depend on the nature on the solvent and the
pH of the solution.
S+
N
R3
S
N
O
R2R1
R1
N+ O
-
R2
S
N+O
-
R2
N+
R3
S
N
O
R2
R1R1
(R1 = Me, Ph R2, R3 = Ph, p-C6H4Me)
Figure 1.11
Though not technically tautomers, a number of simple functional groups can
be drawn in different resonance hybrid forms. For example, Figure 1.12 shows the
resonance forms of azide (top) and diazo (bottom) groups (Leach & Gillet, 2003). If
a hybrid is not considered in conjunction with its other form(s), each could be treated
as though it was a completely different species in a chemical reactivity sense. In turn,
this could have a profoundly adverse effect on the physical and chemical properties
predicted for a molecule containing that functional group.
C N-
N+
NC N N+
N-
CC
-N
+N
H
CN
+N
-H
Figure 1.12
21
1.5 The impact of tautomerism on drug design
Despite the amount that is known about tautomerism and the extent to which
it prevails in heterocyclic chemistry, the fundamental impact it may have in limiting
the success of current computational methods for drug design remains little
researched. The shape of a drug-like molecule together with its donor and acceptor
properties and its physicochemical properties are all critical in determining whether it
will bind strongly to a target receptor site and show the required level of activity.
OH O OH OOH
OH
OHCH3
NH2
O
N+CH3CH3
H
Figure 1.13
For the reasons shown above, tautomerism could have a profound affect on a
molecule’s ability to meet these criteria. For example, the complex molecule
Tetracycline (Figure 1.13) has a total of 64 potential tautomeric forms and a strong
ability to modify its geometry and bonding structure to suit its chemical environment
(Duarte et al., 1999). Given the influence of factors such as solvent and pH, it
remains hard to predict whether or not a particular active tautomer of interest will be
energetically available in a particular environment. Such concerns were raised by
Pospisil et al. (2003):
“…does a molecule bind preferably in one distinct tautomer? Is the
most stable tautomeric form in aqueous solution also the most stable
form in the active site of the protein? What can be the binding
22
contribution of a ligand in its excited tautomeric state in contrast to its
‘normal’ tautomeric state, e.g., its low energy configuration?”
Despite these pressing questions, few studies examining the binding modes of
particular tautomers have been carried out to date. One study by Brandstetter et al.
(2001) however has shown the preference for enol tautomer binding between the
8-barbiturate inhibitor RO200-1770 and the active site of the matrix
metalloproteinase MMP-8. The keto tautomer of the barbiturate in contrast is the one
that dominates in solution. Similarly, Yan and co-workers (Yan et al., 1998)
calculated the preferential tautomer of binding between pterin and ricin. They found
that of the four possibilities the chosen tautomer was neither the one of the lowest
energy in aqueous solution or the gas phase. These two observations show that
though normally unfavoured tautomers can still be activated and stabilized given the
right ligand-protein environment, trying to accurately anticipate such occasions for
exploitation in lead compound discovery applications is likely to remain a sizable
challenge.
1.6 Tautomerism and molecular docking programs
Chemical compounds are typically stored in databases as discrete, canonical
structures. The tools that currently exist to convert tautomers into their alternative
forms remain only of limited functionality according to Pospisil et al. (2003):
“There are several programs available which are able to create
tautomers, however for only one single compound at a time.”
Additionally, in the estimation of Trepalin et al. (2003), up to 0.5% of
commercial databases for bio-screening applications contain tautomers. Combining
these two observations, it is likely that many commercial and corporate databases
23
used for HTS will be missing a valuable and sizable amount of tautomeric structural
information about their collection. The extent of the problem lead Pospisil et al.
(2003) to suggest:
“…if a database is used for computer-aided lead finding, enriching
one’s database by energetically similar tautomers may significantly
improve the success rates in computer-aided drug design.”
Pospisil et al. (2003) also pointed out that including tautomers in virtual
screening increases the amount of “chemical space” covered by databases and
improves the chances of hits being generated. Of the tautomer generation tools
currently in existence, most can be considered as utilities that provide “pre-
processing” for other screening or docking applications that do not take tautomerism
into consideration themselves. Of these, “ProtoPlex” (Pearlmann et al., 2002), a
similar tool by Sadowski (2002), Pospisil’s “in-house” program “AGENT 2”
(AGENT 2, 2004 & Pospisil, 2002) and Kenny’s “Leatherface” (Kenny, 1999) for
the interconversion of tautomer forms are currently the most well known.
1.7 Tautomerism and molecular descriptors
One of the major problems of molecular descriptor prediction is that an
accurate structural representation of a molecule is required. For tautomeric
compounds this is particularly difficult, as single structures of the presumed
dominant tautomer are usually drawn to represent them, while their other tautomers
and the position of the equilibria between them are frequently given little or no
consideration. This means that in a given chemical environment, predictions could at
best be uncertain or at worst, be meaningless. For example, Sayle and Delany (1999)
calculated the log P (CLOGP) values for the paired 4-hydroxypyridine and
24
4-pyridone tautomers (Figure 1.3). They found them to be markedly different to each
other, at 0.93 and –1.31 respectively. As Pospisil et al. (2003) explained, the success
of the fragment-based CLOGP method depends not only on the nature of the
fragments produced, but also on the inclusion of complete and representative
tautomeric information in the training data:
“The fragment-based method depends on the way fragments are
produced, their number, size, and the training sets. Thus, missed or
incorrectly selected tautomers for the training set lead to wrong
correlations and cause the log P prediction to fail.”
Tautomerism can also affect the perceived similarity of molecules and
therefore inadvertently influence how compounds are clustered. Willett et al. (1998),
for example, found the Tanimoto index of the tautomer pair 4-nitrosophenol (Figure
1.14, left) and [1,4] benzoquinone monooxime (Figure 1.14, right) to be only 0.196,
despite them being treated as no more than different forms of the same compound.
Tautomerism can also affect measures of compound set diversity because of the low
levels of similarity that are sometimes ascribed to pairs of tautomers.
O N
OH
OH N
O
Figure 1.14
1.8 The project domain
The development of HTS and computer-aided drug design has opened up
many possibilities for more rapid and more successful lead compound discovery. An
integral part of this process has been in the developing of applications to predict the
physical and chemical properties of drug-like molecules and their likely activity at a
25
particular target. The widespread and well-studied chemical phenomenon of
tautomerism in organic chemistry often has a marked effect on the shape, structure
and chemical properties of molecules and preliminary studies have shown that its
impact on the prediction of those properties can also be considerable.
The project will therefore address the interest of Syngenta in finding out the
extent to which tautomeric misrepresentation is a characteristic of the molecules it its
own corporate collection. Using the example compounds found, the influence of
tautomerism on the prediction of a number of their physical and chemical properties
will be investigated.
1.9 Project outline
Tautomerism can manifest itself in many ways and structural forms. The
nature and scale of the problem it causes to the descriptor-based property prediction
methods currently used by Syngenta will first be studied. The Pesticide Manual
(Tomlin, 2000), samples from the chemical database of Syngenta and compounds
from published chemical catalogues will be examined for this purpose.
Leatherface, a Structure Transformation Tool (STT) developed by Kenny
(1999), will be used to convert each compound from its database-stored form into the
form it considers to be the most physiologically likely at pH7. By examining the
changes in structure and property prediction values of compounds due to the STT, a
series of commonly misrepresented tautomer substructures will be gathered for
further investigation.
In particular, the physical properties lipophilicity (log P), aqueous solubility
(log Sw) and acid-base ionization constant (pka) will be studied. The property
prediction tools ELOGP (v2) (log P), ESOL (v1.1) (solubility) and ACD/pKa (v6.16)
26
(pKa) will be used to perform the calculations, accessed via the Syngenta SOLSTICE
web browser-based interface. The reasons for the individual prediction tools failing
to give values for individual compounds will also be reviewed
A comparison of predicted and measured property values for compounds will
help evaluate how well the STT performs and which tautomers give the best
predictions. In other words, how often it produces tautomers with accurately-
estimated physical properties. It will also be investigated whether there are
tautomeric compounds within the datasets studied that are drawn in a “wrong”
tautomer that the STT fails to identify. The results of these findings may help suggest
ways the STT’s performance could be improved.
27
2 Methodology
2.1 Introduction
The physical and molecular properties of compounds of potential
agrochemical and pharmaceutical interest are especially important in relation to their
biological activity. The screening of candidate structures before they reach the
synthesis and activity profiling stages of development would mean that as well
saving on the cost and time of their ultimately unnecessary synthesis, those lead
compounds that are identified are more likely to be successful and strongly active.
Properties such as aqueous solubility, acidity / basicity, and lipophilicity are
often critical to insuring that a compound is quantitatively delivered to its target
active site and binds strongly with it. Therefore developing tools to accurately predict
these properties from a molecule’s structure are of considerable interest.
As discussed in Chapter 1, Section 5, the accuracy of such structure-based
predictions are particularly called into question in tautomeric compounds where its
structure can take multiple forms and where the equilibria between them is often
either undetermined or depends on the physiochemical environment in which it is
placed. This issue is especially important as compounds are most often stored in
chemical databases as single tautomers and the choice of tautomer is dependant on
the conventions of the individual or organisation who entered it there.
These drawing conventions are therefore likely to vary considerably, both in
rules used and the rigour with which they are applied. So though the conventions
used by Syngenta are very strict, in a large collection of compounds including
examples from other published collections, various other uncertain drawing
conventions are likely to be prevalent as well. The concept of the Native Drawing
28
Convention (NDC) will therefore be used to refer to the specific structure (e.g.
tautomer) of a compound stored in a particular database, whatever the drawing
conventions applied to it were.
The aim of this chapter is to identify a protocol for assessing the extent of
tautomeric misrepresentation in a given dataset of NDC structured compounds and to
evaluate the likely influence that it has on their prediction of the physical properties -
lipophilicity, solubility, acid-base ionisation constant and charge at pH7. Extensive
use will be made of a Structure Transformation Tool (STT) to help identify
tautomeric compounds considered to be drawn in the “wrong” form, convert them to
a form it considers likely to be the most physiologically-relevant and so hopefully
improve the quality of subsequent property predictions made for them.
The effectiveness and limitations of the predictors themselves will also be
considered by analysing the reason for individual prediction failures. Finally the
limitations of the STT will be examined to identify compounds containing potential
tautomer misrepresentation issues that have either gone ignored or were found but
the structure was already considered to be in the “right” form. The main questions
that the methodology will aim address for a given compound dataset are:
• What proportion of NDC compounds in the dataset does the STT
consider to be represented in the “wrong” tautomer form?
• What distinct substructures are identified as being in the “wrong” form
and how does the STT modify them to make them “right”?
• How similar or different are the predictions between the different
tautomers and how do they compare with the measured values, where
available, for each such compound?
29
• How often does the tautomer output by the STT improve the accuracy
of predictions? Are there other important tautomers that the STT
appears to overlook?
• Of the compounds that are not changed by the STT, are there any where
a tautomerism issue has been completely missed?
• Could the STT’s structure-changing rules be enhanced?
2.2 SMILES notation
The Daylight SMILES notation (Daylight 2004a) of chemical structures is
now a widely recognised standard. It provides, via a relatively simple set of
conventions, a means of describing a two-dimensional chemical structure as a linear
character string. These codes can then be used by software applications to regenerate
structures for whatever purpose they require. This form of notation will be the one
presented to the various property prediction techniques and the STT used within this
project as well as on occasion within this dissertation. The basic conventions of
SMILES are relatively few:
• Atoms are represented by their upper case alphabetic symbols. Lower
case symbols represent aromatic centres.
• Hydrogens are automatically assumed to be present. e.g. CC represents
ethane, CH3-CH3.
• Ring closures are indicated by matching digits on the atoms at each end of
the “join”. e.g. C1CCCCC1 represents cyclohexane.
• Double bonds are drawn as “=” and triple bonds as “#”.
• Branch points are denoted with brackets, e.g. phenol – c1ccc(O)cc1
30
A typical SMILES file, bearing the extension .smi, is a simple text file
comprising one line per structure, each line being in the format
“<SMILES string><space><structure ID>”. The preparation of SMILES files from a
given compound set therefore forms the important first step of the methodology.
2.3 SMARTS notation
The Daylight SMARTS notation (Daylight, 2004b) is effectively an extension
of the SMILES language that allows more variability to be built into an atom or bond
structure pattern by the use of AND, OR and NOT operators. Therefore in principle,
a single SMARTS may represent any number of specific SMILES that happen to
match a valid instance of its pattern. For example [!N&a] represents any atom that
is not a nitrogen and is aromatic.
SMARTS targets are therefore useful for defining open-ended substructure
series with highly precise rules concerning the atom, bond and positional variations
that are allowed or disallowed. Multiple SMILES specifications can therefore be
compared against a SMARTS target and each one classified as either being a match
or a non-match and handled accordingly.
2.4 Estimating compound lipophilicity: ELOGP
ELOGP v2 provides an estimate of compound lipophilicity - log P. As
reported in Chapter 1, Section 3, various approaches to log P prediction have been
developed, with the majority based on molecule fragmentation techniques. Following
extensive evaluation work on these various tools, the consensus scoring ELOGP
approach developed by Draper (2002) and then applied by Clarke and Delaney
(2003) and Clarke et al. (2004) based on AlogP v1.5 (Ghose et al., 1988), ACD/logP
31
v6.16 (ACD, 2004) and CLOGP v4.73 (Daylight, 2004c) was adopted as the
standard log P prediction tool for Syngenta.
2.5 Estimating compound aqueous solubility: ESOL
ESOL v1 is a method for estimating the aqueous solubility at pH7 of a
compound. Its development was fully described by Delaney (2004) and first applied
by Clarke and Delaney (2003). In its SOLSTICE implementation (see Chapter 2,
Section 7) it involves the use of the molecular properties log P (estimated from
ELOGP), molecular weight (MWT), number of rotatable bonds (RB, defined from a
set of SMARTS targets) and aromatic proportion (AP) to derive estimated solubility
Log(Sw) (ESOL log ppm), Equation 2.1:
Log(Sw) = 0.16 – 0.63 ELOGP – 0.0062 MWT + 0.066 RB – 0.74 AP
Equation 2.1
While the MWT, RB and AP components can be derived using absolute rules
for any structure presented, the log P component is dependant on the effectiveness of
the conventions and implementation of ELOGP for its own accuracy.
2.6 Estimating acid-base ionisation constants: ACD/pKa
ACD/pKa v6.16 (ACD, 2004b) was the chosen pKa prediction tool for this
project. The underlying acid dissociation constant, Ka, reflects the relative
concentrations of an ionisable molecule’s associated and dissociated forms at a given
temperature, usually 25°C. ACD/pKa’s output, unlike ESOL or ELOGP, are not
necessarily single values and can be either acidic or basic type.
• Acidic dissociation: HA + H2O � H3O+ + A-
• Basic dissociation: HB+ + H2O � H3O+ + B
32
This reflects the fact that molecules can contain multiple ionisation centres
and so multiple dissociations of one type, or the other, or both become feasible. As a
result only the most basic and / or most acidic pKa calculated by ACD/pKa is / are
reported within a user-defined pH range. Both the maximum limits of this range for
ACD/pKa and the range selected for use during this project was pH 0-14. The affect
on predictions of charge at pH 7 using a narrower pH 2-10 range is also investigated
in Chapter 3, Section 5.2.
ACD/pKa also has a tautomer checking utility that was used extensively to
predict whether drawn tautomeric structures were likely to “major” or “minor” ones.
These results provided useful comparisons with the types of tautomer change
performed by the Structure Transformation Tool (STT) (Chapter 2, Section 8.3) to
determine whether its effect was always a positive one. i.e. Whether it always
converted tautomers to a “major” form.
2.7 The SOLSTICE tool set
SOLSTICE v2.18 is a Syngenta in-house suite of structure handling, statistics
generation and file format inter-conversion utilities bundled together and accessed
via an Intranet web browser interface. Amongst its facilities are:
• ELOGP v2 log P octanol prediction (encompassing ACD/logP v6.16,
AlogP v1.5 and ClogP v4.73) (Clarke & Delaney, 2003; Clarke et al.,
2004))
• ESOL v1 aqueous solubility prediction (including ELOGP v2)
(Delaney, 2004)
• pKa prediction using ACD/pKa v6.16 (part of ACD PhysChem v6
(ACD, 2004c))
33
• SMILES > SDF structure file format inter-conversion (Chapter 2,
Section 8.1)
• SDF > SMILES structure file format inter-conversion (Chapter 2,
Section 8.1)
• Unique structure identification (identifying duplicates and validating
SMILES)
• SMILES canonicalisation (Chapter 2, Section 8.2)
For simplicity, these tools will largely be referred-to here-onwards without
their version numbers. SOLSTICE allows dataset files to be uploaded and stored on
its server in a variety of different formats and for batched “jobs” to be processed,
results to be viewed on-screen and output files to be downloaded for further
processing.
Jobs submitted for processing but not yet complete remain queued in the
“background” allowing continued use of SOLSTICE for other tasks. Results of past
jobs can also be stored online, organised into project folders and published so that
other SOLSTICE users can access them. Between acquiring a set of NDC
compounds for study and assessing how the questions in Chapter 2, Section 1 can be
answered, the following stages were followed:
• Compound data set preparation
• Compound property prediction
• Result collation, indexing and presentation
34
2.8 Compound data set preparation
2.8.1 Files in .sdf format
While structures are often stored as SMILES, another format is “structure
data format” (.sdf), originally developed by MDL Information Systems (MDL,
2003). This format uses a connection table approach to represent structures and is
supported by many current chemical software packages, some of which have their
own parsers that automatically interpret .sdf files into structure diagrams. If a
compound set is provided in such a format the SDF > SMILES conversion routine of
SOLSTICE can be used to generate the required SMILES. A further SOLSTICE
routine is available to perform the reverse conversion if required.
2.8.2 SMILES canonicalisation
A SMILES is a non-unique way of representing a structure. This means that
in general, different but equally valid SMILES strings can represent a given
structure. Therefore in principle any one of them could be used to make physical
property predictions for a given compound and be expected to give the same result.
In practice however it has been discovered that the choice of SMILES variant used
sometimes has a bearing on the value of the prediction made when structures contain
6-membered aromatic rings with one or more nitrogens. In particular, the AlogP
contribution of ELOGP was found to be so-affected, prompting further investigation
into the cause.
To illustrate the issue, there are twelve distinct SMILES representations of
2-ethoxypyridine, depending on how the aromatic ring is “split open” and which
“end” of the molecule is read from first. Table 2.1 shows that two AlogP values
occur with equal frequency and differ by a not insignificant amount (0.47 log P
35
units). This difference however has a less significant influence on ELOGP since it is
largely averaged-out when the calculated and unaffected ClogP and ACD/logP
values are included.
N OEt
N OEt
N OEt
c1cnc(OCC)cc1
AlogP = 1.988
c1cc(OCC)ncc1
AlogP = 1.988
n1ccccc1OCC
AlogP = 1.988
c1(OCC)ccccn1
AlogP = 1.523
c1(OCC)ncccc1
AlogP = 1.523
c1cccnc1OCC
AlogP = 1.988
N OEt
N OEt
N OEt
c1c(OCC)nccc1
AlogP = 1.523
c1ccnc(OCC)c1
AlogP = 1.523
n1c(OCC)cccc1
AlogP = 1.523
c1cccc(OCC)n1
AlogP = 1.988
c1nc(OCC)ccc1
AlogP = 1.988
c1ccc(OCC)nc1
AlogP = 1.523
(ACD/logP (all) = 1.855 ClogP (all) = 1.994)
Table 2.1: The differences in AlogP predictions for different SMILES of 2-ethoxypyridine
The same effect was found in the predictions of a number of other simple
aromatic, nitrogen-containing structures such as 2-hydroxypyridine,
[2,2’]-bipyridinyl, quinoline and 7,8-dihydro-cinnoline (Figure 2.1).
N NN
NNN OH
Figure 2.1
36
In each case, pairs of AlogP values also differing by 0.47 log P units were
predicted, depending on which particular SMILES was used. The consistency of the
difference between predictions suggests that this is a commonly replicated problem
with AlogP’s SMILES parser which sometimes assigns a different set of atom types
to atoms either side of “joins”, depending on which SMILES format is presented.
In order to counteract the effect, it was decided that all compounds would
have their SMILES canonicalised using SOLSTICE’s Unique Structures tool before
any predictions were carried out. This procedure reassigns SMILES using the
accepted Daylight conventions (Weininger et al., 1989 & Daylight, 2004d) and a
parser from Daylight’s SMILES Toolkit v4.8 (Daylight, 2004e). Whilst
canonicalisation cannot be considered a way of improving the accuracy of ELOGP
and ESOL predictions, it does help improve their consistency and comparability by
removing the possibility of the same structure appearing to give different ELOGP
predictions.
2.8.3 Leatherface: A tool for transforming chemical structures
Leatherface is a UNIX command line Structure Transformation Tool (STT)
developed by Kenny (1999) and designed to convert molecules into a form
considered to be most chemically or physiologically relevant at pH7. It does so by
applying structural modification rules to SMILES specifications identified using
SMARTS targets. These alterations usually take the form of:
37
• Changing the tautomeric form of a compound
• Protonation of anions to remove charge (e.g. carboxylate →
carboxylic acid)
• Deprotonation of cations to remove charge (e.g. triethylammonium →
triethylamine)
• Changing of resonance hybrid to remove charge separation (e.g. nitro
group, Figure 2.2)
N+
O
O
N
O
OSTT
Figure 2.2
Structures already considered to be in an appropriate form are unaltered by
the STT. The rules applied to a SMILES that matches a SMARTS target state how
atom charges should be changed, bond orders should be changed and where
hydrogens should be added or removed. These rules applied by the STT may be
supplemented at any time by editing the .vb and .smt files it consults each time it is
executed. The .vb (“Vector Binding”) file contains the shortcut SMARTS definitions
of the different target substructures. The .smt (“SMARTS definitions”) file contains
the corresponding structure changing rules for each SMARTS to be applied to
matching SMILES.
In these studies the STT was always provided with a canonicalised NDC
SMILES file as input. The output file of results was also saved as a SMILES file.
After invoking the STT, the following program command sequence was followed,
after which the output file was generated and the STT closed:
38
Do you require assistance? N
Enter SMILES file: <SMILES file name>
Enter SMARTS definition file: <name of .smt file>
Enter SMILES output file: <chosen .smi file name>
Will a vector binding file be used? Y
Enter vector binding file: <name of .vb file>
Also built into the STT is a canonicalisation routine, which applies Daylight
conventions (Daylight, 2004d) to its SMILES results before they are written to the
output file. However as it was not confirmed which version of the Daylight parser the
STT called, each output file from the STT was also passed through the Unique
Structures utility of SOLSTICE to insure consistency. The set of SMILES structures
obtained from these steps represent each compound’s considered Physiologically
Relevant Form, to be referred-to here-onwards as its PRF.
Each compound dataset therefore now comprises of a NDC and a PRF set of
SMILES structures. Depending on whether the STT has modified a compound’s
structure, its NDC and PRF form may or may not be identical. Identifying what
number and kind of changes the STT makes to structures forms an important part of
the property prediction analysis of the datasets that follows.
2.9 Compound property prediction
All the necessary predictions of log P (ELOGP), solubility (ESOL) and pKa
(ACD/pKa via ACD PhysChem v6) for both NDC and PRF structure sets are now
acquired using SOLSTICE.
39
2.9.1 ELOGP
The .csv formatted “ELOGP” output file contained all the predictions made
by the job. It was downloaded for further processing and the following data fields
saved from SOLSTICE:
• Compound reference
• ELOGP value
• Clog P value
• ACD/logP value
• AlogP value
In some circumstances examined in Chapter 3, Section 6, the individual
prediction methods AlogP and / or ACD/logP that underlie ELOGP sometimes failed
to give values for particular structures. This sometimes prevented meaningful
ELOGP prediction comparisons from being made between the NDC and PRF forms
of the same structure or the same form of different structures. This issue is dealt with
in Chapter 2, Section 10.2.
2.9.2 ESOL
The .csv formatted “ESOL Results” summary file contained all the
predictions made by an ESOL job and was downloaded for each job run. The fields
selected for saving from SOLSTICE were:
• Compound reference
• ESOL value
40
2.9.3 pKa
From each ACD/pKa run conducted, the following data fields for each
compound were downloaded and saved in .csv format from the “PhysChem
Results Table”:
• Compound reference
• pKa 1 (i.e. 1st predicted value)
• pKa 1 flag (i.e. whether 1st predicted value is a most acidic “MA” or
most basic “MB” pKa)
• pKa 2 (i.e. any 2nd predicted value – often blank)
• pKa 2 flag (often blank)
2.10 Result collation, indexing and data presentation
2.10.1 DIVA – A spreadsheet for manipulating and displaying chemical information
DIVA v 2.1 (“Diverse Information Visualization and Analysis”) is a specialist
spreadsheet application developed by Accelrys (2004) for managing and visualising
chemical data and was the primary data-gathering tool for this project. It allows users
to:
• Visualise chemical structures stored in .sdf format as fields in a
spreadsheet.
• Collect data from a variety of different sources together in a single
environment
• Display trends, patterns and relationships in data using graphs, charts
and diagrams.
41
• Merge compound data sets based on a common index shared by them
– typically a compound reference number
• Produce reports summarising compound set information.
Data gathered about a particular compound set was typically combined in the
following order into a single DIVA (.div) spreadsheet:
1. Import a list of the compound reference numbers for the dataset
2. Merge measured log P, solubility and pKa data where available.
3. For the NDC followed by the PRF structure sets:
Merge sdf structure
Merge ELOGP data
Merge ESOL data
Merge pKa data
4. Finally export the dataset as a .csv file.
The exporting process allowed the data to be read into Microsoft Excel, the
exception being that the fields which contained .sdf structures were converted into
the SMILES they were originally derived from. The use of Microsoft Excel alongside
DIVA stemmed from Excel’s more powerful sorting and calculation-performing
capabilities.
2.10.2 Post processing of the dataset
Before meaningful, comparable results could be drawn from a compound
dataset, some indexing and simple calculations needed to be performed using
MS Excel. The indexing took the form of flagging each compound “yes” or “no”
42
against a series of criteria by adding a number of additional index fields to the
spreadsheet:
1. Does the STT change the structure of the compound? If the NDC and
PRF SMILES were identical then Structure Changed? index = “no”,
otherwise “yes”.
2. Have AlogP, ACD/logP and ClogP predictions all successfully been
made for both NDC forms and PRFs of the compound? The Valid
ELOGP / ESOL for comparison? flag was set to “yes” if this was the
case or “no” if any of these predictions failed. This unfortunately may
remove a small proportion of compounds from any ELOGP
comparisons made, but makes sure that compounds with false
differences in predicted ELOGP because of prediction failure alone are
not confused with compounds where there is a genuine difference.
3. Have pKa predictions been made successfully for both NDC forms and
PRFs of the compound, i.e. Do both of the pKa 1 fields contain values?
The Valid pKa for comparison? flag was set to “yes” if this was the
case or “no” if either of these predictions failed.
Since multiple pKa predictions and measurements were possible for a single
compound, care was taken to make sure that pKas of the same type were being
compared. As the fields pKa 1 and pKa 2 from ACD/pKa SOLSTICE results files
may contain either acidic and / or basic values, a degree of manual inspection and
rearrangement of data was sometimes required before comparisons were made.
Additional care was also taken when, for example, an acidic pKa prediction
had been obtained but there was more than one measured acidic pKa to compare it
43
to. In this situation since the predicted pKa value quoted will be the “most acidic”
one, it was primarily be related to most acidic measured pKa. i.e. The one with the
lowest value. The comparison of measured and predicted data is covered more fully
in Chapter 2, Section 10.7.
2.10.3 Allocating a predicted charge at pH7
Also of interest were any changes in predicted charge on each structure
between pairs of tautomers at pH7; this being the pH closest to which most
compounds exist in nature. In order to partition the compounds it was assumed that
“most acidic” (MA) pKa 1s of 6 or lower would result in them existing
predominately in a deprotonated state with a single negative charge. For “most basic”
(MB) pKa 1s of more than 8 it was assumed that compounds would exist
predominately in a protonated form with a single positive charge. For the remaining
compound pKas it was harder to predict their protonation state and so were assumed
to be neutral structures.
On this basis, two “Formal charge at pH7” fields were added to the
datasheet and completed accordingly for the NDC and PRF forms of those
compounds where criteria 3 in Chapter 2, Section 10.2 was indexed “yes”.
2.10.4 Data analysis and presentation
In order to compare the predictions made for the NDC and PRF compound
pairs the following additional calculations were performed and graphs plotted:
44
• For compounds where criteria 2 in Chapter 2, Section 10.2 was met, the
absolute difference between their NDC and PRF structure’s ELOGP
predictions were calculated.
o The NDC and PRF ELOGP predictions were then plotted against each
other.
o The distribution of PRF ELOGP values was also plotted.
• For compounds where criteria 2 in Chapter 2, Section 10.2 was met, the
absolute difference between their NDC and PRF structure’s ESOL
predictions were calculated.
o The NDC and PRF ESOL predictions were then plotted against each
other.
o The distribution of PRF ESOL values was also plotted.
• For compounds where criteria 3 in Chapter 2, Section 10.2 was met, the
absolute difference between their NDC and PRF structure’s pKa 1 predictions
were calculated.
o The NDC and PRF pKa 1 predictions were then plotted against each
other.
o The distribution of PRF pKa 1 values was also plotted.
A statistical breakdown of the effect of the STT on a dataset allowed a
measure of the tautomer misrepresentation issue to be gauged. This was done by
partitioning each compound into one of four categories in Table 2.2 according to:
45
• Whether the STT changed its structure in some way.
• Whether the predicted ELOGPs, ESOLs or pKas of its NDC and PRF
forms were different.
Changed structure?
No Yes Physical property
No a b Changed value?
Yes c d Table 2.2: Classification of changes caused to compounds by the STT
1. Compounds matching type a were unchanged by the STT and therefore
saw no property prediction change.
2. Compounds matching type b were cases where, for whatever reason, a
structural change did not lead to a change in property prediction value.
3. Any compounds that matched type c could only be due to “bugs” in
each prediction routine, as this would require the same compound
structure to give rise to two different prediction values. It was
compounds appearing here in error during the analysis of the ELOGP
results of non-canonicalised SMILES that the problem with AlogP,
discussed in Chapter 2, Section 8.2 and Chapter 3, Section 6.1, was first
discovered.
4. Compounds matching type d were most likely to be those where the
STT has encountered a tautomer misrepresentation issue, modified its
structure, and a change in property prediction resulted.
46
2.10.5 Identifying tautomeric substructures
Having identified the d sub-set of compounds (Table 2.2) whose structure and
property predictions had changed due to the STT, it was necessary to identify what
the specific structural changes were, and to categorise them accordingly. Non-
tautomeric changes made, e.g. protonation or deprotonation of heteroatoms to
neutralise charges, could be identified and sidelined at this point.
It was initially decided to allocate each compound to a substructure class
based only on the immediate local region about which the STT had performed its
tautomer transformation. So for example, the simple 2-pyridone framework B
(Figure 2.3) would be considered a general class to represent all the ring systems A,
with its A-groups representing substituents of any nature.
ONH
A
A
A
A
ONH
A
A
A
A
ONH
A
A
N
N
A
A
ONH
A
A
A
A
A
A
ONH
A
A
A
A
A
AA B
Figure 2.3
When analysing the prediction data however it was found useful to subdivide
these broad classes into more specific substructural types, by separating those of
different ring system configuration in the tautomeric region of each molecule.
Additionally, each definition of a specific class was extended to the limits of
substituent conjugation where heteroatoms were involved and where a prototropic
tautomer shift involving them was theoretically possible. So in Figure 2.3, each
47
structure A example was now considered a separate class, where the A-groups
although, in principle, still representing any group, now cannot form rings with each
other or participate in tautomerism.
2.10.6 Other data analysis indicators
• For compounds of type d (Table 2.2), the distributions of the absolute
differences between the NDC and PRF predictions for each property
showed whether certain difference values occurred more repeatedly
than others. By analysing which types of structural change the STT had
performed, tautomer transformations common to particular narrow
absolute difference ranges were sometimes identifiable.
• Plots of the NDC structure’s and PRF structure’s predicted charge
distribution at pH7 for the entire compound set indicated the effect that
the STT had on its expected charge distribution. Also examined were
the specific numbers of compounds whose predicted charge changed
due to a change in structure.
2.10.7 Comparison of measured and predicted log P and pKa values
Log Ps and pKas are among the more common physical property
measurements made for compounds. Given a dataset for which both predicted and
measured data was available, both the accuracy of the predictions and the degree to
which the STT improved them by converting structures to their presumed “right”
form could be gauged.
This was done by calculating the absolute difference between pairs of
predicted and measured values for a compound. The size and sign of the disparity
48
between these absolute differences for a compound’s NDC and PRF structural forms
provided a measure of which form gave the more accurate prediction. A positive
disparity represented an improvement in prediction accuracy through the use of the
STT, suggesting that the PRF form of the structure was a better representation of the
compound. Negative disparities indicated that the NDC form of the structure gave a
more accurate prediction than did the PRF form, suggesting that the former tautomer
may after all be the more representative form. By tabulating these disparities,
comparisons with other compounds of the same or different sub-classes defined in
Chapter 2, Section 10.5 could then be drawn.
2.10.8 Analysis of prediction failures
The prediction routines AlogP, ACD/logP and pKa were sometimes
unsuccessful at giving values for individual structures, leading to blank results
appearing in output files. A detailed analysis of the specific compound’s structures
concerned, together with any error messages produced by them during the running of
the prediction job helped identify the common reasons why failures occurred, and
pinpointed the specific structural features that appeared to repeatedly give problems.
This in turn helped suggest ways each prediction tool could be improved, or at least
highlight more specifically its limitations.
2.10.9 CHI data – a source of information about tautomer classes not highlighted by the STT
CHI (Chromatographic Hydrophobicity Index) is a reversed-phase HPLC
technique that enables an assessment of high throughput lipophilicity to be made
(Valkó et al., 1997 & Kaliszan et al., 1999). A sample of interest is injected into an
aqueous buffer solution at a constant rate and the percentage of organic mobile
phase, usually acetonitrile, steadily increased at a constant gradient. The retention
49
time at which the sample is equally distributed between the aqueous and organic
phases is used to in conjunction with the instrument / column’s calibration curve to
determine its CHI value at that aqueous pH.
CHI values can be used as indicators of log P and acidic or basic pKa when
measured at multiple pHs and can suggest whether a structure exists in different
forms. With respect to tautomerism, analysis of measured CHI data allowed
compounds potentially containing tautomer issues to be highlighted. By examining
these more closely, previously unidentified tautomeric compounds missed by the
STT could be identified. CHI values for compounds examined in this study were all
recorded at pHs 2.5, 7 and 10.
50
3 Results and discussion
3.1 About this chapter
The work discussed in this Chapter covers the tautomer misrepresentation
issue in relation to compound property predictions in several consecutive themes:
• Examining the property prediction and structural changes to datasets due
to passing them through a Structure Transformation Tool (STT) that
seeks, amongst other things, to correct tautomers drawn in the “wrong”
form.
• Analysis of the problems and specific prediction failures associated with
the particular prediction tools used.
• Classifying the tautomer types identified and assessing the validity of the
structural changes applied to them.
• Comparison of predicted and measured property values to assess the
benefits to property predictions of applying the STT to datasets.
• Investigation of a method to determine whether there are tautomer issues
either ignored or unchanged by the STT.
3.2 Introducing the datasets
The methodology developed in Chapter 2 was largely derived from the
experience gained of working with two test sets of compounds. One was compiled as
a result of research activities at Syngenta in recent years; the other is a published list
of both current and past agrochemical products.
51
• Compound set 1 comprises 2,616 compounds that have been highlighted
as hits of interest from high throughput screening (HTS) and lead
compounds from a variety of research projects. As such they form part of
the Syngenta compound collection and are likely to provide good
coverage of recent agrochemical-like compound classes. It will be
commonly referred to as the HTS dataset and its compounds have been
given generic reference numbers of the type HTSxxxx (where xxxx =
0001-2616).
• Compound set 2 comprises 1,359 compounds from the Pesticide Manual
(Tomlin, 2000) and contains examples of both current and superseded
products. It will commonly be referred to as the PM dataset and its
compounds have reference numbers of the type PLxxxx (where xxxx are
values in the range 0001-1618).
As moderate sized sets of compounds, they will generate an easily-managed
amount of data but still be big enough for meaningful trends to be extracted from
them to shape the methodology they are being used to develop.
3.3 Comparing the property predictions made for the NDC forms and PRFs of each compound set
To judge the effect that the STT had on each dataset, the differences in the
predictions between their Native Drawing Convention (NDC) forms and
Physiologically Relevant Forms (PRFs) will be judged from the numbers of
compounds whose prediction values changed and on the size of those changes. This
will indicate how serious an issue presenting the “wrong” tautomer to a log P, pKa or
solubility prediction tool is.
52
3.3.1 ELOGP
Figures 3.1 and 3.2 show the plots of NDC form versus PRF ELOGP
predictions for the HTS and PM datasets respectively. As discussed in Chapter 2,
Section 10.2, these comparisons exclude the small number of compounds where one
or more of the log P prediction methods underpinning ELOGP fail.
-5
-3
-1
1
3
5
7
9
11
13
-5 -3 -1 1 3 5 7 9 11 13
NDC ELOGP
PRF
ELO
GP
Figure 3.1: Comparison of NDC and PRF ELOGP predictions for the HTS dataset
-5
-3
-1
1
3
5
7
9
11
13
-5 -3 -1 1 3 5 7 9 11 13
NDC ELOGP
PRF
ELO
GP
Figure 3.2: Comparison of NDC and PRF ELOGP predictions for the PM dataset
53
Figure 3.1 shows that the majority of NDC and PRF ELOGP predictions for
the HTS dataset were identical. Only in 69 cases (2.7%) of the 2,520 compared was a
difference observed between them. For the PM dataset in Figure 3.2, 37 compounds
(2.9%) of the 1295 cases compared gave different predictions. The distribution of the
absolute non-zero ELOGP prediction differences for the HTS and PM datasets are
shown in Figures 3.3 and 3.4 respectively.
0
4
8
12
16
20
0 0.4 0.8 1.2 1.6 2 2.4
NDC / PRF absolute ELOGP prediction difference
Com
poun
d co
unt
Figure 3.3: The distribution of non-zero absolute differences between NDC and
PRF ELOGP predictions for the HTS dataset
0
1
2
3
4
5
6
7
8
0 0.4 0.8 1.2 1.6 2 2.4
NDC / PRF absolute ELOGP prediction difference
Com
poun
d co
unt
Figure 3.4: The distribution of non-zero absolute differences between NDC and
PRF ELOGP predictions for the PM dataset
54
Figures 3.3 and 3.4 show that the difference in predictions between the NDC
forms and PRFs of the affected compounds were as much as 2.33 log P units and on
average 1.05 and 0.83 log P units for the HTS and PM datasets respectively. Since
there are likely to be relatively few repeated tautomeric substructures and these can
be found in multiple molecules, it may be expected that inter-converting specific
examples of the same type would give rise to similar differences in predicted ELOGP
or ESOL value between pairs of tautomers and hence an irregular not smooth
distribution.
While both datasets are relatively small, making it difficult to extract detailed
correlations, some standard difference patterns could be observed. The strongest
example occurs in the absolute ELOGP difference “bin” 1.00-1.10 in Figure 3.3,
which also coincides with the highest count of non-zero differences for the HTS
dataset. 14 of these 18 compounds underwent the same tautomerisation (Figure 3.5)
and represent all but 3 of the examples of the type found in that dataset.
N N
OH
R1
R2
R3
N NH
O
R1
R2
R3
STT
Figure 3.5
The smaller size of the PM dataset prevented similar meaningful patterns
from being extracted from Figure 3.4. The distribution of predicted ELOGP values
for the PRF of each compound in each dataset is shown in Figure 3.6.
55
0
0.05
0.1
0.15
0.2
0.25
0.3
-4 -2 0 2 4 6 8 10 12
Predicted PRF ELOGP
Frac
tion
of c
ompo
und
set
HTS setPM set
Figure 3.6: Distribution of predicted ELOGP values for HTS and PM dataset compounds represented in their PRF
The near-normal distributions highlight the lower mean (3.19) and higher
standard deviation (1.80) of the PM ELOGP predictions compared to the HTS
predictions (3.61 and 1.48 respectively) but also highlight that the profile of
predictions made for the PRFs of structures in both datasets are broadly similar, with
at least a handful of predicted values being found in every region of the common
ELOGP range for agrochemicals.
3.3.2 ESOL
Figures 3.7 and 3.8 show the plots of NDC form versus PRF ESOL
predictions for the HTS and PM datasets respectively. The compounds excluded
directly correspond with the sets omitted from the ELOGP prediction comparisons.
56
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
NDC ESOL
PRF
ESO
L
Figure 3.7: Comparison of NDC and PRF ESOL predictions for the HTS dataset
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
NDC ESOL
PRF
ESO
L
Figure 3.8: Comparison of NDC and PRF ESOL predictions for the PM dataset
For compounds where the ESOL predictions change between the NDC form
and PRF the magnitude observed is similar to that seen for ELOGP. Since ESOL
predictions are largely dependant on ELOGP predictions, there is a strong
relationship between them (R2 = -0.90 to -0.95 for both datasets and both sets of
NDC and PRF predictions). The distribution of the absolute non-zero ESOL
57
prediction differences for the HTS and PM datasets are shown in Figures 3.9 and
3.10 respectively.
0
5
10
15
20
25
0 0.4 0.8 1.2 1.6 2
NDC / PRF absolute ESOL prediction difference
Com
poun
d co
unt
Figure 3.9: The distribution of non-zero absolute differences between NDC and PRF ESOL predictions for the HTS dataset
0
5
10
15
0 0.4 0.8 1.2 1.6 2
NDC / PRF absolute ESOL prediction difference
Com
poun
d co
unt
Figure 3.10: The distribution of non-zero absolute differences between NDC and PRF ESOL predictions for the PM dataset
Figures 3.9 and 3.10 show that the difference in predictions between the NDC
forms and PRFs of the affected compounds were as much as 1.65 ESOL log units
and on average 0.75 and 0.58 ESOL log units for the HTS and PM datasets
respectively.
58
Due to the noted dependency of ESOL predictions on ELOGP, the
distribution of non-zero absolute difference distributions in Figures 3.9 and 3.10
closely match those of Figures 3.3 and 3.4. The maximum compound count in Figure
3.9 for the “bin” range 0.70-0.80 ESOL units can therefore largely be attributed to
examples of the same single type of tautomer change that was highlighted from
ELOGP data in Figure 3.3 and shown in Figure 3.5. The smaller size of the PM
dataset prevents similar meaningful patterns from being extracted from Figure 3.10.
The distribution of predicted ESOL values for the PRF of each compound in each
dataset is shown in Figure 3.11.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-4 -3 -2 -1 0 1 2 3 4 5 6 7
Predicted PRF ESOL
Frac
tion
of c
ompo
und
set
HTS setPM set
Figure 3.11: Distribution of predicted ESOL values for HTS and PM dataset compounds represented in their PRF
In contrast to the ELOGP distributions of Figure 3.6, the distributions in
Figure 3.11 highlight the higher mean (1.46) and higher standard deviation (1.58) of
the PM ELOGP predictions compared to the HTS predictions (1.25 and 0.98
respectively). The profiles of ESOL predictions for each set however are still largely
comparable, with at least a handful of ELOGP predicted values being found in every
region of the common aqueous solubility range for agrochemicals.
59
3.3.3 pKa
Figures 3.12 and 3.13 show the plots of NDC form versus PRF pKa
predictions for the HTS and PM datasets respectively. As discussed in Chapter 2,
Section 10.2, these comparisons necessarily exclude compounds where pKa
predictions were not obtained for its NDC form and / or PRF, resulting in only 1997
(76%) and 635 (47%) of possible comparisons being made for the HTS and PM
datasets respectively.
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14
NDC pKa
PRF
pKa
Figure 3.12: Comparison of NDC and PRF pKa predictions for the HTS dataset
60
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14
NDC pKa
PRF
pKa
Figure 3.13: Comparison of NDC and PRF pKa predictions for the PM dataset
Figures 3.12 and 3.13 show that the number and mean size of non-zero pKa
prediction differences between compound’s NDC forms and PRFs are far fewer and
typically smaller for the PM dataset than the HTS dataset. In 52 cases (2.6% of valid
comparisons) in the HTS dataset in Figure 3.12 a change in pKa prediction is
observed. For the PM dataset in Figure 3.13, 5 compounds (0.8% of valid
comparisons) similarly had different predictions for their NDC and PRFs. The
Figures also show that the difference in predictions between the NDC forms and
PRFs of the affected compounds can be as much as 7.47 pKa units and on average
2.35 and 1.15 pKa units for the HTS and PM datasets respectively.
Consequently, when structure misrepresentation, tautomeric or otherwise,
occurs, the effect on predictions may be considerable. It is also important to note that
large differences between pKa predictions for different forms of the same structure
are more likely to be due to the accidental mismatching of two different pKas,
between which no meaningful comparison can realistically be drawn. The
distributions of the absolute non-zero pKa prediction differences for the HTS and
PM datasets are shown together in Figure 3.14.
61
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7
NDC / PRF absolute pKa prediction difference
Com
poun
d co
unt
PM setHTS set
Figure 3.14: The distribution of non-zero absolute differences between NDC and PRF pKa predictions for the HTS and PM datasets
The 7 of the 10 compounds that comprise the maximum compound count for
the HTS dataset in Figure 3.14, corresponding to the “bin” range 2.00-2.40 pKa
units, can once again be attributed to compounds undergoing the tautomer change
highlighted in Figure 3.6. The distribution of pKa value predictions for the PRFs of
the compounds in both datasets is shown in Figure 3.15.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 2 4 6 8 10 12 14
Predicted PRF pKa
Frac
tion
of c
ompo
und
set
HTS setPM set
Figure 3.15: Distribution of predicted pKa values for HTS and PM dataset compounds represented in their PRF
62
The distributions shown in Figure 3.15 are clearly not of a classical statistical
type, but show that the pKa predictions made cover almost all of ACD/pKa
maximum 0-14 range. Of the two datasets the HTS set has the more evenly
distributed range of pKa prediction values.
3.4 Summarising the differences between the NDC forms and PRFs of the HTS and PM datasets
Evaluating fully the effect that the STT had on each dataset was achieved by
evaluating whether or not each compound’s structure was modified by it and whether
or not a change in its predicted ELOGP, ESOL or pKa was observed. Table 3.1
provides a summary of the outcomes for both datasets.
HTS dataset PM dataset Changed structure?
(NDC → PRF) Changed structure?
(NDC → PRF)
No Yes No Yes ELOGP
No 2310 (91.7)
141 (5.6)
1167 (90.1)
91 (7.0)
Yes 0
(0) 69
(2.7)
0 (0)
37 (2.9)
ESOL
No 2310 (91.7)
141 (5.6)
1167 (90.1)
91 (7.0)
Yes 0
(0) 69
(2.7)
0 (0)
37 (2.9)
pKa
No 1821 (91.2)
124 (6.2)
599
(94.3) 31
(4.9)
Changed value?
Yes 0
(0) 52
(2.6)
0 (0)
5 (0.8)
(The upper number represents actual numbers of compounds. The number in brackets is the corresponding percentage of total comparisons made for that property)
Table 3.1: Classification of changes caused to the HTS and PM dataset compounds by the STT.
63
For the majority of compounds in each dataset (at least 90%) the STT makes
no modification to their structure and consequently no change in predicted properties
result. As canonical SMILES were used as input, there are no instances in either
dataset of compounds retaining the same structure but appearing to change ELOGP,
ESOL or pKa prediction.
Inspecting the 141 compounds in the HTS dataset and the 91 compounds in
the PM dataset, where a change in structure did not lead to a change in ELOGP or
ESOL prediction, revealed that the only alteration in each case was to the hybrid
form of nitro-groups (see Figure 2.2). It therefore appears that the ELOGP and ESOL
prediction routines have correctly identified and treated both nitro group hybrids as
one-and-the-same entity in these instances.
Compounds whose pKa values remain unchanged despite a structural change
can, in all but 6 of the 124 cases found in the HTS dataset and 3 of the 31 cases in the
PM dataset can similarly be attributed to a nitro group. Of the remaining compounds,
4 differ only in whether a carboxyl group is protonated or not (HTS1707, HTS1663,
HTS1715 and HTS1716), one (HTS2608) differs only in the hybrid form of a nitroso
group and four (HTS2070, PL0083 (6-Isopentenylaminopurine), PL1003 (Kinetin)
and PL1612 (Zeatin), Figure 3.16) have undergone a tautomeric change, but the two
tautomers coincidentally have the same predicted pKa.
64
O
NNH
N
N
NH
O
NHN
N
N
NH
NNH
N
N
NH
OHNHN
N
N
NH
OH
NNH
N
N
NH
NHN
N
N
NH
Kinetin(PL1003)
STT
Zeatin(PL1612)
STT
6-isopentenylaminopurine(PL0083)
STT
Figure 3.16
The remaining compounds are those where a change in structure has lead to a
change in prediction for at least one of the three properties. These compounds are
therefore those mostly likely to have had a tautomer change carried out on them by
the STT. The nature of these compounds will be discussed in Chapter 3, Section 7.
3.5 Formal charge distributions at pH7
3.5.1 The influence of predicted pKa changes on predicted charge distribution
A formally neutral compound may actually exist in a charged state in aqueous
solution at pH7, depending on its pKa. In principle, different tautomers may have
sufficiently different predicted pKas that their predicted formal charge at pH7 could
change. This could result in their aqueous behaviors being very dissimilar to each
other. Using the protocol laid out in Chapter 2, Section 10.3, every compound with a
predicted pKa value in both datasets could therefore be assigned a formal charge
prediction for both its NDC forms and PRFs. This was initially carried out using a
pH range of 0-14 for ACD/pKa and lead to the following distributions for the HTS
dataset (Figure 3.17):
65
Figure 3.17: Predicted charge distributions at pH7 for the HTS dataset in its NDC forms and PRFs using a pH range of 0-14
The effect of passing the HTS dataset through the STT resulted in only small
changes in the predicted charge distribution for the dataset. Emphasising the
similarity of the distributions, 132 of the 144 positively charged NDC structures are
also positively charged in their PRF. 1688 of the 1693 neutral NDC structures are
also neutral in their PRF. Finally, 158 of the 160 negatively charged NDC structures
are also neutral in their PRF.
Only minor changes in the predicted formal charge distribution at pH7 for the
PM dataset, using the same pH range, were also found (Figure 3.18). Closer
inspection of the distribution reveals that only two compound’s predicted charge
actually changes due to its structure being modified by the STT. These compound’s
(PL0558 (Dimethirimol) and PL0679 (Ethirimol) (Figure 3.19)) predicted charges
both changed from +1 to 0 in conjunction with a change in tautomer.
Figure 3.18: Predicted charge distributions at pH7 for the PM dataset in its NDC forms and PRFs using a pH range of 0-14
66
N
N
OH
N N
N
OH
NH
Dimethirimol(PL0558)
Ethirimol(PL0679)
(Both NDC forms) Figure 3.19
Clarke (2002), using similar formal charge definitions, predicted the charge
distribution of compounds in the Pesticide Manual to be approximately 10:1, acid :
base. His findings are to some extent reflected in the predicted positive to negative
charge ratios for both the NDC forms and PRFs shown in Figure 3.18 (both ~ 5:1).
3.5.2 A comparison of predicted charge distribution at pH7 within pH 2-10 and pH 0-14 limits
Compounds can have multiple acid-base ionisation constants. ACD/pKa has
an option to deal with them by only presenting either the most acidic (MA) and / or
the most basic (MB) pKa it finds within the pH range defined by the user. This may
mean however that there are other, more appropriate mid-scale pKas that better
characterise compounds that simply get overlooked. Consequently by taking the
larger HTS test datasets, narrowing the defined pH “window” to 2-10 and observing
the extent of change in the predicted charge distribution, helped give an indication as
to how dependant it is on the pH range chosen.
Only compounds that have predicted pKas within both the 0-14 and 2-10 pH
ranges for both their NDC forms and PRFs could be used in the comparison. This
limited the pH range comparison to 1254 structures (48% of the entire dataset or
63% of the compounds compared over the 0-14 pH range). The predicted charge
distribution profiles for these compound’s NDC forms and PRFs at the two pH
ranges are shown in Figure 3.20.
67
Figure 3.20: Predicted charge distributions at pH7 for the HTS dataset in its NDC forms and PRFs using pH ranges of 0-14 and 2-10 for comparison
Figure 3.20 shows that narrowing the pH “window” has only a minor
influence on the charge distribution for the compounds compared. The number of
structures whose predicted charge at pH 7 actually changes when the pH range is
narrowed from 0-14 to 2-10 is only 16 (NDC structures) and 13 (PRF structures),
equating to only ~1% of the compounds. The exact choice of pH range therefore had
no significant influence on the outcome of the charge distribution predictions.
3.6 Issues and problems with prediction tools
3.6.1 AlogP and SMILES
As was highlighted in Chapter 2, Section 8.2, in order that consistent ELOGP
predictions are obtained for a particular structure, it was important that SMILES
presented as input to the ELOGP prediction tool of SOLSTICE were canonicalised to
Daylight conventions (Weininger et al., 1989 & Daylight, 2004d) to insure that a
consistent AlogP prediction for each structure was always obtained. The extent of the
problem that requires this action was examined using the HTS dataset by comparing
the AlogP predictions obtained using the non-canonicalised SMILES stored in the
Syngenta database, with their canonicalised SMILES obtained using the Daylight
68
SMILES toolkit (via SOLSTICE Unique Structures). In this dataset, 123 (4.7%)
compounds gave different AlogP values for the different SMILES forms, indicating
that a small but significant proportion of compounds were affected. The distribution
of these absolute differences (Figure 3.21) showed that the majority of them fell
within a narrow range, tending to suggest that the error is a routine one, specific to
AlogP’s handling of SMILES.
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Absolute AlogP difference
Com
poun
d co
unt
Figure 3.21: A graph showing the distribution of non-zero AlogP differences between the canonicalised and un-canonicalised SMILES of HTS dataset compounds
Since AlogP is essentially a “black box” prediction tool, where a SMILES is
simply passed to it and a result is passed-back, there is little that an AlogP user can
do to remove the problem without its developer’s intervention other than to use
canonicalised SMILES to insure prediction consistency.
3.6.2 Analysis of prediction failures
A small proportion of compounds in both datasets were excluded from the
comparison of prediction results because AlogP, ACD/logP or ACD/pKa was not
successful in generating a value for individual compounds in either their NDC form
69
or PRF. To uncover the reasons why and show the current limitations of the tools, the
specific instances where failures occurred were investigated.
3.6.2.1 Log P
The complete list of compounds in the HTS dataset that were excluded from
log P result comparisons are listed in Table 3.2. It shows the predictions that were
made and indicates the reason for the failure of the others. Such failures affected 96
compounds (3.7% of the dataset).
NDC PRF NDC PRF Ref
ACD/logP AlogP ACD /logP
AlogPRef
ACD/logP AlogP ACD /logP
AlogP
HTS0034 0.412 -3.75 0.412 A HTS1865 Ch 4.671 Ch 4.671
HTS0045 0.448 A 0.448 A HTS1895 3.978 A 3.978 A
HTS0056 3.283 A 3.283 A HTS1936 5.606 A 5.606 A
HTS0094 1.356 A 1.356 A HTS1937 4.285 A 4.285 A
HTS0110 1.595 -2.263 1.595 A HTS1950 5.097 A 5.097 A
HTS0130 0.457 -4.393 0.457 A HTS1954 6.87 A 6.870 A
HTS0162 Ch 0.300 0.962 1.666 HTS1955 6.272 A 6.272 A
HTS0169 3.904 A 3.904 A HTS1957 7.610 A 7.610 A
HTS0223 0.483 A 0.483 A HTS1963 4.127 A 4.127 A
HTS0284 4.113 A 4.113 A HTS1965 4.494 A 4.494 A
HTS0451 4.374 3.482 2.085 A HTS1996 4.853 4.689 3.125 A
HTS0468 6.542 A 6.542 A HTS2008 1.224 A 1.224 A
HTS0508 5.426 4.172 3.137 A HTS2021 2.412 A 2.412 A
HTS0695 0.989 -4.742 0.989 A HTS2042 3.099 A 3.099 A
HTS0710 F A F A HTS2068 F A F A
HTS0736 Ch A 3.620 2.018 HTS2081 7.972 A 7.972 A
HTS0802 2.774 A 2.774 A HTS2102 5.081 A 5.081 A
HTS0878 1.844 A 1.844 A HTS2115 3.725 A 3.725 A
HTS0901 3.505 A 3.061 A HTS2128 4.579 A 4.579 A
HTS0905 Ch 6.847 Ch 6.847 HTS2132 Ch 5.570 Ch 5.570
HTS0913 Ch 8.169 Ch 8.169 HTS2133 1.905 A 1.905 A
HTS1007 0.219 A 0.219 A HTS2142 3.877 A 3.877 A
HTS1018 4.677 A 4.677 A HTS2149 F 2.212 F 2.212
HTS1053 3.899 A 3.899 A HTS2154 3.940 A 3.940 A
HTS1109 Ch A Ch A HTS2155 3.342 A 3.342 A
HTS1285 0.756 A 0.756 A HTS2156 1.721 A 1.721 A
HTS1335 2.153 A 2.153 A HTS2170 2.706 A 2.706 A
HTS1340 4.595 A 4.595 A HTS2177 F 4.926 F 4.926
70
NDC PRF NDC PRF Ref
ACD/logP AlogP ACD /logP
AlogPRef
ACD/logP AlogP ACD /logP
AlogP
HTS1376 2.882 A 2.882 A HTS2179 Ch 6.312 Ch 6.312
HTS1377 3.069 A 3.069 A HTS2185 3.485 A 3.485 A
HTS1389 6.479 A 6.479 A HTS2188 Ch 5.125 Ch 5.125
HTS1407 4.356 A 4.356 A HTS2207 4.634 A 4.634 A
HTS1464 3.730 3.436 1.565 A HTS2258 0.499 0.780 0.499 A
HTS1523 Ch 2.021 Ch 2.021 HTS2280 5.031 A 5.031 A
HTS1539 Ch 4.507 Ch 4.507 HTS2291 4.665 A 4.665 A
HTS1556 5.636 A 5.636 A HTS2429 0.327 A 0.327 A
HTS1606 1.167 A 1.167 A HTS2430 3.09 A 3.09 A
HTS1652 5.025 A 5.025 A HTS2431 4.984 A 4.984 A
HTS1653 6.675 A 6.675 A HTS2477 2.712 A 2.712 A
HTS1654 0.876 A 0.876 A HTS2500 0.109 A 0.109 A
HTS1755 F 4.305 F 4.305 HTS2501 0.556 A 0.556 A
HTS1763 2.516 A 2.516 A HTS2509 0.652 A 0.652 A
HTS1766 6.008 A 6.008 A HTS2527 6.067 A 6.067 A
HTS1771 1.244 A 1.244 A HTS2532 1.884 A 1.884 A
HTS1797 Ch 5.053 Ch 5.053 HTS2567 1.463 A 1.463 A
HTS1831 0.029 A 0.029 A HTS2590 0.217 A 0.217 A
HTS1839 2.051 A 2.051 A HTS2592 3.842 4.400 2.328 A
HTS1856 0.024 A 0.024 A HTS2609 2.793 -0.329 2.793 A
• A = unparameterised atom(s) found in structure – structure cannot be fully resolved • Ch = structure charged – structure cannot be fully resolved • F = contains fragments that cannot be calculated • Highlighted compounds undergo a tautomeric structural change with the STT, one or
both tautomers of which give rise to a log P prediction error. • Shaded compounds undergo a change in a resonance hybrid substructure only with the
STT, one or both tautomers of which give rise to a log P prediction error.
Table 3.2: Reasons for the failure of AlogP or ACDlogP to predict a value for the NDC form or PRF of affected compounds in the HTS dataset
A similar analysis of log P prediction failures for the PM dataset was also
carried out, revealing that 64 compounds were similarly affected (4.7% of the
dataset). The discussion of findings will therefore address both datasets.
The most common error encountered with AlogP was that of atom fragments
not defined in its dictionary. An inspection of the “problem” compounds revealed
failures occurred most often when phosphorous, sulphur and especially nitrogen were
71
present in less-common bonding arrangements. The structural features that appeared
to cause the majority of AlogP failures are shown with examples in Table 3.3:
Substructure Number of instances
(HTS & PM) Examples
S O
O
O Ar
4 + 6 HTS0169 / HTS2021 / PL0051 (2,4-Dichlorophenyl
benzenesulfonate) / PL0743 (Fenson) / PL1069 (Methasulfocarb)
Nsp3-Nsp3 57 + 7
HTS0509 / HTS0508 / HTS1464 / HTS1996 / HTS2592 / HTS1965 / HTS1766 / HTS0878 /
PL0054 (2-Hydrazinoethanol) / PL0467 (Daminozide) / PL1405 (Sintofen)
N
O
or
N+
O
7 + 3 HTS2429 / HTS0034 / HTS0110 / PL0607 (Dipyrithione)
N+
N
O
or
N
N
O
3 + 0 HTS1340 / HTS1653 / HTS2291
Any nitrogen-sulphur bond 4 + 7
HTS0094 / HTS0710 / PL0108 (Alanycarb) / PL0181 (Benfuracarb) / PL0704 (Fenaminosulf) /
PL1446 (Sulglycapin) Any net charge
fragments 0 + 20 PL0196 (Benzamorf) / PL0245 (BTS 44584) / PL0874 (Glyodin) / PL1431 (Sulcofuron)
Any Si 4 + 5 HTS0901 / HTS2431 / PL0825 (Flusilazole) / PL1402 (Simeconazole)
Tetracoordinate S 0 + 2 PL0648 (Endosulfan) / PL0147 (Aramite) (Compound references in bold underwent a tautomer change with the STT. Those quoted represent all such compounds in both the HTS and PM datasets for which a prediction failure occurred)
Table 3.3: A summary of the structural features that caused AlogP or ACDlogP to fail to give a log P prediction
Of these features, Nsp3-Nsp3 bonds were the most common reason for
prediction failure. Table 3.3 also reveals 6 compounds that underwent a tautomer
change that so far have been excluded from prediction result analysis because of an
AlogP failure issue. Of these, 5 appear to fail for the same reason due to them
72
containing Nsp3-Nsp3 bonds in their PRF tautomer. The AlogP prediction for the
remaining tautomeric molecule, HTS0901, fails for both its NDC and PRF tautomers
because it contains a silicon atom.
ACD/logP predictions appeared to fail for two reasons. Failure outright
occurred in both datasets on 25 occasions for structures carrying a net charge –
particularly examples containing positively charged nitrogen and sulphur. Failure
also occurred when less common structural fragments were encountered. For
example compound HTS0710 contains an N=S=C fragment, HTS1755 contains an
N=S=N fragment, HTS2177 contains an N-P=S fragment and HTS2149 contains an
O=PN2 fragment.
Twelve of the 96 NDC / PRF structure pairs from the HTS dataset, for which
one or more log P predictions failed, differ only in the particular resonance hybrid
drawn of a functional group they contain. For example HTS0162 contains an azide
group. In its charge-separated NDC form ACD/logP fails, but in its neutral PRF
predicts a value. By way of contrast and exception, ACD/logP is able to resolve
successfully in most cases the charge separated and uncharged hybrid forms of
nitroso and nitro groups and treat them equivalently. For AlogP however, neither
nitroso group hybrids are normally recognised and predictions for compounds
containing them usually fail to give a value.
One of the more unusual effects of the application of the STT to the HTS
dataset was its effect on compound HTS0736, converting its charge-separated,
isocyanate, NDC form A into a neutral C(carbene)=N, PRF B (Figure 3.22). While both
ACD/logP and AlogP failed for hybrid A on grounds of charge and “unknown” atom
fragment respectively, they both surprisingly offered predictions for its B hybrid.
73
N+
C N
BA
C::STT
Figure 3.22
Log P prediction failure did not affect any compounds in the PM dataset
where the STT had made a tautomeric structure change. Structure PL0162
(Aziprotryne) however always failed with ACD/logP due to it appearing in the
Pesticide Manual drawn with the structurally ambiguous, azide-like substituent
group, shown in Figure 3.23.
N
N
N
NH
S
N N NH
" "
PL0162(Aziprotryne)
Figure 3.23
3.6.2.2 pKa
The reasons for failure of ACD/pKa predictions were more difficult to relate
to individual molecular characteristics or specific sub-structures than for log P.
Table 3.4 shows the errors encountered for both datasets, with the instances of each
error’s occurrence quoted for predictions made over a pH 0-14 range.
74
Dataset Error
number Error message HTS PM
1 “All calculated pKa values are out of user specified pH range” 942 518
2 “Cannot calculate pKa” (no reason given) 27 109
3 “The structure does not contain ionization centers calculated by current version of ACD/pKa” 244 708
4 pKa value not predicted but no error given either 2 0
5 “The structure contain elements in not-typical valence” 0 2
Totals 1215 1337
Table 3.4: Error types encountered from the failure of ACD/pKa to predict values for compounds from the HTS and PM datasets
The HTS dataset figures relate to 619 specific compounds (23% of the total
dataset) where no pKa prediction was offered for either one or both of their NDC
form or PRF using the pH range 0-14. 11 tautomer inter-conversions were affected
by missing pKa values, in each case relating to the PRF tautomer and caused by
predicted values being out of range. In 9 instances this was due to compounds that
had undergone a 4-hydroxypyridine (NDC) to 4-(1H)-pyridone (PRF) substructure
type inter-conversion (Figure 3.24). The only tautomeric example where pKa
predictions failed for both tautomers was HTS0810 involving a related pyrimidine
(NDC) to pyrimidinone (PRF) substructure transformation.
N
OH
A
A
A
A NH
O
A
AA
ASTT
Figure 3.24
Failure of ACD/pKa predictions affected 725 compounds in the PM dataset
(53% of total) for either their NDC form or PRF. Unlike the log P prediction
75
methods, ACD/pKa did not attempt to split up and treat separately the 62 multiple
component compounds in this dataset. Instead it simply registered a failure,
regardless of whether each constituent component was acceptable in its own right.
The error relating to “not-typical valence” was caused by the structurally ambiguous
compound PL0162 (Aziprotryne) that also caused ACD/logP to fail. The only
compound structurally-altered by the STT to present ACD/pKa with problems was
PL1022 (Mazidox). The NDC A form was successfully handled but the PRF B
resulted in error 2 occurring (Table 3.4 and Figure 3.25).
P N N+
NHO
N
N
P N N+
NO
N
NBA
STTPL1022(Mazidox)
Figure 3.25
No compounds where tautomeric structure changes were carried-out by the
STT were also affected by pKa prediction failure in the PM dataset.
3.7 Revealing the types of structural changes performed by the STT and the tautomer substructures concerned
3.7.1 Analysing the effect of the STT on each dataset
Of the 69 HTS dataset compounds in Table 3.1 whose ELOGP and ESOL
predictions were changed due to their structure being changed, 63 related to a true
change in tautomer form. The nature of the remainder is discussed in Chapter 3,
Section 4. The 52 compounds whose pKa predictions were similarly affected are a
subset of the 63 structures identified above. By including the additional 6 that were
found by examining the log P prediction failures a total of 69 compound tautomer
changes were therefore uncovered in the HTS dataset.
76
On close examination of the 37 PM dataset compounds in Table 3.1 whose
structures and both ELOGP / ESOL predictions were changed, only 7 could be
attributed to a prototropic tautomer change. These compounds were PL0083
(6-Isopentenylaminopurine), PL0558 (Dimethirimol), PL0679 (Ethirimol), PL0891
(Haloxydine), PL1003 (Kinetin), PL1343 (Pyriclor) and PL1612 (Zeatin) (Figures
3.16, 3.19 and 3.26)
N
OH
ClCl
FF
PL0891
(Haloxydine)
N
OH
ClCl
Cl
PL1343
(Pyriclor)
(Both NDC forms)
Figure 3.26
25 of the remainder were simple anion protonations or cation deprotonations
while the final 5 structures all contained nitro groups, which due to the specific
nature of their structures appear to have caused either ClogP (PL0401 (Clothianidin),
PL0595 (Dinotefuran), PL0942 (Imidacloprid) and PL1500 (Thiamethoxam)) or
AlogP (PL0775 (Fluazinam)) specific problems, unusually resulting in different
ELOGP predictions for their different hybrid forms (Figure 3.28).
The compounds where ClogP is affected all contain the same N-nitro
substructure (Figure 3.27) and examining the run log of the ClogP v4 (current
SOLSTICE version) job reveals that on-the-fly calculated ClogP contribution
estimations for the A form of the group were used as opposed to the selection of true
matching dictionary fragment(s). The ClogP v3 and v4 predictions shown in Table
3.5 for these compounds also show significant differences in predictions between the
hybrids, reflecting differences in the ClogP v3 and v4 methodologies (Leo &
77
Hoekman, 2000). The inadequacy of the dictionary for this relatively uncommon
substructural feature would therefore seem to be the cause of the discrepancy.
PL0775 (Fluazinam) on the other hand appears to represent an exception to the
general rule that the resonance hybrid forms of carbon-bound nitro groups are
typically treated equivalently by AlogP.
N N
O
O
N N+
O
O
BA
STT
Figure 3.27
NN F
F
F
FF
F
N+
O
O-
N+
-O
O
Cl
H
Cl
PL0775
(Fluazinam)
S
N
Cl
NH
N NHCH3N
+
O
O-
PL0401
(Clothianidin)
O
NH
N NHCH3N
+
O
O-
PL0595
(Dinotefuran)
N
NNH
N
N+O
-
O Cl
PL0942
(Imidacloprid)
N N
O
CH3
NN
+
O-
O
SN
ClPL1500
(Thiamethoxam)(All NDC forms)
Figure 3.28
78
PL0775
(Fluazinam)
PL0401
(Clothianidin)
PL0595
(Dinotefuran)
PL0942
(Imidacloprid)
PL1500
(Thiamethoxam)
Structure
form NDC PRF NDC PRF NDC PRF NDC PRF NDC PRF
ClogP v4 5.915 5.915 -2.026 0.176 -3.078 -0.876 -1.560 0.672 -0.04 0.718
ClogP v3 5.217 5.217 2.303 0.173 1.384 -0.946 2.772 0.672 1.541 1.503
AlogP 5.254 5.719 2.055 2.055 0.628 0.628 2.260 2.260 3.170 3.170
ACD/logP 8.190 8.190 0.152 0.152 0.700 0.700 0.199 0.199 1.156 1.156
(Highlighted NDC and PRF hybrid pairs are those where predictions differ between them)
Table 3.5: PM compounds containing pairs of resonance hybrids that resulted in different log P predictions sometimes being obtained for each
Of the five compounds whose structures, and consequently pKas, the STT
changed, four (PL0558 (Dimethirimol), PL0679 (Ethirimol), PL0891 (Haloxydine)
and PL1343 (Pyriclor) (Figures 3.19 and 3.26)) form a subset of the 7 PM dataset
tautomeric compounds identified from the log P data. The remaining compound,
PL1606 (WL 9385) (Figure 3.29), contains an azide group that ACD/pKa failed to
treat its hybrids as being equivalent.
N
NN
NH
CH3
NH
CH3CH3
CH3
N-
N+
N
N
NN
NH
CH3
NH
CH3CH3
CH3
NN
N
PL1606
(WL 9385)
STT
Figure 3.29
3.7.2 Categorising the types of structure change performed by the STT
A considerable variety of tautomer and resonance hybrid transformations
were undertaken by the STT on both datasets. The specific substructures involved,
79
some of which are part of larger heterocyclic fused ring systems, and the number of
compounds found of each class are shown in Table 3.6.
No NDC substructure PRF substructure
Number of instances
encountered (HTS + PM)
1 not[O]
N+
O-
A not[O]N
O
A 12 + 0
2 N N
N
OH
A A
NH N
N
O
A A
3 + 0
3 N
N
OH
A
A
A
A
A
N
NH
O
A
A
A
A
A
4 + 0
4 N
N
OH
A A
A
A
A
NH
N
O
A A
A
A
A
1 + 0
5
N
OH
Not OHNot OH
A A
NH
O
Not OHNot OH
A A
10 + 2
6 N N
OH
OH
A
A
NH NH
O
O
A
A
2 + 0
7 A
N-
N+
N A
N
N
N 1 + 0
8 N
N
OH
A
A
A
NH
N
O
A
A
A
2 + 0
80
No NDC substructure PRF substructure
Number of instances
encountered (HTS + PM)
9 N
N
OH
NA
A
A
A
NH
N
O
N
A
A
A
A
5 + 2
10
N
N
OH
A
A
A
NH
N
O
A
A
A
2 + 0
11
N
OH
OH
A
A
A
NH
OH
O
AA
A
3 + 0
12 N
OH
Not OHA
A
A
NH
O
Not OHA
A
A
5 + 0
13 A N+
C- A N C: : 1 + 0
14 N
N
OH
A
A
A
NH
N
O
A
A
A
2 + 0
15 N N
OH
A
A
A
NH N
O
A
A
A
17 + 0
16 N
N
OH
N
A A
A
A
NH
N
O
N
A A
A
A
2 + 0
81
No NDC substructure PRF substructure
Number of instances
encountered (HTS + PM)
17 N
N
NSH
AA
N
NH
NS
AA
2 + 0
18 N
N
OH
N
A
A
A
A
N
NH
O
N
A
AA
A
1 + 0
19 N
N
SH
AA
A
N
NH
S
AA
A
1 + 0
20 N
N
OH
AA
A
NH
N
O
AA
A
3 + 0
21 NH N
N
A
A
N NH
N
A
A
1 + 0
22
S NHSH
A
A
A
S NH2S
A
A
A
1 + 0
23 N N
N
SH
A
A
NH N
N
S
A
A
1 + 0
24 N N
N
OH
A
A
NH N
N
O
A
A
1 + 0
82
No NDC substructure PRF substructure
Number of instances
encountered (HTS + PM)
25 NH N
A'
A
A
N NH
A' A
A
0 + 3
A = Any group (not H when attached to a heteroatom)
Table 3.6: Tautomer substructure types identified from the HTS and PM datasets
3.7.3 Validating the structural changes performed by the STT
So far it has not been identified whether the PRF tautomers are more likely to
be major ones than their NDC analogues. It is also not known whether there are
sometimes other major tautomers that the STT did not generate. As a result, the
tautomer analysis utility of ACD/pKa was used to make predictions about what it
expects the “major” and “minor” tautomers of each substructure to be. To do this,
simple molecules containing each substructure were analysed by ACD/pKa. The
results are shown in Table 3.7 below:
Structural form examples No NDC PRF ACD/pKa suggested
alternatives 1
N+
O-
Major
N
O
“Fail: non-typical
valence”
-
2
N N
N
OH
Minor
NH N
N
O
CD Major 1
N N
NH
O
CD Major 2
83
Structural form examples No NDC PRF ACD/pKa suggested
alternatives 3
N
N
OH
CH3
CH3
CH3
Minor 1
N
NH
O
CH3
CH3
CH3
Minor 2
N
N
O
CH3
CH3
CH3
Major
4
N
N
OH
CH3 CH3
CH3
Minor 1
NH
N
O
CH3 CH3
CH3
Minor 2
N
N
O
CH3 CH3
CH3
Major
5
N
OH
Minor
NH
O
Major
-
6
N N
OH
OH Minor
NH NH
O
O Major
-
7 CH3
N-
N+
N Fail: “non-typical
valence”
CH3
N
N
N Fail: “non-typical
valence”
-
8
N
N
OH
Minor
NH
N
O
Major
-
9
N
N
OH
N
CH3
CH3 Minor
NH
N
O
N
CH3
CH3 CD Major 1
N
NH
O
N
CH3
CH3 CD Major 2
84
Structural form examples No NDC PRF ACD/pKa suggested
alternatives 10
N
N
OH
Minor
NH
N
O
Major
-
11
N
OH
OH
Minor 1
NH
OH
O
CD Major 1
NH
O
O CD Major 2
NH
O
OH Minor 2
12
N
OH
Minor
NH
O
Major
-
13 CH3 N+
C-
Fail: “Charged structure”
CH3 N C: : Fail: “Non-typical
valence” -
14
N
N
OH
Minor
NH
N
O
Major
-
15
N N
OH
Minor
NH N
O
Major
-
16
N
N
OH
N
CH3 CH3 Minor 1
NH
N
O
N
CH3 CH3 Minor 2
N
N
O
NH
CH3 CH3 Major
85
Structural form examples No NDC PRF ACD/pKa suggested
alternatives 17
N
N
NSH
CH3 Minor
N
NH
NS
CH3 Major
-
18
N
N
OH
N
CH3
CH3
Minor 1
N
NH
O
N
CH3
CH3
Minor 2
N
N
O
NH
CH3
CH3
Major
19
N
N
SH
CH3 Minor
N
NH
S
CH3 Major
-
20
N
N
OH
CH3 Minor
NH
N
O
CH3 CD Major 1
N
NH
O
CH3 CD Major 2
21 NH N
N
NCH3
CH3 CD Major 1
N NH
N
NCH3
CH3 CD Major 2
-
22
S NHSH Minor 1
S NH2S Minor 2
S NHS CD Major 1
S NHS CD Major 2
86
Structural form examples No NDC PRF ACD/pKa suggested
alternatives 23
N N
N
SH
Minor
NH N
N
S
CD Major 1
N NH
N
S
CD Major 2
24
N N
N
OH
Minor
NH N
N
O
CD Major 1
N NH
N
O
CD Major 2
25 NH N
CH3 CD Major 1
N NH
CH3 CD Major 2
-
ACD/pKa dominant tautomer predictions: • Fail: “…” = ACD/pKa failed to interpret the structure for checking of alternative
tautomeric forms (reason give in “”) • “Minor” = Sole predicted minor tautomer • “Minor 1/2" = Predicted minor tautomers suggested independently of each other • “Major” = Sole predicted major tautomer • “Major 1/2" = Predicted major tautomers suggested independently of each other • “CD Major 1/2" = Suggested conditions dependant major tautomers of each other
Table 3.7: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified in Table 3.6
In the majority of cases, the STT’s structure-changing rules successfully
tautomerised these substructure example compounds from a predicted “minor” to a
“major” tautomer. The NDC to PRF tautomer transformation is therefore a
worthwhile process. In cases 21 and 25 both the NDC and PRF tautomers appear to
be energetically very similar, since the tautomerisation performed was between two
“major” forms. Only in cases 3, 4 and 22 would the STT’s rules fail to find a “major”
tautomer. What cannot be determined in cases 2, 8, 11 and 20-25 from these findings
is which conditions-dependant tautomer is most likely in a given circumstance. In
cases 1, 7 and 13, due to problems caused by either charge or unusual valence states,
87
either one or both of the NDC and PRF hybrids could not be handled by ACD/pKa’s
tautomer analysis utility.
3.8 Comparing measured and predicted property values
3.8.1 Compounds whose structures were not modified by the STT
In addition to the influence of tautomerism on the outcome of property
predictions, it is important to establish how reliable the predictions made for these
datasets are in comparison to measured values. Initially, measured and predicted
property value comparisons will be restricted to those compounds whose structures
were unchanged by their passage through the Structure Transformation Tool (STT).
That is their Native Drawing Convention (NDC) forms and Physiological Relevant
Forms (PRFs) are identical.
3.8.1.1 pKa comparisons
3.8.1.1.1 HTS dataset
81 compounds in the HTS dataset had both predicted and measured pKas data
available. The distribution of absolute pKa differences between their measured and
predicted values is shown in Figure 3.30.
88
0
5
10
15
20
25
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Absolute pKa difference (predicted vs measured)
Com
poun
d co
unt
Figure 3.30: A plot of the absolute differences between predicted and measured pKa values for the HTS dataset where the STT made no structural change
The distribution in Figure 3.30 shows that the errors in the predicted pKas are
significantly larger than typical experimental errors (+/- 0.1 pH unit) in the measured
values, making the influence of the latter on the former negligible. As the “fall-off”
of compound count in the above distribution appears to be in two parts, stepped at
around 2 pKa absolute difference units, this may represent a cut-off point for the
majority of valid measured vs. predicted pKa comparisons.
From an examination of the ten compounds with the largest absolute pKa
differences, there was evidence that certain sub-structures were commonly involved.
In particular, eight featured one of 3 recurring substructures (Figure 3.31), of which
there were 4 examples of 1 (HTS1199, HTS1192, HTS1200 and HTS1195), 2 of 2
(HTS0957 and HTS0092) and 2 of 3 (HTS0521 and HTS1364).
N N
OH
AA
A
N
N
OH
A
A
A
A
A
N
OH
OHA
A
A
1 2 3
Figure 3.31
89
Recurrences such as these indicate that ACD/pKa systematically misinterprets
particular classes of substructure.
3.8.1.1.2 PM dataset
In this dataset there were 129 compounds for which both measured and
predicted pKa data was available. A plot of the absolute difference between
measured and predicted pKa values for these compounds is shown in Figure 3.32:
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16
Absolute pKa difference (predicted vs measured)
Com
poun
d co
unt
Figure 3.32: A plot of the absolute differences between predicted and measured pKa values for the PM dataset where the STT made no structural change
Similar to Figure 3.30, the majority of absolute differences in Figure 3.32
were less than 2 pKa units. For the PM dataset 64% of these predictions were within
1 pKa unit and 82% within 2 pKa units of the measured value. This compares
favourably with the respective 48% 1 pKa and 67% 2 pKa units cut offs found for
the HTS dataset in Figure 3.30. Overall, pKa predictions for the PM dataset were
typically more reliable than those for the HTS dataset.
90
3.8.1.2 log P comparisons
3.8.1.2.1 HTS dataset
This dataset contained 65 compounds whose log P had been measured.
ELOGP predictions were successfully made for 64 of these. The distribution of
absolute log P differences between these compound’s measured log Ps and predicted
ELOGPs is shown in Figure 3.33:
0
3
6
9
12
15
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Absolute log P difference (predicted vs measured)
Com
poun
d co
unt
Figure 3.33: A plot of the absolute differences between predicted and measured log P values for the HTS dataset where the STT made no structural change
This distribution shows that the degree of error in HTS dataset log P
prediction was lower than was seen for HTS dataset pKas in Figure 3.30. Illustrating
this, the mean absolute error in log P prediction is 0.64 log units in Figure 3.33
compared to 1.50 pKa units in Figure 3.30. Figure 3.33 also shows that 85% of the
log P comparisons are within 1 log P unit of measured values compared 48% of pKa
predictions at the same threshold in Figure 3.30. These observations show that pKa
predictions were less reliable than log P predictions.
91
It is at first surprising that the modal log P difference and smallest absolute
log P difference “bins” did not coincide in Figure 3.33. Since the number of
compounds making up the distribution is relatively small, the skewing of the maxima
can be assumed to be artificial. This skewing is observed by examining the actual
distribution of pKa differences (Figure 3.34):
0
3
6
9
12
15
18
-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2
Log P difference (predicted vs measured)
Com
poun
d co
unt
Figure 3.34: A plot of the differences between predicted and measured log P values for the HTS dataset where the STT made no structural change
The ten poorest log P predictions highlighted in Figure 3.33 are a more varied
collection of compounds than was seen for the same dataset’s pKa predictions.
However, several structural features seem to recur within them that highlight current
weaknesses of the log P prediction tools used:
92
• 5-membered aromatic rings containing 2 or more nitrogens (HTS1499,
HTS1542, HTS0891, HTS0876, HTS0804, HTS0704 and HTS0197)
• Aromatic nitrogen-nitrogen bonds (HTS0704, HTS0804, HTS1542 and
HTS1499)
• Cyclopropyl groups (HTS0197, HTS1393 and HTS1499)
3.8.1.2.2 PM dataset
There were 470 compounds for which both measured and predicted log P
data was available. A plot of the absolute difference between measured and predicted
log P values for these compounds is shown in Figure 3.35:
0
50
100
150
200
250
300
350
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7
Absolute log P difference (predicted vs measured)
Com
poun
d co
unt
Figure 3.35: A plot of the absolute differences between predicted and measured log P values for the PM dataset where the STT made no structural change
In comparison to the log P distribution obtained for the HTS dataset in Figure
3.33, the more statistical distribution in Figure 3.35 reflects the larger number of
compounds it comprises. Compared to the 85% of log P predictions in the HTS
dataset, only 75% of predictions for the PM dataset were within 1 log P unit of their
measured value. Therefore in contrast to the findings for pKa predictions, for log P
93
predictions the HTS dataset were marginally more accurate than those of the PM
dataset.
3.8.2 The impact of the STT changing tautomers on the outcome of log P and pKa predictions
3.8.2.1 Introduction
There was only a limited amount of measured log P and pKa data available
for the 76 changed-tautomer compounds identified from the HTS and PM datasets.
Therefore it was not possible to compare their measured and predicted property
values in isolation.
However the compound substructures from Table 3.6 were used to search the
entire Syngenta database for other examples where log P or pKa values had been
measured. This search identified 54 compounds from 9 of the tautomer structure
classes (3, 5, 6, 9, 10, 11, 12, 15 and 20) with measured pKa values and 69
compounds from the same nine classes with measured log P values. The overlap
between these compounds and the HTS dataset corresponded to 8 compounds, the
remainder of which will be referred to by the generic reference numbers “MEASxx”
where xx = 01-73.
3.8.2.2 Defining tautomer type subclasses
The 23 generic tautomer classes defined so far often encompass a variety of
more specific substructures. To better address the underlying diversity, 7 of the 9
where measured data was available were split into between 2 and 5 further sub-
classes using the guidelines defined in Chapter 2, Section 10.5.
The substructures comprising the expanded list of tautomer types with
measured property values are shown in Table 3.8 together with the number of
94
examples of each. Those unchanged from Table 3.6 are included in Table 3.8 with
the same reference number and those that have been subdivided are suffixed by a
series of letters.
Ref. NDC structure PRF structure Meas’d pKa
Meas’d log P
Both meas’d
3 N
N
OH
A
A
A
A
A
N
NH
O
A
A
A
A
A
3 4 3
5a
N
OH
AA
A A
NH
O
AA
A A
3 1 1
5b
N
OH
A
A
N
A
A
A NH
O
A
A
N
A
A
A
3 13 2
5c
N
OH
N
A
A
A NH
O
N
A
A
A
0 1 0
5d
N
OH
N
A
A
A NH
O
N
A
A
A
1 1 1
5e
N
OH
AA
A
O
A
NH
O
A
A A
O
A
6 5 5
6a N N
OH
OH
A
A
NH NH
O
O
A
A
3 7 3
6b
N N
OH
OH
N
N
A
A
NH NH
O
O
N
N
A
A
1 0 0
95
Ref. NDC structure PRF structure Meas’d pKa
Meas’d log P
Both meas’d
6c
N N
OH
OH A
OA
AA
NH NH
O
O A
OA
AA
1 1 1
9a
N
N
OH
N
H
A
A
A
A
OA
NH
N
O
N
H
A
A
A
A
OA
1 1 1
9b N
N
OH
N
A
A
A
A
NH
N
O
N
A
A
A
A
2 4 2
9c N
N
OH
N
H
A
A
A
NH
N
O
N
H
A
A
A
4 2 2
10
N
N
OH
A
A
A
NH
N
O
A
A
A
2 4 2
11a
N
OH
OHA
A
O
A
NH
OH
OA
A
O
A
3 3 3
11b
N
OH
OHN
N AA
A NH
OH
ON
NA
A
A
0 1 0
12a N
OH
A
A
A
A
NH
O
A
A
A
A
4 3 1
96
Ref. NDC structure PRF structure Meas’d pKa
Meas’d log P
Both meas’d
12b N
OH
A
A
A
O
A
NH
O
A
A
A
O
A
5 6 5
12c
N
OH
N
A
O
A
A
A
AH
A
NH
O
N
O
A
A
A
A
A A
H
3 3 3
12d
N
OH
A
A
O
A
A
A
A
NH
O
A
A
O
A
A
A
A
1 1 1
15a N N
OH
AA
O A
NH N
O
AA
O A
2 2 2
15b N N
OH
A
A
A
NH N
O
A
A
A
0 1 0
20a N
N
OH
A
A
A
A
A
NH
N
O
A
A
A
A
A
0 2 0
20b N
N
OH
AA
A
NH
N
O
AA
A
3 1 1
20c N
N
OH
AOH
O
A
NH
N
O
AOH
O
A
1 1 1
20d N
N
OH
AN
A
A
A
NH
N
O
ANA
A
A
1 1 1
97
Ref. NDC structure PRF structure Meas’d pKa
Meas’d log P
Both meas’d
20e
N
N
OH
AOH
A
NH
N
O
AOH
A
1 0 0
54 69 41 A = Any non-heteroatoms OR any non-protonated heteroatoms OR any
group not containing heteroatoms conjugated into the ring. A-groups must also not be connected to form additional rings.
Totals
Table 3.8: Expanded list of tautomer substructure types for compounds with measured pKa or log P values
Table 3.8 reveals the degree of diversity found within many of the more
generic substructures previously identified in Table 3.7. It also shows that there is a
high degree of overlap of compounds that have both measured log P and pKa data.
3.8.2.3 pKa comparisons
A summary of the measured pKa values plus the NDC and PRF tautomer
ACD/pKa predictions of compounds listed in Table 3.8 are shown in Table 3.9:
Ref Type Meas pKa NDC
predicted pKa
PRF predicted
pKa
Absolute NDC - Meas'd
difference
Absolute PRF - Meas'd
difference
Prediction improvement (NDC→PRF)
HTS0047 3 7.70 A 0.47 MA - - 7.23
HTS0957 3 3.61 A 4.17 A 2.96 MA 0.56 0.65 -0.09
HTS0958 3 3.94 A - - 4.12 MA 0.18
MEAS10 5a 2.00 A 3.22 MA 3.03 MA 1.22 1.03 0.19
MEAS11 5a 3.09 A 4.26 MA 4.24 MA 1.17 1.15 0.02
MEAS19 5a 6.00 A 4.87 MA 6.31 MA 1.13 0.31 0.82
MEAS16 5b 9.50 A 2.02 A 5.55 MB 7.48
MEAS17 5b 6.55 A 1.13 A 4.81 MB 5.42
MEAS18 5b 11.60 A 4.80 A 6.36 MB 6.80
MEAS23 5d 11.30 A 4.85 A 5.91 MB 6.45
MEAS36 5e 9.20 A 8.43 MB 8.68 MA 0.52
MEAS42 5e 5.50 A 4.58 MA 6.53 MA 0.92 1.03 -0.11
MEAS45 5e 9.10 A 8.51 MB 8.73 MA 0.37
MEAS46 5e 5.40 A 4.66 MA 6.59 MA 0.74 1.19 -0.45
98
Ref Type Meas pKa NDC
predicted pKa
PRF predicted
pKa
Absolute NDC - Meas'd
difference
Absolute PRF - Meas'd
difference
Prediction improvement (NDC→PRF)
MEAS47 5e 3.60 A 3.45 MA 2.51 MA 0.15 1.09 -0.94
MEAS51 5e 4.80 A 4.55 MA 7.02 MA 0.25 2.22 -1.97
HTS0107 6a 9.34 A 2.06 MA 9.15 MA 7.28 0.19 7.09
MEAS70 6a 8.93 A 2.06 MA 9.15 MA 6.87 0.22 6.65
MEAS71 6a 8.93 A 2.06 MA 9.15 MA 6.87 0.22 6.65
MEAS03 6b 7.83 A 0.82 A 8.46 MA 7.01 0.63 6.38
MEAS20 6c 9.20 A 2.11 MA 9.20 MA 7.09 0.00 7.09
MEAS66 9a 8.60 A 0.25 MA 9.47 MA 8.35 0.87 7.48
MEAS05 9b 10.70 A 4.64 MA 10.38 MA 6.06 0.32 5.74
MEAS12 9b 2.00 B 7.48 MB 7.54 MA 5.48
MEAS01 9c 9.90 A 4.24 MA 10.05 MA 5.66 0.15 5.51
MEAS06 9c 11.00 A 5.19 MA 11.09 MA 5.81 0.09 5.72
MEAS07 9c 10.60 A 4.71 MA 10.58 MA 5.89 0.02 5.87
MEAS08 9c 9.60 A 4.58 MA 10.94 MA 5.02 1.34 3.68
HTS0451 10 4.97 A 6.73 MB 5.51 MA 0.54
HTS0508 10 4.93 A 6.71 MB 5.51 MA 0.58
MEAS40 11a 5.40 A 6.45 MA 4.50 MA 1.05 0.90 0.15
MEAS43 11a 5.90 A 6.61 MA 4.50 MA 0.71 1.40 -0.69
MEAS48 11a 5.40 A 6.51 MA 4.50 MA 1.11 0.90 0.21
MEAS14 12a 8.30 A 6.13 MA 7.97 MA 2.17 0.33 1.84
MEAS15 12a 9.70 A 8.10 MB 9.83 MA 0.13
MEAS54 12a 6.90 A 7.57 MA 7.04 MA 0.67 0.14 0.53
MEAS62 12a 7.60 A 7.77 MA 7.85 MA 0.17 0.25 -0.08
MEAS37 12b 10.30 A 7.73 MB 9.53 MA 0.77
MEAS41 12b 6.00 A 5.03 MA 5.38 MA 0.97 0.62 0.35
MEAS44 12b 8.40 A 5.50 MB 8.81 MA 0.41
MEAS58 12b 6.59 A 7.96 MA 5.94 MA 1.37 0.65 0.72
MEAS59 12b 5.69 A 5.58 MA 5.86 MA 0.11 0.17 -0.06
MEAS50 12c 10.20 A 13.27 MA 9.74 MA 3.07 0.46 2.61
MEAS52 12c 13.30 A 8.72 MB 10.43 MA 2.87
MEAS56 12c 5.30 A 6.71 MB 8.86 MA 3.56
MEAS55 12d 4.59 A 5.32 MB 8.82 MA 4.23
MEAS72 15a 9.76 A 13.77 MA 9.74 MA 4.01 0.02 3.99
MEAS73 15a 9.66 A 13.78 MA 9.23 MA 4.12 0.43 3.69
MEAS21 20b 8.20 A 8.36 MB 7.96 MA 0.24
MEAS53 20b 6.51 A 12.29 MA 6.17 MA 5.78 0.34 5.44
MEAS57 20b 7.60 A 8.21 MB 7.77 MA 0.17
99
Ref Type Meas pKa NDC
predicted pKa
PRF predicted
pKa
Absolute NDC - Meas'd
difference
Absolute PRF - Meas'd
difference
Prediction improvement (NDC→PRF)
MEAS49 20c 4.60 A 3.74 MA 4.50 MA 0.86 0.10 0.76
MEAS13 20d 4.38 A 1.28 MA 5.86 MA 3.10 1.48 1.62
MEAS02 20e 5.42 A 6.50 MA 4.50 MA 1.08 0.92 0.16
Mean absolute difference between measured and predicted pKa values: 3.59 0.76
• A = acidic pKa, B = basic pKa, MA = most acidic pKa, MB = most basic pKa.
• Cells highlighted yellow relate to predictions against which no comparisons can be drawn due to pKa type incompatibility (no more suitable pKa prediction available) or prediction failure.
• Acidic pKas highlighted in black were obtained by manual ACD/pKa prediction experiments since the SOLSTICE version only provided a basic pKa with the settings used.
Table 3.9: Summary of measured and predicted pKa values
A summary of the NDC and PRF structure pKa prediction accuracy, reported
at a variety of thresholds is given in Table 3.10.
% of successful prediction comparisons made within x units of the measured pKa value Compound
form < 0.5 < 1.0 < 2.0 < 4.0 > 4.0
Unknown (number of
compounds where comparison was
not possible) NDC 9.8 26.8 43.9 51.2 48.8 13 PRF 50.0 75.0 91.7 97.9 2.1 6
% Improvement 40.2 48.2 47.8 46.7 46.7
Table 3.10: Summary of the accuracy of pKa predictions for compounds with measured values
Table 3.9 shows overall that the accuracy of predictions made for the PRFs of
molecules are an improvement on average of over 2.8 pKa units compared to their
NDC forms. Emphasising this positive effect, Table 3.10 shows at least a 40%
improvement in prediction accuracy occurs across a range of measured – predicted
pKa difference thresholds.
Such positive benefits confirm that the effect of converting NDC structures to
PRF structures were substantial. Table 3.9 also shows evidence that the degree of
prediction improvement for compounds within specific subclasses or between related
100
classes of tautomer substructure are often very similar. For example, converting the 5
type 6a, 6b and 6c structures improved their pKa predictions by 6.6-7.1 pKa units.
Evidence of similar uniform improvements can be seen for types 9b and 9c. In
contrast, the predictions of the type 5e compounds examined largely suffered by
changing them to their PRF tautomers. Such negative effects for particular
substructures are discussed in Chapter 3, Section 8.2.6.
3.8.2.4 Log P comparisons
A summary of the measured log P values plus the NDC and PRF tautomer
log P predictions of compounds listed in Table 3.8 are shown in Table 3.11:
NDC PRF
Ref Type Meas’d log P E
LOGP ClogP ACD
logP AlogPE
LOGPClogP ACD
logP AlogP
Abs. diff.
meas’d →
NDC
Abs. diff. meas’d →
PRF
Imprvm’t NDC →
PRF
HTS0047 3 0.89 1.72 2.60 1.34 1.20 0.82 0.52 1.70 0.25 0.83 0.07 0.76
HTS0810 3 0.91 1.32 2.44 1.16 0.37 0.71 0.41 2.30 -0.59 0.41 0.20 0.21
HTS0957 3 0.50 0.41 0.95 0.67 -0.38 -0.71 -17.26 0.47 -1.33 0.09 1.21 -1.12
HTS0958 3 0.50 0.56 1.51 0.17 -0.01 -0.06 -0.52 1.30 -0.96 0.06 0.56 -0.50
MEAS11 5a 2.34 2.12 2.41 1.68 2.28 -0.07 -0.42 0.37 -0.15 0.22 2.41 -2.19
MEAS16 5b 2.10 3.15 3.10 3.44 2.91 1.84 0.99 2.84 1.68 1.05 0.26 0.79
MEAS18 5b 1.60 2.83 2.97 2.76 2.77 1.58 0.60 3.08 1.07 1.23 0.02 1.22
MEAS22 5b 2.15 3.28 3.37 3.22 3.24 2.02 1.05 3.67 1.35 1.13 0.13 1.00
MEAS25 5b 2.58 3.76 3.90 3.75 3.64 2.51 1.57 4.20 1.74 1.18 0.08 1.11
MEAS26 5b 1.79 3.29 3.40 3.29 3.17 2.03 1.08 3.74 1.28 1.50 0.24 1.26
MEAS27 5b 3.45 4.15 4.27 4.14 4.03 3.12 1.94 5.29 2.14 0.70 0.33 0.37
MEAS28 5b 3.89 4.63 4.80 4.67 4.43 3.61 2.47 5.82 2.54 0.74 0.28 0.46
MEAS29 5b 2.99 4.16 4.30 4.21 3.96 3.13 1.97 5.36 2.07 1.17 0.14 1.03
MEAS30 5b 3.03 4.27 4.43 4.28 4.10 3.02 2.10 4.73 2.21 1.24 0.01 1.23
MEAS31 5b 2.60 3.79 3.90 3.75 3.71 2.53 1.57 4.20 1.82 1.19 0.07 1.12
MEAS32 5b 2.57 3.72 3.77 3.68 3.71 2.46 1.44 4.13 1.82 1.15 0.11 1.04
MEAS33 5b 2.99 4.20 4.30 4.21 4.10 2.95 1.97 4.66 2.21 1.21 0.04 1.17
MEAS34 5b 3.50 4.69 4.83 4.74 4.50 3.43 2.50 5.19 2.61 1.19 0.07 1.12
MEAS24 5c 2.80 3.88 4.04 3.98 3.60 2.49 1.67 4.11 1.71 1.08 0.31 0.77
MEAS23 5d 2.30 3.37 3.49 3.42 3.20 1.99 1.11 3.54 1.31 1.07 0.31 0.76
MEAS36 5e 1.40 3.02 4.11 1.79 3.17 1.40 1.12 1.34 1.74 1.62 0.00 1.62
MEAS42 5e 3.23 3.52 4.61 2.36 3.60 1.85 1.60 1.53 2.42 0.29 1.38 -1.09
MEAS45 5e 0.60 1.60 2.20 0.77 1.84 -0.02 -0.74 0.28 0.41 1.00 0.62 0.38
101
NDC PRF
Ref Type Meas’d log P E
LOGP ClogP ACD
logP AlogPE
LOGPClogP ACD
logP AlogP
Abs. diff.
meas’d →
NDC
Abs. diff. meas’d →
PRF
Imprvm’t NDC →
PRF
MEAS46 5e 2.39 2.26 2.70 1.34 2.73 0.43 -0.26 0.47 1.09 0.14 1.96 -1.82
MEAS51 5e 2.45 1.74 1.94 0.86 2.44 -0.01 -1.02 0.21 0.79 0.71 2.46 -1.75
HTS0107 6a < 0.50 0.48 -1.52 3.96 -1.01 -1.86 -3.46 1.11 -3.23 0.02 2.36 -2.33
MEAS61 6a 0.60 1.71 1.45 2.39 1.29 -0.47 -0.48 0.01 -0.93 1.11 1.07 0.04
MEAS63 6a < 0.50 1.11 0.19 2.49 0.65 -0.67 -1.80 1.36 -1.57 0.61 1.17 -0.57
MEAS64 6a < 0.50 1.26 -0.09 3.37 0.51 -0.82 -2.03 1.28 -1.71 0.76 1.32 -0.56
MEAS65 6a < 0.50 1.68 1.20 2.70 1.15 -0.35 -0.74 0.76 -1.07 1.18 0.85 0.33
MEAS70 6a < 0.50 0.58 -0.66 3.28 -0.88 -1.75 -2.60 0.44 -3.10 0.08 2.25 -2.17
MEAS71 6a < 0.50 1.05 0.93 1.69 0.54 -0.51 -1.01 1.16 -1.68 0.55 1.01 -0.46
MEAS20 6c 2.10 3.60 4.78 1.87 4.17 2.20 2.79 2.47 1.33 1.50 0.10 1.41
MEAS66 9a 3.17 3.07 2.96 2.18 4.08 3.15 2.64 3.32 3.48 0.10 0.02 0.08
MEAS04 9b 1.80 3.20 3.70 2.79 3.11 1.76 2.07 1.90 1.30 1.40 0.04 1.36
MEAS05 9b 0.30 1.74 2.11 1.20 1.92 0.30 0.48 0.31 0.11 1.44 0.00 1.44
MEAS09 9b 1.10 2.24 2.77 1.80 2.13 0.80 1.09 0.78 0.52 1.14 0.30 0.83
MEAS12 9b 2.40 2.82 3.79 1.09 3.57 2.11 2.19 2.38 1.77 0.42 0.29 0.13
MEAS06 9c 2.20 3.25 4.13 2.53 3.09 1.97 2.44 2.20 1.28 1.05 0.23 0.82
MEAS07 9c 0.20 1.33 2.02 0.40 1.56 0.05 0.32 0.08 -0.25 1.13 0.15 0.97
HTS0451 10 1.77 3.68 3.17 4.37 3.48 2.50 2.50 2.09 - 1.91 0.73 1.18
HTS0508 10 2.60 4.44 3.72 5.43 4.17 3.04 3.04 3.14 - 1.84 0.44 1.40
MEAS67 10 1.51 3.63 2.92 4.23 3.75 2.25 2.25 2.95 - 2.12 0.74 1.38
MEAS68 10 1.63 4.02 3.42 4.69 3.93 2.75 2.75 3.54 - 2.39 1.12 1.27
MEAS40 11a 1.66 1.73 2.51 0.44 2.25 0.20 0.48 0.46 -0.33 0.07 1.46 -1.39
MEAS43 11a 3.00 3.54 4.92 1.92 3.77 1.87 2.93 2.11 0.57 0.54 1.13 -0.60
MEAS48 11a 0.87 1.52 2.16 0.25 2.16 0.01 0.21 0.24 -0.42 0.65 0.86 -0.21
HTS1013 11b 2.21 3.60 3.52 4.06 3.23 2.17 2.43 2.01 2.08 1.39 0.04 1.36
MEAS38 12a 2.60 3.23 3.44 2.23 4.02 1.60 2.05 1.92 0.82 0.63 1.00 -0.37
MEAS39 12a 3.60 4.23 4.55 3.18 4.95 2.67 3.17 3.09 1.75 0.63 0.93 -0.30
MEAS62 12a 1.14 1.90 2.04 1.23 2.43 0.33 0.65 0.18 0.16 0.76 0.81 -0.05
MEAS37 12b 1.93 3.32 4.11 2.44 3.41 1.98 2.16 1.75 2.03 1.39 0.05 1.34
MEAS41 12b 3.62 4.82 5.05 4.71 4.69 2.51 3.15 2.15 2.23 1.20 1.11 0.09
MEAS44 12b 3.15 4.55 5.05 4.31 4.29 2.84 3.15 2.77 2.60 1.40 0.31 1.09
MEAS58 12b 1.28 2.14 2.28 2.40 1.74 0.10 0.38 0.64 -0.72 0.86 1.18 -0.32
MEAS59 12b 2.03 3.13 2.99 3.90 2.49 0.72 1.10 1.03 0.03 1.10 1.31 -0.22
MEAS60 12b 3.20 3.59 3.16 4.19 3.41 1.29 1.26 1.66 0.95 0.39 1.91 -1.53
MEAS50 12c 3.95 4.79 5.21 5.04 4.11 2.57 3.23 2.48 2.00 0.84 1.38 -0.55
MEAS52 12c 2.08 3.26 4.00 3.16 2.62 1.14 2.03 1.08 0.31 1.18 0.94 0.24
MEAS56 12c 5.26 6.49 7.22 7.40 4.86 3.70 4.77 3.78 2.55 1.23 1.56 -0.33
MEAS55 12d 5.69 5.41 6.17 5.51 4.54 3.18 4.14 3.16 2.23 0.28 2.51 -2.23
MEAS72 15a 1.49 2.43 3.38 0.77 3.16 1.35 1.50 1.34 1.21 0.94 0.14 0.80
MEAS73 15a 0.83 2.17 2.86 0.46 3.18 0.59 0.98 0.03 0.75 1.34 0.24 1.09
102
NDC PRF
Ref Type Meas’d log P E
LOGP ClogP ACD
logP AlogPE
LOGPClogP ACD
logP AlogP
Abs. diff.
meas’d →
NDC
Abs. diff. meas’d →
PRF
Imprvm’t NDC →
PRF
MEAS35 15b < 1.00 1.80 2.41 0.15 2.86 1.06 0.48 0.98 1.72 0.80 0.06 0.74
MEAS53 20a < 0.50 0.78 0.94 0.46 0.93 -0.07 -1.01 1.32 -0.52 0.28 0.57 -0.29
MEAS69 20a 2.22 2.43 2.95 1.89 2.46 1.62 1.57 1.77 1.51 0.21 0.60 -0.39
MEAS21 20b 0.74 2.07 2.06 1.88 2.28 0.99 0.46 1.08 1.45 1.33 0.25 1.08
MEAS49 20c 0.77 3.26 4.03 2.57 3.18 1.46 2.17 0.04 2.16 2.49 0.69 1.80
MEAS13 20d 1.37 2.17 2.33 1.05 3.12 1.72 0.97 1.78 2.41 0.80 0.35 0.45
Mean absolute difference between measured and predicted log P values: 0.95 0.71
• Values highlighted yellow relate to ELOGP prediction values based solely on ClogP v4 due to the failure of AlogP. SOLSTICE ELOGP automatically gives ClogP v4 whenever AlogP predictions fail.
• Values highlighted in grey are generated from upper limit measured log P values that are simply quoted as being below a particular threshold value. For the purposes of comparison, the upper threshold limit is assumed as the measured value.
Table 3.11: Summary of measured and predicted log P values
A summary of the NDC and PRF structure log P prediction accuracy,
reported at a variety of thresholds is given in Table 3.12.
% of successful prediction comparisons made within x units
of the measured log P value Compound
form < 0.5 < 1.0 < 2.0 < 4.0
% Remainder
Unknown (number of compounds where
comparison was not possible)
NDC + 21.7 46.4 95.7 100 0 0 PRF + * 50.7 68.1 92.8 100 0 0
% Improvement 29.0 21.7 -2.9 0 0
* Includes calculations for 4 compounds whose measured log P’s are estimated only by ClogP. + Also includes calculations for 8 compounds whose measured log Ps are estimated at threshold values. See Table 3.11 for details.
Table 3.12: Summary of the accuracy of log P predictions for compounds with measured values
A mean improvement in predictions of 0.24 log P units represents a small but
still significant increase in accuracy. Table 3.12 also shows that the benefit on log P
predictions is felt most by those that are less than 1 log P unit from the measured
value. As was observed for pKa predictions, the spectrum of effects on different sub-
classes of tautomer is varied and often distinctive. For example, types 5b-d, 9a-c, 10
103
and 15a/b show significantly better predictions from the use of PRF tautomers
instead of their NDC analogues. By way of contrast, for types 6a, 11a, 12a-d and 20a
more often the opposite is true.
In the case of the grey-shaded compounds of type 6a and 20a in the
Table 3.11, their actual log P values could conceivably lie either between or outside
their corresponding NDC form and PRF predicted values. This means the degree of
prediction improvement for those compounds is dependant on the assumed-actual log
P value chosen. This makes it particularly difficult to assess whether inter-converting
their NDC form to their PRF has a positive or negative effect. This point is illustrated
by the various assumed-actual log P values and the effect they have on improvement
estimates shown in Table 3.13.
Compound Log P prediction improvement (NDC → PRF)
assuming the actual log P value is…
Ref Type -1.0 -0.5 0 0.5
HTS0107 6a 0.62 -0.38 -1.38 -2.33
MEAS63 6a 1.78 1.44 0.44 -0.57
MEAS64 6a 2.08 1.45 0.45 -0.56
MEAS65 6a 2.03 2.03 1.33 0.33
MEAS70 6a 0.83 -0.17 -1.17 -2.17
MEAS71 6a 1.56 1.54 0.54 -0.46
MEAS53 20a 0.84 0.84 0.71 -0.29
Table 3.13: The variation in log P prediction improvement depending on the actual log P value used for compound types 6a and 20a
As was found for tautomer type 5e in Table 3.9, Table 3.11 also shows
patterns in prediction improvement that are distinct for particular tautomer sub-
classes. For example, log P predictions for PRF structures stand out as being
typically poorer than those of their NDC analogues for tautomer types 5e, 6a
(apparently), 11a, 12a-d and 20a. “Standard” log P prediction improvements were
104
also seen for the four compounds of type 10 (1.18-1.40 log P units) and the majority
of type 5b (1.00-1.26).
3.8.2.5 Re-investigating the validity of the structural changes performed by the STT
The fact that some tautomer substructure compounds seem to give better
predictions in their NDC forms than their PRFs may be at first surprising. To
investigate the issue, the tautomer analysis facility of ACD/pKa was used to suggest
the dominant tautomers for each of the sub-classes using a simple example of each.
Its results are summarised in Table 3.14:
Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major
tautomers 3
N
N
OH
A
A
A
A
A
Minor 1
N
NH
O
A
A
A
A
A
Minor 2
N
N
O
A
A
A
A
A
Major
5a
N
OH
AA
A A
Minor
NH
O
AA
A A
Major
-
5b
N
OH
A
A
N
A
A
A Minor
NH
O
A
A
N
A
A
A Major
-
5c
N
OH
N
A
A
A Minor
NH
O
N
A
A
A Major
-
5d
N
OH
N
A
A
A Minor
NH
O
N
A
A
A Major
-
105
Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major
tautomers 5e
N
OH
AA
A
O
A
Minor
NH
O
A
A A
O
A
Major 1
N
O
AA
A
OH
A
Major 2
6a N N
OH
OH
A
A
Minor
NH NH
O
O
A
A
Major
-
6b N N
OH
OH
N
N
A
A
Minor
NH NH
O
O
N
N
A
A
Major
-
6c
N N
OH
OH A
OA
AA
Minor
NH NH
O
O A
OA
AA
Major
-
9a
N
N
OH
N
H
A
A
A
A
OA Minor
NH
N
O
N
H
A
A
A
A
OA CD major 1
N
NH
O
N
H
A
A
A
A
OA CD major 2
9b
N
N
OH
N
A
A
A
A
Minor
NH
N
O
N
A
A
A
A
CD major 1
N
NH
O
N
A
A
A
A
CD major 2
106
Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major
tautomers 9c
N
N
OH
N
H
A
A
A
Minor
NH
N
O
N
H
A
A
A
CD major 1
N
NH
O
N
H
A
A
A
CD major 2
10
N
N
OH
A
A
A
Minor
NH
N
O
A
A
A
Major
-
11a
N
OH
OHA
A
O
A
Minor
NH
OH
OA
A
O
A
Major 1
NH
O
OA
A
OH
A
Major 2
11b
N
OH
OHN
N AA
A Minor
NH
OH
ON
NA
A
A
CD major 1
NH
O
ON
NA
A
A
CD major 2
12a
N
OH
A
A
A
A
Minor
NH
O
A
A
A
A
Major
-
12b
N
OH
A
A
A
O
A
Minor
NH
O
A
A
A
O
A
Major
-
12c N
OH
N
A
O
A
A
A
AH
A
Minor
NH
O
N
O
A
A
A
A
A A
H
Major 1
N
O
N
OH
A
A
A
A
A A
H
Major 2
107
Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major
tautomers 12d
N
OH
A
A
O
A
A
A
A
Minor
NH
O
A
A
O
A
A
A
A
Major 1
N
O
A
A
OH
A
A
A
A
Major 2
15a N N
OH
AA
O A Minor
NH N
O
AA
O A Major
-
15b N N
OH
A
A
A
Minor
NH N
O
A
A
A
Major
-
20a
N
N
OH
A
A
A
A
A Minor
NH
N
O
A
A
A
A
A
CD major 1
N
NH
O
A
A
A
A
A
CD major 2
20b
N
N
OH
AA
A
Minor
NH
N
O
AA
A
CD major 1
N
NH
O
AA
A
CD major 2
20c
N
N
OH
HOH
O
A
Minor 1
NH
N
O
HOH
O
A
Minor 2
NH
N
O
HO
OH
A
Major
20d
N
N
OH
AN
A
A
A Minor
NH
N
O
ANA
A
A
CD major 1
N
NH
O
ANA
A
A
CD major 2
108
Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major
tautomers 20e
N
N
OH
AOH
A
Minor 1
NH
N
O
AOH
A
Minor 2
N
NH
O
AO
A
Major 1
NH
N
O
AO
A
Major 2
ACD/pKa dominant tautomer predictions:
• “Minor” = Sole suggested minor tautomer • “Minor 1 / 2” = Minor tautomers suggested independently of each other • “Major” = Sole suggested major tautomer • “CD Major1 / 2” = Suggested conditions dependant major tautomers of each other • “Major 1 / 2” = Major tautomers suggested independently of each other
ACD/pKa tautomer predictions made on the basis that A = tertiary sp3 carbon, e.g t-butyl.
Table 3.14: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified in Table 3.8
A comparison of Table 3.14 with Table 3.7 shows that defining more specific
tautomer substructures sometimes leads to an increase in the overall number of
possible tautomers and also sometimes in a change in what is predicted “major”
tautomer. This occurred in tautomer structures class 5 → subclasses 5a/b/c/d/e,
11 → 11a/b, 12 → 12a/b/c/d and 20 → 20a/b/c/d/e.
For example the PRF of subclass 20b was predicted to be a conditions-
dependant “major” one, whereas the analogous PRF tautomer of subclass 20c was
predicted to be a “minor” one; the major form in this case being neither the NDC
form or the PRF. The keto substituent in the subclass 20c therefore appears to have
an important influence on the position of the tautomeric equilibria.
109
Subclasses 5e, 11a, 12b, 12c, 12d and 15a also contain tautomeric keto
groups, many of which, according to ACD/pKa, play an active part in the structures
of their “major” tautomers. This may explain why predictions for the PRFs of some
of these subclass compounds are not always an improvement on their NDC forms.
3.8.2.6 Evaluating the predictions of alternative tautomers
So far only the log P and pKa predictions of a compound’s NDC and PRF
tautomers have been investigated. However ACD/pKa has suggested that alternative
tautomers can sometimes play exist. If this is the case then the property predictions
for compounds based on them may be better than those of the NDC or PRF
tautomers. Comparing the property predictions of all the tautomers of a compound
with measured values therefore provided a better means of probing which was the
best description of particular molecules, or at least identifying which tautomer(s)
provided the poorest description. Several of the tautomer sub-structure classes will
now be examined to these ends.
3.8.2.6.1 Substructures 5b and 5d
As well as the NDC and PRF tautomers, for compounds of substructure types
5b, and 5d there is a third plausible (“minor” according to ACD/pKa) tautomer that
places the variable-position hydrogen on the second ring nitrogen. For example see
Figure 3.36 for type 5b:
N
OH
A
A
N
A
A
A NDC “minor”
NH
O
A
A
N
A
A
A PRF “major”
N
O
A
A
NH
A
A
A 3rd tautomer “minor”
Figure 3.36
110
The measured pKa and predictions for the three tautomers of each available
compound of these types is shown in Table 3.15.
pKa prediction Compound Type Measured
pKa NDC form PRF 3rd tautomer
MEAS16 5b 9.50 A 2.02 A 5.55 MB 2.09 MB
MEAS17 5b 6.55 A 1.13 A 4.81 MB -- --
MEAS18 5b 11.60 A 4.80 A 6.36 MB 3.95 MB
MEAS23 5d 11.30 A 4.85 A 5.91 MB 4.53 MB
= Prediction failed or no suitable acidic pKa prediction available. A / MA = Acidic / Most Acidic MB = Most basic
Table 3.15: Measured vs. predicted pKa values for the different tautomers of type 5b and 5d compounds
As Table 3.15 shows, difficulty was encountered in obtaining acidic pKa
predictions for the PRF and 3rd tautomers to compare with the acidic pKa values
measured. Though acidic pKa predictions were obtained for their NDC tautomers,
these differed considerably from the measured values. These results thus show that
ACD/pKa has particular difficulties in making accurate predictions for these
subclasses of compounds. Table 3.16 however provides more conclusive evidence of
which tautomer best represents type 5b and 5d structures for log P predictions:
111
ELOGP prediction Compound Type Measured
log P NDC form PRF 3rd tautomer
MEAS16 5b 2.10 3.15 1.84 1.08
MEAS18 5b 1.60 2.83 1.58 0.83
MEAS22 5b 2.15 3.28 2.02 1.26
MEAS25 5b 2.58 3.76 2.51 1.75
MEAS26 5b 1.79 3.29 2.03 1.34
MEAS27 5b 3.45 4.15 3.12 1.70
MEAS28 5b 3.89 4.63 3.61 2.19
MEAS29 5b 2.99 4.16 3.13 1.77
MEAS30 5b 3.03 4.27 3.02 2.26
MEAS31 5b 2.60 3.79 2.53 1.77
MEAS32 5b 2.57 3.72 2.46 1.65
MEAS33 5b 2.99 4.20 2.95 2.13
MEAS34 5b 3.50 4.69 3.43 2.62
MEAS23 5d 2.30 3.37 1.99 1.23
(Highlighted predictions are those closest to the measured value)
Table 3.16: Measured vs. predicted log P values for the different tautomers of type 5b and 5d compounds
Table 3.16 shows how for every compound, the predicted log P for its PRF
structure closely mirrors the measured value. This finding is in agreement with
ACD/pKa’s “major” tautomer prediction for these tautomer sub-classes (Table 3.14).
3.8.2.6.2 Substructures 5e and 12b
For compounds of substructure 5e, as well as the NDC and PRF tautomers,
there is a third alternative (Figure 3.37) that enolises the keto substituent. According
to ACD/pKa’s predictions, this is potentially a major tautomer, along with the PRF
tautomer.
112
N
OH
AA
A
O
A
NDC “minor”
NH
O
A
A A
O
A
PRF “major”
N
O
AA
A
OH
A
3rd tautomer “major”
Figure 3.37
As Tables 3.9 and 3.11 showed, the NDC tautomers of the type 5e
compounds often unexpectedly gave more accurate log P and pKa predictions than
did their PRF tautomers. This is in contradiction with the ACD/pKa tautomer
predictions in Table 3.14 where the NDC tautomer was thought to be a “minor” one.
To resolve the issue, pKa and log P predictions for each tautomer of each type 5e
compound were obtained and compared with available measured values. The results
are shown in Tables 3.17 and 3.18.
pKa prediction Compound Type Measured pKa
NDC form PRF 3rd tautomer MEAS36 5e 9.20 A 8.43 MB 8.68 MA 4.53 MA
MEAS42 5e 5.50 A 4.58 MA 6.53 MA 4.50 MA
MEAS45 5e 9.10 A 8.51 MB 8.73 MA 4.54 MA
MEAS46 5e 5.40 A 4.66 MA 6.59 MA 4.50 MA
MEAS47 5e 3.60 A 3.45 MA 2.51 MA 4.50 MA
MEAS51 5e 4.80 A 4.55 MA 7.02 MA 4.50 MA
= No suitable acidic pKa found. A / MA = Acidic / Most Acidic. MB = Most basic Highlighted predictions are the closest of those available to the measured value.
Table 3.17: Measured vs. predicted pKa values for the different tautomers of type 5e compounds
113
ELOGP prediction Compound Type Measured
log P NDC form PRF 3rd tautomer MEAS36 5e 1.40 3.02 1.40 1.63
MEAS42 5e 3.23 3.52 1.85 3.06
MEAS45 5e 0.60 1.60 -0.02 -0.07
MEAS46 5e 2.39 2.26 0.43 1.35
MEAS51 5e 2.45 1.74 -0.01 1.10
Highlighted predictions are those closest to the measured value.
Table 3.18: Measured vs. predicted log P values for the different tautomers of type 5e compounds
For the 5 compounds that are common between them, Tables 3.17 and 3.18
show that the tautomer that gives the predictions closest to both the measured pKa
and log P values for each compound is always the same. The log P and pKa
predictions for each compound’s 3rd tautomer are consistently poorer than for the
NDC or PRF tautomers. This suggests, despite the ACD/pKa’s prediction, it is the
least accurate description of type 5e compounds.
The tables also show that the tautomer that gave the best log P prediction
varied between the NDC and PRF. This appears to indicate that the balance between
which tautomers are major and minor is variable. For example, the varying steric and
electronic effects of different combinations of substituents attached to the
substructure may favour different tautomers (as was found in several examples in
Chapter 1, Section 4.1) or sometimes artificially enhance the predictions of minor
tautomers over major ones.
Given that ACD/pKa predicted that type 6e NDC tautomers would be
“minor” forms, the fact that predictions for some compounds drawn in this tautomer
sometimes lead to the most accurate property predictions of all is a significant result.
One phenomenon that may explain how the NDC tautomer could be stabilised for
114
these compounds is the intramolecular hydrogen bonding opportunity offered to its
phenolic proton by the carbonyl oxygen of the adjacent keto substituent
(Figure 3.38). Examples in Chapter 1, Section 4.1 show that this arrangement is not
without precedent.
N
OH
O
A
AA
A
Figure 3.38
Other tautomer subclasses where similar intramolecular hydrogen bonding is
possible include 11a, 12c and 12d. Such a conclusion would also provide an
explanation why their NDC tautomers often gave more accurate predictions than
their PRF analogues.
Compounds of type 12b, the 2-pyridone analogue of type 5e, have a
contrasting behaviour to them. Analysis of the pKa predictions for each of their three
tautomers (Table 3.19) suggests that their PRFs mainly give the most accurate
predictions, consistent with ACD/pKa’s major tautomer prediction for this type.
115
pKa prediction
Compound Type Measured pKa NDC form PRF 3rd tautomer
MEAS37 12b 10.30 A 7.73 MB 9.53 MA 4.50 MA
MEAS41 12b 6.00 A 5.03 MA 5.38 MA 4.50 MA
MEAS44 12b 8.40 A 5.50 MB 8.81 MA 4.50 MA
MEAS58 12b 6.59 A 7.96 MA 5.94 MA 4.53 MA
MEAS59 12b 5.69 A 5.58 MA 5.86 MA 4.50 MA
= No suitable acidic pKa found. A / MA = Acidic / Most Acidic. MB = Most basic (Highlighted predictions are the closest of those available to the measured value)
Table 3.19: Measured vs. predicted pKa values for the different tautomers of type 12b compounds
The log P prediction data (Table 3.20) shows that NDC tautomers usually
gave the poorest results, consistent with the pKa prediction findings. This indicates
that the NDC tautomer is a good description for type 5e compounds, but very poor
for type 12b compounds.
ELOGP prediction Compound Type Measured
log P NDC form PRF 3rd tautomer
MEAS37 12b 1.40 3.32 1.98 2.66
MEAS41 12b 3.62 4.82 2.51 3.86
MEAS44 12b 3.15 4.55 2.84 3.52
MEAS58 12b 1.28 2.14 0.10 1.22
MEAS59 12b 2.03 3.13 0.72 2.07
MEAS60 12b 3.20 3.59 1.29 2.33
(Highlighted predictions are those closest to the measured value)
Table 3.20: Measured vs. predicted log P values for the different tautomers of type 12b compounds
The steric and electronic differences between compounds of type 5e and 12b
are therefore likely to result in them having difference tautomer equilibrium
positions, affecting which tautomer is the major one. This would also explain in a
116
wider sense why different and distinct trends in prediction improvement were often
seen for different substructure types in Tables 3.9 and 3.11.
3.8.2.6.3 Substructures of type 12a
Three surprising results from the predicted log P data were the poorer
ELOGP predictions of the 12a compounds (MEAS38, MEAS39, and MEAS62) for
their PRFs than their NDC forms. The 12a substructure represents examples of the
relatively simple 2-hydroxypyridine (NDC) / 2-(1H)-pyridone (PRF)
tautomerisation, to which there are no other alternative tautomers and of which the
pyridone form is commonly regarded as the more accurate representation.
It is hard to conceive for these compounds that their various phenyl, bromo
and alkyl substituents are able to induce a significant change of tautomer due to their
electronic properties. Therefore it is most likely that their PRF tautomers are actually
still the dominant ones, but that the effect of the bulky trifluoromethyl or phenyl
substituent attached to the pyridone ring immediately adjacent to the nitrogen in each
compound, artificially enhances predictions for the NDC tautomers. This in turn
means that ELOGP’s treatment of these substituents lacks a measure of the steric
requirements and preferences of these functional groups.
3.9 A method of investigating tautomer issues not highlighted by the STT
The application of the STT to sets of agrochemical database structures has
given an insight into the type and extent of tautomer misrepresentation issue.
However it has provided no indication of what other types of tautomer it encountered
but left unchanged or ignored. A method of probing this issue, with the aim of
117
identifying any further tautomeric substructures, was the analysis of available CHI
(Chromatographic Hydrophobicity Index) data for compounds from the HTS dataset.
3.9.1 Analysis of CHI data
In this dataset, 122 compounds had CHI values at 3 pHs, and of these 22 were
found to contain substructures where tautomerism was an issue. Of these 22
compounds, only 4 (HTS0451, HTS0508, HTS0810 and HTS1364) had previously
been identified and had their tautomer form changed by the STT. The remaining 18
compounds (15% of those with measured CHI values) were previously undiscovered.
Since the number of compounds for which CHI data was available was relatively
small, it is probable that of the remaining HTS dataset there is a sizeable number of
other tautomeric compounds that also go unchanged or unnoticed by the STT.
The 18 newly identified compounds were classified into a series of further
tautomer substructural classes and summarised in Table 3.21. In the case of type 34a
they have been supplemented by two additional examples identified from the PM
dataset.
118
Type Examples No of possible tautomers Notes
26a HTS0320 / HTS0321 6 -
26b HTS0526 / HTS0527 3 Three fewer tautomers than 26a due to a nitrogen being tertiary
rather than secondary 27a HTS0246 / HTS0381 2 -
27b HTS0479 / HTS0480 3 One more tautomer than 27a due to one less plane of symmetry
28 HTS1368 3 - 29 HTS1418 3 - 30 HTS1499 5 - 31 HTS1505 3 - 32 HTS1321 / HTS1322 3 - 33 HTS1335 2 -
34a HTS1014 / HTS1015 / PL1052 (Mesotrione) / PL1434 (Sulcotrione)
≥ 4 Drawn in tri-ketone form
34b HTS1326 > 4 Similar to 34a but drawn in a mono-enol / di-ketone form
Table 3.21: The additional tautomer substructure types identified from the HTS dataset by the analysis of CHI data
For the majority of these compounds the prototropic tautomerisations
involved were of the already familiar OH → NH, OH → OH or NH → NH types.
Only types 32, 34a and 34b were keto-enol type tautomers. For example Sulcotrione
(PL1434) is represented in the PM in its tri-ketone form, however its C-H bond -
centred between the three ketone groups is, because of their presence, particularly
acidic. As a result, one or both of the compound’s enol forms are likely to be major
tautomers (Figure 3.39).
119
O
O O
S
O
OCH3
Cl
O
O OH
S
O
OCH3
Cl
O
OH O
S
O
OCH3
Cl
and / or
PL1434
(Sulcotrione)
Figure 3.39
3.9.2 Comparison of measured and predicted log P and pKa data for “new” tautomeric compounds
In order to gauge the differences in log P and pKa predictions between the
different tautomers of the various types described in Table 3.21, a series of measured
value vs. predicted value comparisons for each tautomer were carried out for those
compounds where at least one piece of measured data was available. A summary of
these findings is shown in Table 3.22.
120
pKa Log P
Comp’d ref Type Taut’ ref Meas’d Type Pred’d Type Meas’d Pred’d
NDC - - 4.99 Alt 1 - - 4.49 HTS0526 6b Alt 2
- - - -
5.83 3.69
NDC - - 3.98 Alt 1 - - 3.51 HTS0527 6b Alt 2
- - - -
4.94 2.83
NDC 4.50 MB 5.19 HTS0246 7a Alt 1
7.82 * MB 4.34 MB
5.16 5.28
NDC 4.12 MB - HTS0381 7a Alt 1
8.03 + MB 4.50 MB
- -
NDC 4.03 MB - Alt 1 4.51 MB - HTS0479 7b Alt 2
7.37 + MB 4.28 MB
- -
NDC 4.03 MB - Alt 1 4.51 MB - HTS0480 7b Alt 2
7.94 + MB 4.28 MB
- -
NDC 2.10 MA 1.46 Alt 1 2.10 MA 1.46 Alt 2 1.20 MA 2.57 Alt 3 1.74 MA 2.16
HTS1499 0
Alt 4
4.43 MA
2.10 MA
2.73
1.30 NDC 2.77 MA - Alt 1 4.50 MA -
PL1052 (Mesotrione) 4a
Alt 2 3.12 MA
4.50 MA -
- NDC 2.87 MA 0.79 Alt 1 4.50 MA 0.28
PL1434 (Sulcotrione) 4a
Alt 2 3.13 MA
4.50 MA -5.00
0.64
• Predictions in bold are those closest to the measured value • - = no measured data available • = average of 4 measurements • + = average of 3 measurements
Table 3.22: Measured vs. predicted pKa and log P values for the different tautomers of compounds identified from HTS dataset by the analysis of CHI data
As the log P predictions for different tautomers of the same molecule are
often similar to each other, it is difficult with the limited data available to conclude
which tautomer best describes each structure. Importantly however, the alternative
tautomers to the NDC form in which the compound is represented in the database do
121
not give consistently poorer predictions and so cannot be conclusively ruled out.
Similarly the NDC representation’s predictions cannot be ruled out as noticeably
poorer on this limited evidence either. Such findings are likely to mean that in these
compounds there are multiple dominant tautomers, between which the equilibria are
sensitive to the nature of each individual molecule based on it.
As Table 3.22 shows, most of the closest pKa predictions to the measured
values were often still very different to them. It can be concluded that ACD/pKa had
particular difficulty with the poorly refined and so inherently more complex tautomer
issues found in these molecules. So long as this is the case, any ground rules for
which tautomer should be used for the most accurate pKa prediction in such cases is
likely to remain problematic.
122
4 Evaluating tautomeric misrepresentation in a larger dataset
4.1 Introduction
The methodology to assess tautomeric misrepresentation in a dataset of
compounds was developed on two relatively small compound sets, primarily for ease
of data handling and examination. However in order to gain a better appreciation of
the issue in a broader context it was considered important to apply it to a larger
dataset. With this aim in mind a 10,000 compound sample was gathered from the
Interim Vendor Database IVDB that is held and maintained at Syngenta.
4.2 Sampling of compounds
The IVDB comprises the structures of approximately 14 million compound
entries gathered from the compound catalogues of various chemical suppliers (e.g.
Aldrich) and also vendors of samples used in biological screening. It is therefore a
sizable collection that provides a window on the “chemical universe” outside of
Syngenta. 20% of IVDB compound records were converted into Daylight SMILES
format using the structure format inter-conversion tool dbtranslate of UNITY
(Tripos, 2004) and the set then screened to remove duplicates. An implementation of
the Knuth sampling algorithm (Knuth, 1998) was then applied to the remaining
approximately 1.6 million structures to obtain a 10,000 compound set.
This set was screened and 192 salts, which as multiple component structures
would otherwise fail outright with ACD/pKa, were removed. Canonicalisation of the
remaining 9,808 compound’s SMILES using the Unique Structures function of
SOLSTICE resulted in a further 336 compounds being removed due to their SMILES
generated by dbtranslate (Tripos, 2004) being invalid and 7 further duplicate
123
SMILES being deleted. These duplicates were presumably not picked-up earlier due
to different SMILES forms having been used. The 336 compounds “lost” during the
canonicalisation stage were due to Tripos (dbtranslate) and Daylight
(canonicalisation rules) using slightly different SMILES drawing conventions. As the
majority of the remaining 9,465 compounds came without a compound reference
number, a generic reference “djpxxxx” (where xxxx = 0001-9465) was assigned to
each. This set of compounds will commonly be referred to as the IVDB dataset.
4.3 Dataset analysis
This compound set was analysed to determine the degree of overlap between
it and the HTS, PM and measured value compound sets studied in Chapter 3. It was
found that only eight of its compounds had previously been studied, six from the
HTS set (HTS1705, HTS1873, HTS2032, HTS2044, HTS2189 and HTS2432) and
two from the PM dataset (PL0030 (1-Naphthylacetic acid) and PL1236
(Phenthoate)).
SMILES files for both the Native Drawing Convention (NDC) and
Physiologically Relevant Form (PRF) of the dataset compounds were then prepared
and log P (ELOGP), solubility (ESOL) and pKa (ACD/pKa) prediction jobs run for
each. Of the 9,465 compounds involved, 9,100 successful log P and solubility
prediction comparisons between the Native Drawing Convention (NDC) forms and
Physiologically Relevant Forms (PRFs) of the dataset were obtained (96.1% of
compounds, the same set for both property predictions). Similarly, successful pKa
prediction comparisons were made on 7,461 occasions (78.8% of compounds).
Indexing of the successful prediction results allowed each compound to be
classified according to whether the Structure Transformation Tool (STT) changed its
124
structure and whether a change in property prediction occurred between its NDC and
PRF forms. A summary of the effect of the STT on the IVDB dataset for each
property prediction is shown in Table 4.1.
Changed structure? (NDC → PRF) No Yes
ELOGP
No 8027 (88.2)
943 (10.4)
Yes 0
(0) 130 (1.4)
ESOL
No 8027 (88.2)
943 (10.4)
Yes 0
(0) 130 (1.4)
pKa
No 6590 (88.3)
784 (10.5)
Changed value?
Yes 0
(0) 87
(1.2)
(Numbers represent actual numbers of compounds for which the full set of prediction data was available to allow accurate interpretation. The figures in brackets are the corresponding percentages of the total of those compounds)
Table 4.1: Classification of changes caused to the IVDB dataset compounds by the STT.
The percentages of compounds that fell into each category are very similar to
the distributions obtained in Table 3.1 for both the HTS and PM datasets (Chapter 3,
Section 4). This confirms that the pattern of structure misrepresentation highlighted
by the STT in this set, a sample drawn from a wider chemical context, is not
significantly different from that found within agrochemical-related compound
collections.
The 943 compounds whose log P and solubility predictions were both
unchanged despite a change in structure could all be attributed to a nitro group
125
changing hybrid form only. 747 of the 784 compounds whose pKa prediction was
unchanged despite a change in structure could similarly be attributed. The remaining
37 of these compounds were all of substructure 21 (Figure 4.1) and underwent the
same tautomer change – the pKa predictions for both being coincidentally the same.
PL1003 (Kinetin), PL1612 (Zeatin) and PL0083 (6-Isopentenylaminopurine) (Figure
3.16) in Chapter 3, Section 4 of the same tautomer class in the PM dataset also
showed the same effect.
N NH
N
A
A
NH N
N
A
A
STT
Figure 4.1
Of the 130 compounds whose structures were changed by the STT and both
the log P and solubility prediction also changed, 25 were due to a non-tautomer
change in structure:
• Nitro group change of hybrid form (8 examples)
• Nitroso group change of hybrid form (6 examples)
• Protonation and / or deprotonation of heteroatoms only (11 examples)
The remaining 105 compounds underwent a change in tautomer. Of the 87
compounds whose structures were changed by the STT and their pKa prediction also
changed, 86 were a subset of the 105 tautomeric compounds identified from the log P
predictions. The remainder, djp2181 (Figure 4.2), was also tautomeric but was not
included in the log P comparison because AlogP failed to give a value for its NDC
tautomer. This was due to it containing an Nsp3-Nsp3 bond, which as discussed in
Chapter 3, Section 6.2.1, is a common “problem” substructure for AlogP.
126
NH
NH
Br
N NH
O
O
NH
N
Br
N NH
O
OH
djp2181
(NDC tautomer)
djp2181
(PRF tautomer)
Figure 4.2
Analysis of log P prediction failures showed that this compound was the only
tautomeric one so affected. Analysis of the pKa prediction failures identified three
compounds where a change in structure had occurred, two of which were tautomeric
and both of which were also positively identified from the log P prediction
comparisons.
In total this means 120 compounds underwent a change of tautomer. Table 4.2
shows a summary of the structural types and corresponding number of compounds
that were found, together with the change made by the STT. In addition to those
tautomer types already highlighted from the previous datasets, 7 additional classes
have been defined and are included along with the ACD/pKa “major” and “minor”
tautomer predictions for each substructure, assuming the A-groups are methyls in
each case.
No NDC substructure PRF substructure Number of instances
encountered
3
N
N
OH
A
A
A
A
A
Minor 1
N
NH
O
A
A
A
A
A
Minor 2
3
127
No NDC substructure PRF substructure Number of instances
encountered
5
N
OH
Not OHNot OH
A A
Minor
NH
O
Not OHNot OH
A A
Major
5
9 N
N
OH
NA
A
A
A Minor
NH
N
O
N
A
A
A
A
CD Major
3
12 N
OH
Not OHA
A
A Minor
NH
O
Not OHA
A
A Major
14
14
N
N
OH
A
A
A
Minor
NH
N
O
A
A
A Major
1
17 N
N
NSH
AA
Minor
N
NH
NS
AA
Major
29
19 N
N
SH
AA
A
Minor
N
NH
S
AA
A
Major
2
128
No NDC substructure PRF substructure Number of instances
encountered
20 N
N
OH
AA
A
Minor
NH
N
O
AA
A
CD Major
21
21 NH N
N
A
A
CD Major 1
N NH
N
A
A
CD Major 2
15 + 14 *
35
N N
SH
OA
A
A Minor
NH N
S
OA
A
A Major
3
36 NH N
N
A
SH
Minor
NH NH
N
A
S
Major
1
37 N N
N
A
OH
A
Minor
N NH
N
A
O
A
Major
3
38
N N
O
AOH
A
A Minor
NH N
O
AO
A
A Major
3
129
No NDC substructure PRF substructure Number of instances
encountered
39 NH
N
O
ASH
A
Minor
NH
NH
O
AS
A
Major
1
40 N
N
O
ASH
AA
Minor
N
NH
O
AS
AA
Major
1
41 NH
NH
A
O
A
Major
N
NH
A
OH
A
Minor
1 +
A = Any group (not H when attached to a heteroatom) * = Compounds where structure changed but pKa prediction didn’t + = Identified by examining list of compounds where valid log P prediction comparison was not possible but where a change of structure was registered ACD/pKa tautomer predictions (assuming A = Me): “Minor” = Sole predicted minor tautomer “Minor 1/2" = Predicted minor tautomers suggested independently of each other “Major” = Sole predicted major tautomer “CD Major 1/2" = Suggested conditions dependant major tautomers of each other
Table 4.2: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified from the IVDB dataset
The additional tautomer types identified are largely similar in nature to those
already defined and involve prototropic shifts between O and N, or S and N atoms.
Type 32 provides the only case of all the types where the STT rules applied appear to
convert a “major” tautomer into a “minor” one.
130
5 Conclusions and further work
5.1 Conclusions
Tautomerism is a widely recognised phenomenon in heterocyclic chemistry
which has the potential to present major issues to computational tools that predict
physical properties such as lipophilicity and acid-base ionisation constants.
This project developed and tested a methodology for assessing tautomer
misrepresentation and its affect on the prediction of solubility (log Sw), lipophilicity
(log P) and acid-base ionisation constants (pKa). Two moderate-sized
(~1,300/~2,600) agrochemical related test sets and a larger (~9,500) publicly
available set were used to do this. A Structure Transformation Tool (STT) identified
compounds drawn in a “wrong” Native Drawing Convention (NDC) tautomer and
converted them into a “right” form - one considered to be most likely at pH7 - a
Physiologically Relevant Form (PRF).
Analysis of the datasets showed that the STT made no change to the structure
of 90% of compounds. Of the others, only 1-2% changed tautomer form. This
indicates that the tautomer misrepresentation issue is relatively minor and is not
significantly different for agrochemicals than any other class of compounds. The
effect of the STT on the predicted charge distribution at pH7 of each test set affected
less than 1% of compounds. The charge predictions of only ~1% of compounds
changed when the pH range used to predict their corresponding pKa values was
narrowed from 0-14 to 2-10.
For compounds whose structure the STT changed, the absolute change to log
P predictions was typically in range 0-2 with a mean value of ~1. For solubility
(log Sw) this range was 0-2 (mean ~1) and for pKa it was 0-4 (mean ~2.5). The
131
datasets contained approximately 40 distinct tautomer substructure types, most often
based on thiol- or hydroxy- pyridines, pyrimidines, pyrazines, 1,3,5-triazines,
imidazoles and 1,2,4-triazoles. The most common effect of the STT was to convert
each to a thione or ketone analogue. In the majority of cases the ACD/pKa tautomer
prediction tool confirmed that the STT turned “minor” tautomers of each
substructure into “major” ones.
A comparison of the predicted and measured log P data for the limited
numbers of tautomeric compounds with measured values showed that the mean
improvement in log P predictions due to the STT was 0.24 log units. This
represented an approximate 30% improvement in the proportion of predictions made
within 0.5 log units of measured log P values. For tautomeric compounds with pKa
measurements the mean improvement was 2.80 log units. In real terms this equates to
a far more substantial improvement in predictions, with 40% more being made
within 0.5 log units of measured values. These findings also validate the “minor” /
“major” tautomer predictions made by ACD/pKa.
Comparing property predictions with measured values showed that different
substructure types were affected by the STT in different ways. In fact for some
substructures, predictions for their PRFs were actually poorer than for their NDC
forms. In many of these cases intramolecular hydrogen bonding to a keto substituent
could explain why conventionally “minor” tautomers were stabilised as “major”
ones. Which tautomer gave the best predictions for compounds of a particular class
was not always clear-cut however. In these situations the steric and electronic effects
of different substituents seemed to influence the balance of equilibria between
tautomers. Such exceptions show that the STT substructure definitions and structure-
132
changing rules are currently too generic and not appropriate in all circumstances to
which they are currently applied.
A series of structural features were identified as commonly causing the
various log P and pKa prediction tools to fail to predict a value. In particular
Nsp3-Nsp3 bonds (AlogP), net charges (ACDlogP) and simply “no values in range”
(ACD/pKa) were found to recur most often. AlogP’s parsing of SMILES was for
~5% of compounds in one test set found to change according to which SMILES
variant was used. For consistency, all SMILES were therefore canonicalised to
Daylight conventions before predictions were made.
The available CHI (Chromatographic Hydrophobicity Index) data for one of
the test sets identified 20 compounds where a “wrongly” drawn tautomer issue had
not been addressed by the STT. In these compounds a further 12 tautomer
substructure types were also identified. Of these, 4 tri-ketones (for example
Mesotrione) were particularly important omissions. Therefore though the STT is
largely effective within its current configuration, it still lacks definitions for many
other important substructural types.
133
5.2 Further work
• The methodology was developed and applied to only medium sized datasets.
It would therefore be beneficial to test larger (e.g. 100k) ones for tautomer
misrepresentation issues, especially those with measured data.
• The steric and/or electronic influence of substituents seem to be important in
determining the major / minor tautomer balance between structurally similar
compounds. A more detailed examination of their effects on predictions
would help clarify their role.
• Does the canonicalisation of SMILES improve, or degrade log P predictions
significantly? Are the benefits of prediction consistency out-weighed by
poorer predictions overall?
• The problems AlogP has with uncanonicalised SMILES and its issues with
particular substructures suggest it requires further development. It would also
be worthwhile investigating a more reliable atom-based tool to replace AlogP
within ELOGP. In the immediate term, an automatic canonicalisation of
SMILES by ELOGP would be beneficial.
• AlogP and ACDlogP currently handle pairs of resonance hybrids such as
nitroso and nitro inconsistently. Program development is required so that
hybrid pairs are always identified as equivalent representations, prediction
failure rates are reduced and prediction consistency is improved.
• There is considerable scope for expanding the structure-changing rules used
by the STT. This could include identifying configurations of existing
substructures where intramolecular hydrogen bonding can occur (Chapter 3,
Section 8.2.6.3) as well as introducing new substructure types such as tri-
ketones (Chapter 3, Section 9.1). A study of the effect of additional rules on
134
the property prediction benefits of the STT would also be important before
fully implementing them.
• There are alternative means of probing datasets for tautomeric compounds
missed by the STT. Unusually large differences between measured values and
property predictions could help identify new cases, for example.
• The implementation of an STT that performs the reverse transformations to
Leatherface would also help identify a wider range of “missing” tautomer
substructures, not just those already in the “right” form.
• Other types of measured data could also indicate a compound’s major
tautomer form when log P or pKa data is unavailable. For example 13C NMR
or IR spectroscopy will differentiate between “4-hydroxypyridine” and “4-
(1H)-pyridone” tautomers. Predicted vs. measured spectral comparisons
could provide a new approach to the tautomer misrepresentation issue.
135
References
Accelrys (2004) “Accelrys: DIVA”. Accelrys [Online]
http://www.accelrys.com/products/diva/index.html
[Accessed 15 August 2004]
ACD (2004a). “ACD/logP DB: Overview”. Advanced Chemistry Development /
Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/logp/
[Accessed 15 August 2004]
ACD (2004b). “ACD/pKa DB: Overview”. Advanced Chemistry Development /
Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/pka/
[Accessed 15 August 2004]
ACD (2004c). “Physico-Chemical Laboratory”. Advanced Chemistry Development /
Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/
[Accessed 15 August 2004]
AGENT 2 (2004). “AGENT 2.0: Advanced Creator of Tautomers”. Swiss Federal
Institute of Technology Zurich. [Online]
http://www.pharma.ethz.ch/pc/Agent2/ [Accessed 25 May 2004]
Beak, P., Fry, F.S., Lee, J. & Steele, F. (1976). “Equilibration studies - protomeric
equilibria of 2-hydroxypyridines and 4-hydroxypyridines,
2-hydroxypyrimidines and 4-hydroxypyrimidines, 2-mercaptopyridines and
4-mercaptopyridines, and structurally related compounds in the gas-phase”.
Journal of the American Chemical Society, 98 (1), 171-179.
136
Bradshaw, J.S., Chamberlin, D.A., Harrison, P.E., Wilson, B.E., Arena, G., Dalley,
N.K., Lamb, J.D., Izatt, R.M., Morin, F.G. & Grant, D.M. (1985). “Proton-
Ionizable Crown Compounds. 1. Synthesis, Complexation Properties, And
Structural Studies Of Macrocyclic Polyether Diester Ligands Containing A
Triazole Subcyclic Unit”. Journal of Organic Chemistry, 50 (17), 3065-3069.
Bradshaw, J.S., Nielson, R.B., P.-K. Tse, Arena, G., Wilson, B.E., Dalley, N.K.,
Lamb, J.D., Christensen, J.J. & Izatt, R.M. (1986). “Proton-Ionizable Crown
Compounds. 4. New Macrocyclic Polyether Ligands Containing A Triazole
Subcyclic Unit”. Journal of Heterocyclic Chemistry, 23 (2), 361-368.
Brandstetter, H., Grams, F., Glitz, D., Lang, A., Huber, R., Bode, W., Krell, H.W. &
Engh, R.A. (2001). “The 1.8-angstrom crystal structure of a matrix
metallaproteinaise 8-barbiturate inhibitor complex reveals a previously
unobserved mechanism for collagenase substrate recognition”. Journal of
Biological Chemistry, 276, 17405-17412.
Briggs, G.G. (1997). “Predicting the uptake and movement of agrochemicals from
physical properties”. SCI Meeting on the uptake of agrochemicals and
pharmaceuticals, London, UK. December 1997, presentation.
Briggs, G.G., Desbordes, P. & Genix, P. (2002). “Are there limits to the physical
properties of fungicides?”. 10th IUPAC International Congress on the
Chemistry of Crop Protection, Basel, Switzerland. August 2002, poster.
137
Chiang, Y., Kresge, A.J. & Schepp, N.P. (1989). “Temperature coefficients of the
rates of acid-catalyzed enolization of acetone and ketonization of its enol in
aqueous and acetonitrile solutions - Comparison of thermodynamic
parameters for the keto-enol equilibrium in solution with those in the gas-
phase”. Journal of the American Chemical Society, 111 (11), 3977-3980.
Civcir, P.U. (2000). “A theoretical study of tautomerism of cytosine, thymine, uracil
and their 1-methyl analogues in the gas and aqueous phases using AM1 and
PM3 methods”. Journal of Molecular Structure – Theochem, 532, 157-159.
Civcir, P.U. (2001). “A theoretical study of 2,6-dithioxanthine in the gas and aqueous
phases using AM1 and PM3 methods”. Journal of Molecular Structure –
Theochem, 572, 5-13.
Clarke, E.D. (2001). “Physico-Chemical Profiling in the Agrochemical Industry”
Sirius User Meeting 2002, Measurement and Beyond, October 2002,
Brighton, presentation.
Clarke, E.D. & Delaney, J.S. (2003). “Physical and Molecular Properties of
Agrochemicals: An Analysis of Screen Inputs, Hits, Leads, and Products”.
Chimia, 57 (11), 731-734.
Clarke, E.D., Draper, E., Holliday, J.D. & Mullier, G.W. (2004). “ELOGP:
Improving the Prediction of Log P Octanol for Agrochemicals.”
UK-QSAR and Chemoinformatics Group Spring 2004 Meeting, April 2004,
Liverpool, poster.
138
Daylight (2004a). “SMILES Tutorial”. Daylight Chemical Information Systems Inc.
[Online] http://www.daylight.com/dayhtml/smiles/smiles-intro.html
[Accessed 15 August 2004]
Daylight (2004b). “Daylight Theory: SMARTS”. Daylight Chemical Information
Systems Inc. [Online]
http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
[Accessed 15 August 2004]
Daylight (2004c). “CLOGP Reference Manual”. Daylight Chemical Information
Systems Inc. [Online] http://www.daylight.com/dayhtml/doc/clogp/
[Accessed 15 August 2004]
Daylight (2004d). “Daylight Theory: SMILES”. Daylight Chemical Information
Systems Inc. [Online]
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
[Accessed 15 August 2004]
Daylight (2004e). “SMILES Toolkit 4.8”. Daylight Chemical Information
Systems Inc. [Online] http://www.daylight.com/products/smiles_kit.html
[Accessed 15 August 2004]
Delaney, J.S. (2004). “ESOL: estimating aqueous solubility directly from molecular
structure”. Journal of Chemical Information and Computer Sciences, 44 (3),
1000-1005.
Devillers, D., Domine, D., Guillon, C. & Karcher, W. (2000) “Simulating
lipophilicity of organic molecules with a back-propagation neural network”.
Journal of Pharmaceutical Sciences, 87 (9), 1086-1090.
139
Draper, E. (2002). Improving the Effectiveness of Descriptor-Based Predictions.
MSc, University of Sheffield.
Duarte, H.A., Carvalho, S., Paniago, E.B. & Simas, A.M. (1999). “The importance of
tautomers in the chemical behavior of tetracyclines”. Journal of
Pharmacological Sciences, 88, 111-120.
Ghose, A.K., Pritchett, A. & Crippen, G.M. (1988). “Atomic physicochemical
parameters for 3-dimensional structure directed quantitative structure-
activity-relationships. 3. Modeling hydrophobic interactions”. Journal of
Computational Chemistry, 9 (1), 80-90.
Gillet, V.J., Willett, P. & Bradshaw, J. (1998). “Identification of Biological Activity
Profiles Using Substructural Analysis and Genetic Algorithms”. Journal of
Chemical Information and Computer Sciences, 38, 165-179.
Hallé, J.C., Lelievre, J. & Terrier, F. (1996). “Solvent effect on preferred protonation
sites in nicotinate and isonicotinate anions”. Canadian Journal of Chemistry,
74 (4), 613-620.
Hansch, C., Maloney, P., Fujita, T. & Muir, R. (1962). “Correlation of Biological
Activity of Phenoxyacetic Acids with Hammett Substituent Constants and
Partition Coefficients”. Nature, 194, 178-180.
Heinzelmann, W. & Märky, M. (1973). “Photosynthese von Dihydroazepinonen aus
2-Alkyl-indazolen”. Helvetica Chimica Acta, 56 (6), 1852-1858.
Heller, G., Buchwaldt, A., Fuchs, R., Kleinicke, W. & Kloss, J. (1925). Journal für
Praktische Chemie, 111, 1-74.
140
Kaliszan, R., Haber, P. & Snyder, L.R. (1999). “Estimation of Compound pKa and
log kw values by means of two Reversed-Phase HPLC Run”. HPLC ’99, May
1999, Granada, L/043.
Katritzky, A.R. & Lagowski, J.M. (1963). “Prototropic Tautomerism of
Heteroaromatic Compounds 1: General Discussion and Methods of Study”.
Advances in Heterocyclic Chemistry, 1, 311-338.
Katritzky, A.R., Elguero, J., Marzin, C. & Linda, P. (1976). “The Tautomerism of
Heterocycles”. Advances in Heterocyclic Chemistry, Supplement 1. New
York: Academic Press.
Katritzky, A.R. & Ghiviriga, I (1995). “An NMR-Study Of The Tautomerism Of
2-Acylaminopyridines”. Journal of the Chemical Society, Perkin
Transactions 2, (8), 1651-1653.
Katritzky, A.R., Ghiviriga, I., Oniciu, D.C., O’Ferrall, R.A.M. & Walsh, S.M.
(1997). “Study of the enol-enaminone tautomerism of alpha-heterocyclic
ketones by deuterium effects on C-13 chemical shifts”. Journal of the
Chemical Society, Perkin Transactions 2, (12), 2605-2608.
Katritzky, A.R., Denisko, O.V. & Elguero, J. (2000). “Prototropic Tautomerism of
Heterocycles: Heteroaromatic Tautomerism – General Overview and
Methodology”. Advances In Heterocyclic Chemistry, 76, 1-84.
Katritzky, A.R., Denisko, O.V., Stanovnik, B. & Tišler, M. (2001). “The
Tautomerism of Heterocycles: Six-Membered Heterocycles: Part 1, Annular
Tautomerism”. Advances In Heterocyclic Chemistry, 81, 253-303.
141
Kenny, P. (1999). “Handling Heterocyclic Tautomerism”. EuroMUG ‘99 meeting,
Cambridge, UK. 28-29 October 1999, presentation. [Online]
http://www.daylight.com/meetings/mug99/Kenny/kenny_mug99.htm
[Accessed 25 May 2004]
Knuth, D. (1998). The Art Of Computer Programming: Volume 2 – Semi-numerical
Algorithms, Reading, MA: Addison-Wesley Longman, pp 142.
Lázlár, L., Göblyös, A., Evanics, F., Bernáth, G. & Fülöp, F. (1998). “Ring-chain
tautomerism of 2-aryl-substituted imidazolidines”. Tetrahedron, 54 (44),
13639-13644.
Leach, A.R. & Gillet, V.J. (2003). An Introduction to Chemoinformatics. Kluwer
Academic Publishers: Dordrecht. pp. 19.
Leo, A.J. (1993). “Calculating Log P(oct) from structures”. Chemical Reviews, 93 (4),
1281-1306.
Leo, A.J & Hoekman, D. (2000). “Calculating log P(oct) with no missing fragments;
The problem of estimating new interaction parameters”. Perspectives in Drug
Discovery and Design, 18, 19-38.
Lipinski, C.A., Lombardo, F., Dominy, B.W. & Feeny, P.J. (1997). “Experimental
and computational approaches to estimate solubility and permeability in drug
discovery and development settings”. Advanced Drug Delivery Reviews, 23,
3-25.
MacNab, H. & Monahan, L.C. (1990). “Azepinones 4. Electrocyclic and
cycloaddition reactions of simple 1H-azepin-3(2H)-ones”. Journal of the
Chemical Society, Perkin Transactions 1, (11), 3169-3173.
142
MDL (2003) “CTFile Formats, Chapter 6: SDfiles” MDL Information Systems.
[Online] http://www.mdli.com/downloads/public/ctfile/ctfile.pdf
[Accessed 15 August 2004]
Morris, J.M. & Bruneau, P.P. (2000). “Prediction of Physicochemical Properties”. In:
Bohn, H.-J. & Schneider, G. (eds.), Virtual Screening for Bioactive
Molecules, Weinheim: Wiley-VCH. pp 33-58.
Oprea, T.I. (2000). “Property distribution of drug-related chemical databases”.
Journal of Computer-Aided Molecular Design, 14 (3), 251-264.
Pearlman, R.S., Khashan, R., Wong, D. & Balducci, R. (2002). “ProtoPlex: user-
control over tautomeric and protonation states”. Abstracts of Papers of the
American Chemical Society, 224, 232-COMP.
Pospisil, P., Ballmer, P., Folkers, G. & Scapozza, L. (2002). “Tautomerism in
nucleobase derivatives and their score in virtual screening to thymidine
kinase”. Abstracts of Papers of the American Chemical Society, 224, 211-
COMP.
Pospisil, P., Ballmer, P., Scapozza, L. & Folkers, G. (2003). “Tautomerism in
Computer-Aided Drug Design”. Journal of Receptors and Signal
Transduction, 23 (4), 361-371.
Sadowski, J. & Kubinyi, H. (1998). “A Scoring Scheme for Discriminating between
Drugs and Nondrugs”. Journal of Medicinal Chemistry, 41, 3325-3329.
Sadowski, J. (2002). “A tautomer and protonation pre-processor for virtual
screening”. Abstracts of Papers of the American Chemical Society, 224, 233-
COMP.
143
Sayle, R. & Delany, J. (1999). “Canonicalization and Enumeration of Tautomers”.
EuroMUG ‘99 meeting, Cambridge, UK. 28-29 October 1999, presentation.
[Online]
http://www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm
[Accessed 25 May 2004]
Tice, C.M. (2001). “Selecting the right compounds for screening: does Lipinski’s
Rule of 5 for pharmaceuticals apply to agrochemicals?”. Pest Management
Science. 57, 3-16.
Tice, C.M. (2002). “Selecting the rights compounds for screening: use of surface-
area parameters”. Pest Management Science. 58, 219-233.
Tišler, M. (1959). Archiv der Pharmazie, 292, 90-97.
Tomlin, C.D.S. (ed.) (2000). The Pesticide Manual, 12th edition, Farnham, Surrey,
UK: British Crop Protection Council.
Trepalin, S.V., Skorenko, A.V., Balakin, K.V., Nasonov, A.F., Lang, S.A.,
Ivashchenko, A.A. & Savchuk, N.P. (2003). “Advanced exact structure
searching in large databases of chemical compounds”. Journal of Chemical
Information and Computer Science, 43, 852-860.
Tripos (2004). UNITY 4.4.1, Tripos Inc., 1699 South Hanley Rd., St. Louis,
Missouri, 63144, USA. [Online]
http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/unity.html
[Accessed 20 August 2004]
144
Valkó, K., Bevan, C. & Reynolds, D. (1997). “Chromatographic hydrophobicity
index by fast-gradient RP HPLC: A high-throughput alternative to log P and
log D”. Analytical Chemistry, 69 (11), 2022-2029.
Weininger, D., Weininger, A. & Weininger, J.L. (1989). “SMILES 2. Algorithm for
Generation of Unique SMILES Notation”. Journal of Chemical Information
and Computer Sciences, 29 (2), 97-101.
Weis, A.L. & Vishkautsan, R. (1984). “Dihydropyrimidines 9. Preparation and
imine-enamine tautomerism of 4,6-diphenyl-1,2-dihydropyrimidine”.
Chemistry Letters, (10), 1773-1776.
Weis, A.L. & van der Plas, H.C. (1986). “Dihydropyrimidines - Synthesis, structure
and tautomerism”. Heterocycles, 24 (5), 1433-1455.
Weis, A.L., Frolow, F. & Vishkautsan, R. (1986). “Dihydropyrimidines 16. Stability
and enamine-imine tautomerism in 1,2-dihydropyrimidines and
2,5-dihydropyrimidines”. Journal of Organic Chemistry, 51 (24), 4623-4626.
Wheland, G.W. (1955). Resonance in Organic Chemistry, New York: Wiley.
pp. 98-100.
Wildman, S.A. & Crippen, G.M. (1999). “Prediction of Physicochemical Parameters
by Atomic Contributions”. Journal of Chemical Information and Computer
Science, 39, 868-873.
Willett, P., Barnard, J.M. & Downs, G.M. (1998). “Chemical similarity searching”.
Journal of Chemical Information and Computer Science, 38, 983-996.
145
Whitman, C.P. (1999). “Keto-Enol Tautomerism in Enzymatic Reactions”.
Comprehensive Natural Products Chemistry, 5, 31-50.
Yan, X., Day, P., Hollis, T., Monzingo, A.F. & Schelp, E. (1998). “Recognition and
interaction of small rings with the ricin A-chain binding site”. Proteins, 31,
33-41.
Zaleska, B., Ciez, D. & Falk, H. (1996). “Synthesis and properties of unique
mesoionic 1,3-thiazolium-4-olates”. Monatshefte für Chemie, 127 (12), 1251-
1257.