dagda.shef.ac.ukdagda.shef.ac.uk/.../external/parker_david_mscchem.pdf · 2 table of contents...

THE EFFECT OF TAUTOMERISM ON THE

PREDICTION OF BIOAVAILABILITY AND

VIRTUAL SCREENING

A study submitted in partial fulfillment

of the requirements for the degree of

Master of Science in Chemoinformatics

at

THE UNIVERSITY OF SHEFFIELD

by

DAVID PARKER

September 2004

2

Table of Contents

Acknowledgements................................................................................ 5

Abstract .................................................................................................. 6

Common abbreviations......................................................................... 7

1 Introduction .................................................................................... 8

1.1 High throughput screening in lead compound discovery............................ 8

1.2 Lipinski’s “rule of five” .............................................................................. 9

1.3 A measure of lipophilicity - log P............................................................. 11

1.4 Tautomerism and property prediction....................................................... 13

1.4.1 Prototropic Tautomerism ................................................................. 14

1.4.2 Valence tautomerism........................................................................ 20

1.5 The impact of tautomerism on drug design............................................... 21

1.6 Tautomerism and molecular docking programs........................................ 22

1.7 Tautomerism and molecular descriptors ................................................... 23

1.8 The project domain ................................................................................... 24

1.9 Project outline ........................................................................................... 25

2 Methodology ................................................................................. 27

2.1 Introduction............................................................................................... 27

2.2 SMILES notation ...................................................................................... 29

2.3 SMARTS notation..................................................................................... 30

2.4 Estimating compound lipophilicity: ELOGP............................................ 30

2.5 Estimating compound aqueous solubility: ESOL ..................................... 31

2.6 Estimating acid-base ionisation constants: ACD/pKa .............................. 31

2.7 The SOLSTICE tool set ............................................................................ 32

2.8 Compound data set preparation................................................................. 34

2.8.1 Files in .sdf format ........................................................................... 34

2.8.2 SMILES canonicalisation ................................................................ 34

2.8.3 Leatherface: A tool for transforming chemical structures ............... 36

2.9 Compound property prediction ................................................................. 38

2.9.1 ELOGP............................................................................................. 39

2.9.2 ESOL................................................................................................ 39

2.9.3 pKa ................................................................................................... 40

2.10 Result collation, indexing and data presentation....................................... 40

3

2.10.1 DIVA – A spreadsheet for manipulating and displaying chemical information ....................................................................... 40

2.10.2 Post processing of the dataset .......................................................... 41

2.10.3 Allocating a predicted charge at pH7............................................... 43

2.10.4 Data analysis and presentation......................................................... 43

2.10.5 Identifying tautomeric substructures................................................ 46

2.10.6 Other data analysis indicators .......................................................... 47

2.10.7 Comparison of measured and predicted log P and pKa values........ 47

2.10.8 Analysis of prediction failures ......................................................... 48

2.10.9 CHI data – a source of information about tautomer classes not highlighted by the STT .................................................................... 48

3 Results and discussion.................................................................. 50

3.1 About this chapter ..................................................................................... 50

3.2 Introducing the datasets............................................................................. 50

3.3 Comparing the property predictions made for the NDC forms and PRFs of each compound set ...................................................................... 51

3.3.1 ELOGP............................................................................................. 52

3.3.2 ESOL................................................................................................ 55

3.3.3 pKa ................................................................................................... 59

3.4 Summarising the differences between the NDC forms and PRFs of the HTS and PM datasets ................................................................................ 62

3.5 Formal charge distributions at pH7........................................................... 64

3.5.1 The influence of predicted pKa changes on predicted charge distribution ....................................................................................... 64

3.5.2 A comparison of predicted charge distribution at pH7 within pH 2-10 and pH 0-14 limits ............................................................. 66

3.6 Issues and problems with prediction tools ................................................ 67

3.6.1 AlogP and SMILES ......................................................................... 67

3.6.2 Analysis of prediction failures ......................................................... 68

3.6.2.1 Log P ............................................................................................. 69

3.6.2.2 pKa ................................................................................................ 73

3.7 Revealing the types of structural changes performed by the STT and the tautomer substructures concerned ....................................................... 75

3.7.1 Analysing the effect of the STT on each dataset.............................. 75

3.7.2 Categorising the types of structure change performed by the STT................................................................................................... 78

3.7.3 Validating the structural changes performed by the STT ................ 82

3.8 Comparing measured and predicted property values ................................ 87

4

3.8.1 Compounds whose structures were not modified by the STT ......... 87

3.8.1.1 pKa comparisons........................................................................... 87

3.8.1.1.1 HTS dataset ............................................................................ 87

3.8.1.1.2 PM dataset ............................................................................. 89

3.8.1.2 log P comparisons......................................................................... 90

3.8.1.2.1 HTS dataset ............................................................................ 90

3.8.1.2.2 PM dataset ............................................................................. 92

3.8.2 The impact of the STT changing tautomers on the outcome of log P and pKa predictions ................................................................ 93

3.8.2.1 Introduction................................................................................... 93

3.8.2.2 Defining tautomer type subclasses................................................ 93

3.8.2.3 pKa comparisons........................................................................... 97

3.8.2.4 Log P comparisons...................................................................... 100

3.8.2.5 Re-investigating the validity of the structural changes performed by the STT.................................................................. 104

3.8.2.6 Evaluating the predictions of alternative tautomers................... 109

3.8.2.6.1 Substructures 5b and 5d....................................................... 109

3.8.2.6.2 Substructures 5e and 12b..................................................... 111

3.8.2.6.3 Substructures of type 12a..................................................... 116

3.9 A method of investigating tautomer issues not highlighted by the STT. 116

3.9.1 Analysis of CHI data...................................................................... 117

3.9.2 Comparison of measured and predicted log P and pKa data for “new” tautomeric compounds ........................................................ 119

4 Evaluating tautomeric misrepresentation in a larger dataset 122

4.1 Introduction............................................................................................. 122

4.2 Sampling of compounds.......................................................................... 122

4.3 Dataset analysis....................................................................................... 123

5 Conclusions and further work .................................................. 130

5.1 Conclusions............................................................................................. 130

5.2 Further work............................................................................................ 133

References .......................................................................................... 135

5

Acknowledgements

I would like to thank Graham Mullier, Eric Clarke, John Delaney and

Val Gillet for their supervisory support and encouragement during the project and for

keeping me supplied with data, ideas and constructive feedback as it progressed.

Thanks also to David Adams for useful discussions.

I also wish to thank in general my colleagues at Syngenta in Jealott’s Hill for

making me feel welcome during my time here and to Nick and Thierry, with whom I

shared one of the cottages.

Finally thanks to my parents for helping with the logistics of getting me to

and from J.H., my friends, and to Dawn for her patience and understanding not just

during this project, but throughout my Masters.

6

Abstract

This work develops and tests a methodology for assessing the degree of

tautomer misrepresentation in chemical datasets and analyses the effect different

tautomers have on predictions of aqueous solubility (log Sw), lipophilicity (log P),

acid-base ionisation constants (pKa) and charge at pH7.

A structure transformation tool (STT) is used to convert compounds from

their database stored form into one considered to be physiologically relevant at pH7;

allowing the number and type of tautomeric compounds “wrongly drawn” to be

assessed. In the 3 datasets studied, such compounds are found to represent no more

than 1-2% of the total.

By making comparisons between predicted values and measured data, the

tautomers that give the best descriptions of molecules are assessed and the distinct

patterns found for different classes of tautomer examined and likely explanations

presented.

The effectiveness of the STT itself is tested and a series of tautomeric

compounds it “misses” identified. The study shows that its structure changing rules

are reasonable, but are sometimes too generically applied to always be reliable.

The reasons for the failing of the various property prediction tools for

individual compounds are also investigated. Particular problems with AlogP’s

inconsistent handling of SMILES and deficiencies in its fragment dictionary are

highlighted.

7

Common abbreviations

• CHI Chromatographic Hydrophobicity Index

• HTS High Throughput Screen(ing)

• NDC Native Drawing Convention

• PM Pesticide Manual

• PRF Physiologically Relevant Form

• STT Structure Transformation Tool

8

1 Introduction

1.1 High throughput screening in lead compound discovery

The challenge of accelerating the lead compound discovery process and

thereby reducing the time taken to bring new pharmaceuticals and agrochemicals to

the market has also been driven by the desire to make significant savings in the

associated research and development costs. With product development times of 10

years not being untypical, there is also a strong financial incentive for a company to

be able to respond more quickly to meet the market demand for a product before its

competitors do.

Undoubtedly mechanisation, miniaturisation and computerisation have made

it easier to screen larger and larger numbers of inputs from combinatorial chemistry

using techniques such as High Throughput Screening (HTS). However, simply

performing more screens does little to optimise the desired physical properties of

those active compounds that become leads (Morris & Bruneau, 2000). In recent

times, considerable efforts have therefore been made in developing techniques that

aid in the design of combinatorial libraries by allowing the physical properties of

screened inputs to be predicted in advance.

The activity of both agrochemicals and pharmaceuticals is dependent on their

ability to bind specifically to a desired target, typically a pocket on the surface of a

protein. So by removing in advance any compounds from the screen that are unlikely

to satisfy the conditions required for binding, the proportion of likely strong actives

actually screened will be increased. This ultimately is likely to increase the number

of strongly active leads obtained from a screen and so reduce the risk of high

9

development costs being directed at leads that are found later to only be weakly

active.

1.2 Lipinski’s “rule of five”

At the forefront of efforts to define the “drug-likeness” of compounds was the

work of Lipinski (Lipinski et al., 1997). His now much cited “rule of five” principle

placed upper limits on four molecular properties, above any of which a molecule is

less likely to be drug-like in permeation. These limits are:

• Molecular weight of 500.

• Log P (octanol / water) of 5.

• Five hydrogen bond donors (either OH or NH)

• Ten hydrogen bond acceptors (N or O atoms)

Some ground rules now set, a number of other research groups developed

more sophisticated models for predicting “drug-likeness” such as using a feed-

forward neural network (Sadowski & Kubinyi, 1998) and a genetic algorithm scoring

scheme (Gillet et al., 1998), both to good effect. The concept of “lead-likeness” and

how lead and drug compound properties differ from each other has also been studied

by Oprea (2000).

The delivery of agrochemicals to crops (typically by spraying) and the

application of pharmaceuticals to patients (typically by ingestion or injection) are

necessarily approached in completely different ways. The agrochemical industry

therefore realised that the typical physicochemical properties of agrochemicals were

likely to differ from those of pharmaceuticals. As a consequence, Briggs quickly

followed Lipinski with his “ground rules of three” (Briggs, 1997) and for fungicides

10

went on to set several alternative physical property limits for agrochemical-like

behavior (Briggs et al., 2002).

Tice compared the physical properties of a set of active herbicides and

insecticides with the same ones described by Lipinski’s “rule of five” for

pharmaceuticals (Tice, 2001 & 2002). His main observation was that these classes of

agrochemicals contained significantly lower numbers of hydrogen bond donors than

did the pharmaceuticals. Tice’s observations lead to him modifying the values laid-

down in Lipinski’s rules, specifically to reflect the nature of herbicides and

insecticides.

Clarke and Delaney (2003) recently compared the changes in nine physical

and molecular properties for herbicides, fungicides and insecticides between

identified HTS hit series compounds, lead series compounds and an agrochemical

product series, as well as a random subset of agrochemicals from their employer’s

corporate database. Properties considered included percentage aromaticity, molecular

weight, charge at pH 7 and partition coefficient differences. Herbicides and

fungicides in particular were surprisingly found to readily meet Lipinski’s criteria for

pharmaceutical lead-like compounds. For agrochemical products as a whole, Clarke

and Delaney (2003) summarised:

“…the whole progression from hits to products is dominated by

rising solubility, decreasing basicity and the removal of carbon,

particularly in aromatic systems.”

Underlying the physical property profiles of agrochemicals and

pharmaceuticals, are the values, whether measured or calculated, assigned to each.

Though practical measurements for properties such as log P, pKa and solubility

11

would ideally be made for every compound in a corporate collection at the time it is

added, in practice this is time consuming, expensive and consequently an unrealistic

expectation.

1.3 A measure of lipophilicity - log P

The partition coefficient log P, a measure of a compound’s lipophilicity, was

pioneered by Hansch and co-workers (Hansch et al., 1962). It has been of particular

interest in the agrochemical industry as it reveals the degree of preference a

compound has for residing in an organic phase (typically n-octanol) over an aqueous

phase. Given the nature of typical agrochemical delivery methods to crops, a

favourable log P is critical in making sure that agrochemicals are capable of crossing

their target species’ cell membrane in order that they can act. Equally important in an

increasingly environmentally-conscious world are the potentially negative

consequences of agrochemicals accidentally leaching into the environment, for

example in rainwater run off into watercourses, and the adverse effects they may

then have on other plants or wildlife.

Various methodologies have been used by researchers in their quest to find an

accurate means of predicting log P. Fragment based methods, such as in the program

CLOGP, first described by Leo (1993), break down a molecule into distinct

substructures chosen from a predetermined set. The pre-calculated log P

contributions of each substructure are then summed across the set generated for the

molecule to give the overall result.

Another common method is atom descriptor based and involves every atom

in a molecule being assigned to one of a series of different atom types, each of which

contribute a log P weight to the overall log P. The overall value is then obtained as

12

the linear sum of the set of component weights. The most well known application

based on this method, ALOGP, has seen some refinement to its operation since

inception, the most recent being described by Wildman and Crippen (1999). The

technique AUTOLOGP (Devillers et al., 2000) takes a different approach by

combining a series of different types of descriptor for hydrogen bond donor and

acceptor ability, lipophilicity and molar refractivity. A trained back-propagation

neural network is then used to evaluate the descriptor set values and produce a log P

estimate from them.

With there being so many types and variants in log P prediction tools, a need

was identified for a rigorous comparison of their performance against literature log P

data (Draper, 2002 & Clarke et al., 2004). Across six compound classes, including a

random 700 compound sample from the Pesticide Manual (Tomlin, 2000) and five

50-250 compound samples from specific agrochemical classes in the Syngenta

corporate database, six log P predictors were tested:

1. Fragment based CLOGP method – Daylight v4.71.

2. Fragment based CLOGP method – Biobyte v3.14.

3. Atom based ALOGP method - Accelrys Diamond Descriptors.

4. Atom and fragment combined method – ACD Phys Chem Batch v4.76.

5. Solvation descriptors – Sirius Absolv v1.4.

6. Quantum mechanics and neural network derived – Accelrys Diamond

Properties v1.5.

No single predictor was found to routinely out-perform the others, with the

different agrochemical classes variously favouring a total of four of the six tools.

Overall however Predictor 1 was found to be the best performer and Predictor 6 the

poorest. In order to maximise the predictive power of the combined methods, a

13

consensus scoring approach was applied to them and a new parameter, ELOGP,

defined. It was calculated for each compound as being the average of the log P

values obtained from methods 1, 3, 4 and 5. Analysis showed that ELOGP often out-

performed the individual methods from which it was derived, with significant

improvements in the proportion of log P predictions made within 0.5 units of the

actual measured value. The success of the ELOGP has since seen it adapted for use

in HTS applications, the HTS version of ELOGP being the mean log P of just

methods 1, 3 and 4 (Clarke & Delaney, 2003).

1.4 Tautomerism and property prediction

Essential to the ability of fragment-based property prediction tools to produce

accurate results is their need for a precise description of each structure. This allows

them to assign sets of specific fragment types to molecules using dictionaries of

fragments. However, many organic compounds have multiple structural isomers that

inter-convert, typically by transfer of a chemical group, and which are in equilibrium

with each other. To add further complication, the position of this equilibrium may

vary, depending upon the immediate physical and chemical environment of the

molecule.

The phenomenon, known as tautomerism, therefore has potentially a very

significant impact on physical and chemical property prediction and on computer-

aided drug design as a whole. As a consequence, it was the subject of a recent review

by Pospisil and co-workers (Pospisil et al., 2003). The concepts underlying

tautomerism in heterocyclic chemistry however are well established. For example,

accounts in the field by Heller et al. (1925) date back almost 80 years.

14

1.4.1 Prototropic Tautomerism

The most well-known and well-studied type of tautomerism is prototropic

tautomerism, manifested in the variable position of attachment of a hydrogen atom in

a molecule. The subject has been the subject of extensive reviews by Katritzky et al.

(1963, 1976, 2000 & 2001). Of this type, keto-enol tautomerism has been particularly

well studied and reviewed, for example by Whitman in relation to enzymatic

reactions in which it plays a part (Whitman, 1999). In Figure 1.1. for example,

acetone has both keto (left) and enol (right) forms that are in equilibrium with each

other.

CH3 CH3

O

CH3 CH2

OH

Figure 1.1

In simple ketones, the keto form is generally more stable than the enol form

(by ΔG = 11 kcal mol-1 in the above example (Chiang et al., 1989) resulting in the

Figure 1.1 equilibrium, in practice, being far over to the left. The favouring of the

keto form was considered by Wheland (1955) to be due to the greater strength of

carbon-oxygen double bonds compared to carbon-carbon double bonds. In contrast,

for aromatic rings, for example phenol and its two cyclohexadienone tautomers

(Figure 1.2), the enol form is most often the one favoured. In the example the higher

free energy of aromatisation (36 kcal mol-1) (Wheland, 1955) overrides the

underlying preference for the keto forms meaning the enol form instead dominates.

15

OH O

O

Figure 1.2

Particularly in conjugated systems containing more than one heteroatom, the

position of the equilibrium can be far less easy to predetermine. For example 4-

pyridone (Figure 1.3, right) and 4-hydroxypyridine (Figure 1.3, left) exist in an

equilibrium that was studied by Beak et al. (1976). They were only able to detect the

pyridone form in a solution of ethanol, but in the vapor phase found the pyridine

form to dominate. A number of other theoretical studies have been carried-out to

examine the tautomer preference of more complex cases such as those of cytosine,

thymine, uracil, 2,6-dithioxanthine and some of their analogues in both the gas and

aqueous phases (Civcir, 2000 & 2001).

N

OH

N

O

H

Figure 1.3

In contrast to the pyridine / pyridone system of Figure 1.3, though

1H-azepin-2-one (Figure 1.4, left) and 1-methyl-azepin-3-one (Figure 1.4, right)

contain conjugated π-electron systems, they cannot tautomerise into aromatic ring

structures. As a result, they exist, as drawn, predominately in their keto forms

(Heinzelmann & Märky, 1973 and MacNab & Monahan, 1990).

16

N

O

CH3NH

O

Figure 1.4

Prototropic tautomerism sometimes results in zwitterionic tautomers, such as

in the case of iso-nicotinic acid (Figure 1.5). The position of the equilibrium in a

solution of dimethyl sulfoxide (DMSO) and water was found by Hallé et al. (1996)

to be very sensitive to the ratio of the solvent’s components. Above 80% DMSO they

found that the position of the equilibrium very strongly favoured the non-zwitterionic

tautomer.

N

O OH

N+

H

O-

O

Figure 1.5

The electronic properties of molecules can also influence the position of

tautomeric equilibria. Katritzky et al. (2001) for example, compared the tautomer

ratios in solution of a series of 1,2-/2,5-dihydropyrimidines (Figure 1.6 and

Table 1.1) in deuterated chloroform and DMSO.

17

N

NH

R1

R2

N

N

R1

R2

R1 = Ph, SPh

R2 = Ph, OMe, SPh

A B

Figure 1.6

Ratio of A form to B form in solvent

R1 R2 CDCl3 DMSO-d6 Reference

Ph Ph 2:1 A form only Weis & Vishkautsan (1984) /

Weis & van der Plas (1986)

Ph OMe 1:6 8:1 Weis et al. (1986)

SPh SPh 1:3 8:1 Weis et al. (1986)

Table 1.1: “Tautomeric Equilibria of 1,2-/2,5-dihydropyrimidines” (Adapted from Katritzky et al. (2001))

In summary, electron donating substituents and non-polar solvents were

found to favour the B tautomers, but that increasing the polarity of the solvent

dramatically changed the position of the equilibrium to strongly favour the A

tautomers instead. The effect of these apparently moderate physical changes hints at

the difficulties faced when trying to accurately predict the predominant tautomer in a

given environment.

For larger, conjugated ring systems and systems containing more than three

heteroatoms, more than two tautomers are frequently plausible. Tišler (1955) for

example studied compounds in the mercapto-oxo-triazole system shown in

Figure 1.7 and for the R-group instances investigated, found tautomer B to be the

most prevalent.

18

N

NH

NH

S

O

R

N

N

NH

SH

O

R

N

N

N

SH

OH

R

A B

C

Figure 1.7

The prototropic tautomerism discussed so far has primarily occurred by

intermolecular proton transfer. However, there are also mechanisms for

intramolecular tautomerisation assisted by hydrogen bonding, such as seen in the

pyrazines and quinazolines shown in Figure 1.8 (Katritzky et al., 1995 & 1997). In

the cases considered, when X = carbon the B tautomers were preferred and when X =

nitrogen the C tautomers were the most prominent.

X

N

RO

X

N

ROH

X

N

RO

H

(X = C, N)

A B

C

Figure 1.8

Sometimes prototropic tautomerism is accompanied by a more substantial

structural change such as a ring opening / closure. Lázlár et al. (1998) for example

studied the tautomeric equilibria of a series of 1-alkyl-substituted 2-

19

arylimidazolidines in deuteriochloroform (Figure 1.9) that undergo reversible, five-

membered ring-opening reactions. The more bulky the R-substituent, the more

favoured the ring-opened tautomer was found to be.

NRHN

NH

N

R

(R = Me, Et, Pr, iPr)

Figure 1.9

The potential influence of the preferred tautomer of a ligand on that ligand’s

binding properties and therefore its whole chemistry can be illustrated by comparing

the X-ray crystal structures of the compounds shown in Figure 1.10. The crown ester

(left) and crown ether (right) are drawn in the forms observed by Bradshaw et al.

(1985 & 1986). With the crown ester’s central cavity containing one more hydrogen

bond donor and one less hydrogen bond acceptor than the crown ether, their

chemical environments will present somewhat different prospects to any potential

encapsulation atom or group. Consequently these tautomers are likely to have

different coordination chemistries.

O

O

O

O

NH

O

N N

OO

O

O

O

O

N

O

N NH

Figure 1.10

20

1.4.2 Valence tautomerism

Valence tautomerism occurs without chemical group detachment and

reattachment elsewhere taking place and instead primarily involves an electronic

rearrangement within a molecule. An example, which also involves a 6-membered

ring opening / closure, was observed in a series of N-bridged 1,3-thiazolium-4-olates

prepared by Zaleska and co-workers (Zaleska et al., 1996) (Figure 1.11). The

position of the equilibrium was found to depend on the nature on the solvent and the

pH of the solution.

S+

N

R3

S

N

O

R2R1

R1

N+ O

-

R2

S

N+O

-

R2

N+

R3

S

N

O

R2

R1R1

(R1 = Me, Ph R2, R3 = Ph, p-C6H4Me)

Figure 1.11

Though not technically tautomers, a number of simple functional groups can

be drawn in different resonance hybrid forms. For example, Figure 1.12 shows the

resonance forms of azide (top) and diazo (bottom) groups (Leach & Gillet, 2003). If

a hybrid is not considered in conjunction with its other form(s), each could be treated

as though it was a completely different species in a chemical reactivity sense. In turn,

this could have a profoundly adverse effect on the physical and chemical properties

predicted for a molecule containing that functional group.

C N-

N+

NC N N+

N-

CC

-N

+N

H

CN

+N

-H

Figure 1.12

21

1.5 The impact of tautomerism on drug design

Despite the amount that is known about tautomerism and the extent to which

it prevails in heterocyclic chemistry, the fundamental impact it may have in limiting

the success of current computational methods for drug design remains little

researched. The shape of a drug-like molecule together with its donor and acceptor

properties and its physicochemical properties are all critical in determining whether it

will bind strongly to a target receptor site and show the required level of activity.

OH O OH OOH

OH

OHCH3

NH2

O

N+CH3CH3

H

Figure 1.13

For the reasons shown above, tautomerism could have a profound affect on a

molecule’s ability to meet these criteria. For example, the complex molecule

Tetracycline (Figure 1.13) has a total of 64 potential tautomeric forms and a strong

ability to modify its geometry and bonding structure to suit its chemical environment

(Duarte et al., 1999). Given the influence of factors such as solvent and pH, it

remains hard to predict whether or not a particular active tautomer of interest will be

energetically available in a particular environment. Such concerns were raised by

Pospisil et al. (2003):

“…does a molecule bind preferably in one distinct tautomer? Is the

most stable tautomeric form in aqueous solution also the most stable

form in the active site of the protein? What can be the binding

22

contribution of a ligand in its excited tautomeric state in contrast to its

‘normal’ tautomeric state, e.g., its low energy configuration?”

Despite these pressing questions, few studies examining the binding modes of

particular tautomers have been carried out to date. One study by Brandstetter et al.

(2001) however has shown the preference for enol tautomer binding between the

8-barbiturate inhibitor RO200-1770 and the active site of the matrix

metalloproteinase MMP-8. The keto tautomer of the barbiturate in contrast is the one

that dominates in solution. Similarly, Yan and co-workers (Yan et al., 1998)

calculated the preferential tautomer of binding between pterin and ricin. They found

that of the four possibilities the chosen tautomer was neither the one of the lowest

energy in aqueous solution or the gas phase. These two observations show that

though normally unfavoured tautomers can still be activated and stabilized given the

right ligand-protein environment, trying to accurately anticipate such occasions for

exploitation in lead compound discovery applications is likely to remain a sizable

challenge.

1.6 Tautomerism and molecular docking programs

Chemical compounds are typically stored in databases as discrete, canonical

structures. The tools that currently exist to convert tautomers into their alternative

forms remain only of limited functionality according to Pospisil et al. (2003):

“There are several programs available which are able to create

tautomers, however for only one single compound at a time.”

Additionally, in the estimation of Trepalin et al. (2003), up to 0.5% of

commercial databases for bio-screening applications contain tautomers. Combining

these two observations, it is likely that many commercial and corporate databases

23

used for HTS will be missing a valuable and sizable amount of tautomeric structural

information about their collection. The extent of the problem lead Pospisil et al.

(2003) to suggest:

“…if a database is used for computer-aided lead finding, enriching

one’s database by energetically similar tautomers may significantly

improve the success rates in computer-aided drug design.”

Pospisil et al. (2003) also pointed out that including tautomers in virtual

screening increases the amount of “chemical space” covered by databases and

improves the chances of hits being generated. Of the tautomer generation tools

currently in existence, most can be considered as utilities that provide “pre-

processing” for other screening or docking applications that do not take tautomerism

into consideration themselves. Of these, “ProtoPlex” (Pearlmann et al., 2002), a

similar tool by Sadowski (2002), Pospisil’s “in-house” program “AGENT 2”

(AGENT 2, 2004 & Pospisil, 2002) and Kenny’s “Leatherface” (Kenny, 1999) for

the interconversion of tautomer forms are currently the most well known.

1.7 Tautomerism and molecular descriptors

One of the major problems of molecular descriptor prediction is that an

accurate structural representation of a molecule is required. For tautomeric

compounds this is particularly difficult, as single structures of the presumed

dominant tautomer are usually drawn to represent them, while their other tautomers

and the position of the equilibria between them are frequently given little or no

consideration. This means that in a given chemical environment, predictions could at

best be uncertain or at worst, be meaningless. For example, Sayle and Delany (1999)

calculated the log P (CLOGP) values for the paired 4-hydroxypyridine and

24

4-pyridone tautomers (Figure 1.3). They found them to be markedly different to each

other, at 0.93 and –1.31 respectively. As Pospisil et al. (2003) explained, the success

of the fragment-based CLOGP method depends not only on the nature of the

fragments produced, but also on the inclusion of complete and representative

tautomeric information in the training data:

“The fragment-based method depends on the way fragments are

produced, their number, size, and the training sets. Thus, missed or

incorrectly selected tautomers for the training set lead to wrong

correlations and cause the log P prediction to fail.”

Tautomerism can also affect the perceived similarity of molecules and

therefore inadvertently influence how compounds are clustered. Willett et al. (1998),

for example, found the Tanimoto index of the tautomer pair 4-nitrosophenol (Figure

1.14, left) and [1,4] benzoquinone monooxime (Figure 1.14, right) to be only 0.196,

despite them being treated as no more than different forms of the same compound.

Tautomerism can also affect measures of compound set diversity because of the low

levels of similarity that are sometimes ascribed to pairs of tautomers.

O N

OH

OH N

O

Figure 1.14

1.8 The project domain

The development of HTS and computer-aided drug design has opened up

many possibilities for more rapid and more successful lead compound discovery. An

integral part of this process has been in the developing of applications to predict the

physical and chemical properties of drug-like molecules and their likely activity at a

25

particular target. The widespread and well-studied chemical phenomenon of

tautomerism in organic chemistry often has a marked effect on the shape, structure

and chemical properties of molecules and preliminary studies have shown that its

impact on the prediction of those properties can also be considerable.

The project will therefore address the interest of Syngenta in finding out the

extent to which tautomeric misrepresentation is a characteristic of the molecules it its

own corporate collection. Using the example compounds found, the influence of

tautomerism on the prediction of a number of their physical and chemical properties

will be investigated.

1.9 Project outline

Tautomerism can manifest itself in many ways and structural forms. The

nature and scale of the problem it causes to the descriptor-based property prediction

methods currently used by Syngenta will first be studied. The Pesticide Manual

(Tomlin, 2000), samples from the chemical database of Syngenta and compounds

from published chemical catalogues will be examined for this purpose.

Leatherface, a Structure Transformation Tool (STT) developed by Kenny

(1999), will be used to convert each compound from its database-stored form into the

form it considers to be the most physiologically likely at pH7. By examining the

changes in structure and property prediction values of compounds due to the STT, a

series of commonly misrepresented tautomer substructures will be gathered for

further investigation.

In particular, the physical properties lipophilicity (log P), aqueous solubility

(log Sw) and acid-base ionization constant (pka) will be studied. The property

prediction tools ELOGP (v2) (log P), ESOL (v1.1) (solubility) and ACD/pKa (v6.16)

26

(pKa) will be used to perform the calculations, accessed via the Syngenta SOLSTICE

web browser-based interface. The reasons for the individual prediction tools failing

to give values for individual compounds will also be reviewed

A comparison of predicted and measured property values for compounds will

help evaluate how well the STT performs and which tautomers give the best

predictions. In other words, how often it produces tautomers with accurately-

estimated physical properties. It will also be investigated whether there are

tautomeric compounds within the datasets studied that are drawn in a “wrong”

tautomer that the STT fails to identify. The results of these findings may help suggest

ways the STT’s performance could be improved.

27

2 Methodology

2.1 Introduction

The physical and molecular properties of compounds of potential

agrochemical and pharmaceutical interest are especially important in relation to their

biological activity. The screening of candidate structures before they reach the

synthesis and activity profiling stages of development would mean that as well

saving on the cost and time of their ultimately unnecessary synthesis, those lead

compounds that are identified are more likely to be successful and strongly active.

Properties such as aqueous solubility, acidity / basicity, and lipophilicity are

often critical to insuring that a compound is quantitatively delivered to its target

active site and binds strongly with it. Therefore developing tools to accurately predict

these properties from a molecule’s structure are of considerable interest.

As discussed in Chapter 1, Section 5, the accuracy of such structure-based

predictions are particularly called into question in tautomeric compounds where its

structure can take multiple forms and where the equilibria between them is often

either undetermined or depends on the physiochemical environment in which it is

placed. This issue is especially important as compounds are most often stored in

chemical databases as single tautomers and the choice of tautomer is dependant on

the conventions of the individual or organisation who entered it there.

These drawing conventions are therefore likely to vary considerably, both in

rules used and the rigour with which they are applied. So though the conventions

used by Syngenta are very strict, in a large collection of compounds including

examples from other published collections, various other uncertain drawing

conventions are likely to be prevalent as well. The concept of the Native Drawing

28

Convention (NDC) will therefore be used to refer to the specific structure (e.g.

tautomer) of a compound stored in a particular database, whatever the drawing

conventions applied to it were.

The aim of this chapter is to identify a protocol for assessing the extent of

tautomeric misrepresentation in a given dataset of NDC structured compounds and to

evaluate the likely influence that it has on their prediction of the physical properties -

lipophilicity, solubility, acid-base ionisation constant and charge at pH7. Extensive

use will be made of a Structure Transformation Tool (STT) to help identify

tautomeric compounds considered to be drawn in the “wrong” form, convert them to

a form it considers likely to be the most physiologically-relevant and so hopefully

improve the quality of subsequent property predictions made for them.

The effectiveness and limitations of the predictors themselves will also be

considered by analysing the reason for individual prediction failures. Finally the

limitations of the STT will be examined to identify compounds containing potential

tautomer misrepresentation issues that have either gone ignored or were found but

the structure was already considered to be in the “right” form. The main questions

that the methodology will aim address for a given compound dataset are:

• What proportion of NDC compounds in the dataset does the STT

consider to be represented in the “wrong” tautomer form?

• What distinct substructures are identified as being in the “wrong” form

and how does the STT modify them to make them “right”?

• How similar or different are the predictions between the different

tautomers and how do they compare with the measured values, where

available, for each such compound?

29

• How often does the tautomer output by the STT improve the accuracy

of predictions? Are there other important tautomers that the STT

appears to overlook?

• Of the compounds that are not changed by the STT, are there any where

a tautomerism issue has been completely missed?

• Could the STT’s structure-changing rules be enhanced?

2.2 SMILES notation

The Daylight SMILES notation (Daylight 2004a) of chemical structures is

now a widely recognised standard. It provides, via a relatively simple set of

conventions, a means of describing a two-dimensional chemical structure as a linear

character string. These codes can then be used by software applications to regenerate

structures for whatever purpose they require. This form of notation will be the one

presented to the various property prediction techniques and the STT used within this

project as well as on occasion within this dissertation. The basic conventions of

SMILES are relatively few:

• Atoms are represented by their upper case alphabetic symbols. Lower

case symbols represent aromatic centres.

• Hydrogens are automatically assumed to be present. e.g. CC represents

ethane, CH3-CH3.

• Ring closures are indicated by matching digits on the atoms at each end of

the “join”. e.g. C1CCCCC1 represents cyclohexane.

• Double bonds are drawn as “=” and triple bonds as “#”.

• Branch points are denoted with brackets, e.g. phenol – c1ccc(O)cc1

30

A typical SMILES file, bearing the extension .smi, is a simple text file

comprising one line per structure, each line being in the format

“<SMILES string><space><structure ID>”. The preparation of SMILES files from a

given compound set therefore forms the important first step of the methodology.

2.3 SMARTS notation

The Daylight SMARTS notation (Daylight, 2004b) is effectively an extension

of the SMILES language that allows more variability to be built into an atom or bond

structure pattern by the use of AND, OR and NOT operators. Therefore in principle,

a single SMARTS may represent any number of specific SMILES that happen to

match a valid instance of its pattern. For example [!N&a] represents any atom that

is not a nitrogen and is aromatic.

SMARTS targets are therefore useful for defining open-ended substructure

series with highly precise rules concerning the atom, bond and positional variations

that are allowed or disallowed. Multiple SMILES specifications can therefore be

compared against a SMARTS target and each one classified as either being a match

or a non-match and handled accordingly.

2.4 Estimating compound lipophilicity: ELOGP

ELOGP v2 provides an estimate of compound lipophilicity - log P. As

reported in Chapter 1, Section 3, various approaches to log P prediction have been

developed, with the majority based on molecule fragmentation techniques. Following

extensive evaluation work on these various tools, the consensus scoring ELOGP

approach developed by Draper (2002) and then applied by Clarke and Delaney

(2003) and Clarke et al. (2004) based on AlogP v1.5 (Ghose et al., 1988), ACD/logP

31

v6.16 (ACD, 2004) and CLOGP v4.73 (Daylight, 2004c) was adopted as the

standard log P prediction tool for Syngenta.

2.5 Estimating compound aqueous solubility: ESOL

ESOL v1 is a method for estimating the aqueous solubility at pH7 of a

compound. Its development was fully described by Delaney (2004) and first applied

by Clarke and Delaney (2003). In its SOLSTICE implementation (see Chapter 2,

Section 7) it involves the use of the molecular properties log P (estimated from

ELOGP), molecular weight (MWT), number of rotatable bonds (RB, defined from a

set of SMARTS targets) and aromatic proportion (AP) to derive estimated solubility

Log(Sw) (ESOL log ppm), Equation 2.1:

Log(Sw) = 0.16 – 0.63 ELOGP – 0.0062 MWT + 0.066 RB – 0.74 AP

Equation 2.1

While the MWT, RB and AP components can be derived using absolute rules

for any structure presented, the log P component is dependant on the effectiveness of

the conventions and implementation of ELOGP for its own accuracy.

2.6 Estimating acid-base ionisation constants: ACD/pKa

ACD/pKa v6.16 (ACD, 2004b) was the chosen pKa prediction tool for this

project. The underlying acid dissociation constant, Ka, reflects the relative

concentrations of an ionisable molecule’s associated and dissociated forms at a given

temperature, usually 25°C. ACD/pKa’s output, unlike ESOL or ELOGP, are not

necessarily single values and can be either acidic or basic type.

• Acidic dissociation: HA + H2O � H3O+ + A-

• Basic dissociation: HB+ + H2O � H3O+ + B

32

This reflects the fact that molecules can contain multiple ionisation centres

and so multiple dissociations of one type, or the other, or both become feasible. As a

result only the most basic and / or most acidic pKa calculated by ACD/pKa is / are

reported within a user-defined pH range. Both the maximum limits of this range for

ACD/pKa and the range selected for use during this project was pH 0-14. The affect

on predictions of charge at pH 7 using a narrower pH 2-10 range is also investigated

in Chapter 3, Section 5.2.

ACD/pKa also has a tautomer checking utility that was used extensively to

predict whether drawn tautomeric structures were likely to “major” or “minor” ones.

These results provided useful comparisons with the types of tautomer change

performed by the Structure Transformation Tool (STT) (Chapter 2, Section 8.3) to

determine whether its effect was always a positive one. i.e. Whether it always

converted tautomers to a “major” form.

2.7 The SOLSTICE tool set

SOLSTICE v2.18 is a Syngenta in-house suite of structure handling, statistics

generation and file format inter-conversion utilities bundled together and accessed

via an Intranet web browser interface. Amongst its facilities are:

• ELOGP v2 log P octanol prediction (encompassing ACD/logP v6.16,

AlogP v1.5 and ClogP v4.73) (Clarke & Delaney, 2003; Clarke et al.,

2004))

• ESOL v1 aqueous solubility prediction (including ELOGP v2)

(Delaney, 2004)

• pKa prediction using ACD/pKa v6.16 (part of ACD PhysChem v6

(ACD, 2004c))

33

• SMILES > SDF structure file format inter-conversion (Chapter 2,

Section 8.1)

• SDF > SMILES structure file format inter-conversion (Chapter 2,

Section 8.1)

• Unique structure identification (identifying duplicates and validating

SMILES)

• SMILES canonicalisation (Chapter 2, Section 8.2)

For simplicity, these tools will largely be referred-to here-onwards without

their version numbers. SOLSTICE allows dataset files to be uploaded and stored on

its server in a variety of different formats and for batched “jobs” to be processed,

results to be viewed on-screen and output files to be downloaded for further

processing.

Jobs submitted for processing but not yet complete remain queued in the

“background” allowing continued use of SOLSTICE for other tasks. Results of past

jobs can also be stored online, organised into project folders and published so that

other SOLSTICE users can access them. Between acquiring a set of NDC

compounds for study and assessing how the questions in Chapter 2, Section 1 can be

answered, the following stages were followed:

• Compound data set preparation

• Compound property prediction

• Result collation, indexing and presentation

34

2.8 Compound data set preparation

2.8.1 Files in .sdf format

While structures are often stored as SMILES, another format is “structure

data format” (.sdf), originally developed by MDL Information Systems (MDL,

2003). This format uses a connection table approach to represent structures and is

supported by many current chemical software packages, some of which have their

own parsers that automatically interpret .sdf files into structure diagrams. If a

compound set is provided in such a format the SDF > SMILES conversion routine of

SOLSTICE can be used to generate the required SMILES. A further SOLSTICE

routine is available to perform the reverse conversion if required.

2.8.2 SMILES canonicalisation

A SMILES is a non-unique way of representing a structure. This means that

in general, different but equally valid SMILES strings can represent a given

structure. Therefore in principle any one of them could be used to make physical

property predictions for a given compound and be expected to give the same result.

In practice however it has been discovered that the choice of SMILES variant used

sometimes has a bearing on the value of the prediction made when structures contain

6-membered aromatic rings with one or more nitrogens. In particular, the AlogP

contribution of ELOGP was found to be so-affected, prompting further investigation

into the cause.

To illustrate the issue, there are twelve distinct SMILES representations of

2-ethoxypyridine, depending on how the aromatic ring is “split open” and which

“end” of the molecule is read from first. Table 2.1 shows that two AlogP values

occur with equal frequency and differ by a not insignificant amount (0.47 log P

35

units). This difference however has a less significant influence on ELOGP since it is

largely averaged-out when the calculated and unaffected ClogP and ACD/logP

values are included.

N OEt

N OEt

N OEt

c1cnc(OCC)cc1

AlogP = 1.988

c1cc(OCC)ncc1

AlogP = 1.988

n1ccccc1OCC

AlogP = 1.988

c1(OCC)ccccn1

AlogP = 1.523

c1(OCC)ncccc1

AlogP = 1.523

c1cccnc1OCC

AlogP = 1.988

N OEt

N OEt

N OEt

c1c(OCC)nccc1

AlogP = 1.523

c1ccnc(OCC)c1

AlogP = 1.523

n1c(OCC)cccc1

AlogP = 1.523

c1cccc(OCC)n1

AlogP = 1.988

c1nc(OCC)ccc1

AlogP = 1.988

c1ccc(OCC)nc1

AlogP = 1.523

(ACD/logP (all) = 1.855 ClogP (all) = 1.994)

Table 2.1: The differences in AlogP predictions for different SMILES of 2-ethoxypyridine

The same effect was found in the predictions of a number of other simple

aromatic, nitrogen-containing structures such as 2-hydroxypyridine,

[2,2’]-bipyridinyl, quinoline and 7,8-dihydro-cinnoline (Figure 2.1).

N NN

NNN OH

Figure 2.1

36

In each case, pairs of AlogP values also differing by 0.47 log P units were

predicted, depending on which particular SMILES was used. The consistency of the

difference between predictions suggests that this is a commonly replicated problem

with AlogP’s SMILES parser which sometimes assigns a different set of atom types

to atoms either side of “joins”, depending on which SMILES format is presented.

In order to counteract the effect, it was decided that all compounds would

have their SMILES canonicalised using SOLSTICE’s Unique Structures tool before

any predictions were carried out. This procedure reassigns SMILES using the

accepted Daylight conventions (Weininger et al., 1989 & Daylight, 2004d) and a

parser from Daylight’s SMILES Toolkit v4.8 (Daylight, 2004e). Whilst

canonicalisation cannot be considered a way of improving the accuracy of ELOGP

and ESOL predictions, it does help improve their consistency and comparability by

removing the possibility of the same structure appearing to give different ELOGP

predictions.

2.8.3 Leatherface: A tool for transforming chemical structures

Leatherface is a UNIX command line Structure Transformation Tool (STT)

developed by Kenny (1999) and designed to convert molecules into a form

considered to be most chemically or physiologically relevant at pH7. It does so by

applying structural modification rules to SMILES specifications identified using

SMARTS targets. These alterations usually take the form of:

37

• Changing the tautomeric form of a compound

• Protonation of anions to remove charge (e.g. carboxylate →

carboxylic acid)

• Deprotonation of cations to remove charge (e.g. triethylammonium →

triethylamine)

• Changing of resonance hybrid to remove charge separation (e.g. nitro

group, Figure 2.2)

N+

O

O

N

O

OSTT

Figure 2.2

Structures already considered to be in an appropriate form are unaltered by

the STT. The rules applied to a SMILES that matches a SMARTS target state how

atom charges should be changed, bond orders should be changed and where

hydrogens should be added or removed. These rules applied by the STT may be

supplemented at any time by editing the .vb and .smt files it consults each time it is

executed. The .vb (“Vector Binding”) file contains the shortcut SMARTS definitions

of the different target substructures. The .smt (“SMARTS definitions”) file contains

the corresponding structure changing rules for each SMARTS to be applied to

matching SMILES.

In these studies the STT was always provided with a canonicalised NDC

SMILES file as input. The output file of results was also saved as a SMILES file.

After invoking the STT, the following program command sequence was followed,

after which the output file was generated and the STT closed:

38

Do you require assistance? N

Enter SMILES file: <SMILES file name>

Enter SMARTS definition file: <name of .smt file>

Enter SMILES output file: <chosen .smi file name>

Will a vector binding file be used? Y

Enter vector binding file: <name of .vb file>

Also built into the STT is a canonicalisation routine, which applies Daylight

conventions (Daylight, 2004d) to its SMILES results before they are written to the

output file. However as it was not confirmed which version of the Daylight parser the

STT called, each output file from the STT was also passed through the Unique

Structures utility of SOLSTICE to insure consistency. The set of SMILES structures

obtained from these steps represent each compound’s considered Physiologically

Relevant Form, to be referred-to here-onwards as its PRF.

Each compound dataset therefore now comprises of a NDC and a PRF set of

SMILES structures. Depending on whether the STT has modified a compound’s

structure, its NDC and PRF form may or may not be identical. Identifying what

number and kind of changes the STT makes to structures forms an important part of

the property prediction analysis of the datasets that follows.

2.9 Compound property prediction

All the necessary predictions of log P (ELOGP), solubility (ESOL) and pKa

(ACD/pKa via ACD PhysChem v6) for both NDC and PRF structure sets are now

acquired using SOLSTICE.

39

2.9.1 ELOGP

The .csv formatted “ELOGP” output file contained all the predictions made

by the job. It was downloaded for further processing and the following data fields

saved from SOLSTICE:

• Compound reference

• ELOGP value

• Clog P value

• ACD/logP value

• AlogP value

In some circumstances examined in Chapter 3, Section 6, the individual

prediction methods AlogP and / or ACD/logP that underlie ELOGP sometimes failed

to give values for particular structures. This sometimes prevented meaningful

ELOGP prediction comparisons from being made between the NDC and PRF forms

of the same structure or the same form of different structures. This issue is dealt with


2.9.2 ESOL

The .csv formatted “ESOL Results” summary file contained all the

predictions made by an ESOL job and was downloaded for each job run. The fields

selected for saving from SOLSTICE were:


• ESOL value

40

2.9.3 pKa

From each ACD/pKa run conducted, the following data fields for each

compound were downloaded and saved in .csv format from the “PhysChem

Results Table”:


• pKa 1 (i.e. 1st predicted value)

• pKa 1 flag (i.e. whether 1st predicted value is a most acidic “MA” or

most basic “MB” pKa)

• pKa 2 (i.e. any 2nd predicted value – often blank)

• pKa 2 flag (often blank)

2.10 Result collation, indexing and data presentation

2.10.1 DIVA – A spreadsheet for manipulating and displaying chemical information

DIVA v 2.1 (“Diverse Information Visualization and Analysis”) is a specialist

spreadsheet application developed by Accelrys (2004) for managing and visualising

chemical data and was the primary data-gathering tool for this project. It allows users

to:

• Visualise chemical structures stored in .sdf format as fields in a

spreadsheet.

• Collect data from a variety of different sources together in a single

environment

• Display trends, patterns and relationships in data using graphs, charts

and diagrams.

41

• Merge compound data sets based on a common index shared by them

– typically a compound reference number

• Produce reports summarising compound set information.

Data gathered about a particular compound set was typically combined in the

following order into a single DIVA (.div) spreadsheet:

1. Import a list of the compound reference numbers for the dataset

2. Merge measured log P, solubility and pKa data where available.

3. For the NDC followed by the PRF structure sets:

Merge sdf structure

Merge ELOGP data

Merge ESOL data

Merge pKa data

4. Finally export the dataset as a .csv file.

The exporting process allowed the data to be read into Microsoft Excel, the

exception being that the fields which contained .sdf structures were converted into

the SMILES they were originally derived from. The use of Microsoft Excel alongside

DIVA stemmed from Excel’s more powerful sorting and calculation-performing

capabilities.

2.10.2 Post processing of the dataset

Before meaningful, comparable results could be drawn from a compound

dataset, some indexing and simple calculations needed to be performed using

MS Excel. The indexing took the form of flagging each compound “yes” or “no”

42

against a series of criteria by adding a number of additional index fields to the

spreadsheet:

1. Does the STT change the structure of the compound? If the NDC and

PRF SMILES were identical then Structure Changed? index = “no”,

otherwise “yes”.

2. Have AlogP, ACD/logP and ClogP predictions all successfully been

made for both NDC forms and PRFs of the compound? The Valid

ELOGP / ESOL for comparison? flag was set to “yes” if this was the

case or “no” if any of these predictions failed. This unfortunately may

remove a small proportion of compounds from any ELOGP

comparisons made, but makes sure that compounds with false

differences in predicted ELOGP because of prediction failure alone are

not confused with compounds where there is a genuine difference.

3. Have pKa predictions been made successfully for both NDC forms and

PRFs of the compound, i.e. Do both of the pKa 1 fields contain values?

The Valid pKa for comparison? flag was set to “yes” if this was the

case or “no” if either of these predictions failed.

Since multiple pKa predictions and measurements were possible for a single

compound, care was taken to make sure that pKas of the same type were being

compared. As the fields pKa 1 and pKa 2 from ACD/pKa SOLSTICE results files

may contain either acidic and / or basic values, a degree of manual inspection and

rearrangement of data was sometimes required before comparisons were made.

Additional care was also taken when, for example, an acidic pKa prediction

had been obtained but there was more than one measured acidic pKa to compare it

43

to. In this situation since the predicted pKa value quoted will be the “most acidic”

one, it was primarily be related to most acidic measured pKa. i.e. The one with the

lowest value. The comparison of measured and predicted data is covered more fully


2.10.3 Allocating a predicted charge at pH7

Also of interest were any changes in predicted charge on each structure

between pairs of tautomers at pH7; this being the pH closest to which most

compounds exist in nature. In order to partition the compounds it was assumed that

“most acidic” (MA) pKa 1s of 6 or lower would result in them existing

predominately in a deprotonated state with a single negative charge. For “most basic”

(MB) pKa 1s of more than 8 it was assumed that compounds would exist

predominately in a protonated form with a single positive charge. For the remaining

compound pKas it was harder to predict their protonation state and so were assumed

to be neutral structures.

On this basis, two “Formal charge at pH7” fields were added to the

datasheet and completed accordingly for the NDC and PRF forms of those

compounds where criteria 3 in Chapter 2, Section 10.2 was indexed “yes”.

2.10.4 Data analysis and presentation

In order to compare the predictions made for the NDC and PRF compound

pairs the following additional calculations were performed and graphs plotted:

44

• For compounds where criteria 2 in Chapter 2, Section 10.2 was met, the

absolute difference between their NDC and PRF structure’s ELOGP

predictions were calculated.

o The NDC and PRF ELOGP predictions were then plotted against each

other.

o The distribution of PRF ELOGP values was also plotted.


absolute difference between their NDC and PRF structure’s ESOL

predictions were calculated.

o The NDC and PRF ESOL predictions were then plotted against each

other.

o The distribution of PRF ESOL values was also plotted.


absolute difference between their NDC and PRF structure’s pKa 1 predictions

were calculated.

o The NDC and PRF pKa 1 predictions were then plotted against each

other.

o The distribution of PRF pKa 1 values was also plotted.

A statistical breakdown of the effect of the STT on a dataset allowed a

measure of the tautomer misrepresentation issue to be gauged. This was done by

partitioning each compound into one of four categories in Table 2.2 according to:

45

• Whether the STT changed its structure in some way.

• Whether the predicted ELOGPs, ESOLs or pKas of its NDC and PRF

forms were different.

Changed structure?

No Yes Physical property

No a b Changed value?

Yes c d Table 2.2: Classification of changes caused to compounds by the STT

1. Compounds matching type a were unchanged by the STT and therefore

saw no property prediction change.

2. Compounds matching type b were cases where, for whatever reason, a

structural change did not lead to a change in property prediction value.

3. Any compounds that matched type c could only be due to “bugs” in

each prediction routine, as this would require the same compound

structure to give rise to two different prediction values. It was

compounds appearing here in error during the analysis of the ELOGP

results of non-canonicalised SMILES that the problem with AlogP,

discussed in Chapter 2, Section 8.2 and Chapter 3, Section 6.1, was first

discovered.

4. Compounds matching type d were most likely to be those where the

STT has encountered a tautomer misrepresentation issue, modified its

structure, and a change in property prediction resulted.

46

2.10.5 Identifying tautomeric substructures

Having identified the d sub-set of compounds (Table 2.2) whose structure and

property predictions had changed due to the STT, it was necessary to identify what

the specific structural changes were, and to categorise them accordingly. Non-

tautomeric changes made, e.g. protonation or deprotonation of heteroatoms to

neutralise charges, could be identified and sidelined at this point.

It was initially decided to allocate each compound to a substructure class

based only on the immediate local region about which the STT had performed its

tautomer transformation. So for example, the simple 2-pyridone framework B

(Figure 2.3) would be considered a general class to represent all the ring systems A,

with its A-groups representing substituents of any nature.

ONH

A

A

A

A

ONH

A

A

A

A

ONH

A

A

N

N

A

A

ONH

A

A

A

A

A

A

ONH

A

A

A

A

A

AA B

Figure 2.3

When analysing the prediction data however it was found useful to subdivide

these broad classes into more specific substructural types, by separating those of

different ring system configuration in the tautomeric region of each molecule.

Additionally, each definition of a specific class was extended to the limits of

substituent conjugation where heteroatoms were involved and where a prototropic

tautomer shift involving them was theoretically possible. So in Figure 2.3, each

47

structure A example was now considered a separate class, where the A-groups

although, in principle, still representing any group, now cannot form rings with each

other or participate in tautomerism.

2.10.6 Other data analysis indicators

• For compounds of type d (Table 2.2), the distributions of the absolute

differences between the NDC and PRF predictions for each property

showed whether certain difference values occurred more repeatedly

than others. By analysing which types of structural change the STT had

performed, tautomer transformations common to particular narrow

absolute difference ranges were sometimes identifiable.

• Plots of the NDC structure’s and PRF structure’s predicted charge

distribution at pH7 for the entire compound set indicated the effect that

the STT had on its expected charge distribution. Also examined were

the specific numbers of compounds whose predicted charge changed

due to a change in structure.

2.10.7 Comparison of measured and predicted log P and pKa values

Log Ps and pKas are among the more common physical property

measurements made for compounds. Given a dataset for which both predicted and

measured data was available, both the accuracy of the predictions and the degree to

which the STT improved them by converting structures to their presumed “right”

form could be gauged.

This was done by calculating the absolute difference between pairs of

predicted and measured values for a compound. The size and sign of the disparity

48

between these absolute differences for a compound’s NDC and PRF structural forms

provided a measure of which form gave the more accurate prediction. A positive

disparity represented an improvement in prediction accuracy through the use of the

STT, suggesting that the PRF form of the structure was a better representation of the

compound. Negative disparities indicated that the NDC form of the structure gave a

more accurate prediction than did the PRF form, suggesting that the former tautomer

may after all be the more representative form. By tabulating these disparities,

comparisons with other compounds of the same or different sub-classes defined in

Chapter 2, Section 10.5 could then be drawn.

2.10.8 Analysis of prediction failures

The prediction routines AlogP, ACD/logP and pKa were sometimes

unsuccessful at giving values for individual structures, leading to blank results

appearing in output files. A detailed analysis of the specific compound’s structures

concerned, together with any error messages produced by them during the running of

the prediction job helped identify the common reasons why failures occurred, and

pinpointed the specific structural features that appeared to repeatedly give problems.

This in turn helped suggest ways each prediction tool could be improved, or at least

highlight more specifically its limitations.

2.10.9 CHI data – a source of information about tautomer classes not highlighted by the STT

CHI (Chromatographic Hydrophobicity Index) is a reversed-phase HPLC

technique that enables an assessment of high throughput lipophilicity to be made

(Valkó et al., 1997 & Kaliszan et al., 1999). A sample of interest is injected into an

aqueous buffer solution at a constant rate and the percentage of organic mobile

phase, usually acetonitrile, steadily increased at a constant gradient. The retention

49

time at which the sample is equally distributed between the aqueous and organic

phases is used to in conjunction with the instrument / column’s calibration curve to

determine its CHI value at that aqueous pH.

CHI values can be used as indicators of log P and acidic or basic pKa when

measured at multiple pHs and can suggest whether a structure exists in different

forms. With respect to tautomerism, analysis of measured CHI data allowed

compounds potentially containing tautomer issues to be highlighted. By examining

these more closely, previously unidentified tautomeric compounds missed by the

STT could be identified. CHI values for compounds examined in this study were all

recorded at pHs 2.5, 7 and 10.

50

3 Results and discussion

3.1 About this chapter

The work discussed in this Chapter covers the tautomer misrepresentation

issue in relation to compound property predictions in several consecutive themes:

• Examining the property prediction and structural changes to datasets due

to passing them through a Structure Transformation Tool (STT) that

seeks, amongst other things, to correct tautomers drawn in the “wrong”

form.

• Analysis of the problems and specific prediction failures associated with

the particular prediction tools used.

• Classifying the tautomer types identified and assessing the validity of the

structural changes applied to them.

• Comparison of predicted and measured property values to assess the

benefits to property predictions of applying the STT to datasets.

• Investigation of a method to determine whether there are tautomer issues

either ignored or unchanged by the STT.

3.2 Introducing the datasets

The methodology developed in Chapter 2 was largely derived from the

experience gained of working with two test sets of compounds. One was compiled as

a result of research activities at Syngenta in recent years; the other is a published list

of both current and past agrochemical products.

51

• Compound set 1 comprises 2,616 compounds that have been highlighted

as hits of interest from high throughput screening (HTS) and lead

compounds from a variety of research projects. As such they form part of

the Syngenta compound collection and are likely to provide good

coverage of recent agrochemical-like compound classes. It will be

commonly referred to as the HTS dataset and its compounds have been

given generic reference numbers of the type HTSxxxx (where xxxx =

0001-2616).

• Compound set 2 comprises 1,359 compounds from the Pesticide Manual

(Tomlin, 2000) and contains examples of both current and superseded

products. It will commonly be referred to as the PM dataset and its

compounds have reference numbers of the type PLxxxx (where xxxx are

values in the range 0001-1618).

As moderate sized sets of compounds, they will generate an easily-managed

amount of data but still be big enough for meaningful trends to be extracted from

them to shape the methodology they are being used to develop.

3.3 Comparing the property predictions made for the NDC forms and PRFs of each compound set

To judge the effect that the STT had on each dataset, the differences in the

predictions between their Native Drawing Convention (NDC) forms and

Physiologically Relevant Forms (PRFs) will be judged from the numbers of

compounds whose prediction values changed and on the size of those changes. This

will indicate how serious an issue presenting the “wrong” tautomer to a log P, pKa or

solubility prediction tool is.

52

3.3.1 ELOGP

Figures 3.1 and 3.2 show the plots of NDC form versus PRF ELOGP

predictions for the HTS and PM datasets respectively. As discussed in Chapter 2,

Section 10.2, these comparisons exclude the small number of compounds where one

or more of the log P prediction methods underpinning ELOGP fail.

-5

-3

-1

1

3

5

7

9

11

13

-5 -3 -1 1 3 5 7 9 11 13

NDC ELOGP

PRF

ELO

GP

Figure 3.1: Comparison of NDC and PRF ELOGP predictions for the HTS dataset

-5

-3

-1

1

3

5

7

9

11

13

-5 -3 -1 1 3 5 7 9 11 13

NDC ELOGP

PRF

ELO

GP

Figure 3.2: Comparison of NDC and PRF ELOGP predictions for the PM dataset

53

Figure 3.1 shows that the majority of NDC and PRF ELOGP predictions for

the HTS dataset were identical. Only in 69 cases (2.7%) of the 2,520 compared was a

difference observed between them. For the PM dataset in Figure 3.2, 37 compounds

(2.9%) of the 1295 cases compared gave different predictions. The distribution of the

absolute non-zero ELOGP prediction differences for the HTS and PM datasets are

shown in Figures 3.3 and 3.4 respectively.

0

4

8

12

16

20

0 0.4 0.8 1.2 1.6 2 2.4

NDC / PRF absolute ELOGP prediction difference

Com

poun

d co

unt

Figure 3.3: The distribution of non-zero absolute differences between NDC and

PRF ELOGP predictions for the HTS dataset

0

1

2

3

4

5

6

7

8

0 0.4 0.8 1.2 1.6 2 2.4

NDC / PRF absolute ELOGP prediction difference

Com

poun

d co

unt

Figure 3.4: The distribution of non-zero absolute differences between NDC and

PRF ELOGP predictions for the PM dataset

54

Figures 3.3 and 3.4 show that the difference in predictions between the NDC

forms and PRFs of the affected compounds were as much as 2.33 log P units and on

average 1.05 and 0.83 log P units for the HTS and PM datasets respectively. Since

there are likely to be relatively few repeated tautomeric substructures and these can

be found in multiple molecules, it may be expected that inter-converting specific

examples of the same type would give rise to similar differences in predicted ELOGP

or ESOL value between pairs of tautomers and hence an irregular not smooth

distribution.

While both datasets are relatively small, making it difficult to extract detailed

correlations, some standard difference patterns could be observed. The strongest

example occurs in the absolute ELOGP difference “bin” 1.00-1.10 in Figure 3.3,

which also coincides with the highest count of non-zero differences for the HTS

dataset. 14 of these 18 compounds underwent the same tautomerisation (Figure 3.5)

and represent all but 3 of the examples of the type found in that dataset.

N N

OH

R1

R2

R3

N NH

O

R1

R2

R3

STT

Figure 3.5

The smaller size of the PM dataset prevented similar meaningful patterns

from being extracted from Figure 3.4. The distribution of predicted ELOGP values

for the PRF of each compound in each dataset is shown in Figure 3.6.

55

0

0.05

0.1

0.15

0.2

0.25

0.3

-4 -2 0 2 4 6 8 10 12

Predicted PRF ELOGP

Frac

tion

of c

ompo

und

set

HTS setPM set

Figure 3.6: Distribution of predicted ELOGP values for HTS and PM dataset compounds represented in their PRF

The near-normal distributions highlight the lower mean (3.19) and higher

standard deviation (1.80) of the PM ELOGP predictions compared to the HTS

predictions (3.61 and 1.48 respectively) but also highlight that the profile of

predictions made for the PRFs of structures in both datasets are broadly similar, with

at least a handful of predicted values being found in every region of the common

ELOGP range for agrochemicals.

3.3.2 ESOL

Figures 3.7 and 3.8 show the plots of NDC form versus PRF ESOL

predictions for the HTS and PM datasets respectively. The compounds excluded

directly correspond with the sets omitted from the ELOGP prediction comparisons.

56

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

NDC ESOL

PRF

ESO

L

Figure 3.7: Comparison of NDC and PRF ESOL predictions for the HTS dataset

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

NDC ESOL

PRF

ESO

L

Figure 3.8: Comparison of NDC and PRF ESOL predictions for the PM dataset

For compounds where the ESOL predictions change between the NDC form

and PRF the magnitude observed is similar to that seen for ELOGP. Since ESOL

predictions are largely dependant on ELOGP predictions, there is a strong

relationship between them (R2 = -0.90 to -0.95 for both datasets and both sets of

NDC and PRF predictions). The distribution of the absolute non-zero ESOL

57

prediction differences for the HTS and PM datasets are shown in Figures 3.9 and

3.10 respectively.

0

5

10

15

20

25

0 0.4 0.8 1.2 1.6 2

NDC / PRF absolute ESOL prediction difference

Com

poun

d co

unt

Figure 3.9: The distribution of non-zero absolute differences between NDC and PRF ESOL predictions for the HTS dataset

0

5

10

15

0 0.4 0.8 1.2 1.6 2

NDC / PRF absolute ESOL prediction difference

Com

poun

d co

unt

Figure 3.10: The distribution of non-zero absolute differences between NDC and PRF ESOL predictions for the PM dataset

Figures 3.9 and 3.10 show that the difference in predictions between the NDC

forms and PRFs of the affected compounds were as much as 1.65 ESOL log units

and on average 0.75 and 0.58 ESOL log units for the HTS and PM datasets

respectively.

58

Due to the noted dependency of ESOL predictions on ELOGP, the

distribution of non-zero absolute difference distributions in Figures 3.9 and 3.10

closely match those of Figures 3.3 and 3.4. The maximum compound count in Figure

3.9 for the “bin” range 0.70-0.80 ESOL units can therefore largely be attributed to

examples of the same single type of tautomer change that was highlighted from

ELOGP data in Figure 3.3 and shown in Figure 3.5. The smaller size of the PM

dataset prevents similar meaningful patterns from being extracted from Figure 3.10.

The distribution of predicted ESOL values for the PRF of each compound in each

dataset is shown in Figure 3.11.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-4 -3 -2 -1 0 1 2 3 4 5 6 7

Predicted PRF ESOL

Frac

tion

of c

ompo

und

set

HTS setPM set

Figure 3.11: Distribution of predicted ESOL values for HTS and PM dataset compounds represented in their PRF

In contrast to the ELOGP distributions of Figure 3.6, the distributions in

Figure 3.11 highlight the higher mean (1.46) and higher standard deviation (1.58) of

the PM ELOGP predictions compared to the HTS predictions (1.25 and 0.98

respectively). The profiles of ESOL predictions for each set however are still largely

comparable, with at least a handful of ELOGP predicted values being found in every

region of the common aqueous solubility range for agrochemicals.

59

3.3.3 pKa

Figures 3.12 and 3.13 show the plots of NDC form versus PRF pKa

predictions for the HTS and PM datasets respectively. As discussed in Chapter 2,

Section 10.2, these comparisons necessarily exclude compounds where pKa

predictions were not obtained for its NDC form and / or PRF, resulting in only 1997

(76%) and 635 (47%) of possible comparisons being made for the HTS and PM

datasets respectively.

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

NDC pKa

PRF

pKa

Figure 3.12: Comparison of NDC and PRF pKa predictions for the HTS dataset

60

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

NDC pKa

PRF

pKa

Figure 3.13: Comparison of NDC and PRF pKa predictions for the PM dataset

Figures 3.12 and 3.13 show that the number and mean size of non-zero pKa

prediction differences between compound’s NDC forms and PRFs are far fewer and

typically smaller for the PM dataset than the HTS dataset. In 52 cases (2.6% of valid

comparisons) in the HTS dataset in Figure 3.12 a change in pKa prediction is

observed. For the PM dataset in Figure 3.13, 5 compounds (0.8% of valid

comparisons) similarly had different predictions for their NDC and PRFs. The

Figures also show that the difference in predictions between the NDC forms and

PRFs of the affected compounds can be as much as 7.47 pKa units and on average

2.35 and 1.15 pKa units for the HTS and PM datasets respectively.

Consequently, when structure misrepresentation, tautomeric or otherwise,

occurs, the effect on predictions may be considerable. It is also important to note that

large differences between pKa predictions for different forms of the same structure

are more likely to be due to the accidental mismatching of two different pKas,

between which no meaningful comparison can realistically be drawn. The

distributions of the absolute non-zero pKa prediction differences for the HTS and

PM datasets are shown together in Figure 3.14.

61

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7

NDC / PRF absolute pKa prediction difference

Com

poun

d co

unt

PM setHTS set

Figure 3.14: The distribution of non-zero absolute differences between NDC and PRF pKa predictions for the HTS and PM datasets

The 7 of the 10 compounds that comprise the maximum compound count for

the HTS dataset in Figure 3.14, corresponding to the “bin” range 2.00-2.40 pKa

units, can once again be attributed to compounds undergoing the tautomer change

highlighted in Figure 3.6. The distribution of pKa value predictions for the PRFs of

the compounds in both datasets is shown in Figure 3.15.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 2 4 6 8 10 12 14

Predicted PRF pKa

Frac

tion

of c

ompo

und

set

HTS setPM set

Figure 3.15: Distribution of predicted pKa values for HTS and PM dataset compounds represented in their PRF

62

The distributions shown in Figure 3.15 are clearly not of a classical statistical

type, but show that the pKa predictions made cover almost all of ACD/pKa

maximum 0-14 range. Of the two datasets the HTS set has the more evenly

distributed range of pKa prediction values.

3.4 Summarising the differences between the NDC forms and PRFs of the HTS and PM datasets

Evaluating fully the effect that the STT had on each dataset was achieved by

evaluating whether or not each compound’s structure was modified by it and whether

or not a change in its predicted ELOGP, ESOL or pKa was observed. Table 3.1

provides a summary of the outcomes for both datasets.

HTS dataset PM dataset Changed structure?

(NDC → PRF) Changed structure?

(NDC → PRF)

No Yes No Yes ELOGP

No 2310 (91.7)

141 (5.6)

1167 (90.1)

91 (7.0)

Yes 0

(0) 69

(2.7)

0 (0)

37 (2.9)

ESOL

No 2310 (91.7)

141 (5.6)

1167 (90.1)

91 (7.0)

Yes 0

(0) 69

(2.7)

0 (0)

37 (2.9)

pKa

No 1821 (91.2)

124 (6.2)

599

(94.3) 31

(4.9)

Changed value?

Yes 0

(0) 52

(2.6)

0 (0)

5 (0.8)

(The upper number represents actual numbers of compounds. The number in brackets is the corresponding percentage of total comparisons made for that property)

Table 3.1: Classification of changes caused to the HTS and PM dataset compounds by the STT.

63

For the majority of compounds in each dataset (at least 90%) the STT makes

no modification to their structure and consequently no change in predicted properties

result. As canonical SMILES were used as input, there are no instances in either

dataset of compounds retaining the same structure but appearing to change ELOGP,

ESOL or pKa prediction.

Inspecting the 141 compounds in the HTS dataset and the 91 compounds in

the PM dataset, where a change in structure did not lead to a change in ELOGP or

ESOL prediction, revealed that the only alteration in each case was to the hybrid

form of nitro-groups (see Figure 2.2). It therefore appears that the ELOGP and ESOL

prediction routines have correctly identified and treated both nitro group hybrids as

one-and-the-same entity in these instances.

Compounds whose pKa values remain unchanged despite a structural change

can, in all but 6 of the 124 cases found in the HTS dataset and 3 of the 31 cases in the

PM dataset can similarly be attributed to a nitro group. Of the remaining compounds,

4 differ only in whether a carboxyl group is protonated or not (HTS1707, HTS1663,

HTS1715 and HTS1716), one (HTS2608) differs only in the hybrid form of a nitroso

group and four (HTS2070, PL0083 (6-Isopentenylaminopurine), PL1003 (Kinetin)

and PL1612 (Zeatin), Figure 3.16) have undergone a tautomeric change, but the two

tautomers coincidentally have the same predicted pKa.

64

O

NNH

N

N

NH

O

NHN

N

N

NH

NNH

N

N

NH

OHNHN

N

N

NH

OH

NNH

N

N

NH

NHN

N

N

NH

Kinetin(PL1003)

STT

Zeatin(PL1612)

STT

6-isopentenylaminopurine(PL0083)

STT

Figure 3.16

The remaining compounds are those where a change in structure has lead to a

change in prediction for at least one of the three properties. These compounds are

therefore those mostly likely to have had a tautomer change carried out on them by

the STT. The nature of these compounds will be discussed in Chapter 3, Section 7.

3.5 Formal charge distributions at pH7

3.5.1 The influence of predicted pKa changes on predicted charge distribution

A formally neutral compound may actually exist in a charged state in aqueous

solution at pH7, depending on its pKa. In principle, different tautomers may have

sufficiently different predicted pKas that their predicted formal charge at pH7 could

change. This could result in their aqueous behaviors being very dissimilar to each

other. Using the protocol laid out in Chapter 2, Section 10.3, every compound with a

predicted pKa value in both datasets could therefore be assigned a formal charge

prediction for both its NDC forms and PRFs. This was initially carried out using a

pH range of 0-14 for ACD/pKa and lead to the following distributions for the HTS

dataset (Figure 3.17):

65

Figure 3.17: Predicted charge distributions at pH7 for the HTS dataset in its NDC forms and PRFs using a pH range of 0-14

The effect of passing the HTS dataset through the STT resulted in only small

changes in the predicted charge distribution for the dataset. Emphasising the

similarity of the distributions, 132 of the 144 positively charged NDC structures are

also positively charged in their PRF. 1688 of the 1693 neutral NDC structures are

also neutral in their PRF. Finally, 158 of the 160 negatively charged NDC structures

are also neutral in their PRF.

Only minor changes in the predicted formal charge distribution at pH7 for the

PM dataset, using the same pH range, were also found (Figure 3.18). Closer

inspection of the distribution reveals that only two compound’s predicted charge

actually changes due to its structure being modified by the STT. These compound’s

(PL0558 (Dimethirimol) and PL0679 (Ethirimol) (Figure 3.19)) predicted charges

both changed from +1 to 0 in conjunction with a change in tautomer.

Figure 3.18: Predicted charge distributions at pH7 for the PM dataset in its NDC forms and PRFs using a pH range of 0-14

66

N

N

OH

N N

N

OH

NH

Dimethirimol(PL0558)

Ethirimol(PL0679)

(Both NDC forms) Figure 3.19

Clarke (2002), using similar formal charge definitions, predicted the charge

distribution of compounds in the Pesticide Manual to be approximately 10:1, acid :

base. His findings are to some extent reflected in the predicted positive to negative

charge ratios for both the NDC forms and PRFs shown in Figure 3.18 (both ~ 5:1).

3.5.2 A comparison of predicted charge distribution at pH7 within pH 2-10 and pH 0-14 limits

Compounds can have multiple acid-base ionisation constants. ACD/pKa has

an option to deal with them by only presenting either the most acidic (MA) and / or

the most basic (MB) pKa it finds within the pH range defined by the user. This may

mean however that there are other, more appropriate mid-scale pKas that better

characterise compounds that simply get overlooked. Consequently by taking the

larger HTS test datasets, narrowing the defined pH “window” to 2-10 and observing

the extent of change in the predicted charge distribution, helped give an indication as

to how dependant it is on the pH range chosen.

Only compounds that have predicted pKas within both the 0-14 and 2-10 pH

ranges for both their NDC forms and PRFs could be used in the comparison. This

limited the pH range comparison to 1254 structures (48% of the entire dataset or

63% of the compounds compared over the 0-14 pH range). The predicted charge

distribution profiles for these compound’s NDC forms and PRFs at the two pH

ranges are shown in Figure 3.20.

67

Figure 3.20: Predicted charge distributions at pH7 for the HTS dataset in its NDC forms and PRFs using pH ranges of 0-14 and 2-10 for comparison

Figure 3.20 shows that narrowing the pH “window” has only a minor

influence on the charge distribution for the compounds compared. The number of

structures whose predicted charge at pH 7 actually changes when the pH range is

narrowed from 0-14 to 2-10 is only 16 (NDC structures) and 13 (PRF structures),

equating to only ~1% of the compounds. The exact choice of pH range therefore had

no significant influence on the outcome of the charge distribution predictions.

3.6 Issues and problems with prediction tools

3.6.1 AlogP and SMILES

As was highlighted in Chapter 2, Section 8.2, in order that consistent ELOGP

predictions are obtained for a particular structure, it was important that SMILES

presented as input to the ELOGP prediction tool of SOLSTICE were canonicalised to

Daylight conventions (Weininger et al., 1989 & Daylight, 2004d) to insure that a

consistent AlogP prediction for each structure was always obtained. The extent of the

problem that requires this action was examined using the HTS dataset by comparing

the AlogP predictions obtained using the non-canonicalised SMILES stored in the

Syngenta database, with their canonicalised SMILES obtained using the Daylight

68

SMILES toolkit (via SOLSTICE Unique Structures). In this dataset, 123 (4.7%)

compounds gave different AlogP values for the different SMILES forms, indicating

that a small but significant proportion of compounds were affected. The distribution

of these absolute differences (Figure 3.21) showed that the majority of them fell

within a narrow range, tending to suggest that the error is a routine one, specific to

AlogP’s handling of SMILES.

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Absolute AlogP difference

Com

poun

d co

unt

Figure 3.21: A graph showing the distribution of non-zero AlogP differences between the canonicalised and un-canonicalised SMILES of HTS dataset compounds

Since AlogP is essentially a “black box” prediction tool, where a SMILES is

simply passed to it and a result is passed-back, there is little that an AlogP user can

do to remove the problem without its developer’s intervention other than to use

canonicalised SMILES to insure prediction consistency.

3.6.2 Analysis of prediction failures

A small proportion of compounds in both datasets were excluded from the

comparison of prediction results because AlogP, ACD/logP or ACD/pKa was not

successful in generating a value for individual compounds in either their NDC form

69

or PRF. To uncover the reasons why and show the current limitations of the tools, the

specific instances where failures occurred were investigated.

3.6.2.1 Log P

The complete list of compounds in the HTS dataset that were excluded from

log P result comparisons are listed in Table 3.2. It shows the predictions that were

made and indicates the reason for the failure of the others. Such failures affected 96

compounds (3.7% of the dataset).

NDC PRF NDC PRF Ref

ACD/logP AlogP ACD /logP

AlogPRef


AlogP

HTS0034 0.412 -3.75 0.412 A HTS1865 Ch 4.671 Ch 4.671

HTS0045 0.448 A 0.448 A HTS1895 3.978 A 3.978 A

HTS0056 3.283 A 3.283 A HTS1936 5.606 A 5.606 A

HTS0094 1.356 A 1.356 A HTS1937 4.285 A 4.285 A

HTS0110 1.595 -2.263 1.595 A HTS1950 5.097 A 5.097 A

HTS0130 0.457 -4.393 0.457 A HTS1954 6.87 A 6.870 A

HTS0162 Ch 0.300 0.962 1.666 HTS1955 6.272 A 6.272 A

HTS0169 3.904 A 3.904 A HTS1957 7.610 A 7.610 A

HTS0223 0.483 A 0.483 A HTS1963 4.127 A 4.127 A

HTS0284 4.113 A 4.113 A HTS1965 4.494 A 4.494 A

HTS0451 4.374 3.482 2.085 A HTS1996 4.853 4.689 3.125 A

HTS0468 6.542 A 6.542 A HTS2008 1.224 A 1.224 A

HTS0508 5.426 4.172 3.137 A HTS2021 2.412 A 2.412 A

HTS0695 0.989 -4.742 0.989 A HTS2042 3.099 A 3.099 A

HTS0710 F A F A HTS2068 F A F A

HTS0736 Ch A 3.620 2.018 HTS2081 7.972 A 7.972 A

HTS0802 2.774 A 2.774 A HTS2102 5.081 A 5.081 A

HTS0878 1.844 A 1.844 A HTS2115 3.725 A 3.725 A

HTS0901 3.505 A 3.061 A HTS2128 4.579 A 4.579 A

HTS0905 Ch 6.847 Ch 6.847 HTS2132 Ch 5.570 Ch 5.570

HTS0913 Ch 8.169 Ch 8.169 HTS2133 1.905 A 1.905 A

HTS1007 0.219 A 0.219 A HTS2142 3.877 A 3.877 A

HTS1018 4.677 A 4.677 A HTS2149 F 2.212 F 2.212

HTS1053 3.899 A 3.899 A HTS2154 3.940 A 3.940 A

HTS1109 Ch A Ch A HTS2155 3.342 A 3.342 A

HTS1285 0.756 A 0.756 A HTS2156 1.721 A 1.721 A

HTS1335 2.153 A 2.153 A HTS2170 2.706 A 2.706 A

HTS1340 4.595 A 4.595 A HTS2177 F 4.926 F 4.926

70

NDC PRF NDC PRF Ref


AlogPRef


AlogP

HTS1376 2.882 A 2.882 A HTS2179 Ch 6.312 Ch 6.312

HTS1377 3.069 A 3.069 A HTS2185 3.485 A 3.485 A

HTS1389 6.479 A 6.479 A HTS2188 Ch 5.125 Ch 5.125

HTS1407 4.356 A 4.356 A HTS2207 4.634 A 4.634 A

HTS1464 3.730 3.436 1.565 A HTS2258 0.499 0.780 0.499 A

HTS1523 Ch 2.021 Ch 2.021 HTS2280 5.031 A 5.031 A

HTS1539 Ch 4.507 Ch 4.507 HTS2291 4.665 A 4.665 A

HTS1556 5.636 A 5.636 A HTS2429 0.327 A 0.327 A

HTS1606 1.167 A 1.167 A HTS2430 3.09 A 3.09 A

HTS1652 5.025 A 5.025 A HTS2431 4.984 A 4.984 A

HTS1653 6.675 A 6.675 A HTS2477 2.712 A 2.712 A

HTS1654 0.876 A 0.876 A HTS2500 0.109 A 0.109 A

HTS1755 F 4.305 F 4.305 HTS2501 0.556 A 0.556 A

HTS1763 2.516 A 2.516 A HTS2509 0.652 A 0.652 A

HTS1766 6.008 A 6.008 A HTS2527 6.067 A 6.067 A

HTS1771 1.244 A 1.244 A HTS2532 1.884 A 1.884 A

HTS1797 Ch 5.053 Ch 5.053 HTS2567 1.463 A 1.463 A

HTS1831 0.029 A 0.029 A HTS2590 0.217 A 0.217 A

HTS1839 2.051 A 2.051 A HTS2592 3.842 4.400 2.328 A

HTS1856 0.024 A 0.024 A HTS2609 2.793 -0.329 2.793 A

• A = unparameterised atom(s) found in structure – structure cannot be fully resolved • Ch = structure charged – structure cannot be fully resolved • F = contains fragments that cannot be calculated • Highlighted compounds undergo a tautomeric structural change with the STT, one or

both tautomers of which give rise to a log P prediction error. • Shaded compounds undergo a change in a resonance hybrid substructure only with the

STT, one or both tautomers of which give rise to a log P prediction error.

Table 3.2: Reasons for the failure of AlogP or ACDlogP to predict a value for the NDC form or PRF of affected compounds in the HTS dataset

A similar analysis of log P prediction failures for the PM dataset was also

carried out, revealing that 64 compounds were similarly affected (4.7% of the

dataset). The discussion of findings will therefore address both datasets.

The most common error encountered with AlogP was that of atom fragments

not defined in its dictionary. An inspection of the “problem” compounds revealed

failures occurred most often when phosphorous, sulphur and especially nitrogen were

71

present in less-common bonding arrangements. The structural features that appeared

to cause the majority of AlogP failures are shown with examples in Table 3.3:

Substructure Number of instances

(HTS & PM) Examples

S O

O

O Ar

4 + 6 HTS0169 / HTS2021 / PL0051 (2,4-Dichlorophenyl

benzenesulfonate) / PL0743 (Fenson) / PL1069 (Methasulfocarb)

Nsp3-Nsp3 57 + 7

HTS0509 / HTS0508 / HTS1464 / HTS1996 / HTS2592 / HTS1965 / HTS1766 / HTS0878 /

PL0054 (2-Hydrazinoethanol) / PL0467 (Daminozide) / PL1405 (Sintofen)

N

O

or

N+

O

7 + 3 HTS2429 / HTS0034 / HTS0110 / PL0607 (Dipyrithione)

N+

N

O

or

N

N

O

3 + 0 HTS1340 / HTS1653 / HTS2291

Any nitrogen-sulphur bond 4 + 7

HTS0094 / HTS0710 / PL0108 (Alanycarb) / PL0181 (Benfuracarb) / PL0704 (Fenaminosulf) /

PL1446 (Sulglycapin) Any net charge

fragments 0 + 20 PL0196 (Benzamorf) / PL0245 (BTS 44584) / PL0874 (Glyodin) / PL1431 (Sulcofuron)

Any Si 4 + 5 HTS0901 / HTS2431 / PL0825 (Flusilazole) / PL1402 (Simeconazole)

Tetracoordinate S 0 + 2 PL0648 (Endosulfan) / PL0147 (Aramite) (Compound references in bold underwent a tautomer change with the STT. Those quoted represent all such compounds in both the HTS and PM datasets for which a prediction failure occurred)

Table 3.3: A summary of the structural features that caused AlogP or ACDlogP to fail to give a log P prediction

Of these features, Nsp3-Nsp3 bonds were the most common reason for

prediction failure. Table 3.3 also reveals 6 compounds that underwent a tautomer

change that so far have been excluded from prediction result analysis because of an

AlogP failure issue. Of these, 5 appear to fail for the same reason due to them

72

containing Nsp3-Nsp3 bonds in their PRF tautomer. The AlogP prediction for the

remaining tautomeric molecule, HTS0901, fails for both its NDC and PRF tautomers

because it contains a silicon atom.

ACD/logP predictions appeared to fail for two reasons. Failure outright

occurred in both datasets on 25 occasions for structures carrying a net charge –

particularly examples containing positively charged nitrogen and sulphur. Failure

also occurred when less common structural fragments were encountered. For

example compound HTS0710 contains an N=S=C fragment, HTS1755 contains an

N=S=N fragment, HTS2177 contains an N-P=S fragment and HTS2149 contains an

O=PN2 fragment.

Twelve of the 96 NDC / PRF structure pairs from the HTS dataset, for which

one or more log P predictions failed, differ only in the particular resonance hybrid

drawn of a functional group they contain. For example HTS0162 contains an azide

group. In its charge-separated NDC form ACD/logP fails, but in its neutral PRF

predicts a value. By way of contrast and exception, ACD/logP is able to resolve

successfully in most cases the charge separated and uncharged hybrid forms of

nitroso and nitro groups and treat them equivalently. For AlogP however, neither

nitroso group hybrids are normally recognised and predictions for compounds

containing them usually fail to give a value.

One of the more unusual effects of the application of the STT to the HTS

dataset was its effect on compound HTS0736, converting its charge-separated,

isocyanate, NDC form A into a neutral C(carbene)=N, PRF B (Figure 3.22). While both

ACD/logP and AlogP failed for hybrid A on grounds of charge and “unknown” atom

fragment respectively, they both surprisingly offered predictions for its B hybrid.

73

N+

C N

BA

C::STT

Figure 3.22

Log P prediction failure did not affect any compounds in the PM dataset

where the STT had made a tautomeric structure change. Structure PL0162

(Aziprotryne) however always failed with ACD/logP due to it appearing in the

Pesticide Manual drawn with the structurally ambiguous, azide-like substituent

group, shown in Figure 3.23.

N

N

N

NH

S

N N NH

" "

PL0162(Aziprotryne)

Figure 3.23

3.6.2.2 pKa

The reasons for failure of ACD/pKa predictions were more difficult to relate

to individual molecular characteristics or specific sub-structures than for log P.

Table 3.4 shows the errors encountered for both datasets, with the instances of each

error’s occurrence quoted for predictions made over a pH 0-14 range.

74

Dataset Error

number Error message HTS PM

1 “All calculated pKa values are out of user specified pH range” 942 518

2 “Cannot calculate pKa” (no reason given) 27 109

3 “The structure does not contain ionization centers calculated by current version of ACD/pKa” 244 708

4 pKa value not predicted but no error given either 2 0

5 “The structure contain elements in not-typical valence” 0 2

Totals 1215 1337

Table 3.4: Error types encountered from the failure of ACD/pKa to predict values for compounds from the HTS and PM datasets

The HTS dataset figures relate to 619 specific compounds (23% of the total

dataset) where no pKa prediction was offered for either one or both of their NDC

form or PRF using the pH range 0-14. 11 tautomer inter-conversions were affected

by missing pKa values, in each case relating to the PRF tautomer and caused by

predicted values being out of range. In 9 instances this was due to compounds that

had undergone a 4-hydroxypyridine (NDC) to 4-(1H)-pyridone (PRF) substructure

type inter-conversion (Figure 3.24). The only tautomeric example where pKa

predictions failed for both tautomers was HTS0810 involving a related pyrimidine

(NDC) to pyrimidinone (PRF) substructure transformation.

N

OH

A

A

A

A NH

O

A

AA

ASTT

Figure 3.24

Failure of ACD/pKa predictions affected 725 compounds in the PM dataset

(53% of total) for either their NDC form or PRF. Unlike the log P prediction

75

methods, ACD/pKa did not attempt to split up and treat separately the 62 multiple

component compounds in this dataset. Instead it simply registered a failure,

regardless of whether each constituent component was acceptable in its own right.

The error relating to “not-typical valence” was caused by the structurally ambiguous

compound PL0162 (Aziprotryne) that also caused ACD/logP to fail. The only

compound structurally-altered by the STT to present ACD/pKa with problems was

PL1022 (Mazidox). The NDC A form was successfully handled but the PRF B

resulted in error 2 occurring (Table 3.4 and Figure 3.25).

P N N+

NHO

N

N

P N N+

NO

N

NBA

STTPL1022(Mazidox)

Figure 3.25

No compounds where tautomeric structure changes were carried-out by the

STT were also affected by pKa prediction failure in the PM dataset.

3.7 Revealing the types of structural changes performed by the STT and the tautomer substructures concerned

3.7.1 Analysing the effect of the STT on each dataset

Of the 69 HTS dataset compounds in Table 3.1 whose ELOGP and ESOL

predictions were changed due to their structure being changed, 63 related to a true

change in tautomer form. The nature of the remainder is discussed in Chapter 3,

Section 4. The 52 compounds whose pKa predictions were similarly affected are a

subset of the 63 structures identified above. By including the additional 6 that were

found by examining the log P prediction failures a total of 69 compound tautomer

changes were therefore uncovered in the HTS dataset.

76

On close examination of the 37 PM dataset compounds in Table 3.1 whose

structures and both ELOGP / ESOL predictions were changed, only 7 could be

attributed to a prototropic tautomer change. These compounds were PL0083

(6-Isopentenylaminopurine), PL0558 (Dimethirimol), PL0679 (Ethirimol), PL0891

(Haloxydine), PL1003 (Kinetin), PL1343 (Pyriclor) and PL1612 (Zeatin) (Figures

3.16, 3.19 and 3.26)

N

OH

ClCl

FF

PL0891

(Haloxydine)

N

OH

ClCl

Cl

PL1343

(Pyriclor)

(Both NDC forms)

Figure 3.26

25 of the remainder were simple anion protonations or cation deprotonations

while the final 5 structures all contained nitro groups, which due to the specific

nature of their structures appear to have caused either ClogP (PL0401 (Clothianidin),

PL0595 (Dinotefuran), PL0942 (Imidacloprid) and PL1500 (Thiamethoxam)) or

AlogP (PL0775 (Fluazinam)) specific problems, unusually resulting in different

ELOGP predictions for their different hybrid forms (Figure 3.28).

The compounds where ClogP is affected all contain the same N-nitro

substructure (Figure 3.27) and examining the run log of the ClogP v4 (current

SOLSTICE version) job reveals that on-the-fly calculated ClogP contribution

estimations for the A form of the group were used as opposed to the selection of true

matching dictionary fragment(s). The ClogP v3 and v4 predictions shown in Table

3.5 for these compounds also show significant differences in predictions between the

hybrids, reflecting differences in the ClogP v3 and v4 methodologies (Leo &

77

Hoekman, 2000). The inadequacy of the dictionary for this relatively uncommon

substructural feature would therefore seem to be the cause of the discrepancy.

PL0775 (Fluazinam) on the other hand appears to represent an exception to the

general rule that the resonance hybrid forms of carbon-bound nitro groups are

typically treated equivalently by AlogP.

N N

O

O

N N+

O

O

BA

STT

Figure 3.27

NN F

F

F

FF

F

N+

O

O-

N+

-O

O

Cl

H

Cl

PL0775

(Fluazinam)

S

N

Cl

NH

N NHCH3N

+

O

O-

PL0401

(Clothianidin)

O

NH

N NHCH3N

+

O

O-

PL0595

(Dinotefuran)

N

NNH

N

N+O

-

O Cl

PL0942

(Imidacloprid)

N N

O

CH3

NN

+

O-

O

SN

ClPL1500

(Thiamethoxam)(All NDC forms)

Figure 3.28

78

PL0775

(Fluazinam)

PL0401

(Clothianidin)

PL0595

(Dinotefuran)

PL0942

(Imidacloprid)

PL1500

(Thiamethoxam)

Structure

form NDC PRF NDC PRF NDC PRF NDC PRF NDC PRF

ClogP v4 5.915 5.915 -2.026 0.176 -3.078 -0.876 -1.560 0.672 -0.04 0.718

ClogP v3 5.217 5.217 2.303 0.173 1.384 -0.946 2.772 0.672 1.541 1.503

AlogP 5.254 5.719 2.055 2.055 0.628 0.628 2.260 2.260 3.170 3.170

ACD/logP 8.190 8.190 0.152 0.152 0.700 0.700 0.199 0.199 1.156 1.156

(Highlighted NDC and PRF hybrid pairs are those where predictions differ between them)

Table 3.5: PM compounds containing pairs of resonance hybrids that resulted in different log P predictions sometimes being obtained for each

Of the five compounds whose structures, and consequently pKas, the STT

changed, four (PL0558 (Dimethirimol), PL0679 (Ethirimol), PL0891 (Haloxydine)

and PL1343 (Pyriclor) (Figures 3.19 and 3.26)) form a subset of the 7 PM dataset

tautomeric compounds identified from the log P data. The remaining compound,

PL1606 (WL 9385) (Figure 3.29), contains an azide group that ACD/pKa failed to

treat its hybrids as being equivalent.

N

NN

NH

CH3

NH

CH3CH3

CH3

N-

N+

N

N

NN

NH

CH3

NH

CH3CH3

CH3

NN

N

PL1606

(WL 9385)

STT

Figure 3.29

3.7.2 Categorising the types of structure change performed by the STT

A considerable variety of tautomer and resonance hybrid transformations

were undertaken by the STT on both datasets. The specific substructures involved,

79

some of which are part of larger heterocyclic fused ring systems, and the number of

compounds found of each class are shown in Table 3.6.

No NDC substructure PRF substructure

Number of instances

encountered (HTS + PM)

1 not[O]

N+

O-

A not[O]N

O

A 12 + 0

2 N N

N

OH

A A

NH N

N

O

A A

3 + 0

3 N

N

OH

A

A

A

A

A

N

NH

O

A

A

A

A

A

4 + 0

4 N

N

OH

A A

A

A

A

NH

N

O

A A

A

A

A

1 + 0

5

N

OH

Not OHNot OH

A A

NH

O

Not OHNot OH

A A

10 + 2

6 N N

OH

OH

A

A

NH NH

O

O

A

A

2 + 0

7 A

N-

N+

N A

N

N

N 1 + 0

8 N

N

OH

A

A

A

NH

N

O

A

A

A

2 + 0

80


Number of instances


9 N

N

OH

NA

A

A

A

NH

N

O

N

A

A

A

A

5 + 2

10

N

N

OH

A

A

A

NH

N

O

A

A

A

2 + 0

11

N

OH

OH

A

A

A

NH

OH

O

AA

A

3 + 0

12 N

OH

Not OHA

A

A

NH

O

Not OHA

A

A

5 + 0

13 A N+

C- A N C: : 1 + 0

14 N

N

OH

A

A

A

NH

N

O

A

A

A

2 + 0

15 N N

OH

A

A

A

NH N

O

A

A

A

17 + 0

16 N

N

OH

N

A A

A

A

NH

N

O

N

A A

A

A

2 + 0

81


Number of instances


17 N

N

NSH

AA

N

NH

NS

AA

2 + 0

18 N

N

OH

N

A

A

A

A

N

NH

O

N

A

AA

A

1 + 0

19 N

N

SH

AA

A

N

NH

S

AA

A

1 + 0

20 N

N

OH

AA

A

NH

N

O

AA

A

3 + 0

21 NH N

N

A

A

N NH

N

A

A

1 + 0

22

S NHSH

A

A

A

S NH2S

A

A

A

1 + 0

23 N N

N

SH

A

A

NH N

N

S

A

A

1 + 0

24 N N

N

OH

A

A

NH N

N

O

A

A

1 + 0

82


Number of instances


25 NH N

A'

A

A

N NH

A' A

A

0 + 3

A = Any group (not H when attached to a heteroatom)

Table 3.6: Tautomer substructure types identified from the HTS and PM datasets

3.7.3 Validating the structural changes performed by the STT

So far it has not been identified whether the PRF tautomers are more likely to

be major ones than their NDC analogues. It is also not known whether there are

sometimes other major tautomers that the STT did not generate. As a result, the

tautomer analysis utility of ACD/pKa was used to make predictions about what it

expects the “major” and “minor” tautomers of each substructure to be. To do this,

simple molecules containing each substructure were analysed by ACD/pKa. The

results are shown in Table 3.7 below:

Structural form examples No NDC PRF ACD/pKa suggested

alternatives 1

N+

O-

Major

N

O

“Fail: non-typical

valence”

-

2

N N

N

OH

Minor

NH N

N

O

CD Major 1

N N

NH

O

CD Major 2

83


alternatives 3

N

N

OH

CH3

CH3

CH3

Minor 1

N

NH

O

CH3

CH3

CH3

Minor 2

N

N

O

CH3

CH3

CH3

Major

4

N

N

OH

CH3 CH3

CH3

Minor 1

NH

N

O

CH3 CH3

CH3

Minor 2

N

N

O

CH3 CH3

CH3

Major

5

N

OH

Minor

NH

O

Major

-

6

N N

OH

OH Minor

NH NH

O

O Major

-

7 CH3

N-

N+

N Fail: “non-typical

valence”

CH3

N

N

N Fail: “non-typical

valence”

-

8

N

N

OH

Minor

NH

N

O

Major

-

9

N

N

OH

N

CH3

CH3 Minor

NH

N

O

N

CH3

CH3 CD Major 1

N

NH

O

N

CH3

CH3 CD Major 2

84


alternatives 10

N

N

OH

Minor

NH

N

O

Major

-

11

N

OH

OH

Minor 1

NH

OH

O

CD Major 1

NH

O

O CD Major 2

NH

O

OH Minor 2

12

N

OH

Minor

NH

O

Major

-

13 CH3 N+

C-

Fail: “Charged structure”

CH3 N C: : Fail: “Non-typical

valence” -

14

N

N

OH

Minor

NH

N

O

Major

-

15

N N

OH

Minor

NH N

O

Major

-

16

N

N

OH

N

CH3 CH3 Minor 1

NH

N

O

N

CH3 CH3 Minor 2

N

N

O

NH

CH3 CH3 Major

85


alternatives 17

N

N

NSH

CH3 Minor

N

NH

NS

CH3 Major

-

18

N

N

OH

N

CH3

CH3

Minor 1

N

NH

O

N

CH3

CH3

Minor 2

N

N

O

NH

CH3

CH3

Major

19

N

N

SH

CH3 Minor

N

NH

S

CH3 Major

-

20

N

N

OH

CH3 Minor

NH

N

O

CH3 CD Major 1

N

NH

O

CH3 CD Major 2

21 NH N

N

NCH3

CH3 CD Major 1

N NH

N

NCH3

CH3 CD Major 2

-

22

S NHSH Minor 1

S NH2S Minor 2

S NHS CD Major 1

S NHS CD Major 2

86


alternatives 23

N N

N

SH

Minor

NH N

N

S

CD Major 1

N NH

N

S

CD Major 2

24

N N

N

OH

Minor

NH N

N

O

CD Major 1

N NH

N

O

CD Major 2

25 NH N

CH3 CD Major 1

N NH

CH3 CD Major 2

-

ACD/pKa dominant tautomer predictions: • Fail: “…” = ACD/pKa failed to interpret the structure for checking of alternative

tautomeric forms (reason give in “”) • “Minor” = Sole predicted minor tautomer • “Minor 1/2" = Predicted minor tautomers suggested independently of each other • “Major” = Sole predicted major tautomer • “Major 1/2" = Predicted major tautomers suggested independently of each other • “CD Major 1/2" = Suggested conditions dependant major tautomers of each other

Table 3.7: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified in Table 3.6

In the majority of cases, the STT’s structure-changing rules successfully

tautomerised these substructure example compounds from a predicted “minor” to a

“major” tautomer. The NDC to PRF tautomer transformation is therefore a

worthwhile process. In cases 21 and 25 both the NDC and PRF tautomers appear to

be energetically very similar, since the tautomerisation performed was between two

“major” forms. Only in cases 3, 4 and 22 would the STT’s rules fail to find a “major”

tautomer. What cannot be determined in cases 2, 8, 11 and 20-25 from these findings

is which conditions-dependant tautomer is most likely in a given circumstance. In

cases 1, 7 and 13, due to problems caused by either charge or unusual valence states,

87

either one or both of the NDC and PRF hybrids could not be handled by ACD/pKa’s

tautomer analysis utility.

3.8 Comparing measured and predicted property values

3.8.1 Compounds whose structures were not modified by the STT

In addition to the influence of tautomerism on the outcome of property

predictions, it is important to establish how reliable the predictions made for these

datasets are in comparison to measured values. Initially, measured and predicted

property value comparisons will be restricted to those compounds whose structures

were unchanged by their passage through the Structure Transformation Tool (STT).

That is their Native Drawing Convention (NDC) forms and Physiological Relevant

Forms (PRFs) are identical.

3.8.1.1 pKa comparisons

3.8.1.1.1 HTS dataset

81 compounds in the HTS dataset had both predicted and measured pKas data

available. The distribution of absolute pKa differences between their measured and

predicted values is shown in Figure 3.30.

88

0

5

10

15

20

25

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Absolute pKa difference (predicted vs measured)

Com

poun

d co

unt

Figure 3.30: A plot of the absolute differences between predicted and measured pKa values for the HTS dataset where the STT made no structural change

The distribution in Figure 3.30 shows that the errors in the predicted pKas are

significantly larger than typical experimental errors (+/- 0.1 pH unit) in the measured

values, making the influence of the latter on the former negligible. As the “fall-off”

of compound count in the above distribution appears to be in two parts, stepped at

around 2 pKa absolute difference units, this may represent a cut-off point for the

majority of valid measured vs. predicted pKa comparisons.

From an examination of the ten compounds with the largest absolute pKa

differences, there was evidence that certain sub-structures were commonly involved.

In particular, eight featured one of 3 recurring substructures (Figure 3.31), of which

there were 4 examples of 1 (HTS1199, HTS1192, HTS1200 and HTS1195), 2 of 2

(HTS0957 and HTS0092) and 2 of 3 (HTS0521 and HTS1364).

N N

OH

AA

A

N

N

OH

A

A

A

A

A

N

OH

OHA

A

A

1 2 3

Figure 3.31

89

Recurrences such as these indicate that ACD/pKa systematically misinterprets

particular classes of substructure.

3.8.1.1.2 PM dataset

In this dataset there were 129 compounds for which both measured and

predicted pKa data was available. A plot of the absolute difference between

measured and predicted pKa values for these compounds is shown in Figure 3.32:

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16

Absolute pKa difference (predicted vs measured)

Com

poun

d co

unt

Figure 3.32: A plot of the absolute differences between predicted and measured pKa values for the PM dataset where the STT made no structural change

Similar to Figure 3.30, the majority of absolute differences in Figure 3.32

were less than 2 pKa units. For the PM dataset 64% of these predictions were within

1 pKa unit and 82% within 2 pKa units of the measured value. This compares

favourably with the respective 48% 1 pKa and 67% 2 pKa units cut offs found for

the HTS dataset in Figure 3.30. Overall, pKa predictions for the PM dataset were

typically more reliable than those for the HTS dataset.

90

3.8.1.2 log P comparisons

3.8.1.2.1 HTS dataset

This dataset contained 65 compounds whose log P had been measured.

ELOGP predictions were successfully made for 64 of these. The distribution of

absolute log P differences between these compound’s measured log Ps and predicted

ELOGPs is shown in Figure 3.33:

0

3

6

9

12

15

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Absolute log P difference (predicted vs measured)

Com

poun

d co

unt

Figure 3.33: A plot of the absolute differences between predicted and measured log P values for the HTS dataset where the STT made no structural change

This distribution shows that the degree of error in HTS dataset log P

prediction was lower than was seen for HTS dataset pKas in Figure 3.30. Illustrating

this, the mean absolute error in log P prediction is 0.64 log units in Figure 3.33

compared to 1.50 pKa units in Figure 3.30. Figure 3.33 also shows that 85% of the

log P comparisons are within 1 log P unit of measured values compared 48% of pKa

predictions at the same threshold in Figure 3.30. These observations show that pKa

predictions were less reliable than log P predictions.

91

It is at first surprising that the modal log P difference and smallest absolute

log P difference “bins” did not coincide in Figure 3.33. Since the number of

compounds making up the distribution is relatively small, the skewing of the maxima

can be assumed to be artificial. This skewing is observed by examining the actual

distribution of pKa differences (Figure 3.34):

0

3

6

9

12

15

18

-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2

Log P difference (predicted vs measured)

Com

poun

d co

unt

Figure 3.34: A plot of the differences between predicted and measured log P values for the HTS dataset where the STT made no structural change

The ten poorest log P predictions highlighted in Figure 3.33 are a more varied

collection of compounds than was seen for the same dataset’s pKa predictions.

However, several structural features seem to recur within them that highlight current

weaknesses of the log P prediction tools used:

92

• 5-membered aromatic rings containing 2 or more nitrogens (HTS1499,

HTS1542, HTS0891, HTS0876, HTS0804, HTS0704 and HTS0197)

• Aromatic nitrogen-nitrogen bonds (HTS0704, HTS0804, HTS1542 and

HTS1499)

• Cyclopropyl groups (HTS0197, HTS1393 and HTS1499)

3.8.1.2.2 PM dataset

There were 470 compounds for which both measured and predicted log P

data was available. A plot of the absolute difference between measured and predicted

log P values for these compounds is shown in Figure 3.35:

0

50

100

150

200

250

300

350

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7

Absolute log P difference (predicted vs measured)

Com

poun

d co

unt

Figure 3.35: A plot of the absolute differences between predicted and measured log P values for the PM dataset where the STT made no structural change

In comparison to the log P distribution obtained for the HTS dataset in Figure

3.33, the more statistical distribution in Figure 3.35 reflects the larger number of

compounds it comprises. Compared to the 85% of log P predictions in the HTS

dataset, only 75% of predictions for the PM dataset were within 1 log P unit of their

measured value. Therefore in contrast to the findings for pKa predictions, for log P

93

predictions the HTS dataset were marginally more accurate than those of the PM

dataset.

3.8.2 The impact of the STT changing tautomers on the outcome of log P and pKa predictions

3.8.2.1 Introduction

There was only a limited amount of measured log P and pKa data available

for the 76 changed-tautomer compounds identified from the HTS and PM datasets.

Therefore it was not possible to compare their measured and predicted property

values in isolation.

However the compound substructures from Table 3.6 were used to search the

entire Syngenta database for other examples where log P or pKa values had been

measured. This search identified 54 compounds from 9 of the tautomer structure

classes (3, 5, 6, 9, 10, 11, 12, 15 and 20) with measured pKa values and 69

compounds from the same nine classes with measured log P values. The overlap

between these compounds and the HTS dataset corresponded to 8 compounds, the

remainder of which will be referred to by the generic reference numbers “MEASxx”

where xx = 01-73.

3.8.2.2 Defining tautomer type subclasses

The 23 generic tautomer classes defined so far often encompass a variety of

more specific substructures. To better address the underlying diversity, 7 of the 9

where measured data was available were split into between 2 and 5 further sub-

classes using the guidelines defined in Chapter 2, Section 10.5.

The substructures comprising the expanded list of tautomer types with

measured property values are shown in Table 3.8 together with the number of

94

examples of each. Those unchanged from Table 3.6 are included in Table 3.8 with

the same reference number and those that have been subdivided are suffixed by a

series of letters.

Ref. NDC structure PRF structure Meas’d pKa

Meas’d log P

Both meas’d

3 N

N

OH

A

A

A

A

A

N

NH

O

A

A

A

A

A

3 4 3

5a

N

OH

AA

A A

NH

O

AA

A A

3 1 1

5b

N

OH

A

A

N

A

A

A NH

O

A

A

N

A

A

A

3 13 2

5c

N

OH

N

A

A

A NH

O

N

A

A

A

0 1 0

5d

N

OH

N

A

A

A NH

O

N

A

A

A

1 1 1

5e

N

OH

AA

A

O

A

NH

O

A

A A

O

A

6 5 5

6a N N

OH

OH

A

A

NH NH

O

O

A

A

3 7 3

6b

N N

OH

OH

N

N

A

A

NH NH

O

O

N

N

A

A

1 0 0

95


Meas’d log P

Both meas’d

6c

N N

OH

OH A

OA

AA

NH NH

O

O A

OA

AA

1 1 1

9a

N

N

OH

N

H

A

A

A

A

OA

NH

N

O

N

H

A

A

A

A

OA

1 1 1

9b N

N

OH

N

A

A

A

A

NH

N

O

N

A

A

A

A

2 4 2

9c N

N

OH

N

H

A

A

A

NH

N

O

N

H

A

A

A

4 2 2

10

N

N

OH

A

A

A

NH

N

O

A

A

A

2 4 2

11a

N

OH

OHA

A

O

A

NH

OH

OA

A

O

A

3 3 3

11b

N

OH

OHN

N AA

A NH

OH

ON

NA

A

A

0 1 0

12a N

OH

A

A

A

A

NH

O

A

A

A

A

4 3 1

96


Meas’d log P

Both meas’d

12b N

OH

A

A

A

O

A

NH

O

A

A

A

O

A

5 6 5

12c

N

OH

N

A

O

A

A

A

AH

A

NH

O

N

O

A

A

A

A

A A

H

3 3 3

12d

N

OH

A

A

O

A

A

A

A

NH

O

A

A

O

A

A

A

A

1 1 1

15a N N

OH

AA

O A

NH N

O

AA

O A

2 2 2

15b N N

OH

A

A

A

NH N

O

A

A

A

0 1 0

20a N

N

OH

A

A

A

A

A

NH

N

O

A

A

A

A

A

0 2 0

20b N

N

OH

AA

A

NH

N

O

AA

A

3 1 1

20c N

N

OH

AOH

O

A

NH

N

O

AOH

O

A

1 1 1

20d N

N

OH

AN

A

A

A

NH

N

O

ANA

A

A

1 1 1

97


Meas’d log P

Both meas’d

20e

N

N

OH

AOH

A

NH

N

O

AOH

A

1 0 0

54 69 41 A = Any non-heteroatoms OR any non-protonated heteroatoms OR any

group not containing heteroatoms conjugated into the ring. A-groups must also not be connected to form additional rings.

Totals

Table 3.8: Expanded list of tautomer substructure types for compounds with measured pKa or log P values

Table 3.8 reveals the degree of diversity found within many of the more

generic substructures previously identified in Table 3.7. It also shows that there is a

high degree of overlap of compounds that have both measured log P and pKa data.

3.8.2.3 pKa comparisons

A summary of the measured pKa values plus the NDC and PRF tautomer

ACD/pKa predictions of compounds listed in Table 3.8 are shown in Table 3.9:

Ref Type Meas pKa NDC

predicted pKa

PRF predicted

pKa

Absolute NDC - Meas'd

difference

Absolute PRF - Meas'd

difference

Prediction improvement (NDC→PRF)

HTS0047 3 7.70 A 0.47 MA - - 7.23

HTS0957 3 3.61 A 4.17 A 2.96 MA 0.56 0.65 -0.09

HTS0958 3 3.94 A - - 4.12 MA 0.18

MEAS10 5a 2.00 A 3.22 MA 3.03 MA 1.22 1.03 0.19

MEAS11 5a 3.09 A 4.26 MA 4.24 MA 1.17 1.15 0.02

MEAS19 5a 6.00 A 4.87 MA 6.31 MA 1.13 0.31 0.82

MEAS16 5b 9.50 A 2.02 A 5.55 MB 7.48

MEAS17 5b 6.55 A 1.13 A 4.81 MB 5.42

MEAS18 5b 11.60 A 4.80 A 6.36 MB 6.80

MEAS23 5d 11.30 A 4.85 A 5.91 MB 6.45

MEAS36 5e 9.20 A 8.43 MB 8.68 MA 0.52

MEAS42 5e 5.50 A 4.58 MA 6.53 MA 0.92 1.03 -0.11

MEAS45 5e 9.10 A 8.51 MB 8.73 MA 0.37

MEAS46 5e 5.40 A 4.66 MA 6.59 MA 0.74 1.19 -0.45

98


predicted pKa

PRF predicted

pKa


difference


difference


MEAS47 5e 3.60 A 3.45 MA 2.51 MA 0.15 1.09 -0.94

MEAS51 5e 4.80 A 4.55 MA 7.02 MA 0.25 2.22 -1.97

HTS0107 6a 9.34 A 2.06 MA 9.15 MA 7.28 0.19 7.09

MEAS70 6a 8.93 A 2.06 MA 9.15 MA 6.87 0.22 6.65

MEAS71 6a 8.93 A 2.06 MA 9.15 MA 6.87 0.22 6.65

MEAS03 6b 7.83 A 0.82 A 8.46 MA 7.01 0.63 6.38

MEAS20 6c 9.20 A 2.11 MA 9.20 MA 7.09 0.00 7.09

MEAS66 9a 8.60 A 0.25 MA 9.47 MA 8.35 0.87 7.48

MEAS05 9b 10.70 A 4.64 MA 10.38 MA 6.06 0.32 5.74

MEAS12 9b 2.00 B 7.48 MB 7.54 MA 5.48

MEAS01 9c 9.90 A 4.24 MA 10.05 MA 5.66 0.15 5.51

MEAS06 9c 11.00 A 5.19 MA 11.09 MA 5.81 0.09 5.72

MEAS07 9c 10.60 A 4.71 MA 10.58 MA 5.89 0.02 5.87

MEAS08 9c 9.60 A 4.58 MA 10.94 MA 5.02 1.34 3.68

HTS0451 10 4.97 A 6.73 MB 5.51 MA 0.54

HTS0508 10 4.93 A 6.71 MB 5.51 MA 0.58

MEAS40 11a 5.40 A 6.45 MA 4.50 MA 1.05 0.90 0.15

MEAS43 11a 5.90 A 6.61 MA 4.50 MA 0.71 1.40 -0.69

MEAS48 11a 5.40 A 6.51 MA 4.50 MA 1.11 0.90 0.21

MEAS14 12a 8.30 A 6.13 MA 7.97 MA 2.17 0.33 1.84

MEAS15 12a 9.70 A 8.10 MB 9.83 MA 0.13

MEAS54 12a 6.90 A 7.57 MA 7.04 MA 0.67 0.14 0.53

MEAS62 12a 7.60 A 7.77 MA 7.85 MA 0.17 0.25 -0.08

MEAS37 12b 10.30 A 7.73 MB 9.53 MA 0.77

MEAS41 12b 6.00 A 5.03 MA 5.38 MA 0.97 0.62 0.35

MEAS44 12b 8.40 A 5.50 MB 8.81 MA 0.41

MEAS58 12b 6.59 A 7.96 MA 5.94 MA 1.37 0.65 0.72

MEAS59 12b 5.69 A 5.58 MA 5.86 MA 0.11 0.17 -0.06

MEAS50 12c 10.20 A 13.27 MA 9.74 MA 3.07 0.46 2.61

MEAS52 12c 13.30 A 8.72 MB 10.43 MA 2.87

MEAS56 12c 5.30 A 6.71 MB 8.86 MA 3.56

MEAS55 12d 4.59 A 5.32 MB 8.82 MA 4.23

MEAS72 15a 9.76 A 13.77 MA 9.74 MA 4.01 0.02 3.99

MEAS73 15a 9.66 A 13.78 MA 9.23 MA 4.12 0.43 3.69

MEAS21 20b 8.20 A 8.36 MB 7.96 MA 0.24

MEAS53 20b 6.51 A 12.29 MA 6.17 MA 5.78 0.34 5.44

MEAS57 20b 7.60 A 8.21 MB 7.77 MA 0.17

99


predicted pKa

PRF predicted

pKa


difference


difference


MEAS49 20c 4.60 A 3.74 MA 4.50 MA 0.86 0.10 0.76

MEAS13 20d 4.38 A 1.28 MA 5.86 MA 3.10 1.48 1.62

MEAS02 20e 5.42 A 6.50 MA 4.50 MA 1.08 0.92 0.16

Mean absolute difference between measured and predicted pKa values: 3.59 0.76

• A = acidic pKa, B = basic pKa, MA = most acidic pKa, MB = most basic pKa.

• Cells highlighted yellow relate to predictions against which no comparisons can be drawn due to pKa type incompatibility (no more suitable pKa prediction available) or prediction failure.

• Acidic pKas highlighted in black were obtained by manual ACD/pKa prediction experiments since the SOLSTICE version only provided a basic pKa with the settings used.

Table 3.9: Summary of measured and predicted pKa values

A summary of the NDC and PRF structure pKa prediction accuracy, reported

at a variety of thresholds is given in Table 3.10.

% of successful prediction comparisons made within x units of the measured pKa value Compound

form < 0.5 < 1.0 < 2.0 < 4.0 > 4.0

Unknown (number of

compounds where comparison was

not possible) NDC 9.8 26.8 43.9 51.2 48.8 13 PRF 50.0 75.0 91.7 97.9 2.1 6

% Improvement 40.2 48.2 47.8 46.7 46.7

Table 3.10: Summary of the accuracy of pKa predictions for compounds with measured values

Table 3.9 shows overall that the accuracy of predictions made for the PRFs of

molecules are an improvement on average of over 2.8 pKa units compared to their

NDC forms. Emphasising this positive effect, Table 3.10 shows at least a 40%

improvement in prediction accuracy occurs across a range of measured – predicted

pKa difference thresholds.

Such positive benefits confirm that the effect of converting NDC structures to

PRF structures were substantial. Table 3.9 also shows evidence that the degree of

prediction improvement for compounds within specific subclasses or between related

100

classes of tautomer substructure are often very similar. For example, converting the 5

type 6a, 6b and 6c structures improved their pKa predictions by 6.6-7.1 pKa units.

Evidence of similar uniform improvements can be seen for types 9b and 9c. In

contrast, the predictions of the type 5e compounds examined largely suffered by

changing them to their PRF tautomers. Such negative effects for particular

substructures are discussed in Chapter 3, Section 8.2.6.

3.8.2.4 Log P comparisons

A summary of the measured log P values plus the NDC and PRF tautomer

log P predictions of compounds listed in Table 3.8 are shown in Table 3.11:

NDC PRF

Ref Type Meas’d log P E

LOGP ClogP ACD

logP AlogPE

LOGPClogP ACD

logP AlogP

Abs. diff.

meas’d →

NDC

Abs. diff. meas’d →

PRF

Imprvm’t NDC →

PRF

HTS0047 3 0.89 1.72 2.60 1.34 1.20 0.82 0.52 1.70 0.25 0.83 0.07 0.76

HTS0810 3 0.91 1.32 2.44 1.16 0.37 0.71 0.41 2.30 -0.59 0.41 0.20 0.21

HTS0957 3 0.50 0.41 0.95 0.67 -0.38 -0.71 -17.26 0.47 -1.33 0.09 1.21 -1.12

HTS0958 3 0.50 0.56 1.51 0.17 -0.01 -0.06 -0.52 1.30 -0.96 0.06 0.56 -0.50

MEAS11 5a 2.34 2.12 2.41 1.68 2.28 -0.07 -0.42 0.37 -0.15 0.22 2.41 -2.19

MEAS16 5b 2.10 3.15 3.10 3.44 2.91 1.84 0.99 2.84 1.68 1.05 0.26 0.79

MEAS18 5b 1.60 2.83 2.97 2.76 2.77 1.58 0.60 3.08 1.07 1.23 0.02 1.22

MEAS22 5b 2.15 3.28 3.37 3.22 3.24 2.02 1.05 3.67 1.35 1.13 0.13 1.00

MEAS25 5b 2.58 3.76 3.90 3.75 3.64 2.51 1.57 4.20 1.74 1.18 0.08 1.11

MEAS26 5b 1.79 3.29 3.40 3.29 3.17 2.03 1.08 3.74 1.28 1.50 0.24 1.26

MEAS27 5b 3.45 4.15 4.27 4.14 4.03 3.12 1.94 5.29 2.14 0.70 0.33 0.37

MEAS28 5b 3.89 4.63 4.80 4.67 4.43 3.61 2.47 5.82 2.54 0.74 0.28 0.46

MEAS29 5b 2.99 4.16 4.30 4.21 3.96 3.13 1.97 5.36 2.07 1.17 0.14 1.03

MEAS30 5b 3.03 4.27 4.43 4.28 4.10 3.02 2.10 4.73 2.21 1.24 0.01 1.23

MEAS31 5b 2.60 3.79 3.90 3.75 3.71 2.53 1.57 4.20 1.82 1.19 0.07 1.12

MEAS32 5b 2.57 3.72 3.77 3.68 3.71 2.46 1.44 4.13 1.82 1.15 0.11 1.04

MEAS33 5b 2.99 4.20 4.30 4.21 4.10 2.95 1.97 4.66 2.21 1.21 0.04 1.17

MEAS34 5b 3.50 4.69 4.83 4.74 4.50 3.43 2.50 5.19 2.61 1.19 0.07 1.12

MEAS24 5c 2.80 3.88 4.04 3.98 3.60 2.49 1.67 4.11 1.71 1.08 0.31 0.77

MEAS23 5d 2.30 3.37 3.49 3.42 3.20 1.99 1.11 3.54 1.31 1.07 0.31 0.76

MEAS36 5e 1.40 3.02 4.11 1.79 3.17 1.40 1.12 1.34 1.74 1.62 0.00 1.62

MEAS42 5e 3.23 3.52 4.61 2.36 3.60 1.85 1.60 1.53 2.42 0.29 1.38 -1.09

MEAS45 5e 0.60 1.60 2.20 0.77 1.84 -0.02 -0.74 0.28 0.41 1.00 0.62 0.38

101

NDC PRF


LOGP ClogP ACD

logP AlogPE

LOGPClogP ACD

logP AlogP

Abs. diff.

meas’d →

NDC


PRF

Imprvm’t NDC →

PRF

MEAS46 5e 2.39 2.26 2.70 1.34 2.73 0.43 -0.26 0.47 1.09 0.14 1.96 -1.82

MEAS51 5e 2.45 1.74 1.94 0.86 2.44 -0.01 -1.02 0.21 0.79 0.71 2.46 -1.75

HTS0107 6a < 0.50 0.48 -1.52 3.96 -1.01 -1.86 -3.46 1.11 -3.23 0.02 2.36 -2.33

MEAS61 6a 0.60 1.71 1.45 2.39 1.29 -0.47 -0.48 0.01 -0.93 1.11 1.07 0.04

MEAS63 6a < 0.50 1.11 0.19 2.49 0.65 -0.67 -1.80 1.36 -1.57 0.61 1.17 -0.57

MEAS64 6a < 0.50 1.26 -0.09 3.37 0.51 -0.82 -2.03 1.28 -1.71 0.76 1.32 -0.56

MEAS65 6a < 0.50 1.68 1.20 2.70 1.15 -0.35 -0.74 0.76 -1.07 1.18 0.85 0.33

MEAS70 6a < 0.50 0.58 -0.66 3.28 -0.88 -1.75 -2.60 0.44 -3.10 0.08 2.25 -2.17

MEAS71 6a < 0.50 1.05 0.93 1.69 0.54 -0.51 -1.01 1.16 -1.68 0.55 1.01 -0.46

MEAS20 6c 2.10 3.60 4.78 1.87 4.17 2.20 2.79 2.47 1.33 1.50 0.10 1.41

MEAS66 9a 3.17 3.07 2.96 2.18 4.08 3.15 2.64 3.32 3.48 0.10 0.02 0.08

MEAS04 9b 1.80 3.20 3.70 2.79 3.11 1.76 2.07 1.90 1.30 1.40 0.04 1.36

MEAS05 9b 0.30 1.74 2.11 1.20 1.92 0.30 0.48 0.31 0.11 1.44 0.00 1.44

MEAS09 9b 1.10 2.24 2.77 1.80 2.13 0.80 1.09 0.78 0.52 1.14 0.30 0.83

MEAS12 9b 2.40 2.82 3.79 1.09 3.57 2.11 2.19 2.38 1.77 0.42 0.29 0.13

MEAS06 9c 2.20 3.25 4.13 2.53 3.09 1.97 2.44 2.20 1.28 1.05 0.23 0.82

MEAS07 9c 0.20 1.33 2.02 0.40 1.56 0.05 0.32 0.08 -0.25 1.13 0.15 0.97

HTS0451 10 1.77 3.68 3.17 4.37 3.48 2.50 2.50 2.09 - 1.91 0.73 1.18

HTS0508 10 2.60 4.44 3.72 5.43 4.17 3.04 3.04 3.14 - 1.84 0.44 1.40

MEAS67 10 1.51 3.63 2.92 4.23 3.75 2.25 2.25 2.95 - 2.12 0.74 1.38

MEAS68 10 1.63 4.02 3.42 4.69 3.93 2.75 2.75 3.54 - 2.39 1.12 1.27

MEAS40 11a 1.66 1.73 2.51 0.44 2.25 0.20 0.48 0.46 -0.33 0.07 1.46 -1.39

MEAS43 11a 3.00 3.54 4.92 1.92 3.77 1.87 2.93 2.11 0.57 0.54 1.13 -0.60

MEAS48 11a 0.87 1.52 2.16 0.25 2.16 0.01 0.21 0.24 -0.42 0.65 0.86 -0.21

HTS1013 11b 2.21 3.60 3.52 4.06 3.23 2.17 2.43 2.01 2.08 1.39 0.04 1.36

MEAS38 12a 2.60 3.23 3.44 2.23 4.02 1.60 2.05 1.92 0.82 0.63 1.00 -0.37

MEAS39 12a 3.60 4.23 4.55 3.18 4.95 2.67 3.17 3.09 1.75 0.63 0.93 -0.30

MEAS62 12a 1.14 1.90 2.04 1.23 2.43 0.33 0.65 0.18 0.16 0.76 0.81 -0.05

MEAS37 12b 1.93 3.32 4.11 2.44 3.41 1.98 2.16 1.75 2.03 1.39 0.05 1.34

MEAS41 12b 3.62 4.82 5.05 4.71 4.69 2.51 3.15 2.15 2.23 1.20 1.11 0.09

MEAS44 12b 3.15 4.55 5.05 4.31 4.29 2.84 3.15 2.77 2.60 1.40 0.31 1.09

MEAS58 12b 1.28 2.14 2.28 2.40 1.74 0.10 0.38 0.64 -0.72 0.86 1.18 -0.32

MEAS59 12b 2.03 3.13 2.99 3.90 2.49 0.72 1.10 1.03 0.03 1.10 1.31 -0.22

MEAS60 12b 3.20 3.59 3.16 4.19 3.41 1.29 1.26 1.66 0.95 0.39 1.91 -1.53

MEAS50 12c 3.95 4.79 5.21 5.04 4.11 2.57 3.23 2.48 2.00 0.84 1.38 -0.55

MEAS52 12c 2.08 3.26 4.00 3.16 2.62 1.14 2.03 1.08 0.31 1.18 0.94 0.24

MEAS56 12c 5.26 6.49 7.22 7.40 4.86 3.70 4.77 3.78 2.55 1.23 1.56 -0.33

MEAS55 12d 5.69 5.41 6.17 5.51 4.54 3.18 4.14 3.16 2.23 0.28 2.51 -2.23

MEAS72 15a 1.49 2.43 3.38 0.77 3.16 1.35 1.50 1.34 1.21 0.94 0.14 0.80

MEAS73 15a 0.83 2.17 2.86 0.46 3.18 0.59 0.98 0.03 0.75 1.34 0.24 1.09

102

NDC PRF


LOGP ClogP ACD

logP AlogPE

LOGPClogP ACD

logP AlogP

Abs. diff.

meas’d →

NDC


PRF

Imprvm’t NDC →

PRF

MEAS35 15b < 1.00 1.80 2.41 0.15 2.86 1.06 0.48 0.98 1.72 0.80 0.06 0.74

MEAS53 20a < 0.50 0.78 0.94 0.46 0.93 -0.07 -1.01 1.32 -0.52 0.28 0.57 -0.29

MEAS69 20a 2.22 2.43 2.95 1.89 2.46 1.62 1.57 1.77 1.51 0.21 0.60 -0.39

MEAS21 20b 0.74 2.07 2.06 1.88 2.28 0.99 0.46 1.08 1.45 1.33 0.25 1.08

MEAS49 20c 0.77 3.26 4.03 2.57 3.18 1.46 2.17 0.04 2.16 2.49 0.69 1.80

MEAS13 20d 1.37 2.17 2.33 1.05 3.12 1.72 0.97 1.78 2.41 0.80 0.35 0.45

Mean absolute difference between measured and predicted log P values: 0.95 0.71

• Values highlighted yellow relate to ELOGP prediction values based solely on ClogP v4 due to the failure of AlogP. SOLSTICE ELOGP automatically gives ClogP v4 whenever AlogP predictions fail.

• Values highlighted in grey are generated from upper limit measured log P values that are simply quoted as being below a particular threshold value. For the purposes of comparison, the upper threshold limit is assumed as the measured value.

Table 3.11: Summary of measured and predicted log P values

A summary of the NDC and PRF structure log P prediction accuracy,

reported at a variety of thresholds is given in Table 3.12.

% of successful prediction comparisons made within x units

of the measured log P value Compound

form < 0.5 < 1.0 < 2.0 < 4.0

% Remainder

Unknown (number of compounds where

comparison was not possible)

NDC + 21.7 46.4 95.7 100 0 0 PRF + * 50.7 68.1 92.8 100 0 0

% Improvement 29.0 21.7 -2.9 0 0

* Includes calculations for 4 compounds whose measured log P’s are estimated only by ClogP. + Also includes calculations for 8 compounds whose measured log Ps are estimated at threshold values. See Table 3.11 for details.

Table 3.12: Summary of the accuracy of log P predictions for compounds with measured values

A mean improvement in predictions of 0.24 log P units represents a small but

still significant increase in accuracy. Table 3.12 also shows that the benefit on log P

predictions is felt most by those that are less than 1 log P unit from the measured

value. As was observed for pKa predictions, the spectrum of effects on different sub-

classes of tautomer is varied and often distinctive. For example, types 5b-d, 9a-c, 10

103

and 15a/b show significantly better predictions from the use of PRF tautomers

instead of their NDC analogues. By way of contrast, for types 6a, 11a, 12a-d and 20a

more often the opposite is true.

In the case of the grey-shaded compounds of type 6a and 20a in the

Table 3.11, their actual log P values could conceivably lie either between or outside

their corresponding NDC form and PRF predicted values. This means the degree of

prediction improvement for those compounds is dependant on the assumed-actual log

P value chosen. This makes it particularly difficult to assess whether inter-converting

their NDC form to their PRF has a positive or negative effect. This point is illustrated

by the various assumed-actual log P values and the effect they have on improvement

estimates shown in Table 3.13.

Compound Log P prediction improvement (NDC → PRF)

assuming the actual log P value is…

Ref Type -1.0 -0.5 0 0.5

HTS0107 6a 0.62 -0.38 -1.38 -2.33

MEAS63 6a 1.78 1.44 0.44 -0.57

MEAS64 6a 2.08 1.45 0.45 -0.56

MEAS65 6a 2.03 2.03 1.33 0.33

MEAS70 6a 0.83 -0.17 -1.17 -2.17

MEAS71 6a 1.56 1.54 0.54 -0.46

MEAS53 20a 0.84 0.84 0.71 -0.29

Table 3.13: The variation in log P prediction improvement depending on the actual log P value used for compound types 6a and 20a

As was found for tautomer type 5e in Table 3.9, Table 3.11 also shows

patterns in prediction improvement that are distinct for particular tautomer sub-

classes. For example, log P predictions for PRF structures stand out as being

typically poorer than those of their NDC analogues for tautomer types 5e, 6a

(apparently), 11a, 12a-d and 20a. “Standard” log P prediction improvements were

104

also seen for the four compounds of type 10 (1.18-1.40 log P units) and the majority

of type 5b (1.00-1.26).

3.8.2.5 Re-investigating the validity of the structural changes performed by the STT

The fact that some tautomer substructure compounds seem to give better

predictions in their NDC forms than their PRFs may be at first surprising. To

investigate the issue, the tautomer analysis facility of ACD/pKa was used to suggest

the dominant tautomers for each of the sub-classes using a simple example of each.

Its results are summarised in Table 3.14:

Ref NDC tautomer PRF tautomer ACD/pKa suggested alternative major

tautomers 3

N

N

OH

A

A

A

A

A

Minor 1

N

NH

O

A

A

A

A

A

Minor 2

N

N

O

A

A

A

A

A

Major

5a

N

OH

AA

A A

Minor

NH

O

AA

A A

Major

-

5b

N

OH

A

A

N

A

A

A Minor

NH

O

A

A

N

A

A

A Major

-

5c

N

OH

N

A

A

A Minor

NH

O

N

A

A

A Major

-

5d

N

OH

N

A

A

A Minor

NH

O

N

A

A

A Major

-

105


tautomers 5e

N

OH

AA

A

O

A

Minor

NH

O

A

A A

O

A

Major 1

N

O

AA

A

OH

A

Major 2

6a N N

OH

OH

A

A

Minor

NH NH

O

O

A

A

Major

-

6b N N

OH

OH

N

N

A

A

Minor

NH NH

O

O

N

N

A

A

Major

-

6c

N N

OH

OH A

OA

AA

Minor

NH NH

O

O A

OA

AA

Major

-

9a

N

N

OH

N

H

A

A

A

A

OA Minor

NH

N

O

N

H

A

A

A

A

OA CD major 1

N

NH

O

N

H

A

A

A

A

OA CD major 2

9b

N

N

OH

N

A

A

A

A

Minor

NH

N

O

N

A

A

A

A

CD major 1

N

NH

O

N

A

A

A

A

CD major 2

106


tautomers 9c

N

N

OH

N

H

A

A

A

Minor

NH

N

O

N

H

A

A

A

CD major 1

N

NH

O

N

H

A

A

A

CD major 2

10

N

N

OH

A

A

A

Minor

NH

N

O

A

A

A

Major

-

11a

N

OH

OHA

A

O

A

Minor

NH

OH

OA

A

O

A

Major 1

NH

O

OA

A

OH

A

Major 2

11b

N

OH

OHN

N AA

A Minor

NH

OH

ON

NA

A

A

CD major 1

NH

O

ON

NA

A

A

CD major 2

12a

N

OH

A

A

A

A

Minor

NH

O

A

A

A

A

Major

-

12b

N

OH

A

A

A

O

A

Minor

NH

O

A

A

A

O

A

Major

-

12c N

OH

N

A

O

A

A

A

AH

A

Minor

NH

O

N

O

A

A

A

A

A A

H

Major 1

N

O

N

OH

A

A

A

A

A A

H

Major 2

107


tautomers 12d

N

OH

A

A

O

A

A

A

A

Minor

NH

O

A

A

O

A

A

A

A

Major 1

N

O

A

A

OH

A

A

A

A

Major 2

15a N N

OH

AA

O A Minor

NH N

O

AA

O A Major

-

15b N N

OH

A

A

A

Minor

NH N

O

A

A

A

Major

-

20a

N

N

OH

A

A

A

A

A Minor

NH

N

O

A

A

A

A

A

CD major 1

N

NH

O

A

A

A

A

A

CD major 2

20b

N

N

OH

AA

A

Minor

NH

N

O

AA

A

CD major 1

N

NH

O

AA

A

CD major 2

20c

N

N

OH

HOH

O

A

Minor 1

NH

N

O

HOH

O

A

Minor 2

NH

N

O

HO

OH

A

Major

20d

N

N

OH

AN

A

A

A Minor

NH

N

O

ANA

A

A

CD major 1

N

NH

O

ANA

A

A

CD major 2

108


tautomers 20e

N

N

OH

AOH

A

Minor 1

NH

N

O

AOH

A

Minor 2

N

NH

O

AO

A

Major 1

NH

N

O

AO

A

Major 2

ACD/pKa dominant tautomer predictions:

• “Minor” = Sole suggested minor tautomer • “Minor 1 / 2” = Minor tautomers suggested independently of each other • “Major” = Sole suggested major tautomer • “CD Major1 / 2” = Suggested conditions dependant major tautomers of each other • “Major 1 / 2” = Major tautomers suggested independently of each other

ACD/pKa tautomer predictions made on the basis that A = tertiary sp3 carbon, e.g t-butyl.

Table 3.14: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified in Table 3.8

A comparison of Table 3.14 with Table 3.7 shows that defining more specific

tautomer substructures sometimes leads to an increase in the overall number of

possible tautomers and also sometimes in a change in what is predicted “major”

tautomer. This occurred in tautomer structures class 5 → subclasses 5a/b/c/d/e,

11 → 11a/b, 12 → 12a/b/c/d and 20 → 20a/b/c/d/e.

For example the PRF of subclass 20b was predicted to be a conditions-

dependant “major” one, whereas the analogous PRF tautomer of subclass 20c was

predicted to be a “minor” one; the major form in this case being neither the NDC

form or the PRF. The keto substituent in the subclass 20c therefore appears to have

an important influence on the position of the tautomeric equilibria.

109

Subclasses 5e, 11a, 12b, 12c, 12d and 15a also contain tautomeric keto

groups, many of which, according to ACD/pKa, play an active part in the structures

of their “major” tautomers. This may explain why predictions for the PRFs of some

of these subclass compounds are not always an improvement on their NDC forms.

3.8.2.6 Evaluating the predictions of alternative tautomers

So far only the log P and pKa predictions of a compound’s NDC and PRF

tautomers have been investigated. However ACD/pKa has suggested that alternative

tautomers can sometimes play exist. If this is the case then the property predictions

for compounds based on them may be better than those of the NDC or PRF

tautomers. Comparing the property predictions of all the tautomers of a compound

with measured values therefore provided a better means of probing which was the

best description of particular molecules, or at least identifying which tautomer(s)

provided the poorest description. Several of the tautomer sub-structure classes will

now be examined to these ends.

3.8.2.6.1 Substructures 5b and 5d

As well as the NDC and PRF tautomers, for compounds of substructure types

5b, and 5d there is a third plausible (“minor” according to ACD/pKa) tautomer that

places the variable-position hydrogen on the second ring nitrogen. For example see

Figure 3.36 for type 5b:

N

OH

A

A

N

A

A

A NDC “minor”

NH

O

A

A

N

A

A

A PRF “major”

N

O

A

A

NH

A

A

A 3rd tautomer “minor”

Figure 3.36

110

The measured pKa and predictions for the three tautomers of each available

compound of these types is shown in Table 3.15.

pKa prediction Compound Type Measured

pKa NDC form PRF 3rd tautomer

MEAS16 5b 9.50 A 2.02 A 5.55 MB 2.09 MB

MEAS17 5b 6.55 A 1.13 A 4.81 MB -- --

MEAS18 5b 11.60 A 4.80 A 6.36 MB 3.95 MB

MEAS23 5d 11.30 A 4.85 A 5.91 MB 4.53 MB

= Prediction failed or no suitable acidic pKa prediction available. A / MA = Acidic / Most Acidic MB = Most basic

Table 3.15: Measured vs. predicted pKa values for the different tautomers of type 5b and 5d compounds

As Table 3.15 shows, difficulty was encountered in obtaining acidic pKa

predictions for the PRF and 3rd tautomers to compare with the acidic pKa values

measured. Though acidic pKa predictions were obtained for their NDC tautomers,

these differed considerably from the measured values. These results thus show that

ACD/pKa has particular difficulties in making accurate predictions for these

subclasses of compounds. Table 3.16 however provides more conclusive evidence of

which tautomer best represents type 5b and 5d structures for log P predictions:

111

ELOGP prediction Compound Type Measured

log P NDC form PRF 3rd tautomer

MEAS16 5b 2.10 3.15 1.84 1.08

MEAS18 5b 1.60 2.83 1.58 0.83

MEAS22 5b 2.15 3.28 2.02 1.26

MEAS25 5b 2.58 3.76 2.51 1.75

MEAS26 5b 1.79 3.29 2.03 1.34

MEAS27 5b 3.45 4.15 3.12 1.70

MEAS28 5b 3.89 4.63 3.61 2.19

MEAS29 5b 2.99 4.16 3.13 1.77

MEAS30 5b 3.03 4.27 3.02 2.26

MEAS31 5b 2.60 3.79 2.53 1.77

MEAS32 5b 2.57 3.72 2.46 1.65

MEAS33 5b 2.99 4.20 2.95 2.13

MEAS34 5b 3.50 4.69 3.43 2.62

MEAS23 5d 2.30 3.37 1.99 1.23

(Highlighted predictions are those closest to the measured value)

Table 3.16: Measured vs. predicted log P values for the different tautomers of type 5b and 5d compounds

Table 3.16 shows how for every compound, the predicted log P for its PRF

structure closely mirrors the measured value. This finding is in agreement with

ACD/pKa’s “major” tautomer prediction for these tautomer sub-classes (Table 3.14).

3.8.2.6.2 Substructures 5e and 12b

For compounds of substructure 5e, as well as the NDC and PRF tautomers,

there is a third alternative (Figure 3.37) that enolises the keto substituent. According

to ACD/pKa’s predictions, this is potentially a major tautomer, along with the PRF

tautomer.

112

N

OH

AA

A

O

A

NDC “minor”

NH

O

A

A A

O

A

PRF “major”

N

O

AA

A

OH

A

3rd tautomer “major”

Figure 3.37

As Tables 3.9 and 3.11 showed, the NDC tautomers of the type 5e

compounds often unexpectedly gave more accurate log P and pKa predictions than

did their PRF tautomers. This is in contradiction with the ACD/pKa tautomer

predictions in Table 3.14 where the NDC tautomer was thought to be a “minor” one.

To resolve the issue, pKa and log P predictions for each tautomer of each type 5e

compound were obtained and compared with available measured values. The results

are shown in Tables 3.17 and 3.18.

pKa prediction Compound Type Measured pKa

NDC form PRF 3rd tautomer MEAS36 5e 9.20 A 8.43 MB 8.68 MA 4.53 MA

MEAS42 5e 5.50 A 4.58 MA 6.53 MA 4.50 MA

MEAS45 5e 9.10 A 8.51 MB 8.73 MA 4.54 MA

MEAS46 5e 5.40 A 4.66 MA 6.59 MA 4.50 MA

MEAS47 5e 3.60 A 3.45 MA 2.51 MA 4.50 MA

MEAS51 5e 4.80 A 4.55 MA 7.02 MA 4.50 MA

= No suitable acidic pKa found. A / MA = Acidic / Most Acidic. MB = Most basic Highlighted predictions are the closest of those available to the measured value.

Table 3.17: Measured vs. predicted pKa values for the different tautomers of type 5e compounds

113


log P NDC form PRF 3rd tautomer MEAS36 5e 1.40 3.02 1.40 1.63

MEAS42 5e 3.23 3.52 1.85 3.06

MEAS45 5e 0.60 1.60 -0.02 -0.07

MEAS46 5e 2.39 2.26 0.43 1.35

MEAS51 5e 2.45 1.74 -0.01 1.10

Highlighted predictions are those closest to the measured value.

Table 3.18: Measured vs. predicted log P values for the different tautomers of type 5e compounds

For the 5 compounds that are common between them, Tables 3.17 and 3.18

show that the tautomer that gives the predictions closest to both the measured pKa

and log P values for each compound is always the same. The log P and pKa

predictions for each compound’s 3rd tautomer are consistently poorer than for the

NDC or PRF tautomers. This suggests, despite the ACD/pKa’s prediction, it is the

least accurate description of type 5e compounds.

The tables also show that the tautomer that gave the best log P prediction

varied between the NDC and PRF. This appears to indicate that the balance between

which tautomers are major and minor is variable. For example, the varying steric and

electronic effects of different combinations of substituents attached to the

substructure may favour different tautomers (as was found in several examples in

Chapter 1, Section 4.1) or sometimes artificially enhance the predictions of minor

tautomers over major ones.

Given that ACD/pKa predicted that type 6e NDC tautomers would be

“minor” forms, the fact that predictions for some compounds drawn in this tautomer

sometimes lead to the most accurate property predictions of all is a significant result.

One phenomenon that may explain how the NDC tautomer could be stabilised for

114

these compounds is the intramolecular hydrogen bonding opportunity offered to its

phenolic proton by the carbonyl oxygen of the adjacent keto substituent

(Figure 3.38). Examples in Chapter 1, Section 4.1 show that this arrangement is not

without precedent.

N

OH

O

A

AA

A

Figure 3.38

Other tautomer subclasses where similar intramolecular hydrogen bonding is

possible include 11a, 12c and 12d. Such a conclusion would also provide an

explanation why their NDC tautomers often gave more accurate predictions than

their PRF analogues.

Compounds of type 12b, the 2-pyridone analogue of type 5e, have a

contrasting behaviour to them. Analysis of the pKa predictions for each of their three

tautomers (Table 3.19) suggests that their PRFs mainly give the most accurate

predictions, consistent with ACD/pKa’s major tautomer prediction for this type.

115

pKa prediction

Compound Type Measured pKa NDC form PRF 3rd tautomer

MEAS37 12b 10.30 A 7.73 MB 9.53 MA 4.50 MA

MEAS41 12b 6.00 A 5.03 MA 5.38 MA 4.50 MA

MEAS44 12b 8.40 A 5.50 MB 8.81 MA 4.50 MA

MEAS58 12b 6.59 A 7.96 MA 5.94 MA 4.53 MA

MEAS59 12b 5.69 A 5.58 MA 5.86 MA 4.50 MA

= No suitable acidic pKa found. A / MA = Acidic / Most Acidic. MB = Most basic (Highlighted predictions are the closest of those available to the measured value)

Table 3.19: Measured vs. predicted pKa values for the different tautomers of type 12b compounds

The log P prediction data (Table 3.20) shows that NDC tautomers usually

gave the poorest results, consistent with the pKa prediction findings. This indicates

that the NDC tautomer is a good description for type 5e compounds, but very poor

for type 12b compounds.


log P NDC form PRF 3rd tautomer

MEAS37 12b 1.40 3.32 1.98 2.66

MEAS41 12b 3.62 4.82 2.51 3.86

MEAS44 12b 3.15 4.55 2.84 3.52

MEAS58 12b 1.28 2.14 0.10 1.22

MEAS59 12b 2.03 3.13 0.72 2.07

MEAS60 12b 3.20 3.59 1.29 2.33

(Highlighted predictions are those closest to the measured value)

Table 3.20: Measured vs. predicted log P values for the different tautomers of type 12b compounds

The steric and electronic differences between compounds of type 5e and 12b

are therefore likely to result in them having difference tautomer equilibrium

positions, affecting which tautomer is the major one. This would also explain in a

116

wider sense why different and distinct trends in prediction improvement were often

seen for different substructure types in Tables 3.9 and 3.11.

3.8.2.6.3 Substructures of type 12a

Three surprising results from the predicted log P data were the poorer

ELOGP predictions of the 12a compounds (MEAS38, MEAS39, and MEAS62) for

their PRFs than their NDC forms. The 12a substructure represents examples of the

relatively simple 2-hydroxypyridine (NDC) / 2-(1H)-pyridone (PRF)

tautomerisation, to which there are no other alternative tautomers and of which the

pyridone form is commonly regarded as the more accurate representation.

It is hard to conceive for these compounds that their various phenyl, bromo

and alkyl substituents are able to induce a significant change of tautomer due to their

electronic properties. Therefore it is most likely that their PRF tautomers are actually

still the dominant ones, but that the effect of the bulky trifluoromethyl or phenyl

substituent attached to the pyridone ring immediately adjacent to the nitrogen in each

compound, artificially enhances predictions for the NDC tautomers. This in turn

means that ELOGP’s treatment of these substituents lacks a measure of the steric

requirements and preferences of these functional groups.

3.9 A method of investigating tautomer issues not highlighted by the STT

The application of the STT to sets of agrochemical database structures has

given an insight into the type and extent of tautomer misrepresentation issue.

However it has provided no indication of what other types of tautomer it encountered

but left unchanged or ignored. A method of probing this issue, with the aim of

117

identifying any further tautomeric substructures, was the analysis of available CHI

(Chromatographic Hydrophobicity Index) data for compounds from the HTS dataset.

3.9.1 Analysis of CHI data

In this dataset, 122 compounds had CHI values at 3 pHs, and of these 22 were

found to contain substructures where tautomerism was an issue. Of these 22

compounds, only 4 (HTS0451, HTS0508, HTS0810 and HTS1364) had previously

been identified and had their tautomer form changed by the STT. The remaining 18

compounds (15% of those with measured CHI values) were previously undiscovered.

Since the number of compounds for which CHI data was available was relatively

small, it is probable that of the remaining HTS dataset there is a sizeable number of

other tautomeric compounds that also go unchanged or unnoticed by the STT.

The 18 newly identified compounds were classified into a series of further

tautomer substructural classes and summarised in Table 3.21. In the case of type 34a

they have been supplemented by two additional examples identified from the PM

dataset.

118

Type Examples No of possible tautomers Notes

26a HTS0320 / HTS0321 6 -

26b HTS0526 / HTS0527 3 Three fewer tautomers than 26a due to a nitrogen being tertiary

rather than secondary 27a HTS0246 / HTS0381 2 -

27b HTS0479 / HTS0480 3 One more tautomer than 27a due to one less plane of symmetry

28 HTS1368 3 - 29 HTS1418 3 - 30 HTS1499 5 - 31 HTS1505 3 - 32 HTS1321 / HTS1322 3 - 33 HTS1335 2 -

34a HTS1014 / HTS1015 / PL1052 (Mesotrione) / PL1434 (Sulcotrione)

≥ 4 Drawn in tri-ketone form

34b HTS1326 > 4 Similar to 34a but drawn in a mono-enol / di-ketone form

Table 3.21: The additional tautomer substructure types identified from the HTS dataset by the analysis of CHI data

For the majority of these compounds the prototropic tautomerisations

involved were of the already familiar OH → NH, OH → OH or NH → NH types.

Only types 32, 34a and 34b were keto-enol type tautomers. For example Sulcotrione

(PL1434) is represented in the PM in its tri-ketone form, however its C-H bond -

centred between the three ketone groups is, because of their presence, particularly

acidic. As a result, one or both of the compound’s enol forms are likely to be major

tautomers (Figure 3.39).

119

O

O O

S

O

OCH3

Cl

O

O OH

S

O

OCH3

Cl

O

OH O

S

O

OCH3

Cl

and / or

PL1434

(Sulcotrione)

Figure 3.39

3.9.2 Comparison of measured and predicted log P and pKa data for “new” tautomeric compounds

In order to gauge the differences in log P and pKa predictions between the

different tautomers of the various types described in Table 3.21, a series of measured

value vs. predicted value comparisons for each tautomer were carried out for those

compounds where at least one piece of measured data was available. A summary of

these findings is shown in Table 3.22.

120

pKa Log P

Comp’d ref Type Taut’ ref Meas’d Type Pred’d Type Meas’d Pred’d

NDC - - 4.99 Alt 1 - - 4.49 HTS0526 6b Alt 2

- - - -

5.83 3.69

NDC - - 3.98 Alt 1 - - 3.51 HTS0527 6b Alt 2

- - - -

4.94 2.83

NDC 4.50 MB 5.19 HTS0246 7a Alt 1

7.82 * MB 4.34 MB

5.16 5.28

NDC 4.12 MB - HTS0381 7a Alt 1

8.03 + MB 4.50 MB

- -

NDC 4.03 MB - Alt 1 4.51 MB - HTS0479 7b Alt 2

7.37 + MB 4.28 MB

- -

NDC 4.03 MB - Alt 1 4.51 MB - HTS0480 7b Alt 2

7.94 + MB 4.28 MB

- -

NDC 2.10 MA 1.46 Alt 1 2.10 MA 1.46 Alt 2 1.20 MA 2.57 Alt 3 1.74 MA 2.16

HTS1499 0

Alt 4

4.43 MA

2.10 MA

2.73

1.30 NDC 2.77 MA - Alt 1 4.50 MA -

PL1052 (Mesotrione) 4a

Alt 2 3.12 MA

4.50 MA -

- NDC 2.87 MA 0.79 Alt 1 4.50 MA 0.28

PL1434 (Sulcotrione) 4a

Alt 2 3.13 MA

4.50 MA -5.00

0.64

• Predictions in bold are those closest to the measured value • - = no measured data available • = average of 4 measurements • + = average of 3 measurements

Table 3.22: Measured vs. predicted pKa and log P values for the different tautomers of compounds identified from HTS dataset by the analysis of CHI data

As the log P predictions for different tautomers of the same molecule are

often similar to each other, it is difficult with the limited data available to conclude

which tautomer best describes each structure. Importantly however, the alternative

tautomers to the NDC form in which the compound is represented in the database do

121

not give consistently poorer predictions and so cannot be conclusively ruled out.

Similarly the NDC representation’s predictions cannot be ruled out as noticeably

poorer on this limited evidence either. Such findings are likely to mean that in these

compounds there are multiple dominant tautomers, between which the equilibria are

sensitive to the nature of each individual molecule based on it.

As Table 3.22 shows, most of the closest pKa predictions to the measured

values were often still very different to them. It can be concluded that ACD/pKa had

particular difficulty with the poorly refined and so inherently more complex tautomer

issues found in these molecules. So long as this is the case, any ground rules for

which tautomer should be used for the most accurate pKa prediction in such cases is

likely to remain problematic.

122

4 Evaluating tautomeric misrepresentation in a larger dataset

4.1 Introduction

The methodology to assess tautomeric misrepresentation in a dataset of

compounds was developed on two relatively small compound sets, primarily for ease

of data handling and examination. However in order to gain a better appreciation of

the issue in a broader context it was considered important to apply it to a larger

dataset. With this aim in mind a 10,000 compound sample was gathered from the

Interim Vendor Database IVDB that is held and maintained at Syngenta.

4.2 Sampling of compounds

The IVDB comprises the structures of approximately 14 million compound

entries gathered from the compound catalogues of various chemical suppliers (e.g.

Aldrich) and also vendors of samples used in biological screening. It is therefore a

sizable collection that provides a window on the “chemical universe” outside of

Syngenta. 20% of IVDB compound records were converted into Daylight SMILES

format using the structure format inter-conversion tool dbtranslate of UNITY

(Tripos, 2004) and the set then screened to remove duplicates. An implementation of

the Knuth sampling algorithm (Knuth, 1998) was then applied to the remaining

approximately 1.6 million structures to obtain a 10,000 compound set.

This set was screened and 192 salts, which as multiple component structures

would otherwise fail outright with ACD/pKa, were removed. Canonicalisation of the

remaining 9,808 compound’s SMILES using the Unique Structures function of

SOLSTICE resulted in a further 336 compounds being removed due to their SMILES

generated by dbtranslate (Tripos, 2004) being invalid and 7 further duplicate

123

SMILES being deleted. These duplicates were presumably not picked-up earlier due

to different SMILES forms having been used. The 336 compounds “lost” during the

canonicalisation stage were due to Tripos (dbtranslate) and Daylight

(canonicalisation rules) using slightly different SMILES drawing conventions. As the

majority of the remaining 9,465 compounds came without a compound reference

number, a generic reference “djpxxxx” (where xxxx = 0001-9465) was assigned to

each. This set of compounds will commonly be referred to as the IVDB dataset.

4.3 Dataset analysis

This compound set was analysed to determine the degree of overlap between

it and the HTS, PM and measured value compound sets studied in Chapter 3. It was

found that only eight of its compounds had previously been studied, six from the

HTS set (HTS1705, HTS1873, HTS2032, HTS2044, HTS2189 and HTS2432) and

two from the PM dataset (PL0030 (1-Naphthylacetic acid) and PL1236

(Phenthoate)).

SMILES files for both the Native Drawing Convention (NDC) and

Physiologically Relevant Form (PRF) of the dataset compounds were then prepared

and log P (ELOGP), solubility (ESOL) and pKa (ACD/pKa) prediction jobs run for

each. Of the 9,465 compounds involved, 9,100 successful log P and solubility

prediction comparisons between the Native Drawing Convention (NDC) forms and

Physiologically Relevant Forms (PRFs) of the dataset were obtained (96.1% of

compounds, the same set for both property predictions). Similarly, successful pKa

prediction comparisons were made on 7,461 occasions (78.8% of compounds).

Indexing of the successful prediction results allowed each compound to be

classified according to whether the Structure Transformation Tool (STT) changed its

124

structure and whether a change in property prediction occurred between its NDC and

PRF forms. A summary of the effect of the STT on the IVDB dataset for each

property prediction is shown in Table 4.1.

Changed structure? (NDC → PRF) No Yes

ELOGP

No 8027 (88.2)

943 (10.4)

Yes 0

(0) 130 (1.4)

ESOL

No 8027 (88.2)

943 (10.4)

Yes 0

(0) 130 (1.4)

pKa

No 6590 (88.3)

784 (10.5)

Changed value?

Yes 0

(0) 87

(1.2)

(Numbers represent actual numbers of compounds for which the full set of prediction data was available to allow accurate interpretation. The figures in brackets are the corresponding percentages of the total of those compounds)

Table 4.1: Classification of changes caused to the IVDB dataset compounds by the STT.

The percentages of compounds that fell into each category are very similar to

the distributions obtained in Table 3.1 for both the HTS and PM datasets (Chapter 3,

Section 4). This confirms that the pattern of structure misrepresentation highlighted

by the STT in this set, a sample drawn from a wider chemical context, is not

significantly different from that found within agrochemical-related compound

collections.

The 943 compounds whose log P and solubility predictions were both

unchanged despite a change in structure could all be attributed to a nitro group

125

changing hybrid form only. 747 of the 784 compounds whose pKa prediction was

unchanged despite a change in structure could similarly be attributed. The remaining

37 of these compounds were all of substructure 21 (Figure 4.1) and underwent the

same tautomer change – the pKa predictions for both being coincidentally the same.

PL1003 (Kinetin), PL1612 (Zeatin) and PL0083 (6-Isopentenylaminopurine) (Figure

3.16) in Chapter 3, Section 4 of the same tautomer class in the PM dataset also

showed the same effect.

N NH

N

A

A

NH N

N

A

A

STT

Figure 4.1

Of the 130 compounds whose structures were changed by the STT and both

the log P and solubility prediction also changed, 25 were due to a non-tautomer

change in structure:

• Nitro group change of hybrid form (8 examples)

• Nitroso group change of hybrid form (6 examples)

• Protonation and / or deprotonation of heteroatoms only (11 examples)

The remaining 105 compounds underwent a change in tautomer. Of the 87

compounds whose structures were changed by the STT and their pKa prediction also

changed, 86 were a subset of the 105 tautomeric compounds identified from the log P

predictions. The remainder, djp2181 (Figure 4.2), was also tautomeric but was not

included in the log P comparison because AlogP failed to give a value for its NDC

tautomer. This was due to it containing an Nsp3-Nsp3 bond, which as discussed in

Chapter 3, Section 6.2.1, is a common “problem” substructure for AlogP.

126

NH

NH

Br

N NH

O

O

NH

N

Br

N NH

O

OH

djp2181

(NDC tautomer)

djp2181

(PRF tautomer)

Figure 4.2

Analysis of log P prediction failures showed that this compound was the only

tautomeric one so affected. Analysis of the pKa prediction failures identified three

compounds where a change in structure had occurred, two of which were tautomeric

and both of which were also positively identified from the log P prediction

comparisons.

In total this means 120 compounds underwent a change of tautomer. Table 4.2

shows a summary of the structural types and corresponding number of compounds

that were found, together with the change made by the STT. In addition to those

tautomer types already highlighted from the previous datasets, 7 additional classes

have been defined and are included along with the ACD/pKa “major” and “minor”

tautomer predictions for each substructure, assuming the A-groups are methyls in

each case.

No NDC substructure PRF substructure Number of instances

encountered

3

N

N

OH

A

A

A

A

A

Minor 1

N

NH

O

A

A

A

A

A

Minor 2

3

127


encountered

5

N

OH

Not OHNot OH

A A

Minor

NH

O

Not OHNot OH

A A

Major

5

9 N

N

OH

NA

A

A

A Minor

NH

N

O

N

A

A

A

A

CD Major

3

12 N

OH

Not OHA

A

A Minor

NH

O

Not OHA

A

A Major

14

14

N

N

OH

A

A

A

Minor

NH

N

O

A

A

A Major

1

17 N

N

NSH

AA

Minor

N

NH

NS

AA

Major

29

19 N

N

SH

AA

A

Minor

N

NH

S

AA

A

Major

2

128


encountered

20 N

N

OH

AA

A

Minor

NH

N

O

AA

A

CD Major

21

21 NH N

N

A

A

CD Major 1

N NH

N

A

A

CD Major 2

15 + 14 *

35

N N

SH

OA

A

A Minor

NH N

S

OA

A

A Major

3

36 NH N

N

A

SH

Minor

NH NH

N

A

S

Major

1

37 N N

N

A

OH

A

Minor

N NH

N

A

O

A

Major

3

38

N N

O

AOH

A

A Minor

NH N

O

AO

A

A Major

3

129


encountered

39 NH

N

O

ASH

A

Minor

NH

NH

O

AS

A

Major

1

40 N

N

O

ASH

AA

Minor

N

NH

O

AS

AA

Major

1

41 NH

NH

A

O

A

Major

N

NH

A

OH

A

Minor

1 +

A = Any group (not H when attached to a heteroatom) * = Compounds where structure changed but pKa prediction didn’t + = Identified by examining list of compounds where valid log P prediction comparison was not possible but where a change of structure was registered ACD/pKa tautomer predictions (assuming A = Me): “Minor” = Sole predicted minor tautomer “Minor 1/2" = Predicted minor tautomers suggested independently of each other “Major” = Sole predicted major tautomer “CD Major 1/2" = Suggested conditions dependant major tautomers of each other

Table 4.2: ACD/pKa major / minor tautomer predictions for example compounds of each substructure type identified from the IVDB dataset

The additional tautomer types identified are largely similar in nature to those

already defined and involve prototropic shifts between O and N, or S and N atoms.

Type 32 provides the only case of all the types where the STT rules applied appear to

convert a “major” tautomer into a “minor” one.

130

5 Conclusions and further work

5.1 Conclusions

Tautomerism is a widely recognised phenomenon in heterocyclic chemistry

which has the potential to present major issues to computational tools that predict

physical properties such as lipophilicity and acid-base ionisation constants.

This project developed and tested a methodology for assessing tautomer

misrepresentation and its affect on the prediction of solubility (log Sw), lipophilicity

(log P) and acid-base ionisation constants (pKa). Two moderate-sized

(~1,300/~2,600) agrochemical related test sets and a larger (~9,500) publicly

available set were used to do this. A Structure Transformation Tool (STT) identified

compounds drawn in a “wrong” Native Drawing Convention (NDC) tautomer and

converted them into a “right” form - one considered to be most likely at pH7 - a

Physiologically Relevant Form (PRF).

Analysis of the datasets showed that the STT made no change to the structure

of 90% of compounds. Of the others, only 1-2% changed tautomer form. This

indicates that the tautomer misrepresentation issue is relatively minor and is not

significantly different for agrochemicals than any other class of compounds. The

effect of the STT on the predicted charge distribution at pH7 of each test set affected

less than 1% of compounds. The charge predictions of only ~1% of compounds

changed when the pH range used to predict their corresponding pKa values was

narrowed from 0-14 to 2-10.

For compounds whose structure the STT changed, the absolute change to log

P predictions was typically in range 0-2 with a mean value of ~1. For solubility

(log Sw) this range was 0-2 (mean ~1) and for pKa it was 0-4 (mean ~2.5). The

131

datasets contained approximately 40 distinct tautomer substructure types, most often

based on thiol- or hydroxypyridines, pyrimidines, pyrazines, 1,3,5-triazines,

imidazoles and 1,2,4-triazoles. The most common effect of the STT was to convert

each to a thione or ketone analogue. In the majority of cases the ACD/pKa tautomer

prediction tool confirmed that the STT turned “minor” tautomers of each

substructure into “major” ones.

A comparison of the predicted and measured log P data for the limited

numbers of tautomeric compounds with measured values showed that the mean

improvement in log P predictions due to the STT was 0.24 log units. This

represented an approximate 30% improvement in the proportion of predictions made

within 0.5 log units of measured log P values. For tautomeric compounds with pKa

measurements the mean improvement was 2.80 log units. In real terms this equates to

a far more substantial improvement in predictions, with 40% more being made

within 0.5 log units of measured values. These findings also validate the “minor” /

“major” tautomer predictions made by ACD/pKa.

Comparing property predictions with measured values showed that different

substructure types were affected by the STT in different ways. In fact for some

substructures, predictions for their PRFs were actually poorer than for their NDC

forms. In many of these cases intramolecular hydrogen bonding to a keto substituent

could explain why conventionally “minor” tautomers were stabilised as “major”

ones. Which tautomer gave the best predictions for compounds of a particular class

was not always clear-cut however. In these situations the steric and electronic effects

of different substituents seemed to influence the balance of equilibria between

tautomers. Such exceptions show that the STT substructure definitions and structure-

132

changing rules are currently too generic and not appropriate in all circumstances to

which they are currently applied.

A series of structural features were identified as commonly causing the

various log P and pKa prediction tools to fail to predict a value. In particular

Nsp3-Nsp3 bonds (AlogP), net charges (ACDlogP) and simply “no values in range”

(ACD/pKa) were found to recur most often. AlogP’s parsing of SMILES was for

~5% of compounds in one test set found to change according to which SMILES

variant was used. For consistency, all SMILES were therefore canonicalised to

Daylight conventions before predictions were made.

The available CHI (Chromatographic Hydrophobicity Index) data for one of

the test sets identified 20 compounds where a “wrongly” drawn tautomer issue had

not been addressed by the STT. In these compounds a further 12 tautomer

substructure types were also identified. Of these, 4 tri-ketones (for example

Mesotrione) were particularly important omissions. Therefore though the STT is

largely effective within its current configuration, it still lacks definitions for many

other important substructural types.

133

5.2 Further work

• The methodology was developed and applied to only medium sized datasets.

It would therefore be beneficial to test larger (e.g. 100k) ones for tautomer

misrepresentation issues, especially those with measured data.

• The steric and/or electronic influence of substituents seem to be important in

determining the major / minor tautomer balance between structurally similar

compounds. A more detailed examination of their effects on predictions

would help clarify their role.

• Does the canonicalisation of SMILES improve, or degrade log P predictions

significantly? Are the benefits of prediction consistency out-weighed by

poorer predictions overall?

• The problems AlogP has with uncanonicalised SMILES and its issues with

particular substructures suggest it requires further development. It would also

be worthwhile investigating a more reliable atom-based tool to replace AlogP

within ELOGP. In the immediate term, an automatic canonicalisation of

SMILES by ELOGP would be beneficial.

• AlogP and ACDlogP currently handle pairs of resonance hybrids such as

nitroso and nitro inconsistently. Program development is required so that

hybrid pairs are always identified as equivalent representations, prediction

failure rates are reduced and prediction consistency is improved.

• There is considerable scope for expanding the structure-changing rules used

by the STT. This could include identifying configurations of existing

substructures where intramolecular hydrogen bonding can occur (Chapter 3,

Section 8.2.6.3) as well as introducing new substructure types such as tri-

ketones (Chapter 3, Section 9.1). A study of the effect of additional rules on

134

the property prediction benefits of the STT would also be important before

fully implementing them.

• There are alternative means of probing datasets for tautomeric compounds

missed by the STT. Unusually large differences between measured values and

property predictions could help identify new cases, for example.

• The implementation of an STT that performs the reverse transformations to

Leatherface would also help identify a wider range of “missing” tautomer

substructures, not just those already in the “right” form.

• Other types of measured data could also indicate a compound’s major

tautomer form when log P or pKa data is unavailable. For example 13C NMR

or IR spectroscopy will differentiate between “4-hydroxypyridine” and “4-

(1H)-pyridone” tautomers. Predicted vs. measured spectral comparisons

could provide a new approach to the tautomer misrepresentation issue.

135

References

Accelrys (2004) “Accelrys: DIVA”. Accelrys [Online]

http://www.accelrys.com/products/diva/index.html

[Accessed 15 August 2004]

ACD (2004a). “ACD/logP DB: Overview”. Advanced Chemistry Development /

Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/logp/


ACD (2004b). “ACD/pKa DB: Overview”. Advanced Chemistry Development /

Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/pka/


ACD (2004c). “Physico-Chemical Laboratory”. Advanced Chemistry Development /

Labs. [Online] http://www.acdlabs.com/products/phys_chem_lab/


AGENT 2 (2004). “AGENT 2.0: Advanced Creator of Tautomers”. Swiss Federal

Institute of Technology Zurich. [Online]

http://www.pharma.ethz.ch/pc/Agent2/ [Accessed 25 May 2004]

Beak, P., Fry, F.S., Lee, J. & Steele, F. (1976). “Equilibration studies - protomeric

equilibria of 2-hydroxypyridines and 4-hydroxypyridines,

2-hydroxypyrimidines and 4-hydroxypyrimidines, 2-mercaptopyridines and

4-mercaptopyridines, and structurally related compounds in the gas-phase”.

Journal of the American Chemical Society, 98 (1), 171-179.

136

Bradshaw, J.S., Chamberlin, D.A., Harrison, P.E., Wilson, B.E., Arena, G., Dalley,

N.K., Lamb, J.D., Izatt, R.M., Morin, F.G. & Grant, D.M. (1985). “Proton-

Ionizable Crown Compounds. 1. Synthesis, Complexation Properties, And

Structural Studies Of Macrocyclic Polyether Diester Ligands Containing A

Triazole Subcyclic Unit”. Journal of Organic Chemistry, 50 (17), 3065-3069.

Bradshaw, J.S., Nielson, R.B., P.-K. Tse, Arena, G., Wilson, B.E., Dalley, N.K.,

Lamb, J.D., Christensen, J.J. & Izatt, R.M. (1986). “Proton-Ionizable Crown

Compounds. 4. New Macrocyclic Polyether Ligands Containing A Triazole

Subcyclic Unit”. Journal of Heterocyclic Chemistry, 23 (2), 361-368.

Brandstetter, H., Grams, F., Glitz, D., Lang, A., Huber, R., Bode, W., Krell, H.W. &

Engh, R.A. (2001). “The 1.8-angstrom crystal structure of a matrix

metallaproteinaise 8-barbiturate inhibitor complex reveals a previously

unobserved mechanism for collagenase substrate recognition”. Journal of

Biological Chemistry, 276, 17405-17412.

Briggs, G.G. (1997). “Predicting the uptake and movement of agrochemicals from

physical properties”. SCI Meeting on the uptake of agrochemicals and

pharmaceuticals, London, UK. December 1997, presentation.

Briggs, G.G., Desbordes, P. & Genix, P. (2002). “Are there limits to the physical

properties of fungicides?”. 10th IUPAC International Congress on the

Chemistry of Crop Protection, Basel, Switzerland. August 2002, poster.

137

Chiang, Y., Kresge, A.J. & Schepp, N.P. (1989). “Temperature coefficients of the

rates of acid-catalyzed enolization of acetone and ketonization of its enol in

aqueous and acetonitrile solutions - Comparison of thermodynamic

parameters for the keto-enol equilibrium in solution with those in the gas-

phase”. Journal of the American Chemical Society, 111 (11), 3977-3980.

Civcir, P.U. (2000). “A theoretical study of tautomerism of cytosine, thymine, uracil

and their 1-methyl analogues in the gas and aqueous phases using AM1 and

PM3 methods”. Journal of Molecular Structure – Theochem, 532, 157-159.

Civcir, P.U. (2001). “A theoretical study of 2,6-dithioxanthine in the gas and aqueous

phases using AM1 and PM3 methods”. Journal of Molecular Structure –

Theochem, 572, 5-13.

Clarke, E.D. (2001). “Physico-Chemical Profiling in the Agrochemical Industry”

Sirius User Meeting 2002, Measurement and Beyond, October 2002,

Brighton, presentation.

Clarke, E.D. & Delaney, J.S. (2003). “Physical and Molecular Properties of

Agrochemicals: An Analysis of Screen Inputs, Hits, Leads, and Products”.

Chimia, 57 (11), 731-734.

Clarke, E.D., Draper, E., Holliday, J.D. & Mullier, G.W. (2004). “ELOGP:

Improving the Prediction of Log P Octanol for Agrochemicals.”

UK-QSAR and Chemoinformatics Group Spring 2004 Meeting, April 2004,

Liverpool, poster.

138

Daylight (2004a). “SMILES Tutorial”. Daylight Chemical Information Systems Inc.

[Online] http://www.daylight.com/dayhtml/smiles/smiles-intro.html


Daylight (2004b). “Daylight Theory: SMARTS”. Daylight Chemical Information

Systems Inc. [Online]

http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html


Daylight (2004c). “CLOGP Reference Manual”. Daylight Chemical Information

Systems Inc. [Online] http://www.daylight.com/dayhtml/doc/clogp/


Daylight (2004d). “Daylight Theory: SMILES”. Daylight Chemical Information

Systems Inc. [Online]

http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html


Daylight (2004e). “SMILES Toolkit 4.8”. Daylight Chemical Information

Systems Inc. [Online] http://www.daylight.com/products/smiles_kit.html


Delaney, J.S. (2004). “ESOL: estimating aqueous solubility directly from molecular

structure”. Journal of Chemical Information and Computer Sciences, 44 (3),

1000-1005.

Devillers, D., Domine, D., Guillon, C. & Karcher, W. (2000) “Simulating

lipophilicity of organic molecules with a back-propagation neural network”.

Journal of Pharmaceutical Sciences, 87 (9), 1086-1090.

139

Draper, E. (2002). Improving the Effectiveness of Descriptor-Based Predictions.

MSc, University of Sheffield.

Duarte, H.A., Carvalho, S., Paniago, E.B. & Simas, A.M. (1999). “The importance of

tautomers in the chemical behavior of tetracyclines”. Journal of

Pharmacological Sciences, 88, 111-120.

Ghose, A.K., Pritchett, A. & Crippen, G.M. (1988). “Atomic physicochemical

parameters for 3-dimensional structure directed quantitative structure-

activity-relationships. 3. Modeling hydrophobic interactions”. Journal of

Computational Chemistry, 9 (1), 80-90.

Gillet, V.J., Willett, P. & Bradshaw, J. (1998). “Identification of Biological Activity

Profiles Using Substructural Analysis and Genetic Algorithms”. Journal of

Chemical Information and Computer Sciences, 38, 165-179.

Hallé, J.C., Lelievre, J. & Terrier, F. (1996). “Solvent effect on preferred protonation

sites in nicotinate and isonicotinate anions”. Canadian Journal of Chemistry,

74 (4), 613-620.

Hansch, C., Maloney, P., Fujita, T. & Muir, R. (1962). “Correlation of Biological

Activity of Phenoxyacetic Acids with Hammett Substituent Constants and

Partition Coefficients”. Nature, 194, 178-180.

Heinzelmann, W. & Märky, M. (1973). “Photosynthese von Dihydroazepinonen aus

2-Alkyl-indazolen”. Helvetica Chimica Acta, 56 (6), 1852-1858.

Heller, G., Buchwaldt, A., Fuchs, R., Kleinicke, W. & Kloss, J. (1925). Journal für

Praktische Chemie, 111, 1-74.

140

Kaliszan, R., Haber, P. & Snyder, L.R. (1999). “Estimation of Compound pKa and

log kw values by means of two Reversed-Phase HPLC Run”. HPLC ’99, May

1999, Granada, L/043.

Katritzky, A.R. & Lagowski, J.M. (1963). “Prototropic Tautomerism of

Heteroaromatic Compounds 1: General Discussion and Methods of Study”.

Advances in Heterocyclic Chemistry, 1, 311-338.

Katritzky, A.R., Elguero, J., Marzin, C. & Linda, P. (1976). “The Tautomerism of

Heterocycles”. Advances in Heterocyclic Chemistry, Supplement 1. New

York: Academic Press.

Katritzky, A.R. & Ghiviriga, I (1995). “An NMR-Study Of The Tautomerism Of

2-Acylaminopyridines”. Journal of the Chemical Society, Perkin

Transactions 2, (8), 1651-1653.

Katritzky, A.R., Ghiviriga, I., Oniciu, D.C., O’Ferrall, R.A.M. & Walsh, S.M.

(1997). “Study of the enol-enaminone tautomerism of alpha-heterocyclic

ketones by deuterium effects on C-13 chemical shifts”. Journal of the

Chemical Society, Perkin Transactions 2, (12), 2605-2608.

Katritzky, A.R., Denisko, O.V. & Elguero, J. (2000). “Prototropic Tautomerism of

Heterocycles: Heteroaromatic Tautomerism – General Overview and

Methodology”. Advances In Heterocyclic Chemistry, 76, 1-84.

Katritzky, A.R., Denisko, O.V., Stanovnik, B. & Tišler, M. (2001). “The

Tautomerism of Heterocycles: Six-Membered Heterocycles: Part 1, Annular

Tautomerism”. Advances In Heterocyclic Chemistry, 81, 253-303.

141

Kenny, P. (1999). “Handling Heterocyclic Tautomerism”. EuroMUG ‘99 meeting,

Cambridge, UK. 28-29 October 1999, presentation. [Online]

http://www.daylight.com/meetings/mug99/Kenny/kenny_mug99.htm

[Accessed 25 May 2004]

Knuth, D. (1998). The Art Of Computer Programming: Volume 2 – Semi-numerical

Algorithms, Reading, MA: Addison-Wesley Longman, pp 142.

Lázlár, L., Göblyös, A., Evanics, F., Bernáth, G. & Fülöp, F. (1998). “Ring-chain

tautomerism of 2-aryl-substituted imidazolidines”. Tetrahedron, 54 (44),

13639-13644.

Leach, A.R. & Gillet, V.J. (2003). An Introduction to Chemoinformatics. Kluwer

Academic Publishers: Dordrecht. pp. 19.

Leo, A.J. (1993). “Calculating Log P(oct) from structures”. Chemical Reviews, 93 (4),

1281-1306.

Leo, A.J & Hoekman, D. (2000). “Calculating log P(oct) with no missing fragments;

The problem of estimating new interaction parameters”. Perspectives in Drug

Discovery and Design, 18, 19-38.

Lipinski, C.A., Lombardo, F., Dominy, B.W. & Feeny, P.J. (1997). “Experimental

and computational approaches to estimate solubility and permeability in drug

discovery and development settings”. Advanced Drug Delivery Reviews, 23,

3-25.

MacNab, H. & Monahan, L.C. (1990). “Azepinones 4. Electrocyclic and

cycloaddition reactions of simple 1H-azepin-3(2H)-ones”. Journal of the

Chemical Society, Perkin Transactions 1, (11), 3169-3173.

142

MDL (2003) “CTFile Formats, Chapter 6: SDfiles” MDL Information Systems.

[Online] http://www.mdli.com/downloads/public/ctfile/ctfile.pdf


Morris, J.M. & Bruneau, P.P. (2000). “Prediction of Physicochemical Properties”. In:

Bohn, H.-J. & Schneider, G. (eds.), Virtual Screening for Bioactive

Molecules, Weinheim: Wiley-VCH. pp 33-58.

Oprea, T.I. (2000). “Property distribution of drug-related chemical databases”.

Journal of Computer-Aided Molecular Design, 14 (3), 251-264.

Pearlman, R.S., Khashan, R., Wong, D. & Balducci, R. (2002). “ProtoPlex: user-

control over tautomeric and protonation states”. Abstracts of Papers of the

American Chemical Society, 224, 232-COMP.

Pospisil, P., Ballmer, P., Folkers, G. & Scapozza, L. (2002). “Tautomerism in

nucleobase derivatives and their score in virtual screening to thymidine

kinase”. Abstracts of Papers of the American Chemical Society, 224, 211-

COMP.

Pospisil, P., Ballmer, P., Scapozza, L. & Folkers, G. (2003). “Tautomerism in

Computer-Aided Drug Design”. Journal of Receptors and Signal

Transduction, 23 (4), 361-371.

Sadowski, J. & Kubinyi, H. (1998). “A Scoring Scheme for Discriminating between

Drugs and Nondrugs”. Journal of Medicinal Chemistry, 41, 3325-3329.

Sadowski, J. (2002). “A tautomer and protonation pre-processor for virtual

screening”. Abstracts of Papers of the American Chemical Society, 224, 233-

COMP.

143

Sayle, R. & Delany, J. (1999). “Canonicalization and Enumeration of Tautomers”.

EuroMUG ‘99 meeting, Cambridge, UK. 28-29 October 1999, presentation.

[Online]

http://www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm

[Accessed 25 May 2004]

Tice, C.M. (2001). “Selecting the right compounds for screening: does Lipinski’s

Rule of 5 for pharmaceuticals apply to agrochemicals?”. Pest Management

Science. 57, 3-16.

Tice, C.M. (2002). “Selecting the rights compounds for screening: use of surface-

area parameters”. Pest Management Science. 58, 219-233.

Tišler, M. (1959). Archiv der Pharmazie, 292, 90-97.

Tomlin, C.D.S. (ed.) (2000). The Pesticide Manual, 12th edition, Farnham, Surrey,

UK: British Crop Protection Council.

Trepalin, S.V., Skorenko, A.V., Balakin, K.V., Nasonov, A.F., Lang, S.A.,

Ivashchenko, A.A. & Savchuk, N.P. (2003). “Advanced exact structure

searching in large databases of chemical compounds”. Journal of Chemical

Information and Computer Science, 43, 852-860.

Tripos (2004). UNITY 4.4.1, Tripos Inc., 1699 South Hanley Rd., St. Louis,

Missouri, 63144, USA. [Online]

http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/unity.html


144

Valkó, K., Bevan, C. & Reynolds, D. (1997). “Chromatographic hydrophobicity

index by fast-gradient RP HPLC: A high-throughput alternative to log P and

log D”. Analytical Chemistry, 69 (11), 2022-2029.

Weininger, D., Weininger, A. & Weininger, J.L. (1989). “SMILES 2. Algorithm for

Generation of Unique SMILES Notation”. Journal of Chemical Information

and Computer Sciences, 29 (2), 97-101.

Weis, A.L. & Vishkautsan, R. (1984). “Dihydropyrimidines 9. Preparation and

imine-enamine tautomerism of 4,6-diphenyl-1,2-dihydropyrimidine”.

Chemistry Letters, (10), 1773-1776.

Weis, A.L. & van der Plas, H.C. (1986). “Dihydropyrimidines - Synthesis, structure

and tautomerism”. Heterocycles, 24 (5), 1433-1455.

Weis, A.L., Frolow, F. & Vishkautsan, R. (1986). “Dihydropyrimidines 16. Stability

and enamine-imine tautomerism in 1,2-dihydropyrimidines and

2,5-dihydropyrimidines”. Journal of Organic Chemistry, 51 (24), 4623-4626.

Wheland, G.W. (1955). Resonance in Organic Chemistry, New York: Wiley.

pp. 98-100.

Wildman, S.A. & Crippen, G.M. (1999). “Prediction of Physicochemical Parameters

by Atomic Contributions”. Journal of Chemical Information and Computer

Science, 39, 868-873.

Willett, P., Barnard, J.M. & Downs, G.M. (1998). “Chemical similarity searching”.

Journal of Chemical Information and Computer Science, 38, 983-996.

145

Whitman, C.P. (1999). “Keto-Enol Tautomerism in Enzymatic Reactions”.

Comprehensive Natural Products Chemistry, 5, 31-50.

Yan, X., Day, P., Hollis, T., Monzingo, A.F. & Schelp, E. (1998). “Recognition and

interaction of small rings with the ricin A-chain binding site”. Proteins, 31,

33-41.

Zaleska, B., Ciez, D. & Falk, H. (1996). “Synthesis and properties of unique

mesoionic 1,3-thiazolium-4-olates”. Monatshefte für Chemie, 127 (12), 1251-

1257.

dagda.shef.ac.ukdagda.shef.ac.uk/.../external/parker_david_mscchem.pdf · 2 table of contents...

Documents