new virtual screening tools for molecular discovery

NEW VIRTUAL SCREENING TOOLS FOR

MOLECULAR DISCOVERY

CHRISTOPHER R. CORBEIL

A thesis submitted to McGill University in partial fulfillment of the requirements of the

degree of Doctor of Philosophy

Department of Chemistry

McGill University

Montreal, Quebec, Canada

H3A 2K6

December, 2008

© Chris Corbeil, 2008

ii

ABSTRACT

In the field of molecular discovery, virtually screening large libraries of

compounds proved to be often more cost-efficient than the traditional experimental

approaches. In fact, it has now become common practice thanks to the virtual screening

tools available to chemists in the pharmaceutical industry, specifically docking. Most

docking programs do not account for the dynamics associated with protein-ligand binding

whether it is protein flexibility or the inclusion of displaceable water molecules.

FITTED1.0 was developed to include these specific two features and has been validated on

a testing set of 33 protein-ligand complexes. Further developments were needed to

increase the speed of FITTED to enable its application as virtual screening tool. This

enhanced version, FITTED1.5, has been applied to the screening of the Maybridge library

onto the HCV polymerase and revealed FITTED’S ability to identify active substances. With this

and other successful applications of FITTED, a comparative study was performed against

other docking programs, with a specific interest in the effect of the ligand and protein

input conformation and the inclusion of bridging water molecules on the accuracy of

docking programs. All three had major effects on accuracy and led to suggestions on how

to better conduct comparative studies. In parallel, we applied our expertise in the virtual

screening area to the field of asymmetric catalyst development and led to the creation of

ACE1.0. When creating a tool for predicting steroselectivities, one has to describe the

transition state with great accuracy although within a reasonable amount of time. To

tackle this problem, ACE creates the transition states from linear combinations of reactant

and product interactions. A genetic algorithm is then exploited as a conformational search

engine to optimize the TS structure. ACE has been applied to the Diels-Alder

cycloaddition and the proline-catalysed aldol reactions and has showed good correlation

between observed and predicted selectivities.

iii

RESUMÉ

Dans le domaine pharmaceutique, le criblage virtuel de large bibliothèques de

molécules est une alternative moins couteuse et souvent au moins aussi efficace que le

criblage à haut débit. D’ailleurs, le développement de tels outils –et plus particulierement

de méthodes de "docking"– a permis au criblage virtuelle de devenir pratique courante

dans l’industrie pharmaceutique. Cependant, la plupart des méthodes de docking ne

prennent pas en compte la dynamique des complexes protéine/ligand et plus

spécifiquement la flexibilité des protéines et la présence de molécules d’eau nécessaires à

une liaison optimale. Dans cette optique, FITTED1.0 a été développé et validé sur un jeu

de 33 complexes protéine/ligand. Ainsi, FITTED1.0 permet de modéliser des complexes

ternaires protéine/ligand/eau entièrement flexibles. D’autres développements ont ensuite

été nécessaires pour en accroître la rapidité et permettre son utilisation pour le criblage de

larges bibliothèques. Cette version améliorée, FITTED1.5, a été appliquée au criblage de la

bibliothèque Maybridge sur la polymérase du virus de l’hépatite C et a permis la

découverte de deux nouveaux inhibiteurs. Après ces résultats très encourageants, une

étude comparative a été entreprise visant spécifiquement à évaluer l’impact des données

d’entrées sur le pouvoir prédictif des programmes de docking les plus couramment

utilisés incluant FITTED2.6. Nous avons alors démontré que la présence d’eau, la

conformation du ligand et de la protéine au départ du calcul ont un impact majeur. En

parallèle, nous avons bénéficié de notre expertise pour développer un second outil de

criblage virtuel ACE1.0 mais cette fois appliqué au criblage de catalyseurs asymétriques.

Dans le domaine de la catalyse asymétrique, il nous fallait prédire la structure et l’énergie

potentielle des états de transition et ce, dans un temps raisonnable. Pour ce faire, ACE crée

les structures d’états de transition par combinaison linéaire des réactifs et produits de la

réaction. Un algorithme génétique est alors exploité pour entreprendre une recherche

conformationelle exhaustive et optimiser ces structures. ACE a été appliqué à deux

réactions bien connues de la chimie organique (cycloaddition de Diels Alder

cycloaddition et reaction d’aldol) et a démontré un grand pouvoir prédictif.

iv

ACKNOWLEDGEMENTS

First I would like to thank my PhD supervisor Dr. Nicolas Moitessier, for without

his mentorship and guidance I would not have matured into the scientist I am today. I

would like to also thank him for always reminding me, yes I am a computational chemist

but remember who your audience is, the organic or medicinal chemist.

Secondly I would like to thank all the members, past and present, of the

Moitessier Research Group for all their help, friendship and patience during my studies at

McGill. I would like to specifically thank Pablo Englebienne for always being there to

discuss any computational problem I may have from how to write better code to helping

me understand that deleting the registry of your computer is a bad thing. I would also like

to thank Janice and her Mom for providing baked good every Monday and Wednesday

like clockwork. Cookies and cakes are always a good incentive to come into work.

I would also like to thank Chantal Marotte, Sandra Aerssen, Fay Nurse, Alison

McCaffrey, Karen Turner and Normand Trempe for aiding me in traversing the deep

waters that is the McGill Administration. It is without these people that the Otto Maass

Chemistry building would have fell apart.

I am forever indebted to wife, MaryAnne who I have met during my studies at

McGill. I would like to thank her for her patience when I worked at night, the weekend

and whenever there was a problem with one of my computer programs. Without her I

would not have been able to survive the stresses associated with doing a PhD.

Lastly I am grateful for all the financial assistance from the CIHR Chemical

Biology Fellowship program, ViroChem Pharma, the Robert Zamboni Travel Award,

Pall Dissertation Award and the Udho Parsini Diwan Prize. I am also thankful to RQCHP

for somehow letting me receive over 90 years of CPU time during my PhD.

v

CONTRIBUTION OF CO-AUTHORS

This thesis consists of one introduction which contains one review draft (Chapter

1.1), 3 published publications (Chapters 3, 4, and 6) and one draft that has been submitted

for publication (Chapter 5). All the work described in these manuscripts has been carried

out as part of my research for the degree of Doctor of Philosophy in Chemistry.

All the manuscripts have co-authors, their contributions are described below. Dr.

Nicolas Moitessier has been my supervisor throughout my doctoral degree and is a co-

author for each manuscript.

Chapter 2: I wrote the code for the FITTED1.0 suite and conducted all docking

experiments. I prepared most of the testing set except for HIV – 1 Protease which P.

Englebienne prepared.

Chapter 3: I wrote the code improvements for FITTED1.5, conducted the

comparison with Fitted1.0 and conducted the docking and virtual screening studies of

HCV polymerase. P. Englebienne prepared the Maybridge database for virtual screening.

C. G. Yannopoulos, L. Chan, S. K. Das, D. Bilimoria and L. Heureux are responsible for

the biological evaluation of the hit compounds found from the virtual screening.

Chapter 4: I performed all experiments reported and made the improvements to

FITTED.

Chapter 5: I wrote all the code for ACE1.0 and selected the testing set. J. A.

Schwartzentruber optimized the conjugate gradient minimization routine within ACE. S.

Thielges tested ACE and performed the validation studies.

vi

TABLE OF CONTENTS

Title Page ....................................................................................................................... i

Abstract .......................................................................................................................... ii

Resumé ........................................................................................................................... iii

Acknowledgments .......................................................................................................... iv

Contribution of Co-Authors ........................................................................................... v

Table of Contents ........................................................................................................... vi

List of Figures ................................................................................................................ ix

List of Tables ................................................................................................................. xii

List of Equations ............................................................................................................ xiv

Abbreviations ................................................................................................................. xv

Chapter 1: Introduction .............................................................................................. 1

1.1 The challenge of modeling reality in the docking of small

molecules to biological targets .................................................................... 2

Abstract .................................................................................................. 2

Introduction ............................................................................................ 3

Ligand Flexibility ................................................................................... 4

Ring Flexibility ...................................................................................... 13

Protein Flexibility ................................................................................... 16

Predicting Displaceable Key Bridging Water Molecules ....................... 23

Predicting of Metal Geometry ................................................................ 26

Conclusion .............................................................................................. 26

1.2 Application of computational techniques to asymmetric catalyst

Development ................................................................................................ 28

Quantum Mechanics Predictions of Stereomeric Excess ....................... 29

Application of Virtual Screening Techniques to the Field of Asymmetric

Catalyst Development ............................................................................ 34

Conclustion ............................................................................................. 45

1.3 Outline of Thesis .......................................................................................... 46

1.4 References .................................................................................................... 48

vii

Chapter 2: Docking Ligands into Flexible and Solvated Macromolecules. 1.

Development and Validation of FITTED1.0 .................................................................... 71

Abstract .............................................................................................................. 72

Introduction ........................................................................................................ 73

Theory and Implementation ............................................................................... 74

Results and Discussion ....................................................................................... 88

Conclusion .......................................................................................................... 109

Experimental Section ......................................................................................... 110

Preparation of Training Set .................................................................... 111

Docking Study ........................................................................................ 113

Acknowledgements ............................................................................................ 113

References .......................................................................................................... 114


Development and Application of FITTED1.5 to the Virtual Screening of Potential

HCV Polymerase Inhibitors ........................................................................................... 119

Abstract .............................................................................................................. 120

Introduction ........................................................................................................ 121


Validation of FITTED1.5 ..................................................................................... 128

Application to the Screening of a Library against HCV Polymerase ................ 132

Conclusion .......................................................................................................... 137



References .......................................................................................................... 140


Impact of Input Ligand Conformation, Protein Flexibility and Water Molecules on

the Accuracy of Docking Programs ............................................................................... 145

Abstract .............................................................................................................. 146

Introduction ........................................................................................................ 147


Results and Discussion ....................................................................................... 158

viii

Conclusion .......................................................................................................... 171



References .......................................................................................................... 177

Chapter 5: Toward a Computational Tool Predicting the Stereochemical Outcome

of Asymmetric Reactions. Development and Application of a Rapid and Accurate

Program Based on Organic Principles ........................................................................... 185

Communication .................................................................................................. 186

References .......................................................................................................... 195

Chapter 6: Conclusion, Future Work and Contributions to Knowledge ...................... 199

Conclusions ........................................................................................................ 199

Future Work ....................................................................................................... 200

Contributions to Knowledge .............................................................................. 201

References .......................................................................................................... 202

Appendix A: Copyright Waivers .................................................................................. 205

Appendix B: Supporting information for Docking Ligands into Flexible and

Solvated Macromolecules. 1.Development and Validation of FITTED1.0 ..................... 209

Appendix C: Supporting information for Docking Ligands into Flexible and

Solvated Macromolecules. 2. Development and Application of FITTED1.5 to the

Virtual Screening of Potential HCV Polymerase Inhibitors .......................................... 221

Appendix D: Supporting information for Docking Ligands into Flexible and

Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein

Flexibility and Water Molecules on the Accuracy of Docking Programs ..................... 229

Appendix E: Supporting information for Toward a Computational Tool Predicting

the Stereochemical Outcome of Asymmetric Reactions. 2. Development and

Application of a Rapid and Accurate Program Based on Organic Principles ............... 233

Appendix F: FITTED2.6 User Manual ........................................................................... 241

ix

LIST OF FIGURES

Figure 1.1 - Matching algorithm ................................................................................... 5

Figure 1.2 - Incremental construction ........................................................................... 8

Figure 1.3 - Genetic Algorithm ..................................................................................... 10

Figure 1.4 - Monte Carlo ............................................................................................... 12

Figure 1.5 - Possible methods to include protein flexibility ......................................... 17

Figure 1.6 - Proposed mechanisms for proline catalysed aldol reaction....................... 30

Figure 1.1 - Example of bicyclic analogue studied by Shinisha et al. .......................... 31

Figure 1.8 - Proposed mechanisms for osmium tetraoxide assymetric

dihydroxylation of alkenes ............................................................................................. 31

Figure 19 - Imidazolidinone catalysed Diels-Alder Reaction ....................................... 32

Figure 1.10 - Mechanism for Mannich reaction............................................................ 33

Figure 1.2 - QM/MM study of Sharpless dihydroxylation. .......................................... 34

Figure 1.12 - Palladium catalysed allylation ................................................................. 35

Figure 1.13 - bis(oxazoline)copper(II) catalysed Diels-Alder ...................................... 36

Figure 1.14 - QSSR using Quantum Mechanical Interaction Field analysis in the

design of chiral amino alcohols for alkyl addition to aldehydes. .................................. 37

Figure 1.15 - Hydroboration of alkenes ........................................................................ 38

Figure 1.16 - Dihydroxylation of xylose ....................................................................... 39

Figure 1.17 - Sharpless dihydroxylation catalyst studied for optimization and

validation of generatic algorithm. .................................................................................. 40

Figure 1.18 - Reactions studies with reverse docking................................................... 41

Figure 1.19 - Mechanism Horner-Wadsworth-Emmons reaction ................................. 43

Figure 1.20 - Mixing of two ground states to find transition state ................................ 43

Figure 1.21 - All energies are calculated on the model PES then projected onto

the true PES using a mixing term .................................................................................. 44

Figure 1.22 - Summary of methods used to find transition states ................................. 45

Figure 2.1 - The binding site of 1d8m .......................................................................... 78

Figure 2.2 - Chromosome describing a protein/water/ligand complex. ........................ 80

Figure 2.3 - Generation of the initial population using a series of filters. .................... 81

Figure 2.4 - 4 possible pairs of children generated after application of two one

x

point cross-over operations. A horizontal bar represents a gene. .................................. 84

Figure 2.5 - Interaction energy between a methanol molecule and an explicit

water molecule or a displaceable water ......................................................................... 86

Figure 2.6 - Bridging water molecules and flexible binding site residues in TK /

inhibitor complexes. ....................................................................................................... 89

Figure 2.7 - Bridging water molecules and flexible binding site residues in

FXa / inhibitors complexes.. .......................................................................................... 90

Figure 2.8 - Docked and crystal structure of 1e2p ligand.. ........................................... 92

Figure 2.9 - Crystal structure and proposed docked model for the 1d8m complex. ..... 109

Figure 3.1 - Selected HCV polymerase inhibitors. ....................................................... 122

Figure 3.2 - Helix T perturbation upon inhibitor binding. Blue and grey ribbon

representations are from 2 different X-ray complexes. 20, 21 ...................................... 123

Figure 3.3 - Consensus docking.. ................................................................................. 125

Figure 3.4 - Binding site pharmacophore for HCV polymerase. .................................. 125

Figure 3.5 - Selected known actives. ............................................................................ 133

Figure 3.6 - Funnel approach implemented in FITTED ............................................... 134

Figure 3.7 - Active compounds recovered. ................................................................... 136

Figure 4.1 - Conformation of 1nfu ligand ..................................................................... 151

Figure 4.2 - FITTED 1.5 vs. FITTED 2.6 chromosome and the various

docking modes ............................................................................................................... 152

Figure 4.3 - Example of the corner flap approach converting a boat

conformation to a chair .................................................................................................. 153

Figure 4.4 - The assumption of torsion equivalencies. ................................................. 153

Figure 4.5 - Representation of the interaction sites found for 1bwi.............................. 155

Figure 4.6 - Schematic of the generation of the initial population within FITTED2.6. 156

Figure 4.7 -Schematic of the evolution cycle of FITTED 2.6. ..................................... 158

Figure 4.8 - Accuracy vs. ligand and protein conformations. ....................................... 164

Figure 4.9 - Self-docking vs. cross-docking for protein. .............................................. 167

Figure 4.10 - Accuracy and water molecules in self-docking experiments. ................. 168

Figure 4.11 - Protein class and accuracy on cross-docking experiments. ..................... 169

Figure 4.12 - Accuracy of program with OMEGA-generated structures ...................... 171

Figure 5.1 - General synthetic scheme and representative dienophiles 1a-f and

xi

dienes 2a-c used in the validation study......................................................................... 188

Figure 5.2 - Predicted vs. observed diastereomeric excesses for 44

Diels Alder reactions. ..................................................................................................... 189

Figure 5.3 - General synthetic scheme and representative catalysts (6a-e) and

aldehydes (5a-d) used in the validation study. ............................................................... 191

Figure 5.4 - Predicted vs. observed diastereomeric excesses for 17 selected cases.. .... 192

Figure 5.5 - Predicted TS structure for the reaction involving 4, 5c and 6a.. ............... 192

Figure 5.6 - ACE predictions and DFT predictions vs. observed diastereomeric

excesses for 4 selected cases .......................................................................................... 193

Figure E.1 - Data represented as Entry # versus ΔΔG. ................................................. 238

Figure E.2 - Data represented as Entry # versus ΔΔG. ................................................. 239

xii

LIST OF TABLES

Table 2.1 - Self-docking – HIV-1 protease inhibitors. .................................................. 97

Table 2.2 - Self-docking – Thymidine kinase inhibitors. .............................................. 98

Table 2.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors. .......................... 99

Table 2.4 - Cross-docking and docking to multiple conformations -

HIV-1 protease inhibitors. ............................................................................................. 100

Table 2.5 - Cross-docking and docking to multiple conformations –

Thymidine kinase inhibitors. ......................................................................................... 101

Table 2.6 - Cross-docking and docking to multiple conformations –

Factor Xa, trypsin and MMP-3 inhibitors. ..................................................................... 102

Table 2.7 - Docking to flexible proteins - HIV-1 protease inhibitors. .......................... 103

Table 2.8 - Docking to flexible proteins - thymidine kinase inhibitors. ....................... 104

Table 2.9 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3 inhibitors .. 105

Table 2.10 - Docking accuracy (%): rigid proteins. ...................................................... 106

Table 2.11 - Docking accuracy (%): flexible proteins. ................................................. 106

Table 3.1 - Comparison of FITTED 1.0 with FITTED 1.5 ........................................... 129

Table 3.2 - Docking of HCV polymerase inhibitors to the allosteric site \

with FITTED 1.5. ........................................................................................................... 130

Table 3.3 - Docking of HCV polymerase inhibitors to the catalytic site. ..................... 131

Table 3.4 - Focused libraries based on MatchScore > 75 and RankScore as indicated 135

Table 4.1 - Testing set of ligand / protein complexes. .................................................. 159

Table 4.2 - Comparison of success rates of FITTED versions 1.0, 1.5 and 2.6

using the “Dock” Docking mode. .................................................................................. 161

Table 4.3 - Comparison of time and number of runs required for various versions

of FITTED when the “Dock” docking mode is selected for rigid protein docking. ..... 161

Table 4.4 - Abbreviations used in Figure 4.8 ................................................................ 165

Table 4.5 - List of ligands used to define protein binding sites .................................... 174

Table B.1 - HIV-1 Protease mono-alcohol inhibitors ................................................... 209

Table B.2 - HIV-1 Protease diol inhibitors ................................................................... 210

Table B.3 - Thymidine Kinase inhibitors ...................................................................... 211

Table B.4 - Factor Xa inhibitors. ................................................................................... 212

xiii

Table B.5 - Trypsin inhibitors. ...................................................................................... 213

Table B.6 - Stromelysin-1 inhibitors. ............................................................................ 214

Table C.1 – Self-docking HIV – 1 Protease. ................................................................. 221

Table C.2 - Self-docking – Thymidine kinase inhibitors. ............................................. 222

Table C.3 - Self-docking – Factor Xa, trypsin and MMP-3 inhibitors. ........................ 223

Table C.4 - Docking to flexible proteins - HIV-1 protease inhibitors. ......................... 224

Table C.5 - Docking to flexible proteins - thymidine kinase inhibitors........................ 225

Table C.6 - Docking to flexible proteins – Factor Xa, trypsin and

MMP-3 inhibitors. .......................................................................................................... 226

Table C.7 - Docking accuracy – FITTED 1.0 VS. FITTED 1.5. ......................................... 227

Table D.1 - Accuracy of the 6 docking programs using various conditions and

self-docking experiments with dry protein. ................................................................... 229


self-docking experiments with proteins with waters. .................................................... 230


cross-docking experiments with dry proteins. ............................................................... 231


cross-docking experiments with dry proteins. ............................................................... 232

xiv

LIST OF EQUATIONS

Equation 1.1 – EVB model Energy .............................................................................. 44

Equation 1.2 – EVB equation for project model PES to True PES .............................. 44

Equation 2.1 – Water Switching function .................................................................... 85

Equation 2.2 – Probability of mutation of water .......................................................... 87

Equation 3.1 – Sphere weight ....................................................................................... 125

Equation 3.2 - MatchScore ........................................................................................... 125

Equation 4.1 – Calculation of minimum MatchScore .................................................. 157

Equation 5.1 – Linear combination of reaction and product ........................................ 187

xv

LIST OF ABBREVIATIONS

ENM Elastic Network Model

EVB Empirical Valence Bond

GA Genetic Algorithm

HTS High Throughput Screening

iGluR2 Ionotropic glutamate Receptor

MCMM Multi Configurational Molecular Mechanics

MCSS Multiple Copy Simultaneous Search

MD Molecular Dynamics

MM Molecular Mechanics

NEB Nudged Elastic Band

NMA Normal Mode Analysis

PDB Protein Data Bank

PES Potential Energy Surface

POS Particle Swarm Optimization

QM Quantum Mechanics

QSAR Quantitative Structure Activity Relationship

QSSR Quantitative Structure Selectivity Relationship

RBD Rigid Body Docking

RMSD Root Mean Square Deviation

SPE Stochastic Proximity Embedding

TS Transition State

TSFF Transition State Forcefields

VS Virtual Screening

CHAPTER 1

- 1 -

CHAPTER ONE

1. INTRODUCTION

Molecular discovery is in essence the search for novel molecules which would

perform a given task. With the increasing pressure from modern society to perform

greener, safer chemistry with rapid results, there has been a push for alternative methods

and technologies for molecular discovery, one alternative being computational methods.

These methods would allow a chemist to assess and develop many ideas and concepts.

Computational methods have found a home in the pharmaceutical industry where this

pressure is not only applied by society but also by the need to increase profits.1-3

There

have been many successful uses of computational methods4, 5

in the field of drug design

and development, yet this success has not translated to many other fields of chemistry

such as asymmetric catalyst development. Herein the state-of-the-art in docking-based

virtual screening methods and application of virtual screening techniques to the field of

asymmetric catalyst development are discussed.

CHAPTER 1

- 2 -

1.1 THE CHALLENGE OF MODELING REALITY IN THE DOCKING OF

SMALL MOLECULES TO BIOLOGICAL TARGETS

ABSTRACT

From virtual screening to understanding the binding mode of novel ligands, docking

methods are being increasingly used at multiple points in the drug discovery pipeline.

Unfortunately, in many cases, the accuracy of docking programs is greatly affected by the

amount of information given to them. To improve the docking accuracy, developers have

been moving towards the superior modeling of the dynamics involved in the protein-

ligand binding process. The problem of modeling reality can be broken down into several

factors including: 1) the modeling of ligand flexibility, 2) the modeling of receptor

flexibility and 3) the modeling of bridging water molecules. Other factors such as metal

coordination should also be considered. Each of these problems requires a separate or

combined conformational search technique. In this review, we will discuss the current

progress in the development of search engines for modeling ligand flexibility, including

cyclic portions, followed by recent progress in considering receptor flexibility and

bridging water molecules and finally the inclusion of metal coordination geometry in

docking.

CHAPTER 1

- 3 -

INTRODUCTION

Due to the advances in crystallography and nuclear magnetic resonance

spectroscopy, there is an ever-expanding knowledge of structural information for

potential therapeutic targets. This increase of knowledge has added pressure within the

drug discovery community to provide new drugs in a more time- and cost-effective

manner. This pressure has accelerated the acceptance and integration of computer-aided

drug design methods within the toolkit of chemists in the pharmaceutical industry1-3

.

These increasingly accurate tools provide viable alternatives to traditional experimental

approaches such as high throughput screening. Nowadays, computational techniques can

be found in many aspects of the drug discovery pipeline, particularly in the field of lead

discovery with the most popular virtual screening (VS) methods.6 In the last decade, there

have been a number of VS success stories4, 5

yet there have been only a few reported

studies directly comparing VS with experimental screening campaigns.7-9

Interestingly, in

these later cases, VS provided more fruitful results than traditional experimental

approaches.

There is a plethora of VS methods available to medicinal chemists today ranging

from ligand-based approaches10

including QSAR11

, ligand similarity searching12, 13

and

ligand pharmacophore screening14

to protein structure-based approaches such as docking

programs.15

Virtually docking ligands to biological targets is one of the more popular

techniques for VS. This popularity is in large part due to the ever-increasing number of

protein crystal structures publicly available within the Protein Data Bank16

not to mention

the large number of proprietary structures within pharmaceutical companies.

One of the first docking studies was reported by Levinthal et al. who performed a

protein-protein docking simulation to predict possible conformations of hemoglobin

fibers.17

This work then inspired the development of the first small molecule / protein

docking tool, namely DOCK.18

In order to reduce the computational cost associated with

the conformational sampling, both these studies only allowed for the optimization of the

relative orientation and/or transition of the two molecules being assembled. These early

methods perform what is known as rigid body docking (RBD).18-20

With the exponential

increase in computational power, protein-ligand docking programs have evolved into

“flexible” docking programs which incorporated the flexibility of ligands. Although these

CHAPTER 1

- 4 -

second generation programs optimized the conformational, orientation and translation of

the ligand, they still treated the protein as a rigid object.

While a number of these programs were reported, comparative studies have often

only evaluated their relative accuracy in predicting the binding mode of known ligands21-

27 and only recently their ability to virtually screen libraries of small molecules and

identify active compounds from these libraries.21, 27-32

Many of these studies showed that

docking the ligand back to its native protein structure (i.e., co-crystallized with this

specific ligand), also referred to as self-docking, was quite successful. However when

docking to another protein structure (referred to as cross-docking), the accuracy of most

of the programs significantly dropped.33-35

This drop has been attributed to the rigid

protein model used by these programs (lock and key model). In reality, protein/ligand

assemblies in cells are complex dynamic multiple component systems which encompass

numerous variables such as protein flexibility, displaceable bridging water molecules,

metal coordination and many more.

Even with the advances in the speed of modern day computers a systematic search

of the entire conformational space available to the protein and ligand during docking

remains intractable. Gehlhaar et al.36

surmised that there may be up to 1030

possible

solutions for one of the complexes that they were examining. If it were possible to

evaluate 1 million distinct solutions per second, it would still take over a trillion years to

perform a systematic search on a current CPU. Therefore modeling reality requires the

development of docking algorithms able to search the conformational space quickly and

efficiently, in order to locate the global solution on the multidimensional binding free

energy hypersurface.

With the growing popularity of docking programs, there was a need for a technical

review of the literature covering the various methods available to model the binding of

protein-ligand complexes. Herein, we review the search algorithms developed over the

last twenty years or so to address ligand and protein flexibility, bridging water molecules

and metal coordination.

LIGAND FLEXIBILITY

Two major criteria for docking performance are the conformational search of the

ligands must be both accurate and time-efficient. Unfortunately, these two factors have an

CHAPTER 1

- 5 -

inversely proportional relationship and should be properly balanced. Throughout the

years, many conformational search algorithms have been developed to assess the

flexibility of the ligand. The first instance of flexible ligand docking appeared in 1985

when Ghose and Crippen37

used a distance geometry method, followed by Goodsell and

Olson38

who used a simulated annealing approach to account for ligand flexibility.

Nowadays, most docking programs incorporate one or more of the following four ligand

conformational search algorithms: shape complementary, incremental construction,

Monte Carlo and genetic algorithms. With these techniques, only the acyclic portions of

the ligand are considered flexible. Searching the conformational space of rings

dramatically increases the search space and necessary computational time. Below are

described the various approaches illustrated with the most popular programs.

Shape Complementary. Some of the most widely used conformational search algorithms

in small protein-ligand docking rely on shape complementary techniques. These methods

evaluate how well the ligand orientation and translation (referred to as a pose) match with

the protein binding site. These matches can be defined in terms of molecular interactions

as in DOCK,18

FITTED,39

FlexX40

and Surflex,41, 42

or geometrical matching as in SHEF.43

Figure 1.1 - Matching algorithm.

CHAPTER 1

- 6 -

When matching ligand and protein molecular interacting groups, most programs

work in a similar manner (Figure 1.1). First, the generation of an initial conformation of

the ligand is followed by the creation of a ligand and protein pharmacophore. A subset of

ligand points is then chosen from which a distance matrix is built. This later step is

repeated for a subset of protein points. These two distance matrices are then compared to

determine a match between a subset (typically 3 or 4) of protein points and a subset of

ligand points. Once a match is found, a translation/rotation matrix is calculated to overlay

the best matching points in the same frame of reference. This matrix is next applied to the

ligand conformation hence positioning the ligand in the protein binding site.

However, more than a single match can be found for a single ligand conformation.

Thus, comparing multiple subsets of ligand and protein matrices can yield a series of

probable matches and determining the best match can be difficult. The determination of

this best match is where most programs vary. For instance, an early version of DOCK18

first prepares of list of pairs by systematically pairing each ligand point with all of the

protein points. DOCK then selects one ligand point and searches for another ligand point

that is within a cut-off distance to the first ligand point. It then selects one of the

matching protein points to the first ligand point and search for a second protein point that

has a similiar distance as the difference between the two ligand points. This is repeated

with all the points in the list until no new pair can be added. The best match is then

defined as the list with the most matching points. FITTED39

, FlexX40

, and others44, 45

use

3-point matching algorithms. For each newly generated conformation of the ligand, one

or more triangle is selected from the ligand and compared to the triangle generated from

the protein interaction sites (e.g., hydrogen bond donor groups). The ligand triangle /

protein triangle pair which is closest in size is considered the best match. It is also

possible to use more than three points to accommodate for the chirality of the pocket and

ligands.46

Surflex41, 47

uses a morphological similarity score which measures the matches

between possible protein interactions (referred to as protomol within Surflex) and the

ligand. A newer version of DOCK48

implemented a bipartite graph matching algorithm

which is similar to a three point matching algorithm except that each vertex of the protein

triangle is matched one at a time. A point of the ligand triangle is selected and translated

to overlay with a matching protein point. A second ligand triangle point is then

CHAPTER 1

- 7 -

superimposed onto a matching protein point that is within a radius of a distance similar to

the distance between the two ligand points from the first protein point. This is repeated

with the last ligand triangle point.

Geometrical matching methods evaluate the fit between the ligand shape and

binding site cavity. The method developed by Yamagishi et al.49

models the pocket and

ligand as sets of spherical harmonic functions which can be described as contour lines

that can be seen as being similar to human fingerprints. It then matches the ligand to the

protein by using fingerprint matching techniques. eHiTS50, 51

represents the ligand and

pocket as a series of polyhedra. An interaction type (e.g., hydrophobic, H-bond donor) is

then assigned to each polyhedron vertex. The match between the ligand and protein

vertices is calculated by a knowledge based scoring function. This function evaluates the

energy of the system based on the distance and relative orientation (angles and torsions)

of surface point pairs on the ligand and the receptor. The scoring is based on a statistical

collection of data from high resolution PDB complexes.

Incremental Construction. Incremental construction algorithm-based programs build the

molecule on-the-fly within the binding pocket therefore addressing ligand flexibility. This

technique has been widely implemented in de novo design programs, which propose

potential novel tight binding ligands.52-55

Incremental construction techniques have also

been developed for docking small molecules to proteins (see Figure 1.2). In this context,

the ligand is first broken into multiple rigid fragments typically at rotatable bonds and

ring systems. An anchor fragment is selected and first docked. The adjacent connecting

fragment is subsequently added and this process is repeated until the entire molecule is

reconstructed. Even though most programs have this basic structure there are multiple

variations of incremental construction methods.

CHAPTER 1

- 8 -

Figure 1.2 - Incremental construction.

Another algorithm in DOCK56, 57

fragments the molecule at all rotatable bonds,

identifies an anchor (i.e., a rigid fragment) then places it using a matching algorithm. The

N best poses for this fragment are selected for the next stage. It is possible for DOCK to

identify multiple anchors requiring that the method described below be repeated for each

anchor. The program adds the adjacent fragment of the molecule, creating multiple

conformations of the fragment around the newly formed bond. The conformations are

then pruned to only keep the N best scoring conformations to reduce the combinatorial

explosion that would occur if all of them were kept. FlexX40

also uses a similar algorithm

which they refer to as a greedy algorithm. Within SLIDE,58, 59

fragments are rotated on-

the-fly to remove undesired clashes between the ligand under construction and the protein

instead of archiving multiple conformations. When building up the ligand, Surflex42, 47

adds fragments in a conformation which maximizes the morphological similarity between

ligand and the protomol. Multiple poses are created with the best scoring poses

undergoing gradient based optimization. Incremental construction assumes that the rest of

the molecule (not added yet) does not affect the conformation of a fragment. Surflex

overcomes this major assumption by including a whole molecule approach to the docking

of fragments.42

When performing the conformational search of a fragment, the rest of the

ligand is still present in its initial input conformation. Thus Surflex quickly removes

fragment conformations which are acceptable on their own, but clashing when the ligand

CHAPTER 1

- 9 -

is rebuilt. Another program, eHiTS,50, 51

uses a novel take on the incremental construction

algorithm. Unlike the previous algorithms, all rigid fragments are simultaneously and

independently docked and scored. The algorithm then attempts to reconnect the best

scoring fragment poses into the complete ligand. For this purpose, the distances between

the connecting atoms of the rigid fragments are calculated. If the distance corresponds to

the length of a flexible connecting chain, the optimal conformation of this chain is

calculated and the fragment and connecting chain are linked.

Genetic Algorithms. The first genetic algorithms60-62

implemented in docking programs

were derived from Darwin’s theory of evolution 63

which proceeds through natural

selection. Throughout this process, favourable genes (as defined by a fitness function) are

passed onto the next generations of an evolving population and unfavourable ones are

eliminated64

(see Figure 1.3). Within a docking program, a population is defined as a set

of individuals with each individual representing a distinct pose. The pose is encoded

within a chromosome made up of a series of genes which represent the ligand

conformation (values of the rotations about rotatable bonds), orientation and position in

space (in the protein reference frame). An initial population is created by assigning values

to each gene of each of the individuals and the population is then allowed to evolve

through the passing of genetic information (reproduction). As described in Darwin’s

theory, genetic operators such as crossover and mutation will modify/optimize the

population over time. Crossover of genetic information proceeds by selecting two parent

individuals and switching the values of the genes from a crossover point in the

chromosome onward. Mutations are simulated by random modification of a gene within a

chromosome. A number of variations (e.g., chromosome definition, selection of the next

generation, genetic operators) have been implemented in docking programs.

CHAPTER 1

- 10 -

Figure 1.3 - Genetic Algorithm.

The DOCK60

developers have also implemented a genetic algorithm, although the

geometric matching is still recommended in the newest versions.57

An elitist strategy is

employed in the selection of the next generation where some of the fittest individuals

(among the parents) are saved into the next generation along with the best scoring

individuals resulting from crossover and mutation (the offspring). DIVALI62

improved on

this approach by incorporating the bond rotations as genes. Within GOLD61, 65

bond

rotations are encoded as genes but the orientation within the pocket is represented by a

series of genes which map interaction points of the ligand with points on the protein. This

mapping is used to orient the ligand within the pocket using least-squares fitting. GOLD

also differs from the previous programs by incorporating a roulette wheel selection of

parents, while random selection is usually done by others. Roulette wheel selection

proceeds through favouring individuals for reproduction based on their fitness. The more

fit an individual is, the more chance it has to couple. GOLD also uses a steady state

selection of the next generation. This occurs when the individuals are crossed over and/or

mutated. If the newly generated individuals (children) have better fitness scores than the

parents, then the parents are replaced. Other programs will select the next generation

from the whole set of parents and children. GOLD has also implemented the use of

CHAPTER 1

- 11 -

islands (also known as niches) to allow sub-sets of the population to evolve

independently of each other with the occasional exchange of individuals between islands.

Prior to the Darwin’s theory of evolution (Charles Darwin 1809-1882), Jean

Baptiste Lamarck (1744-1829) had proposed a different theory, the theory of inheritance

of acquired characteristics.66

Although similar (both described evolution toward a “best”

solution), these two theories differ by the information transmitted from one generation to

the next. Lamarck believed that changes which occurred during the life are passed to the

offspring while Darwin thought these changes did not affect the evolution. Although the

docking programs are primarily based on the Darwinian evolution, a Lamarckian flavour

has been found to improve the docking efficiency. Within the realm of docking this is

done through local conformational search. Within AutoDock,67

the Lamarckian aspect of

the algorithm is added by performing small perturbations of the genes within the

chromosome. If the perturbed individual has a better fit than the unperturbed individual,

then the original’s chromosome is replaced. These perturbations will next be passed to

the next generation. AutoDock also uses an elitist strategy in selecting the next

generation. Another docking program incorporating a Lamarckian genetic algorithm is

FITTED39, 68, 69

which uses a conjugate gradient energy minimization algorithm as the local

search method. The evolution as implemented in FITTED also incorporates novel genetic

operators. A probability of optimization is applied to individuals created after crossover

and mutation. Thus only a fraction of the offspring proceeds through the conjugate

gradient energy minimization. Also the newly created generation has a probability of

learning. If an individual is selected for education, it will be optimized by local search.

Monte Carlo. Monte Carlo techniques employ random or pseudo-random modifications

of bond rotations, translations and rotations of the ligand pose (see Figure 1.4). The

resulting pose is next analysed. If it has a better score than the previous one then it is

saved. When the score is not better it is not outright rejected and can be saved based on a

selection criterion such as the Metropolis criterion. The temperature-dependent

Metropolis criterion allows for higher energy structures to be accepted with the

probability decreasing with increasing potential energy. It is possible to tune how strict

the criteria are by selecting an appropriate z value (see Figure 1.4). This approach allows

CHAPTER 1

- 12 -

for passing of high energy barriers on the potential energy surface (PES). If the pose is

rejected another set of manipulation is tried and the process reiterated.

Figure 1.4 - Monte Carlo.

Among the docking programs using Monte Carlo as a conformational search tools

are ICM, Glide and LigandFit. ICM70, 71

uses a Brownian movement Monte Carlo

technique, which imposes restrictions to the random moves; large changes in the bond

rotations of the ligand should be accompanied with small rotations and translation of the

entire molecule. As part of a multistage funnel approach to docking, which uses a set of

hierarchal filters, Glide72

carries out pose refinement through a Monte Carlo algorithm.

As ICM, LigandFit73

does not select truly random moves to generate new conformations.

The rotation about bonds are based on the number of atoms connected to this the bond.

For example, a torsion which rotates 25 atoms in a molecule with 50 atoms would have a

10° resolution while a torsion which rotates 10 atoms would have a result of 5°. This new

conformation is first evaluated based on its shape matches with the protein followed by

an energy calculation.

Swarm Intelligence. In the past few years, new algorithms based on swarm intelligence

have been introduced in the field of docking. Swarm intelligence is inspired by the

movement of a swarm of birds when one of them finds food. Within docking this can be

translated into a conformational move of a population following the fittest individual of

this population. Two methods of swarm intelligence have found their way into docking:

particle swarm optimization (PSO) and ant colony optimization.

PSO implemented within SODOCK74

is based on the traditional sense of swarm

intelligence. In this program, an initial set of poses is created randomly. During the next

CHAPTER 1

- 13 -

iteration, each rotatable bond, torsion and orientation of all the poses are transformed

based on their distance to the best solution. The distance it moves is referred to as its

velocity. PSO@AutoDock75

is a variation of the AutoDock67

which implements a PSO

method to overcome high energy barriers. In this program, it only updates the velocity of

the conformation if the fitness of the conformation is worse than the previous one.

Ant colony optimization has been implemented in the docking program PLANTS.76

This global optimization technique is bio-inspired on the method ants’ search and

localization of food. When ants find food, they release pheromones on their way back to

their nest. This pheromone path will next be used by the colony to go back to the food

source. Each ant may take a different path but some paths cross each other. When ants

return to the food source, they follow the path with the strongest (most) pheromones. As

implemented in PLANTS, the ant colony optimization algorithm creates a population of

distinct conformations called ants. Each ant (i.e., pose) has a set of rotations, translations

and rotatable bonds. The best individual of the population deposits a pheromone on each

value of its set. In the next iteration the probability to assign a value a member of its set is

directly proportional to the number of pheromones deposited on the value to that point.

RING FLEXIBILITY.

The algorithms mentioned above have one major constraint; they only address the

flexibility of acyclic portions of the ligands. This is a major issue of concern when

docking large libraries of ligands since only one conformation of the ligand ring may be

used and it may not be the bioactive one.77-79

Even if the most thermodynamically

favoured ring conformation is docked, the protein may stabilize (i.e., bind tightly) a

higher energy conformation. Therefore an accurate ligand pose may not be found if the

ring conformation is not searched while docking. There are two options when

incorporating the flexibility of rings for docking programs, either the ring conformations

can be searched before the docking run using conformation generators or on-the-fly by

the docking program itself.

Ring Flexibility Through Conformational Generators. All docking programs can consider

ring flexibility if a pre-computed ensemble of ligand conformations is docked. These

ensembles can be generated by conformational generator, running each conformation

CHAPTER 1

- 14 -

independently and merging the results at the end. One of the most common methods to

incorporate ring flexibility within conformational search tools is to include a library of

ring conformations. This technique is used within OMEGA80

, which is a tool to prepare

ligands for docking. OMEGA is based on a depth-first, divide and conquer approach to

generate conformations of ligands. Fragmentation of the ligand is followed by the

generation of multiple conformations for each of these fragments. The conformations are

next evaluated using an energy calculation and sorted. From this data, OMEGA

reassembles the fragments into a ligand structure. LigPrep81

is another ligand preparation

program which uses a library of known ring conformations. LigPrep first identifies

flexible rings and matches them to a template. Their relative energies is then estimated

using an energy associated with the ring template, axial-equatorial energies and short

range pair-wise repulsions between atoms directly bound to the ring. This data is then

used to identify the most favourable ring conformation for a single flexible ring. For

molecules with multiple rings, a total ring energy is calculated using the sum of the

energies of all the rings present. Following the optimization of the ring, LigPrep proceeds

with a Monte Carlo search to optimize the acyclic portions of the ligand. Although easy

to implement ring conformation libraries are limited to small-sized rings as most of these

libraries do not cover medium and large size rings.

It is also possible to create ring conformations de novo instead of using ring

libraries. In contrast to ring conformation libraries, this approach is not limited to a

predefined set of conformations but is more time consuming. CORINA,82

another ligand

conformation generator, uses a rule based approach for acyclic portions. To search rings,

it creates a circle with a size dependent on the number of atoms within the rings. sp3

Atoms are then added either in the plane of the circle or above and below the plane

alternatively. sp2 Atoms are left in the plane with consideration of cis and trans

geometries. For polycyclic systems a backtracking algorithm is used which first finds all

the possible conformations of the smaller rings within the polycyclic system. The lowest

conformation is tried first and each ring is added systematically. If a conformation with

the lowest energy ring cannot be found the next lowest is tried. This is repeated until a

conformation can be created. Another program, CONCORD,83, 84

uses a rule based

approach for acyclic portions. In contrast, ring conformations are determined by an

algorithm which minimizes its strain within internal coordinates. Another technique,

CHAPTER 1

- 15 -

stochastic proximity embedding (SPE)85

finds conformations by imposing geometrical

constraints. A constraint, either volume or distance, is first selected then the atomic

coordinates are modified until the conformation fits this constraint. These constraints are

defined by a set of rules, in which certain functional groups which match specific

substructures are assigned a constraint. These rules can be defined by a range around an

equilibrium bond length, angle, torsion or pair wise interactions between protein and

ligand determined by statistical analysis of the PDB. Rubicon85

uses a similar approach

but with a metric matrix algorithm that generates conformations which fit the constraint

instead of modifying the atomic coordinates.

These generated libraries of conformation can next be docked with RBD methods.

Another option is to generate the ensemble of ligand conformations as a first step at

the start of a docking run. This is often achieved with template libraries incorporated into

incremental construction methods. This implementation is straightforward as each new

conformation is considered as a separate possibility for a fragment as in Surflex or FlexX.

Surflex acts via a two step procedure. First templates of 5-7 membered hydrocarbon rings

are mapped onto the ligand rings ignoring the atom types. An energy minimization

routine next optimizes the ring shape considering the atom types. FlexX calls CORINA at

the beginning of a docking run to generate multiple conformations of flexible rings.86

Glide also uses a template library (same library as in LigPrep) to generate multiple

conformations of rings in the first step of its hierarchal filter funnel approach.

Ring Flexibility On-the-Fly. A very few programs considers ring flexibility on-the-fly.

Combining one of the above-mentioned acyclic conformational search methods with an

on-the-fly conformational ring search method is expected to be more time-efficient. To

date, the methods implemented are based on variations to the Goto and Osawa’s corner

flapping approach which reflects an atom of a ring through a mirror plane made up of

adjacent atoms.87, 88

Flipping atoms and their substituents while keeping the correct

geometry and chirality proves challenging. GOLD89

uses a series of bond rotations to

transform the original position of the flipped atom into the one reflected through mirror

plane, hence requiring that the 4 adjacent atoms (2 on either side) must be in a plane. This

requirement limits the method and renders the full search of large flexible cyclic systems

difficult. FITTED uses a different approach which enables the removal of this requirement

CHAPTER 1

- 16 -

enabling the searching of larger rings.39

FITTED also enables new ring conformations to

be investigated during docking by using a conjugate gradient minimization.

PROTEIN FLEXIBILITY

The algorithms described in the previous sections account for the ligand flexibility.

However, in most cases, docking programs treat the protein as a rigid object, following

the lock and key model. Numerous reports have described the flexibility of proteins and

its effect on docking accuracies.33-35, 89-91

Considering these additional conformational

degrees of freedom still remains one of the major challenges in the field of small

molecule docking.15, 92-97

The simplest way (although not the most time-efficient) to

include protein flexibility is to dock the ligand to multiple alternative conformations of a

protein and merge the results (see Figure 1.5A). In fact, this method has been shown to

increase the accuracy of docking compared to cross-docking results.39, 98

It has also been

shown that including all available protein structures, while not affecting the docking

accuracy is more CPU time consuming.98

These results demonstrate that the inclusion of

protein flexibility is a necessity but that selection of the protein structures used for a study

is critical.

Using only experimentally determined protein structures may be too restrictive as

other (i.e., not available) protein conformations may be adopted upon binding to another

ligand. New protein conformations generated by computational techniques can

complement experimentally determined structures. Protein conformations can also be

produced while docking. In this case, the conformational search of the proteins falls into

two categories; either the conformation is searched prior to the docking run (see Figure

1.5B) or searched on-the-fly by the docking program (see Figure 1.5C).

CHAPTER 1

- 17 -

Figure 1.5 - Possible methods to include protein flexibility A) Generation of multiple

protein conformations for multiple docking runs; B) On-the-fly generation of protein

conformations during docking using one protein input structure and C) On-the-fly

generation of protein conformations using multiple protein conformations.

Generation of Multiple Protein Conformations. Molecular dynamics (MD) computation

can be exploited to generate multiple protein conformations by simulating the protein’s

conformational changes over time (see Figure 1.5A). Variations of this technique may

force large conformational changes (normal mode analysis, nudged elastic band method,

elastic network model) not only on the side chains but also on the backbone of the

protein. However, as MD simulations may provide a wealth of conformations, a method

is required to limit the number of conformations.

Amaro et al.99

performed a 20ns MD simulation on RNA editing ligase 1 taking a

snapshot every 50ps resulting in 400 conformations. They then applied a “QR

factorization” method which is used to remove redundant information, reducing the 400

conformations to 33 with no loss of data. By reducing the number conformations for

docking, they increased the overall time efficiency of the screen itself. However, the MD

simulations are time-consuming. Garner et al.100

addressed this problem by examining

multiple protein structures and identifying atoms which do not move upon ligand

binding. This information was converted into a set of constraints applied to those atoms

during the MD simulation, hence decreasing the time required for the simulation. MD

simulations are not appropriate for a true conformational search of the protein and

CHAPTER 1

- 18 -

therefore should only be used to probe truly novel (but accurate) conformations. To probe

for new possible conformations, Withers et al.101

developed the active site pressurization

method to force the protein to adopt novel conformations. This is done by filling the

protein cavity with a virtual resin made of uncharged Lennard-Jones particles in the form

of a grid. Initially only the resin beads that are not directly clashing with the protein are

turned on and bead adjacent to them are flagged. During the MD simulation, the protein

reacts to the beads that are on while the flagged beads observe the possible forces applied

to them. After the initial MD run, the conformation is saved and the flagged bead that

observed the most favourable interaction with the protein is turned on followed by

another MD simulation. This process is continued until it reaches a target number of

structures

Another option similar to MD simulations is normal mode analysis (NMA). NMA

approaches are less time consuming than regular MD simulations but are more memory

intensive. The theory behind NMA is that simple harmonic oscillations around a local

energy minimum correspond to the normal modes of vibration. Like MD many

conformations can be produced and the selection of a smaller subset is necessary. Keseru

et al.102

used a selection of low frequency normal modes to approximate the large

movements in the protein structure. Cavasotto et al.103

developed a method which reduces

the number of conformations by only selecting relevant normal modes which affect the

area of interest. New conformations are then created through linear combinations of the

normal modes followed by optimization of the side chains by Monte Carlo search in the

presence of known binders. Cavasotto et al.35

also tried a simpler method to generate the

initial set of protein conformations. Instead of NMA, the ligand was initially placed into

the active site of the protein in multiple orientations. Each of these conformations

underwent an energy minimization followed by an optimization of side chain

conformations through a Monte Carlo procedure.

For larger changes in protein conformations other options are available such as the

elastic network model (ENM)104

and nudged elastic band (NEB) method.105, 106

The ENM

is in fact a simpler version of NMA. Within ENM, all pair wise interactions between Cα

within a cut-off distance are represented as springs with a uniform force constant. With

these springs in place, one can then perform NMA analysis on this simplified model to

determine the normal modes of distortion of the elastic model and create new protein

CHAPTER 1

- 19 -

conformations. This has been used to generate multiple conformations of the ionotropic

glutamate receptor (iGluR2).107

The iGluR2 can adopt multiple conformations due to

domain closure upon ligand binding. To generate multiple conformations along the path

of closure the ENM method was first used to identify the normal modes of an initial

intermediate conformation, followed by steps in both directions (towards the open and

closed conformations) to generate an ensemble of conformations.

NEB method works by approximating the path between the beginning and end

conformations with a series of images interconnected by springs. These springs prevent

the images from sliding down onto the preceding images on the PES. The initial images

are copies of the starting and final conformations. During the simulated annealing

optimization, the interconnecting springs allows each image to be affected by the

previous one allowing for creation of multiple conformations along a PES.108

This

method has been recently implemented within AMBER108

and has been applied to RNA

but can be applied to proteins to simulate larger movements.

On-the-Fly Protein Flexibility. Accounting for protein flexibility during the

conformational search of the ligand pose allows for ligand dependent conformational

changes within the protein. The simplest way to account for local protein conformational

adjustments during docking is to allow for some overlap with the protein by reducing the

repulsive nature of the Lennard-Jones potential. This softening approach, also referred to

as soft-docking, was first implemented in 1991 by Jiang et al.109

ADAM110

went further

by incorporating an offset distance to virtually increase the van der Waals distance

between the protein and ligand along with an energy minimization which allows for the

relaxation of the protein. The energy minimization-based optimization of the protein

structure allows for small movements of the protein atoms. Apostolakis et al.111

improved

on this by allowing the van der Waals interactions to be gradually turned on. This

docking method starts by creating a random conformation of the ligand followed by an

energy minimization including the gradual turning on of the van der Waals interactions.

This is followed by a re-optimization of the ligand conformation through a Monte-Carlo

search.

Many docking programs use a grid based approach to calculate interaction energies

between the protein and the ligand. However, if more than one protein structure is used,

CHAPTER 1

- 20 -

multiple grids are created and multiple docking runs must be carried out. To reduce the

CPU time demand, techniques have been developed which combine the individual grids

into grids modeling an ensemble of protein conformations. This approach has first been

developed by Knegtel et al.112

who used DOCK with an ensemble of protein structures

modeled by a single average grid. This grid was derived from multiple grids,

corresponding to each of the protein structures. Two averaging techniques were

implemented: an energy weighted average and a geometry weighted average. The energy

weighted method takes the average energy of all protein conformations at each point in

space while the geometry weighted method take the average position of an atom in all

protein structures and determines the energy at that point. Osterberg et al.34

improved on

the energy weighted grid by averaging the grid using the Boltzmann distribution factor

and also using a weighted average of the energy grids termed a clamped grid. If the

energy at a point for a specific protein is unfavourable then that point is assigned a low

weight. This technique is similar to one developed by Moitessier et al.113

who docked

aminoglycosides to virtually flexible RNA. They also created a single RNA structure

from averaging the coordinates of multiple RNA structures, then creating one grid.

Sotriffer et al.114

used AutoDock in conjunction with an ensemble of energy grids.

Instead of averaging the ensemble of grids they were joined back to back separated by a

strip of repulsive grid points to remove any possibility of ligand docking in the interface

of two grids.

Generating new protein conformations on-the-fly is another possibility. There are

two options for searching the conformational space of proteins during docking (see

Figure 1.5B); either the ligand is docked first, followed by optimization of the protein, or

both the ligand and the protein conformational searches occur simultaneously.

The first docking study that considers conformational searching of the proteins

upon ligand docking was reported by Leach.115

In this work, optimization of protein side

chains was carried out after docking of the ligand. A rotamer library was trimmed using a

dead-end algorithm to eliminate some rotamers of side chains which overlap with each

other. A tree search algorithm was then exploited to find optimal combinations of amino

acid rotamers. Schaffer et al.116

improved on this by generating multiple starting

conformations of the protein prior to the docking of the ligand. This is completed by a re-

optimization of the protein side chains and a final energy minimization to relax the

CHAPTER 1

- 21 -

protein. The number of possible conformations of the protein resulting from Leach’s

algorithm was trimmed by Anderson et al.117

by only applying the conformational search

to residues which have been identified as flexible in multiple crystal structures. If only

one protein structure is available, a selection scheme is proposed to reduce the number of

residues that would be considered as flexible.

Utilizing a rotamer library has its downside. While binding, a ligand may lead to

novel conformations of side chains that may not be present in the rotamer library. Various

techniques have been developed for de novo prediction of side chain conformations. ICM

incorporated a dual Alanine scanning and refinement procedure to relax the binding site

cavity.118

First, flexible residues are identified then mutated to alanine to enlarge the

pocket. If there are more than two flexible residues, multiple protein structures are

created where various combinations of two residues are mutated to alanine. The ligand is

then docked to the ensemble of protein variants and clustered. The best scoring poses are

then frozen and the protein side chains are reconstructed and re-optimized. Koska et al.119

implemented a similar post-docking refinement of the protein using a combination of

ChiFlex to create an ensemble of protein input structures (see Figure 1.5A) and ChiRotor

to optimize the protein following docking using LibDock. Sherman et al.120

showed that

Glide can also accommodate protein flexibility when it is used in conjunction with

PRIME, a protein homology modeling program. First Glide identifies the 3 most flexible

residues within the pocket and, as ICM, mutates them to alanine to allow for a larger

binding cavity. The flexible residues are selected within the Glide protocol using a set of

4 rules: 1) if a residue has atoms that deviated by 2.5 Å from the apo protein crystal

structure; 2) Residues which have multiple occupancy or missing density within 5Å of

the co-crystallized ligand; 3) If multiple protein crystal structures are available, residues

that have atoms which deviated more than 1.5 Å between structures and 4) if not more

than 3 residues have been selected, residues with high β-factors are used. Glide then

docks the ligand, reconstructs the mutated residues and optimizes the conformation of all

the residues within 5Å of the ligand. These residues and the ligand are then subjected to

an energy minimization followed by a re-docking using Glide.

In all the above-mentioned cases, the ligand pose was first optimized followed by

optimization of the protein. Although this approach improves on generation of multiple

conformations prior to the ligand docking, this is not a perfect representation. In reality,

CHAPTER 1

- 22 -

both the ligand and protein conformations should be optimized in concert. Within Slide58,

59 protein flexibility is incorporated by rotating side chains to remove atomic overlaps

with the ligand. Kairys et al.121

utilizes a mining minima method to optimize the

conformations of side chains. Ligand and protein conformations are generated using

random values within a specific range. This range is subsequently reduced based on the

lowest energy conformation found. Protein flexibility has been implemented into ant

colony optimization technique of PLANTS.76

. In this context, pheromones are placed on

the torsion values for optimal side chain conformations. Skelgen122, 123

uses a modified

simulated annealing algorithm to search the conformational space of side chains.

Interestingly, Skelgen allows to either use a rotamer library or de novo generation of

random conformations for side chain conformations. GOLD61, 65, 124

tackles protein

flexibility in many ways. In its most current release version (v. 4.0), GOLD gives the user

the ability to manually create rotamer libraries for selected residues. It also uses an 8-4

Lennard-Jones potential energy function to soften van der Waals interactions. GOLD also

allows for the optimization of NH3 and OH orientations while docking the ligand by

incorporating them as genes within the genetic algorithm. Another program addressing

protein flexibility is MORDOR which uses a path exploration with distance

constraints.125, 126

MORDOR places the ligand within a pharmacophore sphere of the

binding site. Once the ligand is placed, the protein and ligand are simultaneously

optimized through energy minimization. The path exploration with distance constraints

imposes an RMSD deviation penalty to force the ligand to explore new conformations.

The methods described above either use one protein conformation as input with its

structure being modified while docking or dock to multiple protein conformations to

account for larger moves not considered when allowing optimization of the side chains

only (see Figure 1.5B). Some methods have been developed which allow for use of

multiple protein structures as input for docking (see Figure 1.5C). In FlexX127

a module,

FlexX-ensemble, has been incorporated which merges portions of the proteins with

similar conformations into one instance and save the dissimilar portions as independent

instances. During docking, FlexX-ensemble scores the ligand with all instances. If more

than one instance is present for a portion of the protein the best scoring one is selected. A

similar approach has also been implemented in DOCK.128

FITTED39, 68, 69, 129

uses multiple

protein structures to create a virtual backbone and rotamer library. During the evolution,

CHAPTER 1

- 23 -

the genetic algorithm of FITTED allows for cross-over and mutations of side chain

rotamers and backbone conformations. ROSETTALIGAND130

applies a Monte Carlo

search to an ensemble of protein and ligand conformations to account for the flexibility of

both partners. A ligand pose is selected, randomly perturbed then the side chain

conformations are optimized using a backbone dependent rotamer library.

It is also possible to use MD to optimize the conformation of both the ligand and

protein.131

However, fully searching all the possibilities for the orientation of the ligand is

challenging with regular MD techniques. In fact, in order to adopt other orientations and

conformations during a simulation, the ligand is required to leave -at least partly- the

binding pocket and to bind back in another orientation. Mangoni et al. overcomes this

problem by separating the center of mass of the ligand from its internal and rotational

motions. This approach allows the receptor and ligand internal degrees of motion to be at

a different kinetic energy than the rotation and translation allowing for a more efficient

search of the ligand. To reduce the computational cost of MD-based docking, Tatsumi et

al.132

developed a hybrid MD/harmonic dynamics method which first performs MD to

determine the collective motions of the protein. These collective motions are then

approximated through harmonic modes so that large motions can be used even when only

portions of the receptors are considered. The side chains and ligand conformations are

then optimized using MD. It is also possible to use metadynamics to perform the

conformational search. Metadynamics is similar to MD except that it keeps a history of

the explored region of the energy hypersurface adding penalties to the regions already

visited leading the search towards new conformational space.133

PREDICTING DISPLACEABLE KEY BRIDGING WATER MOLECULES

Bridging water molecules are waters which mediate the interactions between polar

groups of the ligand and protein.134, 135

How docking programs treat and/or predict the

placement of key bridging water molecules is of utmost importance in the field of small

molecule docking today.136

Typically docking studies treat waters depending on the target

being investigated. If the waters are experimentally determined to play a key role in the

ligand binding, they can be treated as parts of the protein. However, explicit water

molecules do not allow for cases where waters are displaced by the ligand. Two common

examples are the case of HIV-1 protease and thymidine kinase. In both these cases,

CHAPTER 1

- 24 -

ligands have been designed to displace key bridging water molecules that interact directly

with the protein and observed in crystal structures.137-140

Since most docking programs do

not correctly account for displaceable water molecules, it is typically suggested that these

waters should be deleted from the protein structure. Recent studies have shown that when

docking, inclusion of water molecules always increases the accuracy independently from

the origins of the water (crystallographic, predicted or optimized position).141, 142

Therefore the challenge is to include the displaceability of water within docking. There

are two options that are available: either predict whether the ligand displaces the water or

not or predict the water positions within the binding site during docking.

Predicting Displacement of Water Molecules. If the protein input structure contains one

or more water molecules, a docking program should be able to determine if the water is

displaced (off) or present (on). For grid based energy methods one can use methods

similar to those developed for protein flexibility. The clamped or Boltzmann equations

developed by Osterberg et al.34

combine grids to make one grid where the water is both

on and off. Moitessier et al.113

have also applied a similar technique to dock to hydrated

and flexible RNA. Huang et al.143

modified DOCK to simultaneously dock to two grids,

one with waters, one without. Thus DOCK calculates the score with both grids, selecting

the best of the two scores. Rarey et al.144

implemented the particle concept within FlexX.

In this program, a water particle is considered as a single sphere. During docking, FlexX

considers both options (the water being on or off) and selects the best scoring one. Within

GOLD145

the water orientation is allowed to be optimized during the genetic algorithm

but like FlexX it considers both on and off keeping the best scoring option. A water-

specific switching function has been added to the AMBER force field within FITTED.68

This function turns off the ligand’s interactions with a given water molecule when it is

overlapping. SLIDE59

uses a similar approach which turns off the water where there is an

overlap with the ligand or protein. Jiang et al.146

created a solvated rotamer library which

has been used in a similar fashion to rotamer libraries used for protein flexibility.

Rotamers are added to the library with waters at positions that form hydrogen bonds with

the residue, these solvated residues are then used during the conformational search of the

protein. Van Diijk and Bonvin147

developed a stepwise procedure to determine key water

molecules. The binding site is first flooded and only waters on the surface of the protein

CHAPTER 1

- 25 -

are kept. The ligand is then placed within the binding site and only waters which mediate

its binding with the protein are kept. This is followed by a random selection of water

molecules where their probability to be kept is set to the fraction of the observed contacts

with the protein over the ideal number of contacts which have been derived by statistics

on the PDB. This is continued until only 25% of the water molecules remain. Key water

molecules are then identified by selecting waters that are below a score cut-off.

Prediction of Water Positions. During the binding process, the protein or ligand may

adopt a conformation not yet experimentally observed, which is facilitated by the

presence of a bridging water molecule. Therefore methods have been developed to

predict the positions of potential bridging water molecules. Even though many of the

developed methods for water position prediction have not been incorporated within

docking programs, these methods could easily be used in conjunction with displaceable

water techniques mentioned above.

The first method to determine water positions within proteins was implemented

within the program GRID.148, 149

GRID uses a series of evenly distributed points, referred

to as a grid. At each grid point, the interaction energy between a water probe at this

location and the protein is calculated. If the energy is favourable (i.e., below a given

threshold), water can occur at that node. An issue can arise when using a grid

representation when too many waters are found, or waters on the surface may be

discounted. Pitt and Goodfellow150

developed a knowledge based method to place waters

within the binding site. A table of preferred water positions for each side chain is created.

The water is then only retained if there is space for it within the binding site. Amadasi et

al.151

validated water positions found in crystal structures and GRID calculations by first

using the HINT score, which is used to calculate the global interaction strength of the

water with the protein, to determine which waters are located in a hydrophilic region. The

RANK algorithm then measures the number and geometric quality of hydrogen bonds

made between the water molecules and the proteins, with higher ranking waters being the

most favoured. Miranker and Karplus152

used a multiple copy simultaneous search

(MCSS), where many water molecules were placed within the active site and their

location optimized using the CHARMM force field. The interaction energy with the

CHAPTER 1

- 26 -

protein was then calculated and waters below an energy cut-off value are minimized and

kept.

Currently only two docking programs incorporate the prediction of water positions,

albeit for different reasons. The FlexX developers incorporated the prediction of water

positions when they implemented the particle concept for displaceable waters.144

FlexX

first determines positions where waters would be energetically favourable followed by a

clustering algorithm to remove redundant information. Within the Glide XP scoring

functions, it is necessary to calculate the amount of desolvation of the ligand.72

To

calculate this energy contribution, Glide first docks water molecules explicitly into the

binding site and then uses empirical scoring terms to measure the exposure to water for

certain groups of the ligand.

PREDICTION OF METAL GEOMETRY

When orienting the ligand in the binding site of a metalloenzyme, docking

programs must have an adequate description of metal coordination geometries.15, 153-155

Currently only a few programs consider specific interactions point around metals. GOLD

initially determines the coordination geometry by examining the angles between protein

coordinating atoms (e.g, His nitrogens or Cys sulphurs). Once the geometry is

determined, interaction points corresponding to the free coordination sites are added.

FITTED39

uses a similar approach by incorporating the vectorial-bond valence model.156

This method states that the sum of the vectors formed by coordination should be equal to

0. The FlexX developers have also recently published an improved description of metal

geometries, based on a template library of metal coordination geometries which is

compared to the protein atoms coordinating the metal. An RMSD is then calculated

between the various templates and the protein metal site and the best matching template

is selected.157

CONCLUSION

Modeling reality is a great challenge in the field of small molecule / protein

docking. A number of new global search algorithms have been developed and

implemented in over 60 docking programs. However, the true challenge is to identify the

correct pose (i.e., assign it the best score) among a plethora of poses that are generated

CHAPTER 1

- 27 -

through this search. Factors such as ionic strength, binding entropy, metal/ligand

interaction energies are very challenging problems that still need to be solved in the field

of scoring functions for docking programs.15, 94, 95, 158

We believe that the major challenge

left in modeling reality is a better understanding of the link between the conformation of

the ligand/protein/water/salt multicomponent system and its score.

CHAPTER 1

- 28 -

1.2 APPLICATION OF COMPUTATIONAL TECHNIQUES TO ASYMMETRIC

CATALYST DEVELOPMENT.

Financial and environmental pressure in the drug discovery and development field

requires that novel drugs are found quickly, efficiently and cheaply. To fulfill these

requirements, computational techniques have found their way into the toolkit of

medicinal chemists in the pharmaceutical industry providing a viable alternative to

experimental approaches such as high throughput screening.1-3

In fact, there are now

many predictive methods (e.g., QSAR, docking, combinatorial library profiling) available

to drug discovery and development chemists.15

Although these methods are based on

approximations, they are accurate enough to yield a higher rate of finding lead molecules

when screening libraries of millions compared to the traditional experimental

approaches.7-9

Even though these techniques yield small libraries enriched in bioactive

molecules, the small number of potentially missed bioactive molecules, does not often

outweigh the speed and cost savings of screening a library in silico.

Surprisingly, the significant advances in the field of computational drug design and

development have not stimulated the development of many other VS methods in other

chemical fields such as asymmetric catalysts. Computational tools in the field of

asymmetric catalysts are typically used to rationalize the outcome of a given reaction post

facto rather than to predict it. The ability to predict the stereomeric excess of a reaction

would enable organic chemists to quickly test out new asymmetric catalyst structures and

to prioritize a few of them for synthesis.

The lack of quick, although predictive, computational tools for organic chemists

when compared to the field of drug design and development has a major origin. To

accurately discriminate an excellent from a poor asymmetric catalyst, it is necessary to

predict the difference in transition state energies (necessary to compute the stereomeric

excess), within less than 1 kcal/mol. On the other hand, discriminating drug hits from non

binders requires a lower resolution in the order of 3 to 5 kcal/mol and looks at ground

state structures.

Drug design methods such as docking programs, use scoring functions to predict

the ligand binding affinity, with many methods using force fields to calculate the

potential energy of the ligand-protein complex.15

However, force fields have been

CHAPTER 1

- 29 -

developed to simulate the ground state of molecules and cannot be applied directly to the

computation of transition state energies. To calculate the transition state energy, the most

accurate although time-consuming approach is to use quantum mechanics (QM). To

enable time-efficient screening, computational organic chemists have developed methods

and programs that enable a faster calculation of stereomeric excesses.

QUANTUM MECHANICS PREDICTIONS OF STEREOMERIC EXCESS

Quantum mechanics has been exploited to rationalize experimental results and

provide valuable insight into the reaction pathway of many reactions.159, 160

Great care

must be taken when selecting basis set for QM methods since smaller and quicker

methods may be able give qualitative answers but not the quantitative predictions desired.

Although QM can calculate the transition state structures and energies very

accurately,161

.it still lacks the speed required for the development of a QM-based VS tool

and can hardly be applied to large catalytic systems. To address this last issue, it is

possible to use a hybrid technique, called QM/MM, which treats the reacting part of the

system with QM and the rest with molecular mechanics (MM).

Using Quantum Mechanics to Predict Stereoselectivities and Transition States. One of

the most studied reactions studied using QM is the proline catalysed aldol reation.162-169

There were initially four proposed mechanisms for the proline catalysed aldol reaction

(see Figure 1.6).170-174

QM methods were extremely helpful in resolving which

mechanism is most likely to occur. The dual proline mechanism (Figure 1.6D) was

discounted after kinetic studies and theoretical experiments showing that this mechanism

is energetically disfavoured.164, 175

. QM methods (B3LYP/6-31G*) showed that the

carboxylic acid mechanism (see Figure 1.6B) is 7.4 Kcal/mol more favoured than the

carbinolamine mechanism (see Figure 1.6A) and 30.5 Kcal/mol over the enaminium

mechanisms (see Figure 1.6C).168

With insights into the mechanism of this reaction this

demonstrated the usefulness of computational techniques is studying the transition states

of reactions. Based on this, QM methods were next to predict the stereochemical outcome

of asymmetric aldol reactions. This study revealed their ability to predict literature

stereomeric excesses with high accuracy, in most cases within a few percents.166

CHAPTER 1

- 30 -

Figure 1.6 - Proposed mechanisms for proline catalyzed aldol reaction

Prediction-based design of new catalysts inducing high stereoselectivities was also

carried out.169, 176

Shinisha et al.176

went on to study bicyclic analogues of proline (see

Figure 1.7) using QM methods (B3LYP/6-31G*). Overall all these catalysts, even though

synthetically difficult to propose they are are predicted to give better selectivities over

proline.

CHAPTER 1

- 31 -

Figure 1.7 - Example of bicyclic analogue studied by Shinisha et al.176

Another example of where QM methods aided in post-facto rationalization of a

reaction is the osmium tetraoxide asymmetric dihydroxylation of alkenes. There were

many discussions about the mechanistic picture of the asymmetric dihydroxylation of

alkenes facilitated by osmium tetraoxide. These discussions centered around two main

themes (see Figure 1.8): 1) a [3+2] cycloaddition that directly formed the osmium

glycolate product177

and 2) a [2+2] stepwise mechanism which first is preceded by a

formation of the osmium alkene complex, followed by “2+2” addition and the ring

expansion to form the desired osmium glycolate.178

Figure 1.8 - Proposed mechanisms for osmium tetraoxide asymmetric dihydroxylation of

alkenes. L = A ligand

These competing possibilities for the mechanism of this reaction led to the use of

QM techniques to resolve this dispute.179-181

The discussion culminated when Sharpless

showed with QM methods (B3LYP/3-21G for the osmium and B3LYP/6-31G* for all

other atoms) that the [3+2] mechanism is significantly energetically favoured over the

[2+2].182

Although, QM methods were instrumental in the understanding of the

mechanism, their application to the prediction of stereochemical outcomes was limited by

the required size of the asymmetric catalysts for this reaction. As discussed below, other

less intensive methods were required.

CHAPTER 1

- 32 -

Another example of the application of QM methods to the prediction of transitions

state structures and energies is with the Diels-Alder reaction. From using frontier-

molecular orbital theory to using high level QM methods, this reaction has a long history

of using theory to predict its regioselectivity and stereoselectivity.183

For example the

Diels-Alder reaction catalysed by imidazolidinones (see Figure 1.9) was studied by

Gordillo and Houk184

. with QM methods (B3LYP/6-31G*) and showed excellent

agreement when predicting endo:exo ratios (diastereoselectivities). However, they

obtained poorer correlations with enantioselectivities. This poor correlation with

experiment has been shown to vary greatly with the QM method used185-189

and therefore

a careful selection of the QM method is required.

Figure 1.9 - Imidazolidinone catalysed Diels-Alder Reaction

Even with these prime examples of how QM methods aided in rationalizing these

reaction mechanisms and showed some promises in their ability to predict stereochemical

outcomes post facto, there has not been de novo prediction of asymmetric catalysts using

QM methods until recently.190, 191

The proline-catalysed Mannich reactions (see Figure

1.10) are known to be selective for products with a syn orientation, but when proline was

substituted with pipecolinic acid a mixture of syn and anti products resulted. QM

methods (HF/6-31G*) were used to study the transition state of this reaction. Several

catalysts were then proposed which should theoretically be selective for the anti

orientation. Upon synthesising, these compounds were indeed shown to be highly

selective for the anti conformation.

CHAPTER 1

- 33 -

Figure 1.10 - Mechanism for the proline-catalyzed Mannich reaction.

Using QM/MM Techniques for Predictions of Stereomeric Excess and Transition States

structures. The advantages of using QM/MM hybrid approaches for the prediction of

stereomeric excess and transition states are their ability to be applicable to large systems

and their relative quickness when compared to QM methods.192

This is achieved by using

QM methods on the atoms involved in the bond breaking and forming and MM on the

rest of the molecule.

One of the only examples of QM/MM methods applied to prediction of

setereoselectivity is for the dihydroxylation of n-alkenes (see Figure 1.11).193

Initial

studies using styrene demonstrated the usefulness of QM/MM applied to this reaction,

with a predicted stereoselectivity of styrene closely matching the experimental results

CHAPTER 1

- 34 -

(99.4 %ee predicted, 96 %ee observed).194

Studying n-alkenes would prove to be too

difficult due to the explosion of possible conformations resulting when going from

propene to 1-decene. Using the QM/MM approach stereoselectivities were predicted with

reasonable correlation with experimental results.

Figure 1.11 - QM/MM study of the Sharpless dihydroxylation. (QM in Blue, MM in

black)

APPLICATION OF VIRTUAL SCREENING TECHNIQUES TO THE FIELD OF ASYMMETRIC

CATALYST DEVELOPMENT.

QM and hybrid QM/MM methods have shown promises as tools to design novel

asymmetric catalysts. However, QM is still significantly too slow to be advantageously

used to screen or design novel structures as compared to experimental stepwise

optimization or screening. This lack of speed has led to the transfer of techniques

primarily used in the field of drug design and development to the field of asymmetric

catalyst development. Two options are quantitative structure activity relationship

methods (QSAR) and molecular mechanics.

Quantitative Structure Selectivity Relationship (QSSR). When QSAR techniques are

applied to the field of asymmetric catalyst development, they are rebranded QSSR since

selectivity and not activity is the desired predicted property. QSSR is defined as the

process that relates chemical structure quantitatively to a chemical process.11

In essence,

the simplest QSSR technique relates a series of descriptors, whether they are

constitutional, topological, geometrical and physicochemical, to chemical structures.195,

196

Chavali et al.197, 198

used molecular indices which described electronic structures

and connectivities to predict catalytic activity and toxicity. Even though this technique

CHAPTER 1

- 35 -

was not used to predict stereomeric excess, it was a demonstration of the potential of

QSSR techniques to predict chemical properties.

It is also desirable to relate structural features directly to Gibbs free energy. Based

on this, Oslob et al. predicted selectivities in palladium catalysed allylation (see Figure

1.12). It was postulated that the reactivity of the terminal allyl carbon can be evaluated by

a linear free energy relationship using descriptors such as bond distance, angles and a

series of dihedral angles which describe the relative position of the allyl group, the

palladium atom and the ligand. The stereomeric ratio was found to be best predicted

when using only 4 descriptors: the breaking Pd-C bond distance, two dihedrals describing

the in-plane distortion and displacement of the allyl group and the final, and most

influential, energy increase associated with the incoming nucleophile. This energy

increase was calculated by measuring the energy difference between the palladium allyl

complex alone and with the minimized complex in presence of the nucleophile. Overall,

this techniques yielded good correlations between experimental and predicted

stereoselectivities. This work showed promise and the use of geometrical descriptors, to

relate chemical structure to selectivities can most likely be applied to other reactions.

Figure 1.12 - Palladium catalysed allylation

Alvarez et al.199

used a method termed continuous chirality measure, to determine

which portions of the molecule are responsible for its chirality and induce

stereoselectivity. The continuous chirality evaluates the level of chirality of the molecule.

The method was first used to rationalize the stereoselectivity of the

bis(oxazoline)copper(II) Diels-Alder reaction. They deduced that the chirality is mainly

induced by the C5 flaps (see Figure 1.13, portion in blue) which affect the orientation of

the diene. Very good correlation between experimental stereoselectivities and continuous

chirality measures were observed. Upon further investigation, a new

bis(oxazoline)copper(II) catalyst was proposed to be highly stereoselective (catalyst in

Figure 1.13) but was unfortunately not made.

CHAPTER 1

- 36 -

Figure 1.13 - bis(oxazoline)copper(II) catalysed Diels-Alder. C5 flaps shaded in Blue.

Descriptors for QSSR other than structural descriptors are quantum molecular

interaction fields200

(see Figure 1.14). Kozlowski and co-workers superimposed

optimized TS conformation of a series of known catalysts onto a grid. At each point on

this grid the interaction energy between the molecule under investigation and a carbon 2s

electron probe is calculated using QM methods. Regression analysis is preformed on the

computed grids to find regions common to all catalysts where increases in energy of the

probe results in increases (green region in Figure 1.14) or decreases (red region in Figure

1.14) in stereoselectivities. After a set of known catalysts have undergone this treatment it

can then be used to predict the selectivity of new catalysts. This technique has been

applied to a series of reactions and showed good correlation between predicted and

experimental selectivies.200-204

CHAPTER 1

- 37 -

Figure 1.14 - QSSR using Quantum Mechanical Interaction Field analysis in the design

of chiral amino alcohols for alkyl addition to aldehydes.

Using Molecular Mechanics to Predict Transition States. Using descriptors to determine

the link between chemical structure and transition states is a quick alternative to QM

methods, but it still requires intimate knowledge of the geometry of the transition state

structure. If the geometry is not known, it may be difficult to accurately predict

stereoselectivity with QSSR techniques and using QM methods to perform a full

conformational search requires a significant investment in time and expertise. To

efficiently search the conformational space of the transition state molecular mechanics is

a viable alternative159, 205

with one sole caveat, the force field being used must be

applicable to transition state modeling. In most cases, force fields have been developed to

predict the ground state structures and energies of molecules. It is therefore necessary to

derive FF parameters for transition states. The second issue is the conformational search.

Traditional conformational search engines locate minima (ground states) and not saddle

points (transition states).

To address these two issues, transition state force fields (TSFF) model transition

states of a reaction as a minimum on a PES. The simplest TSFF freezes or constrains the

CHAPTER 1

- 38 -

breaking or forming bonds and the angles and dihedral which are composed of them in

their optimum geometry. This strategy is known as a rigid transition state model. A

model system is first developed using QM or crystallographic methods159

and then used

to derive the equilibrium values for the force field. These interactions can then be added

to the force field with large force constants to effectively constrain all atoms to the

transition state geometry. This model is sound as no significant change in geometry of the

transition state are usually observed from one catalyst and/or reactant to the next and the

rest of the catalyst and reactant not involved in the reaction but inducing the

stereoselectivity can be assumed in its ground state.

The first use of a TSFF was for a theoretical study of hydroborations by Houk et al.

(see Figure 1.15).206

A model system consisting of ethylene and BH3 was computed using

HF/3-21G* and used to constrain the atoms involved in the transition state. Chiral

boranes reported in the literature where built and the stereoselectivities were calculated. If

energies for multiple conformations of a single transition state were within a few

Kcal/mol of each other, a Boltzman distribution over all conformations was used to

determine the stereoselectivity. This simple method showed that molecular mechanics

can be used to predict stereoselectivity of a reaction with good accuracy, usually within

10-15%. Even though this approach proves to be less accurate than QM methods, the

ability to screen compounds with higher throughput made it applicable to the VS of new

catalysts.

Figure 1.15 - Hydroboration of alkenes

Moitessier et al. used also a rigid transition state model TSFF to aide in the

rationalization of the unexpected outcome of the dihydroxylation of a benzyl protected

CHAPTER 1

- 39 -

allyl xylose. In this study, the isolated isomer was opposite to the one expected from the

Sharpless pneumonic (see Figure 1.16).207, 208

Figure 1.16 - Dihydroxylation of xylose.

An initial transition state model was built using results from a previous study by

Delmonte et al.182

This model was then used as the core for the transition state and was

frozen during the optimization of the catalyst using a modified CFF91 force field. Since

only a new transition state structure would result in an opposite stereoisomer a

conformation search was needed. They proceeded to use a genetic algorithm. To first

validate and optimize the protocol, two catalysts were first investigated (1 and 2, see

Figure 1.17). With these systems two binding modes were identified, one which

corresponded to the Corey model (alkene sandwiched between the two walls of the

catalyst) and the other one corresponding to the Sharpless model (alkene interacting with

the floor). Even though the Sharpless [2+2] model was discounted, the proposed

conformation was still predicted to exist. With this information validating the genetic

algorithm, the protocol was next applied to model the dihydroxylation of the benzyl

protected xylose 3. The unexpected isomer predicted by the Sharpless mnemonic resulted

from the alkene being too large to adopt either binding mode. Instead, the protected allyl

xylose encompassed the catalyst (see Figure 1.17, 3). With these promising results, a VS

of alkenes was undertaken. The accuracy of the stereoselectivities were lower than pure

QM methods but compete within a fraction of the time. In addition, the protocol’s ability

to predict the ranked list shows its promise as a method for VS.

CHAPTER 1

- 40 -

Figure 1.17 - Sharpless dihydroxylation catalyst studied for optimization and validation

of genetic algorithm, (Black = catalyst, Blue = alkene, Green = frozen atoms). Above is

3D representation, below is schematic.

A similar technique has been developed by Harriman and Deslongchamps and

termed Reverse Docking209

. While traditional docking is where a ligand is docked

flexibly into a rigid protein, a rigid transition state can be docked into a flexible catalyst,

modeling the reactant/catalyst transition structure. The resulting structure can then be

used to predict stereoselectivities. The initial version of this method was based on the

AutoDock67

conformational search engine and HF/6-31G* transition states. Its

application to the azidation of α, β unstaturated carbonyls with Miller’s catalyst validated

the approach (see Figure 1.18A). This first version was able to predict the conformation

of the catalyst and the favoured stereoisomer although the stereomeric excesses was

poorly reproduced. The method was then developed into an independent program called

EM-Dock recently implemented within MOE.210

This new version was applied to the

TADDOL-catalysed asymmetric hetero Diels-Alder reaction (see Figure 1.18B) but was

CHAPTER 1

- 41 -

still unable to predict stereomeric excess.210

In these first two versions, the score for the

van der Waals and electrostatic interactions between the reactants and the catalyst where

calculated using a grid based energy method similar to the one implemented in

AutoDock.38, 67, 211

Large energy difference between the stereoisomers resulted from the

use of the grid and therefore their protocol was modified to use a pair-wise potential. The

application of this new scoring method to the TADDOL-catalysed asymmetric hetero

Diels-Alder reaction212

and the organocatalyzed asymmetric Strecker hydrocyanation of

aldimines and ketimines213

(see Figure 1.18C) resulted in the desired decrease in energy.

This change in scoring lead to accurate prediction of stereoselectives when compared to

experimental results..

Figure 1.18 - Reactions studies with reverse docking: A) Azidation of α, β unstaturated

carbonyls, B) TADDOL-catalysed asymmetric hetero Diels-Alder reaction and C)

organocatalyzed asymmetric Strecker hydrocyanation of aldimines.

CHAPTER 1

- 42 -

In all these MM studies, the atoms involved directly in the formation of new

interactions were frozen, an approximation which is often reasonable. However, when the

catalyst structure changes drastically from one reactant to the next, flexible transition

state methods are desirable.

An easy extension of the rigid transition state TSFF would be to allow for smaller

force constants for interactions which have been frozen. The challenge is determining

what this value should be. MMX was developed so that the equilibrium bond length and

force constants are a function of bond order.214

An issue arises since bond orders are not

explicitly known for transition states and in reality force constants may not be directly

proportional to bond order. ReaxFF is a similar method which allows the bond order to

vary as a function of the bond distance.215-217

To overcome this problem, Norrby et al.218

developed the Q2MM method where the TSFF is entirely developed from QM

calculations. This method has been applied to many reactions and has shown to be highly

accurate for the prediction of stereomeric excess.218-222

But with the use of QM to develop

parameters a significant investment in time and expertise is still required. Q2MM has

been applied to the asymmetric dihydroxylation reaction (see Figure 1.8) for the

prediction and rationalization of selectivities.220, 223-225

By using Q2MM fairly good

correlations between predicted and experimental selectivities were achieved.224

Q2MM

has also been applied to the Horner-Wadsworth-Emmons reaction218, 226

(see Figure

1.19). This reaction involved two transition states, necessitating the development of

parameters for both transition states and the study of multiple diastereomeric pathways to

identify the rate limiting step. Based on the inability to accurately determine the energy

difference between TS1 and TS2, predictions of selectivities were difficult allowing only

accurate predictions for high selectivities (i.e., above 95%).226

CHAPTER 1

- 43 -

Figure 1.19 - Mechanism of the Horner-Wadsworth-Emmons reaction.

Another option is to approximate the transition state as the intersection of two

ground states interacting through a mixing term. This technique, known as the empirical

valence bond method (EVB) (see Figure 1.20), was suggested by Warshel and Weiss227,

228

1

Eproduct

E

Reaction CoordinateE

ne

rgy

Ereactant

Figure 1.20 - Mixing of two ground states to find transition state.

Although it was initially used to simulate enzyme reactions, it can also be applied to

organic reactions. For EVB to predict transition state, the relative energies of the reactant

and product are assumed to be similar. This restriction comes from the force field itself.

Most force fields are only meant to reproduce the heats of formation and compare relative

energies of molecules with identical connectivities. Another issue arises when structures

are far away from the energy minimum. Forcefields have difficulty with distorted

structures and some forcefields and more specifically class I force fields do not have an

accurate description of the van der Waals energy term at short distance (i.e., steep

CHAPTER 1

- 44 -

Lennard Jones potential). To overcome this short-coming, more complex functions that

better represent the PES for distorted structures can be used, such as the Morse potential

in MM3.229-233

EVB creates a model PES by using a weighted sum of the energy of the

reactants and products.

(1.1) productreactantmodel 1 EEE

These energies are then projected onto the true PES using the mixing term.

(1.2) mixmodel EEE

This mixing term describes the mixing between the reaction and product PES.

Typically λ values close to 0.5 correspond to a minimum on the model PES allowing for

adequate searching of the transition state (see Figure 1.21. The reaction force field

(RFF)234

and multi configurational molecular mechanics (MCMM)235, 236

are similar

approaches. These techniques have been validated by simulating reaction pathways but

not directly applied to prediction of stereomeric excesses.

λ = 0

E

EMix

Reaction Coordinate

λ = 0.25

λ = 0.5

λ = 0.75

λ = 1.0E

nerg

y

Figure 1.21 - All energies are calculated on the model PES then projected onto the true

PES using a mixing term.

If only relative the energy between two stereoisomers is desired, it is also possible

to neglect the mixing term (equivalent for both stereomeric transition states) and assume

λ is equal to 0.5 for the transition state This approach is undertaken within the SEAM

method.237-240

This method has not been applied to the prediction of stereoselectivities but

CHAPTER 1

- 45 -

in the prediction of the geometry of a transition state and the reactivity of reactants. The

initial versions of SEAM237, 238

were initially validated on simple reactions such as Sn2

displacement of alkyl halides. These initial studies show good correlation between

experimental and predicted reactivities. Later studies went on the study of achiral

pericyclic reactions and compared their results against a QM method (PM3)

demonstrating a fairly good correlation.239, 240

CONCLUSION

In summary, it is possible to effectively search for the transition state of reactions

using a plethora of methods (see Figure 1.22). However caution is needed when selecting

the method. For a complete one time search of a PES, either QM or EVB methods are

appropriate. For highly accurate prediction of stereomeric excess, QM methods should

also be used. If one wants to perform a virtual screening or computer-aided optimization

of a catalyst, specialized molecular mechanics methods such as TSFF, EVB or SEAM are

more suited.

Figure 1.22 - Summary of methods used to find transition states

CHAPTER 1

- 46 -

1.3 OUTLINE OF THESIS

The need for quick and viable alternatives to experimental approaches such as high

throughput or rational step wise optimization has lead to the development of

computational molecular design tools. These tools now guide chemists and aid in their

search for novel chemical space (i.e., chemical entities). Even though these tools have

been developed, they are based on many assumptions and require new ideas to enable a

quicker and more efficient search of novel molecules.

Chapter 2 describes the development and validation of FITTED1.0, a new tool for

docking small molecules to flexible and hydrated macromolecules. FITTED1.0

incorporates macromolecular flexibility by allowing the use of multiple protein input

structure and a genetic algorithm to perform a conformational search on both the protein

and ligand. Displaceable bridging waters are accounted for by using a switching function

which turns off the ligand’s interaction with the water when it is too close. Application of

FITTED1.0 first demonstrated the importance of the inclusion of these features and

validated the method which accurately predicted the binding poses of a test set of 33

protein-ligand complexes.

With the initial iteration of FITTED complete, there was a need to increase the speed

and make the program more appropriate to VS applications. Chapter 3 describes the

modifications to FITTED to enable the pruning of virtual libraries using toxicophores and

Lipinski’s rules and the inclusion of an automatically created set of interaction sites to

aide in creating a population for the genetic algorithm. FITTED1.5 was shown to be more

accurate than the initial version and allowed for the time sensitive screening of a virtual

library against HCV polymerase and the identification of novel active inhibitors.

One of the major discussions of late in the field of docking is how to perform a

comparative study properly with a focus on which input parameters are used, such as

ligand conformation, protein conformation and the inclusion of waters. An enhanced

version of FITTED (version 2.6) was used in a comparative study, described in chapter 4,

with other major docking programs and the impact of these parameters on the outcome of

the comparative study is discussed. This study reveals that the traditional way of

comparing programs in not a realistic scenario. In reality the results of comparative

studies varies greatly depending on the input conformations of the protein and ligand.

CHAPTER 1

- 47 -

With all the VS tools in the field of drug design and development there is a need to

use these techniques as a stepping stone to create VS tools for organic chemists. This led

to the development of ACE1.0, described in chapter 5, a tool for the screening of

asymmetric catalysts. ACE uses a linear combination of reactant and product interactions

to give an approximation of the transition state parameters. Two reactions were initially

screened and the predictions showed excellent correlation with experimental data.

CHAPTER 1

- 48 -

1.4 REFERENCES

1. Richon, A. B., Current status and future direction of the molecular modeling

industry. Drug Discov. Today 2008, 13 (15-16), 665-669.

2. Richon, A. B., An early history of the molecular modeling industry. Drug Discov.

Today 2008, 13 (15-16), 659-664.

3. Guido, R. V. C.; Oliva, G.; Andricopulo, A. D., Virtual screening and its

integration with modern drug design technologies. Curr. Med. Chem. 2008, 15

(1), 37-46.

4. Borman, S., Drugs by design. Chem. Eng. News 2005, 83 (48), 28-30.

5. Clark, D. E., What has virtual screening ever done for drug discovery? Expert

Opin. Drug Discov. 2008, 3 (8), 841-851.

6. Kirchmair, J.; Distinto, S.; Schuster, D.; Spitzer, G.; Langer, T.; Wolber, G.,

Enhancing drug discovery through in silico screening: Strategies to increase true

positives retrieval rates. Curr. Med. Chem. 2008, 15 (20), 2040-2053.

7. Shoichet, B. K.; McGovern, S. L.; Wei, B.; Irwin, J. J., Hits, Leads and Artifacts

from Virtual and High Throughput Screening. Molecular Informatics:

Confronting Complexity, May 13th - 16th 2002 2002.

8. Parker, C. N., McMaster University data-mining and docking competition:

Computational models on the catwalk. J. Biomol. Screen. 2005, 10 (7), 647-648.

9. Lang, P. T.; Kuntz, I. D.; Maggiora, G. M.; Bajorath, J., Evaluating the high-

throughput screening computations. J. Biomol. Screen. 2005, 10 (7), 649-652.

10. Douguet, D., Ligand-based approaches in virtual screening. Curr. Comput.-Aided

Drug Des. 2008, 4 (3), 180-190.

11. Gedeck, P.; Lewis, R. A., Exploiting QSAR models in lead optimization. Curr.

Opin. Drug Disc. 2008, 11 (4), 569-575.

12. Lengauer, T.; Lemmen, C.; Rarey, M.; Zimmermann, M., Novel technologies for

virtual screening. Drug Discov. Today 2004, 9 (1), 27-34.

13. Auer, J.; Bajorath, J., Molecular similarity concepts and search calculations.

Methods in molecular biology (Clifton, N.J.) 2008, 453, 327-347.

14. Sun, H., Pharmacophore-based virtual screening. Curr. Med. Chem. 2008, 15

(10), 1018-1024.

CHAPTER 1

- 49 -

15. Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R., Towards the

development of universal, fast and highly accurate docking/scoring methods: A

long way to go. Br. J. Pharmacol. 2008, 153 (SUPPL. 1), S7-S26.

16. Berman, H.; Henrick, K.; Nakamura, H., Announcing the worldwide Protein Data

Bank. Nat. Struct. Mol. Biol. 2003, 10 (12), 980-980.

17. Levinthal, C.; Wodak, S. J.; Kahn, P.; Dadivanian, A. K., Hemoglobin interaction

in sickle cell fibers I: Theoretical approaches to the molecular contacts. Proc.

Natl. Acad. Sci. U. S. A. 1975, 72 (4), 1330-1334.

18. Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R.; Ferrin, T. E., A geometric

approach to macromolecule-ligand interactions. J. Mol. Biol. 1982, 161 (2), 269-

288.

19. Mizutani, M. Y.; Tomioka, N.; Itai, A., Rational automatic search method for

stable docking models of protein and ligand. J. Mol. Biol. 1994, 243 (2), 310-326.

20. Yamada, M.; Itai, A., Development of an efficient automated docking method.

Chem. Pharm. Bull. 1993, 41 (6), 1200-1202.

21. Bissantz, C.; Folkers, G.; Rognan, D., Protein-based virtual screening of chemical

databases. 1. Evaluation of different docking/scoring combinations. J. Med.

Chem. 2000, 43 (25), 4759-4767.

22. Bursulaya, B. D.; Totrov, M.; Abagyan, R.; Brooks Iii, C. L., Comparative study

of several algorithms for flexible ligand docking. J. Comput.-Aided Mol. Des.

2003, 17 (11), 755-763.

23. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S., Evaluation of Docking

Performance: Comparative Data on Docking Algorithms. J. Med. Chem. 2004, 47

(3), 558-565.

24. Perola, E.; Walters, W. P.; Charifson, P. S., A detailed comparison of current

docking and scoring methods on systems of pharmaceutical relevance. Proteins

2004, 56 (2), 235-249.

25. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D., Comparative evaluation of

eight docking tools for docking and virtual screening accuracy. Proteins 2004, 57

(2), 225-242.

CHAPTER 1

- 50 -

26. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.,

Comparison of automated docking programs as virtual screening tools. J. Med.

Chem. 2005, 48 (4), 962-976.

27. Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,

M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I.

D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A Critical Assessment of

Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49 (20), 5912-

5931.

28. Klebe, G., Virtual ligand screening: strategies, perspectives and limitations. Drug

Discov. Today 2006, 11 (13-14), 580-594.

29. Jalaie, M.; Shanmugasundaram, V., Virtual screening: Are we there yet? Mini-

Rev. Med. Chem. 2006, 6 (10), 1159-1167.

30. Muegge, I.; Oloff, S., Advances in virtual screening. Drug Discovery Today:

Technologies 2006, 3 (4), 405-411.

31. Fara, D. C.; Oprea, T. I.; Prossnitz, E. R.; Bologa, C. G.; Edwards, B. S.; Sklar, L.

A., Integration of virtual and physical screening. Drug Discovery Today:

Technologies 2006, 3 (4), 377-385.

32. Irwin, J. J., Community benchmarks for virtual screening. J. Comput.-Aided Mol.

Des. 2008, 1-7.

33. Murray, C. W.; Baxter, C. A.; Frenkel, A. D., The sensitivity of the results of

molecular docking to induced fit effects: Application to thrombin, thermolysin

and neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13 (6), 547-562.

34. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.,

Automated docking to multiple target structures: Incorporation of protein mobility

and structural water heterogeneity in autodock. Proteins 2002, 46 (1), 34-40.

35. Cavasotto, C. N.; Abagyan, R. A., Protein Flexibility in Ligand Docking and

Virtual Screening to Protein Kinases. J. Mol. Biol. 2004, 337 (1), 209-225.

36. Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.; Fogel, D. B.;

Fogel, L. J.; Freer, S. T., Molecular recognition of the inhibitor AG-1343 by HIV-

1 protease: Conformationally flexible docking by evolutionary programming.

Chem. Biol. 1995, 2 (5), 317-324.

CHAPTER 1

- 51 -

37. Ghose, A. K.; Crippen, G. M., Geometrically feasible binding modes of a flexible

ligand molecule at the receptor site. J. Comput. Chem. 1985, 6 (5), 350-359.

38. Goodsell, D. S.; Olson, A. J., Automated docking of substrates to proteins by

simulated annealing. Proteins 1990, 8 (3), 195-202.

39. Corbeil, C. R.; Moitessier, N., Docking Ligands into Flexible and Solvated

Macromolecules. 3. Impact of Input Ligand Conformation, Protein Flexibility and

Water Molecules on Accuracy of Major Docking Programs J. Chem. Inf. Model.

2008, Submitted.

40. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G., A fast flexible docking method

using an incremental construction algorithm. J. Mol. Biol. 1996, 261 (3), 470-489.

41. Jain, A. N., Morphological similarity: A 3D molecular similarity method

correlated with protein-ligand recognition. J. Comput.-Aided Mol. Des. 2000, 14

(2), 199-213.

42. Jain, A., Surflex-Dock 2.1: Robust performance from ligand energetic modeling,

ring flexibility, and knowledge-based search. J. Comput.-Aided Mol. Des. 2007,

21 (5), 281-306.

43. Beautrait, A.; Leroux, V.; Chavent, M.; Ghemtio, L.; Devignes, M. D.; Smaïl-

Tabbone, M.; Cai, W.; Shao, X.; Moreau, G.; Bladon, P.; Yao, J.; Maigret, B.,

Multiple-step virtual screening using VSM-G: Overview and validation of fast

geometrical matching enrichment. J. Mol. Model. 2008, 14 (2), 135-148.

44. Diller, D. J.; Merz K.M, Jr., High throughput docking for library design and

library prioritization. Proteins 2001, 43 (2), 113-124.

45. Jackson, R. M., Q-fit: A probabilistic method for docking molecular fragments by

sampling low energy conformational space. J. Comput.-Aided Mol. Des. 2002, 16

(1), 43-57.

46. Wu, S. Y.; McNae, I.; Kontopidis, G.; McClue, S. J.; McInnes, C.; Stewart, K. J.;

Wang, S.; Zheleva, D. I.; Marriage, H.; Lane, D. P.; Taylor, P.; Fischer, P. M.;

Walkinshaw, M. D., Discovery of a novel family of CDK inhibitors with the

program LIDAEUS: Structural basis for ligand-induced disordering of the

activation loop. Structure 2003, 11 (4), 399-410.

47. Jain, A. N., Surflex: Fully automatic flexible molecular docking using a molecular

similarity-based search engine. J. Med. Chem. 2003, 46 (4), 499-511.

CHAPTER 1

- 52 -

48. Ewing, T. J. A.; Kuntz, I. D., Critical evaluation of search algorithms for

automated molecular docking and database screening. J. Comput. Chem. 1997, 18

(9), 1175-1189.

49. Yamagishi, M. E. B.; Martins, N. F.; Neshich, G.; Cai, W.; Shao, X.; Beautrait,

A.; Maigret, B., A fast surface-matching procedure for protein-ligand docking. J.

Mol. Model. 2006, 12 (6), 965-972.

50. Zsoldos, Z.; Reid, D.; Simon, A.; Sadjad, B. S.; Johnson, A. P., eHiTS: An

innovative approach to the docking and scoring function problems. Curr. Protein

Pept. Sci. 2006, 7 (5), 421-435.

51. Zsoldos, Z.; Reid, D.; Simon, A.; Sadjad, S. B.; Johnson, A. P., eHiTS: A new

fast, exhaustive flexible ligand docking system. J. Mol. Graph. Modell. 2007, 26

(1), 198-212.

52. Böhm, H. J., The computer program LUDI: a new method for the de novo design

of enzyme inhibitors. J. Comput.-Aided Mol. Des. 1992, 6 (1), 61-78.

53. Nishibata, Y.; Itai, A., Confirmation of usefulness of a structure construction

program based on three-dimensional receptor structure for rational lead

generation. J. Med. Chem. 1993, 36 (20), 2921-2928.

54. Gillet, V.; Myatt, G.; Zsoldos, Z.; Johnson, A., SPROUT, HIPPO and CAESA:

Tools for de novo structure generation and estimation of synthetic accessibility.

Perspect. Drug. Discov. 1995, 3 (1), 34-50.

55. Makino, S.; Ewing, T. J. A.; Kuntz, I. D., DREAM++: Flexible docking program

for virtual combinatorial libraries. J. Comput.-Aided Mol. Des. 1999, 13 (5), 513-

532.

56. Leach, A. R.; Kuntz, I. D., Conformational analysis of flexible ligands in

macromolecular receptor sites. J. Comput. Chem. 1992, 13 (6), 730-748.

57. Ewing, T. J. A.; Makino, S.; Skillman, A. G.; Kuntz, I. D., DOCK 4.0: Search

strategies for automated molecular docking of flexible molecule databases. J.

Comput.-Aided Mol. Des. 2001, 15 (5), 411-428.

58. Schnecke, V.; Swanson, C. A.; Getzoff, E. D.; Trainer, J. A.; Kuhn, L. A.,

Screening a peptidyl database for potential ligands to proteins with side-chain

flexibility. Proteins 1998, 33 (1), 74-87.

CHAPTER 1

- 53 -

59. Schnecke, V.; Kuhn, L. A., Virtual screening with solvation and ligand-induced

complementarity. Perspect. Drug. Discov. 2000, 20, 171-190.

60. Oshiro, C. M.; Kuntz, I. D.; Dixon, J. S., Flexible ligand docking using a genetic

algorithm. J. Comput.-Aided Mol. Des. 1995, 9 (2), 113-130.

61. Jones, G.; Willett, P.; Glen, R. C., Molecular recognition of receptor sites using a

genetic algorithm with a description of desolvation. J. Mol. Biol. 1995, 245 (1),

43-53.

62. Clark, K. P.; Ajay, Flexible ligand docking without parameter adjustment across

four ligand-receptor complexes. J. Comput. Chem. 1995, 16 (10), 1210-1226.

63. Ayala, F. J., Darwin's greatest discovery: Design without designer. Proc. Natl.

Acad. Sci. U. S. A. 2007, 104 (SUPPL. 1), 8567-8573.

64. Abraham, A.; Nedjah, N.; Mourelle, L. d. M., Evolutionary computation: From

genetic algorithms to genetic programming. In Studies in Computational

Intelligence, Nedjah, N.; Macedo Mourelle, L.; Abraham, A., Eds. 2006; Vol. 13,

pp 1-20.

65. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R., Development and

validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267 (3),

727-748.

66. Gould, S. J., Lamarck' and the Birth of Modern Evolutionism in Two-Factor

Theories. In The Structure of Evolutionary Theory, Belknap Harvard: 2002; p 170.

67. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R.

K.; Olson, A. J., Automated docking using a Lamarckian genetic algorithm and an

empirical binding free energy function. J. Comput. Chem. 1998, 19 (14), 1639-

1662.

68. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking Ligands into Flexible

and Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J.

Chem. Inf. Model. 2007, 47 (2), 435-449.

69. Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.;

Bilimoria, D.; Heureux, L.; Moitessier, N., Docking Ligands into Flexible and

Solvated Macromolecules. 2. Development and Application of FITTED 1.5 to the

Virtual Screening of Potential HCV Polymerase Inhibitors. J. Chem. Inf. Model.

2008, 48 (4), 902-909.

CHAPTER 1

- 54 -

70. Abagyan, R.; Totrov, M.; Kuznetsov, D., ICM - A new method for protein

modeling and design: Applications to docking and structure prediction from the

distorted native conformation. J. Comput. Chem. 1994, 15 (5), 488-506.

71. Totrov, M.; Abagyan, R., Flexible protein-ligand docking by global energy

optimization in internal coordinates. Proteins 1997, 29 (SUPPL. 1), 215-220.

72. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz,

D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.;

Francis, P.; Shenkin, P. S., Glide: A New Approach for Rapid, Accurate Docking

and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem.

2004, 47 (7), 1739-1749.

73. Venkatachalam, C. M.; Jiang, X.; Oldfield, T.; Waldman, M., LigandFit: A novel

method for the shape-directed rapid docking of ligands to protein active sites. J.

Mol. Graph. Modell. 2003, 21 (4), 289-307.

74. Chen, H. M.; Liu, B. F.; Huang, H. L.; Hwang, S. F.; Ho, S. Y., SODOCK:

Swarm optimization for highly flexible protein-ligand docking. J. Comput. Chem.

2007, 28 (2), 612-623.

75. Namasivayam, V.; Günther, R., PSO@AUTODOCK: A fast flexible molecular

docking program based on swarm intelligence. Chem. Biol. Drug Des. 2007, 70

(6), 475-484.

76. Korb, O.; Stützle, T.; Exner, T. E. In PLANTS: Application of ant colony

optimization to structure-based drug design, Lecture Notes in Computer Science

(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), Brussels, Brussels, 2006; pp 247-258.

77. Jain, A. N., Bias, reporting, and sharing: Computational evaluations of docking

methods. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 201-212.

78. Jain, A. N.; Nicholls, A., Recommendations for evaluation of computational


79. Hawkins, P. C. D.; Warren, G. L.; Skillman, A. G.; Nicholls, A., How to do an

evaluation: Pitfalls and traps. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 179-

190.

CHAPTER 1

- 55 -

80. Boström, J.; Greenwood, J. R.; Gottfries, J., Assessing the performance of

OMEGA with respect to retrieving bioactive conformations. J. Mol. Graph.

Modell. 2003, 21 (5), 449-462.

81. 4.11 ring_conf. In LigPrep2.2 User Manual, Schrödiger, LLC.: 2008; p 42.

82. Gasteiger, J.; Rudolp, C.; Sadowski, J., Automatic Generation of 3D-Atomic

Coordinates for Organic Molecules. Tetrahedron Computer Methodology 1990, 3

(6C), 537-547.

83. Rusinko Iii, A.; Sheridan, R. P.; Nilakantan, R.; Haraki, K. S.; Bauman, N.;

Venkataraghavan, R., Using CONCORD to construct a large database of three-

dimensional coordinates from connection tables. Journal of Chemical Information

and Computer ScienceÂ® 1989, 29, 251-255.

84. Guner, O. F.; Henry, D. R.; Pearlman, R. S., Use of flexible queries for searching

conformationally flexible molecules in databases of three-dimensional structures.

Journal of Chemical Information and Computer SciencesÂ® 1992, 32, 101-109.

85. Xu, H.; Izrailev, S.; Agrafiotis, D. K., Conformational sampling by self-

organization. J. Chem. Inf. Comput. Sci. 2003, 43 (4), 1186-1191.

86. 15.5 Interface to CORINA. In FlexX Release 3 with GUI User Guide and

Technical Reference, BiosolveIT GmbH: 2008; p 365.

87. Goto, H.; Osawa, E., Corner flapping: A simple and fast algorithm for exhaustive

generation of ring conformations. J. Am. Chem. Soc. 1989, 111 (24), 8950-8951.

88. Goto, H.; Osawa, E., Further Developments in the Algorithm for Generating

Cyclic Conformers - Test with Cycloheptadecane. Tetrahedron Lett. 1992, 33

(10), 1343-1346.

89. Payne, A. W. R.; Glen, R. C., Molecular recognition using a binary genetic search

algorithm. J. Mol. Graph. 1993, 11 (2), 74-91+121.

90. Bursavich, M. G.; Rich, D. H., Designing non-peptide peptidomimetics in the 21st

century: Inhibitors targeting conformational ensembles. J. Med. Chem. 2002, 45

(3), 541-558.

91. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M., Lessons in

Molecular Recognition: The Effects of Ligand and Protein Flexibility on

Molecular Docking Accuracy. J. Med. Chem. 2004, 47 (1), 45-55.

CHAPTER 1

- 56 -

92. Abagyan, R.; Totrov, M., High-throughput docking for lead generation. Curr.

Opin. Chem. Biol. 2001, 5 (4), 375-382.

93. Shoichet, B. K.; McGovern, S. L.; Wei, B.; Irwin, J. J., Lead discovery using

molecular docking. Curr. Opin. Chem. Biol. 2002, 6 (4), 439-446.

94. Mohan, V.; Gibbs, A. C.; Cummings, M. D.; Jaeger, E. P.; DesJarlais, R. L.,

Docking: Successes and challenges. Curr. Pharm. Des. 2005, 11 (3), 323-333.

95. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., Protein-ligand docking: Current

status and future challenges. Proteins 2006, 65 (1), 15-26.

96. Coupez, B.; Lewis, R. A., Docking and scoring - Theoretically easy, practically

impossible? Curr. Med. Chem. 2006, 13 (25), 2995-3003.

97. Kroemer, R. T., Structure-based drug design: Docking and scoring. Curr. Protein

Pept. Sci. 2007, 8 (4), 312-328.

98. Barril, X.; Morley, S. D., Unveiling the full potential of flexible receptor docking

using multiple crystallographic structures. J. Med. Chem. 2005, 48 (13), 4432-

4443.

99. Amaro, R. E.; Baron, R.; McCammon, J. A., An improved relaxed complex

scheme for receptor flexibility in computer-aided drug design. J. Comput.-Aided

Mol. Des. 2008, 22 (9), 693-705.

100. Garner, J.; Deadman, J.; Rhodes, D.; Griffith, R.; Keller, P. A., A new

methodology for the simulation of flexible protein-ligand interactions. J. Mol.

Graph. Modell. 2007, 26 (1), 187-197.

101. Withers, I. M.; Mazanetz, M. P.; Wang, H.; Fischer, P. M.; Laughton, C. A.,

Active site pressurization: A new tool for structure-guided drug design and other

studies of protein flexibility. J. Chem. Inf. Model. 2008, 48 (7), 1448-1454.

102. Keseru, G. M.; Kolossvary, I., Fully flexible low-mode docking: Application to

induced fit in HIV integrase. J. Am. Chem. Soc. 2001, 123 (50), 12708-12709.

103. Cavasotto, C. N.; Kovacs, J. A.; Abagyan, R. A., Representing receptor flexibility

in ligand docking through relevant normal modes. J. Am. Chem. Soc. 2005, 127

(26), 9632-9640.

104. Zheng, W.; Doniach, S., A comparative study of motor-protein motions by using a

simple elastic-network model. Proc. Natl. Acad. Sci. U. S. A. 2003, 100 (23),

13253-13258.

CHAPTER 1

- 57 -

105. Mills, G.; Jo?nsson, H., Quantum and thermal effects in H2 dissociative

adsorption: Evaluation of free energy barriers in multidimensional quantum

systems. Phys. Rev. Lett. 1994, 72 (7), 1124-1127.

106. Jonsson, H.; Mills, G.; Jacobsen, K. W., Nudged elastic band method for finding

minimum energy paths of transitions. In Classical and Quantum Dynamics in

Condensed Phase Simulations, Berne, B. J.; Ciccoti, G.; Coker, D. F., Eds. World

Scientific: Singapore, 1998.

107. Sander, T.; Liljefors, T.; Balle, T., Prediction of the receptor conformation for

iGluR2 agonist binding: QM/MM docking to an extensive conformational

ensemble generated using normal mode analysis. J. Mol. Graph. Modell. 2008, 26

(8), 1259-1268.

108. Mathews, D. H.; Case, D. A., Nudged elastic band calculation of minimal energy

paths for the conformational change of a GG non-canonical pair. J. Mol. Biol.

2006, 357 (5), 1683-1693.

109. Jiang, F.; Kim, S. H., 'Soft docking': Matching of molecular surface cubes. J. Mol.

Biol. 1991, 219 (1), 79-102.

110. Mizutani, M. Y.; Takamatsu, Y.; Ichinose, T.; Nakamura, K.; Itai, A., Effective

handling of induced-fit motion in flexible docking. Proteins 2006, 63 (4), 878-

891.

111. Apostolakis, J.; Plückthun, A.; Caflisch, A., Docking small ligands in flexible

binding sites. J. Comput. Chem. 1998, 19 (1), 21-37.

112. Knegtel, R. M. A.; Kuntz, I. D.; Oshiro, C. M., Molecular docking to ensembles

of protein structures. J. Mol. Biol. 1997, 266 (2), 424-440.

113. Moitessier, N.; Westhof, E.; Hanessian, S., Docking of Aminoglycosides to

Hydrated and Flexible RNA. J. Med. Chem. 2006, 49 (3), 1023-1033.

114. Sotriffer, C. A.; Dramburg, I., "In situ cross-docking" to simultaneously address

multiple targets. J. Med. Chem. 2005, 48 (9), 3122-3125.

115. Leach, A. R., Ligand docking to proteins with discrete side-chain flexibility. J.

Mol. Biol. 1994, 235 (1), 345-356.

116. Schaffer, L.; Verkhivker, G. M., Predicting structural effects in HIV-1 protease

mutant complexes with flexible ligand docking and protein side-chain

optimization. Proteins 1998, 33 (2), 295-310.

CHAPTER 1

- 58 -

117. Anderson, A. C.; O'Neil, R. H.; Surti, T. S.; Stroud, R. M., Approaches to solving

the rigid receptor problem by identifying a minimal set of flexible residues during

ligand docking. Chem. Biol. 2001, 8 (5), 445-457.

118. Bottegoni, G.; Kufareva, I.; Totrov, M.; Abagyan, R., A new method for ligand

docking to flexible receptors by dual alanine scanning and refinement (SCARE).

J. Comput.-Aided Mol. Des. 2008, 1-15.

119. Koska, J.; Spassov, V. Z.; Maynard, A. J.; Yan, L.; Austin, N.; Flook, P. K.;

Venkatachalam, C. M., Fully Automated Molecular Mechanics Based Induced Fit

Protein-Ligand Docking Method. J. Chem. Inf. Model. 2008, 48 (10), 1965-1973.

120. Sherman, W.; Beard, H. S.; Farid, R., Use of an induced fit receptor structure in

virtual screening. Chem. Biol. Drug Des. 2006, 67 (1), 83-84.

121. Kairys, V.; Gilson, M. K., Enhanced docking with the mining minima optimizer:

Acceleration and side-chain flexibility. J. Comput. Chem. 2002, 23 (16), 1656-

1670.

122. Alberts, I. L.; Todorov, N. P.; Dean, P. M., Receptor Flexibility in de Novo

Ligand Design and Docking. J. Med. Chem. 2005, 48 (21), 6585-6596.

123. Alberts, I. L.; Todorov, N. P.; Kallbku, P.; Dean, P. M., Ligand docking and

design in a flexible receptor site. QSAR Comb. Sci. 2005, 24 (4), 503-507.

124. Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D.,

Improved protein-ligand docking using GOLD. Proteins 2003, 52 (4), 609-623.

125. Guilbert, C.; James, T. L., Docking to RNA via Root-Mean-Square-Deviation-

Driven Energy Minimization with Flexible Ligands and Flexible Targets. J.

Chem. Inf. Model. 2008, 48 (6), 1257-1268.

126. Pinto, I. G.; Guilbert, C.; Ulyanov, N. B.; Stearns, J.; James, T. L., Discovery of

ligands for a novel target, the human telomerase RNA, based on flexible-target

virtual screening and NMR. J. Med. Chem. 2008, 51 (22), 7205-7215.

127. Clauβen, H.; Buning, C.; Rarey, M.; Lengauer, T., FLEXE: Efficient molecular

docking considering protein structure variations. J. Mol. Biol. 2001, 308 (2), 377-

395.

128. Wei, B. Q.; Weaver, L. H.; Ferrari, A. M.; Matthews, B. W.; Shoichet, B. K.,

Testing a flexible-receptor docking algorithm in a model binding site. J. Mol.

Biol. 2004, 337 (5), 1161-1182.

CHAPTER 1

- 59 -

129. Moitessier, N.; Therrien, E.; Hanessian, S., A Method for Induced-Fit Docking,

Scoring, and Ranking of Flexible Ligands. Application to Peptidic and

Pseudopeptidic beta-secretase (BACE 1) Inhibitors. J. Med. Chem. 2006, 49 (20),

5885-5894.

130. Davis, I. W.; Baker, D., RosettaLigand Docking with Full Ligand and Receptor

Flexibility. J. Mol. Biol. 2009, 385 (2), 381-392.

131. Luty, B. A.; Wasserman, Z. R.; Stouten, P. F. W.; Hodge, C. N.; Zacharias, M.;

McCammon, J. A., A molecular mechanics/grid method for evaluation of ligand-

receptor interactions. J. Comput. Chem. 1995, 16 (4), 454-464.

132. Tatsumi, R.; Fukunishi, Y.; Nakamura, H., A hybrid method of molecular

dynamics and harmonic dynamics for docking of flexible ligand to flexible

receptor. J. Comput. Chem. 2004, 25 (16), 1995-2005.

133. Gervasio, F. L.; Laio, A.; Parrinello, M., Flexible docking in solution using

metadynamics. J. Am. Chem. Soc. 2005, 127 (8), 2600-2607.

134. Ladbury, J. E., Just add water! The effect of water on the specificity of protein-

ligand binding sites and its potential application to drug design. Chem. Biol. 1996,

3 (12), 973-980.

135. Barillari, C.; Taylor, J.; Viner, R.; Essex, J. W., Classification of water molecules

in protein binding sites. J. Am. Chem. Soc. 2007, 129 (9), 2577-2587.

136. Li, Z.; Lazaridis, T., Water at biomolecular binding interfaces. Phys. Chem.

Chem. Phys. 2007, 9 (5), 573-581.

137. Lam, P. Y. S.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L.

T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N.; Chang, C. H.; Weber,

P. C.; Jackson, D. A.; Sharpe, T. R.; Erickson-Viitanen, S., Rational design of

potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science

1994, 263 (5145), 380-384.

138. Grzesiek, S.; Bax, A.; Nicholson, L. K.; Yamazaki, T.; Wingfield, P.; Stahl, S. J.;

Eyermann, C. J.; Torchia, D. A.; Nicholas Hodge, C.; Lam, P. Y. S.; Jadhav, P.

K.; Chang, C. H., NMR evidence for the displacement of a conserved interior

water molecule in HIV protease by a non-peptide cyclic urea-based inhibitor. J.

Am. Chem. Soc. 1994, 116 (4), 1581-1582.

CHAPTER 1

- 60 -

139. Hodge, C. N.; Aldrich, P. E.; Bacheler, L. T.; Chang, C. H.; Eyermann, C. J.;

Garber, S.; Grubb, M.; Jackson, D. A.; Jadhav, P. K.; Korant, B.; Lam, P. Y. S.;

Maurin, M. B.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Reid, C.; Sharpe, T. R.;

Shum, L.; Winslow, D. L.; Erickson-Viitanen, S., Improved cyclic urea inhibitors

of the HIV-1 protease: Synthesis, potency, resistance profile, human

pharmacokinetics and X-ray crystal structure of DMP 450. Chem. Biol. 1996, 3

(4), 301-314.

140. Champness, J. N.; Bennett, M. S.; Wien, F.; Visse, R.; Summers, W. C.;

Herdewijn, P.; De Clercq, E.; Ostrowski, T.; Jarvest, R. L.; Sanderson, M. R.,

Exploring the active site of herpes simplex virus type-1 thymidine kinase by X-

ray crystallography of complexes with aciclovir and other ligands. Proteins 1998,

32 (3), 350-361.

141. De Graaf, C.; Pospisil, P.; Pos, W.; Folkers, G.; Vermeulen, N. P. E., Binding

mode prediction of cytochrome P450 and thymidine kinase protein-ligand

complexes by consideration of water and rescoring in automated docking. J. Med.

Chem. 2005, 48 (7), 2308-2318.

142. Roberts, B. C.; Mancera, R. L., Ligand - protein docking with water molecules. J.

Chem. Inf. Model. 2008, 48 (2), 397-408.

143. Huang, N.; Shoichet, B. K., Exploiting ordered waters in molecular docking. J.

Med. Chem. 2008, 51 (16), 4862-4865.

144. Rarey, M.; Kramer, B.; Lengauer, T., The particle concept: Placing discrete water

molecules during protein- ligand docking predictions. Proteins 1999, 34 (1), 17-

28.

145. Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.;

Nissink, J. W. M.; Taylor, R. D.; Taylor, R., Modeling water molecules in

protein-ligand docking using GOLD. J. Med. Chem. 2005, 48 (20), 6504-6515.

146. Jiang, L.; Kuhlman, B.; Kortemme, T.; Baker, D., A "solvated rotamer" approach

to modeling water-mediated hydrogen bonds at protein-protein interfaces.

Proteins 2005, 58 (4), 893-904.

147. van Dijk, A. D. J.; Bonvin, A. M. J. J., Solvated docking: Introducing water into

the modelling of biomolecular complexes. Bioinformatics 2006, 22 (19), 2340-

2347.

CHAPTER 1

- 61 -

148. Goodford, P. J., A computational procedure for determining energetically

favorable binding sites on biologically important macromolecules. J. Med. Chem.

1985, 28 (7), 849-857.

149. Wade, R. C.; Goodford, P. J., Further development of hydrogen bond functions

for use in determining energetically favorable binding sites on molecules of

known structure. 2. Ligand probe groups with the ability to form more than two

hydrogen bonds. J. Med. Chem. 1993, 36 (1), 148-156.

150. Pitt, W. R.; Goodfellow, J. M., Modelling of solvent positions around polar

groups in proteins. Protein Eng. 1991, 4 (5), 531-537.

151. Amadasi, A.; Surface, J. A.; Spyrakis, F.; Cozzini, P.; Mozzarelli, A.; Kellogg, G.

E., Robust classification of "relevant" water molecules in putative protein binding

sites. J. Med. Chem. 2008, 51 (4), 1063-1067.

152. Miranker, A.; Karplus, M., Functionality maps of binding sites: A multiple copy

simultaneous search method. Proteins 1991, 11 (1), 29-34.

153. Harding, M. M., The geometry of metal-ligand interactions relevant to proteins.

Acta Crystallogr. D 1999, 55 (8), 1432-1443.

154. Harding, M. M., The geometry of metal-ligand interactions relevant to proteins.

II. Angles at the metal atom, additional weak metal-donor interactions. Acta

Crystallogr. D 2000, 56 (7), 857-867.

155. Harding, M. M., Geometry of metal-ligand interactions in proteins. Acta

Crystallogr. D 2001, 57 (3), 401-411.

156. Harvey, M. A.; Baggio, S.; Baggio, R., A new simplifying approach to molecular

geometry description: The vectorial bond-valence model. Acta Crystallogr. B

2006, 62 (6), 1038-1042.

157. Seebeck, B.; Reulecke, I.; Ka?mper, A.; Rarey, M., Modeling of metal interaction

geometries for protein-ligand docking. Proteins 2008, 71 (3), 1237-1254.

158. Uberbacher, E.; LoCascio, P.; Passovets, S.; Ghattyvenkatakrishna, P.; Agarwal,

P.; Arnold, N.; Bordner, A.; Gorin, A., Computational challenges for modeling

and simulating biomacromolecular assemblies. Journal of Physics: Conference

Series 2006, 46 (1), 311-315.

159. Houk, K. N.; Paddon-Row, M. N.; Rondan, N. G., Theory and modeling of

stereoselective organic reactions. Science 1986, 231 (4742), 1108-1117.

CHAPTER 1

- 62 -

160. Houk, K. N.; Cheong, P. H. Y., Computational prediction of small-molecule

catalysts. Nature 2008, 455 (7211), 309-313.

161. Lynch, B. J.; Truhlar, D. G., How well can hybrid density functional methods

predict transition state geometries and barrier heights? J. Phys. Chem. A 2001, 105

(13), 2936-2941.

162. Bahmanyar, S.; Houk, K. N., The origin of stereoselectivity in proline-catalyzed

intramolecular aldol reactions. J. Am. Chem. Soc. 2001, 123 (51), 12911-12912.

163. Bahmanyar, S.; Houk, K. N., Transition States of Amine-Catalyzed Aldol

Reactions Involving Enamine Intermediates: Theoretical Studies of Mechanism,

Reactivity, and Stereoselectivity. J. Am. Chem. Soc. 2001, 123 (45), 11273-11283.

164. Rankin, K. N.; Gauld, J. W.; Boyd, R. J., Density Functional Study of the Proline-

Catalyzed Direct Aldol Reaction. J. Phys. Chem. A 2002, 106 (20), 5155-5159.

165. Tang, Z.; Jiang, F.; Yu, L. T.; Cui, X.; Gong, L. Z.; Mi, A. Q.; Jiang, Y. Z.; Wu,

Y. D., Novel small organic molecules for a highly enantioselective direct aldol

reaction. J. Am. Chem. Soc. 2003, 125 (18), 5262-5263.

166. Bahmanyar, S.; Houk, K. N.; Martin, H. J.; List, B., Quantum Mechanical

Predictions of the Stereoselectivities of Proline-Catalyzed Asymmetric

Intermolecular Aldol Reactions. J. Am. Chem. Soc. 2003, 125 (9), 2475-2479.

167. Clemente, F. R.; Houk, K. N., Computational evidence for the enamine

mechanism of intramolecular aldol reactions catalyzed by proline. Angew. Chem.

Int. Ed. 2004, 43 (43), 5766-5768.

168. Allemann, C.; Gordillo, R.; Clemente, F. R.; Cheong, P. H. Y.; Houk, K. N.,

Theory of asymmetric organocatalysis of aldol and related reactions:

Rationalizations and predictions. Acc. Chem. Res. 2004, 37 (8), 558-569.

169. Cheong, P. H. Y.; Houk, K. N., Origins and predictions of stereoselectivity in

intramolecular aldol reactions catalyzed by proline derivatives. Synthesis 2005,

(9), 1533-1537.

170. Spencer, T. A.; Neel, H. S.; Flechtner, T. W.; Zayle, R. A., Observations on amine

catalysis of formation and dehydration of ketols. Tetrahedron Lett. 1965, 6 (43),

3889-3897.

171. Hajos, Z. G.; Parrish, D. R., Asymmetric synthesis of bicyclic intermediates of

natural product chemistry. J. Org. Chem. 1974, 39 (12), 1615-1621.

CHAPTER 1

- 63 -

172. Agami, C.; Meynier, F.; Puchot, C.; Guilhem, J.; Pascard, C., Stereochemistry-59.

New insights into the mechanism of the proline-catalyzed asymmetric robinson

cyclization; structure of two intermediates. asymmetric dehydration. Tetrahedron

1984, 40 (6), 1031-1038.

173. Agami, C.; Puchot, C., Kinetic analysis of the dual catalysis by proline in the

asymmetric intramolecular aldol reaction. J. Mol. Catal. 1986, 38 (3), 341-343.

174. Blaney, J. M.; Dixon, J. S., A good ligand is hard to find: Automated docking

methods. Perspect. Drug. Discov. 1993, 1 (2), 301-319.

175. Hoang, L.; Bahmanyar, S.; Houk, K. N.; List, B., Kinetic and stereochemical

evidence for the involvement of only one proline molecule in the transition states

of proline-catalyzed intra- and intermolecular aldol reactions. J. Am. Chem. Soc.

2003, 125 (1), 16-17.

176. Shinisha, C. B.; Sunoj, R. B., Bicyclic proline analogues as organocatalysts for

stereoselective aldol reactions: An in silico DFT study. Org. Biomol. Chem. 2007,

5 (8), 1287-1294.

177. Jørgensen, K. A.; Hoffmann, R., Binding of alkenes to the ligands in OsO2X2 (X

= O and NR) and CpCo(NO)2. A frontier orbital study of the formation of

intermediates in the transition-metal-catalyzed synthesis of diols, amino alcohols,

and diamines. J. Am. Chem. Soc. 1986, 108 (8), 1867-1876.

178. Chong, A. O.; Oshima, K.; Barry Sharpless, K., Synthesis of dioxobis(tert-

alkylimido)osmium(VIII) and oxotris(tert-alkylimido)osmium(VIII) Complexes.

Stereospecific vicinal diamination of olefins. J. Am. Chem. Soc. 1977, 99 (10),

3420-3426.

179. Pidun, U.; Boehme, C.; Frenking, G., Theory Rules Out a [2 + 2] Addition of

Osmium Tetroxide to Olefins as Initial Step of the Dihydroxylation Reaction.

Angew. Chem. Int. Ed. 1996, 35 (23-24), 2817-2820.

180. Dapprich, S.; Ujaque, G.; Maseras, F.; Lledos, A.; Musaev, D. G.; Morokuma, K.,

Theory does not support an osmaoxetane intermediate in the osmium-catalyzed

dihydroxylation of olefins. J. Am. Chem. Soc. 1996, 118 (46), 11660-11661.

181. Torrent, M.; Deng, L.; Duran, M.; Sola, M.; Ziegler, T., Density functional study

of the [2+2]- and [2+3]-cycloaddition mechanisms for the osmium-catalyzed

dihydroxylation of olefins. Organometallics 1997, 16 (1), 13-19.

CHAPTER 1

- 64 -

182. DelMonte, A. J.; Haller, J.; Houk, K. N.; Sharpless, K. B.; Singleton, D. A.;

Strassner, T.; Thomas, A. A., Experimental and Theoretical Kinetic Isotope

Effects for Asymmetric Dihydroxylation. Evidence Supporting a Rate-Limiting

"(3 + 2)" Cycloaddition. J. Am. Chem. Soc. 1997, 119 (41), 9907-9908.

183. Ess, D. H.; Jones, G. O.; Houk, K. N., Conceptual, qualitative, and quantitative

theories of 1,3-dipolar and Diels-Alder cycloadditions used in synthesis. Adv.

Synth. Catal. 2006, 348 (16-17), 2337-2361.

184. Gordillo, R.; Houk, K. N., Origins of stereoselectivity in Diels-Alder

cycloadditions catalyzed by chiral imidazolidinones. J. Am. Chem. Soc. 2006, 128

(11), 3543-3553.

185. Bakalova, S. M.; Santos, A. G., A computational study of the Diels-Alder reaction

of ethyl-S-lactyl acrylate and cyclopentadiene. Origins of stereoselectivity. J. Org.

Chem. 2004, 69 (24), 8475-8481.

186. Dinadayalane, T. C.; Vijaya, R.; Smitha, A.; Sastry, G. N., Diels-Alder reactivity

of butadiene and cyclic five-membered dienes ((CH)4X, X = CH2, SiH2, O, NH,

PH, and S) with ethylene: A benchmark study. J. Phys. Chem. A 2002, 106 (8),

1627-1633.

187. Goumans, T. P. M.; Ehlers, A. W.; Lammertsma, K.; WuÌˆrthwein, E. U.;

Grimme, S., Improved reaction and activation energies of [4+2] cycloadditions,

[3,3] sigmatropic rearrangements and electrocyclizations with the spin-

component- scaled MP2 method. Chem. Eur. J. 2004, 10 (24), 6468-6475.

188. Bakalova, S. M.; Santos, A. G., A theoretical study of the stereoselectivities of the

Diels-Alder addition of cyclopentadiene to ethyl-(S)-lactyl acrylate catalyzed by

aluminium chloride. Eur. J. Org. Chem. 2006, (7), 1779-1789.

189. Jones, G. O.; Guner, V. A.; Houk, K. N., Diels - Alder reactions of

cyclopentadiene and 9,10-dimethylanthracene with cyanoalkenes: The

performance of density functional theory and hartree-fock calculations for the

prediction of substituent effects. J. Phys. Chem. A 2006, 110 (4), 1216-1224.

190. Mitsumori, S.; Zhang, H.; Cheong, P. H. Y.; Houk, K. N.; Tanaka, F.; Barbas, C.

F., Direct asymmetric anti-Mannich-type reactions catalyzed by a designed amino

acid. J. Am. Chem. Soc. 2006, 128 (4), 1040-1041.

CHAPTER 1

- 65 -

191. Cheong, P. H. Y.; Zhang, H.; Thayumanavan, R.; Tanaka, F.; Houk, K. N.; Barbas

Iii, C. F., Pipecolic acid-catalyzed direct asymmetric Mannich reactions. Org.

Lett. 2006, 8 (5), 811-814.

192. Balcells, D.; Maseras, F., Computational approaches to asymmetric synthesis.

New J. Chem. 2007, 31 (3), 333-343.

193. Drudis-Sole, G.; Ujaque, G.; Maseras, F.; Lledos, A., A QM/MM study of the

asymmetric dihydroxylation of terminal aliphatic n-alkenes with

OsO4·(DHQD)2PYDZ: Enantioselectivity as a function of chain length. Chem.

Eur. J. 2005, 11 (3), 1017-1029.

194. Ujaque, G.; Maseras, F.; Lleds, A., Theoretical study on the origin of

enantioselectivity in the bis(dihydroquinidine)-3,6-pyridazineÂ·osmium

tetroxide-catalyzed dihydroxylation of styrene. J. Am. Chem. Soc. 1999, 121 (6),

1317-1323.

195. Hoogenraad, M.; Klaus, G. M.; Elders, N.; Hooijschuur, S. M.; McKay, B.;

Smith, A. A.; Damen, E. W. P., Oxazaborolidine mediated asymmetric ketone

reduction: Prediction of enantiomeric excess based on catalyst structure.

Tetrahedron Asymmetry 2004, 15 (3), 519-523.

196. Van Der Linden, J. B.; Ras, E. J.; Hooijschuur, S. M.; Klaus, G. M.; Luchters, N.

T.; Dani, P.; Verspui, G.; Smith, A. A.; Damen, E. W. P.; McKay, B.;

Hoogenraad, M., Asymmetric catalytic ketone hydrogenation: Relating substrate

structure and product enantiomeric excess using QSPR. QSAR Comb. Sci. 2005,

24 (1), 94-98.

197. Chavali, S.; Lin, B.; Miller, D. C.; Camarda, K. V., Environmentally-benign

transition metal catalyst design using optimization techniques. Comput. Chem.

Eng. 2004, 28 (5), 605-611.

198. Lin, B.; Chavali, S.; Camarda, K.; Miller, D. C., Computer-aided molecular

design using Tabu search. Comput. Chem. Eng. 2005, 29 (2), 337-347.

199. Alvarez, S.; Schefzick, S.; Lipkowitz, K.; Avnir, D., Quantitative Chirality

Analysis of Molecular Subunits of Bis(oxazoline)copper(II) Complexes in

Relation to Their Enantioselective Catalytic Activity. Chem. Eur. J. 2003, 9 (23),

5832-5837.

CHAPTER 1

- 66 -

200. Kozlowski, M. C.; Dixon, S. L.; Panda, M.; Lauri, G., Quantum mechanical

models correlating structure with selectivity: Predicting the enantioselectivity of

Î²-amino alcohol catalysts in aldehyde alkylation. J. Am. Chem. Soc. 2003, 125

(22), 6614-6615.

201. Lipkowitz, K. B.; Pradhan, M., Computational studies of chiral catalysts: A

Comparative Molecular Field Analysis of an asymmetric Diels-Alder reaction

with catalysts containing bisoxazoline or phosphinooxazoline ligands. J. Org.

Chem. 2003, 68 (12), 4648-4656.

202. Sciabola, S.; Alex, A.; Higginson, P. D.; Mitchell, J. C.; Snowden, M. J.; Morao,

I., Theoretical Prediction of the Enantiomeric Excess in Asymmetric Catalysis.

An Alignment-Independent Molecular Interaction Field Based Approach. J. Org.

Chem. 2005, 70 (22), 9025-9027.

203. Ianni, J. C.; Annamalai, V.; Phuan, P.-W.; Panda, M.; Kozlowski, M. C., A Priori

Theoretical Prediction of Selectivity in Asymmetric Catalysis: Design of Chiral

Catalysts by Using Quantum Molecular Interaction Fields. Angew. Chem. Int. Ed.

2006, 45 (33), 5502-5505.

204. Huang, J.; Ianni, J. C.; Antoline, J. E.; Hsung, R. P.; Kozlowski, M. C., De Novo

Chiral Amino Alcohols in Catalyzing Asymmetric Additions to Aryl Aldehydes.

Org. Lett. 2006, 8 (8), 1565-1568.

205. Jensen, F.; Norrby, P. O., Transition states from empirical force fields. Theor.

Chem. Acc. 2003, 109 (1), 1-7.

206. Houk, K. N.; Rondan, N. G.; Wu, Y. D.; Metz, J. T.; Paddon-Row, M. N.,

Theoretical studies of stereoselective hydroborations. Tetrahedron 1984, 40 (12),

2257-2274.

207. Moitessier, N.; Maigret, B.; Chre?tien, F.; Chapleur, Y., Molecular dynamics-

based models explain the unexpected diastereoselectivity of the sharpless

asymmetric dihydroxylation of allyl D- xylosides. Eur. J. Org. Chem. 2000, (6),

995-1005.

208. Moitessier, N.; Henry, C.; Len, C.; Chapleur, Y., Toward a computational tool

predicting the stereochemical outcome of asymmetric reactions. 1. Application to

sharpless asymmetric dihydroxylation. J. Org. Chem. 2002, 67 (21), 7275-7282.

CHAPTER 1

- 67 -

209. Harriman, D. J.; Deslongchamps, G., Reverse-docking as a computational tool for

the study of asymmetric organocatalysis. J. Comput.-Aided Mol. Des. 2004, 18

(5), 303-308.

210. Harriman, J. D.; Deslongchamps, G., Reverse-docking study of the TADDOL-

catalyzed asymmetric hetero-Diels-Alder reaction. J. Mol. Model. 2006, 12 (6),

793-797.

211. Morris, G. M.; Goodsell, D. S.; Huey, R.; Olson, A. J., Distributed automated

docking of flexible ligands to proteins: Parallel applications of AutoDock 2.4. J.


212. Harriman, D. J.; Deleavey, G. F.; Lambropoulos, A.; Deslongchamps, G.,

Reverse-docking study of the organocatalyzed asymmetric Strecker

hydrocyanation of aldimines and ketimines. Tetrahedron 2007, 63 (52), 13032-

13038.

213. Harriman, D. J.; Lambropoulos, A.; Deslongchamps, G., In silico correlation of

enantioselectivity for the TADDOL catalyzed asymmetric hetero-Diels-Alder

reaction. Tetrahedron Lett. 2007, 48 (4), 689-692.

214. Eksterowicz, J. E.; Houk, K. N., Transition-state modeling with empirical force

fields. Chem. Rev. 1993, 93 (7), 2439-2461.

215. Van Duin, A. C. T.; Dasgupta, S.; Lorant, F.; Goddard Iii, W. A., ReaxFF: A

reactive force field for hydrocarbons. J. Phys. Chem. A 2001, 105 (41), 9396-

9409.

216. Nielson, K. D.; Van Duin, A. C. T.; Oxgaard, J.; Deng, W. Q.; Goddard Iii, W. A.,

Development of the ReaxFF reactive force field for describing transition metal

catalyzed reactions, with application to the initial stages of the catalytic formation

of carbon nanotubes. J. Phys. Chem. A 2005, 109 (3), 493-499.

217. Chenoweth, K.; Van Duin, A. C. T.; Persson, P.; Cheng, M. J.; Oxgaard, J.;

Goddard Iii, W. A., Development and application of a ReaxFF reactive force field

for oxidative dehydrogenation on vanadium oxide catalysts. J. Phys. Chem. C

2008, 112 (37), 14645-14654.

218. Norrby, P. O., Selectivity in asymmetric synthesis from QM-guided molecular

mechanics. J. Mol. Struct. THEOCHEM 2000, 506, 9-16.

CHAPTER 1

- 68 -

219. Rasmussen, T.; Norrby, P. O., Modeling the stereoselectivity of the Î²-amino

alcohol-promoted addition of dialkylzinc to aldehydes. J. Am. Chem. Soc. 2003,

125 (17), 5130-5138.

220. Fristrup, P.; Jensen, G. H.; Andersen, M. L. N.; Tanner, D.; Norrby, P. O.,

Combining Q2MM modeling and kinetic studies for refinement of the osmium-

catalyzed asymmetric dihydroxylation (AD) mnemonic. J. Organomet. Chem.

2006, 691 (10), 2182-2198.

221. Donoghue, P. J.; Kieken, E.; Helquist, P.; Wiest, O., Development of a Q2MM

force field for the silver(I)-catalyzed hydroamination of alkynes. Adv. Synth.

Catal. 2007, 349 (17-18), 2647-2654.

222. Rydberg, P.; Olsen, L.; Norrby, P. O.; Ryde, U., General transition-state force

field for cytochrome P450 hydroxylation. J. Chem. Theory Comput. 2007, 3 (5),

1765-1773.

223. Becker, H.; Ho, P. T.; Kolb, H. C.; Loren, S.; Norrby, P. O.; Sharpless, K. B.,

Comparing two models for the selectivity in the asymmetric dihydroxylation

reaction (AD). Tetrahedron Lett. 1994, 35 (40), 7315-7318.

224. Norrby, P. O.; Rasmussen, T.; Haller, J.; Strassner, T.; Houk, K. N., Rationalizing

the stereoselectivity of osmium tetroxide asymmetric dihydroxylations with

transition state modeling using quantum mechanics- guided molecular mechanics.

J. Am. Chem. Soc. 1999, 121 (43), 10186-10192.

225. Fristrup, P.; Tanner, D.; Norrby, P. O., Updating the asymmetric osmium-

catalyzed dihydroxylation (AD) mnemonic: Q2MM modeling and new kinetic

measurements. Chirality 2003, 15 (4), 360-368.

226. Norrby, P. O.; Brandt, P.; Rein, T., Rationalization of Product Selectivities in

Asymmetric Horner-Wadsworth-Emmons Reactions by Use of a New Method for

Transition-State Modeling. J. Org. Chem. 1999, 64 (16), 5845-5852.

227. Warshel, A.; Weiss, R. M., An empirical valence bond approach for comparing

reactions in solutions and in enzymes. J. Am. Chem. Soc. 1980, 102 (20), 6218-

6226.

228. Aqvist, J.; Warshel, A., Simulation of enzyme reactions using valence bond force

fields and other hybrid quantum/classical approaches. Chem. Rev. 1993, 93 (7),

2523-2544.

CHAPTER 1

- 69 -

229. Allinger, N. L.; Yuh, Y. H.; Lii, J.-H., Molecular Mechanics. The MM3 Force

Field for Hydrocarbon 3. 1. J. Am. Chem. Soc. 1989, 111 (23), 8551-8566.

230. Lii, J.-H.; Allinger, N. L., Molecular Mechanics. The MM3 Force Field for

Hydrocarbons. 2. Vibrational Frequencies and Thermodynamics. J. Am. Chem.

Soc. 1989, 111 (23), 8566-8575.

231. Lii, J.-H.; Allinger, N. L., Molecular Mechanics. The MM3 Force Field for

Hydrocarbons. 3. The van der Waals’ Potentials and Crystal Data for Aliphatic

and Aromatic Hydrocarbons. J. Am. Chem. Soc. 1989, 111 (23), 8576-8582.

232. Allinger, N. L.; Li, F.; Yan, L., Molecular Mechanics. The MM3 Force Field for

Alkenes. J. Comput. Chem. 1990, 11 (7), 848-867.

233. Allinger, N. L.; Li, F.; Yan, L.; Tai, J. C., Molecular Mechanics (MM3)

Calculations on Conjugated Hydrocarbons. J. Comput. Chem. 1990, 11 (7), 868-

895.

234. Rappé, A. K.; Pietsch, M. A.; Wiser, D. C.; Hart, J. R.; Bormann-Rochotte, L. M.;

Skiff, W. M., Rff, Conceptual Development of a Full Periodic Table Force Field

for Studying Reaction Potential Surfaces. Mol. Eng. 1997, 7 (3), 385-400.

235. Kim, Y.; Corchado, J. C.; Villa, J.; Xing, J.; Truhlar, D. G., Multiconfiguration

molecular mechanics algorithm for potential energy surfaces of chemical

reactions. J. Chem. Phys. 2000, 112 (6), 2718-2735.

236. Truhlar, D. G., Valence bond theory for chemical dynamics. J. Comput. Chem.

2007, 28 (1), 73-86.

237. Jensen, F., Locating minima on seams of intersecting potential energy surfaces.

An application to transition structure modeling. J. Am. Chem. Soc. 1992, 114 (5),

1596-1603.

238. Jensen, F., Transition structure modeling by intersecting potential energy surfaces.

J. Comput. Chem. 1994, 15 (11), 1199-1216.

239. Jensen, F., Using force fields methods for locating transition structures. J. Chem.

Phys. 2003, 119 (17), 8804-8808.

240. Olsen, P. T.; Jensen, F., Modeling chemical reactions for conformationally mobile

systems with force field methods. J. Chem. Phys. 2003, 118 (8), 3523-3531.

CHAPTER 1

- 70 -

CHAPTER 2

- 71 -

CHAPTER TWO

The rampant use of the lock and key model for ligand-protein docking has be

found to be a cause of decreases in docking accuracies when comparing self-docking to

cross-docking results. With this in mind, we have developed FITTED that overcomes this

major assumption by modeling a more dynamic protein ligand binding process. Within

this chapter, the development of FITTED is discussed. The dynamics of ligand-protein

binding are addressed using a Lamarckian genetic algorithm to allow for the flexibility of

both the ligand and the protein and a switching function models the displacement of

bridging water molecules. FITTED was validated on a set of 33 ligand-protein complexes

and showed good accuracy and the importance of including protein flexibility and

displaceable bridging water molecules.

This chapter is a copy and is reproduced with permission from the Journal of Chemical

Information and Modeling. This article is cited as Corbeil, C. R.; Englebienne, P.;

Moitessier, N., Docking Ligands into Flexible and Solvated Macromolecules. 1.

Development and Validation of FITTED 1.0. Journal of Chemical Information and

Modeling 2007, 47, (2), 435-449. Copyright 2007, with permission from the American

Chemical Society

CHAPTER 2

- 72 -

DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED

MACROMOLECULES. 1.

DEVELOPMENT AND VALIDATION OF FITTED 1.0

ABSTRACT

We report the development and validation of a novel suite of programs, FITTED

1.0, for the docking of flexible ligands into flexible proteins. This docking tool is unique

in that it can deal with both the flexibility of macromolecules (side-chains and main-

chains) and the presence of bridging water molecules while treating protein/ligand

complexes as realistically dynamic systems. This software relies on a genetic algorithm

to account for the flexibility of the two molecules, as well as the location of bridging

water molecules. In addition, FITTED 1.0 features a novel application of a switching

function to retain or displace key water molecules from the protein-ligand complexes.

Two independent modules, ProCESS and SMART, were developed to setup the proteins

and the ligands prior to the docking stage. Validation of the accuracy of the software was

achieved via the application of FITTED 1.0 to the docking of inhibitors of HIV-1 protease,

thymidine kinase, trypsin, factor Xa and MMP to their respective proteins.

CHAPTER 2

- 73 -

INTRODUCTION

Fast, cost-effective and accurate methods of drug design are essential to modern

day medicinal chemistry. Docking-based rational design methods provide a quick and

economical alternative to high throughput screening or more traditional drug discovery

and are increasingly popular alternatives.1

Over the last few years, several comparative studies of docking programs have

been published and show the poor accuracy of some of the commercially available

packages, with Glide and GOLD being amongst the best programs.2-8

In most studies,

inhibitors are accurately docked back to their corresponding protein structure (self-

docking). However, it has been shown that docking to other structures (cross-docking)

performs poorly.9-11

This failure results in part from the use of inaccurate protein models.

Several docking programs treat the proteins as rigid objects and do not account for

conformational changes upon binding, resulting in this observed poor performance in the

cross-docking studies and low enrichment factors in virtual screening studies.12,13

Improvement of the developed software is necessary to include more accurate protein

models.

To account for the discrepancy between self- and cross-docking, various strategies

have been explored and implemented in existing software.14,15

The program FlexE uses a

set of protein structures as an input and describes the side-chain and main-chain

flexibility.16

SLIDE, which handles flexible side-chains,15

can also explore the main-

chain flexibility when coupled with ROCK.17

Another docking program, AutoDock

models rigid proteins using grids that can be combined into grids that approximate

ensembles of conformations.10

In a fourth strategy, Glide, when merged with Prime,

accounts for protein adjustments through the use of homology models.18

We have recently proposed a novel concept for the docking of ligands to solvated

biopolymers,19

a pharmacophore-oriented docking approach,20

and a genetic algorithm

(GA) based docking method.21

The later takes advantage of more than one structure to

dock compounds in virtually flexible proteins. Using a similar approach to Lengauer and

co-workers16

and Shoichet and co-workers,22

we used a library of experimentally

observed protein conformations and made composite structures to model the protein

flexibility and to explore a wide region of conformational space. The proteins and ligands

were described as genes and a mixed Lamarckian/Darwinian evolution optimized the

CHAPTER 2

- 74 -

entire complex. This virtual flexibility was found to significantly increase the accuracy of

the docking of BACE-1 inhibitors.21

We report herein the development of FITTED 1.0 (Flexibility Induced Through

Targeted Evolutionary Description), a suite of programs based on a genetic algorithm

(GA) with an emphasis on speed. This docking program is unique in that it can deal with

both the flexibility of macromolecules and the presence of bridging “displaceable” water

molecules. Additional operators to the more traditional cross-over and mutations were

implemented and led to a significant increase in speed. These operators simulate the

learning (through energy minimization at various stages) and the early selection of

individuals based on a crude estimation of their fitness (e.g., is the ligand in the binding

site?). A new potential energy function modeling the interaction with displaceable water

molecules and two modules (ProCESS and SMART) needed to prepare the ligands and

proteins are also described. A validation of the accuracy of the docking program was

performed on five different sets of protein-ligand complexes: HIV-1 protease, thymidine

kinase, trypsin, factor Xa and stromelysin-1 co-crystallized with a variety of inhibitors.

THEORY AND IMPLEMENTATION

Proof of Concept. Our previous report21

of the use of Lamarckian GA to account for both

ligand and protein flexibility was based on Discover 3.023

as a force field engine and

considered only the ligand torsion angles as degrees of freedom. The flexibility of the

side chains and main chains of the target protein were modeled using a library of

conformations (from available data) that could evolve by means of genetic operators

(cross-over and mutations). An anchor atom was needed by this early version to ensure

convergence in a reasonable period of time. In practice, runs were performed in as long

as 20 hours for the most flexible ligands. Upon further investigations, we found that more

than 96% of CPU time was consumed by intermediate minimization steps (part of the

Lamarckian GA). The inclusion of ligand translational and rotational degrees of freedom

led to intractable computations. This proof-of-concept led us to develop a program based

on the same concept with a strong focus on the CPU time required, instead of using many

independent programs that do not communicate quickly, nor easily with each other.

FITTED 1.0, includes a force field engine to perform conjugate gradient minimization24

and a genetic algorithm.

CHAPTER 2

- 75 -

Program Requirements and Setup. FITTED is being designed to be a docking-based

virtual screening (VS) tool. Before libraries can be screened, the docking algorithm must

be validated. In practice, aspects of the docking routine that are common to all runs

should be performed only once. First, since protein structures are common to all runs in a

VS study, it is best to have a separate program to do their setup once, quickly and

efficiently. Second, a virtual library of drug-like molecules is, in practice, applied to more

than one biologically relevant target and should also be prepared independently from the

VS run. These two aspects led us to create modules for FITTED, namely ProCESS and

SMART, described in greater detail in the following sections. The use of modules is a

common practice as exemplified by the AutoDock suite of programs.25

FITTED, SMART

and ProCESS can either be run from command line in Linux or as console applications in

Windows. However, the accuracy of FITTED was found to be highly compiler-dependent

and caution should be taken to ensure the suitability of the compiler before using FITTED

(gcc v.3.2.3 was found to be the best).

ProCESS, a Tool to Prepare Protein Files. In order to have protein files useable by

FITTED, we developed the ProCESS module (Protein Conformational Ensemble System

Setup) which assigns the advanced residue names, advanced hydrogen names, atom types

and charges for the protein as discussed below. FITTED may use several protein files to

consider the protein flexibly. However, these files must be homogeneously prepared (ie,

same atom name and number, same primary sequence). ProCESS requests all-atom

proteins in mol2 format as inputs. Various programs (InsightII, Maestro, Sybyl) can be

used to add the hydrogen to the PDB files. However, all these graphical interfaces do not

generate the same mol2 files from the same PDB files (various residue names, hydrogen

names, order of atoms). ProCESS first ensures that the protein files are consistent and can

be used unambiguously. Rules exist for the naming of atoms/groups in proteins (PDB).26

For instance, CYS and ASP designate cysteine and aspartic acid residues respectively.

However, this naming does not give information on the protonation state of the side-

chain nor does it identify the terminal residues. CYS can be involved in a disulfide bridge

or not while ASP can be protonated or not. These residue names cannot be used

unambiguously to assign partial charges and atom types. To address this issue, the

CHAPTER 2

- 76 -

graphical interfaces assign various residue names for ionized and neutral aspartyl

residues: Maestro (Schrödinger): ASP and ASH, InsightII (Accelrys): ASP- and ASP,

Sybyl (Tripos): ASP and ASZ. Additional names are used for capped terminal residues in

Sybyl: AMN or AMI, and CXL or CXC for the ionized or neutral terminal amino and

carboxylate groups respectively. In order to use mol2 files generated with these different

interfaces, ProCESS reassigns advanced names to the added hydrogens (ie, HA for alpha

hydrogens, HB1 and HB2 for beta hydrogens) by examining their chemical environment

and proceed with the advanced residue names starting with the terminal residues. If the

residue is an N-terminus, it is assigned a fourth letter, an N; if it is a C-terminus, a C is

appended to the name. ProCESS then checks for the protonation state of CYS, ASP and

GLU. CYS is CYSH if not involved in a disulfide bridge; ASP and GLU are ASPH and

GLUH if neutral. HIS can have one of three possible names: HISE if the hydrogen is on

the ε nitrogen, HISD if the hydrogen is on the δ nitrogen and HISP if positively charged.

By defining the chemical environment, these advanced name aid ProCESS in assigning

appropriate partial charges and atom types to the protein.

In some cases, the atom ordering also varies from one PDB or mol2 file to another.

To address this issue, ProCESS sorts each of the protein files by atom and residue names

and checks for sequence identity; if discrepancies in the primary sequence are found,

ProCESS exits and manual editing of the protein files is required.

As soon as the protein files are checked and made consistent, ProCESS truncates the

protein input structures, using a user-defined cutoff distance around a user-defined

binding site (list of residues, see Supporting Information). The truncated proteins are

represented as united atoms, and AMBER atom types and partial charges are assigned to

each of them. The use of truncated united-atom protein structures significantly reduces

the time required by FITTED to set the lists of non-bond interactions at the outset of each

minimization stage. While potential energies computed using a force field are used all

through the docking process, scoring of the final poses is performed by the previously

developed RankScore21

scoring function. This function accounts for the entropy cost of

freezing flexible residues upon binding implicitly computed by scaling down the

interactions with flexible residues as discussed below. To account for these scaling

factors, ProCESS assigns new “scaled” atom types and charges derived from AMBER

atom types. When processed, the protein structures are outputted in mol2 format. Similar

CHAPTER 2

- 77 -

to FITTED, ProCESS requires a keyword file which contains all the necessary parameters

for ProCESS to work. A typical keyword file is given as Supporting Information.

ProCESS, a tool to create binding site cavity files. As discussed below, FITTED

disregards poses that are not within the binding site cavity, which is approximated by a

series of overlapping spheres. The required CPU time of initial docking attempts using

grids as cavity representations was not satisfactory. We have found that moving to

spheres was much less CPU time consuming (by a factor of 3). ProCESS creates the

overlapping spheres by first generating an evenly spaced grid and keeping the points that

do not clash with the protein (Figure 2.1a). If more than one protein is used, the point

must clash with all proteins to be removed. Thus, alternative accessible spaces are

maintained.

Next the points are converted into spheres (Figure 2.1b). For this purpose, each

sphere is inflated until making contact with a protein atom center (slightly overlapping

with the protein surface) or one of the grid edges. The obtained sphere size and center are

archived. Smaller spheres, included in larger ones, are next removed in order to reduce

the total number of spheres while still covering the entire cavity space. This step is

carried out repeatedly until all spheres are examined. This step significantly reduces the

number of spheres while approximating the whole cavity space. If there is a water

molecule present, ProCESS ignores it. FITTED will later determine whether or not the

water should be considered. The grid file is outputted in mol2 format, the last column

being not partial charges, but the radii of the spheres.

CHAPTER 2

- 78 -

Figure 2.1 - The binding site of 1d8m mapped as (a) a set of points, (b) a set of spheres,

the spheres are colored by size range (from 1.5 Ǻ to over 4.0 Ǻ).

SMART, a Tool for Ligand Preparation. We also developed the SMART module (Small

Molecule Atom-typing and Rotatable Torsion assignment) which automatically identifies

and labels the rotatable bonds of the ligands and assign AMBER atom types. As the rings

are not conformationally sampled in the current version of FITTED, SMART also identifies

the rings27

in the ligand and labels all the corresponding bonds as non-rotatable. Although

no conformational sampling methods are applied to the ring, energy minimization is

applied to the cyclic systems, therefore locally optimizing the ring structures. The partial

charge assignment (Gasteiger-Hückel charges are recommended) is still carried out using

existing software such as Sybyl.28

SMART also creates reference structures of the ligands

used by FITTED to compute accurate RMSD’s. For instance, rotamers of symmetric

groups such as phenyl rings and t-butyl groups are considered, creating a number of new

structures that will be used as references in the atomic RMSD calculation.

FITTED 1.0, an Algorithm to Account for Protein and Ligand Flexibility. The initial proof

of concept showed that the inclusion of flexibility greatly increased the accuracy of a

docking run. To reduce the amount of time per run two program aspects can be

investigated: removal of repetitious events (addressed by SMART and ProCESS) and

increase in the quality of the individuals. The latter aspect is addressed by FITTED itself

and is discussed in the following sections.

CHAPTER 2

- 79 -

Genetic Algorithm Implementation. Genetic algorithms (GA) have been used as

optimization tools in many fields for some time. In the present work, the GA is used to

optimize the binding mode. The chromosomes (Figure 2.2) describe the three-

dimensional structure of the protein/ligand/water complex and the fitness function is the

potential energy of this structure. In the illustrated case, 5 input files are used for the

protein/water structures, 4 side-chains are deemed flexible, and one bridging water

molecule is considered. The first section of the chromosome codes the ligand binding

mode and includes all the internal coordinates necessary to define a given conformation

in a given location in space and a given orientation (often referred to as a pose). The

ligand poses are therefore explicitly described in the chromosomes and FITTED can apply

a conjugate gradient algorithm to finely tune this pose.

The portion of the chromosome defining the solvated protein structure is divided

into 3 sections: i. the rigid portion of the protein, including the entire backbone, ii. the

side chain conformations of the flexible binding site residues and iii. the water molecule

locations. Each side chain and rigid protein portion adopts 5 different conformations in

the 5 protein input structures, referred to as call numbers with a value of 1 to 5. Similarly,

each water molecule adopts 5 different locations again referred as a call number. Thus,

libraries of side chain conformations (1 library per side chain, 5 side chain conformations

per library), a library of 5 structures for the rigid protein portion and a library for each

water molecule (5 set of Cartesian coordinates per water molecule) are built at the outset

of the docking run from the 5 input structures. FITTED next constructs the protein/water

complex from the set of digits and the libraries and adds the ligand pose to form the

ternary complex. A force field energy is associated to this pose and is recalculated

whenever the pose is modified.

CHAPTER 2

- 80 -

Backbone and rigid residue structure (digit from 1 to 5)

Side chain 1: conformation (digit from 1 to 5)




Ligand: dihedral angle 1 (number from - to )




Water molecule: position in space (digit from 1 to 5)

Ligand: bond distances and angles values

Ligand: position in space (x, y, z)

Ligand: orientation in space (xyxzyz)

Ligand section

Protein backbone andresidues not in the binding site

Side chains of the flexible residues in the binding site

Water molecules

Figure 2.2 - Chromosome describing a protein/water/ligand complex.

In the illustrated case, the protein has 4 flexible residues and is represented by 5

input structures, a single water molecule is included and the ligand has 4 rotatable bonds.

A horizontal bar represents a gene

Intelligent Design of the Initial Population. With the libraries completed, FITTED

proceeds to creating individuals. Each individual is first assigned a protein structure. This

is followed by a random generation of the ligand pose.

We first thought that the required CPU time could be significantly reduced by

increasing the “quality” of the initial population and focused on its generation. It is

known that a population including good guesses often converge more rapidly and

decreases the probability to become trapped in a local minimum.29

In an early version, an

energy threshold was used to select reasonable individuals. However, the lengthy

minimization routine was used to optimize all the individuals including the many which

were discarded using this threshold. To reduce the number of unnecessary minimization

steps, we envisioned the use of additional genetic operators in the form of filters (Figure

2.3).

CHAPTER 2

- 81 -

5 protein structure files ligand file

cavity file

constraint file

randomize torsions, orientation, translation

constraint fulfilled?

Yes

No

Construct complex

ligand in binding site

?

No

Yes

pot. energy acceptable?

Yes

enough individuals ?

Yes

Initial population ready

Conjugate-gradientminimization

keyword file

1 protein structure selected

No

pot. energy acceptable?

No

Yes

Save complex

No

pick 1 protein structure

Figure 2.3 - Generation of the initial population using a series of filters.

A first filter was implemented that discards the poses with strong steric clashes and

poses outside the protein cavity approximated by a set of spheres. If any atom of the

ligand is not located within a sphere, the pose is discarded prior to potential energy

evaluation. A second test is made and only the protein/water/ligand complexes with

energy below a user-defined threshold are further optimized by energy minimization.

To further improve the method, we added the possibility of exploiting experimental

information by including constraints to force key interactions. For instance, ligand poses

that are not interacting with a given protein residue or atom (e.g., metal) will be

CHAPTER 2

- 82 -

discarded. As the grid file, the constraint file is in mol2 format. Constraints are also

defined as spheres and columns are added to the mol2 file to define the size of the

constraint and its type (i.e., charge below -0.3).

Thus, in the current version, the first input protein/water file is selected and ligand

poses are randomly generated until one pose fulfills all the criteria (located within the

cavity, fulfilling the constraints) (Figure 2.3). FITTED next constructs the complex (the

corresponding chromosome) and further optimizes it through conjugate gradient energy

minimization. If the optimized complex passes the last test (final energy compared to a

second user-defined threshold), it is archived. If the ligand pose does not pass, it is

discarded and another one is generated. This procedure is reiterated with the other

protein/water structures which are evenly represented in the initial population.

As expected, this implementation significantly reduced the time needed to produce

a high quality initial population while including the rotational and translational degrees of

freedom, which were not present in the previous method.

Evolution of Flexible Ligands. The theme of intelligent design is further carried into

addressing the evolution of the ligands. The first issue addressed is the refinement of the

orientation through mutation. It was observed that by increasing the probability of

mutation of solely the rotation of the ligand (orientation in space), which requires a larger

sampling, an increase in the speed of convergence occurred. Also by decreasing the range

of the possible rotation mutation from 0-360 to +/- 30 degrees, an increase is observed in

the validity of the individuals produced through evolution.

Secondly, the possibility that the best individual is further optimized without being

coupled is small. To increase this possibility, we added the probability of learning.

Before the evolution of each generation, a small percentage of the population is further

optimized by energy minimization. This approach brings the Lamarckian aspect of this

GA one step further.

Evolution of Flexible Proteins – New Genetic Operators. The produced “high quality”

population will then evolve using a series of genetic operators including mutations and

cross-over. These operators will blend the genotypes from the various individuals by

swapping portion of the chromosomes (cross-over) or randomly modifying genes

CHAPTER 2

- 83 -

(mutation). Parent complex structures are randomly selected from the mating pool,

coupled and children are produced by cross-over and mutations operators in a steady-

state way. The offspring should first pass the genotype selection described above (cavity

and constraint filters) to be selected. To our knowledge this early crude selection, first

developed by Haupt and co-workers,29

is a new concept in the GA field applied to

docking methods. A proportion (user-defined) of the selection will be further optimized

by energy-minimization. This energy minimization stage represents the Lamarckian

aspect of the genetic algorithm. The children learn/evolve during their life (are energy-

minimized) and can transmit the acquired skills to the next generation. In practice, this

optimization had to be applied to a small fraction of the population. If all the structures

are fully optimized, the conformational search usually converges to high-in-energy local

minima. The two best fit individuals among the parents and their children survive. This

process of natural selection is based on the potential energy computed with the AMBER

force field.30

In the current version, the input side-chain and backbone conformations and the

water molecules location in space are archived in libraries and each protein structure is

described as a composite of these allowed conformations (a chromosome).21

Creating a

ternary complex then requires the reconstruction of the protein structure and the addition

of the water molecules and ligand. The separation between each section of the

chromosome is made on purpose. FITTED will apply the genetic operators to each section

independently. For instance, a single point cross-over operation can be applied to the

protein side chain section, another one to the ligand internal coordinate section of the

chromosome and a last one to the water molecules (if there is more than one). As the

position of the cross-over is randomly selected it has a higher probability to be applied

between the first and last genes describing the ligand than before the first or after the last

gene. As a result, the orientation in space of the ligand (first gene of the ligand) would be

somewhat linked to the backbone conformation (next gene in the chromosome). To

address this artifact, when a cross-over operation is performed, one of the following two

options is randomly selected: the top portion of the section is kept and the bottom portion

is exchanged or the top portion is exchanged and the bottom is kept. The same two

options apply to the side chain section of the chromosome and to the water molecules

(Figure 2.4). Cross-over operations of the sections including a single gene (ie, a single

CHAPTER 2

- 84 -

water molecule) are restricted to complete exchange or no operation. The probability to

perform a cross-over operation on each section is defined by the user using the

appropriate keyword. Figure 2.4 illustrates the four possible pairs of children produced if

2 cross-over operators are applied. In practice, 4 cross-over operations (one for ligand,

one for binding site residues, one for the rest of the protein and one for the water

molecules) can be used and produce one of the 16 possibilities.

Mutation operations can also alter each gene of the chromosome except the ligand

bond distances and angles. A mutation in the protein backbone, side chain and water

genes are limited to the substitution of the digit for a digit in the range defined by the

number of protein input files. The mutations do not produce conformations of the protein

backbone or side chain conformations nor water molecule locations that are not in the

initial libraries. As a result, FITTED will not propose protein/water structures that are not

composites of the input structures. When producing a composite protein conformation,

FITTED also assesses the integrity of the structure and rejects any generated protein

structure that has intramolecular steric clashes.

one pointcross-over

one pointcross-over

Parents Children pair 1 Children pair 2 Children pair 3 Children pair 4

Figure 2.4 - 4 possible pairs of children generated after application of two one point

cross-over operations. A horizontal bar represents a gene.

CHAPTER 2

- 85 -

Docking to Rigid or Flexible Proteins with FITTED. Three options are available. First

docking to a single conformation can be performed, which allows for self- and cross-

docking studies. Second, docking to a conformational ensemble can also be carried out.

Using this option (referred to as “semi-flexible”), the input protein structures will remain

unchanged over the evolution but can be exchanged between individuals, the cross-over

and mutations operating on the entire protein structures only. Third, one can use the fully

flexible protein structure. With this last option, the cross-over and mutation operators will

be separately applied to the backbone, side chains and water molecules.

Displaceable Water Molecules - Implementation. To date, very few methods have been

proposed to consider dynamically bound water molecules.31

We recently reported a new

concept to describe displaceable water molecules.19

As discussed in this previous report,

the non-bonded energy function can include an energy well at an interaction distance to

the water molecule, but no van der Waals wall in order to simulate the water

displacement. The proof-of-concept of this approach was demonstrated by using

combinations of AutoDock grids modeling the “dry” and solvated RNA oligomers. The

docking of aminoglycosides to these combined grids was found to be more accurate than

the docking to solvated or dry RNA oligomers.19

As FITTED does not make use of grids,

we had to develop and implement an additional potential energy term to the AMBER

function. To remove the Lennard-Jones wall for the water molecule at short distances, we

introduced a switching function (SF) in the form of a scaling factor applied to the

intermolecular energies involving a water molecule.

(2.1)

0.1 if

32 if

0.0 if

3

2

swdd

dd

dddddswddd

swdd

switchwat

switchwatcutwat

switchwatcutwatcutwatcutwatswitchwat

cutwat

where sw is the scaling factor, d is the shortest distance between any atom of the ligand

and any atom of the water molecule, dcutwat is the cutoff distance and dswitchwat is the

switching distance. Such functions are traditionally used to cutoff long range non-bonded

interactions. In this specific case, it will be used to cutoff short range interactions.

CHAPTER 2

- 86 -

-1

1

3

5

7

9

11

13

15

17

0.5 1

1.5 2

2.5 3

Distance (Angs)

Inte

rac

tio

n E

ne

rgy

(k

ca

l/m

ol)

Figure 2.5 - Interaction energy between a methanol molecule and an explicit water

molecule (red) or a displaceable water molecule (blue). Cutoff distance = 1.20 Å,

switching distance = 1.75 Å.

Figure 2.5 represents the energy curve obtained with this new function and

illustrates the interaction between methanol and a water molecule. Although the standard

SF’s are atom-based or group-based, this specific SF has to be molecule-based. To model

a realistic situation, the water molecule should be included in the binding site (SF =1 for

the entire ligand) or displaced (SF = 0 for the entire ligand). Thus, the situations where

this function ranges from 0 or 1 are artifacts. The positive energy observed in Figure 2.5

between 1.20 and 1.75 Ǻ is a consequence of this function. One way to address this issue

would be to turn off the energy function as soon as it is positive. However, a continuous

function between 0 and 1 was needed by the energy-minimization routine. In order to

define the optimal cutoff and switching distances, the intermolecular interaction within

complexes such as methanol-water of methyl acetamide-water was investigated. In all the

cases, the interaction energy between the molecules was positive at distances below 1.75

Å selected as the optimal switching distance. Therefore, this SF applies only when the

interaction energy with the water molecule is repulsive. The SF reached a maximum of

about 15 kcal/mol when a cutoff distance of 1.20 Å was used (see Figure 2.5).

CHAPTER 2

- 87 -

Applied to the docking of molecules, this potential energy function penalizes the

poses that do not interact favorably with the water molecule (distance < cutoff distance =

1.75 Ǻ) nor displace it completely (distance > switching distance = 1.20 Ǻ) and will

consequently help the ligand to interact or fully displace the water molecules.

Displaceable Water Molecules - Optimizing Water Evolution. Bridging water molecules

are often observed in crystal structures. This information is exploited by FITTED. In the

present work, critical water molecules are either maintained when present in the crystal

structures or added by analogy to other structures and their orientation optimized by

energy-minimization when missing. Initial attempts have shown that the prediction of the

occurrence of water molecules in the complexes was not accurate. In practice, we

observed that the ligand pose was first optimized (with greater decrease of the total

potential energy). This early optimization is followed by the refinement of the protein

structure and finally the water molecules. However, most of the water location

possibilities (one per protein structure) have been removed throughout the generations.

We then found that higher mutation rates increased the accuracy by increasing the

sampling of the water molecules. In order to address this issue, we implemented a

ramping mutation rate for the water molecules. This ramping is achieved by using a

quadratic function.

(2.2)

4

maxsgeneration ofnumber maximum

generation

th

mutwat

npp

Thus, very low mutation rates are applied at the early stages of the evolution while

larger rates are used at the late stages. One drawback of the use of the AMBER force

field is the lack of directionality of the hydrogen bond term. Evaluation of the free energy

of binding of the water molecules was also a concern. In the current version of FITTED,

water molecules are considered as part of the protein. To account at least partly for the

entropy cost associated with the capture of a water molecule, a penalty is added to the

final score whenever a water molecule is maintained. This number is arbitrary as this

penalty is system-dependent and should also include the enthalpic contribution to the

binding of the water. Work is in progress in our laboratory to include directional

hydrogen bonds in the next version of FITTED and to improve the scoring of the free

energy of binding of the water molecules to protein/ligand complexes.

CHAPTER 2

- 88 -

Scoring Function. The AMBER force field was implemented in FITTED and used during

the actual docking with a higher weight for the intermolecular interactions than for the

internal energy. A very few scoring function include a term accounting for the protein

entropy. In the present case, using a force field does not permit the evaluation of the

entropy contribution to the free energy of binding. Understanding that the mobility and

entropy of flexible residues is modulated by the ligands, we have proposed to estimate

the free energy of binding to the flexible residues. First, the stronger the interactions are,

the tighter a ligand is bound. Then, the tighter a ligand is bound, the more frozen the

surrounding side chains are. We proposed to account for this entropy/enthalpy

compensation by reducing the interaction with flexible residues. In practice, the

interaction with flexible side chains was scaled down by the use of a new set of atom

types and partial charges assigned by ProCESS. The final poses were then scored using

our scoring function RankScore also implemented in FITTED.

RESULTS AND DISCUSSION

Selection of the Testing Set. As FITTED incorporates protein flexibility and displaceable

water molecules, a selection of protein/inhibitor complexes should be made to evaluate

these aspects of FITTED. The selected inhibitors are listed as supporting information.

As HIV-1 protease (HIVP)/inhibitor complexes often involve a bridging water

molecule, this enzyme was selected as a first test case. Although HIVP is a flexible

protein, the inhibitors usually bind to the close form and HIVP is not considered as a

highly flexible protein in docking studies. However, slight adjustments were observed

and RMSD’s of 0.5 to 1.4 Å between protein structure binding sites were computed.

HIVP can exhibit two different protonation states, either one or both catalytic aspartic

acid side chains acid being protonated.32

In most cases, inhibitors binding to the catalytic

dyad via a diol moiety favor the diprotonation of the catalytic Asp, while monoalcohols

or other functional groups favor the monoprotonated state. We therefore decided to

prepare two sets of protein files. 1b6l, 1eby, 1hpo, 1hpv, and 1pro protein structures were

monoprotonated as discussed in the experimental section while 1ajv, 1ajx, 1hvr, 1hwr

and 1qbs were diprotonated as experimentally observed for the binding of cyclic diols to

the diaspartate catalytic site.32

Only the crystal structures 1b6l, 1eby and 1hpv featured a

CHAPTER 2

- 89 -

water molecule. This same water molecule was therefore added to the other 7 protein

structures to allow FITTED to select whether or not this water is needed.

Similarly, thymidine kinase (TK) is a flexible protein and inhibitors often bind

experiencing interactions with the protein relayed by many water molecules.

Interestingly, a first water (water molecules 1 and 2 in Figure 2.6) can be located at two

different positions following Gln125 side chain conformational changes. The combined

water displacement/Gln side chain flip will be investigated in great detail. As illustrated

in Figure 2.6, either the Gln125 carbonyl oxygen (Figure 2.6a) or the amide hydrogens

(Figure 2.6b) point towards the Arg176 side-chain. The first Gln125 side chain

conformation shown in Figure 2.6a is observed in 1e2k, 1e2p, 1ki4, 1ki8 and 1of1 while

the second conformation is observed in 1ki3, 1ki7, 2ki5 and 1qhi (PDB codes). Similarly,

Water 4 can be displaced by Gln221. These two enzymes (HIVP and TK) together with

oligopeptide binding protein A (OppA) were also selected as test cases by the GOLD

developers in order to evaluate the reliability of their method accounting for bridging

water molecules.31

However, Verdonk et al. considered three water molecules interacting

with the nucleotide base of the TK inhibitors while we considered another three

interacting with the ribose part of these inhibitors, for a total of six water molecules. As

illustrated in Figure 2.3, these 6 waters participate in multiple hydrogen bonds with both

the ligands and the proteins.

Figure 2.6 - Bridging water molecules and flexible binding site residues in TK / inhibitor

complexes. (a) 1e2k and (b) 1ki3. Co-crystallized sulfate is shown in orange.

Factor Xa (FXa) and its homolog trypsin were also included in the validation set.

FXa / inhibitors complexes show from none to two water molecules involved in the

CHAPTER 2

- 90 -

ligand binding while a single bridging water molecule is observed in the selected

trypsin/inhibitor complexes Figure 2.7). The first one interacts with both the inhibitor

cationic moieties and the protein backbone (Ile227 in FXa and Val205 in trypsin) while

the second one bridges the inhibitor cation with the key Asp189 side chain of factor Xa.

The specific shape of these two deep binding sites featuring a narrow pocket made the

conformational sampling problematic. Thus, a larger population size (200) was used in

order to reach the convergence.

We completed this validation study with a small set of metalloprotease (MMP-3,

stromelysin-1) inhibitors for a total of 33 complexes. Most of the known MMP inhibitors

chelate the catalytic zinc cation and are of interest to evaluate the accuracy of FITTED to

reproduce the metal ligation. As the metal chelation is a short range interaction, we

implemented a specific term using a potential similar to the LJ12-10 used for hydrogen

bonds.

All these protein structures were processed using ProCESS (a typical keyword file

is given as supporting information) prior to their use with FITTED and the ligands were

prepared with SMART.

Figure 2.7 - Bridging water molecules and flexible binding site residues in FXa /

inhibitors complexes. (a) 1ezq and (b) 1f0r.

Docking using FITTED 1.0 – General Consideration. All the compounds were first self-

docked to their corresponding protein structure in presence or absence of water. Table 2.1

- Table 2.3 summarize the data obtained for these docking studies. This first set of

CHAPTER 2

- 91 -

docking runs was carried out to evaluate the impact of the new potential energy term for

the displaceable water molecules. A cross docking study was next carried out to evaluate

the impact of the protein structure on the docking accuracy (Table 2.4 - Table 2.6). These

same ligands were next docked to the “semi-flexible” proteins and to the fully flexible

proteins in order to evaluate the ability of FITTED to predict the protein structure (Table

2.7 and 2.9). For the statistical analysis, we considered that the ligand pose was

accurately predicted when the RMSD relative to the crystal structure was below 2.0 Å,

that the protein structure prediction was correct when the RMSD was below the average

RMSD between the series of protein structures used as input (when the prediction is

better that a random selection of protein structures). Finally, we considered the water

molecules to be accurately predicted when the occurrence was right. A set of 10 runs was

carried out for each inhibitor, in order to demonstrate the convergence of the protocol. In

most of the cases at least 5 out of the 10 runs led to similar poses (difference between

computed RMSD’s below 0.5 Å). Although 100 individuals were enough for the docking

of thymidine kinase inhibitors, a larger initial population (200) was required for the other

4 proteins in order to reach the convergence criterion.

Self-docking Study. Among the 5 proteins investigated, 4 proteins can bind ligands

through one or more bridging water molecules. Using this set we first evaluated the

impact of the potential energy term developed for the water molecules described above

on the docking accuracy. In a first set of experiments, the water molecules were removed

from the protein structures and inhibitors were docked back to their corresponding

protein structure (self-docking). In a second set of experiments, the water molecules were

maintained and the developed potential energy term for the water molecule was used.

Table 2.1 presents the results of the self docking study for HIVP inhibitors. As can

be seen in the third and fifth columns, 7 (without water) or 8 (with water) out of the 10

inhibitors were self-docked within 1.2 Å from the experimentally observed binding

modes. Interestingly, 1hpo was properly docked when the water energy potential was

used and the water is predicted to be displaced. However, when the water was removed

prior to docking, 1hpo, which is known to displace the water molecule, is mis-docked.

We attribute this unexpected result to the energy hill shown in Figure 2.5. As postulated

above, this energy potential tends to favor either the complete displacement of the water

CHAPTER 2

- 92 -

or favorable interactions with the water while disfavoring intermediate docked poses as

the one proposed when the water is removed prior to docking. A close look at Table 2.1

also revealed that the RMSD’s are systematically higher when the water is removed prior

to docking. We again believe that can be attributed to the energy potential used to model

the displaceable water molecules.

The TK inhibitors were next docked to the rigid protein in self-docking

experiments (Table 2.2). 6 out of 9 inhibitors were self-docked within 1.1 Å from the

observed binding mode and 1ki3 inhibitor is docked with reasonable RMSD’s. A special

situation arose with the meso compound 1e2p. As shown on Figure 2.8, the RMSD of

2.03 computed for the docked pose was attributed to the exchange of C1 and C-2 groups.

Considering the two methylenol groups as equivalent reduces the RMSD to 0.77 Å.

Docking of 1e2p was therefore considered as successful.

Figure 2.8 - Docked (green) and crystal structure (grey) of 1e2p ligand. Computed

RMSD: 2.03 Å. The pro-chiral carbon is shown as a ball.

Surprisingly, even though many water molecules are involved in the ligand/protein

complexes, the docking to the “dry protein” was as accurate as the docking to the

solvated protein. Only the docking of 2ki5 was slightly affected by the absence of water.

The ten runs carried out with 1ki7 were constantly leading to the same wrong

conformation. In this case, the ribose ring and the base of the nucleotide mimics were

inverted within the binding site.

We next investigated the two sets of charged trypsin and FXa inhibitors. In this

case, the need for water molecules was clear (Table 2.3). Without water, FITTED docked

CHAPTER 2

- 93 -

only 4 out of the 10 inhibitors properly while 8 were accurately docked when the water

molecules were considered. A close look at the failures does not reveal any major

mistake. For instance, the proposed poses for 1f0u were interacting with Asp171 as

experimentally observed. However, the hydrophobic biaryl moiety of 1f0u was not

located in the same pocket with the terminal ammonium group forming a hydrogen bond

with Tyr76 instead of Asn79. As for 1hpo described above, the surprise comes from the

water prediction. In these three cases (1f0u, 1qbo and 1fjs), the occurrence of water

molecules is not accurately predicted but induces a proper docking of the inhibitors.

In contrast, the occurrence of water is accurately predicted when docking trypsin

inhibitors and the removal of the water does not affect the docking.

MMP inhibitors were docked and low accuracy was observed with 2 inhibitors

being accurately docked. This small set is clearly not large enough to fully assess FITTED

for metalloenzymes.

Overall, these first experiments demonstrated FITTED’s abilities to fully sample the

ligand conformational space and assign better scores to experimentally observed poses.

This first study also validated the water molecule prediction method since the occurrence

of the so-called water 301 in HIVP and waters in TK and trypsin is right in most of the

cases. Unexpectedly, this additional energy term also helps the docking of inhibitors that

displace water.

Cross-docking Study. In a real case study, medicinal chemists wish to design compounds

de novo or to screen libraries of compounds that are not co-crystallized with the enzyme.

Thus, a self-docking study is not representative of the real accuracy of docking programs.

To properly evaluate the predictive power of FITTED, a set of cross-docking experiments

was next carried out.

Each inhibitor was docked to the corresponding set of proteins in order to evaluate

the impact of the protein conformation on the docking accuracy. First, each HIVP

inhibitor was docked to the 5 protein structures and the RMSD’s and scores were

computed (Table 2.4). The data collected for the first five inhibitors revealed that the

docking accuracy is greatly influenced by the protein conformation. The cross docking

experiments carried out with the TK, FXa and MMP inhibitors also showed a significant

decrease of the accuracy relative to the self-docking study (Table 2.5 and Table 2.6). In

CHAPTER 2

- 94 -

contrast, the other five HIVP and trypsin inhibitors were accurately docked in most of the

cases regardless of the protein structure used.

Overall, this cross-docking study confirms the need for a docking method that

models the protein flexibility and/or the sensitivity of FITTED for the protein structure.

Docking to Multiple Conformations. The self-docking and cross-docking data can be

used to simulate the docking to multiple conformations. The five (or nine for TK) final

docked poses for each inhibitor (one per protein structure) are next compared and the best

scoring pose is selected (Table 2.4 -Table 2.6). In the case of the monoalcohols 1b6l,

1hpo and 1pro, the self-docking led to the best score. The same observation was made

with 5 out of the 9 investigated TK inhibitors and 4 out of the 5 FXa systems. However,

as the four of the five diols (1ajv, 1ajx, 1hvr and 1hwr) were docked with good accuracy

regardless of the protein structure, the prediction of the protein conformation was much

poorer. Interestingly, although 1qbs was not accurately docked back to its corresponding

protein structure its correct binding mode associated to a better score was proposed when

1ajx and 1hwr protein structures were employed.

Docking to Semi-flexible and Flexible Proteins. Although the previous study intrinsically

includes protein flexibility, it requires 5 to 9 experiments per compound and therefore

implies the equivalent increase in required CPU time. The current version of FITTED

offers to model the protein flexibility in a single experiment. Either of the two options

(docking to semi-flexible and fully flexible proteins) described above can be selected.

Adding the flexibility of the protein increases the complexity of potential energy surface

to explore therefore making the conformational sampling more difficult. We therefore

expected to observe a reduced accuracy when moving from rigid to flexible proteins.

In fact, 1hpo was misdocked to the semi-flexible and flexible HIVP (Table 2.7)

while an RMSD below 1.0 Å was recorded in the previous set of experiments. A close

look at the 10 runs revealed that the same misdocked pose was observed in 5 of the 10

proposed poses. It was also found that the experimentally observed pose has a worse

score. These two observations ruled out the hypothesized bad convergence but pointed

out a weakness of the scoring function. In contrast, 1qbs was misdocked to the rigid

protein with the orientation of the seven-membered core reversed. Again, as in the

CHAPTER 2

- 95 -

docking to multiple conformation study, much better score and the right pose were

predicted when the semi-flexible and flexible proteins were used.

The computed RMSD’s between the crystal structures 1of1 and 1e2k, 1e2p and

1e2k and 1of1 and 1e2p of TK were 0.26, 0.59 and 0.59 Å respectively. The computed

RMSD’s for the other pairs of structures ranged from 0.80 to 1.16 Å with an average of

0.92 Å. When the semi-flexible docking was used, the correct protein structure was

picked among the possible nine in 4 cases and was alternatively picked (5 runs each) with

a similar one when the 1of1 inhibitor was docked (Table 2.8). In contrast, the protein

structure was roughly as good as the average when 1e2p was docked and worse than

average when 1ki8 and 2ki5 were docked. Overall, the protein structure was predicted

with an average RMSD of 0.44 Å for the eight successful dockings (lower than the

average RMSD computed for each pair of structures). As discussed above, the Gln125

side chain of TK can adopt two distinct conformations. FITTED predicts the right

conformation in 7 of the 8 successful docking cases. This is a good indicator of the

predictive power of FITTED when the semi-flexible option is selected. The docking to the

fully flexible protein was less successful with an average RMSD of 0.78 Å but still below

the average RMSD for the 9 protein structures (average RMSD = 0.92 Å). In this last

study the Gln125 side chain was misoriented in 3 out of the 8 successful cases.

Data collected in Table 2.9 shows that 1o2j, 1o3g and 1o3i were properly docked

while 1f0r was misdocked to the semi-flexible and flexible protein. Whether it was

docked to the rigid, semi-flexible and flexible proteins, two alternative poses (RMSD

~1.0 Å or RMSD ~9.5 Å) were proposed for 1ezq. However, the wrong pose was

assigned a better score when the fully flexible protein was used. The correct pose was

much less observed (20% of the runs) than the wrong one. This results indicated that the

global minimum may be located in a sharp and deep energy well of the potential energy

surface that is difficult to find. In this series again, the prediction of the occurrence of

water molecule is good while the protein structure prediction is more disappointing.

1bwi was constantly misdocked to the rigid, semi-flexible and fully flexible

protein. More interestingly, 1d8m was misdocked to its corresponding protein crystal

structure but properly docked to the 1b8y protein structure and to the semi and fully

flexible protein. A closer look at the 1d8m data showed that this inhibitor was properly

docked when most dissimilar protein structures were used. This may indicate that some

CHAPTER 2

- 96 -

fine adjustments of the protein in the crystal structure of 1d8m would be required. In

order to account for these slight moves, FITTED has selected a more appropriate protein

structure. This may also indicate a poor accuracy of the protein structure prediction for

this enzyme or a poor description of the metal chelation.

This exhaustive docking study demonstrated that the scoring function can not only

assign high scores to the experimentally observed pose but also discriminate between

protein structures. It also shows that in specific case such as 1qbs or 1d8m, flexibility

improves the accuracy over self-docking.

The scores given to the final docked poses were also compared and showed that

they are all within 1 unit for each compound regardless of the protein flexibility method.

The scoring function is being further investigated and improvements will be reported in

due course.

CHAPTER 2

- 97 -

Table 2.1 - Self-docking – HIV-1 protease inhibitors.

Docking to proteina Docking to protein + water molecule

b

Obs. Waterc Lig

d Score

e Lig

d Pred.

Waterf

Scoree

1b6l 1 1.10 -9.8 0.55 1 -11.6

1eby 1 2.68 -9.3 4.55 0 -9.4

1hpo 0 2.29 -9.1 0.94 0 -10.1

1hpv 1 1.02 -8.7 1.19 1 -8.5

1pro 0 0.82 -5.2 0.72 0 -5.7

1ajv 0 0.91g -10.0 0.59 0 -11.4

1ajx 0 0.82 -9.4 0.77 0 -9.9

1hvr 0 0.49 -11.7 0.40 0 -12.2

1hwr 0 0.60 -7.5 0.52 0 -8.1

1qbs 0 5.05 -7.5 5.14 0 -8.2

a Water molecules removed prior to docking.

b Water molecule known as Water 301 was

retained and the function describing the interaction between ligand and water molecules

is applied. c

Water molecule observed or not in crystal structures: 1 and 0 define the

presence or absence of the water molecule respectively. d RMSD (in Å): criterion of

success of 2.0 Å. e Score in arbitrary units.

f Water molecules as proposed by FITTED.

Bold numbers highlight failures.

CHAPTER 2

- 98 -

Table 2.2 - Self-docking – Thymidine kinase inhibitors.

Docking to

proteina

Docking to protein + water

moleculeb

Obs. Water

moleculesc

Ligd Score

e Lig

d Pred. water

moleculesf

Scoree

1e2k 1 0 1 1 1 1 0.63 -6.1 0.66 1 0 1 1 0 1 -7.1

1e2p 1 0 1 1 1 1 2.69g -4.7 2.03

g 1 0 1 1 1 1 -5.2

1ki3 0 1 0 0 0 1 1.86 -5.9 1.84 0 1 0 1 1 1 -6.1

1ki4 1 0 1 1 1 1 0.43 -6.9 0.66 1 0 1 1 1 1 -7.7

1ki7 1 0 1 1 1 0 5.79 -5.1 5.76 1 0 1 1 1 1 -4.8

1ki8 1 0 1 1 1 0 0.77 -6.2 0.64 1 0 1 1 1 1 -6.9

2ki5 0 1 1 1 1 1 1.10 -5.5 0.45 0 1 0 1 1 1 -6.3

1of1 1 0 1 1 1 1 0.37 -6.1 0.29 1 0 1 1 1 1 -6.8

1qhi 0 1 1 1 1 0 0.47 -7.2 0.66 0 1 0 1 1 1 -7.8


b 2 to 6 water molecules (see text) were


is applied. c

Water molecules observed or not in crystal structures: 1 and 0 define the

presence or absence of each water molecule respectively. d RMSD (in Å): criterion of

success of 2.0 Å;. e Score in arbitrary units.


Bold numbers highlight failures. g When considering the meso nature of 1e2p ligand,

these RMSD’s were equivalent to RMSD’s below 1.0Å (see text).

CHAPTER 2

99

Table 2.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors.

Docking to proteina Docking to protein + water

moleculeb

Obs.

Waterc

Ligd Score

e Lig

d Pred.

Waterf

Scoree

1ezq 1 0 3.32 -7.4 0.82 0 0 -11.1

1f0r 1 1 2.33 -8.3 1.88 0 0 -8.1

1fjs 1 0 3.64 -7.7 1.78 0 0 -8.8

1nfu 0 0 2.57 -8.9 1.50 0 0 -8.0

1xka 1 0 1.13 -8.3 0.87 0 0 -8.4

1f0u 1 - 2.92 -5.9 3.95 1 - -6.7

1o2j 1 - 1.03 -5.9 0.94 1 - -6.2

1o3g 1 - 1.35 -6.9 1.69 1 - -7.3

1o3i 1 - 0.70 -6.5 0.68 1 - -6.7

1qbo 1 - 3.84 -7.6 3.49 1 - -6.8

1b8y - - 1.15 -9.4 - - - -

1bwi - - 6.35 -5.8 - - - -

1ciz - - 1.22 -10.6 - - - -

1d8m - - 2.99 -6.0 - - - -


b none to 2 water molecules (see text) were


is applied. c



success of 2.0 Å;. e Score in arbitrary units.


Bold numbers highlight failures.

CHAPTER 2

100

Table 2.4 - Cross-docking and docking to multiple conformations – HIV-1 protease

inhibitors.

Docking to rigid proteins Statistics for the best scoring posea

1b6l 1eby 1hpo 1hpv 1pro Ligb Pro

c Water

d Score

e

1b6l 0.55 0.83 3.37 1.11 1.04 0.55 0.00 1 -11.6

1eby 2.57 4.55 2.86 6.15 2.72 2.86 0.96 1 -8.9

1hpo 4.64 3.62 0.94 4.26 2.4 0.94 0.00 0 -10.1

1hpv 4.09 3.39 2.01 1.19 3.53 2.01 1.00 0 -8.8

1pro 0.62 1.01 0.86 0.78 0.72 0.72 0.00 0 -5.7

1ajv 1ajx 1hvr 1hwr 1qbs

1ajv 0.59 1.26 1.46 1.52 1.12 1.12 0.87 0 -10.1

1ajx 0.73 0.77 1.1 0.75 0.73 0.73 0.81 0 -9.6

1hvr 1.9 1.27 0.4 1.22 0.77 0.77 0.72 0 -11.9

1hwr 0.68 0.78 0.85 0.52 0.67 0.78 0.84 0 -8.9

1qbs 5.35 1.49 1.17 5.11 5.15 1.17 0.72 0 -10.3

a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final

poses was selected. b RMSD (in Å): criterion of success of 2.0 Å;.

c RMSD (in Å):

criterion of success: better than average RMSD; average RMSD between protein

structures computed on the binding site residues: 0.91 Å for the first five structures (one

Asp 25 protonated) and 0.77 Å for the last five structures (AspA25 and AspB25

protonated). d Water molecules as proposed by FITTED; 1 and 0 define the presence or

absence of the water molecule respectively. Bold numbers highlight failures. e Score in

arbitrary units.

CHAPTER 2

- 101 -

Table 2.5 - Cross-docking and docking to multiple conformations – Thymidine kinase

inhibitors.

Docking to rigid proteins

1e2k 1e2p 1ki3 1ki4 ki7 1ki8 2ki5 1of1 1qhi

1e2k 0.66 2.11 3.42 0.76 0.83 0.84 0.96 0.78 1.31

1e2p 2.24f 2.03

f 0.97 0.74 0.78 1.41 2.75

f 1.20 2.03

f

1ki3 2.36 2.62 1.84 2.43 2.25 2.50 2.61 2.94 1.74

1ki4 2.48 2.36 3.38 0.66 1.11 0.73 2.35 2.52 1.00

1ki7 5.79 5.67 5.55 5.08 5.76 5.25 5.15 5.65 5.67

1ki8 3.8 3.93 2.35 1.91 1.19 0.64 3.92 3.84 1.29

2ki5 2.29 3.22 1.28 2.13 2.08 1.90 0.45 2.22 1.21

1of1 0.39 0.49 1.14 0.60 0.78 0.89 0.49 0.29 0.81

1qhi 2.41 2.29 1.18 5.43 1.68 2.13 1.10 2.19 0.67

Statistics for the best scoring posea

Ligb Pro

c Water

d Score

e

1e2k 0.66 0.00 1 0 1 1 0 1 -8.1

1e2p 2.24f 0.59 1 0 0 1 1 1 -6.2

1ki3 1.74 0.78 1 0 1 1 1 1 -6.1

1ki4 0.66 0.00 1 0 1 1 1 1 -8.2

1ki7 5.67 0.87 0 0 0 1 1 1 -5.8

1ki8 0.64 0.00 1 0 1 1 1 1 -7.4

2ki5 1.90 1.11 1 0 1 1 1 1 -6.7

1of1 0.29 0.00 1 0 0 1 1 1 -7.3

1qhi 0.67 0.00 0 1 1 1 1 1 -6.1 a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final


c RMSD (in Å):


structures computed on the binding site residues: 0.92 Å. d Water molecules as proposed

by FITTED; 1 and 0 define the presence or absence of the water molecule respectively.

Bold numbers highlight failures. e Score in arbitrary units.

f equivalent to RMSD’s below

1.0 Å if the meso nature of the ligand is considered

Table 2.6 - Cross-docking and docking to multiple conformations – Factor Xa, trypsin

and MMP-3 inhibitors.

Docking to rigid proteins Statistics for the best scoring posea

CHAPTER 2

- 102 -

1ezq 1f0r 1fsj 1nfu 1xka Ligb Pro

c Water

d Score

e

1ezq 0.82 9.14 3.53 3.45 4.8 0.82 0.00 0 0 -11.1

1f0r 2.42 1.89 2.31 2.49 2.42 2.49 0.75 0 0 -9.4

1fsj 3.3 2.81 1.78 2.24 3.22 1.78 0.00 0 0 -8.8

1nfu 2.05 2.17 2.26 1.5 3.79 1.5 0.00 0 0 -9.7

1xka 1.66 1.5 1.64 1.58 0.87 0.87 0.00 0 0 -8.4

1f0u 1o2j 1o3g 1o3i 1qbo

1f0u 3.95 4.21 2.16 4.98 5.50 5.50 0.74 1 - -6.9

1o2j 1.33 0.94 0.80 4.16 1.43 0.80 0.55 1 - -6.3

1o3g 0.59 0.79 1.69 0.67 1.22 0.67 0.31 1 - -7.6

1o3i 0.59 0.93 0.94 0.69 1.06 0.69 0.00 1 - -6.7

1qbo 5.23 4.14 3.89 4.30 3.49 3.89 1.09 1 - -7.4

1b8y 1bwi 1ciz 1d8m

1b8y 1.15 1.51 1.38 2.30 - 1.15 0.00 - - -9.4

1bwi 5.64 6.35 8.95 6.40 - 6.35 0.00 - - -5.8

1ciz 1.15 4.53 1.23 4.33 - 1.15 0.45 - - -10.1

1d8m 1.22 6.23 2.21 2.99 - 1.22 1.11 - - -7.3

a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final


c RMSD (in Å):


structures computed on the binding site residues: factor Xa: 0.86 Å, trypsin: 0.90 Å,

MMP-3: 0.92. d Water molecules as proposed by FITTED; 1 and 0 define the presence or

absence of the water molecule respectively. Bold numbers highlight failures. e Score in

arbitrary units.

CHAPTER 2

- 103 -

Table 2.7 - Docking to flexible proteins - HIV-1 protease inhibitors.

Docking to semi-flexible protein Docking to fully flexible protein

Liga Pro

b Water

c Score

d Lig

a Pro

b Water Score

1b6l 1.06 0.00 1 -11.0 1.08 0.53 1 -11.4

1eby 3.62 0.85 0 -9.3 6.06 1.02 0 -8.6

1hpo 4.03 0.99 0 -8.1 3.25 1.16 0 -8.5

1hpv 3.88 1.00 1 -10.3 1.54 0.79 1 -10.0

1pro 0.51 0.00 0 -5.9 0.94 0.59 0 -5.6

1ajv 0.75 0.00 0 -11.4 1.46 1.02 0 -10.6

1ajx 0.85 0.84 0 -9.3 1.77 0.72 0 -10.0

1hvr 1.72 0.00 0 -9.7 1.59 0.67 0 -11.6

1hwr 0.79 0.81 0 -8.8 0.58 0.71 0 -8.9

1qbs 1.22 0.72 0 -11.0 1.32 0.59 0 -11.0

a RMSD (in Å): criterion of success of 2.0 Å.

b RMSD (in Å): criterion of success: better

than average RMSD; average RMSD between protein structures computed on the binding

site residues: 0.91 Å for the first five structures (one Asp 25 protonated) and 0.77 Å for

the last five structures (AspA25 and AspB25 protonated). c Water molecules as proposed

by FITTED; 1 and 0 define the presence or absence of the water molecule respectively.

Bold numbers highlight failures. d Score in arbitrary units.

CHAPTER 2

- 104 -

Table 2.8 - Docking to flexible proteins - thymidine kinase inhibitors.

a RMSD (in Å): criterion of success of

2.0 Å. b RMSD (in Å): criterion of

success: better than average RMSD;

average RMSD between protein

structures computed on the binding site

residues: 0.92 Å. c Water molecules as

proposed by FITTED; 1 and 0 define the

presence or absence of the water

molecules respectively. Bold numbers

highlight failures. d Score in arbitrary

units.

Docking to semi-flexible protein

Liga Pro

b Occurrence of

water mol.c

Scored

1e2k 0.67 0.00 1 0 1 1 0 1 -7.0

1e2p 0.51 0.88 1 0 0 1 0 1 -5.8

1ki3 1.46 0.00 0 1 1 0 0 1 -6.8

1ki4 0.64 0.00 1 0 1 1 1 1 -7.3

1ki7 5.20 1.01 1 0 1 1 1 1 -5.4

1ki8 0.60 0.96 1 0 1 1 1 1 -6.7

2ki5 1.92 0.89 0 1 1 1 1 1 -5.6

1of1 0.35 0.26c 1 0 1 1 1 1 -6.7

1qhi 0.96 0.00 1 0 0 1 1 0 -7.5

Docking to fully flexible protein

Liga Pro

b Occurrence of

water mol.c

Scored

1e2k 0.75 0.61 1 0 1 1 0 1 -7.2

1e2p 0.95 0.93 1 0 1 1 1 1 -5.7

1ki3 1.35 0.90 0 0 0 1 1 1 -6.8

1ki4 0.77 0.89 1 0 0 1 1 1 -7.9

1ki7 5.25 1.11 0 1 0 1 1 0 -6.5

1ki8 0.65 0.53 1 0 1 1 1 1 -7.8

2ki5 1.62 0.80 1 0 1 1 1 1 -6.8

1of1 0.75 0.99 1 0 1 1 1 1 -7.3

1qhi 0.64 0.69 0 1 0 1 1 1 -8.2

CHAPTER 2

- 105 -

Table 2.9 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3

inhibitors.


Liga Pro

b Water

c Score

d Lig

a Pro

b Water

c Score

d

1ezq 1.34 0.00 1 0 -10.2 9.64e 0.92 1 0 -8.5

1f0r 2.50d 0.75 0 0 -8.1 2.32

f 0.63 0 0 -9.7

1fjs 2.45 0.77 0 0 -9.1 3.24 1.11 1 0 -8.6

1nfu 1.87 0.70 0 0 -8.7 1.17 0.71 0 0 -9.6

1xka 1.31 0.91 0 0 -8.2 1.52 0.70 0 0 -8.7

1f0u 6.11 1.04 1 - -6.7 4.25 0.87 1 - -7.6

1o2j 1.06 0.33 1 - -6.5 1.30 0.58 1 - -7.1

1o3g 0.83 0.66 1 - -7.2 0.82 0.77 1 - -7.7

1o3i 1.24 1.32 1 - -5.9 0.62 0.67 1 - -6.9

1qbo 4.48 1.32 1 - -6.6 3.65 0.78 1 - -7.7

1b8y 0.95 1.11 - - -9.5 1.40 0.67 - - -10.1

1bwi 5.40 1.14 - - -5.4 6.14 0.55 - - -6.2

1ciz 2.01 0.45 - - -9.7 1.39 1.19 - - -10.8

1d8m 1.03 1.11 - - -7.4 1.37 1.49 - - -8.0


b RMSD (in Å): criterion of success: better

than average RMSD; average RMSD between protein structures computed on the binding

site residues: Factor Xa: 0.86 Å, trypsin: 0.90 Å, MMP-3: 0.92 Å. c Water molecules as

proposed by FITTED; 1 and 0 define the presence or absence of the water molecule

respectively. Bold numbers highlight failures. d Score in arbitrary units.

e the second best

has an RMSD of 1.36 Å with a higher potential energy but a better score. f poses with

RMSD below 1.5 Å were found but given worse scores.

CHAPTER 2

- 106 -

Table 2.10 - Docking accuracy (%): rigid proteins.

Docking

to proteina

Docking to protein +

water moleculeb

Cross-docking

Ligc Lig

c Water

d Lig

c

Success 64 76 82 47

a Water molecules removed prior to self-docking.

b Bridging water molecules (see

experimental section) were retained and the function describing the interaction between

ligand and water molecules was applied. c RMSD (in Å): criterion of success of 2.0 Å.

d

Criterion of success: occurrence predicted when ligand successfully docked.

Table 2.11 - Docking accuracy (%): flexible proteins.

a Best scoring poses from self- and cross-docking studies (see text).

b RMSD (in Å):

criterion of success of 2.0 Å. c RMSD (in Å) calculated on successful dockings (ligand

correctly docked) : 2 percentages of success are given following 2 different criteria of

success: exact protein structure (RMSD=0.0 Å), RMSD below average. d Criterion of

success: occurrence predicted. e The success rates are computed on the systems with the

ligand successfully docked.

Discussion. Table 2.10 and Table 2.11 summarize the accuracy observed throughout this

study. It is worth mentioning that the Tables show data for the top scoring poses only.

This study was designed to assess the impact of the energy term used to model

multiple conformationsa semi-flexible protein fully flexible protein

Ligb Pro

c Water

d Lig

b Pro

c Water

d Lig

b Pro

c Water

d

Success 79 47 76 73 73 27 61 82 73 0 73 81

CHAPTER 2

- 107 -

“displaceable” water molecules on one hand and the protein flexibility on the other hand

on the accuracy of FITTED. First, a clear increase in accuracy was observed when the

“displaceable” water molecules were added and validated the developed concept (Table

2.10). Overall, FITTED self-docked 76% of the inhibitors within 2.0 Å from the observed

binding modes when the water was considered and only 64% when it was removed. In

addition, the occurrence of water molecules is predicted with nearly 80% accuracy.

As a comparison, Kontoyianni and co-workers4 found GOLD and Glide as the

most accurate programs with 69% and 58% of the compounds docked in a manner similar

to the experimentally observed mode (referred to as “close” in Kontoyianni’s report)

while LigandFit, FlexX and DOCK showed poorer prediction powers. In another

comparative study, Brooks and co-workers3 docked 73% and 46% of the compounds with

RMSD below 2.0 Å with ICM and GOLD respectively while AutoDock, DOCK and

FlexX were less accurate. Rognan and co-workers2 performed a similar study and found

that Glide, GOLD, Surflex and QXP docked 80 to 90% of the inhibitors within 2.0 Å

from the observed pose while FlexX, Fred, DOCK and Slide showed lower accuracy (50-

65%). Another study carried out by Perola and co-workers5 showed than Glide

outperformed (61% within 2.0 Å) GOLD and ICM (48% and 45% respectively).

Although each study was based on a different set of protein-ligand complexes, our

validation study shows that FITTED performed very well with accuracy as high or higher

than the best performing docking programs. More importantly, FITTED allows for

flexibility of the protein and displaceable water molecules to be accounted for while

GOLD includes water molecules but protein flexibility restricted to the polar hydrogens

and Glide does not consider flexibility nor water molecules.

With this data in hands, we turned our attention to the benefit and impact of the

flexibility. Self-docking is the ideal case with the protein being molded to the ligand

structure. In contrast, cross-docking tries to combine ligands and protein structures that

are not co-crystallized and it is well known that docking programs perform poorer when

cross-docking is carried out.10

In practice, docking experiments considering protein

flexibility should be more accurate than cross-docking experiments and ideally as

accurate as self-docking. In the present study, 47% and 76% accuracy were recorded for

the cross- and self-docking experiments respectively. Gratifyingly, the observed accuracy

of FITTED when docking ligands to flexible proteins is similar to that seen in self-

CHAPTER 2

- 108 -

docking. In addition, neither the prediction of the water molecule occurrence nor ligand

pose is affected by adding the protein flexibility as no significant drop in accuracy is

observed when moving from rigid (self-docking) to flexible proteins.

A close look at the predicted protein structure revealed the good accuracy of our

protocol. For instance, while 5 to 9 protein structures were used as input for each

experiment, the correct conformation was selected for 9 of the 24 (37%) correctly docked

systems when the semi-flexible protein was used. If a random selection was used, 11 to

20% of the cases would present the correct structure. When considering the successful

docking experiments (ligand pose accurately predicted), average protein RMSD’s of 0.50

Å and 0.78 Å were recorded when using the semi-flexible or fully flexible proteins

respectively.

Unexpected results were also recorded. In two cases (MMP-3: 1d8m and HIVP:

1qbs), docking to flexible proteins was more accurate than self-docking. Figure 2.9

shows a superposition of the crystal structure of 1d8m and the docked structure when the

semi flexible option was used. In the crystal structure, hydrogen bonds are observed

between the protein backbone (Ala93 and Leu92) and one of the sulfonamide oxygen

atoms of the inhibitor. In contrast, the docked pose indicates weak hydrogen bonds but

strong hydrophobic/π-stacking interactions with His119 and Tyr141 side chains located

in the S1’ pocket. In order to induce this interaction pattern, the S1’ pocket must be more

closed in the modeled structure than in the crystal structure to encompass the ligand

better. This discrepancy may show that the hydrogen bonding contribution to the binding

is underestimated or that the van der Waals interactions are overestimated.

CHAPTER 2

- 109 -

Figure 2.9 - Crystal structure (protein in grey, ligand in gray) and proposed docked

model (protein in blue, inhibitor in green) for the 1d8m complex.

In the case of 1qbs, no clear explanation was found. A close look at the docked and

crystal complexes do not reveal any specific movement or steric clash. We therefore

believe that the potential energy function may find some nuances that can be detrimental

to the correct pose. Again, slight adjustments of the crystal structures may be necessary

prior to or upon docking. This hypothesis claims that the crystal structure of 1qbs

included some discrepancies that induced some slight van der Waals repulsions,

preventing a good score when docking 1qbs inhibitors to the 1qbs protein structure.

These repulsions would vanish when the flexible protein was used and much lower scores

were recorded. Possible strategies to address this issue are the use of relaxed structures

(as proposed in Glide33

), soft structures as proposed by Shoichet and co-workers34

or

flexible structures as shown in this study. These two unexpected results may reveal some

inaccuracies of the scoring function.

Overall, this study revealed the accuracy of FITTED to dock inhibitors to flexible

and partially solvated proteins and validated it with this set of representative protein-

inhibitor complexes.

CONCLUSION

We have developed FITTED 1.0, a unique docking program that accounts for both

protein flexibility and bridging water molecules. The flexibility is handled by a genetic

CHAPTER 2

- 110 -

algorithm based on various genetic operators specific to FITTED (ie, designed cross-over

operator, focused mutation, filters). Modifications to the initial genetic algorithm have

been made to increase the speed and accuracy by orienting the docking toward “favored”

poses (e.g., poses within the cavity and fulfilling constraints). We have also implemented

a new potential energy term that accurately accounts for dynamically bound water

molecules. Application of FITTED to the docking of a variety of protein/inhibitors

complexes resulted in proposed docked poses within 2.0 Å from the observed binding

modes in 73 to 76% of the cases using flexible or rigid proteins respectively. The

accurate prediction of the occurrence and need for displaceable water molecules was also

demonstrated. Finally, the protein structures were predicted with reasonable accuracy.

Our initial studies led to a method that docked each compound within 0.5 to 20

hours when not considering rotation and orientation of the ligand as part of the

chromosomes. FITTED which now explore the entire conformational space of the ligands,

considers protein flexibility and displaceable water molecules, docks all the tested

compounds within 3 hours. Further studies are in progress to reduce by a factor of 10 or

more the required CPU time which is still not appropriate for virtual screening and to

improve the scoring function.

EXPERIMENTAL SECTION.

PREPARATION OF THE TRAINING SET

General Considerations. The protein/ligand complexes were retrieved from the PDB or

from the PDBbind database.35,36

The complexes were selected for the occurrence of water

molecules, for the flexibility of the protein structure, the diversity of the ligands,

resolutions lower than 2.5 Å and good binding affinities. At least 4 structures for each

system (MMP, HIVP, TK, FXa, trypsin) were looked for. The complexes were setup

using Maestro and/or InsightII. The set of complexes from the same family were

superimposed prior to their use with FITTED. In order to be able to use FITTED with more

than one protein structure, the sequences have to be identical. Therefore, some minor

mutations (often far from the binding site) were achieved (e.g, Arg14 into Lys14 in HIVP

complex 1b6l), missing side chains were reconstructed (e.g., Arg220 in 1qhi), names of

residues were made identical (e.g., Glu124A into Glu124 in 1nfu). Hydrogens were next

added and optimized by energy minimization. All the non-conserved waters were

CHAPTER 2

- 111 -

removed and missing key water molecules were added by analogy with other structures

when applicable. For instance, the water 301 observed in many HIVP/inhibitor

complexes is displaced in 1ajv and was added to the 1ajv protein structure and its

position optimized by energy minimization using AMBER94 as a force field. The naming

of the water molecules is made homogeneous within each set. Each protein is then saved

as a mol2 file and processed using ProCESS to assign protein atom types and charges.

Each ligand was charged using Sybyl Gasteiger-Hückel charges and processed using

SMART. Large grids of spheres were prepared as well as constraint files. These

constraints were loose in order to orient and speed up the docking (as previously

described) but not bias the results. Diameters of the constraint spheres as large as 8.0 Å

were used.

HIV-1 Protease Inhibitor/Protein Complexes. HIVP complexes following the criteria

defined above were retrieved from the PDB. 1eby (crystal structure resolution: 2.29 Å),

1hpo (2.50 Å), 1hpv (1.90 Å) and 1pro (1.80 Å), were superimposed onto 1b6l (1.75 Å)

and one catalytic aspartic acid side chain was protonated. 1ajx (2.00 Å), 1hvr (1.80 Å),

1hwr (1.80 Å) and 1qbs (1.80 Å), were superimposed onto 1ajv (2.00 Å) and the two

catalytic aspartic acid side chains were protonated following the experimental study of

similar complexes.32

A water molecule hydrogen-bonding to both Ile50 NH’s was kept

when present and added when missing. The constraint applied imposes a polar group to

be located close to the catalytic site. As some of the inhibitors have a large number of

rotatable bonds, initial populations of 200 individuals were used in all 10 cases.

Thymidine Kinase Inhibitor/Protein Complexes. All the available TK inhibitor/protein

complexes were retrieved from the PDB and filtered. A final set of nine structures was

used (1e2k, 1e2p, 1ki3, 1ki4, 1ki7, 1ki8, 2ki5, 1of1, 1qhi). Six key water molecules were

considered as discussed in the text. The constraint imposes polar groups to be located

within the catalytic site.

Factor Xa Inhibitor/Protein Complexes. These complexes were retrieved from the

PDBbind database. 1f0r, 1fjs, 1nfu and 1xka were superimposed onto 1ezq. In this set,

two key water molecules were identified and added when missing to the protein

CHAPTER 2

- 112 -

structures. The constraint imposes a polar group to be located close to the Asp189 side

chain. Initial populations of 200 individuals were used.

Trypsin Inhibitor/Protein Complexes. 1o2j, 1o3g, 1o3i and 1qbo, were superimposed

onto 1f0u (1.9 Å). In this case a single water molecule interacting with Leu227 was

considered. The constraint imposes a polar group to be located close to the Asp171 side

chain. Initial populations of 200 individuals were used.

MMP Inhibitor/Protein Complexes 1bwi, 1ciz, and 1d8m were superposed onto 1b8y. No

water molecules were retained. The constraint imposes a polar group to be located close

to the catalytic zinc atom. Specific zinc (van der Waals, and metal chelation) and

hydroxamic acid (internal energy) parameters were added to the force field. The LJ12-10

potential parameters used for the zinc atom were designed to reproduce the observed

energy of zinc chelation.37

DOCKING STUDY

Self-docking, Semi-flexible Protein and Fully Flexible Protein. In the first of these three

sets of runs (self- and cross-docking, docking to multiple conformations), one single

protein structure was used as an input to evaluate the accuracy of the docking algorithm.

In the second set (docking to semi-flexible proteins), the protein structure was restricted

to the four to nine input conformations. In the third set (docking to fully flexible protein),

the protein structures were composite of the four to nine input conformations. A typical

keyword file with all the default parameters is given as supplemental material. The

default parameters (e.g., 10 runs, population size of 100 individuals) were used unless

otherwise stated.

ProCESS and FITTED Parameters. The ensemble of spheres cavity of the binding site

were centered on the center of the cavity and did not exceed 28 Ǻ long. The grid

resolution was 1.5 Ǻ.

CHAPTER 2

- 113 -

ACKNOWLEDGMENTS

We thank Virochem Pharma for financial support and a scholarship to CRC as well

as the Canadian Foundation for Innovation for financial support through the New

Opportunities Fund program. PE is supported by a scholarship from Canadian Institutes

of Health Research (Strategic Training Initiative in Chemical Biology). We thank

RQCHP for generous allocation of computer resources.

Supporting Information Available. Typical keyword files for FITTED and ProCESS. A

detailed description of the validation set (PDB codes, structures, Ki’s).

CHAPTER 2

- 114 -

REFERENCES

1. Rester, U. Dock around the Clock – Current Status of Small Molecule Docking and

Scoring. QSAR Comb. Sci. 25, 2006, 605–615.

2. Bissantz, C.; Folkers, G.; Rognan, D. Protein-Based Virtual Screening of Chemical

Databases. 1. Evaluation of Different Docking/Scoring Combinations. J. Med.

Chem. 2000, 43, 4759–4767.

3. Bursulaya, B. D. Totrov, M. Abagyan, R.; Brooks, C. L., III. Comparative study of

several algorithms for flexible ligand docking. J. Comp.-Aided Mol. Design 2003,

17, 755–763.

4. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S. Evaluation of Docking

Performance: Comparative Data on Docking Algorithms. J. Med. Chem. 2004, 47,

558–565.

5. Perola, E.; Walters, W. P.; Charifson, P. S. A Detailed Comparison of Current

Docking and Scoring Methods on Systems of Pharmaceutical Relevance. Proteins:

Struct. Func. Bioinf. 2004, 56, 235–249.

6. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D. Comparative Evaluation of

Eight Docking Tools for Docking and Virtual Screening Accuracy. Proteins:

Struct. Funct. Bioinf. 2004, 57, 225–242.

7. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.

Comparison of Automated Docking Programs as Virtual Screening Tools. J. Med.

Chem. 2005, 48, 962–976.

8. Kontoyianni, M.; Sokol, G. S.; McClellan, L. M. Evaluation of Library Ranking

Efficacy in Virtual Screening. J. Comput. Chem. 2005, 26, 11–22.

9. Cavasotto, C. N.; Abagyan, R. A. Protein Flexibility in Ligand Docking and Virtual

Screening to Protein Kinases. J. Mol. Biol. 2004, 337, 209–225.

10. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.

Automated Docking to Multiple Target Structures: Incorporation of Protein

Mobility and Structural Water Heterogeneity in AutoDock. Proteins: Struc. Func.

Genet. 2002, 46, 34–40.

CHAPTER 2

- 115 -

11. Murray, C. W.; Baxter, C. A.; Frenkel, A. D. The Sensitivity of the Results of

Molecular Docking to Induced Fit Effects: Application to Thrombin, Thermolysin

and Neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13, 547–562

12. Murray, C. W.; Baxter, C. A.; Frenkel, A. D. The Sensitivity of the Results of

Molecular Docking to Induced Fit Effects: Application to Thrombin, Thermolysin

and Neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13, 547–562.

13. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M. Lessons in

Molecular Recognition: the Effects of Ligand and Protein Flexibility on Molecular

Docking Accuracy. J. Med. Chem. 2004, 47, 45–55.

14. Carlson, H. A. Protein Flexibility and Drug Design: How to Hit a Moving Target.

Curr Opin Chem Biol. 2002, 6, 447–452.

15. Schnecke, V.; Kuhn, L. A. Virtual Screening with Solvation and Ligand-Induced

Complementarity. Perspect. Drug Discovery Des. 2000, 20, 171–190.

16. Claußen, H.; Buning, C.; Rarey, M.; Lengauer, T. FlexE: Efficient Molecular

Docking Considering Protein Structure Variations. J. Mol. Biol. 2001, 308, 377–

395.

17. Zavodszky, M. I.; Lei, M.; Thorpe, M. F.; Day, A. R.; Kuhn, L. A. Modeling

Correlated Main-Chain Motions in Proteins for Flexible Recognition. Proteins:

Struc. Func. Bioinf. 2004, 57, 243–261.

18. Sherman, W.; Day, T.; Jacobson, M. P.; Friesner, R. A.; Farid, R. Novel Procedure

for Modeling Ligand/Receptor Induced Fit Effects. J. Med. Chem. 2006, 49, 534–

553.

19. Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to

Hydrated and Flexible RNA. J. Med. Chem. 2006, 49, 1023–1033.

20. Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore

Search, Automated Docking, and Molecular Dynamics Simulations as a Novel

Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-Glycine-

Aspartic Acid-like Compounds into the αvβ3 Binding Site. J. Med. Chem. 2004, 47,

4178–4187.

CHAPTER 2

- 116 -

21. Moitessier, N.; Therrien, E.; Hanessian, S. A Method for Induced-fit Docking,

Scoring and Ranking of Flexible ligands. Application to Peptidic and

Pseudopeptidic BACE 1 Inhibitors. J. Med. Chem. 2006, 49, 5885–5894.

22. Wei, B. Q.; Weaver, L. H.; Ferrari, A. M.; Matthews, B. W.; Shoichet, B. K.

Testing a Flexible-Receptor Docking Algorithm in a Model Binding Site. J. Mol.

Biol. 2004, 337, 1161–1182.

23. CDiscover, 98.0; Accelrys, Inc.: San Diego, CA, 2001.

24. Fletcher, R.; Reeves, C. M. Function Minimization by Conjugate Gradients. Comp.

J. 1964, 7, 149–154.

25. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Blelew, R.

K.; Olson, A. J. Automated Docking Using a Lamarckian Genetic Algorithm and

an Empirical Binding Free Energy Function. J. Comp. Chem. 1998, 19, 1639–1662.

26. http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html

27. Corey E. J.; Wipke, W. T. Computer-Assisted Design of Complex Organic

Syntheses. Science 1969, 166, 178–192.

28. Gasteiger, J.; Marsili, M. Iterative partial equalization of orbital electronegativity--

a rapid access to atomic charges. Tetrahedron 1980, 36, 3219–3228.

29. Haupt, R. L. 1995. Optimization of aperiodic conducting grids. 11th An. Rev.

Progress in Applied Computational Electromagnetics Conf. Monterey, CA.

30. Weiner, S. J.; Kollman, P. A.; Nguyen, D. T.; Case, D. A. An All Atom Force Field

for Simulations of Proteins and Nucleic Acids. J. Comput. Chem. 1986, 7, 230–252.

31. See for example: Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.;

Murray, C. W.; Nissink, J. W. M.; Taylor, R. D.; Taylor, R. Modeling Water

Molecules in Protein-Ligand Docking Using GOLD. J. Med. Chem. 2005, 48,

6504–6515.

32. Yamazaki, T. Nicholson, L. K.; Torchia, D. A.; Wingfield, P.; Stahl, S. J.;

Kaufman, J. D.; Eyermann, C. J.; Hedge, C. N.; Lam, P. Y. S.; Ru, Y.; Jadhav, P.

K.; Chang, C.-H.; Weber, P. C. NMR and X-ray Evidence That the HIV Protease

Catalytic Aspartyl Groups Are Protonated in the Complex Formed by the Protease

CHAPTER 2

- 117 -

and a Non-peptide Cyclic Urea Inhibitor. J. Am. Chem. Soc. 1994, 116, 10791–

10792.

33. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D.

T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.;

Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring.

1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739–

1749.

34. Ferrari, A. M.; Wei, B. Q.; Costantino, L.; Shoichet, B. K. Soft Docking and

Multiple Receptor Conformations in Virtual Screening. J. Med. Chem. 2004, 47,

5076–5084.

35. Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database: Collection of

binding affinities for protein-ligand complexes with known three-dimensional

structures. J. Med. Chem. 2004, 47, 2977–2980.

36. Wang, R.; Fang, X.; Lu, Y.; Yang, C. Y.; Wang. S. The PDBbind database:

Methodologies and updates. J. Med. Chem. 2005, 48, 4111–4119.

37. Tiraboschi, G.; Greshh, N.; Giessner-Prettre, C.; Pedersen, L. G.; Deerfield, D. W.

Parallel Ab Initio and Molecular Mechanics Investigation of Polycoordinated Zn(II)

Complexes with Model Hard and Soft Ligands: Variations of Binding Energy and

of Its Components with Number and Charges of Ligands. J. Comp. Chem. 2000, 21,

1011–1039.

CHAPTER 2

- 118 -

CHAPTER 3

- 119 -

CHAPTER THREE

Encouraged by the promising results obtained with the first version of FITTED

discussed in the previous chapter, we further optimized the program to allow for a time

efficient virtual screening of a virtual library. The initial version was quite slow and

required improvements in the selection of compounds to be docked and in the creation of

the initial population which increased the overall speed of the program. This increase in

speed then enabled its application to the virtual screening of the Maybridge library

against HCV polymerase and the discovery of two novel lead compounds.

This chapter is a copy and is reproduced with permission from the Journal of Chemical

Information and Modeling. This article is cited as Corbeil, C. R.; Englebienne, P.;

Yannopoulos, C. G.; Chan, L.; Das, S. K.; Bilimoria, D.; Heureux, L.; Moitessier, N.,

Docking Ligands into Flexible and Solvated Macromolecules. 2. Development and

Application of FITTED 1.5 to the Virtual Screening of Potential HCV Polymerase

Inhibitors. Journal of Chemical Information and Modeling 2008, 48, (4), 902-909.

Copyright 2008, with permission from the American Chemical Society.

CHAPTER 3

- 120 -


MACROMOLECULES. 2.

DEVELOPMENT AND APPLICATION OF FITTED 1.5 TO THE

VIRTUAL SCREENING OF POTENTIAL HCV POLYMERASE

INHIBITORS.

ABSTRACT

HCV NS5B polymerase is a validated target for the treatment of hepatitis C,

known to be one of the most challenging enzymes for docking programs. In order to

improve the low accuracy of existing docking methods observed with this challenging

enzyme, we have significantly modified and updated FITTED 1.0, a recently reported

docking program, into FITTED 1.5. This enhanced version is now applicable to the virtual

screening of compound libraries and includes new features such as filters and

pharmacophore- or interaction site-oriented docking. As a first validation, FITTED 1.5

was applied to the testing set previously developed for FITTED 1.0 and extended to

include HCV polymerase inhibitors. This first validation showed an increased accuracy

as well as an increase in speed. It also shows that the accuracy towards HCV polymerase

is better than previously observed with other programs. Next, application of FITTED 1.5

to the virtual screening of the Maybridge library seeded with known HCV polymerase

inhibitors revealed its ability to recover most of these actives in the top 5% of the hit list.

As a third validation, further biological assays uncovered HCV polymerase inhibition for

selected Maybridge compounds ranked in the top of the hit list.

CHAPTER 3

- 121 -

INTRODUCTION

Docking-based Virtual Screening. Various approaches to the design or identification of

new drugs have recently been developed and successfully applied, including both

experimental (e.g., SAR by NMR1) and computational approaches (e.g., docking2, 3). In

modern drug design, docking-based virtual screening (VS) methods provide a quick and

cost-effective alternative to high-throughput screening (HTS).2, 3 Many recent VS

applications have been reported and demonstrate an increasing level of accuracy for the

currently available methods.4, 5

However, to date, only few docking programs (e.g.,

FlexX-Ensemble,6 AutoDock 4.0

7) can take into account conformational changes that

occur as a result of binding to a ligand. FITTED 1.0 (Flexibility Induced Through

Targeted Evolutionary Description) is a docking program that was recently developed

and validated against a set of co-crystallized protein/ligand complexes.8, 9

This program

not only accurately predicts the ligand binding mode, but it also predicts the optimal

protein conformation and the presence/absence and location of water molecules with high

level of accuracy. Recently, HCV polymerase has been found to be a very challenging

protein for docking programs. Our long term goal is to develop a docking-based virtual

screening tool that can be applied to a number of proteins as large as possible. However,

the initial version of FITTED (v. 1.0) was found too slow and not applicable to large VS

studies. Efforts to increase the speed without affecting the accuracy were necessary. In

addition, its accuracy for HCV polymerase inhibitors was low. We report herein an

enhanced version of FITTED (v. 1.5), its validation and its application to the discovery of

HCV polymerase inhibitors.

Hepatitis C. Hepatitis C virus (HCV) is the major causative agent responsible for non-A,

non-B hepatitis, affecting over 170 million people worldwide. Chronic HCV infection

often results in liver fibrosis, liver cirrhosis, hepatocellular carcinoma, and other forms of

liver dysfunction.10, 11

Given the widespread impact of this disease, there is a substantial

medical need for the discovery of new and effective anti-HCV agents to complement

current therapies.12

The impetus for the identification of agents that will be part of a

potent and effective combination regimen is growing in view of the inevitability of the

development of drug resistant mutations. Extensive efforts have been devoted toward the

study of the NS5B RNA-dependent RNA polymerase due to its critical function in the

CHAPTER 3

- 122 -

replication cycle of the virus. Positive results from several clinical trials have indeed

validated the HCV NS5B polymerase as a target for the therapy of HCV infections. For

example, nucleoside analogs (e.g., Valopicitabine (NM283) 1,13, 14

R1626 2 15

) and non-

nucleoside or allosteric (e.g., HCV-796 3 16

) inhibitors of HCV NS5B have been shown

to be effective either alone or in combination with interferon (Figure 3.1). Others such as

VCH-75917, 18

and GSK62543319

are currently being evaluated in clinical trials.

Figure 3.1 - Selected HCV polymerase inhibitors.

HCV Polymerase, a Flexible Protein. We have recently shown that at least two major

conformations can be adopted by the HCV polymerase upon binding of inhibitors to an

allosteric site located in the thumb region.20, 21

The main difference appeared to be a

significant shift of the α helix T located in the binding site (Figure 3.2). It is therefore

critical to account for this HCV polymerase thumb binding site plasticity in both hit

identification and inhibitor design stages. In a recent comparative study by Warren et al.

it was shown that none of the assessed docking programs predicted the experimentally

observed binding modes of HCV polymerase inhibitors with high accuracy.22

In fact, the

flexibility of this protein can in part explain this poor accuracy. This challenge was the

starting point of the development of FITTED 1.5.

CHAPTER 3

- 123 -

Figure 3.2 - Helix T perturbation upon inhibitor binding. Blue and grey ribbon

representations are from two different X-ray complexes. 20, 21


FITTED 1.0 and 1.5. As discussed above, two major aspects have to be considered for

implementation into FITTED 1.5. First, the newer version should be applicable to VS

studies. Second, FITTED 1.5 should be accurate enough with HCV polymerase inhibitors.

Flexible ligand/flexible protein docking programs have seldom been applied to

VS.23

In practice, taking flexibility into account significantly enlarges the search space,

thus reducing throughput and drastically impeding implementation in VS campaigns. We

hypothesized that evaluating FITTED in this context would assess the role of flexibility in

VS studies against HCV polymerase. FITTED 1.0 is a suite of programs for docking that

considers ligand and protein flexibility by means of a genetic algorithm24

while water

molecules displacement is accounted for by means of a specific potential energy

function.8, 25

During the docking process, the protein side-chains and backbone

conformations, the water molecule positions and the ligand torsion angles are coded as

genes and optimized through a combined Lamarckian/Darwinian evolution. This early

version of the program was developed to dock single compounds in proof-of-concept

studies with no consideration for CPU time requirements.8 This, obviously, is a serious

limitation in the context of a large VS study. In order to optimize the software for

efficiency and speed, a stepwise approach to identify and remove inappropriate

candidates (poses) early in the process was implemented in FITTED 1.5. In addition,

preliminary studies have shown that the accuracy of FITTED 1.0 with HCV polymerase

inhibitors should be improved. The various modifications and implementations, which

required major rewriting of the program, are listed and described in the following

sections.

CHAPTER 3

- 124 -

SMART. SMART (Small Molecule Atom typing and Rotatable Torsion assignment) is a

module of FITTED used to prepare the ligands to be docked. In contrast to the original

program developed with FITTED 1.0, the current version now describes the compound

features with a bit string added to the compound’s mol2 file. The bit string includes the

following descriptors: molecular weight, number of rotatable bonds, net charge and

presence of functional groups such as known toxicophores or reactive groups (e.g, nitro

groups, aldehydes, and Michael acceptors) or labile imines. The descriptions are then

used by FITTED to filter out compounds not fulfilling the Lipinski’s rules26

or having

undesired (user-defined) functionalities. These descriptors can also restrict the search to

compounds with needed functionalities (e.g., aldehydes and nitriles for reversible

covalent inhibitors). Although this simple approach is not expected to accurately

discriminate between drug-like molecules and non drug-like molecules, it will orient our

study towards a “cleaner” compound set.

Interaction Site Filter. FITTED 1.0 included a functionality that filtered out poses that did

not fulfill constraints imposed by the user (e.g, binding to metals). It also included a

function that ensured that poses were within the binding site (ClashScore) prior to any

further optimization or more complex scoring. ClashScore, which is a binary score, uses

a series of spheres representing the accessible cavity space. Each pose is then compared

to this set of spheres and a score (“in” or “out”) is computed. This crude score is used to

discard poses that are not located within the binding site. After modifications, FITTED 1.5

can pre-select the poses that are the most apt to be successfully docked with a number of

predefined interaction sites (Figure 3.3). A set of interaction sites is similar to a

pharmacophore but automatically generated by ProCESS from the protein structure alone.

FITTED also allows for the use of a manually created pharmacophore which may exploit

user expertise, as the one shown in Figure 3.4, or for the use of both automatically-

generated interaction sites and user-defined pharmacophores (Figure 3.3). The inclusion

of a pharmacophore component in virtual screening has been shown to enhance

efficiency and accuracy of docking methods in previous studies,27, 28

including HCV

polymerase.22

CHAPTER 3

- 125 -

Figure 3.3 - Consensus docking. Application to the generation of the initial population.

Figure 3.4 - Binding site pharmacophore for HCV polymerase: red: hydrogen bond

acceptor; green: hydrophobic/aromatic; yellow: either hydrogen bond acceptor or

hydrophobic/aromatic.

The interaction sites (and pharmacophores) are represented by a series of spheres.

A sphere diameter defines the allowed volume of the constraint and a weight (w in (3.1)

other than one can be assigned to each sphere.

(3.1)

w

wW 100sphere

(3.2) Spheres MatchiedposeMatchScore W

Each generated pose is compared to the interaction site (and/or pharmacophore)

and a MatchScore (and/or PharmScore) ((3.2) ranging from 0 to 100% is computed. If the

CHAPTER 3

- 126 -

atom types of the ligand atoms lying within the volume of the sphere match the

interaction site/pharmacophore sphere’s pharmacophoric properties, the weight of the

sphere is added to the MatchScore (or PharmScore) for that pose. FITTED 1.5 then

discards poses with a low MatchScore (and/or PharmScore), thereby reducing the

required CPU time by directing the docking toward strongly interacting poses. It is well

known that higher success rates are obtained when rescoring of poses is performed using

other scoring functions (consensus scoring). In the present version of FITTED, up to four

scores are computed while docking (ClashScore mentioned above, MatchScore,

PharmScore and GAFFScore derived from the computed General AMBER Force Field29

(GAFF) potential energy) and can be combined to discriminate active from inactive

compounds (Figure 3.3). These four scores are used in their decreasing order of speed

and allow FITTED to eliminate poses exhibiting bad scoring with one function before

proceeding with the next one. This filtering of poses is carried out both during the

generation of the initial population as in Figure 3.3 but also during the evolution, and can

be viewed as consensus docking.30

This feature significantly reduces the time required to

dock a single compound and increases its accuracy in three ways. First, the MatchScore

and PharmScore are quicker to compute than the GAFF potential energy. Second, “bad”

poses are not considered for energy minimization, a time-consuming step in the docking

process. Third, poses with reasonable GAFFScores but poor chemical complementarity

with the protein were found using FITTED 1.0. Conversely, FITTED 1.5 assigns a low

MatchScore to these poses, thus reducing the number of false positives.

ProCESS. ProCESS (Protein Conformation Ensemble System Setup) is the second module

of FITTED used to prepare the protein file. As described in our previous report,8 ProCESS

also prepares the set of spheres representing the cavity space used to compute ClashScore

(see above). The current version of ProCESS can now derive a set of interaction sites such

as ideal locations for hydrogen bond donors and acceptors, hydrophobic and aromatic

groups.

Quick Docking. When docking a potential ligand, FITTED generates an initial population

and then simulates its evolution. Although this is appropriate for “good” binders, it may

be inappropriate for compounds which are, for instance, too large or too hydrophobic and

CHAPTER 3

- 127 -

therefore should be excluded prior to this time-consuming conformational search. For

this purpose, additional filters were implemented to prevent undesirable compounds from

being docked. First, compounds lacking the required pharmacophoric groups are

excluded. Then, FITTED generates a maximum number of random poses to produce the

initial population. If 100,000 possible binding modes are generated without accepting one

into the initial population (low MatchScore and/or not in the cavity), the program aborts

and docks the next compound on the list. This stage, based on simple shape

discrimination, does not require any CPU-intensive energy or score computation and can

be done within a few seconds per compound.

Refined Docking. A close look at the evolution of the score of the top pose as the docking

proceeds revealed that the scores computed after a few generations are typically within

1.0 to 1.5 kcal/mol of the score of the final pose. It also indicated that at this stage of the

evolution, poses close to the native pose are not always identified. These two

observations demonstrate: 1. the high quality of the initial population and therefore the

identification of poses with good scores early in the evolution; and 2. the need for a

multi-generation evolution process to produce the correct (i.e., experimentally observed)

pose. Thus, if after a few generations (e.g., 5) the score is not satisfactory, the docking

can be aborted. We have therefore implemented new functions (and keywords) into

FITTED to account for this intermediate evaluation. In practice more than one of these

intermediate evaluations can be used to further reduce the number of generations carried

out with a potentially inactive compound.

Scoring. The force field used in FITTED 1.0 (AMBER84) was not appropriate for most of

the small drug-like molecules like the ones found in virtual libraries. Instead we have

used the General Amber Force Field (GAFF)29

for the description of the small molecules.

This required a series of modifications to the force field itself as a specific format has to

be used to be readable by FITTED. SMART was also modified to assign these new atom

types. Finally, a simple automated parameter estimator was developed. Although GAFF

parameters span a large variety of functional groups, some are missing but could be

guessed on-the-fly by FITTED 1.5. These parameters are simply derived from the input

structures; bond lengths and angles of the input structures are used as equilibrium values

CHAPTER 3

- 128 -

and reported in a specific log file. These listed missing parameters can later be further

optimized and added to the force field for the next study or to perform a second run on

these specific molecules. In fact, our own version of GAFF is regularly updated to

include more functional groups and heteroaromatic rings.

Whereas ClashScore, PharmScore, MatchScore and GAFFScore are used upon

docking to identify the correct pose, the RankScore scoring function reported

previously24

is used to assign the final poses a score describing their binding affinities.

VALIDATION OF FITTED1.5

FITTED 1.0 vs. 1.5. As all these changes may affect the accuracy of FITTED, we used the

testing set previously prepared for FITTED 1.0 to evaluate the accuracy of the current

version. This set consists of ligands complexed with HIV-1 protease, thymidine kinase,

factor Xa, trypsin and MMP-3. Table 3.1 summarizes the accuracy obtained for the self-

docking of these 33 inhibitors (“Rigid” mode) as well as their docking to flexible

proteins. The “Semiflexible” mode, as defined in our previous report,8 corresponds to the

docking of ligands to conformational ensembles of protein structures while the “Flexible”

mode corresponds to a fully flexible protein structure.8 The detailed results for each of

the 33 systems are given as supporting information. The computed RMSDs (root mean

square deviation) compare the modeled “docked” binding mode to the observed one

(from crystal structures). For this study, the interactions sites were generated by ProCESS.

CHAPTER 3

- 129 -

Table 3.1 - Comparison of FITTED 1.0 with FITTED 1.5

% Successa

version 1.0 version 1.5

Mode < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å

Rigid (self-docking) 33 79 84 93

Rigid (cross-docking) 21 47 51 74

Semiflexible 36 73 78 84

Flexible 57 73 78 88

a Two criteria of success are shown. A docking run is considered successful if the RMSD

between modeled and experimental binding mode is within 1.0 or 2.0 Å respectively.

Overall, accuracy is significantly increased from version 1.0 to 1.5. More

specifically, there is an enhanced accuracy when examining systems for which the

RMSDs are below 1.0 Å. This observation is most likely due to the use of interaction

sites to guide the docking. With this implementation, FITTED 1.5 generates and considers

only poses that already passed the MatchScore and/or PharmScore filter. Thus, the

quality of the initial population as well as the children offspring produced during the

evolution of the population are of higher quality than with FITTED 1.0. Along with the

increased accuracy, a 3-fold increase in speed was observed. In addition, as observed

previously with FITTED 1.0, the docking to flexible proteins is nearly as accurate as self-

docking and much more accurate than cross-docking.

With these encouraging results in hand, we focused our attention to HCV

polymerase inhibitors. For this purpose, two sets of protein/inhibitors complexes were

initially built and very recently extended as novel crystal structures have been reported

and made available. The first set includes sixteen inhibitors bound to the allosteric site

described above while the second set includes seven inhibitors bound to the catalytic site.

A second allosteric site has been reported but is not used herein.31

We first used the

interactions sites as for the testing set above. The results summarized in

Table 3.2 show that the accuracy obtained when docking to the allosteric site of the

HCV polymerase was not as high as for the testing set used above, but they were

nevertheless considered reasonable. In order to increase the accuracy of this program and

CHAPTER 3

- 130 -

to eventually increase the enrichment factor of the VS study, we manually defined a

pharmacophore used in place of the interaction sites by FITTED (Figure 3.3). In this case,

the accuracy was slightly increased, while the required CPU time was not affected. In this

study, we used the empty space found within a sphere of 40 Å designated as large cavity

in

Table 3.2. A more focused binding site (25 Å) led to a significant increase in

accuracy when self-docking was considered, but only a slight increase in accuracy when

a flexible protein was used. Interestingly, the use of flexible protein was found to be

significantly more accurate than cross-docking, indicating that its implementation should

increase the accuracy of FITTED in VS studies against HCV polymerase.

Table 3.2 - Docking of HCV polymerase inhibitors to the allosteric site with FITTED 1.5.

% Success a

Mode Interaction.Sites

Large cavityb

Pharmacophore

Large cavityb

Pharmacophore

Focused cavityc

Accuracy < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å < 1.0 Å < 2.0

Å

Rigid (self-docking) 56 75 63 75 81 81

Rigid (cross-docking) 0 0 13 31 6 38

Semiflexible 38 50 56 69 63 69

Flexible 25 31 47 59 38 63

a Two criteria of success are shown. A docking run is considered successful if the

RMSD between modeled and experimental binding mode is within 1.0 or 2.0 Å

respectively. b 40 Å diameter cavity.

c 25 Å diameter cavity.

We then turned our attention to the catalytic site; however, docking of the seven

reported inhibitors was initially unsuccessful (Table 3.3). Considering the size of this

very large binding site, this disappointing result is not surprising. While this work was

ongoing, Warren et al. reported a large comparative study including 13 HCV polymerase

inhibitors.22

In their study, only 2 programs docked one out of the 13 inhibitors to the

catalytic site with RMSD below 2.0 Å while the other 8 programs failed with all the

CHAPTER 3

- 131 -

inhibitors. Although their set (not given) and ours may be different, there is at least one

HCV polymerase inhibitor common to both sets.

Warren et al. also mentioned that “no docking program was able to generate

docked poses within 2 Å for ≥40% of the compounds” when only the NTP site is

considered. Unfortunately, no details were provided. In order to orient the docking

towards this binding site, we used ProCESS to automatically generate interaction sites and

spheres representing the binding site cavity centered on this site. Much to our delight,

FITTED was found to dock 5 out of the seven inhibitors with RMSDs below 1.2 Å in self-

docking experiments and the same 5 with RMSDs below 1.5 Å when the semiflexible

mode was selected. These results therefore position our program among the top of the list

of assessed programs for this HCV polymerase site. We believe that the good accuracy

observed with FITTED is due to the consensus docking approach implemented in FITTED

1.5, which is expected to accurately filter out unreasonable poses. We also found that

introducing the protein flexibility led to a significant increase in accuracy relatively to

cross-docking.

Table 3.3 - Docking of HCV polymerase inhibitors to the catalytic site.

% Success a

Whole cavity NTP site

Mode < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å

Rigid (self-docking) 0 0 57 71

Rigid (cross-docking) 0 0 0 14

SemiFlexible 0 0 43 57

Flexible 0 0 29 43

Flo, Gold, Glide,

DockIt, MVP, LigFit,

Dock4, FlexX, Fred,

MOE

- 0 to 8b

- 0 to 60b

a Two criteria of success are shown. A docking run is considered successful if the RMSD

between modeled and experimental binding mode is within 1.0 or 2.0 Å respectively. b

Data for self-docking (corresponding to Rigid mode) from Warren et al.22

CHAPTER 3

- 132 -

APPLICATION TO THE SCREENING OF A LIBRARY AGAINST THE HCV POLYMERASE.

The previous validation demonstrated that the current version of FITTED docked

inhibitors with reasonable accuracy and also demonstrated the key role of the protein

flexibility accounted for by FITTED in the HCV polymerase context. However, this

validation did not provide any indication about its ability to identify active compounds

within a large set (i.e., to rank known inhibitors at the top of the hit list). For this purpose,

we selected first the well studied thiophene site from which we have collected much data

from both SAR and X-ray crystallography. Another two sites (catalytic and allosteric) are

also validated targets but were not considered here. The Maybridge library of drug-like

molecules, which was obtained from the ZINC web site,32, 33

was seeded with known

actives ranging from nanomolar to micromolar activities.

To account for the site flexibility, we used two inhibitor-bound in-house crystal

structures in a “semi-flexible” docking run, an option implemented in FITTED that allows

the simultaneous docking of a flexible ligand to more than one protein structure. The

pharmacophore shown in Figure 3.4 includes six spheres identifying two hydrophobic

pockets (shown in green), three sites for hydrogen-bond acceptors (HBA, shown in red)

and a mixed hydrophobic/HBA site (shown in yellow). At this last location, a phenyl ring

may interact with His475 via aromatic ring stacking and/or π-cation interaction with

Lys533. Experimental data showed that hydrogen bonds with Tyr477 and Ser476 are key

interactions and that two out of the three defined hydrophobic pockets are often

targeted.20

A compound that binds without filling the deep hydrophobic pocket delineated

by Met423, Trp528 and Leu419 would trap water molecules, a phenomenon that is highly

disfavored. Similarly, desolvation of the other hydrophobic pocket defined by Leu419,

Ile482 and Leu497 is favored upon binding. To account for these specific situations, the

spheres representing these hydrophobic pockets are given a larger priority (w = 2) than

the other 4 (w = 1) and poses with a MatchScore lower than 60% are discarded.

Stepwise Screening. As described above, the fully automated FITTED 1.5 screening

protocol can be broken down into five distinct steps (Figure 3.6): 1. filtering out non-

drug-like compounds; 2. filtering out compounds that cannot match the binding site

pharmacophore and/or the binding site cavity; 3. quick docking and discarding

CHAPTER 3

- 133 -

compounds with RankScore values higher than -5.25; 4. refined docking of the best

candidates; 5. selection of the best scoring compounds.

This stepwise approach was applied to the Maybridge set seeded with 23 known

active inhibitors including very weak inhibitors. A large variety of known ligands were

selected and some are illustrated in Figure 3.5. As an additional test for FITTED, one of

these HCV polymerase inhibitors (compound 7) has been reported to have a high anti-

HCV activity, with the (R) enantiomer being the most active.34, 35

We therefore spiked

our set with the two enantiomeric forms of 7 in order to assess their binding to the

allosteric site and their relative predicted binding affinity . In contrast to closely related

analogues which bind to the catalytic site, compound 8 is believed to bind to the

allosteric site and was added to this set.36

Figure 3.5 - Selected known actives. 35-40

The protocol is shown in Figure 3.6. The entire library was processed using SMART

and a first filtration step was carried out using FITTED 1.5. Compounds with net charges

between -1 and +1, with a number of hydrogen bond acceptors lower or equal to 10, a

number of hydrogen bond donors lower or equal to 5, a maximum of 6 rotatable bonds,

CHAPTER 3

- 134 -

molecular weights below or equal to 550 and containing no potentially toxic, reactive or

hydrolyzable groups were retained. This set was comprised of nearly 32,500 compounds

containing 19 known active inhibitors (0.058% of the library).

Figure 3.6 - Funnel approach implemented in FITTED.

Prior to the actual docking, compounds not featuring the necessary pharmacophoric

groups, as well as compounds that could not fit the binding site cavity and/or had a

MatchScore lower than 60% were discarded. None of the known active compounds failed

this test while a further 6 % of the filtered library was eliminated. When appropriate

poses were found, FITTED started the genetic algorithm optimization and produced the

initial population. A quick evolution (five generations) was then applied to this

population and the top three poses (best GAFFScore) were scored using RankScore.

Compounds with a RankScore of -5.25 or lower were allowed to progress to the next

step. An additional 52% of the filtered library was removed at this stage, while all seed

compounds were retained. This observation provides a clear indication of the usefulness

of this intermediate selection. The remaining 13,713 compounds were further optimized

(another 95 generations of evolution) and scored using RankScore. Finally, focused

libraries of different sizes were compiled based on different score cutoffs. Table 3.4

summarizes the size, number of recovered known active and enrichment factors for these

small size libraries.

CHAPTER 3

- 135 -

Table 3.4 - Focused libraries based on MatchScore > 75 and RankScore as indicated

RankScore

cutoff

Hitsa Known

actives

Enrichment

factorb

< -7.0 835 9 18.4

< -7.5 401 8 34.1

< -8.0 147 6 69.7

< -8.5 48 6 214

< -9.0 14 3 366

Filtered

library

32457 19 1.0

a Including known actives.

b Based on the filtered library

Enrichment Factors. Performing docking-based virtual screening tools should prioritize

active compounds from a library of drug-like molecules. It is common practice to seed a

library of drug-like molecules with known actives and use the enrichment factor obtained

to evaluate the accuracy of the docking and scoring functions of the software. An

accurate program should be able to recover the seed compounds at the top of the score-

ranked hit list. Comparative studies evaluating the accuracy of docking programs to

extract active compounds from large libraries showed that Surflex, GOLD, Glide, and

FlexX are among the best programs.3

For instance, the best performer in Rognan’s study,

Surflex, was able to rank ten known thymidine kinase inhibitors from a library of 1000

drug-like molecules in the top 10% of the library with five in the top 3%.41

In another

study slightly less than 50% of the seed inhibitors were ranked in the top 10% by DOCK,

GOLD and Glide, with 38% in the top 2% when GOLD was used.42

Overall, a state-of-

the-art VS tool rarely extracts 100% of the actives in the top 10% and even more rarely in

the top 5%; 50-60% in the top 5% is more commonly observed with the best programs. In

contrast, from the focused sets selected by FITTED 1.5 (Table 3.1), large enrichment

factors were computed. Considering that in the initial library and in the filtered library

only 0.035% and 0.058% respectively were known actives, enrichment factors of over

three hundred for the top fourteen compounds were achieved for this target with FITTED

1.5.

CHAPTER 3

- 136 -

Data analysis illustrated in Figure 3.7 indicated that a third of the actives were

recovered in the top 0.1% and that half of the actives were found in the top 2% of the hit

list. An average of 12 minutes of CPU time per compound was needed to dock each of

the 32,500 filtered compounds using the semi-flexible HCV polymerase structure on

desktop Linux PCs (AMD Opteron) while less that a second per compound was needed to

filter out the bad candidates.

Interestingly, the (R)-7, the most active enantiomer of 7, was predicted to bind well

to the allosteric site, while (S)-7 was given a score much worse than the other 22 actives.

This result correlates well with the experimental data and indicates that 7 may bind to the

allosteric site. Compound 8 was also assigned a high score and is predicted to bind tightly

to the allosteric site, an observation that has only been postulated.36

Figure 3.7 - Active compounds recovered. Blue curve: FITTED VS study; orange:

random selection.

Biological Evaluation. Encouraged by the large enrichment factors obtained for this

target, we assessed FITTED’s ability to identify new HCV polymerase inhibitors from the

Maybridge library by screening the high ranking compounds in biochemical assays. The

top scoring compounds with a RankScore below -7.0 and a MatchScore higher than 75

were considered for biological evaluation (826 compounds; 1.25% of the Maybridge

library) using a scintillation proximity assay (SPA) described in the experimental section.

Unfortunately, some high scoring compounds were not available for purchase at the time

and only 659, representing 1% of the total Maybridge library, were acquired. All these

659 available compounds were screened against the HCV polymerase using a single

point concentration and resulted in 220 compounds showing greater than 50% inhibition

CHAPTER 3

- 137 -

at 10 μg/mL and 12 compounds which had greater than 90% inhibition at 10 μg/mL. The

set of 12 actives were re-tested in an eleven point dose response SPA assay and two drug-

like compounds were identified to inhibit HCV polymerase with IC50 values of 7 μM and

12 μM respectively.

With these newly discovered actives, a new enrichment factor of 20.4 was

computed for the top 835 hits.

CONCLUSION

HCV has been shown to be a challenging enzyme for docking methods and

prompted us to assess FITTED in this context. Hence, FITTED 1.0 has been modified to

incorporate features for its application to docking and virtual screening, such as ligand

and pharmacophore based prefiltering. This current version, namely FITTED 1.5, showed

significantly enhanced accuracy and speed relatively to the previous version. Validation

experiments carried out on two binding sites on HCV polymerase (allosteric and catalytic

site) further confirmed its accuracy. We next looked at its ability to identify active HCV

polymerase inhibitors from a set of drug-like molecules. A virtual screening run on the

Maybridge library seeded with known actives gave enrichment factors which were

superior to the ones often observed with other available docking programs. Top scoring

compounds representing around 1% of the Maybridge library were purchased and

screened in HCV polymerase assays resulting in the identification of two compounds

with IC50’s of 7 and 12 M. The screening of larger libraries is now ongoing.

FITTED 1.5 and the subsequent versions are now available to the scientific

community.43

EXPERIMENTAL SECTION

Running FITTED 1.0 Testing Set with FITTED 1.5, The preparation of the testing set has

been previously reported and will not be described herein.8 All protein, interaction site

and cavity files were then prepared using ProCESS 1.5 and all ligand files with SMART

1.5. The HCV polymerase/inhibitor complexes were prepared following the same

protocol. In order to proceed in the semiflexible and flexible modes, FITTED requires

identical sequences (and number of atoms) for the protein structures used as input.

However, a large number of differences in the sequence of the various crystal structures

CHAPTER 3

- 138 -

have been found. As they were far enough from the binding sites, they are not expected to

affect the docking accuracy. Thus, manual mutations were carried out to correct these

discrepancies.

Preparation of the Cavity and Pharmacophore Files for VS. Two crystal structures

(1NHV and 2GIR) representative of the set of sixteen were used for the VS study.

Preparation of the protein files was carried out as previously described. ProCESS was then

used to prepare the structures for the VS. The center of the active site was defined by the

centroid of the ligands present in the crystal structures. A sphere radius of 25 Å was used

to generate the binding site cavity file. The pharmacophore was generated manually by

examining the known binding modes and the interaction sites identified by ProCESS and

extrapolating the six key interactions shown in Figure 3.3.

Preparation of the Library. The Maybridge library was downloaded from the ZINC

database32

in a mol2 format. Each compound of the library was then prepared by SMART,

which added the rotatable bonds, atom types and completed the bit string for each

compound.

Docking a Library with FITTED. Each compound of the library was docked individually

using FITTED in Semiflexible mode. Compounds containing the following groups where

filtered out and were not docked: aldehydes, esters, imines, nitro, acyclic Michael

acceptors, azides, isocyanates and acyl chlorides. As an additional constraint, all

compounds were required to have at least one aromatic ring. The screening was carried

out on the 872 node Dell PowerEdge cluster of Intel Pentium 4, 3.2 GHz located at the

Réseau Québecois de Calcul de Haute Performance (RQCHP) at the Université de

Sherbrooke.

Biological Evaluation of the Selected Compounds. Briefly, 250 ng of a 5’-biotinylated

DNA oligonucleotide (oligo dT15) primer, annealed to 10 pmol of a homopolymeric poly

rA RNA template, was captured on the surface of streptavidin-coated beads (GE

Healthcare, Uppsala, Sweden). The polymerization activity of 50 nM HCV NS5B

enzyme (genotype 1b, BK strain) was quantified by measuring the incorporation of

CHAPTER 3

- 139 -

radiolabeled [3H]-UTP substrate onto the 3’ end of the growing primer at 22 °C for 140

mins. Detection was performed by counting the signal using a liquid scintillation counter

(Wallac MicroBeta Trilux, Perkin Elmer, MA). Compounds were initially tested using a

single point concentration and the actives were re-confirmed by eleven point dose

responses. Curves were fitted to data points using nonlinear regression analysis, and IC50s

were interpolated from the resulting curves using GraphPad Prism software, version 2.0

(Graphpad Software Inc., San Diego, CA).

ACKNOWLEDGEMENTS

We thank the Canadian Foundation for Innovation for financial support through the

New Opportunities Fund program. CRC holds a CIHR-funded Chemical Biology

Scholarship and PE holds a McGill Majors Fellowship (J. W. McConnell Memorial). We

also thank CIHR (Discovery program), FQRNT (Nouveaux chercheurs) and NSERC for

funding and RQCHP for generous allocation of computer resources.

Supporting Information Available: Detailed data on the docking to the validation set. This

information is available free of charge via the Internet at http://pubs.acs.org.

CHAPTER 3

- 140 -

REFERENCES

1 Shuker, S. B.; Hajduk, P. J.; Meadows, R. P.; Fesik, S. W. Discovering High-

Affinity Ligands for Proteins: SAR by NMR. Science 1996, 274, 1531-1534.

2 Cavasotto, C. N.; Orry, A. J. W. Ligand Docking and Structure-Based Virtual

Screening in Drug Discovery. Curr. Top. Med. Chem. 2007, 7, 1006-1014.

3 Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R. Towards the

Development of Universal, Fast and Highly Accurate Docking/Scoring Methods: A

Long Way to Go. Br. J. Pharmacol. 2007, 2008, 153, (SUPPL. 1).

4 Cozza, G.; Bonvini, P.; Zorzi, E.; Poletto, G.; Pagano, M. A.; Sarno, S.; Donella-

Deana, A.; Zagotto, G.; Rosolen, A.; Pinna, L. A.; Meggio, F.; Moro, S.

Identification of Ellagic Acid as Potent Inhibitor of Protein Kinase CK2: A

Successful Example of a Virtual Screening Application. J. Med. Chem. 2006, 49,

2363-2366.

5 De Graaf, C.; Oostenbrink, C.; Keizers, P. H. J.; Van Der Wijst, T.; Jongejan, A.;

Vermeulen, N. P. E. Catalytic Site Prediction and Virtual Screening of Cytochrome

P450 2D6 Substrates by Consideration of Water and Rescoring in Automated

Docking. J. Med. Chem. 2006, 49, 2417-2430.

6 Claussen, H.; Buning, C.; Rarey, M.; Lengauer, T. FLEXE: Efficient Molecular

Docking Considering Protein Structure Variations. J. Mol. Biol. 2001, 308, 377-

395.

7 AutoDock, 4.0; The Scripps Research Institute: La Jolla, CA, 2006.

8 Corbeil, C. R.; Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and

Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J.

Chem. Inf. Model. 2007, 47, 435-449.

9 Englebienne, P.; Fiaux, H.; Kuntz, D. A.; Corbeil, C. R.; Gerber-Lemaire, S.; Rose,

D. R.; Moitessier, N. Evaluation of Docking Programs for Predicting Binding of

Golgi alpha-Mannosidase II Inhibitors: A Comparison with Crystallography.

Proteins: Struct., Funct., Bioinf. 2007, 69, 160-176.

10 Bacon, B. R.; Di Bisceglie, A. M.; Korb, J. R.; Tillmann, H. L.; Herold, K. C.;

Himelhoch, S.; De Knegt, R. J.; Van Den Berg, A. P.; Bell, B. P.; Walker, B. D.;

CHAPTER 3

- 141 -

Lauer, G. M. Hepatitis C Virus Infection [3] (multiple letters). N. Engl. J. Med.

2001, 345, 1425-1428.

11 Lauer, G. M.; Walker, B. D. Hepatitis C Virus Infection. N. Engl. J. Med. 2001,

345, 41-52.

12 Strader, D. B.; Wright, T.; Thomas, D. L.; Seeff, L. B. Diagnosis, Management,

and Treatment of Hepatitis C. Hepatology 2004, 39, 1147-1171.

13 Toniutto, P.; Fabris, C.; Bitetto, D.; Fornasiere, E.; Rapetti, R.; Pirisi, M.

Valopicitabine Dihydrochloride: A Specific Polymerase Inhibitor of Hepatitis C

Virus. Curr. Opin. Invest. Drugs 2007, 8, 150-158.

14 Pierra, C.; Amador, A.; Benzaria, S.; Cretton-Scott, E.; D'Amours, M.; Mao, J.;

Mathieu, S.; Moussa, A.; Bridges, E. G.; Standring, D. N.; Sommadossi, J. P.;

Storer, R.; Gosselin, G. Synthesis and Pharmacokinetics of Valopicitabine

(NM283), an Efficient Prodrug of the Potent Anti-HCV Agent 2'-C-

Methylcytidine. J. Med. Chem. 2006, 49, 6614-6620.

15 Smith, D. B.; Martin, J. A.; Swallow, S.; Smith, M. K., A.; Yee, C.; Crowell, M.;

Kim, W.; Sarma, K.; Najera, I.; Jiang, W.-R.; Le Pogam, S.; Rajyaguru, S.;

Klumpp, K.; Leveque, V.; Ma, H.; Tu, Y.; Chan, R.; Brandl, M.; Alfredson, T.;

Wu, X.; Birudaraj, R.; Tran, T.; Cammack, N. From R1479 to R1626 :

Optimization of a Nucleoside Inhibitor of NS5B for the Treatment of Hepatitis C.

in Abstracts of Papers, 232nd ACS National Meeting, San Francisco, CA, United

States, Sept. 10-14 2007

16 ViroPharma Incorporated. Therapeutic focus: HCV 796.

http://www.viropharma.com/therapeutic/hcv796.asp (accessed Jan 03, 2008).

17 Virochem Pharma, Corporate Information.

http://www.virochempharma.com/ourFocus.html (accessed Jan 03, 2008).

18 Virochem Pharma, Corporate Information

http://clinicaltrials.gov/ct/gui/show/NCT00389298?order=6 (accessed Jan 03,

2008).

19 Haigh, D.; Amphlett, E. M.; Bravi, G. S.; Bright, H.; Chung, V.; Chambers, C. L.;

Cheasty, A. G.; Convery, M. A.; Ellis, M. R.; Fenwick, R.; Gray, D. F.; Hartley, C.

D.; Howes, P. D.; Jarvest, R. L.; Medhurst, K. J.; Mehbob, A.; Mesogiti, D.;

Mirzai, F.; Nerozzi, F.; Parry, N. R.; Roughley, N.; Skarzynski, T.; Slater, M. J.;

http://www.viropharma.com/therapeutic/hcv796.asp

http://www.virochempharma.com/ourFocus.html

http://clinicaltrials.gov/ct/gui/show/NCT00389298?order=6

CHAPTER 3

- 142 -

Smith, S. A.; Stocker, R.; Theobald, C. J.; Thomas, P. J.; Thommes, P. A.; Thorpe,

J. H.; Wilkinson, C. S.; Williams, E. Identification of GSK625433: A Novel

Clinical Candidate for the Treatment of Hepatitis C. In Abstracts of Papers, 233rd

ACS National Meeting, Chicago, IL, United States, March 25-29 2007.

20 Biswal, B. K.; Cherney, M. M.; Wang, M.; Chan, L.; Yannopoulos, C. G.;

Bilimoria, D.; Nicolas, O.; Bedard, J.; James, M. N. G. Crystal Structures of the

RNA-dependent RNA Polymerase Genotype 2a of Hepatitis C Virus Reveal two

Conformations and Suggest Mechanisms of Inhibition by Non-nucleoside

Inhibitors. J. Biol. Chem. 2005, 280, 18202-18210.

21 Biswal, B. K.; Wang, M.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.;

Bilimoria, D.; Bedard, J.; James, M. N. G. Non-nucleoside Inhibitors Binding to

Hepatitis C Virus NS5B Polymerase Reveal a Novel Mechanism of Inhibition. J.

Mol. Biol. 2006, 361, 33-45.

22 Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,

M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.;

Woolven, J. M.; Peishoff, C. E.; Head, M. S. A Critical Assessment of Docking

Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912-5931.

23 Kim, J.; Park, J. G.; Chong, Y. FlexE Ensemble Docking Approach to Virtual

Screening for CDK2 Inhibitors Mol. Simul. 2007, 33, 667-676.

24 Moitessier, N.; Therrien, E.; Hanessian, S. A method for Induced-fit Docking,

Scoring, and Ranking of Flexible Ligands. Application to Peptidic and

Pseudopeptidic β-secretase (BACE 1) Inhibitors. J. Med. Chem. 2006, 49, 5885-

5894.

25 Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to

Hydrated and Flexible RNA. J. Med. Chem. 2006, 49, 1023-1033.

26 Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and

Computational Approaches to Estimate Solubility and Permeability in Drug

Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23, 3-25.

27 Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore

Search, Automated Docking, and Molecular Dynamics Simulations as a Novel

Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-glycine-

CHAPTER 3

- 143 -

aspartic Acid-like Compounds into the avb3 Binding Site. J. Med. Chem. 2004, 47,

4178-4187.

28 Goto, J.; Kataoka, R.; Hirayama, N. Ph4Dock: Pharmacophore-Based Protein-

Ligand Docking. J. Med. Chem. 2004, 47, 6804-6811.

29 Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development

and Testing of a General Amber Force Field. J. Comput. Chem. 2004, 25, 1157-

1174.

30 Paul, N.; Rognan, D. ConsDock: A New Program for the Consensus Analysis of

Protein-ligand Interactions. Proteins Struct. Funct. Genet. 2002, 47, 521-533.

31 Di Marco, S.; Volpari, C.; Tomei, L.; Altamura, S.; Harper, S.; Narjes, F.; Koch,

U.; Rowley, M.; De Francesco, R.; Migliaccio, G.; Carfi, A. Interdomain

Communication in Hepatitis C Virus Polymerase Abolished by Small Molecule

Inhibitors Bound to a Novel Allosteric Site. J. Biol. Chem. 2005, 280, 29765-

29770.

32 Irwin, J. J.; Shoichet, B. K. ZINC - A Free Database of Commercially Available

Compounds for Virtual Screening. J. Chem. Inf. Model. 2005, 45, 177-182.

33 A Free Database for Virtual Screening ZINC. http://blaster.docking.org/zinc/

(accessed Jan 03, 2008).

34 Gopalsamy, A.; Lim, K.; Ciszewski, G.; Park, K.; Ellingboe, J. W.; Bloom, J.;

insaf, S.; Upeslacis, J.; Mansour, T. S.; Krishnamurthy, G.; Damarla, M.; Pyatski,

Y.; Ho, D.; Howe, A. Y. M.; Orlowski, M.; Feld, B.; O’Connell, J. Discovery of

Pyrano[3,4-b]indoles as Potent and Selective HCV NS5B Polymerase Inhibitors. J.

Med. Chem. 2004, 47, 6603-6608.

35 Gopalsamy, A.; Aplasca, A.; Ciszewski, G.; Park, K.; Ellingboe, J. W.; Orlowski,

M.; Feld, B.; Howe, A. Y. M. Design and synthesis of 3,4-dihydro-1H-[1]-

benzothieno[2,3-c]pyran and 3,4-dihydro-1H-pyrano[3,4-b]benzofuran derivatives

as non-nucleoside inhibitors of HCV NS5B RNA dependent RNA polymerase.

Bioorg. Med. Chem. Lett. 2006, 16, 457-460.

36 Pfefferkorn, J. A.; Nugent, R.; Gross, R. J.; Greene, M.; Mitchell, M. A.; Reding,

M. T.; Funk, L. A.; Anderson, R.; Wells, P. A.; Shelly, J. A.; Anstadt, R.; Finzel,

B. C.; Harris, M. S.; Kilkuskie, R. E.; Kopta, L. A.; Schwende, F. J. Inhibitors of

HCV NS5B polymerase. Part 2: Evaluation of the northern region of (2Z)-2-

http://blaster.docking.org/zinc/

CHAPTER 3

- 144 -

benzoylamino-3-(4-phenoxy-phenyl)-acrylic acid. Bioorg. Med. Chem. Lett. 2005,

15, 2481-2486.

37 Wang, M.; Ng, K. K.-S.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.; Bedard,

J.; Morin, N.; Nguyen-Ba, N.; Alaoui-Ismaili, M. H.; Bethell, R. C.; James, M. N.

G. Non-nucleoside Analogue Inhibitors Bind to an Allosteric Site on HCV NS5B

Polymerase. J. Biol. Chem. 2003, 278, 9489-9495.

38 Biswal, B. K.; Wang, M.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.;

Bilimoria, D.; Bedard, J.; James, M. N. G. Non-nucleoside Inhibitors Binding to

Hepatitis C Virus NS5B Polymerase Reveal a Novel Mechanism of Inhibition. J.

Mol. Biol. 2006, 361, 33-45.

39 Li, H.; Tatlock, J.; Linton, A.; Gonzalez, J.; Borchardt, A.; Dragovich, P.; Jewell,

T.; Prins, T.; Zhou, R.; Blazel, J.; Parge, H.; Love, R.; Hickey, M.; Doan, C.; Shi,

S.; Duggal, R.; Lewis, C.; Fuhrman, S. Identification and structure-based

optimization of novel dihydropyrones as potent HCV RNA polymerase inhibitors.

Bioorg. Med. Chem. Lett. 2006, 16, 4834-4838.

40 Chan, L.; Pereira, O.; Reddy, T. J.; Das, S. K.; Poisson, C.; Courchesne, M.;

Proulx, M.; Siddiqui, A.; Yannopoulos, C. G.; Nguyen-Ba, N.; Roy, C.; Nasturica,

D.; Moinet, C.; Bethell, R.; Hamel, M.; L’Heureux, L.; David, M.; Nicolas, O.;

Courtemanche-Asselin, P.; Brunette, S.; Bilimoria, D.; Bedard, J. Discovery of

thiophene-2-carboxylic acids as potent inhibitors of HCV NS5B polymerase and

HCV subgenomic RNA replication. Part 2: Tertiary amides. Bioorg. Med. Chem.

Lett. 2004, 14, 797-800.

41 Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D. Comparative Evaluation of

Eight Docking Tools for Docking and Virtual Screening Accuracy. Proteins Struct.

Funct. Genet. 2004, 57, 225-242.

42 Kontoyianni, M.; Sokol, G. S.; McClellan, L. M. Evaluation of Library Ranking

Efficacy in Virtual Screening. J. Comput. Chem. 2005, 26, 11-22.

43 Moitessier, N.; Corbeil, C. R.; Englebienne, P.; Schwartzentruber, J. FITTED 2.2

can be obtained on request from the developers ([email protected]).

CHAPTER 4

- 145 -

CHAPTER FOUR

In the previous chapters, the two reported versions of FITTED have been validated

and tested using a test set of 33 protein ligand complexes along with a virtual screening

application against HCV polymerase. We had also demonstrated the importance of the

inclusion of protein flexibility and displaceable bridging water molecules with FITTED.

However, we thought to further demonstrate the impact of these two features as well as

others (e.g., input conformation of the ligand) on the pose prediction accuracy of FITTED

along with other popular docking programs. With this comparative study another

iteration of improvements to the program was also conducted and are presented in this

chapter.

This chapter is reproduced from a manuscript that has been submitted for publication in

the Journal of Chemical Information and Modeling. This article is cited as Corbeil, C. R.;

Moitessier, N. “Docking Ligands into Flexible and Solvated Macromolecules. 3. Impact

of Input Ligand Conformation, Protein Flexibility and Water Molecules on the Accuracy

of Docking Programs.” Journal of Chemical Information and Modeling 2009, accepted.

Copyright 2009, with permission from the American Chemical Society.

CHAPTER 4

- 146 -


MACROMOLECULES. 3.

IMPACT OF INPUT LIGAND CONFORMATION, PROTEIN

FLEXIBILITY AND WATER MOLECULES ON THE ACCURACY

OF DOCKING PROGRAMS

ABSTRACT

Several modifications and additions to FITTED 1.5 led to the development of FITTED 2.6.

Among the novel implementations are a matching algorithm-enhanced genetic algorithm

and a ring conformational search algorithm. With these various optimizations, we also

hoped to remove the biases and to develop a docking program that would provide results

(i.e., poses) as independent as possible to the input ligand and protein conformations and

used parameters, although keeping the options to provide additional experimental

information. These biases were investigated within FITTED 2.6 along with FlexX, GOLD,

Glide and Surflex. The input ligand conformation was found to have a major impact on

the program accuracy as drops as large as 10-50% were observed with all the programs

but FITTED. This comparative study also demonstrates that the accuracy of FITTED is

comparable to other docking programs. We have also demonstrated that protein

flexibility, displaceable water molecules and ring conformational search algorithms, three

of the main FITTED features significantly increased its accuracy. Finally, we also

proposed potential modifications to the available programs to further improve their

accuracy in binding mode prediction.

CHAPTER 4

- 147 -

INTRODUCTION

In modern drug design, docking-based virtual screening (VS) methods provide a quick

and inexpensive alternative to high throughput screening. In fact, numerous applications

have demonstrated the reasonable level of accuracy of the available methods.1, 2

In

parallel, comparative studies evaluated the relative accuracy of previous versions of

docking programs in predicting the correct binding modes, typically with Glide and

GOLD yielding the best results.3-9

Many of these studies, which often made use of ligand

/ protein co-crystal structures, showed that the accuracy of docking the native ligands

back to the corresponding protein structures (self-docking) gave reasonable results.

However, when examining docking of a ligand to non-native crystal structures of the

same protein (cross-docking), the accuracy of most of the programs was significantly

lower.10-12

These failures result in part from the assumption that proteins are rigid objects

(the lock-and-key model) even though they are known to be flexible dynamic objects. As

a result, this major assumption lead to inaccurate binding pose predictions and low

enrichment factors in VS.13

In fact, implementing protein flexibility has been seen as one

major challenge in the development of docking methods.14-16

Currently, very few

programs consider the flexibility of the protein upon docking, although various strategies

have been proposed ranging from soft-docking (e.g., smoothed protein structure in

AutoDock11

) to a more exhaustive and therefore time consuming protein conformational

search as seen in Glide when combined with Prime.17

Docking to conformational

ensembles has also been implemented within a few programs such as FlexX-Ensemble18

,

Slide.19,20

and AutoDock.21

These various implementations led to significantly improved

predictions of binding modes when compared to cross-docking studies.

One of the other challenges in docking and VS is the treatment of key water

molecules.22

In most protein / ligand docking studies, water molecules, if present, are

treated on a per protein basis. If the water molecules appear as highly conserved, then

they are kept as part of the protein description for the docking run. This approach clearly

precludes accurate docking of ligands that would displace these key water molecules

upon binding. A commonly described example is HIV-1 protease ligands. A tightly

bound water molecule has been observed within the co-crystal structure of HIV-1

protease with KNI-272 and analogues.23-27

This water molecule may therefore be

CHAPTER 4

- 148 -

required for an optimally accurate docking of this set of analogues. In parallel, inhibitors

built around a cyclic urea scaffold have been designed to displace this water and would

not be properly docked if this water molecule was kept.28-30

Ideally, water molecules

should be displaceable. In fact, a previous report from our group showed that AutoDock

gained accuracy when water molecules were made displaceable.31

Currently only a few

docking methods can displace water molecules while docking ligands. For instance,

GOLD32

uses user-defined waters present in the protein input file while FlexX33

places

water molecules within the binding site and keeps the ones interacting strongly. In these

two cases, these programs both score with the water present (on) or not (off) and select

the best scoring option. A version of FlexX currently in beta testing allows for

displaceable waters34

that were present in the input file. Surprisingly, when the GOLD

implementation was reported, it showed no improvement of the docking accuracy. This

somewhat unexpected observation questioned the development of functionalities

specifically designed to displace water molecules.

Within the past years, we have developed and reported FITTED 1.035

then FITTED 1.536

.

FITTED (Flexibility Induced Through Targeted Evolutionary Description) is a docking

program that addresses the challenges of protein flexibility and displaceable water

molecules. Herein we describe the development of the next version of this docking

program, FITTED 2.6, that focuses on accelerating the docking process while keeping

similar accuracy. We also focused on reducing the dependence of the accuracy on input

parameters and structures. We have previously found that the accuracy of eHiTS was

affected by the ligand input structure and further investigation was necessary to evaluate

this effect on the accuracy of other programs including ours.37

In order to identify its

strengths and weaknesses and evaluate these dependencies, we then compared FITTED to

some of the most popular docking software available with a specific focus on how

changes in input structure and parameters affect docking accuracy.


FITTED 1.0 and 1.5, creating a virtual screening tool out of a docking program. Previous

reports from our group detailed the development of FITTED versions 1.035

and 1.536

and

only a brief description is given below. FITTED is a suite of programs that includes

CHAPTER 4

- 149 -

FITTED (the docking engine), ProCESS (Protein Conformation Ensemble System Setup, a

module for protein file preparation) and SMART (Small Molecule Atom Typing and

Rotatable Torsion assignment, a module for ligand preparation). Docking a ligand to a

protein can be seen as a global optimization problem. The ligand binding mode, protein

conformation, water molecule occurrence and locations have to be optimized to provide

an optimal free energy of binding. In FITTED, a Lamarckian genetic algorithm addresses

the conformational space search. Genetic algorithms are stochastic methods and often

start with randomly generated populations, followed by a time consuming evolution. Due

to the large conformational space of the protein / water / ligand complexes, finding the

global solution requires a large number of generations and large populations. Short

cutting the process is therefore necessary to reduce the CPU time required for a single

run. We thought that starting with a population that has already evolved (i.e., lower

average energy than random poses) would lead to desired decreases in computational

time. Thus using a series of atomic charge constraints and a binding site volume, FITTED

1.0 prepared such an initial population intelligently. This approach allows for quicker

convergence of the population through evolution. Along with this genetic algorithm,

FITTED incorporated a switching function that effectively turns off the water and allows

them to be displaced when required. The initial validation of FITTED 1.0 with a small set

of protein / ligand complexes showed promises with 76% and 73% success in self-

docking and docking to flexible proteins respectively.35

Although this first version was

docking ligands effectively, our eventual goal was to make FITTED a VS tool. Some

modifications of the original algorithm were necessary to make it significantly quicker to

achieve speeds required by VS tools.

The first step in any virtual screen is the preparation of the virtual database of potential

ligands. Since a docking program will attempt to dock any given compound, we first

focused on prioritizing “drug-like” molecules for docking. For this purpose, a series of

descriptors were implemented into SMART. Bit strings describing the molecular structure

generated by SMART could then be exploited by FITTED to filter out compounds with

undesired chemical features and/or physical properties. The docking was modified to

incorporate a consensus docking approach that enabled FITTED to create and allow the

population to evolve in a more intelligent manner than the previous version. This was

done by adding pharmacophores and/or automatically generated sets of protein

CHAPTER 4

- 150 -

interaction sites (generated by ProCESS) to orient the docking process. When a

conformation of a ligand did not match well to the pharmacophore (PharmScore) and/or

the interaction sites (MatchScore), the conformation was discarded and a new one was

generated. Thus, the inclusion of the interaction sites oriented the docking towards better

solutions and, as a result, afforded a 10% increase in accuracy over FITTED 1.0 and a

significant decrease in required CPU time.36

FITTED was then used in the screening of the

Maybridge database against HCV polymerase and was successful in identifying two hits

in the low micromolar range.36

FITTED 2.6. Improvements to remove dependencies on input parameters. When more

knowledge is provided to a docking program, the accuracy is expected to increase. For

instance, if the ligand in its crystal structure conformation is docked, a program that uses

the ligand input structure as an initial guess would most likely outperform any other

program in self-docking experiments. However, these experiments would give no

information regarding its true accuracy since, in a real drug design scenario, the user does

not know the solution. Some other biases, including the selection of parameters and the

protocol used to prepare the protein (e.g., protonation state of ionizable residues), can

also greatly affect the evaluation of programs. The removal of these dependencies arising

from the input parameters has become one of the hot topics in the literature as of late.38-

42.

One of these dependencies is the input conformation of rings. In a VS study, large

libraries of compounds are tested in silico. These libraries are typically prepared from

two dimensional representations of these molecules, then a 2D to 3D converter such as

OMEGA43

and CORINA44

is used to generate the 3D coordinates. Most 2D to 3D

converters output an esthetically stable state with the option to find a low-in-energy

conformation as defined by a force field and a conformational search algorithm. This

conformation may not always be the same as the bioactive conformation. If the molecule

is acyclic then this poses no problem, since most popular docking software consider the

flexibility of acyclic portions of the molecule. Molecules with flexible cyclic structures

prove to be more challenging as in Figure 4.1. In addition, the conformation of the

flexible ring depends on the program used to generate it and is often not fully searched.38,

CHAPTER 4

- 151 -

39, 45, 46 Our genetic algorithm has therefore been modified to account for ring flexibility

as detailed below.

Figure 4.1 - Conformation of 1nfu ligand (a) as observed in the crystal structure and (b)

as generated by OMEGA

Implementation of ring flexibility. There are three main strategies to address the issue of

flexible ring systems. First a separate tool can create multiple conformations of the ligand

to be used as multiple inputs by docking programs. Second, several ring conformations

can be exploited during the incremental construction of ligands. Surflex2.147

uses

templates of five to seven-membered cycloalkanes to generate multiple input

conformations of the rings used by the incremental construction algorithm. Even though

the templates are saturated carbocyclic structures, the inclusion of energy minimization

steps accounts for the various conformations that may exist for heterocyclic and

unsaturated systems. Similarly, Glide48

version 5.0 uses the template library from

LigPrep to be able to conformationally search larger rings.49

The third option is to

perform the conformational search while docking. GOLD50-52

exploits the corner flap

approach developed by Goto and Osawa,53

where the atom to be flipped is reflected in

the plane formed by the adjacent atoms. The major advantage of this approach is that

rings of any size can be searched. The one pitfall is the requirement to have the four

adjacent atoms in a plane. In a previous version of Glide, version 4.5, the docking engine

used a approach similar to GOLD54

The genetic algorithms rely on the use of chromosomes and the theory of evolution. In

the context of docking, the chromosomes are sets of numerical values (genes) that can

evolve through genetic operators such as mutations and cross-over. These numerical

values often define the conformation, orientation and position in space of the ligand

(referred to as a pose). The chromosome, as defined in FITTED 1.5, included the acyclic

flexible torsions, translation and orientation for a given pose of the ligand. Depending on

CHAPTER 4

- 152 -

the selected docking mode, the chromosome may also include the protein backbone,

water positions and binding site residues (Figure 2).

.

FlexibleTorsions

Translation

Orientation

ProteinBackbone

Binding SiteResidues

Waters

RigidDocking Semiflexible

DockingSemiflexible Docking withFlexible Waters Fully

Flexible Docking

FlexibleRings

Figure 4.2 - FITTED 1.5 vs. FITTED 2.6 chromosome and the various docking modes.

Each of the horizontal lines represents a gene (e.g., given conformation of a side chain

residue). The box highlights the implementation in FITTED 2.6.

Since only the acyclic portions of the ligand were included in the chromosome, the

conformation of the ring(s) within the ligand remained the same throughout the evolution

unless altered by the energy minimization routine. To account for ring flexibility, FITTED

2.6 now includes a conformational search algorithm for rings during the generation of

new conformations (Figure 4.3 and Figure 4.4). FITTED2.6 uses a corner flap algorithm

similar to that of GOLD but does not impose any criteria to the position in space of A, B,

D and E (Figure 4.3). This is achieved by creating the plane out of three atoms (A, B and

D) instead of the four atoms required in GOLD. Any distortions of the bond length and

angles are next resolved through the energy minimization steps performed by FITTED. In

order to maintain the asymmetry of atom C, GOLD imposes the rotation of two bonds

(AB and BC) to position C1 and C2 (Figure 4.4). This approach reinforces the need to

have four atoms in a plane. In our current implementation, an assumption is made that the

torsion C1CBC’ is equivalent to C’2C’BC (Figure 4). Thus the Cartesian coordinates of

C’2 can be defined by converting C2 into C’2 from its internal coordinates (bonds, angles

and torsions).55

CHAPTER 4

- 153 -

Figure 4.3 - Example of the corner flap approach converting a boat conformation to a

chair.

F C

E

A

D

B

C1

C2

C'

C'2

C'1

B

C'

C2

C1

C'

C'2

C'1

C

Figure 4.4 - The assumption of torsion equivalencies.

An improved definition of interaction sites and a matching algorithm. As described

previously, the module ProCESS prepares the protein files in the format needed by

FITTED. It also probes the binding site and generates additional data for optimal docking.

Among this data is a list of potential protein interaction sites (ISs). Geometric rules are

applied to find the ideal locations of hydrogen bond donor and acceptor groups referred

to as HBD and HBA ISs. In the previous version, ISs were centered on the Ser, Thr and

Tyr hydroxyl oxygen and on metal centers. These points are now placed in the position of

the oxygen lone pairs or free metal coordination sites. ProCESS determines these free

metal coordination sites by examining the surrounding residues and using the vector bond

valence postulate56

that states that the sum of all the vectors of the coordinated atoms

must be equal to 0.

The earlier version of FITTED did not identify the hydrophobic pockets with great

accuracy. To resolve this issue, various strategies have been implemented. In the current

CHAPTER 4

- 154 -

version, a grid of evenly distributed points is generated and the interaction of a probe

atom at each of these grid points with the protein is computed. To be considered

hydrophobic (referred to as HYD), the point should not be in close proximity to an HBA

or HBD point. Then the van der Waals interaction energy calculated between the probe

atom and all the protein carbons should be below a minimum van der Waals energy cut-

off value. Applied to a number of proteins, we found this new definition to be more

accurate than the previous one.

A weight is then assigned to each point depending on its type. For HBA and HBD, a

weight is assigned depending on whether the point was created from a charged or neutral

residue then scaled to account for the buriedness of the point.57

The HYD points are

scaled according to the ratio of the van der Waals energy calculated for point over the

minimum van der Waals energy cut-off. These weights are then used to compute the

MatchScore of each pose as described previously.36

Some years ago, we have shown that using pharmacophore oriented docking with a

matching algorithm can improve docking accuracy substantially.58

FITTED 1.5 initiated a

move in this direction by orienting the docking using ISs.36

With the new version, we

complete this move with the inclusion of a three-point triangle matching algorithm to

orient the ligand instead of the random translation and orientation performed with

previous versions. Triangles made of ligand atoms are matched onto triangles made of

ISs of identical chemical property (HDB, HBA or HYD). In order to optimize the

efficiency of this algorithm, only a subset of potential triangle match is used. First,

FITTED removes triangles that connect low weight interaction sites. Secondly, all

triangles must contain at least one point of the top 10 ISs as sorted by weights.

The creation of the ligand ISs are based on simple rules. All ligand atoms that are

hydrogen bond acceptors (HBA) or donors (HBD) are labelled as HBA or HBD (red and

blue atoms in Figure 5). Hydrophobic ligand points (HYD, green spheres in Figure 4.5)

are centroids either placed at the center of hydrophobic rings (rings with a majority of

carbons) or at the center of iso-propyl, methyl and tert-butyl groups (Figure 4.5). All

possible combinations of three-point triangles are then created and stored. The position of

these ligand ISs are recalculated with each new conformation.

CHAPTER 4

- 155 -

ON

N

O

O

N

O

O

O

HH

H

H

Figure 4.5 - Representation of the ISs found for 1bwi. (Green = hydrophobic points, blue

= hydrogen bond acceptors, red = hydrogen bond donors)

With the two lists created, a new ligand conformation is randomly generated and a

triangle match between the ligand ISs and the protein ISs is sought. The ligand triangle is

then superimposed with the protein interaction site triangle. The ligand, that is now

oriented within the binding site, proceeds through the consensus docking approach

described in FITTED 1.5 and summarized in Figure 4.6.

CHAPTER 4

- 156 -

Generate

Conformation

Generate

Conformation

PharmScorePharmScore

MatchScoreMatchScore

ClashScoreClashScore

GAFFScoreGAFFScore

MinimizeMinimize

GAFFScoreGAFFScore

Save in

Population

Save in

Population

Fail

Calculate new

min MatchScore

Calculate new

min MatchScore

Yes

Population size

Reached?

Population size

Reached?

No

EvolutionEvolution

Too many trials

failed reduce min

MatchScore

Too many trials

failed reduce min

MatchScore

LigandLigand Protein

Files

Protein

Files

Interaction

Sites

Interaction

SitesBinding Site

Cavity

Binding Site

Cavity

Matching

algorithm

Matching

algorithm

Corner FlapCorner Flap

Figure 4.6 - Schematic of the generation of the initial population within FITTED2.6.

With earlier versions of FITTED, the minimum MatchScore necessary for a pose to be

accepted was manually set and therefore the accuracy of the docking run was heavily

dependant on it. This is in fact an appropriate approach in drug design when the user

wants to make use of additional information (e.g., a pharmacophore developed from other

studies). However, in the case where no information is available, a value for the

minimum MatchScore is still requested. With the new additions to the generation and

orientation of the ligand, the focus switched to the automatic selection of a MatchScore

for a ligand during docking. FITTED 2.6 starts with an initial minimum MatchScore, that

can either be automatically reduced or increased depending on the ligand. As soon as a

new individual is saved, the minimum MatchScore (Min MatchScore in eq. 1) is

recalculated based on Eq. 1. The scaling factors have been empirically defined to orient

CHAPTER 4

- 157 -

the docking without affecting the time required for the generation of the initial

population.

10.0

MatchScoreMax 2

sMatchScore

5.0

coreMin_MatchS

0

1i

i

i

(4.1)

Evolution and convergence. With the previous versions of FITTED, it was necessary to

perform multiple runs to find the global minimum. To increase the convergence between

various runs, FITTED 2.6 incorporates a matching algorithm to create the higher quality

initial population. Additional modifications were made to the evolution algorithm to

better mimic the Lamarckian and Darwinian evolution. We thought to favour the

evolution of the best individuals. First, in order to increase the possibility of the best

individuals coupling with each other we implemented a new evolutionary function called

the probability of elitism (pElite operator). This function copies one of the top of

individuals, performs a local search on it and passes it on to the next generation. Also a

new selection criterion for the next generation called Metropolis evolution was

implemented. With this mode, the children replace the parents based on an energy-based

Metropolis criterion at a user-defined temperature. With this criterion, higher in energy

children have a non zero probability to survive and be coupled in the next generation.

This approach ensures some structural diversity in the population and enables the

creation of a population which follows the Boltzmann distribution. This population can

next be used for refined scoring.

As a last modification, we moved away from the all atom representation of protein /

ligand interactions to the less time consuming united atom representation. However, the

all atom representation is kept to compute the ligand internal energy and preclude any

inversion of chiral centers. This hybrid united atom / all atom representation resulted in

an increase in speed over the last version of FITTED.

CHAPTER 4

- 158 -

Generate

Initial Population

Generate

Initial Population

ReproductionReproduction

MetropolisMetropolisSteady StateSteady State

pElitepElite

Is population

converged?

Is population

converged?

Yes

No

ExitExit

LigandLigand Protein

Files

Protein

Files

Interaction

Sites

Interaction

SitesBinding Site

Cavity

Binding Site

Cavity

Figure 4.7 - Schematic of the evolution cycle of FITTED 2.6.

RESULTS AND DISCUSSION

Objectives. To validate this new version we have increased the validation set developed

for FITTED 1.0 from 5 proteins (33 crystal structures) to highly diverse 18 proteins and

100 crystal structures (Table 1). In addition to the evaluation of FITTED’s accuracy, we

decided to investigate the impact of parameters and input structure on accuracy. As

discussed above, protein structure, ligand structure (e.g., rings) and selected parameters

often have an impact on the docking accuracy. Although some parameters are expected to

increase accuracy (i.e., Standard Precision mode vs. eXtra Precision mode in Glide or

accuracy levels in GOLD), the ligand conformation should not (Table 1). The set

described here includes very challenging proteins such as HCV RNA polymerase9 and

metallo enzymes.37

CHAPTER 4

- 159 -

Table 4.1 - Testing set of ligand / protein complexes.

Protein (Abbreviation) # of

Structures

Include

Water?

PDB Codes

Cyclin-dependent kinase 2 (CDK2) 4 Yes 1aq1, 1dm2, 1pxn, 1pxn

Cyclooxygenase-2 (COX-2) 4 No 1cx2, 1pxx, 3pgh, 4cox

Estrogen receptor (ER) 3 Yes 1err, 1sj0, 3ert

Factor Xa (FXa) 5 Yes 1ezq, 1f0r, 1fjs, 1nfu,

1zka

Kainate nlutamate

GluK2 Kainate Receptor (GluK2)

5 Yes 1s7y, 1s9t, 1sd3, 1tt1,

1yae

HCV polymerase allosteric site

(HCV Allo)

9 No 1nhu, 1nhv, 1os5, 2gir,

2hai, 2hwh, 2hwi, 2ilr,

2o5d

HCV polymerase catalytic site

(HCV Cat)

7 Yes 1yvf, 1z4u, 2fvc, 2gc8,

2giq, 2qe2, 2qe5

HIV-1 protease, mono protonated

ASP (HIVP)

5 Yes 1b6l, 1eby, 1hpo, 1hpv,

1pro

HIV-1 protease, di-protonated

ASP (HIVPD)

5 Yes 1ajv, 1ajx, 1hvr, 1hwr,

1qbs

HIV-1 reverse transcriptase (HIVRT) 4 Yes 1c1b, 1fk9, 1rt1, 1vrt

Mannosidase (Mann) 8 No 1hww, 1hxk, 1ps3,

1r33, 1r34, 1tqt, 2f1a,

2f18

Matrix metalloprotease 3 (MMP-3) 4 No 1b8y, 1bwi, 1ciz, 1d8m

P38 Map kinase (P38) 5 No 1a9u, 1b17, 1w7h,

1w82, 1w84

Thermolysin (Therm) 8 Yes 1thl, 1tlp, 1tmn, 3tmn,

4tmn, 5tmn, 6tmn, 8tln

Thrombin (Thrn) 5 Yes 1dwc, 1etr, 1ets, 1ett,

1tmt

Thymidine kinase (TK) 9 Yes 1e2k, 1e2p, 1ki3, 1ki4,

1ki7, 1ki8, 1of1, 1qhi,

2ki5

Trypsin (Tryp) 5 Yes 1f0u, 1o2j, 1o3g, 1o3i

1qbo

Vitamin D receptor (VDR) 5 Yes 1db1, 1ie8, 1txi, 2har,

2has

http://en.wikipedia.org/wiki/Cyclin-dependent_kinase

CHAPTER 4

- 160 -

Comparing versions FITTED 1.0, 1.5 and 2.6. To examine the effect of the various

modifications made to FITTED, we compared the current version to the previous ones by

using the training set initially developed to test the accuracy of FITTED 1.035

. This set is

constituted of 33 protein-ligand complexes and 5 proteins (HIVP, FXa, Tryp, MMP-3

and TK). Table 2 summarizes the results obtained with the three FITTED versions for self-

docking (“Rigid” protein flexibility mode) and docking to flexible proteins using the

crystallographic conformation of the ligands. As previously reported, the “SemiFlexible”

protein flexibility mode corresponds to docking to a conformational ensemble of protein

structures, while flexible docking corresponds to a fully flexible protein.

Overall accuracy has declined between versions 1.5 and 2.6 (Table 2), although this set

is not large enough to provide statistically relevant evaluation. Gratifyingly, an overall

increase in speed and convergence between multiple runs was also recorded (Table 3).

This drop in accuracy is attributed to the manual selection of the minimum MatchScore

in version 1.5 that is now automatically determined during the generation of the initial

population. As a result, docking to FXa which was fairly successful with the previous

versions shows very poor accuracy with the current version (i.e., one out of 5 ligands is

docked accurately) while docking to the other proteins demonstrated accuracy of 60%

(Tryp), 75% (MMP-3) and even 100% (HIVPD, HIVP, TK). Manual selection of high

Min_MatchScore values with version 2.6 forces the key ionic interactions between the

ligands and binding site Asp of FXa and Tryp and restores an accuracy similar to that of

version 1.5. This automatic determination of the minimum MatchScore is important as in

a blind docking study no information is given and determination of this minimum

MatchScore value would be difficult.

CHAPTER 4

- 161 -

Table 4.2 - Comparison of success rates of FITTED versions 1.0, 1.5 and 2.6 using the

“Dock” Docking mode.

% Success

Docking Mode 1.0 1.5 2.6

Rigid (self-docking) 79 93 79

Rigid (cross-docking) 47 75 56

Semiflexible 73 84 67

Flexible 73 88 67

A three-fold increase in speed is seen with the newer version of FITTED. This increase

can be in part attributed to the introduction of a matching algorithm to orient the ligand

when generating the initial population and to the use of the hybrid united atom/all atom

representation described above. In the current work both an extensive conformational

search is carried out using the “Dock” docking mode along with a significantly quicker

(i.e., less generations) “VS” docking mode. The increase in the convergence of the runs

(a single run is often enough with the current version) can be attributed in part to the

inclusion of the new pElite evolutionary operator and to the matching algorithm. It

should be stressed that in VS mode the time can be significantly reduced (down to 3-4

minutes for TK inhibitors) and that code optimization is ongoing to further improve the

necessary CPU time for a single run.

Table 4.3 - Comparison of time and number of runs required for various versions of

FITTED when the “Dock” docking mode is selected for rigid protein docking.

Time (min) per run/Number

of Runs

Protein 1.0 1.5 2.6

TK 63/5 30/3 8.5/1

HIVP 114/5 55/3 22/2

CHAPTER 4

- 162 -

Comparing dependencies on input structures. We next turned our attention to the impact

of the input structure of both the ligand and the protein on the pose prediction accuracy of

FITTED. We also investigated how other docking programs perform under the same

conditions. For this comparison we decided to focus more specifically on three important

features which can affect the accuracy of docking programs, the ligand input structure,

protein input structures and the inclusion of bridging water molecules.

Discussions with the developers and/or technical support of each program allowed a

fair comparative study and an optimal use of these programs. In fact, following

recommendations, many conditions (set of parameters) were tried. In addition, all the

major scoring functions were tried if more than one was available (in FlexX and GOLD).

Some representative data is shown in Figure 8. Different levels of accuracy were also

tried with some of the programs as described in the legend of Figure 8.

Prior to the description of the experiments and results, we thought this study should be

put in context. In this work, we wish to evaluate the docking ability of programs without

any information other than the crystal structure of the protein. Obviously in the context of

drug design and screening, any relevant information should be given to the docking

program. However, this would bring too many variables to this study as ISs in FITTED or

pharmacophores in FlexX (FlexX-Pharm) can be trained and would significantly increase

their respective accuracy.

We also wish to stress that the primary goal of this study is not to compare programs

but to evaluate the impact of the input parameters on their pose prediction accuracy. In

addition, as in any comparative study, the data collected in this work should be

considered with care as the set is still not large enough to draw conclusions on their

respective accuracy, and some hidden biases may remain as discussed below. In addition,

we have used the RMSDs between ligand crystal structures and docked poses to measure

the docking accuracy. This criterion is believed to be appropriate to evaluate the impact

of input parameters (the relative accuracy under two different conditions) but not to

compare programs.

Ligand input conformations. We first looked at the impact of the ligand structure.

Previously reported comparative studies typically have only used one conformation of the

ligand either the crystal6 or a non-crystal

5, 9, 59 conformation of ligands. In a real drug

CHAPTER 4

- 163 -

design scenario, the bioactive conformation is unknown and therefore the non-crystal

conformation represents a more realistic scenario. To evaluate the bias when using the

crystallographic conformation of the ligand, we compared the accuracy of the docking

programs when the crystal ligand or OMEGA43

generated structures were used

alternatively as input (Figure 4.8). In this work, the OMEGA generated structure were

used to assess the programs ability to dock the non-crystal conformation of the ligand. To

our knowledge, none of the docking programs assessed in this work have been trained

with OMEGA-generated structures. However, these ligand structures cannot be

considered as completely unbiased as these conformations may be preferably docked by

one of these programs, a bias that we have not evaluated.

In this work, the docked pose is assumed to be accurately predicted if it is within 2.0 Å

of the crystal binding mode when performing self-docking experiments. Even though the

use of RMSD values is known to be misleading, we believe that it will clearly reveal

drops or increases in accuracy induced by specific parameters or conformations. When

evaluating the accuracy in cross-docking experiments, the additional error introduced

when superposing the protein structures should be considered. Thus, an arbitrary RMSD

of 2.25 Å in cross-docking experiment was selected as a criterion of success.

As can be seen in Figure 8, the accuracy of all the programs drops by 10 to 20 % when

moving from crystal ligand structures (yellow bar) to OMEGA-generated structures

(orange bars), except for FITTED for which no change in accuracy was observed. This

first piece of data confirmed that a docking program should not be evaluated by using

ligand crystal structures as input. We then used the ring conformational search features

when available (red bars in Figure 4.8). The overall accuracy increases although not

reaching the one observed with crystal structures. The ring conformational search engine

used by Surflex, which covers a wider range of rings, are clearly more efficient than the

ones used by Glide4.5, GOLD and FITTED. While this manuscript was in preparation, a

new version of Glide that features a new ring search method was released but was not

used in this work. This most recent version of Glide (v5.0) uses a ring library that

accounts for both larger rings and more ring systems and includes small heterocyclic

rings previously not included. These additional features may increase its accuracy. As no

drop was observed between the use of crystal ligand structures and OMEGA-generated

ligand conformations, the FITTED ring search algorithm was not expected to improve the

CHAPTER 4

- 164 -

binding mode prediction significantly. A closer look at the data for FITTED reveals that

the ring conformation is often searched even when the specific feature is turned off. This

can be rationalized by the generation of highly distorted structures and their optimization

through energy minimization. As a first conclusion, all the programs but FITTED are very

sensitive to the input ligand conformations and the implementation of a ring

conformational search engine can reduce this dependency.

When comparing programs, the accuracy does not change much between programs but

is led by Surflex (68% with the fully relaxed protein structures), Glide (66% in XP

mode), GOLD (65% with ChemScore and flexible rings), FITTED (59%) and FlexX (54%

with FlexScore). It should be stressed that this study is carried out with a very difficult

testing set including some of the most challenging proteins such as HCV RNA

polymerase. Form now on, we will only present the data for the best set of conditions for

each program unless there is a significant deviation or interesting point of discussion. In

fact, the best conditions described in Figure 8 were found to be the best conditions for

most of the following studies. In addition, only the OMEGA-generated structures will be

discussed as we believe that the data obtained with these represents the true accuracy of

the docking programs.

SIS

FITTED

TA

FlexX

RP

Glide

Std Rbt

GOLD Surflex

RP-H RP-HAStd Std

0

10

20

30

40

50

60

70

80

90

VS

Do

ck

CS

FS

PL

P

SS

CS

FS

PL

P

SS

HT

VS

SP

XP

HT

VS

SP

XP

CS

GS

CS

GS

std

PG

std

PG

std

PG

% S

uc

ce

ss

Crystal structure

OMEGA-generated structure

OMEGA-generated structure + Flexible rings

Figure 4.8 - Accuracy vs. ligand and protein conformations. For legend see Table 4.

CHAPTER 4

- 165 -

Table 4.4 - Abbreviations used in Figure 4.8

Program Abbreviations Definitions

FITTED VS

Dock

Virtual Screening mode

Docking mode

FlexX SIS

TA

CS

FS

PLP

SS

Single Interaction Scan (matching algorithm)

Triangle Algorithm

ChemScore scoring function

FlexXScore scoring function

PLP Scoring scoring function

ScreenScore scoring function

Glide RP

Std

HTVS

SP

XP

Refined Protein (optimized protein structure)

Non-refined protein

High Throughput virtual screening mode

Standard precision mode

Extra Precision mode

GOLD Std

Rbt

GS

CS

Standard (Automatic selection of parameters)

Robust

GoldScore scoring function

ChemScore scoring function

Surflex Std

PG

Std

RP-H

RP-HA

Standard Docking

pgeom

Non-refined protein

Protein structure with hydrogen atom

positions refined

Refined Protein structure with a constrained

optimization of the heavy atoms

Protein input conformations. We next looked at the effect of the protein conformation.

The Glide and Surflex developers recommend relaxing the protein structure prior to

docking ligands. This procedure (referred to as “refined proteins” in this manuscript, see

Table 4) aims at removing any inaccuracies in the crystal structure. However it is often

carried out keeping the co-crystallized ligand in place and can be seen as a bias for self-

docking experiments. As can be seen in Figure 8, these procedures appeared to have

moderate impact on the accuracy for Glide (increase of 4%) but a significant impact with

Surflex when the fully refined protein is used. In this later case, an increase of 15% is

CHAPTER 4

- 166 -

observed in the standard docking mode and 12% is the advanced docking procedure

(pgeom) is used.

The study described above was carried out using a set of self-docking experiments (i.e.,

the protein structure with its native ligand). We next turned our attention to cross-docking

experiments as we believe these experiments would be more representative of the true

accuracy of the docking programs when performing a virtual screen. As expected, all the

programs demonstrated a much poorer accuracy in this set of experiments, with GOLD

being the most accurate, although not significantly (Figure 4.9A). Drops in the cross-

docking success rate relative to the self-docking rate of as large as nearly 40% were

recorded. The largest drops were attributed to Surflex and Glide when the refined protein

conformations were used. Nevertheless, cross-docking to the refined proteins remains

slightly more accurate than to the crystal structures with Glide. This observation confirms

the developers’ recommendation but also demonstrates that this is a clear bias when

comparing programs running only self-docking experiments. Such large drops in

accuracy between self- and cross-docking have often been observed.10-13, 60, 61

When the

proteins are considered flexible (by selecting the best scoring poses of the cross-docking

experiments, also known as docking to conformational ensembles), the accuracy of all the

programs but GOLD is significantly improved (Figure 9B).

Within FITTED, the protein can be made flexible without having recourse to multiple

runs with run times similar to rigid protein docking. In our previous report, the protein

conformational ensembles used to evaluate this feature included the cognate protein

structures (i.e., protein conformation when co-crystallized with the ligand to be docked)

together with other protein conformations.35, 36

This approach, incorporating both self-

and cross-docking experiments in a single run, allowed us to demonstrate that FITTED

was able to identify the best protein conformation for a given ligand. However, when

evaluating the docking accuracy, we believe that this specific protein conformation

should not be included. Thus, in this work, only the non-native protein structures were

included. This feature makes FITTED slightly more accurate than Glide, GOLD, Surflex

and FlexX while FITTED was found to be less accurate in self-docking experiment.

CHAPTER 4

- 167 -

0

10

20

30

40

50

60

70

80

90

100

FITTED Flex Glide Glide RP GOLD Surflex Surflex RP-HA

% S

uc

ce

ss

Self-Docking

Cross-Docking

FlexDock

Docking to flex. prot.

A)

0

10

20

30

40

50

60

70

80

90

100


% S

uc

ce

ss

Self-Docking

Cross-Docking

Docking to conf. ens.

Docking to flex. prot.

B)

Figure 4.9 - Self-docking vs. cross-docking for protein (A) with no waters and (B) with

key water molecules.

Water molecules. The three major features of FITTED are protein flexibility, ring

conformational search and displaceable water molecules. At this stage, the importance of

the first two had been investigated. But do water molecules significantly affect the

accuracy as well? To investigate the role of key water molecules, we carried out self-

docking experiments and looked at 4 distinct cases: i. all waters were removed from the

protein crystal structures (“no waters”), ii. all key waters were kept (“explicit waters”),

iii: the best scoring of the “no waters” and “explicit waters” experiments were kept,

simulating a displaceable ensemble of water molecules (“displ. ensemble”), iv: waters

were made displaceable whenever the feature was available (“displ. waters”). The

collected data is shown in Figure 4.10. A small increase was observed for all the

programs when the key waters were kept. When the waters were made displaceable such

as for the “displ. ensemble” and “displ. waters” experiments, the accuracy further

CHAPTER 4

- 168 -

increased. As previously observed by the GOLD developers, the strategy implemented in

GOLD did not improve the docking ability of this program.32

The same observation was

made with the FlexX program. In contrast, the displaceable water approach implemented

in FITTED was found to be more accurate than the explicit water and even than the “displ.

ensemble”. In fact, displ. ensemble simulates either all the waters on or none while

FITTED can displace each water independently. In addition, as the various optimizations

of FITTED have done with this feature, the best results are obtained when it is used on.

0

10

20

30

40

50

60

70

80

90

100


% S

uc

ce

ss

dryexplicit watersdispl. ensembledispl. waters

Figure 4.10 - Accuracy and water molecules in self-docking experiments.

We then investigated the combination of protein flexibility and water molecules. Figure

4.9b summarizes the data for self- and cross-docking experiments in presence of

displaceable waters if implemented and displ. ensemble if not. Once more, a slight

improvement is observed with most of the programs indicating that considering both

features leads to at least similar or improved results (Figure 9 and 12) and should be

considered by other developers.

The presence of hydrogen bond donors or acceptors in the protein binding site is

expected to help finding the proper orientation of the ligand. In contrast, non-directional

hydrophobic interactions are directly related to the nature of the compound and protein

binding site and their respective solvation energies more than to any “real” hydrophobic

interactions between protein and ligands. These interactions are therefore expected to be

more difficult to identify in a computationally tractable manner. The proteins of our set

were classified as polar (CDK2, COX-2, ER, FXa, GluK2, HIVP, HIVPD, HIVRT, Thrn,

CHAPTER 4

- 169 -

TK, Tryp, VDR), hydrophobic (HCV Allo, HCV Cat, P38) and metal-containing

enzymes (Mann, MMP3, Therm), based on the main ISs identified by ProCESS. The

cross-docking data was reorganized to account for this factor and the results are

summarized in Figure 4.11. GOLD appears to be fairly insensitive to the protein type

while Surflex and FlexX were much less accurate with metalloenzymes. The automatic

metal parameters GOLD and FITTED may explain the good accuracy with

metalloenzymes. More striking is the much greater accuracy of FITTED with hydrophilic

enzymes than with hydrophobic enzymes while Glide and FlexX are significantly more

accurate with hydrophobic proteins than with polar proteins. Interestingly, the SIS

algorithm in FlexX, developed to improve the accuracy with hydrophobic proteins, lead

to increase in accuracy with this class of proteins when compared to the traditional FlexX

algorithm. Once more this data indicates that the set used for any comparative study

would have a significant effect on the relative accuracies of programs. For instance,

FITTED would be the second best program if hydrophilic proteins were selected while

being the worst if only hydrophobic proteins were selected.

0

10

20

30

40

50

60

70

80

90

100

FITTED FlexX - SIS - FS

Flexx - TA-FS

Glide Gold Surflex

% S

uc

ce

ss

Metal

Hydrophobic

Polar

Figure 4.11 - Protein class and accuracy on cross-docking experiments.

As a summary, the accuracy of each of the assessed programs using the optimal

conditions is shown in Figure 4.12 for self-docking and cross-docking experiments as

well as docking to flexible proteins when available. Overall, the levels of accuracy given

here are significantly lower than the ones provided in other comparative studies.5, 6, 9

In

fact, we found our testing set to be much more challenging than the one we used

CHAPTER 4

- 170 -

previously. Additionally, part of this drop (10-20%) is directly attributed to the use of

OMEGA-generated structures and another part (10-30%) to the use of cross-docking

experiments in place of self-docking experiments used elsewhere.

To further assess the impact of such a protein-specific training and the novel placement

of ISs implemented in the current version of ProCESS, we have carried out additional

experiments. When protein-specific information (e.g., ISs derived from known

ligand/protein complexes) is manually given to FITTED, the accuracy increases

significantly (data not shown). This clearly demonstrates that this manual placement of

ISs–and more specifically hydrophobic sites- for docking with FITTED remains better

than the automated placement which should be further improved.

Overall, FITTED, Glide, GOLD and Surflex show very similar accuracies on our testing

set for self docking (i.e., rigid protein). When protein flexibility is considered (ligands

docked to all non-native protein structures in multiple runs and in a single run with

FITTED), FITTED is slightly more accurate (the only one featuring displaceable waters and

protein flexibility simultaneously), followed by GOLD and Glide. It is worth recalling

that FITTED was outperformed by the other programs in the self-docking experiments. It

is clear from Figure 12 that implementing protein flexibility would significantly improve

Surflex, Glide and FlexX accuracy while no significant improvements are expected for

GOLD which already uses a soft protein representation (Lennard-Jones 8-4).

The following numbers are given as rough estimates as these programs were run on

various computers and supercomputers with varying processor speed and some programs

(Surflex, FlexX) do not output the CPU time. In the extreme cases FlexX docks a

compound every 30 s per while FITTED is the slowest by a factor of 10 to 15. When the

criterion of success is made more stringent (RMSD ≤1.0 Å for self docking and ≤1.25 Å

for cross docking, Figure 12b), FITTED slightly outperform the other programs for self-

docking, and all programs show similar accury in cross-docking with this stricter

criterion. It also shows that the arbitrary limit of 2 Å does not affect much the ranking of

programs by their RMSD-derived accuracy.

CHAPTER 4

- 171 -

0

10

20

30

40

50

60

70

80

90

100

FITTED FlexX Glide GOLD Surflex

% S

uc

es

s

Self-docking Cross-docking Flexible proteins

0

10

20

30

40

50

60

70

80

90

100

FITTED FlexX Glide GOLD Surflex

% S

uc

es

s

Self-docking Cross-docking Flexible proteins

Figure 4.12 - Accuracy of program with OMEGA-generated structures. FITTED: Dock

mode; FlexX: ScreenScore and SIS used for the incremental construction; Glide: XP and

refined protein; GOLD: ChemScore; Surflex: pgeom and protein with refined hydrogen

positions. For the flexible protein, the “fully flexible” protein mode is used with FITTED

implementation. (a) success criterion: RMSD ≤2.0 Å for self-docking and 2.25 Å for

cross docking (b) RMSD ≤1.0 Å for self-docking and 1.25 Å for cross docking.

From this comparative study we confirmed that a few guidelines should be considered

to perform a proper evaluation: 1. the ligand should be in a conformation other than the

crystal structure, 2. both cross-docking and self-docking experiments should be carried

out, 3. Refining the protein structure using the co-crystallized ligand may bias the self-

docking accuracy but does not affect the cross-docking accuracy.

CONCLUSION

We have further modified our docking program FITTED and implemented a ring search

method into the genetic algorithm as well as a matching algorithm to produce the initial

CHAPTER 4

- 172 -

population. This advanced version was tested against major docking programs. It should

be stressed that this work was not intended to rank programs as the ranking varies from

one set of protein / ligand complexes to another. In fact, this work demonstrated that

ranking can significantly vary depending on the protein / ligand set considered (e.g.,

hydrophilic, hydrophobic) as well as the input ligand and protein conformations (e.g.,

crystal structures or OMEGA-generated, self- or cross-docking, with or without water

molecules). With this study, we demonstrated the impact of protein and ligand

conformations as well as protein flexibility and water molecules on the accuracy of

docking programs. We have been working on these last two properties for the last few

years and have shown herein that these two features significantly improve the accuracy of

our docking program FITTED. The placement of hydrophobic interaction sites has been

identified as a remaining issue and more work are currently ongoing to better understand

and identify hydrophobic pockets. This work may also serve the developers to better

understand the weaknesses and strengths of their respective programs.

EXPERIMENTAL SECTION

Preparation of the docking set. Structures were downloaded from the PDB62

and selected

based on diversity of the ligands, presence of water molecules, flexibility of the protein

and resolutions of the crystal structure below 2.5 Å. In some cases crystal structures with

resolutions higher than 2.5 Å were kept to increases the diversity of the conformations

seen within a protein. All structures were prepared using Maestro63

, the graphical

interface to the Schrödinger Suite of programs. Structures of the same protein were then

superimposed using the protein structure alignment option within Maestro. The protein

sequences where then homogenized by mutating and deleting missing residues when at

least 10Å from the binding cavity. If a missing residue was closer than the minimum

distance the structure was removed from the set. Hydrogen atoms where added using

Maestro and energy-minimized using the OPLS_2005 forcefield. All non-conserved

waters were removed from all the structures. Conserved or key water molecules were

defined as water molecules that make at least 2 hydrogen bonds with the protein and one

with the ligand. The protein and ligand structure were then separated. The ligand crystal

structure was used as input into OMEGA64

to generate new starting conformations. For

CHAPTER 4

- 173 -

this study we had OMEGA only output the most thermodynamically stable conformation

using all standard default values.

Docking programs methodology. A recent review by our group65

found over 60 docking

programs that have been published. It is becoming ever harder to distinguish which

program is best for a specific protein or in general. To assess how well FITTED performs

compared to other docking programs a small comparative study was undertaken using

FlexX, Glide, GOLD and Surflex. It is worth noting that even though AutoDock11

or a

combination of Glide and Prime17

can allow for protein flexibility they were not used due

to time constraints. Also FlexX does have a module for protein flexibility (FlexX-

Ensemble) but the version used in this study was incompatible with FlexX-Ensemble

module.

For all docking runs the OMEGA generated ligand conformation was used for self-

docking, cross-docking and flexible-protein docking. The crystallized ligand structure

was run separately but only the self-docking data is shown. All the docking experiments

were performed using dry proteins (no waters present) unless otherwise stated. When

proteins structures contain a key water molecule(s) additional docking experiments were

preformed to the wet protein (only the key water molecules for that protein crystal

structure are present). If the docking program has the ability to dock with displaceable

crystallographic waters additional sets of docking experiments were performed to a

displaceable water protein structure (all possible water positions occur). When defining

the active site for all the proteins, the largest ligand of the set for a particular protein was

used.

CHAPTER 4

- 174 -

Table 4.5 - List of ligands used to define protein binding sites

Protein PDB Code

CDK2 1aq1

COX-2 4cox

Estrogen Receptor 1sj0

Factor Xa 1nfu

Kainate Glutamate Receptor 1yae

HCV Polymerase Allosteric Pocket 2o5d

HCV Polymerase Catalytic Pocket 2fvc

HIV-1 Protease 1pro

HIV-1 Protease Diols 1hvr

HIV Reverse Transciptase 1vrt

Mannosidase 2f18

MMP-III 1d8m

p38 Map Kinase 1w82

Thermolysin 3tmn

Thymidine Kinase 1tmt

Thrombin 1qhi

Trypsin 1qbo

Vitamin D receptor 2has

In all cases docking success for docking was measured using the standard RMSD

criterion (RMSD between the heavy atoms of the docked posed and the reported crystal

structure). During self docking run we used a criterion of less than 2.0 Å but increased

this to 2.25 Å for cross docking to account for the error resulting from the superposition

of the proteins. Only the top scoring pose was used for accuracy measures as it would be

the one picked in a VS experiment whether the docking run was successful. The RMSD

was calculated using the tool provided by the program. One exception was in the case of

FlexX and Glide where the GOLD RMSD script was also used for the calculation of the

RMSDs for HIV-1 proteases ligands. Due to the C2 symmetric nature of the HIV-1

protease binding site it was necessary to calculate the RMSD on 2 orientations of the

ligand, the original and the ligand rotated 180°. This second RMSD could not be done

since these programs output the RMSD within the output file of the run and could not be

re-computed. This second rotated ligand was done by rotation of a duplicated copy of the

protein/ligand complex in space and re-superimposition using InsightII. With all

CHAPTER 4

- 175 -

programs the RMSD was calculated using both orientations for the HIV-1 protease

ligands with the lowest RMSD being kept.

In all cases we have been in contact with either the developers themselves or the

technical support of the programs discussed herein to determine the best conditions for

our comparative study. Where they were uncertain we ran all possibilities.

FlexX 3.1.033, 60, 66, 67

FlexX uses an incremental construction algorithm to build up the

ligand within the active site. To determine the placement of the fragments FlexX uses a

set of interaction sites then uses a matching algorithm to find the best match between the

fragment and the interactions. FlexX can account for displaceable water molecules using

the particle water concept where all possible combinations of the water being present or

not present are tried and the best scoring combination is kept. We used the FlexX3.1.0

interface to construct all the project files for each individual crystal structure. For each

protein the binding site hydrogen positions for the protein were manually oriented to

create optimal hydrogen bond with either the protein and/or native ligand. For each

structure where water molecules occur, a project file was created for the dry protein (no

waters), the wet protein (waters are treated as spheres) and the displaced waters protein

(waters are considered as spheres and allowed to be displaceable). 4 settings.pxx files

were created so that we could run FlexX through command line interface using

FlexXScore, ChemScore, ScreenScore and PLP scoring functions. Within the .bat file we

would turn on the ring search using the corina_f executable68

provided by Molecular

Networks by using the keyword SET RING_MODE to 1 and/or turn on the SIS docking

algorithm by using the PLACEBAS 1 keyword. At the time of this publication FlexX-

Ensemble was not available for FlexX3.1.0 and was deemed not ready for a comparative

study by the developers.

Glide4.548, 69, 70

Glide uses a funnel approach to docking by initially creating a series of

ligand conformations then removing the unfavourable ones. With this done a refinement

is performed by doing an energy minimization followed by a restricted monte-carlo

search on the lowest energy conformations. This Monte-Carlo search is used to refine the

initial structure.

CHAPTER 4

- 176 -

The protein structures in mol2 format were prepared using the protein preparation

wizard with default values (the proteins in future referred to as refined proteins). Grids

were prepared for the initial prepared protein as well as the refined proteins using a 30Å

box with the center of the grid being defined by using the largest ligand of the protein in

our set. Default parameters were used to dock with Glide for HTVS, SP and XP docking

modes. With each docking mode, both grids were used individually. With Glide the

default is to allow for flexible rings and therefore to study Glide with rigid rings this

functionality was turned off.

GOLD3.2.71

GOLD performs a conformational search of the ligand by using a genetic

algorithm. When dealing with displaceable waters GOLD considers all possible

combinations of the water being present or not present keeping the best scoring

combination. The prepared ligand and proteins were used in mol2 format for GOLD. The

automatic settings with the default parameters were used. When docking to the wet

protein (proteins with key water molecules), the orientation of the waters is optimized but

the waters are not displaced. When docking to displaceable water-protein structures, the

waters are set to displaceable and their orientation is optimized. To examine the corner

flap approach the flip_free_corners was set to 1 in the .conf file. Upon discussions with

the developers it was suggested to try a more robust search. This was done by using the

keywords autoscale = 1.5 and autoscale_nopt_min = 15000. It was also suggested to set

early_termination to 0. Both additions were tried and are referred to as the robust search

in the results.

Surflex2.372

. The Surflex docking algorithm combines a shape matching algorithm with

the matching of a protomol that is similar to a pharmacophore. Surflex uses an

incremental construction algorithm with relinking of the fragmented ligand. No interface

was provided with Surflex which was therefore used in command line. For Surflex the

prepared ligands and proteins were used in mol2 format. The protomol was initially

generated using the largest ligand of the set for that protein. Surflex was then used to

dock using all the default values. To perform the conformational search of rings the

+rings command was used. Upon discussion with the developers it was suggested we try

the –pgeom command that is meant to increase docking accuracy and use their program

CHAPTER 4

- 177 -

to optimize the hydrogens of the proteins. Docking runs using these suggested conditions

were performed.

FITTED 2.673

. The files describing the proteins, interaction sites and cavity sites were

prepared using the PROCESS module while the ligands were prepared using SMART. The

created files were next used by FITTED. The default parameters were used with each of

these three programs. FITTED is now available at www.FITTED.ca.

Acknowledgment. We thank CIHR and Virochem Pharma for financial support as well as

the Canadian Foundation for Innovation for financial support through the New

Opportunities Fund program. CRC held a CIHR-funded Chemical Biology Scholarship

during a portion of this study. We are thankful to the RQCHP for allocation of computer

resources for this study. We would also like to thank the following people for the input

and suggestions on how to improve the docking accuracy of their programs; Ajay Jain,

UCSF (SurFlex) and the support departments of both CCDC (GOLD) and Schrödinger

(Glide).

Supporting Information Available: A more detailed listing of results and the comparative

study set is available free of charge via the Internet at http://pubs.acs.org.

REFERENCES

1. Cozza, G.; Bonvini, P.; Zorzi, E.; Poletto, G.; Pagano, M. A.; Sarno, S.; Donella-

Deana, A.; Zagotto, G.; Rosolen, A.; Pinna, L. A.; Meggio, F.; Moro, S.,

Identification of Ellagic Acid as Potent Inhibitor of Protein Kinase CK2: A

Successful Example of a Virtual Screening Application. J. Med. Chem. 2006, 49

(8), 2363-2366.

2. De Graaf, C.; Oostenbrink, C.; Keizers, P. H. J.; Van Der Wijst, T.; Jongejan, A.;

Vermeulen, N. P. E., Catalytic site prediction and virtual screening of cytochrome

P450 2D6 substrates by consideration of water and rescoring in automated docking.

J. Med. Chem. 2006, 49 (8), 2417-2430.

http://pubs.acs.org/

CHAPTER 4

- 178 -

3. Bissantz, C.; Folkers, G.; Rognan, D., Protein-based virtual screening of chemical

databases. 1. Evaluation of different docking/scoring combinations. J. Med. Chem.

2000, 43 (25), 4759-4767.

4. Bursulaya, B. D.; Totrov, M.; Abagyan, R.; Brooks Iii, C. L., Comparative study of

several algorithms for flexible ligand docking. J. Comput.-Aided Mol. Des. 2003,

17 (11), 755-763.

5. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S., Evaluation of docking

performance: comparative data on docking algorithms. J. Med. Chem. 2004, 47 (3),

558-565.

6. Perola, E.; Walters, W. P.; Charifson, P. S., A detailed comparison of current

docking and scoring methods on systems of pharmaceutical relevance. Proteins

2004, 56 (2), 235-249.

7. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D., Comparative evaluation of

eight docking tools for docking and virtual screening accuracy. Proteins 2004, 57

(2), 225-242.

8. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.,

Comparison of automated docking programs as virtual screening tools. J. Med.

Chem. 2005, 48 (4), 962-976.

9. Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,

M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.;

Woolven, J. M.; Peishoff, C. E.; Head, M. S., A Critical Assessment of Docking

Programs and Scoring Functions. J. Med. Chem. 2006, 49 (20), 5912-5931.

10. Cavasotto, C. N.; Abagyan, R. A., Protein flexibility in ligand docking and virtual

screening to protein kinases. J. Mol. Biol. 2004, 337 (1), 209-225.

11. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.,

Automated docking to multiple target structures: Incorporation of protein mobility

and structural water heterogeneity in autodock. Proteins 2002, 46 (1), 34-40.

12. Murray, C. W.; Baxter, C. A.; Frenkel, A. D., The sensitivity of the results of

molecular docking to induced fit effects: Application to thrombin, thermolysin and

neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13 (6), 547-562.

CHAPTER 4

- 179 -

13. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M., Lessons in

molecular recognition: The effects of ligand and protein flexibility on molecular

docking accuracy. J. Med. Chem. 2004, 47 (1), 45-55.

14. Cavasotto, C. N.; J.W. Orry, A.; Abagyan, R. A., The challenge of considering

receptor flexibility in ligand docking and virtual screening. Curr. Comput.-Aided

Drug Des. 2005, 1, 423-440.

15. Klebe, G., Virtual ligand screening: strategies, perspectives and limitations. Drug

Discov. Today 2006, 11 (13-14), 580-594.

16. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., Protein-ligand docking: Current status

and future challenges. Proteins 2006, 65 (1), 15-26.

17. Sherman, W.; Day, T.; Jacobson, M. P.; Friesner, R. A.; Farid, R., Novel Procedure

for Modeling Ligand/Receptor Induced Fit Effects. J. Med. Chem. 2006, 49 (2),

534-553.

18. Claussen, H.; Buning, C.; Rarey, M.; Lengauer, T., FLEXE: Efficient molecular

docking considering protein structure variations. J. Mol. Biol. 2001, 308 (2), 377-

395.

19. Schnecke, V.; Kuhn, L. A., Virtual screening with solvation and ligand-induced

complementarity. Perspect. Drug. Discov. 2000, 20, 171-190.

20. Zavodszky, M. I.; Lei, M.; Thorpe, M. F.; Day, A. R.; Kuhn, L. A., Modeling

correlated main-chain motions in proteins for flexible molecular recognition.

Proteins 2004, 57 (2), 243-261.

21. Sotriffer, C. A.; Dramburg, I., "In situ cross-docking" to simultaneously address

multiple targets. J. Med. Chem. 2005, 48 (9), 3122-3125.

22. Li, Z.; Lazaridis, T., Water at biomolecular binding interfaces. Phys. Chem. Chem.

Phys. 2007, 9 (5), 573-581.

23. Baldwin, E. T.; Bhat, T. N.; Gulnik, S.; Liu, B.; Topol, I. A.; Kiso, Y.; Mimoto, T.;

Mitsuya, H.; Erickson, J. W., Structure of HIV-1 protease with KNI-272, a tight-

binding transition-state analog containing allophenylnorstatine. Structure 1995, 3

(6), 581-590.

24. Wang, Y. X.; Freedberg, D. I.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso,

Y.; Bhat, T. N.; Erickson, J. W.; Torchia, D. A., Bound water molecules at the

CHAPTER 4

- 180 -

interface between the HIV-1 protease and a potent inhibitor, KNI-272, determined

by NMR. J. Am. Chem. Soc. 1996, 118 (49), 12287-12290.

25. Kervinen, J.; Thanki, N.; Zdanov, A.; Tino, J.; Barrish, J.; Lin, P. F.; Colonno, R.;

Riccardi, K.; Samanta, H.; Wlodawer, A., Structural analysis of the native and

drug-resistant HIV-1 proteinases complexed with an aminodiol inhibitor. Protein

Pept. Lett. 1996, 3 (6), 399-406.

26. Hong, L.; Zhang, X. J.; Foundling, S.; Hartsuck, J. A.; Tang, J., Structure of a

G48H mutant of HIV-1 protease explains how glycine-48 replacements produce

mutants resistant to inhibitor drugs. FEBS Lett. 1997, 420 (1), 11-16.

27. Louis, J. M.; Dyda, F.; Nashed, N. T.; Kimmel, A. R.; Davies, D. R., Hydrophilic

peptides derived from the transframe region of Gag-Pol inhibit the HIV-1 protease.

Biochemistry 1998, 37 (8), 2105-2110.

28. Lam, P. Y. S.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L.

T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N.; Chang, C. H.; Weber, P.

C.; Jackson, D. A.; Sharpe, T. R.; Erickson-Viitanen, S., Rational design of potent,

bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science 1994, 263

(5145), 380-384.

29. Grzesiek, S.; Bax, A.; Nicholson, L. K.; Yamazaki, T.; Wingfield, P.; Stahl, S. J.;

Eyermann, C. J.; Torchia, D. A.; Nicholas Hodge, C.; Lam, P. Y. S.; Jadhav, P. K.;

Chang, C. H., NMR evidence for the displacement of a conserved interior water

molecule in HIV protease by a non-peptide cyclic urea-based inhibitor. J. Am.

Chem. Soc. 1994, 116 (4), 1581-1582.

30. Hodge, C. N.; Aldrich, P. E.; Bacheler, L. T.; Chang, C. H.; Eyermann, C. J.;

Garber, S.; Grubb, M.; Jackson, D. A.; Jadhav, P. K.; Korant, B.; Lam, P. Y. S.;

Maurin, M. B.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Reid, C.; Sharpe, T. R.;

Shum, L.; Winslow, D. L.; Erickson-Viitanen, S., Improved cyclic urea inhibitors

of the HIV-1 protease: Synthesis, potency, resistance profile, human

pharmacokinetics and X-ray crystal structure of DMP 450. Chem. Biol. 1996, 3 (4),

301-314.

31. Moitessier, N.; Westhof, E.; Hanessian, S., Docking of aminoglycosides to

hydrated and flexible RNA. J. Med. Chem. 2006, 49 (3), 1023-1033.

CHAPTER 4

- 181 -

32. Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.;

Nissink, J. W. M.; Taylor, R. D.; Taylor, R., Modeling water molecules in protein-

ligand docking using GOLD. J. Med. Chem. 2005, 48 (20), 6504-6515.

33. Rarey, M.; Kramer, B.; Lengauer, T., The particle concept: Placing discrete water

molecules during protein- ligand docking predictions. Proteins 1999, 34 (1), 17-28.

34. 14.3.5 Mapping amino acids to templates. In FlexX Release 3 with GUI User Guide

and Technical Reference, BiosolveIT GmbH: 2007; p 310.

35. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking ligands into flexible and

solvated macromolecules. 1. Development and validation of FITTED 1.0. J. Chem.

Inf. Model. 2007, 47 (2), 435-449.

36. Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.;

Bilimoria, D.; Heureux, L.; Moitessier, N., Docking ligands into flexible and

solvated macromolecules. 2. Development and application of FITTED 1.5 to the

virtual screening of potential HCV polymerase inhibitors. J. Chem. Inf. Model.

2008, 48 (4), 902-909.

37. Englebienne, P.; Fiaux, H.; Kuntz, D. A.; Corbeil, C. R.; Gerber-Lemaire, S.; Rose,

D. R.; Moitessier, N., Evaluation of docking programs for predicting binding of

Golgi α-mannosidase II inhibitors: A comparison with crystallography. Proteins

2007, 69 (1), 160-176.

38. Kirchmair, J.; Markt, P.; Distinto, S.; Wolber, G.; Langer, T., Evaluation of the

performance of 3D virtual screening protocols: RMSD comparisons, enrichment

assessments, and decoy selection—What can we learn from earlier mistakes? J.


39. Good, A.; Oprea, T., Optimization of CAMD techniques 3. Virtual screening

enrichment studies: a help or hindrance in tool selection? J. Comput.-Aided Mol.

Des. 2008, 22 (3), 169-178.

40. Jain, A. N., Bias, reporting, and sharing: Computational evaluations of docking


41. Jain, A. N., Bias, reporting, and sharing: computational evaluations of docking

methods. J. Comput.-Aided Mol. Des. 2007, 1-12.

CHAPTER 4

- 182 -

42. Hartshorn, M. J.; Verdonk, M. L.; Chessari, G.; Brewerton, S. C.; Mooij, W. T. M.;

Mortenson, P. N.; Murray, C. W., Diverse, high-quality test set for the validation of

protein-ligand docking performance. J. Med. Chem. 2007, 50 (4), 726-741.

43. Boström, J.; Greenwood, J. R.; Gottfries, J., Assessing the performance of OMEGA

with respect to retrieving bioactive conformations. J. Mol. Graph. Modell. 2003, 21

(5), 449-462.

44. Gasteiger, J.; Sadowski, J.; Schuur, J.; Selzer, P.; Steinhauer, L.; Steinhauer, V.,

Chemical information in 3D space. J. Chem. Inf. Comput. Sci. 1996, 36 (5), 1030-

1037.

45. Boström, J., Reproducing the conformations of protein-bound ligands: A critical

evaluation of several popular conformational searching tools. J. Comput.-Aided

Mol. Des. 2001, 15 (12), 1137-1152.

46. Jain, A. N.; Nicholls, A., Recommendations for evaluation of computational


47. Jain, A., Surflex-Dock 2.1: Robust performance from ligand energetic modeling,

ring flexibility, and knowledge-based search. J. Comput.-Aided Mol. Des. 2007, 21

(5), 281-306.

48. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D.

T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.;

Shenkin, P. S., Glide: A new approach for rapid, accurate docking and scoring. 1.

Method and assessment of docking accuracy. J. Med. Chem. 2004, 47 (7), 1739-

1749.

49. 5.2.3 Setting Docking Options. In Glide 5.0 User Manual, Schrödiger, LLC.:

2008; p 44.

50. Jones, G.; Willett, P.; Glen, R. C., Molecular recognition of receptor sites using a

genetic algorithm with a description of desolvation. J. Mol. Biol. 1995, 245 (1), 43-

53.

51. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R., Development and

validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267 (3),

727-748.

52. Payne, A. W. R.; Glen, R. C., Molecular recognition using a binary genetic search

algorithm. J. Mol. Graph. 1993, 11 (2), 74-91+121.

CHAPTER 4

- 183 -


generation of ring conformations. J. Am. Chem. Soc. 1989, 111 (24), 8950-8951.

54. 5.2.3 Setting Docking Options. In Glide 4.5 User Manual, Schrödiger, LLC.:

2007; p 42.

55. Thompson, H. B., Calculation of Cartesian Coordinates and Their Derivatives from

Internal Molecular Coordinates. J. Chem. Phys. 1967, 47 (9), 3407-3410.

56. Harvey, M. A.; Baggio, S.; Baggio, R., A new simplifying approach to molecular

geometry description: The vectorial bond-valence model. Acta Crystallogr. B 2006,

62 (6), 1038-1042.

57. O'Boyle, N. M.; Brewerton, S. C.; Taylor, R., Using buriedness to improve

discrimination between actives and inactives in docking. J. Chem. Inf. Model. 2008,

48 (6), 1269-1278.

58. Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y., Combining pharmacophore

search, automated docking, and molecular dynamics simulations as a novel strategy

for flexible docking. Proof of concept: Docking of arginine-glycine-aspartic acid-

like compounds into the alphav beta3 Binding Site. J. Med. Chem. 2004, 47 (17),

4178-4187.

59. Chen, H.; Lyne, P. D.; Giordanetto, F.; Lovell, T.; Li, J., On evaluating molecular-

docking methods for pose prediction and enrichment factors. J. Chem. Inf. Model.

2006, 46 (1), 401-415.

60. Kramer, B.; Rarey, M.; Lengauer, T., Evaluation of the FLEXX incremental

construction algorithm for protein-ligand docking. Proteins 1999, 37 (2), 228-241.

61. Birch, L.; Murray, C. W.; Hartshorn, M. J.; Tickle, I. J.; Verdonk, M. L.,

Sensitivity of molecular docking to induced fit effects in influenza virus

neuraminidase. J. Comput.-Aided Mol. Des. 2002, 16 (12), 855-869.

62. Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B., The protein data bank: a

computer based archival file for macromolecular structures. J. Mol. Biol. 1977, 112

(3), 535-542.

63. Maestro, 8.0; Schrödiger, LLC.: Portland, OR, 2007.

64. OMEGA, 2.2.1; Open Eye Scientific Software: Sante Fe, NM, 2007.

CHAPTER 4

- 184 -

65. Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R., Towards the

development of universal, fast and highly accurate docking/scoring methods: A

long way to go. Br. J. Pharmacol. 2008, 153 (SUPPL. 1), S7-S26.

66. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G., A fast flexible docking method

using an incremental construction algorithm. J. Mol. Biol. 1996, 261 (3), 470-489.

67. FlexX, 3.1.0; BioSolveIT: Sankt Augustin, Germany, 2008.

68. Corina_F, 3.4; Molecular Networks: Erlangen, Germany, 2008.

69. Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard,

W. T.; Banks, J. L., Glide: A new approach for rapid, accurate docking and scoring.

2. Enrichment factors in database screening. J. Med. Chem. 2004, 47 (7), 1750-

1759.

70. Glide, 4.5; Schrödiger, LLC.: Portland, OR, 2007.

71. GOLD, 3.2; Cambridge Crystallographic Data Center: Cambridge, UK, 2007.

72. Surflex, 2.3; BioPharmics, LLC: San Fransico, CA, 2008.

73. Corbeil, C. R.; Englebienne, P.; Moitessier, N. FITTED, 2.6; McGill University:

Montreal, Que., 2008.

CHAPTER 5

- 185 -

CHAPTER FIVE

Although more popular in the drug design filed, computational tools for virtual

screening are lacking in the field of asymmetric catalyst design. ACE was created from

the expertise gained through the development of FITTED. The main challenge when

creating a virtual screening tool for asymmetric catalyst development is the need to

develop an on-the-fly determination of transition state parameters for a molecular

mechanics forcefield. This was addressed by using a linear combination of transition state

parameters along with a genetic algorithm to enable an efficient conformational search of

the transition state structure. ACE was then validated on two reactions and showed

excellent correlations between experimental and predicted stereoselectivities.

This chapter is a copy and is reproduced with permission from Angewandte Chemie,

International Edition. This article is cited as Corbeil, C. R.; Thielges, S.;

Schwartzentruber, J. A.; Moitessier, N., Toward a Computational Tool Predicting the

Stereochemical Outcome of Asymmetric Reactions: Development and Application of a

Rapid and Accurate Program Based on Organic Principles. Angewandte Chemie

International Edition 2008, 47, (14), 2635-2638. Copyright 2008, with permission from

Wiley.

CHAPTER 5

- 186 -

TOWARD A COMPUTATIONAL TOOL PREDICTING THE

STEREOCHEMICAL OUTCOME OF ASYMMETRIC REACTIONS.

DEVELOPMENT AND APPLICATION OF A RAPID AND ACCURATE

PROGRAM BASED ON ORGANIC PRINCIPLES.

The asymmetric catalyst discovery process as practiced now often relies on expensive

-and sometimes serendipitous- stepwise optimization and/or library screening.1 We believe

that this is poised to change, as computational predictive methods have reached a level of

accuracy that obviates many steps now done manually. We report herein the early version

of a new program, ACE (Asymmetric Catalyst Evaluation), its underlying concepts, and

the assessment of its applicability and accuracy in distinguishing efficient asymmetric

catalysts or chiral auxiliaries from inferior ones.

Although much effort has been directed toward the development of computer-aided

drug design tools, there has been little investigation into computational tools for

asymmetric catalyst design. Nowadays, the fields of quantum mechanic and quantum

mechanics/molecular mechanics2 are highly developed and has yielded accurate

predictions of asymmetric reaction stereoselectivities 3-6

. However, QM methods would

require months of computation to screen a library of potential catalysts in the search for

new ones. To address this issue, other methods were developed which include reverse-

docking.7, 8

and quantitative structure-selectivity relationship 9-11

and more specifically

the use of quantum mechanics interaction fields.12, 13

As another alternative to QM

techniques, molecular mechanics (MM) applied to ground state structures have been

used.14

Advanced MM-based transition state (TS) techniques, which accurately predict

TS structures and their relative potential energies, have also been reported.15

Although

these methods (e.g., Q2MM,16

using TS force fields,17

SEAM18, 19

, Empirical Valence

Bond (EVB)20, 21

and multiconfiguration MM (MCMM)22

have shown great potential in

locating and investigating TS’s, only a very few studies were reported that attempted to

predict the stereochemical outcome of reactions.7, 8, 14, 23-28

with even fewer applications

to the design of new asymmetric catalysts.13, 29, 30

In fact, one major shortcoming of force

CHAPTER 5

- 187 -

fields is the lack of accurate parameters for metal complexes, necessary to model metal-

catalyzed reactions, which need to be specifically developed.31

ACE is a molecular mechanics-based independent program that has been developed

from simple organic chemistry principles. For example, the Hammond-Leffler postulate

states the TS looks most like the species (reactants or products) it is closest to in energy.

Following this principle, ACE constructs TS’s from a linear combination of reactants and

products, including a factor () describing the position of the TS on the potential energy

surface (Eq. 1 with λ defined by 0 < λ < 1). A similar approach is used to locate transition

states by the EVB method mentioned above, where is iterated from 0 to 1 to find the

maximum energy corresponding to the TS. EVB has indeed been successfully used in the

study of several enzymatic mechanisms21

. Within ACE, interactions between two atoms

forming a bond are described as both covalent bond and non-bonded interactions with

weights (1-λ) and λ for each of these two types of interaction. Angles, torsions and non-

bonded interactions between atoms of the reacting center are also scaled by either (1-λ) if

found in the reactants or λ if found in the products. As a comparison, λ can be related to

the Brønsted coefficient which measures the role of the reacting partners in a TS.

(5.1) productreactant1TS

As stated by Curtin and Hammett, stereomeric excesses can be derived from the

difference in the diastereomeric TS energies, in this case the MM3* force field potential

energies. This force field has already been used with the SEAM and TSFF approaches to

predict TS energy differences.

CHAPTER 5

- 188 -

ON

O O

R2 ON

O O

Ph

ON

O O

Ph

HH

O

O

O

OON

OO

OO

O

CF3

O

O

1a 1b

1c

1d

1e1f

2a 2b 2c

Auxiliary

Ocatalyst

Auxiliary

O

(ie Et2AlCl)*

R2R1

R1

R2

1 2 3

Figure 5.1 - General synthetic scheme and representative dienophiles 1a-f and dienes 2a-

c used in the validation study.

For each of the diene/dienophile pairs, reactants and products were built

considering only an endo attack, known to be favored in this type of reaction. Prior to

running the computation, has to be set. It is well known that Diels Alder reactions in the

presence of strong Lewis acids have low energies of activation and early TS’s, a situation

which corresponds to low values of λ. In order to evaluate the impact of the selected λ

value, values were used ranging from 0.10 to 0.60 in steps of 0.10.

CHAPTER 5

- 189 -

-100 -80 -60 -40 -20 0 20 40 60 80 100

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

Figure 5.2 - Predicted (grey) vs. observed (black) diastereomeric excesses for 44 Diels

Alder reactions. Positive excess refers to the (R) isomer while negative excesses refer to

the (S) isomers (λ = 0.20).

ACE creates the TS’s from reactants and products prepared using graphical

interfaces and ESFF charges32

and carries out a conformational analysis using a genetic

algorithm similar to the one previously implemented in our docking program FITTED

1.0.33

This algorithm samples the conformational space of the transition structures. The

potential energy was computed for each of the TS’s, and diastereomeric excesses were

derived and compared to the experimental data. Initially, the difference in potential

energy between the diasteromeric TS’s consistently overestimated the experimentally

observed difference in free energy. A correction factor (0.5) was applied to the potential

energy difference to better align predictions with observations. Although this factor has

no true physical meaning, it may reflect the difference between force field potential

CHAPTER 5

- 190 -

energy in vacuo and experimental free energy in solvent and steeper modelled PES

surface at the TS. Plots of ΔΔG(predicted) vs. ΔΔG(experimental) are given as

supplementary material. Overall, the rank-ordered list was not strongly affected by the

value of λ when in the range [0.1-0.3]. Since the ranking is more important for virtual

screening than the predicted absolute values, the selection of λ would not have much

impact on the success of a screening campaign. However, increasing λ led to slightly (λ=

0.4) or significantly (λ ≥ 0.5) reduced accuracies. These data demonstrate that λ does not

have to be fully optimized but should be selected with care based on the type of reaction

analyzed. For this class of reaction, λ has to be lower than 0.5, suggesting an early TS

which is seen in DFT studies.34

In fact, when using λ= 0.1 or λ= 0.20, the distances of the

forming/breaking bond (predicted to be in the range of 2.05-2.15 Å) match well to

distances computed using higher level calculation the forming/breaking bond (ranging

from 2.05 to 2.55 Å) of model systems34

. However, the two forming bonds show the

same distances with ACE, while the attack is usually asynchronous. Further development

of the method is needed to account for this effect.

Applied to the entire set, ACE accurately predicted the correct isomer in 41 out of

the 44 cases. The major failures (# 1-4 on Figure 5.2) were observed with polycyclic

auxiliaries exemplified by 1e. This suggests that the force field description of complex

molecules has to be refined.

In practice, a tool like ACE would be of interest for its ability to discriminate very

good auxiliaries from a list of potential auxiliaries. The predicted 20 best of the 44

systems were first considered. Experimentally, 19 of these 20 systems led to selectivities

of over 80%, with 15 over 90% and 13 over 95%. On the other side, the 10 systems

which were predicted to provide the lowest selectivities were considered. 6 out of these

10 systems had experimentally obtained diastereomeric excesses below 70% and only

one obtained an excess over 95%. These data clearly show the potential of this method to

discriminate between efficient and inefficient chiral auxiliaries.

The second reaction we investigated was the asymmetric organocatalyzed aldol

reaction (Figure 5.3). Reported reactions using various combinations of ketones,

aldehydes and proline derivatives used as catalysts were selected, for a total of 40

combinations.

CHAPTER 5

- 191 -

NH

CO2H

EtOOCO O

O

OOCH2Ph

6a6b

6c

O

NH

COOH

NH

O

OH

NH

S

NH

NH

O

OH

NHO

O

H

O2N

O

HH3C

O

H

5a 5b 5c

Cl

6d 6e

O

H

5d

Ocatalyst 6 O

*

O

H R R

OH+

4 75

Figure 5.3 - General synthetic scheme and representative catalysts (6a-e) and aldehydes

(5a-d) used in the validation study.

According to extensive experimental and DFT studies, this reaction involves the

formation of a flexible macrocyclic TS35, 36

and so required sampling the conformational

space of large rings. The corner flapping approach37

was implemented in ACE to carry

out this conformational search. From DFT studies the key TS is found to be closer in

energy to the produced intermediate than to the starting reactants, implying a λ value

greater than 0.5.35

Figure 5.4 summarizes the results obtained with λ = 0.60, though λ in

the range 0.60 to 0.75 led to similar results. As for the Diels Alder reaction, ACE TS’s

can be compared to TS developed using higher level calculations. Figure 5.5 illustrates

the superposition of the most energetically favoured TS structures as proposed by DFT

CHAPTER 5

- 192 -

and ACE. The distances of the forming bond predicted by these two methods are within

0.1 Å.

-100 -80 -60 -40 -20 0 20 40 60 80 100

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

Figure 5.4 - Predicted (grey) vs. observed (black) diastereomeric excesses for 17 selected

cases. Positive excess refers to the (R) isomer while negative excess refers to the (S)

isomer (λ = 0.6). The complete data (40 cases) is given as supporting information.

O

O

N

OMeH

Figure 5.5 - Predicted TS structure for the reaction involving 4, 5c and 6a. grey: DFT

prediction, black: ACE predictions.

CHAPTER 5

- 193 -

-100 -80 -60 -40 -20 0 20 40 60 80 100

4 + 5b

4 + 5c

4 + 5d

5d + 5d

Figure 5.6 - ACE predictions (grey) and DFT predictions (white) vs. observed (black)

diastereomeric excesses for 4 selected cases

In the asymmetric organocatalyzed aldol reaction, ACE was again accurate, with

the correct isomers predicted in 38 cases out of 40. Most of the cases investigated here

are known to provide excesses below 80%, equivalent to a small difference in energies

between diastereomeric TS’s. This makes this second validation study more challenging.

Extensive investigations did not reveal the cause of these two failures. Only one of the

computed reactions experimentally showed an enantiomeric excess higher than 90% (#4

in Figure 5.2) and was indeed predicted to lead to the highest selectivity of the set

(prediction: 99%). This demonstrates that ACE could accurately guide the design of

efficient catalysts.

As another validation, it is of interest to compare high-level calculation results,

when available, with these results. Houk and co-workers have reported an exhaustive

study on the proline-catalyzed aldol reaction of acetone with various aldehydes, in an

attempt to assess the predictive power of DFT.4 As shown in Figure 5.6, ACE shows

accuracy close to DFT but within a much shorter period of time. This unexpectedly high

accuracy might be attributable to the exhaustive conformational search of the macrocyclic

TS’s carried out by ACE but not by DFT techniques. In fact, ACE could be used as a

conformational search engine providing high-quality starting structures for further DFT

studies. In addition, this software provides good quality transition structures that can be

used for rationalizations of data in place of CPK models, as additional pairwise

interaction energies can be outputted.

CHAPTER 5

- 194 -

The trade-off between computing speed and accuracy of predictions is well known.

In this communication, we have presented a unique computational tool, ACE, which

performs conformational sampling, TS potential energy optimization, and TS relative

energy evaluation within less than an hour on a standard PC. Application of this tool to

two well-established reactions has revealed its good accuracy in predicting enantio-

/diastereomeric excesses. Future enhancements and applications/validations are ongoing

to improve and assess the predictive power and versatility of the software as well as its

transferability to other reactions. Metal-catalyzed reactions are being investigated.

However, the early version of ACE shows considerable promise and we believe should

be transferable to any other reactions with well known mechanisms.

CHAPTER 5

- 195 -

REFERENCES

1. Francis, M. B.; Jacobsen, E. N., Discovery of novel catalysts for aIkene epoxidation

from metal-binding combinatorial libraries. Angew. Chem. Int. Ed. 1999, 38, (7),

937-941.

2. Lin, H.; Truhlar, D. G., QM/MM: What have we learned, where are we, and where

do we go from here? Theor. Chem. Acc. 2007, 117, (2), 185-199.

3. Panda, M.; Phuan, P. W.; Kozlowski, M. C., Theoretical and experimental studies

of asymmetric organozinc additions to benzaldehyde catalyzed by flexible and

constrained Î³-amino alcohols. J. Org. Chem. 2003, 68, (2), 564-571.

4. Bahmanyar, S.; Houk, K. N.; Martin, H. J.; List, B., Quantum Mechanical

Predictions of the Stereoselectivities of Proline-Catalyzed Asymmetric

Intermolecular Aldol Reactions. J. Am. Chem. Soc. 2003, 125, (9), 2475-2479.

5. Garcia, J. I.; Jimenez-Oses, G.; Martinez-Merino, V.; Mayoral, J. A.; Pires, E.;

Villalba, I., QM/MM modeling of enantioselective pybox-ruthenium- and box-

copper-catalyzed cyclopropanation reactions: Scope, performance, and applications

to ligand design. Chem. Eur. J. 2007, 13, (14), 4064-4073.

6. Goumans, T. P. M.; Ehlers, A. W.; Lammertsma, K., The asymmetric Schrock

olefin metathesis catalyst. A computational study. Organometallics 2005, 24, (13),

3200-3206.

7. Harriman, D. J.; Deslongchamps, G., Reverse-docking as a computational tool for

the study of asymmetric organocatalysis. J. Comput.-Aided Mol. Des. 2004, 18, (5),

303-308.

8. Harriman, D. J.; Lambropoulos, A.; Deslongchamps, G., In silico correlation of

enantioselectivity for the TADDOL catalyzed asymmetric hetero-Diels-Alder

reaction. Tetrahedron Lett. 2007, 48, (4), 689-692.

9. Chavali, S.; Lin, B.; Miller, D. C.; Camarda, K. V., Environmentally-benign

transition metal catalyst design using optimization techniques. Comp. Chem. Eng.

2004, 28, (5), 605-611.

10. Lin, B.; Chavali, S.; Camarda, K.; Miller, D. C., Computer-aided molecular design

using Tabu search. Comp. Chem. Eng. 2005, 29, (2), 337-347.

CHAPTER 5

- 196 -

11. Sciabola, S.; Alex, A.; Higginson, P. D.; Mitchell, J. C.; Snowden, M. J.; Morao, I.,

Theoretical prediction of the enantiomeric excess in asymmetric catalysis. An

alignment-independent molecular interaction field based approach. J. Org. Chem.

2005, 70, (22), 9025-9027.

12. Ianni, J. C.; Annamalai, V.; Phuan, P. W.; Panda, M.; Kozlowski, M. C., A priori

theoretical prediction of selectivity in asymmetric catalysis: Design of chiral

catalysts by using quantum molecular interaction fields. Angew. Chem. Int. Ed.

2006, 45, (33), 5502-5505.

13. Huang, J.; Ianni, J. C.; Antoline, J. E.; Hsung, R. P.; Kozlowski, M. C., De novo

chiral amino alcohols in catalyzing asymmetric additions to aryl aldehydes. Org.

Lett. 2006, 8, (8), 1565-1568.

14. Deeth, R. J.; Fey, N., A molecular mechanics study of copper(II)-catalyzed

asymmetric Diels-Alder reactions. Organometallics 2004, 23, (5), 1042-1054.

15. Jensen, F.; Norrby, P. O., Transition states from empirical force fields. Theor.

Chem. Acc. 2003, 109, (1), 1-7.

16. Norrby, P. O., Selectivity in asymmetric synthesis from QM-guided molecular

mechanics. J. Mol. Struct. THEOCHEM 2000, 506, 9-16.

17. Eksterowicz, J. E.; Houk, K. N., Transition-state modeling with empirical force

fields. Chem. Rev. 1993, 93, (7), 2439-2461.

18. Olsen, P. T.; Jensen, F., Modeling chemical reactions for conformationally mobile

systems with force field methods. J. Chem. Phys. 2003, 118, (8), 3523-3531.

19. Jensen, F., Using force fields methods for locating transition structures. J. Chem.

Phys. 2003, 119, (17), 8804-8808.

20. Warshel, A.; Weiss, R. M., An empirical valence bond approach for comparing

reactions in solutions and in enzymes. J. Am. Chem. Soc. 1980, 102, (20), 6218-

6226.

21. Aqvist, J.; Warshel, A., Simulation of enzyme reactions using valence bond force

fields and other hybrid quantum/classical approaches. Chem. Rev. 1993, 93, (7),

2523-2544.

22. Truhlar, D. G., Valence bond theory for chemical dynamics. J. Comput. Chem.

2007, 28, (1), 73-86.

CHAPTER 5

- 197 -

23. Moitessier, N.; Chrétien, F.; Chapleur, Y.; Maigret, B., Molecular dynamics-based

models explain the unexpected diastereoselectivity of the sharpless asymmetric

dihydroxylation of allyl D- xylosides. Eur. J. Org. Chem. 2000, (6), 995.

24. Moitessier, N.; Henry, C.; Len, C.; Chapleur, Y., Toward a Computational Tool

Predicting the Stereochemical Outcome of Asymmetric Reactions. 1. Application to

Sharpless Asymmetric Dihydroxylation. J. Org. Chem. 2002, 67, (21), 7275-7282.

25. Fristrup, P.; Jensen, G. H.; Andersen, M. L. N.; Tanner, D.; Norrby, P. O.,

Combining Q2MM modeling and kinetic studies for refinement of the osmium-

catalyzed asymmetric dihydroxylation (AD) mnemonic. J. Organomet. Chem.

2006, 691, (10), 2182-2198.

26. Gennari, C.; Fioravanzo, E.; Bernardi, A.; Vulpetti, A., Origins of stereoselectivity

in the addition of allyl- and crotylboronates to aldehydes: The development and

application of a force field model of the transition state. Tetrahedron 1994, 50, (29),

8815-8826.

27. Rasmussen, T.; Norrby, P. O., Modeling the stereoselectivity of the Î²-amino

alcohol-promoted addition of dialkylzinc to aldehydes. J. Am. Chem. Soc. 2003,

125, (17), 5130-5138.

28. Bernardi, A.; Gennari, C.; Goodman, J. M.; Paterson, I., The rational design and

systematic analysis of asymmetric aldol reactions using enol borinates:

Applications of transition state computer modelling. Tetrahedron Asymmetry 1995,

6, (11), 2613-2636.

29. Kozlowski, M. C.; Waters, S. P.; Skudlarek, J. W.; Evans, C. A., Computer-Aided

Design of Chiral Ligands. Part III. A Novel Ligand for Asymmetric Allylation

Designed Using Computational Techniques. Org. Lett. 2002, 4, (25), 4391-4393.

30. Gennari, C.; Hewkin, C. T.; Molinari, F.; Bernardi, A.; Comotti, A.; Goodman, J.

M.; Paterson, I., The rational design of highly stereoselective boron enolates using

transition-state computer modeling: A novel, asymmetric anti aldol reaction for

ketones. J. Org. Chem. 1992, 57, (19), 5173-5177.

31. Deeth, R. J., Comprehensive molecular mechanics model for oxidized type I copper

proteins: Active site structures, strain energies, and entatic bulging. Inorg. Chem.

2007, 46, (11), 4492-4503.

CHAPTER 5

- 198 -

32. Shi, S.; Yan, L.; Yang, Y.; Fisher-Shaulsky, J.; Thacher, T., An extensible and

systematic force field, ESFF, for molecular modeling of organic, inorganic, and

organometallic systems. J. Comput. Chem. 2003, 24, (9), 1059-1076.

33. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking ligands into flexible and

solvated macromolecules. 1. Development and validation of FITTED 1.0. J. Chem.

Inf. Model. 2007, 47, (2), 435-449.

34. Branchadell, V., Density Functional Study of Diels-Alder Reactions Between

Cyclopentadiene and Substituted Derivatives of Ethylene. Int. J. Quantum Chem.

1997, 61, 381-388.

35. Rankin, K. N.; Gauld, J. W.; Boyd, R. J., Density functional study of the proline-

catalyzed direct aldol reaction. J. Phys. Chem. A 2002, 106, (20), 5155-5159.

36. Allemann, C.; Gordillo, R.; Clemente, F. R.; Cheong, P. H. Y.; Houk, K. N.,

Theory of asymmetric organocatalysis of aldol and related reactions:

Rationalizations and predictions. Acc. Chem. Res. 2004, 37, (8), 558-569.


generation of ring conformations. J. Am. Chem. Soc. 1989, 111, (24), 8950-8951.

CHAPTER 6

- 199 -

CHAPTER SIX

CONCLUSION, FUTURE WORK

AND CONTRIBUTIONS TO KNOWLEDGE

CONCLUSION

Virtual screening tools for molecular discovery are becoming ever more prevalent

to guide organic and medicinal chemists in their search for novel molecules. The ability

of these tools to produce results quickly and cheaply has lead to their widespread

acceptance in the field of drug design and development yet there is still a lack of these

tools for organic chemists. Two programs, one for virtual screening against biological

targets and the second for virtual screening of asymmetric catalysts, have been

developed, validated and applied.

Two major caveats of most docking programs are the assumptions that the protein

flexibility and waters do not have a significant impact on docking accuracy. FITTED1.0

has been developed to account for these phenomena. FITTED enables the use of multiple

protein input structures and, by means of a genetic algorithm allows for the flexibility of

the protein under investigation. To account for displaceable bridging water molecules,

switching function to turn off interactions between the water and the ligand if too close

has been implemented. Initial validation showed the importance of including both protein

flexibility and displaceable water molecules. Application of FITTED to the docking of our

test set resulted in 73% docking accuracy when docking using flexible proteins.

When performing a virtual screen, the speed of the program is of utmost

importance. With the success of FITTED1.0, we further modified it to reduce the average

time required to perform a docking run. With the inclusion of filters to remove unwanted

compounds and interaction sites to aid in orienting the poses while docking, a significant

increase in the accuracy and speed of FITTED was seen. A virtual screening campaign

against HCV polymerase was undertaken with this enhanced version. Initial docking

validations showed that FITTED was able to accurately predict the binding pose of known

HCV polymerase inhibitors. The Maybridge virtual library, seeded with known

inhibitors, was then screened against HCV polymerase. Again, FITTED showed excellent

CHAPTER 6

- 200 -

enrichment rates along with identifying two novel molecules of interest for the

pharmaceutical industry involved in this research.

After the success of the previous two versions, a comparative study was undertaken

to assess the effect of ligand and protein input conformation (which include the treatment

of bridging water molecules) on the accuracy of major docking programs, including

FITTED. This work showed that the accuracy of these programs is greatly affected by the

given information. When including multiple protein structures and bridging water

molecules FITTED ranked second of these six major programs.

Considering the lack of virtual screening tools for organic chemists, we next turned

our attention to creating a tool to predict the stereoselectivities of asymmetric reactions.

ACE was developed and enabled the quick estimation of transition state forcefield

parameters through the linear combination of ground state interaction. ACE was validated

on two reactions and showed excellent correlation between observed and predicted

selectivities.

Overall, both programs exhibited excellent predictive power. By developing these

two new tools, we have provided greener, safer and quicker alternative to experimental

screening to the scientific community.

FUTURE WORK

One of the advantages to writing your own program code is the ability to implement

new and exciting ideas into the program quite easily. One of the possible directions that

both programs can take is the ability to not only predict selectivity, be it a ligands affinity

for a protein or stereoselectivity of a catalyst, but to propose a better binding ligand or an

existing catalyst. One of the major downfalls of de novo prediction of new molecules is

the consideration of synthetic accessibility.1 For an organic chemist synthetic

accessibility is usually estimated by retrosynthetic analysis but coding a chemist’s

knowledge, expertise and experience into a program is a difficult task.2-13

All of these

tools require that knowledge of known reactions be programmed into the code and

therefore if a new reaction would like to be used it requires its addition to the reaction

database. Therefore the automatic creation of chemical reaction databases applied to the

field of de novo design is needed.

CHAPTER 6

- 201 -

Another possible direction of future work would be the combination of FITTED and

ACE to create a tool for biocatalysis. In the past year there has been much discussion in

the use of computational tools for the design of biocatalytic enzymes.14-20

The main issue

with these techniques is the use of QM/MM techniques which slows down the throughput

due to the necessity of correctly predicting the enzymes transition state. The combination

of FITTED and ACE together with additional implementation would enable the virtual

screening of biocatalytic enzymes.

CONTRIBUTIONS TO KNOWLEDGE

We have developed FITTED a docking-based virtual screening program for

solvated and flexible proteins. It has been shown to be able to predict the binding pose of

ligand-protein complexes with good accuracy. During the development of FITTED we

have shown the importance of protein flexibility, bridging water molecules along with the

effect ligand input conformation on major docking programs and not only FITTED. We

also used FITTED to virtually screen the Maybridge library and indentified two molecules

with IC50s less than 15 μM. We have also proposed that when conducting comparative

studies, one should consider cross-docking accuracies instead of self-docking accuracies

to better approximate a real case scenario when the binding pose of a ligand is not known.

With the experience gained in developing FITTED, we created a new tool for the

predictions of stereoselectivities called ACE. This program has been developed with the

ease of use for organic chemists as a main driving force.

CHAPTER 6

- 202 -

REFERENCES

1. Gasteiger, J., De novo design and synthetic accessibility. J. Comput.-Aided Mol.

Des. 2007, 21, (6), 307-309.

2. Boda, K.; Seidel, T.; Gasteiger, J., Structure and reaction based evaluation of

synthetic accessibility. J. Comput.-Aided Mol. Des. 2007, 21, (6), 311-325.

3. Corey, E. J.; Wipke, W. T., Computer-Assisted Design of Complex Organic

Synthesis. Science 1969, 166, (3902), 178-192.

4. Corey, E. J.; Long, A. K.; Rubenstein, S. D., Computer-assisted analysis in organic

synthesis. Science 1985, 228, (4698), 408-418.

5. Wipke, W. T.; Gund, P., Simulation and evaluation of chemical synthesis.

Congestion: a conformation-dependent function of steric environment at a reaction

center. Application with torsional terms to stereoselectivity of nucleophilic

additions to ketones. J. Am. Chem. Soc. 1976, 98, (25), 8107-8118.

6. Wipke, W. T.; Ouchi, G. I.; Krishnan, S., Simulation and evaluation of chemical

synthesis--SECS: An application of artificial intelligence techniques. Art. Intell,

1978, 11, (1-2), 173-193.

7. Wipke, W. T.; Rogers, D., Artificial Intelligence in organic synthesis, SST. Starting

Material Selection Strategies. An Application of Superstructure J. Chem. Inf.

Comput. Sci. 1984, 24, (2), 71-81.

8. Gelernter, H.; Rose, J. R.; Chen, C., Building and refining a knowledge base for

synthetic organic chemistry via the methodology of inductive and deductive

machine learning. J. Chem. Inf. Comput. Sci. 1990, 30, (4), 492-504.

9. Gelernter, H. L.; Sanders, A. F.; Larsen, D. L., Empirical explorations of

SYNCHEM. The methods of artificial intelligence are applied to the problem of

organic synthesis route discovery. Science 1977, 197, (4308), 1041-1049.

10. Satoh, H.; Funatsu, K., SOPHIA, a Knowledge Base-Guided Reaction Prediction

System - Utilization of a Knowledge Base Derived from a Reaction Database. J.

Chem. Inf. Comput. Sci. 1995, 35, (1), 34-44.

11. Gasteiger, J.; Jochum, C., EROS: A computer progrm for generating sequences of

reactions. Top. Curr. Chem. 1978, 74, 93-126.

12. Ugi, I.; Bauer, J.; Bley, K.; Dengler, A.; Dietz, A.; Fontain, E.; Gruber, B.; Herges,

R.; Knauer, M.; Reitsam, K.; Stein, N., Computer-assisted solution of chemical

CHAPTER 6

- 203 -

problems - The historical development and the present state of the art of a new

discipline of chemistry. Angew. Chem. Int. Ed. 1993, 32, (2), 201-227.

13. Hanessian, S.; Franco, J.; Gagnon, G.; Laramee, D.; Larouche, B., Computer-

assisted analysis and perception of stereochemical features in organic molecules

using the CHIRON program. J. Chem. Inf. Comput. Sci. 1990, 30, (4), 413-425.

14. Jiang, L.; Althoff, E. A.; Clemente, F. R.; Doyle, L.; RoÌˆthlisberger, D.;

Zanghellini, A.; Gallaher, J. L.; Betker, J. L.; Tanaka, F.; Barbas Iii, C. F.; Hilvert,

D.; Houk, K. N.; Stoddard, B. L.; Baker, D., De novo computational design of

retro-aldol enzymes. Science 2008, 319, (5868), 1387-1391.

15. Marti, S.; Andres, J.; Moliner, V.; Silla, E.; Tunon, I.; Bertrain, J., Computational

design of biological catalysts. Chem. Soc. Rev. 2008, 37, (12), 2634-2643.

16. Prather, K. L. J.; Martin, C. H., De novo biosynthetic pathways: rational design of

microbial chemical factories. Curr. Opin. Biotechnol. 2008, 19, (5), 468-474.

17. Ward, T. R., Artificial enzymes made to order: Combination of computational

design and directed evolution. Angew. Chem. Int. Ed. 2008, 47, (41), 7802-7803.

18. Chaput, J. C.; Woodbury, N. W.; Stearns, L. A.; Williams, B. A. R., Creating

protein biocatalysts as tools for future industrial applications. Ex. Op. Bio. Ther.

2008, 8, (8), 1087-1098.

19. Sterner, R.; Merkl, R.; Raushel, F. M., Computational Design of Enzymes. Chem.

Biol. 2008, 15, (5), 421-423.

20. Ghirlanda, G., Computational biochemistry: Old enzymes, new tricks. Nature 2008,

453, (7192), 164-166.

CHAPTER 6

- 204 -

APPENDIX A

- 205 -

APPENDIX A

COPYRIGHT WAIVERS

Copyright waiver for chapters 2: Docking Ligands into Flexible and Solvated

Macromolecules. 1. Development and Validation of FITTED 1.0; and chapter 3: Docking

Ligands into Flexible and Solvated Macromolecules. 2. Development and Application of

Fitted 1.5 to the Virtual Screening of Potential HCV Polymerase Inhibitors.

American Chemical Society’s Policy on Theses and Dissertations

If your university requires a signed copy of this letter see contact information below.

Thank you for your request for permission to include your paper(s) or portions of text from your paper(s) in your thesis.

Permission is now automatically granted; please pay special attention to the implications paragraph below. The

Copyright Subcommittee of the Joint Board/Council Committees on Publications approved the following:

Copyright permission for published and submitted material from theses and dissertations

ACS extends blanket permission to students to include in their theses and dissertations their own articles, or

portions thereof, that have been published in ACS journals or submitted to ACS journals for publication, provided

that the ACS copyright credit line is noted on the appropriate page(s).

Publishing implications of electronic publication of theses and dissertation material

Students and their mentors should be aware that posting of theses and dissertation material on the Web prior to

submission of material from that thesis or dissertation to an ACS journal may affect publication in that journal.

Whether Web posting is considered prior publication may be evaluated on a case-by-case basis by the journal’s

editor. If an ACS journal editor considers Web posting to be “prior publication”, the paper will not be accepted

for publication in that journal. If you intend to submit your unpublished paper to ACS for publication, check with

the appropriate editor prior to posting your manuscript electronically.

If your paper has not yet been published by ACS, we have no objection to your including the text or portions of the text in

your thesis/dissertation in print and microfilm formats; please note, however, that electronic distribution or Web posting of

the unpublished paper as part of your thesis in electronic formats might jeopardize publication of your paper by ACS. Please

print the following credit line on the first page of your article: "Reproduced (or 'Reproduced in part') with permission from

[JOURNAL NAME], in press (or 'submitted for publication'). Unpublished work copyright [CURRENT YEAR] American

Chemical Society." Include appropriate information.

If your paper has already been published by ACS and you want to include the text or portions of the text in your

thesis/dissertation in print or microfilm formats, please print the ACS copyright credit line on the first page of your

article: “Reproduced (or 'Reproduced in part') with permission from [FULL REFERENCE CITATION.] Copyright

[YEAR] American Chemical Society." Include appropriate information.

Submission to a Dissertation Distributor: If you plan to submit your thesis to UMI or to another dissertation distributor,

you should not include the unpublished ACS paper in your thesis if the thesis will be disseminated electronically, until ACS

has published your paper. After publication of the paper by ACS, you may release the entire thesis (not the individual ACS

article by itself) for electronic dissemination through the distributor; ACS’s copyright credit line should be printed on the

first page of the ACS paper.

Use on an Intranet: The inclusion of your ACS unpublished or published manuscript is permitted in your thesis in print and

microfilm formats. If ACS has published your paper you may include the manuscript in your thesis on an intranet that is not

publicly available. Your ACS article cannot be posted electronically on a publicly available medium (i.e. one that is not

password protected), such as but not limited to, electronic archives, Internet, library server, etc. The only material from your

paper that can be posted on a public electronic medium is the article abstract, figures, and tables, and you may link to the

article’s DOI or post the article’s author-directed URL link provided by ACS. This paragraph does not pertain to the

dissertation distributor paragraph above.

Questions? Call +1 202/872-4368/4367. Send e-mail to [email protected] or fax to +1 202-776-8112. 10/10/03, 01/15/04, 06/07/06

APPENDIX A

- 206 -

Copyright waiver for chapter 5: Toward a Computational Tool Predicting the

Stereochemical Outcome of Asymmetric Reactions. 2. Development and Application of a

Rapid and Accurate Program Based on Organic Principles.

APPENDIX A

- 207 -

Copyright waiver for chapter 1.1: The challenge of modeling reality in the docking of

small molecules to biological targets and for Chapter 4: Docking Ligands into Flexible

and Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein

Flexibility and Water Molecules on the Accuracy of Docking Programs.

APPENDIX A

- 208 -

APPENDIX B

- 209 -

APPENDIX B

SUPPORTING INFO FOR CHAPTER 2:


MACROMOLECULES. 1.

DEVELOPMENT AND VALIDATION OF FITTED 1.0

Table B.1 - HIV-1 Protease mono-alcohol inhibitors.

PDB code Structure Ki

1b6l

NHO

N

HO

O

O

N

NH2

O

O

NH

H

5 nM

1eby

OH

HN

O

O OH

HO O

O

NH OH

0.2 nM

1hpv N

OH

HN

O O

O

S

O

O

NH2

0.6 nM

APPENDIX B

- 210 -

1hpo

HOHN

SO

O

N

O

O

0.6 nM

1pro

MeO

N

HO

N

NOH

OMeO

OH

0.005 nM

Table B.2 - HIV-1 Protease diol inhibitors.


1ajv

OHHOO

NS

OO

N

O

19.1nM

1ajx

OHHOO

N

O

N

O

12.2nM

1hvr

OH

N

O

N

HO

0.31nM

APPENDIX B

- 211 -

1hwr

OHHO

N

O

N

4.7nM

1qbs

N

OHHO

NHO

O

OH

0.12nM

Table B.3 - Thymidine Kinase inhibitors.


1e2k N

OH

OHO

HN

O

11.4 M

1e2p

OH

HOHN

O

NH

O

27 M

1ki3

NH2

N

N

N

HO

OH O

NH

1ki4

HO

OHO

N

ONH

O

BrS

1ki7 OHO

OH

N

HN

O

O

APPENDIX B

- 212 -

1ki8 HO O

HO

N

Br

O

NH

O

2ki5

NH2

N

N

NOHO

O

NH

1of1

OHOH

NN

O

O

4.1 M

1qhi HO

N

N O

NH

HNN

Table B.4 - Factor Xa inhibitors.


1ezq

O

O

NH2

NHHN

O

H2N

0.9 nM

1f0r

NH2

NNHNS

O

O

N

S

O

22 nM

APPENDIX B

- 213 -

1fjs N

N

ON

O

OH

H2NNH

F

NHO

O

F

0.11 nM

1nfu

NH2

NH

N

N

O

SO

O

Cl

S

1.3 nM

1xka

H2N NH

HN

N HO

O

131nm

Table B.5 - Trypsin inhibitors.


1f0u O

O

NH2

NHHN

O

H2N

69 nM

1o2j O

OH

N

NHH2N

HN

120 nM

1o3g NH2

NH

NH

OH

74 nM

APPENDIX B

- 214 -

1o3i

BrHO

HN

H2N

NH

170 nM

1qbo

HNN O

HN

N+

NH2

+H2N

18 nM

Table B.6 - Stromelysin-1 inhibitors.


1b8y

OHO

HN

SO

O

N

14nm

1bwi

ON

NH

O

O

NH

HO

OH

MeO

104nm

1ciz

HOO

HN

NH

S

O

O N

36nm

APPENDIX B

- 215 -

1d8m

MeO

SOO

N

N

O

O

HN OH

3.1nm

On the following pages, are presented typical keyword files for FITTED and ProCESS. Most of the values used here are set the default values.

##########################################################################################

########

#

#

# ____________ _________ _____________ _____________ ____________ _________

#

# ------------ --------- ------------- ------------- ------------ ----------

#

# ||| ||| ||| ||| ||| ||| \\\

#

# ||| ||| ||| ||| ||| ||| \\\

#

# ||| _____ ||| ||| ||| ||| _____ ||| |||

#

# ||| ----- ||| ||| ||| ||| ----- ||| |||

#

# ||| ||| ||| ||| ||| ||| |||

#

# ||| ||| ||| ||| ||| ||| |||

#

# ||| ||| ||| ||| ||| ||| ///

#

# ||| _________ ||| ||| ____________ __________

#

# ||| --------- ||| ||| ------------ --------

#

#

#

# Flexibility Induced Through Targeted Evolutionary Description

#

#

#

# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne

#

#

#

# March 2006

#

##########################################################################################

########

#

#

# INPUT/OUTPUT FILES

##########################################################################################

########

#

Protein 9 # Number of protein input files

1e2k_protein.mol2 # Names of the protein files

1e2p_protein.mol2

1ki3_protein.mol2

1ki4_protein.mol2

1ki7_protein.mol2

APPENDIX B

- 216 -

1ki8_protein.mol2

2ki5_protein.mol2

1of1_protein.mol2

1qhi_protein.mol2

Ligand 1e2k_lig.mol2 # Ligand structure file

Output 1e2k_run01 # File that will contain the output

Forcefield emc.txt # Force field file

Active_site_cav tk_grid.mol2 # File containing the sphere locations and

sizes

Constraints tk_cons.mol2 # Constraint file

Ref 1 # Number of reference structures for RMSD

calculations

1e2k_ligand_ref1.mol2 # Reference structure

#

#

# CONJUGATE GRADIENT PARAMETERS

##########################################################################################

########

#

# Creation of the initial population

#-----------------------------------------------------------------------------------------

--------

#

GI_MaxInt 100 # Maximum number of iterations

GI_StepSize 0.002 # Step size used in the minimization process

GI_MaxStep 0.9 # Maximum Step Size

GI_MaxSameEnergy 3 # Max. number of times the same energy can

appear

GI_MaxGrad 0.00001 # Maximum size of the gradient

GI_EnergyBound 0.0001 # Diff. in energy between two "similar"

structures

#

# Evolution

#-----------------------------------------------------------------------------------------

--------

#

GA_MaxInt 20 # Maximum number of iterations

GA_StepSize 0.002 # Step size used in the minimization process

GA_MaxStep 1 # Maximum Step Size

GA_MaxSameEnergy 3 # Max. number of times the same energy can

appear

GA_MaxGrad 0.00001 # Maximum size of the gradient, Evolution

GA_EnergyBound 0.00001 # Diff. in energy between two "similar"

structures

#

#

# ENERGY/SCORING PARAMETERS

##########################################################################################

########

#

# Energy parameters

#-----------------------------------------------------------------------------------------

--------

#

Score_Initial minimize # Scoring of the input conformation

vdW 1-4 # Consider 1,4 vdw interactions and greater

vdWScale_1-4 0.5 # Scaling factor for vdw 1,4 interactions

vdWScale_1-5 1.0 # scaling factor for vdw 1,5+ interactions

E_vdWScale_Pro 2.0 # Scaling factor for vdw L,P interactions

E_vdWScale_Wat 2.0 # Scaling factor for vdw L,Water interactions

Elec 1-4 # Consider 1,4 electrostatics

ElecScale_1-4 0.25 # Scaling factor for L,L 1,4 electrostatics

ElecScale_1-5 0.5 # Scaling factor for L,L 1,5+ electrostatics

E_ElecScale_Pro 1.0 # Scaling factor for L,P interactions

E_ElecScale_Wat 1.0 # Scaling factor for L,P interactions

HBond Y # Include hydrogen bonds

E_HBondScale_Pro 2.0 # Scaling factor for Hydrogen bond term

E_HBondScale_Wat 2.0 # Scaling factor for Hydrogen bond term

Cutdist 9.0 # Cutoff Distance

APPENDIX B

- 217 -

switchdist 7.0 # Switching distance

Cutdist_Wat 1.75 # Cutoff Distance for Waters

Switchdist_Wat 1.20 # Switching distance for Waters

Water_loss_energy -0.5 # penalty added if a water is displaced

#

# Scoring function parameters

#-----------------------------------------------------------------------------------------

--------

#

weight_rot_bonds 0.14 # Penalty per frozen rotatable bond

S_vdWScale_Pro 0.175 # Scaling factor for vdw L,P interactions

S_ElecScale_Pro 0.064 # Scaling factor for L,P interactions

S_HBondScale_Pro 0.25 # Scaling factor for Hydrogen bond term

S_vdWScale_Wat 0.175 # Scaling factor for vdw L,Water interactions

S_ElecScale_Wat 0.064 # Scaling factor for L,Water interactions

S_HBondScale_Wat 0.25 # Scaling factor for Hydrogen bond term

#

#

# GENETIC ALGORITHM PARAMETERS

##########################################################################################

########

#

# Creation of the initial population

#-----------------------------------------------------------------------------------------

--------

#

Pop_Size 100 # Population size

anchor_atom 15 # Ligand anchor atom

anchor_coor -10 -9 -15 # Location of the center of the binding site

max_tx 3 # Maximum radius for translation

max_ty 3 # Maximum translation angle in x, y plane for

rotation.

max_tz 3 # Maximum elevation angle for translation

max_rxy 30 # Maximum rotation of molecule in x

max_ryz 30 # Maximum rotation of molecule in y

max_rxz 30 # Maximum rotation of molecule in z

max_sc_PP 0.8 # Maximum distance for two atoms to be

considered clashing

max_num_sc 5 # Maximum number of steric clashes of the side

chains

GI_Minimized_E 1000 # Maximum potential energy to accept a pose

GI_Initial_E 10000000 # Maximum potential energy to start a

minimization

#

# Evolution

#-----------------------------------------------------------------------------------------

--------

#

flex 15 # Number of flexible side chains

HISD58 # Name of the flexible residues

ARG222

GLU83

ARG163

TYR101

ILE97

MET231

ARG176

TYR172

MET121

ILE100

TRP88

MET128

GLN125

MET85

max_gen 200 # Maximum number of generations

seed 15 # Seed number

resolution 7 # Torsion angle resolution for ligand

pCross 0.85 # Probability of crossover

pMut 0.02 # Probability of mutation

APPENDIX B

- 218 -

pMutRot 0.20 # Probability of mutation for the rotation in

3D Space

pOpt 0.20 # Probability of energy minimization of the

children

pLearn 0.1 # Probability of energy minimization of the

parents

#

# Convergence and output

#-----------------------------------------------------------------------------------------

--------

#

print_structures final # Which structures to print out

print_best_every_x_gen 5 # Intermediate statistical output

print_energy_full no # Detailed potential energy

converge_std_dev 0.2 # Standard deviation criterion

Number_of_best 20 # Number of best structures to be output

MaxSameEnergy_GA 100 # Number of unchanged lowest-in-energy

structure

#

#

#

##########################################################################################

########

##########################################################################################

########

#

#

# __________ ______ _________ _____ _____

#

# ----------- ---------- --------- --------- ---------

#

# ||| ||| ||| ||| ||| |||

#

# ||| ||| ||| ||| ||| |||

#

# |||________ ___ ____ ||| |||______ _______ _______

#

# |||------- ---_____ -------- ||| |||------ ------- -------

#

# ||| |||------ ||| ||| ||| ||| ||| |||

#

# ||| ||| || ||| ||| ||| ||| ||| |||

#

# ||| ||| ||| ||| ||| ||| ||| |||

#

# ||| ||| ________ __________ _________ _________ _________

#

# ||| ||| ---- ------ --------- ------- -------

#

#

#

# Protein Conformational Ensemble System Setup #

#

#

# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne

#

# Dept. of Chemistry

#

# McGill University, Montreal, Canada

#

#

#

# Feb 2006

#

##########################################################################################

########

#

#


##########################################################################################

########

APPENDIX B

- 219 -

#

protein 9 # Number of protein files

1e2k_prot.mol2 # Name of the protein files

1e2p_prot.mol2

1ki3_prot.mol2

1ki4_prot.mol2

1ki7_prot.mol2

1ki8_prot.mol2

1of1_prot.mol2

1qhi_prot.mol2

2ki5_prot.mol2

rep_file 1e2k_prot.mol2 # Name of the reference structure (for atom sorting)

Output tk # File that will contain the output structure

Grid tk_grid # File that will contain the grid

#

# PROTEIN DESCRIPTION

##########################################################################################

########

#

Find_Residues name # Find residues in protein mol2 by either number of

name

Active_Site 15 # Number of flexible side chains

HIS58 # Names of flexible residues

ARG222

GLU83

ARG163

TYR101

ILE97

MET231

ARG176

TYR172

MET121

ILE100

TRP88

MET128

GLN125

MET85

#

# ACTION

##########################################################################################

########

#

Assign_G yes # Assigns proper group names

Truncate yes # Truncates proteins keeping residues within cutoff

distance

Cutoff 7 # Cutoff distance

United yes # Makes the united atom representation

Coarse 0 # Makes the coarse grained representation at

indicated level

#

# GRID DESCRIPTION

##########################################################################################

########

#

Grid_boundary hard # Spheres make contact with the grid edges

grid_center -10.8 -10.6 -13.3 # Center of the grid

grid_resolution 1.5 # Grid resolution

grid_size 9 9 9 # Grid size

grid_clash 1.5 # Maximum distance between grid point and protein or

edge

grid_sphere_size 10 # A sphere centered at grid_center truncates the

apexes of the grid

#

#

##########################################################################################

########

APPENDIX B

- 220 -

APPENDIX C

- 221 -

APPENDIX C



MACROMOLECULES. 2.

DEVELOPMENT AND APPLICATION OF FITTED 1.5 TO THE

VIRTUAL SCREENING OF POTENTIAL HCV POLYMERASE

INHIBITORS.

Table C.1 – Self-docking HIV – 1 Protease.

Docking to protein + water moleculeb

Obs. Waterc Ligand

d Pred. Water

e

1b6l 1 0.36 1

1eby 1 0.39 1

1hpo 0 0.71 0

1hpv 1 0.44 1

1pro 0 0.18 0

1ajv 0 0.43 0

1ajx 0 0.31 0

1hvr 0 0.51 0

1hwr 0 0.32 0

1qbs 0 5.42 0


b Water molecule known as Water 301

was retained and the function describing the interaction between ligand and water

molecules is applied. c

Water molecule observed or not in crystal structures: 1 and 0

APPENDIX C

- 222 -

define the presence or absence of the water molecule respectively. d RMSD (in Å):

criterion of success of 2.0 Å. e Water molecules as proposed by FITTED. Bold numbers

highlight failures.

Table C.2 - Self-docking – Thymidine kinase inhibitors.


Obs. Water moleculesc Ligand

d Pred. water molecules

e

1e2k 1 0 1 1 1 1 0.55 1 0 1 1 0 1

1e2p 1 0 1 1 1 1 0.71 1 0 1 1 0 1

1ki3 0 1 0 0 0 1 1.18 0 1 0 1 1 1

1ki4 1 0 1 1 1 1 0.44 1 0 1 1 1 1

1ki7 1 0 1 1 1 0 5.76 1 0 1 1 1 1

1ki8 1 0 1 1 1 0 0.39 1 0 1 1 1 1

2ki5 0 1 1 1 1 1 0.72 0 1 0 1 1 1

1of1 1 0 1 1 1 1 0.40 1 0 1 1 1 1

1qhi 0 1 1 1 1 0 0.66 0 1 0 1 1 1


b 2 to 6 water molecules (see text) were


is applied. c



success of 2.0 Å;. e Water molecules as proposed by FITTED. Bold numbers highlight

failures. g two structures with similar energies were observed (RMSD =2.03 and 0.53)

APPENDIX C

- 223 -

Table C.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors.


Obs. Waterc Ligand

d Pred. Water

e

1ezq 1 0 0.57 1 0

1f0r 1 1 0.77 0 0

1fjs 1 0 0.93 1 0

1nfu 0 0 1.20 0 0

1xka 1 0 1.22 1 0

1f0u 1 - 0.47 1 -

1o2j 1 - 0.76 1 -

1o3g 1 - 0.58 1 -

1o3i 1 - 0.49 1 -

1qbo 1 - 0.90 1 -

1b8y - - 0.61 - -

1bwi - - 0.98 - -

1ciz - - 0.92 - -

1d8m - - 0.71 - -


b none to 2 water molecules (see text)

were retained and the function describing the interaction between ligand and water

molecules is applied. c

Water molecules observed or not in crystal structures: 1 and 0

define the presence or absence of each water molecule respectively. d RMSD (in Å):

criterion of success of 2.0 Å;. e Water molecules as proposed by FITTED. Bold numbers

highlight failures.

APPENDIX C

- 224 -

Table C.4 - Docking to flexible proteins - HIV-1 protease inhibitors.


Liganda Protein

b Water

c Ligand

a Protein

b Water

1b6l 0.45 0.00 1 0.34 0.73 1

1eby 0.58 0.00 1 0.79 1.37 1

1hpo 1.53 0.99 0 1.18 1.45 0

1hpv 0.83 1.00 1 0.87 1.12 1

1pro 0.67 0.72 0 0.63 1.15 0

1ajv 0.57 0.00 0 0.60 0.50 0

1ajx 0.67 0.69 0 0.55 0.67 0

1hvr 0.64 0.00 0 0.57 0.31 0

1hwr 0.45 0.81 0 0.44 0.84 0

1qbs (5.33) 0.72 0 5.25 0.67 0


b RMSD (in Å): criterion of success:

better than average RMSD; average RMSD between protein structures computed on the

binding site residues: 0.91 Å for the first five structures (one Asp 25 protonated) and 0.77

Å for the last five structures (AspA25 and AspB25 protonated). c Water molecules as

proposed by FITTED; 1 and 0 define the presence or absence of the water molecule

respectively. Bold numbers highlight failures. d Score in arbitrary units.

APPENDIX C

- 225 -

Table C.5 - Docking to flexible proteins - thymidine kinase inhibitors.

Docking to semi-flexible protein

Liga Pro

b Occurrence of water mol.

c

1e2k 0.54 0.00 1 0 1 1 0 1

1e2p 0.57 0.00 1 0 1 1 1 1

1ki3 1.29 0.00 0 1 0 1 1 1

1ki4 0.52 0.00 1 0 1 1 1 1

1ki7 5.75 0.83 0 1 0 1 1 1

1ki8 0.58 0.96 1 0 1 1 1 1

2ki5 0.59 0.89 0 1 0 1 1 1

1of1 0.34 0.26 1 0 1 1 1 1

1qhi 0.72 0.00 0 1 0 1 1 1

Docking to fully flexible protein

Liga Pro

b Occurrence of water mol.

c

1e2k 0.71 0.99 1 0 1 1 1 1

1e2p 0.78 0.74 1 0 1 1 1 1

1ki3 3.01 0.71 1 0 1 1 1 1

1ki4 0.42 0.96 1 0 1 1 1 1

1ki7 5.69 0.65 0 1 1 1 1 1

1ki8 0.48 0.43 1 0 1 1 1 1

2ki5 (1.60) 1.06 1 0 1 1 0 1

1of1 0.52 0.95 1 0 1 1 1 1

1qhi (0.65) 0.74 0 1 0 1 1 1




binding site residues: 0.92 Å. c Water molecules as proposed by FITTED; 1 and 0 define

the presence or absence of the water molecules respectively. Bold numbers highlight

failures.

APPENDIX C

- 226 -

Table C.6 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3 inhibitors.


Liganda Protein

b Water

c Ligand

a Protein

b Water

c

1ezq 0.59 0.00 1 0 0.49 0.59 0 0

1f0r 2.24 0.55 0 0 0.53 0.27 1 1

1fjs 2.71 0.90 1 1 2.51 0.62 1 1

1nfu 0.77 0.00 0 0 0.85 0.65 0 0

1xka 8.17 1.01 1 1 2.01 0.83 1 1

1f0u 0.56 0.00 1 - 0.43 0.40 1 -

1o2j 1.02 0.53 1 - 0.80 0.40 1 -

1o3g 0.78 0.00 1 - 0.93 0.50 1 -

1o3i 0.80 0.00 1 - 0.77 0.65 1 -

1qbo 0.53 0.00 1 - 0.58 0.63 1 -

1b8y 0.69 0.45 - - 0.67 0.64 - -

1bwi 0.63 0.00 - - 0.95 0.55 - -

1ciz 0.64 0.45 - - 0.92 0.89 - -

1d8m 1.10 0.00 - - 0.93 1.24 - -




binding site residues: 0.92 Å. c Water molecules as proposed by FITTED; 1 and 0 define

the presence or absence of the water molecules respectively. Bold numbers highlight

failures.

APPENDIX C

- 227 -

Table C.7 - Docking accuracy – FITTED 1.0 VS. FITTED 1.5.

Rigid Semi-flexible Fully-Flexible

Ver Lig a water c Lig a Pro b Water c Lig a Pro b Water c

1.0 76% 84% 73% 58% 86% 73% 71% 83%

1.5 94% 91% 84% 84% 82% 88% 78% 75%




binding site residues: 0.92 Å. c Water molecules successfully predicted (absence or

presence) by FITTED.

APPENDIX C

- 228 -

APPENDIX D

- 229 -

APPENDIX D



MACROMOLECULES. 3.

IMPACT OF INPUT LIGAND CONFORMATION, PROTEIN

FLEXIBILITY AND WATER MOLECULES ON THE ACCURACY OF

DOCKING PROGRAMS

Table D.1 - Accuracy of the 6 docking programs using various conditions and self-

docking experiments with dry protein.

Self Docking Dry < 2 Å

Crystal Generated Omega +Rings

eHiTS acc . 3 91 77

eHiTS acc. 6 93 80

FITTED Dock 58 59 59

FITTED VS 56 57 54

Flexx CS 56 48 52

Flexx CS SIS 60 50 51

Flexx FS 60 46 54

Flexx FS SIS 54 51 49

Flexx PLP 50 35 45

Flexx PLP SIS 46 41 46

Flexx SS 60 46 54

Flexx SS SIS 54 51 49

Glide HTVS 73 54 56

Glide SP 73 56 58

Glide XP 74 58 60 Gllide HTVS Refined Protein 80 58 60

Gllide SP Refined Protein 80 59 60

Gllide XP Refined Protein 79 67 67

GOLD CS 73 64 65

GOLD GS 64 53 55

APPENDIX D

- 230 -

GOLD CS - Robust 70 58 65

GOLD GS - Robust 63 48 57

Surflex 62 47 51

Surflex - pgeom 73 50 66

Surflex - hprot 62 50 59

Surflex - hprot - pgeom 80 69 72

Surflex - popt 65 47 55

Surflex - popt - pgeom 72 57 59

Table D.2 - Accuracy of the 6 docking programs using various conditions and self-

docking experiments with proteins with waters.

Self docking with Waters < 2 Angs

Explicit Waters

Displ. Ensemble Displ. Waters

eHiTS acc . 3 76 79

eHiTS acc. 6 81 82

FITTED Dock 62 61 64

FITTED VS 62 54 64

Flexx CS 50 52 43

Flexx CS SIS 53 53 54

Flexx FS 49 51 46

Flexx FS SIS 49 53 48

Flexx PLP 38 37 39

Flexx PLP SIS 41 38 42

Flexx SS 49 51 46

Flexx SS SIS 49 53 48

Glide HTVS 58 58

Glide SP 64 63

Glide XP 62 63 Gllide HTVS Refined Protein 64 62

Gllide SP Refined Protein 59 60

Gllide XP Refined Protein 68 68

GOLD CS 65 68 63

GOLD GS 54 52 55

GOLD CS - Robust 64 59 62

GOLD GS - Robust 51 53 52

Surflex 38 49

Surflex - pgeom 42 51

Surflex - hprot 42 59

Surflex - hprot - pgeom 56 69

APPENDIX D

- 231 -

Surflex - popt 43 50

Surflex - popt - pgeom 47 58

Table D.3 - Accuracy of the 6 docking programs using various conditions and cross-

docking experiments with dry proteins.

Cross and Flexible Docking Dry < 2.25

Cross-

Docking Conformational

Ensemble

Flexible Protein (“Semi”/”Fully”

)

eHiTS acc . 3 64 80

eHiTS acc. 6 70 71

FITTED Dock 40 50 48/48

FITTED VS 37 48 44/44

Flexx CS 24 43

Flexx CS SIS 27 43

Flexx FS 20 28

Flexx FS SIS 25 38

Flexx PLP 16 24

Flexx PLP SIS 21 28

Flexx SS 20 28

Flexx SS SIS 25 38

Glide HTVS 29 44

Glide SP 31 43

Glide XP 31 46 Glide HTVS Refined Protein 35 50

Glide SP Refined Protein 34 48

Glide XP Refined Protein 36 50

GOLD CS 45 38

GOLD GS 38 44

GOLD CS - Robust 44 40

GOLD GS - Robust 33 33

Surflex 29 37



Surflex - hprot - pgeom 30 47

Surflex - popt 27 38


APPENDIX D

- 232 -

Table D.4 - Accuracy of the 6 docking programs using various conditions and cross-docking experiments with dry proteins.

Cross and Flexible Docking wet < 2.25

Cross-

Docking Conformational

Ensemble Flexible Protein (“Semi”/”Fully”)

eHiTS acc . 3 64 79

eHiTS acc. 6 69 71

FITTED Dock 39 50 50/52

FITTED VS 35 47 48/44

Flexx CS 27 45

Flexx CS SIS 24 41

Flexx FS 20 31

Flexx FS SIS 27 43

Flexx PLP 16 29

Flexx PLP SIS 21 30

Flexx SS 20 31

Flexx SS SIS 27 30

Glide HTVS 33 49

Glide SP 36 49

Glide XP 35 49

Glide HTVS Refined Protein 33 54

Glide SP Refined Protein 33 52

Glide XP Refined Protein 37 54

GOLD CS 42 41

GOLD GS 34 36

GOLD CS - Robust 46 41

GOLD GS - Robust 36 39

Surflex 28 36



Surflex - hprot - pgeom 28 47 Surflex - popt 32 38


APPENDIX E

- 233 -

APPENDIX E


TOWARD A COMPUTATIONAL TOOL PREDICTING THE

STEREOCHEMICAL OUTCOME OF ASYMMETRIC REACTIONS. 2.

DEVELOPMENT AND APPLICATION OF A RAPID AND ACCURATE

PROGRAM BASED ON ORGANIC PRINCIPLES.

EXPERIMENTAL

All structures were drawn within InsightII, charged using ESFF force field and

saved in mol2 format. These structures were next prepared to be used withACE. This step

is done automatically using SMART a module of ACE. This module assigns MM3 atom

types, identifies “rotatable” bonds and rings. These files were next used with ACE which

automatically performs the entire run (transition state definition from reactants and

products, conformational search, energy calculation, conjugate gradient energy

minimization). The MM3-derived energies of the modeled TSs are then outputted

together with the optimized structures. The ACE and SMART executables are available

free of charge to academics.

For the Diels-Alder reaction, only structures leading to the endo product were

investigated. In practice, the endo adducts are the major isomers observed experimentally

with the investigated auxiliaries. In all cases two products and two reactants where drawn

considering both possible diastereomeric outcomes of an endo attack.

APPENDIX E

- 234 -

The xyz coordinates of Diels-Alder transition state predicted by ace for entry 16

in table 2.

ON

O O

1a

2a

Et2AlCl

ON

O OAl

Al 2.0634 0.9863 -10.9626

C 1.8520 1.4088 -12.8730

H 2.4556 2.2144 -13.0919

H 0.8801 1.7130 -13.0285

C 1.5955 2.4373 -9.7165

H 0.6656 2.2335 -9.3227

H 1.5048 3.3096 -10.2568

C 1.6209 -1.6380 -10.2357

C 3.9636 -0.8343 -10.3641

C 5.0941 -2.6913 -9.7039

H 5.8008 -3.2594 -10.1864

H 5.3276 -2.7237 -8.7009

C 3.6296 -3.1380 -9.9682

H 3.2701 -3.5887 -9.1131

C 3.5065 -4.0980 -11.1720

H 2.5054 -4.2671 -11.3509

C 4.1300 -5.4741 -10.8654

H 3.7210 -5.8791 -10.0113

H 3.9541 -6.1317 -11.6386

H 5.1481 -5.4096 -10.7341

APPENDIX E

- 235 -

C 4.0831 -3.5207 -12.4814

H 3.9319 -4.1759 -13.2617

H 3.6221 -2.6347 -12.7283

H 5.0946 -3.3483 -12.4103

C 0.7409 -2.7769 -9.8094

H 1.1692 -3.7749 -9.7120

C -0.6222 -2.7158 -10.1443

H -1.1231 -3.6633 -10.3648

H -1.0045 -1.8377 -10.6734

C 2.6431 2.5860 -8.6070

H 2.3812 3.3422 -7.9587

H 2.7272 1.7125 -8.0682

H 3.5654 2.8090 -9.0075

C 2.1990 0.2054 -13.7572

H 1.5853 -0.5927 -13.5402

H 2.0825 0.4411 -14.7531

H 3.1755 -0.0843 -13.6062

O 3.8299 0.3394 -10.6093

O 1.0690 -0.5977 -10.5226

N 3.0146 -1.8152 -10.1981

O 5.2029 -1.3320 -10.1378

C -1.5294 -2.4255 -8.3555

H -2.5917 -2.6410 -8.5044

C -0.5912 -3.3758 -7.6425

H -0.8552 -3.5014 -6.6534

H -0.5227 -4.3111 -8.0692

C 0.6491 -2.5189 -7.7908

H 1.6137 -2.8160 -7.3703

APPENDIX E

- 236 -

C -1.0712 -1.1396 -8.0856

H -1.6609 -0.2166 -8.2113

C 0.2164 -1.1969 -7.7348

H 0.8529 -0.3270 -7.5044

For the aldol reaction, both enolate possibilities (the two methyl groups C1 and

C2 are equivalent but do not interchange during the conformational search as the proline

nitrogen N1 is doubly bonded to the carbon C3 in the product) and 4 products (2 for each

diastereomeric possibility) were considered. This allowed us to consider all possible

outcomes for the different enolate possibilities.

O

O

N

HC1

C2

C3

N1

R H

O

The xyz coordinates of aldol transition state predicted by ace for the reaction of

acetone with acetaldehyde in presence of proline (entry 17 in table 4).

O

O

N

OMeH

ONH

COOH

Me H

O

6a 5c4

N 2.3242 -0.4309 -0.0200

C 1.7969 0.8830 -0.4502

H 0.7710 0.8985 -0.3684

C 2.1884 1.2606 -1.8748

O 2.2783 0.2458 -2.6967

APPENDIX E

- 237 -

H 3.0048 -0.9521 -2.2561

O 1.8634 2.4146 -2.2262

C 2.4191 1.8258 0.5854

H 2.4177 2.8146 0.2916

H 1.8993 1.7757 1.4751

C 3.8281 1.2572 0.7645

H 4.4494 1.6119 0.0210

H 4.2558 1.5263 1.6635

C 3.6495 -0.2622 0.6270

H 4.3799 -0.6931 0.0402

H 3.6295 -0.7312 1.5437

C 1.7336 -1.5744 -0.2667

C 0.3506 -1.6485 -0.8689

H 0.3551 -1.2719 -1.8273

H 0.0085 -2.6174 -0.9079

H -0.3165 -1.1056 -0.3051

C 2.4508 -2.8408 -0.1536

H 3.2958 -2.8163 0.4961

H 1.8200 -3.6723 0.0681

C 3.0974 -3.1365 -1.7928

H 2.2518 -3.2755 -2.4430

C 3.9336 -4.4119 -1.7820

H 4.7303 -4.3142 -1.1372

H 3.3647 -5.2163 -1.4823

H 4.3006 -4.6039 -2.7248

O 3.8119 -2.0472 -2.1381

APPENDIX E

- 238 -

Figure E.1 - Data represented as Entry # versus ΓΓG. (see Figure 2 in the publication for

another representation of the data), . Predicted (grey) vs. observed (black)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

APPENDIX E

- 239 -

Figure E.2 - Data represented as Entry # versus ΓΓG. . (see Figure 4 in the publication

for another representation of the data), Predicted (grey) vs. observed (black)

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

APPENDIX E

240

FITTED 2.6: User Guide

241

APPENDIX F

FITTED2.6 USER MANUAL

APPENDIX F

242

Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne Department of Chemistry, McGill University Montréal, Québec, Canada

FITTED 2.6

User Guide

APPENDIX F

- 243 -

Table of Contents

I. Preface .......................................................................................................................................... - 246 -

II.1. Conventions used in this guide ............................................................................................. - 246 -

II.2. Acknowledgements ............................................................................................................... - 246 -

II. FITTED: Theory and implementation ......................................................................................... - 247 -

II.2. Overview ................................................................................................................................ - 247 -

II.3. ProCESS ................................................................................................................................ - 249 - II.3.1. Initial setup ..................................................................................................................... - 249 - II.3.2. Definition of the binding site and creation of the XXXX_site.txt file ............................... - 250 - II.3.3. Creation of the XXXX_dock.mol2 file ............................................................................. - 250 - II.3.4. Binding site cavity generation ........................................................................................ - 250 - II.3.5. Interaction Sites/Pharmacophore generation and creation of the XXXX_IS.mol2 file ... - 253 - II.3.6. Solvation and creation of the XXXX_score.mol2 file ...................................................... - 254 -

II.4. SMART ................................................................................................................................... - 255 - II.4.1. SMART input ................................................................................................................... - 255 - II.4.2. Partial atomic charges .................................................................................................... - 255 - II.4.3. The bit string description ................................................................................................ - 255 -

II.5. FITTED ................................................................................................................................... - 256 - II.5.1. FITTED modes ................................................................................................................ - 256 - II.5.2. Protein flexibility.............................................................................................................. - 257 - II.5.3. Covalent docking ............................................................................................................ - 258 - II.5.4. FITTED scoring functions ................................................................................................ - 258 - II.5.5. Genetic algorithm ........................................................................................................... - 260 -

II.6. References ............................................................................................................................ - 267 -

III. Installation ...................................................................................................................................... 269

III.2. The FITTED, ProCESS and SMART folders ............................................................................... 269

IV. Getting started with FITTED ..................................................................................................... - 270 -

IV.2. Setting up the system .......................................................................................................... - 270 -

IV.3. Running the FITTED suite .................................................................................................... - 271 -

V. Preparing a keyword file for FITTED ........................................................................................ - 273 -

V.2. Input/output files .................................................................................................................... - 273 -

V.3. Run parameters .................................................................................................................... - 274 -

V.4. Filtering parameters .............................................................................................................. - 275 -

V.5. Conjugate gradient parameters ............................................................................................ - 279 -

V.6. Energy parameters ............................................................................................................... - 279 -

V.7. Scoring parameters ............................................................................................................... - 279 -

V.8. Initial population parameters ................................................................................................. - 280 -

V.9. Evolution parameters ............................................................................................................ - 280 -

APPENDIX F

- 244 -

V.10. Docking of covalent inhibitors ............................................................................................. - 281 -

V.11. A typical FITTED keyword file .............................................................................................. - 282 -

VI. Preparing a keyword file for ProCESS .................................................................................... - 284 -

VI.2. Input/output files ................................................................................................................... - 284 -

VI.3. Reading the input files and preparing the output protein files ............................................. - 285 -

VI.4. Parameters for the binding cavity file ................................................................................... - 285 -

VI.5. Parameters for the Interaction Sites file ............................................................................... - 286 -

VI.6. A typical ProCESS keyword file ............................................................................................ - 288 -

VII. Running SMART ........................................................................................................................ - 290 -

Appendix A: FITTED Input File Formats ....................................................................................... - 292 -

A.1. The protein files ................................................................................................................ - 293 - A.1.1. The XXXX_dock.mol2 .................................................................................................... - 293 - A.1.2. The XXXX_score.mol2 file ............................................................................................. - 293 - A.1.3. The XXXX_site.txt file .................................................................................................... - 296 - A.2. The Ligand file .................................................................................................................. - 296 - A.3. The binding site cavity file ................................................................................................. - 300 - A.4. The interaction site, pharmacophore and XXXX_IS.mol2 files ......................................... - 301 - A.5. The force field file .............................................................................................................. - 302 - A.5.1. Adding parameters to the bond list ................................................................................ - 302 - A.5.2. Adding parameters to the angle list ............................................................................... - 302 - A.5.3. Adding parameters to the torsion list ............................................................................. - 303 - A.5.4. Adding parameters to the out-of-plane list ..................................................................... - 304 - A.5.5. Adding parameters to the van der Waals list ................................................................. - 304 - A.5.6. Adding parameters to the hydrogen bond list ................................................................ - 305 - A.5.7. Adding parameters to the bond charge increment list ................................................... - 305 - A.5.8. Adding parameters to the partial bond charge increment / formal charge adjustment factor

list ........................................................................................................................................................ - 306 -

Appendix B: FITTED errors and warnings .................................................................................... - 307 -

Appendix C: ProCESS errors and warnings ................................................................................ - 310 -

Appendix D: SMART errors and warnings .................................................................................... - 311 -

Appendix E: Functional group definitions ................................................................................... - 313 -

Appendix F: Additional keywords for FITTED .............................................................................. - 316 -

F.1. Input/output files ................................................................................................................ - 316 - F.2. Run parameters ................................................................................................................ - 318 - F.3. Filtering parameters .......................................................................................................... - 319 - F.4. Conjugate gradient parameters ........................................................................................ - 322 - F.5. Energy parameters ............................................................................................................ - 323 - F.6. Scoring parameters ........................................................................................................... - 325 - F.7. Initial population parameters ............................................................................................. - 326 - F.8. Evolution parameters ........................................................................................................ - 328 -

APPENDIX F

- 245 -

F.9. Output/convergence parameters ...................................................................................... - 331 - F.10. Docking of covalent inhibitors ......................................................................................... - 332 -

Appendix G: Additional Keywords for ProCESS ......................................................................... - 333 -

G.1. Input/output files ............................................................................................................... - 333 - G.2. Reading the input files and preparing the output protein files .......................................... - 334 - G.3. Parameters for the binding cavity file ............................................................................... - 335 - G.4. Parameters for the Interaction sites file ............................................................................ - 336 -

APPENDIX F

- 246 -

I. Preface

II.1. Conventions used in this guide

This guide describes the use of a suite of programs which are useable either via an interactive mode or through issuing command-line arguments. FITTED and ProCESS, as well, require a set of commands to be issued in the form of a keyword file, a standard ASCII text file with instructions that follow the form Keyword Option, usually one per line, but in some cases a keyword might

span multiple lines.

In the remainder of the manual, different typefaces will be used to symbolize the following:

Filenames and command-line input: constant-width font, standard face. Examples: ligand.mol2 keyword.txt smart –fitted ligands.mol2

Keyword names: constant-width font, bold face. Examples: Protein Mode AutoFind_Site

Keyword options: constant-width font, italic face. Examples: 1a46.mol2 Docking Yes

Please note that the formatting is for clarity of the manual only as it is not possible to format an ASCII file with different typefaces.

II.2. Acknowledgements

Over the last years, the development of FITTED has been funded by ViroChem Pharma (research grants) and the Canadian Institutes for Health Research (CIHR Operating grants) while the development of ACE has been funded by the Natural Science and Engineering Research Council (NSERC discovery grant). These partners are warmly acknowledged. More recently, the “Ministere du Développement Economique, de l'lnnovation et de l'Exportation du Québec" has recognized the potential of our drug discovery platform and granted us funding for further development and commercialization as part of a program called "Soutien à la maturation technologique". Jeremy Schwartzentruber (code optimization) and Devin Lee (comparative study) are also acknowledged

APPENDIX F

- 247 -

II. FITTED: Theory and implementation

II.1. Overview

Docking methods are computational techniques that are able to predict the binding mode of ligands (e.g., enzyme inhibitors, receptor antagonists) to their biological targets.[1] Combined with a scoring function, these methods can be used to screen large libraries of compounds for their affinities for a specific pharmaceutically relevant target. FITTED 2.6 is a suite of programs (FITTED, ProCESS and SMART) for the docking of small-molecule ligands onto flexible proteins.

ProCESS is a module used in the preparation of the protein input files for FITTED. ProCESS (i) can truncate the protein to reduce its structure strictly to the binding site; (ii) assigns atom types and partial charges; (iii) checks the protein files for consistency if more than one is used; (iv) pre-computes the atomic solvation data; (v) identifies the binding site cavity and the interaction sites and prepares the files describing them.

FITTED also includes SMART, a module used to prepare the ligand input files. SMART (i) assigns atom types; (ii) identifies rotatable bonds (e.g., C-C single bonds); (iii) applies descriptors and derives bit string describing the chemical features and properties of the compound; and (iv) assigns MMFF partial atomic charges.

The main module (FITTED) docks the ligands into the flexible protein in presence of displaceable water molecules.

A complete description of the theory and modules of FITTED 1.0 and 1.5 can be found in references 2 and 3. The current version of the suite (FITTED 2.6) includes previous and additional features and will be reported shortly [4]:

Automatic interaction site generation,

Automatic binding site identification,

Interaction site / force field consensus docking,

Pharmacophore oriented docking,

Force field parameter estimator,

Generalized AMBER force field (GAFF) implementation,

Improved scoring function (RankScore 2),

Filters for drug-like molecules,

Improved atom typing,

Semi-flexible protein docking with flexible waters,

United atom representation of the ligand protein non-bonded interactions,

New pElite genetic operator,

New evolution type: Metropolis,

Improved conjugate gradient minimization algorithm,

Covalent inhibitor docking,

Docking of ligands to DNA/RNA.

APPENDIX F

- 248 -

Matching algorithm for orientation individuals in the initial population Descriptions of the concepts of displaceable water molecules, flexible proteins, and

pharmacophore-oriented docking can be found in references [2]-[6] . Applications of FITTED can be found in references [3], [7] and [8] while the development of the RankScore scoring function used by FITTED can be found in references [6], [10] and [11]. Docking of covalent inhibitors is yet to be reported [12].

APPENDIX F

- 249 -

II.2. PROCESS

ProCESS is the module used to prepare the receptor(s) for FITTED, generating the various descriptions of the binding of the receptor(s). Herein, enzymes, cell-membrane receptors, nucleic acids are referred to as receptors. Most commonly, the receptor will be a protein, although ProCESS and FITTED can now handle nucleic acids (see Table 1b). For more information of the setup of the

individual protein files before using ProCESS see section IV.1, Setting up the system.

II.2.1. Initial setup

ProCESS requires all input proteins (if running any of the flexible modes) to be similar. For two proteins to be considered similar, they must have the same number of atoms, the same residue naming and have the same atoms within each residue (including atom names).

The order of the atoms in the input file does not have to coincide for all proteins. The first step done by ProCESS involves sorting the protein atoms to have the same order as the first protein listed in the keyword file. ProCESS can sort the atoms within a residue if all the atoms of that residue are listed contiguously. If the atoms of a residue are listed in multiple locations (e.g., all heavy atoms first, then all hydrogen atoms), ProCESS will issue an error. If ProCESS cannot find a residue from the reference file in another protein it will output an error message stating which residue cannot be found. If this occurs, rename the residue in the file giving the error.

Charge assignment, atom typing and interaction sites generation by ProCESS requires hydrogen

atom names to follow the IUPAC recommendations in section 2.1.1 of Pure Appl. Chem. 70:117 (1998). Residue names must use the standard three letter code or an advanced 4 letter naming (See Tables 1a and 1b).

Table 7a - List of acceptable residue names for protein ProCESS input

Amino acid Mid chain Other accepted names N-terminal C-terminal

Alanine ALA ALAN ALAN Arginine ARG ARGN ARGN Asparagine ASN ASNN ASNN Neutral aspartic acid ASPH ASH,ASZ Aspartic_acid ASP ASPN ASPN

Cysteine in disulfide bridge CYS Cysteine CYSH CYS CYSHN CYSHN Glutamine GLN GLNN GLNN Glutamic_acid GLU GLUN GLUN Neutral glutamic acid GLUH GLU Glycine GLY GLYN GLYN Histidine epsilon HISE HIS, HIE HISEN HISEN Histidine protonated HISP HIS, HIP HISPN HISPN Histidine delta HISD HIS, HID HISDN HISDN

Water HOH WAT Isoleucine ILE ILEN ILEN Leucine LEU LEUN LEUN Lysine LYS LYSN LYSN Methionine MET METN METN Phenylalanine PHE PHEN PHEN Proline PRO PRON PRON Serine SER SERN SERN

APPENDIX F

- 250 -

Threonine THR THRN THRN Tyrosine TYR TYRN TYRN Tryptophan TRP TRPN TRPN Valine VAL VALN VALN

Table 1b - List of acceptable residue names for nucleic acid ProCESS input

Nucleotide Mid chain 5’-terminus 3’-terminus

deoxyadenosine DA* DA5 DA3 deoxyguanosine DG* DG5 DG3 deoxycytosine DC* DC5 DC3 deoxythymidine DT* DT5 DT3 riboadenosine RA* RA5 RA3 riboguanosine RG* RG5 RG3 ribocytosine RC* RC5 RC3 ribouracil RU* RU5 RU3

II.2.2. Definition of the binding site and creation of the XXXX_site.txt file

The first step within ProCESS is to define the binding site. The binding site can either be manually defined or found automatically with aid of a co-crystallized ligand. For the manual definition of the binding site, the residues included in it should be listed with the Binding_Site

keyword. To automatically define the binding site, the AutoFind_Site keyword is used. In this

case, ProCESS looks for the file defined in the Ligand keyword and selects all residues within

Ligand_Cutoff of the ligand as binding site residues. These residues (either manually or defined

with co-crystallized ligand) are listed in the XXXX_site.txt (XXXX being the protein filename

with the “.mol2” extension removed). If more than one protein is used as input, they are all considered when creating XXXX_site.txt, however there will be only one XXXX_site.txt

output file, since all the proteins will have the same binding site residues.

II.2.3. Creation of the XXXX_dock.mol2 file

To reduce the amount of time required to compute the non-bond interactions within FITTED, the protein is truncated to remove residues which have small contributions to the binding energy, i.e., far away from the binding site. There are two options available to truncate the protein: automated truncation, where the center of the truncation is the ligand, and manual truncation, where the center is the collection of active site residues. A residue is removed from the protein unless an atom within the residue lies within the cutoff distance of the center in use.

II.2.4. Binding site cavity generation

To reduce the amount of time required for a FITTED run, a negative image of the binding site cavity (herein referred to as the binding site cavity) is generated. By using the binding site cavity FITTED will only select ligand binding modes that do not have clashes with the protein in a timely manner, without having to compare to all the atoms of the protein.

APPENDIX F

- 251 -

Figure 1 - Grid generated from ProCESS

The initial setup of the binding site cavity requires the initial definition of the location of the binding site. This can be done either manually (Grid_Center) or automatically using the co-

crystallized ligand (Ligand). Once the center of the site has been defined, ProCESS creates a grid

(Figure 1).

The size and resolution of the grid can be customized by using the Grid_Size and

Grid_Resolution keywords, although the default values are highly recommended for docking to

protein binding sites. Grid points will be located within a cube in the coordinates Grid_Center ± Grid_Size. At each point of the grid ProCESS checks for clashes with protein

atoms. A grid point is considered to be clashing if it lies within Grid_Clash Å of a protein atom

excluding water atoms. If using multiple proteins, a grid point must clash with all proteins to be removed from the grid. Finally, the active site is made spherical to remove the bias from the initial orientation of the grid by removing all grid points lying further than Grid_Size Å from the center

of the binding site cavity. This sphere is shown as black dashes in Figure 2b.

Once all the grid points that are clashing with the protein have been removed, ProCESS eliminates points in regions that are not part of the binding site (Figure 2a). The binding site region (orange in Figure 2a) is defined by points overlapping with the ligand (or the grid center), and expanded through all adjacent points; the regions not connected to the binding site region are termed isolated (red dots in Figure 2a). These isolated points are then removed.

Once the clustering is finished, ProCESS inflates each grid point until it clashes with the protein or the edge of the large sphere (black dashes in Figure 2b) to reduce the total number of points. Once the grid point is inflated it is referred to as a sphere. Each sphere has a radius associated with it, which is either the minimum distance to a protein atom (Grid Point #1 in Figure 2b) or the distance from the edge of the grid (Grid Point #2 in Figure 2b), whichever is smaller. ProCESS then increases the radius of the spheres to allow for an overlap which prevents area which may not be covered by a sphere (Grid Point #3 in Figure 2b). ProCESS then takes the sphere with the largest radius and removes all the smaller spheres included within its volume.

APPENDIX F

- 252 -

NOH

OHOHN

O

Binding Site Cavity Center

Grid .Point #2

Grid .Point #1

Grid .Point #3

OH

Figure 2 – (a) Removal of isolated points; (b) Inflation of grid points into spheres

The spheres kept after inflation are output in the Binding_Site_Cav file. A representation of

the binding site cavity file is shown in Figure 3.

Figure 3 - Binding site cavity generated by ProCESS

APPENDIX F

- 253 -

II.2.5. Interaction Sites/Pharmacophore generation and creation of the XXXX_IS.mol2 file

Interaction sites. Studies have shown that using pharmacophoric constraints during docking leads to increased accuracy.[6] By using a pharmacophore before the force field energy calculation, FITTED can discard poses lacking key interactions thus reducing the CPU time required. Pharmacophores are typically user-generated (see below) with knowledge of the binding site. With this in mind, ProCESS can automatically generate interaction sites from the protein input file(s). One advantage to the use of an automated procedure is that potential interaction points which may be ignored or discarded by users would be kept, allowing for the probing of otherwise unprobed binding pockets.

H O

N

O

H

Figure 4 - Example of the interaction site spheres created for serine; red: hydrogen bond donor, blue: hydrogen bond

acceptor.

The interaction sites are automatically created by ProCESS after the generation of the binding site cavity. The hydrogen bond donor (HBD) and acceptor (HBA) sites are created by locating spheres complementary to the properties of the binding site residues. As an example (see Figure 4), for a serine side-chain many HBD and HBA points are created. These points are transformed into spheres defining the geometrical constraints of the interaction. In all cases, the initial sphere radius is predefined (HBA = 1.5 Å; HBD = 1.5 Å). Along with the radius, each point is also assigned a weight depending on the type of point (HBD = 5; HBA = 5; HBA = 25 when the point is for a metal interaction). A weight is used to define the importance of the possible interactions that can occur at each individual sphere. These weights can be modified by using HBD_weight, HBA_Weight and

Metal_Weight. At this stage only points within the binding site (see section II.2.4, Binding site

cavity generation) are kept.

In order to generate the hydrophobic points we initially go through the binding site cavity grid that was created before the inflation of the spheres. At each point not close to a HBA or HBD point the van der Waals (vdW) interaction energy with the protein carbons is calculated. If the energy calculated is below a cutoff (Hydrophobic_Level), by default set to -0.3, the point is kept. If this

point is surrounded by a number of other HYD points then the point is added to the interaction site file. The weight of this point is calculated by taking the quartic ratio of the vdW energy calculated at the point divided by the Hydrophobic_Level. This weight can be scaled up or down using the

Hydrophobic_Weight keyword which is by default set to 1. The size of the hydrophobic point is

set to 2.0 Å.

Once all the interaction site spheres have been initially defined, they are rescaled and some of them are further filtered out. The number of atoms within a 12 Å radius sphere of the interaction site sphere is used as a crude estimation of the deepness of the pocket where the sphere is located. The

APPENDIX F

- 254 -

weight of a sphere is multiplied by this factor, increasing the weight of buried spheres relative to solvent exposed points.

To further reduce the number of sites, ProCESS merges interaction sites which may be considered redundant or overlapping. Interaction sites are considered to be overlapping if within a distance to each other specified by the Pharm_Polar_Softness, Pharm_Nonpolar_Softness, and

Pharm_Aromatic_Softness keywords.

As ProCESS may generate many interaction sites, one can further reduce the number of points to keep only the sites with the largest weights. This is done by using the Num_of_IS keyword. By

default this is set to 75. This keyword will keep only the number of interaction sites specified. The final interaction sites file for 1b6l is shown in Figure 5.

Figure 5 - Interaction sites generation by ProCESS. Red: Hydrogen bond acceptors, blue: hydrogen bond donors, green:

Hydrophilic.

A XXXX_IS.mol2 files is created which is used by the matching algorithm in FITTED. Unlike the Interaction site file, these XXXX_IS.mol2 are created for each protein individually as if rigid docking was being done. Therefore for rigid protein docking the XXXX_IS.mol2 file and the interaction site file are similar.

Pharmacophore. A pharmacophore file can also be generated manually following the same format as the interaction sites file (see section II.4, The interaction site, pharmacophore ). Typically the pharmacophore file has fewer spheres than the interaction sites file, usually involving interactions which are key to binding, such as interactions with metals.

II.2.6. Solvation and creation of the XXXX_score.mol2 file

As soon as the XXXX_dock.mol2 file is ready, ProCESS prepares a second protein file (or set

of files if more than one protein file is given as input) named XXXX_score.mol2. This protein

file describes the protein using the all atom representation, GAFF atom types and AMBER atom

APPENDIX F

- 255 -

charges. A key feature of these files is the pre-computed data used to calculate the Generalized Born/Surface Area (GB/SA) solvation properties (for a detailed description of this method, see[14] and [15]). As part of the scoring function RankScore, FITTED evaluates the solvation contribution to

the free energy of binding. Following Still‟s method [14], [15], Gpol,i for each atom is calculated and outputted in this mol2 file. The solvent accessible surface area (SASA) is also computed, and this data is used to derive the solvation energy of the protein. As soon as a complex is formed by

FITTED, the same Gpol,is can be used and the ligand effects added to each of them to compute the solvation of the complex in a timely fashion.

II.3. SMART

SMART is the module used to prepare ligand structures in a modified MOL2 format for use by FITTED. Basically, it assigns GAFF atom types and describes the ligand‟s functional groups as a bit string. It can also optionally correct the bond order assignment in a molecule, assign MMFF charges [16] and prepare structures for use with ACE (Asymmetric Catalyst Evaluation) assigning MM3 atom types.[17]

II.3.1. SMART input

The input to SMART should be either a standard MOL2-formatted file or a standard SD/MOL file containing the structure of the ligand to be docked. In either format, the input file can contain one or multiple molecules. Upon completion, SMART will output all diagnostic messages to a log file, and output the prepared ligands in FITTED MOL2 format, either as a single multi-MOL2 file or in multiple single-MOL2 files. Hydrogen atoms will not be added by the current version of SMART and are therefore required in the input files.

II.3.2. Partial atomic charges

FITTED requires partial atomic charges to be assigned on the ligand; SMART has the option to assign MMFF atomic partial charges.[16] Atomic charges can also be assigned by other methods with a third-party tool, in which case SMART can be instructed to preserve them. MMFF charges are accurate for a wide variety of organic molecules; a warning message will be output to the log file in case the assignment could be inaccurate due to missing parameters. Other charging schemes are currently being implemented and validated and will be released by Dec. 2008.

II.3.3. The bit string description

The bit string within SMART is created to allow the fine tuning of the library to be screened using FITTED. SMART determines the presence of functional groups in a ligand by analyzing the assigned atom types and/or connectivity. The groups recognized within the bit string are defined in Table 8 (see also 0 Functional group definitions).

Table 8 – Recognized functional groups in SMART

Aromatic

Aldehyde

APPENDIX F

- 256 -

Ester

Lactone

Amide

Lactame

Acid

Nitrile

Imine

Nitro

Acceptor

Azide

Isocyanate

Acyl_Chloride

Sulphonamide

Carbamate

Ammonium

Oxime

Ketone

Boronate

Primary_Amine

Secondary_Amine

II.4. FITTED

FITTED is the tool performing most of the work. FITTED has many different options allowing the user to perform various tasks. In the following sections, the different options available within FITTED and a more detailed description of FITTED‟s inner workings are given.

II.4.1. FITTED modes

Several modes within FITTED allow it to solve various computational problems that may arise: Docking, Virtual Screening, Scoring and Filtering. Additionally, it is possible to adjust a rough docking pose by performing a local optimization. To access any of these modes use the Mode

keyword with Dock, VS, Score, Filter, SAR or Local as the option.

If Dock is selected, FITTED will dock the molecule into the protein target. By selecting the Dock

option FITTED ignores all filtering criteria if present in the keyword file and also sets larger default values for the number of trials to find a suitable conformation (GI_Num_Of_Trials 10000,

GA_Num_of_Trials 1000). If no suitable individual is found after GI/GA_Num_Of_Trials, the

last individual is saved and a new search is started. When the Dock mode is set Number_of_Runs

is set to 3.

APPENDIX F

- 257 -

The VS mode is optimized for the fast screening of compound. FITTED will use the filtering

criteria (see Filtering parameters), skipping compounds which do not satisfy these conditions and proceeding to dock the suitable ones. A smaller default value for GI_Num_Of_Trials (2000) is

used, preventing spending extra time on molecules too large for the cavity. FITTED will also use the CutScore_1, Max_Gen_1, CutScore_2 and Max_Gen_2 keywords. If after Max_Gen_1

generations the score of one of the top three individuals is below CutScore_1, the simulation is

allowed to proceed; an analogous check is performed after Max_Gen_2 generations with

CutScore_2 as a threshold. These checks ensure that potential non-binders do not consume

valuable CPU time. When the VS mode is set Number_of_Runs is set to 1.

In the Score mode, FITTED will only score the initial input structure against all the input

proteins. By using this option in conjunction with the keyword Score_Initial with the option

Minimize, it is possible to not only score the input structure, but to optimize the input ligand

structure by energy minimization and score the energy-minimized structure.

If Mode Filter is specified, FITTED checks the molecule to see if it passes the filtering criteria

defined in the keyword file or default filters. Diagnostic messages are output to the log file, stating if the molecule is kept or which of the filters were not satisfied.

With the Local mode it is possible to perform a local energy search on the ligand input

structure. With this mode the ligand is slightly perturbed by only allowing small changes from the initial structure.

With the SAR mode it is possible to perform a structure activity relationship study assuming

similar binding modes for all the ligands. Thus, the torsions will be conformational searched but translation and orientations will be searched only locally (around the location of the initial structure).

II.4.2. Protein flexibility

FITTED is able to dock using various modes of protein flexibility using the Flex_Type keyword.

This keyword has four different options: rigid docking (Rigid), semiflexible docking (Semiflex),

semiflexible docking with movable waters (Flex_Water) and fully flexible docking (Flex).

The number of protein files required for N proteins is (3N+1). For rigid docking, only one protein input structure is needed (4 files for one protein: XXXX_score.mol2, XXXX_dock.mol2,

XXXX_site.txt, XXXX_IS.mol2). The XXXX_site.txt file is only used to determine the

binding site. In this mode, only the ligand is considered flexible (translation, rotation and torsions). When running semiflexible or fully flexible docking within FITTED, more than one protein is needed (i.e., 2 files per protein, XXXX_score.mol2, XXXX_dock.mol2 and XXXX_IS.mol2, and

a single XXXX_site.txt).

Semiflexible docking is similar to docking to conformational ensembles, with the protein conformations being allowed to exchange between complexes during the conformational search. The semiflexible mode with movable waters is identical to the semiflexible mode, except that the water molecules can also be exchanged independently of the proteins.

APPENDIX F

- 258 -

In the fully flexible docking mode, the entire system is considered flexibly. The ligand (translation, rotation and torsions), protein (backbone, binding site side chains) and waters) are allowed to have all genetic operators applied to them. The list of residues in the XXXX_flex.txt

file will be considered as flexible.

II.4.3. Covalent docking

About 5% of the marketed drugs are covalent inhibitors. To address this under-investigated field of docking, we turned our attention to the development of an original method to dock covalent inhibitors in a fully automated fashion (unpublished results). This feature is currently being validated and should be used with care. The current version of FITTED is able to dock and virtually screen covalently bound and competitive ligands. FITTED can be instructed to do so by using the Covalent_Residue keyword. Following this keyword should be the name of the residue that will

form the covalent bond with the ligand (ex. SER451). Currently only serine and cysteine can form covalent bonds with the ligand (contact the developers for other residues). In these cases, the hydrogen from the residue is automatically transferred from the residue to the carbonyl or nitrile group of the ligand. In cases where the hydrogen is not transferred from the protein to the ligand but instead to another residue (the current version of the program does not allow any change in the number of atoms while docking), the residue must be specified using the Proton_Moved_To

keyword followed by the atom which the hydrogen will be transferred to.

With FITTED, only certain ligand functional groups have been setup to handle covalent bonds (aldehydes, ketones, nitriles and boronates). These groups are handled automatically with no further specification of the group name or location. If FITTED does not find any of these functional groups, it will dock the compound in a regular docking manner. To force only covalent poses the keyword Covalent_Ligand with the option Only can be used. During the creation of the initial

population and evolution only poses where a covalent bond is formed are kept while all other poses will be rejected. If this keyword is not selected FITTED will force a minimum of 10% of the initial population to be covalent, while during the evolution there is no restriction and the population is allowed to freely evolve towards covalent or non covalent binding modes.

To allow for covalent docking FITTED uses a switching function similar to the one implemented for displaceable water molecules. If the functional group is within Cutdist_Cov angstroms of the

serine oxygen or cystein sulfur, a bond is made, the hybridization (and atom type) of the reacting functional group modified (i.e., from c to c3 for a reacting aldehyde) and the proton moved.

II.4.4. FITTED scoring functions

Within FITTED there are two scoring functions, one applied during the docking (DockScore) and another used for a final scoring (RankScore2).

DockScore is in fact a consensus score made up of 4 components. The first is PharmScore which is calculated for the ligand match to the pharmacophore (see equation (1)). The MatchScore is calculated in an analogous manner, but considering the match of the interaction sites instead.

APPENDIX F

- 259 -

n

ii

n

i

i

sphere

sphere

w

otherwise , 0

matches sphere if , w

100PharmScore

Next a ClashScore is calculated with the active site cavity file generated from process. If an atom does not match with one of the spheres of the cavity, the pose is rejected and a new one is generated. Finally, a GAFFScore based on the GAFF force field [18] is computed. The following are the GAFFScore weights. Note that a lower value (more negative) of GAFFScore is better.

LossWater proteincoulombic

5-1coulombic4-1coulombic

proteinvdWvdWvdWinternal

EE

E5.0 E25.0

E0.2EE5.0EGAFFScore5-141

The weights can be modified by using the keywords found in the 0 Additional keywords for FITTED. Since proteins are large objects and the non-bonded energy contribution is nearly 0 at long distances, FITTED is equipped with a switching function to effectively turn off the long range protein ligand non-bond terms. This function is similar to the one implemented in CHARMM.[13] The cut-off and switching distances can be modified using the keywords Cutdist and Switchdist.

Since FITTED allows for displaceable water molecules. All water molecules within the protein structure are considered as displaceable which mean they can be either present or removed depending on which situation is better during the docking. Within FITTED we achieve this by using a switching function which effectively removed the water if the ligand is too close to the water. The distance between the ligand and the water to be considered too close can be changed by using Cutdist_Wat and Switchdist_Wat.

RankScore2 (Equation 3) is a force-field based scoring function, with the addition of terms to account for solvation and entropic effects.

SsolvstrainHB/metalcoulvdw EEEEEERankScore2 f

Each energy term is weighted by a coefficient that can be modified by using the keywords specified in section II.10, Energy parameters. The first 3 energy terms (Evdw, Ecoul and EHB/metal) correspond to the intermolecular interactions calculated with the GAFF force field. The factor f scales these terms to account for the reduced free energy of protein-ligand interaction experimented by flexible residues. The Estrain term accounts for the conformational strain of the ligand pose. It is calculated from the GAFF internal energies of the ligand when bound and unbound. Esolv includes an estimation of the energy of desolvation calculated with the GB/SA approach. The atomic contributions to GB are pre-computed by ProCESS and stored in the XXXX_score.mol2 file.

Following Still‟s method [14],[15], the polar contribution to the solvation energy is computed. A surface area is also derived from the ligand, protein and complex and used to compute the non-polar contribution. The solvation remains time consuming and can be turned off by setting Solvation to

(1)

(2)

(3)

APPENDIX F

- 260 -

off. Finally, the ES term (Equation 4) represents the penalty for torsional entropy loss upon

binding.

atoms

i

i

rotN

NpolG

0

S )(15.0E

Essentially, it involves counting the number of rotatable bonds (acyclic, non-terminal bonds

between sp3-hybridized atoms), affected by the polarity of the bond ((pol)) and how buried the

bond is (estimated by the relative number of atoms around each atom of the bond, Ni/N0). The rationale behind this is that the more polar a bond, the more frozen it will be in a binding pocket; conversely, the most solvent-exposed a bond is, the more free to move it will be.

II.4.5. Genetic algorithm

A genetic algorithm (see Figure 7a) is a global optimization technique. In the present case, it is used as a conformational search tool, allowing for the sampling of flexibility of complex systems. The first step is the creation of an initial population of poses, with many different conformations, also known as individuals. Each conformation can be described by a chromosome made up of genes; in this case each gene contains the value of a molecular descriptor such as torsional angles, position

and orientation (see Error! Reference source not found.). Once the initial population is created the population evolves. Within this loop, parents (a pair of individuals in the population) are selected and coupled together using genetic operators, creating new conformations called children. With the children produced, the population is trimmed (some parents and kids are removed) and another reproduction cycle is started. This loop is continued until the population converges. This is the basis for the FITTED genetic algorithm. Modifications to the general scheme along with more in depth definitions of each element of the genetic algorithm is discussed in the following sections.

(4)

APPENDIX F

- 261 -

II.4.5.1. The chromosomes

Figure 6 - FITTED Chromosone

For evolution to occur on a population of conformations, each individual must have a chromosome. Flex_Type determines the make up of these chromosomes (Figure 6).

The chromosome contains many different genes. In Flex_Type Rigid the chromosome has a

separate gene for each flexible torsion and one each for the translation and the orientation. For semiflexible docking (Semiflex) the protein conformation gene is added to the chromosome. For

semiflexible docking with flexible waters (Flex_water) the position of the each individual water is

added to the chromosome. Finally, during fully flexible docking (Flex) the conformation of each

individual binding site residue is added.

II.4.5.2. Generation of a high quality (already evolved) population

FITTED creates the initial population (see Figure 7b) by generating many ligand conformations randomizing its torsions, orientation and positioning. To initialize the random number generator the keyword Seed is used. It is also possible to have FITTED select a random

number to initialize the Seed by setting it to 0. The allowed values of these genes (torsional angle

values) can be restricted by using keywords. The rotation of the torsions occurs at discrete values controlled by the Resolution keyword (360°/Resolution); this increases the probability that the

conformation generated is energetically stable. FITTED uses a corner flap approach to conformationally search rings which can be turned on by using Corner_Flap on. A three-point

matching algorithm is used to orient the ligand when generating the initial population using the XXXX_IS.mol2 file. If a Pharmacophore is used, each triangle formed by three points has at least

one of the points from the pharmacophore. If only XXXX_IS.mol2 are used, it is possible to only

APPENDIX F

- 262 -

create triangles with the highest weighted interaction sites (see section II.2.5, Interaction Sites/Pharmacophore generation). The Num_of_Top_IS parameter allows the user to specify that

only triangles with one of the the top Num_of_Top_IS points should be created (default = 10). To

switch off the matching algorithm the keyword Matching_Algorithm should be set to off. The

matching algorithm is automatically turned off when doing either a Local or SAR search or a

covalent docking run. With the matching algorithm turned off the maximum translation performed in one generation can be set by setting the Max_Tx, Max_Ty, Max_Tz keywords; the center of the

ligand will be translated by a random number up to these values. When Mode is set to Dock the

default for the values is 5 Å, when set to SAR or Local the default is 0.2 Å. The ligand pose will be

rigidly rotated by any value between 0° and 360° if using the Dock mode. When doing a local search or during SAR mode the ligand will only be rotated between +/- Max_Rxy, Max_Ryz,

Max_Rxz which is by default set to 2°. If using any of the flexible docking modes then the

population will be increased to ensure an even representation of the protein conformations throughout the population. The size of the population in controlled with the Pop_Size keyword.

Pop_Size will be modified so that it is divisible by 2 and the number of proteins. For example if a

Pop_Size of 100 is selected and 3 proteins are used, the Pop_Size will be increased to 102.

Once a pose is generated, it is first passed through the PharmScore filter. The filter uses the Pharmacophore file. For more information on how the file is created and how the PharmScore is

calculated, see sections II.2.5: Interaction Sites/Pharmacophore generation and II.4.4: FITTED scoring functions. If the pose does not meet the Min_PharmScore then a new conformation is

generated. If the conformation passes then it passes through the MatchScore filter.

Generation

of Initial

Population

Generation

of Initial

Population


Selection of

next generation

Selection of

next generation

Yes

No

ExitExit

Is population

converged?

Is population

converged?

Generate

Conformation

Generate

Conformation




GAFFScoreGAFFScore

MinimizeMinimize

GAFFScoreGAFFScore

Save in

Population

Save in

Population

Fail


Selection of the

next generation

Selection of the

next generation

Is population

converged?

Is population

converged?

Yes

No

ExitExit

Generate

Initial Population

Generate

Initial Population

Calculate newMin_MatchScore

Calculate newMin_MatchScore

Yes

Pop_Size

Reached?

Pop_Size

Reached?No

Figure 7 – (a) Genetic algorithm; (b) Generation of initial population

The conformation must meet the value of Min_MatchScore, otherwise a new conformation is

generated. In the case of using Mode Dock, the initial Min_MatchScore is optimized on the fly. If

APPENDIX F

- 263 -

after 500 trials at generating a conformation one is not found that passes the MatchScore filter, the Min_MatchScore is reduced by 0.5%. Once an individual is saved into the initial population

Min_MatchScore is reoptimized with equation (5) using all the conformations generated to

calculate the average MatchScore.

Where Stringent_MS is used to tune how strigent should the Min_MatchScore be. By default

this value is 5.0 for rigid docking, 4.0 for flexible docking. If larger value of Stringent_MS are

used the more strigent the automatic calculation of Min_MatchScore will become.

ClashScore is calculated on the binding site cavity file. If the ligand passes through the ClashScore a GAFFScore is calculated. If the GAFFScore is below GI_Initial_E then the structure is

minimized and a new GAFFScore is calculated. If the latter is below GI_Minimized_E then the

conformation is saved into the population. Once Pop_Size has been reached, FITTED proceeds

with the evolution of the population.

II.4.5.3. Evolution

Once a population is created it proceeds through reproduction (see Figure 8). FITTED uses a Lamarckian evolutionary algorithm where the population learns during the evolution. A conformation learns by being energy-minimized; in this way, the child conformation does not descend directly from its parents anymore. At the beginning of each evolutionary iteration a percentage of the population is further optimized by energy minimization. The number of individuals energy-minimized is selected by using the pLearn keyword. The larger the value, the

longer the docking run will be. However, optimizing a small fraction speeds up the convergence of the docking runs.

Once this learning stage is over, 2 individuals of the population are randomly chosen to act as parents. Parents that have been not been used in previous generations are given a higher priority to be selected.

0.10

MatchScoreMatchScoreMSStringent_coreMin_MatchS

MaximumAverage (5)

APPENDIX F

- 264 -

Select ParentsSelect Parents




GAFFScoreGAFFScore

MinimizeMinimize

GAFFScoreGAFFScore

Save ChildrenSave Children

Generate

Initial Population

Generate

Initial Population

pCrosspCross

pMutpMut

Has Pop_Size

Been reached?

Has Pop_Size

Been reached?

No

Yes

pLearnpLearn

Selection of the

next generation

Selection of the

next generation

Is population

converged?

Is population

converged?

YesNo

ExitExit


Figure 8 - Reproduction

At this point the parents‟ chromosomes are crossed over to create children chromosomes defining new poses, followed by the application of mutation(s). The probabilities of crossover and mutation are controlled by the keywords pCross and pMut. Crossover is the exchange of the genes of the

chromosome between the parents to create new individuals. The default value of pCross and pMut

are highly recommended. Mutation is the random generation of a new value for a gene within the children. FITTED also allows for the independent increase of the probability of mutation for the orientation in space of the ligand and water exchange by using pMutRot and pMutWat. Larger

values for pMutRot and pMutWat are needed since many of the initial values may be eliminated

during the early stages of the evolution. During the evolution, orientations often need only a small refinement. Accordingly, the extent of the mutation of the ligand orientation can also be restricted by using Max_Rxy, Max_Rxz, Max_Ryz. pMutWat increases exponentially from 0% to pMutWat at

Max_Gen. Once this new individual is made, it passes through the same filters as for the initial

population and may be eventually energy-minimized.

II.4.5.4. Selection of the next generation

Once a new child population is created, FITTED allows for multiple ways to select the individuals of the next generation (see Figure 9). By default the Evolution keyword is set to

Steady_State where the best two individuals out of the two parents and two children are kept and

passed to a new population. A variation of the Steady_State is to use the Metropolis option.

In the latter, in the case where one or more of the parents would pass to the next generation using the steady state mode, the Metropolis criteria is used to see if the children should be kept. The

APPENDIX F

- 265 -

Metropolis option should be used when population statistics wish to be studied. Once the

individuals selected to be passed to the next generation are known, the pElite genetic operator is

applied. A number of individuals, determined by pElite, are copied, a local optimization is

applied, and the corresponding poses replace the worst individuals of this new population. To decrease the problem of early convergence by copying only the best individuals, pElite selects a

random individual out of the top pElite_SSize. This process can occur every generation, or at

different intervals set by pElite_Every_X_Gen. This option prevents premature convergence and

is highly recommended. This new population is then passed onto the next generation.

Generate

Initial Population

Generate

Initial Population


MetropolisMetropolisSteady StateSteady State

pElitepElite

Is pElite_Every_X_Gen

criteria met?

Is pElite_Every_X_Gen

criteria met?

Yes

Selection of the

next generation

Selection of the

next generation

Is population

converged?

Is population

converged?

No

Yes

No

ExitExit

Figure 9 - Selection of the next generation

APPENDIX F

- 266 -

II.4.5.5. Convergence criteria

Has Max_Gen

been reached and

Max_Gen != Max_Gen_2?

Has Max_Gen

been reached and

Max_Gen != Max_Gen_2?

No

Is one of top 3

individuals below

CutScore_X?

Is one of top 3

individuals below

CutScore_X?

Increase Max_Gen

To Max_Gen_X

Increase Max_Gen

To Max_Gen_X

Yes

Yes

Generate

Initial Population

Generate

Initial Population


Selection of the

next generation

Selection of the

next generation

ExitExit

No

Is Best - Average

< Diff_Avg_Best?

Is Best - Average

< Diff_Avg_Best?Is Best – Diff_Number

< Diff_N_Best?

Is Best – Diff_Number

< Diff_N_Best?

No Yes

Is population

converged?

Is population

converged?

Figure 10 - Convergence criteria

During virtual screening it has been noted that a good scoring ligand typically has a good score within a few generations. Therefore to reduce the amount of time spent on ligands that are unlikely to be good binders we filter out the ligands that do not have a GAFFScore lower (better) than CutScore_1 at Max_Gen. If the score is below CutScore_1 then Max_Gen is increased to

Max_Gen_1 FITTED also has a second filter of this type which can be used by using the

CutScore_2 and Max_Gen_2.

FITTED has two energetic convergence criteria for the genetic algorithm. One involves monitoring the difference between the best individual GAFFScore and the average GAFFScore of the population (Diff_Avg_Best), while the other criterion uses the difference between the

GAFFScore of the pose ranked Diff_Number and the best GAFFScore (Diff_N_Best).

APPENDIX F

- 267 -

II.5. References

[1] Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J. Corbeil, C. Towards the development of universal, fast and highly accurate docking/scoring methods: A long way to go. Brit. J.

Pharmacol. 2008, 153, (SUPPL. 1).

[2] Corbeil, C. R.; Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J. Chem. Inf.

Model. 2007, 47, 435-449.

[3] Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.; Bilimoria, D.; Heureux, L.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 2. Development and Application of FITTED 1.5 to the Virtual Screening of Potential HCV

Polymerase Inhibitors. J. Chem. Inf. Model. 2008, 48, 902-909.

[4] Corbeil, C. R.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein Flexibility and Water Molecules on Accuracy of Major Docking Programs. J. Chem. Inf. Model. submitted

[5] Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to Hydrated and

Flexible RNA. J. Med. Chem. 2006, 49, 1023-1033.

[6] Moitessier, N.; Therrien, E.; Hanessian, S. Method for Induced-Fit Docking, Scoring, and Ranking of Flexible Ligands. Application to Peptidic and Pseudopeptidic β- secretase (ΒΑCΔ-1) Inhibitors. J. Med. Chem. 2006, 49, 5885-5894.

[7] Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore Search, Automated Docking, and Molecular Dynamics Simulations as a Novel Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-Glycine-Aspartic Acid-like Compounds

into the v3 Binding Site. J. Med. Chem. 2004, 47, 4178-4187.

[8] Englebienne, P.; Fiaux, H.; Kuntz, D.; Corbeil, C. R.; Rose, D.; Gerber-Lemaire, S.; Moitessier, N. Evaluation of Docking Programs for Predicting Binding of Golgi alpha-Mannosidase II Inhibitors: A Comparison with Crystallography. Proteins: Struct., Funct.,

Bioinf. 2007, 69, 160-176.

[9] Fay, A.; Corbeil, C.R.; Moitessier, N.; Bowie, D. Ligand Docking and Electrophysiological Analysis of Full and Partial Agonists at Kainate Receptors. To be submitted

[10] Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 4. Are Popular Scoring Functions Accurate for This Class of Proteins? Manuscript in preparation.

[11] Englebienne, P.; Corbeil, C. R.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 5. Force Field-Based Prediction of Binding Affinities of Ligands To Proteins and Development of RankScore2. Manuscript in preparation.

[12] Schwartzentruber, J.; Lawandi, J.; Corbeil, C. R.; Moitessier, N. Unpublished results

http://dx.doi.org/10.1021/jm050138y

http://dx.doi.org/10.1021/jm050138y

http://dx.doi.org/10.1021/jm0311386





APPENDIX F

- 268 -

[13] Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan S.; Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics

Calculations. J. Comp. Chem. 1983, 4, 187-217.

[14] Still, W.C.; Tempczyk, A.; Hawley, R.C.; Hendrickson, T. Semianalytical treatment of

solvation for molecular mechanics and dynamics. J. Am. Chem. Soc. 1990,112, 6127-6129.

[15] Qiu, D.; Shenkin, P. S.; Hollinger, F. P.; Still, W. C. The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii. J.

Phys. Chem. A. 1997,101, 3005-3014.

[16] Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and

performance of MMFF94. J. Comp. Chem. 1996, 17, 490-519, and following papers.

[17] Corbeil, C. R.; Thielges, S.; Schwartzentruber, J. A.; Moitessier, N. Toward a Computational Tool Predicting the Stereochemical Outcome of Asymmetric Reactions: Development and Application of a Rapid and Accurate Program Based on Organic

Principles. Angew. Chem., Int. Ed. 2008, 47, 2635-2638.

[18] Wang, J.; Wolf, R.M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and

testing of a general Amber force field. J. Comp. Chem. 2004, 25, 1157-1174

APPENDIX F

269

III. Installation

III.1. The FITTED, ProCESS and SMART folders

To install the suite of programs, simply copy the FITTED2.6 folder where you want to use it. No other manipulations are necessary. This folder includes three sub-directories one for each of the three modules (FITTED, ProCESS and SMART). Within these three folders can be found other subfolders such as input, output and keyword. The names for these various folders cannot be changed. The figure below describes the tree of folders necessary for each program to work properly.

Figure 11 - Directory structure for FITTED, ProCESS and SMART

ProCESS

outputkeywordinput

• Ligand.mol2

• Protein.mol2

• KeywordFile.txt • Output.out

• Protein_dock.mol2

• Protein_score.mol2

• Protein_site.txt

• Interaction_Sites.mol2

• Binding_Site_Cavity.mol2

SMART

outputinput forcefield

• Ligand.mol2 • Fitted_ff.txt • Output.out

• Ligand.mol2

• fitted_ff.txt

outputkeywordinputforcefield

structure

FITTED

• Ligand.mol2

• Protein_dock.mol2

• Protein_score.mol2

• Protein_site.txt

• Interaction_Sites.mol2

• Protein_IS.mol2

• Pharmacophore.mol2

• Binding_Site_Cavity.mol2

• Output.out

• logFile.log

• Ligand.mol2

• Protein.mol2

• KeywordFile.txt

APPENDIX F

- 270 -

IV. Getting started with FITTED

This section describes how to prepare and start a docking run with the FITTED suite of programs. All the examples and the corresponding files can be found in the examples/ folder.

IV.1. Setting up the system

In general, the structure files downloaded from the Protein Databank (PDB) require some preparation to ensure optimal results with any docking program. For instance, the protonation state of some residues may be critical to the binding of a ligand, hence to the observed enzymatic activity. The accuracy of the docking obtained with the FITTED suite of programs therefore relies on a careful preparation of the input files. This preparation requires the use of other programs with a graphical user interface such as Sybyl, Maestro or Insight II. The following sections give general details on what needs to be done to the protein and ligand structure files before the FITTED suite can be used for optimal results.

When the ligand and the protein remain as one file, they will be referred to as the complex from now on. The complex may also include ions (e.g., metals), water molecules and co-factors. X-ray crystal structures downloaded from the PDB most likely do not contain hydrogen atoms. Hydrogens on the ligands are needed for FITTED since the ligand is treated via an all-atom representation while the hydrogens on the protein are required to assign advanced residue names according to the protonation states, and to compute the solvation parameters. Once hydrogens are added, their orientation should be optimized, for instance performing an energy minimization with the heavy atoms fixed.

One of the advantages of FITTED is its ability to have mobile and displaceable water molecules. However, this feature requires the proper setup of waters within the complex. Only water molecules which are perceived as key for the binding of ligands should be kept, while all others should be removed. Waters are perceived as critical if they interact with both the protein and the ligand (bridging interactions) and are not exposed to the aqueous medium. If the number of key water molecules varies with the protein structure, copy the location of the missing waters from the other structure. During the docking run FITTED will displace them if necessary.

At this point, the complex(es) can be split into its(their) corresponding protein and ligand structure file(s). The protein file(s) should include the water molecules, ions and co-factors, if any. These files (ligand and protein) should be saved in mol2 format, available within most of the interfaces. If running a rigid docking (Mode Rigid), the protein file is ready to be submitted to

ProCESS. If Mode is Semiflex, Flex or Flex_Water, additional steps may be required to ensure

that all protein files are identical (i.e., same number of protein atoms, same number of water molecules, same residue names). If discrepancies were found, ProCESS will exit with an error message (see 0 ProCESS ). If some crystal structures have more water molecules than others, waters can be taken/copied from the protein structures that have similar conformations. The following section lists some of the common „errors‟ in PDB files which need to be corrected.

APPENDIX F

- 271 -

o In all cases, if the „error‟ appears close to the binding site, the protein structure should not be considered for the study.

o Mutated residues if the mutation is far from the binding site (at least 10 Å from the binding

site) then the residue can be virtually mutated to the desired residue. o Incomplete residue

In some case, parts of very flexible residues are not observed and are not included in PDB files. Again, if they are far from the active site, they can be virtually reconstructed.

o Missing Residues If they are far from the active site they can be (i) added where missing or

(ii) removed from the other files. o Terminal Residues

In some cases, terminal residues are not properly described in the PDB or mishandled by the program used to setup the protein. (e.g., terminal COO

-

groups are CHO). In this case, the missing atoms should be added. o Missing Waters

All proteins files should have the same number of waters. If a water molecule is missing, one can be virtually added from another protein file. FITTED will remove the water if it is not needed.

o Missing atoms Atom actually missing: if it is far away from the active, it can be added. If the atom is there, make sure the atom name and atom type are the same. The atom may be a different part of the protein file. If this is the case

renumber the atom within your graphical interface and regenerate the protein input file for ProCESS.

o Nucleic acids The 5‟-terminus should have a 5‟-OH, and not a phosphate group. If

necessary, remove the phosphate group and protonate the 5‟ oxygen. The residue names need to be corrected before attempting to run ProCESS,

but after adding hydrogens to the system. For this, a pair of scripts (fix_dna.awk and fix_rna.awk) are provided in the

process/scripts/ directory. This scripts rename the residues

according to the names in Table 1b:

fix_rna.awk term5=<5'term> term3=<3'term> file.mol2 > file_new.mol2

<5’term> and <3’term> denote the residue numbers (column 7 in the

MOL2 file) of the 5‟- and 3‟-terminal residues, respectively. Before the ligand is run with SMART, partial charges need to be properly assigned (Gasteiger-

Hückel have been validated but others should work as well) and the file saved in mol2 format.

IV.2. Running the FITTED suite

All three modules work under Windows and Linux. Both versions are useable from a terminal window, and the Windows executables can be started by double-clicking on the icons. The commands given below are to be inputed in a terminal.

APPENDIX F

- 272 -

To run ProCESS, place all protein files in the ProCESS input directory and create an appropriate keyword file (examples keyword files are 1e2k_rigid.txt and tk.txt). If flexibility is

desired make sure the ProCESS keyword file includes multiple protein structure files for consideration of flexibility (see tk.txt). Go to the process/ directory and run ProCESS by

typing:

./process <keyword filename>

ProCESS will create all the files (XXXX_dock.mol2, XXXX_score.mol2,

XXXX_site.txt, cavity.mol2, interaction_sites.mol2, XXXX_IS.mol2) in the

output/ directory within minutes as well as XXXX.out which will include information about the

calculations and errors. If running the same keyword again, all the files will be overwritten. Copy all files except the XXXX.out file to the input/ directory of FITTED.

To run SMART (example 1e2k_ligand.mol2), place the ligand file previously prepared

(section IV.1) in the SMART input/ directory. Go to the smart/ directory and run SMART,

typing:

./smart 1e2k_ligand.mol2

All files will be outputted to the output/ directory. Copy the 1e2k_ligand_1.mol2 file to

the input/ directory of FITTED.

To run FITTED, make sure all the input files (ligand, protein(s), active site cavity, and interaction site files) are in the FITTED input/ directory. Create the appropriate keyword (examples for rigid

and flexible docking are included as 1e2k_rigid.txt and 1e2k_flex.txt) and place them

in the keyword/ directory. To run FITTED, type:

./fitted <keyword_filename>

All results will be put into a .out file (name specified in the keyword file) and all

errors/warnings in a .log file (same name as output file). If structures are outputted (“printed”),

these files will be created in the structure directory within the output directory.

If running more than one file sequentially as in virtual screening runs, scripts can be used to create keyword files, extract data and run FITTED.

APPENDIX F

- 273 -

V. Preparing a keyword file for FITTED

The following sections list some of the keywords (one that are most frequently changed, for a complete list see 0), their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates a text string (such as a file

name); square brackets [choice1|choice2] indicate a choice of values, the default shown in

italics.

Note that keyword files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.

Although the value of many keywords can be altered, default values should be used unless a

specific system requires different settings. These keywords are essentially used by the developers for optimization and evaluation of the program. In general, modification of a specific value does not significantly improve or affect the accuracy but may result in longer or quicker docking runs.

At the end of this section, a typical keyword file can be found.

V.1. Input/output files

Protein_Conformations <# of files>

input_file_1

input_file_2

Following this keyword is the number of protein structure files used as input (same protein different conformation). These protein files should be prepared using ProCESS prior to the actual docking.

On the following lines are the protein file names, one per line.

For each of the proteins listed there should be the following files associated with then

input_file_dock.mol2

input_file_score.mol2

input_file_site.txt

input_file_IS.mol2

The name listed in this keyword file should therefore not include extensions such as _dock.mol2 that will be automatically added by FITTED.

Ligand ligand_file.mol2

Name of the ligand file (in MOL2 format). This ligand files should be prepared using SMART prior to the actual docking.

Ref <#_of_files>

lig_ref_file1.mol2

APPENDIX F

- 274 -

lig_ref_file2.mol2

Following this keyword is an integer stating how many reference files are used to calculate the root-mean-square deviation (RMSD) of the ligand heavy atoms. These ligand files should be in the same reference frame as the protein structure. The possible symmetric conformations of the ligand are calculated in silico.

2 reference files may be needed in some instances where the ligand or protein active site is Cn symmetric (n >=2 )

On the following line(s), the reference file(s) (in MOL2 format) are listed, one per line.

If this keyword is missing, no RMSD values will be computed.

Output filename

Name of the output file.

Forcefield forcefield_file.txt

Name of the force field file to use. If a forcefield other than fitted_ff.txt is to be used. The format of this force field should be consistent with the required format for Fitted (see section II.5).

Binding_Site_Cav cavity_file.mol2

Following this keyword is the file defining the empty space present in the active site cavity (a set of spheres prepared by ProCESS).

If this keyword is missing, no grid filter will be used (it is highly recommended to use both Pharmacophore and Binding_site_cav keywords).

Interaction_Sites interaction_sites_file.mol2

Name of the file containing the interaction site description (prepared by ProCESS).

If this keyword is missing, no interaction site filter will be used. (It is highly recommended to use both Interaction_Sites and Binding_site_cav)

Pharmacophore pharmacophore_file.mol2

Name of the file containing the pharmacophore constraints on the ligands (prepared by ProCESS). Typically this keyword is used to ensure that the individuals produced match this constraint, but it can be softened by setting Min_Constraint.

If this keyword is missing, no constraint will be used.

V.2. Run parameters

Mode [Dock|Filter|VS|Score|Local]

Dock

Normal docking run. No ligands are filtered out.

This is the default.

Filter

Filters out structures that do not meet Filter, Optional or Essential groups (see below).

Once filtering is done the program exits.

APPENDIX F

- 275 -

VS

Filters out structures that do not meet Filter, Optional or Essential groups (see below). If

the ligand passes all the filters, the docking is performed otherwise FITTED exits. Additional keywords are also provided (see below).

Score

Scores the ligand input structure in the provided orientation against all input proteins.

Local

Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modifications to the ligand conformation, orientation and translation are carried out.

SAR

Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modification to the ligand orientation and translation are carried out while a complete search of conformations is done.

Flex_Type [Rigid|Semiflex|Flex_water|Flex]

Rigid

The ligand is docked onto one protein structure.

This is the default if only one protein structure is used.

Semiflex

The ligand is docked onto multiple protein structures (requires Protein ≥ 2). Proteins can be

exchanged during the evolution but not the genes corresponding to side chains or water molecules (a more complete description of this mode is given in reference 1).

This is the default if more than one protein structure is used.

Flex_water

The ligand is docked into multiple protein structures (requires Protein ≥ 2). Similar to

Semiflex, except that each water molecule evolves independently.

Flex

The ligand is docked onto multiple protein structures (requires Protein ≥ 2). The side chains

and waters are allowed to be exchanged independently from the protein backbone.

Number_of_Runs <number of runs>

More than one run per ligand can be performed (The ligand may be docked several time to ensure a complete search).

If this keyword is missing, a single run is done.

The default value is 3 for Dock mode all other modes the default is 1.

V.3. Filtering parameters

The following keywords are used to filter out structures in VS or Filter modes only

APPENDIX F

- 276 -

Max_Charge <max_charge>

If a ligand has a net charge higher than max_charge, the program exits.

Default is +2.

Min_Charge <min_charge>

If a ligand has a net charge lower than min_charge, the program exits.

Default is -2.

Max_MW <max_MW>

If a ligand has a molecular weight higher than max_MW, the program exits.

Default is 500.

Min_MW <min_MW>

If a ligand has a molecular weight lower than min_MW, the program exits.

Default is 250.

Max_HBD <max_HBD>

If a ligand has more hydrogen bond donors than max_HBD, the program exits.

Default is 5.

Min_HBD <min_HBD>

If a ligand has fewer hydrogen bond donors than min_HBD, the program exits.

Default is 0.

Max_HBA <max_HBA>

If a ligand has more hydrogen bond acceptors than max_HBA, the program exits.

Default is 10.

Min_HBA <min_HBA>

If a ligand has fewer hydrogen bond acceptors than min_HBA, the program exits.

Default is 0.

Max_Nrot <max_Nrot>

If a ligand has more rotatable bonds than max_Nrot, the program exits.

Default is 6.

Min_Nrot <min_Nrot>

If a ligand has fewer rotatable bonds than min_Nrot, the program exits.

Default is 0.

Max_Ionizable <max_ionizable>

APPENDIX F

- 277 -

If a ligand has more ionizable groups than max_ionizable, the program exits.

Default is 2.

Min_Ionizable <min_ionizable>

If a ligand has fewer ionizable groups than min_ionizable, the program exits.

Default is 0.

Max_Rings <max_rings>

If a ligand has more rings than max_rings, the program exits.

Default is 10.

Min_Rings <min_rings>

If a ligand has fewer rings than min_rings, the program exits.

Default is 0.

Max_O <max_O>

If a ligand has more oxygen atoms than max_O, the program exits.

Default is 100.

Min_O <min_O>

If a ligand has less oxygen atoms than min_O, the program exits.

Default is 0.

Max_N <max_N>

If a ligand has more nitrogen atoms than max_N, the program exits.

Default is 100.

Min_N <min_N>

If a ligand has less nitrogen atoms than min_N, the program exits.

Default is 0.

Max_S <max_S>

If a ligand has more sulfur atoms than max_S, the program exits.

Default is 100.

Min_S <min_S>

If a ligand has less sulfur atoms than min_S, the program exits.

Default is 0.

Max_Hetero <max_hetero>

If a ligand has more heteroatoms (N, S and O) than max_hetero, the program exits.

APPENDIX F

- 278 -

Default is 100.

Min_Hetero <max_hetero>

If a ligand has less heteroatoms (N, S and O) than max_hetero, the program exits.

Default is 0.

Max_Metal <max_metal>

If a ligand has more heavy atoms other than C, N, O, S, P than max_metal, the program exits.

Default is 0.

Min_Metal <min_metal>

If a ligand has less heavy atoms other than C, N, O, S, P than min_metal, the program exits.

Default is 0.

Max_Num_of_Atoms <max_atoms>

If a ligand has more atoms other than max_atoms, the program exits.

Default is 10000.

Min_Num_of_Atoms <min_atoms>

If a ligand has less atoms other than min_atoms, the program exits.

Default is 0.

Filter <#_groups_filtered>

group_filtered1

group_filtered2

Number of functional groups that are filtered out. The name(s) of the filtered functional groups are listed below this keyword (see Table 1).

Optional <#_option_groups>

group_needed1

group_needed2

Number of functional group where one of them has to be present. The name(s) of the needed functional groups are listed below this keyword (see Table 1).

Essential <#_essential_groups>

group_needed1

group_needed2

Number of functional groups that are required. The name(s) of the needed functional groups are listed below this keyword (see Table 1).

APPENDIX F

- 279 -

Table 3. List of groups recognized by FITTED that can be listed after Filter, Optional or

Essential.

Aromatic Acid Acceptor Carbamate Primary Amine

Aldehyde Lactame Azide Ammoniun Secondary Amine

Ester Nitrile Isocyanate Oxime

Lactone Imine Acyl_Chloride Ketone

Amide Nitro Sulphonamide Boronate

V.4. Conjugate gradient parameters

The default values for all the keywords described in this section are recommended.

GA_* or GI_*

There are two sets of the following keywords: one for the parameters used during the generation of the initial population (GI_*; e.g., GI_MaxInt) and another one used during the evolution (GA_*; e.g.,

GA_MaxInt). The default values are recommended.

XX_MaxIter <maxiter>

Maximum number of iterations. Once this number is reached the minimization is finished.

The default is 20.

V.5. Energy parameters

Score_Initial [none|score|minimize]

Scoring of the initial ligand binding mode.

none

No scoring of the initial input structure is performed.

This is the default setting.

score

Only the score of the initial input ligand is output.

minimize

The score of the initial pose and the score of the energy minimized structure will be outputted.

V.6. Scoring parameters

The default values for all the keywords of this section (see 0) are highly recommended as they represent the scaling factors of RankScore2 (reference 7)

APPENDIX F

- 280 -

V.7. Initial population parameters

Pop_Size <pop_size>

Population size for the genetic algorithm conformational search.

The default is 100 for rigid docking, 200 for flexible docking

Min_MatchScore <min_matchscore>

This keyword is used only if an interaction site file is provided. If the Mode is set to Dock,

Min_Matchscore is automatically calculated.

Minimum match of the interaction sites.

The default is 25.

Min_PharmScore <min_constraint>

This keyword is used only if a pharmacophore file is provided.

Minimum percent match of the pharmacophore.

The default is 100.

V.8. Evolution parameters

Max_Gen <max_gen>

Determine the maximum number of generations for the genetic algorithm.

The default is 200.

CutScore_1 <cutscore_1>

Upper bound score at Max_Gen to further proceed with the docking run. If there is one individual within

the top 3 below this CutScore_1 then the program proceeds to Max_Gen_1

The default is -4.

Max_Gen_1 <max_gen_1>

This keyword is used in VS mode only.

After Max_Gen generations, if none of the top poses has a score below the one specified by

CutScore_1, the program exits. Otherwise, the program proceeds until it reaches Max_Gen_1

The default is set to be Max_Gen.


Upper bound score at Max_Gen_2 to further proceed with the docking run. If there is one individual

within the top 3 below this CutScore_2 then the program proceeds to Max_Gen_2

The default is -5.5.


APPENDIX F

- 281 -

As for Max_Gen_1, if after Max_Gen_1 generations none of the top poses has a score below the one

specified by CutScore_2, the program exits. Otherwise, the program proceeds until it reaches

Max_Gen_2.

The default is Max_Gen.

Seed <seed>

Select the starting point within the random number generator. If the same run is done with the same seed, the exact same result will be obtained. If a different seed is used, the GA will follow a different path. Changing the seed helps the developers to evaluate the convergence of a run.

The default is 100.

V.9. Docking of covalent inhibitors

This feature is under validation

Covalent_Residue <residue_name>

Following this keyword is the name of the residue, the covalent inhibitor will react with. Only CYS and SER are implemented in the current version (e.g., SER554)

Covalent_Ligand [Only|Both]

Controls the covalent docking. FITTED will automatically identify the aldehyde, boronate or nitrile groups (other groups will eventually be implemented) and assign the proper atom types when covalent poses will be considered

Only

Only covalent poses will be considered


Both

Covalent and non-covalent poses will be considered concomitantly.

Proton_Moved_To <residue> <atom_name>

The protein will be moved to atom <atom_name> of residue <residue>.

APPENDIX F

- 282 -

V.10. A typical FITTED keyword file

##################################################################################################

# #

# ____________ _________ _____________ _____________ ____________ _________ #

# ------------ --------- ------------- ------------- ------------ ---------- #

# ||| ||| ||| ||| ||| ||| \\\ #

# ||| ||| ||| ||| ||| ||| \\\ #

# ||| _____ ||| ||| ||| ||| _____ ||| ||| #

# ||| ----- ||| ||| ||| ||| ----- ||| ||| #

# ||| ||| ||| ||| ||| ||| ||| #

# ||| ||| ||| ||| ||| ||| ||| #

# ||| ||| ||| ||| ||| ||| /// #

# ||| _________ ||| ||| ____________ __________ #

# ||| --------- ||| ||| ------------ -------- #

# #

# Flexiblity Induced Through Targeted Evolutionary Description #

# #

# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne #

# Jeremy Schwartzentruber #

# McGill University, Montreal, Canada #

# October 2008 #

##################################################################################################

#

#


##################################################################################################

Protein_Conformations 9 # Number of protein input files

1e2k_protein # File names

1e2p_protein #

1ki3_protein #

1ki4_protein #

1ki7_protein #

1ki8_protein #

2ki5_protein #

1of1_protein #

1qhi_protein #

Ligand 1e2k_lig.mol2 # Ligand structure file

Output 1e2k_run_1 # File that wiill contain the output

Forcefield fitted_ff.txt # Force field file

Ref 1 # Number of references ligand files (for RMSD)

1e2k_lig.mol2 # Name of the ligand ref. file

Binding_Site_Cav cavity_1e2k.mol2 # Cavity file name

Interaction_Sites InterSite_1e2k.mol2 # Interaction site file name

#

# FILTERING (used in VS and Filter modes)

##################################################################################################

#

# Example:

# keep compounds with: 250 < MW < 400, filtering out aldehydes and nitro-containing compounds

# and docking only compounds with aromatic group(s)

#-------------------------------------------------------------------------------------------------

Min_MW 250

Max_MW 400

Filter 2

Nitro

Aldehyde

Essential 1

Aromatic

APPENDIX F

- 283 -

#

#

# GENETIC ALGORITHM PARAMETERS

##################################################################################################

#

# Scoring of the input structure and creation of the initial population

#-------------------------------------------------------------------------------------------------

Mode Dock # Can be Dock, VS, Filter, Local, or sar

Flex_Type Semiflex # Can be Rigid, Semiflex, Flex, Flex_water

#

#

#

###############################################################################

APPENDIX F

- 284 -

VI. Preparing a keyword file for ProCESS

The following section lists the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates

a text string (such as a file name); square brackets [] indicate a choice of values, the default shown

in italics.

ProCESS keywords files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.

Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings.


VI.1. Input/output files

Protein <#_protein_struct>

protein_file1.mol2

protein_file2.mol2

Following the keyword, specify the number of protein structure files to be processed

On the following lines, specify the protein file names, one per line.

Output output_filename


Binding_Site_Cav cavity_filename

Name of the file where to output the binding site cavity.

If this keyword is not present ProCESS will not create a binding site cavity file.

Interaction_Sites pharmacophore_filename

Name of the file where to output the interaction sites definition file.

If this keyword is not present ProCESS will not create an interaction sites definition file.

Binding_Site <#_flex_residues>

flex_residue_1_name

flex_residue_2_name

Manually defines the active site. (The active site can be automatically defined by providing a ligand, see below)

APPENDIX F

- 285 -

On the same line following this keyword, specify the number of flexible residues.

On subsequent lines, the residue name/numbers (according to Find_Residues) are specified, one

per line.

VI.2. Reading the input files and preparing the output protein files

Renumber_Residues <first_residue_number>

Specify the new number for the first residue; the rest will be sequentially renumbered.

This feature is useful if the protein is a multimer, having multiple residues with the same group name (e.g., two Pro60, two Asp25 as in HIV-1 protease).

AutoFind_Site [Y|N]

This function allows the user to have ProCESS automatically finding the flexible residues/binding site.

The default is N.

AutoFind_Center [Y|N]

This function allows the user to have ProCESS automatically find the center of the binding site.

The default is N.

Ligand ligandfile.mol2

Ligand file (in MOL2 format) used to define the active site and its center. Its should be in the same frame as the protein.

Ligand_Cutoff <ligand_cutoff>

Protein residues within this cutoff (in Å) are considered part of the binding site.

The default is 6.0.

Truncate [Y|N|auto]

Determine if the protein will be truncated, keeping only residues within Cutoff of the binding site

residues.

The default is Auto.

auto

The protein will be truncated keeping residues within cutoff distance of the ligand and not within cutoff distance from the binding site residues.

Cutoff <cutoff>

Any residue that does not have an atom within this distance (in Å) from an atom of a flexible residue or of the given ligand will be deleted from the protein file that ProCESS will output.

The default value is 11 for auto truncation 9 for truncation = yes.

VI.3. Parameters for the binding cavity file

Grid_Center <grid_center>

APPENDIX F

- 286 -

Specifically defines the center of the binding site

The default is to automatically find it using the center of a ligand.

Grid_Size <size>

Specifies the size of the box for the binding site.

The default is 12.5.

VI.4. Parameters for the Interaction Sites file

XXX_Weight <xxx_weight>

This group of keywords (Xxx being Hydrophobic, Metal, HBA or HBD) specifies the parameters for

the assignment of pharmacophoric points. xxx_weight is used to give weight for favourable xxx-type

interactions. Defaults parameters are highly recommended.

Hydrophobic_Weight <hydro_weight>

Defines the weight for hydrophobic interaction points.

The default is 1.

Metal_Weight <metal_weight>

Defines the weight for metal interaction points.

The default is 50.

HBA_Weight <hba_weight>

Defines the weight for hydrogen bond acceptor interaction points.

The default is 5.

HBD_Weight <hbd_weight> <hbd_penalty>

Defines the weight for hydrogen bond donor interaction points.

The default is 5.

If too many points are found, one can reduce this number by using the following keywords:

Pharm_Polar_Softness <pharm_polar_soft>

Maximum distance (in Å) between two polar points to merge.

The default is 0.0.

Pharm_Nonpolar_Softness <pharm_nonpolar_soft>

Maximum distance (in Å) between two non-polar points to merge.

The default is 0.0.

Hydrophobic_Level <hydro_level>

Van der Waals interaction between a probe on the grid point with hydrophobic carbons to be considered hydrophobic. If the interaction is found lower than hydro_level, an hydrophobic

APPENDIX F

- 287 -

point is added at this location. For more information see the section on Interaction Sites/Pharmacophore generation.


Min_Weight <min_weight>

Minimum weight for a pharmacophoric point to be included in the final pharmacophore.

The defaults are 0.5 0.0respectively.

Num_of_IS <num_of_spheres>

This determines the maximum number of interaction site spheres in the interaction sites file.

The default is 75.

APPENDIX F

- 288 -

VI.5. A typical ProCESS keyword file

##################################################################################################

# #

# __________ _______ _________ _____ _____ #

# ----------- ----------- --------- --------- --------- #

# ||| ||| ||| ||| ||| ||| #

# ||| ||| ||| ||| ||| ||| #

# |||________ ___ ____ ||| |||______ _______ _______ #

# |||------- ---_____ --------- ||| |||------ ------- ------- #

# ||| |||------ ||| ||| ||| ||| ||| ||| #

# ||| ||| || ||| ||| ||| ||| ||| ||| #

# ||| ||| ||| ||| ||| ||| ||| ||| #

# ||| ||| ________ ___________ _________ _________ _________ #

# ||| ||| ---- ------- --------- ------- ------- #

# #

# Protein Conformational Ensemble System Setup #

# #

# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne #

# Jeremy Schwartzentruber #

# McGill University, Montreal, Canada #

# October 2008 #

##################################################################################################

#

#


##################################################################################################

Protein 1

protein.mol2

Output protein # File that will contain the output structure

Binding_Site_Cav cavity.mo # File that will contain the binding site file

Interaction_Sites site.mol2 # File that will contain the pharmacophore

#

#

# PROTEIN DESCRIPTION

##################################################################################################

AutoFind_Site yes # Finding site automatically

Ligand lig.mol2 # Ligand used to find center and site

Ligand_Cutoff 9 # Residues within cutoff are part of the binding site#

#

# ACTIONS

##################################################################################################

Assign_G yes # Assigning residue names

Truncate auto # Truncates the protein keeping residues within cutoff

Cutoff 7 # Cutoff distance

United yes # Makes the united atom representation

#

#

# INTERACTION SITES DESCRIPTION

##################################################################################################

Pharm_Polar_Softness 0.65 # max distance between two polar points to merge

Pharm_Nonpolar_Softness 0.9 # max distance between two non polar points to merge

Pharm_Aromatic_Softness 1.9 # max distance between two aromatic points to merge

Aromatic_Weight 1 # weight given to aromatic points

Metal_Weight 8 # weight given to metal points

Hydrophobic_Weight 1 # weight given to hydrophobic points

HBA_Weight 5 # weight given to HBA points

HBD_Weight 5 # weight given to HBD points

Hydrophobic_Level -0.35 # vdw interaction with hydrophobic carbons

Hydrophobic_Resolution 0.35 # grid resolution for computation of hydrophobic points

Min_Weight 2 # Minimum weight to be included in the pharmacophore

Num_of_IS 50 # Maximum number of beads

APPENDIX F

- 289 -

#

#

# ACTIVE SITE CAVITY DESCRIPTION

##################################################################################################

Pharmacophore pharm # File that will contain the pharmacophore

Grid_Boundary soft # Grid computed within a box (hard) or not (soft)

Grid_Resolution 1.5 # Grid resolution

Grid_Size 10 # Grid size

Grid_Clash 1.5 # Max dist to consider a point clashing with the protein

#

#

#

##################################################################################################

APPENDIX F

- 290 -

VII. Running SMART

SMART is the module used to prepare ligand structures in a modified MOL2 format for use by FITTED. It can also assign MMFF charges and prepare ligand structures for use with ACE (Asymmetric Catalyst Evaluation).

SMART has 2 modes of operation: interactive and command-line arguments. The interactive mode is started by calling smart without any arguments. The program will request user input to determine

the mode of operation and the file to be processed. Note that not all options are available through the interactive mode.

The command-line argument syntax of SMART is as follows (arguments in angle brackets < > are mandatory, arguments in square brackets [ ] are optional):

./smart <mode> [OPTIONS] <ligandfile>

mode is one of the following:

-fitted, -f assign GAFF atom types, write file in FITTED format (default)

-ace, -a assign MM3 atom types, write file in ACE format

If specified, the optional arguments modify the default behaviour of SMART:

-in sd read input file in SD/MOL format

-in fitted read input file in FITTED MOL2 format

-out std output standard MOL2 format instead

-out debug output verbose MOL2 format instead (useful for debugging)

-multi write a single multi-MOL2 file as output

-o <name> specify name for output files <name>.mol2 and <name>.log

-m XXX[-YYY] process only molecules within specified range. Range can be one of: i)

XXX: process a single molecule; ii) XXX-YYY: process molecules

#XXX to #YYY (inclusive); iii) XXX-: process from molecule XXX

until the end.

-nocharge do not assign MMFF charges

-nobond do not reassign bond order

The standard format output may be useful for use on visualization software (as it assigns standard Tripos atom types) or as input to other programs. If the options –nocharge or –nobond are

specified, the atomic charges and/or the bond order from the input files are preserved. If the option –in [sd|fitted] is not specified, SMART expects an input file in MOL2 format.

APPENDIX F

- 291 -

ligandfile specifies a file in standard MOL2 file format

(http://www.tripos.com/data/support/mol2.pdf), or standard SD/MOL file format (http://www.mdli.com/downloads/public/ctfile/ctfile.jsp). It can contain either a single or multiple molecules, and it has to be located in the input/ directory. In the case of a multi-molecule input

file, there will be one ligandfile.out output file and multiple ligandfile_X.mol2 output

files, unless the –multi option is used, in which case there will be a single ligandfile.mol2

file.

The –multi option writes a single multi-MOL2 file as output. Note that the current version of

FITTED cannot use multi-MOL2 files as input; an awk script (scripts/separate_mol2.awk)

is provided to split a multi-MOL2 file into single-molecule MOL2 files. Additionally, another script (scripts/extract_mol2.awk) is provided to extract a specific molecule from a multi-MOL2

file. For example, to extract the molecule #99 from a multi-MOL2 file, the following command can be issued (from the smart/output/ directory):

../scripts/extract_mol2.awk m=99 ligandfile.mol2

The options –o and –m can be used in combination to quickly process a large set of files in a multi-

processor environment (e.g., a multiple core workstation or a cluster). Say we have an SD file ligands.sd with 249,543 molecules. Instead of processing them with one instance of SMART, we

can use 5 instances, each processing chunks of 50,000 molecules each, thus significantly reducing the CPU time required. The commands to do this are:

./smart –in sd –m 1-50000 –o ligands_1 –multi ligands.sd




./smart –in sd –m 200001- –o ligands_5 –multi ligands.sd

Note the last command: because there are fewer than 50,000 molecules left, we instruct SMART to process from molecule 200001 until the end of the file. The following step would be to make individual MOL2 files for use as FITTED input. This is accomplished by using an awk script to separate the multi-MOL2 files into individual files:

cat ligands_1.mol2 ligands_2.mol2 ligands_3.mol2 ligands_4.mol2 \

ligands_5.mol2 | ../scripts/separate_mol2.awk

This will generate files in the form ligand_N.mol2, with N going from 1 to 249,543.

APPENDIX F

- 292 -

FITTED Input File Formats

The following sections outline the file formats used for FITTED. FITTED uses a modified version of the Sybyl MOL2 format for all of its input files. For more information on the original Sybyl MOL2 format, visit http://www.tripos.com/data/support/mol2.pdf. The following is an example of a standard MOL2 formatted file.

Column # Description 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge

APPENDIX F

- 293 -

II.1. The protein files

II.1.1. The XXXX_dock.mol2

The format of the file outputted by ProCESS is a standard MOL2 format with the following changes:

@<TRIPOS>ATOM section:

o the atom types (column 6) are Amber united atom types instead of Sybyl atom types

o the group names (column 8) include the advanced residue names (see Appendix A)

@<TRIPOS>BOND section:

o only bonds for the flexible residues are listed

Column # Description 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge 10 Misc. Information

II.1.2. The XXXX_score.mol2 file

The format of the file outputted by ProCESS is a standard MOL2 format with the following changes:


o the atom types (column 6) are Amber united atom types instead of Sybyl atom types

APPENDIX F

- 294 -

o the group names (column 8) include the advanced residue names (see Appendix A)


o all bonds are listed

APPENDIX F

- 295 -

Column # Description Column # Description 1 Atom number 11 Scaling factor 2 Atom name 12 Place Holder 3 x coordinate 13 OPLS Atom Type 4 y coordinate 14 Place Holder 5 z coordinate 15 van der Waals

Radii 6 Atom type 7 Group number 16 Atom Volume 8 Group name 17 Atomic Solvation 9 Partial charge 18 vdW solvation 10 Misc. Information 19 and > Water solvation

APPENDIX F

- 296 -

II.1.3. The XXXX_site.txt file

The XXXX_site.txt file is outputted by ProCESS and contains the binding site residue list. The first line of the file must start with Site followed by the number of residues. The following lines (1

per line) list the names of the binding site residues.

II.2. The Ligand file

The format outputted by SMART is based on the MOL2 format. Some modifications were introduced in order to implement the bitstring describing the presence of functional groups, and to aid in checking the chirality, atom connectivity and ring perception. The changes from the standard MOL2 format are as follows:

APPENDIX F

- 297 -

@<TRIPOS>MOLECULE section:

o the second line (data associated with the molecule) is expanded by a number of fields describing the ligand and the functional groups present (bitstring). The presence of a particular group is indicated by a 1 on the respective field. The order of the fields is as

follows:

number of atoms

number of bonds

molecular weight

net charge

number of hydrogen bond donors

number of hydrogen bond acceptors

number of rotatable bonds

number of rings

number of ionisable groups

presence of aromatic group

presence of aldehyde

presence of ester

presence of lactone

presence of amide

presence of amide

presence of lactame

presence of acid

presence of nitrile

presence of imine

presence of nitro

presence of Michael acceptor

presence of azide

presence of isocyanate

presence of acyl chloride

APPENDIX F

- 298 -

presence of sulphonamide

presence of carbamate

presence of ammonium

presence of oxime

presence of secondary amine

presence of primary amine

presence of ketone

presence of boronate

number of oxygens

number of nitrogens

number of sulphurs

number of hetero atoms

number of toxic metals

APPENDIX F

- 299 -


o the Sybyl atom types (column 6) are replaced by GAFF atom types; the corresponding Sybyl atom types are stored in column 11.

o the number of hydrogen atoms attached to an atom is stated on column 12.

o the hybridization of an atom is stated on column 13.


o an additional column specifies the bond as rotatable, r, or non rotatable, nr.

Column # Discription 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge 10 Misc. Information 11 Sybyl Atom type 12 Number of

Hydrogens 13 hybridization

APPENDIX F

- 300 -

II.3. The binding site cavity file

The binding site cavity file is used to determine the empty space within the protein via a collection of spheres of different radius. It resembles a MOL2 formatted file with the following changes:


o on the second line the number of spheres is specified as the first field; fields 2-5 are 0.


o column 6 (Sybyl atom type) is unnecessary, therefore it is replaced by a dash.

o column 9 (partial charges) is replaced by the radius of the sphere.

Column # Discription 1 Point number 2 Point name 3 x coordinate 4 y coordinate 5 z coordinate 6 Point Type 7 Group number 8 Group name 9 Radius 10 Misc. information

APPENDIX F

- 301 -

II.4. The interaction site, pharmacophore and XXXX_IS.mol2 files

The interaction sites and pharmacophore file are used to create conformations that already have good interaction with the protein. Again the format resembles mol2 format with the addition of columns for the interaction site type and weight.


o on the second line the number of constraints is specified as the first field; fields 2-5 are 0.


o column 6 (Sybyl atom type) is unnecessary, therefore it is replaced by a dash.

o column 7 (group number) is replaced by a point type descriptor.

o column 9 (partial charges) is replaced by the radius of the constraint.

o column 10 specifies the type of the pharmacophoric point (HBD, HBA, HYD, ARO, or any combination such as HBA/HYD).

o column 11 specifies the weight of the constraint.

Column # Discription 1 Point number 2 Point name 3 x coordinate 4 y coordinate 5 z coordinate 6 -

7 Point type 8 PHARM

9 Radius 10 Pharmacophoric

type 11 Weight

APPENDIX F

- 302 -

II.5. The force field file

The force field file is where all the parameters for the FITTED force field are kept. Additionally, SMART uses the force field file to assign MMFF charges to molecules if so requested. The force field is a modified GAFF force field [17] with MMFF [15] charge parameters in free format, so it can be edited with any text editor. Although most of the parameters for drug-like molecules are present, some may be missing. When adding a parameter to the force field file, some rules must be followed.

Each section starts with a title (e.g., #fitted_bond_parameters), followed by the actual

parameters (i.e, 1.0 1 c c 1.5500 290.100) and ends with a line with blank

parameters designated by stars (i.e., 0.0 1 * * 0.0000 0.000). The title and

end lines should not be removed and any line added before the title line and after the end line will be ignored.

FITTED also allows for the use of wildcard parameters for angles and torsion parameters, where I, J, K or L can represent any atom type, by using the wildcard character * in the respective column.

Using wildcards (*) for all the atoms will be read as an end line.

II.5.1. Adding parameters to the bond list

The bond list starts 2 lines following the #fitted_bond_parameters title, and the end is

signaled by having both the I and J atom types (columns 3 and 4) as *. Any parameters added after

this last line will be ignored. Removing this line The parameters added must be in a single line in the following format

Units: R (Å), K (kcal/mol Å)

#1 #2 #3 #4 #5 #6 Force field file version

number Reference number Atom type of I Atom type of J R0 K2

#----------------------------------------------

#Ver Ref I J R0 K2

#----------------------------------------------

#fitted_bond_parameters

#----------------------------------------------

1.0 1 c c 1.5500 290.100

1.0 1 c c1 1.4600 379.800

1.0 1 c c2 1.4060 449.900

1.0 1 c c3 1.5080 328.300

1.0 1 c ca 1.4870 349.700

[...]

1.2 1 ct ss 1.7700 256.600

1.2 1 ct nh 1.3640 449.000

1.2 1 nt nt 1.3400 450.000

0.0 1 * * 0.0000 0.000

Adding parameters to the angle list

The angle list starts 2 lines below the #fitted_angle_parameters title, and the end is signaled

by having all I, J and K atom types (columns 3-5) as *. FITTED also allows for the use of wildcard

parameters, where I and/or K can represent any atom, by using the wildcard character * in the

respective column. Parameters added to the force field including wildcards should be placed at the

APPENDIX F

- 303 -

end of the angle list. The less specific the parameter (higher number of wildcards), the lower in the list it should be placed. The parameter added must be in a single line in the following format:

Units: * add units for R, K (kcal/mol rad)

#1 #2 #3 #4 #5 #6 #7 Force field file version

number Reference

number Atom type

of I Atom type

of J Atom type

of K R0 K2

#----------------------------------------------------

# E = K2 * (Theta - Theta0)^2

#----------------------------------------------------

#Ver Ref I J K Theta0 K2

#----------------------------------------------------

#fitted_angle_parameters

#----------------------------------------------------

1.0 1 hw ow hw 104.5200 100.0000

1.0 1 hw hw ow 127.7400 0.0000

1.0 1 br c br 113.1000 66.9000

1.0 1 br c c3 110.7400 63.3000

1.0 1 br c o 121.4600 63.2000

[...]

1.2 1 * n4 hn 109.0000 35.0000

1.2 1 * n4 * 109.5000 60.0000

1.2 1 * na * 120.0000 60.0000

1.2 1 * nb * 120.0000 60.0000

0.0 1 * * * 0.0000 0.0000

Adding parameters to the torsion list

The torsion list starts 2 lines below the #fitted_torsion_parameters title, and the end of the

list is signaled by having all I, J, K and L atom types as *. FITTED also allows for the use of

wildcard parameters, where I and L can represent any atom, by using the wildcard character * in the

respective column. Parameters added to the force field including wildcards should be placed at the beginning of the torsion list. The parameter added must be in a single line in the following format:

Units: V (kcal/mol)

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14

Force field file version number

Reference number

Atom type of I

Atom type of J

Atom type of K

Atom type of L

V1 01 V2 02 V3 03 V4 04

#----------------------------------------------------------------------------------------

# E = SUM(n=1,4) { (Vn/m) * [ 1 + cos(n*Phi - Phi0(n)) ] }

# m = multiplicity or total number of torsions centered on the same bond.

#----------------------------------------------------------------------------------------

#Ver Ref I J K L V1 Phi0 V2 Phi0 V3 Phi0 V4 Phi0

#----------------------------------------------------------------------------------------

#fitted_torsion_parameters

#----------------------------------------------------------------------------------------

1.0 1 * c c * 0.0000 0.00 1.2000 180.00 0.0000 0.00 0.0000 0.00

1.0 1 * c c1 * 0.0000 0.00 0.0000 180.00 0.0000 0.00 0.0000 0.00

1.0 1 * c cg * 0.0000 0.00 0.0000 180.00 0 .000 0.00 0.0000 0.00

1.0 1 * c ch * 0.0000 0.00 0.0000 180.00 0.0000 0.00 0.0000 0.00

1.0 1 * c c2 * 0.0000 0.00 8.7000 180.00 0.0000 0.00 0.0000 0.00

[...]

1.0 1 hc c3 c3 f 0.1900 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00

1.0 1 hc c3 c3 cl 0.2500 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00

1.0 1 hc c3 c3 br 0.5500 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00

0.0 1 * * * * 0.0000 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00

APPENDIX F

- 304 -

Adding parameters to the out-of-plane list

The angle list starts 2 lines below the #fitted_oop_parameters title, and the end of the list is

signaled by having all I, J, L and K atom types as *. FITTED also allows for the use of wildcard

parameters, where I, K and/or L can represent any atom, by using the wildcard character * in the

respective column. Parameters added to the force field including wildcards should be placed at the end of the angle list. The less specific the parameter (higher number of wildcards), the lower in the list it should be placed. The parameter added must be in a single line in the following format:

Units: K (kcal/mol)

#1 #2 #3 #4 #5 #6 #7 #8 #9


Reference number

Atom type of I

Atom type of J

Atom type of K

Atom type of L

Kchi N Chi0

#-----------------------------------------------------------------------

# E = Kchi * [ 1 + cos(n*Chi - Chi0) ]

#-----------------------------------------------------------------------

#Ver Ref I J K L Kchi n Chi0

#-----------------------------------------------------------------------

#fitted_oop_parameters

#-----------------------------------------------------------------------

1.0 1 c c2 c2 c3 1.1000 2 180.0000

1.0 1 c ca c3 ca 1.1000 2 180.0000

1.0 1 c n c3 hn 1.1000 2 180.0000

1.0 1 c n c3 o 1.1000 2 180.0000

1.0 1 c2 na c2 c3 1.1000 2 180.0000

[...]

1.0 1 c2 c2 * * 10.5000 2 180.0000

1.0 1 o c * * 1.1000 2 180.0000

1.0 1 * c * * 10.5000 2 180.0000

1.0 1 * ca * * 7.1000 2 180.0000

0.0 1 * * * * 0.0000 2 180.0000

Adding parameters to the van der Waals list

The vdW list starts 2 lines below the #fitted_vdW_parameters title, and the end of the list is

signaled by I atom type as *. The parameter added must be in a single line in the following format:

Units: Ri (Å), ESPI (kcal/mol)

#1 #2 #3 #4 #5


Reference number Atom type of I

Ri* ESPI

#------------------------------------------------

#type r-eps

#combination arithmetic

#------------------------------------------------

# E = EPSij * { (Rij*/Rij)^12 - 2(Rij*/Rij)^6 }

# where EPSij = sqrt( EPSi * EPSj)

# Rij* = (Ri* + Rj*)/2

#------------------------------------------------

#Ver Ref I Ri* EPSi

#------------------------------------------------

#fitted_vdW_parameters

#------------------------------------------------

APPENDIX F

- 305 -

1.0 1 h1 2.7740 0.01570

1.0 1 h2 2.5740 0.01570

1.0 1 h3 2.3740 0.01570

1.0 1 h4 2.8180 0.01500

1.0 1 h5 2.7180 0.01500

[...]

1.2 1 n0 3.7360 0.00277

1.2 1 k 5.3160 0.000328

1.2 1 zn2 2.2000 0.0125

0.0 1 * 0.0000 0.00000

Adding parameters to the hydrogen bond list

The Hbond list starts 2 lines below the #fitted_Hbond_parameters title, and the end of the list

is signaled by having both the I and J atom types as *. The parameter added must be in a single line

in the following format:

Units: A, B (kcal/mol)

#1 #2 #3 #4 #5 #6


Reference number

Atom type of I

Atom type of J

A B

#--------------------------------------------------

# E = Aij/r^12 - Bij/r^10

#--------------------------------------------------

#Ver Ref I J A B

#--------------------------------------------------

#fitted_Hbond_parameters

#--------------------------------------------------

1.0 3 hw nb 7557.0000 2385.0000

1.0 3 hw nc 10238.0000 3071.0000

1.0 3 hw o 7557.0000 2385.0000

1.0 3 hw oh 7557.0000 2385.0000

1.0 3 hw os 7557.0000 2385.0000

[...]

1.2 3 zn s6 15000.0000 5000.0000

1.2 3 zn ss 15000.0000 5000.0000

1.2 3 zn sh 15000.0000 5000.0000

0.0 3 * * 0.0000 0.0000

II.5.2. Adding parameters to the bond charge increment list

The bond charge increment list starts below the #fitted_charge_parameters title, and the end

of the list is signaled by having both the I and J atom types as *. Each line specifies a bond charge

increment for a bond between atoms of type I and J (bciIJ), such that the resulting charge on J is the bciIJ, while on I is –bciIJ.The parameter added must be in a single line in the following format:

#1 #2 #3 #4 #5

Version number Atom type of I Atom type of J bci Comment

##############################

#Cl I J bond_inc source

##############################

#fitted_charge_parameters

APPENDIX F

- 306 -

0 1 1 0.0000 #C94

0 1 2 -0.1382 C94

0 1 3 -0.0610 #C94

0 1 4 -0.2000 #X94

[...]

0 80 81 -0.4000 #C94

0 101 1 0.0000 empirical

0 101 6 -0.1900 empirical

0 101 37 -0.0000 empirical

* * * * *

Adding parameters to the partial bond charge increment / formal charge adjustment factor list

As a more general way of describing bci‟s, MMFF includes a partial bci parameter that is assigned to each atom type [15]; a bci for a bond can be obtained as the sum of the partial bci corresponding to each atom type involved. Additionally, the formal charge on some groups is spread among neighbouring atoms; this is specified in the formal charge adjustment factor for the central atom type in those functional groups [15].

The parameter list starts below the #fitted_mmff_addl_charges title, and the end of the list is

signaled by having both the I and J atom types as *. The parameter added must be in a single line in

the following format:

#1 #2 #3 #4 #5

Version number Atom type Partial bci Formal charge adj Comment

###

# MMFF Partial Bond Charge Incs and Formal-Charge Adj. Factors: 19-MAY-1994

#

# source: J. Comp. Chem. 17, 616 (1996)

###

# type pbci fcadj Origin/Comment

###

#fitted_mmff_addl_charges

0 1 0.000 0.000 E94

0 2 -0.135 0.000 E94

0 3 -0.095 0.000 E94

0 4 -0.200 0.000 E94

[...]

0 96 2.000 0.000 Ionic charge

0 97 1.000 0.000 Ionic charge

0 98 2.000 0.000 Ionic charge

0 99 2.000 0.000 Ionic charge

* * * * *

APPENDIX F

- 307 -

FITTED errors and warnings

ERROR: Molecule outside maximum number of angles.

FITTED can only handle molecules with [3 × #atoms] angles. If there are more then please contact the developers at [email protected]

ERROR: Molecule outside maximum number of torsions.

FITTED can only handle molecules with [6 × #atoms] torsions. If there are more then please contact the developers at [email protected]

ERROR: Forcefield file <forcefield filename> not found

The force field file listed in the keyword file is not found in the forcefield/ directory.

ERROR: Molecule too big for active site

Increase GI_Num_of_Trials

Increase GI_Initial_E

Increase GI_Minimized_E

Increase Grid_Size in ProCESS to create a larger active site cavity.

If none of these work, the molecule is too big for the active site and cannot be docked.

ERROR: Protein input file <Protein file name> not present

The protein file could not be found in the input/ directory.

ERROR: Ligand input file <ligand file name> not present

The ligand file could not be located in the input/ directory.

ERROR: Binding_Site_Cav file <Active site filename> not present

Binding_Site_Cav file could not be located within the input/ directory. Without an active site file

the docking may take longer and be less accurate.

WARNING: Binding_Site_Cav needed for generation of initial population

FITTED issues this warning but will not exit. Without an active site file the docking may take longer and be less accurate.

ERROR: Reference file <Reference file name> not present.

The reference file could not be located within the input/ directory.

ERROR: Missing Forcefield Parameters

FITTED exits because there are missing force field parameters. Either add them to the force field file or use Parameters Auto keyword to have FITTED automatically assign parameters.

APPENDIX F

- 308 -

WARNING: Missing Forcefield parameters, assigning parameters

automatically

List below is the parameter which was assigned automatically. If you do not like the automatic assignment add the parameter with your desired value into the force field file

ERROR: <keyword_filename> Can not be opened

If the keyword file is not found in the keyword directory then an error.log will be created with this

error. Please put keyword in keyword/ directory.

ERROR: Coordinates not found in protein structural file

@<TRIPOS>ATOM is not found in the protein file preceding the coordinates

ERROR: Array size for number of protein atoms and bonds not in Protein 1

mol2 file.

@<TRIPOS>MOLECULE is not found in the first protein mol2 file

ERROR: Array size for number of ligand atoms and bonds not in Ligand

mol2.

@<TRIPOS>MOLECULE is not found in the first protein mol2 file

ERROR: Coordinates not found in ligand file

@<TRIPOS>ATOM is not found in the ligand file preceding the coordinates

ERROR: Check water names and atom types.

The water atom name and atom types are non-standard. Change to standard names.

ERROR: Bonds not found in ligand file

@<TRIPOS>BOND is not found in the ligand file preceding the list of bonds

ERROR: No assignment of Rotatable bonds

Please assign rotatable bonds either manually or by using SMART

ERROR: Bonds not found in protein file

@<TRIPOS>BOND is not found in the protein file preceding the list of bonds

ERROR: Protein keyword not found in keyword file.

The keyword Protein is not found within the keyword file. Please include this within you keyword file

followed by the number of protein files and on the next lines a list of the protein files for the docking/virtual screening run.

ERROR: Can not find <residue name>

Can not a residue listed in the keyword file. Please check the spelling.

ERROR: Flex file <protein file name> not found

Make sure <protein file name>_flex.txt is in the input directory.

APPENDIX F

- 309 -

Error: Can not find coordinates in <Binding_Site_filename> is not present

in the active site cavity file

@<TRIPOS>MOLECULE is not found in the Binding_Site_Cav file

Error: Can not find coordinates in <Binding_Site_filename>

@<TRIPOS>ATOM is not found in the Binding_Site_Cav file preceding the coordinates

Error: Can not find coordinates in <pharmacophore_filename>

@<TRIPOS>ATOM is not found in the Pharmacophore file preceding the coordinates

Error: Can not find coordinates in <Interaction_Sites_filename>

@<TRIPOS>ATOM is not found in the Interaction_Sites file preceding the coordinates

Error: Ligand can not match minimum pharmacophore

Increase value of Min_PharmScore.

Error: Ligand can not match minimum Interaction Sites

Increase value of Min_MatchScore.

ERROR: Reference file <reference_filename> not present

The reference file is not located within the input/ directory.

ERROR: Invalid parameter specified for covalent residue.

Make sure the residue name is listed in the keyword the same way it is listed in the protein file.

ERROR: FITTED cannot find O/S and H for the covalent residue

Format in protein input file may be incorrect. In particular, make sure that for serine the alcohol atom names are set as OG and HG.

ERROR: Invalid parameter specified for other catalytic residue.

Make sure the residue and atom name are specified the same in the keyword and protein file.

ERROR: The proteins do not have the same number of atoms

Make sure to run ProCESS with all proteins in one keyword file.

ERROR: Problem with creation of z-matrix for ligand.

Make sure there is not a missing bond in the bond list of the ligand mol2 file. FITTED cannot handle mol2 with multiple structures.

ERROR: Problem with creation of z-matrix for active site residue

<residue_name>.

A bond is missing from the bond list in one of the protein mol2 files. Either add the missing bond(s) or remove the residue from the XXXX_site.txt file if it not critical to binding of the ligand.

APPENDIX F

- 310 -

ProCESS errors and warnings

Number of proteins not in keyword file.

If the number of protein files does not follow Protein_Conformations keyword.

Coordinates not found in structural file

If in either the protein or ligand mol2 file @<TRIPOS>ATOM does not precede the coordinates of the structure.

Bonds not found in structural file

If in either the protein or ligand mol2 file @<TRIPOS>BOND does not precede the bond list.

Ligand file not present now closing

If Ligand is not found in the keyword file.

User wanted automatic finding of active site center, Ligand Reference not

given.

If the keyword AutoFind_Site is used in the keyword and Ligand is not found in the keyword file.

<Protein file name> file not present. Program now Closing.

The protein file given can not be found in the input/ directory.

<Ligand file name> file not present. Program now Closing

The ligand file given can not be found in the input/ directory

Side chain <residue name> Not found in <protein file name>

The residue given can not be found in the protein file.

Unknown residue name: <residue name>

The residue is not known. Refer to Tables 1a and 1b for accepted residue names.

APPENDIX F

- 311 -

SMART errors and warnings

The following is a list of errors and warnings that SMART outputs to the corresponding log file in the output/ directory. Errors indicate serious problems that cause SMART to either skip a molecule or

exit. Warnings are potential problems that might cause the SMART output to be incorrect; critical examination of the output and input structures in these cases is strongly encouraged.

ERROR: File <filename> cannot be opened.

The input file specified could not be read. Make sure that the file is located in the input/ directory.

Check the spelling and the file permissions.

ERROR: Atom <atom_name> cannot find element

The specified atom has a non-standard Sybyl atom type, or is not in the range of atomic numbers 1-35 (H-Br), 44-46 (Ru-Pd), 53 (I) or 78 (Pt). Without a proper element assignment, atom types cannot be assigned. In particular, look for: i) P atoms in phosphates and analogous functional groups: the Sybyl atom type for the P atom should be “P.3”; ii) S atoms in sulfoxides, sulfones and derivatives: the Sybyl atom type for the S atom should be “S.o” or “S.o2” respectively.

ERROR: could not write to <filename>

The specified output file could not be written. Check permissions on the output/ and parent

directories, that there is enough empty space in the volume and that the filename is valid.

ERROR: cannot create Z-matrix. Does the molecule have a torsion?

In order to be processed by SMART, a molecule must at least have 4 atoms connected sequentially in order to define a torsion. If a torsion cannot be defined, the molecule is skipped.

WARNING: Sum of partial charges does not equal net charge

When assigning MMFF charges, the partial charges assigned do not match the predicted formal charge. Check atom type assignment and bond connectivity.

WARNING: Cannot assign atomic weight to atom <atom_number> <atom_name>

When generating the bit string, the molecular weight is calculated from the sum of atomic weights. Currently, only atoms of atomic number 1-17 (H-Cl), 34-35 (Se, Br) and 53 (I) are parameterized.

WARNING: Atom <atom_name> has a formal charge of <formal charge>

When automatically assigning the bond orders (-assign_bond command-line option), this message is outputted to the log file for every atom with a formal charge higher than 1. Check the bond order assignment in these molecules to make sure it is correct.

WARNING: Missing bond increment. Bond # <bond_number> Atoms <atom_name1> <atom_name2>; MMFF atom types <MMFF_type1> and <MMFF_type2>. Bond increment set

to 0.

When automatically assigning the MMFF charges (-charge command-line option), this message is

outputted for every pair of atoms for which the bond increment is not parameterized. Add the bond increment in the forcefield/fitted_ff.txt file.

WARNING: Could not assign charges to molecule

APPENDIX F

- 312 -

When automatically assigning the MMFF charges (-charge command-line option), this message is

indicative of other problems with the charge assignment. Look for warning messages appearing before this one in the log file.

APPENDIX F

- 313 -

Functional group definitions

Table 9 - Definition of functional groups in SMART (blue = atom type, green = element)

Keyword Description

Aromatic ca

An aromatic group is present if a ca atom type is within the molecule

Aldehyde H

O

c

H

An aldehyde is present if there is a c atom type in the molecule bound to a hydrogen.

Ester O

O

c os= n

An ester is present if there is a c atom type bound to an atom with an os atom type with the c not bound to an a n atom type, with both c and os atoms being acyclic.

Lactone O

O c

os

= n

A lactone is present there is a c atom type bound to an atom with an os atom type with the c not bound to an a n atom type, with c and os atoms involved in a ring.

Amide N

O

c

R

n= os R

An amide is present if there is a c atom type bound to an atom with an n atom type with the c not bound to an os atom type, with both c and n atoms being acyclic.

Lactame N

O c

n

= osR

A lactame is present if there is a c atom type bound to an atom with an n atom type with the c not bound to an a os atom type. With both c and n atoms being cyclic.

Acid

O

c

O

o

An acid (carboxylate) is present if an atom with a c atom type is bound to two atoms with o atom types.

APPENDIX F

- 314 -

Nitrile

N

c1

n1

A nitrile is present if an atom with a c1 atom type is bound to an atom with an n1 atom type.

Imine N

Rc2

n2

= O

An imine is present if an atom with a c2 atom type is bound to an atom with an n2 atom type, both acyclic; R cannot be an oxygen atom.

Nitro N

O

O

+ no

A nitro is present if there is an atom with an no atom type within the molecule.

Acceptor

O

R

c o

= oc2

c1 n1

c2N

A Michael acceptor is present if an atom with an atom type of c2 is bound to either 1) an atom with a c atom type which is not a carboxylate, or 2) a nitrile group. The bond between c2 and c/c1 must be acyclic.

Azide NN+

N-

N

An azide is present if there are three acyclic nitrogens in a linear formation.

Isocyanate N

CO

oc

n2

An isocyanate is present if an atom with an atom type of c is bound to 2 atoms, one with an atom type of n2 and another with an atom type of o, where the c – n2 bond is acyclic.

Acyl_Chloride

O

Cl

c

cl

O

Br

c

br

An acyl chloride is present if an atom with a atom type of c is bound to an atom with an atom type of cl or br.

APPENDIX F

- 315 -

Sulphonamide S

NH

O O

s6n

A sulphonamide is present when an atom with an atom type of s6 is bound to an atom with an atom type of n.

Carbamate

O

O

c

NR

R

R

nos

A carbamate is present when an atom with an atom type of c is bound to an atom with an n atom type and an atom with an os atom type.

Ammonium R

N+

R

R

R

n4

An ammonium is present if there is an atom with an n4 atom type.

Oxime N

Oc2

n2

O

R

An oxime is present if there is an atom with a c2 atom type bound to an atom with an n2 atom type which in turn is bound to an oxygen atom.

Ketone

Oc

C

A ketone is present if an atom with a c atom type c is bound to 2 carbon atoms.

Boronate BO

O

R

R

O

O

C B

A boronate is present if there is a boron atom bound to a carbon and two oxygens.

Primary_Amine R N

H

H

n3

A primary amine is present if there is an atom with an atom type of n3 bound to two hydrogens.

Secondary_Amine R N

H

R

n3

A secondary amine is defined as an atom with an atom type of n3 bound to a single hydrogen.

APPENDIX F

- 316 -

Additional keywords for FITTED

The following sections list the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates

a text string (such as a file name); square brackets [choice1|choice2] indicate a choice of

values, the default shown in italics.

Note that keyword files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.

Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings. These keywords are essentially used by the developers for optimization of the program (time and accuracy). In general, modification of a specific value does not significantly improve or affect the accuracy but may result in longer or quicker docking runs.


II.6. Input/output files

Protein_Conformations <# of files>

input_file_1

input_file_2

Following this keyword is the number of protein structure files used as input (same protein different conformation). These protein files should be prepared using ProCESS prior to the actual docking.

On the following lines are the protein file names, one per line.

For each of the proteins listed there should be the following files associated with then

input_file_dock.mol2

input_file_score.mol2

input_file_site.txt

input_file_IS.mol2

The name listed in this keyword file should therefore not include extensions such as _dock.mol2 that will be automatically added by FITTED.

Ligand ligand_file.mol2

Name of the ligand file (in MOL2 format). This ligand files should be prepared using SMART prior to the actual docking.

Ref <#_of_files>

lig_ref_file1.mol2

APPENDIX F

- 317 -

lig_ref_file2.mol2

Following this keyword is an integer stating how many reference files are used to calculate the root-mean-square deviation (RMSD) of the ligand heavy atoms. These ligand files should be in the same reference frame as the protein structure. The possible symmetric conformations of the ligand are calculated in silico.

2 reference files may be needed in some instances where the ligand or protein active site is Cn symmetric (n >=2 )

On the following line(s), the reference file(s) (in MOL2 format) are listed, one per line.

If this keyword is missing, no RMSD values will be computed.

Output filename


Forcefield forcefield_file.txt

Name of the force field file to use. If a forcefield other than fitted_ff.txt is to be used. The format of this force field should be consistent with the required format for Fitted (see section II.5).

Binding_Site_Cav cavity_file.mol2

Following this keyword is the file defining the empty space present in the active site cavity (a set of spheres prepared by ProCESS).

If this keyword is missing, no grid filter will be used (it is highly recommended to use both Pharmacophore and Binding_site_cav keywords).

Interaction_Sites interaction_sites_file.mol2

Name of the file containing the interaction site description (prepared by ProCESS).

If this keyword is missing, no interaction site filter will be used. (It is highly recommended to use both Interaction_Sites and Binding_site_cav)

Pharmacophore pharmacophore_file.mol2

Name of the file containing the pharmacophore constraints on the ligands (prepared by ProCESS). Typically this keyword is used to ensure that the individuals produced match this constraint, but it can be softened by setting Min_Constraint.

If this keyword is missing, no constraint will be used.

Protein_Ref <#_of_files>

ref_file_1.ext

ref_file_2.ext

Following this keyword is the number of reference protein structure files used to compute the protein RMSD (deviation of the modeled protein structure from the reference structures).

On the following lines are the protein file names, one per line. These files will be used in addition to the Protein files listed before to calculate a root-mean-square-deviation (RMSD) between the protein

generated during a fitted docking run and the Protein_ref files. Additional files can be needed if the

protein has a symmetrical structure (e.g., HIV-1 protease)

If this keyword is missing, protein input files will be used as references.

APPENDIX F

- 318 -

II.7. Run parameters

Mode [Dock|Filter|VS|Score|Local]

Dock

Normal docking run. No ligands are filtered out.


Filter

Filters out structures that do not meet Filter, Optional or Essential groups (see below).

Once filtering is done the program exits.

VS

Filters out structures that do not meet Filter, Optional or Essential groups (see below). If

the ligand passes all the filters, the docking is performed otherwise FITTED exits. Additional keywords are also provided (see below).

Score

Scores the ligand input structure in the provided orientation against all input proteins.

Local

Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modifications to the ligand conformation, orientation and translation are carried out.

SAR

Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modification to the ligand orientation and translation are carried out while a complete search of conformations is done.

Flex_Type [Rigid|Semiflex|Flex_water|Flex]

Rigid

The ligand is docked onto one protein structure.

This is the default if only one protein structure is used.

Semiflex

The ligand is docked onto multiple protein structures (requires Protein ≥ 2). Proteins can be

exchanged during the evolution but not the genes corresponding to side chains or water molecules (a more complete description of this mode is given in reference 1).

This is the default if more than one protein structure is used.

Flex_water

The ligand is docked into multiple protein structures (requires Protein ≥ 2). Similar to

Semiflex, except that each water molecule evolves independently.

Flex

The ligand is docked onto multiple protein structures (requires Protein ≥ 2). The side chains

and waters are allowed to be exchanged independently from the protein backbone.

Number_of_Runs <number of runs>

APPENDIX F

- 319 -

More than one run per ligand can be performed (The ligand may be docked several time to ensure a complete search).

If this keyword is missing, a single run is done.

The default value is 3 for Dock mode all other modes the default is 1.

II.8. Filtering parameters

The following keywords are used to filter out structures in VS or Filter modes only

Max_Charge <max_charge>

If a ligand has a net charge higher than max_charge, the program exits.

Default is +2.

Min_Charge <min_charge>

If a ligand has a net charge lower than min_charge, the program exits.

Default is -2.

Max_MW <max_MW>

If a ligand has a molecular weight higher than max_MW, the program exits.

Default is 500.

Min_MW <min_MW>

If a ligand has a molecular weight lower than min_MW, the program exits.

Default is 250.

Max_HBD <max_HBD>

If a ligand has more hydrogen bond donors than max_HBD, the program exits.

Default is 5.

Min_HBD <min_HBD>

If a ligand has fewer hydrogen bond donors than min_HBD, the program exits.

Default is 0.

Max_HBA <max_HBA>

If a ligand has more hydrogen bond acceptors than max_HBA, the program exits.

Default is 10.

Min_HBA <min_HBA>

If a ligand has fewer hydrogen bond acceptors than min_HBA, the program exits.

Default is 0.

Max_Nrot <max_Nrot>

APPENDIX F

- 320 -

If a ligand has more rotatable bonds than max_Nrot, the program exits.

Default is 6.

Min_Nrot <min_Nrot>

If a ligand has fewer rotatable bonds than min_Nrot, the program exits.

Default is 0.

Max_Ionizable <max_ionizable>

If a ligand has more ionizable groups than max_ionizable, the program exits.

Default is 2.

Min_Ionizable <min_ionizable>

If a ligand has fewer ionizable groups than min_ionizable, the program exits.

Default is 0.

Max_Rings <max_rings>

If a ligand has more rings than max_rings, the program exits.

Default is 10.

Min_Rings <min_rings>

If a ligand has fewer rings than min_rings, the program exits.

Default is 0.

Max_O <max_O>

If a ligand has more oxygen atoms than max_O, the program exits.

Default is 100.

Min_O <min_O>

If a ligand has less oxygen atoms than min_O, the program exits.

Default is 0.

Max_N <max_N>

If a ligand has more nitrogen atoms than max_N, the program exits.

Default is 100.

Min_N <min_N>

If a ligand has less nitrogen atoms than min_N, the program exits.

Default is 0.

Max_S <max_S>

If a ligand has more sulfur atoms than max_S, the program exits.

APPENDIX F

- 321 -

Default is 100.

Min_S <min_S>

If a ligand has less sulfur atoms than min_S, the program exits.

Default is 0.

Max_Hetero <max_hetero>

If a ligand has more heteroatoms (N, S and O) than max_hetero, the program exits.

Default is 100.

Min_Hetero <max_hetero>

If a ligand has less heteroatoms (N, S and O) than max_hetero, the program exits.

Default is 0.

Max_Metal <max_metal>

If a ligand has more heavy atoms other than C, N, O, S, P than max_metal, the program exits.

Default is 0.

Min_Metal <min_metal>

If a ligand has less heavy atoms other than C, N, O, S, P than min_metal, the program exits.

Default is 0.

Max_Num_of_Atoms <max_atoms>

If a ligand has more atoms other than max_atoms, the program exits.

Default is 10000.

Min_Num_of_Atoms <min_atoms>

If a ligand has less atoms other than min_atoms, the program exits.

Default is 0.

Filter <#_groups_filtered>

group_filtered1

group_filtered2

Number of functional groups that are filtered out. The name(s) of the filtered functional groups are listed below this keyword (see Table 1).

Optional <#_option_groups>

group_needed1

group_needed2

APPENDIX F

- 322 -

Number of functional group where one of them has to be present. The name(s) of the needed functional groups are listed below this keyword (see Table 1).

Essential <#_essential_groups>

group_needed1

group_needed2

Number of functional groups that are required. The name(s) of the needed functional groups are listed below this keyword (see Table 1).

Table 3. List of groups recognized by FITTED that can be listed after Filter, Optional or

Essential.

Aromatic Acid Acceptor Carbamate Primary Amine

Aldehyde Lactame Azide Ammoniun Secondary Amine

Ester Nitrile Isocyanate Oxime

Lactone Imine Acyl_Chloride Ketone

Amide Nitro Sulphonamide Boronate

II.9. Conjugate gradient parameters

The default values for all the keywords described in this section are highly recommended.

GA_* or GI_*

There are two sets of the following keywords: one for the parameters used during the generation of the initial population (GI_*; e.g., GI_MaxInt) and another one used during the evolution (GA_*; e.g.,

GA_MaxInt). The default values are recommended.

XX_MaxIter <maxiter>

o Maximum number of iterations. Once this number is reached the minimization is finished.

o The default is 20.

XX_StepSize <stepsize>

o Initial value of the step taken in the direction of the gradient during minimization.

o The default is 0.02.

XX_MaxStep <maxstep>

o Maximum step size allowed during minimization.

o The default is 1.

XX_EnergyBound <energybound>

o Minimum energy difference between two molecules to be considered similar.

o The default is 1.0 for GI_EnergyBound and 0.001 for GA_EnergyBound.

XX_MaxSameEnergy <maxsameenergy>

o Number of times that the same energy (defined by EnergyBound) can be repeated.

o The default is 3.

APPENDIX F

- 323 -

XX_MaxGrad <maxgrad>

o Gradient convergence criteria.

o The default is 0.001.

II.10. Energy parameters

The default values for all the keywords described in this section are highly recommended.

Score_Initial [none|score|minimize]

Scoring of the initial ligand binding mode.

none

No scoring of the initial input structure is performed.


score

Only the score of the initial input ligand is output.

minimize

The score of the initial pose and the score of the energy minimized structure will be outputted.

VdW [1-4|1-5]

Selects whether 1,4 and/or 1,5 and greater van der Waals interactions should be considered.

1-4

Used to consider 1,4 interactions and above.


1-5

Used to consider only 1,5 interactions and above.

VdWScale_1-4 <vdwscale_1-4>

Scaling factor for the 1,4 van der Waals interactions.

The default is 1.0.

VdWScale_1-5 <vdwscale_1-5>

Scaling factor for the 1,5 van der Waals interactions.

The default is 1.0.

E_VdWScale_Pro <e_vdwscale_pro>

Scaling factor for the ligand-protein van der Waals interactions.

The default is 1.0.

E_VdWScale_Wat <e_vdwscale_wat>

Scaling factor for the ligand-water van der Waals interactions.

The default is set the value as the same as E_vdWScale_Pro.

APPENDIX F

- 324 -

Elec [1-4|1-5]

Select whether 1,4 and/or 1,5 and greater electrostatic interactions should be considered.

1-4



1-5


ElecScale_1-4 <elecscale_1-4>

Scaling factor for the 1,4 electrostatic interactions.

The default is 1.0.

ElecScale_1-5 <elecscale_1-5>

Scaling factor for the 1,5 electrostatic interactions.

The default is 1.0.

E_ElecScale_Pro <e_elecscale_pro>

Scaling factor for the ligand-protein electrostatic interactions.

The default is 1.0.

E_ElecScale_Wat <e_elecscale_wat>

Scaling factor for the ligand-water electrostatic interactions.

The default value is set the same as E_ElecScale_Pro.

HBond [Y|N]

Selects whether or not hydrogen bonds are included in the energy calculation.

The default is Y.

E_HbondScale_Pro <e_hbondscale_pro>

Scaling factor for the ligand-protein hydrogen bond interactions.

The default is 1.0.

E_HbondScale_Wat <e_hbondscale_wat>

Scaling factor for the ligand-water hydrogen bond interactions.

The default value is set the same as E_HbondScale_Pro.

Cutdist <cutdist>

Cutoff distance (in Ǻ) for the non-bond interactions with the protein.

The default value is 9.

Switchdist <switchdist>

APPENDIX F

- 325 -

Switching distance (in Ǻ) for the non-bond interactions with the protein.


Cutdist_Wat <cutdist_wat>

Cutoff distance for the non-bond interactions with the water molecules.

The default value is 1.20

Switchdist_Wat <switchdist_wat>

Switching distance for the non-bond interactions with the water molecules.


GI_Protein_Nbonds [United|All_Atom]

FITTED will treat protein non-bonded interactions with the ligand as either all atom or united for the generation of the initial population.

The default for this keyword is United.

GA_Protein_Nbonds [United|All_Atom]

FITTED will treat protein non-bonded interactions with the ligand as either all atom or united for the evolutional.

The default for this keyword is United.

GA_Protein_Nbonds2 <generation number>

FITTED will switch from united to all atom representation of the non-bonded interactions at this generation.

The defaults is set to Max_Gen2.

Solvation [On|Off}

Allows the user to turn off the calculation of the solvation energy

The default is on.

Displaceable_Waters [On|Off}

Allows the user to turn off the displaceable waters.

The default is on which allows displaceable waters.

II.11. Scoring parameters

S_VdWScale_Pro <s_vdwscale_pro>

Scaling factor for the ligand-protein van der Waals score.

The default is 1.0.

S_VdWScale_Wat <s_vdwscale_wat>

Scaling factor for the ligand-water (located in protein structure) van der Waals interactions.

The default is the value of S_VdWScale_Pro.

APPENDIX F

- 326 -

S_ElecScale_Pro <s_vdwscale_pro>

Scaling factor for the ligand-protein electrostatic interactions.

The default is 1.0.

S_ElecScale_Wat <s_vdwscale_wat>

Scaling factor for the ligand-water electrostatic interactions.

The default is the value of S_ElecScale_Pro.

S_HbondScale_Pro <s_hbondscale_pro>

Scaling factor for the ligand-protein hydrogen bond interactions.

The default is 1.0.

S_HbondScale_Wat <s_hbondscale_wat>

Scaling factor for the ligand-water hydrogen bond interactions.

The default is the value of S_HbondScale_Pro.

S_PolarSolvation

Scaling factor for the polar salvation energy

The default is 1.0.

S_nonPolarSolvation

Scaling factor for the non-polar salvation

the default is 1.0.

Water_Loss_Energy <water_loss_energy>

Energy penalty (in kcal/mol) associated with the displacement of a water molecule in the active site during docking.

The default value is 1.0.

II.12. Initial population parameters

Pop_Size <pop_size>

Population size for the genetic algorithm conformational search.

The default is 100 for rigid docking, 200 for flexible docking

Min_MatchScore <min_matchscore>

This keyword is used only if an interaction site file is provided. If the Mode is set to Dock,

Min_Matchscore is automatically calculated.

Minimum match of the interaction sites.

The default is 25.

Min_PharmScore <min_constraint>

APPENDIX F

- 327 -

This keyword is used only if a pharmacophore file is provided.

Minimum percent match of the pharmacophore.

The default is 100.

Anchor_Atom <anchor_atom>

Sequence number of the atom to be used as an anchor. This is used to identify the center of translation and rotation for the GA.

If this keyword is not specified, the anchor is automatically set to the gravity center of the ligand.

Anchor_Coor <anchor_x> <anchor_y> <anchor_z>

Following this keyword must be the x, y and z coordinates of the protein active site center.

If this keyword is not used, it is automatically set to the center of the protein active site defined by the active site (flexible) residues.

Max_Tx <max_tx>

Max_Ty <max_ty>

Max_Tz <max_tz>

Maximum value for translation (in Å) in x, y, and z respectively.

The default is 5 for the three values.

GI_Initial_E <gi_initial_e>

Energy value (in kcal/mol) added to the minimized energy of the free ligand to give an upper bound. If the energy of an individual in the initial population/GA is below this number, then the individual is optimized by energy minimization.

The default is 100,000.

GI_Minimized_E <gi_minimized_e>

Energy value (in kcal/mol) added to the minimized energy of the free ligand to give a lower cutoff. If the energy of an individual is below this value after energy minimization then the individual is kept as a part of the initial population.

The default is 1,000 (could be set to values as low as 100).

GI_Num_of_Trials <gi_num_trials>

Maximum number of successive unsuccessful trials before exiting.

The default for Mode Dock is 10,000 and for Mode VS is 1,000.

Matching_Algorithm [On|Off]

Turns on or off the matching algorithm.

By default, it is set to On.

Num_of_Top_IS <num_of_top_IS>

Number of top Interactions sites that the interaction site triangles must contain at least one of.

APPENDIX F

- 328 -

The default is 10.

Stringent_Triangles <weight_of_triangles>

Is a factor by which the triangles are selected. The higher Stringent_Triangles is set, the

more the matching algorithm will favour triangles that have not been used.


Stringent_MS <stringent_MS>

Is a weight factor used in calculation of Min_MatchScore. The higher this value, the stricter

Min_MatchScore becomes.


Corner_Flap [On|Off]

Turns the corner flap conformational search for rings on or off.

By default, it is set to Off.

II.13. Evolution parameters

Max_Gen <max_gen>

Determine the maximum number of generations for the genetic algorithm.

The default is 200.


Upper bound score at Max_Gen to further proceed with the docking run. If there is one individual within

the top 3 below this CutScore_1 then the program proceeds to Max_Gen_1

The default is -4.


This keyword is used in VS mode only.

After Max_Gen generations, if none of the top poses has a score below the one specified by

CutScore_1, the program exits. Otherwise, the program proceeds until it reaches Max_Gen_1

The default is set to be Max_Gen.


Upper bound score at Max_Gen_2 to further proceed with the docking run. If there is one individual

within the top 3 below this CutScore_2 then the program proceeds to Max_Gen_2



As for Max_Gen_1, if after Max_Gen_1 generations none of the top poses has a score below the one

specified by CutScore_2, the program exits. Otherwise, the program proceeds until it reaches

Max_Gen_2.


APPENDIX F

- 329 -

Seed <seed>

Select the starting point within the random number generator. If the same run is done with the same seed, the exact same result will be obtained. If a different seed is used, the GA will follow a different path. Changing the seed helps the developers to evaluate the convergence of a run.

The default is 100.

Max_Rxy <max_rxy>

Max_Ryz <max_ryz>

Max_Rzx <max_rzx>

Maximum value (in degrees) for the mutation of the rotation in their respective planes.

The default is 30 for the three values.

Parent_Selection [Random|Tournament]

Select how the parents are chosen.

Random

A random individual is selected, then checked to see if it has already been coupled. If it has then another number is chosen as generator. If this occurs 10 times then the last number is kept as the parent.

Tournament

Tournament_Size random individuals are selected with the best being kept as the parent.

Tournament_Size <tourney size>

The tournament size of the parent selection. Only used with Parent_Selection Tournament.

The default is 5.

Max_Num_SC <max_num_sc>

Maximum number of steric clashes allowed between the flexible side chains of the protein and/or between the water molecules when a composite protein structure is created.

The default is 0.

Max_SC_PP <max_sc_pp>

Maximum distance (in Å) between side chain atoms in a composite protein structure of another atom of another side chain, to be consider as a clash.

The default is 1.5.

Resolution <resolution>

Select the resolution for the bond rotation during the generation of the initial population. For example, if a resolution of 12 is selected, the bond rotation will occur in multiples of (360/12), or 30 degrees.

The default is 120.

pLearn <plearn>

Probability of energy minimization of the parents at every generation.

APPENDIX F

- 330 -

The Default is 0.1.

pCross <pcross>

Probability of crossover at every generation.


pMut <pmut>

Probability of mutation at every generation.


pMutRot <pmutrot>

Probability of mutation of the orientation of the ligand at every generation.


pMutWat <pmutwat>

The maximum rate of mutation of the water at Max_Gen generations


pElite <pElite>

The percentage of the best of the population to be directly passed on to the next generations.


pElite_Every_X_Gen <pElite_Every_X_Gen>

pElite will be used every pElite_Every_X_Gen

The default is 2.

pElite_SSize <pElite_SSize>

The individual to be passed directly onto the next generation will be selected random from the top pElite_SSize individuals of the population.

The default is 10.

pOpt <popt>

Probability of optimization of the ligand at every generation.


Evolution [Steady_State|Metropolis|Elite]

Steady_State

During the evolution, out of a pair of two children and their 2 parents the two best will be saved.


Metropolis

APPENDIX F

- 331 -

During the evolution, out of a pair of two children and their 2 parents two individuals will be saved following the Metropolis criterion. If the children are higher in energy they are checked to see if they have a high probability to exist at room temperature. If they do they are saved.

Elite

During the evolution, the top pop_size individuals of the children and parents will be kept for

the next generation.

GA_Num_of_Trials <ga_num_trials>

Maximum number of successive unsuccessful trials to create children.

The default is 1000.

Diff_Avg_Best <difference_avg_best>

The absolute difference between the average energy of the population and the best individual of the population. If the calculated value is below difference_avg_best then the population is considered

to be converged.

The default is 1.

Diff_N_Best <difference_n_best>

The absolute difference in energy between the individual with the lowest energy and the individual ranked Diff_Number.

If Diff_Number is defined the default value is 0.4.

Diff_Number <number_rank>

The number of the indivuals to be used with Diff_N_Best

By default this criteria is not used.

II.14. Output/convergence parameters

Print_Level [0|1|2|3]

Controls the amount of data outputted.

Print_Structures [Final|Full|None]

Controls the output of the structures during or at the end of the docking.

Final

Only the final structures will be printed.


Full

The structures (protein and ligand) will be printed during the run along with the final structures.

None

No structures will be printed.

Print_Num_Structures <print_num_structures>

Select how many of the top poses are printed as MOL2 files.

APPENDIX F

- 332 -

The default is 1.

Number_of_Best <number_of_best>

Select how many individuals to print the score, energy and RMSD during the run.

The default is 10 in Mode Dock and 1 in Mode VS.

Print_Best_Every_X_Gen <print_best_every_x_gen>

How often to print a summary of the run.

The default is (Max_Gen + 1).

Print_Energy_Full [Y|N]

Controls the printout of the detailed energy contributions.

Y

Print out a breakdown of the energy (bond energy, angle energy, etc.).


N

Print out only the total energy.

MaxSameEnergy_GA <maxsameenergy_ga>

If the lowest energy individual does not change in this many generations, the program exits.


II.15. Docking of covalent inhibitors

This feature is under validation

Covalent_Residue <residue_name>

Following this keyword is the name of the residue, the covalent inhibitor will react with. Only CYS and SER are implemented in the current version (e.g., SER554)

Covalent_Ligand [Only|Both]

Controls the covalent docking. FITTED will automatically identify the aldehyde, boronate or nitrile groups (other groups will eventually be implemented) and assign the proper atom types when covalent poses will be considered

Only

Only covalent poses will be considered


Both

Covalent and non-covalent poses will be considered concomitantly.

Proton_Moved_To <residue> <atom_name>

The protein will be moved to atom <atom_name> of residue <residue>.

APPENDIX F

- 333 -

Additional Keywords for ProCESS

The following section lists the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates

a text string (such as a file name); square brackets [] indicate a choice of values, the default shown

in italics.

ProCESS keywords files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.

Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings.


II.16. Input/output files

Protein <#_protein_struct>

protein_file1.mol2

protein_file2.mol2

Following the keyword, specify the number of protein structure files to be processed

On the following lines, specify the protein file names, one per line.

Output output_filename


Binding_Site_Cav cavity_filename

Name of the file where to output the binding site cavity.

If this keyword is not present ProCESS will not create a binding site cavity file.

Interaction_Sites pharmacophore_filename

Name of the file where to output the interaction sites definition file.

If this keyword is not present ProCESS will not create an interaction sites definition file.

Binding_Site <#_flex_residues>

flex_residue_1_name

flex_residue_2_name

Manually defines the active site. (The active site can be automatically defined by providing a ligand, see below)

On the same line following this keyword, specify the number of flexible residues.

APPENDIX F

- 334 -

On subsequent lines, the residue name/numbers (according to Find_Residues) are specified, one

per line.

Rep_file protein_file1.mol2

Specify which file to use as a template.

If omitted, the first file specified in Protein is used.

II.17. Reading the input files and preparing the output protein files

Renumber_Residues <first_residue_number>

Specify the new number for the first residue; the rest will be sequentially renumbered.

This feature is useful if the protein is a multimer, having multiple residues with the same group name (e.g., two Pro60, two Asp25 as in HIV-1 protease).

AutoFind_Site [Y|N]

This function allows the user to have ProCESS automatically finding the flexible residues/binding site.

The default is N.

Ligand ligandfile.mol2

Ligand file (in MOL2 format) used to define the active site and its center. Its should be in the same frame as the protein.

Ligand_Cutoff <ligand_cutoff>

Protein residues within this cutoff (in Å) are considered part of the binding site.

The default is 5.0.

Truncate [Y|N|auto]

Determine if the protein will be truncated, keeping only residues within Cutoff of the binding site

residues.

The default is Y.

auto

The protein will be truncated keeping residues within cutoff distance of the ligand and not within cutoff distance from the binding site residues.

Cutoff <cutoff>

Any residue that does not have an atom within this distance (in Å) from an atom of a flexible residue or of the given ligand will be deleted from the protein file that ProCESS will output.

The default value is 11 for auto truncation 9 for truncation = yes.

Find_Residues [Name|Number]

If Active_Site is used, define in which way ProCESS will identify the residues that make up the

binding site.

Name

APPENDIX F

- 335 -

Search residues by group name.


Number

Search residues by group number.

Assign_H [Y|N]

ProCESS requires the advanced hydrogen atom names. This keyword allows ProCESS to rename the hydrogens if they are not assigned correctly.

The default is N

Assign_G [Y|N]

Allows ProCESS to assign the advanced PDB residue names automatically.

The default is N

United [Y|N]

This allows the user to select whether the protein will have a united-atom or all-atom representation.

Y is the default.

Coarse [0|1|2|3]

This keyword followed by 0 is for united-atom representation. Other coarse grained representations are under development.

0 is the default.

Assign_Charges [Scaled|None]

Scaled

Scales van der Waals and electrostatics for the flexible residues. This scaling of parameters allows for entropy cost of the binding to the protein to be accounted for (see J. Med. Chem. 2006, 49, 5885 for a more complete description).


None

Assigns default values to charges and van der Waals.

II.18. Parameters for the binding cavity file

Grid_Center <grid_center>

Specifically defines the center of the binding site.

The default is to automatically find it using the center of a ligand

Grid_Size <size>

Specifies the size of the box for the binding site.


Grid_Boundary [Soft|Hard]

APPENDIX F

- 336 -

Soft

When converting from the grid to spheres, the boundary of the box will be ignored (defined by Grid_Size) and spheres can include volume outside of the box.


Hard

The active site cavity file will be constrained within the box defined by Grid_Size.

Grid_Resolution <grid_resolution>

Following this keyword is the resolution (Å) of the grid.

The default is 1.5.

Grid_Sphere_Size <grid_sphere_size>

Specifies the size of a sphere used to trim the sides of the box to make it rounder.

The default Grid_Size.

Grid_Clash <grid_clash>

If a protein atom is within this distance of a grid point, the point is removed from the grid.

The default is 1.5.

II.19. Parameters for the Interaction sites file

Xxx_Weight <xxx_weight>

This group of keywords (Xxx being Hydrophobic, Metal, HBA or HBD) specifies the parameters for

the assignment of pharmacophoric points. xxx_weight is used to give weight for favourable xxx-type

interactions. Defaults parameters are highly recommended.

Hydrophobic_Weight <hydro_weight>

Defines the weight for hydrophobic interaction points.

The default is 1.

Metal_Weight <metal_weight>

Defines the weight for metal interaction points.

The default is 50.

HBA_Weight <hba_weight>

Defines the weight for hydrogen bond acceptor interaction points.

The default is 5.

HBD_Weight <hbd_weight> <hbd_penalty>

Defines the weight for hydrogen bond donor interaction points.

The default is 5.

If too many points are found, one can reduce this number by using the following keywords:

Pharm_Polar_Softness <pharm_polar_soft>

APPENDIX F

- 337 -

Maximum distance (in Å) between two polar points to merge.

The default is 0.0.

Pharm_Nonpolar_Softness <pharm_nonpolar_soft>

Maximum distance (in Å) between two non-polar points to merge.

The default is 0.0.

Hydrophobic_Level <hydro_level>

Van der Waals interaction between a probe on the grid point with hydrophobic carbons to be considered hydrophobic. If the interaction is found lower than hydro_level, an hydrophobic

point is added at this location. For more information see the section on Interaction Sites/Pharmacophore generation.


Min_Weight <min_weight>

Minimum weight for a pharmacophoric point to be included in the final pharmacophore.

The defaults are 0.5 0.0respectively.

Num_of_IS <num_of_spheres>

This determines the maximum number of interaction site spheres in the interaction sites file.

The default is 75.

new virtual screening tools for molecular discovery

Documents