new virtual screening tools for molecular discovery
TRANSCRIPT
NEW VIRTUAL SCREENING TOOLS FOR
MOLECULAR DISCOVERY
CHRISTOPHER R. CORBEIL
A thesis submitted to McGill University in partial fulfillment of the requirements of the
degree of Doctor of Philosophy
Department of Chemistry
McGill University
Montreal, Quebec, Canada
H3A 2K6
December, 2008
© Chris Corbeil, 2008
ii
ABSTRACT
In the field of molecular discovery, virtually screening large libraries of
compounds proved to be often more cost-efficient than the traditional experimental
approaches. In fact, it has now become common practice thanks to the virtual screening
tools available to chemists in the pharmaceutical industry, specifically docking. Most
docking programs do not account for the dynamics associated with protein-ligand binding
whether it is protein flexibility or the inclusion of displaceable water molecules.
FITTED1.0 was developed to include these specific two features and has been validated on
a testing set of 33 protein-ligand complexes. Further developments were needed to
increase the speed of FITTED to enable its application as virtual screening tool. This
enhanced version, FITTED1.5, has been applied to the screening of the Maybridge library
onto the HCV polymerase and revealed FITTED’S ability to identify active substances. With this
and other successful applications of FITTED, a comparative study was performed against
other docking programs, with a specific interest in the effect of the ligand and protein
input conformation and the inclusion of bridging water molecules on the accuracy of
docking programs. All three had major effects on accuracy and led to suggestions on how
to better conduct comparative studies. In parallel, we applied our expertise in the virtual
screening area to the field of asymmetric catalyst development and led to the creation of
ACE1.0. When creating a tool for predicting steroselectivities, one has to describe the
transition state with great accuracy although within a reasonable amount of time. To
tackle this problem, ACE creates the transition states from linear combinations of reactant
and product interactions. A genetic algorithm is then exploited as a conformational search
engine to optimize the TS structure. ACE has been applied to the Diels-Alder
cycloaddition and the proline-catalysed aldol reactions and has showed good correlation
between observed and predicted selectivities.
iii
RESUMÉ
Dans le domaine pharmaceutique, le criblage virtuel de large bibliothèques de
molécules est une alternative moins couteuse et souvent au moins aussi efficace que le
criblage à haut débit. D’ailleurs, le développement de tels outils –et plus particulierement
de méthodes de "docking"– a permis au criblage virtuelle de devenir pratique courante
dans l’industrie pharmaceutique. Cependant, la plupart des méthodes de docking ne
prennent pas en compte la dynamique des complexes protéine/ligand et plus
spécifiquement la flexibilité des protéines et la présence de molécules d’eau nécessaires à
une liaison optimale. Dans cette optique, FITTED1.0 a été développé et validé sur un jeu
de 33 complexes protéine/ligand. Ainsi, FITTED1.0 permet de modéliser des complexes
ternaires protéine/ligand/eau entièrement flexibles. D’autres développements ont ensuite
été nécessaires pour en accroître la rapidité et permettre son utilisation pour le criblage de
larges bibliothèques. Cette version améliorée, FITTED1.5, a été appliquée au criblage de la
bibliothèque Maybridge sur la polymérase du virus de l’hépatite C et a permis la
découverte de deux nouveaux inhibiteurs. Après ces résultats très encourageants, une
étude comparative a été entreprise visant spécifiquement à évaluer l’impact des données
d’entrées sur le pouvoir prédictif des programmes de docking les plus couramment
utilisés incluant FITTED2.6. Nous avons alors démontré que la présence d’eau, la
conformation du ligand et de la protéine au départ du calcul ont un impact majeur. En
parallèle, nous avons bénéficié de notre expertise pour développer un second outil de
criblage virtuel ACE1.0 mais cette fois appliqué au criblage de catalyseurs asymétriques.
Dans le domaine de la catalyse asymétrique, il nous fallait prédire la structure et l’énergie
potentielle des états de transition et ce, dans un temps raisonnable. Pour ce faire, ACE crée
les structures d’états de transition par combinaison linéaire des réactifs et produits de la
réaction. Un algorithme génétique est alors exploité pour entreprendre une recherche
conformationelle exhaustive et optimiser ces structures. ACE a été appliqué à deux
réactions bien connues de la chimie organique (cycloaddition de Diels Alder
cycloaddition et reaction d’aldol) et a démontré un grand pouvoir prédictif.
iv
ACKNOWLEDGEMENTS
First I would like to thank my PhD supervisor Dr. Nicolas Moitessier, for without
his mentorship and guidance I would not have matured into the scientist I am today. I
would like to also thank him for always reminding me, yes I am a computational chemist
but remember who your audience is, the organic or medicinal chemist.
Secondly I would like to thank all the members, past and present, of the
Moitessier Research Group for all their help, friendship and patience during my studies at
McGill. I would like to specifically thank Pablo Englebienne for always being there to
discuss any computational problem I may have from how to write better code to helping
me understand that deleting the registry of your computer is a bad thing. I would also like
to thank Janice and her Mom for providing baked good every Monday and Wednesday
like clockwork. Cookies and cakes are always a good incentive to come into work.
I would also like to thank Chantal Marotte, Sandra Aerssen, Fay Nurse, Alison
McCaffrey, Karen Turner and Normand Trempe for aiding me in traversing the deep
waters that is the McGill Administration. It is without these people that the Otto Maass
Chemistry building would have fell apart.
I am forever indebted to wife, MaryAnne who I have met during my studies at
McGill. I would like to thank her for her patience when I worked at night, the weekend
and whenever there was a problem with one of my computer programs. Without her I
would not have been able to survive the stresses associated with doing a PhD.
Lastly I am grateful for all the financial assistance from the CIHR Chemical
Biology Fellowship program, ViroChem Pharma, the Robert Zamboni Travel Award,
Pall Dissertation Award and the Udho Parsini Diwan Prize. I am also thankful to RQCHP
for somehow letting me receive over 90 years of CPU time during my PhD.
v
CONTRIBUTION OF CO-AUTHORS
This thesis consists of one introduction which contains one review draft (Chapter
1.1), 3 published publications (Chapters 3, 4, and 6) and one draft that has been submitted
for publication (Chapter 5). All the work described in these manuscripts has been carried
out as part of my research for the degree of Doctor of Philosophy in Chemistry.
All the manuscripts have co-authors, their contributions are described below. Dr.
Nicolas Moitessier has been my supervisor throughout my doctoral degree and is a co-
author for each manuscript.
Chapter 2: I wrote the code for the FITTED1.0 suite and conducted all docking
experiments. I prepared most of the testing set except for HIV – 1 Protease which P.
Englebienne prepared.
Chapter 3: I wrote the code improvements for FITTED1.5, conducted the
comparison with Fitted1.0 and conducted the docking and virtual screening studies of
HCV polymerase. P. Englebienne prepared the Maybridge database for virtual screening.
C. G. Yannopoulos, L. Chan, S. K. Das, D. Bilimoria and L. Heureux are responsible for
the biological evaluation of the hit compounds found from the virtual screening.
Chapter 4: I performed all experiments reported and made the improvements to
FITTED.
Chapter 5: I wrote all the code for ACE1.0 and selected the testing set. J. A.
Schwartzentruber optimized the conjugate gradient minimization routine within ACE. S.
Thielges tested ACE and performed the validation studies.
vi
TABLE OF CONTENTS
Title Page ....................................................................................................................... i
Abstract .......................................................................................................................... ii
Resumé ........................................................................................................................... iii
Acknowledgments .......................................................................................................... iv
Contribution of Co-Authors ........................................................................................... v
Table of Contents ........................................................................................................... vi
List of Figures ................................................................................................................ ix
List of Tables ................................................................................................................. xii
List of Equations ............................................................................................................ xiv
Abbreviations ................................................................................................................. xv
Chapter 1: Introduction .............................................................................................. 1
1.1 The challenge of modeling reality in the docking of small
molecules to biological targets .................................................................... 2
Abstract .................................................................................................. 2
Introduction ............................................................................................ 3
Ligand Flexibility ................................................................................... 4
Ring Flexibility ...................................................................................... 13
Protein Flexibility ................................................................................... 16
Predicting Displaceable Key Bridging Water Molecules ....................... 23
Predicting of Metal Geometry ................................................................ 26
Conclusion .............................................................................................. 26
1.2 Application of computational techniques to asymmetric catalyst
Development ................................................................................................ 28
Quantum Mechanics Predictions of Stereomeric Excess ....................... 29
Application of Virtual Screening Techniques to the Field of Asymmetric
Catalyst Development ............................................................................ 34
Conclustion ............................................................................................. 45
1.3 Outline of Thesis .......................................................................................... 46
1.4 References .................................................................................................... 48
vii
Chapter 2: Docking Ligands into Flexible and Solvated Macromolecules. 1.
Development and Validation of FITTED1.0 .................................................................... 71
Abstract .............................................................................................................. 72
Introduction ........................................................................................................ 73
Theory and Implementation ............................................................................... 74
Results and Discussion ....................................................................................... 88
Conclusion .......................................................................................................... 109
Experimental Section ......................................................................................... 110
Preparation of Training Set .................................................................... 111
Docking Study ........................................................................................ 113
Acknowledgements ............................................................................................ 113
References .......................................................................................................... 114
Chapter 3: Docking Ligands into Flexible and Solvated Macromolecules. 2.
Development and Application of FITTED1.5 to the Virtual Screening of Potential
HCV Polymerase Inhibitors ........................................................................................... 119
Abstract .............................................................................................................. 120
Introduction ........................................................................................................ 121
Theory and Implementation ............................................................................... 123
Validation of FITTED1.5 ..................................................................................... 128
Application to the Screening of a Library against HCV Polymerase ................ 132
Conclusion .......................................................................................................... 137
Experimental Section ......................................................................................... 137
Acknowledgements ............................................................................................ 139
References .......................................................................................................... 140
Chapter 4: Docking Ligands into Flexible and Solvated Macromolecules. 3.
Impact of Input Ligand Conformation, Protein Flexibility and Water Molecules on
the Accuracy of Docking Programs ............................................................................... 145
Abstract .............................................................................................................. 146
Introduction ........................................................................................................ 147
Theory and Implementation ............................................................................... 148
Results and Discussion ....................................................................................... 158
viii
Conclusion .......................................................................................................... 171
Experimental Section ......................................................................................... 172
Acknowledgements ............................................................................................ 177
References .......................................................................................................... 177
Chapter 5: Toward a Computational Tool Predicting the Stereochemical Outcome
of Asymmetric Reactions. Development and Application of a Rapid and Accurate
Program Based on Organic Principles ........................................................................... 185
Communication .................................................................................................. 186
References .......................................................................................................... 195
Chapter 6: Conclusion, Future Work and Contributions to Knowledge ...................... 199
Conclusions ........................................................................................................ 199
Future Work ....................................................................................................... 200
Contributions to Knowledge .............................................................................. 201
References .......................................................................................................... 202
Appendix A: Copyright Waivers .................................................................................. 205
Appendix B: Supporting information for Docking Ligands into Flexible and
Solvated Macromolecules. 1.Development and Validation of FITTED1.0 ..................... 209
Appendix C: Supporting information for Docking Ligands into Flexible and
Solvated Macromolecules. 2. Development and Application of FITTED1.5 to the
Virtual Screening of Potential HCV Polymerase Inhibitors .......................................... 221
Appendix D: Supporting information for Docking Ligands into Flexible and
Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein
Flexibility and Water Molecules on the Accuracy of Docking Programs ..................... 229
Appendix E: Supporting information for Toward a Computational Tool Predicting
the Stereochemical Outcome of Asymmetric Reactions. 2. Development and
Application of a Rapid and Accurate Program Based on Organic Principles ............... 233
Appendix F: FITTED2.6 User Manual ........................................................................... 241
ix
LIST OF FIGURES
Figure 1.1 - Matching algorithm ................................................................................... 5
Figure 1.2 - Incremental construction ........................................................................... 8
Figure 1.3 - Genetic Algorithm ..................................................................................... 10
Figure 1.4 - Monte Carlo ............................................................................................... 12
Figure 1.5 - Possible methods to include protein flexibility ......................................... 17
Figure 1.6 - Proposed mechanisms for proline catalysed aldol reaction....................... 30
Figure 1.1 - Example of bicyclic analogue studied by Shinisha et al. .......................... 31
Figure 1.8 - Proposed mechanisms for osmium tetraoxide assymetric
dihydroxylation of alkenes ............................................................................................. 31
Figure 19 - Imidazolidinone catalysed Diels-Alder Reaction ....................................... 32
Figure 1.10 - Mechanism for Mannich reaction............................................................ 33
Figure 1.2 - QM/MM study of Sharpless dihydroxylation. .......................................... 34
Figure 1.12 - Palladium catalysed allylation ................................................................. 35
Figure 1.13 - bis(oxazoline)copper(II) catalysed Diels-Alder ...................................... 36
Figure 1.14 - QSSR using Quantum Mechanical Interaction Field analysis in the
design of chiral amino alcohols for alkyl addition to aldehydes. .................................. 37
Figure 1.15 - Hydroboration of alkenes ........................................................................ 38
Figure 1.16 - Dihydroxylation of xylose ....................................................................... 39
Figure 1.17 - Sharpless dihydroxylation catalyst studied for optimization and
validation of generatic algorithm. .................................................................................. 40
Figure 1.18 - Reactions studies with reverse docking................................................... 41
Figure 1.19 - Mechanism Horner-Wadsworth-Emmons reaction ................................. 43
Figure 1.20 - Mixing of two ground states to find transition state ................................ 43
Figure 1.21 - All energies are calculated on the model PES then projected onto
the true PES using a mixing term .................................................................................. 44
Figure 1.22 - Summary of methods used to find transition states ................................. 45
Figure 2.1 - The binding site of 1d8m .......................................................................... 78
Figure 2.2 - Chromosome describing a protein/water/ligand complex. ........................ 80
Figure 2.3 - Generation of the initial population using a series of filters. .................... 81
Figure 2.4 - 4 possible pairs of children generated after application of two one
x
point cross-over operations. A horizontal bar represents a gene. .................................. 84
Figure 2.5 - Interaction energy between a methanol molecule and an explicit
water molecule or a displaceable water ......................................................................... 86
Figure 2.6 - Bridging water molecules and flexible binding site residues in TK /
inhibitor complexes. ....................................................................................................... 89
Figure 2.7 - Bridging water molecules and flexible binding site residues in
FXa / inhibitors complexes.. .......................................................................................... 90
Figure 2.8 - Docked and crystal structure of 1e2p ligand.. ........................................... 92
Figure 2.9 - Crystal structure and proposed docked model for the 1d8m complex. ..... 109
Figure 3.1 - Selected HCV polymerase inhibitors. ....................................................... 122
Figure 3.2 - Helix T perturbation upon inhibitor binding. Blue and grey ribbon
representations are from 2 different X-ray complexes. 20, 21 ...................................... 123
Figure 3.3 - Consensus docking.. ................................................................................. 125
Figure 3.4 - Binding site pharmacophore for HCV polymerase. .................................. 125
Figure 3.5 - Selected known actives. ............................................................................ 133
Figure 3.6 - Funnel approach implemented in FITTED ............................................... 134
Figure 3.7 - Active compounds recovered. ................................................................... 136
Figure 4.1 - Conformation of 1nfu ligand ..................................................................... 151
Figure 4.2 - FITTED 1.5 vs. FITTED 2.6 chromosome and the various
docking modes ............................................................................................................... 152
Figure 4.3 - Example of the corner flap approach converting a boat
conformation to a chair .................................................................................................. 153
Figure 4.4 - The assumption of torsion equivalencies. ................................................. 153
Figure 4.5 - Representation of the interaction sites found for 1bwi.............................. 155
Figure 4.6 - Schematic of the generation of the initial population within FITTED2.6. 156
Figure 4.7 -Schematic of the evolution cycle of FITTED 2.6. ..................................... 158
Figure 4.8 - Accuracy vs. ligand and protein conformations. ....................................... 164
Figure 4.9 - Self-docking vs. cross-docking for protein. .............................................. 167
Figure 4.10 - Accuracy and water molecules in self-docking experiments. ................. 168
Figure 4.11 - Protein class and accuracy on cross-docking experiments. ..................... 169
Figure 4.12 - Accuracy of program with OMEGA-generated structures ...................... 171
Figure 5.1 - General synthetic scheme and representative dienophiles 1a-f and
xi
dienes 2a-c used in the validation study......................................................................... 188
Figure 5.2 - Predicted vs. observed diastereomeric excesses for 44
Diels Alder reactions. ..................................................................................................... 189
Figure 5.3 - General synthetic scheme and representative catalysts (6a-e) and
aldehydes (5a-d) used in the validation study. ............................................................... 191
Figure 5.4 - Predicted vs. observed diastereomeric excesses for 17 selected cases.. .... 192
Figure 5.5 - Predicted TS structure for the reaction involving 4, 5c and 6a.. ............... 192
Figure 5.6 - ACE predictions and DFT predictions vs. observed diastereomeric
excesses for 4 selected cases .......................................................................................... 193
Figure E.1 - Data represented as Entry # versus ΔΔG. ................................................. 238
Figure E.2 - Data represented as Entry # versus ΔΔG. ................................................. 239
xii
LIST OF TABLES
Table 2.1 - Self-docking – HIV-1 protease inhibitors. .................................................. 97
Table 2.2 - Self-docking – Thymidine kinase inhibitors. .............................................. 98
Table 2.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors. .......................... 99
Table 2.4 - Cross-docking and docking to multiple conformations -
HIV-1 protease inhibitors. ............................................................................................. 100
Table 2.5 - Cross-docking and docking to multiple conformations –
Thymidine kinase inhibitors. ......................................................................................... 101
Table 2.6 - Cross-docking and docking to multiple conformations –
Factor Xa, trypsin and MMP-3 inhibitors. ..................................................................... 102
Table 2.7 - Docking to flexible proteins - HIV-1 protease inhibitors. .......................... 103
Table 2.8 - Docking to flexible proteins - thymidine kinase inhibitors. ....................... 104
Table 2.9 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3 inhibitors .. 105
Table 2.10 - Docking accuracy (%): rigid proteins. ...................................................... 106
Table 2.11 - Docking accuracy (%): flexible proteins. ................................................. 106
Table 3.1 - Comparison of FITTED 1.0 with FITTED 1.5 ........................................... 129
Table 3.2 - Docking of HCV polymerase inhibitors to the allosteric site \
with FITTED 1.5. ........................................................................................................... 130
Table 3.3 - Docking of HCV polymerase inhibitors to the catalytic site. ..................... 131
Table 3.4 - Focused libraries based on MatchScore > 75 and RankScore as indicated 135
Table 4.1 - Testing set of ligand / protein complexes. .................................................. 159
Table 4.2 - Comparison of success rates of FITTED versions 1.0, 1.5 and 2.6
using the “Dock” Docking mode. .................................................................................. 161
Table 4.3 - Comparison of time and number of runs required for various versions
of FITTED when the “Dock” docking mode is selected for rigid protein docking. ..... 161
Table 4.4 - Abbreviations used in Figure 4.8 ................................................................ 165
Table 4.5 - List of ligands used to define protein binding sites .................................... 174
Table B.1 - HIV-1 Protease mono-alcohol inhibitors ................................................... 209
Table B.2 - HIV-1 Protease diol inhibitors ................................................................... 210
Table B.3 - Thymidine Kinase inhibitors ...................................................................... 211
Table B.4 - Factor Xa inhibitors. ................................................................................... 212
xiii
Table B.5 - Trypsin inhibitors. ...................................................................................... 213
Table B.6 - Stromelysin-1 inhibitors. ............................................................................ 214
Table C.1 – Self-docking HIV – 1 Protease. ................................................................. 221
Table C.2 - Self-docking – Thymidine kinase inhibitors. ............................................. 222
Table C.3 - Self-docking – Factor Xa, trypsin and MMP-3 inhibitors. ........................ 223
Table C.4 - Docking to flexible proteins - HIV-1 protease inhibitors. ......................... 224
Table C.5 - Docking to flexible proteins - thymidine kinase inhibitors........................ 225
Table C.6 - Docking to flexible proteins – Factor Xa, trypsin and
MMP-3 inhibitors. .......................................................................................................... 226
Table C.7 - Docking accuracy – FITTED 1.0 VS. FITTED 1.5. ......................................... 227
Table D.1 - Accuracy of the 6 docking programs using various conditions and
self-docking experiments with dry protein. ................................................................... 229
Table D.2 - Accuracy of the 6 docking programs using various conditions and
self-docking experiments with proteins with waters. .................................................... 230
Table D.3 - Accuracy of the 6 docking programs using various conditions and
cross-docking experiments with dry proteins. ............................................................... 231
Table D.4 - Accuracy of the 6 docking programs using various conditions and
cross-docking experiments with dry proteins. ............................................................... 232
xiv
LIST OF EQUATIONS
Equation 1.1 – EVB model Energy .............................................................................. 44
Equation 1.2 – EVB equation for project model PES to True PES .............................. 44
Equation 2.1 – Water Switching function .................................................................... 85
Equation 2.2 – Probability of mutation of water .......................................................... 87
Equation 3.1 – Sphere weight ....................................................................................... 125
Equation 3.2 - MatchScore ........................................................................................... 125
Equation 4.1 – Calculation of minimum MatchScore .................................................. 157
Equation 5.1 – Linear combination of reaction and product ........................................ 187
xv
LIST OF ABBREVIATIONS
ENM Elastic Network Model
EVB Empirical Valence Bond
GA Genetic Algorithm
HTS High Throughput Screening
iGluR2 Ionotropic glutamate Receptor
MCMM Multi Configurational Molecular Mechanics
MCSS Multiple Copy Simultaneous Search
MD Molecular Dynamics
MM Molecular Mechanics
NEB Nudged Elastic Band
NMA Normal Mode Analysis
PDB Protein Data Bank
PES Potential Energy Surface
POS Particle Swarm Optimization
QM Quantum Mechanics
QSAR Quantitative Structure Activity Relationship
QSSR Quantitative Structure Selectivity Relationship
RBD Rigid Body Docking
RMSD Root Mean Square Deviation
SPE Stochastic Proximity Embedding
TS Transition State
TSFF Transition State Forcefields
VS Virtual Screening
xvi
CHAPTER 1
- 1 -
CHAPTER ONE
1. INTRODUCTION
Molecular discovery is in essence the search for novel molecules which would
perform a given task. With the increasing pressure from modern society to perform
greener, safer chemistry with rapid results, there has been a push for alternative methods
and technologies for molecular discovery, one alternative being computational methods.
These methods would allow a chemist to assess and develop many ideas and concepts.
Computational methods have found a home in the pharmaceutical industry where this
pressure is not only applied by society but also by the need to increase profits.1-3
There
have been many successful uses of computational methods4, 5
in the field of drug design
and development, yet this success has not translated to many other fields of chemistry
such as asymmetric catalyst development. Herein the state-of-the-art in docking-based
virtual screening methods and application of virtual screening techniques to the field of
asymmetric catalyst development are discussed.
CHAPTER 1
- 2 -
1.1 THE CHALLENGE OF MODELING REALITY IN THE DOCKING OF
SMALL MOLECULES TO BIOLOGICAL TARGETS
ABSTRACT
From virtual screening to understanding the binding mode of novel ligands, docking
methods are being increasingly used at multiple points in the drug discovery pipeline.
Unfortunately, in many cases, the accuracy of docking programs is greatly affected by the
amount of information given to them. To improve the docking accuracy, developers have
been moving towards the superior modeling of the dynamics involved in the protein-
ligand binding process. The problem of modeling reality can be broken down into several
factors including: 1) the modeling of ligand flexibility, 2) the modeling of receptor
flexibility and 3) the modeling of bridging water molecules. Other factors such as metal
coordination should also be considered. Each of these problems requires a separate or
combined conformational search technique. In this review, we will discuss the current
progress in the development of search engines for modeling ligand flexibility, including
cyclic portions, followed by recent progress in considering receptor flexibility and
bridging water molecules and finally the inclusion of metal coordination geometry in
docking.
CHAPTER 1
- 3 -
INTRODUCTION
Due to the advances in crystallography and nuclear magnetic resonance
spectroscopy, there is an ever-expanding knowledge of structural information for
potential therapeutic targets. This increase of knowledge has added pressure within the
drug discovery community to provide new drugs in a more time- and cost-effective
manner. This pressure has accelerated the acceptance and integration of computer-aided
drug design methods within the toolkit of chemists in the pharmaceutical industry1-3
.
These increasingly accurate tools provide viable alternatives to traditional experimental
approaches such as high throughput screening. Nowadays, computational techniques can
be found in many aspects of the drug discovery pipeline, particularly in the field of lead
discovery with the most popular virtual screening (VS) methods.6 In the last decade, there
have been a number of VS success stories4, 5
yet there have been only a few reported
studies directly comparing VS with experimental screening campaigns.7-9
Interestingly, in
these later cases, VS provided more fruitful results than traditional experimental
approaches.
There is a plethora of VS methods available to medicinal chemists today ranging
from ligand-based approaches10
including QSAR11
, ligand similarity searching12, 13
and
ligand pharmacophore screening14
to protein structure-based approaches such as docking
programs.15
Virtually docking ligands to biological targets is one of the more popular
techniques for VS. This popularity is in large part due to the ever-increasing number of
protein crystal structures publicly available within the Protein Data Bank16
not to mention
the large number of proprietary structures within pharmaceutical companies.
One of the first docking studies was reported by Levinthal et al. who performed a
protein-protein docking simulation to predict possible conformations of hemoglobin
fibers.17
This work then inspired the development of the first small molecule / protein
docking tool, namely DOCK.18
In order to reduce the computational cost associated with
the conformational sampling, both these studies only allowed for the optimization of the
relative orientation and/or transition of the two molecules being assembled. These early
methods perform what is known as rigid body docking (RBD).18-20
With the exponential
increase in computational power, protein-ligand docking programs have evolved into
“flexible” docking programs which incorporated the flexibility of ligands. Although these
CHAPTER 1
- 4 -
second generation programs optimized the conformational, orientation and translation of
the ligand, they still treated the protein as a rigid object.
While a number of these programs were reported, comparative studies have often
only evaluated their relative accuracy in predicting the binding mode of known ligands21-
27 and only recently their ability to virtually screen libraries of small molecules and
identify active compounds from these libraries.21, 27-32
Many of these studies showed that
docking the ligand back to its native protein structure (i.e., co-crystallized with this
specific ligand), also referred to as self-docking, was quite successful. However when
docking to another protein structure (referred to as cross-docking), the accuracy of most
of the programs significantly dropped.33-35
This drop has been attributed to the rigid
protein model used by these programs (lock and key model). In reality, protein/ligand
assemblies in cells are complex dynamic multiple component systems which encompass
numerous variables such as protein flexibility, displaceable bridging water molecules,
metal coordination and many more.
Even with the advances in the speed of modern day computers a systematic search
of the entire conformational space available to the protein and ligand during docking
remains intractable. Gehlhaar et al.36
surmised that there may be up to 1030
possible
solutions for one of the complexes that they were examining. If it were possible to
evaluate 1 million distinct solutions per second, it would still take over a trillion years to
perform a systematic search on a current CPU. Therefore modeling reality requires the
development of docking algorithms able to search the conformational space quickly and
efficiently, in order to locate the global solution on the multidimensional binding free
energy hypersurface.
With the growing popularity of docking programs, there was a need for a technical
review of the literature covering the various methods available to model the binding of
protein-ligand complexes. Herein, we review the search algorithms developed over the
last twenty years or so to address ligand and protein flexibility, bridging water molecules
and metal coordination.
LIGAND FLEXIBILITY
Two major criteria for docking performance are the conformational search of the
ligands must be both accurate and time-efficient. Unfortunately, these two factors have an
CHAPTER 1
- 5 -
inversely proportional relationship and should be properly balanced. Throughout the
years, many conformational search algorithms have been developed to assess the
flexibility of the ligand. The first instance of flexible ligand docking appeared in 1985
when Ghose and Crippen37
used a distance geometry method, followed by Goodsell and
Olson38
who used a simulated annealing approach to account for ligand flexibility.
Nowadays, most docking programs incorporate one or more of the following four ligand
conformational search algorithms: shape complementary, incremental construction,
Monte Carlo and genetic algorithms. With these techniques, only the acyclic portions of
the ligand are considered flexible. Searching the conformational space of rings
dramatically increases the search space and necessary computational time. Below are
described the various approaches illustrated with the most popular programs.
Shape Complementary. Some of the most widely used conformational search algorithms
in small protein-ligand docking rely on shape complementary techniques. These methods
evaluate how well the ligand orientation and translation (referred to as a pose) match with
the protein binding site. These matches can be defined in terms of molecular interactions
as in DOCK,18
FITTED,39
FlexX40
and Surflex,41, 42
or geometrical matching as in SHEF.43
Figure 1.1 - Matching algorithm.
CHAPTER 1
- 6 -
When matching ligand and protein molecular interacting groups, most programs
work in a similar manner (Figure 1.1). First, the generation of an initial conformation of
the ligand is followed by the creation of a ligand and protein pharmacophore. A subset of
ligand points is then chosen from which a distance matrix is built. This later step is
repeated for a subset of protein points. These two distance matrices are then compared to
determine a match between a subset (typically 3 or 4) of protein points and a subset of
ligand points. Once a match is found, a translation/rotation matrix is calculated to overlay
the best matching points in the same frame of reference. This matrix is next applied to the
ligand conformation hence positioning the ligand in the protein binding site.
However, more than a single match can be found for a single ligand conformation.
Thus, comparing multiple subsets of ligand and protein matrices can yield a series of
probable matches and determining the best match can be difficult. The determination of
this best match is where most programs vary. For instance, an early version of DOCK18
first prepares of list of pairs by systematically pairing each ligand point with all of the
protein points. DOCK then selects one ligand point and searches for another ligand point
that is within a cut-off distance to the first ligand point. It then selects one of the
matching protein points to the first ligand point and search for a second protein point that
has a similiar distance as the difference between the two ligand points. This is repeated
with all the points in the list until no new pair can be added. The best match is then
defined as the list with the most matching points. FITTED39
, FlexX40
, and others44, 45
use
3-point matching algorithms. For each newly generated conformation of the ligand, one
or more triangle is selected from the ligand and compared to the triangle generated from
the protein interaction sites (e.g., hydrogen bond donor groups). The ligand triangle /
protein triangle pair which is closest in size is considered the best match. It is also
possible to use more than three points to accommodate for the chirality of the pocket and
ligands.46
Surflex41, 47
uses a morphological similarity score which measures the matches
between possible protein interactions (referred to as protomol within Surflex) and the
ligand. A newer version of DOCK48
implemented a bipartite graph matching algorithm
which is similar to a three point matching algorithm except that each vertex of the protein
triangle is matched one at a time. A point of the ligand triangle is selected and translated
to overlay with a matching protein point. A second ligand triangle point is then
CHAPTER 1
- 7 -
superimposed onto a matching protein point that is within a radius of a distance similar to
the distance between the two ligand points from the first protein point. This is repeated
with the last ligand triangle point.
Geometrical matching methods evaluate the fit between the ligand shape and
binding site cavity. The method developed by Yamagishi et al.49
models the pocket and
ligand as sets of spherical harmonic functions which can be described as contour lines
that can be seen as being similar to human fingerprints. It then matches the ligand to the
protein by using fingerprint matching techniques. eHiTS50, 51
represents the ligand and
pocket as a series of polyhedra. An interaction type (e.g., hydrophobic, H-bond donor) is
then assigned to each polyhedron vertex. The match between the ligand and protein
vertices is calculated by a knowledge based scoring function. This function evaluates the
energy of the system based on the distance and relative orientation (angles and torsions)
of surface point pairs on the ligand and the receptor. The scoring is based on a statistical
collection of data from high resolution PDB complexes.
Incremental Construction. Incremental construction algorithm-based programs build the
molecule on-the-fly within the binding pocket therefore addressing ligand flexibility. This
technique has been widely implemented in de novo design programs, which propose
potential novel tight binding ligands.52-55
Incremental construction techniques have also
been developed for docking small molecules to proteins (see Figure 1.2). In this context,
the ligand is first broken into multiple rigid fragments typically at rotatable bonds and
ring systems. An anchor fragment is selected and first docked. The adjacent connecting
fragment is subsequently added and this process is repeated until the entire molecule is
reconstructed. Even though most programs have this basic structure there are multiple
variations of incremental construction methods.
CHAPTER 1
- 8 -
Figure 1.2 - Incremental construction.
Another algorithm in DOCK56, 57
fragments the molecule at all rotatable bonds,
identifies an anchor (i.e., a rigid fragment) then places it using a matching algorithm. The
N best poses for this fragment are selected for the next stage. It is possible for DOCK to
identify multiple anchors requiring that the method described below be repeated for each
anchor. The program adds the adjacent fragment of the molecule, creating multiple
conformations of the fragment around the newly formed bond. The conformations are
then pruned to only keep the N best scoring conformations to reduce the combinatorial
explosion that would occur if all of them were kept. FlexX40
also uses a similar algorithm
which they refer to as a greedy algorithm. Within SLIDE,58, 59
fragments are rotated on-
the-fly to remove undesired clashes between the ligand under construction and the protein
instead of archiving multiple conformations. When building up the ligand, Surflex42, 47
adds fragments in a conformation which maximizes the morphological similarity between
ligand and the protomol. Multiple poses are created with the best scoring poses
undergoing gradient based optimization. Incremental construction assumes that the rest of
the molecule (not added yet) does not affect the conformation of a fragment. Surflex
overcomes this major assumption by including a whole molecule approach to the docking
of fragments.42
When performing the conformational search of a fragment, the rest of the
ligand is still present in its initial input conformation. Thus Surflex quickly removes
fragment conformations which are acceptable on their own, but clashing when the ligand
CHAPTER 1
- 9 -
is rebuilt. Another program, eHiTS,50, 51
uses a novel take on the incremental construction
algorithm. Unlike the previous algorithms, all rigid fragments are simultaneously and
independently docked and scored. The algorithm then attempts to reconnect the best
scoring fragment poses into the complete ligand. For this purpose, the distances between
the connecting atoms of the rigid fragments are calculated. If the distance corresponds to
the length of a flexible connecting chain, the optimal conformation of this chain is
calculated and the fragment and connecting chain are linked.
Genetic Algorithms. The first genetic algorithms60-62
implemented in docking programs
were derived from Darwin’s theory of evolution 63
which proceeds through natural
selection. Throughout this process, favourable genes (as defined by a fitness function) are
passed onto the next generations of an evolving population and unfavourable ones are
eliminated64
(see Figure 1.3). Within a docking program, a population is defined as a set
of individuals with each individual representing a distinct pose. The pose is encoded
within a chromosome made up of a series of genes which represent the ligand
conformation (values of the rotations about rotatable bonds), orientation and position in
space (in the protein reference frame). An initial population is created by assigning values
to each gene of each of the individuals and the population is then allowed to evolve
through the passing of genetic information (reproduction). As described in Darwin’s
theory, genetic operators such as crossover and mutation will modify/optimize the
population over time. Crossover of genetic information proceeds by selecting two parent
individuals and switching the values of the genes from a crossover point in the
chromosome onward. Mutations are simulated by random modification of a gene within a
chromosome. A number of variations (e.g., chromosome definition, selection of the next
generation, genetic operators) have been implemented in docking programs.
CHAPTER 1
- 10 -
Figure 1.3 - Genetic Algorithm.
The DOCK60
developers have also implemented a genetic algorithm, although the
geometric matching is still recommended in the newest versions.57
An elitist strategy is
employed in the selection of the next generation where some of the fittest individuals
(among the parents) are saved into the next generation along with the best scoring
individuals resulting from crossover and mutation (the offspring). DIVALI62
improved on
this approach by incorporating the bond rotations as genes. Within GOLD61, 65
bond
rotations are encoded as genes but the orientation within the pocket is represented by a
series of genes which map interaction points of the ligand with points on the protein. This
mapping is used to orient the ligand within the pocket using least-squares fitting. GOLD
also differs from the previous programs by incorporating a roulette wheel selection of
parents, while random selection is usually done by others. Roulette wheel selection
proceeds through favouring individuals for reproduction based on their fitness. The more
fit an individual is, the more chance it has to couple. GOLD also uses a steady state
selection of the next generation. This occurs when the individuals are crossed over and/or
mutated. If the newly generated individuals (children) have better fitness scores than the
parents, then the parents are replaced. Other programs will select the next generation
from the whole set of parents and children. GOLD has also implemented the use of
CHAPTER 1
- 11 -
islands (also known as niches) to allow sub-sets of the population to evolve
independently of each other with the occasional exchange of individuals between islands.
Prior to the Darwin’s theory of evolution (Charles Darwin 1809-1882), Jean
Baptiste Lamarck (1744-1829) had proposed a different theory, the theory of inheritance
of acquired characteristics.66
Although similar (both described evolution toward a “best”
solution), these two theories differ by the information transmitted from one generation to
the next. Lamarck believed that changes which occurred during the life are passed to the
offspring while Darwin thought these changes did not affect the evolution. Although the
docking programs are primarily based on the Darwinian evolution, a Lamarckian flavour
has been found to improve the docking efficiency. Within the realm of docking this is
done through local conformational search. Within AutoDock,67
the Lamarckian aspect of
the algorithm is added by performing small perturbations of the genes within the
chromosome. If the perturbed individual has a better fit than the unperturbed individual,
then the original’s chromosome is replaced. These perturbations will next be passed to
the next generation. AutoDock also uses an elitist strategy in selecting the next
generation. Another docking program incorporating a Lamarckian genetic algorithm is
FITTED39, 68, 69
which uses a conjugate gradient energy minimization algorithm as the local
search method. The evolution as implemented in FITTED also incorporates novel genetic
operators. A probability of optimization is applied to individuals created after crossover
and mutation. Thus only a fraction of the offspring proceeds through the conjugate
gradient energy minimization. Also the newly created generation has a probability of
learning. If an individual is selected for education, it will be optimized by local search.
Monte Carlo. Monte Carlo techniques employ random or pseudo-random modifications
of bond rotations, translations and rotations of the ligand pose (see Figure 1.4). The
resulting pose is next analysed. If it has a better score than the previous one then it is
saved. When the score is not better it is not outright rejected and can be saved based on a
selection criterion such as the Metropolis criterion. The temperature-dependent
Metropolis criterion allows for higher energy structures to be accepted with the
probability decreasing with increasing potential energy. It is possible to tune how strict
the criteria are by selecting an appropriate z value (see Figure 1.4). This approach allows
CHAPTER 1
- 12 -
for passing of high energy barriers on the potential energy surface (PES). If the pose is
rejected another set of manipulation is tried and the process reiterated.
Figure 1.4 - Monte Carlo.
Among the docking programs using Monte Carlo as a conformational search tools
are ICM, Glide and LigandFit. ICM70, 71
uses a Brownian movement Monte Carlo
technique, which imposes restrictions to the random moves; large changes in the bond
rotations of the ligand should be accompanied with small rotations and translation of the
entire molecule. As part of a multistage funnel approach to docking, which uses a set of
hierarchal filters, Glide72
carries out pose refinement through a Monte Carlo algorithm.
As ICM, LigandFit73
does not select truly random moves to generate new conformations.
The rotation about bonds are based on the number of atoms connected to this the bond.
For example, a torsion which rotates 25 atoms in a molecule with 50 atoms would have a
10° resolution while a torsion which rotates 10 atoms would have a result of 5°. This new
conformation is first evaluated based on its shape matches with the protein followed by
an energy calculation.
Swarm Intelligence. In the past few years, new algorithms based on swarm intelligence
have been introduced in the field of docking. Swarm intelligence is inspired by the
movement of a swarm of birds when one of them finds food. Within docking this can be
translated into a conformational move of a population following the fittest individual of
this population. Two methods of swarm intelligence have found their way into docking:
particle swarm optimization (PSO) and ant colony optimization.
PSO implemented within SODOCK74
is based on the traditional sense of swarm
intelligence. In this program, an initial set of poses is created randomly. During the next
CHAPTER 1
- 13 -
iteration, each rotatable bond, torsion and orientation of all the poses are transformed
based on their distance to the best solution. The distance it moves is referred to as its
velocity. PSO@AutoDock75
is a variation of the AutoDock67
which implements a PSO
method to overcome high energy barriers. In this program, it only updates the velocity of
the conformation if the fitness of the conformation is worse than the previous one.
Ant colony optimization has been implemented in the docking program PLANTS.76
This global optimization technique is bio-inspired on the method ants’ search and
localization of food. When ants find food, they release pheromones on their way back to
their nest. This pheromone path will next be used by the colony to go back to the food
source. Each ant may take a different path but some paths cross each other. When ants
return to the food source, they follow the path with the strongest (most) pheromones. As
implemented in PLANTS, the ant colony optimization algorithm creates a population of
distinct conformations called ants. Each ant (i.e., pose) has a set of rotations, translations
and rotatable bonds. The best individual of the population deposits a pheromone on each
value of its set. In the next iteration the probability to assign a value a member of its set is
directly proportional to the number of pheromones deposited on the value to that point.
RING FLEXIBILITY.
The algorithms mentioned above have one major constraint; they only address the
flexibility of acyclic portions of the ligands. This is a major issue of concern when
docking large libraries of ligands since only one conformation of the ligand ring may be
used and it may not be the bioactive one.77-79
Even if the most thermodynamically
favoured ring conformation is docked, the protein may stabilize (i.e., bind tightly) a
higher energy conformation. Therefore an accurate ligand pose may not be found if the
ring conformation is not searched while docking. There are two options when
incorporating the flexibility of rings for docking programs, either the ring conformations
can be searched before the docking run using conformation generators or on-the-fly by
the docking program itself.
Ring Flexibility Through Conformational Generators. All docking programs can consider
ring flexibility if a pre-computed ensemble of ligand conformations is docked. These
ensembles can be generated by conformational generator, running each conformation
CHAPTER 1
- 14 -
independently and merging the results at the end. One of the most common methods to
incorporate ring flexibility within conformational search tools is to include a library of
ring conformations. This technique is used within OMEGA80
, which is a tool to prepare
ligands for docking. OMEGA is based on a depth-first, divide and conquer approach to
generate conformations of ligands. Fragmentation of the ligand is followed by the
generation of multiple conformations for each of these fragments. The conformations are
next evaluated using an energy calculation and sorted. From this data, OMEGA
reassembles the fragments into a ligand structure. LigPrep81
is another ligand preparation
program which uses a library of known ring conformations. LigPrep first identifies
flexible rings and matches them to a template. Their relative energies is then estimated
using an energy associated with the ring template, axial-equatorial energies and short
range pair-wise repulsions between atoms directly bound to the ring. This data is then
used to identify the most favourable ring conformation for a single flexible ring. For
molecules with multiple rings, a total ring energy is calculated using the sum of the
energies of all the rings present. Following the optimization of the ring, LigPrep proceeds
with a Monte Carlo search to optimize the acyclic portions of the ligand. Although easy
to implement ring conformation libraries are limited to small-sized rings as most of these
libraries do not cover medium and large size rings.
It is also possible to create ring conformations de novo instead of using ring
libraries. In contrast to ring conformation libraries, this approach is not limited to a
predefined set of conformations but is more time consuming. CORINA,82
another ligand
conformation generator, uses a rule based approach for acyclic portions. To search rings,
it creates a circle with a size dependent on the number of atoms within the rings. sp3
Atoms are then added either in the plane of the circle or above and below the plane
alternatively. sp2 Atoms are left in the plane with consideration of cis and trans
geometries. For polycyclic systems a backtracking algorithm is used which first finds all
the possible conformations of the smaller rings within the polycyclic system. The lowest
conformation is tried first and each ring is added systematically. If a conformation with
the lowest energy ring cannot be found the next lowest is tried. This is repeated until a
conformation can be created. Another program, CONCORD,83, 84
uses a rule based
approach for acyclic portions. In contrast, ring conformations are determined by an
algorithm which minimizes its strain within internal coordinates. Another technique,
CHAPTER 1
- 15 -
stochastic proximity embedding (SPE)85
finds conformations by imposing geometrical
constraints. A constraint, either volume or distance, is first selected then the atomic
coordinates are modified until the conformation fits this constraint. These constraints are
defined by a set of rules, in which certain functional groups which match specific
substructures are assigned a constraint. These rules can be defined by a range around an
equilibrium bond length, angle, torsion or pair wise interactions between protein and
ligand determined by statistical analysis of the PDB. Rubicon85
uses a similar approach
but with a metric matrix algorithm that generates conformations which fit the constraint
instead of modifying the atomic coordinates.
These generated libraries of conformation can next be docked with RBD methods.
Another option is to generate the ensemble of ligand conformations as a first step at
the start of a docking run. This is often achieved with template libraries incorporated into
incremental construction methods. This implementation is straightforward as each new
conformation is considered as a separate possibility for a fragment as in Surflex or FlexX.
Surflex acts via a two step procedure. First templates of 5-7 membered hydrocarbon rings
are mapped onto the ligand rings ignoring the atom types. An energy minimization
routine next optimizes the ring shape considering the atom types. FlexX calls CORINA at
the beginning of a docking run to generate multiple conformations of flexible rings.86
Glide also uses a template library (same library as in LigPrep) to generate multiple
conformations of rings in the first step of its hierarchal filter funnel approach.
Ring Flexibility On-the-Fly. A very few programs considers ring flexibility on-the-fly.
Combining one of the above-mentioned acyclic conformational search methods with an
on-the-fly conformational ring search method is expected to be more time-efficient. To
date, the methods implemented are based on variations to the Goto and Osawa’s corner
flapping approach which reflects an atom of a ring through a mirror plane made up of
adjacent atoms.87, 88
Flipping atoms and their substituents while keeping the correct
geometry and chirality proves challenging. GOLD89
uses a series of bond rotations to
transform the original position of the flipped atom into the one reflected through mirror
plane, hence requiring that the 4 adjacent atoms (2 on either side) must be in a plane. This
requirement limits the method and renders the full search of large flexible cyclic systems
difficult. FITTED uses a different approach which enables the removal of this requirement
CHAPTER 1
- 16 -
enabling the searching of larger rings.39
FITTED also enables new ring conformations to
be investigated during docking by using a conjugate gradient minimization.
PROTEIN FLEXIBILITY
The algorithms described in the previous sections account for the ligand flexibility.
However, in most cases, docking programs treat the protein as a rigid object, following
the lock and key model. Numerous reports have described the flexibility of proteins and
its effect on docking accuracies.33-35, 89-91
Considering these additional conformational
degrees of freedom still remains one of the major challenges in the field of small
molecule docking.15, 92-97
The simplest way (although not the most time-efficient) to
include protein flexibility is to dock the ligand to multiple alternative conformations of a
protein and merge the results (see Figure 1.5A). In fact, this method has been shown to
increase the accuracy of docking compared to cross-docking results.39, 98
It has also been
shown that including all available protein structures, while not affecting the docking
accuracy is more CPU time consuming.98
These results demonstrate that the inclusion of
protein flexibility is a necessity but that selection of the protein structures used for a study
is critical.
Using only experimentally determined protein structures may be too restrictive as
other (i.e., not available) protein conformations may be adopted upon binding to another
ligand. New protein conformations generated by computational techniques can
complement experimentally determined structures. Protein conformations can also be
produced while docking. In this case, the conformational search of the proteins falls into
two categories; either the conformation is searched prior to the docking run (see Figure
1.5B) or searched on-the-fly by the docking program (see Figure 1.5C).
CHAPTER 1
- 17 -
Figure 1.5 - Possible methods to include protein flexibility A) Generation of multiple
protein conformations for multiple docking runs; B) On-the-fly generation of protein
conformations during docking using one protein input structure and C) On-the-fly
generation of protein conformations using multiple protein conformations.
Generation of Multiple Protein Conformations. Molecular dynamics (MD) computation
can be exploited to generate multiple protein conformations by simulating the protein’s
conformational changes over time (see Figure 1.5A). Variations of this technique may
force large conformational changes (normal mode analysis, nudged elastic band method,
elastic network model) not only on the side chains but also on the backbone of the
protein. However, as MD simulations may provide a wealth of conformations, a method
is required to limit the number of conformations.
Amaro et al.99
performed a 20ns MD simulation on RNA editing ligase 1 taking a
snapshot every 50ps resulting in 400 conformations. They then applied a “QR
factorization” method which is used to remove redundant information, reducing the 400
conformations to 33 with no loss of data. By reducing the number conformations for
docking, they increased the overall time efficiency of the screen itself. However, the MD
simulations are time-consuming. Garner et al.100
addressed this problem by examining
multiple protein structures and identifying atoms which do not move upon ligand
binding. This information was converted into a set of constraints applied to those atoms
during the MD simulation, hence decreasing the time required for the simulation. MD
simulations are not appropriate for a true conformational search of the protein and
CHAPTER 1
- 18 -
therefore should only be used to probe truly novel (but accurate) conformations. To probe
for new possible conformations, Withers et al.101
developed the active site pressurization
method to force the protein to adopt novel conformations. This is done by filling the
protein cavity with a virtual resin made of uncharged Lennard-Jones particles in the form
of a grid. Initially only the resin beads that are not directly clashing with the protein are
turned on and bead adjacent to them are flagged. During the MD simulation, the protein
reacts to the beads that are on while the flagged beads observe the possible forces applied
to them. After the initial MD run, the conformation is saved and the flagged bead that
observed the most favourable interaction with the protein is turned on followed by
another MD simulation. This process is continued until it reaches a target number of
structures
Another option similar to MD simulations is normal mode analysis (NMA). NMA
approaches are less time consuming than regular MD simulations but are more memory
intensive. The theory behind NMA is that simple harmonic oscillations around a local
energy minimum correspond to the normal modes of vibration. Like MD many
conformations can be produced and the selection of a smaller subset is necessary. Keseru
et al.102
used a selection of low frequency normal modes to approximate the large
movements in the protein structure. Cavasotto et al.103
developed a method which reduces
the number of conformations by only selecting relevant normal modes which affect the
area of interest. New conformations are then created through linear combinations of the
normal modes followed by optimization of the side chains by Monte Carlo search in the
presence of known binders. Cavasotto et al.35
also tried a simpler method to generate the
initial set of protein conformations. Instead of NMA, the ligand was initially placed into
the active site of the protein in multiple orientations. Each of these conformations
underwent an energy minimization followed by an optimization of side chain
conformations through a Monte Carlo procedure.
For larger changes in protein conformations other options are available such as the
elastic network model (ENM)104
and nudged elastic band (NEB) method.105, 106
The ENM
is in fact a simpler version of NMA. Within ENM, all pair wise interactions between Cα
within a cut-off distance are represented as springs with a uniform force constant. With
these springs in place, one can then perform NMA analysis on this simplified model to
determine the normal modes of distortion of the elastic model and create new protein
CHAPTER 1
- 19 -
conformations. This has been used to generate multiple conformations of the ionotropic
glutamate receptor (iGluR2).107
The iGluR2 can adopt multiple conformations due to
domain closure upon ligand binding. To generate multiple conformations along the path
of closure the ENM method was first used to identify the normal modes of an initial
intermediate conformation, followed by steps in both directions (towards the open and
closed conformations) to generate an ensemble of conformations.
NEB method works by approximating the path between the beginning and end
conformations with a series of images interconnected by springs. These springs prevent
the images from sliding down onto the preceding images on the PES. The initial images
are copies of the starting and final conformations. During the simulated annealing
optimization, the interconnecting springs allows each image to be affected by the
previous one allowing for creation of multiple conformations along a PES.108
This
method has been recently implemented within AMBER108
and has been applied to RNA
but can be applied to proteins to simulate larger movements.
On-the-Fly Protein Flexibility. Accounting for protein flexibility during the
conformational search of the ligand pose allows for ligand dependent conformational
changes within the protein. The simplest way to account for local protein conformational
adjustments during docking is to allow for some overlap with the protein by reducing the
repulsive nature of the Lennard-Jones potential. This softening approach, also referred to
as soft-docking, was first implemented in 1991 by Jiang et al.109
ADAM110
went further
by incorporating an offset distance to virtually increase the van der Waals distance
between the protein and ligand along with an energy minimization which allows for the
relaxation of the protein. The energy minimization-based optimization of the protein
structure allows for small movements of the protein atoms. Apostolakis et al.111
improved
on this by allowing the van der Waals interactions to be gradually turned on. This
docking method starts by creating a random conformation of the ligand followed by an
energy minimization including the gradual turning on of the van der Waals interactions.
This is followed by a re-optimization of the ligand conformation through a Monte-Carlo
search.
Many docking programs use a grid based approach to calculate interaction energies
between the protein and the ligand. However, if more than one protein structure is used,
CHAPTER 1
- 20 -
multiple grids are created and multiple docking runs must be carried out. To reduce the
CPU time demand, techniques have been developed which combine the individual grids
into grids modeling an ensemble of protein conformations. This approach has first been
developed by Knegtel et al.112
who used DOCK with an ensemble of protein structures
modeled by a single average grid. This grid was derived from multiple grids,
corresponding to each of the protein structures. Two averaging techniques were
implemented: an energy weighted average and a geometry weighted average. The energy
weighted method takes the average energy of all protein conformations at each point in
space while the geometry weighted method take the average position of an atom in all
protein structures and determines the energy at that point. Osterberg et al.34
improved on
the energy weighted grid by averaging the grid using the Boltzmann distribution factor
and also using a weighted average of the energy grids termed a clamped grid. If the
energy at a point for a specific protein is unfavourable then that point is assigned a low
weight. This technique is similar to one developed by Moitessier et al.113
who docked
aminoglycosides to virtually flexible RNA. They also created a single RNA structure
from averaging the coordinates of multiple RNA structures, then creating one grid.
Sotriffer et al.114
used AutoDock in conjunction with an ensemble of energy grids.
Instead of averaging the ensemble of grids they were joined back to back separated by a
strip of repulsive grid points to remove any possibility of ligand docking in the interface
of two grids.
Generating new protein conformations on-the-fly is another possibility. There are
two options for searching the conformational space of proteins during docking (see
Figure 1.5B); either the ligand is docked first, followed by optimization of the protein, or
both the ligand and the protein conformational searches occur simultaneously.
The first docking study that considers conformational searching of the proteins
upon ligand docking was reported by Leach.115
In this work, optimization of protein side
chains was carried out after docking of the ligand. A rotamer library was trimmed using a
dead-end algorithm to eliminate some rotamers of side chains which overlap with each
other. A tree search algorithm was then exploited to find optimal combinations of amino
acid rotamers. Schaffer et al.116
improved on this by generating multiple starting
conformations of the protein prior to the docking of the ligand. This is completed by a re-
optimization of the protein side chains and a final energy minimization to relax the
CHAPTER 1
- 21 -
protein. The number of possible conformations of the protein resulting from Leach’s
algorithm was trimmed by Anderson et al.117
by only applying the conformational search
to residues which have been identified as flexible in multiple crystal structures. If only
one protein structure is available, a selection scheme is proposed to reduce the number of
residues that would be considered as flexible.
Utilizing a rotamer library has its downside. While binding, a ligand may lead to
novel conformations of side chains that may not be present in the rotamer library. Various
techniques have been developed for de novo prediction of side chain conformations. ICM
incorporated a dual Alanine scanning and refinement procedure to relax the binding site
cavity.118
First, flexible residues are identified then mutated to alanine to enlarge the
pocket. If there are more than two flexible residues, multiple protein structures are
created where various combinations of two residues are mutated to alanine. The ligand is
then docked to the ensemble of protein variants and clustered. The best scoring poses are
then frozen and the protein side chains are reconstructed and re-optimized. Koska et al.119
implemented a similar post-docking refinement of the protein using a combination of
ChiFlex to create an ensemble of protein input structures (see Figure 1.5A) and ChiRotor
to optimize the protein following docking using LibDock. Sherman et al.120
showed that
Glide can also accommodate protein flexibility when it is used in conjunction with
PRIME, a protein homology modeling program. First Glide identifies the 3 most flexible
residues within the pocket and, as ICM, mutates them to alanine to allow for a larger
binding cavity. The flexible residues are selected within the Glide protocol using a set of
4 rules: 1) if a residue has atoms that deviated by 2.5 Å from the apo protein crystal
structure; 2) Residues which have multiple occupancy or missing density within 5Å of
the co-crystallized ligand; 3) If multiple protein crystal structures are available, residues
that have atoms which deviated more than 1.5 Å between structures and 4) if not more
than 3 residues have been selected, residues with high β-factors are used. Glide then
docks the ligand, reconstructs the mutated residues and optimizes the conformation of all
the residues within 5Å of the ligand. These residues and the ligand are then subjected to
an energy minimization followed by a re-docking using Glide.
In all the above-mentioned cases, the ligand pose was first optimized followed by
optimization of the protein. Although this approach improves on generation of multiple
conformations prior to the ligand docking, this is not a perfect representation. In reality,
CHAPTER 1
- 22 -
both the ligand and protein conformations should be optimized in concert. Within Slide58,
59 protein flexibility is incorporated by rotating side chains to remove atomic overlaps
with the ligand. Kairys et al.121
utilizes a mining minima method to optimize the
conformations of side chains. Ligand and protein conformations are generated using
random values within a specific range. This range is subsequently reduced based on the
lowest energy conformation found. Protein flexibility has been implemented into ant
colony optimization technique of PLANTS.76
. In this context, pheromones are placed on
the torsion values for optimal side chain conformations. Skelgen122, 123
uses a modified
simulated annealing algorithm to search the conformational space of side chains.
Interestingly, Skelgen allows to either use a rotamer library or de novo generation of
random conformations for side chain conformations. GOLD61, 65, 124
tackles protein
flexibility in many ways. In its most current release version (v. 4.0), GOLD gives the user
the ability to manually create rotamer libraries for selected residues. It also uses an 8-4
Lennard-Jones potential energy function to soften van der Waals interactions. GOLD also
allows for the optimization of NH3 and OH orientations while docking the ligand by
incorporating them as genes within the genetic algorithm. Another program addressing
protein flexibility is MORDOR which uses a path exploration with distance
constraints.125, 126
MORDOR places the ligand within a pharmacophore sphere of the
binding site. Once the ligand is placed, the protein and ligand are simultaneously
optimized through energy minimization. The path exploration with distance constraints
imposes an RMSD deviation penalty to force the ligand to explore new conformations.
The methods described above either use one protein conformation as input with its
structure being modified while docking or dock to multiple protein conformations to
account for larger moves not considered when allowing optimization of the side chains
only (see Figure 1.5B). Some methods have been developed which allow for use of
multiple protein structures as input for docking (see Figure 1.5C). In FlexX127
a module,
FlexX-ensemble, has been incorporated which merges portions of the proteins with
similar conformations into one instance and save the dissimilar portions as independent
instances. During docking, FlexX-ensemble scores the ligand with all instances. If more
than one instance is present for a portion of the protein the best scoring one is selected. A
similar approach has also been implemented in DOCK.128
FITTED39, 68, 69, 129
uses multiple
protein structures to create a virtual backbone and rotamer library. During the evolution,
CHAPTER 1
- 23 -
the genetic algorithm of FITTED allows for cross-over and mutations of side chain
rotamers and backbone conformations. ROSETTALIGAND130
applies a Monte Carlo
search to an ensemble of protein and ligand conformations to account for the flexibility of
both partners. A ligand pose is selected, randomly perturbed then the side chain
conformations are optimized using a backbone dependent rotamer library.
It is also possible to use MD to optimize the conformation of both the ligand and
protein.131
However, fully searching all the possibilities for the orientation of the ligand is
challenging with regular MD techniques. In fact, in order to adopt other orientations and
conformations during a simulation, the ligand is required to leave -at least partly- the
binding pocket and to bind back in another orientation. Mangoni et al. overcomes this
problem by separating the center of mass of the ligand from its internal and rotational
motions. This approach allows the receptor and ligand internal degrees of motion to be at
a different kinetic energy than the rotation and translation allowing for a more efficient
search of the ligand. To reduce the computational cost of MD-based docking, Tatsumi et
al.132
developed a hybrid MD/harmonic dynamics method which first performs MD to
determine the collective motions of the protein. These collective motions are then
approximated through harmonic modes so that large motions can be used even when only
portions of the receptors are considered. The side chains and ligand conformations are
then optimized using MD. It is also possible to use metadynamics to perform the
conformational search. Metadynamics is similar to MD except that it keeps a history of
the explored region of the energy hypersurface adding penalties to the regions already
visited leading the search towards new conformational space.133
PREDICTING DISPLACEABLE KEY BRIDGING WATER MOLECULES
Bridging water molecules are waters which mediate the interactions between polar
groups of the ligand and protein.134, 135
How docking programs treat and/or predict the
placement of key bridging water molecules is of utmost importance in the field of small
molecule docking today.136
Typically docking studies treat waters depending on the target
being investigated. If the waters are experimentally determined to play a key role in the
ligand binding, they can be treated as parts of the protein. However, explicit water
molecules do not allow for cases where waters are displaced by the ligand. Two common
examples are the case of HIV-1 protease and thymidine kinase. In both these cases,
CHAPTER 1
- 24 -
ligands have been designed to displace key bridging water molecules that interact directly
with the protein and observed in crystal structures.137-140
Since most docking programs do
not correctly account for displaceable water molecules, it is typically suggested that these
waters should be deleted from the protein structure. Recent studies have shown that when
docking, inclusion of water molecules always increases the accuracy independently from
the origins of the water (crystallographic, predicted or optimized position).141, 142
Therefore the challenge is to include the displaceability of water within docking. There
are two options that are available: either predict whether the ligand displaces the water or
not or predict the water positions within the binding site during docking.
Predicting Displacement of Water Molecules. If the protein input structure contains one
or more water molecules, a docking program should be able to determine if the water is
displaced (off) or present (on). For grid based energy methods one can use methods
similar to those developed for protein flexibility. The clamped or Boltzmann equations
developed by Osterberg et al.34
combine grids to make one grid where the water is both
on and off. Moitessier et al.113
have also applied a similar technique to dock to hydrated
and flexible RNA. Huang et al.143
modified DOCK to simultaneously dock to two grids,
one with waters, one without. Thus DOCK calculates the score with both grids, selecting
the best of the two scores. Rarey et al.144
implemented the particle concept within FlexX.
In this program, a water particle is considered as a single sphere. During docking, FlexX
considers both options (the water being on or off) and selects the best scoring one. Within
GOLD145
the water orientation is allowed to be optimized during the genetic algorithm
but like FlexX it considers both on and off keeping the best scoring option. A water-
specific switching function has been added to the AMBER force field within FITTED.68
This function turns off the ligand’s interactions with a given water molecule when it is
overlapping. SLIDE59
uses a similar approach which turns off the water where there is an
overlap with the ligand or protein. Jiang et al.146
created a solvated rotamer library which
has been used in a similar fashion to rotamer libraries used for protein flexibility.
Rotamers are added to the library with waters at positions that form hydrogen bonds with
the residue, these solvated residues are then used during the conformational search of the
protein. Van Diijk and Bonvin147
developed a stepwise procedure to determine key water
molecules. The binding site is first flooded and only waters on the surface of the protein
CHAPTER 1
- 25 -
are kept. The ligand is then placed within the binding site and only waters which mediate
its binding with the protein are kept. This is followed by a random selection of water
molecules where their probability to be kept is set to the fraction of the observed contacts
with the protein over the ideal number of contacts which have been derived by statistics
on the PDB. This is continued until only 25% of the water molecules remain. Key water
molecules are then identified by selecting waters that are below a score cut-off.
Prediction of Water Positions. During the binding process, the protein or ligand may
adopt a conformation not yet experimentally observed, which is facilitated by the
presence of a bridging water molecule. Therefore methods have been developed to
predict the positions of potential bridging water molecules. Even though many of the
developed methods for water position prediction have not been incorporated within
docking programs, these methods could easily be used in conjunction with displaceable
water techniques mentioned above.
The first method to determine water positions within proteins was implemented
within the program GRID.148, 149
GRID uses a series of evenly distributed points, referred
to as a grid. At each grid point, the interaction energy between a water probe at this
location and the protein is calculated. If the energy is favourable (i.e., below a given
threshold), water can occur at that node. An issue can arise when using a grid
representation when too many waters are found, or waters on the surface may be
discounted. Pitt and Goodfellow150
developed a knowledge based method to place waters
within the binding site. A table of preferred water positions for each side chain is created.
The water is then only retained if there is space for it within the binding site. Amadasi et
al.151
validated water positions found in crystal structures and GRID calculations by first
using the HINT score, which is used to calculate the global interaction strength of the
water with the protein, to determine which waters are located in a hydrophilic region. The
RANK algorithm then measures the number and geometric quality of hydrogen bonds
made between the water molecules and the proteins, with higher ranking waters being the
most favoured. Miranker and Karplus152
used a multiple copy simultaneous search
(MCSS), where many water molecules were placed within the active site and their
location optimized using the CHARMM force field. The interaction energy with the
CHAPTER 1
- 26 -
protein was then calculated and waters below an energy cut-off value are minimized and
kept.
Currently only two docking programs incorporate the prediction of water positions,
albeit for different reasons. The FlexX developers incorporated the prediction of water
positions when they implemented the particle concept for displaceable waters.144
FlexX
first determines positions where waters would be energetically favourable followed by a
clustering algorithm to remove redundant information. Within the Glide XP scoring
functions, it is necessary to calculate the amount of desolvation of the ligand.72
To
calculate this energy contribution, Glide first docks water molecules explicitly into the
binding site and then uses empirical scoring terms to measure the exposure to water for
certain groups of the ligand.
PREDICTION OF METAL GEOMETRY
When orienting the ligand in the binding site of a metalloenzyme, docking
programs must have an adequate description of metal coordination geometries.15, 153-155
Currently only a few programs consider specific interactions point around metals. GOLD
initially determines the coordination geometry by examining the angles between protein
coordinating atoms (e.g, His nitrogens or Cys sulphurs). Once the geometry is
determined, interaction points corresponding to the free coordination sites are added.
FITTED39
uses a similar approach by incorporating the vectorial-bond valence model.156
This method states that the sum of the vectors formed by coordination should be equal to
0. The FlexX developers have also recently published an improved description of metal
geometries, based on a template library of metal coordination geometries which is
compared to the protein atoms coordinating the metal. An RMSD is then calculated
between the various templates and the protein metal site and the best matching template
is selected.157
CONCLUSION
Modeling reality is a great challenge in the field of small molecule / protein
docking. A number of new global search algorithms have been developed and
implemented in over 60 docking programs. However, the true challenge is to identify the
correct pose (i.e., assign it the best score) among a plethora of poses that are generated
CHAPTER 1
- 27 -
through this search. Factors such as ionic strength, binding entropy, metal/ligand
interaction energies are very challenging problems that still need to be solved in the field
of scoring functions for docking programs.15, 94, 95, 158
We believe that the major challenge
left in modeling reality is a better understanding of the link between the conformation of
the ligand/protein/water/salt multicomponent system and its score.
CHAPTER 1
- 28 -
1.2 APPLICATION OF COMPUTATIONAL TECHNIQUES TO ASYMMETRIC
CATALYST DEVELOPMENT.
Financial and environmental pressure in the drug discovery and development field
requires that novel drugs are found quickly, efficiently and cheaply. To fulfill these
requirements, computational techniques have found their way into the toolkit of
medicinal chemists in the pharmaceutical industry providing a viable alternative to
experimental approaches such as high throughput screening.1-3
In fact, there are now
many predictive methods (e.g., QSAR, docking, combinatorial library profiling) available
to drug discovery and development chemists.15
Although these methods are based on
approximations, they are accurate enough to yield a higher rate of finding lead molecules
when screening libraries of millions compared to the traditional experimental
approaches.7-9
Even though these techniques yield small libraries enriched in bioactive
molecules, the small number of potentially missed bioactive molecules, does not often
outweigh the speed and cost savings of screening a library in silico.
Surprisingly, the significant advances in the field of computational drug design and
development have not stimulated the development of many other VS methods in other
chemical fields such as asymmetric catalysts. Computational tools in the field of
asymmetric catalysts are typically used to rationalize the outcome of a given reaction post
facto rather than to predict it. The ability to predict the stereomeric excess of a reaction
would enable organic chemists to quickly test out new asymmetric catalyst structures and
to prioritize a few of them for synthesis.
The lack of quick, although predictive, computational tools for organic chemists
when compared to the field of drug design and development has a major origin. To
accurately discriminate an excellent from a poor asymmetric catalyst, it is necessary to
predict the difference in transition state energies (necessary to compute the stereomeric
excess), within less than 1 kcal/mol. On the other hand, discriminating drug hits from non
binders requires a lower resolution in the order of 3 to 5 kcal/mol and looks at ground
state structures.
Drug design methods such as docking programs, use scoring functions to predict
the ligand binding affinity, with many methods using force fields to calculate the
potential energy of the ligand-protein complex.15
However, force fields have been
CHAPTER 1
- 29 -
developed to simulate the ground state of molecules and cannot be applied directly to the
computation of transition state energies. To calculate the transition state energy, the most
accurate although time-consuming approach is to use quantum mechanics (QM). To
enable time-efficient screening, computational organic chemists have developed methods
and programs that enable a faster calculation of stereomeric excesses.
QUANTUM MECHANICS PREDICTIONS OF STEREOMERIC EXCESS
Quantum mechanics has been exploited to rationalize experimental results and
provide valuable insight into the reaction pathway of many reactions.159, 160
Great care
must be taken when selecting basis set for QM methods since smaller and quicker
methods may be able give qualitative answers but not the quantitative predictions desired.
Although QM can calculate the transition state structures and energies very
accurately,161
.it still lacks the speed required for the development of a QM-based VS tool
and can hardly be applied to large catalytic systems. To address this last issue, it is
possible to use a hybrid technique, called QM/MM, which treats the reacting part of the
system with QM and the rest with molecular mechanics (MM).
Using Quantum Mechanics to Predict Stereoselectivities and Transition States. One of
the most studied reactions studied using QM is the proline catalysed aldol reation.162-169
There were initially four proposed mechanisms for the proline catalysed aldol reaction
(see Figure 1.6).170-174
QM methods were extremely helpful in resolving which
mechanism is most likely to occur. The dual proline mechanism (Figure 1.6D) was
discounted after kinetic studies and theoretical experiments showing that this mechanism
is energetically disfavoured.164, 175
. QM methods (B3LYP/6-31G*) showed that the
carboxylic acid mechanism (see Figure 1.6B) is 7.4 Kcal/mol more favoured than the
carbinolamine mechanism (see Figure 1.6A) and 30.5 Kcal/mol over the enaminium
mechanisms (see Figure 1.6C).168
With insights into the mechanism of this reaction this
demonstrated the usefulness of computational techniques is studying the transition states
of reactions. Based on this, QM methods were next to predict the stereochemical outcome
of asymmetric aldol reactions. This study revealed their ability to predict literature
stereomeric excesses with high accuracy, in most cases within a few percents.166
CHAPTER 1
- 30 -
Figure 1.6 - Proposed mechanisms for proline catalyzed aldol reaction
Prediction-based design of new catalysts inducing high stereoselectivities was also
carried out.169, 176
Shinisha et al.176
went on to study bicyclic analogues of proline (see
Figure 1.7) using QM methods (B3LYP/6-31G*). Overall all these catalysts, even though
synthetically difficult to propose they are are predicted to give better selectivities over
proline.
CHAPTER 1
- 31 -
Figure 1.7 - Example of bicyclic analogue studied by Shinisha et al.176
Another example of where QM methods aided in post-facto rationalization of a
reaction is the osmium tetraoxide asymmetric dihydroxylation of alkenes. There were
many discussions about the mechanistic picture of the asymmetric dihydroxylation of
alkenes facilitated by osmium tetraoxide. These discussions centered around two main
themes (see Figure 1.8): 1) a [3+2] cycloaddition that directly formed the osmium
glycolate product177
and 2) a [2+2] stepwise mechanism which first is preceded by a
formation of the osmium alkene complex, followed by “2+2” addition and the ring
expansion to form the desired osmium glycolate.178
Figure 1.8 - Proposed mechanisms for osmium tetraoxide asymmetric dihydroxylation of
alkenes. L = A ligand
These competing possibilities for the mechanism of this reaction led to the use of
QM techniques to resolve this dispute.179-181
The discussion culminated when Sharpless
showed with QM methods (B3LYP/3-21G for the osmium and B3LYP/6-31G* for all
other atoms) that the [3+2] mechanism is significantly energetically favoured over the
[2+2].182
Although, QM methods were instrumental in the understanding of the
mechanism, their application to the prediction of stereochemical outcomes was limited by
the required size of the asymmetric catalysts for this reaction. As discussed below, other
less intensive methods were required.
CHAPTER 1
- 32 -
Another example of the application of QM methods to the prediction of transitions
state structures and energies is with the Diels-Alder reaction. From using frontier-
molecular orbital theory to using high level QM methods, this reaction has a long history
of using theory to predict its regioselectivity and stereoselectivity.183
For example the
Diels-Alder reaction catalysed by imidazolidinones (see Figure 1.9) was studied by
Gordillo and Houk184
. with QM methods (B3LYP/6-31G*) and showed excellent
agreement when predicting endo:exo ratios (diastereoselectivities). However, they
obtained poorer correlations with enantioselectivities. This poor correlation with
experiment has been shown to vary greatly with the QM method used185-189
and therefore
a careful selection of the QM method is required.
Figure 1.9 - Imidazolidinone catalysed Diels-Alder Reaction
Even with these prime examples of how QM methods aided in rationalizing these
reaction mechanisms and showed some promises in their ability to predict stereochemical
outcomes post facto, there has not been de novo prediction of asymmetric catalysts using
QM methods until recently.190, 191
The proline-catalysed Mannich reactions (see Figure
1.10) are known to be selective for products with a syn orientation, but when proline was
substituted with pipecolinic acid a mixture of syn and anti products resulted. QM
methods (HF/6-31G*) were used to study the transition state of this reaction. Several
catalysts were then proposed which should theoretically be selective for the anti
orientation. Upon synthesising, these compounds were indeed shown to be highly
selective for the anti conformation.
CHAPTER 1
- 33 -
Figure 1.10 - Mechanism for the proline-catalyzed Mannich reaction.
Using QM/MM Techniques for Predictions of Stereomeric Excess and Transition States
structures. The advantages of using QM/MM hybrid approaches for the prediction of
stereomeric excess and transition states are their ability to be applicable to large systems
and their relative quickness when compared to QM methods.192
This is achieved by using
QM methods on the atoms involved in the bond breaking and forming and MM on the
rest of the molecule.
One of the only examples of QM/MM methods applied to prediction of
setereoselectivity is for the dihydroxylation of n-alkenes (see Figure 1.11).193
Initial
studies using styrene demonstrated the usefulness of QM/MM applied to this reaction,
with a predicted stereoselectivity of styrene closely matching the experimental results
CHAPTER 1
- 34 -
(99.4 %ee predicted, 96 %ee observed).194
Studying n-alkenes would prove to be too
difficult due to the explosion of possible conformations resulting when going from
propene to 1-decene. Using the QM/MM approach stereoselectivities were predicted with
reasonable correlation with experimental results.
Figure 1.11 - QM/MM study of the Sharpless dihydroxylation. (QM in Blue, MM in
black)
APPLICATION OF VIRTUAL SCREENING TECHNIQUES TO THE FIELD OF ASYMMETRIC
CATALYST DEVELOPMENT.
QM and hybrid QM/MM methods have shown promises as tools to design novel
asymmetric catalysts. However, QM is still significantly too slow to be advantageously
used to screen or design novel structures as compared to experimental stepwise
optimization or screening. This lack of speed has led to the transfer of techniques
primarily used in the field of drug design and development to the field of asymmetric
catalyst development. Two options are quantitative structure activity relationship
methods (QSAR) and molecular mechanics.
Quantitative Structure Selectivity Relationship (QSSR). When QSAR techniques are
applied to the field of asymmetric catalyst development, they are rebranded QSSR since
selectivity and not activity is the desired predicted property. QSSR is defined as the
process that relates chemical structure quantitatively to a chemical process.11
In essence,
the simplest QSSR technique relates a series of descriptors, whether they are
constitutional, topological, geometrical and physicochemical, to chemical structures.195,
196
Chavali et al.197, 198
used molecular indices which described electronic structures
and connectivities to predict catalytic activity and toxicity. Even though this technique
CHAPTER 1
- 35 -
was not used to predict stereomeric excess, it was a demonstration of the potential of
QSSR techniques to predict chemical properties.
It is also desirable to relate structural features directly to Gibbs free energy. Based
on this, Oslob et al. predicted selectivities in palladium catalysed allylation (see Figure
1.12). It was postulated that the reactivity of the terminal allyl carbon can be evaluated by
a linear free energy relationship using descriptors such as bond distance, angles and a
series of dihedral angles which describe the relative position of the allyl group, the
palladium atom and the ligand. The stereomeric ratio was found to be best predicted
when using only 4 descriptors: the breaking Pd-C bond distance, two dihedrals describing
the in-plane distortion and displacement of the allyl group and the final, and most
influential, energy increase associated with the incoming nucleophile. This energy
increase was calculated by measuring the energy difference between the palladium allyl
complex alone and with the minimized complex in presence of the nucleophile. Overall,
this techniques yielded good correlations between experimental and predicted
stereoselectivities. This work showed promise and the use of geometrical descriptors, to
relate chemical structure to selectivities can most likely be applied to other reactions.
Figure 1.12 - Palladium catalysed allylation
Alvarez et al.199
used a method termed continuous chirality measure, to determine
which portions of the molecule are responsible for its chirality and induce
stereoselectivity. The continuous chirality evaluates the level of chirality of the molecule.
The method was first used to rationalize the stereoselectivity of the
bis(oxazoline)copper(II) Diels-Alder reaction. They deduced that the chirality is mainly
induced by the C5 flaps (see Figure 1.13, portion in blue) which affect the orientation of
the diene. Very good correlation between experimental stereoselectivities and continuous
chirality measures were observed. Upon further investigation, a new
bis(oxazoline)copper(II) catalyst was proposed to be highly stereoselective (catalyst in
Figure 1.13) but was unfortunately not made.
CHAPTER 1
- 36 -
Figure 1.13 - bis(oxazoline)copper(II) catalysed Diels-Alder. C5 flaps shaded in Blue.
Descriptors for QSSR other than structural descriptors are quantum molecular
interaction fields200
(see Figure 1.14). Kozlowski and co-workers superimposed
optimized TS conformation of a series of known catalysts onto a grid. At each point on
this grid the interaction energy between the molecule under investigation and a carbon 2s
electron probe is calculated using QM methods. Regression analysis is preformed on the
computed grids to find regions common to all catalysts where increases in energy of the
probe results in increases (green region in Figure 1.14) or decreases (red region in Figure
1.14) in stereoselectivities. After a set of known catalysts have undergone this treatment it
can then be used to predict the selectivity of new catalysts. This technique has been
applied to a series of reactions and showed good correlation between predicted and
experimental selectivies.200-204
CHAPTER 1
- 37 -
Figure 1.14 - QSSR using Quantum Mechanical Interaction Field analysis in the design
of chiral amino alcohols for alkyl addition to aldehydes.
Using Molecular Mechanics to Predict Transition States. Using descriptors to determine
the link between chemical structure and transition states is a quick alternative to QM
methods, but it still requires intimate knowledge of the geometry of the transition state
structure. If the geometry is not known, it may be difficult to accurately predict
stereoselectivity with QSSR techniques and using QM methods to perform a full
conformational search requires a significant investment in time and expertise. To
efficiently search the conformational space of the transition state molecular mechanics is
a viable alternative159, 205
with one sole caveat, the force field being used must be
applicable to transition state modeling. In most cases, force fields have been developed to
predict the ground state structures and energies of molecules. It is therefore necessary to
derive FF parameters for transition states. The second issue is the conformational search.
Traditional conformational search engines locate minima (ground states) and not saddle
points (transition states).
To address these two issues, transition state force fields (TSFF) model transition
states of a reaction as a minimum on a PES. The simplest TSFF freezes or constrains the
CHAPTER 1
- 38 -
breaking or forming bonds and the angles and dihedral which are composed of them in
their optimum geometry. This strategy is known as a rigid transition state model. A
model system is first developed using QM or crystallographic methods159
and then used
to derive the equilibrium values for the force field. These interactions can then be added
to the force field with large force constants to effectively constrain all atoms to the
transition state geometry. This model is sound as no significant change in geometry of the
transition state are usually observed from one catalyst and/or reactant to the next and the
rest of the catalyst and reactant not involved in the reaction but inducing the
stereoselectivity can be assumed in its ground state.
The first use of a TSFF was for a theoretical study of hydroborations by Houk et al.
(see Figure 1.15).206
A model system consisting of ethylene and BH3 was computed using
HF/3-21G* and used to constrain the atoms involved in the transition state. Chiral
boranes reported in the literature where built and the stereoselectivities were calculated. If
energies for multiple conformations of a single transition state were within a few
Kcal/mol of each other, a Boltzman distribution over all conformations was used to
determine the stereoselectivity. This simple method showed that molecular mechanics
can be used to predict stereoselectivity of a reaction with good accuracy, usually within
10-15%. Even though this approach proves to be less accurate than QM methods, the
ability to screen compounds with higher throughput made it applicable to the VS of new
catalysts.
Figure 1.15 - Hydroboration of alkenes
Moitessier et al. used also a rigid transition state model TSFF to aide in the
rationalization of the unexpected outcome of the dihydroxylation of a benzyl protected
CHAPTER 1
- 39 -
allyl xylose. In this study, the isolated isomer was opposite to the one expected from the
Sharpless pneumonic (see Figure 1.16).207, 208
Figure 1.16 - Dihydroxylation of xylose.
An initial transition state model was built using results from a previous study by
Delmonte et al.182
This model was then used as the core for the transition state and was
frozen during the optimization of the catalyst using a modified CFF91 force field. Since
only a new transition state structure would result in an opposite stereoisomer a
conformation search was needed. They proceeded to use a genetic algorithm. To first
validate and optimize the protocol, two catalysts were first investigated (1 and 2, see
Figure 1.17). With these systems two binding modes were identified, one which
corresponded to the Corey model (alkene sandwiched between the two walls of the
catalyst) and the other one corresponding to the Sharpless model (alkene interacting with
the floor). Even though the Sharpless [2+2] model was discounted, the proposed
conformation was still predicted to exist. With this information validating the genetic
algorithm, the protocol was next applied to model the dihydroxylation of the benzyl
protected xylose 3. The unexpected isomer predicted by the Sharpless mnemonic resulted
from the alkene being too large to adopt either binding mode. Instead, the protected allyl
xylose encompassed the catalyst (see Figure 1.17, 3). With these promising results, a VS
of alkenes was undertaken. The accuracy of the stereoselectivities were lower than pure
QM methods but compete within a fraction of the time. In addition, the protocol’s ability
to predict the ranked list shows its promise as a method for VS.
CHAPTER 1
- 40 -
Figure 1.17 - Sharpless dihydroxylation catalyst studied for optimization and validation
of genetic algorithm, (Black = catalyst, Blue = alkene, Green = frozen atoms). Above is
3D representation, below is schematic.
A similar technique has been developed by Harriman and Deslongchamps and
termed Reverse Docking209
. While traditional docking is where a ligand is docked
flexibly into a rigid protein, a rigid transition state can be docked into a flexible catalyst,
modeling the reactant/catalyst transition structure. The resulting structure can then be
used to predict stereoselectivities. The initial version of this method was based on the
AutoDock67
conformational search engine and HF/6-31G* transition states. Its
application to the azidation of α, β unstaturated carbonyls with Miller’s catalyst validated
the approach (see Figure 1.18A). This first version was able to predict the conformation
of the catalyst and the favoured stereoisomer although the stereomeric excesses was
poorly reproduced. The method was then developed into an independent program called
EM-Dock recently implemented within MOE.210
This new version was applied to the
TADDOL-catalysed asymmetric hetero Diels-Alder reaction (see Figure 1.18B) but was
CHAPTER 1
- 41 -
still unable to predict stereomeric excess.210
In these first two versions, the score for the
van der Waals and electrostatic interactions between the reactants and the catalyst where
calculated using a grid based energy method similar to the one implemented in
AutoDock.38, 67, 211
Large energy difference between the stereoisomers resulted from the
use of the grid and therefore their protocol was modified to use a pair-wise potential. The
application of this new scoring method to the TADDOL-catalysed asymmetric hetero
Diels-Alder reaction212
and the organocatalyzed asymmetric Strecker hydrocyanation of
aldimines and ketimines213
(see Figure 1.18C) resulted in the desired decrease in energy.
This change in scoring lead to accurate prediction of stereoselectives when compared to
experimental results..
Figure 1.18 - Reactions studies with reverse docking: A) Azidation of α, β unstaturated
carbonyls, B) TADDOL-catalysed asymmetric hetero Diels-Alder reaction and C)
organocatalyzed asymmetric Strecker hydrocyanation of aldimines.
CHAPTER 1
- 42 -
In all these MM studies, the atoms involved directly in the formation of new
interactions were frozen, an approximation which is often reasonable. However, when the
catalyst structure changes drastically from one reactant to the next, flexible transition
state methods are desirable.
An easy extension of the rigid transition state TSFF would be to allow for smaller
force constants for interactions which have been frozen. The challenge is determining
what this value should be. MMX was developed so that the equilibrium bond length and
force constants are a function of bond order.214
An issue arises since bond orders are not
explicitly known for transition states and in reality force constants may not be directly
proportional to bond order. ReaxFF is a similar method which allows the bond order to
vary as a function of the bond distance.215-217
To overcome this problem, Norrby et al.218
developed the Q2MM method where the TSFF is entirely developed from QM
calculations. This method has been applied to many reactions and has shown to be highly
accurate for the prediction of stereomeric excess.218-222
But with the use of QM to develop
parameters a significant investment in time and expertise is still required. Q2MM has
been applied to the asymmetric dihydroxylation reaction (see Figure 1.8) for the
prediction and rationalization of selectivities.220, 223-225
By using Q2MM fairly good
correlations between predicted and experimental selectivities were achieved.224
Q2MM
has also been applied to the Horner-Wadsworth-Emmons reaction218, 226
(see Figure
1.19). This reaction involved two transition states, necessitating the development of
parameters for both transition states and the study of multiple diastereomeric pathways to
identify the rate limiting step. Based on the inability to accurately determine the energy
difference between TS1 and TS2, predictions of selectivities were difficult allowing only
accurate predictions for high selectivities (i.e., above 95%).226
CHAPTER 1
- 43 -
Figure 1.19 - Mechanism of the Horner-Wadsworth-Emmons reaction.
Another option is to approximate the transition state as the intersection of two
ground states interacting through a mixing term. This technique, known as the empirical
valence bond method (EVB) (see Figure 1.20), was suggested by Warshel and Weiss227,
228
1
Eproduct
E
Reaction CoordinateE
ne
rgy
Ereactant
Figure 1.20 - Mixing of two ground states to find transition state.
Although it was initially used to simulate enzyme reactions, it can also be applied to
organic reactions. For EVB to predict transition state, the relative energies of the reactant
and product are assumed to be similar. This restriction comes from the force field itself.
Most force fields are only meant to reproduce the heats of formation and compare relative
energies of molecules with identical connectivities. Another issue arises when structures
are far away from the energy minimum. Forcefields have difficulty with distorted
structures and some forcefields and more specifically class I force fields do not have an
accurate description of the van der Waals energy term at short distance (i.e., steep
CHAPTER 1
- 44 -
Lennard Jones potential). To overcome this short-coming, more complex functions that
better represent the PES for distorted structures can be used, such as the Morse potential
in MM3.229-233
EVB creates a model PES by using a weighted sum of the energy of the
reactants and products.
(1.1) productreactantmodel 1 EEE
These energies are then projected onto the true PES using the mixing term.
(1.2) mixmodel EEE
This mixing term describes the mixing between the reaction and product PES.
Typically λ values close to 0.5 correspond to a minimum on the model PES allowing for
adequate searching of the transition state (see Figure 1.21. The reaction force field
(RFF)234
and multi configurational molecular mechanics (MCMM)235, 236
are similar
approaches. These techniques have been validated by simulating reaction pathways but
not directly applied to prediction of stereomeric excesses.
λ = 0
E
EMix
Reaction Coordinate
λ = 0.25
λ = 0.5
λ = 0.75
λ = 1.0E
nerg
y
Figure 1.21 - All energies are calculated on the model PES then projected onto the true
PES using a mixing term.
If only relative the energy between two stereoisomers is desired, it is also possible
to neglect the mixing term (equivalent for both stereomeric transition states) and assume
λ is equal to 0.5 for the transition state This approach is undertaken within the SEAM
method.237-240
This method has not been applied to the prediction of stereoselectivities but
CHAPTER 1
- 45 -
in the prediction of the geometry of a transition state and the reactivity of reactants. The
initial versions of SEAM237, 238
were initially validated on simple reactions such as Sn2
displacement of alkyl halides. These initial studies show good correlation between
experimental and predicted reactivities. Later studies went on the study of achiral
pericyclic reactions and compared their results against a QM method (PM3)
demonstrating a fairly good correlation.239, 240
CONCLUSION
In summary, it is possible to effectively search for the transition state of reactions
using a plethora of methods (see Figure 1.22). However caution is needed when selecting
the method. For a complete one time search of a PES, either QM or EVB methods are
appropriate. For highly accurate prediction of stereomeric excess, QM methods should
also be used. If one wants to perform a virtual screening or computer-aided optimization
of a catalyst, specialized molecular mechanics methods such as TSFF, EVB or SEAM are
more suited.
Figure 1.22 - Summary of methods used to find transition states
CHAPTER 1
- 46 -
1.3 OUTLINE OF THESIS
The need for quick and viable alternatives to experimental approaches such as high
throughput or rational step wise optimization has lead to the development of
computational molecular design tools. These tools now guide chemists and aid in their
search for novel chemical space (i.e., chemical entities). Even though these tools have
been developed, they are based on many assumptions and require new ideas to enable a
quicker and more efficient search of novel molecules.
Chapter 2 describes the development and validation of FITTED1.0, a new tool for
docking small molecules to flexible and hydrated macromolecules. FITTED1.0
incorporates macromolecular flexibility by allowing the use of multiple protein input
structure and a genetic algorithm to perform a conformational search on both the protein
and ligand. Displaceable bridging waters are accounted for by using a switching function
which turns off the ligand’s interaction with the water when it is too close. Application of
FITTED1.0 first demonstrated the importance of the inclusion of these features and
validated the method which accurately predicted the binding poses of a test set of 33
protein-ligand complexes.
With the initial iteration of FITTED complete, there was a need to increase the speed
and make the program more appropriate to VS applications. Chapter 3 describes the
modifications to FITTED to enable the pruning of virtual libraries using toxicophores and
Lipinski’s rules and the inclusion of an automatically created set of interaction sites to
aide in creating a population for the genetic algorithm. FITTED1.5 was shown to be more
accurate than the initial version and allowed for the time sensitive screening of a virtual
library against HCV polymerase and the identification of novel active inhibitors.
One of the major discussions of late in the field of docking is how to perform a
comparative study properly with a focus on which input parameters are used, such as
ligand conformation, protein conformation and the inclusion of waters. An enhanced
version of FITTED (version 2.6) was used in a comparative study, described in chapter 4,
with other major docking programs and the impact of these parameters on the outcome of
the comparative study is discussed. This study reveals that the traditional way of
comparing programs in not a realistic scenario. In reality the results of comparative
studies varies greatly depending on the input conformations of the protein and ligand.
CHAPTER 1
- 47 -
With all the VS tools in the field of drug design and development there is a need to
use these techniques as a stepping stone to create VS tools for organic chemists. This led
to the development of ACE1.0, described in chapter 5, a tool for the screening of
asymmetric catalysts. ACE uses a linear combination of reactant and product interactions
to give an approximation of the transition state parameters. Two reactions were initially
screened and the predictions showed excellent correlation with experimental data.
CHAPTER 1
- 48 -
1.4 REFERENCES
1. Richon, A. B., Current status and future direction of the molecular modeling
industry. Drug Discov. Today 2008, 13 (15-16), 665-669.
2. Richon, A. B., An early history of the molecular modeling industry. Drug Discov.
Today 2008, 13 (15-16), 659-664.
3. Guido, R. V. C.; Oliva, G.; Andricopulo, A. D., Virtual screening and its
integration with modern drug design technologies. Curr. Med. Chem. 2008, 15
(1), 37-46.
4. Borman, S., Drugs by design. Chem. Eng. News 2005, 83 (48), 28-30.
5. Clark, D. E., What has virtual screening ever done for drug discovery? Expert
Opin. Drug Discov. 2008, 3 (8), 841-851.
6. Kirchmair, J.; Distinto, S.; Schuster, D.; Spitzer, G.; Langer, T.; Wolber, G.,
Enhancing drug discovery through in silico screening: Strategies to increase true
positives retrieval rates. Curr. Med. Chem. 2008, 15 (20), 2040-2053.
7. Shoichet, B. K.; McGovern, S. L.; Wei, B.; Irwin, J. J., Hits, Leads and Artifacts
from Virtual and High Throughput Screening. Molecular Informatics:
Confronting Complexity, May 13th - 16th 2002 2002.
8. Parker, C. N., McMaster University data-mining and docking competition:
Computational models on the catwalk. J. Biomol. Screen. 2005, 10 (7), 647-648.
9. Lang, P. T.; Kuntz, I. D.; Maggiora, G. M.; Bajorath, J., Evaluating the high-
throughput screening computations. J. Biomol. Screen. 2005, 10 (7), 649-652.
10. Douguet, D., Ligand-based approaches in virtual screening. Curr. Comput.-Aided
Drug Des. 2008, 4 (3), 180-190.
11. Gedeck, P.; Lewis, R. A., Exploiting QSAR models in lead optimization. Curr.
Opin. Drug Disc. 2008, 11 (4), 569-575.
12. Lengauer, T.; Lemmen, C.; Rarey, M.; Zimmermann, M., Novel technologies for
virtual screening. Drug Discov. Today 2004, 9 (1), 27-34.
13. Auer, J.; Bajorath, J., Molecular similarity concepts and search calculations.
Methods in molecular biology (Clifton, N.J.) 2008, 453, 327-347.
14. Sun, H., Pharmacophore-based virtual screening. Curr. Med. Chem. 2008, 15
(10), 1018-1024.
CHAPTER 1
- 49 -
15. Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R., Towards the
development of universal, fast and highly accurate docking/scoring methods: A
long way to go. Br. J. Pharmacol. 2008, 153 (SUPPL. 1), S7-S26.
16. Berman, H.; Henrick, K.; Nakamura, H., Announcing the worldwide Protein Data
Bank. Nat. Struct. Mol. Biol. 2003, 10 (12), 980-980.
17. Levinthal, C.; Wodak, S. J.; Kahn, P.; Dadivanian, A. K., Hemoglobin interaction
in sickle cell fibers I: Theoretical approaches to the molecular contacts. Proc.
Natl. Acad. Sci. U. S. A. 1975, 72 (4), 1330-1334.
18. Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R.; Ferrin, T. E., A geometric
approach to macromolecule-ligand interactions. J. Mol. Biol. 1982, 161 (2), 269-
288.
19. Mizutani, M. Y.; Tomioka, N.; Itai, A., Rational automatic search method for
stable docking models of protein and ligand. J. Mol. Biol. 1994, 243 (2), 310-326.
20. Yamada, M.; Itai, A., Development of an efficient automated docking method.
Chem. Pharm. Bull. 1993, 41 (6), 1200-1202.
21. Bissantz, C.; Folkers, G.; Rognan, D., Protein-based virtual screening of chemical
databases. 1. Evaluation of different docking/scoring combinations. J. Med.
Chem. 2000, 43 (25), 4759-4767.
22. Bursulaya, B. D.; Totrov, M.; Abagyan, R.; Brooks Iii, C. L., Comparative study
of several algorithms for flexible ligand docking. J. Comput.-Aided Mol. Des.
2003, 17 (11), 755-763.
23. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S., Evaluation of Docking
Performance: Comparative Data on Docking Algorithms. J. Med. Chem. 2004, 47
(3), 558-565.
24. Perola, E.; Walters, W. P.; Charifson, P. S., A detailed comparison of current
docking and scoring methods on systems of pharmaceutical relevance. Proteins
2004, 56 (2), 235-249.
25. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D., Comparative evaluation of
eight docking tools for docking and virtual screening accuracy. Proteins 2004, 57
(2), 225-242.
CHAPTER 1
- 50 -
26. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.,
Comparison of automated docking programs as virtual screening tools. J. Med.
Chem. 2005, 48 (4), 962-976.
27. Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,
M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I.
D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A Critical Assessment of
Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49 (20), 5912-
5931.
28. Klebe, G., Virtual ligand screening: strategies, perspectives and limitations. Drug
Discov. Today 2006, 11 (13-14), 580-594.
29. Jalaie, M.; Shanmugasundaram, V., Virtual screening: Are we there yet? Mini-
Rev. Med. Chem. 2006, 6 (10), 1159-1167.
30. Muegge, I.; Oloff, S., Advances in virtual screening. Drug Discovery Today:
Technologies 2006, 3 (4), 405-411.
31. Fara, D. C.; Oprea, T. I.; Prossnitz, E. R.; Bologa, C. G.; Edwards, B. S.; Sklar, L.
A., Integration of virtual and physical screening. Drug Discovery Today:
Technologies 2006, 3 (4), 377-385.
32. Irwin, J. J., Community benchmarks for virtual screening. J. Comput.-Aided Mol.
Des. 2008, 1-7.
33. Murray, C. W.; Baxter, C. A.; Frenkel, A. D., The sensitivity of the results of
molecular docking to induced fit effects: Application to thrombin, thermolysin
and neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13 (6), 547-562.
34. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.,
Automated docking to multiple target structures: Incorporation of protein mobility
and structural water heterogeneity in autodock. Proteins 2002, 46 (1), 34-40.
35. Cavasotto, C. N.; Abagyan, R. A., Protein Flexibility in Ligand Docking and
Virtual Screening to Protein Kinases. J. Mol. Biol. 2004, 337 (1), 209-225.
36. Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.; Fogel, D. B.;
Fogel, L. J.; Freer, S. T., Molecular recognition of the inhibitor AG-1343 by HIV-
1 protease: Conformationally flexible docking by evolutionary programming.
Chem. Biol. 1995, 2 (5), 317-324.
CHAPTER 1
- 51 -
37. Ghose, A. K.; Crippen, G. M., Geometrically feasible binding modes of a flexible
ligand molecule at the receptor site. J. Comput. Chem. 1985, 6 (5), 350-359.
38. Goodsell, D. S.; Olson, A. J., Automated docking of substrates to proteins by
simulated annealing. Proteins 1990, 8 (3), 195-202.
39. Corbeil, C. R.; Moitessier, N., Docking Ligands into Flexible and Solvated
Macromolecules. 3. Impact of Input Ligand Conformation, Protein Flexibility and
Water Molecules on Accuracy of Major Docking Programs J. Chem. Inf. Model.
2008, Submitted.
40. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G., A fast flexible docking method
using an incremental construction algorithm. J. Mol. Biol. 1996, 261 (3), 470-489.
41. Jain, A. N., Morphological similarity: A 3D molecular similarity method
correlated with protein-ligand recognition. J. Comput.-Aided Mol. Des. 2000, 14
(2), 199-213.
42. Jain, A., Surflex-Dock 2.1: Robust performance from ligand energetic modeling,
ring flexibility, and knowledge-based search. J. Comput.-Aided Mol. Des. 2007,
21 (5), 281-306.
43. Beautrait, A.; Leroux, V.; Chavent, M.; Ghemtio, L.; Devignes, M. D.; Smaïl-
Tabbone, M.; Cai, W.; Shao, X.; Moreau, G.; Bladon, P.; Yao, J.; Maigret, B.,
Multiple-step virtual screening using VSM-G: Overview and validation of fast
geometrical matching enrichment. J. Mol. Model. 2008, 14 (2), 135-148.
44. Diller, D. J.; Merz K.M, Jr., High throughput docking for library design and
library prioritization. Proteins 2001, 43 (2), 113-124.
45. Jackson, R. M., Q-fit: A probabilistic method for docking molecular fragments by
sampling low energy conformational space. J. Comput.-Aided Mol. Des. 2002, 16
(1), 43-57.
46. Wu, S. Y.; McNae, I.; Kontopidis, G.; McClue, S. J.; McInnes, C.; Stewart, K. J.;
Wang, S.; Zheleva, D. I.; Marriage, H.; Lane, D. P.; Taylor, P.; Fischer, P. M.;
Walkinshaw, M. D., Discovery of a novel family of CDK inhibitors with the
program LIDAEUS: Structural basis for ligand-induced disordering of the
activation loop. Structure 2003, 11 (4), 399-410.
47. Jain, A. N., Surflex: Fully automatic flexible molecular docking using a molecular
similarity-based search engine. J. Med. Chem. 2003, 46 (4), 499-511.
CHAPTER 1
- 52 -
48. Ewing, T. J. A.; Kuntz, I. D., Critical evaluation of search algorithms for
automated molecular docking and database screening. J. Comput. Chem. 1997, 18
(9), 1175-1189.
49. Yamagishi, M. E. B.; Martins, N. F.; Neshich, G.; Cai, W.; Shao, X.; Beautrait,
A.; Maigret, B., A fast surface-matching procedure for protein-ligand docking. J.
Mol. Model. 2006, 12 (6), 965-972.
50. Zsoldos, Z.; Reid, D.; Simon, A.; Sadjad, B. S.; Johnson, A. P., eHiTS: An
innovative approach to the docking and scoring function problems. Curr. Protein
Pept. Sci. 2006, 7 (5), 421-435.
51. Zsoldos, Z.; Reid, D.; Simon, A.; Sadjad, S. B.; Johnson, A. P., eHiTS: A new
fast, exhaustive flexible ligand docking system. J. Mol. Graph. Modell. 2007, 26
(1), 198-212.
52. Böhm, H. J., The computer program LUDI: a new method for the de novo design
of enzyme inhibitors. J. Comput.-Aided Mol. Des. 1992, 6 (1), 61-78.
53. Nishibata, Y.; Itai, A., Confirmation of usefulness of a structure construction
program based on three-dimensional receptor structure for rational lead
generation. J. Med. Chem. 1993, 36 (20), 2921-2928.
54. Gillet, V.; Myatt, G.; Zsoldos, Z.; Johnson, A., SPROUT, HIPPO and CAESA:
Tools for de novo structure generation and estimation of synthetic accessibility.
Perspect. Drug. Discov. 1995, 3 (1), 34-50.
55. Makino, S.; Ewing, T. J. A.; Kuntz, I. D., DREAM++: Flexible docking program
for virtual combinatorial libraries. J. Comput.-Aided Mol. Des. 1999, 13 (5), 513-
532.
56. Leach, A. R.; Kuntz, I. D., Conformational analysis of flexible ligands in
macromolecular receptor sites. J. Comput. Chem. 1992, 13 (6), 730-748.
57. Ewing, T. J. A.; Makino, S.; Skillman, A. G.; Kuntz, I. D., DOCK 4.0: Search
strategies for automated molecular docking of flexible molecule databases. J.
Comput.-Aided Mol. Des. 2001, 15 (5), 411-428.
58. Schnecke, V.; Swanson, C. A.; Getzoff, E. D.; Trainer, J. A.; Kuhn, L. A.,
Screening a peptidyl database for potential ligands to proteins with side-chain
flexibility. Proteins 1998, 33 (1), 74-87.
CHAPTER 1
- 53 -
59. Schnecke, V.; Kuhn, L. A., Virtual screening with solvation and ligand-induced
complementarity. Perspect. Drug. Discov. 2000, 20, 171-190.
60. Oshiro, C. M.; Kuntz, I. D.; Dixon, J. S., Flexible ligand docking using a genetic
algorithm. J. Comput.-Aided Mol. Des. 1995, 9 (2), 113-130.
61. Jones, G.; Willett, P.; Glen, R. C., Molecular recognition of receptor sites using a
genetic algorithm with a description of desolvation. J. Mol. Biol. 1995, 245 (1),
43-53.
62. Clark, K. P.; Ajay, Flexible ligand docking without parameter adjustment across
four ligand-receptor complexes. J. Comput. Chem. 1995, 16 (10), 1210-1226.
63. Ayala, F. J., Darwin's greatest discovery: Design without designer. Proc. Natl.
Acad. Sci. U. S. A. 2007, 104 (SUPPL. 1), 8567-8573.
64. Abraham, A.; Nedjah, N.; Mourelle, L. d. M., Evolutionary computation: From
genetic algorithms to genetic programming. In Studies in Computational
Intelligence, Nedjah, N.; Macedo Mourelle, L.; Abraham, A., Eds. 2006; Vol. 13,
pp 1-20.
65. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R., Development and
validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267 (3),
727-748.
66. Gould, S. J., Lamarck' and the Birth of Modern Evolutionism in Two-Factor
Theories. In The Structure of Evolutionary Theory, Belknap Harvard: 2002; p 170.
67. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R.
K.; Olson, A. J., Automated docking using a Lamarckian genetic algorithm and an
empirical binding free energy function. J. Comput. Chem. 1998, 19 (14), 1639-
1662.
68. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking Ligands into Flexible
and Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J.
Chem. Inf. Model. 2007, 47 (2), 435-449.
69. Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.;
Bilimoria, D.; Heureux, L.; Moitessier, N., Docking Ligands into Flexible and
Solvated Macromolecules. 2. Development and Application of FITTED 1.5 to the
Virtual Screening of Potential HCV Polymerase Inhibitors. J. Chem. Inf. Model.
2008, 48 (4), 902-909.
CHAPTER 1
- 54 -
70. Abagyan, R.; Totrov, M.; Kuznetsov, D., ICM - A new method for protein
modeling and design: Applications to docking and structure prediction from the
distorted native conformation. J. Comput. Chem. 1994, 15 (5), 488-506.
71. Totrov, M.; Abagyan, R., Flexible protein-ligand docking by global energy
optimization in internal coordinates. Proteins 1997, 29 (SUPPL. 1), 215-220.
72. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz,
D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.;
Francis, P.; Shenkin, P. S., Glide: A New Approach for Rapid, Accurate Docking
and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem.
2004, 47 (7), 1739-1749.
73. Venkatachalam, C. M.; Jiang, X.; Oldfield, T.; Waldman, M., LigandFit: A novel
method for the shape-directed rapid docking of ligands to protein active sites. J.
Mol. Graph. Modell. 2003, 21 (4), 289-307.
74. Chen, H. M.; Liu, B. F.; Huang, H. L.; Hwang, S. F.; Ho, S. Y., SODOCK:
Swarm optimization for highly flexible protein-ligand docking. J. Comput. Chem.
2007, 28 (2), 612-623.
75. Namasivayam, V.; Günther, R., PSO@AUTODOCK: A fast flexible molecular
docking program based on swarm intelligence. Chem. Biol. Drug Des. 2007, 70
(6), 475-484.
76. Korb, O.; Stützle, T.; Exner, T. E. In PLANTS: Application of ant colony
optimization to structure-based drug design, Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), Brussels, Brussels, 2006; pp 247-258.
77. Jain, A. N., Bias, reporting, and sharing: Computational evaluations of docking
methods. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 201-212.
78. Jain, A. N.; Nicholls, A., Recommendations for evaluation of computational
methods. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 133-139.
79. Hawkins, P. C. D.; Warren, G. L.; Skillman, A. G.; Nicholls, A., How to do an
evaluation: Pitfalls and traps. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 179-
190.
CHAPTER 1
- 55 -
80. Boström, J.; Greenwood, J. R.; Gottfries, J., Assessing the performance of
OMEGA with respect to retrieving bioactive conformations. J. Mol. Graph.
Modell. 2003, 21 (5), 449-462.
81. 4.11 ring_conf. In LigPrep2.2 User Manual, Schrödiger, LLC.: 2008; p 42.
82. Gasteiger, J.; Rudolp, C.; Sadowski, J., Automatic Generation of 3D-Atomic
Coordinates for Organic Molecules. Tetrahedron Computer Methodology 1990, 3
(6C), 537-547.
83. Rusinko Iii, A.; Sheridan, R. P.; Nilakantan, R.; Haraki, K. S.; Bauman, N.;
Venkataraghavan, R., Using CONCORD to construct a large database of three-
dimensional coordinates from connection tables. Journal of Chemical Information
and Computer Science® 1989, 29, 251-255.
84. Guner, O. F.; Henry, D. R.; Pearlman, R. S., Use of flexible queries for searching
conformationally flexible molecules in databases of three-dimensional structures.
Journal of Chemical Information and Computer Sciences® 1992, 32, 101-109.
85. Xu, H.; Izrailev, S.; Agrafiotis, D. K., Conformational sampling by self-
organization. J. Chem. Inf. Comput. Sci. 2003, 43 (4), 1186-1191.
86. 15.5 Interface to CORINA. In FlexX Release 3 with GUI User Guide and
Technical Reference, BiosolveIT GmbH: 2008; p 365.
87. Goto, H.; Osawa, E., Corner flapping: A simple and fast algorithm for exhaustive
generation of ring conformations. J. Am. Chem. Soc. 1989, 111 (24), 8950-8951.
88. Goto, H.; Osawa, E., Further Developments in the Algorithm for Generating
Cyclic Conformers - Test with Cycloheptadecane. Tetrahedron Lett. 1992, 33
(10), 1343-1346.
89. Payne, A. W. R.; Glen, R. C., Molecular recognition using a binary genetic search
algorithm. J. Mol. Graph. 1993, 11 (2), 74-91+121.
90. Bursavich, M. G.; Rich, D. H., Designing non-peptide peptidomimetics in the 21st
century: Inhibitors targeting conformational ensembles. J. Med. Chem. 2002, 45
(3), 541-558.
91. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M., Lessons in
Molecular Recognition: The Effects of Ligand and Protein Flexibility on
Molecular Docking Accuracy. J. Med. Chem. 2004, 47 (1), 45-55.
CHAPTER 1
- 56 -
92. Abagyan, R.; Totrov, M., High-throughput docking for lead generation. Curr.
Opin. Chem. Biol. 2001, 5 (4), 375-382.
93. Shoichet, B. K.; McGovern, S. L.; Wei, B.; Irwin, J. J., Lead discovery using
molecular docking. Curr. Opin. Chem. Biol. 2002, 6 (4), 439-446.
94. Mohan, V.; Gibbs, A. C.; Cummings, M. D.; Jaeger, E. P.; DesJarlais, R. L.,
Docking: Successes and challenges. Curr. Pharm. Des. 2005, 11 (3), 323-333.
95. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., Protein-ligand docking: Current
status and future challenges. Proteins 2006, 65 (1), 15-26.
96. Coupez, B.; Lewis, R. A., Docking and scoring - Theoretically easy, practically
impossible? Curr. Med. Chem. 2006, 13 (25), 2995-3003.
97. Kroemer, R. T., Structure-based drug design: Docking and scoring. Curr. Protein
Pept. Sci. 2007, 8 (4), 312-328.
98. Barril, X.; Morley, S. D., Unveiling the full potential of flexible receptor docking
using multiple crystallographic structures. J. Med. Chem. 2005, 48 (13), 4432-
4443.
99. Amaro, R. E.; Baron, R.; McCammon, J. A., An improved relaxed complex
scheme for receptor flexibility in computer-aided drug design. J. Comput.-Aided
Mol. Des. 2008, 22 (9), 693-705.
100. Garner, J.; Deadman, J.; Rhodes, D.; Griffith, R.; Keller, P. A., A new
methodology for the simulation of flexible protein-ligand interactions. J. Mol.
Graph. Modell. 2007, 26 (1), 187-197.
101. Withers, I. M.; Mazanetz, M. P.; Wang, H.; Fischer, P. M.; Laughton, C. A.,
Active site pressurization: A new tool for structure-guided drug design and other
studies of protein flexibility. J. Chem. Inf. Model. 2008, 48 (7), 1448-1454.
102. Keseru, G. M.; Kolossvary, I., Fully flexible low-mode docking: Application to
induced fit in HIV integrase. J. Am. Chem. Soc. 2001, 123 (50), 12708-12709.
103. Cavasotto, C. N.; Kovacs, J. A.; Abagyan, R. A., Representing receptor flexibility
in ligand docking through relevant normal modes. J. Am. Chem. Soc. 2005, 127
(26), 9632-9640.
104. Zheng, W.; Doniach, S., A comparative study of motor-protein motions by using a
simple elastic-network model. Proc. Natl. Acad. Sci. U. S. A. 2003, 100 (23),
13253-13258.
CHAPTER 1
- 57 -
105. Mills, G.; Jo?nsson, H., Quantum and thermal effects in H2 dissociative
adsorption: Evaluation of free energy barriers in multidimensional quantum
systems. Phys. Rev. Lett. 1994, 72 (7), 1124-1127.
106. Jonsson, H.; Mills, G.; Jacobsen, K. W., Nudged elastic band method for finding
minimum energy paths of transitions. In Classical and Quantum Dynamics in
Condensed Phase Simulations, Berne, B. J.; Ciccoti, G.; Coker, D. F., Eds. World
Scientific: Singapore, 1998.
107. Sander, T.; Liljefors, T.; Balle, T., Prediction of the receptor conformation for
iGluR2 agonist binding: QM/MM docking to an extensive conformational
ensemble generated using normal mode analysis. J. Mol. Graph. Modell. 2008, 26
(8), 1259-1268.
108. Mathews, D. H.; Case, D. A., Nudged elastic band calculation of minimal energy
paths for the conformational change of a GG non-canonical pair. J. Mol. Biol.
2006, 357 (5), 1683-1693.
109. Jiang, F.; Kim, S. H., 'Soft docking': Matching of molecular surface cubes. J. Mol.
Biol. 1991, 219 (1), 79-102.
110. Mizutani, M. Y.; Takamatsu, Y.; Ichinose, T.; Nakamura, K.; Itai, A., Effective
handling of induced-fit motion in flexible docking. Proteins 2006, 63 (4), 878-
891.
111. Apostolakis, J.; Plückthun, A.; Caflisch, A., Docking small ligands in flexible
binding sites. J. Comput. Chem. 1998, 19 (1), 21-37.
112. Knegtel, R. M. A.; Kuntz, I. D.; Oshiro, C. M., Molecular docking to ensembles
of protein structures. J. Mol. Biol. 1997, 266 (2), 424-440.
113. Moitessier, N.; Westhof, E.; Hanessian, S., Docking of Aminoglycosides to
Hydrated and Flexible RNA. J. Med. Chem. 2006, 49 (3), 1023-1033.
114. Sotriffer, C. A.; Dramburg, I., "In situ cross-docking" to simultaneously address
multiple targets. J. Med. Chem. 2005, 48 (9), 3122-3125.
115. Leach, A. R., Ligand docking to proteins with discrete side-chain flexibility. J.
Mol. Biol. 1994, 235 (1), 345-356.
116. Schaffer, L.; Verkhivker, G. M., Predicting structural effects in HIV-1 protease
mutant complexes with flexible ligand docking and protein side-chain
optimization. Proteins 1998, 33 (2), 295-310.
CHAPTER 1
- 58 -
117. Anderson, A. C.; O'Neil, R. H.; Surti, T. S.; Stroud, R. M., Approaches to solving
the rigid receptor problem by identifying a minimal set of flexible residues during
ligand docking. Chem. Biol. 2001, 8 (5), 445-457.
118. Bottegoni, G.; Kufareva, I.; Totrov, M.; Abagyan, R., A new method for ligand
docking to flexible receptors by dual alanine scanning and refinement (SCARE).
J. Comput.-Aided Mol. Des. 2008, 1-15.
119. Koska, J.; Spassov, V. Z.; Maynard, A. J.; Yan, L.; Austin, N.; Flook, P. K.;
Venkatachalam, C. M., Fully Automated Molecular Mechanics Based Induced Fit
Protein-Ligand Docking Method. J. Chem. Inf. Model. 2008, 48 (10), 1965-1973.
120. Sherman, W.; Beard, H. S.; Farid, R., Use of an induced fit receptor structure in
virtual screening. Chem. Biol. Drug Des. 2006, 67 (1), 83-84.
121. Kairys, V.; Gilson, M. K., Enhanced docking with the mining minima optimizer:
Acceleration and side-chain flexibility. J. Comput. Chem. 2002, 23 (16), 1656-
1670.
122. Alberts, I. L.; Todorov, N. P.; Dean, P. M., Receptor Flexibility in de Novo
Ligand Design and Docking. J. Med. Chem. 2005, 48 (21), 6585-6596.
123. Alberts, I. L.; Todorov, N. P.; Kallbku, P.; Dean, P. M., Ligand docking and
design in a flexible receptor site. QSAR Comb. Sci. 2005, 24 (4), 503-507.
124. Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D.,
Improved protein-ligand docking using GOLD. Proteins 2003, 52 (4), 609-623.
125. Guilbert, C.; James, T. L., Docking to RNA via Root-Mean-Square-Deviation-
Driven Energy Minimization with Flexible Ligands and Flexible Targets. J.
Chem. Inf. Model. 2008, 48 (6), 1257-1268.
126. Pinto, I. G.; Guilbert, C.; Ulyanov, N. B.; Stearns, J.; James, T. L., Discovery of
ligands for a novel target, the human telomerase RNA, based on flexible-target
virtual screening and NMR. J. Med. Chem. 2008, 51 (22), 7205-7215.
127. Clauβen, H.; Buning, C.; Rarey, M.; Lengauer, T., FLEXE: Efficient molecular
docking considering protein structure variations. J. Mol. Biol. 2001, 308 (2), 377-
395.
128. Wei, B. Q.; Weaver, L. H.; Ferrari, A. M.; Matthews, B. W.; Shoichet, B. K.,
Testing a flexible-receptor docking algorithm in a model binding site. J. Mol.
Biol. 2004, 337 (5), 1161-1182.
CHAPTER 1
- 59 -
129. Moitessier, N.; Therrien, E.; Hanessian, S., A Method for Induced-Fit Docking,
Scoring, and Ranking of Flexible Ligands. Application to Peptidic and
Pseudopeptidic beta-secretase (BACE 1) Inhibitors. J. Med. Chem. 2006, 49 (20),
5885-5894.
130. Davis, I. W.; Baker, D., RosettaLigand Docking with Full Ligand and Receptor
Flexibility. J. Mol. Biol. 2009, 385 (2), 381-392.
131. Luty, B. A.; Wasserman, Z. R.; Stouten, P. F. W.; Hodge, C. N.; Zacharias, M.;
McCammon, J. A., A molecular mechanics/grid method for evaluation of ligand-
receptor interactions. J. Comput. Chem. 1995, 16 (4), 454-464.
132. Tatsumi, R.; Fukunishi, Y.; Nakamura, H., A hybrid method of molecular
dynamics and harmonic dynamics for docking of flexible ligand to flexible
receptor. J. Comput. Chem. 2004, 25 (16), 1995-2005.
133. Gervasio, F. L.; Laio, A.; Parrinello, M., Flexible docking in solution using
metadynamics. J. Am. Chem. Soc. 2005, 127 (8), 2600-2607.
134. Ladbury, J. E., Just add water! The effect of water on the specificity of protein-
ligand binding sites and its potential application to drug design. Chem. Biol. 1996,
3 (12), 973-980.
135. Barillari, C.; Taylor, J.; Viner, R.; Essex, J. W., Classification of water molecules
in protein binding sites. J. Am. Chem. Soc. 2007, 129 (9), 2577-2587.
136. Li, Z.; Lazaridis, T., Water at biomolecular binding interfaces. Phys. Chem.
Chem. Phys. 2007, 9 (5), 573-581.
137. Lam, P. Y. S.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L.
T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N.; Chang, C. H.; Weber,
P. C.; Jackson, D. A.; Sharpe, T. R.; Erickson-Viitanen, S., Rational design of
potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science
1994, 263 (5145), 380-384.
138. Grzesiek, S.; Bax, A.; Nicholson, L. K.; Yamazaki, T.; Wingfield, P.; Stahl, S. J.;
Eyermann, C. J.; Torchia, D. A.; Nicholas Hodge, C.; Lam, P. Y. S.; Jadhav, P.
K.; Chang, C. H., NMR evidence for the displacement of a conserved interior
water molecule in HIV protease by a non-peptide cyclic urea-based inhibitor. J.
Am. Chem. Soc. 1994, 116 (4), 1581-1582.
CHAPTER 1
- 60 -
139. Hodge, C. N.; Aldrich, P. E.; Bacheler, L. T.; Chang, C. H.; Eyermann, C. J.;
Garber, S.; Grubb, M.; Jackson, D. A.; Jadhav, P. K.; Korant, B.; Lam, P. Y. S.;
Maurin, M. B.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Reid, C.; Sharpe, T. R.;
Shum, L.; Winslow, D. L.; Erickson-Viitanen, S., Improved cyclic urea inhibitors
of the HIV-1 protease: Synthesis, potency, resistance profile, human
pharmacokinetics and X-ray crystal structure of DMP 450. Chem. Biol. 1996, 3
(4), 301-314.
140. Champness, J. N.; Bennett, M. S.; Wien, F.; Visse, R.; Summers, W. C.;
Herdewijn, P.; De Clercq, E.; Ostrowski, T.; Jarvest, R. L.; Sanderson, M. R.,
Exploring the active site of herpes simplex virus type-1 thymidine kinase by X-
ray crystallography of complexes with aciclovir and other ligands. Proteins 1998,
32 (3), 350-361.
141. De Graaf, C.; Pospisil, P.; Pos, W.; Folkers, G.; Vermeulen, N. P. E., Binding
mode prediction of cytochrome P450 and thymidine kinase protein-ligand
complexes by consideration of water and rescoring in automated docking. J. Med.
Chem. 2005, 48 (7), 2308-2318.
142. Roberts, B. C.; Mancera, R. L., Ligand - protein docking with water molecules. J.
Chem. Inf. Model. 2008, 48 (2), 397-408.
143. Huang, N.; Shoichet, B. K., Exploiting ordered waters in molecular docking. J.
Med. Chem. 2008, 51 (16), 4862-4865.
144. Rarey, M.; Kramer, B.; Lengauer, T., The particle concept: Placing discrete water
molecules during protein- ligand docking predictions. Proteins 1999, 34 (1), 17-
28.
145. Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.;
Nissink, J. W. M.; Taylor, R. D.; Taylor, R., Modeling water molecules in
protein-ligand docking using GOLD. J. Med. Chem. 2005, 48 (20), 6504-6515.
146. Jiang, L.; Kuhlman, B.; Kortemme, T.; Baker, D., A "solvated rotamer" approach
to modeling water-mediated hydrogen bonds at protein-protein interfaces.
Proteins 2005, 58 (4), 893-904.
147. van Dijk, A. D. J.; Bonvin, A. M. J. J., Solvated docking: Introducing water into
the modelling of biomolecular complexes. Bioinformatics 2006, 22 (19), 2340-
2347.
CHAPTER 1
- 61 -
148. Goodford, P. J., A computational procedure for determining energetically
favorable binding sites on biologically important macromolecules. J. Med. Chem.
1985, 28 (7), 849-857.
149. Wade, R. C.; Goodford, P. J., Further development of hydrogen bond functions
for use in determining energetically favorable binding sites on molecules of
known structure. 2. Ligand probe groups with the ability to form more than two
hydrogen bonds. J. Med. Chem. 1993, 36 (1), 148-156.
150. Pitt, W. R.; Goodfellow, J. M., Modelling of solvent positions around polar
groups in proteins. Protein Eng. 1991, 4 (5), 531-537.
151. Amadasi, A.; Surface, J. A.; Spyrakis, F.; Cozzini, P.; Mozzarelli, A.; Kellogg, G.
E., Robust classification of "relevant" water molecules in putative protein binding
sites. J. Med. Chem. 2008, 51 (4), 1063-1067.
152. Miranker, A.; Karplus, M., Functionality maps of binding sites: A multiple copy
simultaneous search method. Proteins 1991, 11 (1), 29-34.
153. Harding, M. M., The geometry of metal-ligand interactions relevant to proteins.
Acta Crystallogr. D 1999, 55 (8), 1432-1443.
154. Harding, M. M., The geometry of metal-ligand interactions relevant to proteins.
II. Angles at the metal atom, additional weak metal-donor interactions. Acta
Crystallogr. D 2000, 56 (7), 857-867.
155. Harding, M. M., Geometry of metal-ligand interactions in proteins. Acta
Crystallogr. D 2001, 57 (3), 401-411.
156. Harvey, M. A.; Baggio, S.; Baggio, R., A new simplifying approach to molecular
geometry description: The vectorial bond-valence model. Acta Crystallogr. B
2006, 62 (6), 1038-1042.
157. Seebeck, B.; Reulecke, I.; Ka?mper, A.; Rarey, M., Modeling of metal interaction
geometries for protein-ligand docking. Proteins 2008, 71 (3), 1237-1254.
158. Uberbacher, E.; LoCascio, P.; Passovets, S.; Ghattyvenkatakrishna, P.; Agarwal,
P.; Arnold, N.; Bordner, A.; Gorin, A., Computational challenges for modeling
and simulating biomacromolecular assemblies. Journal of Physics: Conference
Series 2006, 46 (1), 311-315.
159. Houk, K. N.; Paddon-Row, M. N.; Rondan, N. G., Theory and modeling of
stereoselective organic reactions. Science 1986, 231 (4742), 1108-1117.
CHAPTER 1
- 62 -
160. Houk, K. N.; Cheong, P. H. Y., Computational prediction of small-molecule
catalysts. Nature 2008, 455 (7211), 309-313.
161. Lynch, B. J.; Truhlar, D. G., How well can hybrid density functional methods
predict transition state geometries and barrier heights? J. Phys. Chem. A 2001, 105
(13), 2936-2941.
162. Bahmanyar, S.; Houk, K. N., The origin of stereoselectivity in proline-catalyzed
intramolecular aldol reactions. J. Am. Chem. Soc. 2001, 123 (51), 12911-12912.
163. Bahmanyar, S.; Houk, K. N., Transition States of Amine-Catalyzed Aldol
Reactions Involving Enamine Intermediates: Theoretical Studies of Mechanism,
Reactivity, and Stereoselectivity. J. Am. Chem. Soc. 2001, 123 (45), 11273-11283.
164. Rankin, K. N.; Gauld, J. W.; Boyd, R. J., Density Functional Study of the Proline-
Catalyzed Direct Aldol Reaction. J. Phys. Chem. A 2002, 106 (20), 5155-5159.
165. Tang, Z.; Jiang, F.; Yu, L. T.; Cui, X.; Gong, L. Z.; Mi, A. Q.; Jiang, Y. Z.; Wu,
Y. D., Novel small organic molecules for a highly enantioselective direct aldol
reaction. J. Am. Chem. Soc. 2003, 125 (18), 5262-5263.
166. Bahmanyar, S.; Houk, K. N.; Martin, H. J.; List, B., Quantum Mechanical
Predictions of the Stereoselectivities of Proline-Catalyzed Asymmetric
Intermolecular Aldol Reactions. J. Am. Chem. Soc. 2003, 125 (9), 2475-2479.
167. Clemente, F. R.; Houk, K. N., Computational evidence for the enamine
mechanism of intramolecular aldol reactions catalyzed by proline. Angew. Chem.
Int. Ed. 2004, 43 (43), 5766-5768.
168. Allemann, C.; Gordillo, R.; Clemente, F. R.; Cheong, P. H. Y.; Houk, K. N.,
Theory of asymmetric organocatalysis of aldol and related reactions:
Rationalizations and predictions. Acc. Chem. Res. 2004, 37 (8), 558-569.
169. Cheong, P. H. Y.; Houk, K. N., Origins and predictions of stereoselectivity in
intramolecular aldol reactions catalyzed by proline derivatives. Synthesis 2005,
(9), 1533-1537.
170. Spencer, T. A.; Neel, H. S.; Flechtner, T. W.; Zayle, R. A., Observations on amine
catalysis of formation and dehydration of ketols. Tetrahedron Lett. 1965, 6 (43),
3889-3897.
171. Hajos, Z. G.; Parrish, D. R., Asymmetric synthesis of bicyclic intermediates of
natural product chemistry. J. Org. Chem. 1974, 39 (12), 1615-1621.
CHAPTER 1
- 63 -
172. Agami, C.; Meynier, F.; Puchot, C.; Guilhem, J.; Pascard, C., Stereochemistry-59.
New insights into the mechanism of the proline-catalyzed asymmetric robinson
cyclization; structure of two intermediates. asymmetric dehydration. Tetrahedron
1984, 40 (6), 1031-1038.
173. Agami, C.; Puchot, C., Kinetic analysis of the dual catalysis by proline in the
asymmetric intramolecular aldol reaction. J. Mol. Catal. 1986, 38 (3), 341-343.
174. Blaney, J. M.; Dixon, J. S., A good ligand is hard to find: Automated docking
methods. Perspect. Drug. Discov. 1993, 1 (2), 301-319.
175. Hoang, L.; Bahmanyar, S.; Houk, K. N.; List, B., Kinetic and stereochemical
evidence for the involvement of only one proline molecule in the transition states
of proline-catalyzed intra- and intermolecular aldol reactions. J. Am. Chem. Soc.
2003, 125 (1), 16-17.
176. Shinisha, C. B.; Sunoj, R. B., Bicyclic proline analogues as organocatalysts for
stereoselective aldol reactions: An in silico DFT study. Org. Biomol. Chem. 2007,
5 (8), 1287-1294.
177. Jørgensen, K. A.; Hoffmann, R., Binding of alkenes to the ligands in OsO2X2 (X
= O and NR) and CpCo(NO)2. A frontier orbital study of the formation of
intermediates in the transition-metal-catalyzed synthesis of diols, amino alcohols,
and diamines. J. Am. Chem. Soc. 1986, 108 (8), 1867-1876.
178. Chong, A. O.; Oshima, K.; Barry Sharpless, K., Synthesis of dioxobis(tert-
alkylimido)osmium(VIII) and oxotris(tert-alkylimido)osmium(VIII) Complexes.
Stereospecific vicinal diamination of olefins. J. Am. Chem. Soc. 1977, 99 (10),
3420-3426.
179. Pidun, U.; Boehme, C.; Frenking, G., Theory Rules Out a [2 + 2] Addition of
Osmium Tetroxide to Olefins as Initial Step of the Dihydroxylation Reaction.
Angew. Chem. Int. Ed. 1996, 35 (23-24), 2817-2820.
180. Dapprich, S.; Ujaque, G.; Maseras, F.; Lledos, A.; Musaev, D. G.; Morokuma, K.,
Theory does not support an osmaoxetane intermediate in the osmium-catalyzed
dihydroxylation of olefins. J. Am. Chem. Soc. 1996, 118 (46), 11660-11661.
181. Torrent, M.; Deng, L.; Duran, M.; Sola, M.; Ziegler, T., Density functional study
of the [2+2]- and [2+3]-cycloaddition mechanisms for the osmium-catalyzed
dihydroxylation of olefins. Organometallics 1997, 16 (1), 13-19.
CHAPTER 1
- 64 -
182. DelMonte, A. J.; Haller, J.; Houk, K. N.; Sharpless, K. B.; Singleton, D. A.;
Strassner, T.; Thomas, A. A., Experimental and Theoretical Kinetic Isotope
Effects for Asymmetric Dihydroxylation. Evidence Supporting a Rate-Limiting
"(3 + 2)" Cycloaddition. J. Am. Chem. Soc. 1997, 119 (41), 9907-9908.
183. Ess, D. H.; Jones, G. O.; Houk, K. N., Conceptual, qualitative, and quantitative
theories of 1,3-dipolar and Diels-Alder cycloadditions used in synthesis. Adv.
Synth. Catal. 2006, 348 (16-17), 2337-2361.
184. Gordillo, R.; Houk, K. N., Origins of stereoselectivity in Diels-Alder
cycloadditions catalyzed by chiral imidazolidinones. J. Am. Chem. Soc. 2006, 128
(11), 3543-3553.
185. Bakalova, S. M.; Santos, A. G., A computational study of the Diels-Alder reaction
of ethyl-S-lactyl acrylate and cyclopentadiene. Origins of stereoselectivity. J. Org.
Chem. 2004, 69 (24), 8475-8481.
186. Dinadayalane, T. C.; Vijaya, R.; Smitha, A.; Sastry, G. N., Diels-Alder reactivity
of butadiene and cyclic five-membered dienes ((CH)4X, X = CH2, SiH2, O, NH,
PH, and S) with ethylene: A benchmark study. J. Phys. Chem. A 2002, 106 (8),
1627-1633.
187. Goumans, T. P. M.; Ehlers, A. W.; Lammertsma, K.; Würthwein, E. U.;
Grimme, S., Improved reaction and activation energies of [4+2] cycloadditions,
[3,3] sigmatropic rearrangements and electrocyclizations with the spin-
component- scaled MP2 method. Chem. Eur. J. 2004, 10 (24), 6468-6475.
188. Bakalova, S. M.; Santos, A. G., A theoretical study of the stereoselectivities of the
Diels-Alder addition of cyclopentadiene to ethyl-(S)-lactyl acrylate catalyzed by
aluminium chloride. Eur. J. Org. Chem. 2006, (7), 1779-1789.
189. Jones, G. O.; Guner, V. A.; Houk, K. N., Diels - Alder reactions of
cyclopentadiene and 9,10-dimethylanthracene with cyanoalkenes: The
performance of density functional theory and hartree-fock calculations for the
prediction of substituent effects. J. Phys. Chem. A 2006, 110 (4), 1216-1224.
190. Mitsumori, S.; Zhang, H.; Cheong, P. H. Y.; Houk, K. N.; Tanaka, F.; Barbas, C.
F., Direct asymmetric anti-Mannich-type reactions catalyzed by a designed amino
acid. J. Am. Chem. Soc. 2006, 128 (4), 1040-1041.
CHAPTER 1
- 65 -
191. Cheong, P. H. Y.; Zhang, H.; Thayumanavan, R.; Tanaka, F.; Houk, K. N.; Barbas
Iii, C. F., Pipecolic acid-catalyzed direct asymmetric Mannich reactions. Org.
Lett. 2006, 8 (5), 811-814.
192. Balcells, D.; Maseras, F., Computational approaches to asymmetric synthesis.
New J. Chem. 2007, 31 (3), 333-343.
193. Drudis-Sole, G.; Ujaque, G.; Maseras, F.; Lledos, A., A QM/MM study of the
asymmetric dihydroxylation of terminal aliphatic n-alkenes with
OsO4·(DHQD)2PYDZ: Enantioselectivity as a function of chain length. Chem.
Eur. J. 2005, 11 (3), 1017-1029.
194. Ujaque, G.; Maseras, F.; Lleds, A., Theoretical study on the origin of
enantioselectivity in the bis(dihydroquinidine)-3,6-pyridazine·osmium
tetroxide-catalyzed dihydroxylation of styrene. J. Am. Chem. Soc. 1999, 121 (6),
1317-1323.
195. Hoogenraad, M.; Klaus, G. M.; Elders, N.; Hooijschuur, S. M.; McKay, B.;
Smith, A. A.; Damen, E. W. P., Oxazaborolidine mediated asymmetric ketone
reduction: Prediction of enantiomeric excess based on catalyst structure.
Tetrahedron Asymmetry 2004, 15 (3), 519-523.
196. Van Der Linden, J. B.; Ras, E. J.; Hooijschuur, S. M.; Klaus, G. M.; Luchters, N.
T.; Dani, P.; Verspui, G.; Smith, A. A.; Damen, E. W. P.; McKay, B.;
Hoogenraad, M., Asymmetric catalytic ketone hydrogenation: Relating substrate
structure and product enantiomeric excess using QSPR. QSAR Comb. Sci. 2005,
24 (1), 94-98.
197. Chavali, S.; Lin, B.; Miller, D. C.; Camarda, K. V., Environmentally-benign
transition metal catalyst design using optimization techniques. Comput. Chem.
Eng. 2004, 28 (5), 605-611.
198. Lin, B.; Chavali, S.; Camarda, K.; Miller, D. C., Computer-aided molecular
design using Tabu search. Comput. Chem. Eng. 2005, 29 (2), 337-347.
199. Alvarez, S.; Schefzick, S.; Lipkowitz, K.; Avnir, D., Quantitative Chirality
Analysis of Molecular Subunits of Bis(oxazoline)copper(II) Complexes in
Relation to Their Enantioselective Catalytic Activity. Chem. Eur. J. 2003, 9 (23),
5832-5837.
CHAPTER 1
- 66 -
200. Kozlowski, M. C.; Dixon, S. L.; Panda, M.; Lauri, G., Quantum mechanical
models correlating structure with selectivity: Predicting the enantioselectivity of
β-amino alcohol catalysts in aldehyde alkylation. J. Am. Chem. Soc. 2003, 125
(22), 6614-6615.
201. Lipkowitz, K. B.; Pradhan, M., Computational studies of chiral catalysts: A
Comparative Molecular Field Analysis of an asymmetric Diels-Alder reaction
with catalysts containing bisoxazoline or phosphinooxazoline ligands. J. Org.
Chem. 2003, 68 (12), 4648-4656.
202. Sciabola, S.; Alex, A.; Higginson, P. D.; Mitchell, J. C.; Snowden, M. J.; Morao,
I., Theoretical Prediction of the Enantiomeric Excess in Asymmetric Catalysis.
An Alignment-Independent Molecular Interaction Field Based Approach. J. Org.
Chem. 2005, 70 (22), 9025-9027.
203. Ianni, J. C.; Annamalai, V.; Phuan, P.-W.; Panda, M.; Kozlowski, M. C., A Priori
Theoretical Prediction of Selectivity in Asymmetric Catalysis: Design of Chiral
Catalysts by Using Quantum Molecular Interaction Fields. Angew. Chem. Int. Ed.
2006, 45 (33), 5502-5505.
204. Huang, J.; Ianni, J. C.; Antoline, J. E.; Hsung, R. P.; Kozlowski, M. C., De Novo
Chiral Amino Alcohols in Catalyzing Asymmetric Additions to Aryl Aldehydes.
Org. Lett. 2006, 8 (8), 1565-1568.
205. Jensen, F.; Norrby, P. O., Transition states from empirical force fields. Theor.
Chem. Acc. 2003, 109 (1), 1-7.
206. Houk, K. N.; Rondan, N. G.; Wu, Y. D.; Metz, J. T.; Paddon-Row, M. N.,
Theoretical studies of stereoselective hydroborations. Tetrahedron 1984, 40 (12),
2257-2274.
207. Moitessier, N.; Maigret, B.; Chre?tien, F.; Chapleur, Y., Molecular dynamics-
based models explain the unexpected diastereoselectivity of the sharpless
asymmetric dihydroxylation of allyl D- xylosides. Eur. J. Org. Chem. 2000, (6),
995-1005.
208. Moitessier, N.; Henry, C.; Len, C.; Chapleur, Y., Toward a computational tool
predicting the stereochemical outcome of asymmetric reactions. 1. Application to
sharpless asymmetric dihydroxylation. J. Org. Chem. 2002, 67 (21), 7275-7282.
CHAPTER 1
- 67 -
209. Harriman, D. J.; Deslongchamps, G., Reverse-docking as a computational tool for
the study of asymmetric organocatalysis. J. Comput.-Aided Mol. Des. 2004, 18
(5), 303-308.
210. Harriman, J. D.; Deslongchamps, G., Reverse-docking study of the TADDOL-
catalyzed asymmetric hetero-Diels-Alder reaction. J. Mol. Model. 2006, 12 (6),
793-797.
211. Morris, G. M.; Goodsell, D. S.; Huey, R.; Olson, A. J., Distributed automated
docking of flexible ligands to proteins: Parallel applications of AutoDock 2.4. J.
Comput.-Aided Mol. Des. 1996, 10 (4), 293-304.
212. Harriman, D. J.; Deleavey, G. F.; Lambropoulos, A.; Deslongchamps, G.,
Reverse-docking study of the organocatalyzed asymmetric Strecker
hydrocyanation of aldimines and ketimines. Tetrahedron 2007, 63 (52), 13032-
13038.
213. Harriman, D. J.; Lambropoulos, A.; Deslongchamps, G., In silico correlation of
enantioselectivity for the TADDOL catalyzed asymmetric hetero-Diels-Alder
reaction. Tetrahedron Lett. 2007, 48 (4), 689-692.
214. Eksterowicz, J. E.; Houk, K. N., Transition-state modeling with empirical force
fields. Chem. Rev. 1993, 93 (7), 2439-2461.
215. Van Duin, A. C. T.; Dasgupta, S.; Lorant, F.; Goddard Iii, W. A., ReaxFF: A
reactive force field for hydrocarbons. J. Phys. Chem. A 2001, 105 (41), 9396-
9409.
216. Nielson, K. D.; Van Duin, A. C. T.; Oxgaard, J.; Deng, W. Q.; Goddard Iii, W. A.,
Development of the ReaxFF reactive force field for describing transition metal
catalyzed reactions, with application to the initial stages of the catalytic formation
of carbon nanotubes. J. Phys. Chem. A 2005, 109 (3), 493-499.
217. Chenoweth, K.; Van Duin, A. C. T.; Persson, P.; Cheng, M. J.; Oxgaard, J.;
Goddard Iii, W. A., Development and application of a ReaxFF reactive force field
for oxidative dehydrogenation on vanadium oxide catalysts. J. Phys. Chem. C
2008, 112 (37), 14645-14654.
218. Norrby, P. O., Selectivity in asymmetric synthesis from QM-guided molecular
mechanics. J. Mol. Struct. THEOCHEM 2000, 506, 9-16.
CHAPTER 1
- 68 -
219. Rasmussen, T.; Norrby, P. O., Modeling the stereoselectivity of the β-amino
alcohol-promoted addition of dialkylzinc to aldehydes. J. Am. Chem. Soc. 2003,
125 (17), 5130-5138.
220. Fristrup, P.; Jensen, G. H.; Andersen, M. L. N.; Tanner, D.; Norrby, P. O.,
Combining Q2MM modeling and kinetic studies for refinement of the osmium-
catalyzed asymmetric dihydroxylation (AD) mnemonic. J. Organomet. Chem.
2006, 691 (10), 2182-2198.
221. Donoghue, P. J.; Kieken, E.; Helquist, P.; Wiest, O., Development of a Q2MM
force field for the silver(I)-catalyzed hydroamination of alkynes. Adv. Synth.
Catal. 2007, 349 (17-18), 2647-2654.
222. Rydberg, P.; Olsen, L.; Norrby, P. O.; Ryde, U., General transition-state force
field for cytochrome P450 hydroxylation. J. Chem. Theory Comput. 2007, 3 (5),
1765-1773.
223. Becker, H.; Ho, P. T.; Kolb, H. C.; Loren, S.; Norrby, P. O.; Sharpless, K. B.,
Comparing two models for the selectivity in the asymmetric dihydroxylation
reaction (AD). Tetrahedron Lett. 1994, 35 (40), 7315-7318.
224. Norrby, P. O.; Rasmussen, T.; Haller, J.; Strassner, T.; Houk, K. N., Rationalizing
the stereoselectivity of osmium tetroxide asymmetric dihydroxylations with
transition state modeling using quantum mechanics- guided molecular mechanics.
J. Am. Chem. Soc. 1999, 121 (43), 10186-10192.
225. Fristrup, P.; Tanner, D.; Norrby, P. O., Updating the asymmetric osmium-
catalyzed dihydroxylation (AD) mnemonic: Q2MM modeling and new kinetic
measurements. Chirality 2003, 15 (4), 360-368.
226. Norrby, P. O.; Brandt, P.; Rein, T., Rationalization of Product Selectivities in
Asymmetric Horner-Wadsworth-Emmons Reactions by Use of a New Method for
Transition-State Modeling. J. Org. Chem. 1999, 64 (16), 5845-5852.
227. Warshel, A.; Weiss, R. M., An empirical valence bond approach for comparing
reactions in solutions and in enzymes. J. Am. Chem. Soc. 1980, 102 (20), 6218-
6226.
228. Aqvist, J.; Warshel, A., Simulation of enzyme reactions using valence bond force
fields and other hybrid quantum/classical approaches. Chem. Rev. 1993, 93 (7),
2523-2544.
CHAPTER 1
- 69 -
229. Allinger, N. L.; Yuh, Y. H.; Lii, J.-H., Molecular Mechanics. The MM3 Force
Field for Hydrocarbon 3. 1. J. Am. Chem. Soc. 1989, 111 (23), 8551-8566.
230. Lii, J.-H.; Allinger, N. L., Molecular Mechanics. The MM3 Force Field for
Hydrocarbons. 2. Vibrational Frequencies and Thermodynamics. J. Am. Chem.
Soc. 1989, 111 (23), 8566-8575.
231. Lii, J.-H.; Allinger, N. L., Molecular Mechanics. The MM3 Force Field for
Hydrocarbons. 3. The van der Waals’ Potentials and Crystal Data for Aliphatic
and Aromatic Hydrocarbons. J. Am. Chem. Soc. 1989, 111 (23), 8576-8582.
232. Allinger, N. L.; Li, F.; Yan, L., Molecular Mechanics. The MM3 Force Field for
Alkenes. J. Comput. Chem. 1990, 11 (7), 848-867.
233. Allinger, N. L.; Li, F.; Yan, L.; Tai, J. C., Molecular Mechanics (MM3)
Calculations on Conjugated Hydrocarbons. J. Comput. Chem. 1990, 11 (7), 868-
895.
234. Rappé, A. K.; Pietsch, M. A.; Wiser, D. C.; Hart, J. R.; Bormann-Rochotte, L. M.;
Skiff, W. M., Rff, Conceptual Development of a Full Periodic Table Force Field
for Studying Reaction Potential Surfaces. Mol. Eng. 1997, 7 (3), 385-400.
235. Kim, Y.; Corchado, J. C.; Villa, J.; Xing, J.; Truhlar, D. G., Multiconfiguration
molecular mechanics algorithm for potential energy surfaces of chemical
reactions. J. Chem. Phys. 2000, 112 (6), 2718-2735.
236. Truhlar, D. G., Valence bond theory for chemical dynamics. J. Comput. Chem.
2007, 28 (1), 73-86.
237. Jensen, F., Locating minima on seams of intersecting potential energy surfaces.
An application to transition structure modeling. J. Am. Chem. Soc. 1992, 114 (5),
1596-1603.
238. Jensen, F., Transition structure modeling by intersecting potential energy surfaces.
J. Comput. Chem. 1994, 15 (11), 1199-1216.
239. Jensen, F., Using force fields methods for locating transition structures. J. Chem.
Phys. 2003, 119 (17), 8804-8808.
240. Olsen, P. T.; Jensen, F., Modeling chemical reactions for conformationally mobile
systems with force field methods. J. Chem. Phys. 2003, 118 (8), 3523-3531.
CHAPTER 1
- 70 -
CHAPTER 2
- 71 -
CHAPTER TWO
The rampant use of the lock and key model for ligand-protein docking has be
found to be a cause of decreases in docking accuracies when comparing self-docking to
cross-docking results. With this in mind, we have developed FITTED that overcomes this
major assumption by modeling a more dynamic protein ligand binding process. Within
this chapter, the development of FITTED is discussed. The dynamics of ligand-protein
binding are addressed using a Lamarckian genetic algorithm to allow for the flexibility of
both the ligand and the protein and a switching function models the displacement of
bridging water molecules. FITTED was validated on a set of 33 ligand-protein complexes
and showed good accuracy and the importance of including protein flexibility and
displaceable bridging water molecules.
This chapter is a copy and is reproduced with permission from the Journal of Chemical
Information and Modeling. This article is cited as Corbeil, C. R.; Englebienne, P.;
Moitessier, N., Docking Ligands into Flexible and Solvated Macromolecules. 1.
Development and Validation of FITTED 1.0. Journal of Chemical Information and
Modeling 2007, 47, (2), 435-449. Copyright 2007, with permission from the American
Chemical Society
CHAPTER 2
- 72 -
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 1.
DEVELOPMENT AND VALIDATION OF FITTED 1.0
ABSTRACT
We report the development and validation of a novel suite of programs, FITTED
1.0, for the docking of flexible ligands into flexible proteins. This docking tool is unique
in that it can deal with both the flexibility of macromolecules (side-chains and main-
chains) and the presence of bridging water molecules while treating protein/ligand
complexes as realistically dynamic systems. This software relies on a genetic algorithm
to account for the flexibility of the two molecules, as well as the location of bridging
water molecules. In addition, FITTED 1.0 features a novel application of a switching
function to retain or displace key water molecules from the protein-ligand complexes.
Two independent modules, ProCESS and SMART, were developed to setup the proteins
and the ligands prior to the docking stage. Validation of the accuracy of the software was
achieved via the application of FITTED 1.0 to the docking of inhibitors of HIV-1 protease,
thymidine kinase, trypsin, factor Xa and MMP to their respective proteins.
CHAPTER 2
- 73 -
INTRODUCTION
Fast, cost-effective and accurate methods of drug design are essential to modern
day medicinal chemistry. Docking-based rational design methods provide a quick and
economical alternative to high throughput screening or more traditional drug discovery
and are increasingly popular alternatives.1
Over the last few years, several comparative studies of docking programs have
been published and show the poor accuracy of some of the commercially available
packages, with Glide and GOLD being amongst the best programs.2-8
In most studies,
inhibitors are accurately docked back to their corresponding protein structure (self-
docking). However, it has been shown that docking to other structures (cross-docking)
performs poorly.9-11
This failure results in part from the use of inaccurate protein models.
Several docking programs treat the proteins as rigid objects and do not account for
conformational changes upon binding, resulting in this observed poor performance in the
cross-docking studies and low enrichment factors in virtual screening studies.12,13
Improvement of the developed software is necessary to include more accurate protein
models.
To account for the discrepancy between self- and cross-docking, various strategies
have been explored and implemented in existing software.14,15
The program FlexE uses a
set of protein structures as an input and describes the side-chain and main-chain
flexibility.16
SLIDE, which handles flexible side-chains,15
can also explore the main-
chain flexibility when coupled with ROCK.17
Another docking program, AutoDock
models rigid proteins using grids that can be combined into grids that approximate
ensembles of conformations.10
In a fourth strategy, Glide, when merged with Prime,
accounts for protein adjustments through the use of homology models.18
We have recently proposed a novel concept for the docking of ligands to solvated
biopolymers,19
a pharmacophore-oriented docking approach,20
and a genetic algorithm
(GA) based docking method.21
The later takes advantage of more than one structure to
dock compounds in virtually flexible proteins. Using a similar approach to Lengauer and
co-workers16
and Shoichet and co-workers,22
we used a library of experimentally
observed protein conformations and made composite structures to model the protein
flexibility and to explore a wide region of conformational space. The proteins and ligands
were described as genes and a mixed Lamarckian/Darwinian evolution optimized the
CHAPTER 2
- 74 -
entire complex. This virtual flexibility was found to significantly increase the accuracy of
the docking of BACE-1 inhibitors.21
We report herein the development of FITTED 1.0 (Flexibility Induced Through
Targeted Evolutionary Description), a suite of programs based on a genetic algorithm
(GA) with an emphasis on speed. This docking program is unique in that it can deal with
both the flexibility of macromolecules and the presence of bridging “displaceable” water
molecules. Additional operators to the more traditional cross-over and mutations were
implemented and led to a significant increase in speed. These operators simulate the
learning (through energy minimization at various stages) and the early selection of
individuals based on a crude estimation of their fitness (e.g., is the ligand in the binding
site?). A new potential energy function modeling the interaction with displaceable water
molecules and two modules (ProCESS and SMART) needed to prepare the ligands and
proteins are also described. A validation of the accuracy of the docking program was
performed on five different sets of protein-ligand complexes: HIV-1 protease, thymidine
kinase, trypsin, factor Xa and stromelysin-1 co-crystallized with a variety of inhibitors.
THEORY AND IMPLEMENTATION
Proof of Concept. Our previous report21
of the use of Lamarckian GA to account for both
ligand and protein flexibility was based on Discover 3.023
as a force field engine and
considered only the ligand torsion angles as degrees of freedom. The flexibility of the
side chains and main chains of the target protein were modeled using a library of
conformations (from available data) that could evolve by means of genetic operators
(cross-over and mutations). An anchor atom was needed by this early version to ensure
convergence in a reasonable period of time. In practice, runs were performed in as long
as 20 hours for the most flexible ligands. Upon further investigations, we found that more
than 96% of CPU time was consumed by intermediate minimization steps (part of the
Lamarckian GA). The inclusion of ligand translational and rotational degrees of freedom
led to intractable computations. This proof-of-concept led us to develop a program based
on the same concept with a strong focus on the CPU time required, instead of using many
independent programs that do not communicate quickly, nor easily with each other.
FITTED 1.0, includes a force field engine to perform conjugate gradient minimization24
and a genetic algorithm.
CHAPTER 2
- 75 -
Program Requirements and Setup. FITTED is being designed to be a docking-based
virtual screening (VS) tool. Before libraries can be screened, the docking algorithm must
be validated. In practice, aspects of the docking routine that are common to all runs
should be performed only once. First, since protein structures are common to all runs in a
VS study, it is best to have a separate program to do their setup once, quickly and
efficiently. Second, a virtual library of drug-like molecules is, in practice, applied to more
than one biologically relevant target and should also be prepared independently from the
VS run. These two aspects led us to create modules for FITTED, namely ProCESS and
SMART, described in greater detail in the following sections. The use of modules is a
common practice as exemplified by the AutoDock suite of programs.25
FITTED, SMART
and ProCESS can either be run from command line in Linux or as console applications in
Windows. However, the accuracy of FITTED was found to be highly compiler-dependent
and caution should be taken to ensure the suitability of the compiler before using FITTED
(gcc v.3.2.3 was found to be the best).
ProCESS, a Tool to Prepare Protein Files. In order to have protein files useable by
FITTED, we developed the ProCESS module (Protein Conformational Ensemble System
Setup) which assigns the advanced residue names, advanced hydrogen names, atom types
and charges for the protein as discussed below. FITTED may use several protein files to
consider the protein flexibly. However, these files must be homogeneously prepared (ie,
same atom name and number, same primary sequence). ProCESS requests all-atom
proteins in mol2 format as inputs. Various programs (InsightII, Maestro, Sybyl) can be
used to add the hydrogen to the PDB files. However, all these graphical interfaces do not
generate the same mol2 files from the same PDB files (various residue names, hydrogen
names, order of atoms). ProCESS first ensures that the protein files are consistent and can
be used unambiguously. Rules exist for the naming of atoms/groups in proteins (PDB).26
For instance, CYS and ASP designate cysteine and aspartic acid residues respectively.
However, this naming does not give information on the protonation state of the side-
chain nor does it identify the terminal residues. CYS can be involved in a disulfide bridge
or not while ASP can be protonated or not. These residue names cannot be used
unambiguously to assign partial charges and atom types. To address this issue, the
CHAPTER 2
- 76 -
graphical interfaces assign various residue names for ionized and neutral aspartyl
residues: Maestro (Schrödinger): ASP and ASH, InsightII (Accelrys): ASP- and ASP,
Sybyl (Tripos): ASP and ASZ. Additional names are used for capped terminal residues in
Sybyl: AMN or AMI, and CXL or CXC for the ionized or neutral terminal amino and
carboxylate groups respectively. In order to use mol2 files generated with these different
interfaces, ProCESS reassigns advanced names to the added hydrogens (ie, HA for alpha
hydrogens, HB1 and HB2 for beta hydrogens) by examining their chemical environment
and proceed with the advanced residue names starting with the terminal residues. If the
residue is an N-terminus, it is assigned a fourth letter, an N; if it is a C-terminus, a C is
appended to the name. ProCESS then checks for the protonation state of CYS, ASP and
GLU. CYS is CYSH if not involved in a disulfide bridge; ASP and GLU are ASPH and
GLUH if neutral. HIS can have one of three possible names: HISE if the hydrogen is on
the ε nitrogen, HISD if the hydrogen is on the δ nitrogen and HISP if positively charged.
By defining the chemical environment, these advanced name aid ProCESS in assigning
appropriate partial charges and atom types to the protein.
In some cases, the atom ordering also varies from one PDB or mol2 file to another.
To address this issue, ProCESS sorts each of the protein files by atom and residue names
and checks for sequence identity; if discrepancies in the primary sequence are found,
ProCESS exits and manual editing of the protein files is required.
As soon as the protein files are checked and made consistent, ProCESS truncates the
protein input structures, using a user-defined cutoff distance around a user-defined
binding site (list of residues, see Supporting Information). The truncated proteins are
represented as united atoms, and AMBER atom types and partial charges are assigned to
each of them. The use of truncated united-atom protein structures significantly reduces
the time required by FITTED to set the lists of non-bond interactions at the outset of each
minimization stage. While potential energies computed using a force field are used all
through the docking process, scoring of the final poses is performed by the previously
developed RankScore21
scoring function. This function accounts for the entropy cost of
freezing flexible residues upon binding implicitly computed by scaling down the
interactions with flexible residues as discussed below. To account for these scaling
factors, ProCESS assigns new “scaled” atom types and charges derived from AMBER
atom types. When processed, the protein structures are outputted in mol2 format. Similar
CHAPTER 2
- 77 -
to FITTED, ProCESS requires a keyword file which contains all the necessary parameters
for ProCESS to work. A typical keyword file is given as Supporting Information.
ProCESS, a tool to create binding site cavity files. As discussed below, FITTED
disregards poses that are not within the binding site cavity, which is approximated by a
series of overlapping spheres. The required CPU time of initial docking attempts using
grids as cavity representations was not satisfactory. We have found that moving to
spheres was much less CPU time consuming (by a factor of 3). ProCESS creates the
overlapping spheres by first generating an evenly spaced grid and keeping the points that
do not clash with the protein (Figure 2.1a). If more than one protein is used, the point
must clash with all proteins to be removed. Thus, alternative accessible spaces are
maintained.
Next the points are converted into spheres (Figure 2.1b). For this purpose, each
sphere is inflated until making contact with a protein atom center (slightly overlapping
with the protein surface) or one of the grid edges. The obtained sphere size and center are
archived. Smaller spheres, included in larger ones, are next removed in order to reduce
the total number of spheres while still covering the entire cavity space. This step is
carried out repeatedly until all spheres are examined. This step significantly reduces the
number of spheres while approximating the whole cavity space. If there is a water
molecule present, ProCESS ignores it. FITTED will later determine whether or not the
water should be considered. The grid file is outputted in mol2 format, the last column
being not partial charges, but the radii of the spheres.
CHAPTER 2
- 78 -
Figure 2.1 - The binding site of 1d8m mapped as (a) a set of points, (b) a set of spheres,
the spheres are colored by size range (from 1.5 Ǻ to over 4.0 Ǻ).
SMART, a Tool for Ligand Preparation. We also developed the SMART module (Small
Molecule Atom-typing and Rotatable Torsion assignment) which automatically identifies
and labels the rotatable bonds of the ligands and assign AMBER atom types. As the rings
are not conformationally sampled in the current version of FITTED, SMART also identifies
the rings27
in the ligand and labels all the corresponding bonds as non-rotatable. Although
no conformational sampling methods are applied to the ring, energy minimization is
applied to the cyclic systems, therefore locally optimizing the ring structures. The partial
charge assignment (Gasteiger-Hückel charges are recommended) is still carried out using
existing software such as Sybyl.28
SMART also creates reference structures of the ligands
used by FITTED to compute accurate RMSD’s. For instance, rotamers of symmetric
groups such as phenyl rings and t-butyl groups are considered, creating a number of new
structures that will be used as references in the atomic RMSD calculation.
FITTED 1.0, an Algorithm to Account for Protein and Ligand Flexibility. The initial proof
of concept showed that the inclusion of flexibility greatly increased the accuracy of a
docking run. To reduce the amount of time per run two program aspects can be
investigated: removal of repetitious events (addressed by SMART and ProCESS) and
increase in the quality of the individuals. The latter aspect is addressed by FITTED itself
and is discussed in the following sections.
CHAPTER 2
- 79 -
Genetic Algorithm Implementation. Genetic algorithms (GA) have been used as
optimization tools in many fields for some time. In the present work, the GA is used to
optimize the binding mode. The chromosomes (Figure 2.2) describe the three-
dimensional structure of the protein/ligand/water complex and the fitness function is the
potential energy of this structure. In the illustrated case, 5 input files are used for the
protein/water structures, 4 side-chains are deemed flexible, and one bridging water
molecule is considered. The first section of the chromosome codes the ligand binding
mode and includes all the internal coordinates necessary to define a given conformation
in a given location in space and a given orientation (often referred to as a pose). The
ligand poses are therefore explicitly described in the chromosomes and FITTED can apply
a conjugate gradient algorithm to finely tune this pose.
The portion of the chromosome defining the solvated protein structure is divided
into 3 sections: i. the rigid portion of the protein, including the entire backbone, ii. the
side chain conformations of the flexible binding site residues and iii. the water molecule
locations. Each side chain and rigid protein portion adopts 5 different conformations in
the 5 protein input structures, referred to as call numbers with a value of 1 to 5. Similarly,
each water molecule adopts 5 different locations again referred as a call number. Thus,
libraries of side chain conformations (1 library per side chain, 5 side chain conformations
per library), a library of 5 structures for the rigid protein portion and a library for each
water molecule (5 set of Cartesian coordinates per water molecule) are built at the outset
of the docking run from the 5 input structures. FITTED next constructs the protein/water
complex from the set of digits and the libraries and adds the ligand pose to form the
ternary complex. A force field energy is associated to this pose and is recalculated
whenever the pose is modified.
CHAPTER 2
- 80 -
Backbone and rigid residue structure (digit from 1 to 5)
Side chain 1: conformation (digit from 1 to 5)
Side chain 2: conformation (digit from 1 to 5)
Side chain 3: conformation (digit from 1 to 5)
Side chain 4: conformation (digit from 1 to 5)
Ligand: dihedral angle 1 (number from - to )
Ligand: dihedral angle 2 (number from - to )
Ligand: dihedral angle 3 (number from - to )
Ligand: dihedral angle 4 (number from - to )
Water molecule: position in space (digit from 1 to 5)
Ligand: bond distances and angles values
Ligand: position in space (x, y, z)
Ligand: orientation in space (xyxzyz)
Ligand section
Protein backbone andresidues not in the binding site
Side chains of the flexible residues in the binding site
Water molecules
Figure 2.2 - Chromosome describing a protein/water/ligand complex.
In the illustrated case, the protein has 4 flexible residues and is represented by 5
input structures, a single water molecule is included and the ligand has 4 rotatable bonds.
A horizontal bar represents a gene
Intelligent Design of the Initial Population. With the libraries completed, FITTED
proceeds to creating individuals. Each individual is first assigned a protein structure. This
is followed by a random generation of the ligand pose.
We first thought that the required CPU time could be significantly reduced by
increasing the “quality” of the initial population and focused on its generation. It is
known that a population including good guesses often converge more rapidly and
decreases the probability to become trapped in a local minimum.29
In an early version, an
energy threshold was used to select reasonable individuals. However, the lengthy
minimization routine was used to optimize all the individuals including the many which
were discarded using this threshold. To reduce the number of unnecessary minimization
steps, we envisioned the use of additional genetic operators in the form of filters (Figure
2.3).
CHAPTER 2
- 81 -
5 protein structure files ligand file
cavity file
constraint file
randomize torsions, orientation, translation
constraint fulfilled?
Yes
No
Construct complex
ligand in binding site
?
No
Yes
pot. energy acceptable?
Yes
enough individuals ?
Yes
Initial population ready
Conjugate-gradientminimization
keyword file
1 protein structure selected
No
pot. energy acceptable?
No
Yes
Save complex
No
pick 1 protein structure
Figure 2.3 - Generation of the initial population using a series of filters.
A first filter was implemented that discards the poses with strong steric clashes and
poses outside the protein cavity approximated by a set of spheres. If any atom of the
ligand is not located within a sphere, the pose is discarded prior to potential energy
evaluation. A second test is made and only the protein/water/ligand complexes with
energy below a user-defined threshold are further optimized by energy minimization.
To further improve the method, we added the possibility of exploiting experimental
information by including constraints to force key interactions. For instance, ligand poses
that are not interacting with a given protein residue or atom (e.g., metal) will be
CHAPTER 2
- 82 -
discarded. As the grid file, the constraint file is in mol2 format. Constraints are also
defined as spheres and columns are added to the mol2 file to define the size of the
constraint and its type (i.e., charge below -0.3).
Thus, in the current version, the first input protein/water file is selected and ligand
poses are randomly generated until one pose fulfills all the criteria (located within the
cavity, fulfilling the constraints) (Figure 2.3). FITTED next constructs the complex (the
corresponding chromosome) and further optimizes it through conjugate gradient energy
minimization. If the optimized complex passes the last test (final energy compared to a
second user-defined threshold), it is archived. If the ligand pose does not pass, it is
discarded and another one is generated. This procedure is reiterated with the other
protein/water structures which are evenly represented in the initial population.
As expected, this implementation significantly reduced the time needed to produce
a high quality initial population while including the rotational and translational degrees of
freedom, which were not present in the previous method.
Evolution of Flexible Ligands. The theme of intelligent design is further carried into
addressing the evolution of the ligands. The first issue addressed is the refinement of the
orientation through mutation. It was observed that by increasing the probability of
mutation of solely the rotation of the ligand (orientation in space), which requires a larger
sampling, an increase in the speed of convergence occurred. Also by decreasing the range
of the possible rotation mutation from 0-360 to +/- 30 degrees, an increase is observed in
the validity of the individuals produced through evolution.
Secondly, the possibility that the best individual is further optimized without being
coupled is small. To increase this possibility, we added the probability of learning.
Before the evolution of each generation, a small percentage of the population is further
optimized by energy minimization. This approach brings the Lamarckian aspect of this
GA one step further.
Evolution of Flexible Proteins – New Genetic Operators. The produced “high quality”
population will then evolve using a series of genetic operators including mutations and
cross-over. These operators will blend the genotypes from the various individuals by
swapping portion of the chromosomes (cross-over) or randomly modifying genes
CHAPTER 2
- 83 -
(mutation). Parent complex structures are randomly selected from the mating pool,
coupled and children are produced by cross-over and mutations operators in a steady-
state way. The offspring should first pass the genotype selection described above (cavity
and constraint filters) to be selected. To our knowledge this early crude selection, first
developed by Haupt and co-workers,29
is a new concept in the GA field applied to
docking methods. A proportion (user-defined) of the selection will be further optimized
by energy-minimization. This energy minimization stage represents the Lamarckian
aspect of the genetic algorithm. The children learn/evolve during their life (are energy-
minimized) and can transmit the acquired skills to the next generation. In practice, this
optimization had to be applied to a small fraction of the population. If all the structures
are fully optimized, the conformational search usually converges to high-in-energy local
minima. The two best fit individuals among the parents and their children survive. This
process of natural selection is based on the potential energy computed with the AMBER
force field.30
In the current version, the input side-chain and backbone conformations and the
water molecules location in space are archived in libraries and each protein structure is
described as a composite of these allowed conformations (a chromosome).21
Creating a
ternary complex then requires the reconstruction of the protein structure and the addition
of the water molecules and ligand. The separation between each section of the
chromosome is made on purpose. FITTED will apply the genetic operators to each section
independently. For instance, a single point cross-over operation can be applied to the
protein side chain section, another one to the ligand internal coordinate section of the
chromosome and a last one to the water molecules (if there is more than one). As the
position of the cross-over is randomly selected it has a higher probability to be applied
between the first and last genes describing the ligand than before the first or after the last
gene. As a result, the orientation in space of the ligand (first gene of the ligand) would be
somewhat linked to the backbone conformation (next gene in the chromosome). To
address this artifact, when a cross-over operation is performed, one of the following two
options is randomly selected: the top portion of the section is kept and the bottom portion
is exchanged or the top portion is exchanged and the bottom is kept. The same two
options apply to the side chain section of the chromosome and to the water molecules
(Figure 2.4). Cross-over operations of the sections including a single gene (ie, a single
CHAPTER 2
- 84 -
water molecule) are restricted to complete exchange or no operation. The probability to
perform a cross-over operation on each section is defined by the user using the
appropriate keyword. Figure 2.4 illustrates the four possible pairs of children produced if
2 cross-over operators are applied. In practice, 4 cross-over operations (one for ligand,
one for binding site residues, one for the rest of the protein and one for the water
molecules) can be used and produce one of the 16 possibilities.
Mutation operations can also alter each gene of the chromosome except the ligand
bond distances and angles. A mutation in the protein backbone, side chain and water
genes are limited to the substitution of the digit for a digit in the range defined by the
number of protein input files. The mutations do not produce conformations of the protein
backbone or side chain conformations nor water molecule locations that are not in the
initial libraries. As a result, FITTED will not propose protein/water structures that are not
composites of the input structures. When producing a composite protein conformation,
FITTED also assesses the integrity of the structure and rejects any generated protein
structure that has intramolecular steric clashes.
one pointcross-over
one pointcross-over
Parents Children pair 1 Children pair 2 Children pair 3 Children pair 4
Figure 2.4 - 4 possible pairs of children generated after application of two one point
cross-over operations. A horizontal bar represents a gene.
CHAPTER 2
- 85 -
Docking to Rigid or Flexible Proteins with FITTED. Three options are available. First
docking to a single conformation can be performed, which allows for self- and cross-
docking studies. Second, docking to a conformational ensemble can also be carried out.
Using this option (referred to as “semi-flexible”), the input protein structures will remain
unchanged over the evolution but can be exchanged between individuals, the cross-over
and mutations operating on the entire protein structures only. Third, one can use the fully
flexible protein structure. With this last option, the cross-over and mutation operators will
be separately applied to the backbone, side chains and water molecules.
Displaceable Water Molecules - Implementation. To date, very few methods have been
proposed to consider dynamically bound water molecules.31
We recently reported a new
concept to describe displaceable water molecules.19
As discussed in this previous report,
the non-bonded energy function can include an energy well at an interaction distance to
the water molecule, but no van der Waals wall in order to simulate the water
displacement. The proof-of-concept of this approach was demonstrated by using
combinations of AutoDock grids modeling the “dry” and solvated RNA oligomers. The
docking of aminoglycosides to these combined grids was found to be more accurate than
the docking to solvated or dry RNA oligomers.19
As FITTED does not make use of grids,
we had to develop and implement an additional potential energy term to the AMBER
function. To remove the Lennard-Jones wall for the water molecule at short distances, we
introduced a switching function (SF) in the form of a scaling factor applied to the
intermolecular energies involving a water molecule.
(2.1)
0.1 if
32 if
0.0 if
3
2
swdd
dd
dddddswddd
swdd
switchwat
switchwatcutwat
switchwatcutwatcutwatcutwatswitchwat
cutwat
where sw is the scaling factor, d is the shortest distance between any atom of the ligand
and any atom of the water molecule, dcutwat is the cutoff distance and dswitchwat is the
switching distance. Such functions are traditionally used to cutoff long range non-bonded
interactions. In this specific case, it will be used to cutoff short range interactions.
CHAPTER 2
- 86 -
-1
1
3
5
7
9
11
13
15
17
0.5 1
1.5 2
2.5 3
Distance (Angs)
Inte
rac
tio
n E
ne
rgy
(k
ca
l/m
ol)
Figure 2.5 - Interaction energy between a methanol molecule and an explicit water
molecule (red) or a displaceable water molecule (blue). Cutoff distance = 1.20 Å,
switching distance = 1.75 Å.
Figure 2.5 represents the energy curve obtained with this new function and
illustrates the interaction between methanol and a water molecule. Although the standard
SF’s are atom-based or group-based, this specific SF has to be molecule-based. To model
a realistic situation, the water molecule should be included in the binding site (SF =1 for
the entire ligand) or displaced (SF = 0 for the entire ligand). Thus, the situations where
this function ranges from 0 or 1 are artifacts. The positive energy observed in Figure 2.5
between 1.20 and 1.75 Ǻ is a consequence of this function. One way to address this issue
would be to turn off the energy function as soon as it is positive. However, a continuous
function between 0 and 1 was needed by the energy-minimization routine. In order to
define the optimal cutoff and switching distances, the intermolecular interaction within
complexes such as methanol-water of methyl acetamide-water was investigated. In all the
cases, the interaction energy between the molecules was positive at distances below 1.75
Å selected as the optimal switching distance. Therefore, this SF applies only when the
interaction energy with the water molecule is repulsive. The SF reached a maximum of
about 15 kcal/mol when a cutoff distance of 1.20 Å was used (see Figure 2.5).
CHAPTER 2
- 87 -
Applied to the docking of molecules, this potential energy function penalizes the
poses that do not interact favorably with the water molecule (distance < cutoff distance =
1.75 Ǻ) nor displace it completely (distance > switching distance = 1.20 Ǻ) and will
consequently help the ligand to interact or fully displace the water molecules.
Displaceable Water Molecules - Optimizing Water Evolution. Bridging water molecules
are often observed in crystal structures. This information is exploited by FITTED. In the
present work, critical water molecules are either maintained when present in the crystal
structures or added by analogy to other structures and their orientation optimized by
energy-minimization when missing. Initial attempts have shown that the prediction of the
occurrence of water molecules in the complexes was not accurate. In practice, we
observed that the ligand pose was first optimized (with greater decrease of the total
potential energy). This early optimization is followed by the refinement of the protein
structure and finally the water molecules. However, most of the water location
possibilities (one per protein structure) have been removed throughout the generations.
We then found that higher mutation rates increased the accuracy by increasing the
sampling of the water molecules. In order to address this issue, we implemented a
ramping mutation rate for the water molecules. This ramping is achieved by using a
quadratic function.
(2.2)
4
maxsgeneration ofnumber maximum
generation
th
mutwat
npp
Thus, very low mutation rates are applied at the early stages of the evolution while
larger rates are used at the late stages. One drawback of the use of the AMBER force
field is the lack of directionality of the hydrogen bond term. Evaluation of the free energy
of binding of the water molecules was also a concern. In the current version of FITTED,
water molecules are considered as part of the protein. To account at least partly for the
entropy cost associated with the capture of a water molecule, a penalty is added to the
final score whenever a water molecule is maintained. This number is arbitrary as this
penalty is system-dependent and should also include the enthalpic contribution to the
binding of the water. Work is in progress in our laboratory to include directional
hydrogen bonds in the next version of FITTED and to improve the scoring of the free
energy of binding of the water molecules to protein/ligand complexes.
CHAPTER 2
- 88 -
Scoring Function. The AMBER force field was implemented in FITTED and used during
the actual docking with a higher weight for the intermolecular interactions than for the
internal energy. A very few scoring function include a term accounting for the protein
entropy. In the present case, using a force field does not permit the evaluation of the
entropy contribution to the free energy of binding. Understanding that the mobility and
entropy of flexible residues is modulated by the ligands, we have proposed to estimate
the free energy of binding to the flexible residues. First, the stronger the interactions are,
the tighter a ligand is bound. Then, the tighter a ligand is bound, the more frozen the
surrounding side chains are. We proposed to account for this entropy/enthalpy
compensation by reducing the interaction with flexible residues. In practice, the
interaction with flexible side chains was scaled down by the use of a new set of atom
types and partial charges assigned by ProCESS. The final poses were then scored using
our scoring function RankScore also implemented in FITTED.
RESULTS AND DISCUSSION
Selection of the Testing Set. As FITTED incorporates protein flexibility and displaceable
water molecules, a selection of protein/inhibitor complexes should be made to evaluate
these aspects of FITTED. The selected inhibitors are listed as supporting information.
As HIV-1 protease (HIVP)/inhibitor complexes often involve a bridging water
molecule, this enzyme was selected as a first test case. Although HIVP is a flexible
protein, the inhibitors usually bind to the close form and HIVP is not considered as a
highly flexible protein in docking studies. However, slight adjustments were observed
and RMSD’s of 0.5 to 1.4 Å between protein structure binding sites were computed.
HIVP can exhibit two different protonation states, either one or both catalytic aspartic
acid side chains acid being protonated.32
In most cases, inhibitors binding to the catalytic
dyad via a diol moiety favor the diprotonation of the catalytic Asp, while monoalcohols
or other functional groups favor the monoprotonated state. We therefore decided to
prepare two sets of protein files. 1b6l, 1eby, 1hpo, 1hpv, and 1pro protein structures were
monoprotonated as discussed in the experimental section while 1ajv, 1ajx, 1hvr, 1hwr
and 1qbs were diprotonated as experimentally observed for the binding of cyclic diols to
the diaspartate catalytic site.32
Only the crystal structures 1b6l, 1eby and 1hpv featured a
CHAPTER 2
- 89 -
water molecule. This same water molecule was therefore added to the other 7 protein
structures to allow FITTED to select whether or not this water is needed.
Similarly, thymidine kinase (TK) is a flexible protein and inhibitors often bind
experiencing interactions with the protein relayed by many water molecules.
Interestingly, a first water (water molecules 1 and 2 in Figure 2.6) can be located at two
different positions following Gln125 side chain conformational changes. The combined
water displacement/Gln side chain flip will be investigated in great detail. As illustrated
in Figure 2.6, either the Gln125 carbonyl oxygen (Figure 2.6a) or the amide hydrogens
(Figure 2.6b) point towards the Arg176 side-chain. The first Gln125 side chain
conformation shown in Figure 2.6a is observed in 1e2k, 1e2p, 1ki4, 1ki8 and 1of1 while
the second conformation is observed in 1ki3, 1ki7, 2ki5 and 1qhi (PDB codes). Similarly,
Water 4 can be displaced by Gln221. These two enzymes (HIVP and TK) together with
oligopeptide binding protein A (OppA) were also selected as test cases by the GOLD
developers in order to evaluate the reliability of their method accounting for bridging
water molecules.31
However, Verdonk et al. considered three water molecules interacting
with the nucleotide base of the TK inhibitors while we considered another three
interacting with the ribose part of these inhibitors, for a total of six water molecules. As
illustrated in Figure 2.3, these 6 waters participate in multiple hydrogen bonds with both
the ligands and the proteins.
Figure 2.6 - Bridging water molecules and flexible binding site residues in TK / inhibitor
complexes. (a) 1e2k and (b) 1ki3. Co-crystallized sulfate is shown in orange.
Factor Xa (FXa) and its homolog trypsin were also included in the validation set.
FXa / inhibitors complexes show from none to two water molecules involved in the
CHAPTER 2
- 90 -
ligand binding while a single bridging water molecule is observed in the selected
trypsin/inhibitor complexes Figure 2.7). The first one interacts with both the inhibitor
cationic moieties and the protein backbone (Ile227 in FXa and Val205 in trypsin) while
the second one bridges the inhibitor cation with the key Asp189 side chain of factor Xa.
The specific shape of these two deep binding sites featuring a narrow pocket made the
conformational sampling problematic. Thus, a larger population size (200) was used in
order to reach the convergence.
We completed this validation study with a small set of metalloprotease (MMP-3,
stromelysin-1) inhibitors for a total of 33 complexes. Most of the known MMP inhibitors
chelate the catalytic zinc cation and are of interest to evaluate the accuracy of FITTED to
reproduce the metal ligation. As the metal chelation is a short range interaction, we
implemented a specific term using a potential similar to the LJ12-10 used for hydrogen
bonds.
All these protein structures were processed using ProCESS (a typical keyword file
is given as supporting information) prior to their use with FITTED and the ligands were
prepared with SMART.
Figure 2.7 - Bridging water molecules and flexible binding site residues in FXa /
inhibitors complexes. (a) 1ezq and (b) 1f0r.
Docking using FITTED 1.0 – General Consideration. All the compounds were first self-
docked to their corresponding protein structure in presence or absence of water. Table 2.1
- Table 2.3 summarize the data obtained for these docking studies. This first set of
CHAPTER 2
- 91 -
docking runs was carried out to evaluate the impact of the new potential energy term for
the displaceable water molecules. A cross docking study was next carried out to evaluate
the impact of the protein structure on the docking accuracy (Table 2.4 - Table 2.6). These
same ligands were next docked to the “semi-flexible” proteins and to the fully flexible
proteins in order to evaluate the ability of FITTED to predict the protein structure (Table
2.7 and 2.9). For the statistical analysis, we considered that the ligand pose was
accurately predicted when the RMSD relative to the crystal structure was below 2.0 Å,
that the protein structure prediction was correct when the RMSD was below the average
RMSD between the series of protein structures used as input (when the prediction is
better that a random selection of protein structures). Finally, we considered the water
molecules to be accurately predicted when the occurrence was right. A set of 10 runs was
carried out for each inhibitor, in order to demonstrate the convergence of the protocol. In
most of the cases at least 5 out of the 10 runs led to similar poses (difference between
computed RMSD’s below 0.5 Å). Although 100 individuals were enough for the docking
of thymidine kinase inhibitors, a larger initial population (200) was required for the other
4 proteins in order to reach the convergence criterion.
Self-docking Study. Among the 5 proteins investigated, 4 proteins can bind ligands
through one or more bridging water molecules. Using this set we first evaluated the
impact of the potential energy term developed for the water molecules described above
on the docking accuracy. In a first set of experiments, the water molecules were removed
from the protein structures and inhibitors were docked back to their corresponding
protein structure (self-docking). In a second set of experiments, the water molecules were
maintained and the developed potential energy term for the water molecule was used.
Table 2.1 presents the results of the self docking study for HIVP inhibitors. As can
be seen in the third and fifth columns, 7 (without water) or 8 (with water) out of the 10
inhibitors were self-docked within 1.2 Å from the experimentally observed binding
modes. Interestingly, 1hpo was properly docked when the water energy potential was
used and the water is predicted to be displaced. However, when the water was removed
prior to docking, 1hpo, which is known to displace the water molecule, is mis-docked.
We attribute this unexpected result to the energy hill shown in Figure 2.5. As postulated
above, this energy potential tends to favor either the complete displacement of the water
CHAPTER 2
- 92 -
or favorable interactions with the water while disfavoring intermediate docked poses as
the one proposed when the water is removed prior to docking. A close look at Table 2.1
also revealed that the RMSD’s are systematically higher when the water is removed prior
to docking. We again believe that can be attributed to the energy potential used to model
the displaceable water molecules.
The TK inhibitors were next docked to the rigid protein in self-docking
experiments (Table 2.2). 6 out of 9 inhibitors were self-docked within 1.1 Å from the
observed binding mode and 1ki3 inhibitor is docked with reasonable RMSD’s. A special
situation arose with the meso compound 1e2p. As shown on Figure 2.8, the RMSD of
2.03 computed for the docked pose was attributed to the exchange of C1 and C-2 groups.
Considering the two methylenol groups as equivalent reduces the RMSD to 0.77 Å.
Docking of 1e2p was therefore considered as successful.
Figure 2.8 - Docked (green) and crystal structure (grey) of 1e2p ligand. Computed
RMSD: 2.03 Å. The pro-chiral carbon is shown as a ball.
Surprisingly, even though many water molecules are involved in the ligand/protein
complexes, the docking to the “dry protein” was as accurate as the docking to the
solvated protein. Only the docking of 2ki5 was slightly affected by the absence of water.
The ten runs carried out with 1ki7 were constantly leading to the same wrong
conformation. In this case, the ribose ring and the base of the nucleotide mimics were
inverted within the binding site.
We next investigated the two sets of charged trypsin and FXa inhibitors. In this
case, the need for water molecules was clear (Table 2.3). Without water, FITTED docked
CHAPTER 2
- 93 -
only 4 out of the 10 inhibitors properly while 8 were accurately docked when the water
molecules were considered. A close look at the failures does not reveal any major
mistake. For instance, the proposed poses for 1f0u were interacting with Asp171 as
experimentally observed. However, the hydrophobic biaryl moiety of 1f0u was not
located in the same pocket with the terminal ammonium group forming a hydrogen bond
with Tyr76 instead of Asn79. As for 1hpo described above, the surprise comes from the
water prediction. In these three cases (1f0u, 1qbo and 1fjs), the occurrence of water
molecules is not accurately predicted but induces a proper docking of the inhibitors.
In contrast, the occurrence of water is accurately predicted when docking trypsin
inhibitors and the removal of the water does not affect the docking.
MMP inhibitors were docked and low accuracy was observed with 2 inhibitors
being accurately docked. This small set is clearly not large enough to fully assess FITTED
for metalloenzymes.
Overall, these first experiments demonstrated FITTED’s abilities to fully sample the
ligand conformational space and assign better scores to experimentally observed poses.
This first study also validated the water molecule prediction method since the occurrence
of the so-called water 301 in HIVP and waters in TK and trypsin is right in most of the
cases. Unexpectedly, this additional energy term also helps the docking of inhibitors that
displace water.
Cross-docking Study. In a real case study, medicinal chemists wish to design compounds
de novo or to screen libraries of compounds that are not co-crystallized with the enzyme.
Thus, a self-docking study is not representative of the real accuracy of docking programs.
To properly evaluate the predictive power of FITTED, a set of cross-docking experiments
was next carried out.
Each inhibitor was docked to the corresponding set of proteins in order to evaluate
the impact of the protein conformation on the docking accuracy. First, each HIVP
inhibitor was docked to the 5 protein structures and the RMSD’s and scores were
computed (Table 2.4). The data collected for the first five inhibitors revealed that the
docking accuracy is greatly influenced by the protein conformation. The cross docking
experiments carried out with the TK, FXa and MMP inhibitors also showed a significant
decrease of the accuracy relative to the self-docking study (Table 2.5 and Table 2.6). In
CHAPTER 2
- 94 -
contrast, the other five HIVP and trypsin inhibitors were accurately docked in most of the
cases regardless of the protein structure used.
Overall, this cross-docking study confirms the need for a docking method that
models the protein flexibility and/or the sensitivity of FITTED for the protein structure.
Docking to Multiple Conformations. The self-docking and cross-docking data can be
used to simulate the docking to multiple conformations. The five (or nine for TK) final
docked poses for each inhibitor (one per protein structure) are next compared and the best
scoring pose is selected (Table 2.4 -Table 2.6). In the case of the monoalcohols 1b6l,
1hpo and 1pro, the self-docking led to the best score. The same observation was made
with 5 out of the 9 investigated TK inhibitors and 4 out of the 5 FXa systems. However,
as the four of the five diols (1ajv, 1ajx, 1hvr and 1hwr) were docked with good accuracy
regardless of the protein structure, the prediction of the protein conformation was much
poorer. Interestingly, although 1qbs was not accurately docked back to its corresponding
protein structure its correct binding mode associated to a better score was proposed when
1ajx and 1hwr protein structures were employed.
Docking to Semi-flexible and Flexible Proteins. Although the previous study intrinsically
includes protein flexibility, it requires 5 to 9 experiments per compound and therefore
implies the equivalent increase in required CPU time. The current version of FITTED
offers to model the protein flexibility in a single experiment. Either of the two options
(docking to semi-flexible and fully flexible proteins) described above can be selected.
Adding the flexibility of the protein increases the complexity of potential energy surface
to explore therefore making the conformational sampling more difficult. We therefore
expected to observe a reduced accuracy when moving from rigid to flexible proteins.
In fact, 1hpo was misdocked to the semi-flexible and flexible HIVP (Table 2.7)
while an RMSD below 1.0 Å was recorded in the previous set of experiments. A close
look at the 10 runs revealed that the same misdocked pose was observed in 5 of the 10
proposed poses. It was also found that the experimentally observed pose has a worse
score. These two observations ruled out the hypothesized bad convergence but pointed
out a weakness of the scoring function. In contrast, 1qbs was misdocked to the rigid
protein with the orientation of the seven-membered core reversed. Again, as in the
CHAPTER 2
- 95 -
docking to multiple conformation study, much better score and the right pose were
predicted when the semi-flexible and flexible proteins were used.
The computed RMSD’s between the crystal structures 1of1 and 1e2k, 1e2p and
1e2k and 1of1 and 1e2p of TK were 0.26, 0.59 and 0.59 Å respectively. The computed
RMSD’s for the other pairs of structures ranged from 0.80 to 1.16 Å with an average of
0.92 Å. When the semi-flexible docking was used, the correct protein structure was
picked among the possible nine in 4 cases and was alternatively picked (5 runs each) with
a similar one when the 1of1 inhibitor was docked (Table 2.8). In contrast, the protein
structure was roughly as good as the average when 1e2p was docked and worse than
average when 1ki8 and 2ki5 were docked. Overall, the protein structure was predicted
with an average RMSD of 0.44 Å for the eight successful dockings (lower than the
average RMSD computed for each pair of structures). As discussed above, the Gln125
side chain of TK can adopt two distinct conformations. FITTED predicts the right
conformation in 7 of the 8 successful docking cases. This is a good indicator of the
predictive power of FITTED when the semi-flexible option is selected. The docking to the
fully flexible protein was less successful with an average RMSD of 0.78 Å but still below
the average RMSD for the 9 protein structures (average RMSD = 0.92 Å). In this last
study the Gln125 side chain was misoriented in 3 out of the 8 successful cases.
Data collected in Table 2.9 shows that 1o2j, 1o3g and 1o3i were properly docked
while 1f0r was misdocked to the semi-flexible and flexible protein. Whether it was
docked to the rigid, semi-flexible and flexible proteins, two alternative poses (RMSD
~1.0 Å or RMSD ~9.5 Å) were proposed for 1ezq. However, the wrong pose was
assigned a better score when the fully flexible protein was used. The correct pose was
much less observed (20% of the runs) than the wrong one. This results indicated that the
global minimum may be located in a sharp and deep energy well of the potential energy
surface that is difficult to find. In this series again, the prediction of the occurrence of
water molecule is good while the protein structure prediction is more disappointing.
1bwi was constantly misdocked to the rigid, semi-flexible and fully flexible
protein. More interestingly, 1d8m was misdocked to its corresponding protein crystal
structure but properly docked to the 1b8y protein structure and to the semi and fully
flexible protein. A closer look at the 1d8m data showed that this inhibitor was properly
docked when most dissimilar protein structures were used. This may indicate that some
CHAPTER 2
- 96 -
fine adjustments of the protein in the crystal structure of 1d8m would be required. In
order to account for these slight moves, FITTED has selected a more appropriate protein
structure. This may also indicate a poor accuracy of the protein structure prediction for
this enzyme or a poor description of the metal chelation.
This exhaustive docking study demonstrated that the scoring function can not only
assign high scores to the experimentally observed pose but also discriminate between
protein structures. It also shows that in specific case such as 1qbs or 1d8m, flexibility
improves the accuracy over self-docking.
The scores given to the final docked poses were also compared and showed that
they are all within 1 unit for each compound regardless of the protein flexibility method.
The scoring function is being further investigated and improvements will be reported in
due course.
CHAPTER 2
- 97 -
Table 2.1 - Self-docking – HIV-1 protease inhibitors.
Docking to proteina Docking to protein + water molecule
b
Obs. Waterc Lig
d Score
e Lig
d Pred.
Waterf
Scoree
1b6l 1 1.10 -9.8 0.55 1 -11.6
1eby 1 2.68 -9.3 4.55 0 -9.4
1hpo 0 2.29 -9.1 0.94 0 -10.1
1hpv 1 1.02 -8.7 1.19 1 -8.5
1pro 0 0.82 -5.2 0.72 0 -5.7
1ajv 0 0.91g -10.0 0.59 0 -11.4
1ajx 0 0.82 -9.4 0.77 0 -9.9
1hvr 0 0.49 -11.7 0.40 0 -12.2
1hwr 0 0.60 -7.5 0.52 0 -8.1
1qbs 0 5.05 -7.5 5.14 0 -8.2
a Water molecules removed prior to docking.
b Water molecule known as Water 301 was
retained and the function describing the interaction between ligand and water molecules
is applied. c
Water molecule observed or not in crystal structures: 1 and 0 define the
presence or absence of the water molecule respectively. d RMSD (in Å): criterion of
success of 2.0 Å. e Score in arbitrary units.
f Water molecules as proposed by FITTED.
Bold numbers highlight failures.
CHAPTER 2
- 98 -
Table 2.2 - Self-docking – Thymidine kinase inhibitors.
Docking to
proteina
Docking to protein + water
moleculeb
Obs. Water
moleculesc
Ligd Score
e Lig
d Pred. water
moleculesf
Scoree
1e2k 1 0 1 1 1 1 0.63 -6.1 0.66 1 0 1 1 0 1 -7.1
1e2p 1 0 1 1 1 1 2.69g -4.7 2.03
g 1 0 1 1 1 1 -5.2
1ki3 0 1 0 0 0 1 1.86 -5.9 1.84 0 1 0 1 1 1 -6.1
1ki4 1 0 1 1 1 1 0.43 -6.9 0.66 1 0 1 1 1 1 -7.7
1ki7 1 0 1 1 1 0 5.79 -5.1 5.76 1 0 1 1 1 1 -4.8
1ki8 1 0 1 1 1 0 0.77 -6.2 0.64 1 0 1 1 1 1 -6.9
2ki5 0 1 1 1 1 1 1.10 -5.5 0.45 0 1 0 1 1 1 -6.3
1of1 1 0 1 1 1 1 0.37 -6.1 0.29 1 0 1 1 1 1 -6.8
1qhi 0 1 1 1 1 0 0.47 -7.2 0.66 0 1 0 1 1 1 -7.8
a Water molecules removed prior to docking.
b 2 to 6 water molecules (see text) were
retained and the function describing the interaction between ligand and water molecules
is applied. c
Water molecules observed or not in crystal structures: 1 and 0 define the
presence or absence of each water molecule respectively. d RMSD (in Å): criterion of
success of 2.0 Å;. e Score in arbitrary units.
f Water molecules as proposed by FITTED.
Bold numbers highlight failures. g When considering the meso nature of 1e2p ligand,
these RMSD’s were equivalent to RMSD’s below 1.0Å (see text).
CHAPTER 2
99
Table 2.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors.
Docking to proteina Docking to protein + water
moleculeb
Obs.
Waterc
Ligd Score
e Lig
d Pred.
Waterf
Scoree
1ezq 1 0 3.32 -7.4 0.82 0 0 -11.1
1f0r 1 1 2.33 -8.3 1.88 0 0 -8.1
1fjs 1 0 3.64 -7.7 1.78 0 0 -8.8
1nfu 0 0 2.57 -8.9 1.50 0 0 -8.0
1xka 1 0 1.13 -8.3 0.87 0 0 -8.4
1f0u 1 - 2.92 -5.9 3.95 1 - -6.7
1o2j 1 - 1.03 -5.9 0.94 1 - -6.2
1o3g 1 - 1.35 -6.9 1.69 1 - -7.3
1o3i 1 - 0.70 -6.5 0.68 1 - -6.7
1qbo 1 - 3.84 -7.6 3.49 1 - -6.8
1b8y - - 1.15 -9.4 - - - -
1bwi - - 6.35 -5.8 - - - -
1ciz - - 1.22 -10.6 - - - -
1d8m - - 2.99 -6.0 - - - -
a Water molecules removed prior to docking.
b none to 2 water molecules (see text) were
retained and the function describing the interaction between ligand and water molecules
is applied. c
Water molecules observed or not in crystal structures: 1 and 0 define the
presence or absence of each water molecule respectively. d RMSD (in Å): criterion of
success of 2.0 Å;. e Score in arbitrary units.
f Water molecules as proposed by FITTED.
Bold numbers highlight failures.
CHAPTER 2
100
Table 2.4 - Cross-docking and docking to multiple conformations – HIV-1 protease
inhibitors.
Docking to rigid proteins Statistics for the best scoring posea
1b6l 1eby 1hpo 1hpv 1pro Ligb Pro
c Water
d Score
e
1b6l 0.55 0.83 3.37 1.11 1.04 0.55 0.00 1 -11.6
1eby 2.57 4.55 2.86 6.15 2.72 2.86 0.96 1 -8.9
1hpo 4.64 3.62 0.94 4.26 2.4 0.94 0.00 0 -10.1
1hpv 4.09 3.39 2.01 1.19 3.53 2.01 1.00 0 -8.8
1pro 0.62 1.01 0.86 0.78 0.72 0.72 0.00 0 -5.7
1ajv 1ajx 1hvr 1hwr 1qbs
1ajv 0.59 1.26 1.46 1.52 1.12 1.12 0.87 0 -10.1
1ajx 0.73 0.77 1.1 0.75 0.73 0.73 0.81 0 -9.6
1hvr 1.9 1.27 0.4 1.22 0.77 0.77 0.72 0 -11.9
1hwr 0.68 0.78 0.85 0.52 0.67 0.78 0.84 0 -8.9
1qbs 5.35 1.49 1.17 5.11 5.15 1.17 0.72 0 -10.3
a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final
poses was selected. b RMSD (in Å): criterion of success of 2.0 Å;.
c RMSD (in Å):
criterion of success: better than average RMSD; average RMSD between protein
structures computed on the binding site residues: 0.91 Å for the first five structures (one
Asp 25 protonated) and 0.77 Å for the last five structures (AspA25 and AspB25
protonated). d Water molecules as proposed by FITTED; 1 and 0 define the presence or
absence of the water molecule respectively. Bold numbers highlight failures. e Score in
arbitrary units.
CHAPTER 2
- 101 -
Table 2.5 - Cross-docking and docking to multiple conformations – Thymidine kinase
inhibitors.
Docking to rigid proteins
1e2k 1e2p 1ki3 1ki4 ki7 1ki8 2ki5 1of1 1qhi
1e2k 0.66 2.11 3.42 0.76 0.83 0.84 0.96 0.78 1.31
1e2p 2.24f 2.03
f 0.97 0.74 0.78 1.41 2.75
f 1.20 2.03
f
1ki3 2.36 2.62 1.84 2.43 2.25 2.50 2.61 2.94 1.74
1ki4 2.48 2.36 3.38 0.66 1.11 0.73 2.35 2.52 1.00
1ki7 5.79 5.67 5.55 5.08 5.76 5.25 5.15 5.65 5.67
1ki8 3.8 3.93 2.35 1.91 1.19 0.64 3.92 3.84 1.29
2ki5 2.29 3.22 1.28 2.13 2.08 1.90 0.45 2.22 1.21
1of1 0.39 0.49 1.14 0.60 0.78 0.89 0.49 0.29 0.81
1qhi 2.41 2.29 1.18 5.43 1.68 2.13 1.10 2.19 0.67
Statistics for the best scoring posea
Ligb Pro
c Water
d Score
e
1e2k 0.66 0.00 1 0 1 1 0 1 -8.1
1e2p 2.24f 0.59 1 0 0 1 1 1 -6.2
1ki3 1.74 0.78 1 0 1 1 1 1 -6.1
1ki4 0.66 0.00 1 0 1 1 1 1 -8.2
1ki7 5.67 0.87 0 0 0 1 1 1 -5.8
1ki8 0.64 0.00 1 0 1 1 1 1 -7.4
2ki5 1.90 1.11 1 0 1 1 1 1 -6.7
1of1 0.29 0.00 1 0 0 1 1 1 -7.3
1qhi 0.67 0.00 0 1 1 1 1 1 -6.1 a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final
poses was selected. b RMSD (in Å): criterion of success of 2.0 Å;.
c RMSD (in Å):
criterion of success: better than average RMSD; average RMSD between protein
structures computed on the binding site residues: 0.92 Å. d Water molecules as proposed
by FITTED; 1 and 0 define the presence or absence of the water molecule respectively.
Bold numbers highlight failures. e Score in arbitrary units.
f equivalent to RMSD’s below
1.0 Å if the meso nature of the ligand is considered
Table 2.6 - Cross-docking and docking to multiple conformations – Factor Xa, trypsin
and MMP-3 inhibitors.
Docking to rigid proteins Statistics for the best scoring posea
CHAPTER 2
- 102 -
1ezq 1f0r 1fsj 1nfu 1xka Ligb Pro
c Water
d Score
e
1ezq 0.82 9.14 3.53 3.45 4.8 0.82 0.00 0 0 -11.1
1f0r 2.42 1.89 2.31 2.49 2.42 2.49 0.75 0 0 -9.4
1fsj 3.3 2.81 1.78 2.24 3.22 1.78 0.00 0 0 -8.8
1nfu 2.05 2.17 2.26 1.5 3.79 1.5 0.00 0 0 -9.7
1xka 1.66 1.5 1.64 1.58 0.87 0.87 0.00 0 0 -8.4
1f0u 1o2j 1o3g 1o3i 1qbo
1f0u 3.95 4.21 2.16 4.98 5.50 5.50 0.74 1 - -6.9
1o2j 1.33 0.94 0.80 4.16 1.43 0.80 0.55 1 - -6.3
1o3g 0.59 0.79 1.69 0.67 1.22 0.67 0.31 1 - -7.6
1o3i 0.59 0.93 0.94 0.69 1.06 0.69 0.00 1 - -6.7
1qbo 5.23 4.14 3.89 4.30 3.49 3.89 1.09 1 - -7.4
1b8y 1bwi 1ciz 1d8m
1b8y 1.15 1.51 1.38 2.30 - 1.15 0.00 - - -9.4
1bwi 5.64 6.35 8.95 6.40 - 6.35 0.00 - - -5.8
1ciz 1.15 4.53 1.23 4.33 - 1.15 0.45 - - -10.1
1d8m 1.22 6.23 2.21 2.99 - 1.22 1.11 - - -7.3
a Each ligand was docked to the 5 protein structure and the best scoring of the 5 final
poses was selected. b RMSD (in Å): criterion of success of 2.0 Å;.
c RMSD (in Å):
criterion of success: better than average RMSD; average RMSD between protein
structures computed on the binding site residues: factor Xa: 0.86 Å, trypsin: 0.90 Å,
MMP-3: 0.92. d Water molecules as proposed by FITTED; 1 and 0 define the presence or
absence of the water molecule respectively. Bold numbers highlight failures. e Score in
arbitrary units.
CHAPTER 2
- 103 -
Table 2.7 - Docking to flexible proteins - HIV-1 protease inhibitors.
Docking to semi-flexible protein Docking to fully flexible protein
Liga Pro
b Water
c Score
d Lig
a Pro
b Water Score
1b6l 1.06 0.00 1 -11.0 1.08 0.53 1 -11.4
1eby 3.62 0.85 0 -9.3 6.06 1.02 0 -8.6
1hpo 4.03 0.99 0 -8.1 3.25 1.16 0 -8.5
1hpv 3.88 1.00 1 -10.3 1.54 0.79 1 -10.0
1pro 0.51 0.00 0 -5.9 0.94 0.59 0 -5.6
1ajv 0.75 0.00 0 -11.4 1.46 1.02 0 -10.6
1ajx 0.85 0.84 0 -9.3 1.77 0.72 0 -10.0
1hvr 1.72 0.00 0 -9.7 1.59 0.67 0 -11.6
1hwr 0.79 0.81 0 -8.8 0.58 0.71 0 -8.9
1qbs 1.22 0.72 0 -11.0 1.32 0.59 0 -11.0
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success: better
than average RMSD; average RMSD between protein structures computed on the binding
site residues: 0.91 Å for the first five structures (one Asp 25 protonated) and 0.77 Å for
the last five structures (AspA25 and AspB25 protonated). c Water molecules as proposed
by FITTED; 1 and 0 define the presence or absence of the water molecule respectively.
Bold numbers highlight failures. d Score in arbitrary units.
CHAPTER 2
- 104 -
Table 2.8 - Docking to flexible proteins - thymidine kinase inhibitors.
a RMSD (in Å): criterion of success of
2.0 Å. b RMSD (in Å): criterion of
success: better than average RMSD;
average RMSD between protein
structures computed on the binding site
residues: 0.92 Å. c Water molecules as
proposed by FITTED; 1 and 0 define the
presence or absence of the water
molecules respectively. Bold numbers
highlight failures. d Score in arbitrary
units.
Docking to semi-flexible protein
Liga Pro
b Occurrence of
water mol.c
Scored
1e2k 0.67 0.00 1 0 1 1 0 1 -7.0
1e2p 0.51 0.88 1 0 0 1 0 1 -5.8
1ki3 1.46 0.00 0 1 1 0 0 1 -6.8
1ki4 0.64 0.00 1 0 1 1 1 1 -7.3
1ki7 5.20 1.01 1 0 1 1 1 1 -5.4
1ki8 0.60 0.96 1 0 1 1 1 1 -6.7
2ki5 1.92 0.89 0 1 1 1 1 1 -5.6
1of1 0.35 0.26c 1 0 1 1 1 1 -6.7
1qhi 0.96 0.00 1 0 0 1 1 0 -7.5
Docking to fully flexible protein
Liga Pro
b Occurrence of
water mol.c
Scored
1e2k 0.75 0.61 1 0 1 1 0 1 -7.2
1e2p 0.95 0.93 1 0 1 1 1 1 -5.7
1ki3 1.35 0.90 0 0 0 1 1 1 -6.8
1ki4 0.77 0.89 1 0 0 1 1 1 -7.9
1ki7 5.25 1.11 0 1 0 1 1 0 -6.5
1ki8 0.65 0.53 1 0 1 1 1 1 -7.8
2ki5 1.62 0.80 1 0 1 1 1 1 -6.8
1of1 0.75 0.99 1 0 1 1 1 1 -7.3
1qhi 0.64 0.69 0 1 0 1 1 1 -8.2
CHAPTER 2
- 105 -
Table 2.9 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3
inhibitors.
Docking to semi-flexible protein Docking to fully flexible protein
Liga Pro
b Water
c Score
d Lig
a Pro
b Water
c Score
d
1ezq 1.34 0.00 1 0 -10.2 9.64e 0.92 1 0 -8.5
1f0r 2.50d 0.75 0 0 -8.1 2.32
f 0.63 0 0 -9.7
1fjs 2.45 0.77 0 0 -9.1 3.24 1.11 1 0 -8.6
1nfu 1.87 0.70 0 0 -8.7 1.17 0.71 0 0 -9.6
1xka 1.31 0.91 0 0 -8.2 1.52 0.70 0 0 -8.7
1f0u 6.11 1.04 1 - -6.7 4.25 0.87 1 - -7.6
1o2j 1.06 0.33 1 - -6.5 1.30 0.58 1 - -7.1
1o3g 0.83 0.66 1 - -7.2 0.82 0.77 1 - -7.7
1o3i 1.24 1.32 1 - -5.9 0.62 0.67 1 - -6.9
1qbo 4.48 1.32 1 - -6.6 3.65 0.78 1 - -7.7
1b8y 0.95 1.11 - - -9.5 1.40 0.67 - - -10.1
1bwi 5.40 1.14 - - -5.4 6.14 0.55 - - -6.2
1ciz 2.01 0.45 - - -9.7 1.39 1.19 - - -10.8
1d8m 1.03 1.11 - - -7.4 1.37 1.49 - - -8.0
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success: better
than average RMSD; average RMSD between protein structures computed on the binding
site residues: Factor Xa: 0.86 Å, trypsin: 0.90 Å, MMP-3: 0.92 Å. c Water molecules as
proposed by FITTED; 1 and 0 define the presence or absence of the water molecule
respectively. Bold numbers highlight failures. d Score in arbitrary units.
e the second best
has an RMSD of 1.36 Å with a higher potential energy but a better score. f poses with
RMSD below 1.5 Å were found but given worse scores.
CHAPTER 2
- 106 -
Table 2.10 - Docking accuracy (%): rigid proteins.
Docking
to proteina
Docking to protein +
water moleculeb
Cross-docking
Ligc Lig
c Water
d Lig
c
Success 64 76 82 47
a Water molecules removed prior to self-docking.
b Bridging water molecules (see
experimental section) were retained and the function describing the interaction between
ligand and water molecules was applied. c RMSD (in Å): criterion of success of 2.0 Å.
d
Criterion of success: occurrence predicted when ligand successfully docked.
Table 2.11 - Docking accuracy (%): flexible proteins.
a Best scoring poses from self- and cross-docking studies (see text).
b RMSD (in Å):
criterion of success of 2.0 Å. c RMSD (in Å) calculated on successful dockings (ligand
correctly docked) : 2 percentages of success are given following 2 different criteria of
success: exact protein structure (RMSD=0.0 Å), RMSD below average. d Criterion of
success: occurrence predicted. e The success rates are computed on the systems with the
ligand successfully docked.
Discussion. Table 2.10 and Table 2.11 summarize the accuracy observed throughout this
study. It is worth mentioning that the Tables show data for the top scoring poses only.
This study was designed to assess the impact of the energy term used to model
multiple conformationsa semi-flexible protein fully flexible protein
Ligb Pro
c Water
d Lig
b Pro
c Water
d Lig
b Pro
c Water
d
Success 79 47 76 73 73 27 61 82 73 0 73 81
CHAPTER 2
- 107 -
“displaceable” water molecules on one hand and the protein flexibility on the other hand
on the accuracy of FITTED. First, a clear increase in accuracy was observed when the
“displaceable” water molecules were added and validated the developed concept (Table
2.10). Overall, FITTED self-docked 76% of the inhibitors within 2.0 Å from the observed
binding modes when the water was considered and only 64% when it was removed. In
addition, the occurrence of water molecules is predicted with nearly 80% accuracy.
As a comparison, Kontoyianni and co-workers4 found GOLD and Glide as the
most accurate programs with 69% and 58% of the compounds docked in a manner similar
to the experimentally observed mode (referred to as “close” in Kontoyianni’s report)
while LigandFit, FlexX and DOCK showed poorer prediction powers. In another
comparative study, Brooks and co-workers3 docked 73% and 46% of the compounds with
RMSD below 2.0 Å with ICM and GOLD respectively while AutoDock, DOCK and
FlexX were less accurate. Rognan and co-workers2 performed a similar study and found
that Glide, GOLD, Surflex and QXP docked 80 to 90% of the inhibitors within 2.0 Å
from the observed pose while FlexX, Fred, DOCK and Slide showed lower accuracy (50-
65%). Another study carried out by Perola and co-workers5 showed than Glide
outperformed (61% within 2.0 Å) GOLD and ICM (48% and 45% respectively).
Although each study was based on a different set of protein-ligand complexes, our
validation study shows that FITTED performed very well with accuracy as high or higher
than the best performing docking programs. More importantly, FITTED allows for
flexibility of the protein and displaceable water molecules to be accounted for while
GOLD includes water molecules but protein flexibility restricted to the polar hydrogens
and Glide does not consider flexibility nor water molecules.
With this data in hands, we turned our attention to the benefit and impact of the
flexibility. Self-docking is the ideal case with the protein being molded to the ligand
structure. In contrast, cross-docking tries to combine ligands and protein structures that
are not co-crystallized and it is well known that docking programs perform poorer when
cross-docking is carried out.10
In practice, docking experiments considering protein
flexibility should be more accurate than cross-docking experiments and ideally as
accurate as self-docking. In the present study, 47% and 76% accuracy were recorded for
the cross- and self-docking experiments respectively. Gratifyingly, the observed accuracy
of FITTED when docking ligands to flexible proteins is similar to that seen in self-
CHAPTER 2
- 108 -
docking. In addition, neither the prediction of the water molecule occurrence nor ligand
pose is affected by adding the protein flexibility as no significant drop in accuracy is
observed when moving from rigid (self-docking) to flexible proteins.
A close look at the predicted protein structure revealed the good accuracy of our
protocol. For instance, while 5 to 9 protein structures were used as input for each
experiment, the correct conformation was selected for 9 of the 24 (37%) correctly docked
systems when the semi-flexible protein was used. If a random selection was used, 11 to
20% of the cases would present the correct structure. When considering the successful
docking experiments (ligand pose accurately predicted), average protein RMSD’s of 0.50
Å and 0.78 Å were recorded when using the semi-flexible or fully flexible proteins
respectively.
Unexpected results were also recorded. In two cases (MMP-3: 1d8m and HIVP:
1qbs), docking to flexible proteins was more accurate than self-docking. Figure 2.9
shows a superposition of the crystal structure of 1d8m and the docked structure when the
semi flexible option was used. In the crystal structure, hydrogen bonds are observed
between the protein backbone (Ala93 and Leu92) and one of the sulfonamide oxygen
atoms of the inhibitor. In contrast, the docked pose indicates weak hydrogen bonds but
strong hydrophobic/π-stacking interactions with His119 and Tyr141 side chains located
in the S1’ pocket. In order to induce this interaction pattern, the S1’ pocket must be more
closed in the modeled structure than in the crystal structure to encompass the ligand
better. This discrepancy may show that the hydrogen bonding contribution to the binding
is underestimated or that the van der Waals interactions are overestimated.
CHAPTER 2
- 109 -
Figure 2.9 - Crystal structure (protein in grey, ligand in gray) and proposed docked
model (protein in blue, inhibitor in green) for the 1d8m complex.
In the case of 1qbs, no clear explanation was found. A close look at the docked and
crystal complexes do not reveal any specific movement or steric clash. We therefore
believe that the potential energy function may find some nuances that can be detrimental
to the correct pose. Again, slight adjustments of the crystal structures may be necessary
prior to or upon docking. This hypothesis claims that the crystal structure of 1qbs
included some discrepancies that induced some slight van der Waals repulsions,
preventing a good score when docking 1qbs inhibitors to the 1qbs protein structure.
These repulsions would vanish when the flexible protein was used and much lower scores
were recorded. Possible strategies to address this issue are the use of relaxed structures
(as proposed in Glide33
), soft structures as proposed by Shoichet and co-workers34
or
flexible structures as shown in this study. These two unexpected results may reveal some
inaccuracies of the scoring function.
Overall, this study revealed the accuracy of FITTED to dock inhibitors to flexible
and partially solvated proteins and validated it with this set of representative protein-
inhibitor complexes.
CONCLUSION
We have developed FITTED 1.0, a unique docking program that accounts for both
protein flexibility and bridging water molecules. The flexibility is handled by a genetic
CHAPTER 2
- 110 -
algorithm based on various genetic operators specific to FITTED (ie, designed cross-over
operator, focused mutation, filters). Modifications to the initial genetic algorithm have
been made to increase the speed and accuracy by orienting the docking toward “favored”
poses (e.g., poses within the cavity and fulfilling constraints). We have also implemented
a new potential energy term that accurately accounts for dynamically bound water
molecules. Application of FITTED to the docking of a variety of protein/inhibitors
complexes resulted in proposed docked poses within 2.0 Å from the observed binding
modes in 73 to 76% of the cases using flexible or rigid proteins respectively. The
accurate prediction of the occurrence and need for displaceable water molecules was also
demonstrated. Finally, the protein structures were predicted with reasonable accuracy.
Our initial studies led to a method that docked each compound within 0.5 to 20
hours when not considering rotation and orientation of the ligand as part of the
chromosomes. FITTED which now explore the entire conformational space of the ligands,
considers protein flexibility and displaceable water molecules, docks all the tested
compounds within 3 hours. Further studies are in progress to reduce by a factor of 10 or
more the required CPU time which is still not appropriate for virtual screening and to
improve the scoring function.
EXPERIMENTAL SECTION.
PREPARATION OF THE TRAINING SET
General Considerations. The protein/ligand complexes were retrieved from the PDB or
from the PDBbind database.35,36
The complexes were selected for the occurrence of water
molecules, for the flexibility of the protein structure, the diversity of the ligands,
resolutions lower than 2.5 Å and good binding affinities. At least 4 structures for each
system (MMP, HIVP, TK, FXa, trypsin) were looked for. The complexes were setup
using Maestro and/or InsightII. The set of complexes from the same family were
superimposed prior to their use with FITTED. In order to be able to use FITTED with more
than one protein structure, the sequences have to be identical. Therefore, some minor
mutations (often far from the binding site) were achieved (e.g, Arg14 into Lys14 in HIVP
complex 1b6l), missing side chains were reconstructed (e.g., Arg220 in 1qhi), names of
residues were made identical (e.g., Glu124A into Glu124 in 1nfu). Hydrogens were next
added and optimized by energy minimization. All the non-conserved waters were
CHAPTER 2
- 111 -
removed and missing key water molecules were added by analogy with other structures
when applicable. For instance, the water 301 observed in many HIVP/inhibitor
complexes is displaced in 1ajv and was added to the 1ajv protein structure and its
position optimized by energy minimization using AMBER94 as a force field. The naming
of the water molecules is made homogeneous within each set. Each protein is then saved
as a mol2 file and processed using ProCESS to assign protein atom types and charges.
Each ligand was charged using Sybyl Gasteiger-Hückel charges and processed using
SMART. Large grids of spheres were prepared as well as constraint files. These
constraints were loose in order to orient and speed up the docking (as previously
described) but not bias the results. Diameters of the constraint spheres as large as 8.0 Å
were used.
HIV-1 Protease Inhibitor/Protein Complexes. HIVP complexes following the criteria
defined above were retrieved from the PDB. 1eby (crystal structure resolution: 2.29 Å),
1hpo (2.50 Å), 1hpv (1.90 Å) and 1pro (1.80 Å), were superimposed onto 1b6l (1.75 Å)
and one catalytic aspartic acid side chain was protonated. 1ajx (2.00 Å), 1hvr (1.80 Å),
1hwr (1.80 Å) and 1qbs (1.80 Å), were superimposed onto 1ajv (2.00 Å) and the two
catalytic aspartic acid side chains were protonated following the experimental study of
similar complexes.32
A water molecule hydrogen-bonding to both Ile50 NH’s was kept
when present and added when missing. The constraint applied imposes a polar group to
be located close to the catalytic site. As some of the inhibitors have a large number of
rotatable bonds, initial populations of 200 individuals were used in all 10 cases.
Thymidine Kinase Inhibitor/Protein Complexes. All the available TK inhibitor/protein
complexes were retrieved from the PDB and filtered. A final set of nine structures was
used (1e2k, 1e2p, 1ki3, 1ki4, 1ki7, 1ki8, 2ki5, 1of1, 1qhi). Six key water molecules were
considered as discussed in the text. The constraint imposes polar groups to be located
within the catalytic site.
Factor Xa Inhibitor/Protein Complexes. These complexes were retrieved from the
PDBbind database. 1f0r, 1fjs, 1nfu and 1xka were superimposed onto 1ezq. In this set,
two key water molecules were identified and added when missing to the protein
CHAPTER 2
- 112 -
structures. The constraint imposes a polar group to be located close to the Asp189 side
chain. Initial populations of 200 individuals were used.
Trypsin Inhibitor/Protein Complexes. 1o2j, 1o3g, 1o3i and 1qbo, were superimposed
onto 1f0u (1.9 Å). In this case a single water molecule interacting with Leu227 was
considered. The constraint imposes a polar group to be located close to the Asp171 side
chain. Initial populations of 200 individuals were used.
MMP Inhibitor/Protein Complexes 1bwi, 1ciz, and 1d8m were superposed onto 1b8y. No
water molecules were retained. The constraint imposes a polar group to be located close
to the catalytic zinc atom. Specific zinc (van der Waals, and metal chelation) and
hydroxamic acid (internal energy) parameters were added to the force field. The LJ12-10
potential parameters used for the zinc atom were designed to reproduce the observed
energy of zinc chelation.37
DOCKING STUDY
Self-docking, Semi-flexible Protein and Fully Flexible Protein. In the first of these three
sets of runs (self- and cross-docking, docking to multiple conformations), one single
protein structure was used as an input to evaluate the accuracy of the docking algorithm.
In the second set (docking to semi-flexible proteins), the protein structure was restricted
to the four to nine input conformations. In the third set (docking to fully flexible protein),
the protein structures were composite of the four to nine input conformations. A typical
keyword file with all the default parameters is given as supplemental material. The
default parameters (e.g., 10 runs, population size of 100 individuals) were used unless
otherwise stated.
ProCESS and FITTED Parameters. The ensemble of spheres cavity of the binding site
were centered on the center of the cavity and did not exceed 28 Ǻ long. The grid
resolution was 1.5 Ǻ.
CHAPTER 2
- 113 -
ACKNOWLEDGMENTS
We thank Virochem Pharma for financial support and a scholarship to CRC as well
as the Canadian Foundation for Innovation for financial support through the New
Opportunities Fund program. PE is supported by a scholarship from Canadian Institutes
of Health Research (Strategic Training Initiative in Chemical Biology). We thank
RQCHP for generous allocation of computer resources.
Supporting Information Available. Typical keyword files for FITTED and ProCESS. A
detailed description of the validation set (PDB codes, structures, Ki’s).
CHAPTER 2
- 114 -
REFERENCES
1. Rester, U. Dock around the Clock – Current Status of Small Molecule Docking and
Scoring. QSAR Comb. Sci. 25, 2006, 605–615.
2. Bissantz, C.; Folkers, G.; Rognan, D. Protein-Based Virtual Screening of Chemical
Databases. 1. Evaluation of Different Docking/Scoring Combinations. J. Med.
Chem. 2000, 43, 4759–4767.
3. Bursulaya, B. D. Totrov, M. Abagyan, R.; Brooks, C. L., III. Comparative study of
several algorithms for flexible ligand docking. J. Comp.-Aided Mol. Design 2003,
17, 755–763.
4. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S. Evaluation of Docking
Performance: Comparative Data on Docking Algorithms. J. Med. Chem. 2004, 47,
558–565.
5. Perola, E.; Walters, W. P.; Charifson, P. S. A Detailed Comparison of Current
Docking and Scoring Methods on Systems of Pharmaceutical Relevance. Proteins:
Struct. Func. Bioinf. 2004, 56, 235–249.
6. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D. Comparative Evaluation of
Eight Docking Tools for Docking and Virtual Screening Accuracy. Proteins:
Struct. Funct. Bioinf. 2004, 57, 225–242.
7. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.
Comparison of Automated Docking Programs as Virtual Screening Tools. J. Med.
Chem. 2005, 48, 962–976.
8. Kontoyianni, M.; Sokol, G. S.; McClellan, L. M. Evaluation of Library Ranking
Efficacy in Virtual Screening. J. Comput. Chem. 2005, 26, 11–22.
9. Cavasotto, C. N.; Abagyan, R. A. Protein Flexibility in Ligand Docking and Virtual
Screening to Protein Kinases. J. Mol. Biol. 2004, 337, 209–225.
10. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.
Automated Docking to Multiple Target Structures: Incorporation of Protein
Mobility and Structural Water Heterogeneity in AutoDock. Proteins: Struc. Func.
Genet. 2002, 46, 34–40.
CHAPTER 2
- 115 -
11. Murray, C. W.; Baxter, C. A.; Frenkel, A. D. The Sensitivity of the Results of
Molecular Docking to Induced Fit Effects: Application to Thrombin, Thermolysin
and Neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13, 547–562
12. Murray, C. W.; Baxter, C. A.; Frenkel, A. D. The Sensitivity of the Results of
Molecular Docking to Induced Fit Effects: Application to Thrombin, Thermolysin
and Neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13, 547–562.
13. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M. Lessons in
Molecular Recognition: the Effects of Ligand and Protein Flexibility on Molecular
Docking Accuracy. J. Med. Chem. 2004, 47, 45–55.
14. Carlson, H. A. Protein Flexibility and Drug Design: How to Hit a Moving Target.
Curr Opin Chem Biol. 2002, 6, 447–452.
15. Schnecke, V.; Kuhn, L. A. Virtual Screening with Solvation and Ligand-Induced
Complementarity. Perspect. Drug Discovery Des. 2000, 20, 171–190.
16. Claußen, H.; Buning, C.; Rarey, M.; Lengauer, T. FlexE: Efficient Molecular
Docking Considering Protein Structure Variations. J. Mol. Biol. 2001, 308, 377–
395.
17. Zavodszky, M. I.; Lei, M.; Thorpe, M. F.; Day, A. R.; Kuhn, L. A. Modeling
Correlated Main-Chain Motions in Proteins for Flexible Recognition. Proteins:
Struc. Func. Bioinf. 2004, 57, 243–261.
18. Sherman, W.; Day, T.; Jacobson, M. P.; Friesner, R. A.; Farid, R. Novel Procedure
for Modeling Ligand/Receptor Induced Fit Effects. J. Med. Chem. 2006, 49, 534–
553.
19. Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to
Hydrated and Flexible RNA. J. Med. Chem. 2006, 49, 1023–1033.
20. Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore
Search, Automated Docking, and Molecular Dynamics Simulations as a Novel
Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-Glycine-
Aspartic Acid-like Compounds into the αvβ3 Binding Site. J. Med. Chem. 2004, 47,
4178–4187.
CHAPTER 2
- 116 -
21. Moitessier, N.; Therrien, E.; Hanessian, S. A Method for Induced-fit Docking,
Scoring and Ranking of Flexible ligands. Application to Peptidic and
Pseudopeptidic BACE 1 Inhibitors. J. Med. Chem. 2006, 49, 5885–5894.
22. Wei, B. Q.; Weaver, L. H.; Ferrari, A. M.; Matthews, B. W.; Shoichet, B. K.
Testing a Flexible-Receptor Docking Algorithm in a Model Binding Site. J. Mol.
Biol. 2004, 337, 1161–1182.
23. CDiscover, 98.0; Accelrys, Inc.: San Diego, CA, 2001.
24. Fletcher, R.; Reeves, C. M. Function Minimization by Conjugate Gradients. Comp.
J. 1964, 7, 149–154.
25. Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Blelew, R.
K.; Olson, A. J. Automated Docking Using a Lamarckian Genetic Algorithm and
an Empirical Binding Free Energy Function. J. Comp. Chem. 1998, 19, 1639–1662.
26. http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html
27. Corey E. J.; Wipke, W. T. Computer-Assisted Design of Complex Organic
Syntheses. Science 1969, 166, 178–192.
28. Gasteiger, J.; Marsili, M. Iterative partial equalization of orbital electronegativity--
a rapid access to atomic charges. Tetrahedron 1980, 36, 3219–3228.
29. Haupt, R. L. 1995. Optimization of aperiodic conducting grids. 11th An. Rev.
Progress in Applied Computational Electromagnetics Conf. Monterey, CA.
30. Weiner, S. J.; Kollman, P. A.; Nguyen, D. T.; Case, D. A. An All Atom Force Field
for Simulations of Proteins and Nucleic Acids. J. Comput. Chem. 1986, 7, 230–252.
31. See for example: Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.;
Murray, C. W.; Nissink, J. W. M.; Taylor, R. D.; Taylor, R. Modeling Water
Molecules in Protein-Ligand Docking Using GOLD. J. Med. Chem. 2005, 48,
6504–6515.
32. Yamazaki, T. Nicholson, L. K.; Torchia, D. A.; Wingfield, P.; Stahl, S. J.;
Kaufman, J. D.; Eyermann, C. J.; Hedge, C. N.; Lam, P. Y. S.; Ru, Y.; Jadhav, P.
K.; Chang, C.-H.; Weber, P. C. NMR and X-ray Evidence That the HIV Protease
Catalytic Aspartyl Groups Are Protonated in the Complex Formed by the Protease
CHAPTER 2
- 117 -
and a Non-peptide Cyclic Urea Inhibitor. J. Am. Chem. Soc. 1994, 116, 10791–
10792.
33. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D.
T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.;
Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring.
1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739–
1749.
34. Ferrari, A. M.; Wei, B. Q.; Costantino, L.; Shoichet, B. K. Soft Docking and
Multiple Receptor Conformations in Virtual Screening. J. Med. Chem. 2004, 47,
5076–5084.
35. Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database: Collection of
binding affinities for protein-ligand complexes with known three-dimensional
structures. J. Med. Chem. 2004, 47, 2977–2980.
36. Wang, R.; Fang, X.; Lu, Y.; Yang, C. Y.; Wang. S. The PDBbind database:
Methodologies and updates. J. Med. Chem. 2005, 48, 4111–4119.
37. Tiraboschi, G.; Greshh, N.; Giessner-Prettre, C.; Pedersen, L. G.; Deerfield, D. W.
Parallel Ab Initio and Molecular Mechanics Investigation of Polycoordinated Zn(II)
Complexes with Model Hard and Soft Ligands: Variations of Binding Energy and
of Its Components with Number and Charges of Ligands. J. Comp. Chem. 2000, 21,
1011–1039.
CHAPTER 2
- 118 -
CHAPTER 3
- 119 -
CHAPTER THREE
Encouraged by the promising results obtained with the first version of FITTED
discussed in the previous chapter, we further optimized the program to allow for a time
efficient virtual screening of a virtual library. The initial version was quite slow and
required improvements in the selection of compounds to be docked and in the creation of
the initial population which increased the overall speed of the program. This increase in
speed then enabled its application to the virtual screening of the Maybridge library
against HCV polymerase and the discovery of two novel lead compounds.
This chapter is a copy and is reproduced with permission from the Journal of Chemical
Information and Modeling. This article is cited as Corbeil, C. R.; Englebienne, P.;
Yannopoulos, C. G.; Chan, L.; Das, S. K.; Bilimoria, D.; Heureux, L.; Moitessier, N.,
Docking Ligands into Flexible and Solvated Macromolecules. 2. Development and
Application of FITTED 1.5 to the Virtual Screening of Potential HCV Polymerase
Inhibitors. Journal of Chemical Information and Modeling 2008, 48, (4), 902-909.
Copyright 2008, with permission from the American Chemical Society.
CHAPTER 3
- 120 -
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 2.
DEVELOPMENT AND APPLICATION OF FITTED 1.5 TO THE
VIRTUAL SCREENING OF POTENTIAL HCV POLYMERASE
INHIBITORS.
ABSTRACT
HCV NS5B polymerase is a validated target for the treatment of hepatitis C,
known to be one of the most challenging enzymes for docking programs. In order to
improve the low accuracy of existing docking methods observed with this challenging
enzyme, we have significantly modified and updated FITTED 1.0, a recently reported
docking program, into FITTED 1.5. This enhanced version is now applicable to the virtual
screening of compound libraries and includes new features such as filters and
pharmacophore- or interaction site-oriented docking. As a first validation, FITTED 1.5
was applied to the testing set previously developed for FITTED 1.0 and extended to
include HCV polymerase inhibitors. This first validation showed an increased accuracy
as well as an increase in speed. It also shows that the accuracy towards HCV polymerase
is better than previously observed with other programs. Next, application of FITTED 1.5
to the virtual screening of the Maybridge library seeded with known HCV polymerase
inhibitors revealed its ability to recover most of these actives in the top 5% of the hit list.
As a third validation, further biological assays uncovered HCV polymerase inhibition for
selected Maybridge compounds ranked in the top of the hit list.
CHAPTER 3
- 121 -
INTRODUCTION
Docking-based Virtual Screening. Various approaches to the design or identification of
new drugs have recently been developed and successfully applied, including both
experimental (e.g., SAR by NMR1) and computational approaches (e.g., docking2, 3). In
modern drug design, docking-based virtual screening (VS) methods provide a quick and
cost-effective alternative to high-throughput screening (HTS).2, 3 Many recent VS
applications have been reported and demonstrate an increasing level of accuracy for the
currently available methods.4, 5
However, to date, only few docking programs (e.g.,
FlexX-Ensemble,6 AutoDock 4.0
7) can take into account conformational changes that
occur as a result of binding to a ligand. FITTED 1.0 (Flexibility Induced Through
Targeted Evolutionary Description) is a docking program that was recently developed
and validated against a set of co-crystallized protein/ligand complexes.8, 9
This program
not only accurately predicts the ligand binding mode, but it also predicts the optimal
protein conformation and the presence/absence and location of water molecules with high
level of accuracy. Recently, HCV polymerase has been found to be a very challenging
protein for docking programs. Our long term goal is to develop a docking-based virtual
screening tool that can be applied to a number of proteins as large as possible. However,
the initial version of FITTED (v. 1.0) was found too slow and not applicable to large VS
studies. Efforts to increase the speed without affecting the accuracy were necessary. In
addition, its accuracy for HCV polymerase inhibitors was low. We report herein an
enhanced version of FITTED (v. 1.5), its validation and its application to the discovery of
HCV polymerase inhibitors.
Hepatitis C. Hepatitis C virus (HCV) is the major causative agent responsible for non-A,
non-B hepatitis, affecting over 170 million people worldwide. Chronic HCV infection
often results in liver fibrosis, liver cirrhosis, hepatocellular carcinoma, and other forms of
liver dysfunction.10, 11
Given the widespread impact of this disease, there is a substantial
medical need for the discovery of new and effective anti-HCV agents to complement
current therapies.12
The impetus for the identification of agents that will be part of a
potent and effective combination regimen is growing in view of the inevitability of the
development of drug resistant mutations. Extensive efforts have been devoted toward the
study of the NS5B RNA-dependent RNA polymerase due to its critical function in the
CHAPTER 3
- 122 -
replication cycle of the virus. Positive results from several clinical trials have indeed
validated the HCV NS5B polymerase as a target for the therapy of HCV infections. For
example, nucleoside analogs (e.g., Valopicitabine (NM283) 1,13, 14
R1626 2 15
) and non-
nucleoside or allosteric (e.g., HCV-796 3 16
) inhibitors of HCV NS5B have been shown
to be effective either alone or in combination with interferon (Figure 3.1). Others such as
VCH-75917, 18
and GSK62543319
are currently being evaluated in clinical trials.
Figure 3.1 - Selected HCV polymerase inhibitors.
HCV Polymerase, a Flexible Protein. We have recently shown that at least two major
conformations can be adopted by the HCV polymerase upon binding of inhibitors to an
allosteric site located in the thumb region.20, 21
The main difference appeared to be a
significant shift of the α helix T located in the binding site (Figure 3.2). It is therefore
critical to account for this HCV polymerase thumb binding site plasticity in both hit
identification and inhibitor design stages. In a recent comparative study by Warren et al.
it was shown that none of the assessed docking programs predicted the experimentally
observed binding modes of HCV polymerase inhibitors with high accuracy.22
In fact, the
flexibility of this protein can in part explain this poor accuracy. This challenge was the
starting point of the development of FITTED 1.5.
CHAPTER 3
- 123 -
Figure 3.2 - Helix T perturbation upon inhibitor binding. Blue and grey ribbon
representations are from two different X-ray complexes. 20, 21
THEORY AND IMPLEMENTATION
FITTED 1.0 and 1.5. As discussed above, two major aspects have to be considered for
implementation into FITTED 1.5. First, the newer version should be applicable to VS
studies. Second, FITTED 1.5 should be accurate enough with HCV polymerase inhibitors.
Flexible ligand/flexible protein docking programs have seldom been applied to
VS.23
In practice, taking flexibility into account significantly enlarges the search space,
thus reducing throughput and drastically impeding implementation in VS campaigns. We
hypothesized that evaluating FITTED in this context would assess the role of flexibility in
VS studies against HCV polymerase. FITTED 1.0 is a suite of programs for docking that
considers ligand and protein flexibility by means of a genetic algorithm24
while water
molecules displacement is accounted for by means of a specific potential energy
function.8, 25
During the docking process, the protein side-chains and backbone
conformations, the water molecule positions and the ligand torsion angles are coded as
genes and optimized through a combined Lamarckian/Darwinian evolution. This early
version of the program was developed to dock single compounds in proof-of-concept
studies with no consideration for CPU time requirements.8 This, obviously, is a serious
limitation in the context of a large VS study. In order to optimize the software for
efficiency and speed, a stepwise approach to identify and remove inappropriate
candidates (poses) early in the process was implemented in FITTED 1.5. In addition,
preliminary studies have shown that the accuracy of FITTED 1.0 with HCV polymerase
inhibitors should be improved. The various modifications and implementations, which
required major rewriting of the program, are listed and described in the following
sections.
CHAPTER 3
- 124 -
SMART. SMART (Small Molecule Atom typing and Rotatable Torsion assignment) is a
module of FITTED used to prepare the ligands to be docked. In contrast to the original
program developed with FITTED 1.0, the current version now describes the compound
features with a bit string added to the compound’s mol2 file. The bit string includes the
following descriptors: molecular weight, number of rotatable bonds, net charge and
presence of functional groups such as known toxicophores or reactive groups (e.g, nitro
groups, aldehydes, and Michael acceptors) or labile imines. The descriptions are then
used by FITTED to filter out compounds not fulfilling the Lipinski’s rules26
or having
undesired (user-defined) functionalities. These descriptors can also restrict the search to
compounds with needed functionalities (e.g., aldehydes and nitriles for reversible
covalent inhibitors). Although this simple approach is not expected to accurately
discriminate between drug-like molecules and non drug-like molecules, it will orient our
study towards a “cleaner” compound set.
Interaction Site Filter. FITTED 1.0 included a functionality that filtered out poses that did
not fulfill constraints imposed by the user (e.g, binding to metals). It also included a
function that ensured that poses were within the binding site (ClashScore) prior to any
further optimization or more complex scoring. ClashScore, which is a binary score, uses
a series of spheres representing the accessible cavity space. Each pose is then compared
to this set of spheres and a score (“in” or “out”) is computed. This crude score is used to
discard poses that are not located within the binding site. After modifications, FITTED 1.5
can pre-select the poses that are the most apt to be successfully docked with a number of
predefined interaction sites (Figure 3.3). A set of interaction sites is similar to a
pharmacophore but automatically generated by ProCESS from the protein structure alone.
FITTED also allows for the use of a manually created pharmacophore which may exploit
user expertise, as the one shown in Figure 3.4, or for the use of both automatically-
generated interaction sites and user-defined pharmacophores (Figure 3.3). The inclusion
of a pharmacophore component in virtual screening has been shown to enhance
efficiency and accuracy of docking methods in previous studies,27, 28
including HCV
polymerase.22
CHAPTER 3
- 125 -
Figure 3.3 - Consensus docking. Application to the generation of the initial population.
Figure 3.4 - Binding site pharmacophore for HCV polymerase: red: hydrogen bond
acceptor; green: hydrophobic/aromatic; yellow: either hydrogen bond acceptor or
hydrophobic/aromatic.
The interaction sites (and pharmacophores) are represented by a series of spheres.
A sphere diameter defines the allowed volume of the constraint and a weight (w in (3.1)
other than one can be assigned to each sphere.
(3.1)
w
wW 100sphere
(3.2) Spheres MatchiedposeMatchScore W
Each generated pose is compared to the interaction site (and/or pharmacophore)
and a MatchScore (and/or PharmScore) ((3.2) ranging from 0 to 100% is computed. If the
CHAPTER 3
- 126 -
atom types of the ligand atoms lying within the volume of the sphere match the
interaction site/pharmacophore sphere’s pharmacophoric properties, the weight of the
sphere is added to the MatchScore (or PharmScore) for that pose. FITTED 1.5 then
discards poses with a low MatchScore (and/or PharmScore), thereby reducing the
required CPU time by directing the docking toward strongly interacting poses. It is well
known that higher success rates are obtained when rescoring of poses is performed using
other scoring functions (consensus scoring). In the present version of FITTED, up to four
scores are computed while docking (ClashScore mentioned above, MatchScore,
PharmScore and GAFFScore derived from the computed General AMBER Force Field29
(GAFF) potential energy) and can be combined to discriminate active from inactive
compounds (Figure 3.3). These four scores are used in their decreasing order of speed
and allow FITTED to eliminate poses exhibiting bad scoring with one function before
proceeding with the next one. This filtering of poses is carried out both during the
generation of the initial population as in Figure 3.3 but also during the evolution, and can
be viewed as consensus docking.30
This feature significantly reduces the time required to
dock a single compound and increases its accuracy in three ways. First, the MatchScore
and PharmScore are quicker to compute than the GAFF potential energy. Second, “bad”
poses are not considered for energy minimization, a time-consuming step in the docking
process. Third, poses with reasonable GAFFScores but poor chemical complementarity
with the protein were found using FITTED 1.0. Conversely, FITTED 1.5 assigns a low
MatchScore to these poses, thus reducing the number of false positives.
ProCESS. ProCESS (Protein Conformation Ensemble System Setup) is the second module
of FITTED used to prepare the protein file. As described in our previous report,8 ProCESS
also prepares the set of spheres representing the cavity space used to compute ClashScore
(see above). The current version of ProCESS can now derive a set of interaction sites such
as ideal locations for hydrogen bond donors and acceptors, hydrophobic and aromatic
groups.
Quick Docking. When docking a potential ligand, FITTED generates an initial population
and then simulates its evolution. Although this is appropriate for “good” binders, it may
be inappropriate for compounds which are, for instance, too large or too hydrophobic and
CHAPTER 3
- 127 -
therefore should be excluded prior to this time-consuming conformational search. For
this purpose, additional filters were implemented to prevent undesirable compounds from
being docked. First, compounds lacking the required pharmacophoric groups are
excluded. Then, FITTED generates a maximum number of random poses to produce the
initial population. If 100,000 possible binding modes are generated without accepting one
into the initial population (low MatchScore and/or not in the cavity), the program aborts
and docks the next compound on the list. This stage, based on simple shape
discrimination, does not require any CPU-intensive energy or score computation and can
be done within a few seconds per compound.
Refined Docking. A close look at the evolution of the score of the top pose as the docking
proceeds revealed that the scores computed after a few generations are typically within
1.0 to 1.5 kcal/mol of the score of the final pose. It also indicated that at this stage of the
evolution, poses close to the native pose are not always identified. These two
observations demonstrate: 1. the high quality of the initial population and therefore the
identification of poses with good scores early in the evolution; and 2. the need for a
multi-generation evolution process to produce the correct (i.e., experimentally observed)
pose. Thus, if after a few generations (e.g., 5) the score is not satisfactory, the docking
can be aborted. We have therefore implemented new functions (and keywords) into
FITTED to account for this intermediate evaluation. In practice more than one of these
intermediate evaluations can be used to further reduce the number of generations carried
out with a potentially inactive compound.
Scoring. The force field used in FITTED 1.0 (AMBER84) was not appropriate for most of
the small drug-like molecules like the ones found in virtual libraries. Instead we have
used the General Amber Force Field (GAFF)29
for the description of the small molecules.
This required a series of modifications to the force field itself as a specific format has to
be used to be readable by FITTED. SMART was also modified to assign these new atom
types. Finally, a simple automated parameter estimator was developed. Although GAFF
parameters span a large variety of functional groups, some are missing but could be
guessed on-the-fly by FITTED 1.5. These parameters are simply derived from the input
structures; bond lengths and angles of the input structures are used as equilibrium values
CHAPTER 3
- 128 -
and reported in a specific log file. These listed missing parameters can later be further
optimized and added to the force field for the next study or to perform a second run on
these specific molecules. In fact, our own version of GAFF is regularly updated to
include more functional groups and heteroaromatic rings.
Whereas ClashScore, PharmScore, MatchScore and GAFFScore are used upon
docking to identify the correct pose, the RankScore scoring function reported
previously24
is used to assign the final poses a score describing their binding affinities.
VALIDATION OF FITTED1.5
FITTED 1.0 vs. 1.5. As all these changes may affect the accuracy of FITTED, we used the
testing set previously prepared for FITTED 1.0 to evaluate the accuracy of the current
version. This set consists of ligands complexed with HIV-1 protease, thymidine kinase,
factor Xa, trypsin and MMP-3. Table 3.1 summarizes the accuracy obtained for the self-
docking of these 33 inhibitors (“Rigid” mode) as well as their docking to flexible
proteins. The “Semiflexible” mode, as defined in our previous report,8 corresponds to the
docking of ligands to conformational ensembles of protein structures while the “Flexible”
mode corresponds to a fully flexible protein structure.8 The detailed results for each of
the 33 systems are given as supporting information. The computed RMSDs (root mean
square deviation) compare the modeled “docked” binding mode to the observed one
(from crystal structures). For this study, the interactions sites were generated by ProCESS.
CHAPTER 3
- 129 -
Table 3.1 - Comparison of FITTED 1.0 with FITTED 1.5
% Successa
version 1.0 version 1.5
Mode < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å
Rigid (self-docking) 33 79 84 93
Rigid (cross-docking) 21 47 51 74
Semiflexible 36 73 78 84
Flexible 57 73 78 88
a Two criteria of success are shown. A docking run is considered successful if the RMSD
between modeled and experimental binding mode is within 1.0 or 2.0 Å respectively.
Overall, accuracy is significantly increased from version 1.0 to 1.5. More
specifically, there is an enhanced accuracy when examining systems for which the
RMSDs are below 1.0 Å. This observation is most likely due to the use of interaction
sites to guide the docking. With this implementation, FITTED 1.5 generates and considers
only poses that already passed the MatchScore and/or PharmScore filter. Thus, the
quality of the initial population as well as the children offspring produced during the
evolution of the population are of higher quality than with FITTED 1.0. Along with the
increased accuracy, a 3-fold increase in speed was observed. In addition, as observed
previously with FITTED 1.0, the docking to flexible proteins is nearly as accurate as self-
docking and much more accurate than cross-docking.
With these encouraging results in hand, we focused our attention to HCV
polymerase inhibitors. For this purpose, two sets of protein/inhibitors complexes were
initially built and very recently extended as novel crystal structures have been reported
and made available. The first set includes sixteen inhibitors bound to the allosteric site
described above while the second set includes seven inhibitors bound to the catalytic site.
A second allosteric site has been reported but is not used herein.31
We first used the
interactions sites as for the testing set above. The results summarized in
Table 3.2 show that the accuracy obtained when docking to the allosteric site of the
HCV polymerase was not as high as for the testing set used above, but they were
nevertheless considered reasonable. In order to increase the accuracy of this program and
CHAPTER 3
- 130 -
to eventually increase the enrichment factor of the VS study, we manually defined a
pharmacophore used in place of the interaction sites by FITTED (Figure 3.3). In this case,
the accuracy was slightly increased, while the required CPU time was not affected. In this
study, we used the empty space found within a sphere of 40 Å designated as large cavity
in
Table 3.2. A more focused binding site (25 Å) led to a significant increase in
accuracy when self-docking was considered, but only a slight increase in accuracy when
a flexible protein was used. Interestingly, the use of flexible protein was found to be
significantly more accurate than cross-docking, indicating that its implementation should
increase the accuracy of FITTED in VS studies against HCV polymerase.
Table 3.2 - Docking of HCV polymerase inhibitors to the allosteric site with FITTED 1.5.
% Success a
Mode Interaction.Sites
Large cavityb
Pharmacophore
Large cavityb
Pharmacophore
Focused cavityc
Accuracy < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å < 1.0 Å < 2.0
Å
Rigid (self-docking) 56 75 63 75 81 81
Rigid (cross-docking) 0 0 13 31 6 38
Semiflexible 38 50 56 69 63 69
Flexible 25 31 47 59 38 63
a Two criteria of success are shown. A docking run is considered successful if the
RMSD between modeled and experimental binding mode is within 1.0 or 2.0 Å
respectively. b 40 Å diameter cavity.
c 25 Å diameter cavity.
We then turned our attention to the catalytic site; however, docking of the seven
reported inhibitors was initially unsuccessful (Table 3.3). Considering the size of this
very large binding site, this disappointing result is not surprising. While this work was
ongoing, Warren et al. reported a large comparative study including 13 HCV polymerase
inhibitors.22
In their study, only 2 programs docked one out of the 13 inhibitors to the
catalytic site with RMSD below 2.0 Å while the other 8 programs failed with all the
CHAPTER 3
- 131 -
inhibitors. Although their set (not given) and ours may be different, there is at least one
HCV polymerase inhibitor common to both sets.
Warren et al. also mentioned that “no docking program was able to generate
docked poses within 2 Å for ≥40% of the compounds” when only the NTP site is
considered. Unfortunately, no details were provided. In order to orient the docking
towards this binding site, we used ProCESS to automatically generate interaction sites and
spheres representing the binding site cavity centered on this site. Much to our delight,
FITTED was found to dock 5 out of the seven inhibitors with RMSDs below 1.2 Å in self-
docking experiments and the same 5 with RMSDs below 1.5 Å when the semiflexible
mode was selected. These results therefore position our program among the top of the list
of assessed programs for this HCV polymerase site. We believe that the good accuracy
observed with FITTED is due to the consensus docking approach implemented in FITTED
1.5, which is expected to accurately filter out unreasonable poses. We also found that
introducing the protein flexibility led to a significant increase in accuracy relatively to
cross-docking.
Table 3.3 - Docking of HCV polymerase inhibitors to the catalytic site.
% Success a
Whole cavity NTP site
Mode < 1.0 Å < 2.0 Å < 1.0 Å < 2.0 Å
Rigid (self-docking) 0 0 57 71
Rigid (cross-docking) 0 0 0 14
SemiFlexible 0 0 43 57
Flexible 0 0 29 43
Flo, Gold, Glide,
DockIt, MVP, LigFit,
Dock4, FlexX, Fred,
MOE
- 0 to 8b
- 0 to 60b
a Two criteria of success are shown. A docking run is considered successful if the RMSD
between modeled and experimental binding mode is within 1.0 or 2.0 Å respectively. b
Data for self-docking (corresponding to Rigid mode) from Warren et al.22
CHAPTER 3
- 132 -
APPLICATION TO THE SCREENING OF A LIBRARY AGAINST THE HCV POLYMERASE.
The previous validation demonstrated that the current version of FITTED docked
inhibitors with reasonable accuracy and also demonstrated the key role of the protein
flexibility accounted for by FITTED in the HCV polymerase context. However, this
validation did not provide any indication about its ability to identify active compounds
within a large set (i.e., to rank known inhibitors at the top of the hit list). For this purpose,
we selected first the well studied thiophene site from which we have collected much data
from both SAR and X-ray crystallography. Another two sites (catalytic and allosteric) are
also validated targets but were not considered here. The Maybridge library of drug-like
molecules, which was obtained from the ZINC web site,32, 33
was seeded with known
actives ranging from nanomolar to micromolar activities.
To account for the site flexibility, we used two inhibitor-bound in-house crystal
structures in a “semi-flexible” docking run, an option implemented in FITTED that allows
the simultaneous docking of a flexible ligand to more than one protein structure. The
pharmacophore shown in Figure 3.4 includes six spheres identifying two hydrophobic
pockets (shown in green), three sites for hydrogen-bond acceptors (HBA, shown in red)
and a mixed hydrophobic/HBA site (shown in yellow). At this last location, a phenyl ring
may interact with His475 via aromatic ring stacking and/or π-cation interaction with
Lys533. Experimental data showed that hydrogen bonds with Tyr477 and Ser476 are key
interactions and that two out of the three defined hydrophobic pockets are often
targeted.20
A compound that binds without filling the deep hydrophobic pocket delineated
by Met423, Trp528 and Leu419 would trap water molecules, a phenomenon that is highly
disfavored. Similarly, desolvation of the other hydrophobic pocket defined by Leu419,
Ile482 and Leu497 is favored upon binding. To account for these specific situations, the
spheres representing these hydrophobic pockets are given a larger priority (w = 2) than
the other 4 (w = 1) and poses with a MatchScore lower than 60% are discarded.
Stepwise Screening. As described above, the fully automated FITTED 1.5 screening
protocol can be broken down into five distinct steps (Figure 3.6): 1. filtering out non-
drug-like compounds; 2. filtering out compounds that cannot match the binding site
pharmacophore and/or the binding site cavity; 3. quick docking and discarding
CHAPTER 3
- 133 -
compounds with RankScore values higher than -5.25; 4. refined docking of the best
candidates; 5. selection of the best scoring compounds.
This stepwise approach was applied to the Maybridge set seeded with 23 known
active inhibitors including very weak inhibitors. A large variety of known ligands were
selected and some are illustrated in Figure 3.5. As an additional test for FITTED, one of
these HCV polymerase inhibitors (compound 7) has been reported to have a high anti-
HCV activity, with the (R) enantiomer being the most active.34, 35
We therefore spiked
our set with the two enantiomeric forms of 7 in order to assess their binding to the
allosteric site and their relative predicted binding affinity . In contrast to closely related
analogues which bind to the catalytic site, compound 8 is believed to bind to the
allosteric site and was added to this set.36
Figure 3.5 - Selected known actives. 35-40
The protocol is shown in Figure 3.6. The entire library was processed using SMART
and a first filtration step was carried out using FITTED 1.5. Compounds with net charges
between -1 and +1, with a number of hydrogen bond acceptors lower or equal to 10, a
number of hydrogen bond donors lower or equal to 5, a maximum of 6 rotatable bonds,
CHAPTER 3
- 134 -
molecular weights below or equal to 550 and containing no potentially toxic, reactive or
hydrolyzable groups were retained. This set was comprised of nearly 32,500 compounds
containing 19 known active inhibitors (0.058% of the library).
Figure 3.6 - Funnel approach implemented in FITTED.
Prior to the actual docking, compounds not featuring the necessary pharmacophoric
groups, as well as compounds that could not fit the binding site cavity and/or had a
MatchScore lower than 60% were discarded. None of the known active compounds failed
this test while a further 6 % of the filtered library was eliminated. When appropriate
poses were found, FITTED started the genetic algorithm optimization and produced the
initial population. A quick evolution (five generations) was then applied to this
population and the top three poses (best GAFFScore) were scored using RankScore.
Compounds with a RankScore of -5.25 or lower were allowed to progress to the next
step. An additional 52% of the filtered library was removed at this stage, while all seed
compounds were retained. This observation provides a clear indication of the usefulness
of this intermediate selection. The remaining 13,713 compounds were further optimized
(another 95 generations of evolution) and scored using RankScore. Finally, focused
libraries of different sizes were compiled based on different score cutoffs. Table 3.4
summarizes the size, number of recovered known active and enrichment factors for these
small size libraries.
CHAPTER 3
- 135 -
Table 3.4 - Focused libraries based on MatchScore > 75 and RankScore as indicated
RankScore
cutoff
Hitsa Known
actives
Enrichment
factorb
< -7.0 835 9 18.4
< -7.5 401 8 34.1
< -8.0 147 6 69.7
< -8.5 48 6 214
< -9.0 14 3 366
Filtered
library
32457 19 1.0
a Including known actives.
b Based on the filtered library
Enrichment Factors. Performing docking-based virtual screening tools should prioritize
active compounds from a library of drug-like molecules. It is common practice to seed a
library of drug-like molecules with known actives and use the enrichment factor obtained
to evaluate the accuracy of the docking and scoring functions of the software. An
accurate program should be able to recover the seed compounds at the top of the score-
ranked hit list. Comparative studies evaluating the accuracy of docking programs to
extract active compounds from large libraries showed that Surflex, GOLD, Glide, and
FlexX are among the best programs.3
For instance, the best performer in Rognan’s study,
Surflex, was able to rank ten known thymidine kinase inhibitors from a library of 1000
drug-like molecules in the top 10% of the library with five in the top 3%.41
In another
study slightly less than 50% of the seed inhibitors were ranked in the top 10% by DOCK,
GOLD and Glide, with 38% in the top 2% when GOLD was used.42
Overall, a state-of-
the-art VS tool rarely extracts 100% of the actives in the top 10% and even more rarely in
the top 5%; 50-60% in the top 5% is more commonly observed with the best programs. In
contrast, from the focused sets selected by FITTED 1.5 (Table 3.1), large enrichment
factors were computed. Considering that in the initial library and in the filtered library
only 0.035% and 0.058% respectively were known actives, enrichment factors of over
three hundred for the top fourteen compounds were achieved for this target with FITTED
1.5.
CHAPTER 3
- 136 -
Data analysis illustrated in Figure 3.7 indicated that a third of the actives were
recovered in the top 0.1% and that half of the actives were found in the top 2% of the hit
list. An average of 12 minutes of CPU time per compound was needed to dock each of
the 32,500 filtered compounds using the semi-flexible HCV polymerase structure on
desktop Linux PCs (AMD Opteron) while less that a second per compound was needed to
filter out the bad candidates.
Interestingly, the (R)-7, the most active enantiomer of 7, was predicted to bind well
to the allosteric site, while (S)-7 was given a score much worse than the other 22 actives.
This result correlates well with the experimental data and indicates that 7 may bind to the
allosteric site. Compound 8 was also assigned a high score and is predicted to bind tightly
to the allosteric site, an observation that has only been postulated.36
Figure 3.7 - Active compounds recovered. Blue curve: FITTED VS study; orange:
random selection.
Biological Evaluation. Encouraged by the large enrichment factors obtained for this
target, we assessed FITTED’s ability to identify new HCV polymerase inhibitors from the
Maybridge library by screening the high ranking compounds in biochemical assays. The
top scoring compounds with a RankScore below -7.0 and a MatchScore higher than 75
were considered for biological evaluation (826 compounds; 1.25% of the Maybridge
library) using a scintillation proximity assay (SPA) described in the experimental section.
Unfortunately, some high scoring compounds were not available for purchase at the time
and only 659, representing 1% of the total Maybridge library, were acquired. All these
659 available compounds were screened against the HCV polymerase using a single
point concentration and resulted in 220 compounds showing greater than 50% inhibition
CHAPTER 3
- 137 -
at 10 μg/mL and 12 compounds which had greater than 90% inhibition at 10 μg/mL. The
set of 12 actives were re-tested in an eleven point dose response SPA assay and two drug-
like compounds were identified to inhibit HCV polymerase with IC50 values of 7 μM and
12 μM respectively.
With these newly discovered actives, a new enrichment factor of 20.4 was
computed for the top 835 hits.
CONCLUSION
HCV has been shown to be a challenging enzyme for docking methods and
prompted us to assess FITTED in this context. Hence, FITTED 1.0 has been modified to
incorporate features for its application to docking and virtual screening, such as ligand
and pharmacophore based prefiltering. This current version, namely FITTED 1.5, showed
significantly enhanced accuracy and speed relatively to the previous version. Validation
experiments carried out on two binding sites on HCV polymerase (allosteric and catalytic
site) further confirmed its accuracy. We next looked at its ability to identify active HCV
polymerase inhibitors from a set of drug-like molecules. A virtual screening run on the
Maybridge library seeded with known actives gave enrichment factors which were
superior to the ones often observed with other available docking programs. Top scoring
compounds representing around 1% of the Maybridge library were purchased and
screened in HCV polymerase assays resulting in the identification of two compounds
with IC50’s of 7 and 12 M. The screening of larger libraries is now ongoing.
FITTED 1.5 and the subsequent versions are now available to the scientific
community.43
EXPERIMENTAL SECTION
Running FITTED 1.0 Testing Set with FITTED 1.5, The preparation of the testing set has
been previously reported and will not be described herein.8 All protein, interaction site
and cavity files were then prepared using ProCESS 1.5 and all ligand files with SMART
1.5. The HCV polymerase/inhibitor complexes were prepared following the same
protocol. In order to proceed in the semiflexible and flexible modes, FITTED requires
identical sequences (and number of atoms) for the protein structures used as input.
However, a large number of differences in the sequence of the various crystal structures
CHAPTER 3
- 138 -
have been found. As they were far enough from the binding sites, they are not expected to
affect the docking accuracy. Thus, manual mutations were carried out to correct these
discrepancies.
Preparation of the Cavity and Pharmacophore Files for VS. Two crystal structures
(1NHV and 2GIR) representative of the set of sixteen were used for the VS study.
Preparation of the protein files was carried out as previously described. ProCESS was then
used to prepare the structures for the VS. The center of the active site was defined by the
centroid of the ligands present in the crystal structures. A sphere radius of 25 Å was used
to generate the binding site cavity file. The pharmacophore was generated manually by
examining the known binding modes and the interaction sites identified by ProCESS and
extrapolating the six key interactions shown in Figure 3.3.
Preparation of the Library. The Maybridge library was downloaded from the ZINC
database32
in a mol2 format. Each compound of the library was then prepared by SMART,
which added the rotatable bonds, atom types and completed the bit string for each
compound.
Docking a Library with FITTED. Each compound of the library was docked individually
using FITTED in Semiflexible mode. Compounds containing the following groups where
filtered out and were not docked: aldehydes, esters, imines, nitro, acyclic Michael
acceptors, azides, isocyanates and acyl chlorides. As an additional constraint, all
compounds were required to have at least one aromatic ring. The screening was carried
out on the 872 node Dell PowerEdge cluster of Intel Pentium 4, 3.2 GHz located at the
Réseau Québecois de Calcul de Haute Performance (RQCHP) at the Université de
Sherbrooke.
Biological Evaluation of the Selected Compounds. Briefly, 250 ng of a 5’-biotinylated
DNA oligonucleotide (oligo dT15) primer, annealed to 10 pmol of a homopolymeric poly
rA RNA template, was captured on the surface of streptavidin-coated beads (GE
Healthcare, Uppsala, Sweden). The polymerization activity of 50 nM HCV NS5B
enzyme (genotype 1b, BK strain) was quantified by measuring the incorporation of
CHAPTER 3
- 139 -
radiolabeled [3H]-UTP substrate onto the 3’ end of the growing primer at 22 °C for 140
mins. Detection was performed by counting the signal using a liquid scintillation counter
(Wallac MicroBeta Trilux, Perkin Elmer, MA). Compounds were initially tested using a
single point concentration and the actives were re-confirmed by eleven point dose
responses. Curves were fitted to data points using nonlinear regression analysis, and IC50s
were interpolated from the resulting curves using GraphPad Prism software, version 2.0
(Graphpad Software Inc., San Diego, CA).
ACKNOWLEDGEMENTS
We thank the Canadian Foundation for Innovation for financial support through the
New Opportunities Fund program. CRC holds a CIHR-funded Chemical Biology
Scholarship and PE holds a McGill Majors Fellowship (J. W. McConnell Memorial). We
also thank CIHR (Discovery program), FQRNT (Nouveaux chercheurs) and NSERC for
funding and RQCHP for generous allocation of computer resources.
Supporting Information Available: Detailed data on the docking to the validation set. This
information is available free of charge via the Internet at http://pubs.acs.org.
CHAPTER 3
- 140 -
REFERENCES
1 Shuker, S. B.; Hajduk, P. J.; Meadows, R. P.; Fesik, S. W. Discovering High-
Affinity Ligands for Proteins: SAR by NMR. Science 1996, 274, 1531-1534.
2 Cavasotto, C. N.; Orry, A. J. W. Ligand Docking and Structure-Based Virtual
Screening in Drug Discovery. Curr. Top. Med. Chem. 2007, 7, 1006-1014.
3 Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R. Towards the
Development of Universal, Fast and Highly Accurate Docking/Scoring Methods: A
Long Way to Go. Br. J. Pharmacol. 2007, 2008, 153, (SUPPL. 1).
4 Cozza, G.; Bonvini, P.; Zorzi, E.; Poletto, G.; Pagano, M. A.; Sarno, S.; Donella-
Deana, A.; Zagotto, G.; Rosolen, A.; Pinna, L. A.; Meggio, F.; Moro, S.
Identification of Ellagic Acid as Potent Inhibitor of Protein Kinase CK2: A
Successful Example of a Virtual Screening Application. J. Med. Chem. 2006, 49,
2363-2366.
5 De Graaf, C.; Oostenbrink, C.; Keizers, P. H. J.; Van Der Wijst, T.; Jongejan, A.;
Vermeulen, N. P. E. Catalytic Site Prediction and Virtual Screening of Cytochrome
P450 2D6 Substrates by Consideration of Water and Rescoring in Automated
Docking. J. Med. Chem. 2006, 49, 2417-2430.
6 Claussen, H.; Buning, C.; Rarey, M.; Lengauer, T. FLEXE: Efficient Molecular
Docking Considering Protein Structure Variations. J. Mol. Biol. 2001, 308, 377-
395.
7 AutoDock, 4.0; The Scripps Research Institute: La Jolla, CA, 2006.
8 Corbeil, C. R.; Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and
Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J.
Chem. Inf. Model. 2007, 47, 435-449.
9 Englebienne, P.; Fiaux, H.; Kuntz, D. A.; Corbeil, C. R.; Gerber-Lemaire, S.; Rose,
D. R.; Moitessier, N. Evaluation of Docking Programs for Predicting Binding of
Golgi alpha-Mannosidase II Inhibitors: A Comparison with Crystallography.
Proteins: Struct., Funct., Bioinf. 2007, 69, 160-176.
10 Bacon, B. R.; Di Bisceglie, A. M.; Korb, J. R.; Tillmann, H. L.; Herold, K. C.;
Himelhoch, S.; De Knegt, R. J.; Van Den Berg, A. P.; Bell, B. P.; Walker, B. D.;
CHAPTER 3
- 141 -
Lauer, G. M. Hepatitis C Virus Infection [3] (multiple letters). N. Engl. J. Med.
2001, 345, 1425-1428.
11 Lauer, G. M.; Walker, B. D. Hepatitis C Virus Infection. N. Engl. J. Med. 2001,
345, 41-52.
12 Strader, D. B.; Wright, T.; Thomas, D. L.; Seeff, L. B. Diagnosis, Management,
and Treatment of Hepatitis C. Hepatology 2004, 39, 1147-1171.
13 Toniutto, P.; Fabris, C.; Bitetto, D.; Fornasiere, E.; Rapetti, R.; Pirisi, M.
Valopicitabine Dihydrochloride: A Specific Polymerase Inhibitor of Hepatitis C
Virus. Curr. Opin. Invest. Drugs 2007, 8, 150-158.
14 Pierra, C.; Amador, A.; Benzaria, S.; Cretton-Scott, E.; D'Amours, M.; Mao, J.;
Mathieu, S.; Moussa, A.; Bridges, E. G.; Standring, D. N.; Sommadossi, J. P.;
Storer, R.; Gosselin, G. Synthesis and Pharmacokinetics of Valopicitabine
(NM283), an Efficient Prodrug of the Potent Anti-HCV Agent 2'-C-
Methylcytidine. J. Med. Chem. 2006, 49, 6614-6620.
15 Smith, D. B.; Martin, J. A.; Swallow, S.; Smith, M. K., A.; Yee, C.; Crowell, M.;
Kim, W.; Sarma, K.; Najera, I.; Jiang, W.-R.; Le Pogam, S.; Rajyaguru, S.;
Klumpp, K.; Leveque, V.; Ma, H.; Tu, Y.; Chan, R.; Brandl, M.; Alfredson, T.;
Wu, X.; Birudaraj, R.; Tran, T.; Cammack, N. From R1479 to R1626 :
Optimization of a Nucleoside Inhibitor of NS5B for the Treatment of Hepatitis C.
in Abstracts of Papers, 232nd ACS National Meeting, San Francisco, CA, United
States, Sept. 10-14 2007
16 ViroPharma Incorporated. Therapeutic focus: HCV 796.
http://www.viropharma.com/therapeutic/hcv796.asp (accessed Jan 03, 2008).
17 Virochem Pharma, Corporate Information.
http://www.virochempharma.com/ourFocus.html (accessed Jan 03, 2008).
18 Virochem Pharma, Corporate Information
http://clinicaltrials.gov/ct/gui/show/NCT00389298?order=6 (accessed Jan 03,
2008).
19 Haigh, D.; Amphlett, E. M.; Bravi, G. S.; Bright, H.; Chung, V.; Chambers, C. L.;
Cheasty, A. G.; Convery, M. A.; Ellis, M. R.; Fenwick, R.; Gray, D. F.; Hartley, C.
D.; Howes, P. D.; Jarvest, R. L.; Medhurst, K. J.; Mehbob, A.; Mesogiti, D.;
Mirzai, F.; Nerozzi, F.; Parry, N. R.; Roughley, N.; Skarzynski, T.; Slater, M. J.;
CHAPTER 3
- 142 -
Smith, S. A.; Stocker, R.; Theobald, C. J.; Thomas, P. J.; Thommes, P. A.; Thorpe,
J. H.; Wilkinson, C. S.; Williams, E. Identification of GSK625433: A Novel
Clinical Candidate for the Treatment of Hepatitis C. In Abstracts of Papers, 233rd
ACS National Meeting, Chicago, IL, United States, March 25-29 2007.
20 Biswal, B. K.; Cherney, M. M.; Wang, M.; Chan, L.; Yannopoulos, C. G.;
Bilimoria, D.; Nicolas, O.; Bedard, J.; James, M. N. G. Crystal Structures of the
RNA-dependent RNA Polymerase Genotype 2a of Hepatitis C Virus Reveal two
Conformations and Suggest Mechanisms of Inhibition by Non-nucleoside
Inhibitors. J. Biol. Chem. 2005, 280, 18202-18210.
21 Biswal, B. K.; Wang, M.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.;
Bilimoria, D.; Bedard, J.; James, M. N. G. Non-nucleoside Inhibitors Binding to
Hepatitis C Virus NS5B Polymerase Reveal a Novel Mechanism of Inhibition. J.
Mol. Biol. 2006, 361, 33-45.
22 Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,
M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.;
Woolven, J. M.; Peishoff, C. E.; Head, M. S. A Critical Assessment of Docking
Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912-5931.
23 Kim, J.; Park, J. G.; Chong, Y. FlexE Ensemble Docking Approach to Virtual
Screening for CDK2 Inhibitors Mol. Simul. 2007, 33, 667-676.
24 Moitessier, N.; Therrien, E.; Hanessian, S. A method for Induced-fit Docking,
Scoring, and Ranking of Flexible Ligands. Application to Peptidic and
Pseudopeptidic β-secretase (BACE 1) Inhibitors. J. Med. Chem. 2006, 49, 5885-
5894.
25 Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to
Hydrated and Flexible RNA. J. Med. Chem. 2006, 49, 1023-1033.
26 Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and
Computational Approaches to Estimate Solubility and Permeability in Drug
Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23, 3-25.
27 Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore
Search, Automated Docking, and Molecular Dynamics Simulations as a Novel
Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-glycine-
CHAPTER 3
- 143 -
aspartic Acid-like Compounds into the avb3 Binding Site. J. Med. Chem. 2004, 47,
4178-4187.
28 Goto, J.; Kataoka, R.; Hirayama, N. Ph4Dock: Pharmacophore-Based Protein-
Ligand Docking. J. Med. Chem. 2004, 47, 6804-6811.
29 Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development
and Testing of a General Amber Force Field. J. Comput. Chem. 2004, 25, 1157-
1174.
30 Paul, N.; Rognan, D. ConsDock: A New Program for the Consensus Analysis of
Protein-ligand Interactions. Proteins Struct. Funct. Genet. 2002, 47, 521-533.
31 Di Marco, S.; Volpari, C.; Tomei, L.; Altamura, S.; Harper, S.; Narjes, F.; Koch,
U.; Rowley, M.; De Francesco, R.; Migliaccio, G.; Carfi, A. Interdomain
Communication in Hepatitis C Virus Polymerase Abolished by Small Molecule
Inhibitors Bound to a Novel Allosteric Site. J. Biol. Chem. 2005, 280, 29765-
29770.
32 Irwin, J. J.; Shoichet, B. K. ZINC - A Free Database of Commercially Available
Compounds for Virtual Screening. J. Chem. Inf. Model. 2005, 45, 177-182.
33 A Free Database for Virtual Screening ZINC. http://blaster.docking.org/zinc/
(accessed Jan 03, 2008).
34 Gopalsamy, A.; Lim, K.; Ciszewski, G.; Park, K.; Ellingboe, J. W.; Bloom, J.;
insaf, S.; Upeslacis, J.; Mansour, T. S.; Krishnamurthy, G.; Damarla, M.; Pyatski,
Y.; Ho, D.; Howe, A. Y. M.; Orlowski, M.; Feld, B.; O’Connell, J. Discovery of
Pyrano[3,4-b]indoles as Potent and Selective HCV NS5B Polymerase Inhibitors. J.
Med. Chem. 2004, 47, 6603-6608.
35 Gopalsamy, A.; Aplasca, A.; Ciszewski, G.; Park, K.; Ellingboe, J. W.; Orlowski,
M.; Feld, B.; Howe, A. Y. M. Design and synthesis of 3,4-dihydro-1H-[1]-
benzothieno[2,3-c]pyran and 3,4-dihydro-1H-pyrano[3,4-b]benzofuran derivatives
as non-nucleoside inhibitors of HCV NS5B RNA dependent RNA polymerase.
Bioorg. Med. Chem. Lett. 2006, 16, 457-460.
36 Pfefferkorn, J. A.; Nugent, R.; Gross, R. J.; Greene, M.; Mitchell, M. A.; Reding,
M. T.; Funk, L. A.; Anderson, R.; Wells, P. A.; Shelly, J. A.; Anstadt, R.; Finzel,
B. C.; Harris, M. S.; Kilkuskie, R. E.; Kopta, L. A.; Schwende, F. J. Inhibitors of
HCV NS5B polymerase. Part 2: Evaluation of the northern region of (2Z)-2-
CHAPTER 3
- 144 -
benzoylamino-3-(4-phenoxy-phenyl)-acrylic acid. Bioorg. Med. Chem. Lett. 2005,
15, 2481-2486.
37 Wang, M.; Ng, K. K.-S.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.; Bedard,
J.; Morin, N.; Nguyen-Ba, N.; Alaoui-Ismaili, M. H.; Bethell, R. C.; James, M. N.
G. Non-nucleoside Analogue Inhibitors Bind to an Allosteric Site on HCV NS5B
Polymerase. J. Biol. Chem. 2003, 278, 9489-9495.
38 Biswal, B. K.; Wang, M.; Cherney, M. M.; Chan, L.; Yannopoulos, C. G.;
Bilimoria, D.; Bedard, J.; James, M. N. G. Non-nucleoside Inhibitors Binding to
Hepatitis C Virus NS5B Polymerase Reveal a Novel Mechanism of Inhibition. J.
Mol. Biol. 2006, 361, 33-45.
39 Li, H.; Tatlock, J.; Linton, A.; Gonzalez, J.; Borchardt, A.; Dragovich, P.; Jewell,
T.; Prins, T.; Zhou, R.; Blazel, J.; Parge, H.; Love, R.; Hickey, M.; Doan, C.; Shi,
S.; Duggal, R.; Lewis, C.; Fuhrman, S. Identification and structure-based
optimization of novel dihydropyrones as potent HCV RNA polymerase inhibitors.
Bioorg. Med. Chem. Lett. 2006, 16, 4834-4838.
40 Chan, L.; Pereira, O.; Reddy, T. J.; Das, S. K.; Poisson, C.; Courchesne, M.;
Proulx, M.; Siddiqui, A.; Yannopoulos, C. G.; Nguyen-Ba, N.; Roy, C.; Nasturica,
D.; Moinet, C.; Bethell, R.; Hamel, M.; L’Heureux, L.; David, M.; Nicolas, O.;
Courtemanche-Asselin, P.; Brunette, S.; Bilimoria, D.; Bedard, J. Discovery of
thiophene-2-carboxylic acids as potent inhibitors of HCV NS5B polymerase and
HCV subgenomic RNA replication. Part 2: Tertiary amides. Bioorg. Med. Chem.
Lett. 2004, 14, 797-800.
41 Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D. Comparative Evaluation of
Eight Docking Tools for Docking and Virtual Screening Accuracy. Proteins Struct.
Funct. Genet. 2004, 57, 225-242.
42 Kontoyianni, M.; Sokol, G. S.; McClellan, L. M. Evaluation of Library Ranking
Efficacy in Virtual Screening. J. Comput. Chem. 2005, 26, 11-22.
43 Moitessier, N.; Corbeil, C. R.; Englebienne, P.; Schwartzentruber, J. FITTED 2.2
can be obtained on request from the developers ([email protected]).
CHAPTER 4
- 145 -
CHAPTER FOUR
In the previous chapters, the two reported versions of FITTED have been validated
and tested using a test set of 33 protein ligand complexes along with a virtual screening
application against HCV polymerase. We had also demonstrated the importance of the
inclusion of protein flexibility and displaceable bridging water molecules with FITTED.
However, we thought to further demonstrate the impact of these two features as well as
others (e.g., input conformation of the ligand) on the pose prediction accuracy of FITTED
along with other popular docking programs. With this comparative study another
iteration of improvements to the program was also conducted and are presented in this
chapter.
This chapter is reproduced from a manuscript that has been submitted for publication in
the Journal of Chemical Information and Modeling. This article is cited as Corbeil, C. R.;
Moitessier, N. “Docking Ligands into Flexible and Solvated Macromolecules. 3. Impact
of Input Ligand Conformation, Protein Flexibility and Water Molecules on the Accuracy
of Docking Programs.” Journal of Chemical Information and Modeling 2009, accepted.
Copyright 2009, with permission from the American Chemical Society.
CHAPTER 4
- 146 -
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 3.
IMPACT OF INPUT LIGAND CONFORMATION, PROTEIN
FLEXIBILITY AND WATER MOLECULES ON THE ACCURACY
OF DOCKING PROGRAMS
ABSTRACT
Several modifications and additions to FITTED 1.5 led to the development of FITTED 2.6.
Among the novel implementations are a matching algorithm-enhanced genetic algorithm
and a ring conformational search algorithm. With these various optimizations, we also
hoped to remove the biases and to develop a docking program that would provide results
(i.e., poses) as independent as possible to the input ligand and protein conformations and
used parameters, although keeping the options to provide additional experimental
information. These biases were investigated within FITTED 2.6 along with FlexX, GOLD,
Glide and Surflex. The input ligand conformation was found to have a major impact on
the program accuracy as drops as large as 10-50% were observed with all the programs
but FITTED. This comparative study also demonstrates that the accuracy of FITTED is
comparable to other docking programs. We have also demonstrated that protein
flexibility, displaceable water molecules and ring conformational search algorithms, three
of the main FITTED features significantly increased its accuracy. Finally, we also
proposed potential modifications to the available programs to further improve their
accuracy in binding mode prediction.
CHAPTER 4
- 147 -
INTRODUCTION
In modern drug design, docking-based virtual screening (VS) methods provide a quick
and inexpensive alternative to high throughput screening. In fact, numerous applications
have demonstrated the reasonable level of accuracy of the available methods.1, 2
In
parallel, comparative studies evaluated the relative accuracy of previous versions of
docking programs in predicting the correct binding modes, typically with Glide and
GOLD yielding the best results.3-9
Many of these studies, which often made use of ligand
/ protein co-crystal structures, showed that the accuracy of docking the native ligands
back to the corresponding protein structures (self-docking) gave reasonable results.
However, when examining docking of a ligand to non-native crystal structures of the
same protein (cross-docking), the accuracy of most of the programs was significantly
lower.10-12
These failures result in part from the assumption that proteins are rigid objects
(the lock-and-key model) even though they are known to be flexible dynamic objects. As
a result, this major assumption lead to inaccurate binding pose predictions and low
enrichment factors in VS.13
In fact, implementing protein flexibility has been seen as one
major challenge in the development of docking methods.14-16
Currently, very few
programs consider the flexibility of the protein upon docking, although various strategies
have been proposed ranging from soft-docking (e.g., smoothed protein structure in
AutoDock11
) to a more exhaustive and therefore time consuming protein conformational
search as seen in Glide when combined with Prime.17
Docking to conformational
ensembles has also been implemented within a few programs such as FlexX-Ensemble18
,
Slide.19,20
and AutoDock.21
These various implementations led to significantly improved
predictions of binding modes when compared to cross-docking studies.
One of the other challenges in docking and VS is the treatment of key water
molecules.22
In most protein / ligand docking studies, water molecules, if present, are
treated on a per protein basis. If the water molecules appear as highly conserved, then
they are kept as part of the protein description for the docking run. This approach clearly
precludes accurate docking of ligands that would displace these key water molecules
upon binding. A commonly described example is HIV-1 protease ligands. A tightly
bound water molecule has been observed within the co-crystal structure of HIV-1
protease with KNI-272 and analogues.23-27
This water molecule may therefore be
CHAPTER 4
- 148 -
required for an optimally accurate docking of this set of analogues. In parallel, inhibitors
built around a cyclic urea scaffold have been designed to displace this water and would
not be properly docked if this water molecule was kept.28-30
Ideally, water molecules
should be displaceable. In fact, a previous report from our group showed that AutoDock
gained accuracy when water molecules were made displaceable.31
Currently only a few
docking methods can displace water molecules while docking ligands. For instance,
GOLD32
uses user-defined waters present in the protein input file while FlexX33
places
water molecules within the binding site and keeps the ones interacting strongly. In these
two cases, these programs both score with the water present (on) or not (off) and select
the best scoring option. A version of FlexX currently in beta testing allows for
displaceable waters34
that were present in the input file. Surprisingly, when the GOLD
implementation was reported, it showed no improvement of the docking accuracy. This
somewhat unexpected observation questioned the development of functionalities
specifically designed to displace water molecules.
Within the past years, we have developed and reported FITTED 1.035
then FITTED 1.536
.
FITTED (Flexibility Induced Through Targeted Evolutionary Description) is a docking
program that addresses the challenges of protein flexibility and displaceable water
molecules. Herein we describe the development of the next version of this docking
program, FITTED 2.6, that focuses on accelerating the docking process while keeping
similar accuracy. We also focused on reducing the dependence of the accuracy on input
parameters and structures. We have previously found that the accuracy of eHiTS was
affected by the ligand input structure and further investigation was necessary to evaluate
this effect on the accuracy of other programs including ours.37
In order to identify its
strengths and weaknesses and evaluate these dependencies, we then compared FITTED to
some of the most popular docking software available with a specific focus on how
changes in input structure and parameters affect docking accuracy.
THEORY AND IMPLEMENTATION
FITTED 1.0 and 1.5, creating a virtual screening tool out of a docking program. Previous
reports from our group detailed the development of FITTED versions 1.035
and 1.536
and
only a brief description is given below. FITTED is a suite of programs that includes
CHAPTER 4
- 149 -
FITTED (the docking engine), ProCESS (Protein Conformation Ensemble System Setup, a
module for protein file preparation) and SMART (Small Molecule Atom Typing and
Rotatable Torsion assignment, a module for ligand preparation). Docking a ligand to a
protein can be seen as a global optimization problem. The ligand binding mode, protein
conformation, water molecule occurrence and locations have to be optimized to provide
an optimal free energy of binding. In FITTED, a Lamarckian genetic algorithm addresses
the conformational space search. Genetic algorithms are stochastic methods and often
start with randomly generated populations, followed by a time consuming evolution. Due
to the large conformational space of the protein / water / ligand complexes, finding the
global solution requires a large number of generations and large populations. Short
cutting the process is therefore necessary to reduce the CPU time required for a single
run. We thought that starting with a population that has already evolved (i.e., lower
average energy than random poses) would lead to desired decreases in computational
time. Thus using a series of atomic charge constraints and a binding site volume, FITTED
1.0 prepared such an initial population intelligently. This approach allows for quicker
convergence of the population through evolution. Along with this genetic algorithm,
FITTED incorporated a switching function that effectively turns off the water and allows
them to be displaced when required. The initial validation of FITTED 1.0 with a small set
of protein / ligand complexes showed promises with 76% and 73% success in self-
docking and docking to flexible proteins respectively.35
Although this first version was
docking ligands effectively, our eventual goal was to make FITTED a VS tool. Some
modifications of the original algorithm were necessary to make it significantly quicker to
achieve speeds required by VS tools.
The first step in any virtual screen is the preparation of the virtual database of potential
ligands. Since a docking program will attempt to dock any given compound, we first
focused on prioritizing “drug-like” molecules for docking. For this purpose, a series of
descriptors were implemented into SMART. Bit strings describing the molecular structure
generated by SMART could then be exploited by FITTED to filter out compounds with
undesired chemical features and/or physical properties. The docking was modified to
incorporate a consensus docking approach that enabled FITTED to create and allow the
population to evolve in a more intelligent manner than the previous version. This was
done by adding pharmacophores and/or automatically generated sets of protein
CHAPTER 4
- 150 -
interaction sites (generated by ProCESS) to orient the docking process. When a
conformation of a ligand did not match well to the pharmacophore (PharmScore) and/or
the interaction sites (MatchScore), the conformation was discarded and a new one was
generated. Thus, the inclusion of the interaction sites oriented the docking towards better
solutions and, as a result, afforded a 10% increase in accuracy over FITTED 1.0 and a
significant decrease in required CPU time.36
FITTED was then used in the screening of the
Maybridge database against HCV polymerase and was successful in identifying two hits
in the low micromolar range.36
FITTED 2.6. Improvements to remove dependencies on input parameters. When more
knowledge is provided to a docking program, the accuracy is expected to increase. For
instance, if the ligand in its crystal structure conformation is docked, a program that uses
the ligand input structure as an initial guess would most likely outperform any other
program in self-docking experiments. However, these experiments would give no
information regarding its true accuracy since, in a real drug design scenario, the user does
not know the solution. Some other biases, including the selection of parameters and the
protocol used to prepare the protein (e.g., protonation state of ionizable residues), can
also greatly affect the evaluation of programs. The removal of these dependencies arising
from the input parameters has become one of the hot topics in the literature as of late.38-
42.
One of these dependencies is the input conformation of rings. In a VS study, large
libraries of compounds are tested in silico. These libraries are typically prepared from
two dimensional representations of these molecules, then a 2D to 3D converter such as
OMEGA43
and CORINA44
is used to generate the 3D coordinates. Most 2D to 3D
converters output an esthetically stable state with the option to find a low-in-energy
conformation as defined by a force field and a conformational search algorithm. This
conformation may not always be the same as the bioactive conformation. If the molecule
is acyclic then this poses no problem, since most popular docking software consider the
flexibility of acyclic portions of the molecule. Molecules with flexible cyclic structures
prove to be more challenging as in Figure 4.1. In addition, the conformation of the
flexible ring depends on the program used to generate it and is often not fully searched.38,
CHAPTER 4
- 151 -
39, 45, 46 Our genetic algorithm has therefore been modified to account for ring flexibility
as detailed below.
Figure 4.1 - Conformation of 1nfu ligand (a) as observed in the crystal structure and (b)
as generated by OMEGA
Implementation of ring flexibility. There are three main strategies to address the issue of
flexible ring systems. First a separate tool can create multiple conformations of the ligand
to be used as multiple inputs by docking programs. Second, several ring conformations
can be exploited during the incremental construction of ligands. Surflex2.147
uses
templates of five to seven-membered cycloalkanes to generate multiple input
conformations of the rings used by the incremental construction algorithm. Even though
the templates are saturated carbocyclic structures, the inclusion of energy minimization
steps accounts for the various conformations that may exist for heterocyclic and
unsaturated systems. Similarly, Glide48
version 5.0 uses the template library from
LigPrep to be able to conformationally search larger rings.49
The third option is to
perform the conformational search while docking. GOLD50-52
exploits the corner flap
approach developed by Goto and Osawa,53
where the atom to be flipped is reflected in
the plane formed by the adjacent atoms. The major advantage of this approach is that
rings of any size can be searched. The one pitfall is the requirement to have the four
adjacent atoms in a plane. In a previous version of Glide, version 4.5, the docking engine
used a approach similar to GOLD54
The genetic algorithms rely on the use of chromosomes and the theory of evolution. In
the context of docking, the chromosomes are sets of numerical values (genes) that can
evolve through genetic operators such as mutations and cross-over. These numerical
values often define the conformation, orientation and position in space of the ligand
(referred to as a pose). The chromosome, as defined in FITTED 1.5, included the acyclic
flexible torsions, translation and orientation for a given pose of the ligand. Depending on
CHAPTER 4
- 152 -
the selected docking mode, the chromosome may also include the protein backbone,
water positions and binding site residues (Figure 2).
.
FlexibleTorsions
Translation
Orientation
ProteinBackbone
Binding SiteResidues
Waters
RigidDocking Semiflexible
DockingSemiflexible Docking withFlexible Waters Fully
Flexible Docking
FlexibleRings
Figure 4.2 - FITTED 1.5 vs. FITTED 2.6 chromosome and the various docking modes.
Each of the horizontal lines represents a gene (e.g., given conformation of a side chain
residue). The box highlights the implementation in FITTED 2.6.
Since only the acyclic portions of the ligand were included in the chromosome, the
conformation of the ring(s) within the ligand remained the same throughout the evolution
unless altered by the energy minimization routine. To account for ring flexibility, FITTED
2.6 now includes a conformational search algorithm for rings during the generation of
new conformations (Figure 4.3 and Figure 4.4). FITTED2.6 uses a corner flap algorithm
similar to that of GOLD but does not impose any criteria to the position in space of A, B,
D and E (Figure 4.3). This is achieved by creating the plane out of three atoms (A, B and
D) instead of the four atoms required in GOLD. Any distortions of the bond length and
angles are next resolved through the energy minimization steps performed by FITTED. In
order to maintain the asymmetry of atom C, GOLD imposes the rotation of two bonds
(AB and BC) to position C1 and C2 (Figure 4.4). This approach reinforces the need to
have four atoms in a plane. In our current implementation, an assumption is made that the
torsion C1CBC’ is equivalent to C’2C’BC (Figure 4). Thus the Cartesian coordinates of
C’2 can be defined by converting C2 into C’2 from its internal coordinates (bonds, angles
and torsions).55
CHAPTER 4
- 153 -
Figure 4.3 - Example of the corner flap approach converting a boat conformation to a
chair.
F C
E
A
D
B
C1
C2
C'
C'2
C'1
B
C'
C2
C1
C'
C'2
C'1
C
Figure 4.4 - The assumption of torsion equivalencies.
An improved definition of interaction sites and a matching algorithm. As described
previously, the module ProCESS prepares the protein files in the format needed by
FITTED. It also probes the binding site and generates additional data for optimal docking.
Among this data is a list of potential protein interaction sites (ISs). Geometric rules are
applied to find the ideal locations of hydrogen bond donor and acceptor groups referred
to as HBD and HBA ISs. In the previous version, ISs were centered on the Ser, Thr and
Tyr hydroxyl oxygen and on metal centers. These points are now placed in the position of
the oxygen lone pairs or free metal coordination sites. ProCESS determines these free
metal coordination sites by examining the surrounding residues and using the vector bond
valence postulate56
that states that the sum of all the vectors of the coordinated atoms
must be equal to 0.
The earlier version of FITTED did not identify the hydrophobic pockets with great
accuracy. To resolve this issue, various strategies have been implemented. In the current
CHAPTER 4
- 154 -
version, a grid of evenly distributed points is generated and the interaction of a probe
atom at each of these grid points with the protein is computed. To be considered
hydrophobic (referred to as HYD), the point should not be in close proximity to an HBA
or HBD point. Then the van der Waals interaction energy calculated between the probe
atom and all the protein carbons should be below a minimum van der Waals energy cut-
off value. Applied to a number of proteins, we found this new definition to be more
accurate than the previous one.
A weight is then assigned to each point depending on its type. For HBA and HBD, a
weight is assigned depending on whether the point was created from a charged or neutral
residue then scaled to account for the buriedness of the point.57
The HYD points are
scaled according to the ratio of the van der Waals energy calculated for point over the
minimum van der Waals energy cut-off. These weights are then used to compute the
MatchScore of each pose as described previously.36
Some years ago, we have shown that using pharmacophore oriented docking with a
matching algorithm can improve docking accuracy substantially.58
FITTED 1.5 initiated a
move in this direction by orienting the docking using ISs.36
With the new version, we
complete this move with the inclusion of a three-point triangle matching algorithm to
orient the ligand instead of the random translation and orientation performed with
previous versions. Triangles made of ligand atoms are matched onto triangles made of
ISs of identical chemical property (HDB, HBA or HYD). In order to optimize the
efficiency of this algorithm, only a subset of potential triangle match is used. First,
FITTED removes triangles that connect low weight interaction sites. Secondly, all
triangles must contain at least one point of the top 10 ISs as sorted by weights.
The creation of the ligand ISs are based on simple rules. All ligand atoms that are
hydrogen bond acceptors (HBA) or donors (HBD) are labelled as HBA or HBD (red and
blue atoms in Figure 5). Hydrophobic ligand points (HYD, green spheres in Figure 4.5)
are centroids either placed at the center of hydrophobic rings (rings with a majority of
carbons) or at the center of iso-propyl, methyl and tert-butyl groups (Figure 4.5). All
possible combinations of three-point triangles are then created and stored. The position of
these ligand ISs are recalculated with each new conformation.
CHAPTER 4
- 155 -
ON
N
O
O
N
O
O
O
HH
H
H
Figure 4.5 - Representation of the ISs found for 1bwi. (Green = hydrophobic points, blue
= hydrogen bond acceptors, red = hydrogen bond donors)
With the two lists created, a new ligand conformation is randomly generated and a
triangle match between the ligand ISs and the protein ISs is sought. The ligand triangle is
then superimposed with the protein interaction site triangle. The ligand, that is now
oriented within the binding site, proceeds through the consensus docking approach
described in FITTED 1.5 and summarized in Figure 4.6.
CHAPTER 4
- 156 -
Generate
Conformation
Generate
Conformation
PharmScorePharmScore
MatchScoreMatchScore
ClashScoreClashScore
GAFFScoreGAFFScore
MinimizeMinimize
GAFFScoreGAFFScore
Save in
Population
Save in
Population
Fail
Calculate new
min MatchScore
Calculate new
min MatchScore
Yes
Population size
Reached?
Population size
Reached?
No
EvolutionEvolution
Too many trials
failed reduce min
MatchScore
Too many trials
failed reduce min
MatchScore
LigandLigand Protein
Files
Protein
Files
Interaction
Sites
Interaction
SitesBinding Site
Cavity
Binding Site
Cavity
Matching
algorithm
Matching
algorithm
Corner FlapCorner Flap
Figure 4.6 - Schematic of the generation of the initial population within FITTED2.6.
With earlier versions of FITTED, the minimum MatchScore necessary for a pose to be
accepted was manually set and therefore the accuracy of the docking run was heavily
dependant on it. This is in fact an appropriate approach in drug design when the user
wants to make use of additional information (e.g., a pharmacophore developed from other
studies). However, in the case where no information is available, a value for the
minimum MatchScore is still requested. With the new additions to the generation and
orientation of the ligand, the focus switched to the automatic selection of a MatchScore
for a ligand during docking. FITTED 2.6 starts with an initial minimum MatchScore, that
can either be automatically reduced or increased depending on the ligand. As soon as a
new individual is saved, the minimum MatchScore (Min MatchScore in eq. 1) is
recalculated based on Eq. 1. The scaling factors have been empirically defined to orient
CHAPTER 4
- 157 -
the docking without affecting the time required for the generation of the initial
population.
10.0
MatchScoreMax 2
sMatchScore
5.0
coreMin_MatchS
0
1i
i
i
(4.1)
Evolution and convergence. With the previous versions of FITTED, it was necessary to
perform multiple runs to find the global minimum. To increase the convergence between
various runs, FITTED 2.6 incorporates a matching algorithm to create the higher quality
initial population. Additional modifications were made to the evolution algorithm to
better mimic the Lamarckian and Darwinian evolution. We thought to favour the
evolution of the best individuals. First, in order to increase the possibility of the best
individuals coupling with each other we implemented a new evolutionary function called
the probability of elitism (pElite operator). This function copies one of the top of
individuals, performs a local search on it and passes it on to the next generation. Also a
new selection criterion for the next generation called Metropolis evolution was
implemented. With this mode, the children replace the parents based on an energy-based
Metropolis criterion at a user-defined temperature. With this criterion, higher in energy
children have a non zero probability to survive and be coupled in the next generation.
This approach ensures some structural diversity in the population and enables the
creation of a population which follows the Boltzmann distribution. This population can
next be used for refined scoring.
As a last modification, we moved away from the all atom representation of protein /
ligand interactions to the less time consuming united atom representation. However, the
all atom representation is kept to compute the ligand internal energy and preclude any
inversion of chiral centers. This hybrid united atom / all atom representation resulted in
an increase in speed over the last version of FITTED.
CHAPTER 4
- 158 -
Generate
Initial Population
Generate
Initial Population
ReproductionReproduction
MetropolisMetropolisSteady StateSteady State
pElitepElite
Is population
converged?
Is population
converged?
Yes
No
ExitExit
LigandLigand Protein
Files
Protein
Files
Interaction
Sites
Interaction
SitesBinding Site
Cavity
Binding Site
Cavity
Figure 4.7 - Schematic of the evolution cycle of FITTED 2.6.
RESULTS AND DISCUSSION
Objectives. To validate this new version we have increased the validation set developed
for FITTED 1.0 from 5 proteins (33 crystal structures) to highly diverse 18 proteins and
100 crystal structures (Table 1). In addition to the evaluation of FITTED’s accuracy, we
decided to investigate the impact of parameters and input structure on accuracy. As
discussed above, protein structure, ligand structure (e.g., rings) and selected parameters
often have an impact on the docking accuracy. Although some parameters are expected to
increase accuracy (i.e., Standard Precision mode vs. eXtra Precision mode in Glide or
accuracy levels in GOLD), the ligand conformation should not (Table 1). The set
described here includes very challenging proteins such as HCV RNA polymerase9 and
metallo enzymes.37
CHAPTER 4
- 159 -
Table 4.1 - Testing set of ligand / protein complexes.
Protein (Abbreviation) # of
Structures
Include
Water?
PDB Codes
Cyclin-dependent kinase 2 (CDK2) 4 Yes 1aq1, 1dm2, 1pxn, 1pxn
Cyclooxygenase-2 (COX-2) 4 No 1cx2, 1pxx, 3pgh, 4cox
Estrogen receptor (ER) 3 Yes 1err, 1sj0, 3ert
Factor Xa (FXa) 5 Yes 1ezq, 1f0r, 1fjs, 1nfu,
1zka
Kainate nlutamate
GluK2 Kainate Receptor (GluK2)
5 Yes 1s7y, 1s9t, 1sd3, 1tt1,
1yae
HCV polymerase allosteric site
(HCV Allo)
9 No 1nhu, 1nhv, 1os5, 2gir,
2hai, 2hwh, 2hwi, 2ilr,
2o5d
HCV polymerase catalytic site
(HCV Cat)
7 Yes 1yvf, 1z4u, 2fvc, 2gc8,
2giq, 2qe2, 2qe5
HIV-1 protease, mono protonated
ASP (HIVP)
5 Yes 1b6l, 1eby, 1hpo, 1hpv,
1pro
HIV-1 protease, di-protonated
ASP (HIVPD)
5 Yes 1ajv, 1ajx, 1hvr, 1hwr,
1qbs
HIV-1 reverse transcriptase (HIVRT) 4 Yes 1c1b, 1fk9, 1rt1, 1vrt
Mannosidase (Mann) 8 No 1hww, 1hxk, 1ps3,
1r33, 1r34, 1tqt, 2f1a,
2f18
Matrix metalloprotease 3 (MMP-3) 4 No 1b8y, 1bwi, 1ciz, 1d8m
P38 Map kinase (P38) 5 No 1a9u, 1b17, 1w7h,
1w82, 1w84
Thermolysin (Therm) 8 Yes 1thl, 1tlp, 1tmn, 3tmn,
4tmn, 5tmn, 6tmn, 8tln
Thrombin (Thrn) 5 Yes 1dwc, 1etr, 1ets, 1ett,
1tmt
Thymidine kinase (TK) 9 Yes 1e2k, 1e2p, 1ki3, 1ki4,
1ki7, 1ki8, 1of1, 1qhi,
2ki5
Trypsin (Tryp) 5 Yes 1f0u, 1o2j, 1o3g, 1o3i
1qbo
Vitamin D receptor (VDR) 5 Yes 1db1, 1ie8, 1txi, 2har,
2has
CHAPTER 4
- 160 -
Comparing versions FITTED 1.0, 1.5 and 2.6. To examine the effect of the various
modifications made to FITTED, we compared the current version to the previous ones by
using the training set initially developed to test the accuracy of FITTED 1.035
. This set is
constituted of 33 protein-ligand complexes and 5 proteins (HIVP, FXa, Tryp, MMP-3
and TK). Table 2 summarizes the results obtained with the three FITTED versions for self-
docking (“Rigid” protein flexibility mode) and docking to flexible proteins using the
crystallographic conformation of the ligands. As previously reported, the “SemiFlexible”
protein flexibility mode corresponds to docking to a conformational ensemble of protein
structures, while flexible docking corresponds to a fully flexible protein.
Overall accuracy has declined between versions 1.5 and 2.6 (Table 2), although this set
is not large enough to provide statistically relevant evaluation. Gratifyingly, an overall
increase in speed and convergence between multiple runs was also recorded (Table 3).
This drop in accuracy is attributed to the manual selection of the minimum MatchScore
in version 1.5 that is now automatically determined during the generation of the initial
population. As a result, docking to FXa which was fairly successful with the previous
versions shows very poor accuracy with the current version (i.e., one out of 5 ligands is
docked accurately) while docking to the other proteins demonstrated accuracy of 60%
(Tryp), 75% (MMP-3) and even 100% (HIVPD, HIVP, TK). Manual selection of high
Min_MatchScore values with version 2.6 forces the key ionic interactions between the
ligands and binding site Asp of FXa and Tryp and restores an accuracy similar to that of
version 1.5. This automatic determination of the minimum MatchScore is important as in
a blind docking study no information is given and determination of this minimum
MatchScore value would be difficult.
CHAPTER 4
- 161 -
Table 4.2 - Comparison of success rates of FITTED versions 1.0, 1.5 and 2.6 using the
“Dock” Docking mode.
% Success
Docking Mode 1.0 1.5 2.6
Rigid (self-docking) 79 93 79
Rigid (cross-docking) 47 75 56
Semiflexible 73 84 67
Flexible 73 88 67
A three-fold increase in speed is seen with the newer version of FITTED. This increase
can be in part attributed to the introduction of a matching algorithm to orient the ligand
when generating the initial population and to the use of the hybrid united atom/all atom
representation described above. In the current work both an extensive conformational
search is carried out using the “Dock” docking mode along with a significantly quicker
(i.e., less generations) “VS” docking mode. The increase in the convergence of the runs
(a single run is often enough with the current version) can be attributed in part to the
inclusion of the new pElite evolutionary operator and to the matching algorithm. It
should be stressed that in VS mode the time can be significantly reduced (down to 3-4
minutes for TK inhibitors) and that code optimization is ongoing to further improve the
necessary CPU time for a single run.
Table 4.3 - Comparison of time and number of runs required for various versions of
FITTED when the “Dock” docking mode is selected for rigid protein docking.
Time (min) per run/Number
of Runs
Protein 1.0 1.5 2.6
TK 63/5 30/3 8.5/1
HIVP 114/5 55/3 22/2
CHAPTER 4
- 162 -
Comparing dependencies on input structures. We next turned our attention to the impact
of the input structure of both the ligand and the protein on the pose prediction accuracy of
FITTED. We also investigated how other docking programs perform under the same
conditions. For this comparison we decided to focus more specifically on three important
features which can affect the accuracy of docking programs, the ligand input structure,
protein input structures and the inclusion of bridging water molecules.
Discussions with the developers and/or technical support of each program allowed a
fair comparative study and an optimal use of these programs. In fact, following
recommendations, many conditions (set of parameters) were tried. In addition, all the
major scoring functions were tried if more than one was available (in FlexX and GOLD).
Some representative data is shown in Figure 8. Different levels of accuracy were also
tried with some of the programs as described in the legend of Figure 8.
Prior to the description of the experiments and results, we thought this study should be
put in context. In this work, we wish to evaluate the docking ability of programs without
any information other than the crystal structure of the protein. Obviously in the context of
drug design and screening, any relevant information should be given to the docking
program. However, this would bring too many variables to this study as ISs in FITTED or
pharmacophores in FlexX (FlexX-Pharm) can be trained and would significantly increase
their respective accuracy.
We also wish to stress that the primary goal of this study is not to compare programs
but to evaluate the impact of the input parameters on their pose prediction accuracy. In
addition, as in any comparative study, the data collected in this work should be
considered with care as the set is still not large enough to draw conclusions on their
respective accuracy, and some hidden biases may remain as discussed below. In addition,
we have used the RMSDs between ligand crystal structures and docked poses to measure
the docking accuracy. This criterion is believed to be appropriate to evaluate the impact
of input parameters (the relative accuracy under two different conditions) but not to
compare programs.
Ligand input conformations. We first looked at the impact of the ligand structure.
Previously reported comparative studies typically have only used one conformation of the
ligand either the crystal6 or a non-crystal
5, 9, 59 conformation of ligands. In a real drug
CHAPTER 4
- 163 -
design scenario, the bioactive conformation is unknown and therefore the non-crystal
conformation represents a more realistic scenario. To evaluate the bias when using the
crystallographic conformation of the ligand, we compared the accuracy of the docking
programs when the crystal ligand or OMEGA43
generated structures were used
alternatively as input (Figure 4.8). In this work, the OMEGA generated structure were
used to assess the programs ability to dock the non-crystal conformation of the ligand. To
our knowledge, none of the docking programs assessed in this work have been trained
with OMEGA-generated structures. However, these ligand structures cannot be
considered as completely unbiased as these conformations may be preferably docked by
one of these programs, a bias that we have not evaluated.
In this work, the docked pose is assumed to be accurately predicted if it is within 2.0 Å
of the crystal binding mode when performing self-docking experiments. Even though the
use of RMSD values is known to be misleading, we believe that it will clearly reveal
drops or increases in accuracy induced by specific parameters or conformations. When
evaluating the accuracy in cross-docking experiments, the additional error introduced
when superposing the protein structures should be considered. Thus, an arbitrary RMSD
of 2.25 Å in cross-docking experiment was selected as a criterion of success.
As can be seen in Figure 8, the accuracy of all the programs drops by 10 to 20 % when
moving from crystal ligand structures (yellow bar) to OMEGA-generated structures
(orange bars), except for FITTED for which no change in accuracy was observed. This
first piece of data confirmed that a docking program should not be evaluated by using
ligand crystal structures as input. We then used the ring conformational search features
when available (red bars in Figure 4.8). The overall accuracy increases although not
reaching the one observed with crystal structures. The ring conformational search engine
used by Surflex, which covers a wider range of rings, are clearly more efficient than the
ones used by Glide4.5, GOLD and FITTED. While this manuscript was in preparation, a
new version of Glide that features a new ring search method was released but was not
used in this work. This most recent version of Glide (v5.0) uses a ring library that
accounts for both larger rings and more ring systems and includes small heterocyclic
rings previously not included. These additional features may increase its accuracy. As no
drop was observed between the use of crystal ligand structures and OMEGA-generated
ligand conformations, the FITTED ring search algorithm was not expected to improve the
CHAPTER 4
- 164 -
binding mode prediction significantly. A closer look at the data for FITTED reveals that
the ring conformation is often searched even when the specific feature is turned off. This
can be rationalized by the generation of highly distorted structures and their optimization
through energy minimization. As a first conclusion, all the programs but FITTED are very
sensitive to the input ligand conformations and the implementation of a ring
conformational search engine can reduce this dependency.
When comparing programs, the accuracy does not change much between programs but
is led by Surflex (68% with the fully relaxed protein structures), Glide (66% in XP
mode), GOLD (65% with ChemScore and flexible rings), FITTED (59%) and FlexX (54%
with FlexScore). It should be stressed that this study is carried out with a very difficult
testing set including some of the most challenging proteins such as HCV RNA
polymerase. Form now on, we will only present the data for the best set of conditions for
each program unless there is a significant deviation or interesting point of discussion. In
fact, the best conditions described in Figure 8 were found to be the best conditions for
most of the following studies. In addition, only the OMEGA-generated structures will be
discussed as we believe that the data obtained with these represents the true accuracy of
the docking programs.
SIS
FITTED
TA
FlexX
RP
Glide
Std Rbt
GOLD Surflex
RP-H RP-HAStd Std
0
10
20
30
40
50
60
70
80
90
VS
Do
ck
CS
FS
PL
P
SS
CS
FS
PL
P
SS
HT
VS
SP
XP
HT
VS
SP
XP
CS
GS
CS
GS
std
PG
std
PG
std
PG
% S
uc
ce
ss
Crystal structure
OMEGA-generated structure
OMEGA-generated structure + Flexible rings
Figure 4.8 - Accuracy vs. ligand and protein conformations. For legend see Table 4.
CHAPTER 4
- 165 -
Table 4.4 - Abbreviations used in Figure 4.8
Program Abbreviations Definitions
FITTED VS
Dock
Virtual Screening mode
Docking mode
FlexX SIS
TA
CS
FS
PLP
SS
Single Interaction Scan (matching algorithm)
Triangle Algorithm
ChemScore scoring function
FlexXScore scoring function
PLP Scoring scoring function
ScreenScore scoring function
Glide RP
Std
HTVS
SP
XP
Refined Protein (optimized protein structure)
Non-refined protein
High Throughput virtual screening mode
Standard precision mode
Extra Precision mode
GOLD Std
Rbt
GS
CS
Standard (Automatic selection of parameters)
Robust
GoldScore scoring function
ChemScore scoring function
Surflex Std
PG
Std
RP-H
RP-HA
Standard Docking
pgeom
Non-refined protein
Protein structure with hydrogen atom
positions refined
Refined Protein structure with a constrained
optimization of the heavy atoms
Protein input conformations. We next looked at the effect of the protein conformation.
The Glide and Surflex developers recommend relaxing the protein structure prior to
docking ligands. This procedure (referred to as “refined proteins” in this manuscript, see
Table 4) aims at removing any inaccuracies in the crystal structure. However it is often
carried out keeping the co-crystallized ligand in place and can be seen as a bias for self-
docking experiments. As can be seen in Figure 8, these procedures appeared to have
moderate impact on the accuracy for Glide (increase of 4%) but a significant impact with
Surflex when the fully refined protein is used. In this later case, an increase of 15% is
CHAPTER 4
- 166 -
observed in the standard docking mode and 12% is the advanced docking procedure
(pgeom) is used.
The study described above was carried out using a set of self-docking experiments (i.e.,
the protein structure with its native ligand). We next turned our attention to cross-docking
experiments as we believe these experiments would be more representative of the true
accuracy of the docking programs when performing a virtual screen. As expected, all the
programs demonstrated a much poorer accuracy in this set of experiments, with GOLD
being the most accurate, although not significantly (Figure 4.9A). Drops in the cross-
docking success rate relative to the self-docking rate of as large as nearly 40% were
recorded. The largest drops were attributed to Surflex and Glide when the refined protein
conformations were used. Nevertheless, cross-docking to the refined proteins remains
slightly more accurate than to the crystal structures with Glide. This observation confirms
the developers’ recommendation but also demonstrates that this is a clear bias when
comparing programs running only self-docking experiments. Such large drops in
accuracy between self- and cross-docking have often been observed.10-13, 60, 61
When the
proteins are considered flexible (by selecting the best scoring poses of the cross-docking
experiments, also known as docking to conformational ensembles), the accuracy of all the
programs but GOLD is significantly improved (Figure 9B).
Within FITTED, the protein can be made flexible without having recourse to multiple
runs with run times similar to rigid protein docking. In our previous report, the protein
conformational ensembles used to evaluate this feature included the cognate protein
structures (i.e., protein conformation when co-crystallized with the ligand to be docked)
together with other protein conformations.35, 36
This approach, incorporating both self-
and cross-docking experiments in a single run, allowed us to demonstrate that FITTED
was able to identify the best protein conformation for a given ligand. However, when
evaluating the docking accuracy, we believe that this specific protein conformation
should not be included. Thus, in this work, only the non-native protein structures were
included. This feature makes FITTED slightly more accurate than Glide, GOLD, Surflex
and FlexX while FITTED was found to be less accurate in self-docking experiment.
CHAPTER 4
- 167 -
0
10
20
30
40
50
60
70
80
90
100
FITTED Flex Glide Glide RP GOLD Surflex Surflex RP-HA
% S
uc
ce
ss
Self-Docking
Cross-Docking
FlexDock
Docking to flex. prot.
A)
0
10
20
30
40
50
60
70
80
90
100
FITTED Flex Glide Glide RP GOLD Surflex Surflex RP-HA
% S
uc
ce
ss
Self-Docking
Cross-Docking
Docking to conf. ens.
Docking to flex. prot.
B)
Figure 4.9 - Self-docking vs. cross-docking for protein (A) with no waters and (B) with
key water molecules.
Water molecules. The three major features of FITTED are protein flexibility, ring
conformational search and displaceable water molecules. At this stage, the importance of
the first two had been investigated. But do water molecules significantly affect the
accuracy as well? To investigate the role of key water molecules, we carried out self-
docking experiments and looked at 4 distinct cases: i. all waters were removed from the
protein crystal structures (“no waters”), ii. all key waters were kept (“explicit waters”),
iii: the best scoring of the “no waters” and “explicit waters” experiments were kept,
simulating a displaceable ensemble of water molecules (“displ. ensemble”), iv: waters
were made displaceable whenever the feature was available (“displ. waters”). The
collected data is shown in Figure 4.10. A small increase was observed for all the
programs when the key waters were kept. When the waters were made displaceable such
as for the “displ. ensemble” and “displ. waters” experiments, the accuracy further
CHAPTER 4
- 168 -
increased. As previously observed by the GOLD developers, the strategy implemented in
GOLD did not improve the docking ability of this program.32
The same observation was
made with the FlexX program. In contrast, the displaceable water approach implemented
in FITTED was found to be more accurate than the explicit water and even than the “displ.
ensemble”. In fact, displ. ensemble simulates either all the waters on or none while
FITTED can displace each water independently. In addition, as the various optimizations
of FITTED have done with this feature, the best results are obtained when it is used on.
0
10
20
30
40
50
60
70
80
90
100
FITTED Flex Glide Glide RP GOLD Surflex Surflex RP-HA
% S
uc
ce
ss
dryexplicit watersdispl. ensembledispl. waters
Figure 4.10 - Accuracy and water molecules in self-docking experiments.
We then investigated the combination of protein flexibility and water molecules. Figure
4.9b summarizes the data for self- and cross-docking experiments in presence of
displaceable waters if implemented and displ. ensemble if not. Once more, a slight
improvement is observed with most of the programs indicating that considering both
features leads to at least similar or improved results (Figure 9 and 12) and should be
considered by other developers.
The presence of hydrogen bond donors or acceptors in the protein binding site is
expected to help finding the proper orientation of the ligand. In contrast, non-directional
hydrophobic interactions are directly related to the nature of the compound and protein
binding site and their respective solvation energies more than to any “real” hydrophobic
interactions between protein and ligands. These interactions are therefore expected to be
more difficult to identify in a computationally tractable manner. The proteins of our set
were classified as polar (CDK2, COX-2, ER, FXa, GluK2, HIVP, HIVPD, HIVRT, Thrn,
CHAPTER 4
- 169 -
TK, Tryp, VDR), hydrophobic (HCV Allo, HCV Cat, P38) and metal-containing
enzymes (Mann, MMP3, Therm), based on the main ISs identified by ProCESS. The
cross-docking data was reorganized to account for this factor and the results are
summarized in Figure 4.11. GOLD appears to be fairly insensitive to the protein type
while Surflex and FlexX were much less accurate with metalloenzymes. The automatic
metal parameters GOLD and FITTED may explain the good accuracy with
metalloenzymes. More striking is the much greater accuracy of FITTED with hydrophilic
enzymes than with hydrophobic enzymes while Glide and FlexX are significantly more
accurate with hydrophobic proteins than with polar proteins. Interestingly, the SIS
algorithm in FlexX, developed to improve the accuracy with hydrophobic proteins, lead
to increase in accuracy with this class of proteins when compared to the traditional FlexX
algorithm. Once more this data indicates that the set used for any comparative study
would have a significant effect on the relative accuracies of programs. For instance,
FITTED would be the second best program if hydrophilic proteins were selected while
being the worst if only hydrophobic proteins were selected.
0
10
20
30
40
50
60
70
80
90
100
FITTED FlexX - SIS - FS
Flexx - TA-FS
Glide Gold Surflex
% S
uc
ce
ss
Metal
Hydrophobic
Polar
Figure 4.11 - Protein class and accuracy on cross-docking experiments.
As a summary, the accuracy of each of the assessed programs using the optimal
conditions is shown in Figure 4.12 for self-docking and cross-docking experiments as
well as docking to flexible proteins when available. Overall, the levels of accuracy given
here are significantly lower than the ones provided in other comparative studies.5, 6, 9
In
fact, we found our testing set to be much more challenging than the one we used
CHAPTER 4
- 170 -
previously. Additionally, part of this drop (10-20%) is directly attributed to the use of
OMEGA-generated structures and another part (10-30%) to the use of cross-docking
experiments in place of self-docking experiments used elsewhere.
To further assess the impact of such a protein-specific training and the novel placement
of ISs implemented in the current version of ProCESS, we have carried out additional
experiments. When protein-specific information (e.g., ISs derived from known
ligand/protein complexes) is manually given to FITTED, the accuracy increases
significantly (data not shown). This clearly demonstrates that this manual placement of
ISs–and more specifically hydrophobic sites- for docking with FITTED remains better
than the automated placement which should be further improved.
Overall, FITTED, Glide, GOLD and Surflex show very similar accuracies on our testing
set for self docking (i.e., rigid protein). When protein flexibility is considered (ligands
docked to all non-native protein structures in multiple runs and in a single run with
FITTED), FITTED is slightly more accurate (the only one featuring displaceable waters and
protein flexibility simultaneously), followed by GOLD and Glide. It is worth recalling
that FITTED was outperformed by the other programs in the self-docking experiments. It
is clear from Figure 12 that implementing protein flexibility would significantly improve
Surflex, Glide and FlexX accuracy while no significant improvements are expected for
GOLD which already uses a soft protein representation (Lennard-Jones 8-4).
The following numbers are given as rough estimates as these programs were run on
various computers and supercomputers with varying processor speed and some programs
(Surflex, FlexX) do not output the CPU time. In the extreme cases FlexX docks a
compound every 30 s per while FITTED is the slowest by a factor of 10 to 15. When the
criterion of success is made more stringent (RMSD ≤1.0 Å for self docking and ≤1.25 Å
for cross docking, Figure 12b), FITTED slightly outperform the other programs for self-
docking, and all programs show similar accury in cross-docking with this stricter
criterion. It also shows that the arbitrary limit of 2 Å does not affect much the ranking of
programs by their RMSD-derived accuracy.
CHAPTER 4
- 171 -
0
10
20
30
40
50
60
70
80
90
100
FITTED FlexX Glide GOLD Surflex
% S
uc
es
s
Self-docking Cross-docking Flexible proteins
0
10
20
30
40
50
60
70
80
90
100
FITTED FlexX Glide GOLD Surflex
% S
uc
es
s
Self-docking Cross-docking Flexible proteins
Figure 4.12 - Accuracy of program with OMEGA-generated structures. FITTED: Dock
mode; FlexX: ScreenScore and SIS used for the incremental construction; Glide: XP and
refined protein; GOLD: ChemScore; Surflex: pgeom and protein with refined hydrogen
positions. For the flexible protein, the “fully flexible” protein mode is used with FITTED
implementation. (a) success criterion: RMSD ≤2.0 Å for self-docking and 2.25 Å for
cross docking (b) RMSD ≤1.0 Å for self-docking and 1.25 Å for cross docking.
From this comparative study we confirmed that a few guidelines should be considered
to perform a proper evaluation: 1. the ligand should be in a conformation other than the
crystal structure, 2. both cross-docking and self-docking experiments should be carried
out, 3. Refining the protein structure using the co-crystallized ligand may bias the self-
docking accuracy but does not affect the cross-docking accuracy.
CONCLUSION
We have further modified our docking program FITTED and implemented a ring search
method into the genetic algorithm as well as a matching algorithm to produce the initial
CHAPTER 4
- 172 -
population. This advanced version was tested against major docking programs. It should
be stressed that this work was not intended to rank programs as the ranking varies from
one set of protein / ligand complexes to another. In fact, this work demonstrated that
ranking can significantly vary depending on the protein / ligand set considered (e.g.,
hydrophilic, hydrophobic) as well as the input ligand and protein conformations (e.g.,
crystal structures or OMEGA-generated, self- or cross-docking, with or without water
molecules). With this study, we demonstrated the impact of protein and ligand
conformations as well as protein flexibility and water molecules on the accuracy of
docking programs. We have been working on these last two properties for the last few
years and have shown herein that these two features significantly improve the accuracy of
our docking program FITTED. The placement of hydrophobic interaction sites has been
identified as a remaining issue and more work are currently ongoing to better understand
and identify hydrophobic pockets. This work may also serve the developers to better
understand the weaknesses and strengths of their respective programs.
EXPERIMENTAL SECTION
Preparation of the docking set. Structures were downloaded from the PDB62
and selected
based on diversity of the ligands, presence of water molecules, flexibility of the protein
and resolutions of the crystal structure below 2.5 Å. In some cases crystal structures with
resolutions higher than 2.5 Å were kept to increases the diversity of the conformations
seen within a protein. All structures were prepared using Maestro63
, the graphical
interface to the Schrödinger Suite of programs. Structures of the same protein were then
superimposed using the protein structure alignment option within Maestro. The protein
sequences where then homogenized by mutating and deleting missing residues when at
least 10Å from the binding cavity. If a missing residue was closer than the minimum
distance the structure was removed from the set. Hydrogen atoms where added using
Maestro and energy-minimized using the OPLS_2005 forcefield. All non-conserved
waters were removed from all the structures. Conserved or key water molecules were
defined as water molecules that make at least 2 hydrogen bonds with the protein and one
with the ligand. The protein and ligand structure were then separated. The ligand crystal
structure was used as input into OMEGA64
to generate new starting conformations. For
CHAPTER 4
- 173 -
this study we had OMEGA only output the most thermodynamically stable conformation
using all standard default values.
Docking programs methodology. A recent review by our group65
found over 60 docking
programs that have been published. It is becoming ever harder to distinguish which
program is best for a specific protein or in general. To assess how well FITTED performs
compared to other docking programs a small comparative study was undertaken using
FlexX, Glide, GOLD and Surflex. It is worth noting that even though AutoDock11
or a
combination of Glide and Prime17
can allow for protein flexibility they were not used due
to time constraints. Also FlexX does have a module for protein flexibility (FlexX-
Ensemble) but the version used in this study was incompatible with FlexX-Ensemble
module.
For all docking runs the OMEGA generated ligand conformation was used for self-
docking, cross-docking and flexible-protein docking. The crystallized ligand structure
was run separately but only the self-docking data is shown. All the docking experiments
were performed using dry proteins (no waters present) unless otherwise stated. When
proteins structures contain a key water molecule(s) additional docking experiments were
preformed to the wet protein (only the key water molecules for that protein crystal
structure are present). If the docking program has the ability to dock with displaceable
crystallographic waters additional sets of docking experiments were performed to a
displaceable water protein structure (all possible water positions occur). When defining
the active site for all the proteins, the largest ligand of the set for a particular protein was
used.
CHAPTER 4
- 174 -
Table 4.5 - List of ligands used to define protein binding sites
Protein PDB Code
CDK2 1aq1
COX-2 4cox
Estrogen Receptor 1sj0
Factor Xa 1nfu
Kainate Glutamate Receptor 1yae
HCV Polymerase Allosteric Pocket 2o5d
HCV Polymerase Catalytic Pocket 2fvc
HIV-1 Protease 1pro
HIV-1 Protease Diols 1hvr
HIV Reverse Transciptase 1vrt
Mannosidase 2f18
MMP-III 1d8m
p38 Map Kinase 1w82
Thermolysin 3tmn
Thymidine Kinase 1tmt
Thrombin 1qhi
Trypsin 1qbo
Vitamin D receptor 2has
In all cases docking success for docking was measured using the standard RMSD
criterion (RMSD between the heavy atoms of the docked posed and the reported crystal
structure). During self docking run we used a criterion of less than 2.0 Å but increased
this to 2.25 Å for cross docking to account for the error resulting from the superposition
of the proteins. Only the top scoring pose was used for accuracy measures as it would be
the one picked in a VS experiment whether the docking run was successful. The RMSD
was calculated using the tool provided by the program. One exception was in the case of
FlexX and Glide where the GOLD RMSD script was also used for the calculation of the
RMSDs for HIV-1 proteases ligands. Due to the C2 symmetric nature of the HIV-1
protease binding site it was necessary to calculate the RMSD on 2 orientations of the
ligand, the original and the ligand rotated 180°. This second RMSD could not be done
since these programs output the RMSD within the output file of the run and could not be
re-computed. This second rotated ligand was done by rotation of a duplicated copy of the
protein/ligand complex in space and re-superimposition using InsightII. With all
CHAPTER 4
- 175 -
programs the RMSD was calculated using both orientations for the HIV-1 protease
ligands with the lowest RMSD being kept.
In all cases we have been in contact with either the developers themselves or the
technical support of the programs discussed herein to determine the best conditions for
our comparative study. Where they were uncertain we ran all possibilities.
FlexX 3.1.033, 60, 66, 67
FlexX uses an incremental construction algorithm to build up the
ligand within the active site. To determine the placement of the fragments FlexX uses a
set of interaction sites then uses a matching algorithm to find the best match between the
fragment and the interactions. FlexX can account for displaceable water molecules using
the particle water concept where all possible combinations of the water being present or
not present are tried and the best scoring combination is kept. We used the FlexX3.1.0
interface to construct all the project files for each individual crystal structure. For each
protein the binding site hydrogen positions for the protein were manually oriented to
create optimal hydrogen bond with either the protein and/or native ligand. For each
structure where water molecules occur, a project file was created for the dry protein (no
waters), the wet protein (waters are treated as spheres) and the displaced waters protein
(waters are considered as spheres and allowed to be displaceable). 4 settings.pxx files
were created so that we could run FlexX through command line interface using
FlexXScore, ChemScore, ScreenScore and PLP scoring functions. Within the .bat file we
would turn on the ring search using the corina_f executable68
provided by Molecular
Networks by using the keyword SET RING_MODE to 1 and/or turn on the SIS docking
algorithm by using the PLACEBAS 1 keyword. At the time of this publication FlexX-
Ensemble was not available for FlexX3.1.0 and was deemed not ready for a comparative
study by the developers.
Glide4.548, 69, 70
Glide uses a funnel approach to docking by initially creating a series of
ligand conformations then removing the unfavourable ones. With this done a refinement
is performed by doing an energy minimization followed by a restricted monte-carlo
search on the lowest energy conformations. This Monte-Carlo search is used to refine the
initial structure.
CHAPTER 4
- 176 -
The protein structures in mol2 format were prepared using the protein preparation
wizard with default values (the proteins in future referred to as refined proteins). Grids
were prepared for the initial prepared protein as well as the refined proteins using a 30Å
box with the center of the grid being defined by using the largest ligand of the protein in
our set. Default parameters were used to dock with Glide for HTVS, SP and XP docking
modes. With each docking mode, both grids were used individually. With Glide the
default is to allow for flexible rings and therefore to study Glide with rigid rings this
functionality was turned off.
GOLD3.2.71
GOLD performs a conformational search of the ligand by using a genetic
algorithm. When dealing with displaceable waters GOLD considers all possible
combinations of the water being present or not present keeping the best scoring
combination. The prepared ligand and proteins were used in mol2 format for GOLD. The
automatic settings with the default parameters were used. When docking to the wet
protein (proteins with key water molecules), the orientation of the waters is optimized but
the waters are not displaced. When docking to displaceable water-protein structures, the
waters are set to displaceable and their orientation is optimized. To examine the corner
flap approach the flip_free_corners was set to 1 in the .conf file. Upon discussions with
the developers it was suggested to try a more robust search. This was done by using the
keywords autoscale = 1.5 and autoscale_nopt_min = 15000. It was also suggested to set
early_termination to 0. Both additions were tried and are referred to as the robust search
in the results.
Surflex2.372
. The Surflex docking algorithm combines a shape matching algorithm with
the matching of a protomol that is similar to a pharmacophore. Surflex uses an
incremental construction algorithm with relinking of the fragmented ligand. No interface
was provided with Surflex which was therefore used in command line. For Surflex the
prepared ligands and proteins were used in mol2 format. The protomol was initially
generated using the largest ligand of the set for that protein. Surflex was then used to
dock using all the default values. To perform the conformational search of rings the
+rings command was used. Upon discussion with the developers it was suggested we try
the –pgeom command that is meant to increase docking accuracy and use their program
CHAPTER 4
- 177 -
to optimize the hydrogens of the proteins. Docking runs using these suggested conditions
were performed.
FITTED 2.673
. The files describing the proteins, interaction sites and cavity sites were
prepared using the PROCESS module while the ligands were prepared using SMART. The
created files were next used by FITTED. The default parameters were used with each of
these three programs. FITTED is now available at www.FITTED.ca.
Acknowledgment. We thank CIHR and Virochem Pharma for financial support as well as
the Canadian Foundation for Innovation for financial support through the New
Opportunities Fund program. CRC held a CIHR-funded Chemical Biology Scholarship
during a portion of this study. We are thankful to the RQCHP for allocation of computer
resources for this study. We would also like to thank the following people for the input
and suggestions on how to improve the docking accuracy of their programs; Ajay Jain,
UCSF (SurFlex) and the support departments of both CCDC (GOLD) and Schrödinger
(Glide).
Supporting Information Available: A more detailed listing of results and the comparative
study set is available free of charge via the Internet at http://pubs.acs.org.
REFERENCES
1. Cozza, G.; Bonvini, P.; Zorzi, E.; Poletto, G.; Pagano, M. A.; Sarno, S.; Donella-
Deana, A.; Zagotto, G.; Rosolen, A.; Pinna, L. A.; Meggio, F.; Moro, S.,
Identification of Ellagic Acid as Potent Inhibitor of Protein Kinase CK2: A
Successful Example of a Virtual Screening Application. J. Med. Chem. 2006, 49
(8), 2363-2366.
2. De Graaf, C.; Oostenbrink, C.; Keizers, P. H. J.; Van Der Wijst, T.; Jongejan, A.;
Vermeulen, N. P. E., Catalytic site prediction and virtual screening of cytochrome
P450 2D6 substrates by consideration of water and rescoring in automated docking.
J. Med. Chem. 2006, 49 (8), 2417-2430.
CHAPTER 4
- 178 -
3. Bissantz, C.; Folkers, G.; Rognan, D., Protein-based virtual screening of chemical
databases. 1. Evaluation of different docking/scoring combinations. J. Med. Chem.
2000, 43 (25), 4759-4767.
4. Bursulaya, B. D.; Totrov, M.; Abagyan, R.; Brooks Iii, C. L., Comparative study of
several algorithms for flexible ligand docking. J. Comput.-Aided Mol. Des. 2003,
17 (11), 755-763.
5. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S., Evaluation of docking
performance: comparative data on docking algorithms. J. Med. Chem. 2004, 47 (3),
558-565.
6. Perola, E.; Walters, W. P.; Charifson, P. S., A detailed comparison of current
docking and scoring methods on systems of pharmaceutical relevance. Proteins
2004, 56 (2), 235-249.
7. Kellenberger, E.; Rodrigo, J.; Muller, P.; Rognan, D., Comparative evaluation of
eight docking tools for docking and virtual screening accuracy. Proteins 2004, 57
(2), 225-242.
8. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P.,
Comparison of automated docking programs as virtual screening tools. J. Med.
Chem. 2005, 48 (4), 962-976.
9. Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert,
M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.;
Woolven, J. M.; Peishoff, C. E.; Head, M. S., A Critical Assessment of Docking
Programs and Scoring Functions. J. Med. Chem. 2006, 49 (20), 5912-5931.
10. Cavasotto, C. N.; Abagyan, R. A., Protein flexibility in ligand docking and virtual
screening to protein kinases. J. Mol. Biol. 2004, 337 (1), 209-225.
11. Osterberg, F.; Morris, G. M.; Sanner, M. F.; Olson, A. J.; Goodsell, D. S.,
Automated docking to multiple target structures: Incorporation of protein mobility
and structural water heterogeneity in autodock. Proteins 2002, 46 (1), 34-40.
12. Murray, C. W.; Baxter, C. A.; Frenkel, A. D., The sensitivity of the results of
molecular docking to induced fit effects: Application to thrombin, thermolysin and
neuraminidase. J. Comput.-Aided Mol. Des. 1999, 13 (6), 547-562.
CHAPTER 4
- 179 -
13. Erickson, J. A.; Jalaie, M.; Robertson, D. H.; Lewis, R. A.; Vieth, M., Lessons in
molecular recognition: The effects of ligand and protein flexibility on molecular
docking accuracy. J. Med. Chem. 2004, 47 (1), 45-55.
14. Cavasotto, C. N.; J.W. Orry, A.; Abagyan, R. A., The challenge of considering
receptor flexibility in ligand docking and virtual screening. Curr. Comput.-Aided
Drug Des. 2005, 1, 423-440.
15. Klebe, G., Virtual ligand screening: strategies, perspectives and limitations. Drug
Discov. Today 2006, 11 (13-14), 580-594.
16. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., Protein-ligand docking: Current status
and future challenges. Proteins 2006, 65 (1), 15-26.
17. Sherman, W.; Day, T.; Jacobson, M. P.; Friesner, R. A.; Farid, R., Novel Procedure
for Modeling Ligand/Receptor Induced Fit Effects. J. Med. Chem. 2006, 49 (2),
534-553.
18. Claussen, H.; Buning, C.; Rarey, M.; Lengauer, T., FLEXE: Efficient molecular
docking considering protein structure variations. J. Mol. Biol. 2001, 308 (2), 377-
395.
19. Schnecke, V.; Kuhn, L. A., Virtual screening with solvation and ligand-induced
complementarity. Perspect. Drug. Discov. 2000, 20, 171-190.
20. Zavodszky, M. I.; Lei, M.; Thorpe, M. F.; Day, A. R.; Kuhn, L. A., Modeling
correlated main-chain motions in proteins for flexible molecular recognition.
Proteins 2004, 57 (2), 243-261.
21. Sotriffer, C. A.; Dramburg, I., "In situ cross-docking" to simultaneously address
multiple targets. J. Med. Chem. 2005, 48 (9), 3122-3125.
22. Li, Z.; Lazaridis, T., Water at biomolecular binding interfaces. Phys. Chem. Chem.
Phys. 2007, 9 (5), 573-581.
23. Baldwin, E. T.; Bhat, T. N.; Gulnik, S.; Liu, B.; Topol, I. A.; Kiso, Y.; Mimoto, T.;
Mitsuya, H.; Erickson, J. W., Structure of HIV-1 protease with KNI-272, a tight-
binding transition-state analog containing allophenylnorstatine. Structure 1995, 3
(6), 581-590.
24. Wang, Y. X.; Freedberg, D. I.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso,
Y.; Bhat, T. N.; Erickson, J. W.; Torchia, D. A., Bound water molecules at the
CHAPTER 4
- 180 -
interface between the HIV-1 protease and a potent inhibitor, KNI-272, determined
by NMR. J. Am. Chem. Soc. 1996, 118 (49), 12287-12290.
25. Kervinen, J.; Thanki, N.; Zdanov, A.; Tino, J.; Barrish, J.; Lin, P. F.; Colonno, R.;
Riccardi, K.; Samanta, H.; Wlodawer, A., Structural analysis of the native and
drug-resistant HIV-1 proteinases complexed with an aminodiol inhibitor. Protein
Pept. Lett. 1996, 3 (6), 399-406.
26. Hong, L.; Zhang, X. J.; Foundling, S.; Hartsuck, J. A.; Tang, J., Structure of a
G48H mutant of HIV-1 protease explains how glycine-48 replacements produce
mutants resistant to inhibitor drugs. FEBS Lett. 1997, 420 (1), 11-16.
27. Louis, J. M.; Dyda, F.; Nashed, N. T.; Kimmel, A. R.; Davies, D. R., Hydrophilic
peptides derived from the transframe region of Gag-Pol inhibit the HIV-1 protease.
Biochemistry 1998, 37 (8), 2105-2110.
28. Lam, P. Y. S.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L.
T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N.; Chang, C. H.; Weber, P.
C.; Jackson, D. A.; Sharpe, T. R.; Erickson-Viitanen, S., Rational design of potent,
bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science 1994, 263
(5145), 380-384.
29. Grzesiek, S.; Bax, A.; Nicholson, L. K.; Yamazaki, T.; Wingfield, P.; Stahl, S. J.;
Eyermann, C. J.; Torchia, D. A.; Nicholas Hodge, C.; Lam, P. Y. S.; Jadhav, P. K.;
Chang, C. H., NMR evidence for the displacement of a conserved interior water
molecule in HIV protease by a non-peptide cyclic urea-based inhibitor. J. Am.
Chem. Soc. 1994, 116 (4), 1581-1582.
30. Hodge, C. N.; Aldrich, P. E.; Bacheler, L. T.; Chang, C. H.; Eyermann, C. J.;
Garber, S.; Grubb, M.; Jackson, D. A.; Jadhav, P. K.; Korant, B.; Lam, P. Y. S.;
Maurin, M. B.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Reid, C.; Sharpe, T. R.;
Shum, L.; Winslow, D. L.; Erickson-Viitanen, S., Improved cyclic urea inhibitors
of the HIV-1 protease: Synthesis, potency, resistance profile, human
pharmacokinetics and X-ray crystal structure of DMP 450. Chem. Biol. 1996, 3 (4),
301-314.
31. Moitessier, N.; Westhof, E.; Hanessian, S., Docking of aminoglycosides to
hydrated and flexible RNA. J. Med. Chem. 2006, 49 (3), 1023-1033.
CHAPTER 4
- 181 -
32. Verdonk, M. L.; Chessari, G.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.;
Nissink, J. W. M.; Taylor, R. D.; Taylor, R., Modeling water molecules in protein-
ligand docking using GOLD. J. Med. Chem. 2005, 48 (20), 6504-6515.
33. Rarey, M.; Kramer, B.; Lengauer, T., The particle concept: Placing discrete water
molecules during protein- ligand docking predictions. Proteins 1999, 34 (1), 17-28.
34. 14.3.5 Mapping amino acids to templates. In FlexX Release 3 with GUI User Guide
and Technical Reference, BiosolveIT GmbH: 2007; p 310.
35. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking ligands into flexible and
solvated macromolecules. 1. Development and validation of FITTED 1.0. J. Chem.
Inf. Model. 2007, 47 (2), 435-449.
36. Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.;
Bilimoria, D.; Heureux, L.; Moitessier, N., Docking ligands into flexible and
solvated macromolecules. 2. Development and application of FITTED 1.5 to the
virtual screening of potential HCV polymerase inhibitors. J. Chem. Inf. Model.
2008, 48 (4), 902-909.
37. Englebienne, P.; Fiaux, H.; Kuntz, D. A.; Corbeil, C. R.; Gerber-Lemaire, S.; Rose,
D. R.; Moitessier, N., Evaluation of docking programs for predicting binding of
Golgi α-mannosidase II inhibitors: A comparison with crystallography. Proteins
2007, 69 (1), 160-176.
38. Kirchmair, J.; Markt, P.; Distinto, S.; Wolber, G.; Langer, T., Evaluation of the
performance of 3D virtual screening protocols: RMSD comparisons, enrichment
assessments, and decoy selection—What can we learn from earlier mistakes? J.
Comput.-Aided Mol. Des. 2008, 22 (3), 213-228.
39. Good, A.; Oprea, T., Optimization of CAMD techniques 3. Virtual screening
enrichment studies: a help or hindrance in tool selection? J. Comput.-Aided Mol.
Des. 2008, 22 (3), 169-178.
40. Jain, A. N., Bias, reporting, and sharing: Computational evaluations of docking
methods. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 201-212.
41. Jain, A. N., Bias, reporting, and sharing: computational evaluations of docking
methods. J. Comput.-Aided Mol. Des. 2007, 1-12.
CHAPTER 4
- 182 -
42. Hartshorn, M. J.; Verdonk, M. L.; Chessari, G.; Brewerton, S. C.; Mooij, W. T. M.;
Mortenson, P. N.; Murray, C. W., Diverse, high-quality test set for the validation of
protein-ligand docking performance. J. Med. Chem. 2007, 50 (4), 726-741.
43. Boström, J.; Greenwood, J. R.; Gottfries, J., Assessing the performance of OMEGA
with respect to retrieving bioactive conformations. J. Mol. Graph. Modell. 2003, 21
(5), 449-462.
44. Gasteiger, J.; Sadowski, J.; Schuur, J.; Selzer, P.; Steinhauer, L.; Steinhauer, V.,
Chemical information in 3D space. J. Chem. Inf. Comput. Sci. 1996, 36 (5), 1030-
1037.
45. Boström, J., Reproducing the conformations of protein-bound ligands: A critical
evaluation of several popular conformational searching tools. J. Comput.-Aided
Mol. Des. 2001, 15 (12), 1137-1152.
46. Jain, A. N.; Nicholls, A., Recommendations for evaluation of computational
methods. J. Comput.-Aided Mol. Des. 2008, 22 (3-4), 133-139.
47. Jain, A., Surflex-Dock 2.1: Robust performance from ligand energetic modeling,
ring flexibility, and knowledge-based search. J. Comput.-Aided Mol. Des. 2007, 21
(5), 281-306.
48. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D.
T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.;
Shenkin, P. S., Glide: A new approach for rapid, accurate docking and scoring. 1.
Method and assessment of docking accuracy. J. Med. Chem. 2004, 47 (7), 1739-
1749.
49. 5.2.3 Setting Docking Options. In Glide 5.0 User Manual, Schrödiger, LLC.:
2008; p 44.
50. Jones, G.; Willett, P.; Glen, R. C., Molecular recognition of receptor sites using a
genetic algorithm with a description of desolvation. J. Mol. Biol. 1995, 245 (1), 43-
53.
51. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R., Development and
validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267 (3),
727-748.
52. Payne, A. W. R.; Glen, R. C., Molecular recognition using a binary genetic search
algorithm. J. Mol. Graph. 1993, 11 (2), 74-91+121.
CHAPTER 4
- 183 -
53. Goto, H.; Osawa, E., Corner flapping: A simple and fast algorithm for exhaustive
generation of ring conformations. J. Am. Chem. Soc. 1989, 111 (24), 8950-8951.
54. 5.2.3 Setting Docking Options. In Glide 4.5 User Manual, Schrödiger, LLC.:
2007; p 42.
55. Thompson, H. B., Calculation of Cartesian Coordinates and Their Derivatives from
Internal Molecular Coordinates. J. Chem. Phys. 1967, 47 (9), 3407-3410.
56. Harvey, M. A.; Baggio, S.; Baggio, R., A new simplifying approach to molecular
geometry description: The vectorial bond-valence model. Acta Crystallogr. B 2006,
62 (6), 1038-1042.
57. O'Boyle, N. M.; Brewerton, S. C.; Taylor, R., Using buriedness to improve
discrimination between actives and inactives in docking. J. Chem. Inf. Model. 2008,
48 (6), 1269-1278.
58. Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y., Combining pharmacophore
search, automated docking, and molecular dynamics simulations as a novel strategy
for flexible docking. Proof of concept: Docking of arginine-glycine-aspartic acid-
like compounds into the alphav beta3 Binding Site. J. Med. Chem. 2004, 47 (17),
4178-4187.
59. Chen, H.; Lyne, P. D.; Giordanetto, F.; Lovell, T.; Li, J., On evaluating molecular-
docking methods for pose prediction and enrichment factors. J. Chem. Inf. Model.
2006, 46 (1), 401-415.
60. Kramer, B.; Rarey, M.; Lengauer, T., Evaluation of the FLEXX incremental
construction algorithm for protein-ligand docking. Proteins 1999, 37 (2), 228-241.
61. Birch, L.; Murray, C. W.; Hartshorn, M. J.; Tickle, I. J.; Verdonk, M. L.,
Sensitivity of molecular docking to induced fit effects in influenza virus
neuraminidase. J. Comput.-Aided Mol. Des. 2002, 16 (12), 855-869.
62. Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B., The protein data bank: a
computer based archival file for macromolecular structures. J. Mol. Biol. 1977, 112
(3), 535-542.
63. Maestro, 8.0; Schrödiger, LLC.: Portland, OR, 2007.
64. OMEGA, 2.2.1; Open Eye Scientific Software: Sante Fe, NM, 2007.
CHAPTER 4
- 184 -
65. Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J.; Corbeil, C. R., Towards the
development of universal, fast and highly accurate docking/scoring methods: A
long way to go. Br. J. Pharmacol. 2008, 153 (SUPPL. 1), S7-S26.
66. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G., A fast flexible docking method
using an incremental construction algorithm. J. Mol. Biol. 1996, 261 (3), 470-489.
67. FlexX, 3.1.0; BioSolveIT: Sankt Augustin, Germany, 2008.
68. Corina_F, 3.4; Molecular Networks: Erlangen, Germany, 2008.
69. Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard,
W. T.; Banks, J. L., Glide: A new approach for rapid, accurate docking and scoring.
2. Enrichment factors in database screening. J. Med. Chem. 2004, 47 (7), 1750-
1759.
70. Glide, 4.5; Schrödiger, LLC.: Portland, OR, 2007.
71. GOLD, 3.2; Cambridge Crystallographic Data Center: Cambridge, UK, 2007.
72. Surflex, 2.3; BioPharmics, LLC: San Fransico, CA, 2008.
73. Corbeil, C. R.; Englebienne, P.; Moitessier, N. FITTED, 2.6; McGill University:
Montreal, Que., 2008.
CHAPTER 5
- 185 -
CHAPTER FIVE
Although more popular in the drug design filed, computational tools for virtual
screening are lacking in the field of asymmetric catalyst design. ACE was created from
the expertise gained through the development of FITTED. The main challenge when
creating a virtual screening tool for asymmetric catalyst development is the need to
develop an on-the-fly determination of transition state parameters for a molecular
mechanics forcefield. This was addressed by using a linear combination of transition state
parameters along with a genetic algorithm to enable an efficient conformational search of
the transition state structure. ACE was then validated on two reactions and showed
excellent correlations between experimental and predicted stereoselectivities.
This chapter is a copy and is reproduced with permission from Angewandte Chemie,
International Edition. This article is cited as Corbeil, C. R.; Thielges, S.;
Schwartzentruber, J. A.; Moitessier, N., Toward a Computational Tool Predicting the
Stereochemical Outcome of Asymmetric Reactions: Development and Application of a
Rapid and Accurate Program Based on Organic Principles. Angewandte Chemie
International Edition 2008, 47, (14), 2635-2638. Copyright 2008, with permission from
Wiley.
CHAPTER 5
- 186 -
TOWARD A COMPUTATIONAL TOOL PREDICTING THE
STEREOCHEMICAL OUTCOME OF ASYMMETRIC REACTIONS.
DEVELOPMENT AND APPLICATION OF A RAPID AND ACCURATE
PROGRAM BASED ON ORGANIC PRINCIPLES.
The asymmetric catalyst discovery process as practiced now often relies on expensive
-and sometimes serendipitous- stepwise optimization and/or library screening.1 We believe
that this is poised to change, as computational predictive methods have reached a level of
accuracy that obviates many steps now done manually. We report herein the early version
of a new program, ACE (Asymmetric Catalyst Evaluation), its underlying concepts, and
the assessment of its applicability and accuracy in distinguishing efficient asymmetric
catalysts or chiral auxiliaries from inferior ones.
Although much effort has been directed toward the development of computer-aided
drug design tools, there has been little investigation into computational tools for
asymmetric catalyst design. Nowadays, the fields of quantum mechanic and quantum
mechanics/molecular mechanics2 are highly developed and has yielded accurate
predictions of asymmetric reaction stereoselectivities 3-6
. However, QM methods would
require months of computation to screen a library of potential catalysts in the search for
new ones. To address this issue, other methods were developed which include reverse-
docking.7, 8
and quantitative structure-selectivity relationship 9-11
and more specifically
the use of quantum mechanics interaction fields.12, 13
As another alternative to QM
techniques, molecular mechanics (MM) applied to ground state structures have been
used.14
Advanced MM-based transition state (TS) techniques, which accurately predict
TS structures and their relative potential energies, have also been reported.15
Although
these methods (e.g., Q2MM,16
using TS force fields,17
SEAM18, 19
, Empirical Valence
Bond (EVB)20, 21
and multiconfiguration MM (MCMM)22
have shown great potential in
locating and investigating TS’s, only a very few studies were reported that attempted to
predict the stereochemical outcome of reactions.7, 8, 14, 23-28
with even fewer applications
to the design of new asymmetric catalysts.13, 29, 30
In fact, one major shortcoming of force
CHAPTER 5
- 187 -
fields is the lack of accurate parameters for metal complexes, necessary to model metal-
catalyzed reactions, which need to be specifically developed.31
ACE is a molecular mechanics-based independent program that has been developed
from simple organic chemistry principles. For example, the Hammond-Leffler postulate
states the TS looks most like the species (reactants or products) it is closest to in energy.
Following this principle, ACE constructs TS’s from a linear combination of reactants and
products, including a factor () describing the position of the TS on the potential energy
surface (Eq. 1 with λ defined by 0 < λ < 1). A similar approach is used to locate transition
states by the EVB method mentioned above, where is iterated from 0 to 1 to find the
maximum energy corresponding to the TS. EVB has indeed been successfully used in the
study of several enzymatic mechanisms21
. Within ACE, interactions between two atoms
forming a bond are described as both covalent bond and non-bonded interactions with
weights (1-λ) and λ for each of these two types of interaction. Angles, torsions and non-
bonded interactions between atoms of the reacting center are also scaled by either (1-λ) if
found in the reactants or λ if found in the products. As a comparison, λ can be related to
the Brønsted coefficient which measures the role of the reacting partners in a TS.
(5.1) productreactant1TS
As stated by Curtin and Hammett, stereomeric excesses can be derived from the
difference in the diastereomeric TS energies, in this case the MM3* force field potential
energies. This force field has already been used with the SEAM and TSFF approaches to
predict TS energy differences.
CHAPTER 5
- 188 -
ON
O O
R2 ON
O O
Ph
ON
O O
Ph
HH
O
O
O
OON
OO
OO
O
CF3
O
O
1a 1b
1c
1d
1e1f
2a 2b 2c
Auxiliary
Ocatalyst
Auxiliary
O
(ie Et2AlCl)*
R2R1
R1
R2
1 2 3
Figure 5.1 - General synthetic scheme and representative dienophiles 1a-f and dienes 2a-
c used in the validation study.
For each of the diene/dienophile pairs, reactants and products were built
considering only an endo attack, known to be favored in this type of reaction. Prior to
running the computation, has to be set. It is well known that Diels Alder reactions in the
presence of strong Lewis acids have low energies of activation and early TS’s, a situation
which corresponds to low values of λ. In order to evaluate the impact of the selected λ
value, values were used ranging from 0.10 to 0.60 in steps of 0.10.
CHAPTER 5
- 189 -
-100 -80 -60 -40 -20 0 20 40 60 80 100
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
Figure 5.2 - Predicted (grey) vs. observed (black) diastereomeric excesses for 44 Diels
Alder reactions. Positive excess refers to the (R) isomer while negative excesses refer to
the (S) isomers (λ = 0.20).
ACE creates the TS’s from reactants and products prepared using graphical
interfaces and ESFF charges32
and carries out a conformational analysis using a genetic
algorithm similar to the one previously implemented in our docking program FITTED
1.0.33
This algorithm samples the conformational space of the transition structures. The
potential energy was computed for each of the TS’s, and diastereomeric excesses were
derived and compared to the experimental data. Initially, the difference in potential
energy between the diasteromeric TS’s consistently overestimated the experimentally
observed difference in free energy. A correction factor (0.5) was applied to the potential
energy difference to better align predictions with observations. Although this factor has
no true physical meaning, it may reflect the difference between force field potential
CHAPTER 5
- 190 -
energy in vacuo and experimental free energy in solvent and steeper modelled PES
surface at the TS. Plots of ΔΔG(predicted) vs. ΔΔG(experimental) are given as
supplementary material. Overall, the rank-ordered list was not strongly affected by the
value of λ when in the range [0.1-0.3]. Since the ranking is more important for virtual
screening than the predicted absolute values, the selection of λ would not have much
impact on the success of a screening campaign. However, increasing λ led to slightly (λ=
0.4) or significantly (λ ≥ 0.5) reduced accuracies. These data demonstrate that λ does not
have to be fully optimized but should be selected with care based on the type of reaction
analyzed. For this class of reaction, λ has to be lower than 0.5, suggesting an early TS
which is seen in DFT studies.34
In fact, when using λ= 0.1 or λ= 0.20, the distances of the
forming/breaking bond (predicted to be in the range of 2.05-2.15 Å) match well to
distances computed using higher level calculation the forming/breaking bond (ranging
from 2.05 to 2.55 Å) of model systems34
. However, the two forming bonds show the
same distances with ACE, while the attack is usually asynchronous. Further development
of the method is needed to account for this effect.
Applied to the entire set, ACE accurately predicted the correct isomer in 41 out of
the 44 cases. The major failures (# 1-4 on Figure 5.2) were observed with polycyclic
auxiliaries exemplified by 1e. This suggests that the force field description of complex
molecules has to be refined.
In practice, a tool like ACE would be of interest for its ability to discriminate very
good auxiliaries from a list of potential auxiliaries. The predicted 20 best of the 44
systems were first considered. Experimentally, 19 of these 20 systems led to selectivities
of over 80%, with 15 over 90% and 13 over 95%. On the other side, the 10 systems
which were predicted to provide the lowest selectivities were considered. 6 out of these
10 systems had experimentally obtained diastereomeric excesses below 70% and only
one obtained an excess over 95%. These data clearly show the potential of this method to
discriminate between efficient and inefficient chiral auxiliaries.
The second reaction we investigated was the asymmetric organocatalyzed aldol
reaction (Figure 5.3). Reported reactions using various combinations of ketones,
aldehydes and proline derivatives used as catalysts were selected, for a total of 40
combinations.
CHAPTER 5
- 191 -
NH
CO2H
EtOOCO O
O
OOCH2Ph
6a6b
6c
O
NH
COOH
NH
O
OH
NH
S
NH
NH
O
OH
NHO
O
H
O2N
O
HH3C
O
H
5a 5b 5c
Cl
6d 6e
O
H
5d
Ocatalyst 6 O
*
O
H R R
OH+
4 75
Figure 5.3 - General synthetic scheme and representative catalysts (6a-e) and aldehydes
(5a-d) used in the validation study.
According to extensive experimental and DFT studies, this reaction involves the
formation of a flexible macrocyclic TS35, 36
and so required sampling the conformational
space of large rings. The corner flapping approach37
was implemented in ACE to carry
out this conformational search. From DFT studies the key TS is found to be closer in
energy to the produced intermediate than to the starting reactants, implying a λ value
greater than 0.5.35
Figure 5.4 summarizes the results obtained with λ = 0.60, though λ in
the range 0.60 to 0.75 led to similar results. As for the Diels Alder reaction, ACE TS’s
can be compared to TS developed using higher level calculations. Figure 5.5 illustrates
the superposition of the most energetically favoured TS structures as proposed by DFT
CHAPTER 5
- 192 -
and ACE. The distances of the forming bond predicted by these two methods are within
0.1 Å.
-100 -80 -60 -40 -20 0 20 40 60 80 100
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
Figure 5.4 - Predicted (grey) vs. observed (black) diastereomeric excesses for 17 selected
cases. Positive excess refers to the (R) isomer while negative excess refers to the (S)
isomer (λ = 0.6). The complete data (40 cases) is given as supporting information.
O
O
N
OMeH
Figure 5.5 - Predicted TS structure for the reaction involving 4, 5c and 6a. grey: DFT
prediction, black: ACE predictions.
CHAPTER 5
- 193 -
-100 -80 -60 -40 -20 0 20 40 60 80 100
4 + 5b
4 + 5c
4 + 5d
5d + 5d
Figure 5.6 - ACE predictions (grey) and DFT predictions (white) vs. observed (black)
diastereomeric excesses for 4 selected cases
In the asymmetric organocatalyzed aldol reaction, ACE was again accurate, with
the correct isomers predicted in 38 cases out of 40. Most of the cases investigated here
are known to provide excesses below 80%, equivalent to a small difference in energies
between diastereomeric TS’s. This makes this second validation study more challenging.
Extensive investigations did not reveal the cause of these two failures. Only one of the
computed reactions experimentally showed an enantiomeric excess higher than 90% (#4
in Figure 5.2) and was indeed predicted to lead to the highest selectivity of the set
(prediction: 99%). This demonstrates that ACE could accurately guide the design of
efficient catalysts.
As another validation, it is of interest to compare high-level calculation results,
when available, with these results. Houk and co-workers have reported an exhaustive
study on the proline-catalyzed aldol reaction of acetone with various aldehydes, in an
attempt to assess the predictive power of DFT.4 As shown in Figure 5.6, ACE shows
accuracy close to DFT but within a much shorter period of time. This unexpectedly high
accuracy might be attributable to the exhaustive conformational search of the macrocyclic
TS’s carried out by ACE but not by DFT techniques. In fact, ACE could be used as a
conformational search engine providing high-quality starting structures for further DFT
studies. In addition, this software provides good quality transition structures that can be
used for rationalizations of data in place of CPK models, as additional pairwise
interaction energies can be outputted.
CHAPTER 5
- 194 -
The trade-off between computing speed and accuracy of predictions is well known.
In this communication, we have presented a unique computational tool, ACE, which
performs conformational sampling, TS potential energy optimization, and TS relative
energy evaluation within less than an hour on a standard PC. Application of this tool to
two well-established reactions has revealed its good accuracy in predicting enantio-
/diastereomeric excesses. Future enhancements and applications/validations are ongoing
to improve and assess the predictive power and versatility of the software as well as its
transferability to other reactions. Metal-catalyzed reactions are being investigated.
However, the early version of ACE shows considerable promise and we believe should
be transferable to any other reactions with well known mechanisms.
CHAPTER 5
- 195 -
REFERENCES
1. Francis, M. B.; Jacobsen, E. N., Discovery of novel catalysts for aIkene epoxidation
from metal-binding combinatorial libraries. Angew. Chem. Int. Ed. 1999, 38, (7),
937-941.
2. Lin, H.; Truhlar, D. G., QM/MM: What have we learned, where are we, and where
do we go from here? Theor. Chem. Acc. 2007, 117, (2), 185-199.
3. Panda, M.; Phuan, P. W.; Kozlowski, M. C., Theoretical and experimental studies
of asymmetric organozinc additions to benzaldehyde catalyzed by flexible and
constrained γ-amino alcohols. J. Org. Chem. 2003, 68, (2), 564-571.
4. Bahmanyar, S.; Houk, K. N.; Martin, H. J.; List, B., Quantum Mechanical
Predictions of the Stereoselectivities of Proline-Catalyzed Asymmetric
Intermolecular Aldol Reactions. J. Am. Chem. Soc. 2003, 125, (9), 2475-2479.
5. Garcia, J. I.; Jimenez-Oses, G.; Martinez-Merino, V.; Mayoral, J. A.; Pires, E.;
Villalba, I., QM/MM modeling of enantioselective pybox-ruthenium- and box-
copper-catalyzed cyclopropanation reactions: Scope, performance, and applications
to ligand design. Chem. Eur. J. 2007, 13, (14), 4064-4073.
6. Goumans, T. P. M.; Ehlers, A. W.; Lammertsma, K., The asymmetric Schrock
olefin metathesis catalyst. A computational study. Organometallics 2005, 24, (13),
3200-3206.
7. Harriman, D. J.; Deslongchamps, G., Reverse-docking as a computational tool for
the study of asymmetric organocatalysis. J. Comput.-Aided Mol. Des. 2004, 18, (5),
303-308.
8. Harriman, D. J.; Lambropoulos, A.; Deslongchamps, G., In silico correlation of
enantioselectivity for the TADDOL catalyzed asymmetric hetero-Diels-Alder
reaction. Tetrahedron Lett. 2007, 48, (4), 689-692.
9. Chavali, S.; Lin, B.; Miller, D. C.; Camarda, K. V., Environmentally-benign
transition metal catalyst design using optimization techniques. Comp. Chem. Eng.
2004, 28, (5), 605-611.
10. Lin, B.; Chavali, S.; Camarda, K.; Miller, D. C., Computer-aided molecular design
using Tabu search. Comp. Chem. Eng. 2005, 29, (2), 337-347.
CHAPTER 5
- 196 -
11. Sciabola, S.; Alex, A.; Higginson, P. D.; Mitchell, J. C.; Snowden, M. J.; Morao, I.,
Theoretical prediction of the enantiomeric excess in asymmetric catalysis. An
alignment-independent molecular interaction field based approach. J. Org. Chem.
2005, 70, (22), 9025-9027.
12. Ianni, J. C.; Annamalai, V.; Phuan, P. W.; Panda, M.; Kozlowski, M. C., A priori
theoretical prediction of selectivity in asymmetric catalysis: Design of chiral
catalysts by using quantum molecular interaction fields. Angew. Chem. Int. Ed.
2006, 45, (33), 5502-5505.
13. Huang, J.; Ianni, J. C.; Antoline, J. E.; Hsung, R. P.; Kozlowski, M. C., De novo
chiral amino alcohols in catalyzing asymmetric additions to aryl aldehydes. Org.
Lett. 2006, 8, (8), 1565-1568.
14. Deeth, R. J.; Fey, N., A molecular mechanics study of copper(II)-catalyzed
asymmetric Diels-Alder reactions. Organometallics 2004, 23, (5), 1042-1054.
15. Jensen, F.; Norrby, P. O., Transition states from empirical force fields. Theor.
Chem. Acc. 2003, 109, (1), 1-7.
16. Norrby, P. O., Selectivity in asymmetric synthesis from QM-guided molecular
mechanics. J. Mol. Struct. THEOCHEM 2000, 506, 9-16.
17. Eksterowicz, J. E.; Houk, K. N., Transition-state modeling with empirical force
fields. Chem. Rev. 1993, 93, (7), 2439-2461.
18. Olsen, P. T.; Jensen, F., Modeling chemical reactions for conformationally mobile
systems with force field methods. J. Chem. Phys. 2003, 118, (8), 3523-3531.
19. Jensen, F., Using force fields methods for locating transition structures. J. Chem.
Phys. 2003, 119, (17), 8804-8808.
20. Warshel, A.; Weiss, R. M., An empirical valence bond approach for comparing
reactions in solutions and in enzymes. J. Am. Chem. Soc. 1980, 102, (20), 6218-
6226.
21. Aqvist, J.; Warshel, A., Simulation of enzyme reactions using valence bond force
fields and other hybrid quantum/classical approaches. Chem. Rev. 1993, 93, (7),
2523-2544.
22. Truhlar, D. G., Valence bond theory for chemical dynamics. J. Comput. Chem.
2007, 28, (1), 73-86.
CHAPTER 5
- 197 -
23. Moitessier, N.; Chrétien, F.; Chapleur, Y.; Maigret, B., Molecular dynamics-based
models explain the unexpected diastereoselectivity of the sharpless asymmetric
dihydroxylation of allyl D- xylosides. Eur. J. Org. Chem. 2000, (6), 995.
24. Moitessier, N.; Henry, C.; Len, C.; Chapleur, Y., Toward a Computational Tool
Predicting the Stereochemical Outcome of Asymmetric Reactions. 1. Application to
Sharpless Asymmetric Dihydroxylation. J. Org. Chem. 2002, 67, (21), 7275-7282.
25. Fristrup, P.; Jensen, G. H.; Andersen, M. L. N.; Tanner, D.; Norrby, P. O.,
Combining Q2MM modeling and kinetic studies for refinement of the osmium-
catalyzed asymmetric dihydroxylation (AD) mnemonic. J. Organomet. Chem.
2006, 691, (10), 2182-2198.
26. Gennari, C.; Fioravanzo, E.; Bernardi, A.; Vulpetti, A., Origins of stereoselectivity
in the addition of allyl- and crotylboronates to aldehydes: The development and
application of a force field model of the transition state. Tetrahedron 1994, 50, (29),
8815-8826.
27. Rasmussen, T.; Norrby, P. O., Modeling the stereoselectivity of the β-amino
alcohol-promoted addition of dialkylzinc to aldehydes. J. Am. Chem. Soc. 2003,
125, (17), 5130-5138.
28. Bernardi, A.; Gennari, C.; Goodman, J. M.; Paterson, I., The rational design and
systematic analysis of asymmetric aldol reactions using enol borinates:
Applications of transition state computer modelling. Tetrahedron Asymmetry 1995,
6, (11), 2613-2636.
29. Kozlowski, M. C.; Waters, S. P.; Skudlarek, J. W.; Evans, C. A., Computer-Aided
Design of Chiral Ligands. Part III. A Novel Ligand for Asymmetric Allylation
Designed Using Computational Techniques. Org. Lett. 2002, 4, (25), 4391-4393.
30. Gennari, C.; Hewkin, C. T.; Molinari, F.; Bernardi, A.; Comotti, A.; Goodman, J.
M.; Paterson, I., The rational design of highly stereoselective boron enolates using
transition-state computer modeling: A novel, asymmetric anti aldol reaction for
ketones. J. Org. Chem. 1992, 57, (19), 5173-5177.
31. Deeth, R. J., Comprehensive molecular mechanics model for oxidized type I copper
proteins: Active site structures, strain energies, and entatic bulging. Inorg. Chem.
2007, 46, (11), 4492-4503.
CHAPTER 5
- 198 -
32. Shi, S.; Yan, L.; Yang, Y.; Fisher-Shaulsky, J.; Thacher, T., An extensible and
systematic force field, ESFF, for molecular modeling of organic, inorganic, and
organometallic systems. J. Comput. Chem. 2003, 24, (9), 1059-1076.
33. Corbeil, C. R.; Englebienne, P.; Moitessier, N., Docking ligands into flexible and
solvated macromolecules. 1. Development and validation of FITTED 1.0. J. Chem.
Inf. Model. 2007, 47, (2), 435-449.
34. Branchadell, V., Density Functional Study of Diels-Alder Reactions Between
Cyclopentadiene and Substituted Derivatives of Ethylene. Int. J. Quantum Chem.
1997, 61, 381-388.
35. Rankin, K. N.; Gauld, J. W.; Boyd, R. J., Density functional study of the proline-
catalyzed direct aldol reaction. J. Phys. Chem. A 2002, 106, (20), 5155-5159.
36. Allemann, C.; Gordillo, R.; Clemente, F. R.; Cheong, P. H. Y.; Houk, K. N.,
Theory of asymmetric organocatalysis of aldol and related reactions:
Rationalizations and predictions. Acc. Chem. Res. 2004, 37, (8), 558-569.
37. Goto, H.; Osawa, E., Corner flapping: A simple and fast algorithm for exhaustive
generation of ring conformations. J. Am. Chem. Soc. 1989, 111, (24), 8950-8951.
CHAPTER 6
- 199 -
CHAPTER SIX
CONCLUSION, FUTURE WORK
AND CONTRIBUTIONS TO KNOWLEDGE
CONCLUSION
Virtual screening tools for molecular discovery are becoming ever more prevalent
to guide organic and medicinal chemists in their search for novel molecules. The ability
of these tools to produce results quickly and cheaply has lead to their widespread
acceptance in the field of drug design and development yet there is still a lack of these
tools for organic chemists. Two programs, one for virtual screening against biological
targets and the second for virtual screening of asymmetric catalysts, have been
developed, validated and applied.
Two major caveats of most docking programs are the assumptions that the protein
flexibility and waters do not have a significant impact on docking accuracy. FITTED1.0
has been developed to account for these phenomena. FITTED enables the use of multiple
protein input structures and, by means of a genetic algorithm allows for the flexibility of
the protein under investigation. To account for displaceable bridging water molecules,
switching function to turn off interactions between the water and the ligand if too close
has been implemented. Initial validation showed the importance of including both protein
flexibility and displaceable water molecules. Application of FITTED to the docking of our
test set resulted in 73% docking accuracy when docking using flexible proteins.
When performing a virtual screen, the speed of the program is of utmost
importance. With the success of FITTED1.0, we further modified it to reduce the average
time required to perform a docking run. With the inclusion of filters to remove unwanted
compounds and interaction sites to aid in orienting the poses while docking, a significant
increase in the accuracy and speed of FITTED was seen. A virtual screening campaign
against HCV polymerase was undertaken with this enhanced version. Initial docking
validations showed that FITTED was able to accurately predict the binding pose of known
HCV polymerase inhibitors. The Maybridge virtual library, seeded with known
inhibitors, was then screened against HCV polymerase. Again, FITTED showed excellent
CHAPTER 6
- 200 -
enrichment rates along with identifying two novel molecules of interest for the
pharmaceutical industry involved in this research.
After the success of the previous two versions, a comparative study was undertaken
to assess the effect of ligand and protein input conformation (which include the treatment
of bridging water molecules) on the accuracy of major docking programs, including
FITTED. This work showed that the accuracy of these programs is greatly affected by the
given information. When including multiple protein structures and bridging water
molecules FITTED ranked second of these six major programs.
Considering the lack of virtual screening tools for organic chemists, we next turned
our attention to creating a tool to predict the stereoselectivities of asymmetric reactions.
ACE was developed and enabled the quick estimation of transition state forcefield
parameters through the linear combination of ground state interaction. ACE was validated
on two reactions and showed excellent correlation between observed and predicted
selectivities.
Overall, both programs exhibited excellent predictive power. By developing these
two new tools, we have provided greener, safer and quicker alternative to experimental
screening to the scientific community.
FUTURE WORK
One of the advantages to writing your own program code is the ability to implement
new and exciting ideas into the program quite easily. One of the possible directions that
both programs can take is the ability to not only predict selectivity, be it a ligands affinity
for a protein or stereoselectivity of a catalyst, but to propose a better binding ligand or an
existing catalyst. One of the major downfalls of de novo prediction of new molecules is
the consideration of synthetic accessibility.1 For an organic chemist synthetic
accessibility is usually estimated by retrosynthetic analysis but coding a chemist’s
knowledge, expertise and experience into a program is a difficult task.2-13
All of these
tools require that knowledge of known reactions be programmed into the code and
therefore if a new reaction would like to be used it requires its addition to the reaction
database. Therefore the automatic creation of chemical reaction databases applied to the
field of de novo design is needed.
CHAPTER 6
- 201 -
Another possible direction of future work would be the combination of FITTED and
ACE to create a tool for biocatalysis. In the past year there has been much discussion in
the use of computational tools for the design of biocatalytic enzymes.14-20
The main issue
with these techniques is the use of QM/MM techniques which slows down the throughput
due to the necessity of correctly predicting the enzymes transition state. The combination
of FITTED and ACE together with additional implementation would enable the virtual
screening of biocatalytic enzymes.
CONTRIBUTIONS TO KNOWLEDGE
We have developed FITTED a docking-based virtual screening program for
solvated and flexible proteins. It has been shown to be able to predict the binding pose of
ligand-protein complexes with good accuracy. During the development of FITTED we
have shown the importance of protein flexibility, bridging water molecules along with the
effect ligand input conformation on major docking programs and not only FITTED. We
also used FITTED to virtually screen the Maybridge library and indentified two molecules
with IC50s less than 15 μM. We have also proposed that when conducting comparative
studies, one should consider cross-docking accuracies instead of self-docking accuracies
to better approximate a real case scenario when the binding pose of a ligand is not known.
With the experience gained in developing FITTED, we created a new tool for the
predictions of stereoselectivities called ACE. This program has been developed with the
ease of use for organic chemists as a main driving force.
CHAPTER 6
- 202 -
REFERENCES
1. Gasteiger, J., De novo design and synthetic accessibility. J. Comput.-Aided Mol.
Des. 2007, 21, (6), 307-309.
2. Boda, K.; Seidel, T.; Gasteiger, J., Structure and reaction based evaluation of
synthetic accessibility. J. Comput.-Aided Mol. Des. 2007, 21, (6), 311-325.
3. Corey, E. J.; Wipke, W. T., Computer-Assisted Design of Complex Organic
Synthesis. Science 1969, 166, (3902), 178-192.
4. Corey, E. J.; Long, A. K.; Rubenstein, S. D., Computer-assisted analysis in organic
synthesis. Science 1985, 228, (4698), 408-418.
5. Wipke, W. T.; Gund, P., Simulation and evaluation of chemical synthesis.
Congestion: a conformation-dependent function of steric environment at a reaction
center. Application with torsional terms to stereoselectivity of nucleophilic
additions to ketones. J. Am. Chem. Soc. 1976, 98, (25), 8107-8118.
6. Wipke, W. T.; Ouchi, G. I.; Krishnan, S., Simulation and evaluation of chemical
synthesis--SECS: An application of artificial intelligence techniques. Art. Intell,
1978, 11, (1-2), 173-193.
7. Wipke, W. T.; Rogers, D., Artificial Intelligence in organic synthesis, SST. Starting
Material Selection Strategies. An Application of Superstructure J. Chem. Inf.
Comput. Sci. 1984, 24, (2), 71-81.
8. Gelernter, H.; Rose, J. R.; Chen, C., Building and refining a knowledge base for
synthetic organic chemistry via the methodology of inductive and deductive
machine learning. J. Chem. Inf. Comput. Sci. 1990, 30, (4), 492-504.
9. Gelernter, H. L.; Sanders, A. F.; Larsen, D. L., Empirical explorations of
SYNCHEM. The methods of artificial intelligence are applied to the problem of
organic synthesis route discovery. Science 1977, 197, (4308), 1041-1049.
10. Satoh, H.; Funatsu, K., SOPHIA, a Knowledge Base-Guided Reaction Prediction
System - Utilization of a Knowledge Base Derived from a Reaction Database. J.
Chem. Inf. Comput. Sci. 1995, 35, (1), 34-44.
11. Gasteiger, J.; Jochum, C., EROS: A computer progrm for generating sequences of
reactions. Top. Curr. Chem. 1978, 74, 93-126.
12. Ugi, I.; Bauer, J.; Bley, K.; Dengler, A.; Dietz, A.; Fontain, E.; Gruber, B.; Herges,
R.; Knauer, M.; Reitsam, K.; Stein, N., Computer-assisted solution of chemical
CHAPTER 6
- 203 -
problems - The historical development and the present state of the art of a new
discipline of chemistry. Angew. Chem. Int. Ed. 1993, 32, (2), 201-227.
13. Hanessian, S.; Franco, J.; Gagnon, G.; Laramee, D.; Larouche, B., Computer-
assisted analysis and perception of stereochemical features in organic molecules
using the CHIRON program. J. Chem. Inf. Comput. Sci. 1990, 30, (4), 413-425.
14. Jiang, L.; Althoff, E. A.; Clemente, F. R.; Doyle, L.; Röthlisberger, D.;
Zanghellini, A.; Gallaher, J. L.; Betker, J. L.; Tanaka, F.; Barbas Iii, C. F.; Hilvert,
D.; Houk, K. N.; Stoddard, B. L.; Baker, D., De novo computational design of
retro-aldol enzymes. Science 2008, 319, (5868), 1387-1391.
15. Marti, S.; Andres, J.; Moliner, V.; Silla, E.; Tunon, I.; Bertrain, J., Computational
design of biological catalysts. Chem. Soc. Rev. 2008, 37, (12), 2634-2643.
16. Prather, K. L. J.; Martin, C. H., De novo biosynthetic pathways: rational design of
microbial chemical factories. Curr. Opin. Biotechnol. 2008, 19, (5), 468-474.
17. Ward, T. R., Artificial enzymes made to order: Combination of computational
design and directed evolution. Angew. Chem. Int. Ed. 2008, 47, (41), 7802-7803.
18. Chaput, J. C.; Woodbury, N. W.; Stearns, L. A.; Williams, B. A. R., Creating
protein biocatalysts as tools for future industrial applications. Ex. Op. Bio. Ther.
2008, 8, (8), 1087-1098.
19. Sterner, R.; Merkl, R.; Raushel, F. M., Computational Design of Enzymes. Chem.
Biol. 2008, 15, (5), 421-423.
20. Ghirlanda, G., Computational biochemistry: Old enzymes, new tricks. Nature 2008,
453, (7192), 164-166.
CHAPTER 6
- 204 -
APPENDIX A
- 205 -
APPENDIX A
COPYRIGHT WAIVERS
Copyright waiver for chapters 2: Docking Ligands into Flexible and Solvated
Macromolecules. 1. Development and Validation of FITTED 1.0; and chapter 3: Docking
Ligands into Flexible and Solvated Macromolecules. 2. Development and Application of
Fitted 1.5 to the Virtual Screening of Potential HCV Polymerase Inhibitors.
American Chemical Society’s Policy on Theses and Dissertations
If your university requires a signed copy of this letter see contact information below.
Thank you for your request for permission to include your paper(s) or portions of text from your paper(s) in your thesis.
Permission is now automatically granted; please pay special attention to the implications paragraph below. The
Copyright Subcommittee of the Joint Board/Council Committees on Publications approved the following:
Copyright permission for published and submitted material from theses and dissertations
ACS extends blanket permission to students to include in their theses and dissertations their own articles, or
portions thereof, that have been published in ACS journals or submitted to ACS journals for publication, provided
that the ACS copyright credit line is noted on the appropriate page(s).
Publishing implications of electronic publication of theses and dissertation material
Students and their mentors should be aware that posting of theses and dissertation material on the Web prior to
submission of material from that thesis or dissertation to an ACS journal may affect publication in that journal.
Whether Web posting is considered prior publication may be evaluated on a case-by-case basis by the journal’s
editor. If an ACS journal editor considers Web posting to be “prior publication”, the paper will not be accepted
for publication in that journal. If you intend to submit your unpublished paper to ACS for publication, check with
the appropriate editor prior to posting your manuscript electronically.
If your paper has not yet been published by ACS, we have no objection to your including the text or portions of the text in
your thesis/dissertation in print and microfilm formats; please note, however, that electronic distribution or Web posting of
the unpublished paper as part of your thesis in electronic formats might jeopardize publication of your paper by ACS. Please
print the following credit line on the first page of your article: "Reproduced (or 'Reproduced in part') with permission from
[JOURNAL NAME], in press (or 'submitted for publication'). Unpublished work copyright [CURRENT YEAR] American
Chemical Society." Include appropriate information.
If your paper has already been published by ACS and you want to include the text or portions of the text in your
thesis/dissertation in print or microfilm formats, please print the ACS copyright credit line on the first page of your
article: “Reproduced (or 'Reproduced in part') with permission from [FULL REFERENCE CITATION.] Copyright
[YEAR] American Chemical Society." Include appropriate information.
Submission to a Dissertation Distributor: If you plan to submit your thesis to UMI or to another dissertation distributor,
you should not include the unpublished ACS paper in your thesis if the thesis will be disseminated electronically, until ACS
has published your paper. After publication of the paper by ACS, you may release the entire thesis (not the individual ACS
article by itself) for electronic dissemination through the distributor; ACS’s copyright credit line should be printed on the
first page of the ACS paper.
Use on an Intranet: The inclusion of your ACS unpublished or published manuscript is permitted in your thesis in print and
microfilm formats. If ACS has published your paper you may include the manuscript in your thesis on an intranet that is not
publicly available. Your ACS article cannot be posted electronically on a publicly available medium (i.e. one that is not
password protected), such as but not limited to, electronic archives, Internet, library server, etc. The only material from your
paper that can be posted on a public electronic medium is the article abstract, figures, and tables, and you may link to the
article’s DOI or post the article’s author-directed URL link provided by ACS. This paragraph does not pertain to the
dissertation distributor paragraph above.
Questions? Call +1 202/872-4368/4367. Send e-mail to [email protected] or fax to +1 202-776-8112. 10/10/03, 01/15/04, 06/07/06
APPENDIX A
- 206 -
Copyright waiver for chapter 5: Toward a Computational Tool Predicting the
Stereochemical Outcome of Asymmetric Reactions. 2. Development and Application of a
Rapid and Accurate Program Based on Organic Principles.
APPENDIX A
- 207 -
Copyright waiver for chapter 1.1: The challenge of modeling reality in the docking of
small molecules to biological targets and for Chapter 4: Docking Ligands into Flexible
and Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein
Flexibility and Water Molecules on the Accuracy of Docking Programs.
APPENDIX A
- 208 -
APPENDIX B
- 209 -
APPENDIX B
SUPPORTING INFO FOR CHAPTER 2:
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 1.
DEVELOPMENT AND VALIDATION OF FITTED 1.0
Table B.1 - HIV-1 Protease mono-alcohol inhibitors.
PDB code Structure Ki
1b6l
NHO
N
HO
O
O
N
NH2
O
O
NH
H
5 nM
1eby
OH
HN
O
O OH
HO O
O
NH OH
0.2 nM
1hpv N
OH
HN
O O
O
S
O
O
NH2
0.6 nM
APPENDIX B
- 210 -
1hpo
HOHN
SO
O
N
O
O
0.6 nM
1pro
MeO
N
HO
N
NOH
OMeO
OH
0.005 nM
Table B.2 - HIV-1 Protease diol inhibitors.
PDB code Structure Ki
1ajv
OHHOO
NS
OO
N
O
19.1nM
1ajx
OHHOO
N
O
N
O
12.2nM
1hvr
OH
N
O
N
HO
0.31nM
APPENDIX B
- 211 -
1hwr
OHHO
N
O
N
4.7nM
1qbs
N
OHHO
NHO
O
OH
0.12nM
Table B.3 - Thymidine Kinase inhibitors.
PDB code Structure Ki
1e2k N
OH
OHO
HN
O
11.4 M
1e2p
OH
HOHN
O
NH
O
27 M
1ki3
NH2
N
N
N
HO
OH O
NH
1ki4
HO
OHO
N
ONH
O
BrS
1ki7 OHO
OH
N
HN
O
O
APPENDIX B
- 212 -
1ki8 HO O
HO
N
Br
O
NH
O
2ki5
NH2
N
N
NOHO
O
NH
1of1
OHOH
NN
O
O
4.1 M
1qhi HO
N
N O
NH
HNN
Table B.4 - Factor Xa inhibitors.
PDB code Structure Ki
1ezq
O
O
NH2
NHHN
O
H2N
0.9 nM
1f0r
NH2
NNHNS
O
O
N
S
O
22 nM
APPENDIX B
- 213 -
1fjs N
N
ON
O
OH
H2NNH
F
NHO
O
F
0.11 nM
1nfu
NH2
NH
N
N
O
SO
O
Cl
S
1.3 nM
1xka
H2N NH
HN
N HO
O
131nm
Table B.5 - Trypsin inhibitors.
PDB code Structure Ki
1f0u O
O
NH2
NHHN
O
H2N
69 nM
1o2j O
OH
N
NHH2N
HN
120 nM
1o3g NH2
NH
NH
OH
74 nM
APPENDIX B
- 214 -
1o3i
BrHO
HN
H2N
NH
170 nM
1qbo
HNN O
HN
N+
NH2
+H2N
18 nM
Table B.6 - Stromelysin-1 inhibitors.
PDB code Structure Ki
1b8y
OHO
HN
SO
O
N
14nm
1bwi
ON
NH
O
O
NH
HO
OH
MeO
104nm
1ciz
HOO
HN
NH
S
O
O N
36nm
APPENDIX B
- 215 -
1d8m
MeO
SOO
N
N
O
O
HN OH
3.1nm
On the following pages, are presented typical keyword files for FITTED and ProCESS. Most of the values used here are set the default values.
##########################################################################################
########
#
#
# ____________ _________ _____________ _____________ ____________ _________
#
# ------------ --------- ------------- ------------- ------------ ----------
#
# ||| ||| ||| ||| ||| ||| \\\
#
# ||| ||| ||| ||| ||| ||| \\\
#
# ||| _____ ||| ||| ||| ||| _____ ||| |||
#
# ||| ----- ||| ||| ||| ||| ----- ||| |||
#
# ||| ||| ||| ||| ||| ||| |||
#
# ||| ||| ||| ||| ||| ||| |||
#
# ||| ||| ||| ||| ||| ||| ///
#
# ||| _________ ||| ||| ____________ __________
#
# ||| --------- ||| ||| ------------ --------
#
#
#
# Flexibility Induced Through Targeted Evolutionary Description
#
#
#
# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne
#
#
#
# March 2006
#
##########################################################################################
########
#
#
# INPUT/OUTPUT FILES
##########################################################################################
########
#
Protein 9 # Number of protein input files
1e2k_protein.mol2 # Names of the protein files
1e2p_protein.mol2
1ki3_protein.mol2
1ki4_protein.mol2
1ki7_protein.mol2
APPENDIX B
- 216 -
1ki8_protein.mol2
2ki5_protein.mol2
1of1_protein.mol2
1qhi_protein.mol2
Ligand 1e2k_lig.mol2 # Ligand structure file
Output 1e2k_run01 # File that will contain the output
Forcefield emc.txt # Force field file
Active_site_cav tk_grid.mol2 # File containing the sphere locations and
sizes
Constraints tk_cons.mol2 # Constraint file
Ref 1 # Number of reference structures for RMSD
calculations
1e2k_ligand_ref1.mol2 # Reference structure
#
#
# CONJUGATE GRADIENT PARAMETERS
##########################################################################################
########
#
# Creation of the initial population
#-----------------------------------------------------------------------------------------
--------
#
GI_MaxInt 100 # Maximum number of iterations
GI_StepSize 0.002 # Step size used in the minimization process
GI_MaxStep 0.9 # Maximum Step Size
GI_MaxSameEnergy 3 # Max. number of times the same energy can
appear
GI_MaxGrad 0.00001 # Maximum size of the gradient
GI_EnergyBound 0.0001 # Diff. in energy between two "similar"
structures
#
# Evolution
#-----------------------------------------------------------------------------------------
--------
#
GA_MaxInt 20 # Maximum number of iterations
GA_StepSize 0.002 # Step size used in the minimization process
GA_MaxStep 1 # Maximum Step Size
GA_MaxSameEnergy 3 # Max. number of times the same energy can
appear
GA_MaxGrad 0.00001 # Maximum size of the gradient, Evolution
GA_EnergyBound 0.00001 # Diff. in energy between two "similar"
structures
#
#
# ENERGY/SCORING PARAMETERS
##########################################################################################
########
#
# Energy parameters
#-----------------------------------------------------------------------------------------
--------
#
Score_Initial minimize # Scoring of the input conformation
vdW 1-4 # Consider 1,4 vdw interactions and greater
vdWScale_1-4 0.5 # Scaling factor for vdw 1,4 interactions
vdWScale_1-5 1.0 # scaling factor for vdw 1,5+ interactions
E_vdWScale_Pro 2.0 # Scaling factor for vdw L,P interactions
E_vdWScale_Wat 2.0 # Scaling factor for vdw L,Water interactions
Elec 1-4 # Consider 1,4 electrostatics
ElecScale_1-4 0.25 # Scaling factor for L,L 1,4 electrostatics
ElecScale_1-5 0.5 # Scaling factor for L,L 1,5+ electrostatics
E_ElecScale_Pro 1.0 # Scaling factor for L,P interactions
E_ElecScale_Wat 1.0 # Scaling factor for L,P interactions
HBond Y # Include hydrogen bonds
E_HBondScale_Pro 2.0 # Scaling factor for Hydrogen bond term
E_HBondScale_Wat 2.0 # Scaling factor for Hydrogen bond term
Cutdist 9.0 # Cutoff Distance
APPENDIX B
- 217 -
switchdist 7.0 # Switching distance
Cutdist_Wat 1.75 # Cutoff Distance for Waters
Switchdist_Wat 1.20 # Switching distance for Waters
Water_loss_energy -0.5 # penalty added if a water is displaced
#
# Scoring function parameters
#-----------------------------------------------------------------------------------------
--------
#
weight_rot_bonds 0.14 # Penalty per frozen rotatable bond
S_vdWScale_Pro 0.175 # Scaling factor for vdw L,P interactions
S_ElecScale_Pro 0.064 # Scaling factor for L,P interactions
S_HBondScale_Pro 0.25 # Scaling factor for Hydrogen bond term
S_vdWScale_Wat 0.175 # Scaling factor for vdw L,Water interactions
S_ElecScale_Wat 0.064 # Scaling factor for L,Water interactions
S_HBondScale_Wat 0.25 # Scaling factor for Hydrogen bond term
#
#
# GENETIC ALGORITHM PARAMETERS
##########################################################################################
########
#
# Creation of the initial population
#-----------------------------------------------------------------------------------------
--------
#
Pop_Size 100 # Population size
anchor_atom 15 # Ligand anchor atom
anchor_coor -10 -9 -15 # Location of the center of the binding site
max_tx 3 # Maximum radius for translation
max_ty 3 # Maximum translation angle in x, y plane for
rotation.
max_tz 3 # Maximum elevation angle for translation
max_rxy 30 # Maximum rotation of molecule in x
max_ryz 30 # Maximum rotation of molecule in y
max_rxz 30 # Maximum rotation of molecule in z
max_sc_PP 0.8 # Maximum distance for two atoms to be
considered clashing
max_num_sc 5 # Maximum number of steric clashes of the side
chains
GI_Minimized_E 1000 # Maximum potential energy to accept a pose
GI_Initial_E 10000000 # Maximum potential energy to start a
minimization
#
# Evolution
#-----------------------------------------------------------------------------------------
--------
#
flex 15 # Number of flexible side chains
HISD58 # Name of the flexible residues
ARG222
GLU83
ARG163
TYR101
ILE97
MET231
ARG176
TYR172
MET121
ILE100
TRP88
MET128
GLN125
MET85
max_gen 200 # Maximum number of generations
seed 15 # Seed number
resolution 7 # Torsion angle resolution for ligand
pCross 0.85 # Probability of crossover
pMut 0.02 # Probability of mutation
APPENDIX B
- 218 -
pMutRot 0.20 # Probability of mutation for the rotation in
3D Space
pOpt 0.20 # Probability of energy minimization of the
children
pLearn 0.1 # Probability of energy minimization of the
parents
#
# Convergence and output
#-----------------------------------------------------------------------------------------
--------
#
print_structures final # Which structures to print out
print_best_every_x_gen 5 # Intermediate statistical output
print_energy_full no # Detailed potential energy
converge_std_dev 0.2 # Standard deviation criterion
Number_of_best 20 # Number of best structures to be output
MaxSameEnergy_GA 100 # Number of unchanged lowest-in-energy
structure
#
#
#
##########################################################################################
########
##########################################################################################
########
#
#
# __________ ______ _________ _____ _____
#
# ----------- ---------- --------- --------- ---------
#
# ||| ||| ||| ||| ||| |||
#
# ||| ||| ||| ||| ||| |||
#
# |||________ ___ ____ ||| |||______ _______ _______
#
# |||------- ---_____ -------- ||| |||------ ------- -------
#
# ||| |||------ ||| ||| ||| ||| ||| |||
#
# ||| ||| || ||| ||| ||| ||| ||| |||
#
# ||| ||| ||| ||| ||| ||| ||| |||
#
# ||| ||| ________ __________ _________ _________ _________
#
# ||| ||| ---- ------ --------- ------- -------
#
#
#
# Protein Conformational Ensemble System Setup #
#
#
# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne
#
# Dept. of Chemistry
#
# McGill University, Montreal, Canada
#
#
#
# Feb 2006
#
##########################################################################################
########
#
#
# INPUT/OUTPUT FILES
##########################################################################################
########
APPENDIX B
- 219 -
#
protein 9 # Number of protein files
1e2k_prot.mol2 # Name of the protein files
1e2p_prot.mol2
1ki3_prot.mol2
1ki4_prot.mol2
1ki7_prot.mol2
1ki8_prot.mol2
1of1_prot.mol2
1qhi_prot.mol2
2ki5_prot.mol2
rep_file 1e2k_prot.mol2 # Name of the reference structure (for atom sorting)
Output tk # File that will contain the output structure
Grid tk_grid # File that will contain the grid
#
# PROTEIN DESCRIPTION
##########################################################################################
########
#
Find_Residues name # Find residues in protein mol2 by either number of
name
Active_Site 15 # Number of flexible side chains
HIS58 # Names of flexible residues
ARG222
GLU83
ARG163
TYR101
ILE97
MET231
ARG176
TYR172
MET121
ILE100
TRP88
MET128
GLN125
MET85
#
# ACTION
##########################################################################################
########
#
Assign_G yes # Assigns proper group names
Truncate yes # Truncates proteins keeping residues within cutoff
distance
Cutoff 7 # Cutoff distance
United yes # Makes the united atom representation
Coarse 0 # Makes the coarse grained representation at
indicated level
#
# GRID DESCRIPTION
##########################################################################################
########
#
Grid_boundary hard # Spheres make contact with the grid edges
grid_center -10.8 -10.6 -13.3 # Center of the grid
grid_resolution 1.5 # Grid resolution
grid_size 9 9 9 # Grid size
grid_clash 1.5 # Maximum distance between grid point and protein or
edge
grid_sphere_size 10 # A sphere centered at grid_center truncates the
apexes of the grid
#
#
##########################################################################################
########
APPENDIX B
- 220 -
APPENDIX C
- 221 -
APPENDIX C
SUPPORTING INFO FOR CHAPTER 3:
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 2.
DEVELOPMENT AND APPLICATION OF FITTED 1.5 TO THE
VIRTUAL SCREENING OF POTENTIAL HCV POLYMERASE
INHIBITORS.
Table C.1 – Self-docking HIV – 1 Protease.
Docking to protein + water moleculeb
Obs. Waterc Ligand
d Pred. Water
e
1b6l 1 0.36 1
1eby 1 0.39 1
1hpo 0 0.71 0
1hpv 1 0.44 1
1pro 0 0.18 0
1ajv 0 0.43 0
1ajx 0 0.31 0
1hvr 0 0.51 0
1hwr 0 0.32 0
1qbs 0 5.42 0
a Water molecules removed prior to docking.
b Water molecule known as Water 301
was retained and the function describing the interaction between ligand and water
molecules is applied. c
Water molecule observed or not in crystal structures: 1 and 0
APPENDIX C
- 222 -
define the presence or absence of the water molecule respectively. d RMSD (in Å):
criterion of success of 2.0 Å. e Water molecules as proposed by FITTED. Bold numbers
highlight failures.
Table C.2 - Self-docking – Thymidine kinase inhibitors.
Docking to protein + water moleculeb
Obs. Water moleculesc Ligand
d Pred. water molecules
e
1e2k 1 0 1 1 1 1 0.55 1 0 1 1 0 1
1e2p 1 0 1 1 1 1 0.71 1 0 1 1 0 1
1ki3 0 1 0 0 0 1 1.18 0 1 0 1 1 1
1ki4 1 0 1 1 1 1 0.44 1 0 1 1 1 1
1ki7 1 0 1 1 1 0 5.76 1 0 1 1 1 1
1ki8 1 0 1 1 1 0 0.39 1 0 1 1 1 1
2ki5 0 1 1 1 1 1 0.72 0 1 0 1 1 1
1of1 1 0 1 1 1 1 0.40 1 0 1 1 1 1
1qhi 0 1 1 1 1 0 0.66 0 1 0 1 1 1
a Water molecules removed prior to docking.
b 2 to 6 water molecules (see text) were
retained and the function describing the interaction between ligand and water molecules
is applied. c
Water molecules observed or not in crystal structures: 1 and 0 define the
presence or absence of each water molecule respectively. d RMSD (in Å): criterion of
success of 2.0 Å;. e Water molecules as proposed by FITTED. Bold numbers highlight
failures. g two structures with similar energies were observed (RMSD =2.03 and 0.53)
APPENDIX C
- 223 -
Table C.3 - Self-docking – Factor Xa trypsin and MMP-3 inhibitors.
Docking to protein + water moleculeb
Obs. Waterc Ligand
d Pred. Water
e
1ezq 1 0 0.57 1 0
1f0r 1 1 0.77 0 0
1fjs 1 0 0.93 1 0
1nfu 0 0 1.20 0 0
1xka 1 0 1.22 1 0
1f0u 1 - 0.47 1 -
1o2j 1 - 0.76 1 -
1o3g 1 - 0.58 1 -
1o3i 1 - 0.49 1 -
1qbo 1 - 0.90 1 -
1b8y - - 0.61 - -
1bwi - - 0.98 - -
1ciz - - 0.92 - -
1d8m - - 0.71 - -
a Water molecules removed prior to docking.
b none to 2 water molecules (see text)
were retained and the function describing the interaction between ligand and water
molecules is applied. c
Water molecules observed or not in crystal structures: 1 and 0
define the presence or absence of each water molecule respectively. d RMSD (in Å):
criterion of success of 2.0 Å;. e Water molecules as proposed by FITTED. Bold numbers
highlight failures.
APPENDIX C
- 224 -
Table C.4 - Docking to flexible proteins - HIV-1 protease inhibitors.
Docking to semi-flexible protein Docking to fully flexible protein
Liganda Protein
b Water
c Ligand
a Protein
b Water
1b6l 0.45 0.00 1 0.34 0.73 1
1eby 0.58 0.00 1 0.79 1.37 1
1hpo 1.53 0.99 0 1.18 1.45 0
1hpv 0.83 1.00 1 0.87 1.12 1
1pro 0.67 0.72 0 0.63 1.15 0
1ajv 0.57 0.00 0 0.60 0.50 0
1ajx 0.67 0.69 0 0.55 0.67 0
1hvr 0.64 0.00 0 0.57 0.31 0
1hwr 0.45 0.81 0 0.44 0.84 0
1qbs (5.33) 0.72 0 5.25 0.67 0
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success:
better than average RMSD; average RMSD between protein structures computed on the
binding site residues: 0.91 Å for the first five structures (one Asp 25 protonated) and 0.77
Å for the last five structures (AspA25 and AspB25 protonated). c Water molecules as
proposed by FITTED; 1 and 0 define the presence or absence of the water molecule
respectively. Bold numbers highlight failures. d Score in arbitrary units.
APPENDIX C
- 225 -
Table C.5 - Docking to flexible proteins - thymidine kinase inhibitors.
Docking to semi-flexible protein
Liga Pro
b Occurrence of water mol.
c
1e2k 0.54 0.00 1 0 1 1 0 1
1e2p 0.57 0.00 1 0 1 1 1 1
1ki3 1.29 0.00 0 1 0 1 1 1
1ki4 0.52 0.00 1 0 1 1 1 1
1ki7 5.75 0.83 0 1 0 1 1 1
1ki8 0.58 0.96 1 0 1 1 1 1
2ki5 0.59 0.89 0 1 0 1 1 1
1of1 0.34 0.26 1 0 1 1 1 1
1qhi 0.72 0.00 0 1 0 1 1 1
Docking to fully flexible protein
Liga Pro
b Occurrence of water mol.
c
1e2k 0.71 0.99 1 0 1 1 1 1
1e2p 0.78 0.74 1 0 1 1 1 1
1ki3 3.01 0.71 1 0 1 1 1 1
1ki4 0.42 0.96 1 0 1 1 1 1
1ki7 5.69 0.65 0 1 1 1 1 1
1ki8 0.48 0.43 1 0 1 1 1 1
2ki5 (1.60) 1.06 1 0 1 1 0 1
1of1 0.52 0.95 1 0 1 1 1 1
1qhi (0.65) 0.74 0 1 0 1 1 1
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success:
better than average RMSD; average RMSD between protein structures computed on the
binding site residues: 0.92 Å. c Water molecules as proposed by FITTED; 1 and 0 define
the presence or absence of the water molecules respectively. Bold numbers highlight
failures.
APPENDIX C
- 226 -
Table C.6 - Docking to flexible proteins – Factor Xa, trypsin and MMP-3 inhibitors.
Docking to semi-flexible protein Docking to fully flexible protein
Liganda Protein
b Water
c Ligand
a Protein
b Water
c
1ezq 0.59 0.00 1 0 0.49 0.59 0 0
1f0r 2.24 0.55 0 0 0.53 0.27 1 1
1fjs 2.71 0.90 1 1 2.51 0.62 1 1
1nfu 0.77 0.00 0 0 0.85 0.65 0 0
1xka 8.17 1.01 1 1 2.01 0.83 1 1
1f0u 0.56 0.00 1 - 0.43 0.40 1 -
1o2j 1.02 0.53 1 - 0.80 0.40 1 -
1o3g 0.78 0.00 1 - 0.93 0.50 1 -
1o3i 0.80 0.00 1 - 0.77 0.65 1 -
1qbo 0.53 0.00 1 - 0.58 0.63 1 -
1b8y 0.69 0.45 - - 0.67 0.64 - -
1bwi 0.63 0.00 - - 0.95 0.55 - -
1ciz 0.64 0.45 - - 0.92 0.89 - -
1d8m 1.10 0.00 - - 0.93 1.24 - -
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success:
better than average RMSD; average RMSD between protein structures computed on the
binding site residues: 0.92 Å. c Water molecules as proposed by FITTED; 1 and 0 define
the presence or absence of the water molecules respectively. Bold numbers highlight
failures.
APPENDIX C
- 227 -
Table C.7 - Docking accuracy – FITTED 1.0 VS. FITTED 1.5.
Rigid Semi-flexible Fully-Flexible
Ver Lig a water c Lig a Pro b Water c Lig a Pro b Water c
1.0 76% 84% 73% 58% 86% 73% 71% 83%
1.5 94% 91% 84% 84% 82% 88% 78% 75%
a RMSD (in Å): criterion of success of 2.0 Å.
b RMSD (in Å): criterion of success:
better than average RMSD; average RMSD between protein structures computed on the
binding site residues: 0.92 Å. c Water molecules successfully predicted (absence or
presence) by FITTED.
APPENDIX C
- 228 -
APPENDIX D
- 229 -
APPENDIX D
SUPPORTING INFO FOR CHAPTER 4:
DOCKING LIGANDS INTO FLEXIBLE AND SOLVATED
MACROMOLECULES. 3.
IMPACT OF INPUT LIGAND CONFORMATION, PROTEIN
FLEXIBILITY AND WATER MOLECULES ON THE ACCURACY OF
DOCKING PROGRAMS
Table D.1 - Accuracy of the 6 docking programs using various conditions and self-
docking experiments with dry protein.
Self Docking Dry < 2 Å
Crystal Generated Omega +Rings
eHiTS acc . 3 91 77
eHiTS acc. 6 93 80
FITTED Dock 58 59 59
FITTED VS 56 57 54
Flexx CS 56 48 52
Flexx CS SIS 60 50 51
Flexx FS 60 46 54
Flexx FS SIS 54 51 49
Flexx PLP 50 35 45
Flexx PLP SIS 46 41 46
Flexx SS 60 46 54
Flexx SS SIS 54 51 49
Glide HTVS 73 54 56
Glide SP 73 56 58
Glide XP 74 58 60 Gllide HTVS Refined Protein 80 58 60
Gllide SP Refined Protein 80 59 60
Gllide XP Refined Protein 79 67 67
GOLD CS 73 64 65
GOLD GS 64 53 55
APPENDIX D
- 230 -
GOLD CS - Robust 70 58 65
GOLD GS - Robust 63 48 57
Surflex 62 47 51
Surflex - pgeom 73 50 66
Surflex - hprot 62 50 59
Surflex - hprot - pgeom 80 69 72
Surflex - popt 65 47 55
Surflex - popt - pgeom 72 57 59
Table D.2 - Accuracy of the 6 docking programs using various conditions and self-
docking experiments with proteins with waters.
Self docking with Waters < 2 Angs
Explicit Waters
Displ. Ensemble Displ. Waters
eHiTS acc . 3 76 79
eHiTS acc. 6 81 82
FITTED Dock 62 61 64
FITTED VS 62 54 64
Flexx CS 50 52 43
Flexx CS SIS 53 53 54
Flexx FS 49 51 46
Flexx FS SIS 49 53 48
Flexx PLP 38 37 39
Flexx PLP SIS 41 38 42
Flexx SS 49 51 46
Flexx SS SIS 49 53 48
Glide HTVS 58 58
Glide SP 64 63
Glide XP 62 63 Gllide HTVS Refined Protein 64 62
Gllide SP Refined Protein 59 60
Gllide XP Refined Protein 68 68
GOLD CS 65 68 63
GOLD GS 54 52 55
GOLD CS - Robust 64 59 62
GOLD GS - Robust 51 53 52
Surflex 38 49
Surflex - pgeom 42 51
Surflex - hprot 42 59
Surflex - hprot - pgeom 56 69
APPENDIX D
- 231 -
Surflex - popt 43 50
Surflex - popt - pgeom 47 58
Table D.3 - Accuracy of the 6 docking programs using various conditions and cross-
docking experiments with dry proteins.
Cross and Flexible Docking Dry < 2.25
Cross-
Docking Conformational
Ensemble
Flexible Protein (“Semi”/”Fully”
)
eHiTS acc . 3 64 80
eHiTS acc. 6 70 71
FITTED Dock 40 50 48/48
FITTED VS 37 48 44/44
Flexx CS 24 43
Flexx CS SIS 27 43
Flexx FS 20 28
Flexx FS SIS 25 38
Flexx PLP 16 24
Flexx PLP SIS 21 28
Flexx SS 20 28
Flexx SS SIS 25 38
Glide HTVS 29 44
Glide SP 31 43
Glide XP 31 46 Glide HTVS Refined Protein 35 50
Glide SP Refined Protein 34 48
Glide XP Refined Protein 36 50
GOLD CS 45 38
GOLD GS 38 44
GOLD CS - Robust 44 40
GOLD GS - Robust 33 33
Surflex 29 37
Surflex - pgeom 31 40
Surflex - hprot 29 44
Surflex - hprot - pgeom 30 47
Surflex - popt 27 38
Surflex - popt - pgeom 30 45
APPENDIX D
- 232 -
Table D.4 - Accuracy of the 6 docking programs using various conditions and cross-docking experiments with dry proteins.
Cross and Flexible Docking wet < 2.25
Cross-
Docking Conformational
Ensemble Flexible Protein (“Semi”/”Fully”)
eHiTS acc . 3 64 79
eHiTS acc. 6 69 71
FITTED Dock 39 50 50/52
FITTED VS 35 47 48/44
Flexx CS 27 45
Flexx CS SIS 24 41
Flexx FS 20 31
Flexx FS SIS 27 43
Flexx PLP 16 29
Flexx PLP SIS 21 30
Flexx SS 20 31
Flexx SS SIS 27 30
Glide HTVS 33 49
Glide SP 36 49
Glide XP 35 49
Glide HTVS Refined Protein 33 54
Glide SP Refined Protein 33 52
Glide XP Refined Protein 37 54
GOLD CS 42 41
GOLD GS 34 36
GOLD CS - Robust 46 41
GOLD GS - Robust 36 39
Surflex 28 36
Surflex - pgeom 30 41
Surflex - hprot 32 43
Surflex - hprot - pgeom 28 47 Surflex - popt 32 38
Surflex - popt - pgeom 31 44
APPENDIX E
- 233 -
APPENDIX E
SUPPORTING INFO FOR CHAPTER 5:
TOWARD A COMPUTATIONAL TOOL PREDICTING THE
STEREOCHEMICAL OUTCOME OF ASYMMETRIC REACTIONS. 2.
DEVELOPMENT AND APPLICATION OF A RAPID AND ACCURATE
PROGRAM BASED ON ORGANIC PRINCIPLES.
EXPERIMENTAL
All structures were drawn within InsightII, charged using ESFF force field and
saved in mol2 format. These structures were next prepared to be used withACE. This step
is done automatically using SMART a module of ACE. This module assigns MM3 atom
types, identifies “rotatable” bonds and rings. These files were next used with ACE which
automatically performs the entire run (transition state definition from reactants and
products, conformational search, energy calculation, conjugate gradient energy
minimization). The MM3-derived energies of the modeled TSs are then outputted
together with the optimized structures. The ACE and SMART executables are available
free of charge to academics.
For the Diels-Alder reaction, only structures leading to the endo product were
investigated. In practice, the endo adducts are the major isomers observed experimentally
with the investigated auxiliaries. In all cases two products and two reactants where drawn
considering both possible diastereomeric outcomes of an endo attack.
APPENDIX E
- 234 -
The xyz coordinates of Diels-Alder transition state predicted by ace for entry 16
in table 2.
ON
O O
1a
2a
Et2AlCl
ON
O OAl
Al 2.0634 0.9863 -10.9626
C 1.8520 1.4088 -12.8730
H 2.4556 2.2144 -13.0919
H 0.8801 1.7130 -13.0285
C 1.5955 2.4373 -9.7165
H 0.6656 2.2335 -9.3227
H 1.5048 3.3096 -10.2568
C 1.6209 -1.6380 -10.2357
C 3.9636 -0.8343 -10.3641
C 5.0941 -2.6913 -9.7039
H 5.8008 -3.2594 -10.1864
H 5.3276 -2.7237 -8.7009
C 3.6296 -3.1380 -9.9682
H 3.2701 -3.5887 -9.1131
C 3.5065 -4.0980 -11.1720
H 2.5054 -4.2671 -11.3509
C 4.1300 -5.4741 -10.8654
H 3.7210 -5.8791 -10.0113
H 3.9541 -6.1317 -11.6386
H 5.1481 -5.4096 -10.7341
APPENDIX E
- 235 -
C 4.0831 -3.5207 -12.4814
H 3.9319 -4.1759 -13.2617
H 3.6221 -2.6347 -12.7283
H 5.0946 -3.3483 -12.4103
C 0.7409 -2.7769 -9.8094
H 1.1692 -3.7749 -9.7120
C -0.6222 -2.7158 -10.1443
H -1.1231 -3.6633 -10.3648
H -1.0045 -1.8377 -10.6734
C 2.6431 2.5860 -8.6070
H 2.3812 3.3422 -7.9587
H 2.7272 1.7125 -8.0682
H 3.5654 2.8090 -9.0075
C 2.1990 0.2054 -13.7572
H 1.5853 -0.5927 -13.5402
H 2.0825 0.4411 -14.7531
H 3.1755 -0.0843 -13.6062
O 3.8299 0.3394 -10.6093
O 1.0690 -0.5977 -10.5226
N 3.0146 -1.8152 -10.1981
O 5.2029 -1.3320 -10.1378
C -1.5294 -2.4255 -8.3555
H -2.5917 -2.6410 -8.5044
C -0.5912 -3.3758 -7.6425
H -0.8552 -3.5014 -6.6534
H -0.5227 -4.3111 -8.0692
C 0.6491 -2.5189 -7.7908
H 1.6137 -2.8160 -7.3703
APPENDIX E
- 236 -
C -1.0712 -1.1396 -8.0856
H -1.6609 -0.2166 -8.2113
C 0.2164 -1.1969 -7.7348
H 0.8529 -0.3270 -7.5044
For the aldol reaction, both enolate possibilities (the two methyl groups C1 and
C2 are equivalent but do not interchange during the conformational search as the proline
nitrogen N1 is doubly bonded to the carbon C3 in the product) and 4 products (2 for each
diastereomeric possibility) were considered. This allowed us to consider all possible
outcomes for the different enolate possibilities.
O
O
N
HC1
C2
C3
N1
R H
O
The xyz coordinates of aldol transition state predicted by ace for the reaction of
acetone with acetaldehyde in presence of proline (entry 17 in table 4).
O
O
N
OMeH
ONH
COOH
Me H
O
6a 5c4
N 2.3242 -0.4309 -0.0200
C 1.7969 0.8830 -0.4502
H 0.7710 0.8985 -0.3684
C 2.1884 1.2606 -1.8748
O 2.2783 0.2458 -2.6967
APPENDIX E
- 237 -
H 3.0048 -0.9521 -2.2561
O 1.8634 2.4146 -2.2262
C 2.4191 1.8258 0.5854
H 2.4177 2.8146 0.2916
H 1.8993 1.7757 1.4751
C 3.8281 1.2572 0.7645
H 4.4494 1.6119 0.0210
H 4.2558 1.5263 1.6635
C 3.6495 -0.2622 0.6270
H 4.3799 -0.6931 0.0402
H 3.6295 -0.7312 1.5437
C 1.7336 -1.5744 -0.2667
C 0.3506 -1.6485 -0.8689
H 0.3551 -1.2719 -1.8273
H 0.0085 -2.6174 -0.9079
H -0.3165 -1.1056 -0.3051
C 2.4508 -2.8408 -0.1536
H 3.2958 -2.8163 0.4961
H 1.8200 -3.6723 0.0681
C 3.0974 -3.1365 -1.7928
H 2.2518 -3.2755 -2.4430
C 3.9336 -4.4119 -1.7820
H 4.7303 -4.3142 -1.1372
H 3.3647 -5.2163 -1.4823
H 4.3006 -4.6039 -2.7248
O 3.8119 -2.0472 -2.1381
APPENDIX E
- 238 -
Figure E.1 - Data represented as Entry # versus ΓΓG. (see Figure 2 in the publication for
another representation of the data), . Predicted (grey) vs. observed (black)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
APPENDIX E
- 239 -
Figure E.2 - Data represented as Entry # versus ΓΓG. . (see Figure 4 in the publication
for another representation of the data), Predicted (grey) vs. observed (black)
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
APPENDIX E
240
FITTED 2.6: User Guide
241
APPENDIX F
FITTED2.6 USER MANUAL
APPENDIX F
242
Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne Department of Chemistry, McGill University Montréal, Québec, Canada
FITTED 2.6
User Guide
APPENDIX F
- 243 -
Table of Contents
I. Preface .......................................................................................................................................... - 246 -
II.1. Conventions used in this guide ............................................................................................. - 246 -
II.2. Acknowledgements ............................................................................................................... - 246 -
II. FITTED: Theory and implementation ......................................................................................... - 247 -
II.2. Overview ................................................................................................................................ - 247 -
II.3. ProCESS ................................................................................................................................ - 249 - II.3.1. Initial setup ..................................................................................................................... - 249 - II.3.2. Definition of the binding site and creation of the XXXX_site.txt file ............................... - 250 - II.3.3. Creation of the XXXX_dock.mol2 file ............................................................................. - 250 - II.3.4. Binding site cavity generation ........................................................................................ - 250 - II.3.5. Interaction Sites/Pharmacophore generation and creation of the XXXX_IS.mol2 file ... - 253 - II.3.6. Solvation and creation of the XXXX_score.mol2 file ...................................................... - 254 -
II.4. SMART ................................................................................................................................... - 255 - II.4.1. SMART input ................................................................................................................... - 255 - II.4.2. Partial atomic charges .................................................................................................... - 255 - II.4.3. The bit string description ................................................................................................ - 255 -
II.5. FITTED ................................................................................................................................... - 256 - II.5.1. FITTED modes ................................................................................................................ - 256 - II.5.2. Protein flexibility.............................................................................................................. - 257 - II.5.3. Covalent docking ............................................................................................................ - 258 - II.5.4. FITTED scoring functions ................................................................................................ - 258 - II.5.5. Genetic algorithm ........................................................................................................... - 260 -
II.6. References ............................................................................................................................ - 267 -
III. Installation ...................................................................................................................................... 269
III.2. The FITTED, ProCESS and SMART folders ............................................................................... 269
IV. Getting started with FITTED ..................................................................................................... - 270 -
IV.2. Setting up the system .......................................................................................................... - 270 -
IV.3. Running the FITTED suite .................................................................................................... - 271 -
V. Preparing a keyword file for FITTED ........................................................................................ - 273 -
V.2. Input/output files .................................................................................................................... - 273 -
V.3. Run parameters .................................................................................................................... - 274 -
V.4. Filtering parameters .............................................................................................................. - 275 -
V.5. Conjugate gradient parameters ............................................................................................ - 279 -
V.6. Energy parameters ............................................................................................................... - 279 -
V.7. Scoring parameters ............................................................................................................... - 279 -
V.8. Initial population parameters ................................................................................................. - 280 -
V.9. Evolution parameters ............................................................................................................ - 280 -
APPENDIX F
- 244 -
V.10. Docking of covalent inhibitors ............................................................................................. - 281 -
V.11. A typical FITTED keyword file .............................................................................................. - 282 -
VI. Preparing a keyword file for ProCESS .................................................................................... - 284 -
VI.2. Input/output files ................................................................................................................... - 284 -
VI.3. Reading the input files and preparing the output protein files ............................................. - 285 -
VI.4. Parameters for the binding cavity file ................................................................................... - 285 -
VI.5. Parameters for the Interaction Sites file ............................................................................... - 286 -
VI.6. A typical ProCESS keyword file ............................................................................................ - 288 -
VII. Running SMART ........................................................................................................................ - 290 -
Appendix A: FITTED Input File Formats ....................................................................................... - 292 -
A.1. The protein files ................................................................................................................ - 293 - A.1.1. The XXXX_dock.mol2 .................................................................................................... - 293 - A.1.2. The XXXX_score.mol2 file ............................................................................................. - 293 - A.1.3. The XXXX_site.txt file .................................................................................................... - 296 - A.2. The Ligand file .................................................................................................................. - 296 - A.3. The binding site cavity file ................................................................................................. - 300 - A.4. The interaction site, pharmacophore and XXXX_IS.mol2 files ......................................... - 301 - A.5. The force field file .............................................................................................................. - 302 - A.5.1. Adding parameters to the bond list ................................................................................ - 302 - A.5.2. Adding parameters to the angle list ............................................................................... - 302 - A.5.3. Adding parameters to the torsion list ............................................................................. - 303 - A.5.4. Adding parameters to the out-of-plane list ..................................................................... - 304 - A.5.5. Adding parameters to the van der Waals list ................................................................. - 304 - A.5.6. Adding parameters to the hydrogen bond list ................................................................ - 305 - A.5.7. Adding parameters to the bond charge increment list ................................................... - 305 - A.5.8. Adding parameters to the partial bond charge increment / formal charge adjustment factor
list ........................................................................................................................................................ - 306 -
Appendix B: FITTED errors and warnings .................................................................................... - 307 -
Appendix C: ProCESS errors and warnings ................................................................................ - 310 -
Appendix D: SMART errors and warnings .................................................................................... - 311 -
Appendix E: Functional group definitions ................................................................................... - 313 -
Appendix F: Additional keywords for FITTED .............................................................................. - 316 -
F.1. Input/output files ................................................................................................................ - 316 - F.2. Run parameters ................................................................................................................ - 318 - F.3. Filtering parameters .......................................................................................................... - 319 - F.4. Conjugate gradient parameters ........................................................................................ - 322 - F.5. Energy parameters ............................................................................................................ - 323 - F.6. Scoring parameters ........................................................................................................... - 325 - F.7. Initial population parameters ............................................................................................. - 326 - F.8. Evolution parameters ........................................................................................................ - 328 -
APPENDIX F
- 245 -
F.9. Output/convergence parameters ...................................................................................... - 331 - F.10. Docking of covalent inhibitors ......................................................................................... - 332 -
Appendix G: Additional Keywords for ProCESS ......................................................................... - 333 -
G.1. Input/output files ............................................................................................................... - 333 - G.2. Reading the input files and preparing the output protein files .......................................... - 334 - G.3. Parameters for the binding cavity file ............................................................................... - 335 - G.4. Parameters for the Interaction sites file ............................................................................ - 336 -
APPENDIX F
- 246 -
I. Preface
II.1. Conventions used in this guide
This guide describes the use of a suite of programs which are useable either via an interactive mode or through issuing command-line arguments. FITTED and ProCESS, as well, require a set of commands to be issued in the form of a keyword file, a standard ASCII text file with instructions that follow the form Keyword Option, usually one per line, but in some cases a keyword might
span multiple lines.
In the remainder of the manual, different typefaces will be used to symbolize the following:
Filenames and command-line input: constant-width font, standard face. Examples: ligand.mol2 keyword.txt smart –fitted ligands.mol2
Keyword names: constant-width font, bold face. Examples: Protein Mode AutoFind_Site
Keyword options: constant-width font, italic face. Examples: 1a46.mol2 Docking Yes
Please note that the formatting is for clarity of the manual only as it is not possible to format an ASCII file with different typefaces.
II.2. Acknowledgements
Over the last years, the development of FITTED has been funded by ViroChem Pharma (research grants) and the Canadian Institutes for Health Research (CIHR Operating grants) while the development of ACE has been funded by the Natural Science and Engineering Research Council (NSERC discovery grant). These partners are warmly acknowledged. More recently, the “Ministere du Développement Economique, de l'lnnovation et de l'Exportation du Québec" has recognized the potential of our drug discovery platform and granted us funding for further development and commercialization as part of a program called "Soutien à la maturation technologique". Jeremy Schwartzentruber (code optimization) and Devin Lee (comparative study) are also acknowledged
APPENDIX F
- 247 -
II. FITTED: Theory and implementation
II.1. Overview
Docking methods are computational techniques that are able to predict the binding mode of ligands (e.g., enzyme inhibitors, receptor antagonists) to their biological targets.[1] Combined with a scoring function, these methods can be used to screen large libraries of compounds for their affinities for a specific pharmaceutically relevant target. FITTED 2.6 is a suite of programs (FITTED, ProCESS and SMART) for the docking of small-molecule ligands onto flexible proteins.
ProCESS is a module used in the preparation of the protein input files for FITTED. ProCESS (i) can truncate the protein to reduce its structure strictly to the binding site; (ii) assigns atom types and partial charges; (iii) checks the protein files for consistency if more than one is used; (iv) pre-computes the atomic solvation data; (v) identifies the binding site cavity and the interaction sites and prepares the files describing them.
FITTED also includes SMART, a module used to prepare the ligand input files. SMART (i) assigns atom types; (ii) identifies rotatable bonds (e.g., C-C single bonds); (iii) applies descriptors and derives bit string describing the chemical features and properties of the compound; and (iv) assigns MMFF partial atomic charges.
The main module (FITTED) docks the ligands into the flexible protein in presence of displaceable water molecules.
A complete description of the theory and modules of FITTED 1.0 and 1.5 can be found in references 2 and 3. The current version of the suite (FITTED 2.6) includes previous and additional features and will be reported shortly [4]:
Automatic interaction site generation,
Automatic binding site identification,
Interaction site / force field consensus docking,
Pharmacophore oriented docking,
Force field parameter estimator,
Generalized AMBER force field (GAFF) implementation,
Improved scoring function (RankScore 2),
Filters for drug-like molecules,
Improved atom typing,
Semi-flexible protein docking with flexible waters,
United atom representation of the ligand protein non-bonded interactions,
New pElite genetic operator,
New evolution type: Metropolis,
Improved conjugate gradient minimization algorithm,
Covalent inhibitor docking,
Docking of ligands to DNA/RNA.
APPENDIX F
- 248 -
Matching algorithm for orientation individuals in the initial population Descriptions of the concepts of displaceable water molecules, flexible proteins, and
pharmacophore-oriented docking can be found in references [2]-[6] . Applications of FITTED can be found in references [3], [7] and [8] while the development of the RankScore scoring function used by FITTED can be found in references [6], [10] and [11]. Docking of covalent inhibitors is yet to be reported [12].
APPENDIX F
- 249 -
II.2. PROCESS
ProCESS is the module used to prepare the receptor(s) for FITTED, generating the various descriptions of the binding of the receptor(s). Herein, enzymes, cell-membrane receptors, nucleic acids are referred to as receptors. Most commonly, the receptor will be a protein, although ProCESS and FITTED can now handle nucleic acids (see Table 1b). For more information of the setup of the
individual protein files before using ProCESS see section IV.1, Setting up the system.
II.2.1. Initial setup
ProCESS requires all input proteins (if running any of the flexible modes) to be similar. For two proteins to be considered similar, they must have the same number of atoms, the same residue naming and have the same atoms within each residue (including atom names).
The order of the atoms in the input file does not have to coincide for all proteins. The first step done by ProCESS involves sorting the protein atoms to have the same order as the first protein listed in the keyword file. ProCESS can sort the atoms within a residue if all the atoms of that residue are listed contiguously. If the atoms of a residue are listed in multiple locations (e.g., all heavy atoms first, then all hydrogen atoms), ProCESS will issue an error. If ProCESS cannot find a residue from the reference file in another protein it will output an error message stating which residue cannot be found. If this occurs, rename the residue in the file giving the error.
Charge assignment, atom typing and interaction sites generation by ProCESS requires hydrogen
atom names to follow the IUPAC recommendations in section 2.1.1 of Pure Appl. Chem. 70:117 (1998). Residue names must use the standard three letter code or an advanced 4 letter naming (See Tables 1a and 1b).
Table 7a - List of acceptable residue names for protein ProCESS input
Amino acid Mid chain Other accepted names N-terminal C-terminal
Alanine ALA ALAN ALAN Arginine ARG ARGN ARGN Asparagine ASN ASNN ASNN Neutral aspartic acid ASPH ASH,ASZ Aspartic_acid ASP ASPN ASPN
Cysteine in disulfide bridge CYS Cysteine CYSH CYS CYSHN CYSHN Glutamine GLN GLNN GLNN Glutamic_acid GLU GLUN GLUN Neutral glutamic acid GLUH GLU Glycine GLY GLYN GLYN Histidine epsilon HISE HIS, HIE HISEN HISEN Histidine protonated HISP HIS, HIP HISPN HISPN Histidine delta HISD HIS, HID HISDN HISDN
Water HOH WAT Isoleucine ILE ILEN ILEN Leucine LEU LEUN LEUN Lysine LYS LYSN LYSN Methionine MET METN METN Phenylalanine PHE PHEN PHEN Proline PRO PRON PRON Serine SER SERN SERN
APPENDIX F
- 250 -
Threonine THR THRN THRN Tyrosine TYR TYRN TYRN Tryptophan TRP TRPN TRPN Valine VAL VALN VALN
Table 1b - List of acceptable residue names for nucleic acid ProCESS input
Nucleotide Mid chain 5’-terminus 3’-terminus
deoxyadenosine DA* DA5 DA3 deoxyguanosine DG* DG5 DG3 deoxycytosine DC* DC5 DC3 deoxythymidine DT* DT5 DT3 riboadenosine RA* RA5 RA3 riboguanosine RG* RG5 RG3 ribocytosine RC* RC5 RC3 ribouracil RU* RU5 RU3
II.2.2. Definition of the binding site and creation of the XXXX_site.txt file
The first step within ProCESS is to define the binding site. The binding site can either be manually defined or found automatically with aid of a co-crystallized ligand. For the manual definition of the binding site, the residues included in it should be listed with the Binding_Site
keyword. To automatically define the binding site, the AutoFind_Site keyword is used. In this
case, ProCESS looks for the file defined in the Ligand keyword and selects all residues within
Ligand_Cutoff of the ligand as binding site residues. These residues (either manually or defined
with co-crystallized ligand) are listed in the XXXX_site.txt (XXXX being the protein filename
with the “.mol2” extension removed). If more than one protein is used as input, they are all considered when creating XXXX_site.txt, however there will be only one XXXX_site.txt
output file, since all the proteins will have the same binding site residues.
II.2.3. Creation of the XXXX_dock.mol2 file
To reduce the amount of time required to compute the non-bond interactions within FITTED, the protein is truncated to remove residues which have small contributions to the binding energy, i.e., far away from the binding site. There are two options available to truncate the protein: automated truncation, where the center of the truncation is the ligand, and manual truncation, where the center is the collection of active site residues. A residue is removed from the protein unless an atom within the residue lies within the cutoff distance of the center in use.
II.2.4. Binding site cavity generation
To reduce the amount of time required for a FITTED run, a negative image of the binding site cavity (herein referred to as the binding site cavity) is generated. By using the binding site cavity FITTED will only select ligand binding modes that do not have clashes with the protein in a timely manner, without having to compare to all the atoms of the protein.
APPENDIX F
- 251 -
Figure 1 - Grid generated from ProCESS
The initial setup of the binding site cavity requires the initial definition of the location of the binding site. This can be done either manually (Grid_Center) or automatically using the co-
crystallized ligand (Ligand). Once the center of the site has been defined, ProCESS creates a grid
(Figure 1).
The size and resolution of the grid can be customized by using the Grid_Size and
Grid_Resolution keywords, although the default values are highly recommended for docking to
protein binding sites. Grid points will be located within a cube in the coordinates Grid_Center ± Grid_Size. At each point of the grid ProCESS checks for clashes with protein
atoms. A grid point is considered to be clashing if it lies within Grid_Clash Å of a protein atom
excluding water atoms. If using multiple proteins, a grid point must clash with all proteins to be removed from the grid. Finally, the active site is made spherical to remove the bias from the initial orientation of the grid by removing all grid points lying further than Grid_Size Å from the center
of the binding site cavity. This sphere is shown as black dashes in Figure 2b.
Once all the grid points that are clashing with the protein have been removed, ProCESS eliminates points in regions that are not part of the binding site (Figure 2a). The binding site region (orange in Figure 2a) is defined by points overlapping with the ligand (or the grid center), and expanded through all adjacent points; the regions not connected to the binding site region are termed isolated (red dots in Figure 2a). These isolated points are then removed.
Once the clustering is finished, ProCESS inflates each grid point until it clashes with the protein or the edge of the large sphere (black dashes in Figure 2b) to reduce the total number of points. Once the grid point is inflated it is referred to as a sphere. Each sphere has a radius associated with it, which is either the minimum distance to a protein atom (Grid Point #1 in Figure 2b) or the distance from the edge of the grid (Grid Point #2 in Figure 2b), whichever is smaller. ProCESS then increases the radius of the spheres to allow for an overlap which prevents area which may not be covered by a sphere (Grid Point #3 in Figure 2b). ProCESS then takes the sphere with the largest radius and removes all the smaller spheres included within its volume.
APPENDIX F
- 252 -
NOH
OHOHN
O
Binding Site Cavity Center
Grid .Point #2
Grid .Point #1
Grid .Point #3
OH
Figure 2 – (a) Removal of isolated points; (b) Inflation of grid points into spheres
The spheres kept after inflation are output in the Binding_Site_Cav file. A representation of
the binding site cavity file is shown in Figure 3.
Figure 3 - Binding site cavity generated by ProCESS
APPENDIX F
- 253 -
II.2.5. Interaction Sites/Pharmacophore generation and creation of the XXXX_IS.mol2 file
Interaction sites. Studies have shown that using pharmacophoric constraints during docking leads to increased accuracy.[6] By using a pharmacophore before the force field energy calculation, FITTED can discard poses lacking key interactions thus reducing the CPU time required. Pharmacophores are typically user-generated (see below) with knowledge of the binding site. With this in mind, ProCESS can automatically generate interaction sites from the protein input file(s). One advantage to the use of an automated procedure is that potential interaction points which may be ignored or discarded by users would be kept, allowing for the probing of otherwise unprobed binding pockets.
H O
N
O
H
Figure 4 - Example of the interaction site spheres created for serine; red: hydrogen bond donor, blue: hydrogen bond
acceptor.
The interaction sites are automatically created by ProCESS after the generation of the binding site cavity. The hydrogen bond donor (HBD) and acceptor (HBA) sites are created by locating spheres complementary to the properties of the binding site residues. As an example (see Figure 4), for a serine side-chain many HBD and HBA points are created. These points are transformed into spheres defining the geometrical constraints of the interaction. In all cases, the initial sphere radius is predefined (HBA = 1.5 Å; HBD = 1.5 Å). Along with the radius, each point is also assigned a weight depending on the type of point (HBD = 5; HBA = 5; HBA = 25 when the point is for a metal interaction). A weight is used to define the importance of the possible interactions that can occur at each individual sphere. These weights can be modified by using HBD_weight, HBA_Weight and
Metal_Weight. At this stage only points within the binding site (see section II.2.4, Binding site
cavity generation) are kept.
In order to generate the hydrophobic points we initially go through the binding site cavity grid that was created before the inflation of the spheres. At each point not close to a HBA or HBD point the van der Waals (vdW) interaction energy with the protein carbons is calculated. If the energy calculated is below a cutoff (Hydrophobic_Level), by default set to -0.3, the point is kept. If this
point is surrounded by a number of other HYD points then the point is added to the interaction site file. The weight of this point is calculated by taking the quartic ratio of the vdW energy calculated at the point divided by the Hydrophobic_Level. This weight can be scaled up or down using the
Hydrophobic_Weight keyword which is by default set to 1. The size of the hydrophobic point is
set to 2.0 Å.
Once all the interaction site spheres have been initially defined, they are rescaled and some of them are further filtered out. The number of atoms within a 12 Å radius sphere of the interaction site sphere is used as a crude estimation of the deepness of the pocket where the sphere is located. The
APPENDIX F
- 254 -
weight of a sphere is multiplied by this factor, increasing the weight of buried spheres relative to solvent exposed points.
To further reduce the number of sites, ProCESS merges interaction sites which may be considered redundant or overlapping. Interaction sites are considered to be overlapping if within a distance to each other specified by the Pharm_Polar_Softness, Pharm_Nonpolar_Softness, and
Pharm_Aromatic_Softness keywords.
As ProCESS may generate many interaction sites, one can further reduce the number of points to keep only the sites with the largest weights. This is done by using the Num_of_IS keyword. By
default this is set to 75. This keyword will keep only the number of interaction sites specified. The final interaction sites file for 1b6l is shown in Figure 5.
Figure 5 - Interaction sites generation by ProCESS. Red: Hydrogen bond acceptors, blue: hydrogen bond donors, green:
Hydrophilic.
A XXXX_IS.mol2 files is created which is used by the matching algorithm in FITTED. Unlike the Interaction site file, these XXXX_IS.mol2 are created for each protein individually as if rigid docking was being done. Therefore for rigid protein docking the XXXX_IS.mol2 file and the interaction site file are similar.
Pharmacophore. A pharmacophore file can also be generated manually following the same format as the interaction sites file (see section II.4, The interaction site, pharmacophore ). Typically the pharmacophore file has fewer spheres than the interaction sites file, usually involving interactions which are key to binding, such as interactions with metals.
II.2.6. Solvation and creation of the XXXX_score.mol2 file
As soon as the XXXX_dock.mol2 file is ready, ProCESS prepares a second protein file (or set
of files if more than one protein file is given as input) named XXXX_score.mol2. This protein
file describes the protein using the all atom representation, GAFF atom types and AMBER atom
APPENDIX F
- 255 -
charges. A key feature of these files is the pre-computed data used to calculate the Generalized Born/Surface Area (GB/SA) solvation properties (for a detailed description of this method, see[14] and [15]). As part of the scoring function RankScore, FITTED evaluates the solvation contribution to
the free energy of binding. Following Still‟s method [14], [15], Gpol,i for each atom is calculated and outputted in this mol2 file. The solvent accessible surface area (SASA) is also computed, and this data is used to derive the solvation energy of the protein. As soon as a complex is formed by
FITTED, the same Gpol,is can be used and the ligand effects added to each of them to compute the solvation of the complex in a timely fashion.
II.3. SMART
SMART is the module used to prepare ligand structures in a modified MOL2 format for use by FITTED. Basically, it assigns GAFF atom types and describes the ligand‟s functional groups as a bit string. It can also optionally correct the bond order assignment in a molecule, assign MMFF charges [16] and prepare structures for use with ACE (Asymmetric Catalyst Evaluation) assigning MM3 atom types.[17]
II.3.1. SMART input
The input to SMART should be either a standard MOL2-formatted file or a standard SD/MOL file containing the structure of the ligand to be docked. In either format, the input file can contain one or multiple molecules. Upon completion, SMART will output all diagnostic messages to a log file, and output the prepared ligands in FITTED MOL2 format, either as a single multi-MOL2 file or in multiple single-MOL2 files. Hydrogen atoms will not be added by the current version of SMART and are therefore required in the input files.
II.3.2. Partial atomic charges
FITTED requires partial atomic charges to be assigned on the ligand; SMART has the option to assign MMFF atomic partial charges.[16] Atomic charges can also be assigned by other methods with a third-party tool, in which case SMART can be instructed to preserve them. MMFF charges are accurate for a wide variety of organic molecules; a warning message will be output to the log file in case the assignment could be inaccurate due to missing parameters. Other charging schemes are currently being implemented and validated and will be released by Dec. 2008.
II.3.3. The bit string description
The bit string within SMART is created to allow the fine tuning of the library to be screened using FITTED. SMART determines the presence of functional groups in a ligand by analyzing the assigned atom types and/or connectivity. The groups recognized within the bit string are defined in Table 8 (see also 0 Functional group definitions).
Table 8 – Recognized functional groups in SMART
Aromatic
Aldehyde
APPENDIX F
- 256 -
Ester
Lactone
Amide
Lactame
Acid
Nitrile
Imine
Nitro
Acceptor
Azide
Isocyanate
Acyl_Chloride
Sulphonamide
Carbamate
Ammonium
Oxime
Ketone
Boronate
Primary_Amine
Secondary_Amine
II.4. FITTED
FITTED is the tool performing most of the work. FITTED has many different options allowing the user to perform various tasks. In the following sections, the different options available within FITTED and a more detailed description of FITTED‟s inner workings are given.
II.4.1. FITTED modes
Several modes within FITTED allow it to solve various computational problems that may arise: Docking, Virtual Screening, Scoring and Filtering. Additionally, it is possible to adjust a rough docking pose by performing a local optimization. To access any of these modes use the Mode
keyword with Dock, VS, Score, Filter, SAR or Local as the option.
If Dock is selected, FITTED will dock the molecule into the protein target. By selecting the Dock
option FITTED ignores all filtering criteria if present in the keyword file and also sets larger default values for the number of trials to find a suitable conformation (GI_Num_Of_Trials 10000,
GA_Num_of_Trials 1000). If no suitable individual is found after GI/GA_Num_Of_Trials, the
last individual is saved and a new search is started. When the Dock mode is set Number_of_Runs
is set to 3.
APPENDIX F
- 257 -
The VS mode is optimized for the fast screening of compound. FITTED will use the filtering
criteria (see Filtering parameters), skipping compounds which do not satisfy these conditions and proceeding to dock the suitable ones. A smaller default value for GI_Num_Of_Trials (2000) is
used, preventing spending extra time on molecules too large for the cavity. FITTED will also use the CutScore_1, Max_Gen_1, CutScore_2 and Max_Gen_2 keywords. If after Max_Gen_1
generations the score of one of the top three individuals is below CutScore_1, the simulation is
allowed to proceed; an analogous check is performed after Max_Gen_2 generations with
CutScore_2 as a threshold. These checks ensure that potential non-binders do not consume
valuable CPU time. When the VS mode is set Number_of_Runs is set to 1.
In the Score mode, FITTED will only score the initial input structure against all the input
proteins. By using this option in conjunction with the keyword Score_Initial with the option
Minimize, it is possible to not only score the input structure, but to optimize the input ligand
structure by energy minimization and score the energy-minimized structure.
If Mode Filter is specified, FITTED checks the molecule to see if it passes the filtering criteria
defined in the keyword file or default filters. Diagnostic messages are output to the log file, stating if the molecule is kept or which of the filters were not satisfied.
With the Local mode it is possible to perform a local energy search on the ligand input
structure. With this mode the ligand is slightly perturbed by only allowing small changes from the initial structure.
With the SAR mode it is possible to perform a structure activity relationship study assuming
similar binding modes for all the ligands. Thus, the torsions will be conformational searched but translation and orientations will be searched only locally (around the location of the initial structure).
II.4.2. Protein flexibility
FITTED is able to dock using various modes of protein flexibility using the Flex_Type keyword.
This keyword has four different options: rigid docking (Rigid), semiflexible docking (Semiflex),
semiflexible docking with movable waters (Flex_Water) and fully flexible docking (Flex).
The number of protein files required for N proteins is (3N+1). For rigid docking, only one protein input structure is needed (4 files for one protein: XXXX_score.mol2, XXXX_dock.mol2,
XXXX_site.txt, XXXX_IS.mol2). The XXXX_site.txt file is only used to determine the
binding site. In this mode, only the ligand is considered flexible (translation, rotation and torsions). When running semiflexible or fully flexible docking within FITTED, more than one protein is needed (i.e., 2 files per protein, XXXX_score.mol2, XXXX_dock.mol2 and XXXX_IS.mol2, and
a single XXXX_site.txt).
Semiflexible docking is similar to docking to conformational ensembles, with the protein conformations being allowed to exchange between complexes during the conformational search. The semiflexible mode with movable waters is identical to the semiflexible mode, except that the water molecules can also be exchanged independently of the proteins.
APPENDIX F
- 258 -
In the fully flexible docking mode, the entire system is considered flexibly. The ligand (translation, rotation and torsions), protein (backbone, binding site side chains) and waters) are allowed to have all genetic operators applied to them. The list of residues in the XXXX_flex.txt
file will be considered as flexible.
II.4.3. Covalent docking
About 5% of the marketed drugs are covalent inhibitors. To address this under-investigated field of docking, we turned our attention to the development of an original method to dock covalent inhibitors in a fully automated fashion (unpublished results). This feature is currently being validated and should be used with care. The current version of FITTED is able to dock and virtually screen covalently bound and competitive ligands. FITTED can be instructed to do so by using the Covalent_Residue keyword. Following this keyword should be the name of the residue that will
form the covalent bond with the ligand (ex. SER451). Currently only serine and cysteine can form covalent bonds with the ligand (contact the developers for other residues). In these cases, the hydrogen from the residue is automatically transferred from the residue to the carbonyl or nitrile group of the ligand. In cases where the hydrogen is not transferred from the protein to the ligand but instead to another residue (the current version of the program does not allow any change in the number of atoms while docking), the residue must be specified using the Proton_Moved_To
keyword followed by the atom which the hydrogen will be transferred to.
With FITTED, only certain ligand functional groups have been setup to handle covalent bonds (aldehydes, ketones, nitriles and boronates). These groups are handled automatically with no further specification of the group name or location. If FITTED does not find any of these functional groups, it will dock the compound in a regular docking manner. To force only covalent poses the keyword Covalent_Ligand with the option Only can be used. During the creation of the initial
population and evolution only poses where a covalent bond is formed are kept while all other poses will be rejected. If this keyword is not selected FITTED will force a minimum of 10% of the initial population to be covalent, while during the evolution there is no restriction and the population is allowed to freely evolve towards covalent or non covalent binding modes.
To allow for covalent docking FITTED uses a switching function similar to the one implemented for displaceable water molecules. If the functional group is within Cutdist_Cov angstroms of the
serine oxygen or cystein sulfur, a bond is made, the hybridization (and atom type) of the reacting functional group modified (i.e., from c to c3 for a reacting aldehyde) and the proton moved.
II.4.4. FITTED scoring functions
Within FITTED there are two scoring functions, one applied during the docking (DockScore) and another used for a final scoring (RankScore2).
DockScore is in fact a consensus score made up of 4 components. The first is PharmScore which is calculated for the ligand match to the pharmacophore (see equation (1)). The MatchScore is calculated in an analogous manner, but considering the match of the interaction sites instead.
APPENDIX F
- 259 -
n
ii
n
i
i
sphere
sphere
w
otherwise , 0
matches sphere if , w
100PharmScore
Next a ClashScore is calculated with the active site cavity file generated from process. If an atom does not match with one of the spheres of the cavity, the pose is rejected and a new one is generated. Finally, a GAFFScore based on the GAFF force field [18] is computed. The following are the GAFFScore weights. Note that a lower value (more negative) of GAFFScore is better.
LossWater proteincoulombic
5-1coulombic4-1coulombic
proteinvdWvdWvdWinternal
EE
E5.0 E25.0
E0.2EE5.0EGAFFScore5-141
The weights can be modified by using the keywords found in the 0 Additional keywords for FITTED. Since proteins are large objects and the non-bonded energy contribution is nearly 0 at long distances, FITTED is equipped with a switching function to effectively turn off the long range protein ligand non-bond terms. This function is similar to the one implemented in CHARMM.[13] The cut-off and switching distances can be modified using the keywords Cutdist and Switchdist.
Since FITTED allows for displaceable water molecules. All water molecules within the protein structure are considered as displaceable which mean they can be either present or removed depending on which situation is better during the docking. Within FITTED we achieve this by using a switching function which effectively removed the water if the ligand is too close to the water. The distance between the ligand and the water to be considered too close can be changed by using Cutdist_Wat and Switchdist_Wat.
RankScore2 (Equation 3) is a force-field based scoring function, with the addition of terms to account for solvation and entropic effects.
SsolvstrainHB/metalcoulvdw EEEEEERankScore2 f
Each energy term is weighted by a coefficient that can be modified by using the keywords specified in section II.10, Energy parameters. The first 3 energy terms (Evdw, Ecoul and EHB/metal) correspond to the intermolecular interactions calculated with the GAFF force field. The factor f scales these terms to account for the reduced free energy of protein-ligand interaction experimented by flexible residues. The Estrain term accounts for the conformational strain of the ligand pose. It is calculated from the GAFF internal energies of the ligand when bound and unbound. Esolv includes an estimation of the energy of desolvation calculated with the GB/SA approach. The atomic contributions to GB are pre-computed by ProCESS and stored in the XXXX_score.mol2 file.
Following Still‟s method [14],[15], the polar contribution to the solvation energy is computed. A surface area is also derived from the ligand, protein and complex and used to compute the non-polar contribution. The solvation remains time consuming and can be turned off by setting Solvation to
(1)
(2)
(3)
APPENDIX F
- 260 -
off. Finally, the ES term (Equation 4) represents the penalty for torsional entropy loss upon
binding.
atoms
i
i
rotN
NpolG
0
S )(15.0E
Essentially, it involves counting the number of rotatable bonds (acyclic, non-terminal bonds
between sp3-hybridized atoms), affected by the polarity of the bond ((pol)) and how buried the
bond is (estimated by the relative number of atoms around each atom of the bond, Ni/N0). The rationale behind this is that the more polar a bond, the more frozen it will be in a binding pocket; conversely, the most solvent-exposed a bond is, the more free to move it will be.
II.4.5. Genetic algorithm
A genetic algorithm (see Figure 7a) is a global optimization technique. In the present case, it is used as a conformational search tool, allowing for the sampling of flexibility of complex systems. The first step is the creation of an initial population of poses, with many different conformations, also known as individuals. Each conformation can be described by a chromosome made up of genes; in this case each gene contains the value of a molecular descriptor such as torsional angles, position
and orientation (see Error! Reference source not found.). Once the initial population is created the population evolves. Within this loop, parents (a pair of individuals in the population) are selected and coupled together using genetic operators, creating new conformations called children. With the children produced, the population is trimmed (some parents and kids are removed) and another reproduction cycle is started. This loop is continued until the population converges. This is the basis for the FITTED genetic algorithm. Modifications to the general scheme along with more in depth definitions of each element of the genetic algorithm is discussed in the following sections.
(4)
APPENDIX F
- 261 -
II.4.5.1. The chromosomes
Figure 6 - FITTED Chromosone
For evolution to occur on a population of conformations, each individual must have a chromosome. Flex_Type determines the make up of these chromosomes (Figure 6).
The chromosome contains many different genes. In Flex_Type Rigid the chromosome has a
separate gene for each flexible torsion and one each for the translation and the orientation. For semiflexible docking (Semiflex) the protein conformation gene is added to the chromosome. For
semiflexible docking with flexible waters (Flex_water) the position of the each individual water is
added to the chromosome. Finally, during fully flexible docking (Flex) the conformation of each
individual binding site residue is added.
II.4.5.2. Generation of a high quality (already evolved) population
FITTED creates the initial population (see Figure 7b) by generating many ligand conformations randomizing its torsions, orientation and positioning. To initialize the random number generator the keyword Seed is used. It is also possible to have FITTED select a random
number to initialize the Seed by setting it to 0. The allowed values of these genes (torsional angle
values) can be restricted by using keywords. The rotation of the torsions occurs at discrete values controlled by the Resolution keyword (360°/Resolution); this increases the probability that the
conformation generated is energetically stable. FITTED uses a corner flap approach to conformationally search rings which can be turned on by using Corner_Flap on. A three-point
matching algorithm is used to orient the ligand when generating the initial population using the XXXX_IS.mol2 file. If a Pharmacophore is used, each triangle formed by three points has at least
one of the points from the pharmacophore. If only XXXX_IS.mol2 are used, it is possible to only
APPENDIX F
- 262 -
create triangles with the highest weighted interaction sites (see section II.2.5, Interaction Sites/Pharmacophore generation). The Num_of_Top_IS parameter allows the user to specify that
only triangles with one of the the top Num_of_Top_IS points should be created (default = 10). To
switch off the matching algorithm the keyword Matching_Algorithm should be set to off. The
matching algorithm is automatically turned off when doing either a Local or SAR search or a
covalent docking run. With the matching algorithm turned off the maximum translation performed in one generation can be set by setting the Max_Tx, Max_Ty, Max_Tz keywords; the center of the
ligand will be translated by a random number up to these values. When Mode is set to Dock the
default for the values is 5 Å, when set to SAR or Local the default is 0.2 Å. The ligand pose will be
rigidly rotated by any value between 0° and 360° if using the Dock mode. When doing a local search or during SAR mode the ligand will only be rotated between +/- Max_Rxy, Max_Ryz,
Max_Rxz which is by default set to 2°. If using any of the flexible docking modes then the
population will be increased to ensure an even representation of the protein conformations throughout the population. The size of the population in controlled with the Pop_Size keyword.
Pop_Size will be modified so that it is divisible by 2 and the number of proteins. For example if a
Pop_Size of 100 is selected and 3 proteins are used, the Pop_Size will be increased to 102.
Once a pose is generated, it is first passed through the PharmScore filter. The filter uses the Pharmacophore file. For more information on how the file is created and how the PharmScore is
calculated, see sections II.2.5: Interaction Sites/Pharmacophore generation and II.4.4: FITTED scoring functions. If the pose does not meet the Min_PharmScore then a new conformation is
generated. If the conformation passes then it passes through the MatchScore filter.
Generation
of Initial
Population
Generation
of Initial
Population
ReproductionReproduction
Selection of
next generation
Selection of
next generation
Yes
No
ExitExit
Is population
converged?
Is population
converged?
Generate
Conformation
Generate
Conformation
PharmScorePharmScore
MatchScoreMatchScore
ClashScoreClashScore
GAFFScoreGAFFScore
MinimizeMinimize
GAFFScoreGAFFScore
Save in
Population
Save in
Population
Fail
ReproductionReproduction
Selection of the
next generation
Selection of the
next generation
Is population
converged?
Is population
converged?
Yes
No
ExitExit
Generate
Initial Population
Generate
Initial Population
Calculate newMin_MatchScore
Calculate newMin_MatchScore
Yes
Pop_Size
Reached?
Pop_Size
Reached?No
Figure 7 – (a) Genetic algorithm; (b) Generation of initial population
The conformation must meet the value of Min_MatchScore, otherwise a new conformation is
generated. In the case of using Mode Dock, the initial Min_MatchScore is optimized on the fly. If
APPENDIX F
- 263 -
after 500 trials at generating a conformation one is not found that passes the MatchScore filter, the Min_MatchScore is reduced by 0.5%. Once an individual is saved into the initial population
Min_MatchScore is reoptimized with equation (5) using all the conformations generated to
calculate the average MatchScore.
Where Stringent_MS is used to tune how strigent should the Min_MatchScore be. By default
this value is 5.0 for rigid docking, 4.0 for flexible docking. If larger value of Stringent_MS are
used the more strigent the automatic calculation of Min_MatchScore will become.
ClashScore is calculated on the binding site cavity file. If the ligand passes through the ClashScore a GAFFScore is calculated. If the GAFFScore is below GI_Initial_E then the structure is
minimized and a new GAFFScore is calculated. If the latter is below GI_Minimized_E then the
conformation is saved into the population. Once Pop_Size has been reached, FITTED proceeds
with the evolution of the population.
II.4.5.3. Evolution
Once a population is created it proceeds through reproduction (see Figure 8). FITTED uses a Lamarckian evolutionary algorithm where the population learns during the evolution. A conformation learns by being energy-minimized; in this way, the child conformation does not descend directly from its parents anymore. At the beginning of each evolutionary iteration a percentage of the population is further optimized by energy minimization. The number of individuals energy-minimized is selected by using the pLearn keyword. The larger the value, the
longer the docking run will be. However, optimizing a small fraction speeds up the convergence of the docking runs.
Once this learning stage is over, 2 individuals of the population are randomly chosen to act as parents. Parents that have been not been used in previous generations are given a higher priority to be selected.
0.10
MatchScoreMatchScoreMSStringent_coreMin_MatchS
MaximumAverage (5)
APPENDIX F
- 264 -
Select ParentsSelect Parents
PharmScorePharmScore
MatchScoreMatchScore
ClashScoreClashScore
GAFFScoreGAFFScore
MinimizeMinimize
GAFFScoreGAFFScore
Save ChildrenSave Children
Generate
Initial Population
Generate
Initial Population
pCrosspCross
pMutpMut
Has Pop_Size
Been reached?
Has Pop_Size
Been reached?
No
Yes
pLearnpLearn
Selection of the
next generation
Selection of the
next generation
Is population
converged?
Is population
converged?
YesNo
ExitExit
ReproductionReproduction
Figure 8 - Reproduction
At this point the parents‟ chromosomes are crossed over to create children chromosomes defining new poses, followed by the application of mutation(s). The probabilities of crossover and mutation are controlled by the keywords pCross and pMut. Crossover is the exchange of the genes of the
chromosome between the parents to create new individuals. The default value of pCross and pMut
are highly recommended. Mutation is the random generation of a new value for a gene within the children. FITTED also allows for the independent increase of the probability of mutation for the orientation in space of the ligand and water exchange by using pMutRot and pMutWat. Larger
values for pMutRot and pMutWat are needed since many of the initial values may be eliminated
during the early stages of the evolution. During the evolution, orientations often need only a small refinement. Accordingly, the extent of the mutation of the ligand orientation can also be restricted by using Max_Rxy, Max_Rxz, Max_Ryz. pMutWat increases exponentially from 0% to pMutWat at
Max_Gen. Once this new individual is made, it passes through the same filters as for the initial
population and may be eventually energy-minimized.
II.4.5.4. Selection of the next generation
Once a new child population is created, FITTED allows for multiple ways to select the individuals of the next generation (see Figure 9). By default the Evolution keyword is set to
Steady_State where the best two individuals out of the two parents and two children are kept and
passed to a new population. A variation of the Steady_State is to use the Metropolis option.
In the latter, in the case where one or more of the parents would pass to the next generation using the steady state mode, the Metropolis criteria is used to see if the children should be kept. The
APPENDIX F
- 265 -
Metropolis option should be used when population statistics wish to be studied. Once the
individuals selected to be passed to the next generation are known, the pElite genetic operator is
applied. A number of individuals, determined by pElite, are copied, a local optimization is
applied, and the corresponding poses replace the worst individuals of this new population. To decrease the problem of early convergence by copying only the best individuals, pElite selects a
random individual out of the top pElite_SSize. This process can occur every generation, or at
different intervals set by pElite_Every_X_Gen. This option prevents premature convergence and
is highly recommended. This new population is then passed onto the next generation.
Generate
Initial Population
Generate
Initial Population
ReproductionReproduction
MetropolisMetropolisSteady StateSteady State
pElitepElite
Is pElite_Every_X_Gen
criteria met?
Is pElite_Every_X_Gen
criteria met?
Yes
Selection of the
next generation
Selection of the
next generation
Is population
converged?
Is population
converged?
No
Yes
No
ExitExit
Figure 9 - Selection of the next generation
APPENDIX F
- 266 -
II.4.5.5. Convergence criteria
Has Max_Gen
been reached and
Max_Gen != Max_Gen_2?
Has Max_Gen
been reached and
Max_Gen != Max_Gen_2?
No
Is one of top 3
individuals below
CutScore_X?
Is one of top 3
individuals below
CutScore_X?
Increase Max_Gen
To Max_Gen_X
Increase Max_Gen
To Max_Gen_X
Yes
Yes
Generate
Initial Population
Generate
Initial Population
ReproductionReproduction
Selection of the
next generation
Selection of the
next generation
ExitExit
No
Is Best - Average
< Diff_Avg_Best?
Is Best - Average
< Diff_Avg_Best?Is Best – Diff_Number
< Diff_N_Best?
Is Best – Diff_Number
< Diff_N_Best?
No Yes
Is population
converged?
Is population
converged?
Figure 10 - Convergence criteria
During virtual screening it has been noted that a good scoring ligand typically has a good score within a few generations. Therefore to reduce the amount of time spent on ligands that are unlikely to be good binders we filter out the ligands that do not have a GAFFScore lower (better) than CutScore_1 at Max_Gen. If the score is below CutScore_1 then Max_Gen is increased to
Max_Gen_1 FITTED also has a second filter of this type which can be used by using the
CutScore_2 and Max_Gen_2.
FITTED has two energetic convergence criteria for the genetic algorithm. One involves monitoring the difference between the best individual GAFFScore and the average GAFFScore of the population (Diff_Avg_Best), while the other criterion uses the difference between the
GAFFScore of the pose ranked Diff_Number and the best GAFFScore (Diff_N_Best).
APPENDIX F
- 267 -
II.5. References
[1] Moitessier, N.; Englebienne, P.; Lee, D.; Lawandi, J. Corbeil, C. Towards the development of universal, fast and highly accurate docking/scoring methods: A long way to go. Brit. J.
Pharmacol. 2008, 153, (SUPPL. 1).
[2] Corbeil, C. R.; Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 1. Development and Validation of FITTED 1.0. J. Chem. Inf.
Model. 2007, 47, 435-449.
[3] Corbeil, C. R.; Englebienne, P.; Yannopoulos, C. G.; Chan, L.; Das, S. K.; Bilimoria, D.; Heureux, L.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 2. Development and Application of FITTED 1.5 to the Virtual Screening of Potential HCV
Polymerase Inhibitors. J. Chem. Inf. Model. 2008, 48, 902-909.
[4] Corbeil, C. R.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein Flexibility and Water Molecules on Accuracy of Major Docking Programs. J. Chem. Inf. Model. submitted
[5] Moitessier, N.; Westhof, E.; Hanessian, S. Docking of Aminoglycosides to Hydrated and
Flexible RNA. J. Med. Chem. 2006, 49, 1023-1033.
[6] Moitessier, N.; Therrien, E.; Hanessian, S. Method for Induced-Fit Docking, Scoring, and Ranking of Flexible Ligands. Application to Peptidic and Pseudopeptidic β- secretase (ΒΑCΔ-1) Inhibitors. J. Med. Chem. 2006, 49, 5885-5894.
[7] Moitessier, N.; Henry, C.; Maigret, B.; Chapleur, Y. Combining Pharmacophore Search, Automated Docking, and Molecular Dynamics Simulations as a Novel Strategy for Flexible Docking. Proof of Concept: Docking of Arginine-Glycine-Aspartic Acid-like Compounds
into the v3 Binding Site. J. Med. Chem. 2004, 47, 4178-4187.
[8] Englebienne, P.; Fiaux, H.; Kuntz, D.; Corbeil, C. R.; Rose, D.; Gerber-Lemaire, S.; Moitessier, N. Evaluation of Docking Programs for Predicting Binding of Golgi alpha-Mannosidase II Inhibitors: A Comparison with Crystallography. Proteins: Struct., Funct.,
Bioinf. 2007, 69, 160-176.
[9] Fay, A.; Corbeil, C.R.; Moitessier, N.; Bowie, D. Ligand Docking and Electrophysiological Analysis of Full and Partial Agonists at Kainate Receptors. To be submitted
[10] Englebienne, P.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 4. Are Popular Scoring Functions Accurate for This Class of Proteins? Manuscript in preparation.
[11] Englebienne, P.; Corbeil, C. R.; Moitessier, N. Docking Ligands into Flexible and Solvated Macromolecules. 5. Force Field-Based Prediction of Binding Affinities of Ligands To Proteins and Development of RankScore2. Manuscript in preparation.
[12] Schwartzentruber, J.; Lawandi, J.; Corbeil, C. R.; Moitessier, N. Unpublished results
APPENDIX F
- 268 -
[13] Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan S.; Karplus M. CHARMM: A Program for Macromolecular Energy, Minimization and Dynamics
Calculations. J. Comp. Chem. 1983, 4, 187-217.
[14] Still, W.C.; Tempczyk, A.; Hawley, R.C.; Hendrickson, T. Semianalytical treatment of
solvation for molecular mechanics and dynamics. J. Am. Chem. Soc. 1990,112, 6127-6129.
[15] Qiu, D.; Shenkin, P. S.; Hollinger, F. P.; Still, W. C. The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii. J.
Phys. Chem. A. 1997,101, 3005-3014.
[16] Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and
performance of MMFF94. J. Comp. Chem. 1996, 17, 490-519, and following papers.
[17] Corbeil, C. R.; Thielges, S.; Schwartzentruber, J. A.; Moitessier, N. Toward a Computational Tool Predicting the Stereochemical Outcome of Asymmetric Reactions: Development and Application of a Rapid and Accurate Program Based on Organic
Principles. Angew. Chem., Int. Ed. 2008, 47, 2635-2638.
[18] Wang, J.; Wolf, R.M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and
testing of a general Amber force field. J. Comp. Chem. 2004, 25, 1157-1174
APPENDIX F
269
III. Installation
III.1. The FITTED, ProCESS and SMART folders
To install the suite of programs, simply copy the FITTED2.6 folder where you want to use it. No other manipulations are necessary. This folder includes three sub-directories one for each of the three modules (FITTED, ProCESS and SMART). Within these three folders can be found other subfolders such as input, output and keyword. The names for these various folders cannot be changed. The figure below describes the tree of folders necessary for each program to work properly.
Figure 11 - Directory structure for FITTED, ProCESS and SMART
ProCESS
outputkeywordinput
• Ligand.mol2
• Protein.mol2
• KeywordFile.txt • Output.out
• Protein_dock.mol2
• Protein_score.mol2
• Protein_site.txt
• Interaction_Sites.mol2
• Binding_Site_Cavity.mol2
SMART
outputinput forcefield
• Ligand.mol2 • Fitted_ff.txt • Output.out
• Ligand.mol2
• fitted_ff.txt
outputkeywordinputforcefield
structure
FITTED
• Ligand.mol2
• Protein_dock.mol2
• Protein_score.mol2
• Protein_site.txt
• Interaction_Sites.mol2
• Protein_IS.mol2
• Pharmacophore.mol2
• Binding_Site_Cavity.mol2
• Output.out
• logFile.log
• Ligand.mol2
• Protein.mol2
• KeywordFile.txt
APPENDIX F
- 270 -
IV. Getting started with FITTED
This section describes how to prepare and start a docking run with the FITTED suite of programs. All the examples and the corresponding files can be found in the examples/ folder.
IV.1. Setting up the system
In general, the structure files downloaded from the Protein Databank (PDB) require some preparation to ensure optimal results with any docking program. For instance, the protonation state of some residues may be critical to the binding of a ligand, hence to the observed enzymatic activity. The accuracy of the docking obtained with the FITTED suite of programs therefore relies on a careful preparation of the input files. This preparation requires the use of other programs with a graphical user interface such as Sybyl, Maestro or Insight II. The following sections give general details on what needs to be done to the protein and ligand structure files before the FITTED suite can be used for optimal results.
When the ligand and the protein remain as one file, they will be referred to as the complex from now on. The complex may also include ions (e.g., metals), water molecules and co-factors. X-ray crystal structures downloaded from the PDB most likely do not contain hydrogen atoms. Hydrogens on the ligands are needed for FITTED since the ligand is treated via an all-atom representation while the hydrogens on the protein are required to assign advanced residue names according to the protonation states, and to compute the solvation parameters. Once hydrogens are added, their orientation should be optimized, for instance performing an energy minimization with the heavy atoms fixed.
One of the advantages of FITTED is its ability to have mobile and displaceable water molecules. However, this feature requires the proper setup of waters within the complex. Only water molecules which are perceived as key for the binding of ligands should be kept, while all others should be removed. Waters are perceived as critical if they interact with both the protein and the ligand (bridging interactions) and are not exposed to the aqueous medium. If the number of key water molecules varies with the protein structure, copy the location of the missing waters from the other structure. During the docking run FITTED will displace them if necessary.
At this point, the complex(es) can be split into its(their) corresponding protein and ligand structure file(s). The protein file(s) should include the water molecules, ions and co-factors, if any. These files (ligand and protein) should be saved in mol2 format, available within most of the interfaces. If running a rigid docking (Mode Rigid), the protein file is ready to be submitted to
ProCESS. If Mode is Semiflex, Flex or Flex_Water, additional steps may be required to ensure
that all protein files are identical (i.e., same number of protein atoms, same number of water molecules, same residue names). If discrepancies were found, ProCESS will exit with an error message (see 0 ProCESS ). If some crystal structures have more water molecules than others, waters can be taken/copied from the protein structures that have similar conformations. The following section lists some of the common „errors‟ in PDB files which need to be corrected.
APPENDIX F
- 271 -
o In all cases, if the „error‟ appears close to the binding site, the protein structure should not be considered for the study.
o Mutated residues if the mutation is far from the binding site (at least 10 Å from the binding
site) then the residue can be virtually mutated to the desired residue. o Incomplete residue
In some case, parts of very flexible residues are not observed and are not included in PDB files. Again, if they are far from the active site, they can be virtually reconstructed.
o Missing Residues If they are far from the active site they can be (i) added where missing or
(ii) removed from the other files. o Terminal Residues
In some cases, terminal residues are not properly described in the PDB or mishandled by the program used to setup the protein. (e.g., terminal COO
-
groups are CHO). In this case, the missing atoms should be added. o Missing Waters
All proteins files should have the same number of waters. If a water molecule is missing, one can be virtually added from another protein file. FITTED will remove the water if it is not needed.
o Missing atoms Atom actually missing: if it is far away from the active, it can be added. If the atom is there, make sure the atom name and atom type are the same. The atom may be a different part of the protein file. If this is the case
renumber the atom within your graphical interface and regenerate the protein input file for ProCESS.
o Nucleic acids The 5‟-terminus should have a 5‟-OH, and not a phosphate group. If
necessary, remove the phosphate group and protonate the 5‟ oxygen. The residue names need to be corrected before attempting to run ProCESS,
but after adding hydrogens to the system. For this, a pair of scripts (fix_dna.awk and fix_rna.awk) are provided in the
process/scripts/ directory. This scripts rename the residues
according to the names in Table 1b:
fix_rna.awk term5=<5'term> term3=<3'term> file.mol2 > file_new.mol2
<5’term> and <3’term> denote the residue numbers (column 7 in the
MOL2 file) of the 5‟- and 3‟-terminal residues, respectively. Before the ligand is run with SMART, partial charges need to be properly assigned (Gasteiger-
Hückel have been validated but others should work as well) and the file saved in mol2 format.
IV.2. Running the FITTED suite
All three modules work under Windows and Linux. Both versions are useable from a terminal window, and the Windows executables can be started by double-clicking on the icons. The commands given below are to be inputed in a terminal.
APPENDIX F
- 272 -
To run ProCESS, place all protein files in the ProCESS input directory and create an appropriate keyword file (examples keyword files are 1e2k_rigid.txt and tk.txt). If flexibility is
desired make sure the ProCESS keyword file includes multiple protein structure files for consideration of flexibility (see tk.txt). Go to the process/ directory and run ProCESS by
typing:
./process <keyword filename>
ProCESS will create all the files (XXXX_dock.mol2, XXXX_score.mol2,
XXXX_site.txt, cavity.mol2, interaction_sites.mol2, XXXX_IS.mol2) in the
output/ directory within minutes as well as XXXX.out which will include information about the
calculations and errors. If running the same keyword again, all the files will be overwritten. Copy all files except the XXXX.out file to the input/ directory of FITTED.
To run SMART (example 1e2k_ligand.mol2), place the ligand file previously prepared
(section IV.1) in the SMART input/ directory. Go to the smart/ directory and run SMART,
typing:
./smart 1e2k_ligand.mol2
All files will be outputted to the output/ directory. Copy the 1e2k_ligand_1.mol2 file to
the input/ directory of FITTED.
To run FITTED, make sure all the input files (ligand, protein(s), active site cavity, and interaction site files) are in the FITTED input/ directory. Create the appropriate keyword (examples for rigid
and flexible docking are included as 1e2k_rigid.txt and 1e2k_flex.txt) and place them
in the keyword/ directory. To run FITTED, type:
./fitted <keyword_filename>
All results will be put into a .out file (name specified in the keyword file) and all
errors/warnings in a .log file (same name as output file). If structures are outputted (“printed”),
these files will be created in the structure directory within the output directory.
If running more than one file sequentially as in virtual screening runs, scripts can be used to create keyword files, extract data and run FITTED.
APPENDIX F
- 273 -
V. Preparing a keyword file for FITTED
The following sections list some of the keywords (one that are most frequently changed, for a complete list see 0), their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates a text string (such as a file
name); square brackets [choice1|choice2] indicate a choice of values, the default shown in
italics.
Note that keyword files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.
Although the value of many keywords can be altered, default values should be used unless a
specific system requires different settings. These keywords are essentially used by the developers for optimization and evaluation of the program. In general, modification of a specific value does not significantly improve or affect the accuracy but may result in longer or quicker docking runs.
At the end of this section, a typical keyword file can be found.
V.1. Input/output files
Protein_Conformations <# of files>
input_file_1
input_file_2
Following this keyword is the number of protein structure files used as input (same protein different conformation). These protein files should be prepared using ProCESS prior to the actual docking.
On the following lines are the protein file names, one per line.
For each of the proteins listed there should be the following files associated with then
input_file_dock.mol2
input_file_score.mol2
input_file_site.txt
input_file_IS.mol2
The name listed in this keyword file should therefore not include extensions such as _dock.mol2 that will be automatically added by FITTED.
Ligand ligand_file.mol2
Name of the ligand file (in MOL2 format). This ligand files should be prepared using SMART prior to the actual docking.
Ref <#_of_files>
lig_ref_file1.mol2
APPENDIX F
- 274 -
lig_ref_file2.mol2
Following this keyword is an integer stating how many reference files are used to calculate the root-mean-square deviation (RMSD) of the ligand heavy atoms. These ligand files should be in the same reference frame as the protein structure. The possible symmetric conformations of the ligand are calculated in silico.
2 reference files may be needed in some instances where the ligand or protein active site is Cn symmetric (n >=2 )
On the following line(s), the reference file(s) (in MOL2 format) are listed, one per line.
If this keyword is missing, no RMSD values will be computed.
Output filename
Name of the output file.
Forcefield forcefield_file.txt
Name of the force field file to use. If a forcefield other than fitted_ff.txt is to be used. The format of this force field should be consistent with the required format for Fitted (see section II.5).
Binding_Site_Cav cavity_file.mol2
Following this keyword is the file defining the empty space present in the active site cavity (a set of spheres prepared by ProCESS).
If this keyword is missing, no grid filter will be used (it is highly recommended to use both Pharmacophore and Binding_site_cav keywords).
Interaction_Sites interaction_sites_file.mol2
Name of the file containing the interaction site description (prepared by ProCESS).
If this keyword is missing, no interaction site filter will be used. (It is highly recommended to use both Interaction_Sites and Binding_site_cav)
Pharmacophore pharmacophore_file.mol2
Name of the file containing the pharmacophore constraints on the ligands (prepared by ProCESS). Typically this keyword is used to ensure that the individuals produced match this constraint, but it can be softened by setting Min_Constraint.
If this keyword is missing, no constraint will be used.
V.2. Run parameters
Mode [Dock|Filter|VS|Score|Local]
Dock
Normal docking run. No ligands are filtered out.
This is the default.
Filter
Filters out structures that do not meet Filter, Optional or Essential groups (see below).
Once filtering is done the program exits.
APPENDIX F
- 275 -
VS
Filters out structures that do not meet Filter, Optional or Essential groups (see below). If
the ligand passes all the filters, the docking is performed otherwise FITTED exits. Additional keywords are also provided (see below).
Score
Scores the ligand input structure in the provided orientation against all input proteins.
Local
Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modifications to the ligand conformation, orientation and translation are carried out.
SAR
Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modification to the ligand orientation and translation are carried out while a complete search of conformations is done.
Flex_Type [Rigid|Semiflex|Flex_water|Flex]
Rigid
The ligand is docked onto one protein structure.
This is the default if only one protein structure is used.
Semiflex
The ligand is docked onto multiple protein structures (requires Protein ≥ 2). Proteins can be
exchanged during the evolution but not the genes corresponding to side chains or water molecules (a more complete description of this mode is given in reference 1).
This is the default if more than one protein structure is used.
Flex_water
The ligand is docked into multiple protein structures (requires Protein ≥ 2). Similar to
Semiflex, except that each water molecule evolves independently.
Flex
The ligand is docked onto multiple protein structures (requires Protein ≥ 2). The side chains
and waters are allowed to be exchanged independently from the protein backbone.
Number_of_Runs <number of runs>
More than one run per ligand can be performed (The ligand may be docked several time to ensure a complete search).
If this keyword is missing, a single run is done.
The default value is 3 for Dock mode all other modes the default is 1.
V.3. Filtering parameters
The following keywords are used to filter out structures in VS or Filter modes only
APPENDIX F
- 276 -
Max_Charge <max_charge>
If a ligand has a net charge higher than max_charge, the program exits.
Default is +2.
Min_Charge <min_charge>
If a ligand has a net charge lower than min_charge, the program exits.
Default is -2.
Max_MW <max_MW>
If a ligand has a molecular weight higher than max_MW, the program exits.
Default is 500.
Min_MW <min_MW>
If a ligand has a molecular weight lower than min_MW, the program exits.
Default is 250.
Max_HBD <max_HBD>
If a ligand has more hydrogen bond donors than max_HBD, the program exits.
Default is 5.
Min_HBD <min_HBD>
If a ligand has fewer hydrogen bond donors than min_HBD, the program exits.
Default is 0.
Max_HBA <max_HBA>
If a ligand has more hydrogen bond acceptors than max_HBA, the program exits.
Default is 10.
Min_HBA <min_HBA>
If a ligand has fewer hydrogen bond acceptors than min_HBA, the program exits.
Default is 0.
Max_Nrot <max_Nrot>
If a ligand has more rotatable bonds than max_Nrot, the program exits.
Default is 6.
Min_Nrot <min_Nrot>
If a ligand has fewer rotatable bonds than min_Nrot, the program exits.
Default is 0.
Max_Ionizable <max_ionizable>
APPENDIX F
- 277 -
If a ligand has more ionizable groups than max_ionizable, the program exits.
Default is 2.
Min_Ionizable <min_ionizable>
If a ligand has fewer ionizable groups than min_ionizable, the program exits.
Default is 0.
Max_Rings <max_rings>
If a ligand has more rings than max_rings, the program exits.
Default is 10.
Min_Rings <min_rings>
If a ligand has fewer rings than min_rings, the program exits.
Default is 0.
Max_O <max_O>
If a ligand has more oxygen atoms than max_O, the program exits.
Default is 100.
Min_O <min_O>
If a ligand has less oxygen atoms than min_O, the program exits.
Default is 0.
Max_N <max_N>
If a ligand has more nitrogen atoms than max_N, the program exits.
Default is 100.
Min_N <min_N>
If a ligand has less nitrogen atoms than min_N, the program exits.
Default is 0.
Max_S <max_S>
If a ligand has more sulfur atoms than max_S, the program exits.
Default is 100.
Min_S <min_S>
If a ligand has less sulfur atoms than min_S, the program exits.
Default is 0.
Max_Hetero <max_hetero>
If a ligand has more heteroatoms (N, S and O) than max_hetero, the program exits.
APPENDIX F
- 278 -
Default is 100.
Min_Hetero <max_hetero>
If a ligand has less heteroatoms (N, S and O) than max_hetero, the program exits.
Default is 0.
Max_Metal <max_metal>
If a ligand has more heavy atoms other than C, N, O, S, P than max_metal, the program exits.
Default is 0.
Min_Metal <min_metal>
If a ligand has less heavy atoms other than C, N, O, S, P than min_metal, the program exits.
Default is 0.
Max_Num_of_Atoms <max_atoms>
If a ligand has more atoms other than max_atoms, the program exits.
Default is 10000.
Min_Num_of_Atoms <min_atoms>
If a ligand has less atoms other than min_atoms, the program exits.
Default is 0.
Filter <#_groups_filtered>
group_filtered1
group_filtered2
Number of functional groups that are filtered out. The name(s) of the filtered functional groups are listed below this keyword (see Table 1).
Optional <#_option_groups>
group_needed1
group_needed2
Number of functional group where one of them has to be present. The name(s) of the needed functional groups are listed below this keyword (see Table 1).
Essential <#_essential_groups>
group_needed1
group_needed2
Number of functional groups that are required. The name(s) of the needed functional groups are listed below this keyword (see Table 1).
APPENDIX F
- 279 -
Table 3. List of groups recognized by FITTED that can be listed after Filter, Optional or
Essential.
Aromatic Acid Acceptor Carbamate Primary Amine
Aldehyde Lactame Azide Ammoniun Secondary Amine
Ester Nitrile Isocyanate Oxime
Lactone Imine Acyl_Chloride Ketone
Amide Nitro Sulphonamide Boronate
V.4. Conjugate gradient parameters
The default values for all the keywords described in this section are recommended.
GA_* or GI_*
There are two sets of the following keywords: one for the parameters used during the generation of the initial population (GI_*; e.g., GI_MaxInt) and another one used during the evolution (GA_*; e.g.,
GA_MaxInt). The default values are recommended.
XX_MaxIter <maxiter>
Maximum number of iterations. Once this number is reached the minimization is finished.
The default is 20.
V.5. Energy parameters
Score_Initial [none|score|minimize]
Scoring of the initial ligand binding mode.
none
No scoring of the initial input structure is performed.
This is the default setting.
score
Only the score of the initial input ligand is output.
minimize
The score of the initial pose and the score of the energy minimized structure will be outputted.
V.6. Scoring parameters
The default values for all the keywords of this section (see 0) are highly recommended as they represent the scaling factors of RankScore2 (reference 7)
APPENDIX F
- 280 -
V.7. Initial population parameters
Pop_Size <pop_size>
Population size for the genetic algorithm conformational search.
The default is 100 for rigid docking, 200 for flexible docking
Min_MatchScore <min_matchscore>
This keyword is used only if an interaction site file is provided. If the Mode is set to Dock,
Min_Matchscore is automatically calculated.
Minimum match of the interaction sites.
The default is 25.
Min_PharmScore <min_constraint>
This keyword is used only if a pharmacophore file is provided.
Minimum percent match of the pharmacophore.
The default is 100.
V.8. Evolution parameters
Max_Gen <max_gen>
Determine the maximum number of generations for the genetic algorithm.
The default is 200.
CutScore_1 <cutscore_1>
Upper bound score at Max_Gen to further proceed with the docking run. If there is one individual within
the top 3 below this CutScore_1 then the program proceeds to Max_Gen_1
The default is -4.
Max_Gen_1 <max_gen_1>
This keyword is used in VS mode only.
After Max_Gen generations, if none of the top poses has a score below the one specified by
CutScore_1, the program exits. Otherwise, the program proceeds until it reaches Max_Gen_1
The default is set to be Max_Gen.
CutScore_2 <cutscore_2>
Upper bound score at Max_Gen_2 to further proceed with the docking run. If there is one individual
within the top 3 below this CutScore_2 then the program proceeds to Max_Gen_2
The default is -5.5.
Max_Gen_2 <max_gen_2>
APPENDIX F
- 281 -
As for Max_Gen_1, if after Max_Gen_1 generations none of the top poses has a score below the one
specified by CutScore_2, the program exits. Otherwise, the program proceeds until it reaches
Max_Gen_2.
The default is Max_Gen.
Seed <seed>
Select the starting point within the random number generator. If the same run is done with the same seed, the exact same result will be obtained. If a different seed is used, the GA will follow a different path. Changing the seed helps the developers to evaluate the convergence of a run.
The default is 100.
V.9. Docking of covalent inhibitors
This feature is under validation
Covalent_Residue <residue_name>
Following this keyword is the name of the residue, the covalent inhibitor will react with. Only CYS and SER are implemented in the current version (e.g., SER554)
Covalent_Ligand [Only|Both]
Controls the covalent docking. FITTED will automatically identify the aldehyde, boronate or nitrile groups (other groups will eventually be implemented) and assign the proper atom types when covalent poses will be considered
Only
Only covalent poses will be considered
This is the default.
Both
Covalent and non-covalent poses will be considered concomitantly.
Proton_Moved_To <residue> <atom_name>
The protein will be moved to atom <atom_name> of residue <residue>.
APPENDIX F
- 282 -
V.10. A typical FITTED keyword file
##################################################################################################
# #
# ____________ _________ _____________ _____________ ____________ _________ #
# ------------ --------- ------------- ------------- ------------ ---------- #
# ||| ||| ||| ||| ||| ||| \\\ #
# ||| ||| ||| ||| ||| ||| \\\ #
# ||| _____ ||| ||| ||| ||| _____ ||| ||| #
# ||| ----- ||| ||| ||| ||| ----- ||| ||| #
# ||| ||| ||| ||| ||| ||| ||| #
# ||| ||| ||| ||| ||| ||| ||| #
# ||| ||| ||| ||| ||| ||| /// #
# ||| _________ ||| ||| ____________ __________ #
# ||| --------- ||| ||| ------------ -------- #
# #
# Flexiblity Induced Through Targeted Evolutionary Description #
# #
# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne #
# Jeremy Schwartzentruber #
# McGill University, Montreal, Canada #
# October 2008 #
##################################################################################################
#
#
# INPUT/OUTPUT FILES
##################################################################################################
Protein_Conformations 9 # Number of protein input files
1e2k_protein # File names
1e2p_protein #
1ki3_protein #
1ki4_protein #
1ki7_protein #
1ki8_protein #
2ki5_protein #
1of1_protein #
1qhi_protein #
Ligand 1e2k_lig.mol2 # Ligand structure file
Output 1e2k_run_1 # File that wiill contain the output
Forcefield fitted_ff.txt # Force field file
Ref 1 # Number of references ligand files (for RMSD)
1e2k_lig.mol2 # Name of the ligand ref. file
Binding_Site_Cav cavity_1e2k.mol2 # Cavity file name
Interaction_Sites InterSite_1e2k.mol2 # Interaction site file name
#
# FILTERING (used in VS and Filter modes)
##################################################################################################
#
# Example:
# keep compounds with: 250 < MW < 400, filtering out aldehydes and nitro-containing compounds
# and docking only compounds with aromatic group(s)
#-------------------------------------------------------------------------------------------------
Min_MW 250
Max_MW 400
Filter 2
Nitro
Aldehyde
Essential 1
Aromatic
APPENDIX F
- 283 -
#
#
# GENETIC ALGORITHM PARAMETERS
##################################################################################################
#
# Scoring of the input structure and creation of the initial population
#-------------------------------------------------------------------------------------------------
Mode Dock # Can be Dock, VS, Filter, Local, or sar
Flex_Type Semiflex # Can be Rigid, Semiflex, Flex, Flex_water
#
#
#
###############################################################################
APPENDIX F
- 284 -
VI. Preparing a keyword file for ProCESS
The following section lists the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates
a text string (such as a file name); square brackets [] indicate a choice of values, the default shown
in italics.
ProCESS keywords files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.
Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings.
At the end of this section, a typical keyword file can be found.
VI.1. Input/output files
Protein <#_protein_struct>
protein_file1.mol2
protein_file2.mol2
Following the keyword, specify the number of protein structure files to be processed
On the following lines, specify the protein file names, one per line.
Output output_filename
Name of the output file.
Binding_Site_Cav cavity_filename
Name of the file where to output the binding site cavity.
If this keyword is not present ProCESS will not create a binding site cavity file.
Interaction_Sites pharmacophore_filename
Name of the file where to output the interaction sites definition file.
If this keyword is not present ProCESS will not create an interaction sites definition file.
Binding_Site <#_flex_residues>
flex_residue_1_name
flex_residue_2_name
Manually defines the active site. (The active site can be automatically defined by providing a ligand, see below)
APPENDIX F
- 285 -
On the same line following this keyword, specify the number of flexible residues.
On subsequent lines, the residue name/numbers (according to Find_Residues) are specified, one
per line.
VI.2. Reading the input files and preparing the output protein files
Renumber_Residues <first_residue_number>
Specify the new number for the first residue; the rest will be sequentially renumbered.
This feature is useful if the protein is a multimer, having multiple residues with the same group name (e.g., two Pro60, two Asp25 as in HIV-1 protease).
AutoFind_Site [Y|N]
This function allows the user to have ProCESS automatically finding the flexible residues/binding site.
The default is N.
AutoFind_Center [Y|N]
This function allows the user to have ProCESS automatically find the center of the binding site.
The default is N.
Ligand ligandfile.mol2
Ligand file (in MOL2 format) used to define the active site and its center. Its should be in the same frame as the protein.
Ligand_Cutoff <ligand_cutoff>
Protein residues within this cutoff (in Å) are considered part of the binding site.
The default is 6.0.
Truncate [Y|N|auto]
Determine if the protein will be truncated, keeping only residues within Cutoff of the binding site
residues.
The default is Auto.
auto
The protein will be truncated keeping residues within cutoff distance of the ligand and not within cutoff distance from the binding site residues.
Cutoff <cutoff>
Any residue that does not have an atom within this distance (in Å) from an atom of a flexible residue or of the given ligand will be deleted from the protein file that ProCESS will output.
The default value is 11 for auto truncation 9 for truncation = yes.
VI.3. Parameters for the binding cavity file
Grid_Center <grid_center>
APPENDIX F
- 286 -
Specifically defines the center of the binding site
The default is to automatically find it using the center of a ligand.
Grid_Size <size>
Specifies the size of the box for the binding site.
The default is 12.5.
VI.4. Parameters for the Interaction Sites file
XXX_Weight <xxx_weight>
This group of keywords (Xxx being Hydrophobic, Metal, HBA or HBD) specifies the parameters for
the assignment of pharmacophoric points. xxx_weight is used to give weight for favourable xxx-type
interactions. Defaults parameters are highly recommended.
Hydrophobic_Weight <hydro_weight>
Defines the weight for hydrophobic interaction points.
The default is 1.
Metal_Weight <metal_weight>
Defines the weight for metal interaction points.
The default is 50.
HBA_Weight <hba_weight>
Defines the weight for hydrogen bond acceptor interaction points.
The default is 5.
HBD_Weight <hbd_weight> <hbd_penalty>
Defines the weight for hydrogen bond donor interaction points.
The default is 5.
If too many points are found, one can reduce this number by using the following keywords:
Pharm_Polar_Softness <pharm_polar_soft>
Maximum distance (in Å) between two polar points to merge.
The default is 0.0.
Pharm_Nonpolar_Softness <pharm_nonpolar_soft>
Maximum distance (in Å) between two non-polar points to merge.
The default is 0.0.
Hydrophobic_Level <hydro_level>
Van der Waals interaction between a probe on the grid point with hydrophobic carbons to be considered hydrophobic. If the interaction is found lower than hydro_level, an hydrophobic
APPENDIX F
- 287 -
point is added at this location. For more information see the section on Interaction Sites/Pharmacophore generation.
The default is -0.3.
Min_Weight <min_weight>
Minimum weight for a pharmacophoric point to be included in the final pharmacophore.
The defaults are 0.5 0.0respectively.
Num_of_IS <num_of_spheres>
This determines the maximum number of interaction site spheres in the interaction sites file.
The default is 75.
APPENDIX F
- 288 -
VI.5. A typical ProCESS keyword file
##################################################################################################
# #
# __________ _______ _________ _____ _____ #
# ----------- ----------- --------- --------- --------- #
# ||| ||| ||| ||| ||| ||| #
# ||| ||| ||| ||| ||| ||| #
# |||________ ___ ____ ||| |||______ _______ _______ #
# |||------- ---_____ --------- ||| |||------ ------- ------- #
# ||| |||------ ||| ||| ||| ||| ||| ||| #
# ||| ||| || ||| ||| ||| ||| ||| ||| #
# ||| ||| ||| ||| ||| ||| ||| ||| #
# ||| ||| ________ ___________ _________ _________ _________ #
# ||| ||| ---- ------- --------- ------- ------- #
# #
# Protein Conformational Ensemble System Setup #
# #
# Nicolas Moitessier, Christopher Corbeil, Pablo Englebienne #
# Jeremy Schwartzentruber #
# McGill University, Montreal, Canada #
# October 2008 #
##################################################################################################
#
#
# INPUT/OUTPUT FILES
##################################################################################################
Protein 1
protein.mol2
Output protein # File that will contain the output structure
Binding_Site_Cav cavity.mo # File that will contain the binding site file
Interaction_Sites site.mol2 # File that will contain the pharmacophore
#
#
# PROTEIN DESCRIPTION
##################################################################################################
AutoFind_Site yes # Finding site automatically
Ligand lig.mol2 # Ligand used to find center and site
Ligand_Cutoff 9 # Residues within cutoff are part of the binding site#
#
# ACTIONS
##################################################################################################
Assign_G yes # Assigning residue names
Truncate auto # Truncates the protein keeping residues within cutoff
Cutoff 7 # Cutoff distance
United yes # Makes the united atom representation
#
#
# INTERACTION SITES DESCRIPTION
##################################################################################################
Pharm_Polar_Softness 0.65 # max distance between two polar points to merge
Pharm_Nonpolar_Softness 0.9 # max distance between two non polar points to merge
Pharm_Aromatic_Softness 1.9 # max distance between two aromatic points to merge
Aromatic_Weight 1 # weight given to aromatic points
Metal_Weight 8 # weight given to metal points
Hydrophobic_Weight 1 # weight given to hydrophobic points
HBA_Weight 5 # weight given to HBA points
HBD_Weight 5 # weight given to HBD points
Hydrophobic_Level -0.35 # vdw interaction with hydrophobic carbons
Hydrophobic_Resolution 0.35 # grid resolution for computation of hydrophobic points
Min_Weight 2 # Minimum weight to be included in the pharmacophore
Num_of_IS 50 # Maximum number of beads
APPENDIX F
- 289 -
#
#
# ACTIVE SITE CAVITY DESCRIPTION
##################################################################################################
Pharmacophore pharm # File that will contain the pharmacophore
Grid_Boundary soft # Grid computed within a box (hard) or not (soft)
Grid_Resolution 1.5 # Grid resolution
Grid_Size 10 # Grid size
Grid_Clash 1.5 # Max dist to consider a point clashing with the protein
#
#
#
##################################################################################################
APPENDIX F
- 290 -
VII. Running SMART
SMART is the module used to prepare ligand structures in a modified MOL2 format for use by FITTED. It can also assign MMFF charges and prepare ligand structures for use with ACE (Asymmetric Catalyst Evaluation).
SMART has 2 modes of operation: interactive and command-line arguments. The interactive mode is started by calling smart without any arguments. The program will request user input to determine
the mode of operation and the file to be processed. Note that not all options are available through the interactive mode.
The command-line argument syntax of SMART is as follows (arguments in angle brackets < > are mandatory, arguments in square brackets [ ] are optional):
./smart <mode> [OPTIONS] <ligandfile>
mode is one of the following:
-fitted, -f assign GAFF atom types, write file in FITTED format (default)
-ace, -a assign MM3 atom types, write file in ACE format
If specified, the optional arguments modify the default behaviour of SMART:
-in sd read input file in SD/MOL format
-in fitted read input file in FITTED MOL2 format
-out std output standard MOL2 format instead
-out debug output verbose MOL2 format instead (useful for debugging)
-multi write a single multi-MOL2 file as output
-o <name> specify name for output files <name>.mol2 and <name>.log
-m XXX[-YYY] process only molecules within specified range. Range can be one of: i)
XXX: process a single molecule; ii) XXX-YYY: process molecules
#XXX to #YYY (inclusive); iii) XXX-: process from molecule XXX
until the end.
-nocharge do not assign MMFF charges
-nobond do not reassign bond order
The standard format output may be useful for use on visualization software (as it assigns standard Tripos atom types) or as input to other programs. If the options –nocharge or –nobond are
specified, the atomic charges and/or the bond order from the input files are preserved. If the option –in [sd|fitted] is not specified, SMART expects an input file in MOL2 format.
APPENDIX F
- 291 -
ligandfile specifies a file in standard MOL2 file format
(http://www.tripos.com/data/support/mol2.pdf), or standard SD/MOL file format (http://www.mdli.com/downloads/public/ctfile/ctfile.jsp). It can contain either a single or multiple molecules, and it has to be located in the input/ directory. In the case of a multi-molecule input
file, there will be one ligandfile.out output file and multiple ligandfile_X.mol2 output
files, unless the –multi option is used, in which case there will be a single ligandfile.mol2
file.
The –multi option writes a single multi-MOL2 file as output. Note that the current version of
FITTED cannot use multi-MOL2 files as input; an awk script (scripts/separate_mol2.awk)
is provided to split a multi-MOL2 file into single-molecule MOL2 files. Additionally, another script (scripts/extract_mol2.awk) is provided to extract a specific molecule from a multi-MOL2
file. For example, to extract the molecule #99 from a multi-MOL2 file, the following command can be issued (from the smart/output/ directory):
../scripts/extract_mol2.awk m=99 ligandfile.mol2
The options –o and –m can be used in combination to quickly process a large set of files in a multi-
processor environment (e.g., a multiple core workstation or a cluster). Say we have an SD file ligands.sd with 249,543 molecules. Instead of processing them with one instance of SMART, we
can use 5 instances, each processing chunks of 50,000 molecules each, thus significantly reducing the CPU time required. The commands to do this are:
./smart –in sd –m 1-50000 –o ligands_1 –multi ligands.sd
./smart –in sd –m 50001-100000 –o ligands_2 –multi ligands.sd
./smart –in sd –m 100001-150000 –o ligands_3 –multi ligands.sd
./smart –in sd –m 150001-200000 –o ligands_4 –multi ligands.sd
./smart –in sd –m 200001- –o ligands_5 –multi ligands.sd
Note the last command: because there are fewer than 50,000 molecules left, we instruct SMART to process from molecule 200001 until the end of the file. The following step would be to make individual MOL2 files for use as FITTED input. This is accomplished by using an awk script to separate the multi-MOL2 files into individual files:
cat ligands_1.mol2 ligands_2.mol2 ligands_3.mol2 ligands_4.mol2 \
ligands_5.mol2 | ../scripts/separate_mol2.awk
This will generate files in the form ligand_N.mol2, with N going from 1 to 249,543.
APPENDIX F
- 292 -
FITTED Input File Formats
The following sections outline the file formats used for FITTED. FITTED uses a modified version of the Sybyl MOL2 format for all of its input files. For more information on the original Sybyl MOL2 format, visit http://www.tripos.com/data/support/mol2.pdf. The following is an example of a standard MOL2 formatted file.
Column # Description 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge
APPENDIX F
- 293 -
II.1. The protein files
II.1.1. The XXXX_dock.mol2
The format of the file outputted by ProCESS is a standard MOL2 format with the following changes:
@<TRIPOS>ATOM section:
o the atom types (column 6) are Amber united atom types instead of Sybyl atom types
o the group names (column 8) include the advanced residue names (see Appendix A)
@<TRIPOS>BOND section:
o only bonds for the flexible residues are listed
Column # Description 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge 10 Misc. Information
II.1.2. The XXXX_score.mol2 file
The format of the file outputted by ProCESS is a standard MOL2 format with the following changes:
@<TRIPOS>ATOM section:
o the atom types (column 6) are Amber united atom types instead of Sybyl atom types
APPENDIX F
- 294 -
o the group names (column 8) include the advanced residue names (see Appendix A)
@<TRIPOS>BOND section:
o all bonds are listed
APPENDIX F
- 295 -
Column # Description Column # Description 1 Atom number 11 Scaling factor 2 Atom name 12 Place Holder 3 x coordinate 13 OPLS Atom Type 4 y coordinate 14 Place Holder 5 z coordinate 15 van der Waals
Radii 6 Atom type 7 Group number 16 Atom Volume 8 Group name 17 Atomic Solvation 9 Partial charge 18 vdW solvation 10 Misc. Information 19 and > Water solvation
APPENDIX F
- 296 -
II.1.3. The XXXX_site.txt file
The XXXX_site.txt file is outputted by ProCESS and contains the binding site residue list. The first line of the file must start with Site followed by the number of residues. The following lines (1
per line) list the names of the binding site residues.
II.2. The Ligand file
The format outputted by SMART is based on the MOL2 format. Some modifications were introduced in order to implement the bitstring describing the presence of functional groups, and to aid in checking the chirality, atom connectivity and ring perception. The changes from the standard MOL2 format are as follows:
APPENDIX F
- 297 -
@<TRIPOS>MOLECULE section:
o the second line (data associated with the molecule) is expanded by a number of fields describing the ligand and the functional groups present (bitstring). The presence of a particular group is indicated by a 1 on the respective field. The order of the fields is as
follows:
number of atoms
number of bonds
molecular weight
net charge
number of hydrogen bond donors
number of hydrogen bond acceptors
number of rotatable bonds
number of rings
number of ionisable groups
presence of aromatic group
presence of aldehyde
presence of ester
presence of lactone
presence of amide
presence of amide
presence of lactame
presence of acid
presence of nitrile
presence of imine
presence of nitro
presence of Michael acceptor
presence of azide
presence of isocyanate
presence of acyl chloride
APPENDIX F
- 298 -
presence of sulphonamide
presence of carbamate
presence of ammonium
presence of oxime
presence of secondary amine
presence of primary amine
presence of ketone
presence of boronate
number of oxygens
number of nitrogens
number of sulphurs
number of hetero atoms
number of toxic metals
APPENDIX F
- 299 -
@<TRIPOS>ATOM section:
o the Sybyl atom types (column 6) are replaced by GAFF atom types; the corresponding Sybyl atom types are stored in column 11.
o the number of hydrogen atoms attached to an atom is stated on column 12.
o the hybridization of an atom is stated on column 13.
@<TRIPOS>BOND section:
o an additional column specifies the bond as rotatable, r, or non rotatable, nr.
Column # Discription 1 Atom number 2 Atom name 3 x coordinate 4 y coordinate 5 z coordinate 6 Atom type 7 Group number 8 Group name 9 Partial charge 10 Misc. Information 11 Sybyl Atom type 12 Number of
Hydrogens 13 hybridization
APPENDIX F
- 300 -
II.3. The binding site cavity file
The binding site cavity file is used to determine the empty space within the protein via a collection of spheres of different radius. It resembles a MOL2 formatted file with the following changes:
@<TRIPOS>MOLECULE section:
o on the second line the number of spheres is specified as the first field; fields 2-5 are 0.
@<TRIPOS>ATOM section:
o column 6 (Sybyl atom type) is unnecessary, therefore it is replaced by a dash.
o column 9 (partial charges) is replaced by the radius of the sphere.
Column # Discription 1 Point number 2 Point name 3 x coordinate 4 y coordinate 5 z coordinate 6 Point Type 7 Group number 8 Group name 9 Radius 10 Misc. information
APPENDIX F
- 301 -
II.4. The interaction site, pharmacophore and XXXX_IS.mol2 files
The interaction sites and pharmacophore file are used to create conformations that already have good interaction with the protein. Again the format resembles mol2 format with the addition of columns for the interaction site type and weight.
@<TRIPOS>MOLECULE section:
o on the second line the number of constraints is specified as the first field; fields 2-5 are 0.
@<TRIPOS>ATOM section:
o column 6 (Sybyl atom type) is unnecessary, therefore it is replaced by a dash.
o column 7 (group number) is replaced by a point type descriptor.
o column 9 (partial charges) is replaced by the radius of the constraint.
o column 10 specifies the type of the pharmacophoric point (HBD, HBA, HYD, ARO, or any combination such as HBA/HYD).
o column 11 specifies the weight of the constraint.
Column # Discription 1 Point number 2 Point name 3 x coordinate 4 y coordinate 5 z coordinate 6 -
7 Point type 8 PHARM
9 Radius 10 Pharmacophoric
type 11 Weight
APPENDIX F
- 302 -
II.5. The force field file
The force field file is where all the parameters for the FITTED force field are kept. Additionally, SMART uses the force field file to assign MMFF charges to molecules if so requested. The force field is a modified GAFF force field [17] with MMFF [15] charge parameters in free format, so it can be edited with any text editor. Although most of the parameters for drug-like molecules are present, some may be missing. When adding a parameter to the force field file, some rules must be followed.
Each section starts with a title (e.g., #fitted_bond_parameters), followed by the actual
parameters (i.e, 1.0 1 c c 1.5500 290.100) and ends with a line with blank
parameters designated by stars (i.e., 0.0 1 * * 0.0000 0.000). The title and
end lines should not be removed and any line added before the title line and after the end line will be ignored.
FITTED also allows for the use of wildcard parameters for angles and torsion parameters, where I, J, K or L can represent any atom type, by using the wildcard character * in the respective column.
Using wildcards (*) for all the atoms will be read as an end line.
II.5.1. Adding parameters to the bond list
The bond list starts 2 lines following the #fitted_bond_parameters title, and the end is
signaled by having both the I and J atom types (columns 3 and 4) as *. Any parameters added after
this last line will be ignored. Removing this line The parameters added must be in a single line in the following format
Units: R (Å), K (kcal/mol Å)
#1 #2 #3 #4 #5 #6 Force field file version
number Reference number Atom type of I Atom type of J R0 K2
#----------------------------------------------
#Ver Ref I J R0 K2
#----------------------------------------------
#fitted_bond_parameters
#----------------------------------------------
1.0 1 c c 1.5500 290.100
1.0 1 c c1 1.4600 379.800
1.0 1 c c2 1.4060 449.900
1.0 1 c c3 1.5080 328.300
1.0 1 c ca 1.4870 349.700
[...]
1.2 1 ct ss 1.7700 256.600
1.2 1 ct nh 1.3640 449.000
1.2 1 nt nt 1.3400 450.000
0.0 1 * * 0.0000 0.000
Adding parameters to the angle list
The angle list starts 2 lines below the #fitted_angle_parameters title, and the end is signaled
by having all I, J and K atom types (columns 3-5) as *. FITTED also allows for the use of wildcard
parameters, where I and/or K can represent any atom, by using the wildcard character * in the
respective column. Parameters added to the force field including wildcards should be placed at the
APPENDIX F
- 303 -
end of the angle list. The less specific the parameter (higher number of wildcards), the lower in the list it should be placed. The parameter added must be in a single line in the following format:
Units: * add units for R, K (kcal/mol rad)
#1 #2 #3 #4 #5 #6 #7 Force field file version
number Reference
number Atom type
of I Atom type
of J Atom type
of K R0 K2
#----------------------------------------------------
# E = K2 * (Theta - Theta0)^2
#----------------------------------------------------
#Ver Ref I J K Theta0 K2
#----------------------------------------------------
#fitted_angle_parameters
#----------------------------------------------------
1.0 1 hw ow hw 104.5200 100.0000
1.0 1 hw hw ow 127.7400 0.0000
1.0 1 br c br 113.1000 66.9000
1.0 1 br c c3 110.7400 63.3000
1.0 1 br c o 121.4600 63.2000
[...]
1.2 1 * n4 hn 109.0000 35.0000
1.2 1 * n4 * 109.5000 60.0000
1.2 1 * na * 120.0000 60.0000
1.2 1 * nb * 120.0000 60.0000
0.0 1 * * * 0.0000 0.0000
Adding parameters to the torsion list
The torsion list starts 2 lines below the #fitted_torsion_parameters title, and the end of the
list is signaled by having all I, J, K and L atom types as *. FITTED also allows for the use of
wildcard parameters, where I and L can represent any atom, by using the wildcard character * in the
respective column. Parameters added to the force field including wildcards should be placed at the beginning of the torsion list. The parameter added must be in a single line in the following format:
Units: V (kcal/mol)
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
Force field file version number
Reference number
Atom type of I
Atom type of J
Atom type of K
Atom type of L
V1 01 V2 02 V3 03 V4 04
#----------------------------------------------------------------------------------------
# E = SUM(n=1,4) { (Vn/m) * [ 1 + cos(n*Phi - Phi0(n)) ] }
# m = multiplicity or total number of torsions centered on the same bond.
#----------------------------------------------------------------------------------------
#Ver Ref I J K L V1 Phi0 V2 Phi0 V3 Phi0 V4 Phi0
#----------------------------------------------------------------------------------------
#fitted_torsion_parameters
#----------------------------------------------------------------------------------------
1.0 1 * c c * 0.0000 0.00 1.2000 180.00 0.0000 0.00 0.0000 0.00
1.0 1 * c c1 * 0.0000 0.00 0.0000 180.00 0.0000 0.00 0.0000 0.00
1.0 1 * c cg * 0.0000 0.00 0.0000 180.00 0 .000 0.00 0.0000 0.00
1.0 1 * c ch * 0.0000 0.00 0.0000 180.00 0.0000 0.00 0.0000 0.00
1.0 1 * c c2 * 0.0000 0.00 8.7000 180.00 0.0000 0.00 0.0000 0.00
[...]
1.0 1 hc c3 c3 f 0.1900 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00
1.0 1 hc c3 c3 cl 0.2500 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00
1.0 1 hc c3 c3 br 0.5500 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00
0.0 1 * * * * 0.0000 0.00 0.0000 0.00 0.0000 0.00 0.0000 0.00
APPENDIX F
- 304 -
Adding parameters to the out-of-plane list
The angle list starts 2 lines below the #fitted_oop_parameters title, and the end of the list is
signaled by having all I, J, L and K atom types as *. FITTED also allows for the use of wildcard
parameters, where I, K and/or L can represent any atom, by using the wildcard character * in the
respective column. Parameters added to the force field including wildcards should be placed at the end of the angle list. The less specific the parameter (higher number of wildcards), the lower in the list it should be placed. The parameter added must be in a single line in the following format:
Units: K (kcal/mol)
#1 #2 #3 #4 #5 #6 #7 #8 #9
Force field file version number
Reference number
Atom type of I
Atom type of J
Atom type of K
Atom type of L
Kchi N Chi0
#-----------------------------------------------------------------------
# E = Kchi * [ 1 + cos(n*Chi - Chi0) ]
#-----------------------------------------------------------------------
#Ver Ref I J K L Kchi n Chi0
#-----------------------------------------------------------------------
#fitted_oop_parameters
#-----------------------------------------------------------------------
1.0 1 c c2 c2 c3 1.1000 2 180.0000
1.0 1 c ca c3 ca 1.1000 2 180.0000
1.0 1 c n c3 hn 1.1000 2 180.0000
1.0 1 c n c3 o 1.1000 2 180.0000
1.0 1 c2 na c2 c3 1.1000 2 180.0000
[...]
1.0 1 c2 c2 * * 10.5000 2 180.0000
1.0 1 o c * * 1.1000 2 180.0000
1.0 1 * c * * 10.5000 2 180.0000
1.0 1 * ca * * 7.1000 2 180.0000
0.0 1 * * * * 0.0000 2 180.0000
Adding parameters to the van der Waals list
The vdW list starts 2 lines below the #fitted_vdW_parameters title, and the end of the list is
signaled by I atom type as *. The parameter added must be in a single line in the following format:
Units: Ri (Å), ESPI (kcal/mol)
#1 #2 #3 #4 #5
Force field file version number
Reference number Atom type of I
Ri* ESPI
#------------------------------------------------
#type r-eps
#combination arithmetic
#------------------------------------------------
# E = EPSij * { (Rij*/Rij)^12 - 2(Rij*/Rij)^6 }
# where EPSij = sqrt( EPSi * EPSj)
# Rij* = (Ri* + Rj*)/2
#------------------------------------------------
#Ver Ref I Ri* EPSi
#------------------------------------------------
#fitted_vdW_parameters
#------------------------------------------------
APPENDIX F
- 305 -
1.0 1 h1 2.7740 0.01570
1.0 1 h2 2.5740 0.01570
1.0 1 h3 2.3740 0.01570
1.0 1 h4 2.8180 0.01500
1.0 1 h5 2.7180 0.01500
[...]
1.2 1 n0 3.7360 0.00277
1.2 1 k 5.3160 0.000328
1.2 1 zn2 2.2000 0.0125
0.0 1 * 0.0000 0.00000
Adding parameters to the hydrogen bond list
The Hbond list starts 2 lines below the #fitted_Hbond_parameters title, and the end of the list
is signaled by having both the I and J atom types as *. The parameter added must be in a single line
in the following format:
Units: A, B (kcal/mol)
#1 #2 #3 #4 #5 #6
Force field file version number
Reference number
Atom type of I
Atom type of J
A B
#--------------------------------------------------
# E = Aij/r^12 - Bij/r^10
#--------------------------------------------------
#Ver Ref I J A B
#--------------------------------------------------
#fitted_Hbond_parameters
#--------------------------------------------------
1.0 3 hw nb 7557.0000 2385.0000
1.0 3 hw nc 10238.0000 3071.0000
1.0 3 hw o 7557.0000 2385.0000
1.0 3 hw oh 7557.0000 2385.0000
1.0 3 hw os 7557.0000 2385.0000
[...]
1.2 3 zn s6 15000.0000 5000.0000
1.2 3 zn ss 15000.0000 5000.0000
1.2 3 zn sh 15000.0000 5000.0000
0.0 3 * * 0.0000 0.0000
II.5.2. Adding parameters to the bond charge increment list
The bond charge increment list starts below the #fitted_charge_parameters title, and the end
of the list is signaled by having both the I and J atom types as *. Each line specifies a bond charge
increment for a bond between atoms of type I and J (bciIJ), such that the resulting charge on J is the bciIJ, while on I is –bciIJ.The parameter added must be in a single line in the following format:
#1 #2 #3 #4 #5
Version number Atom type of I Atom type of J bci Comment
##############################
#Cl I J bond_inc source
##############################
#fitted_charge_parameters
APPENDIX F
- 306 -
0 1 1 0.0000 #C94
0 1 2 -0.1382 C94
0 1 3 -0.0610 #C94
0 1 4 -0.2000 #X94
[...]
0 80 81 -0.4000 #C94
0 101 1 0.0000 empirical
0 101 6 -0.1900 empirical
0 101 37 -0.0000 empirical
* * * * *
Adding parameters to the partial bond charge increment / formal charge adjustment factor list
As a more general way of describing bci‟s, MMFF includes a partial bci parameter that is assigned to each atom type [15]; a bci for a bond can be obtained as the sum of the partial bci corresponding to each atom type involved. Additionally, the formal charge on some groups is spread among neighbouring atoms; this is specified in the formal charge adjustment factor for the central atom type in those functional groups [15].
The parameter list starts below the #fitted_mmff_addl_charges title, and the end of the list is
signaled by having both the I and J atom types as *. The parameter added must be in a single line in
the following format:
#1 #2 #3 #4 #5
Version number Atom type Partial bci Formal charge adj Comment
###
# MMFF Partial Bond Charge Incs and Formal-Charge Adj. Factors: 19-MAY-1994
#
# source: J. Comp. Chem. 17, 616 (1996)
###
# type pbci fcadj Origin/Comment
###
#fitted_mmff_addl_charges
0 1 0.000 0.000 E94
0 2 -0.135 0.000 E94
0 3 -0.095 0.000 E94
0 4 -0.200 0.000 E94
[...]
0 96 2.000 0.000 Ionic charge
0 97 1.000 0.000 Ionic charge
0 98 2.000 0.000 Ionic charge
0 99 2.000 0.000 Ionic charge
* * * * *
APPENDIX F
- 307 -
FITTED errors and warnings
ERROR: Molecule outside maximum number of angles.
FITTED can only handle molecules with [3 × #atoms] angles. If there are more then please contact the developers at [email protected]
ERROR: Molecule outside maximum number of torsions.
FITTED can only handle molecules with [6 × #atoms] torsions. If there are more then please contact the developers at [email protected]
ERROR: Forcefield file <forcefield filename> not found
The force field file listed in the keyword file is not found in the forcefield/ directory.
ERROR: Molecule too big for active site
Increase GI_Num_of_Trials
Increase GI_Initial_E
Increase GI_Minimized_E
Increase Grid_Size in ProCESS to create a larger active site cavity.
If none of these work, the molecule is too big for the active site and cannot be docked.
ERROR: Protein input file <Protein file name> not present
The protein file could not be found in the input/ directory.
ERROR: Ligand input file <ligand file name> not present
The ligand file could not be located in the input/ directory.
ERROR: Binding_Site_Cav file <Active site filename> not present
Binding_Site_Cav file could not be located within the input/ directory. Without an active site file
the docking may take longer and be less accurate.
WARNING: Binding_Site_Cav needed for generation of initial population
FITTED issues this warning but will not exit. Without an active site file the docking may take longer and be less accurate.
ERROR: Reference file <Reference file name> not present.
The reference file could not be located within the input/ directory.
ERROR: Missing Forcefield Parameters
FITTED exits because there are missing force field parameters. Either add them to the force field file or use Parameters Auto keyword to have FITTED automatically assign parameters.
APPENDIX F
- 308 -
WARNING: Missing Forcefield parameters, assigning parameters
automatically
List below is the parameter which was assigned automatically. If you do not like the automatic assignment add the parameter with your desired value into the force field file
ERROR: <keyword_filename> Can not be opened
If the keyword file is not found in the keyword directory then an error.log will be created with this
error. Please put keyword in keyword/ directory.
ERROR: Coordinates not found in protein structural file
@<TRIPOS>ATOM is not found in the protein file preceding the coordinates
ERROR: Array size for number of protein atoms and bonds not in Protein 1
mol2 file.
@<TRIPOS>MOLECULE is not found in the first protein mol2 file
ERROR: Array size for number of ligand atoms and bonds not in Ligand
mol2.
@<TRIPOS>MOLECULE is not found in the first protein mol2 file
ERROR: Coordinates not found in ligand file
@<TRIPOS>ATOM is not found in the ligand file preceding the coordinates
ERROR: Check water names and atom types.
The water atom name and atom types are non-standard. Change to standard names.
ERROR: Bonds not found in ligand file
@<TRIPOS>BOND is not found in the ligand file preceding the list of bonds
ERROR: No assignment of Rotatable bonds
Please assign rotatable bonds either manually or by using SMART
ERROR: Bonds not found in protein file
@<TRIPOS>BOND is not found in the protein file preceding the list of bonds
ERROR: Protein keyword not found in keyword file.
The keyword Protein is not found within the keyword file. Please include this within you keyword file
followed by the number of protein files and on the next lines a list of the protein files for the docking/virtual screening run.
ERROR: Can not find <residue name>
Can not a residue listed in the keyword file. Please check the spelling.
ERROR: Flex file <protein file name> not found
Make sure <protein file name>_flex.txt is in the input directory.
APPENDIX F
- 309 -
Error: Can not find coordinates in <Binding_Site_filename> is not present
in the active site cavity file
@<TRIPOS>MOLECULE is not found in the Binding_Site_Cav file
Error: Can not find coordinates in <Binding_Site_filename>
@<TRIPOS>ATOM is not found in the Binding_Site_Cav file preceding the coordinates
Error: Can not find coordinates in <pharmacophore_filename>
@<TRIPOS>ATOM is not found in the Pharmacophore file preceding the coordinates
Error: Can not find coordinates in <Interaction_Sites_filename>
@<TRIPOS>ATOM is not found in the Interaction_Sites file preceding the coordinates
Error: Ligand can not match minimum pharmacophore
Increase value of Min_PharmScore.
Error: Ligand can not match minimum Interaction Sites
Increase value of Min_MatchScore.
ERROR: Reference file <reference_filename> not present
The reference file is not located within the input/ directory.
ERROR: Invalid parameter specified for covalent residue.
Make sure the residue name is listed in the keyword the same way it is listed in the protein file.
ERROR: FITTED cannot find O/S and H for the covalent residue
Format in protein input file may be incorrect. In particular, make sure that for serine the alcohol atom names are set as OG and HG.
ERROR: Invalid parameter specified for other catalytic residue.
Make sure the residue and atom name are specified the same in the keyword and protein file.
ERROR: The proteins do not have the same number of atoms
Make sure to run ProCESS with all proteins in one keyword file.
ERROR: Problem with creation of z-matrix for ligand.
Make sure there is not a missing bond in the bond list of the ligand mol2 file. FITTED cannot handle mol2 with multiple structures.
ERROR: Problem with creation of z-matrix for active site residue
<residue_name>.
A bond is missing from the bond list in one of the protein mol2 files. Either add the missing bond(s) or remove the residue from the XXXX_site.txt file if it not critical to binding of the ligand.
APPENDIX F
- 310 -
ProCESS errors and warnings
Number of proteins not in keyword file.
If the number of protein files does not follow Protein_Conformations keyword.
Coordinates not found in structural file
If in either the protein or ligand mol2 file @<TRIPOS>ATOM does not precede the coordinates of the structure.
Bonds not found in structural file
If in either the protein or ligand mol2 file @<TRIPOS>BOND does not precede the bond list.
Ligand file not present now closing
If Ligand is not found in the keyword file.
User wanted automatic finding of active site center, Ligand Reference not
given.
If the keyword AutoFind_Site is used in the keyword and Ligand is not found in the keyword file.
<Protein file name> file not present. Program now Closing.
The protein file given can not be found in the input/ directory.
<Ligand file name> file not present. Program now Closing
The ligand file given can not be found in the input/ directory
Side chain <residue name> Not found in <protein file name>
The residue given can not be found in the protein file.
Unknown residue name: <residue name>
The residue is not known. Refer to Tables 1a and 1b for accepted residue names.
APPENDIX F
- 311 -
SMART errors and warnings
The following is a list of errors and warnings that SMART outputs to the corresponding log file in the output/ directory. Errors indicate serious problems that cause SMART to either skip a molecule or
exit. Warnings are potential problems that might cause the SMART output to be incorrect; critical examination of the output and input structures in these cases is strongly encouraged.
ERROR: File <filename> cannot be opened.
The input file specified could not be read. Make sure that the file is located in the input/ directory.
Check the spelling and the file permissions.
ERROR: Atom <atom_name> cannot find element
The specified atom has a non-standard Sybyl atom type, or is not in the range of atomic numbers 1-35 (H-Br), 44-46 (Ru-Pd), 53 (I) or 78 (Pt). Without a proper element assignment, atom types cannot be assigned. In particular, look for: i) P atoms in phosphates and analogous functional groups: the Sybyl atom type for the P atom should be “P.3”; ii) S atoms in sulfoxides, sulfones and derivatives: the Sybyl atom type for the S atom should be “S.o” or “S.o2” respectively.
ERROR: could not write to <filename>
The specified output file could not be written. Check permissions on the output/ and parent
directories, that there is enough empty space in the volume and that the filename is valid.
ERROR: cannot create Z-matrix. Does the molecule have a torsion?
In order to be processed by SMART, a molecule must at least have 4 atoms connected sequentially in order to define a torsion. If a torsion cannot be defined, the molecule is skipped.
WARNING: Sum of partial charges does not equal net charge
When assigning MMFF charges, the partial charges assigned do not match the predicted formal charge. Check atom type assignment and bond connectivity.
WARNING: Cannot assign atomic weight to atom <atom_number> <atom_name>
When generating the bit string, the molecular weight is calculated from the sum of atomic weights. Currently, only atoms of atomic number 1-17 (H-Cl), 34-35 (Se, Br) and 53 (I) are parameterized.
WARNING: Atom <atom_name> has a formal charge of <formal charge>
When automatically assigning the bond orders (-assign_bond command-line option), this message is outputted to the log file for every atom with a formal charge higher than 1. Check the bond order assignment in these molecules to make sure it is correct.
WARNING: Missing bond increment. Bond # <bond_number> Atoms <atom_name1> <atom_name2>; MMFF atom types <MMFF_type1> and <MMFF_type2>. Bond increment set
to 0.
When automatically assigning the MMFF charges (-charge command-line option), this message is
outputted for every pair of atoms for which the bond increment is not parameterized. Add the bond increment in the forcefield/fitted_ff.txt file.
WARNING: Could not assign charges to molecule
APPENDIX F
- 312 -
When automatically assigning the MMFF charges (-charge command-line option), this message is
indicative of other problems with the charge assignment. Look for warning messages appearing before this one in the log file.
APPENDIX F
- 313 -
Functional group definitions
Table 9 - Definition of functional groups in SMART (blue = atom type, green = element)
Keyword Description
Aromatic ca
An aromatic group is present if a ca atom type is within the molecule
Aldehyde H
O
c
H
An aldehyde is present if there is a c atom type in the molecule bound to a hydrogen.
Ester O
O
c os= n
An ester is present if there is a c atom type bound to an atom with an os atom type with the c not bound to an a n atom type, with both c and os atoms being acyclic.
Lactone O
O c
os
= n
A lactone is present there is a c atom type bound to an atom with an os atom type with the c not bound to an a n atom type, with c and os atoms involved in a ring.
Amide N
O
c
R
n= os R
An amide is present if there is a c atom type bound to an atom with an n atom type with the c not bound to an os atom type, with both c and n atoms being acyclic.
Lactame N
O c
n
= osR
A lactame is present if there is a c atom type bound to an atom with an n atom type with the c not bound to an a os atom type. With both c and n atoms being cyclic.
Acid
O
c
O
o
An acid (carboxylate) is present if an atom with a c atom type is bound to two atoms with o atom types.
APPENDIX F
- 314 -
Nitrile
N
c1
n1
A nitrile is present if an atom with a c1 atom type is bound to an atom with an n1 atom type.
Imine N
Rc2
n2
= O
An imine is present if an atom with a c2 atom type is bound to an atom with an n2 atom type, both acyclic; R cannot be an oxygen atom.
Nitro N
O
O
+ no
A nitro is present if there is an atom with an no atom type within the molecule.
Acceptor
O
R
c o
= oc2
c1 n1
c2N
A Michael acceptor is present if an atom with an atom type of c2 is bound to either 1) an atom with a c atom type which is not a carboxylate, or 2) a nitrile group. The bond between c2 and c/c1 must be acyclic.
Azide NN+
N-
N
An azide is present if there are three acyclic nitrogens in a linear formation.
Isocyanate N
CO
oc
n2
An isocyanate is present if an atom with an atom type of c is bound to 2 atoms, one with an atom type of n2 and another with an atom type of o, where the c – n2 bond is acyclic.
Acyl_Chloride
O
Cl
c
cl
O
Br
c
br
An acyl chloride is present if an atom with a atom type of c is bound to an atom with an atom type of cl or br.
APPENDIX F
- 315 -
Sulphonamide S
NH
O O
s6n
A sulphonamide is present when an atom with an atom type of s6 is bound to an atom with an atom type of n.
Carbamate
O
O
c
NR
R
R
nos
A carbamate is present when an atom with an atom type of c is bound to an atom with an n atom type and an atom with an os atom type.
Ammonium R
N+
R
R
R
n4
An ammonium is present if there is an atom with an n4 atom type.
Oxime N
Oc2
n2
O
R
An oxime is present if there is an atom with a c2 atom type bound to an atom with an n2 atom type which in turn is bound to an oxygen atom.
Ketone
Oc
C
A ketone is present if an atom with a c atom type c is bound to 2 carbon atoms.
Boronate BO
O
R
R
O
O
C B
A boronate is present if there is a boron atom bound to a carbon and two oxygens.
Primary_Amine R N
H
H
n3
A primary amine is present if there is an atom with an atom type of n3 bound to two hydrogens.
Secondary_Amine R N
H
R
n3
A secondary amine is defined as an atom with an atom type of n3 bound to a single hydrogen.
APPENDIX F
- 316 -
Additional keywords for FITTED
The following sections list the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates
a text string (such as a file name); square brackets [choice1|choice2] indicate a choice of
values, the default shown in italics.
Note that keyword files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.
Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings. These keywords are essentially used by the developers for optimization of the program (time and accuracy). In general, modification of a specific value does not significantly improve or affect the accuracy but may result in longer or quicker docking runs.
At the end of this section, a typical keyword file can be found.
II.6. Input/output files
Protein_Conformations <# of files>
input_file_1
input_file_2
Following this keyword is the number of protein structure files used as input (same protein different conformation). These protein files should be prepared using ProCESS prior to the actual docking.
On the following lines are the protein file names, one per line.
For each of the proteins listed there should be the following files associated with then
input_file_dock.mol2
input_file_score.mol2
input_file_site.txt
input_file_IS.mol2
The name listed in this keyword file should therefore not include extensions such as _dock.mol2 that will be automatically added by FITTED.
Ligand ligand_file.mol2
Name of the ligand file (in MOL2 format). This ligand files should be prepared using SMART prior to the actual docking.
Ref <#_of_files>
lig_ref_file1.mol2
APPENDIX F
- 317 -
lig_ref_file2.mol2
Following this keyword is an integer stating how many reference files are used to calculate the root-mean-square deviation (RMSD) of the ligand heavy atoms. These ligand files should be in the same reference frame as the protein structure. The possible symmetric conformations of the ligand are calculated in silico.
2 reference files may be needed in some instances where the ligand or protein active site is Cn symmetric (n >=2 )
On the following line(s), the reference file(s) (in MOL2 format) are listed, one per line.
If this keyword is missing, no RMSD values will be computed.
Output filename
Name of the output file.
Forcefield forcefield_file.txt
Name of the force field file to use. If a forcefield other than fitted_ff.txt is to be used. The format of this force field should be consistent with the required format for Fitted (see section II.5).
Binding_Site_Cav cavity_file.mol2
Following this keyword is the file defining the empty space present in the active site cavity (a set of spheres prepared by ProCESS).
If this keyword is missing, no grid filter will be used (it is highly recommended to use both Pharmacophore and Binding_site_cav keywords).
Interaction_Sites interaction_sites_file.mol2
Name of the file containing the interaction site description (prepared by ProCESS).
If this keyword is missing, no interaction site filter will be used. (It is highly recommended to use both Interaction_Sites and Binding_site_cav)
Pharmacophore pharmacophore_file.mol2
Name of the file containing the pharmacophore constraints on the ligands (prepared by ProCESS). Typically this keyword is used to ensure that the individuals produced match this constraint, but it can be softened by setting Min_Constraint.
If this keyword is missing, no constraint will be used.
Protein_Ref <#_of_files>
ref_file_1.ext
ref_file_2.ext
Following this keyword is the number of reference protein structure files used to compute the protein RMSD (deviation of the modeled protein structure from the reference structures).
On the following lines are the protein file names, one per line. These files will be used in addition to the Protein files listed before to calculate a root-mean-square-deviation (RMSD) between the protein
generated during a fitted docking run and the Protein_ref files. Additional files can be needed if the
protein has a symmetrical structure (e.g., HIV-1 protease)
If this keyword is missing, protein input files will be used as references.
APPENDIX F
- 318 -
II.7. Run parameters
Mode [Dock|Filter|VS|Score|Local]
Dock
Normal docking run. No ligands are filtered out.
This is the default.
Filter
Filters out structures that do not meet Filter, Optional or Essential groups (see below).
Once filtering is done the program exits.
VS
Filters out structures that do not meet Filter, Optional or Essential groups (see below). If
the ligand passes all the filters, the docking is performed otherwise FITTED exits. Additional keywords are also provided (see below).
Score
Scores the ligand input structure in the provided orientation against all input proteins.
Local
Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modifications to the ligand conformation, orientation and translation are carried out.
SAR
Performs a local search on the ligand input structure. The provided orientation/translation/conformation is used as a starting point and only slight modification to the ligand orientation and translation are carried out while a complete search of conformations is done.
Flex_Type [Rigid|Semiflex|Flex_water|Flex]
Rigid
The ligand is docked onto one protein structure.
This is the default if only one protein structure is used.
Semiflex
The ligand is docked onto multiple protein structures (requires Protein ≥ 2). Proteins can be
exchanged during the evolution but not the genes corresponding to side chains or water molecules (a more complete description of this mode is given in reference 1).
This is the default if more than one protein structure is used.
Flex_water
The ligand is docked into multiple protein structures (requires Protein ≥ 2). Similar to
Semiflex, except that each water molecule evolves independently.
Flex
The ligand is docked onto multiple protein structures (requires Protein ≥ 2). The side chains
and waters are allowed to be exchanged independently from the protein backbone.
Number_of_Runs <number of runs>
APPENDIX F
- 319 -
More than one run per ligand can be performed (The ligand may be docked several time to ensure a complete search).
If this keyword is missing, a single run is done.
The default value is 3 for Dock mode all other modes the default is 1.
II.8. Filtering parameters
The following keywords are used to filter out structures in VS or Filter modes only
Max_Charge <max_charge>
If a ligand has a net charge higher than max_charge, the program exits.
Default is +2.
Min_Charge <min_charge>
If a ligand has a net charge lower than min_charge, the program exits.
Default is -2.
Max_MW <max_MW>
If a ligand has a molecular weight higher than max_MW, the program exits.
Default is 500.
Min_MW <min_MW>
If a ligand has a molecular weight lower than min_MW, the program exits.
Default is 250.
Max_HBD <max_HBD>
If a ligand has more hydrogen bond donors than max_HBD, the program exits.
Default is 5.
Min_HBD <min_HBD>
If a ligand has fewer hydrogen bond donors than min_HBD, the program exits.
Default is 0.
Max_HBA <max_HBA>
If a ligand has more hydrogen bond acceptors than max_HBA, the program exits.
Default is 10.
Min_HBA <min_HBA>
If a ligand has fewer hydrogen bond acceptors than min_HBA, the program exits.
Default is 0.
Max_Nrot <max_Nrot>
APPENDIX F
- 320 -
If a ligand has more rotatable bonds than max_Nrot, the program exits.
Default is 6.
Min_Nrot <min_Nrot>
If a ligand has fewer rotatable bonds than min_Nrot, the program exits.
Default is 0.
Max_Ionizable <max_ionizable>
If a ligand has more ionizable groups than max_ionizable, the program exits.
Default is 2.
Min_Ionizable <min_ionizable>
If a ligand has fewer ionizable groups than min_ionizable, the program exits.
Default is 0.
Max_Rings <max_rings>
If a ligand has more rings than max_rings, the program exits.
Default is 10.
Min_Rings <min_rings>
If a ligand has fewer rings than min_rings, the program exits.
Default is 0.
Max_O <max_O>
If a ligand has more oxygen atoms than max_O, the program exits.
Default is 100.
Min_O <min_O>
If a ligand has less oxygen atoms than min_O, the program exits.
Default is 0.
Max_N <max_N>
If a ligand has more nitrogen atoms than max_N, the program exits.
Default is 100.
Min_N <min_N>
If a ligand has less nitrogen atoms than min_N, the program exits.
Default is 0.
Max_S <max_S>
If a ligand has more sulfur atoms than max_S, the program exits.
APPENDIX F
- 321 -
Default is 100.
Min_S <min_S>
If a ligand has less sulfur atoms than min_S, the program exits.
Default is 0.
Max_Hetero <max_hetero>
If a ligand has more heteroatoms (N, S and O) than max_hetero, the program exits.
Default is 100.
Min_Hetero <max_hetero>
If a ligand has less heteroatoms (N, S and O) than max_hetero, the program exits.
Default is 0.
Max_Metal <max_metal>
If a ligand has more heavy atoms other than C, N, O, S, P than max_metal, the program exits.
Default is 0.
Min_Metal <min_metal>
If a ligand has less heavy atoms other than C, N, O, S, P than min_metal, the program exits.
Default is 0.
Max_Num_of_Atoms <max_atoms>
If a ligand has more atoms other than max_atoms, the program exits.
Default is 10000.
Min_Num_of_Atoms <min_atoms>
If a ligand has less atoms other than min_atoms, the program exits.
Default is 0.
Filter <#_groups_filtered>
group_filtered1
group_filtered2
Number of functional groups that are filtered out. The name(s) of the filtered functional groups are listed below this keyword (see Table 1).
Optional <#_option_groups>
group_needed1
group_needed2
APPENDIX F
- 322 -
Number of functional group where one of them has to be present. The name(s) of the needed functional groups are listed below this keyword (see Table 1).
Essential <#_essential_groups>
group_needed1
group_needed2
Number of functional groups that are required. The name(s) of the needed functional groups are listed below this keyword (see Table 1).
Table 3. List of groups recognized by FITTED that can be listed after Filter, Optional or
Essential.
Aromatic Acid Acceptor Carbamate Primary Amine
Aldehyde Lactame Azide Ammoniun Secondary Amine
Ester Nitrile Isocyanate Oxime
Lactone Imine Acyl_Chloride Ketone
Amide Nitro Sulphonamide Boronate
II.9. Conjugate gradient parameters
The default values for all the keywords described in this section are highly recommended.
GA_* or GI_*
There are two sets of the following keywords: one for the parameters used during the generation of the initial population (GI_*; e.g., GI_MaxInt) and another one used during the evolution (GA_*; e.g.,
GA_MaxInt). The default values are recommended.
XX_MaxIter <maxiter>
o Maximum number of iterations. Once this number is reached the minimization is finished.
o The default is 20.
XX_StepSize <stepsize>
o Initial value of the step taken in the direction of the gradient during minimization.
o The default is 0.02.
XX_MaxStep <maxstep>
o Maximum step size allowed during minimization.
o The default is 1.
XX_EnergyBound <energybound>
o Minimum energy difference between two molecules to be considered similar.
o The default is 1.0 for GI_EnergyBound and 0.001 for GA_EnergyBound.
XX_MaxSameEnergy <maxsameenergy>
o Number of times that the same energy (defined by EnergyBound) can be repeated.
o The default is 3.
APPENDIX F
- 323 -
XX_MaxGrad <maxgrad>
o Gradient convergence criteria.
o The default is 0.001.
II.10. Energy parameters
The default values for all the keywords described in this section are highly recommended.
Score_Initial [none|score|minimize]
Scoring of the initial ligand binding mode.
none
No scoring of the initial input structure is performed.
This is the default setting.
score
Only the score of the initial input ligand is output.
minimize
The score of the initial pose and the score of the energy minimized structure will be outputted.
VdW [1-4|1-5]
Selects whether 1,4 and/or 1,5 and greater van der Waals interactions should be considered.
1-4
Used to consider 1,4 interactions and above.
This is the default setting.
1-5
Used to consider only 1,5 interactions and above.
VdWScale_1-4 <vdwscale_1-4>
Scaling factor for the 1,4 van der Waals interactions.
The default is 1.0.
VdWScale_1-5 <vdwscale_1-5>
Scaling factor for the 1,5 van der Waals interactions.
The default is 1.0.
E_VdWScale_Pro <e_vdwscale_pro>
Scaling factor for the ligand-protein van der Waals interactions.
The default is 1.0.
E_VdWScale_Wat <e_vdwscale_wat>
Scaling factor for the ligand-water van der Waals interactions.
The default is set the value as the same as E_vdWScale_Pro.
APPENDIX F
- 324 -
Elec [1-4|1-5]
Select whether 1,4 and/or 1,5 and greater electrostatic interactions should be considered.
1-4
Used to consider 1,4 interactions and above.
This is the default setting.
1-5
Used to consider 1,5 interactions and above.
ElecScale_1-4 <elecscale_1-4>
Scaling factor for the 1,4 electrostatic interactions.
The default is 1.0.
ElecScale_1-5 <elecscale_1-5>
Scaling factor for the 1,5 electrostatic interactions.
The default is 1.0.
E_ElecScale_Pro <e_elecscale_pro>
Scaling factor for the ligand-protein electrostatic interactions.
The default is 1.0.
E_ElecScale_Wat <e_elecscale_wat>
Scaling factor for the ligand-water electrostatic interactions.
The default value is set the same as E_ElecScale_Pro.
HBond [Y|N]
Selects whether or not hydrogen bonds are included in the energy calculation.
The default is Y.
E_HbondScale_Pro <e_hbondscale_pro>
Scaling factor for the ligand-protein hydrogen bond interactions.
The default is 1.0.
E_HbondScale_Wat <e_hbondscale_wat>
Scaling factor for the ligand-water hydrogen bond interactions.
The default value is set the same as E_HbondScale_Pro.
Cutdist <cutdist>
Cutoff distance (in Ǻ) for the non-bond interactions with the protein.
The default value is 9.
Switchdist <switchdist>
APPENDIX F
- 325 -
Switching distance (in Ǻ) for the non-bond interactions with the protein.
The default value is 7.
Cutdist_Wat <cutdist_wat>
Cutoff distance for the non-bond interactions with the water molecules.
The default value is 1.20
Switchdist_Wat <switchdist_wat>
Switching distance for the non-bond interactions with the water molecules.
The default is 1.75.
GI_Protein_Nbonds [United|All_Atom]
FITTED will treat protein non-bonded interactions with the ligand as either all atom or united for the generation of the initial population.
The default for this keyword is United.
GA_Protein_Nbonds [United|All_Atom]
FITTED will treat protein non-bonded interactions with the ligand as either all atom or united for the evolutional.
The default for this keyword is United.
GA_Protein_Nbonds2 <generation number>
FITTED will switch from united to all atom representation of the non-bonded interactions at this generation.
The defaults is set to Max_Gen2.
Solvation [On|Off}
Allows the user to turn off the calculation of the solvation energy
The default is on.
Displaceable_Waters [On|Off}
Allows the user to turn off the displaceable waters.
The default is on which allows displaceable waters.
II.11. Scoring parameters
S_VdWScale_Pro <s_vdwscale_pro>
Scaling factor for the ligand-protein van der Waals score.
The default is 1.0.
S_VdWScale_Wat <s_vdwscale_wat>
Scaling factor for the ligand-water (located in protein structure) van der Waals interactions.
The default is the value of S_VdWScale_Pro.
APPENDIX F
- 326 -
S_ElecScale_Pro <s_vdwscale_pro>
Scaling factor for the ligand-protein electrostatic interactions.
The default is 1.0.
S_ElecScale_Wat <s_vdwscale_wat>
Scaling factor for the ligand-water electrostatic interactions.
The default is the value of S_ElecScale_Pro.
S_HbondScale_Pro <s_hbondscale_pro>
Scaling factor for the ligand-protein hydrogen bond interactions.
The default is 1.0.
S_HbondScale_Wat <s_hbondscale_wat>
Scaling factor for the ligand-water hydrogen bond interactions.
The default is the value of S_HbondScale_Pro.
S_PolarSolvation
Scaling factor for the polar salvation energy
The default is 1.0.
S_nonPolarSolvation
Scaling factor for the non-polar salvation
the default is 1.0.
Water_Loss_Energy <water_loss_energy>
Energy penalty (in kcal/mol) associated with the displacement of a water molecule in the active site during docking.
The default value is 1.0.
II.12. Initial population parameters
Pop_Size <pop_size>
Population size for the genetic algorithm conformational search.
The default is 100 for rigid docking, 200 for flexible docking
Min_MatchScore <min_matchscore>
This keyword is used only if an interaction site file is provided. If the Mode is set to Dock,
Min_Matchscore is automatically calculated.
Minimum match of the interaction sites.
The default is 25.
Min_PharmScore <min_constraint>
APPENDIX F
- 327 -
This keyword is used only if a pharmacophore file is provided.
Minimum percent match of the pharmacophore.
The default is 100.
Anchor_Atom <anchor_atom>
Sequence number of the atom to be used as an anchor. This is used to identify the center of translation and rotation for the GA.
If this keyword is not specified, the anchor is automatically set to the gravity center of the ligand.
Anchor_Coor <anchor_x> <anchor_y> <anchor_z>
Following this keyword must be the x, y and z coordinates of the protein active site center.
If this keyword is not used, it is automatically set to the center of the protein active site defined by the active site (flexible) residues.
Max_Tx <max_tx>
Max_Ty <max_ty>
Max_Tz <max_tz>
Maximum value for translation (in Å) in x, y, and z respectively.
The default is 5 for the three values.
GI_Initial_E <gi_initial_e>
Energy value (in kcal/mol) added to the minimized energy of the free ligand to give an upper bound. If the energy of an individual in the initial population/GA is below this number, then the individual is optimized by energy minimization.
The default is 100,000.
GI_Minimized_E <gi_minimized_e>
Energy value (in kcal/mol) added to the minimized energy of the free ligand to give a lower cutoff. If the energy of an individual is below this value after energy minimization then the individual is kept as a part of the initial population.
The default is 1,000 (could be set to values as low as 100).
GI_Num_of_Trials <gi_num_trials>
Maximum number of successive unsuccessful trials before exiting.
The default for Mode Dock is 10,000 and for Mode VS is 1,000.
Matching_Algorithm [On|Off]
Turns on or off the matching algorithm.
By default, it is set to On.
Num_of_Top_IS <num_of_top_IS>
Number of top Interactions sites that the interaction site triangles must contain at least one of.
APPENDIX F
- 328 -
The default is 10.
Stringent_Triangles <weight_of_triangles>
Is a factor by which the triangles are selected. The higher Stringent_Triangles is set, the
more the matching algorithm will favour triangles that have not been used.
The default value is 5.
Stringent_MS <stringent_MS>
Is a weight factor used in calculation of Min_MatchScore. The higher this value, the stricter
Min_MatchScore becomes.
The default value is 4.
Corner_Flap [On|Off]
Turns the corner flap conformational search for rings on or off.
By default, it is set to Off.
II.13. Evolution parameters
Max_Gen <max_gen>
Determine the maximum number of generations for the genetic algorithm.
The default is 200.
CutScore_1 <cutscore_1>
Upper bound score at Max_Gen to further proceed with the docking run. If there is one individual within
the top 3 below this CutScore_1 then the program proceeds to Max_Gen_1
The default is -4.
Max_Gen_1 <max_gen_1>
This keyword is used in VS mode only.
After Max_Gen generations, if none of the top poses has a score below the one specified by
CutScore_1, the program exits. Otherwise, the program proceeds until it reaches Max_Gen_1
The default is set to be Max_Gen.
CutScore_2 <cutscore_2>
Upper bound score at Max_Gen_2 to further proceed with the docking run. If there is one individual
within the top 3 below this CutScore_2 then the program proceeds to Max_Gen_2
The default is -5.5.
Max_Gen_2 <max_gen_2>
As for Max_Gen_1, if after Max_Gen_1 generations none of the top poses has a score below the one
specified by CutScore_2, the program exits. Otherwise, the program proceeds until it reaches
Max_Gen_2.
The default is Max_Gen.
APPENDIX F
- 329 -
Seed <seed>
Select the starting point within the random number generator. If the same run is done with the same seed, the exact same result will be obtained. If a different seed is used, the GA will follow a different path. Changing the seed helps the developers to evaluate the convergence of a run.
The default is 100.
Max_Rxy <max_rxy>
Max_Ryz <max_ryz>
Max_Rzx <max_rzx>
Maximum value (in degrees) for the mutation of the rotation in their respective planes.
The default is 30 for the three values.
Parent_Selection [Random|Tournament]
Select how the parents are chosen.
Random
A random individual is selected, then checked to see if it has already been coupled. If it has then another number is chosen as generator. If this occurs 10 times then the last number is kept as the parent.
Tournament
Tournament_Size random individuals are selected with the best being kept as the parent.
Tournament_Size <tourney size>
The tournament size of the parent selection. Only used with Parent_Selection Tournament.
The default is 5.
Max_Num_SC <max_num_sc>
Maximum number of steric clashes allowed between the flexible side chains of the protein and/or between the water molecules when a composite protein structure is created.
The default is 0.
Max_SC_PP <max_sc_pp>
Maximum distance (in Å) between side chain atoms in a composite protein structure of another atom of another side chain, to be consider as a clash.
The default is 1.5.
Resolution <resolution>
Select the resolution for the bond rotation during the generation of the initial population. For example, if a resolution of 12 is selected, the bond rotation will occur in multiples of (360/12), or 30 degrees.
The default is 120.
pLearn <plearn>
Probability of energy minimization of the parents at every generation.
APPENDIX F
- 330 -
The Default is 0.1.
pCross <pcross>
Probability of crossover at every generation.
The default is 0.85.
pMut <pmut>
Probability of mutation at every generation.
The default is 0.05.
pMutRot <pmutrot>
Probability of mutation of the orientation of the ligand at every generation.
The default is 0.30.
pMutWat <pmutwat>
The maximum rate of mutation of the water at Max_Gen generations
The default is 0.35.
pElite <pElite>
The percentage of the best of the population to be directly passed on to the next generations.
The default is 0.01.
pElite_Every_X_Gen <pElite_Every_X_Gen>
pElite will be used every pElite_Every_X_Gen
The default is 2.
pElite_SSize <pElite_SSize>
The individual to be passed directly onto the next generation will be selected random from the top pElite_SSize individuals of the population.
The default is 10.
pOpt <popt>
Probability of optimization of the ligand at every generation.
The default is 0.20.
Evolution [Steady_State|Metropolis|Elite]
Steady_State
During the evolution, out of a pair of two children and their 2 parents the two best will be saved.
This is the default.
Metropolis
APPENDIX F
- 331 -
During the evolution, out of a pair of two children and their 2 parents two individuals will be saved following the Metropolis criterion. If the children are higher in energy they are checked to see if they have a high probability to exist at room temperature. If they do they are saved.
Elite
During the evolution, the top pop_size individuals of the children and parents will be kept for
the next generation.
GA_Num_of_Trials <ga_num_trials>
Maximum number of successive unsuccessful trials to create children.
The default is 1000.
Diff_Avg_Best <difference_avg_best>
The absolute difference between the average energy of the population and the best individual of the population. If the calculated value is below difference_avg_best then the population is considered
to be converged.
The default is 1.
Diff_N_Best <difference_n_best>
The absolute difference in energy between the individual with the lowest energy and the individual ranked Diff_Number.
If Diff_Number is defined the default value is 0.4.
Diff_Number <number_rank>
The number of the indivuals to be used with Diff_N_Best
By default this criteria is not used.
II.14. Output/convergence parameters
Print_Level [0|1|2|3]
Controls the amount of data outputted.
Print_Structures [Final|Full|None]
Controls the output of the structures during or at the end of the docking.
Final
Only the final structures will be printed.
This is the default.
Full
The structures (protein and ligand) will be printed during the run along with the final structures.
None
No structures will be printed.
Print_Num_Structures <print_num_structures>
Select how many of the top poses are printed as MOL2 files.
APPENDIX F
- 332 -
The default is 1.
Number_of_Best <number_of_best>
Select how many individuals to print the score, energy and RMSD during the run.
The default is 10 in Mode Dock and 1 in Mode VS.
Print_Best_Every_X_Gen <print_best_every_x_gen>
How often to print a summary of the run.
The default is (Max_Gen + 1).
Print_Energy_Full [Y|N]
Controls the printout of the detailed energy contributions.
Y
Print out a breakdown of the energy (bond energy, angle energy, etc.).
This is the default.
N
Print out only the total energy.
MaxSameEnergy_GA <maxsameenergy_ga>
If the lowest energy individual does not change in this many generations, the program exits.
The default is Max_Gen.
II.15. Docking of covalent inhibitors
This feature is under validation
Covalent_Residue <residue_name>
Following this keyword is the name of the residue, the covalent inhibitor will react with. Only CYS and SER are implemented in the current version (e.g., SER554)
Covalent_Ligand [Only|Both]
Controls the covalent docking. FITTED will automatically identify the aldehyde, boronate or nitrile groups (other groups will eventually be implemented) and assign the proper atom types when covalent poses will be considered
Only
Only covalent poses will be considered
This is the default.
Both
Covalent and non-covalent poses will be considered concomitantly.
Proton_Moved_To <residue> <atom_name>
The protein will be moved to atom <atom_name> of residue <residue>.
APPENDIX F
- 333 -
Additional Keywords for ProCESS
The following section lists the keywords, their functions and default values. Gray shading indicates a required keyword; angle brackets <> indicate a numeric value; plain text indicates
a text string (such as a file name); square brackets [] indicate a choice of values, the default shown
in italics.
ProCESS keywords files are case-sensitive. Empty lines are allowed, and text after a pound sign (#) is considered a comment.
Although the value of many keywords can be altered, default values should be used unless a specific system requires different settings.
At the end of this section, a typical keyword file can be found.
II.16. Input/output files
Protein <#_protein_struct>
protein_file1.mol2
protein_file2.mol2
Following the keyword, specify the number of protein structure files to be processed
On the following lines, specify the protein file names, one per line.
Output output_filename
Name of the output file.
Binding_Site_Cav cavity_filename
Name of the file where to output the binding site cavity.
If this keyword is not present ProCESS will not create a binding site cavity file.
Interaction_Sites pharmacophore_filename
Name of the file where to output the interaction sites definition file.
If this keyword is not present ProCESS will not create an interaction sites definition file.
Binding_Site <#_flex_residues>
flex_residue_1_name
flex_residue_2_name
Manually defines the active site. (The active site can be automatically defined by providing a ligand, see below)
On the same line following this keyword, specify the number of flexible residues.
APPENDIX F
- 334 -
On subsequent lines, the residue name/numbers (according to Find_Residues) are specified, one
per line.
Rep_file protein_file1.mol2
Specify which file to use as a template.
If omitted, the first file specified in Protein is used.
II.17. Reading the input files and preparing the output protein files
Renumber_Residues <first_residue_number>
Specify the new number for the first residue; the rest will be sequentially renumbered.
This feature is useful if the protein is a multimer, having multiple residues with the same group name (e.g., two Pro60, two Asp25 as in HIV-1 protease).
AutoFind_Site [Y|N]
This function allows the user to have ProCESS automatically finding the flexible residues/binding site.
The default is N.
Ligand ligandfile.mol2
Ligand file (in MOL2 format) used to define the active site and its center. Its should be in the same frame as the protein.
Ligand_Cutoff <ligand_cutoff>
Protein residues within this cutoff (in Å) are considered part of the binding site.
The default is 5.0.
Truncate [Y|N|auto]
Determine if the protein will be truncated, keeping only residues within Cutoff of the binding site
residues.
The default is Y.
auto
The protein will be truncated keeping residues within cutoff distance of the ligand and not within cutoff distance from the binding site residues.
Cutoff <cutoff>
Any residue that does not have an atom within this distance (in Å) from an atom of a flexible residue or of the given ligand will be deleted from the protein file that ProCESS will output.
The default value is 11 for auto truncation 9 for truncation = yes.
Find_Residues [Name|Number]
If Active_Site is used, define in which way ProCESS will identify the residues that make up the
binding site.
Name
APPENDIX F
- 335 -
Search residues by group name.
This is the default.
Number
Search residues by group number.
Assign_H [Y|N]
ProCESS requires the advanced hydrogen atom names. This keyword allows ProCESS to rename the hydrogens if they are not assigned correctly.
The default is N
Assign_G [Y|N]
Allows ProCESS to assign the advanced PDB residue names automatically.
The default is N
United [Y|N]
This allows the user to select whether the protein will have a united-atom or all-atom representation.
Y is the default.
Coarse [0|1|2|3]
This keyword followed by 0 is for united-atom representation. Other coarse grained representations are under development.
0 is the default.
Assign_Charges [Scaled|None]
Scaled
Scales van der Waals and electrostatics for the flexible residues. This scaling of parameters allows for entropy cost of the binding to the protein to be accounted for (see J. Med. Chem. 2006, 49, 5885 for a more complete description).
This is the default.
None
Assigns default values to charges and van der Waals.
II.18. Parameters for the binding cavity file
Grid_Center <grid_center>
Specifically defines the center of the binding site.
The default is to automatically find it using the center of a ligand
Grid_Size <size>
Specifies the size of the box for the binding site.
The default is 12.5.
Grid_Boundary [Soft|Hard]
APPENDIX F
- 336 -
Soft
When converting from the grid to spheres, the boundary of the box will be ignored (defined by Grid_Size) and spheres can include volume outside of the box.
This is the default.
Hard
The active site cavity file will be constrained within the box defined by Grid_Size.
Grid_Resolution <grid_resolution>
Following this keyword is the resolution (Å) of the grid.
The default is 1.5.
Grid_Sphere_Size <grid_sphere_size>
Specifies the size of a sphere used to trim the sides of the box to make it rounder.
The default Grid_Size.
Grid_Clash <grid_clash>
If a protein atom is within this distance of a grid point, the point is removed from the grid.
The default is 1.5.
II.19. Parameters for the Interaction sites file
Xxx_Weight <xxx_weight>
This group of keywords (Xxx being Hydrophobic, Metal, HBA or HBD) specifies the parameters for
the assignment of pharmacophoric points. xxx_weight is used to give weight for favourable xxx-type
interactions. Defaults parameters are highly recommended.
Hydrophobic_Weight <hydro_weight>
Defines the weight for hydrophobic interaction points.
The default is 1.
Metal_Weight <metal_weight>
Defines the weight for metal interaction points.
The default is 50.
HBA_Weight <hba_weight>
Defines the weight for hydrogen bond acceptor interaction points.
The default is 5.
HBD_Weight <hbd_weight> <hbd_penalty>
Defines the weight for hydrogen bond donor interaction points.
The default is 5.
If too many points are found, one can reduce this number by using the following keywords:
Pharm_Polar_Softness <pharm_polar_soft>
APPENDIX F
- 337 -
Maximum distance (in Å) between two polar points to merge.
The default is 0.0.
Pharm_Nonpolar_Softness <pharm_nonpolar_soft>
Maximum distance (in Å) between two non-polar points to merge.
The default is 0.0.
Hydrophobic_Level <hydro_level>
Van der Waals interaction between a probe on the grid point with hydrophobic carbons to be considered hydrophobic. If the interaction is found lower than hydro_level, an hydrophobic
point is added at this location. For more information see the section on Interaction Sites/Pharmacophore generation.
The default is -0.3.
Min_Weight <min_weight>
Minimum weight for a pharmacophoric point to be included in the final pharmacophore.
The defaults are 0.5 0.0respectively.
Num_of_IS <num_of_spheres>
This determines the maximum number of interaction site spheres in the interaction sites file.
The default is 75.