flexe: efficient molecular docking considering protein

doi:10.1006/jmbi.2001.4551 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 308, 377±395

FLEXE: Efficient Molecular Docking ConsideringProtein Structure Variations

Holger Clauûen*, Christian Buning, Matthias Rareyand Thomas Lengauer

German National ResearchCenter for InformationTechnology (GMD), Institutefor Algorithms and Scienti®cComputing (SCAI), SchloûBirlinghoven, 53754 SanktAugustin, Germany

E-mail address of the [email protected]

Abbreviations used: PDB, Proteinroot-mean-square deviation; CPU, cunit; CSD, Cambridge Structural Da

0022-2836/01/020377±19 $35.00/0

Side-chain or even backbone adjustments upon docking of differentligands to the same protein structure, a phenomenon known as induced®t, are frequently observed. Sometimes point mutations within the activesite in¯uence the ligand binding of proteins. Furthermore, for homologyderived protein structures there are often ambiguities in side-chainplacement and uncertainties in loop modeling which may be critical fordocking applications. Nevertheless, only very few molecular dockingapproaches have taken into account such variations in protein structures.

We present the new software tool FLEXE which addresses the problemof protein structure variations during docking calculations. FLEXE candock ¯exible ligands into an ensemble of protein structures which rep-resents the ¯exibility, point mutations, or alternative models of a protein.The FLEXE approach is based on a united protein description generatedfrom the superimposed structures of the ensemble. For varying parts ofthe protein, discrete alternative conformations are explicitly taken intoaccount, which can be combinatorially joined to create new valid proteinstructures.

FLEXE was evaluated using ten protein structure ensembles containing105 crystal structures from the PDB and one modeled structure with 60ligands in total. For 50 ligands (83 %) FLEXE ®nds a placement with anRMSD to the crystal structure below 2.0 AÊ . In all cases our results are ofsimilar quality to the best solution obtained by sequentially docking theligands into all protein structures (cross docking). In most cases the com-puting time is signi®cantly lower than the accumulated run times for thesingle structures. FLEXE takes about ®ve and a half minutes on averagefor placing one ligand into the united protein description on a commonworkstation.

The example of the aldose reductase demonstrates the necessity of con-sidering protein structure variations for docking calculations. We dockedthree potent inhibitors into four protein structures with substantial con-formational changes within the active site. Using only one rigid proteinstructure for screening would have missed potential inhibitors whereasall inhibitors can be docked taking all protein structures into account.

# 2001 Academic Press

Keywords: structure-based drug design, ¯exible protein-ligand moleculardocking, protein ¯exibility, protein ensembles, aldose reductase
*Corresponding author
ng author:

Data Bank; RMSD,entral processingtabase.

Introduction

In recent years the search for novel drugs hasevolved from a process of trial and error into asophisticated procedure including several compu-ter-based approaches. In structure-based designthe structures of known target proteins are used todiscover new compounds of therapeutical rel-evance. The approaches can be classi®ed roughly

# 2001 Academic Press

http://www.idealibrary.com

mailto:[email protected]

378 Docking Considering Protein Structure Variations

into two categories: de novo design and docking.The former method designs new ligands to ®t theprotein target, whereas the latter is used to decidewhether existing compounds possess a good stericand chemical complementarity to the given pro-tein. For reviews on computer-aided structure-based design methods, see Kuntz,1 Rosenfeld et al.,2

Lengauer & Rarey,3 Kubinyi4 and Rarey.5

In order to use computational methods for struc-ture-based design, several assumptions have to bemade. For example early-docking tools treatedboth the ligand and the protein as rigid structuresfor ef®ciency reasons. At present, standard appli-cations of current docking tools like DOCK,6

GOLD,7,8 and FLEXX9 use ¯exible ligands, but keepthe protein structure essentially rigid, except for afew terminal H-bond donors and acceptors, andassume one single protein conformation even forcomplexes with different ligands. Therefore,ligands requiring larger conformational changeswithin the protein upon binding cannot be placedcorrectly by these methods.

Using experimentally determined structureensembles of proteins with different ligands, sev-eral types of ¯exibility are observed.10 The scaleranges from simple side-chain rotations, over smal-ler adjustments of single loops, and up to largemovements of complete domains. In some casesthere are mutations of single amino acid residues(point mutations) within the active site which mayin¯uence the ligand binding. Furthermore, thereare a lot of sequences for which no 3D structure isso far known. Although the overall structure ofsome of these proteins may be derived fromhomology modeling, there are often ambiguitiesin side-chain placement and uncertainties inloop modeling which may be critical for dockingapplications.

Even though in most cases only a few side-chains vary within the active site and the confor-mational changes tend not to be very large,11 ¯exi-bility can be quite critical for docking applications.For example, if a rotatable side-chain ®lls a sub-pocket of an active site into which a ligand shouldbe docked, then the correct binding mode cannotbe predicted without rotating the side-chain. Werefer to this as the problem of protein structurevariation during docking calculations. There areonly a few approaches which focus on differentaspects of this problem.

Leach12,13 explores the conformational space ofprotein side-chains keeping the backbone rigid inorder to ®nd the global minimum energy com-bination of amino acid side-chains and ligandconformations within a speci®ed energy cutoff fora given orientation of the ligand. His approachis based on dead-end elimination and theA*-algorithm to search the large combinatorialspace. The examination of conformations near tothe global energy minimum shows that mostdifferences arise from changes in side-chain confor-mations that are essentially independent from each

other and only very few result from a concertedchange in conformations of two or more residues.

Sandak et al.14-17 model ¯exibility by the conceptof hinge-bending. This is a method adopted fromcomputer vision techniques used for recognizing¯exible objects assembled from rigid parts. Thehinge-bending approach was originally used tomodel ligand ¯exibility, but the roles of the ligandand the protein can be interchanged since themathematical problem is symmetrical. However,the ¯exibility can only be considered for one dock-ing partner at a time. The hinges have to bede®ned manually and they are limited to a smallnumber, therefore, side-chain ¯exibility is out ofthe scope of this approach.

Knegtel et al.18 use weighted averages withrespect to energy and geometry over ensembles ofprotein structures to describe the ¯exibility of pro-teins. Their ensembles are sets of crystal and NMRstructures from the PDB. Ligand ¯exibility is nottaken into account in this approach. The force-®eldterms for the particular structures are combinedand stored on a single scoring grid that is used inDOCK 3.5.19-22 Repulsive potentials are onlyincluded if the potentials from all structures arerepulsive. Although several protein structures areused during the calculation, it should be noted thatthe combinatorial nature of the problem resultingfrom explicitly distinguished alternative confor-mations is neglected. The interactions of the par-ticular conformations are blurred by averaging,which may cause maxima in the potential relatedto unrealistic conformations. In addition, the activesite is enlarged by neglecting the repulsive terms.

Schnecke et al.23-26 focus on docking as a screen-ing method. The aim of their screening tools Spec-titope and Slide is the very fast and rough dockingof many ligands from a large database. They usemulti-level hashing to match an anchor fragmentof the ligand to the template points describing thebinding site. The anchor fragment is the part of theligand which can be matched onto the templatepoints using the ligand conformations given in thedatabase. All ¯exible bonds within this anchorfragment are rigi®ed and so this approach dependson the initial conformation of the ligand in thedatabase and the protein from which the templatepoints are derived. Ligand and protein ¯exibilityare modeled as a post optimization process byresolving collisions between the placed ligand andthe protein by directed rotations of single bonds ofboth the ¯exible parts of the ligand or the side-chains of the protein. Mean-®eld theory is used todecide which rotations to use to resolve collisionsand improve shape complementarity. In order toevaluate their approach they created several differ-ent binding-site templates from three differentHIV-1 protease structures and screened threedifferent ligand databases for putative ligands withlow energy of binding. Their results show that, inspite of the handling of protein ¯exibility, thechoice of the structure used as the target is stillvery important for the screening results.25

Docking Considering Protein Structure Variations 379

Our new software tool FLEXE is able to take intoaccount several side-chain conformations and, tosome extent, even loop movements while placing a¯exible ligand into the active site. The ¯exibility ofthe protein is considered directly during ligandplacement and not as a post optimization. Theapproach is based on the assumption that the over-all structure of the protein and the general shapeof the active site have to be conserved due to thebinding speci®ty. The occurrence of large main-chain variations, such as domain movements areout of the scope of our model.

The main idea is to describe the protein structurevariations with a set of protein structures repre-senting the ¯exibility, mutations or alternativemodels of a protein. FLEXE selects the combinationof partial structures which suits best the givenligand with respect to the scoring function.Therefore, the variability considered by FLEXE isde®ned by the differences within the given inputstructures.

According to Knegtel et al.18 we refer to the setof protein structures representing the variation asthe ensemble, although it is not an ensemble instrict terms of statistical thermodynamics. How-ever, the set of protein structures contains mostlylow energy conformations of the same protein,which also dominate a thermodynamical ensemblebecause they are highly populated. In this sensethe two terms are related.

FLEXE is based on the so-called united proteindescription created from the superimposed struc-tures of the ensemble. Similar parts of the struc-tures are merged whereas dissimilar areas aretreated as separate alternatives. The different struc-tures of the ensemble are not only considered sim-ultaneously, but can also be combined to form newoverall structures during the docking process. Thisconcept can easily be extended to rotamer libraries.Since FLEXE is derived from FLEXX, all substantialconcepts like the interaction scheme, the incremen-tal construction algorithm, and the scoring functionare adapted to the ensemble approach.

Due to the recombination of protein structures,dependencies between different alternatives withinthe united protein description occur. These arecaused by logical and geometric exclusion resultingin the concept of incompatibility between structureelements of the protein. Incompatibility is intern-ally represented as a graph. Valid protein struc-tures are independent sets of alternatives withinthis graph that ful®ll certain constraints. Duringthe incremental construction of the ligand optimalindependent sets of alternatives with regard to thebinding energy are determined for each partialligand placement.

Here, using FLEXE, the ensemble of structuresonly models conformers present in the input struc-ture (i.e. between four and 16 structures) and thepossible recombinations of them. This may be alimitation in some cases. This is not a limitation inmethods based on the Mean-®eld theory or the A*-algorithm. However, in principle the ensembles

used with FLEXE are not limited to experimentallydetermined protein structures. For example, struc-tures picked from a molecular dynamics simu-lation, generated by using rotamer libraries, orambiguous homology models can be used as well.However, in order to evaluate FLEXE we need toknow the correct binding mode. For this reason weuse mainly experimentally determined proteinstructures for validations purpose.

FLEXE has been evaluated with ten protein struc-ture ensembles containing 105 crystal structuresfrom the PDB27 in total. In addition we used onehomology model of human aldose reductase. Thestructures within the ensembles have a highly simi-lar backbone trace, but different conformations forseveral side-chains, point mutations, and slightvariations within loops. Sixty structures containligands which are used as a reference for the cor-rect binding mode.

All ligands of an ensemble were docked into theunited protein description of the ensemble and theroot mean square deviation (RMSD) to the ligand'sposition in the crystal structure was determined.For comparison, all ligands were docked into allstructures in the ensemble separately with FLEXX(cross docking) and the solutions merged into oneranking list. The RMSD of the ligands is of thesame order of magnitude for both tools. Takinginto account the top ten solutions, FLEXE ®nds aligand position with an RMSD below 2.0 AÊ to thereference structure in 67 % and FLEXX in 63 % ofthe cases. The CPU time for FLEXE for the base pla-cement and the complex construction is about ®veand a half minutes on average on a Sun UltraSPARC 10 workstation and in most cases is lowerthan the sum of the corresponding run times forthe cross-docking experiment with FLEXX.

The cross docking experiment with aldosereductase which has a highly ¯exible active siteshows the necessity of taking into account variousprotein structures for docking calculations. Wedocked the three potent inhibitors sorbinil, tolre-stat, and zopolrestat causing substantial confor-mational changes within the active site of theenzyme. For all ligands the ensemble approach ofFLEXE and the cross docking with FLEXX ®nd goodsolutions. However, FLEXX can only correctly placetolrestat and zopolrestat in the protein structureswhich were crystallized together with theseligands. Thus, using FLEXX with just one rigid pro-tein structure for screening would have missedpotent inhibitors.

Results

Test data set

The ensembles used with FLEXE are not restrictedto experimentally determined protein structures.However, in order to evaluate FLEXE we need areference for the correct binding mode. Therefore,


FLEXE has been evaluated with ten protein ensem-bles containing 105 crystal structures from thePDB27 and one homology model (see Table 1).Ideal ensembles for this test should meet the fol-lowing criteria. (i) The structures should have verysimilar sequences and backbone traces. (ii) Thereshould be different side-chain conformations avail-able originating from complexes with differentligands. (iii) There should be several differentligands. (iv) Point mutations are allowed. They areespecially interesting if they in¯uence the bindingof different ligands. (v) Flexible loops should belimited to short segments (containing a few aminoacid residues).

Not all ensembles here satisfy all criteria. Allmembers of an ensemble have highly similar back-bone traces, different conformations for some side-chains or point mutations, and slight variationswithin some loops. Due to the limitations of avail-able PDB structures, the ligands are often quitesimilar and sometimes even identical. For thealdose reductase case we combine three PDB struc-tures of the porcine aldose reductase enzyme witha homology model of human aldose reductase(Podjarny, A., van Zandt, M., KraÈmer, O. & Klebe,G., personal communication). In this case, the over-all sequence identity is about 86 % but the activesite is highly conserved. The size of the ensemblesvaries from seven to 16 members.

All proteins are either apo structures or com-plexed with a small molecule. Ligands that arecovalently bound or too small (e.g. an azide ion)were disregarded. The remaining 60 ligands aremarked in Table 1.

Preparation of the ligands

The preparation of the ligands for FLEXE isperformed as for FLEXX28 using SYBYL.29 First,the ligand coordinates of the non-hydrogenatoms were extracted from the original PDB.They are used as reference for the calculation ofthe RMSD values later on. In the case in whichthere are identical ligands for different PDBentries, we used all ligands as separate refer-ences since they differ sometimes slightly. Weobtained the ligand input ®les by de®ning cor-rect atom types (including hybridization states)and correct bond types, adding hydrogen atoms,assigning formal charges to each atom, and®nally energy-minimizing the reference structure.The energy minimization guarantees a low-energy conformation with suitable bond lengthand angles. This new geometry and the fact thatthe minimized structure is not translated accord-ing to its ensemble structure guarantees thatthere is no implicit docking information aboutthe protein-ligand complex of the PDB structurein the ligand input ®le. In general, all car-boxylic-acid and phosphoric-acid groups areionized while all amino, amidino, and guanidinogroups, but no amide groups, are protonated.

Preparation of the ensemble structures

For each structure, the description of an ensem-ble contains the de®nition of the protein atoms (viachain identi®ers and hetero groups), the resolutionof ambiguities in the PDB ®le (alternate locationindicators, etc.), the location of hydrogen atoms athetero atoms, and the de®nition of the active siteatoms. The assignment of hydrogen positions ismade on the basis of default rules except for thede®nition of the torsion angles at the hydroxylgroups of the amino acid residues serine, threo-nine, tyrosine, and the hydrogen position insidethe histidine side-chain. Here, torsion angles (either0 �, 180 � for tyrosine; 60 �, 180 �, 300 � for serine,threonine) and the optimal tautomeric histidinestate are selected by visual inspection of the pro-tein. The side-chains of lysine and arginine residueare protonated and the carboxylate groups ofaspartic and glutamic acid are ionized. Water mol-ecules contained in the PDB ®le have beenremoved.

In order to de®ne the active sites of the proteins,all members of an ensemble are superimposedtogether with their reference ligand structure. Allatoms including metal ions are selected that arelocated less than 6.5 AÊ apart from an atom of anyligand of the ensemble. Therefore, the active site isde®ned by the union of all ligands of the ensemble.In addition, the complete amino acid is selected ifat least one of its atoms is picked.

The superimposed protein structures and refer-ence ligand positions are stored for later crossdocking experiments with FLEXX using the samede®nitions of active sites.

Evaluation of FLEXE

We docked all ligands of each ensemble into theunited protein structure with FLEXE. The results areshown in Table 2. For each ligand the RMSD of the®rst solution, the best solution within the ®rst ten,and the best solution within all solutions is given.In addition, the rank of the best and the ®rst pre-diction with an RMSD below 2.0 AÊ and 2.5 AÊ ,respectively, are shown. In the literature, an RMSDthreshold of 2.0 AÊ 8,28 is usually used for acceptabledocking solutions. The increased cutoff of 2.5 AÊ

takes into account the lack that the geometry of theunited protein description may be slightly dis-torted. This can occur because the instances areclustered while building the united proteindescription using a maximum distance of 1.0 AÊ .

The RMSD of the ®rst solution is often used torate the performance of a docking tool because inpractical screening applications one cannot inspectseveral placements of a large set of ligands. How-ever, this placement depends not only on the dock-ing algorithm but also highly on the scoringfunction. Here, we present a new approach forhandling protein structure variations using one ofthe various scoring functions which are proposedfor docking applications.9,30-32 None of these scor-

Table 1. Ensembles in the test data set

Ensemble PDB codesa RMSD (AÊ )b Flexiblec parts/mutations

Aldose reductase 1ah0a 1ah3a 1ah4 0.2-0.6 Phe122, Leu124, Val130,(4 pdb, 3 ligands) homology modela Val297-Ala/Thr304Alpha-momorcharin 1mri 1ahc 1mrha 0.2-0.2 Leu/Val64, Asn/Asp110,(7 pdb, 4 ligands) 1mrga 1ahaa 1ahba Asn68-Met72, Glu85,

1mom Glu112, Arg122Carboanhydrase II 2cbb 1h9n 1hec 0.1-0.3 His64, Ala/Phe/Ser65,(16 pdb, 8 ligands) 1mua 1uga 1ugd Asn67, Gln92,

1bcda 1cila 1cnwa Gln/Glu106, Asp/His119,1cnxa 1craa 1ray Lys133, Val135, Gln136,1cama 1caza 1ugb His/Leu198, Ala/Thr199,1zsba Ala/Pro202, Cys206

Carboxypeptidase 5cpa 1arl 1yme 0.1-0.5 His69, Arg71, Arg127,(16 pdb, 7 ligands) 3cpaa 4cpaa 6cpaa Arg145, Ser162-Thr164,

7cpaa 8cpa 1bavA Ser194-Ser199, Leu203,1bavB 1bavC 1bavD Leu206, Tyr208, Ile243,1cbxa 1cpsa 2ctb Ile247-Ile255, Thr268,2ctca Glu270, Thr274

Dihydrofolate reductase 1dyha 1dyia 1dyja 0.1-0.7 Leu4, Leu8, Ile14, Gly15(12 pdb, 12 ligands) 1jola 1ra2a 1ra3a Ala19, Met20, Phe/Trp22

3drca 1dhja 1draa Asp/Cys/Ser27, Trp30, Lys321drba 2drca 4dfra Arg33, His45-Leu54

Isocitrate dehyrogenase 1idf 1idd 6icd 0.3-0.6 Leu103-Thr105, Arg119,7icd 1ika 4icd Asp/Glu/Ser113-Ala117,

(14 pdb, 6 ligands) 5icda 9icd 1idea Arg129, Asn155, Phe/Tyr160,1idca 1groa 1grpa Asn303, Leu304, Asp307,

Ser310,8icda 1iso Asp311, Glu336, Thr338-Gly340

Mandelate racemase 2mnr 1mdra 1mns 0.1-1.1 Val22-Ala25, Val29,(6 pdb, 4 ligands) 1mdla 1mraa 1dtna Arg/Lys166, Asn197, Asn248,

Asn/Asp270, Gln/Glu317,Leu321

Ricin 1rtc 1obs 2aai 0.3-0.4 Asn78, Tyr80, Val82, Asp96(9 pdb, 3 ligands) 1fmpa 1apga 1ifu Asn122, Asp124, Arg125,

Gln173,1ifs 1ift 1obta Arg/Gly/His180, Ile205,

Thr206,Asn209, Ser210, Arg213, Arg258

Seryl tRNA synthetase 1sryA 1sryB 1sesAa 0.3-0.5 Arg157, Gln211, Glu227,(6 pdb, 4 ligands) 1sesBa 1setAa 1setBa Arg256, Glu258, Met270-Glu279,

Glu345, Ser348, Asn378,Asn379,Leu382, Arg386, Ile387

Trypsin 3ptn 2ptc 1tld 0.2-0.3 Asn97, Thr98, Leu99(16 pdb, 9 ligands) 1tpo 1taw 1max Gln175, Lys188, Gln192

1ppca 1ppha 1tnga Ser217, Gln221, Lys2241tnha 1tnia 1tnja Tyr2281tnka 1tnla 1tpp3ptba

a The ligands of all marked PDB codes are used as reference structures.b The range of the mean RMSD of the backbone atoms within the active site.c Parts with more than one instance in the the active site of united protein description.


ing functions performs well under all cir-cumstances.33 Therefore, one cannot expect to ®ndthe best solution ranked ®rst. Since improving thescoring function is out of the scope of this work,we therefore discuss the quality of the docking onthe basis of the best RMSD within the ®rst ten sol-utions. We do this in order to reduce the in¯uenceof the scoring function on the one hand, while onthe other hand keeping the number of solutionsmanageable.

The minimal RMSD found is also given in orderto estimate the optimal RMSD that could beachieved with an ideal scoring function. The rankof this best solution is an indicator for the qualityof the scoring function. However, one should keepin mind that this placement is not itself indepen-

dent of the scoring function, because the scoringfunction is used during the docking calculation tochoose the optimal independent set of instancesand to rank the partial solutions.

The rank of the ®rst solution with an RMSDbelow 2.0 AÊ shows how many solutions wouldhave to be scanned to ®nd a good placement. Forpractical applications this rank is therefore muchmore important than the rank of the best solution.Table 3A summarizes the number of solutions withminimal RMSD and below 1.0, 1.5, 2.0 and 2.5 AÊ ,respectively. This table reveals that for 83 % of theligands, a placement with an RMSD below 2.0 AÊ isfound with FLEXE considering all solutions. This®rst hit is on average ranked 25th with a standarddeviation of 60 ranks, whereas the best prediction

Table 2. FLEXE results

(Min.) RMSD (AÊ ) Best solution Rank of first sol.

Ensemble Ligand Rank 1 Rank 10 RMSD (AÊ ) Rank <2.0 AÊ <2.5 AÊ

Aldose reductase SBI.1ah0 0.58 0.54 0.54 2 1 1(good) TOL.1ah3 1.09 1.05 1.05 5 1 1

ZST.model 6.74 6.72 0.64 133 133 133

Alpha-momorcharin FMC.1mrh 2.13 1.75 0.80 119 7 1(good) ADN.1mrg 4.30 1.11 0.97 13 10 10

ADE.1aha 3.27 0.85 0.76 13 10 7FMP.1ahb 1.60 1.42 0.85 116 1 1

Carboanhydrase II FMS.1bcd 3.33 1.47 1.47 2 2 2(satisfactory) ETS.1cil 2.76 2.69 2.35 95 - 95

EG1.1cnw 7.86 7.86 6.83 111 - -EG2.1cnx 4.70 4.70 4.10 65 - -TRI.1cra 5.84 0.82 0.79 82 5 5BCT.1cam 6.82 2.51 1.96 12 12 11ACY.1caz 2.18 1.18 0.96 41 2 1AZM.1zsb 6.48 3.16 1.85 302 240 82

Carboxypeptidase G-Y.3cpa 1.81 1.69 1.05 210 1 1(satisfactory) GLY.4cpa 3.20 3.17 1.63 402 237 42

ZAF.6cpa 7.35 7.33 7.31 53 - -FVF.7cpa 6.51 5.37 5.08 24 - -BZS.1cbx 6.40 6.03 1.53 67 16 16CPM.1cps 4.97 1.02 1.00 21 2 2LOF.2ctc 2.44 2.32 1.72 140 140 1

Dihydrofolate reductase DZF.1dyh 2.21 2.00 1.86 60 4 1(good) FOL.1dyi 2.17 1.84 1.81 57 4 1

DDF.1dyj 5.43 1.63 1.58 123 8 8FFO.1jol 8.49 5.37 5.21 43 - -FOL.1ra2 2.30 1.91 1.91 4 4 1MTX.1ra3 1.50 0.81 0.53 59 1 1MTX.3drc 1.21 0.50 0.50 7 1 1MTX.1dhj 1.12 0.67 0.46 11 1 1MTX.1dra 1.10 0.70 0.59 51 1 1MTX.1drb 1.23 0.98 0.64 65 1 1MTX.2drc 1.05 0.97 0.61 36 1 1MTX.4dfr 1.32 0.66 0.65 60 1 1

Isocitrate dehyrogenase ICT.5icd 4.35 1.47 1.12 21 8 8(good) ICT.1ide 4.29 1.39 1.01 77 7 7

OXS.1idc 3.77 1.91 1.68 16 9 9ICT.1gro 4.29 1.53 1.05 99 9 9ICT.1grp 4.38 2.91 0.89 90 11 11ICT.8icd 4.25 1.67 1.00 109 7 7

Mandelate racemase SAA.1mdr 1.85 0.99 0.56 99 1 1(good) SMN.1mdl 2.54 1.45 0.83 175 8 8

SAA.1mra 1.08 0.82 0.54 27 1 1SAA.1dtn 1.95 0.86 0.30 168 1 1

Ricin (satisfactory) FMP.1fmp 5.42 2.71 1.29 200 40 11A-G.1apg 8.22 4.44 3.26 91 - -AMP.1obt 3.11 3.11 1.50 298 238 38

Seryl tRNA synthetase AHX.1ses 3.28 2.62 2.46 149 - 47(satisfactory) AMP.1ses 5.87 2.48 2.45 58 - 8

SSA.1set 2.35 1.95 1.92 20 5 1SSA.1set 2.54 2.28 2.12 14 - 2

Trypsin NAS.1ppc 3.04 1.98 1.10 16 4 2(good) TOS.1pph 3.75 3.75 1.03 33 33 33

AMC.1tng 0.70 0.70 0.70 1 1 1FBA.1tnh 0.55 0.48 0.48 3 1 1PBN.1tni 0.96 0.96 0.84 15 1 1PEA.1tnj 1.86 0.61 0.61 2 1 1PRA.1tnk 1.97 0.57 0.57 8 1 1TPA.1tnl 0.93 0.89 0.89 2 1 1BEN.3ptb 0.63 0.27 0.27 5 1 1

For each Ligand, the RMSD of the ®rst solution and the best RMSD within the ®rst ten solutions are listed together with theRMSD and the rank of the best prediction and the rank of the ®rst solution with an RMSD below 2.5 AÊ .


Table 3. Statistics on results

RMSD 4 1.0 1.5 2.0 2.5 Min

A. FLEXE first solution on any rank# hits 28 39 50 54 60Per cent 46.7 65.0 83.3 90.0 100.0Avg. rank 24.0 25.2 24.7 11.9 73.3SD 37.8 60.6 60.2 25.0 80.8

B. FLEXE first solution on top ten ranks# hits 20 29 40 43 60Per cent 33.3 48.3 66.7 71.7 100.0

C. FLEXE first solution on top ten merged ranks# hits 25 35 38 46 60Per cent 41.7 58.3 63.3 76.7 100.0

This table summarizes the number of ligands for which asolution below 1.0, 1.5, 2.0, 2.5 AÊ , and with minimal RMSD isfound in absolute numbers and as percentage considering allsolutions of FLEXE (A), the ®rst ten solution of FLEXE (B) andthe ®rst ten solution of the merged ranking list of FLEXX (C). Inaddition the average rank and the standard deviation (SD) ofthe rank is given for (A).


is on average ranked 73rd with a standard devi-ation of 81 ranks. This low rank and the greatestandard deviation indicates that there are pro-blems with the scoring function. The ®rst hit(RMSD below 2.0 AÊ ) is sometimes not within the®rst ten solutions, therefore Table 3B shows thesame statistics taking into account only the ®rst tensolutions predicted by FLEXE: For 67 % of theligands, a placement with an RMSD below 2.0 AÊ isfound. Since all predictions are within the ®rst tensolutions, no statistics on the rank is given.

We do not discuss all ensembles in detail, butclassify the results into two groups: good solutionswhich are mostly correct and satisfactory solutionswhich partially falied.

Good solutions

For the ensembles of aldose reductase, alpha-momorcharin, dihydrofolate reductase, isocitratedehydrogenase, mandelate racemase and trypsinFLEXE works very well. The ®rst solution with anRMSD below 2.0 AÊ occurs frequently on rank 1and the best RMSD within the ®rst ten solutions isoften around 1.5 AÊ or less. There is one ligand(FFO.1jol) for dihydrofolate reductase for which nosolution below 2.0 AÊ is found and two otherswhere the ®rst solution with an RMSD below2.0 AÊ is ranked low down in the list: ZST.model(aldose reductase) and TOS.1pph (trypsin).

For the dihydrofolate reductase there are twotypes of ligands. The ®rst type contains the metho-trexates (MTX) of different PDB structures whichdiffer slightly from each other and the second typeconsists of the ®ve folate derivates. (see Figure 1)The heterocycles of the latter ligands are all placedin a binding mode similar to the heterocycle of themethotrexate, which lies rotated by 180 � in com-parison to the folate crystal structure (see Figure 2).Therefore, the RMSD of the folate placements areabout one aÊngstroÈm higher than the RMSDs of the

methotrexates. The reason for this rotation lies inthe missing water molecules, which as yet cannotbe handled in FLEXE.

Satisfactory solutions

These include carboanhydrase II, carboxypepti-dase, ricin and seryl tRNA synthetase. For severalligands there is no solution with an RMSD below2.0 AÊ within the ®rst ten solutions and the betterones are ranked lowly. For some ligands even thebest prediction has an RMSD of more than 2.0 AÊ

and for ®ve ligands no placement with less than2.5 AÊ RMSD is found at all.

Several effects lead to these results. The ®rstis a ranking problem which can be seenwith the ligands AZM.1zsb, GLY.4cpa, orAMP.1obt where the ®rst solution below 2.5 AÊ aswell as the best one are ranked low down the list.Secondly, there are ligands which are quite largeand have greater than ten rotatable bonds:EG1.1cnw, (14) ZAF.6cpa, (14) or FVF.7cpa (17)(see Figure 1). Such large ligands lead to manysolutions and are also problematic for most otherdocking tools including FLEXX.28 Finally, the ligandA-G.1apg is not completely bound to the protein inthe crystal structure, about one half of the ligandlies outside the active site.

Comparison with FLEXX

In order to take into account protein structurevariations with a docking tool in which the proteinstructure is kept rigid, a ligand has to be dockedsequentially into all different protein structures(cross docking)28 and the solutions of the particulardocking runs have to be combined in a automati-cally reproducible way. Picking the protein struc-tures leading to the solution with minimal RMSDis not possible because the reference is usuallyunknown. Therefore, we merge the particular sol-ution lists predicted by FLEXX and resort the pre-dictions according to the scores.

FLEXE is derived from FLEXX as an integratedapproach for ef®cient handling of protein structurevariations using the same scoring function asFLEXX. Therefore, the quality of the predictions byFLEXE in terms of RMSD should be comparablewith the quality of solutions selected from such asequential cross docking.

We docked all ligands of all ensembles withFLEXX separately into each of the superimposedstructures of the ensemble (cross docking) usingthe same reference structure and de®nition of theactive site as in FLEXE. This yielded 727 complexesin total. For each ligand, we merged the solutionspredicted for all ensemble structures and rerankedthem according to their score. Table 3C summar-izes the results taking into account the ®rst tensolutions of this merged ranking list.

For 38 ligands (63.3 %) there is a solution withan RMSD below 2.0 AÊ . For FLEXE, this number isslightly higher (40 lig., 66.7 %), but FLEXX ®nds sig-

Figure 1. Chemical formulae of the ligands mentioned in the text. The ligands within a box belong to the sameensemble (see also Table 1).


Figure 2. Binding mode of folate and methotrexate.The hetero-cycle of the folate (red) is docked in the abinding mode similar to the methotrexate (blue) whichis rotated by 180 � in comparison to the crystal structureof the folate (green).


ni®cantly more solutions below 1.5 AÊ than FLEXE.We see the reason for this in the distortion of theensemble structures by clustering. Therefore, theresults are quite similar for the larger thresholds.

Again, we do not discuss all ensembles in detail,but take some cross docking experiments asexamples. The results are represented in colorcoded matrices (Figures 3-5). The ligands corre-spond to the rows, the protein structures to thecolumns. The ®rst column contains the results forthe ensemble used in FLEXE, the second column theoutcome of the merging approach, and the othercolumns the values for the particular docking runwith FLEXX. Note that the merging approach doesnot necessarily ®nd the minimal RMSD solution ofthe particular docking runs due to the differentscores of the ligands within the various proteinstructures.

Dihydrofolate reductase

The dihydrofolate reductase is a nice examplefor which the FLEXE approach works well (seeFigure 3). With one exception (FFO.jol), the RMSDsof the ligands docked with FLEXE are comparableor better than the best predictions in the mergedranking list of FLEXX. For the ligands DDF.1dyj,FOL.ra2, and MTX.ra3, FLEXE ®nds signi®cantlybetter solutions than the predictions from themerged ranking list of FLEXX, although there arecomparable predictions among the individualdocking runs from FLEXX, which obviously do notscore well enough to reach the top-ten ranks in themerged solution list.

Again, the two types of ligands namely, themethotrexates (MTX) and the ®ve folate derivates,can be easily distinguished due to the color codingin the matrix. The higher RMSD of the folate deri-vates are caused by a misplacement of their heterocycles, which are placed by both tools in a bindingmode similar to the hetero cycles of the methotrex-

ate. As explained before, this misplacement is dueto missing water molecules. Since they cannot yetbe handled by FLEXE water was also excluded forthe FLEXX runs to allow for comparability.

The cross-docking experiment also shows thatsome protein structures yield better results formost ligands than others (1dhj or 1drb forexample). Since all structures have a similar resol-ution it would be dif®cult to choose the right pro-tein structure in advance.

In the test case, described above, FLEXX ®nds asolution for at least one protein structure. In orderto analyze whether FLEXE still produces a goodprediction if this is not the case, we created anadditional ensemble for dihydrofolate reductasecontaining only the three PDB structures 1dhj,1drb and 1jol which performed worst with FLEXXand repeated the cross docking experiment.Figure 4 compares the results of this experimentwith the results produced with the completeensemble.

The predictions of FLEXE with the reducedensemble are slightly worse than for the completeensemble of dihydrofolate reductase structures,whereas the predictions of the merging approachwith FLEXX are signi®cantly worse compared tothat of the whole ensemble, especially for themethotrexates (MTX). Therefore, all methotrexatesbinding modes would only have been found byFLEXE using the reduced ensemble. This exper-iment shows that FLEXE is able to recombine differ-ent structures to new conformations that are moresuitable for docking a particular ligand. The con-formational space taken into account by FLEXE ismuch greater than covered by the separate ensem-ble structures. FLEXE is therefore more robustagainst local un®tness than the merging approachusing FLEXX because local un®tness can be com-pensated for by alternatives during ligand place-ment. This is impossible when merging thesolutions afterwards.

Aldose reductase

Aldose reductase catalyzes the reduction of glu-cose to sorbitol. As this reaction is believed to belinked to the pathogenesis of diabetic compli-cations affecting the nervous, renal, and visual sys-tems, the development of therapeutic agents hasattracted intense effort.

Two areas of the active site of the aldosereductase are involved in ligand binding: (i) arecognition region for hydrogen-bonds and (ii) ahydrophobic contact zone. Upon binding to differ-ent ligands, the aldose reductase opens a speci®citypocket in the hydrophobic area which can alter itsshape by adopting different conformations. This¯exibility can explain the large variety of possiblesubstrates of aldose reductase.34

We used an ensemble of four protein structuresfor the cross docking experiment. Three of thesewere the crystal structures of porcine aldosereductase (1ah0, 1ah3, 1ah4), all crystallized by

Figure 3. Cross-docking dihydrofolate reductase. The color-coded Table shows the RMSD of the best predictionwithin the top-ten solutions for each ligand (row) predicted by FLEXE (1.col.), the merged ranking list (2.col.) and byFLEXX for the particular ensemble structures.


Urzhumtsev et al.34 The structures 1ah0 and 1ah3are complexed with the potent inhibitors sorbinil(SBI.1ah0) and tolrestat (TOL.1ah3), respectively,(Figure 1) whereas the PDB entry 1ah4 contains thenative holoenzyme. We combined these structureswith a homology model of the complex of theinhibitor zopolrestat (ZST.model) with the humanaldose reductase. (Podjarny, A., van Zandt,KraÈmer, O. & Klebe, G., personal communication).For this complex only the coordinates of theCa-atoms are currently published (1mar).35 Thehuman aldose reductase has an overall sequenceidentity of about 86 % to the porcine sequence, buthas a very conserved active site.

Figure 6(a) shows the active site of the unitedprotein description containing the three inhibitorsas given in the reference structure. There are twodifferent binding regions: the hydrophobic contact

zone mainly between the tryptophan residuesTrp20, Trp79, Trp111, Trp219 and the amino acidresidues Thr48, His110 and Trp111 forming hydro-gen bonds to the ligands. This part of the activesite is highly conserved over the four ensemblestructures and is merged to one structure by clus-tering in the united protein description.

The speci®city pocket is the ¯exible region of theactive site where even backbone movements dooccur (loop Ala299-Cys303). The pocket is closedby Phe122 and Leu300 when binding sorbinil(conf. 2) and opens in two different ways to assimi-late tolrestat (conf. 1) or zopolrestat (conf. 3).

Since the speci®city pocket is not involved inbinding sorbinil, FLEXX is able to place this ligandinto all ensemble structures with an RMSD lessthan 1.0 AÊ . The original complex can even bereproduced with an RMSD of 0.43 AÊ (Figure 5).

Figure 4. Cross-docking dihydrofolate reductase,reduced ensemble. The color-coded Table shows theRMSD of the best prediction within the top ten solutionsfor each ligand (row) predicted by FLEXE (1.col.), themerged ranking list (2.col.) and by FLEXX for the par-ticular ensemble structures.

Figure 5. Cross-docking aldose reductase. The color-coded Table shows the RMSD of the best predictionwithin all solutions for each ligand (row) predicted byFLEXE (1.col.), the merged ranking list (2.col.) and byFLEXX for the particular ensemble structures.


Due to the different conformations of the speci-®city pocket when binding tolrestat and zopolre-stat, FLEXX can correctly place these ligands onlyinto their own structures. The best solutions inforeign structures have RMSDs of more than 3.5 AÊ

for the tolrestat and 6.0 AÊ for the zopolrestat. Thismeans that using only one of the ensemble struc-tures with FLEXX would de®nitely have missedinhibitors. Figure 5 shows the RMSD of the bestsolutions predicted by FLEXE on any rank, as wellas the analogous results of the cross docking of allinhibitors into all the ensemble structures separ-ately with FLEXX. We do not use the best solutionwithin the top ten ranks to be sure that there is nobetter solution on any lower rank. This exampleshows the necessity of considering several differentprotein structures to decide whether a ligand canbind to a protein.

The united protein description of FLEXE as wellas merging the cross-docking ranking listspredicted by FLEXX are different ways to take intoaccount these protein structures variations. Bothapproaches select suitable conformations and ®nda good placements with about or even less than1.0 AÊ RMSD for all three inhibitors.

The best solutions predicted by FLEXE togetherwith the reference structures for each ligand areshown in Figure 6(b)-(d). In these images, only theamino acid residues forming interactions with theligand are shown and the alternative instances notused for the particular solution have been fadedout.

Sorbinil (Figure 6(b)), which consists of threeconnected rings, is rigid. Three hydrogen bondsto Thr48, His110 and Trp111 are found and thearomatic ring is placed in the hydrophobic area. Agood solution with 0.58 AÊ is found on rank oneand the best one with 0.54 AÊ on rank two (Table 3).

In the case of tolrestat (Figure 6(c)), two hydro-gen bonds between its carboxylate group and theamino acid residues His110 and Trp111 are found,while the naphthyl ring forms hydrophobic con-tacts with Trp20, Trp79, Trp111, Trp219 andPhe115, Phe122 and Leu300. Conformation one ofPhe122 and conformation one of Leu300 are usedby FLEXE, such that the tri¯uormethyl group andthe methoxy group of the inhibitor can be placedinto the upper part of the speci®city pocket inagreement with the experimental structure. Again,there is a good solution (1.09 AÊ ) ranked ®rst and aslightly better solution (1.05 AÊ ) on rank ®ve. Thebest solution in the merged ranking list of FLEXX isranked second.

Like tolrestat, zopolrestat (Figure 6(d)) formstwo hydrogen bonds between its carboxylategroup and His110 and Trp111. The phthalazinonering lies within the conserved hydrophobic zoneand the benzothiazole ring ®lls the deeper part of

Figure 6. Aldose reductase. The Figure shows the united protein description containing (a) the reference structuresof the three inhibitors sorbinil (magenta), tolrestat (yellow) and zopolrestat (cyan). In addition the best prediction andthe reference structures of (b) sorbinil, (c) tolrestat and (d) zopolrestat are given separately. In the latter images onlythe amino acid residues forming interactions with the ligand are shown and the alternative instances not used for theparticular solution are faded out.


the speci®city pocket opened by Leu300. FLEXEtakes the protein conformation stemming from thetolrestat complex (conf. 1) instead of conformationthree. Hence, there is a larger overlap between thenitrogen of the benzothiazole ring and Leu300 tol-erated by FLEXE. For Phe122, FLEXE uses confor-mation three which is the orientation stemmingfrom the modelled complex between zopolrestatand aldose reductase. Therefore, Phe122 can formhydrophobic interactions with the benzothiazolering as well as with the phthalazinone ring of thezopolrestat. The best solutions are ranked 133rd byFLEXE and ®rst in the merged ranking list of FLEXX.

Run time

Table 4 summarizes the run time needed for pre-paring the input and for docking. The table showsfor each ensemble the average run time for dockinga ligand into a single-protein structure with FLEXX,the accumulated average run time for sequentially

placing the ligand into all protein structures withFLEXX, and the average run time for simul-taneously considering the whole ensemble withFLEXE. The CPU times are measured for both toolson a Sun Ultra 10 machine having an Ultra SPARC2e processor with 440 MHz and 512 MB RAM.

The preparation of the protein structure is not acritical factor with respect to screening applicationsbecause these computations have to be done onlyonce for an ensemble (FLEXE) and each proteinstructure (FLEXX). For FLEXE this phase containsbuilding the united protein description as well aspre-computing and partitioning the incompatibilitygraph, while for both tools it includes indexing ofpossible triplets of interactions points. The prep-aration phase accumulates on average up to a fewminutes for the whole ensemble with FLEXX andtakes less than 30 minutes with FLEXE with oneexception: carboxypeptidase. This case needs aboutthree hours due to a very unfavorable combination

Table 4. Average run time for one ligand

Preparation in (seconds) Docking (in seconds)

FLEXX FLEXX FLEXX FLEXX FLEXX FLEXXEnsemble Single Accumul. Ensemble Single Accumul. Ensemble

Aldose reductase 23.15 92.60 1333.79 201.48 805.91 307.05Alpha-momorcharin 8.42 58.93 460.05 23.83 166.81 142.90Carboanhydrase II 8.42 134.79 230.90 48.68 778.86 159.70Carboxypeptidase 26.01 416.23 10805.00 63.66 1018.49 845.23Dihydrofolate reductase 26.72 320.64 722.99 82.39 988.68 141.23Isocitrate dehydrogenase 9.70 135.74 1597.05 30.47 426.64 399.08Mandelate racemase 7.24 43.43 1288.29 57.04 342.21 187.39Ricin 9.86 88.71 1379.43 65.42 588.74 139.86Seryl tRNA synthetase 14.93 89.59 1470.57 58.99 353.95 1025.46Trypsin 16.54 264.67 647.76 33.52 536.36 80.07

Average 15.10 164.53 1993.58 66.55 600.67 342.80

The table shows for each ensemble the run time in seconds for preparing the protein structure and the average run time for dock-ing one ligand into a single protein structure (FLEXX), sequentially into all protein structures (FLEXX accumulated), and simulta-neously into the ensemble (FLEXE). The CPU times are measured a Sun Ultra 10 machine having an Ultra SPARC 2e processor with440 MHz and 512 MB RAM.


of a large ensemble which leads to complex depen-dencies and a large active site.

Docking the ligand into the protein is actuallythe runtime-critical phase with respect to largerscreening applications. A single docking run withFLEXX takes on average about a minute. However,a separate docking run is necessary for eachensemble structure in order to take into account allconformations given in the ensemble, resulting inan accumulated average run time of ten minutesfor a whole ensemble.

For eight of the ten ensembles FLEXE is about afactor of two faster than FLEXX regarding the accu-mulated run time. FLEXE needs on average about5.5 minutes for placing one ligand into the unitedprotein structure taking into account not only allensemble structures, but also the combinatorialcombination of the conformations contained in theensemble. This recombination of structures is notcovered by the sequential approach with FLEXX.

Conclusions

Here, we describe a new docking tool FLEXEwhich is able to take into account protein structurevariations. The idea of our approach is to representprotein ¯exibility, point mutations, or alternativemodels of a protein by an ensemble of feasiblestructures and combine them to form new validprotein structures during the docking processusing an independent set search algorithm on theso-called incompatibility graph. FLEXE is a tool foref®ciently handling and recombining alternativeconformations. The variations considered by FLEXEare de®ned by the differences within the ensemblestructures, and are therefore not limited to exper-imental data.

In contrast to the merging approach using FLEXXand the discrete docking approaches mentioned inthe introduction, FLEXE is able to take into accountligand and protein ¯exibility simultaneously and

directly while placing the ligand into the active siteand not as a post optimization as in methods suchas merging the ranking lists or resolving collisions.Therefore, FLEXE is independent from initial con-formations of the ligand or the protein.

FLEXE treats the instances separately and recom-bines them in a discrete combinatorial way. Hence,the interactions are not blurred by averaging overdistinct alternative instances, which may corre-spond to unrealistic protein conformations.

The example of aldose reductase demonstratesthe necessity of considering protein structure vari-ations for docking calculations. Using a single rigidconformation would have missed inhibitors whichcould be placed into another ensemble structure.

We applied our method to ten ensembles of pro-tein structures and compared the results with themerged ranking lists that arose from cross-dockingruns with FLEXX. The results show that our ensem-ble approach is able to cope with several side-chain conformations and even movements ofloops. Motions of larger backbone segments oreven domain movements are not covered by thisapproach.

For 67 % of the test cases we obtained dockingsolutions with an RMSD below 2.0 AÊ within thetop ten solutions predicted by FLEXE. This is com-parable with 63 % found in the merged ranking listof FLEXX. However, the run time of FLEXE is onaverage lower than the accumulated run timeneeded by FLEXX to dock the ligand sequentiallyinto all members of the ensemble covering only afraction of the whole conformational space con-sidered by FLEXE, because FLEXE is able to createnovel combinations of the ensemble conformations.The cross docking into the reduced ensemble ofdihydrofolate reductase shows the advantage ofcombining structural elements of different ensem-ble members during ligand placement. FLEXE candock a ligand correctly into an ensemble of struc-tures into which the ligand cannot be docked by

Figure 7. Interaction geometries. The ligand can forman interaction with the protein if the interaction centerof each group is lying (approximately) on the interactionsurface of the counter group.


merging the ranking lists of FLEXX afterwards.Docking into all possible combinations of ensembleparts with FLEXX would further increase the accu-mulated run time.

The gap in run time further increases if we takeinto account rotamer libraries to enrich the numberof possible alternative conformations of a proteinwhen for example only a few structures are avail-able. Then the merging approach using FLEXXwould become combinatorially too complex,whereas FLEXE even has the potential to ef®cientlyhandle such a big conformational space.

There are still some problems in ranking the sol-utions for both FLEXE and the merging approachusing FLEXX. However, improving the scoringfunction was not aim of this work. This will be thefocus of further investigation.

Up to now, intramolecular interactions withinthe protein are only considered in so far that twoinstances are incompatible if they overlap witheach other. But there are non-overlapping inter-actions as well, which are partly favorable andpartly unfavorable. Taking this into account couldfurther improve the results.

Materials and Methods

FLEXE is based on FLEXX whose methods and recentdevelopments have been described in detail else-where.9,36-39 Results on the evaluation of FLEXX28,40 and ascreening application with FLEXX41 have also beenreported. Therefore, we just summarize the main con-cepts of FLEXX which are adapted for FLEXE and describethe new approaches of FLEXE in more detail.

Ligand conformational flexibility

The conformational ¯exibility of the ligand is modeledby a discrete set of preferred torsion angles at acyclicsingle bonds, and multiple conformations for ring sys-tems. Torsion angles at multiple bonds, bond lengthsand bond angles are used as given in the input structure.The torsion angles are taken from a database containingabout 900 molecular fragments with a central singlebond which has been derived from the Cambridge Struc-ture Database (CSD)42 by Klebe & Mietzner.43 By thismethod up to 12 low-energy torsion angles can beassigned to each single bond.

Multiple conformations for rings are computed withthe program CORINA.44 The number of ring atoms foreach elementary ring is limited to eight. Larger rings areconsidered rigid and the input structure is used.

RMS deviations due to the described model of ligand¯exibility are typically less than 1.0 AÊ and in most casesare even less than 0.5 AÊ (e.g. methotrexate: 0.4 AÊ ).

Interaction scheme

The model of molecular interactions used in FLEXXand FLEXE has been adopted from BoÈhm45,46 and Klebe.47

For each group forming interactions an interaction geo-metry is assigned consisting of the position of a centerand the shape of a spherical interaction surface. Twogroups interact if the interaction center of each group islying approximately on the interaction surface of thecounter group (Figure 7). For algorithmic reasons, theinteraction surfaces on the protein side are approximatedby a ®nite set of so-called interaction points.

The interactions are divided into three different types,from level 3 for highly-directional interactions such asH-bonds down to level 1 for directionally unspeci®csuch as hydrophobic interactions. The higher-level typesare preferred in the selection and placement of base frag-ments. Only if there are not enough high-level inter-actions will the algorithm descend to lower-levelinteraction types.38

Scoring function

The ranking of the docking results is performed witha modi®cation of the scoring function developed byBoÈhm:30

�G � �G0 ��Grot �Nrot �1�

��Ghb

Xneutral H-bonds

f ��R;�a� �2�

��Gio

Xionic int:

f ��R;�a� �3�

��Garo

Xaro int:

f ��R;�a� �4�

��Glipo

Xlipo: cont:

f ��R� �5�

The ®rst two terms (equation (1)) of the function are a®xed ground term (�G0 � 5.4 kJ/mol) and a term takinginto account the loss of entropy during ligand bindingdue to the hindrance of rotatable bonds in the ligand


(�Grot � 1.4 kJ/mol). The following terms (equations (2)-(4)) are sums over all pairwise interactions. The last part(equation (5)) of the scoring function evaluates the atom-atom contacts between protein and ligand, i.e. hydro-phobic contacts and forbiddingly close contacts (clashes).The functions f, f* are heuristic distance and angle-depen-dent penalties (see9,30 for details).

Incremental construction algorithm

The docking algorithm is divided into three parts: (i)the selection of the base fragments; (ii) the placement ofthe base fragments; and (iii) the incremental constructionof the whole ligand within the active site. First, theligand is fragmented into components by severing allacyclic single bonds. FLEXE/X automatically forms a setof alternative base fragments by selecting single com-ponents or combinations of them.36

The base fragments are placed into the active siteusing one of two different algorithms. The ®rst trianglematching algorithm superposes triples of interaction cen-ters of a base fragment with triples of appropriate inter-action points in the active site. If a base fragment hasfewer than three interaction centers or if the number ofplacements is less than 100, then the second line match-ing algorithm is started. It matches pairs of interactioncenters with pairs of interaction points. Because of geo-metric ambiguity, multiple placements are generated byrotation around the axis de®ned by the interaction pointsand centers. Both base placement algorithms typicallygenerate a large number of solutions. A reduction byclash tests and clustering follows.

Starting with the different base placements the com-plete ligand is constructed by linking the remaining com-ponents step by step in compliance with the torsionaldatabase. After adding one component, new interactionsare searched for and the scoring function is used to selectthe best partial solutions which are used for the nextextension step. The maximum number of solutions takeninto account in the next iteration is 400 � 100nf, where nf

is the number of different base fragments.

Figure 8. Ensemble approach. Active site of alpha-momofrom the superimposed structures of the ensemble (left). Simareas are treated as separate alternatives.

New concept of FLEXE

The FLEXE approach is based on the united proteindescription which handles the similarities and differ-ences of the protein structures of the ensemble. All side-chains and backbone parts are treated separately and thedependencies between them are represented in the theso-called incompatibility graph. An algorithm for ®ndingsets of pairwise compatible instances is applied to thisgraph to ensure that valid protein structures are used asa basis for the docking algorithm. These steps aredescribed in more detail in the following sections.

United protein description

The basis of FLEXE is the united protein descriptionwhich administers protein structure variations. Thedescription is generated from an ensemble of proteinstructures. Each member of the ensemble must be a validprotein structure which shows one possible confor-mation of the protein. The particular structures aresuperimposed and combined to a united protein descrip-tion (Figure 8). The maximum number of allowed struc-tures per ensemble is currently set to 30.

Since we assume highly similar backbone traces forthe members of an ensemble, the most straightforwardway to superimpose the structures is to ®t the backboneatoms of the particular structures. We implemented twosimple methods to superimpose the structures of anensemble onto a reference structure, which we take to bethe ®rst structure of the ensemble. The ®rst methodapplies the Kabsch algorithm48 to ®t two sets of atomse.g. the backbone atoms given in the ensemble descrip-tion. The second method iterates the ®rst method, suchthat all pairs of atoms with a distance greater than a userde®ned threshold are ignored for the next step. This pro-cedure emphasizes the differences and improves the ®t-ting in conserved regions of the structures. Alternatively,the superposition can be performed externally with anyother tool.

rcharin. The united protein description (right) is createdilar parts of the structures are merged whereas dissimilar


The superimposed structures are combined to createthe united protein structure by clustering the alternativeside-chain conformations and backbone parts, which wecall instances (see Figure 9). The clustered instances canbe recombined to form new valid protein structuresregardless of the structure from which they originallystem. Therefore, the structures we dock into are not lim-ited to the original ensemble structures.

The method we apply for clustering is a complete-linkage hierarchical cluster algorithm.49 The strategy ofhierarchical clustering is such that two clusters withminimal distance are merged into one cluster iterativelyas long as the minimal distance between two clusters isless than a prede®ned threshold. Complete-linkagemeans that the distance between two clusters is de®nedas the maximum distance between the elements of theclusters.

We cluster the instances of each part (Figure 9) separ-ately. The instances of the particular parts are theelements of the clusters and the distance between twoelements is the mean distance between the atoms of theinstances if they are of the same type (i.e. they are bothbackbone parts or side-chains of the same amino acidresidue), otherwise the two instances are not clustered.We use a threshold of 1.0 AÊ as a trade off between dis-torting and clustering the instances.

Incompatibility graph

Two instances of the united protein structure areincompatible if they cannot be realized simultaneously.The incompatibility between the instances is representedas a graph by using all instances as nodes and connect-ing pairs of incompatible nodes with edges (seeFigure 10).

We distinguish between three kinds of incompatibil-ity: (i) logical: two instances are alternatives of eachother; (ii) geometric: two logically compatible instancesoverlap; (iii) structural: two instances of the same chainare unconnected. Logical incompatibility is implied bythe construction of the united protein description: If twoinstances belong to the same part they are incompatible.

Figure 9. Notation. A component contains all atomswhich belong to the same amino acid or a mutation ofthe amino acid. Each component consists of a backboneand a side-chain part. Each part is a set of instancesdescribing alternative conformations.

An overlap test is performed to test for geometricincompatibility, tolerating an overlap volume of 5.5 AÊ 3

between two instances. This threshold was chosen suchthat instances of experimentally determined proteinstructures do not clash with each other. Covalent bondsbetween adjacent backbone instances, adjacent backboneand side-chain instances, close contacts between twocysteine residues (disul®de bridge) and those with a pro-line neighbor (ring closure) are exempt from this test.

In addition, compatible instances have to be linked toavoid combinations of instances of different loops whichare too far away from each other because such combi-nations would create absurd protein structures. Weassume two instances to be directly linked if the bondlengths of the bonds between them agree with theexpected length50 up to a tolerance of �1.0 AÊ , which cor-responds to the threshold for clustering the instances.Two adjacent components are linked if there is at leastone pair of linked backbone instances of the two com-ponents. Otherwise a chain break is supposed. Twoinstances are always compatible if there is a chain breakin between them. In order to determine whether twoinstances are linked via a sequence of instances we use adynamic programming algorithm which will not bedescribed in detail here.

A valid protein structure corresponds to a completelydisconnected subgraph (independent set) in the incom-patibility graph, containing exactly one node per part.Therefore, ®nding a valid protein structure is tantamountto searching for an independent set in the incompatibilitygraph. In order to speed up this search the incompatibil-ity graph is divided into maximum connected sub-graphs, which can be treated independently from eachother (see Figure 10). Maximum connected subgraphsare minimal subsets of nodes connected by at least oneedge such that there are no edges between nodes ofdifferent subgraphs.

In some rare cases there is a con¯icting instance whichis incompatible with a single instance of a part (seeFigure 10). It occurs if one of the ensemble structure itselfis invalid or may be produced by clustering if thethreshold for clustering is too large and the tolerance forvolume overlap between two instances is too small.Such a con¯icting instance is excluded from furthercomputations.

Surface and interaction geometries

FLEXX uses the Connolly molecular surface51 to decidewhether an atom of the protein is solvent-accessible.Only atoms at the surface can form an interaction withthe ligand. An analogous de®nition of a surface in theunited protein description of FLEXE is dif®cult becausethe surface of an instance cannot be de®ned indepen-dently from proximate compatible instances. Therefore,FLEXE does not compute a surface for the united proteindescription, but allows all atoms that are part of theactive site to form interactions.

To each instance that can form interactions we assignan interaction geometry consisting of a set of interactionpoints. Each interaction point is individually tested foroverlap with the protein taking into account the compat-ibility. See for example Figure 11. The interaction pointsof instance O1 cannot clash with the instance O2 andvice versa, even though the points are placed within thevan de Waals radius of the other instance, because theinstances O1 and O2 are incompatible. In addition, inter-action points of instance O1 interfere with the instance

Figure 10. Incompatibility graph. Each instance is anode in the incompatibility graph. Incompatible nodesare connected by an edge. Due to the constraints,instances incompatible with a single instance of anotherpart can be excluded from further computations. Inorder to accelerate the search for independent sets thegraph is divided into maximum connected subgraphs.

Figure 11. Overlapping interaction geometries. TheFigure shows four side-chain instances of the unitedprotein description. The instances O1 and O2 as well asS1 and S2 are alternatives of each other. For clarity onlyone type of interaction (hydrogen bond acceptor) isshown. All interaction points of the instances O1 andO2 are used because only clashes between compatibleinstances are considered if there is no alternativeinstance with less overlap.


S2 which is compatible to O1. However, there is analternative instance S1 which also is compatible to O1and does not overlap with any interaction point ofinstance O1. All interaction points of both instances O1and O2 can therefore be used.

Selection of instances

During the incremental construction algorithm theligand is placed fragment by fragment into the active siteof the united protein description. After each constructionstep, all possible interactions between the (partially)placed ligand and all instances of the united proteinstructure are determined. For each particular instance,the scoring function is applied in order to estimate thescore of this instance. The score of an independent set ofinstances in the incompatibility graph can then be calcu-lated as the sum over the scores of its nodes. The inde-pendent set of instances with the highest scorerepresents the protein structure which best suits to the(partially) placed ligand with respect to the scoring func-tion. The score of this optimal independent set thereforedetermines the ®nal score of the (partial) solution.

The optimal independent set can be assembled fromindependent sets of the maximum connected subgraphs.Thereby, only those subgraphs have to be consideredwhich contain at least one node with a score that is notzero because the other subgraphs can not contribute tothe total score.

We use a modi®ed version of the Bron-Kerboschalgorithm52 for ®nding high-scoring independent setswithin the maximum connected subgraphs with morethan three nodes. The original algorithm was intendedfor ®nding all cliques of an undirected graph, which iscomplementary to the independent set problem. TheBron-Kerbosch algorithm enumerates all independentsets by augmenting an initial independent set in a recur-sive fashion. At each step, the partial independent set Pis extented by a node selected from a set C of so-calledcandidate nodes. This selection is crucial for the perform-

ance of the algorithm. We refer the reader to the originalpaper for a detailed description of the algorithm.52

The time consuming search for optimal independentsets has to be done after each construction step and foreach maximum connected subgraph. Therefore, it is themost time-critical step of FLEXE. In order to speed up thesearch, we check after each selection if the union P[C ofthe partial independent set P and the set of candidates Cstill contain instances of all parts. If this is not the casethe algorithm can backtrack the recursion early, since aindependent set must contain exactly one instance perpart in order to describe a valid protein structure. Inaddition, we stop the algorithm if the ®rst independentset is found because we only need the score of the opti-mal independent set. Although it cannot be guaranteedthat the ®rst-found independent set is the optimal one,this ``greedy'' strategy ®nds best scoring independent setfrequently, because we sort the candidate set C by des-cending score before starting the enumeration.

World Wide Web resource

More detailed results of the presented test for allensembles and all cross docking experiments as well asthe input ®les of the presented data set will be madeavailable on the World Wide Web. The FLEX softwarepackage is available for SUN, SGI, and PCs runningthe Linux opeartion system. FLEXE will be made availablein the near future. Interested readers shouldvisit our website at http://cartan.gmd.de/FlexX andhttp://cartan.gmd.de/FlexE, or contact the correspond-ing author for details.


Acknowledgments

The authors thank Alberto D. Podjarny (UPR de Biolo-gie Structurale IGBMC, Illkirch, France), Michael vanZandt (Institute for Diabetes Discovery (IDD) Branford,Connecticut, USA), Oliver KraÈmer, and Gerhard Klebe(University of Marburg, Germany) for providing thehomology model of the human aldose reductase.

We also thank Martin Stahl (F. Hoffmann-La RocheLtd., Basel, Switzerland) for fruitful discussions on thistopic, our cooperation partners for various helpful com-ments during the method development, and Sally Hindlefor many sugesstions on earlier versions of the manu-script.

This work is part of the Relimo project, funded by thebmb�f (Bundesministerium fuÈ r Bildung und Forschung)and the participating industrial partners BoehringerIngelheim Pharma KG and Merck KgaA, Darmstadtunder grant 0311 620.

References

1. Kuntz, I. (1992). Structure-based strategies for drugdesign and discovery. Science, 257, 1078-1082.

2. Rosenfeld, R., Vajda, S. & DeLisi, C. (1995). Flexibledocking and design. Annu. Rev. Biophys. Biomol.Struct. 24, 677-700.

3. Lengauer, T. & Rarey, M. (1996). Computationalmethods for biomolecular docking. Curr. Opin.Struct. Biol. 6, 402-406.

4. Kubinyi, H. (1998). Structure-based design ofenzyme inhibitors and receptor ligands. Curr. Opin.Drug Discov. Dev. 1, 4-15.

5. Rarey, M. (2001). Protein-ligand docking in drugdesign. In Bioinformatics - From Genomes to Drugs(Lengauer, T., ed.), vol. 1, VCH-Wiley, Heidelberg2001.

6. Oshiro, C., Kuntz, I. & Dixon, J. (1995). Flexibleligand docking using a genetic algorithm. J. Comput.Aid. Mol. Des. 9, 113-130.

7. Jones, G., Willett, P. & Glen, R. (1995). Molecularrecognition of receptor sites using a genetic algor-ithm with a description of desolvation. J. Mol. Biol.245, 43-53.

8. Jones, G., Willett, P., Glen, R., Leach, A. & Taylor,R. (1997). Development and validation of a geneticalgorithm for ¯exible docking. J. Mol. Biol. 267, 727-748.

9. Rarey, M., Kramer, B., Lengauer, T. & Klebe, G.(1996). A fast ¯exible docking method using anincremental construction algorithm. J. Mol. Biol. 261,470-489.

10. Gerstein, M. & Krebs, W. (1998). A database ofmacromolecular motions. Nucl. Acids Res. 26, 4280-4290.

11. Najmanovich, R., Kuttner, J., Sobolev, V. &Edelman, M. (2000). Side-chain ¯exibility in proteinsupon ligand binding. Proteins: Struct. Funct. Genet.39, 261-268.

12. Leach, A. (1994). Ligand docking to proteins withdiscrete side-chain ¯exibility. J. Mol. Biol. 235, 345-356.

13. Leach, A. & Lemon, A. (1998). Exploring the confor-mational space of protein side chains using dead-end elimination and the A* algorithm. Proteins:Struct. Funct. Genet. 33, 227-239.

14. Sandak, B., Nussinov, R. & Wolfson, H. (1994). 3-DFlexible Docking of Molecules. In Shape and PatternMatching in Computational Biology: Procedings of IEEEworkshop 1994, (Califano, A., ed.), pp. 41-54, PlenumPress, New York, ISBN 0-306-45138-7.

15. Sandak, B., Nussinov, R. & Wolfson, H. (1995). Anautomated computer vision and robotics-based tech-nique for 3-D ¯exible biomolecular docking andmatching. Comput. Appl. Biosci. 11, 87-99.

16. Sandak, B., Nussinov, R. & Wolfson, H. (1998). Amethod for biomolecular structural recognition anddocking allowing conformational ¯exibility.J. Comput. Biol. 5, 631-654.

17. Sandak, B., Wolfson, H. & Nussinov, R. (1998).Flexible docking allowing induced ®t in proteins:insights from an open to closed conformationalisomers. Proteins: Struct. Funct. Genet. 32, 159-174.

18. Knegtel, R., Kuntz, I. & Oshiro, C. (1997). Moleculardocking to ensembles of protein structures. J. Mol.Biol. 266, 424-440.

19. Kuntz, I., Blaney, J., Oatley, S., Langridge, R. &Ferrin, T. (1982). A geometric approach to macro-molecule-ligand interactions. J. Mol. Biol. 161, 269-288.

20. Meng, E., Shoichet, B. & Kuntz, I. (1992). Automateddocking with grid-based energy evaluation.J. Comput. Chem. 13, 505-524.

21. Meng, E. D. A. G., Blaney, J. & Kuntz, I. (1993).Orientational sampling and rigid-body minimizationin molecular docking. Proteins: Struct. Funct. Genet.17, 266-278.

22. Shoichet, B., Bodian, D. & Kuntz, I. (1992). Molecu-lar docking using shape descriptors. J. Comput.Chem. 13, 380-397.

23. Schnecke, V., Swanson, C., Getzoff, E., Tainer, J. &Kuhn, L. (1998). Screening a peptidyl database forpotential ligands to proteins with side-chain ¯exi-bility. Proteins: Struct. Funct. Genet. 33, 74-87.

24. Schnecke, V. & Kuhn, L. (1999). Flexible screeningfor molecules interacting with proteins. In Rigidity inTheory and Applications (Thorpe, M. & Duxbury, P.,eds), pp. 385-400, Plenum Publishing, New York.

25. Schnecke, V. & Kuhn, L. (1999). Database screeningfor hiv protease ligand: the in¯uence of binding-siteconformations and representation on ligand selectiv-ity. In Proceedings of the Seventh International Confer-ence on Intelligent Systems for Molecular Biology(Lengauer, T., Schneider, R., Bork, P., Brutlag, D.,Mewes, W.-W. & Zimmer, R., eds), pp. 242-251,AAAI Press, Menlo Park, CA.

26. Schnecke, V. & Kuhn, L. (2000). Virtual screeningwith solvation and ligand-induced complementarity.Perspec. Drug Discov. Des. 20, 171-190.

27. Bernstein, F., Koetzle, T., Williams, G., Meyer, E. J.,Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T.& Tasumi, M. (1977). The protein data bank: a com-puter based archival ®le for macromolecular struc-tures. J. Mol. Biol. 112, 535-542.

28. Kramer, B., Rarey, M. & Lengauer, T. (1999).Evaluation of the FLEXX incremental constructionalgorithm for protein-ligand docking. Proteins:Struct. Funct. Genet. 37, 228-241.

29. Tripos Associates (1994). SYBYL Molecular ModelingSoftware, version 6.x, Tripos Associates, Inc. St. Louis,MO.

30. BoÈhm, H.-J. (1994). The development of a simpleempirical scoring function to estimate the bindingconstant for a protein-ligand complex of known


three-dimensional structure. J. Computer-Aided Mol.Des. 8, 243-256.

31. Muegge, I. & Martin, Y. C. (1999). A general andfast scoring function for protein-ligand interactions:a simpli®ed potential approach. J. Med. Chem. 42,791-804.

32. Gohlke, G., Hendlich, M. & Klebe, G. (2000). Predict-ing binding modes and binding af®nities and hotspots for protein-ligand complexes using a knowl-edge-based scoring function. Perspect. Drug Discov.Des. 20, 115-144.

33. Charifson, P., Corkery, J., Murcko, M. & Walters,W. P. (1999). Consensus scoring: a method forobtaining improved hit rates from docking data-bases of three-dimensional structures into proteins.J. Med. Chem. 42, 5100-5109.

34. Urzhumtsev, A., Tete-Favier, F., Mitschler, A.,Barbanton, J., Barth, P. & Urzhumtseva, L., et al.(1997). A ``speci®city'' pocket inferred from thecrystal structure of the complexes of aldosereductase with the pharmaceutically importantinhibitors tolrestat and sorbinil. Structure, 5, 601-612.

35. Wilson, D. K., Tarle, I., Petrash, J. M. & Quicho, F. A.(1993). Re®ned 1.8 angstroms structure of humanaldose reductase complexed with the potent inhibi-tor zopolrestat. Proc. Natl Acad. Sci. USA, 90, 9847-9851.

36. Rarey, M., Kramer, B. & Lengauer, T. (1997).Multiple automatic base selection: protein-liganddocking based on incremental construction withoutmanual intervention. J. Computer-Aided Mol. Des. 11,369-384.

37. Rarey, M., Kramer, B. & Lengauer, T. (1999). Theparticle concept: placing discrete water moleculesduring protein-ligand docking predictions. Proteins:Struct. Funct. Genet. 34, 17-28.

38. Rarey, M., Kramer, B. & Lengauer, T. (1999). Dock-ing of hydrophobic ligands with interaction-basedmatching algorithms. Bioinformatics, 15, 243-250.

39. Rarey, M. & Lengauer, T. (2000). A recursivealgorithm for ef®cient combinatorial library docking.Perspec. Drug Discov. Des. 20, 63-81.

40. Kramer, B., Rarey, M. & Lengauer, T. (1997). Casp-2experiences with docking ¯exible ligands usingFLEXX. Proteins: Struct. Funct. Genet. Suppl 1:1, 221-225.

41. Kramer, B., Metz, G., Rarey, M. & Lengauer, T.(1999). Ligand docking and screening with FLEXX.Med. Chem. Res. 7/8, 463-478.

42. Allen, F., Bellard, S., Brice, M., Cartwright, B.,Doubleday, A., Higgs, H. & Hummelink-Peters, T.et al. (1979). The Cambridge Crystallographic DataCentre: computer-based search, retrieval, analysisand display of information. Acta Crystallog. sect. B,35, 2331-2339.

43. Klebe, G. & Mietzner, T. (1994). A fast and ef®cientmethod to generate biologically relevant confor-mations. J. Comput. Aid. Mol. Des. 8, 583-606.

44. Sadowski, J. & Gasteiger, J. (1993). From atoms andbonds to three-dimensional atomic coordinates:automatic model builders. Chem. Rev. 93, 2567-2581.

45. BoÈhm, H.-J. (1992). The computer program LUDI: anew method for the de novo design of enzymeinhibitors. J. Comput. Aid. Mol. Des. 6, 61-78.

46. BoÈhm, H.-J. (1992). LUDI: rule-based automaticdesign of new substituents for enzyme inhibitorleads. J. Comput. Aid. Mol. Des. 6, 593-606.

47. Klebe, G. (1994). The use of composite crystal-®eldenviroments in molecular recognition and thede-novo design of protein ligands. J. Mol. Biol. 237,221-235.

48. Kabsch, W. (1976). A solution for the best rotationto relate two sets of vectors. Acta Crystallog. sect. A,32, 922-923.

49. Duda, R. & Hart, P. (1973). Pattern Classi®cation andScene Analysis, John Wiley & Sons, Inc., New York.

50. Engh, R. & Hubert, R. (1991). Accurate bond andangle parameters for X-ray protein structure re®ne-ment. Acta Crystallog. sect. A, 47, 392-400.

51. Connolly, M. (1983). Analytical molecular surfacecalculation. J. Appl. Crystallog. 16, 548-558.

52. Bron, C. & Kerbosch, J. (1973). Finding all cliques ofan undirected graph [H]. Commun. ACM, 16, 575-577.

Edited by J. Thornton

(Received 3 July 2000; received in revised form 13 February 2001; accepted 16 February 2001)

flexe: efficient molecular docking considering protein

Documents