computational approaches for the identification and characterization of protein binding sites

105
Computational Approaches for the Identification and Characterization of Protein Binding Sites by Dario Ghersi A dissertation submitted to the Graduate Faculty of the Mount Sinai Graduate School of Biological Sciences Biomedical Sciences Doctoral Program, in partial fulfillment of the requirements for the degree of Doctor of Philosophy, Mount Sinai School of Medicine of New York University 2010 Professor Roberto Sanchez

Upload: dario-ghersi

Post on 16-Aug-2015

224 views

Category:

Documents


0 download

DESCRIPTION

A dissertation submitted to the Graduate Faculty of the Mount Sinai Graduate School of Biological Sciences Biomedical Sciences Doctoral Program, in partial fulfillment of the requirements for the degree of Doctor of Philosophy, Mount Sinai School of Medicine of New York University

TRANSCRIPT

ComputationalApproachesfortheIdenticationandCharacterizationofProteinBindingSitesbyDario GhersiA dissertation submitted to the Graduate Faculty of the Mount Sinai Graduate School ofBiological Sciences Biomedical Sciences Doctoral Program, in partial fulllment of therequirements for the degree of Doctor of Philosophy, Mount Sinai School of Medicine ofNew York University2010Professor Roberto Sanchezc 2010 by Dario GhersiAll Rights ReservedTo my parents and all the people with whom I shared small and big things during thoseyears, with aection.In omnibus requiem quaesivi, et nusquam inveni nisi in angulo cum libroThomas `a KempisivAbstractTheproblemofinferringthefunctionofaproteininthecontextofthecomplexnet-work of interactions is one of the most crucial challenges faced by Computational Biologytoday. Knowingthebindingpartnersofproteinsisanessentialsteptountanglethewebof functional relationshipsthatcontrol cellularprocesses, andtheidenticationandthecharacterization of a protein binding site represent an important step to achieve this goal.Some of the most widely used techniques that have been developed by the bioinformaticscommunity over the years are discussed here,together with their limitations and applica-bility range.This thesis introduces a framework to perform binding site identication on protein struc-tures by means of an energy-based approach based on the concept of Molecular InteractionFields(MIFs). Theapproachhasbeenvalidatedonalargesetof boundandunboundproteinstructures, andaspecicapplicationofbindingsiteidenticationinthecontextof reverse virtual screening is discussed. The advantage of using chemically specic probesto compute the MIFs is illustrated by applying the binding site identication procedure tophosphorylated ligands. Furthermore, an improved version of the energy-based binding siteidentication approach that incorporates evolutionary information is presented, emphasiz-ing its advantage in situations where the energy-based signal is weak.Asanattempttomovebeyondtheproblemofbindingsiteidentication,amethodologythat can be applied to infer the bound conformation of a protein starting from an unboundform is introduced.Taken together, the results presented in this work indicate that the energy-based approachwith multiple probes MIFs provides a versatile framework to carry out binding site identi-cation and hint to the possibility of identifying the bound form of structures that undergolargeconformationalchanges. Furthermore, theproblemofpredictingthetypeofligandthatabindingsitecanaccommodateliesamongthefuturechallengesthatcouldbenetfrom the methodology described here.vAcknowledgementsItisonlyttoprefacetheacknowledgmentswithanapologytothepeoplethatoneinevitably forgets to mention despite their direct or indirect contributions.I would like to thank my advisor Dr. Roberto Sanchez for encouraging me to nd my ownvoice and for his always wise suggestions (in science and beyond), Dr. Roman Osman forfollowingmyprogresseswithcareful attentionduringall theseyears, Dr. MihalyMezeiforhisalwaysknowledgeableandbrightadvice, Dr. Ming-MingZhouforhiscontinuoussupport and insightful ideas, and all the members of the Sanchez Lab for useful discussions.ManythankstoDr. SuvobrataChakravartyforthefruitful exchangesthatwehadoverlunch and to Zachary Charlop-Powers for the wide ranging discussions about science, musicandeverythingelseandourenjoyablemusical projects. Special thankstoDr. MaurizioFilipponeforsharingwithmecutting-edgeideasinmachinelearningandstatisticsand,moreimportantly, becauseIcouldnotimagineabetterfriend. ThankstoDr. FabianaRenzi for being such a wonderful person to spend time with. I would also like to thank Prof.FrancoCeladaforgivingmethecondencetojumpfromMedicineintoacomputationaleld and for showing me that more things are possible than one would imagine.Finally, I would like to acknowledge all the people that contribute to make our Department ofStructural and Chemical Biology at Mount Sinai a friendly and collaborative environment.viContentsDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction 11.1 Inferring Protein Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Binding Site Identication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Sequence-based approaches . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Structure-based approaches . . . . . . . . . . . . . . . . . . . . . . . 61.2.2.1 Geometric approaches . . . . . . . . . . . . . . . . . . . . . 61.2.2.2 Energy-based approaches . . . . . . . . . . . . . . . . . . . 71.2.3 Approaches that take into account the protein dynamics. . . . . . . 81.3 Binding Site Characterization and Comparison . . . . . . . . . . . . . . . . 101.3.1 Approaches for comparing geometric features . . . . . . . . . . . . . 111.3.2 Approaches for comparing structurally derived properties . . . . . . 111.4 Available Softwares for Binding Site Identication and Characterization . . 151.5 EASYMIFs andSITEHOUND. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5.1 EASYMIFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.5.1.1 Calculation of MIFs inEASYMIFs . . . . . . . . . . . . . . 171.5.1.2 Visualizing the results. . . . . . . . . . . . . . . . . . . . . 181.5.2 SITEHOUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18vii1.5.2.1 TheSITEHOUND-web Server. . . . . . . . . . . . . . . . . . 191.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 FocusedDocking 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.1 Reverse Virtual Screening . . . . . . . . . . . . . . . . . . . . . . . . 232.1.2 The Protein-Ligand Docking Problem . . . . . . . . . . . . . . . . . 252.1.2.1 The scoring function component . . . . . . . . . . . . . . . 262.1.2.2 The search component . . . . . . . . . . . . . . . . . . . . . 272.1.3 Blind Docking vs Focused Docking . . . . . . . . . . . . . . . . . . . 292.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.1 Binding site identication . . . . . . . . . . . . . . . . . . . . . . . . 302.2.2 Blind docking setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.3 Focused docking setup. . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.4 Focused docking with masked grids . . . . . . . . . . . . . . . . . . . 332.2.5 Comparison of blind vs. focused docking . . . . . . . . . . . . . . . . 342.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.1 Comparison of blind and focused docking protocols . . . . . . . . . . 352.3.2 Binding site detection accuracy. . . . . . . . . . . . . . . . . . . . . 352.3.3 Docking pose accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.4 Comparison of blind vs. focused docking in the unbound dataset . . 392.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 BindingSiteIdenticationforPhosphorylatedLigands 423.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Binding Site Identication . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.2.1 Phosphopeptides Dataset . . . . . . . . . . . . . . . . . . . 453.2.2.2 ATP Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.2.3 Phoshosugars Dataset . . . . . . . . . . . . . . . . . . . . . 46viii3.2.3 Reranking of Putative Sites by Conservation . . . . . . . . . . . . . 463.2.4 Assessment of the Prediction Accuracy . . . . . . . . . . . . . . . . . 473.2.5 Electrostatic Potential Calculations . . . . . . . . . . . . . . . . . . . 473.2.6 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.1 Overall Performance on the Whole Datasets . . . . . . . . . . . . . 483.3.2 Evolutionary reranking of the putative sites . . . . . . . . . . . . . . 503.3.3 Role of the Electrostatic Potential . . . . . . . . . . . . . . . . . . . 533.3.4 Probe Selectivity Analysis. . . . . . . . . . . . . . . . . . . . . . . . 543.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 BeyondBindingSiteIdentication 584.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.1.1 Models of Conformational Changes. . . . . . . . . . . . . . . . . . . 594.1.2 The Elastic Network Model . . . . . . . . . . . . . . . . . . . . . . . 604.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.1 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.2 Root Mean Square Deviation (RMSD) Calculations . . . . . . . . . 624.2.3 The Anisotropic Elastic Network Model (ANM) . . . . . . . . . . . . 634.2.3.1 Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.4 Side-chain Modeling and MIFs Calculations . . . . . . . . . . . . . . 654.2.5 Comparing MIFs derived from binding sites . . . . . . . . . . . . . . 654.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.1 Normal Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.2 Bound Form Identication . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Appendices 72ixAIntroductiontoClustering 73A.1 Brief overview of clustering inSITEHOUND. . . . . . . . . . . . . . . . . . . 73BFocusedDockingSetup 76B.1 Selection of complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76B.2 Preparation of the proteins and ligands for docking. . . . . . . . . . . . . . 76CPublicationsResultingFromThisThesis 78xListofTables1.1 Available softwares for binding site identication and characterization . . . . . . . 152.1 Parameters for the dierent sets of focused docking experiments . . . . . . . 322.2 Accuracy of binding site identication . . . . . . . . . . . . . . . . . . . . . . . 352.3 Accuracy of Blind and Focused Docking in Unbound Proteins . . . . . . . . . . . 413.1 Summary of the performance on the complete dataset of phosphorylated ligands 483.2 Summary of the performance for the rst cluster only . . . . . . . . . . . . 493.3 Performance with the CMET probe . . . . . . . . . . . . . . . . . . . . . . 544.1 Dataset of complexes undergoing hinge-like motion upon binding . . . . . . 624.2 Overview of the results with the centroid shape function. . . . . . . . . . . 68xiListofFigures1.1 Methyllysine recognition domains. . . . . . . . . . . . . . . . . . . . . . . . 31.2 Pocket Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Conformational changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Identication of ligand-binding sites . . . . . . . . . . . . . . . . . . . . . . 161.5 An example of interaction energy calculations on a protein . . . . . . . . . . 191.6 Characterization of the yeast adenylate kinase binding site usingEASYMIFsandSITEHOUND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.7 SITEHOUND-web results page . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1 Possible applications of reverse virtual screening . . . . . . . . . . . . . . . 242.2 Focused docking scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 SITEHOUND binding site identication performance . . . . . . . . . . . . . . 332.4 Binding site identication rate for blind and focused docking . . . . . . . . 382.5 Accuracy of blind and focused docking. . . . . . . . . . . . . . . . . . . . . 392.6 Examples of focused docking results . . . . . . . . . . . . . . . . . . . . . . 403.1 Phosphopeptides dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2 Phosphosugars dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 ATP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4 Energy ratio density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Example of phosphobinding site identication. . . . . . . . . . . . . . . . . 543.6 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.7 Role of the electrostatic potential . . . . . . . . . . . . . . . . . . . . . . . . 56xii3.8 Combining multiple probes . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1 Models of Conformational Changes . . . . . . . . . . . . . . . . . . . . . . . 614.2 ENM scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3 Contribution of normal modes to RMSD. . . . . . . . . . . . . . . . . . . . 674.4 MIFs comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.5 Example of Bound Form Identication. . . . . . . . . . . . . . . . . . . . . 694.6 Example of Bound Form Identication (2) . . . . . . . . . . . . . . . . . . . 70A.1 Eects of linkage on clustering results . . . . . . . . . . . . . . . . . . . . . 74xiiiChapter1Introduction1.1 InferringProteinFunctionThe widespread structural genomics initiatives[103] have somewhat changed the tradi-tionalparadigmoftheexperimentalinvestigationofproteins,wherethestructuraldeter-mination was the last step and aimed to clarify and elucidate functional details. Now thenumber of structures that have not yet been annotated is increasing and, as a consequence,the bioinformatics community has taken up the challenge to develop more reliable tools forinferring function from structure and for guiding further experimental work. Dening thefunction of a protein is a dicult task, since it involves notions that range from the locationoftheproteininsidethecelltothecompleteblueprintofitsregulatorysystems. Widelyusedapproachesinvolvecomparisonsof sequencesandglobal structural similarities. Inboth cases, the underlying assumption is that proteins that share a detectable evolutionaryrelationship are usually endowed with similar functions.In the case of global structural similarity approaches the functional information containedin the closest match from databases like CATH[97] and SCOP[93] is transferred to the querystructure. Clearly, caution must be exerted when annotating unknown proteins in this way,since divergent evolution could have produced practically indistinguishable folds with verydierent functions. For this reason, a signicant level of sequence similarity is also requiredto transfer the annotation[54].In an intriguing review published in 2006, Kolodny[75] and colleagues show how the concept1of a protein fold is more uid than usually thought. The authors propose that we shouldmoveawayfromtheideaofastructuralspacedividedintodiscreteandnon-overlappingislands (the folds) with similarly discrete and non overlapping functions. The comparisonof global similarities in protein structures and evolutionary relationships between proteinsasinferredfromglobalsequencecomparisonsareclearlyveryusefulandrepresentoneofthe main achievements of bioinformatics. On the other hand, it is possible to nd numerousexamples of proteins that possess distinct evolutionary histories (as determined by sequenceandstructuralanalysis)butthatcarryoutsimilarfunctions. Inmanyoftheseinstancesonecanndstrikingsimilaritiesatthelevel of theactivesite(inthecaseof enzymes)ormoregenerallyinthebindingsite(forrecognitiondomains). Acompellingexampleisgiven in Figure 1.1, where several structurally unrelated domains all involved in methylly-sinerecognition[16]showasimilararrangementsofaromaticresiduesinthebindingsite.These cases might represent instances of convergent evolution, where unrelated protein do-mains acquired similar recognition motifs that were particularly eective and were thereforeretainedbyselection. AnotherinterestingexampleappearedinarecentissueofScience,where it was shown that phylogenetically unrelated microbial hydrogenases possess similarfeatures in their active site[110].An additional layer of complexity is given by the fact that many proteins whose func-tion has already been determined are actually endowed with more than one function: thisphenomenon has been called moonlighting and is likely to play an important role in centralprocesseslikecatalysis, transcriptionandgeneexpression[27, 67]. Theconceptofmoon-lighting opens up new spaces to the application of algorithms for the prediction of functionfromstructure, sinceonlyincases wherethedierent functions areclearlyencodedindetectablemotifscanmultiplesequencealignmentbeanadequatetool. Computationalmethods that are specically tailored to address the problem of binding site identicationand characterization are therefore much needed,and have the potential to go beyond thetraditional description of global sequence and structural similarities.Sincethefunctionofaproteinistosomeextentembeddedinitsinteractionswithothermolecules, elucidating these interactions is a key step to address the problem of functional2Figure 1.1: Domains involved in methylated lysine recognition: chromodomain (a 1KNA), PHDnger (b 2FSA), tudor domain (c 2IG0), ankyrin repeats (d 3B95), mbt domain (e - 2RHI). Thebottom panel shows a close-up view of the aromatic residues surrounding the methylated lysine incage-like fashion. Despite being structurally unrelated,the domains shown here possess a bindingsite with an aromatic cage responsible for the methylated lysine recognitionannotation. Theconceptofguiltbyassociationexploitsthisideaandhasbeenexten-sively employed to infer functional information about poorly characterized proteins[62].In some instances,even a suciently complete description of the molecular function doesnot encompass the dierent physiological roles played by a protein. An interesting exampleis given by the Glycogen Synthase Kinase 3 (GSK3). In muscle cells, GSK3 is inhibitedby insulin activity, with the eect of promoting the conversion of glucose into glycogen andthereforereducingbloodglucoselevels. FromatherapeuticpointofviewGSK3wouldtherefore seem a potentially attractive target[57]. On the other hand, GSK3 is also a partof the Wnt signaling pathway, and a reduction in GSK3is implicated in certain types ofcancers, while in other types (e.g. mixed lineage leukemia) GSK3would seem to play anoncogenic role [120]. Therefore, only a selective inhibition of GSK3in muscle cells couldbeaviabletherapeutical optionfordiabetes. Furthermore, astudybyNobleetal.[94]showedhowtheinhibitionof GSK3leadstoreducedproductionof theneurobrillarytangles implicated in Alzheimers disease.TheexampleprovidedbyGSK3illustratestheimportanceof knowingall thebindingpartners of a protein, since even a detailed chemical knowledge of only one of the pathways3whereGSK3isinvolveddoesnotfullyaccountforthepleiotropiceectsoftheprotein.The interactions a protein can participate in can be roughly divided in three broad classes,depending on the chemical nature of the protein partner:Protein - Protein interactionsProtein - Nucleic Acid (DNA or RNA) interactionsProtein - Small Molecule interactions (where the term small molecule is meant torepresent the broad class of non polymeric molecules)Protein - protein interactions can involve large contact interfaces (where shape complemen-tarity usually plays an important role) or be mediated by peptide recognition modules[69, 2].This work will concentrate on protein - small molecules and protein peptides interactions,since protein - protein interactions mediated by large contact interfaces and protein - nucleicacids interactions represent a dierent challenge and require a dierent set of computationalapproaches.A preliminary but crucial step to exploit the functional information embedded in the net-work of protein interactions is binding site identication, dened as the identication of theresidues that make up the region where binding occurs.Anoverviewofthecurrentlyavailablemethodstocarryoutbindingsiteidenticationispresented below.1.2 BindingSiteIdentication1.2.1 Sequence-basedapproachesOne of the simplest yet eective ideas behind sequence-based approaches to identify func-tionally important residues is to exploit the evolutionary information contained in MultipleSequenceAlignments(MSAs)of homologoussequencesandextractasubsetof residuesthatshowahighdegreeofconservationintheMSA.Theassumptionbehindthisideaisthattheevolutionarypressureactingon functionallyimportant residueswill reducetheirvariabilityinaproteinfamily. Dierentconservationmeasureshavebeenemployed, the4majorityofthembeingcastintheinformationtheoreticframework. Someofthesemea-sures take into account the background distribution of the aminoacids, and they have beenshowntoperformbetterthanthosewhodonotinastudythatsystematicallycomparedseveralconservationmeasuresonthreedierentlargedatasets[22]. Morespecically, theJensen-Shannonentropyhasbeenshowntooutperformotherinformationtheoreticmea-sures and will be discussed in greater details in Chapter 3.An alternative approach that takes advantage of phylogenetic analysis is the evolutionarytrace method developed by Lichtarge[85]. The idea behind the method is to consider thedegree of conservation of residues positions in a protein family in phylogenetically distinctgroups. Theassumptionisthatfunctionallyimportantresiduesmaybeconservedinasubgroup but can vary across dierent subgroups, since these subgroups may have evolvedto perform slightly dierent functions.Another approach that takes advantage of phylogenetic information is Rate4Site[88]. Themethod relies on estimates of site-specic mutation rates by using a Bayesian approach, that(by including prior information into the model) is less sensitive to the number of sequencesin the alignment than other conservation-based methods. On the other hand, a clear disad-vantage of Rate4Site compared to simple information theoretic measures of conservationis the speed of execution, which is substantially lower[22].Despitetheirusefulnesstoinferfunctionallyimportant residues, all thesequence-basedmethods suer from the fundamental limitation of not being able to discriminate betweenresidues that are conserved as part of a binding site from residues that are crucial to proteinstability or regulation. In other words, while binding residues are usually conserved acrossa protein family, conservation alone is not always a suciently specic criterion to identifyabindingsite, sinceresiduescanbeconservedforreasonsotherthanbinding. Toover-cometheselimitationsotherapproacheshavebeendevisedthatexplicitlytakestructuralinformation into account.5Figure1.2: Geometricidenticationofbindingsites. A)Humanalpha-Phosphomannomutaseincomplex with D-mannose 1-phosphate (PDB code: 2fue). The ligand binds in a deep crevice thatis correctly identied as the largest pocket by LIGASITEcsc. B) Mannose 6-phosphate receptor incomplex with mannose 6-phosphate (PDB code: 1sz0). The binding site is a shallow pocket and inthis case is not among the top three sites predicted by LIGASITEcsc. The black arrows indicate thelocation of the ligand and the blue spheres show the pockets detected by LIGASITEcsc.1.2.2 Structure-basedapproaches1.2.2.1 GeometricapproachesMost of the geometric approaches to identify binding sites in protein structures rely onthe assumption that a binding site is usually a geometrically well dened cleft or a pocket.For example, in a study of 67 protein structures Laskowski determined that the largest cleftcorresponded to a binding site in over 83% of the cases[80].One of the earliest approaches employed by cleft detection algorithms is the protein-solvent-proteinconcept(e.g. thePOCKET[84] andLIGSITEalgorithms[55]). Themainideaconsists of embedding the protein in a 3D lattice and assigning the grid points to either theprotein (if within a predened distance from an atom center) or the solvent. Subsequently,the x, y, z axes are scanned and the pockets are dened as the regions in space that containpoints assigned to the solvent category and that are surrounded by protein points. AnimprovedversionofLIGSITE(LIGSITEcs)programimplementedConnollyssurfaces[26]to replace the protein-solvent-protein with surface-solvent-surface events, and a further ver-sion of LIGSITE (LIGSITEcsc) incorporated a conservation measure to rerank the putativepockets[64].6AnotherwellestablishedalgorithmforpocketdetectionwasdevelopedbyLaskowskiandimplementedintheSURFNETprogram[79]. Thegistoftheideaistoplacespheresbe-tween all pairs of atoms in such a way that no two atoms are contained inside the spheres,which are then retained only if their radius is between 1 and 4A, and nally clustered. Theclustered spheres with the largest volume dene the putative pocket. Another approach todetectinvaginationsonproteinstructureswasproposedbyMezei[89],exploitingthecon-cept of circular variance.Asystematiccomparisonbetweenall thegeometricmethodsbrieyoutlinedabovewascarriedoutbyHuangandcolleaguesonadatasetcontaining210boundproteinsplus48proteinsforwhichanunboundformwasavailable[64]. Theperformanceofthemethodsranged from 80 to 87% for the bound dataset and from 71 to 77% for the unbound cases.Recently, Huangandcolleagues combinedseveral geometricapproaches withanenergybased approach (see next section) into a metaserver named MetaPocket[63], yielding animprovement over each of the individual methods used in isolation.Despite their usefulness for binding site identication, one of the major shortcomings of allthe geometric approaches is represented by the fact that not all binding sites are deep pock-ets. An example is provided in Figure 1.2, where two types of binding sites are represented.The rst one is a typical deep cleft that can be easily identied by LIGASITEcsc, while thesecond is a shallow pocket that does not rank among the top three sites identied by theprogram.1.2.2.2 Energy-basedapproachesEnergy-based approaches to binding site identication work on the assumption that aprotein binding site is characterized by energetic properties that stand out from the rest ofthe protein surface and can be reliably identied.One of the earliest attempts to characterize binding sites using energetic rather than geomet-ric properties is the GRID program[49], that computes a semi-empirical interaction energybetween the protein and a set of chemical probes parameterized to mimic atom types andchemicalfragmentsofpharmaceuticalandbiologicalinterest. TheGRIDprogramisnotmeanttobeusedasabindingsiteidenticationtoolperse, buttheinteractionenergy7maps (also known as Molecular Interaction Fields) that are produced by the program canbe used for that purpose, with appropriate manipulations. As an example, Q-SiteFinder[81]usestheGRIDforceeldtocomputeaninteractionenergymapbetweentheproteinandamethyl (-CH3)probeandcarriesoutclusteranalysistoidentifytheregionsthathavethe highest total interaction energy. These regions usually correspond to binding sites fordrug-like molecules [81]. More recently, Morita et al.[91] improved the performance of thisapproach by using the AMBER force eld [28] for the interaction energy calculations anda more sophisticated two-steps algorithm for clustering.An alternative approach to carry out binding site identication on protein structures buildsontheexperimental techniqueintroducedbyMattosandRingecalledMultipleSolventCrystal Structures (MSCS)[87]. The idea behind MSCS is to repeatedly soak the proteinwith dierent organic solvents and identify the regions involved in binding to these solventsbyX-raycrystallography. VajdaandGuarnieri[116] haveproposedanequivalentof thisprocedure, where the solvent mapping is carried out computationally and a consensus sitewhere dierent solvents bind favorably is identied as the putative binding site.1.2.3 ApproachesthattakeintoaccounttheproteindynamicsAlltheenergy-basedmethodsdiscussedintheprevioussectiontreattheproteinasarigid body. While this approximation may yield reasonable results in a variety of situations,there are examples where it can be shown that the binding sites are not evident in the un-bound form but show up transiently and are locked in by the ligand[38]. An illustrativeexample is provided in gure 1.3. Clearly, the identication of these transient pockets canbe exploited to design inhibitors in situations where the unbound conformation would sug-gest a low degree of druggability.Conventional Molecular Dynamics (MD) simulations can be used to generate an ensembleofconformationsthataresubsequentlyanalyzedwithbindingsiteidenticationtools. Acompelling example can be found in the work of Schames et al.[107], that showed the for-mation of a trench near the active site of the HIV-1 Integrase during a 2ns MD simulation.Compounds that exploited both the active site and the trench were shown to have betterdocking energies.8Figure1.3: Anexampleof conformational changeoccurringuponbinding. AD-allosebindingproteinisrepresentedinthebound(green)andunbound(red)conformation, withtheligandinlicorice representation. The two domains of the protein move closer to each other and form a cleftthat accommodates the sugarAnother study by Frembgen-Kesner et al. [43] identied a transiently forming binding siteon a p38 MAP kinase using MD simulations. The method adopted in this study involveddocking a known inhibitor to 5000 dierent snapshots generated during an MD simulation.The results indicated that a large movement of a side-chain in the unbound conformationallowed for the formation of a new binding site exploited by the inhibitor and adjacent tothe kinase ATP site, similar to what was seen in crystal structures of the complex.The disadvantages of the methods involving MD simulations are associated to the high com-putational cost and to the potential inability of average-length MD simulations to capturelarge conformational changes occurring upon binding. Other approaches that incorporatealternative methods for generating ensembles of conformations have therefore been devised.Gonzalez-Ruiz and Gohlke[48], for examples, employed the FRODA method [123] to explorethe conformational space of the interleukin-2 receptor and were able to correctly identify thebound conformation starting with an unbound form of the receptor. The main idea behindthe FRODA approach is to produce an ensemble of conformations that are not dependentontimeasintheMDframeworkbutratheronthedistancefromareferenceconformer.Therefore the trajectory does not reect time but a geometrical pathway. A rigidity analysis9step identies the parts of the protein that are treated as rigid bodies during the geometricsimulation.OthermethodshavecoupledNormalModeAnalysis(NMA)[17]performedonMDsnap-shots or ElasticNetworkModels[5] withbindingsiteidenticationor docking[24, 104].The advantage that Normal Mode Analysis oers is the ability to capture large-amplitudeconformational changes. Furthermore, the possibility of performing Normal Mode Analysisusing the Elastic Network Model enables a substantial gain in speed compared to MD basedanalyses.It is noteworthy to point out that the methods outlined above require at least an approx-imateknowledgeofthelocationofthebindingsite. Thegoalistorenethisknowledgeby identifying a conformation that is closer to the bound form or to provide a mechanisticexplanationtoll thegapbetweenunboundandboundconformations. Theproblemofreliably identifying the binding site a prioriin the presence of large conformational changeswithout any further knowledge is clearly much more challenging.1.3 BindingSiteCharacterizationandComparisonAs previously discussed binding site identication represents an important step towardsfunctional annotation by providing knowledge of the residues that are involved in binding.The majority of approaches to infer functional information from the analysis of the bindingsiterelyoncomparisonswithknownannotatedexamples. Inotherwords,theprocessoffunctional annotation usually consists of transferring the information gained on some wellstudied proteins to the protein of interest by virtue of their binding sites similarity. Thisapproachrelaxestherequirementsforhomologyorstructural similarityandunderscoresthe role that the binding site plays in the protein functional roles.In a way that parallels what has been done in the eld of binding site identication, bothgeometric and energy-based approaches have been developed to compare binding sites.101.3.1 ApproachesforcomparinggeometricfeaturesAgraphtheoreticmethodforidentifying3Dpatternsof aminoacidsidechainswasproposedbyArtymiuketal.[4]. Bytreatingthesidechainsasnodesofagraph(usingapseudo-atomsrepresentation)andthedistancesbetweenthemastheedgesitispossibleto search for a given pattern by resorting to well established subgraph isomorphism algo-rithms. A proof of principle was provided by screening a set of proteins for the Ser-His-Aspcatalytic triad[4]. More recently a similar approach was taken by Zhang et al.[132] to builda network of binding sites similarities.Anotherestablishedapproachtocomparespecicarrangementsof residuesistheTESSalgorithm[118], thatusesa3Dtemplateacquiredbyminingtheprimaryliteratureandcontainingall theatomsthatareessential foranenzymetoperformitsfunction; then,givenaquerystructurethealgorithmlooksforamatchbetweenthequeryandthe3Dtemplate using the geometric hashing formalism, originally developed for object recognitionproblems in computer vision[126]. Using a 3D template that contained information for theserine protease active site (again with the well known Ser-His-Asp catalytic triad), the TESSalgorithm was able to detect the active site of all the serine proteases, acetylcholinesteraseand haloalkane dehalogenase[118].The necessity to provide a template with a well-dened structural arrangement of residueslimits,inasense,theapplicabilityofthecomparativeapproachesdescribedabovetoen-zymes or other molecules with a very conserved active site. Proteins whose function is tobind other proteins or ligands (especially in the case of low anity binding) are less suitableto the generation of a well-dened template, since they will generally lack a highly conservedgeometrical arrangement of residues in the binding region.1.3.2 ApproachesforcomparingstructurallyderivedpropertiesVisual inspectionof MolecularInteractionFields(MIFs)providesuseful informationabout regions characterized by favorable interaction energies with specic chemical probes,but it is feasible only when analyzing a few targets. In other cases, a quantitative measureof similarity between targets with respect to specic probes becomes necessary to automate11the comparison. In addition, a quantitative similarity index is a valuable tool to cluster thetargetstructuresandorganizethemaccordingtointeractionenergiespatterns. Onecanapplythesameconsiderationstotheelectrostaticpotential,thatiscomputedatdiscretepoints in space as the MIFs and can be analyzed using similar or identical indexes.Dierent similarity indexes rst developed for comparing quantum mechanically computedelectron densities of small molecules have been adapted to calculate the similarity of elec-trostatic potential and MIFs[20]. Among the available indexes, the Hodgkin index is one ofthe most popular and has been extensively used for protein structural comparisons based onelectrostatic potential and MIFs. The Hodgkin index is dened by the following equation:SI =2(p1 p2)p1 p1 +p2 p2(1.1)where p1 and p2 represent the vectors containing the potential energy values for map1 and 2 respectively and the symbol indicates the dot product.The Hodgkin index is used by the PIPSA program[13], a software to analyze the pairwisesimilarity of 3D interaction property elds of proteins. The proteins to be analyzed must besuperposed before computing the MIFs or the electrostatic potential;after the elds havebeencalculated, PIPSAgivestheoptiontoselectaregionaroundthemolecules(calledskin in the program) that contains points within a certain distance from the protein sur-face. Subsequently the program uses the Hodgkin index to calculate a similarity matrix andclusters the structures accordingly. This approach has been successfully applied to dierenttypesof proteinsincludingPHdomains[13] andWWdomains[108], cupredoxins[34] andE2 ubiquitin conjugating enzymes[124]. In all cases the clustering based on the similarityanalysisyieldedclassicationscomparabletowhatcouldbeachievedusingfunctionalin-formation like known peptide binding specicities and catalytic mechanisms.The Hodgkin Index and other similarity indexes usefully summarize global energy-based re-lationships between structures, but they cannot provide detailed information about regionspotentially important for binding anity or selectivity. Another problem, specic to MIFs,is how to integrate and analyze the information obtained by using many dierent chemicalprobes. To this purpose multivariate analysis techniques and, in particular, Principal Com-12ponent Analysis (PCA) have been applied to reduce the dimensionality of the problem andidentifyregionscharacterizedbyhighlyselectiveinteractions[99]. Morerecently, anum-berofpublicationshavedemonstratedtheusefulnessofConsensusPrincipal ComponentAnalysis (CPCA) applied to multi-probes MIFs[72]. CPCA allows one to discriminate be-tween regions that are important for binding selectivity with respect to a particular probeandregionsthatdonotcontributetotheproteinbindingsitesselectivity. Inaddition,manystructurescanbeusedtorepresentaparticularprotein(multipleNMRconforma-tions or Molecular Dynamics snapshots, for instance), thus implicitly including dynamicalinformation into the analysis. More sophisticated statistical techniques such as IndependentComponent Analysis (ICA)[25] could in principle be adopted to go beyond the requirementsneeded by PCA to be eective (linear correlation and normal distributions of the variables).Thesimilarityindexesandmultivariatetechniquesdescribedabovehaveshowntheirpotential asquantitativetoolstocompareproteinstructuresandcharacterizechemicallyimportant regions. Despite their usefulness, they have a major requirement that can limittheir applicability, namely they need a superposition of the binding sites. For proteins thatshow very limited structural variability in their binding site this may not be a major issue,butitbecomesanobstaclewhenproteinsknowntobindthesameliganddonotpresentobvious ways to superpose their binding sites. Unfortunately, these are also the cases thatwould benet the most from these comparative approaches.In addition, both similarity indexes comparisons and multivariate analysis rely on the as-sumption that the maximum correspondence between the grid points of dierent proteinshas been established; in other words, they require the best possible superposition of energyvalues. This goal is only implicitly pursued when we perform a conventional minimizationof the RMSD of equivalent atoms. Using only atoms to maximize the similarity of energyvalues could easily bias the results, since a group of atoms playing only a marginal role interms of contribution to the interaction energy with a particular probe or to the electrostaticpotential could heavily inuence the nal superposition of the proteins and, as a result, theoutcome of the calculations.Barbany[9] et al. presented an approach to optimize a MIFs-based similarity index (named13MIPSIM index) dened by the following equation:MIPSIMAB =

NAi=1

NBj=1XiAXjB exp(ar2ij)

NAi=1

NAj=1XiAXjA exp(ar2ij)

NBi=1

NBj=1XiBXjB exp(ar2ij)(1.2)by seeking the locally optimal superposition of the structures that returns the maximalsimilarity. The method was originally proposed to superpose ligands, but it can be adaptedto protein binding sites. The major limitations of the approach are the computational de-mands posed by the optimization step and the fact that only a locally optimal solution willbe determined.The problem of the optimal superposition of binding sites has been circumvented in severalapproaches that combine geometrical features and structurally derived properties (e.g. theelectrostaticpotential)inatranslationallyandrotationallyinvariantframework. Forex-ample,Kinoshita and Nakamura [74] built a database of functional sites described by theelectrostaticpotentialcomputedatseveralpointsontheConnollysurfaceofthebindingsite, and then implemented a graph theoretic approach to query the database.More recently, Sael et al.[106] developed another method to rapidly compare physicochem-ical properties such as the electrostatic potential and a hydrophobicity index mapped ontothe surface of proteins. The method takes advantage of 3D Zernike descriptors[95], whichconsist of a series expansion of a 3D function, thereby allowing for a translational and ro-tational invariant comparison of the so obtained coecients.Another translationallyandrotationallyinvariant approachfor comparingstructurally-derived properties has been introduced by Das et al.[32]. The approach relies on the con-cept of shape distributions[98] (originally introduced for object recognition purposes), thatmeasure the probability that a given property will be at a specied distance from anotheronthesurface, therebyincorporatingshapeandpropertydistributionsinonemeasure.The method was benchmarked on the PDBBind database[119] and the results indicated itscapabilitytoclassifybindingsitesinfunctionallymeaningful groups(denedbytheECnumbers of the enzymes).Oneofthemainlimitationoftheseapproachesliesintheirinabilitytocomparebinding14sites of dierent sizes or highly dierent shapes. Also, the electrostatic potential may notbe the best structurally-derived property to use for all binding sites (see Chapter 3).1.4 AvailableSoftwares for BindingSiteIdenticationandCharacterizationThe calculations involved in binding site identication and characterization require spe-cializedsoftware, irrespectiveofwhetherthegeometrical orenergy-basedapproachesareused. Therstmajorrequirementforascienticsoftwareisclearlyrepresentedbyitsavailability, butanotheressential featurerequiredforlargescaleanalysesisthepossibil-ityofrunningtheprogramslocallyasopposedtousingaweb-basedinterface. AsTable1.1shows, mostof themethodstocarryoutgeometrical orenergy-basedidenticationandcharacterizationofproteinbindingsitesareeitherprovidedaswebserversorrequireacommerciallicense. Moreimportantly, nocurrentlyavailabletoolprovidesacombinedframework in which one can perform binding site identication and characterization usingan energy-based approach.For these reasons I set out to implement a comprehensive package to address these needs.An introduction to the methods is provided below and in Appendix A.Table 1.1: Available softwares for binding site identication and characterizationName Purpose Availability ReferenceFTMAP fragment-basedidenticationofhotspots webserveronly Brenkeetal.[15]CMIP energy-basedcharacterizationofbindingsites currentlyunavailable Gelpietal.[44]GRID energy-basedcharacterizationofbindingsites commerciallicense Goodfordetal.[49]LIGSITE bindingsiteidentication(geometrical) webserverandstandalone Huangetal.[64]PocketFinder bindingsiteidentication(geometrical) webserveronly Hendlichetal.[55]PocketPicker bindingsiteidentication(geometrical) webserverandstandalone Weiseletal.[121]Q-SiteFinder bindingsiteidentication(energy-based) webserveronly Laurieetal.[81]1.5 EASYMIFsandSITEHOUNDEASYMIFsandSITEHOUND[46] aretwosoftwaretoolsthatincombinationenabletheidentication of binding sites in protein structures using an energy-based approach. EASYMIFsis a simple Molecular Interaction Fields (MIFs) calculator;andSITEHOUND is a post pro-15cessingtoolforMIFsthatidentiesinteractionenergyclusterscorrespondingtoputativebindingsites. Whilethesetoolsareconvenientlyusedincombination, theycanalsobeused separately. EASYMIFs can be used to calculate MIFs for binding site characterization,Quantitative Structure-Activity Relationship (QSAR) studies, selectivity analysis of proteinfamilies, pharmacophoric search, and other applications that require MIFs [29]. SITEHOUNDcan be used to process the ouput from other MIF or Anity Map calculation programs, inaddition toEASYMIFs, such asGRID [49] and theAutogrid tool of theAutoDock softwarepackage [92].Figure 1.4: Identicationofligand-bindingsitesusingEASYMIFsandSITEHOUND.(A)Aproteinstructure is used as inputand the program EASYMIFs computes the potential interaction energy of a molec-ularprobewiththeproteinoneachpointonanorthogonalgridcalledaMolecularInteractionField(MIF).(B)TheprogramSITEHOUNDprocessestheMIFbyrstremovingall pointsthathaveunfavorableinter-actionenergy,(C)subsequentlytheremainingpointsaregroupedusingahierarchicalclusteringalgorithm,andtheresultingclustersarerankedbytheirTotal InteractionEnergy(thesumoftheinteractionenergyofallpointsinonecluster). (D)Knownbindingsitesareusuallyfoundamongthetopthreeclusters.1.5.1 EASYMIFsMolecularInteractionFields(MIFs)describethespatial variationof theinteractionenergy between a target molecule and a specic probe, that usually represents a chemicalgroup. Althoughtheinteractionenergyeldis, bydenition, acontinuousquantity, forcomputational convenience it is usually discretized on a three-dimensional orthogonal gridthat surrounds themoleculeof interest. Theoutput of aMIFcalculationis thereforerepresented by an energy map that provides information about the potential energy betweenthe probe and the molecule under analysis. EASYMIFs aims to provide a simple and rapidwaytocharacterizeaproteinstructurefromachemicalstandpointattheglobalorlocal16level (e.g. around an active site), returning maps that can be loaded in a Molecular GraphicsSoftware such as PyMol[35], VMD[66] or Chimera[101].1.5.1.1 CalculationofMIFsinEASYMIFsEASYMIFscomputesthepotentialenergybetweenachemicalprobe(representedbyaparticular atom type) and the protein on a regularly spaced grid, using the following equa-tion:Vi =

j(VLJ(rij) +VE(rij)) (1.3)wherethepotential energycalculatedforaprobeatapointiinthegridisequal tothesumofaLennard-Jonesandanelectrostaticstermoveralltheatomsoftheprotein.rijrepresents the distance between the probe at pointi in the grid and an atomjof theprotein. The Lennard-Jones and the electrostatics term are expressed by the following twoequations:VLJ(rij) =C(12)ijr12ijC(6)ijr6ij(1.4)VE(rij) =140qiqj(rij)rij(1.5)TheC(12)andC(6)parametersintheLennard-JonestermdependonthechosenprobeandtheparticularatomtypeandaretakenfromamatrixofLJ-parametersdistributedwith the GROMACS package[117]. The dielectric constant140has been set to 138.935485.Thedistance-dependentdielectricsigmoidal functionhasbeentakenfromSolmajerandMehler[111] and has the following form:(rij) = A+B1 +eBrij(1.6)whereA = 6.02944; B=e0A; e0 = 78.4; = 0.018733345; k= 213.5782. Whenthedistance between the probe and an atom becomes less than 1.32A, a dielectric constant of178 is used. The parameters reported above for the distance-dependent dielectric have beentaken from Cui et al.[30]1.5.1.2 VisualizingtheresultsEASYMIFsproducesInteractionEnergyMapsinthedx format, thatcanbeconve-niently visualized inPyMOL,Chimera,VMD and other molecular graphics packages. The dxle is usually displayed as a contour plot, showing regions of space where the energy valueiswithinaspeciedrange. Forlargescaleanalysesthatinvolvethegenerationofmanythousands maps it is also possible to use compressed maps that achieve a compression rateusually greater than 4, thereby making optimal use of the disc space. The compression al-gorithm incorporated intoEASYMIFs is the Lempel-Ziv-Welch (LZW) algorithm[122], withan O(1) dictionary search step that adds almost no overhead to the calculations.Figure1.5showsanexampleofMIFcalculations. EASYMIFshasbeenusedtocalculateaninteractionenergymapbetweentheprotein(inthebindingsiteregion)andanhy-droxyl probe, shown in gold in the Figure. The box around the binding site illustrates theboundaries of the box used in the calculations.1.5.2 SITEHOUNDThe purpose ofSITEHOUND is to manipulate the output of theEASYMIFs program (andother programs such as Autogrid [92] and GRID [49]) in order to predict regions on proteinstructuresthatarelikelytobeinvolvedinbindingtosmall moleculesorpeptides. Theapproachisbasedonthe Q-SiteFinderalgorithm[81], butcontainsmoreoptionsandimprovements. Themaindierenceslieintheuseofmultipleprobesforthedetectionofdierenttypesof bindingsites(seeChapter3); alternativeclusteringalgorithms, whichimproveresultsforligandsofdierentshapes(seeAppendixA)andthefactthatSITE-HOUND can be run independently of a web interface.Theprogramrstltersoall thegridpointsthathaveanenergyvalueaboveauser-specied threshold (a negative value) and clusters them according to spatial proximity us-ingsingleoraveragelinkageagglomerativeclustering(seeAppendixA). Subsequently,the TotalInteractionEnergy(TIE) of each cluster is computed and this value is used18Figure 1.5: Anexampleof interactionenergycalculationsonaproteinTheproteinshownhereis aD-allose bindingprotein(PDBcode 1rpj). The boxdelimitates the areaof the proteinwhere thecalculationshavebeencarriedout. Thegoldenpointsindicateareasof favorableinteractionenergywithanhydroxyl probe(energythresholdsetto-28KJ/mol). Theligandisoverlaidforcomparison, butwasremovedbeforecomputingtheinteractionenergymap.toranktheclusters,fromthemostnegativetotheleastnegative. Thelaststepinvolvesprinting the results on text les and in the PDB and DX formats, that allow for graphicaldisplay of the results on the protein using standard molecular visualization tools (such asChimera[101],PyMol[35] orVMD[66]).Figure 1.6 illustrates an example of binding site identication carried out with two dier-entprobes(methylandphosphateoxygen)onthesameprotein. Thecombinationofthetwoprobesyieldsamorecomprehensivepictureofthelargebindingsiteandthecorrectidentication of the adenosine and phosphate binding regions.1.5.2.1 TheSITEHOUND-webServerA streamlined web-based interface to carry out binding site identication using SITEHOUNDhas been made available at http://sitehound.sanchezlab.org[59]. The interface (Figure 1.7)can be used to upload a PDB structure, automatically perform the binding site identicationand visualize the results of the calculations on a ribbon representation of the protein. Theresiduespotentiallyinvolvedinbindingarealsoreportedonaper-clusterbasis, together19Figure1.6: Characterizationof theyeast adenylatekinasebindingsiteusingEASYMIFs andSITEHOUND.(A)Ribbondiagramoftheyeastadenylatekinasestructureshowingthetoprankingclustersas solid surfaces: phosphate probe cluster (red) and carbon probe clusters (green). (B) SITEHOUND clusterssuperposed on the structure of the Ap5A (bis(adenosine)-5-pentaphosphate) inhibitor of adenylate kinase [1].The phosphate probe correctly identies the pathway of phosphoryl transfer, and the carbon probe correctlyidentiestheadenosinebindingregions. TheFigurewaspreparedusingthe1akyCMETcluster.pdband1akyOPcluster.pdblesfromtheexample,andthePyMOL[35]moleculargraphicsprogram.withasummaryofthemainfeaturesoftheclusters. Fromtheresultspage(Figure1.7)the user can also download all the les that are produced bySITEHOUND. Furthermore, itis possible to download the .map le produced byEASYMIFs orAutogrid,which can beused by SITEHOUND to carry out the binding identication with combinations of parametersdierent from the default parameters used by the web server. SITEHOUND-web only allowsfor the processing of relatively small systems with default parameters.1.6 ConclusionsThemolecularfunctionof aproteinislargelydeterminedbyinteractionswithothermolecules at binding sites on its surface. Hence, identication of the location and charac-teristics of ligand-binding sites can contribute to functional annotation of a protein; it canguide experiments, and be useful in predicting or verifying interactions. The identicationof ligand-binding sites can also be an important part of the drug discovery process.Several methodshavebeendevelopedfortheidenticationofbindingsitesfromproteinstructuresandsequences, andthemainideasunderlyingsomeof themostwidelyusedapproacheshavebeendiscussedabove. Sequence-basedmethodshavetheadvantageof20Figure 1.7: SITEHOUND-webresultspage. Theresultsofacarbonprobecalculationwiththeaveragelinkageclusteringareshownforforadenylatekinase(PDBcode1aky). TheWebInterfaceisavailableathttp://sitehound.sanchezlab.orgbeing applicable to proteins of unknown structure,by relying on the evolutionary conser-vation of residues. However, they are also limited by the fact that not all binding sites areconserved, and not all conserved residues correspond to binding sites. Structure-based ap-proaches can overcome these limitations and complement sequence-based methods. Amongthe structure-based approaches, the energy-based methods directly describe the molecularinteractionpropertiesoftheproteinsurfaceandcaninprinciplealsodistinguishbindingsiteswithdistinctligandpreferences(e.g. hydrophobicversuspolar)ifdierentchemicalprobes are used for the molecular interaction calculation.The limited availability of softwares to compute interaction energy maps and the absenceof methods that incorporate multiple probes into the binding site identication frameworkprovidedthemotivationforimplementingtheEASYMIFsandSITEHOUNDprograms. Theremainder of this work will describe their application in the context of protein-ligand dock-21ing, binding site identication tailored to specic classes of ligands and the identication ofbound forms in cases where large conformational changes take place.22Chapter2FocusedDocking2.1 Introduction2.1.1 ReverseVirtualScreeningAsmentionedinChapter1, animportantsteptowards(possiblyautomatic)proteinfunctional annotation consists in elucidating all the possible binding partners of a protein.Investigating directly protein binding sites has also valuable applications in the eld of drugdiscovery, where several approaches have been implemented (both experimental and compu-tational) for predicting all the possible biological targets of a drug[68]. The computationalapproachesmeanttoaccomplishthistaskusuallygounderthenameofreversevirtualscreening.At least four dierent situations can be envisaged where knowing all the targets of a drugcan be therapeutically exploited (gure 2.1). The rst one involves the problem of so-calledorphan drugs, i.e. pharmaceuticals that are marketed and proven to be eective despite ourlackofknowledgeabouttheirmoleculartarget(s). Relatedtothisistheissueofadversedrug reactions due to o-target eects, where the drug binds to (unwanted) secondary tar-gets giving rise to an adverse reaction. Another scenario involves the repurposing of existingdrugs for new uses. In this case the o-target eects of the drug are exploited to treat othermedicalconditionswheretheadditionaltargetsofthedrugplayanimportantrole. Theadvantage of repurposing the drug is mainly due to safety and pharmacoeconomics issues.23Figure2.1: Thepanelsrepresentdierentinstanceswheretheknowledgeofallthetargetsmod-ulatedbyadrugcanbebenecial. A:adversedrugreactionsduetoo-targeteects; B:orphandrugs(drugswhosemechanismofactionwaspreviouslyunknown); C:drugrepurposing; D:druginteractions (for simplicity the drugs are shown to interact with the same target, but more indirectinteractionsarealsopossible, forexampleatthepathwaylevel). Thedashedarrowindicatesaninteraction that was previously unknown.Finally, the problem of drug-drug interactions (a common cause of iatrogenic injuries andtherapeutic failures) can also be framed in the context of shared similarities between drugtargets. Signicant eort has been spent into experimental and computational approachestodetectpotentiallyharmful drug-druginteractionsthatarisefromsharedmetabolizingenzymes. Inthiscasetheinteractionisduetothefactthatbothdrugscompeteforthesamemetabolicpathwayresponsibleforthedegradationandsubsequentexcretionofthedrug, leading to increases in blood concentration of one or both drugs. A more challengingmechanismtopredictcanbefoundatthepharmacodynamiclevel,wheretheinteractionbetweendrugsoccurattheleveloftheirtargets, withsynergisticorantagonisticeects.The problem is compounded by the fact that the interaction between drugs is not necessar-ily due to modulation of the same target, but can also occur at the pathway level.24Determiningallthebiologicallyrelevantproteinpartnersofasmallmoleculeclearlyrep-resentsaformidabletask. Computational lteringtoolscapableof narrowingdowntheexperimental validation to a handful of proteins are therefore necessary. One of the mostecientapproachesfromacomputational standpointistolookforsmall moleculesthatare very similar to the one that is under investigation and assume that they will also sharethe same set of targets. A review of some of the techniques that are available can be foundinSheridanandKearsley[109]. Eventhoughtheseapproachesarecomputationallyveryeective, they are limited in the sense that they rely on already discovered interactions ofcognate small molecules. If the small molecule of interest is not signicantly similar to anyother small molecule for which there is enough information, nothing useful can be inferred.More general approaches involve docking the small molecule into a large set of proteins andusing the docking energy as a ranking criterion. The main idea of screening a large set ofprotein structures against a particular small molecule of interest has been described in Paulet al.[100], where they carried out docking experiments by screening proteins whose bindingsite was already known. Although this approach enables faster reverse virtual screening, itlimits the universe of candidate targets to those proteins that have clearly identied bindingsites and only to those sites within the protein. Ideally, a reverse virtual screening approachwould require only the knowledge of the three-dimensional structure of the candidate targetproteinsandwouldallowforthediscoveryofunexpectedinteractionsthatmayoccuratpreviously unidentied binding sites. This idea will be described in this chapter by showingtheresultsofcombiningabindingsitepredictionstepwithasubsequentdockingexperi-ment.A brief introduction to the docking problem is necessary to understand the challenges in-volved in its applications in the context of reverse virtual screening.2.1.2 TheProtein-LigandDockingProblemThe problem of protein-ligand docking frames the two interrelated questions of the freeenergy of binding and the optimal orientation and conformation of the ligand (and possiblyoftheproteinaswell)asaglobaloptimizationproblem. Inotherwords,givenaproteinstructureandaligandthedockingalgorithmwill seektheorientationandconformation25of theligandthat yields aglobal minimumonthefreeenergylandscape. Inpracticalapplications, thefreeenergyisusuallyapproximatedbythedockingenergy, whichhasbeen tuned to empirically reproduce experimentally derived free energy values for a givenset of known complexes.The majority of docking algorithms consist of two components:1. Ascoring functionthatreturnsthedockingenergyforagivenorientationandcon-formation of the ligand with respect to the protein2. A search strategy that seeks to nd the global minimum in the docking energy land-scapeHere the focus will be on the AutoDock[92] package, one of the most widely cited applica-tions for molecular docking (as reported at http://autodock.scripps.edu)2.1.2.1 ThescoringfunctioncomponentAutoDock estimates the free energy of binding between a protein (P) and a ligand (L)using pairwise terms and an implicit solvent model. More formally:G = (VLLboundVLLunbound) + (VPPboundVPPunbound) + (VPLboundVPLunbound +Sconf) (2.1)In other words, the intramolecular energetics of the transition from the unbound state tothe bound one are evaluated separately for each of the molecules, and nally the intermolec-ular energetics of the protein and the ligand in the complex are computed. The second termin the equation above is clearly 0 if the protein is kept xed during docking. The entropiclossthatoccursuponbinding(lasttermintheequation)isdirectlyproportional tothenumber of rotatable bonds in the ligand.The pairwise atomic terms in AutoDock are described by the following equation:V= Wvdw

i,j(Aijr12ijBijr6ij) +Whbound

i,jE()(Cijr12ijDijr10ij)++Welec

i,jqiqj(rij)rij+Wsol

i,j(SiVj +SjVi) exp(r2ij/22) (2.2)26The rst term is a 6/12 potential for dispersion/repulsion interactions, while the secondterm explicitly takes into account the H-bond term, with a dependence on the angle ex-pressed as a deviation from the ideal bonding geometry. The electrostatic term is expressedby a Coulomb potential with a distance dependent dielectric. The desolvation potential isbased on the volume (V) of the atoms that surround a given atom weighted by a solvationparameter (S) and (exponentially) by the distance.All the coecients in AutoDock4 were derived by calibration on set of 188 protein-ligandcomplexes whose binding energies were experimentally determined[65] .2.1.2.2 ThesearchcomponentThe expressions described in the previous section dene the energy landscape that hastobesampledinordertodeterminetheglobaloptimum. Sinceananalyticalsolutiontothisproblemdoesnotexist, AutoDockresortstoglobal searchheuristicstondarea-sonableapproximation. Inparticular, AutoDockemploysamodiedversionof GeneticAlgorithms[90] .Genetic Algorithms are a class of evolutionary strategies widely employed in global optimiza-tionproblems. Theyapplyadarwinianprocessofselectionofthettesttoapopulationof individualsthatrepresentapotential solutiontotheproblemunderanalysis. Inthisspecic context,each individual bears a genotype that fully describes the orientation andconformation of the ligand,and the tness is simply represented by the resulting dockingenergy.A simplied description of the optimization algorithm implemented in AutoDock is givenbelow:1. randomly initialize the population2. evaluate the tness of each individual3. select a fraction of the most t individuals for reproduction4. apply crossover and mutation operators to the individuals5. carry out local optimization276. evaluate the tness of the new individuals7. discard a fraction of the least t individuals8. repeatsteps3though7untilthemaximumnumberofenergyevaluationshasbeenreachedThe mutation operator simply applies a random change to the genotype of an individualaccording to a predened probability distribution. In AutoDock this amounts to a randomchange in the values of one of the degrees of freedom of the ligand. The crossover operatormimics theexchangeof geneticmaterial that occurs duringmeiosis, andit consists inassembling parts of the parents genotypes into a new combination.ThemostnotabledierencefromthecommonimplementationsofGeneticAlgorithmsistobefoundinthelocaloptimizationstep, whichbringsthesolutionsrepresentedbytheindividual genotypes to their local minima, by means of a local optimization search carriedoutinthecoordinatespaceof theligands. Theoptimizedsolutionisthencodedbackinthegenotypeof theindividuals, andthisexplainsthenameof LamarckianGeneticAlgorithms given to the hybrid version implemented in AutoDock.It is important to mention that the maximum number of energy evaluations is the primaryfactor that controls when the algorithm will stop its search. A value set too low will preventthe algorithm from thoroughly sampling the free energy space and will usually result in poorsolutions. On the other hand, a balance has to be struck in terms of accuracy and eciencyof the search, since in many applications docking is applied to thousands of molecules andspeed is a crucial factor.Another factor that can potentially inuence the aforementioned balance between speed andquality of the results is the size of the space allowed for docking. In virtual screening, wherethe binding site of the protein is usually known, the docking calculations are restricted to aregion approximately corresponding to the binding site. This is not applicable to the caseof reverse virtual screening, as will be discussed below.282.1.3 BlindDockingvsFocusedDockingThe goal of protein-ligand docking is to predict the position and orientation of a ligand(usually a small molecule) when it is bound to a receptor protein. When the binding sitetobetargetedbythesmall-moleculeisknown, selectingareasonablysmall dockingboxaround this site facilitates docking by focusing sampling of the translational, rotational, andtorsional degrees of freedom of the ligand. This is the usual situation in lead optimization,wherepredictingthebindingmodeorposeoftheligandisneededforrational designofimproved potency and selectivity, and in hit identication through virtual screening wherethe goal is the discovery of ligands, out of a large library, that are likely to bind a proteintarget. The reverse question is more dicult to address. Given a ligand,is it possible todiscover its most likely target? In this reverse virtual screening case, because the bindingsite is not known it becomes necessary to explore the entire protein surface by docking, aprocedurethathasbeennamedblinddocking[60, 61]. Becausethespacewhereblinddockingtakes placemust accommodatetheentireproteinandis thereforemuchlargerthan a regular docking box,the number of energy evaluations carried out by the dockingprogramisusuallysetuptoaproportionallyhighervalue[60, 61], withacorrespondingincrease in the running time. This shortcoming has been partially overcome by using knownprotein binding sites as targets for reverse-virtual screening[100]. Although this approachenables faster reverse virtual screening, it limits the universe of candidate targets to thoseproteins that have clearly identied binding sites and only to those sites within the protein.Ideally, a reverse virtual screening approach would require only the knowledge of the three-dimensional structure of the candidate target proteins and would allow for the discovery ofunexpected interactions that may occur at previously unidentied binding sites. One suchapproachhasbeendescribedbyBrownandVanderJagt[19], inwhichamacromoleculeencapsulating surface (MES) was used to geometrically dene the boundaries of predictedbindingsitesandguidethedockingsearch. Onasetof14protein-ligandcomplexestheMES approach was shown to improve the eciency of the genetic algorithm-based optimizerin the AutoDock[92] docking software.Thealternativeoptionof predictingasetof putativesitesandcarryingoutdockingon29them one at at time is investigated here. The use of binding sites calculated directly fromthe docking grid (i.e. interaction energy-based calculation) is evaluated as a tool to focusthedockingsearchesoftheAutoDock[92, 65] software. Thisidearesultsinanapproachconsisting of multiple independent docking runs carried out on smaller boxes, centered onafewpredictedbindingsites,asopposedtoonelargerblinddockingrunthatcoversthecompleteproteinstructure. Bycomparingthefocuseddockingapproachwithreferenceblinddockingrunsoverasetof77ligand-proteincomplexesand19ligand-freeproteins,thefollowingquestionswill beaddressed: Isfocuseddockingmoreaccuratethanblinddocking? Isthereareal gainincomputationaleciencywhenusingfocuseddocking? Isthere a penalty paid (e.g. missed binding sites) when using focused docking?Figure2.2: Blinddockingandfocuseddocking. Theblindprotocol consistsofasingledockingexperiment,carriedoutonthewholeproteinsurface,whereasthefocusedprotocolbreaksuptheproblem into multiple smaller docking experiments, focusing on predicted binding sites.2.2 MaterialsandMethods2.2.1 BindingsiteidenticationAmoredetaileddescriptionofthealgorithmcanbefoundinChapter2. HereIwillonly recall the main ideas behind the approach.Thealgorithmtopredictthelocationofpotentialbindingsitesfordrug-likemoleculesis30based on principles similar to those that underlie the QSiteFinder[81] algorithm. Both al-gorithms identify the regions characterized by favorable van der Waals interactions, whichhave been shown to play an important role in the binding of drug-like molecules to proteins.The rst step requires the computation of a low resolution (1.0A) carbon anity map withAutoGrid (part of the AutoDock suite v. 4), using a box large enough to accommodate theentire protein. In the next step,a predened energy cuto (-0.3 kcal/mol for all cases) isapplied to lter out all the anity map points corresponding to unfavorable interaction en-ergies. Subsequently, the remaining points are clustered according to the spatial proximitywithanagglomerativehierarchical clusteringalgorithmusingaveragelinkage, asimple-mented in the C Cluster Library[33]. This step yields a hierarchical dendrogram, which isnallycutintononoverlappingclustersbyapplyingadistancecuto(7.8Aforallcases).Thislaststepismadepossiblebythefactthattheaveragelinkageclusteringproducesmonotonic hierarchies. In other words, the distance between clusters at each merging stepneverdecreases. Therefore,thenumberofclustersneedsnotbedeterminedapriori,butonly the value for the distance cuto must be chosen. Finally, these so-obtained clusters areranked by Total Interaction Energy (TIE, the sum of the energy values of all the points thatbelong to the same cluster) and the rst three are selected for focused docking (see below).Thespatial localizationof theclustersischaracterizedbytheircenterof energy(COE,theaverageoftheircoordinatesweightedbyenergy). Theonlytwoparametersthatthismethod requires are the energy cuto to lter the grid points and the distance cuto for theclustering step. A range of values for these two parameters was tested, and a combination(-0.3 kcal/mol and 7.8A, respectively) was chosen that yielded the most accurate bindingsitepredictionasdenedbytheaccuracymeasureintroducedbyLaurie[81]. Intermsofcomputational overheadforthebindingsitepredictionstep, itisnoteworthytomentionthat the time required to run the SITEHOUND program is negligible with respect to the timerequired for a full docking experiment. The median time calculated on the dataset was rc(4.3)where dijis the euclidean distance between two nodes and rc is a specied distance cut-o.From the matrix one can compute the Hessian of the system, a block matrix dened as:63Hij =ij(R0ij)2__XijXijXijYijXijZijYijXijYijYijYijZijZijXijZijYijZijZij__(4.4)Hii =

jHij(4.5)whereXij,YijandZijare the components of the distance vector between two nodesi andj in the x, y, and z direction respectively and represents the spring constant, identical forall pairs of nodes.By performing the spectral decomposition ofHone obtains 3N 6 eigenvectors with cor-responding non zero real eigenvalues, that represent the directions (modes) where the col-lective motion of the nodes takes place. The low frequency modes (that express the highlycollectivemotionsofthenodes)canbepickedoutbyselectingtheeigenvectorswiththecorresponding lowest eigenvalues.For this analysis, the top three modes were selected, and a sampling along these modes wasperformed. Theconformationsweregeneratedbycomputingadisplacementalongeachofthethreemodes, rangingfrom-180to180arbitraryunitswithastepsizeof20. Thisprocedure yielded 6859 conformers for each pair.4.2.3.1 FittingIn order to obtain a rough estimate of the range and relative contribution of the normalmodes to the protein motion (from the unbound to the bound form) a least square analysiswas carried out to estimate the optimal linear combination of the rst 10 modes that yieldsthe best t. More formally,Mx b u (4.6)whereMis the matrix containing the eigenvectors (arranged in columns),b is the vectorcontaining the coordinates of the bound form and u the vector with the unbound coordi-nates. x is the linear combination of the eigenvectors that yields the best t. In other words,the vector containing the dierence between the bound and the unbound forms is projected64inthespacespannedbythelowestfrequency10eigenvectors, obtainingadisplacementfor each mode that converts the unbound form to a conformation that is as similar to thebound form as possible (given the normal modes).4.2.4 Side-chainModelingandMIFsCalculationsThe structures generated as described in section 4.2.3 have been processed with SCWRL3[36],a side-chain modeling software that uses a backbone-dependent rotamer library cast in theBayesian framework to account for rarely occurring rotamers. No energy minimization wasperformed.Subsequently, the residues within 5A from the ligand in the crystal structure of the refer-ence bound form were extracted and the center of the binding site computed (for each ofthe conformers generated with the ENM). Finally,EASYMIFs[46] was used to compute theMIFs around these centers with the carbon (C), methyl (CMET), nitrogen (N), hydroxyloxygen (OA), phospho oxygen (OP) and water oxygen (OW) probes.4.2.5 ComparingMIFsderivedfrombindingsitesThe ensemble of MIFs (6 maps per conformer, one for each probe) was compared againstthe MIFs derived from the bound form. In order to deal with some of the problems outlinedin Chapter 1 and circumvent the rotational and translational dependence of many indicesused for comparing maps, a modied version of the algorithm described in Osada et al[98]wasimplemented. Themainideaistoderiveamultidimensionalfeaturevectorfromtheprobability of observing a given distance between a point and the centroid of the map, wherethe probability is a function of the energy of the points (the more favorable the interactionenergy, the higher the chance that a point has of being selected). The size of the maps usedin this analysis allowed an exact enumeration of all the distances between the centroid andthe points,weighted by the probability (energy) of the points. The ngerprint derived inthe aforementioned way is called a centroid shape distribution.To quantitate the distance between the shape distributions describing the binding site in thebound form and the ones representing the ensemble of conformations the Kullback-Leibler65divergence (KLD)[77] has been used:DKL(P|Q) =

iP(i) log(P(i)Q(i)) (4.7)It is noteworthy to mention that the KLD does not induce a metrical space since it isnot symmetric. This fact does not represent a problem for the application described here,since the comparison is always directional (i.e. one looks for the top nconformations thataremostsimilartothetemplate, i.e. thereferenceboundforminthisapplication). Aschematic representation of the method is given in Figure 4.4.4.3 Results4.3.1 NormalModelFittingInordertoassesstheENMabilitytogeneratestructuresclosetotheboundconfor-mation, the tting procedure described in Section 4.2.3.1 has been applied to the 11 pairsof structures in the dataset. The results (Table 4.1and Figure 4.3) indicate that thetop2-3 modes are usually able to yield a conformation with an RMSD lower than 3A (9 out of11 cases), with 6 out of 11 cases having an RMSD lower than 2A. In one case (1lfg 1lfh, alactoferrin), the ENM does not seem to fully capture the motion from the unbound to thebound form.As pointed out by Tama et al.[114], when the motion is collective the rst mode is adequateto yield a close t to the bound form. Indeed, Figure 4.3 clearly illustrates this point, sincethe largest drop in RMSD occurs with the topmost mode and, in the case where the ttingisnotsuccessful (1lfg 1lfh), noteven10modesaresucienttoreducetheRMSDbelow3A.Taken together, these results indicate that the ENM applied to this dataset can yield struc-tures that approximately resemble the bound form. Therefore, the bound form identicationprocedure described in Section 4.2.5 can be meaningfully applied.66Figure 4.3: Contribution of normal modes to tting. For all the cases included in the dataset anANM tting, with a number of normal modes ranging from 1 to 10. The resulting RMSD is plottedagainst the number of modes employed in the tting procedure. For most pairs the top three modesare usually enough to reach a low RMSD, while in a few cases (e.g. 1lfg 1lfh) more are needed4.3.2 BoundFormIdenticationFor each of the 11 cases in the dataset the top 20 conformers closest to the bound formin MIFs space have been selected, and the Gaussian Weighted RMSD from the bound formhas been computed. The results are shown in Table 4.2. In most cases it is possible to ndone conformer among the top 20 that is similar to the tted conformation, which representsa lower bound on the RMSD, being the optimal conformation that can be obtained with agiven set of modes.Figures 4.5and4.6showtwoexamples of boundformidentication, wherethelowestRMSD structures among the top 20 ranking conformers (out of 6859 structures) have beensuperimposed on the bound form. Interestingly, not all probes behave identically, and someprobes seem to be better suited at identifying the bound form than others (e.g. the OA67probe for the 1lst 2lao pair).Figure 4.4: Binding site MIFs comparison. Two MIF maps are shown, one derived from a boundstructure and the other from an unbound structure. The grayscale used to represent the points isproportional to the interaction energy (darker shades indicate more favorable interaction energies).The shape distributions derived from the maps as described in section 4.3.2 are shown in the bottomplot. The Kullback-Leibler divergence is used to compute the distance between them.Table 4.2: Overview of the results with the centroid shape function.The initial RMSD andtheRMSDsobtainedwiththebestoutofthetop20conformationsrankedbysimilarityto the bound form using the carbon (C), methyl (CMET), nitrogen (N), hydroxyl oxygen(OA), phospho oxygen (OP) and water oxygen (OW) are reportedPDB Codes Initial Min MinC MinCMET MinN MinOA MinOP MinOW1ake 4ake 8.2 3.6 3.9 3.9 4. 2 5.3 5.3 4.21anf 1omp 7.3 1.3 4.0 4.0 4.5 4.5 2.5 2.51gky 1ex6 4.4 1.8 4.0 6.7 3.2 3.2 2.7 3.21jg6 1jej 2.8 1.0 4.0 3.8 3.6 3.7 3.6 4.21lfg 1lfh 8.2 5.1 8.3 5.3 9.2 10.9 5.9 7.41lst 2lao 8.6 1.7 2.6 4.0 2.8 1.7 2.8 2.81quk 1oib 5.7 1.0 1.1 3.5 2.8 2.3 1.3 2.81rpj 1gud 6.2 1.3 1.3 4.5 2.1 2.9 2.0 2.11suv 1bp5 12.5 2.3 3.1 3.9 4.2 4.4 4.8 4.11wdn 1ggg 10.2 2.3 3.0 2.3 2.3 2.4 2.3 2.32dri 1ba2 12.5 2.1 8.2 2.9 2.1 2.6 2.9 2.168Figure4.5: Identicationof theboundformof aprotein. Thegreenstructurerepresentstheboundformof anE. coli phosphatebindingprotein(PDBcode: 1quk[130]). Theredstructuresuperimposed on the left is the corresponding unbound form (PDB code: 1oib[130]) with a backboneRMSD of 5.7A. The blue structure on the right is the 3rd ranking conformation generated from theunboundformandselectedwiththeshapefunctionapproachusingacarbonprobe(nalRMSD:1.1A)4.4 DiscussionTheresultspresentedaboveprovideaproofofconceptfortheMIFsbasedapproachtoscreenanensembleof structuresandseekconformersthatresembletheboundform.The ENM has been chosen to generate the ensemble of conformations because of its abilityto capture large collective motions and its limited computational demands. However,thebound form identication approach presented here is by no means restricted to a particularsampling technique, and alternative methods (such as Monte Carlo or Molecular Dynamics)could be better suited to deal with other situations. For example, the ENM would not workwell in cases where the motion from the unbound to the bound form is not collective.Another important point worth mentioning is that the shape distributions used to select thetopmost 20 structures have been derived directly from the bound forms. This is clearly asituation that would not occur in real applications, since the very goal of the procedure is toget information about the possible conformation of the bound form when only the unbound69Figure 4.6: Identication of the bound form of a protein. The green structure represents the boundform of the Salmonella typhimuriumlysine-arginine-ornitine binding protein (PDB code: 1lst[96]).The red structure superimposed on the left is the corresponding unbound form (PDB code: 2lao[96])with a backbone RMSD of 8.6A. The blue structure on the right is the 11th ranking conformationgenerated from the unbound form and selected with the shape function approach using the hydroxyloxygen (nal RMSD: 1.7A)form is available. The shape distributions would have to be derived from other structuresbound to the same ligand (or perhaps a similar one). It is conceivable that the performancewoulddeteriorate, butmorerobustapproachescouldbeimplementedtodeal withthisissue. Forexample, multiplestructuresboundtothesameligandcouldbeanalyzedandclustered, to identify potentially dierent binding modes. It should be possible to computean average shape distributions by including multiple structures whenever available, therebyincreasing the robustness of the approach.The idea of comparing properties derived from the binding site in a translationally and rota-tionally invariant fashion could also be applied to the problem of binding site classication.Bindingsiteclassicationreferstothepossibilityofidentifyingapotentialsetofligandsthat could bind to a given binding site. Several approaches have been conceived to accom-plishthistask[56]. Themethodpresentedherecouldbeappliedtocompareanunknownbinding site against a library of well characterized sites, identifying the closest ones in theMIFs space. Some of the properties of the ligands known to bind to the structures closestto the query could then be used to infer the type of ligand (and therefore gain functionalinformation about the unknown site) or to rene a lead compound to target the site.Theadvantagesof suchanapproachresideincomparingstructurally-derivedpropertiesthat are directly related to binding, without sequence or structural constraints and in a fast70and computationally inexpensive way.4.5 ConclusionsTheproblemofinferringthefunctionofaproteininthecontextofthecomplexnet-work of interactions is one of the most crucial challenges faced by Computational Biologytoday. Knowingthebindingpartnersofproteinsisanessentialsteptountanglethewebof functional relationshipsthatcontrol cellularprocesses, andtheidenticationandthecharacterization of a protein binding site represent an important step to achieve this goal.Someofthetechniquesthathavebeendevelopedbythebioinformaticscommunityovertheyearshavebeendiscussed, togetherwiththeirlimitationsandapplicabilityrange, inChapter 1.This work proposes a framework to perform binding site identication on protein structuresby means of an energy-based approach based on the concept of Molecular Interaction Fields(MIFs). Theapproachhasbeenvalidatedonalargesetofboundandunboundproteinstructures, and a specic application of binding site identication in the context of reversevirtual screening has also been discussed (Chapter 2). The advantage of using chemicallyspecic probes to compute the MIFs has been demonstrated by applying the binding siteidenticationproceduretophosphorylatedligands. Furthermore, animprovedversionofthe energy-based binding site identication approach that incorporates evolutionary infor-mation has been presented, and its advantage in situations where the energy-based signalis weak has been emphasized.(Chapter 3)Asanattempttomovebeyondtheproblemofbindingsiteidentication,amethodologythat can be applied to infer the bound conformation of a protein starting from an unboundform has been introduced, with a preliminary validation on 11 structures (Chapter 4).Taken together, the results presented in this work indicate that the energy-based approachwith multiple probes MIFs provides a versatile framework to carry out binding site identi-cation and hint at the possibility of identifying the bound form of structures that undergolargeconformationalchanges. Furthermore, theproblemofpredictingthetypeofligandthatabindingsitecanaccommodateliesamongthefuturechallengesthatcouldbenet71from the methodology described here.72AppendixAIntroductiontoClusteringA.1 BriefoverviewofclusteringinSITEHOUNDThemainideaimplementedinSITEHOUNDistogroupthepointsof theinteractionenergy map that have passed the energy lter into clusters and to rank them by Total In-teraction Energy (TIE). It is important to understand the options related to the clusteringstep in order to eectively use the program. The principles of clustering algorithms and therelevant parameters used bySITEHOUND are discussed here.The fundamental goal of a clustering algorithm can be considered as nding a partitionof a set of points, dened in a multidimensional space, according to some optimalitycri-terion (usually, one seeks to minimize intra-clusters distances and maximize inter-clustersdistances). It is worth pointing out that the problem is NP-complete, because one shouldcalculate all the possible partitions of the points, a combinatorial problem that scales withthefactorialofthenumberofpoints. Inpractice,onecanresorttoheuristicsthatmakethe problem amenable to computation and yield satisfactory results.More formally, given:x1 = {x11, x12, . . . , x1n}, . . . , xm = {xm1, xm2, . . . , xmn} (A.1)73Figure A.1: Eectsoflinkageonclusteringresults - a) and b) show the results of average and singlelinkage on cyclin-dependent kinase 2 (PDB code 1ke5). Single linkage yields a better coverage of the bindingpocket, whichisquiteelongated. Ontheotherhand, forhumanpregnenolonesulfotransferase(PDBcode1q1q)averagelinkageisthebestchoice,sinceitcorrespondsmorecloselytotheligandcontour.as a set ofm points belonging to ann dimensional space,we can dene the followingtwo quantities:Dp(x1, x2) (A.2)Dc(R, S) (A.3)thatrepresentthedistancebetweentwopointsx1andx2andthedistancebetweentwoclustersRandS,respectively. AnaturalchoiceforDpinourproblemisthesimpleeuclidean distance between the points.One of the most widely used heuristics to approach the clustering problem is to proceed74from to the bottom to the top by iteratively merging clusters until one cluster containing allthe points is obtained. This is where theDc quantity plays a role, by dening the distancebetween clusters. The name linkage is commonly used to indicate this quantity.SITEHOUND incorporates two types of linkage, single and average, dened in the followingway:Dcsingle(R, S) = minx1R,x2SDp(x1, x2) (A.4)Dcaverage(R, S) =

x1R

x2SDp(x1, x2)|R||S|(A.5)where the | | notation indicates the cardinality of the set (i.e. the number of points ofthe cluster).Twoimportantpropertiessharedbythesetwolinkagesarethefactthatthedistancebetween clusters increases monotonically at each step. Therefore, it is possible to cut thepartition at a particular level obtaining the corresponding clusters. In SITEHOUND this leveliscalledspatial cuto. Thetypeoflinkageusedaects(tosomeextent)theshapeofthe clusters obtained. In general,it can be shown that single linkage tends to yield moreelongatedclusters, whereaswithaveragelinkagetheshapeof theclustersisclosertoasphere. From a practical point of view,using single linkage can be more meaningful withpeptidebindingsitesorelongatedligands, whereasaveragelinkageperformsbetterwithsmallchemicals. TheseeectsareillustratedinFigureA.1. Ingeneral, itisdesirabletorun the calculations with both types of linkage, and compare the results. In some instances,withaveragelinkagethebindingsiteissplitintworegions, whereassinglelinkagewilltendtoshowonesinglesite. Thisinformationcouldbevaluableinthecontextofliganddesign, since the two regions that show up with average linkage could both be exploited byconnecting two fragments with a linker.75AppendixBFocusedDockingSetupB.1 SelectionofcomplexesBoth focused and blind docking experiments were carried out on the same set of com-plexes obtained from the Astex Diverse Set[53], a published collection of 85 protein-ligandcrystal structures extracted from the Protein Data Bank (PDB) and specically selected toevaluate the performance of docking algorithms. All water molecules and heteroatoms (in-cluding the ligands) were removed and for the cases that contained identical sets of chains,only one set was retained.B.2 PreparationoftheproteinsandligandsfordockingGasteiger charges were added to both ligands and proteins, using the programs includedin the AutoDockTools suite (version 1.4.5). At that stage, eight cases that issued warningsandwouldhaverequiredmanual interventionwereremovedresultinginanal setof77complexes. The PDB codes of the selected chains are: 1gkcA, 1gm8, 1hnnA, 1hp0A, 1hq2,1hvyD, 1hwiA1B, 1hww, 1ia1B, 1ig3, 1j3jA, 1jd0B, 1jjeA, 1jlaA, 1k3u, 1ke5, 1kzk, 1l2sB,1l7f, 1lpz, 1lrhD, 1m2zA, 1meh, 1mzc, 1n1mA, 1n2jA, 1n2v, 1n46A, 1nav, 1of1B, 1opk,1oq5, 1owe, 1oyt, 1p2y, 1p62, 1pmn, 1q1gF,1q41A,1q4gB,1r1h, 1r55, 1r58, 1r9o, 1s19,1s3v,1sg0B,1sj0,1sq5A,1sqnB,1t40,1t46,1tow,1tt1A,1tz8B,1u1cF,1uml,1unlA1D,1uou, 1v0pA, 1v48, 1v4s, 1vcj, 1w1pB, 1w2gB, 1x8x, 1xm6A, 1xoqB, 1xoz, 1ygc, 1yqy,761yvf, 1ywr, 1z95, 2bm2B, 2br1, and 2bsm. For each single-chain binding site entry in theAstexDiverseSetaBLAST9searchwasperformedagainstthePDBdatabaseselectingalltheentriesthathadasequenceidentity> 95%andacoverage> 95%. Subsequently,the cases that had mutated residues in the binding site were eliminated from the dataset.Finally, fromtheremainingcases onlytheentries that didnot haveanyligandinthebinding site were selected. This procedure led to 19 unbound proteins corresponding to asubset of the 77 complexes described earlier. The PDB codes of the bound unbound pairsare: 1hq2 1hka, 1t46 1t45, 1ke5 1hcl, 1v0pA 1ob3A, 1l2sB 2blsA, 1v48 1pbn, 1l7f 1nmaN,1w1pB 1e15A, 1n1mA 1r9mA, 1yvf 2girA, 1n2v 1pud, 1ywr 2okrA, 1oq5 2cbe, 2br1 1ia8,1oyt 1vr1H, 2bsm 1uyl, 1q41A 1i09A, 1s3v 1pdb, and 1t40 1xgd. To facilitate the compar-ison of docking results, the binding site residues in the unbound proteins were superimposedon the corresponding residues of the bound proteins using the backbone atoms of the residuesthat had at least one atom within 6.0A of the ligand heavy atoms in the