knowledge of native protein–protein interfaces is ... · called ksenia deduced from the...

24
HAL Id: hal-01229886 https://hal.inria.fr/hal-01229886 Submitted on 10 Sep 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Knowledge of Native Protein–Protein Interfaces Is Suffcient To Construct Predictive Models for the Selection of Binding Candidates Petr Popov, Sergei Grudinin To cite this version: Petr Popov, Sergei Grudinin. Knowledge of Native Protein–Protein Interfaces Is Suffcient To Con- struct Predictive Models for the Selection of Binding Candidates. Journal of Chemical Information and Modeling, American Chemical Society, 2015, 55 (10), pp.2242-2255. 10.1021/acs.jcim.5b00372. hal-01229886

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • HAL Id: hal-01229886https://hal.inria.fr/hal-01229886

    Submitted on 10 Sep 2018

    HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

    L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

    Knowledge of Native Protein–Protein Interfaces IsSufficient To Construct Predictive Models for the

    Selection of Binding CandidatesPetr Popov, Sergei Grudinin

    To cite this version:Petr Popov, Sergei Grudinin. Knowledge of Native Protein–Protein Interfaces Is Sufficient To Con-struct Predictive Models for the Selection of Binding Candidates. Journal of Chemical Informationand Modeling, American Chemical Society, 2015, 55 (10), pp.2242-2255. �10.1021/acs.jcim.5b00372�.�hal-01229886�

    https://hal.inria.fr/hal-01229886https://hal.archives-ouvertes.fr

  • Knowledge of Native Protein-Protein Interfacesis Sufficient to Construct Predictive Models for

    the Selection of Binding Candidates

    Petr Popov and Sergei Grudinin∗

    Univ. Grenoble Alpes, LJK, CNRS, LJK, Inria, F-38000 Grenoble, France

    E-mail: [email protected]

    Abstract

    Selection of putative binding poses is a challenging part of virtual screening for protein-protein interactions. Predictive models to filter out binding candidates with the highest bind-ing affinities comprise scoring functions that assign a score to each binding pose. Existingscoring functions are typically deduced collecting statistical information about interfaces ofnative conformations of protein complexes along with interfaces of a large generated set ofnon-native conformations. However, the obtained scoring functions become biased toward themethod used to generate the non-native conformations, i.e. they may not recognize near-nativeinterfaces generated with a different method.

    Present study demonstrates that knowledge of only native protein-protein interfaces issufficient to construct well-discriminative predictive models for the selection of binding candi-dates. Here, we introduce a new scoring method that comprises a knowledge-based potentialcalled KSENIA deduced from the structural information about the native interfaces of 844crystallographic protein-protein complexes. We derive KSENIA using convex optimizationwith a training set composed of native protein complexes and their near-native conformationsthat are obtained using deformations along the low-frequency normal modes. As a result,our knowledge-based potential has only marginal bias toward a method to generate putativebinding poses. Furthermore, KSENIA is smooth by construction, which allows to use it alongwith a rigid-body optimization to refine the binding poses. Using several test benchmarkswe demonstrate that our method discriminates well native and near-native conformations ofprotein complexes from the non-native ones. Our methodology can be easily adapted to therecognition of other types of molecular interactions, such as protein-ligand, protein-RNA, etc.KSENIA will be made publicly available as a part of the SAMSON software platform athttps://team.inria.fr/nano-d/software.

    1

    [email protected]://team.inria.fr/nano-d/softwareSergei GrudininThis document is the Accepted Manuscript version of a Published Work that appeared in final form in Journal of Chemical Information and Modeling, copyright © American Chemical Society after peer review and technical editing by the publisher. To access the final edited and published work see https://pubs.acs.org/doi/full/10.1021/acs.jcim.5b00372.�

  • 1 Introduction

    Protein-protein interactions play crucial rolein the human interactome, orchestrating mostof the signaling network processes. Abruptchanges in protein-protein interactions lead tovarious kind of diseases, which makes proteinstructure prediction an important challenge inrational drug design. However, generally it isvery difficult to experimentally obtain struc-tures of protein complexes, thus computationalmolecular docking techniques are often usednowadays for protein-protein structure predic-tion. Typically, molecular docking as an in-tegral part of the drug discovery process in-volves the scoring stage, where one selects thebest putative binding candidates from the setof binding poses by assigning the score or theenergy value E to each candidate. The scoringstage incorporates sophisticated scoring func-tions,1 which are obtained with the empiricalforce-fields or using information derived fromexperimentally obtained structures of proteincomplexes. The latter type of scoring functionsbelongs to the family of the knowledge-based orstatistical scoring functions. The majority ofmodern knowledge-based scoring functions forthe protein-protein interactions are developedfollowing the observation that the distances be-tween the atoms in experimentally determinedstructures follow the Boltzmann distribution.2

    More precisely, using ideas from statisticaltheory of liquids, effective potentials betweenatoms are extracted using the inverse Boltz-mann relation, Eij(r) = −kBT log(Pij(r)/Z),where kB is the Boltzmann constant, Pij(r) de-notes the probability to find two atoms of cer-tain types i and j at a distance r, and Z de-notes the probability distribution in the refer-ence state. The latter is the thermodynamicequilibrium state of the protein when all in-teractions between the atoms are set to zero.The score of a protein conformation is thengiven as a sum of the effective potentials be-tween all pairs of atoms. Although this conceptis old and originates from the work of Tanakaand Scheraga,3 Miyazawa and Jernigan,4 andSippl,5–7 it is still under debates.8–11 Partic-ularly, the computation of the reference state

    is a challenging problem.12 Although some as-sumptions were made to ease the expression ofthe reference state for protein monomers,5,13–15

    to deduce scoring functions for the protein-protein docking, one usually computes the ref-erence state based on a large set of generatednon-native conformations of protein complexes(decoys).16,17 Another type of statistical poten-tials is constructed using the discriminative ma-chine learning, specifically, the linear program-ming approach.18–23 The basic idea behind thisapproach is to solve a system of inequalitiesthat demand the energy of the native confor-mation to be lower than the energy of all thedecoy conformations for a particular complex,E(P native)− E(P decoyi ) < 0, ∀P

    decoyi ∈ Pdecoy.

    Although this approach circumvents the com-putation of the reference state, its success crit-ically depends on the chosen set of decoy con-formations Pdecoy. Thereby, the obtained sta-tistical potential depends on the sampling algo-rithm used to generate the decoy conformationsand, generally, might not distinguish the nativestructures equally well from decoys obtained byanother sampling algorithm.

    In this study we discovered that knowledgeof only native protein-protein interfaces is suf-ficient to construct well-discriminative predic-tive models for the selection of putative bind-ing candidates. Namely, we introduce a newscoring method that comprises a knowledge-based potential called KSENIA deduced fromthe structural information about the native in-terfaces of 844 crystallographic protein-proteincomplexes. As a result, our approach does notrequire neither the computation of the referencestate nor the ensemble of non-native complexes.Thus, it can have only a marginal bias toward amethod to generate putative binding poses. Tothe best of our knowledge, this is the first inves-tigation of the knowledge-based potential thatneeds no information derived from non-nativeprotein-protein interfaces. More precisely, weuse convex optimization to train the knowledge-based potential on sets of near-native confor-mations with the average root mean square de-viation (RMSD) between monomers of 1 Å.These are composed using the deformationsalong the directions of low-frequency normal

    2

  • modes computed at the native conformations.We demonstrate that the obtained knowledge-based potential is capable to distinguish thenative and near-native protein-protein interac-tions from the non-native ones. Given thatrigid-body minimization refinement improvesthe scoring performance,24 we also implement arigid-body optimization protocol using the de-rived knowledge-based potential. Finally, weverify the robustness of our method on severalprotein-protein docking benchmarks.

    2 Theoretical Basis

    We consider N native protein-protein complexconformations P nativei , i = 1..N . For each pro-tein complex i we generate D decoys, P decoyij ,j = 1..D, where the first index runs over dif-ferent protein complexes and the second indexruns over generated decoys. Then we find a lin-ear scoring functional F , defined for all possiblecomplexes, such that for each native complex iand its decoy j the following inequality holds:

    F (P nativei ) < F (Pdecoyij ) (1)

    We express the scoring functional which fulfillsthese assumptions in the following form:

    F (P ) =M∑k=1

    M∑l=k

    rmax∫0

    nkl(r)Ukl(r) dr, (2)

    where nkl(r) is the number density of atom pairsat a distance r between two atoms of types kand l (kl-pair), with one atom located in thelarger protein (receptor), and the other atom lo-cated in the smaller protein (ligand). Here, Mis the total number of different atom types. Weused M = 20 atom types provided by Huangand Zou,16 which were defined by the classifica-tion of all heavy atoms in standard amino acidsaccording to their element symbol, aromatic-ity, hybridization, and polarity. The functionsUkl(r) are unknown scoring potentials, whichwe determine below.The number density nkl(r)is computed as a sum over all kl-pairs in a given

    protein complex via:

    nkl(r) =∑ij

    1√2πσ2

    e−(r−rij)

    2

    2σ2 (3)

    Here, each kl-pair at a distance rij is repre-sented by a Gaussian centered at rij with thestandard deviation of σ, which takes into ac-count possible inaccuracies and thermal fluctu-ations in the protein structure. In our work wechose σ = 0.4 Å, since this value demonstratedthe best results in the holdout cross-validationtests25 (see Section 2 of Supporting Informa-tion for more details). We considered only atompairs at distances below the threshold distancermax = 10 Å. Using Eq. (3), we can rewrite thescoring functional F (see Eq. (2)) as the sumover all kl-pairs of atoms i and j at a distancerij:

    F (P ) =∑ij

    1√2πσ2

    rmax∫0

    e−(r−rij)

    2

    2σ2 Ukl(r) dr =∑ij

    Υkl(rij)

    (4)We will refer to the functions

    Υkl(r) =1√

    2πσ2

    rmax∫0

    e−(x−r)2

    2σ2 Ukl(x) dx, (5)

    which are the Gauss transform of the scoringpotentials Ukl(x), as to the scoring functions.

    In order to determine unknown scoring po-tentials Ukl(r) (see Eq. (2)), we decomposethem along with the number densities nkl(r) ina polynomial basis:

    Ukl(r) =∑q

    wklq ψq(r), r ∈ [0; rmax]

    nkl(r) =∑q

    xklq ψq(r), r ∈ [0; rmax],(6)

    where ψq(r) are orthogonal basis functions onthe interval [r1; r2], and w

    klq with x

    klq are the ex-

    pansion coefficients of Ukl(r) and nkl(r), respec-tively. Here, we use a set of shifted rectangularfunctions as the basis.26 Given this, the scoringfunctional F (see Eq. (2)) can be expanded up

    3

  • to the order Q as:

    F (P ) ≈M∑k=1

    M∑l=k

    Q∑q

    wklq xklq = (w · x),

    w,x ∈ RQ×M×(M+1)/2,

    (7)

    where we use the order of expansion Q = 40.We will refer to the vector w as to the scoringvector and to the vector x as to the structurevector. Then, we can rewrite the set of inequal-ities (1) as a soft-margin quadratic optimizationproblem:27

    Minimize (in w, bi, ξij):12w ·w +

    ∑ij Cijξij

    Subject to:yij [w · xij − bi]− 1 + ξij ≥ 0, i = 1..N, j = 0..Dξij ≥ 0

    (8)Here, index i runs over different protein com-plexes and index j runs over different confor-mations of the i-th protein complex. Particu-larly, protein conformations with j = 0 are na-tive with the corresponding constants yi0 = +1and protein conformations with j = 1..D arethe decoys with the corresponding constantsyij = −1. Parameters Cij can be regardedas regularization parameters, which control theimportance of different structure vectors. Wefound the optimal values of Cij parametersusing the holdout cross-validation procedure25

    (see Section 2 of Supporting Information). Thescoring vector w, the offset vector b and theslack variables ξij are the parameters to be op-timized. The size of the optimization problem isdetermined by the dimensionality of the struc-ture and the scoring vectors, which is equal toQ×M×(M+1)/2 = 8400, and by the size of thetraining set, N = 844 and D = 225. The lat-ter is composed based only on local informationabout the native interfaces of protein-proteincomplexes and no other information is used (seeSection 3.3). We solve problem (8) in its dualform using the block sequential minimal opti-mization (BSMO) algorithm as explained else-where.26 Finally, given the solution of problem(8), i.e. the scoring vector w, one may restorethe scoring potentials Ukl(r) (see Eq. (6)), thescoring functions Υkl(r) (see Eq. (5)), and com-

    pute the score of a protein complex accordingto Eq. (4).

    3 Materials and Methods

    3.1 Rigid-Body Minimization

    The scoring functions Υkl(r) (see Eq. (5)) aresmooth by construction. This fact allows touse these functions for the structure optimiza-tion. More accurately, for a given kl-pair ofatoms at a distance rij, the negative gradient−∇Υkl(rij) could be regarded as the force withwhich one atom acts on the other atom. Thus,one may use the set of derived functions Υkl(r)to optimize a particular conformation of a pro-tein complex until a local minimum is reached,provided ∇Υkl(rij) = 0 for each pair of atoms.Since special calibration is required to retainstructure integrity of a complex, a more rele-vant structure optimization would be the rigid-body optimization, where instead of force min-imization over each pair of atoms, one mini-mizes the net force and the net torque actingon each monomer. The rigid-body optimiza-tion with functions Υkl(r) could be useful as arefinement step to process docking predictions.It was shown that rigid-body refinement couldimprove docking predictions dramatically.24 Incontrast to our scoring functions Υkl(r), mostof modern statistical potentials are not differen-tiable.14,28–30 Thereby, to perform structure op-timization with such potentials, one either usesa smooth interpolation of potentials, or employsvarious derivative-free optimization strategies,e.g. Nelder-Mead31 or Powell32 methods andtheir modifications, where the convergence rateis much slower compared to first- or higher- or-der optimization strategies. Following this idea,we implemented the local rigid-body minimiza-tion protocol to explore whether such an opti-mization improves scoring capabilities of KSE-NIA . General work-flow for the local rigid-bodyminimization is listed in Table 1.

    3.2 Normal Modes

    Let us consider a system of N particles with 3Ndegrees of freedom near the equilibrium state

    4

  • Table 1: The rigid-body minimization work-flow.

    1 Set initial parameters for the structure optimization.

    2Compute the score Uk of the current conformation and the descent direction dk inthe rigid-body space.

    3Find an appropriate step size α and make a step toward the descent direction:xk+1 = xk + αdk.

    4 Repeat steps 2-3 until desired tolerance or maximum number of iterations is achieved.5 Take the last computed score as the final score of the optimized conformation.

    x0. The potential energy of the system can beapproximated as a quadratic form:

    U(x1, x2, . . . , x3N) = U(x0) +1

    2

    3N∑i=1

    3N∑j=1

    Fijxixj,

    (9)where elements of the matrix of the quadratic

    form Fij =(

    δ2Uδxiδxj

    )x0

    are the force constants

    at the equilibrium state x0. There exist a dif-ferent set of coordinates yi, where both the ki-netic K and the potential U energies have thediagonal form and thus the Newton’s equationsof motion are uncoupled. This means that thesolution for the equations of motion for each co-ordinate can be obtained separately. These co-ordinates yi are called the normal coordinates,and the corresponding energy terms have thefollowing form:

    U(y1, y2, . . . , y3N) = U(x0) +1

    2

    3N∑i=1

    λiy2i ,

    K(y1, y2, . . . , y3N) =1

    2

    3N∑i=1

    ẏ2i

    (10)The transition matrix between the two coor-dinate bases is obtained via diagonalization ofmatrix M−

    12FM−

    12 = LDLT:

    U−U(x0) ≡1

    2xTFx =

    1

    2xTM

    12LDLTM

    12x =

    1

    2yTDy,

    (11)where M is the diagonal mass matrix, i.e.Mij = miδij. Thus, the connection betweenthe two coordinate systems is given as a lineartransformation

    x = M−12Ly (12)

    Normal coordinates provide a convenient wayto describe molecular fluctuations of a systemnear the equilibrium state. Particularly, theevolution of the system in the normal basis isthe superposition of the independent harmonicoscillations along each normal coordinate yi.Such oscillations are called normal modes33 andare expressed as:

    yi(t) = Ai cos (ωit+ δi), (13)

    where ωi ≡√Dii and δi correspond to the fre-

    quency and the phase of the i-th mode, respec-tively. The factor Ai =

    √2kBT/ωi is the am-

    plitude of the fluctuation. Given the transitionmatrix L between the two bases (see Eq. (12)),oscillations in the Cartesian basis can be writ-ten as:

    xk(t) = Lki(Ai cos (ωit+ δi))/√mk (14)

    Thus, all atoms in a molecule for a given modei oscillate with the same frequency and phase.However, the amplitude of the fluctuation of theCartesian coordinate xk, corresponding to theoscillation of the mode yi, is different for eachcoordinate k and is defined by the i-th columnof the transition matrix L:

    〈x2k〉i = L2kiA2i 〈cos (ωit+ δi)2〉/mk

    =1

    2mkL2kiA

    2i = L

    2ki

    kBT

    mkω2i

    (15)

    When all the modes are active, the amplitudeof the fluctuation of the Cartesian coordinatexk reads:

    〈x2k〉 =kBT

    mk

    ∑i

    L2kiω2i

    (16)

    5

  • We use this theoretical framework to constructthe training set of protein-protein complexes.A deeper discussion of normal modes analysisand its applications in structural biology can befound elsewhere.33–37

    3.3 Training Set

    3.3.1 Native Complexes

    We used the training database of 851 non-redundant protein-protein complex structuresprepared by Huang and Zou.16 This databasecontains protein-protein complexes extractedfrom the PDB38 and includes 655 homodimersand 196 heterodimers. We updated three PDBstructures from the original training database:2Q33 supersedes 1N98, 2ZOY supersedes 1V7B,and 3KKJ supersedes 1YVV. The trainingdatabase contains only crystal dimeric struc-tures determined by X-ray crystallography atresolution better than 2.5 Å. Each chain of thedimeric structure has at least 10 amino acids,and the number of interacting residue pairs, asdefined as having at least 1 heavy atom within4.5 Å, is at least 30. Each protein-protein inter-face consists only of 20 standard types of aminoacids. No homologous complexes were includedin the training database. Two protein com-plexes were regarded as homologues if the se-quence identity between receptor-receptor pairsand between ligand-ligand pairs was> 70%. Fi-nally, Huang and Zou16 manually inspected thetraining database and left only those structuresthat had no artifacts of crystallization.

    3.3.2 Near-native Decoys

    To exclude any possible bias to computationalmethods and potentials for the generation ofputative binding poses, we do not use standardmethods for the generation of non-native de-coys, but instead we construct our training setusing structural information about only pro-tein complexes in their native conformations.For the initial set of 844 native protein com-plexes we generated near-native conformations,i.e. conformations within RMSD = 3 Å, foreach native complex as follows. First, giventhe coordinate vector Xnative of each monomer

    in a protein complex, we computed its tenlowest-frequency normal modes. Then, weformed fifteen near-native conformations foreach monomer using the linear combinations ofthese modes :

    X̂ = Xnative +√kBTM

    −12

    10∑i=1

    riLiωi, (17)

    where√kBT is the temperature factor, M is

    the diagonal mass matrix, i.e. Mkl = mkδkl,ri is the random weight for each mode rangingfrom -1 to 1, Li is the i-th column of the transi-tion matrix between the Cartesian and the nor-mal mode bases, and ωi is the frequency of thei-th mode. The temperature factor

    √kBT af-

    fects the amplitude of the deformation, hence,too large temperatures cause a monomer to de-form significantly breaking the covalent bonds.We tested several values of the temperature fac-tor and found the optimal value of

    √kBT to be

    10 kJ12 (see Section 2 of Supporting Informa-

    tion). To ensure the absence of non-relevantconformations, we measured the RMSD be-tween the native and the generated conforma-tions. Indeed, the average RMSD is equal to1.02 Å, which means that the deformations withthe given temperature factor keep all generatedconformations non-disrupted. At the last step,we combined conformations X̂ of two monomersrepresenting one protein complex, resulted in15 × 15 = 225 near-native conformations. Tosummarize, the composed training set to derivethe knowledge-based potential contains 844 as-semblies, where each assembly consists of onenative protein complex and 225 generated near-native conformations.

    We used the MMTK library39 to perform thenormal mode analysis for protein molecules andthe OPLS-UA force-field40 to compute the forceconstants (see Eq. (9)). Since normal modesare defined for the equilibrium state of the sys-tem, we minimized each monomer of a dimerin a vacuum using 50 steps of the steepest de-scent algorithm with the relative energy toler-ance of 1e − 3 and the cut-off distance for allnon-bonded interactions of 5 Å. We chose sucha relatively small number of minimization stepsin order to not significantly deform the X-Ray

    6

  • structure of a monomer. Indeed, the RMSD be-tween the initial and the minimized monomerstructures did not exceed 0.5 Å. Given eachmonomer near the equilibrium state, we usedthe Fourier subspace for the reduced-basis nor-mal modes computations.41 We picked up tenfirst low-frequency modes from the Fourier ba-sis to generate different local deformations ofthe protein complexes. We should note that weexcluded the first six modes that correspond tothe rigid-body motion.

    Finally, we want to stress that all gener-ated conformations represent near-native pro-tein structures. Indeed, we use directions alongthe slowest normal modes to locally deformthe monomers, however, the orientations of themonomers with respect to each other are fixed.Since all the monomer conformations differ onlyslightly from the native monomers (the aver-age RMSD is 1.02 Å), the interaction inter-faces of all generated complexes undergo mod-erate changes keeping the major part of thenative contacts. To conclude, we composedthe training set based only on local informa-tion about the native interfaces and no otherinformation was used. In the Results section wedemonstrate the knowledge-based potential forprotein-protein interactions derived using thistraining set.

    3.4 Test Benchmarks

    Here, we describe the composed benchmarksto test and validate the KSENIA potential.For the accurate validation it is very impor-tant to ensure that the test and the train-ing sets do not overlap. The first test bench-mark consists of the complexes from the train-ing set, however with different binding inter-faces. The other benchmarks are built usingthe interfaces from the protein-protein dock-ing benchmark v. 2, 3 and 4.42 We ran thealign program from the FASTA2 package43

    in order to calculate the number of the ho-mologous interfaces between the training setand the protein-protein docking benchmark.There are no intersections between the two setswith the sequence similarity greater than 70%,there is one pair (1LEW–2OZA) with the sim-

    ilarity of 61 % and 12 pairs (1AVW–1AVX,1BIS– 2B4J, 1CSO– 3SGQ, 1E96– 1E96, 1KLJ–1FAK, 1MCV– 1FLE, 1SBW– 2UUY, 1SCJ–2SNI, 1SFI– 2UUY, 1UJZ– 7CEI, 1ZBX– 1ZHI,2NGR– 1GRN) with the similarity in between50% and 60%. For each of the benchmarks, weevaluate the success rate of our method withrespect to other tested methods, which is de-fined as the percentage of protein complexes forwhich docking predictions of a certain qualityare ranked at the top positions.

    3.4.1 Hex Test Benchmark

    For the first test, we constructed a rigid-bodybenchmark starting from the native structuresin the training set. More precisely, to gener-ate decoys we used the Hex rigid-body dock-ing program.44,45 For the Hex input, we usedpolar Fourier shape expansions to polynomialorder of 31, the real-space angular search stepof 7.5◦, the radial search range of 40 Å witha translation step of 2.5 Å and the subsequentsub-step of 1.25 Å. We ran Hex for each nativecomplex in the training set and clustered thedocking solutions with a threshold of 8 Å. Top200 docking predictions were added to the testbenchmark in addition to the native complexes,resulting in 201 × 844 = 169, 644 protein com-plexes. Finally, we evaluated the success rateof the Hex scoring function on the constructedbenchmark according to the quality of the dock-ing poses. Here we define the quality accordingto the value of RMSD of the backbone atomsof the ligand (LRMSD) after the receptors in thenative and the decoy conformations have beenoptimally superimposed (see Table 2). To doso, we used the fast open-source RigidRMSDlibrary46 that computes RMSDs given spatialtransforms of the docking poses.

    Table 2: Quality with respect to the value of LRMSD.

    Quality LRMSD( Å)

    1 LRMSD ≤ 12 1 < LRMSD ≤ 53 5 < LRMSD ≤ 10

    7

  • 3.4.2 Zdock Test Benchmark

    For the second test benchmark we used theprotein-protein docking benchmark v3.0 com-posed by Hwang et al., which consists of 124non-redundant protein-protein complexes.47

    Then, we employed Zdock v.3.0.1 rigid-bodydocking software,48 which uses a grid-basedrepresentation of two proteins and a three-dimensional fast Fourier transform to explorethe search space of rigid-body docking posi-tions. We used the bound conformation of eachmonomer in the benchmark for the Zdock in-put, randomly set initial protein orientationsand used the default parameters for the dockingpredictions. Finally, we chose 2000 best gener-ated rigid-body docking poses according to theZdock v.3.0.1. scoring function for each com-plex. Thus, the second test benchmark consistsof 124× 2, 000 = 248, 000 protein complexes.

    To evaluate the success rate of this scor-ing function on the constructed benchmark,we use the Critical Assessment of PRedictionof Interactions (CAPRI) criterion49 for a cor-rect prediction (Table 3). This is a more so-phisticated criterion compared to the one usedabove. More precisely, in addition to the ligand-RMSD, it involves the fraction of native con-tacts in the docking prediction fnat, and theinterface RMSD, IRMSD. The fnat parameteris the ratio of the number of native residue-residue contacts in the predicted complex tothe number of residue-residue contacts in thecrystal structure. A pair of residues from dif-ferent monomers are considered to be in con-tact if they are within 5 Å from each other.The IRMSD parameter is the RMSD of the in-terface region between the predicted and nativestructures after optimal superimposition of thebackbone atoms of the interface residues. Aresidue is considered as the interface residue ifany atom of this residue is within 10 Å from theother partner.

    3.4.3 ItScoreTest Benchmark

    Following Huang et al., we generated theItScore test set using 91 protein complexes andZdock v.2.1 docking program as it is describedin the original ItScore paper.16 Overall, this test

    set comprises 2000 decoys together with the na-tive structure per each of 91 protein complexesand we will refer to this set as to the ItScoretest benchmark. We evaluated the success rateson this set using the CAPRI prediction qualitycriterion (Table 3).

    3.4.4 Rosetta Test Benchmark

    Gray et al. generated the Rosetta benchmarkusing 54 complexes from the protein-proteindocking benchmark version 0.050 in both thebound and the unbound conformations. Foreach complex, the authors generated 1,000bound and 1,000 unbound decoys following theflexible docking protocol, which is a part of theRosettaDock suite.51 The first step in the pro-tocol is the random translation and rotation ofone of the proteins constituting the complex.Afterwards, the side chains are optimized si-multaneously with the rigid-body displacementof the protein. Finally, the full-atom minimiza-tion is performed to refine the conformation ofthe complex. We calculated the success rate ofthe RosettaDock protocol using the same qual-ity criterion as in CAPRI49 (Table 3). Both thebound and the unbound Rosetta benchmarksconsist of 54 × 1, 000 = 54, 000 protein com-plexes.

    3.4.5 SwarmDock Test Benchmark

    Finally, we tested our scoring function on theunbound decoy set prepared by Moal et al.1

    and generously provided by Mieczyslaw Tor-chala from Biomolecular modelling group of theFrancis Crick Institute. The SwarmDock de-coy set was generated using the SwarmDockdocking server52 with initial structures in theunbound state taken from the protein-proteindocking benchmark v.4.0.42 In total, the de-coy set consists of about 500 conformations pereach of 176 protein complexes, and there is atleast one correct prediction of at least accept-able quality (according to the CAPRI criterion,Table 3) for 122 complexes. Using this bench-mark, Moal et al. compared performance of115 various scoring functions,1 including finite-and coarse- grained docking potentials, their

    8

  • Table 3: CAPRI criterion to estimate the quality of docking predictions.

    Quality Condition

    1 fnat ≥ 0.5 and (LRMSD ≤ 1.0 or IRMSD ≤ 1.0)

    2(0.3 ≤ fnat < 0.5) and (LRMSD ≤ 5.0 or IRMSD ≤2.0) or (fnat ≥ 0.5 and LRMSD > 1.0 and IRMSD > 1.0)

    3(0.1 ≤ fnat < 0.3 and (LRMSD ≤ 10.0 or IRMSD ≤4.0) or (fnat ≥ 0.3 and LRMSD > 5.0 and IRMSD > 2.0)

    constitutive terms, molecular mechanics energyfunctions, and protein folding potentials. Thiscomparison provides a good reference point ofpredictive capabilities for different scoring func-tions.53 To be consistent with the assessmentpipeline of Moal et al., in this test we didnot use the refinement procedure and calcu-lated the success rates of KSENIA solely basedon the initial scores of the decoys. FollowingMoal et al., for the test we used only com-plexes from the protein-protein docking bench-mark ”v.4.0 update” in order to exclude biasin performance of scoring functions trained onthe previous versions of this benchmark. Thisupdate consists of 52 protein-protein complexesnon-homologous to the previous version of thebenchmark.

    3.5 Computational Considera-tions

    Here we briefly describe the computational de-tails of the potential derivation and the scoringprocedure with the KSENIA potential. The po-tential is derived off-line by solving problem (8)in its dual form using the BSMO algorithm asexplained elsewhere.26 This procedure is itera-tive and takes minutes to hours depending onthe desired convergence, the number of avail-able CPU cores and the amount of memory.Then, for practical applications, as it is de-scribed in more detail in Supporting Informa-tion, we use a cubic spline interpolation of theobtained potential. More precisely, for each of210 types of interactions, the spline coefficientsare pre-computed off-line at forty equidistantpoints.54 To compute the value of the scoringfunction for a protein complex, we construct thelist of intractions using the linked-cell neigh-

    bourlist algorithm. Then, for each pair of in-teractions from the list, we evaluate the valueof the potential and, if necessary, its derivativefrom the spline interpolation.

    Dist

    ribut

    ion

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Angle (radian)0 1 2 3 4

    Distance (Å)0 2 4 6 8 10

    Figure 1: Distribution of the interval lengths in thefour-dimensional manifold where the partial deriva-tives of the scoring functional are the constant-signfunctions. These distributions are computed usingthe native structures in the training set. Blue, solid:interval length for the d-coordinate, which is the dis-tance between the centers of mass of two monomers.Green, dashed: interval length for the α-coordinate,which is the angle of rotation of the ligand about theaxis connecting the centers of mass. Orange, dotted:interval length for the β- and γ- coordinates, whichare the angles of rotation about two other orthogonalaxes.

    4 Results

    4.1 Scoring Functional

    Figure 2 presents three derived scoring func-tions (dashed) for different atom pairs. As onecan see, at short separation distances the scor-ing functions tend to zero. This is the arti-

    9

  • Score

    0

    20

    40

    60

    80

    Distance (Å)0 2 4 6 8 10

    Cα - Cα

    Score

    0

    20

    40

    60

    80

    Distance (Å)0 2 4 6 8 10

    C3 - C3

    Score

    0

    20

    40

    60

    80

    Distance (Å)0 2 4 6 8 10

    N2+ - O2-

    Figure 2: Examples of the derived distance-dependent scoring functions between atoms of types N2+− O2- ,C3 − C3 and Cα − Cα, respectively. Here, N2+ are guanidine nitrogens with two hydrogens, O2- are oxygensin carboxyl groups, C3 are aliphatic carbons bonded to carbons or hydrogens only and Cα are the backbone Cαatoms. Black, dashed: initially derived scoring functions without taking into account the absence of statisticsat short distances. Blue, solid: re-defined scoring functions that take into account the absence of statistics atshort distances.

    fact of the training set, and is mainly causedby the absence of observations of atom pairsat distances close to zero. However, we wantour scoring functions to be able to penalize con-formations in which steric clashes between themonomers are present. Thus, we re-define thescoring functions at short distances to form ar-tificial potential barriers (see Section 1 of Sup-porting Information). The initial scoring func-tions along with the modified scoring functionsare shown in Figure 2. We refer to the latteras to KSENIA, which stands for Knowledge-based Scoring function Employing only NativeInterfaces .

    The scoring functional F (see Eq. (4)) ofa particular protein complex P is computedas the sum of separate scores for each pair ofatoms within the cutoff distance rmax. Thus,F , as a function of 3 × (NA + NB) variables,where NA and NB are the numbers of atomsin molecules A and B respectively, is not iden-tically zero only in the conformational volumewhere at least one pair of atoms is within rmaxdistance. Since KSENIA typically possessesseveral maxima and minima (see Fig. 2), F islikely to be a rugged function in this volume.55

    However, we want to demonstrate that sinceour scoring functions were derived from the lo-cal deformations of the native conformations,the scoring functional F is smooth at least inthe neighborhood of the native conformation.To show this, we explored the behavior of thescoring functional F in the four-dimensionalmanifold of the 3× (NA + NB) conformational

    space. Namely, given two monomers, one ofwhich is fixed, we consider four coordinates cor-responding to the rigid-body degrees of free-dom: the distance d between the centers ofmass of the two monomers, the rotation of thefree molecule about the axis connecting the cen-ters of mass by an angle α , and two rota-tions about two other orthogonal axes by an-gles β and γ. Then, starting from the nativeconformation of the complex (d0, α0, β0, γ0), wecalculate partial derivatives in the vicinity ofthis conformation. More precisely, we sam-

    ple the first partial derivativeδF (d, α, β, γ)

    δeat points {e0 ± �, e0 ± 2�, e0 ± 3�, . . . }, wheree ∈ {d, α, β, γ}, and � is a sufficiently smallpositive value. At the point where the partialderivative changes its sign, we can not expecta gradient-based local minimization algorithmto find the nearest local minimum to the point(d0, α0, β0, γ0). Thus, one can characterize thesmoothness of the scoring functional F at thepoint (d0, α0, β0, γ0) by four intervals (e0 −m�, e0 + n�), where the partial derivative is aconstant-sign function. Figure 1 shows the dis-tribution of such interval lengths over the nativeconformations in the training set. The mostprobable size of the smooth region around thenative conformation is 2.2 Å, 0.42 rad, 0.22 rad,0.22 rad in four degrees of freedom, respectively.Practically, it means that the rigid-body mini-mization started from an arbitrary point withinthis region is expected to optimize the confor-mation corresponding to this point toward the

    10

  • conformation corresponding to the local mini-mum of this region, assuming that F is convexin the neighborhood of the native conformation.

    Freq

    uenc

    y (%

    )

    0

    10

    20

    30

    40

    RMSD (Å)0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

    Table1_1

    Figure 3: Histogram representing the distributions ofthe RMSDs between the native and minimized con-formations in the training set using the rigid-bodyminimization protocol.

    Finally, it remains to prove that the point rep-resenting the native conformation in the four-dimensional manifold lies close to the local min-imum. To demonstrate this, we measured theRMSD between the native conformation andthe conformation obtained after the rigid-bodyminimization with the KSENIA potential start-ing from the native conformation. Figure 3shows the distribution of such RMSDs in thetraining set. As it could be seen, the min-imized and native structures are very similarand the corresponding RMSD does not exceed2 Å. Moreover, the most probable RMSD be-tween the two conformations is 0.1 Å.

    To summarize, we demonstrated that thescoring functional F is a smooth function inthe vicinity of the native conformation. Hence,the rigid-body minimization is expected to im-prove predictions if started at an arbitrary pointin this vicinity. Below, we provide numericalexperiments that demonstrate the practical im-portance of the rigid-body minimization withKSENIA.

    4.2 Performance on the TestBenchmarks

    The aim of any scoring function is to dif-ferentiate the native and near-native confor-

    mations of protein complexes from the non-native ones. In this section we demonstratethat observing only the native protein com-plexes is sufficient to build a powerful and well-discriminative knowledge-based potential. Us-ing six different protein-protein benchmarks de-scribed in Section 3.4, we evaluate the suc-cess rate of our method as described above.For each benchmark we also provide successrates of the widely-used scoring functions ofHex,44 Zdock,48 Rosetta,51 ItScore-PP16 andthose tested by Moal et al.1 as the reference.

    4.2.1 Hex Test Benchmark

    In the first test, we used the Hex test bench-mark (see Section 3.4.1). Although the train-ing set and this benchmark share the same na-tive structures, their decoys are very different.More precisely, for the training, we generatedlocal deformations at the protein-protein inter-faces for all native complexes using directionsalong the low-frequency normal modes. Onthe other hand, to generate decoys for the testbenchmark, we performed the exhaustive searchin the six-dimensional space of rigid-body mo-tions. Consequently, many different interfacesfor each native complex are present. Further-more, owing to the clustering of spatially closedocking predictions, there are no similar inter-faces in the test benchmark. Thus, the goal ofthe first test is to demonstrate that employingonly local information about the native inter-faces is sufficient to derive a well-discriminativescoring function.

    We ranked all docking poses in the trainingset according to the values of the initial scor-ing functions and the values of KSENIA. Figure4 presents the corresponding success rates forthe top predictions. Clearly, the derived scor-ing functions predict the native interfaces verywell, providing the success rates of more than90% for the top one predictions. To exploreif our scoring functions can distinguish correctinterfaces (generated by Hex with quality-one,-two or -three) from the non-native ones, weremoved the native structures from the testbenchmark, leaving only predictions with non-zero rotational part of the spatial transform.

    11

  • 90%

    93%

    95%

    98%

    100%

    Top1, q=1,2,3 Top10, q=1,2 Top1, q=1 Top10, q=1

    Suc

    cess

    rat

    e Initial SFs KSENIA

    Figure 4: Performance of the scoring functions on theHex test benchmark. Success rates of the initial scor-ing functions (Initial SFs) are depicted with the bluerectangles. Success rates of KSENIA are depictedwith the yellow rectangles. TopN value is defined asthe percentage of protein complexes for which at leastone of the docking prediction with the correspondingquality q is found within the first N docking poses.The quality of predictions q is evaluated according tothe value of LRMSD (see Table 2).

    0%

    25%

    50%

    75%

    Top1, q=1,2,3 Top10, q=1,2 Top1, q=1 Top10, q=1

    Suc

    cess

    rat

    e

    Initial SFs KSENIA KSENIA+RBM Hex

    Figure 5: Performance of the scoring functions on thereduced Hex test benchmark. Success rates of the ini-tial scoring functions (Initial SFs) are depicted withthe solid blue rectangles. Success rates of KSENIAare depicted with the solid yellow rectangles. Suc-cess rates of KSENIA along with the rigid-body mini-mization (KSENIA+RBM) are depicted with the solidgreen rectangles. Success rates of the Hex scoringfunction are depicted with the solid purple rectangles.Hollow rectangles of the corresponding color repre-sent the maximum achievable success rates. TopNvalue is defined as the percentage of protein com-plexes for which at least one of the docking predic-tion with the corresponding quality q is found withinthe first N docking poses. The quality of predictionsq is evaluated according to the value of LRMSD (seeTable 2).

    We will refer to the obtained set as to the re-duced Hex test benchmark. Figure 5 shows re-computed success rates for the top predictions(solid rectangles). In this figure, we also listthe maximum success rates of the scoring func-tions (hollow rectangles) as the percentage ofprotein complexes for which Hex could predictposes of the corresponding quality. From Fig-ure 5 one can see that the derived scoring func-tions provide a similar success rate as the Hexscoring function, which is solely based on theshape-complementarity term. However, the ini-tial scoring functions slightly out-perform KSE-NIA on the reduced Hex test benchmark. Pre-sumably this is because we lose some informa-tion when re-defining potentials at short dis-tances (see Section 1 of Supporting Informa-tion). Nonetheless, KSENIA is dedicated tobe used with the local rigid-body minimiza-tion for the refinement of the docking predic-tions. Thereby, at the next step we used therigid-body minimization protocol (see Section3.1 and Table 1) to optimize the docking poses.Then, we ranked the optimized docking predic-tions according to the values of KSENIA andre-evaluated the success rates (Figure 5, greensolid rectangles). We found that the rigid-bodyminimization dramatically improves the scoringresults. In particular, the rigid-body minimiza-tion increased the total number of quality-oneposes, improving near-natives poses towardsnatives and rising the maximum success ratefrom 28% to 66%. Moreover, the correspondingsuccess rates are more than twice better com-pared to both the success rates of Hex and thesuccess rates of scoring without the refinementprocedure. Such a significant improvement canbe explained by the fact that the initial Hexpredictions have moderate steric clashes andthus are not very suitable for the scoring withKSENIA without subsequent docking pose op-timization. To summarize, we demonstratedthat employing structural information of onlynative interfaces, it is possible to distinguishnear-native conformations of protein complexesfrom the non-native decoys. We have alsoshown that it is possible to refine docking pre-dictions using a smooth knowledge-based statis-tical potential with a rigid-body minimization

    12

  • procedure, which improves the quality of thepredictions and the overall performance of thescoring method. Below, we further investigatethe capability of our approach on more compli-cated test benchmarks.

    4.2.2 Zdock Test Benchmark

    For the Zdock benchmark set (see Section 3.4.2)we applied the rigid-body minimization pro-tocol with KSENIA, as in the previous sec-tion, ranked the poses and compared the suc-cess rates against Zdock v.3.0.1 scoring func-tion, which includes the shape-complementarityterm, the electrostatic term and the desolvationterm. Figure 6 shows results obtained on thisbenchmark. Our approach shows around threetimes better success rate for the top one quality-one, -two or -three predictions. We observedthat similarly to the Hex test benchmark, therigid-body minimization improves the resultsof scoring significantly, indicating that the re-finement procedure is very crucial for KSE-NIA. We should also note, however, that foreight complexes in the benchmark, the rigid-body minimization deteriorated several quality-one predictions to quality -two or -three. Thus,the maximum number for the top one quality-one predictions is reduced from 97% to 91%.Nonetheless, our method demonstrates aroundseven times better success rate for the top onepredictions with the highest quality comparedto the Zdock v.3.0.1 scoring function.

    4.2.3 Comparison with the ItScore-PPpotential

    The results of the previous sections show thatKSENIA outperforms simple empirical scoringfunctions that include shape, electrostatic, anddesolvation terms. This indicates that nativeprotein complexes contain themselves all nec-essary information to derive robust knowledge-based scoring function. In order to support thisclaim further, we compared the performance ofthe KSENIA potential with the performanceof the ItScore-PP potential.16 The ItScore-PPpotential is a distance-dependent knowledge-based scoring function obtained using an it-

    0%

    25%

    50%

    75%

    100%

    Top1, q=1,2,3 Top10, q=1,2 Top1, q=1 Top10, q=1

    Suc

    cess

    rat

    e

    KSENIA+RBM Zdock

    Figure 6: Performance of the scoring functions onthe Zdock test benchmark. Success rates of KSE-NIA along with the rigid-body minimization (KSE-NIA+RBM) are depicted with the solid green rect-angles. Success rates of the Zdock scoring functionare depicted with the solid purple rectangles. Hol-low rectangles of the corresponding color representthe maximum achievable success rates. TopN valueis defined as the percentage of protein complexes forwhich at least one of the docking prediction with thecorresponding quality q is found within the first Ndocking poses. The quality of predictions q is eval-uated according to the CAPRI criterion (see Table3).

    erative technique that avoids the explicit es-timation of the reference state probabilities.ItScore-PP was derived using the same atomtypization and the same set of native complexesfor its training as our scoring function. Further-more, the authors also used ItScore-PP in com-bination with the refinement procedure whenoptimizing the docking candidates. Thus, it isvery interesting to compare the scoring powerof ItScore-PP and KSENIA.

    Similarly to the previous tests and followingthe scoring procedure used by ItScore-PP, weran KSENIA in combination with the rigid-body refinement, ranked the predictions withrespect to the score and evaluated the successrates according to the CAPRI criterion. Theobtained results are presented in Figure 7. Thesuccess rates of the Zdock v.2.1 and Zdock v.2.3scoring functions are adapted from Huang etal.16 and plotted for the reference comparison.The former scoring function evaluates only theshape complementarity term, the latter also ac-counts for electrostatic and desolvation effects.As one can see, our scoring function slightlyoutperforms the ItScore-PP potential, provid-

    13

  • ing 100% success rate for the top 10 predictionsof acceptable quality. It is very important tonote that, except for nine complexes, there isneither repetition nor highly homologous com-plexes in the test set compared to the trainingset.16 Hence, there is no bias due to the overlapbetween the training and the test sets and thetrue success rates are represented.

    We should note that here we did not verifythe performance of KSENIA on the protein-protein unbound benchmark generated with theZdock software. More precisely, after the rigid-body docking applied to the monomers in theunbound conformations, side-chains of the in-terface residues are, generally, in non-optimalconformations, which might be crucial for KSE-NIA. Instead, we verified the performance ofKSENIA on the Rosetta bound and unboundtest benchmarks and also using the SwarmDocktest set, where side-chain conformations are op-timized.

    0%

    25%

    50%

    75%

    100%

    Top1, q=1,2,3 Top10, q=1,2,3

    Suc

    cess

    rate

    Zdock 2.1 Zdock 2.3 ItScore-PP KSENIA+RBM

    Figure 7: Performance of the scoring functions onthe ItScore test benchmark. Success rates of KSE-NIA along with the rigid-body minimization (KSE-NIA+RBM) are depicted with the solid purple rectan-gle. Success rates of the ItScore-PP scoring functionare depicted with the solid green rectangle. Successrates of the Zdock v.2.1 and v.2.3 scoring functionsare depicted with the solid blue and yellow rectangles,respectively. Results for the ItScore-PP potential aswell as for the Zdock v.2.1 and v.2.3 scoring func-tions are adapted from Huang et al.16 TopN valueis defined as the percentage of protein complexes forwhich at least one of the docking prediction with atleast the acceptable quality is found within the first Ndocking poses. The quality of predictions is evaluatedaccording to the CAPRI criterion (see Table 3).

    4.2.4 Rosetta Test Benchmark

    Comparison of the performance of the Rosetta’sscoring function against our rigid-body min-imization with KSENIA is presented in Fig-ure 8 for both the bound and the unboundbenchmarks. As one can see, although Rosetta

    0%

    25%

    50%

    75%

    Top1, q=1,2,3 Top10, q=1,2 Top1, q=1 Top10, q=1

    Suc

    cess

    rat

    e

    KSENIA+RBM (Bound) Rosetta (Bound)KSENIA+RBM (Unbound) Rosetta (Unbound)

    Figure 8: Performance of the scoring functions on theRosetta bound and unbound test benchmarks. Suc-cess rates of KSENIA along with the rigid-body mini-mization (KSENIA+RBM) are depicted with the solidgreen and the solid blue rectangles for the Rosettabound and unbound test benchmarks, respectively.Success rates of the Rosetta scoring function are de-picted with the solid red and the solid purple rectan-gles for the Rosetta bound and unbound test bench-marks, respectively. Hollow rectangles of the cor-responding color represent the maximum achievablesuccess rates. TopN value is defined as the percent-age of protein complexes for which at least one of thedocking prediction with the corresponding quality q isfound within the first N docking poses. The qualityof predictions q is evaluated according to the CAPRIcriterion (see Table 3).

    itself performs slightly better, our approachstill demonstrates very good results despitethe complexity of these benchmarks. Indeed,the native contacts for all the complexes inthe benchmark are disturbed owing to theside-chain re-packing or homologous replace-ment, for example. In addition, our scoringmethod does not take into consideration theindividual scores of the monomers. In par-ticular, it does not penalize rare rotamericstates of the side-chains, which are present inthe benchmark. Nonetheless, using only dis-tance distributions between the atoms in differ-ent monomers at their native and near-native

    14

  • states, our knowledge-based potential is capa-ble to rank quality-one poses at the top positionfor around 60 % of cases for the Rosetta boundbenchmark, and to rank quality-one, -two or -three poses at the top position for around 45 %of cases for the Rosetta unbound benchmark.

    4.2.5 SwarmDock Test Benchmark

    Moal et al. compared the performance of115 various scoring functions1 on a decoy setwas generated using the SwarmDock dockingserver.52,56 The corresponding success rates ofthe best 40 scoring functions provided by Moalet al.1 along with the success rates of KSENIAare presented in Figure 9.

    As one can see, the KSENIA potential per-forms relatively well compared to the rest ofthe assessed potentials, being among best 40potentials out of 115. The detailed performanceof each scoring function on the entire protein-protein docking benchmark v.4.0 is listed inTable S1 from Supporting Information. Whilethe problem of rigid-body pair-wise docking isconsidered to be solved,57 the flexible dockingproblem still remains to be a challenge. Thus, itis interesting to see the success rates of scoringfunctions on the medium and difficult cases ofprotein-protein docking benchmark v.4.0. Here,we used the full benchmark, otherwise there aretoo few complexes for the test. We evaluatedthe corresponding performance of KSENIA andcompared it with the success rates of other scor-ing functions provided by Moal et al.1 Remark-ably, KSENIA performs very competitively, re-sulted in 6 correctly predicted complexes, andonly six scoring function out of 115 performedbetter (see Figure 10).

    4.3 Crystallographic SymmetryMates as Docking Predic-tions

    We observed that in several cases non-native de-coys replace near-native predictions at the toppositions after the rigid-body minimization ap-plied. As the result, the success rate becomesless than it could be, since the near-native pre-dictions get a lower rank. For example, Table 4

    lists scores before and after the rigid-body min-imization applied to the protein complex 1ZC6from the Hex test benchmark.

    In terms of the ligand-RMSD, the decoystructure significantly differs from the nativeone: LRMSD > 60 Å. However, we found thatthe interface formed by the decoy monomers issimilar to the one of the crystal-packing inter-faces that are observed in the crystal structure.Typically, only one of the interfaces presentedin the crystal is considered to be the native in-terface, and other crystal-packing interfaces orcrystal contacts are considered to be the ar-tifacts of crystallization (Fig. 11). However,distinguishing between the native interface andthe crystal contacts is a challenging problem,since both are formed following the same phys-ical principles.58,59 For the case of homodimer1ZC6, LRMSD between the decoy and the com-plex forming the crystal contact is about 2.8 Å.We found these observations to be the addi-tional evidence of the prediction capability ofKSENIA.

    4.3.1 Performance in CAPRI

    To conclude the Results section we brieflyoverview performance of our team in CAPRI60

    rounds 26, 27 and 30, where the KSENIA scor-ing function was used. First, we used theHex software45 in order to generate preliminarydocking poses. Then, we refined the poses us-ing the rigid-body minimization algorithm incombination with the KSENIA potential. Ad-ditionally, the SCWRL4 package61 was used ateach iteration of the rigid-body minimization inorder to optimize side-chain conformations. Fi-nally, the best ten candidates were selected asthe submission models for CAPRI. Figure 12presents correct predictions for protein-proteinCAPRI targets of Rounds 26-27 obtained withthe described docking pipeline.

    For the Targets 53-54 there were no unboundstructure of one of the monomers and thusthe homology modeling with the I-TASSERserver62 was used in order to generate initialdocking models. For Target 53 our dockingpipeline succeeded to provide one acceptable-quality prediction among ten top-ranked mod-

    15

  • AP_D

    DGau

    INSI

    DECP

    _TEs

    AP_U

    RS ELE

    ALIP

    HCH

    ARM

    M_E

    LEK

    SEN

    IAAP

    _dFI

    RECP

    _DEC

    KAP

    _GEO

    MET

    RIC

    AP_D

    COM

    PLEX

    CP_Q

    aCP

    _Qp

    DESO

    LVAT

    TRAC

    TCP

    _RM

    FCA

    CG_E

    NVPR

    OPNS

    TSAP

    _GOA

    P_AL

    LCP

    _TB

    CP_B

    TCP

    _Qm

    CP_T

    DZR

    ANK

    CP_Z

    S3DC

    _MIN

    ROSE

    TTAD

    OCK

    CP_M

    IXRA

    NK_M

    INCP

    _HLP

    LCP

    _SKO

    bCP

    _TEl

    CP_E

    S3DC

    _MIN

    PYDO

    CK_T

    OTAP

    _ACE

    CHAR

    MM

    _TOT

    CP_B

    FKV

    CP_S

    CKOa

    CP_S

    JKG

    CP_V

    DAA

    _PRO

    PSP

    IDER

    PISA

    AP_O

    PUS_

    PSP

    FIRE

    DOCK

    _AB

    FIRE

    DOCK

    _El

    CP_T

    SCAP

    _T1

    AP_D

    ARS

    SIPP

    ERAP

    _T2

    CP_M

    J3h

    CP_S

    KOIP

    FIRE

    DOCK

    ZRAN

    K2

    0

    5

    10

    15

    20

    25

    Figure 9: The success rates of the best 40 scoring functions adapted from Moal et al.1 along with the successrates of the KSENIA potential on the protein-protein docking benchmark ”v.4.0 update”.The number of com-plexes for which an acceptable or better solution could be found in the top 1 and top 10 solutions was calculatedfor each scoring function. Acceptable quality solutions are shown in orange, medium quality solutions in green,and high quality solutions in blue for the two measures (top 1 left, top 10 middle). The functions are orderedby the top 10 success rate. We kept the same labels for the scoring function as in Moal et al.1

    INSI

    DECP

    _DDG

    rwCP

    _ZLO

    CAL_

    MIN

    CP_E

    3DC_

    MIN

    ODA

    AP_c

    alRW

    pAP

    _GEO

    MET

    RIC

    CP_H

    LPL

    CP_T

    DCP

    _TEl

    FA_P

    PCG

    _VDW

    CG_B

    ETA

    CP_Z

    LOCA

    L_CB

    CP_Z

    S3DC

    _CB

    CP_E

    S3DC

    _CB

    ELE

    AP_c

    alRW

    AP_G

    OAP_

    GCP

    _MIX

    RANK

    _MIN

    _CB

    CP_D

    DGru

    CP_B

    TCP

    _MJ3

    hCP

    _Qa

    CP_Q

    mCP

    _Qp

    AP_D

    COM

    PLEX

    ACE_

    INTE

    AA_P

    ROP

    AP_M

    PSCP

    _ZPA

    IR_M

    INPY

    DOCK

    _TOT

    ATTR

    ACT

    AP_G

    OAP_

    DFAP

    _OPU

    S_PS

    PFI

    REDO

    CKFI

    REDO

    CK_E

    lCP

    _MIX

    RANK

    _MIN

    CHAR

    MM

    _TOT

    CHAR

    MM

    _SAS

    ACP

    _SKO

    bCP

    _SKO

    IPCP

    _RM

    FCA

    ACE_

    SCRE

    ZRAN

    KNI

    PCG

    _ENV

    CP_D

    ECK

    CP_E

    PAIR

    _CB

    CP_E

    3D_M

    INPR

    OPNS

    TSSI

    PPER

    AP_G

    OAP_

    ALL

    ALIP

    HAP

    _T1

    AP_T

    2AP

    _DDG

    awCP

    _BFK

    VCP

    _SJK

    GNS

    CAP

    _DAR

    SPI

    _PI

    FIRE

    DOCK

    _AB

    CP_T

    BCP

    _TSC

    AP_D

    DGau

    CHAR

    MM

    _ELE

    KSE

    NIA

    CP_S

    CKOa

    ZRAN

    K2CP

    _ZS3

    DC_M

    INPI

    SACP

    _ES3

    DC_M

    INSP

    IDER

    0

    2

    4

    6

    8

    10

    Figure 10: The success rates of the best 40 scoring functions adapted from Moal et al.1 along with the successrates of the KSENIA potential on the medium and difficult cases of the protein-protein docking benchmarkv.4.0. The number of complexes for which an acceptable or better solution could be found in the top 1 andtop 10 solutions was calculated for each scoring function. Acceptable quality solutions are shown in orange andmedium quality solutions in green for the two measures (top 1 left, top 10 middle). The functions are orderedby top 10 success rate. We kept the same labels for the scoring function as in Moal et al.1

    16

  • Table 4: Scores for the native and one of the decoy structures before and after the rigid-body minimization.

    1ZC6 Score Score after the rigid-body minimization

    Udecoy -1594.740 -3036.307 (rank 1)Unative -1810.758 (rank 1) -2144.868

    BA

    A' B'B'

    B'

    B' B'

    B'

    Figure 11: Schematic representation of the nativeinterface (orange, solid) and crystal contacts (blue,dashed). The unit cell is depicted as the gray par-allelogram encompassing monomers A and B, whichform the native interface.

    Figure 12: The native and predicted structures of theprotein-protein complexes for CAPRI Targets. Left:native structure of Target 53 (grey) and acceptable-quality model produced by the docking pipeline (thetwo monomers are coloured in red and blue, respec-tively). Right: native structure of Target 58 (grey)and medium-quality model produced by the dockingpipeline (the two monomers are coloured in red andblue, respectively).

    els. However, there were no successful predic-tions for Target 54, probably due to the largedifference between the true structure and thehomologue model (only 4 teams out of 42 suc-ceeded to produce acceptable-quality predic-tions). For Target 58 we obtained one medium-quality prediction and only four other teams outof 22 succeeded to produce predictions of thesame quality. CAPRI Targets 55 and 56 wereaimed to test methods for evaluating the effectof point mutations on protein-protein interac-tion affinity. Predictors were provided with thecomprehensive datasets on the effects of everypoint mutant of two designed protein bindersof influenza hemagglutinin63 . Generally, pointmutations stabilized the protein folds and someof them also provided effect on the binding ofthe complex. It turned out to be very diffi-cult to predict the effect of the mutations fol-lowing physics-based principles. As a result,only machine-learning methods provided statis-tically significant correlation between the pre-dicted values and the measured Kd constants.Particularly, we obtained a good correlation be-tween our score and the binding affinity forpoint mutations corresponding to four residueslying on the interface between the two proteinsand failed otherwise63 . CAPRI Round 30 waslaunched in collaboration with the Critical As-sessment of Structure Predictions of proteins(CASP).64 Overall, 42 interfaces in 25 targetswere designated to CAPRI, comprising 19 pro-tein dimers and 6 tetramers. In this round, weobtained correct predictions for 11 targets, in-cluding 10 predictions of medium quality.

    4.4 Discussion

    Reference state-based statistical methods re-quire a large set of false-positive examples ofprotein complexes, i.e. non-native conforma-tions, in order to compute the reference state.

    17

  • Linear and quadratic programming approacheswhen training scoring functions also use a setof generated false-positive examples in order toconstruct the system of inequalities (1). It isa common practice in protein-protein dockingto select as false-positive examples those de-coys that possess the best score according tosome well-accepted scoring function.16,17,21 Onthe contrary, we have selected false-negative ex-amples purely based on the structure of pro-tein complexes in their native conformations.More precisely, our decoy sets were generatedin such a way that the average RMSD betweenthe corresponding monomers in the decoys andin the native structures is about 1 Å, keep-ing the relative orientation of the monomersfixed. Nonetheless, despite our training setdoes not contain non-native conformations withlarge RMSDs with respect to the native struc-tures, we are able to reconstruct the atom-atom distance-dependent scoring functions (seeEq. (5)). As we have shown above, theobtained potentials demonstrate surprisinglygood results on different protein-protein dock-ing benchmarks. We would like to emphasizethat all the benchmarks mostly consist of non-native decoys that have large RMSDs with re-spect to the native structures. Thus, our resultsstrongly suggest that the native protein com-plexes themselves contain all necessary struc-tural information to build well-discriminativepotentials that recognize native and near-nativeprotein-protein conformations.

    Regarding the disadvantages of the proposedmethodology, i.e. derivation of the KSENIApotential, we can point out two aspects. First,current statistic observations do not take intoaccount conformations of individual monomers.This means that, in principle, we can imagine asituation when two very unrealistic structuresof two monomers (all atomic coordinates insideeach monomer are the same, for example) resultin a good score of the complex. To circumventthis problem, one may either collect extra geo-metric information, such as triplet, quadruplet,etc. distributions of atoms in the complex, oradditionally score individual monomers. Sec-ond, in our training set there are no statis-tics at short separation distances between the

    monomers inside a complex. Thus, as a result,we need to define potential barriers at short dis-tances for the proper behaviour of the obtainedscoring functions.

    We would also like to stress that even thoughthe KSENIA potential is derived using localperturbations of the native protein structures,it generally has no bias toward a method to gen-erate docking predictions. This is because forthe construction of the training set we did notuse any standard docking prediction method.Thus, the rigid-body minimization is very im-portant for the success of the proposed scoringmethodology. Namely, the minimization is re-quired to resolve steric clashes that often ap-pear in docking predictions produced by vari-ous methods, particularly those that use a gridsearch without subsequent refinement of thedocking predictions. For example, Zdock andHex use a soft shape complementarity poten-tial, which permits moderate overlaps betweenthe monomers in a complex. Generally, we be-lieve that structure optimization should be theinevitable step of a general scoring procedurewhen one has no information about dockingpredictions to score.

    Our method does not, in principle, require ex-ternal packages, potentials, or algorithms nei-ther to generate the training set, nor to for-mulate and solve the optimization problem. Inthe present study, to generate the local defor-mations, we computed low-frequency normalmodes using the MMTK package with a united-atom force-field.39 However, the normal modescan be computed in a simpler way using, e.g.the elastic-network model,65 the gaussian net-work model,66 the rotation-translation of blocksmethod,67 etc. Thus, methodology presented inthis paper can be easily adapted to the recog-nition of other types of molecular interactions,such as protein-ligand, protein-RNA, etc., pro-vided that the atom types assignment is modi-fied appropriately.

    5 Conclusions

    Present study demonstrates that knowledge ofonly native protein-protein interfaces is suffi-

    18

  • cient to construct well-discriminative predic-tive models for the selection of binding candi-dates. Namely, we introduced a new scoringmethod that comprises a knowledge-based po-tential called KSENIA deduced from the struc-tural information about the native interfaces of844 crystallographic protein-protein complexes.The knowledge-based potential relies on theinformation obtained thanks to the deforma-tions of these interfaces computed along thelow-frequency normal modes. As a result, incontrast to existing scoring functions, our po-tential does not require neither the computa-tion of the reference state nor the ensembleof non-native complexes. Thus, it can haveonly a marginal bias toward a method to gen-erate putative binding poses. Moreover, KSE-NIA is smooth by construction, which allowsto use it along with the gradient-based rigid-body minimization. Particularly, we showedthat the rigid-body optimization of the dock-ing poses improves the scoring stage of molec-ular docking. Using several test benchmarkswe demonstrated that our method out-performsthe Hex scoring function, which is based on theshape complementarity between the monomersin a complex, and the Zdock scoring function,which also includes the electrostatic and de-solvation terms. We have also demonstratedthat the KSENIA potential slightly outper-forms the ItScore-PP potential, which is theatomic distance-dependent scoring function de-rived using the same set of native complexes.We find remarkable that the native proteincomplexes themselves contain all necessary in-formation to derive a successful and well-discriminative knowledge-based potential. Al-though our method performs slightly worse onthe Rosetta test benchmark and SwarmDocktest benchmark compared to the more sophisti-cated scoring functions, we believe that furtherimprovements of KSENIA, e.g. accounting forthe integrity of monomers, rotamer optimiza-tion, etc., will eliminate this disadvantage.

    Methodology presented in this paper canbe easily adapted to the recognition of othertypes of molecular interactions, such as protein-ligand, protein-RNA, etc. Finally, we want tonote that we have successfully used KSENIA in

    the CAPRI protein docking experiment start-ing from Round 26.60 We will make KSENIApublicly available as a part of the SAMSONsoftware platform developed in our group athttps://team.inria.fr/nano-d/software.

    Acknowledgements The authors acknowl-edge Mieczyslaw Torchala from Biomolecularmodelling group of the Francis Crick Institutefor the provided SwarmDock test benchmark.

    Funding Sources This work was supportedby the Agence Nationale de la Recherche (ANR-2010-JCJC-0206-01, ANR-11-MONU-006-01).

    Associated Content Supporting Informa-tion Available: Description of the artificial bar-riers, the cross-validation procedure, and theperformance of scoring functions including theKSENIA one on the SwarmDock benchmark.This material is available free of charge via theInternet at http://pubs.acs.org.

    19

    https://team.inria.fr/nano-d/softwareh

  • References

    (1) Moal, I. H.; Torchala, M.; Bates, P. A.;Fernández-Recio, J. The Scoring of Posesin Protein-Protein Docking: Current Ca-pabilities and Future Directions. BMCBioinf. 2013, 14, 286.

    (2) Finkelstein, A.; Badretdinov, A.;Gutin, A. Why Do Protein Architec-tures Have Boltzmann-Like Statistics?Proteins: Struct., Funct., Bioinf. 2004,23, 142–150.

    (3) Tanaka, S.; Scheraga, H. A. Medium-and Long-Range Interaction Parametersbetween Amino Acids for PredictingThree-Dimensional Structures of Proteins.Macromolecules 1976, 9, 945–950.

    (4) Miyazawa, S.; Jernigan, R. Estimationof Effective Interresidue Contact Ener-gies from Protein Crystal Structures:Quasi-Chemical Approximation. Macro-molecules 1985, 18, 534–552.

    (5) Sippl, M. Calculation of ConformationalEnsembles from Potentials of Mean Force:an Approach to the Knowledge-Based Pre-diction of Local Structures in GlobularProteins. J. Mol. Biol. 1990, 213, 859–883.

    (6) Sippl, M.; Ortner, M.; Jaritz, M.; Lack-ner, P.; Flöckner, H. Helmholtz Free Ener-gies of Atom Pair Interactions in Proteins.Folding Des. 1996, 1, 289–298.

    (7) Koppensteiner, W.; Sippl, M. Knowledge-Based Potentials–Back to the Roots. Bio-chemistry 1998, 63, 247–252.

    (8) Thomas, P.; Dill, K. Statistical PotentialsExtracted from Protein Structures: HowAccurate Are They? J. Mol. Biol. 1996,257, 457–469.

    (9) Ben-Naim, A. Statistical Potentials Ex-tracted from Protein Structures: AreThese Meaningful Potentials? J. Chem.Phys. 1997, 107, 3698–3706.

    (10) Liu, S.; Zhang, C.; Zhou, H.; Zhou, Y.A Physical Reference State Unifies theStructure-Derived Potential of MeanForce for Protein Folding and Binding.Proteins: Struct., Funct., Bioinf. 2004,56, 93–101.

    (11) Deng, H.; Jia, Y.; Wei, Y.; Zhang, Y.What Is the Best Reference State for De-signing Statistical Atomic Potentials inProtein Structure Prediction? Proteins:Struct., Funct., Bioinf. 2012, 80, 2311–2322.

    (12) Leelananda, S. P.; Feng, Y.; Gniewek, P.;Kloczkowski, A.; Jernigan, R. L. In Mul-tiscale Approaches to Protein Modeling ;Kolinski, A., Ed.; Springer New York,2011; pp 127–157.

    (13) Lu, H.; Skolnick, J. A Distance-DependentAtomic Knowledge-Based Potential forImproved Protein Structure Selection.Proteins: Struct., Funct., Bioinf. 2001,44, 223–232.

    (14) Samudrala, R.; Moult, J. An All-AtomDistance-Dependent Conditional Proba-bility Discriminatory Function for ProteinStructure Prediction. J. Mol. Biol. 1998,275, 895–916.

    (15) Zhou, H.; Zhou, Y. Distance-Scaled, Fi-nite Ideal-Gas Reference State ImprovesStructure-Derived Potentials of MeanForce for Structure Selection and StabilityPrediction. Protein Sci. 2002, 11, 2714–2726.

    (16) Huang, S.-Y.; Zou, X. An IterativeKnowledge-Based Scoring Function forProtein–Protein Recognition. Proteins:Struct., Funct., Bioinf. 2008, 72, 557–579.

    (17) Chuang, G.-Y.; Kozakov, D.; Brenke, R.;Comeau, S. R.; Vajda, S. Dars (De-coys As The Reference State) Potentialsfor Protein-Protein Docking. Biophys. J.2008, 95, 4217–4227.

    20

  • (18) Maiorov, V. N.; Grippen, G. M. Con-tact Potential that Recognizes the CorrectFolding of Globular Proteins. J. Mol. Biol.1992, 227, 876–888.

    (19) Qiu, J.; Elber, R. Atomically DetailedPotentials to Recognize Native and Ap-proximate Protein Structures. Proteins:Struct., Funct., Bioinf. 2005, 61, 44–55.

    (20) Rajgaria, R.; McAllister, S.; Floudas, C.A Novel High Resolution Cα–Cα Dis-tance Dependent Force Field Based on aHigh Quality Decoy Set. Proteins: Struct.,Funct., Bioinf. 2006, 65, 726–741.

    (21) Tobi, D.; Bahar, I. Optimal Design of Pro-tein Docking Potentials: Efficiency andLimitations. Proteins: Struct., Funct.,Bioinf. 2006, 62, 970–981.

    (22) Ravikant, D.; Elber, R. PIE–Efficient Fil-ters and Coarse Grained Potentials for Un-bound Protein–Protein Docking. Proteins:Struct., Funct., Bioinf. 2010, 78, 400–419.

    (23) Chae, M.-H.; Krull, F.; Lorenzen, S.;Knapp, E.-W. Predicting Protein Com-plex Geometries with a Neural Network.Proteins: Struct., Funct., Bioinf. 2010,78, 1026–1039.

    (24) Mirzaei, H.; Beglov, D.; Paschalidis, I. C.;Vajda, S.; Vakili, P.; Kozakov, D. RigidBody Energy Minimization on Manifoldsfor Molecular Docking. J. Chem. TheoryComput. 2012, 8, 4374–4380.

    (25) Kohavi, R. A Study of Cross-Validationand Bootstrap for Accuracy Estimationand Model Selection. International JointConference on Artificial Intelligence. 1995;pp 1137–1145.

    (26) Derevyanko, G.; Grudinin, S. Convex-PP: Predicting Protein–Protein Interac-tions Using Polynomial Expansions andConvex Optimisation. Unpublished.

    (27) Boyd, S.; Vandenberghe, L. Convex Op-timization; Cambridge University Press,New York, 2004.

    (28) Huang, S.-Y.; Yan, C.; Grinter, S. Z.;Chang, S.; Jiang, L.; Zou, X. Inclu-sion of the Orientational Entropic Effectand Low-Resolution Experimental Infor-mation for Protein–Protein Docking inCritical Assessment of PRedicted Interac-tions (CAPRI). Proteins: Struct., Funct.,Bioinf. 2013, 81, 2183–2191.

    (29) Shen, M.-y.; Sali, A. Statistical Potentialfor Assessment and Prediction of ProteinStructures. Protein Sci. 2006, 15, 2507–2524.

    (30) Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y.A Knowledge-Based Energy Functionfor Protein-Ligand, Protein-Protein, andProtein-DNA Complexes. J. Med. Chem.2005, 48, 2325–2335.

    (31) Nelder, J. A.; Mead, R. A SimplexMethod for Function Minimization. Com-puter Journal 1965, 7, 308–313.

    (32) Powell, M. J. An Efficient Method forFinding the Minimum of a Function ofSeveral Variables Without CalculatingDerivatives. Computer Journal 1964, 7,155–162.

    (33) Wilson, E. B. Molecular Vibrations: theTheory of Infrared and Raman VibrationalSpectra; Dover Publications, New York,1955.

    (34) Brooks, B.; Karplus, M. Harmonic Dy-namics of Proteins: Normal Modes andFluctuations in Bovine Pancreatic TrypsinInhibitor. Proc. Natl. Acad. Sci. U. S. A.1983, 80, 6571–6575.

    (35) May, A.; Zacharias, M. Energy Minimiza-tion in Low-Frequency Normal Modes toEfficiently Allow for Global FlexibilityDuring Systematic Protein-Protein Dock-ing. Proteins: Struct., Funct., Bioinf.2008, 70, 794–809.

    (36) Venkatraman, V.; Ritchie, D. W. FlexibleProtein Docking Refinement Using Pose-Dependent Normal Mode Analysis. Pro-

    21

  • teins: Struct., Funct., Bioinf. 2012, 80,2262–2274.

    (37) Cavasotto, C. N.; Kovacs, J. A.;Abagyan, R. A. Representing ReceptorFlexibility in Ligand Docking ThroughRelevant Normal Modes. J. Am. Chem.Soc. 2005, 127, 9632–9640.

    (38) Berman, H. M.; Battistuz, T.;Bhat, T. N.; Bluhm, W. F.; Bourne, P. E.;Burkhardt, K.; Feng, Z.; Gilliland, G. L.;Iype, L.; Jain, S. The Protein DataBank. Acta Crystallogr., Sect. D: Biol.Crystallogr. 2002, 58, 899–907.

    (39) Hinsen, K. The Molecular ModelingToolkit: a New Approach to MolecularSimulations. J. Comput. Chem. 2000, 21,79–85.

    (40) Jorgensen, W. L.; Maxwell, D. S.; Tirado-Rives, J. Development and Testing of theOPLS All-Atom Force Field on Confor-mational Energetics and Properties of Or-ganic Liquids. J. Am. Chem. Soc. 1996,118, 11225–11236.

    (41) Hinsen, K. Analysis of Domain Motions byApproximate Normal Mode Calculations.Proteins: Struct., Funct., Genet. 1998,33, 417–429.

    (42) Hwang, H.; Vreven, T.; Janin, J.; Weng, Z.Protein–Protein Docking Benchmark Ver-sion 4.0. Proteins: Struct., Funct., Bioinf.2010, 78, 3111–3114.

    (43) Pearson, W. R.; Lipman, D. J. ImprovedTools for Biological Sequence Comparison.Proc. Natl. Acad. Sci. U. S. A. 1988, 85,2444–2448.

    (44) Ritchie, D. W.; Kozakov, D.; Vajda, S.Accelerating and Focusing Protein–Protein Docking Correlations UsingMulti-Dimensional Rotational FFT Gen-erating Functions. Bioinformatics 2008,24, 1865–1873.

    (45) Ritchie, D. W.; Venkatraman, V. Ultra-Fast FFT Protein Docking on Graph-ics Processors. Bioinformatics 2010, 26,2398–2405.

    (46) Popov, P.; Grudinin, S. Rapid Determina-tion of RMSDs Corresponding to Macro-molecular Rigid Body Motions. J. Com-put. Chem. 2014, 35, 950–956.

    (47) Hwang, H.; Pierce, B.; Mintseris, J.;Janin, J.; Weng, Z. Protein–Protein Dock-ing Benchmark Version 3.0. Proteins:Struct., Funct., Bioinf. 2008, 73, 705–709.

    (48) Pierce, B. G.; Hourai, Y.; Weng, Z. Accel-erating Protein Docking In ZDOCK Us-ing an Advanced 3D Convolution Library.PloS One 2011, 6, e24657.

    (49) Méndez, R.; Leplae, R.; De Maria, L.;Wodak, S. J. Assessment of Blind Pre-dictions of Protein–Protein Interactions:Current Status of Docking Methods. Pro-teins: Struct., Funct., Bioinf. 2003, 52,51–67.

    (50) Chen, R.; Weng, Z. Docking UnboundProteins Using Shape Complementarity,Desolvation, and Electrostatics. Proteins:Struct., Funct., Bioinf. 2002, 47, 281–294.

    (51) Gray, J.; Moughon, S.; Wang, C.;Schueler-Furman, O.; Kuhlman, B.;Rohl, C.; Baker, D. Protein–ProteinDocking with Simultaneous Optimizationof Rigid-Body Displacement and Side-Chain Conformations. J. Mol. Biol. 2003,331, 281–300.

    (52) Torchala, M.; Bates, P. A. Protein Struc-ture Prediction; Methods Mol. Biol. (N.Y., NY, U. S.); Springer, 2014; Vol. 1137;pp 181–197.

    (53) Moal, I. H.; Moretti, R.; Baker, D.;Fernández-Recio, J. Scoring Functions forProtein-Protein Interactions. Curr. Opin.Struct. Biol. 2013, 23, 862–867.

    22

  • (54) Press, W. H. Numerical Recipes 3rd Edi-tion: The Art of Scientific Computing ;Cambridge University Press, New York,2007.

    (55) Frauenfelder, H.; Sligar, S. G.;Wolynes, P. G. The Energy Land-scapes and Motions of Proteins. Science1991, 254, 1598–1603.

    (56) Torchala, M.; Moal, I. H.; Chaleil, R. A.;Fernandez-Recio, J.; Bates, P. A. Swar-mDock: a Server for Flexible Protein–Protein Docking. Bioinformatics 2013,29, 807–809.

    (57) Vakser, I. A. Protein-Protein Docking:From Interaction to Interactome. Biophys.J. 2014, 107, 1785–1793.

    (58) Kobe, B.; Guncar, G.; Buchholz, R.; Hu-ber, T.; Maco, B.; Cowieson, N.; Mar-tin, J. L.; Marfori, M.; Forwood, J. K.Crystallography and Protein-Protein In-teractions: Biological Interfaces and Crys-tal Contacts. Biochem. Soc. Trans. 2008,36, 1438–1441.

    (59) Krissinel, E. Crystal Contacts as Nature’sDocking Solutions. J. Comput. Chem.2010, 31, 133–143.

    (60) Janin, J.; Wodak, S. J.; Lensink, M. F.;Velankar, S. Assessing Structural Predic-tions of Protein–Protein Recognition: TheCAPRI Experiment. Rev. Comput. Chem.2015, 28, 137–173.

    (61) Krivov, G. G.; Shapovalov, M. V.; Dun-brack, R. L. Improved Prediction ofProtein Side-Chain Conformations withSCWRL4. Proteins: Struct., Funct.,Bioinf. 2009, 77, 778–795.

    (62) Yang, J.; Yan, R.; Roy, A.; Xu, D.; Pois-son, J.; Zhang, Y. The I-TASSER Suite:Protein Structure and Function Predic-tion. Nat. Methods 2015, 12, 7–8.

    (63) Moretti, R.; others., Community-WideEvaluation of Methods for Predicting theEffect of Mutations on Protein–Protein

    Interactions. Proteins: Struct., Funct.,Bioinf. 2013, 81, 1980–1987.

    (64) Moult, J.; Fidelis, K.; Kryshtafovych, A.;Schwede, T.; Tramontano, A. Critical As-sessment of Methods of Protein StructurePrediction (CASP)–Round X. Proteins:Struct., Funct., Bioinf. 2014, 82, 1–6.

    (65) Tirion, M. M. Large Amplitude Elas-tic Motions in Proteins from a Single-Parameter, Atomic Analysis. Phys. Rev.Lett. 1996, 77, 1905.

    (66) Bahar, I.; Atilgan, A. R.; Erman, B. Di-rect Evaluation of Thermal Fluctuationsin Proteins Using a Single-Parameter Har-monic Potential. Folding Des. 1997, 2,173–81.

    (67) Tama, F.; Gadea, F. X.; Marques, O.;Sanejouand, Y. H. Building-Block Ap-proach for Determining Low-FrequencyNormal Modes of Macromolecules. Pro-teins: Struct., Funct., Bioinf. 2000, 41,1–7.

    23

    IntroductionTheoretical BasisMaterials and MethodsRigid-Body MinimizationNormal ModesTraining SetNative ComplexesNear-native Decoys

    Test BenchmarksHex Test BenchmarkZdock Test BenchmarkItScoreTest BenchmarkRosetta Test BenchmarkSwarmDock Test Benchmark

    Computational Considerations

    Results Scoring FunctionalPerformance on the Test BenchmarksHex Test BenchmarkZdock Test BenchmarkComparison with the ItScore-PP potentialRosetta Test BenchmarkSwarmDock Test Benchmark

    Crystallographic Symmetry Mates as Docking PredictionsPerformance in CAPRI

    Discussion

    Conclusions