9118.full text lowres

8/3/2019 9118.Full Text Lowres

1/12

A coarse-grained force field for ProteinRNAdocking

Piotr Setny* and Martin Zacharias

Physics Department T38, Technical University Munich, James-Franck-Strasse 1, 85748 Garching, Germany

Received May 3, 2011; Revised July 13, 2011; Accepted July 19, 2011

ABSTRACT

The awareness of important biological role played

by functional, non coding (nc) RNA has grown tre-

mendously in recent years. To perform their tasks,

ncRNA molecules typically unite with protein

partners, forming ribonucleoprotein complexes.

Structural insight into their architectures can be

greatly supplemented by computational dockingtechniques, as they provide means for the integra-

tion and refinement of experimental data that is

often limited to fragments of larger assemblies or

represents multiple levels of spatial resolution.

Here, we present a coarse-grained force field for

protein-RNA docking, implemented within the

framework of the ATTRACT program. Complex

structure prediction is based on energy minimiza-

tion in rotational and translational degrees of

freedom of binding partners, with possible exten-

sion to include structural flexibility. The coarse-

grained representation allows for fast and efficient

systematic docking search without any prior know-ledge about complex geometry.

INTRODUCTION

In recent years, the known inventory of functional RNAsthat do not belong to protein coding messenger (m) RNAclass has increased dramatically (13). These so-callednon-coding (nc) RNAs appear to play important roles indiverse cellular activities such as RNA processing andmodification, translation, gene expression, proteintrafficking or chromosome maintenance. In doing so,RNA molecules typically unite with protein partners in

functional ribonucleoprotein complexes (46). Structuralinsight into such assemblies is essential for our under-standing of their mechanism of action as well as for thefuture ability to design new diagnostic tools or therapeuticstrategies.

The awareness for the importance of ncRNA growsmuch faster, however, than the available body of struc-tural data. In spite of spectacular successes, such asobtaining high-resolution structures of small and largeribosomal subunits (7,8), proteinRNA complexescomprise currently only $4% of records deposited in theProtein Data Bank (PDB) (9), while at the same time it isestimated that the portion of genome transcribed into

ncRNA may be even 20 times larger than the proteincoding part (3). X ray crystallography of macromolecularcomplexes, particularly containing nucleic acids, is a moredifficult task than the determination of isolated compo-nents. Thus, computational docking techniques, aiming atthe prediction of the complex structure based on its com-ponents, are becoming increasingly important. Eventhough they still need to tackle many challenges beforebecoming a reliable standalone tool (10), they alreadyplay an important role in the integration of structuraldata coming from experimental methods that provide dif-ferent levels of spatial resolution (11,12).

To date, most computational efforts for ribonucleo-proteins were focused on the characterization of binding

interfaces (1317) or the localization of RNA binding siteson proteins (1823) rather than docking. In contrast toproteinprotein docking that has an established record(10), to our knowledge, only very few preliminaryattempts have been made for the actual proteinRNAbinding mode prediction (2426). They are based ondistance-dependent atomic statistical potentials forproteinRNA interactions (24,25) or statistically derivedpropensities for proteinRNA pairing at the residue level(26). Descriptors developed for quantifying proteinRNAinteractions were demonstrated to distinguish betweennative structures and provided decoys (24,25), or toimprove ranking of near-native solutions obtained usingshape complementarity-based scoring (26).

Recently, proteinRNA complexes were included astargets in the Critical Assessment of PRediction ofInteractions (CAPRI) competition (27). Most of theparticipating groups adapted for this occasion methodsdeveloped for proteinprotein docking and used

*To whom correspondence should be addressed. Tel: +49 89 289 13768; Fax: +49 89 289 12444; Email: [email protected] may also be addressed to Martin Zacharias. Tel: +49 89 289 12335; Fax: +49 89 289 12444; Email: [email protected]

91189129 Nucleic Acids Research, 2011, Vol. 39, No. 21 Published online 16 August 2011doi:10.1093/nar/gkr636

The Author(s) 2011. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


2/12

experimental restraints to facilitate native-like geometryselection. Out of two ribonucleoprotein targets, onerequired homology modeling of both binding partnersand was not solved by any of the competing groups,while the second, with provided bound RNA geometry,allowed for a number of successful, medium accuracy pre-dictions. The presence of proteinRNA targets for the first

time in CAPRI competition indicates the growing interestin computational prediction of ribonucleoprotein struc-tures and consequently, the need for new methods,targeted specifically on proteinRNA docking problems.

In the current study, we design and parametrize adistance-dependent, coarse-grained forcefield forproteinRNA interactions. This forcefield is compatiblewith an earlier parameter set developed for proteinprotein docking by Zacharias (28,29). It allows for fullysystematic proteinRNA docking by energy minimizationin the rotational and translational degrees of freedom ofthe binding partners. Unlike propensity-based descriptors,the distance-dependent potential function provides allnecessary means for finding realistic bound geometriesand their subsequent scoring by the resulting potentialenergy. Structure representation at the sub residualcoarse-grained level allows for efficient calculations, yetat the same time maintains reasonable details ofphysicochemical features.

In the following sections, we describe forcefield devel-opment and testing based on 110 crystallographic struc-tures of proteinRNA complexes. We also consider itsapplication to few proteinRNA complexes with availablestructural data for bound state as well as unboundcomponents.

METHODS

Coarse-grained representation

The presented potential for proteinRNA interactions isdesigned to be compatible with the earlier coarse-grainedparametrization developed for proteins by Zacharias(28,29). In Zacharias model, each amino acid is repre-sented by up to four pseudoatoms (beads): two corres-ponding to main chain nitrogen and oxygen, and one ortwo describing short and long side chains, respectively. Intotal, there are 31 pseudoatom types. To extend this modelfor nucleotides, 17 new bead types were introduced. Theyinclude three pseudoatoms for phosphate/ribose part, and

three or four for purine and pyrimidine bases, respectively(Figure 1).

The assumed, pairwise additive interactions betweenprotein and RNA beads are described by a distance-dependent potential that has two different forms, corres-ponding to attractive and repulsive interactions (Figure 2).The attractive potential is of LennardJones type, with asoft repulsive term:

Uattrij r ij

ijr8

ijr6

: 1

Pairwise-specific parameters sij and eij govern interactionrange and strength, respectively. The repulsive potential isdefined as:

Urepij r

Uattrij r 2Umij for r r

mij

Uattrij r for r4 rmij

;

2

where rmij and Umij correspond to the position and value of

Uattrij minimum. Such formula provides a smooth, easily

Figure 1. Coarse-grained representation for nucleotides. Beads (dashedcircles) are either centered on particular atoms or at geometric centersof a few atoms (dots).

Figure 2. Repulsive and attractive potential form. rm and Um: theposition and value of Uattr minimum.

Nucleic Acids Research, 2011, Vol. 39, No. 21 9119


3/12

implemented potential, albeit with non-physically vanish-ing force at rm. Unlike in the Zacharias model for proteins,no separate electrostatic terms were introduced, thusassuming that the above potentials account for all effect-ive intermolecular interactions.

Potential parametrization

The interaction of each pair ij of protein and RNA beadsis described by two parameters (sijand eij), hence giving intotal 1054 parameters for proteinRNA force field. Theywere derived in a knowledge-based manner, using a set ofproteinRNA crystallographic complexes. At first,distance-dependent statistical potentials ~Gijd were con-structed for each bead pair, and the initial values of sand e parameters were obtained by fitting Equations (1)and (2) to ~Gijd. The resulting parameter values were sub-sequently adjusted to optimize docking results in terms offinding the right (close to native) binding mode and itsproper scoring. The details of parametrization procedureare given in the following.

ProteinRNA complexes were selected from crystallo-graphic structures used recently for the analysis ofproteinRNA binding sites (22,23). The provided lists ofnon-redundant complexes, having resolution better than 3A , were merged together, and a search for homologousstructures was performed using Smith and Waterman al-gorithm (30) for proteins, with the similarity threshold of70% sequence identity, and nucleotide BLAST algorithm(31) for RNA, with the similarity threshold of 80%sequence identity. Two complexes were deemed redun-dant, if similarity thresholds were simultaneouslyexceeded by their both binding partners, thus allowingsingle binding component to be considered for dockingwith different partners. A list of complexes sharing one

similar binding component is given in the SupplementatyData.The non-redundant structures were than subjected to

manual analysis with the following rules:

. complexes in which a DNA molecule was found tobind protein and RNA molecules were discarded;

. structures with multiple missing side chains or nucleo-tides were discarded;

. structures with proteinRNA contact involving


4/12

The beads radii were estimated using the acquiredNobs(i, j, d), as an average of half distance of closestapproach for all considered interaction partners for agiven bead. The probe radius, in turn, was defined as anaverage of all bead radii, yielding Rp =1.78A . The nu-merical results for (d) appeared to be insensitive forsmall ($0.1 A ) variations in the applied radii.

Parameters optimization. The initial parameter set wasobtained by fitting Equations (1) and (2) to the derivedstatistical potentials and choosing the (s, e) pair providingthe lowest root mean square deviation (RMSD) from~Gijd. In order to optimize the performance of the

obtained parameter set for proteinRNA docking, theparameters were further adjusted using a two-stageprocedure.

In the first stage, only s parameters were optimized toprovide possibly best stability of native structures. Due tolarge dimensionality of parameter space, a random,Monte Carlo-like optimization scheme was introduced.The following optimization block was iterated until no

further improvement was observed (In practical applica-tion, steps (3) and (4) were modified for better efficacy: thesearch started from a given s value, proceeded in positiveand than negative direction, but each search direction wasinterrupted before checking all k values as only the scorestarted to decrease.):

(1) start with the current parameter set;(2) randomly select one s parameter from those that

were not yet selected;(3) for each of 2k+1 values: sk

s,k,s,k,s+k

s

perform potential energy minimizations for the set ofnative complexes, and score the results;

(4) keep the s value that provides the best score;

(5) if some s parameters remain to be scanned, go topoint 2

The energy minimization was done in protein (ligand)translational and rotational degrees of freedom, withboth binding partners kept rigid. The scoring performedin point (3) was based on the following criteria applied toeach minimized complex:

. if alpha carbon RMSD of minimized versus nativeprotein position was greater than >5 A , a largepenalty (LP) was applied; and

. if RMSD was between 1.0 and 5.0 A , its square wasaccumulated in RMSD variable.

The best score was considered to have the lowest numberof LP or smallest RMSD value among the scores with thesame number of LP. Such scoring scheme was tuned forfinding solutions close to native (with RMSD 5 A . Such systematicdocking was performed from starting points evenlydistributed over the receptor surface, hence the loca-tions of the decoys were not limited to the area of

binding interface (see Docking protocols sectionbelow).

A following optimization scheme was repeated until nofurther improvement in the number of correctly rankedcomplexes was observed:

(1) start with the current parameter set and performenergy minimization for native complexes and alldecoys;

(2) randomly select one e parameter from those thatwere not yet selected;

(3) for each of 2k+1 values: e ke,k, e,k, e+k

e

perform a single point energy evaluation for allstructures;

(4) keep the e value that provides the best ranking ofnative complexes;

(5) if some e parameters remain to be scanned, go topoint (2).

The criterion for the best overall ranking was based onthe ability to provide the lowest energy for native-like(energy minimized) complexes with respect to their corres-ponding decoys. The best e value was regarded as havingthe highest number of properly ranked (i.e. with rank 1)native-like complexes. If a few e values resulted in identicalranking, the one with lowest penalty score was thenchosen. The penalty score was calculated for complexesthat did not rank properly. It depended on the RMSD

of native-like complex (native complexes with highRMSD after energy minimization were not expected tobe ranked as precisely as complexes with low RMSD),the actual rank of native-like complex among its decoys(the higher the rank the smaller penalty) and the differencein energy between the native-like complex and the bestranked decoy. For the e optimization procedure, k = 2and

e= 0.1 kcal/mol were chosen. For further details,

see the Supplementary Data.

Parameter set evaluation. The random optimizationscheme does not guarantee reaching single, globally bestparameter set, even if (unlikely) such set exists. In order toobtain insight into the validity of the adopted optimiza-

tion procedure, three independent parameters optimiza-tions were carried out. The three resulting interactionpotentials for each bead pair were compared with theanalysis of standard deviations of their minima or saddlepoint positions with respect to their mean location.Minimum or saddle point position is a simple functionof s parameter (on the distance scale) and e parameter(on the energy scale). In order to combine the deviationsin s and e components, they were normalized by the re-spective average s or e deviations over all bead pairs. Suchnormalized deviations, S

sand S

e, were then combined,

yielding a single value S ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

S2 S2

p, measuring the

http://-/?-http://-/?-http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1http://-/?-http://-/?-


5/12

variability of a given bead pair interaction across the threesets.

Finally, an average, consensus parameter set was con-structed, and docking efficacy of all four potentials wascompared.

Docking protocols

All docking simulations performed in the current studywere carried out with the ATTRACT program (28).They involved two rigid partners in coarse-grained repre-sentation: an RNA structure treated as an immobilereceptor and a protein structure treated as a movableligand. No additional information such as distance con-straints or binding region location was submitted. Prior todocking, a set of starting ligand positions evenlydistributed with spacing of$12 A around the receptor atdistances precluding receptor-ligand overlaps wasgenerated. For each such position, 208 initial ligand orien-tations were considered. A single docking attempt con-sisted of five stages of potential energy minimization inligand translational and rotational degrees of freedom,with decreasing distance cutoff for pairwise interactions.The initial cutoff was set to 50 A and final cutoff was set tobe 8 A .

The solutions with converged potential energy werethen clustered according to their pairwise RMSD dis-tances in order to remove redundant ligand poses.Scoring of the results was done according to their poten-tial energy.

RESULTS AND DISCUSSION

Potential parameters

The obtained statistical potentials provide a valuable char-

acterization of the assumed interaction model (seeSupplementary Data for the corresponding plots): theyreasonably indicate attractive and repulsive pseudoatompairs, define the distances of closest approach and give theestimate of interaction strength.

Their use as a definite and only basis for deriving inter-action potentials in the form of Equations (1) or (2) is,however, limited. First, due to a relatively small number ofavailable, high-quality proteinRNA complexes, for someparticularly infrequent bead pairs, considerable unphysic-al oscillations in the statistical potential are observed, ren-dering the fitting procedure inaccurate. Second, in somecases the statistical potential does not reach the zerobaseline for distances as large as 14 A , requiring an arbi-

trary decision whether the potential should be shiftedbefore fitting, and if yes, what offset value should beused. Finally, the assumption that a statistical potentialbetween two beads describes their pairwise interaction inthe context of specific macromolecular environment is notphysically justified (33).

Due to these reasons, the statistical potentials were usedto generate only an initial guess for the parameter set(called set 0 in the following), with its subsequent opti-mization, specifically for docking and scoring application,in mind. Given the large dimensionality of parameterspace and relatively limited size of the training set, it is

reasonable to expect that the result of parameter optimiza-tion would be sensitive to the order of pairwise potentialsadjustments. As no indisputably superior adjustmentsequence was determined, a random scheme wasadopted and three independent optimization procedureswere carried out. In addition, a consensus parameter setwas constructed, obtained by averaging of respective s

and e parameters of all three sets.Pairwise interactions amplitudeswell depths (negative

values) for attractive potentials or saddle point heights(positive values) for repulsive potentialsfor the consen-sus set (Figure 4), agree, in general, with the expectedproteinRNA interaction characteristics. If the percentageof strong interactions (defined as having an amplitudehigher than average repulsive or attractive interactionplus their respective one standard deviation) offered by agiven bead is a measure of its importance for the specifi-city of proteinRNA recognition, the aromatic proteinside chains (Tyr, Trp, Phe, His) appear to be the mostcontributing group, in line with the traditional view (34).They preferentially interact with nucleic bases rather than

RNA backbone, which reflects stacking interactions theyare supposed to make.

Strong, mostly attractive interactions are observed forpositively charged arginine. They are directed predomin-antly toward nucleic bases, at least if its distal, ARGS,bead is concerned. Interestingly, a similarly chargedlysine appears to favor the RNA backbone instead,while the rest of its interactions is only moderate, withthe exception of strong attraction to GG4 guaninepseudoatom. As expected, strong attraction is alsoobserved between protein backbone nitrogen andpseudoatoms representing hydrogen bonds acceptors onnucleic bases.

The repulsive interactions, while not driving the

complex formation, are also important for binding speci-ficity. A significant repulsion is observed between nega-tively charged RNA phosphate groups and ionized sidechains of Glu and Asp, and also for Gln, Asn andprotein backbone oxygen. Interestingly, Cys appears tobe in general the most RNA-repelling amino acid,both when phosphate backbone and nucleic basesare concerned, even though it does not carry a netcharge.

The weakest interacting beads, in turn, belong to hydro-phobic residues (Ile, Leu, Val, Ala, Gly) on the proteinside, and sugar moiety on the RNA side.

As expected, the exact values of parameters for theoptimized interaction potentials vary across all three par-

ameter sets. An average deviation for pairwise interactionwith respect to mean location of its potential minimum orsaddle point is 0.07 A on distance scale (S

s) and 0.18 kcal/

mol on energy scale (Se). The observed combined devi-

ations (see Methods section) differ considerably amongall pairwise potentials (Figure 5). To some extent, theyreflect a statistical uncertainty due to an absolutenumber of bead pairs used to derive a given potential:rarely observed pseudoatoms like those in Trp or Mettend to have greater variability in their interaction poten-tials, whereas frequent beads like those in RNA or proteinbackbone tend to contribute to more repetitive potentials.

9122 Nucleic Acids Research, 2011, Vol. 39, No. 21
http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1http://-/?-http://-/?-http://-/?-http://-/?-http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1


6/12

Such trend is also visible for nucleic bases with most oftenencountered guanine having on average lower deviationsthan less frequent uracil and adenine. For some ubiqui-tous pseudoatoms, however, like those in Arg, Lys orprotein backbone oxygen, average deviations, thoughstill below the mean for all protein beads, remain relatively

high. Interestingly, those pseudoatoms are among thegroup that seems to particularly contribute to proteinRNA recognition. Perhaps, having high impact ondocking performance, they are also exceptionally sensitiveto the order of parameter optimization, and each timetheir interactions are being optimized, the adjustmentdepends on the actual state of all other pairwise potentials.It indicates that there are many local optima in the inter-action parameters space, resulting in similar docking effi-ciency for the limited training set. In order to mitigate theeffect of such random parameters overfitting toward thetraining set structures, a consensus parameter set was

constructed by averaging all respective s and e values ofthe three optimized sets.

Docking performance

Upon completion of parameters derivation, a systematic

docking search was carried out for all non-ribosomalcomplexes from the training set (64 in total) and the testset (25 in total). Docking performance was evaluatedbased on interface RMSD (iRMSD) relative to thenative interface composed of proteinRNA bead pairsfound within the cutoff distance of 8 A in the crystallo-graphic structure, and the fraction of established nativecontacts (fNC) within such interface. A docked ligandpose was considered as a hit, with iRMSD 2 A and

fNC !0.3, or iRMSD !1 A with fNC !0.5. Such criteriaare equivalent to hit being high or medium qualitysolution according to the CAPRI challenge (35)

Figure 5. Deviations of minimum or saddle point locations for threesets of pairwise potentials with respect to consensus set. Side plots:average deviations for each pseudoatom type (points), relativefrequencies of pseudoatoms found in interacting pairs (histograms).All data normalized such that average values equal one.

Figure 4. Color map: average amplitudeswell depths (negativevalues) or saddle point heights (positive values)of interaction poten-tials for all bead pairs. Side plots: the percentage of strong interactionsfor each bead; strong interactions are defined as having an amplitudehigher than average plus 1 SD.



7/12

guidelines. Separately, the statistics for hits of only highquality (i.e. with iRMSD 1.0 A and fNC !0.5) wasdetermined.

Before performing a systematic docking search, in orderto evaluate forcefield ability to provide stable complexesconsistent with experimental binding modes, potentialenergy minimization of crystallographic structures was

carried out. For each parameter set, except thenon-optimized set 0, all complexes, both in the trainingand test set (Table 1), were found to be stable in the sensethat the converged energy minimum met the adoptedcriteria for hit. Most of the the optimized geometries($85%) corresponded to high-quality solutions.Interestingly, in some cases when crystallographiccomplex distortion exceeded the assumed tolerance forhigh-quality model, such solution was found during sys-tematic docking search, though not necessarily scored asthe best. It is worth stressing that crystallographic struc-tures are usually computationally optimized with the useof atomic forcefields, and some residual strains are likely

to be encountered when switching to a different set ofinteraction potentials. Hence, the imbalance of particularreceptor-ligand geometry during rigid body energy mini-mization does not preclude the forcefield ability to provideenergy minimum that corresponds to physically equivalentbinding mode.

The performance of parameter set obtained solely uponfitting to statistical potentials (set 0) in terms of providingstable native-like geometries, was surprisingly good, giventhat its parameters were not tuned in any way to achievethis task. Around 90% of complexes remained stable, andmajority of the optimized geometries corresponded tohigh-quality solutions, with no significant differencebetween the training and test set.

The results of systematic docking searches are presentedin Table 1. For optimized and consensus potentials,native-like solutions in training set complexes are foundto have the highest rank (lowest potential energy) in5062% cases, depending on parameter set used. Thesuccess rate increases significantly when native-like solu-tions are allowed to have up to 10th rank, and furtherincrease in rank threshold brings only moderate improve-ment (Figure 6). For the test set, optimal ranking isachieved for 3648% of structures, and the success ratetends to saturate at higher rank threshold (Figure 6).Interestingly, portions of generally found native-like

solutions are similar for both training and test sets,indicating that the force field ability to score the resultsis more prone to overfitting than its ability to provideproper geometries. This can be understood, as bindinggeometries depend mostly on effective pseudoatomicradii, while energy values are additionally sensitive toamplitudes of interaction potentials.

As expected, the performance of parameters from set 0is much worse compared with optimized and consensuspotentials. While the fraction of generally found solutionsis moderately good, in accordance with the number ofstable structures found during crystallographic complexenergy minimization, the ability of non-optimized poten-tials to score the results is rather poor. Indeed, the position

of minimum (or the range of hard core bead-bead inter-action) in statistical potentials seems to be relativelyreliable and error-resistant physical descriptor incontrast to interaction strength. As a result, set 0 poten-tials approximate shape complementarity but fail inproviding a meaningful estimate of the binding freeenergy.

The performance of the three optimized parameter setsis generally similar within training or, separately, test setstructures. The consensus parameter set does not bringmuch difference with respect to training set, but clearlyoutperforms the three optimized potentials when test set

Table 1. Docking results for the initial parameter set (0), three optimized parameter sets (1, 2, 3) and the consensus set (C)

Parameter set Training set Test set

Ns N1 N10 Nall Ns N1 N10 Nall

0 88 (72) 19 (19) 23 (23) 67 (50) 92 (60) 4 (4) 8 (8) 72 (48)1 100 (86) 50 (41) 75 (59) 86 (66) 100 (84) 36 (36) 56 (56) 92 (76)2 100 (89) 52 (42) 81 (62) 92 (70) 100 (80) 40 (40) 64 (56) 88 (76)3 100 (84) 62 (53) 73 (58) 86 (66) 100 (84) 40 (40) 52 (52) 88 (80)C 100 (83) 61 (53) 80 (66) 88 (70) 100 (84) 48 (44) 76 (72) 100 (84)

Numbers correspond to the percentage of hits for crystallographic structure energy minimization (Ns), and for systematic search within best ranksolutions (N1), within 10 best solutions (N10), and within all docked poses (Nall). Numbers in brackets correspond to solutions of high quality (seetext for definition).

Figure 6. The percentage of hits at given rank threshold obtained forthree optimized parameter sets and the consensus set. Lower set ofdashed lines: the numbers of hits with rank 1, upper set of dashedlines: the numbers of generally found native-like solutions.



8/12

structures are considered. It suggests, that parameteraveraging might indeed have contributed to the the reduc-tion of inter parameter correlations arising fromoverfitting.

If obtaining a native-like solution, as the one with thelowest potential energy, is considered as an ultimate goalof docking study, the most common reason for the lack ofsuccess is improper scoring (error of type 1, E1). Itaccounts for $70% of failures for the training set struc-tures and >80% for the test set structures (Table 2). Theinability of systematic search procedure to find native-likebinding geometry (error of type 2, E2), corresponds to1638% of failures for the training set and up to 20%for the test set. As noted above, the actual percentage ofstructures for which systematic search fails to find anative-like geometry is similar for training and test sets($10%), and differences in relative frequencies of E1versus E2 errors for those sets result from generallyworse scoring for the test set structures. Error distributionobtained with non-optimized parameter set 0, shows a

greater proportion of E2 errors when compared withoptimized and consensus parameter sets. Again, it indi-cates that the optimization of shape complementarityencoded in the potentials is more efficient than the opti-mization of their scoring capabilities.

From the structural point of view on docking efficiency,the considered proteinRNA complexes tend to be eithereasy (docked and scored within top 10 for all four param-eter sets) or difficult (with rank worse than 10 for all fourparameter sets). The fractions of complexes belonging tothose two classes (Figure 7) are much higher than esti-mates based on the supposition that the success (orfailure) in docking for one parameter set is independentfrom the results for other parameter sets, i.e. can be

expected with probability being a product of success (orfailure) probabilities in each case. Clearly, to some degreeit is the effect of correlations between parameter sets, asthey are derived from a common predecessor, but moreimportantly it is a consequence of structural properties ofthe investigated complexes.

In most cases, complexes regarded as difficult havesmall interface region (see Figure 9 for examples: 2F8S,2DLC, 1YYW). The number of interfacial beads on bothreceptor and ligand side for $80% of difficult structures iswithin the lowest 20% of interfacial bead numbers for allcomplexes (Figure 7). Typically in such cases, a native-like

geometry is found but scores badly, as alternative solu-tions provide more, albeit usually worse, contacts. Onthe contrary, in 2 of 11 difficult cases (1F7U, 1FFY) theinterface size is exceedingly large (close to the maximum ofthe observed interfacial bead number range). For thosecomplexes, native-like solutions are not found at allduring systematic search, as tightly packed, largebinding region precludes successful docking of rigidpartners, even though crystallographic structures corres-pond to stable energy minima. In general, the size ofprotein interface appears to be slightly more importantfor docking efficiency. Perhaps, it is due to greater contri-bution of protein partners to specific recognition (22),

which is a consequence of greater variability amongprotein polymer building blocks in comparison with justfour types of standard RNA nucleotides.

The absolute size (accessible surface area) of bindingpartners seems to have little influence on docking effi-ciency. Structures of different sizes are in general evenlydistributed among difficult and non-difficult complexes(Figure 7), with the exception of three largest ligands.Two of them, however, (2F8S, 1SER) appear to haverather small interface regions, while the third one(1FFY) was already mentioned as having extremelylarge and tight interface.

Improving docking performance

As noted above, proper scoring of docked geometriesappears to be the major factor that affects docking per-formance. A possible way to increase scoring efficiency isto use information about the expected interface location inorder to amplify ligandreceptor interactions originatingfrom this area. To date, a number of methods have beendeveloped for the prediction of RNA binding sites onproteins (1823). Unfortunately, predicting proteinbinding sites on RNA is much more difficult task and,apart from general characterization of protein bindinginterface on RNA (1317), to our knowledge no predictivemethods exist to tackle this problem. Due to this reason,

Table 2. Relative fractions (in %) for two types of docking errors

Parameter set Training set Test set

E1 E2 E1 E2

0 60 40 71 291 72 28 88 122 84 16 80 203 62 38 80 20C 68 32 100 0

E1: the hit is found but scores improperly; E2: the hit is not found atall during systematic search.

Figure 7. Left: the distribution of easy, medium and difficult dockingstructures, and an estimate of their relative frequencies assuming nocorrelation between docking efficiency and structural properties.Right: the distribution of structural features among difficult complexes;R IB, L IBthe number of interfacial beads in receptor and ligand,R S, L Sreceptor and ligand accessible surface area. Features rangesare normalized with 0 and 100 corresponding to the lowest and highestobserved values, respectively.



9/12

only the influence of protein interface predictions will beconsidered.

Instead of evaluating the usefulness of particularapproaches for the prediction of RNA binding region inproteins, a generic estimate of the predictive powerrequired to improve scoring was carried out. The true

protein interface was assumed to consist of beadsinvolved in the formation of native contacts with RNA(see above for definition). Its status was perturbed to thedesired level of specificity and selectivity by randomlyintroducing some false positive and false negative inter-facial beads to protein structure. Finally, for each suchprediction, two hundred lowest energy solutions of sys-tematic docking search for each complex were rescoredwith the interactions for ligand interfacial beads scaledup by some weight factor (W). An alternative scheme, inwhich only attractive interactions were scaled, was alsoconsidered but the results were qualitatively the sameand, hence, are not shown. Due to random character ofbeads reassignment, interface perturbation and rescoring

of docking results were repeated five times for each pre-diction quality and weight factor combination, andaverage numbers of best ranked native-like geometrieswere recorded.

Moderate scaling (W= 1.25) of the predicted nativeinteractions appears to have in general a small beneficialeffect on scoring efficiency (Figure 8). Even with relativelylow prediction quality, it provides up to $10% increase intop ranked hits. Greater weighting factors allow reachingbetter results, up to $30% increase in optimal scores. Thiseffect tends to saturate, however, and interaction scalingabove W= 5 does not improve the results, even for ideally

predicted interface (with sensitivity and selectivity of 1).This is understandable, as predictions affect only theprotein binding partner and thus, false solutions, inwhich a correct protein region binds to an incorrectRNA region, can also benefit from interaction scaling.

Interestingly, higher W values require better predictionquality in order to improve scoring efficiency. In fact, poor

predictions tend to even worsen the original (W= 1)results, and this effect becomes increasingly evident withgrowing W. At the same time, high sensitivity appears tobe somewhat more beneficial than specificity, suggestingthat finding as many interfacial pseudoatoms aspossible, even at the expense of false positive predic-tions, is favorable to capturing only few, but truly inter-facial beads. Indeed, in the latter case, some falsesolutions that partially overlap with the correctbinding region may easily benefit from high W values,if only they involve contact through singular promotedinterfacial beads.

Confronting the above considerations with the expectedeffectiveness of currently available methods for RNA

binding site prediction (Fig. 8), indicates that makinguse of their results would improve scoring efficiency by1020% with moderate interaction scaling (W


10/12

proper ligand placement and its scoring, a significantobstacle arises due to local and global conformationalchanges that likely occur upon binding and need to bepredicted for successful docking. Addressing this issue isbeyond the scope of the present report, focused on thedevelopment of an intermolecular interaction potential.Nonetheless, the performance of rigid docking with theuse of consensus parameter set was checked on few avail-able unbound structures of both binding partners in orderto give an estimate for the level of structural deformationthat still does not preclude successful docking.

Unbound structures were required to have at least 95%sequence identity of the overlapping region with boundversion, both in the case of RNA and proteins. As thereare very few such instances, in the case of bound RNA inthe form of straight double helix, for which exists anunbound protein partner, a modeling of 3D structurewas performed using secondary structure as an input.The program Assemble (36) was used for this task. Thecriteria applied so far for the detection of hit wereextended to incorporate solutions of acceptable qualityaccording to CAPRI classification, that is having iRMSD

Figure 9. Examples of crystallographic structures (marked by PDB-id) for difficult cases (upper row), successful unbound docking (middle row) andunsuccessful unbound docking (lower row). Green: bound receptor; blue: unbound receptor; gray: native ligand geometry; red: best docked solution.Below PDB-id is the rank of the presented solution and its iRMSD/fNC.



11/12

4.0 A and 0.1 fNC 2.0 A and fNC!0.3.

Though the statistics based on just few presented casesis not particularly meaningful, it allows to divide theresults in two classes.

In cases when the protein partner binds on the surface

of double RNA helix (Figure 9, 1DFU, 1JBR, 2AZ2,2RKJ), docking results appear to be more sensitive tothe distortion on protein rather than RNA side(Table 3). RMSD up to 2.8 A on the RNA side did notpreclude finding a hit or acceptable solution within the top20 solutions, when docking of protein in its bound con-formation was considered. At the same time, for boundRNAunbound protein pair, there was no correspond-ingly high ranked solutions, even for relatively lowprotein RMSD. Such result may be expected, as in thosecases specific binding occurs due to, generally movable,protein side chains, while the topology of RNA minorand major groves remains relatively unchanged. The sta-

bility of helical RNA and rather loose binding geometriesaccount also for quite successful docking of two partnersin their unbound conformations: in all three cases, solu-tions were found within 100 top geometries.

The situation is more difficult in the case of proteinbinding to single-stranded termini of RNA chains(Figure 9, 1TTT, 2E9T). Here, the distortion ofunbound RNA allows for finding only poorly scored so-lutions of acceptable quality. Also, as binding is gener-ally tighter than in previous cases, even small change inprotein structure (RMSD of 0.8 A for 2E9T) gives un-favorable score for the hit found. In both cases, dockingof two unbound partners was unsuccessful.

The remaining complex (Figure 9, 1K8W) does not

belong to the above classification. Here, binding involvesa long RNA loop with most bases in flipped out geometry.The docking of unbound protein to bound RNA is sur-prisingly successful, but solutions for unbound RNA con-formation are not found. This, however, does not need toresult from the changes in the geometry of binding site onRNA, but from the fact that unbound RNA represents acomplete molecule (tRNA) whose fold outside the regionoverlapping with bound structure causes significantsterical clashes with bound protein, thus precluding anysuccessful binding. Whether tRNA needs to refold uponbinding, or the crystallographic structure of complex

involving incomplete RNA does not represent realgeometry, remains an open question.

CONCLUDING REMARKS

A new coarse-grained force field for proteinRNA inter-actions targeted on docking applications was presented. Itis compatible with earlier representation developed forprotein complexes and suitable to use within theATTRACT docking protocol (28), based on intermolecu-lar potential energy minimization. The force field wasparametrized and tested with the use of 110 crystallo-graphic structures of proteinRNA complexes. Itshowed a good performance in systematic dockingsearch for proteinRNA binding geometries when twopartners were considered in their bound conformations.The presented results were obtained without any addition-al constraints based on information regarding the putativebinding site location; however, the possible role of such

information in augmenting docking efficiency wasdemonstrated.

The application to unbound structures showedpromising results in the cases with loose binding inter-faces, for which moderate deformations of global struc-ture do not affect much the topology of native contacts. Inthe cases where complex formation involves local con-formational changes and the creation of sterically tightinterpenetrating binding interfaces, rigid body dockingwas obviously unsuccessful, however, due to reasons notrelated to forcefield performance.

Addressing the issue of global and local conformationalchanges in protein and RNA binding partners is a signifi-cant challenge, remaining within the scope of not only

docking, but also folding community. An importantproblem, particularly affecting the development ofdocking methodologies, is the small amount of availablestructural data representing both the geometry of acomplex and its components in free forms. The currentlypresented forcefield, providing a reasonable estimate ofintermolecular interaction energy at low computationalexpense, is a good basis to be coupled with existing andfuture methods accounting for molecular deformability atdifferent structural levels. Such combination, leading toflexible proteinRNA docking, is in the center of ourongoing efforts.

Table 3. Results for docking with unbound structures

Complex Receptor Ligand br:bl ur:bl br:ul ur:ul

1DFUmn:p 364Dabc (2.8) 1B75a (3.8) 3 1 [18] [4]

1JBRd:b 430Da (2.8) 1AQZa (0.6) 14 18 30[2] 1001K8Wb:a 2K4C

a (2.8) 1R3Fa (1.5) 1 2[1]

1TTTd:a 6TNAa (2.6) 1TUIa (10.1) 1 [480] 291 2RKJ

c:a,b1Y0Q

a(2.6) 1Y42

aa(1.1) 2 [13] [>1k] [>1k]

2AZ2cd:ab model (1.4) 2B9Zab (1.5) 3 3[1] >1k[349] >1k[43]

2E9Tbc:a model (3.3) 1U09a(0.8) 2 [712] >1k [-]

Subscripts at PDB id denote involved chains, numbers in brackets correspond to RMSD (P atoms for receptor and C a for ligand) versus boundstructure (in case of multiple NMR models an average value is provided); br, bl, ur, ulbound/unbound receptor/ligand; numbers correspond to theranks of hits or acceptable solutions (in square brackets).Asterisk denote NMR structure.



12/12

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: DFG (DeutscheForschungsgemeinschaft) grant (Za153/19-1).

Conflict of interest statement. None declared.

REFERENCES

1. Eddy,S.R. (2001) Non-coding RNA genes and the modern RNAworld. Nat. Rev. Genet., 2, 919929.

2. Storz,G. (2002) An expanding universe of noncoding RNAs.Science, 296, 12601263.

3. Matera,A.G., Terns,R.M. and Terns,M.P. (2007) Non-codingrnas: lessons from the small nuclear and small nucleolar RNAs.Nat. Rev. Mol. Cell. Biol., 8, 209220.

4. Lunde,B.M., Moore,C. and Varani,G. (2007) RNA-bindingproteins: modular design for efficient function. Nat. Rev. Mol.

Cell. Biol., 8, 479490.5. Hogg,J.R. and Collins,K. (2008) Structured non-coding RNAs

and the RNP renaissance. Curr. Opin. Chem. Biol., 12, 684689.6. Glisovic,T., Bachorik,J.L., Yong,J. and Dreyfuss,G. (2008)

RNA-binding proteins and post-transcriptional gene regulation.FEBS Lett., 582, 19771986.

7. Wimberly,B.T., Brodersen,D.E., Clemons,W.M., Morgan-Warren,R.J., Carter,A.P., Vonrhein,C., Hartsch,T. andRamakrishnan,V. (2000) Structure of the 30s ribosomal subunit.Nature, 407, 327339.

8. Ban,N., Nissen,P., Hansen,J., Moore,P.B. and Steitz,T.A. (2000)The complete atomic structure of the large ribosomal subunit at2.4 a resolution. Science, 289, 905920.

9. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The proteindata bank. Nucleic Acids Res., 28, 235242.

10. Moreira,I.S., Fernandes,P.A. and Ramos,M.J. (2009)

Protein-protein docking dealing with the unknown.J. Comput. Chem., 31, 317342.

11. Cowieson,N.P., Kobe,B. and Martin,J.L. (2008) United we stand:combining structural methods. Curr. Opin. Struct. Biol., 18,617622.

12. Steven,A.C. and Baumeister,W. (2008) The future is hybrid.J. Struct. Biol., 163, 186195.

13. Allers,J. and Shamoo,Y. (2001) Structure-based analysis ofprotein-RNA interactions using the program entangle.J. Mol. Biol., 311, 7586.

14. Jones,S., Daley,D.T., Luscombe,N.M., Berman,H.M. andThornton,J.M. (2001) Protein-RNA interactions: a structuralanalysis. Nucleic Acids Res., 29, 943954.

15. Morozova,N., Allers,J., Myers,J. and Shamoo,Y. (2006)Protein-RNA interactions: exploring binding patterns with athree-dimensional superposition analysis of high resolutionstructures. Bioinformatics, 22, 27462752.

16. Ellis,J.J., Broom,M. and Jones,S. (2007) Protein-RNAinteractions: structural analysis and functional classes. Proteins,66, 903911.

17. Bahadur,R.P., Zacharias,M. and Janin,J. (2008) Dissectingprotein-RNA recognition sites. Nucleic Acids Res., 36, 27052716.

18. Terribilini,M., Lee,J.-H., Yan,C., Jernigan,R.L., Honavar,V. andDobbs,D. (2006) Prediction of RNA binding sites in proteinsfrom amino acid sequence. RNA, 12, 14501462.

19. Wang,L. and Brown,S.J. (2006) Bindn: a web-based tool forefficient prediction of DNA and RNA binding sites in amino acidsequences. Nucleic Acids Res., 34, W243W248.

20. Kumar,M., Gromiha,M.M. and Raghava,G.P.S. (2008) Predictionof RNA binding sites in a protein using SVM and PSSM profile.Proteins, 71, 189194.

21. Cheng,C.-W., Su,E.C.-Y., Hwang,J.-K., Sung,T.-Y. and Hsu,W.-L. (2008) Predicting RNA-binding sites of proteins using supportvector machines and evolutionary information.BMC Bioinformatics, 9(Suppl 12), S6.

22. Pe rez-Cano,L. and Ferna ndez-Recio,J. (2009) Optimalprotein-RNA area, opra: a propensity-based method to identifyRNA-binding sites on proteins. Proteins, 78, 2535.

23. Liu,Z.-P., Wu,L.-Y., Wang,Y., Zhang,X.-S. and Chen,L. (2010)Prediction of protein-RNA binding sites by a random forestmethod with combined features. Bioinformatics, 26, 16161622.

24. Chen,Y., Kortemme,T., Robertson,T., Baker,D. and Varani,G.(2004) A new hydrogen-bonding potential for the design ofprotein-RNA interactions predicts specific contacts anddiscriminates decoys. Nucleic Acids Res., 32, 51475162.

25. Zheng,S., Robertson,T.A. and Varani,G. (2007) Aknowledge-based potential function predicts the specificity andrelative binding energy of RNA-binding proteins. FEBS J., 274,63786391.

26. Pe rez-Cano,L., Solernou,A., Pons,C. and Ferna ndez-Recio,J.(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentials.Pacific Sympos. Biocomput., 15, 269280.

27. Ferna ndez-Recio,J. and Sternberg,M.J.E. (2010) The 4th meetingon the critical assessment of predicted interaction (capri) held atthe mare nostrum, barcelona. Proteins, 78, 30653066.

28. Zacharias,M. (2003) Protein-protein docking with a reducedprotein model accounting for side-chain flexibility. Protein Sci.,12, 12711282.

29. Fiorucci,S. and Zacharias,M. (2010) Binding site prediction andimproved scoring during flexible protein-protein docking withattract. Proteins, 78, 31313139.

30. Smith,T.F. and Waterman,M.S. (1981) Identification of commonmolecular subsequences. J. Mol. Biol., 147, 195197.

31. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.(1990) Basic local alignment search tool. J. Mol. Biol., 215,403410.

32. Dolinsky,T.J., Nielsen,J.E., McCammon,J.A. and Baker,N.A.(2004) Pdb2pqr: an automated pipeline for the setup ofpoisson-boltzmann electrostatics calculations. Nucleic Acids Res.,32, W665W667.

33. Ben-Naim,A. (1997) Statistical potentials extracted from proteinstructures: are these meaningful potentials? J. Chem. Phys., 107,36983706.

34. Baker,C.M. and Grant,G.H. (2007) Role of aromatic amino acidsin protein-nucleic acid recognition. Biopolymers, 85, 456470.

35. Mendez,R., Leplae,R., Lensink,M.F. and Wodak,S.J. (2005)Assessment of capri predictions in rounds 3-5 shows progress indocking procedures. Proteins, 60, 150169.

36. Jossinet,F., Ludwig,T.E. and Westhof,E. (2010) Assemble: aninteractive graphical tool to analyze and build rna architectures atthe 2d and 3d levels. Bioinformatics, 26, 20572059.

http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1http://nar.oxfordjournals.org/cgi/content/full/gkr636/DC1

9118.full text lowres

Documents