global fold determination from a small number of distance ...€¦ · are natural candidates for...

19
JMB—MS 692 Cust. Ref. No. FC 19/94 [SGML] J. Mol. Biol. (1995) 251, 308–326 Global Fold Determination from a Small Number of Distance Restraints A. Aszo ´ di 1 *, M. J. Gradwell 2 and W. R. Taylor 1 We have designed a distance geometry-based method for obtaining the 1 Division of Mathematical tertiary fold of a protein from a limited number of structure-specific distance Biology 2 Division of Molecular Structure, National restraints and the secondary structure assignment. Interresidue distances were predicted from patterns of conserved hydrophobic amino acids Institute for Medical Research deduced from multiple alignments. A simple model chain representing the The Ridgeway, Mill Hill London, NW7 1AA, UK protein was then folded by projecting its distance matrix into Euclidean spaces with gradually decreasing dimensionality until a final three-dimen- sional embedding was achieved. Tangled conformations produced by the projection steps were eliminated using a novel filtering algorithm. Information on various aspects of protein structure such as accessibility and chirality was incorporated into the conformation refinement, increasing the robustness of the algorithm. The method successfully identified the correct folds of three small proteins from a small number of restraints, indicating that it could serve as a useful computational tool in protein structure determination from NMR data. 7 1995 Academic Press Limited Keywords: protein folding; distance geometry; structure determination; *Corresponding author distance prediction; molecular modelling Introduction Most interactions governing the formation of native protein structures can be coded in terms of interresidue or interatomic distances, providing a set of internal coordinates that can be processed conveniently by distance geometry methods in simulations. On the experimental side, the impres- sive development of NMR spectroscopy now enables the determination of the molecular conformation of small proteins in solution. Structural information from NMR experiments comes in the form of distance restraints (i.e. acceptable interatomic distance ranges flanked by lower and upper bounds), usually estimated from 2D NOE spectra. Plausible conformations that conform to these distance restraints can be generated by a wide variety of computational tools (Kuntz et al ., 1989; Havel, 1991; James, 1994), most of which fall into one of the broad categories of restrained molecular dynamics and distance geometry methods. The latter are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction of all distance restraints in addition to the holonomic constraints (bond lengths, angles, etc.) imposed upon the structure and generate model conformations that are compatible with the information obtained from the experiments. Frequently a hybrid approach is chosen whereby the distance geometry program generates a family of plausible structures that can then be refined by molecular dynamics simulations. The route from the raw NMR spectra to the Brookhaven databank is often tortuous: the distance restraints may be insufficient, too lax or inconsistent. Robust algorithms are needed that can produce adequate global folds even if the restraints are not tight or there are just a few of them. A theoretical analysis of the minimum number of restraints necessary for a successful structure determination, based on statistical mechanics, was carried out by Gutin & Shakhnovich (1994). Smith-Brown et al . (1993) folded various protein chains by a Monte Carlo method guided by a small amount of distance restraints, while Hoch & Stern (1992) compared a restrained molecular dynamics optimisation with the classic distance geometry protocol (Crippen & Havel, 1988). Their results suggest that it is possible to design reasonably robust methods that can locate the correct fold using a small amount of distance information. Similar conclusions have been drawn when the problem has been approached from a Abbreviations used: 2D and 3D, two- and three-dimensional; NOE, nuclear Overhauser effect; PDB, Protein Data Bank. 0022–2836/95/320308–19 $08.00/0 7 1995 Academic Press Limited

Upload: others

Post on 26-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692 Cust. Ref. No. FC 19/94 [SGML]

J. Mol. Biol. (1995) 251, 308–326

Global Fold Determination from a Small Number ofDistance Restraints

A. Aszo di 1*, M. J. Gradwell 2 and W. R. Taylor 1

We have designed a distance geometry-based method for obtaining the1Division of Mathematicaltertiary fold of a protein from a limited number of structure-specific distanceBiology 2Division of

Molecular Structure, National restraints and the secondary structure assignment. Interresidue distanceswere predicted from patterns of conserved hydrophobic amino acidsInstitute for Medical Researchdeduced from multiple alignments. A simple model chain representing theThe Ridgeway, Mill Hill

London, NW7 1AA, UK protein was then folded by projecting its distance matrix into Euclideanspaces with gradually decreasing dimensionality until a final three-dimen-sional embedding was achieved. Tangled conformations produced bythe projection steps were eliminated using a novel filtering algorithm.Information on various aspects of protein structure such as accessibility andchirality was incorporated into the conformation refinement, increasing therobustness of the algorithm. The method successfully identified the correctfolds of three small proteins from a small number of restraints, indicatingthat it could serve as a useful computational tool in protein structuredetermination from NMR data.

7 1995 Academic Press Limited

Keywords: protein folding; distance geometry; structure determination;*Corresponding author distance prediction; molecular modelling

Introduction

Most interactions governing the formation ofnative protein structures can be coded in terms ofinterresidue or interatomic distances, providing a setof internal coordinates that can be processedconveniently by distance geometry methods insimulations. On the experimental side, the impres-sive development of NMR spectroscopy now enablesthe determination of the molecular conformation ofsmall proteins in solution. Structural informationfrom NMR experiments comes in the form ofdistance restraints (i.e. acceptable interatomicdistance ranges flanked by lower and upperbounds), usually estimated from 2D NOE spectra.Plausible conformations that conform to thesedistance restraints can be generated by a widevariety of computational tools (Kuntz et al., 1989;Havel, 1991; James, 1994), most of which fall into oneof the broad categories of restrained moleculardynamics and distance geometry methods. The latterare natural candidates for the NMR structuredetermination problem, since they enable thesimultaneous satisfaction of all distance restraints in

addition to the holonomic constraints (bond lengths,angles, etc.) imposed upon the structure andgenerate model conformations that are compatiblewith the information obtained from the experiments.Frequently a hybrid approach is chosen whereby thedistance geometry program generates a family ofplausible structures that can then be refined bymolecular dynamics simulations.

The route from the raw NMR spectra to theBrookhaven databank is often tortuous: the distancerestraints may be insufficient, too lax or inconsistent.Robust algorithms are needed that can produceadequate global folds even if the restraints are nottight or there are just a few of them. A theoreticalanalysis of the minimum number of restraintsnecessary for a successful structure determination,based on statistical mechanics, was carried out byGutin & Shakhnovich (1994). Smith-Brown et al.(1993) folded various protein chains by a MonteCarlo method guided by a small amount of distancerestraints, while Hoch & Stern (1992) compared arestrained molecular dynamics optimisation withthe classic distance geometry protocol (Crippen &Havel, 1988). Their results suggest that it is possibleto design reasonably robust methods that can locatethe correct fold using a small amount of distanceinformation. Similar conclusions have been drawnwhen the problem has been approached from a

Abbreviations used: 2D and 3D, two- andthree-dimensional; NOE, nuclear Overhauser effect;PDB, Protein Data Bank.

0022–2836/95/320308–19 $08.00/0 7 1995 Academic Press Limited

Page 2: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 309

structure prediction viewpoint (Taylor, 1993; Dan-dekar & Argos, 1994).

We have shown that model polypeptide chains canbe folded into compact three-dimensional confor-mations possessing a hydrophobic core andsecondary structure using distance geometry tech-niques (Aszodi & Taylor, 1994a,b). The model chainswere random 1:1 copolymers of ‘‘hydrophobic’’ and‘‘hydrophilic’’ monomers, and the interresiduedistances were set according to the hydrophobicity ofthe residue pairs. Although the chains folded intoprotein-like conformations, they could not beregarded as models of any particular protein. Alogical extension of these studies was to make thesimulations more realistic by modelling the polypep-tide chain more accurately and by supplying externaldistance restraints to the program. In order toincrease the robustness of the algorithm, aconsiderable amount of background informationabout various aspects of protein structure such ashydrophobic packing, interresidue distance distri-bution, general topological properties and chiralitywas incorporated.

Algorithms

The algorithm presented in this work is theextension of the gradual projection approachimplemented in the program DRAGON-2 (Aszodi &Taylor, 1994b) and the core algorithm remainedessentially the same. To avoid repetition, only theadditions and improvements to the method aredescribed below.

Strategy

The objective of our calculations was to simulatethe determination of the structures of smallmonomeric proteins by NMR. The following piecesof information were assumed to be available.

(1) The sequence of the protein.(2) A set of sequences homologous to that of the

protein to be modelled.(3) A full assignment of secondary structure

derived possibly from short-range NOEs.(4) A set of long-range distance restraints in the

form of lower and upper distance bounds betweenamino acid side-chains.

The sequence and secondary structure assignmentinformation was obtained from the Protein DataBank (Bernstein et al., 1977). First, a multiplesequence alignment was constructed that providedinformation about the conservation of individualamino acid residues. The conservation information,coupled with hydrophobicity data, was in turn usedto predict the distances between those residues forwhich no extra distance data were available. Thesepredicted distances, together with simulated NOE-derived restraints, were submitted to the gradualprojection algorithm. The number of simulated NOErestraints was varied to determine the robustness ofthe approach. For each restraint set, 25 model

Figure 1. The geometry of the model polypeptide chain.A, The backbone is built of Ca atoms and the side-chainsare represented by pseudo-Cb atoms corresponding to the20 natural amino acids. The different chemical identitiesare indicated by the different sizes and patterns of thepseudo-Cb atoms. B, the ith pseudo-Cb atom (open circle)lay in the plane of the (i − 1)th, ith and (i + 1)th Ca atoms(filled circles). The distance dab between a Ca atom and itscorresponding pseudo-Cb atom as well as the van derWaals radii of the pseudo-Cb atoms rb depended on theamino acid type of the monomer (see Table 1).

structures were generated. The same model chains,complete with secondary structure assignment anddistance restraint data were also submitted to theprogram X-PLOR (Brunger, 1992) and the resultingmodel conformations served as controls.

The model chain

The model chain was represented as a linearheteropolymer of the 20 natural amino acids. Thegeometric and physicochemical properties of theamino acid monomers were described by a set ofattributes as follows.

Chain geometry

The polypeptide chain was modelled by ana-carbon backbone and single Cb atoms (Figure 1A)representing amino acid side-chains (‘‘lollipopmodel’’, Levitt, 1976). The positions of Cb atoms weredetermined by the backbone conformation: theposition of the ith dummy Cb atom was obtained byreflecting the midpoint between the (i − 1)th and(i + 1)th Ca atoms through the ith Ca atom (Figure 1B)as described (Aszodi & Taylor, 1994b). However,

Page 3: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data310

Table 1. Properties of the model amino acidsMonomer properties

Amino acid rb (A) dab (A) H

Ala 1.76 1.53 1.73Cys 2.03 2.06 0.84Asp 2.23 2.47 0.03Glu 2.48 3.11 0.01Phe 2.79 3.41 1.48Gly 0.00 0.00 1.27His 2.61 3.15 0.06Ile 2.60 2.33 3.46Lys 2.67 3.44 0.03Leu 2.60 2.61 2.56Met 2.61 2.96 0.86Asn 2.35 2.45 0.01Pro 2.21 1.88 0.18Gln 2.57 3.07 0.03Arg 2.88 4.11 0.00Ser 1.94 1.89 0.49Thr 2.25 1.95 0.59Val 2.38 1.97 2.46Trp 3.07 3.87 0.74Tyr 2.88 3.76 0.59

rb is the van der Waals radius of the pseudo-Cb side-chain atom.dab is the distance between the Ca and pseudo-Cb atoms. H is thehydrophobicity of the residue according to Levitt.

monomers and the correct 3D chirality was imposedupon them in a separate refinement stage. Thehandedness of helices and the asymmetric twist ofb-sheets were adjusted through the torsion angleabout the main-chain H bonds as described (Aszodi& Taylor, 1994b).

Interresidue distances

Although the model chain monomers werecomposed of two atoms, the position of the fake Cb

atoms were determined by the Ca backboneconformation and the Ca to centroid separation data,therefore the model chain could fully be described bythe Ca–Ca distances. These distances, henceforthreferred to as interresidue distances, fell in thefollowing categories.

(1) ‘‘Hard’’ distances: these were determined bythe protein chain geometry, e.g. the virtual Ca–Ca

bond lengths and distances between residues withinthe same secondary structural element and wereassumed to be accurately known.

(2) ‘‘Experimental’’ distances: these were distancerestraints specific to the target molecule and wereassumed to have been supplied by experimentalmeasurements such as NMR spectroscopy, usuallywith less accuracy than the hard distances.

(3) ‘‘Soft’’ distances: this large category containedall the remaining distances, of which little more wasknown than that they fell between the lower andupper limits determined by the bump distances andthe estimated diameter of the molecule, respectively.

Each category required different treatment, asoutlined below.

Hard distances

One of the few distances that can be predicted withany certainty is the virtual Ca bond length d1 = 3.8 A.The separation between second neighbours is morevariable; in the present study these were set to anaverage value of d2 = 6.0 A (Figure 1B) but allowingfor a moderate wobble of the Ca virtual bond angles(Aszodi & Taylor, 1994a).

Since the secondary structure assignment wasassumed to be known, the Ca distances derived fromideal helical and sheet conformations (Pauling &Corey, 1951a,b; Wako & Scheraga, 1982) wereincorporated into the matrix of desired distances.Most of these values are fairly constant due to thestrict geometric requirements imposed upon theprotein chain by backbone hydrogen-bonding.

Simulated NOE distance restraints

Simulated long-range distance restraints wereobtained from the PDB structures of the proteins tobe modelled. (Long-range here refers to sequentialrather than spatial separation: in general, onlydistances between residues separated by at least fiveother residues in the sequence were considered.) All

for each amino acid monomer, the distance betweenthe Ca and Cb atoms (dab) now corresponded to theaverage Ca to centroid distances observed in nativeproteins (Table 1). Volume exclusion was modelled bycentering hard van der Waals spheres on the atoms.For Ca atoms, the van der Waals radius was set tora = 2.0 A, regardless of amino acid type. For the Cb

atoms, the van der Waals radii rb were set so that thevan der Waals sphere had the same volume as thecorresponding side-chain type in native proteins(Table 1).

Other attributes

In addition to the geometry data described above,each monomer in the chain had attributes describinghydrophobicity and conservation. Levitt’s hydro-phobicity scale (Levitt, 1978) was chosen from themany scales available (Table 1), but the program canaccommodate any scale provided the values areshifted so that all of them are non-negative.Conservation of each residue in the model sequencewas calculated from the multiple alignment asdescribed below.

Chirality

In all distance geometry-based methods chiralityneeds special treatment, since the distance matricesof a point set and its mirror image are identical. Also,objects that were chiral in N-dimensional spacesbecome achiral in N + 1 dimensions so thathandedness consistency cannot be guaranteed in thegradual projection method (Aszodi & Taylor, 1994b).While the ultimate source of structural asymmetry inproteins is the consistent handedness of the aminoacids, our model chains were built from symmetric

Page 4: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 311

pairwise distances between these amino acidside-chain centroids were calculated and thoseshorter than a threshold of 5.0 A were selected. Theseselected distances were converted to lower andupper distance bounds by subtracting and adding2.0 A, respectively. In order to obtain an abundant setof restraints, the threshold was set to 7.0 A. Tosimulate the more realistic experimental case whenthe number of long-range NOEs may be rather small,a variable fraction of randomly chosen restraintswere gradually removed from the basic 5.0 Athreshold set to produce more and more sparse sets,following Hoch & Stern (1992). In general, one setwas constructed for a given number of restraints. Thesensitivity of the method to the choice of restraintswas tested by generating 20 different randomdatasets containing the same number of restraints foreach target structure.

Soft distances

In previous versions of DRAGON, these virtuallyunknown distances were adjusted so that pairs of‘‘hydrophobic’’ residues should come close together.This crude model of the hydrophobic effect wassignificantly refined in the present program.Interresidue distances were predicted from thepairwise conserved hydrophobicity score (Taylor,1991), based on the assumption that pairs ofresidues that are both conserved and hydrophobic,are likely to be close together in the core of themolecule.

Sequences homologous to the target sequenceswere obtained by searching the SwissProt database(Bairoch & Boeckmann, 1991) and a final set thatoriginated from a wide range of evolutionarilydistant organisms showing only a modest level ofhomology to the model sequence was selected byhand. Multiple alignments of these sequences weregenerated by the MULTAL algorithm (Taylor, 1988)as implemented in the CAMELEON package(Version 3.0C, Oxford Molecular) using the defaultparameter setting.

The conservation gi at the ith position of analignment of N sequences was measured by theaverage of the pairwise similarity scores of all aminoacids in the position:

gi = 2N(N − 1) s

N − 1

j = 1

sN

k = j + 1

M(Rij , Rik ) (1)

where Rij is the type of the amino acid in the jthsequence in alignment position i and M(·, ·) is anentry in the PAM250 amino acid similarity matrix(Dayhoff et al., 1978). The similarity matrix wasscaled so that all entries were e0 by subtracting thesmallest entry from all the others. The similarityscore between a gap and any amino acid was alwaysset to zero. The conservation value was normalisedby the maximal score encountered among the aminoacid pairs:

ci =gi

maxj<kM(Rij , Rik )(2)

ensuring that 0EciE1. The pairwise hydrophobicpacking score could then be defined as:

hij = ciHi + cjHj (3)

where ci is the conservation and Hi is the averageLevitt hydrophobicity (Levitt, 1978) of the aminoacids in the ith sequential position. The hydrophobicscores were converted into predicted distances viathe transformation:

dij = −p1hp2ij + p3 (4)

where the three parameters p1, p2, p3 > 0 wereestimated by non-linear regression so that thedistribution of the predicted distances matched theobserved interresidue distance distribution in arepresentative set of small monomeric proteins(Aszodi & Taylor, 1995).

Distance adjustments

For hard and soft distance data, the actual valuesin the distance matrix were ‘‘massaged’’ towards thecorresponding desired values by a convex function(Aszodi & Taylor, 1994a). The extent of theadjustment was regulated by ‘‘strictness’’ values,which were close to 1 for hard distances and close to0 for soft distances. In energetic terms, thisadjustment strategy is equivalent to replacing thedetailed description of the interresidue potentials bytheir corresponding minima.

Adjustment of ‘‘experimental’’ distances wascarried out so that if an interresidue distance wasshorter than the lower simulated NOE limit then itwas increased, if it was longer than the upper limitthen it was reduced. No adjustment was performedif the distance in question fell between the two limits.This strategy corresponds to a potential well with aflat bottom.

For all interresidue distances, a minimal andmaximal separation can be calculated; the minimalvalue, representing steric repulsion, was determinedby the sum of the appropriate van der Waals radii, themaximal value by the expected maximal separationfor two residues linked by a fully extended chain orby the estimated diameter of the molecule (Aszodi &Taylor, 1994a), whichever the smaller. These minimaland maximal limits were applied to all residue pairsas hard restraints.

Residue accessibility

Accessibility was calculated by the ‘‘cone’’algorithm (Aszodi & Taylor, 1994b) with minormodifications, the most important being that thecone for the kth residue was required to contain onlythose residues closer than 8.0 A. This local approachimproved the estimation of burial for residuessituated near crevices on the protein surface andresulted in a considerable increase in computingspeed.

In previous versions of DRAGON, residue burialadjustment was based upon a simple logic: it wasundesirable for hydrophobic residues to be on the

Page 5: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data312

surface, and equally undesirable for hydrophilics toreside in the interior. Since in the present study theresidues had different chemical identities corre-sponding to the 20 natural amino acids with 20different hydrophobicity values, a more detailedtreatment was necessary. To this end the distributionof conic accessibility values for all 20 amino acids ina reference set of protein structures were approxi-mated by histograms. For each amino acid type, thecentral 80% portion of its accessibility distributionwas considered acceptable and if a residue in themodel had an accessibility value within this rangethen no adjustment was performed. If a residue hadan accessibility value that fell into the lower 10% ofthe accessibility distribution corresponding to itsamino acid type, then that residue was consideredtoo exposed and it was moved towards the centre ofthe molecule. Similarly, residues that were moreburied than 90% of the native residues of the sameamino acid type were moved towards the surface.The extent of the adjustment in both cases wasgraded according to the actual position in theaccessibility distribution curve; wild outliers wereadjusted more strictly.

Tangles

The various distance geometry approaches basedon the projection algorithm just generate pointcoordinates from interpoint distances and have noinformation about the connectivity of the polypep-tide chain. As a result, tangled conformations arefrequently observed. Such structures are very few, ifnot completely absent, among naturally occurringfolded polypeptides (Connolly et al., 1980) andtherefore tangled conformations produced by thedistance geometry algorithm must be filtered out. Tothis end a simple and fast heuristic was devised thatcan correct most tangled conformations.

First, the polypeptide chain was divided intosegments according to its secondary structure. Thebasic idea of tangle filtering was that differentsegments were not allowed to penetrate each other.To this end, chain segments were represented withsuitably chosen sets of tetrahedra and penetrationwas detected if another segment intersected at leastone of these tetrahedra.

The concept of containment detection is easierto understand in a unidimensional example (Fig-ure 2A). If the point Pin is between the two pointsR0 and R1, then its position vector P� in can beexpressed as a linear combination of the positionvectors R� 0 and R� 1:

P� in = s0R� 0 + s1R� 1 (5)

so that the coefficients s0 and s1 obey the followingrelations:

s0 + s1 = 1, 0Es0, s1E1 (6)

The two points R0 and R1 define a unidimensionalsimplex. By analogy, the point Pin is lying within atetrahedron (the 3D simplex) defined by the four

Figure 2. The tetrahedral containment algorithm. A, Theunidimensional case. B, The three-dimensional case. Seethe text for explanation and details.

points R0, R1, R2, R3 if the coefficients si , i = 0, . . . , 3in the linear combination:

P� in = s3

i = 0

siR� i (7)

satisfy the following relations:

s3

i = 0

si = 1, [si :0EsiE1 (8)

The coefficients can be found as follows. Bysubtracting R� 0 from both sides of equation (7) andexpressing s0 = s1 + s2 + s3 the equation:

p� = s1r� 1 + s2r� 2 + s3r� 3 (9)

is obtained. In other words, the vector s� = (s1, s2, s3)corresponds to the position vector p� of P in a localcoordinate system with origin R0 and (non-orthog-onal) base vectors r� 1, r� 2, r� 3 (Figure 2B). Equation (9)can be recast in matrix form:

R·s� = p� (10)

where the columns of the matrix R are the basevectors:

R = [r� 1=r� 2=r� 3] (11)

Equation (10) can be solved by singular valuedecomposition (Rozsa, 1991), which automaticallyfinds the 3D subspace spanned by the tetrahedroneven in D > 3 hyperspaces where the matrix R has Drows and three columns. For a proper tetrahedron,R has a rank r(R) = 3. If the four points do not spana tetrahedron because they lie in a D < 3D subspace,

Page 6: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 313

Figure 3. Sets of tetrahedra superimposed on secondarystructures. The thick lines symbolise the Ca-backbones, thevarious thin lines represent the tetrahedra. A, Helices.Only every second tetrahedron is shown for clarity. B,Sheets.

main-chain H bond (Figure 3B). Finally, notetrahedra were put on coil segments. Tangledetection was then carried out following eachprojection step by testing all tetrahedra in helices andsheets for containing pieces of the chain from allother segments. If two segments were found tointersect then both were moved away from each otheras rigid bodies. The procedure was repeated until nomore intersections were found.

Model quality assessment

Rigid-body structural alignment

Model structures were compared with the knownPDB structures by weighted rigid-body rotationusing McLachlan’s algorithm (McLachlan, 1979).Only the backbone Ca atoms were used in thecomparison. The weights wi were derived from theB-factors Bi in the PDB entries†, so that flexibleportions of the target structure with large B-factorswere given less weight:

wi = Bmax − Bi

Bmax − Bmin(14)

where Bmin and Bmax were the smallest and largestB-factor in the known structure, respectively. Thesimilarities were then expressed as Ca coordinatesRMS deviations.

Precision and accuracy

Typically, the analysis of NMR-derived data leadsto a collection of plausible structures. Since the nativestructure is unknown, direct assessment of mod-elling accuracy via RMS deviations is impossible.Accuracy may instead be determined by comparingthe simulated NMR spectra calculated from themodels with the experimental spectrum in a manneranalogous to the R-factor determination in X-raycrystallography (James, 1994). In our simulationstudy, we had the advantage of knowing the targetstructures and therefore accuracy was monitoredthrough RMS deviations as described above.

The precision, on the other hand, is easy todetermine; a useful measure could be the averageRMS deviation of the individual models from theiraverage. Whereas a high level of precision (i.e. a setof very similar model structures) does not necessarilymean that the native conformation was found, a lowlevel of precision usually indicates that the modellingalgorithm did not perform adequately. The RMS Ca

backbone deviations between the individual modelsand their average was determined and the average ofthese RMS values was used as an indication ofmodelling precision.

Topology comparison

Model structure topologies were compared withthe corresponding native topologies by visualinspection. The target and model backbone coordi-

this condition is indicated by the decompositionalgorithm returning a rank r(R) < 3.

Having found the coefficient vector s� = (s1, s2, s3),the missing coefficient s0 can now be obtained bysimply observing that s0 = s1 + s2 + s3. If all coeffi-cients are in the range 0. . . 1, then the point P is insidethe tetrahedron. To decide whether a portion of a linebetween two points PA and PB is contained by atetrahedron even if the points themselves areoutside, we observe that every point P on the PA :PB

segment can be expressed as the linear combination:

P = lPA + (1 − l)PB , 0ElE1 (12)

see equation (5), and the tetrahedral coefficients, si

for P can be obtained from those of PA and PB (ai andbi , respectively) because:

P = s3

i = 0

siR� i = s3

i = 0

(lai + (1 − l)bi )R� i (13)

If there is a range of 0ElminElmaxE1 such that allcoefficients si satisfy the conditions in equation 8 thenthe segment PA :PB intersects the tetrahedron.

The chain segments were represented by tetrahe-dra as follows. Helices were made up by tetrahedradefined by the points (i, i + 1, i + 2, i + 3),(i + 1, i + 2, i + 3, i + 4), . . . where i is the index of thefirst Ca atom in the helix (Figure 3A). Sheets weretreated as sets of non-overlapping tetrahedra madeup by the four Ca atoms surrounding each

† The B-factors of the tendamistat structure (3AIT)were simulated from RMS deviations between a set ofNMR structures.

Page 7: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data314

nates were smoothed by repeatedly scanning amoving average window along the chain, and thesmoothed models aligned to the target weredisplayed and inspected using a simple moleculargraphics program developed in our laboratory. Tanglechecks were performed on the unsmoothed modelbackbones.

The model folds were classified into correct,slightly incorrect, incorrect and mirror imagetopologies. Folds in which just one secondarystructure element was misplaced were regarded asslightly incorrect, two or more misplaced segmentswere classified as incorrect topologies. Topologiesthat were mirror images of the native topologyformed a special group.

X-PLOR controls

The performance of the DRAGON algorithm wascompared with that of X-PLOR Version 3.1 (Brunger,1992), a popular tool frequently used for NMRstructure determination. The model residue andparameter files were written for X-PLOR so that alldata corresponded to the conventions used byDRAGON as closely as possible.

Chain geometry

A general carbon atom type was defined forrepresenting backbone Ca atoms with a van derWaals radius of 2.0 A. The monomer side-chainswere described by 20 pseudo-Cb atoms for eachamino acid type using the same van der Waals radiiand Ca–Cb distances as in DRAGON (Table 1). Thecoplanarity of the atoms in the monomers wasmaintained by defining the appropriate improperaxes of rotation. The virtual Ca bond angles and theCa–Ca–Cb angles were allowed to vary around theirmean values as in DRAGON. This was achieved byincorporating the equivalent distance rangesCa(i − 1)–Ca(i + 1), Ca(i − 1)–Cb(i ) and Cb(i )–Ca(i + 1)as constraints throughout the calculations. The forceconstants were 1000 kcal/mol per A2 for thepseudobonds and 500 kcal/mol per rad2 for thepseudoimproper rotation axes and the distanceranges restraining the pseudobond angles. Thesevalues were the same for all monomers and werekept constant throughout the simulations.

Secondary structure geometry was maintained asfollows. For helices, the Ca(i )–Ca(i + 3) distanceswere restrained with an upper limit of 6.9 A. In orderto obtain right-handed helices, the virtual Ca bondtorsional angles were also loosely constrained to48(220)°. b-Sheet geometry was imposed byrestraining the direct and adjacent cross-strand Ca

distances with target values of 5.1 A and 6.1 A,respectively. The distances between equivalentcarbon atoms two strands away from each other ona b-sheet were also constrained with target values of10 A. These constraints were enforced by a soft,asymptotic square-well function (Nilges et al., 1988)with a constant force constant of 50 kcal/mol per A2.

Simulated annealing protocol

For each subset of long-range simulated NOEdistance restraints, a family of 25 conformations wascalculated. The NOE restraints were applied to thepseudo-Cb atoms as in DRAGON-3, and wereenforced by the same square-well potential asdescribed above for the a-helices and b-sheets.Non-bonded interactions were represented by aquartic hard-sphere potential. A simulated anneal-ing protocol, based on a method by Nilges et al.(1988), was used, starting from extended confor-mations. During dynamics, temperature coupling(Berendsen et al., 1984) was used and bond stabilitywas imposed using the SHAKE method (Rykaertet al., 1977; van Gunsteren & Berendsen, 1977).Initially the force constant for the non-bondedinteraction potential was 0.002 kcal/mol per A2 withthe atomic radii set to 1.2 times the values used inDRAGON. The slope of the asymptote for thedistance restraint function was 0.1. Using thesevalues, 50 cycles of restrained Powell minimization(Powell, 1977) were performed. The atomic velocitieswere scaled to a Maxwellian population at 4000 Kand 80 ps of molecular dynamics was calculated,with a target temperature of 4000 K. The slope of theasymptote for the distance constraint potential wasincreased to 1.0 and a further 40 ps of high-tempera-ture dynamics was calculated. The system was thencooled to a simulated temperature of 100 K in 25 Ksteps, and for each target temperature 00.39 ps ofrestrained dynamics was calculated. At each cycle,the sizes of the atomic radii relative to the valuesused in DRAGON were scaled down from 1.2 and theforce constant for the non-bonded potential wasscaled up from 0.002 exponentially such that the finalvalues were 1.0 and 4.0, respectively. Finally, 1000cycles of Powell minimization were performed.

Implementation

DRAGON-3 was implemented as an ANSI Cprogram by A.A. The models presented here wereproduced by running DRAGON-3 in batch mode onvarious SGI computers. Execution time for a 3ICBmodel (75 residues) was about two minutes on anSGI Challenge S equipped with a R4400/150 MHzprocessor. X-PLOR was run on a Sun Sparcstation 10model 41, and a 3ICB model with 86 long-rangerestraints was generated in about 35 minutes. Thecorresponding execution time for DRAGON on thesame machine was about four minutes. The plotswere generated using the ACE/gr program by PaulJ. Turner, the protein cartoons were drawn by theRasMol package (author Roger Sayle).

Data

For the purposes of this study, a set ofwell-resolved small monomeric proteins was neededthat represented the major structural classes. Thetarget structures (Brookhaven codes are given inparentheses) chosen for analysis were bovine

Page 8: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 315

A

B

Figure 4. Distribution of conserved hydrophobicresidues in the bovine vitamin D-dependent Ca2+-bindingprotein (3ICB). A, Conservation (lower graph) andconserved hydrophobicity (upper graph) profiles obtainedfrom the multiple alignment. B, Position of hydrophobicside-chains with a conservation score higher than 0.75.

albumin from Xenopus laevis (P05940) and fromLatimeria chalmersii (P02623), human S100-L protein(P29034) and rat calpactin I light chain (P05943).

Tendamistat (3AIT)

This 74-residue long protein, an a-amylaseinhibitor from Streptomyces tendae, is a sandwich oftwo, three-strand antiparallel b-sheets, and itcontains two disulphide bonds. The structure of thePDB entry was determined by constrained energyminimisation from solution NMR measurementsusing the AMBER 3.0 protocol (Billeter et al., 1990).The sequence database search provided a set ofrelatively close homologues to the S. tendae sequenceand another set of very distant, spurious matches.The multiple alignment was therefore constructedfrom various a-amylase inhibitor sequences fromother Streptomyces species (PIR access codes P01093,P07512, P09921, P20078 and P20596).

Thioredoxin (2TRX)

Thioredoxin is an electron transport protein fromEscherichia coli. The PDB entry contains twochemically identical chains in the unit cell designatedA and B. Since parts of chain B are disordered, chainA was chosen as the target structure. Thioredoxin is108 residues long and contains a five-strand mixedb-sheet in the core, shielded by five helices in a foldsimilar to that of flavodoxin (4FXN). The structurewas determined at a resolution of 1.68 A (Katti et al.,1990). The multiple alignment was constructed fromthe following protein sequences (PIR access codes inparentheses): thioredoxin from Anabaena (P20857),Saccharomyces cerevisiae (P22217), Aspergillus nidulans(P29429) and Pisum sativum (P29450), protein S–Sisomerase (EC 5.3.4.1) precursor from mouse(P09103) and from man (P07237), bloodstream-specific protein 2 precursor from Trypanosoma brucei(P12865) and a sequence described as a ‘‘probableERP72 protein homolog’’ from C. elegans.

Results

Bovine vitamin D-dependent Ca 2+-bindingprotein

Conserved hydrophobicity

The conserved hydrophobicity score deducedfrom the multiple alignment highlighted theconserved components of the hydrophobic core(Figure 4). With the exception of Leu23 and Val61, allhydrophobic residues with a conservation higherthan 0.75 were in helical regions.

Model quality dependence on the number ofrestraints

Four sets of simulated NOE restraints weredefined for the models, containing 86, 19, 10 and 5

vitamin D-dependent Ca2+-binding protein (3ICB),tendamistat (3AIT) and thioredoxin (2TRX).

Bovine vitamin D-dependent Ca2+-binding protein(3ICB)

This protein represented the all-a structural class.It is 75 residues long and contains four a-helices anda small irregular helix. The X-ray structure wasdetermined at 2.3 A resolution (Szebenyi & Moffat,1986). The following sequences were aligned withthe 3ICB sequence (PIR access codes in parentheses):troponin C from chick (P02588) and from theJapanese horseshoe crab (P15159), Caenorhabditiselegans calmodulin-like protein (P04630), b-parv-

Page 9: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data316

A

B

C

Figure 5. Simulation results for the bovine vitamin D-dependent Ca2+-binding protein (3ICB). A, Average RMS deviationof the models from the PDB structure as a function of the number of long-range distance restraints per residue. Circlesand squares represent the models generated by DRAGON-3 and X-PLOR, respectively. B, Results of the visual topologychecks. Open, hatched and filled bars symbolise correct, slightly incorrect and seriously incorrect topologies, respectively.Chequered bars symbolise mirror images. For every restraint set, the left and right bars represent DRAGON (D) andX-PLOR (X) models, respectively. C, Effect of the choice of restraints. The average RMS deviations for 20 datasets aredepicted as filled circles (the abscissae are the RMS values and the points are shifted vertically for clarity). The error barscorrespond to the standard deviations of the RMS values within the datasets. The pooled distribution of the model RMSdeviations is plotted in the background.

long-range restraints, respectively. An additionaldataset containing no restraints served as an internalcontrol.

With 86 long-range restraints, both DRAGON andX-PLOR correctly identified the native fold, theaverage RMS deviation being slightly better for theX-PLOR structures than for the DRAGON structures.As the number of restraints decreased, DRAGONperformed increasingly better than X-PLOR. With 19restraints, the average RMS of the models producedby the two programs were about the same, X-PLORstill having a slight advantage. With ten or five

restraints, however, the DRAGON structures stillhad essentially correct topologies (save occasionalmisplaced helices), while the X-PLOR modelsbecame increasingly inaccurate (Figure 5A and B).Without any long-range restraints, X-PLOR failed toproduce compact structures at all, which renderedthe comparison meaningless (Table 2).

Although the average RMS of the DRAGONstructures was as high as 10 A in the no-restraintscase, the program still correctly identified the nativetopology in 40% of the runs and always producedcompact, globular structures (Figure 6).

Page 10: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 317

Table 2. RMS deviation of model structures from target 3ICBRMS (A) S.D. (A)

Number of Restraintsrestraints per residue DRAGON X-PLOR DRAGON X-PLOR

86 1.14 2.88 2.39 0.21 0.1719 0.25 4.93 4.44 0.48 1.5210 0.13 6.26 7.33 2.04 2.51

5 0.07 6.32 11.2 1.64 2.380 0.00 10.0 21.3 1.50 3.20

Averages and standard deviations of the individual RMS values are tabulated as afunction of the number of simulated NOE restraints for models produced by DRAGONand X-PLOR. Entries significantly (P < 0.01) lower than their counterparts are set inboldface.

Model quality dependence on the choice ofrestraints

The sensitivity to the choice of the distances wastested on 20 different datasets each containing tenrestraints. For each dataset, 25 models weregenerated by DRAGON. The pooled distribution ofthe RMS values of these 500 models was unimodalwith a maximum at 5.0 A, the overall average RMSvalue was 6.55(21.76)A. The average RMS devi-ations of the individual datasets spanned a rangefrom 5.37 A to 8.61 A (Figure 5C).

Due to the much higher CPU time requirements ofthe X-PLOR calculations, five representative datasetswere chosen among the 20 datasets described aboveand 25 models were generated from these byX-PLOR. For the same dataset, DRAGON-3 per-formed consistently better than X-PLOR, as indicatedby comparison of the corresponding RMS averages(Table 3).

Tendamistat

Conserved hydrophobicity

The hydrophobic core of tendamistat is not so welldefined as that of 3ICB (Figure 7). The edge strands52–57 in sheet 1 and 67–73 in sheet 2 possess nohydrophobic residues with a conservation scorehigher than 0.75, while the remaining strands containmainly smaller hydrophobic residues. Two con-served hydrophobic residues, Trp18 and Ala28occupy exposed positions, while the flexible

N-terminal tail of the molecule contains two moreconserved hydrophobics, Val4 and Ala8.

Model quality dependence on the number ofrestraints

The tendamistat datasets contained 120, 46, 10, 5and 0 long-range simulated NOE restraints. For this

A

B

Figure 7. Distribution of conserved hydrophobicresidues in tendamistat (3AIT). Symbols are as in Figure 4.Note the exposed hydrophobic residues Trp18 and Ala28.

Table 3. Comparison of representative ten-restraintdatasets for 3ICB

RMS (A) S.D. (A)DRAGON X-PLOR DRAGON X-PLOR

5.37 7.14 1.06 3.035.88 8.16 1.46 2.876.43 7.76 1.06 2.506.71 8.40 1.62 1.728.33 12.1 1.39 2.54

Averages and standard deviations of the individual RMS valuesare tabulated as a function of the number of simulated NOErestraints for models produced by DRAGON and X-PLOR. Entriessignificantly (P < 0.01) lower than their counterparts are set inboldface.

Page 11: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data318

small b-sandwich protein, DRAGON always per-formed significantly better than X-PLOR, the latterproducing numerous mirror-image topologies andheavily tangled structures (Figure 8A and B). TheDRAGON models were essentially correct down tothe five long-range restraint case, where packingbecame loose but the native topology was stillpreserved (Figure 9). Slightly incorrect topologies(5 out of 25) were observed only without anylong-range restraints (Table 4).

From a modelling point of view, the structure oftendamistat has a few inconvenient features. TheN-terminal end of the molecule is rather mobile,which is reflected in the B-factor entries in the PDBfile (obtained from the RMS deviations between a setof NMR structures). The presence of conservedapolar residues occupying exposed positions on thesurface, which is not uncommon among smallproteins involved in non-covalent protein-proteininteractions such as enzyme inhibition, can confuseDRAGON’s hydrophobic core-building heuristics,which forces exposed hydrophobic residues towardsthe centre of the molecule if no extra information isavailable. With just five long-range distancerestraints DRAGON therefore was also unable topack the b-sheets together.

Model quality dependence on the choice of restraints

The assessment was performed in the same way asfor 3ICB models (see above). The 20 distance sets,each containing ten restraints, were submitted toDRAGON and 25 models were generated for eachset. The pooled distribution of the RMS deviations ofthese 500 models was bimodal, with maxima atapproximately 6.0 and 9.0 A. The overall averageRMS value was 7.49(21.52)A. The average RMSdeviations of the individual datasets spanned a rangefrom 5.24 A to 9.31 A (Figure 8C).

In the majority of cases DRAGON againperformed significantly better than X-PLOR whenthe two methods were compared using fiverepresentative datasets (Table 5).

Thioredoxin

Conserved hydrophobicity

Thioredoxin has the most complex structureamong the proteins modelled in this study. Itcontains a five-strand mixed b-sheet in the core,shielded by helices and represented a challengeto both programs. The key conserved hydrophobicresidues distribute evenly in the core (Figure 10),but the overall level of conservation was somewhatlower than for the two other target proteins, probablydue to the larger number of sequences (12) in themultiple alignment. Therefore only 11 hydrophobicresidues had conservation scores higher than 0.75.One of these, Trp31, occupies an exposed positionat the beginning of the second helix (residues 32to 49).

A

B

C

Figure 8. Simulation results for tendamistat (3AIT).Symbols are as in Figure 5.

Page 12: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 319

Table 4. RMS deviation of model structures from target 3AITRMS (A) S.D. (A)

Number of Restraintsrestraints per residue DRAGON X-PLOR DRAGON X-PLOR

120 1.62 3.72 5.29 0.16 2.1946 0.62 4.12 5.99 0.22 2.0610 0.14 5.76 7.32 0.56 1.465 0.07 7.70 9.85 0.36 1.060 0.00 9.38 11.9 0.66 1.30

Data and symbols are as in Table 2.

Model quality dependence on the number ofrestraints

The NOE distance dataset contained 161, 48, 30,20 and 0 simulated restraints. Starting from thetwo largest restraint sets, DRAGON always foundthe correct topology and the average RMS of themodels were good. X-PLOR, on the other hand,produced several mirror-image topologies evenfor the largest dataset and the structures wereoccasionally tangled as well, giving rise tosignificantly higher average RMS values. Whenthe number of restraints was reduced to 30,DRAGON produced two distinct subsets ofsolutions: 12 models out of 25 were just as goodas those obtained from the large dataset runs,possessing correct topologies with RMS valuesaround 4.5 A, while the other half were lesssatisfactory, with incorrect topologies (Figure 11Aand B). At this stage, the models produced byX-PLOR were comparable with those of DRAGON,with the exception that the number of mirror imagesand other seriously incorrect topologies was higher(Figure 12). On further reduction of the number ofrestraints, both programs produced models withhigh RMS values and about one-third of the modelshad correct topologies (Table 6).

Model quality dependence on the choice ofrestraints

The most interesting behaviour was expected atthe ‘‘crossover point’’ where two distinct families ofmodels were produced from a 30-restraint set. The 20distance sets, each containing 30 restraints, weretherefore submitted to DRAGON and for each set 25models were generated. The pooled distribution ofthe RMS deviations of the 500 models wasmultimodal, containing four distinct peaks at 4.5 A,6.5 A, 9 A and 10.5 A, respectively. The overallaverage RMS value was 7.97(22.35)A. The average

RMS deviations of the individual datasets spanneda range from 5.3 A to 10.1 A (Figure 11C). Thestandard deviations of RMS values were high formost datasets, indicating that a considerable range ofthe overall distribution was sampled, producingsubsets of high and low-quality models from thesame restraints.

DRAGON and X-PLOR performed similarly whencompared using representative 30-restraint datasets.In four out of five cases, there was no significantdifference between the average RMS values (Table 7).

With only a small number of restraints, DRAGON

A

B

Figure 10. Distribution of conserved hydrophobicresidues in thioredoxin (2TRX). Symbols are as in Fig-ure 4. Note the exposed hydrophobic residues Trp18 andAla28.

Table 5. Comparison of representative ten-restraintdatasets for 3AIT

RMS (A) S.D. (A)DRAGON X-PLOR DRAGON X-PLOR

5.24 6.45 0.36 1.686.22 7.61 0.43 1.447.35 8.19 0.70 1.578.58 10.6 0.74 1.208.83 7.79 0.54 0.87

Data and symbols are as in Table 3.

Page 13: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data320

Figu

re6.

Mod

elC

aba

ckbo

nes

alig

ned

toth

ena

tive

CA

BP

stru

ctur

e3I

CP.

Res

idue

pos

itio

nsar

eco

lour

-cod

edfr

ombl

ue(N

term

inus

)to

red

(Cte

rmin

us)i

na

smoo

thra

nge.

Aan

dB

,Mod

els

gene

rate

dfr

om19

long

-ran

gere

stra

ints

byD

RA

GO

Nan

dX

-PLO

R,r

esp

ecti

vely

.X

-PLO

Rpr

oduc

edbe

tter

stru

ctur

esin

this

case

,exc

ept

one

outl

ier,

whi

chha

da

slig

htly

inco

rrec

tto

pol

ogy.

The

DR

AG

ON

mod

els

wer

ele

ssac

cura

tebu

tall

25m

odel

sha

dco

rrec

ttop

olog

ies.

Can

dD

,Mod

els

gene

rate

dfr

omfiv

elo

ng-r

ange

rest

rain

tsby

DR

AG

ON

and

X-P

LOR

.O

fthe

25D

RA

GO

Nm

odel

s,20

still

had

corr

ectt

opol

ogie

s,w

here

as21

X-P

LOR

mod

els

wer

ein

corr

ect.

Not

eth

atth

eX

-PLO

Rm

odel

imag

esw

ere

scal

eddo

wn

tofit

inth

esc

reen

.

AB

CD

Page 14: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 321

Figu

re9.

Mod

elC

aba

ckbo

nes

alig

ned

wit

hth

ena

tive

tend

amis

tats

truc

ture

3AIT

.Col

ourc

odes

asin

Figu

re5.

Aan

dB

,Mod

els

gene

rate

dfr

om46

long

-ran

gere

stra

ints

byD

RA

GO

Nan

dX

-PLO

R,r

esp

ecti

vely

.DR

AG

ON

prod

uced

mar

ked

lybe

tter

resu

lts,

whe

reas

the

X-P

LOR

mod

els

wer

em

ore

vari

able

.Can

dD

,Mod

els

gene

rate

dfr

omfiv

elo

ng-r

ange

rest

rain

tsby

DR

AG

ON

and

X-P

LOR

.All

DR

AG

ON

mod

els

still

had

corr

ect

top

olog

ies,

whi

leth

em

ajor

ity

ofth

eX

-PLO

Rm

odel

sbe

cam

ein

corr

ect.

AB

CD

Page 15: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data322

Figu

re12

.M

odel

Ca

back

bone

sal

igne

dw

ith

the

nati

veth

iore

doxi

nst

ruct

ure

2TR

X(A

-cha

in).

Col

our

code

sas

inFi

g-ur

e5.

Aan

dB

,Mod

els

gene

rate

dfr

om48

long

-ran

gere

stra

ints

byD

RA

GO

Nan

dX

-PLO

R,r

esp

ecti

vely

.The

DR

AG

ON

mod

els

wer

eco

rrec

tin

all

case

s,w

hile

the

X-P

LOR

prod

uced

anu

mbe

rof

inco

rrec

tre

sult

s.C

and

D,

Mod

els

gene

rate

dfr

om20

long

-ran

gere

stra

ints

byD

RA

GO

Nan

dX

-PLO

R.B

oth

met

hod

slo

cate

dth

eco

rrec

ttop

olog

ies

inab

outt

enca

ses

outo

f25

butt

heov

eral

lmod

elqu

alit

yw

asp

oor.

AB

CD

Page 16: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 323

A

B

C

Figure 11. Simulation results for thioredoxin (2TRX).Symbols are as in Figure 5.

performed significantly better, because X-PLORfailed to pack the helices against the central sheet andtherefore the RMS deviations were much larger thanthose obtained for the compact DRAGON structures.It was clear, however, that neither program couldproduce useful results if less than 20 long-rangerestraints were available. Comparison with theresults obtained from the simulations of 3ICB and3AIT suggests that the minimal number oflong-range restraints necessary for correct foldidentification is a non-linear function of chain length.

Discussion

Accuracy

In most cases, DRAGON produced more accurateresults than X-PLOR, especially when the number ofdistance restraints was small. It should be kept inmind that in a real NMR structure determinationproblem the maximal accuracy is limited by thequality of the experimental data. This aspect wasmodelled by establishing an artificial 22 A range forall simulated restraints, which of course may bevariable when real experimental distance estimatesare used. Also, an important simplification in thepresent study was that side-chains were replaced bya single pseudo-Cb atom and the restraints wereimposed on these whereas in reality the NOE-derived restraints apply to the H atoms. It is possible,however, to convert individual inter-H distancerestraints into side-chain centroid distance restraints.These restraints can be used by DRAGON togenerate a set of starting structures, which in turncan be refined by another program such as X-PLORusing the original H–H distance restraints applied toa full-atom representation of the molecule.

In the absence of detailed structural information,usually a number of different folds are generated.The gradual projection algorithm of DRAGON wasoften capable of finding the correct topology evenunder unfavourable circumstances due to the easewith which large structural rearrangements can bemade in high-dimensional spaces. This correlateswith the observation that local minima can beavoided by performing energy minimisation in fouror more dimensions (Crippen, 1982; Purisima &Scheraga, 1986).

Precision and sampling

The precision of the DRAGON models was almostalways significantly better than those produced byX-PLOR, judged by the RMS deviation values of theindividual models measured with respect to thecorresponding average structures (Table 8).

Precision, however, cannot be evaluated separ-ately from the sampling properties of the algorithmin question. While it is practically impossible toperform an exhaustive search in protein confor-mational spaces, good sampling at least shouldensure that a reliable, unbiased estimate is obtained.

The sampling properties of distance geometry-

Page 17: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data324

Table 6. RMS deviation of model structures from target 2TRXRMS (A) S.D. (A)

Number of Restraintsrestraints per residue DRAGON X-PLOR DRAGON X-PLOR

161 1.49 3.86 4.96 0.13 2.5548 0.44 4.57 6.54 0.62 3.0230 0.28 8.64 8.61 2.17 3.6220 0.19 10.6 11.3 1.0 2.7

0 0.00 11.4 14.1 1.6 2.5

Data and symbols are as in Table 2.

based techniques were assessed in detail by Havel(1991). The ‘‘randomized metrization’’ method usedin his programs, which coupled the selection ofuniformly distributed trial distances with triangleinequality smoothing, provided very good results.An efficient version of this algorithm called ‘‘partialmetrization’’ was implemented by Kuszewski et al.(1992). A similar approach was used in the DRAGONsimulations, which started with a random initialdistance matrix but the triangle inequality smooth-ing was performed indirectly, through adjusting theelements of the metric matrix in a self-consistent loop(Aszodi & Taylor, 1994a). This smoothed distancematrix, subsequently projected into a high-dimen-sional Euclidean space, corresponded to a muchgreater structural variability than can be obtainedfrom starting a random coil conformation in 3D. Ourprevious studies with the predecessors of the presentalgorithm (Aszodi & Taylor, 1994b) showed that inthe absence of specific distance restraints a widevariety of structures was generated. Taking intoaccount the low standard deviation of model RMSvalues obtained in the present study, it can beconcluded that the algorithm delivered highprecision and good sampling at the same time.

Robustness

In the present context, the robustness of analgorithm means the ability to produce satisfactoryresults even if the input data are scarce or unreliable.For an NMR structure determination problem,the lack of interresidue distance data could becompensated for by applying various heuristicsbased on our general knowledge about proteins.DRAGON employs a wide range of such heuristicsto achieve its goal; the most important of theseare the modelling of the hydrophobic effect, theelimination of tangled conformations and thechirality checks.

Compactness and the hydrophobic effect

The hydrophobic effect was modelled by reducingthe distances between hydrophobic residues and byconstantly monitoring the accessibility of everyside-chain in the molecule. The construction of thehydrophobic core was directed by the identificationof the conserved hydrophobic residues. When noexternal distance restraints were available, thisapproach still succeeded in producing compactstructures, often with the correct topology, whereasX-PLOR usually generated loose tangles ‘‘floating’’randomly. It must be noted, however, that DRAGONwas more successful with a-helices, which oftenprovide clear structural orientation in the form oftheir hydrophobic moments.

The projection into spaces of gradually decreasingdimensionality also facilitated the formation ofcompact structures, since this operation effectivelycompresses the point set being projected. Variouschecks were applied to ensure that the correct pointdensity was maintained through the series ofprojections (Aszodi & Taylor, 1994b).

Tangles

Another, often overlooked, problem in simulationsis tangling. Tangles are very rare if not totally absent

Table 8. RMS deviation of model structures from theirrespective average structure

RMS (A)Number of Restraintsrestraints per residue DRAGON X-PLOR

3ICB86 1.14 0.97 0.9019 0.25 2.48 2.5910 0.13 4.03 5.43

5 0.07 4.62 7.000 0.00 7.69 10.7

3AIT120 1.62 1.69 4.00

46 0.62 1.73 4.3610 0.14 2.87 5.78

5 0.07 2.61 5.350 0.00 3.38 5.72

2TRX161 1.49 1.54 4.79

48 0.44 1.92 5.7130 0.28 8.06 6.6220 0.19 4.20 7.60

0 0.00 6.57 8.87

Bold figures indicate entries significantly lower than theircounterparts.

Table 7. Comparison of representative ten-restraintdatasets for 2TRX

RMS (A) S.D. (A)DRAGON X-PLOR DRAGON X-PLOR

5.30 6.48 1.67 3.076.84 6.03 1.96 2.287.46 8.56 1.74 3.678.48 10.3 2.50 2.739.37 10.2 2.37 2.63

Data and symbols are as in Table 3.

Page 18: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data 325

in native polypeptide folds, yet they frequentlyoccur in models. Distance geometry techniques thatrely on metric matrix projection are particularlyliable to generate this kind of artifact, especiallywhen larger molecules are modelled, which canfold into more complex patterns. Tangles may trapthe chain in an incorrect conformation and therebyhinder the convergence of the algorithm, renderingit less robust. Tangles are difficult to detect bycomputer because of the lack of a reliablemathematical definition. (Knots, which can exactlybe defined in topological terms, occur in closedcurves only, while protein backbones, in general,are open curves.) The heuristic employed inDRAGON-3 relied on the backbone hydrogen-bondtopology of secondary structures and thereforecannot be considered universal: in its presentimplementation, it cannot detect tangles in a chainwith no secondary structure at all. (Although thepresent algorithm can be extended, further complex-ity would incur a severe performance penalty.) Mosttangles were nevertheless successfully eliminated inthe simulations, indicating that an advantageoustrade-off was achieved between generality andpracticality.

Chirality

Chirality presents a persistent problem to alldistance geometry-based techniques, since thehandedness information is not contained in thedistance matrix and therefore even a perfect set ofdistance restraints corresponds to the correcttopology, and its mirror image. Fortunately, incorrectmirror images can be filtered out at the tertiarystructural level since the correct chirality of thesecondary structural elements are known. If thehandedness of helices and sheets are carefullymonitored as is done in the DRAGON algorithm,then the mirror image topology will be anenergetically slightly unfavourable diastereomer,rather than an indistinguishable enantiomer, of thecorrect topology.

Spatial distribution of distance restraints

The distribution of the interresidue distancerestraints within the molecule determines the qualityofthesimulationtoaconsiderabledegree,asindicatedby our results obtained from randomly chosendistance sets containing the same number ofrestraints. Although theoretically it might be possibleto find the best arrangement of a given number ofrestraints for a particular structure, in practice thefreedom of choice usually does not exist. Oursimulations suggest that restraints between sec-ondary structure elements alone are, in general, notsufficient to determine the overall fold correctly.Restraints between coils are also important, asindicated by the tendamistat simulations where thelong, flexible chain segments showed a definitetendency to intercalate between the sheets, thusgiving rise to incorrect models.

Comparison with other methods

Comparing the performance of DRAGON-3 withother methods is difficult, because of the differencesin the published simulation protocols. For example,Smith-Brown et al. (1993) impose simulated Ca–Ca

restraints with a constant 2 A difference between thelower and upper bounds onto a polyglycineback-bone. On the other hand, Hoch & Stern (1992)use a lollipop chain similar to ours, but the distancerestraints are applied to the Ca and pseudo-Cb atomsas well, making the interpretation of the results moredifficult (see above). Ideally, one should compare allavailable methods using the same protocol, which is,unfortunately, impractical. We therefore decided tocompare DRAGON-3 with X-PLOR, a widely usedand very flexible tool. The simplified chainrepresentation of DRAGON-3 was transferred easilyto the X-PLOR protocol, enabling us to makemeaningful comparisons.

The task given to both algorithms was by no meanstrivial. The distance restraint sets were designed tobe inaccurate; the lower and upper distance boundswere separated by 4 A, spanning a range twice asmuch as the study described by Smith-Brown et al.(1993). Also, the simulated NOE restraints wereapplied to the pseudo-Cb atoms, while accuracy wastested by Ca coordinate RMS deviations; thus, therestraints had only an indirect effect on the backboneconformations.

The results indicated that DRAGON had a definiteadvantage in situations where robustness becameimportant, due to the extra background informationderived from the conserved hydrophobicity patterns,and the detangling procedure. DRAGON also has theappealing property that it always converges to acompact three-dimensional conformation.

Conclusion

We have presented here a distance geometry-based approach that was designed to incorporate ageneral background knowledge about proteins inorder to produce acceptable results even whenspecific structural information is scarce. The aim ofour work has been to show that topologically correctsolutions to sparse distance data are more easilyobtained when general properties of proteins areincorporated, giving rise to compact, tangle-freeconformations with well-defined hydrophobic cores.The method of attaining these attributes in the finalmodel is secondary and, although we preferred adistance geometry-based approach, similar resultsmight well have been attained through optimisationin Euclidean space. However, we did not simplyencode these general heuristics in one of the severalcurrent refinement methods, as their orientation todetailed atomic representation is not ideally suited tothe broad (topological) emphasis we desired. Weregard our method as suitable for generating anensemble of good starting conformations that mightthen be further refined by a program such asX-PLOR. A molecular modelling tool based on this

Page 19: Global Fold Determination from a Small Number of Distance ...€¦ · are natural candidates for the NMR structure determination problem, since they enable the simultaneous satisfaction

JMB—MS 692

Global Folds from Sparse Distance Data326

tandem approach should help the experimentalistobtaining more reliable structures from NMR data.

AcknowledgementsThe authors thank Dr J. Feeney and Dr A. Sali for their

helpful comments and suggestions.

ReferencesAszodi, A. & Taylor, W. R. (1994a). Folding polypeptide

alpha-carbon backbones by distance geometrymethods. Biopolymers, 34, 489–505.

Aszodi, A. & Taylor, W. R. (1994b). Secondary structureformation in model polypeptide chains. Protein Eng. 7,633–644.

Aszodi, A. & Taylor, W. R. (1995). Estimating polypeptidea-carbon distances from multiple sequence align-ments. J. Math. Chem. 17, 167–184.

Bairoch, A. & Boeckmann, B. (1991). The SWISS-PROTprotein sequence data bank. Nucl. Acids Res. 19,2247–2249.

Berendsen, H. J. C., Postma, J. P. M., van Gunsteren, W. F.,Dinola, A. & Haak, J. (1984). Molecular dynamics withcoupling to an external bath. J. Chem. Phys. 81,3684–3690.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,E. F., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). The proteindatabank: a computer-based archival file for macro-molecular structures. J. Mol. Biol. 112, 535–542.

Billeter, M., Schaumann, T., Braun, W. & Wuthrich, K.(1990). Restrained energy refinement with twodifferent algorithms and force fields of the structure ofthe alpha-amylase inhibitor tendamistat determinedby NMR in solution. Biopolymers, 29, 695–706.

Brunger, A. T. (1992). X-PLOR Manual. Yale University,New Haven, CT.

Connolly, M. L., Kuntz, I. D. & Crippen, G. M. (1980).Linked and threaded loops in proteins. Biopolymers,19, 1167–1182.

Crippen, G. M. (1982). Conformational analysis by energyembedding. J. Comp. Chem. 3, 471–476.

Crippen, G. M. & Havel, T. F. (1988). Distance Geometry andMolecular Conformation. Chemometrics Research StudiesPress, Wiley, New York.

Dandekar, T. & Argos, P. (1994). Folding the main-chain ofsmall proteins with the genetic algorithm. J. Mol. Biol.236, 844–861.

Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). Amodel of evolutionary change in proteins. In Atlas ofProtein Sequence and Structure (Dayhoff, M. O., ed.),vol. 5, suppl. 3, pp. 345–352, Nat. Biomed. Res.Foundation, Washington DC.

Gutin, A. M. & Shakhnovich, E. I. (1994). Statisticalmechanics of polymers with distance constraints.J. Chem. Phys. 100, 5290–5293.

Havel, T. F. (1991). An evaluation of computationalstrategies for use in the determination of protein struc-ture from distance constraints obtained by nuclearmagnetic resonance. Prog. Biophys. Mol. Biol. 56, 43–78.

Hoch, J. C. & Stern, A. S. (1992). A method for determiningoverall protein fold from NMR distance restraints.J. Biomol. NMR, 2, 535–543.

James, T. L. (1994). Computational strategies pertinent toNMR solution structure determination. Curr. Opin.Struct. Biol. 4, 275–284.

Katti, S. K., LeMaster, D. M. & Eklund, H. (1990). Crystalstructure of thioredoxin from Escherichia coli at 1.68 Aresolution. J. Mol. Biol. 212, 167–184.

Kuntz, I. D., Thomason, J. F. & Oshiro, C. M. (1989).Distance geometry. Methods Enzymol. 177, 159–204.

Kuszewski, J., Nilges, M. & Brunger, A. T. (1992). Samplingand efficiency of metric matrix distance geometry: anovel partial metrization algorithm. J. Biomol. NMR, 2,33–56.

Levitt, M. (1976). A simplified representation of proteinconformations for rapid simulation of protein folding.J. Mol. Biol. 104, 59–107.

Levitt, M. (1978). Conformational preferences of aminoacids in globular proteins. Biochemistry, 17, 4277–4285.

McLachlan, A. D. (1979). Gene duplications in thestructural evolution of chymotrypsin. J. Mol. Biol. 128,49–79.

Nilges, M., Gronenborn, A. M., Brunger, A. T. & Clore, G.M. (1988). Determination of three-dimensionalstructures of proteins by simulated annealing withinterproton distance restraints. Application tocrambin, potato carboxypeptidase and barley serineprotease inhibitor 2. Protein Eng. 2, 27–38.

Pauling, L. & Corey, R. B. (1951a). The pleated sheet, a newconfiguration of polypeptide chains. Proc. Natl Acad.Sci. USA. 37, 251–256.

Pauling, L. & Corey, R. B. (1951b). The structure of syntheticpolypeptides. Proc. Natl Acad. Sci. USA, 37, 241–250.

Powell, M. J. D. (1977). Restart procedures for theConjugate Gradient Method. Math. Progr. 12, 241–254.

Purisima, E. O. & Scheraga, H. A. (1986). An approach tothe multiple-minima problem by relaxing dimension-ality. Proc. Natl Acad. Sci. USA, 83, 2782–2786.

Rozsa, P. (1991). Linearis algebra es alkalmazasai.Tankonyvkiado, Budapest (in Hungarian).

Rykaert, J. P., Ciccotti, G. & Berendsen, H. J. C. (1977).Numerical integration of the Cartesian equations ofmotion of a system with constraints: moleculardynamics of n-alkanes. J. Comp. Phys. 23, 327–341.

Smith-Brown, M. J., Kominos, D. & Levy, R. M. (1993).Global folding of proteins using a limited number ofdistance constraints. Protein Eng. 6, 605–614.

Szebenyi, D. M. E. & Moffat, K. (1986). The refinedstructure of vitamin D-dependent calcium-bindingprotein from bovine intestine. Molecular details, ionbinding and implications for the structure of othercalcium-binding proteins. J. Biol. Chem. 162, 8761–8777.

Taylor, W. R. (1988). A flexible method to align large num-bers of biological sequences. J. Mol. Evol. 28, 161–169.

Taylor, W. R. (1991). Towards protein tertiary foldprediction using distance and motif constraints.Protein Eng. 4, 853–870.

Taylor, W. R. (1993). Protein fold refinement: buildingmodels from idealised folds using motif constraintsand multiple sequence data. Protein Eng. 6, 593–604.

van Gunsteren, W. F. & Berendsen, H. J. C. (1977).Algorithms for macromolecular dynamics andconstrained dynamics. Mol. Phys. 74, 1311–1327.

Wako, H. & Scheraga, H. A. (1982). Distance-constraintapproach to protein folding. I. Statistical analysis ofprotein conformations in terms of distances betweenresidues. J. Protein Chem. 1, 5–45.

Edited by F. Cohen

(Received 19 December 1994; accepted in revised form 23 May 1995)