prediction of protein secondary structure and …...j. mol. biol. (1987) 195, 957-961 prediction of...

J. Mol. Biol. (1987) 195, 957-961

Prediction of Protein Secondary Structureand Active Sites using the Alignment of

Homologous Sequences

The prediction of protein secondary structure (a-helices, p-sheets and coil) is improved by9o/o to 66/o using the information available from a family of homologous sequences. Theapproach is based both on averaging the Garnier et al. (1978) secondary structurepropensities for aligned residues and on the observation that insertions and high sequencevariability tend to occur in loop regions between secondary structures. Accordingly, analgorithm first aligns a family of sequences and a value for the extent of sequenceconservation at each position is obtained. This value modifies a Garnier et al. prediction onthe averaged sequence to yield the improved prediction. In addition, from the sequenceconservation and the predicted secondary structure, many active site regions of enzymescan be located (26 out of 43) with limited over-prediction (8 extra). The entire algorithm isfully automatic and is applicable to all structural classes of globular proteins.

More than 3700 protein sequences are known (e.g.see Barker et al., 1986) and this wealth of biologicalinformation has highlighted the need for accurateand automatic methods to predict proteinconformation and function from primary structure(for reviews, see Sternberg, lg86; Taylor, 1986a).However, recently Kabsch & Sander (1983) haveevaluated the accuracies of three widely used andgeneral methods of predicting secondary structure.They considered more structures than the data seton which the algorithms were developed and foundthat the accuracies for a three-state prediction(c-helix, p-sheet and coil) werc 560/o, Robson andco-workers (Garnier et al., 1978); 50%, Chou &Fasman (1978); and 59o/o, Lim (1974).

Since these three methods were developed in the1970s, there have been several new algorithmsreported. Taylor & Thornton (1983, 1984) startedwith Robson's approach as it is both probabilisticand simple to program (see below) and improved itby an avera ge of 7 -5o/o when applied to the a/p classof proteins. A length-dependent template for thep-strand/a-helix/p-strand motif modified thelikelihood of a and p structure. Another recentapproach by Cohen et al. (1983,1986) is based onpattern recognition of specific residue types and hasbeen applied to the a/B class of proteins and topredict turns in all structural classes. However, thework of both Taylor & Thornton (1983, 1984) andCohen et al. (1983, 1986) requires the assignment ofstructural class (Levitt & Chothia, 1976) and thisstill cannot be reliably obtained.

With the increase in number of known proteinstructures (Bernstein et al., 1977) several groups(Sweet, 1986; Nishikawa & Ooi, 1986; Levin et al.,1986) have recently explored a different approach.These methods are based on recognizing a, sequencerelationship between a segment of the polypeptidechain of the unknown structure with a sequenceand conformation data base from the known

N22-28361871r2095745 $03.00/0 957

structures. The accuracies for a three-stateprediction range between 591" and 63/o.

Today when one wishes to predict the secondarystructure ofa protein, one often has sequences froma family of homologous molecules. This providesadditional information that needs to be incor-porated into predictions. [n this paper an algorithmis presented that uses the observation that whensequences are aligned the regions of insertions andlow sequence conservation tend to occur in the loopregions connecting the regular secondary structures.This aspect is quantified and modifies the standardalgorithm of Garnier et aI. (1978) to yield animproved prediction. In addition, residues centralto the activity of an enzyme are conserved in thefamily of homologous molecules. From the alignedsequences and predicted secondary structure, analgorithm is developed to identify potentialfunctionally important residues (FIRs)f .

The first step of the algorithm is to obtain analignment of all the sequences. A standard methodto align two protein (or nucleic acid) sequences isthe dynamic programming approach of Needleman& Wunsch (1970). A matrix of similarity (identity,chemical property or observed substitution)between pairs of amino acids is chosen and thealgorithm establishes an alignment of the twosequences including insertions that yields the bestscore. A previous study (Barton & Sternberg, 1986)has evaluated the accuracy of sequence alignmentson the basis of a benchmark obtained fromstructural superpositions of the two molecules. Itwas shown that if the score is greater than 6standard deviations from the mean score forrandom sequences, then the residues in secondarystructures generally &re aligned to >75o/oaccuracy. Thus features in the aligned sequences

f Abbreviation used: FIR,, functionally importantresidues.

O 1987 Acadernic Press Inc. (London) Ltd.

--------r

958 M. J. Zuelebil et al.

could be used to locate secondary structures.Although this algorithm could be generalized toalign many sequences simultaneously, computermemory and time requirements are prohibitive (e.g.see Murata et al., 1985). Therefore, a simplerapproach for multiple sequence alignment wasdeveloped.

The-method applies the Needleman & Wunsch(1970) algorithm at each stage. Firstly, sequence 2is aligned with sequence l, then sequence 3 isaligned with the alignment of I and 2 by using asimilarity score obtained from the mean score ateach position for 3 uersus 2 and 3 aersus l. Thisprocedure is then continued for sequences 4 to N.Once all 1[ sequences have been aligned, sequence Iis realigned to the 2 Lo N sequences; then 2 againstl , 3 ,4, . . . , N, etc . One fur ther pass is requi red toproduce an alignment that appears consistent byvisual inspection. The algorithm, without anyoptimization, requires 90 minutes of VAX lll750c.p.u. t ime to align 60 sequences of about l l0residues. This alignment procedure was applied toI I families of sequences (Table I ) obtained from theProtein Information Resource data bank (Barker efo/., 1986), which were chosen so that one member ofthe family had a crystallographic secondarystructure assignment (Bernstein et al., 197 7 \.

To obtain a benchmark against whichimprovements can be assessed, a standard Robsonprediction (Garnier et al., 1978) was performed onthe protein with known secondary structure. Inoutline, the Robson method evaluates thelikelihood of residue f being in an a-helix by theaddition of the empirical a-helix-formingpropensities of l7 residues from i-8 to if8.Similarly, the likelihoods of the residue being in ap-sheet, turn or coil are evaluated and theconformation state with the maximum likelihood ispredicted for residue i.

One approach for incorporating the informationfrom the family of sequences, suggested when theRobson algorithm was developed, is to average thea, B, coil and turn parameters for all the alignedresidues at each position. fn our algorithm eachinsertion is scored as 0.0, which is a neutral value asRobson propensities are both positive and negativenumbers.

There is, however, more information aboutsecondary structure available from alignedsequences than that simply obtained by averagingthe residue propensities. The crystal structures ofprotein families show that sequence insertion andsections of high sequence variability occur in loopregions between secondary structures (e.g. seeGreer, l98l ). In addition, residues involved insecondary structure packing tend to behydrophobic (Lesk & Chothia, 1980). We havequantified the extent of sequence conservation at aposition i along the chain by a "conservationnumber" C;, which ranges from 0 to l. C; is basedon a representation of the Venn diagram of thechemical properties of the amino acids (Taylor,l986a,D). The aim is to have a high conservationnumber when similar chemical types of amino acidsoccur at a position, and a low value when there ishigh variability of residues or an insertion. Thus aconserved residue scored 1.0 and substitutionsbetween residues with the same chemical properties0.9. For chemically different amino acids 0.1 wassubtracted from 0.9 every time one of the chemicalproperties was different.

Table 2 lists the amino acids and assigns a yes orno value to ten chemical properties. Gaps andunknown residues are modelled by assigning a yesvalue for each property. If an invariant residueoccurs at a position then C; is 1.0. When a set ofresidues occur at a position, each property isconsidered in turn and if one amino acid differs in

Table IPred,iction of second,ary structure

Reference proteinNo. of

residues

AverageNo. seq. conservation,aligned C""

Accuracy of prediction (o/o)

Robson on Robson on Robson *reference multiple conservation

(l) Haemoglobin (f-chain)(2) Cytochrome c(3) Myoglobin(4) Immunoglobulin Fab V,(5) Kallikrein(6) Lactate dehydrogenase(7) Dihydrofolate reductase(8) Triose phosphate isomerase(9) Phospholipase A,

(10) Ribonuclease S (bovine)(lI) Lysozyme (human)

Average

(ala)(a ld)(dla)(pt f)(pt f)(a lF l(a lE \(a l f \(d+ f \(d+ f \(a+ f)

o.40.4u'50'2u.50.60.20.7o.40.60.8

66.052.467.060.654.356.053.06 r . l44.564-553.1

D r . a

64.257.268.667.560.358.952.066.353.466.959.2

6 l ' 3

7t.259.27 l ' 868.363.365.3oD.a

69.66r .873-468.4

66.1

63u l

2859

o

l l5

2622

6

I46103153l l 7oo,

329t62247t23t25130

The tsrookhaven files (Bernstein et al., 1977) corresponding to the protein numbers in the Table are: (l) 2MHP; (2) 3CYT; (3) IMBN;(a) 3FAts; (5) 2PTN; (6) aLDH; (7) IDFR; (8) ITIM; (9) lBP2; (10) IRNS; ( l l l ILYZ. Thestructuralc lassof theprotein is indicated.The value of C"" and the number of aligned sequences provide some guide as to the extent of sequence variation, and inspection of theresults shows that there is no direct relationship between these values and the improvement in secondary structuie prediction.However, the proteins were chosen so that there was enough variation in sequence to provide additional information, but the sequenceswere homologous enough that they should have very similar secondary structures.

Letters to the Eilitor 959

Table 2Properties of amino acid, resid,ues

Properties: Hydrophobic Positive NesativeAmino acid

Polar Charged Small Tiny Aliphatic Aromatic Proline

IleLeuValcy.AlaGlvMetPheTy"TrpHisLy.ArgGluGlnA"pAsnSerThrProAsxGlxGupUnknown

YYYYYYYYYYYY

YYYY

YYY

YYYYYYYYYYY

YYYY

YYYY

YY

YYY

YY

YY

YYYY

Y YYYYY

YY

Y, yes value.

the yes or no assignment of chemical property, acount (P) is incremented. The conservation numberis calculated as:

C i : 0 ' 9 - 0 ' l x P .

l f P:10, then C; is set to 0.0 rather than -0.1.Thus if Ile and Leu only occur at a position, thenP : 0 and Ci: O'9 reflected the chemical similarityof the residues. If Ile and Glu occur, P: b andCi : 0.4. The effect of a gap is to lower the value ofCr. Thus IIe and a gap yield a yalue for p of ? andCi :0.2. I f I le , Glu and a gap occur than p: l0a n d C , : 6 . 9 .

After C, is calculated for each position, C; isaveraged over three residues (i-1, i, i* l), to vielda "smoothed conservation number', CS,. Whe; CS,is plotted along the sequence (Fig. I ) and theregularisecondary structures marked on the plot,the low$.values of CB, occur most frequently in theloop ffins. Thus this plot alone is heipful insugge9tihg the locations of secondary structures.

These observations can be quantihed to improvesecondary structure prediction. The aim is topenalize the prediction of secondary structurewhere there is a low conservation number. First, theaverage conservation number over the entiresequence (C;vlis obtained. The smoothed conserva_tion number tfsp-aubtracted from the averageconservation number and this value is multiplied bya constant z4 (i.e. [CB'-C",]x.4). This value isadded to the averaged conformational parametersfor a-helices and p-sheets in the aligned sequences.The constant .4 is included to highfight differencesin conservation. Optimum values were found to beA : 150 for Q" < 0.55 and A: 2b0 for C", > 0.Sb.

Thus in the more conserved sequences, gaps andmajor variation in chemical type of residue lead toa greater penalty in secondary structure prediction.In addition, a gap was assigned the coil andturn propensities of 0.0, and a helix and sheetpropensity as the average of the Gly value betweenf -3 and i*3. Thus the helix and sheet-breakinqcharacter of Gly is used to penalize further theprediction of secondary structures where there aregaps introduced. Trials showed Lhat a 2/o improve-ment occurred when a gap was scored as a Glyrather than simply a propensity of 0.0.

The conservation plot and the predictedsecondary structure can be used to locate thefunctionally important regions of enzymes. Theseinclude residues directly involved in catalysis aswell as binding sites. FIRs may consist of one ormore sequential residues, and within a family ofenzymes are frequently invariant. The definitionsare taken from the Brookhaven Data Bank SITErecord (Bernstein et al., 1977), or if notavailable, from the literature. The algorithmdeveloped is: (l) an active segment is predictedwithin a five-residue segment if n or more residuesare invariant, where n: 4 if Cu, > 0.b, otherwisen: 3; (2) an active segment is predicted if in apredicted loop region there is one invariant residueat position i (i.e. Ci : 1.0) and the smoothedconservation obeys the rules CS; > 1.50x C.", andCSi > 0.7. The first part of the algorithm quantifiesthe extent ofsequence invariance (z) judged againstthe background of the overall similarity of thealigned sequences (Cu"). The second part uses theprinciple that invariant residues in loop regions aremore suggestive of FIR than invariant residues

T

960 M. J. Zuelebil et al.

a B BB B

roo t z v

a c

B o

40 60 80Alignmenl number

Figure 1. Conservation plots showing predicted andX-ray secondary structure, and active segments. Thesmoother conservation number (CB,) is plotted along thesequence. The large arrow denotes the averageconservation number for the entire sequence. P and Xdenoted predicted and X-ray secondary structures. Thearrowheads denote predicted and X-ray active segments.

within the secondary structure core. The algorithmis at present based on general principles of FIRs,and a general analysis of the location of FIRs inproteins is an progress, which will then be used torefine the algorithm.

Table I gives the accuracies of the predictions ofsecondary structure of the Il proteins that wereconsidered, as they had both a known secondarystructure and more than four homologoussequences. When the Robson algorithm was appliedto just the single sequence, the average accuracy ofa three-state prediction (coil and turn beingconsidered as one state) was 57o/o (Iable l). Thealignment of the sequences and the use of anaveraged propensity of the sequence and the use ofan averaged propensity for all residues at an alignedposition yielded an average improvemenL of 4o/oLo 610/o.

Figure I illustrates the conservation plot and thesecondary structure prediction for three proteins.The minima of the conservation plot generallyoccur in the loop regions between the secondarystructures and thus provide information to help instructure prediction. When the averaged Robsonprediction is modified by the conservation plot, theaccuracy increases to 660/o. This represents a 9/oimprovement on the prediction on one sequence(Table l, Fig. l) and a 5/o improvement from theaveraged prediction. For each of the I I proteins,which cover all the structural classes, there is animprovement in prediction. It is important thatthere is increased.accuracy for the c*p proteins,which are not amenable to improvement by atemplate approach. Inspection of the results showsthat, as intended, the improvements occur both bythe promotion of fl and p structure in the regions ofhigh sequence conservation and by penalizing theprediction of these regular secondary structures insections of low sequence conservation.

Table 3 gives the results of the prediction of theFIRs. The algorithm located 26 out of the 43 FIRs inthe seven enzyme families. In addition, only eightsegments were overpredicted and of these fiveincluded a Cys in a conserved disulphide bridge.The prediction of the FIRs is also illustrated inFigure l. Table 3 also shows the results ofpredicting FIRs with cruder algorithms. Simplyconsidering a single identity leads to a high level ofoverprediction (136 extra FIRs). In contrast, thesearch for three consecutive identities locates farfewer FIRs than the proposed algorithm.

Plots of sequence conservation or variability havebeen presented. One major application isthe work of Kabat (1976), who plotted sequencevariability for the immunoglobulin domains andhighlighted the variable loops that form the antigenbinding regions. However, this paper presents theuse of such plots for general secondary structureprediction and formalizes concepts that werepreviously incorporated by hand in anunsystematic manner. Further work is required onmany aspects of the algorithm. The best method forquantifying sequence variation, including gaps,needs to be determined. An alternative simpleapproach instead of using the Venn diagram wouldbe to use an average value of the sequencealignment score (Barker et ol., lg86). However, ananalysis of the types of residue substitutions and

ea a

^ c F

x -rL

Phosphol ipose

Ribonucleose

Lysozyme

t l

B B " P "o o

t t I

t l

B oc

o

o(J

a l j a B B

" B a B B Bt aX +

40 60 80Alignmenf number

40 60 BOAlignment number

o

o

o

o

o

l l l

o [ 3 B o B c

" P B P B q q c- i - , ? -

PX

t oo

Letters to the Editor 96r

Table 3Pred,iction of functionally important resid,ues

Enzyme

No. ofFIR

segments

I identity 3 consecutive Algorithm

No. located No. extra No. located No. extra No. located No. extraNo. Cysin extra

KallikreinLactate dehydrogenaseTriose phosphate isomeraseDihydrofolate reductasePhospholipase A,Ribonuelease SLysozyme

Total

2

4II43

l 8

25292628

I927

t36

28t)

8o45

38

I6

l0o4a

43

3II00I2

8

235o443

26

200002I

5

3I

I002I

8

The results ofpredicting FIRs with different approaches are given; I identity, refers to simply loacating I invariant residue along thesequence; 3 consecutive, denotes a search for 3 consecutive identities. The results of the proposed algoriihm are then given.

gap insertions that occur in the reqular secondarvstructures and in their connectiois would yielhempirical scoring schemes for the conserv;tionvalue and scaling factor (.4) used in this alsorithm.This- analysis should quantify the level of s"equencesimilarity within the family and relate this io theexpected improvement in prediction. If thesequences are too similar then there is littleadditional information, but if there is widespreadsequence variation then the assumption of identicalsecondary structure no longer holds. Similarly, ananalysis of the location of active site residues inproteins would improve the prediction of FIRs.

The aim of this work was to develop an approachfor predicting secondary structure ana nRi that isautomatic and without any subjective judgement.Furthermore, with their probabilistic nature, thealgorithms are flexible enough to cope with someerrors in sequence or alignment. The go/oimprovement in secondary structure prediction forall structural classes will provide a basis for furtherimprovements incorporating feedback of structuralclass to modify cut-off parameters (Garnier et al.,1978) and to select suitable templates (Taylor &Thornton, 1983, 1984). The identification ofpossible FIRs can be used to locate tarqet residuesfor site-specific mutagenesis.

We thank the SERC for support and Professor T. L.Blundell and Dr Janet Thornton for helpful discussion.

Mark€ta J. ZvelebilGeoffrey J. BartonWilliam R. TaylorMichael J. E. Sternberg

Laboratory of Molecular BiologyDepartment of CrystallographyBirkbeck CollegeMalet StreetLondon WCIE 7HX, U.K.

Received l0 December 1986

References

Barker, W. C., Hunt, L. T., Orcutt, B. C., George, D. G.,Yeh, L. S., Chen, H. R., Blomquist, M. C., Johnson,

G. C., Siebel-Ross, E. I. & Ledley, R. S. (1986).Protein Information Resource, National BiomedicalResearch Foundation, Georgetown Universitv.Washington D.C.

Barton, G. J. & Sternberg, M. J. E. (198?). ProteinEng'ineering, 1, 89-94.

Bernstein, F. C., Koetzle, T., William, G. J. B., Meyer,8., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol.1t2.535-542.

Chou, P. Y. & Fasman, G. D. (1978) . Adoan. Enzymol. 47,45-148.

Cohen, F. E., Abarbanel, R. M., Kuntz. I. D. &Fletterick, R. J. ( I 983). Biochemistry, 22, 4894-4904.

Cohen, F. E., Abarbanel, R. M., Kuntz, I. D. &Fletterick, R. J. ( 1986). Biochemistry, 25, 266-27 b.

Garnier, J., Osguthorpe, D. J. & Robson, B. (lg?8).J. Mol. Biol. 120, 97-120.

Greer, J. (1981). J. Mol. Biol.153,1027-1042.Kabat, E. A. (1976). Structural Concepts in Immunologg

and, Immunochemistrg, 2nd edit., Holt, Rinehart &Winston, New York.

Kabsch, W. & Sander, C. (1983). FEBB Letters, 155, lZ9-182.

Lesk, A. M. & Chothia, C. (1980). J. Mol. Bi,ol.136,225-270.

Levin, J. M., Robson, B. & Garnier, J. (1986). ftASLetters.205. 303-308.

Levitt, M. & Chothia, C. (1976). Nature (London),261,cbz-bba.

Lim, V. I. (1974). J. Mol. Biol.88,873-894.Murata, M., Richardson, J. S. & Sussman, J. L. (198b).

Proc. Nat. Acad,. Sci., U.5.A.82,3073-3077.Needleman, S. B. & Wunsch, C. D. (1970). J. MoI. Biol.

48,443-453.Nishikawa, K. & Ooi, T. (1986). Biochim. Biophgs. Acta,

87t, 45-54.Sternberg, M. J. E. (1986). Anti-Cancer Dru,g Design, l,

169-178.Sweet, R. M. (1986). Biopolymers,25, 1565-1577.Taylor, W. R. (1986a). J. Theor. Biol. ll9,205-218.Taylor, W. R. (1986b). J. Theor. Biol. 188,233-258.Taylor, W. R. (1987). In Nucleic Acid and, Protein

Sequence Annlgsis-A Practical Approach(Bishop, M. & Rawlings, C., eds), pp. 285-321, IRLPress, Oxford.

Taylor, W. R. & Thornton, J. M. (1983). Nature( Lond,on), 3Ol, 540-542.

Taylor, W. R,. & Thornton, J. M. (1984). J. MoI. Biol.173,487-514.

Editeil by A. Klu.g

prediction of protein secondary structure and …...j. mol. biol. (1987) 195, 957-961 prediction of...

Documents