assessment of protein domain classiﬁcations: scop, cath, dali...

Assessment of Protein Domain Classifications: SCOP, CATH, Dali andEVEREST

Elon Portugaly 1, Nathan Linial 1, Michal Linial 2∗

1School of Computer Science & Engineering. 2Dept. of Biological Chemistry, Inst. of Life Sciences.

The Hebrew University of Jerusalem, Israel

Abstract

Background: SCOP is a manual classification of protein domain structures. CATH is a classificationof protein domain structures created through a combination of manual and automatic methods. The DaliDomain Classification (henceforth Dali) is an automatically generated classification of protein domain struc-tures. EVEREST is an automatically generated classification of protein domains that uses sequence infor-mation alone. We present a systematic comparison of the aforementioned systems.

Methodology / Principal Findings: We focus on the proper classification of each domain, rather thanon the exact determination of its boundaries. We show a tradeoff between the granularity of classificationand the level of agreement among the classification systems - the coarser the granularity, the lower theagreement. SCOP and CATH generally agree with each other at fine and medium granularity levels of theirhierarchies, but disagree at coarser levels. The agreement among SCOP and CATH exceeds the quality ofmatching between Dali to either one of them at each given granularity level. Furthermore, nearly all Dalifamilies are of fine granularity. Although EVEREST uses no structural information, it agrees with SCOP tothe same degree as does Dali. EVEREST’s agreement with CATH even exceeds that of Dali. The granularityof EVEREST families is generally between SCOP family level and the CATH S level.

Conclusions: The medium granularity levels form a twilight zone where SCOP and CATH agree, whereasthe automatic methods do not match them. Beyond that twilight zone the disagreement between SCOPand CATH becomes so high that no global reference is available. We suggest that a reconsideration of theclassifications at these levels is due. EVEREST, a sequence-only method performs as well or better thanDali, a structure-only method. All data files generated during this study are available at http://www.everest.cs.huji.ac.il/3d-assessment/.

1 Introduction

The Protein Data Bank (1) contains over 35,000 records documenting experimentally solved protein struc-tures. The two most significant efforts to classify these data are SCOP (2) and CATH (3). Both systemsfirst parse the protein record into domains, and then classify the domains into a shallow hierarchy of domain-families. SCOP classification is created manually, while CATH is mostly automatic, but uses manual inputin several key points of the classification process.

The Dali Domain Classification is an automatically created classification of protein structures into domainfamilies (4). First each protein chain is decomposed to domains using the DomainParser2 program (5). Then

∗To whom correspondence should be addressed. email: [email protected]

1

Portugaly, Linial & Linial Assessment of domain classifications

the similarity between each pair of domains is calculated using the Dali structural comparison algorithm (6).Finally an average linkage clustering algorithm is applied to the calculated similarities to define a hierarchicalclassification of the domains, that is comprised of 6 predefined levels. Below we refer to the Dali DomainClassification as Dali.

EVEREST (EVolutionary Ensembles of REcurrent SegmenTs) is an automatic computational processthat identifies protein domains and classifies them into families. The inputs to the process are a database ofall protein sequences (here taken from Swiss-Prot 49.2 (7)), and a training set of known protein families (herea random subset of Pfam A release 19.0 (8)). The latter is used as a training set with which to exemplify tothe system the notion of a domain family, but not to derive the characteristics of specific families. The outputof the process is a set of HMMER HMMs, defining the EVEREST domains families (9). The EVERESTprocess begins by constructing a database of protein segments that emerge in an all vs. all pairwise sequencecomparison. It then proceeds to cluster these segments, choosing the best clusters using machine learningtechniques, and creating a statistical model for each of the them. This procedure is then iterated: Theaforementioned statistical models are used to scan all protein sequences, to recreate a database of segmentsand to cluster them again.

Several studies have attempted to provide a qualitative and quantitative description of the similaritiesand differences between SCOP, CATH and Dali. The Dali binary tree of hierarchical domain classification,prior to the final stage at which it gets split into its six predefined levels, was evaluated against SCOPsuperfamily classification in (10). Day et. al. compare SCOP fold level, CATH topology level and Dali’scoarsest level. They mark the population of domain pairs that are co-classified by one system, and measurethe fraction of this population that is also co-classified by the other system (11). We further discuss thesestudies and compare our results with theirs later in this manuscript. In a related research track, severalgroups have developed automatic algorithms for the prediction of SCOP or CATH classifications. In (12),Getz et. al. compare SCOP, CATH and Dali for single-domain protein chains only, and describe a processfor the automatic assignment of SCOP folds and CATH topologies from Dali structural comparison scores.Camoglu et. al. developed a decision tree technique to combine the outputs of two sequence and threestructure comparison tools and predict the SCOP classification of protein structures (13). The same task istackled using machine vision techniques in (14).

Figure 1 gives the number of families at each level of each of the systems under consideration. Notethat EVEREST contains more than 5 times more domains than each of the other classifications. This isbecause EVEREST often finds several domains that cover overlapping regions. Usually such domains arecontained in each other, a property that reflects a true biological relationship among the relevant families.For a discussion of relationships between EVEREST families, see (15).

The goal of this study is to quantify the extent of agreement and disagreement between the four domainclassifications. We show that, generally, the manual methods, CATH and SCOP, agree on their fine andmedium granularity families but disagree on the coarse granularity families. Both EVEREST and Dali arecapable of reconstructing most of the fine granularity SCOP families. EVEREST, despite being based onsequence alone, performs as well, or even better than Dali, that is based on structure alone.

2 Results

Our aim was to quantify the extent of agreement and disagreement between the four domain classifications.Let us discuss three exemplary cases of possible relationships between two classifications. In case (i), thetwo classification systems perfectly agree on some family: CATH architecture Trefoil (2.80) is in perfectagreement with SCOP fold beta-Trefoil (b.42): all 80 domains of CATH 2.80 match domains of SCOP b.42,and vice-versa. In case (ii), two systems agree on the classification, but use different granularities to definefamilies: the classification of Globin domains by CATH and SCOP, illustrated in Figure 2. In this examplethe CATH Globins homologous-superfamily (1.10.490.10) is a union of three different SCOP families (a.1.1.1,a.1.1.2 and a.1.1.4), the SCOP Globin-like superfamily (a.1.1) is a union of the CATH Globins (1.10.490.10)and Phycocyanins homologous-superfamily (1.10.490.20). Finally, case (iii), illustrated by Figure 3, shows

2


that sometimes domains classified as one family by one system are classified into several unrelated familiesby another system. These three cases serve to illustrate the complex relationships of different systems.

In our analysis not all four classification systems were treated equally: SCOP and CATH were used asreference classifications for each other and for the other two systems. We considered four levels of hierarchy inSCOP: class, fold, superfamily and family, and five levels of hierarchy in CATH: class, architecture, topology,homologous-superfamily and S-level (see Figure 1). In each of our analyses either SCOP of CATH servedas a reference. Each of the rest of the classifications was compared with this reference classification. Thiswas done by mapping each family of the tested classification to the reference classification. A tested familycould map to a reference family, at any level of the reference hierarchy. To accommodate differences in thegranularity of classification, we also allowed mapping “in between” levels. This was carried out by allowinga matching between a tested family and a subset of sibling reference families. A quality score was assignedto each possible matching, with zero being best quality, and values above one for matches between totallyunrelated families. The best scoring match was defined to be the mapping of the tested family to the referenceclassification. Given a threshold on the quality of the matching (tolerance), it was also possible that thetested family failed to be mapped to the reference classification. Section 4.3: Mapping a tested domainfamily onto a reference hierarchical classification of domains describes in detail the mapping process,and the quality assigned to the mapping. Returning to the three examples from the previous paragraph, incase (i), CATH 2.80 was mapped with no penalty (score 0) to SCOP b.42, and vice versa. Likewise, in case(ii) CATH 1.10.490.10 was mapped to the union of SCOP a.1.1.1, a.1.1.2 and a.1.1.4 with score 0, SCOPa.1.1 was mapped to the union CATH 1.10.490.10 and 1.10.490.20 with score 0, and CATH 1.10.490 wasmapped to SCOP a.1.1 with score 3/221 ≈ 0.01 (Figure 2). In case (iii), given any reasonable tolerancelevel, CATH 1.10.8 could not be mapped to SCOP at all, because the best scoring match for it would be toa subset of a SCOP family within the c.37 fold (c.37.1.20). This subset of family c.37.1.20 contains 13 of the38 CATH 1.10.8 domains, leaving 25 as false positives, i.e. a score of 25/38 ≈ 0.66 (Figure 3).

2.1 Granularity of classification vs. agreement level

Figure 4, clearly reveals a tradeoff between the granularity of the families that are being tested (as manifestedby the level to which they map) and the proportion of families that can be successfully mapped. It transpiresthat families of finer levels of classification are more likely to match a reference classification than familieshigher up in the classification hierarchy. Note how the tradeoff for the manual methods is better than thetradeoff for the automatic methods. Nevertheless, approximately 80% of the EVEREST families successfullymap to SCOP and likewise approximately 80% of them successfully map to CATH. Furthermore, EVEREST,a system that is based on sequence data alone, performs as well (with respect to SCOP) or even better (withrespect to CATH) than Dali, a system that has access to structural information.

2.2 CATH and SCOP agree on the lower levels and disagree at higher levels

Figure 5 provides a closer look at the distribution of mapping levels. Observe first the mapping of CATHinto SCOP and of SCOP into CATH. The lower levels of the two hierarchies map quite well onto each other.However, at CATH topology level the proportion of families that do not match any SCOP family is above30%, and the figure only rises for SCOP folds and classes and of CATH architectures and classes (the barabove X in the histograms). At the lower levels, one can trace an interleaving pattern in which CATH S-levelfamilies are clearly sub-families of SCOP families, which in turn are usually somewhere between CATH S-level families and CATH homologous-superfamilies. CATH homologous-superfamilies map anywhere betweenSCOP family and superfamily levels (with 10% mapping below SCOP family level). The interleaving patternfades above this level, as 45% SCOP superfamilies map to CATH homologous-superfamilies, 24% of themdo not map to CATH at all, and only 18% map above the CATH homologous-superfamily level. It seemspossible that those SCOP superfamilies that should have mapped above the CATH homologous-superfamilylevel in terms of their granularity, are the superfamilies that did no map at all.

3


2.3 Automatic sequence-based classification outperforms the automatic structure-based classification

Dali families generally match the low levels of the SCOP hierarchy. Ascending the Dali hierarchy does indeedcorrelate with an ascent of the SCOP hierarchy, though the pace is very slow. This is accompanied by a fastdecay in the fraction of Dali families that can be successfully mapped onto the SCOP hierarchy. Setting 30%as a limit to the proportion of unmappable families we allow, we stop at Dali level 3, which generally mapsbelow SCOP family level. Mapping Dali onto CATH yields quite a different picture. Throughout all levelsof the Dali hierarchy, a large fraction (above 30%) of the Dali families do not agree with CATH. Consideringthat both Dali level 6 families and CATH S-level families map below SCOP families, we conclude that bothDali and CATH provide refinements of the SCOP families, but that Dali refinements do not match thoseof CATH. This is reasonable since all Dali classifications, including those at the finest level 6 are based onstructure comparison, while CATH S-level is a sequence (PSI-BLAST (16)) based refinement of the CATHhomologous-superfamily level.

Finally, EVEREST maps reasonably well to the low levels of both SCOP and CATH. In both mappings,about 20% of the EVEREST families fail to map (the bar above X in the histograms). The rest of the familiesare generally finer than SCOP families, but coarser than CATH S-level families. This becomes evident froma direct observation of the distributions of EVEREST mappings, as well as from a comparison between theEVEREST mappings into CATH and the SCOP family level mapping into SCOP.

2.4 Analysis of coverage of reference classification

What parts of the SCOP and CATH hierarchy are covered by the tested classifications? Figure 6 showsthe proportion of each level of the reference hierarchy that is matched to families in each level of the testedhierarchies. Note, first, that all test families except for EVEREST achieve nearly 100% coverage of referencedomains, and that EVEREST covers over 90% of both SCOP and CATH domains.

Observe, again, the tradeoff between covering the higher levels of the reference hierarchy and successfullymapping the general tested family. CATH covers about 50% of the SCOP families and superfamilies. Anadditional 10% of the SCOP families and superfamilies are covered by CATH as members of unions offamilies. Most of this coverage is achieved by CATH levels S and H, which contain a very small number offamilies that disagree with SCOP. CATH also covers about a quarter of the SCOP folds, and 40% more asmembers of unions of folds, however, the CATH levels that provide this coverage contain large fractions offamilies that disagree with SCOP. Thus CATH lower levels agree with SCOP, and cover most of its lowerlevels, but coverage of the higher levels of SCOP introduces many CATH families that disagree with SCOP.

About 80% of SCOP family and superfamily level families are successfully mapped to CATH, and theycover 15% of the CATH S-level families (48% of them are covered as member of unions) and 60% of theCATH H level families are. Some CATH H level families are covered as members of unions by SCOP familyand superfamily levels, but most of those covered this way and all of CATH T and A level families coveredby SCOP are covered by SCOP fold and class levels, which contain a large fraction of families that disagreewith CATH. Again, the lower levels of SCOP agree with CATH, and cover a large fraction of its lower levels.The higher levels of SCOP do not agree with CATH, and even those higher level families that do agree withCATH do not cover many of its families.

EVEREST covers a little more SCOP families than does Dali, and while Dali covers more SCOP su-perfamilies and some SCOP folds, at the expense of a large number of families not matching SCOP at all,EVEREST stops at a relatively low error rate, covering only 10% of the SCOP superfamilies. About 30%of the CATH S-level families and 40% of CATH homologous-superfamilies are matched by EVEREST andan additional 50% of the S-level families are covered as members of unions of sibling families. This is highercoverage, at lower error rates than Dali’s.

An imperfect coverage of a reference hierarchical classification can range between two extreme scenarios,as illustrated by panels A and B in Figure 7. In the first scenario, illustrated in panel A, families of all partsof the protein world are identified by the tested classification, but every part is only identified at a single

4


granularity level. Formally speaking, the covered families in this scenario form a root vs. leaves cut-set inthe reference classification tree. In the second scenario, illustrated by panel B, some parts of the proteinworld are missed altogether, whereas in other parts families of several granularity levels are identified. Thebar graphs in the figure are given in an attempt to see which of the two scenarios is more prevalent in realcoverage data. Contrast the bar graphs of the first and second scenarios. In scenario A, since all lowestlevel families have a covered ancestor, the height of the bars reach 100%. In scenario B only a half ofthe lowest level families have a covered ancestor, and all of those are covered already at the lowest level,therefore the height of all bars is 50%. In the first scenario, since no family is covered at more than onelevel, the blue segment of each bar starts where the blue segment of the bar to its right ends. In the secondscenario every covered ancestor family is an ancestor of families that are already covered, therefore the bluesegments all overlap in their y coordinates. Observe now the bar graphs describing real coverage data.EVEREST and Dali cover 63% and 64% of the SCOP world, respectively and 86% and 78% of the CATHworld, respectively. Almost all this coverage is taking place at SCOP family level and at the CATH unionof S families level. There is very little y-axis overlap between the blue segments for both EVEREST andDali covering both SCOP and CATH. The bars in the lowest graphs of panels C and D reach 95% and 89%respectively, indicating that nearly all SCOP families are covered at some granularity level by CATH andvice versa. Furthermore, there is some y-axis overlap of the blue segments, indicating that a large fractionof the families are covered at more than one level.

The coverage of the SCOP and CATH worlds is uniform across all SCOP and CATH classes. We havedivided SCOP families and CATH S-level families by their classes, and measured the fraction of the familieswithin each class that have a covered ancestor (up to the levels of SCOP fold and CATH architecture).Table 1 provides this data for the major SCOP and CATH classes. While some variation is observed, thecoverage of no class is remarkably different than that of other classes. Differential coverage by family sizewas also checked. No significant difference was observed (not shown).

2.5 A comparison with the literature

In (10) it is stated that 58% of SCOP superfamilies are perfectly reconstructed by Dali (strict monophyly intheir terms). In Figure 6 we report that 27% (83 out of 309) of the SCOP superfamilies are matched by Dalifamilies. This number includes imperfect matches (see section 4.3: Mapping a tested domain familyonto a reference hierarchical classification of domains). If only perfect matches are considered, thenonly 66 (21%) of the SCOP superfamilies are found by Dali in our data. The disagreement between these twofigures (58% vs. 21%) could stem from several differences in the setting. We believe that the most significantdifference is that we only count SCOP superfamilies that contain at least two SCOP families, whereasDietmann et. al. count all non-singleton superfamilies. Indeed, when we count all SCOP superfamiliesexcluding singletons, we find that 42% are perfectly reconstructed by Dali. Thus, most of the differencebetween our coverage result and their coverage result is due to SCOP superfamilies that are much easier toreconstruct because they are in fact SCOP families. In addition, the two studies employ different versionsof Dali and SCOP. Also, in (10) the full binary tree of putative domain families that is the output of theDali agglomerative clustering procedure was examined. Here, in contrast, we examine only the final outputof the Dali Domain Dictionary, namely the binary tree cut at 6 predefined levels. The whole tree providesmore putative domain families, and thus more opportunities to reconstruct SCOP superfamilies.

In (11), Day et. al. compare SCOP at the fold level, CATH at the topology level and Dali at the highestlevel. To quantify the level of agreement between two classifications, they enumerate pairs of domains thatare co-classified (i.e. classified to the same fold / topology / highest level Dali family) in one system, andcheck what fraction of those pairs are also co-classified in the other system. We have repeated their analysison our datasets, and have arrived at similar figures when comparing SCOP and CATH. There it is reportedthat 81% of the SCOP co-classified domain pairs are also co-classified by CATH, and that 35% of the CATHco-classified domain pairs are also co-classified by SCOP. Our calculations gave similar figures: 86% and40% respectively. However, we find that this measure has a severe drawback as it is extremely sensitive tosizes of families. (The number of pairs grows quadratically with the size of the base set). However, family

5


sizes in the databases depend on several extraneous and irrelevant factors such as database redundancy andredundancy removal processes. For example, the largest CATH topology is 3.40.50 (Rossmann fold). In ournon-redundant database it contains 1,047 domains, and when matched to the SCOP hierarchy, it matchesa union of 55 (out of 116) SCOP folds of class c (Alpha and beta proteins). On the other hand, no SCOPfamily, at any level, is matched to CATH 3.40.50. To comprehend the extent to which this measurementis biased towards large families, we have repeated our calculations, excluding CATH 3.40.50. The fractionof CATH co-classified pairs, excluding 3.40.50 that were co-classified by SCOP fold level is 62%. Thus theexclusion of a single topology increases the success rate from 40% to 62%.

3 Discussion

This is a careful assessment of state of the art domain classification methods. There appears to be a division ofthe levels of structural classification into fine granularity (approximately up to SCOP family level), mediumgranularity (approximately up to SCOP superfamily level) and coarse granularity. EVEREST and Daligenerally agree with SCOP and CATH at the fine granularity levels. CATH and SCOP agree with eachother on the fine and medium granularity classifications, however the coarse granularity levels contain manyfamilies on which CATH and SCOP disagree. Thus the medium granularity levels are a twilight zone, wherethe two manual methods agree, whereas the automatic methods are not able to match them. Beyond thattwilight zone, from CATH topology level and up, the disagreement between CATH and SCOP becomes sohigh that in this range no global reference is available and only for specific individual families a meaningfulcomparison is still feasible. This state of affairs calls for a rethinking of the classifications at these levels.

Many families of either of the four systems map in-between the SCOP and CATH levels. This is evidencethat the current specific definitions of granularity levels are somewhat artificial, and that classification ofstructures should follow a more gradual hierarchy, a view that has been advocated by Yang and Honigin (17–19).

What may be our most encouraging finding is that present state of the art sequence-only automaticclassification methods are capable of recognizing fine granularity protein structural families. Moreover,EVEREST, our automatically generated classification of protein domains that is based on sequence dataalone fares as well, or even better than Dali, the state-of-the-art automatically generated classification ofprotein domains that is based on structural data. Dali is based only on the geometric information derivedfrom the protein structure records. It is reasonable to expect that an automatic classification that willemploy both geometrical and sequence information would outperform both EVEREST and Dali.

Our domain matching rule requires that the intersection of a tested domain and a reference be at least80% of either one of them. An alternative rule, that the intersection should cover a high fraction of bothdomains is used in (11). This difference may appear rather technical and insignificant, but it is actually quiteprofound. The condition that the intersection (nearly) covers either one of the domains allows for matchingbetween methods that define domains based on different criteria. For instance this rule make it possible tomatch an active site domain with the structural domain that includes it. The EVEREST process does notattempt to determine boundaries that match structural domains, and indeed only 14% of the EVERESTdomains match SCOP domains when one requires that the intersection covers 80% of both domains. Thisis not necessarily a drawback of EVEREST, since there are many different useful definitions for proteindomains aside from the structural one. Nevertheless, it is still an interesting challenge to develop a versionof EVEREST that seeks to identify structural domain boundaries from sequence.

In the process of deriving the EVEREST families, a random set of Pfam families was used for the purposeof teaching the system what a domain family is. We plan to repeat the process of generating EVERESTfamilies with a different training set such as SCOP superfamilies. It is quite conceivable that such a variationof the EVEREST process would yield EVEREST families of coarser granularity. EVEREST has been shownto reconstruct Pfam families with high coverage and accuracy (20). It would be interesting to see whethergood performance with respect to Pfam can be retained while increasing granularity of EVEREST families.

Several key features of our analysis distinguish it from earlier published studies. First, we do not aim to

6


provide a single number describing the agreement between two classifications. Any such scoring scheme isbound to overlook important issues regarding coverage versus accuracy and agreement along the granularityaxes. Redundancy of data and non-uniformity of family size have been shown to significantly skew analysesof this type. In an attempt to overcome such biases, the use of non-redundant sets of sequences has becomea standard practice. However, other forms of data redundancy should be considered as well. We were carefulto consider families of any granularity level only if they constitute a merge of at least two families of the nextfiner level, and have shown that failing to do so leads to over optimistic estimation of agreement at fixedgranularity level, due to the abundance of families that are incorrectly considered coarser than they are. Byassigning all families of a given granularity level equal weight in our analysis we have overcome the bias thatresults from variation in family size. Measures counting co-classification of pairs of objects are particularlysensitive to such bias, and should only be employed in conjunction with carefully designed weighting schemeto counter balance it.

4 Methods

4.1 Input Databases

PDB protein sequences were downloaded from the PDB server on February 2006 (1). SCOP domains weretaken from ASTRAL release 1.69 (21). CATH release 2.6.0 was used. Dali domain dictionary, was down-loaded from the DALI web-site on October 2006 (22). EVEREST domains were taken from the EVERESTrelease 2.0 scan of the PDB (20).

4.2 Generating a common ground for the SCOP, CATH, Dali and EVERESTdomain sets

SCOP, CATH and Dali domains defined on PDB chains that did not appear in our version of PDB, andSCOP domains split across more than one chain were discarded. To overcome minor inconsistencies inthe sequences obtained from PDB, SCOP and CATH, every SCOP and CATH segment was aligned to therelevant PDB chain, using dynamic programming, with unit gap and mismatch penalties (alignments areglobal in the segment and local in the PDB chain). The best alignment was accepted, provided the totalpenalty was not greater than 10% of the score of the original segment when compared to itself (thresholdsof 5% and 1% were also tested, with no substantial changes in the final results, inconsistency threshold 5%in Figure 8). Thus the boundaries of SCOP and CATH segments were mapped to PDB sequences. Domainswere discarded if one or more of their segments were discarded. Dali domain boundaries were accepted aspresented in the Dali files. In Dali, every domain is assigned a representative domain, and the representativedomains are classified. The Dali files do not distinguish between a single domain with multiple segments andmultiple domains matched to the same representative domain on the same chain. Therefore, we generatedtwo sets of files for Dali: in the first set, every segment was considered an independent domain; in the second,all segments defined on the same chain and matched to the same representative domain were considered one,multi-segmented, domain. Later on, when Dali families were matched to reference families, we picked thebest scoring version for each family.

All methods, with the exception of EVEREST are assumed to be able to classify all input proteinstructures. A PDB chain can be missing for one of these methods only due to versions inconsistencies, andnot due to a shortcoming of the method. Therefore, any PDB chain that is missing from either SCOP,CATH or Dali was discarded.

The PDB sequences downloaded are highly redundant. We have created a non-redundant subset of thosesequences by allowing one sequence to represent another if a BLAST (16) alignment between them coversat least 95% of each one of them, and the alignment e-value is at most 1e− 90. The final domain sets usedin this study are those that are defined on this non-redundant set of sequences. The analysis was repeatedon a sequence set composed of all unique sequences with no substantial changes in final results (no nrdb in

7


Figure 8).EVEREST domains were defined by scanning the non-redundant set of PDB sequences with the EVER-

EST family models, as described in (20).The numbers of chains and domains retained in each stage of the mapping process are listed in Table 2.

4.3 Mapping a tested domain family onto a reference hierarchical classificationof domains

To map a tested domain family onto a reference classification of domains, we first project each tested domainonto the set of reference domains. Each tested domain was compared with all reference domains on the samechain. We notice three quantities: the sum of the lengths of the segments of the tested domain (the testeddomain length), the sum of the lengths of the segments of the reference domain (the reference domain length)and the sum of the lengths of the intersection of all segments of the tested domain with all segments of thereference domain (the intersection length). If the intersection length is at least 80% of the either the testeddomain length or the reference domain length, then we say the that tested domain has collected the referencedomain. The projection of the tested family is defined as the set of reference domains collected by domainsof the tested family. We define the penalty of matching the tested family with a specific reference familyas the size of the symmetric difference between the projection of the tested family and the reference family.The penalty of matching the tested family with a specific union of (sibling) reference families is defined inthe same manner. Finally, each tested family is mapped to the reference family, or union of sibling referencefamilies that provide the minimal penalty. To achieve common ground across different family sizes, thequality of the match is defined as the penalty divided by the size of the projection.

We also employed the following more stringent modification of the above procedure. Only tested domainsthat collect exactly one reference domain are allowed to contribute to the projection. Other tested domains(i.e. those that do not collect any, and those that more than one reference domains) are considered falsepositives with respect to all reference families. This analysis is labeled single match only in Figure 8.

It is quite common that a family would have only one child family (e.g. SCOP fold “S-adenosylmethioninesynthetase” which contains only one superfamily, of the same name). In such cases, we disregard the parentfamily. When the parent and child families are reference families, we would map tested families to the childfamily (even though both parent and child family would induce the same cost). When the parent and childfamilies are test families, we do not map the parent family, and count it neither as a successful match nor asa missed match. According to the same rule, families that contain only one domain are not counted either.

Given a threshold on the quality of the matching (tolerance), it is possible that the tested family cannotbe mapped to the reference classification at all. We used a tolerance level of 0.2, i.e. a symmetric differencebetween tested and reference families of at most 20% of the size of the tested family.

For the purpose of repeating the domain-pair analysis of Day et. al., we have used their domain matchingrule, namely, instead of requiring that the intersection cover at least 80% of either one of the two domains, itis required that the intersection covers at least 80% of each one of the two domains (this variation is labeledrequire coverage of both domains in Figure 8). We have also repeated the process with our domain matchingrule, obtaining higher percentages, but qualitatively the same results.

Several hand picked parameters were used throughout the analysis. To evaluate the sensitivity of ourresults to these parameters we ran several variations of the analysis. Figure 8 shows the results of thesevariations in similar format to Figure 4. In addition to aforementioned variations, the following variationsare shown: tolerance at 0.1 and tolerance at 0.3: consider a match successful at various levels of tolerance;uncorrected boundaries: boundaries of SCOP and CATH domains taken literally from files instead of beingmapped to PDB sequence by sequence comparison; only unsplit domains: domains composed of more thanone segment removed from analysis. While the proportion of successfully mapped families can vary, especiallyat low rates of success, the average levels to which families map remains quite stable. Furthermore, thegeneral conclusions drawn throughout this paper remain valid under all variations.

Data files generated during this analysis are available for download at http://www.everest.cs.huji.ac.il/3d-assessment/.

8


5 Funding

E.P. is supported by the Sudarsky Center for Computational Biology. This work is partiallyfunded by NoE (Framework VI) BioSapiens consortium.

References[1] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids

Res 28:235--42.

[2] Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database

for the investigation of sequences and structures. J Mol Biol 247:536--540.

[3] Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. (1997) CATH--a hierarchic classification of

protein domain structures. Structure 5:1093--1108.

[4] Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33:88--96.

[5] Guo JT, Xu D, Kim D, Xu Y (2003) Improving the performance of DomainParser for structural domain

partition using neural network. Nucleic Acids Res 31:944--952.

[6] Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol

233:123--138.

[7] Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, et al. (2006) The Universal Protein Resource

(UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187--91.

[8] Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, et al. (2002) The Pfam protein families database.

Nucleic Acids Res 30:276--80.

[9] Eddy SR (2001). HMMER: Profile hidden Markov models for biological sequence analysis. [http://hmmer.wustl.edu/].

[10] Dietmann S, Holm L (2001) Identification of homology in protein structure classification. Nat Struct Biol

8:953--957.

[11] Day R, Beck DA, Armen RS, Daggett V (2003) A consensus view of fold space: combining SCOP, CATH, and the

Dali Domain Dictionary. Protein Sci 12:2150--2160.

[12] Getz G, Vendruscolo M, Sachs D, Domany E (2002) Automated assignment of SCOP and CATH protein structure

classifications from FSSP scores. Proteins 46:405--415.

[13] Camoglu O, Can T, Singh AK, Wang YF (2005) Decision tree based information integration for automated

protein classification. J Bioinform Comput Biol 3:717--742.

[14] Chi PH, Shyu CR, Xu D (2006) A fast SCOP fold classification system using content-based E-Predict

algorithm. BMC Bioinformatics 7:362.

[15] Portugaly E, Harel A, Linial N, Linial M (2006) EVEREST: automatic identification and classification of

protein domains in all protein sequences. BMC Bioinformatics 7:277.

[16] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs. Nucleic Acids Res 25:3389--402.

[17] Yang AS, Honig B (2000) An integrated approach to the analysis and modeling of protein sequences and

structures. I. Protein structural alignment and a quantitative measure for protein structural distance. J

Mol Biol 301:665--678.


structures. II. On the relationship between sequence and structural similarity for proteins that are not

obviously related in sequence. J Mol Biol 301:679--689.


structures. III. A comparative study of sequence conservation in protein structural families using

multiple structural alignments. J Mol Biol 301:691--711.

[20] Portugaly E, Harel A, Linial N, Linial M (2006) EVEREST: automatic identification and classification of

protein domains in all protein sequences. BMC Bioinformatics 7:277.

[21] Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, et al. (2002) ASTRAL compendium enhancements.

Nucleic Acids Res 30:260--3.

[22] Dietmann S, Park J, Notredame C, Heger A, Lappe M, et al. (2001) A fully automatic evolutionary

classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res 29:55--57.

9


Tables

Table 1 - Coverage of lowest level families by their class

Class Size in families EVEREST (%) SCOP (%) CATH (%) Dali (%)SCOP α 252 54 - 88 62SCOP β 252 63 - 87 63SCOP α/β 292 67 - 89 75SCOP α + β 296 63 - 94 63CATH α 410 83 88 - 73CATH β 506 88 98 - 81CATH αβ 831 87 83 - 80

Table 2 - Database sizes along the pre-processing stage

Stage PDB SCOP CATH Dali EVEREST(# chains) (# domains) (# domains) (# domains) (# domains)

Downloaded 74,345 67,210 67,054 83,608 -Mapped to PDB - 65,957 61,922 74,863 -Common Chains 37,946 48,396 54,588 57,828 -Unique Chains 12,293 15,802 17,619 19,062 174,281Non-redundant 8,081 9,670 10,379 11,076 56,436

10


A

Domains: 9,670

Folds:

1,093 (309)

Families: 1,987 (1,204)

Superfamilies:

7 (7)Classes:

668 (102)

B

Domains: 10,379

S: 5,405 (1,785)

Topologies:

1,436 (722)H-Superfamilies:

4 (3)Classes:

834 (125)

Architectures: 39 (23)

C

Domains: 10,699

Level 3: 3,510 (796)

Level 6: 6,462 (1,585)

Level 5: 6,240 (157)

Level 4: 4,920 (712)

Level 2: 2,478 (621)

Level 1: 1,845 (419) D

Domains: 56,436

Families: 10,034 (6689)

Figure 1: Schematic representation of the SCOP, CATH, Dali and EVEREST classifications.Number of families at each level of SCOP (A), CATH (B), Dali (C) and EVEREST (D), after pre-processingas described under section 4.2: Generating a common ground for the SCOP, CATH, Dali andEVEREST domain sets. In parenthesis, number of families with at least two child-families, i.e. familiesthat are a non-trivial combination of families from a lower level. Note that the scale for panel D is differentthan the scale for the other panels. See discussion at the end of section 1: Introduction regarding thelarger number of EVEREST domains.

11


a.1.1.3Phycocyanin-like

phycobilisome proteins (26)

1.10.490.20Phycocyanins

a.1.1Globin-like (221)

1.10.490.10Globins

a.1.1.2Globins (187)

a.1.1.1Truncated

hemoglobin (6)

a.1.1.4Nerve tissue mini-

hemoglobin (neural globin) (2)

f.1.1.1Colicin (2)

1.10.490.30Colicin

f.1.2.1Diphtheria toxin,

middle domain (1)

1.10.490.40Diphtheria toxin, middle domain

1.10.490Globin-like

1.10.490.10.9

Figure 2: SCOP and CATH families of Globin like domains. SCOP families are given in red, CATHfamilies in blue. Striped lines surround families that appear both in SCOP and in CATH. Number of SCOPdomains in the family is given in parenthesis. Note how CATH homologous-superfamily 1.10.490.10 includesall and only domains from SCOP families a.1.1.2, a.1.1.1 and a.1.1.4. When mapping CATH 1.10.490.10 ontothe SCOP hierarchy it will be matched without penalty to the union of a.1.1.2, a.1.1.1 and a.1.1.4, placingit in between family and superfamily levels. Likewise, SCOP a.1.1 matches the union of CATH 1.10.490.10and 1.10.490.20 without penalty, placing it between homologous-superfamily and topology levels. On theother hand CATH 1.10.490 cannot be mapped to the union of SCOP a.1.1 with f.1.2.1 and f.1.1.1 becausethese SCOP superfamilies are not siblings - they do not share the same fold. Therefore, CATH 1.10.490 willbe mapped to SCOP a.1.1 with a penalty of 3, which comes from considering the 3 domains of f.1.2.1 andf.1.1.1 as false positives, normalization would lead to a score of 3/221 ≈ 0.01.

12


c.37

181

14

a.97out: 1in: 1

a.156a.120in: 1

d.14 22 in:1

f.23 27 in:1

c.6644

3 12 a.88

3

24

a.5

Figure 3: CATH topology Helicase, Ruva Protein; domain 3, and intersecting SCOP folds.CATH topology 1.10.8: Helicase, Ruva Protein; domain 3, containing 38 domains is represented by theblue circle. Each SCOP fold containing at least one of those 38 domains is represented by a red circularsector. The area of the intersection of each red sector and the blue circle is proportional to the numberof CATH 1.10.8 domains that are classified by SCOP to the fold, while the total area of the sector isproportional to the total number of domains in the fold. Fold indices are given in red, the number ofdomains in the intersection of CATH 1.10.8 and the fold is given in blue and the number of domains in thefold that do not belong to CATH 1.10.8 is given in green. The SCOP folds are: c.37: P-loop containingnucleoside triphosphate hydrolases; a.97 An anticodon-binding domain of class I aminoacyl-tRNA synthetases;a.156: S13-like H2TH domain; a.120: gene 59 helicase assembly protein; d.14: Ribosomal protein S5 domain2-like; f.23: Single transmembrane helix; c.66 S-adenosyl-L-methionine-dependent methyltransferases; a.5:RuvA C-terminal domain-like; a.8: immunoglobulin/albumin-binding domain-like. Note that CATH 1.10.8domains span 4 different SCOP classes, and that most SCOP folds that contain CATH 1.10.8 domains,also contain domains outside CATH 1.10.8. A web-tool generating figures similar to this one is available athttp://www.everest.cs.huji.ac.il/3d-assessment.

13


SCOP CATH

Frac

tion

Suc

cess

fully

Map

ped

Class Fold Superfam. Family0

0.2

0.4

0.6

0.8

1

EVERESTDaliCATH

Class Arch. Topo. Homo. S0

0.2

0.4

0.6

0.8

1

EVERESTDaliSCOP

Mapping Level

Figure 4: The tradeoff between the granularity of a family and its chance of matching a ref-erence family. Test families were mapped to the SCOP (left panel) and CATH (right panel) hierarchicalclassifications, as described in section 4.3: Mapping a tested domain family onto a reference hierar-chical classification of domains. Test families were grouped according to their test classification levels:class, architecture, topology, homologous-superfamily and S levels for CATH; class, fold, superfamily andfamily for SCOP; Levels 1-6 for Dali. EVEREST has no levels, and thus all of its families formed one group.Each marker represents one such group, the rightmost marker on each line corresponding to the most finegranularity group, the leftmost corresponding to the most coarse. X-axis: Average reference level to whichsuccessfully mapped families of the group where mapped. Y-axis: Proportion of the families of the groupthat where successfully mapped.

14

Portugaly, Linial & Linial Assessment of domain classificationsR

efer

ence

:S

CO

P

EV

ER

ES

T

Family (6429)

X C F S F0

0.20.40.60.8

1

CAT

H

Class (3)

X C F S F0

0.20.40.60.8

1Arch. (23)

X C F S F0

0.20.40.60.8

1Topo. (124)

X C F S F0

0.20.40.60.8

1Homo. (718)

X C F S F0

0.20.40.60.8

1S (1799)

X C F S F0

0.20.40.60.8

1

Dal

i

1 (408)

X C F S F0

0.20.40.60.8

12 (602)

X C F S F0

0.20.40.60.8

13 (776)

X C F S F0

0.20.40.60.8

14 (696)

X C F S F0

0.20.40.60.8

15 (152)

X C F S F0

0.20.40.60.8

16 (1740)

X C F S F0

0.20.40.60.8

1

Ref

eren

ce:

CAT

H

EV

ER

ES

T

Family (6042)

X C A T H S0

0.20.40.60.8

1

SC

OP

Class (7)

X C A T H S0

0.20.40.60.8

1Fold (102)

X C A T H S0

0.20.40.60.8

1SF (307)

X C A T H S0

0.20.40.60.8

1Family (1257)

X C A T H S0

0.20.40.60.8

1

Dal

i

1 (394)

X C A T H S0

0.20.40.60.8

12 (577)

X C A T H S0

0.20.40.60.8

13 (750)

X C A T H S0

0.20.40.60.8

14 (665)

X C A T H S0

0.20.40.60.8

15 (150)

X C A T H S0

0.20.40.60.8

16 (1877)

X C A T H S0

0.20.40.60.8

1

Figure 5: Matching of families of EVEREST, Dali, SCOP and CATH onto the SCOP or CATHhierarchies. Each panel describes a mapping of families from a certain level of a certain classificationinto one of the two reference hierarchies. The height of the bars above the X symbol gives the proportion oftested families that could not be mapped (at a tolerance level of 0.2) to the reference hierarchy. The heightof bars above the other symbols describe the proportion of tested families that mapped to the given level(for SCOP, the levels are, from left to right: C - class, F - fold, S - superfamily and F - family, for CATHthey are: C - class, A - architecture, T - topology, H - homologous-superfamily and S - S-level). Bars inbetween symbols describe the proportion of families that mapped in between the corresponding levels of thereference hierarchy.

15


A B

Ref

eren

ce:

SC

OP

EV

ER

ES

T Family

X CFSFD0

0.20.40.60.8

1

CAT

H

Class

X CFSFD0

0.20.40.60.8

1Arch.

X CFSFD0

0.20.40.60.8

1Topo.

X CFSFD0

0.20.40.60.8

1Homo.

X CFSFD0

0.20.40.60.8

1S

X CFSFD0

0.20.40.60.8

1

Dal

i

1

X CFSFD0

0.20.40.60.8

12

X CFSFD0

0.20.40.60.8

13

X CFSFD0

0.20.40.60.8

14

X CFSFD0

0.20.40.60.8

15

X CFSFD0

0.20.40.60.8

16

X CFSFD0

0.20.40.60.8

1

All

C F S F D0

0.20.40.60.8

1

All

C F S F D0

0.20.40.60.8

1

All

C F S F D0

0.20.40.60.8

1

Ref

eren

ce:

CAT

H

EV

ER

ES

T Family

X CATHSD0

0.20.40.60.8

1

SC

OP

Class

X CATHSD0

0.20.40.60.8

1Fold

X CATHSD0

0.20.40.60.8

1SF

X CATHSD0

0.20.40.60.8

1Family

X CATHSD0

0.20.40.60.8

1

Dal

i

1

X CATHSD0

0.20.40.60.8

12

X CATHSD0

0.20.40.60.8

13

X CATHSD0

0.20.40.60.8

14

X CATHSD0

0.20.40.60.8

15

X CATHSD0

0.20.40.60.8

16

X CATHSD0

0.20.40.60.8

1

All

CA T HSD0

0.20.40.60.8

1

All

CA T HSD0

0.20.40.60.8

1

All

CA T HSD0

0.20.40.60.8

1

Figure 6: Coverage of SCOP and CATH families by EVEREST, Dali, SCOP and CATH families.A Each panel describes a mapping of families from a certain level of a certain tested classification into oneof the two reference hierarchies. B Each panel describes the total coverage of all reference families from byall families from a certain tested classification (including singleton domains). The height of the bars abovethe X symbol gives the proportion of tested families that could not be mapped (at a tolerance level of 0.2)to the reference hierarchy. The height of the blue bars above the other symbols describe the proportion ofreference families of the indicated level that had (at least one) tested family mapped to them. The heightof the green bars above the other symbols describe the proportion of reference families of the indicated levelthat were members of (at least one) union of sibling reference families that had (at least one) tested familymapped to them (excluding those families that were counted in the blue bar). SCOP levels are indicated bythe following letters, from left to right: C - class, F - fold, S - superfamily, F - family and D - domain. CATHlevels are indicated by the following letters, from left to right: C - class, A - architecture, T - topology, H -homologous-superfamily, S - S-level and D - domain.

16


BA

00.20.40.60.81

00.20.40.60.81

C

C F S F0

0.20.40.60.81

C F S F0

0.20.40.60.81

C F S F0

0.20.40.60.81

EVEREST

Dali

CATH

D

C A T H S0

0.20.40.60.81

EVEREST

C A T H S0

0.20.40.60.81

Dali

C A T H S0

0.20.40.60.81

SCOP

Figure 7: Relationships between coverage of different granularity levels. The dendrograms in Aand B illustrate two imperfect coverage scenarios. Black (white) circles represent reference families covered(uncovered) by the tested classification. The fraction of families covered at the lower level is equal in bothscenarios. In A families from all parts of the protein world are covered, but every part is only covered atone granularity level. In B not all parts of the protein world are covered, but the covered parts are coveredat several granularity levels. The bar graphs summarize this information for the two scenarios and for realcoverage data. In A a B, the bars in the graphs correspond to the levels of the dendrograms. For eachlevel of the reference hierarchy, (including unions of sibling families as intermediate levels in C and D), wemeasure the fraction of lowest level families (SCOP family in C and CATH S-level family in D) that have acovered ancestor at or below that level (given by the total height of the bar) and the fraction of lowest levelfamilies that have a covered ancestor exactly at that level (given by the height of the blue part of the bar).C Coverage of SCOP. Levels are indicated by the following letters, from left to right: C - class, F - fold, S- superfamily and F - family. D Coverage of CATH. Levels are indicated by the following letters, from leftto right: C - class, A - architecture, T - topology, H - homologous-superfamily and S - S-level. The testedclassifications are indicated above the bar graphs. We use the term ancestor here in the general sense, i.e. afamily is an ancestor of itself.

17


SCOP CATH

Frac

tion

Suc

cess

fully

Map

ped


0.2

0.4

0.6

0.8

1


0.2

0.4

0.6

0.8

1


0.2

0.4

0.6

0.8

1


0.2

0.4

0.6

0.8

1

Mapping Level

EVEREST Dali 1Dali 3Dali 6CATH Class / SCOP ClassCATH Arch. / SCOP FoldCATH Topo. / SCOP Superfam.CATH Homo. / SCOP FamilyCATH S

Top:

normal settingssingle match onlyrequire coverage of both domainstolerance at 0.1tolerance at 0.3

Bottom:

normal settingsno nrdbincosistancy threshold 5%uncorrected boundariesonly unsplit domains

Figure 8: Stability of the procedures. Test families were mapped to the SCOP (left panel) and CATH(right panel) hierarchical classifications, as described in section 4.3: Mapping a tested domain familyonto a reference hierarchical classification of domains. Each marker shape describes a differentmodification to the procedure described in section 4: Methods. As in Figure 4, test families were groupedaccording to their test classification levels: class, architecture, topology, homologous-superfamily and S levelsfor CATH; class, fold, superfamily and family for SCOP; Levels 1,3,6 for Dali. EVEREST has no levels, andthus all of its families formed one group. Marker colors indicate the family group described. X-axis: Averagereference level to which successfully mapped families of the group where mapped. Y-axis: Proportion ofthe families of the group that where successfully mapped.

18

assessment of protein domain classiﬁcations: scop, cath, dali...

Documents