scope: structural classification of proteins—extended,...

6
SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures Naomi K. Fox 1 , Steven E. Brenner 1,2, * and John-Marc Chandonia 1 1 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and 2 Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA Received September 16, 2013; Revised November 1, 2013; Accepted November 7, 2013 ABSTRACT Structural Classification of Proteins—extended (SCOPe, http://scop.berkeley.edu) is a database of protein structural relationships that extends the SCOP database. SCOP is a manually curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. Development of the SCOP 1.x series concluded with SCOP 1.75. The ASTRAL compendium provides several databases and tools to aid in the analysis of the protein structures classified in SCOP, particularly through the use of their sequences. SCOPe extends version 1.75 of the SCOP database, using automated curation methods to classify many structures released since SCOP 1.75. We have rigorously benchmarked our automated methods to ensure that they are as accurate as manual curation, though there are many proteins to which our methods cannot be applied. SCOPe is also partially manually curated to correct some errors in SCOP. SCOPe aims to be backward com- patible with SCOP, providing the same parseable files and a history of changes between all stable SCOP and SCOPe releases. SCOPe also incorpor- ates and updates the ASTRAL database. The latest release of SCOPe, 2.03, contains 59 514 Protein Data Bank (PDB) entries, increasing the number of struc- tures classified in SCOP by 55% and including more than 65% of the protein structures in the PDB. BACKGROUND Nearly all proteins have structural similarities with other proteins and, in many of these cases, share a common evolutionary origin. The Structural Classification of Proteins (SCOP) database (1–4), which marks its 20th an- niversary this year, aimed to provide a detailed and com- prehensive description of the structural and evolutionary relationships between all proteins of known structure. As such, it provides a broad survey of known protein folds, detailed information about the close relatives of any par- ticular protein and a framework for future research and classification. By analogy with taxonomy, SCOP was created as a hierarchy of several levels where the fundamental unit of classification is a ‘domain’ in the experimentally determined protein structure. The hierarchy of SCOP domains comprises the following levels: ‘Species’ repre- senting a distinct protein sequence and its naturally occurring or artificially created variants; ‘Protein’ grouping together similar sequences of essentially the same functions that either originate from different biolo- gical species or represent different isoforms within the same species; ‘Family’ containing proteins with similar se- quences but typically distinct functions and ‘Superfamily’ bridging together protein families with common func- tional and structural features inferred to be from a common evolutionary ancestor. Near the root, the basis of classification is purely structural: structurally similar superfamilies are grouped into ‘Folds’, which are further arranged into ‘Classes’ based mainly on their secondary structure content and organization. The ASTRAL compendium (5–7) is a collection of software and databases, partially derived from SCOP, that aid research into protein structure and evolution. ASTRAL provides sequences and coordinate files for all SCOP domains, as well as sequences for all Protein Data Bank (PDB, 8) chains that are classified in SCOP. Chemically modified amino acids are translated back to the original sequence, and sequences are curated to elim- inate errors resulting from the automated parsing of PDB files. Because the majority of sequences in the PDB are *To whom correspondence should be addressed. Tel: +510 643 9131; Fax: +510 666 2505; Email: [email protected] D304–D309 Nucleic Acids Research, 2014, Vol. 42, Database issue Published online 3 December 2013 doi:10.1093/nar/gkt1240 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. at University of California, Berkeley on August 19, 2014 http://nar.oxfordjournals.org/ Downloaded from

Upload: others

Post on 19-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

SCOPe Structural Classification ofProteinsmdashextended integrating SCOPand ASTRAL data and classificationof new structuresNaomi K Fox1 Steven E Brenner12 and John-Marc Chandonia1

1Physical Biosciences Division Lawrence Berkeley National Laboratory Berkeley CA 94720 USA and2Department of Plant and Microbial Biology University of California Berkeley CA 94720 USA

Received September 16 2013 Revised November 1 2013 Accepted November 7 2013

ABSTRACT

Structural Classification of Proteinsmdashextended(SCOPe httpscopberkeleyedu) is a database ofprotein structural relationships that extends theSCOP database SCOP is a manually curatedordering of domains from the majority of proteinsof known structure in a hierarchy according tostructural and evolutionary relationshipsDevelopment of the SCOP 1x series concludedwith SCOP 175 The ASTRAL compendiumprovides several databases and tools to aid in theanalysis of the protein structures classified in SCOPparticularly through the use of their sequencesSCOPe extends version 175 of the SCOPdatabase using automated curation methods toclassify many structures released since SCOP 175We have rigorously benchmarked our automatedmethods to ensure that they are as accurate asmanual curation though there are many proteinsto which our methods cannot be applied SCOPe isalso partially manually curated to correct someerrors in SCOP SCOPe aims to be backward com-patible with SCOP providing the same parseablefiles and a history of changes between all stableSCOP and SCOPe releases SCOPe also incorpor-ates and updates the ASTRAL database The latestrelease of SCOPe 203 contains 59 514 Protein DataBank (PDB) entries increasing the number of struc-tures classified in SCOP by 55 and including morethan 65 of the protein structures in the PDB

BACKGROUND

Nearly all proteins have structural similarities with otherproteins and in many of these cases share a common

evolutionary origin The Structural Classification ofProteins (SCOP) database (1ndash4) which marks its 20th an-niversary this year aimed to provide a detailed and com-prehensive description of the structural and evolutionaryrelationships between all proteins of known structure Assuch it provides a broad survey of known protein foldsdetailed information about the close relatives of any par-ticular protein and a framework for future research andclassification

By analogy with taxonomy SCOP was created as ahierarchy of several levels where the fundamental unitof classification is a lsquodomainrsquo in the experimentallydetermined protein structure The hierarchy of SCOPdomains comprises the following levels lsquoSpeciesrsquo repre-senting a distinct protein sequence and its naturallyoccurring or artificially created variants lsquoProteinrsquogrouping together similar sequences of essentially thesame functions that either originate from different biolo-gical species or represent different isoforms within thesame species lsquoFamilyrsquo containing proteins with similar se-quences but typically distinct functions and lsquoSuperfamilyrsquobridging together protein families with common func-tional and structural features inferred to be from acommon evolutionary ancestor Near the root the basisof classification is purely structural structurally similarsuperfamilies are grouped into lsquoFoldsrsquo which are furtherarranged into lsquoClassesrsquo based mainly on their secondarystructure content and organization

The ASTRAL compendium (5ndash7) is a collection ofsoftware and databases partially derived from SCOPthat aid research into protein structure and evolutionASTRAL provides sequences and coordinate files for allSCOP domains as well as sequences for all Protein DataBank (PDB 8) chains that are classified in SCOPChemically modified amino acids are translated back tothe original sequence and sequences are curated to elim-inate errors resulting from the automated parsing of PDBfiles Because the majority of sequences in the PDB are

To whom correspondence should be addressed Tel +510 643 9131 Fax +510 666 2505 Email scopecompbioberkeleyedu

D304ndashD309 Nucleic Acids Research 2014 Vol 42 Database issue Published online 3 December 2013doi101093nargkt1240

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

very similar to others ASTRAL provides representativesubsets of proteins that span the set of classified proteinstructures or domains while alleviating bias toward well-studied proteins The highest quality representative ineach subset is chosen using AEROSPACI scores (7)which provide a numeric estimate of the quality and pre-cision of crystallographically determined structures

The SCOP database was first released in December 1994and contained all 3091 entries in the PDB at that time (1)Table 1 provides a history of all stable SCOP releasesincluding the number of months required to curate eachrelease The last comprehensive release was SCOP 171released in October 2006 The increase in time requiredto fully classify each comprehensive SCOP releaseshown in Table 1 highlights the need to abet manual clas-sification with automated methods Although SCOP wasfully manually curated prior to version 173 an automatedclassification method discussed later in this manuscriptwas introduced in SCOP 173 SCOP levels above lsquoSpeciesrsquowere hand-curated in all versions of SCOP these includethe lsquoSuperfamilyrsquo level in which human expertise was usedto annotate many remote homologs that cannot bereliably identified as homologous by current computa-tional methods Development of the SCOP 1x seriesconcluded with SCOP 175 released in June 2009 It clas-sified 38 221 PDB entries more than 70 of the PDBstructures available at that time However due toincreasing growth of the PDB SCOP 175 now coversfewer than half of the protein structures that are currentlyavailable

In this article we describe SCOPmdashextended (SCOPe)a database that extends SCOP 175 with the aim of

providing ongoing classification of new PDB structuresin the context of the SCOP hierarchy backward com-patibility with SCOP manual correction of errors andstable releases suitable for benchmarking More than800 new structures are now being deposited to thePDB in an average month (8) too many for aworkflow based entirely on manual curation to keeppace We have therefore developed new methods forautomatically classifying structures similar to thosealready in SCOP and we have benchmarked it rigor-ously with the goal of achieving accuracy comparable tofully hand-curated SCOP releases (ie through SCOP171) However these methods do not classify structuresdissimilar to those already classified manually Newreleases of ASTRAL are now derived from the SCOPeclassification and ASTRAL data are integrated into theSCOPe website SCOPe also includes data from allprevious releases of SCOP and ASTRAL that featurestable identifiers (ie since SCOP 155) to facilitateweb-based comparison of releases of SCOP andSCOPe SCOPe releases 201 202 and 203 do notadd additional manually curated entries at thelsquoFamilyrsquo level or above Rather domains from newprotein structures are classified into the manuallycurated hierarchy of SCOP 175 and many existingSCOP 175 domains were corrected or movedStatistics on all SCOP and SCOPe releases summariesand a full history of changes and other information areavailable from the SCOPe website (httpscopberkeleyedu) together with copies of our relational (MySQL)database and parseable files containing all SCOPeSCOP and ASTRAL data

Table 1 SCOP and SCOPe growth and benchmarking

Release Freezedate

Releasedate

Months torelease

Total PDBentries

Total PDBentriesclassified

New PDBentries used inbenchmark

PDB depositionrate per month

Percent of newentries classifiableby currentautomated method

SCOP 155 2001ndash03 2001ndash07 4 13 300 13 228 na 258 naSCOP 157 2001ndash10 2002ndash01 3 14 825 14 736 1508 275 49SCOP 159 2002ndash03 2002ndash05 2 16 057 15 985 1249 270 47SCOP 161 2002ndash09 2002ndash11 2 17 498 17 411 1426 304 51SCOP 163 2003ndash03 2003ndash06 3 19 036 18 951 1540 351 50SCOP 165 2003ndash08 2003ndash12 4 20 699 20 619 1668 374 51SCOP 167 2004ndash05 2005ndash02 9 24 131 24 036 3417 436 52SCOP 169 2004ndash10 2005ndash07 9 26 101 25 972 1936 454 46SCOP 171 2005ndash01 2006ndash10 21 27 821 27 599 1627 474 45SCOP 173 2007ndash09 2007ndash11 2 44 169 34 494 6895 593 58SCOP 175 2009ndash02 2009ndash06 4 53 830 38 221 3727 632 48SCOPe 201

(formerly 175A)2012ndash02 2012ndash03 1 76 528 49 219 na 775 na

SCOPe 202(formerly 175B)

2012ndash11 2013ndash01 2 83 643 49 674 na 816 na

SCOPe 203(formerly 175C)

2013ndash08 2013ndash10 2 90 812 59 514 na na na

The number of new entries added in each release of SCOP that used stable identifiers For each release the lsquofreeze datersquo or date at which no newPDB entries were to be classified in the release is given In practice some entries released just after the freeze date were sometimes included Thetotal number of PDB entries that contained protein structures were not obsolete as of the freeze date or which were included in each release isgiven as well as the number of PDB entries that were included in each release Release 171 was the most recent comprehensive SCOP release (ieone in which nearly all PDB entries available prior to the freeze date were classified) The average rate at which PDB entries were deposited eachmonth is also given measured over the 6 months before and after (if applicable) the freeze date

Nucleic Acids Research 2014 Vol 42 Database issue D305

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

BACKWARD COMPATIBILITY

To facilitate use of SCOPe data by SCOP and ASTRALusers we provide SCOPe and ASTRAL data in parseablefiles in the same formats as the previous SCOP andASTRAL releases SCOPe uses the same stable identifiers(eg sunid sid sccs) as were used for prior releases ofSCOP and ASTRAL (356) and the same protocols pre-viously used to assign new SCOP and ASTRAL identifiersare currently being used in SCOPe A history of allchanges between consecutive releases of SCOP andSCOPe is available on the SCOPe website

ADDING NEW ENTRIES AND CORRECTINGERRORS IN SCOP

Using our new automation procedure (described in thenext section) we have added 21 293 PDB files to SCOPethat were not classified in SCOP 175 (23 of the PDB)We have also modified some previous entries that weremanually curated or classified via a previous automatedmethod introduced in SCOP 173 (4) Based on carefulexamination of differences between automated classifica-tion and manually curated domains we detected and fixederrors in 70 manually curated domains in SCOP (out of105 detected with differences) of which 62 were changedby more than 10 residues The fraction of manually clas-sified domains in SCOP 175 with substantial errors wecould detect was 008 (62 of 80 140) reflecting the ex-tremely high accuracy of manual classification in theSCOP database Typical examples of common errors areshown in Figure 1 Often these errors were the result of atypo during manual curation and would result in theimproper inclusion or exclusion of a region of sequencefrom a domain We reassigned domain boundaries to 5054domains that had been classified with the SCOP 173 auto-mated method The vast majority of these were smallchanges but 231 differed substantially requiring adomain to be split into two or merged with another orits boundary to be changed by more than 10 residues Thisalso represents a low error rate 08 (231 of 30 660)although it is roughly 10 times higher than the error ratefor manually curated domains We also noticed a numberof both manually and automatically curated domains thathad been classified in the wrong lsquoProteinrsquo or lsquoSpeciesrsquoentry We have corrected 252 manually curated domainsand 6285 previous automatically classified domains

AUTOMATED CLASSIFICATION METHODS

We have implemented a new automated classification al-gorithm based on sequence similarity to previously clas-sified entries and benchmarked this method against allstable releases of SCOP from 157 to 175An automated method for classifying new PDB entries

based on previously classified entries was introduced inSCOP 173 (4) however rigorous benchmarking wasnot used to validate this method We have foundinconsistencies between manually curated domains andthose determined with this method Figure 1 includes anexample of such an inconsistency Other methods such as

SUPERFAMILY (910) and SCOPmap (11) have beendeveloped to automatically classify new PDB structuresin the context of SCOP and have the advantage ofoperating on proteins somewhat dissimilar from thosealready classified However when validated againstmanually curated domains these methods have errorrates that are too high to be suitable for incorporatingtheir results directly into SCOP For example SCOPmapreports a 5 error rate in benchmarking experiments Ouraim in developing a new automated classification algo-rithm was to maximize the number of domains thatcould be classified with an error rate indistinguishablefrom manual curation (see discussion of manualcuration error rate above)

The SCOP automated curation pipeline has been com-pletely rewritten in SCOPe We are replacing all domainsthat were predicted with the less accurate SCOP 173 auto-mated method with newer predictions Over the course ofbenchmarking our new method we found the methodwould have been able to automatically assign domainsfor approximately half of the PDB entries that weremanually curated in prior versions of SCOP with no sub-stantial differences (ie gt10 residue difference in anydomain boundary) between the automatically assignedand manually curated domain boundaries Over thecourse of benchmarking we found some differencesthat upon expert examination were determined to bethe result of manual curation errors in prior versions ofSCOP These are discussed later in the manuscript andwere corrected in SCOPe 203

One of the main challenges in automatically assigningdomains based on previously classified homologs is thatdifferent homologs may have different observed residues(ie those present in the ATOM records of PDB data)SCOP domains in a structure are defined relative to theobserved residues It is sometimes challenging to correctlyassign observed residues in a new structure that were notobserved in a previously classified structure Another chal-lenge is in variance in manually curated boundaries Wefound that in multi-domain chains with long linkerregions the exact boundaries between domains within afamily can differ substantially This is because fullymanually curated releases of SCOP (those before 173)aimed to classify every residue in each PDB structureeven though it is sometimes unclear to which if eitherof the adjacent domains residues in a linker regionshould be assigned

We now describe our algorithm for predicting domainsand classifying them in the SCOP hierarchy We firstcreate a BLAST database containing the SEQRES-basedsequences for each domain in SCOP and SCOPe Thenfor each newly released PDB chain we BLAST itsSEQRES sequence against the domain sequencedatabase BLAST performs local alignment returningthe start and end of the alignment (the lsquohitrsquo) for thequery sequence and the target sequence as well as theE-value We collect only the BLAST alignments wherethe E-value is at least as significant at 104 and the align-ment covers most of the target domain (defined as missingat most 10 residues from each end) We group the align-ments by the PDB chain that the targets belong to and

D306 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

rank the groups by the total number of residues coveredby the hits on the query chain We then use the top-ranking group of BLAST-based alignments to annotatedomains in the query sequence If the nearest PDB chainend or gap in ATOM records is within 10 residues of analignment end we extend the domain to include theseresidues The purpose of extending the BLAST hit is toclassify every observed residue in the chain the 10-residuelimitation makes it very unlikely that the extension willinclude a new domain If the BLAST boundaries areoutside the observed residues in the query chain theassigned domain is shortened to include only observedresidues

After making predictions for each chain we applied aset of criteria to determine whether each domain predic-tion was high confidence (ie sufficiently accurate to beincluded in SCOPe without further manual inspection)We first exclude from high confidence those chains thatare low-resolution (3 A resolution or above) ribosomal orsynthetic (due to those being classified outside the firstseven classes of SCOP) or those that are homologous togenetic domains classified in SCOP (due to the difficulty ofcorrectly automating predictions that include multiplePDB chains)

In cases where an entire PDB chain was predicted to bea single SCOPe domain the prediction was deemed highconfidence if the target domain also comprised its entire

PDB chain For PDB chains that were divided intomultiple SCOPe domains additional criteria were usedto determine whether the predicted domains were highconfidence First we restricted the chains whose predic-tions we placed in the high confidence set to those whicheither (i) had 100 sequence identity with the target chainused to make the predictions or (ii) had exactly twodomains each composed of only one contiguous regionof residues We plan to extend the method to three ormore domain chains and multi-region domains in thefuture Second if any region found in the ATOMrecords of the chain was longer than 10 residues and notassigned to any domain we removed the chain from thehigh-confidence set Third we required the BLAST hitsused for the two domains in the chain not be to thesame target SCOP domain (on the theory that domainduplications are more likely to require a specialized algo-rithm or manual inspection to ensure no structuralchanges such as domain swapping)To fully classify these predictions in the SCOP hier-

archy we also had to assign levels below lsquoSuperfamilyrsquoBased on our benchmarking we developed heuristic rulesto classify domains at the lsquoFamilyrsquo lsquoProteinrsquo and lsquoSpeciesrsquolevels In cases where the protein or family could not bereliably matched to an existing SCOP entity we created anew protein or family called lsquoautomated matchesrsquo ratherthan risking inaccurate classification

Domain should include entire chain

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from SCOP 173 Automated Curation

d1tqya2 d1e5ma2

One strand of beta sheet assignedto wrong domain by manual curation

e5ma2

Homolog (Correct)Alternative Structure (Correct)

d1oyvi_

d2p8qa1 d1qgka_

d1pjua1

Domain should be split into twodomains at this location

Homolog (Correct)

d1qzfa2d1seja2 fa2

Homolog (Correct)

Inconsistentinclusion offirst helixin domain

(a)

(c) (d)

(b)

Figure 1 Errors identified during benchmarking We detected errors in 70 manually curated domains by running benchmarking and manuallyinspecting predicted domains that did not sufficiently match the manually annotated domains These errors in domain boundaries in multi-domainchains were manually fixed in SCOPe 203 We also detected and fixed inconsistencies in 5054 domains that had been predicted and classified with theSCOP 173 automated method We review some of the types of errors detected (a) The SCOP 173 automated method used to predict domaind2p8qa1 had included approximately half the residues in the chain This was inconsistent with all other manually curated entries in its species-levelclade that included the entire chain (b) A strand of beta sheet was included in the d1tqya2 domain by manual curation (c) All of chain I from 1oyvhad been placed into a single domain (d) The manually curated domain d1seja2 excluded the first helix in the chain

Nucleic Acids Research 2014 Vol 42 Database issue D307

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Thus far our focus has been on classifying chains thathave high sequence similarity with previously classifiedchains in SCOP and therefore have relied solely onsequence data The addition of structural informationshould expand the set of domains that can be classifiedwith high confidence in the future

BENCHMARKING RESULTS

To validate the new automated method we performedbenchmarking against all SCOP releases with stable iden-tifiers (ie releases 155ndash175) All PDB entries that wereadded between each pair of consecutive releases wereautomatically classified based on the earlier release andcompared with the manually curated domains in the sub-sequent release A predicted domain was consideredidentified and classified correctly if it was placed in thecorrect superfamily and its boundaries differed from themanually curated boundaries by no more than 10 residuesTable 1 lists the fraction of PDB entries that could beclassified using this method for each SCOP release Anon-trivial example of automated classification is shownin Figure 2 Our method predicted 20 048 domains thatmatched manually curated domains within the 10-residueerror tolerance and predicted 105 domains that differed bymore than 10 residues We reviewed all differences of morethan 10 residues and found they were either the result oferrors in SCOP 175 (discussed above) or in linker regionsin which neither the manually curated domain nor theprediction could be determined to be more accurate Wealso compared domains that had been added by theprevious automated method (4) which had added 30 852new domains in total to versions 173 and 175 Wecompared domains predicted by our new method tothose added by the old method Of the 15 660 domainpairs compared (the new automated method is more con-servative than the old one and therefore fewer domainsare classified) 125 were found to have domain boundariesthat differed by more than 10 residues (this is lower than

the 231 such domains we corrected in SCOPe discussedabove because the benchmark did not include additionalSCOPe domains based on manually curated domains fromSCOP 175) These differences were also the result oferrors in the previous automated method or ambiguouslinker regions

NEW WEBSITE

The SCOPe website offers a modest redesign of the SCOPwebsite presenting all SCOPe SCOP and ASTRAL datathrough a single unified web interface Many objects thatwere difficult to find in the original website such as thechange history are now available under tabs Thumbnailimages were automatically generated to show each domainon its own and in several structural contexts and these aredisplayed as part of the browser A fully JavaScriptJSmol-based viewer was added to enable visitors to viewdomains in three-dimensions in isolation in context ofthe chain or in context of the entire PDB structure TheSCOPe website can display data from all versions ofSCOPe SCOP and ASTRAL since release 155 All dataare stored in a relational (MySQL) database which is alsoavailable for download

STABLE AND PERIODIC UPDATES

Stable releases will potentially include manual curation ofthe SCOPe hierarchy correction of errors in previouslyclassified domains or changes to our classificationworkflow (eg validation of our automated sequence-based classification protocol with structure comparison)However in an effort to stay more closely synchronizedwith the PDB we also plan to supplement these stablereleases with periodic updates (approximately monthly)Our infrastructure automatically imports and classifiesnew PDB files on a weekly basis Starting with SCOPe202 we have begun to release periodic updates that addnewly released PDB entries to the SCOPe classification

Automated domainboundary at residue 225

Manually curated domainboundary at residue 223

d1vj5a2(grey)

d1vj5a1(color)

1

2 224

226

554

544

BLAST against DB of all SCOP domains

1vj5 chain A

hit 2 (d1ek1a2)

hit 1 (d1ek1a1)

query sequence

Figure 2 Automated curation example This figure depicts an example of applying the automated method for domain prediction and classification to1vj5 chain A released on 2004-04-27 We attempted to automatically classify it into SCOP 167 based only on domains defined in SCOP 165 1vj5Ahas 554 residues of which residues 2-547 are observed (found in the ATOM records in PDB data) Two significant BLAST hits were found to theclassified chain 1ek1A which has a distinct sequence from 1vj5A but also has 554 residues of which residues 4-19 48-66 and 90-544 are observedThe two BLAST hits include residues 2-224 and 226-544 in 1vj5A The final predicted domains in 1vj5A are 2-225 and 226-547 The manuallyannotated domains for 1vj5A are 2-223 and 224-547 Since the end of each predicted domain differs from the manually annotated domain by at most10 residues this domain prediction is deemed to fall within the error tolerance for validation

D308 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Page 2: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

very similar to others ASTRAL provides representativesubsets of proteins that span the set of classified proteinstructures or domains while alleviating bias toward well-studied proteins The highest quality representative ineach subset is chosen using AEROSPACI scores (7)which provide a numeric estimate of the quality and pre-cision of crystallographically determined structures

The SCOP database was first released in December 1994and contained all 3091 entries in the PDB at that time (1)Table 1 provides a history of all stable SCOP releasesincluding the number of months required to curate eachrelease The last comprehensive release was SCOP 171released in October 2006 The increase in time requiredto fully classify each comprehensive SCOP releaseshown in Table 1 highlights the need to abet manual clas-sification with automated methods Although SCOP wasfully manually curated prior to version 173 an automatedclassification method discussed later in this manuscriptwas introduced in SCOP 173 SCOP levels above lsquoSpeciesrsquowere hand-curated in all versions of SCOP these includethe lsquoSuperfamilyrsquo level in which human expertise was usedto annotate many remote homologs that cannot bereliably identified as homologous by current computa-tional methods Development of the SCOP 1x seriesconcluded with SCOP 175 released in June 2009 It clas-sified 38 221 PDB entries more than 70 of the PDBstructures available at that time However due toincreasing growth of the PDB SCOP 175 now coversfewer than half of the protein structures that are currentlyavailable

In this article we describe SCOPmdashextended (SCOPe)a database that extends SCOP 175 with the aim of

providing ongoing classification of new PDB structuresin the context of the SCOP hierarchy backward com-patibility with SCOP manual correction of errors andstable releases suitable for benchmarking More than800 new structures are now being deposited to thePDB in an average month (8) too many for aworkflow based entirely on manual curation to keeppace We have therefore developed new methods forautomatically classifying structures similar to thosealready in SCOP and we have benchmarked it rigor-ously with the goal of achieving accuracy comparable tofully hand-curated SCOP releases (ie through SCOP171) However these methods do not classify structuresdissimilar to those already classified manually Newreleases of ASTRAL are now derived from the SCOPeclassification and ASTRAL data are integrated into theSCOPe website SCOPe also includes data from allprevious releases of SCOP and ASTRAL that featurestable identifiers (ie since SCOP 155) to facilitateweb-based comparison of releases of SCOP andSCOPe SCOPe releases 201 202 and 203 do notadd additional manually curated entries at thelsquoFamilyrsquo level or above Rather domains from newprotein structures are classified into the manuallycurated hierarchy of SCOP 175 and many existingSCOP 175 domains were corrected or movedStatistics on all SCOP and SCOPe releases summariesand a full history of changes and other information areavailable from the SCOPe website (httpscopberkeleyedu) together with copies of our relational (MySQL)database and parseable files containing all SCOPeSCOP and ASTRAL data

Table 1 SCOP and SCOPe growth and benchmarking

Release Freezedate

Releasedate

Months torelease

Total PDBentries

Total PDBentriesclassified

New PDBentries used inbenchmark

PDB depositionrate per month

Percent of newentries classifiableby currentautomated method

SCOP 155 2001ndash03 2001ndash07 4 13 300 13 228 na 258 naSCOP 157 2001ndash10 2002ndash01 3 14 825 14 736 1508 275 49SCOP 159 2002ndash03 2002ndash05 2 16 057 15 985 1249 270 47SCOP 161 2002ndash09 2002ndash11 2 17 498 17 411 1426 304 51SCOP 163 2003ndash03 2003ndash06 3 19 036 18 951 1540 351 50SCOP 165 2003ndash08 2003ndash12 4 20 699 20 619 1668 374 51SCOP 167 2004ndash05 2005ndash02 9 24 131 24 036 3417 436 52SCOP 169 2004ndash10 2005ndash07 9 26 101 25 972 1936 454 46SCOP 171 2005ndash01 2006ndash10 21 27 821 27 599 1627 474 45SCOP 173 2007ndash09 2007ndash11 2 44 169 34 494 6895 593 58SCOP 175 2009ndash02 2009ndash06 4 53 830 38 221 3727 632 48SCOPe 201

(formerly 175A)2012ndash02 2012ndash03 1 76 528 49 219 na 775 na

SCOPe 202(formerly 175B)

2012ndash11 2013ndash01 2 83 643 49 674 na 816 na

SCOPe 203(formerly 175C)

2013ndash08 2013ndash10 2 90 812 59 514 na na na

The number of new entries added in each release of SCOP that used stable identifiers For each release the lsquofreeze datersquo or date at which no newPDB entries were to be classified in the release is given In practice some entries released just after the freeze date were sometimes included Thetotal number of PDB entries that contained protein structures were not obsolete as of the freeze date or which were included in each release isgiven as well as the number of PDB entries that were included in each release Release 171 was the most recent comprehensive SCOP release (ieone in which nearly all PDB entries available prior to the freeze date were classified) The average rate at which PDB entries were deposited eachmonth is also given measured over the 6 months before and after (if applicable) the freeze date

Nucleic Acids Research 2014 Vol 42 Database issue D305

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

BACKWARD COMPATIBILITY

To facilitate use of SCOPe data by SCOP and ASTRALusers we provide SCOPe and ASTRAL data in parseablefiles in the same formats as the previous SCOP andASTRAL releases SCOPe uses the same stable identifiers(eg sunid sid sccs) as were used for prior releases ofSCOP and ASTRAL (356) and the same protocols pre-viously used to assign new SCOP and ASTRAL identifiersare currently being used in SCOPe A history of allchanges between consecutive releases of SCOP andSCOPe is available on the SCOPe website

ADDING NEW ENTRIES AND CORRECTINGERRORS IN SCOP

Using our new automation procedure (described in thenext section) we have added 21 293 PDB files to SCOPethat were not classified in SCOP 175 (23 of the PDB)We have also modified some previous entries that weremanually curated or classified via a previous automatedmethod introduced in SCOP 173 (4) Based on carefulexamination of differences between automated classifica-tion and manually curated domains we detected and fixederrors in 70 manually curated domains in SCOP (out of105 detected with differences) of which 62 were changedby more than 10 residues The fraction of manually clas-sified domains in SCOP 175 with substantial errors wecould detect was 008 (62 of 80 140) reflecting the ex-tremely high accuracy of manual classification in theSCOP database Typical examples of common errors areshown in Figure 1 Often these errors were the result of atypo during manual curation and would result in theimproper inclusion or exclusion of a region of sequencefrom a domain We reassigned domain boundaries to 5054domains that had been classified with the SCOP 173 auto-mated method The vast majority of these were smallchanges but 231 differed substantially requiring adomain to be split into two or merged with another orits boundary to be changed by more than 10 residues Thisalso represents a low error rate 08 (231 of 30 660)although it is roughly 10 times higher than the error ratefor manually curated domains We also noticed a numberof both manually and automatically curated domains thathad been classified in the wrong lsquoProteinrsquo or lsquoSpeciesrsquoentry We have corrected 252 manually curated domainsand 6285 previous automatically classified domains

AUTOMATED CLASSIFICATION METHODS

We have implemented a new automated classification al-gorithm based on sequence similarity to previously clas-sified entries and benchmarked this method against allstable releases of SCOP from 157 to 175An automated method for classifying new PDB entries

based on previously classified entries was introduced inSCOP 173 (4) however rigorous benchmarking wasnot used to validate this method We have foundinconsistencies between manually curated domains andthose determined with this method Figure 1 includes anexample of such an inconsistency Other methods such as

SUPERFAMILY (910) and SCOPmap (11) have beendeveloped to automatically classify new PDB structuresin the context of SCOP and have the advantage ofoperating on proteins somewhat dissimilar from thosealready classified However when validated againstmanually curated domains these methods have errorrates that are too high to be suitable for incorporatingtheir results directly into SCOP For example SCOPmapreports a 5 error rate in benchmarking experiments Ouraim in developing a new automated classification algo-rithm was to maximize the number of domains thatcould be classified with an error rate indistinguishablefrom manual curation (see discussion of manualcuration error rate above)

The SCOP automated curation pipeline has been com-pletely rewritten in SCOPe We are replacing all domainsthat were predicted with the less accurate SCOP 173 auto-mated method with newer predictions Over the course ofbenchmarking our new method we found the methodwould have been able to automatically assign domainsfor approximately half of the PDB entries that weremanually curated in prior versions of SCOP with no sub-stantial differences (ie gt10 residue difference in anydomain boundary) between the automatically assignedand manually curated domain boundaries Over thecourse of benchmarking we found some differencesthat upon expert examination were determined to bethe result of manual curation errors in prior versions ofSCOP These are discussed later in the manuscript andwere corrected in SCOPe 203

One of the main challenges in automatically assigningdomains based on previously classified homologs is thatdifferent homologs may have different observed residues(ie those present in the ATOM records of PDB data)SCOP domains in a structure are defined relative to theobserved residues It is sometimes challenging to correctlyassign observed residues in a new structure that were notobserved in a previously classified structure Another chal-lenge is in variance in manually curated boundaries Wefound that in multi-domain chains with long linkerregions the exact boundaries between domains within afamily can differ substantially This is because fullymanually curated releases of SCOP (those before 173)aimed to classify every residue in each PDB structureeven though it is sometimes unclear to which if eitherof the adjacent domains residues in a linker regionshould be assigned

We now describe our algorithm for predicting domainsand classifying them in the SCOP hierarchy We firstcreate a BLAST database containing the SEQRES-basedsequences for each domain in SCOP and SCOPe Thenfor each newly released PDB chain we BLAST itsSEQRES sequence against the domain sequencedatabase BLAST performs local alignment returningthe start and end of the alignment (the lsquohitrsquo) for thequery sequence and the target sequence as well as theE-value We collect only the BLAST alignments wherethe E-value is at least as significant at 104 and the align-ment covers most of the target domain (defined as missingat most 10 residues from each end) We group the align-ments by the PDB chain that the targets belong to and

D306 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

rank the groups by the total number of residues coveredby the hits on the query chain We then use the top-ranking group of BLAST-based alignments to annotatedomains in the query sequence If the nearest PDB chainend or gap in ATOM records is within 10 residues of analignment end we extend the domain to include theseresidues The purpose of extending the BLAST hit is toclassify every observed residue in the chain the 10-residuelimitation makes it very unlikely that the extension willinclude a new domain If the BLAST boundaries areoutside the observed residues in the query chain theassigned domain is shortened to include only observedresidues

After making predictions for each chain we applied aset of criteria to determine whether each domain predic-tion was high confidence (ie sufficiently accurate to beincluded in SCOPe without further manual inspection)We first exclude from high confidence those chains thatare low-resolution (3 A resolution or above) ribosomal orsynthetic (due to those being classified outside the firstseven classes of SCOP) or those that are homologous togenetic domains classified in SCOP (due to the difficulty ofcorrectly automating predictions that include multiplePDB chains)

In cases where an entire PDB chain was predicted to bea single SCOPe domain the prediction was deemed highconfidence if the target domain also comprised its entire

PDB chain For PDB chains that were divided intomultiple SCOPe domains additional criteria were usedto determine whether the predicted domains were highconfidence First we restricted the chains whose predic-tions we placed in the high confidence set to those whicheither (i) had 100 sequence identity with the target chainused to make the predictions or (ii) had exactly twodomains each composed of only one contiguous regionof residues We plan to extend the method to three ormore domain chains and multi-region domains in thefuture Second if any region found in the ATOMrecords of the chain was longer than 10 residues and notassigned to any domain we removed the chain from thehigh-confidence set Third we required the BLAST hitsused for the two domains in the chain not be to thesame target SCOP domain (on the theory that domainduplications are more likely to require a specialized algo-rithm or manual inspection to ensure no structuralchanges such as domain swapping)To fully classify these predictions in the SCOP hier-

archy we also had to assign levels below lsquoSuperfamilyrsquoBased on our benchmarking we developed heuristic rulesto classify domains at the lsquoFamilyrsquo lsquoProteinrsquo and lsquoSpeciesrsquolevels In cases where the protein or family could not bereliably matched to an existing SCOP entity we created anew protein or family called lsquoautomated matchesrsquo ratherthan risking inaccurate classification

Domain should include entire chain

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from SCOP 173 Automated Curation

d1tqya2 d1e5ma2

One strand of beta sheet assignedto wrong domain by manual curation

e5ma2

Homolog (Correct)Alternative Structure (Correct)

d1oyvi_

d2p8qa1 d1qgka_

d1pjua1

Domain should be split into twodomains at this location

Homolog (Correct)

d1qzfa2d1seja2 fa2

Homolog (Correct)

Inconsistentinclusion offirst helixin domain

(a)

(c) (d)

(b)

Figure 1 Errors identified during benchmarking We detected errors in 70 manually curated domains by running benchmarking and manuallyinspecting predicted domains that did not sufficiently match the manually annotated domains These errors in domain boundaries in multi-domainchains were manually fixed in SCOPe 203 We also detected and fixed inconsistencies in 5054 domains that had been predicted and classified with theSCOP 173 automated method We review some of the types of errors detected (a) The SCOP 173 automated method used to predict domaind2p8qa1 had included approximately half the residues in the chain This was inconsistent with all other manually curated entries in its species-levelclade that included the entire chain (b) A strand of beta sheet was included in the d1tqya2 domain by manual curation (c) All of chain I from 1oyvhad been placed into a single domain (d) The manually curated domain d1seja2 excluded the first helix in the chain

Nucleic Acids Research 2014 Vol 42 Database issue D307

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Thus far our focus has been on classifying chains thathave high sequence similarity with previously classifiedchains in SCOP and therefore have relied solely onsequence data The addition of structural informationshould expand the set of domains that can be classifiedwith high confidence in the future

BENCHMARKING RESULTS

To validate the new automated method we performedbenchmarking against all SCOP releases with stable iden-tifiers (ie releases 155ndash175) All PDB entries that wereadded between each pair of consecutive releases wereautomatically classified based on the earlier release andcompared with the manually curated domains in the sub-sequent release A predicted domain was consideredidentified and classified correctly if it was placed in thecorrect superfamily and its boundaries differed from themanually curated boundaries by no more than 10 residuesTable 1 lists the fraction of PDB entries that could beclassified using this method for each SCOP release Anon-trivial example of automated classification is shownin Figure 2 Our method predicted 20 048 domains thatmatched manually curated domains within the 10-residueerror tolerance and predicted 105 domains that differed bymore than 10 residues We reviewed all differences of morethan 10 residues and found they were either the result oferrors in SCOP 175 (discussed above) or in linker regionsin which neither the manually curated domain nor theprediction could be determined to be more accurate Wealso compared domains that had been added by theprevious automated method (4) which had added 30 852new domains in total to versions 173 and 175 Wecompared domains predicted by our new method tothose added by the old method Of the 15 660 domainpairs compared (the new automated method is more con-servative than the old one and therefore fewer domainsare classified) 125 were found to have domain boundariesthat differed by more than 10 residues (this is lower than

the 231 such domains we corrected in SCOPe discussedabove because the benchmark did not include additionalSCOPe domains based on manually curated domains fromSCOP 175) These differences were also the result oferrors in the previous automated method or ambiguouslinker regions

NEW WEBSITE

The SCOPe website offers a modest redesign of the SCOPwebsite presenting all SCOPe SCOP and ASTRAL datathrough a single unified web interface Many objects thatwere difficult to find in the original website such as thechange history are now available under tabs Thumbnailimages were automatically generated to show each domainon its own and in several structural contexts and these aredisplayed as part of the browser A fully JavaScriptJSmol-based viewer was added to enable visitors to viewdomains in three-dimensions in isolation in context ofthe chain or in context of the entire PDB structure TheSCOPe website can display data from all versions ofSCOPe SCOP and ASTRAL since release 155 All dataare stored in a relational (MySQL) database which is alsoavailable for download

STABLE AND PERIODIC UPDATES

Stable releases will potentially include manual curation ofthe SCOPe hierarchy correction of errors in previouslyclassified domains or changes to our classificationworkflow (eg validation of our automated sequence-based classification protocol with structure comparison)However in an effort to stay more closely synchronizedwith the PDB we also plan to supplement these stablereleases with periodic updates (approximately monthly)Our infrastructure automatically imports and classifiesnew PDB files on a weekly basis Starting with SCOPe202 we have begun to release periodic updates that addnewly released PDB entries to the SCOPe classification

Automated domainboundary at residue 225

Manually curated domainboundary at residue 223

d1vj5a2(grey)

d1vj5a1(color)

1

2 224

226

554

544

BLAST against DB of all SCOP domains

1vj5 chain A

hit 2 (d1ek1a2)

hit 1 (d1ek1a1)

query sequence

Figure 2 Automated curation example This figure depicts an example of applying the automated method for domain prediction and classification to1vj5 chain A released on 2004-04-27 We attempted to automatically classify it into SCOP 167 based only on domains defined in SCOP 165 1vj5Ahas 554 residues of which residues 2-547 are observed (found in the ATOM records in PDB data) Two significant BLAST hits were found to theclassified chain 1ek1A which has a distinct sequence from 1vj5A but also has 554 residues of which residues 4-19 48-66 and 90-544 are observedThe two BLAST hits include residues 2-224 and 226-544 in 1vj5A The final predicted domains in 1vj5A are 2-225 and 226-547 The manuallyannotated domains for 1vj5A are 2-223 and 224-547 Since the end of each predicted domain differs from the manually annotated domain by at most10 residues this domain prediction is deemed to fall within the error tolerance for validation

D308 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Page 3: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

BACKWARD COMPATIBILITY

To facilitate use of SCOPe data by SCOP and ASTRALusers we provide SCOPe and ASTRAL data in parseablefiles in the same formats as the previous SCOP andASTRAL releases SCOPe uses the same stable identifiers(eg sunid sid sccs) as were used for prior releases ofSCOP and ASTRAL (356) and the same protocols pre-viously used to assign new SCOP and ASTRAL identifiersare currently being used in SCOPe A history of allchanges between consecutive releases of SCOP andSCOPe is available on the SCOPe website

ADDING NEW ENTRIES AND CORRECTINGERRORS IN SCOP

Using our new automation procedure (described in thenext section) we have added 21 293 PDB files to SCOPethat were not classified in SCOP 175 (23 of the PDB)We have also modified some previous entries that weremanually curated or classified via a previous automatedmethod introduced in SCOP 173 (4) Based on carefulexamination of differences between automated classifica-tion and manually curated domains we detected and fixederrors in 70 manually curated domains in SCOP (out of105 detected with differences) of which 62 were changedby more than 10 residues The fraction of manually clas-sified domains in SCOP 175 with substantial errors wecould detect was 008 (62 of 80 140) reflecting the ex-tremely high accuracy of manual classification in theSCOP database Typical examples of common errors areshown in Figure 1 Often these errors were the result of atypo during manual curation and would result in theimproper inclusion or exclusion of a region of sequencefrom a domain We reassigned domain boundaries to 5054domains that had been classified with the SCOP 173 auto-mated method The vast majority of these were smallchanges but 231 differed substantially requiring adomain to be split into two or merged with another orits boundary to be changed by more than 10 residues Thisalso represents a low error rate 08 (231 of 30 660)although it is roughly 10 times higher than the error ratefor manually curated domains We also noticed a numberof both manually and automatically curated domains thathad been classified in the wrong lsquoProteinrsquo or lsquoSpeciesrsquoentry We have corrected 252 manually curated domainsand 6285 previous automatically classified domains

AUTOMATED CLASSIFICATION METHODS

We have implemented a new automated classification al-gorithm based on sequence similarity to previously clas-sified entries and benchmarked this method against allstable releases of SCOP from 157 to 175An automated method for classifying new PDB entries

based on previously classified entries was introduced inSCOP 173 (4) however rigorous benchmarking wasnot used to validate this method We have foundinconsistencies between manually curated domains andthose determined with this method Figure 1 includes anexample of such an inconsistency Other methods such as

SUPERFAMILY (910) and SCOPmap (11) have beendeveloped to automatically classify new PDB structuresin the context of SCOP and have the advantage ofoperating on proteins somewhat dissimilar from thosealready classified However when validated againstmanually curated domains these methods have errorrates that are too high to be suitable for incorporatingtheir results directly into SCOP For example SCOPmapreports a 5 error rate in benchmarking experiments Ouraim in developing a new automated classification algo-rithm was to maximize the number of domains thatcould be classified with an error rate indistinguishablefrom manual curation (see discussion of manualcuration error rate above)

The SCOP automated curation pipeline has been com-pletely rewritten in SCOPe We are replacing all domainsthat were predicted with the less accurate SCOP 173 auto-mated method with newer predictions Over the course ofbenchmarking our new method we found the methodwould have been able to automatically assign domainsfor approximately half of the PDB entries that weremanually curated in prior versions of SCOP with no sub-stantial differences (ie gt10 residue difference in anydomain boundary) between the automatically assignedand manually curated domain boundaries Over thecourse of benchmarking we found some differencesthat upon expert examination were determined to bethe result of manual curation errors in prior versions ofSCOP These are discussed later in the manuscript andwere corrected in SCOPe 203

One of the main challenges in automatically assigningdomains based on previously classified homologs is thatdifferent homologs may have different observed residues(ie those present in the ATOM records of PDB data)SCOP domains in a structure are defined relative to theobserved residues It is sometimes challenging to correctlyassign observed residues in a new structure that were notobserved in a previously classified structure Another chal-lenge is in variance in manually curated boundaries Wefound that in multi-domain chains with long linkerregions the exact boundaries between domains within afamily can differ substantially This is because fullymanually curated releases of SCOP (those before 173)aimed to classify every residue in each PDB structureeven though it is sometimes unclear to which if eitherof the adjacent domains residues in a linker regionshould be assigned

We now describe our algorithm for predicting domainsand classifying them in the SCOP hierarchy We firstcreate a BLAST database containing the SEQRES-basedsequences for each domain in SCOP and SCOPe Thenfor each newly released PDB chain we BLAST itsSEQRES sequence against the domain sequencedatabase BLAST performs local alignment returningthe start and end of the alignment (the lsquohitrsquo) for thequery sequence and the target sequence as well as theE-value We collect only the BLAST alignments wherethe E-value is at least as significant at 104 and the align-ment covers most of the target domain (defined as missingat most 10 residues from each end) We group the align-ments by the PDB chain that the targets belong to and

D306 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

rank the groups by the total number of residues coveredby the hits on the query chain We then use the top-ranking group of BLAST-based alignments to annotatedomains in the query sequence If the nearest PDB chainend or gap in ATOM records is within 10 residues of analignment end we extend the domain to include theseresidues The purpose of extending the BLAST hit is toclassify every observed residue in the chain the 10-residuelimitation makes it very unlikely that the extension willinclude a new domain If the BLAST boundaries areoutside the observed residues in the query chain theassigned domain is shortened to include only observedresidues

After making predictions for each chain we applied aset of criteria to determine whether each domain predic-tion was high confidence (ie sufficiently accurate to beincluded in SCOPe without further manual inspection)We first exclude from high confidence those chains thatare low-resolution (3 A resolution or above) ribosomal orsynthetic (due to those being classified outside the firstseven classes of SCOP) or those that are homologous togenetic domains classified in SCOP (due to the difficulty ofcorrectly automating predictions that include multiplePDB chains)

In cases where an entire PDB chain was predicted to bea single SCOPe domain the prediction was deemed highconfidence if the target domain also comprised its entire

PDB chain For PDB chains that were divided intomultiple SCOPe domains additional criteria were usedto determine whether the predicted domains were highconfidence First we restricted the chains whose predic-tions we placed in the high confidence set to those whicheither (i) had 100 sequence identity with the target chainused to make the predictions or (ii) had exactly twodomains each composed of only one contiguous regionof residues We plan to extend the method to three ormore domain chains and multi-region domains in thefuture Second if any region found in the ATOMrecords of the chain was longer than 10 residues and notassigned to any domain we removed the chain from thehigh-confidence set Third we required the BLAST hitsused for the two domains in the chain not be to thesame target SCOP domain (on the theory that domainduplications are more likely to require a specialized algo-rithm or manual inspection to ensure no structuralchanges such as domain swapping)To fully classify these predictions in the SCOP hier-

archy we also had to assign levels below lsquoSuperfamilyrsquoBased on our benchmarking we developed heuristic rulesto classify domains at the lsquoFamilyrsquo lsquoProteinrsquo and lsquoSpeciesrsquolevels In cases where the protein or family could not bereliably matched to an existing SCOP entity we created anew protein or family called lsquoautomated matchesrsquo ratherthan risking inaccurate classification

Domain should include entire chain

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from SCOP 173 Automated Curation

d1tqya2 d1e5ma2

One strand of beta sheet assignedto wrong domain by manual curation

e5ma2

Homolog (Correct)Alternative Structure (Correct)

d1oyvi_

d2p8qa1 d1qgka_

d1pjua1

Domain should be split into twodomains at this location

Homolog (Correct)

d1qzfa2d1seja2 fa2

Homolog (Correct)

Inconsistentinclusion offirst helixin domain

(a)

(c) (d)

(b)

Figure 1 Errors identified during benchmarking We detected errors in 70 manually curated domains by running benchmarking and manuallyinspecting predicted domains that did not sufficiently match the manually annotated domains These errors in domain boundaries in multi-domainchains were manually fixed in SCOPe 203 We also detected and fixed inconsistencies in 5054 domains that had been predicted and classified with theSCOP 173 automated method We review some of the types of errors detected (a) The SCOP 173 automated method used to predict domaind2p8qa1 had included approximately half the residues in the chain This was inconsistent with all other manually curated entries in its species-levelclade that included the entire chain (b) A strand of beta sheet was included in the d1tqya2 domain by manual curation (c) All of chain I from 1oyvhad been placed into a single domain (d) The manually curated domain d1seja2 excluded the first helix in the chain

Nucleic Acids Research 2014 Vol 42 Database issue D307

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Thus far our focus has been on classifying chains thathave high sequence similarity with previously classifiedchains in SCOP and therefore have relied solely onsequence data The addition of structural informationshould expand the set of domains that can be classifiedwith high confidence in the future

BENCHMARKING RESULTS

To validate the new automated method we performedbenchmarking against all SCOP releases with stable iden-tifiers (ie releases 155ndash175) All PDB entries that wereadded between each pair of consecutive releases wereautomatically classified based on the earlier release andcompared with the manually curated domains in the sub-sequent release A predicted domain was consideredidentified and classified correctly if it was placed in thecorrect superfamily and its boundaries differed from themanually curated boundaries by no more than 10 residuesTable 1 lists the fraction of PDB entries that could beclassified using this method for each SCOP release Anon-trivial example of automated classification is shownin Figure 2 Our method predicted 20 048 domains thatmatched manually curated domains within the 10-residueerror tolerance and predicted 105 domains that differed bymore than 10 residues We reviewed all differences of morethan 10 residues and found they were either the result oferrors in SCOP 175 (discussed above) or in linker regionsin which neither the manually curated domain nor theprediction could be determined to be more accurate Wealso compared domains that had been added by theprevious automated method (4) which had added 30 852new domains in total to versions 173 and 175 Wecompared domains predicted by our new method tothose added by the old method Of the 15 660 domainpairs compared (the new automated method is more con-servative than the old one and therefore fewer domainsare classified) 125 were found to have domain boundariesthat differed by more than 10 residues (this is lower than

the 231 such domains we corrected in SCOPe discussedabove because the benchmark did not include additionalSCOPe domains based on manually curated domains fromSCOP 175) These differences were also the result oferrors in the previous automated method or ambiguouslinker regions

NEW WEBSITE

The SCOPe website offers a modest redesign of the SCOPwebsite presenting all SCOPe SCOP and ASTRAL datathrough a single unified web interface Many objects thatwere difficult to find in the original website such as thechange history are now available under tabs Thumbnailimages were automatically generated to show each domainon its own and in several structural contexts and these aredisplayed as part of the browser A fully JavaScriptJSmol-based viewer was added to enable visitors to viewdomains in three-dimensions in isolation in context ofthe chain or in context of the entire PDB structure TheSCOPe website can display data from all versions ofSCOPe SCOP and ASTRAL since release 155 All dataare stored in a relational (MySQL) database which is alsoavailable for download

STABLE AND PERIODIC UPDATES

Stable releases will potentially include manual curation ofthe SCOPe hierarchy correction of errors in previouslyclassified domains or changes to our classificationworkflow (eg validation of our automated sequence-based classification protocol with structure comparison)However in an effort to stay more closely synchronizedwith the PDB we also plan to supplement these stablereleases with periodic updates (approximately monthly)Our infrastructure automatically imports and classifiesnew PDB files on a weekly basis Starting with SCOPe202 we have begun to release periodic updates that addnewly released PDB entries to the SCOPe classification

Automated domainboundary at residue 225

Manually curated domainboundary at residue 223

d1vj5a2(grey)

d1vj5a1(color)

1

2 224

226

554

544

BLAST against DB of all SCOP domains

1vj5 chain A

hit 2 (d1ek1a2)

hit 1 (d1ek1a1)

query sequence

Figure 2 Automated curation example This figure depicts an example of applying the automated method for domain prediction and classification to1vj5 chain A released on 2004-04-27 We attempted to automatically classify it into SCOP 167 based only on domains defined in SCOP 165 1vj5Ahas 554 residues of which residues 2-547 are observed (found in the ATOM records in PDB data) Two significant BLAST hits were found to theclassified chain 1ek1A which has a distinct sequence from 1vj5A but also has 554 residues of which residues 4-19 48-66 and 90-544 are observedThe two BLAST hits include residues 2-224 and 226-544 in 1vj5A The final predicted domains in 1vj5A are 2-225 and 226-547 The manuallyannotated domains for 1vj5A are 2-223 and 224-547 Since the end of each predicted domain differs from the manually annotated domain by at most10 residues this domain prediction is deemed to fall within the error tolerance for validation

D308 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Page 4: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

rank the groups by the total number of residues coveredby the hits on the query chain We then use the top-ranking group of BLAST-based alignments to annotatedomains in the query sequence If the nearest PDB chainend or gap in ATOM records is within 10 residues of analignment end we extend the domain to include theseresidues The purpose of extending the BLAST hit is toclassify every observed residue in the chain the 10-residuelimitation makes it very unlikely that the extension willinclude a new domain If the BLAST boundaries areoutside the observed residues in the query chain theassigned domain is shortened to include only observedresidues

After making predictions for each chain we applied aset of criteria to determine whether each domain predic-tion was high confidence (ie sufficiently accurate to beincluded in SCOPe without further manual inspection)We first exclude from high confidence those chains thatare low-resolution (3 A resolution or above) ribosomal orsynthetic (due to those being classified outside the firstseven classes of SCOP) or those that are homologous togenetic domains classified in SCOP (due to the difficulty ofcorrectly automating predictions that include multiplePDB chains)

In cases where an entire PDB chain was predicted to bea single SCOPe domain the prediction was deemed highconfidence if the target domain also comprised its entire

PDB chain For PDB chains that were divided intomultiple SCOPe domains additional criteria were usedto determine whether the predicted domains were highconfidence First we restricted the chains whose predic-tions we placed in the high confidence set to those whicheither (i) had 100 sequence identity with the target chainused to make the predictions or (ii) had exactly twodomains each composed of only one contiguous regionof residues We plan to extend the method to three ormore domain chains and multi-region domains in thefuture Second if any region found in the ATOMrecords of the chain was longer than 10 residues and notassigned to any domain we removed the chain from thehigh-confidence set Third we required the BLAST hitsused for the two domains in the chain not be to thesame target SCOP domain (on the theory that domainduplications are more likely to require a specialized algo-rithm or manual inspection to ensure no structuralchanges such as domain swapping)To fully classify these predictions in the SCOP hier-

archy we also had to assign levels below lsquoSuperfamilyrsquoBased on our benchmarking we developed heuristic rulesto classify domains at the lsquoFamilyrsquo lsquoProteinrsquo and lsquoSpeciesrsquolevels In cases where the protein or family could not bereliably matched to an existing SCOP entity we created anew protein or family called lsquoautomated matchesrsquo ratherthan risking inaccurate classification

Domain should include entire chain

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from Manual Curation

Chain with Error from SCOP 173 Automated Curation

d1tqya2 d1e5ma2

One strand of beta sheet assignedto wrong domain by manual curation

e5ma2

Homolog (Correct)Alternative Structure (Correct)

d1oyvi_

d2p8qa1 d1qgka_

d1pjua1

Domain should be split into twodomains at this location

Homolog (Correct)

d1qzfa2d1seja2 fa2

Homolog (Correct)

Inconsistentinclusion offirst helixin domain

(a)

(c) (d)

(b)

Figure 1 Errors identified during benchmarking We detected errors in 70 manually curated domains by running benchmarking and manuallyinspecting predicted domains that did not sufficiently match the manually annotated domains These errors in domain boundaries in multi-domainchains were manually fixed in SCOPe 203 We also detected and fixed inconsistencies in 5054 domains that had been predicted and classified with theSCOP 173 automated method We review some of the types of errors detected (a) The SCOP 173 automated method used to predict domaind2p8qa1 had included approximately half the residues in the chain This was inconsistent with all other manually curated entries in its species-levelclade that included the entire chain (b) A strand of beta sheet was included in the d1tqya2 domain by manual curation (c) All of chain I from 1oyvhad been placed into a single domain (d) The manually curated domain d1seja2 excluded the first helix in the chain

Nucleic Acids Research 2014 Vol 42 Database issue D307

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Thus far our focus has been on classifying chains thathave high sequence similarity with previously classifiedchains in SCOP and therefore have relied solely onsequence data The addition of structural informationshould expand the set of domains that can be classifiedwith high confidence in the future

BENCHMARKING RESULTS

To validate the new automated method we performedbenchmarking against all SCOP releases with stable iden-tifiers (ie releases 155ndash175) All PDB entries that wereadded between each pair of consecutive releases wereautomatically classified based on the earlier release andcompared with the manually curated domains in the sub-sequent release A predicted domain was consideredidentified and classified correctly if it was placed in thecorrect superfamily and its boundaries differed from themanually curated boundaries by no more than 10 residuesTable 1 lists the fraction of PDB entries that could beclassified using this method for each SCOP release Anon-trivial example of automated classification is shownin Figure 2 Our method predicted 20 048 domains thatmatched manually curated domains within the 10-residueerror tolerance and predicted 105 domains that differed bymore than 10 residues We reviewed all differences of morethan 10 residues and found they were either the result oferrors in SCOP 175 (discussed above) or in linker regionsin which neither the manually curated domain nor theprediction could be determined to be more accurate Wealso compared domains that had been added by theprevious automated method (4) which had added 30 852new domains in total to versions 173 and 175 Wecompared domains predicted by our new method tothose added by the old method Of the 15 660 domainpairs compared (the new automated method is more con-servative than the old one and therefore fewer domainsare classified) 125 were found to have domain boundariesthat differed by more than 10 residues (this is lower than

the 231 such domains we corrected in SCOPe discussedabove because the benchmark did not include additionalSCOPe domains based on manually curated domains fromSCOP 175) These differences were also the result oferrors in the previous automated method or ambiguouslinker regions

NEW WEBSITE

The SCOPe website offers a modest redesign of the SCOPwebsite presenting all SCOPe SCOP and ASTRAL datathrough a single unified web interface Many objects thatwere difficult to find in the original website such as thechange history are now available under tabs Thumbnailimages were automatically generated to show each domainon its own and in several structural contexts and these aredisplayed as part of the browser A fully JavaScriptJSmol-based viewer was added to enable visitors to viewdomains in three-dimensions in isolation in context ofthe chain or in context of the entire PDB structure TheSCOPe website can display data from all versions ofSCOPe SCOP and ASTRAL since release 155 All dataare stored in a relational (MySQL) database which is alsoavailable for download

STABLE AND PERIODIC UPDATES

Stable releases will potentially include manual curation ofthe SCOPe hierarchy correction of errors in previouslyclassified domains or changes to our classificationworkflow (eg validation of our automated sequence-based classification protocol with structure comparison)However in an effort to stay more closely synchronizedwith the PDB we also plan to supplement these stablereleases with periodic updates (approximately monthly)Our infrastructure automatically imports and classifiesnew PDB files on a weekly basis Starting with SCOPe202 we have begun to release periodic updates that addnewly released PDB entries to the SCOPe classification

Automated domainboundary at residue 225

Manually curated domainboundary at residue 223

d1vj5a2(grey)

d1vj5a1(color)

1

2 224

226

554

544

BLAST against DB of all SCOP domains

1vj5 chain A

hit 2 (d1ek1a2)

hit 1 (d1ek1a1)

query sequence

Figure 2 Automated curation example This figure depicts an example of applying the automated method for domain prediction and classification to1vj5 chain A released on 2004-04-27 We attempted to automatically classify it into SCOP 167 based only on domains defined in SCOP 165 1vj5Ahas 554 residues of which residues 2-547 are observed (found in the ATOM records in PDB data) Two significant BLAST hits were found to theclassified chain 1ek1A which has a distinct sequence from 1vj5A but also has 554 residues of which residues 4-19 48-66 and 90-544 are observedThe two BLAST hits include residues 2-224 and 226-544 in 1vj5A The final predicted domains in 1vj5A are 2-225 and 226-547 The manuallyannotated domains for 1vj5A are 2-223 and 224-547 Since the end of each predicted domain differs from the manually annotated domain by at most10 residues this domain prediction is deemed to fall within the error tolerance for validation

D308 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Page 5: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

Thus far our focus has been on classifying chains thathave high sequence similarity with previously classifiedchains in SCOP and therefore have relied solely onsequence data The addition of structural informationshould expand the set of domains that can be classifiedwith high confidence in the future

BENCHMARKING RESULTS

To validate the new automated method we performedbenchmarking against all SCOP releases with stable iden-tifiers (ie releases 155ndash175) All PDB entries that wereadded between each pair of consecutive releases wereautomatically classified based on the earlier release andcompared with the manually curated domains in the sub-sequent release A predicted domain was consideredidentified and classified correctly if it was placed in thecorrect superfamily and its boundaries differed from themanually curated boundaries by no more than 10 residuesTable 1 lists the fraction of PDB entries that could beclassified using this method for each SCOP release Anon-trivial example of automated classification is shownin Figure 2 Our method predicted 20 048 domains thatmatched manually curated domains within the 10-residueerror tolerance and predicted 105 domains that differed bymore than 10 residues We reviewed all differences of morethan 10 residues and found they were either the result oferrors in SCOP 175 (discussed above) or in linker regionsin which neither the manually curated domain nor theprediction could be determined to be more accurate Wealso compared domains that had been added by theprevious automated method (4) which had added 30 852new domains in total to versions 173 and 175 Wecompared domains predicted by our new method tothose added by the old method Of the 15 660 domainpairs compared (the new automated method is more con-servative than the old one and therefore fewer domainsare classified) 125 were found to have domain boundariesthat differed by more than 10 residues (this is lower than

the 231 such domains we corrected in SCOPe discussedabove because the benchmark did not include additionalSCOPe domains based on manually curated domains fromSCOP 175) These differences were also the result oferrors in the previous automated method or ambiguouslinker regions

NEW WEBSITE

The SCOPe website offers a modest redesign of the SCOPwebsite presenting all SCOPe SCOP and ASTRAL datathrough a single unified web interface Many objects thatwere difficult to find in the original website such as thechange history are now available under tabs Thumbnailimages were automatically generated to show each domainon its own and in several structural contexts and these aredisplayed as part of the browser A fully JavaScriptJSmol-based viewer was added to enable visitors to viewdomains in three-dimensions in isolation in context ofthe chain or in context of the entire PDB structure TheSCOPe website can display data from all versions ofSCOPe SCOP and ASTRAL since release 155 All dataare stored in a relational (MySQL) database which is alsoavailable for download

STABLE AND PERIODIC UPDATES

Stable releases will potentially include manual curation ofthe SCOPe hierarchy correction of errors in previouslyclassified domains or changes to our classificationworkflow (eg validation of our automated sequence-based classification protocol with structure comparison)However in an effort to stay more closely synchronizedwith the PDB we also plan to supplement these stablereleases with periodic updates (approximately monthly)Our infrastructure automatically imports and classifiesnew PDB files on a weekly basis Starting with SCOPe202 we have begun to release periodic updates that addnewly released PDB entries to the SCOPe classification

Automated domainboundary at residue 225

Manually curated domainboundary at residue 223

d1vj5a2(grey)

d1vj5a1(color)

1

2 224

226

554

544

BLAST against DB of all SCOP domains

1vj5 chain A

hit 2 (d1ek1a2)

hit 1 (d1ek1a1)

query sequence

Figure 2 Automated curation example This figure depicts an example of applying the automated method for domain prediction and classification to1vj5 chain A released on 2004-04-27 We attempted to automatically classify it into SCOP 167 based only on domains defined in SCOP 165 1vj5Ahas 554 residues of which residues 2-547 are observed (found in the ATOM records in PDB data) Two significant BLAST hits were found to theclassified chain 1ek1A which has a distinct sequence from 1vj5A but also has 554 residues of which residues 4-19 48-66 and 90-544 are observedThe two BLAST hits include residues 2-224 and 226-544 in 1vj5A The final predicted domains in 1vj5A are 2-225 and 226-547 The manuallyannotated domains for 1vj5A are 2-223 and 224-547 Since the end of each predicted domain differs from the manually annotated domain by at most10 residues this domain prediction is deemed to fall within the error tolerance for validation

D308 Nucleic Acids Research 2014 Vol 42 Database issue

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from

Page 6: SCOPe: Structural Classification of Proteins—extended, …compbio.berkeley.edu/people/brenner/pubs/fox-2013-nar... · 2014. 8. 19. · determined protein structure. The hierarchy

These periodic updates add new PDB entries to thecurrent release without affecting any previously classifieddomains The newly classified entries are visible in theweb interface and in downloadable files such as theSCOP-compatible parseable files and MySQL databaseSequences for the newly added chains and domains willnot be added to the ASTRAL representative subsets untilthe subsequent stable release of ASTRAL

The periodic updates are not intended to replace stablereleases because the latter are commonly used for bench-marking both will be available for download through theSCOPe website Stable releases will be assigned versionnumbers in the current format (explicitly labeled stableon the website and downloadable files eg 203-stable)while updates to stable releases will be named accordingto the most recent stable version appended with the releasedate (eg 203-2013-12-01)

ACKNOWLEDGEMENTS

We would like to thank all the other SCOP authorsAlexey G Murzin Antonina Andreeva Dave HoworthLoredana Lo Conte Bartlett G Ailey Tim JP Hubbardand Cyrus Chothia

FUNDING

This work is supported by the National Institutes ofHealth (NIH) [R01-GM073109] through the USDepartment of Energy under Contract No DE-AC02-05CH11231 Funding for open access charge NIH[R01-GM073109]

Conflict of interest statement None declared

REFERENCES

1 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classification of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

2 BrennerSE ChothiaC HubbardTJ and MurzinAG (1996)Understanding protein structure using scop for foldinterpretation Methods Enzymol 266 635ndash643

3 Lo ConteL BrennerSE HubbardT ChothiaC andMurzinAG (2002) SCOP database in 2002 refinementsaccommodate structural genomics Nucleic Acids Res 30264ndash267

4 AndreevaA HoworthD ChandoniaJM BrennerSEHubbardTJ ChothiaC and MurzinAG (2008) Data growthand its impact on the SCOP database new developments NucleicAcids Res 36 D419ndashD425

5 BrennerSE KoehlP and LevittM (2000) The ASTRALcompendium for protein structure and sequence analysis NucleicAcids Res 28 254ndash256

6 ChandoniaJM WalkerNS Lo ConteL KoehlP LevittMand BrennerSE (2002) ASTRAL compendium enhancementsNucleic Acids Res 30 260ndash263

7 ChandoniaJM HonG WalkerNS Lo ConteL KoehlPLevittM and BrennerSE (2004) The ASTRAL Compendium in2004 Nucleic Acids Res 32 D189ndashD192

8 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The ProteinData Bank Nucleic Acids Res 28 235ndash242

9 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

10 WilsonD PethicaR ZhouY TalbotC VogelC MaderaMChothiaC and GoughJ (2009) SUPERFAMILYmdashsophisticatedcomparative genomics data mining visualization and phylogenyNucleic Acids Res 37 D380ndashD386

11 CheekS QiY KrishnaSS KinchLN and GrishinNV(2004) SCOPmap automated assignment of proteinstructures to evolutionary superfamilies BMC Bioinformatics 5197

Nucleic Acids Research 2014 Vol 42 Database issue D309

at University of C

alifornia Berkeley on A

ugust 19 2014httpnaroxfordjournalsorg

Dow

nloaded from