charting online omics resources: a navigational chart for clinical researchers

12
REVIEW Charting online OMICS resources: A navigational chart for clinical researchers Juan Antonio Vizcaíno * , Michael Mueller * , Henning Hermjakob and Lennart Martens EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK The life sciences have sprouted several popular and successful OMICS technologies that span all levels of biological information transfer. Ever since the start of the Human Genome Project, the then revolutionary idea to make all resulting data publicly available has been central to all of the efforts across OMICS technologies. As a result, a great variety of publicly available data reposi- tories and resources is currently available to the research community. This widespread avail- ability of data does come at the price of increased confusion on the part of the users, especially for those that see the OMICS technologies as tools to help unravel a larger biological or clinical question. We therefore provide a comprehensive overview of the available resources across OMICS fields, with a special emphasis on those databases that are relevant to the study of pro- teins. Additionally, we also describe various integrative systems that have been established, and highlight new developments in the field that can revolutionize the way in which live data inte- gration is achieved over the internet. Received: March 30, 2008 Revised: May 27, 2008 Accepted: June 17, 2008 Keywords: Bioinformatics / Databases / Proteomics 18 Proteomics Clin. Appl. 2009, 3, 18–29 1 Introduction The field of biology has been thoroughly influenced by the gradual build-up of a toolbox of high-throughput OMICS technologies that span all levels of information transfer in living organisms. Genes and their regulation are analyzed by genomics, RNA transcripts have become the focus of tran- scriptomics, and the translated proteins and their co-transla- tional modification or PTM are the subject of proteomics. The workings of the proteins in the cell can further be ana- lyzed by looking at the metabolites they generate, bringing us to the field of metabolomics. These OMICS technologies all owe their success to the availability of improved instru- mentation, and to the results obtained by the genome se- quencing efforts. The massive availability of sequences that bootstrapped the field of genomics also first introduced the community to the informatics challenges that all OMICS technologies have faced: how to manage, analyze, and dis- seminate the vast amounts of data that are generated by the application of these technologies. While the community has tackled the data management and analysis challenges by building a variety of sophisticated databases and software tools, probably the most interesting innovation consisted of the policy to provide the finalized results freely to the com- munity via the internet. As a direct consequence of this revolutionary data dissemination policy (similar information on chemical entities remains largely proprietary, for Correspondence: Dr. Lennart Martens, EMBL Outstation, Europe- an Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK E-mail: [email protected] Fax: 144-1223-494-484 Abbreviations: DAS, Distributed Annotation System; EBI, Euro- pean Bioinformatics Institute; EMBL, European Molecular Biolo- gy Laboratory; EST, expressed sequence tag; GOLD, Genomes On Line Database; HMDB, Human Metabolome Database; NCBI, National Center for Biotechnology Information; PDB, Protein Data Bank; PICR, Protein Identifier Cross-Reference; UniProtKB, UniProt Knowledgebase * Both these authors contributed equally to this work. DOI 10.1002/prca.200800082 © 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Upload: juan-antonio-vizcaino

Post on 06-Jul-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Charting online OMICS resources: A navigational chart for clinical researchers

REVIEW

Charting online OMICS resources:

A navigational chart for clinical researchers

Juan Antonio Vizcaíno*, Michael Mueller*, Henning Hermjakob and Lennart Martens

EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus,Hinxton, Cambridge, UK

The life sciences have sprouted several popular and successful OMICS technologies that span alllevels of biological information transfer. Ever since the start of the Human Genome Project, thethen revolutionary idea to make all resulting data publicly available has been central to all of theefforts across OMICS technologies. As a result, a great variety of publicly available data reposi-tories and resources is currently available to the research community. This widespread avail-ability of data does come at the price of increased confusion on the part of the users, especially forthose that see the OMICS technologies as tools to help unravel a larger biological or clinicalquestion. We therefore provide a comprehensive overview of the available resources acrossOMICS fields, with a special emphasis on those databases that are relevant to the study of pro-teins. Additionally, we also describe various integrative systems that have been established, andhighlight new developments in the field that can revolutionize the way in which live data inte-gration is achieved over the internet.

Received: March 30, 2008Revised: May 27, 2008

Accepted: June 17, 2008

Keywords:

Bioinformatics / Databases / Proteomics

18 Proteomics Clin. Appl. 2009, 3, 18–29

1 Introduction

The field of biology has been thoroughly influenced by thegradual build-up of a toolbox of high-throughput OMICStechnologies that span all levels of information transfer inliving organisms. Genes and their regulation are analyzed bygenomics, RNA transcripts have become the focus of tran-scriptomics, and the translated proteins and their co-transla-

tional modification or PTM are the subject of proteomics.The workings of the proteins in the cell can further be ana-lyzed by looking at the metabolites they generate, bringingus to the field of metabolomics. These OMICS technologiesall owe their success to the availability of improved instru-mentation, and to the results obtained by the genome se-quencing efforts. The massive availability of sequences thatbootstrapped the field of genomics also first introduced thecommunity to the informatics challenges that all OMICStechnologies have faced: how to manage, analyze, and dis-seminate the vast amounts of data that are generated by theapplication of these technologies. While the community hastackled the data management and analysis challenges bybuilding a variety of sophisticated databases and softwaretools, probably the most interesting innovation consisted ofthe policy to provide the finalized results freely to the com-munity via the internet. As a direct consequence of thisrevolutionary data dissemination policy (similar informationon chemical entities remains largely proprietary, for

Correspondence: Dr. Lennart Martens, EMBL Outstation, Europe-an Bioinformatics Institute, Wellcome Trust Genome Campus,Hinxton, Cambridge, UKE-mail: [email protected]: 144-1223-494-484

Abbreviations: DAS, Distributed Annotation System; EBI, Euro-pean Bioinformatics Institute; EMBL, European Molecular Biolo-gy Laboratory; EST, expressed sequence tag; GOLD, GenomesOn Line Database; HMDB, Human Metabolome Database; NCBI,

National Center for Biotechnology Information; PDB, ProteinData Bank; PICR, Protein Identifier Cross-Reference; UniProtKB,

UniProt Knowledgebase * Both these authors contributed equally to this work.

DOI 10.1002/prca.200800082

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 2: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 19

instance), a plethora of databases and resources have comeinto existence over the past years. Due to this vast number ofresources however, it is not always easy for researchers toquickly find the appropriate resource to consult for a partic-ular piece of information, thus proving that availability ofdata does not necessarily imply easy access to this data. Inthis review, we therefore present a comprehensive overviewof the major publicly available resources across the OMICSfields. We also highlight several resources that integrateinformation from different resources, including emergingnew approaches for data integration. Since the study of pro-teins provides unique information that can be extremelyvaluable to clinical applications (see for instance the tworecent review issues of Proteomics Clinical Applications [1,2]), special attention is given to proteomics resources in thisreview. As prime effectors in the cell, proteins form thebridge between the informational and metabolical branchesof the OMICS technologies, with genomics and tran-scriptomics studying the transfer of information towards theconstruction of proteins, which in turn perform the meta-bolic tasks that can be monitored via metabolomics.

In principle, proteomics data can be divided into threecategories: protein sequence information, experimental data,and functional annotations. Before the application of MS toprotein analysis, direct sequencing of proteins using theEdman degradation technique used to be the standard modeof protein identification. MS has enabled rapid, high-throughput identification of proteins in complex mixtures.

Experimental data are generated by a wide range of experi-mental approaches, including yeast-two-hybrid screens andcoimmunoprecipitation experiments to study protein interac-tions, or MS analyses to identify and quantify proteins in com-plex mixtures and study their PTMs. Finally, functional anno-tations are then derived from the experimental results eithermanually (very reliable but time and resources-consuming), orautomatically, through computational analysis (typically lessreliable, but achieving much greater throughput).

Clinical researchers are probably most interested in theknowledge (i.e., the annotations), although often a particularannotation of special interest might be worth tracing back tothe underlying experimental results for additional validationanalysis or better understanding.

Throughout this review, Fig. 1 can be used as a guide toposition the resources in the overall flow of information. Inthis figure, a distinction is also made between primary, sec-ondary, and integrative resources by means of three col-umns. Primary resources are closer to the original data andexperimental results, and can be considered low-level data-bases primarily aimed at experts or for detailed data analysis.Secondary resources provide substantial additional informa-tion on top of the primary data, and can often be extensivelycrosslinked to primary and even other secondary resources.As such, secondary resources provide useful starting pointswhen an entity is known (e.g., a protein) and additionalinformation about this entity is sought. Integrative resourcesfinally, represent efforts to collapse relevant data across

OMICS field into one view. Compared to primary or second-ary resources, these systems usually present quite differentvisualizations of their data. Integrative resources can thusprovide insight on the Systems Biology level, and allow thetracking of an entity throughout the different levels of bio-logical information transfer. It should be noted that theboundaries between the different types of resources are oftenquite vague, which can sometimes interfere with the clearcategorization of a database.

Additionally, we also provide a table as Supporting Infor-mation that summarizes all the resources that are cited inthe text, including their corresponding web addresses(URLs).

2 General biological databases

2.1 Genomics databases

There are three primary nucleotide databases in the world:GenBank in the USA [3], the European Molecular BiologyLaboratory Nucleotide Sequence Database (EMBL) in Europe[4], and the DNA DataBank of Japan (DDBJ) [5]. Together,these three databases comprise the International NucleotideSequence Database Collaboration (INSDC), a long-standingconsortium in which nucleotide data is exchanged daily toensure a uniform and comprehensive collection of sequenceinformation in each participating resource. However, customsubstructuring of the database into sections is not necessarilyreplicated between INSDC databases, and as a result is decid-ed internally by each member. For example, in order to facil-itate searches, the National Center for Biotechnology Infor-mation (NCBI) has decided to split GenBank into three subsetdatabases called: “GSS,” “EST,” and “CoreNucleotide.” GSSrecords contain first-pass, single-read genomic sequences,and rarely include annotated biological features. The ESTdatabase contains all expressed sequence tag (EST) dataincluded in GenBank and finally, “CoreNucleotide” containssequences from all remaining divisions of GenBank [6].

From a clinical perspective however, it may not even beknown initially whether the genome sequence of a particularorganism of interest (i.e., a pathogenic microorganism) isavailable or not. An answer to this question can quickly befound in the Genomes On Line Database (GOLD) [7], which isa highly valuable, independent resource that contains detailedinformation about ongoing efforts in genome, EST, andmetagenome sequencing projects worldwide. In addition tosequence data, GOLD also contains information about theinstitutions involved, the funding agencies behind them, andrelevant publications for each project. At the moment of writ-ing, GOLD contains data on 4135 sequencing projects.

2.2 Transcriptomics

ArrayExpress in Europe [8], Gene Expression Omnibus inUSA (GEO) [9], and CiBEX in Japan [10] are the three data

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 3: Charting online OMICS resources: A navigational chart for clinical researchers

20 J. A. Vizcaíno et al. Proteomics Clin. Appl. 2009, 3, 18–29

Figure 1. This figure summarizes the various resources discussed in this review, and organizes them by their biological informationtransfer level (arrow on left) as well as by the level of additional annotation they contain (columns). See main text for details.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 4: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 21

bases recommended by the Microarray and Gene ExpressionData (MGED) society [11] for depositions of publicationrelated microarray data. ArrayExpress is divided into twoparts: the ArrayExpress repository, a public archive of micro-array data, and the ArrayExpress Data Warehouse, which is adatabase of gene expression profiles that were selected fromthe repository and consistently reannotated [8].

Another important source of transcriptomics data isEST data. EST-based projects are underway for numerousorganisms, generating millions of short (200–800 bp inlength), single-pass nucleotide sequence reads derived fromcDNA libraries. At present, ESTs are used for, among oth-ers, gene discovery, genome annotation, gene structureidentification, studying of alternate transcription, singlenucleotide polymorphism (SNP) characterization and pro-teome analysis [12]. As described above, the three majornucleotide databases also have major sections devoted toESTs. Additionally, the NCBI’s UniGene [6] aims to create adatabase of unique genes and represents a nonredundantset of gene-oriented clusters generated from ESTs. UniGeneclusters are created for organisms for which there are70 000 or more ESTs in GenBank, and has been used as asource of unique sequences in the fabrication of micro-arrays.

A database of cancer-related EST sequences from humantumors and the corresponding healthy tissues was set up aspart of NCI’s Cancer Genome Anatomy Project. The projectaims to determine gene expression in the different stages ofcancer development with the long-term goal to improvediagnostics and treatment [13].

Microarray based expression profiles are accessiblethrough the GNF Symatlas, which provides access toexpression patterns of protein-coding genes across 79 hu-man tissues [14].

2.3 Metabolomics

The metabolome is the complete complement of all smallmolecule (,1500 Da) metabolites found in a specific cell,organ or organism. As such, metabolomics could be definedas the high-throughput identification, quantification, andcharacterization of the small molecule metabolites in themetabolome [15]. The Human Metabolome Project (HMP)was launched in 2004 in order to identify and quantify alldetectable metabolites at a concentration higher than 1 mMin the human body. Additionally, the Human MetabolomeDatabase (HMDB) was created. The most relevant feature ofthe HMDB from a clinical point of view concerns its exten-sive links to additional information on metabolic diseases,normal, and abnormal metabolite concentration ranges (inmany different biofluids like urine, blood, or cerebrospinalfluid), mutation/SNP data, and to the genes, enzymes,reactions, and pathways associated with many diseases ofinterest [16]. At the moment of writing the HMDB contains6586 metabolite entries, which are linked to 28 uniquepathways.

3 Chemical compounds databases

Although chemical databases do not strictly cover a particu-lar OMICS field, we have decided to include them briefly inthis review due to their obvious and potential great impor-tance for clinical users. PubChem [6] was launched in 2004by the US National Institutes of Health (NIH). It is dividedinto three connected databases (PubChem Compound, Pub-Chem Substance, and PubChem BioAssay) that constitute ahighly reputable and versatile resource [17].

Drugbank [18] is a manually curated resource that con-tains information about drugs linked to protein or drug tar-get sequences. Each database entry is called a DrugCard, andcontains information not only devoted to drug/chemicaldata, but also to pharmacological, pharmacogenomic, andmolecular biology data.

Other public chemical databases are DSSTOX [19] (con-taining toxicity information), eMolecules (www.emolecules.com), NMRShiftDB [20] (including chemical structures andassociated NMR shift assignments), ChemSpider [21],KEGG [22] (which will be treated in more detail below in thistext), and ChEBI [23]. For a complete review covering thecapabilities of different chemical repositories see [17].

4 The unique information availablethrough proteomics data

Excitement about the importance and potential contribu-tions of microarray technology to biology and medicine hasbeen intense. However, despite many successes, this tech-nology necessarily carries certain limitations. These can berelated to technological issues such as poor specificity and/orreproducibility, or the need to develop better tools for dataanalysis (reviewed in [24]). Perhaps more importantly,microarrays are not able to predict changes in active, matureprotein levels in a quantitative way. Quite often the resultsobtained in microarray and Real-Time PCR and/or Westernblot experiments are contradictory. A variety of reasons canexplain these discrepancies, for instance the fact that micro-arrays cannot take into account protein degradation rates[25]. Lu et al. [26] found that only slightly more than 70% ofSaccharomyces cerevisiae, and about half of Escherichia coliprotein levels are determined by transcriptional regulation.Microarray and proteomics measurements can thus be con-sidered to be complementary techniques. Indeed, a combi-nation of both OMICS technologies would ideally enable theresearcher to differentiate between changes due to tran-scriptional regulation, and changes that are related to altera-tions in translation rates or protein stability [25].

Another topic that is only accessible to proteomics is thelarge-scale measurement of co- or post-translational proteinmodifications and their quantitative changes upon perturba-tions to the cell. For example, Olsen et al. [27] measured theglobal level of protein phosphorylation in HeLa cells in re-sponse to epidermal growth factor stimulation.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 5: Charting online OMICS resources: A navigational chart for clinical researchers

22 J. A. Vizcaíno et al. Proteomics Clin. Appl. 2009, 3, 18–29

Finally, proteomics can be used in clinical medicine forthe identification of biomarkers. A biomarker could bedefined as a measurable or assessable entity that providesdiagnostic, prognostic, or treatment-oriented information,which can drive patient care. A potential biomarker mustfulfill four criteria: it must be easily attainable, confer ade-quate sensitivity, retain adequate specificity, and it must leadto patient benefit [28]. While microarray-based approacheshave also been used for biomarker discovery (e.g., see [29]),their clinical usefulness may ultimately be more limited thanproteomics-based approaches. Indeed, apart from the inabil-ity of microarrays to detect important changes in proteinmodification or proteolytic degradation, proteomics also hasaccess to secreted, circulating proteins that present highlyconvenient targets for detection and quantification. Blood, orthe plasma or serum derived from it, is a logical source forbiomarker detection because it is exposed to all organs of thebody. As such, the circulatory system can present an inte-grated view on the state of the organism at any given time.However, this integrative aspect can also confound analysisin certain cases as a weak signal may be swamped by non-disease related fluctuations. Other proximal body fluids cantherefore also be used as a source. These include urine, sa-liva, and cerebrospinal fluid (CSF). Many reviews about theapplication of proteomics for the discovery and detection ofbiomarkers are available in the literature [28, 30–32].

5 Current proteomics data resources

UniProt (The UNIversal PROTein resource) is the mostcomprehensive data repository on protein sequence andfunctional annotation. It is maintained by a collaboration ofthe Swiss Institute of Bioinformatics (SIB), the ProteinInformation Resource (PIR), and the European Bioinfor-matics Institute (EBI) [33]. It has four components, each ofthem optimized for different user profiles: the UniProtKnowledgebase (UniProtKB), the UniProt Reference Clus-ters (UniRef), the UniProt archive (UniParc), and the Uni-Prot Metagenomic and Environmental Sequences database(UniMES).

The first component, UniProtKB is divided in two sec-tions: UniProtKB/Swiss-Prot [34] and UniProtKB/TrEMBL.The difference between them is found in the quality ofannotations. UniProtKB/Swiss-Prot contains high qualityannotation extracted from the literature and computationalanalyses curated by experts. The annotations include,amongst others: protein function(s), protein domains andsites, PTMs, subcellular location(s), tissue specificity, struc-ture, interactions, and diseases associated with deficienciesor abnormalities. UniProtKB/TrEMBL on the other handcontains the translations of all coding sequences (CDS)present in the EMBL/GenBank/DDJB nucleotide sequencedatabases, excluding some types of data such as pseudo-genes. UniProtKB/TrEMBL records are annotated auto-matically based on computational analyses.

At the moment of writing, UniProtKB includes cross-references from 124 external databases, some of which areclearly of clinical interest. These include the Online Mende-lian Inheritance in Man (OMIM) database [35] (treated inmore detail below in this text), as well as several genomicdatabases from potential pathogens such as BuruList (Myco-bacterium ulcerans), EchoBase [36] and Ecogene [37] (E. coli),LegioList (Legionella pneumophila) [38], MypuList (Myco-plasma pulmonis), TubercuList (M. tuberculosis), Stygene (Sal-monella typhimurium), or the European Hepatitis C Virusdatabase [39]. Additionally, a special effort has been madeover the last years to annotate viruses in UniProtKB/Swiss-Prot, with a particular focus on important human pathogenssuch as HIV, Influenza, Hepatitis C, Rabies, SARS, Ebola,Dengue, and Yellow Fever viruses [33].

The second component of UniProt is called UniRef andprovides clustered sets of all sequences from the UniProtKBdatabase and selected UniProt Archive records to obtaincomplete coverage of sequences at different resolutions (100,90, and 50% sequence identity), while hiding redundantsequences [40]. The third component, the UniProt archive(UniParc) is a repository that reflects the history of all proteinsequences [41]. The last component, UniMES, was recentlyadded and contains data from metagenomic projects such asthe Global Ocean Sampling Expeditions (GOS) [42].

Whereas UniProt contains protein data with no speciesrestrictions (including for instance data from pathogens anddisease vectors), the Human Protein Reference database [43](HPRD) contains only human protein data. This proteininformation resource provides curated extensive informationincluding domain architecture, protein functions, protein–protein interactions, PTMs, enzyme–substrate relationships,subcellular localization, tissue expression, and disease asso-ciation of genes. The recently released Human Proteinpediais the community annotated part of HPRD and containsprotein data from a diversity of platforms [44].

All of the above resources provide additional annotationor crossreferences relevant to clinical researchers on top ofthe protein sequences. These databases should therefore notbe confused with more technical, sequence-centric databasesthat specifically aim to provide comprehensive coverage of aproteome in protein sequence terms. Usually these latterdatabases serve a more technical use, such as supportingBLAST or MS searches. Examples of this type of databaseinclude the NCBI’s nonredundant database, and the EBI’sInternational Protein Index (IPI) [45].

5.1 MS data repositories

MS is the main method for the identification and quanti-fication of proteins. Therefore, several repositories havebeen established to store protein and peptide identifica-tions derived from MS data. The three main resources inthe field are GPMDB [46], PeptideAtlas [47], and the Pro-teomics IDEntifications database (PRIDE) [48]. Additionalresources include PepSeeker [49], the Genome Annotating

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 6: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 23

Proteomic Pipeline (GAPP) [50], Tranche (http://tranche.proteomecommons.org), MAPU [51], OPD [52], and thepreviously cited Human Proteinpedia [44]. For a completereview covering the capabilities of different proteomic MSrepositories see [53].

The abovementioned three main repositories are allweb-based portals that can be used for data mining, datavisualization, data sharing, and crossvalidation. Currently,there is a need for increased data annotation in theseresources, for example regarding separation techniques,instrumentation, and identification software and sequencedatabases used. PRIDE is the most complete database interms of metadata associated with peptide identifications,since it contains numerous experimental details of the pro-tocols followed by the submitters [53]. Additionally, a Venndiagram tool allows across-experiment comparison of pro-tein identifications stored in PRIDE [48]. The detailedmetadata in PRIDE has enabled analyses of large datasetswhich have proven to yield very interesting information forthe field [54, 55].

5.2 Specialized protein databases

In addition to those cited previously, there are numerousspecialized protein databases available to the scientific com-munity. Resources devoted to 3-D structures of proteins andother biological macromolecules include the Protein DataBank (PDB, USA) [56], the Macromolecular Structure Data-base (MSD, Europe) [57], PDBj (Japan), and the BiologicalMagnetic Resonance Data Bank (BMRB, USA) [58]. All ofthese belong to the wwPDB consortium [59, 60].

A number of resources capture information on proteininteractions, for instance Intact [61], MINT [62], DIP [63],BIND [64], MPact [65], and BioGrid [66]. See [67] for moreinformation on the particularities of these databases. All ofthem are now members of the IMEx consortium [68], whichis committed to the exchange of interaction data as well asthe coordination of literature curation efforts across theresources. As a result, data content in these resources isbecoming more and more integrated and consistent, whiletheir combined coverage is increasing more quickly.

Furthermore, there are highly specialized protein familydatabases, generally maintained by experts in the field,which contain information related to a specific family orgroup of proteins. For an exhaustive list, see [69]. For exam-ple, there are databases devoted to peptidases (MEROPS,[70]), kinases (KinG [71]), or membrane transporters (Trans-portDB [72]). For clinicians, some particularly interestingresources are the Defensins Knowledgebase [73] and Peptai-bol [74] databases, both devoted to antimicrobial peptides,and FUNPEP (http://swift.cmbi.kun.nl/swift/FUNPEP/gergo/), which stores information on low-complexity pep-tides capable of forming amyloid plaque in Alzheimer’s dis-ease.

Another resource that should be mentioned due to itsrelevance to clinical research is the Human Protein Atlas

[75]. The project was initiated in 2003 with the aim to pro-duce a complete atlas of human protein expression profilesin healthy and cancer tissues. The process includes the gen-eration of antibodies, analysis of selected samples, validationof the results, and publication of the obtained data via apublic web site. At the moment of writing (version 4.1) theHuman Protein Atlas contains images generated by targetspecific staining of human tissues with more than 6000 dif-ferent antibodies.

PhosphoSitePlus (www.phosphosite.org) is a database ofexperimentally determined protein modifications allowingthe user to browse curated MS/MS experiments by disease.Finally, the Human 2-D PAGE databases for proteome anal-yses in health and disease (http://proteomics.cancer.dk)contains 2-D PAGE data and images from breast and bladdercancer and various other tissues.

The last group of protein related databases discussed inthis review are the protein signature databases. Theseresources are built on the assumption that proteins withsimilar sequences should have similar biological functions.For a group of related sequences that contains one or moresequences with known function(s), the functional annota-tion is transitively inferred for the uncharacterized membersof the group. There are a number of signature protein data-bases [76–85], which are all conveniently integrated into asingle resource: InterPro [86]. Signatures from differentdatabases that describe the same domain, family, repeat,active site, binding site, or PTM are grouped into singleInterPro entries. Additionally, each InterPro entry containshigh-quality manual annotation providing useful informa-tion about each protein signature [87]. Protein matches arecalculated using InterProScan [88], a tool that is also avail-able for user query sequence searches (http://www.ebi.ac.uk/InterProScan/).

6 Integrating data resources acrossbiological information levels

6.1 Static resources

Although large volumes of high-throughput proteomics dataare currently being captured, it is still early days for theeffective integration of this data. This is particularly lamen-table in the light of the emerging field of Systems Biology,with its need to integrate biological data that comes from avariety of sources and approaches including genomics, pro-teomics, transcriptomics, and metabolomics. Several effortshave already been established to carry out such integration,yet each of these is essentially built on a static model(Fig. 2A). As cited above, UniProt [33] presents an integrativelayer by including a vast array of crossreferences to a greatvariety of resources.

The Reference Sequences (RefSeq) database [89], main-tained at the NCBI, provides curated views across transcripts,proteins, and genomic regions, plus computationally derived

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 7: Charting online OMICS resources: A navigational chart for clinical researchers

24 J. A. Vizcaíno et al. Proteomics Clin. Appl. 2009, 3, 18–29

Figure 2. Two types of integra-tion of biological data acrossseveral resources. Panel (A)illustrates a static approach, inwhich an integrative database isbuilt by aggregating informa-tion from various sources, and(B) presents the more recentdevelopments aimed at estab-lishing a dynamic integrativelayer (in this case based on Bio-Marts) for each resource, whichallows each relevant resource tobe queried live over the internetby a single query.

nucleotide sequences and proteins. The RefSeq collection isnonredundant and is used internationally as a standard forgenome annotation.

Another integrative system is provided by the EBISequence Retrieval System (SRS) [90], which represents aknowledgebase approach in which data from a variety ofbiological databases is collected in a single resource. SRSthen provides a specialized query language on top of theseaggregated databases that allows complex searches to be car-ried out. Since the different OMICS technologies ultimatelyderive from the availability of genome sequences, manyother integrative systems are based on this primordialresource. We will discuss these genome browsers next.

The Ensembl project, a joint project between the EBI andthe Wellcome Trust Sanger Institute, provides a comprehen-sive genome information system consisting of data storage,integration, analysis, and visualization of a wide variety ofbiological data [91]. The current release (50, July 2008) ofEnsembl contains data from 39 species. Most of them arechordates, but there are other organisms as well such as S.cerevisiae, Caenorhabditis elegans, the fruit fly, and two clini-cally important mosquitoes (Anopheles gambiae and Aedesaegypti). For each species, a core database contains the DNAsequences, gene annotations, and external references. Addi-tional data are also provided (except in the case of low-cover-age genomes) such as EST genes, external annotation sets,and variation data. Finally, during the last year, a functionalgenomics database was introduced, initially for human andmouse, to support functional data types assayed by whole-genome tiling arrays or high-throughput sequencing [91].The Ensembl genome browser provides visualization for thiswide variety of biological data.

Other genome browsers are also available, such as thosebased at the University of California Santa Cruz (UCSCbrowser) [92] or the NCBI’s Map Viewer [6].

From a clinical perspective, it would be equally interest-ing to be able to browse genomes from smaller organismslike bacteria and fungi. Integr8 [93] provides an overview ofthe biology of each organism and a detailed statistical analy-sis of its genome and proteome using both textual and gra-phical information. A complementary resource for com-parative genomic analysis is Genome Reviews [94]. In orderto further unify these resources however, Integr8 and Ge-nome Reviews will be discontinued soon and will becomeEnsembl Genomes, providing the same look and feel forthese organisms as the one provided by the current Ensembl.

There are other general biological databases that arebased on data integration across OMICS fields, and that canprove very useful for clinical researchers. The first of these isReactome [95], a curated resource for human pathway data. Avery valuable feature included in Reactome is the “Sky-Pain-ter.” It allows researchers to upload their own datasets, e.g., alist of gene identifiers, and visualize them on top of thereaction map in the web page. The Sky Painter also auto-matically calculates and highlights those pathways whosemembers are enriched in the uploaded set of genes.

Another popular and comprehensive database featuringinformation on biological pathways is KEGG. As of January2008, KEGG comprises 19 databases that integrate genomic,chemical, and systemic functional information [22]. KEGGPATHWAY contains curated metabolic and signaling path-ways in species ranging from prokaryotes to humans. It alsocontains curated information about several human diseases,including neurodegenerative disorders, infectious diseases,metabolic disorders, and cancers. Other KEGG databasesrelevant for medical and pharmaceutical applications are therecently launched KEGG DISEASE (released on January2008) and KEGG DRUG [22].

Another useful resource, the OMIM database, provides aunique knowledgebase of the known relation between hu-

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 8: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 25

man genes and genetic disorders. It was started in order tosupport human genetics research and education as well asthe practice of clinical genetics [35]. The Online MendelianInheritance in Animals (OMIA) database is the equivalentresource for animals other than human and mouse [96].

Furthermore, the pharmacogenetics and pharmage-nomics knowledge base (PharmGBK) [97] is a resource thatintegrates information related to gene-drug relationships,gene variation, genomics, gene-disease relationships, drugaction, and pathways. Finally, there are several immunologi-cal databases available. Among them we here highlight:dbMHC, a genetic and clinical database of human MHC [6],and IPD, the Immuno Polymorphisms database [98] (for anexhaustive list, see [69]).

6.2 Dynamic resources

The databases discussed in the previous section all providestatic views across OMICS data. In order to obtain the latestdata from all the individual, specialized databases they relyon, these resources need to update their crossreferences,rebuild their core data schema, or rely on manual curation(Fig. 2A). Interestingly, a new approach involving federationof data storage across the internet has recently been devel-oped that can support live integrative queries. As a result, thequery will have access to the very latest information across allof the resources involved, without requiring any action onbehalf of the integrative layer (Fig. 2B). As a result, we herecall such approaches dynamic.

One of the tools that provide direct integrative cap-abilities, as well as a wide variety of protocols and interfacesthat other integrative resources can build on, is BioMart.Originally developed within the Ensembl project (and initi-ally called “EnsMart”) [99], BioMart is a query-oriented datamanagement system that does not require any programmingknowledge, although it can also be accessed program-matically using web services or software libraries written inthe popular Perl and Java languages.

BioMart access is also available for many other dataresources, which enables the integration of informationacross several types of biological data through the built-inability of BioMart to perform across-Mart queries. The list ofsupported databases can be found at the BioMart Projectwebsite (http://www.biomart.org) and comprises differentgenomics, transcriptomics (e.g., ArrayExpress Data Ware-house, Pancreatic expression database), proteomics (e.g.,PRIDE, PepSeeker), and other databases such us Reactome.Users can go to each individual BioMart or to the BioMartcentral server and combine searches across different dataresources. The current version of BioMart does not yet allowto combine more than two resources, however. For a practicalintroduction about to how to use BioMart via web, see [100].

A second way to integrate biological information is pro-vided by the Distributed Annotation System (DAS) protocol[101, 102]. DAS allows sequence annotations to be decen-tralized among multiple annotators and integrated on an as-

needed basis by client-side software. Annotation servers arespecialized for returning lists of annotations across definedregions of the sequence, for instance identified peptidesfrom a proteomics database, known domains from a domaindatabase, and sequenced ESTs from a genomics database. ADAS client then visualizes these annotations for the user.The EBI’s Dasty2 (http://www.ebi.ac.uk/dasty) viewer is anexample of such a client. Dasty2 provides the user with allavailable biological information about a particular proteinfrom a number of annotation servers in a single page.Another DAS based viewer is the 3-D structure viewer SPICE[103] that can show DAS annotations directly on a proteinstructure. DAS servers are provided by a wide variety ofresources including Ensembl [91]. Some other examples ofDAS-enabled web sites and applications can be found athttp://www.biodas.org.

A frequently encountered but often underestimatedissue that can make proteomics information integrationchallenging, is the problem of heterogeneous and often dy-namic identifiers or accession numbers referring to the sameprotein in different databases. The major reference proteindatabases (UniProtKB, Ensembl, and RefSeq) maintain acomprehensive list of crossreferences to each other but fullcoverage is difficult to achieve since they have different pro-duction systems and release schedules. Additionally, smaller,more specialized databases might not have been included inthis crossreferencing process. The result is that users stillneed to query multiple sources to ensure that they have acomplete picture, including the latest information available.In order to overcome this situation, the Protein IdentifierCross-Reference (PICR) service, was launched last year at theEBI [104]. Originally built as an extension to annotate andcrossreference data in the PRIDE database, the PICR serviceis also accessible through a web application (http://www.ebi.ac.uk/Tools/picr). The mapping algorithm relies on theUniParc database, a data warehouse of crossreferences basedon 100% sequence identity. PICR maps across proteins fromover 70 distinct source databases, including the most com-monly used ones. The users can query the service with a listof protein identifiers or sequences and PICR will return thecorresponding, most up-to-date protein identifiers in theselected target database(s).

Commercial solutions for data integration are availableas well, such as the software ProteinCenter (Proxeon Biosys-tems). This package can integrate systems biology data fromdifferent sources with multiple proteomic repositories [53].

7 Future perspectives and outlook

It is clear that a vast array of resources and repositories arealready in place to collect and disseminate data across thedifferent OMICS fields. However, as most of these resourcesare either focused on a specialized field or technology plat-form, the availability of this data does not translate directlyinto accessibility for nonspecialists in these fields. This

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 9: Charting online OMICS resources: A navigational chart for clinical researchers

26 J. A. Vizcaíno et al. Proteomics Clin. Appl. 2009, 3, 18–29

shortcoming is more strongly felt in those fields of the lifesciences that rely on OMICS techniques as tools rather thanas research goals in their own right. Since this user commu-nity includes cell biologists as well as clinical researchers,there is an obvious need to provide novel, integrative toolsaimed specifically at these users. Although this integrativebioinformatics effort has already materialized in several ofthe resources we discussed here, the exciting new possibil-ities provided by dynamic integrative layers such as BioMartand DAS will certainly enable even more powerful interfacesfor systems biology research. Indeed, by efficiently joiningtogether federated and diverse resources, such dynamicintegration approaches will be able to seamlessly includelow-level, technology-specific resources, as well as the high-level, static integrative views such as the genome browsers orUniProt. Ultimately, through leveraging the interconnec-tions established by the static integrative resources, as well aswithin-domain translators such as PICR, systems built ontop of such dynamic integrative layers will be able to collectand visualize a broad range of live information about a pro-tein, tissue, pathway, disease, or custom question of interest.By spanning the different levels of biological informationtransfer, as well as the different levels of detailed data that isavailable for each of these, such an interface will provide averitable treasure trove of biological and clinical knowledgethat is certain to contribute greatly towards fulfilling thepromises of the life sciences.

J. A. V. is a Postdoctoral Fellow of the “Especialización enOrganismos Internacionales” program from the Spanish Minis-try of Education and Science. This work has been supported bythe “ProDaC” grant LSHG-CT-2006-036814 of the EuropeanUnion. The authors would like to thank Rolf Apweiler for hissupport.

The authors have declared no conflict of interest.

8 References

[1] Dunn, M. J., Proteomics – Clinical Applications Reviews 2008,Part 2. Proteomics Clin. Appl. 2008, 2, 287–289.

[2] Dunn, M. J., Proteomics – Clinical Applications Reviews 2008,Part 1. Proteomics – Clin. Appl. 2008, 2, 119–120.

[3] Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J.,Wheeler, D., GenBank. Nucleic Acids Res. 2008, 36, D25–D30.

[4] Cochrane, G., Akhtar, R., Aldebert, P., Althorpe, N. et al., Prio-rities for nucleotide trace, sequence and annotation datacapture at the Ensembl Trace Archive and the EMBL Nucleo-tide Sequence Database. Nucleic Acids Res. 2008, 36, D5–D12.

[5] Sugawara, H., Abe, T., Gojobori, T., Tateno, Y., DDBJ workingon evaluation and classification of bacterial genes in INSDC.Nucleic Acids Res. 2007, 35, D13–D15.

[6] Wheeler, D., Barrett, T., Benson, D., Bryant, S. et al., Data-base resources of the National Center for BiotechnologyInformation. Nucleic Acids Res. 2008, 36, D13–D21.

[7] Liolios, K., Mavromatis, K., Tavernarakis, N., Kyrpides, N.,The Genomes On Line Database (GOLD) in 2007: Status ofgenomic and metagenomic projects and their associatedmetadata. Nucleic Acids Res. 2008, 36, D475–D479.

[8] Parkinson, H., Kapushesky, M., Shojatalab, M., Abeyguna-wardena, N. et al., ArrayExpress – A public database ofmicroarray experiments and gene expression profiles.Nucleic Acids Res. 2007, 35, D747–D750.

[9] Barrett, T., Troup, D., Wilhite, S., Ledoux, P. et al., NCBI GEO:Mining tens of millions of expression profiles – Databaseand tools update. Nucleic Acids Res. 2007, 35, D760–D765.

[10] Ikeo, K., Ishi-i, J., Tamura, T., Gojobori, T., Tateno, Y., CIBEX:Center for information biology gene expression database. C.R. Biol. 2003, 326, 1079–1082.

[11] Ball, C. A., Brazma, A., MGED standards: Work in progress.Omics 2006, 10, 138–144.

[12] Nagaraj, S., Gasser, R., Ranganathan, S., A hitchhiker’sguide to expressed sequence tag (EST) analysis. Brief.Bioinform. 2007, 8, 6–21.

[13] Brentani, H., Caballero, O. L., Camargo, A. A., da Silva, A. M.et al., The generation and utilization of a cancer-orientedrepresentation of the human transcriptome by usingexpressed sequence tags. Proc. Natl. Acad. Sci. USA 2003,100, 13418–13423.

[14] Su, A. I., Wiltshire, T., Batalov, S., Lapp, H. et al., A gene atlasof the mouse and human protein-encoding transcriptomes.Proc. Natl. Acad. Sci. USA 2004, 101, 6062–6067.

[15] German, J., Hammock, B., Watkins, S., Metabolomics:Building on a century of biochemistry to guide humanhealth. Metabolomics 2005, 1, 3–9.

[16] Wishart, D., Tzur, D., Knox, C., Eisner, R. et al., HMDB: TheHuman Metabolome Database. Nucleic Acids Res. 2007, 35,D521–D526.

[17] Williams, A. J., Public chemical compound databases. Curr.Opin. Drug Discov. Dev. 2008, 11, 393–404.

[18] Wishart, D. S., Knox, C., Guo, A. C., Cheng, D. et al., Drug-Bank: A knowledgebase for drugs, drug actions and drugtargets. Nucleic Acids Res. 2008, 36, D901–D906.

[19] Richard, A. M., Williams, C. R., Distributed structure-search-able toxicity (DSSTox) public database network: A proposal.Mutat. Res. 2002, 499, 27–52.

[20] Steinbeck, C., Krause, S., Kuhn, S., NMRShiftDB-construct-ing a free chemical information system with open-sourcecomponents. J. Chem. Inf. Comput. Sci. 2003, 43, 1733–1739.

[21] Williams, A. J., ChemSpider and its demanding web: Build-ing a structure centric community for chemists, Chem. Int.2008, 30, 30.

[22] Kanehisa, M., Araki, M., Goto, S., Hattori, M. et al., KEGG forlinking genomes to life and the environment, Nucleic AcidsRes. 2008, 36, D480–D484.

[23] Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J. et al.,ChEBI: A database and ontology for chemical entities ofbiological interest. Nucleic Acids Res. 2008, 36, D344–D350.

[24] Fathallah-Shaykh, H., Microarrays: Applications and pitfalls.Arch. Neurol. 2005, 62, 1669–1672.

[25] Cox, J., Mann, M., Is proteomics the new genomics? Cell2007, 130, 395–398.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 10: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 27

[26] Lu, P., Vogel, C., Wang, R., Yao, X., Marcotte, E., Absoluteprotein expression profiling estimates the relative contribu-tions of transcriptional and translational regulation. Nat.Biotechnol. 2007, 25, 117–124.

[27] Olsen, J., Blagoev, B., Gnad, F., Macek, B. et al., Global, invivo, and site-specific phosphorylation dynamics in signal-ing networks. Cell 2006, 127, 635–648.

[28] Kohn, E., Azad, N., Annunziata, C., Dhamoon, A., Whiteley,G., Proteomics as a tool for biomarker discovery. Dis. Mark-ers 2007, 23, 411–417.

[29] Bast, R. C., Jr., Badgwell, D., Lu, Z., Marquez, R. et al., Newtumor markers: CA125 and beyond. Int. J. Gynecol. Cancer2005, 15, 274–281.

[30] Kischel, P., Waltregny, D., Castronovo, V., Identification ofaccessible human cancer biomarkers using ex vivo chemicalproteomic strategies. Expert Rev. Proteomics 2007, 4, 727–739.

[31] Matharoo-Ball, B., Ball, G., Rees, R., Clinical proteomics:Discovery of cancer biomarkers using mass spectrometryand bioinformatics approaches – A prostate cancer per-spective. Vaccine 2007, 25, B110–B121.

[32] Reid, J., Parker, C., Borchers, C., Protein arrays for biomarkerdiscovery. Curr. Opin. Mol. Ther. 2007, 9, 216–221.

[33] The UniProt Consortium. The universal protein resource(UniProt). Nucleic Acids Res. 2008, 36, D190–D195.

[34] Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bair-och, A., UniProtKB/Swiss-Prot: The manually annotatedsection of the UniProt KnowledgeBase, Methods Mol. Biol.(Clifton, NJ) 2007, 406, 89–112.

[35] Hamosh, A., Scott, A., Amberger, J., Bocchini, C., McKusick,V., Online Mendelian Inheritance in Man (OMIM), a knowl-edgebase of human genes and genetic disorders, NucleicAcids Res. 2005, 33, D514–D517.

[36] Misra, R., Horler, R., Reindl, W., Goryanin, I., Thomas, G.,EchoBASE: An integrated post-genomic database forEscherichia coli, Nucleic Acids Res. 2005, 33, D329–D333.

[37] Rudd, K., EcoGene: A genome sequence database forEscherichia coli K-12. Nucleic Acids Res. 2000, 28, 60–64.

[38] Fang, G., Ho, C., Qiu, Y., Cubas, V. et al., Specialized micro-bial databases for inductive exploration of microbial ge-nome sequences. BMC Genomics 2005, 6, 14.

[39] Combet, C., Garnier, N., Charavay, C., Grando, D. et al.,euHCVdb: The European hepatitis C virus database. NucleicAcids Res. 2007, 35, D363–D366.

[40] Suzek, B., Huang, H., McGarvey, P., Mazumder, R., Wu, C.,UniRef: Comprehensive and nonredundant UniProt refer-ence clusters. Bioinformatics (Oxford, England) 2007, 23,1282–1288.

[41] Leinonen, R., Diez, F., Binns, D., Fleischmann, W. et al., Uni-Prot archive. Bioinformatics (Oxford, England) 2004, 20,3236–3237.

[42] Yooseph, S., Sutton, G., Rusch, D., Halpern, A. et al., TheSorcerer II Global Ocean Sampling expedition: Expandingthe universe of protein families. PLoS Biol. 2007, 5, e16.

[43] Mishra, G. R., Suresh, M., Kumaran, K., Kannabiran, N. et al.,Human protein reference database – 2006 update. NucleicAcids Res. 2006, 34, D411–D414.

[44] Mathivanan, S., Ahmed, M., Ahn, N., Alexandre, H. et al.,Human Proteinpedia enables sharing of human proteindata. Nat. Biotechnol. 2008, 26, 164–167.

[45] Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y. etal., The International Protein Index: An integrated databasefor proteomics experiments. Proteomics 2004, 4, 1985–1988.

[46] Craig, R., Cortens, J., Beavis, R., Open source system foranalyzing, validating, and storing protein identification data.J. Proteome Res. 2004, 3, 1234–1242.

[47] Desiere, F., Deutsch, E., King, N., Nesvizhskii, A. et al., ThePeptideAtlas project. Nucleic Acids Res. 2006, 34, D655–D658.

[48] Jones, P., Cote, R., Cho, S., Klie, S. et al., PRIDE: New devel-opments and new datasets. Nucleic Acids Res. 2008, 36,D878–D883.

[49] McLaughlin, T., Siepen, J., Selley, J., Lynch, J. et al., Pep-Seeker: A database of proteome peptide identifications forinvestigating fragmentation patterns. Nucleic Acids Res.2006, 34, D649–D654.

[50] Shadforth, I., Xu, W., Crowther, D., Bessant, C., GAPP: A fullyautomated software for the confident identification of hu-man peptides from tandem mass spectra. J. Proteome Res.2006, 5, 2849–2852.

[51] Zhang, Y., Zhang, Y., Adachi, J., Olsen, J. et al., MAPU: Max-Planck Unified database of organellar, cellular, tissue andbody fluid proteomes. Nucleic Acids Res. 2007, 35, D771–D779.

[52] Prince, J., Carlson, M., Wang, R., Lu, P., Marcotte, E., Theneed for a public proteomics repository. Nat. Biotechnol.2004, 22, 471–472.

[53] Mead, J., Shadforth, I., Bessant, C., Public proteomic MSrepositories and pipelines: Available tools and biologicalapplications. Proteomics 2007, 7, 2769–2786.

[54] Klie, S., Martens, L., Vizcaíno, J. A., Côté, R. et al., Analyzinglarge-scale proteomics projects with latent semantic index-ing. J. Proteome Res. 2008, 7, 182–191.

[55] Mueller, M., Vizcaíno, J. A., Jones, P., Côté, R. et al., Analysisof the experimental detection of central nervous systemrelated genes in human brain and cerebrospinal fluid data-sets. Proteomics 2008, 8, 1138–1148.

[56] Berman, H., The Protein Data Bank: A historical perspective.Acta Crystallogr. 2008, 64, 88–95.

[57] Tagari, M., Tate, J., Swaminathan, G., Newman, R. et al., E-MSD: Improving data deposition and structure quality.Nucleic Acids Res. 2006, 34, D287–D290.

[58] Ulrich, E., Akutsu, H., Doreleijers, J., Harano, Y. et al., Bio-MagResBank. Nucleic Acids Res. 2008, 36, D402–D408.

[59] Berman, H., Henrick, K., Nakamura, H., Markley, J., Theworldwide Protein Data Bank (wwPDB): Ensuring a single,uniform archive of PDB data. Nucleic Acids Res. 2007, 35,D301–D303.

[60] Henrick, K., Feng, Z., Bluhm, W., Dimitropoulos, D. et al.,Remediation of the protein data bank archive. Nucleic AcidsRes. 2008, 36, D426–D433.

[61] Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I. et al.,IntAct – Open source resource for molecular interactiondata. Nucleic Acids Res. 2007, 35, D561–D565.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 11: Charting online OMICS resources: A navigational chart for clinical researchers

28 J. A. Vizcaíno et al. Proteomics Clin. Appl. 2009, 3, 18–29

[62] Chatr-Aryamontri, A., Ceol, A., Palazzi, L., Nardelli, G. et al.,MINT: The Molecular INTeraction database. Nucleic AcidsRes. 2007, 35, D572–D574.

[63] Salwinski, L., Miller, C., Smith, A., Pettit, F. et al., The Data-base of Interacting Proteins: 2004 update. Nucleic Acids Res.2004, 32, D449–D451.

[64] Alfarano, C., Andrade, C., Anthony, K., Bahroos, N. et al., TheBiomolecular Interaction Network Database and relatedtools 2005 update. Nucleic Acids Res. 2005, 33, D418–D424.

[65] Guldener, U., Munsterkotter, M., Oesterheld, M., Pagel, P. etal., MPact: The MIPS protein interaction resource on yeast.Nucleic Acids Res. 2006, 34, D436–D441.

[66] Breitkreutz, B., Stark, C., Reguly, T., Boucher, L. et al., TheBioGRID Interaction Database: 2008 update. Nucleic AcidsRes. 2008, 36, D637–D640.

[67] Mathivanan, S., Periaswamy, B., Gandhi, T. K. B., Kanda-samy, K. et al., An evaluation of human protein-proteininteraction data in the public domain. BMC Bioinformatics2006, 7, S19.

[68] Ideker, T., Valencia, A., Bioinformatics in the human inter-actome project. Bioinformatics (Oxford, England) 2006, 22,2973–2974.

[69] Galperin, M., The Molecular Biology Database Collection:2008 update. Nucleic Acids Res. 2008, 36, D2–D4.

[70] Rawlings, N., Morton, F., Kok, C., Kong, J., Barrett, A., MER-OPS: The peptidase database, Nucleic Acids Res. 2008, 36,D320–D325.

[71] Krupa, A., Abhinandan, K., Srinivasan, N., KinG: A databaseof protein kinases in genomes. Nucleic Acids Res. 2004, 32,D153–D155.

[72] Ren, Q., Chen, K., Paulsen, I., TransportDB: A comprehen-sive database resource for cytoplasmic membrane transportsystems and outer membrane channels. Nucleic Acids Res.2007, 35, D274–D279.

[73] Seebah, S., Suresh, A., Zhuo, S., Choong, Y. et al., Defensinsknowledgebase: A manually curated database and informa-tion source focused on the defensins family of antimicrobialpeptides. Nucleic Acids Res. 2007, 35, D265–D268.

[74] Whitmore, L., Wallace, B., The Peptaibol Database: A data-base for sequences and structures of naturally occurringpeptaibols. Nucleic Acids Res. 2004, 32, D593–D594.

[75] Hober, S., Uhlén, M., Human protein atlas and the use ofmicroarray technologies. Curr. Opin. Biotechnol. 2008, 19,30–35.

[76] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L. et al., The PRO-SITE database. Nucleic Acids Res. 2006, 34, D227–D230.

[77] Attwood, T., Bradley, P., Flower, D., Gaulton, A. et al., PRINTSand its automatic supplement, preprints. Nucleic Acids Res.2003, 31, 400–402.

[78] Finn, R., Tate, J., Mistry, J., Coggill, P. et al., The Pfam proteinfamilies database. Nucleic Acids Res. 2008, 36, D281–D288.

[79] Letunic, I., Copley, R., Pils, B., Pinkert, S. et al., SMART 5:Domains in the context of genomes and networks. NucleicAcids Res. 2006, 34, D257–D260.

[80] Selengut, J., Haft, D., Davidsen, T., Ganapathy, A. et al.,TIGRFAMs and Genome Properties: Tools for the assign-ment of molecular function and biological process in pro-karyotic genomes. Nucleic Acids Res. 2007, 35, D260–D264.

[81] Wu, C., Nikolskaya, A., Huang, H., Yeh, L. et al., PIRSF: Familyclassification system at the Protein Information Resource.Nucleic Acids Res. 2004, 32, D112–D114.

[82] Mi, H., Guo, N., Kejariwal, A., Thomas, P., PANTHER version6: Protein sequence and function evolution data withexpanded representation of biological pathways. NucleicAcids Res. 2007, 35, D247–D252.

[83] Wilson, D., Madera, M., Vogel, C., Chothia, C., Gough, J., TheSUPERFAMILY database in 2007: Families and functions.Nucleic Acids Res. 2007, 35, D308–D313.

[84] Yeats, C., Lees, J., Reid, A., Kellam, P. et al., Gene3D: Com-prehensive structural and functional annotation of ge-nomes. Nucleic Acids Res. 2008, 36, D414–D418.

[85] Bru, C., Courcelle, E., Carrere, S., Beausse, Y. et al., The Pro-Dom database of protein domain families: More emphasison 3D. Nucleic Acids Res. 2005, 33, D212–D215.

[86] Mulder, N., Apweiler, R., Attwood, T., Bairoch, A. et al., Newdevelopments in the InterPro database. Nucleic Acids Res.2007, 35, D224–D228.

[87] Mulder, N., Kersey, P., Pruess, M., Apweiler, R., In silicocharacterization of proteins: UniProt, InterPro and Integr8.Mol. Biotechnol. 2008, 38, 165–177.

[88] Mulder, N., Apweiler, R., InterPro and InterProScan: Tools forprotein sequence classification and comparison. MethodsMol. Biol. (Clifton, NJ) 2007, 396, 59–70.

[89] Pruitt, K., Tatusova, T., Maglott, D., NCBI reference sequen-ces (RefSeq): A curated nonredundant sequence database ofgenomes, transcripts and proteins. Nucleic Acids Res. 2007,35, D61–D65.

[90] Zdobnov, E., Lopez, R., Apweiler, R., Etzold, T., The EBI SRSserver-new features. Bioinformatics (Oxford, England) 2002,18, 1149–1150.

[91] Flicek, P., Aken, B., Beal, K., Ballester, B. et al., Ensembl 2008.Nucleic Acids Res. 2008, 36, D707–D714.

[92] Karolchik, D., Kuhn, R., Baertsch, R., Barber, G. et al., TheUCSC Genome Browser Database: 2008 update. NucleicAcids Res. 2008, 36, D773–D779.

[93] Kersey, P., Bower, L., Morris, L., Horne, A. et al., Integr8 andGenome Reviews: Integrated views of complete genomesand proteomes. Nucleic Acids Res. 2005, 33, D297–D302.

[94] Sterk, P., Kersey, P., Apweiler, R., Genome Reviews: Stan-dardizing content and representation of information aboutcomplete genomes. Omics 2006, 10, 114–118.

[95] Vastrik, I., D’Eustachio, P., Schmidt, E., Joshi-Tope, G. et al.,Reactome: A knowledge base of biologic pathways andprocesses. Genome Biol. 2007, 8, R39.

[96] Lenffer, J., Nicholas, F., Castle, K., Rao, A. et al., OMIA(Online Mendelian Inheritance in Animals): An enhancedplatform and integration into the Entrez search interface atNCBI. Nucleic Acids Res. 2006, 34, D599–D601.

[97] Hernandez-Boussard, T., Whirl-Carrillo, M., Hebert, J., Gong,L. et al., The pharmacogenetics and pharmacogenomicsknowledge base: Accentuating the knowledge. NucleicAcids Res. 2008, 36, D913–D918.

[98] Robinson, J., Waller, M., Fail, S., Marsh, S., The IMGT/HLAand IPD databases. Hum. Mutat. 2006, 27, 1192–1199.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com

Page 12: Charting online OMICS resources: A navigational chart for clinical researchers

Proteomics Clin. Appl. 2009, 3, 18–29 29

[99] Kasprzyk, A., Keefe, D., Smedley, D., London, D. et al.,EnsMart: A generic system for fast and flexible access tobiological data. Genome Res. 2004, 14, 160–169.

[100] Spudich, G., Fernandez-Suarez, X., Birney, E., Genomebrowsing with Ensembl: A practical overview. Brief. Funct.Genomics Proteomics 2007, 6, 202–219.

[101] Dowell, R., Jokerst, R., Day, A., Eddy, S., Stein, L., The dis-tributed annotation system. BMC Bioinformatics 2001, 2, 7.

[102] Prlic, A., Down, T., Kulesha, E., Finn, R. et al., Integratingsequence and structural biology with DAS, BMC Bioinfor-matics 2007, 8, 333.

[103] Prlic, A., Down, T. A., Hubbard, T. J. P., Adding Some SPICEto DAS. Bioinformatics 2005, 21, ii40–ii41.

[104] Cote, R., Jones, P., Martens, L., Kerrien, S. et al., The ProteinIdentifier Cross-Referencing (PICR) service: Reconcilingprotein identifiers across multiple source databases. BMCBioinformatics 2007, 8, 401.

© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.clinical.proteomics-journal.com