advancing glycomics: implementation strategies at - glycobiology

9
Glycobiology vol. 16 no. 5 pp. 82R–90R, 2006 doi:10.1093/glycob/cwj080 Advance Access publication on February 14, 2006 © The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 82R Advancing glycomics: Implementation strategies at the Consortium for Functional Glycomics Rahul Raman, Maha Venkataraman, Subu Ramakrishnan, Wei Lang, S. Raguram, and Ram Sasisekharan 1 Biological Engineering Division, Center for Biomedical Engineering, Center for Environmental Health Sciences, Massachusetts Institute of Technology; Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology, 77 Massachusetts Avenue, Cambridge, MA 02139 Received on December 21, 2005; revised on January 16, 2006; accepted on January 19, 2006 Glycomics —an integrated approach to study structure–function relationships of complex carbohydrates (or glycans)—is an emerging field in this age of post-genomics. Realizing the importance of glycomics, many large scale research initiatives have been established to generate novel resources and technol- ogies to advance glycomics. These initiatives are generating and cataloging diverse data sets necessitating the development of bioinformatic platforms to acquire, integrate, and dissemi- nate these data sets in a meaningful fashion. With the consor- tium for functional glycomics (CFG) as the model system, this review discusses databases and the bioinformatics platform developed by this consortium to advance glycomics. Key words: bioinformatics/CFG/glycan/glycomics Introduction In comparison with genomics and proteomics, advancing glycomics is faced with unique and intriguing challenges. These challenges arise from two fundamental aspects of gly- can structure–function relationships. First, the biosynthesis of glycans is a non–template-driven process involving a coor- dinated expression of multiple glycosyltransferases, some of which have additional tissue-specific isoforms (Varki et al., 1999; Lowe and Marth, 2003; Taylor and Drickamer, 2003). Second, understanding the biochemical basis of glycan– protein interactions in the context of a biological pathway is complicated by the multivalency and the graded affinity involving an ensemble of glycan structures making multiple contacts with multivalent binding sites on proteins (Collins and Paulson, 2004). The above issues have also made it chal- lenging to develop databases and bioinformatics tools for glycomics (von der Lieth et al., 2004; Raman et al., 2005). Based on the above, the evolution of information in glycomics has been different from that of genomics and proteomics. Although large volumes of DNA and protein sequence data were being generated in databases such as GenBank and SwissProt, much less information on glycan structures was available at that time. For example, early attempts to construct a glycan database resulted in the Complex Carbohydrate Structure Database (CCSD) which contained about 2000 literature citations (Doubet et al., 1989) along with a set of computational tools (CarbBank) to query and access information. CCSD rapidly grew into the primary repository for glycan structures with over 50,000 citations. However, challenges in structural charac- terization of glycans in the past led to numerous glycan structures with unspecified linkage information which com- plicated the databasing and search engine development. Moreover, during the development of CCSD, a clear blue- print was not available for establishing the kinds of data that needed to be stored and annotated along with the gly- can structures in this database. Owing to these and other challenges, further development of CCSD was discontinued. Important breakthroughs in glycobiology fueled by rapid technology development have led to a “glyco-renaissance” in the past few years. Recognizing the need to take an integrated approach to understand glycan structure–function relationships, several international collaborative efforts such as the Consor- tium for Functional Glycomics (CFG; an international initiative funded by National Institute of General Medical Sciences), Complex Carbohydrate Research Center (www.ccrc.uga.edu), EuroCarbDB (www.eurocarb.org), and Human Disease Gly- comics/Proteome Initiative (www.hgpi.jp) have been established. Motivated by the need to address the challenges in glycomics, these collaborative efforts are developing novel resources and state-of-the-art technologies for advancing this field. Impor- tantly, these initiatives are putting significant resources into developing databases and bioinformatics platforms to integrate and disseminate glycomics data to the scientific community. This review discusses the utility of the glycomics databases developed by the CFG highlighting the CFG’s synergistic approach to col- laborate with other large-scale initiatives and collectively address challenges in advancing glycomics. CFG bioinformatics platform for glycomics To address the central issue of decoding structure–function relationships of glycan–protein interactions, the CFG is organized into scientific Cores that have developed technolo- gies to generate novel data sets. These diverse data sets are derived from (1) gene expression of glycosyltransferases and 1 To whom correspondence should be addressed; e-mail: [email protected] Downloaded from https://academic.oup.com/glycob/article/16/5/82R/572357 by guest on 02 December 2021

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advancing glycomics: Implementation strategies at - Glycobiology

Glycobiology vol. 16 no. 5 pp. 82R–90R, 2006 doi:10.1093/glycob/cwj080 Advance Access publication on February 14, 2006

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display theopen access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journaland Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 82R

Advancing glycomics: Implementation strategies at the Consortium for Functional Glycomics

Rahul Raman, Maha Venkataraman, Subu Ramakrishnan, Wei Lang, S. Raguram, and Ram Sasisekharan1

Biological Engineering Division, Center for Biomedical Engineering, Center for Environmental Health Sciences, Massachusetts Institute of Technology; Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology, 77 Massachusetts Avenue, Cambridge, MA 02139

Received on December 21, 2005; revised on January 16, 2006; accepted on January 19, 2006

Glycomics—an integrated approach to study structure–functionrelationships of complex carbohydrates (or glycans)—is anemerging field in this age of post-genomics. Realizing theimportance of glycomics, many large scale research initiativeshave been established to generate novel resources and technol-ogies to advance glycomics. These initiatives are generatingand cataloging diverse data sets necessitating the developmentof bioinformatic platforms to acquire, integrate, and dissemi-nate these data sets in a meaningful fashion. With the consor-tium for functional glycomics (CFG) as the model system, thisreview discusses databases and the bioinformatics platformdeveloped by this consortium to advance glycomics.

Key words: bioinformatics/CFG/glycan/glycomics

Introduction

In comparison with genomics and proteomics, advancingglycomics is faced with unique and intriguing challenges.These challenges arise from two fundamental aspects of gly-can structure–function relationships. First, the biosynthesisof glycans is a non–template-driven process involving a coor-dinated expression of multiple glycosyltransferases, some ofwhich have additional tissue-specific isoforms (Varki et al.,1999; Lowe and Marth, 2003; Taylor and Drickamer, 2003).Second, understanding the biochemical basis of glycan–protein interactions in the context of a biological pathway iscomplicated by the multivalency and the graded affinityinvolving an ensemble of glycan structures making multiplecontacts with multivalent binding sites on proteins (Collinsand Paulson, 2004). The above issues have also made it chal-lenging to develop databases and bioinformatics tools forglycomics (von der Lieth et al., 2004; Raman et al., 2005).

Based on the above, the evolution of information inglycomics has been different from that of genomics and

proteomics. Although large volumes of DNA and proteinsequence data were being generated in databases such asGenBank and SwissProt, much less information on glycanstructures was available at that time. For example, earlyattempts to construct a glycan database resulted in theComplex Carbohydrate Structure Database (CCSD) whichcontained about 2000 literature citations (Doubet et al.,1989) along with a set of computational tools (CarbBank)to query and access information. CCSD rapidly grew intothe primary repository for glycan structures with over50,000 citations. However, challenges in structural charac-terization of glycans in the past led to numerous glycanstructures with unspecified linkage information which com-plicated the databasing and search engine development.Moreover, during the development of CCSD, a clear blue-print was not available for establishing the kinds of datathat needed to be stored and annotated along with the gly-can structures in this database. Owing to these and otherchallenges, further development of CCSD was discontinued.

Important breakthroughs in glycobiology fueled by rapidtechnology development have led to a “glyco-renaissance” in thepast few years. Recognizing the need to take an integratedapproach to understand glycan structure–function relationships,several international collaborative efforts such as the Consor-tium for Functional Glycomics (CFG; an international initiativefunded by National Institute of General Medical Sciences),Complex Carbohydrate Research Center (www.ccrc.uga.edu),EuroCarbDB (www.eurocarb.org), and Human Disease Gly-comics/Proteome Initiative (www.hgpi.jp) have been established.Motivated by the need to address the challenges in glycomics,these collaborative efforts are developing novel resources andstate-of-the-art technologies for advancing this field. Impor-tantly, these initiatives are putting significant resources intodeveloping databases and bioinformatics platforms to integrateand disseminate glycomics data to the scientific community. Thisreview discusses the utility of the glycomics databases developedby the CFG highlighting the CFG’s synergistic approach to col-laborate with other large-scale initiatives and collectively addresschallenges in advancing glycomics.

CFG bioinformatics platform for glycomics

To address the central issue of decoding structure–functionrelationships of glycan–protein interactions, the CFG isorganized into scientific Cores that have developed technolo-gies to generate novel data sets. These diverse data sets arederived from (1) gene expression of glycosyltransferases and1To whom correspondence should be addressed; e-mail: [email protected]

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 2: Advancing glycomics: Implementation strategies at - Glycobiology

Advancing glycomics

83R

glycan-binding proteins (GBPs), (2) phenotyping analysis oftransgenic mice, (3) mass spectrometric profiling of glycanstructures isolated from cells and tissues, and (4) screeningglycan affinity of proteins using novel glycan arrays. It isclear that there is a need to cut across these diverse data setsto begin understanding the fundamental structure–functionrelationships of glycans. A critical component that enablesthis process is a bioinformatics platform to store, integrate,and process the information generated by the above methodsand disseminate them in a meaningful fashion via the Inter-net to the scientific community worldwide.

To implement an informatics framework for integratingdiverse data sets, the Bioinformatics Core (Core B) of theCFG constructed relational databases. The blueprint of theCFG database is the data model or ontology diagram thatcaptures data definitions and inter-relationships (Figure 1B).It is important to keep the complexity of this database notcome in the way of the researchers who deposit data into thedatabase and query the database for accessing specific datasets. To accomplish this goal, the CFG bioinformatics plat-form has developed a three-tier architecture. This comprisesof the relational database in the backend (implemented using

Fig. 1. CFG bioinformatics platform. Shown in panel A is the three-tier architecture developed to provide flexibility to handle evolving data relationships and to facilitate seamless data acquisition and dissemination. Shown in panel B is a schematic of the data model where the key data objects and their con-nectivity with other objects are indicated.

GBP

Molecule Pages

Glycan

Biological Source

or Sample

Literature

Citations

GBP

Mouse Phenotyping

Hematology, Histology,

Immunology & Metabolism

analysis

MALDI-MS Glyan Profiling

Analysis of Cells, Tissues

from human & mice

[including transgenic]

Glycan-GBP Interactions

Glycan array to screen GBPs for

primary glycan ligand specificities

and identify candidate ligands

Glyco-gene Expression

Focused DNA microarrays to obtain

gene expression of glycan

biosynthesis enzymes and GBPs

Glycosylation

Patways

Glycosyltransferase

Glycosyltransferase

Molecule Pages

Object-based Relational Database

[Back-End]

Middleware Application Layer

Seamless interface between

user and database

Static Web Pages

Quick updates to

Web resources

Dynamic Pages

Database driven

updates and queries

A

B

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 3: Advancing glycomics: Implementation strategies at - Glycobiology

R. Raman et al.

84R

Oracle) with an object-oriented middleware application layer(implemented in Java environment) and a front-end userinterface (Figure 1A). The middleware application layerbridges the user interface to the underlying relational data-base in a seamless fashion and thus facilitates the data acqui-sition and dissemination process.

Two types of data acquisition tools have been imple-mented. First, tools have been developed to automaticallycapture information pertaining to GBPs and glycosyltrans-ferases from public databases, such as SwissProt, GenBank,LocusLink, and so on. Second, graphical user interface-based Web forms have been developed to enable aresearcher to directly deposit data into the database. Themiddleware application layer is designed to automaticallyannotate the input data based on the data model in the rela-tional database and organize the data sets into the appro-priate relational tables in the backend database.

Core B has also developed user-friendly interfaces to navi-gate the diverse CFG data sets in a seamless fashion. Thesedata dissemination interfaces were released to the public ear-lier this year fuelling the interests of numerous researchersand thus increasing the number of participating investigatorsin the CFG. The various CFG data dissemination interfacesare discussed in the following (URLs provided in Table I).

Scientific core data dissemination

As stated above, there are four primary data sets generatedby the CFG scientific Cores. Each of these data sets compriseof hierarchical layers of information starting from a sum-mary presentation (in PDF or Excel format) of the data tothe individual parameters and the data files. The dynamicdata dissemination interfaces facilitate the navigation of datasets through these hierarchical layers giving users access allthe way down to the individual raw data files. Furthermore,the standardized protocols for performing the different anal-yses by each scientific Core are also available on the frontpage of each of the dissemination interfaces.

Glycan analysis data

The Analytical Glycotechnology Core utilizes the matrix-assisted laser desorption/ionization mass spectrometry(MALDI-MS) methodology to obtain profiles of glycansderived from tissues and cells of mice (wild-type [Comelli et al.,2006] and knock-out strains) and humans. Each annotated

glycan structure in the MALDI-MS profile represents the bestlevel of information captured based on its mass, composition,biosynthesis, and biological source. More detailed analysissuch as tandem MS-MS fragmentation of specific mass ions(these ions are highlighted in a shaded box in the MALDI-MSprofile) is used to resolve isobaric structures. The MALDI-MSprofiling data provides a high throughput snapshot of the mostlikely glycan structures derived from specific cells and tissues.

The current dissemination interface of the glycan analysisdata organizes the sample information at different hierarchicallevels from species→tissue→sample→N-/O-linked glycan pro-file→high/low molecular weight glycans (in cases where highand low molecular weight glycans are analyzed separately).The final annotated spectrum is available in both JPG andPDF for viewing, comparison, and printing purposes. In addi-tion to the annotated spectra, the ASCII file containing m/zand intensity values and the raw binary format which is gener-ated by the MALDI-MS instrumentation is also accessible tothe user. Currently, the annotated image files are being con-verted to a digitized format comprising of the m/z peak, inten-sity, IUPAC representation of the most likely set of glycanstructures, and links to corresponding structures in the glycanstructures database. The novel structures that are obtained byrigorous structural characterization methods such as tandemMSn and gas chromatography-MS linkage analysis areentered directly into the glycan structures database.

Glyco-gene microarray data

The CFG has developed glyco-gene DNA microarrays toanalyze the expression of genes for GBP and glycan biosyn-thesis enzymes. The CFG glyco-gene microarrays havebeen successfully utilized by many participating investiga-tors (Smith et al., 2005; Comelli et al., 2006) to analyze awide range of samples such as different tumor cell lines,cells from wild-type and knock-out mice strains, cells undermechanical stress, and cells in other pathological states.

The data obtained from the glyco-gene DNA microarrayis organized based on the scope of the experiment, informa-tion on the samples used in the experiment (in the MAIMEstandard format), and the various data files. The data filesinclude the standard set of Affymetrix files—.CEL, .CHP,.DAT, .RPT and .EXP—and an ASCII-based read-out filecontaining probeset ID, signal intensity, present/absentcall, and expectation value for each probe in the gene chip.The ASCII-based readout file is translated by the middle-ware application layer, and the information regarding the

Table I. URLs for the CFG data dissemination pages

Description URL

CFG home page http://www.functionalglycomics.org

GBP molecule pages http://www.functionalglycomics.org/glycomics/molecule/jsp/gbpMolecule-home.jsp

Glycan database http://www.functionalglycomics.org/glycomics/molecule/jsp/carbohydrate/carbMoleculeHome.jsp

Glycosylation pathways http://www.functionalglycomics.org/static/gt/gtdb.shtml

Glycan profiling data http://www.functionalglycomics.org/glycomics/publicdata/glycoprofiling.jsp

Glycan array screening data http://www.functionalglycomics.org/glycomics/publicdata/primaryscreen.jsp

Gene microarray data http://www.functionalglycomics.org/glycomics/publicdata/microarray.jsp

Mouse line phenotyping data http://www.functionalglycomics.org/glycomics/publicdata/phenotyping.jsp Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 4: Advancing glycomics: Implementation strategies at - Glycobiology

Advancing glycomics

85R

expression of each gene is automatically captured in thetables of the relational database. The dissemination inter-face provides a structured navigation of the microarraydata sets starting from the experiment and sample informa-tion to downloading individual data files for each sample.Since the expression data is captured at the level of eachgene represented on the microarray, tools to download theexpression for a specific set of genes across the differentsamples for a given experiment have also been implementedas a part of the dissemination interface. Furthermore, thedissemination interface also includes links to perform low andhigh level data analysis using the downloaded microarraydata. The user-friendly dissemination of the gene microarraydata has motivated efforts to develop data-mining tools forthe prediction of glycan structures based on gene expres-sion profile (Kawano et al., 2005).

Mouse phenotype data

Advances in whole organism genetics have providedinsights into linking the role of glycosylation and glycandiversity to whole organism phenotype (Lowe and Marth,2003). Motivated by these efforts, the CFG dedicated twoscientific Core facilities to generate transgenic mice withknockouts in glycosyltransferase and GBP genes (MouseTransgenics Core) and to perform a battery of phenotypinganalyses on these transgenic mice (Mouse PhenotypingCore). The analyses done by the Mouse Phenotyping Corecan be classified into hematology, histology, immunology,metabolism, and behavior, and they provide a large volume ofnew data at the whole organism level. The data sets includefluorescence activated cell sorting analysis, histologicalstaining of different tissues, and a wide range of parameterspertaining to oxygen consumption, motor reflexes, bloodcount, and coagulation. As a result, diverse data formatssuch as PDF, JPG/TIFF images, and Excel spreadsheetsare uploaded into the database and are automatically orga-nized into the appropriate tables. In addition, a summaryon each of the broad type of analysis, which contains spe-cific interpretations of the Core Director along with the mostsignificant graphs or parameter values, is also accessible viathe database.

Similar to the other Core dissemination interfaces, thetopmost layer of the mouse phenotyping data interfacecomprises of a list of transgenic mice that have been pheno-typed along with the links to data summaries, experimentaldetails, and raw data files. The experimental details providea description of the tests performed and comprehensiveinformation on the mice used in the studies. The raw dataprovide access to individual parameters and image files(e.g., histological staining of individual tissues).

Glycan–protein interaction screening data

To expand the current knowledge of sequence-specificglycan–protein interactions, the Glycan Synthesis andGlycan–Protein Interaction Cores of the CFG are devel-oping glycan arrays comprising of hundreds of syntheticand biological glycans. Two different types of arrayformats have been developed. The first format is amicrowell-based array where the glycans are biotinylatedand applied to streptavidin-coated microwells. The sec-

ond format is a N-hydroxysuccinimide activated glassslide array where the glycans are covalently printed onthe slide using standard printing technologies availablefor making DNA microarrays (Blixt et al., 2004). Theprinted array has better signal to noise ratio and alsofacilitates expansion of the number of glycans that can beprinted on the plate. These glycan arrays are becomingwidely utilized by the scientific community for screeningseveral proteins such as plant and animal lectins, antibod-ies, and proteins on pathogen cell surfaces to identifynovel glycan ligand specificity of these proteins (Guoet al., 2004; Bochner et al., 2005; Tateno et al., 2005; vanVliet et al., 2005; Singh et al., 2006). More recently, a newtechnique to derivatize the glycans using a fluorescent taghas enabled developing glycan arrays with the quantifica-tion of each glycan on the array (Xia et al., 2005).

Annotation tools have been developed to automati-cally parse the readout files derived from the screeninganalysis and store the information on the glycan struc-ture, mean signal intensity, signal to noise ratio, andstandard error in the database. At the top level of thedata dissemination page, the information is organizedinto the protein analyzed, experiment information, andthe name of the investigator. The data prepared by theCore are available as an Excel format with some of thelegacy data containing PDF data summaries. An interac-tive interface that provides a two-dimensional false colorimaging of the array (based on signal intensities) and abar chart of signal intensity versus glycan ID is alsoavailable. This interface allows the user to seamlesslynavigate from the click on potential high signal intensityregions (in the two-dimensional representation and bargraph) to provide more information on that glycan struc-ture in the glycan structures database.

In addition to the hierarchical dissemination of thediverse data sets generated by each scientific Core, the rela-tionships between data sets at the level of the sample arealso captured in the relational database (Figure 2). Forexample, the CFG has begun integrating its tools to deriveorthogonal data sets such as gene expression profile andglycan analysis on tissues and pure cell populations derivedfrom human and mouse (Comelli et al., 2006). Such inte-gration enables researchers to begin correlating glycandiversity of a cell or tissue with the gene expression profileof the glycan biosynthesis enzymes and also to understandthe glycan ‘signature’ of phenotypically distinct cells andtissues derived from different mice strains.

GBP molecule pages

An emerging concept in data integration is the moleculepage interface which provides a portal to information anddata ranging from molecule to mouse (Li et al., 2002;Raman et al., 2005). The molecule page interface developedby CFG captures information pertaining to different fami-lies of human and mouse GBPs. The CFG classification ofGBP families is—C-type lectins, galectins, siglecs, and other.The CFG molecule pages contain three main components(1) automatic acquisition of information from other publicdatabases on that molecule, (2) automatic interface with

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 5: Advancing glycomics: Implementation strategies at - Glycobiology

R. Raman et al.

86R

CFG data pertaining to that molecule, and (3) contributionfrom experts on that particular molecule. The GBP moleculepages are organized based on the above classification andsubfamilies. Each class is linked to a list of proteins in that

class, and each protein is linked to its appropriate moleculepage. The information pertaining to the protein is organizedinto six categories that are presented as clickable tabs in themolecule page interface.

Fig. 2. Integration of CFG data. The relational database and the three-tier architecture not only facilitate structuring the four major types of scientific Core data based on defined relationships, but they also facilitate interconnectivity between diverse data sets. An example how the histology staining of spleen tissue from wild-type and fucosyltransferase VII knockout is integrated with the glycan profiling of these tissues is highlighted.

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 6: Advancing glycomics: Implementation strategies at - Glycobiology

Advancing glycomics

87R

The General tab comprises of name of the molecule, itssynonyms obtained from the SwissProt database, and asummary of the molecule that is contributed by experts(work is in progress to obtain expert contribution for mole-cule pages). The Reference tab comprises of link to NCBI’sPubMed database that automatically searches this databaseusing the name of the protein and synonyms as searchfields. It also contains links to other portals such as NCBI’sEntrez Gene, Source data portal (developed at StanfordUniversity). The Genome tab has information on the genename, the cDNA sequence with links to BLAST server forsequence alignment, gene expression profiles available inpublic databases such as NCBI’s Gene Expression Omni-bus database, and Symatlas database of Novartis founda-tion. Currently, work is in progress to interface the geneexpression data obtained from the CFG’s glyco-genemicroarrays.

The Proteome tab has information pertaining to the pro-tein sequence with links to the SwissProt and the PDB data-bases (if three-dimensional crystal structures are available).Furthermore, a schematic of the domain organization ofthe entire protein family is provided. The Glycome tab is anunique contribution from the CFG which provides infor-mation on candidate ligands or known counter receptorsfor that GBP (provided by expert contribution) and accessto glycan array data if that protein was screened on theCFG glycan array (Figure 3). More recently, using the toolsdeveloped to extract glycan structures from PDB (Luttekeet al., 2004), information on the glycan ligands used in thecrystal structure studies of that GBP (if available) withlinks to the appropriate PDB entries is provided. Finally,the Biology tab provides a summary of physiological andpathological roles of that GBP and an interface to CFGphenotyping analysis (if available) of transgenic mice with aknockout of that protein.

Glycan structures database

The CFG glycan structures database represents one of themany important efforts (Cooper et al., 2001; Hashimotoet al., 2005; Lutteke et al., 2005) to develop a standardizedrepository of glycan structures information. This databasewas developed to meet three main objectives. The firstobjective was to facilitate the assignment of peaks in theMALDI-MS glycan profiling of tissues. The second objec-tive was to capture relationship between candidate ligandson the glycan array and their corresponding binding pro-teins. The third objective was to capture information onglycan structures that are being published in the literature(Figure 1B). The current repository was built starting frommammalian structures in the CCSD, curated structuresobtained from a private database (developed by GlycomindsLtd., Lod, Israel). To this repository, the glycan structuresgenerated by CFG (Glycan Synthesis Core and those on theglycan array) were added along with synthesis protocols foreach glycan synthesized by Glycan Synthesis Core. Thisdatabase also comprises of a large number of theoreticallygenerated mammalian N-linked glycans based on biosynthesisrules that are primarily utilized for annotation of theMALDI-MS glycan profiles.

As stated earlier, the glycan array data have alreadybeen integrated with the glycan structures database. Uponaccessing a glycan on the glycan array, the list of proteinsfor which that structure was identified as a high-affinitybinder is also available along with the entire data set ofthat screening experiment. This integration not onlyenables the identification of a high-affinity glycan ligandfor a given protein, but it also provides information onwhat other proteins were identified as high-affinity bindersto the same glycan ligand. Currently, efforts are ongoing tointegrate the glycan-profiling data with the glycan struc-tures database.

The glycan structures in the database can be searched andretrieved using different search criteria such as molecularweight, composition, biological source, linear nomenclature,and citation. Another useful feature is the substructuresearch interface that allows users to either build from commontemplates (core and extension) or modify imported glycansubstructures or motifs and search for these motifs in thedatabase. This interface also facilitates development of gly-can biosynthesis pathway interface to the glycan structuresdatabase (see Glycosylation Pathways Interface) and enteringa new glycan structure into the database.

Given that the glycan structures databases represent acommon focus of all of the large glycomics initiatives, there isa practical need to develop standardized formats for exchangeof information on the structures. Collaborative efforts betweenlarge-scale initiatives are in progress to evaluate XMLformats (Kikuchi et al., 2005; Sahoo et al., 2005) for consis-tent description of glycan structures in different databases.

Glycosylation pathways interface

Understanding the glycosylation pathways involved inglycan biosynthesis is the primary link to understand thestructure–function relationship of glycans in the contextof how genotype influences whole organism phenotype.The KEGG (Hashimoto et al., 2005) and the CAZy(http://afmb.cnrs-mrs.fr/CAZY/) databases are currentlythe major sources of information on glycan biosynthesisenzymes. The CFG glycosylation pathways interface pro-vides a set of composite glycan structures representing thecore, terminal, and common extension units of N-linked,O-linked glycans and glycolipids. Each linkage in thecomposite structure interface is directly linked to the cor-responding family of the glycosyltransferase in the CAZydatabase. The KEGG database provides a pathway rep-resentation for the biosynthesis of different glycans.There are 98 genes corresponding to human glycosylationpathways that have been annotated (assigned to a specificglycosidic linkage) in the KEGG database.

To expand this annotation, the CFG is collaboratingwith CAZy, KEGG, and other experts. The current listcomprises around 200 annotated genes which will soonbecome available via the KEGG and CFG databases. Tofacilitate the inputs from the different initiatives, experts inthe CFG have constructed detailed composite structures ofcore, extension, and terminal regions of different glycans.Each monosaccharide in this composite structure and thelinkage between the specific monosaccharides are numbered

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 7: Advancing glycomics: Implementation strategies at - Glycobiology

R. Raman et al.

88R

Fig. 3. GBP molecule pages. Shown is the example of the Glycome tab of the molecule page of human dendritic cell intercellular adhesion molecule grab-bing nonintegrin (DC-SIGN) which is a type II C-type lectin. The information on primary glycan specificity and proposed glycoprotein/glycolipid counter receptors is provided by expert investigators. Also shown is the link to the glycan array data generated for this protein which is automatically associated with the molecule page. Finally, the glycan ligands extracted from the PDB with the PDB identifiers of the protein–glycan complexes are shown.

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 8: Advancing glycomics: Implementation strategies at - Glycobiology

Advancing glycomics

89R

to facilitate the annotation process (Figure 4). The CFG isdeveloping molecule page interfaces for glycosyltrans-ferases (analogous to the GBP molecule pages) with links tothe relevant information in KEGG and CAZy databases.The current CFG glycosylation pathway interfaces will bemodified to include the expanded composite structure,where clicking on each linkage would provide access to theglycosyltransferase molecule pages.

Summary

The beginning of this millennium has marked a promisingera for advancing glycomics where large-scale interna-tional initiatives such as CFG are generating valuableresources and data sets that are openly available to the sci-entific community. The establishment of these initiativeshas sparked an explosive growth of novel contributions tothe glycomics field. Central to the progress of these effortsis the development of bioinformatics platforms to acquireand disseminate diverse data sets across the world usinguser-friendly interfaces via the Internet. Another impor-tant aspect of these large glycomics initiatives is the moti-vation to take a collaborative approach for integrating theresources and data generated by the different initiatives aspointed out by the Editorial (2005) in Nature Methods.This collaboration and integration is critical for providing

access to diverse data sets in different databases thatenable scientists to begin answering important questionson glycan diversity, glycan–protein interactions that arefundamental to understanding the structure–function rela-tionships of glycans.

Acknowledgments

This work was supported by the National Institute of Gen-eral Medical Sciences Glue Grant U54 GM62116. Theauthors thank present and past members of the CFG Bioin-formatics Core including Ganesh Venkataraman, ChipongKwan, Eric Berry, Nishla Keiser, and Ishan Capila. In addi-tion, the authors thank the members of the other Cores, theSteering Committee, and the participating investigators ofthe CFG for their contribution to the Bioinformatics Core.

Conflict of interest statement

None declared.

Abbreviations

CCSD, Complex Carbohydrate Structure Database; CFG,consortium for functional glycomics; GBP, glycan-binding

Fig. 4. Composite structure glycosylation pathways. The composite structures of complex type N-linked glycan along with the common type II extension unit and the known terminal monosaccharides are shown. Each monosaccharide is numbered to facilitate the annotation of the family of glycosyltrans-ferases that are involved in the synthesis of the linkage between specific monosaccharides. The glycosyltransferase molecule pages are being developed and would be accessible by clicking on specific linkages in the composite structure interface.

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021

Page 9: Advancing glycomics: Implementation strategies at - Glycobiology

R. Raman et al.

90R

protein; MALDI-MS, matrix-assisted laser desorption/ion-ization mass spectrometry.

References

Blixt, O., Head, S., Mondala, T., Scanlan, C., Huflejt, M.E., Alvarez, R.,Bryan, M.C., Fazio, F., Calarese, D., and others. (2004) Printed cova-lent glycan array for ligand profiling of diverse glycan binding pro-teins. Proc. Natl. Acad. Sci. U. S. A., 101, 17033–17038.

Bochner, B.S., Alvarez, R.A., Mehta, P., Bovin, N.V., Blixt, O., White,J.R., and Schnaar, R.L. (2005) Glycan array screening reveals a candi-date ligand for Siglec-8. J. Biol. Chem., 280, 4307–4312.

Collins, B.E. and Paulson, J.C. (2004) Cell surface biology mediated bylow affinity multivalent protein-glycan interactions. Curr. Opin. Chem.Biol., 8, 617–625.

Comelli, E.M., Head, S.R., Gilmartin, T., Whisenant, T., Haslam, S.M.,North, S.J., Wong, N.K., Kudo, T., Narimatsu, H., Esko, J.D., andothers. (2006) A focused microarray approach to functional glycomics:transcriptional regulation of the glycome. Glycobiology, 16, 117–131.

Cooper, C.A., Harrison, M.J., Wilkins, M.R., and Packer, N.H. (2001)GlycoSuiteDB: a new curated relational database of glycoprotein glycanstructures and their biological sources. Nucleic Acids Res., 29, 332–335.

Doubet, S., Bock, K., Smith, D., Darvill, A., and Albersheim, P. (1989)The complex carbohydrate structure database. Trends Biochem. Sci.,14, 475–477.

Editorial (2005) Sweet collaborations. Nat. Methods, 2, 799.Guo, Y., Feinberg, H., Conroy, E., Mitchell, D.A., Alvarez, R., Blixt, O.,

Taylor, M.E., Weis, W.I., and Drickamer, K. (2004) Structural basisfor distinct ligand-binding and targeting properties of the receptorsDC-SIGN and DC-SIGNR. Nat. Struct. Mol. Biol., 11, 591–598.

Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N.,Hamajima, M., Kawasaki, T., and Kanehisa, M. (2005) KEGG as aglycome informatics resource. Glycobiology. Epub ahead of print.

Kawano, S., Hashimoto, K., Miyama, T., Goto, S., and Kanehisa, M.(2005) Prediction of glycan structures from gene expression data basedon glycosyltransferase reactions. Bioinformatics, 21, 3976–3982.

Kikuchi, N., Kameyama, A., Nakaya, S., Ito, H., Sato, T., Shikanai, T.,Takahashi, Y., and Narimatsu, H. (2005) The carbohydrate sequencemarkup language (CabosML): an XML description of carbohydratestructures. Bioinformatics, 21, 1717–1718.

Li, J., Ning, Y., Hedley, W., Saunders, B., Chen, Y., Tindill, N., Hannay, T.,and Subramaniam, S. (2002) The molecule pages database. Nature, 420,716–717.

Lowe, J.B. and Marth, J.D. (2003) A genetic approach to Mammalian gly-can function. Annu. Rev. Biochem., 72, 643–691.

Lutteke, T., Bohne-Lang, A., Loss, A., Goetz, T., Frank, M., and von derLieth, C.W. (2005) GLYCOCIENCES.de: an internet portal to sup-port glycomics and glycobiology research. Glycobiology. Epub aheadof print.

Lutteke, T., Frank, M., and von der Lieth, C.W. (2004) Data mining theprotein data bank: automatic detection and assignment of carbohy-drate structures. Carbohydr. Res., 339, 1015–1020.

Raman, R., Raguram, S., Venkataraman, G., Paulson, J.C., andSasisekharan, R. (2005) Glycomics: an integrated systems approach tostructure-function relationships of glycans. Nat. Methods, 2, 817–824.

Sahoo, S.S., Thomas, C., Sheth, A., Henson, C., and York, W.S. (2005)GLYDE-an expressive XML standard for the representation of glycanstructure. Carbohydr. Res., 340, 2802–2807.

Singh, T., Wu, J.H., Peumans, W.J., Rouge, P., Van Damme, E.J.,Alvarez, R.A., Blixt, O., and Wu, A.M. (2006) Carbohydrate specific-ity of an insecticidal lectin isolated from the leaves of Glechoma heder-acea (ground ivy) towards mammalian glycoconjugates. Biochem. J.,393, 331–341.

Smith, F.I., Qu, Q., Hong, S.J., Kim, K.S., Gilmartin, T.J., and Head,S.R. (2005) Gene expression profiling of mouse postnatal cerebellardevelopment using oligonucleotide microarrays designed to detectdifferences in glycoconjugate expression. Gene Expr. Patterns, 5,740–749.

Tateno, H., Crocker, P.R., and Paulson, J.C. (2005) Mouse Siglec-F andhuman Siglec-8 are functionally convergent paralogs that are selec-tively expressed on eosinophils and recognize, 6′-sulfo-sialyl Lewis Xas a preferred glycan ligand. Glycobiology, 15, 1125–1135.

Taylor, M.E. and Drickamer, K. (2003) Introduction to Glycobiology.Oxford University Press, Oxford and New York.

van Vliet, S.J., van Liempt, E., Saeland, E., Aarnoudse, C.A., Appelmelk, B.,Irimura, T., Geijtenbeek, T.B., Blixt, O., Alvarez, R., van Die, I., andvan Kooyk, Y. (2005) Carbohydrate profiling reveals a distinctive rolefor the C-type lectin MGL in the recognition of helminth parasites andtumor antigens by dendritic cells. Int. Immunol., 17, 661–669.

Varki, A., Cummings, R., Esko, J., Freeze, H., Hort E., and Marth, T.(1999) Essentials of Glycobiology. Cold Spring Harber LaboratoryPress, New York.

von der Lieth, C.W., Bohne-Lang, A., Lohmann, K.K., and Frank, M.(2004) Bioinformatics for glycomics: status, methods, requirementsand perspectives. Brief. Bioinform., 5, 164–178.

Xia, B., Kawar, Z.S., Ju, T., Alvarez, R.A., Sachdev, G.P., andCummings, R.D. (2005) Versatile fluorescent derivatization of glycansfor glycomic analysis. Nat. Methods, 2, 845–850.

Dow

nloaded from https://academ

ic.oup.com/glycob/article/16/5/82R

/572357 by guest on 02 Decem

ber 2021