adgo 2.0: interpreting microarray data and list of genes...

ADGO 2.0: interpreting microarray data and listof genes using composite annotationsSang-Mun Chi1, Jin Kim2, Seon-Young Kim3,* and Dougu Nam4,*

1School of Computer Science and Engineering, Kyungsung University, Busan, 2School of Computer Scienceand Engineering, Seoul National University, Seoul, 3Medical Genomics Research Center, Korea ResearchInstitute of Bioscience and Biotechnology, Daejeon and 4School of Nano-Bioscience and Chemical Engineering,UNIST, Ulsan, Rep. of Korea

Received February 21, 2011; Revised April 19, 2011; Accepted May 3, 2011

ABSTRACT

ADGO 2.0 is a web-based tool that provides com-posite interpretations for microarray data compar-ing two sample groups as well as lists of genes fromdiverse sources of biological information. Someother tools also incorporate composite annotationssolely for interpreting lists of genes but usuallyprovide highly redundant information. This new ver-sion has the following additional features: first,it provides multiple gene set analysis methods formicroarray inputs as well as enrichment analysesfor lists of genes. Second, it screens redundantcomposite annotations when generating and pri-oritizing them. Third, it incorporates union and sub-tracted sets as well as intersection sets. Lastly,users can upload their own gene sets (e.g. predictedmiRNA targets) to generate and analyze new com-posite sets. The first two features are unique toADGO 2.0. Using our tool, we demonstrate analysesof a microarray dataset and a list of genes for T-celldifferentiation. The new ADGO is available at http://www.btool.org/ADGO2.

INTRODUCTION

High-throughput omics experiments often produce lists ofgenes, and their biological interpretations have been ofsubstantial interest. Typical approaches examine the extentof the overlap between a list of genes and predefinedannotated gene sets using hypergeometric distribution,chi-square or Fisher’s exact test, which may be dubbedcollectively as gene list analysis (GLA) (1). For micro-arrays, each gene has its own score (e.g. two sample t-stat-istic or fold-change value) and an alternative approach,called gene set analysis (GSA), is applicable without select-ing a list of genes (2). In many cases, the ‘interpretation’ of

large-scale data indicates investigating the enrichment ofpre-set knowledge within the given data. Accordingly,such enrichment analyses are widespread over omics re-search regardless of the data analyzed (microarray, massspectrometry, ChIP-chip or next-generation sequencing).In addition, a number of algorithms and tools have beendeveloped in this context (1–3).

In both approaches (GSA and GLA), the predefinedgene sets play key roles in biological interpretations.Such gene sets are usually derived from biological data-bases such as Gene Ontology (4) or KEGG (5), where theyshare a common biological annotation for pathways, func-tions, cellular localizations or targets of a common tran-scription factor (TF), for instance. One importantproblem with most existing methods is that they handleonly gene sets with unary annotations, thus limiting thediscriminating power of the method employed. Forexample, suppose we want to examine whether a givenlist of genes is enriched with the putative targets of someTF. Because most gene sets that share a common TFbinding site are dominated with false positive targets,this simple approach may not be very successful whenused to uncover the relevant TFs. However, if we takeintersections between the putative TF target sets and thegene sets of Gene Ontology, some of them may definebiologically more relevant gene sets, which then may beenriched with the gene list. With this rationale, compositeannotation gene sets were introduced for GSA (6) andGLA (7), respectively. Thereafter, several software toolswere developed for GLA based on composite annotations(8–10). ADGO (6) and ProfCom (9) use Boolean set op-erations (intersection, union and subtraction) to generatecomposite gene sets, and GENECODIS (11) andCOFECO (10) employ an association rule-mining algo-rithm to extract co-occurring annotations. In any case,the composite interpreters usually display quite a longand redundant list of significant gene sets, many ofwhich largely overlap each other. Therefore, removing

*To whom correspondence should be addressed. D.Nam. Tel: +82 52 217 2525; Fax: +82 52 217 2509; Email: [email protected] may also be addressed to Seon-Young Kim. Tel: +82 42 879 8116; Fax: +82 42 879 8119; Email: [email protected]

W302–W306 Nucleic Acids Research, 2011, Vol. 39, Web Server issue Published online 29 May 2011doi:10.1093/nar/gkr392

� The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at Ulsan N

atl Inst of Science & T

echnology on August 9, 2013

http://nar.oxfordjournals.org/D

ownloaded from

http://nar.oxfordjournals.org/

redundancy and abstraction appear to be an importantissue when utilizing composite annotations. Here, wesuggest three criteria for filtering composite gene sets forGSA and GLA.

(1) If a composite set is largely overlapped with somesingle set over a threshold, that set should be removeda priori. In other words, if the members of sets arevery similar to each other, the single annotationshould have priority.

(2) A significant intersection or union set should bescreened, if any of the single sets used to generatethem are also significant.

(3) A significant subtracted set should be screened, if thesingle set that contains the subtracted set is alsosignificant.

Taking into account these considerations, we constructedweb-based software called ADGO 2.0 to provide compos-ite interpretations for both microarrays and lists of genes.The previous version of ADGO was designed to illustratethe idea of using composite annotations for GSA andprovided a single GSA method (6). The current versionwas totally rebuilt considering the automatic updating of(composite) gene sets, and extended in terms of bothcoverage and methods.

MATERIALS AND METHODS

Supported analyses

ADGO 2.0 currently supports analyses of eight popularorganisms (Homo sapiens,Musmusculus,Rattus norvegicus,Saccharomyces cerevisiae, Arabidopsis thaliana,Caenorhabditis elegans, Drosophila melanogaster andEscherichia coli.), and from four to seven kinds of anno-tations (GO terms for biological processes, cellular com-ponents, and molecular functions, KEGG pathways,chromosome and cis-regulatory motifs, and OMIM) areprovided depending on the species selected. The user canchoose one of the four methods (Z-statistic, gene permu-tation, sample permutation and Gene Set EnrichmentAnalysis (GSEA)) for GSA and the two methods(Fisher’s exact test and hypergeometric distribution) forGLA. Only applicable methods are displayed dependingon the format of input data.

Construction of annotation gene set databases

For all of the gene sets from the seven annotation categ-ories included in ADGO 2.0, we applied three types ofBoolean set operations (intersection, union and subtrac-tion) to each pair of gene sets across different categories:the 10–20% rule was applied for intersections and subtrac-tion (6). In other words, a pair of single gene sets wasrequired to have at least 10 genes in common and thetwo subtracted sets were required to contain at least20% of the genes in each single set. Union operationswere applied if a pair of gene sets has five or moreelements in common. The ‘subtraction’ of two annotationsets A and B is denoted by A� B, which is the intersectionof A and the complement of B. To ensure the generated

composite set is a genuinely novel set, it was comparedwith each single set and discarded if it has any overlapwith some single set over a threshold (‘FilteringComposite Sets’). The overlap (%) is computed by theportion of intersection between the composite set and asingle set over the union of these two sets. For the 60%threshold, �27% of all composite sets are screened for thethree categories of Gene Ontology. All of these single andcomposite gene sets are prepared in the server in advanceand retrieved according to the user’s choice of gene setcategories. One important feature of ADGO 2.0 is thatthe user can upload and analyze his/her own annotationgene sets. If the user chooses some of the built-in gene setsand uploads user gene set data, the server then generatesad hoc composite sets and shows the computation results.For this reason, it takes much more time for analyzing theuser’s gene sets.

Processing methods

If the user uploads microarray data or a list of genes, theserver detects the file format and displays relevant analysismethods and other options. For a microarray input, fourgene set analysis methods are available. Among them,‘Z-test’ (12) and ‘Gene permutation’ (13) are gene ran-domization methods, and ‘Sample permutation’ (13) and‘GSEA’ (14) are sample randomization methods. We usedthe average t-value for the set score in the gene or samplepermutation methods. The Z-test is a parametric methodand is the fastest. GSEA is the most widely used butusually takes more time for computing. This becomesproblematic when analyzing composite annotations, asthe number of gene sets to be handled increases in a quad-ratic manner against the number of usual single annota-tions. For this reason, we newly realized the algorithmin C++. We fixed the power of the gene score as p=1to deploy the weighted Kolmogorov–Smirnov gene setstatistic. For a gene list input, the ‘Fisher’s exact test’and ‘Hypergeometric distribution’ are provided for theanalysis method.For both GSA and GLA, we provide two types of fil-

tering methods for significant composite sets: The ‘StrongType’ and the ‘Weak Type’. For the strong-type option, asignificant composite set is screened if a single set involvedin generating the composite set is also significant. For theweak-type option, a significant composite set is displayedif it has a smaller P-value than those of the individualsingle sets. Therefore, the weak-type option yields morecomposite sets that are significant.

Input data types and options

For microarray input, the user can upload microarraydata with two sample groups. The first column shouldbe the header for gene IDs and the sample data valuesshould follow in the next columns. ADGO 2.0 acceptsboth single and dual channel gene IDs for microarrayinput. For a single channel input, the probe IDs forAffymetrix, Illumina and Agilent chips are supported.For a dual channel input, five types of gene IDs (genesymbol, Ensemble, Entrez, Refseq and Uniprot) as wellas the systematic names for Saccaromyces cerevisiae are

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W303

at Ulsan N




ownloaded from


supported. The first sample group data should appear inthe first k columns of data values and the second groupdata should follow in the next l columns. k and l should bespecified in the ‘Sample Size’ option. ADGO 2.0 thencomputes the two-sample t-statistic or average fold-change values for each gene and proceeds. We haveanother option for microarray input. If the user wantsto use gene scores other than the two-sample t-statisticor fold-change values, he/she can directly use the genescore data (a single value for each gene).For a gene list input, the same dual channel IDs are

supported. We constructed the reference (background)genes for GLA by merging all the genes contained ineach annotation set. In any type of input data, the usercan also paste the data into the ‘Paste data panel’ withoutuploading the data file. More detailed information is avail-able from our web site (http://www.btool.org/ADGO2).

Outputs

If the user executes the analysis, the server shows a list ofsignificant gene sets (single and composite) along withgene set names, gene set id’s, members of each gene set,P-values, False Discovery Rate (FDR) Q-values andBonferroni’s corrected P-values (Figure 1). The membersof each gene set are listed in a descending order of theirassociation strength. The gene set list is sorted accordingto the FDR Q-values. Certainly, composite annotationsincrease the number of annotation sets to be analyzed tochange the analysis results more or less. However, theFDR Q-values reflect the increased number of gene sets

and provide the adjusted significance threshold. The com-putation results are also downloadable as a text file.

Analysis example

We present an example of gene expression data analysisby ADGO 2.0 to demonstrate its utility. Schones et al. (15)compared the gene expression patterns between restingand activated T cells to understand the molecularchanges that occur during the T-cell differentiation(GEO number: GSE10437). We computed the foldchanges of each gene in the microarray data to test aGSA. We chose the ‘Z test’ method (Q-value cutoff:0.01) and two annotation categories: KEGG andChromosome. We checked three types of composite sets:‘Single’ + ‘Intersection’ + ‘Subtraction’. Many KEGGpathways related to immune responses including‘Autoimmune thyroid disease’, ‘Antigen processing andpresentation’ and ‘Graft-versus-host disease’ were not sig-nificant by themselves when we used only KEGGcategories. However, they were significantly induced ifwe excluded the genes in the chromosome 6p21.3 set.Interestingly, the same immune-related gene sets were sig-nificantly downregulated when intersected with the 6p21.3set. This suggests that some part of the chromosomeregion 6p21.3 is locked during T-cell differentiationwhile other immune-related genes outside this region areactivated. Indeed, this intersection set contained manyHLA (human leukocyte antigen) genes that were mostlydownregulated [i.e. HLA-DPB1 (�3.19), HLA-DRA(�1.97), HLA-DMA (�1.92), HLA-DRB1 (�1.71),

Figure 1. Z-test results for the T-cell differentiation data set. See the text for an explanation and detailed options. If the user clicks ‘view’, themembers of each significant annotation set as well as their scores are shown.

W304 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

at Ulsan N




ownloaded from


HLA-E (�1.66), HLA-DBM (�1.54)], significantly affect-ing the overall patterns of immune-related gene sets. SeeFigure 1 for the list of significant gene sets. This exampleillustrates how bi-directional expression patterns within agene set can be described precisely using compositeannotations.

We then selected the 200 most induced genes to test aGLA. Using Fisher’s exact test and the three GeneOntology categories, we interpreted the list. In the firsttrial, we chose the ‘Strong Type’ option, and we obtaineda list of gene sets associated with the input list, many ofwhich, as a single set, had strong relevance with T-celldifferentiation. Examples include the cytokine-relatedprocesses ‘JAK-STAT cascade’, ‘regulation of tyrosinephosphorylation’, ‘immunoglobulin production’, ‘regula-tion of T cell activation’ and ‘T-cell proliferation’ (15,16).Additionally, some subtracted sets associated with ‘ribo-some’ were detected. We then chose the ‘Weak Type’ toinvestigate more specific patterns within each single set.Figure 2 shows the computation results. Several inter-sected and subtracted gene sets appeared on top ranks.Most ‘tyrosine phosphorylation’ gene sets showedstronger patterns when intersected with ‘growth factoractivity’. For example, ‘regulation of peptidyl-tyrosinephosphorylation’ had a Q-value 1:377� 10�4 in thestrong-type analysis, but the intersection of ‘regulationof peptidyl-tyrosine phosphorylation’ and ‘growth factoractivity’ had a much better Q-value of 7:208� 10�7 in theweak-type analysis. The former single set originally con-tained 83 genes in total and actually included eightmembers from the input list, while the latter intersectionset had a much smaller number of genes (25 in total), butincluded seven members from the input list. This featuremakes the composite sets more precise descriptors of theenrichment patterns.

DISCUSSION AND CONCLUSION

ADGO 2.0 is currently a unique tool that supports GSAmethods based on composite annotations. We added GLAmethods to this new version. It provides several widelyused GSA methods including fast GSEA (14). Severalother tools [e.g. GENECODIS (11), ProfCom (9), andCOFECO (10)] provide analyses via composite annota-tions only for GLA. GENECODIS and COFECOemploy the same type of algorithm and focus on the inter-section of two or more annotation gene sets, whileProfCom generates more general types of composite setsusing Boolean set operations (up to five single sets).Shortcomings with these tools are that they display allthe significant gene sets without filtering redundant infor-mation (GENECODIS and COFECO) or a partial list ofgene sets identified by a greedy search algorithm(ProfCom). ADGO 2.0 generates composite gene setsbased on Boolean operations of two overlapping singlesets and screens composite sets that have redundant infor-mation for both GSA and GLA. We may also considercomposite sets generated by three or more single sets asProfCom does, but this will increase the computationalcomplexity prohibitively and make it quite complicatedto establish legitimate rules (e.g. inclusion and exclusionrules) to screen the redundant information.Using our tool, we demonstrated how to incorporate

and interpret significant composite annotations whenanalyzing microarray data and list of genes. Note thatmany significant composite terms are hard to interpretclearly. In most cases, we may not find evidence fromthe literature because complex biological patterns havebeen rarely explored so far. Therefore, our tool may beused for an explorative research. If a composite pattern isobserved repeatedly across many data sets, it may bevalidated experimentally.

Figure 2. Enriched gene sets for a list of upregulated genes in T-cell differentiation. The weak-type filtering criterion is applied for significantcomposite sets.

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W305

at Ulsan N




ownloaded from


One interesting future work with our tool may be toinvestigate the regulatory interactions between regulators(TF or miRNA) and pathways using gene expressionprofiles. Because sequence-based predictions of thetargets of TF or miRNA inevitably include abundantfalse positives, taking the intersections of the putativetarget genes with other gene sets may be useful forexploring specific patterns in regulatory networks.

FUNDING

Basic Science Research Program through a NationalResearch Foundation (NRF) grant funded by theKorean government (MEST) (No. 2011-0015107);National R & D Program for Cancer Control of theMinistry for Health and Welfare, Republic of Korea(1020360) (to S.-Y.K.). Funding for open access charge:NRF grant funded by Korean government (No. 2011-0015107).

Conflict of interest statement. None declared.

REFERENCES

1. Khatri,P. and Draghici,S. (2005) Ontological analysis of geneexpression data: current tools, limitations, and open problems.Bioinformatics, 21, 3587–3595.

2. Nam,D. and Kim,S.Y. (2008) Gene-set approach for expressionpattern analysis. Brief. Bioinform., 9, 189–197.

3. Huang da,W., Sherman,B.T. and Lempicki,R.A. (2009)Bioinformatics enrichment tools: paths toward the comprehensivefunctional analysis of large gene lists. Nucleic Acids Res., 37, 1–13.

4. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T.et al. (2000) Gene ontology: tool for the unification of biology.The Gene Ontology Consortium. Nat. Genet., 25, 25–29.

5. Kanehisa,M., Araki,M., Goto,S., Hattori,M., Hirakawa,M.,Itoh,M., Katayama,T., Kawashima,S., Okuda,S., Tokimatsu,T.

et al. (2008) KEGG for linking genomes to life and theenvironment. Nucleic Acids Res., 36, D480–484.

6. Nam,D., Kim,S.B., Kim,S.K., Yang,S., Kim,S.Y. and Chu,I.S.(2006) ADGO: analysis of differentially expressed gene sets usingcomposite GO annotation. Bioinformatics, 22, 2249–2253.

7. Antonov,A.V. and Mewes,H.W. (2006) Complex functionality ofgene groups identified from high-throughput data. J. Mol. Biol.,363, 289–296.

8. Carmona-Saez,P., Chagoyen,M., Tirado,F., Carazo,J.M. andPascual-Montano,A. (2007) GENECODIS: a web-based toolfor finding significant concurrent annotations in gene lists.Genome Biol., 8, R3.

9. Antonov,A.V., Schmidt,T., Wang,Y. and Mewes,H.W. (2008)ProfCom: a web tool for profiling the complex functionalityof gene groups identified from high-throughput data.Nucleic Acids Res., 36, W347–W351.

10. Sun,C.H., Kim,M.S., Han,Y. and Yi,G.S. (2009) COFECO:composite function annotation enriched by protein complex data.Nucleic Acids Res., 37, W350–W355.

11. Nogales-Cadenas,R., Carmona-Saez,P., Vazquez,M., Vicente,C.,Yang,X., Tirado,F., Carazo,J.M. and Pascual-Montano,A. (2009)GeneCodis: interpreting gene lists through enrichmentanalysis and integration of diverse biological information.Nucleic Acids Res., 37, W317–W322.

12. Kim,S.Y. and Volsky,D.J. (2005) PAGE: parametric analysis ofgene set enrichment. BMC Bioinformatics, 6, 144.

13. Tian,L., Greenberg,S.A., Kong,S.W., Altschuler,J., Kohane,I.S.and Park,P.J. (2005) Discovering statistically significant pathwaysin expression profiling studies. Proc. Natl Acad. Sci. USA, 102,13544–13549.

14. Subramanian,A., Tamayo,P., Mootha,V.K., Mukherjee,S.,Ebert,B.L., Gillette,M.A., Paulovich,A., Pomeroy,S.L.,Golub,T.R., Lander,E.S. et al. (2005) Gene set enrichmentanalysis: a knowledge-based approach for interpretinggenome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102,15545–15550.

15. Schones,D.E., Cui,K., Cuddapah,S., Roh,T.Y., Barski,A.,Wang,Z., Wei,G. and Zhao,K. (2008) Dynamic regulationof nucleosome positioning in the human genome. Cell, 132,887–898.

16. O’Shea,J.J. and Murray,P.J. (2008) Cytokine signaling modules ininflammatory responses. Immunity, 28, 477–487.

W306 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

at Ulsan N




ownloaded from


adgo 2.0: interpreting microarray data and list of genes...

Documents