annotations - arxiv.org · annotations kimberly glass1 ;2 3, and michelle girvan 4 5 1biostatistics...

15
Finding New Order in Biological Functions from the Network Structure of Gene Annotations Kimberly Glass 1,2,3* , and Michelle Girvan 3,4,5 1 Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2 Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 3 Physics Department, University of Maryland, College Park, MD, USA 4 Institute for Physical Science and Technology, University of Maryland, College Park, MD, USA 5 Santa Fe Institute, Santa Fe, NM, USA The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to each other. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternate natural way to group the functional terms which is different from the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our method connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different experiments. We show that grouping terms by this alternate scheme is distinct from term relationships defined in the ontological structure and provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. Tools and supplemental material related to this work can be found online at http://www.networks.umd.edu. 1. INTRODUCTION The Gene Ontology (GO) [4][34] has been around for over a decade, during which time it has been widely uti- lized both to validate and to predict the results of biolog- ical experiments (see, for example [11, 17, 19, 20, 25, 38, 39]). The structure of the ontology, where different “cate- gories” or terms are related to each other in a hierarchical fashion, provides a well-established format with which to classify and subclassify all biological functions and pro- cesses. This classification approach is well-structured and well-characterized, however, we seek to determine if it is the only natural way in which to classify this type of biological information. We address two main questions. First, does there exist another natural way to organize the functional terms that is distinct from the ontological organization? Secondly, if such an alternate classification exists, can it be used to interpret biological data? In recent years, complex networks tools have been used alongside traditional bioinformatics techniques to study many different kinds of biological networks [27], including, but not limited to, gene regulatory networks [24, 33], protein-protein interaction networks [18, 36], and metabolic networks [16, 40]. Developments in network theory provide the computational tools needed to calcu- late global properties of such networks, lending insights into the behavior of the systems represented by these * contact: [email protected] networks. For example, many networks exhibit commu- nity structure, meaning that there are clusters of nodes in the network within which edges are relatively dense [13]. Within the field of complex networks, many recent research papers [21, 26, 28, 31] have focused on the de- velopment of methods to detect such module in various types of networks [26][21] in a computationally efficient and accurate manner [6]. In this study, we leverage the community structure in gene annotation networks to de- velop an endogenous organization of biological functions. Ontologies are utilized across many disciplines in- cluding economics, artificial intelligence, engineering, li- brary science, and biomedical informatics (for example, [5, 22, 32]). The Gene Ontology, specifically, describes the relationships between different biological concepts or functions [4]. It breaks these concepts into three main do- mains, or distinct ontologies: “Biological Process” (BP), describing sets of molecular events, “Molecular Func- tion” (MF), describing the activities of gene products, and “Cellular Component” (CC), describing parts of a cell or its external environment. Each of the three pri- mary domains in GO takes the form of a directed acyclic graph (DAG), in which “child” functional categories, or “terms”, are subclassified under one or more “par- ent” terms. Each parent and all its subsequent progeny therefore define multiple, overlapping, sets of terms, or “branches” in GO. Using GO, genes are annotated to in- dividual terms representing their particular role in a cell, and these annotations are transitive up the relationships in the DAG such that each “parent” term takes on all the gene annotations associated with any of its progeny [35]. arXiv:1210.0024v2 [q-bio.QM] 3 May 2013

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

Finding New Order in Biological Functions from the Network Structure of GeneAnnotations

Kimberly Glass 1,2,3∗, and Michelle Girvan 3,4,5

1Biostatistics and Computational Biology,Dana-Farber Cancer Institute, Boston, MA, USA

2Department of Biostatistics,Harvard School of Public Health, Boston, MA, USA

3Physics Department, University of Maryland,College Park, MD, USA

4Institute for Physical Science and Technology,University of Maryland, College Park, MD, USA

5Santa Fe Institute, Santa Fe, NM, USA

The Gene Ontology (GO) provides biologists with a controlled terminology that describes howgenes are associated with functions and how functional terms are related to each other. Theseterm-term relationships encode how scientists conceive the organization of biological functions, andthey take the form of a directed acyclic graph (DAG). Here, we propose that the network structureof gene-term annotations made using GO can be employed to establish an alternate natural way togroup the functional terms which is different from the hierarchical structure established in the GODAG. Instead of relying on an externally defined organization for biological functions, our methodconnects biological functions together if they are performed by the same genes, as indicated in acompendium of gene annotation data from numerous different experiments. We show that groupingterms by this alternate scheme is distinct from term relationships defined in the ontological structureand provides a new framework with which to describe and predict the functions of experimentallyidentified sets of genes. Tools and supplemental material related to this work can be found onlineat http://www.networks.umd.edu.

1. INTRODUCTION

The Gene Ontology (GO) [4][34] has been around forover a decade, during which time it has been widely uti-lized both to validate and to predict the results of biolog-ical experiments (see, for example [11, 17, 19, 20, 25, 38,39]). The structure of the ontology, where different “cate-gories” or terms are related to each other in a hierarchicalfashion, provides a well-established format with which toclassify and subclassify all biological functions and pro-cesses. This classification approach is well-structured andwell-characterized, however, we seek to determine if it isthe only natural way in which to classify this type ofbiological information. We address two main questions.First, does there exist another natural way to organizethe functional terms that is distinct from the ontologicalorganization? Secondly, if such an alternate classificationexists, can it be used to interpret biological data?

In recent years, complex networks tools have beenused alongside traditional bioinformatics techniques tostudy many different kinds of biological networks [27],including, but not limited to, gene regulatory networks[24, 33], protein-protein interaction networks [18, 36], andmetabolic networks [16, 40]. Developments in networktheory provide the computational tools needed to calcu-late global properties of such networks, lending insightsinto the behavior of the systems represented by these

∗contact: [email protected]

networks. For example, many networks exhibit commu-nity structure, meaning that there are clusters of nodesin the network within which edges are relatively dense[13]. Within the field of complex networks, many recentresearch papers [21, 26, 28, 31] have focused on the de-velopment of methods to detect such module in varioustypes of networks [26][21] in a computationally efficientand accurate manner [6]. In this study, we leverage thecommunity structure in gene annotation networks to de-velop an endogenous organization of biological functions.

Ontologies are utilized across many disciplines in-cluding economics, artificial intelligence, engineering, li-brary science, and biomedical informatics (for example,[5, 22, 32]). The Gene Ontology, specifically, describesthe relationships between different biological concepts orfunctions [4]. It breaks these concepts into three main do-mains, or distinct ontologies: “Biological Process” (BP),describing sets of molecular events, “Molecular Func-tion” (MF), describing the activities of gene products,and “Cellular Component” (CC), describing parts of acell or its external environment. Each of the three pri-mary domains in GO takes the form of a directed acyclicgraph (DAG), in which “child” functional categories,or “terms”, are subclassified under one or more “par-ent” terms. Each parent and all its subsequent progenytherefore define multiple, overlapping, sets of terms, or“branches” in GO. Using GO, genes are annotated to in-dividual terms representing their particular role in a cell,and these annotations are transitive up the relationshipsin the DAG such that each “parent” term takes on all thegene annotations associated with any of its progeny [35].

arX

iv:1

210.

0024

v2 [

q-bi

o.Q

M]

3 M

ay 2

013

Page 2: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

2

In the past there has only been minimal investigation ofhow biological functions might be related to each otheroutside of the ontology structure, with the majority fo-cusing on discovering individual links between functions[19] rather than investigating the structure as a whole. Inthis work, we propose an alternate classification of func-tional terms that relies on gene-term annotations ratherthan ontological relationships.

Our complex networks approach to organizing biologi-cal functions using annotations made to the Gene Ontol-ogy is outlined in Figure 1. We begin by considering termrelationships defined by the GO hierarchy. We then addin gene-term annotation information collected from nu-merous different experiments and encapsulate these con-nections in the form of a bipartite network. Next, weuse this bipartite network to construct another networkdescribing the relationships between functional termsbased on shared gene annotations. We apply communitystructure finding algorithms to partition this annotation-driven network into communities of terms and comparethese communities to branches (ontological groupings ofterms) from the GO hierarchy. We show that, althoughthere are some similarities, there are also very strong dif-ferences between the two ways of organizing terms. Fi-nally, we test the applicability of the community-derivedclassification, utilizing functional analysis techniques toevaluate the enrichment of cancer signatures (sets ofgenes associated with cancer) in both term communitiesand GO branches. We find that certain signatures areenriched primarily in our term communities and not GObranches. Therefore, we suggest that by linking func-tional terms based on shared genes, we can create analternate, biologically meaningful, network-derived orga-nization of terms that is both distinct from the GO DAGand can be used to investigate biological systems. Toolsand supplemental material related to this work can befound online at http://www.networks.umd.edu.

2. METHODS

2.1. Characterizing Gene Ontology Annotations ina Bipartite Graph

In the following analysis we explore if there exists analternate, natural way to classify terms that is distinctfrom the ontology structure. To begin, we use term-termontology relationships and gene-term annotation infor-mation for human genes downloaded from the GO web-site (geneontology.org) to construct a gene-term bipartitenetwork. We choose to represent this network in the formof an nG × nT adjacency matrix, where nG is the totalnumber of genes and nT is the total number of termslisted in the annotation file. In this matrix a value of oneindicates a known connection between the correspondinggene and term, and a value of zero indicates that the gene

Gene Ontology Term Hierarchy (DAG)

Gene-Term Annotations (Bipartite Graph)

Annotation-Driven Term Network

Annotation-Driven Term Communities

evaluate enrichment in experimental

gene sets

incorporate annotations

project network

partition terms

compare structure

Branch1 Branch2 Branch3 Comm1 Comm2 Comm3

FIG. 1: Visual representation of our approach. First, we sum-marize gene annotations made to functional terms in the GeneOntology hierarchy as a gene-term bipartite graph. Fromthese gene-term relationships, we project a term-term net-work. We partition this network into communities and com-pare those term communities to branches of terms in theDAG. Finally, we perform functional enrichment analysis onexperimentally-defined gene sets using both the term commu-nities and GO branches.

is not associated with that term. Thus,

Bpi =

{1 if gene p is annotated to term i

0 if gene p is not annotated to term i. (1)

This bipartite network represents a summary of the re-lationships between 18930 genes and 15033 functionalterms, derived from many different types of biologicalevidence and contributed to by multiple laboratories[10].We note that although GO is broken into three primarydomains and gene-annotations are made to the ontologyfor many species, for simplicity in the following analysiswe will combine information from all three domains anduse annotation information only that pertains to humangenes. Domain-specific and comparative species analysisis provided in the Supplemental Material.

2.2. Constructing a Term Network from GeneOntology Annotations

Next, we used gene-term annotations to construct anetwork representing term-term relationships. Using thebipartite network (Equation 1) one could create a termnetwork by simply joining together any pair of terms thatshare common genes; however, the number of genes an-notated to each term has a heavy-tailed distribution (seeSupplemental Figure S1(a)), thus this approach wouldlose a large amount of information as connections be-tween pairs of highly-annotated terms would be given thesame weight as connections between pairs of terms withonly a few annotations. We correct for the skewed term

Page 3: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

3

degree distribution by constructing a diagonal weightingmatrix, w, and then projecting a term network T , whoseedges are modified by this weighting matrix:

wij =δij

nG∑q=1

Bqi

, T = w′B′Bw. (2)

The values of Tij take a maximum value of one whenterms i and j each only have the same single gene an-notation and a minimum value of zero when none of thegenes annotated to term i are annotated to term j. Theuse of the weighting matrix emphasizes the weights ofnetwork edges between low degree terms. Since theseterms represent biological functions performed by only ahandful of genes, we believe this weighting is more likelyto capture highly-specific shared biological information.

2.3. Identifying Communities of GO terms

Finally, we seek to identify the community structurein annotation-driven term-term relationships, or clustersof terms within which there are many or high-weight re-lationships in our projected network (Equation 2), butbetween which there are only few or low-weight relation-ships. In order to quantify the strength of communitystructure we use a quantity known as modularity [28].Modularity (Q) can be defined as:

Q =1

2m

∑ij

[Aij −

(1 +

r

〈k〉

)kikj2m

]δ(xi, xj) (3)

where δ is the Kronecker delta function, xi is the com-munity of node i, ki is the degree of node i, A is theadjacency matrix, a matrix with values representing theweight between nodes i and j, and m is the total weightof the edges in the network[3, 26]. Traditionally, in orderto partition a network into communities, the resolutionparameter, r in Equation 3, is set equal to zero. Vary-ing this value allows one to look for alternate divisionsof a network into communities at different scales, or res-olutions, with r > 0 uncovering sub-structures in thenetwork [3].

We used a weighted version of the Fast Greedy Com-munity Structure algorithm [6] to investigate the com-munity structure of our term network, and found fifty-six communities at maximum modularity. We then im-plemented a modified version of the Fast Greedy thatmaximizes modularity for non-zero values of the reso-lution parameter in order to find many different viablepartitions. We varied the resolution parameter severalorders of magnitude and found 11491 different commu-nities (see Supplemental Table S1). We gave our com-munities numeric identities that vary from TC:0000001to TC:0011491 and will refer to them as such in thefollowing analysis. The different values of the resolu-tion parameter were chosen to give community sizes that

were roughly similar to those defined by the branches atdifferent levels of the GO DAG (see Supplemental Fig-ure S1(a)). Like GO branches, which represent overlap-ping sets of functional categories rather than one discreetpartition of terms, communities found at different reso-lutions are highly overlapping and represent functionalstructure at many different levels of specificity.

3. RESULTS: AN ALTERNATE “NATURAL”GROUPING OF GO TERMS

3.1. Term Communities and GO BranchesRepresent Distinct Collections of Biological

Functions

To better understand the relationships between thecommunities found at different resolutions, we visualizedthe term communities with ten or more members for thesix lowest values of resolution used (Figure 2). In thisvisualization each community is represented by a singlecircle, whose radius scales as the log of the number ofterms belonging to that community and whose color cor-responds to the percentage of members from each pri-mary domain that belong to that community. Betweenthe communities found at adjacent resolutions, we draw aline from a community at a higher resolution to a commu-nity at a lower resolution if at least 10% of the membersof the community from the higher resolution also belongto the community at the lower resolution. The thick-ness of the line is indicative of the overlap between thetwo communities. For more details on the visualizationprocess see the Supplemental Material.

The structure of annotation-driven term relationshipsis distinct from the structure of those relationships asdefined by GO branches. This is evidenced clearly bythe fact that, although each GO branch can only belongto one primary ontology, and thus would be pure yel-low, cyan or magenta in this type of visualization, com-munities, even smaller ones and those found at higherresolutions, generally contain members from multiple on-tologies, resulting in a rainbow of colors. We also observethat communities at higher resolutions do not merely rep-resent the “splitting apart” of communities at lower reso-lutions (represented by a child community only connect-ing to a single parent), but instead each resolution oftenbrings about a new way of partitioning the network. Ananalogous visualization of GO branches reveals a similara complex partitioning of terms in the GO DAG, albeitsegregated by primary domain (see Supplemental FigureS2).

Next we directly compared the membership of the termcommunities with that of branches in the GO DAG. Inorder to quantify the similarity between each commu-nity and branch, we calculated the Jaccard similarity,which takes the value J(x, y) = |x ∩ y|/|x ∪ y|. Then, foreach community (x), we determined the correspondingbranch (y) that has the highest overlap in membership

Page 4: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

4

BP

CC MF

FIG. 2: Visualization of communities (circles) of GO terms found at the six lowest levels of resolution (rows), in increasingorder (top to bottom). The width of the line connecting two communities is proportional to the percentage of terms in the childcommunity that are also in the parent community. The size of communities is proportional to the log of the number of termsin the community. Color represents the normalized percentage of terms in the community which belong to the BP (yellow),MF (cyan) and CC (magenta) primary domains.

by this measure: Jm(x) = max{J(x, y) : y ∈ Y }, andvice versa. Because the exact value of the Jaccard sim-ilarity is highly sensitive to incremental changes in setmembership when comparing sets with only a few mem-bers, we will limit all the following analysis to communi-ties and branches that contain ten or more terms in orderto focus on the most robust results. Figure 3(a) showsthe distribution of Jm comparing these 2370 communitiesand 2151 branches. Although a handful of communitiesand branches are quite similar to each other, the ma-jority of communities are dissimilar to the GO Branchesand vice versus. We have repeated this analysis con-structing the term network and corresponding partitionsthree more times, using annotations specific to each ofthe three primary ontologies, and observe similar results(see Supplemental Figure S1(b)-1(d) and SupplementalTable S2).

To better interpret these values, we selected severalcommunities to inspect more closely. First we selecteda community with a very high Jm value to inspect (Fig-ure 3(b)). TC:0007391 is most similar to GO:0070570(“regulation of neuron projection regeneration”) withJm = 0.6667. It is interesting that in addition to mem-bers from the BP domain, TC:0007391 also includes twomembers from the MF and CC domains, “neutrophin re-ceptor activity” and “perineuronal net” respectively, theformer of which is involved in the regeneration of injuredaxons [12] while the degradation of the latter has beenshown to favor axon regeneration [30]. This indicatesthat terms found in the community but not the branchare consistent with known biology.

Next we selected TC:0000936, which is most simi-lar (Jm = 0.1667) to GO:0060538 (Figure 3(c)). Wenote that that the dissimilarity found between this com-munity and branch cannot be attributed to communitymembership from multiple primary domains, as all ofTC:0000936’s members belong to the “Biological Pro-cess” primary domain. Interestingly, the branch de-fined by GO:0060538 has members that belong to elevendistinct communities, demonstrating that not only are

communities often distinct from branches, within thebranches themselves the annotation-driven classificationis often very distinct from the defined ontological rela-tionships. This pair is a representative example of themaximal shared information that is typically found be-tween a community and branches, therefore we concludethat although there is occasional similarity between ourfound communities and GO branches, the communitiesare not simply a recapitulation of the DAG.

3.2. Capturing the Biological Information in TermCommunities

We have illustrated that our term communities rep-resent a natural partitioning of functional terms thatis distinct from the GO DAG, however, the biologicalmeaning of these communities is, at this point, unclear.On a mathematical level they represent sets of biologi-cal functions that are generally performed by the samecollection of genes. Labeling and understanding the bio-logical meaning behind these communities is vital if theyare to have the same wide-range applications as the GObranches. Therefore, in order to easily interpret the con-tents of an individual community we choose to summa-rize the descriptions of its member terms in the form ofa word cloud, coloring each word in the cloud based onthe normalized percentage of times the members is it de-rived from belong to each primary domain, and scalingthe size of each word by the statistical enrichment of itsfrequency in that community (for details see Supplemen-tal Material). We illustrate the biological content of twocommunities in Figure 4(a)-4(b).

The word clouds illustrate a richness of biologicalinformation in term communities. Although Com-munity TC:0000400 (Figure 4(a)) contains 335 mem-bers harking from all three primary domains, the wordcloud presentation easily summarizes this information.The individual words are often contained in terms as-sociated with multiple domains, resulting in a com-plex coloration, but reveal that this community in-

Page 5: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

5

0 0.25 0.5 0.75 10

100

200

300

Maximum Jaccard (Jm

)

# of

Bra

nche

s/C

omm

uniti

es

BranchesCommunities

(a)Distribution of Jm for Branches and Communities

(b)TC:0007391, Jm = 0.6667 with GO:0070570

(c)TC:0000936, Jm = 0.1667 with GO:0060538

FIG. 3: A comparison of branches in the GO DAG and termcommunities found by partitioning the term network. (A)Distribution of Jm, the maximum similarity a communityor branch with ten or more members has compared to allother branches or communities with ten or more members,respectively. Although a small number of communities andbranches have similar memberships, most are highly dissimi-lar. (B)-(C) Two example comparisons between communitiesand branches: (B) TC:0007391 compared to GO:0070570, and(C) TC:0000936 compared to GO:0060538. In each panel onthe left hand side a community and its inter-community con-nections in the annotation-driven term network is shown andon the right hand side the branch with which that commu-nity has the the highest Jaccard similarity is illustrated. Inthe right panel edges represent the ontological associationsdefined by the Gene Ontology term hierarchy. Each termmember of the community or branch is colored both by its as-sociated primary domain (inner color - BP:yellow, MF:cyan,CC:magenta) and its community membership (outer color),determined at the same resolution value as the illustratedcommunity. Terms common between each community andbranch pair are circled.

cludes biological concepts related to various types ofRNA, including “rRNA”, “tRNA”, “mRNA”, “LSU-rRNA”, “SSU-rRNA”, “ncRNA”, “RNA-polymerase”and more. In contrast, TC:0000061 contains manywords related to the heart such as “cardiac”, “muscle”,“ventricle”, “ventricular” and “heart” (Figure 4(b)).Neither community is very similar to any particularbranch in GO, although they represent similar biolog-ical information. TC:0000400 is most similar (Jm =0.22146) to GO:0016070 or “RNA metabolic process”,and TC:0000061 has the highest similarity (Jm = 0.149)with GO:0072358, or “cardiovascular system develop-ment.”

We point out that one can also represent the biologicalinformation contained in branches in the form of wordclouds, although, because the members of each branchcan only belong to one of the three primary domains,all the words in the cloud will be the same color. Twobranches are illustrated in Figure 4(c)-4(d). The first,GO:0000003, or “reproduction” clearly contains termspertaining to sex-related processes as it contains wordssuch as “female”, “sex”, “prostate” and “male.” Simi-larly, the cloud for GO:0002376, whose parent term nameis “immune system process” contains words pertaining tothe immune system.

3.3. Term Communities can be used to Evaluateand Predict Genetic Function

Finally, we wanted to test how our communities mightbe used in one common application of the Gene Ontology:functional enrichment analysis. To begin, we downloadeda collection of experimentally derived genes sets fromthe Gene Signatures Database (GeneSigDB) [8]. Thisdatabase is a manual curation of previously publishedgene expression signatures, focusing primarily on can-cer and stem cell signatures [7], and includes 509 humansignatures that contain at least 100 and less than 1000genes annotated in the Gene Ontology. To perform thefunctional enrichemnt analysis we use Annotation En-richment Analysis [14] since it has been shown to betterestimate the biological functions of experimentally de-rived sets of genes, and has a conceptual framework con-ducive to estimating functional enrichment between setsof terms and genes, rather than simply between two setsof genes (for more details see Supplemental Material).

In general, we observe that term communities con-tain slightly more statistically enriched associations withthese experimental signatures than GO branches (whichare widely used in functional enrichment analysis) andwe verified that this level of enrichment is absent for ran-domly constructed communities (see Supplemental Fig-ure S3). Knowing that our communities are statisticallyassociated with experimental gene signatures, we nextsought to know if there was a context in which our termcommunities captured biological information from thesesignatures that is missed by the branches, or vice versus.

Page 6: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

6

Thus we selected gene signatures that were significantlyenriched (p < 10−6) in at least one community/branchbut not significantly enriched (p > 5 × 10−5) in anyGO branch/community, respectively. Figure 4(e) showsa heat map of the enrichment values for the nineteensignatures that met this criteria across any communityor branch statistically enriched in at least one of thosesignatures.

It is immediately striking that of these signatures,the majority are enriched in communities and not GObranches. Two signatures, in particular, are enrichedin a collection of communities. The first, an embryonicstem cell signature [37], represents genes that are up-regulated in cardiomyocytes compared to non-selectedembryoid bodies and hESC. The communities repre-sented in this signature contain several different themes,all consistent with the expected properties of genes se-lected from stem cells and related to the heart. Thecorresponding clouds emphasize words such as “cardiac”and “muscle” (TC:0000012), “actin”, “myosin”, and “fil-ament” (TC:0000249), “morphogenesis” and “develop-ment” (TC:0000365), “blood”, “pressure” and “contrac-tion” (TC:0000582), with the other clouds generally con-taining these words in different combinations (see for ex-ample TC:0000061, illustrated in Figure 4(b)). The sec-ond signature is a list of bladder cancer specific genes[29]. Most of the words emphasized by the commu-nity clouds are related to cell proliferation. For exam-ple, it is enriched in TC:0000400 (illustrated in Fig-ure 4(a)) and the other clouds emphasize words suchas “cell-cycle”, “mitotic”, “meiotic”, “checkpoint”, “re-pair”, “replication”, “recombination”, “telomere”, “spin-dle”, “complex”, “DNA”, “chromosome”, “histone”, and“methylation”. Although the connection to the bladderis not obvious, the connection to cancer and the high rateof cell proliferation in tumor cells [2] is apparent.

4. CONCLUSION

The network structure of gene annotations made tothe Gene Ontology has not previously been exploitedin a manner that reveals an organization of biologicalfunction unique from the published hierarchical classifi-cation of the GO DAG. By analyzing functional annota-tion data we were able to construct an alternate, natural,and biologically-relevant way in which to categorize cel-lular functions. This categorization is structurally andconceptually distinct from the GO DAG and allows us touncover multiple, strong connections between terms thatdo not share a parent/child relationship. It takes advan-tage of a large amount of data from a variety of sourcesand creates a classification scheme that is motivated pri-marily by the data reported rather than the organizationof human conceptions.

The term communities defined in this work representan integration of information across all three primarydomains in GO that, to the authors’ knowledge, has

(a)TC:0000400 WordCloud

(b)TC:0000061 WordCloud

(c)GO:0000003 WordCloud

(d)GO:0002376 WordCloud

GO

:000

6139

GO

:000

9987

GO

:000

5622

GO

:004

4424

GO

:003

2501

GO

:004

2221

GO

:000

2376

GO

:005

0896

TC:0

0000

12TC

:000

0061

TC:0

0002

49TC

:000

0365

TC:0

0004

10TC

:000

0488

TC:0

0005

82TC

:000

0611

TC:0

0008

36TC

:000

0881

TC:0

0015

01TC

:000

2025

TC:0

0000

00TC

:000

0062

TC:0

0001

16TC

:000

0122

TC:0

0001

52TC

:000

0247

TC:0

0002

56TC

:000

0400

TC:0

0009

60TC

:000

0004

TC:0

0000

59TC

:000

0114

TC:0

0001

63TC

:000

0063

TC:0

0001

18TC

:000

0179

TC:0

0002

61TC

:000

0057

TC:0

0002

58TC

:000

0369

TC:0

0008

69TC

:000

0165

−log10 p−value

Ovarian (Ouellet ’05)Bone (Heller ’08)

Sarcoma (Lee ’04)Leukemia (Walker ’06)StemCell (Osada ’05)

HeadandNeck (Bredel ’06)Sarcoma (Missiaglia ’10)

Viral (Urosevic ’07)EmbryonicStemCell (Xu ’09)

Bladder (Osman ’06)Sarcoma (Filion ’09)

Breast (Creighton ’08)Intestine (Vecchi ’07)Breast (Mazurek ’05)

HeadandNeck (Roepman ’06)Lung (Hou ’10)

Breast (Mira ’09)Colon (Grade ’06)

Leukemia (Calln ’08) 0

2

4

6

(e)Annotation Enrichment Analysis

FIG. 4: (A-D) Term Communities (TC:0000400, TC:0000061)and branches (GO:0009607, GO:0050896) summarized asword clouds. In each case the color of a word representshow often the term description containing that word be-longs to each of the primary domains (BP:yellow, MF:cyan,CC:magenta, also see Figure 2 for mixed-domain coloration)and size represents that word’s statistical enrichment in thatcommunity/branch. (E) A heat map showing the statisti-cal enrichment of selected cancer signatures (see text) in GObranches and term communities.

not previously been investigated systematically. Usingthe simple principle of co-annotation we suggest thatin the future biological concepts from other classifica-tion databases could also be analyzed or even combinedwith these results. We concede that the communities de-fined here likely do not represent the only way to groupfunctional terms outside of the ontology structure. Evenso, we believe that our functional enrichment analysisdemonstrates that these term communities, in particular,are more than a mathematical phenomenon and have ahigh potential to be used to better interpret biologicaldata.

Page 7: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

7

Acknowledgements

We wish to thank Geet Duggal for supplying an im-plementation of the Fast Greedy Community Structure

algorithm that included the resolution parameter.

[1] http://www.softpedia.com/progdownload/ibm-word-cloud-generator-download-160283.html.URL http://www.softpedia.com/progDownload/

IBM-Word-Cloud-Generator-Download-160283.html.[2] M Andreeff, Goodrich DW, and Pardee AB. Holland-

Frei Cancer Medicine. BC Decker, Hamilton, ON, 5thedition, 2000.

[3] Alex Arenas, Alberto Fernandez, and SergioGomez. Analysis of the structure of complexnetworks at different resolution levels. e-pub:http://arxiv.org/abs/physics/0703218, January2008. doi: 10.1088/1367-2630/10/5/053039. URLhttp://dx.doi.org/10.1088/1367-2630/10/5/053039.

[4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein,H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S.Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E.Richardson, M. Ringwald, G. M. Rubin, and G. Sher-lock. Gene ontology: tool for the unification of biology.the gene ontology consortium. Nature genetics, 25(1):25–29, May 2000. ISSN 1061-4036. doi: 10.1038/75556.URL http://dx.doi.org/10.1038/75556.

[5] B. Chandrasekaran, John R. Josephson, and V. RichardBenjamins. What are ontologies, and why do we needthem? IEEE Intelligent Systems, 14(1):20–26, January1999. ISSN 1541-1672. doi: 10.1109/5254.747902. URLhttp://dx.doi.org/10.1109/5254.747902.

[6] Aaron Clauset, M. E. J. Newman, and CristopherMoore. Finding community structure in very large net-works. Physical Review E, 70(6):066111+, December2004. doi: 10.1103/PhysRevE.70.066111. URL http:

//dx.doi.org/10.1103/PhysRevE.70.066111.[7] Aedın C. Culhane, Thomas Schwarzl, Razvan Sul-

tana, Kermshlise C. Picard, Shaita C. Picard, Tim H.Lu, Katherine R. Franklin, Simon J. French, GeraldPapenhausen, Mick Correll, and John Quackenbush.GeneSigDB–a curated database of gene expression signa-tures. Nucleic acids research, 38(Database issue):D716–D725, January 2010. ISSN 1362-4962. doi: 10.1093/nar/gkp1015. URL http://dx.doi.org/10.1093/nar/

gkp1015.[8] Aedın C. Culhane, Markus S. Schroder, Razvan Sul-

tana, Shaita C. Picard, Enzo N. Martinelli, CarolineKelly, Benjamin Haibe-Kains, Misha Kapushesky, Anne-Alyssa St Pierre, William Flahive, Kermshlise C. Pi-card, Daniel Gusenleitner, Gerald Papenhausen, NiallO’Connor, Mick Correll, and John Quackenbush. Gen-eSigDB: a manually curated database and resource foranalysis of gene expression signatures. Nucleic Acids Re-search, 40(D1):D1060–D1066, January 2012. ISSN 1362-4962. doi: 10.1093/nar/gkr901. URL http://dx.doi.

org/10.1093/nar/gkr901.[9] Leon Danon, Albert Dıaz-Guilera, Jordi Duch, and

Alex Arenas. Comparing community structure iden-tification. Journal of Statistical Mechanics: The-

ory and Experiment, 2005(09):P09008, September 2005.ISSN 1742-5468. doi: 10.1088/1742-5468/2005/09/P09008. URL http://dx.doi.org/10.1088/1742-5468/

2005/09/P09008.[10] Emily C. Dimmer, Rachael P. Huntley, Yasmin Alam-

Faruque, Tony Sawford, Claire O’Donovan, Maria J.Martin, Benoit Bely, Paul Browne, Wei Mun Chan,Ruth Eberhardt, Michael Gardner, Kati Laiho, DuncanLegge, Michele Magrane, Klemens Pichler, Diego Pog-gioli, Harminder Sehra, Andrea Auchincloss, KristianAxelsen, Marie-Claude C. Blatter, Emmanuel Boutet,Silvia Braconi-Quintaje, Lionel Breuza, Alan Bridge,Elizabeth Coudert, Anne Estreicher, Livia Famiglietti,Serenella Ferro-Rojas, Marc Feuermann, Arnaud Gos,Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo,Janet James, Silvia Jimenez, Florence Jungo, GuillaumeKeller, Phillippe Lemercier, Damien Lieberherr, PatrickMasson, Madelaine Moinat, Ivo Pedruzzi, Sylvain Poux,Catherine Rivoire, Bernd Roechert, Michael Schneider,Andre Stutz, Shyamala Sundaram, Michael Tognolli, Ly-die Bougueleret, Ghislaine Argoud-Puy, Isabelle Cusin,Paula Duek-Roggli, Ioannis Xenarios, and Rolf Apweiler.The UniProt-GO annotation database in 2011. Nucleicacids research, 40(Database issue):D565–D570, January2012. ISSN 1362-4962. doi: 10.1093/nar/gkr1048. URLhttp://dx.doi.org/10.1093/nar/gkr1048.

[11] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D.de Jong, Michael Egmont-Petersen, and Cisca Wijmenga.Reconstruction of a functional human gene network, withan application for prioritizing positional candidate genes.American journal of human genetics, 78(6):1011–1025,June 2006. ISSN 0002-9297. doi: 10.1086/504300. URLhttp://dx.doi.org/10.1086/504300.

[12] H. Funakoshi, J. Frisen, G. Barbany, T. Timmusk,O. Zachrisson, V. M. Verge, and H. Persson. Differentialexpression of mRNAs for neurotrophins and their recep-tors after axotomy of the sciatic nerve. The Journal of cellbiology, 123(2):455–465, October 1993. ISSN 0021-9525.URL http://view.ncbi.nlm.nih.gov/pubmed/8408225.

[13] M. Girvan and M. E. J. Newman. Community struc-ture in social and biological networks. Proceedings of theNational Academy of Sciences, 99(12):7821–7826, June2002. ISSN 0027-8424. doi: 10.1073/pnas.122653799.URL http://dx.doi.org/10.1073/pnas.122653799.

[14] Kimberly Glass and Michelle Girvan. Annotation en-richment analysis: An alternative method for evalu-ating the functional properties of gene sets. e-pub:http://arxiv.org/abs/1208.4127, August 2012. URLhttp://arxiv.org/abs/1208.4127.

[15] Kimberly Glass, Edward Ott, Wolfgang Losert, andMichelle Girvan. Implications of functional similarityfor gene regulatory interactions. Journal of the RoyalSociety, Interface / the Royal Society, February 2012.ISSN 1742-5662. doi: 10.1098/rsif.2011.0585. URLhttp://dx.doi.org/10.1098/rsif.2011.0585.

Page 8: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

8

[16] Roger Guimera and Luis A. Nunes Amaral. Functionalcartography of complex metabolic networks. Nature, 433(7028):895–900, February 2005. ISSN 0028-0836. doi: 10.1038/nature03288. URL http://dx.doi.org/10.1038/

nature03288.[17] Da W. Huang, Brad T. Sherman, Qina Tan, Joseph

Kir, David Liu, David Bryant, Yongjian Guo, RobertStephens, Michael W. Baseler, H. Clifford Lane, andRichard A. Lempicki. David bioinformatics resources:expanded annotation database and novel algorithms tobetter extract biology from large gene lists. Nucl. AcidsRes., 35(Web Server issue):gkm415+, June 2007. ISSN1362-4962. doi: 10.1093/nar/gkm415. URL http://dx.

doi.org/10.1093/nar/gkm415.[18] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai.

Lethality and centrality in protein networks. Nature, 411(6833):41–42, May 2001. ISSN 0028-0836. doi: 10.1038/35075138. URL http://dx.doi.org/10.1038/35075138.

[19] O. D. King, R. E. Foulger, S. S. Dwight, J. V. White, andF. P. Roth. Predicting gene function from patterns ofannotation. Genome research, 13(5):896–904, May 2003.ISSN 1088-9051. doi: 10.1101/gr.440803. URL http:

//dx.doi.org/10.1101/gr.440803.[20] Insuk Lee, Shailesh V. Date, Alex T. Adai, and Ed-

ward M. Marcotte. A probabilistic functional networkof yeast genes. Science, 306(5701):1555–1558, November2004. ISSN 1095-9203. doi: 10.1126/science.1099511.URL http://dx.doi.org/10.1126/science.1099511.

[21] E. A. Leicht and M. E. J. Newman. Commu-nity structure in directed networks. Physical ReviewLetters, 100:118703+, March 2008. doi: 10.1103/PhysRevLett.100.118703. URL http://dx.doi.org/10.

1103/PhysRevLett.100.118703.[22] Usakali Maki, editor. The Economic World View: Stud-

ies in the Ontology of Economics. The Press Syndicateof the University of Cambridge, Cambridge, UK, 2001.

[23] Marina Meila. Comparing clusterings by the variationof information. Learning Theory and Kernel Machines,pages 173–187, 2003. URL http://www.springerlink.

com/content/rdda4r5bm9ajlbw3.[24] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,

D. Chklovskii, and U. Alon. Network motifs: Sim-ple building blocks of complex networks. Science, 298(5594):824–827, October 2002. ISSN 1095-9203. doi: 10.1126/science.298.5594.824. URL http://dx.doi.org/

10.1126/science.298.5594.824.[25] Sara Mostafavi and Quaid Morris. Fast integration of

heterogeneous data sources for predicting gene func-tion with limited annotation. Bioinformatics, 26(14):1759–1765, July 2010. ISSN 1367-4811. doi: 10.1093/bioinformatics/btq262. URL http://dx.doi.org/10.

1093/bioinformatics/btq262.[26] M. E. Newman. Analysis of weighted networks. Phys Rev

E Stat Nonlin Soft Matter Phys, 70(5 Pt 2), November2004. ISSN 1539-3755. URL http://view.ncbi.nlm.

nih.gov/pubmed/15600716.[27] M. E. J. Newman. The structure and func-

tion of complex networks. SIAM Review, 45(2):167–256, 2003. URL http://scitation.

aip.org/getabs/servlet/GetabsServlet?prog=

normal&id=SIREAD000045000002000167000001&idtype=

cvips&gifs=yes.[28] M. E. J. Newman and M. Girvan. Finding and evaluating

community structure in networks. Physical Review E, 69

(2):026113+, February 2004. doi: 10.1103/physreve.69.026113. URL http://dx.doi.org/10.1103/physreve.

69.026113.[29] Iman Osman, Dean F. Bajorin, Tung-Tien T. Sun, Hong

Zhong, Diah Douglas, Joseph Scattergood, Run Zheng,Mark Han, K. Wayne Marshall, and Choong-Chin C.Liew. Novel blood biomarkers of human urinary blad-der cancer. Clinical cancer research : an official jour-nal of the American Association for Cancer Research, 12(11 Pt 1):3374–3380, June 2006. ISSN 1078-0432. doi:10.1158/1078-0432.CCR-05-2081. URL http://dx.doi.

org/10.1158/1078-0432.CCR-05-2081.[30] Erika Pastrana, Maria Teresa T. Moreno-Flores, Este-

ban N. Gurzov, Jesus Avila, Francisco Wandosell, andJavier Diaz-Nido. Genes associated with adult axon re-generation promoted by olfactory ensheathing cells: anew role for matrix metalloproteinase 2. The Journal ofneuroscience : the official journal of the Society for Neu-roscience, 26(20):5347–5359, May 2006. ISSN 1529-2401.doi: 10.1523/JNEUROSCI.1111-06.2006. URL http:

//dx.doi.org/10.1523/JNEUROSCI.1111-06.2006.[31] Mason A. Porter, Jukka-Pekka Onnela, and Pe-

ter J. Mucha. Communities in networks. e-pub:http://arxiv.org/abs/0902.3788, September 2009. URLhttp://arxiv.org/abs/0902.3788.

[32] Simon B. Shum, Enrico Motta, and John Domingue.ScholOnto: an Ontology-Based digital library serverfor research documents and discourse. InternationalJournal on Digital Libraries, 3:237–248, 2000. URLhttp://citeseerx.ist.psu.edu/viewdoc/summary?

doi=10.1.1.36.3835.[33] Ricard V. Sole, Ramon F. Cancho, Jose M. Montoya,

and Sergi Valverde. Selection, tinkering, and emergencein complex networks. Complex., 8(1):20–33, September2002. ISSN 1076-2787. URL http://portal.acm.org/

citation.cfm?id=770715.770720.[34] Robert Stevens, Carole A. Goble, and Sean Bechhofer.

Ontology-based knowledge representation for bioinfor-matics. Brief Bioinform, 1(4):398–414, January 2000.doi: 10.1093/bib/1.4.398. URL http://dx.doi.org/10.

1093/bib/1.4.398.[35] The gene ontology consortium. Creating the gene ontol-

ogy resource: design and implementation. Genome Res.,11(8):1425–1433, August 2001. ISSN 1088-9051. doi:10.1101/gr.180801. URL http://dx.doi.org/10.1101/

gr.180801.[36] Andreas Wagner. The yeast protein interaction network

evolves rapidly and contains few redundant duplicategenes. Mol Biol Evol, 18(7):1283–1292, July 2001. ISSN0737-4038. URL http://mbe.oxfordjournals.org/cgi/

content/abstract/18/7/1283.[37] Xiu Qin Q. Xu, Set Yen Y. Soo, William Sun, and

Robert Zweigerdt. Global expression profile of highlyenriched cardiomyocytes derived from human embryonicstem cells. Stem cells (Dayton, Ohio), 27(9):2163–2174,September 2009. ISSN 1549-4918. doi: 10.1002/stem.166.URL http://dx.doi.org/10.1002/stem.166.

[38] Xuerui Yang, Yang Zhou, Rong Jin, and Christina Chan.Reconstruct modular phenotype-specific gene networksby knowledge-driven matrix factorization. Bioinformat-ics, 25(17):2236–2243, September 2009. ISSN 1367-4811.doi: 10.1093/bioinformatics/btp376. URL http://dx.

doi.org/10.1093/bioinformatics/btp376.[39] Ahrim Youn, David J. Reiss, and Werner Stuetzle.

Page 9: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

9

Learning transcriptional networks from the integrationof ChIP-chip and expression data in a non-parametricmodel. Bioinformatics, 26(15):1879–1886, August 2010.doi: 10.1093/bioinformatics/btq289. URL http://dx.

doi.org/10.1093/bioinformatics/btq289.[40] Jing Zhao, Hong Yu, Jianhua Luo, Z. Cao, and Yixue

Li. Complex networks theory for analyzing metabolicnetworks. Chinese Science Bulletin, 51(13):1529–1537–1537, July 2006. ISSN 1001-6538. doi: 10.1007/s11434-006-2015-2. URL http://dx.doi.org/10.1007/

s11434-006-2015-2.

Page 10: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

10

Supplemental Material

In this section we provide additional materials meantto compliment and expand upon the analysis in “Find-ing New Order in Biological Functions from the NetworkStructure of Gene Annotations.” In the first section,“Communities Found Across Multiple Resolutions”, weprovide more detail about how we used the resolutionparameter to generate multiple, overlapping, yet uniquecommunities of terms. In the second section, “Domain-specific Analysis” we provide analysis of communities ofterms compared to branches within each primary domainof GO. Next, in “Visualizing GO Branches” we illustraterelationships between GO branches at different “levels”of the GO DAG. In the section entitled “Functional En-richment Analysis”, we briefly explain the methodologyused to evaluate the statistical enrichment of gene setsin the term communities and GO branches, and pro-vide analysis showing that this enrichment is absent forrandomly generated term communities or randomly gen-erated sets of genes. Finally, in “Comparative SpeciesAnalysis” we construct and analyze annotation-driventerm networks using gene annotations for sixteen addi-tional organisms and compare those networks with eachother as well as the human results presented in the maintext.

For information regarding the methodology used toillustrate the term communities or generate the wordclouds in the main text, see “Supplemental Methods”below.

4.1. Communities Found Across MultipleResolutions

In order to find additional viable partitions of theannotation-driven term network, representing differentresolutions, we varied the weighting parameter (see r inEquation 3, and “Methods” in the main text). In addi-tion to the fundamental partition (r = 0) we chose val-ues of r in geometrically increasing steps ranging fromr = 2−4 = 0.25 to r = 210 = 1024 such that the final dis-tribution of community sizes would resemble that of thebranches. This procedure resulted in fourteen differentpartitions of the terms in the annotation-driven networksuch that, initially, each term can be assigned to exactlyfourteen communities, one at each resolution. We em-phasize that while communities at any given resolutiondo not overlap, communities at different resolutions canbe highly overlapping.

We point out that it is possible for the membership of acommunity found at one resolution to be identical to themembership of a community found at another resolution.In order to eliminate completely redundant communityinformation from our found communities, at each resolu-tion, we determined if any of the communities found atthat resolution were identical in membership as a commu-nity found at a lower resolution. If so, we “collapsed” the

Value of r Number of Number of New(Resolution) Communities Communities

0 56 56

0.25 80 51

0.5 89 44

1 116 67

2 146 95

4 193 148

8 320 257

16 576 422

32 1007 703

64 1557 965

128 2409 1439

256 3585 1965

512 4983 2375

1024 6730 2904

TABLE S1: The number of communities found at each valueof resolution used. As the resolution is increased, the numberof communities found at that resolution increases as well.

two community assignments into that of the communityfrom the lower resolution. For example, of the 80 com-munities found at a resolution value of r = 0.25, 29 hadan identical membership to one of the 56 communitiesfound at r = 0. We removed those 29 “redundant” com-munities from the r = 0.25 partitioning, retaining onlytheir r = 0 assignment, and record that at r = 0.25 only51 additional communities are found. Similarly, of the 89communities found at a resolution of r = 0.5, 45 of themare identical in membership to one of the 107 unique com-munities found at r = 0 or r = 0.25. These “redundant”communities were removed and we record only 44 addi-tional communities found at r = 0.5. This procedure wasrepeated for the remaining resolutions, resulting in 11491“unique” communities (see Table S1). The cumulativedistribution of the number of members in these commu-nities is shown in Figure S1(a). For comparison, on thesame graph we also plot the cumulative distribution ofthe number of annotations made to GO branches. Thesedistributions are heavy-tailed and demonstrate that thenumber and sizes of branches and communities are simi-lar. The heavy-tailed distribution for branches is a resultof the hierarchical DAG structure as the members of eachbranch are also members of their parent branch(es) (see[14, 15]).

4.2. Domain-specific Analysis

The Gene Ontology is broken into three, fully indepen-dent, primary domains, each of which takes the form ofa directed acyclic graph [35]. In the main text we com-bined information from all three primary domains to con-struct the annotation-driven term network. Because ofthis, the dissimilarity between the GO branches and the

Page 11: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

11

1 10 100 1,000 10,00010

0

101

102

103

104

Number of Members (N)

# of

Bra

nche

s/C

omm

uniti

es(M

embe

rs≥

N)

BranchesCommunities

(a)Distribution of Branch andCommunity Sizes

0 0.2 0.4 0.6 0.8 10

50

100

150

200

Maximum Jaccaard

Num

ber

of B

ranc

hes/

Com

mun

ities

BranchesCommunities

(b)Biological Process Terms

0 0.2 0.4 0.6 0.8 10

20

40

60

80

100

Maximum Jaccaard

Num

ber

of B

ranc

hes/

Com

mun

ities

BranchesCommunities

(c)Molecular Function Terms

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

Maximum Jaccaard

Num

ber

of B

ranc

hes/

Com

mun

ities

BranchesCommunities

(d)Cellular Component Terms

FIG. S1: (A) The cumulative Distribution for the sizes of all branches in the Gene Ontology and all unique term communitiesfound at the various resolutions. (B-D) Distribution of Jm, the maximum similarity a domain-specific community or branchwith ten or more members has compared to all other domain-specific branches or communities with ten or more members,respectively. Although a small number of communities and branches have similar memberships, most are very dissimilar.

term communities we find by partitioning this networkmay be at least partially attributable to the fact thatmembers of a particular community can belong to anyof the primary domains whereas members of a particularGO branch must all belong to the same primary domainbased on the construction of the hierarchy. To addressthe extent of this issue, we constructed three “domain-specific” term networks by using only terms specific toa particular primary domain and the gene annotationsmade to those specific terms. We partitioned each ofthese networks using the same resolution values as wereused previously and, as with the partitions using all an-notations, we retained only the “unique” set of communi-ties for the following analysis (see Section 4.1). Table S2shows the number of communities found and the num-ber of branches defined in GO for each of the primarydomains.

Next we compared the membership of these communi-ties and branches using the Jaccard similarity (see “TermCommunities and GO Branches Represent Distinct Col-lections of Biological Functions” in the main text). Wedid the comparison three times, each time confining thecomparison to those branches defined by a particular pri-mary domain and the term communities derived from thenetwork constructed using those terms and their corre-sponding gene annotations. The distribution of Jm (Fig-

Number of Number of

Branches Communities

All Terms 15033 11491

BP Terms 10192 9043

MF terms 3634 5423

CC Terms 1207 2107

TABLE S2: The number of branches defined within each pri-mary domain as well as the number of communities foundby varying the resolution parameter and partitioning a term-network derived by gene annotations made to terms in thisprimary domain. The large size of the “Biological Process”primary domain compared to the others is evident.

ure S1), the maximum similarity a community has toany branch, or vice versus, for each of the domains, is notstrikingly different from that presented when using infor-mation collected from all three domains (see Figure 3(a)in the main text). There might be slightly more similar-ity when directly comparing communities and branchesderived from either the “Molecular Function”, or “Cel-lular Component” primary domains, but we remind thereader that the majority (approximately two-thirds) ofthe terms in GO belong to the “Biological Process” pri-mary domain, and this distribution (Figure S1(b)) ispractically indistinguishable from the one presented inFigure 3(a) of the main text.

4.3. Visualizing GO Branches

To better understand the relationships between thebranches in the Gene Ontology, we visualized branches ina manner similar to the way we visualized the relation-ships between term communities found at different reso-lutions (see Figure 2 in the main text). First, we deter-mined a “level” for each GO branch in order to segregatethe branches in a manner similar to the resolution param-eter. To begin, the head nodes of the three primary do-mains (“Biological Process”, “Molecular Function”, and“Cellular Component”) were assigned to level one. Next,we determined branches for which the head-node is aterm that has a parent-child relationship with only oneof these three level-one terms, and assigned those termsa level of two. Continuing, head-nodes that have parent-child relationships with only level-one or level-two termswere given a level assignment of three, and so on. Thisassignment was repeated until all head-nodes were as-signed a “level”. Branch levels were then defined basedon the level of its head-node.

We visualized branches with one-hundred or moreterm-members at the six lowest levels (Figure S2), lin-ing up branches with the same level assignment horizon-tally. Each branch is represented by a single circle, whoseradius scales as the log of the number of terms belong-

Page 12: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

12

FIG. S2: Visualization of branches of GO terms. Each branch is represented as a circle whose radius scales as the log of thenumber of terms in the branch. The width of the line connecting two branches is proportional to the percentage of terms in thechild branch that are also in the parent branch. Color represents whether the terms in the branch belong to the BP (yellow),MF (cyan) or CC (magenta) primary domain.

ing to that branch and whose color represents whetherthe terms in the branch belong to the BP (yellow), MF(cyan) or CC (magenta) primary domain. Between thebranches found at adjacent levels, we draw a line froma branch at a higher level to a branch at a lower levelif at least 10% of the members of the branch from thehigher level also belong to the branch at the lower level.The thickness of the line is indicative of the overlap inmembership between the two branches.

Unsurprisingly, we observe three distinct sets of inter-connected branches corresponding to branches belongingto each of the three primary domains. Like in the visualrepresentation of the term communities across differentresolutions (see Figure 2 in the main text), we observea lot of “cross-talk” between branches at adjacent levels,whereby a branch at a given level is very likely to con-tain members from multiple branches at a lower level. Wenote that in this representation, an individual term canbe a member of multiple branches at the same “level”.This is in contrast to segregating communities by reso-lution, in which case each term only appears once on agiven resolution-row. As a consequence, the inter-levelconnections between branches are somewhat structurallydifferent from connections between communities found atadjacent resolutions. Namely, branches that share a termwill necessarily also share a set of parent branches. Re-dundancy of the same term member(s) across multiplebranches at a given level is visually evident among “Bi-ological Process” (yellow) branches, where there existsgroups of branches at each level that connect primarilyto the same set of branches at a lower level.

4.4. Functional Enrichment Analysis

We also wanted to test how our communities might beused in functional enrichment analysis, a very commonapplication of the Gene Ontology. Traditionally, eachbranch of GO is collapsed to its head node and all thegenes annotated to that head node are grouped into one“set.” (Note that this set is the same as the genes an-notated to the entire branch because of the propagationof gene annotation assignment, see the “Introduction” inthe main text). Recently there has been evidence thatthis approach over-simplifies the complex structure of theGene Ontology and has the potential to mis-represent the

enrichment of gene sets in branches [14]. Therefore, wechoose to use Annotation Enrichment Analysis (AEA)to evaluate the functional enrichment of experimentally-derived gene sets in both the GO branches and our termcommunities. AEA allows the user to specify a collectionof terms and a collection of genes and uses a randomiza-tion protocol to evaluate the probability that these twosets are more connected than by chance.

We ran AEA using one million randomizations defin-ing collections of genes using a public database of CancerSignatures [8] and using collections of terms defined by(1) membership in the term communities; or (2) member-ship in GO branches. We plot the number of pairs of termand gene “collections” that are enriched beneath varioussignificance thresholds (Figure S3(a)) and observe clearstatistical enrichment of experimentally-derived gene sig-natures in both term communities and GO brancheswith several thousand community-signature and branch-signature pairs enriched at a p-value significance less than10−6.

Next, we wished to verify that this enrichment was notan artifact of the community construction – namely, wewanted to verify that experimental gene signatures arenot enriched in random collections of terms and that termcommunities are not enriched in random collections ofgenes. Therefore we constructed “random” term commu-nities by taking the community assignments of terms andswapping term labels. Similarly, we constructed “ran-dom” gene sets by randomly swapping gene labels. Thisgives us random term communities and random gene setswith both the same size and relative overlap as the realterm communities and the experimentally-defined signa-tures. We then ran AEA an additional four times, using:(1) random communities and the experimentally-definedsignatures; (2) term communities and random gene sets;(3) branches and random gene sets; and (4) random com-munities and random gene sets. The result of AEA indi-cated no enrichment using either random term commu-nities or random sets of genes (Figure S3(a)), showingthat the term communities generated by partitioning theannotation-driven term network contain useful biologicalinformation.

AEA evaluates the overlap in annotations made by aset of genes or to a set of terms. Other, traditional func-tional enrichment analysis procedures, however, often useFisher’s Exact Test (FET) to evaluate the overlap in two

Page 13: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

13

p ≤ 10−6 p ≤ 10−5 p ≤ 10−4 p ≤ 10−3 p ≤ 10−2 p ≤ 10−10

1000

2000

3000

4000

5000

6000

7000

≥8000

Num

ber

of C

ompa

rison

sM

eetin

g S

igni

fican

ce L

evel GO Branches (Cancer Signatures)

GO Branches (Random Signatures)Term Communities (Cancer Signatures)Term Communities (Random Signatures)Random Communities (Cancer Signatures)Random Communities (Random Signatures)

(a)Annotation Enrichment Analysis

p ≤ 10−6 p ≤ 10−5 p ≤ 10−4 p ≤ 10−3 p ≤ 10−2 p ≤ 10−10

30000

60000

90000

120000

150000

180000

210000

≥240000

Num

ber

of C

ompa

rison

sM

eetin

g S

igni

fican

ce L

evel GO Branches (Cancer Signatures)

GO Branches (Random Signatures)Term Communities (Cancer Signatures)Term Communities (Random Signatures)Random Communities (Cancer Signatures)Random Communities (Random Signatures)

(b)Fisher’s Exact Test

FIG. S3: Level of statistical enrichment found when comparing branches, communities, and randomly generated communitieswith either cancer signatures or randomly generated gene sets. Only branches and our identified term communities showstatistical enrichment in experimentally-derived gene signatures, with slightly more enrichment for the communities comparedto the branches.

sets of genes – one defined by a gene set or signatureof interest, and the other defined by taking the set ofgenes annotated to all the terms in a GO branch (sameas the genes annotated to the parent node). Althoughthere is evidence that such analysis is highly sensitiveto the annotation degree of genes and terms [14], wewanted to see if our communities were enriched in cancersignatures using this more traditional approach. There-fore, for each GO branch, term community and randomcommunity, we took the collection of genes annotatedto all terms in that branch/community/random commu-nity, and assigned this set of genes to represent thatbranch/community/random-community. We then eval-uated the significance of overlap between theses sets ofgenes and sets of genes as defined by cancer signatures,or random sets of genes.

As with AEA, both term communities and GObranches show enrichment in experimentally-defined genesignatures using FET, with term communities perhapshaving a slightly greater level of enrichment (FigureS3(b)). At first it is surprising to observe that randomcommunities also show a large amount of enrichment inthe experimental gene signatures – much more than ei-ther the branches or real communities!! In retrospect,however, this serves to highlight a known weakness ofFET to incorrectly over-estimate the significance of over-lap when genes in a set contain a higher than expectednumber of annotations. Whereas our random gene setsrepresent a random sampling from all genes annotated

to GO, the genes collected in the signatures publishedby GeneSigDB are biased in that they are generally an-notated much more frequently to GO than one wouldexpect by chance (see [14]). Furthermore, by taking allgenes annotated to a collection of terms, highly anno-tated genes are also more likely to be represented in thegene sets representing the branches, term communities,and the random communities. The enrichment of the ex-perimental gene signatures in the random communities,therefore, is attributable to the fact that FET finds sig-nificant overlap between sets containing an abundanceof highly annotated genes, independent of the biologicalcontent of those sets. For more discussion, see [14].

4.5. Comparative Species Analysis

Even though the Gene Ontology hierarchy establishes aspecies-independent terminology, one could imagine con-structing networks of terms using species-specific geneannotations, and thereby constructing species-specificterm communities. These communities would reflect thebiological terms that are performed by the same sets ofgenes in a particular species and wouldn’t necessarily bethe same across different species. In this section we eval-uate the similarity between partitions of GO terms de-rived from the annotations of seventeen model organisms,including thale cress (Arabidopsis thaliana), Escherichiacoli, slime mold (Dictyostelium discoideum), Aspergillus

Page 14: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

14

Ara

bido

psis

thal

iana

Esc

heric

hia

coli

Dic

tyos

teliu

m d

isco

ideu

mA

sper

gillu

s ni

dula

nsC

andi

da a

lbic

ans

Sac

char

omyc

es c

erev

isia

eS

chiz

osac

char

omyc

es p

ombe

Cae

norh

abdi

tis e

lega

nsD

roso

phila

mel

anog

aste

rD

anio

rerio

Gal

lus

gallu

sS

us s

crof

aB

os ta

urus

Can

ine

lupu

s fa

mili

aris

Mus

mus

culu

sR

attu

s no

rveg

icus

Hom

o sa

pien

s

Variation of Inform

ation

Thale cress E. coli

Slime mold A. nidulans

YeastBudding yeast Fission yeast

WormFruit fly

ZebrafishChicken

PigCowDog

MouseRat

Human0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

FIG. S4: The variation of information between term par-titions generated from species-specific projected term net-works. Labels along the x-axis are the scientific name for thespecies and labels along the y-axis are the common name forthe species. Although the partitions are more similar thanrandom (V I ≈ 2.4), they are still far from being identical(V I = 0), indicating that these species-specific communitiesof functional terms carry distinct information.

nidulans, three types of yeast including Candida albicans,budding yeast (Saccharomyces cerevisiae), and fissionyeast (Schizosaccharomyces pombe), worm (Caenorhab-ditis elegans), fruit fly (Drosophila melanogaster), ze-brafish (Danio rerio), chicken (Gallus gallus), pig (Susscrofa), cow (Bos taurus), dog (Canine lupus familiaris),mouse (Mus muculus), rat (Rattus norvagicus) and hu-man (Homo sapiens). We downloaded gene annotationfiles for each of these species and projected term-termnetworks (see Section “Constructing a Term Networkfrom Gene Ontology Annotations” in the main text). Wethen partitioned each of these networks into communities(see Section “Identifying Communities of GO terms” inthe main text). For simplicity we choose to focus only onthe fundamental partition (resolution parameter r = 0,see Equation 3 in the main text). This results in exactlyone discreet partitioning of GO terms associated witheach species.

Comparing community structure identification is anongoing area of research in the complex systems field andthere are multiple proposed methods for comparing twodiscreet partitions of a set of nodes (e.g. [9, 23]). Onewe will employ here is the variation of information (V I)[23]:

V I(X,Y ) = H(X) +H(Y )− 2MI(X,Y ) (4)

where H(X) is the Shannon’s entropy associated with apartition, X, and MI(X,Y ) is the mutual information

between two partitions, X and Y . V I(X,Y ) representsa distance between the information shared between twopartitions, X and Y , with V I(X,Y ) = 0 indicating iden-tical partitions.

We calculated the VI between the partitions for everypair of species, using only terms common to both parti-tions when different sets of terms are annotated in thedifferent species. Figure S4 shows these values. We alsocalculated a VI value for a random shuffling of commu-nity assignments in each pair of species and observe that“random” VI takes a value of approximately 2.4 ± 0.12.The actual VI is generally lower than this random valueshowing that there is some shared information; however,most comparisons are much closer to this random valuethan to “perfect” agreement (V I = 0), indicating thatthese term partitions are far from identical.

Interestingly, there is a relatively higher level of simi-larity between the species belonging to the Fungi King-dom (A.nidulans, yeast, budding yeast and fission yeast).On the other hand there is higher dissimilarity betweenanimals belonging to the Chordata phylum (zebrafish,chicken, pig, cow, dog, mouse, rat, human), both betweeneach other and compared to the other species. This couldbe a consequence of evolutionary diversity playing a moredominant role in the organization of biological functionin these organisms, perhaps through more complex regu-latory mechanisms such as epigenetics. The exception iswhen comparing mouse, rat and human, which is not en-tirely surprising given the extent to which mouse is usedto mimic the human system in laboratory experiments.

Although some of the differences between the species-specific term communities may be due to variations in theannotation practices among groups that supply annota-tions to GO, it is also likely that they reflect real, biolog-ical differences in the cellular organization of these sys-tems. We suggest that using these partitions of terms in aspecies-specific context may enhance the results of func-tional analysis for these model organisms. Furthermore,identifying the exact differences between these communi-ties may uncover important cellular properties of variousspecies, an investigation we leave to future work.

Supplemental Methods

In this section we provide additional information onthe methodology used to illustrate the term communitiesacross different resolutions (Figure 2 in the main text)and generate the word clouds representing the biologicalcontent of those communities (Figure 4 in the main text).

4.6. Illustrating Community Structure at differentresolutions

To better understand the relationships between thecommunities found at different resolutions, we visual-ized the uniquely-found term communities with ten or

Page 15: Annotations - arxiv.org · Annotations Kimberly Glass1 ;2 3, and Michelle Girvan 4 5 1Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 2Department

15

more members for the six lowest values of resolution used(r = {0, 0.25, 0.5, 1, 2, 4}) (Figure 2 in the main text).We line up the communities found at each resolution andvisualize each as a circle whose radius scales as the logof the number of term members found in that commu-nity and whose color corresponds to the percentage of“Biological Process”, “Molecular Function” and “Cellu-lar Component” terms that belong to that community.In other words, for each community we count the num-ber of members in that community from the “BiologicalProcess” domain and divide by the number of membersin the entire “Biological Process” domain. We then dothe same things for the other two domains. After thesepercentages are calculated, within each community theyare “normalized” by dividing by the maximum found per-centage such that the vector representing that commu-nity’s domain content has at least one member with avalue of one. We can think of these values in terms ofa three-part cmy color vector. The normalization pro-cess causes communities with an equal percentage fromall three primary domains to be colored black (cmy colorvector equal to [1,1,1]), and those with members onlyfrom one primary domain to be exactly yellow ([0,0,1]) for“Biological Process”, cyan ([1,0,0]) for “Molecular Func-tion”, or magenta ([0,1,0]) for “Cellular Component”.

Between the communities found at adjacent resolu-tions, we draw a line from a community at a higher reso-lution to a community at a lower resolution if at least 10%of the members of the community from the higher resolu-tion also belong to the community at the lower resolution.Line thickness scales linearly based on the percentage ofmembers of the community from the higher resolutionthat belong to the community at the lower resolution.Note that although connections are only made betweencommunities in adjacent resolutions, sometimes the par-ent community is identical to another community foundat an even lower resolution, in which case the connectionis made from the child community to the copy of theparent at its lowest found resolution.

4.7. Capturing Biological Information In WordClouds

In order to easily interpret the contents of our commu-nities we summarize the information contained in each inthe form of word clouds using a free word-cloud makingprogram [1]. This program automatically configures theorientation of the words in the clouds, but we manuallyassign each word a relative size and color to representthat word’s statistical enrichment in the community andthe primary domain that word is representing in the com-munity, respectively.

To begin, for each community, we determine all thedescriptions corresponding to the member terms of thatcommunity and count the number of times an individualword appears across all these descriptions. Then, for eachof these words, we calculate the statistical enrichment (p-

value) of the frequency of that word in the communitycompared to its frequency across the descriptions of allGO terms using the hypergeometric probability:

p = P (N ≥ Nwc|Nw, Nc, Ntot) =

min[Nw,Nc]∑i=Nwc

(Nc

i

)(Ntot−Nc

Nw−i)(

Ntot

Nw

) ,

(5)where Nwc is the number of times that word appearsin the term descriptions specific to a community, Nw isthe number of times that word appears across all termdescriptions, Nc is the number of individual words in theterm descriptions specific to a community, and Ntot is thenumber of individual words across all term descriptions.We scale the sizes of the words in the word cloud as−log10(p) such that words with the lowest probability ofbeing in the community by chance are given the largestsize, and those one might expect by chance are given asize close to zero.

We also colored each word based on the percentage oftimes the terms that word comes from in a communitybelongs to each of the primary domains. Specifically, foreach instance of a word in the term descriptions, we de-termine the domain assignment of that term and givethat word instance the same domain assignment. Tocolor the word in a community-specific context, we de-termine the domain assignments made to all instancesof that word in the community. We count these in-stances and divide by the domain assignments made to allwords. This will generate a three-part vector represent-ing the percentage of the primary domain represented bythe word in the community. We “normalize” this vectorby dividing by the maximum found percentage, resultingin a three-part cmy color vector has at least one mem-ber with a value of one. The described normalizationcauses a word with an equal percentage from all threeprimary domains to be colored black (cmy color vectorequal to [1,1,1]), and those with members only from oneprimary domain to be exactly yellow ([0,0,1]) for “Bio-logical Process”, cyan ([1,0,0]) for “Molecular Function”,or magenta ([0,1,0]) for “Cellular Component”.