a graph-based semantic similarity measure for the gene ontology

November 4, 2011 10:16 WSPC/185-JBCB S0219720011005641

Journal of Bioinformatics and Computational BiologyVol. 9, No. 6 (2011) 681–695c© Imperial College PressDOI: 10.1142/S0219720011005641

A GRAPH-BASED SEMANTIC SIMILARITY MEASUREFOR THE GENE ONTOLOGY

MARCO A. ALVAREZ

Department of Computer ScienceUtah State University

Logan, Utah 84322, [email protected]

CHANGHUI YAN

Department of Computer ScienceNorth Dakota State University

Fargo, North Dakota 58102, [email protected]

Received 27 February 2011Revised 23 June 2011Accepted 24 June 2011

Existing methods for calculating semantic similarities between pairs of Gene Ontology(GO) terms and gene products often rely on external databases like Gene OntologyAnnotation (GOA) that annotate gene products using the GO terms. This dependencyleads to some limitations in real applications. Here, we present a semantic similarityalgorithm (SSA), that relies exclusively on the GO. When calculating the semantic sim-ilarity between a pair of input GO terms, SSA takes into account the shortest pathbetween them, the depth of their nearest common ancestor, and a novel similarity scorecalculated between the definitions of the involved GO terms. In our work, we use SSA tocalculate semantic similarities between pairs of proteins by combining pairwise seman-tic similarities between the GO terms that annotate the involved proteins. The relia-bility of SSA was evaluated by comparing the resulting semantic similarities betweenproteins with the functional similarities between proteins derived from expert annota-tions or sequence similarity. Comparisons with existing state-of-the-art methods showedthat SSA is highly competitive with the other methods. SSA provides a reliable mea-sure for semantics similarity independent of external databases of functional-annotationobservations.

Keywords: Semantics; ontology; graph.

1. Introduction

The Gene Ontology (GO) project1 maintains a dynamic, structured, preciselydefined, and controlled vocabulary of terms for describing the roles of genes andgene products. Complementarily, the Gene Ontology Annotation (GOA) project2

aims to provide high-quality GO annotations to gene products. Database entries

681

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.

http://dx.doi.org/10.1142/S0219720011005641


682 M. A. Alvarez & C. Yan

generated manually and electronically by the GO and GOA projects are availablefor a new generation of methods that aim to estimate functional relationshipsbetween gene products by exploiting their semantic relatedness. The common pro-cedure of these methods is that, given two gene products, semantic similaritiesbetween their annotating GO terms are calculated and then combined to obtaina semantic similarity score between the gene products. Semantic similarity hasbeen applied to estimate functional relationships between gene products in differ-ent tasks, such as protein–protein interaction and expression data,3 clustering ofgenes in pathways,4–7 analyzing expression profiles of gene products,8 and estimat-ing protein function similarity.9–12

The calculation of ontology-based semantic similarities between terms has longbeen a central problem in natural language processing and information retrieval.13

On the other hand, ontologies have recently become a popular topic in biomedicalresearch, where several semantic similarity methods have been proposed in the lit-erature, as reviewed by Pesquita et al.14 In the calculation of semantic similarities,some methods depend on external databases composed of functional-annotationobservations. For example, Pesquita et al.,14 Resnik,15 Jiang and Conrath,16 andLin17 considered the Information Content (IC) of the involved GO terms, which isderived from their frequency in the entire GOA database. In this context, the GOAis an external database that consists of functional-annotation observations. Thereare some limitations to these methods. First, they are sensitive to the functional-annotation observations. For example, as new genes (or gene products) are dis-covered and annotated with GO terms, the frequencies of GO terms in the GOAdatabases will change and, consequently, so will the IC of GO terms. Thus, eventhough the GO itself has not changed, the semantic similiarities given by thesemeasures will change. Second, as recommended by some researchers, the IC of GOterms should be computed on a per species basis. Thus, for species that have poor orincomplete genome annotation, these methods cannot compute a reasonable seman-tic similarity. A way to avoid these problems is to use semantic methods that donot rely on the functional-annotation observations, such as GOA.

Here, we propose a semantic similarity algorithm (SSA) that depends exclusivelyon the GO. SSA had been originally proposed for calculating semantic similaritybetween English words using the WordNet ontology.18 Herein, we propose a newversion of SSA with an improved formulation adapted to the GO. There have beensome purely structural semantic methods that only rely on the structure of theGO (e.g. see Refs. 4, 19 and 20). One of the main advantages in our method isthat it does not assume a consistent quantity of parent–child specialization acrossrelations in the gene ontology, as most purely structural methods do. Furthermore,we show how SSA can be easily extended to calculate semantic similarity betweengene products. In this study, we use proteins as examples of gene products. But itis worth pointing out that the GO also contains functional annotations for othergene products like non-coding RNA. Given a pair of input proteins, SSA considersa subgraph of the GO, including all terms annotating the input proteins and the

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.


Graph-Based Method for Semantic Similarity 683

respective ancestors of such terms in the GO. Then, SSA exploits the subgraphto calculate pairwise semantic similarities between the GO terms annotating theinput proteins. Finally, such pairwise semantic similarities are combined to obtaina semantic similarity score for the two proteins. The reliability of our approach isevaluated by comparing the resulting semantic similarities between proteins againstfunctional similarities between proteins derived from expert annotations.

2. Methods

Given two proteins, a subgraph from the GO is created. This subgraph containsall the GO terms annotating the proteins and their ancestors in the GO. Basedon this subgraph, SSA is used to calculate pairwise similarities between those GOterms that annotate the input proteins. Finally, we combine the resultant pairwisesimilarities to obtain a semantic similarity score between the two proteins. In theSec. 2.1, we will describe this approach step by step.

2.1. Subgraph that consists of the GO terms

For a pair of input proteins, all their annotating GO terms are collected. Then asubgraph (will be denoted as Gsim) of the GO is created. Each vertex in Gsim is aGO term that either annotates one or both of the input proteins or is an ancestorof the annotating terms. Each edge corresponds to a relationship between the twocorresponding terms. In the GO, there are three types of relationships betweenGO terms: “is–a,” “part–of,” and “regulates.” We only focus on two most commontypes of relationships: “is–a” and “part–of.” As the GO includes three differentontologies, the resulting subgraph will be different depending on which ontologyis being used. For example, Fig. 1 shows the resulting subgraph for UniprotKBproteins P17252 and P18907 when the Cellular Component (CC) ontology is used.

2.2. Semantic similarity between GO terms

Given two GO terms, t1 and t2, the semantic similarity between the two terms iscalculated on their respective subgraph Gsim using the equation

SSA(t1, t2) =spsim(t1, t2) + nca(t1, t2) + ld(t1, t2)

3, (1)

where spsim is the distance of the shortest path between t1 and t2 in Gsim con-verted into a similarity value, nca is a score proportional to the depth of the nearestcommon ancestor (NCA) of t1 and t2 in Gsim, and ld is a similarity score betweenthe definitions of the two terms. All these functions return values in the range of 0to 1. Equation (1) takes the average of the three functions and also gives a valuebetween 0 and 1. We have also explored different weighting strategies by analyzingthe contributions and correlation of the three functions. No weighting strategy couldachieve consistent improvement on all the datasets, that is, although some weight-ing choices gave minor improvement on some datasets, they also achieved worst

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



Fig. 1. Subgraph for proteins P17252 (Protein kinase C alpha type) and P18907 (Sodium/potassium-transporting ATPase subunit alpha-1) from the CC ontology. All the ancestors of theannotating terms are also included. There are five terms annotating only protein P17252, six termsannotating only protein P18907, and one term annotating both proteins.

performance in other datasets. In this research, we also explored the feasibility ofcalculating SSA(t1, t2) without considering the similarity score between the defini-tions of the two terms. In those experiments, SSA(t1, t2) was just the average ofspsim and nca. The three functions in Eq. (1) are detailed in the following sections.

2.2.1. spsim(): Shortest distance between two terms

One challenge in calculating semantic similarities derived from ontologies is thatontology relationships do not reflect equal semantic distances between terms.

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



In the case of the GO, edges closer to the root usually represent larger differencesin function (or other properties) than edges farther away from the root. To addressthis challenge, we assign weights (in the range of [0, 1]) to edges using Eq. (2), sothat edges closer to the root have higher weights.

weight(ti, tj) = 1 − depth(ti) + depth(tj)2 · max

, (2)

where, ti and tj are the endpoint vertices (terms) of the edge, depth(ti) and depth(tj)are their corresponding depths in the graph, and max is the maximum depth inthe respective ontology. We define the length of a path as the sum over the weightsof the edges on the path. In SSA, the length of the shortest path between t1 andt2 is first calculated and then converted into a similarity score, spsim(ti, tj), usingEq. (3).

spsim(ti, tj) =(

sp(ti, tj)max

− 1)2

, (3)

where sp(ti, tj) is the length of the shortest path from node ti to node tj , and maxis the maximum depth in the ontology. Note that in Eq. (3) we normalize the valueof sp(ti, tj) by max, which is the maximum depth. It can be easily shown that afterweighting edges according to Eq. (2), the (weighted) length of the longest pathbetween two GO terms in the graph is less than or equal to the maximum depth(max). Thus, the normalization provides values in the interval [0, 1]. In order toconvert distances into similarity values, we subtract the distance by 1 and applya quadratic function. In this way, longer distances between two terms give lowerscores of similarity. We have also tried a linear function (absolute function) insteadof a quadratic function to convert distance into similarity. No improvement wasobserved.

2.2.2. nca(): Depth of NCA

The depth of the NCA is calculated by Eq. (4), which is the depth of the nearestterm that is a common ancestor of ti and tj, normalized by the maximum depth ofthe corresponding ontology in GO, therefore nca() is limited to the interval [0, 1].

nca(ti, tj) =dnca(ti, tj)

max, (4)

where dnca() simply returns the depth of the nearest common ancestor. Note thatin this formula, the deeper the NCA is, the more similar the two terms are.

2.2.3. ld(): Similarity between the definitions of the GO terms

In the GO, each term is associated with a definition, which is a text note thatencloses a rich source of knowledge about the meaning of the term. For example, thedefinition of the term GO:0051210 (isotropic cell growth) is “The process by whicha cell irreversibly increases in size uniformly in all directions. In general, a rounded

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



cell morphology reflects isotropic cell growth.” We define the long definition of aterm as the union of the terms’ name and its definition. In the case of GO:0051210,the corresponding long definition is: “isotropic cell growth. The process by which acell irreversibly increases in size uniformly in all directions. In general, a roundedcell morphology reflects isotropic cell growth.” The similarity score between thelong definitions of t1 and t2 is given by ld(t1, t2). First, we refine every term’slong definition in the GO by removing common words (e.g. of, a, the, is, etc.) andapplying the Porter algorithm21 for stemming. Then, for each term, we create a longdefinition vector in the n-dimensional ontology space, where n is the total numberof unique stemmed words found in the long definitions of all terms in the ontology.Each value in the long definition vector represents the tf-idf (term frequency-inversedocument frequency) weight for the corresponding word. This weight evaluates howimportant a word is to its long definition. A high tf-idf weight is reached by a highword frequency in the long definition and a low occurrence of the word in thecollection of long definitions in the ontology. The final similarity score between thelong definition vectors of two terms is their cosine similarity, defined by Eq. (5).

ld(ti, tj) =−→ldi · −→ldj

‖−→ldi‖ · ‖−→ldj‖, (5)

where −→ldi and −→

ldj are the long definition vectors for terms ti and tj , respectively.

2.3. Semantic similarity between proteins

Although SSA had been designed for calculating semantic similarity between GOterms, we can easily combine term similarities to obtain the semantic similaritybetween proteins as follows. Let A(Pk) = {tk1, tk2, . . .} be the set of non-redundantGO terms annotating protein Pk. Then, given two input proteins Pi and Pj withannotation sets A(Pi) = {ti1, ti2, . . . , tim} and A(Pj) = {tj1, tj2, . . . , tjn}, we obtainthe similarity matrix Mm×n, where M(a, b) is the semantic similarity between termstia and tjb given by SSA(tia, tjb). Based on this matrix, we can calculate the seman-tic similarity between Pi and Pj using three different methods described in previousresearch,4,8,9,11,12,22 namely the maximum (MAX), the average (AVE) and the bestmatch average (BMA). MAX takes the maximum value from the similarity matrixMm×n and considers it as the semantic similarity between the proteins. AVE calcu-lates the average of all entries in the matrix and assigns it to the semantic similaritybetween the proteins. BMA considers the maximum values in each row (row max-ima) and each column (column maxima). The average of row maxima and that ofcolumn maxima are first calculated. Then the average of the two is the semanticsimilarity between the proteins, as shown in Eq. (6).

Pbma(Pi, Pj) =rowmax(Pi, Pj) + colmax(Pi, Pj)

2, (6)

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



where

rowmax(Pi, Pj) =1m

·m∑

a=1

max1≤b≤n

SSA(tia, tjb)

colmax(Pi, Pj) =1n·

n∑b=1

max1≤a≤m

SSA(tia, tjb)

Intuitively, the maximum value in a row (or column) corresponds to the best hitwhen a term annotating a protein is compared with all the terms annotating anotherprotein. The average over row maxima (rowmax) reflects the best hits when com-paring terms from the protein associated with the rows against the protein asso-ciated with the columns. Conversely, the average over column maxima (colmax)reflects the best hits when comparing terms from the protein associated with thecolumns against the protein associated with the rows. Thus, the score given bythe best match average method (Pbma) reflects the best scores in both directionalcomparisons.

2.4. Protein function similarity derived from Pfam annotations

In our work, the reliability of SSA is evaluated by comparing the resulting semanticsimilarities against protein function similarity derived from expert annotations.Functional similarities between proteins were derived from the Pfam database,23 asdescribed by Couto et al.12 Thus, let F (Pi) = {fi1, fi2, . . . , fim} be the set of Pfamannotations for protein Pi and F (Pj) = {fj1, fj2, . . . , fjn} be that of Pj , then thefunctional similarity between the two proteins can be estimated using the Jaccardcoefficient between the two annotation sets as shown in Eq. (7).

Ppfam(Pi, Pj) =|F (Pi) ∩ F (Pj)||F (Pi) ∪ F (Pj)| (7)

2.5. Evaluation of the reliability of SSA by comparing the

resulting semantic similarities with functional similarities

derived from Pfam

In our experiments, for a set of proteins, their pairwise semantic similarities werecalculated using SSA. Separately, the pairwise functional similarities were calcu-lated based on their Pfam annotation as mentioned above. As in Couto et al.,12 thePearson’s Correlation Coefficient (PCC) between the semantic similarities and func-tional similarities was used for performance comparisons. The values of PCC rangefrom −1 to 1. In this context, higher values of PCC indicate better performance inthe calculation of semantic similarity.

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



2.6. Datasets

We downloaded the revision 1.723 of the GO. The GO contains three structuredcontrolled ontologies that describe gene products in terms of their associated bio-logical process (BP), molecular function (MF), and CC. Using a different ontologywill yield a different subgraph. For the set of proteins, we used the release 15.6of UniprotKB/Swiss-Prot, which is the most comprehensive and highly annotatedpublicly accessible protein sequence database, having archived more than 6.6 millionproteins through a combination of manual and electronic techniques. On the otherhand, the GOA project employs both manual and electronic methods for the asso-ciation of GO terms to UniProtKB entries. Both methods are strictly controlled toproduce high-quality GO annotations.2 We used the release 74.0 of GOA-Uniprotas the database for protein annotations.

In our preliminary experiments, we explored a few variations of SSA. Tocompare the performance of different versions, we selected the top 500 proteinswith the most total number of annotations in GOA-Uniprot and created subsetswith the top 100, 200, 300, 400, and 500 most annotated proteins. These sub-sets will be referred to as T-100, T-200, T-300, T-400, and T-500, respectively.We ensured that all selected proteins existed in UniprotKB/Swiss-Prot and hadat least one annotation from each of the three GO ontologies in GOA-Uniprot.On the other hand, we have also made sure that the selected proteins have atleast one Pfam-A annotation by checking the online service provided by the Pfamdatabase.

3. Results

3.1. BMA outperforms other methods in combining semantic

similarities between GO terms

We first compared the three different methods, namely the maximum method(MAX), the average method (AVE) and the best match average method (BMA), forcombining pairwise semantic similarities between GO terms to obtain a semanticsimilarity between a pair of proteins. We used the simplest version of SSA, whichconsidered only “is–a” relationships when constructing the subgraph and did notconsider the similarity score between the definitions of the two terms. Figure 2shows the results for when the MF ontology was used to build the subgraph. Theresults show that the BMA outperforms MAX and AVE in all the datasets used.We conducted the Fisher’s transformation to test the statistical significance of thedifference. In all cases, the BMA outperforms MAX and AVE with a p-value lessthan 0.0001. The same trend was observed when the BP and CC ontologies wereused to build the subgraph. We also repeated this experiment using different ver-sions of SSA that we explored in this research (see details in the next section), thesame trend was observed. Our results here are consistent with the claim of Pesquitaet al.9 Since BMA outperforms AVE and MAX, in all of the following experiments

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



Fig. 2. Correlations between semantic similarities and functional similarities for different combi-nation methods. This figure shows that BMA outperforms MAX and AVE for the MF ontology.The same trend was observed when the BP and the CC ontologies were used.

BMA was chosen to combine semantic similarities between GO terms to obtain thesemantic similarity between proteins.

3.2. Exploring the usefulness of the definitions of GO terms

and that of “part–of” relationships

In the GO, each term is associated with a text definition. To the best of our knowl-edge no existing method, based on the GO, explores definitions in the calculationof semantic similarity. In this section, we will test whether considering the contentof the term definitions improve the performance in the calculation of semantic sim-ilarity. When taking term definitions into account, a function [i.e. ld(t1, t2)] thatmeasures the similarity between the definitions of two GO terms is used.

In the GO, there are three types of relationships between GO terms: “is–a,”“part–of,” and “regulates.” We will also explore the contributions of the most com-mon types of relationships in calculating semantic similarity, to be precise, “is–a”and “part–of ” relationships. If a type of relationship is taken into account, thenthe edges corresponding to this type of relationship are included in the subgraph asdescribed in the first section of Methods. Consequently, we performed experimentsusing distinct versions of SSA, described at Table 1.

Note that in SSAv2 and SSAv4, both the edges corresponding to “is–a” rela-tionships and the edges corresponding to “part–of ” relationships are included in

Table 1. Different versions of SSA used in our experiments.

Version Relationships considered Term definitions considered

SSAv1 “is–a” NoSSAv2 “is–a” and “part–of ” NoSSAv3 “is–a” YesSSAv4 “is–a” and “part–of ” Yes

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



the subgraph, while in SSAv1 and SSAv3, only the edges corresponding to “is–a”relationships are included. For each version of SSA, we calculated pairwise semanticsimilarities between proteins. Then the correlation between their resulting semanticsimilarities and functional similarities was used to evaluate the performance of eachdifferent version. This step was repeated for each of the five datasets (T-100, T-200,T-300, T-400, and T-500) and for each of the three ontologies in the GO. Table 2shows the results of the experiments. Among the four versions, SSAv3 achievesthe best performance. We tested the significance of the difference between SSAv3and other versions using the Fisher’s transformation. The p-values are shown inthe parenthesis in Table 2. In most of the cases, SSAv3 outperforms other versionswith p-value less than 0.05, a conventional threshold for significance test. All butone of the insignificant p-values occurs in the cases when T-100 is used, presumablydue to the relatively small size of T-100. Since SSAv3 and SSAv4 achieve the sameresults on MF ontology, the significance test was not performed between SSAv3and SSAv4 on MF ontology.

SSAv3 yields higher correlations than SSAv1 across all datasets and all ontolo-gies. The only difference between SSAv1 and SSAv3 is that SSAv3 takes intoaccount the similarity between the definitions of GO terms and SSAv1 does not.Thus, we can conclude that taking into account the definitions of GO terms canimprove the performance. This conclusion is also supported by comparing theresults of SSAv2 with those of SSAv4. Table 2 also shows that the correlations

Table 2. Correlations between protein function similarity and the semantic similarities givenby the four different versions of SSA.

Ontology Dataset SSAv1 SSAv2 SSAv3 SSAv4

BP T-100 0.79 (0.0793) 0.79 (0.0793) 0.81 0.80 (0.242)T-200 0.68 (0.0024) 0.66 (<0.0001) 0.70 0.68 (0.0024)T-300 0.73 (0.0174) 0.70 (<0.0001) 0.74 0.73 (0.0174)T-400 0.64 (<0.0001) 0.61 (<0.0001) 0.66 0.64 (<0.0001)T-500 0.59 (<0.0001) 0.56 (<0.0001) 0.61 0.59 (<0.0001)

MF T-100 0.77 (0.242) 0.77 (0.242) 0.78 0.78T-200 0.62 (0.0024) 0.62 (0.0024) 0.64 0.64T-300 0.65 (0.0174) 0.65 (0.0174) 0.66 0.66T-400 0.57 (0.0024) 0.57 (0.0024) 0.58 0.58T-500 0.51 (<0.0001) 0.51 (<0.0001) 0.53 0.53

CC T-100 0.53 (0.0174) 0.51 (<0.0001) 0.56 0.54 (0.0793)T-200 0.39 (0.0024) 0.37 (<0.0001) 0.41 0.40 (0.0793)T-300 0.43 (<0.0001) 0.41 (<0.0001) 0.46 0.44 (<0.0001)T-400 0.34 (<0.0001) 0.32 (<0.0001) 0.36 0.35 (0.0024)T-500 0.30 (<0.0001) 0.29 (<0.0001) 0.32 0.31 (<0.0001)

Note: BP: Biological Process, MF: Molecular Function, CC: Cellular Component. We testedthe significance of the difference between SSAv3 and other versions using the Fisher’s trans-formation. The p values are shown in the parenthesis. Since SSAv3 and SSAv4 achieve thesame results on MF ontology, the significance test was not performed between SSAv3 andSSAv4 on MF ontology.

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



achieved by SSAv2 are lower than or equal to those of SSAv1 while the correlationsachieved by SSAv4 are lower than or equal to those of SSAv3. From this, we canconclude that the use of “part–of ” relationships does not improve the performance.When the MF ontology is used to build the graph, SSAv2 achieves the same resultsas SSAv1 while SSAv4 achieves the same results as SSAv3. This is because thereare only 3 “part–of ” relationships in the MF ontology, which is too few to affect theresults. In comparison, there are 3446 “part–of ” relationships in the BP ontologyand 944 in the CC ontology.

When the dataset and the method are fixed, SSA achieves much lower correla-tions when CC ontology is used to build the subgraph than when BP or MF is used.One possible explanation is that the BP and MF ontologies describe gene productsusing terms associated with biological processes and molecular functions, both ofwhich are directly related to protein function. Consequently, the semantic similari-ties calculated using these ontologies correlate better with functional similarities. Incontrast, the CC ontology describes gene products using terms associated with CCs,which are only indirectly related to protein function. Another possible cause of thisis that proteins in the dataset have different numbers of annotations from the threeontologies. For example, in dataset T-500, the 500 proteins have a total of 16248unique annotations from the BP ontology, 5029 unique annotations from the MFontology and only 4298 unique annotations from the CC ontology. In Table 2, wecan observe a general trend of decreasing performance from T-100 to T-500 (withthe only exception of T-300). Note that from T-100 to T-500, the average num-ber of annotations per protein decreases. These results suggest that having moreannotations per protein in the dataset leads to more reliable functional similarityestimation. This claim is also supported by the results from Xu et al.3

3.3. Comparisons with previous methods

In this section, we compare SSA against previously published methods. We usedthe CESSM web server9 (http://xldb.fc.ul.pt/biotools/cessm/) developed by theXLDB research group at the University of Lisbon. CESSM currently implements11 semantic similarity methods: simGIC, simUI,9 and the average, maximum andbest-match average combinations of the three different term similarity methodsproposed by Resnik,15 Lin,17 and Jiang and Conrath.16 CESSM assumes that sim-ilar sequence lead to similar function and considers sequence similarity between apair of proteins as the functional similarity between them. In CESSM, sequencesimilarity is calculated using RRBS,9 which is a relative measure of sequence simi-larity based on BLAST bitscores. In the evaluation of a semantic similarity method,CESSM compares the pairwise semantic similarities between proteins given by themethod with the pairwise sequence similarities between the proteins. CESSM usesresolution to evaluate how well semantic similarities match sequence similarities. Adetailed description of resolution can be found in Ref. 9. Briefly, first, the seman-tic similarity is plotted against the sequence similarity. Then, the data points are

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



Table 3. Resolutions for SSA and other 11 methodsimplemented in CESSM.

Algorithm Resolution

SSA — Best Match Average 0.972simUI 0.967Resnik — Best Match Average 0.958simGIC 0.956Lin — Best Match Average 0.571Lin — Average 0.451Resnik — Average 0.415Resnik — Maximum 0.381Lin — Maximum 0.248Jiang — Best Match Average 0.241Jiang — Average 0.175Jiang — Maximum 0.084

Note: Highest resolution is indicated in bold.

divided into sequence similarity intervals, with all intervals having equal numberof data points. Let y be the variable representing the averaged semantic similarityof the intervals and x be the variable representing the averaged sequence similar-ity of the intervals. Last, the behavior of y versus x is modeled using two Normalcumulative distribution functions (NCDF) as y = a + b∗NCDF1(x) + c∗NCDF2(x).Then, the resolution of the semantic similarity method is given by the sum of thescale factors, i.e., b + c. Based on the authors’ definition, resolution is the rela-tive intensity where variations in the sequence similarity scale are translated intothe semantic similarity scale. A higher resolution value means that the semanticsimilarity method has a higher capability to distinguish between different levels ofprotein function. Therefore, a method with higher resolution performs better thana method with lower resolution.

Our results from the Sec. 3.2 have confirmed that SSAv3 achieves better perfor-mance than other versions of SSA. Thus, we use CESSM to compare SSAv3 withthe methods implemented in CESSM. In this comparison only the MF ontology isused, because, as stated by Lord et al.,11 BP and CC ontologies presented poor cor-relation with sequence similarity. Table 3 presents the resolution scores achieved bydifferent methods. It shows that SSA achieves the highest resolution, outperformingother previously published methods. In addition to the high resolution, SSA has theadvantage that it does not rely on an external database of functional-annotationobservations in the calculation of semantic similarity. In comparison, all the othermethods shown in Table 2 need such a database (in this case, the annotations inGOA).

4. Conclusions

This paper presents a new method, SSA, for the GO-based calculation of semanticsimilarity between gene products. In addition to its highly competitive performance,

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



a main difference between SSA and other previous methods is that SSA relies exclu-sively on the GO and it does not assume a consistent quantity of parent–child spe-cialization across relations in the gene ontology, as most purely structural methodsdo. Other contributions of this work include: (1) we explored the inclusion of defi-nitions of GO terms in the semantic similarity calculation and confirmed that usingsuch information can improve the performance; and (2) we explored the usefulnessof “part–of ” and “is–a” relationships in the semantic similarity calculation andshowed that taking into account “part–of ” relationships does not help to improvethe performance.

The evaluation of semantic-similarity measures is especially difficult. In previouspublications, the evaluation method was based on how well semantic similarity cor-related with sequence similarity or functional similarity. Such an evaluation methodassumes that there exists positive correlation between them, and the higher the cor-relation the better the semantic-similarity measure. In this research, we follow thesame assumptions made in previous studies and test semantic similarity againstfunctional similarity derived from Pfam annotations. Although, it is probably truethat there exists positive correlation between them, the claim that higher correla-tions mean better semantic-similarity measures is still debatable. Further studiesare still needed to test the validity of the assumptions. Another imminent need isto develop a gold standard for comparing different semantic-similarity measures.As to date, the CESSM is probably the only publically available tool that is closeto a gold standard.

Acknowledgments

The authors would like to thank the XLDB Research Team of the University ofLisbon for providing an online tool for the evaluation of GO-based semantic simi-larity measures. In particular, we thank Catia Pesquita for all the kind support sheprovided during the use of their tool. This project was partially supported by NIHGrant Number P20 RR016471 from the INBRE Program of the National Centerfor Research Resources.

References

1. Ashburner M et al., Gene ontology: Tool for the unification of biology. The GeneOntology Consortium, Nature Gene 25(1):25–29, 2000.

2. Barrell D et al., The GOA database in 2009 — An integrated Gene Ontology Anno-tation resource, Nucl Acids Res 37(Suppl 1):D396–D403, 2009.

3. Xu T, Du L, Zhou Y, Evaluation of GO-based functional similarity measures using S.cerevisiae protein interaction and expression profile data, BMC Bioinform 9(1):472,2008.

4. Wang JZ et al., A new method to measure the semantic similarity of GO terms,Bioinform 23(10):1274–1281, 2007.

5. Sheehan B et al., A relation based measure of semantic similarity for gene ontologyannotations, BMC Bioinform 9(1):468, 2008.

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



6. Nagar A, Al-Mubaid H, A new path length measure based on GO for gene similaritywith evaluation using sgd pathways, in IEEE International Symposium on Computer-Based Medical Systems, 2008.

7. Du Z et al., G-sesame: Web tools for GO-term-based gene similarity analysis andknowledge discovery, Nucl Acids Res 37(Suppl 2):W345–W349, 2009.

8. Sevilla JL et al., Correlation between gene expression and GO semantic similarity,IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(4):330–338, 2005.

9. Pesquita C et al., Metrics for GO based protein semantic similarity: A systematicevaluation, BMC Bioinform 9(Suppl):5, 2008.

10. Mistry M, Pavlidis P, Gene ontology term overlap as a measure of gene functionalsimilarity, BMC Bioinform 9(1):327, 2008.

11. Lord PW et al., Investigating semantic similarity measures across the gene ontology:The relationship between sequence and annotation, Bioinformatics 19(10):1275–1283,2003.

12. Couto FM, Silva MJ, Coutinho PM, Measuring semantic similarity between geneontology terms, Data and Knowledge Engng 16(1):137–152, 2007.

13. Budanitsky A, Hirst G, Evaluating wordnet-based measures of lexical semantic relat-edness, Comp Linguistics 32(1):13–47, 2006.

14. Pesquita C et al., Semantic similarity in biomedical ontologies, PLOS Comput Biol5(7):e1000443, 2009.

15. Resnik P, Using information content to evaluate semantic similarity in a taxonomy,in International Joint Conference on Artificial Intelligence, 1995.

16. Jiang JJ, Conrath DW, Semantic similarity based on corpus statistics and lexicaltaxonomy, in International Conference Research on Computational Linguistics, 1997.

17. Lin D, An information-theoretic definition of similarity, in International Conferenceon Machine Learning, 1998.

18. Alvarez MA, Lim S, A graph modeling of semantic similarity between words, in Inter-national Conference on Semantics Computing, 2007.

19. Ruths T, Ruths D, Nakhleh L, GS2: An efficiently computable measure of GO-basedsimilarity of gene sets, Bioinformatics 25(9):1178–1184, 2009.

20. Chabalier J, Mosser J, Burgun A, A transversal approach to predict gene productnetworks from ontology-based similarity, BMC Bioinform 8(1):235, 2007.

21. Porter MF, An algorithm for suffix stripping, Program 14(3):130–137, 1980.22. Schlicker A et al., A new measure for functional similarity of gene products based on

gene ontology, BMC Bioinform 7(302), 2006.23. Finn RD et al., The pfam protein families database, Nucl Acids Res 36(Suppl 1):

D281–D288, 2008.

Marco A. Alvarez received his B.Sc. in Computer Science from the Departmentof Computing and Statistics, Federal University of Mato Grosso do Sul, CampoGrande, Brazil, in 1997. He also received an M.Sc. in Computer Science from theMathematical Sciences and Computing Institute, University of Sao Paulo, Sao Car-los, Brazil, in 1999. He worked as a Professor in the Computer Engineering Program,Dom Bosco Catholic University, Campo Grande-MS, Brazil, from 1999 to 2004. Heis a Ph.D. candidate in Computer Science at the Department of Computer Science,

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.



Utah State University. His main research interests include machine learning, bioin-formatics, computer vision, and higher education in computing.

Changhui Yan received his Ph.D. in Computer Science and Bioinformatics &Computational Biology from Iowa State University, Ames, IA, USA, in 2005. Heis an Assistant Professor at the Department of Computer Science of North DakotaState University, Fargo, ND, USA.

J. B

ioin

form

. Com

put.

Bio

l. 20

11.0

9:68

1-69

5. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

11/2

6/14

. For

per

sona

l use

onl

y.

a graph-based semantic similarity measure for the gene ontology

Documents