evaluating the significance of protein functional similarity based on gene ontology

14
Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology BOGUMIL M. KONOPKA, 1 TOMASZ GOLDA, 1 and MALGORZATA KOTULSKA 1 ABSTRACT Gene ontology is among the most successful ontologies in the biomedical domain. It is used to describe, unambiguously, protein molecular functions, cellular localizations, and processes in which proteins participate. The hierarchical structure of gene ontology allows quantifying protein functional similarity by application of algorithms that calculate semantic similari- ties. The scores, however, are meaningless without a given context. Here, we propose how to evaluate the significance of protein function semantic similarity scores by comparing them to reference distributions calculated for randomly chosen proteins. In the study, thresholds for significant functional semantic similarity, in four representative annotation corpuses, were estimated. We also show that the score significance is influenced by the number and specificity of gene ontology terms that are annotated to compared proteins. While proteins with a greater number of terms tend to yield higher similarity scores, proteins with more specific terms produce lower scores. The estimated significance thresholds were validated using protein sequence–function and structure–function relationships. Taking into account the term number and term specificity improves the distinction between significant and insignificant semantic similarity comparisons. Key words: gene ontology, protein function, semantic similarity. 1. INTRODUCTION R apid growth of reliable biological databases, which are freely available for the scientific society, is an unquestionable proof of impressive progress in the field of biomedical research, especially in high- throughput sequencing technologies. Optimistic estimates state that the human interactome will be solved within the next decade (Nebel, 2012). However, if the available experimental data are to be useful, they need careful annotation and further processing. Computer science provided ontology framework to support pro- cessing and analyzing large amounts of data. Ontologies allow for modeling and describing any domain of interest in a unified and standardized way (Gruber, 1995). One of the most often used biomedical ontologies is the ontology of genes and gene products—the Gene Ontology (GO) (Ashburner et al., 2000). GO is used to describe three different aspects of proteins that are coded by genes. These aspects are molecular function, which standardizes the naming of molecular actions performed by proteins; biological process, which groups terms describing biological processes in which 1 Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland. JOURNAL OF COMPUTATIONAL BIOLOGY Volume 21, Number 11, 2014 # Mary Ann Liebert, Inc. Pp. 809–822 DOI: 10.1089/cmb.2014.0181 809

Upload: malgorzata

Post on 20-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

Evaluating the Significance of Protein Functional

Similarity Based on Gene Ontology

BOGUMIL M. KONOPKA,1 TOMASZ GOLDA,1 and MALGORZATA KOTULSKA1

ABSTRACT

Gene ontology is among the most successful ontologies in the biomedical domain. It is used todescribe, unambiguously, protein molecular functions, cellular localizations, and processesin which proteins participate. The hierarchical structure of gene ontology allows quantifyingprotein functional similarity by application of algorithms that calculate semantic similari-ties. The scores, however, are meaningless without a given context. Here, we propose how toevaluate the significance of protein function semantic similarity scores by comparing themto reference distributions calculated for randomly chosen proteins. In the study, thresholdsfor significant functional semantic similarity, in four representative annotation corpuses,were estimated. We also show that the score significance is influenced by the number andspecificity of gene ontology terms that are annotated to compared proteins. While proteinswith a greater number of terms tend to yield higher similarity scores, proteins with morespecific terms produce lower scores. The estimated significance thresholds were validatedusing protein sequence–function and structure–function relationships. Taking into accountthe term number and term specificity improves the distinction between significant andinsignificant semantic similarity comparisons.

Key words: gene ontology, protein function, semantic similarity.

1. INTRODUCTION

Rapid growth of reliable biological databases, which are freely available for the scientific society, is

an unquestionable proof of impressive progress in the field of biomedical research, especially in high-

throughput sequencing technologies. Optimistic estimates state that the human interactome will be solved

within the next decade (Nebel, 2012). However, if the available experimental data are to be useful, they need

careful annotation and further processing. Computer science provided ontology framework to support pro-

cessing and analyzing large amounts of data. Ontologies allow for modeling and describing any domain of

interest in a unified and standardized way (Gruber, 1995).

One of the most often used biomedical ontologies is the ontology of genes and gene products—the Gene

Ontology (GO) (Ashburner et al., 2000). GO is used to describe three different aspects of proteins that are

coded by genes. These aspects are molecular function, which standardizes the naming of molecular actions

performed by proteins; biological process, which groups terms describing biological processes in which

1Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland.

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 21, Number 11, 2014

# Mary Ann Liebert, Inc.

Pp. 809–822

DOI: 10.1089/cmb.2014.0181

809

Page 2: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

proteins participate; and cellular component, with terms that can be used for describing the location within

the cell where the protein is active. Those three vocabularies form separate subontologies; each is a directed

acyclic graph with terms interconnected mainly by is_a and part_of relations. The most general terms form

the upper part of GO and their specificity increases with increasing depth in the ontology tree.

GO is widely used in analyzing and interpreting the outcome of high-throughput genomic or proteomic

experiments, for example, gene expression microarray experiments (Haugen et al., 2010; Gruca et al., 2011;

Warita et al., 2012). Huang et al. (2009) provided a comprehensive review of methods used for the so-

called GO enrichment analysis. GO annotations have also been used in studies that investigated protein

sequence, structure, and function relationships (Hvidsten et al., 2009; Pascual-Garcia et al., 2010). In

Konopka et al. (2012) we applied protein GO terms in a protein model quality assessment program.

The GO term hierarchy, which represents relations between terms, allows for calculating similarities

between term semantics. This supports automated inference based on annotated data and allows for

quantification of concepts that would be hard to quantify otherwise. Functional similarity of proteins is an

example of such a concept.

A number of approaches for calculating semantic similarity (SemSim) of protein annotations have been

proposed. They can be organized into algorithms based on term annotation frequency (Lord et al., 2003;

Schlicker et al., 2006), ontology structure (Pekar and Staab, 2002; Wu et al., 2005, 2006), and vector space

model (Chabalier et al., 2007; Benabderrahmane et al., 2010). There is also a number of hybrid approaches

(Pesquita et al., 2007; Wang et al., 2007; Othman et al., 2008). Pesquita et al. (2009) provided a com-

prehensive review of those methods.

Similarity between proteins in aspects covered by GO, that is, function, process, and location, is hard

to quantify without the formal description supplied by the ontology. For instance, there are no experi-

ments that directly assess functional similarity of proteins. Thus, there is no gold standard for comparison

and assessment of algorithms calculating SemSim. The resemblance of proteins in the GO-described

aspects can be estimated indirectly based on other protein features. Sequence similarity was used ex-

tensively as a benchmark measure (Lord et al., 2003; Xu et al., 2008; Pesquita et al., 2008; del Pozo et

al., 2008). It was established that sequence similarity is more strongly related with molecular function

than biological process and cellular component (Lord et al., 2003). On the other hand, biological process

and cellular component similarities are more strongly related with protein–protein interactions (PPI) (Wu

et al., 2006; Xu et al., 2008). The assumption that underlies the use of PPI in assessment of SemSim is

that proteins that contribute to the same biological process are more likely to interact. Several other

sources of information have also been used to evaluate the performance of SemSim scores, for example,

gene expression correlations (Xu et al., 2008), Pfam, or enzyme commission annotations (Alvarez et al.,

2011). However, regardless of how well an algorithm correlates protein GO-based SemSim with actual

biological similarities, a calculated similarity score is meaningless without a biological context.

In this work, we propose a method to evaluate the significance of SemSim scores. In the approach, the

significance of a SemSim score is estimated in relation to a reference distribution of semantic similarities

calculated for a significant number of randomly chosen protein pairs. We evaluate the hypothesis that a

SemSim score equal to the 95% significance level is not only significant from the statistical point of view

but can also be treated as a biologically meaningful threshold.

2. METHODS

2.1. Data sets

Four data sets of protein GO term annotations were compiled for the study: a multispecies set of proteins

deposed in the Protein Data Bank (PDB) (Berman et al., 2000), and three organism-specific sets: human

(Homo sapiens), ecoli (Escherichia coli) string K12, and athaliana (Arabidopsis thalaiana). Protein se-

quence redundancy was controlled. PDB, human, and athaliana (as of April 17, 2012) annotation files were

downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ (PDB as of March 5, 2012; human and

athaliana as of April 17, 2012). The ecoli annotations were downloaded from www.geneontology.org/

GO.downloads.annotations.shtml as of April 17, 2012. Human and athaliana annotation projects originally

used protein Uniprot IDs. PDB and ecoli protein IDs, which were not originally Uniprot IDs, were mapped

to unique Uniprot IDs. Respective protein amino acid sequences were downloaded from uniport.org, and

sequence redundancy of the sets was reduced with CD-HIT (Li and Godzik, 2006). We followed a three-

810 KONOPKA ET AL.

Page 3: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

step procedure suggested by the authors of the application (Li, 2012). First, the redundancy was limited

from 100% to 80%, then from 80% to 60%, and finally from 60% to 30%. In the analysis, we used 100%,

60%, and 30% redundancy sets.

We focused only on protein function–structure relation; hence, only annotations of MF terms were

considered in the study. In terms of evidence codes all available annotations were used (including those

inferred from electronic annotations).

2.2. Protein GO annotation set parameters

Every protein in our data set was annotated with a set of GO terms. We investigated the influence of

those annotation set parameters on the value of SemSim. These parameters were (1) annotation set size

(#GO), the number of unique GO terms annotated to the protein, and (2) annotation set specificity (SPEC),

the GO hierarchy depth of the most specific term; specificity of a single GO term was its depth in the GO

hierarchy.

2.3. Semantic similarity

Pairwise SemSim of GO terms was calculated with Wang’s algorithm (Wang et al., 2007). It is a GO

graph-based approach. For each term GOi in the ontology, a semantic value SV(GOi) was calculated based

on semantic contributions SGOi(t), of GOi ancestor terms:

SV (GOi) =X

t2TGOi

SGOi(t)‚ (1)

where TGOi is a set of ancestor terms of the term GOi. Semantic contributions of ancestor terms to GOi term,

SGOi(t) is defined iteratively as

SGOi(GOi) = 1

SGOi(t) = maxfwe � SGOi

(t0)jt0 2 children of (t)‚ t 6¼ GOig

� �(2)

where we is the weight of the relation between terms. The weights suggested by Wang et al. (2007) are 0.8

and 0.6 for is_a and part_of, respectively.

SemSim of two terms was calculated as

SimilarityWang(GOi‚ GOj) =

Pt2TGOi\TGOj

(SGOi(t) + SGOj

(t))

SV(GOi) + SV(GOj): (3)

The SemSim of proteins is the similarity of GO term sets annotated to compared proteins. In Wang’s

algorithm, the best-matching average approach is applied. First, for each term in the set, the best matching

term in the second set is identified:

SimilarityWang(GO‚ GOset) = maxi = 1‚ ...‚ m

(SimilarityWang(GO‚ GOi))‚ (4)

where m is the number of terms in the GO set. Then, all best-matching pairwise similarities were averaged:

SemSim(GOsetA‚ GOsetB) =Pm

i = 1 SimilarityWang(goi‚ GOsetB)Pn

j = 1 SimilarityWang(goj‚ GOsetA)

m + n(5)

Here m, n are numbers of terms annotated to proteins A and B. From now on in the text, we refer to

SemSim (GOsetA, GOsetB) as SemSim. This is the semantic similarity value that is analyzed in the study.

2.4. Score significance–reference distributions

To evaluate the significance of SemSim of two proteins, we propose to compare the calculated SemSim

to a reference distribution of SemSim scores acquired for a large set of protein pairs. We assume that

two proteins are significantly similar in terms of SemSim if their similarity is greater than the assumed

0.05 p-value threshold, derived from a reference SemSim distribution. The 0.95 quantile is an estimator

GO SIGNIFICANCE 811

Page 4: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

of this threshold. The threshold means that only 5% of all protein pairs from a test set produce a greater

SemSim score.

In the study, we used two approaches to generating reference to random distributions: we call them all-

against-all random distribution (all-vs-all) and subset-against-all random distribution (subset-vs-all). These

methods are described below.

2.5. The all-vs-all random distribution

In all-vs-all approach, SemSim between 2500 randomly chosen protein pairs, from a given set, was

calculated. An exemplary distribution is presented in Figure 1b. A six-number summary of a distribution

was obtained. The summary included 0.05, 0.25, 0.5, 0.75, and 0.95 quantiles, and the mean (Fig. 1a).

Starting from here on we will refer to those parameters as min, max, Q0.05, Q0.25, median, Q0.75, Q0.95,

and mean. The procedure was repeated 50 times for a data set, and then average values of all parameters

and their standard deviations were calculated. The all-vs-all approach was used to obtain general estimates

of SemSim significance thresholds.

2.6. The subset-vs-all distribution

In the subset-vs-all approach, random proteins from a subset of proteins were compared against random

proteins from a more general protein set (e.g., subset—proteins with 3 annotations; general set—proteins

with any annotations). Again, 2500 pairs were compared and the resulting distribution was parametrized

with Q0.05, Q0.25, median, Q0.75, Q0.95, and mean (Fig. 1). The parameters were averaged over 50

repetitions of distributions generation. The subset-vs-all approach was used in the study to investigate the

influence of annotation parameters SPEC and #GO on the results of SemSim comparison.

2.7. Corrected SemSim significance thresholds

In each data set, proteins were grouped based on their annotation set #GO and SPEC. Significance

thresholds were estimated for each group of proteins separately, following the subset-vs-all protocol.

Calculated Q0.95 values were organized into a 2D table, where the significance threshold was a function of

#GO and SPEC parameters. If the number of proteins in a particular group was lower than 10, the threshold

estimated for this group was considered as not representative and was replaced by the static significance

threshold estimated for the respective data set (PDB, human, ecoli, and athaliana).

SemSim

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0

a

bFIG. 1. An exemplary distribution of semantic

similarities calculated for randomly chosen proteins.

The boxplot in (a) shows a six-number summary of

the distribution in (b). Quantiles 0.05, 0.25, 0.5,

0.75, and 0.95 are marked by whiskers and the box;

the mean was marked by a yellow diamond. The

distribution shape (b) was fitted with an extreme

value distribution model.

812 KONOPKA ET AL.

Page 5: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

2.8. Model fitting

Distributions generated in all-vs-all and subset-vs-all experiments were fitted with model distributions.

Out of several models tested, the best fitting was provided by the general extreme value (GEV) distribution.

We used the evd package in R (Stephenson, 2012) for fitting. A GEV model is defined by three parameters,

that is, location, scale, and shape. Similarly to distribution statistics, those model parameters were averaged

over 50 generated reference distributions. The goodness-of-fit was assessed using Anderson–Darling

goodness-of-fit test from ADGofTest package in R.

2.9. Protein sequence–function relation

BLAST was used to run an all-against-all sequence comparison for proteins in the PDB annotation set.

Hits that yield an e-value below 10 - 4 were retained. For measuring protein sequence similarity, the relative

reciprocal BLAST score (RRBS) measure proposed by Pesquita et al. (2008) was used. It was defined as

RRBS(A‚ B) =BLASTbit score(A‚ B) + BLASTbit score(B‚ A)

BLASTbit score(A‚ A) + BLASTbit score(B‚ B)(6)

where A and B are compared sequences.

2.10. Protein structure–function relation

The structure–function relation was examined on a set of representative SCOP database (Murzin et al.,

1995) nonredundant structures (30% sequence similarity cutoff, 5901 protein structures) and their structural

neighbors (SNs) identified with DALI (Holm and Rosenstrom, 2010). SNs of SCOP proteins along with

their structural similarity scores were retrieved from DALI database server (http://ekhidna.biocenter

.helsinki.fi/dali/start as of May 2012) (Holm and Rosenstrom, 2010). SemSim values between SCOP

proteins and their SNs were calculated as defined above.

Structural resemblance of proteins was measured with DALI Z-score. As reported by DALI authors, all

Z-score values greater than 2 are significant hits. The authors have also estimated an empirical Z-score

cutoff value for structural ‘‘strong match’’ hits, which is

Zcutoff =n

10- 4‚ (7)

where n is the length of a protein (Holm et al., 2008).

3. RESULTS AND DISCUSSION

3.1. Estimating static significance thresholds

In order to assess the significance of SemSim scores, we generated and investigated reference distri-

butions of SemSim between random proteins (see all-vs-all generation in Methods). Four annotation

corpora were analyzed, that is, the PDB set, human, ecoli, and athaliana sets. Each corpora was analyzed at

three different levels of the sequence redundancy: a redundant set, redundancy reduced to 60% and 30%

(for the procedure, see Methods section).

Distributions acquired for all corpora share the same general shape (Fig. 2). The difference between

median and average values indicates that the distributions are asymmetric. This is also confirmed by the

differences between min–median and median–Q0.95 distances. The latter distance is significantly greater.

With the exception of the human set, we were able to fit the distributions with GEV model distributions (see

Methods). The p-values provided by the Anderson–Darling goodness-of-fit test ranged from 0.05 to 0.12.

The shapes of distributions are quite similar; however, for the PDB set, the distribution is significantly

shifted toward greater values (Fig. 2 and Table 1). In this set, the median and Q0.95 values are much higher

than the values acquired for other sets. We believe that the reason for this difference is the multiorganism

content of the PDB. The set contains proteins that play similar roles in different organisms, which increases

the overall similarity of proteins in the set.

There is no clear, consistent relation between parameters of distributions and sequence redundancy in the

sets. In ecoli, there are almost no differences in distribution parameters between different redundancy

GO SIGNIFICANCE 813

Page 6: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

cutoffs. In athaliana and human sets, the median and average values rise when redundancy decreases, while

in the PDB set, the median and average do not change significantly between GOA and GOA60, and in the

least redundant set, GOA30, they fall.

In general, the threshold of the significant similarity, quantile 0.95, falls in the range from 0.602 (human

GOA) to 0.752 (athaliana GOA30). In the PDB set, the value does not change with redundancy and it

equals approximately 0.7. In human and athaliana sets, Q0.95 moves toward higher values with decreasing

redundancy. The athaliana set turned out special, since in less redundant versions of the set, we observed

increasing ratios of proteins sharing exactly the same annotations. This resulted in a greater number of

protein pairs that acquired maximal SemSim (i.e., 1) when generating reference distributions. This led to

increased values of all the statistics calculated for the athaliana sets, including the Q0.95.

Table 1. Differences in Semantic Similarity Score,

Median, and Q0.95 Values Between Different

Gene Ontology Annotation Sets

Data set Redundancy Median Q0.95

PDB 100 0.360 0.695

60 0.362 0.700

30 0.348 0.705

Human 100 0.212 0.602

60 0.262 0.651

30 0.284 0.678

Ecoli 100 0.277 0.621

60 0.277 0.621

30 0.278 0.618

Athaliana 100 0.256 0.645

60 0.271 0.688

30 0.288 0.752

Fun

ctio

n S

imila

rity

0

0.2

0.4

0.6

0.8

1

Dataset redundancy

100

60 30

100

60 30

100

60 30

100

60 30

PBD HUMAN ECOLI ATHALIANA

FIG. 2. The summary of SemSim distributions calculated for randomly chosen proteins. SemSim scores were in-

vestigated in four annotation corpora: one multiorganism set (the PDB [red]), and three single-organism sets, that is,

human (green), ecoli (blue), and athaliana (magenta). Three different redundancy cutoffs were investigated: 100 (thick

box), 60 (dash), and 30. The boxes show quantiles of distributions Q0.25, median, and Q0.75; whiskers show Q0.05 and

Q0.95; average values are marked by squares. The parameters were averaged over 50 experiments.

814 KONOPKA ET AL.

Page 7: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

The analysis of different organism data sets showed that the reference distributions of SemSim are very

similar in terms of the shape; however, they vary in terms of values of statistics. The differences in

parameter values may be partially explained by the fact that each annotation set comes from a different

annotation project. Proteins are annotated with different techniques, and scientists involved in each project

may be focused on annotation of different aspects or different types of proteins. We observe that the score

significance differs between annotation corpora; therefore, significance thresholds should be estimated

separately for every corpus prior using SemSim in proteins of the set.

3.2. GO term specificity influence

Although SemSim scores for two different pairs of proteins may be the same, the actual meaning of the

comparisons may differ. For example, high similarity of proteins described with detailed GO terms, that is,

terms located at lower levels of the ontology graph, is more meaningful than the same similarity value

calculated between two proteins annotated with some generic terms. This is called shallow annotation

problem (Wang et al., 2007). To test the influence of specificity of annotations on the significance

thresholds, we first grouped proteins by their specificity levels (see Methods for specificity definition).

Then, we generated reference SemSim distributions by calculating SemSim value between proteins with a

certain specificity and proteins sampled from the whole set (see Methods, subset-vs-all distribution).

In all analyzed corpora, we could observe a general trend that Q0.95 values decreased as the specificity

of annotations increased (Fig. 3). This confirmed the shallow annotation problem described by Wang et al.

(2007). Detailed terms, which are located deep down the GO graph, are semantically less similar to other

terms compared to more generic terms located in the middle or top part of the GO hierarchy. That is why

subset-vs-all distributions are shifted toward lower values for more specifically described proteins.

spec1 spec4 spec7 spec10 spec13

GO specificity

Sem

Sim

0.0

0.2

0.4

0.6

0.8

1.0

spec1 spec4 spec7 spec10 spec14

0.0

0.2

0.4

0.6

0.8

1.0

GO specificity

Sem

Sim

spec1 spec4 spec7 spec10 spec14

GO specificity

Sem

Sim

0.0

0.2

0.4

0.6

0.8

1.0

spec1 spec4 spec7 spec10 spec14

0.0

0.2

0.4

0.6

0.8

1.0

GO specificity

Sem

Sim

a b

c d

FIG. 3. The influence of the specificity of protein annotations on distributions of sematic similarity. Proteins with a

given specificity were compared against all other proteins in their annotation corpora: (a) PDB, (b) ecoli, (c) human,

and (d) athaliana. The general tendency is that distributions move toward lower values as the specificity of annotations

increases. The boxplot whiskers mark Q0.05 and Q0.95 qunatiles. The box marks Q0.25, Q0.5, and Q0.95 quantiles.

The mean values are marked by diamonds.

GO SIGNIFICANCE 815

Page 8: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

Although the trend holds in general, some exceptions can be noticed. For instance, in the PDB set, the

Q0.95 at SPEC equal to 2 is significantly higher than in all remaining subsets (Fig. 3a). We found that it is

because many comparisons in this set yield the maximum value of 1. In this subset, a great number of

proteins are annotated a single term GO:0005515—‘‘protein binding,’’ which yields multiple perfect-match

scores.

3.3. GO term number influence

In order to evaluate the influence of the number of GO terms annotated to proteins (#GO) on their

semantic comparisons, we carried out two experiments. These experiments were ran separately for each

corpora. First, the annotations were divided into subsets, based on #GO. Each subset, with the exception of

the last one, comprised proteins with exactly the same #GO. The last subset grouped all proteins with more

than 9 GO terms. Then, the reference distributions were generated for each subset with (1) the all-vs-all

procedure (see Methods) and (2) subset-vs-all (see Supplementary Material and Supplementary Fig. S1,

available online at www.liebertpub.com/cmb). Both experiments lead to consistent conclusions, but here

we present the results only for the all-vs-all approach. The full description of the subset-vs-all can be found

in the Supplementary Material.

In general, in all analyzed data sets, the semantic similarities showed positive correlation with #GO (Fig. 4).

A clear rising trend of median and mean values was observed. The exceptions were subsets of athaliana

with #GO > 8. In those subsets, the reference distributions did not follow the rising trend observed in PDB,

human, and ecoli annotation corpora. However, the numbers of proteins in those athaliana subsets were

low; therefore, the acquired results may not be representative.

The experiments showed that the number of GO terms that describe a protein has a significant influence

on calculated semantic similarities. Proteins that have more terms tend to be more similar to all other

GOn1 GOn3 GOn5 GOn7 GOn9

0.0

0.2

0.4

0.6

0.8

1.0

GO number

Sem

Sim

GOn1 GOn3 GOn5 GOn7 GOn9

0.0

0.2

0.4

0.6

0.8

1.0

GO number

Sem

Sim

GOn1 GOn3 GOn5 GOn7 GOn9

0.0

0.2

0.4

0.6

0.8

1.0

GO number

Sem

Sim

GOn1 GOn3 GOn5 GOn7 GOn9

0.0

0.2

0.4

0.6

0.8

1.0

GO number

Sem

Sim

a b

c d

FIG. 4. The influence of the number of annotated GO terms on the SemSim distribution—the number of terms was

controlled in both proteins of each compared pair. The study was performed in four annotation corpora: (a) PDB, (b)

ecoli, (c) human, and (d) athaliana. Proteins with more GO terms tend to yield higher semantic similarities. GO, gene

ontology.

816 KONOPKA ET AL.

Page 9: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

proteins than proteins with only one or two terms. It may be caused by the fact that a greater number of

terms that annotate a protein raises the probability of finding well-matching terms annotating other proteins.

3.4. Specificity/GO term number combined influence

We investigated the joined influence of #GO and annotation specificity on acquired reference distri-

butions in all corpora. Each annotation set was divided into bins containing proteins with the same

specificity level and #GO. Then, for each bin, reference distributions were generated using the subset-vs-all

procedure (see Methods).

Figure 5 shows the calculated Q0.95 values, which are the SemSim significance thresholds for given

subsets of proteins. For a fixed #GO, the Q0.95 quantiles (Fig. 5) and median values (Supplementary Fig.

S2) of reference distributions decrease as specificity increases. However, the larger the #GO, the weaker the

effect. The experiment again shows that the significance and the actual meaning of SemSim may be

different depending on annotation specificity and the number of GO terms annotated to proteins. For

instance, if a protein is described with a single GO term with a specificity of 7, then a comparison that

yields a SemSim score greater than 0.4 means that the similarity of the two proteins is significant and

nonrandom, because only 5% out of all proteins can produce a score greater than that (Fig. 5). Conversely,

the SemSim of 0.4 is not meaningful if the protein is described with a less specific GO term, for example, 3.

In that case, there are plenty of proteins that can yield a similar SemSim score out of sheer luck. Similarly,

the SemSim of 0.4 is less significant if the specificity of the description remains unchanged but there are

more GO terms.

The most rapid change in median and Q0.95 values occurs between #GO 1 and 2. GO term number and

term specificity have an antagonistic influence on the SemSim of proteins. The value of Q0.95, estimated in

the last study, could be used to take into account these effects when evaluating the significance of SemSim

scores. Instead of using a static threshold to determine whether a similarity between a protein of interest and

GOn1 GOn3 GOn5 GOn7 GOn9

Spec_14

Spec_13

Spec_12

Spec_11

Spec_10

Spec_9

Spec_8

Spec_7

Spec_6

Spec_5

Spec_4

Spec_3

Spec_2

Spec_1

0.0

0.2

0.4

0.6

0.8

1.0

GO number

GO

spe

cific

ity

FIG. 5. Combined influence of protein annotation specificity and the number of GO terms on SemSim distributions in

the PDB set. The heat map shows the changes of the reference distribution Q0.95 values, depending on #GO and

specificity of protein annotation. Red dots mark subsets of proteins (of given specificity and set size) that were smaller

than 10—for those small subsets, the acquired parameter values might not be representative.

GO SIGNIFICANCE 817

Page 10: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

another protein is significant, we can use a threshold that depends on the characteristics of the GO term

description of the protein of interest, that is, #GO terms and specificity, such as those presented in Figure 5.

3.5. Protein sequence–function relation

We examined how the statistical significance of SemSim reflects similarities in other aspects of proteins.

We used sequence and structure (section 3.6) similarities as benchmarks to validate the estimated signif-

icance thresholds.

We ran an all-against-all BLAST search for protein sequences in all data sets. All BLAST hits with e-

values below 10 - 4 were retained. For each protein pair, RRBS sequence similarity was calculated as

proposed by Pesquita et al. (2008) (for details, see Methods). Functional SemSim is strongly related with

sequence similarity, and it increases rapidly with increasing RRBS. Figure 6 presents the sequence–

function relation for proteins in the PDB set. Unexpectedly, even for low sequence similarity values, that is,

in the RRBS range (0, 0.1), the functional resemblance of proteins is quite high and the distribution

statistics of all box plots in that range are greater than that of the SemSim distribution acquired for random

proteins from the PDB set presented in section 3.1 (Fig. 2). Pascual-Garcia et al. (2010) reported that below

a certain level of sequence similarity, a ‘‘structure divergence explosion’’ might be observed. This could be

caused because, due to multiple mutations, sequence evolutionary information is lost. In such cases,

sequence similarity measures are unable to distinguish between random and related proteins. Confusion

matrices in Figure 7 summarize the quantitative evaluation of the compliance between significant sequence

and function similarities in every annotation corpus.

In all examined data sets in the region where RRBS is more reliable ( ‡ 0.1), the great majority of protein

pairs showed significant functional SemSim. The lowest rate observed was 84.1% (PDB set), which meant

that 84.1% of protein pairs with significant sequence similarity had also significantly similar molecular

functions. However, in the RRBS ‘‘twilight’’ region (0, 0.1), the rate of compliance was much lower,

34.2% in the PDB set. This might have resulted because BLAST comparisons were filtered with a stringent

e-value (e · 10 - 4), and although in our study some of them were treated as insignificant, they might be in

fact nonrandom. Still, the overall accuracies reported in bottom-right-hand corners of matrices confirmed

the general compliance between sequence–function similarity.

Although high sequence similarity is a strong rationale for assuming functional similarity, functional

similarity does not necessarily mean significant sequence similarity. The study confirmed that if two

Fu

nct

ion

al S

imila

rity

0

0.2

0.4

0.6

0.8

1

Sequence similarity (RRBS)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Q0.95

FIG. 6. Sequence–function relation in the PDB set investigated with function SemSim. A strict BLAST search

criterion was used to select protein pairs for comparison (e-value <10 - 4). Data points were binned into 0.01-wide bins

and then box-plotted. Boxes denote Q0.25, median, and Q0.75 qunatiles; whiskers mark 0.05 and 0.95 qunatiles; box

average values are marked by red squares connected with a line. The blue dashed line marks the significance threshold

value (Q0.95) inferred for functional SemSim in the PDB set.

818 KONOPKA ET AL.

Page 11: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

proteins are significantly related, based on sequence similarity, their functional SemSim exceeds our

proposed thresholds.

The overall accuracies reported for dynamic SemSim thresholds were greater than those acquired for

static ones in all annotation corpora. It means that adjusting the significance threshold depending on the

#GO terms and specificity of terms is beneficial and may prevent missing some actual functional semantic

similarities. Unfortunately, only minor improvement was observed, which may suggest that a better model

for estimating SemSim significance is needed. In this study, the dynamic SemSim thresholds were esti-

mated by taking into account #GO and GO term specificity of only one protein in a compared pair.

Developing a model that would include the information regarding GO annotations of both proteins might

further improve the relation. This will be the subject of or future studies.

3.6. Protein structure–function relation

We also validated our method of estimating significant functional similarities using the structure–

function relation. DALI (Holm and Rosenstrom, 2010) was used for calculations of structural similarities in

a representative set of protein structures (for details, see Methods). According to DALI authors, all

FIG. 7. Confusion matrices summarize the quantitative evaluation of the compliance between significant sequence

and function similarities in every annotation corpus. Sequence similarity was counted as significant if it was greater

than an arbitrarily set RRBS threshold of 0.1 .The significance of function similarity was estimated using static SemSim

thresholds proposed in section 3.1 and dynamic thresholds proposed in section 3.4. Numbers in green cells summarize

cases in which sequence–function similarities according to the thresholds were concordant, while numbers in red cells

summarize their divergence.

GO SIGNIFICANCE 819

Page 12: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

structural neighbors returned by a DALI Lite database search are significant hits (Z-score >2) (Holm and

Rosenstrom, 2010). In general, the calculated values of functional SemSim are in agreement with this

statement (Fig. 8); however, the SemSim boxplots for the least similar SNs are very similar to boxplots

derived for randomly chosen proteins in the PDB set (Fig. 2). This fact suggests that either those hits are not

in fact significant or that structural similarity at such a low level does not result in functional resemblance.

For the analyzed set of proteins, the overall strong-match Z-score cutoff equals 27.6 (for details, see

Methods). Based on that, a confusion matrix of structural–functional compliance was built. An over-

whelming majority of DALI strong matches had a SemSim over our estimated Q0.95 SemSim cutoff (Fig.

9), which is a strong point that supports our approach to estimating SemSim significance.

4. CONCLUSIONS

In this study, we proposed a novel methodology to evaluate the significance of SemSim of protein

ontological descriptions. We applied it to protein molecular function GO terms and Wang’s semantic sim-

ilarity; however, the methodology can also be used with other GO subontologies and other measures of

semantic similarity. Our approach is based on the statistical analysis of semantic similarity reference dis-

tributions. We proposed to use the 0.95 quantile as a threshold between significant and insignificant semantic

similarities—similarities greater than this threshold can be considered as significant and nonrandom since

Fu

nct

ion

al S

imila

rity

0

0.2

0.4

0.6

0.8

1

Structural Similarity (DALI)10 20 30 40 50 60 70

Q0.95

FIG. 8. Structure–function relation investigated in the PDB set with the use of function SemSim. Functional simi-

larities between SCOP proteins and their structural neighbors identified with DALI. Data points were binned into 72

bins and box-plotted. Boxes denote Q0.25, median, and Q0.75 quantiles; whiskers mark 0.05 and 0.95 quantiles; box

average values are marked by red squares connected with a line. The blue dashed line marks the significance threshold

value (Q0.95) inferred for functional SemSim in the PDB set.

FIG. 9. Confusion matrix evaluates the agree-

ment between significant structural and functional

similarity. DALI Z-score of 27.6 was used as the

structural strong-match cutoff, while the SemSim

of 0.69 was used as a significant functional simi-

larity cutoff.

820 KONOPKA ET AL.

Page 13: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

only 5% of randomly chosen protein pairs could produce similarity scores equal or greater. Four represen-

tative annotation corpora were analyzed (PDB, human, ecoli, and athaliana). The significance thresholds

differ between annotation sets; therefore, such thresholds should be calculated separately for every set. We

also showed that the significance of SemSim may change, depending on the number and the level of detail of

GO terms used to annotate a protein. Proteins annotated with detailed GO terms tend to yield lower SemSim.

On the other hand, a greater number of GO terms annotated to proteins are usually associated with higher

SemSim. Based on sequence–function and structure–function relations, we have shown that taking into

account these two effects can improve the usefulness of SemSim measures.

AUTHORS’ CONTRIBUTIONS

B.M.K. proposed the concept of the study, prepared the data sets, carried out data analysis, and par-

ticipated in writing the article. B.M.K. and T.G. developed the software and ran the calculations. M.K.

participated in the design of the study, data analysis, and writing the article. All authors read and approved

the final article.

ACKNOWLEDGMENTS

B.M.K. would like to acknowledge the financial support from ‘‘Mloda Kadra’’ Fellowship cofinanced by

European Union within European Social Fund. Part of the calculations have been done in Wroclaw Centre

for Networking and Supercomputing.

AUTHOR DISCLOSURE STATEMENT

The authors declare no competing interests.

REFERENCES

Alvarez, M.A., Qi, X., and Yan, C. 2011. A shortest-path graph kernel for estimating gene product semantic similarity.

J. Biomed. Semantics 2, 3.

Ashburner, M., Ball, C.A., Blake, J.A., et al. 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25,

25–29.

Benabderrahmane, S., Smail-Tabbone, M., Poch, O., et al. 2010. IntelliGO: a new vector-based semantic similarity

measure including annotation origin. BMC Bioinform. 11, 588.

Berman, H.M., Westbrook, J., Feng, Z., et al. 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242.

Chabalier, J., Mosser, J., and Burgun, A. 2007. A transversal approach to predict gene product networks from ontology-

based similarity. BMC Bioinform. 8, 235.

del Pozo, A., Pazos, F., and Valencia, A. 2008. Defining functional distances over Gene Ontology. BMC Bioinform. 9, 50.

Gruber, T.R. 1995. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. Comput.

Stud. 43, 907–928.

Gruca, A., Sikora, M., and Polanski, A. 2011. RuleGO: a logical rules-based tool for description of gene groups by

means of Gene Ontology. Nucleic Acids Res. 39, W293–W301.

Haugen, A.C., Di Prospero, N.A., Parker, J.S., et al. 2010. Altered gene expression and DNA damage in peripheral

blood cells from Friedreich’s ataxia patients: cellular model of pathology. PLoS Genet. 6, e1000812.

Holm, L., and Rosenstrom, P. 2010. Dali server: conservation mapping in 3D. Nucleic Acids Res. 38, 545–549.

Holm, L., Kaariainen, S., Rosenstrom, P., et al. 2008. Searching protein structure databases with DaliLite v.3.

Bioinformatics 24, 2780–2781.

Huang da, W., Sherman, B.T., and Lempicki, R.A. 2009. Bioinformatics enrichment tools: paths toward the com-

prehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13.

Hvidsten, T.R., Lægreid, A., Kryshtafovych, A., et al. 2009. A comprehensive analysis of the structure-function

relationship in proteins based on local structure similarity. PLoS ONE 4, e6266.

Konopka, B.M., Nebel, J.-C., and Kotulska, M., 2012. Quality assessment of protein model-structures based on

structural and functional similarities. BMC Bioinform. 13, 242.

GO SIGNIFICANCE 821

Page 14: Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology

Li, W. CD-HIT User’s Guide. Available at: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id = cd-hit_user_guide.

Accessed April 23, 2012.

Li, W., and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide

sequences. Bioinformatics 22, 1658–1659.

Lord, P.W., Stevens, R.D., Brass, A., et al. 2003. Semantic similarity measures as tools for exploring the gene ontology.

Pac. Symp. Biocomput. 8, 601–612.

Murzin, A.G., Brenner, S.E., Hubbard, T., et al. 1995. SCOP: a structural classification of proteins database for the

investigation of sequences and structures. J. Mol. Biol., 247, 536–540.

Nebel, J.C. 2012. Proteomics and bioinformatics soon to resolve the human structural interactome. J. Proteomics

Bioinform. 5, xi–xii.

Othman, R.M., Deris, S., and Illias, R.M. 2008. A genetic similarity algorithm for searching the gene ontology terms

and annotating anonymous protein sequences. J. Biomed. Inform. 41, 65–81.

Pascual-Garcia, A., Abia, D., Mendez, R., et al. 2010. Quantifying the evolutionary divergence of protein structures: the

role of function change and function conservation. Proteins 78, 181–196.

Pekar, V., and Staab, S. 2002. Taxonomy learning: factoring the structure of a taxonomy into a semantic classification

decision. Proceedings of the Nineteenth Conference on Computational Linguistics.

Pesquita, C., Faria, D., Bastos, H., et al. 2007. Evaluating GObased semantic similarity measures. ISMB/ECCB SIG

Meeting Program Materials. Available at: www.psb.ugent.be/cbd/cco/Bio-Ontologies2007.pdf

Pesquita, C., Faria, D., Bastos, H., et al. 2008. Metrics for GO based protein semantic similarity: a systematic

evaluation. BMC Bioinform., 9, S4.

Pesquita, C., Faria, D., Falcao, A.O., et al. 2009. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5,

e1000443.

Schlicker, A., Domingues, F.S., Rahnenfuhrer, J., et al. 2006. A new measure for functional similarity of gene products

based on Gene Ontology. BMC Bioinform. 7, 302.

Stephenson, A. 2012. Functions for extreme value distributions. Available at: http://cran.r___project.org/web/packages/

evd/

Wang, J.Z., Du, Z., Payattakool, R., et al. 2007. A new method to measure the semantic similarity of GO terms.

Bioinformatics 23, 1274–1281.

Warita, K., Mitsuhashi, T., Tabuchi, Y., et al. 2012. Microarray and Gene Ontology analyses reveal downregulation of

DNA repair and apoptotic pathways in diethylstilbestrol-exposed testicular Leydig cells. J. Toxicol. Sci. 37, 287–295.

Wu, H., Su, Z., Mao, F., et al. 2005. Prediction of functional modules based on comparative genome analysis and gene

ontology application. Nucleic Acids Res. 33, 2822–2837.

Wu, X., Zhu, L., Guo, J., et al. 2006. Prediction of yeast protein–protein interaction network: insights from the Gene

Ontology and annotations. Nucleic Acids Res. 34, 2137–2150.

Xu, T., Du, L., and Zhou, Y., 2008. Evaluation of GO-based functional similarity measures using S. cerevisiae protein

interaction and expression profile data. BMC Bioinform. 9, 472.

Address correspondence to:

Dr. Bogumil M. Konopka

Institute of Biomedical Engineering and Instrumentation

Wroclaw University of Technology

Wybrzeze Wyspianskiego 27

50 370 Wroclaw

Poland

E-mail: [email protected]

822 KONOPKA ET AL.