semantic similarity techniques

i

NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE

BIOMEDICAL DOMAIN AND WORDNET

by

Hoa A. Nguyen, B.Eng.

THESIS

Presented to the Faculty of

The University of Houston Clear Lake

In Partial Fulfillment

of the Requirements

for the Degree

MASTER OF SCIENCE

THE UNIVERSITY OF HOUSTON-CLEAR LAKE

December 2006

ii

NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE BIOMEDICAL DOMAIN AND WORDNET

by

Hoa A. Nguyen

APPROVED BY

__________________________________________ Hisham Al-Mubaid, Ph.D., Chair

__________________________________________ Said Bettayeb, Ph.D., Committee Member

__________________________________________ Gary D. Boetticher, Ph.D., Committee Member

__________________________________________ Robert Ferebee, Ph.D., Associate Dean

__________________________________________ Sadegh Davari, Ph.D., Dean

iii

ACKNOWLEDGEMENTS I would like to thank my thesis advisor Dr. Hisham Al-Mubaid for his incredible

assistance, guidance, and great patience over the past two years. Dr. Al-Mubaid was very

supportive and very cooperative during all stages of this work. Without his guidance and

help this work would not have been completed. I would like also to thank him for his

great help and advice for my career in selecting Ph.D. programs and communicating with

universities and professors.

Also, I would like to acknowledge and thank the other members of my thesis committee,

Dr. Gary D. Boetticher and Dr. Said Bettayeb, for their assistance and guidance in this

work.

Finally, I would like to thank the people who helped me in obtaining some of the datasets

and tools needed for this work, in particular, Dr. Mona T. Diab for giving us the verb

dataset that we used in some of the experiments for comparison and evaluation purposes.

iv

ABSTRACT

NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE BIOMEDICAL DOMAIN AND WORDNET

Hoa A. Nguyen, M.S. The University of Houston-Clear Lake, 2006

Thesis Chair: Hisham Al-Mubaid

Semantic similarity techniques are used to compute the semantic similarity (common

shared information) between two concepts according to certain language or domain

resources like ontologies, taxonomies, corpora, etc. Semantic similarity techniques

constitute important components in most Information Retrieval and knowledge-based

systems. This thesis presents new techniques for measuring the semantic similarity

between concepts based on ontology. The proposed measures are based on three

features: (1) (cross-modified) path length, (2) common specificity of concepts in

ontology, (3) local granularity of clusters. The new features of common specificity and

granularity are applied in computing semantic similarity of concepts in single or across

multiple ontologies. The key contribution is the novel cross-ontology approach for

measuring similarity of concepts dispersed in multiple ontologies in a unified framework.

The proposed techniques were evaluated extensively in the biomedical domain and

general English domain. The experimental results proved the effectiveness and

superiority of our similarity measures compared with the existing similar techniques.

v

TABLE OF CONTENTS

1. INTRODUCTION .......................................................................................................... 1

2. BACKGOUND AND SIMILAR WORK....................................................................... 4

2.1 WordNet.................................................................................................................... 4

2.2 UMLS ....................................................................................................................... 4

2.3 Unified Framework................................................................................................... 5

2.4 MeSH........................................................................................................................ 6

2.5 MEDLINE ................................................................................................................ 7

2.6 SNOMED-CT ........................................................................................................... 8

2.7 Semantic Similarity, Semantic Distance and Relatedness and Transformation ....... 8

2.8 Polysemy of Concept and Semantic Similarity of Concept Class ............................ 9

2.9 Traditional Ontology-Based Semantic Measures ................................................... 10

2.9.1 Ontology-Structure-Based Measures ............................................................... 10

2.9.2 Information-Based Measures ........................................................................... 10

2.10 Previous Work in the Biomedical Domain ........................................................... 10

3. A NEW ONTOLOGY-BASED SEMANTIC DISTANCE APPROACH ................... 12

3.1 Method of Semantic Similarity............................................................................... 12

3.2 The New Feature: Common Specificity Feature .................................................... 16

3.3 Rules and Assumptions........................................................................................... 16

3.4 The Proposed Semantic Distance Approach........................................................... 17

3.5 Evaluation ............................................................................................................... 18

3.5.1 Dataset.............................................................................................................. 18

3.5.2 Experiments and Results.................................................................................. 19

3.5.3 Discussion ........................................................................................................ 20

4. THE PROPOSED CLUSTER-BASED APPROACH.................................................. 22

4.1 The Need for a New Approach ............................................................................... 22

4.2 Local Granularity and Local Concept Specificity .................................................. 22

vi

4.3 The Adapted Common Specificity Feature ............................................................ 23

4.4 Rules and Assumptions........................................................................................... 24

4.5 The Proposed Cluster-Based Approach.................................................................. 24

4.5.1 Single Cluster Similarity.................................................................................. 24

4.5.2 Cross-Cluster Semantic Similarity................................................................... 25

4.6. Evaluation .............................................................................................................. 28

4.6.1 Datasets ............................................................................................................ 29


4.6.3 Discussion ........................................................................................................ 32

5. USING MEDLINE AS STANDARD CORPUS FOR SEMANTIC SIMILARITY OF

CONCEPTS IN THE BIOMEDICAL DOMAIN ................................................... 34

5.1 The Need for a Standard Corpus in the Biomedical Domain ................................. 34

5.2 Semantic Similarity................................................................................................. 35

5.3 Evaluation ............................................................................................................... 36

5.3.1 Information Sources......................................................................................... 36

5.3.2 Dataset.............................................................................................................. 37

5.3.3 Experimental Results ....................................................................................... 38

5.4 Discussion ........................................................................................................... 39

6. THE PROPOSED COMBINATION-BASED (HYBRID) APPROACH.................... 40

6.1 Motivation............................................................................................................... 40

6.2 Semantic Similarity Features .................................................................................. 40

6.2.1 Path Feature and Depth Feature....................................................................... 40

6.2.2 The Adapted Common Specificity Feature...................................................... 42

6.3 The Combination-Based (Hybrid) Approach ......................................................... 43

6.4 Evaluation ............................................................................................................... 44

6.4.1 Information Source .......................................................................................... 44

6.4. 2 Datasets ........................................................................................................... 44


6.5. Discussion.............................................................................................................. 48

7. THE PROPOSED CROSS-CLUSTER APPROACH FOR SEMANTIC SIMILARITY

OF CONCEPTS IN WORDNET............................................................................. 49

vii

7.1. The Need for a Cross-Cluster Semantic Approach for WordNet .......................... 49

7.2. The Proposed Cross-Cluster Semantic Distance Approach................................... 50

7.4 Evaluation ............................................................................................................... 53

7.4.1 Information Source .......................................................................................... 53

7.4.2 Evaluation Method and Dataset ....................................................................... 54


7.4. 5 Discussion....................................................................................................... 57

8. SEMANTIC SIMILARITY OF VERBS AND NOUNS IN WORDNET ................... 58

8.1 Motivation............................................................................................................... 58

8.2 Information Source and Datasets............................................................................ 58

8.3 Semantic Similarity in Verb Cluster....................................................................... 59

8.3.1 Traditional Measures ....................................................................................... 60

8.3.2 Hybrid Measure and Cross-Cluster Measure................................................... 61

8.4 Semantic Similarity of Open-Class Words in WordNet ......................................... 62

8.5 Discussion............................................................................................................... 63

9. SEMANTIC SIMILARITY OF CONCEPTS IN A UNIFIED FRAMEWORK: THE

PROPOSED CROSS-ONTOLOGY APPROACH ................................................. 65

9.1 The Need for Cross-Ontology Approach................................................................ 65

9.2 The Adapted Common Specificity Feature ............................................................ 67

9.3 Local Ontology Granularity.................................................................................... 68

9.4 The Proposed Cross-Ontology Similarity Approach ............................................ 69

9.4.1 Rules and Assumptions.................................................................................... 69

Like the above proposed NA measure, the cross-ontology measure also satisfies the

two above assumptions (A1 and A2). ....................................................................... 70

9.4.2 Single Ontology Similarity .............................................................................. 70

9.4.3 Cross-Ontology Semantic Similarity ............................................................... 70

9.4.4 Choosing the Secondary Ontologies................................................................ 74

9.5. Evaluation .............................................................................................................. 76

9.5.1 Testing Dataset................................................................................................. 76

9.5.2 Tools and Information Sources........................................................................ 76


viii

9.6. Discussion.............................................................................................................. 79

10. DISCUSSION AND FUTURE WORK ..................................................................... 81

10.1 Directions.............................................................................................................. 81

10.1.1 Adapting Existing Ontology-based Measures for Cross-Ontology Similarity

................................................................................................................................... 81

10.1.2 Semantic Similarity and Application in Information Retrieval ..................... 81

10.1.3 The Need for Topic Similarity and a New Information Retrieval Model...... 84

10.2 Discussion............................................................................................................. 85

10.3 Conclusion ............................................................................................................ 86

11. REFERENCES ........................................................................................................... 88

Appendix A....................................................................................................................... 93

ix

LIST OF FIGURES

Figure 1. Overlap of concepts of ontologies in UMLS framework. ................................... 5

Figure 2. Graphical view of MeSH ontology by MeSH browser. ...................................... 6

Figure 3. A fragment of MeSH. ........................................................................................ 12

Figure 4. A fragment of two clusters in ontology........................................................... 26

Figure 5. Results of correlations with human scores for four measures using SNOMED-

CT. .................................................................................................................................... 31

Figure 6. Results of correlations with human scores for four measures using MeSH...... 31

Figure 7. Illustration of the three information-based measures with human scores. ........ 38

Figure 8. Fragment of Ontology ...................................................................................... 49

Figure 9. Two fragments from two ontologies. ................................................................ 67

Figure 10. Connecting two ontology fragments. .............................................................. 71

Figure 11. Two fragments of SNOMED-CT (left) and MeSH (right)............................. 82

x

LIST OF TABLES

Table 1. Dataset 1: 30 medical term pairs sorted in the order of the averaged physician’

scores................................................................................................................................. 19

Table 2. Absolute correlation of the four measures relative to human ratings ................ 20

Table 3. Dataset 2: 36 medical term pairs with five similarity scores: Human, Path length

(PATH), Wu and Palmer(WUP), Leacock and Chodorow(LCH), and proposed measure

(SemDist); using MeSH ontology.................................................................................... 29

Table 4. Absolute correlations with human scores for all measures using SNOMED-CT

on Dataset 1, Dataset 2, and Dataset 3.............................................................................. 30

Table 5. Absolute correlations with human scores for all measures using MeSH on

Dataset 1, Dataset 2, and Dataset 3................................................................................... 30

Table 6. The improvements that SemDist achieved over the average of the three other

similar techniques using SNOMED-CT with three datasets............................................. 31

Table 7. The improvements that SemDist achieved over the average of the three other

similar techniques using MeSH with three datasets ......................................................... 32

Table 8. Format of MH_Freq_count file .......................................................................... 36

Table 9. Absolute correlations of information-based measures........................................ 37

Table 10. Similarity features of 8 similarity measures ..................................................... 41

Table 11. A subset of human mean ratings for the Rubenstein-Goodenough (RG) set.... 45

Table 12. Training dataset: 19 medical term pairs of Dataset 2 found in WordNet ......... 45

Table 13. Results of absolute correlations of the proposed measure with human ratings

using the training dataset with different parameter values ............................................... 46

Table 14.Absolute correlations with human ratings for the proposed measures using the

RG dataset (65 pairs) ........................................................................................................ 46

Table 15.Absolute correlations with RG human ratings using SemCor and Brown corpora

and WordNet 2.0 for four combination-based measures .................................................. 47

Table 16. Absolute correlations with human judgments for the proposed measures using

the RG dataset ................................................................................................................... 55

xi

Table 17. Absolute correlations with RG human ratings using two corpora and WordNet

2.0 for 3 information content-based measures.................................................................. 56

Table 18. Absolute correlations with RG human ratings of ontology-based measures... 56

Table 19. Mean human ratings of RD dataset of verb pairs ............................................. 59

Table 20. Absolute correlations with RD human ratings using SemCor and Brown ....... 60

Table 21. Absolute correlations with RD human ratings of ontology-structure-based

measures............................................................................................................................ 60

Table 22. Absolute correlations with RG human ratings of seven measures ................... 61

Table 23. Similarity of Verbs given by Hybrid measure using WordNet 2.0 and SemCor

with human rating (HContext) in Resnik and Diab (RD) dataset.................................... 62

Table 24. Similarity of word in RGRD (65 noun pairs+27 verb pairs) dataset using ...... 63

Table 25. Similarity of Verbs in RGRD (65 noun pairs+27 verb pairs) dataset using

Cross-Cluster measure...................................................................................................... 63

Table 26. Absolute correlation of proposed approach on the RG dataset and WordNet 2.0

........................................................................................................................................... 77

Table 27. Absolute correlations of the proposed approach using WordNet and MeSH... 77

Table 28. Absolute correlations of the proposed approach using WordNet and SNOMED-

CT ..................................................................................................................................... 78

Table 29. Biomedical Dataset 2 (36 pairs) and RG dataset (65 pairs, in italics) with

human similarity scores (Human) and SemDist’s scores using WordNet and MeSH...... 80

xii

LIST OF PUBLICATIONS

Major part of this thesis is published in the following papers:

1. Nguyen, H.A. and Al-Mubaid, H. A New Ontology-based Semantic Similarity

Measure for the Biomedical Domain. In Proc. IEEE International Conference on

Granular Computing GrC’06 , GA,USA, May 2006.

2. Al-Mubaid, H. and Nguyen, H.A. A Cluster-Based Approach for Semantic Similarity

in the Biomedical Domain, In Proc. The 28th Annual International Conference of the

IEEE Engineering in Medicine and Biology Society EMBS, New York, USA, September

2006.

3. Al-Mubaid, H. and Nguyen, H.A. Using MEDLINE as Standard Corpus for

Measuring Semantic Similarity of Concepts in the Biomedical Domain. In Proc. The

2006 IEEE 6th Symposium on Bioinformatics & Bioengineering BIBE-06, Washington

D.C., USA, October 2006. pp.315-319.

4. Nguyen, H.A. and Al-Mubaid, H. A Combination-based Semantic Similarity

Approach Using Multiple Information Sources. In Proc. The 2006 IEEE International

Conference on Information Reuse and Integration IEEE IRI 2006, Hawaii, USA,

September 2006.

5. Al-Mubaid, H. and Nguyen, H.A. A Cross-Cluster Approach for Measuring Semantic

Similarity between Concepts. In Proc. The 2006 IEEE International Conference on

Information Reuse and Integration IEEE IRI 2006, Hawaii, USA, September 2006.

6. Al-Mubaid, H. and Nguyen, H.A. Semantic Distance of Concepts within a Unified

Framework in the Biomedical Domain. In The 22nd Annual ACM Symposium on Applied

Computing Seoul, Korea, forthcoming March, 2007.

1

1. INTRODUCTION

Ontology-based semantic similarity techniques or approaches, also called semantic

similarity measures, can estimate the semantic similarity between two hierarchically

expressed concepts in a given ontology or taxonomy. In a given ontology (e.g. WordNet

or MeSH) each node contains a set of synonymous terms, and can be called “concept

node”. Each concept node represents a sense of concept (e.g. WordNet). Two terms are

synonymous if they belong to the same node in the ontology tree. Thus, an ontology or

taxonomy is a hierarchical tree-structured organization of the terms and concepts in a

language or domain. Semantic similarity is the inverse of semantic distance, such that,

two concepts may belong to two different nodes in an ontology tree, and the distance

between their nodes determines the similarity between these two concepts. Thus we can

use the terms “semantic distance” and “semantic similarity” interchangeably to refer to

the same thing as the conversion from distance to similarity or vice versa is a direct

operation (Section 2.7).

Semantic similarity techniques are becoming important components of most of the

information retrieval (IR), information extraction (IE), and other intelligent knowledge-

based systems. For example, in IR, semantic similarity measures play a crucial role in

determining an optimal match between query terms and the retrieved document in

ranking the results. With the fast growing of biomedical databases such as PubMed [31],

the task of retrieving biomedical documents effectively plays a very important role in this

domain. This thesis focuses on investigating and developing new semantic similarity

approaches for measuring semantic similarity between terms and concepts within certain

2

language or domain resources like ontologies, taxonomies, and corpora. The proposed

semantic measures are applied in the biomedical domain and general English domain.

The main contribution of this thesis is the new combination of semantic features and the

five variations of semantic similarity measures that are proposed and evaluated in the

course of this master thesis research. The five variations of semantic similarity measures

are: (1) a new semantic similarity measure [chapter 3], (2) a cluster-based semantic

similarity measure in multiple clusters in single ontology [chapter 4], (3) a hybrid

semantic similarity measure using multiple information sources [chapter 6], (4) a cross-

cluster semantic similarity measure using multiple information sources [chapter 7] and

(5) a cross-ontology semantic measure in a unified framework [chapter 9]. Besides these

five semantic techniques, this thesis includes two more investigations : (1) an

investigation of using MEDLINE as a standard corpus for measuring semantic similarity

of MeSH concepts [chapter 5], (2) an investigation of measuring semantic similarity of

nouns and verbs in WordNet [chapter 8]. Furthermore, the final chapter of this thesis

[chapter 10] presents directions and discussions more about advanced works as future

directions of this thesis. The experimental results compared with the existing similar

techniques demonstrated that all of the proposed measures have good performance and

are very promising in computing semantic similarity of concepts in the two domains.

Some of the presented work (for example, in chapter 5 and chapter 9) includes new

devised methods and techniques that represents interesting works and puts first bricks for

more advances and more structures into these tasks.

The rest of this thesis is organized as follows:

Chapter 2: A review of background and related work is presented in chapter 2; this

includes ontologies in the two domains of general English domain and biomedical

domain, definitions, ontology-based semantic similarity measures and previous work in

the biomedical domain.

Chapter 3: In this chapter, the first proposed semantic distance measure (NA) is

proposed and applied in the biomedical domain. It combines the two semantic distance

features in one measure to combine strengths and complement weaknesses of some

existing ontology-based measures.

3

Chapter 4: This chapter presents an extension of the NA measure to be the Cluster-Based

approach by taking into account the granularity of clusters in the ontology.

Chapter 5: There is no standard text corpus as secondary information source for the

information-based measures. Chapter 5 presents an investigation of using the most

comprehensive database in the biomedical domain MEDLINE as a standard corpus for

measuring information content (IC) of MeSH concepts used in information-based

measures.

Chapters 6-8: These chapters present and discuss new semantic similarity techniques

applied into the WordNet ontology in the general English domain. These techniques

include methods for measuring semantic similarity of nouns, verbs and both nouns and

verbs in one similarity scale system.

Chapter 9: This chapter presents and explains a novel cross-ontology semantic distance

approach in a unified framework within the biomedical domain for measuring semantic

distance of concepts in single ontology or in cross-ontology. This approach is an

extension of the Cluster-Based approach which is discussed in Chapter 4.

Chapter 10: Chapter 10 presents general and more comprehensive discussion of this

thesis. Also Chapter 10 includes some thoughts and ideas about new directions as future

work.

Appendix A: Appendix A presents a framework, called MeSHSimPack, for measuring

semantic similarity of MeSH concepts by a number of ontology-based semantic similarity

measures including information-based measures.

4

2. BACKGOUND AND SIMILAR WORK

2.1 WordNet

WordNet [13] is a semantic lexicon for the English language developed at Princeton

University. WordNet became a valuable resource for people in human language

technology and artificial intelligence for many years. English nouns, verbs, adjectives and

adverbs are organized into synonym sets, each representing one underlying lexical

concept. Different relations link the synonym sets. The IS-A relations form IS-A

taxonomies with the noun and verb synsets and they are majority relations in WordNet.

The IS-A relations in IS-A taxonomies of noun are hypernym/hyponym and

holonym/meronym relations while in verb taxonomies are hypernym/troponym. In

WordNet 2.0, there are nine noun taxonomies with an average depth of 13 and 554 verb

taxonomies with an average depth of 2. For more detail information of WordNet, please

refer to [13].

2.2 UMLS

The Unified Medical Language System (UMLS) project started at the National Library of

Medicine (NLM) in 1986 [31], with one of the objectives is to help interpret and

understand medical meanings across systems. It consists of three main knowledge

sources: Metathesaurus, Semantic Network, and SPECIALIST Lexicon & Lexical Tools.

The current version (2006 AC) of Metathesaurus contains more than 1.3 million concepts

and 6.4 million unique concept names from more than 100 different source vocabularies

and supports 17 languages. The Metathesaurus is built from the electronic versions of

5

various thesauri, classifications, code sets, and lists of controlled terms used in patient

care, health services billing, public health statistics, indexing and cataloging of

biomedical literature, and/or basic, clinical, and health services research. These are

referred to as the “source vocabularies” of the Metathesaurus. The control vocabularies

or terminologies in these resources are expressed hierarchically with the major relations

between concepts are IS-A relations (actually broader /narrow than), therefore, these

sources are also called ontology/taxonomy, etc. Different from other

ontologies/taxonomies in other domains such as WordNet, the ontologies in biomedical

domain (UMLS) do not allow multiple inheritances. The ontologies in UMLS

Metathesaurus overlap in set of UMLS concepts as in Figure 1. Each ontology is

designed for specific purposes in biomedical domain, for example, MeSH thesaurus/

ontology is built for cataloging, indexing and searching MEDLINE database [37].

Figure 1. Overlap of concepts of ontologies in UMLS framework.

2.3 Unified Framework

UMLS [26], NCI [38] frameworks are instances of unified framework defined in this

thesis. The unified framework in this thesis employs the characteristics of UMLS

framework and has following main characteristics:

- The framework covers a set of concepts as a set of controlled vocabulary in a

specific domain such as biomedical domain.

- The framework includes: (1) Semantic Network which represents semantic

relations possibly between concepts in set of controlled vocabulary and (2)

Metathesaurus including many ontologies which cover a subset of concepts of the

UMLS

SNOMED-CT

NCI

MeSH

6

framework. The ontologies in the framework overlap of concepts as in Figure 1.

Each of them reflects a view of a specific community or is constructed for specific

tasks in the domain.

- The concept class, concept and term have characteristics discussed in the

following section about MeSH ontology as an example of ontology in a unified

framework.

2.4 MeSH

MeSH, stands for Medical Subject Headings [33,39], is one of the main source

vocabularies (terminologies/concepts) used in UMLS with the primary purpose of

supporting indexing, cataloguing, and retrieval of medical literature articles stored in

NLM MEDLINE database, and includes about 16 high-level categories

(taxonomies/subtrees) as in Figure 2.

Figure 2. Graphical view of MeSH ontology by MeSH browser.

7

The database of MeSH ontology [32] has about 23K MeSH Descriptors which is often

broader than a single concept and so it may consist of a class of concepts. Concepts, in

turn, correspond to a class of terms which are synonymous with each other [32]. A

Descriptor is then a class of concepts that have meanings closely together. Each MeSH

Descriptor has a preferred concept which is a MeSH heading. Therefore, a MeSH

heading is a represented concept of a concept class containing synonymous concepts. The

following hierarchy relations show the relations between a Descriptor and its concepts,

terms as follows:

-Concept Class -Concept

-Term

One concept class (Descriptor) that can have many concepts that are close to each other

in meaning, each concept then can have many synonymous terms. One term is chosen as

a preferred term of concept and one concept is chosen as preferred concept of the concept

class which is the heading of concept class (Descriptor). The concept structure above is

applied for all other ontologies in UMLS. Following example shows a concept structure

of a MeSH heading:

- Achlorhydria (Heading)

- Achlorhydria - Achlorhydria

- Achylia Gastrica

- Hypochlorhydria

- Hypochlorhydria

2.5 MEDLINE

MEDLINE [38], the NLM's premier bibliographic database covering the fields of

medicine, nursing, dentistry, veterinary medicine, the health care system, and the

preclinical sciences, contains about 14 million research abstracts dated back to the 1950s

from more than 4,800 biomedical journals published in the United States and 70 other

8

countries. It is and thus considered the main source of literature and textual data for

bioinformatics research. Each record in MEDLINE is a cited article which is assigned

10-15 MeSH terms (MeSH main heading) by indexers typically with major topics (MeSH

major heading) indicated with an asterisk (*) [36]. Indexers typically use the most

specific MeSH term available.

2.6 SNOMED-CT

SNOMED-CT, stands for Systemized Nomenclature of Medicine Clinical Term [35], was

included in UMLS in May 2004 (2004AA). It is the result of collaboration between The

College of American Pathologists (CAP) and The United Kingdom’s National Health

Service, and is a comprehensive clinical terminology with coverage of diseases, clinical

findings, and procedures comprising of concepts, terms and relationships to represent

clinical information. The current version contains more than 360,000 concepts, 975,000

synonyms and 1,450,000 relationships organized into 18 hierarchies/sub trees/ categories.

In this thesis, the term “ontology” is used as a description of the concepts and

relationships between them in a given domain and is used to denote for all kind of IS-A

trees or hierarchical trees in which concepts are represented hierarchically by IS-A

relations (is-a-kind-of, is-a-part-of) although the hierarchical relations in biomedical

domain in the framework of UMLS are broader/narrow than relations, however, they are

assumed as IS-A relations in semantic computing.

2.7 Semantic Similarity, Semantic Distance and Relatedness and Transformation

While “semantic similarity” is concerned about likeliness, “relatedness” seeks to

determine relation between two terms/concepts. For example, “car” and “driver” are

related, but not much similar, but “car” and “vehicle” are similar in some degree.

Relatedness is thus more general than similarity. Furthermore, semantic distance is the

inverse of semantic similarity that is the less distance of the two concepts, the more they

are similar.

9

To insure the conversion from semantic distance to semantic similarity do not change the

absolute correlation value, the transformation function below is used:

Sim (C1, C2) = MaxDist- Dist (C1, C2) (1)

where Dist is the semantic distance of two concepts, MaxDist is the maximum distance

of two concepts and Sim is the converted semantic similarity of the two concepts.

However, in this thesis, absolute correlation is used to evaluate performances of the

approaches.

In this thesis, the term “semantic measure” is used to denote both semantic distance

measure and semantic similarity measure. Furthermore, the term “semantic similarity” is

also used to denote for both semantic similarity and semantic distance.

2.8 Polysemy of Concept and Semantic Similarity of Concept Class

According to the polysemy of concept in natural language and in biomedical domain, for

clarity, in this thesis, the term “concept node” will refer to a particular “sense” of a

concept class, and the term concept will refer to a “concept class”. A concept class

example, in MeSH ontology has many senses/concept nodes and appears in many

positions or nodes in the ontology. For example, MeSH heading “Achlorhydria” has two

tree numbers in the XML database as follows:

<TreeNumberList> <TreeNumber>C06.405.748.045</TreeNumber>

<TreeNumber>C18.452.076.087</TreeNumber>

</TreeNumberList>

As a heading/preferred concept is a represented concept of a concept class or Descriptor,

therefore, the similarity of two terms contained in two concept classes is the similarity of

two headings that represent two concept classes. Each concept class has a set of concept

nodes. There can be many similarities between two sets of concept nodes. The similarity

10

of two headings is then chosen as the maximum similarity among these similarities. The

similarity of two terms in one concept class reaches maximum.

2.9 Traditional Ontology-Based Semantic Measures

Ontology-based semantic similarity measures are those use ontology source as the primary information source. They are can be roughly grouped into two groups as follows:

2.9.1 Ontology-Structure-Based Measures

Most of the semantic similarity measures that are based on the structure of ontology are

actually based on path length (shortest path length) between two concept nodes, and/or

depths of concept nodes in the IS-A hierarchy tree. The primitive Path length measure

was first developed and was applied in MeSH ontology. Most of the later measures were

developed and applied in WordNet. Some of the WordNet-based measures are: Path

length [22], Wu and Palmer [30], Leacock and Chodorow [10], and Li et al. [12]. More

detail of these measures is discussed in section 3.1.

2.9.2 Information-Based Measures

The information-based approaches are based on the information theory which use text

corpus as secondary information source beside the primary ontology information source.

They all use information content (IC) of concept nodes derived from the IS-A relations

and corpus statistics. In WordNet-based measures, some information-based measures are:

Resnik [23], Jiang and Conrath [9] and Lin [11]. The detail of these measures is

discussed in sections 5-7.

2.10 Previous Work in the Biomedical Domain

Rada et al. [22] first proposed a semantic distance measure and applied it into the

biomedical domain using MeSH ontology. The semantic distance between two concept

11

nodes is the shortest path length between them. This Path length measure is actually a

simplified version of spreading activation theory [5,21, 22]. Caviedes and Cimino (2004)

[6] implemented the shortest Path length measure, called CDist, based on the shortest

distance between two concept nodes in the ontology. They evaluated the CDist measure

on MeSH, SNOMED, ICD9 ontologies based on correlation with human ratings. Another

recent work on semantic similarity and relatedness in biomedical domain is conducted by

Pedersen, Pakhomov and Patwardhan (2005) [20] in which they applied corpus-based

context vector approach to measure relatedness. Their context vector approach is

ontology-free but requires training text, for which, they used text data from Mayo Clinic

corpus of medical notes. Their proposed method was evaluated using human judgments

(collected a set of 30 medical term pairs annotated by 3 physicians and 9 experts) and

compared with five other measures.

12

3. A NEW ONTOLOGY-BASED SEMANTIC DISTANCE APPROACH

3.1 Method of Semantic Similarity

The Path length measure is the primitive ontology-based measure that finds the semantic

distance between two concept nodes by finding the shortest path length between them on

the ontology. Rada et al. [22] proposed this Path length measure as a potential measure in

the biomedical domain. Let us consider, for example, the following fragment (shown in

Figure 3) from MeSH ontology. This fragment is for the seventh category tree

“Biological Science” and is assigned letter G in MeSH ontology.

Figure 3. A fragment of MeSH.

In this fragment (Figure 3), the path length between “Biological Sciences [G01]” and

“Environment and Public Health [G03]” is 3 using node counting. The path length

between “Biology [G01.273]” and “Biotechnology [G01.550]” is also 3. Thus, the

similarity in these two cases is the same by Path length measure. However, intuitively

speaking, the similarity between “Biological Sciences [G01]” and “Environment and

Environment and Public Health [G03]

Biology [G01.273]

Biological Sciences [G01]

Biotechnology [G01.550]

Biological Science [G]

13

Public Health [G03]” is less than the similarity between “Biology [G01.273]” and

“Biotechnology [G01.550]” as the latter two concepts lie at a lower level in the hierarchy

tree and share more information. Thus, measures based on path length such as Path length

(DistPath) [22] and Leacock and Chodorow (LCH) [10] give the same similarity for these

two pairs as they use the only path length feature in measure as follows:

DistPath(C1,C2)= d(C1,C2) (2)

⎟⎟⎠

⎞⎜⎜⎝

⎛ ×=

)C,d(CD2log)C,(CSim

2121LCH (3)

Therefore, the specificity of concepts should be taken into account by concerning depth

of concepts. However, the measure of Wu and Palmer [30] measures semantic similarity

of concepts by taking into account the depths of concept nodes only. Wu and Palmer

assumed that “within one conceptual domain, the similarity of two concepts is defined by

how closely they are related in the hierarchy.” They proposed a measure that has formula

as follows:

321

321 N2NN

N2)C,Sim(C

×++×

= (4)

where N3 is the depth of the least common subsumer (The least common subsumer,

LCS(C1,C2), of two concept nodes C1 and C2 is the lowest node that can be a parent for

C1 and C2. For example, in Figure 3, LCS(G01.273 ,G01.550) = G01 and LCS(G01 ,

G03) = G) of two concept nodes and N1, N2 are the path lengths from each concept

node to LCS, respectively. The formula of Wu and Palmer measure is rewritten as

follows:

)Depth(C)Depth(C

))C,CDepth(LCS(2)C,Sim(C

21

2121 +

×= (5)

14

Li et al. [12] proposed a combination-based measure that combines several existing

ontology-based semantic features. Each feature has parameters to determine its weight in

the combination approach. They used different combination approaches to maximize the

results. They used a training approach to find the optimal parameters. The features

employed in this measure are path length, depth LCS of two concept nodes, and the IC of

LCS of two concept nodes as a kind for local density. They combined these three features

by using ten different strategies from linear to nonlinear approaches. Although many

combinations of three features were investigated, a nonlinear approach that combines

only the shortest path length and the depth of LCS gets the highest correlation with

human ratings dataset. This optimal strategy shows that the optimal approach doesn’t use

information content or corpus-based feature. The optimal strategy of this measure is

given by:

S(w1,w2) = bhh b

bhh bl a . −

−−

+−

eeeee (6)

where a≥ 0 and b≥ 0 are parameters to scale the contribution of shortest path length (l) of

two concepts and depth of the LCS of two concepts (h), respectively.

However, this measure has limitation that violates some of the intuitions and assumptions

of ontology-based similarity [11] as it gives different similarity values for different

identical pairs.

Formula of Li et al.:

S(w1,w2) = bhh b

bhh bl a . −

−−

+−

eeeee

Similarity of G01-G01 in which h=1, l=0:

15

S(G01, G01)= bb

bb

−

−

+−

eeee

Similarity of G01.273- G01.273 in which h=2, l=0:

S(G01.273, G01.273)= b22b

b22b

−

−

+−

eeee

Therefore, similarity of the first pair is different from the similarity of the second pair.

Moreover, the information content of LCS is kind of “weighted depth” of LCS, in other

words, they are the same kind of feature as the depth (length) of a concept is calculated

by summing up all the links/nodes from the node while the weighted depth of a concept

(information content of that concept node) is calculated by all weighted links from the

root to that concept node [9, 23]. Furthermore, the local density (i.e. information content),

link type, and link strength are also a factor that affects the semantic similarity [26]. The

information content (IC) of a concept node is measured based on corpus statistics.

However, there is no standard corpus in the biomedical domain, therefore, in this work,

only ontology-structure-based features are investigated and used in this first proposed

measure.

As discussed above, both the path length (path) feature and depth length (depth) feature

should be used in the measure. The LCS node determines the common sharing of two

concept nodes. The measure of Li et al. measures take into account the specificity of

concept nodes by utilizing depth of LCS in semantic computing. However, the

combination of features in Li et al. measures violates some intuitive rules of similarity

based on ontology discussed above. Furthermore, the measure of Wu and Palmer only

takes into account the depths of concept nodes only skipping the most important path

length feature [Equation 5] or the contributions of two features are not weighted

[Equation 4]. For that motivation, this thesis proposes approaches that complement and

combine strengths of some existing measures as well as integrate more semantic features

for advance computing. The following section clarifies rules and assumptions for

16

measuring semantic distance/similarity of concepts in the ontology to be satisfied for the

first proposed semantic distance measure.

3.2 The New Feature: Common Specificity Feature

Besides the path length feature is an important feature taken into account in the measure,

the proposed measure also uses depth of LCS concepts of two concepts as specificity of

two concepts effectively to improve performance. The LCS node of two concepts C1 and

C2 determines the “common specificity” of C1 and C2 in the ontology, therefore, the

common specificity of two concepts is measured by finding the depth of their LCS node

and then scaling this depth by the depth D of the ontology as follows:

CSpec(C1,C2) = D − depth(LCS(C1,C2)) (7)

where D is the depth of the ontology. Thus the CSpec(C1,C2) feature determines the

common specificity of two concepts in the ontology. The smaller the common specificity

value of two concept nodes, the more they share information, and thus the more they are

similar.

3.3 Rules and Assumptions

Two features discussed above are combined in the proposed measure by using some

intuitive rules and assumptions. The path length (shortest path length) is used in the usual

way that the similarity of two concepts is higher when the two concepts have less

distance between them. The following summarizes intuitions in the following three rules

to be satisfied as follows:

Rule R1: The shorter the distance (path length) between two concept nodes in the

ontology, the more they are similar.

Rule R2: Lower level pairs of concept nodes are semantically closer (more similar)

than higher level pairs.

17

Rule R3: The maximum similarity is reached when the two concept nodes are the

same node in the ontology.

The proposed measure besides satisfies the above three rules, it also satisfies the two

assumptions of semantic similarity as follows:

Assumption A1: Logarithm functions are the universal law of semantic distance.

Exponential-decay functions are universal law of stimulus generalization for

psychological sciences [28]. Logarithm (inverse of exponentiation) for semantic distance

is used. This thesis assumes that non-linear combination approach is the optimum

approach for combining semantic features as Rule_R3 shows that when the two concept

nodes are the same node, the semantic similarity must reach highest similarity regardless

of other features, and so, non-linear combination approach of features should be used.

Therefore, another assumption is needed as follows:

Assumption A2: Non-linear function is the universal combination law of semantic

similarity features.

3.4 The Proposed Semantic Distance Approach

There two features to combine as discussed above: Path length and the Common

specificity given by Equation 7. When the two concept nodes are the same node then path

length will be 1 (using node counting), and so the semantic distance value must reach the

minimum regardless of CSpec feature by rule R3 (recall the semantic distance is the

inverse of semantic similarity). Therefore, product of semantic distance features for

combination of features should be used. By applying Rules R1, R2, R3 and the two

assumptions, the proposed semantic distance measure is given as follows:

( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (8)

18

where α>0 and β>0 are contribution factors of two features; k is a constant; LCS is the

least common subsumer of two concepts; CSpec (CSpec(C1,C2)) is calculated using

Equation 7; and Path is the path length or length of the shortest path between the two

concept nodes. To insure the distance is positive and the combination is non-linear, k

must be greater or equal to one (k≥1). In this thesis, k=1 is used in experiments. When

two concept nodes have path length of 1 (Path=1) using node counting (i.e., they are in

the same node in the ontology), they have a semantic distance (SemDist) equals to zero

(i.e. maximum similarity) regardless of common specificity feature.

3.5 Evaluation

3.5.1 Dataset

There are no standard human rating sets of concepts/terms for semantic similarity in the

biomedical domain. Thus, to evaluate the proposed approach, the dataset of 30 concept

pairs from Pedersen et al. (2005) [20], (Dataset 1) which was annotated by 3 physicians

and 9 medical index experts. Each pair was annotated on a 4-point scale: “practically

synonymous, related, marginally related, and unrelated”.

Table 1 contains whole pairs of this dataset. The average correlation between physicians

is 0.68, and between experts is 0.78. Because the experts are more than the physicians,

and the correlation (agreement) between experts (0.78) is higher than the correlation

between physicians (0.68), it can be assumed that the experts’ rating scores are more

reliable than the physicians’ rating scores.

Only 25 out of the 30 term pairs are found in MeSH using MeSH browser version 2006

[32] as some terms cannot be found, 25 pairs was used in the experiments (Pedersen et.

al. [20] tested 29 out of the 30 concept pairs as one pair was not found in SNOMED-CT).

19

The term pairs in bold, in Table 1, are the ones that contains a term that was not found in

MeSH and they were excluded in experiments.

Table 1. Dataset 1: 30 medical term pairs sorted in the order of the averaged

physician’ scores Concept 1 Concept 2 Phys. Expert

Renal failure Kidney failure 4.0000 4.0000 Heart Myocardium 3.3333 3.0000 Stroke Infarct 3.0000 2.7778 Abortion Miscarriage 3.0000 3.3333 Delusion Schizophrenia 3.0000 2.2222 Congestive heart failure Pulmonary edema 3.0000 1.4444 Metastasis Adenocarcinoma 2.6667 1.7778 Calcification Stenosis 2.6667 2.0000 Diarrhea Stomach cramps 2.3333 1.3333 Mitral stenosis Atrial fibrillation 2.3333 1.3333 Chronic obstructive pulmonary disease

Lung infiltrates 2.3333 1.8889

Rheumatoid arthritis Lupus 2.0000 1.1111 Brain tumor Intracranial hemorrhage 2.0000 1.3333 Carpal tunnel syndrome Osteoarthritis 2.0000 1.1111 Diabetes mellitus Hypertension 2.0000 1.0000 Acne Syringe 2.0000 1.0000 Antibiotic Allergy 1.6667 1.2222 Cortisone Total knee replacement 1.6667 1.0000 Pulmonary embolus Myocardial infarction 1.6667 1.2222 Pulmonary Fibrosis Lung Cancer 1.6667 1.4444 Cholangiocarcinoma Colonoscopy 1.3333 1.0000 Lymphoid hyperplasia Laryngeal Cancer 1.3333 1.0000 Multiple Sclerosis Psychosis 1.0000 1.0000 Appendicitis Osteoporosis 1.0000 1.0000 Rectal polyp Aorta 1.0000 1.0000 Xerostomia Alcoholic cirrhosis 1.0000 1.0000 Peptic ulcer disease Myopia 1.0000 1.0000 Depression Cellulitis 1.0000 1.0000 Varicose vein Entire knee meniscus 1.000 1.0000 Hyperlipidemia Metastasis 1.0000 1.0000

3.5.2 Experiments and Results

In these experiments, only one testing dataset was used and there was no training dataset,

therefore, the default parameters of the proposed measures were used to validate the

measure and the Li et al. was excluded in evaluation as it needs a training phase for

optimum parameters.

20

Table 2. Absolute correlation of the four measures relative to human ratings Measure

Phys. (rank)

Expert (rank)

Both (rank)

SemDist 0.666 (2) 0.862 (1) 0.836 (1) Leacock and Chodorow 0.672 (1) 0.856 (2) 0.833 (2) Wu and Palmer 0.652 (3) 0.794 (4) 0.778 (4) Path length 0.631(4) 0.742 (3) 0.734 (3)

Semantic distance/similarity values of 25 pairs were calculated using the proposed

measure and other three ontology-based semantic distance/similarity measures. All the

measures use node counting for path lengths and for depths of concept nodes. For the

pairs that have a term belongs to more than one category tree, only its position(s) in the

same category with the other term is/are taken into account. Table 2 shows for the four

measures the results of correlation with human ratings of physicians, experts, and both

(phys. and experts), with the ranks between parentheses. These correlation values (Table

2) show that the proposed measure is ranked #1 in correlation relative to experts’

judgments and relative to both (expert and phys. judgments). But relative to physician

judgments, the proposed approach is ranked #2. However, as discussed, the human

ratings of experts are more reliable than of physicians, therefore, in overall, the proposed

measure performs very well and has great potential.

3.5.3 Discussion

The proposed semantic distance measure has been introduced proving its potential and

promising, however, the experiments also show some limitation. There is no training

phase for the proposed measure for optimum parameters. The dataset is small and small

part of dataset cannot be found exactly in the MeSH ontology. They are found by closely

related terms in MeSH ontology. Moreover, the dataset is originally created and

conducted experiments in SNOMED-CT ontology, therefore, most of them are found in

SNOMED-CT ontology. It is more noticed that the dataset is relatedness dataset,

therefore, the semantic distance/similarity measure cannot capture relatedness causing

their performance low. It is not fair and logical in conducting comparing relatedness

21

measures and semantic distance/similarity measures using relatedness measures as in

[20]. According to above limitation, in the next section, more advanced experiments and

evaluation were conducted using one more semantic similarity dataset and two ontologies

in UMLS were compared on semantic similarity of terms.

22

4. THE PROPOSED CLUSTER-BASED APPROACH

4.1 The Need for a New Approach

The above proposed semantic distance measure, called, NA measure, was proposed as a

combination of two semantic distance features and weights of their contribution to

similarity are taken into account. It is a measure to complement some weaknesses of

some existing measures. However, one question stands out is that is it enough for

semantic computing for ontologies that have different granularity degrees in different

clusters? To answer that question, let us first investigate the local granularity of cluster

affecting on the semantic similarity.

4.2 Local Granularity and Local Concept Specificity

In this work, the term “cluster” is used to denote a subtree or category tree of ontology,

for example, MeSH ontology has about 16 category trees as in Figure 2. The following

example explains the affect of cluster granularity on local concept specificity. Let us

consider, for example, a fragment of ontology showing two clusters as in Figure 4. The

specificity of a concept c in cluster C is defined as follows:

depthCdepth(c)spec(c) = (9)

23

where depthC is the depth of cluster C, and spec(c) ∈[0,1]. It is noticed that spec(c) = 1

when the concept c is a leaf node in the cluster C. Then, in Figure 1, the specificity of a3

and b3 is calculated as follows:

spec(a3)=3/4= 0.75

spec(b3)=3/3= 1.00

Thus, the specificity of b3 (1.00) is more than specificity of a3 (0.75), even though their

depths are equal. Thus, b3 has more specificity within its cluster than a3 as it lies further

down towards the bottom in its cluster. Therefore, the local granularity of clusters should

be taken into account as a feature that most existing measures that use ontology

structure (IS-A relations) as primary information source do not take it into consideration.

4.3 The Adapted Common Specificity Feature

In the Cluster-Based approach, the common specificity feature of two concept nodes is

calculated within the cluster. The least common subsumer (LCS) node of two concept

nodes C1 and C2 determines the common specificity of C1 and C2 in the cluster. So the

common specificity of two concept nodes is calculated by finding the depth of their LCS

node and then scaling this depth by the depth D of the cluster as follows:

CSpec(C1,C2) = D − depth(LCS(C1,C2)) (10)

where D is the depth of the cluster. Thus, the CSpec(C1,C2) feature determines the

“common specificity” of two concepts in the cluster. The smaller the common specificity

value of two concept nodes, the more they share information, and thus the more they are

similar.

24

4.4 Rules and Assumptions

Like in the NA measure, two features are taken into account: Path length feature and

Common specificity feature. However, in this Cluster-Based approach, the feature of

local granularity is utilized and integrated into the measure; therefore, the intuitive rules

in this case are slightly different from the rules of the above proposed NA measure as

follows:

Rule R3: The semantic similarity scale system shows (reflects) the degree of similarity

of pairs of concepts comparably in one cluster or in cross-cluster. This rule ensures

that the mapping of cluster 1 to cluster_2 does not deteriorate the scale of similarity.

Rule R4: The semantic similarity must obey local cluster’s similarity rules as follows:

Rule R4.1 (R1): The shorter the distance (path length) between two concept nodes in

the ontology, the more they are similar.

Rule R4.2 (R2): Lower level pairs of concept nodes are semantically closer (more

similar) than higher level pairs.

Rule R4.3 (R3): The maximum similarity is reached when the two concept nodes are

the same node in the ontology.

Like the above proposed NA measure, the Cluster-Based measure also satisfies the two

above assumptions (A1 and A2).

4.5 The Proposed Cluster-Based Approach

4.5.1 Single Cluster Similarity

In single cluster, the local granularity of the cluster is not considered as there is only one

single cluster. Two features are combined: path length and the common specificity

(CSpect) given by Equation 10. When the two concept nodes are the same node then path

length will be 1 (using node counting), and so the semantic distance value must reach the

minimum regardless of CSpec feature by rule R4.3 (recall the semantic distance is the

inverse of semantic similarity). Therefore, product of semantic distance features for

25

combination of features should be used. By applying Rules R3, R4 and the two

assumptions, the proposed measure for a single cluster is:


where α>0 and β>0 are contribution factors of two features; k is a constant; LCS is the

least common subsumer of two concept nodes; and Path is the path length of the shortest

path between the two concept nodes. To insure the distance is positive and the

combination is non-linear, k must be greater or equal to one (k≥1). k=1 is used in

experiments. When two concept nodes have path length of 1 (Path=1) using node

counting (i.e., they are in the same node in the ontology), they have a semantic distance

(SemDist) equals to zero (i.e. maximum similarity) regardless of common specificity

feature.

4.5.2 Cross-Cluster Semantic Similarity

In cross-cluster semantic similarity, to measure the semantic similarity between two

concept nodes (C1 and C2), there are four cases depending on the positions of the two

concept nodes within the clusters of the ontology. The cluster that has the longest depth is

assigned the main cluster (called primary cluster) on which the semantic features from all

other clusters will be scaled to this cluster’s scale-level. All other remaining clusters are

secondary clusters. Then, there are four cases as follows:

Case 1: Similarity within the Primary Cluster: If the two concept nodes occur in the

primary cluster then the similarity in this case is same as the similarity within single

cluster [Equation 11] discussed in section 4.5.1.

Case 2: Cross-Cluster Similarity: In this case, one of the two concept nodes belong to

the primary cluster while the other is in a secondary cluster, and the LCS of two concept

nodes is the global root node, which belongs to the two clusters. This technique does not

26

affect the scale of the CSpec feature of the primary cluster. The common specificity is

then given as:

CSpec(C1,C2) = CSpecprimary = Dprimary -1, (12)

where Dprimary is the depth of the primary cluster. The root is the LCS of the two concept

nodes in this case. The path between the two concept nodes passes through two clusters

having different granularity degrees. The portion of the path length that belongs to the

secondary cluster is in scale of granularity different from that of the primary cluster, and

thus, it is needed to convert (is leveled) into primary cluster scale-level as follows.

Figure 4. A fragment of two clusters in ontology.

The Cross-Cluster Path Length Feature: The path length between two concept nodes (C1

and C2) is computed by adding up the two shortest path lengths from the two nodes to

their LCS node (their LCS is the root). For example, in Figure 4, for the two concept

nodes (a3, b3), the LCS is the root r. So, the path length between a3 and b3 is calculated

as follows:

Path(C1,C2) = d1 + d2 -1 (13)

such that: d1 = d(a3, root) and d2 = d(b3, root), where d(a3, root) is the path length from

the root r to node a3 ; and similarly d(b3, root) is the path length from r to b3. Notice that

the root node is counted twice, so one is subtracted in Equation 13. It is noticed here that

r

a1

a2

a3

b1

a5 b2

a4

a6b3

27

the densities or granularities of the two clusters are in different scales. Then, the portion

of the path length in the secondary cluster is scaled into the primary cluster’s scale-level.

The cluster containing a3 has higher depth, and then it’s the primary cluster, and the

cluster containing b3 is the secondary. The granularity rate of the primary cluster over the

secondary cluster for the common specificity feature is:

1D 1DCSpecRate

2

1

−−

= (14)

where (D1-1) and (D2 -1) are maximum common specificity values of the primary and

secondary clusters, respectively. The granularity rate, PathRate, of path length feature

for the primary cluster over the secondary cluster is given by:

12D 12DPathRate

2

1

−−

= (15)

where (2D1-1) and (2D2 -1) are maximum path length values of any two nodes in the

primary and secondary clusters, respectively. Following Rule R3, d2 in Equation 13 is

converted into the primary cluster as follows:

22 dPathRated' ×= (16)

This new path length d’2 reflects the path length of the second concept node to the LCS

relative to the primary cluster’s path length feature scale. Applying Equation 16, the path

length between two concept nodes in primary cluster scale is as follows:

1dPathRated)C,Path(C 2121 −×+= (17)

1d12D12D

d)C,Path(C 22

1121 −×

−−

+= (18)

Finally, the semantic distance between two concept nodes is given as follows:

28

CSpec (C1, C2) = Dprimary –1 (19)


Case 3: Similarity within a Single Secondary Cluster: The third case is when the two

concept nodes are in a single secondary cluster. Then the semantic features, in this case,

must be converted to primary cluster’s scales for the two features, Path and CSpec, as

follows:

Path(C1, C2) = Path(C1, C2) secondary × PathRate (21)

CSpec(C1, C2) = CSpec(C1, C2) secondary × CSpecRate (22)


where Path(C1, C2) secondary and CSpec(C1, C2)secondary are the Path and CSpec between C1 and C2 calculated in the secondary cluster; and PathRate and CSpecRate are computed in Equations 15 and 14, respectively.

Case 4: Similarity within Multiple Secondary Clusters: In this case, the two concept

nodes are in two secondary clusters Csi and Csj (i.e., none of them exists in the primary

cluster). Then, one of the two secondary clusters acts momentarily as a primary to

calculate the semantic features (viz. Path and CSpec) using Case-2 above. That is, the

semantic features, Path and CSpec, will be computed according to Case-2 by assuming

temporarily that Csi and Csj are primary and secondary clusters although both are

secondarys, so that to scale and unify the CSpec and Path features between them. Then,

the semantic distance (SemDist) is computed using Case-3 to scale the features (again) to

the scale-level of the primary cluster (Cp).

4.6. Evaluation

For experiments, two ontologies of MeSH and SNOMED-CT were used as information

source for the semantic measures and two datasets are used for evaluation.

29

4.6.1 Datasets

The first dataset is Dataset 1 in shown Table 1. Another biomedical dataset was used

containing 36 MeSH term pairs [8]. The human scores in this dataset are the average

evaluated scores of reliable doctors. UMLSKS browser was used [34] for SNOMED-CT

terms, and MeSH Browser [39] for MeSH terms. Table 3 shows Dataset 2 along with

human scores and scores of four measures calculated using MeSH ontology. The pairs

with scores “*” are excluded from experiments.

Table 3. Dataset 2: 36 medical term pairs with five similarity scores: Human, Path length (PATH), Wu and Palmer(WUP), Leacock and Chodorow(LCH), and proposed measure

(SemDist); using MeSH ontology Concept 1 Concept 2 Human PATH WUP LCH SemDist

Anemia Appendicitis 0.031 8 0.364 1.099 4.263 Meningitis Tricuspid Atresia 0.031 8 0.364 1.099 4.263 Sinusitis Mental Retardation 0.031 8 0.364 1.099 4.263 Dementia Atopic Dermatitis 0.062 9 0.333 0.981 4.394 Acquired Immunodeficiency Syndrome Congenital Heart Defects 0.062 7 0.400 1.232 4.111 Bacterial Pneumonia Malaria 0.156 8 0.364 1.099 4.263 Osteoporosis Patent Ductus Arteriosus 0.156 9 0.333 0.981 4.394 Amino Acid Sequence Anti Bacterial Agents 0.156 12 0.154 0.693 4.804 Otitis Media Infantile Colic 0.156 10 0.308 0.876 4.511 Hyperlipidemia Hyperkalemia 0.156 5 0.667 1.569 3.497 Neonatal Jaundice Sepsis 0.187 8 0.364 1.099 4.263 Asthma Pneumonia 0.375 4 0.727 1.792 3.219 Hypothyroidism Hyperthyroidism 0.406 3 0.800 2.079 2.833 Sarcoidosis Tuberculosis 0.406 11 0.286 0.78 4.615 Sickle Cell Anemia Iron Deficiency Anemia 0.437 6 0.667 1.386 3.584 Adenovirus Rotavirus 0.437 6 0.615 1.386 3.714 Lactose Intolerance Irritable Bowel Syndrome 0.468 6 0.667 1.386 3.584 Hypertension Kidney Failure 0.500 9 0.333 0.981 4.394 Diabetic Nephropathy Diabetes Mellitus 0.500 3 0.800 2.079 2.833 Pulmonary Valve Stenosis Aortic Valve Stenosis 0.531 3 0.833 2.079 2.708 Hepatitis B Hepatitis C 0.562 3 0.857 2.079 2.565 Vaccines Immunity * * * * * Psychology Cognitive Science * * * * * Failure to Thrive Malnutrition 0.625 8 0.364 1.099 4.263 Urinary Tract Infection Pyelonephritis 0.656 5 0.667 1.569 3.497 Migraine Headache 0.718 9 0.429 0.981 4.291 Myocardial Ischemia Myocardial Infarction 0.750 2 0.923 2.485 1.946 Carcinoma Neoplasm 0.750 4 0.667 1.792 3.332 Breast Feeding Lactation 0.843 1 1.000 3.178 0.000 Seizures Convulsions 0.843 1 1.000 3.178 0.000 Pain Ache 0.875 1 1.000 3.178 0.000 Malnutrition Nutritional Deficiency 0.875 1 1.000 3.178 0.000 Down Syndrome Trisomy 21 0.875 1 1.000 3.178 0.000 Measles Rubeola 0.906 1 1.000 3.178 0.000 Antibiotics Antibacterial Agents 0.937 1 1.000 3.178 0.000 Chicken Pox Varicella 0.968 1 1.000 3.178 0.000

30


All the measures use node counting for path lengths and depths of concept nodes. As

there is no a training phase, the two features (Path and CSpec) are assumed to contribute

equally to similarity; that is, default parameters (α=1 and β=1) are set in all experiments.

Out of the 30 pairs of Dataset 1, only 25 pairs in MeSH were found and 29 pairs in

SNOMED-CT. For the four pairs that were not found in MeSH and found in SNOMED-

CT, average distance/similarity values of the most related concept nodes to each one of

them were calculated , so there were 29 pairs in MeSH and SNOMED-CT in total. Out

of the 36 pairs of Dataset 2 in SNOMED-CT, 34 pairs were found and all 36 pairs were

found in MeSH, so the 34 pairs that exist in both ontologies in the experiments were used

(The two pairs that were not found are shown in bold in Table 3). Furthermore,

Dataset_1 and Dataset 2 were combined into one dataset for a larger dataset (Dataset 3).

The results of correlations with human scores using the three datasets, experimented on

MeSH and SNOMED-CT ontologies, are shown in Tables 4 and 5 ; Figures 5 and 6.

Table 4. Absolute correlations with human scores for all measures using SNOMED-CT on Dataset 1, Dataset 2, and Dataset 3

SNOMED-CT Measure Dataset 1

(rank) Dataset 2

(rank) Dataset 3

(rank) SemDist 0.665 (1) 0.735 (1) 0.726 (1) Leacock and Chodorow 0.431 (2) 0.677 (3) 0.600 (2) Wu and Palmer 0.296 (3) 0.686 (2) 0.498 (3) Path length 0.254 (4) 0.586 (4) 0.422 (4)

Average 0.412 0.671 0.562

Table 5. Absolute correlations with human scores for all measures using MeSH on Dataset 1, Dataset

2, and Dataset 3

MeSH Measure Dataset 1

(rank) Dataset 2 (rank)

Dataset 3 (rank)

SemDist 0.863 (1) 0.825 (1) 0.841 (1) Leacock and Chodorow 0.857 (2) 0.820 (2) 0.836 (2) Wu and Palmer 0.794 (3) 0.811 (3) 0.808 (3) Path length 0.744 (4) 0.765 (4) 0.764 (4)

Average 0.815 0.805 0.812

31

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3

Datasets

Cor

rela

tion

with

hum

an

scor

es

SemDist (proposed)

Leacock&Chodorow

Wu & Palmer

Path Length

Figure 5. Results of correlations with human scores for four measures using

SNOMED-CT.

0.65

0.7

0.75

0.8

0.85

0.9

1 2 3

Datasets

Cor

rela

tion

with

hum

an

scor

es

SemDist (proposed)

Leacock&Chodorow

Wu & Palmer

Path Length

Figure 6. Results of correlations with human scores for four measures using MeSH.

Table 6. The improvements that SemDist achieved over the average of the three other similar techniques using SNOMED-CT with three datasets

SNOMED-CT Correlations of Dataset 1 Dataset 2 Dataset 3

Average of the 3 similar measures 0.327 0.650 0.507 SemDist 0.665 0.735 0.726 Improvement 103% 13% 43%

32

Table 7. The improvements that SemDist achieved over the average of the three other similar techniques using MeSH with three datasets

MeSH Correlations of Dataset 1 Dataset 2 Dataset 3

Average of the 3 similar measures 0.798 0.799 0.803 SemDist 0.863 0.825 0.841 Improvement 8.1% 3.3% 4.8%

4.6.3 Discussion

Tables 4 and 5 show that the proposed Cluster-Based measure, SemDist, achieves the best

correlations with human similarity scores and ranks #1 with two ontologies and on three

datasets. These results confirm that SemDist is efficient in computing semantic similarity,

and is outperforming the three other measures in all six experiments. Leacock and

Chodorow measure achieves the second best correlations in five of the six experiments,

while Wu and Palmer measure gives the third best correlations in five of six experiments

and the second best correlation in one experiment. Path length measure achieves the

lowest correlations in all six experiments. These results seem realistic since Leacock and

Chodorow measure uses path length scaled by depth of ontology, and thus, outperforms

both Wu and Palmer measure, which uses only depths of concept nodes, and the Path

length measure. To be more specific, Leacock and Chodorow measure measures the

similarity by using the path length scaled by the maximum path length of two concept

nodes in the ontology, whereas, Wu and Palmer measure uses depth of LCS of the two

concept nodes scaled by the summation of the depths of the two concept nodes.

It is noticed that SemDist outperforms the other measures with SNOMED-CT more

significantly than with MeSH because of the higher specificity of SNOMED-CT (with

depth around of 18) compared to MeSH (with depth around of 12).

The average correlations of measures in Tables 4 and 5 and the improvements that

SemDist achieved over the average correlations are shown in Tables 6 and 7 for

SNOMED-CT and MeSH, respectively. From these results in Tables 6 and 7, we

observe that SemDist achieved an average improvement of 53% using SNOMED-CT,

33

while using MeSH, the average improvement is 5.3%. This suggests that SemDist is a

good choice for ontologies with high specificities where the new CSpec feature will have

more positive impact on the correlation results. Even with MeSH, where the average

improvement is 5.4%; this improvement can be considered significant given the existing

limited resources of human scored datasets in this domain. Furthermore, Tables 4 and 5

show that all four measures perform better in MeSH than in SNOMED-CT.

34

5. USING MEDLINE AS STANDARD CORPUS FOR SEMANTIC SIMILARITY

OF CONCEPTS IN THE BIOMEDICAL DOMAIN

5.1 The Need for a Standard Corpus in the Biomedical Domain After the work of Rada et al. [22], a number of ontology-structure-based measures

[10,12,30,17] that use IS-A relations of concepts in computation, and information-based

measures [9,11,23] that use both IS-A relations and corpus-based feature (information

content) have been proposed and applied using WordNet. Typically, the information-

based measures use standard corpora as secondary information sources to compute

similarity between two given terms. However, there is no standard corpus in biomedical

domain as secondary information source for information-based measures.

In this work, the feasibility of using MEDLINE as standard corpus and MeSH ontology

for measuring semantic similarity between biomedical concepts is investigated. Most of

the semantic similarity work in the biomedical domain uses only IS-A relations in

ontology (e.g. MeSH, SNOMED-CT) for computing the similarity between the

biomedical terms. In this work, however, information-based semantic measures are used

that use biomedical text corpus in computing the similarity between terms.

35

5.2 Semantic Similarity

The primitive information-based semantic similarity approach was introduced by Resnik

[23] in which the similarity of two concepts is the maximum of the information content

of the concept that subsumes them in the taxonomy hierarchy [Equation 24]. The

information content of a concept depends on the probability of encountering an instance

of that concept in a corpus, and the information content is calculated as negative the log

likelihood of the probability [Equation 28]. That is, the probability of a concept is

determined by the frequency of occurrence of the concept and its subconcepts in the

corpus [Equation 27]. As the information-based measures use corpus statistics, these

similarity measures can be adapted well to particular applications using suitable corpora.

For more information about the pure information-based approach, please refer to Resnik’

work [22]. Following Resnik’s work, some information-based measures were introduced

to improve the performance of pure information-based approach by considering the

weight/strength of edges/links between concept nodes in ontology. The links between

ontology nodes are not equal in term of strength/weight, and link strength can be

determined by local density, information content, and link type [9,26]. The measure of

Jiang and Conrath [9] determines the similarity of two concept nodes by calculate the

“weighted path” between them by summing up all weighted links between them

[Equation 25]. While the measure of Lin [Equation 26] is similar to the measure of Wu

and Palmer [Equation 5]. However, Lin measure uses information content of concept

nodes instead of depth of concept nodes. In fact, the depth is replaced by the “weighted

depth”. Followings are formulas of Resnik, Jiang and Conrath, and Lin measures. They

all use information content (IC) of individual concept nodes C1 and C2 or/and LCS (least

common subsumer) of C1 and C2:

1) Resnik

Sim(C1,C2) = IC(LCS(C1,C2) (24)

2) Jiang and Conrath

Sim(C1,C2) = IC(C1)+ IC(C2) - 2 ×IC(LCS(C1,C2) (25)

36

3) Lin

)IC(C)IC(C))C,IC(LCS(C2

)C,Sim(C21

2121 +

×= (26)

Table 8. Format of MH_Freq_count file

Frequency as MeSH Heading

MH MJ

Pressure 41324 2637 Hydrolysis 41318 35 Haplorhini 41256 3311 Colonic Neoplasms 41207 1619 Energy Metabolism 41203 10902 Hela Cells 41007 409 Heart Diseases 40984 4385 Brain Chemistry 40972 12420 Uterine Cervical Neoplasms 40969 3133 Thrombosis 40929 3562

5.3 Evaluation

5.3.1 Information Sources

In order to evaluate these semantic measures in the biomedical domain, a biomedical

ontology, a biomedical text corpus, and a test dataset of biomedical terms pairs are

needed. Each term pair will have to be scored for similarity by human domain experts.

Then, for each pair, a similarity score was computed by each of the three methods

(Equations 24, 25, 26) and then the correlation was found between the computed

similarity scores and the human scores. MeSH ontology was used which is one of the

core ontologies in UMLS to get hierarchy relations of concepts, and MEDLINE was

used as text corpus to get occurrence frequencies of concepts. The frequencies of MeSH

concepts in MEDLINE are stored in files (available from US National Library of

Medicine NLM at http://mbr.nlm.nih.gov/Download/index.shtml#Freq). For each MeSH

heading, there are two types of frequency:

MH: frequency of that heading as a main heading in MEDLINE corpus.

MJ: frequency of that concept as a major heading in MEDLINE corpus.

37

Both types of frequencies are used in the experiments. The MH_freq_count file contains

frequencies of all MeSH headings. The format of this file is shown in Table 8. Each row

shows one MeSH heading in the 1st column, its frequency as main heading (MH), and its

frequency as major heading (MJ) in MEDLINE.

The information content technique in biomedical domain will be a slightly different from

the original technique of Resnik [18], that is in the way of counting frequencies of MeSH

headings in MEDLINE in which each MeSH heading occurs in one document is counted

only once in that document.

The concept probability of a concept (MeSH heading) c is computed as follows:

N

frq(c))c(p = (27)

where frq(c) is the frequency of concept c by summing up all the frequency of it and its

subconcepts in the corpus, and N is the total frequencies of concepts. The information

content (IC) of a concept c is then given by:

IC(c) = - log p(c) (28)

5.3.2 Dataset

Dataset 2 containing 36 MeSH term pairs was used in experiments as a strictly semantic

similarity dataset.

Table 9. Absolute correlations of information-based measures Correlation

Measure MeSH Main Heading (MH)

MeSH Major Heading (MJ)

Resnik 0.731 0.731 Lin 0.781 0.786 Jiang and Conrath 0.808 0.820

Average 0.773 0.779

38

0.720.740.760.780.8

0.820.84

Resnik Lin Jiang &Conrath

Average

Cor

rel.

with

hum

an s

core

s

MH

MJ

Figure 7. Illustration of the three information-based measures with human scores.

Table 9 shows this dataset along with human scores, and the computational scores by the

three information-based measures using MJ frequency for calculating information content

of each concept.

5.3.3 Experimental Results

Two kinds of frequencies (MH and MJ) were used to calculate IC of concepts. Table 9

contains the results of correlation with human scores for the three measures with IC

calculated according to the two types frequencies (viz. MH and MJ), and Figure 7

contains illustrations of these results. The results in Table 9 show that all measures

perform very well having fairly high correlations with human ratings using both kinds of

frequencies/ICs. It is noticed that the measure of Jiang and Conrath achieves the highest

correlation with human scores, while the measure of Resnik gives the lowest correlations,

and the differences in the three methods are not very significant though. One of the

reasons for the lower correlations of Resnik’s measure compared to the other two

measures is because Resnik’s measure is based on the IC of the LCS of the two concepts

[Equation 24] skipping the specificity of concepts whereas the other two measures are

based on combination of three specific ICs, namely, IC of concept 1, IC of concept 2, and

IC of their LCS, Equations 25 and 26. The average correlation of all measures using MJ

frequency and MH frequency are very close (Table 9, Figure 7). Each measure produces

very close correlations using MH and MJ which indicates that, in general, term usage and

frequency distributions in MEDLINE as MH and MJ are fairly consistent. Thus, these

results demonstrate that MEDLINE can provide a very good insight into the semantic

39

similarity between biomedical (MeSH) terms. It is mentioned that, not every biomedical

term is a MeSH heading/concept or can be found in MEDLINE frequency files. Yet,

MEDLINE is the largest and most comprehensive text and literature database for

biomedical research. Thus, it can be considered as the most reliable information source.

Determining the similarity between biomedical terms is a rather important task that is

needed in many applications. For example, in information retrieval in the biomedical

domain, there is a need to determine the best match between the query/keywords and the

retrieved documents. Integrating multiple resources for information extraction and

knowledge discovery is another application that can benefit greatly from semantic

similarity.

5.4 Discussion

This is an interesting work that puts a first brick for more advances and more structures

into this task. The previous semantic similarity work in the biomedical domain used

ontologies only as primary information sources. The main contribution of this work is the

application of information-based semantic similarity measures into the biomedical

domain using MEDLINE, the most comprehensive resource of textual information in this

domain. Experiment results show that MEDLINE is an effective resource for computing

semantic similarity between biomedical terms and concepts. The experimental results

demonstrated that information-based similarity measures can achieve high correlations

with human similarity scores.

40

6. THE PROPOSED COMBINATION-BASED (HYBRID) APPROACH

6.1 Motivation

This section represents a new analysis/view of semantic features that make up semantic

measures as well as an analysis the strengths and weaknesses of semantic similarity

measures based on this view. To combine several existing measures’ strengths and

complement their weaknesses in semantic computing, a combination-based measure is

proposed as a hybrid measure (Hybrid) that uses IS-A relations in the ontology

information source for the path length feature and depth feature and uses corpus for

information content of concept nodes to augment these two features. This work also

shows how to use corpus statistics/IC effectively in semantic computing in general

English domain.

6.2 Semantic Similarity Features

6.2.1 Path Feature and Depth Feature

The first and primitive approach to measure semantic distance/similarity between two

concept nodes in ontology is by finding the shortest distance between their nodes. This

approach, called Path length, was proposed by Rada et al. [22] as a potential approach in

biomedical domain. After that, a number of ontology-based similarity approaches have

been introduced which use IS-A relations in ontology as primary information source.

Most of these measures can be roughly divided into two groups. The first group includes

ontology-structure-based measures (i.e. Path length [22], Leacock and Chodorow [10],

41

Wu and Palmer [30]) and the second group includes information-based measures that use

ontology structure and corpus-based features (i.e. Resnik [23], Jiang and Conrath [9], Lin

[11]). Both groups use IS-A relations in ontology as information source for computing

the similarity. The two main features of measures used in both groups are: (1) path

feature and (2) depth feature. Path feature can be measured by (i) simple node counting,

(ii) edge/link counting, or (iii) by “weighted path” (Jiang and Conrath [9]) using IC of

concept nodes. The weighted path between two concept nodes C1 and C2 is measured by

summing up all weighted links on the shortest path between C1 and C2. The depth

feature, on the other hand, can be measured by node counting, edge/link counting or by

“weighted depth” which was first developed by Resnik [23]). The weighted depth or

information-based approach measures the similarity of two concept nodes by finding IC

of the least common subsumer (LCS) node of them in the ontology. The information

content of a concept node depends on the probability of encountering an instance of it in

a corpus, and the information content is calculated as negative the log likelihood of the

probability [Equation 28] which is determined by the frequency of occurrence of the

concepts it contains and its subconcepts in the corpus [Equation 27].

Table 10. Similarity features of 8 similarity measures

Feature Measure Path Depth

Path length * none Leacock and Chodorow * none Wu and Palmer none * Resnik none ** Jiang and Conrath ** none Lin none ** Li et al. * * Hybrid (proposed) * ** * is denoted for path length or depth length ** is denoted for weighed path or weighted depth “none” is denoted for the feature is not used by measure

Path feature is an important feature that contributes significantly to semantic similarity.

Let us consider a fragment of ontology in Figure 4 containing concept nodes ai and bi.

42

Path length measure and Leacock and Chodorow measure do not use the depth feature as

a property of concepts in the measure; hence they give the same similarity for those pairs

have the same path length (i.e. pair a2-a5 and pair a1-b1) regardless of their specificity in

the ontology. Table 10 summarizes the features used by six of the existing similarity

measures along with the proposed measure.

Six of the measures in Table 10 use either the path or depth feature but not both;

therefore, can be grouped into: (1) path-based measures (Path length, Leacock and

Chodorow, and Jiang and Conrath) and (2) depth-based measures (Wu and Palmer,

Resnik, and Lin).

Li et al. is the measure that combines the two features of path length and depth length.

However, it has limitation as discussed above in section 3.1. The new proposed measure

combines the two features of weighted path and weighed depth features in one measure as

path length and depth length are special cases of weighted path and weighted depth. In

weighted path or weighted depth approaches, the links between ontology nodes are not

equal in term of strength/weight, and link strength can be determined by local density,

information content, and link type [26]. However, the weighted path approach of Jiang

and Conrath has limitation as it takes into account individual IC of individual concept

nodes; therefore, it is affected by using a small corpus as some words may not occur in

small corpora. Thus, such words will always have their similarity with any other word

reaches the minimum. Through using path length, all the relationships between any

presented concepts in the ontology can be seen intuitively. Therefore, node counting is

used for path feature (path length). Beside path length feature, the weighted depth is used

as kind of specificity of concept nodes in the measure.

6.2.2 The Adapted Common Specificity Feature

The LCS node of two given concept nodes determines their common specificity in

ontology. The common specificity of two concept nodes in ontology based on ontology

structure and corpus is defined as follows:

43

CSpec(C1,C2) = ICmax - IC(LCS(C1,C2)) (29)

where ICmax is the maximum IC of concept nodes in the ontology. The CSpec feature

determines the common specificity of two concept nodes in the ontology based on given

corpus and ontology structure. The less the common specificity value of two concept

nodes the more they are share information, and thus the more they are similar. When the

IC of LCS of two concept nodes (C1 and C2) reaches ICmax, that is,

IC(LCS(C1,C2)) = ICmax ,

then the two concept nodes reach the highest common specificity which equals to zero:

CSpec(C1,C2) = 0.

6.3 The Combination-Based (Hybrid) Approach

One of the contributions of this work is the adapted common specificity feature that is

integrated in the proposed measure that can perform stably using any corpus sizes. The

proposed measure also satisfies the three single ontology intuitive rules (R1,R2,R3) and

two assumptions (A1 and A2) in the section 3.3 . The proposed Hybrid approach is as

follows:


where α>0 and β>0 are contribution factors of two features (Path length and CSpec

(CSpec(C1, C2 )); k is a constant. Path is the path length (shortest path length) of two

concept nodes using node counting. If k is zero, the combination is linear, and to insure

the distance is positive and the combination is non-linear, k must be greater or equal to

one (k≥1). When two concept nodes have path length of 1 using node counting (Path=1),

then they have a minimum semantic distance (i.e., maximum similarity) which equals to

zero regardless of common specificity feature.

44

6.4 Evaluation

6.4.1 Information Source

WordNet 2.0 was used as the primary information source which is a semantic lexicon for

the English language developed at Princeton University. The Perl module

WordNet::Similarity was used and inherited by using existing implemented measures

developed by Pedersen et al. [19]. Resnik’s technique [23] was used to calculate IC of

concept particularly for nouns based on their frequencies. In these experiments Brown

corpus [7] or SemCor corpus [15] were used. The frequency frq(c) of a concept node c

was computed by counting all the occurrences of the concepts in corpus contained in or

subsumed by the concept node c. Then concept node probability is computed directly as:

N

frq(c))c(p = (31)

where N is the total number of nouns in the corpus that are also presented in WordNet.

The information content of concept c is then given by:

IC(c) = - log p(c) (32)

6.4. 2 Datasets

There are two well-known benchmark datasets of term pairs that were scored by human

experts for semantic similarity for general English. The first set (RG) is collected by

Rubenstein and Goodenough [25], and covers 51 subjects containing 65 pairs of words on

a scale from “highly synonymous” to “semantically unrelated” (Table 11 contains only

subset of this dataset). The second dataset (MC) was collected by Miller and Charles [14]

in a similar experiment conducted 25 years after Rubenstein and Goodenough collected

RG set, and contains 30 pairs extracted from the 65 pairs of RG, and covers 38 human

subjects.

45

Table 11. A subset of human mean ratings for the Rubenstein-Goodenough (RG) set Top 5 pairs Last 5 pairs

Pair RG Rating Pair RG Rating cord- smile 0.02 cushion-pillow 3.84 rooster-voyage 0.04 cemetery- graveyard 3.88 noon- string 0.04 automobile-car 3.92 fruit- furnace 0.05 midday- noon 3.94 autograph- shore 0.06 gem-jewel 3.94

Table 12. Training dataset: 19 medical term pairs of Dataset 2 found in WordNet Concept 1 Concept 2 Human

Anemia Appendicitis 0.031 Sinusitis Mental Retardation 0.031 Dementia Atopic Dermatitis 0.062 Osteoporosis Patent Ductus Arteriosus 0.156 Hypothyroidism Hyperthyroidism 0.406 Sarcoidosis Tuberculosis 0.406 Adenovirus Rotavirus 0.437 Hypertension Kidney Failure 0.500 Hepatitis B Hepatitis C 0.562 Vaccines Immunity 0.593 Psychology Cognitive Science 0.593 Urinary Tract Infection Pyelonephritis 0.656 Migraine Headache 0.718 Carcinoma Neoplasm 0.750 Breast Feeding Lactation 0.843 Seizures Convulsions 0.843 Pain Ache 0.875 Down Syndrome Trisomy 21 0.875 Measles Rubeola 0.906


Most of previous relevant work used MC dataset to validate and compare the approaches

[9,11,12,23] as of missing concepts in previous versions of WordNet. MC can be used as

training dataset and RG as testing dataset, however, MC is subset of RG and now the

whole 65 pairs of RG dataset can be found in WordNet 2.0. The whole RG dataset was

used for testing the Hybrid (proposed measure) measure and compare with other

measures. Also, the training step was also needed to train the proposed measure for

46

optimal parameters. It is more effective to use another dataset (completely different from

RG) for training. As lacking of dataset, Dataset 2 from the biomedical domain [8] was

used. The human scores in this dataset are the average evaluated scores of reliable

doctors. As this data set contains biomedical terms so part of the pairs in dataset cannot

be found in WordNet. Table 12 shows part of this dataset that can be found in WordNet

which contains 19 biomedical term pairs. These pairs in then was used to train for

optimal parameters of the proposed measure. Table 13 shows some experiment results

using two corpora.

Table 13. Results of absolute correlations of the proposed measure with human

ratings using the training dataset with different parameter values Parameter

Values α =1,

β =1, k=1 α =2,

β =1, k=1 α =3,

β =1, k=1 α =3,

β =1, k=2 α =3,

β =1, k=3

SemDist (SemCor Corpus) 0.717 0.741 0.747 0.743 0.739

SemDist (Brown Corpus) 0.698 0.729 0.739 0.735 0.733

When α =3 and β =1 the performances of SemDist are very close and reach highest

correlations with human scores using either the SemCor Corpus or Brown Corpus

(Table_13). It is noted that the results in Table 13 show α should be greater than β to get

higher correlations. This implies that the Path feature contributes more to the semantic

similarity than the CSpec feature [Equation 29]. The testing was conducted using the RG

test set (65 pairs), SemCor Corpus and Brown Corpus; and the results are in Table_14.

Table 14.Absolute correlations with human ratings for the proposed measures using

the RG dataset (65 pairs)

Measure Optimal Parameters Correlation

α =3, β =1, k=1 0.873 α =3, β =1, k=2 0.873 SemDist

(SemCor corpus)α =3, β =1, k=3 0.874 α =3, β =1, k=1 0.872 α =3, β =1, k=2 0.874 SemDist

(Brown corpus)α =3, β =1, k=3 0.874

47

The results of the RG experiments in Table 14 show that the Hybrid measure produces

good and stable performance with this set. Furthermore, the correlation results in Tables

13 and 14 show that it can perform well in any corpus sizes, and reach very good

correlations with RG dataset. Furthermore, performances of other information-based

measures were also investigated on two corpora using the RG dataset and WordNet 2.0;

and the results are in Table 15. The results in Table 15 show clearly that Hybrid measure

outperforms the other information-based measures on two different corpora. Most of the

measures perform significantly better with using Brown corpus than with using SemCor

corpus. Moreover, Resnik measure gives a good stability in performance using both

corpora compared with Jiang and Conrath and Lin.

Table 15.Absolute correlations with RG human ratings using SemCor and Brown corpora and WordNet 2.0 for four combination-based measures

Correlation with RG Measure

Using SemCor Using Brown SemDist 0.874 0.874 Resnik 0.807 0.830 Jiang and Conrath 0.650 0.854 Lin 0.728 0.853

Table 15 shows that Hybrid measure gives the highest and most stable correlations with

human scores on two different corpora. SemCor [15] is a sense-tagged subset of the

Brown corpus [7]. The words in the corpus have been manually tagged with their

appropriate senses by human experts. However, the size of this corpus (~200,000 words)

is relatively smaller than the Brown corpus (~ 1 million words). The Brown corpus is a

plain text with no annotations. The measures of Jiang and Conrath and Lin perform not

so well using SemCor corpus compared to their performances with Brown corpus (Table

15). Moreover, Resnik measure performs better than Jiang and Conrath measure and Lin

measure using SemCor (Table 15) just because Lin and Jiang and Conrath measures use

IC of individual concept nodes; and in a small corpus, like SemCor, some words do not

occur affecting performances of such measures. The Hybrid measure does not get

affected by small corpus size because of using IC of the LCS node of two concept nodes.

48

6.5. Discussion

The Hybrid measure has been presented that performs quite well gaining a quite

impressive correlation (0.874) which is best to date reported results of correlation with

human ratings using the benchmark RG dataset. The proposed measure combines all the

strengths of some traditional approaches. The proposed measure uses a new feature

(CSpec) that contributes well to the performance given by scaling the IC of the least

common subsumer of two given concepts to the maximum IC of concepts. The

experimental comparative results demonstrated that the measure is very competitive and

outperforms the existing ontology-based measures with benchmark datasets.

Furthermore, the Hybrid measure can be adaptive to get optimum performance in specific

domain by effective training strategy, and can perform well using any corpus sizes.

49

7. THE PROPOSED CROSS-CLUSTER APPROACH FOR SEMANTIC

SIMILARITY OF CONCEPTS IN WORDNET

7.1. The Need for a Cross-Cluster Semantic Approach for WordNet

In this work, the term “cluster” is used to denote a taxonomy in WordNet, while in

biomedical ontology, “cluster” refers to taxonomy, category tree or subtree (e.g. in

UMLS [31]). In WordNet, as discussed in more detail later, all noun taxonomies are

grouped into one “noun cluster” and all verb taxonomies are grouped into one “verb

cluster”. In this work, only semantic of nouns and verbs are concerned.

Figure 8. Fragment of Ontology

r

a1

a2

a3

b1

a6 b2

a4

a7

a8

b3

a9

a5

50

As the noun cluster of WordNet was the first to be richly developed, therefore, most of

the researchers had their works limited to this cluster [9,10,12,23,26]. Resnik and Diab

[24] first examined the similarity of verbs in verb cluster and they considered the

similarity of verbs is different from noun similarity by some aspects because verb

representations are generally viewed as possessing properties that nouns do not, such as

syntactic subcategorization restrictions, selectional preferences, event structure, and there

are dependencies among these properties. However, there is no work concerning about

measuring semantic similarity of all words in open classes in one scale system using

ontology-based measures. The reason for concerning the scale system is that the scale

system of similarity are different between the noun cluster and the verb cluster as

discussed above the average depth of the noun cluster is 13 while the average depth of the

verb cluster is 2. It is necessary not to be limited to noun similarity only and skip other

kinds of words (e.g. verb) in application such as word sense disambiguation, IR, etc.

For example, in Figure 8, the distance (path length) between b1 and b3 is 3 by node

counting, and this value represents the maximum distance (minimum similarity) in cluster

containing concepts bi, while path length of 3 in cluster ai , for example between a1 and

a3, has a different scale and is not the maximum distance.

7.2. The Proposed Cross-Cluster Semantic Distance Approach

This approach is a variation approach of the Cluster-Based approach in which the CSpec

feature is calculated as in the Hybrid approach [Equation 29]. The previous work has

proved that the Hybrid approach performs very well and stably in the general English

domain. In this work, it is extended for measuring semantic similarity of open words in

WordNet. The Cross-Cluster approach is then an extension of Hybrid approach and is a

variation of Cluster-Based approach. It therefore satisfies the two rules (R3 and R4) and

the two assumptions (A1 and A2).

In cross-cluster similarity, there are four cases depending on whether the concepts occur

in primary or in secondary clusters. The four cases are as follows:

51

Case 1: Similarity within the Primary Cluster: If the two concept nodes occur in the

primary cluster then the similarity in this case is treated as similarity within single cluster

[ Equation 30] discussed in section 6.3.

Case 2: Cross-Cluster Similarity: In this case, the LCS of two concept nodes is the

root node which belongs to two clusters. The secondary cluster is connected as a child of

the root of the primary cluster. This technique does not affect the scale of the common

specificity feature of the primary cluster. The common specificity is then given as

follows:

CSpec(C1,C2) = CSpecprimary = IC primary (33)

where ICprimary is maximum IC of concept nodes in primary cluster (primary cluster

information content). The path length or shortest path of two concept nodes goes though

two cluster having different granularity degree, therefore, part of this path length in the

secondary cluster have to be converted in primary cluster path scale as follows.

The Cross-Cluster Path length Feature: Let us consider again the example, shown in

Figure 4. The root node is the node that connects all clusters. The path length between

two nodes is computed by adding up the two (shortest) path lengths of two nodes to their

LCS node (their LCS is the root). For example, in Figure 4, for the nodes (a3 and b3), the

LCS is the root node. The path length between a3 and b3 is calculated as follows:

Path(C1,C2)= d1 + d2 - 1, (34)

such that, d1 = d(a3, root) and d2 = d(b3, root),

where d(a3, root) is the shortest path (path length) from the root node to node a3 ; and

similarly d(b3, root) is the shortest path from root to b3. The root node is counted twice as

of using node counting approach, so we subtract one, Equation 34. It is noticed here that

Path is path length between the two concept nodes in “cross-cluster”, and the densities

52

(or granularities of the two clusters are in different scales. So the path length of two

concept nodes with their LCS crosses different scales. According to previous discussion

of local specificity of concept, let us call the first cluster which contains a3 the “primary

cluster”, and call the second cluster which contains b3 the “secondary cluster”. The

granularity rate of the primary cluster over the secondary cluster of the common

specificity feature based on ontology is:

secondary IC primary ICCSpecRate = (35)

where (IC primary) and (IC secondary) are information content of the primary and

secondary clusters respectively. The granularity rate of the primary cluster over the

secondary cluster for the path feature is given by:

12D 12DPathRate

2

1

−−

= (36)

where (2D1-1) and (2D2 -1) are maximum shortest path values of two concept nodes in

the local primary and local secondary clusters, respectively. Following Rule R3, d2 in the

secondary cluster, in Equation 5, is converted to the primary cluster as follows:


This new distance d’2 reflects the path length of the second concept to the LCS relative to

the path scale of primary cluster. Applying Equation 8, the path length (Equation 5)

between two concept nodes in primary cluster scale will be as follows:


1d12D12D

d)C,Path(C 22

1121 −×

−−

+= (39)

Finally, the semantic distance (SemDist) between two concept nodes is given as follows:

53

CSpec (C1, C2) = IC primary (40)


Case 3: Similarity within the Secondary Cluster: In this case, both concept nodes are

in a single secondary cluster. Then the semantic distance features must be converted to

primary cluster’s scales as follows:

Path(C1,C2)= Path(C1,C2) secondary× PathRate (42)

CSpec (C1, C2)= CSpec (C1, C2) secondary×CspecRate (43)

( ) ( )( )k+×= CSpec1-Pathlog)C,SemDist(C 21βα (44)

where Path(C1,C2) secondary and CSpec(C1,C2)secondary are the Path and CSpec between C1

and C2 in the secondary cluster; and PathRate and CSpecRate are computed in Equations

36 and 35, respectively.

Case 4: Similarity within Multiple Secondary Clusters: In this case, the two concept

nodes are in two secondary clusters Csi and Csj (i.e., none of them exists in the primary

cluster). Then, one of the two secondary clusters acts momentarily as a primary to

calculate the semantic features (viz. Path and CSpec) using Case-2 above. (That is, the

semantic features, Path and CSpec, will be computed according to Case-2 by assuming

temporarily that Csi and Csj are primary and secondary clusters although both are

secondarys, so that to scale and unify the CSpec and Path features between them). Then,

the semantic distance (SemDist) is computed using Case-3 to scale the features (again) to

the scale-level of the primary cluster (Cp).

7.4 Evaluation

7.4.1 Information Source

WordNet 2.0 was used as the primary information source which is a semantic lexicon for

the English language developed at Princeton University. The Perl module

54

WordNet::Similarity was used and inherited by using existing implemented measures

developed by Pedersen et al. [19]. Resnik’s technique [23] was used to calculate IC of

concept particularly for nouns and verbs based on their frequencies. In these experiments

Brown corpus [7] or SemCor corpus [15] were used

7.4.2 Evaluation Method and Dataset

The proposed measure is evaluated using single ontology (WordNet) but with more than

one cluster (cross-cluster) to show the effectiveness of the proposed technique of

handling cluster granularity differences within the same ontology.

The noun cluster is considered which connects all noun taxonomies in WordNet as the

primary cluster, and has depth of 18. The verb cluster which connects all verb taxonomies

is considered the secondary cluster, and has depth of 14. The depths of two clusters (noun

cluster and verb cluster) show that the granularities of the two clusters are significantly

different. The RG dataset contains 65 noun pairs and part of noun pairs containing nouns

which have one or many verb senses. The proposed measure is evaluated on RG dataset

as follows: If one word has two parts of speeches, its part of speech (POS) which is the

same as the POS of the other concept is considered. If two nouns have both noun and

verb senses, only pairs of noun- noun and verb-verb are taken into account.

As discussed, the whole 65 pairs of RG now can be found in WordNet 2.0, therefore, the

whole RG dataset was used for testing our measure and comparing with other measures.

However, the proposed measure needs a training step to tune for optimal parameters.

Unfortunately, there is no other standard dataset in general English domain that has

human ratings (recall that MC is a subset of RG). Therefore, the training phase discussed

in section 6.4.3 above was used for optimum parameters.

55


Then, the experiments were conducted using the RG (65 pairs) dataset with both corpora

and the results are in Table_16. These results (Table 16) demonstrate that SemDist

produces good and stable performance in the RG terms. The measure achieves the same

correlations (0.873) using either one of the two corpora. Thus, the proposed measure,

SemDist, can perform well in any corpus sizes. Furthermore, the performance of three

other relevant measures were investigated with the two corpora and using RG dataset.

Table 17 shows the correlation results of these 3 information-based measures with RG

human ratings using two corpora. The results in Tables 17 and 18 also show that SemDist

outperforms these three measures significantly when SemCor corpus was used. That is,

with the SemCor, SemDist achieves correlation with human scores (0.873) that is almost

20% higher that that average correlation (0.728) of the three methods, Table 17. When

the Brown corpus was used, SemDist performs slightly better that these measures. It is

noticed that in Table 17 that, the three measures perform significantly better with using

Brown than with SemCor. Table 17 also shows that the proposed measure can perform

well in both corpora as well as Resnik measure as the two measures do not use specific

IC of concept nodes, hence, their performances are not affected much by small corpus

sizes such as SemCor as some words may not occur in small corpora making similarity of

those words to any other words reach minimum.

Table 16. Absolute correlations with human judgments for the proposed measures

using the RG dataset

Measure Optimal Parameters Correlation

SemDist (SemCor corpus) α =3, β =1, k=1 0.873

SemDist (Brown corpus) α =3, β =1, k=1 0.873

56

Table 17. Absolute correlations with RG human ratings using two corpora and WordNet 2.0 for 3 information content-based measures

RG Dataset Measure

SemCor Brown Resnik 0.807 0.830 Jiang and Conrath 0.650 0.854 Lin 0.728 0.853

Average 0.728 0.846

The proposed measure, SemDist, reaches a quite impressive correlation of 0.873 with

human ratings and rank #1 (Table 18) which proves the great potential of the approach

and the goodness of the combination strategy. As the correlation results of measures are

so high, therefore, a small amount of improvement is significant.

The experiments on single ontology, WordNet, and multiple clusters, show the efficiency

of the proposed approach that performs quite well gaining a quite impressive correlation

(0.873), which is the best to date reported results of correlation with human ratings using

the benchmark RG dataset. In the experimental results, the proposed measure achieved an

improvement of ~20% over the average correlation of three of the similar measures using

the standard SemCor corpus.

Table 18. Absolute correlations with RG human ratings of ontology-based

measures Measure RG Rank

SemDist (using Brown or SemCor) 0.873 1 Leacock and Chodorow 0.858 2 Jiang and Conrath (using Brown) 0.854 3 Lin (using Brown) 0.853 4 Resnik (using Brown) 0.830 5 Wu and Palmer 0.811 6 Path Length 0.798 7

57

7.4. 5 Discussion

The limit of this work is that there is not a dataset that contains both nouns and verbs for

better evaluation of the cross-cluster approach. It is clear and intuitive that the approach

is efficient in similarity computing in such WordNet ontology where there are many the

clusters with greatly different granularity. In the next work, the similarity of verbs is

conducted as well as both nouns and verbs in one similarity scale system.

58

8. SEMANTIC SIMILARITY OF VERBS AND NOUNS IN WORDNET

8.1 Motivation

There are four different parts of speech (POS) of open-class words in WordNet. The

taxonomies of nouns form the noun cluster and taxonomies of verbs form the verb

cluster. This work focuses on semantic of verbs in the verb cluster by evaluating existing

semantic measures and the proposed measures. In addition, the similarity of nouns and

verbs are also investigated in one similarity scale system given in the context of real

applications (i.e. word sense disambiguation (WSD)) that take into account both nouns

and verbs.

8.2 Information Source and Datasets

WordNet 2.0 was also used as primary information source for semantic measures.

WordNet::Similarity Perl package was used for existing measures and for implementing

the new measures. RG dataset was used for dataset of nouns. For the dataset of verb, this

work used the verb dataset of Resnik and Diab [24]. In their work of measuring semantic

similarity between verbs, Resnik and Diab [24] compiled a dataset of 27 pairs of verbs

(called RD dataset) and then collected the human ratings (human similarity scores) for all

verb pairs in the dataset. They collected two kinds human similarity scores for each pair,

one by providing the human subject with the verbs only and call it HNoContext score,

and the second score was by providing the verbs with their contexts and they call it

HContext score. Table 19 contains the RD dataset with human ratings.

59

Table 19. Mean human ratings of RD dataset of verb pairs

8.3 Semantic Similarity in Verb Cluster

The traditional ontology-based semantic measures are ones that do not combine semantic

features in one measure and then do not use parameters in formulas. However,

combination-based measures such as Li et al., Hybrid or Cross-Cluster have parameters

in formulas. The parameters of path and depth features in Li et al. measure reflect their

contributions into similarity. However, the parameters in Hybrid measure or Cross-

Cluster measure show the relatively comparative contributions of lexical representation

and corpus into similarity.

Verb 1 Verb 2 HNoContext HContext wiggle rotate 2.80 2.20 prick compose 0.00 0.00 crinkle boggle 0.00 0.40 hack unfold 0.00 0.00 wash sap 1.20 0.40 compress unionize 1.80 1.00 percolate unionize 0.00 0.00 chill toughen 1.40 0.80 fill inject 4.60 2.40 whisk deflate 0.00 0.40 compose manufacture 4.00 2.80 obsess disillusion 1.20 0.00 loosen inflate 0.00 0.40 swagger waddle 3.20 1.60 loosen open 3.00 1.80 displease disillusion 2.80 0.80 dissolve dissipate 4.20 3.40 plunge bathe 2.20 1.60 lean kneel 2.60 1.80 embellish decorate 4.60 4.00 neutralize energize 0.20 0.20 initiate enter 3.20 2.60 open inflate 0.60 0.80 unfold divorce 1.60 0.60 bathe kneel 0.00 0.00 festoon decorate 5.00 4.20 weave enrich 3.00 0.25

60

8.3.1 Traditional Measures

Six traditional semantic measures were used to compute the similarity of verb pairs in RD

dataset and the correlations of semantic measures with human ratings with and without

the contexts (HContext and HNoContext) were calculated. Table 20 shows the

correlations of the information-based measures using the two corpora, and Table 21

shows the correlations of structure-based measures.

Table 20. Absolute correlations with RD human ratings using SemCor and Brown corpora for four information-based measures

Correlation with RD Using SemCor Using Brown

Measure

HNoContext HContext HNoContext HContext Resnik 0.633 0.724 0.623 0.712 Jiang and Conrath 0.444 0.572 0.418 0.451 Lin 0.525 0.638 0.476 0.526

Average 0.534 0.645 0.506 0.563

Table 21. Absolute correlations with RD human ratings of ontology-structure-based measures

Correlation with RD Measure HNoContext HContext Leacock and Chodorow 0.492 0.683 Wu and Palmer 0.599 0.715 Path Length 0.470 0.659

Average 0.520 0.686

Tables 20 and 21 show that all the measures give similarity scores more correlated to

ratings with context than to ratings without context. The average correlations in Table 20

show that SemCor corpus contributes to similarity as a secondary information source

significantly more than Brown corpus in which the average correlation (0.645) of all

information-based measures with context ratings using SemCor corpus improves 14.6 %

over the average correlation (0.563) of all information-based measures with context

ratings using Brown corpus. Table 22 shows correlation of all measures in which SemCor

is used for information-based measures.

61

Table 22. Absolute correlations with RG human ratings of seven measures using SemCor corpus for information-based measures

Correlation with RD Measure HNoContext HContext Resnik 0.633 0.724 Wu and Palmer 0.599 0.715 Leacock and Chodorow 0.492 0.683 Path Length 0.470 0.659 Lin 0.525 0.638 Jiang and Conrath 0.444 0.672

It is noted in Table 22 that the depth-based measures including Resnik and Wu and

Palmer outperform others ranking #1 and #2, respectively. The correlation results in

Table 22 also show that corpus contributes significantly into similarity. However, the

average correlation with context ratings of structure-based measure (Table 21) is more

than of information-based measures. Therefore, to examine the contributions of lexical

hierarchy representation of the verb cluster and the corpus in single verb cluster, the

Hybrid measure was used to examine the affects of them on similarity.

8.3.2 Hybrid Measure and Cross-Cluster Measure

Previous experiments in this thesis show that using SemCor corpus as secondary

information sources gives improved results than Brown corpus in verb cluster; therefore,

SemCor corpus was used for Hybrid measure. As there is no other verb dataset for

training the measure, the parameters were tuned to see the affect of contributing

parameters to similarity. Table 23 shows the absolute correlation results while changing

the parameters.

62

Table 23. Similarity of Verbs given by Hybrid measure using WordNet 2.0 and SemCor with human rating (HContext) in Resnik and Diab (RD) dataset

Parameter Values α =1,β =1 α =1,β =2, α =2, β =1, α =1,β =3 α =1,β =4

k=1 0.788 0.802 0.768 0.802 0.800

k=2 0.794 0.810 0.767 0.809 0.806

k=4 0.792 0.814 0.761 0.815 0.812

k=6 0.788 0.814 0.756 0.817 0.814

k=8 0.783 0.813 0.750 0.818 0.816

k=10 0.779 0.812 0.745 0.818 0.817

k=30 0.754 0.796 0.716 0.815 0.819

The contribution parameters of features are very different in verb cluster compared with

the noun cluster that the contributor parameter of the common specificity (β) based on

corpus is more than the path length feature (α) based on lexical hierarchy representation

in good performance cases. It indicates that the lexical representation of the noun clusters

is better than of the verb cluster in term of similarity.

As once said, the different granularities of different clusters give different scale system of

similarity, especially in WordNet in which the noun cluster has depth about of 18 while

the depth of the verb cluster is about of 14. Therefore, the two similarity scale system is

different. Furthermore, in application such as WSD, IR, there is a need to compute

similarity of nouns and verbs in one scale system that can only be solved by using Cross-

Cluster approach taking into account the granularity of local clusters in ontology.

8.4 Semantic Similarity of Open-Class Words in WordNet

In other to evaluate the affect of the granularity feature of Cross-Cluster approach,

performance of Hybrid approach that doesn’t take into account the granularity was used

as a baseline. A combined dataset (RGRD) was used which is a combination of 65 pairs

63

of nouns in RG dataset and 27 pairs of verbs in RD dataset. Tables 24 and 25 show the

correlations of Cross-Cluster approach and baseline approach when tuning the parameters

as of not having the training dataset.

Table 24. Similarity of word in RGRD (65 noun pairs+27 verb pairs) dataset using Hybrid measure

Correlation Parameter

Values α =1,β =1 α =2,β =1 α =3,β =1 α =1,β =2 α =1,β =3

k=1 0.840 0.839 0.835 0.838 0.832

Furthermore, although the Cross-Cluster approach can measure the similarity of a noun

and a verb in cross-cluster as nouns can have one or more verbs senses or verbs can have

one or more noun senses, however, similarity of pairs of noun senses and pairs of verb

sense only were considered.

Table 25. Similarity of Verbs in RGRD (65 noun pairs+27 verb pairs) dataset using

Cross-Cluster measure

Correlation Parameter

Values α =1,β =1 α =2,β =1 α =3,β =1, α =1,β =2 α =1,β =3

k=1 0.817 0.809 0.802 0.820 0.817

The correlations in Tables 24 and 25 show that taking into account the granularity feature

helps improve performance. Furthermore, the Cross-Cluster approach can even more

useful in measuring the similarity of concept in cross-cluster in which local clusters in

ontology have very different granularity degrees.

8.5 Discussion

The contribution of this work is twofold: (1) The experimental results in Tables 22 and

23 confirm the point that when the hierarchy representation of the ontology is not richly

developed, its contribution will be less than the contribution of a good corpus as

64

information source to similarity; (2) The experimental results in Tables 24 and 25 also

show that the granularity feature should be taken into account as an important feature of

semantic similarity across clusters.

65

9. SEMANTIC SIMILARITY OF CONCEPTS IN A UNIFIED FRAMEWORK:

THE PROPOSED CROSS-ONTOLOGY APPROACH

We have discussed so far two classes of ontology-based measures in the previous

chapters. In this chapter, we will talk about two more groups of semantic measures.

These measures do not use IS-A relations in the ontology but instead use property of

concepts in the ontology.

9.1 The Need for Cross-Ontology Approach

The ontology-based semantic similarity can be roughly grouped into four groups. The

first group includes ontology-structure-based measures [10, 12, 22, 30] that use IS-A

relations in ontology as the only information source (i.e. ontology-only measures). The

second group includes information-based measures [9, 11, 23] that use IS-A relations in

ontology as primary information source and text corpus statistics as secondary

information source in estimating the similarity between two terms. The third group

includes feature-based measures. Feature-based measures such as Tversky approach [29]

are those measures that do not use IS-A relations or text corpus statistics, but instead use

a function of their property such as gloss or definition/context of concepts in the ontology

[29]. The last group includes hybrid measures such as Rodriguez and Egenhofer (2003)

approach [27] that use combination of semantic features from the above three groups.

However, Rodriguez and Egenhofer approach is based on Tversky approach and use

depth of ontology. Feature-based measures are based on set theory while ontology-based

66

measures in the first two groups are based from activation theory [5, 21, 23]. One of the

assumptions of spreading activation theory is that the semantic network is organized

along the lines of semantic similarity [23]. The more properties of two concepts share in

common, the more links there are between the concepts and the more closely related they

are. Although the most primitive semantic distance/similarity measure (Path length) was

first developed and applied in MeSH ontology [23], most of the later work was centered

on WordNet. The existing similarity measures, in the four groups, can only measure

similarity of two concepts in single ontology except the hybrid measure proposed by

Rodriguez and Egenhofer [27] which can measure similarity of concepts in single

ontology or in cross-ontology. This measure uses a matching process over synonym sets,

semantic neighborhoods, and distinguishing features.

Most of the previous work of semantic similarity in the biomedical domain [1-3,6,17,20]

focuses on semantic similarity of concepts in single ontology. There are a number of

ontologies in the biomedical domain, each of which covers a subset of UMLS concepts,

and therefore, there are some missing concepts in every ontology. This problem of

missing concepts/terms from a given ontology makes it impossible to measure semantic

similarity of the missing concepts. For some applications, such as IR in biomedical

domain [16], there is a need for measuring similarity of all concepts in UMLS.

Furthermore, constructing a single ontology for all UMLS concepts is so costly and is a

challenge as each source represents a view of a community who develop the source and

each view is suitable for specific few tasks. Therefore, there is a need for measuring

semantic similarity of concepts in UMLS Metathesaurus using existing sources. For this

motivation, an ontology-structure-based semantic distance/similarity approach is

proposed that can measure semantic similarity in single ontology as well as in cross-

ontology in a unified framework such as UMLS framework. The proposed measure is

adapted from (and is an extension of) the Cluster-Based approach which was developed

to compute the similarity between two terms across multiple clusters within a single

ontology.

67

9.2 The Adapted Common Specificity Feature

In this work, the adapted common specificity feature (discussed in Sections 4.3 and 6.2.2)

is extended for cross-ontology approach. This feature takes into account the depth of the

least common subsumer of two concepts and the depth of ontology. The least common

subsumer (LCS) node of two concepts C1 and C2 determines the specificity of C1 and C2

in the ontology. The common specificity of two concept nodes can be measured by

finding the depth of their LCS node and then scaling it by the depth D of the ontology as

follows:

CSpec(C1,C2) = D − Depth(LCS(C1,C2)) (45)

where D is the depth of the ontology. Thus the CSpec feature determines the common

specificity of two concept nodes in the ontology. The less the common specificity value

of two concepts, the more they have shared information, and thus the more they are

similar. When the depth of LCS of two concept nodes reaches D, the two concept nodes

have the highest common specificity in the ontology which equals to zero (i.e.,

CSpec(C1,C2) = 0).

Figure 9. Two fragments from two ontologies.

r1

a1

a3

a3

a7

a5 a8

a4

a6 a9 b2

b3

r2

b1

Mapping a9=b2

68

9.3 Local Ontology Granularity

Let us consider two fragments from two ontologies as in Figure_9. The first ontology,

OA, contains concepts ai; and the second ontology, OB, contains concepts bi. Depth of

ontology OA is 5 and depth of ontology OB is 4 (by node counting). The relationship

between two concept nodes belonging to two different ontologies can not be seen such as

concept nodes a3 and b3 in Figure 9; however, when the two ontologies are mapped, the

relationship between them can be seen intuitively in the tree. On the other hand, different

ontologies have different granularity degrees; hence, the similarity scales of ontologies

are also different. The effect of granularity differences of ontologies can be seen by

examining local specificity of concept nodes. Define the specificity spec(Ci) of concept Ci

in ontology as follows:

Depth

Depth(Ci)spec(Ci) = (46)

where Depth is the depth of ontology containing Ci and spec(Ci)_∈ [0, 1]. It is noticed

that spec(Ci) = 1 when Ci is a leaf node in the ontology. Then, following Equation 46,

specificity of a3 and b3, in Figure 9, is calculated as follows:

spec(a3)=4/5 = 0.8

spec(b3) =4/4 = 1.0

Therefore, the local specificity of b3 (1.0) is more than local specificity of a3 (0.8), even

though the depths of concepts a3 and b3 are equal. That is, b3 has more specificity within

its ontology than a3 as it lies further down towards the bottom in its ontology, and

because of the difference of granularity degrees of the two ontologies. Therefore, the

local granularities of ontologies should be taken into account when measuring semantic

similarity of concept nodes across ontologies.

69

9.4 The Proposed Cross-Ontology Similarity Approach

We want to extend the Cluster-Based approach (Chapter 4) in measuring semantic

similarity of concept nodes into the cross-ontology scale. Then an ontology will be

treated as a cluster, i.e., the cluster here is one ontology and two ontologies can overlap in

set of controlled concepts. They are ontologies in a unified framework as discussed in

section 2.3. The following rules and assumptions have to be satisfied in the proposed

approach.

9.4.1 Rules and Assumptions

The propose measure combines all the semantic features discussed above in one measure

in an effective and logical way. Following are the intuitive rules and assumptions that

should be fulfilled in measuring semantic distance/similarity across ontologies:

Rule R5: The semantic similarity (distance) scale system shows (reflects) the degree of

similarity of pairs of concept nodes comparably in single ontology or in cross-

ontology. This rule ensures that the mapping of ontology OB (called secondary

ontology) to ontology OA (called primary ontology) does not deteriorate the similarity

scale of the primary ontology.

Rule R6: The semantic similarity must obey local ontology’s similarity rules as follows:

Rule R6.1: The shorter the distance between two concept nodes in the ontology, the

more they are similar.

Rule R6.2: Lower level pairs of concept nodes are semantically closer (more similar)

than higher level pairs (i.e. the more the two concept nodes share

information/attributes, the more similar they are).

Rule R6.3: The maximum similarity of two concept nodes is reached when they are

the same node in the ontology.

70

Like the above proposed NA measure, the cross-ontology measure also satisfies the two

above assumptions (A1 and A2).

9.4.2 Single Ontology Similarity

In single ontology, there are two features to combine: Path length (shortest path length)

and Common specificity given by Equation 45. When the two concept nodes are the

same node (two concepts are synonymous or identical) then path length will be 1 (Path =

1), and then the semantic distance value must reach the minimum regardless of CSpec

feature by rule R6.3 (recall the semantic distance is the inverse of semantic similarity).

Therefore, to combine features, the product of semantic distance features is used. Those

constraints may not be satisfied if other combinations of the two features are used. By

applying Rules R5, R6 and the two assumptions (A1 and A2), the proposed measure for a

single ontology is:


where α>0 and β>0 are contribution factors of two features (Path and CSpec); k is a

constant; and Path is the shortest path length between the two concept nodes. If k is zero,

the combination is linear and to insure the distance is positive and the combination is

non-linear, k must be greater or equal to one (k≥ 1). When two concepts have path length

of 1 (Path=1) using node counting, they have a semantic distance (SemDist) equals to

zero (assuming k=1) according to Equation 47 regardless of the CSpec feature.

9.4.3 Cross-Ontology Semantic Similarity

In cross-ontology semantic similarity, there are four cases depending on whether the

concept nodes occur in primary or in secondary ontologies. The four cases are as follows:

71

Case 1: Similarity within the Primary Ontology: If the two concept nodes occur in the

primary ontology then the similarity in this case is treated as similarity within single

ontology using Equation 47 discussed in section 9.4.2.

Figure 10. Connecting two ontology fragments.

Case 2: Cross-Ontology Similarity (Primary-Secondary):

The Common Specificity Feature: In this case, the two concepts belong to two different

ontologies. The secondary ontology is connected to the primary ontology by joining the

associate/ common nodes (e.g., a9 and b2 in Figure 9) of two ontologies. However, two

ontologies may have many common or equivalent concept nodes. Two concepts in two

ontologies are equivalent if they refer to the same concept. For example, in Figure 9,

suppose that b2 and a9 refer to the same concept (b2 = a9), then we merge b2 and a9 into

one node called Bridge as in Figure 10. Thus, Figure 10 shows how the two ontologies

are mapped and how the Bridge appears. As there can be more than one Bridge node

when mapping two ontologies, there can be more than one LCS node ({LCSn}) for the

two concepts. The LCS node of two concept nodes (C1, C2) belonging to two ontologies

is the LCS of the first node C1 in primary ontology and the Bridges node, that is:

LCSn (C1,C2) = LCS(C1, Bridgen) (48)

such that C1 belongs to the primary ontology ai while C2 belongs to the secondary

ontology bi. The path length between two concept nodes in two ontologies passes through

r1

a1

a2

a3

a7

a5 a8

a4

a6

a9 , b2

b3

r2

b1

Bridge

72

the Bridge node and goes through two ontologies having different granularity degrees.

The part of path length in secondary ontology is then converted into primary ontology’s

scale of path feature as explained next.

The Cross-Ontology Path length Feature: The typical way to calculate the path length

between two concept nodes is by adding up the two path lengths, from each of them to

the LCS node. In cross ontology approach, the path length between two concept nodes is

calculated by adding up two path lengths from each of them to Bridge node. For

example, the path length between a3 and b3 in Figure 10 is calculated as follows:

Path(a3, b3) = d1 + d2 – 1 (49)

such that:

d1 = d(a3, Bridge), and

d2 = d(b3, Bridge),

where d(a3, Bridge) is the path length (shortest path) from a3 to the Bridge; and similarly

for d(b3, Bridge). In this case, Bridge is counted twice because of using node counting

approach, so one is subtracted in Equation 49. Since this is a cross-ontology, Path(a3, b3)

crosses different scales, i.e., d1 and d2 are in different scales. According to our discussion

of specificity in sections 2.2 and 3.1, let us call the first ontology (which contains ai) the

primary ontology, and call the second ontology (which contains bi) the secondary

ontology. The granularity rate of the primary ontology over the secondary ontology for

the common specificity feature is:

1 D1D

CSpecRate2

1

−−

= (50)

where (D1-1) and (D2 -1) are maximum common specificity values of the primary and

secondary ontologies respectively (D1 and D2 are depth of primary ontology and

secondary ontology respectively). The granularity rate of the primary ontology over the

secondary ontology for the path feature is given by:

73

12D 12DPathRate

2

1

−−

= (51)

where (2D1-1) and (2D2 -1) are maximum path values of two concept nodes in the

primary and secondary ontology respectively. Following Rule R5, d2 (in Equation 49) in

the secondary ontology is scaled to the primary ontology as follows:


This new path length d’2 reflects the path length of the second concept node to the Bridge

node relative to the primary ontology granularity scale of path feature. Applying

Equation 52, the cross path length between the two concept nodes in primary ontology

scale of path feature is given as follows:


1d12D12Dd)C,Path(C 2

2

1121 −×

−−

+= (54)

Recall that there can be more than one Bridge node, therefore, there can be more than one

path length between the two concept nodes ({Pathn}). Finally, the semantic distance

(SemDist) between two concept nodes is given as follows:

CSpecn (C1,C2) = D1 − Depth(LCS(C1,Bridgen)) (55)

( ) ( )( )k CSpec1-Pathlog)C,(CSemDist nn21n +×= βα (56)

SemDist(C1, C2) = min{ Semi(C1, C2)} (57) i

where Pathn is the path length of two concepts calculated via Bridgen. The semantic

distance between two concepts is chosen as the minimum among all possible paths.

Case 3: Similarity within Single Secondary Ontology: The third case is the case when

the two concept nodes are both in a single secondary ontology. Then the semantic

74

distance features in this case must be converted to primary ontology scales of two

features as follows:

Path(C1,C2)= Path(C1,C2) secondary× PathRate (58)

CSpec(C1, C2) = CSpec(C1, C2) secondary× CSpecRate (59)

( ) ( )( )k CSpec1-Pathlog)C,Sem(C 21 +×= βα (60)

where Path(C1,C2) secondary and CSpec(C1,C2)secondary are the Path and CSpec between C1

and C2 in the secondary ontology; and PathRate and CSpecRate are computed in

Equations 51 and 50.

Case 4: Similarity within Multiple Secondary Ontologies: The fourth case is when the

two concept nodes are in two different secondary ontologies (i.e., none of them exists in

the primary ontology). In this case, one of the two secondary ontologies acts

momentarily as a primary to calculate the semantic features (viz. Path and CSpec) using

Case-2 above. Then, the semantic similarity is computed using Case-3 to scale the

features (again) to the scale-level of the primary ontology.

9.4.4 Choosing the Secondary Ontologies

In biomedical domain within the UMLS framework, as there are many ontologies

overlapping in set of UMLS concepts, therefore, one problem stands out: which ontology

is chosen as the secondary ontology? Let us examine again the four cases above.

Case 1: In this case, there is one primary ontology, therefore, there’s no need to choose

for the secondary ontology.

Case 2: In this case, the second concept may belong to many ontologies in the unified

framework (i.e. UMLS), the problem is which ontology is chosen for mapping into the

primary ontology for similarity? The proposed cross-ontology approach uses a strategy

for choosing the secondary ontology in which one is primary ontology mainly bases on

75

two points. The first one is that the more the two ontologies overlap, the more it is good

for similarity of two concepts dispersed in these two ontologies. The second point is that

the secondary ontology should be chosen as the one that has high granularity degree. For

that, a metric is proposed to measure the “goodness” of choosing a secondary ontology.

The higher the goodness value, the better it is chosen as the secondary ontology for

mapping for similarity. The metric is as follows:

DpDs

OsOpOsOpOs)p,goodness(O ×=

U

I (61)

where:

- Op is primary ontology and Os is a source ontology that is examined the goodness for

choosing as secondary ontology.

- OsOpI is the set of common concepts of two ontologies.

- OsOpU is the union of two sets of concepts of two ontologies.

- Ds and Dp are depths of primary ontology and secondary ontology respectively.

Case 3: In this case, the two concepts are both in one source ontology; however, there

can be many source ontologies contain both concepts. The problem is which source

ontology is chosen to be the secondary ontology? In this case, Equation 61 is used for

determining the secondary ontology. It is noted that Case-3 includes the case that two

concepts belong to one source (secondary) ontology, but one of them belongs to the

primary ontology.

Case 4: In this case, the two concepts belong to two different source ontologies.

However, there can be many source ontologies contain the concepts. Therefore, the

problem is what source ontology is chosen for each of the two concepts?. First, the source

ontology for the first concept that has highest granularity degree in source ontologies that

contain the first concept is chosen. Then, the goodness metric, Equation 61, is used to

determine which one of the ontologies that contain the second concept is most suitable as

secondary ontology if the ontology chosen above for the first concept acts as temporary

primary ontology.

76

9.5. Evaluation

9.5.1 Testing Dataset

To evaluate the approach in cross-ontology, a dataset containing term pairs as in Cases 2,

3, and 4 should be used. For example, in Case-2, such concept pairs like (C1, C2) such

that one concept (C1) belongs to primary ontology only and the other concept (C2)

belongs to a secondary ontology and both ontologies are in the unified framework should

be used for testing. Since there is no such dataset with human ratings, we combined

datasets from two domains: general English domain and biomedical domain. For that, RG

dataset and Datasets 1 and 2 were used for experiments.

9.5.2 Tools and Information Sources

WordNet 2.0 was used as primary ontology and MeSH [23, 24] and SNOMED-CT [23,

25] as secondary ontologies. The Perl module WordNet::Similarity developed by

Pedersen et al. [13] was also used to implement our proposed approach to measure

semantic distance of concepts found in WordNet using WordNet 2.0. MeSH database and

MeSH Browser were used which are available at

{http://www.nlm.nih.gov/mesh/meshhome.html} to get information on biomedical terms

in MeSH; and UMLSKS Browser was used available at {http://umlsks.nlm.nih.gov} to

get information on biomedical concepts in SNOMED-CT.


5.5.3.1 Experiments on Single Ontology: WordNet

The proposed approach was first evaluated on single ontology. In single ontology, the

approach performs very well surpassing other existing measures in biomedical domain in

previous experiments. In this single ontology experiment, the RG dataset was used and

WordNet 2.0, and the results are in Table 26 using the default parameters (α=1, β=1,

k=1). The purpose of this experiment is just to show that the method achieves sound

77

results of correlation using a standard dataset (RG) and a large and reliable WordNet

ontology.

Table 26. Absolute correlation of proposed approach on the RG dataset and WordNet 2.0

No. Parameters Correlation1 α=1, β=1, k=1 0.815

To evaluate the approach in cross-ontology, it needs a dataset that contains term pairs

dispersed in two ontologies. For that, the RG dataset (65 pairs) was combined with the

two biomedical datasets in three combinations as follows:

(a) RG (65 pairs) + Dataset 1 (30 pairs): total 95 pairs.

(b) RG (65 pairs) + Dataset 2 (36 pairs): total 101 pairs.

(c) RG + Dataset 1 + Dataset 2: total 131 pairs.

WordNet was used for RG words/terms, and MeSH or SNOMED-CT was used for

terms/concepts of Dataset 1 and Dataset 2. Moreover, WordNet was considered the

primary while MeSH/SNOMED-CT was the secondary ontology. Then, on these three

dataset combinations, (a) – (c), two evaluations were conducted, one using WordNet and

MeSH, and the other using WordNet and SNOMED-CT.

Table 27. Absolute correlations of the proposed approach using WordNet and MeSH

No. Dataset Correlation

1 WordNet (RG, 65 pairs) + MeSH (Dataset 1, 25 pairs)

(90 pairs)

0.808

2 WordNet (RG, 65 pairs) + MeSH (Dataset 2, 36 pairs)

(101 pairs)

0.804

3

WordNet (RG, 65 pairs) + MeSH (Dataset 1, 25 pairs) + MeSH (Dataset 2, 36 pairs)

(126 pairs)

0.814

Average number of tested pairs: 105.7 Average correlation: 0.809

78

9.5.3.2 Experiments Using WordNet and MeSH

In these experiments, WordNet was used as primary general ontology and MeSH was

used as secondary ontology. Three experiments were conducted using the three dataset

combinations (a), (b) and (c). In the first experiment, using combination (a), only 25 pairs

(out of the 30 pairs in Dataset 1) were found in MeSH. Thus, the similarity of 65 pairs in

WordNet as in single WordNet ontology and 25 term pairs as cross-ontology technique

(Case-3). In the second experiment, dataset combination (b) was tested using WordNet

and MeSH. In the third experiment, the three datasets were combined, combination (c),

with a total of 126 pairs distributed between WordNet and MeSH. The results are in

Table 26. In these experiments, the proposed method achieved an average of ~81%

correlation with human scores using, on average, ~106 term pairs and two ontologies.

The complete results of correlation for each pair using combination (b) (the 2nd

experiment) are shown in Table 29. The human rating scores in RG dataset are converted

into [0-1] scale to be compatible with human ratings in Dataset_2.

Table 28. Absolute correlations of the proposed approach using WordNet and

SNOMED-CT No. Dataset Correlation

1 WordNet (RG, 65 pairs) + SNOMED-CT (Dataset 1, 29

pairs) (94 pairs)

0.778


pairs) (99 pairs)

0.700


pairs) + SNOMED-CT (Dataset 2, 34

pairs) (128 pairs)

0.757

Average number of tested pairs: 107

Average correlation: 0.745

79

9.6. Discussion

A cross-ontology semantic distance/similarity approach has been presented and has been

applied in the biomedical domain; however, it can be applied in other domains within a

unified framework. One of the problems in measuring semantic similarity between

concepts, using ontology, is that, certain terms in the dataset are missing from the

underlying ontology. This problem stands out more clearly in specific domains (e.g.

bioinformatics domain) than in general domains. For example, in biomedical IR, there is

a great need for measuring the semantic similarity between biomedical terms/concepts

and documents [13] and there are several potential ontologies. It can, very well, be that

not all the concepts are found in single ontology (that is, the concepts are dispersed on

more than one ontology). In this case, such concepts that are missing from the ontology

will not be measured for similarity and will be skipped; see for example, [15]. This work

discussed and evaluated an ontology-based approach that can measure semantic similarity

of concepts in single ontology or in multiple ontologies (cross-ontology) within a unified

framework such as UMLS, NCI. This is an interesting work that puts a brick for more

structure and more advances in ontology integration and cross-ontology research in the

biomedical domain. The experimental results show that the proposed approach is very

promising and performs quite well with very good correlations with human scores.

80

Table 29. Biomedical Dataset 2 (36 pairs) and RG dataset (65 pairs, in italics) with human similarity scores (Human) and SemDist’s scores using WordNet and MeSH

Score Score Concept 1

Concept 2 Human SemDist

Concept 1 Concept 2

Human SemDist Anemia Appendicitis 0.031 4.69 Diabetic Nephropathy Diabetes Mellitus 0.500 3.25 Meningitis Tricuspid Atresia 0.031 4.69 Pulmonary Valve Stenosis Aortic Valve Stenosis 0.531 3.12 Sinusitis Mental Retardation 0.031 4.69 Hepatitis B Hepatitis C 0.562 2.97 Dementia Atopic Dermatitis 0.062 4.83 Vaccines Immunity 0.593 4.79 Acquired Immunodeficiency Syndrome

Congenital Heart Defects 0.062

4.54

Psychology Cognitive Science 0.593

2.47

Bacterial Pneumonia Malaria 0.156 4.69 Failure to Thrive Malnutrition 0.625 4.69 Osteoporosis Patent Ductus Arteriosus 0.156 4.83 Urinary Tract Infection Pyelonephritis 0.656 3.92 Amino Acid Sequence Anti Bacterial Agents 0.156 5.24 Migraine Headache 0.718 4.72 Otitis Media Infantile Colic 0.156 4.94 Myocardial Ischemia Myocardial Infarction 0.750 2.33 Hyperlipidemia Hyperkalemia 0.156 3.92 Carcinoma Neoplasm 0.750 3.75 Neonatal Jaundice Sepsis 0.156 4.69 Breast Feeding Lactation 0.843 0.00 Asthma Pneumonia 0.187 3.64 Seizures Convulsions 0.843 0.00 Hypothyroidism Hyperthyroidism 0.357 3.25 Pain Ache 0.875 0.00 Sarcoidosis Tuberculosis 0.406 5.05 Malnutrition Nutritional Deficiency 0.875 0.00 Sickle Cell Anemia Iron Deficiency Anemia 0.406 4.01 Down Syndrome Trisomy 21 0.875 0.00 Adenovirus Rotavirus 0.437 4.14 Measles Rubeola 0.906 0.00 Lactose Intolerance Irritable Bowel Syndrome 0.468 4.01 Antibiotics Antibacterial Agents 0.937 0.00 Hypertension Kidney Failure 0.500 4.83 Chicken Pox Varicella 0.968 0.00 cord smile 0.005 5.26 car journey 0.388 5.40 rooster voyage 0.010 5.78 cemetery mound 0.423 5.08 noon string 0.010 5.24 glass jewel 0.445 4.51 fruit furnace 0.013 4.44 magician oracle 0.455 4.44 autograph shore 0.015 5.32 crane implement 0.593 3.97 automobile wizard 0.028 5.20 brother lad 0.603 4.04 mound stove 0.035 4.44 sage wizard 0.615 4.26 grin implement 0.045 5.40 oracle sage 0.653 4.19 asylum fruit 0.048 4.44 bird crane 0.658 3.33 asylum monk 0.098 5.02 bird cock 0.658 2.30 graveyard madhouse 0.105 5.42 food fruit 0.673 4.73 glass magician 0.110 4.66 brother monk 0.685 2.48 boy rooster 0.110 4.97 asylum madhouse 0.760 2.30 cushion jewel 0.113 4.44 furnace stove 0.778 4.60 monk slave 0.143 4.04 magician wizard 0.803 0.00 asylum cemetery 0.198 5.18 hill mound 0.823 0.00 coast forest 0.213 4.51 cord string 0.853 2.56 grin lad 0.220 5.40 glass tumbler 0.863 2.48 shore woodland 0.225 4.33 grin smile 0.865 0.00 monk oracle 0.228 4.60 serf slave 0.865 3.69 boy sage 0.240 4.26 journey voyage 0.895 2.48 automobile cushion 0.243 4.65 autograph signature 0.898 2.48 mound shore 0.243 3.97 coast shore 0.900 2.56 lad wizard 0.248 4.04 forest woodland 0.913 0.00 forest graveyard 0.250 4.98 implement tool 0.915 2.56 food rooster 0.273 5.34 cock rooster 0.920 0.00 cemetery woodland 0.295 4.98 boy lad 0.955 2.56 Shore voyage 0.305 5.32 cushion pillow 0.960 2.56 Bird woodland 0.310 4.80 cemetery graveyard 0.970 0.00 Coast hill 0.315 3.97 automobile car 0.980 0.00 Furnace implement 0.343 4.26 midday noon 0.985 0.00 Crane rooster 0.353 4.16 gem jewel 0.985 0.00 Hill woodland 0.370 4.33

81

10. DISCUSSION AND FUTURE WORK

10.1 Directions

10.1.1 Adapting Existing Ontology-based Measures for Cross-Ontology Similarity

In this thesis, the cross-ontology approach for measuring semantic similarity of concepts

in a unified framework has been introduced and explained. This approach is based on the

following points: (1) the mapping of two ontologies based on the overlap between them

in a set of concept nodes, (2) the consideration of granularity degrees of the ontologies

and (3) a strategy for choosing the secondary ontology for missing concepts. This

approach can be then applied for existing ontology-structure-based measures to adapt

them for measuring semantic similarity of concepts in a unified framework. For example,

the cross-ontology path length feature can be employed for the Path length and Leacock

and Chodorow measures.

10.1.2 Semantic Similarity and Application in Information Retrieval

PubMed [36] is a service of the U.S. National Library of Medicine (NLM) that includes

over 16 million citations from MEDLINE and other life science journals for biomedical

articles back to the 1950s.

A Case Study: Semantic Similarity of Concepts in IR in Biomedical Domain

82

Previous work of Mao and Chu [16] shows that using concept-based vector space model

(VSM) has better performance than the stem-based VSM in medical document retrieval

(concept-based VSM> stem-based VSM). The concept-based VSM uses MeSH concepts

in which documents and queries are presented by MeSH headings. Moreover, concept-

based VSM takes into account “concept interrelation” (concept-interrelation-based

VSM) by using semantic similarity techniques (measures) to represent the interrelation of

concepts which improves its performance over the concept-based VSM (concept-

interrelation-based VSM > concept-based VSM). Furthermore, most important and

comprehensive databases in biomedical domain such as MEDLINE contain concept-

structure-based records. For example, each citation of an article in MEDLINE contains a

set of cited/indexed MeSH concepts. Furthermore, most IR/search systems [36] or IR

research in this domain use concepts limited to MeSH concepts. One of the reasons is that

there is not a technique to measure all semantic similarity of all UMLS concepts

dispersed in multiple ontologies.

One of the most popular search engines for biomedical domain is PubMed-Entrez

developed by NLM. It is a Boolean search engine. The input text string is parsed into

MeSH terms and text words, and it therefore uses MeSH thesaurus for indexing.

Therefore, PubMed-Entrez is limited to MeSH headings only. The following example

shows clearly the limitation of using only single terminology source in retrieval:

Figure 11. Two fragments of SNOMED-CT (left) and MeSH (right).

83

MeSH thesaurus (ontology) contains only about 23K headings or concept scopes, which

is a small subset in UMLS concepts containing about 1.3 million concepts (concept

classes). One problem stands out if user wants to search for a concepts/entity which is not

found in MeSH thesaurus, the results may not really satisfy the query. For example,

concept “Stomach cramps” is not found in MeSH but in SNOMED-CT. When a query

with “Stomach cramps” is inputted into PubMed-Entrez, the query is parsed as follow:

("stomach"[MeSH Terms] OR Stomach[Text Word]) AND (("muscle cramp"[TIAB] NOT

Medline[SB]) OR "muscle cramp"[MeSH Terms] OR cramps[Text Word])

The query is parsed into two MeSH headings: (1) “stomach” and (2) “muscle cramp” as

MeSH thesaurus/ontology do not contain concept “stomach cramps”. It is clear the

search engine should be not limited to MeSH headings only.

Moreover in the work of Mao and Chu [13], they used an ontology-structure-based

semantic similarity measure that they developed on their own to calculate the

interrelationship (similarity) between concepts. Actually, there are many semantic

measures with different performance based on applications, information sources, etc.

Therefore, that work also has some limitation.

According to above discussion, there are some questions that most current research in

biomedical IR cannot answer:

1. What is the search model that is best suitable for the concept-structure-based

biomedical databases? Concept-interrelation-based VSM, Boolean model, or

other models?

2. Is there a need to develop a “new” IR model such as combination of concept-

based and stem-based/phrase-based techniques?

3. What set of vocabulary sources in UMLS that is best used for IR in biomedical

domain?

84

4. By using concept-interrelation-based IR models, an interrelation of concepts is

represented by a similarity value given by a semantic similarity measure. There is

one issue that what is the most suitable measure for these tasks as different

measures and different group-based of measures (ontology-structure-based

measures, information-based measures, etc.) perform differently in different

applications and situations.

10.1.3 The Need for Topic Similarity and a New Information Retrieval Model

Each MEDLINE record contains a citation which is cited by about 10-15 MeSH

headings, therefore, by using (concept-interrelation-based) VSMs the similarity score

between two documents represented as two vectors should be low as each

records/document contains a small number of cited concepts. Therefore, a Boolean

retrieval model should be a good choice utilizing the indexing of MeSH headings for

each document. Moreover, the extended Boolean model should be a good model for

retrieving documents in PubMed/MEDLINE. However, the concepts in MeSH ontology

are represented hierarchically and hence support semantic search.

In the (extended) Boolean model, a document contains “A1 and B1” will not satisfy much

for a search query like “A and B” in which A1 and B1 are subconcepts of A and B,

respectively. Thus, a new semantic model should be developed for utilizing the semantic

of concepts represented in the ontology. This leads to the development of topic similarity

as core techniques of the new semantic model instead of semantic measures applied in

VSMs.

However, the combination of text-word-based models and other models such as Boolean

models and VSMs should be conducted in which the contribution parameters should be

weighted in the combination approach.

85

10.2 Discussion

This thesis centers about the semantic similarity techniques in the two domains of general

English domain and biomedical domain. Through applications of techniques into the

biomedical domain; development new techniques applied in the two domains; and some

investigations, we have some observations as follows:

1. The application of existing ontology-structure-based measures into biomedical

domain gains good results tested on two datasets using two ontologies.

2. In the general English domain (WordNet), the affect of the common specificity

feature of the two concept nodes can not be seen in the general English domain

because the dataset is so small such as MC contains 30 term pairs covering 38

human subjects. Therefore, the term pairs dispersed in many taxonomies; hence,

the specificity can not be seen in experiments [12]. However, in the biomedical

domain where the ontology covers concepts in specific domain and the number of

term pairs belong to a single taxonomy is high, therefore the specificity feature of

concept can be seen clearly. We observed that the two features of path length and

common specificity of two concepts contribute equally to semantic similarity in

good performance cases in this domain.

3. While the technique of information content was developed to augment the pure

ontology-structured-based measures; however, most of the previous work show

that the information-based measures using information content do not surpass

over the ontology-structure-based measures [4]. This thesis [chapters 6-8]

presents how to use a text corpus effectively in measuring semantic similarity of

concepts. Furthermore, experiments in Chapter 8 also presents a new view of

using semantic features by considering the contributions of lexical representation

of concepts in network and corpus statistics into similarity. For this direction, we

gained promising results in computing semantic similarity of verbs in WordNet

regarding the network of verbs is not richly developed.

86

4. Most of the semantic similarity work in the biomedical domain, especially using

MeSH ontology, is limited to the ontology itself as the primary information

source; therefore there is a need for investigations of creating standard corpora in

this domain. The experimental results in Chapter 5 show that MEDLINE is the

promising corpus in this domain, especially for MeSH concepts.

5. The ideas in the proposed cross-ontology approach basically bases on the

granularity of the ontologies and the mapping approach; hence, it can be applied

to adapt other exiting ontology-structure-based measures for cross-ontology

semantic similarity. The proposed cross-ontology approach is the novel approach

in a unified framework in the biomedical domain. In the unified framework, all

the concepts are unified different from general English domain where the two

same concepts can have different names causing difficulty for mapping

ontologies.

10.3 Conclusion

This thesis introduces and presents a number of approaches for computing the semantic

similarity/distance in the general English domain as well as in the biomedical domain.

Semantic distance is the inverse of semantic similarity, and the semantic similarity

techniques are used to compute the semantic similarity (i.e., common shared information)

of concepts or concept classes according to certain language or domain resources like

ontologies, taxonomies, corpora, etc. The thesis also presents the related work and

relevant techniques in the biomedical domain and discusses some of the new directions

related to semantic similarity, topic similarity and information retrieval (IR) in the

biomedical domain. The key contribution of this thesis is a novel semantic distance

approach that can measure semantic distance/similarity between two concepts in a unified

framework comprising of many ontologies that overlap in a set of controlled concepts.

The proposed techniques have been extensively evaluated in the biomedical and general

English domains. The experimental results confirmed the superiority and efficiency of the

87

proposed techniques in computing the semantic similarity distance in single and across

multiple ontologies.

In the future work of this research, we would like to develop and collect a cross-ontology

semantic similarity dataset in the biomedical domain for evaluating the semantic

similarity techniques and to support the research in this task. We will further investigate

and explore the various IR models and model combinations to be adapted and applied

into the biomedical domain to utilize from the proposed cross-ontology semantic

techniques and to exploit the numerous biomedical ontologies and taxonomies within

UMLS.

88

11. REFERENCES

[1] Al-Mubaid, H. and Nguyen, H.A. A Cluster-Based Approach for Semantic Similarity

in the Biomedical Domain, In Proc. The 28th Annual International Conference of the

IEEE Engineering in Medicine and Biology Society EMBS’06, New York, USA,

September 2006.

[2] Al-Mubaid, H. and Nguyen, H.A. Using MEDLINE as Standard Corpus for

Measuring Semantic Similarity of Concepts in the Biomedical Domain. In Proc. The

2006 IEEE 6th Symposium on Bioinformatics & Bioengineering BIBE-06,

Washington D.C., USA, October 2006. pp.315-319.

[3] Al-Mubaid, H. and Nguyen, H.A. Semantic Distance of Concepts within a Unified

Framework in the Biomedical Domain. Accepted paper, 22nd Annual ACM

Symposium on Applied Computing SAC’07, forthcoming March 2007.

[4] Budanitsky, A. and Hirst, G. Evaluating WordNet-based measures of semantic

distance, Computational Linguistics, vol.32,1, March 2006.

[5] Colin, A. and Loftus, E. A spreading activation theory of semantic processing,

Psychological Review, 82,407-428, 1975.

[6] Caviedes, J. and Cimino, J. Towards the development of a conceptual distance metric

for the UMLS. Journal of Biomedical Informatics 37,77-85, 2004.

[7] Francis, W.N. and Kucera, H. Brown Corpus Manual—Revised and Amplified, Dept.

of Linguistics, Brown Univ., Providence, R.I., 1979.

[8] Hliaoutakis, A. Semantic Similarity Measures in MeSH Ontology and their

application to Information Retrieval on Medline. Master’s thesis, Technical

University of Crete, Greek. 2005.

89

[9] Jiang, J.J, and Conrath, D.W. Semantic similarity based on corpus statistics and

lexical ontology. In Proc. on International Conference on Research in Computational

Linguistics, 19–33,1997.

[10] Leacock, C., and Chodorow, M. Combining local context and WordNet similarity

for word sense identification. In Fellbaum, C., ed., WordNet: An electronic lexical

database. MIT press.265-283, 1998.

[11] Lin, D. An information-theoretic definition of similarity. In Proc. of the Int’l

Conference on Machine Learning, 1998.

[12] Li, Y., Bandar, Z. A. and McLean D., An Approach for Measuring Semantic

Similarity between Words Using Multiple Information Sources. IEEE Transactions

on Knowledge and Data Engineering, 15, 4(2003), 871-882.

[13] Miller, G.A. WordNet: A Lexical Database for English Comm. ACM 38,11(1995),

39-41.

[14] Miller, G.A and Charles, W.G. Contextual Correlates of Semantic similarity.

Language and Cognitive Processes, 6, 1(1991), 1-28.

[15] Miller, G.A., Leacock, C., Randee,T and Bunker, R.T. A semantic concordance. In

Proc. of the 3rd DARPA workshop on Human Language Technology, pp.303–308.

Plainsboro, New Jersey, 1993.

[16] Mao, W. and Chu, W.W. Free-text medical document retrieval via phrase-based

vector space model, In Proc. AMIA Symp 2002 ; ()489-93,2002.

90

[17] Nguyen, H.A. and Al-Mubaid, H. A New Ontology-based Semantic Similarity

Measure for the Biomedical Domain. In Proc. IEEE International Conference on

Granular Computing GrC’06 , GA,USA, May 2006.

[18] Nguyen, H.A. and Al-Mubaid, H. A Combination-based Semantic Similarity

Approach Using Multiple Information Sources. In Proc. The 2006 IEEE International

Conference on Information Reuse and Integration IEEE IRI 2006, Hawaii, USA,

September 2006.

[19] Pedersen,T., Patwardhan, S., and Michelizzi, J. WordNet::Similarity-Measuring The

Relatedness of Concepts,”. In Proc. of the Nineteenth National Conference on

Artificial Intelligence (AAAI-04). San Jose, CA, 2004.

[20] Pedersen,T., Pakhomov, S. and Patwardhan,S. Measures of Semantic Similarity and

Relatedness in the Medical Domain, University of Minnesota Digital Technology

Center Research Report DTC 2005/12.

[21] Quillian, M.R. Semantic Memory, In Minsky, M.(Ed.), Semantic Information

Processing, MIT Press, Cambridge, MA, 1968.

[22] Rada, R., Mili, H. Bicknell, E. and Blettner, M. Development and Application of a

Metric on Semantic Net. IEEE Transactions on Systems, Man and Cybernetics,

19,1(1989),17-30.

[23] Resnik, P. Using information content to evaluate semantic similarity in ontology. In

Proc. of the 14th intl Joint Conference on Artificial Intelligence,448–453,1995.

[24]Resnik,P. and Diab,M. Measuring Verb Similarity, Twenty Second Annual Meeting

of the Cognitive Science Society (COGSCI2000), Philadelphia, August 2000.

91

[25] Rubenstein, H and Goodenough, J.B. Contextual Correlates of Synonymy, Comm.

ACM, 8, 627-633, 1965.

[26] Richardson, R., Smeaton, A.F., and Murphy, J. Using WordNet as a Knowledge

Base for Measuring Semantic Similarity. Working paper CA-1294, School of

Computer Applications, Dublin City Univ., Dublin, 1994.

[27] Rodriguez, M.A. and Egenhofer, M.J. Determining Semantic Similarity Among

Entity Classes from Different Ontologies. IEEE Transactions on Knowledge and

Data Engineering, 15,2 (2003), 442-456.

[28] Shepard, R. N. 1987. Toward a universal law of generalization for psychological

science. Science 237,1317-1323, 1987.

[29] Tversky, A. Features of similarity. Psychological Review 84(4): 327-352, 1977.

[30] Wu, Z., and Palmer, M. Verb semantics and lexical selection. In 32nd Annual

Meeting of the Association for Computational Linguistics, 133–138, 1994.

[31] UMLS: Unified Medical Language System. Available:

http://www.nlm.nih.gov/research/umls/

[32] XML MeSH. Available:

http://www.nlm.nih.gov/mesh/xmlmesh.html

[33] MeSH. Availabe:

http://www.nlm.nih.gov/mesh/meshhome.html

[34] UMLSKS. Available:

http://umlsks.nlm.nih.gov

92

[35] SNOMED-CT. Available:

http://www.snomed.org/index.html

[36] PubMed. Available:

http://www.ncbi.nlm.nih.gov

[37] MEDLINE. Available:

http://www.cas.org/ONLINE/DBSS/medliness.html

[38] NCI. Available:

http://www.cancer.gov/cancertopics/terminologyresources

[39] MeSH Browser. Available:

http://www.nlm.nih.gov/mesh/MBrowser.html

[40] The Semantic Vocabulary Interoperation Project. Available: http://lsdis.cs.uga.edu/~kashyap/projects/SVIP/

93

Appendix A

MeSHSimPack: A LIBRARY FOR MEASURING SEMANTIC SIMILARITY OF MESH CONCEPTS

1. Introduction

MeSHSimPack is C# library for measuring semantic distance/similarity between of

MeSH headings. The library includes two main modules: (1) MeSHQueryData is for

querying information of MeSH headings and (2) MeSHSimilarity is for measuring

semantic distance/similarity of concepts by eight implemented ontology-based semantic

measures. There are eight semantic measures implemented in the framework in which

three of them are semantic distance measures and five of them are semantic similarity

measures. Some of them are information-based measures (Resnik, Jiang and Conrath and

Lin) that use MEDLINE as corpus for information content of MeSH headings.

2. Semantic Measures Ontology-based semantic similarity measures use ontology hierarchical relations in

computation as primary information source or can use corpus as secondary information

source. They are derived from spreading activation theory. However, distributional

similarity that is based on concepts/words occurrence approaches can also use “scope

note”, or “gloss” of concept in the ontology. The framework focuses on implement

ontology-based semantic measures including Path length, Wu and Palmer, Leacock and

Chodorow, Li et al. , NA, Resnik, Jiang and Conrath and Lin.

3 MeSHSimPack MeSHSimPack has two main component modules and one database. The components of

the framework are as follows:

3.1 MeSHQueryData The MeSHQueryData is an interface module of MeSH database. MeSHQueryData

provides implemented function as APIs to query information of MeSH headings/terms in

94

MeSH database. This module can also be used by other applications such as information

retrieval, information extraction, semantic computing, etc. in the biomedical domain

using MeSH ontology.

3.2 MeSHSimilarity MeSHSimilarity is a main module in which eight ontology-based semantic measures were

implemented. Each measure takes two terms/headings and parameters possible as inputs

and returns numerical values showing their semantic distance/similarity in the measure’s

distance/similarity scale system.

Figure 12. MeSHSimPack components.

Figure 1 shows a web-based interface of MeSHSimilarity module that takes two MeSH

headings/terms as inputs to produce the semantic distance/similarity of them in the MeSH

ontology.

3.3 LocalMeSH and MeSHConverter

95

There are two versions of original database of MeSH which are in two formats: MeSH

XML database which contains files inXML format and MeSH ASCII database which

contains files in ASCII format, therefore, for performance and convenience, a tool was

developed to convert MeSH XML database into relational database, called,

MeSHConverter. MeSH XML database has three main different files about:

- Descriptors (main headings) (desc200x.xml): characterize the subject matter or

content.

- Qualifiers (qual200x.xml): are used with descriptors and afford a means of

grouping together those documents concerned with a particular aspect of a

subject.

- Supplementary Concept Records (supp200x.xml).

Currently, MeSHConverter can only convert desc200x.xml file in to relational database

for information needed for semantic computation.

3.4 ICGenerator For information-based measure, they need information contents of concepts in measuring

semantic similarity; therefore, a tool was developed, called, ICGenerator for updating the

MeSH database to add IC of concept node. This tool takes MH_Freq_Count file as input

and updates the LocalMeSH database as output.

4. Discussion A C# library has been introduced supporting for measuring semantic similarity of MeSH

headings by implemented existing ontology-based measures. The library can be

integrated into semantic-similarity-based applications or can be used for semantic

similarity research in the biomedical domain. In the future, a framework will be

investigated and implemented that supports for measuring semantic similarity of UMLS

concepts dispersed in many ontologies in UMLS Metathesaurus as a continuing of this

work.

96

Figure 13. Web-based interface of MeSHSimilarity.

semantic similarity techniques

Documents