Download - SEMANTIC SIMILARITY TECHNIQUES
i
NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE
BIOMEDICAL DOMAIN AND WORDNET
by
Hoa A. Nguyen, B.Eng.
THESIS
Presented to the Faculty of
The University of Houston Clear Lake
In Partial Fulfillment
of the Requirements
for the Degree
MASTER OF SCIENCE
THE UNIVERSITY OF HOUSTON-CLEAR LAKE
December 2006
ii
NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE BIOMEDICAL DOMAIN AND WORDNET
by
Hoa A. Nguyen
APPROVED BY
__________________________________________ Hisham Al-Mubaid, Ph.D., Chair
__________________________________________ Said Bettayeb, Ph.D., Committee Member
__________________________________________ Gary D. Boetticher, Ph.D., Committee Member
__________________________________________ Robert Ferebee, Ph.D., Associate Dean
__________________________________________ Sadegh Davari, Ph.D., Dean
iii
ACKNOWLEDGEMENTS I would like to thank my thesis advisor Dr. Hisham Al-Mubaid for his incredible
assistance, guidance, and great patience over the past two years. Dr. Al-Mubaid was very
supportive and very cooperative during all stages of this work. Without his guidance and
help this work would not have been completed. I would like also to thank him for his
great help and advice for my career in selecting Ph.D. programs and communicating with
universities and professors.
Also, I would like to acknowledge and thank the other members of my thesis committee,
Dr. Gary D. Boetticher and Dr. Said Bettayeb, for their assistance and guidance in this
work.
Finally, I would like to thank the people who helped me in obtaining some of the datasets
and tools needed for this work, in particular, Dr. Mona T. Diab for giving us the verb
dataset that we used in some of the experiments for comparison and evaluation purposes.
iv
ABSTRACT
NEW SEMANTIC SIMILARITY TECHNIQUES OF CONCEPTS APPLIED IN THE BIOMEDICAL DOMAIN AND WORDNET
Hoa A. Nguyen, M.S. The University of Houston-Clear Lake, 2006
Thesis Chair: Hisham Al-Mubaid
Semantic similarity techniques are used to compute the semantic similarity (common
shared information) between two concepts according to certain language or domain
resources like ontologies, taxonomies, corpora, etc. Semantic similarity techniques
constitute important components in most Information Retrieval and knowledge-based
systems. This thesis presents new techniques for measuring the semantic similarity
between concepts based on ontology. The proposed measures are based on three
features: (1) (cross-modified) path length, (2) common specificity of concepts in
ontology, (3) local granularity of clusters. The new features of common specificity and
granularity are applied in computing semantic similarity of concepts in single or across
multiple ontologies. The key contribution is the novel cross-ontology approach for
measuring similarity of concepts dispersed in multiple ontologies in a unified framework.
The proposed techniques were evaluated extensively in the biomedical domain and
general English domain. The experimental results proved the effectiveness and
superiority of our similarity measures compared with the existing similar techniques.
v
TABLE OF CONTENTS
1. INTRODUCTION .......................................................................................................... 1
2. BACKGOUND AND SIMILAR WORK....................................................................... 4
2.1 WordNet.................................................................................................................... 4
2.2 UMLS ....................................................................................................................... 4
2.3 Unified Framework................................................................................................... 5
2.4 MeSH........................................................................................................................ 6
2.5 MEDLINE ................................................................................................................ 7
2.6 SNOMED-CT ........................................................................................................... 8
2.7 Semantic Similarity, Semantic Distance and Relatedness and Transformation ....... 8
2.8 Polysemy of Concept and Semantic Similarity of Concept Class ............................ 9
2.9 Traditional Ontology-Based Semantic Measures ................................................... 10
2.9.1 Ontology-Structure-Based Measures ............................................................... 10
2.9.2 Information-Based Measures ........................................................................... 10
2.10 Previous Work in the Biomedical Domain ........................................................... 10
3. A NEW ONTOLOGY-BASED SEMANTIC DISTANCE APPROACH ................... 12
3.1 Method of Semantic Similarity............................................................................... 12
3.2 The New Feature: Common Specificity Feature .................................................... 16
3.3 Rules and Assumptions........................................................................................... 16
3.4 The Proposed Semantic Distance Approach........................................................... 17
3.5 Evaluation ............................................................................................................... 18
3.5.1 Dataset.............................................................................................................. 18
3.5.2 Experiments and Results.................................................................................. 19
3.5.3 Discussion ........................................................................................................ 20
4. THE PROPOSED CLUSTER-BASED APPROACH.................................................. 22
4.1 The Need for a New Approach ............................................................................... 22
4.2 Local Granularity and Local Concept Specificity .................................................. 22
vi
4.3 The Adapted Common Specificity Feature ............................................................ 23
4.4 Rules and Assumptions........................................................................................... 24
4.5 The Proposed Cluster-Based Approach.................................................................. 24
4.5.1 Single Cluster Similarity.................................................................................. 24
4.5.2 Cross-Cluster Semantic Similarity................................................................... 25
4.6. Evaluation .............................................................................................................. 28
4.6.1 Datasets ............................................................................................................ 29
4.6.2 Experiments and Results.................................................................................. 30
4.6.3 Discussion ........................................................................................................ 32
5. USING MEDLINE AS STANDARD CORPUS FOR SEMANTIC SIMILARITY OF
CONCEPTS IN THE BIOMEDICAL DOMAIN ................................................... 34
5.1 The Need for a Standard Corpus in the Biomedical Domain ................................. 34
5.2 Semantic Similarity................................................................................................. 35
5.3 Evaluation ............................................................................................................... 36
5.3.1 Information Sources......................................................................................... 36
5.3.2 Dataset.............................................................................................................. 37
5.3.3 Experimental Results ....................................................................................... 38
5.4 Discussion ........................................................................................................... 39
6. THE PROPOSED COMBINATION-BASED (HYBRID) APPROACH.................... 40
6.1 Motivation............................................................................................................... 40
6.2 Semantic Similarity Features .................................................................................. 40
6.2.1 Path Feature and Depth Feature....................................................................... 40
6.2.2 The Adapted Common Specificity Feature...................................................... 42
6.3 The Combination-Based (Hybrid) Approach ......................................................... 43
6.4 Evaluation ............................................................................................................... 44
6.4.1 Information Source .......................................................................................... 44
6.4. 2 Datasets ........................................................................................................... 44
6.4.3 Experimental Results ....................................................................................... 45
6.5. Discussion.............................................................................................................. 48
7. THE PROPOSED CROSS-CLUSTER APPROACH FOR SEMANTIC SIMILARITY
OF CONCEPTS IN WORDNET............................................................................. 49
vii
7.1. The Need for a Cross-Cluster Semantic Approach for WordNet .......................... 49
7.2. The Proposed Cross-Cluster Semantic Distance Approach................................... 50
7.4 Evaluation ............................................................................................................... 53
7.4.1 Information Source .......................................................................................... 53
7.4.2 Evaluation Method and Dataset ....................................................................... 54
7.4.3 Experiments and Results.................................................................................. 55
7.4. 5 Discussion....................................................................................................... 57
8. SEMANTIC SIMILARITY OF VERBS AND NOUNS IN WORDNET ................... 58
8.1 Motivation............................................................................................................... 58
8.2 Information Source and Datasets............................................................................ 58
8.3 Semantic Similarity in Verb Cluster....................................................................... 59
8.3.1 Traditional Measures ....................................................................................... 60
8.3.2 Hybrid Measure and Cross-Cluster Measure................................................... 61
8.4 Semantic Similarity of Open-Class Words in WordNet ......................................... 62
8.5 Discussion............................................................................................................... 63
9. SEMANTIC SIMILARITY OF CONCEPTS IN A UNIFIED FRAMEWORK: THE
PROPOSED CROSS-ONTOLOGY APPROACH ................................................. 65
9.1 The Need for Cross-Ontology Approach................................................................ 65
9.2 The Adapted Common Specificity Feature ............................................................ 67
9.3 Local Ontology Granularity.................................................................................... 68
9.4 The Proposed Cross-Ontology Similarity Approach ............................................ 69
9.4.1 Rules and Assumptions.................................................................................... 69
Like the above proposed NA measure, the cross-ontology measure also satisfies the
two above assumptions (A1 and A2). ....................................................................... 70
9.4.2 Single Ontology Similarity .............................................................................. 70
9.4.3 Cross-Ontology Semantic Similarity ............................................................... 70
9.4.4 Choosing the Secondary Ontologies................................................................ 74
9.5. Evaluation .............................................................................................................. 76
9.5.1 Testing Dataset................................................................................................. 76
9.5.2 Tools and Information Sources........................................................................ 76
9.5.3 Experimental Results ....................................................................................... 76
viii
9.6. Discussion.............................................................................................................. 79
10. DISCUSSION AND FUTURE WORK ..................................................................... 81
10.1 Directions.............................................................................................................. 81
10.1.1 Adapting Existing Ontology-based Measures for Cross-Ontology Similarity
................................................................................................................................... 81
10.1.2 Semantic Similarity and Application in Information Retrieval ..................... 81
10.1.3 The Need for Topic Similarity and a New Information Retrieval Model...... 84
10.2 Discussion............................................................................................................. 85
10.3 Conclusion ............................................................................................................ 86
11. REFERENCES ........................................................................................................... 88
Appendix A....................................................................................................................... 93
ix
LIST OF FIGURES
Figure 1. Overlap of concepts of ontologies in UMLS framework. ................................... 5
Figure 2. Graphical view of MeSH ontology by MeSH browser. ...................................... 6
Figure 3. A fragment of MeSH. ........................................................................................ 12
Figure 4. A fragment of two clusters in ontology........................................................... 26
Figure 5. Results of correlations with human scores for four measures using SNOMED-
CT. .................................................................................................................................... 31
Figure 6. Results of correlations with human scores for four measures using MeSH...... 31
Figure 7. Illustration of the three information-based measures with human scores. ........ 38
Figure 8. Fragment of Ontology ...................................................................................... 49
Figure 9. Two fragments from two ontologies. ................................................................ 67
Figure 10. Connecting two ontology fragments. .............................................................. 71
Figure 11. Two fragments of SNOMED-CT (left) and MeSH (right)............................. 82
x
LIST OF TABLES
Table 1. Dataset 1: 30 medical term pairs sorted in the order of the averaged physician’
scores................................................................................................................................. 19
Table 2. Absolute correlation of the four measures relative to human ratings ................ 20
Table 3. Dataset 2: 36 medical term pairs with five similarity scores: Human, Path length
(PATH), Wu and Palmer(WUP), Leacock and Chodorow(LCH), and proposed measure
(SemDist); using MeSH ontology.................................................................................... 29
Table 4. Absolute correlations with human scores for all measures using SNOMED-CT
on Dataset 1, Dataset 2, and Dataset 3.............................................................................. 30
Table 5. Absolute correlations with human scores for all measures using MeSH on
Dataset 1, Dataset 2, and Dataset 3................................................................................... 30
Table 6. The improvements that SemDist achieved over the average of the three other
similar techniques using SNOMED-CT with three datasets............................................. 31
Table 7. The improvements that SemDist achieved over the average of the three other
similar techniques using MeSH with three datasets ......................................................... 32
Table 8. Format of MH_Freq_count file .......................................................................... 36
Table 9. Absolute correlations of information-based measures........................................ 37
Table 10. Similarity features of 8 similarity measures ..................................................... 41
Table 11. A subset of human mean ratings for the Rubenstein-Goodenough (RG) set.... 45
Table 12. Training dataset: 19 medical term pairs of Dataset 2 found in WordNet ......... 45
Table 13. Results of absolute correlations of the proposed measure with human ratings
using the training dataset with different parameter values ............................................... 46
Table 14.Absolute correlations with human ratings for the proposed measures using the
RG dataset (65 pairs) ........................................................................................................ 46
Table 15.Absolute correlations with RG human ratings using SemCor and Brown corpora
and WordNet 2.0 for four combination-based measures .................................................. 47
Table 16. Absolute correlations with human judgments for the proposed measures using
the RG dataset ................................................................................................................... 55
xi
Table 17. Absolute correlations with RG human ratings using two corpora and WordNet
2.0 for 3 information content-based measures.................................................................. 56
Table 18. Absolute correlations with RG human ratings of ontology-based measures... 56
Table 19. Mean human ratings of RD dataset of verb pairs ............................................. 59
Table 20. Absolute correlations with RD human ratings using SemCor and Brown ....... 60
Table 21. Absolute correlations with RD human ratings of ontology-structure-based
measures............................................................................................................................ 60
Table 22. Absolute correlations with RG human ratings of seven measures ................... 61
Table 23. Similarity of Verbs given by Hybrid measure using WordNet 2.0 and SemCor
with human rating (HContext) in Resnik and Diab (RD) dataset.................................... 62
Table 24. Similarity of word in RGRD (65 noun pairs+27 verb pairs) dataset using ...... 63
Table 25. Similarity of Verbs in RGRD (65 noun pairs+27 verb pairs) dataset using
Cross-Cluster measure...................................................................................................... 63
Table 26. Absolute correlation of proposed approach on the RG dataset and WordNet 2.0
........................................................................................................................................... 77
Table 27. Absolute correlations of the proposed approach using WordNet and MeSH... 77
Table 28. Absolute correlations of the proposed approach using WordNet and SNOMED-
CT ..................................................................................................................................... 78
Table 29. Biomedical Dataset 2 (36 pairs) and RG dataset (65 pairs, in italics) with
human similarity scores (Human) and SemDist’s scores using WordNet and MeSH...... 80
xii
LIST OF PUBLICATIONS
Major part of this thesis is published in the following papers:
1. Nguyen, H.A. and Al-Mubaid, H. A New Ontology-based Semantic Similarity
Measure for the Biomedical Domain. In Proc. IEEE International Conference on
Granular Computing GrC’06 , GA,USA, May 2006.
2. Al-Mubaid, H. and Nguyen, H.A. A Cluster-Based Approach for Semantic Similarity
in the Biomedical Domain, In Proc. The 28th Annual International Conference of the
IEEE Engineering in Medicine and Biology Society EMBS, New York, USA, September
2006.
3. Al-Mubaid, H. and Nguyen, H.A. Using MEDLINE as Standard Corpus for
Measuring Semantic Similarity of Concepts in the Biomedical Domain. In Proc. The
2006 IEEE 6th Symposium on Bioinformatics & Bioengineering BIBE-06, Washington
D.C., USA, October 2006. pp.315-319.
4. Nguyen, H.A. and Al-Mubaid, H. A Combination-based Semantic Similarity
Approach Using Multiple Information Sources. In Proc. The 2006 IEEE International
Conference on Information Reuse and Integration IEEE IRI 2006, Hawaii, USA,
September 2006.
5. Al-Mubaid, H. and Nguyen, H.A. A Cross-Cluster Approach for Measuring Semantic
Similarity between Concepts. In Proc. The 2006 IEEE International Conference on
Information Reuse and Integration IEEE IRI 2006, Hawaii, USA, September 2006.
6. Al-Mubaid, H. and Nguyen, H.A. Semantic Distance of Concepts within a Unified
Framework in the Biomedical Domain. In The 22nd Annual ACM Symposium on Applied
Computing Seoul, Korea, forthcoming March, 2007.
1
1. INTRODUCTION
Ontology-based semantic similarity techniques or approaches, also called semantic
similarity measures, can estimate the semantic similarity between two hierarchically
expressed concepts in a given ontology or taxonomy. In a given ontology (e.g. WordNet
or MeSH) each node contains a set of synonymous terms, and can be called “concept
node”. Each concept node represents a sense of concept (e.g. WordNet). Two terms are
synonymous if they belong to the same node in the ontology tree. Thus, an ontology or
taxonomy is a hierarchical tree-structured organization of the terms and concepts in a
language or domain. Semantic similarity is the inverse of semantic distance, such that,
two concepts may belong to two different nodes in an ontology tree, and the distance
between their nodes determines the similarity between these two concepts. Thus we can
use the terms “semantic distance” and “semantic similarity” interchangeably to refer to
the same thing as the conversion from distance to similarity or vice versa is a direct
operation (Section 2.7).
Semantic similarity techniques are becoming important components of most of the
information retrieval (IR), information extraction (IE), and other intelligent knowledge-
based systems. For example, in IR, semantic similarity measures play a crucial role in
determining an optimal match between query terms and the retrieved document in
ranking the results. With the fast growing of biomedical databases such as PubMed [31],
the task of retrieving biomedical documents effectively plays a very important role in this
domain. This thesis focuses on investigating and developing new semantic similarity
approaches for measuring semantic similarity between terms and concepts within certain
2
language or domain resources like ontologies, taxonomies, and corpora. The proposed
semantic measures are applied in the biomedical domain and general English domain.
The main contribution of this thesis is the new combination of semantic features and the
five variations of semantic similarity measures that are proposed and evaluated in the
course of this master thesis research. The five variations of semantic similarity measures
are: (1) a new semantic similarity measure [chapter 3], (2) a cluster-based semantic
similarity measure in multiple clusters in single ontology [chapter 4], (3) a hybrid
semantic similarity measure using multiple information sources [chapter 6], (4) a cross-
cluster semantic similarity measure using multiple information sources [chapter 7] and
(5) a cross-ontology semantic measure in a unified framework [chapter 9]. Besides these
five semantic techniques, this thesis includes two more investigations : (1) an
investigation of using MEDLINE as a standard corpus for measuring semantic similarity
of MeSH concepts [chapter 5], (2) an investigation of measuring semantic similarity of
nouns and verbs in WordNet [chapter 8]. Furthermore, the final chapter of this thesis
[chapter 10] presents directions and discussions more about advanced works as future
directions of this thesis. The experimental results compared with the existing similar
techniques demonstrated that all of the proposed measures have good performance and
are very promising in computing semantic similarity of concepts in the two domains.
Some of the presented work (for example, in chapter 5 and chapter 9) includes new
devised methods and techniques that represents interesting works and puts first bricks for
more advances and more structures into these tasks.
The rest of this thesis is organized as follows:
Chapter 2: A review of background and related work is presented in chapter 2; this
includes ontologies in the two domains of general English domain and biomedical
domain, definitions, ontology-based semantic similarity measures and previous work in
the biomedical domain.
Chapter 3: In this chapter, the first proposed semantic distance measure (NA) is
proposed and applied in the biomedical domain. It combines the two semantic distance
features in one measure to combine strengths and complement weaknesses of some
existing ontology-based measures.
3
Chapter 4: This chapter presents an extension of the NA measure to be the Cluster-Based
approach by taking into account the granularity of clusters in the ontology.
Chapter 5: There is no standard text corpus as secondary information source for the
information-based measures. Chapter 5 presents an investigation of using the most
comprehensive database in the biomedical domain MEDLINE as a standard corpus for
measuring information content (IC) of MeSH concepts used in information-based
measures.
Chapters 6-8: These chapters present and discuss new semantic similarity techniques
applied into the WordNet ontology in the general English domain. These techniques
include methods for measuring semantic similarity of nouns, verbs and both nouns and
verbs in one similarity scale system.
Chapter 9: This chapter presents and explains a novel cross-ontology semantic distance
approach in a unified framework within the biomedical domain for measuring semantic
distance of concepts in single ontology or in cross-ontology. This approach is an
extension of the Cluster-Based approach which is discussed in Chapter 4.
Chapter 10: Chapter 10 presents general and more comprehensive discussion of this
thesis. Also Chapter 10 includes some thoughts and ideas about new directions as future
work.
Appendix A: Appendix A presents a framework, called MeSHSimPack, for measuring
semantic similarity of MeSH concepts by a number of ontology-based semantic similarity
measures including information-based measures.
4
2. BACKGOUND AND SIMILAR WORK
2.1 WordNet
WordNet [13] is a semantic lexicon for the English language developed at Princeton
University. WordNet became a valuable resource for people in human language
technology and artificial intelligence for many years. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each representing one underlying lexical
concept. Different relations link the synonym sets. The IS-A relations form IS-A
taxonomies with the noun and verb synsets and they are majority relations in WordNet.
The IS-A relations in IS-A taxonomies of noun are hypernym/hyponym and
holonym/meronym relations while in verb taxonomies are hypernym/troponym. In
WordNet 2.0, there are nine noun taxonomies with an average depth of 13 and 554 verb
taxonomies with an average depth of 2. For more detail information of WordNet, please
refer to [13].
2.2 UMLS
The Unified Medical Language System (UMLS) project started at the National Library of
Medicine (NLM) in 1986 [31], with one of the objectives is to help interpret and
understand medical meanings across systems. It consists of three main knowledge
sources: Metathesaurus, Semantic Network, and SPECIALIST Lexicon & Lexical Tools.
The current version (2006 AC) of Metathesaurus contains more than 1.3 million concepts
and 6.4 million unique concept names from more than 100 different source vocabularies
and supports 17 languages. The Metathesaurus is built from the electronic versions of
5
various thesauri, classifications, code sets, and lists of controlled terms used in patient
care, health services billing, public health statistics, indexing and cataloging of
biomedical literature, and/or basic, clinical, and health services research. These are
referred to as the “source vocabularies” of the Metathesaurus. The control vocabularies
or terminologies in these resources are expressed hierarchically with the major relations
between concepts are IS-A relations (actually broader /narrow than), therefore, these
sources are also called ontology/taxonomy, etc. Different from other
ontologies/taxonomies in other domains such as WordNet, the ontologies in biomedical
domain (UMLS) do not allow multiple inheritances. The ontologies in UMLS
Metathesaurus overlap in set of UMLS concepts as in Figure 1. Each ontology is
designed for specific purposes in biomedical domain, for example, MeSH thesaurus/
ontology is built for cataloging, indexing and searching MEDLINE database [37].
Figure 1. Overlap of concepts of ontologies in UMLS framework.
2.3 Unified Framework
UMLS [26], NCI [38] frameworks are instances of unified framework defined in this
thesis. The unified framework in this thesis employs the characteristics of UMLS
framework and has following main characteristics:
- The framework covers a set of concepts as a set of controlled vocabulary in a
specific domain such as biomedical domain.
- The framework includes: (1) Semantic Network which represents semantic
relations possibly between concepts in set of controlled vocabulary and (2)
Metathesaurus including many ontologies which cover a subset of concepts of the
UMLS
SNOMED-CT
NCI
MeSH
6
framework. The ontologies in the framework overlap of concepts as in Figure 1.
Each of them reflects a view of a specific community or is constructed for specific
tasks in the domain.
- The concept class, concept and term have characteristics discussed in the
following section about MeSH ontology as an example of ontology in a unified
framework.
2.4 MeSH
MeSH, stands for Medical Subject Headings [33,39], is one of the main source
vocabularies (terminologies/concepts) used in UMLS with the primary purpose of
supporting indexing, cataloguing, and retrieval of medical literature articles stored in
NLM MEDLINE database, and includes about 16 high-level categories
(taxonomies/subtrees) as in Figure 2.
Figure 2. Graphical view of MeSH ontology by MeSH browser.
7
The database of MeSH ontology [32] has about 23K MeSH Descriptors which is often
broader than a single concept and so it may consist of a class of concepts. Concepts, in
turn, correspond to a class of terms which are synonymous with each other [32]. A
Descriptor is then a class of concepts that have meanings closely together. Each MeSH
Descriptor has a preferred concept which is a MeSH heading. Therefore, a MeSH
heading is a represented concept of a concept class containing synonymous concepts. The
following hierarchy relations show the relations between a Descriptor and its concepts,
terms as follows:
-Concept Class -Concept
-Term
One concept class (Descriptor) that can have many concepts that are close to each other
in meaning, each concept then can have many synonymous terms. One term is chosen as
a preferred term of concept and one concept is chosen as preferred concept of the concept
class which is the heading of concept class (Descriptor). The concept structure above is
applied for all other ontologies in UMLS. Following example shows a concept structure
of a MeSH heading:
- Achlorhydria (Heading)
- Achlorhydria - Achlorhydria
- Achylia Gastrica
- Hypochlorhydria
- Hypochlorhydria
2.5 MEDLINE
MEDLINE [38], the NLM's premier bibliographic database covering the fields of
medicine, nursing, dentistry, veterinary medicine, the health care system, and the
preclinical sciences, contains about 14 million research abstracts dated back to the 1950s
from more than 4,800 biomedical journals published in the United States and 70 other
8
countries. It is and thus considered the main source of literature and textual data for
bioinformatics research. Each record in MEDLINE is a cited article which is assigned
10-15 MeSH terms (MeSH main heading) by indexers typically with major topics (MeSH
major heading) indicated with an asterisk (*) [36]. Indexers typically use the most
specific MeSH term available.
2.6 SNOMED-CT
SNOMED-CT, stands for Systemized Nomenclature of Medicine Clinical Term [35], was
included in UMLS in May 2004 (2004AA). It is the result of collaboration between The
College of American Pathologists (CAP) and The United Kingdom’s National Health
Service, and is a comprehensive clinical terminology with coverage of diseases, clinical
findings, and procedures comprising of concepts, terms and relationships to represent
clinical information. The current version contains more than 360,000 concepts, 975,000
synonyms and 1,450,000 relationships organized into 18 hierarchies/sub trees/ categories.
In this thesis, the term “ontology” is used as a description of the concepts and
relationships between them in a given domain and is used to denote for all kind of IS-A
trees or hierarchical trees in which concepts are represented hierarchically by IS-A
relations (is-a-kind-of, is-a-part-of) although the hierarchical relations in biomedical
domain in the framework of UMLS are broader/narrow than relations, however, they are
assumed as IS-A relations in semantic computing.
2.7 Semantic Similarity, Semantic Distance and Relatedness and Transformation
While “semantic similarity” is concerned about likeliness, “relatedness” seeks to
determine relation between two terms/concepts. For example, “car” and “driver” are
related, but not much similar, but “car” and “vehicle” are similar in some degree.
Relatedness is thus more general than similarity. Furthermore, semantic distance is the
inverse of semantic similarity that is the less distance of the two concepts, the more they
are similar.
9
To insure the conversion from semantic distance to semantic similarity do not change the
absolute correlation value, the transformation function below is used:
Sim (C1, C2) = MaxDist- Dist (C1, C2) (1)
where Dist is the semantic distance of two concepts, MaxDist is the maximum distance
of two concepts and Sim is the converted semantic similarity of the two concepts.
However, in this thesis, absolute correlation is used to evaluate performances of the
approaches.
In this thesis, the term “semantic measure” is used to denote both semantic distance
measure and semantic similarity measure. Furthermore, the term “semantic similarity” is
also used to denote for both semantic similarity and semantic distance.
2.8 Polysemy of Concept and Semantic Similarity of Concept Class
According to the polysemy of concept in natural language and in biomedical domain, for
clarity, in this thesis, the term “concept node” will refer to a particular “sense” of a
concept class, and the term concept will refer to a “concept class”. A concept class
example, in MeSH ontology has many senses/concept nodes and appears in many
positions or nodes in the ontology. For example, MeSH heading “Achlorhydria” has two
tree numbers in the XML database as follows:
<TreeNumberList> <TreeNumber>C06.405.748.045</TreeNumber>
<TreeNumber>C18.452.076.087</TreeNumber>
</TreeNumberList>
As a heading/preferred concept is a represented concept of a concept class or Descriptor,
therefore, the similarity of two terms contained in two concept classes is the similarity of
two headings that represent two concept classes. Each concept class has a set of concept
nodes. There can be many similarities between two sets of concept nodes. The similarity
10
of two headings is then chosen as the maximum similarity among these similarities. The
similarity of two terms in one concept class reaches maximum.
2.9 Traditional Ontology-Based Semantic Measures
Ontology-based semantic similarity measures are those use ontology source as the primary information source. They are can be roughly grouped into two groups as follows:
2.9.1 Ontology-Structure-Based Measures
Most of the semantic similarity measures that are based on the structure of ontology are
actually based on path length (shortest path length) between two concept nodes, and/or
depths of concept nodes in the IS-A hierarchy tree. The primitive Path length measure
was first developed and was applied in MeSH ontology. Most of the later measures were
developed and applied in WordNet. Some of the WordNet-based measures are: Path
length [22], Wu and Palmer [30], Leacock and Chodorow [10], and Li et al. [12]. More
detail of these measures is discussed in section 3.1.
2.9.2 Information-Based Measures
The information-based approaches are based on the information theory which use text
corpus as secondary information source beside the primary ontology information source.
They all use information content (IC) of concept nodes derived from the IS-A relations
and corpus statistics. In WordNet-based measures, some information-based measures are:
Resnik [23], Jiang and Conrath [9] and Lin [11]. The detail of these measures is
discussed in sections 5-7.
2.10 Previous Work in the Biomedical Domain
Rada et al. [22] first proposed a semantic distance measure and applied it into the
biomedical domain using MeSH ontology. The semantic distance between two concept
11
nodes is the shortest path length between them. This Path length measure is actually a
simplified version of spreading activation theory [5,21, 22]. Caviedes and Cimino (2004)
[6] implemented the shortest Path length measure, called CDist, based on the shortest
distance between two concept nodes in the ontology. They evaluated the CDist measure
on MeSH, SNOMED, ICD9 ontologies based on correlation with human ratings. Another
recent work on semantic similarity and relatedness in biomedical domain is conducted by
Pedersen, Pakhomov and Patwardhan (2005) [20] in which they applied corpus-based
context vector approach to measure relatedness. Their context vector approach is
ontology-free but requires training text, for which, they used text data from Mayo Clinic
corpus of medical notes. Their proposed method was evaluated using human judgments
(collected a set of 30 medical term pairs annotated by 3 physicians and 9 experts) and
compared with five other measures.
12
3. A NEW ONTOLOGY-BASED SEMANTIC DISTANCE APPROACH
3.1 Method of Semantic Similarity
The Path length measure is the primitive ontology-based measure that finds the semantic
distance between two concept nodes by finding the shortest path length between them on
the ontology. Rada et al. [22] proposed this Path length measure as a potential measure in
the biomedical domain. Let us consider, for example, the following fragment (shown in
Figure 3) from MeSH ontology. This fragment is for the seventh category tree
“Biological Science” and is assigned letter G in MeSH ontology.
Figure 3. A fragment of MeSH.
In this fragment (Figure 3), the path length between “Biological Sciences [G01]” and
“Environment and Public Health [G03]” is 3 using node counting. The path length
between “Biology [G01.273]” and “Biotechnology [G01.550]” is also 3. Thus, the
similarity in these two cases is the same by Path length measure. However, intuitively
speaking, the similarity between “Biological Sciences [G01]” and “Environment and
Environment and Public Health [G03]
Biology [G01.273]
Biological Sciences [G01]
Biotechnology [G01.550]
Biological Science [G]
13
Public Health [G03]” is less than the similarity between “Biology [G01.273]” and
“Biotechnology [G01.550]” as the latter two concepts lie at a lower level in the hierarchy
tree and share more information. Thus, measures based on path length such as Path length
(DistPath) [22] and Leacock and Chodorow (LCH) [10] give the same similarity for these
two pairs as they use the only path length feature in measure as follows:
DistPath(C1,C2)= d(C1,C2) (2)
⎟⎟⎠
⎞⎜⎜⎝
⎛ ×=
)C,d(CD2log)C,(CSim
2121LCH (3)
Therefore, the specificity of concepts should be taken into account by concerning depth
of concepts. However, the measure of Wu and Palmer [30] measures semantic similarity
of concepts by taking into account the depths of concept nodes only. Wu and Palmer
assumed that “within one conceptual domain, the similarity of two concepts is defined by
how closely they are related in the hierarchy.” They proposed a measure that has formula
as follows:
321
321 N2NN
N2)C,Sim(C
×++×
= (4)
where N3 is the depth of the least common subsumer (The least common subsumer,
LCS(C1,C2), of two concept nodes C1 and C2 is the lowest node that can be a parent for
C1 and C2. For example, in Figure 3, LCS(G01.273 ,G01.550) = G01 and LCS(G01 ,
G03) = G) of two concept nodes and N1, N2 are the path lengths from each concept
node to LCS, respectively. The formula of Wu and Palmer measure is rewritten as
follows:
)Depth(C)Depth(C
))C,CDepth(LCS(2)C,Sim(C
21
2121 +
×= (5)
14
Li et al. [12] proposed a combination-based measure that combines several existing
ontology-based semantic features. Each feature has parameters to determine its weight in
the combination approach. They used different combination approaches to maximize the
results. They used a training approach to find the optimal parameters. The features
employed in this measure are path length, depth LCS of two concept nodes, and the IC of
LCS of two concept nodes as a kind for local density. They combined these three features
by using ten different strategies from linear to nonlinear approaches. Although many
combinations of three features were investigated, a nonlinear approach that combines
only the shortest path length and the depth of LCS gets the highest correlation with
human ratings dataset. This optimal strategy shows that the optimal approach doesn’t use
information content or corpus-based feature. The optimal strategy of this measure is
given by:
S(w1,w2) = bhh b
bhh bl a . −
−−
+−
eeeee (6)
where a≥ 0 and b≥ 0 are parameters to scale the contribution of shortest path length (l) of
two concepts and depth of the LCS of two concepts (h), respectively.
However, this measure has limitation that violates some of the intuitions and assumptions
of ontology-based similarity [11] as it gives different similarity values for different
identical pairs.
Formula of Li et al.:
S(w1,w2) = bhh b
bhh bl a . −
−−
+−
eeeee
Similarity of G01-G01 in which h=1, l=0:
15
S(G01, G01)= bb
bb
−
−
+−
eeee
Similarity of G01.273- G01.273 in which h=2, l=0:
S(G01.273, G01.273)= b22b
b22b
−
−
+−
eeee
Therefore, similarity of the first pair is different from the similarity of the second pair.
Moreover, the information content of LCS is kind of “weighted depth” of LCS, in other
words, they are the same kind of feature as the depth (length) of a concept is calculated
by summing up all the links/nodes from the node while the weighted depth of a concept
(information content of that concept node) is calculated by all weighted links from the
root to that concept node [9, 23]. Furthermore, the local density (i.e. information content),
link type, and link strength are also a factor that affects the semantic similarity [26]. The
information content (IC) of a concept node is measured based on corpus statistics.
However, there is no standard corpus in the biomedical domain, therefore, in this work,
only ontology-structure-based features are investigated and used in this first proposed
measure.
As discussed above, both the path length (path) feature and depth length (depth) feature
should be used in the measure. The LCS node determines the common sharing of two
concept nodes. The measure of Li et al. measures take into account the specificity of
concept nodes by utilizing depth of LCS in semantic computing. However, the
combination of features in Li et al. measures violates some intuitive rules of similarity
based on ontology discussed above. Furthermore, the measure of Wu and Palmer only
takes into account the depths of concept nodes only skipping the most important path
length feature [Equation 5] or the contributions of two features are not weighted
[Equation 4]. For that motivation, this thesis proposes approaches that complement and
combine strengths of some existing measures as well as integrate more semantic features
for advance computing. The following section clarifies rules and assumptions for
16
measuring semantic distance/similarity of concepts in the ontology to be satisfied for the
first proposed semantic distance measure.
3.2 The New Feature: Common Specificity Feature
Besides the path length feature is an important feature taken into account in the measure,
the proposed measure also uses depth of LCS concepts of two concepts as specificity of
two concepts effectively to improve performance. The LCS node of two concepts C1 and
C2 determines the “common specificity” of C1 and C2 in the ontology, therefore, the
common specificity of two concepts is measured by finding the depth of their LCS node
and then scaling this depth by the depth D of the ontology as follows:
CSpec(C1,C2) = D − depth(LCS(C1,C2)) (7)
where D is the depth of the ontology. Thus the CSpec(C1,C2) feature determines the
common specificity of two concepts in the ontology. The smaller the common specificity
value of two concept nodes, the more they share information, and thus the more they are
similar.
3.3 Rules and Assumptions
Two features discussed above are combined in the proposed measure by using some
intuitive rules and assumptions. The path length (shortest path length) is used in the usual
way that the similarity of two concepts is higher when the two concepts have less
distance between them. The following summarizes intuitions in the following three rules
to be satisfied as follows:
Rule R1: The shorter the distance (path length) between two concept nodes in the
ontology, the more they are similar.
Rule R2: Lower level pairs of concept nodes are semantically closer (more similar)
than higher level pairs.
17
Rule R3: The maximum similarity is reached when the two concept nodes are the
same node in the ontology.
The proposed measure besides satisfies the above three rules, it also satisfies the two
assumptions of semantic similarity as follows:
Assumption A1: Logarithm functions are the universal law of semantic distance.
Exponential-decay functions are universal law of stimulus generalization for
psychological sciences [28]. Logarithm (inverse of exponentiation) for semantic distance
is used. This thesis assumes that non-linear combination approach is the optimum
approach for combining semantic features as Rule_R3 shows that when the two concept
nodes are the same node, the semantic similarity must reach highest similarity regardless
of other features, and so, non-linear combination approach of features should be used.
Therefore, another assumption is needed as follows:
Assumption A2: Non-linear function is the universal combination law of semantic
similarity features.
3.4 The Proposed Semantic Distance Approach
There two features to combine as discussed above: Path length and the Common
specificity given by Equation 7. When the two concept nodes are the same node then path
length will be 1 (using node counting), and so the semantic distance value must reach the
minimum regardless of CSpec feature by rule R3 (recall the semantic distance is the
inverse of semantic similarity). Therefore, product of semantic distance features for
combination of features should be used. By applying Rules R1, R2, R3 and the two
assumptions, the proposed semantic distance measure is given as follows:
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (8)
18
where α>0 and β>0 are contribution factors of two features; k is a constant; LCS is the
least common subsumer of two concepts; CSpec (CSpec(C1,C2)) is calculated using
Equation 7; and Path is the path length or length of the shortest path between the two
concept nodes. To insure the distance is positive and the combination is non-linear, k
must be greater or equal to one (k≥1). In this thesis, k=1 is used in experiments. When
two concept nodes have path length of 1 (Path=1) using node counting (i.e., they are in
the same node in the ontology), they have a semantic distance (SemDist) equals to zero
(i.e. maximum similarity) regardless of common specificity feature.
3.5 Evaluation
3.5.1 Dataset
There are no standard human rating sets of concepts/terms for semantic similarity in the
biomedical domain. Thus, to evaluate the proposed approach, the dataset of 30 concept
pairs from Pedersen et al. (2005) [20], (Dataset 1) which was annotated by 3 physicians
and 9 medical index experts. Each pair was annotated on a 4-point scale: “practically
synonymous, related, marginally related, and unrelated”.
Table 1 contains whole pairs of this dataset. The average correlation between physicians
is 0.68, and between experts is 0.78. Because the experts are more than the physicians,
and the correlation (agreement) between experts (0.78) is higher than the correlation
between physicians (0.68), it can be assumed that the experts’ rating scores are more
reliable than the physicians’ rating scores.
Only 25 out of the 30 term pairs are found in MeSH using MeSH browser version 2006
[32] as some terms cannot be found, 25 pairs was used in the experiments (Pedersen et.
al. [20] tested 29 out of the 30 concept pairs as one pair was not found in SNOMED-CT).
19
The term pairs in bold, in Table 1, are the ones that contains a term that was not found in
MeSH and they were excluded in experiments.
Table 1. Dataset 1: 30 medical term pairs sorted in the order of the averaged
physician’ scores Concept 1 Concept 2 Phys. Expert
Renal failure Kidney failure 4.0000 4.0000 Heart Myocardium 3.3333 3.0000 Stroke Infarct 3.0000 2.7778 Abortion Miscarriage 3.0000 3.3333 Delusion Schizophrenia 3.0000 2.2222 Congestive heart failure Pulmonary edema 3.0000 1.4444 Metastasis Adenocarcinoma 2.6667 1.7778 Calcification Stenosis 2.6667 2.0000 Diarrhea Stomach cramps 2.3333 1.3333 Mitral stenosis Atrial fibrillation 2.3333 1.3333 Chronic obstructive pulmonary disease
Lung infiltrates 2.3333 1.8889
Rheumatoid arthritis Lupus 2.0000 1.1111 Brain tumor Intracranial hemorrhage 2.0000 1.3333 Carpal tunnel syndrome Osteoarthritis 2.0000 1.1111 Diabetes mellitus Hypertension 2.0000 1.0000 Acne Syringe 2.0000 1.0000 Antibiotic Allergy 1.6667 1.2222 Cortisone Total knee replacement 1.6667 1.0000 Pulmonary embolus Myocardial infarction 1.6667 1.2222 Pulmonary Fibrosis Lung Cancer 1.6667 1.4444 Cholangiocarcinoma Colonoscopy 1.3333 1.0000 Lymphoid hyperplasia Laryngeal Cancer 1.3333 1.0000 Multiple Sclerosis Psychosis 1.0000 1.0000 Appendicitis Osteoporosis 1.0000 1.0000 Rectal polyp Aorta 1.0000 1.0000 Xerostomia Alcoholic cirrhosis 1.0000 1.0000 Peptic ulcer disease Myopia 1.0000 1.0000 Depression Cellulitis 1.0000 1.0000 Varicose vein Entire knee meniscus 1.000 1.0000 Hyperlipidemia Metastasis 1.0000 1.0000
3.5.2 Experiments and Results
In these experiments, only one testing dataset was used and there was no training dataset,
therefore, the default parameters of the proposed measures were used to validate the
measure and the Li et al. was excluded in evaluation as it needs a training phase for
optimum parameters.
20
Table 2. Absolute correlation of the four measures relative to human ratings Measure
Phys. (rank)
Expert (rank)
Both (rank)
SemDist 0.666 (2) 0.862 (1) 0.836 (1) Leacock and Chodorow 0.672 (1) 0.856 (2) 0.833 (2) Wu and Palmer 0.652 (3) 0.794 (4) 0.778 (4) Path length 0.631(4) 0.742 (3) 0.734 (3)
Semantic distance/similarity values of 25 pairs were calculated using the proposed
measure and other three ontology-based semantic distance/similarity measures. All the
measures use node counting for path lengths and for depths of concept nodes. For the
pairs that have a term belongs to more than one category tree, only its position(s) in the
same category with the other term is/are taken into account. Table 2 shows for the four
measures the results of correlation with human ratings of physicians, experts, and both
(phys. and experts), with the ranks between parentheses. These correlation values (Table
2) show that the proposed measure is ranked #1 in correlation relative to experts’
judgments and relative to both (expert and phys. judgments). But relative to physician
judgments, the proposed approach is ranked #2. However, as discussed, the human
ratings of experts are more reliable than of physicians, therefore, in overall, the proposed
measure performs very well and has great potential.
3.5.3 Discussion
The proposed semantic distance measure has been introduced proving its potential and
promising, however, the experiments also show some limitation. There is no training
phase for the proposed measure for optimum parameters. The dataset is small and small
part of dataset cannot be found exactly in the MeSH ontology. They are found by closely
related terms in MeSH ontology. Moreover, the dataset is originally created and
conducted experiments in SNOMED-CT ontology, therefore, most of them are found in
SNOMED-CT ontology. It is more noticed that the dataset is relatedness dataset,
therefore, the semantic distance/similarity measure cannot capture relatedness causing
their performance low. It is not fair and logical in conducting comparing relatedness
21
measures and semantic distance/similarity measures using relatedness measures as in
[20]. According to above limitation, in the next section, more advanced experiments and
evaluation were conducted using one more semantic similarity dataset and two ontologies
in UMLS were compared on semantic similarity of terms.
22
4. THE PROPOSED CLUSTER-BASED APPROACH
4.1 The Need for a New Approach
The above proposed semantic distance measure, called, NA measure, was proposed as a
combination of two semantic distance features and weights of their contribution to
similarity are taken into account. It is a measure to complement some weaknesses of
some existing measures. However, one question stands out is that is it enough for
semantic computing for ontologies that have different granularity degrees in different
clusters? To answer that question, let us first investigate the local granularity of cluster
affecting on the semantic similarity.
4.2 Local Granularity and Local Concept Specificity
In this work, the term “cluster” is used to denote a subtree or category tree of ontology,
for example, MeSH ontology has about 16 category trees as in Figure 2. The following
example explains the affect of cluster granularity on local concept specificity. Let us
consider, for example, a fragment of ontology showing two clusters as in Figure 4. The
specificity of a concept c in cluster C is defined as follows:
depthCdepth(c)spec(c) = (9)
23
where depthC is the depth of cluster C, and spec(c) ∈[0,1]. It is noticed that spec(c) = 1
when the concept c is a leaf node in the cluster C. Then, in Figure 1, the specificity of a3
and b3 is calculated as follows:
spec(a3)=3/4= 0.75
spec(b3)=3/3= 1.00
Thus, the specificity of b3 (1.00) is more than specificity of a3 (0.75), even though their
depths are equal. Thus, b3 has more specificity within its cluster than a3 as it lies further
down towards the bottom in its cluster. Therefore, the local granularity of clusters should
be taken into account as a feature that most existing measures that use ontology
structure (IS-A relations) as primary information source do not take it into consideration.
4.3 The Adapted Common Specificity Feature
In the Cluster-Based approach, the common specificity feature of two concept nodes is
calculated within the cluster. The least common subsumer (LCS) node of two concept
nodes C1 and C2 determines the common specificity of C1 and C2 in the cluster. So the
common specificity of two concept nodes is calculated by finding the depth of their LCS
node and then scaling this depth by the depth D of the cluster as follows:
CSpec(C1,C2) = D − depth(LCS(C1,C2)) (10)
where D is the depth of the cluster. Thus, the CSpec(C1,C2) feature determines the
“common specificity” of two concepts in the cluster. The smaller the common specificity
value of two concept nodes, the more they share information, and thus the more they are
similar.
24
4.4 Rules and Assumptions
Like in the NA measure, two features are taken into account: Path length feature and
Common specificity feature. However, in this Cluster-Based approach, the feature of
local granularity is utilized and integrated into the measure; therefore, the intuitive rules
in this case are slightly different from the rules of the above proposed NA measure as
follows:
Rule R3: The semantic similarity scale system shows (reflects) the degree of similarity
of pairs of concepts comparably in one cluster or in cross-cluster. This rule ensures
that the mapping of cluster 1 to cluster_2 does not deteriorate the scale of similarity.
Rule R4: The semantic similarity must obey local cluster’s similarity rules as follows:
Rule R4.1 (R1): The shorter the distance (path length) between two concept nodes in
the ontology, the more they are similar.
Rule R4.2 (R2): Lower level pairs of concept nodes are semantically closer (more
similar) than higher level pairs.
Rule R4.3 (R3): The maximum similarity is reached when the two concept nodes are
the same node in the ontology.
Like the above proposed NA measure, the Cluster-Based measure also satisfies the two
above assumptions (A1 and A2).
4.5 The Proposed Cluster-Based Approach
4.5.1 Single Cluster Similarity
In single cluster, the local granularity of the cluster is not considered as there is only one
single cluster. Two features are combined: path length and the common specificity
(CSpect) given by Equation 10. When the two concept nodes are the same node then path
length will be 1 (using node counting), and so the semantic distance value must reach the
minimum regardless of CSpec feature by rule R4.3 (recall the semantic distance is the
inverse of semantic similarity). Therefore, product of semantic distance features for
25
combination of features should be used. By applying Rules R3, R4 and the two
assumptions, the proposed measure for a single cluster is:
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (11)
where α>0 and β>0 are contribution factors of two features; k is a constant; LCS is the
least common subsumer of two concept nodes; and Path is the path length of the shortest
path between the two concept nodes. To insure the distance is positive and the
combination is non-linear, k must be greater or equal to one (k≥1). k=1 is used in
experiments. When two concept nodes have path length of 1 (Path=1) using node
counting (i.e., they are in the same node in the ontology), they have a semantic distance
(SemDist) equals to zero (i.e. maximum similarity) regardless of common specificity
feature.
4.5.2 Cross-Cluster Semantic Similarity
In cross-cluster semantic similarity, to measure the semantic similarity between two
concept nodes (C1 and C2), there are four cases depending on the positions of the two
concept nodes within the clusters of the ontology. The cluster that has the longest depth is
assigned the main cluster (called primary cluster) on which the semantic features from all
other clusters will be scaled to this cluster’s scale-level. All other remaining clusters are
secondary clusters. Then, there are four cases as follows:
Case 1: Similarity within the Primary Cluster: If the two concept nodes occur in the
primary cluster then the similarity in this case is same as the similarity within single
cluster [Equation 11] discussed in section 4.5.1.
Case 2: Cross-Cluster Similarity: In this case, one of the two concept nodes belong to
the primary cluster while the other is in a secondary cluster, and the LCS of two concept
nodes is the global root node, which belongs to the two clusters. This technique does not
26
affect the scale of the CSpec feature of the primary cluster. The common specificity is
then given as:
CSpec(C1,C2) = CSpecprimary = Dprimary -1, (12)
where Dprimary is the depth of the primary cluster. The root is the LCS of the two concept
nodes in this case. The path between the two concept nodes passes through two clusters
having different granularity degrees. The portion of the path length that belongs to the
secondary cluster is in scale of granularity different from that of the primary cluster, and
thus, it is needed to convert (is leveled) into primary cluster scale-level as follows.
Figure 4. A fragment of two clusters in ontology.
The Cross-Cluster Path Length Feature: The path length between two concept nodes (C1
and C2) is computed by adding up the two shortest path lengths from the two nodes to
their LCS node (their LCS is the root). For example, in Figure 4, for the two concept
nodes (a3, b3), the LCS is the root r. So, the path length between a3 and b3 is calculated
as follows:
Path(C1,C2) = d1 + d2 -1 (13)
such that: d1 = d(a3, root) and d2 = d(b3, root), where d(a3, root) is the path length from
the root r to node a3 ; and similarly d(b3, root) is the path length from r to b3. Notice that
the root node is counted twice, so one is subtracted in Equation 13. It is noticed here that
r
a1
a2
a3
b1
a5 b2
a4
a6b3
27
the densities or granularities of the two clusters are in different scales. Then, the portion
of the path length in the secondary cluster is scaled into the primary cluster’s scale-level.
The cluster containing a3 has higher depth, and then it’s the primary cluster, and the
cluster containing b3 is the secondary. The granularity rate of the primary cluster over the
secondary cluster for the common specificity feature is:
1D 1DCSpecRate
2
1
−−
= (14)
where (D1-1) and (D2 -1) are maximum common specificity values of the primary and
secondary clusters, respectively. The granularity rate, PathRate, of path length feature
for the primary cluster over the secondary cluster is given by:
12D 12DPathRate
2
1
−−
= (15)
where (2D1-1) and (2D2 -1) are maximum path length values of any two nodes in the
primary and secondary clusters, respectively. Following Rule R3, d2 in Equation 13 is
converted into the primary cluster as follows:
22 dPathRated' ×= (16)
This new path length d’2 reflects the path length of the second concept node to the LCS
relative to the primary cluster’s path length feature scale. Applying Equation 16, the path
length between two concept nodes in primary cluster scale is as follows:
1dPathRated)C,Path(C 2121 −×+= (17)
1d12D12D
d)C,Path(C 22
1121 −×
−−
+= (18)
Finally, the semantic distance between two concept nodes is given as follows:
28
CSpec (C1, C2) = Dprimary –1 (19)
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (20)
Case 3: Similarity within a Single Secondary Cluster: The third case is when the two
concept nodes are in a single secondary cluster. Then the semantic features, in this case,
must be converted to primary cluster’s scales for the two features, Path and CSpec, as
follows:
Path(C1, C2) = Path(C1, C2) secondary × PathRate (21)
CSpec(C1, C2) = CSpec(C1, C2) secondary × CSpecRate (22)
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (23)
where Path(C1, C2) secondary and CSpec(C1, C2)secondary are the Path and CSpec between C1 and C2 calculated in the secondary cluster; and PathRate and CSpecRate are computed in Equations 15 and 14, respectively.
Case 4: Similarity within Multiple Secondary Clusters: In this case, the two concept
nodes are in two secondary clusters Csi and Csj (i.e., none of them exists in the primary
cluster). Then, one of the two secondary clusters acts momentarily as a primary to
calculate the semantic features (viz. Path and CSpec) using Case-2 above. That is, the
semantic features, Path and CSpec, will be computed according to Case-2 by assuming
temporarily that Csi and Csj are primary and secondary clusters although both are
secondarys, so that to scale and unify the CSpec and Path features between them. Then,
the semantic distance (SemDist) is computed using Case-3 to scale the features (again) to
the scale-level of the primary cluster (Cp).
4.6. Evaluation
For experiments, two ontologies of MeSH and SNOMED-CT were used as information
source for the semantic measures and two datasets are used for evaluation.
29
4.6.1 Datasets
The first dataset is Dataset 1 in shown Table 1. Another biomedical dataset was used
containing 36 MeSH term pairs [8]. The human scores in this dataset are the average
evaluated scores of reliable doctors. UMLSKS browser was used [34] for SNOMED-CT
terms, and MeSH Browser [39] for MeSH terms. Table 3 shows Dataset 2 along with
human scores and scores of four measures calculated using MeSH ontology. The pairs
with scores “*” are excluded from experiments.
Table 3. Dataset 2: 36 medical term pairs with five similarity scores: Human, Path length (PATH), Wu and Palmer(WUP), Leacock and Chodorow(LCH), and proposed measure
(SemDist); using MeSH ontology Concept 1 Concept 2 Human PATH WUP LCH SemDist
Anemia Appendicitis 0.031 8 0.364 1.099 4.263 Meningitis Tricuspid Atresia 0.031 8 0.364 1.099 4.263 Sinusitis Mental Retardation 0.031 8 0.364 1.099 4.263 Dementia Atopic Dermatitis 0.062 9 0.333 0.981 4.394 Acquired Immunodeficiency Syndrome Congenital Heart Defects 0.062 7 0.400 1.232 4.111 Bacterial Pneumonia Malaria 0.156 8 0.364 1.099 4.263 Osteoporosis Patent Ductus Arteriosus 0.156 9 0.333 0.981 4.394 Amino Acid Sequence Anti Bacterial Agents 0.156 12 0.154 0.693 4.804 Otitis Media Infantile Colic 0.156 10 0.308 0.876 4.511 Hyperlipidemia Hyperkalemia 0.156 5 0.667 1.569 3.497 Neonatal Jaundice Sepsis 0.187 8 0.364 1.099 4.263 Asthma Pneumonia 0.375 4 0.727 1.792 3.219 Hypothyroidism Hyperthyroidism 0.406 3 0.800 2.079 2.833 Sarcoidosis Tuberculosis 0.406 11 0.286 0.78 4.615 Sickle Cell Anemia Iron Deficiency Anemia 0.437 6 0.667 1.386 3.584 Adenovirus Rotavirus 0.437 6 0.615 1.386 3.714 Lactose Intolerance Irritable Bowel Syndrome 0.468 6 0.667 1.386 3.584 Hypertension Kidney Failure 0.500 9 0.333 0.981 4.394 Diabetic Nephropathy Diabetes Mellitus 0.500 3 0.800 2.079 2.833 Pulmonary Valve Stenosis Aortic Valve Stenosis 0.531 3 0.833 2.079 2.708 Hepatitis B Hepatitis C 0.562 3 0.857 2.079 2.565 Vaccines Immunity * * * * * Psychology Cognitive Science * * * * * Failure to Thrive Malnutrition 0.625 8 0.364 1.099 4.263 Urinary Tract Infection Pyelonephritis 0.656 5 0.667 1.569 3.497 Migraine Headache 0.718 9 0.429 0.981 4.291 Myocardial Ischemia Myocardial Infarction 0.750 2 0.923 2.485 1.946 Carcinoma Neoplasm 0.750 4 0.667 1.792 3.332 Breast Feeding Lactation 0.843 1 1.000 3.178 0.000 Seizures Convulsions 0.843 1 1.000 3.178 0.000 Pain Ache 0.875 1 1.000 3.178 0.000 Malnutrition Nutritional Deficiency 0.875 1 1.000 3.178 0.000 Down Syndrome Trisomy 21 0.875 1 1.000 3.178 0.000 Measles Rubeola 0.906 1 1.000 3.178 0.000 Antibiotics Antibacterial Agents 0.937 1 1.000 3.178 0.000 Chicken Pox Varicella 0.968 1 1.000 3.178 0.000
30
4.6.2 Experiments and Results
All the measures use node counting for path lengths and depths of concept nodes. As
there is no a training phase, the two features (Path and CSpec) are assumed to contribute
equally to similarity; that is, default parameters (α=1 and β=1) are set in all experiments.
Out of the 30 pairs of Dataset 1, only 25 pairs in MeSH were found and 29 pairs in
SNOMED-CT. For the four pairs that were not found in MeSH and found in SNOMED-
CT, average distance/similarity values of the most related concept nodes to each one of
them were calculated , so there were 29 pairs in MeSH and SNOMED-CT in total. Out
of the 36 pairs of Dataset 2 in SNOMED-CT, 34 pairs were found and all 36 pairs were
found in MeSH, so the 34 pairs that exist in both ontologies in the experiments were used
(The two pairs that were not found are shown in bold in Table 3). Furthermore,
Dataset_1 and Dataset 2 were combined into one dataset for a larger dataset (Dataset 3).
The results of correlations with human scores using the three datasets, experimented on
MeSH and SNOMED-CT ontologies, are shown in Tables 4 and 5 ; Figures 5 and 6.
Table 4. Absolute correlations with human scores for all measures using SNOMED-CT on Dataset 1, Dataset 2, and Dataset 3
SNOMED-CT Measure Dataset 1
(rank) Dataset 2
(rank) Dataset 3
(rank) SemDist 0.665 (1) 0.735 (1) 0.726 (1) Leacock and Chodorow 0.431 (2) 0.677 (3) 0.600 (2) Wu and Palmer 0.296 (3) 0.686 (2) 0.498 (3) Path length 0.254 (4) 0.586 (4) 0.422 (4)
Average 0.412 0.671 0.562
Table 5. Absolute correlations with human scores for all measures using MeSH on Dataset 1, Dataset
2, and Dataset 3
MeSH Measure Dataset 1
(rank) Dataset 2 (rank)
Dataset 3 (rank)
SemDist 0.863 (1) 0.825 (1) 0.841 (1) Leacock and Chodorow 0.857 (2) 0.820 (2) 0.836 (2) Wu and Palmer 0.794 (3) 0.811 (3) 0.808 (3) Path length 0.744 (4) 0.765 (4) 0.764 (4)
Average 0.815 0.805 0.812
31
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3
Datasets
Cor
rela
tion
with
hum
an
scor
es
SemDist (proposed)
Leacock&Chodorow
Wu & Palmer
Path Length
Figure 5. Results of correlations with human scores for four measures using
SNOMED-CT.
0.65
0.7
0.75
0.8
0.85
0.9
1 2 3
Datasets
Cor
rela
tion
with
hum
an
scor
es
SemDist (proposed)
Leacock&Chodorow
Wu & Palmer
Path Length
Figure 6. Results of correlations with human scores for four measures using MeSH.
Table 6. The improvements that SemDist achieved over the average of the three other similar techniques using SNOMED-CT with three datasets
SNOMED-CT Correlations of Dataset 1 Dataset 2 Dataset 3
Average of the 3 similar measures 0.327 0.650 0.507 SemDist 0.665 0.735 0.726 Improvement 103% 13% 43%
32
Table 7. The improvements that SemDist achieved over the average of the three other similar techniques using MeSH with three datasets
MeSH Correlations of Dataset 1 Dataset 2 Dataset 3
Average of the 3 similar measures 0.798 0.799 0.803 SemDist 0.863 0.825 0.841 Improvement 8.1% 3.3% 4.8%
4.6.3 Discussion
Tables 4 and 5 show that the proposed Cluster-Based measure, SemDist, achieves the best
correlations with human similarity scores and ranks #1 with two ontologies and on three
datasets. These results confirm that SemDist is efficient in computing semantic similarity,
and is outperforming the three other measures in all six experiments. Leacock and
Chodorow measure achieves the second best correlations in five of the six experiments,
while Wu and Palmer measure gives the third best correlations in five of six experiments
and the second best correlation in one experiment. Path length measure achieves the
lowest correlations in all six experiments. These results seem realistic since Leacock and
Chodorow measure uses path length scaled by depth of ontology, and thus, outperforms
both Wu and Palmer measure, which uses only depths of concept nodes, and the Path
length measure. To be more specific, Leacock and Chodorow measure measures the
similarity by using the path length scaled by the maximum path length of two concept
nodes in the ontology, whereas, Wu and Palmer measure uses depth of LCS of the two
concept nodes scaled by the summation of the depths of the two concept nodes.
It is noticed that SemDist outperforms the other measures with SNOMED-CT more
significantly than with MeSH because of the higher specificity of SNOMED-CT (with
depth around of 18) compared to MeSH (with depth around of 12).
The average correlations of measures in Tables 4 and 5 and the improvements that
SemDist achieved over the average correlations are shown in Tables 6 and 7 for
SNOMED-CT and MeSH, respectively. From these results in Tables 6 and 7, we
observe that SemDist achieved an average improvement of 53% using SNOMED-CT,
33
while using MeSH, the average improvement is 5.3%. This suggests that SemDist is a
good choice for ontologies with high specificities where the new CSpec feature will have
more positive impact on the correlation results. Even with MeSH, where the average
improvement is 5.4%; this improvement can be considered significant given the existing
limited resources of human scored datasets in this domain. Furthermore, Tables 4 and 5
show that all four measures perform better in MeSH than in SNOMED-CT.
34
5. USING MEDLINE AS STANDARD CORPUS FOR SEMANTIC SIMILARITY
OF CONCEPTS IN THE BIOMEDICAL DOMAIN
5.1 The Need for a Standard Corpus in the Biomedical Domain After the work of Rada et al. [22], a number of ontology-structure-based measures
[10,12,30,17] that use IS-A relations of concepts in computation, and information-based
measures [9,11,23] that use both IS-A relations and corpus-based feature (information
content) have been proposed and applied using WordNet. Typically, the information-
based measures use standard corpora as secondary information sources to compute
similarity between two given terms. However, there is no standard corpus in biomedical
domain as secondary information source for information-based measures.
In this work, the feasibility of using MEDLINE as standard corpus and MeSH ontology
for measuring semantic similarity between biomedical concepts is investigated. Most of
the semantic similarity work in the biomedical domain uses only IS-A relations in
ontology (e.g. MeSH, SNOMED-CT) for computing the similarity between the
biomedical terms. In this work, however, information-based semantic measures are used
that use biomedical text corpus in computing the similarity between terms.
35
5.2 Semantic Similarity
The primitive information-based semantic similarity approach was introduced by Resnik
[23] in which the similarity of two concepts is the maximum of the information content
of the concept that subsumes them in the taxonomy hierarchy [Equation 24]. The
information content of a concept depends on the probability of encountering an instance
of that concept in a corpus, and the information content is calculated as negative the log
likelihood of the probability [Equation 28]. That is, the probability of a concept is
determined by the frequency of occurrence of the concept and its subconcepts in the
corpus [Equation 27]. As the information-based measures use corpus statistics, these
similarity measures can be adapted well to particular applications using suitable corpora.
For more information about the pure information-based approach, please refer to Resnik’
work [22]. Following Resnik’s work, some information-based measures were introduced
to improve the performance of pure information-based approach by considering the
weight/strength of edges/links between concept nodes in ontology. The links between
ontology nodes are not equal in term of strength/weight, and link strength can be
determined by local density, information content, and link type [9,26]. The measure of
Jiang and Conrath [9] determines the similarity of two concept nodes by calculate the
“weighted path” between them by summing up all weighted links between them
[Equation 25]. While the measure of Lin [Equation 26] is similar to the measure of Wu
and Palmer [Equation 5]. However, Lin measure uses information content of concept
nodes instead of depth of concept nodes. In fact, the depth is replaced by the “weighted
depth”. Followings are formulas of Resnik, Jiang and Conrath, and Lin measures. They
all use information content (IC) of individual concept nodes C1 and C2 or/and LCS (least
common subsumer) of C1 and C2:
1) Resnik
Sim(C1,C2) = IC(LCS(C1,C2) (24)
2) Jiang and Conrath
Sim(C1,C2) = IC(C1)+ IC(C2) - 2 ×IC(LCS(C1,C2) (25)
36
3) Lin
)IC(C)IC(C))C,IC(LCS(C2
)C,Sim(C21
2121 +
×= (26)
Table 8. Format of MH_Freq_count file
Frequency as MeSH Heading
MH MJ
Pressure 41324 2637 Hydrolysis 41318 35 Haplorhini 41256 3311 Colonic Neoplasms 41207 1619 Energy Metabolism 41203 10902 Hela Cells 41007 409 Heart Diseases 40984 4385 Brain Chemistry 40972 12420 Uterine Cervical Neoplasms 40969 3133 Thrombosis 40929 3562
5.3 Evaluation
5.3.1 Information Sources
In order to evaluate these semantic measures in the biomedical domain, a biomedical
ontology, a biomedical text corpus, and a test dataset of biomedical terms pairs are
needed. Each term pair will have to be scored for similarity by human domain experts.
Then, for each pair, a similarity score was computed by each of the three methods
(Equations 24, 25, 26) and then the correlation was found between the computed
similarity scores and the human scores. MeSH ontology was used which is one of the
core ontologies in UMLS to get hierarchy relations of concepts, and MEDLINE was
used as text corpus to get occurrence frequencies of concepts. The frequencies of MeSH
concepts in MEDLINE are stored in files (available from US National Library of
Medicine NLM at http://mbr.nlm.nih.gov/Download/index.shtml#Freq). For each MeSH
heading, there are two types of frequency:
MH: frequency of that heading as a main heading in MEDLINE corpus.
MJ: frequency of that concept as a major heading in MEDLINE corpus.
37
Both types of frequencies are used in the experiments. The MH_freq_count file contains
frequencies of all MeSH headings. The format of this file is shown in Table 8. Each row
shows one MeSH heading in the 1st column, its frequency as main heading (MH), and its
frequency as major heading (MJ) in MEDLINE.
The information content technique in biomedical domain will be a slightly different from
the original technique of Resnik [18], that is in the way of counting frequencies of MeSH
headings in MEDLINE in which each MeSH heading occurs in one document is counted
only once in that document.
The concept probability of a concept (MeSH heading) c is computed as follows:
N
frq(c))c(p = (27)
where frq(c) is the frequency of concept c by summing up all the frequency of it and its
subconcepts in the corpus, and N is the total frequencies of concepts. The information
content (IC) of a concept c is then given by:
IC(c) = - log p(c) (28)
5.3.2 Dataset
Dataset 2 containing 36 MeSH term pairs was used in experiments as a strictly semantic
similarity dataset.
Table 9. Absolute correlations of information-based measures Correlation
Measure MeSH Main Heading (MH)
MeSH Major Heading (MJ)
Resnik 0.731 0.731 Lin 0.781 0.786 Jiang and Conrath 0.808 0.820
Average 0.773 0.779
38
0.720.740.760.780.8
0.820.84
Resnik Lin Jiang &Conrath
Average
Cor
rel.
with
hum
an s
core
s
MH
MJ
Figure 7. Illustration of the three information-based measures with human scores.
Table 9 shows this dataset along with human scores, and the computational scores by the
three information-based measures using MJ frequency for calculating information content
of each concept.
5.3.3 Experimental Results
Two kinds of frequencies (MH and MJ) were used to calculate IC of concepts. Table 9
contains the results of correlation with human scores for the three measures with IC
calculated according to the two types frequencies (viz. MH and MJ), and Figure 7
contains illustrations of these results. The results in Table 9 show that all measures
perform very well having fairly high correlations with human ratings using both kinds of
frequencies/ICs. It is noticed that the measure of Jiang and Conrath achieves the highest
correlation with human scores, while the measure of Resnik gives the lowest correlations,
and the differences in the three methods are not very significant though. One of the
reasons for the lower correlations of Resnik’s measure compared to the other two
measures is because Resnik’s measure is based on the IC of the LCS of the two concepts
[Equation 24] skipping the specificity of concepts whereas the other two measures are
based on combination of three specific ICs, namely, IC of concept 1, IC of concept 2, and
IC of their LCS, Equations 25 and 26. The average correlation of all measures using MJ
frequency and MH frequency are very close (Table 9, Figure 7). Each measure produces
very close correlations using MH and MJ which indicates that, in general, term usage and
frequency distributions in MEDLINE as MH and MJ are fairly consistent. Thus, these
results demonstrate that MEDLINE can provide a very good insight into the semantic
39
similarity between biomedical (MeSH) terms. It is mentioned that, not every biomedical
term is a MeSH heading/concept or can be found in MEDLINE frequency files. Yet,
MEDLINE is the largest and most comprehensive text and literature database for
biomedical research. Thus, it can be considered as the most reliable information source.
Determining the similarity between biomedical terms is a rather important task that is
needed in many applications. For example, in information retrieval in the biomedical
domain, there is a need to determine the best match between the query/keywords and the
retrieved documents. Integrating multiple resources for information extraction and
knowledge discovery is another application that can benefit greatly from semantic
similarity.
5.4 Discussion
This is an interesting work that puts a first brick for more advances and more structures
into this task. The previous semantic similarity work in the biomedical domain used
ontologies only as primary information sources. The main contribution of this work is the
application of information-based semantic similarity measures into the biomedical
domain using MEDLINE, the most comprehensive resource of textual information in this
domain. Experiment results show that MEDLINE is an effective resource for computing
semantic similarity between biomedical terms and concepts. The experimental results
demonstrated that information-based similarity measures can achieve high correlations
with human similarity scores.
40
6. THE PROPOSED COMBINATION-BASED (HYBRID) APPROACH
6.1 Motivation
This section represents a new analysis/view of semantic features that make up semantic
measures as well as an analysis the strengths and weaknesses of semantic similarity
measures based on this view. To combine several existing measures’ strengths and
complement their weaknesses in semantic computing, a combination-based measure is
proposed as a hybrid measure (Hybrid) that uses IS-A relations in the ontology
information source for the path length feature and depth feature and uses corpus for
information content of concept nodes to augment these two features. This work also
shows how to use corpus statistics/IC effectively in semantic computing in general
English domain.
6.2 Semantic Similarity Features
6.2.1 Path Feature and Depth Feature
The first and primitive approach to measure semantic distance/similarity between two
concept nodes in ontology is by finding the shortest distance between their nodes. This
approach, called Path length, was proposed by Rada et al. [22] as a potential approach in
biomedical domain. After that, a number of ontology-based similarity approaches have
been introduced which use IS-A relations in ontology as primary information source.
Most of these measures can be roughly divided into two groups. The first group includes
ontology-structure-based measures (i.e. Path length [22], Leacock and Chodorow [10],
41
Wu and Palmer [30]) and the second group includes information-based measures that use
ontology structure and corpus-based features (i.e. Resnik [23], Jiang and Conrath [9], Lin
[11]). Both groups use IS-A relations in ontology as information source for computing
the similarity. The two main features of measures used in both groups are: (1) path
feature and (2) depth feature. Path feature can be measured by (i) simple node counting,
(ii) edge/link counting, or (iii) by “weighted path” (Jiang and Conrath [9]) using IC of
concept nodes. The weighted path between two concept nodes C1 and C2 is measured by
summing up all weighted links on the shortest path between C1 and C2. The depth
feature, on the other hand, can be measured by node counting, edge/link counting or by
“weighted depth” which was first developed by Resnik [23]). The weighted depth or
information-based approach measures the similarity of two concept nodes by finding IC
of the least common subsumer (LCS) node of them in the ontology. The information
content of a concept node depends on the probability of encountering an instance of it in
a corpus, and the information content is calculated as negative the log likelihood of the
probability [Equation 28] which is determined by the frequency of occurrence of the
concepts it contains and its subconcepts in the corpus [Equation 27].
Table 10. Similarity features of 8 similarity measures
Feature Measure Path Depth
Path length * none Leacock and Chodorow * none Wu and Palmer none * Resnik none ** Jiang and Conrath ** none Lin none ** Li et al. * * Hybrid (proposed) * ** * is denoted for path length or depth length ** is denoted for weighed path or weighted depth “none” is denoted for the feature is not used by measure
Path feature is an important feature that contributes significantly to semantic similarity.
Let us consider a fragment of ontology in Figure 4 containing concept nodes ai and bi.
42
Path length measure and Leacock and Chodorow measure do not use the depth feature as
a property of concepts in the measure; hence they give the same similarity for those pairs
have the same path length (i.e. pair a2-a5 and pair a1-b1) regardless of their specificity in
the ontology. Table 10 summarizes the features used by six of the existing similarity
measures along with the proposed measure.
Six of the measures in Table 10 use either the path or depth feature but not both;
therefore, can be grouped into: (1) path-based measures (Path length, Leacock and
Chodorow, and Jiang and Conrath) and (2) depth-based measures (Wu and Palmer,
Resnik, and Lin).
Li et al. is the measure that combines the two features of path length and depth length.
However, it has limitation as discussed above in section 3.1. The new proposed measure
combines the two features of weighted path and weighed depth features in one measure as
path length and depth length are special cases of weighted path and weighted depth. In
weighted path or weighted depth approaches, the links between ontology nodes are not
equal in term of strength/weight, and link strength can be determined by local density,
information content, and link type [26]. However, the weighted path approach of Jiang
and Conrath has limitation as it takes into account individual IC of individual concept
nodes; therefore, it is affected by using a small corpus as some words may not occur in
small corpora. Thus, such words will always have their similarity with any other word
reaches the minimum. Through using path length, all the relationships between any
presented concepts in the ontology can be seen intuitively. Therefore, node counting is
used for path feature (path length). Beside path length feature, the weighted depth is used
as kind of specificity of concept nodes in the measure.
6.2.2 The Adapted Common Specificity Feature
The LCS node of two given concept nodes determines their common specificity in
ontology. The common specificity of two concept nodes in ontology based on ontology
structure and corpus is defined as follows:
43
CSpec(C1,C2) = ICmax - IC(LCS(C1,C2)) (29)
where ICmax is the maximum IC of concept nodes in the ontology. The CSpec feature
determines the common specificity of two concept nodes in the ontology based on given
corpus and ontology structure. The less the common specificity value of two concept
nodes the more they are share information, and thus the more they are similar. When the
IC of LCS of two concept nodes (C1 and C2) reaches ICmax, that is,
IC(LCS(C1,C2)) = ICmax ,
then the two concept nodes reach the highest common specificity which equals to zero:
CSpec(C1,C2) = 0.
6.3 The Combination-Based (Hybrid) Approach
One of the contributions of this work is the adapted common specificity feature that is
integrated in the proposed measure that can perform stably using any corpus sizes. The
proposed measure also satisfies the three single ontology intuitive rules (R1,R2,R3) and
two assumptions (A1 and A2) in the section 3.3 . The proposed Hybrid approach is as
follows:
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (30)
where α>0 and β>0 are contribution factors of two features (Path length and CSpec
(CSpec(C1, C2 )); k is a constant. Path is the path length (shortest path length) of two
concept nodes using node counting. If k is zero, the combination is linear, and to insure
the distance is positive and the combination is non-linear, k must be greater or equal to
one (k≥1). When two concept nodes have path length of 1 using node counting (Path=1),
then they have a minimum semantic distance (i.e., maximum similarity) which equals to
zero regardless of common specificity feature.
44
6.4 Evaluation
6.4.1 Information Source
WordNet 2.0 was used as the primary information source which is a semantic lexicon for
the English language developed at Princeton University. The Perl module
WordNet::Similarity was used and inherited by using existing implemented measures
developed by Pedersen et al. [19]. Resnik’s technique [23] was used to calculate IC of
concept particularly for nouns based on their frequencies. In these experiments Brown
corpus [7] or SemCor corpus [15] were used. The frequency frq(c) of a concept node c
was computed by counting all the occurrences of the concepts in corpus contained in or
subsumed by the concept node c. Then concept node probability is computed directly as:
N
frq(c))c(p = (31)
where N is the total number of nouns in the corpus that are also presented in WordNet.
The information content of concept c is then given by:
IC(c) = - log p(c) (32)
6.4. 2 Datasets
There are two well-known benchmark datasets of term pairs that were scored by human
experts for semantic similarity for general English. The first set (RG) is collected by
Rubenstein and Goodenough [25], and covers 51 subjects containing 65 pairs of words on
a scale from “highly synonymous” to “semantically unrelated” (Table 11 contains only
subset of this dataset). The second dataset (MC) was collected by Miller and Charles [14]
in a similar experiment conducted 25 years after Rubenstein and Goodenough collected
RG set, and contains 30 pairs extracted from the 65 pairs of RG, and covers 38 human
subjects.
45
Table 11. A subset of human mean ratings for the Rubenstein-Goodenough (RG) set Top 5 pairs Last 5 pairs
Pair RG Rating Pair RG Rating cord- smile 0.02 cushion-pillow 3.84 rooster-voyage 0.04 cemetery- graveyard 3.88 noon- string 0.04 automobile-car 3.92 fruit- furnace 0.05 midday- noon 3.94 autograph- shore 0.06 gem-jewel 3.94
Table 12. Training dataset: 19 medical term pairs of Dataset 2 found in WordNet Concept 1 Concept 2 Human
Anemia Appendicitis 0.031 Sinusitis Mental Retardation 0.031 Dementia Atopic Dermatitis 0.062 Osteoporosis Patent Ductus Arteriosus 0.156 Hypothyroidism Hyperthyroidism 0.406 Sarcoidosis Tuberculosis 0.406 Adenovirus Rotavirus 0.437 Hypertension Kidney Failure 0.500 Hepatitis B Hepatitis C 0.562 Vaccines Immunity 0.593 Psychology Cognitive Science 0.593 Urinary Tract Infection Pyelonephritis 0.656 Migraine Headache 0.718 Carcinoma Neoplasm 0.750 Breast Feeding Lactation 0.843 Seizures Convulsions 0.843 Pain Ache 0.875 Down Syndrome Trisomy 21 0.875 Measles Rubeola 0.906
6.4.3 Experimental Results
Most of previous relevant work used MC dataset to validate and compare the approaches
[9,11,12,23] as of missing concepts in previous versions of WordNet. MC can be used as
training dataset and RG as testing dataset, however, MC is subset of RG and now the
whole 65 pairs of RG dataset can be found in WordNet 2.0. The whole RG dataset was
used for testing the Hybrid (proposed measure) measure and compare with other
measures. Also, the training step was also needed to train the proposed measure for
46
optimal parameters. It is more effective to use another dataset (completely different from
RG) for training. As lacking of dataset, Dataset 2 from the biomedical domain [8] was
used. The human scores in this dataset are the average evaluated scores of reliable
doctors. As this data set contains biomedical terms so part of the pairs in dataset cannot
be found in WordNet. Table 12 shows part of this dataset that can be found in WordNet
which contains 19 biomedical term pairs. These pairs in then was used to train for
optimal parameters of the proposed measure. Table 13 shows some experiment results
using two corpora.
Table 13. Results of absolute correlations of the proposed measure with human
ratings using the training dataset with different parameter values Parameter
Values α =1,
β =1, k=1 α =2,
β =1, k=1 α =3,
β =1, k=1 α =3,
β =1, k=2 α =3,
β =1, k=3
SemDist (SemCor Corpus) 0.717 0.741 0.747 0.743 0.739
SemDist (Brown Corpus) 0.698 0.729 0.739 0.735 0.733
When α =3 and β =1 the performances of SemDist are very close and reach highest
correlations with human scores using either the SemCor Corpus or Brown Corpus
(Table_13). It is noted that the results in Table 13 show α should be greater than β to get
higher correlations. This implies that the Path feature contributes more to the semantic
similarity than the CSpec feature [Equation 29]. The testing was conducted using the RG
test set (65 pairs), SemCor Corpus and Brown Corpus; and the results are in Table_14.
Table 14.Absolute correlations with human ratings for the proposed measures using
the RG dataset (65 pairs)
Measure Optimal Parameters Correlation
α =3, β =1, k=1 0.873 α =3, β =1, k=2 0.873 SemDist
(SemCor corpus)α =3, β =1, k=3 0.874 α =3, β =1, k=1 0.872 α =3, β =1, k=2 0.874 SemDist
(Brown corpus)α =3, β =1, k=3 0.874
47
The results of the RG experiments in Table 14 show that the Hybrid measure produces
good and stable performance with this set. Furthermore, the correlation results in Tables
13 and 14 show that it can perform well in any corpus sizes, and reach very good
correlations with RG dataset. Furthermore, performances of other information-based
measures were also investigated on two corpora using the RG dataset and WordNet 2.0;
and the results are in Table 15. The results in Table 15 show clearly that Hybrid measure
outperforms the other information-based measures on two different corpora. Most of the
measures perform significantly better with using Brown corpus than with using SemCor
corpus. Moreover, Resnik measure gives a good stability in performance using both
corpora compared with Jiang and Conrath and Lin.
Table 15.Absolute correlations with RG human ratings using SemCor and Brown corpora and WordNet 2.0 for four combination-based measures
Correlation with RG Measure
Using SemCor Using Brown SemDist 0.874 0.874 Resnik 0.807 0.830 Jiang and Conrath 0.650 0.854 Lin 0.728 0.853
Table 15 shows that Hybrid measure gives the highest and most stable correlations with
human scores on two different corpora. SemCor [15] is a sense-tagged subset of the
Brown corpus [7]. The words in the corpus have been manually tagged with their
appropriate senses by human experts. However, the size of this corpus (~200,000 words)
is relatively smaller than the Brown corpus (~ 1 million words). The Brown corpus is a
plain text with no annotations. The measures of Jiang and Conrath and Lin perform not
so well using SemCor corpus compared to their performances with Brown corpus (Table
15). Moreover, Resnik measure performs better than Jiang and Conrath measure and Lin
measure using SemCor (Table 15) just because Lin and Jiang and Conrath measures use
IC of individual concept nodes; and in a small corpus, like SemCor, some words do not
occur affecting performances of such measures. The Hybrid measure does not get
affected by small corpus size because of using IC of the LCS node of two concept nodes.
48
6.5. Discussion
The Hybrid measure has been presented that performs quite well gaining a quite
impressive correlation (0.874) which is best to date reported results of correlation with
human ratings using the benchmark RG dataset. The proposed measure combines all the
strengths of some traditional approaches. The proposed measure uses a new feature
(CSpec) that contributes well to the performance given by scaling the IC of the least
common subsumer of two given concepts to the maximum IC of concepts. The
experimental comparative results demonstrated that the measure is very competitive and
outperforms the existing ontology-based measures with benchmark datasets.
Furthermore, the Hybrid measure can be adaptive to get optimum performance in specific
domain by effective training strategy, and can perform well using any corpus sizes.
49
7. THE PROPOSED CROSS-CLUSTER APPROACH FOR SEMANTIC
SIMILARITY OF CONCEPTS IN WORDNET
7.1. The Need for a Cross-Cluster Semantic Approach for WordNet
In this work, the term “cluster” is used to denote a taxonomy in WordNet, while in
biomedical ontology, “cluster” refers to taxonomy, category tree or subtree (e.g. in
UMLS [31]). In WordNet, as discussed in more detail later, all noun taxonomies are
grouped into one “noun cluster” and all verb taxonomies are grouped into one “verb
cluster”. In this work, only semantic of nouns and verbs are concerned.
Figure 8. Fragment of Ontology
r
a1
a2
a3
b1
a6 b2
a4
a7
a8
b3
a9
a5
50
As the noun cluster of WordNet was the first to be richly developed, therefore, most of
the researchers had their works limited to this cluster [9,10,12,23,26]. Resnik and Diab
[24] first examined the similarity of verbs in verb cluster and they considered the
similarity of verbs is different from noun similarity by some aspects because verb
representations are generally viewed as possessing properties that nouns do not, such as
syntactic subcategorization restrictions, selectional preferences, event structure, and there
are dependencies among these properties. However, there is no work concerning about
measuring semantic similarity of all words in open classes in one scale system using
ontology-based measures. The reason for concerning the scale system is that the scale
system of similarity are different between the noun cluster and the verb cluster as
discussed above the average depth of the noun cluster is 13 while the average depth of the
verb cluster is 2. It is necessary not to be limited to noun similarity only and skip other
kinds of words (e.g. verb) in application such as word sense disambiguation, IR, etc.
For example, in Figure 8, the distance (path length) between b1 and b3 is 3 by node
counting, and this value represents the maximum distance (minimum similarity) in cluster
containing concepts bi, while path length of 3 in cluster ai , for example between a1 and
a3, has a different scale and is not the maximum distance.
7.2. The Proposed Cross-Cluster Semantic Distance Approach
This approach is a variation approach of the Cluster-Based approach in which the CSpec
feature is calculated as in the Hybrid approach [Equation 29]. The previous work has
proved that the Hybrid approach performs very well and stably in the general English
domain. In this work, it is extended for measuring semantic similarity of open words in
WordNet. The Cross-Cluster approach is then an extension of Hybrid approach and is a
variation of Cluster-Based approach. It therefore satisfies the two rules (R3 and R4) and
the two assumptions (A1 and A2).
In cross-cluster similarity, there are four cases depending on whether the concepts occur
in primary or in secondary clusters. The four cases are as follows:
51
Case 1: Similarity within the Primary Cluster: If the two concept nodes occur in the
primary cluster then the similarity in this case is treated as similarity within single cluster
[ Equation 30] discussed in section 6.3.
Case 2: Cross-Cluster Similarity: In this case, the LCS of two concept nodes is the
root node which belongs to two clusters. The secondary cluster is connected as a child of
the root of the primary cluster. This technique does not affect the scale of the common
specificity feature of the primary cluster. The common specificity is then given as
follows:
CSpec(C1,C2) = CSpecprimary = IC primary (33)
where ICprimary is maximum IC of concept nodes in primary cluster (primary cluster
information content). The path length or shortest path of two concept nodes goes though
two cluster having different granularity degree, therefore, part of this path length in the
secondary cluster have to be converted in primary cluster path scale as follows.
The Cross-Cluster Path length Feature: Let us consider again the example, shown in
Figure 4. The root node is the node that connects all clusters. The path length between
two nodes is computed by adding up the two (shortest) path lengths of two nodes to their
LCS node (their LCS is the root). For example, in Figure 4, for the nodes (a3 and b3), the
LCS is the root node. The path length between a3 and b3 is calculated as follows:
Path(C1,C2)= d1 + d2 - 1, (34)
such that, d1 = d(a3, root) and d2 = d(b3, root),
where d(a3, root) is the shortest path (path length) from the root node to node a3 ; and
similarly d(b3, root) is the shortest path from root to b3. The root node is counted twice as
of using node counting approach, so we subtract one, Equation 34. It is noticed here that
Path is path length between the two concept nodes in “cross-cluster”, and the densities
52
(or granularities of the two clusters are in different scales. So the path length of two
concept nodes with their LCS crosses different scales. According to previous discussion
of local specificity of concept, let us call the first cluster which contains a3 the “primary
cluster”, and call the second cluster which contains b3 the “secondary cluster”. The
granularity rate of the primary cluster over the secondary cluster of the common
specificity feature based on ontology is:
secondary IC primary ICCSpecRate = (35)
where (IC primary) and (IC secondary) are information content of the primary and
secondary clusters respectively. The granularity rate of the primary cluster over the
secondary cluster for the path feature is given by:
12D 12DPathRate
2
1
−−
= (36)
where (2D1-1) and (2D2 -1) are maximum shortest path values of two concept nodes in
the local primary and local secondary clusters, respectively. Following Rule R3, d2 in the
secondary cluster, in Equation 5, is converted to the primary cluster as follows:
22 dPathRated' ×= (37)
This new distance d’2 reflects the path length of the second concept to the LCS relative to
the path scale of primary cluster. Applying Equation 8, the path length (Equation 5)
between two concept nodes in primary cluster scale will be as follows:
1dPathRated)C,Path(C 2121 −×+= (38)
1d12D12D
d)C,Path(C 22
1121 −×
−−
+= (39)
Finally, the semantic distance (SemDist) between two concept nodes is given as follows:
53
CSpec (C1, C2) = IC primary (40)
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (41)
Case 3: Similarity within the Secondary Cluster: In this case, both concept nodes are
in a single secondary cluster. Then the semantic distance features must be converted to
primary cluster’s scales as follows:
Path(C1,C2)= Path(C1,C2) secondary× PathRate (42)
CSpec (C1, C2)= CSpec (C1, C2) secondary×CspecRate (43)
( ) ( )( )k+×= CSpec1-Pathlog)C,SemDist(C 21βα (44)
where Path(C1,C2) secondary and CSpec(C1,C2)secondary are the Path and CSpec between C1
and C2 in the secondary cluster; and PathRate and CSpecRate are computed in Equations
36 and 35, respectively.
Case 4: Similarity within Multiple Secondary Clusters: In this case, the two concept
nodes are in two secondary clusters Csi and Csj (i.e., none of them exists in the primary
cluster). Then, one of the two secondary clusters acts momentarily as a primary to
calculate the semantic features (viz. Path and CSpec) using Case-2 above. (That is, the
semantic features, Path and CSpec, will be computed according to Case-2 by assuming
temporarily that Csi and Csj are primary and secondary clusters although both are
secondarys, so that to scale and unify the CSpec and Path features between them). Then,
the semantic distance (SemDist) is computed using Case-3 to scale the features (again) to
the scale-level of the primary cluster (Cp).
7.4 Evaluation
7.4.1 Information Source
WordNet 2.0 was used as the primary information source which is a semantic lexicon for
the English language developed at Princeton University. The Perl module
54
WordNet::Similarity was used and inherited by using existing implemented measures
developed by Pedersen et al. [19]. Resnik’s technique [23] was used to calculate IC of
concept particularly for nouns and verbs based on their frequencies. In these experiments
Brown corpus [7] or SemCor corpus [15] were used
7.4.2 Evaluation Method and Dataset
The proposed measure is evaluated using single ontology (WordNet) but with more than
one cluster (cross-cluster) to show the effectiveness of the proposed technique of
handling cluster granularity differences within the same ontology.
The noun cluster is considered which connects all noun taxonomies in WordNet as the
primary cluster, and has depth of 18. The verb cluster which connects all verb taxonomies
is considered the secondary cluster, and has depth of 14. The depths of two clusters (noun
cluster and verb cluster) show that the granularities of the two clusters are significantly
different. The RG dataset contains 65 noun pairs and part of noun pairs containing nouns
which have one or many verb senses. The proposed measure is evaluated on RG dataset
as follows: If one word has two parts of speeches, its part of speech (POS) which is the
same as the POS of the other concept is considered. If two nouns have both noun and
verb senses, only pairs of noun- noun and verb-verb are taken into account.
As discussed, the whole 65 pairs of RG now can be found in WordNet 2.0, therefore, the
whole RG dataset was used for testing our measure and comparing with other measures.
However, the proposed measure needs a training step to tune for optimal parameters.
Unfortunately, there is no other standard dataset in general English domain that has
human ratings (recall that MC is a subset of RG). Therefore, the training phase discussed
in section 6.4.3 above was used for optimum parameters.
55
7.4.3 Experiments and Results
Then, the experiments were conducted using the RG (65 pairs) dataset with both corpora
and the results are in Table_16. These results (Table 16) demonstrate that SemDist
produces good and stable performance in the RG terms. The measure achieves the same
correlations (0.873) using either one of the two corpora. Thus, the proposed measure,
SemDist, can perform well in any corpus sizes. Furthermore, the performance of three
other relevant measures were investigated with the two corpora and using RG dataset.
Table 17 shows the correlation results of these 3 information-based measures with RG
human ratings using two corpora. The results in Tables 17 and 18 also show that SemDist
outperforms these three measures significantly when SemCor corpus was used. That is,
with the SemCor, SemDist achieves correlation with human scores (0.873) that is almost
20% higher that that average correlation (0.728) of the three methods, Table 17. When
the Brown corpus was used, SemDist performs slightly better that these measures. It is
noticed that in Table 17 that, the three measures perform significantly better with using
Brown than with SemCor. Table 17 also shows that the proposed measure can perform
well in both corpora as well as Resnik measure as the two measures do not use specific
IC of concept nodes, hence, their performances are not affected much by small corpus
sizes such as SemCor as some words may not occur in small corpora making similarity of
those words to any other words reach minimum.
Table 16. Absolute correlations with human judgments for the proposed measures
using the RG dataset
Measure Optimal Parameters Correlation
SemDist (SemCor corpus) α =3, β =1, k=1 0.873
SemDist (Brown corpus) α =3, β =1, k=1 0.873
56
Table 17. Absolute correlations with RG human ratings using two corpora and WordNet 2.0 for 3 information content-based measures
RG Dataset Measure
SemCor Brown Resnik 0.807 0.830 Jiang and Conrath 0.650 0.854 Lin 0.728 0.853
Average 0.728 0.846
The proposed measure, SemDist, reaches a quite impressive correlation of 0.873 with
human ratings and rank #1 (Table 18) which proves the great potential of the approach
and the goodness of the combination strategy. As the correlation results of measures are
so high, therefore, a small amount of improvement is significant.
The experiments on single ontology, WordNet, and multiple clusters, show the efficiency
of the proposed approach that performs quite well gaining a quite impressive correlation
(0.873), which is the best to date reported results of correlation with human ratings using
the benchmark RG dataset. In the experimental results, the proposed measure achieved an
improvement of ~20% over the average correlation of three of the similar measures using
the standard SemCor corpus.
Table 18. Absolute correlations with RG human ratings of ontology-based
measures Measure RG Rank
SemDist (using Brown or SemCor) 0.873 1 Leacock and Chodorow 0.858 2 Jiang and Conrath (using Brown) 0.854 3 Lin (using Brown) 0.853 4 Resnik (using Brown) 0.830 5 Wu and Palmer 0.811 6 Path Length 0.798 7
57
7.4. 5 Discussion
The limit of this work is that there is not a dataset that contains both nouns and verbs for
better evaluation of the cross-cluster approach. It is clear and intuitive that the approach
is efficient in similarity computing in such WordNet ontology where there are many the
clusters with greatly different granularity. In the next work, the similarity of verbs is
conducted as well as both nouns and verbs in one similarity scale system.
58
8. SEMANTIC SIMILARITY OF VERBS AND NOUNS IN WORDNET
8.1 Motivation
There are four different parts of speech (POS) of open-class words in WordNet. The
taxonomies of nouns form the noun cluster and taxonomies of verbs form the verb
cluster. This work focuses on semantic of verbs in the verb cluster by evaluating existing
semantic measures and the proposed measures. In addition, the similarity of nouns and
verbs are also investigated in one similarity scale system given in the context of real
applications (i.e. word sense disambiguation (WSD)) that take into account both nouns
and verbs.
8.2 Information Source and Datasets
WordNet 2.0 was also used as primary information source for semantic measures.
WordNet::Similarity Perl package was used for existing measures and for implementing
the new measures. RG dataset was used for dataset of nouns. For the dataset of verb, this
work used the verb dataset of Resnik and Diab [24]. In their work of measuring semantic
similarity between verbs, Resnik and Diab [24] compiled a dataset of 27 pairs of verbs
(called RD dataset) and then collected the human ratings (human similarity scores) for all
verb pairs in the dataset. They collected two kinds human similarity scores for each pair,
one by providing the human subject with the verbs only and call it HNoContext score,
and the second score was by providing the verbs with their contexts and they call it
HContext score. Table 19 contains the RD dataset with human ratings.
59
Table 19. Mean human ratings of RD dataset of verb pairs
8.3 Semantic Similarity in Verb Cluster
The traditional ontology-based semantic measures are ones that do not combine semantic
features in one measure and then do not use parameters in formulas. However,
combination-based measures such as Li et al., Hybrid or Cross-Cluster have parameters
in formulas. The parameters of path and depth features in Li et al. measure reflect their
contributions into similarity. However, the parameters in Hybrid measure or Cross-
Cluster measure show the relatively comparative contributions of lexical representation
and corpus into similarity.
Verb 1 Verb 2 HNoContext HContext wiggle rotate 2.80 2.20 prick compose 0.00 0.00 crinkle boggle 0.00 0.40 hack unfold 0.00 0.00 wash sap 1.20 0.40 compress unionize 1.80 1.00 percolate unionize 0.00 0.00 chill toughen 1.40 0.80 fill inject 4.60 2.40 whisk deflate 0.00 0.40 compose manufacture 4.00 2.80 obsess disillusion 1.20 0.00 loosen inflate 0.00 0.40 swagger waddle 3.20 1.60 loosen open 3.00 1.80 displease disillusion 2.80 0.80 dissolve dissipate 4.20 3.40 plunge bathe 2.20 1.60 lean kneel 2.60 1.80 embellish decorate 4.60 4.00 neutralize energize 0.20 0.20 initiate enter 3.20 2.60 open inflate 0.60 0.80 unfold divorce 1.60 0.60 bathe kneel 0.00 0.00 festoon decorate 5.00 4.20 weave enrich 3.00 0.25
60
8.3.1 Traditional Measures
Six traditional semantic measures were used to compute the similarity of verb pairs in RD
dataset and the correlations of semantic measures with human ratings with and without
the contexts (HContext and HNoContext) were calculated. Table 20 shows the
correlations of the information-based measures using the two corpora, and Table 21
shows the correlations of structure-based measures.
Table 20. Absolute correlations with RD human ratings using SemCor and Brown corpora for four information-based measures
Correlation with RD Using SemCor Using Brown
Measure
HNoContext HContext HNoContext HContext Resnik 0.633 0.724 0.623 0.712 Jiang and Conrath 0.444 0.572 0.418 0.451 Lin 0.525 0.638 0.476 0.526
Average 0.534 0.645 0.506 0.563
Table 21. Absolute correlations with RD human ratings of ontology-structure-based measures
Correlation with RD Measure HNoContext HContext Leacock and Chodorow 0.492 0.683 Wu and Palmer 0.599 0.715 Path Length 0.470 0.659
Average 0.520 0.686
Tables 20 and 21 show that all the measures give similarity scores more correlated to
ratings with context than to ratings without context. The average correlations in Table 20
show that SemCor corpus contributes to similarity as a secondary information source
significantly more than Brown corpus in which the average correlation (0.645) of all
information-based measures with context ratings using SemCor corpus improves 14.6 %
over the average correlation (0.563) of all information-based measures with context
ratings using Brown corpus. Table 22 shows correlation of all measures in which SemCor
is used for information-based measures.
61
Table 22. Absolute correlations with RG human ratings of seven measures using SemCor corpus for information-based measures
Correlation with RD Measure HNoContext HContext Resnik 0.633 0.724 Wu and Palmer 0.599 0.715 Leacock and Chodorow 0.492 0.683 Path Length 0.470 0.659 Lin 0.525 0.638 Jiang and Conrath 0.444 0.672
It is noted in Table 22 that the depth-based measures including Resnik and Wu and
Palmer outperform others ranking #1 and #2, respectively. The correlation results in
Table 22 also show that corpus contributes significantly into similarity. However, the
average correlation with context ratings of structure-based measure (Table 21) is more
than of information-based measures. Therefore, to examine the contributions of lexical
hierarchy representation of the verb cluster and the corpus in single verb cluster, the
Hybrid measure was used to examine the affects of them on similarity.
8.3.2 Hybrid Measure and Cross-Cluster Measure
Previous experiments in this thesis show that using SemCor corpus as secondary
information sources gives improved results than Brown corpus in verb cluster; therefore,
SemCor corpus was used for Hybrid measure. As there is no other verb dataset for
training the measure, the parameters were tuned to see the affect of contributing
parameters to similarity. Table 23 shows the absolute correlation results while changing
the parameters.
62
Table 23. Similarity of Verbs given by Hybrid measure using WordNet 2.0 and SemCor with human rating (HContext) in Resnik and Diab (RD) dataset
Parameter Values α =1,β =1 α =1,β =2, α =2, β =1, α =1,β =3 α =1,β =4
k=1 0.788 0.802 0.768 0.802 0.800
k=2 0.794 0.810 0.767 0.809 0.806
k=4 0.792 0.814 0.761 0.815 0.812
k=6 0.788 0.814 0.756 0.817 0.814
k=8 0.783 0.813 0.750 0.818 0.816
k=10 0.779 0.812 0.745 0.818 0.817
k=30 0.754 0.796 0.716 0.815 0.819
The contribution parameters of features are very different in verb cluster compared with
the noun cluster that the contributor parameter of the common specificity (β) based on
corpus is more than the path length feature (α) based on lexical hierarchy representation
in good performance cases. It indicates that the lexical representation of the noun clusters
is better than of the verb cluster in term of similarity.
As once said, the different granularities of different clusters give different scale system of
similarity, especially in WordNet in which the noun cluster has depth about of 18 while
the depth of the verb cluster is about of 14. Therefore, the two similarity scale system is
different. Furthermore, in application such as WSD, IR, there is a need to compute
similarity of nouns and verbs in one scale system that can only be solved by using Cross-
Cluster approach taking into account the granularity of local clusters in ontology.
8.4 Semantic Similarity of Open-Class Words in WordNet
In other to evaluate the affect of the granularity feature of Cross-Cluster approach,
performance of Hybrid approach that doesn’t take into account the granularity was used
as a baseline. A combined dataset (RGRD) was used which is a combination of 65 pairs
63
of nouns in RG dataset and 27 pairs of verbs in RD dataset. Tables 24 and 25 show the
correlations of Cross-Cluster approach and baseline approach when tuning the parameters
as of not having the training dataset.
Table 24. Similarity of word in RGRD (65 noun pairs+27 verb pairs) dataset using Hybrid measure
Correlation Parameter
Values α =1,β =1 α =2,β =1 α =3,β =1 α =1,β =2 α =1,β =3
k=1 0.840 0.839 0.835 0.838 0.832
Furthermore, although the Cross-Cluster approach can measure the similarity of a noun
and a verb in cross-cluster as nouns can have one or more verbs senses or verbs can have
one or more noun senses, however, similarity of pairs of noun senses and pairs of verb
sense only were considered.
Table 25. Similarity of Verbs in RGRD (65 noun pairs+27 verb pairs) dataset using
Cross-Cluster measure
Correlation Parameter
Values α =1,β =1 α =2,β =1 α =3,β =1, α =1,β =2 α =1,β =3
k=1 0.817 0.809 0.802 0.820 0.817
The correlations in Tables 24 and 25 show that taking into account the granularity feature
helps improve performance. Furthermore, the Cross-Cluster approach can even more
useful in measuring the similarity of concept in cross-cluster in which local clusters in
ontology have very different granularity degrees.
8.5 Discussion
The contribution of this work is twofold: (1) The experimental results in Tables 22 and
23 confirm the point that when the hierarchy representation of the ontology is not richly
developed, its contribution will be less than the contribution of a good corpus as
64
information source to similarity; (2) The experimental results in Tables 24 and 25 also
show that the granularity feature should be taken into account as an important feature of
semantic similarity across clusters.
65
9. SEMANTIC SIMILARITY OF CONCEPTS IN A UNIFIED FRAMEWORK:
THE PROPOSED CROSS-ONTOLOGY APPROACH
We have discussed so far two classes of ontology-based measures in the previous
chapters. In this chapter, we will talk about two more groups of semantic measures.
These measures do not use IS-A relations in the ontology but instead use property of
concepts in the ontology.
9.1 The Need for Cross-Ontology Approach
The ontology-based semantic similarity can be roughly grouped into four groups. The
first group includes ontology-structure-based measures [10, 12, 22, 30] that use IS-A
relations in ontology as the only information source (i.e. ontology-only measures). The
second group includes information-based measures [9, 11, 23] that use IS-A relations in
ontology as primary information source and text corpus statistics as secondary
information source in estimating the similarity between two terms. The third group
includes feature-based measures. Feature-based measures such as Tversky approach [29]
are those measures that do not use IS-A relations or text corpus statistics, but instead use
a function of their property such as gloss or definition/context of concepts in the ontology
[29]. The last group includes hybrid measures such as Rodriguez and Egenhofer (2003)
approach [27] that use combination of semantic features from the above three groups.
However, Rodriguez and Egenhofer approach is based on Tversky approach and use
depth of ontology. Feature-based measures are based on set theory while ontology-based
66
measures in the first two groups are based from activation theory [5, 21, 23]. One of the
assumptions of spreading activation theory is that the semantic network is organized
along the lines of semantic similarity [23]. The more properties of two concepts share in
common, the more links there are between the concepts and the more closely related they
are. Although the most primitive semantic distance/similarity measure (Path length) was
first developed and applied in MeSH ontology [23], most of the later work was centered
on WordNet. The existing similarity measures, in the four groups, can only measure
similarity of two concepts in single ontology except the hybrid measure proposed by
Rodriguez and Egenhofer [27] which can measure similarity of concepts in single
ontology or in cross-ontology. This measure uses a matching process over synonym sets,
semantic neighborhoods, and distinguishing features.
Most of the previous work of semantic similarity in the biomedical domain [1-3,6,17,20]
focuses on semantic similarity of concepts in single ontology. There are a number of
ontologies in the biomedical domain, each of which covers a subset of UMLS concepts,
and therefore, there are some missing concepts in every ontology. This problem of
missing concepts/terms from a given ontology makes it impossible to measure semantic
similarity of the missing concepts. For some applications, such as IR in biomedical
domain [16], there is a need for measuring similarity of all concepts in UMLS.
Furthermore, constructing a single ontology for all UMLS concepts is so costly and is a
challenge as each source represents a view of a community who develop the source and
each view is suitable for specific few tasks. Therefore, there is a need for measuring
semantic similarity of concepts in UMLS Metathesaurus using existing sources. For this
motivation, an ontology-structure-based semantic distance/similarity approach is
proposed that can measure semantic similarity in single ontology as well as in cross-
ontology in a unified framework such as UMLS framework. The proposed measure is
adapted from (and is an extension of) the Cluster-Based approach which was developed
to compute the similarity between two terms across multiple clusters within a single
ontology.
67
9.2 The Adapted Common Specificity Feature
In this work, the adapted common specificity feature (discussed in Sections 4.3 and 6.2.2)
is extended for cross-ontology approach. This feature takes into account the depth of the
least common subsumer of two concepts and the depth of ontology. The least common
subsumer (LCS) node of two concepts C1 and C2 determines the specificity of C1 and C2
in the ontology. The common specificity of two concept nodes can be measured by
finding the depth of their LCS node and then scaling it by the depth D of the ontology as
follows:
CSpec(C1,C2) = D − Depth(LCS(C1,C2)) (45)
where D is the depth of the ontology. Thus the CSpec feature determines the common
specificity of two concept nodes in the ontology. The less the common specificity value
of two concepts, the more they have shared information, and thus the more they are
similar. When the depth of LCS of two concept nodes reaches D, the two concept nodes
have the highest common specificity in the ontology which equals to zero (i.e.,
CSpec(C1,C2) = 0).
Figure 9. Two fragments from two ontologies.
r1
a1
a3
a3
a7
a5 a8
a4
a6 a9 b2
b3
r2
b1
Mapping a9=b2
68
9.3 Local Ontology Granularity
Let us consider two fragments from two ontologies as in Figure_9. The first ontology,
OA, contains concepts ai; and the second ontology, OB, contains concepts bi. Depth of
ontology OA is 5 and depth of ontology OB is 4 (by node counting). The relationship
between two concept nodes belonging to two different ontologies can not be seen such as
concept nodes a3 and b3 in Figure 9; however, when the two ontologies are mapped, the
relationship between them can be seen intuitively in the tree. On the other hand, different
ontologies have different granularity degrees; hence, the similarity scales of ontologies
are also different. The effect of granularity differences of ontologies can be seen by
examining local specificity of concept nodes. Define the specificity spec(Ci) of concept Ci
in ontology as follows:
Depth
Depth(Ci)spec(Ci) = (46)
where Depth is the depth of ontology containing Ci and spec(Ci)_∈ [0, 1]. It is noticed
that spec(Ci) = 1 when Ci is a leaf node in the ontology. Then, following Equation 46,
specificity of a3 and b3, in Figure 9, is calculated as follows:
spec(a3)=4/5 = 0.8
spec(b3) =4/4 = 1.0
Therefore, the local specificity of b3 (1.0) is more than local specificity of a3 (0.8), even
though the depths of concepts a3 and b3 are equal. That is, b3 has more specificity within
its ontology than a3 as it lies further down towards the bottom in its ontology, and
because of the difference of granularity degrees of the two ontologies. Therefore, the
local granularities of ontologies should be taken into account when measuring semantic
similarity of concept nodes across ontologies.
69
9.4 The Proposed Cross-Ontology Similarity Approach
We want to extend the Cluster-Based approach (Chapter 4) in measuring semantic
similarity of concept nodes into the cross-ontology scale. Then an ontology will be
treated as a cluster, i.e., the cluster here is one ontology and two ontologies can overlap in
set of controlled concepts. They are ontologies in a unified framework as discussed in
section 2.3. The following rules and assumptions have to be satisfied in the proposed
approach.
9.4.1 Rules and Assumptions
The propose measure combines all the semantic features discussed above in one measure
in an effective and logical way. Following are the intuitive rules and assumptions that
should be fulfilled in measuring semantic distance/similarity across ontologies:
Rule R5: The semantic similarity (distance) scale system shows (reflects) the degree of
similarity of pairs of concept nodes comparably in single ontology or in cross-
ontology. This rule ensures that the mapping of ontology OB (called secondary
ontology) to ontology OA (called primary ontology) does not deteriorate the similarity
scale of the primary ontology.
Rule R6: The semantic similarity must obey local ontology’s similarity rules as follows:
Rule R6.1: The shorter the distance between two concept nodes in the ontology, the
more they are similar.
Rule R6.2: Lower level pairs of concept nodes are semantically closer (more similar)
than higher level pairs (i.e. the more the two concept nodes share
information/attributes, the more similar they are).
Rule R6.3: The maximum similarity of two concept nodes is reached when they are
the same node in the ontology.
70
Like the above proposed NA measure, the cross-ontology measure also satisfies the two
above assumptions (A1 and A2).
9.4.2 Single Ontology Similarity
In single ontology, there are two features to combine: Path length (shortest path length)
and Common specificity given by Equation 45. When the two concept nodes are the
same node (two concepts are synonymous or identical) then path length will be 1 (Path =
1), and then the semantic distance value must reach the minimum regardless of CSpec
feature by rule R6.3 (recall the semantic distance is the inverse of semantic similarity).
Therefore, to combine features, the product of semantic distance features is used. Those
constraints may not be satisfied if other combinations of the two features are used. By
applying Rules R5, R6 and the two assumptions (A1 and A2), the proposed measure for a
single ontology is:
( ) ( )( )k CSpec1-Pathlog)C,SemDist(C 21 +×= βα (47)
where α>0 and β>0 are contribution factors of two features (Path and CSpec); k is a
constant; and Path is the shortest path length between the two concept nodes. If k is zero,
the combination is linear and to insure the distance is positive and the combination is
non-linear, k must be greater or equal to one (k≥ 1). When two concepts have path length
of 1 (Path=1) using node counting, they have a semantic distance (SemDist) equals to
zero (assuming k=1) according to Equation 47 regardless of the CSpec feature.
9.4.3 Cross-Ontology Semantic Similarity
In cross-ontology semantic similarity, there are four cases depending on whether the
concept nodes occur in primary or in secondary ontologies. The four cases are as follows:
71
Case 1: Similarity within the Primary Ontology: If the two concept nodes occur in the
primary ontology then the similarity in this case is treated as similarity within single
ontology using Equation 47 discussed in section 9.4.2.
Figure 10. Connecting two ontology fragments.
Case 2: Cross-Ontology Similarity (Primary-Secondary):
The Common Specificity Feature: In this case, the two concepts belong to two different
ontologies. The secondary ontology is connected to the primary ontology by joining the
associate/ common nodes (e.g., a9 and b2 in Figure 9) of two ontologies. However, two
ontologies may have many common or equivalent concept nodes. Two concepts in two
ontologies are equivalent if they refer to the same concept. For example, in Figure 9,
suppose that b2 and a9 refer to the same concept (b2 = a9), then we merge b2 and a9 into
one node called Bridge as in Figure 10. Thus, Figure 10 shows how the two ontologies
are mapped and how the Bridge appears. As there can be more than one Bridge node
when mapping two ontologies, there can be more than one LCS node ({LCSn}) for the
two concepts. The LCS node of two concept nodes (C1, C2) belonging to two ontologies
is the LCS of the first node C1 in primary ontology and the Bridges node, that is:
LCSn (C1,C2) = LCS(C1, Bridgen) (48)
such that C1 belongs to the primary ontology ai while C2 belongs to the secondary
ontology bi. The path length between two concept nodes in two ontologies passes through
r1
a1
a2
a3
a7
a5 a8
a4
a6
a9 , b2
b3
r2
b1
Bridge
72
the Bridge node and goes through two ontologies having different granularity degrees.
The part of path length in secondary ontology is then converted into primary ontology’s
scale of path feature as explained next.
The Cross-Ontology Path length Feature: The typical way to calculate the path length
between two concept nodes is by adding up the two path lengths, from each of them to
the LCS node. In cross ontology approach, the path length between two concept nodes is
calculated by adding up two path lengths from each of them to Bridge node. For
example, the path length between a3 and b3 in Figure 10 is calculated as follows:
Path(a3, b3) = d1 + d2 – 1 (49)
such that:
d1 = d(a3, Bridge), and
d2 = d(b3, Bridge),
where d(a3, Bridge) is the path length (shortest path) from a3 to the Bridge; and similarly
for d(b3, Bridge). In this case, Bridge is counted twice because of using node counting
approach, so one is subtracted in Equation 49. Since this is a cross-ontology, Path(a3, b3)
crosses different scales, i.e., d1 and d2 are in different scales. According to our discussion
of specificity in sections 2.2 and 3.1, let us call the first ontology (which contains ai) the
primary ontology, and call the second ontology (which contains bi) the secondary
ontology. The granularity rate of the primary ontology over the secondary ontology for
the common specificity feature is:
1 D1D
CSpecRate2
1
−−
= (50)
where (D1-1) and (D2 -1) are maximum common specificity values of the primary and
secondary ontologies respectively (D1 and D2 are depth of primary ontology and
secondary ontology respectively). The granularity rate of the primary ontology over the
secondary ontology for the path feature is given by:
73
12D 12DPathRate
2
1
−−
= (51)
where (2D1-1) and (2D2 -1) are maximum path values of two concept nodes in the
primary and secondary ontology respectively. Following Rule R5, d2 (in Equation 49) in
the secondary ontology is scaled to the primary ontology as follows:
22 dPathRated' ×= (52)
This new path length d’2 reflects the path length of the second concept node to the Bridge
node relative to the primary ontology granularity scale of path feature. Applying
Equation 52, the cross path length between the two concept nodes in primary ontology
scale of path feature is given as follows:
1dPathRated)C,Path(C 2121 −×+= (53)
1d12D12Dd)C,Path(C 2
2
1121 −×
−−
+= (54)
Recall that there can be more than one Bridge node, therefore, there can be more than one
path length between the two concept nodes ({Pathn}). Finally, the semantic distance
(SemDist) between two concept nodes is given as follows:
CSpecn (C1,C2) = D1 − Depth(LCS(C1,Bridgen)) (55)
( ) ( )( )k CSpec1-Pathlog)C,(CSemDist nn21n +×= βα (56)
SemDist(C1, C2) = min{ Semi(C1, C2)} (57) i
where Pathn is the path length of two concepts calculated via Bridgen. The semantic
distance between two concepts is chosen as the minimum among all possible paths.
Case 3: Similarity within Single Secondary Ontology: The third case is the case when
the two concept nodes are both in a single secondary ontology. Then the semantic
74
distance features in this case must be converted to primary ontology scales of two
features as follows:
Path(C1,C2)= Path(C1,C2) secondary× PathRate (58)
CSpec(C1, C2) = CSpec(C1, C2) secondary× CSpecRate (59)
( ) ( )( )k CSpec1-Pathlog)C,Sem(C 21 +×= βα (60)
where Path(C1,C2) secondary and CSpec(C1,C2)secondary are the Path and CSpec between C1
and C2 in the secondary ontology; and PathRate and CSpecRate are computed in
Equations 51 and 50.
Case 4: Similarity within Multiple Secondary Ontologies: The fourth case is when the
two concept nodes are in two different secondary ontologies (i.e., none of them exists in
the primary ontology). In this case, one of the two secondary ontologies acts
momentarily as a primary to calculate the semantic features (viz. Path and CSpec) using
Case-2 above. Then, the semantic similarity is computed using Case-3 to scale the
features (again) to the scale-level of the primary ontology.
9.4.4 Choosing the Secondary Ontologies
In biomedical domain within the UMLS framework, as there are many ontologies
overlapping in set of UMLS concepts, therefore, one problem stands out: which ontology
is chosen as the secondary ontology? Let us examine again the four cases above.
Case 1: In this case, there is one primary ontology, therefore, there’s no need to choose
for the secondary ontology.
Case 2: In this case, the second concept may belong to many ontologies in the unified
framework (i.e. UMLS), the problem is which ontology is chosen for mapping into the
primary ontology for similarity? The proposed cross-ontology approach uses a strategy
for choosing the secondary ontology in which one is primary ontology mainly bases on
75
two points. The first one is that the more the two ontologies overlap, the more it is good
for similarity of two concepts dispersed in these two ontologies. The second point is that
the secondary ontology should be chosen as the one that has high granularity degree. For
that, a metric is proposed to measure the “goodness” of choosing a secondary ontology.
The higher the goodness value, the better it is chosen as the secondary ontology for
mapping for similarity. The metric is as follows:
DpDs
OsOpOsOpOs)p,goodness(O ×=
U
I (61)
where:
- Op is primary ontology and Os is a source ontology that is examined the goodness for
choosing as secondary ontology.
- OsOpI is the set of common concepts of two ontologies.
- OsOpU is the union of two sets of concepts of two ontologies.
- Ds and Dp are depths of primary ontology and secondary ontology respectively.
Case 3: In this case, the two concepts are both in one source ontology; however, there
can be many source ontologies contain both concepts. The problem is which source
ontology is chosen to be the secondary ontology? In this case, Equation 61 is used for
determining the secondary ontology. It is noted that Case-3 includes the case that two
concepts belong to one source (secondary) ontology, but one of them belongs to the
primary ontology.
Case 4: In this case, the two concepts belong to two different source ontologies.
However, there can be many source ontologies contain the concepts. Therefore, the
problem is what source ontology is chosen for each of the two concepts?. First, the source
ontology for the first concept that has highest granularity degree in source ontologies that
contain the first concept is chosen. Then, the goodness metric, Equation 61, is used to
determine which one of the ontologies that contain the second concept is most suitable as
secondary ontology if the ontology chosen above for the first concept acts as temporary
primary ontology.
76
9.5. Evaluation
9.5.1 Testing Dataset
To evaluate the approach in cross-ontology, a dataset containing term pairs as in Cases 2,
3, and 4 should be used. For example, in Case-2, such concept pairs like (C1, C2) such
that one concept (C1) belongs to primary ontology only and the other concept (C2)
belongs to a secondary ontology and both ontologies are in the unified framework should
be used for testing. Since there is no such dataset with human ratings, we combined
datasets from two domains: general English domain and biomedical domain. For that, RG
dataset and Datasets 1 and 2 were used for experiments.
9.5.2 Tools and Information Sources
WordNet 2.0 was used as primary ontology and MeSH [23, 24] and SNOMED-CT [23,
25] as secondary ontologies. The Perl module WordNet::Similarity developed by
Pedersen et al. [13] was also used to implement our proposed approach to measure
semantic distance of concepts found in WordNet using WordNet 2.0. MeSH database and
MeSH Browser were used which are available at
{http://www.nlm.nih.gov/mesh/meshhome.html} to get information on biomedical terms
in MeSH; and UMLSKS Browser was used available at {http://umlsks.nlm.nih.gov} to
get information on biomedical concepts in SNOMED-CT.
9.5.3 Experimental Results
5.5.3.1 Experiments on Single Ontology: WordNet
The proposed approach was first evaluated on single ontology. In single ontology, the
approach performs very well surpassing other existing measures in biomedical domain in
previous experiments. In this single ontology experiment, the RG dataset was used and
WordNet 2.0, and the results are in Table 26 using the default parameters (α=1, β=1,
k=1). The purpose of this experiment is just to show that the method achieves sound
77
results of correlation using a standard dataset (RG) and a large and reliable WordNet
ontology.
Table 26. Absolute correlation of proposed approach on the RG dataset and WordNet 2.0
No. Parameters Correlation1 α=1, β=1, k=1 0.815
To evaluate the approach in cross-ontology, it needs a dataset that contains term pairs
dispersed in two ontologies. For that, the RG dataset (65 pairs) was combined with the
two biomedical datasets in three combinations as follows:
(a) RG (65 pairs) + Dataset 1 (30 pairs): total 95 pairs.
(b) RG (65 pairs) + Dataset 2 (36 pairs): total 101 pairs.
(c) RG + Dataset 1 + Dataset 2: total 131 pairs.
WordNet was used for RG words/terms, and MeSH or SNOMED-CT was used for
terms/concepts of Dataset 1 and Dataset 2. Moreover, WordNet was considered the
primary while MeSH/SNOMED-CT was the secondary ontology. Then, on these three
dataset combinations, (a) – (c), two evaluations were conducted, one using WordNet and
MeSH, and the other using WordNet and SNOMED-CT.
Table 27. Absolute correlations of the proposed approach using WordNet and MeSH
No. Dataset Correlation
1 WordNet (RG, 65 pairs) + MeSH (Dataset 1, 25 pairs)
(90 pairs)
0.808
2 WordNet (RG, 65 pairs) + MeSH (Dataset 2, 36 pairs)
(101 pairs)
0.804
3
WordNet (RG, 65 pairs) + MeSH (Dataset 1, 25 pairs) + MeSH (Dataset 2, 36 pairs)
(126 pairs)
0.814
Average number of tested pairs: 105.7 Average correlation: 0.809
78
9.5.3.2 Experiments Using WordNet and MeSH
In these experiments, WordNet was used as primary general ontology and MeSH was
used as secondary ontology. Three experiments were conducted using the three dataset
combinations (a), (b) and (c). In the first experiment, using combination (a), only 25 pairs
(out of the 30 pairs in Dataset 1) were found in MeSH. Thus, the similarity of 65 pairs in
WordNet as in single WordNet ontology and 25 term pairs as cross-ontology technique
(Case-3). In the second experiment, dataset combination (b) was tested using WordNet
and MeSH. In the third experiment, the three datasets were combined, combination (c),
with a total of 126 pairs distributed between WordNet and MeSH. The results are in
Table 26. In these experiments, the proposed method achieved an average of ~81%
correlation with human scores using, on average, ~106 term pairs and two ontologies.
The complete results of correlation for each pair using combination (b) (the 2nd
experiment) are shown in Table 29. The human rating scores in RG dataset are converted
into [0-1] scale to be compatible with human ratings in Dataset_2.
Table 28. Absolute correlations of the proposed approach using WordNet and
SNOMED-CT No. Dataset Correlation
1 WordNet (RG, 65 pairs) + SNOMED-CT (Dataset 1, 29
pairs) (94 pairs)
0.778
2 WordNet (RG, 65 pairs) + SNOMED-CT (Dataset 2, 34
pairs) (99 pairs)
0.700
3 WordNet (RG, 65 pairs) + SNOMED-CT (Dataset 1, 29
pairs) + SNOMED-CT (Dataset 2, 34
pairs) (128 pairs)
0.757
Average number of tested pairs: 107
Average correlation: 0.745
79
9.6. Discussion
A cross-ontology semantic distance/similarity approach has been presented and has been
applied in the biomedical domain; however, it can be applied in other domains within a
unified framework. One of the problems in measuring semantic similarity between
concepts, using ontology, is that, certain terms in the dataset are missing from the
underlying ontology. This problem stands out more clearly in specific domains (e.g.
bioinformatics domain) than in general domains. For example, in biomedical IR, there is
a great need for measuring the semantic similarity between biomedical terms/concepts
and documents [13] and there are several potential ontologies. It can, very well, be that
not all the concepts are found in single ontology (that is, the concepts are dispersed on
more than one ontology). In this case, such concepts that are missing from the ontology
will not be measured for similarity and will be skipped; see for example, [15]. This work
discussed and evaluated an ontology-based approach that can measure semantic similarity
of concepts in single ontology or in multiple ontologies (cross-ontology) within a unified
framework such as UMLS, NCI. This is an interesting work that puts a brick for more
structure and more advances in ontology integration and cross-ontology research in the
biomedical domain. The experimental results show that the proposed approach is very
promising and performs quite well with very good correlations with human scores.
80
Table 29. Biomedical Dataset 2 (36 pairs) and RG dataset (65 pairs, in italics) with human similarity scores (Human) and SemDist’s scores using WordNet and MeSH
Score Score Concept 1
Concept 2 Human SemDist
Concept 1 Concept 2
Human SemDist Anemia Appendicitis 0.031 4.69 Diabetic Nephropathy Diabetes Mellitus 0.500 3.25 Meningitis Tricuspid Atresia 0.031 4.69 Pulmonary Valve Stenosis Aortic Valve Stenosis 0.531 3.12 Sinusitis Mental Retardation 0.031 4.69 Hepatitis B Hepatitis C 0.562 2.97 Dementia Atopic Dermatitis 0.062 4.83 Vaccines Immunity 0.593 4.79 Acquired Immunodeficiency Syndrome
Congenital Heart Defects 0.062
4.54
Psychology Cognitive Science 0.593
2.47
Bacterial Pneumonia Malaria 0.156 4.69 Failure to Thrive Malnutrition 0.625 4.69 Osteoporosis Patent Ductus Arteriosus 0.156 4.83 Urinary Tract Infection Pyelonephritis 0.656 3.92 Amino Acid Sequence Anti Bacterial Agents 0.156 5.24 Migraine Headache 0.718 4.72 Otitis Media Infantile Colic 0.156 4.94 Myocardial Ischemia Myocardial Infarction 0.750 2.33 Hyperlipidemia Hyperkalemia 0.156 3.92 Carcinoma Neoplasm 0.750 3.75 Neonatal Jaundice Sepsis 0.156 4.69 Breast Feeding Lactation 0.843 0.00 Asthma Pneumonia 0.187 3.64 Seizures Convulsions 0.843 0.00 Hypothyroidism Hyperthyroidism 0.357 3.25 Pain Ache 0.875 0.00 Sarcoidosis Tuberculosis 0.406 5.05 Malnutrition Nutritional Deficiency 0.875 0.00 Sickle Cell Anemia Iron Deficiency Anemia 0.406 4.01 Down Syndrome Trisomy 21 0.875 0.00 Adenovirus Rotavirus 0.437 4.14 Measles Rubeola 0.906 0.00 Lactose Intolerance Irritable Bowel Syndrome 0.468 4.01 Antibiotics Antibacterial Agents 0.937 0.00 Hypertension Kidney Failure 0.500 4.83 Chicken Pox Varicella 0.968 0.00 cord smile 0.005 5.26 car journey 0.388 5.40 rooster voyage 0.010 5.78 cemetery mound 0.423 5.08 noon string 0.010 5.24 glass jewel 0.445 4.51 fruit furnace 0.013 4.44 magician oracle 0.455 4.44 autograph shore 0.015 5.32 crane implement 0.593 3.97 automobile wizard 0.028 5.20 brother lad 0.603 4.04 mound stove 0.035 4.44 sage wizard 0.615 4.26 grin implement 0.045 5.40 oracle sage 0.653 4.19 asylum fruit 0.048 4.44 bird crane 0.658 3.33 asylum monk 0.098 5.02 bird cock 0.658 2.30 graveyard madhouse 0.105 5.42 food fruit 0.673 4.73 glass magician 0.110 4.66 brother monk 0.685 2.48 boy rooster 0.110 4.97 asylum madhouse 0.760 2.30 cushion jewel 0.113 4.44 furnace stove 0.778 4.60 monk slave 0.143 4.04 magician wizard 0.803 0.00 asylum cemetery 0.198 5.18 hill mound 0.823 0.00 coast forest 0.213 4.51 cord string 0.853 2.56 grin lad 0.220 5.40 glass tumbler 0.863 2.48 shore woodland 0.225 4.33 grin smile 0.865 0.00 monk oracle 0.228 4.60 serf slave 0.865 3.69 boy sage 0.240 4.26 journey voyage 0.895 2.48 automobile cushion 0.243 4.65 autograph signature 0.898 2.48 mound shore 0.243 3.97 coast shore 0.900 2.56 lad wizard 0.248 4.04 forest woodland 0.913 0.00 forest graveyard 0.250 4.98 implement tool 0.915 2.56 food rooster 0.273 5.34 cock rooster 0.920 0.00 cemetery woodland 0.295 4.98 boy lad 0.955 2.56 Shore voyage 0.305 5.32 cushion pillow 0.960 2.56 Bird woodland 0.310 4.80 cemetery graveyard 0.970 0.00 Coast hill 0.315 3.97 automobile car 0.980 0.00 Furnace implement 0.343 4.26 midday noon 0.985 0.00 Crane rooster 0.353 4.16 gem jewel 0.985 0.00 Hill woodland 0.370 4.33
81
10. DISCUSSION AND FUTURE WORK
10.1 Directions
10.1.1 Adapting Existing Ontology-based Measures for Cross-Ontology Similarity
In this thesis, the cross-ontology approach for measuring semantic similarity of concepts
in a unified framework has been introduced and explained. This approach is based on the
following points: (1) the mapping of two ontologies based on the overlap between them
in a set of concept nodes, (2) the consideration of granularity degrees of the ontologies
and (3) a strategy for choosing the secondary ontology for missing concepts. This
approach can be then applied for existing ontology-structure-based measures to adapt
them for measuring semantic similarity of concepts in a unified framework. For example,
the cross-ontology path length feature can be employed for the Path length and Leacock
and Chodorow measures.
10.1.2 Semantic Similarity and Application in Information Retrieval
PubMed [36] is a service of the U.S. National Library of Medicine (NLM) that includes
over 16 million citations from MEDLINE and other life science journals for biomedical
articles back to the 1950s.
A Case Study: Semantic Similarity of Concepts in IR in Biomedical Domain
82
Previous work of Mao and Chu [16] shows that using concept-based vector space model
(VSM) has better performance than the stem-based VSM in medical document retrieval
(concept-based VSM> stem-based VSM). The concept-based VSM uses MeSH concepts
in which documents and queries are presented by MeSH headings. Moreover, concept-
based VSM takes into account “concept interrelation” (concept-interrelation-based
VSM) by using semantic similarity techniques (measures) to represent the interrelation of
concepts which improves its performance over the concept-based VSM (concept-
interrelation-based VSM > concept-based VSM). Furthermore, most important and
comprehensive databases in biomedical domain such as MEDLINE contain concept-
structure-based records. For example, each citation of an article in MEDLINE contains a
set of cited/indexed MeSH concepts. Furthermore, most IR/search systems [36] or IR
research in this domain use concepts limited to MeSH concepts. One of the reasons is that
there is not a technique to measure all semantic similarity of all UMLS concepts
dispersed in multiple ontologies.
One of the most popular search engines for biomedical domain is PubMed-Entrez
developed by NLM. It is a Boolean search engine. The input text string is parsed into
MeSH terms and text words, and it therefore uses MeSH thesaurus for indexing.
Therefore, PubMed-Entrez is limited to MeSH headings only. The following example
shows clearly the limitation of using only single terminology source in retrieval:
Figure 11. Two fragments of SNOMED-CT (left) and MeSH (right).
83
MeSH thesaurus (ontology) contains only about 23K headings or concept scopes, which
is a small subset in UMLS concepts containing about 1.3 million concepts (concept
classes). One problem stands out if user wants to search for a concepts/entity which is not
found in MeSH thesaurus, the results may not really satisfy the query. For example,
concept “Stomach cramps” is not found in MeSH but in SNOMED-CT. When a query
with “Stomach cramps” is inputted into PubMed-Entrez, the query is parsed as follow:
("stomach"[MeSH Terms] OR Stomach[Text Word]) AND (("muscle cramp"[TIAB] NOT
Medline[SB]) OR "muscle cramp"[MeSH Terms] OR cramps[Text Word])
The query is parsed into two MeSH headings: (1) “stomach” and (2) “muscle cramp” as
MeSH thesaurus/ontology do not contain concept “stomach cramps”. It is clear the
search engine should be not limited to MeSH headings only.
Moreover in the work of Mao and Chu [13], they used an ontology-structure-based
semantic similarity measure that they developed on their own to calculate the
interrelationship (similarity) between concepts. Actually, there are many semantic
measures with different performance based on applications, information sources, etc.
Therefore, that work also has some limitation.
According to above discussion, there are some questions that most current research in
biomedical IR cannot answer:
1. What is the search model that is best suitable for the concept-structure-based
biomedical databases? Concept-interrelation-based VSM, Boolean model, or
other models?
2. Is there a need to develop a “new” IR model such as combination of concept-
based and stem-based/phrase-based techniques?
3. What set of vocabulary sources in UMLS that is best used for IR in biomedical
domain?
84
4. By using concept-interrelation-based IR models, an interrelation of concepts is
represented by a similarity value given by a semantic similarity measure. There is
one issue that what is the most suitable measure for these tasks as different
measures and different group-based of measures (ontology-structure-based
measures, information-based measures, etc.) perform differently in different
applications and situations.
10.1.3 The Need for Topic Similarity and a New Information Retrieval Model
Each MEDLINE record contains a citation which is cited by about 10-15 MeSH
headings, therefore, by using (concept-interrelation-based) VSMs the similarity score
between two documents represented as two vectors should be low as each
records/document contains a small number of cited concepts. Therefore, a Boolean
retrieval model should be a good choice utilizing the indexing of MeSH headings for
each document. Moreover, the extended Boolean model should be a good model for
retrieving documents in PubMed/MEDLINE. However, the concepts in MeSH ontology
are represented hierarchically and hence support semantic search.
In the (extended) Boolean model, a document contains “A1 and B1” will not satisfy much
for a search query like “A and B” in which A1 and B1 are subconcepts of A and B,
respectively. Thus, a new semantic model should be developed for utilizing the semantic
of concepts represented in the ontology. This leads to the development of topic similarity
as core techniques of the new semantic model instead of semantic measures applied in
VSMs.
However, the combination of text-word-based models and other models such as Boolean
models and VSMs should be conducted in which the contribution parameters should be
weighted in the combination approach.
85
10.2 Discussion
This thesis centers about the semantic similarity techniques in the two domains of general
English domain and biomedical domain. Through applications of techniques into the
biomedical domain; development new techniques applied in the two domains; and some
investigations, we have some observations as follows:
1. The application of existing ontology-structure-based measures into biomedical
domain gains good results tested on two datasets using two ontologies.
2. In the general English domain (WordNet), the affect of the common specificity
feature of the two concept nodes can not be seen in the general English domain
because the dataset is so small such as MC contains 30 term pairs covering 38
human subjects. Therefore, the term pairs dispersed in many taxonomies; hence,
the specificity can not be seen in experiments [12]. However, in the biomedical
domain where the ontology covers concepts in specific domain and the number of
term pairs belong to a single taxonomy is high, therefore the specificity feature of
concept can be seen clearly. We observed that the two features of path length and
common specificity of two concepts contribute equally to semantic similarity in
good performance cases in this domain.
3. While the technique of information content was developed to augment the pure
ontology-structured-based measures; however, most of the previous work show
that the information-based measures using information content do not surpass
over the ontology-structure-based measures [4]. This thesis [chapters 6-8]
presents how to use a text corpus effectively in measuring semantic similarity of
concepts. Furthermore, experiments in Chapter 8 also presents a new view of
using semantic features by considering the contributions of lexical representation
of concepts in network and corpus statistics into similarity. For this direction, we
gained promising results in computing semantic similarity of verbs in WordNet
regarding the network of verbs is not richly developed.
86
4. Most of the semantic similarity work in the biomedical domain, especially using
MeSH ontology, is limited to the ontology itself as the primary information
source; therefore there is a need for investigations of creating standard corpora in
this domain. The experimental results in Chapter 5 show that MEDLINE is the
promising corpus in this domain, especially for MeSH concepts.
5. The ideas in the proposed cross-ontology approach basically bases on the
granularity of the ontologies and the mapping approach; hence, it can be applied
to adapt other exiting ontology-structure-based measures for cross-ontology
semantic similarity. The proposed cross-ontology approach is the novel approach
in a unified framework in the biomedical domain. In the unified framework, all
the concepts are unified different from general English domain where the two
same concepts can have different names causing difficulty for mapping
ontologies.
10.3 Conclusion
This thesis introduces and presents a number of approaches for computing the semantic
similarity/distance in the general English domain as well as in the biomedical domain.
Semantic distance is the inverse of semantic similarity, and the semantic similarity
techniques are used to compute the semantic similarity (i.e., common shared information)
of concepts or concept classes according to certain language or domain resources like
ontologies, taxonomies, corpora, etc. The thesis also presents the related work and
relevant techniques in the biomedical domain and discusses some of the new directions
related to semantic similarity, topic similarity and information retrieval (IR) in the
biomedical domain. The key contribution of this thesis is a novel semantic distance
approach that can measure semantic distance/similarity between two concepts in a unified
framework comprising of many ontologies that overlap in a set of controlled concepts.
The proposed techniques have been extensively evaluated in the biomedical and general
English domains. The experimental results confirmed the superiority and efficiency of the
87
proposed techniques in computing the semantic similarity distance in single and across
multiple ontologies.
In the future work of this research, we would like to develop and collect a cross-ontology
semantic similarity dataset in the biomedical domain for evaluating the semantic
similarity techniques and to support the research in this task. We will further investigate
and explore the various IR models and model combinations to be adapted and applied
into the biomedical domain to utilize from the proposed cross-ontology semantic
techniques and to exploit the numerous biomedical ontologies and taxonomies within
UMLS.
88
11. REFERENCES
[1] Al-Mubaid, H. and Nguyen, H.A. A Cluster-Based Approach for Semantic Similarity
in the Biomedical Domain, In Proc. The 28th Annual International Conference of the
IEEE Engineering in Medicine and Biology Society EMBS’06, New York, USA,
September 2006.
[2] Al-Mubaid, H. and Nguyen, H.A. Using MEDLINE as Standard Corpus for
Measuring Semantic Similarity of Concepts in the Biomedical Domain. In Proc. The
2006 IEEE 6th Symposium on Bioinformatics & Bioengineering BIBE-06,
Washington D.C., USA, October 2006. pp.315-319.
[3] Al-Mubaid, H. and Nguyen, H.A. Semantic Distance of Concepts within a Unified
Framework in the Biomedical Domain. Accepted paper, 22nd Annual ACM
Symposium on Applied Computing SAC’07, forthcoming March 2007.
[4] Budanitsky, A. and Hirst, G. Evaluating WordNet-based measures of semantic
distance, Computational Linguistics, vol.32,1, March 2006.
[5] Colin, A. and Loftus, E. A spreading activation theory of semantic processing,
Psychological Review, 82,407-428, 1975.
[6] Caviedes, J. and Cimino, J. Towards the development of a conceptual distance metric
for the UMLS. Journal of Biomedical Informatics 37,77-85, 2004.
[7] Francis, W.N. and Kucera, H. Brown Corpus Manual—Revised and Amplified, Dept.
of Linguistics, Brown Univ., Providence, R.I., 1979.
[8] Hliaoutakis, A. Semantic Similarity Measures in MeSH Ontology and their
application to Information Retrieval on Medline. Master’s thesis, Technical
University of Crete, Greek. 2005.
89
[9] Jiang, J.J, and Conrath, D.W. Semantic similarity based on corpus statistics and
lexical ontology. In Proc. on International Conference on Research in Computational
Linguistics, 19–33,1997.
[10] Leacock, C., and Chodorow, M. Combining local context and WordNet similarity
for word sense identification. In Fellbaum, C., ed., WordNet: An electronic lexical
database. MIT press.265-283, 1998.
[11] Lin, D. An information-theoretic definition of similarity. In Proc. of the Int’l
Conference on Machine Learning, 1998.
[12] Li, Y., Bandar, Z. A. and McLean D., An Approach for Measuring Semantic
Similarity between Words Using Multiple Information Sources. IEEE Transactions
on Knowledge and Data Engineering, 15, 4(2003), 871-882.
[13] Miller, G.A. WordNet: A Lexical Database for English Comm. ACM 38,11(1995),
39-41.
[14] Miller, G.A and Charles, W.G. Contextual Correlates of Semantic similarity.
Language and Cognitive Processes, 6, 1(1991), 1-28.
[15] Miller, G.A., Leacock, C., Randee,T and Bunker, R.T. A semantic concordance. In
Proc. of the 3rd DARPA workshop on Human Language Technology, pp.303–308.
Plainsboro, New Jersey, 1993.
[16] Mao, W. and Chu, W.W. Free-text medical document retrieval via phrase-based
vector space model, In Proc. AMIA Symp 2002 ; ()489-93,2002.
90
[17] Nguyen, H.A. and Al-Mubaid, H. A New Ontology-based Semantic Similarity
Measure for the Biomedical Domain. In Proc. IEEE International Conference on
Granular Computing GrC’06 , GA,USA, May 2006.
[18] Nguyen, H.A. and Al-Mubaid, H. A Combination-based Semantic Similarity
Approach Using Multiple Information Sources. In Proc. The 2006 IEEE International
Conference on Information Reuse and Integration IEEE IRI 2006, Hawaii, USA,
September 2006.
[19] Pedersen,T., Patwardhan, S., and Michelizzi, J. WordNet::Similarity-Measuring The
Relatedness of Concepts,”. In Proc. of the Nineteenth National Conference on
Artificial Intelligence (AAAI-04). San Jose, CA, 2004.
[20] Pedersen,T., Pakhomov, S. and Patwardhan,S. Measures of Semantic Similarity and
Relatedness in the Medical Domain, University of Minnesota Digital Technology
Center Research Report DTC 2005/12.
[21] Quillian, M.R. Semantic Memory, In Minsky, M.(Ed.), Semantic Information
Processing, MIT Press, Cambridge, MA, 1968.
[22] Rada, R., Mili, H. Bicknell, E. and Blettner, M. Development and Application of a
Metric on Semantic Net. IEEE Transactions on Systems, Man and Cybernetics,
19,1(1989),17-30.
[23] Resnik, P. Using information content to evaluate semantic similarity in ontology. In
Proc. of the 14th intl Joint Conference on Artificial Intelligence,448–453,1995.
[24]Resnik,P. and Diab,M. Measuring Verb Similarity, Twenty Second Annual Meeting
of the Cognitive Science Society (COGSCI2000), Philadelphia, August 2000.
91
[25] Rubenstein, H and Goodenough, J.B. Contextual Correlates of Synonymy, Comm.
ACM, 8, 627-633, 1965.
[26] Richardson, R., Smeaton, A.F., and Murphy, J. Using WordNet as a Knowledge
Base for Measuring Semantic Similarity. Working paper CA-1294, School of
Computer Applications, Dublin City Univ., Dublin, 1994.
[27] Rodriguez, M.A. and Egenhofer, M.J. Determining Semantic Similarity Among
Entity Classes from Different Ontologies. IEEE Transactions on Knowledge and
Data Engineering, 15,2 (2003), 442-456.
[28] Shepard, R. N. 1987. Toward a universal law of generalization for psychological
science. Science 237,1317-1323, 1987.
[29] Tversky, A. Features of similarity. Psychological Review 84(4): 327-352, 1977.
[30] Wu, Z., and Palmer, M. Verb semantics and lexical selection. In 32nd Annual
Meeting of the Association for Computational Linguistics, 133–138, 1994.
[31] UMLS: Unified Medical Language System. Available:
http://www.nlm.nih.gov/research/umls/
[32] XML MeSH. Available:
http://www.nlm.nih.gov/mesh/xmlmesh.html
[33] MeSH. Availabe:
http://www.nlm.nih.gov/mesh/meshhome.html
[34] UMLSKS. Available:
http://umlsks.nlm.nih.gov
92
[35] SNOMED-CT. Available:
http://www.snomed.org/index.html
[36] PubMed. Available:
http://www.ncbi.nlm.nih.gov
[37] MEDLINE. Available:
http://www.cas.org/ONLINE/DBSS/medliness.html
[38] NCI. Available:
http://www.cancer.gov/cancertopics/terminologyresources
[39] MeSH Browser. Available:
http://www.nlm.nih.gov/mesh/MBrowser.html
[40] The Semantic Vocabulary Interoperation Project. Available: http://lsdis.cs.uga.edu/~kashyap/projects/SVIP/
93
Appendix A
MeSHSimPack: A LIBRARY FOR MEASURING SEMANTIC SIMILARITY OF MESH CONCEPTS
1. Introduction
MeSHSimPack is C# library for measuring semantic distance/similarity between of
MeSH headings. The library includes two main modules: (1) MeSHQueryData is for
querying information of MeSH headings and (2) MeSHSimilarity is for measuring
semantic distance/similarity of concepts by eight implemented ontology-based semantic
measures. There are eight semantic measures implemented in the framework in which
three of them are semantic distance measures and five of them are semantic similarity
measures. Some of them are information-based measures (Resnik, Jiang and Conrath and
Lin) that use MEDLINE as corpus for information content of MeSH headings.
2. Semantic Measures Ontology-based semantic similarity measures use ontology hierarchical relations in
computation as primary information source or can use corpus as secondary information
source. They are derived from spreading activation theory. However, distributional
similarity that is based on concepts/words occurrence approaches can also use “scope
note”, or “gloss” of concept in the ontology. The framework focuses on implement
ontology-based semantic measures including Path length, Wu and Palmer, Leacock and
Chodorow, Li et al. , NA, Resnik, Jiang and Conrath and Lin.
3 MeSHSimPack MeSHSimPack has two main component modules and one database. The components of
the framework are as follows:
3.1 MeSHQueryData The MeSHQueryData is an interface module of MeSH database. MeSHQueryData
provides implemented function as APIs to query information of MeSH headings/terms in
94
MeSH database. This module can also be used by other applications such as information
retrieval, information extraction, semantic computing, etc. in the biomedical domain
using MeSH ontology.
3.2 MeSHSimilarity MeSHSimilarity is a main module in which eight ontology-based semantic measures were
implemented. Each measure takes two terms/headings and parameters possible as inputs
and returns numerical values showing their semantic distance/similarity in the measure’s
distance/similarity scale system.
Figure 12. MeSHSimPack components.
Figure 1 shows a web-based interface of MeSHSimilarity module that takes two MeSH
headings/terms as inputs to produce the semantic distance/similarity of them in the MeSH
ontology.
3.3 LocalMeSH and MeSHConverter
95
There are two versions of original database of MeSH which are in two formats: MeSH
XML database which contains files inXML format and MeSH ASCII database which
contains files in ASCII format, therefore, for performance and convenience, a tool was
developed to convert MeSH XML database into relational database, called,
MeSHConverter. MeSH XML database has three main different files about:
- Descriptors (main headings) (desc200x.xml): characterize the subject matter or
content.
- Qualifiers (qual200x.xml): are used with descriptors and afford a means of
grouping together those documents concerned with a particular aspect of a
subject.
- Supplementary Concept Records (supp200x.xml).
Currently, MeSHConverter can only convert desc200x.xml file in to relational database
for information needed for semantic computation.
3.4 ICGenerator For information-based measure, they need information contents of concepts in measuring
semantic similarity; therefore, a tool was developed, called, ICGenerator for updating the
MeSH database to add IC of concept node. This tool takes MH_Freq_Count file as input
and updates the LocalMeSH database as output.
4. Discussion A C# library has been introduced supporting for measuring semantic similarity of MeSH
headings by implemented existing ontology-based measures. The library can be
integrated into semantic-similarity-based applications or can be used for semantic
similarity research in the biomedical domain. In the future, a framework will be
investigated and implemented that supports for measuring semantic similarity of UMLS
concepts dispersed in many ontologies in UMLS Metathesaurus as a continuing of this
work.
96
Figure 13. Web-based interface of MeSHSimilarity.