research article an ontology-based semantic similarity...

Research ArticleAn Ontology-Based Semantic Similarity Measure ConsideringMulti-Inheritance in Biomedicine

Fengqin Yang12 Yuanyuan Xing3 Hongguang Sun12 Tieli Sun14 and Siya Chen1

1School of Computer Science and Information Technology Northeast Normal University Changchun 130117 China2Key Laboratory of Intelligent Information Processing of Jilin Universities Changchun 130117 China3Studentsrsquo Affairs Division Qingdao Technological University Qingdao 266033 China4College of Humanities and Science Northeast Normal University Changchun 130117 China

Correspondence should be addressed to Hongguang Sun sunhg889nenueducn and Tieli Sun suntlnenueducn

Received 12 February 2015 Accepted 27 April 2015

Academic Editor Chih-Cheng Hung

Copyright copy 2015 Fengqin Yang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Computation of semantic similarity between words for text understanding is a vital issue in many applications such as word sensedisambiguation document categorization and information retrieval In recent years different paradigms have been proposed tocompute semantic similarity based on different ontologies and knowledge resources In this paper we propose a new similaritymeasure combining both superconcepts of the evaluated concepts and their common specificity feature The common specificityfeature considers the depth of the Least Common Subsumer (LCS) of two concepts and the depth of the ontology to obtainmore semantic evidence The multiple inheritance phenomenon in a large and complex taxonomy is taken into account by allsuperconcepts of the evaluated concepts We evaluate and compare the correlation obtained by our measure with human scoresagainst other existingmeasures exploiting SNOMEDCT as the input ontologyThe experimental evaluations show the applicabilityof the measure on different datasets and confirm the efficiency and simplicity of our proposed measure

1 Introduction

In the last few years the amount of available electronicinformation has increased sharply in many research areassuch as biomedicine education psychology linguistics cog-nitive science and artificial intelligence As we know mostof the information sources are presented in unstructured orsemistructured textual formats Hence it is an urgent issueto process the text information from a semantic perspectiveUnderstood as the degree of taxonomical proximity semanticsimilarity computes the likeness between words and plays avery important part in the above-mentioned fields such asword sense disambiguation [1] word spelling correction [2]automatic language translation [3] document categorizationor clustering [4] information extraction and retrieval [5ndash7]detection of redundancy and ontology learning [8 9] It isworth mentioning that many applications of semantic sim-ilarity computation are discussed in the biomedical domaindue to the availability of numerous medical ontologies andresources that organizemedical concepts into hierarchies For

example semantic similarity between concepts of ontologiessuch as Gene [10 11] was computed with the aim of assessingprotein functional similarity [6]

As mentioned above semantic similarity is relevant tomany research areas Designing accurate computingmethodsis important for improving the performance of applicationsdependent on semantic similarity Essentially semantic simi-larity measures assess a score between a pair of wordsmakinguse of the information from some predefined knowledgesources (such as ontologies or domain corpora) containingthe semantic evidence Therefore the accuracy of semanticsimilarity approaches relies on the knowledge sources Sofar many semantic similarity measures have been proposedwhich can be classified according to the domain knowledgeitself exploited and the theoretical principles The semanticsimilarity measures can be roughly divided into severalcategories as follows (1)measures based on the taxonomicalstructure of the ontology which are strategies estimatingsemantic similarity by counting the number of nodes or edgesseparating two concepts [12ndash14] even if these methods are

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 305369 9 pageshttpdxdoiorg1011552015305369

2 Mathematical Problems in Engineering

the most intuitive and easy to implement they suffer from thelimitation that they work properly requiring consistent andrich ontologies (2) measures utilizing information content(IC) of concepts which are methods exploiting the notion ofIC defined as a measure of the fruitful semantic informationof concepts and computed by counting the occurrence ofwords in large corpora [15ndash17] their shortcomings are that itis necessary to perform time-consuming analysis of corporaand that the IC values depend on the considered corpora(3) measures using the amount of cooccurrences betweenword contexts which are approaches constructing contextvectors of concepts by extracting contextual words (withina fixed window of context) from a corpus of textual docu-ments including the evaluated concepts and computing thesimilarity of concepts as the cosine of the angle between theircontext vectors [11 18 19] Similar to the methods motionedin category (2) the availability and suitability of corpora affectthe applicability of these measures

Usually these measures can obtain good performancewhen we employ large and general purpose knowledge baseslike WordNet [20] Some of them have been applied tothe biomedical field using domain information extractedfrom clinical data or relevant medical ontologies such asSNOMED CT (httpsutsnlmnihgovhomehtml) [21 22]or MeSH (httpwwwnlmnihgovmeshmeshhomehtml)[22 23] in the Unified Medical Language System (UMLS)[22] and authors compared these measures and analyzedand evaluated them over certain datasets to determine theiradvantages and limitations with respect to the backgroundknowledge source [24ndash26]

In this paper firstly we review and investigate differentmeasures for semantic similarity computation Then wepropose a new measure considering the multiple inheritancein ontologies and the common specificity feature of theevaluated concepts in order to obtain a more accurate sim-ilarity between concepts Finally we evaluate the proposedmeasure using two datasets of biomedical term pairs scoredfor similarity by human experts and exploiting SNOMEDCTas the input ontology We compare the correlation obtainedby our measure with human scores against other measuresThe experimental evaluations confirm the efficiency of theproposed measure

Besides Section 1 the paper is organized as followsSection 2 investigates the basic methods for semantic sim-ilarity including the taxonomy-based measures the IC-based measures and the context vector measures Section 3presents the proposed measure for semantic similarity andits main advantages Section 4 evaluates and compares themeasure against the analyzed measures using SNOMED CTas the input ontology Section 5 analyzes and discusses theexperimental results The final section is the conclusions ofthe paper

2 Existing Measures for ComputingSemantic Similarity

Theexistingmeasures for semantic similarity are discussed asfollows

21 Measures Based on the Taxonomical Structure The sim-plest way of computing similarity for concepts is the measurebased on path length developed by Rada et al [13] Themeasure quantifies the shortest distance between the twoconcept nodes 119888

1and 1198882

dis (1198881 1198882) = 119873

1+ 1198732 (1)

where1198731and119873

2stand for theminimumnumber of is-a links

from 1198881and 1198882to their LCS respectively

Wu and Palmer [14] introduced a measure based on pathlength that considers the depth of the concepts only in thehierarchy It is based on the assumption that concepts lowerdown in the taxonomy aremore similar than those higher up

simWampP (1198881 1198882) =2 times 119873

3

1198731+ 1198732+ 2 times 119873

3

(2)

where 1198733is the number of is-a relations from the LCS of

the evaluated concepts to the root of the ontology And thesimilarity value ranges from 1 (for identical concepts) to 0

Leacock and Chodorow [12] proposed a measure forsimilarity in which the shortest path length between twoconcepts was scaled by twice the maximum depth 119863 of thehierarchy

simLampC (1198881 1198882) = minus log(1198731+ 1198732+ 1

2 times 119863) (3)

Besides there are other measures for semantic similaritybased on structure Li et al [27] developed a measurecombining the depth of the ontology and the shortest path

simLi (1198881 1198882) = 119890minus120572path(119888

11198882) 119890120573ℎ minus 119890minus120573ℎ

119890120573ℎ + 119890minus120573ℎ (4)

where ℎ is the minimum depth of the LCS in the hierarchypath(119888

1 1198882) is the shortest path between two concepts and120572 ge

0 and 120573 gt 0 stand for the contribution of the shortest pathand the depth respectively The optimal parameters for themeasure were 120572 = 02 and 120573 = 06

Al-Mubaid and Nguyen [24] proposed a cluster-basedmeasure combining path length and common specificity thatconsiders the depth of the LCS of two concepts and the depth119863 of ontology They defined the clusters as the branches ofthe ontology with respect to the root node The commonspecificity of concepts 119888

1and 1198882is defined as follows

CSpec (1198881 1198882) = 119863

119888minus Depth (LCS (119888

1 1198882)) (5)

where119863119888is the depth of the cluster including concepts 119888

1and

1198882 Thus the CSpec(119888

1 1198882) feature determines the ldquocommon

specificityrdquo of two concepts in the cluster The smaller thecommon specificity value of two concept nodes is the moreinformation they share Thus they will be more similar Thesemantic distance measure is defined as follows

SemDistAampN (1198881 1198882)

= log ((Path minus 1)120572 times (CSpec)120573 + 119896) (6)

Mathematical Problems in Engineering 3

where 120572 and 120573 (120572 gt 0 120573 gt 0) are contribution factors of twofeatures 119896 is a constant whichmust be greater than or equal to1 to insure that the distance is positive and the combination isnonlinear and Path is the length of the shortest path betweenthe two concept nodes

Batet et al [25] proposed a similarity measure whichtakes into account all the superconcepts (subsumers of theevaluated terms) belonging to all the possible paths betweenthe concept nodes and defined the measure as the ratiobetween the amount of nonshared information and the sumof shared and nonshared informationThe similaritymeasureof two concepts consideringmultiple inheritance is defined asfollows

simBampS (1198881 1198882)

= minuslog2

1003816100381610038161003816119879 (1198881) cup 119879 (1198882)1003816100381610038161003816 minus1003816100381610038161003816119879 (1198881) cap 119879 (1198882)

10038161003816100381610038161003816100381610038161003816119879 (1198881) cup 119879 (1198882)

1003816100381610038161003816

(7)

where 119879(119888119894) = 119888119895isin 119862 | 119888

119895is superconcept of 119888

119894 cup 119888119894 and 119862

is the concept setFrom the measures introduced above we can get a

conclusion that the LCS of two concepts plays a vital partin the computation for semantic similarity of concepts Forsome measures it is enough that just only ontology isused as background knowledge (no corpus with domaindata is needed) which makes them heavily depend on theontology itself Besides the minimum path (shortest path)also plays an important part For large ontologies such asSNOMED CT there is a phenomenon that one or bothconcepts inherit from several is-a hierarchies In this caseAl-Mubaid and Nguyen [24] used the minimum path toget the maximum semantic similarity However it may omitmuch other available taxonomical knowledge Here multiplepossible paths can exist between any two concepts but onlythe shortest one is selected among those paths To solve theproblem we can take all the superconcepts into account andtry to get more semantic evidence in the case of multipleinheritance whichmakes themeasure for semantic similaritymore accurate

In a view of an independent domain in order to gethigh accuracy most path-based measures rely on large andgeneral purpose taxonomy Usually researchers may chooseWordNet to apply thesemeasures because of its perfect struc-ture However the coverage of biomedical terms in WordNetis so limited that the accuracy of similarity assessments formedical terms is poor [11 28] So Pedersen et al [11] Al-Mubaid and Nguyen [24] and Batet et al [25] adopted thesemeasures to the biomedical domain by exploiting SNOMEDCT as the input ontology

22 Measures Based on Information Content Thesemeasuresevaluate the similarity of concepts depending on the amountof shared information between two concepts According tothe information theory concepts are evaluated by their IC ICcan be considered as a measure which quantifies the amountof information that a given concept expresseswhen appearingin taxonomy Resnik [17] stated that concept semantic simi-larity depends on the amount of shared information betweenthem

In Resnikrsquos seminal work [17] IC is computed accordingto119901(119888) representing the probability of occurrence of a concept119888 in a corpus

IC (119888) = minus log119901 (119888) (8)

Usually in a general context 119901(119888) estimation is severelyhampered due to textual ambiguity and data sparsenessproblems [29] In fact the tagged corpora about domaininformation like biomedicine are limited Authors [15 17]estimated concept appearance from SemCor [30] that is asemantically tagged text consisting of 100 passages fromthe Brown Corpus based on WordNet Because the manualtagging scheme is based on the fine grained structure of wordsenses covered by WordNet the 119901(119888) estimation is accuratebut is limited to the coverage of corpora that covered less than13 of the available word senses in WordNet

To guarantee the consistency of similarity computationcoherence of 119901(119888) computation based on taxonomical struc-ture should be taken into account Meanwhile to computethe value both the explicit appearances of concept 119888 andits specializations must be considered Thus Resnik [17]proposed themeasure for calculating 119901(119888) showed as formula(9) Consider

119901 (119888) =sum119899isin119908(119888)

count (119899)119873

(9)

where 119908(119888) is the set of words subsumed by concept 119888 and119873is the total number of observed corpus terms excluding thosethat are not subsumed by any WordNet class

Resnik [17] mentioned concepts in a lower level whichare usually more specialized and share more informationrepresented by LCS of both concepts in taxonomy So themore the IC of the subsumer of concepts is the moresimilar the concepts are Based on the premise above Resnikmeasures the similarity as the IC of the LCS of concepts

simres (1198881 1198882) = IC (LCS (1198881 1198882)) (10)

To tackle the problem of Resnikrsquos measure that thesimilarity value will be the same if any two concepts have thesame LCS both studies by Jiang andConrath [15] and Lin [16]improved Resnikrsquos measure

Lin measured the similarity as the ratio between the IC ofthe LCS of the two concepts and the summation of the IC ofthe two concepts

simlin (1198881 1198882) =2 times IC (LCS (119888

1 1198882))

(IC (1198881) + IC (119888

2)) (11)

Jiang and Conrath calculated the dissimilarity betweenconcepts illustrating the similarity of concepts with formula(12) Consider

disJampc (1198881 1198882) = (IC (1198881) + IC (1198882)) minus 2

times IC (LCS (1198881 1198882))

(12)

As mentioned above the IC-based similarity assessmentswill not get the best performance when we use a general pur-pose corpus such asWordNet or SemCor due to their limited


coverage of biomedical terms For this reason Pedersen etal [11] applied these measures to the biomedical domain byexploiting SNOMED CT and computing the IC of conceptsusing Mayo Clinic Corpus of Clinical Notes as a domaincorpus

23 Context Vector Measures Computing Semantic Related-ness Patwardhan and Pedersen [19] proposed a measureof semantic relatedness that represents a concept with acontext vector which performs more flexible than similaritymeasurements since the information source for the contextvectors is a raw corpus of text and the concepts do notneed to be connected via a path of relations in ontologyThey built gloss vectors corresponding to each concept inWordNet using the cooccurrence information along withthe WordNet definitions In their experiments the glossesseemed to contain content rich termsTheywould distinguishvarious concepts much better than text drawn from moregeneric corpus if authors chose the WordNet glosses Andthe WordNet glosses can be viewed as a corpus of contextsconsisting of about 14 million words The gloss vectormeasure got the highest correlation with respect to humanjudgment using different benchmarks [19]

Pedersen et al [11] constructed cooccurrence vectors thatrepresent the contextual profile of concepts and applied themeasure to the biomedical filed In their study they createdcontext vectors corresponding to each concept in SNOMEDCT using a set of word vectors The word vectors for allwords occurring in the clinical notes referred to here wereproduced with the Mayo Clinic Corpus of Clinical Notes Inthese vectors the window size of the context is one line of thetext

Then the semantic relatedness of two concepts 1198881and 1198882

is computed as the cosine of the angle between their contextvectors with formula (13) Consider

relvector (1198881 1198882) =997888rarrV1sdot997888rarrV2

1003816100381610038161003816V11003816100381610038161003816 sdot1003816100381610038161003816V21003816100381610038161003816 (13)

where V1and V

2are the context vectors corresponding to

1198881and 1198882 respectively Note that the context vector measure

has different performance according to the different choicesabout clinical notes in the experiments In other words thismeasure depends heavily on the availability and quality of thecorpora

3 Proposed Measure for Computingthe Semantic Similarity

According to the analysis of similarity measures above weknow measures based on IC or context vectors depend oncorpora As a matter of fact corpora consist of unstructuredor semistructured textual data Corpora need to be prepro-cessed to obtain enough semantic informationwhich brings aheavy computational burden In the biomedical domain it isvery difficult to get enough clinical data due to the sensitivityof patient data which may cause data sparseness problemsFor these reasons the applicability of these measures maybe hampered by the availability of enough suitable data On

the contrary path-based measures only use the structure ofontologies and do not require preprocessing of text datawhich makes them have low computational complexity Butevery coin has two sides Path-basedmeasures are very simpleand it cannot get more semantic evidence to perform betterthan other measures like IC-based measures and contextvector relatedness measures

From the path-based measures we know that when thepath length between each of concepts and their LCS iscalculated only the minimum path length is found out andkept for use even if all paths may be calculated Howeverin the taxonomy one or both concepts may inherit fromseveral is-a hierarchies For this reason there exists a case ofmultiple inheritance [25 31] especially in large and complextaxonomies (eg SNOMED CT) including thousands ofinterrelated concepts in the hierarchies For example inFigure 1 if we choose the minimum distance among all thepaths between 119888

1and 1198882 the contributions of the noncommon

superconcepts of the two concepts in the taxonomy areomitted which affects the accuracy of the measures

In this paper we improve significantly the measure of Al-Mubaid and Nguyen [24] and propose a new modified mea-sure for semantic similarity combining both superconcepts ofthe evaluated concepts and common specificity feature whichcan capture more semantic evidence The proposed measurecan achieve better performance than other measures basedon structure and keep their simplicity

To take into account contributions of all noncommonsuperconcepts of the evaluated concepts we consider theconcepts themselves and all their noncommon supercon-cepts instead of the minimum path length to capture moresemantic evidence for similarity Besides in our measure wealso consider the common specificity feature of the conceptsnodes scaled by the depth of the concept nodes and the depthof their LCS Thus we combine noncommon superconceptswith common specificity in order to get more semanticinformation for computing similarity between two concepts

Let 119888119894stand for 119894th concept of an ontology Then 119873(119888

119894) is

defined as the set of all superconcepts of 119888119894including 119888

119894itself

Thus the number of noncommon superconcepts of conceptscan be defined as follows

NonComSub (1198881 1198882) =1003816100381610038161003816119873 (1198881) cup 119873 (1198882)

1003816100381610038161003816

minus1003816100381610038161003816119873 (1198881) cap 119873 (1198882)

1003816100381610038161003816 (14)

Here the NonComSub value can be an indication of thepath length of the two concepts For example the numberof noncommon superconcepts for concepts 119888

1and 1198882is 3 in

Figure 1On the other hand the common specificity feature is

defined as follows [24]

ComSpec (1198881 1198882) = 119863 minus Depth (LCS (119888

1 1198882)) (15)

where 119863 is the depth of the ontology and the ComSpecfeature determines the common specificity of two evaluatedconceptsThe smaller theComSpec value of two concepts themore information they share and thus the more similar theyare


C1 C2

C3C4

C5

Figure 1 Ontology with multi-inheritance

Then we use logarithm function of NonComSub andComSpec to represent semantic distance which is inverse tosemantic similarityTherefore the semantic distance betweenconcepts 119888


SemD (1198881 1198882) = log (NonComSub times ComSpec + 1) (16)

It is worth mentioning that any concept can be comparedwith itself In this case the semantic distance is 0 Thismeasure can be applied to all concepts and does not need tocheck whether or not the two concepts compared are distinct

4 Evaluation

Measures of semantic similarity are usually evaluated bycomparing the computed similarity values of the measuresagainst the human judgments using correlation coefficientThe higher the correlation value against the human expertsrsquosimilarity scores are the better the measure is

In the biomedical field there are no standard humanrating datasets for semantic similarity like manually ratedconcept sets created by Rubenstein and Goodenough [32]and Miller and Charles [33] Pedersen et al [11] stated thatit is necessary to choose sets of words manually scored forthe evaluation of concept semantic similarity measures inbiomedicine In their research they created a set of 30 conceptpairs regarding medical disorders with the help of MayoClinic expertsThe set was annotated by three physicians andnine medical coders All three physicians are specialists inthe area of rheumatology Finally after a series of processingthe average of the similarity values of 30 concept pairs wasnormalized in a scale between 1 and 4The average correlationbetween physicians is 068 while the average correlationbetween medical coders is 078 The 30 medical term pairswith averaged expertsrsquo similarity scores are showed in Table 1

Pedersen et al [11] applied the standard to evaluatethe path-based and the IC-based measures by exploitingSNOMED CT taxonomy as domain ontology and the MayoClinical Corpus andThesaurus as corpora respectivelyMed-ical coders had a better understanding about the notion ofsimilarity because they were pretrained during the construc-tion of the original dataset So medical codersrsquo ratings seemto reproduce better the concept of (taxonomic) similarity

whereas physiciansrsquo ratings seem to represent a more generalconcept of (taxonomic and nontaxonomic) relatedness Al-Mubaid and Nguyen [24] just compared their concept simi-larity value against codersrsquo ratings while Batet et al [25]madea comparison between concept similarity value obtained andscores of both of them

In this paper in order to compare the performance ofour measure with other measures in the biomedical domainobjectively we used the set of 30 concept pairs from literature[11] (called Dataset 1) and the set of 36 biomedical termpairs from literature [6] (called Dataset 2) as experimentaldatasetsThe 36medical termpairs inDataset 2with averagedhumanrsquo similarity scores are showed in Table 2 in which thehuman scores are the average evaluated scores of reliable doc-tors We use UMLS Knowledge Source (UMLSKS) browser(httpsutsnlmnihgovhomehtml) for SNOMEDCT to getinformation on the terms in the two datasets The compar-ative results using Dataset 1 with respect to both physiciansandmedical coders and results usingDataset 2with respect tohuman are shown inTable 3 Note that ourmeasures computesemantic distance while human ratings represent similaritySo a linear transformation should be performed

5 Discussion

ForDataset 1 there are 29 out of 30 concept pairs in SNOMEDCT and the average correlation between physicians is 068while the average correlation between medical coders is 078The evaluation results show that the path-based similaritymeasures obtain lower correlations than 036 and 066 forphysicians and coders respectively It indicates that the accu-racy of path-basedmeasures is limitedThesemeasures utilizethe minimum path length without considering multipleinheritances which causes much useful semantic evidenceto be ignored For the IC-based measures they improve theresults of most measures based on path length generally withthe highest value 075 for coders and 060 for physicians Andthe lowest correlation is 062 for coders which outperformsmost structure-basedmeasure with exception of SemDistAampN(066) and SimBampS (079) The coverage of biomedical termsin domain corpora is limited Due to the high dependency onthe availability of domain data the accuracy of the IC-basedsimilarity approaches is hampered

The measure SimBampS shows good performance consid-ering both common and uncommon information of theevaluated concepts However when using SimBampS to computesemantic similarity we should check if the two concepts arethe same otherwise wewould get an infinitely large similarityvalue

With respect to the context vector measure there arefour cases with changing corpus size and corpus selectionFrom the results we can see that the best correlations are084 for the physicians and 075 for the coders under thecondition of 1 million notes involving only the diagnosticsection That is the correlation value with the context vectormeasure is higher than ours only if 1 million notes are usedto create the vectors Moreover the data corpus used tocreate vectors was constructed by physicians of the same


Table 1 Dataset 1 30 medical term pairs with physiciansrsquo and cordersrsquo similarity scores [11]

Term 1 Term 2 Physician ratings Coder ratings(Averaged) (Averaged)

Renal failure Kidney failure 40 40Heart Myocardium 33 30Stroke Infarct 30 28Abortion Miscarriage 30 33Delusion Schizophrenia 30 22Congestive heart failure Pulmonary edema 30 14Metastasis Adenocarcinoma 27 18Calcification Stenosis 27 20Diarrhea Stomach cramps 23 13Mitral stenosis Atrial fibrillation 23 13Chronic obstructive pulmonary disease Lung infiltrates 23 19Rheumatoid arthritis Lupus 20 11Brain tumor Intracranial hemorrhage 20 13Carpal tunnel syndrome Osteoarthritis 20 11Diabetes mellitus Hypertension 20 10Acne Syringe 20 10Antibiotic Allergy 17 12Cortisone Total knee replacement 17 10Pulmonary embolus Myocardial infarction 17 12Pulmonary fibrosis Lung cancer 17 14Cholangiocarcinoma Colonoscopy 13 10Lymphoid hyperplasia Laryngeal cancer 13 10Multiple sclerosis Psychosis 10 10Appendicitis Osteoporosis 10 10Rectal polyp Aorta 10 10Xerostomia Alcoholic cirrhosis 10 10Peptic ulcer disease Myopia 10 10Depression Cellulitis 10 10Varicose vein Entire knee meniscus 10 10Hyperlipidemia Metastasis 10 10

clinic and so there were some limitations in the way ofinterpreting and formalizing knowledge Thus the contextvector measure strongly depends on the amount and qualityof the background corpus The measure can show goodperformance just in the particular situation and domain Ifwe want to obtain higher correlation we should try bestto make preparations in choosing the size and quality ofthe information sources The results show that the accuracyof the approach decreases for other corpus configurationsnoticeably For example when 100000 notes involving allsections are used the correlations drop to 041 for physiciansand 053 for coders

The correlations obtained by our measure are 067 forphysicians and 077 for coders while the average correla-tion between physicians is 068 and the average correlationbetween medical coders is 078 The proposed measure hasmade obvious improvement to SemDistAampN (077 versus066) The correlation for physicians is higher than othermeasures except for context vector measure and the corre-lation for coders is higher than other measures except for

SimBampS (079) And the correlation is 075 considering bothsets of experts which is rather high among all the measuresmentioned in this paper Ourmeasure gets higher correlationvalues than all the IC-based measures shown in Table 3

From Table 3 we can see that the correlation values forcoders are always higher than that for physicians exceptfor context vector measure which explains that medicalcodersrsquo ratings with more pretraining are more reliable thanphysiciansrsquo ratings As a result many similarity measurescompare similarity against similarity scores ofmedical codersto obtain much better correlations

For Dataset 2 we can find 34 out of 36 concept pairsin SNOMED CT We compare our measure with otherstructure-based similarity measures with respect to humanscores The correlation of Al-Mubaid and Nguyen measureis 0735 which is higher than other measures Howeverthe correlation value obtained by our measure is 0774The comparative result (0774 versus 0735) shows that theproposed measure outperforms other measures shown inTable 3


Table 2 Dataset 2 36 medical term pairs with averaged humansimilarity scores [6]

Term 1 Term 2 HumanAnemia Appendicitis 0031Dementia Atopic dermatitis 0062Bacterial pneumonia Malaria 0156

Osteoporosis Patent ductusarteriosus

0156

Amino acid sequence Antibacterial agents 0156Acquiredimmunodeficiencysyndrome

Congenital heartdefects

0062

Otitis media Infantile colic 0156Meningitis Tricuspid atresia 0031Sinusitis Mental retardation 0031Hypertension Kidney failure 0500Hyperlipidemia Hyperkalemia 0156Hypothyroidism Hyperthyroidism 0406Sarcoidosis Tuberculosis 0406Vaccines Immunity 0593Asthma Pneumonia 0375Diabetic nephropathy Diabetes mellitus 0500

Lactose intolerance Irritable bowelsyndrome

0468

Urinary tractinfection Pyelonephritis 0656

Neonatal jaundice Sepsis 0187

Sickle cell anemia Iron deficiencyanemia

0437

Psychology Cognitive science 0593Adenovirus Rotavirus 0437Migraine Headache 0718Myocardial ischemia Myocardial infarction 0750Hepatitis B Hepatitis C 0562Carcinoma Neoplasm 0750Pulmonary valvestenosis Aortic valve stenosis 0531

Failure to thrive Malnutrition 0625Breast feeding Lactation 0843Antibiotics Antibacterial agents 0937Seizures Convulsions 0843Pain Ache 0875Malnutrition Nutritional deficiency 0875Measles Rubeola 0906Chicken pox Varicella 0968Down syndrome Trisomy 21 0875

Through the observation of the experimental results ontwo datasets we find that our measure performs muchbetter than almost all the similarity measures including the

path-based measures and the IC-based measures We makesignificant improvements to the measure of Al-Mubaid andNguyen in the case of multi-inheritance which exists inSNOMED CT

In addition we adopt the Pearson correlation coefficient(119901 value) as a measure of the strength of the relation betweenhuman ratings of similarity and computational values [30]The smaller the 119901 value is the more significant the relation isIn all cases the 119901 values for our results on both Dataset 1 andDataset 2 are less than 0001 (01 chance) which explainsthat the correlation values using our measure are significantstatistically

By all accounts our measure is only based on an ontologystructure and it can provide a comparatively high accuracywithout any dependency on data preprocessing and avail-ability Meanwhile it keeps the simplicity of structure-basedmeasures and overcomes their shortcomings that multi-inheritance phenomenon does not fully considered Themeasure nonlinearly combines the evaluated conceptsrsquo non-common information considering the different taxonomicalhierarchies and their common specificity feature which hasmade obvious improvement to other measures Note thatthere exists the case ofmultiple inheritance inwhich conceptsmay be subsumed by several superconcepts in the largest andmost widely used knowledge source such as SNOMED CTMeSH in the UMLS or WordNet From the experimentsour measure has very good performance in ontologies withmulti-inheritance When the input ontology does not havethe multiple inheritance of concepts the measure can alsobe accurate Therefore the accuracy and application of ourapproach are very impressive and significant especially in thebiomedical domain

6 Conclusions

In the paper we propose a similaritymeasure that nonlinearlycombines the evaluated conceptsrsquo noncommon informa-tion considering the different taxonomical hierarchies withtheir common specificity feature The measure keeps thesimplicity without parameter tuning In the experimentswe use SNOMED CT a large and detailed ontology withmultiple inheritance between concepts as input ontologyThe experimental results show that our measure has ratherhigh correlation values with respect to both physicians andcoders and the measure outperforms most approaches basedon taxonomical structure and IC and context vectors

As we know recently measuring semantic similarity ofconcepts within multiple ontologies has become more andmore important As future work we will extend the measureto multiple ontologies such as SNOMED CT MeSH andWordNet

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper


Table 3 Correlations across measures for physicians coders and both on 29 pairs of Dataset 1 and correlations across measures for humanon 34 pairs of Dataset 2 using SNOMED CT as ontology

Measure Dataset 1 Dataset 2Physicians Coders Both Human

Path length 036 051 048 0586Leacock and Chodorow 035 050 047 0677Wu and Palmer NA 029 NA 0686Li et al NA 037 NA 0694Choi and Kim NA 015 NA 0440Al-Mubaid and Nguyen (SemDistAampN) NA 066 NA 0735Montserrat Batet et al (SimBampS) 060 079 073 NAResnik 045 062 055 NALin 060 075 069 NAJiang and Conrath 045 062 055 NAContext vector (1m notes diagnostic section) 084 075 076 NAContext vector (1m notes all sections) 062 068 069 NAContext vector (100000 notes diagnostic section) 056 059 060 NAContext vector (100000 notes all sections) 041 053 051 NAOur measure 067 077 075 0774

Acknowledgments

This paper was sponsored by Jilin Provincial Scienceand Technology Department of China (Grant nos20130206041GX and 20120302) Jilin Province Developmentand Reform Committee of China (Grant no 2013C036-5[2013]779) Changchun Science and Technology Bureau ofChina (Grant no 14KT009) and the Doctoral Program ofHigher Education of China (no 20110043110011) respectivelyThe authors appreciate the help of Professor David Sanchezin their research He provided them with useful informationand friendly advice

References

[1] S Patwardhan S Banerjee and T Pedersen ldquoUsing measuresof semantic relatedness for word sense disambiguationrdquo in Pro-ceedings of the 4th International Conference on ComputationalLinguistics and Intelligent Text Processing Mexico City MexicoFebruary 2003 pp 241ndash257 Springer Berlin Germany 2003

[2] A Budanitsky and G Hirst ldquoEvaluating wordnet-based mea-sures of lexical semantic relatednessrdquo Computational Linguis-tics vol 32 no 1 pp 13ndash47 2006

[3] R L Cilibrasi and P M B Vitanyi ldquoThe google similarity dis-tancerdquo IEEE Transactions on Knowledge and Data Engineeringvol 19 no 3 pp 370ndash383 2007

[4] S Aseervatham and Y Bennani ldquoSemi-structured documentcategorization with a semantic kernelrdquo Pattern Recognition vol42 no 9 pp 2067ndash2076 2009

[5] J Atkinson A Ferreira and E Aravena ldquoDiscovering implicitintention-level knowledge from natural-language textsrdquoKnowledge-Based Systems vol 22 no 7 pp 502ndash508 2009

[6] A Hliaoutakis ldquoSemantic similarity measures in the MESHontology and their application to information retrieval onMedlinerdquo Tech Rep Technical University of Crete (TUC)Department of Electronic amp Computer Engineering 2005

[7] M Stevenson and M A Greenwood ldquoA semantic approach toIE pattern inductionrdquo in Proceedings of the 43rd AnnualMeetingon Association for Computational Linguistics (ACL rsquo05) pp 379ndash386 Ann Arbor Mich USA June 2005

[8] D Sanchez and A Moreno ldquoLearning non-taxonomic relation-ships from web documents for domain ontology constructionrdquoData amp Knowledge Engineering vol 64 no 3 pp 600ndash6232008

[9] D Sanchez Domain ontology learning from the web [PhDthesis] VDM Publishing Saarbrucken Germany 2008

[10] F M Couto M J Silva and P M Coutinho ldquoMeasuringsemantic similarity between gene ontology termsrdquo Data ampKnowledge Engineering vol 61 no 1 pp 137ndash152 2007

[11] T Pedersen S V S Pakhomov S Patwardhan and C GChute ldquoMeasures of semantic similarity and relatedness in thebiomedical domainrdquo Journal of Biomedical Informatics vol 40no 3 pp 288ndash299 2007

[12] C Leacock and M Chodorow ldquoCombining local context andWordNet similarity for word sense identificationrdquo inWordNetAn Electronic Lexical Database C Fellbaum Ed pp 265ndash283MIT Press Cambridge Mass USA 1998

[13] R Rada H Mili E Bicknell and M Blettner ldquoDevelopmentand application of ametric on semantic netsrdquo IEEETransactionson Systems Man and Cybernetics vol 19 no 1 pp 17ndash30 1989

[14] Z Wu and M Palmer ldquoVerb semantics and lexical selectionrdquoin Proceedings of the 32nd Annual Meeting of the Associationfor Computational Linguistics (ACL rsquo94) pp 133ndash138 ACM LasCruces NM USA 1994

[15] J Jiang and D Conrath ldquoSemantic similarity based on corpusstatistics and lexical taxonomyrdquo in Proceedings of the Interna-tional Conference on Research in Computational Linguistics pp19ndash33 Taipei Taiwan 1997

[16] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning (ICML rsquo98) pp 296ndash304 Morgan Kaufmann Madi-son Wis USA 1998

[17] P Resnik ldquoUsing information content to evaluate semanticsimilarity in a taxonomyrdquo inProceedings of the 14th International


Joint Conference on Artificial Intelligence pp 448ndash453 Mon-treal Canada 1995

[18] S Patwardhan Incorporating dictionary and corpus informationinto a context vector measure of semantic relatedness [MSScience thesis] Department of Computer Science University ofMinnesota Duluth Minn USA 2003

[19] S Patwardhan and T Pedersen ldquoUsingWordNet-based contextvectors to estimate the semantic relatedness of conceptsrdquo inProceedings of the EACL Workshop pp 1ndash8 Trento Italy 2006

[20] C Fellbaum WordNet An Electronic Lexical Database MITPress Cambridge Mass USA 1998

[21] K A Spackman ldquoSNOMED CT milestones endorsements areadded to already-impressive standards credentialsrdquo HealthcareInformatics vol 21 no 9 pp 54ndash56 2004

[22] R Kleinsorge C Tilley and J Willis Unified Medical Lan-guage System (UMLS) Basics 2000 httpwwwnlmnihgovresearchumlspdfUMLS Basicspdf

[23] S J Nelson D Johnston and B L Humphreys ldquoRelationshipsin medical subject headingsrdquo in Relationships in the Organiza-tion of Knowledge pp 171ndash184 K A Publishers 2001

[24] H Al-Mubaid and H A Nguyen ldquoA cluster-based approach forsemantic similarity in the biomedical domainrdquo in Proceedings ofthe 28th Annual International Conference of the IEEE Engineer-ing in Medicine and Biology Society (EMBS rsquo06) pp 2713ndash2717IEEE New York NY USA August-September 2006

[25] M Batet D Sanchez and A Valls ldquoAn ontology-basedmeasureto compute semantic similarity in biomedicinerdquo Journal ofBiomedical Informatics vol 44 no 1 pp 118ndash125 2011

[26] G Pirro ldquoA semantic similarity metric combining features andintrinsic information contentrdquo Data and Knowledge Engineer-ing vol 68 no 11 pp 1289ndash1308 2009

[27] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[28] A Burgun and O Bodenreider ldquoComparing terms conceptsand semantic classes in WordNet and the unified medicallanguage systemrdquo in Proceedings of the NAACL Workshop onWordNet and Other Lexical Resources Applications Extensionsand Customizations pp 77ndash82 Pittsburgh Pa USA June 2001

[29] E Brill ldquoProcessing natural language without natural languageprocessingrdquo in Proceedings of the 4th International Conferenceon Computational Linguistics and Intelligent Text Processing pp360ndash369 Mexico City Mexico 2003

[30] G Miller C Leacock R Tengi and R T Bunker ldquoA semanticconcordancerdquo in Proceedings of the ARPAWorkshop on HumanLanguage Technology Association for Computational Linguisticspp 303ndash308 Princeton NJ USA March 1993

[31] D Sanchez M Batet and D Isern ldquoOntology-based informa-tion content computationrdquo Knowledge-Based Systems vol 24no 2 pp 297ndash303 2011

[32] H Rubenstein and J B Goodenough ldquoContextual correlates ofsynonymyrdquo Communications of the ACM vol 8 no 10 pp 627ndash633 1965

[33] G A Miller and W G Charles ldquoContextual correlates ofsemantic similarityrdquo Language and Cognitive Processes vol 6no 1 pp 1ndash28 1991

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


the most intuitive and easy to implement they suffer from thelimitation that they work properly requiring consistent andrich ontologies (2) measures utilizing information content(IC) of concepts which are methods exploiting the notion ofIC defined as a measure of the fruitful semantic informationof concepts and computed by counting the occurrence ofwords in large corpora [15ndash17] their shortcomings are that itis necessary to perform time-consuming analysis of corporaand that the IC values depend on the considered corpora(3) measures using the amount of cooccurrences betweenword contexts which are approaches constructing contextvectors of concepts by extracting contextual words (withina fixed window of context) from a corpus of textual docu-ments including the evaluated concepts and computing thesimilarity of concepts as the cosine of the angle between theircontext vectors [11 18 19] Similar to the methods motionedin category (2) the availability and suitability of corpora affectthe applicability of these measures

Usually these measures can obtain good performancewhen we employ large and general purpose knowledge baseslike WordNet [20] Some of them have been applied tothe biomedical field using domain information extractedfrom clinical data or relevant medical ontologies such asSNOMED CT (httpsutsnlmnihgovhomehtml) [21 22]or MeSH (httpwwwnlmnihgovmeshmeshhomehtml)[22 23] in the Unified Medical Language System (UMLS)[22] and authors compared these measures and analyzedand evaluated them over certain datasets to determine theiradvantages and limitations with respect to the backgroundknowledge source [24ndash26]

In this paper firstly we review and investigate differentmeasures for semantic similarity computation Then wepropose a new measure considering the multiple inheritancein ontologies and the common specificity feature of theevaluated concepts in order to obtain a more accurate sim-ilarity between concepts Finally we evaluate the proposedmeasure using two datasets of biomedical term pairs scoredfor similarity by human experts and exploiting SNOMEDCTas the input ontology We compare the correlation obtainedby our measure with human scores against other measuresThe experimental evaluations confirm the efficiency of theproposed measure

Besides Section 1 the paper is organized as followsSection 2 investigates the basic methods for semantic sim-ilarity including the taxonomy-based measures the IC-based measures and the context vector measures Section 3presents the proposed measure for semantic similarity andits main advantages Section 4 evaluates and compares themeasure against the analyzed measures using SNOMED CTas the input ontology Section 5 analyzes and discusses theexperimental results The final section is the conclusions ofthe paper

2 Existing Measures for ComputingSemantic Similarity

Theexistingmeasures for semantic similarity are discussed asfollows

21 Measures Based on the Taxonomical Structure The sim-plest way of computing similarity for concepts is the measurebased on path length developed by Rada et al [13] Themeasure quantifies the shortest distance between the twoconcept nodes 119888

1and 1198882

dis (1198881 1198882) = 119873

1+ 1198732 (1)

where1198731and119873

2stand for theminimumnumber of is-a links

from 1198881and 1198882to their LCS respectively

Wu and Palmer [14] introduced a measure based on pathlength that considers the depth of the concepts only in thehierarchy It is based on the assumption that concepts lowerdown in the taxonomy aremore similar than those higher up

simWampP (1198881 1198882) =2 times 119873

3

1198731+ 1198732+ 2 times 119873

3

(2)

where 1198733is the number of is-a relations from the LCS of

the evaluated concepts to the root of the ontology And thesimilarity value ranges from 1 (for identical concepts) to 0

Leacock and Chodorow [12] proposed a measure forsimilarity in which the shortest path length between twoconcepts was scaled by twice the maximum depth 119863 of thehierarchy

simLampC (1198881 1198882) = minus log(1198731+ 1198732+ 1

2 times 119863) (3)

Besides there are other measures for semantic similaritybased on structure Li et al [27] developed a measurecombining the depth of the ontology and the shortest path

simLi (1198881 1198882) = 119890minus120572path(119888

11198882) 119890120573ℎ minus 119890minus120573ℎ

119890120573ℎ + 119890minus120573ℎ (4)

where ℎ is the minimum depth of the LCS in the hierarchypath(119888

1 1198882) is the shortest path between two concepts and120572 ge

0 and 120573 gt 0 stand for the contribution of the shortest pathand the depth respectively The optimal parameters for themeasure were 120572 = 02 and 120573 = 06

Al-Mubaid and Nguyen [24] proposed a cluster-basedmeasure combining path length and common specificity thatconsiders the depth of the LCS of two concepts and the depth119863 of ontology They defined the clusters as the branches ofthe ontology with respect to the root node The commonspecificity of concepts 119888


CSpec (1198881 1198882) = 119863

119888minus Depth (LCS (119888

1 1198882)) (5)

where119863119888is the depth of the cluster including concepts 119888

1and

1198882 Thus the CSpec(119888

1 1198882) feature determines the ldquocommon

specificityrdquo of two concepts in the cluster The smaller thecommon specificity value of two concept nodes is the moreinformation they share Thus they will be more similar Thesemantic distance measure is defined as follows

SemDistAampN (1198881 1198882)

= log ((Path minus 1)120572 times (CSpec)120573 + 119896) (6)




simBampS (1198881 1198882)

= minuslog2

1003816100381610038161003816119879 (1198881) cup 119879 (1198882)1003816100381610038161003816 minus1003816100381610038161003816119879 (1198881) cap 119879 (1198882)

10038161003816100381610038161003816100381610038161003816119879 (1198881) cup 119879 (1198882)

1003816100381610038161003816

(7)

where 119879(119888119894) = 119888119895isin 119862 | 119888


119894 cup 119888119894 and 119862






IC (119888) = minus log119901 (119888) (8)



119901 (119888) =sum119899isin119908(119888)

count (119899)119873

(9)



simres (1198881 1198882) = IC (LCS (1198881 1198882)) (10)




1 1198882))

(IC (1198881) + IC (119888

2)) (11)



times IC (LCS (1198881 1198882))

(12)









1003816100381610038161003816V11003816100381610038161003816 sdot1003816100381610038161003816V21003816100381610038161003816 (13)

where V1and V













119894) is


119894itself


NonComSub (1198881 1198882) =1003816100381610038161003816119873 (1198881) cup 119873 (1198882)

1003816100381610038161003816

minus1003816100381610038161003816119873 (1198881) cap 119873 (1198882)

1003816100381610038161003816 (14)


1and 1198882is 3 in




1 1198882)) (15)



C1 C2

C3C4

C5






4 Evaluation






5 Discussion

















0156



0062



0468




0437







6 Conclusions









Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra












simBampS (1198881 1198882)

= minuslog2

1003816100381610038161003816119879 (1198881) cup 119879 (1198882)1003816100381610038161003816 minus1003816100381610038161003816119879 (1198881) cap 119879 (1198882)

10038161003816100381610038161003816100381610038161003816119879 (1198881) cup 119879 (1198882)

1003816100381610038161003816

(7)

where 119879(119888119894) = 119888119895isin 119862 | 119888


119894 cup 119888119894 and 119862






IC (119888) = minus log119901 (119888) (8)



119901 (119888) =sum119899isin119908(119888)

count (119899)119873

(9)



simres (1198881 1198882) = IC (LCS (1198881 1198882)) (10)




1 1198882))

(IC (1198881) + IC (119888

2)) (11)



times IC (LCS (1198881 1198882))

(12)









1003816100381610038161003816V11003816100381610038161003816 sdot1003816100381610038161003816V21003816100381610038161003816 (13)

where V1and V













119894) is


119894itself


NonComSub (1198881 1198882) =1003816100381610038161003816119873 (1198881) cup 119873 (1198882)

1003816100381610038161003816

minus1003816100381610038161003816119873 (1198881) cap 119873 (1198882)

1003816100381610038161003816 (14)


1and 1198882is 3 in




1 1198882)) (15)



C1 C2

C3C4

C5






4 Evaluation






5 Discussion

















0156



0062



0468




0437







6 Conclusions









Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra
















1003816100381610038161003816V11003816100381610038161003816 sdot1003816100381610038161003816V21003816100381610038161003816 (13)

where V1and V













119894) is


119894itself


NonComSub (1198881 1198882) =1003816100381610038161003816119873 (1198881) cup 119873 (1198882)

1003816100381610038161003816

minus1003816100381610038161003816119873 (1198881) cap 119873 (1198882)

1003816100381610038161003816 (14)


1and 1198882is 3 in




1 1198882)) (15)



C1 C2

C3C4

C5






4 Evaluation






5 Discussion

















0156



0062



0468




0437







6 Conclusions









Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra










C1 C2

C3C4

C5






4 Evaluation






5 Discussion

















0156



0062



0468




0437







6 Conclusions









Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra






















0156



0062



0468




0437







6 Conclusions









Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra













Acknowledgments


References











































Volume 2014




Journal of











Journal of


Function Spaces






Algebra


































Volume 2014




Journal of











Journal of


Function Spaces






Algebra









research article an ontology-based semantic similarity...

Documents