a study of hybrid similarity measures for semantic relation extraction

Intelligent Database Systems Lab

Presenter : BEI-YI JIANG

Authors : UNIVERSIT´E CATHOLIQUE DE LOUVAIN, BELGIUM

2012. ASSOCIATION FOR COMPUTING MACHINERY

A Study of Hybrid Similarity Measures for Semantic Relation Extraction


Outlines

MotivationObjectivesMethodologyExperimentsConclusionsComments


Motivation

• The quality of the relations provided by existing extractors is still lower than the quality of the manually constructed relations.

• Most studies are still not taking into account the whole range of existing measures, combining mostly sporadically different methods.


Objectives

• To development of new relation extraction methods.• The method is a systematic analysis of 16 baseline

measures, and their combinations with 8 fusion methods and 3 techniques for the combination set selection.


Methodology• norm function

• similarity scores

• knn function


Methodology-Single Similarity Measures

• Measures Based on a Semantic Network(5)– exploit the lengths of the shortest paths between

terms in a network– probability of terms derived from a corpus– Wu and Palmer, Leacock and Chodorow, Resnik,

Jiang and Conrath , and Lin


• Web-based Measures(3)– Web search engines– rely on the number of times the terms co-occur in

the documents– Normalized Google Distance(NGD)– Measures of Semantic Relatedness(MSR)– YAHOO!, BING, GOOGLE over the domain

wikipedia.org



• Corpus-based Measures(5)– Distributional Measures

› Bag-of-words Distributional Analysis(BDA) › Syntactic Distributional Analysis(SDA)

– Pattern-based Measure› PatternWiki

– Other Corpus-based Measures› Latent Semantic Analysis(LSA)› Normalized Google Distance(NGD)



• Definition-based Measures(3)– WktWiki– Gloss Vectors– Extended Lesk



• Combination Methods – Input： a set of similarity matrices{S1, . . . , SK}

produced by K single measures– Output： a combined similarity matrix Scmb

› 1. Mean› 2. Mean-Nnz› 3. Mean-Zscore› 4. Median

Methodology- Hybrid Similarity Measures

› 5. Max› 6. Rank Fusion› 7. Relation Fusion› 8. Logit


• Combination Methods– Mean. A mean of K pairwise similarity scores:

– Mean-Nnz. A mean of those pairwise similarity scores which have a non-zero value:



• Combination Methods– Mean-Zscore. A mean of K similarity scores transformed

into Z-scores:

– Median. A median of K pairwise similarities:



• Combination Methods– Max. A maximum of K pairwise similarities:

– Rank Fusion.



• Combination Methods– Relation Fusion.

– Logit.



• Combination Sets– Expert choice of measures

– Forward stepwise procedure

– Logistic regression



Experiments• Evaluation– Human Judgements Datasets.

› MC, RG, WordSim353

– Semantic Relations Datasets.› BLESS, SN


Experiments


Conclusions

• The results have shown that the hybrid measures outperform the single measures on all datasets.

• A combination of 15 baseline corpus-, web-, network-, and dictionary-based measures with Logistic Regression provided the best results.


Comments• Advantages– higher performance

• Applications

a study of hybrid similarity measures for semantic relation extraction

Documents

similarity matrices

baseline measures

range of existing measures

fusion methods

semantic network5exploit

different methods

networkprobability of

relation fusion8