semantic similarity measures for semantic relation extraction

Introduction Pattern-Based Similarity Measures Hybrid Semantic Similarity Measures

Semantic Similarity Measures forSemantic Relation Extraction

Alexander PanchenkoCenter for Natural Language Processing (CENTAL)

Universite catholique de Louvain – [email protected]

September 21, 2012

1 / 60

[email protected]


Plan

Introduction

Pattern-Based Similarity Measures

Hybrid Semantic Similarity Measures

2 / 60


Semantic Similarity Measures

1. A similarity measure sij = sim(ci , cj)→ [0, 1]• ci , cj – terms• sij – high for semantic relations 〈ci , cj〉

• synonyms, hyponyms, co-hyponyms• sij – low for other pairs 〈ci , cj〉

2. Semantic similarity measures are useful for NLP/IR:• WSD (Patwardhan et al., 2003)• Query Expansion (Hsu et al., 2006)• QA (Sun et al., 2005)• Text Categorization (Tikk et al, 2003)• Text Similarity (Saric et al., 2012)

3 / 60


State of the Art

• WordNet-based measures• WuPalmer (1994), LeacockChodorow (1998), Resnik (1995)• rely on manually crafted resources• highest precision, limited coverage

• Dictionary-based measures• ExtendedLesk (Banerjee and Pedersen, 2003), GlossVectors

(Patward han and Pedersen, 2006) and WiktionaryOverlap(Zesch et al., 2008)

• rely on manually crafted resources• high precision, limited coverage

• Corpus-based measures• ContextWindow (Van de Cruys, 2010), SyntacticContext (Lin,

1998), LSA (Landauer et al., 1998)• no semantic resources are needed• low precision, high recall

• Combined e.g. WikiRelate! (Strube and Ponzetto, 2006) . . .4 / 60


Introduction

Plan

Introduction

Pattern-Based Similarity MeasuresIntroductionLexico-Syntactic PatternsSemantic Similarity MeasuresResultsConclusion

Hybrid Semantic Similarity MeasuresIntroductionFeatures: Single Similarity MeasuresHybrid Similarity MeasuresResultsConclusion

5 / 60


Introduction

Reference Paper

• Panchenko A., Morozova O., Naets H. “A SemanticSimilarity Measure Based on Lexico-Syntactic Patterns”.In Proceedings of KONVENS 2012, pp.174–178, 2012

6 / 60


Introduction

Try a Demo

• http://serelex.cental.be/

7 / 60

http://serelex.cental.be/


Lexico-Syntactic Patterns

Plan

Introduction



8 / 60



General architecture

• 6 classical Hearst (1992) patterns• 12 further patterns• extracting hypernyms, co-hyponyms and synonyms

9 / 60



The main transducer

• A cascade of FSTs• Unitex

10 / 60



The 2nd pattern

• Allow for language variation, preserving precision• Compare to surface-based patterns (Bollegala et al., 2007)

11 / 60



Explicit extraction rules

• positive/negative contexts,• dictionaries,• insertions of adjectives, . . .

12 / 60



Patterns are applied to corpora

• No preprocessing is needed• 250Mb blocks• 1 block ≈ 1 hour @ Intel i5 [email protected]

13 / 60



Patterns extract concordances

• such diverse {[occupations]} as {[doctors]},{[engineers]} and {[scientists]}[PATTERN=1]

• such {non-alcoholic [sodas]} as {[root beer]} and{[cream soda]}[PATTERN=1]

• {traditional[food]}, such as{[sandwich]},{[burger]}, and {[fry]}[PATTERN=2]

Number of concordances:

• WaCypedia – 1.196.468• ukWaC – 2.227.025• WaCypedia+ukWaC – 3.423.493

14 / 60



Plan

Introduction



15 / 60



General procedure

16 / 60



Reranking

• Efreq. No re-ranking.

sij = eij

sij – semantic similarity between terms ci , cj ∈ Ceij – frequency of co-occurrence of ci and cj in concordances K

• Efreq-Rfreq. Penalizes terms strongly related to many words.

sij =2 · α · eij

ei∗ + e∗j,

ei∗ – a number of concordances containing word ciα – an expected number of semantically related words per term

17 / 60



Reranking

• Efreq-Rnum. Penalizes terms strongly related to many words:

sij =2 · µb · eij

bi∗ + b∗j,

bi∗ =∑

j :eij≥β 1 – number of extractions with a frequency ≥ βµb = 1

|C |∑|C |

i=1 bi∗ – an average number of relations per term

• Efreq-Cfreq. Penalizes relations to general words e.g. “item”.

sij =P(ci , cj)

P(ci )P(cj)

P(ci , cj) =eij∑ij eij

– extraction probability of the pair 〈ci , cj〉

P(ci ) =fi∑i fi

– probability of the word ci

fi – frequency of ci in the corpus18 / 60



Reranking

• Efreq-Rnum-Cfreq-Pnum. Combines previous formulas +pattern redundancy.

sij =√

pij ·2 · µb

bi∗ + b∗j·

P(ci , cj)

P(ci )P(cj).

pij = 1, 18 – number of patterns extracted the relation 〈ci , cj〉

19 / 60


Results

Plan

Introduction



20 / 60


Results

Correlation with Human Judgements

term, ci term, cj judgement, s sim, s judgement, r sim, rtiger cat 7.35 0.85 1 3book paper 7.46 0.95 2 2

computer keyboard 7.62 0.81 3 1... ... ... ... . . . . . .

possibility girl 1.94 0.25 64 65sugar approach 0.88 0.05 65 23

Data:• WordSim353 – 353 term pairs (Finkelstein, 2002)• MC – 30 term pairs (Miller Charles, 1991)• RG – 65 term pairs (Rubenstein Goodenough, 1965)

Criteria:• Pearson correlation: ρ = cov(s,s)

σ(s)σ(s)

• Spearman’s correlation: r = cov(r,r)σ(r)σ(r)

21 / 60


Results

Correlation with Human Judgements

22 / 60


Results

Semantic Relation Ranking

term, ci term, cj relation type, tjudge adjudicate synjudge arbitrate synjudge chancellor syn... ... ...

judge pc randomjudge fare randomjudge lemon random

• BLESS (Baroni and Lenci, 2011)• 26554 relations• hyperonyms, co-hypernyms, meronyms, associations,

attributes, random relations• SN (Panchenko and Morozova, 2012)

• 14682 relations• synonyms, co-hyponyms, hyponyms, random relations

• |Rrandom||R| ≈ 0.5

23 / 60


Results


• Based on the number of correctly ranked relations.• R – all non-random relations• R(k) – top k% relations of targets

Criteria

• Precision: P(k) = |R∩R(k)||R(k)| ,

• Recall: R(k) = |R∩R(k)||R| ,

• We use P(10), P(20), P(50), R(50).

24 / 60


Results


• Precision P(50%) = 17 ≈ 0.86

term, ci term, cj relation type sijaficionado enthusiast syn 0.07197aficionado fan syn 0.05195aficionado admirer syn 0.01964aficionado addict syn 0.01326aficionado devotee syn 0.01163aficionado foundling random 0.00777aficionado fanatic syn 0.00414aficionado adherent syn 0.00353aficionado capital random 0.00232aficionado statute random 0.00029aficionado blot random 0.00025aficionado meddler random 0.00005aficionado enlargement random 0.00003aficionado bawdyhouse random 0.00000

25 / 60


Results


26 / 60


Results


Figure: Precision-Recall graphs calculated on the BLESS dataset: (a)PatternSim measures; (b) the best PatternSim measure versus baselines.

27 / 60


Results

Semantic Relation Extraction

Figure: Semantic relation extraction: precision at k.

• 49 words – vocabulary of the RG dataset• three annotators, binary annotations

28 / 60


Conclusion

Plan

Introduction



29 / 60


Conclusion

Conclusion

• We presented a similarity measure based on manually-craftedlexico-syntactic patterns.

• The measure provides results comparable to the baselinesand does not require semantic resources.

• Future work – using a supervised model to• combine different factors;• tune the meta-parameters.

Data: http://cental.fltr.ucl.ac.be/team/~panchenko/sim-eval/

Code: http://github.com/cental/patternsim/

Demo: http://serelex.cental.be/

30 / 60

http://cental.fltr.ucl.ac.be/team/~panchenko/sim-eval/

http://github.com/cental/patternsim/

http://serelex.cental.be/


Introduction

Plan

Introduction



31 / 60


Introduction

Reference Paper

• Panchenko A. Morozova O. “A Study of Hybrid SimilarityMeasures for Semantic Relation Extraction” . InProceedings of Workshop of Innovative Hybrid Approaches tothe Processing of Textual Data Workshop, EACL 2012,pp.10-18, 2012

32 / 60


Introduction

The State of Art

• A multitude of complimentary measures were proposed toextract synonyms, hypernyms, and co-hyponyms

• Most of them are based on one of the 5 key approaches:1. distributional analysis (Lin, 1998b)2. web as a corpus (Cilibrasi and Vitanyi, 2007)3. lexico-syntactic patterns (Bollegala et al., 2007)4. semantic networks (Resnik, 1995)5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a)

• Some attempts were made to combine measures (Curran,2002; Cederberg and Widdows, 2003; Mihalcea et al., 2006;Agirre et al., 2009; Yang and Callan, 2009)

• However, most studies are still not taking into account all 5existing extraction approaches.

33 / 60


Introduction

The State of Art

• A multitude of complimentary measures were proposed toextract synonyms, hypernyms, and co-hyponyms

• Most of them are based on one of the 5 key approaches:1. distributional analysis (Lin, 1998b)2. web as a corpus (Cilibrasi and Vitanyi, 2007)3. lexico-syntactic patterns (Bollegala et al., 2007)4. semantic networks (Resnik, 1995)5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a)

• Some attempts were made to combine measures (Curran,2002; Cederberg and Widdows, 2003; Mihalcea et al., 2006;Agirre et al., 2009; Yang and Callan, 2009)

• However, most studies are still not taking into account all 5existing extraction approaches.

34 / 60


Introduction

Contributions

• A systematic analysis of• 16 baseline similarity measures of 5 key extraction principles• their combinations with 8 fusion methods

• Hybrid similarity measures based on all the 5 extractionapproaches:1. distributional analysis2. Web as a corpus3. lexico-syntactic patterns4. semantic networks5. definitions of dictionaries or encyclopedias

35 / 60


Introduction

Single and Hybrid Similarity Measures

• 16 single measures• 5 measures based on a semantic network• 3 web-based measures• 5 corpus-based measures

• 2 distributional• 1 lexico-syntactic patterns• 2 other co-occurence based

• 3 definition-based measures• 64 hybrid measures

• 8 combination methods• 8 measure sets obtained with 3 measure selection techniques

36 / 60


Features: Single Similarity Measures

Plan

Introduction



37 / 60



Measures Based on a Semantic Network

1. Wu and Palmer (1994)2. Leacock and Chodorow (1998)3. Resnik (1995)4. Jiang and Conrath (1997)5. Lin (1998)

Data:• WordNet 3.0• SemCor corpus

Variables:• Lengths of the shortest paths between terms in the network• Probability of terms derived from a corpus

Coverage: 155.287 English terms encoded in WordNet 3.0.38 / 60



Web-based Measures

Normalized Google Distance (NGD) (Cilibrasi and Vitanyi, 2007)

6. NGD-Yahoo!7. NGD-Bing8. NGD-Google over wikipedia.org domain

Data: number of times the terms co-occur in the documents asindexed by an IR system.Variables:

• number of hits returned by query ”ci”

• number of hits returned by query ”ci AND c ′′jCoverage: huge vocabulary in dozens of languages.

39 / 60



Corpus-based Measures

9. Bag-of-word Distributional Analysis (BDA) (Sahlgren, 2006)10. Syntactic Distributional Analysis (SDA) (Curran, 2003)

Data: WaCkypedia (800M tokens) and PukWaC (2000M tokens)corpora (Baroni et al., 2009)Variables:• feature vector based on the context window• feature vector based on the syntactic context

Coverage: word should occur in the corpora.

40 / 60




11. A measure based on lexico-syntactic patterns

Data: WaCkypedia corpus (800M tokens)Method:• 10 patterns for hypernymy extraction: 6 Hearst (1992)patterns + 4 other patterns

• such diverse {[occupations]} as {[doctors]},{[engineers]} and {[scientists]}[PATTERN=1]

• Efreq: semantic similarity sij between terms ci , cj ∈ C – thenumber of term co-occurences in the same concordance nij :

sim(ci , cj) = sij =nij

maxij(nij).

41 / 60




12. Latent Semantic Analysis (LSA) on TASA corpus(Landauer and Dumais, 1997)

13. NGD on Factiva corpus (Veksler et al., 2008)

42 / 60



Definition-based Measures

14. Extended Lesk (Banerjee and Pedersen, 2003)15. GlossVectors (Patwardhan and Pedersen, 2006)

Data: WordNet glosses.Variables:• bag-of-words vector of a term ci derived from the glosses• relation between words (ci , cj) in the network

Coverage: 117.659 glosses encoded in WordNet 3.0

43 / 60



Definition-based Measures

16. WktWiki – BDA on definitions of Wiktionary and Wikipedia 1

Data: Wikipedia abstracts, Wiktionary.Method:• Definition = abstract of Wikipedia article with title ”ci” +glosses, examples, quotations, related words, categories fromWiktionary for ci

• Represent a definition as a bag-of-words vector• Calculate similarities with cosine• Update similarities according to relations in the Wiktionary.

Coverage: Wiktionary: 536.594 glosses, Wikipedia: 3.8M articles

1The method stems from the work of Zesch et al. (2008)44 / 60


Hybrid Similarity Measures

Plan

Introduction



45 / 60



Combination Methods

• A goal of a combination method is to produce “better”similarity scores than the scores of single measures.

• A combination method takes as an input {S1, . . . ,SK}produced by K single measures and outputs Scmb.

• skij ∈ Sk is a pairwise similarity score of terms ci and cjproduced by k-th measure.

• We tested 8 combination methods.

46 / 60



Combination Methods

1. Mean. A mean of K pairwise similarity scores:

Scmb =1K

K∑k=1

Sk ⇔ scmbij =

1K

∑k=1,K

skij .

2. Mean-Nnz. A mean of scores having non-zero value:

scmbij =

1|k : sk

ij > 0, k = 1,K |∑

k=1,K

skij .

3. Mean-Zscore. A mean of scores transformed into Z-scores:

Scmb =1K

K∑k=1

Sk − µk

σk,

where µk and σk are a mean and a standard deviation of thescores of the k-th measure (Sk).

47 / 60



Combination Methods

4. Median. A median of K pairwise similarities:

scmbij = median(s1

ij , . . . , sKij ).

5. Max. A maximum of K pairwise similarities:

scmbij = max(s1

ij , . . . , sKij ).

6. RankFusion. A mean of scores converted to ranks:

scmbij =

1K

∑k=1,K

rkij ,

where rkij is the rank corresponding to the similarity score sk

ij .

48 / 60



Combination Methods

7. RelationFusion.• Unions the top relations found by each measure separately.• A relation extracted by several measures has more weight.• See (Panchenko and Morozova, 2012) for details.

49 / 60



Combination Methods

8. Logit. A supervised combination of similarity measures• Training a binary classifier (a Logistic Regression) on a set of

manually constructed semantic relations R (BLESS or SN)• Positive training examples are “meaningful” relations

(synonyms, hyponyms, co-hyponyms, associations)• Negative training examples are pairs of semantically

unrelated words (generated randomly and verified manually).• A relation 〈ci , t, cj〉 ∈ R is represented with an N-dimensionalvector of pairwise similarities: xij = (s1

ij , . . . , sNij ).

• Category yij :

yij =

{0 if 〈ci , t, cj〉 is a random relation1 otherwise

• Using the model (w1, . . . ,wK ) to combine measures:

scmbij =

11+ e−z , z = w0 +

K∑k=1

wkskij ,

50 / 60



Measure Selection

A problem

Number of ways to choose which of 16 single measures to combine:

216 = 65.535

• Expert choice of measures – 5, 9 and 15 measures• Forward Stepwise Procedure – 7, 8a, 8b, 10 measures• Analysis of LR weights – 12 measures

• The best predictors: C-BDA, C-SDA, C-LSA-Tasa,D-WktWiki, D-GlossVectors, D-ExtendedLesk.

51 / 60



Measure Selection

A problem

Number of ways to choose which of 16 single measures to combine:

216 = 65.535

• Expert choice of measures – 5, 9 and 15 measures• Forward Stepwise Procedure – 7, 8a, 8b, 10 measures• Analysis of LR weights – 12 measures• The best predictors: C-BDA, C-SDA, C-LSA-Tasa,D-WktWiki, D-GlossVectors, D-ExtendedLesk.

52 / 60


Results

Plan

Introduction



53 / 60


Results

Single Similarity Measures

Figure: Performance of 16 single similarity measures on humanjudgement datasets (MC, RG, WordSim353). The best scores in agroup are in bold.

54 / 60


Results

Single Similarity Measures

Figure: Performance of 16 single similarity measures on humanjudgement datasets (MC, RG, WordSim353) and semantic relationdatasets (BLESS and SN). The best scores in a group are in bold.

55 / 60


Results


Figure: Performance of 16 single and 8 hybrid similarity measures onhuman judgements datasets (MC, RG, WordSim353) and semanticrelation datasets (BLESS and SN). The best scores in a group(single/hybrid) are in bold; the very best scores are in grey.

56 / 60


Results


Figure: Precision-Recall graphs calculated on the BLESS dataset of (a)16 single measures and the best hybrid measure H-Logit-E15; (b) 8hybrid measures.

57 / 60


Conclusion

Plan

Introduction



58 / 60


Conclusion

Conclusion:

• We have undertaken a study of 16 baseline measures, 8combination methods, and 3 measure selection techniques.

• The proposed hybrid measures:• use all 5 main types of baseline measures;• outperform the single measures on all datasets.

• The best results were provided by• a combination of 15 corpus-, web-, network-, and

definition-based measures• with Logistic Regression• ρ = 0.870, P(20) = 0.987, R(50) = 0.814.

59 / 60


Conclusion

Thank you! Questions?

60 / 60

semantic similarity measures for semantic relation extraction

Technology

eij sij semantic similarity

similarity measure sij

text similarity saric

semantic resources

semantic relations ci

cj terms sij high

terms ci

cj synonyms