using wordnet to supplement corpus statisticsroseh/slides/sphinxlunch02-wordnet.pdf ·...

28
Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld November 14, 2002 Sphinx Lunch Nov 2002

Upload: others

Post on 20-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Using WordNet to Supplement Corpus

Statistics

Rose Hoberman and Roni Rosenfeld

November 14, 2002

Sphinx Lunch Nov 2002

Page 2: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Data, Statistics, and Sparsity

• Statistical approaches need large amounts of data

• Even with lots of data long tail of infrequent events(in 100MW over half of word types occur only once or twice)

• Problem: Poor statistical estimation of rare events

• Proposed Solution: Augment data with linguistic or semantic knowledge(e.g. dictionaries, thesauri, knowledge bases...)

Sphinx Lunch Nov 2002 1

Page 3: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

WordNet

• Large semantic network, groups words into synonym sets

• Links sets with a variety of linguistic and semantic relations

• Hand-built by linguists (theories of human lexical memory)

• Small sense-tagged corpus

Sphinx Lunch Nov 2002 2

Page 4: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

WordNet: Size and Shape

• Size: 110K synsets, lexicalized by 140K lexical entries

– 70% nouns– 17% adjectives– 10% verbs– 3% adverbs

• Relations: 150K

– 60% hypernym/hyponym (IS-A)– 30% similar to (adjectives), member of, part of, antonym– 10% ...

Sphinx Lunch Nov 2002 3

Page 5: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

WordNet Example: Paper IS-A ...

• paper→ material, stuff→ substance, matter→ physical object→ entity

• composition, paper, report, theme → essay → writing ... abstraction

→ assignment ... work ... human act

• newspaper, paper → print media ... instrumentality → artifact → entity

• newspaper, paper, newspaper publisher → publisher, publishing house→ firm, house, business firm → business, concern → enterprise →organization → social group → group, grouping

• ...

Sphinx Lunch Nov 2002 4

Page 6: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

This Talk

• Derive numerical word similarities from WordNet noun taxonomy.

• Examine usefulness of WordNet for two language modelling tasks:

1. Improve perplexity of bigram LM (trained on very little data)

• Combine bigram data of rare words with similar but more commonproxies

• Use WN to find similar words

2. Find words which tend to co-occur within a sentence.

• Long distance correlations often semantic• Use WN to find semantically related words

Sphinx Lunch Nov 2002 5

Page 7: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Measuring Similarity in a Taxonomy

• Structure of taxonomy lends itself to calculating distances (or similarities)

• Simplest distance measure: length of shortest path (in edges)

• Problem: edges often span different semantic distances

• For example:plankton IS-A living thingrabbit IS-A leporid ... IS-A mammal IS-A vertebrate IS-A ... animal IS-Aliving thing

Sphinx Lunch Nov 2002 6

Page 8: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Measuring Similarity using Information Content

• Resnik’s method: use structure and corpus statistics

• Counts from corpus ⇒ probability of each concept in the taxonomy ⇒“information content” of a concept.

• Similarity between concepts = the information content of their leastcommon ancestor: sim(c1, c2) = − log(p(lca(c1, c2)))

• Other similarity measures subsequently proposed

Sphinx Lunch Nov 2002 7

Page 9: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Similarity between Words

• Each word has many senses (multiple nodes in taxonomy)

• Resnik’s word similarity: max similarity between any of their senses

• Alternative definition: the weighted sum of sim(c1, c2), over all pairs ofsenses c1 of w1 and c2 of w2, where more frequent senses are weightedmore heavily.

• For example:TURKEY vs. CHICKENTURKEY vs. GREECE

Sphinx Lunch Nov 2002 8

Page 10: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Improving Bigram Perplexity

• Combat sparseness → define equivalence classes and pool data

• Automatic clustering, distributional similarity, ...

• But... for rare words not enough info to cluster reliably

• Test whether bigram distributions of semantically similar words (accordingto WordNet) can be combined to reduce the bigram perplexity of rarewords

Sphinx Lunch Nov 2002 9

Page 11: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Combining Bigram Distributions

• Simple linear interpolation

• ps(·|t) = (1− λ)pgt(·|t) + λpml(·|s)

• Optimize lambda using 10-way cross-validation on training set

• Evaluate by comparing the perplexity on a new test set of ps(·|t) withthe baseline model pgt(·|t).

Sphinx Lunch Nov 2002 10

Page 12: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Ranking Proxies

• Score each candidate proxy s for target word t

1. WordNet similarity score: wsimmax(t, s)

2. KL Divergence: D(pgt(·|t) ||pml(·|s))

3. Training set perplexity reduction of word s, i.e. the improvement inperplexity of ps(·|t) compared to the 10-way cross-validated model.

4. Random: choose proxy randomly

• Choose highest ranked proxy (ignore actual scales of scores)

Sphinx Lunch Nov 2002 11

Page 13: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Experiments

• 140MW of Broadcast News

– Test: 40MW reserved for testing– Train: 9 random subsets of training data (1MW - 100MW)

• From nouns occurring in WordNet:

– 150 target words (occurred < 2 times in 1MW)– 2000 candidate proxies (occurred > 50 times in 1MW)

Sphinx Lunch Nov 2002 12

Page 14: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Methodology

for each size training corpus:

• Find highest scoring proxy for each target word and each ranking method

• Target word: ASPIRATIONSbest Proxies: SKILLS DREAMS DREAM/DREAMS HILL

• Create interpolated models and calculate perplexity reduction on test set

• Average perplexity reduction: weighted average of the perplexityreduction achieved for each target word, weighted by the frequencyof each target word in the test set

Sphinx Lunch Nov 2002 13

Page 15: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

1 2 3 4 5 10 25 50 100

1

2

3

4

5

6

7

Data Size in Millions of Words

Per

cent

PP

red

uctio

n

WordNetRandomKLdivTrainPP

Figure 1: Perplexity reduction as a function of training data size for foursimilarity measures.

Sphinx Lunch Nov 2002 14

Page 16: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

0 500 1000 1500

−2

−1

01

23

4

proxy rank

avg

Per

cent

PP

red

uctio

nrandomWNsimKLdivcvPP

Figure 2: Perplexity reduction as a function of proxy rank for four similaritymeasures.

Sphinx Lunch Nov 2002 15

Page 17: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Error Analysis

% Type of Relation Examples45 Not an IS-A relation rug-arm, glove-scene40 Missing or weak in WN aluminum-steel, bomb-shell15 Present in WN blizzard-storm

Table 1: Classification of best proxies for 150 target words.

• Each target word⇒ proxy with largest test PP reduction ⇒ categorizedrelation

• Also a few topical relations (TESTAMENT-RELIGION) and domainspecific relations (BEARD-MAN)

Sphinx Lunch Nov 2002 16

Page 18: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Modelling Semantic Coherence

• N-grams only model short distances

• In real sentences content words come from same semantic domain

• Want to find long-distance correlations

• Incorporate semantic similarity constraint into exponential LM

Sphinx Lunch Nov 2002 17

Page 19: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Modelling Semantic Coherence II

• Find words that co-occur within a sentence.

• Association statistics from data only reliable for high frequency words

• Long-distance associations are semantic

• Use WN ?

Sphinx Lunch Nov 2002 18

Page 20: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Experiments

• “Cheating experiment” to evaluate usefulness of WN

• Derive similarities from WN for only frequent words

• Compare to measure of association calculated from large amounts ofdata. (ground truth)

• Question: are these two measures correlated?

Sphinx Lunch Nov 2002 19

Page 21: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

”Ground Truth”

• 500,000 noun pairs

• Expected number of chance co-occurrences > 5

• Word pair association: (Yule’s statistic) Q = C11·C22−C12·C21

C11·C22+C12·C21

Word 1 Yes Word 1 No

Word 2 Yes C11 C12

Word 2 No C21 C22

• Q ranges from -1 to 1

Sphinx Lunch Nov 2002 20

Page 22: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Sphinx Lunch Nov 2002 21

Page 23: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Figure 3: Looking for Correlation: WordNet similarity scores versus Q scoresfor 10,000 noun pairs

Sphinx Lunch Nov 2002 22

Page 24: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

Q Score

Den

sity

wsim > 6All pairs

Only 0.1% of wordpairs have WordNet similarity scores above 5 and only0.03% are above 6.

Sphinx Lunch Nov 2002 23

Page 25: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

0.00 0.01 0.02 0.03 0.04 0.05

0.2

0.4

0.6

0.8

recall

prec

isio

n

weightedmaximum

Figure 4: Comparing effectiveness of two WordNet word similarity measures

Sphinx Lunch Nov 2002 24

Page 26: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Relation Type Num Examples

WN 277(163)part/member 87 (15) finger-hand, student-schoolphrase isa 65 (47) death tax IS-A taxcoordinates 41 (31) house-senate, gas-oilmorphology 30 (28) hospital-hospitalsisa 28 (23) gun-weapon, cancer-diseaseantonyms 18 (13) majority-minorityreciprocal 8 (6) actor-director, doctor-patientnon-WN 461topical 336 evidence-guilt, church-saintnews and events 102 iraq-weapons, glove-theoryother 23 END of the SPECTRUM

Table 2: Error Analysis

Sphinx Lunch Nov 2002 25

Page 27: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Conclusions?

• Very small bigram PP improvement when little data available

• Words with very high WN similarity do tend to co-occur within sentences,

• However recall is poor because most relations topical (but WN addingtopical links)

• Limited types and quantities of relationships in WordNet compared tothe spectrum of relationships found in real data

• WN word similarities weak source of knowledge for 2 tasks

Sphinx Lunch Nov 2002 26

Page 28: Using WordNet to Supplement Corpus Statisticsroseh/Slides/sphinxLunch02-wordnet.pdf · 2002-11-14 · Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld

Possible Improvements, Other Directions?

• Interpolation weights should depend on ...

– data AND WordNet score– relative frequency of target and proxy word

• Improve WN similarity measure

– consider frequency of senses but don’t dilute strong relations– info content misleading for rare but high level concepts– learn a function from large amounts of data?– learn which parts of taxonomy are more reliable/complete?

• Consider alternative framework

– class → word / word → class / class ← word / word ← class– provide WN with more constraints (from data)

Sphinx Lunch Nov 2002 27