graph-based analysis of semantic drift in espresso-like bootstrapping algorithms

Graph-based Analysis of Semantic Drift in

Espresso-like Bootstrapping Algorithms

EMNLP 2008

23/04/22

Mamoru Komachi†, Taku Kudo‡, Masashi Shimbo† and Yuji Matsumoto†

†Nara Institute of Science and Technology, Japan‡Google Inc.

04/22/232

Bootstrapping and semantic drift Bootstrapping grows small amount of seed instances

Buy # at Apple Store

Generic patterns= Patterns which co-occur with many (irrelevant) instances

Seed instance Extracted instance

iPhone

iPod # for sale

Co-occurrence pattern

iPod

MacBook Air

car

house

04/22/233

Main contributions of this work

3

1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)

2. Solve semantic drift by graph kernels used in link analysis community

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection

Until a stopping criterion is met

04/22/234

Espresso uses pattern-instance matrix Afor ranking patterns and instances|P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns and instances

04/22/235

1 2 ... i ... |I|

1

2

:

p

:

|P|

[A]p,i = pmi(p,i) / maxp,i pmi(p,i)

instance indices

pattern indices

Pattern/instance ranking in Espressop = pattern score vectori = instance score vectorA = pattern-instance matrix

04/22/236

€

p ← 1IAi... pattern ranking

€

i ← 1PATp... instance ranking

Reliable instances are supported by reliable patterns, and vice versa

|P| = number of patterns|I| = number of instances

normalization factorsto keep score vectors not too large

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection


04/22/237

For graph-theoretic analysis,

we will introduce 3 simplifications to

Espresso

First simplification Compute the pattern-instance matrix Repeat

Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection


04/22/238

Simplification 1Remove pattern/instance extraction stepsInstead, pre-compute all patterns and instances once in the beginning of the algorithm

Second simplification Compute the pattern-instance matrix Repeat

Pattern ranking Pattern selection

Instance ranking Instance selection


04/22/239

Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0

Instead,retain scores of all patterns and instances

Third simplification Compute the pattern-instance matrix Repeat

Pattern ranking

Instance ranking


04/22/2310

Until score vectors p and i converge

Simplification 3No early stopping i.e., run until convergence

Input Initial score vector of seed instances Pattern-instance co-occurrence matrix A

Main loopRepeat

Until i and p converge

OutputInstance and pattern score vectors i and p

€

p ← 1IAi... pattern ranking

€

i ← 1PATp... instance ranking

04/22/2311

Simplified Espresso

€

i = 0,1,0,0( )

04/22/2312

Simplified Espresso is HITSSimplified Espresso = HITS in a bipartite graph whose adjacency matrix is A

Problem

No matter which seed you start with, the same instance is always ranked topmostSemantic drift (also called topic drift in HITS)

The ranking vector i tends to the principal eigenvector of ATA as the iteration proceedsregardless of the seed instances!

How about the original Espresso?Original Espresso has heuristics not present in Simplified Espresso

Early stopping Pattern and instance selection

Do these heuristics really help reduce semantic drift?

04/22/2313

Experiments on semantic drift

Does the heuristics in original Espresso help reduce drift?

04/22/2314

04/22/2315

Word sense disambiguation task of Senseval-3 English Lexical Sample

Predict the sense of “bank”

… the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …

In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden

Possibly aligned to water a sort of bank(???) by a rushing river.

Training instances are annotated with their sense

Predict the sense of target word in the test set

04/22/2316

Word sense disambiguation by Original Espresso

Seed instance = the instance to predict its sense

System output = k-nearest neighbor (k=3)Heuristics of Espresso Pattern and instance selection

# of patterns to retain p=20 (increase p by 1 on each iteration)

# of instance to retain m=100 (increase m by 100 on each iteration)

Early stopping

Seed instance

04/22/2317

Convergence process of EspressoHeuristics in Espresso helps reducing

semantic drift(However, early stopping is required for

optimal performance)

Output the most frequent sense regardless of input

Original Espresso

Simplified Espresso

Most frequent sense (baseline)

Semantic drift occurs (always outputs the most frequent sense)

Learning curve of Original Espresso:per-sense breakdown

04/22/2318

# of most frequent sense predictions increases

Recall for infrequent senses worsens even with

original Espresso

Most frequent sense

Other senses

04/22/2319

Summary: Espresso and semantic driftSemantic drift happens because Espresso is designed like HITS HITS gives the same ranking list regardless of seeds

Some heuristics reduce semantic drift Early stopping is crucial for optimal performance

Still, these heuristics require many parameters to be calibrated but calibration is difficult

04/22/2320

Main contributions of this work

20

1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)

2. Solve semantic drift by graph kernels used in link analysis community

Q. What caused drift in Espresso?A. Espresso's resemblance to HITS

HITS is an importance computation method(gives a single ranking list for any seeds)

Why not use a method for another type of link analysis measure - which takes seeds into account?"relatedness" measure

(it gives different rankings for different seeds)

04/22/2321

04/22/2322

The regularized Laplacian kernel A relatedness measure Takes higher-order relations into account

Has only one parameter

€

L = D− AGraph Laplacian

R n ( L)n (I L) 1

n0

Regularized Laplacian matrix

A: adjacency matrix of the graphD: (diagonal) degree matrix

β: parameterEach column of Rβ gives the rankings relative to a node

Evaluation of regularized Laplacian

Comparison to (Agirre et al. 2006)

04/22/2323

algorithm F measure

Most frequent sense (baseline)

54.5

HyperLex 64.6PageRank 64.6Simplified Espresso 44.1Espresso (after convergence)

46.9

Espresso (optimal stopping)

66.5

Regularized Laplacian (β=10-2)

67.104/22/2324

WSD on all nouns in Senseval-3

Outperforms other graph-based methods

Espresso needs optimal stopping to achieve an equivalent

performance

04/22/2325

Conclusions

Semantic drift in Espresso is a parallel form of topic drift in HITS

The regularized Laplacian reduces semantic drift inherently a relatedness measure ( importance measure)

04/22/2326

Future work Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training)

Try other popular tasks of bootstrapping such as named entity extraction Selection of seed instances matters

04/22/2327

Algorithm Most frequent sense

Other senses

Simplified Espresso 100.0 0.0Espresso (after convergence)

100.0 30.2

Espresso (optimal stopping)

94.4 67.4

Regularized Laplacian (β=10-2)

92.1 62.8

04/22/2328

Label prediction of “bank” (F measure)

The regularized Laplacian keeps high recall for infrequent

senses

Espresso suffers from semantic drift

(unless stopped at optimal stage)

Regularized Laplacian is mostly stable across a parameter

04/22/2329

Pattern/instance ranking in EspressoScore for pattern p Score for instance

i

08.10.243030

r (p)

pmi(i, p)max pmi

r(i)

iI

I

r(i)

pmi(i, p)max pmi

r (p)

pP

P

,**,,*,,,

log),(pyxypx

pipmi

p: patterni: instanceP: set of patternsI: set of instancespmi: pointwise mutual informationmax pmi: max of pmi in all the patterns

and instances

graph-based analysis of semantic drift in espresso-like bootstrapping algorithms

Documents