Transcript
Page 1: Term Informativeness for Named Entity Detection

Term Informativeness for Named Entity Detection

Jason D. M. RennieMIT

Tommi JaakkolaMIT

Page 2: Term Informativeness for Named Entity Detection

Information Extraction

President Bush signed the Central America Free Trade Agreement into law Tuesday…

Who What When

Page 3: Term Informativeness for Named Entity Detection

Named Entity Detection

President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open-door policy that will benefit U.S. exporters

and seed prosperity and democracy in Central America and the Dominican

Republic.

Page 4: Term Informativeness for Named Entity Detection

Informal Communication

• Other Sources of Information– E-mail– Web Bulletin Boards– Mailing Lists

• More specialized, up-to-date information

• But, harder to extract

Page 5: Term Informativeness for Named Entity Detection

IE for Informal Comm.

SUBJECT: Two New Ipswich Seafood Joints to Open Soon.

ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…

Page 6: Term Informativeness for Named Entity Detection

NED for Informal Comm.

Subject: finale harvard square

has anyone been to the recently openedfinale in harvard square?

Page 7: Term Informativeness for Named Entity Detection

Restaurant Bulletin Board

• Gathered from a Restaurant BBoard– 6 sets of ~100 posts– 132 threads– Applied Ratnaparki’s POS tagger– Hand-labeled each token In/Out of restaurant

name

Page 8: Term Informativeness for Named Entity Detection

Detecting Named Entities

Named Entity

Informative

Bursty

Named Entity

Informative

Page 9: Term Informativeness for Named Entity Detection

Document 1 Document 2 Document 3

Quantifying Informativeness

the clandestineBrazil

Page 10: Term Informativeness for Named Entity Detection

A Little History…

Z-measure [Brookes,1968]

Inverse Doc. Freq. [Jones,1973]

xI [Bookstein & Swanson, 1974]

Residual IDF [Church & Gale, 1995]

Gain [Papenini, 2001]

Page 11: Term Informativeness for Named Entity Detection

Main Idea

• Informative words are:– Rare (IDF)– Modal (Mixture Score)

• Rarity and Modality are independent qualities

• We quantify informativeness using a product of IDF and Mixture Score

Page 12: Term Informativeness for Named Entity Detection

Binomial Distribution

Page 13: Term Informativeness for Named Entity Detection

Term Frequency Distributions

7

0

4

0

8

0

5

5

6

0

“the”

“Brazil”

Page 14: Term Informativeness for Named Entity Detection

Mixture Models

0.1% 5%

10%

0 5

90%

Page 15: Term Informativeness for Named Entity Detection

Modality

• Modal words fit a mixture much better than a single binomial

• We separately fit the binomial and mixture models to each term frequency distribution

• We quantify modality by comparing the fitness of the two models

Page 16: Term Informativeness for Named Entity Detection

Learning Mixture Parameters

Use Gradient Descent to learn , 1, 2

Page 17: Term Informativeness for Named Entity Detection

Comparing Fitness

• Use log-odds to compare fitness of the two models

Page 18: Term Informativeness for Named Entity Detection

Top Mixture Score Words

Token Score Rest. Occur.

sichaun 99.62 31/52

fish 50.59 7/73

was 48.79 0/483

speed 44.69 16/19

tacos 43.77 4/19

Page 19: Term Informativeness for Named Entity Detection

Independence

Rareness(IDF)

Modality(Mixture Score)

?

Page 20: Term Informativeness for Named Entity Detection

Correlation Coefficient

Score Pair Corr. Coefficient

IDF/Mixture -.0139IDF/RIDF .4113

Mixture/RIDF .7380

Page 21: Term Informativeness for Named Entity Detection

Top Words Overlap Plot

• Two sorted lists– Sorted by IDF– Sorted by Mixture Score

• Look at % overlap among top N in both lists

• Plot % overlap as we vary N

• Independent scores would produce line along diagonal

Page 22: Term Informativeness for Named Entity Detection

Overlap Plot

# Top Words

Per

cent

Ove

rlap

IDF/Mixture

IDF/RIDF

Page 23: Term Informativeness for Named Entity Detection

Top IDF*Mixture Words

Token Score Rest. Occur.

sichaun 379.97 31/52

villa 197.08 10/11

tokyo 191.72 7/11

ribs 181.57 0/13

speed 156.23 16/19

Page 24: Term Informativeness for Named Entity Detection

Intro to NED Experiments

• Task: Identify Restaurant Names

• Use standard NED features (capitalization, punctuation, POS) as “Baseline”

• Add informativeness score as an additional feature

• Use F1 Breakeven as performance metric

Page 25: Term Informativeness for Named Entity Detection

NED Experiments

Feature Set F1 Breakeven

Baseline 55.0%

IDF 56.0%

Mixture 56.0%

IDF,Mixture 56.9%

Residual IDF 57.4%

IDF*RIDF 58.5%

IDF*Mixture 59.3%

Better

Page 26: Term Informativeness for Named Entity Detection

Summary

• Traditional syntax-based features are not enough for IE in e-mail & bulletin boards

• We used term occurrence statistics to construct an informativeness score (IDF*Mixture)

• We found IDF*Mixture to be useful for identifying topic-centric words and named entites

Page 27: Term Informativeness for Named Entity Detection

Discussion

• Phrases

• Foreign languages, Speech

• Co-reference resolution, context tracking

• Collaborative filtering


Top Related