deriving a web-scale commonsense fact database

38
Deriving a Web-Scale Commonsense Fact Database Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics Aug 11, 2011

Upload: enye

Post on 11-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Deriving a Web-Scale Commonsense Fact Database . Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics. Aug 11, 2011. Some trivial facts. Apples are green, red, juicy, sweet but not fast, funny. Parks and meadows are green or lively but not black or slow. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Deriving a Web-Scale Commonsense Fact Database

Deriving a Web-Scale Commonsense Fact Database

Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics

Aug 11, 2011

Page 2: Deriving a Web-Scale Commonsense Fact Database

SOME TRIVIAL FACTS..

Apples are green, red, juicy, sweet but not fast, funny

2

Parks and meadows are green or lively but not black or slow

Keys are kept in pocket but not in the air

Question: How do computers know? Solution: Build a commonsense knowledge base

Page 3: Deriving a Web-Scale Commonsense Fact Database

INTRODUCTION

What is the problem? Harvest commonsense facts from text:

– Flower is soft, hasProperty(flower,soft)– Room is part of house partOf(room, house)

Why is it hard? Rarely mentioned in text Noise with natural language text

What is required to tackle the problem?Web Scale corpus but.. Web Scale corpus is hard to get! Use Web Scale N-grams => poses interesting research

challenges

3

Page 4: Deriving a Web-Scale Commonsense Fact Database

MESSAGE OF THE TALK

N-grams simulate larger corpus

Existing information extraction models must be carefully adapted for harvesting facts.

4

Page 5: Deriving a Web-Scale Commonsense Fact Database

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

5

Page 6: Deriving a Web-Scale Commonsense Fact Database

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

6

(fire, hot) X is very Y

(ice, cold)(flower, beautiful)

Page 7: Deriving a Web-Scale Commonsense Fact Database

GOOD SEEDS => (GOOD) PATTERNS

7

Text •He bought very sweet apples

Seed facts •hasProperty: <apple, sweet>

Candidate patterns

Page 8: Deriving a Web-Scale Commonsense Fact Database

GOOD SEEDS => (GOOD) PATTERNS

8

Text •He bought very sweet apples

Seed facts •hasProperty: <apple, sweet>

Candidate patterns

•[hasProperty]

•“He bought very Y X”

Page 9: Deriving a Web-Scale Commonsense Fact Database

GOOD SEEDS => (GOOD) PATTERNS

9

Text •He bought very sweet apples•Apples and sweet potato are delicious

Seed facts •hasProperty: <apple, sweet>

Candidate patterns

•[hasProperty]

•“He bought very Y X”•“X and Y potato are delicious”

Page 10: Deriving a Web-Scale Commonsense Fact Database

GOOD SEEDS => (GOOD) PATTERNS

10

Text •He bought very sweet apples•Apples and sweet potato are delicious•He kept the keys in pocket

Seed facts •hasProperty: <apple, sweet>

•hasLocation:<key, pocket>

Candidate patterns

•[hasProperty] [hasLocation]•“He bought very Y X” “He kept the X in Y”•“X and Y potato are delicious”

Page 11: Deriving a Web-Scale Commonsense Fact Database

GOOD PATTERNS => (GOOD) TUPLES

He kept the butter in

refrigerator

[hasLocation] “He kept the

X in Y”

[hasLocation]

<butter, refrigerator>

11

Page 12: Deriving a Web-Scale Commonsense Fact Database

MODEL

Fact Extraction and Ranking

Pattern rankingPattern Induction[hasProperty]

“He bought very Y X”“X and Y potato are delicious”

[hasLocation]“He kept the X in Y”

Seeds

12

Page 13: Deriving a Web-Scale Commonsense Fact Database

STATE OF THE ART - PATTERN BASED IE

Dipre - Brin ‘98 Snowball - Agichtein et al. ‘00 KnowItAll - Etzioni et al. ’04

Observations: Low Recall on easily available corpora (large corpus is

difficult to get) Low Precision when applied to our corpus

13

Page 14: Deriving a Web-Scale Commonsense Fact Database

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

14

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

Pattern based Information extraction model

The corpus we use to extract facts is Web Scale N-grams

Page 15: Deriving a Web-Scale Commonsense Fact Database

WEB-SCALE N-GRAMS

N-gram: sequence of N consecutive word tokens e.g. the apples are very red

Web-scale N-gram statistics derived from trillion words e.g. the apples are very red 12000

Google N-grams, Microsoft N-grams, Yahoo N-grams N-gram dataset limitations

Length <= 5 cannot! => the apple that he was eating was very red

But... Most commonsense relations fit this small context Sheer Volume of data

15

Page 16: Deriving a Web-Scale Commonsense Fact Database

EXAMPLE OF COMMONSENSE RELATIONS

16

Page 17: Deriving a Web-Scale Commonsense Fact Database

OUR APPROACH Use ConceptNet data as seeds to harvest commonsense

facts from Google N-Grams corpus ConceptNet: MIT’s common sense knowledge base

constructed by crowd-sourcing and further processed We take very large number of seeds

Avoids drift with iterations We consider variations of seeds for nouns (plural form)

[key, pocket] , [keys, pocket] ,[keys, pockets] Gives very large number of potential patterns, but most are

noise Constrain patterns by Part Of Speech tags.

X<noun> is very Y<adjective> Need to carefully rank potential patterns

17

patternsfacts

Page 18: Deriving a Web-Scale Commonsense Fact Database

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

18

One dirty fish spoils the whole pond!

Page 19: Deriving a Web-Scale Commonsense Fact Database

EXISTING PATTERN RANKING APPROACH: PMI

PMI score for pattern (p) with matching seeds(x,y)

19

Raw frequencies, not distinct seeds - Bias towards rare events (strings containing seed words by chance)

Frequencies alone are not enough (spam, boilerplate text)

Page 20: Deriving a Web-Scale Commonsense Fact Database

PATTERN RANKING – OBSERVATION 1

Observation 1: • # seeds a pattern matches

follows power law. • Unreliable patterns likely

in tail

Question 1: Can we find

patterns not in tail?

20

Score based on Observation 1 employs gradient (!

threshold)

Power-Law curve s(x) ~ axk

Page 21: Deriving a Web-Scale Commonsense Fact Database

PATTERN RANKING – OBSERVATION 2

Observation 2: • Some patterns match many

seeds but .. match all sort of things e.g. <X> and <Y> matches seeds of different relations that we have

• PMI does not consider #relations matched

Question 2: Can we penalize them?

21

Score of a pattern:

Page 22: Deriving a Web-Scale Commonsense Fact Database

PATTERN RANKING – OUR APPROACH

Combined Pattern Score:

22

Combine scores based on observations 1 and 2 using logistic function

Page 23: Deriving a Web-Scale Commonsense Fact Database

IMPROVEMENT OVER PMI IN PATTERN RANKING

(ISA RELATION)

Top-Ranked Patterns (PMI) Top-Ranked Patterns (q)Varsity <Y> <X> Men <Y>/<X><Y> MLB <X> <Y> : <X> </S><Y> <X> Boys <Y> <X> </S><Y> Posters <X> Basketball <Y> - <X> </S><Y> - <X> Basketball <Y> such as <X><Y> MLB <X> NBA <S> <X> <Y><Y> Badminton <X> <X> and other <Y>

23

San Francisco and

other cities

Page 24: Deriving a Web-Scale Commonsense Fact Database

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

24

Page 25: Deriving a Web-Scale Commonsense Fact Database

ESTIMATE FACT CONFIDENCE : SIMPLE APPROACH

recipes yummy[16:130, 19:51, 21:55, 98:219, 10:80, 63:180, 29:51, 121:57]

title unique[3:111,2:63,114:91,1:213,0:788,41:246,55:95,22:112,18:75,9:48,60:64,14:71]

apples nutritious[12:144]

applet unable[11:62]

25

Pattern idFrequency

• Matches several patterns• Matches few patterns• Matches few patterns

Gives low recall but high precision

Good tuples match many patterns

Page 26: Deriving a Web-Scale Commonsense Fact Database

OUR FACT RANKING APPROACH

These pattern count feature vectors are used to learn a Decision Tree.

Gives facts with estimated confidence

P1 P2 P3 … labelTuple 1 f(T1,P1) f(T1,P2) f(T1,P2) Positive

Tuple 2 f(T2,P1) f(T2,P2) f(T3,P2) Negative

Tuple 3 f(T2,P1) f(T2,P2) f(T3,P2) Positive

26

Page 27: Deriving a Web-Scale Commonsense Fact Database

RECAP: MODEL

Construct Seeds from Conceptnet,

Pattern Induction over Google 5-grams

Pattern Ranking (match many seeds but not too many)

Fact Extraction with clean patterns

over Google 5-grams

27

Page 28: Deriving a Web-Scale Commonsense Fact Database

EXPERIMENTAL SETUP

Test Data (true and false labels): Randomly chosen high confidence facts from ConceptNet

Precision and Recall computed using 10-fold cross-validation over the test data

Classifier used: Decision Trees with Adaptive Boosting

28

Page 29: Deriving a Web-Scale Commonsense Fact Database

RESULTS – MORE THAN 200 MILLION FACTS EXTRACTED

Relation Precision (%) #Facts Extracted Relative Recall (%)

CapableOf 77 907,173 45Causes 88 3,218,388 49Desires 58 4,386,685 69HasPrerequisite 82 5,336,630 65HasProperty 62 2,976,028 48IsA 62 11,694,235 27LocatedNear 71 13,930,656 61PartOf 71 11,175,349 58SimilarSize 74 8,640,737 49.. many others … … …

29

Extension of ConceptNet by orders of magnitude

Page 30: Deriving a Web-Scale Commonsense Fact Database

FURTHER DIRECTIONS

Tune system towards higher precision to release high-quality knowledge base

Applications enabled by commonsense knowledge Base

30

Page 31: Deriving a Web-Scale Commonsense Fact Database

TAKE HOME MESSAGE

N-grams simulate a larger corpus N-grams embed patterns and frequency

Novel pattern ranking adapted for N-gram corpus PMI not the best choice in our case

Extracted Fact Matrix extends Conceptnet by more than 200x!

31

Page 32: Deriving a Web-Scale Commonsense Fact Database

32

Thank [email protected]

hasProperty(flower , *)

Page 33: Deriving a Web-Scale Commonsense Fact Database
Page 34: Deriving a Web-Scale Commonsense Fact Database

ADDITIONAL SLIDES FOLLOW

Page 35: Deriving a Web-Scale Commonsense Fact Database

INACCURACIES IN CONCEPTNET• Properties are wrongly compounded

– HasProperty(apple, green yellow red)[usually] 1• Score of zero to correct tuples

– HasProperty(wallet, black)[] 0• Negated scores are infact commonsense facts

– HasProperty(jeans, blue)[not] 1• Confusing polarity for machine consumption

– HasProperty(jeans, blue)[not] 1– HasProperty(jeans, blue)[often] 1– HasProperty(jeans, blue)[usually] 1

• Wrongly labeled as hasProperty.– HasProperty(literature, book)[] 1

• Some are just facts but not commonsense– HasProperty(high point gibraltar, rock gibraltar 426 m)[] 1

Page 36: Deriving a Web-Scale Commonsense Fact Database

RELATED WORK

IE Rule Based Pattern Based (Iterative)

joint inference Pattern Based (!Iterative)

Pros high Precision high Recall High Precision high Precision, high Recall, no drift

Cons Low Recall Low Precision (drift) Scalability

Web Scale IE

Small corpus Small domain larger corpus

Use search engine

N-grams (easy access)

Pros Manual rulemanageable

Better Precision (better statistics)

High P (reliable statistics)

High Precision, High Recall, Good Runtime

Cons Low Recall, Low Precision

Low Recall Run time, top K

CSK acquisition

Human Supplied

Hard Coded Rules Use Search Engine

(Re)use Knowledge

Pros Precise & Rich Very Precise Simple, Precise Pros of Manual + search engine

Cons Expensive, low Recall

Very Expensive, low Recall

Run time, top K

Page 37: Deriving a Web-Scale Commonsense Fact Database

SYNTHETIC TRAINING DATA GENERATION

• Seeds overlap Matrix• Jaccard Sim(a,b)• If Sim ~ 0 , unrelated

relations• Combine Seeds from

unrelated relations to generate incorrect or negative tuples

atLocation causes hasProperty isA …

atLocation

Causes

hasProperty

isA

Page 38: Deriving a Web-Scale Commonsense Fact Database

ALL RESULTS