1 autonomous web-scale information extraction doug downey advisor: oren etzioni department of...

Post on 25-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Autonomous Web-scale Information Extraction

Doug DowneyAdvisor: Oren Etzioni Department of Computer Science and EngineeringTuring CenterUniversity of Washington

2

Web Information Extraction

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Edison invented the light bulb…(Edison, light bulb) Invented

x V y => (x, y) V

e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]

3

Identifying correct extractions

…mayors of major cities such as Giuliani… => Giuliani City

Supervised IE: hand-label examples of each concept

Not possible on the Web (far too many concepts)

=> Unsupervised IE (UIE)

How can we automatically identify correct extractions for any concept without hand-labeled data?

4

KnowItAll Hypothesis (KH)

Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct.

Repetitions of the same error are relatively rare

…mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.…

Misinformation is the exception rather than the rule

“Elvis killed JFK” – 200 hits“Oswald killed JFK” – 3000 hits

5

Redundancy

KH can identify many correct statements because the Web is highly redundant

– same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits

(but leveraging the KH is a little tricky => probabilistic model)

Thesis:We can identify correct extractions without labeled data using a probabilistic model of redundancy.

6

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

7

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

8

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

9

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

10

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1 and unlabeled examples (x)

P(y=1 | x1) increases with x1

11

Common Structure

Task Monotonic FeatureUIE “C such as x”

[Etzioni et al., 2005]

Word Sense Disambiguation

“plant and animal species” [Yarowsky, 1995]

Information Retrieval search query[Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”[McCallum & Nigam, 1999; Gliozzo, 2005]

Named Entity Recognition

contains(“Mr.”)

[Collins & Singer, 1998]

12

MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods

Unlike co-training, MF Model doesn’t require labeled data pre-defined “views”

Isn’t this just ___ ?

13

One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is

minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998]

MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases

Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls(under assumptions)

Theoretical Results

14

MFA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier

Classification with the MF Model

15

20 Newsgroups dataset (MF:newsgroup name)

Vs. Two SSL baselines (NB + EM, LP)

Without labeled data:

Experimental Results

16

MFA-SSL provides a 15% error reduction for 100-400 labeled examples.

MFA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

17

Bad News: confusable MFs

For more complex tasks, monotonicity is insufficient

Example: City extractions

MF: extraction frequency with e.g., “cities such as x”

..also MF for:has skyscrapers

has an opera house

located on Earth, …

New York 1488

Chicago 999

Los Angeles 859

… …

Twisp 1

Northeast 1

MF Extraction value

18

Performance of MFA in UIE

19

MFA for SSL in UIE

20

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

21

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

22

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

23

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

24

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

25

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

26

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

27

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

28

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

29

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

30

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

31

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

32

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

33

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

34

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

35

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

36

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

37

Sensitivity Analysis

URNS assumes num(E), p are constant

If we alter parameter choices substantially, URNS still outperforms noisy-or, PMI by at least 8x

Most sensitive to p

p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet(solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)

38

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Omaha and other cities” 950

“Illinois and other cities” 24,400

“cities such as Omaha” 930

“cities such as Illinois” 6

Multiple Extraction Patterns

39

Benefits from Multiple Urns

10 1.0 1.0

20 0.9875 1.0

50 0.925 0.955

100 0.8375 0.845

200 0.7075 0.71

Precision at K K Single Multiple

Using multiple URNS reduces error by 29%.

40

URNS vs. MFA

41

URNS + MFA in SSL

MFA-ssl (urns) reduces error by 6%, on average.

42

URNS: Learnable from unlabeled data

All URNS parameters can be learned from unlabeled data alone [Theorem 20]

URNS implies PAC learnability from unlabeled data alone [Theorem 21]

Even with confusable MFs (i.e. even without conditional independence)

(with assumptions)

43

Parameters Learnable (1)

We can express the URNS model as:

Compound Poisson Process Mixture gC() + gE() can be learned, given enough

samples [Loh, 1993]

Task: learn power-law distributions gC(), gE() from

their sum

44

Parameters Learnable (2)

Assume:

Sufficiently high frequency => only target elements

Sufficiently low frequency => only errors

Then:

gC() + gE() =

45

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

46

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

47

Mayor McCheese

48

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Assessing Sparse Extractions

49

Terms in the same class tend to appear in similar contexts.

“cities including __” 42,000 1

“__ and other cities” 37,900 0

The Distributional Hypothesis

Hits with Hits withContext Chicago Twisp

“__ hotels” 2,000,000 1,670

“mayor of __” 657,000 82

50

Precomputed – scalable

Handle sparsity

Unsupervised Language Models

51

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors

52

Twisp: < >

HMM(Twisp):

HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

53

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

54

Summary of Thesis

Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled

data alone [Corollary 4.1]

MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]

The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]

MF model is insufficient when “subconcepts” are

present [Proposition 12]

55

Summary: MFs (Continued)

MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art

SSL algorithm with 160 labeled examples. [Table 2.1]

Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]

MFA is not effective for UIE [Table 2.2 & Figure 2.6]

56

Summary: URNS

URNS: Formal model of redundancy in IE Describes how probability increases with MF value

[Proposition 13]

Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]

57

URNS Theoretical Results

Uniform Special Case (USC) Odds in USC increase exponentially with repetition

[Theorem 15]

Error decreases exponentially when parameters are known [Theorem 16]

Zipfian Case (ZC) Closed-form expression for ZC probability given

parameters and odds given repetitions [Theorem 17]

Error in ZC is bounded above by K / n1- for any > 0 when parameters are known [Theorem 19]

58

URNS Theoretical Results (cont.)

Zipfian Case (ZC) In ZC, with probability 1-, the parameters of URNS

can be estimated with error < for all , > 0, given sufficient data [Theorem 20]

In ZC, URNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]

59

URNS Experimental Results

Supervised Learning [Table 3.3]

19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM

Semi-supervised IE [Figure 3.4]

6% error reduction over LP Unsupervised IE [Figure 3.2]

1500% error reduction over noisy-or 2200% error reduction over PMI

Improved Efficiency [Table 3.2]

8x faster than PMI

60

Other Applications of URNS

Estimating extraction precision and recall [Table 3.7]

Identifying synonymous objects and relations (RESOLVER) [Yates & Etzioni, 2007]

Identifying functional relations in text [Ritter et al., 2008]

61

Assessing Sparse Extractions

Hidden Markov Model assessor (HMM-T): Error reduction of 23-46% over context vectors on

typechecking task [Table 4.1]

Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]

10-50x more efficient vs. context vectors

Sparse extraction assessment with language models:

Error reduction of 39% over previous work [Table 4.3]

Massively more scalable than previous techniques

62

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzAlan Ritter

Stef SchoenmackersStephen Soderland

Dan Weld

63

64

65

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x))

Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006]

Web IE without labeled examples

66

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

“cities such as Atlanta” – 21,600 Hits

Web IE without labeled examples

67

Web IE without labeled examples

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Bloomberg, mayor of New York City…(Bloomberg, New York City) Mayor

x, C of y => (x, y) C

The scale and redundancy of the Web makes a multitude of facts “easy” to extract.

68

http://www.cs.washington.edu/research/textrunner/

[Banko et al., 2007]

TextRunner Search

69

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

But…

Task: Assess which extractions are correct Without hand-labeled examples At Web-scale

Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”

70

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness URNS model of redundancy

[Downey et al., IJCAI 2005]

(Distinguished Paper Award)

4) Challenge: The “long tail”

5) Machine learning generalization

Outline

71

2) Multiple patterns

Phrase Hits

1) Repetition

“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Redundancy – Two Intuitions

Goal: a formal model of these intuitions.

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

72

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

73

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

74

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

75

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

76

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

77

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

78

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

79

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

80

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

81

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

82

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

83

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

84

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Multiple Extraction Patterns

85

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

86

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

87

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

88

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

89

Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence)

vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]

PMI computed with search engine hit counts (inspired by [Turney, 2000])

URNS requires no hit count queries (~8x faster)

Scalability

90

Probabilistic model of redundancy Accurate without hand-labeled examples

15-22x improvement in accuracy Scalable

8x faster

[Downey et al., IJCAI 2005]

URNS: Contributions

91

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail” Language models to the rescue

[Downey et al., ACL 2007]

5) Machine learning generalization

Outline

92

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

93

Mayor McCheese

94

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Unsupervised language models Precomputed – scalable Handle sparsity

Assessing Sparse Extractions

95

The “distributional hypothesis”:Instances of the same relationship tend to appear in similar contexts.

…David B. Shaver was elected as the new mayor of Pickerington, Ohio.

http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp

…Mike Bloomberg was elected as the new mayor of New York City.

http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm

Assessing Sparse Extractions

96

Type errors are common:

Alexander the Great conquered Egypt… (Great, Egypt) Conquered

Locally acquired malaria is now uncommon… (Locally, malaria) Acquired

Type checking

97

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors (1)

98

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Baseline: context vectors (2)

99

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

Hidden Markov Model (HMM)

States – unobserved

Words – observed

Hidden States ti {1, …, N} (N fairly small)

Train on unlabeled data – P(ti | wi = w) is N-dim. distributional summary of w

– Compare extractions using KL divergence

100

Twisp: < >

P(t | Twisp):

Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

101

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> Context vectors say no,

dot product is 0!

291 0 …

<x>

, O

hio

<x>

, Ill

inoi

s

0 1 …

Example

102

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example

103

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

104

No hand-labeled data Scalability

Language models precomputed=> Can be queried at interactive speed

Improved accuracy over previous work[Downey et al., ACL 2007]

REALM: Contributions

105

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail”

5) Machine learning generalization Monotonic Features

[Downey et al., 2008 (submitted)]

Outline

106

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Semi-supervised Learning

[McCallum & Nigam, 1999; Gliozzo, 2005]

107

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005]

Identity of a monotonic feature xi such that:P(y = 1 | xi) increases strictly monotonically with xi

Classification of examples x = (x1, …, xd) into classes y {0, 1}

108

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

109

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

110

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

111

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

112

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

113

1. No labeled data, MFs given (MA) With noisy labels from MFs, train any classifier

2. Labeled data, no MFs given (MA-SSL) Detect MFs from labeled data, run MA

3. Labeled data and MFs given (MA-BOTH) Run MA with given & detected MFs

Exploiting MF Structure

114

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

115

MA-SSL provides a 15% error reduction for 100-400 labeled examples.

MA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

116

Co-training Requires labeled examples and known views

Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure

Relationship to other approaches

117

Best known methods for IE without labeled data Probabilities of correctness (URNS)

Massive improvements in accuracy (15-22x) Handling sparse data (Language models)

Vastly more scalable than previous work Accuracy wins (39% error reduction)

Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification

Summary of Results

118

IE Web IE But still need:

A coherent knowledge base MayorOf(Chicago, Daley) –

the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery

Improved accuracy and coverage Currently, ignore character/document features, recursive

structure, etc. Future work: more sophisticated language models

(e.g. PCFGs)

Conclusions and Future Work

119

Thanks!

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzStef Schoenmackers

Dan Weld

120

Self-Supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

121

Language Modeling for IE REALM is simple, ignores:

Character- or Document-Level Features Web structure Recursive structure (PCFGs)

Goal: x won an Oscar for playing a villain…What is P(x) ?

From facts to knowledge Entity resolution and inference

Future Work

122

Named Entity Location Lexical Statistics improve state of the art

[Downey et al., IJCAI 2007]

Modeling Web Search Characterizing user behavior

[Downey et al., SIGIR 2007] (poster)[Liebling et al., 2008] (submitted)

Predictive models [Downey et al., IJCAI 2007]

Other Work

123

Web Fact-Finding

Who has won three or more Academy Awards?

124

Web Fact-FindingProblems:

User has to pick the right words, often a tedious process:

"world foosball champion in 1998“ – 0 hits“world foosball champion” 1998 – 2 hits, no answer

What if I could just ask for P(x) in“x was world foosball champion in 1998?”

How far can language modeling and the distributional hypothesis take us?

125

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

126

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

127

invent in real time

TextRunner

Ranked by frequency

REALM improves precision of the top 20 extractions by an average of 90%.

128

Tarantella, Santa Cruz

International Business Machines Corporation, Armonk

Mirapoint, Sunnyvale

ALD, Sunnyvale

PBS, Alexandria

General Dynamics, Falls Church

Jupitermedia Corporation, Darien

Allegro, Worcester

Trolltech, Oslo

Corbis, Seattle

TR Precision: 40% REALM Precision: 100%

Improving TextRunner: Example (1)

“headquartered” Top 10:company, Palo Alto

held company, Santa Cruz

storage hardware and software, Hopkinton

Northwestern Mutual, Tacoma

1997, New York City

Google, Mountain View

PBS, Alexandria

Linux provider, Raleigh

Red Hat, Raleigh

TI, Dallas

TR Precision: 40%

129

Arabs, Rhodes

Arabs, Istanbul

Assyrians, Mesopotamia

Great, Egypt

Assyrians, Kassites

Arabs, Samarkand

Manchus, Outer Mongolia

Vandals, North Africa

Arabs, Persia

Moors, Lagos

TR Precision: 60% REALM Precision: 90%

Improving TextRunner: Example (2)

“conquered” Top 10:Great, Egypt

conquistador, Mexico

Normans, England

Arabs, North Africa

Great, Persia

Romans, part

Romans, Greeks

Rome, Greece

Napoleon, Egypt

Visigoths, Suevi Kingdom

TR Precision: 60%

130

Previous n-gram technique (1)

1) Form a context vector for each extracted argument:…

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005].

1 2 1… …

such as <x> , Boston

But <x> isn’t the

Angeles and <x> .

131

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Previous n-gram technique (2)

132

Miami: < >

P(t | Miami):

Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

Compressing Context Vectors

133

Example: N-Grams on Sparse Data

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> N-grams says no, dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

134

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example: HMM-T on Sparse Data

135

HMM-T Limitations

Learning iterations take time proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3 Sufficient for typechecking (Santa Clara is a city) Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

136

The REALM ArchitectureTwo steps for assessing R(arg1, arg2) Typechecking

Ensure arg1 and arg2 are of proper type for RMayorOf(Intel, Santa Clara)

Leverages all occurrences of each arg Relation Assessment

Ensure R actually holds between arg1 and arg2MayorOf(Giuliani, Seattle)

Both steps use pre-computed language models=> Scales to Open IE

137

Type checking isn’t enoughNY Mayor Giuliani toured downtown Seattle.

Want: How do arguments behave in relation to each other?

Relation Assessment

138

N-gram language model:

P(wi, wi-1, … wi-k)

arg1, arg2 often far apart => large k (inaccurate)

REL-GRAMS (1)

139

Relational Language Model (REL-GRAMS):

For any two arguments e1, e2:

P(wi, wi-1, … wi-k | wi = e1, e1 near e2)

k can be small – REL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR)

Combine with HMM-T by multiplying ranks.

REL-GRAMS (2)

140

Experiments

Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, MergedREALM vs.

TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005])

Pattern Learning (PL) – based on Snowball [Agichtein 2000]

HMM-T and REL-GRAMS in isolation

141

Learning num(C) and num(E)

From untagged data: ill-posed problem• num(C) can vary wildly with C

e.g., countries vs. cities vs. mayors

Assume:1) Consistent precision of a single co-occurrence,

e.g., in a randomly drawn phrase “C such as x”,x C about p of the time. (0.9 for [Etzioni,

2005])

2) num(E) is constant for all C

3) num(C) is Zipf Estimate num(C) from untagged data using EM

[Downey et al. 2005] (Also: multiple contexts)

142

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 -

P(x C) in “C such as x”

Assumed ~0.9

Error Distribution

Assumed large with Zipf parameter 1.0

143

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 - Can vary wildly (e.g. cities vs. countries).

Learned from unlabeled data using EM

144

Distributional Similarity

Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2:

Compare context distributions:

P(wb,…, we | seed1, seed2 )

P(wb,…, we | arg1, arg2)But e – b can be large

Many parameters, sparse data => inaccuracy

wb … wh seed1 wh+2 … wi seed2 wi+2 … we

wb … wh arg1 wh+2 … wi arg2 wi+2 … we

145

http://www.cs.washington.edu/research/textrunner/

TextRunner Search

146

Large textual corpora are redundant,

and we can use this observation to bootstrap extraction and classification models 

from minimally labeled, or even completely unlabeled data.

Thesis

147

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

148

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

149

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features

150

Problem: num(C) can vary wildly e.g. cities vs. countries

Assume: num(C), num(E) Zipf distributed

freq. of ith element i-z

p and num(E) independent of C

Learn num(C) from unlabeled data alone With Expectation Maximization

URNS without labeled data

151

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

152

Typecheck each arg by comparing HMM’s distributional summaries:

Rank arguments in ascending order of f(arg)

arg|,|

||

1(arg) tPseedtP

seedsKLf

ii

HMM Type-checking

153

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

154

Semi-supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

155

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)

156

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)and system labels its own examples

157

Self-supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

158

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

159

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

160

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features

top related