1 autonomous web-scale information extraction doug downey advisor: oren etzioni department of...

Autonomous Web-scale Information Extraction

Doug DowneyAdvisor: Oren Etzioni Department of Computer Science and EngineeringTuring CenterUniversity of Washington

Web Information Extraction

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Edison invented the light bulb…(Edison, light bulb) Invented

x V y => (x, y) V

e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]

Identifying correct extractions

…mayors of major cities such as Giuliani… => Giuliani City

Supervised IE: hand-label examples of each concept

Not possible on the Web (far too many concepts)

=> Unsupervised IE (UIE)

How can we automatically identify correct extractions for any concept without hand-labeled data?

KnowItAll Hypothesis (KH)

Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct.

Repetitions of the same error are relatively rare

…mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.…

Misinformation is the exception rather than the rule

“Elvis killed JFK” – 200 hits“Oswald killed JFK” – 3000 hits

Redundancy

KH can identify many correct statements because the Web is highly redundant

– same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits

(but leveraging the KH is a little tricky => probabilistic model)

Thesis:We can identify correct extractions without labeled data using a probabilistic model of redundancy.

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Classical Supervised Learning

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

Monotonic Features

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Monotonic Features

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1 and unlabeled examples (x)

P(y=1 | x1) increases with x1

Common Structure

Task Monotonic FeatureUIE “C such as x”

[Etzioni et al., 2005]

Word Sense Disambiguation

“plant and animal species” [Yarowsky, 1995]

Information Retrieval search query[Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”[McCallum & Nigam, 1999; Gliozzo, 2005]

Named Entity Recognition

contains(“Mr.”)

[Collins & Singer, 1998]

MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods

Unlike co-training, MF Model doesn’t require labeled data pre-defined “views”

Isn’t this just ___ ?

One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is

minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998]

MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases

Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls(under assumptions)

Theoretical Results

MFA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier

Classification with the MF Model

20 Newsgroups dataset (MF:newsgroup name)

Vs. Two SSL baselines (NB + EM, LP)

Without labeled data:

Experimental Results

MFA-SSL provides a 15% error reduction for 100-400 labeled examples.

MFA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Bad News: confusable MFs

For more complex tasks, monotonicity is insufficient

Example: City extractions

MF: extraction frequency with e.g., “cities such as x”

..also MF for:has skyscrapers

has an opera house

located on Earth, …

New York 1488

Chicago 999

Los Angeles 859

… …

Twisp 1

Northeast 1

MF Extraction value

Performance of MFA in UIE

MFA for SSL in UIE

Outline

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

C = Country

n = 10

Saudi Arabia

United States

Africa

United Kingdom

Afghanistan

Australia

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

United States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

0.9999…

C = Country

n = 10

Saudi Arabia

United States

Africa

United Kingdom

Afghanistan

Australia

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

C = Country

n ~50,000

0.9999…

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

0.9999…

PfreqkUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

= 1 – (1 – p)k/n

New York

Chicago

El Estor

Villegas

Northeastwards

C = City

n ~50,000

0.9999…

Probability xC depends on the distribution of C.

C = Country

n ~50,000

0.9999…

Pfreq Pfreqk kUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Sydney

Atlanta

Yakima

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

Sydney

Atlanta

Yakima

Urn for C = City

Urn Example

Computing Probabilities

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

p 1 - p

C Zipf E Zipf

Observed frequency distribution

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

New York

Chicago

El Estor

Villegas

Northeastwards

C = City

n ~50,000

0.9999…

C = Country

n ~50,000

0.9999…

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

City Film Country MayorOf

noisy-or

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Sensitivity Analysis

URNS assumes num(E), p are constant

If we alter parameter choices substantially, URNS still outperforms noisy-or, PMI by at least 8x

Most sensitive to p

p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet(solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Omaha and other cities” 950

“Illinois and other cities” 24,400

“cities such as Omaha” 930

“cities such as Illinois” 6

Multiple Extraction Patterns

Benefits from Multiple Urns

10 1.0 1.0

20 0.9875 1.0

50 0.925 0.955

100 0.8375 0.845

200 0.7075 0.71

Precision at K K Single Multiple

Using multiple URNS reduces error by 29%.

URNS vs. MFA

URNS + MFA in SSL

MFA-ssl (urns) reduces error by 6%, on average.

URNS: Learnable from unlabeled data

All URNS parameters can be learned from unlabeled data alone [Theorem 20]

URNS implies PAC learnability from unlabeled data alone [Theorem 21]

Even with confusable MFs (i.e. even without conditional independence)

(with assumptions)

Parameters Learnable (1)

We can express the URNS model as:

Compound Poisson Process Mixture gC() + gE() can be learned, given enough

samples [Loh, 1993]

Task: learn power-law distributions gC(), gE() from

their sum

Parameters Learnable (2)

Assume:

Sufficiently high frequency => only target elements

Sufficiently low frequency => only errors

gC() + gE() =

Outline

0 50000 100000

Frequency rank of extraction

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Mayor McCheese

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Assessing Sparse Extractions

Terms in the same class tend to appear in similar contexts.

“cities including __” 42,000 1

“__ and other cities” 37,900 0

The Distributional Hypothesis

Hits with Hits withContext Chicago Twisp

“__ hotels” 2,000,000 1,670

“mayor of __” 657,000 82

Precomputed – scalable

Handle sparsity

Unsupervised Language Models

cities such as Chicago , Boston ,

But Chicago isn’t the best

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

n’t th

Baseline: context vectors

Twisp: < >

HMM(Twisp):

HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

Summary of Thesis

Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled

data alone [Corollary 4.1]

MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]

The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]

MF model is insufficient when “subconcepts” are

present [Proposition 12]

Summary: MFs (Continued)

MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art

SSL algorithm with 160 labeled examples. [Table 2.1]

Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]

MFA is not effective for UIE [Table 2.2 & Figure 2.6]

Summary: URNS

URNS: Formal model of redundancy in IE Describes how probability increases with MF value

[Proposition 13]

Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]

URNS Theoretical Results

Uniform Special Case (USC) Odds in USC increase exponentially with repetition

[Theorem 15]

Error decreases exponentially when parameters are known [Theorem 16]

Zipfian Case (ZC) Closed-form expression for ZC probability given

parameters and odds given repetitions [Theorem 17]

Error in ZC is bounded above by K / n1- for any > 0 when parameters are known [Theorem 19]

URNS Theoretical Results (cont.)

Zipfian Case (ZC) In ZC, with probability 1-, the parameters of URNS

can be estimated with error < for all , > 0, given sufficient data [Theorem 20]

In ZC, URNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]

URNS Experimental Results

Supervised Learning [Table 3.3]

19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM

Semi-supervised IE [Figure 3.4]

6% error reduction over LP Unsupervised IE [Figure 3.2]

1500% error reduction over noisy-or 2200% error reduction over PMI

Improved Efficiency [Table 3.2]

8x faster than PMI

Other Applications of URNS

Estimating extraction precision and recall [Table 3.7]

Identifying synonymous objects and relations (RESOLVER) [Yates & Etzioni, 2007]

Identifying functional relations in text [Ritter et al., 2008]

Hidden Markov Model assessor (HMM-T): Error reduction of 23-46% over context vectors on

typechecking task [Table 4.1]

Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]

10-50x more efficient vs. context vectors

Sparse extraction assessment with language models:

Error reduction of 39% over previous work [Table 4.3]

Massively more scalable than previous techniques

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzAlan Ritter

Stef SchoenmackersStephen Soderland

Dan Weld

Extraction is sometimes “easy”: generic extraction patterns

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x))

Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006]

Web IE without labeled examples

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

“cities such as Atlanta” – 21,600 Hits

Web IE without labeled examples

…Bloomberg, mayor of New York City…(Bloomberg, New York City) Mayor

x, C of y => (x, y) C

The scale and redundancy of the Web makes a multitude of facts “easy” to extract.

http://www.cs.washington.edu/research/textrunner/

[Banko et al., 2007]

TextRunner Search

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

But…

Task: Assess which extractions are correct Without hand-labeled examples At Web-scale

Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness URNS model of redundancy

[Downey et al., IJCAI 2005]

(Distinguished Paper Award)

4) Challenge: The “long tail”

5) Machine learning generalization

Outline

2) Multiple patterns

Phrase Hits

1) Repetition

“Chicago and other cities” 94,400

“cities such as Chicago” 42,500

Redundancy – Two Intuitions

Goal: a formal model of these intuitions.

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

“…countries such as the United States…”

“…countries such as Africa…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

C = Country

n = 10

Saudi Arabia

United States

Africa

United Kingdom

Afghanistan

Australia

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

United States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

0.9999…

C = Country

n = 10

Saudi Arabia

United States

Africa

United Kingdom

Afghanistan

Australia

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

C = Country

n ~50,000

0.9999…

Pnoisy-ork

= 1 – (1 – p)k/n

United States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

0.9999…

PfreqkUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

= 1 – (1 – p)k/n

New York

Chicago

El Estor

Villegas

Northeastwards

C = City

n ~50,000

0.9999…

Probability xC depends on the distribution of C.

C = Country

n ~50,000

0.9999…

Pfreq Pfreqk kUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Sydney

Atlanta

Yakima

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

Sydney

Atlanta

Yakima

Urn for C = City

Urn Example

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Chicago and other cities” 94,400

“cities such as Chicago” 42,500

Multiple Extraction Patterns

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

p 1 - p

C Zipf E Zipf

Observed frequency distribution

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

New York

Chicago

El Estor

Villegas

Northeastwards

C = City

n ~50,000

0.9999…

C = Country

n ~50,000

0.9999…

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

City Film Country MayorOf

noisy-or

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence)

vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]

PMI computed with search engine hit counts (inspired by [Turney, 2000])

URNS requires no hit count queries (~8x faster)

Scalability

Probabilistic model of redundancy Accurate without hand-labeled examples

15-22x improvement in accuracy Scalable

8x faster

URNS: Contributions

1) Motivation

3) Estimating extraction correctness

4) Challenge: The “long tail” Language models to the rescue

[Downey et al., ACL 2007]

5) Machine learning generalization

Outline

0 50000 100000

Frequency rank of extraction

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Mayor McCheese

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Unsupervised language models Precomputed – scalable Handle sparsity

The “distributional hypothesis”:Instances of the same relationship tend to appear in similar contexts.

…David B. Shaver was elected as the new mayor of Pickerington, Ohio.

http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp

…Mike Bloomberg was elected as the new mayor of New York City.

http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm

Type errors are common:

Alexander the Great conquered Egypt… (Great, Egypt) Conquered

Locally acquired malaria is now uncommon… (Locally, malaria) Acquired

Type checking

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

n’t th

Baseline: context vectors (1)

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

. . . 0 0 0 1 . . .

Baseline: context vectors (2)

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

Hidden Markov Model (HMM)

States – unobserved

Words – observed

Hidden States ti {1, …, N} (N fairly small)

Train on unlabeled data – P(ti | wi = w) is N-dim. distributional summary of w

– Compare extractions using KL divergence

Twisp: < >

P(t | Twisp):

Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> Context vectors say no,

dot product is 0!

291 0 …

0 1 …

Example

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

No hand-labeled data Scalability

Language models precomputed=> Can be queried at interactive speed

Improved accuracy over previous work[Downey et al., ACL 2007]

REALM: Contributions

1) Motivation

3) Estimating extraction correctness

4) Challenge: The “long tail”

5) Machine learning generalization Monotonic Features

[Downey et al., 2008 (submitted)]

Outline

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Topic word, e.g.: “politics”

Semi-supervised Learning

[McCallum & Nigam, 1999; Gliozzo, 2005]

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Topic word, e.g.: “politics”

Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005]

Identity of a monotonic feature xi such that:P(y = 1 | xi) increases strictly monotonically with xi

Classification of examples x = (x1, …, xd) into classes y {0, 1}

Semi-Supervised Learning (SSL)

Monotonic Features

1. No labeled data, MFs given (MA) With noisy labels from MFs, train any classifier

2. Labeled data, no MFs given (MA-SSL) Detect MFs from labeled data, run MA

3. Labeled data and MFs given (MA-BOTH) Run MA with given & detected MFs

Exploiting MF Structure

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

MA-SSL provides a 15% error reduction for 100-400 labeled examples.

MA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Co-training Requires labeled examples and known views

Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure

Relationship to other approaches

Best known methods for IE without labeled data Probabilities of correctness (URNS)

Massive improvements in accuracy (15-22x) Handling sparse data (Language models)

Vastly more scalable than previous work Accuracy wins (39% error reduction)

Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification

Summary of Results

IE Web IE But still need:

A coherent knowledge base MayorOf(Chicago, Daley) –

the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery

Improved accuracy and coverage Currently, ignore character/document features, recursive

structure, etc. Future work: more sophisticated language models

(e.g. PCFGs)

Conclusions and Future Work

Thanks!

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzStef Schoenmackers

Dan Weld

Self-Supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Language Modeling for IE REALM is simple, ignores:

Character- or Document-Level Features Web structure Recursive structure (PCFGs)

Goal: x won an Oscar for playing a villain…What is P(x) ?

From facts to knowledge Entity resolution and inference

Future Work

Named Entity Location Lexical Statistics improve state of the art

Modeling Web Search Characterizing user behavior

[Downey et al., SIGIR 2007] (poster)[Liebling et al., 2008] (submitted)

Predictive models [Downey et al., IJCAI 2007]

Other Work

Web Fact-Finding

Who has won three or more Academy Awards?

Web Fact-FindingProblems:

User has to pick the right words, often a tedious process:

"world foosball champion in 1998“ – 0 hits“world foosball champion” 1998 – 2 hits, no answer

What if I could just ask for P(x) in“x was world foosball champion in 1998?”

How far can language modeling and the distributional hypothesis take us?

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

KnowItAll Hypothesis

Distributional Hypothesis

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

KnowItAll Hypothesis

Distributional Hypothesis

invent in real time

TextRunner

Ranked by frequency

REALM improves precision of the top 20 extractions by an average of 90%.

Tarantella, Santa Cruz

International Business Machines Corporation, Armonk

Mirapoint, Sunnyvale

ALD, Sunnyvale

PBS, Alexandria

General Dynamics, Falls Church

Jupitermedia Corporation, Darien

Allegro, Worcester

Trolltech, Oslo

Corbis, Seattle

TR Precision: 40% REALM Precision: 100%

Improving TextRunner: Example (1)

“headquartered” Top 10:company, Palo Alto

held company, Santa Cruz

storage hardware and software, Hopkinton

Northwestern Mutual, Tacoma

1997, New York City

Google, Mountain View

PBS, Alexandria

Linux provider, Raleigh

Red Hat, Raleigh

TI, Dallas

TR Precision: 40%

Arabs, Rhodes

Arabs, Istanbul

Assyrians, Mesopotamia

Great, Egypt

Assyrians, Kassites

Arabs, Samarkand

Manchus, Outer Mongolia

Vandals, North Africa

Arabs, Persia

Moors, Lagos

TR Precision: 60% REALM Precision: 90%

Improving TextRunner: Example (2)

“conquered” Top 10:Great, Egypt

conquistador, Mexico

Normans, England

Arabs, North Africa

Great, Persia

Romans, part

Romans, Greeks

Rome, Greece

Napoleon, Egypt

Visigoths, Suevi Kingdom

TR Precision: 60%

Previous n-gram technique (1)

1) Form a context vector for each extracted argument:…

2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005].

1 2 1… …

such as <x> , Boston

But <x> isn’t the

Angeles and <x> .

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

. . . 0 0 0 1 . . .

Previous n-gram technique (2)

Miami: < >

P(t | Miami):

Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

Compressing Context Vectors

Example: N-Grams on Sparse Data

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> N-grams says no, dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example: HMM-T on Sparse Data

HMM-T Limitations

Learning iterations take time proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3 Sufficient for typechecking (Santa Clara is a city) Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

The REALM ArchitectureTwo steps for assessing R(arg1, arg2) Typechecking

Ensure arg1 and arg2 are of proper type for RMayorOf(Intel, Santa Clara)

Leverages all occurrences of each arg Relation Assessment

Ensure R actually holds between arg1 and arg2MayorOf(Giuliani, Seattle)

Both steps use pre-computed language models=> Scales to Open IE

Type checking isn’t enoughNY Mayor Giuliani toured downtown Seattle.

Want: How do arguments behave in relation to each other?

Relation Assessment

N-gram language model:

P(wi, wi-1, … wi-k)

arg1, arg2 often far apart => large k (inaccurate)

REL-GRAMS (1)

Relational Language Model (REL-GRAMS):

For any two arguments e1, e2:

P(wi, wi-1, … wi-k | wi = e1, e1 near e2)

k can be small – REL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR)

Combine with HMM-T by multiplying ranks.

REL-GRAMS (2)

Experiments

Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, MergedREALM vs.

TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005])

Pattern Learning (PL) – based on Snowball [Agichtein 2000]

HMM-T and REL-GRAMS in isolation

Learning num(C) and num(E)

From untagged data: ill-posed problem• num(C) can vary wildly with C

e.g., countries vs. cities vs. mayors

Assume:1) Consistent precision of a single co-occurrence,

e.g., in a randomly drawn phrase “C such as x”,x C about p of the time. (0.9 for [Etzioni,

2005])

2) num(E) is constant for all C

3) num(C) is Zipf Estimate num(C) from untagged data using EM

[Downey et al. 2005] (Also: multiple contexts)

Frequency Rank

P(x C) in “C such as x”

Assumed ~0.9

Error Distribution

Assumed large with Zipf parameter 1.0

Frequency Rank

1 - Can vary wildly (e.g. cities vs. countries).

Learned from unlabeled data using EM

Distributional Similarity

Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2:

Compare context distributions:

P(wb,…, we | seed1, seed2 )

P(wb,…, we | arg1, arg2)But e – b can be large

Many parameters, sparse data => inaccuracy

wb … wh seed1 wh+2 … wi seed2 wi+2 … we

wb … wh arg1 wh+2 … wi arg2 wi+2 … we

http://www.cs.washington.edu/research/textrunner/

TextRunner Search

Large textual corpora are redundant,

and we can use this observation to bootstrap extraction and classification models

from minimally labeled, or even completely unlabeled data.

Thesis

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Unlabeled examples DU = {(x)} ~ P(x)

Monotonic Features

Smaller

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Potentially empty!

Monotonic Features

Problem: num(C) can vary wildly e.g. cities vs. countries

Assume: num(C), num(E) Zipf distributed

freq. of ith element i-z

p and num(E) independent of C

Learn num(C) from unlabeled data alone With Expectation Maximization

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Typecheck each arg by comparing HMM’s distributional summaries:

Rank arguments in ascending order of f(arg)

arg|,|

1(arg) tPseedtP

seedsKLf

HMM Type-checking

Semi-supervised Learning (SSL)

Self-supervised Learning

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)and system labels its own examples

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Monotonic Features

Unlabeled examples DU = {(x)} ~ P(x)

Monotonic Features

Smaller

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Potentially empty!

Monotonic Features

1 autonomous web-scale information extraction doug downey advisor: oren etzioni department of...

x etzioni

unlabeled examples x

monotonic feature x

x c hearst

unlabeled examples x

light bulb invented

y x1x1 x2x2 slide

monotonic features x1x1

Documents

patricia riddle* university of washington seattle, wa...

the first law of robotics ( a call to arms...

diagram understanding in geometry questions min joon seo 1,...

stefan schoenmackers, oren etzioni, and daniel s. weld...

by amitai etzioni - george washington university

1 assisted cognition henry kautz, oren etzioni, & dieter fox...

1 extracting product feature assessments from reviews...

doi:10.1145/2955091 amitai etzioni and oren etzioni...

mixed scanning etzioni

etzioni amitai_toward an i & we paradigm

neal lesh and oren etzioni* department of computer science...

oren etzioni, ceo allen institute for ai (ai2) › 2450 ›...

dynamic reference sifting jonathan shakes, marc...

cs276b text information retrieval, mining, and exploitation...

the web servers + crawlers€¦ · the web servers +...

1 assisted cognition henry kautz, oren etzioni, dieter fox,...

grouper: a dynamic clustering interface to web search...

1 paradigm shift in ai oren etzioni turing center note: some...

the first law of robotics ( a call to arms...

nahal oren