open ie and universal schema discovery heng ji [email protected] acknowledgement: some slides from daniel...

Open IE and Universal Schema Discovery

Heng Ji

[email protected]

Acknowledgement: some slides from Daniel Weld and Dan Roth

mailto:[email protected]

Traditional, Supervised I.E.

Raw Data

Labeled Training

Data

LearningAlgorithm

Extractor

Kirkland-based Microsoft is the largest software company.Boeing moved it’s headquarters to Chicago in 2003.Hank Levy was named chair of Computer Science & Engr.

…

HeadquarterOf(<company>,<city>)

Solutions

3

Open IE Universal Schema Discovery Concept Typing

Open Information Extraction

The challenge of Web extraction is to be able to do Open Information Extraction. Unbounded number of relations Web corpus contains billions of documents.

How open IE systems work

learn a general model of how relations are expressed (in a particular language), based on unlexicalized features such as part-of-speech tags. (Identify a verb)

Learn domain-independent regular expressions. (Punctuations, Commas).

Methods for Open IESelf Supervision

Kylin (Wikipedia) Shrinkage & Retraining

Hearst Patterns PMI Validation Subclass Extraction

Pattern LearningStructural Extraction

List Extraction & WebTables TextRunner

Kylin: Self-Supervised Information Extraction from Wikipedia

[Wu & Weld CIKM 2007]

Its county seat is Clearfield.

As of 2005, the population density was 28.2/km².

Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

From infoboxes to a training set

Kylin Architecture

Schema Mapping

Heuristics Edit History String Similarity

• Experiments• Precision: 94% Recall: 87%

• Future• Integrated Joint Inference

Person Performerbirth_datebirth_placenameother_names…

birthdatelocationnameothername…

Main Lesson: Self Supervision

Find structured data source Use heuristics to generate training data

E.g. Infobox attributes & matching sentences

The KnowItAll System

PredicatesCountry(X)

Domain-independent Rule Templates

<class> “such as” NP

Bootstrapping

Extraction Rules“countries such as” NP

Discriminators“country X”

ExtractorWorld Wide Web

ExtractionsCountry(“France”)

Assessor

Validated ExtractionsCountry(“France”), prob=0.999

Unary predicates: instances of a class

Unary predicates: instanceOf(City), instanceOf(Film), instanceOf(Company), …

Good recall and precision from generic patterns:

<class> “such as” XX “and other” <class>

Instantiated rules:“cities such as” X X “and

other cities”“films such as” X X “and

other films”“companies such as” X X

“and other companies”

Recall – Precision Tradeoff

High precision rules apply to only a small percentage of sentences on Web

hits for “X” “cities such as X” “X and other cities”

Boston 365,000,000 15,600,000 12,000Tukwila 1,300,000 73,000 44Gjatsk 88 34 0Hadaslav 51 1 0

“Redundancy-based extraction” ignores all but the unambiguous references.

Limited Recall with Binary Rules

Relatively high recall for unary rules:

“companies such as” X 2,800,000 Web hits

X “and other companies” 500,000 Web hits

Low recall for binary rules:

X “is the CEO of Microsoft” 160 Web hits

X “is the CEO of Wal-mart” 19 Web hits

X “is the CEO of Continental Grain” 0 Web hits

X “, CEO of Microsoft” 6,700 Web hits

X “, CEO of Wal-mart” 700 Web hits

X “, CEO of Continental Grain” 2 Web hits

Examples of Extraction Errors

Rule: countries such as X => instanceOf(Country, X)

“We have 31 offices in 15 countries such as London and France.”

=> instanceOf(Country, London)instanceOf(Country, France)

Rule: X and other cities => instanceOf(City, X)“A comparative breakdown of the cost of living in Klamath County and other cities follows.”

=> instanceOf(City, Klamath County)

“Generate and Test” Paradigm

1. Find extractions from generic rules2. Validate each extraction

Assign probability that extraction is correct Use search engine hit counts to compute PMI PMI (pointwise mutual information) between

extraction “discriminator” phrases for target concept

PMI-IR: P.D.Turney, “Mining the Web for synonyms: PMI-IR versus LSA on TOEFL”. In Proceedings of ECML, 2001.

Computing PMI Scores

Measures mutual information between the extraction and target concept.

I = an instance of a target concept instanceOf(Country, “France”)

D = a discriminator phrase for the concept “ambassador to X”

D+I = insert instance into discriminator phrase“ambassador to France”

|)(|

|)(|),(

Ihits

IDhitsIDPMI

Example of PMI

Discriminator: “countries such as X” Instance: “France” vs. “London” PMI for France >> PMI for London (2 orders of mag.) Need features for probability update that distinguish

“high” PMI from “low” PMI for a discriminator

394.1000,300,14

800,27 EPMI

“countries such as France” : 27,800 hits“France”: 14,300,000 hits

“countries such as London” : 71 hits“London”: 12,600,000 hits

66.5000,600,12

71 EPMI

20

Chicago Unmasked

City sense Movie sense

|)(|

|)(|

CityChicagoHits

CityMovieChicagoHits

21

Impact of Unmasking on PMI

Name Recessive Original Unmask BoostWashington city 0.50 0.99 96%Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512%Chicago movie 0.02 0.21 972%

22

RL: learn class-specific patterns. “Headquarted in <city>”

SE: Recursively extract subclasses.“Scientists such as physicists and chemists”

LE: extract lists of items (~ Google Sets).

How to Increase Recall?

23

List Extraction (LE)

1. Query Engine with known items.2. Learn a wrapper for each result page.3. Collect large number of lists.4. Sort items by number of list “votes”.

LE+A=sort list according to Assessor.

Evaluation: Web recall, at precision= 0.9.

TextRunner

TextRunner Works in two phases.

1. Using a conditional random field, the extractor learns to assign labels to each of the words in a sentence.

2. Extracts one or more textual triples that aim to capture (some of) the relationships in each sentence.

Information Redundancy Assumption (Banko et al., 2008)

26

Performance Comparison on Traiditonal IE vs. Open IE

27

General Remarks

29

Exploit data-driven methods (e.g., domain-independent patterns) to extract relation or event tuples

It dramatically enhanced the scalability of IE. Obtain substantially lower recall than traditional IE

heavily rely on information redundancy to validate the extraction results and thus suffer from the ``long-tail" knowledge sparsity problem

Incapable of generalizing the lexical contexts in order to name new fact types Over-simplified IE problem: IE != Sentence-level Relation

Extraction; how about entity and event types?

They focused on making the types of relations and events unrestricted, while still used other IE components such as name tagging and semantic role labeling for pre-defined types

Universal Schema

30

Discover a wide range of domain-independent fact types manually AMR, extended NE

or automatically by relation clustering based on coreferential arguments Go_off represented by “plant”, “set_off”, “injure” because

they share coreferential arguments

combining patterns with external knowledge sources such as Freebase or query logs

Some name tagging work focused on extracting more fine-grained types beyond the traditional entity types (person, organization and geo-political entities)

Open Questions

31

Is there a close set of universal schema? How many knowledge sources should we

use for discovering universal schema? What are they?

Many knowledge bases are available, how to unify them (automatically)?

Concept Typing

32

Chunking Bootstrapping Clustering with Contexts

Concept Mention Extraction (Wang et al., 2013)

33

Concept Mention Extraction (Tsai et al., 2013)

Identify and categorize mentions of concepts (Gupta and Manning, 2011) TECHNIQUE and APPLICATION

“We apply support vector machines on text classification.” Unsupervised Bootstrapping algorithm (Yarowsky, 1995;

Collins and Singer, 1999)

The proposed algorithm1. Extract noun phrases (Punyakanok and Roth, 2001)

2. For each category, initialize a decision list by seeds.3. For several rounds,

1. Annotate NPs using the decision lists.2. Extract top features from new annotated phrases, and add

them into decision lists.

34

Seeds

35

Paper1……………………………………support vector machine………………... …………………………………………………………………………………….c4.5……..

Paper2……………………………………svm-based classification………………….………………………………….............decision_trees………….…….…………………………………

Paper4……………………………………maximal_margin_classifiers…………………………………….…………………………………………………………………..

Paper3.…………………………………………………………………………..svm….…………………………………….……………………………………………………

(Cortes,1995)

(Quinlan,1993)

(Vapnik,1995)(Vapnik,1995)

(Quinlan,1993)

(Cortes,1995)(Cortes,1995)

(Quinlan,1993)

(Vapnik,1995)(Vapnik,1995)

(Quinlan,1993)

(Cortes,1995)

•c4.5•decision trees

•support vector machine•svm-based classification

•svm•maximal margin classifiers

Citation-Context Based Concept Clustering(CitClus) Cluster mentions into semantic coherent concepts

1. Group concept mentions by citation context2. Merge clusters based on lexical similarity between

mentions in the clusters

Remarks

37

Where to draw the line for different granuarilities?

Maybe the type of each phrase should be a path in the ontology structure?

Is an absolute cold-start extraction possible?

How to use other resources such as scenario models and social cognitive theories?

Paper presentations

38

Lifu’s presentation + Phrase Typing Exercise

39

open ie and universal schema discovery heng ji [email protected] acknowledgement: some slides from daniel...

Documents