open ie and universal schema discovery heng ji [email protected] acknowledgement: some slides from daniel...

38
Open IE and Universal Schema Discovery Heng Ji [email protected] Acknowledgement: some slides from Daniel Weld and Dan Roth

Upload: bridget-austin

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Open IE and Universal Schema Discovery

Heng Ji

[email protected]

Acknowledgement: some slides from Daniel Weld and Dan Roth

Page 2: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Traditional, Supervised I.E.

Raw Data

Labeled Training

Data

LearningAlgorithm

Extractor

Kirkland-based Microsoft is the largest software company.Boeing moved it’s headquarters to Chicago in 2003.Hank Levy was named chair of Computer Science & Engr.

HeadquarterOf(<company>,<city>)

Page 3: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Solutions

3

Open IE Universal Schema Discovery Concept Typing

Page 4: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Open Information Extraction

The challenge of Web extraction is to be able to do Open Information Extraction. Unbounded number of relations Web corpus contains billions of documents.

Page 5: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

How open IE systems work

learn a general model of how relations are expressed (in a particular language), based on unlexicalized features such as part-of-speech tags. (Identify a verb)

Learn domain-independent regular expressions. (Punctuations, Commas).

Page 6: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Methods for Open IESelf Supervision

Kylin (Wikipedia) Shrinkage & Retraining

Hearst Patterns PMI Validation Subclass Extraction

Pattern LearningStructural Extraction

List Extraction & WebTables TextRunner

Page 7: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Kylin: Self-Supervised Information Extraction from Wikipedia

[Wu & Weld CIKM 2007]

Its county seat is Clearfield.

As of 2005, the population density was 28.2/km².

Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

2,972 km² (1,147 mi²) of it is land and 17 km²  (7 mi²) of it (0.56%) is water.

From infoboxes to a training set

Page 8: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Kylin Architecture

Page 9: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Schema Mapping

Heuristics Edit History String Similarity

• Experiments• Precision: 94% Recall: 87%

• Future• Integrated Joint Inference

Person Performerbirth_datebirth_placenameother_names…

birthdatelocationnameothername…

Page 10: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Main Lesson: Self Supervision

Find structured data source Use heuristics to generate training data

E.g. Infobox attributes & matching sentences

Page 11: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

The KnowItAll System

PredicatesCountry(X)

Domain-independent Rule Templates

<class> “such as” NP

Bootstrapping

Extraction Rules“countries such as” NP

Discriminators“country X”

ExtractorWorld Wide Web

ExtractionsCountry(“France”)

Assessor

Validated ExtractionsCountry(“France”), prob=0.999

Page 12: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Unary predicates: instances of a class

Unary predicates: instanceOf(City), instanceOf(Film), instanceOf(Company), …

Good recall and precision from generic patterns:

<class> “such as” XX “and other” <class>

Instantiated rules:“cities such as” X X “and

other cities”“films such as” X X “and

other films”“companies such as” X X

“and other companies”

Page 13: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Recall – Precision Tradeoff

High precision rules apply to only a small percentage of sentences on Web

hits for “X” “cities such as X” “X and other cities”

Boston 365,000,000 15,600,000 12,000Tukwila 1,300,000 73,000 44Gjatsk 88 34 0Hadaslav 51 1 0

“Redundancy-based extraction” ignores all but the unambiguous references.

Page 14: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Limited Recall with Binary Rules

Relatively high recall for unary rules:

“companies such as” X 2,800,000 Web hits

X “and other companies” 500,000 Web hits

Low recall for binary rules:

X “is the CEO of Microsoft” 160 Web hits

X “is the CEO of Wal-mart” 19 Web hits

X “is the CEO of Continental Grain” 0 Web hits

X “, CEO of Microsoft” 6,700 Web hits

X “, CEO of Wal-mart” 700 Web hits

X “, CEO of Continental Grain” 2 Web hits

Page 15: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Examples of Extraction Errors

Rule: countries such as X => instanceOf(Country, X)

“We have 31 offices in 15 countries such as London and France.”

=> instanceOf(Country, London)instanceOf(Country, France)

Rule: X and other cities => instanceOf(City, X)“A comparative breakdown of the cost of living in Klamath County and other cities follows.”

=> instanceOf(City, Klamath County)

Page 16: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

“Generate and Test” Paradigm

1. Find extractions from generic rules2. Validate each extraction

Assign probability that extraction is correct Use search engine hit counts to compute PMI PMI (pointwise mutual information) between

extraction “discriminator” phrases for target concept

PMI-IR: P.D.Turney, “Mining the Web for synonyms: PMI-IR versus LSA on TOEFL”. In Proceedings of ECML, 2001.

Page 17: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Computing PMI Scores

Measures mutual information between the extraction and target concept.

I = an instance of a target concept instanceOf(Country, “France”)

D = a discriminator phrase for the concept “ambassador to X”

D+I = insert instance into discriminator phrase“ambassador to France”

|)(|

|)(|),(

Ihits

IDhitsIDPMI

Page 18: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Example of PMI

Discriminator: “countries such as X” Instance: “France” vs. “London” PMI for France >> PMI for London (2 orders of mag.) Need features for probability update that distinguish

“high” PMI from “low” PMI for a discriminator

394.1000,300,14

800,27 EPMI

“countries such as France” : 27,800 hits“France”: 14,300,000 hits

“countries such as London” : 71 hits“London”: 12,600,000 hits

66.5000,600,12

71 EPMI

Page 19: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

20

Chicago Unmasked

City sense Movie sense

|)(|

|)(|

CityChicagoHits

CityMovieChicagoHits

Page 20: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

21

Impact of Unmasking on PMI

Name Recessive Original Unmask BoostWashington city 0.50 0.99 96%Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512%Chicago movie 0.02 0.21 972%

Page 21: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

22

RL: learn class-specific patterns. “Headquarted in <city>”

SE: Recursively extract subclasses.“Scientists such as physicists and chemists”

LE: extract lists of items (~ Google Sets).

How to Increase Recall?

Page 22: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

23

List Extraction (LE)

1. Query Engine with known items.2. Learn a wrapper for each result page.3. Collect large number of lists.4. Sort items by number of list “votes”.

LE+A=sort list according to Assessor.

Evaluation: Web recall, at precision= 0.9.

Page 23: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

TextRunner

Page 24: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

TextRunner Works in two phases.

1. Using a conditional random field, the extractor learns to assign labels to each of the words in a sentence.

2. Extracts one or more textual triples that aim to capture (some of) the relationships in each sentence.

Page 25: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Information Redundancy Assumption (Banko et al., 2008)

26

Page 26: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Performance Comparison on Traiditonal IE vs. Open IE

27

Page 27: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth
Page 28: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

General Remarks

29

Exploit data-driven methods (e.g., domain-independent patterns) to extract relation or event tuples

It dramatically enhanced the scalability of IE. Obtain substantially lower recall than traditional IE

heavily rely on information redundancy to validate the extraction results and thus suffer from the ``long-tail" knowledge sparsity problem

Incapable of generalizing the lexical contexts in order to name new fact types Over-simplified IE problem: IE != Sentence-level Relation

Extraction; how about entity and event types?

They focused on making the types of relations and events unrestricted, while still used other IE components such as name tagging and semantic role labeling for pre-defined types

Page 29: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Universal Schema

30

Discover a wide range of domain-independent fact types manually AMR, extended NE

or automatically by relation clustering based on coreferential arguments Go_off represented by “plant”, “set_off”, “injure” because

they share coreferential arguments

combining patterns with external knowledge sources such as Freebase or query logs

Some name tagging work focused on extracting more fine-grained types beyond the traditional entity types (person, organization and geo-political entities)

Page 30: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Open Questions

31

Is there a close set of universal schema? How many knowledge sources should we

use for discovering universal schema? What are they?

Many knowledge bases are available, how to unify them (automatically)?

Page 31: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Concept Typing

32

Chunking Bootstrapping Clustering with Contexts

Page 32: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Concept Mention Extraction (Wang et al., 2013)

33

Page 33: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Concept Mention Extraction (Tsai et al., 2013)

Identify and categorize mentions of concepts (Gupta and Manning, 2011) TECHNIQUE and APPLICATION

“We apply support vector machines on text classification.” Unsupervised Bootstrapping algorithm (Yarowsky, 1995;

Collins and Singer, 1999)

The proposed algorithm1. Extract noun phrases (Punyakanok and Roth, 2001)

2. For each category, initialize a decision list by seeds.3. For several rounds,

1. Annotate NPs using the decision lists.2. Extract top features from new annotated phrases, and add

them into decision lists.

34

Page 34: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Seeds

35

Page 35: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Paper1……………………………………support vector machine………………... …………………………………………………………………………………….c4.5……..

Paper2……………………………………svm-based classification………………….………………………………….............decision_trees………….…….…………………………………

Paper4……………………………………maximal_margin_classifiers…………………………………….…………………………………………………………………..

Paper3.…………………………………………………………………………..svm….…………………………………….……………………………………………………

(Cortes,1995)

(Quinlan,1993)

(Vapnik,1995)(Vapnik,1995)

(Quinlan,1993)

(Cortes,1995)(Cortes,1995)

(Quinlan,1993)

(Vapnik,1995)(Vapnik,1995)

(Quinlan,1993)

(Cortes,1995)

•c4.5•decision trees

•support vector machine•svm-based classification

•svm•maximal margin classifiers

Citation-Context Based Concept Clustering(CitClus) Cluster mentions into semantic coherent concepts

1. Group concept mentions by citation context2. Merge clusters based on lexical similarity between

mentions in the clusters

Page 36: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Remarks

37

Where to draw the line for different granuarilities?

Maybe the type of each phrase should be a path in the ontology structure?

Is an absolute cold-start extraction possible?

How to use other resources such as scenario models and social cognitive theories?

Page 37: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Paper presentations

38

Page 38: Open IE and Universal Schema Discovery Heng Ji jih@rpi.edu Acknowledgement: some slides from Daniel Weld and Dan Roth

Lifu’s presentation + Phrase Typing Exercise

39