relation extraction from the web using distant supervision

Relation Extraction from the Web using Distant Supervision

Isabelle Augenstein, Diana Maynard, Fabio Ciravegna Department of Computer Science, University of Sheffield, UK

{i.augenstein,d.maynard,f.ciravegna}@dcs.shef.ac.uk

November 28, 2014

EKAW 2014

2

•  Large knowledge bases are useful for search, question answering etc. but far from complete

•  Approach: automatic knowledge base population (KBP) methods using Web information extraction (IE) 1)  Extracting entities and relations between them from text on Web pages 2)  Combining information from several sources to populate KBs

Problem

3

•  Why can’t we just use existing tool X?

Motivation

4

•  Why can’t we just use existing tool X?

•  IE methods requiring manual effort •  Manually crafted extraction patters, e.g. “X is a professor at Y” •  Supervised learning: statistical models, manually annotated training

data as input Ø  Biased towards a domain, e.g. Biology, newswire, Wikipedia

•  IE methods requiring no manual effort •  Unsupervised learning: discovering patterns, clustering Ø  Difficult to map to schema •  Bootstrapping: learning patterns iteratively starting with prior knowledge,

e.g. list of names Ø  “Semantic drift”

Existing Approaches

5

•  Requirements •  Works for Web text •  Extract with respect to knowledge base •  No manual effort required

•  What can we do? •  Use knowledge base to train statistical model •  Distant supervision: automatically label text with relations from

knowledge base, train machine learning classifier Ø  Extract relations with respect to KB, no manual effort

Proposed Approach

6

Creating positive & negative training

examples

Feature Extraction

Classifier Training

Prediction of New

Relations

Distant Supervision

7

Distant Supervision

Creating positive & negative training

examples

Feature Extraction

Classifier Training

Prediction of New

Relations

Supervised learning

Automatically generated training data

+

Distant Supervision

8

“If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009)

Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.� Blur helped to popularise the Britpop genre.� Beckham rose to fame with the all-female pop group Spice Girls.�

Name Genre … Amy Winehouse Amy Jade Winehouse Wino …

R&B soul jazz …

…

Blur …

Britpop …

…

Spice Girls …

pop …

…

different lexicalisations

Distant Supervision

9

•  Collect corpus •  From Web, using search patterns containing relation

•  Relation identification •  Recognise all entities in sentences •  Check if sentences contain subject, object of relations

•  Seed selection •  Discover, then discard potentially noisy training data

•  Extract features •  Standard features: context words, part of speech tags (noun, verb) etc.

•  Train classifier •  Apply to hold-out part of corpus

•  Same relation identification procedure as for training data •  Extracting relations across sentence boundaries

•  Integrate / combine results

Distant Supervision System

10

•  Collect corpus •  From Web, using search patterns containing relation

•  Relation identification •  Recognise all entities in sentences •  Check if sentences contain subject, object of relations

•  Seed selection •  Discover, then discard potentially noisy training data

•  Extract features •  Standard features

•  Train classifier •  Apply to hold-out part of corpus

•  Same as for training data •  Extracting relations across sentences

•  Integrate / combine results

Distant Supervision System

Research described in paper

11

•  Web crawl corpus, created using entity-specific search queries, consisting of 1 million Web pages

Class Property / Relation

Book author, characters, publication date, genre, original language

Musical Artist

album, active (start), active (end), genre, record label, origin, track

Film release date, director, producer, language, genre, actor, character

Politician birthdate, birthplace, educational institution, nationality, party, religion, spouses

Evaluation: Corpus

Class Property / Relation

Organisation industry, employees, city, country, date founded, founders

Educational Institution

school type, mascot, colours, city, country, date founded

River origin, mouth, length, basin countries, contained by

12

Generating training data: is it that easy?

Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.

Name Album Track The Beatles …

Let It Be …

Let It Be …

Seed Selection

13

Generating training data: is it that easy?

Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.

Name Album Track The Beatles …

Let It Be …

Let It Be …

•  Use ‘Let It Be’ mentions as positive training examples for album or for track?

•  Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt

•  How can such ambiguous examples be detected? •  Develop methods to detect, then automatically discard

potentially ambiguous training data

Seed Selection

14

Ambiguity within an entity •  Example: Let It Be is the twelfth album by The Beatles

which contains their hit single Let It Be. •  Let It Be can be both an album and a track of the musical artist

The Beatles •  For every relation, consisting of a subject, a property and an

object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations?

•  Unam: •  Retrieve the number of such senses using the Freebase API •  Discard the lexicalisation of the object as positive training data if it has at

least two different senses within an entity

Seed Selection

15

Ambiguity across classes •  Example: common names of book authors or common genres,

e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac.

•  Stop: remove common words that are stopwords •  Stat: Estimate how ambiguous a lexicalisation of an object is

compared to other lexicalisations of objects of the same relation •  For every lexicalisation of an object of a relation, retrieve the number of

senses using the Freebase API (example: for Jack n=1066) •  Compute frequency distribution per relation with min, max, median (50th

percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32)

•  For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded)

Seed Selection

16

•  Seed selection •  Statistical methods for discarding noisy training data improve precision

e.g. Musical Artist: 0.62 -> 0.74; Politician: 0.85 -> 0.86

•  Relation candidate recognition •  Using additional methods to recognise named entities which do not rely

on existing tools increases number of extractions

•  Information integration •  Statistical methods for information integration improve results over

simple combination Overall precision: Simple: 0.74, Strategic combination: 0.86

•  Extracting across sentence boundaries •  Improves precision as well as recall, up to 5 times the number of single

extractions, on average twice as many extractions combined Overall precision: 0.8 -> 0.86

Results / Key Findings

17

•  Distant supervision allows to automatically populate knowledge bases without manual effort

•  Distant supervision can be applied to any domain (focus of this work: Web data)

•  Seed selection, improved named entity recognition, strategies for information integration and extracting sentences across boundaries improve performance

•  Additional heuristics for named entity recognition work, but the approach still relies on existing tools for that Ø  More work on unsupervised named entity recognition needed

•  Web pages do not only contain text, but also lists, tables etc. Ø  more data that can be integrated

Conclusions / Future Work

18

Thank you for your attention!

Questions?

relation extraction from the web using distant supervision

Technology

relation extraction

web text extract

relations fromknowledge

search patterns

standard features train

noisy training data

web pages2

prior knowledge