declarative multilingual information extraction with systemt

49
Declarative Multilingual Information Extraction with SystemT Alan Akbik and Laura Chiticariu IBM Research | Almaden Institute of Computer Science (ICSI), Berkeley, California 11/17/2016

Upload: laura-chiticariu

Post on 19-Jan-2017

25 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Declarative Multilingual Information Extraction with SystemT

Declarative

Multilingual Information Extraction

with SystemT

Alan Akbik and Laura ChiticariuIBM Research | Almaden

Institute of Computer Science (ICSI), Berkeley, California

11/17/2016

Page 2: Declarative Multilingual Information Extraction with SystemT

Outline

• Use casesUse casesUse casesUse cases

• RequirementsRequirementsRequirementsRequirements

• SystemTSystemTSystemTSystemT

• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes

• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL

2

Page 3: Declarative Multilingual Information Extraction with SystemT

Case Study 1: Social Data Analysis

Text

Analytics

Product catalog, Customer Master Data, …

Social Media

• Products

Interests 360o Profile

• Relationships• Personal

Attributes

• Life

Events

Statistical

Analysis,

Report Gen.

Bank

6

23%

Bank

5

20%

Bank

4

21%

Bank

3

5%

Bank

2

15%

Bank

1

15%

ScaleScaleScaleScale450M+ tweets a day,

100M+ consumers, …

BreadthBreadthBreadthBreadthBuzz, intent, sentiment, life

events, personal atts, …

ComplexityComplexityComplexityComplexitySarcasm, wishful thinking, …

Customer 360º

3

Page 4: Declarative Multilingual Information Extraction with SystemT

Case Study 2: Machine Analysis

Web Server

Logs

Custom Custom

Application

Logs

Database

Logs

Raw LogsParsed

Log

Records

Parse Extract

Linkage

Information

End-to-End

Application

Sessions

Integrate

Graphical Models(Anomaly detection)

Analyze

5 min 0.85

CauseCause EffectEffect DelayDelay CorrelationCorrelation

10 min 0.62

Correlation Analysis

• Every customer has unique

components with unique log record

formats

• Need to quickly customize all

stages of analysis for these custom

log data sources

4

Page 5: Declarative Multilingual Information Extraction with SystemT

Outline

• Use casesUse casesUse casesUse cases

• RequirementsRequirementsRequirementsRequirements

• SystemTSystemTSystemTSystemT

• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes

• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL

5

Page 6: Declarative Multilingual Information Extraction with SystemT

Enterprise Requirements

• ExpressivityExpressivityExpressivityExpressivity– Need to express complex algorithms for a variety of tasks and data sources

• ScalabilityScalabilityScalabilityScalability– Large number of documents and large-size documents

• TransparencyTransparencyTransparencyTransparency– Every customer’s data and problems are unique in some way

– Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors

6

Page 7: Declarative Multilingual Information Extraction with SystemT

Expressivity Example: Different Kinds of Parses

We are raising our tablet forecast.

S

areNP

We

S

raisingNP

forecastNP

tablet

DET

our

subj

obj

subj pred

Natural Language Machine Log

Dependency Tree

Oct 1 04:12:24 9.1.1.3 41865:

%PLATFORM_ENV-1-DUAL_PWR: Faulty

internal power supply B detected

Time Oct 1 04:12:24

Host 9.1.1.3

Process 41865

Category%PLATFORM_ENV-1-

DUAL_PWR

MessageFaulty internal power supply B detected

7

Page 8: Declarative Multilingual Information Extraction with SystemT

88

Expressivity Example: Fact Extraction (Tables)

Singapore 2012 Annual Report(136 pages PDF)

Identify note breaking down Operating expenses line item, and extract opex components

Identify line item for Operating expenses from Income statement (financial table in pdf document)

Page 9: Declarative Multilingual Information Extraction with SystemT

• Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day

• Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs

• Machine Data: One application server under moderate load at medium logging level � 1GB of logs per day

Scalability Example

Scalability Example

9

Page 10: Declarative Multilingual Information Extraction with SystemT

Transparency Example: Need to incorporate in-situ data

Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%

FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.

I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.

Customer or competitor?

Good or bad?

Entity of interest

I need to go back to Walmart, Toys R Us has the same I need to go back to Walmart, Toys R Us has the same toy $10 cheaper!

10

Page 11: Declarative Multilingual Information Extraction with SystemT

Outline

• Use casesUse casesUse casesUse cases

• RequirementsRequirementsRequirementsRequirements

• SystemTSystemTSystemTSystemT

• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes

• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL

11

Page 12: Declarative Multilingual Information Extraction with SystemT

AQL Language

Optimizer

OperatorRuntime

Specify extractor semantics declaratively

Choose efficient execution plan that implements semantics

Example AQL Extractor

Fundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & Theorems

• Expressivity:Expressivity:Expressivity:Expressivity: The class of extraction tasks expressible in AQL is a strict superset of that expressible through cascaded regular automata.

• Performance:Performance:Performance:Performance: For any acyclic token-based finite state transducer T, there exists an operator graph G such that evaluating T and G has the same computational complexity.

create view PersonPhone as

select P.name as person, N.number as phone

from Person P, PhoneNumber N, Sentence S

where

Follows(P.name, N.number, 0, 30)

and Contains(S.sentence, P.name)

and Contains(S.sentence, N.number)

and ContainsRegex(/\b(phone|at)\b/,

SpanBetween(P.name, N.number));

Within a single sentence

<Person> <PhoneNum>

0-30 chars

Contains “phone” or “at”

Within a single sentence

<Person> <PhoneNum>

0-30 chars

Contains “phone” or “at”

SystemTSystemTSystemTSystemT: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006

More than 35 papers across ACL, EMNLP, VLDB, PODS and SIGMOD

SystemT Architecture

12

Page 13: Declarative Multilingual Information Extraction with SystemT

Expressivity: Rich Set of Operators

Deep Syntactic Parsing ML Training & Scoring

Core Operators

Tokenization Parts of Speech Dictionaries

Semantic Role Labels

Language to express NLP Algorithms ���� AQL

�.Regular Expressions

HTML/XML Detagging

Span Operations

Relational Operations

AggregationOperations

13

Page 14: Declarative Multilingual Information Extraction with SystemT

package com.ibm.avatar.algebra.util.sentence;

import java.io.BufferedWriter;

import java.util.ArrayList;

import java.util.HashSet;

import java.util.regex.Matcher;

public class SentenceChunker

{

private Matcher sentenceEndingMatcher = null;

public static BufferedWriter sentenceBufferedWriter = null;

private HashSet<String> abbreviations = new HashSet<String> ();

public SentenceChunker ()

{

}

/** Constructor that takes in the abbreviations directly. */

public SentenceChunker (String[] abbreviations)

{

// Generate the abbreviations directly.

for (String abbr : abbreviations) {

this.abbreviations.add (abbr);

}

}

/**

* @param doc the document text to be analyzed

* @return true if the document contains at least one sentence boundary

*/

public boolean containsSentenceBoundary (String doc)

{

String origDoc = doc;

/*

* Based on getSentenceOffsetArrayList()

*/

// String origDoc = doc;

// int dotpos, quepos, exclpos, newlinepos;

int boundary;

int currentOffset = 0;

do {

/* Get the next tentative boundary for the sentenceString */

setDocumentForObtainingBoundaries (doc);

boundary = getNextCandidateBoundary ();

if (boundary != -1) {doc.substring (0, boundary + 1);

String remainder = doc.substring (boundary + 1);

String candidate = /*

* Looks at the last character of the String. If this last

* character is part of an abbreviation (as detected by

* REGEX) then the sentenceString is not a fullSentence and

* "false” is returned

*/

// while (!(isFullSentence(candidate) &&

// doesNotBeginWithCaps(remainder))) {

while (!(doesNotBeginWithPunctuation (remainder)

&& isFullSentence (candidate))) {

/* Get the next tentative boundary for the sentenceString */

int nextBoundary = getNextCandidateBoundary ();

if (nextBoundary == -1) {

break;

}

boundary = nextBoundary;

candidate = doc.substring (0, boundary + 1);

remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) {

// sentences.addElement(candidate.trim().replaceAll("\n", "

// "));

// sentenceArrayList.add(new Integer(currentOffset + boundary

// + 1));

// currentOffset += boundary + 1;

// Found a sentence boundary. If the boundary is the last

// character in the string, we don't consider it to be

// contained within the string.

int baseOffset = currentOffset + boundary + 1;

if (baseOffset < origDoc.length ()) {

// System.err.printf("Sentence ends at %d of %d\n",

// baseOffset, origDoc.length());

return true;

}

else {

return false;

}

}

// origDoc.substring(0,currentOffset));

// doc = doc.substring(boundary + 1);

doc = remainder;

}

}

while (boundary != -1);

// If we get here, didn't find any boundaries.

return false;

}

public ArrayList<Integer> getSentenceOffsetArrayList (String doc)

{

ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();

// String origDoc = doc;

// int dotpos, quepos, exclpos, newlinepos;

int boundary;

int currentOffset = 0;

sentenceArrayList.add (new Integer (0));

do {

/* Get the next tentative boundary for the sentenceString */

setDocumentForObtainingBoundaries (doc);

boundary = getNextCandidateBoundary ();

if (boundary != -1) {

String candidate = doc.substring (0, boundary + 1);

String remainder = doc.substring (boundary + 1);

/*

* Looks at the last character of the String. If this last character

* is part of an abbreviation (as detected by REGEX) then the

* sentenceString is not a fullSentence and "false" is returned

*/

// while (!(isFullSentence(candidate) &&

// doesNotBeginWithCaps(remainder))) {

while (!(doesNotBeginWithPunctuation (remainder) &&

isFullSentence (candidate))) {

/* Get the next tentative boundary for the sentenceString */

int nextBoundary = getNextCandidateBoundary ();

if (nextBoundary == -1) {

break;

}

boundary = nextBoundary;

candidate = doc.substring (0, boundary + 1);

remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) {

sentenceArrayList.add (new Integer (currentOffset + boundary + 1));

currentOffset += boundary + 1;

}

// origDoc.substring(0,currentOffset));

doc = remainder;

}

}

while (boundary != -1);

if (doc.length () > 0) {

sentenceArrayList.add (new Integer (currentOffset + doc.length ()));

}

sentenceArrayList.trimToSize ();

return sentenceArrayList;

}

private void setDocumentForObtainingBoundaries (String doc)

{

sentenceEndingMatcher = SentenceConstants.

sentenceEndingPattern.matcher (doc);

}

private int getNextCandidateBoundary ()

{

if (sentenceEndingMatcher.find ()) {

return sentenceEndingMatcher.start ();

}

else

return -1;

}

private boolean doesNotBeginWithPunctuation (String remainder)

{

Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);

return (!m.find ());

}

private String getLastWord (String cand)

{

Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);

if (lastWordMatcher.find ()) {

return lastWordMatcher.group ();

}

else {

return "";

}

}

/*

* Looks at the last character of the String. If this last character is

* par of an abbreviation (as detected by REGEX)

* then the sentenceString is not a fullSentence and "false" is returned

*/

private boolean isFullSentence (String cand)

{

// cand = cand.replaceAll("\n", " "); cand = " " + cand;

Matcher validSentenceBoundaryMatcher =

SentenceConstants.validSentenceBoundaryPattern.matcher (cand);

if (validSentenceBoundaryMatcher.find ()) return true;

Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);

if (abbrevMatcher.find ()) {

return false; // Means it ends with an abbreviation

}

else {

// Check if the last word of the sentenceString has an entry in the

// abbreviations dictionary (like Mr etc.)

String lastword = getLastWord (cand);

if (abbreviations.contains (lastword)) { return false; }

}

return true;

}

}

Java Implementation of Sentence Boundary Detection

Expressivity: AQL vs. Custom Java Code

create dictionary AbbrevDict from file

'abbreviation.dict’;

create view SentenceBoundary as

select R.match as boundary

from ( extract regex /(([\.\?!]+\s)|(\n\s*\n))/

on D.text as match from Document D ) R

where

Not(ContainsDict('AbbrevDict',

CombineSpans(LeftContextTok(R.match, 1),R.match)));

Equivalent AQL Implementation

14

Page 15: Declarative Multilingual Information Extraction with SystemT

Transparency in SystemT

PersonPhone

• Clear and simple provenanceprovenanceprovenanceprovenance �

Person PhonePerson

Anna at James St. office (555-5555) 6.

[Liu et al., VLDB’10]

Ease of comprehension

Ease of debugging

Ease of enhancement

• Explains output data in terms of input data, intermediate data, and the transformation

• For predicate-based rule languages (e.g., SQL), can be computed automatically

create view PersonPhone as

select P.name as person, N.number as phone

from Person P, PhoneNumber N

where Follows(P.name, N.number, 0, 30);

personpersonpersonperson phonephonephonephone

Anna 555-5555

James 555-5555

Person

namenamenamename

Anna

James

PhoneNumber

numbernumbernumbernumber

555-5555t1t2

t3

t1 ∧ t3

t2 ∧ t3

15

Page 16: Declarative Multilingual Information Extraction with SystemT

Machine Learning in SystemT

• AQL provides a foundation of transparency

• Next step: Add machine learning without losing transparency

• Major machine learning efforts:

– Embeddable Models in AQL

– Learning using AQL as target language

16

Page 17: Declarative Multilingual Information Extraction with SystemT

Machine Learning in SystemT

• AQL provides a foundation of transparency

• Next step: Add machine learning without losing transparency

• Major machine learning efforts

– Embeddable Models in AQL

• Model Training and Scoring � integration with SystemML

• Semantic Role Labeling Semantic Role Labeling Semantic Role Labeling Semantic Role Labeling [Akbik et al., ACL’15,’16, EMNLP’16, COLING’16]

• Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15]

– Learning using AQL as target language

17

Page 18: Declarative Multilingual Information Extraction with SystemT

Machine Learning in SystemT

• AQL provides a foundation of transparency

• Next step: Add machine learning without losing transparency

• Major machine learning efforts

– Embeddable Models in AQL

– Learning using AQL as target language

• Low-level features: regular expressions [Li et al., EMNLP’08],

dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13]

• Rule induction [Nagesh et al., EMNLP’12]

• Rule refinement [Liu et al., VLDB’10]

• Ongoing research

18

Page 19: Declarative Multilingual Information Extraction with SystemT

Web Tools Overview

Ease of

Programming

Ease of

Sharing

19

Page 20: Declarative Multilingual Information Extraction with SystemT

Outline

• Use casesUse casesUse casesUse cases

• RequirementsRequirementsRequirementsRequirements

• SystemTSystemTSystemTSystemT

• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes

• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL

20

Page 21: Declarative Multilingual Information Extraction with SystemT

SystemT in Universities

2013 2015 2016 2017

• UC Santa Cruz (full Graduate class)

2014

• U. Washington (Grad)• U. Oregon (Undergrad)• U. Aalborg, Denmark (Grad)• UIUC (Grad)

• U. Maryland Baltimore County (Undergrad)

• UC Irvine (Grad)

• NYU Abu-Dhabi (Undergrad)• U. Washington (Grad)• U. Oregon (Undergrad)• U. Maryland Baltimore County

(Undergrad)• 6

• UC Santa Cruz, 3 lectures in one Grad class

SystemT MOOC

21

Page 22: Declarative Multilingual Information Extraction with SystemT

AlonAlonAlonAlon Halevy, November 2015Halevy, November 2015Halevy, November 2015Halevy, November 2015

"The tutorial of System T was an important hands-on component of a Ph.Dcourse on Structured Data on the Web at the University of Aalborg in Denmark taught by Alon Halevy. The tutorial did a great job giving the students the feeling for the challenges involved in extracting structured data from text.“

ShimeiShimeiShimeiShimei Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016

"The Information Extraction with SystemT class, which was offered for the first time at UMBC, has been a wonderful experience for me and my students. I really enjoyed teaching this course. Students were also very enthusiastic. Based on the feedback from my students, they have learned a lot. Some of my students even want me to offer this class every semester.“

SystemT in Universities: Testimonials

22

Page 23: Declarative Multilingual Information Extraction with SystemT

• Professors like teaching it

• Students do interesting projects and have fun!

• Identify character relationships in Games of Thrones

• Determine the cause and treatment for PTSD patients

• Extract job experiences and skills from resumes

• …

SystemT in Universities: Summary

23

Page 24: Declarative Multilingual Information Extraction with SystemT

SystemT MOOC

• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students

• Target Audience: Target Audience: Target Audience: Target Audience: NLP Engineer, Data Scientiste.g., University students (grad, undergrad), industry professionals

• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classesTA0105: Text Analytics: Getting Results with SystemT

– Basics of IE with SystemT, learning to write extractors using the web tool,

learn AQL

– 5 lectures + 1 hands-on lab + final exam

TA0106EN: Advanced Text Analytics: Getting Results with SystemT

– Advanced material on declarative approach and SystemT optimization

– 2 lectures + final exam

• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students

• Target Audience: Target Audience: Target Audience: Target Audience: NLP Engineer, Data Scientiste.g., University students (grad, undergrad), industry professionals

• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classesTA0105: Text Analytics: Getting Results with SystemT

– Basics of IE with SystemT, learning to write extractors using the web tool,

learn AQL

– 5 lectures + 1 hands-on lab + final exam

TA0106EN: Advanced Text Analytics: Getting Results with SystemT

– Advanced material on declarative approach and SystemT optimization

– 2 lectures + final exam

24

Page 25: Declarative Multilingual Information Extraction with SystemT

Outline

• Use casesUse casesUse casesUse cases

• RequirementsRequirementsRequirementsRequirements

• SystemTSystemTSystemTSystemT

• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes

• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL

25

Page 26: Declarative Multilingual Information Extraction with SystemT

• 7,102 known languages

• 23 most spoken language

> 4.1 billion people

https://walizahid.com/2015/06/the-worlds-most-spoken-languages/

• Our goal: Enable

Text Analytics

over Multilingual

Text

The World’s Most Spoken Languages

26

Page 27: Declarative Multilingual Information Extraction with SystemT

ProblemProblemProblemProblem: High costs to handle

multilingual data

English

NLP

English Text

Analytics

German

NLP

German Text

Analytics

Current StatusCurrent StatusCurrent StatusCurrent Status: - One language at a time;

- Query language and data must be in the same language

SeparateSeparateSeparateSeparate parsers /

features for each

language

French

NLP

French Text

Analytics

Chinese

NLP

Chinese Text

Analytics

SeparateSeparateSeparateSeparate text analyics

must be developed

EnglishEnglishEnglishEnglish

Text Data

ChineseChineseChineseChinese

Text Data

FrenchFrenchFrenchFrench

Text Data

GermanGermanGermanGerman

Text Data

CrosslingualCrosslingualCrosslingualCrosslingual Text AnalyticsText AnalyticsText AnalyticsText Analytics: Parse and develop against unified abstraction;

Unified

NLP

Crosslingual

TextAnalytics

Semantic parserSemantic parserSemantic parserSemantic parser

• arbitrary languages

• unified feature setunified feature setunified feature setunified feature set

Universal extractorsUniversal extractorsUniversal extractorsUniversal extractors

• Develop onceonceonceonce

• Run for all languages

Proposed Approach: Crosslingual Text Analytics

27

Page 28: Declarative Multilingual Information Extraction with SystemT

SRL focuses on semantics, not syntax

Useful for many applications (Shen and

Lapata, 2007; Maqsud et al., 2014)

Frame Frame Frame Frame semanticssemanticssemanticssemantics crosslinguallycrosslinguallycrosslinguallycrosslingually valid?valid?valid?valid?

Dirk broke the window with a hammer.

Break.01A0 A1 A2

The window was broken by Dirk.

The window broke.

A1 Break.01 A0

A1 Break.01

Break.01

A0 – BreakerA1 – Thing brokenA2 – InstrumentA3 – Pieces

Break.15

A0 – Journalist, exposer

A1 – Story,thing exposed

• Problem: What type of labels are valid across languages? • Lexical, morphological and syntactic labels differ greatly

• But shallow semantic labels shallow semantic labels shallow semantic labels shallow semantic labels remain stable

Semantic Role Labeling

28

Page 29: Declarative Multilingual Information Extraction with SystemT

The station bought Cosby reruns for record prices

Buy.01A0

6 spurred dollar buying by Japanese institutions.

Spur.01 A1 Buy.01 A0

The company payed 2 million US$ to buy 6.

Corpus of annotated text dataFrame set

A1 A3

Buy.01

A0 – BuyerA1 – Thing boughtA2 – SellerA3 – Price paidA4 – Benefactive

Pay.01

A0 – PayerA1 – MoneyA2 – Being payedA3 – Commodity

Buy.01Pay.01A0 A1

• English SRL

– FrameNet

– PropBank

SRL Resources

29

Page 30: Declarative Multilingual Information Extraction with SystemT

• English SRL

– FrameNet

– PropBank

30

1. Limited coverage

2. Language-specific formalisms

• Other languages

– Chinese Proposition Bank

– Hindi Proposition Bank

– German FrameNet

– French? Spanish? Russian? Arabic? …

订购

A0: buyer

A1: commodity

A2: seller

order.02

A0: ordererA1: thing orderedA2: benefactive, ordered-forA3: source

SRL for other languages?

We want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labels

Page 31: Declarative Multilingual Information Extraction with SystemT

WhatsApp was bought by Facebook

Facebook hat WhatsApp gekauft

Facebook a achété WhatsApp

buy.01

Facebook WhatsApp

Crosslingual representationMultilingual input text

Buy.01 A0A1

Buy.01A1A0

Buy.01A0 A1

Shared Frames Across Languages

31

Page 32: Declarative Multilingual Information Extraction with SystemT

• 1. Generation of Proposition Banks for Arbitrary Languages

• 2. Development of a SotA Semantic Role Labeler

• 3. Application to Use Cases

Research

32

Page 33: Declarative Multilingual Information Extraction with SystemT

Our Goal

• Generate SRL resources for many other languages

• Shared frame set

• Minimal effort

Frame set

Buy.01

A0 – BuyerA1 – Thing boughtA2 – SellerA3 – Price paidA4 – Benefactive

Pay.01

A0 – PayerA1 – MoneyA2 – Being payedA3 – Commodity

Il faut qu‘ il y ait des responsables

Need.01A0

Je suis responsable pour le chaos

Be.01A1 A2 AM-PRD

Les services postaux ont achété des 6

Be.01 A2A1

Buy.01A0

Corpus of annotated text data

33

Page 34: Declarative Multilingual Information Extraction with SystemT

Example: TV subtitles

Generating Universal Proposition Banks

Current practice:

Annotator training(months)

Annotation(years)

Must be repeated for each language!

Idea: Annotation projection in parallel corpora

Das würde ich für einen Dollar kaufen German subtitles

I would buy that for a dollar! English subtitles

PRICEBUYER ITEM

BUYERITEM

I would buy that for a dollar!

PRICE

projection

Das würde ich für einen Dollar kaufenDas würde ich für einen Dollar kaufen

34

Page 35: Declarative Multilingual Information Extraction with SystemT

However: Projections Not Always Possible

We need to hold people responsible

Il faut qu‘ il y ait des responsables

English sentence:

Target sentence:

Hold.01A0

A1 A3Need.01

Hold.01Incorrect projection!

There need to be those responsible

A1

Main error sources:Main error sources:Main error sources:Main error sources:

• Translation shift

• Source-language SRL errors

35

Page 36: Declarative Multilingual Information Extraction with SystemT

• Two-step process of filtering and bootstrapping (ACL ‘15)– Filters to detect translation shift, block projections (more precision at cost

of recall)

– Bootstrap learning to increase recall

– Generate 7 universal proposition banks 7 universal proposition banks 7 universal proposition banks 7 universal proposition banks from 3 language groups

Improving Projection Quality

36

Page 37: Declarative Multilingual Information Extraction with SystemT

• Online demo for SRL in 9 languages (ACL ‘16)

Train Multilingual SRL

37

Page 38: Declarative Multilingual Information Extraction with SystemT

Low-resource Languages & Crowdsourcing

• Apply approach to low-resource languages

(EMNLP ‘16)

– Bengali, Malayalam, Tamil

– Fewer sources of parallel data

– Almost no NLP: No syntactic parsing,

lemmatization etc.

• Crowdsourcing for data curation

38

Page 39: Declarative Multilingual Information Extraction with SystemT

Multilingual Aliasing

• Expert curation of frame mappings (COLING ‘16)

• ProblemProblemProblemProblem: Target language frame lexicon automatically

generated from alignments

– False frames

– Redundant frames

• Two-step expert curation approach

– Step 1: Filter

– Step 2: Merge

• Example: drehen

– “to turn” (rotate)

– “to film”

39

Page 40: Declarative Multilingual Information Extraction with SystemT

Original

drehen.04 SHOOT.03

record on film

A0: videographerA1: subject filmedA2: medium

drehen.01 TURN.01

rotation

A0: turnerA1: thing turningAm: direction

drehen.02

TURN.02

transformation

A0: causer of transformation

A1: thing changingA2: end stateA3: start state

record on film

A0: recorderA1: thing filmed

drehen.03 FILM.01

incorrect frame,

filtered

Merged

synonyms,

merged

Predicate: drehendrehendrehendrehen

Filtered

drehen.01 TURN.01

rotation

A0: turnerA1: thing turningAm: direction

record on film

A0: recorderA1: thing filmed

drehen.03 FILM.01

drehen.01 TURN.01

rotation

A0: turnerA1: thing turningAm: direction

drehen.03

FILM.01

SHOOT.03

record on film

A0: recorderA1: thing filmed

record on film

A0: videographerA1: subject filmedA2: medium

drehen.04

SHOOT.03

Multilingual Aliasing: Example

40

Page 41: Declarative Multilingual Information Extraction with SystemT

Multilingual Aliasing

• Expert curation of frame mappings (COLING ‘16)

– 60 person hours per language

– Chinese, French and German

– More salient senses, improved quality

41

Page 42: Declarative Multilingual Information Extraction with SystemT

Automatic Generation of Proposition Banks

• Project English PropBank frame and role annotation onto

target language sentences

• A variety of strategies to increase quality:

– Filtering of translation shift

– Crowdsourced data curation

– Expert frame alignments

• Result: Generated universal proposition banks for 8 universal proposition banks for 8 universal proposition banks for 8 universal proposition banks for 8

languageslanguageslanguageslanguages

• Currently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source release

42

Page 43: Declarative Multilingual Information Extraction with SystemT

• 1. Generation of Proposition Banks for Arbitrary Languages

• 2. Development of a SotA Semantic Role Labeler

• 3. Application to Use Cases

Research

43

Page 44: Declarative Multilingual Information Extraction with SystemT

• ProblemProblemProblemProblem: Current SRL error-prone

• Strong local bias, many lowStrong local bias, many lowStrong local bias, many lowStrong local bias, many low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize

Instance-based Semantic Role Labeling

44

Page 45: Declarative Multilingual Information Extraction with SystemT

• ProblemProblemProblemProblem: Current SRL error-prone

• Strong local biasStrong local biasStrong local biasStrong local bias

• Many lowMany lowMany lowMany low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize

Instance-based Semantic Role Labeling

• K-SRL: Instance-based learning

– K-nearest neighbors

• Composite features

• Significantly outperform all other current systems

In-domain Out-of-domain

45

Page 46: Declarative Multilingual Information Extraction with SystemT

• 1. Generation of Proposition Banks for Arbitrary Languages

• 2. Development of a SotA Semantic Role Labeler

• 3. Application to Use Cases

Research

46

Page 47: Declarative Multilingual Information Extraction with SystemT

ActionAPI - Demo

47

Page 48: Declarative Multilingual Information Extraction with SystemT

Thank you!

• For more information…– Visit our website: Visit our website: Visit our website: Visit our website:

http://ibm.co/1Cdm1Mj

– Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:

http://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEv

– Learning at your own pace: Learning at your own pace: Learning at your own pace: Learning at your own pace: SystemT MOOCSystemT MOOCSystemT MOOCSystemT MOOC

– Contact usContact usContact usContact us• [email protected]

[email protected]

SystemT Team:Alan Akbik, Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy,

Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

48

Page 49: Declarative Multilingual Information Extraction with SystemT

Declarative AQL language

Development Environment

AQL Extractor

create view ProductMention as

select ...

from ...

where ...

create view IntentToBuy as

select ...

from ...

where ... Cost-based optimization

. . .

SystemT: Declarative Information Extraction

Discovery tools for AQL development

SystemT RuntimeSystemT Runtime

Input

Documents

Extracted

Objects

Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive, efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.

Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.

AQL: a declarative language that can be used to build extractors outperforming the state-of-the-arts [ACL’10]

Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]

A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10,

ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13, SIGMOD’13, ACL’13,VLDB’15, NAACL’15]

Cost-based optimization for text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14]

Highly embeddable runtime with high-throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09]

For details and

Online Class visit:https://ibm.biz/BdF4GQ

49