using the english resource grammar to extend fact extraction capabilities

29
International Technology Alliance in Network & Information Sciences Using the English Resource Grammar to extend fact extraction capabilities v1.1 David Mott, IBM UK Stephen Poteet, Anne Kao, Ping Xue, Boeing Research & Technology Ann Copestake, University of Cambridge ITA Fall Meeting October 2013

Upload: gudrun

Post on 30-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Using the English Resource Grammar to extend fact extraction capabilities. v1.1. David Mott, IBM UK Stephen Poteet, Anne Kao, Ping Xue, Boeing Research & Technology Ann Copestake, University of Cambridge. ITA Fall Meeting October 2013. Research Objectives. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using the English Resource Grammar to extend fact extraction capabilities

International Technology Alliancein

Network & Information Sciences

Using the English Resource Grammar to extend fact extraction

capabilities

Using the English Resource Grammar to extend fact extraction

capabilitiesv1.1v1.1

David Mott, IBM UK

Stephen Poteet, Anne Kao, Ping Xue, Boeing Research & Technology

Ann Copestake, University of Cambridge

ITA Fall MeetingOctober 2013

Page 2: Using the English Resource Grammar to extend fact extraction capabilities

Research ObjectivesResearch Objectives

Extraction of facts in Controlled English from Natural Language documentsexpress the document in a formal but still readable wayextracted facts can be used to infer new information

Facilitate configuration of NL processing tools in CEhuman analyst can be more involved in the NL processing a common model of linguistics, grammar, and semantics

Provide rationale for linguistic and analytic processinghuman can better understand and review the reasoningfacilitate evaluation of the quality of the reasoning

We are not tasked with creating fundamental breakthroughs in the theory of

NL processing

Page 3: Using the English Resource Grammar to extend fact extraction capabilities

otherdata

Referencedata

Supporting the analystSupporting the analyst

doc27doc27

doc27

CE Facts

Inference Rationale

Argumentation

Query

Analysts Conceptual Model

Assumptions

Uncertainty CE Tools

NLP

Requirements

ProductLinked data web

Structured data

CE Facts

The analyst does not have time to

read all the reports

Page 4: Using the English Resource Grammar to extend fact extraction capabilities

Working ScenarioWorking Scenario

Imagine you are an analyst in a team, being asked to provide high value information about events on the ground

Based upon reports and background reference material:

You want to extract basic facts from these reports and to infer new information You want to have “new ideas” and implement this quickly without IT involvement You want to understand and review the collaborative reasoning of the team which

may contain differing skills

02/03/10 - ET: 0855hrs -- Cell phone call from unidentified male (7115452376) in Bayaa to an unidentified male (7438604901) in Saydiyah //MGRSCOOR: 38S MB 37 77//. The caller stated: “I will need new carpet for my house.” The receiver asked: “How big is the house?” The reply was: “I have a large family.” The receiver said, “I will see what I can do.” The call lasted 15 seconds

Source: SYNCOIN simulated reportsGraham, Rimland, & Hall. (2011). A COIN-inspired Synthetic Dataset

for Qualitative Evaluation of Hard and Soft Fusion Systems: Proc, 14th international conference on information fusion. Chicago, IL.

Page 5: Using the English Resource Grammar to extend fact extraction capabilities

The state of the BPP11 researchThe state of the BPP11 research

We are using CE as the target language for expressing facts as the shared model of the concepts being expressed as the language to configure NL systems

• Detecting structures in phrases• Mapping language expressions to concepts

as the way to reveal reasoning performed by a collaborative team

Text Phrase structures

FactsGeneric Semantics

Domain Semantics

Controlled English Analysts Reasoning

High Value Facts

Page 6: Using the English Resource Grammar to extend fact extraction capabilities

Motivation for using DELPH-IN linguisticsMotivation for using DELPH-IN linguistics

Collaborate with DELPH-IN consortium, to extend our NL and fact extraction capabilities ERG is a high-coverage, high-precision English grammar, developed over 20 years MRS is the representation of semantics PET parser is an efficient parser

Explore Controlled English as possible facilitator for the use of DELPH-IN linguistic resources Provide opportunity to research into deeper semantic processing

contribute to the NL research community

Typed Feature StructuresEnglish Resource

Grammar, Stanford

Linguistic Knowledge Builder, Cambridge

PET parserMinimal Recursion Semantics, Cambridge

Japanese, German, Norwegian, Thai, Chinese, Spanish, ...

Translation

Page 7: Using the English Resource Grammar to extend fact extraction capabilities

Integrating CE and the ERGIntegrating CE and the ERG

Use ERG (and PET) to parse sentences and provide phrase structures Use MRS to express generic semantics Represent domain semantics in MRS, by extending generic semantics Research into the integration of domain semantics and linguistic processing

Text Phrase structures

FactsGeneric Semantics

Domain Semantics

Controlled English Analyst’s Reasoning

High Value Facts

ERG MRS?

Page 8: Using the English Resource Grammar to extend fact extraction capabilities

Raw ERG system outputRaw ERG system output

PARSE TREE (syntax)

MRS (semantics)

We will turn this into CE

Page 9: Using the English Resource Grammar to extend fact extraction capabilities

Defining the ERG lexicon in CEDefining the ERG lexicon in CE

Transformation between the ERG structures (Typed Feature Structures) and CE

there is a count noun named checkpoint_n1 that is written as the word |checkpoint| and is a form of the noun sense ‘_checkpoint_n_1_rel’.

checkpoint_n1 := n_-_c_le & [ ORTH < “checkpoint" >, SYNSEM [ LKEYS.KEYREL.PRED "_checkpoint_n_1_rel", PHON.ONSET con ] ].

The user has to define this linkIs this easier to

understand?

the noun sense ‘_checkpoint_n1_rel’ expresses the entity concept ‘checkpoint’.

Mapping between generic semantics and specific semantics

the noun sense ‘_carpet_n1_rel’ expresses the entity concept ‘carpet’.

Page 10: Using the English Resource Grammar to extend fact extraction capabilities

Defining ERG grammar rules in CEDefining ERG grammar rules in CE

Subcomponents of phrase are

“head daughter” followed by “non head”

daughter

basic_head_initial := basic_binary_headed_phrase & [ HD-DTR #head, NH-DTR #non-head, ARGS < #head, #non-head > ].

there is a linguistic frame named f1 that defines the basic-head-initial PH and

has the sequence ( the sign A0 , and the sign A1 ) as subcomponents and

has the statement that ( the basic-head-initial PH has the sign A0 as HD-DTR and has the sign A1 as NH-DTR ) as semantics.

a basic-head-initial

ARGS

a list0TH

a sign

HD-DTR a thing

NH-DTR

a thing

1ST

a sign

Page 11: Using the English Resource Grammar to extend fact extraction capabilities

Three stage approach to defining MRS in CEThree stage approach to defining MRS in CE

1. Generate raw representation of : elementary predications (EPs) as objects with predicate and arguments scope information between EPs features of the entities involved

2. Extract intermediate, but generic, concepts describing the raw MRS: patterns of quantification …

3. Transform into domain specific CE concepts using links between the predicate and the CE concept. …

Page 12: Using the English Resource Grammar to extend fact extraction capabilities

Step 1 - CE version of raw MRSStep 1 - CE version of raw MRS

x5 – “I”

x9 – “new carpet”

x5 “needs” x9

Still needs to be turned into more understandable concepts…

Page 13: Using the English Resource Grammar to extend fact extraction capabilities

if ( there is an indefinite quantification Q that is on the thing T and has the mrs predicate MRS as sense ) and ( the mrs predicate MRS expresses the entity concept EC )then( the thing T is an EC ).

the mrs elementary predication #ep7_3 is an instance of the mrs predicate ‘_udef_q_rel’ and has the thing x9_8 as zeroth argument.

there is an indefinite quantification named q2 that

is on the thing x9_8 and

has the mrs predicate “_carpet_n_1_rel” as sense.

the mrs elementary predication #ep7_5 is an instance of the mrs predicate '_carpet_n_1_rel’ and has the thing x9_8 as zeroth argument.

the mrs predicate “_carpet_n_1_rel”

expresses

the entity concept ‘carpet’.

the thing x9_8 is a carpet.

the mrs elementary predication #ep7_3 equals modulo quantifiers the mrs elementary predication #ep7_5.

rule to detect quantifier pattern in MRS

Raw

Intermediate

Domain

3 Steps to Domain Semantics3 Steps to Domain Semantics

Page 14: Using the English Resource Grammar to extend fact extraction capabilities

Facts extracted from example sentenceFacts extracted from example sentence

02/03/10 - ET: 0855hrs -- Cell phone call from unidentified male (7115452376) in Bayaa to an

unidentified male (7438604901) in Saydiyah //MGRSCOOR: 38S MB 37 77//. The caller

stated: “I will need new carpet for my house.” The receiver asked: “How big is the house?” The reply was: “I have a large family.” The receiver

said, “I will see what I can do.” The call lasted 15 seconds

If other reports can add to information on the man x5_8 then we may know who is requiring new carpets, and could predict

future events?

This requires a number of linguistic and domain specific steps

Page 15: Using the English Resource Grammar to extend fact extraction capabilities

DiscussionDiscussion

DELPH-IN community have developed excellent Natural Language capabilities We are integrating the “ERG system” and expressing lexicon, grammar rules and

semantics in CE However in the ERG system, the semantics are not completely separated from

the linguistic structures we propose intermediate semantic structures in CE, for bridging gap

between generic and domain semantics We are introducing domain semantics to represent facts in CE

provides a “target” for output of the ERG system opportunity to explore how this can affect parsing of sentences

Much needs to to be done improve integration extend intermediate MRS obtain rationale feedback of semantic reasoning into the parsing mechanisms to help adding/understanding of rules

Page 16: Using the English Resource Grammar to extend fact extraction capabilities

ExtraExtra

Page 17: Using the English Resource Grammar to extend fact extraction capabilities

ERG rules & types ERG lexicon

PET parserText MRS

CE lexicon

Conceptual model

shallow processing

CE facts

PET parse tree

Parse tree as

CEStanford Parser

Raw MRS as

CE

Use same transformation

to be consistent

CE linguistic frames

Information FlowInformation Flow

Red links have been partially implemented

Page 18: Using the English Resource Grammar to extend fact extraction capabilities

RationaleRationale

“the group of things x10 has the entity concept survey as categorisation.”

The rationale from the elementary predicates is:

How do we get the rationale FOR the elementary predicates? could follow the parser tree + the TFS definitions, but nee a link between parse tree and

MRS, which is so far not available

Page 19: Using the English Resource Grammar to extend fact extraction capabilities

A layered Conceptual ModelA layered Conceptual Model

Meta Model Concept, Entity Concept, Relation Concept, Conceptual Model

belongs to, has as domain

Semiotics Thing, Meaning, Symbol stands for, expresses

General Semantics

Agent, Spatial Entity, Temporal Entity, Situation, Container

has as agent role, is contained in

Linguistic Sentence, Phrase, Word, Noun, Fragment, Linguistic Frame

has as dependent, is parsed from, expresses

Analysts Domain Model

Place, Person, Village, Communication, IED, Facility, ....

is located in, monitors

Our Semiotic Triangle, based on [Ogden, C. K. and Richards, I. A. (1923). ]

Page 20: Using the English Resource Grammar to extend fact extraction capabilities

The ERG system architectureThe ERG system architecture

PET is run under Linux (DEBIAN) in an ORACLE VirtualBox image A Prolog program provides a web service for parsing sentences and

turning the result into CE Aiming to integrate to our CE Store

sentenceCE parse tree and

MRS

PET parser with ERG PROLOG CE generator

PROLOG web service

sentenceparse tree and MRS

CEparse tree and MRS

Page 21: Using the English Resource Grammar to extend fact extraction capabilities

Feedback of domain reasoning to the parsing?Feedback of domain reasoning to the parsing?

We want the domain to affect the parse, eg: creating new lexical entries and grammar rules prior to parsing

But we also want arbitrary domain reasoning to affect the parse at runtime

Could this: rule out inconsistent parses provide disambiguations, and dialog context?

ERG/PETDOMAIN

REASONER

facts

constraints on linguistic phenomena

ERG DOMAIN MODELlexical entries,

grammar rules

Page 22: Using the English Resource Grammar to extend fact extraction capabilities

Linking text to domain situationsLinking text to domain situations

Page 23: Using the English Resource Grammar to extend fact extraction capabilities

Working out the “requirer”Working out the “requirer”

This can only be done by analysis of the communications as a whole (including anaphoric reference)

02/03/10 - ET: 0855hrs -- Cell phone call from unidentified male (7115452376) in Bayaa to an unidentified male (7438604901) in Saydiyah //MGRSCOOR: 38S MB 37 77//. The caller stated: “I will need new carpet for my house.” The receiver asked: “How big is the house?” The reply was: “I have a large family.” The receiver said, “I will see what I can do.” The call lasted 15 seconds

STEP CSTEP A

• Step C needs knowledge of the structure of the report and of communications• Step A needs linguistic knowledge

Page 24: Using the English Resource Grammar to extend fact extraction capabilities

Example CE rulesExample CE rules

if

( the communication C has the agent A as initiator ) and

( the agent A is located in the place P )

then

( the communication C is from the place P ).

if

( the mrs elementary predication EP is an instance of the mrs predicate '_in_p_rel'

and has the thing T as first argument

and has the thing C as second argument )

then

( the thing T is contained in the container C ).

DOMAIN RULE

LINGUISTIC RULE

Page 25: Using the English Resource Grammar to extend fact extraction capabilities

Domain SituationsDomain Situations

a requirement

a production

a delivery

a usage

an agent

an agent

an agent

an agent

the materialhas as material

is requested by

is requested from

has

as m

ater

ial

is produced by

is delivered by

is delivered to an agent

has as material

has

as m

ater

ial

an agent

is performed by

needs

are these the same agent?

Page 26: Using the English Resource Grammar to extend fact extraction capabilities

CE representation for parse treeCE representation for parse tree

Page 27: Using the English Resource Grammar to extend fact extraction capabilities

Defining ERG grammar rules in CEDefining ERG grammar rules in CE

basic_head_initial := basic_binary_headed_phrase &

[ HD-DTR #head,

NH-DTR #non-head,

ARGS < #head, #non-head > ].

headed_phrase := phrase &

[ SYNSEM.LOCAL [

CAT [ HEAD head & #head, HC-LEX #hclex ],

AGR #agr,CONJ #conj ],

HD-DTR.SYNSEM.LOCAL local & [

CAT [ HEAD #head, HC-LEX #hclex ],

AGR #agr,CONJ #conj ] ].

Ordered sequence of

subcomponents, Head daughter

followed by non head daughter

Some info is passed up from head daughter

to “this” phrase

Analysis of the rules for hd_cmp_u_c

Page 28: Using the English Resource Grammar to extend fact extraction capabilities

Example CE rulesExample CE rules

if

( the communication C has the agent A as initiator ) and

( the agent A is located in the place P )

then

( the communication C is from the place P ).

if

( the mrs elementary predication EP is an instance of the mrs predicate '_in_p_rel'

and has the thing T as first argument

and has the thing C as second argument )

then

( the thing T is contained in the container C ).

DOMAIN RULE

LINGUISTIC RULE

Page 29: Using the English Resource Grammar to extend fact extraction capabilities

Calling ERG system from WordCalling ERG system from Word