a semantic best-effort approach for extracting structured discourse graphs from wikipedia

42
Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia André Freitas, Danilo Carvalho, J. C. P. da Silva, Sean O’Riain, Edward Curry

Upload: andre-freitas

Post on 10-May-2015

931 views

Category:

Documents


0 download

DESCRIPTION

Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts, however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous. These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a seman- tic best-effort information extraction approach, which targets an information extraction scenario where text information is extracted under a pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency, context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information ex- traction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.

TRANSCRIPT

Page 1: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

A Semantic Best-Effort Approach for Extracting Structured

Discourse Graphs from WikipediaAndré Freitas, Danilo Carvalho, J. C. P. da

Silva, Sean O’Riain, Edward Curry

Page 2: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Outline

Motivation Representation

Requirements Semantic Best-effort Representation

Extraction Graphia Extractor Preliminary Evaluation Extraction Examples

Conclusion

Page 3: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Page 4: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Page 5: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Page 6: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Page 7: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Linked Data Terminological and structural regularity Shared semantic agreement between data consumers

Natural language texts No terminological or structural

regularity Highly contextualized Complex semantic dependency

relations Ambiguity Information selection/normalization

- vocabulary constraints+ entity-centric+ pay-as-you-go data semantics = semantic best-effort

Page 8: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivation

Vocabulary-independent (schema-free queries) How to abstract users from knowing the data

representation? Semantic matching

Schemaless databases in the limit demands vocabulary-independency

How information extraction is reshaped in this scenario?

Page 9: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Motivational Scenario

What is the relationship between Barack Obama and Indonesia?

Sentence: From age six to ten, Obama attended local schools in Jakarta, including Besuki Public School and St Francis Assisi School.

Semantic Best-effort Extraction

Entity-centric text representation

Page 10: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Representation

Page 11: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Computational Linguistics Perspective

What is already there to represent NL? Discourse Representation Theory (DRT) Semantic Role Labeling (SRL)

Page 12: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Discourse Representation Theory (DRT) “The key idea behind (...) Discourse Representation Theory

is that each new sentence of a discourse is interpreted in the context provided by the sentences preceding it.”

van Eijck and Kamp Models propositions in discourse (multiple sentences). Discourse representation structures (DRS).

John enters a card. Every card is green.

Page 13: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Semantic Role Labeling (SRL)

Shallow semantic parsing. Detection of arguments associated with a predicate. Associated semantic types to arguments.

Bill cut his hair with a razor

[Agent Bill] cut [Patient his hair] [Instrument with a razor.]

Page 14: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Semantic Best-Effort

Objectives: Entity-centric & Standardized: easier to integrate with

other resources Remove the formal constraints and the ‘baggage’ from

existing approaches Representation robust to extraction limitations/errors

Page 15: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Semantic Best-Effort Requirements

Text segmentation into (s,p,o)s Context representation Conceptual model independency Resolve co-references (pay-as-you-go) Represent recurrent discourse structures Standardized representation (RDF(S)) Principled interpretation (compositionality)

Page 16: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

- Text segmentation into (s,p,o)s

- Context representation- Resolve co-references (pay-as-you-go)- Conceptual model independency

Page 17: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

- Context representation

Page 18: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

Page 19: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

Page 20: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

- Resolve co-references (pay-as-you-go)

Page 21: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Examples

- Represent recurrent discourse structures

Page 22: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

SDG Elements

Named, non-named entities and properties Quantifiers & operators Triple Trees Context elements Co-Referential elements Resolved & normalized entities

Page 23: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Graph Patterns

Page 24: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

[[Interpretation]]

Graph traversal – deref sequence

Page 25: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Extraction

Page 26: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

SBE Graph Extraction Tool

Page 27: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Extraction Pipeline Architecture

Subject Predicate Object Prepositional phrase & Noun complement Reification Time

Page 28: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Preliminary Evaluation

1033 relations (triples) from 150 sentences from 5 randomly selected Wikipedia articles

Manually classified the graphs: error categories and accuracy.

Page 29: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Preliminary Evaluation

Page 30: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 31: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 32: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 33: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 34: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 35: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 36: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 37: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 38: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 39: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 40: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 41: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Other Extraction Examples

Page 42: A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

Conclusion

Main direction for improvement is completeness Aligned with the pay-as-you-go scenario

Still need to define clear criteria for what you can’t extract There is still a long way to go (e.g. complex subordination) Investigation using existing n-ary relations patterns Context (reification) should be a first-class citizen in the

representation of natural language Focus on getting the semantic pivots (rigid designators)

right Worth putting effort on enumerable patterns (timestamps,

operators)