premapper: improving entity extraction accuracy in the digital humanities

Post on 18-Dec-2014

65 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

IBM Haifa Research Lab

© 2014 IBM Corporation

PreMapper:Improving Entity Extraction Accuracy in theDigital Humanities

Cormac Hampson (cormac.hampson@scss.tdc.ie)Ella Rabinovich (ellak@il.ibm.com), Sara Porat (porat@il.ibm.com)Maya Koleva (maya.koleva@commetric.com), Ivan Uzunov (ivan.uzunov@commetric.com)Owen Conlan (owen.conlan@scss.tcd.ie)

IBM Haifa Research Lab

© 2014 IBM Corporation2

What is CULTURA

• Digital humanities portal supporting the exploration of cultural heritage collections by a range of different users

• Professional researchers and historians

• Students with little or no experience of a particular archive

• There are three digitised collections in the portal

• 1641 Depositions (http://cultura-project.eu/1641/)

• Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916)

• IPSA Collection (http://cultura-project.eu/ipsa)

IBM Haifa Research Lab

© 2014 IBM Corporation3

Smart Content Analysis with Entity-Relationship Extraction

• A powerful technique for injecting semantics into unstructured text

• Employing Natural Language Processing (NLP)

• Involving training a dataset and/or using prior knowledge (e.g., dictionaries) so that specific entities can be identified within the text

• Each collection introduces its unique entity-relationship model

• Entities, e.g., Person, Location, Event

• Entity attributes, e.g., Person.occupation, Deposition.mentioned_date

• Relation between entities, e.g., Person at Location

IBM Haifa Research Lab

© 2014 IBM Corporation4

Entity Extraction – Example

<title> first-name last-name

sir Robert Andrew

IBM Haifa Research Lab

© 2014 IBM Corporation5

Manual Updates of Extracted Entities - Motivation

• The automatic task of entity extraction cannot provide full accuracy

• The 1641 Depositions collection introduces additional difficulty due to the noisy text, inconsistent grammar and spelling

• Extraction errors can damage a curator’s trust in the automatic processing, as well as an end user’s overall confidence in the system

• Approaches to improve the accuracy of entity extraction are of major benefit of the CULTURA environment

IBM Haifa Research Lab

© 2014 IBM Corporation6

Entities Visualisation and Modification with PreMapper

• PreMapper is a web-based visualization and analysis tool that is integrated into the CULTURA environment

• Provides visualisation and editing of entities, maps, flows and relationships between individuals and groups

• Entities (people, organizations) are represented by nodes, links present relationships between these nodes

IBM Haifa Research Lab

© 2014 IBM Corporation7

Manual Changes of Extracted Entities

PreMapper enables curators of the collection to make manual changes to

the extracted entities using a GUI

• Add/delete/update entity

• Merge two entities into a single entity (entities disambiguation)

• Add/delete relationship between entities

The entity “sir phelim” can be merged with theentity “phelim neil” if an expert deems that theseentities refer to the same person

IBM Haifa Research Lab

© 2014 IBM Corporation8

General Flow

IBM Haifa Research Lab

© 2014 IBM Corporation9

Entity Disambiguation via PreMapper

• The task of determining the identity of entities mentioned in the text

• e.g., based on entity’s key attributes

• Entity disambiguation in historical content is one of the main challengesof CULTURA professional users

• Are “sir Phelim” and “Phelim o neil” the same person?

• Are “Rob. Meredith” and “Robert Meredith” the same person?

• Entities scope matter (disambiguation of entities found in the same deposition vs. entities found in different depositions)

• Non-functional challenges

• Authorization – who is allowed to make changes?

• Personalization – what is the scope of a specific change (specific researcher, group of researchers, the entire professional community)?

• Verification – who verifies the changes?

IBM Haifa Research Lab

© 2014 IBM Corporation10

Summary and Future Work

• Entity-relationship extraction is a powerful technique for extracting structured information from unstructured documents

• PreMapper is a visualization tool that allows domain experts to improve the accuracy of the entity-relationship data

• Domain experts feedback is important in refining the user interfacewith the CULTURA environment

• It becomes vital when entity extraction is error-prone, as with the 1641 Depositions collection that contains a lot of noise and misspellings

• Future work includes further exploration and design of the fullyintegrated end-to-end solutionhttp://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId=7&layout=circle#svg-graph-editor-switch

IBM Haifa Research Lab

© 2014 IBM Corporation11

top related