premapper: improving entity extraction accuracy in the digital humanities

11
IBM Haifa Research Lab © 2014 IBM Corporation PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities Cormac Hampson ([email protected] ) Ella Rabinovich ([email protected] ), Sara Porat ([email protected] ) Maya Koleva ([email protected] ), Ivan Uzunov ( [email protected] ) Owen Conlan ([email protected] )

Upload: ella-rabinovich

Post on 18-Dec-2014

65 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation

PreMapper:Improving Entity Extraction Accuracy in theDigital Humanities

Cormac Hampson ([email protected])Ella Rabinovich ([email protected]), Sara Porat ([email protected])Maya Koleva ([email protected]), Ivan Uzunov ([email protected])Owen Conlan ([email protected])

Page 2: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation2

What is CULTURA

• Digital humanities portal supporting the exploration of cultural heritage collections by a range of different users

• Professional researchers and historians

• Students with little or no experience of a particular archive

• There are three digitised collections in the portal

• 1641 Depositions (http://cultura-project.eu/1641/)

• Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916)

• IPSA Collection (http://cultura-project.eu/ipsa)

Page 3: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation3

Smart Content Analysis with Entity-Relationship Extraction

• A powerful technique for injecting semantics into unstructured text

• Employing Natural Language Processing (NLP)

• Involving training a dataset and/or using prior knowledge (e.g., dictionaries) so that specific entities can be identified within the text

• Each collection introduces its unique entity-relationship model

• Entities, e.g., Person, Location, Event

• Entity attributes, e.g., Person.occupation, Deposition.mentioned_date

• Relation between entities, e.g., Person at Location

Page 4: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation4

Entity Extraction – Example

<title> first-name last-name

sir Robert Andrew

Page 5: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation5

Manual Updates of Extracted Entities - Motivation

• The automatic task of entity extraction cannot provide full accuracy

• The 1641 Depositions collection introduces additional difficulty due to the noisy text, inconsistent grammar and spelling

• Extraction errors can damage a curator’s trust in the automatic processing, as well as an end user’s overall confidence in the system

• Approaches to improve the accuracy of entity extraction are of major benefit of the CULTURA environment

Page 6: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation6

Entities Visualisation and Modification with PreMapper

• PreMapper is a web-based visualization and analysis tool that is integrated into the CULTURA environment

• Provides visualisation and editing of entities, maps, flows and relationships between individuals and groups

• Entities (people, organizations) are represented by nodes, links present relationships between these nodes

Page 7: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation7

Manual Changes of Extracted Entities

PreMapper enables curators of the collection to make manual changes to

the extracted entities using a GUI

• Add/delete/update entity

• Merge two entities into a single entity (entities disambiguation)

• Add/delete relationship between entities

The entity “sir phelim” can be merged with theentity “phelim neil” if an expert deems that theseentities refer to the same person

Page 8: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation8

General Flow

Page 9: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation9

Entity Disambiguation via PreMapper

• The task of determining the identity of entities mentioned in the text

• e.g., based on entity’s key attributes

• Entity disambiguation in historical content is one of the main challengesof CULTURA professional users

• Are “sir Phelim” and “Phelim o neil” the same person?

• Are “Rob. Meredith” and “Robert Meredith” the same person?

• Entities scope matter (disambiguation of entities found in the same deposition vs. entities found in different depositions)

• Non-functional challenges

• Authorization – who is allowed to make changes?

• Personalization – what is the scope of a specific change (specific researcher, group of researchers, the entire professional community)?

• Verification – who verifies the changes?

Page 10: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation10

Summary and Future Work

• Entity-relationship extraction is a powerful technique for extracting structured information from unstructured documents

• PreMapper is a visualization tool that allows domain experts to improve the accuracy of the entity-relationship data

• Domain experts feedback is important in refining the user interfacewith the CULTURA environment

• It becomes vital when entity extraction is error-prone, as with the 1641 Depositions collection that contains a lot of noise and misspellings

• Future work includes further exploration and design of the fullyintegrated end-to-end solutionhttp://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId=7&layout=circle#svg-graph-editor-switch

Page 11: PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IBM Haifa Research Lab

© 2014 IBM Corporation11