biographynet: linking the world of history

42
BiographyNet Linking the world of History Serge ter Braake, Antske Fokkens, Niels Ockeloen, Susan Legêne, Guus Schreiber, Piek Vossen, et al. The Network Institute, VU University Amsterdam http://wm.cs.vu.nl http://www.biographynet.nl October 2013

Upload: biographynet

Post on 08-Aug-2015

72 views

Category:

Science


2 download

TRANSCRIPT

Page 1: BiographyNet: Linking the world of History

BiographyNetLinking the world of History

Serge ter Braake, Antske Fokkens, Niels Ockeloen, Susan Legêne, Guus Schreiber, Piek Vossen, et al.

The Network Institute, VU University Amsterdamhttp://wm.cs.vu.nl http://www.biographynet.nl

October 2013

Page 2: BiographyNet: Linking the world of History

BiographyNet: Linking the world of historyGeneral project info, February 2014

Overview of this presentation• Introduction of the project

• What is E-history?• Project goals

• Short overview of use cases• Illustrative use case example

• Text mining using NLP• Challenges• Preliminary results

• Why provenance is important• Requirements from the perspective of the Historian• Requirements from the perspective of the Computer scientist

• The BiographyNet schema• Extending the schema with Provenance• Aggregated provenance information• Detailed provenance information

• Demonstrator Interface• First ideas and sketches

Overview

Page 3: BiographyNet: Linking the world of History

BiographyNet: Extracting relations between people, places and historic events•Multidisciplinary E-History Project

What is BiographyNet?

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 4: BiographyNet: Linking the world of History

E-humanities Investigates what can be done in humanities with moderntechniques which we could not do before, or only with agreat deal of effort

What is E-history?

E-historySub domain of E-humanities which aims at improving existing methodsof historical research rather than introducinga whole new way of doing historical research *

* Zaagsma, G.: Doing history in the digital age: history as a hybrid practice (2013)

http://gerbenzaagsma.org/blog/16-03-2013/doing-history-digital-age-history-hybrid-practice

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 5: BiographyNet: Linking the world of History

BiographyNet: Extracting relations between people, places and historic events•Multidisciplinary E-History Project

What is BiographyNet?

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 6: BiographyNet: Linking the world of History

BiographyNet: Extracting relations between people, places and historic events•Multidisciplinary E-History Project

What is BiographyNet?

• Funded by the Netherlands eScience Center• Partners are the Netherlands eScience Center, the

Huygens/ING Institute of the Royal Dutch Academy of Sciences and VU University Amsterdam

• Starting Point: The Biographical Portal of the Netherlands - http://www.biografischportaal.nl• 125,000 short biographical descriptions with limited meta

data from a variety of Dutch biographical dictionaries• 76,000 individuals

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 7: BiographyNet: Linking the world of History

Short biographical descriptions with limited meta data

Name

Category

Gender

Date of Death

Date of Birth

Place of Birth

Place of Death

Occupation

Religion

Father

Mother

Claim to Fame

Partner

Text

0 20 40 60 80 100 120

percentage

Individuals with available information (%)

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 8: BiographyNet: Linking the world of History

Main project goals • Provide a richer historic knowledge base by creating a semantic layer on

top of the data from the Biographical Portal• Convert the available data to RDF (first conversion available)• Enrichments (NLP) and Aggregations• Link to other sources

• Inspire Historians in setting up new research projects by providing them with interesting leads• Development of a demonstrator• Quantitative analysis, visualisation and browsing techniques

• Re-usable deliverables• Open-source release of the platform for analyzing texts about people• Methodology for extraction of a relation network between people, places

and events

Project Goals

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 9: BiographyNet: Linking the world of History

Currently 12 use cases developed involving quantitative analysis, relation discovery, thematic research, etc. • Simple:• Group analysis of Governors-general

of the Dutch Indies•More complex:• When did Dutch elites get involved

with the ‘New World’?• Highly complex:• What can we say about nationalism in biographical

dictionaries from the nineteenth and twentieth century?

Use Case Overview

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 10: BiographyNet: Linking the world of History

Governors-General of the Dutch Indies• Highest Official in the Dutch Indies (1610-1949)• 129 Biographies describing 71 individuals

•What can we say about these men as a group?•What properties did they need to have to be appointed?• Personal qualities• Relations (already

more difficult)

Illustrative use case

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 11: BiographyNet: Linking the world of History

Focus on the following information• Family connections

• Parents• Partner• Children

• Dates• Birth• Appointment• Death

• Motivation• Education• Religion• Reasons for appointment• Reasons for leaving the office

Governors General: Data Mining

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 12: BiographyNet: Linking the world of History

Manual analysis“More than one full week to manually mine this information

from the Biography Portal.” (Serge ter Braake)

The question“Can a historian do this with (almost) the same results in

less than an hour when using the demonstrator?”

Governors General: Time and effort

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 13: BiographyNet: Linking the world of History

Basic System for data enrichment using text:• Identifying meta data in text• Linguistically naïve supervised machine learning

• Linguistic processing• Detection of (co-referenced) named-entities

(persons, places and dates) and events• Concept identification

Text mining using Natural Language Processing (NLP)

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 14: BiographyNet: Linking the world of History

Challenges for NLP within BiographyNet:• Deal with alternative spelling• Texts vary from 19th century Dutch to contemporary Dutch• Variations in the naming of people and places

• OCR-ed texts contain errors• Used methods may introduce bias:• Example: Location identification with GeoNames

Heuristic: On multiple possibilities, take the one in, or closest to The Netherlands

• Problem: ‘America’ is a place in The Netherlands, but what about trade with the new world?

NLP: Challenges

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 15: BiographyNet: Linking the world of History

NLP: Preliminary results – Governors

mariag

e

multiple mari

age

partners

Children

(number)

Children

(nam

es)

Age (s

tart fu

nction)

Place o

f Birt

h

Place o

f Dea

th

Studies

Previous c

arree

r

Reaso

n job en

d

Last jo

b

Family

connecti

ons

Religio

n0

10

20

30

40

50

60

70

80

90

100

metadatatext

Presence of information in text vs. meta data (% on 71 individuals)

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 16: BiographyNet: Linking the world of History

Before development of the actual demonstrator can commence, we first need to:• Convert the data of the Biography Portal to RDF• Prevent loss of information

• Devise a schema • Structure the data• Provide compatibility with other interesting sources• Facilitate the recording of provenance information on the

manipulation of the data

Towards the demonstrator

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 17: BiographyNet: Linking the world of History

Two main requirements for the demonstrator:• A trace back to all original sources (texts and meta data)

involved in producing a certain result• Which sources were used for the overall outcome and how often?• What potentially relevant data was excluded from the end result?• Which piece of data led to a specific result (e.g. the age of a

specific governor at his appointment)?• Insight in the processes manipulating and selecting the data• Indication of overall performance: Focus on recall or precision?• Global description of the used heuristics should be provided• Indication of responsibility: Who to contact when results are

pulled into question?

Requirements from the perspective of the Historian

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 18: BiographyNet: Linking the world of History

Reproducing results:• Reproducing results in NLP is non-trivial• Details in implementations or experimental setup can influence

results up to a point where they tell a different story• Clear registration of all steps involved and storage of

intermediate system output can improve reproducibility• Systematic testing can help to gain insight into the variation of

the outcome of our systems and hence lead to more insight in their performance

Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013.

Requirements from the perspective of the Computer Scientist / Computational Linguist

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 19: BiographyNet: Linking the world of History

Translation into requirements for the demonstrator:• Facilitate Replication and Reproduction• Recording of information on used tools such as Creator, version

number, etc.• Recording of any kind of pre- / post-processing done on

input/output data.• Recording of the intention behind the various steps in the NLP

pipeline, including made assumptions and possible biases.• Intermediate results need to be preserved for debugging purposes

• The schema needs to be both generic and flexible• NLP pipeline design can change• Tools and their formats unclear towards the future

Requirements from the perspective of the Computer Scientist / Computational Linguist

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 20: BiographyNet: Linking the world of History

Foundations of the schema: • Based on the structure of the original XML files• Needs to facilitate the coupling of different biographies of the same

person, without compromising the original data• Needs to facilitate the incorporation of several enrichments, following

from NLP, as well as aggregations• Compatible with existing

schemas such as the Europeana Data Model,PROV, P-PLAN, DC terms, etc.

The BiographyNet Schema

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 21: BiographyNet: Linking the world of History

Purely syntactic conversion• Preserve the original

structure of the data• Prevent los of information• Allow for reinterpretation of

the original data in the future

The conversion process

Data Preservation

<XML> Very simplified BP XML Example <BioDes>

<FileDes> Source Meta Data <Author></Author> </FileDes>

<PersonDes> Person Meta Data <Name></Name> </PersonDes>

<BioPart> Biographical Text <Snippet></Snippet> <BioPart>

</BioDes></XML>

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 22: BiographyNet: Linking the world of History

Conversion steps: • Retrieval of XML dump of the Biography Portal• Initial conversion to ‘crude’ RDF• Using ClioPatria and the XMLRDF

tool for ClioPatria• RDF restructuring• Correction of purely syntactic

inefficiencies in the data• TODO: Linking to other sources• Essential step in the

‘Linked Data’ philosophy

The conversion process

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 23: BiographyNet: Linking the world of History

Provenance information is information on how Entities come into existence

• What are entities?• Documents, Articles, Pictures, etc.• Basically anything that can be

‘produced’ by something or someone• What kind of information?• Who did what?• Using which entities?• In which processes?

• Why use the PROV-DM, i.e. PROV-O?• PROV-DM now an official W3C recommendation

Adding Provenance Information

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 24: BiographyNet: Linking the world of History

Based on the requirements for the demonstrator, provenance needs to be modeled:

• From several perspectives:• Information involved Sources, but also: NER input data, etc.• Processes involved All steps in enrichment, aggregation, etc • People involved Who was responsible for pipeline, tool, etc.

• At multiple levels:• An aggregated level, Targeted at the Historian

i.e. per enrichment• A detailed level, i.e. all Targeted at the Computer Scientist and

individual processes computational linguist

Provenance in BiographyNet

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 25: BiographyNet: Linking the world of History

Needed to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool

• One needs to be able to validate results• Replication: Retrieving the same results later using the demonstrator• Reproducibility: Manually by the historian

• The aggregated level – Targeted at the historian• Which original sources where involved?• Who to contact in case results are pulled into question?

• The detailed level – Targeted at the computer scientist• Detailed information on each individual step• Allows for debugging the internal processing pipeline

Recap: Why is provenance info important for BiographyNet?

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 26: BiographyNet: Linking the world of History

BiographyNet: Schema illustration

http://www.biographynet.nl/schema

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 27: BiographyNet: Linking the world of History

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

BiographyNetEnrichment example

Thorbecke

Biographical Description

FileMeta Data

NNBW

PersonMeta Data

“Thorbecke”

BiographyParts

Birth1798Event

Biographical Description

Enrichment NLP Pipeline

PersonMeta Data

EventBirth

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Zwolle1798-01-14

prov:plan

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 28: BiographyNet: Linking the world of History

Provenance and Plans (P-PLAN):* Represent the plans that guided the execution of scientific processes

• ‘Plans’ describe the original idea behind an activity• Each ‘Plan’ can consist of one or more ‘Steps’• Each ‘Step’ corresponds to an ‘Activity’

• ‘Variables’ describe the input/output of an activity• Structure, format, quantity, etc.• Each ‘Variable’ corresponds with an input/output ‘Entity’ of an

‘Activity’• ‘Plans’ have their own provenance info• E.g. who was responsible for the creation of a plan?

*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan

More than just Provenance:

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 29: BiographyNet: Linking the world of History

P-PLAN is used to not only model what actually happened, but also what was supposed to happen• Forces the recording of what an activity and its

input/output should look like• Provides abstract description of original idea behind activity• As such, can provide info on heuristics and assumptions

• Allows for comparing the actual activity and its input/output with the original plan and its variables• Do they differ from each other and to what extend?• Makes finding errors much easier, as more information is

available about what the input/output should look like

Why model plans besides provenance?

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 30: BiographyNet: Linking the world of History

BiographyNet: Schema illustration

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 31: BiographyNet: Linking the world of History

Activity

Plan

EntityEntity

Variable Variable

Agent

Agent

Association

Activity

Plan

Person

NLP Tool

Page 32: BiographyNet: Linking the world of History

• The interface should be easy to use• The demonstrator should inspire historians to

undertake new research and give direction, rather than being the ‘closing factor’ in their research

• The interface should allow to ‘fine tune’ results returned upon an initial action

Interface: Focus

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 33: BiographyNet: Linking the world of History

• Query composition• Faceted browsing• A combination

Interface: Options

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 34: BiographyNet: Linking the world of History

• Drop down boxes to select ‘Verbs’, data elements and relations

Interface: Query composition

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 35: BiographyNet: Linking the world of History

• No explicit querying, but convergence of the data through browsing and selecting

• Provides better feedback to the user• Allows for more direct and easier

adjustment of the selected data

Interface: Faceted browsing

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 36: BiographyNet: Linking the world of History

Interface: Faceted browsing

Page 37: BiographyNet: Linking the world of History

• Query composition combined with faceted browsing

• Create new facets by defining a query– The result of the query is available as a subset of

the data by selecting the defined facet– As such, combinable with other facets

• Method to integrate ‘open’ querying of the data into a general interface and visualization

Interface: A combination

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 38: BiographyNet: Linking the world of History

Interface: A combination

Question Analysis

SelectionProcess

Results

Data

Facets

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 39: BiographyNet: Linking the world of History

Time and place are primary elements

Interface: Demonstrator

Results

?

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 40: BiographyNet: Linking the world of History

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 41: BiographyNet: Linking the world of History

Main components of the demonstrator• Initial schema available• Schema models enrichments and aggregations alongside original sources • Allows for storing various levels of provenance information• Model will be adapted while progressing with building the demonstrator

• Initial conversion to RDF available• Structure according to devised schema• Next step is linking to external sources

• Initial NLP system setup available• Preliminary results comparable with manual use case

• Interface• First ideas and sketches

Current Status

BiographyNet: Linking the world of historyGeneral project info, February 2014

Page 42: BiographyNet: Linking the world of History

Thank you for your attention

www.biographynet.nl

Feel free to ask questions

BiographyNet: Linking the world of historyGeneral project info, February 2014