linking historical ship records to a newspaper archive

23
Linking historical ship records to a newspaper archive Andrea Bravo Balado Victor de Boer, Guus Schreiber VU University Amsterdam

Upload: guus-schreiber

Post on 15-Jun-2015

933 views

Category:

Technology


2 download

DESCRIPTION

Talk at Histoinformatics, 10 November 2014, Barcelona

TRANSCRIPT

Page 1: Linking historical ship records to a newspaper archive

Linking historical ship records to a newspaper archive

Andrea Bravo BaladoVictor de Boer, Guus Schreiber

VU University Amsterdam

Page 2: Linking historical ship records to a newspaper archive

2

Context: dutchshipsandsailors.nl/

Page 3: Linking historical ship records to a newspaper archive

3

Dutch Ships and Sailors (DSS) datasets

Page 4: Linking historical ship records to a newspaper archive

4

Results published as Linked Data

Page 5: Linking historical ship records to a newspaper archive

5

Data visualizations

Page 6: Linking historical ship records to a newspaper archive

6

This study

• Increasing number of historical databases are being digitized

• Finding matching occurrences of the same object in different datasets is both relevant (for historical research) and non-trivial– “Instance mapping”

• This paper: case study of linking ship instances in two maritime datasets

Page 7: Linking historical ship records to a newspaper archive

7

Focus on methodology

• This study is not about developing new techniques

• This study is about methodology:– What combination of existing techniques gets the

“best” result?– What the “best” result is depends on context (i.e.,

goal of the historical research)• This is a case study, so be wary of

generalization

Page 8: Linking historical ship records to a newspaper archive

8

Data

• Muster rolls (Northern Dutch Maritime Museum)– Period: 1803-1937– 77,043 records of 34,552 sea men – 17,098 mentions of 4,935 ships

• Newspaper archive (Dutch National Library)– Period: 1618-1995– 7K newspapers, 9M pages (coverage: 10%) – Text generated via OCR

Page 9: Linking historical ship records to a newspaper archive

9

Timeline newspapers in the archive

Page 10: Linking historical ship records to a newspaper archive

10

Example muster roll record (in Dutch)

Page 11: Linking historical ship records to a newspaper archive

11

Example newspaper article (in Dutch)

Page 12: Linking historical ship records to a newspaper archive

12

Approach

• Generate candidate set of links• Apply two types of filters to the candidate set– Domain-specific filtering• Using domain heuristics about ship identification

– Text classification of newspaper articles• Determine whether the article is about a ship

• Combine filters

Page 13: Linking historical ship records to a newspaper archive

13

Baseline generation

• Find all ship instances in the muster rolls• Query newspaper archive for first 100 hits

with this name– API: http://www.delpher.nl/

• Result set is expected to have high recall but low precision

Page 14: Linking historical ship records to a newspaper archive

14

Evaluation

• No gold standard• Manual assessment of all links is infeasible• Sampling method for evaluating candidates– 50 candidates per technique– 3 assessors (domain expert plus two authors)– Inter-observer agreement: Cohen’s kappa = 0.65

• Recall: approximation, based on the estimated number of correct links (using the baseline)

Page 15: Linking historical ship records to a newspaper archive

15

Domain-specific filtering

• Heuristic 1: co-occurrence of name of ship captain– Common practice in historical maritime

documentation• Heuristic 2: date of newspaper article is within

ship lifetime (as indicated by muster roll)– Average life span of ship is 30 years

Page 16: Linking historical ship records to a newspaper archive

16

Text classification

• Task: decide whether a newspaper article is about a ship

• Two techniques used– Naive Bayes and Support Vector Machine (SVM)

with Sequential Minimal Optimisation (SMO)– WEKA implementation– Training set: 200 samples (121 positive, 79

negative)

Page 17: Linking historical ship records to a newspaper archive

17

Configuration

• Filter 1a: captain name• Filter 1b: time restriction• Filter 2: combine filters 1a + 1b• Filter 2 + text classification

Page 18: Linking historical ship records to a newspaper archive

18

Results

Page 19: Linking historical ship records to a newspaper archive

19

Analysis

• Captain’s name turns out to be a strong heuristic

• Time restriction much less useful• When combined, precision becomes very high,

at the cost of (approximate) recall• Text classification has high precision (no false

positives)• Text classification combined with heuristic

filtering has negative effect

Page 20: Linking historical ship records to a newspaper archive

20

Discussion

• Interestingly, the historian preferred very high precision at the cost of recall

• Consequently, 16K links published as Linked Data (precision 0.96; approximate recall 0.13)

• Links are to departure/arrival listing, but also to shipwrecks and sales

• In case of good heuristics the contribution of generic techniques is at best minimal

• Absence of gold standard is realistic

Page 21: Linking historical ship records to a newspaper archive

21

Limitations

• Evaluation– 50 samples – Choice of assessors– Approximation of recall

• Data– OCR quality of newspaper articles– Digitized newspaper archive covers only 10%

Page 22: Linking historical ship records to a newspaper archive

22

Acknowledgements

• Jurjen Leinenga, domain expert• CLARIN-NL

http://www.clarin.nl • BiographyNet, Netherlands eScience Center

http://esciencecenter.nl

• Online appendix with details of results at http://dx.doi.org/10.6084/m9.figshare.1189228

Page 23: Linking historical ship records to a newspaper archive

23

QUESTION TIME