links icos 2014 glasgow utrecht leiden large scale harvesting of variants of proper names gerrit...

25
Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen, LIACS, Leiden University The Netherlands [email protected] [email protected]

Upload: emmalee-dyal

Post on 01-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

Large scale harvesting of variants of proper names

Gerrit Bloothooft, UiL-OTS, Utrecht UniversityMarijn Schraagen, LIACS, Leiden University

The Netherlands

[email protected]@liacs.leidenuniv.nl

Page 2: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

name variants

• different versions of a name, that can denote the same object

• requires a proof that the same object is involved (in at least one example)– not always easy– rarely explicitly provided

2

Page 3: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

proper names in historical sources

Lots of variation

– spelling variation Dirk - Dirck– suffix variation Willem - Willempje– abbreviation Willem - Wim– translation Willem - Guillaume

Willem - Wilhelmus– typos (digitization) Willem - Aillem– …

3

Page 4: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden 4

variation!GuljelmusWllhelmusWlhelmusWIllem(Willem)WiellemWlllemGujlelniusWllemWiIllemWijllemWihelmusWillemjWikllemWwillemWilllemGuilleamWilleamWillemWil.lemWilemGuileamWillelminiwillemWiilemGuillemWeillemGuilelmisWil;helmusWilhlemWelhelmus

WiillemWiehelmusWulhelmusWillem)WilehelmusWoillemWihhelmusWeijlemWillelmusWi;;emWilehlmusWuhelmGuilelmusWilhlelmusWillem(se)WilalemWullemWillem.W#ilhelmusGuillelmusWliiemWlihelmusWilelmusWillemmWileemWìllemWillememWolhelmusWechelmusGuilllelmusWilemm

W.ilhelmusWillem]Willemh\WillemWïllemw8illemWilhellmusWilhelm.WilmhelmusWilhelmunsWilhelmuaWilhelmoswilhelmnusWilhelmnusWilhelmuesGuilleaummeWilhelmumGuilhelmusWillemlWilhelmanusWilhelmjusWilhelmesGuilliaummeWilhelmasWillemnWilhelmusWilhelmnsWillhelmusGuiliaumeWilllenGuiilleaume

GuilliaumeWillenisGuiliermoWilempjenWillempjenWillepjenGuilliermoWittemWillen!WilhlennWijlenWielenWillenwilhemWillempkeGuilleaumeWilhellemusWilhekmusGuiileaumeWilleaumeWilhelmuusGuylleaumeGuileaummeGuileaumeWilhelemusGuilleaumaWillewmGuillesmusGuïllermoGuilermoGuiilermo

GuillermoGuillerlmusGuijlleaumeWilheminusWilhelhmusGuillaumguillaumGueillaumWilhemusGuilhemusWielhemusWilhehmusWilhelminusWilhelmienusWilherlmusWilhermusWeilhimWilhiemWilheimWilheinWillaumGuillaimwilhemusWilhelnmusWoalterWillhemGuillhemWilheemWilhemWölhelmWilhelimus

WilhelusWillaimWillemermanWiechemWiloemWilhelmiusWilhelmijsGuilhelmisWilhelmjsWilhelmisWillemhelmusWilloemWilhelnusWeilhelmusWwilhelmusWylhelmuswWilhelmus(Wilhelmus)(WilhelmusWilhelmüsWilhelmus\GuiljameWilhelmus?WilhelmusHubertusWilheelmusWilhelmmusWielhelmusWilhhelmusWiilhelmusWEilhelmuswilhelmus

Wilhelmus)WilhelmussWilhelmusStephanusWIlhelmusWillkemWilkhelmusWilhelmiemWilhelmigsWillmeWilmeWilhelmusHenricusWilhelmusTheodorusWilhelmushenricusWilhelmusnWilhelmusznEilhelmusilhelmusIlhelmusWillemcusWilhelmusJohannesWilhelmushubertusWilhelmuwWwilhwlmusGuilliaamGuiliamGuiliaamGuillieamGuilliamWiliamWilnelmusWillwm

GuillieaumeWilhwlmusWilhwelmusWillumGuillumWilliamWilhlemusWilielmusWillielmusGüilielmusGuililmusGuileilmusGuïllielmusGuilielmusGuillijaamWillemusWiiliamGuilemusGuillemusWillemmusWilehmusWilemusWillliamWieliamGuillielmusWilhlmusGuillmusWiliaamWilhmusGuiilmusWilmus

GuilmusAillemJohannesWilhelmusJohanneswilhelmusCornelisWilhelmusGulliëlmusGuliëlmusGijlliaumeGüliëlmusGuli?lmusGuijelmusGulielmusGuiëlmusGiliaumeGilliaumeGilliaummeGuihelmusGuikelmusGullielmusGuielmusJannwillemJanwillemJanWillemJanWilhelmusMartinusWilhelmusQwillem

Page 5: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

challenge

name variation is difficult to model, therefore:

• learn variation in person names from use of names in real life (let data speak for itself)

• automatically from big data

5

Page 6: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

required

• big data– with many references to individuals

• true person resolution– proof that the same individual is concerned– even with data that contain name variants

6

Page 7: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

big data

• Dutch vital registration (who-was-who 2011)1811- early 20th century

– 4.1 million birth certificates (~30%)

– 3.1 million marriage certificates(~90%)

– 7.6 million death certificates (~65%)

55 million name references to persons

7

Page 8: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

source names

1,052,000 different full first names (composite) Jan, Johanna Maria Cornelia

111,900 different female first names (singular, Maria) 82,700 different male first names (singular, Jan)

681,000 different surnames (prefixes included) Bakker, de Vries 600.000 different surnames (prefixes excluded)

Vries

8

Page 9: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

information per person

• first name person (child, bride or groom, deceased)

• first name father• surname father• first name mother• surname mother (always maiden name

in The Netherlands)

• age person

9

Page 10: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

person resolution

• assumption: the available information identifies a person uniquely (if there is exact matching)

• relaxed assumption: one of the first names and surnames of the mother or father is not needed for true person resolution

10

Page 11: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

example

Johanna Endt

• marries in 1858 as 29 years old daughter of Gerrit Endt and Dorothea Kerbert

• dies in 1882 as 54 years old daughter of Gerrit Endt and Doortje Kerbert

~1829, Johanna, Gerrit, Endt, Kerbert, Dorothea~1828, Johanna, Gerrit, Endt, Kerbert, Doortje

11

Page 12: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

test of assumption (of true person resolution)

• consider all matches between birth and death certificates with exact matching of all information

• leave out one name per match• count number of multiple matches

result: only 85 out of 1,107,162 matches are not unique

12

Page 13: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

harvesting name variant pairs(procedure)

• identify all record pairs of individuals (over birth, marriage and death certificates) that exactly share

– first name of the individual– approximate year of birth– three out of four names of parents (first names and surnames)

• collect pairs of the remaining name, if differentChristiena – ChristinaBloothooft - Bloothoofd

13

Page 14: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

harvesting name variant pairs(results)

female first names 48,600 pairs 246,500 tokens male first names 31,900 pairs 183,000 tokenssurnames 177,000 pairs 374,900 tokens

average:first names: 5 to 6 tokens per variant pairsurnames: 2 tokens per variant pair

14

Page 15: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

so far so good, but

• the original certificates are not error-free

> found variants can be due to errors in the source, during transcription or to typos

• theoretical issue: what is a name variant, and what is an error?

15

Page 16: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

example

in the source documents:

Pieter born as son of Jacob Houtlosser and Aafje Spruit, died as son of Jacob Houtlosser and Grietje Spruit

variant Aafje – Grietje ?

16

Page 17: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

variants and errors

distinction is difficult to make

• variants share the same lemma and errors do not

requires onomastic expertise (which we would like to avoid, let the data speak for itself)

17

Page 18: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

variants and errors

• VariantsWillem - WilhelmWillem - GuillaumeWillem - W8llem (no indication of different lemma)

• ErrorsGrietje - AafjeFijtje - Sijtje (understandable reading error but different lemma)

18

Page 19: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

methods for cleaning

• using name dictionaries with lemmas• to accept name pairs

• using known non-variants• to reject name pairs

• rules • to accept name pairs

all with manual intervention (< 2%)

19

Page 20: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

cleaning | name dictionaries

• dictionary of Dutch first names (20,000), but– lemmas too detailed– names with multiple lemmas

– only 8% of all first name pairs share lemma in dictionary (43 % of tokens)

20

Page 21: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

results, in variant pairs

• female first name pairs 34,800 accepted 13,900 errors (29%)

• male first name pairs 22,500 accepted 9,400 errors (29%)

• surnames pairs 120,100 accepted 57,100 errors (32%)

21

Page 22: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

very many variant pairs (Willemina)WILMINA - WILMIJNA WILLEMJE

- WILLEMPJE WELLEMTJE - WILLEMTJE WILMTJE - WILLEMPJE WILLEMTJE - WILEMTJE WILHELMINA - WILLEMPJE WILLEPMJE - WILLEMTJE WILLEMPIE - WILLEMPJE WELLEMTJE - WELLIMTJE WELLEMTJE - WOLLEMTJE WILLEMIJNTJE - WILLEMPJE WILLEMIJNTJE - WLLEMIJNTJE WLLEMIJNTJE - WILLEMPJE WILLEMIJN - WILLEMIJNA WILHELMINA - WILLEMINA WILLEMTIEN - WILMTIEN WILLEMTIEN - WILLEMTJE WILEHELMINA - WILHELMINE WILLEMKE - WILLEMKEN WILLEMKEN - WILLEKEN WILLEMINA - WILLEMINE WILLEMINA - WILLIMINA WILLEMIENA - WILLEMINA WILLEMINA - WILLEMPJE WIHELMINA - WILHELMINA WILLEMKE - WILLENKE WILLEMIJNTJE - WILEMIJNTJE WILHEMINA - WILLEMINA WILLEMKEN - WILMKEN WILLEMPJE - WILLEMTJE WILLEMIJNTE - WILLEMIJNTJE WILLEMIJNTJE - WILLEMYNTJE WILLEMPTJE - WILLEMTJE WILLEMIJNTJE - WILLEMTJE WILLEMIJNTJE - WILLEMYNA WILLEMYNA - WILLEMIJNA WILLEMPJE - WILSJE WILEMPJE - WILLEMPJE WILLEMIJNTJE - WILLEMEINTJE WILLEMIINTJE - WILLEMIJNTJE WILLEMINA - WILLEMINTJE WILLEMINA - WILELMINA WILHELMINA - WILHELMINE WILLEMIJN - WILLEMPJE WILLEMIJN - WILLEMTJE WILLEMINA - WILLEMIJN WILLEMIJNTJE - WILLEMINTJE WILLEMIJNTJE - WILLEMEIJNTJE WILLEMIJN - WILLEMIJNTJE

WILHELMINA - WILLEMIJNA WILHELMIMA - WILHELMINA WILHELMINA - WILHLEMINA WILHELMIJNA - WILHELMINA WILLEMKE - WILLEMPJE WILLEPMJE - WILLEMKE WILLEPMJE - WILLEMPJE WILLEMIJNTJE - WILLEMINA WILHELMA - WILLEMIJNA WILLEMINA - WILLLEMINA WILLEINTJE - WILLEMPJE WILHELMIJNA - WILLEMIJNA WILHELMINA - WILHELMUS WILLEMINA - WILHELMUS WILHELMIA - WILHELMINA WILLEMTIEN - WILTIEN WILLEKE - WILLEMKE WILHELMINA - WILHLMINA WILHELMINA - WILHEMINA WILLEMPTJE - WILLEMTJEN WILLEMIEN - WILLEMTIEN WILLEM - WILLEMPJE WILLEMINA - WILLEMIJNE WILTIEN - WILMTIEN WILMKE - WILLEMKEN WELHELMINA - WILHELMINA GUILLIELMINE - GUILLELMINE WILLEMTIEN - WILLEMPIEN WILHELMIENA - WILHELMINA WILMINA - WILMIENA WILLEMKE - WILLEMTIEN WELLEMTJE - WELMTJE WILLEMIN - WILHELMINA WILMTJE - WILLEMTJE WILLEMINA - WILMINA WILLELMIN - WILHELMINA GUILLIELMINE - WILHELMINA WILLEMINA - WILLEMKE WILEMIJNA - WILLEMIJNA WILLEMTIJN - WILLEMTJE WILLEMINA - WILLEMMINA WILLEMIJNE - WILLEMIJNA WILLEMS - WILLEMINA WILLEMINE - WILLELMINA WILLEMKE - WILMKE WILLEMIJNTJE - WILLEMIENTJE WILLEMINA - WILLEMIMA WILLEMA - WILLEMINA WILLEMINA - WILLEMEIJNTJE

WILHELINA - WILHELMINA WILLEMKEN - WILLENKE WILLEMINA - WILLEMTJE WILLEMIJNTJE - WILLIMPJE WILHELMINA - WILLEMIJNTJE WULLEMPJE - WILLEMPJE WILLEMINA - WELLEMINA WILHELMINE - WILLEMINE WILLEMIJN - WILHELMINA WILLEMIJNE - WILHELMINA WILLEMPTJE - WILMPTJE WILHELM - WILHELMI WILLEMIEN - WILHELMINA WILLEMINA - WILLEMKEN WILHELMA - WILHELMINA WILHELMINE - WILLEMINA WILLEMIN - WILLEMINA GUILLEMINE - WILHELMINE WILLEMIENTJE - WILLEMEINTJE WILLMINA - WILHELMINA WILLEMIJNA - WILEMINA WILLEMINA - WILLMINA GUILLELMINE - WILHELMINE WILLEMIJNTJE - WILMIENA WILLEM - WILLEMS WILHELMINA - WILMINA WILMPJE - WILLEMTJE WILLEMINA - WILLEMIENTJE WILLEMKE - WILLEMTJE WILLEMKE - WILLEMPKE WILLEMIJNTJE - WILLEMKEN WILLEMIJNTJE - WILLEMIJNTIE WILLEMPJE - WILEMTJE WILLEMINA - WILMIJNTJE WILLEINTJE - WILLEMTJE WILLEMTJEN - WILLEMPJE WILLEMTJE - WILLMEPJE WILLEMINA - WILHELMIMA GUILLIELMINE - GUILIELMINE WILLEMPIEN - WILLEMPJE WILHELMINA - WILLEMTJE WILLEMINA - WILLEMEINTJE WILLEMIEN - WILLEMIN WILLEMINA - WILMPJE WILMINE - WILLEMINE WILKENS - WILKES WILLEMINE - WILMINA WILLEMTJEN - WILLMEPJE WIILEMINA - WILLEMINA

WILEHELMINA - WILHELMINA WILHELMINA - WILLEMDINA WILLEMKEN - WILHELMINA WILLEMIENTJE - WILLEMIJNA WILLEMA - WILLEMS WILLEMPJEN - WILLEMTJEN WILLEMPIEN - WILLEMTJE WILHELHERMINA - WILHELMINA GUILLEMINE - WILHELMINA WILLEMIJNTJE - WILMIJNTJE WILLEMPJE - WILMPJE WILLEMINE - WILLEMIENE WILLEMINA - WILLEMSEN WILLEMPKE - WILLEMPJE GUILLELMINE - GUILLELMINA WILLEMIENA - WILLEMPJE WILLEMIJNTIE - WILLEMPJE WILLELMINA - WILLEMINA GUILLEMINE - GUILLELMINA WILLEMIENA - WILHELMIENA WILLEMINA - WILHELMIENA WILELMINA - WILHELMINA GUILLEMINA - GUILLELMINE WILLEMKE - WILEMKE WILLEMKE - WILLEM WILLEMTJEN - WILLEMTIJN WILLEMPIEN - WILLEMPJEN WILLEMJE - WILLEMTJE WILLEMKEN - WILLEM WILEMIJNA - WILMIJNA WILHELMINA - WILLEMIENA WILLEMTJE - WILLEMTJEN WILLEMTIEN - WILLEMS WILLEMTIEN - WILLEMPJE GUILHELMINE - GUILLELMINE WILLEMKE - WIMPKE WILHELMINA - WILKELINA WILHELLEMINA - WILHELMINA WILEMINA - WILLEMINA WILLEMJEN - WILLEMKEN WILMINE - WILLEMINA WILHELMIN - WILHELMINA WILLEMPJ - WILLEMPJE

and many more

22

Page 23: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

name clusters

• variant pairs (are interconnected)Jan - JohannesJan - JoannesJan - JohanJohannes – Johan, etc

• create cluster Jan {Jan, Johannes, Johan}

23

Page 24: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

name clusters

• male first names 1.221 (16.487 names, 20%)• female first names 1.530 (23.816 names, 21%)

compares to number of lemma’s in Dutch dictionary of first names, vd Schaar 1964

• surnames 11.686 (93.839 names, 17%)

compares to number in Dutch surnames overview (without many variants), Winkler 1885

24

Page 25: Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

LinksICOS 2014 Glasgow

Utrecht Leiden

conclusions

• person name variants need proof from true person links

• expert knowledge necessary because errors cannot be distinguished fully automatically from true variants (but < 2%)

• final results are promising as a starting point to create a national repository of proven name variants

25