chalice / edinburgh geoparser at ca2011

28
www.inf.ed.ac.uk Institute for Language, Cognition and Computation Chalice – Linked Data and Historic Place-names Jo Walsh [email protected] Kate Byrne, Richard Tobin, Claire Grover

Upload: jo-walsh

Post on 24-Jan-2015

369 views

Category:

Education


3 download

DESCRIPTION

Slides hurriedly put together based on Claire Grover's slides for Pelagios workshop

TRANSCRIPT

Page 1: Chalice / Edinburgh Geoparser at CA2011

www.inf.ed.ac.uk

Institute for Language, Cognition and Computation

Chalice – Linked Data and Historic Place-names

Jo Walsh [email protected]

Kate Byrne, Richard Tobin, Claire Grover

Page 2: Chalice / Edinburgh Geoparser at CA2011

Overview of the Edinburgh Geoparser• System to automatically recognise place names in text and

disambiguate them with respect to a gazetteer. (Athens, Springfield)

• Patchy development over past few years funded by a variety of projects applied to a range of data sets:

– GeoCrossWalk

– BOPCRIS

– GeoDigRef (Histpop, BOPCRIS, BL)

– Embedding GeoCrossWalk (Stormont Papers)

– SYNC3 (online news)

– Chalice (EPNS)

– Unlock

• Main concern has been to keep it generally usable while applying it to specific data sets.

Page 3: Chalice / Edinburgh Geoparser at CA2011

Overview of the Edinburgh Geoparser

.txt.html.xml

Format conversion

TokenisationPOS

taggingLemmatis-

ation

NamedEntity

Recognition

.geotagged.xml

Geotagging

Gazetteerlookup

Resolution.geotagged.xml .gaz.xml

Georesolution

Page 4: Chalice / Edinburgh Geoparser at CA2011
Page 5: Chalice / Edinburgh Geoparser at CA2011
Page 6: Chalice / Edinburgh Geoparser at CA2011

Chalice• Connecting Historical Authorities with Linked Data, Contexts, and Entities.

• Part of jiscEXPO - "exposing digital content for education and research".

• The project is exploring the viability of creating a historical gazetteer from digitized volumes from the English Place-Name Society (EPNS).

• Partners:

– CDDA, Queen’s University, Belfast

– School of Informatics, Edinburgh

– EDINA, Edinburgh

– CeRch, Kings College London

Page 7: Chalice / Edinburgh Geoparser at CA2011

English Place-Name Survey• At the Institute of Name Studies in Nottingham

• 80+ volumes covering English counties

• Over 1000 years of place-name history

• Started in 1925 and still going!

Page 8: Chalice / Edinburgh Geoparser at CA2011
Page 9: Chalice / Edinburgh Geoparser at CA2011

Archaeology and Place-names and History

• "The first point, already noted repeatedly but so important that it cannot be too strongly emphasised, is that historical evidence is documentary and therefore direct evidence only of a state of mind; that archaeological evidence is material and therefore direct evidence only of practical skills, technological processes, aesthetic interests and physical sequences; and that place-name evidence is linguistic and therefore direct evidence only of language and speech habits. Indirect inferences may be drawn in each case, and the evidence of place-names may be used to throw light on the date, nature and extent of settlements, on the movements of peoples and their relationships to each other, on certain aspects of their organisation and on many of the other problems that concern the historian and the archaeologist. But in all these cases the inferences depend to some extent on assumptions and they must be examined carefully before they are accepted as valid." – F.T. Wainwright

Page 10: Chalice / Edinburgh Geoparser at CA2011

Chalice data• Cheshire

– Cheshire Part I. EPNS Volume 44, 1970

– Cheshire Part II. EPNS Volume 45, 1970

– Cheshire Part III. EPNS Volume 46, 1971

– Cheshire Part IV. EPNS Volume 47, 1972

– Cheshire Part V (1 :i). EPNS Volume 48, 1981

– Cheshire Part V (1 :ii). EPNS Volume 54, 1981

• Small samples from:

– Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19), Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)

• Shropshire: Pimhill Hundred (born digital)

Page 11: Chalice / Edinburgh Geoparser at CA2011

EPNS• Parishes organised in terms of the hundreds in which they belong.

• Towns and villages referred to as townships, organised in terms of the parish in which they belong.

• Township descriptions often contain descriptions of buildings, bridges, lanes, woods and farms.

• Information about river and major road names are described separately from the inhabited place descriptions.

• Names and spellings that have been attested in historical sources and the etymology of names or name parts.

• In Chalice we focus on capturing parishes, townships, sub-townships, attestation.

Page 12: Chalice / Edinburgh Geoparser at CA2011
Page 13: Chalice / Edinburgh Geoparser at CA2011

The start of the entry for the township of Willaston in the parish of Neston in Wirral Hundred.

Page 14: Chalice / Edinburgh Geoparser at CA2011
Page 15: Chalice / Edinburgh Geoparser at CA2011
Page 16: Chalice / Edinburgh Geoparser at CA2011
Page 17: Chalice / Edinburgh Geoparser at CA2011
Page 18: Chalice / Edinburgh Geoparser at CA2011

Turtle-like version@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

@prefix gn: <http://www.geonames.org/ontology#> .

@prefix dc: <http://purl.org/dc/elements/1.1/#>

@prefix chalice: <http://made.up.domain.name/chalice/>

:Bosley a chalice:Place;

dc:title Bosley .

:Boselega a chalice:PlaceName;

dc:title Boselega .

#attested a chalice:PlaceNameAttestation;

chalice:place :Bosley ;

chalice:known_as :Boselega ;

chalice:source :DB ;

chalice:date 1086 .

:DB a chalice:Source dc:title 'Domesday Book' .

Page 19: Chalice / Edinburgh Geoparser at CA2011

Issues• OCR quality needs to be high: not just recognising characters correctly but

getting font and layout information right.

• Variation in use of layout and font to indicate structure

• Different volumes reflect different decisions about where place name information should be put.

Page 20: Chalice / Edinburgh Geoparser at CA2011

Linking Data• A URI for each place-name

• Links to information about each attestation

• Links to nearby places

• Links to other sources of place-name references

– Geonames.org (variable quality, wide usage)

– Ordnance Survey Open Data (also variable quality)

• Then links from and between documentary sources

Page 21: Chalice / Edinburgh Geoparser at CA2011

Integrating (with) other sources• Series of use cases by Stuart Dunn at KCL

• Victoria County History

• Clergy of the Church of England Database

• Archaeology Data Service

Page 22: Chalice / Edinburgh Geoparser at CA2011
Page 23: Chalice / Edinburgh Geoparser at CA2011
Page 24: Chalice / Edinburgh Geoparser at CA2011
Page 25: Chalice / Edinburgh Geoparser at CA2011
Page 26: Chalice / Edinburgh Geoparser at CA2011

GAP & Ancient Place-names• Based on Pleiades set of ancient place names but extended in two ways:

• by matching Pleiades place names against GeoNames place names in the same location and adding the GeoNames alternative names to the Pleiades+ list:

– adds three alternative names for the single Pleiades entry for "Autricum" ("Chartrez", "Chartres", "Shartr"), because "Autricum” is present in both Pleiades and GeoNames, with the same approximate location

• (We don't want to simply take places directly from GeoNames because, when we tried it, we were swamped with irrelevant modern places having names corresponding to ancient toponyms.)

Page 27: Chalice / Edinburgh Geoparser at CA2011

Pleiades+(+)• Pleiades+: get alternative names for places that match in geonames

• Pleiades++ is a runtime supercharging bit:

– if place X isn't in Pleiades+,

– look at "synonym ring" of alternative names in geonames

– try all of those against Pleiades+–

mysql> select distinct p.name,p.plid,p.geonameId,p.fclass,p.fcode,p.country,p.latitude,p.longitude,p.population,p.normname from plplus p join geonames.alternatename a on p.name=a.alternatename join geonames.geoname g on a.geonameid=g.geonameid join geonames.alternatename a2 on a2.geonameid=g.geonameid where a2.alternatename="Egypt";+----------+---------+-----------+--------+-------+---------+------------+------------+------------+----------+| name     | plid    | geonameId | fclass | fcode | country | latitude   | longitude  | population | normname |+----------+---------+-----------+--------+-------+---------+------------+------------+------------+----------+| Aegyptus |     766 |         0 |        |       |         | 32.5000000 | 32.5000000 |          0 | aegyptus || Aegyptus |  981503 |         0 |        |       |         | 27.5000000 | 26.5476190 |          0 | aegyptus || Aigyptos | 1001943 |         0 |        |       |         | 32.5000000 | 32.5000000 |          0 | aigyptos |+----------+---------+-----------+--------+-------+---------+------------+------------+------------+----------+3 rows in set (0.05 sec)

Page 28: Chalice / Edinburgh Geoparser at CA2011

www.inf.ed.ac.uk

Institute for Language, Cognition and Computation

Thanks

http://chalice.blogs.edina.ac.ukhttp://unlock.edina.ac.uk/text.html