university of edinburgh, school of...

18
Palimpsest: Mining Literary Edinburgh Beatrice Alex (@bea_alex, @LitPalimpsest) University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept 22nd 2015

Upload: trantu

Post on 11-Feb-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Palimpsest: Mining Literary Edinburgh Beatrice Alex (@bea_alex, @LitPalimpsest)!

University of Edinburgh, School of Informatics

#TDHSS, Aberdeen, Sept 22nd 2015

Page 2: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

#TDHSS, Aberdeen, Sept 22nd 2015

Palimpsest!!AHRC (Big Data) project: 01/2014-03/2015 Literature, University of Edinburgh James Loxley, Professor of Early Modern Literature Miranda Anderson, Research Fellow !Informatics, University of Edinburgh Jon Oberlander, Professor of Epistemics Beatrice Alex, Research Fellow in Text Mining Claire Grover, Senior Research Fellow !SACHI: St Andrews Human Computer Interaction Research Aaron Quigley, Director of SACHI & Chair of Human Computer Interaction David Harris-Birtill, Research Fellow Uta Hinrichs, Research Fellow !EDINA James Reid, Workgroup Leader, Geoservices Nicola Osborne, Social Media Officer

Page 3: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Prototype

#TDHSS, Aberdeen, Sept 22nd 2015

Page 4: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

I visited Edinburgh with languid eyes and mind; and yet that city might have interested the most unfortunate being. Clerval did not like it so well as Oxford; for the antiquity of the latter city was pleasing to him. But the beauty and regularity of the new town of Edinburgh, its romantic castle and its environs, the most delightful in the world, Arthur’s Seat, St. Bernards Well, and the Pentland Hills, compensated him for the change and filled him with cheerfulness and admiration.

Mary Shelley, Frankenstein

Frankenstein

#TDHSS, Aberdeen, Sept 22nd 2015

Page 5: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Edinburgh: Picturesque Notes

But it is not only pipers who have vanished, many a solid bulk of masonry has been likewise spirited into the air. Here, for example, is the shape of a heart let into the causeway. This was the site of the Tolbooth, the Heart of Midlothian, a place old in story and namefather to a noble book.

! Stevenson, Edinburgh: Picturesque Notes

#TDHSS, Aberdeen, Sept 22nd 2015

Page 6: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Trainspotting

These burds ur gaun oantay us aboot how fuckin beautiful Edinburgh is, and how lovely the fuckin castle is oan the hill ower the gairdins n aw that shite. That's aw they tourist cunts ken though, the castle n Princes Street, n the High Street. Like whin Monny's auntie came ower fae that wee village oan that Island oaf the west coast ay Ireland, wi aw her bairns. The wifey goes up tae the council fir a hoose. The council sais tae her, whair's it ye want tae fuckin stey, like? The woman sais, ah want a hoose in Princes Street lookin oantay the castle.…Perr cunt jist liked the look ay the street whin she came oaf the train, thoat the whole fuckin place wis like that. The cunts in the council jist laugh n stick the cunt n one ay they hoatline joabs in West Granton, thit nae cunt else wants. Instead ay a view ay the castle, she's goat a view ay the gasworks. That's how it fuckin works in real life, if ye urnae a rich cunt wi a big fuckin hoose n plenty poppy.

Irvine Welsh, Trainspotting

#TDHSS, Aberdeen, Sept 22nd 2015

Page 7: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Palimpsest Workflow

HathiTrust collectionBritish Library Nineteenths Century Books

National Library of Scotland collectionOxford Text ArchiveProject Gutenberg

...

TEXT MINING

DIGITISED DOCUMENTS DOCUMENT RETRIEVAL & FILTERING

RELATIONAL DATABASE

USER INTERFACES

EDINBURGH GAZETTEER

Ranked lists of Edinburgh-specific candidates

MANUAL CURATION

Curation of Edinburgh-specific literature

fine-grained location extraction and geo-referencing using the Edinburgh Geoparser

geo-referenced locationssnippets

meta data

24.189 The Journal of Sir Walter Scott (Scott, Walter) 22.079 Robert Louis Stevenson (Black, Margaret Moyes)20.725 The Modern Scottish Minstrel, Volumes I-VI. (Various)19.610 Spare Hours (Brown, John)17.181 The Heart of Mid-Lothian (Scott, Walter)15.369 The Works of Robert Louis Stevenson (Stevenson, Robert L.)15.018 Rab and His Friends and Other Papers (Brown, John)14.177 Greyfriars Bobby (Atkinson, Eleanor)...

gazetteer of Edinburgh place names and their latitude/longitude pairs or shape files derived from several sources

#TDHSS, Aberdeen, Sept 22nd 2015

Page 8: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Palimpsest Workflow

HathiTrust collectionBritish Library Nineteenths Century Books

National Library of Scotland collectionOxford Text ArchiveProject Gutenberg

...

TEXT MINING

DIGITISED DOCUMENTS DOCUMENT RETRIEVAL & FILTERING

RELATIONAL DATABASE

USER INTERFACES

EDINBURGH GAZETTEER

Ranked lists of Edinburgh-specific candidates

MANUAL CURATION

Curation of Edinburgh-specific literature

fine-grained location extraction and geo-referencing using the Edinburgh Geoparser

geo-referenced locationssnippets

meta data

24.189 The Journal of Sir Walter Scott (Scott, Walter) 22.079 Robert Louis Stevenson (Black, Margaret Moyes)20.725 The Modern Scottish Minstrel, Volumes I-VI. (Various)19.610 Spare Hours (Brown, John)17.181 The Heart of Mid-Lothian (Scott, Walter)15.369 The Works of Robert Louis Stevenson (Stevenson, Robert L.)15.018 Rab and His Friends and Other Papers (Brown, John)14.177 Greyfriars Bobby (Atkinson, Eleanor)...

gazetteer of Edinburgh place names and their latitude/longitude pairs or shape files derived from several sources

Big data IN!!

Small data OUT

#TDHSS, Aberdeen, Sept 22nd 2015

Page 9: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Datasets

HathiTrust collection (all worldwide public domain material)

British Library Nineteenth Century Books collection

English Project Gutenberg books

Oxford Text Archive data

National Library of Scotland data

Limited set of copyrighted material, if author/publisher agrees (Irvine Welsh, Muriel Spark, Alexander McCall Smith ...)

#TDHSS, Aberdeen, Sept 22nd 2015

Page 10: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Text Analytics Tasks

Retrieve literary works which are at least partly set in Edinburgh from all literature accessible to us.

Devise a method for identifying “loco-specificity” in literature automatically based on input from literary scholars.

Create a fine-grained location gazetteer for Edinburgh.

Identify and geo-reference locations (including street names and buildings) using the Edinburgh Geoparser.

Extract snippets and compute interestingness.

Assist in de-duplification.

#TDHSS, Aberdeen, Sept 22nd 2015

Page 11: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Multiple Sources and Formats

#TDHSS, Aberdeen, Sept 22nd 2015

Page 12: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Assisted Curation

#TDHSS, Aberdeen, Sept 22nd 2015

Page 13: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Discovery

#TDHSS, Aberdeen, Sept 22nd 2015

Page 14: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Genre

Unless meta data is available for genre, our method is not easily able to make the distinction.

Ted Underwood, How to find English-language fiction, poetry, and drama in HathiTrust. 29/12/2014

#TDHSS, Aberdeen, Sept 22nd 2015

Page 15: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Outputs

The fine-grained Edinburgh gazetteer and the Geoparser can be used for future research.

User interfaces are available at litlong.org via:

a web-based visualisation

a mobile app

an API

Queries on GitHub: https://github.com/LitPalimpsest/Palimpsest

#TDHSS, Aberdeen, Sept 22nd 2015

Page 16: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

LitLong.org

#TDHSS, Aberdeen, Sept 22nd 2015

Page 17: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

LTG

Ongoing projects of the Edinburgh Language Technology Group:

Palimpsest: Mining Literary Edinburgh, AHRC

UK Connectivity: Analysis of social media, British Council)

BotaniTours: Information aggregation and presentation of botanical points of interest in the Scottish Borders, Smart Tourism/dot.rural

Trading Consequences: Text mining trends in commodity trading of large 19th century text collections, Digging into Data).

Text Mining Careers: Mining of career profiles (Challenge Investment Fund, UoE).

New: Text mining brain scan reports for clinical neurologists.

#TDHSS, Aberdeen, Sept 22nd 2015

Page 18: University of Edinburgh, School of Informaticshomepages.inf.ed.ac.uk/balex/publications/PalimpsestSlides.pdf · University of Edinburgh, School of Informatics #TDHSS, Aberdeen, Sept

Thank You

!

LTG: www.ltg.ed.ac.uk

Edinburgh Geoparser: https://www.ltg.ed.ac.uk/software/geoparser/

Palimpsest: palimpsest.blogs.edina.ac.uk

LitLong: litlong.org

Twitter: @LitPalimpsest

Contact:

Beatrice Alex | [email protected]

#TDHSS, Aberdeen, Sept 22nd 2015