enabling complex analysis of large-scale digital collections: humanities research, high performance...

13
Melissa Terras, James Baker, James Hetherington, David Beavan, Martin Zaltz Austwick, Anne Welsh, Helen O'Neill, Will Finley, Oliver Duke-Williams, and Adam Farquhar This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: quotations, embeds from external sources, logos, and marked images. Enabling Complex Analysis of Large-Scale Digital Collections Humanities Research, High Performance Computing, and transforming access to British Library Digital Collections Data, code, viz: github.com/UCL- dataspring

Upload: james-baker

Post on 08-Jan-2017

402 views

Category:

Education


0 download

TRANSCRIPT

Melissa Terras, JamesBaker, JamesHetherington, DavidBeavan, Martin ZaltzAustwick, Anne Welsh,Helen O'Neill, Will Finley,Oliver Duke-Williams, andAdam Farquhar

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.Exceptions: quotations, embeds from external sources, logos, and marked images.

Enabling ComplexAnalysis of Large-ScaleDigital CollectionsHumanities Research, High PerformanceComputing, and transforming access toBritish Library Digital Collections Data, code, viz: github.com/UCL-

dataspring

OverviewBarriers to computational approaches:

● fragmentation of communities,resources, and tools;

● lack of interoperability;● lack of technical skills

Data, code, viz: github.com/UCL-dataspring

Method60k books from the British Library:

● 17th - 19th century● 224GB compressed ALTO XML● UCL High Performance Computing● 4 humanities researchers● Research questions tocomputational queries

Data, code, viz: github.com/UCL-dataspring

Data, code, viz: github.com/UCL-dataspring

UCL’s Legion Cluster supercomputing facility. Photo: Tony Slade, © UCL Creative Media Services (all rights reserved)

Method60k books from the British Library:

● 17th - 19th century● 224GB compressed ALTO XML● UCL High Performance Computing● 4 humanities researchers● Research questions tocomputational queries

Data, code, viz: github.com/UCL-dataspring

ResultsIt worked!:

● Case Study 1: History of Medicine● Case Study 2: History of Images● Technical barriers● Search ‘recipes’

Data, code, viz: github.com/UCL-dataspring

Case Study 1History of Medicine Oliver Duke-Williams, UCL

Data, code, viz: github.com/UCL-dataspring

CaseStudy 2History ofImagesWill Finley,Sheffield

Data, code, viz: github.com/UCL-dataspring

CaseStudy 2History ofImagesWill Finley,Sheffield

Data, code, viz: github.com/UCL-dataspring

TechnicalMajor sticking point:

● Using humanities data on HPCsBest practice recommendations:

● Derived datasets● Normalisations● Documentating decisions● Fixed/defined dataset

Data, code, viz: github.com/UCL-dataspring

Generic searches:● for all variants of a word● that return keywords in contexttraced over time

● for a word or phrase that ignoreanother word or phrase

● for a word when in close proximityto word a second word

● based on image metadata

Data, code, viz: github.com/UCL-dataspring

ConclusionsRecommendations for enablingcomplex analysis of large-scale digitalcollections in the humanities:

● 1 Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector in order to facilitate research in the arts,humanities and social and historical sciences,

● 2 Invest in training library staff to run these initialqueries in collaboration with humanities faculty,to support work with subsets of data that areproduced, and to document and manageresulting code and derived data.

Data, code, viz: github.com/UCL-dataspring

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: quotations, embeds from external sources, logos, and marked images.

Special thanks to UCLResearch Computing andBritish Library DigitalResearch for their hard workand support!

Data, code, viz: github.com/UCL-dataspring

Melissa Terras, JamesBaker, JamesHetherington, DavidBeavan, Martin ZaltzAustwick, Anne Welsh,Helen O'Neill, Will Finley,Oliver Duke-Williams, andAdam Farquhar