developing a digital library for the humanities
Post on 08-Jan-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
Developing a Digital Library for the Humanities
• Gregory Crane (gcrane@tufts.edu)
• Winnick Family Chair in Technology and EntrepreneurshipProfessor of ClassicsDirector, Perseus Digital Library ProjectHttp://www.perseus.tufts.edu/About/grc.html
Perseus Digital Library
• On-going areas of Development• 1987: DL on Classical Greek Culture• 1993: History of Science• 1996: Began work on Latin and Rome• 1997: Early Modern English• 1999: History and Topography of London• 2000: Ancient Egyptian Giza• 2000: Slavery and the US Civil War
Partner Institutions
• Max Planck Institute for the History of Science (Berlin)
• Museum of Fine Arts, Boston• Stoa Publishing Consortium• New Variorum Shakespeare Series, Modern
Language Association• Special Collections at Tufts, Brandeis, the
University of Pennsylvania
On-Going Support
• National Endowment for the Humanities(DLI2, Preservation & Access, Education)
• National Science Foundation (DLI2)
• Fund for the Improvement of Postsecondary Education, Dept of Ed.
• Max Planck Society
The Whole greater than the sum
• Tufts Health Sciences Database:
• An on-line Medical School Curriculum– First iteration: 70% of the value– Second Iteration: 90%– Third Iteration: 130%
• “Data” and “system” interact in increasingly dynamic ways.
Persistent value over time &space
• How many ages hence Shall this our lofty scene be acted over,In states unborn and accents yet unknown?– Brutus in Julius Caesar
• How do we structure data for– Contemporary users we can’t directly
anticipate?– Systems not yet designed?
Radically New Documents
• Reconstructions of Historical Spaces, e.g.– UVA’s Crystal Palace (London) – UCLA’s Rome and VR Lab
• Integrating Virtual Spaces with Sources– Museum of Fine Arts, Tombs at Giza– Greek Sculpture– The Streets of 19th Century London
Traditional Docs Rethought
• Concordance: “Obsolete”
• Bibliographies — databases
• Encyclopedias — automatic linking
• Lexica and lexicography — – Automatically discovered semantic rel-s– THEN lexicographic work
Development is two part
• Ultimate end: Radically new docs?
• Short term: Electronic Incunabula– New Variorum Shakespeare– Electronic Marlowe– Tallis Street Maps
• FIRST we thoroughly analyze what we have
• THEN radical redesign emerges
Technology outruns Practice
• The 3D Reconstruction/Virtual Space– Cutting edge technology– Still nascent scholarly practices
• Mature Document Structures– Textual Notes: 1908 Richard 3– Traditional Text Citations: 1887 Commentary
The More Things Stay the same...
• “Content” can remain unchanged
• “Presentation” is dynamic and flexible– The Dictionary knows what you are reading– Citations —> Bidirectional links– Automatic Linking by keyword– Text and Atlas: Plot sites in a document
Current Paradigm: DL Dipomacy
• Monolithic Systems (e.g., Perseus!)– One way to view each document
• Intercommunication via metadata– DL as metadata for “opaque” objects
• Major Problems– Renting access, rather than collecting content– All publications become ephemera
Three Strategies
• 1) The Editing Problem — – How do real authors create structured docs?
• 2) Developing Radically New Docs —– Archimedes DL on Mechanics– MFA Excavations at Giza
• 3) Radical Repurposing of Print– Bolles Collection on London
Bolles Collection at Tufts
• documenting the history and topography of London and its environs – 35 "full-size” maps– 320 more specialized maps– 400 books (284 linear feet of shelf space) – 1,000 pamphlets. – “Paper Hypertexts”
• 10,000+ “extra illustrations”
Bolles Electronic Archive
• A Testbed for the Perseus Digital Library
• “Level 5” TEI Encoded Full Text– Quotes, languages, proper names, dates, money
• High-end OCR and Double Keyboarding– OCR ideal for some but not all– Keyboarding much the best — money
permitting
Bolles — Initial Texts
• Five Million Words now in L5 TEI– Will exceed 10 million by year’s end
• Surveys of London History and Topography– Stow, Maitland, Wilkinson, Allen, Thornbury
• Commentary on social conditions– Mayhew, Archer, Hollingshead, Booth
• Literary works with London as backdrop– Defoe, Dickens, “Sherlock Holmes”
Images
• 10,000 Grayscale Images– Mainly engravings of people and places– “opportunistic” metadata (=captions & context)
• 2,400 Contemporary Images– Well catalogued and geo-referenced
• QTVR Panoramas
• 70 Tallis Map “Elevations”
Geospatial Data
• Bartholomew 1:5000 Data set for London– Modern data as reference and interchange
• Historical maps georeferenced to Barth. Data– 10 so far (c. 2 hours each)– Urban maps do not easily “line up”– How to create an historical GIS?
• GPS Waypoints– As of May 2000, good to within 10m. or better
Feature Extraction
• Easy identification: Dates, Money• Known Keywords and Classes
– The Getty TGN (1 m. places and lon/lats)– The Bartholomew Gazzetteer (10,000)– Indices to Maps (e.g. Cruchley 1826, 4200)– The Index/Abstract of the DNB (30,000+)
• Clean-up with rule based Proper Name classification: Mr NAME; NAME street
“Runtime” Links
• Runtime links supplement in file tagging
• 1) Where metadata is less precise– Metadata from unedited headers and captions
• 2) Where the source does not contain data– If no dates, then scan for them
• Use tagging for “high confidence” data– Ideal situation: automated tags hand proofed
Strategic Questions
• “Editions” a foundation for scholarship
• Where does the editor’s job start?
• How does editor’s job change?
• How do we define “Corpus Editors”?– People with domain expertise in content– Expertise in software and Library systems
• Need for scholarly automated processing
Delivering Integrated Data
• “Good” and “rough” maps for Cic’s Letters
• Coleman delivers quite useful results
• Map locates Coleman Street.
• Streets in description of "Portsoken Ward”.
• Historical Views of this section of London
• Timeline 1: A Linear History
• Timeline 2: “Encyclopedic Scatter”
Further Work
• Disambig., auto-cataloguing, Time/Space
• VR Interface: Tallis 1, 2 and Headset
• New challenging document types
• Geospatial Data in : Patterson's Journeys
• Urban data in Booth and City Directories.– Tallis Map for Oxford Street with overall and
more focused directories.
Research Projects
• Robert Jacob and VR Interfaces– Figure: Tallis VR Conversion 1.
– Figure: Tallis VR Conversion 2..
– Figure: Head mounted VR navigation.
• Holly Taylor and Cognitive Analysis
– Spatial Cognition
– Text Comprehension
Conclusions
• Baseline Knowledge Environment– Practical and useful
• “Corpus Editions”
• Midway between editions and library digitiz.
• Requires a new config. of skills
• The “Diplomatic” Federated DL model weak– Need access to full data for visualizations
Perseus Document Manager
• Works with XML– Multiple granularities: sentence, section,
chapter– Deals with overlapping doc hierarchies– Combines internal and external metadata– Our metadata in RDF and can be XML
• Since all data and metadata —> XML– Well suited to Federated DL Applications
Scalable DL• SGML/XML need translation for display
– Can’t maintain stylesheets for millions of docs
• Intelligent display of various DTDs– “Cheaply” acquires XML/SGML docs – Individual Custom Style sheets allowed
• Integration of Geo-spatial Data
• Multilingual support, feature extraction
• Integrated multi-resolution image support
Perseus Document Manager
• Short term development:– Collecting new datasets to the Perseus DL
• (leveraging Internet 2 investment)
– Adding value: e.g.,• Sources for the History of Mechanics (Max Planck)
• Duke Databank of Documentary Papyri
• Books, maps etc. on the City of London
• Shakespeare and Early modern English
Perseus Document Manager
• Longer Term: Distribution of the System
• How best to maintain and expand the system?– Open source?– Commercial Licensing?– Wait for third party to match PDM features?
Automatic Integration
• Content Analysis: Various Languages• Time: extracting and visualizing dates• Space: Integrating historical Geographic Data• Names: establishing authority lists
– Getty Thesaurus of Geographic Names • Names and Coordinates
– Encyclopedias: e.g., Harpers, DNB• Names and Dates
Our Research Agenda
• Developing a self-sustaining models– Publication of documents– Maintenance of software
• Exploring Problem Sets in different domains– E.g., sparse data (antiquity) vs. rich (London)
• Helping humanists rethink their position– Reaching new audiences– Changing habits
Technology matters: e.g.19th c. Printing in England
• 20th Century Radio/Film/TV: ambiguous
• 19th Century Print Technology– 1810: c. 10,000 copies for a successful book
• Audience for literature mainly upper class
– 1850: hundreds of thousands• Audience vastly expands
• Huge numbers read Dickens, etc.
• 21st Century Network Technology?
The Future?
• Two models:– Reproduce current world in new form
• Narrow/expensive distribution
– Think about how that world may change• Broader/inexpensive distribution
• What happens now sets the stage for …– “talk show” cyber culture? or– a new dispersal of intellectual life?
top related