crowdsourced manuscript transcription ben brumfield roots and routes 2012
TRANSCRIPT
What it isn't
We'll concentrate on web-based tools for extracting text from images, not addressing:
• Oral History
• Video
• Audio Transcription
• Image Manipulation
• Transcription/Facsimile Display
Tools exist for these tasks, nevertheless.
Indexing
• Structured Data
• Extracts from Text vs. Representing Text
• Databases for Search and Analysis
• Granular Quality Control
• Gamification
Editing
• Books, Diaries, Letters, Articles
• Representing Text
• Traditional Editorial Workflow
• Digital or Print Editions
Origins (Traditions)
• OCR Correction
• Documentary Editing
• Genealogy
• Natural Science
• Astronomy
Split this into 5 slides
Online Tools
• Recent (none older than 2005)
• Influenced by origin
• Still pretty raw
• Most require tech expertise for set-up and customization
• All require making trade-offs
Selection Factors
• Source Material
• Transcript Purpose
• Organizational/Project Management Fit
• Financial and Technical Resources
Source Material
Evaluating your source material:
• Is it of interest to anyone else?
• Is it under copyright?
• Does it need restricted access?
• Is it composed of documents or records?
• Is it non-textual?
• How complex is the layout? How important is that layout?
Purpose
How will you be using the transcribed data?
• Traditional print editions
• Searchable online editions
• Do you want to use the system to analyze the text?
• How do you want to analyze the text?
• Is public engagement a goal?
• Should the transcripts be open?
Organizational/Project Management Fit
• How important is traditional editorial workflow?
• Will you rely on volunteers? How will you motivate them?
• What is the duration of the project?
• Is there a "final version"?
• Is TEI a mandate?
Financial and Technical Resources
Do you have or need:
• System administrators to install non-hosted software?
• Money to pay hosting costs?
• Programming skills to customize a tool?
• Money to pay programmers for customization?
• Support for on-going costs to keep the site running, however small?
Technical Questions to Answer
• Where are the images now?
• How do images get into the system?
• How do transcripts get out of the system?
• How mature is the underlying technology?
• How configurable is the technology?
• How does the system work with the public face of your project?
• Where does the metadata live?
• Who will maintain this? How long?
• How many sites are using this system?
Wikisource
Pro:
• Mediawiki plus its add-on modules (e.g. print-on-demand, export).
• Wikimedia community.
• Incredibly mature.
Con:
• Wikimedia policy.
• Public editing.
• Limited mark-up.
Bentham Transcription Desk
Pro:
• MediaWiki is very mature.
• TEI Toolbar (can also be used on other systems)
• Deployed outside original project.
Con:
• Development efforts halted.
Scripto
Pro:
• Team at CHNM has a great track record.
• Your CMS is your public face.
• MediaWiki is very mature.
• Deployed and under active development.
Con:
• Your CMS handles all metadata.
• Mark-up is extremely limited.
FromThePage
Pro:
• Designed for intensive editing and indexing.
• Semantic mark-up and analysis.
• Hosting available.
Con:
• Single developer (me).
• No TEI mark-up.
Islandora TEI Editor
Caveat: I don't know much about this tool or this team.
• Based on Drupal and Fedora
• Supports TEI via friendly interface
• Many Drupal-based projects considering it.
T-PEN
Caveat: I don't know much about this tool.
• Designed for medieval manuscripts.
• Supports TEI natively.
• Line-by-line interface.
• Hosted version available.
Scribe
Pro:
• Excellent for complex layout or non-documentary transcription.
• Zooniverse team is large, well-funded, experienced.
• Configurable.
Con:
• No automated tool for loading images or viewing transcript database (yet!)
• No concept of image-as-a-text.
Pybossa
Caveat: I don't know much about this tool or this team.
• Open Knowledge Foundation's crowdsourcing task management tool.
• Designed for tabular data.
• Google Spreadsheet data entry.
• Extremely young.
TextLab
Caveat: I don't know much about this tool or this team.
• Melville Electronic Library.
• Direct addition of TEI tags to image.