impact at ocr summit
DESCRIPTION
OCR Summit Meeting Initiative for Digital Humanities, Media and Culture, Texas A&M University, 17-18 October 2011, College Station, TX, United States.TRANSCRIPT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
An Experimental Workflow Development Platform for Historical Document Digitisation
Clemens Neudecker, KB National Library of the Netherlands
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Background IMPACT – Improving Access to Text (2008 – 2011)
From a technical perspective: > 20 software components for solving specific issuesPrototyping new algorithms, improving commercial solutions
Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) + 3rd party applications
“One ring to rule them all…”
IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main requirements
Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability
Functional: Modular Transparent Expandable Open source Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Framework integration Simple to use generic command line wrapper for web services
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Architecture IMPACT Interoperability Framework: Technologies
- Java
- Apache Maven
- Apache Tomcat
- Apache Axis2+Synapse
- Taverna Workflow Engine
IMPACT Interoperability Framework: Dataset
- more than 600.000 images from digital libraries
- more than 50.000 ground truth transcriptions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Generic Web Service Wrapper
Only requirement: Command Line Application HTML formSource code available on github: https://github.com/impactcentre/toolwrapper Easy integration: developers can focus on their application
and have to worry less about integration = higher quality software components
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflows OCR workflow =
data pipeline
Building blocks = processing modules
Integration = interaction between nodes (mashups)
Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow management Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: Project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local client: Taverna Workbench
Background: BioSciences
Developed and maintained bymyGrid, UK
Available for Windows/Linux/OSXand as open source(Java)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Web client: Taverna Server/Workflow Parser
SOAP/REST API Remote execution of workflows
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Community Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Compute cluster Enterprise Service Bus
receives requests from users and distributes the load to the availableworker nodes
Main effect: Process parallelization,Load distribution,Fail over
Test deployment on Dutch Supercomputing Cloud HPC
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dataset Representative and annotated dataset of significant size, with
metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation features Text based comparison of result with ground truth,
using Levenshtein distance method Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Outlook
Online service for testing/evaluation/processing
Results Repository (WebDAV, POI)
Extending the scope:Workflows for linguistic analysis: CLARIN
Workflows for preservation: SCAPE
Even better scalability: MapReduce/Hadoop
Supported by a community of developers & practitioners
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Summary- Availability of resources (images, ground truth and tools)
to the international research community- A common baseline for transparent evaluation and comparison- Ready-to-use components, reproducible experiments- Sharing of results and know-how- Enable scalability for prototypes/data intensive workflows - Simple and uniform user interface for all embedded tools- Consolidation of support and maintenance
Thank you! Questions?