jhove2 a next-generation architecture for format-aware preservation processing stephen abrams...

13
JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford University Digital Library Federation Fall Forum Philadelphia, November 5-7, 2007

Upload: ashley-tyler

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

JHOVE2A Next-Generation Architecture for

Format-Aware Preservation Processing

Stephen AbramsHarvard University

Evan OwensPortico

Tom CramerStanford University

Digital Library Federation Fall ForumPhiladelphia, November 5-7, 2007

Page 2: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

JHOVE2 project

• Two year NDIIPP-funded collaborative project to develop “next generation” architecture for format-aware preservation processing

– Harvard University• Stephen Abrams, Gary McGath, Robin Wendler

– Portico• Evan Owens, John Meyer, Sheila Morrissey

– Stanford University• Tom Cramer, Richard Anderson, Hannah Frost, Rachel Gollub,

Nancy Hoebelheinrich, Keith Johnson

• Open source

– Educational Community License (ECL)– SourceForge

Page 3: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

JHOVE2 project goals

• Refactor the existing architecture

– Rectify known inefficiencies and idiosyncrasies– Simplify the process of integration– Encourage third-party extensions

• Provide enhancements

– Separate identification from validation– Standardized error handling– Standardized handling of validation profiles – Standardized reporting using METS, with XSL transform– More sophisticated data model – Arbitrary processing modules

Page 4: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

JHOVE2 project goals

• Develop modules

– Signature-based identification using DROID– Validation and characterization– Symbolic display of selected binary formats– API-level editing capability– Policy-based assessment

Page 5: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Data model

• Implicit assumption in JHOVE

– 1 object = 1 file = 1 format

• But what about…

– TIFF with embedded ICC profile and XMP metadata• 1 object = 1 file = 3 formats

– JPEG 2000 JPX fragmentation• 1 object = n files = 1 format

– ESRI Shapefile• 1 object = 3 files = 3 formats

• JHOVE2 will support processing of complex aggregate objects and nested formatted bit streams

– 1 object = n files = m formats

Page 6: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Common “backplane”

• Outer loop is an iteration over digital objects• Inner loop of processes applied against each object, passing a

common memory structure

while (has-another-object) { while (has-another-process) { process (object, state); } }

Iterator

module

common data

common data

module

common data

METS writer

displayXSLT

XSL

object

Page 7: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Validation

• There is a useful distinction between well-formedness, validity, renderability, and usability

– Well-formedness and validity are “bright line” determinations relative to a specification

– Renderability is a “bright line” determination relative to a specific rendering tool

– Usability is a “fuzzy” determination relative to local policies and heuristics

Page 8: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Policy-based assessment

• Evaluate objects based on prior characterization and locally-defined policy rules and heuristics, for example:

– Risk of technological obsolescence– Risk of transformative loss

• Codify assessment methodologies and best practice recommendations

• Develop a formal language in which to express policy rules

• Implement a rules engine

Page 9: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Format support

• Audio AIFF, WAVE

• Color ICC

• Document PDF

• GIS Shapefile

• Image GIF, JPEG, JPEG 2000, TIFF

• Text ASCII, HTML, SGML, UTF-8, XML

Page 10: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Schedule

• 6 months of community outreach, requirements gathering, and design

• 6 months implementation of core APIs and the engine

• 1 year implementation of modules

• Continual prototyping and re-factoring

Page 11: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Questions (for you)?

• Do you care about the open source license (ECL)?

• Do you care about the distribution platform (SourceForge)?

• Do you have functional requirements or use cases?

– How do you use JHOVE today?– What needs doesn’t it meet?

• What types of policy assessments do you perform?

– How do you quantify risk?– What is your underlying assessment model?

• Are you aware of existing expression languages and engines for rules-based assessment?

Page 12: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Questions (for you)?

• What can we do to facilitate integration into existing (or planned) systems and workflows?

• What can we do to facilitate third-party development and extension?

– What help would you need to implement your own modules?– Would you be interested in a co-development arrangement with

the JHOVE2 project?

• Do you have interesting test files that you are willing to share?

Page 13: JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford

Questions (from you)?