digital preservation

48
Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council

Upload: gary

Post on 31-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

HUL University Library Council. Digital Preservation. Dale Flecker Stephen Abrams February 15, 2007. Agenda. IThe problem IIWhat has Harvard been doing? IIIWhat more do we need to do?. IThe problem …. … is twofold. Keeping the bits Keeping the bits useful. Keeping the bits. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Digital Preservation

Digital Preservation

Dale FleckerStephen Abrams

February 15, 2007

HUL University Library Council

Page 2: Digital Preservation

Agenda

I The problem

II What has Harvard been doing?

III What more do we need to do?

Page 3: Digital Preservation

I The problem …

Page 4: Digital Preservation

… is twofold

• Keeping the bits

• Keeping the bits useful

Page 5: Digital Preservation

Keeping the bits

• Digital things are amazingly easy to destroy!– Bad guys want to do damage– Hardware/software fails– People make mistakes

• The slip of a finger, or an unnoticed consequence of change, happen easily - and are potentially catastrophic

Page 6: Digital Preservation

Destruction is not always apparent

Data not used regularly is alwaysat risk of unintended and

unnoticed damage.(Note that archival copies can

be pretty invisible…)

Page 7: Digital Preservation

Keeping bits useful

Digital materials are fragile!!!

They depend on technologies for their vitality… and those technologies

age and disappear rapidly.

Page 8: Digital Preservation

Fragility

• Using digital content requires mediation by hardware and software

• Hardware and software must understand the format of the content

• Hardware and software technology change continually

Page 9: Digital Preservation

Fragility

• Old technology will break

• New technology frequently does not understand old formats

Page 10: Digital Preservation

II What has Harvard been doing?

Internally …

Page 11: Digital Preservation

Digital Repository Service (DRS)

• Secure, professionally managed environment

– Manage data rigorously, with discipline, and in accordance to community best practices

• Redundant, heterogeneous, distributed storage with periodic media migration…

Page 12: Digital Preservation

Digital Repository Service (DRS)

• Know what data you have

– What are the logical objects (“works”, not files)?

– What are the technical characteristics of those objects?

• Check the data continuously

• Manage access to stored objects

Page 13: Digital Preservation

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

Page 14: Digital Preservation

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 15: Digital Preservation

Format

• Understanding formats is fundamental to preservation

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 16: Digital Preservation

Format

• Formats vary significantly in their “preservability”

• Keeping multiple versions of a given piece of content for different purposes is frequently wise

– E.g. archival master, production master, use copy

Page 17: Digital Preservation

Format

• Some criteria for “preservability” (from LC)– Disclosure (how well documented?)– Adoption (how widely used?)– Transparency (is compression used?)– Self documenting (good!)– External dependencies (self sufficiency is good)– Patents (could limit preservation actions)– DRM/encryption (what if decryption key is not

available?)

Page 18: Digital Preservation

Metadata

• The basis of decision-making for preservation

– Technical metadata• What format is this in?• What format options are used?

– Structural metadata• If I change this, what else is affected?

Page 19: Digital Preservation

Metadata

– Administrative metadata

• Who has the right to make decisions about this?

– Relationship metadata

• Are there other versions of this object?

– How do these affect my preservation strategy?

– Provenance metadata

• Where did this come from?• What changes has it already undergone?

Page 20: Digital Preservation

Guidelines for “preservable” objects

The least expensive, and mosteffective preservation measure

is to think about the future whenan object is created!

(Guidelines on format, metadata,archival masters, etc.)

Page 21: Digital Preservation

JHOVE (JSTOR/Harvard Object Validation

Environment)

A widely used tool for format identification, validation, and

characterization.

Page 22: Digital Preservation

JHOVE (JSTOR/Harvard Object Validation

Environment)

When an object is ingested:• Determine its format

(“identify”)

• Insure that it is properly formed(“validate”)

• Extract meaningful technical metadata(“characterize”)

Page 23: Digital Preservation

DRS: what’s managed today

As of January 2007, 5.6M files and 22 TB, excluding Google and web archiving

Page 24: Digital Preservation

II What has Harvard been doing?

Externally…

Page 25: Digital Preservation

E-journal archiving

• “How can we ensure that licensed e-journal content will remain usable over time?”

• Mellon-funded study

• Explored technical formats, content types, transactions and dataflows, validation, systems requirements, contractual requirements, business models

• Harvard’s proposed model largely implemented by Portico

Page 26: Digital Preservation

Technical Metadata for Digital Still Images

• “What are the appropriate technical metadata necessary for the preservation of images?”

• Standardized as NISO Z39.87

• Expressed in the MIX schema

– Maintained by LC

• The basis for DRS image technical metadata

Page 27: Digital Preservation

METS (Metadata Encoding and Transmission

Standard)

• “Is there a generic packaging form for digital content?”

• For example,

– Digital books

– Audio works

– Images (archival master, production master, deliverables)

• Useful for exchange of objects between repositories

• Maintained by LC

Page 28: Digital Preservation

Core audio metadata

• “What are the appropriate technical metadata necessary for the preservation of audio?”

• Standardized as AES X-098

• Used as the basis for DRS audio technical metadata

Page 29: Digital Preservation

PDF/A

• “PDF defines too many options; is there a ‘flavor’ that will be more ‘preservable’ over time?”

• Requires, recommends, and restricts PDF functionality to enhance preservability

• Standardized as ISO 19005

Page 30: Digital Preservation

PREMIS PREservation Metadata: Implementation Strategies

• “What are the general metadata elements necessary to preserve digital content over time?”

• OCLC/RLG-sponsored work group

• Recommendations and best practices for preservation metadata– Core elements, data dictionary, implementation

strategies, cooperative projects

Page 31: Digital Preservation

PREMIS PREservation Metadata: Implementation Strategies

• Report on current practices and recommended metadata elements available

• Maintained by LC

Page 32: Digital Preservation

AIHT (Archive Ingest and Handling Test)

• “What difficulties can we expect to arise during the exchange of content between heterogeneous repositories?”

• LC-funded project to investigate exchange of complex data between preservation repositories

• Harvard, Stanford, Johns Hopkins, Old Dominion ingest and exchange web archive data

Page 33: Digital Preservation

GDFR (Global Digital Format Registry)

• “What will need to know in the future about formats in use today, and how will we know it?”

• Shared registry of preservation-related information about technical format

• Reduce work for repositories to create and maintain information about objects they ingest…

Page 34: Digital Preservation

GDFR (Global Digital Format Registry)

• Enables sharing of format expertise

• Directed by Harvard, implemented by OCLC

• Funded by Mellon Foundation

Page 35: Digital Preservation

Registry of Digital Masters

• “How can I found out who has accepted archival responsibility for a given piece of content?”

• Initially reformatted materials; intention to expand to born-digital

• DLF project

• Implemented by and housed at OCLC

Page 36: Digital Preservation

Repository certification

• “Why should a collection manager trust a digital repository?”

• RLG/OCLC report on Trusted Repository Attributes

• RLG/NARA Digital Repository Certification Task Force…

Page 37: Digital Preservation

Repository certification

• Recommend structure and metrics of an international process for certifying preservation repositories– Organizational role and structure, staff size and

skill, formal operations and documentation, appropriate technical infrastructure and facilities, on-going funding, and “hand-off” plan, etc.

• CRL Auditing and Certification project

Page 38: Digital Preservation

Key activities elsewhere

• ISO 14721 OAIS (Open Archival Information System)

• LC NDIIPP (National Digital Information Infrastructure Preservation Program)

• Web archiving (IA, IIPC)

• NARA ERA (Electronic Records Archiving)

• Digital Curation Centre

• PLANETS

Page 39: Digital Preservation

III What more do we need to do?

Page 40: Digital Preservation

Evolution: from projects to program

• Digital preservation requires continual pro-active program– You can’t just stop and start– Time frames are MUCH shorter than for preservation

of physical collections

• Need to define scope and role of our preservation efforts

• Investment required in both technology and staffing

Page 41: Digital Preservation

Preservation lifecycle

• Creation– Format and technical specification choices– Accompanying metadata– Packaging for ingest

• Ingest– Validation– Normalization

Page 42: Digital Preservation

Preservation lifecycle

• Assumption of preservation responsibility

• Monitoring– When is intervention necessary?

• Changes to the technical environment

• Changes to user expectations

• Planning– Significant properties

• All preservation decisions involve choice; how to choose what to preserve?

Page 43: Digital Preservation

Preservation lifecycle

• Intervention (preserving usability)– Re-acquisition– Re-generation from an archival master– Migration before necessary (“just in case”)– Migration at point of request (“just in time”)– Emulation of obsolete technology in contemporary

environment– Universal Virtual Computer (UVC)

• Rewrite necessary software to run on technology-agnostic “virtual” computer

Page 44: Digital Preservation

Preservation lifecycle

• Intervention (continued)– Save for digital archeologists

• After intervention– Post-intervention quality assurance– Documenting the process of change

• Succession planning– What do we do when we want to get out of the

repository business?

Page 45: Digital Preservation

Staffing and responsibilities

• Technical– Infrastructure maintenance– Monitor technological change– Integration into larger preservation environment– Preservation planning

• Curatorial – Preservation intervention will involve trade-offs

• What attributes need to be preserved?• Cost/benefit analysis

Page 46: Digital Preservation

Immediate challenges

• Google– Substantial increase in scale (both number and size)– “Dark” content; no expectation of current access

• Web archiving– Explosion of data types– No forethought on format selection and technical

specifications– No metadata– Some failure may be inevitable

Page 47: Digital Preservation

Coming soon?

• Institutional repository (IR) to enhance scholarly communication and preserve scholarly creations– Similar to web archiving: objects not typically

created with preservation in mind, nor accompanied by metadata

• “Just in case” local copies of licensed content– May necessitate increased sophistication of IPR

management

Page 48: Digital Preservation

Longer term issues

• Economics – What can we afford to preserve?• Scale – How much can we preserve?• Selection – What do we leave for others?• Federation – Can we share responsibilities for

preservation?– Copies in independent environments are safest

• Certification – Do we need formal certification?– Note Section 108 revision

• Education – who at Harvard needs to understand?