day 2, workshop 4, inge van nieuwerburgh

55
(meta)data standards for digital archiving DISH 2009 @ Rotterdam Universiteitsbibliotheek Gent – MMLab UGent

Upload: dish09

Post on 11-May-2015

4.414 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Day 2, workshop 4, Inge Van Nieuwerburgh

(meta)data standards for digital archivingDISH 2009 @ Rotterdam

Universiteitsbibliotheek Gent – MMLab UGent

Page 2: Day 2, workshop 4, Inge Van Nieuwerburgh

Summary

• Introduction• Defining the problem• State of the art:

• OAIS• Data formats• Metadata schemas• Declarative containers

• Layered Metadata Model• Best practices

Page 3: Day 2, workshop 4, Inge Van Nieuwerburgh

Introduction

Page 4: Day 2, workshop 4, Inge Van Nieuwerburgh

BOM Vl: Preservation and disclosure of multimedia data in Flanders

Flemish project – 1.5 yearsCross sectoral: broadcasters, archival institutions, cultural sector and the libraries.Studies:• Needs for preservation• Selection• Metadata standards & exchange formats• Digital rights• Supply and distribution models

Page 5: Day 2, workshop 4, Inge Van Nieuwerburgh

Defining the problem

Page 6: Day 2, workshop 4, Inge Van Nieuwerburgh

Problems when archiving digital information

Problem 1.• Analogous formats are disappearing and have to be replaced by digital alternatives.• Quick growth of data.• Discrepancy between the short life span of digital technology and the need for long term archiving.

Page 7: Day 2, workshop 4, Inge Van Nieuwerburgh

Problems when archiving digital information

Problem 2.• In digital form, information is abstract, independent from the storage medium. The abstract information has to be preserved, not the medium.

Page 8: Day 2, workshop 4, Inge Van Nieuwerburgh

Problems when archiving digital information

But also consider…

Page 9: Day 2, workshop 4, Inge Van Nieuwerburgh

Growth Storage capacity of desktop computers (HanKwang 2008)

Page 10: Day 2, workshop 4, Inge Van Nieuwerburgh

Evolution of used file formats (PRONOM)

1980 1990 2000

‘86 – TIFF3

’87 ‘88 TIFF4 & 5

‘92 – TIFF6

‘96 - PNG 1.0

’99 – PNG 1.2

’00 - JPEG2000

‘92 - JPEG’87 – GIF87

’87 – GIF89

‘92 - MrSID

‘85 - BMP

‘84 - TGA ‘03 - SVG

’84 - GEM Raster

Page 11: Day 2, workshop 4, Inge Van Nieuwerburgh

Evolution format derivatives

MIME type image/tiff:• TIFF (alle versies)• TIFF/IT• TIFF G4/LZW/UNC• Digital Negative Format (DNG)• GeoTIFF• Pyramid TIFF• …

Bron: PRONOM Technical Registry [http://www.nationalarchives.gov.uk/pronom/]

Page 12: Day 2, workshop 4, Inge Van Nieuwerburgh

Riscs at the long term

Bit Errors/BugsFile Format Changes

Time

Changing Technology

Organizational changes

Interpretation of the format

1980 1990 2000

Page 13: Day 2, workshop 4, Inge Van Nieuwerburgh

Study: state of the art (meta)data standards

Page 14: Day 2, workshop 4, Inge Van Nieuwerburgh

• What is a digital archive NOT:• mass storage for active applications and data• a networked backup solution

• What is a digital archive:• Storage of digital information with historical, scientific, financial or legal value in the long term.• Platform independent access to digital information for 50, 100 years or longer.

What is a digital archive?

Page 15: Day 2, workshop 4, Inge Van Nieuwerburgh

OAIS

Page 16: Day 2, workshop 4, Inge Van Nieuwerburgh

Open Archival Information System (OAIS)

• Reference model for the description of digital archives.• Developed in 1982:

• NASA (US)• ESA (EU)• RSA (USSR)• NASDA (Japan)• …

•Since 2002 ISO Standard 14721

Page 17: Day 2, workshop 4, Inge Van Nieuwerburgh

OAIS model

• Consists of 3 parts:1. Description of an archival system: responsabilities,

procedures and a common terminology.2. Functional model: all processes needed for the

longterm preservation of digital information.3. Information model: describes the stored digital

information.

Page 18: Day 2, workshop 4, Inge Van Nieuwerburgh

OAIS functional model

Page 19: Day 2, workshop 4, Inge Van Nieuwerburgh

• Need to explore the necessary, recommended and generally used standards

• technical schemas• descriptive schemas• preservation schemas• structural schemas

• What are the different metadata schemas (if any) used in the different cultural sectors?

Standards

Page 20: Day 2, workshop 4, Inge Van Nieuwerburgh

Data formats

Page 21: Day 2, workshop 4, Inge Van Nieuwerburgh

What

• Raw data is increasingly storage consuming• Need to compress: compression standards

• video: Mpeg-2, H.264/Mpeg-4 AVC, Motion JPEG2000• audio: MP3, AAC • images: JPEG, TIFF

• Need for container formats for exchange of A/V material• MXF, AVI, WMA, MP4

Page 22: Day 2, workshop 4, Inge Van Nieuwerburgh

Metadata schemas

Page 23: Day 2, workshop 4, Inge Van Nieuwerburgh

What

• Descriptive metadata• Administrative metadata• Preservation metadata• Technical metadata• Usage data

Page 24: Day 2, workshop 4, Inge Van Nieuwerburgh

Standards

• Especially for descriptive metadata: differences in sectors=> Preferred standard per sector?• Differences in detail• Differences in structure• Differences in relations

• Preservation metadata: PREMIS• Conceptual models

Page 25: Day 2, workshop 4, Inge Van Nieuwerburgh

Declarative containers

Page 26: Day 2, workshop 4, Inge Van Nieuwerburgh

What

• Compound information objects, combining descriptive, administrative and/or structural metadata• Advantage: the ease to exchange and reuse them• some examples:

•METS•MPEG-21 DIDL: describe complex digital objects•LOM: learning objects•ORE: model to describe aggregations

Page 27: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Page 28: Day 2, workshop 4, Inge Van Nieuwerburgh

How to proceed?

• Need for a layered metadata model to manage digital archive

• Why? Too much differences between data models• Need a common ground

Page 29: Day 2, workshop 4, Inge Van Nieuwerburgh

Solution: layered metadata model

• Model in different layers:• A generic top level descriptive metadata schema

(DC)• A refined standard per sector for detail, to preserve

the metadata in detail• + Preservation metadata, technical metadata and

rights metadata

Page 30: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered metadata model

MARCXML TIFF PSD

Descriptive metadata: Dublin Core

Preservation metadata: PREMIS

Rights metadata: PREMIS, MPEG-21/REL, INDECS, ODRL, XrML

Technical metadata:PREMIS, MPEG-7, Z38.87, AudioMD, VideoMD, TextMD

MARC Standard

TIFF Standard

Page 31: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Descriptive Model: Dublin Core

• Most interoperable, cross sectoral.• Greatest common divider of all metadata models.• All fields are repeatable and optional.

Mapping between own metadata model.

• Dublin Core as pidgin:• DC as common layer above the own metadata.• DC as model for querying.• Discovery and identification of digital objects.

Page 32: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Descriptive Model: Dublin Core

How to disseminate as DC?• Crosswalk to DC is made for the most important

metadata models used in the sectors:

Libraries: MARC21

A/V Sector: P/Meta

Arts sector and museums: CDWA and SPECTRUM

Archiving sector: ISAD(G) and EAD• Crosswalks can be used to disseminate the DC records

via OAI-PMH, GRDDL(XSLT), mapping API (D2RQ), or ontology linking.

Page 33: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

• Administrative metadata + Rights Metadata

assisting in the management of the digital objects.

• Technical metadata

assisting the access (conversions or emulation).

• Preservation Metadata

Tracking the provenance – history of all actions on an object.

Page 34: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Page 35: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

• Objects: Describes the objects to be preserved in a technical manner.

• 3 subclasses:• Bitstream• File• Representation

• Facilitates the conversion or emulation process.

Page 36: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Objects: • Describes the objects to be preserved in a

technical manner.• 3 subclasses:

• Bitstream• File• Representation

• Facilitates the conversion or emulation process.

Page 37: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Agents: • Aggregates information about agents (persons,

organisations, software) associated with rights management and preservation events in the life of a data object.

• No direct relation between Agent and Object:• May hold or grant one or more rights• May carry out, authorize, or compel one or

more events.• Identify agents uniquely.

Page 38: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Events: • Actions that modify objects should always be

recorded. Other actions such as copying an object for backup purposes may be recorded in an Event entity.

• Stored separately from the digital object.

Page 39: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Rights:• The minimum core rights information that a

preservation repository must know, however, is what rights or permissions a repository has to carry out actions related to objects within the repository.

• These may be granted by copyright law, by statute, or by a license agreement with the rightsholder.

Page 40: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

Intellectual Entity:• Descriptive metadata: out of scope for PREMIS.• Dublin Core

Page 41: Day 2, workshop 4, Inge Van Nieuwerburgh

Layered Metadata Model

Preservation Model: PREMIS

PREMIS OWL:• Semantic (OWL) ontology following the data

dictionary of PREMIS 2.0.• Published Online:

http://multimedialab.elis.ugent.be/users/samcoppe/ontologies/Premis/premis.owl

• Documentation Online:

http://multimedialab.elis.ugent.be/users/samcoppe/ontologies/Premis/index.html

Page 42: Day 2, workshop 4, Inge Van Nieuwerburgh

Best practicesOr

How to minimize risks

Page 43: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 1: Store technical metadata

Bron: Adrian Brown, National Archives UK; “Developing Practical Approaches to Active Preservation”

Page 44: Day 2, workshop 4, Inge Van Nieuwerburgh

Bitrot/Software errors

• No storage device is perfect and eternal.• David Rosenthal Stanford University

“Bit Preservation: A Solved Problem?”• Bit half-life of 8 x 10^17 year => gives 50% chance

that 1 Petabyte survives a century without errors.• Comparable studies by Carnegie Mellon University,

Google and CERN

Page 45: Day 2, workshop 4, Inge Van Nieuwerburgh

Bitrot/Software errors

• Volker Heydegger University of Cologne• Analysing the Impact of File Formats on Digital Integrity

Page 46: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 2: Preserve preservation metadata

• Checksums• Digital Signatures• Provenance• …

Page 47: Day 2, workshop 4, Inge Van Nieuwerburgh

Interpretation riscs

One of the coolest and oldest dwarf stars ever been found.

Page 48: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 3: Representation metadata

• Time• Place• Wave lengths/Calibration data• Provenance

Page 49: Day 2, workshop 4, Inge Van Nieuwerburgh

Technology Changes

4b50 0403 0014 0000 0008 0cdb 282e 7d22ddaa 0243 0001 ab00 0002 000f 0000 63415f65 666f

INC $D020DEC $D020JMP $2000LDX $D020INXSTX $D020JMP $2000LDA $5000

+ =

Documentation

Information

Syntax

Semantics

Page 50: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 4: Do not trust software

• It is an illusion to think that software will always offer access to the archived data.

• Computer software is an active component in the archive and it knows only two possible states:

1. It works and is maintained.

2. It does not work and is not maintained.

Page 51: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 4: Do not trust software (cont.)

• Case 2: Software does not work, is not maintained:– Documentation metadata has to contain the source

code of the original software.– Emulation has to be foreseen; metadata has to

contain all the emulation parameters.• Case 1: Software works, is maintained:

– The archive has the software. – The user has the software.– Both cases have a dynamic metadata layer with all

the software aspects needed to access the data.

Page 52: Day 2, workshop 4, Inge Van Nieuwerburgh

Descriptive metadata

• Are descriptive metadata (or other access tools like thumbnails, previews) data or metadata?

• Non-discussion: ‘metadata’ is a relative term.• as Data:

• Advantage: descriptive metadata are ‘core business’, too valuable not to be archived.

• Disadvantage: this type data is very dynamic.• as Metadata:

• Advantage : metadata are dynamic; can be adapted to the needs of the archive.

• Disadvantage : which descriptive model have to be used: MARC, EAD, P/META,…?

Page 53: Day 2, workshop 4, Inge Van Nieuwerburgh

Best Practice # 5: Store descriptive metadata as dataProvide a broadly accepted descriptive model like Dublin Core

• Dublin Core describes the ‘Who’, ‘What’,’Where’, ‘When’ and ‘How’.

• Sector specific descriptive metadata models have finer granularity.

• Use international standards (MARC, EAD, P/Meta).

Page 54: Day 2, workshop 4, Inge Van Nieuwerburgh

Want to know more?

Book (in dutch):“(Meta)datastandaarden voor digitale archieven full-text available book “, Bastijns, Paul; Coppens, Sam; Corneillie, Siska; Hochstenbach, Patrick et al. , (2009); http://hdl.handle.net/1854/LU-480734

Deliverable Layered metadata model: http://hdl.handle.net/1854/LU-764194

Page 55: Day 2, workshop 4, Inge Van Nieuwerburgh

Q & A