preservation metadata and premis vilas wuwongse asian institute of technology 1
TRANSCRIPT
Outline
• Introduction to Metadata• What is preservation metadata?• Why is preservation metadata needed?• How to create preservation metadata• PREMIS• Conclusion• Acknowledgement
2
Metadata
• Metadata is often defined as “Structured Data about Data”.
• It defines information about one or more characteristics of the data:– Data’s name, description, purpose, created
date-time, creator, basic information.• For example
– Library catalogues
4
Metadata Categories (1)
• Descriptive– describes identification and information of resource: title,
author, abstract and keywords.• Structural
– informs relationships within and among resource objects: web page containing html files, image files, css files, and javascript files, linking to others files.
• Technical (for physical files)– Includes technical information that applies to any file type:
software/hardware environment, checksums, digital signatures, image width, elapsed time.
5
Metadata Categories (1)
• Administrative– provides information to help manage a resource,
such as when and how it was created, file type and other technical information, and who can access it
– Two important subsets:• Rights management metadata, dealing with intellectual
property rights• Preservation metadata, containing information needed
to archive and preserve a resource.
6
Preservation Metadata (1)
• Information that is essential to ensure long-term accessibility of digital resources
• A verifier of the past• A communication to the future• A best guess on the future: no prescriptive list
of metadata elements available• Must be able to exist independently from the
systems which were used to create them
8
Preservation Metadata (2)
• Sometimes considered a subset of• administrative metadata, assisting in the management
of information• technical metadata, assisting access to the digital
content and ensuring that the digital resources can be rendered originally
• Basic functional objectives: [OCLC]• Providing knowledge about actions to maintain digital
resource over the long-term• Ensuring that the digital resources can be rendered
originally
9
Information Included in Preservation Metadata
• Provenance– Describe history of creation, ownership, access, and change
• Authenticity– Ensure trustworthiness (Does digital resource render originally?)
• Preservation activities– Record process supporting preservation, such as migration
• Technical environment– Provide name and version of hardware, platform, OS, and software that is
required to render digital resources• Rights management
– Inform concern of intellectual property rights and agreement that need to be observed when execute preservation process.E.g. does a creator allow to copy his/her work or not?
10
Why Preservation Metadata?
Preservation metadata helps the implementation of preservation policies
13
Preservation Policies (1)
• define how to manage digital assets in a repository to avert the risk of content loss in terms of, e.g.,– data storage requirements– preservation actions– Responsibilities
14
Preservation Policies (2)
• Specify preservation goals to ensure that:– digital content is within the physical control of the
repository– digital content can be uniquely and persistently
identified and retrieved in the future– all information is available so that digital content can
be understood by its designated user community– significant characteristics of the digital assets are
preserved even as data carriers or physical representations change
15
Preservation Policies (3)
• Specify preservation goals to ensure that:– physical media are cared for– digital objects remain renderable or executable– digital objects remain whole and unimpaired and that
it is clear how all the parts relate to each other– digital objects are what they purport to be
• All of these preservation functions depend on the availability of preservation metadata
16
I won’t give you a blueprint or concrete model for running a
restaurant.
But I’ll guide you WHAT and HOW you have to
consider when planing to run a restaurant business.
22
To Begin
OAIS Background
• Reference Model for an Open Archival Information System (OAIS)– Development led by the Consultative Committee
for Space Data Systems (CCSDS)– Issued as CCSDS Recommendation (Blue Book)
650.0-B-1 (January 2002)– Also adopted as: ISO 14721:2003
25
OAIS Model (1)
• Conventional categories– Administrative, Descriptive (e.g. MARC, Dublin Core),
Structural
• OAIS model categories– Preservation Description Information
• Reference Information: to enumerate and describe identifiers• Provenance Information: to document the history of the content
information (creation, modification, custody)• Context Information: to document the relationship of the content
to its environment• Fixity Information: to document authentication mechanisms
26
OAIS Model (2)
– Content Information • Content Data Object• Representation Information
the information needed for proper rendering, understanding, and interpretation of a digital object's content
– Packaging Information– Descriptive Information
the information used to aid searching, ordering, and retrieval of the objects
27
OAIS Model (3)Metadata
Packaging Information
binds the digital object and its associated metadata into an identifiable unit or package (i.e., an Archival Information
Package)
Descriptive Information
that helps users of the archive to locate and access
information of potential
interest.
Representation Information
needed to make the data object understandable to
the designated
user community
Content Data Object
the original target of
preservation Provenance Information
documents the history of
the Content Information
Reference Information
enumerates and describes identifiers assigned to the Content Information such that it can be referred to
unambiguously, both internally and externally to
the archive
Context Informationdocuments the
relationships of the Content Information to its environment (e.g., why it
was created, relationships to other Content
Information)
Fixity Informationinformation validating the
authenticity ofthe content information
Content Information Preservation Description Information
necessary to manage the preservation of the Content
Information
Structure Informationinterprets the bits by organizing them into specific data types, groups of data types,
and other higher-level meanings.
Semantic Informationprovides additional meaning for the
interpretation of the content. For example, structural information may identify a bit stream as ASCII text characters, while
semantic information might indicate that
the text is in English.
DC.TitleDC.CreatorDC.SubjectDC.DescriptionDC.PublisherDC.ContributorDC.DateDC.TypeDC.FormatDC.IdentifierDC.SourceDC.LanguageDC.Coverage
- Reason for Creation- Is Version Of- Has Version- Is Replaced By- Replaces (migration)- Is Required By- Requires- Is Part Of- Has Part- Is Referenced By- References- Is Format Of- Has Format- Same Intellectual Content As
Ingest Process History- Institution- Event Date/Time- Event Type- Event DescriptionPreservation History- Institution- Action Date/Time- Action Type- Action Description- Technical Device
Authentication- Dig. Signature / Watermark / Time Stamp- Checksum- Encryption- Documentation of Auth. Mechanism
e.g three numbers interpreted as a date
e.g., ISBN, URN
Content Data Object Description
detailing the characteristics and features of the Content Data Object itself that are necessary to render and understand its
content.
Environment Descriptiondescribes a hardware/software
environment capable of rendering or displaying the Content Data Object in the
form in which it currently exists in the archival store.
Directory structure and file naming conventionsContent type Component types and their relationships File description Installation requirements Size Access inhibitors Access facilitators Significant properties Functionality Description of rendered content Quirks
Documentation
- Access Status- Rights Information - Copyright Statement - Patent Statement - Archiving Permission- Use Conditions - Actors - Actions - Permitted by statute - Permitted by license - Encryption details- Contacts / Rights Holders
Software Environment
Hardware Environment
Archival System Identification
Global Identification
Resource Description
Rights Information
Full-Text DescriptionFor normalising full-text XML
Computational Resources
Storage
Peripherals
Rendering Programs
Operating System
28
OAIS Functional Entities
There are three types of information package:•the Submission Information Package (SIP), which conveys the information provided to the archive by the user and deposit system. •the Archival Information Package (AIP), which is the stored archival version of the information. •the Dissemination Information Package (DIP), which is the version of the information available to users.
29
What?
• PREservation Metadata: Implementation Strategies
• Sponsored by Library of Congress (LOC)• People usually refer to “PREMIS” as “Data
Dictionary”• Represented in XML format
33
PREMIS Data Dictionary• Set of Semantic Units (which will be called Metadata Elements
when they are implemented)• Metadata for digital objects so that they
– Can be read from media– Can be rendered– Are stored securely– Keep track of changing formats
• Metadata Scope– Format-spec e.g. audio, video, image, …– Implementation-spec How to access it (by app)– Descriptive metadata Data properties; like, MARC, DC– Detailed info (For media or hardware)– Agents info e.g. people, org, or software– Right info e.g. license, permission
34
Where is PREMIS?
35
PREMIS responses itself as a coordinator among several types of metadata in order to perform preservation function on all digital resources.
Thus, PREMIS is a small core at the heart of preservation metadata
Intellectual Entities
Examples:• Rabbit Run by John Updike (a book)• “Maggie at the beach”
(a photograph)• The Library of Congress Website (a
website)• The Library of Congress: American
Memory Home page (a web page)
• Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)
• May include other Intellectual Entities (e.g. a website that includes a web page)
• **Has one or more digital representations**
• Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation
Objects
Examples:• chapter1.pdf (a file)• chapter1.pdf + chapter2.pdf +
chapter3.pdf (representation of a book w/3 chapters)
• TIFF file containing header and 2 images (2 bitstreams (images), each with own set of properties (semantic units): e.g., identifiers, technical metadata, inhibitors, … )
• Discrete unit of information in digital form
• **Objects are what repository actually preserves**
• Three types of Object:– FILE: named and ordered sequence
of bytes that is known by an operating system
– REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity
– BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file)
Thailand MapThailand Map
Intellectual Entity
Object 1 Object 2 Object 3
Representation File File1 jpeg file1 TIFF file include:
3 bitstreams of images of map layers•Province•mountain,•river
It can be a web page that contains 3 files •HTML•CSS•JPEG
Example types of object for the preservation of Thailand Map
40
Object Examples: Thailand Map
Object Example: book in two versions
Intellectual EntityDa Vinci Code by
Dan Brown
Representation 1Page image
version
Representation 2ebook version
File 1: page1.tiff
File 2:page2.tiff
File N:pageN.tiff
File 1:book.lit
File N+1:METS.xml
Events
Examples:• Validation Event: use some tools to
verify that chapter1.pdf is a valid PDF file
• Ingest Event: transform an OAIS SIP into an AIP
• Migration Event: create a new version of an Object in an up-to-date format
• An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository
• Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle
• Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession)
eventTypeEvent Type Descriptioncapture the process whereby a repository actively obtains an object
compression the process of coding data to save storage space or transmission time
creation the process of removing an object from the inventory of a repository
deaccession the process of removing an object from the inventory of a repository
decompression the process of reversing the effects of compression
decryption the process of converting encrypted data to plaintext
deletion the process of removing an object from repository storage
Event Type Descriptiondigital signature validation the process of determining that a decrypted digital signature matches an expected value
dissemination the process of retrieving an object from repository storage and making it available to users
fixity check the process of verifying that an object has not been changed in a given period
ingestion the process of adding objects to a preservation repository
message digest calculation the process by which a message digest(“hash”) is created
migration a transformation of an object creating a version in a morecontemporary format
Agents
Examples:• Rathachai Chawuthai (a person)• Asian Institute of Technology (an
organization)• Dark Archive in the Sunshine State
implementation (a system)• JHOVE version 1.0 (a software
program)
• Person, organization, or software program/system associated with an Event or a Right (permission statement)
• Agents are associated only indirectly to Objects through Events or Rights
• Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification
Rights Statements
Example:• Rathachai Chawuthai grants AIT
digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.
• An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.
• Not a full rights expression language; focuses exclusively on permissions that take the form:– Agent X grants Permission Y to
the repository in regard to Object Z.
Semantic units pertaining to objects: technical metadata
• objectIdentifier• preservationLevel• significantProperties• objectCategory• objectCharacteristics
– fixity– size– format– creatingApplication– inhibitors– extension
• originalName• storage• environment• signatureInformation• relationship• linkingEventID• linkingIntellectual EntityID• linkingRights StatementID
Semantic units pertaining to Events: provenance and preservation activity
• eventIdentifier• eventType• eventDateTime• eventDetail• eventOutcome• eventOutcomeDetail• linkingAgentIdentifier• linkingObjectIdentifier
Semantic units pertaining to Rights
rightsStatement rightsStatement Identifier rightsBasis copyrightInformation licenseInformation statuteInformation
rightsGranted act restriction termOfGrant rightsGranted
linkingObjectIdentifier linkingAgentIdentifier rightsExtension
METS Background• XML based• Describes the structure of digital objects and associates
various kinds of metadata with their components• Uses the XML Schema facility for combining vocabularies
from different Namespaces for extensibility• Metadata is categorized into separate sections (embedded
or linked)• Records the names and locations of the files that comprise
those objects (embedded or linked)• Records a map of hyperlinks between components• Associates executable behaviour with the components
51
The Structure of a METS fileMETSheader
dmdSec
admSec
behaviorSec
structMap
fileSec file inventory
descriptive metadata
administrative metadata
behaviour metadata
structural map
<fileGrp ID="munahi010-aaa-fgrp-0001">
<file GROUPID="0" ID="munahi010-aaa-0001-0" MIMETYPE="image/tiff" ADMID="munahi010-aaa-tmd-0001-0"> <FLocat LOCTYPE="URL" xlink:href="file://hfs.ox.ac.uk/data/odl/munahi010/digObjects/aaa/0/munahi010-aaa-0001.tiff"/> </file>
<file GROUPID="6" ID="munahi010-aaa-0001-6" MIMETYPE="image/jpeg" ADMID="munahi010-aaa-tmd-0001-6"> <FLocat LOCTYPE="URL" xlink:href="http:odl/munahi010/digObjects/aaa/6/munahi010-aaa-0001-6.jpg"/> </file>
<file GROUPID="3" ID="munahi010-aaa-0001-3" MIMETYPE="image/jpeg" ADMID="munahi010-aaa-tmd-0001-3"> <FLocat LOCTYPE="URL" xlink:href="http:odl/munahi010/digObjects/aaa/3/munahi010-aaa-0001-3.jpg"/> </file>
</fileGrp>
The inside of a METS fileMETSheader
dmdSec
admSec
behaviorSec
structMap
fileSec file inventory
descriptive metadata
administrative metadata
behaviour metadata
structural map
<mdWrap MIMETYPE="text/xml" MDTYPE="MODS" LABEL="MODS Metadata"> <xmlData> <mods:mods> <mods:titleInfo> <mods:title>Cobbett's parliamentary history of England, from the Norman Conquest, in 1066 to the year, 1803 : from which last-mentioned epoch it is continued downwards in the work entitled, &quot;The parliamentary debates&quot;</mods:title> </mods:titleInfo> <mods:titleInfo type="alternative"> <mods:title>Cobbett's Parliamentary History -
volume 2</mods:title> </mods:titleInfo> <mods:name> <mods:namePart>$aGreat Britain. Parliament.</mods:namePart> <mods:role> <mods:roleTerm type="code“
authority="marcrelator">spn</mods:roleTerm> </mods:role> </mods:name> </mods:mods> </xmlData></mdWrap>
METS with PREMIS asOAIS Information Package
• OAIS repository functions for which METS is often used are submission or exchange (SIP), archiving (AIP), dissemination (DIP)
• A METS package is a good candidate for realization of an information object in an OAIS repository
• PREMIS satisfies need for Preservation Description Information: provenance, context, reference and fixity
• PREMIS is an elaboration and translation of OAIS• information model into implementable semantic units
57
Why do we need guidelines for using PREMIS with METS?
• Contents of each information package may vary depending on its function within a repository
• Need to determine how to include representation metadata and associate it with package components
• PREMIS data entities (objects, events, rights, agents) do not map perfectly to METS categories for representation metadata (techMD, digiProvMD, rightsMD, sourceMD)
• There are redundant elements between the two standards• Both have extensibility mechanisms• Flexibility of both standards requires implementation
choices• Predictability will enhance the ability for exchange with
minimal human intervention
59
Guidelines for Using PREMISwith METS for Exchange
60
http://www.loc.gov/standards/premis/guidelines-premismets.pdf
Benefits of using PREMIS in METS
• Packages together metadata necessary for digital preservation in a predictable format
• PREMIS provides technical and event metadata• METS provides structural metadata• Both standards are
– Openly available– Flexible– Extensible– Maintained by an open process
• Provides an exchange standard between repositories
61
Conclusions• Information preservation supports an organization’s identity
preservation• An organization must have a preservation policy• A preservation policy is realized by means of preservation
metadata • PREMIS Data Dictionary provides critical piece of reliable digital
preservation infrastructure comprising technology, standards, and best practice
• PREMIS Data Dictionary is a building block with which effective, sustainable digital preservation strategies can be implemented for various domains
• PREMIS is being widely implemented and experience using it needs to be shared
URLs
• PREMIS Maintenance Activity:
http://www.loc.gov/standards/premis/
• PREMIS Data Dictionary for Preservation Metadata, version 2.1:http://www.loc.gov/standards/premis/v2/premis-dd-2-1.pdf