the role of file formats in digital preservation: opportunities and threats erpatraining on file...

30
The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank Moehle [email protected] Swiss Federal Archives Project ARELDA www.bundesarchiv.ch

Upload: arleen-burke

Post on 13-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

The Role of File Formats in Digital Preservation: Opportunities and Threats

ErpaTraining on File Formats for Preservation

Vienna, May 10-11, 2004

Frank Moehle [email protected]

Swiss Federal Archives

Project ARELDA www.bundesarchiv.ch

Page 2: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 2

Agenda

1. Introduction to File Formats2. Text like Documents3. Images and Graphics4. Audio and Video5. Conclusions

Page 3: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 3

Data vs. File Format

Some popular data formats:HTML, SGML, XML, SQL, etc.

But all of them can be stored in the same file format…

Text File

SQL

HTML

XML

Page 4: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 4

What a File Format is

A file is nothing but a sequence of bits

A file format defines how to interpret the contents of a file.

Typical file formats have a file header (metadata), followed by actual (primary) data

Page 5: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 5

Text vs. binary files

P18 80 0 1 1 0 0 1 11 1 0 0 1 1 0 00 0 1 1 0 0 1 11 1 0 0 1 1 0 00 0 1 1 0 0 1 11 1 0 0 1 1 0 00 0 1 1 0 0 1 11 1 0 0 1 1 0 0

P48 8aFaFaFaF

Page 6: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 6

Proprietary vs. Open Formats

Proprietary Full Documentation

not always available License and patent

rules may apply License agreements

subject to change Restrictions for use

and modifications may apply

Open Unlimited use No license fees No patent owners Full Documentation

Available Open for self-made

modifications

Page 7: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 7

Digital Preservation Needs

Integrity uncorrupted, safe storage

Understandability Intellectual understandability of content and context

Originality Data structure and appearance

Authenticity Authentication of author and authentification of provenance and evidence

Accessibility Readable and usable

Page 8: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 8Archive Managementaccepts the OAIS responsibilities

Producer

Registered Producers

~ OAIS Archive

Business Plan, Strategic Goals

AIP

SIP

Ingest

Descriptive Metadata Data

Management

AIP

Access

Descriptive Metadata

DIP

Agreements

Standards, PoliciesPreservation Planning

Monitoring

User Request

Archival Storage

AdministrationCustomer Services

Consumer

Designated Communities

request

The Preservation Process

Page 9: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 9

The Ingest Process

The ingest process needs Unambiguous file format specs Validation tools for quality assurance File format converters

And is supported by Standardization organizations (ISO, …) Registries (PRONOM)

Page 10: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 10

Kind of File Formats

The diversity of file types requires a whole set of file format specs for Text documents Data bases Still and animated graphics Audio recordings Video sequences Etc.

Page 11: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 11

Requirements

Syntactical and semantical specs are public

Standardized by an recognized organization like ISO, ANSI, ITEF, …

Widely used and accepted Free of patent and license fees

Page 12: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 12

Requirements (cont’d)

Free of any cryptographical techniques (risk of loosing keys)

No compression techniques (loss of reduncancies)

Specification should be self-contained

File format should be media-independent

Page 13: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 13

Migration of Archival Data

Migration is needed due to technological obsolescences

Migration isThe periodic transfer from one HW and/or SW configuration to another, or from one generation of computer technology to a subsequent generation.

Page 14: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 14

Migration Pros and Cons

Migration is good to Preserve integrity of digital data Retain ability to retrieve/access/use data Exploit technological progress

But … Exact digital copying is not always possible Maintaining compatibility not guaranteed Migration is time-consuming, costly, complex

and error-prone

Page 15: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 15

Text-like Documents (1)

Office documents (often MS Word or Excel Files) can be preserved as PostScript, PDF, DSSSL, RTF, ASCII,

SGML, TIFF, CGM, PNG PostScript, PDF, RTF are proprietary DSSSL, SGML not (yet?) widely used CGM has multiple variants in use

Page 16: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 16

Text-like Documents (2)

‘Best practice’ in the Swiss Federal Archive: PDF and TIFF No proprietary extensions allowed Private fields and values will be ignored Mulitpage TIFF allowed with restrictions

Pages belong to the same document or Pages display the same content with

different resolutions

Page 17: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 17

Database Preservation

Extraction

Record creatingoffice

Swiss FederalArchives

A1

A0 A1

Database

Ingestion Use

Customers:- Offices- Citizens

Database

Database

Database in archival format

Electronicarchives

Original database Restored database

Swiss FederalArchives

A2

Analysing and Interpreting

Metadata

Verifying Reload

Page 18: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 18

Database Extraction

A0analysing and

archiving

Archive files =archived database

a) Analyses the data logic (e.g. schemas, tables, constraints, views)b) Shows results of analysis in GUIc) Interactive dialog to modify results of analysis (e.g. include/exclude objects)d) Write export files with data logic (XML reference file, SQL3 DDL files) and table content (flat files)

SQL-Server

SQL3 DDLfile

flat files withtable content

create … 1,'TEXT',2,…

XMLreference file

<!ELEMENT …

Oracle Adabas D Basis Plus

DB2

Informix

MySQL

PostgreSQL

Sybase MS Access

Most commondatabaseproducts in thefederaladministration

Expert mode (JDBCinterface with scripting

language)

Generic mode(JDBC interface)

Page 19: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 19

Raster Image Example

Page 20: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 20

Image Compression

Lossy compression: Loss of Quality JPEG, JPEG2000

Lossless compression TIFF, PBM, PNG, etc.

Page 21: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 21

Loss of Redundancy

Only one bit of a Byte is corrupted in this image:No error recovery due to compression!

Page 22: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 22

Lossy Compression2 kB JPEG

4 kB JPEG

8 kB JPEG

Page 23: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 23

Vector Graphic Files

Vector graphics are composed of geometric figures, splines, polynomials, etc.

Thus, scaling, rotating, and other operations can be performed efficiently

Possible formats: PS, PDF, SVG

Page 24: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 24

Vector Graphics Example

Page 25: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 25

Preservation Strategy

End-user formats Preservation formats

Short-term file formats for usage, e.g. GIF, MP3, etc

File formats for long-term preservation, e.g. TIFF, WAVE, etc.

Derived from (On the fly or caching)

Page 26: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 26

Opportunities

Up-to-date documents can be read/accessed

Workable documents can be processed with current software on current hardware

Quality may be improved and new features may become available

Page 27: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 27

Opportunities (cont’d)

Up-to-date file formats require less education than historical ones

Up-to-date and widely used file formats are better supported by software producers than old, obsolete file formats

Documents are “alive” as they have to be migrated periodically

Page 28: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 28

Threats

Migration is not an established, uniform process

Format conversion has an intrinsic risk of data corruption

The migration process requires a high effort (e.g. quality assurance)

Too many file formats require too much resources for tracking their development

Page 29: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 29

Threats (cont’d)

New features of new file formats may affect derivative creation

Migration affects watermarks, signatures, or other cryptographic techniques for “fixity”

Unpredictable migration cycles make staff planning difficult

Page 30: The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank

Project ARELDA Frank Moehle 30

Q & A

Q & ADiscussion