leslie johnston: challenges of preserving every digital format, 2012

19
The Challenges of Preserving Every Digital Format on the Face of the Planet Leslie Johnston March 26, 2012

Upload: lljohnston

Post on 25-May-2015

430 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

The Challenges of Preserving Every Digital Format on the Face of the Planet

Leslie JohnstonMarch 26, 2012

Page 2: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

Well, not every format

But we often have little or no control over what comes into the Library of Congress Digital Collections, and we manage and preserve a wide variety of formats.

Page 3: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

What are examples of some of the collecting and preservation challenges?

Page 4: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

NATIONAL DIGITAL NEWSPAPER PROGRAM chroniclingamerica.loc.gov/

A partnership between the National Endowment for the Humanities and the Library of Congress:

Enhance access to America newspapers

Sustainable digital collection

Scalable, phased, cost-effective management

The program has:

Multiple producers (25 now, ultimately 54)

Digitization standards (http://loc.gov/ndnp/)

Free and open public access

APIs for machine access and automated processes

Files

TIFFs, JPEGs, JPEG 2000s, and XML.

Over 4 million newspaper pages ingested to date

Over 250 Tb of data

Page 5: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Page 6: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

WEB ARCHIVING http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa- home.html

The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records.The collections include:

U.S. elections

Web sites created by members of the House and Senate

Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices.

Collections around an area of study, such as Legal “Blawgs”

The file formats include every format possible on the web. The collection comprises approximately 5 billion files in 300 TB.

Page 7: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Page 8: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

NATIONAL DIGITAL INFORMATION INFRASTRUCTURE & PRESERVATION PROGRAM digitalpreservation.gov

Page 9: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

CONTENT TYPESCONTENT TYPES

Images and Text Audio Visual Geospatial Web Sites

Page 10: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

PACKARD CAMPUS NATIONAL AUDIO-VISUAL CENTERPreserving Film, Broadcast Television, and Audio

The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future.

The facility also handles born-digital video and audio received directly from producers.

The formats include MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial formats.

Page 11: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

eDEPOSIT FOR eSERIALS

eDeposit for eSerials is a collaborative effort between the U.S. Copyright Office and the Library of Congress.

Copyright Mandatory Deposit represents the largest acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.

eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.

The files must come to the Library “as published” – in whatever their original formats are. This means a wide variety of XML content and metadata, HTML, and PDFs.

Page 12: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

WORLD DIGITAL LIBRARY www.wdl.org

Deliver historically significant primary materials from cultures around the world to an international multilingual audience

Over 100 participating partner institutions, and contributions from over 40 institutions so far.

Representing all 193 UNESCO member countries.

Maps, prints, photographs, rare books, manuscripts, journals, sound recordings, and motion pictures.

Metadata in Arabic, Chinese, French, English, Portuguese, Russian, and Spanish.

JPEG 2000s, PDFs, XML.

Page 13: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Page 14: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

THE TWITTER ARCHIVEEvery public tweet since Twitter’s launch in March

2006.We have a historic 2006-2010 archive and ongoing

access to new tweets. We do not receive personal account information,

linked images, or linked web page content.Tweets will not move into the archive until six

months after their initial posting.The Library’s researcher services will not recreate

twitter, and cannot be openly accessible.We are testing various technologies, and entering a

pilot phase with test researchers. We will announce it when the archive is open to all researchers.

The collection comprises only a few TB, but over 80 billion tweets.

An FAQ is available online at: http://blogs.loc.gov/loc/2010/04/the-library-and- twitter-an-faq/

Page 15: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

So how are we making this easier for the Library to manage?

Page 16: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

Preservation Infrastructure

•The Library developed the BagIt transfer specification for the movement of files between and within organizations.

• http://www.digitalpreservation.gov/documents/ bagitspec.pdf

•The Library inventories all incoming files, and is inventorying all digital content.

• We maintain multiple copies of files on servers and on tape, in geographically distributed locations.

Page 17: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

Preservation Partnerships

The Library cannot collect everything on its own, so works as part of:

The National Digital Stewardship Alliance http://www.digitalpreservation.gov/ndsa/

The International Internet Preservation Consortium http://netpreserve.org/about/index.php

among others…

Page 18: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

What are the Library’s strategies for formats?• The Library has documented sustainability factors for file formats.

• http://www.digitalpreservation.gov/format s/

• For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.

• http://www.copyright.gov/circs/circ07b.pdf

• The Library is developing Format Preservation Action Plans.

Page 19: Leslie Johnston: Challenges of Preserving Every Digital Format, 2012

DISCUSSION?

Leslie JohnstonChief of Repository Development

Manager of Technical Architecture Initiatives, [email protected]