an introduction to digital preservation at the library of congress
Post on 17-Dec-2014
237 Views
Preview:
DESCRIPTION
TRANSCRIPT
An Introduction toAn Introduction toDigital PreservationDigital Preservation
at the Library of Congressat the Library of Congress
Leslie JohnstonLibrary of Congress
22
NDIIPPNDIIPPNational Digital Information InfrastructureNational Digital Information Infrastructure
and Preservation Programand Preservation Program
MISSION: Ensure access over time to a rich body of digital content through establishment of a national network of partners committed to selecting, collecting and preserving at-risk digital information.
http://www.digitalpreservation.gov/
33
NDIIPPNDIIPP
Learn By Doing
Catalyze Activity
Support Collaboration
44
NDIIPP Focus AreasNDIIPP Focus Areas
Digital Content
Partnerships: Government, Industry, Academia
Technical Infrastructure
Education
55
Access Drives PreservationAccess Drives Preservation
66
There are Important Non-Technical IssuesThere are Important Non-Technical Issues
Legal: intellectual property, copyright, privacy, national Legal: intellectual property, copyright, privacy, national security classificationsecurity classification
Collaboration: new models needed for institutions, Collaboration: new models needed for institutions, communities to work togethercommunities to work together
Institutional culture: staff need new skills, new policies need to Institutional culture: staff need new skills, new policies need to be made, leaders need to integrate analog and digitalbe made, leaders need to integrate analog and digital
Cost: many cost variables; economic sustainability is an issueCost: many cost variables; economic sustainability is an issue
77
Digital Content can be Copyrighted, Digital Content can be Copyrighted,
Private, ConfidentialPrivate, Confidential
Societal norms and expectations for privacy are Societal norms and expectations for privacy are shiftingshifting Especially on the InternetEspecially on the Internet
Data mining and other techniques allow for new Data mining and other techniques allow for new kinds of access and new policieskinds of access and new policies Email – public and personalEmail – public and personal Personal digital archives in special Personal digital archives in special
collectionscollections
88
Economic IssuesEconomic Issues
http://ncdd.nl/en/document/EnglishSummary.pdf
http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf
Hard to know the ongoing costs for digital preservation, lots of variables
Institutions often need to support the preservation of analog and digital collections with tight budgets
Demonstrate value of preserved digital content through use and reuse
99
Organizational IssuesOrganizational Issues
http://ncdd.nl/en/document/EnglishSummary.pdf
Digital preservation is a big challenge
New models are needed for institutions, communities to work together
Preservationists need to be involved much earlier in the lifecycle of a digital object
A variety of new skills and training opportunities are needed.
1010
Examples of Digital Preservation InitiativesExamples of Digital Preservation Initiatives
Open Planets Foundation Open Planets Foundation European project using a solution adopted by national
heritage organizations and others
National Archives and Records AdministrationNational Archives and Records Administration Developing Electronic Records Archives system to meet
federal records management and archival needs
National Library of New ZealandNational Library of New Zealand Developing National Digital Heritage Archive for digital
collections
International Internet Preservation ConsortiumInternational Internet Preservation Consortium Group of national libraries and other organizations
collaborating in web content preservation and developing common tools
1111
What are examples What are examples of some of the of some of the collecting and collecting and preservation preservation challenges at the challenges at the Library of Library of Congress?Congress?
1212
National DigitalNational DigitalNewspaper ProgramNewspaper Program
chroniclingamerica.loc.gov/chroniclingamerica.loc.gov/Some researchers want to search for stories in historic newspapers.
Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze all 6 million pages.
The site gets approximately 5 million views per day. The program has: Multiple producers (25 now, ultimately 54) Free and open public access APIs for machine access and automated processes
Files TIFFs, JPEGs, JPEG 2000s, and XML. Over 6 million newspaper pages ingested to date Over 250 Tb of data
1313
eDeposit for eSerialseDeposit for eSerialseDeposit for eSerials is a collaborative effort between
the U.S. Copyright Office and the Library of Congress.
Copyright Mandatory Deposit represents the largest acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.
eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.
The files must come to the Library “as published” – in whatever their original formats are. This means a wide variety of XML content and metadata, HTML, and PDFs. We have received 49 different file extensions…so far.
1414
Packard Campus National Packard Campus National Audio-Visual CenterAudio-Visual Center
Preserving Film, Broadcast Television, and Audio
The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future.
The facility also handles born-digital video and audio received directly from producers.
The formats include MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial formats.
Over 3.5 PB of files.
1515
Web ArchivingWeb Archivinghttp://www.loc.gov/webarchiving/ http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.htmllcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records.
Websites are complex objectsmultiple formats
interrelated elements
distributed authors
ownership is not transparent
The concept of publishing on the Web doesn’t match with legal definition
The volume of content is immense
Website publishing technology is constantly changing
When we began archiving election web sites, we imagined users browsing through the web pages. But when our first researchers came to the Library, they wanted to mine the collections
Files Every format possible on the web Approximately 7 billion files Over 400 TB
1616
The Twitter ArchiveThe Twitter ArchiveEvery public tweet since Twitter’s launch in March
2006.
The Library’s researcher services will not recreate twitter, and cannot be openly accessible.
Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.
The collection comprises only a few TB, but over 10s of billions of tweets.
A White Paper is available at http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/
status
privacycommercial
personal
events
social media
visualization
social science
1717
Libraries/archives/museums have reasons to engage with individuals about personal digital preservation
May bring in personal digital collectionsRaise institutional visibility Answer patron questions
Guidance for the general public on saving their digital stuff: documents, photos, music, video, email, websites etc.
Public Events
Further How-to’s and tutorials
“Personal Archiving Day”
http://www.digitalpreservation.gov/personalarchiving/http://www.digitalpreservation.gov/personalarchiving/
Personal Digital ArchivingPersonal Digital Archiving
1818
What are some of the What are some of the technological challenges of technological challenges of managing and preserving managing and preserving large digital collections in large digital collections in many formats, and making many formats, and making them available for re-use?them available for re-use?
1919
Sheer amount.Sheer amount.
Huge variation in file formats.Huge variation in file formats.
Unclear and undocumented rights.Unclear and undocumented rights.
SecuritySecurity
Missing metadata.Missing metadata.
Data citation and identifier issues.Data citation and identifier issues.
Discovery expectations: discovery across collections and Discovery expectations: discovery across collections and institutions together.institutions together.
Cost.Cost.
2020
I will mention infrastructure only in passingI will mention infrastructure only in passing
There are scale issues related to:
Bandwidth
Storage
Backup and tape archiving
Software development
Staffing for processing
2121
Preservation ArchitecturePreservation Architecture
There is no national preservation architecture, system, or storage backend.
Highly variable institution by institution, but commonalities in backend repository systems, ingest models, and discovery models.
Community- and discipline-based repositories, often with an unclear relationship to libraries or archives.
Multiple methods for certifying the trust level for a repository.
Agreed upon protocols and mechanisms for the transfer of files, but no single standard for the interchange of files and metadata between environments.
Synchronization and versioning are not just a technical challenge; it complicates management and preservation and access.
2222
And at the Library of Congress?And at the Library of Congress?
The Library has an active digital reformatting program across all formats.
The Library is currently modifying its preservation and collection security policies around digital collections.
The Library has repository services that inventory its file assets and maintains multiple copies of files on servers and on tape, in geographically distributed locations.
The Library developed the BagIt transfer specification for the movement of files between and within organizations.
http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library has documented sustainability factors for file formats. http://www.digitalpreservation.gov/formats/
For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.
http://www.copyright.gov/circs/circ07b.pdf
2323
What are the Library’s strategies What are the Library’s strategies for formats?for formats?
The Library has documented sustainability factors for file formats.
http://www.digitalpreservation.gov/formats/
For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.
http://www.copyright.gov/circs/circ07b.pdf
The Library is ready to start developing Digital Format Preservation Action Plans.
2424
What are the Digital Preservation What are the Digital Preservation Services?Services?
We must develop sufficient infrastructure for distributed, replicated preservation We must develop sufficient infrastructure for distributed, replicated preservation storage. storage.
We will spend an increasing amount of time auditing our files and storage to We will spend an increasing amount of time auditing our files and storage to ensure that no issues have arisen.ensure that no issues have arisen.
We may need to process all files to create a variety of derivatives that are more We may need to process all files to create a variety of derivatives that are more sustainable, and that might be required for various forms of use and analysis sustainable, and that might be required for various forms of use and analysis before ingesting them and providing access. before ingesting them and providing access.
We must develop sufficient infrastructure to support large scale discovery. We must develop sufficient infrastructure to support large scale discovery.
We are comfortable with self-service through the institutional repository model, We are comfortable with self-service through the institutional repository model, but can libraries ingest, manage and provide access to an increasing number but can libraries ingest, manage and provide access to an increasing number of digital collections without any mediation?of digital collections without any mediation?
We are providing quite a bit of guidance to researchers on digital preservation We are providing quite a bit of guidance to researchers on digital preservation standards and personal digital preservation.standards and personal digital preservation.
2525
And where are the And where are the digital preservation digital preservation innovations?innovations?
2626
The Cloud is a The Cloud is a supplement – NOT a supplement – NOT a replacement – for local replacement – for local preservation storage preservation storage resources.resources.
2727
In content characterization In content characterization tools, such as JHOVE and tools, such as JHOVE and DROID and FITS, so we can DROID and FITS, so we can understand the risks inherent in understand the risks inherent in the files in our collections. the files in our collections.
2828
In the adaptation and use of In the adaptation and use of forensics tools for the creation forensics tools for the creation of complete and authentic of complete and authentic copies of unique digital media.copies of unique digital media.
2929
In virtualization and emulation In virtualization and emulation technologies used to recreate technologies used to recreate environments needs for digital environments needs for digital preservation and for access.preservation and for access.
3030
Preservation Preservation Partnerships are a Partnerships are a Necessary InnovationNecessary Innovation
The Library cannot collect everything on its own, so works as part of:
The National Digital Stewardship Alliance http://www.digitalpreservation.gov/ndsa/
The International Internet Preservation Consortium http://netpreserve.org/about/index.php
among others…
3131
What is Success for any Digital What is Success for any Digital Preservation Initiative?Preservation Initiative?
Success must be measured in Success must be measured in concrete goals and deliverables that concrete goals and deliverables that are widely and openly distributed.are widely and openly distributed.
Success is also measured in Success is also measured in enthusiasm, participation, and in enthusiasm, participation, and in adoption by the community.adoption by the community.
3232
SummarySummary
Digital information presents tough issues in terms of Digital information presents tough issues in terms of preservation and accesspreservation and access
Libraries and archives must address these issues even Libraries and archives must address these issues even though there are no ideal solutions and some open though there are no ideal solutions and some open questionsquestions
Progress is evident though the application of shared Progress is evident though the application of shared conceptsconcepts
Initiatives are underway around the world testing different Initiatives are underway around the world testing different approaches to preservationapproaches to preservation
There are a number of significant non-technical issuesThere are a number of significant non-technical issues
Digital preservation is also relevant on the personal levelDigital preservation is also relevant on the personal level
3333http://www.digitalpreservation.gov/formats/index.shtml
The Library of Congress “Sustainability of Digital Formats” site, which analyzes the preservation merits of a variety of digital file formats.
NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach
3434
NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach
http://www.digitalpreservation.gov
The Library of Congress Digital Preservation web site
3535http://blogs.loc.gov/digitalpreservation
The NDIIPP blog “The Signal”: Where we post, and discuss, the many issues, news items and project updates about digital preservation and library technology, both inside and outside of the Library of Congress.
NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach
3636
Leslie JohnstonLeslie JohnstonLibrary of CongressLibrary of Congress
lesliej@loc.govlesliej@loc.gov
top related