20100401 정영임 da 전략 tft_0330

31
정정 정정정정정 정정 정정정 정정정 정정 정정정 정정 - 정정 정정정 정정정정 정정 정정 TF 정정 정정정 - 2010. 4. 1. 정정정 정정정정정정정정정정정 정정정정정정 정정정정정

Upload: glorykim

Post on 11-Nov-2014

810 views

Category:

Documents


2 download

DESCRIPTION

정보 생애주기별 데이터 보존을 위해 고려할 사항

TRANSCRIPT

Page 1: 20100401 정영임 da 전략 tft_0330

정보 생애주기에 따른 데이터 보존을 위해 고려할 사항

- 국가 디지털 아카이빙 전략 연구 TF 내부 세미나 -

2010. 4. 1.

정영임한국과학기술정보연구원 정보유통본부 지식기반실

Page 2: 20100401 정영임 da 전략 tft_0330

- 2 -

Table of Contents

1. Digital Archiving in the Framework of Information Life Cycle Management

2. Creation3. Acquisition4. Cataloging/Identification5. Storage6. Preservation7. Access

Page 3: 20100401 정영임 da 전략 tft_0330

Digital Archiving in the Framework of Information Life Cycle Management Digital archiving framework

– Considered at all stages of the information life cycle man-agement

– Information life cycle• Creation• Acquisition• Cataloging/Identification• Storage• Preservation• Access

- 3 -

Page 4: 20100401 정영임 da 전략 tft_0330

Creation

Creation – Defined as an act of producing the information product in the

broadest sense– Should be regarded as a starting point of long-term and

preservation

Suggestion of provision of a preservation indicator for cre-ators – U.S. Department of Agriculture’s Digital Publications Preserva-

tion Steering Committee Establishment of guidelines for creators

– Oak Ridge National Laboratory, USA • A Guide To Record Series Supporting Epidemiological Studies Con-

ducted for the Department of Energy• Limits on software• Format and layout of the documents

- 4 -

Page 5: 20100401 정영임 da 전략 tft_0330

Creation

Adaption of Standard Descriptive Languages– Standard groups incorporate XML and RDF architectures

Attachment of Metadata on Digital Contents

- 5 -

Page 6: 20100401 정영임 da 전략 tft_0330

Acquisition and Collection Development

Three main aspects to acquisition of digital objects – Collection policies– Gathering methods– Intellectual Property Concerns

- 6 -

Page 7: 20100401 정영임 da 전략 tft_0330

Establishment of Collection Policies

Collection policies– Selecting What to Archive

• Purpose– For Dark Archiving: Back issue– For Light Archiving: Current issue

• Criteria – Easiness of Content Acquisition– Quality of Contents – Utilization– On-going access fee

• Content Type Coverage: E-journals/R&D Reports/Patents/Scientific Data

– Determining Extent– Archiving Links– Refreshing the Archived Contents

- 7 -

Page 8: 20100401 정영임 da 전략 tft_0330

Considerations on Gathering Method

Gathering methods– Hand selection

• Value Judgment and Retention Scheduling (Edinburgh University Library)

– Not preserved – Preserved for defined period – Preserved indefinitely

– Automatic selection• National Library of Sweden: Automatic acquisition without making

value judgment (priority: periodicals, static documents, HTML pages >> conferences, usenet groups, ftp archives)

• EVA projects: Establishment of time limits to avoid the overload-ing

- 8 -

Page 9: 20100401 정영임 da 전략 tft_0330

Considerations on Intellectual Property Concerns

Reliance on Legislation– Freedom of Information Act 2001

• The public may have unrestricted access to certain records. (Consider what categories of information may need to be viewed by the public - these records need to remain accessible at all times.)

– In general, due to absence of international digital deposit leg-islation

• PANDORA project seeks permission from the copyright owner• Swedish and Finnish national library projects do not contact the

owners

Making Agreement with Content Providers– E-journal: Publishers or academic associations

• CLIR/DLF draft model license, NESLi2 Standard license model• Agreement of Cornell University with publishers

– Government document: Open to public– Scientific data: individual creators or data centers

• Arts and Humanities Data Service provide information on what is needed for a digital archive and what creators are likely to be will-ing to deposit

- 9 -

Page 10: 20100401 정영임 da 전략 tft_0330

Agreement of Cornell University with Publishers

Topics identified in the agreement(Thomson and Kroch, 2000)– The general responsibilities of the publishers and Cornell – Characteristics of the data, accompanying metadata, and any additional documen-

tation that are to be deposited – Guidelines on transmission methods and media for deposit – Procedures for the deposit – Procedures and protocols Cornell will use to verify the arrival and completeness of

the data – Rights of the depositing organizations to audit the repository – The respective roles, responsibilities, and rights of the Cornell and the data pro-

ducers with regard to the data – Articulation of Cornell's responsibilities and capabilities with regard to the acces-

sioning, description, management, and even transformation of the deposited data – Access policies for users of the repository, and how they may vary over time – Conditions on the use of the data, and again how they may vary over time – Fees (if any) associated with the deposit – Cornell's ability to share the data with partners to create an agreed-upon level of

redundancy – Clarification of issues surrounding copyright retained by authors

- 10 -

Page 11: 20100401 정영임 da 전략 tft_0330

Identification and Cataloging

Identification– Provision of a unique key for finding the digital object and link-

ing object to other related objects

Cataloging in the form of metadata– Support for organization, access and curation

- 11 -

Page 12: 20100401 정영임 da 전략 tft_0330

Persistent Identification

Problems in using URL as Identifier– Use of server as location identifier can result in lack of persis-

tent over time both for the source object and any linked ob-jects

– Continuous use of URL

New approaches on persistent identification– OCLC: PURLs– ACS: Digital Object Identifier (DOI), MN (Manuscript Number)– DTIC: Handle® system– AAS: Bibcode, PubRef numbers

- 12 -

Page 13: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (1/3)

Creation Method of Metadata– Manual creation of metadata– Automatic generation of metadata

• A project by US Environmental Protection Agency• Defense Information Technology Testbed project

- 13 -

Page 14: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (2/3)

Formats of Descriptive Metadata– E-journal

• Full MARC cataloging – Traditional library cataloging standards– NLA’s PANDORA Archive

• Current development of descriptive metadata standards– MARCXML, MODS(Metadata Object Descriptive Schema)

– Web-based resources • Dublin Core-like format • EVA project

– Non-textual data• Identification of metadata elements needed for non-textual data

types such as images, video, multimedia and others– Z39.87 NISO/AIIM Technical metadata for digital still images– AES X089 core audio metadata

- 14 -

Page 15: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (3/3)

Management of Heterogeneous Metadata Format– Translation between various metadata formats– Key to the development of networked, heterogeneous ar-

chives– Adaption of packaging metadata standards

• Open Archival Information System (OAIS) Reference Model– Is developed by ISO Consultative Committee for Space Data Systems– Encapsulates specific metadata as needed for each object type in a

consistent data model

• Metadata Encoding and Transmission Standard (METS) – Is produced by Library of Congress Standards Office and Digital Lib-

rary Federation– Provides framework for holding all types of metadata for digital ob-

ject

• Others– MPEG-21 Digital Item Declaration Language– IMS Global Learning Consortium Content Packaging Standards– Sharable Content Object Reference Model (SCORM)– CCSDS XML Packaging scheme

- 15 -

Page 16: 20100401 정영임 da 전략 tft_0330

Development of Technical Model for Storage

Recommendation for Developing a technical model for the repository (Cornell University)– Establishing a baseline of e-journal software and file format

needs – Specify the archival repository– Specifying monitoring tools that will flag documents within

the repository that require migration– Specifying a baseline hardware and software infrastructure to

house the repository– Exploring the need and implementation models for redun-

dancy in the repository

- 16 -

Page 17: 20100401 정영임 da 전략 tft_0330

Issues on Changing Storage Media

Problem of changing storage media– Block size, tape size and tape drive mechanism have changed

over time.

Common Solution– Data migration to new storage systems

• Much cost and imperfect transferring system is still an issue.• Check/validation algorithms are extremely important• Manual check is still necessary.• Atmospheric Radiation Monitoring Center plans to migrate to new

storage systems every 4-5 years– Each data migration will take 6-12 months

- 17 -

Page 18: 20100401 정영임 da 전략 tft_0330

Issues on Terabytes of Data Storage

Problem of dealing with large-scale data– Extensive validation routines to ensure the quality of the in-

formation as the information is migrated• NCBI has 30 Ph.D.s reviewing the information manually, even af-

ter it has passed a variety of validation algorithms• Similar cost has been spent for

– Corrections and additions to particular records– Maintenance of a history of changes– Approval by the owner of all changes controlled by NCBI

Common Solution– Large-scale data can be stored in different file formats

• Biological sequence data is held in simple ASCII files for preserva-tion purposes.

• Data in a structured database is provided for searching, reporting and maintenance

– Extensive tasks can be transitioned to a non-profit consortia• Protein Data Bank: Collaboratory for Structured Bioinformatics

- 18 -

Page 19: 20100401 정영임 da 전략 tft_0330

Preservation

Long-term preservation– No common agreement on the definition of long-term preser-

vation

Main aspects on preservation– Selection of digital preservation strategies/technologies– Cycle for hardware/software migration

• No specific investigation on the cycle for hw/sw migration has been done.

• Depending on the particular technologies and subject disciplines, it can be vary from 2 to 10 years.

– Preservation of the “look and feel” of digital contents

- 19 -

Page 20: 20100401 정영임 da 전략 tft_0330

Digital Preservation Strategies

Bitstream Copying Refreshing Durable/Persistent Media Technology Preservation Digital Archaeology Analog Backups Migration (SW, HW migration) Replication Reliance on Standards Normalization Canonicalization Emulation Encapsulation Universal Virtual Computer

- 20 -

Page 21: 20100401 정영임 da 전략 tft_0330

Hardware and Software Migration

Problems on Migration– Migration is not guaranteed to work for all data types– Migration of information products having used sophisticated

software feature is unreliable– Generally, there is no backward compatibility, and if it is pos-

sible, there is certainly loss of integrity in the result.

Emulation as an alternative to migration– Encapsulates the behavior of the hardware/software with the

objects• MS Word 2000 document with metadata indicating how to recon-

struct the document at the engineering level– Creates an emulation registry identifying the HW/SW envi-

ronment and providing information on how to recreate the environment

- 21 -

Page 22: 20100401 정영임 da 전략 tft_0330

Advantages and Disadvantages of Preservation Strategies

- 22 -

Page 23: 20100401 정영임 da 전략 tft_0330

Selection of Preservation Strategies

A schematic diagram for selection of preservation techniques of digital in-formation.

(Lee et al, 2002)

- 23 -

Page 24: 20100401 정영임 da 전략 tft_0330

Preservation of the Look and Feel

Format of materials – In order to save the “look and feel” of material

• TIFF– The most prevalent for those organizations involved with the conver-

sion of paper back file» E.g.) JSTOR

– This does not allow the embedded references to be active hyper links• SGML/HTML

– Used by many large publishers after years of converting publication systems from proprietary format to SGML

– American Astronomical Society (AAS)• PDF

– The most prevalent format for purely electronic documents used for both formal publications and grey literature

– National Library of Sweden– Concerns remain for long-time preservation

» It may not be accepted as a legal depository form because of its proprietary nature

- 24 -

Page 25: 20100401 정영임 da 전략 tft_0330

Normalization vs. Native Formats

Normalization– Process of converting the native format to a standard format

• AAS, ACS transform the incoming file into SGML-tagged ASCII for-mat

– Electronic master copy is able to serve as the robust electronic archival copy.

– Well-tagged copy can be updated periodically, at very little cost.– It takes advantage of advances in both technology and standards.

» Content remains unchanged, but the public electronic version can be updated to remain compatible with the browsers and other access technology

– Examples of data normalization provided data community• NASA Data Active Archive Centers

– Transform incoming satellite and ground monitoring information into standard Common Data Format

• U.K’s National Digital Archive of Datasets– Transforms the native format into one of its own devising

• Normalized formats are considered to be the archival versions– Intellectual property question

- 25 -

Page 26: 20100401 정영임 da 전략 tft_0330

Reliance on Standards

Emphasis on Standards– DOE OSTI

• Limited the number of acceptable input formats• Text in SGML (and its relatives HTML and XML), PDF, WordPerfect

and Word.• Image in TIFF Group4 and PDF Image

- 26 -

Page 27: 20100401 정영임 da 전략 tft_0330

Preservation Strategies Used in Major Projects

- 27 -

CSI: CISTI Csi, ECO: OCLC Electronic Collections Online, EJO: Ohio LINK Electronic Journal Center KB: KB e-Depot, KOP: Kopal DDB, LA: LOCKSS Alliance, LANL: Los Alamos National Laboratory Research Library, NLA: National Library of Australia PANDORA, OSP: Ontario Scholars Portal, PMC: PubMed Central, PORT: Portico

Page 28: 20100401 정영임 da 전략 tft_0330

Issues on Access

Access Mechanisms– Access and display mechanisms

• Providing access• Restricting access

Rights Management and Security Requirements– Security and version control– Creation metadata to manage encryption, watermarks, digital

signatures

- 28 -

Page 29: 20100401 정영임 da 전략 tft_0330

Access Mechanisms

Providing Access – NLM’s Profiles in Science

• Creates an electronic archive of the photographs, text, video, etc• Electronic archive is used to create new access versions as access

mechanisms change– Providing access technologies

• Super Distribution• Value-chain support

Restricting Access– Usage rule– Persistent protection

- 29 -

Page 30: 20100401 정영임 da 전략 tft_0330

Access

Rights Management and Security Requirements– Most difficult access issues for digital archiving– Security and version control impact digital archiving

• Right management includes providing or restricting access as ap-propriate

• Content protection technologies– Contents Encryption– Trusted Environment

– Metadata for managing encryption, watermarks, digital signa-tures needs to be created.

- 30 -

Page 31: 20100401 정영임 da 전략 tft_0330

References CLIR, 2002. The State of Digital Preservation: An International Perspective [online]

[cited 2009-07-23] Hodge, 2000. Best Practices for Digital Archiving: An Information Life Cycle Ap-

proach, D-Lib Magazine:6(1) [online] [cited 2009-07-23] < http://www.dlib.org/dlib/january00/01hodge.html>

Hodge et al, 2004. Digital Preservation and Permanent Access to Scientific Informa-tion, [online] [cited 2009-07-23]

ICPSR, 2009. Digital Preservation Management: Implementing Short-term Strategies for Long-term Problems [online] [cited 2009-12-03] http://www.icpsr.umich.edu/dpm/index.html

Kenney, A. R., Entlich, R., Hirtle, P. B., McGovern, N. Y. and Buckley E. L., 2006. E-Journal Archiving Metes and Bounds: A Survey of the Landscape [online] [cited 2009-12-03]

Lee, K., Slattery, O., Lu, R., Tang, X. and McCrary, V. 2002. The State of the Art and Practice in Digital Preservation, Journal of Research of the National Institute of Stan-dards and Technology: 107(1), 93-106.

Thomas, S. E. and Kroch, C. A. 2000, Project Harvest: The Cornell University Li-brary's Proposal to The Andrew W. Mellon Foundation To Develop a Repository for E-Journals, [online] [cited 2010-03-26] <http http://www.diglib.org/preserve/cornell-prop.htm >

Edinburgh University Library Digital Archives Research Project. A report and recom-mendations

- 31 -