importance of infrastructure independence

34
Importance of Infrastructure Importance of Infrastructure Independence Independence http://dice.unc.edu http://dice.unc.edu [email protected] [email protected] Richard Marciano Richard Marciano Professor Professor @ @ SILS SILS Chief Scientist Chief Scientist for Persistent Archives and Digital Preservation @ for Persistent Archives and Digital Preservation @ RENCI RENCI Executive Director Executive Director @ Data Intensive Cyber Environments ( @ Data Intensive Cyber Environments (DICE DICE ) Center ) Center Lab Director Lab Director of Sustainable Archives & Library Technologies ( of Sustainable Archives & Library Technologies (SALT SALT ) @ ) @ DICE Center DICE Center

Upload: ponce

Post on 05-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Importance of Infrastructure Independence. http://dice.unc.edu [email protected] Richard Marciano Professor @ SILS Chief Scientist for Persistent Archives and Digital Preservation @ RENCI Executive Director @ Data Intensive Cyber Environments ( DICE ) Center - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Importance of Infrastructure Independence

Importance of Infrastructure IndependenceImportance of Infrastructure Independence

http://dice.unc.eduhttp://dice.unc.edu

[email protected][email protected]

Richard MarcianoRichard Marciano

ProfessorProfessor @ @ SILSSILS

Chief Scientist Chief Scientist for Persistent Archives and Digital Preservation @ for Persistent Archives and Digital Preservation @ RENCI RENCI

Executive Director Executive Director @ Data Intensive Cyber Environments (@ Data Intensive Cyber Environments (DICEDICE) Center) Center

Lab Director Lab Director of Sustainable Archives & Library Technologies (of Sustainable Archives & Library Technologies (SALTSALT) @ DICE Center) @ DICE Center

Page 2: Importance of Infrastructure Independence

Vice Chancellor

Research and Economic

Development

Tony Waldrop

Executive Director

Richard Marciano

Director

Reagan Moore

Director of R&D

Arcot Rajasekar

Mike Wan

Wayne Schroder

Sheau-yen Chen

Lucas Gilbert

Bing Zhu

Richard Marciano

Professor SILS

Sustainable Archives &

Library Technologies

(SALT)

Arcot Rajasekar

Professor SILS

Data Grid and

Policy

Helen Tibbo

Professor SILS

Digital Curation @

Carolina

Jose-Marie Griffiths

Javed Mostafa

Jan Prins

STAFF

RESEARCH FELLOWS RESEARCH UNITS

Paul Tooby

Antoine de Torcy

STAFF

TBD, Program Officer

Chien-yi Hou

STAFF

Doctoral Fellows:

Michael Brown

Heather Bowden

Kaitlin Costello

Sarah Ramdeen

STAFF

Chien-yi Hou

TBD

STAFF

Page 3: Importance of Infrastructure Independence

Sustainable Archives & Library TechnologiesSustainable Archives & Library TechnologiesSALT: a metaphor for “Data Curation”?SALT: a metaphor for “Data Curation”?

annotationannotation

collectioncollection

conditioningconditioning

preservationpreservation

current & future usecurrent & future use

“We define this discipline of ‘data curation’ as the practice of c

ollection, annotation, conditioning and preservation of data for both current and future use.”

Helen Tibbo & Bryan Heidorn

Page 4: Importance of Infrastructure Independence

Curation Topics

• Motivation for curation• Access today• Preservation - access tomorrow

• Preservation community concepts• Representation information for records• Representation information for policies• Representation information for processes

• Rule-oriented data systems• Automate curation processes• Enforce curation policies• Verify assertions about curation results

Page 5: Importance of Infrastructure Independence

Curation Processes

• Extract record from the creation environment and import into the digital library or preservation environment• Context

• Properties about the record creation• Description of the record content• Description of the record type• Description of the record structure

• Assert that the record can be viewed and manipulated in the future

Page 6: Importance of Infrastructure Independence

Curation (Preservation) Is an Active Process

• Preservation is communication with the future• Continually migrate the record from the current data

management environment into the next management environment

• At the point in time when the migration occurs, both the old and new technologies are present

• Use data grids to support interoperability across technologies• Manage the name spaces for identifying records,

archivists, storage systems • Decouple access mechanisms from storage systems

Page 7: Importance of Infrastructure Independence

Maintain Control of the Curation Environment

• Insert data management infrastructure between the records and the current technology• Distributed server architecture

• Protect the records from changes in the environment• Ensure that the curation properties are maintained• Ensure that the curation policies are enforced• Verify assessment criteria

Page 8: Importance of Infrastructure Independence

Use Cases (1)Use Cases (1)

• DCAPE: Distributed Custodial Archival Preservation Environments• Build a distributed production preservation

environment that meets the needs of archival repositories for trusted archival preservation services

• Develop preservation policies for state archives, university archives and cultural institutions

• Use iRODS to implement and deliver the resulting services

Page 9: Importance of Infrastructure Independence

DCAPE: Distributed Custodial Archival PreservationDCAPE: Distributed Custodial Archival PreservationPurpose:

Build a distributed production preservation environment that meets the needs of archival repositories for trusted archival preservation services

Distributed partnership of 11 institutions: 33 people

* STATES:

- California

- Kansas

- Michigan

- Kentucky

- North Carolina

- New York

* UNIVERSITIES:

- Tufts University

- West Virginia University

- UNC (SILS/RENCI)

* CULTURAL ENTITIES:

- Getty Research Institute

* INTERNATIONAL PARTNERS: - Carleton University (Geomatics and Cartographic Research Centre)

Richard Marciano, Professor SILSReagan Moore, Professor SILSChien-yi Hou, Research Associate SILSJohn Gallagher, Dir. of Research Mgt. and Admin RENCI

Kelly Eubank, Ele,ctronic Records ArchivistDruscie Simpson IT AdministratorDavid Minor, ProgrammerEd Southern, State ArchivistJennifer Ricker, Digital Collections ManagerAmy Rudersdorf, Director of Digital Information Mgt.

Page 10: Importance of Infrastructure Independence

Overview of iRODS Architecture Overview of iRODS Architecture

Archivist AAutomatic

replication service requested

Services can be invoked for automatic replication, generation of audit trails, e-mail notification of activity, ingestion of multiple files, format obsolescence, etc.

Delivery of Preservation Services

NC State Archives

iRODS Metadata

Catalog

iRODS Data System

NC State Library

Getty Research Inst.

Archivist BValidation

service for a collection

Page 11: Importance of Infrastructure Independence

What are Data Grids?What are Data Grids?

Data Grids are “middleware services”• Software that sits between applications and data

sources

So, What?

Page 12: Importance of Infrastructure Independence

What are Data Grids Good For?What are Data Grids Good For?

Data Grids allow you to access data:• In any format

• Files, databases, streams, web, programs,…• Documents, images, data, sensor packets, tables,…

• Stored in any type of storage system• File Systems, tape silos, object ring buffers, sensor streams,

• Stored anywhere over a wide area network• Across organizational, administrative and security boundaries

• Without having to know the system addresses, paths, protocols, commands, etc. needed to retrieve it!

Page 13: Importance of Infrastructure Independence

What are Data Grids Good For?What are Data Grids Good For?

• Scalability• Millions of files• Petabytes of Data

• Evolvability• Infrastructure Independence• Across Generations of Software

• Extensibility• Deal with Technologies not yet Dreamed of

Page 14: Importance of Infrastructure Independence

What are Data Grids Good For?What are Data Grids Good For?

• Collections Managed by the DICE Center:

• 1+PetaBytes, 170+ Million files

• Multi-disciplinary Scientific Data

• Astronomy, Cosmology

• Neuro Science, Cell-Signalling & other Bio-medical Informatics

• Environmental & Ecological Data

• Educational (web) & Research Data (Chem, Phys,…)

• Earthquake Data, Seismic Simulations

• Real-time Sensor Data

• Growing at 1TB a day

• Supporting large projects: TPAP, TeraGrid, NVO, SCEC, SEEK/Kepler, GEON, ROADNet, JCSG, AfCS, SIO Explorer, SALK, PAT …

Page 15: Importance of Infrastructure Independence

Data GridsData Grids

• Storage Resource Broker (SRB)• Initially funded by DARPA in 1996• Current version is 3.5.0, released Dec 3, 2007• Production system used internationally

• Integrated Rule-Oriented Data System (iRODS)• Funded by NSF SDCI and NARA• Current version is 1.1, released June 2008

Page 16: Importance of Infrastructure Independence

Design ImplicationsDesign Implications

• Heterogeneous storage systems• Data stored in file systems, archives, databases

• Global name spaces• Files• Users• Resources

• Persistent access controls• Constraints between name spaces

• Consistent state information• Properties of files, collections, resources, users

Page 17: Importance of Infrastructure Independence

Data GridsData Grids• Data virtualization

• Provide the persistent, global identifiers needed to manage distributed data

• Provide standard operations for interacting with heterogeneous storage system

• Provide standard actions for interacting with clients

• Trust virtualization• Manage authentication and authorization• Enable access controls on data, metadata, storage

• Federation• Controlled sharing of name spaces, files, and metadata

between independent data grids• Data grid chaining / Central archives / Master-slave data

grids / Peer-to-Peer data grids

Page 18: Importance of Infrastructure Independence

What are Data Grids Good For?What are Data Grids Good For?

Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor

products.

U Md UCSD

MCAT MCAT

Georgia Tech

MCAT

NARA II

MCAT

NARA I

MCAT

Rocket Center

MCAT

U NC

MCAT

TPAP - NARA Transcontinental Persistent Archive Prototype

Federation of Seven Independent Data Grids

Page 19: Importance of Infrastructure Independence

Federation Across Spatial ScalesFederation Across Spatial Scales• International collaborations

• Australian Research Collaboration Service (ARCS)• Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN)• Cinegrid

• National collaborations• Temporal Dynamics of Learning Center (TDLC)• Ocean Observatories Initiative (OOI)• NARA Transcontinental Persistent Archive Prototype (TPAP)

• Regional collaborations• LSU data grid• HASTAC humanities data grid• Distributed Custodial Archive Preservation Environment (DCAPE)

• State collaborations• RENCI data grid • North Carolina State Library

• Institutional repositories• Carolina Digital Repository• SIO Repository

Page 20: Importance of Infrastructure Independence

Ten Years of Data Grid 1.0 - What’s MissingTen Years of Data Grid 1.0 - What’s Missing

• Automatic Policy Execution• Increasing Scale• Managing System Administration• Visualization• Virtualization• Customization

Page 21: Importance of Infrastructure Independence

Data Grids 2.0 – Policies in Action!Data Grids 2.0 – Policies in Action!

Specify Policies“Make X Copies of Accessioned Records”

Break Policies Down into Rules“Put one copy at Rocket Center”“Put one copy at UCSD”“Verify Copies are Identical”

Break Rules Down into Micro-Services• “Put one copy at Rocket Center.”

• Read File• Copy File• Create Checksum• Copy Checksum• Etc.

Micro-Services Can Be Combined into Complex WorkflowsExecute them: Periodically, On-demand, Delayed Start, Anywhere on the

network

Page 22: Importance of Infrastructure Independence

Rule-based Data ManagementRule-based Data Management

• Associate Rules with Combinations of: • Data Objects• Collections• User Groups• Storage Systems

• For Example:• Particular User Groups when Accessing a

Particular Collection

Page 23: Importance of Infrastructure Independence

Evolution of Data Grid TechnologyEvolution of Data Grid Technology

• Shared collections• Enable researchers at multiple institutions to

collaborate on research by sharing data• Focus was on performance, scalability

• Digital libraries• Support provenance information and discovery• Integrated with digital library front end services

• Preservation environments• Support preservation policies• Build rule-based data management system

Page 24: Importance of Infrastructure Independence

24

Infrastructure IndependenceInfrastructure Independence

• Use data grids to preserve records independently of the choice of technology• Management of archives properties

• Map technology components to preservation principles• Capabilities that support preservation requirements

• Construct preservation environment from components• Archival engineering perspective

• Use infrastructure independence to enable use of new technology• View that new technology is an opportunity instead of a challenge

Page 25: Importance of Infrastructure Independence

Overview of iRODS Architecture Overview of iRODS Architecture

UserCan Search, Access,

Add and Manage Data& Metadata

*Access data with Web-based Browser or iRODS GUI or Command Line clients.

Overview of iRODS Data System

iRODS Data Server

Disk, Tape, etc.

iRODS Metadata

CatalogKeeps track of data

iRODS Data System

Page 26: Importance of Infrastructure Independence
Page 27: Importance of Infrastructure Independence

Building a Shared CollectionBuilding a Shared Collection

DB

Have collaborators at multiple sites, each with

different administration policies, different types

of storage systems, different naming conventions.

Assemble a self-consistent, persistent distributed shared

collection

UNC @ Chapel Hill DukeNCSU

Page 28: Importance of Infrastructure Independence

Using a Data Grid - Using a Data Grid - DetailsDetails

iRODS Server #2Rule Engine

•Data request goes to iRODS Server #1

iRODS Server #1Rule Engine

Metadata CatalogRule Base

DB

•Server looks up information in catalog

•Catalog tells which iRODS server has data•1st server asks 2nd for data

•The 2nd iRODS server applies rules

•User asks for data

Page 29: Importance of Infrastructure Independence

iRODS - Integrated Rule Oriented iRODS - Integrated Rule Oriented Data SystemData System

1. Shared collection assembled from data distributed across remote storage locations

2. Server-side workflow environment in which procedures are executed at remote storage locations

3. Policy enforcement engine, with computer actionable rules applied at the remote storage locations

4. Validation environment for assessment criteria5. Consensus building system for establishing a

collaboration (policies, data formats, semantics, shared collection)

Page 30: Importance of Infrastructure Independence

User With Client, Views

& Manages Data

Overview of iRODS Architecture Overview of iRODS Architecture

My DataDisk, Tape, Database,

Filesystem, etc.

The iRODS Data Grid installs in a “layer” over existing or new data, letting you view, manage, and share part of all of diverse data in a unified Collection.

iRODS Shows Unified “Virtual Collection”

My DataDisk, Tape, Database,

Filesystem, etc.

User Sees Single “Virtual Collection”

Partner’s DataDisk, Tape, Database,

Filesystem, etc.

Page 31: Importance of Infrastructure Independence

31

Engineering ApproachEngineering Approach• Preservation Principles - enumerate assertions

• Authenticity, integrity, chain of custody, respect des fonds• Preservation Standards - select relevant set

• Architecture, metadata, submission, format, assessment• Preservation Engineering - define capabilities

• Infrastructure independence, scalability, federation• Preservation Technology - integrate components

• Data grids, digital libraries, workflows• Preservation Management - automate policies

• Interoperability, policies, capabilities, verification

Page 32: Importance of Infrastructure Independence

Preservation StandardsPreservation Standards• Architectural Model

• OAIS, Reference Model for an Open Archival Information System• Representation information for each record• Submission / Archival / Dissemination Information Package (SIP / AIP / DIP)

• Data grid - Storage Resource Broker (SRB), integrated Rule Oriented Data System (iRODS)• Digital Library - DSpace services, Fedora digital library middleware

• Metadata• Dublin core• LCDRG, NARA Life Cycle Data Requirements Guide• PREMIS, Preservation Metadata Implementation Strategies

• Metadata organization• MPEG-21, ISO/IEC TR 21000-1: MPEG-21 Multimedia Framework• METS, Metadata Encoding and Transmission Standard• OAIS, Reference Model for an Open Archival Information System

• Submission / Harvesting• Producer Archive Interface (NASA)• OAI-PMH, Open Archives Initiative - Protocol for Metadata Harvesting

• Data format • pdf, xml, (4000 formats retrievable on web crawls)

• Assessment criteria• RLG/NARA TRAC - Trustworthy Repositories Audit & Certification: Criteria and Checklist.

http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf

Page 33: Importance of Infrastructure Independence

Policy-Virtualization: Policy-Virtualization: Automate OperationsAutomate Operations

• System-centric Policies & Obligations: • Manage retention, disposition, distribution, replication, integrity,

authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication

• Domain-specific Policies:• Identification & Extraction of Metadata• Ingestion Control for Provenance Attribution• Processing of Data on Ingestion

• Creation of multi-resolution images, type-identification, anonymization,…

• Processing of Data on Access• IRB Approval for data access, Data sub-setting, Merging of multiple images,

conversion, redaction, …

Page 34: Importance of Infrastructure Independence

Evolution of Data Grid TechnologyEvolution of Data Grid Technology

• Shared collections• Enable researchers at multiple institutions to collaborate

on research by sharing data• Focus was on performance, scalability

• Digital libraries• Support provenance information and discovery• Integrated with digital library front end services

• Preservation environments• Support preservation policies• Build rule-based data management system

• Differ in choice of management policies