national archives and records administration1 integrated rules ordered data system (irods)...
DESCRIPTION
National Archives and Records Administration3 Open Source, University-based Technology Research collaboratively supported by NSF/Office of CyberInfrastructure & NARATRANSCRIPT
National Archives and Records AdministrationNational Archives and Records Administration 11
IIntegrated ntegrated RRules ules OOrdered rdered DData ata SSystem (“IRODS”) ystem (“IRODS”) Technology Research:Technology Research:
Digital Preservation Technology in a SOA Digital Preservation Technology in a SOA Technical ContextTechnical Context
Robert ChadduckRobert ChadduckPrincipal TechnologistPrincipal Technologist
Electronic Records Archives ProgramElectronic Records Archives ProgramThe National Archives and Records AdministrationThe National Archives and Records Administration
National Archives and Records AdministrationNational Archives and Records Administration 22
Synopsis of 18 April 2007 Invited Presentation by Synopsis of 18 April 2007 Invited Presentation by Dr. Reagan Moore, Ph.D. Dr. Reagan Moore, Ph.D.
Distinguished Scientist Distinguished Scientist San Diego Supercomputer Center San Diego Supercomputer Center
to NITRD HCI&IM Coordinating Groupto NITRD HCI&IM Coordinating Group
National Archives and Records AdministrationNational Archives and Records Administration 33
Open Source, University-based Technology Open Source, University-based Technology Research collaboratively supported by NSF/Office Research collaboratively supported by NSF/Office
of CyberInfrastructure & NARAof CyberInfrastructure & NARA
Scientific Data CollectionsScientific Data Collections
Reagan W. MooreReagan W. Moore
Wayne SchroederWayne Schroeder
Mike WanMike Wan
Arcot RajasekarArcot Rajasekar
Richard MarcianoRichard Marciano
{moore, schroede, mwan, sekar, marciano}@sdsc.edu
http://www.sdsc.edu/srb
http://irods.sdsc.edu/http://irods.sdsc.edu/
Data CollectionsData Collections• NSF Cyberinfrastructure projects
• Digital holdings for a scientific discipline• Simulation applications
• Output from supercomputers• Real-time sensor systems
• Observational data• Scientific laboratories
• Experimental data
Scientific Data ManagementScientific Data Management• Data collections
• Data organization• Data grids
• Data sharing• Data publication
• Digital Libraries• Data preservation
• Persistent archives
• SDSC uses generic data grid technology to support all data management applications
Date
Project GBs of data stored
1000Õs of files
GBs of data stored
1000Õs of files
Users with ACLs
GBs of data stored
1000Õs of files
Users with ACLs
Data Grid NSF / NVO 17,800 5,139 51,380 8,690 80 119,278 17,828 100 NSF / NPACI 1,972 1,083 17,578 4,694 380 36,514 7,483 380 Hayden 6,800 41 7,201 113 178 8,013 161 227 Pzone 438 31 812 47 49 25,681 14,793 68 NSF / LDAS-SALK 239 1 4,562 16 66 193,959 196 67 NSF / SLAC-JCSG 514 77 4,317 563 47 20,620 2,152 55 NSF / TeraGrid 80,354 685 2,962 293,539 8,038 3,267 NIH / BIRN 5,416 3,366 148 20,800 33,748 424 NCAR 1,567 8 2 LCA 1,834 39 2Digital Library NSF / LTER 158 3 233 6 35 260 41 36 NSF / Portal 33 5 1,745 48 384 2,620 53 460 NIH / AfCS 27 4 462 49 21 733 94 21 NSF / SIO Explorer 19 1 1,734 601 27 2,750 1,202 27 NSF / SCEC 15,246 1,737 52 168,931 3,545 73 LLNL 13,784 1,374 5 CHRON 6,398 2,064 5Persistent Archive NARA 7 2 63 81 58 3,793 4,983 58 NSF / NSDL 2,785 20,054 119 5,699 50,600 136 UCSD Libraries 127 202 29 190 208 29 NHPRC / PAT 1,888 521 28 RoadNet 2,608 975 30 UCTV 7,359 2 5 LOC 9,693 256 8 Earth Sci 3,794 511 5TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 961 TB 153 mil 5,516
5/17/02 6/30/04 4/23/07
Data Management ChallengesData Management Challenges• Authenticity
• Manage descriptive metadata for each file• Manage access controls• Manage consistent updates to administrative metadata
• Integrity• Manage checksums• Replicate files• Synchronize replicas• Federate data grids
• Infrastructure independence• Manage collection properties • Manage interactions with storage systems• Manage distributed data
Generic InfrastructureGeneric Infrastructure• Data grids manage data distributed
across multiple types of storage systems• File systems, tape archives, object ring buffers
• Data grids manage collection attributes• Provenance, descriptive, system metadata
• Data grids manage technology evolution• At the point in time when new technology is
available, both the old and new systems can be integrated
Data GridsData Grids• SRB - Storage Resource Broker
• Persistent naming of distributed data• Management of data stored in multiple types of storage
systems• Organization of data as a shared collection with descriptive
metadata, access controls, audit trails• iRODS - integrated Rule-Oriented Data System
• Rules control execution of remote micro-services• Manage persistent state information• Validate assertions about collection• Automate execution of management policies
Preservation ManagementPreservation Management
Data ManagementEnvironment
ConservedProperties
ControlMechanisms
RemoteOperations
ManagementFunctions
AssessmentCriteria
ManagementPolicies
Capabilities
Data grid Š Management virtualizationData Management
InfrastructurePersistent
StateRules Micro-services
Data grid Š Data and trust virtualizationPhysical
InfrastructureDatabase Rule Engine Storage
System
iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System
Rule-based Data ManagementRule-based Data Management
• Map from management policies to rules controlling execution of remote micro-services
• Manage persistent state information for results of each micro-service execution
• Support an additional three logical name spaces• Rules• Micro-services• Persistent state information
• Constitutes representation information for preservation environments
Example RulesExample Rules
• Rule composed of four parts:• Name | condition | micro-service set | recovery
• Rule to automate replication of data for a specific collection
acPostProcForPut |$objPath like /tempZone/home/rods/nvo/* | msiSysReplDataObj(nvoReplResc,null) | nop
• Rule types• Internal, administrative, user-defined• Atomic, deferred, periodic
Management VirtualizationManagement Virtualization• Standard policies expressed as rules
• Integrity• Validation of checksums• Synchronization of replicas• Data distribution• Data retention• Access controls
• Authenticity• Chain of custody - audit trails• Required preservation metadata - templates• Generation of AIPs, DIPS
New CapabilitiesNew Capabilities• Management capabilities
• Rules to validate assessment criteria• Access controls on rules • Time-dependent access controls• Access controls on each micro-service• Redaction, access controls on structures in a file• Rule to parse audit trails, verify consistency of system
• Data grid evolution• Dynamic addition of new rules / micro-services / persistent state
information• Rules to control migration from old management policies to new
management policies• Federation
• Migration of rules and micro-services with data
Federation Between Data GridsFederation Between Data Grids
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical rule name space
• Logical micro-service name
• Logical persistent state
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical rule name space
• Logical micro-service name
• Logical persistent state
Data Collection A
Digital PreservationDigital Preservation
• Preservation is communication with the future• How do we migrate records onto new technology
(information syntax, encoding format, storage infrastructure, access protocols)?
• SRB - Storage Resource Broker data grid provides the interoperability mechanisms needed to manage multiple versions of technology
• Preservation manages communication from the past• What information do we need from the past to make
assertions about preservation assessment criteria (authenticity, integrity, chain of custody)?
• iRODS - integrated Rule-Oriented Data System
For More InformationFor More Information
Reagan W. MooreSan Diego Supercomputer Center
http://www.sdsc.edu/srb/http://irods.sdsc.edu/
National Archives and Records AdministrationNational Archives and Records Administration 1919
For Additional Information and DevelopmentsFor Additional Information and Developmentshttp://irods.sdsc.edu/index.php/Main_Pagehttp://irods.sdsc.edu/index.php/Main_Page
National Archives and Records AdministrationNational Archives and Records Administration 2020
Robert ChadduckRobert ChadduckPrincipal TechnologistPrincipal Technologist
Electronic Records Archives ProgramElectronic Records Archives ProgramThe National Archives and Records AdministrationThe National Archives and Records Administration
telephone: 301-827-1585telephone: 301-827-1585robert.chadduck at nara.govrobert.chadduck at nara.gov