data preservation
DESCRIPTION
TRANSCRIPT
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Preservation and Long Term Access to Data and Records in a Knowledge-
based Society
Reagan W. MooreSan Diego Supercomputer Center
[email protected]://www.npaci.edu/DICE/
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Data and Knowledge Systems GroupStaff• Reagan Moore• Ilkai Altintas• Chaitan Baru• Sheau Yen Chen• Charles Cowart• Amarnath Gupta• George Kremenek• M. Kulrul• Bertram Ludäscher• Richard Marciano• A. Memon• XuFei Qian• Roman Olshanowsky• Arcot Rajasekar• Abe Singer• Michael Wan• Ilya Zaslavsky• Bing Zhu
Graduate Students • A. Bagchi• S. Bansal• A. Behere• R. Bharath• S. Bharath• L. Sui
Undergraduate Interns• N. Cotofana• D. Le• J. Trang• L. Yin• +/- NN
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Topics
• Building persistent archives
• Data grids
• Authenticity mechanisms
• Managing technology evolution
• Knowledge-based access
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Archival ProcessesAppraisal –determine the archivable contentAccession - determine the initial physical location for the data, and the
relationship of the new collection to existing collections Arrangement - add administration control, describe the information
content (provenance, authenticity, structure, administrative), and decompose digital objects into their components as needed.
Description - complete the definition of collection attributes by iterating between arrangement, reformatting, and representation.
Preservation – build an archivable form of the digital entities, characterize the collection context , and manage their storage
Access – provide query mechanisms for discovering, retrieving, and presenting the digital entities.
ERA Concept model
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Common Approach (digital library, persistent archive, data grid)
• Logical name space used to organize digital entities, and associate attributes
• Separation of information management from data storage management
• Definition of abstraction mechanisms for dealing with repositories
• Emergence of need for knowledge management
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Unix Shell
Java, NTBrowsers
WebWSDL
PrologPredicate
SDSC Storage Resource Broker & Meta-data CatalogLevels of Abstraction
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
Clients
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Authenticity
• Guarantee that the data has not been changed– Collection owned data, only accessible through the data
handling system
– Support roles defining access (curation, owner, annotation, read)
– Support access controls mapping users to roles
• Audit trails that record all operations on files• Digital signatures - cryptographic checksums
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Managing Technology Evolution
• Data grids provide interoperability mechanisms to access data in multiple administration domains and multiple types of storage systems.
• Persistent archives migrate collections from old technology to new technology to support presentation on new systems
• Both require the ability to access heterogeneous systems
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Presentation of Digital Objects
Storage System
Operating System
Application
Digital Object
Display System
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management - Emulation
New Storage System
New Operating System
Old Application
Digital Object
New Display System
Wrap Application
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management
New Storage System
New Operating System
Old Application
Digital Object
New Display System
Add Operating System Call
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management
Old Storage System
New Operating System
Old Application
Digital Object
Old Display System
Add Operating System Call
Add Operating System Call
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management Migration
New Storage System
New Operating System
New Application
Digital Object
New Display System
Migrate Encoding Format
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management - SDSC
Old Storage System
New Operating System
New Application
Digital Object
Old Display System
Wrap Storage System Wrap Display System
Migrate Encoding Format
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Accessing Archived Data
• Name transparency– Access data without knowing the file name– Map from attributes to a local file name
• Location transparency– Access data without knowing where it is stored– Map from global file name to local file name
• Collection transparency– Access data without knowing the collection attributes– Map from concept space to collection attributes
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Information Management- Logical Name Space
• Set of attributes to describe digital entities that are registered into the logical name space
• SRB metadata - Unix file system semantics• Provenance metadata - Dublin Core• Resource metadata - User access control lists• Discipline metadata - User defined attributes
• Each digital entity may have unique attributes
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Knowledge Management - Discovery across Collections
• Mapping from collection attributes to discipline concepts – Make queries based on discipline concepts
• Characterization of relationships between attributes– Semantic / logical - cross-walks– Procedural / temporal - records management– Structural / spatial - GIS
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Knowledge Based Data Grids
AttributesSemantics
Knowledge
Information
Data
Ingest Services
Management AccessServices
(Model-based Access)
(Data Handling System - SRB)
MC
AT
/HD
F
Gri
ds
XM
L D
TD
SD
LIP
XT
M D
TD
Rul
es -
KQ
L
InformationRepository
Attribute- based Query
Feature-basedQuery
Knowledge orTopic-Based Query / Browse
KnowledgeRepository for Rules
RelationshipsBetweenConcepts
FieldsContainersFolders
Storage(Replicas,Persistent IDs)
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Further Information
http://www.npaci.edu/DICE