national partnership for advanced computational infrastructure collection-based persistent archives...
TRANSCRIPT
National Partnership for Advanced Computational Infrastructure
Collection-based Persistent Archives
Reagan W. MooreAssociate Director, Data Intensive Computing
San Diego Supercomputer Center
[email protected]://www.npaci.edu/DICE
National Partnership for Advanced Computational Infrastructure
Topics
• Experiences learned building a prototype Persistent Archive• Information model • Hierarchical levels of information• Interoperability mechanisms
• Application to workshop topics• Ingestion methodology• Data set identification• Certification of archives
National Partnership for Advanced Computational Infrastructure
Persistent Archive Goals
• Provide collection based archive• Data set relevance is organized by the collection
• Provide information model to describe the context for the data collection• Enough information is needed to be able to dynamically
create the collection from archived information
• Decouple collection creation from digital object archiving
• Provide accessioning system to turn data sets into digital objects• Accessioning is independent of the final collection
National Partnership for Advanced Computational Infrastructure
NARA Persistent Archive Prototype
• Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036)• 2.5 GB of data• 6 required fields• 13 optional fields• User defined fields (over 1000)
• Determine information model needed for persistent archive
National Partnership for Advanced Computational Infrastructure
National Partnership for Advanced Computational Infrastructure
Key Concepts Learned
• Information model• Semi-structured representation of information - XML• Infrastructure independent representation of information
context - XML DTD• Differentiation between information context for digital
objects,collection and presentation• DTD for objects• DTD for collection• XSL style sheets for presentation
• Instantiation software for creating the collection from the information model
• XML databases now appearing
National Partnership for Advanced Computational Infrastructure
Hierarchy of Information Contexts
• Digital object context• Meta-data to define the structure of the object• When publishing a digital object, must also publish the
context of the object
• Use collections to organize objects • Meta-data to define the structure of the collection • When publishing a collection, must also publish the
information needed to organize the collection.
• Use presentation context to control access• Meta-data to define structure of presentation
National Partnership for Advanced Computational Infrastructure
XML DTD for
<!ELEMENT rfc1036_mesg (headers, body)>
<!ELEMENT headers (required_headers, optional_headers, other_headers)><!ELEMENT body #PCDATA>
<!ELEMENT required_headers (From, Date, Newsgroups, Subject, Message-ID, Path)><!ELEMENT optional_headers (Folloup-To?, Expires?, Reply-To?, Sender?, References?, Control?, Distribution?, Keywords?, Summary?, Approved?, Lines?, Xref?, Organization?)><!ELEMENT other_headers other+>
<!-- 6 required header keywords --><!ELEMENT From #PCDATA><!ELEMENT Date #PCDATA><!ELEMENT Newsgroups #PCDATA><!ELEMENT Subject #PCDATA><!ELEMENT Message-ID #PCDATA><!ELEMENT Path #PCDATA>
<!ATTLIST From seqno CDATA #REQUIRED><!ATTLIST Date seqno CDATA #REQUIRED><!ATTLIST Newsgroups seqno CDATA #REQUIRED><!ATTLIST Subject seqno CDATA #REQUIRED><!ATTLIST Message-ID seqno CDATA #REQUIRED>
<!ATTLIST Path seqno CDATA #REQUIRED>
National Partnership for Advanced Computational Infrastructure
Formatted Message Using XML DTD
National Partnership for Advanced Computational Infrastructure
Key Concepts Learned
• Digital object encapsulation• Minimize the number of times a digital object must be
touched• Once archived, a digital object should only be retrieved
when requested by a user
• Implies meta-data stored with digital objects should only describe the objects
• Collection and presentation meta-data should be stored separately
National Partnership for Advanced Computational Infrastructure
Persistent Archive Requirements• Distributed environment to ensure separable
components• Accession workbench• Archive• Presentation platform
• Data handling mechanisms for interoperability as basis for system evolution• No tightly coupled systems• Unique names are only used by the data handling system• Use of containers to aggregate digital objects for storage• Implies a hierarchical naming scheme
• Collection / container / digital object
National Partnership for Advanced Computational Infrastructure
TAPE
DISK
CD
FTP
Media Handlers
METADATA
REPOSITORY
RECORDS
REPOSITORY
AccessioningWork Bench
(snapin)
Text
Image
Photo
Video
Audio
Geographical Information System
Compound Records
WEB
DatabaseMetadata wrapper
record
ReferenceWorkbench
(snapin)
Arrangement
A R C
Catalog
OrderFulfillment
RetrieveRecords
WRAPPER
ACCESSION ARCHIVES REFERENCE TRANSFER
FTP
TAPE
DISK
CDUNWRAPPER
Electronic Records Archive (ERA)
Query &Reference
Tools
InternetIntranet
Presentation
National Partnership for Advanced Computational Infrastructure
Federation of Data Collections into Digital Libraries
DPOSS Sky Survey
2MASS Sky Survey
NASACatalog
NSDigLib
Wash. Brain Image
UCLA Brain Image
MSU Brain Image
UCSD Neuroscience
CEED / ESA
REINAS
U Md Archive
ADL
Elib - Flora
ESSDigLib Protein Data Bank
Wash U Genome
U H Mol Trajectory
MSDigLib
UCCalif
FindingAids
UMDL Social Science
AMICO Image Library
NARA Persistent Archive
U Wisc. Video Lib.
Pacific Rim DL
National Partnership for Advanced Computational Infrastructure
Conclusions
• Ingestion• Infrastructure independent representation for digital
objects• Infrastructure independent representation for information
model• Turn data sets into digital objects by adding attribute tags
• Aggregate digital objects in containers for storage
National Partnership for Advanced Computational Infrastructure
Conclusions
• Data set identification• Unique names only required by data handling system
• Attribute based access through collection
• Hierarchical naming• Collection / Container / Digital object• Finding Aid for collection / Data handling system ID for container /
Unique ID for object
National Partnership for Advanced Computational Infrastructure
Conclusion
• Certification of persistent archive• Demonstrate that can provide infrastructure independent
representation for• Finding aids for locating collections• Information model for building collection• Data handling system container Ids for storage access• Digital object attribute tags
• Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology
• Demonstrate that can independently migrate any component of architecture
National Partnership for Advanced Computational Infrastructure
Further Information
http://www.npaci.edu/DICE
National Partnership for Advanced Computational Infrastructure
NARA Persistent Archive
Application Infrastructure
Accessioning Workbench InformationModel User interface / Analysis tools
Finding Aids Federation / Mediation ofCollections
Information discovery MarkupLanguage Digital Library Services
Collection migration system Collection Management
Storage Resource Broker Meta-data Data Handling System
HPSS / file system Archive Storage
National Partnership for Advanced Computational Infrastructure
Context Based Objects
• For data to be useful, the context must be defined• Data format - binary/integer representation• Physical meaning - units• Structure - geometry• Relevance - feature annotation• Semantics - data dictionary for attributes
• Context is preserved as meta-data attributes
National Partnership for Advanced Computational Infrastructure
Information Models for Organization of Data
Digital Object Attributes
Collection Attributes
Presentation Attributes
National Partnership for Advanced Computational Infrastructure
Information Models for Access to Data
Presentation of data from multiple digital libraries
Collections from federated databases
Digital object Attributes
National Partnership for Advanced Computational Infrastructure
Common Information Model
• eXtensible Markup Language (XML) • Use tags to define semantic context for components of the
data set
• Document Type Definition (DTD)• Provides semi-structured representation for organizing
tags that can be applied to groups of digital objects
• Development of standards for tags• Digital sky, Protein Data Bank, Neuroscience brain images• California Digital Library - Art Museum Image Consortium
National Partnership for Advanced Computational Infrastructure
Information Management Hierarchy
• Presentation / Information Discovery / Analysis• Visualization - Shastra, 3D visualization tools• Presentation information model - XSL style sheet
• Collection organization• Meta-data catalog - MCAT• Collection information model - XML DTD
• Data handling• Storage Resource Broker - SRB
• Storage• Archival storage system - HPSS• Digital object model - XML DTD
National Partnership for Advanced Computational Infrastructure
Open Grid Architecture to Encourage Interoperability
Data HandlingSystems
StorageResource
s
RemoteProcedureExecution
Data ModelManagement
Application
StorageSystem
Description
InformationDiscovery
DynamicInfo
Discovery
National Partnership for Advanced Computational Infrastructure
Technology Sources• Archive Community
• IEEE Mass Storage Systems Technical Committee• Scalable storage systems
• Digital Library Community• NSF Digital Library Initiative, Phase II• Information management mediation - XML
• Supercomputer Community• Scalable analysis platforms
• Grid Forum• Data handling systems for interoperability
• Archivist Community / Library Community• Management policies and standards
National Partnership for Advanced Computational Infrastructure
Technology Sources
Data HandlingSystems
StorageResource
s
RemoteProcedureExecution
Data ModelManagement
Application
StorageSystem
Description
InformationDiscovery
DynamicInfo
Discovery
Digital Library
Computational Grid
National Partnership for Advanced Computational Infrastructure
Information Management Architecture• Digital library community technologies
• Distributed information resources• Digital library interoperability protocols - SDLIP• Mediation of information using XML - MIX
• Grid Forum technologies• Support for distributed services / procedures• Inter-realm authentication
• GSI Grid Security Infrastructure
• Data handling system• Storage Resource Broker, Meta-data Catalog
National Partnership for Advanced Computational Infrastructure
Grid Forum Data Access Architecture
Data HandlingSystems
StorageResources
API that provides“glue” to underlyingstorage, QoS, etc.[GASS, IBP, SRB]
RemoteProcedureExecution
DPSS, DFS, NFSHPSS, ADSM, DMF, Unitree, NASstore,
DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity
API that provides “glue” to underlying data handling
systems (security, scheduling, QoS, access
protocol, data format/model, adaptivity, info discovery, location
control)
Data ModelManagement
Application
StorageSystem
Description
InformationDiscovery
ArmadaD’agents,FEL, ADRGRAM,
SRB, Java, CORBA
+ authentication+ authorization
DynamicInfo
DiscoveryGloPerf,
Netlogger, NWS
Condor, GASS, NILE, SRB, I-2 caching,
ADR
DTD, ADR, object class
LDAP, Database, Flat file, Object database
National Partnership for Advanced Computational Infrastructure
Data Handling System Capabilities
• SDSC Storage Resource Broker• Protocol transparency
• Common API for access to remote data resources• Explicit drivers for each type of storage system
• Name transparency• Attribute based access to data
• Location transparency• Distribution of collection across multiple physical resources
• Time transparency• Minimization of latency for data access
National Partnership for Advanced Computational Infrastructure
SDSC Storage Resource Broker & Meta-data Catalog
SRB
ADSM HPSS DB2 Oracle Unix
Application
File SID DBLobj SID Obj SID
MCAT
Dublin Core
Resource
User
ApplicationMeta-data
National Partnership for Advanced Computational Infrastructure
Time Transparency
• How to minimize latency Prefetch data to local high performance disk, so that all
accesses can be done at high speed from local resources
• How to maximize access rate Composite or aggregate data into a single data set to avoid
multiple accesses• Stream data at high rates using parallel I/O, amortizing the
access latency by the volume of data that is delivered.
• How to avoid congestion• Replicate data across multiple servers
National Partnership for Advanced Computational Infrastructure
SRB Containers - Managing Archive Latency
• Create container in a logical storage resource containing at least one “cacheable” resource
• Create objects in containers• “Cache” daemon will move filled
containers to archive• synch and purge API’s
SRB client
UNIX
Distributed Storage Resources
SRB Server
HPSSHPSS
container
cached containers
National Partnership for Advanced Computational Infrastructure
Generality of Information Infrastructure
• Same information model needed to manage• Federation in space
• Metacomputing environment• Interoperable services for digital libraries
• Migration over time• Collection creation and update• Persistent archive
• Same storage systems needed to support• Supercomputer center data• Discipline specific data collections• Digital library collections
National Partnership for Advanced Computational Infrastructure
Art Museum Image Consortium
• Demonstrated• Support for heterogeneous digital objects• Automated conversion of meta-data to XML DTD• Validation of meta-data• XSL style sheet for presenting information
National Partnership for Advanced Computational Infrastructure
AMICO Meta-data Conversion to XML
National Partnership for Advanced Computational Infrastructure
2. XSL StyleSheet Script
1. AMICOXML DataRecords
3. Rendered Output
AMICO Presentation Interface
National Partnership for Advanced Computational Infrastructure
National Partnership for Advanced Computational Infrastructure
• Facilitate the conduct of science through development of knowledge resources• Publish - Data collection infrastructure • Info discovery - Digital Library infrastructure • Data access - Data handling infrastructure
• Apply to federal, state, and university projects• NSF / DOE / NASA / USPTO / NARA / Census Bureau• California Digital Library• UCSD - Pacific Rim Digital Library Alliance
National Partnership for Advanced Computational Infrastructure
Publishing Scientific Data
Archival Storage
Applications
Digital Library
Data Storage
Information Management
CollectionBuilding
CDLUCB - ElibUCSB - ADLStanford - SDLIPU Michigan - UMDL
Digital SkyNeuroscience
Protein Data BankMolecular Structures
Earth Systems Science
Applications Libraries
National Partnership for Advanced Computational Infrastructure
NPACI is a National Partnership of Partnerships
46 institutions
20 states
4 countries
5 national labs
Many projects (new and old)
Vendors and industry
Government agencies
National Partnership for Advanced Computational Infrastructure
National Partnership for Advanced Computational Infrastructure
• Provide Teraflops / Petabyte capable systems for use by national academic community• Current systems at the San Diego Supercomputer Center
• 250 Gflops peak computation rate
– IBM SP, CRAY T3E• 250 Terabyte archive capacity, 100 TB in archive
– High Performance Storage System
• By end of year• 1 TFlop peak computation rate
– IBM SP• 500 Terabyte archive capacity
National Partnership for Advanced Computational Infrastructure
Challenges
• Facilitate access to high-end resources• Support data intensive computing
• Facilitate access to distributed data resources• Support information discovery
• Minimize complexity of user interfaces• Provide unifying data access system
• Requires information management infrastructure
National Partnership for Advanced Computational Infrastructure
Bio-Informatics
Application Infrastructure
Structural Comparison (n x n) Information
ModelUser interface / Analysis
tools
Mediation of Information
using XML / Extensible Meta-
data Catalog
Federation / Mediation of
Collections
Protein Data Bank Services Markup
LanguageDigital Library Services
PDB / Genome / Molecular
Trajectory Collections
Collection Management
Storage Resource Broker Meta-data Data Handling System
HPSS / file system Archive Storage
National Partnership for Advanced Computational Infrastructure
Art Museum Image Consortium - AMICO
Application Infrastructure
Classroom lectures InformationModel User interface / Analysis tools
Mediation of Information usingXML
Federation / Mediation ofCollections
Internet Explorer – XSL stylesheets
MarkupLanguage Digital Library Services
AMICO Collection Collection Management
Storage Resource Broker Meta-data Data Handling System
HPSS / file system Archive Storage
National Partnership for Advanced Computational Infrastructure
National Virtual Observatory
Application Infrastructure
Astronomer’s Workbench InformationModel User interface / Analysis tools
Correlation catalogs Federation / Mediation ofCollections
Statistical Analysis MarkupLanguage Digital Library Services
2MASS / DPOSS / SDSS /NOAA
Collection Management
Storage Resource Broker Meta-data Data Handling System
HPSS / file system Archive Storage
National Partnership for Advanced Computational Infrastructure
California Digital Library
Application Infrastructure
Research / Education / Publicweb-based access
InformationModel User interface / Analysis tools
Mediation of Information usingXML / Extensible Meta-dataCatalog
Federation / Mediation ofCollections
Electronic Notebook / Infoscapes MarkupLanguage Digital Library Services
AMICO / ADEPT / UCB Floracollection
Collection Management
Storage Resource Broker Meta-data Data Handling System
HPSS / file system Archive Storage