haoxu( jewelh.ward( mike(conway( arcot(rajasekar( reagan(w...
TRANSCRIPT
Building an Extensible File System via Policy-‐based Data Management
Hao Xu Jewel H. Ward Mike Conway Arcot Rajasekar Reagan W. Moore
(iRODS ConsorIum, hLp://irods.org)
File System
q Essential Functions: § Ingest, Store, Access
q Modern File Systems are built on top of traditional file systems: § Google File System, Amazon S3, Hadoop
Distributed File System § Driven by the need of a target application § Customized toward the target application
domain
Data Management Needs in Archive and Scientific Communities
q Discoverability q Complex Metadata q Workflow Management q Data Sharing q Provenance q Long Term Preservation q Technology Migration q Interoperability Between Infrastructures
Challenges
Can generic infrastructure meet the needs of a diverse set of data management domains?
Flexibility to Define a Wide Range of Application Domain Policies
q User Community à Policies q File ingest operations:
§ Authentication § Authorization § Storage Quota § Aggregation § Resource Selection § Replication § File Retention § Metadata
Infrastructure Support For Non-standard Application Domain Operations
q Standard file system operations have robust support: § Metadata § Auditing § Access Control List
q Non-standard operations that are implemented as a library do not have direct support from the file system. Examples: § Preservation – OAIS: SIP, AIP, DIP packages § Digital library – Provenance & discovery metadata § Processing pipeline – Format transformation
Interoperability with Other Infrastructures
q Emergent scalability mechanisms: § Organization change
• List à Tree à Graph (Internet) à Search
§ Data structure change • Files, tables, streams
§ Property enforcement expectations • Reproducible data-driven research
q Separation of how files are stored, accessed, and manipulated
Policy-based Data Management
Policy = Metadata + Procedure
q Purpose Reason a collecIon is assembled q Proper)es ALributes needed to ensure the purpose q Policies Controls for enforcing desired proper)es
§ Procedural Policy: Example: When an object is ingested, run workflow § Asser?onal Policy: Example: A file has three or more replicas
q Metadata Persistent state § State informa?on (consistency in a distributed environment) § Generated through applica?on of procedures
q Procedures OperaIons performed within the system § What to run: Func?ons that implement the policies § How to verify: Valida?on that metadata conforms to the desired
purpose
Collection Purpose Defines
Defines
Policy Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
SubType
Metadata
Policy-based Data Management
Collection Purpose Defines
Attribute
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates
Has
Has
Policy-based Data Management - Collection
Has
Collection Purpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates
Has
Has
Integrity
Isa
Authenticity Isa
Access control
Isa
Policy-based Data Management – Collection Properties
HasFeature
Collection Purpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa Integrity
Isa
Authenticity Isa
Access control
Isa
Policy-based Data Management – Collection Policies
Isa
Isa
HasFeature
Collection Purpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
Workflow
SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa Integrity
Isa
Authenticity Isa
Access control
Isa
GetUserACL
SetDataType
SetQuota
DataObjRepl
SysChksumDataObj Isa
Isa
Isa
Isa
Isa
Policy-based Data Management –Collection Procedures
Isa
Isa
HasFeature
Collection Purpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Periodic Assessment
Criteria Policy
Workflow
SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa Integrity
Isa
Authenticity Isa
Access control
Isa
GetUserACL
SetDataType
SetQuota
DataObjRepl
SysChksumDataObj Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Policy-based Data Management – Persistent State
Isa
Isa
HasFeature
Collection Purpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines Procedure Controls Updates
Client Action
Periodic Assessment
Criteria Policy
Policy Enforcement
Point
Workflow
Invokes
Has SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa Integrity
Isa
Authenticity Isa
Access control
Isa
GetUserACL
SetDataType
SetQuota
DataObjRepl
SysChksumDataObj Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Policy-based Data Management – Policy Enforcement
Isa
Isa
HasFeature
Example of Policy-based Data Management
Policy-based Infrastructure integrated Rule Oriented Data System
• Biology • Cognitive Science Temporal Dynamics of Learning Center • Human genome Broad Institute, Wellcome Trust Sanger Institute, NGS • Medicine Sick Kids Hospital • Neuroscience International Neuroinformatics Coordinating Facility • Plant genome the iPlant Collaborative • Phylogenetics Phylogenetics at CC IN2P3
• Computer Science • Network research GENI experimental network
• Earth Sciences • Atmospheric science NASA Langley Atmospheric Sciences Center • Climate NOAA National Climatic Data Center
• NASA Center for Climate Simulations • Ecology CEED Caveat Emptor Ecological Data • Hydrology Institute for the Environment, UNC-CH; Hydroshare • Oceanography Ocean Observatories Initiative • Seismology Southern California Earthquake Center
• Engineering • Education repository CIBER-U
• Physics • Astrophysics Auger supernova search • Cosmic Ray AMS experiment on the International Space Station • Dark Matter Physics Edelweiss II • High Energy Physics BaBar / Stanford Linear Accelerator • Neutrino Physics T2K and dChooz neutrino experiments • Optical Astronomy National Optical Astronomy Observatory • Particle Physics Indra multi-detector collaboration at IN2P3 • Quantum Chromodynamics IN2P3 • Radio Astronomy Cyber Square Kilometer Array, TREND, BAOradio
• Social Science Odum, TerraPop
Policy Applications
q Pre-process policy § Applied before an operation is done
q Operation § May be policy controlled
q Post-process policy § Applied after the operation is done
q Are these sufficient to handle the wide diversity of data management applications?
q Does this minimize the number of required operations?
Policy (Workflow) in Hydrology
Choose gauge or outlet (HIS)
Extract drainage area
(NHDPlus)
Digital Elevation
Model (DEM)
Worldfile Flowtable
RHESSys
Slope Aspect
Streams (NHD) Roads (DOT) Strata
Hillslope Patch
Basin Stream network
Nested watershed structure
Land Use
Leaf Area Index
Phenology
Soil Data
NLCD (EPA)
Landsat TM
MODIS
USDA
Soil and vegetation parameter files
RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state.
For each box, create a micro-service to automate task, and chain into a workflow
Rule Engine
Policies in Software Defined Networking Control selection of network paths
GraphDB Data Policies
Network Policies
OF Controller
iRODS Server
iRODS Server
iRODS Server
iCAT
Policy in Data Storage Aggregation / Caching / Replication
Queen Mary University of London
Source: Di Lodovico et al.
Indexing Policies
iRODS Data
Metadata
Message Passing (AMQP)
DataBook Rules
VIVO
VIVO
Search UI
Indexing Framework
External Index
Indexing Service
OSGi
Indexer Index: Text Metadata Events
Policies in Digital Libraries
q SILS LifeTime Library § Student collections range from 2 GBytes to 150 Gbytes § Number of files from 2000 to 12,000
q Library management Policies § Replication, Checksums, Versioning, Strict access controls,
Quotas, Metadata catalog replication, Installation environment archiving
q Ingestion Policies § Automated synchronization of student directory
with LifeTime Library § Automated loading of MP3 metadata
Policies in Archives
Formal Aspects of Policy-based Data Management
Domain Model
q Entities § Data Object, Replica, Collection, User, Resource,
Rules, Metadata, Access q Relations
§ (Collection) contains (Data Object); (Resource) stores (Replica); (Replica) replicates (Data Object); (User) owns (Data Object); (User) is granted (Access); (Access) is granted on (Data Object)
q Operations § Get, put, replicate, etc.
Policy
q A policy is implemented as a set of procedures defined in terms of the Domain Model § Assertion about state: “A file has three or
more replicas” • A procedure to maintain state consistency:
replication rule acPostProcForPut • (Hardware, human errors) A procedure to check
state consistency: periodic integrity check
Example of Formalism Using Monad
q Monad Recap: § A monad represents computations (possibly with side effects, in
our example, assume only state change) q Monad Constructors
§ return: trivial computation that returns a value § x >> y: do x then y § x >>= y: feed return value of x into y
q Monad Laws § return x >>= f = f x (Left Id) § f >>= return = f (Right Id) § f >>= g >>= h = f >>= (λx.g x >>= h) (Associative)
• A B C => A (B C)
Domain Model
q Entities: § DataObject, Content, Replica, Resource
q Relations: § replica: r = replica(o,i)
r is the replica of o at resource i § replicas: r ∈ replicas(o)
r is a replica of o
Domain Model
q Basic Operations: § read : read r read content of replica r § write : write c r write content c to replica r § aread : aread i read ith latest audit log entry § awrite : awrite s r append to audit log (s,r) § repl : repl o replicate o to all resources § newest : newest o the newest replica of object o
Complex Operations and Policy Enforcement Points
q Complex Operations: § oread : oread o read the content of object o § owrite : owrite c o write content c to object o
q Defined in terms of Basic Operations + PEPs § op args = pre args >>= op’ args >>= post args
q We define oread and owrite: § oread o = pre o >>= read >>= post o § owrite o = pre c o >>= write c >>= post c o
Basic Semantics
q Only one resource i § oread
• pre = return (replica o i) read replica of object o
• post = return return content of replica
§ owrite • pre = return (replica o i)
write replica of object o • post = return
simply return
Auditing
q One resource i + audit log § oread
• pre = awrite “read” o >> return (replica o i) audit + read replica of object o
• post = return return content of replica
§ owrite • pre = awrite “write” o >> return (replica o i)
audit + write replica of object o • post = return
simply return
Replication q Multiples resources
§ oread • pre = return (replica o i)
read arbitrary replica i of object o • post = return
return content of replica § owrite
• pre = return (replica o i’) write arbitrary replica i’ of object o
• post = λx.(repl o >> return x) replicate and return
Policy-based Data Management Concept Graph
Collection Purpose
(5 main types)
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy (11 default)
Has
Property (7 default)
Defines Procedure (11 default)
Controls Updates
Clients (50)
Periodic Assessment
Criteria Policy
Policy Enforcement Points (72)
Workflow
Invokes
Has SubType Isa
Micro-service (350)
Chains
Operation
Isa
Persistent State
Information (338)
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
Isa Integrity
Isa
Authenticity Isa
Access control
Isa
msiGetUserACL
msiSetDataType
msiSetQuota
msiDataObjRepl
msiSysChksumDataObj
Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa Isa
Isa
HasFeature
Archive Data grid Collection
Digital Library Processing Pipeline
SubType
iRODS Distributed Data Management
iRODS data grid
Integrated Rule Oriented Data System Open source software http://irods.org Supported by the iRODS Consortium