reliable distributed obs-based file system · provides access to zfs data and metadata in local...
TRANSCRIPT
IBM Research Laboratory in Haifa
zFS - A Reliable Distributed ObS-based File System
Julian Satran and Avi Teperman
IBM Labs in Haifa
2 IBM Research Laboratory in Haifa
zFS Background
zFS is part of continued research on storageStarted with Distributed Sharing FacilityContinued with Antara Object Store DeviceObS standard is under development by SNIA and is a formal T10 projectzFS will move to future ObS
zFS is an attempt to explore completely distributed File System based on Object Storage
IBM Labs in Haifa
3 IBM Research Laboratory in Haifa
zFS Goals
Operate well on few or thousands of machines Built from off-the-shelf components with Object StorageUse the memory of all machines as a global cacheAchieve almost linear scalability
IBM Labs in Haifa
4 IBM Research Laboratory in Haifa
How zFS Goals are Achieved
Scalability is achieved by:Separating storage-space management from file management
Storage-space management is encapsulated in the ObSDynamically distributing file management
No central serverNo single point of failure
Machines can be dynamically added/removed from zFS
IBM Labs in Haifa
5 IBM Research Laboratory in Haifa
zFS Architecture
IBM Labs in Haifa
6 IBM Research Laboratory in Haifa
SAN File Systems Architecture
Many computers connected to many storage devices by high speed networkClients can have non-mediated access to storage – improved storage access
IBM Labs in Haifa
7 IBM Research Laboratory in Haifa
Security vs. ProtectionProtection: buggy clients, inadvertent access, etc.
Useful inside and outside glass houseSecurity: intentional attempts at unauthorized access
Essential outside the glass houseSAN Security/Protection Today:
At unit levelZoning/Fencing/Masking are hard to use and are done at LU levelToo many actively used blocks to provide block level security
Coordination overhead too highAssume only trusted clients
SAN File Systems ChallengesSecurity and Protection
IBM Labs in Haifa
8 IBM Research Laboratory in Haifa
Scalability is not an issue when volumes are partitioned among hostsHowever, Shared access is one of the touted benefits of a SANFor shared read-write access host must coordinate usage of blocks
File systems must coordinate allocation of blocks to filesCoordination can result in:
False contentionBetween hosts allocating space from different logical units
Additional communicationIt may be possible to piggyback some of the information
SAN File Systems ChallengesScalability
IBM Labs in Haifa
9 IBM Research Laboratory in Haifa
Object Store Security
All operations are secured by a credentialSecurity achieved by cooperation of:
Admin – authenticates/authorizes clients and generates credentials.
ObS -- validates credential that a client presents. Credential is cryptographically hardened
ObS and admin share a secretGoals of Object Store security are:
Increased protection/securityAt level of objects rather than LUHosts do not access metadata directly
Allow non-trusted clients to sit on SANAllow shared access to storage without giving clients access to all data on volume
Client
Object Store
Security Admin
Shared Secret
Authorization Req
CredentialCredential
IBM Labs in Haifa
10 IBM Research Laboratory in Haifa
zFS Components
Object Store Device (ObS)Assumes ObSs have fail over mechanismObS is rather recent, and the Working Group is approaching a first version of the spec
Lease Manager (LMGR)File Manager (FMGR)Transaction Manager (TMGR)Front-End/Cache (FE/Cache)
No Single Point of Failure
IBM Labs in Haifa
11 IBM Research Laboratory in Haifa
zFS ComponentsObject Store Device (ObS)
Using ObS allows zFS to focus on File Management and ScalabilityAllocation of storage is done in the ObSSecurity is handled by the Security Admin and ObSCooperative caching poses security challenge
IBM Labs in Haifa
12 IBM Research Laboratory in Haifa
zFS ComponentsLease Manager (LMGR)
The need for lease manager stems from the following facts:Locking mechanism is required to control access to disksIn SAN file systems clients can write directly to ObSs. zFS uses a locking discipline for concurrency control
IBM Labs in Haifa
13 IBM Research Laboratory in Haifa
To reduce ObS’s overhead the following mechanism is used:Each ObS is associated with one lease managerObS maintains and grants to its LMGR one major lease LMGR grants object leases to the FMGRs requesting itFMGR grants range leases to the FEs requesting it
zFS ComponentszFS ComponentsLease Manager (LMGR)Lease Manager (LMGR)
IBM Labs in Haifa
14 IBM Research Laboratory in Haifa
The LMGR + FMGR operation distributed and safe lease.We prefer leases over locks to reduce state related issues – that make recovery complexLeases incur the overhead of leases renewal – negligible as leases a renewed periodically with one operation-per-renewing-machine
zFS ComponentszFS ComponentsLease Manager (LMGR)Lease Manager (LMGR)
IBM Labs in Haifa
15 IBM Research Laboratory in Haifa
Each file is managed by one file manager onlyEach lease request on the file is mediated by the FMGRThe first client opening a file creates an instance of the file manager (in the current implementation this instance is local to the client)The FMGR interacts with the proper LMGR to get the object lease and grants range leases to the FE
zFS ComponentszFS ComponentsFile Manager (FMGR)File Manager (FMGR)
IBM Labs in Haifa
16 IBM Research Laboratory in Haifa
The FMGR keeps track of:Where (which FE) each file’s extents resideEach file’s leases
If client X requests page and lease whichresides on client Y the FMGR will direct
FE/Cache on Y to send the requested pageand lease directly to X
zFS ComponentszFS ComponentsFile Manager (FMGR)File Manager (FMGR)
IBM Labs in Haifa
17 IBM Research Laboratory in Haifa
File manager assignment is dynamicFE requests lease from local FMGR (fmgri)fmgri requests object lease from LMGR.LMGR checks:
If no other FMGR holds the object lease it grants the object lease to fmgri
if another FMGR (fmgrk) has the object lease LMGR returns its address to fmgri
fmgri instructs FE to request the lease from fmgrk
zFS ComponentszFS ComponentsFile Manager (FMGR)File Manager (FMGR)
IBM Labs in Haifa
18 IBM Research Laboratory in Haifa
Meta data operations handle several objectsTo ensure file system consistency zFS implements them as distributed transactionsAll meta data operations are handled by the TMGR
zFS ComponentszFS ComponentsTransaction Manager (TMGR)Transaction Manager (TMGR)
IBM Labs in Haifa
19 IBM Research Laboratory in Haifa
FERuns on every client machinePresents to the application/user the standard file system APIProvides access to zFS files and directories
CacheProvides access to zFS data and metadata in local memory to other machinesSince data is transferred from memory to memory, there is a security issue if one client is un-trusted
Requires further research
zFS ComponentszFS ComponentsFrontFront--End / CacheEnd / Cache
IBM Labs in Haifa
20 IBM Research Laboratory in Haifa
zFS Architecture
IBM Labs in Haifa
21 IBM Research Laboratory in Haifa
zFS read() Operation
IBM Labs in Haifa
22 IBM Research Laboratory in Haifa
zFS write() OperationOnly read leases
IBM Labs in Haifa
23 IBM Research Laboratory in Haifa
zFS write() OperationOne write lease
IBM Labs in Haifa
24 IBM Research Laboratory in Haifa
zFS Failure HandlingLMGRi failed
Detected by all FMGRs that hold leases for objects of ObSi
Each FMGR informs all FEs holding files on ObSi to flush their dirty data and release filesFMGRs instantiate new LMGRi which tries to get the ObSi major leaseOnce the previous major lease expires, one LMGRi gets the major lease and all others are terminatedOperation on ObSi resumes
IBM Labs in Haifa
25 IBM Research Laboratory in Haifa
zFS Failure HandlingFMGRi failed
Detected by all FEs that hold leases granted by FMGRi Each FE flushes all its dirty data to ObSi and release filesFEs instantiate new local FMGRiWhen application requests a range lease for file F the local FMGR requests an object lease from the LMGR If another local FMGR already got the object lease, its address is passed back to the FE via the requesting FMGRThe FE connects to the correct FMGR and operation continuesAll this is transparent to the application
IBM Labs in Haifa
26 IBM Research Laboratory in Haifa
zFS Future Research
Distributed TransactionsHandling meta data operations in a distributed scalable manner
Add Security MechanismInvestigate how cooperative cache integrates with the security model of ObjectStore
IBM Labs in Haifa
27 IBM Research Laboratory in Haifa
Related Documents
zFS Web Sitehttp://www.haifa.il.ibm.com/projects/storage/zFS/index.html
DSF Web Sitehttp://www.haifa.il.ibm.com/projects/systems/dsf.html
PublicationszFS - A Scalable distributed File System using Object Disks
O. Rodeh & A. Teperman; MSST03Group Communication - still complex after all these years
R. Golding & O. Rodeh; SRDS 2003Object Store Based SAN File Systems
J. Satran & A. Teperman; SSCCII-2004
IBM Labs in Haifa
28 IBM Research Laboratory in Haifa
Current Status
zFS implemented on Linux Kernel 2.4.19
CurrentlyExcept for TMGR all components workInitial results show ~20% improvement on read with cooperative cache
IBM Labs in Haifa
29 IBM Research Laboratory in Haifa
Related WorkLustre
ClientsCluster control-system (MDS)Storage targets (programmable ObS)
StorageTankClient MSD (different IP network)Storage: currently standard SCSI disks over SAN Integration with ObjectStore
xFS and XFSxFS is similar to zFS with static distribution policyImplemented cooperative caching - M. D. Dahlin, C. J. Mather, R. Y. Wang, T. E. Anderson and D. A. Patterson, A Quantitative Analysis of Cache Policies for Scalable Network File System, Computer Science Division, University of California at Berkeley