exporting storage systems in a scalable manner with pnfs · page 7 pnfs ietf nfsv4 protocol...
TRANSCRIPT
Exporting Storage Systems in a Scalable Manner with pNFS
Dean Hildebrand
Advisor: Peter HoneymanCenter For Information Technology IntegrationUniversity of Michigan
Page 2
Outline
MotivationpNFS OverviewpNFS PrototypeExperiments
Page 3
Motivation:ASCI Example
FSNode
FSNode
…
… …
Net…
ArchiveHPSS
VisualizationCluster
…
…
CapabilityPlatform
StudioDisplays
Infiniband™ I/O NetworkInfiniband™ or GbEnet I/O Network
Net
System Data and Control NetworksSystem Data and Control NetworksSystem Data and Control NetworksSystem Data and Control Networks
ComputeNode
ComputeNode
from ASCI Technology Prospectus, July 2001
NFS
LoginNet
NFS
LoginNet
ASCI Platform, Data Storage and File System Architecture
Page 4
Motivation:HPC Out-Of-Band File Systems
AsymmetricDirect storage accessSeparate metadata server(s)Object Based:
Lustre, Panasas ActiveScaleBlock Based:
EMC High Road, IBM SAN FSFile Based: NASD NFS
SymmetricDirect storage accessEach node is a fully capable client and metadata serverEx: IBM GPFS, Redhat GFS, Polyserve Matrix Server
Page 5
Motivation: NFS and OOB File Systems
Issues:Single Server BottleneckExtra level of indirection
Symmetric OOB File SystemsAsymmetric OOB File Systems
Page 6
Problem StatementHPC OOB File System Issues
InteroperabilityCostProprietaryRemote access performance (NFS, CIFS)
NFSv4 IssuesMany-to-one relationship of NFSv4 clients to serverCannot scale with exported storage
Page 7
pNFSIETF NFSv4 protocol extensionScale with underlying file system
Clients performs direct I/O to storage Escape NFSv4 block size restrictionsSingle file access
File system independentSupport all layout maps (block, object, file, etc)Create global namespace of disparate HPC file systems
Interoperate with standard NFSv4 clients and serversStorage still accessible through NFSv4 server
Operate over any NFSv4 infrastructureSupport existing storage protocols and infrastructures
Examples: SBC on Fibre Channel on iSCSI, NFSv4
Page 8
pNFS Architecture
Page 9
NFSv4 Extensions (1/2)LAYOUTGET operation
Retrieves file layout informationValid until returned or file close
Arguments: Results:File handleOffsetExtentI/O typeState identifierMaximum count and cookie
OffsetExtentCookieOpaque layout
Other Operations: LAYOUTCOMMIT, LAYOUTRETURN, CB_LAYOUTRECALL, GETDEVICEINFO, GETDEVICELIST
Page 10
NFSv4 Extensions (2/2)Layout Driver
Interprets layout informationFile layout protocol specificSupport standard and non-standard storage protocols Multiple per clientMultiple per file system
LAYOUT_CLASSES file system attributeI/O Driver
Performs raw I/O to storage nodesExamples: Myrinet GM, Infiniband
Page 11
pNFS PrototypeNFSv4 on Linux 2.6.9-rc3Exports PVFS2 1.0.1 OOB file system
Page 12
PVFS2 Overview
Developed at Argonne National Laboratory Algorithmic file layout
Currently supports round robin stripingLAYOUTCOMMIT, LAYOUTRETURN, CB_LAYOUTRECALL not required
No locking subsystemNo data caching
Page 13
File Layout Retrieval MechanismLAYOUTGET sent on either:
Application read/write requestFile open
pNFS server uses ioctl to retrieve layoutClient caches layout informationPVFS2 file layout consists of:
Size of layoutFile system idNumber of file handlesSet of file handles (includes storage node info)Distribution idDistribution parameters, e.g., stripe size
Page 14
PVFS2 Layout and I/O DriverRegisters on loadStandard Linux API
file_operations structureIoctl injects opaque layoutI/O Driver uses specialized protocol with TCP/IP
ssize_t (*read) (struct file* filp, char __user* buf,size_t count, loff_t* pos);
ssize_t (*write) (struct file* filp, const char __user* buf, size_t count, loff_t* pos);
int (*ioctl) (struct inode* inode, struct file* file, unsigned int cmd, unsigned long arg);
Page 15
pNFS Prototype Architecture
Page 16
Experimental Setup
Forty 2GHz dual Opteron nodes with 2GB RAMPVFS2 storage nodes utilize software RAID 0 with four ATA drivesPVFS2 1.0.1 with 16 storage nodes, 1 metadata server
pNFS server, PVFS2 client, and PVFS2 metadata server on single machine
Maximum 23 clients (46 processes)Gigabit EthernetLinux 2.6.9-rc3IOZone to measure aggregate I/O throughputWarm read cacheWrites are immediately committed to disk
Page 17
Write Experiments – Separate Files
Page 18
Write Experiment – Single File
Page 19
Read Experiment – Separate Files
Page 20
Read Experiment – Single File
Page 21
DiscussionLayout and I/O drivers enable scalability
Avoid indirection penalty and single server bottleneck of NFSv4
Standard I/O protocol reduces development and support cost
Validated opaque layout design Opaque file layout information and standard layout driver interface enables underlying file system independenceObviates proprietary file system client
A single pNFS client can interact with multiple parallel file systems on multiple platforms
Issue:Scalability of LAYOUTGET operation
Page 22
Related Work
Scalability of NFSBigfoot-NFS, Expand, nfsp, NASD NFS
EMC’s HighRoadNFS- or CIFS-based control channel and block-based data channelFacilitates data sharingLimited to block-based EMC Symmetrix storage system
Storage Resource Broker (SRB) Lacks parallel data access to multiple storage endpoints Lacks integration with local file system
GridFTPNo consistency protocolLacks integration with local file system
Page 23
Future WorkpNFS Protocol :
LockingSecurity
Different issues for file-, object-, and block-based systems Layout management and delegationClient data cache consistency
pNFS/PVFS2 Prototype:MPI-IO support
General:Large file layoutsStrided file layoutsMultiple pNFS servers
Page 24
More InformationpNFS –
http://www.pdl.cmu.edu/pNFS/
NFSv4 –http://www.nfsv4.org/
CITI Linux NFSv4 Projects –http://www.citi.umich.edu/