storage aggregation for performance & availabilitystorage aggregation for performance &...
TRANSCRIPT
May 4, 2005
Storage Aggregation forPerformance & Availability:
The Path from Physical RAID to Virtual Objects
Univ Minnesota, Third Intelligent Storage Workshop
Garth GibsonCTO, Panasas, and Assoc Prof, CMU
Slide 2 April 26, 2005
Cluster Computing: The New Wave
Traditional Computing Cluster Computing
MonolithicMonolithicStorageStorage
LinuxLinuxComputeComputeClusterCluster
MonolithicMonolithicComputersComputers
Single dataSingle datapathpath
ParallelParalleldatadata
pathspaths
??But lower $ / Gbps
IssuesComplex ScalingLimited BW & I/OIslands of storageInflexibleExpensive
Scale to bigger box
Scale:– File & total BW– File & total capacity– Load & capacity balancing
Slide 3 April 26, 2005
Panasas: Next Gen Cluster Storage
Scalable performance
Parallel data paths to compute nodes
Scale clients, network and capacity
As capacity grows, performance grows
Simplified and dynamic management
Robust, shared file access by many clients
Seamless growth within single namespaceeliminates time-consuming admin tasks
Integrated HW/SW solution
Optimizes performance and manageability
Ease of integration and support
ActiveScale Storage Cluster
ComputeComputeClusterClusterSingle Step:Single Step:
Perform job directlyfrom high I/O Panasas
Storage Cluster
MetadataMetadataManagersManagers
Object StorageObject StorageDevicesDevices
ParallelParalleldatadata
pathspaths
Controlpath
Slide 5 April 26, 2005
Birth of RAID (1986-1991)
Member of 4th Berkeley RISC CPU design team (SPUR: 84-89)
Dave Patterson decides CPU design is a “solved” problem
Sends me to figure out how storage plays in SYSTEM PERFORMANCE
IBM 3380 disk is 4 arms in a 7.5 GB washing machine box
SLED: Single Large Expensive Disk
New PC industry demands costeffective 100 MB 3.5” disks
Enabled by new SCSIembedded controller architecture
Use many PC disks for parallelismSIGMOD88: A case for RAID
PS. $10-20 per MB (~1000X now) 100 MB/arm (~1000X now) 20-30 IO/sec/arm (5X now)
Slide 6 April 26, 2005
But RAID is really about Availability
Arrays have more Hard Disk Assemblies (HDAs) -- more failures
Apply replication and/or error/erasure detection codes
Mirroring wastes 50% space; RAID wastes 1/NMirroring halves, RAID 5 quarters small write bandwidth
Slide 7 April 26, 2005
Off to CMU & More Availability
Parity Declustering “spreads RAID groups” to reduce MTTR
Each parity disk block protects fewer than all data disk blocks (C)
Virtualizing RAID group lessens recovery work
Faster recovery or better user response time during recovery or mixture of both
RAID over X?
X = Independent fault domains
“Disk” is easiest “X”
Parity declusteringis first step inRAID virtualization
Slide 9 April 26, 2005
Storage Interconnect Evolution
Outboard circuitry increases over time (VLSI density)
Hardware (#hosts, #disks, #paths) sharing increases over time
Logical (information) sharing limited by host SW
1995: Fibrechannel packetizes SCSI over a near general network
Outboard circuitry increases over time (VLSI density)
Hardware (#hosts, #disks, #paths) sharing increases over time
Logical (information) sharing limited by host SW
1995: Fibrechannel packetizes SCSI over a near general networkPacketized SCSI will go over any network (iSCSI)
What do we next do with device intelligence?
Slide 10 April 26, 2005
Storage as First Class Network Citizen
Direct transfer between client and storage
Exploit scalable switched cluster area networking
Split file service into: primitives (in drive) and policies (in manager)
Slide 11 April 26, 2005
NASD Architecture
Before NASD there was store&forward Server-Attached Disks (SAD)
Move access control, consistency out-of-band and cache decisions
Raise storage abstraction: encapsulate layout, offload data access
Slide 12 April 26, 2005
File manager
NASD
Client
1: Request for access
Secret Key
Secret Key
NASD Integrity/Privacy
ReqMAC = MACCapKey(Req,NonceIn)
ReplyMAC = MACCapKey(Reply,NonceOut)
4: Reply, NonceOut, ReplyMAC
2: CapArgs, CapKey
Private Communication
CapKey= MACSecretKey(CapArgs)CapArgs= ObjID, Version, Rights, Expiry,....
3: CapArgs, Req, NonceIn, ReqMAC
Fine Grain Access Enforcement
State of art is VPN of all out-of-band clients, all sharable data and metadata
Accident prone & vulnerable to subverted client; analogy to single-address space computing
Object Storageuses a digitallysigned, object-specific capabilitieson each request
Slide 14 April 26, 2005
Today’s Ubiquitous NFS
ADVANTAGES
Familiar, stable & reliable
Widely supported by vendors
Competitive market
DISADVANTAGES
Capacity doesn’t scale
Bandwidth doesn’t scale
Cluster by customer-exposednamespace partitioning
Host Net
Storage Net
Clients
File Servers
Disk arrays
Exported Sub-File System
Slide 15 April 26, 2005
Scale Out 1: Forwarding NFS Servers
Bind many file servers into single system image with forwarding
Mount point binding less relevant, allows DNS-style balancing, more manageable
Control and data traverse mount point path (in band) passing through two servers
Single file and single file system bandwidth limited by backend server & storage
Tricord, Spinnaker
Host NetStorage Net
Clients
File Server Cluster
Disk arrays
Slide 16 April 26, 2005
Scale Out File Service w/ Out-of-Band
Client sees many storage addresses, accesses in parallel
Zero file servers in data path allows high bandwidth thru scalable networking
Mostly built on block-based SANs where servers trust all clients
E.g.: IBM SanFS, IBM GPFS, EMC HighRoad, Sun QFS, SGI CXFS, etc
Beginning to be built on Object storage (SANs too)
E.g. Panasas DirectFlow, Lustre,
Clients
File Servers
Storage
Slide 18 April 26, 2005
Object Storage Architecture
An evolutionary improvement to standard SCSI storage interface (OSD)
Offload most data path work from server to intelligent storage
Finer granularity of security: protect & manage one file at a time
Raises level of abstraction: Object is container for “related” data
Storage understands how blocks of a “file” are related -> self-managementPer Object Extensible Attributes key expansion of storage functionality
Block Based DiskBlock Based Disk Object Based DiskObject Based Disk
Source: Intel
Operations:Operations:Create object
Delete object Read object Write object
Addressing:Addressing:[object, byte range]
Allocation:Allocation:Internal
Operations:Operations:Read block
Write block
Addressing:Addressing: Block range
Allocation:Allocation:External
Security AtSecurity AtVolumeVolume Level Level
Security AtSecurity AtObjectObject Level Level
Slide 19 April 26, 2005
Why I Like Objects
Objects are flexible, autonomous
Variable length data withlayout metadata encapsulated
Share one access control decision
Amortize object metadata
Defer allocation, smarter caching
With extensible attributes
E.g. size, timestamps, ACLs, +
Some updated inline by device
Extensible programming model
Metadata server decisions are signed & cached at clients, enforced at device
Rights and object map small relative to block allocation map
Clients can be untrusted b/c limited exposure of bugs & attacks (only authorized object data)
Cache decisions (& maps) replaced transparently -- dynamic remapping -- virtualization
SNIA Shared Storage Model, 2001
Slide 20 April 26, 2005
CMU NASD Lustre
OSD is now an ANSI Standard
T10/SCSI ratifies OSD command set 9/27/04Co-chaired by IBM and Seagate, protocol is a general framework (transport independent)
www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf
KEY:KEY: Resulting standard will build options for customers looking to deploy objects
NSIC NASD SNIA/T10 OSDOSD
Standard
Panasas
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
KeyKeyPlayers:Players:
Slide 22 April 26, 2005
Object Storage Systems
16-Port GESwitch Blade
Orchestrates system activity
Balances objects across OSDs
“Smart” disk for objects
2 SATA disks – 240/500 GB
Stores up to 5 TBs per shelfStores up to 5 TBs per shelf4 Gbps pershelf to cluster
Disk array subsystem
Ie. LLNL with Lustre
Prototype Seagate OSD
Highly integrated, single disk
Expect wide variety of Object Storage DevicesExpect wide variety of Object Storage Devices
Slide 23 April 26, 2005
BladeServer Storage Cluster
DirectorBlade
StorageBlade
Integrated GE Switch
Shelf Front1 DB, 10 SB
Shelf Rear
Midplane routes GE, power
Battery Module(2 Power units)
Slide 24 April 26, 2005
How Do Panasas Objects Scale?
File Comprised of:
User DataAttributesLayout
Scalable Object Map 1. Purple OSD & Object 2. Gold OSD & Object 3. Red OSD & ObjectPlus stripe size, RAID level
Scalable Object Map 1. Purple OSD & Object 2. Gold OSD & Object 3. Red OSD & ObjectPlus stripe size, RAID level
Scale capacity, bandwidth, reliability by striping according to small mapScale capacity, bandwidth, reliability by striping according to small map
DATA
Slide 25 April 26, 2005
Object Storage Bandwidth
0
2
4
6
8
10
12
0 50 100 150 200 250 300 350
Object Storage Devices Object Storage Devices
GB
/sec
GB
/sec
Scalable Bandwidth demonstrated with GE switchingScalable Bandwidth demonstrated with GE switching
Lab results
Slide 26 April 26, 2005
ActiveScale SW Architecture
Zero
CopyNVRAM
Cache
DFLOW fs
RAID 0
Zero
CopyNVRAM
Cache
DFLOW fs
RAID 0
DirectFLOW StorageBlades
DirectFLOW Client
POSIX App
NFS
NT App
CIFS
NF
S
Dir
ectF
LO
W
POSIX App
Direct-FLOW
RAID
Local
Buffer
Cache VFS
OSD / iSCSITCP/IP
ActiveScale DirectorBlades
Dir
ectF
LO
W
Dir
ectF
LO
W
Mgmt
Agent
Zero
CopyNVRAM
Cache
DFLOW fs
RAID 0
OSD / iSCSITCP/IP
UNIX Windows
NTP +DHCPServer
RPC
ActiveScale Protocol ServersMgr DB
DirectFLOW ClientRAID
Realm & Performance Mgrs + Web Mgmt Server
Buffer Cache
Linux
NFSNDMP CIFS
CIF
S
DirectFLOW
DirectFLOWVirtual Sub Mgrs
FileMgr
QuotaMgr
StorMgr
FileMgr
QuotaMgr
StorMgr
FileMgr
QuotaMgr
StorMgr
OSD / iSCSITCP/IP
Slide 27 April 26, 2005
DirectFLOW Client
Standard installable file system for Linux
POSIX syscalls with common performance exceptions
Reduce serialization & media updates (e.g. durable read timestamps)
Add create parameters for file width, RAID level, stripe unit size
Add open in concurrent write mode; where app handles consistency
Mount options
Local (NFS) mount anywhere in client, or
Global (AFS) mount at “/panfs” on all clients
Cache consistency and callbacks
Need a callback on a file in order to cache any file state
Exclusive, read-only, read-write (non cacheable), concurrent write
Callbacks are really leases so AWOL clients lose promises
Slide 28 April 26, 2005
Fault Tolerance
Overall up/down state of blades
Subset of managers track overall state with heartbeats
Maintain identical state with quorum/consensus
Per file RAID: no parity for unused capacity
RAID level per file; small files mirrorRAID5 for large files
First step toward policy qualityof storage associated w/ data
Scalable rebuilt rate
Large files striped on all blades
Recon XOR on all managers
Rebuilt files spread on all blades
Slide 29 April 26, 2005
Manageable Storage Clusters
Snapshots: consistency for copying, backing up
Copy-on-write duplication of contents of objects
Named as “…/.snapshot/JulianSnapTimestamp/filename”
Snaps can be scheduled, auto-deleted
Soft volumes: grow management without physical constraints
Volumes can be quota bounded, unbounded, or just send email on threshold
Multiple volumes can share space of a set of shelves (double disk failure domain)
Capacity and load balancing: seamless use of growing set of blades
All blades track capacity & load; manager aggregates & ages utilization metrics
Unbalanced systems influence allocation; can trigger moves
Adding a blade simply makes a system unbalanced for awhile
Slide 31 April 26, 2005
Scale Out 1b: Any NFS Server Will Do
Single server does all data transfer in single system image
Servers share access to all storage and “hand off” role of accessing storage
Control and data traverse mount point path (in band) passing through one server
Hand off cheap: difference between 100% & 0% colocated front & back endwas about zero SFS97 NFS op/s and < 10% latency increase for Panasas NFS
Allows single file & single filesystem capacity & bandwidth to scale with server cluster
Host Net Storage Net
Clients
File Server Cluster
Disk arrays
Slide 32 April 26, 2005
Performance & Scalability for all Workloads
Objects: breakthrough data throughput AND random I/O
Source: SPEC.org & Panasas
Slide 34 April 26, 2005
4
4
4
4
4
4
4
4
4
4
4
4
Los Alamos Lightning
1400 nodes and 1400 nodes and 60TB (120 TB):60TB (120 TB): Ability to deliver ~ 3 GB/s Ability to deliver ~ 3 GB/s* (~6 GB/s)* (~6 GB/s)
88
8
8
Lightning1400 nodes
Switch
Panasas12 shelves
8
Slide 35 April 26, 2005
Progress in the Field
Seismic Processing Genomics Sequencing
HPC Simulations Post-Production Rendering
Slide 37 April 26, 2005
Out-of-Band Interoperability Issues
Clients
Vendor XFile Servers
Storage
ADVANTAGES
Capacity scaling
Faster bandwidth scaling
DISADVANTAGES
Requires client kernel addition
Many non-interoperable solutions
Not necessarily able to replace NFS
Vendor XKernel Patch/RPM
EXAMPLE FEATURES
POSIX Plus & Minus
Global mount point
Fault tolerant cache coherence
RAID 0, 1, 5 & snapshots
Distributed metadata andonline growth, upgrade
Slide 38 April 26, 2005
Standards Efforts: Parallel NFS (pNFS)
December 2003 U. Michigan workshop
Whitepapers at www.citi.umich.edu/NEPS
U. Michigan CITI, EMC, IBM, JohnsHopkins, Network Appliance, Panasas,Spinnaker Networks, Veritas, ZForce
NFSv4 extensions enabling clients toaccess data in parallel storage devices ofmultiple standard flavors
Parallel NFS request routing / filevirtualization
Extend block or object “layout” informationto clients, enabling parallel direct access
IETF pNFS Documents:
draft-gibson-pnfs-problem-statement-01.txt
draft-gibson-pnfs-reqs-00.txt
draft-welch-pnfs-ops-01.txt
Slide 39 April 26, 2005
Multiple Data Server Protocols
Inclusiveness favors success
Three (or more) flavors of out-of-band metadata attributes:
BLOCKS:SBC/FCP/FC or SBC/iSCSI…for files built on blocks
OBJECTS:OSD/iSCSI/TCP/IP/GE for filesbuilt on objects
FILES:NFS/ONCRPC/TCP/IP/GE forfiles built on subfiles
Inode-level encapsulation inserver and client code
Local Filesystem
pNFS server
pNFS IFS
Client Apps
Diskdriver
1. SBC (blocks)2. OSD (objects)3. NFS (files)
NFSv4 extendedw/ orthogonal“disk” metadataattributes
“disk” metadatagrant & revoke
pNFS
Slide 40 April 26, 2005
Object Storage Value Proposition
Scalable Shared Storage Value Proposition
Larger surveys; more active analysis
Applications leverage full bandwidth
Simplified storage management
Results gateway
DirectFLOWDirectFLOW
NFS
CIFS
Linux Compute Cluster Panasas Storage Cluster Analysis &Visualization ToolsLarge datasets, checkpointing
Analysis & Simulation Results Gateway
NFS