storage aggregation for performance & availabilitystorage aggregation for performance &...

May 4, 2005

Storage Aggregation forPerformance & Availability:

The Path from Physical RAID to Virtual Objects

Univ Minnesota, Third Intelligent Storage Workshop

Garth GibsonCTO, Panasas, and Assoc Prof, CMU

[email protected]

April 26, 2005

Cluster Computing: The New Wave

Traditional Computing Cluster Computing

MonolithicMonolithicStorageStorage

LinuxLinuxComputeComputeClusterCluster

MonolithicMonolithicComputersComputers

Single dataSingle datapathpath

ParallelParalleldatadata

pathspaths

??But lower $ / Gbps

IssuesComplex ScalingLimited BW & I/OIslands of storageInflexibleExpensive

Scale to bigger box

Scale:– File & total BW– File & total capacity– Load & capacity balancing

April 26, 2005

Panasas: Next Gen Cluster Storage

Scalable performance

Parallel data paths to compute nodes

Scale clients, network and capacity

As capacity grows, performance grows

Simplified and dynamic management

Robust, shared file access by many clients

Seamless growth within single namespaceeliminates time-consuming admin tasks

Integrated HW/SW solution

Optimizes performance and manageability

Ease of integration and support

ActiveScale Storage Cluster

ComputeComputeClusterClusterSingle Step:Single Step:

Perform job directlyfrom high I/O Panasas

Storage Cluster

MetadataMetadataManagersManagers

Object StorageObject StorageDevicesDevices

ParallelParalleldatadata

pathspaths

Controlpath

May 4, 2005

Rooted Firmly in RAID

April 26, 2005

Birth of RAID (1986-1991)

Member of 4th Berkeley RISC CPU design team (SPUR: 84-89)

Dave Patterson decides CPU design is a “solved” problem

Sends me to figure out how storage plays in SYSTEM PERFORMANCE

IBM 3380 disk is 4 arms in a 7.5 GB washing machine box

SLED: Single Large Expensive Disk

New PC industry demands costeffective 100 MB 3.5” disks

Enabled by new SCSIembedded controller architecture

Use many PC disks for parallelismSIGMOD88: A case for RAID

PS. $10-20 per MB (~1000X now) 100 MB/arm (~1000X now) 20-30 IO/sec/arm (5X now)

April 26, 2005

But RAID is really about Availability

Arrays have more Hard Disk Assemblies (HDAs) -- more failures

Apply replication and/or error/erasure detection codes

Mirroring wastes 50% space; RAID wastes 1/NMirroring halves, RAID 5 quarters small write bandwidth

April 26, 2005

Off to CMU & More Availability

Parity Declustering “spreads RAID groups” to reduce MTTR

Each parity disk block protects fewer than all data disk blocks (C)

Virtualizing RAID group lessens recovery work

Faster recovery or better user response time during recovery or mixture of both

RAID over X?

X = Independent fault domains

“Disk” is easiest “X”

Parity declusteringis first step inRAID virtualization

May 4, 2005

Network-Attached Secure Disks(NASD, 95-99)

April 26, 2005

Storage Interconnect Evolution

Outboard circuitry increases over time (VLSI density)

Hardware (#hosts, #disks, #paths) sharing increases over time

Logical (information) sharing limited by host SW

1995: Fibrechannel packetizes SCSI over a near general network

Outboard circuitry increases over time (VLSI density)

Hardware (#hosts, #disks, #paths) sharing increases over time

Logical (information) sharing limited by host SW

1995: Fibrechannel packetizes SCSI over a near general networkPacketized SCSI will go over any network (iSCSI)

What do we next do with device intelligence?

April 26, 2005

Storage as First Class Network Citizen

Direct transfer between client and storage

Exploit scalable switched cluster area networking

Split file service into: primitives (in drive) and policies (in manager)

April 26, 2005

NASD Architecture

Before NASD there was store&forward Server-Attached Disks (SAD)

Move access control, consistency out-of-band and cache decisions

Raise storage abstraction: encapsulate layout, offload data access

April 26, 2005

File manager

NASD

Client

1: Request for access

Secret Key

Secret Key

NASD Integrity/Privacy

ReqMAC = MACCapKey(Req,NonceIn)

ReplyMAC = MACCapKey(Reply,NonceOut)

4: Reply, NonceOut, ReplyMAC

2: CapArgs, CapKey

Private Communication

CapKey= MACSecretKey(CapArgs)CapArgs= ObjID, Version, Rights, Expiry,....

3: CapArgs, Req, NonceIn, ReqMAC

Fine Grain Access Enforcement

State of art is VPN of all out-of-band clients, all sharable data and metadata

Accident prone & vulnerable to subverted client; analogy to single-address space computing

Object Storageuses a digitallysigned, object-specific capabilitieson each request

May 4, 2005

Scalable File System Taxonomy

April 26, 2005

Today’s Ubiquitous NFS

ADVANTAGES

Familiar, stable & reliable

Widely supported by vendors

Competitive market

DISADVANTAGES

Capacity doesn’t scale

Bandwidth doesn’t scale

Cluster by customer-exposednamespace partitioning

Host Net

Storage Net

Clients

File Servers

Disk arrays

Exported Sub-File System

April 26, 2005

Scale Out 1: Forwarding NFS Servers

Bind many file servers into single system image with forwarding

Mount point binding less relevant, allows DNS-style balancing, more manageable

Control and data traverse mount point path (in band) passing through two servers

Single file and single file system bandwidth limited by backend server & storage

Tricord, Spinnaker

Host NetStorage Net

Clients

File Server Cluster

Disk arrays

April 26, 2005

Scale Out File Service w/ Out-of-Band

Client sees many storage addresses, accesses in parallel

Zero file servers in data path allows high bandwidth thru scalable networking

Mostly built on block-based SANs where servers trust all clients

E.g.: IBM SanFS, IBM GPFS, EMC HighRoad, Sun QFS, SGI CXFS, etc

Beginning to be built on Object storage (SANs too)

E.g. Panasas DirectFlow, Lustre,

Clients

File Servers

Storage

May 4, 2005

Object Storage Standards

April 26, 2005

Object Storage Architecture

An evolutionary improvement to standard SCSI storage interface (OSD)

Offload most data path work from server to intelligent storage

Finer granularity of security: protect & manage one file at a time

Raises level of abstraction: Object is container for “related” data

Storage understands how blocks of a “file” are related -> self-managementPer Object Extensible Attributes key expansion of storage functionality

Block Based DiskBlock Based Disk Object Based DiskObject Based Disk

Source: Intel

Operations:Operations:Create object

Delete object Read object Write object

Addressing:Addressing:[object, byte range]

Allocation:Allocation:Internal

Operations:Operations:Read block

Write block

Addressing:Addressing: Block range

Allocation:Allocation:External

Security AtSecurity AtVolumeVolume Level Level

Security AtSecurity AtObjectObject Level Level

April 26, 2005

Why I Like Objects

Objects are flexible, autonomous

Variable length data withlayout metadata encapsulated

Share one access control decision

Amortize object metadata

Defer allocation, smarter caching

With extensible attributes

E.g. size, timestamps, ACLs, +

Some updated inline by device

Extensible programming model

Metadata server decisions are signed & cached at clients, enforced at device

Rights and object map small relative to block allocation map

Clients can be untrusted b/c limited exposure of bugs & attacks (only authorized object data)

Cache decisions (& maps) replaced transparently -- dynamic remapping -- virtualization

SNIA Shared Storage Model, 2001

April 26, 2005

CMU NASD Lustre

OSD is now an ANSI Standard

T10/SCSI ratifies OSD command set 9/27/04Co-chaired by IBM and Seagate, protocol is a general framework (transport independent)

www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf

KEY:KEY: Resulting standard will build options for customers looking to deploy objects

NSIC NASD SNIA/T10 OSDOSD

Standard

Panasas

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

KeyKeyPlayers:Players:

May 4, 2005

ActiveScale Storage Cluster

April 26, 2005

Object Storage Systems

16-Port GESwitch Blade

Orchestrates system activity

Balances objects across OSDs

“Smart” disk for objects

2 SATA disks – 240/500 GB

Stores up to 5 TBs per shelfStores up to 5 TBs per shelf4 Gbps pershelf to cluster

Disk array subsystem

Ie. LLNL with Lustre

Prototype Seagate OSD

Highly integrated, single disk

Expect wide variety of Object Storage DevicesExpect wide variety of Object Storage Devices

April 26, 2005

BladeServer Storage Cluster

DirectorBlade

StorageBlade

Integrated GE Switch

Shelf Front1 DB, 10 SB

Shelf Rear

Midplane routes GE, power

Battery Module(2 Power units)

April 26, 2005

How Do Panasas Objects Scale?

File Comprised of:

User DataAttributesLayout

Scalable Object Map 1. Purple OSD & Object 2. Gold OSD & Object 3. Red OSD & ObjectPlus stripe size, RAID level

Scalable Object Map 1. Purple OSD & Object 2. Gold OSD & Object 3. Red OSD & ObjectPlus stripe size, RAID level

Scale capacity, bandwidth, reliability by striping according to small mapScale capacity, bandwidth, reliability by striping according to small map

DATA

April 26, 2005

Object Storage Bandwidth

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350

Object Storage Devices Object Storage Devices

GB

/sec

GB

/sec

Scalable Bandwidth demonstrated with GE switchingScalable Bandwidth demonstrated with GE switching

Lab results

April 26, 2005

ActiveScale SW Architecture

Zero

CopyNVRAM

Cache

DFLOW fs

RAID 0

Zero

CopyNVRAM

Cache

DFLOW fs

RAID 0

DirectFLOW StorageBlades

DirectFLOW Client

POSIX App

NFS

NT App

CIFS

NF

S

Dir

ectF

LO

W

POSIX App

Direct-FLOW

RAID

Local

Buffer

Cache VFS

OSD / iSCSITCP/IP

ActiveScale DirectorBlades

Dir

ectF

LO

W

Dir

ectF

LO

W

Mgmt

Agent

Zero

CopyNVRAM

Cache

DFLOW fs

RAID 0

OSD / iSCSITCP/IP

UNIX Windows

NTP +DHCPServer

RPC

ActiveScale Protocol ServersMgr DB

DirectFLOW ClientRAID

Realm & Performance Mgrs + Web Mgmt Server

Buffer Cache

Linux

NFSNDMP CIFS

CIF

S

DirectFLOW

DirectFLOWVirtual Sub Mgrs

FileMgr

QuotaMgr

StorMgr

FileMgr

QuotaMgr

StorMgr

FileMgr

QuotaMgr

StorMgr

OSD / iSCSITCP/IP

April 26, 2005

DirectFLOW Client

Standard installable file system for Linux

POSIX syscalls with common performance exceptions

Reduce serialization & media updates (e.g. durable read timestamps)

Add create parameters for file width, RAID level, stripe unit size

Add open in concurrent write mode; where app handles consistency

Mount options

Local (NFS) mount anywhere in client, or

Global (AFS) mount at “/panfs” on all clients

Cache consistency and callbacks

Need a callback on a file in order to cache any file state

Exclusive, read-only, read-write (non cacheable), concurrent write

Callbacks are really leases so AWOL clients lose promises

April 26, 2005

Fault Tolerance

Overall up/down state of blades

Subset of managers track overall state with heartbeats

Maintain identical state with quorum/consensus

Per file RAID: no parity for unused capacity

RAID level per file; small files mirrorRAID5 for large files

First step toward policy qualityof storage associated w/ data

Scalable rebuilt rate

Large files striped on all blades

Recon XOR on all managers

Rebuilt files spread on all blades

April 26, 2005

Manageable Storage Clusters

Snapshots: consistency for copying, backing up

Copy-on-write duplication of contents of objects

Named as “…/.snapshot/JulianSnapTimestamp/filename”

Snaps can be scheduled, auto-deleted

Soft volumes: grow management without physical constraints

Volumes can be quota bounded, unbounded, or just send email on threshold

Multiple volumes can share space of a set of shelves (double disk failure domain)

Capacity and load balancing: seamless use of growing set of blades

All blades track capacity & load; manager aggregates & ages utilization metrics

Unbalanced systems influence allocation; can trigger moves

Adding a blade simply makes a system unbalanced for awhile

April 26, 2005

Out-of-band DF & Clustered NAS

April 26, 2005

Scale Out 1b: Any NFS Server Will Do

Single server does all data transfer in single system image

Servers share access to all storage and “hand off” role of accessing storage

Control and data traverse mount point path (in band) passing through one server

Hand off cheap: difference between 100% & 0% colocated front & back endwas about zero SFS97 NFS op/s and < 10% latency increase for Panasas NFS

Allows single file & single filesystem capacity & bandwidth to scale with server cluster

Host Net Storage Net

Clients

File Server Cluster

Disk arrays

April 26, 2005

Performance & Scalability for all Workloads

Objects: breakthrough data throughput AND random I/O

Source: SPEC.org & Panasas

May 4, 2005

ActiveScale In Practice

April 26, 2005

4

4

4

4

4

4

4

4

4

4

4

4

Los Alamos Lightning

1400 nodes and 1400 nodes and 60TB (120 TB):60TB (120 TB): Ability to deliver ~ 3 GB/s Ability to deliver ~ 3 GB/s* (~6 GB/s)* (~6 GB/s)

88

8

8

Lightning1400 nodes

Switch

Panasas12 shelves

8

April 26, 2005

Progress in the Field

Seismic Processing Genomics Sequencing

HPC Simulations Post-Production Rendering

May 4, 2005

pNFS - Parallel NFS

April 26, 2005

Out-of-Band Interoperability Issues

Clients

Vendor XFile Servers

Storage

ADVANTAGES

Capacity scaling

Faster bandwidth scaling

DISADVANTAGES

Requires client kernel addition

Many non-interoperable solutions

Not necessarily able to replace NFS

Vendor XKernel Patch/RPM

EXAMPLE FEATURES

POSIX Plus & Minus

Global mount point

Fault tolerant cache coherence

RAID 0, 1, 5 & snapshots

Distributed metadata andonline growth, upgrade

April 26, 2005

Standards Efforts: Parallel NFS (pNFS)

December 2003 U. Michigan workshop

Whitepapers at www.citi.umich.edu/NEPS

U. Michigan CITI, EMC, IBM, JohnsHopkins, Network Appliance, Panasas,Spinnaker Networks, Veritas, ZForce

NFSv4 extensions enabling clients toaccess data in parallel storage devices ofmultiple standard flavors

Parallel NFS request routing / filevirtualization

Extend block or object “layout” informationto clients, enabling parallel direct access

IETF pNFS Documents:

draft-gibson-pnfs-problem-statement-01.txt

draft-gibson-pnfs-reqs-00.txt

draft-welch-pnfs-ops-01.txt

April 26, 2005

Multiple Data Server Protocols

Inclusiveness favors success

Three (or more) flavors of out-of-band metadata attributes:

BLOCKS:SBC/FCP/FC or SBC/iSCSI…for files built on blocks

OBJECTS:OSD/iSCSI/TCP/IP/GE for filesbuilt on objects

FILES:NFS/ONCRPC/TCP/IP/GE forfiles built on subfiles

Inode-level encapsulation inserver and client code

Local Filesystem

pNFS server

pNFS IFS

Client Apps

Diskdriver

1. SBC (blocks)2. OSD (objects)3. NFS (files)

NFSv4 extendedw/ orthogonal“disk” metadataattributes

“disk” metadatagrant & revoke

pNFS

April 26, 2005

Object Storage Value Proposition

Scalable Shared Storage Value Proposition

Larger surveys; more active analysis

Applications leverage full bandwidth

Simplified storage management

Results gateway

DirectFLOWDirectFLOW

NFS

CIFS

Linux Compute Cluster Panasas Storage Cluster Analysis &Visualization ToolsLarge datasets, checkpointing

Analysis & Simulation Results Gateway

NFS

May 4, 2005

Next Generation Network Storage

Garth [email protected]