cloudopen - 08/29/2012

65
ceph – a unified distributed storage system sage weil cloudopen – august 29, 2012

Upload: inktank

Post on 28-Nov-2014

761 views

Category:

Technology


0 download

DESCRIPTION

Sage Weil's slides from CloudOpen - Aug 2012

TRANSCRIPT

Page 1: CloudOpen - 08/29/2012

ceph – a unified distributed storage system

sage weilcloudopen – august 29, 2012

Page 2: CloudOpen - 08/29/2012

outline

● why you should care● what is it, what it does● how it works

● architecture

● how you can use it● librados● radosgw● RBD● file system

● who we are, why we do this

Page 3: CloudOpen - 08/29/2012

why should you care about anotherstorage system?

Page 4: CloudOpen - 08/29/2012

requirements

● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?

● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance

Page 5: CloudOpen - 08/29/2012

time

● ease of administration● no manual data migration, load balancing● painless scaling

● expansion and contraction● seamless migration

Page 6: CloudOpen - 08/29/2012

cost

● linear function of size or performance● incremental expansion

● no fork-lift upgrades

● no vendor lock-in● choice of hardware● choice of software

● open

Page 7: CloudOpen - 08/29/2012

what is ceph?

Page 8: CloudOpen - 08/29/2012

unified storage system

● objects● native● RESTful

● block● thin provisioning, snapshots, cloning

● file● strong consistency, snapshots

Page 9: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 10: CloudOpen - 08/29/2012

open source

● LGPLv2● copyleft● ok to link to proprietary code

● no copyright assignment● no dual licensing● no “enterprise-only” feature set

● active community● commercial support

Page 11: CloudOpen - 08/29/2012

distributed storage system

● data center scale● 10s to 10,000s of machines● terabytes to exabytes

● fault tolerant● no single point of failure● commodity hardware

● self-managing, self-healing

Page 12: CloudOpen - 08/29/2012

ceph object model

● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy

● objects● bazillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)

Page 13: CloudOpen - 08/29/2012

why start with objects?

● more useful than (disk) blocks● names in a single flat namespace● variable size● simple API with rich semantics

● more scalable than files● no hard-to-distribute hierarchy● update semantics do not span objects● workload is trivially parallel

Page 14: CloudOpen - 08/29/2012

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

Page 15: CloudOpen - 08/29/2012

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

Page 16: CloudOpen - 08/29/2012

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN (actually more like this…)

(COMPUTER)(COMPUTER)

Page 17: CloudOpen - 08/29/2012

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

Page 18: CloudOpen - 08/29/2012

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 19: CloudOpen - 08/29/2012

Monitors:

• Maintain cluster membership and state

• Provide consensus for distributed decision-making

• Small, odd number

• These do not serve stored objects to clients

M

Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform

replication tasks

Page 20: CloudOpen - 08/29/2012

M

M

M

HUMAN

Page 21: CloudOpen - 08/29/2012

data distribution

● all objects are replicated N times● objects are automatically placed, balanced, migrated

in a dynamic cluster● must consider physical infrastructure

● ceph-osds on hosts in racks in rows in data centers

● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it

Page 22: CloudOpen - 08/29/2012

CRUSH• Pseudo-random placement

algorithm

• Fast calculation, no lookup

• Repeatable, deterministic

• Ensures even distribution

• Stable mapping

• Limited data migration

• Rule-based configuration

• specifiable replication

• infrastructure topology aware

• allows weighting

Page 23: CloudOpen - 08/29/2012

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 24: CloudOpen - 08/29/2012

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 25: CloudOpen - 08/29/2012

RADOS

● monitors publish osd map that describes cluster state● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution

● object storage daemons (OSDs)● safely replicate and store object● migrate data as the cluster changes over time● coordinate based on shared view of reality

● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● the illusion of a single copy with consistent behavior

M

Page 26: CloudOpen - 08/29/2012

CLIENTCLIENT

??

Page 27: CloudOpen - 08/29/2012
Page 28: CloudOpen - 08/29/2012
Page 29: CloudOpen - 08/29/2012

CLIENT

??

Page 30: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 31: CloudOpen - 08/29/2012

LIBRADOSLIBRADOS

MM

MM

MM

APPAPP

native

Page 32: CloudOpen - 08/29/2012

LLLIBRADOS

• Provides direct access to RADOS for applications

• C, C++, Python, PHP, Java• No HTTP overhead

Page 33: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 34: CloudOpen - 08/29/2012

MM

MM

MM

LIBRADOSLIBRADOS

RADOSGWRADOSGW

APPAPP

native

REST

LIBRADOSLIBRADOS

RADOSGWRADOSGW

APPAPP

Page 35: CloudOpen - 08/29/2012

RADOS Gateway:• REST-based interface to

RADOS• Supports buckets,

accounting• Compatible with S3 and

Swift applications

Page 36: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 37: CloudOpen - 08/29/2012

DISKDISK

COMPUTERCOMPUTER

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

DISKDISK

Page 38: CloudOpen - 08/29/2012

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

VMVM

VMVM

VMVM

Page 39: CloudOpen - 08/29/2012

MM

MM

MM

VMVM

LIBRADOSLIBRADOS

LIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

Page 40: CloudOpen - 08/29/2012

LIBRADOSLIBRADOS

MM

MM

MM

LIBRBDLIBRBD

CONTAINERCONTAINER

LIBRADOSLIBRADOS

LIBRBDLIBRBD

CONTAINERCONTAINERVMVM

Page 41: CloudOpen - 08/29/2012

LIBRADOSLIBRADOS

MM

MM

MM

KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)

HOSTHOST

Page 42: CloudOpen - 08/29/2012

RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers

• Live migration!• Images are striped across the cluster• Snapshots!• Support in

• Qemu/KVM

• OpenStack, CloudStack

• Mainline Linux kernel

Page 43: CloudOpen - 08/29/2012

HOW DO YOU

SPIN UP

THOUSANDS OF VMs

INSTANTLY

AND

EFFICIENTLY?

Page 44: CloudOpen - 08/29/2012

144 0 0 0 0 = 144

instant copy

Page 45: CloudOpen - 08/29/2012

4144

CLIENT

write

write

write

= 148

write

Page 46: CloudOpen - 08/29/2012

4144

CLIENTread

read

read

= 148

Page 47: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 48: CloudOpen - 08/29/2012

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 49: CloudOpen - 08/29/2012

MM

MM

MM

Page 50: CloudOpen - 08/29/2012

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to

clients• Only required for shared

filesystem

Page 51: CloudOpen - 08/29/2012

one tree

three metadata servers

??

Page 52: CloudOpen - 08/29/2012
Page 53: CloudOpen - 08/29/2012
Page 54: CloudOpen - 08/29/2012
Page 55: CloudOpen - 08/29/2012
Page 56: CloudOpen - 08/29/2012

DYNAMIC SUBTREE PARTITIONING

Page 57: CloudOpen - 08/29/2012

recursive accounting

● ceph-mds tracks recursive directory stats● file sizes ● file and directory counts● modification time

● virtual xattrs present full stats● efficient

$ ls ­alSh | headtotal 0drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomcephdrwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 lukodrwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eestdrwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzycephdrwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph

Page 58: CloudOpen - 08/29/2012

snapshots

● volume or subvolume snapshots unusable at petabyte scale● snapshot arbitrary subdirectories

● simple interface● hidden '.snap' directory● no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 59: CloudOpen - 08/29/2012

multiple protocols, implementations

● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt● export (NFS), Samba (CIFS)

● ceph-fuse● libcephfs.so

● your app● Samba (CIFS)● Ganesha (NFS)● Hadoop (map/reduce) kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 60: CloudOpen - 08/29/2012

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 61: CloudOpen - 08/29/2012

why we do this

● limited options for scalable open source storage ● proprietary solutions

● expensive● don't scale (well or out)● marry hardware and software

● industry needs to change

Page 62: CloudOpen - 08/29/2012

who we are

● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)

● Los Angeles, Sunnyvale, San Francisco, remote

● growing user and developer community● Linux distros, users, cloud stacks, SIs, OEMs

http://ceph.com/

Page 63: CloudOpen - 08/29/2012

thanks

BoF tonight @ 5:15

sage weil

[email protected]

@liewegas

http://github.com/ceph

http://ceph.com/

Page 65: CloudOpen - 08/29/2012

why we like btrfs

● pervasive checksumming● snapshots, copy-on-write● efficient metadata (xattrs)● inline data for small files● transparent compression● integrated volume management

● software RAID, mirroring, error recovery● SSD-aware

● online fsck● active development community