![Page 1: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/1.jpg)
Ceph Block Devices: A Deep DiveJosh DurginRBD LeadJune 24, 2015
![Page 2: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/2.jpg)
Ceph Motivating Principles● All components must scale horizontally● There can be no single point of failure● The solution must be hardware agnostic● Should use commodity hardware● Self-manage wherever possible● Open Source (LGPL)● Move beyond legacy approaches
– client/cluster instead of client/server– Ad hoc HA
![Page 3: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/3.jpg)
Ceph Components
RGWA web services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block device with cloud platform
integration
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
APP HOST/VM CLIENT
![Page 4: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/4.jpg)
Ceph Components
RGWA web services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block device with cloud platform
integration
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
APP HOST/VM CLIENT
![Page 5: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/5.jpg)
Storing Virtual Disks
M M
RADOS CLUSTER
HYPERVISORLIBRBD
VM
![Page 6: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/6.jpg)
Kernel Module
M M
RADOS CLUSTER
LINUX HOSTKRBD
![Page 7: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/7.jpg)
RBD
● Stripe images across entire cluster (pool)● Read-only snapshots● Copy-on-write clones● Broad integration
– QEMU, libvirt– Linux kernel– iSCSI (STGT, LIO)– OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt
● Incremental backup (relative to snapshots)
![Page 8: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/8.jpg)
![Page 9: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/9.jpg)
Ceph Components
RGWA web services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block device with cloud platform
integration
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
APP HOST/VM CLIENT
![Page 10: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/10.jpg)
RADOS
● Flat namespace within a pool● Rich object API
– Bytes, attributes, key/value data– Partial overwrite of existing data– Single-object compound atomic operations– RADOS classes (stored procedures)
● Strong consistency (CP system)● Infrastructure aware, dynamic topology● Hash-based placement (CRUSH)● Direct client to server data path
![Page 11: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/11.jpg)
RADOS Components
OSDs:
10s to 1000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:
Maintain cluster membership and state
Provide consensus for distributed decision-making
Small, odd number (e.g., 5)
Not part of data path
M
![Page 12: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/12.jpg)
Ceph Components
RGWA web services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block device with cloud platform
integration
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
APP HOST/VM CLIENT
![Page 13: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/13.jpg)
Metadata
● rbd_directory– Maps image name to id, and vice versa
● rbd_children– Lists clones in a pool, indexed by parent
● rbd_id.$image_name– The internal id, locatable using only the user-specified image name
● rbd_header.$image_id– Per-image metadata
![Page 14: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/14.jpg)
Data
● rbd_data.* objects– Named based on offset in image– Non-existent to start with– Plain data in each object– Snapshots handled by rados– Often sparse
![Page 15: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/15.jpg)
Striping
● Objects are uniformly sized– Default is simple 4MB divisions of device
● Randomly distributed among OSDs by CRUSH● Parallel work is spread across many spindles● No single set of servers responsible for image● Small objects lower OSD usage variance
![Page 16: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/16.jpg)
I/O
M M
RADOS CLUSTER
HYPERVISORLIBRBD
CACHE
VM
![Page 17: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/17.jpg)
Snapshots
● Object granularity● Snapshot context [list of snap ids, latest snap id]
– Stored in rbd image header (self-managed)– Sent with every write
● Snapshot ids managed by monitors● Deleted asynchronously● RADOS keeps per-object overwrite stats, so diffs are easy
![Page 18: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/18.jpg)
LIBRBDSnap context: ([], 7)
write
Snapshots
![Page 19: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/19.jpg)
LIBRBD
Create snap 8
Snapshots
![Page 20: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/20.jpg)
LIBRBD
Set snap context to ([8], 8)
Snapshots
![Page 21: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/21.jpg)
LIBRBDWrite with ([8], 8)
Snapshots
![Page 22: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/22.jpg)
LIBRBDWrite with ([8], 8)
Snap 8
Snapshots
![Page 23: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/23.jpg)
Watch/Notify● establish stateful 'watch' on an object
– client interest persistently registered with object– client keeps connection to OSD open
● send 'notify' messages to all watchers– notify message (and payload) sent to all watchers– notification (and reply payloads) on completion
● strictly time-bounded liveness check on watch– no notifier falsely believes we got a message
![Page 24: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/24.jpg)
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
watch
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
watch
![Page 25: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/25.jpg)
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
Add snapshot
![Page 26: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/26.jpg)
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify
![Page 27: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/27.jpg)
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify ack
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify complete
![Page 28: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/28.jpg)
Watch/Notify
HYPERVISOR
LIBRBD
read metadata
HYPERVISOR
LIBRBD
![Page 29: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/29.jpg)
Clones
● Copy-on-write (and optionally read, in Hammer)● Object granularity● Independent settings
– striping, feature bits, object size can differ– can be in different pools
● Clones are based on protected snapshots– 'protected' means they can't be deleted
● Can be flattened– Copy all data from parent– Remove parent relationship
![Page 30: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/30.jpg)
Clones - read
LIBRBD
clone parent
![Page 31: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/31.jpg)
Clones - write
LIBRBD
clone parent
![Page 32: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/32.jpg)
LIBRBD
clone parent
read
Clones - write
![Page 33: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/33.jpg)
LIBRBD
clone parent
copy up, write
Clones - write
![Page 34: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/34.jpg)
Ceph and OpenStack
![Page 35: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/35.jpg)
Virtualized Setup
● Secret key stored by libvirt● XML defining VM fed in, includes
– Monitor addresses– Client name– QEMU block device cache setting
● Writeback recommended– Bus on which to attach block device
● virtio-blk/virtio-scsi recommended● Ide ok for legacy systems
– Discard options (ide/scsi only), I/O throttling if desired
![Page 36: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/36.jpg)
libvirtvm.xml
QEMULIBRBD
CACHE
qemu enablekvm ...
VM
Virtualized Setup
![Page 37: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/37.jpg)
Kernel RBD
● rbd map sets everything up● /etc/ceph/rbdmap is like /etc/fstab● udev adds handy symlinks:
– /dev/rbd/$pool/$image[@snap]● striping v2 and later feature bits not supported yet● Can be used to back LIO, NFS, SMB, etc.● No specialized cache, page cache used by filesystem on top
![Page 38: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/38.jpg)
![Page 39: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/39.jpg)
What's new in Hammer
● Copy-on-read● Librbd cache enabled in safe mode by default● Readahead during boot● Lttng tracepoints● Allocation hints● Cache hints● Exclusive locking (off by default for now)● Object map (off by default for now)
![Page 40: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/40.jpg)
Infernalis
● Easier space tracking (rbd du)● Faster differential backups● Per-image metadata
– Can persist rbd options● RBD journaling● Enabling new features on the fly
![Page 41: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point](https://reader034.vdocuments.site/reader034/viewer/2022043005/5f8c470e59866206986d8fd3/html5/thumbnails/41.jpg)
Future Work
● Kernel client catch up● RBD mirroring● Consistency groups● QoS for rados (policy at rbd level)● Active/Active iSCSI● Performance improvements
– Newstore osd backend– Improved cache tiering– Finer-grained caching– Many more