![Page 1: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/1.jpg)
ERASURE CODING AND CACHE TIERING
SAGE WEIL - SDC 2014.09.16
![Page 2: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/2.jpg)
ARCHITECTURE
![Page 3: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/3.jpg)
3
CEPH MOTIVATING PRINCIPLES
● All components must scale horizontally
● There can be no single point of failure
● The solution must be hardware agnostic
● Should use commodity hardware
● Self-manage whenever possible
● Open source
![Page 4: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/4.jpg)
4
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 5: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/5.jpg)
ROBUST SERVICES BUILT ON RADOS
![Page 6: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/6.jpg)
6
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 7: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/7.jpg)
7
THE RADOS GATEWAY
M M
M
RADOS CLUSTER
RADOSGWLIBRADOS
socket
RADOSGWLIBRADOS
APPLICATION APPLICATION
REST
![Page 8: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/8.jpg)
8
MULTI-SITE OBJECT STORAGE
WEB APPLICATION
APP SERVER
CEPH OBJECT GATEWAY
(RGW)
CEPH STORAGE CLUSTER
(US-EAST)
WEB APPLICATION
APP SERVER
CEPH OBJECT GATEWAY
(RGW)
CEPH STORAGE CLUSTER
(EU-WEST)
![Page 9: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/9.jpg)
10
RADOSGW MAKES RADOS WEBBY
RADOSGW: REST-based object storage proxy Uses RADOS to store objects
● Stripes large RESTful objects across many RADOS objects
API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications
![Page 10: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/10.jpg)
11
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 11: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/11.jpg)
12
STORING VIRTUAL DISKS
M M
RADOS CLUSTER
HYPERVISORLIBRBD
VM
![Page 12: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/12.jpg)
13
KERNEL MODULE
M M
RADOS CLUSTER
LINUX HOSTKRBD
![Page 13: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/13.jpg)
14
RBD FEATURES
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integration
– Qemu
– Linux kernel
– iSCSI (STGT, LIO)
– OpenStack, CloudStack, Nebula, Ganeti, Proxmox
● Incremental backup (relative to snapshots)
![Page 14: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/14.jpg)
15
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 15: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/15.jpg)
16
SEPARATE METADATA SERVER
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
![Page 16: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/16.jpg)
17
SCALABLE METADATA SERVERS
METADATA SERVER Manages metadata for a POSIX-compliant
shared filesystem Directory hierarchy File metadata (owner, timestamps,
mode, etc.) Clients stripe file data in RADOS
MDS not in data path MDS stores metadata in RADOS
Key/value objects Dynamic cluster scales to 10s or 100s Only required for shared filesystem
![Page 17: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/17.jpg)
RADOS
![Page 18: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/18.jpg)
19
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 19: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/19.jpg)
20
RADOS
● Flat object namespace within each pool
● Rich object API (librados)
– Bytes, attributes, key/value data
– Partial overwrite of existing data
– Single-object compound operations
– RADOS classes (stored procedures)
● Strong consistency (CP system)
● Infrastructure aware, dynamic topology
● Hash-based placement (CRUSH)
● Direct client to server data path
![Page 20: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/20.jpg)
21
RADOS CLUSTER
APPLICATION
M M
M M
M
RADOS CLUSTER
![Page 21: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/21.jpg)
22
RADOS COMPONENTS
OSDs: 10s to 1000s in a cluster One per disk (or one per SSD, RAID
group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number (e.g., 5) Not part of data path
M
![Page 22: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/22.jpg)
23
OBJECT STORAGE DAEMONS
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfsbtrfsext4
M
M
M
![Page 23: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/23.jpg)
DATA PLACEMENT
![Page 24: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/24.jpg)
25
WHERE DO OBJECTS LIVE?
??APPLICATION
M
M
M
OBJECT
![Page 25: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/25.jpg)
26
A METADATA SERVER?
1
APPLICATION
M
M
M
2
![Page 26: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/26.jpg)
27
CALCULATED PLACEMENT
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
![Page 27: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/27.jpg)
28
CRUSH
CLUSTER
OBJECTS
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
PLACEMENT GROUPS(PGs)
![Page 28: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/28.jpg)
29
CRUSH IS A QUICK CALCULATION
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 29: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/29.jpg)
30
CRUSH AVOIDS FAILED DEVICES
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
10
![Page 30: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/30.jpg)
31
CRUSH: DECLUSTERED PLACEMENT
RADOS CLUSTER
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
31● Each PG independently maps to a
pseudorandom set of OSDs
● PGs that map to the same OSD generally have replicas that do not
● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD
– Highly parallel recovery
![Page 31: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/31.jpg)
32
CRUSH: DYNAMIC DATA PLACEMENT
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
![Page 32: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/32.jpg)
33
DATA IS ORGANIZED INTO POOLS
CLUSTER
OBJECTS
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
POOLS(CONTAINING PGs)
10
01
11
01
10
01
01
10
01
10
10
01
11
01
10
01
10 01 10 11
01
11
01
10
10
01
01
01
10
10
01
01
POOLA
POOLB
POOL C
POOLDOBJECTS
OBJECTS
OBJECTS
![Page 33: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/33.jpg)
TIERED STORAGE
![Page 34: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/34.jpg)
35
TWO WAYS TO CACHE
● Within each OSD
– Combine SSD and HDD for each OSD
– Make localized promote/demote decisions
– Leverage existing tools
● dm-cache, bcache, FlashCache● Variety of caching controllers
– We can help with hints
● Cache on separate devices/nodes
– Different hardware for different tiers
● Slow nodes for cold data● High performance nodes for hot data
– Add, remove, scale each tier independently
● Unlikely to choose right ratios at procurement time
![Page 35: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/35.jpg)
36
TIERED STORAGE
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
![Page 36: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/36.jpg)
37
RADOS TIERING PRINCIPLES
● Each tier is a RADOS pool
– May be replicated or erasure coded
● Tiers are durable
– e.g., replicate across SSDs in multiple hosts
● Each tier has its own CRUSH policy
– e.g., map to SSDs devices/hosts only
● librados clients adapt to tiering topology
– Transparently direct requests accordingly
● e.g., to cache
– No changes to RBD, RGW, CephFS, etc.
![Page 37: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/37.jpg)
38
WRITE INTO CACHE POOL
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
WRITE ACK
![Page 38: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/38.jpg)
39
WRITE INTO CACHE POOL
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
WRITE ACK
PROMOTE
![Page 39: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/39.jpg)
40
READ (CACHE HIT)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ READ REPLY
![Page 40: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/40.jpg)
41
READ (CACHE MISS)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ READ REPLYREDIRECT READ
![Page 41: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/41.jpg)
42
READ (CACHE MISS)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ
PROMOTE
READ REPLY
![Page 42: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/42.jpg)
43
ESTIMATING TEMPERATURE
● Each PG constructs in-memory bloom filters
– Insert records on both read and write
– Each filter covers configurable period (e.g., 1 hour)
– Tunable false positive probability (e.g., 5%)
– Maintain most recent N filters on disk
● Estimate temperature
– Has object been accessed in any of the last N periods?
– ...in how many of them?
– Informs flush/evict decision
● Estimate “recency”
– How many periods since the object hasn't been accessed?
– Informs read miss behavior: promote vs redirect
![Page 43: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/43.jpg)
44
AGENT: FLUSH COLD DATA
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
FLUSH ACK
![Page 44: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/44.jpg)
45
TIERING AGENT
● Each PG has an internal tiering agent
– Manages PG based on administrator defined policy
● Flush dirty objects
– When pool reaches target dirty ratio
– Tries to select cold objects
– Marks objects clean when they have been written back to the base pool
● Evict clean objects
– Greater “effort” as pool/PG size approaches target size
![Page 45: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/45.jpg)
46
READ ONLY CACHE TIER
CEPH CLIENT
CACHE POOL (SSD): READ ONLY
BACKING POOL (REPLICATED)
CEPH STORAGE CLUSTER
READ READ REPLY
PROMOTE
WRITE ACK
![Page 46: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/46.jpg)
ERASURE CODING
![Page 47: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/47.jpg)
48
ERASURE CODING
OBJECT
REPLICATED POOL
CEPH STORAGE CLUSTER
ERASURE CODED POOL
CEPH STORAGE CLUSTER
COPY COPY
OBJECT
31 2 X Y
COPY4
Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery
One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery
![Page 48: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/48.jpg)
49
ERASURE CODING SHARDS
CEPH STORAGE CLUSTER
OBJECT
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
![Page 49: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/49.jpg)
50
ERASURE CODING SHARDS
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
0
4
8
12
16
1
5
9
13
17
2
6
10
14
18
3
7
9
15
19
A
B
C
D
E
A'
B'
C'
D'
E'
● Variable stripe size
● Zero-fill shards (logically) in partial tail stripe
![Page 50: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/50.jpg)
51
PRIMARY
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
![Page 51: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/51.jpg)
52
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ
![Page 52: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/52.jpg)
53
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ
READS
![Page 53: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/53.jpg)
54
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ REPLY
![Page 54: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/54.jpg)
55
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
![Page 55: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/55.jpg)
56
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
WRITES
![Page 56: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/56.jpg)
57
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE ACK
![Page 57: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/57.jpg)
58
EC WRITE: DEGRADED
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
WRITES
![Page 58: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/58.jpg)
59
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
WRITES
![Page 59: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/59.jpg)
60
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
B B BA A A
![Page 60: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/60.jpg)
61
EC RESTRICTIONS
● Overwrite in place will not work in general
● Log and 2PC would increase complexity, latency
● We chose to restrict allowed operations
– create
– append (on stripe boundary)
– remove (keep previous generation of object for some time)
● These operations can all easily be rolled back locally
– create → delete
– append → truncate
– remove → roll back to previous generation
● Object attrs preserved in existing PG logs (they are small)
● Key/value data is not allowed on EC pools
![Page 61: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/61.jpg)
62
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
B B BA A A
![Page 62: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/62.jpg)
63
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
A A AA A A
![Page 63: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/63.jpg)
64
EC RESTRICTIONS
● This is a small subset of allowed librados operations
– Notably cannot (over)write any extent
● Coincidentally, these operations are also inefficient for erasure codes
– Generally require read/modify/write of affected stripe(s)
● Some applications can consume EC directly
– RGW (no object data update in place)
● Others can combine EC with a cache tier (RBD, CephFS)
– Replication for warm/hot data
– Erasure coding for cold data
– Tiering agent skips objects with key/value data
![Page 64: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/64.jpg)
65
WHICH ERASURE CODE?
● The EC algorithm and implementation are pluggable
– jerasure (free, open, and very fast)
– ISA-L (Intel library; optimized for modern Intel procs)
– LRC (local recovery code – layers over existing plugins)
● Parameterized
– Pick k or m, stripe size
● OSD handles data path, placement, rollback, etc.
● Plugin handles
– Encode and decode
– Given these available shards, which ones should I fetch to satisfy a read?
– Given these available shards and these missing shards, which ones should I fetch to recover?
![Page 65: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/65.jpg)
66
COST OF RECOVERY
1 TB OSD
![Page 66: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/66.jpg)
67
COST OF RECOVERY
1 TB OSD
![Page 67: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/67.jpg)
68
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
![Page 68: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/68.jpg)
69
COST OF RECOVERY (REPLICATION)
1 TB OSD
.01 TB
.01 TB
.01 TB
.01 TB
...
...
.01 TB .01 TB
![Page 69: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/69.jpg)
70
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
![Page 70: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/70.jpg)
71
COST OF RECOVERY (EC)
1 TB OSD
1 TB
1 TB
1 TB
1 TB
![Page 71: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/71.jpg)
72
LOCAL RECOVERY CODE (LRC)
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
A
OSD
C
OSD
B
OSD
OBJECT
![Page 72: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/72.jpg)
73
BIG THANKS TO
● Ceph
– Loic Dachary (CloudWatt, FSF France, Red Hat)
– Andreas Peters (CERN)
– Sam Just (Inktank / Red Hat)
– David Zafman (Inktank / Red Hat)
● jerasure
– Jim Plank (University of Tennessee)
– Kevin Greenan (Box)
![Page 73: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/73.jpg)
ROADMAP
![Page 74: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/74.jpg)
75
WHAT'S NEXT
● Erasure coding
– Allow (optimistic) client reads directly from shards
– ARM optimizations for jerasure
● Cache pools
– Better agent decisions (when to flush or evict)
– Supporting different performance profiles
● e.g., slow / “cheap” flash can read just as fast
– Complex topologies
● Multiple readonly cache tiers in multiple sites
● Tiering
– Support “redirects” to cold tier below base pool
– Dynamic spin-down
![Page 75: ERASURE CODING AND CACHE TIERING - SNIA · 2019-02-11 · Within each OSD – Combine SSD and HDD for each OSD – Make localized promote/demote decisions – Leverage existing tools](https://reader030.vdocuments.site/reader030/viewer/2022011813/5e466d49324bc27c1c19c8ca/html5/thumbnails/75.jpg)
76
OTHER ONGOING WORK
● Performance optimization (SanDisk, Mellanox)
● Alternative OSD backends
– leveldb, rocksdb, LMDB
– hybrid key/value and file system
● Messenger (network layer) improvements
– RDMA support (libxio – Mellanox)
– Event-driven TCP implementation (UnitedStack)
● Multi-datacenter RADOS replication
● CephFS
– Online consistency checking
– Performance, robustness