what's new in luminous and beyond
TRANSCRIPT
![Page 1: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/1.jpg)
WHAT’S NEW IN LUMINOUS AND BEYOND
SAGE WEIL – RED HAT2017.11.06
![Page 2: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/2.jpg)
2
UPSTREAM RELEASES
Jewel (LTS)
Spring 2016
Kraken
Fall 2016
Luminous
Summer 2017
12.2.z
10.2.z
14.2.z
WE ARE HERE
Mimic
Spring 2018
Nautilus
Winter 2019
13.2.z
● New release cadence● Named release every 9 months● Backports for 2 releases● Upgrade up to 2 releases at a time
(e.g., Luminous → Nautilus)
![Page 3: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/3.jpg)
3 LUMINOUS
![Page 4: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/4.jpg)
4
BLUESTORE: STABLE AND DEFAULT
● New OSD backend
– consumes raw block device(s) – no more XFS
– embeds rocksdb for metadata
● Fast on both HDDs (~2x) and SSDs (~1.5x)
– Similar to FileStore on NVMe, where the device is not the bottleneck
● Smaller journals
– happily uses fast SSD partition(s) for internal metadata, or NVRAM for journal
● Full data checksums (crc32c, xxhash, etc.)
● Inline compression (zlib, snappy)
– policy driven by global or per-pool config, and/or client hints
● Stable and default
![Page 5: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/5.jpg)
5
HDD: RANDOM WRITE
4 8 16 32 64 128
256
512
1024
2048
4096
050
100150200250300350400450
Bluestore vs Filestore HDD Random Write Throughput
FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)
IO Size
Thr
ough
put (
MB
/s)
4 8 16 32 64 128
256
512
1024
2048
4096
0200400600800
100012001400160018002000
Bluestore vs Filestore HDD Random Write IOPS
FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)
IO Size
IOP
S
![Page 6: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/6.jpg)
6
HDD: MIXED READ/WRITE
4 8 16 32 64 128
256
512
1024
2048
4096
050
100150200250300350
Bluestore vs Filestore HDD Random RW Throughput
FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)
IO Size
Thr
ough
put (
MB
/s)
4 8 16 32 64 128
256
512
1024
2048
4096
0
200
400
600
800
1000
1200
Bluestore vs Filestore HDD Random RW IOPS
FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)
IO Size
IOP
S
![Page 7: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/7.jpg)
7
RGW ON HDD+NVME, EC 4+2
1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados1 RGW Server 4 RGW Servers Bench
0
200
400
600
800
1000
1200
1400
1600
1800
4+2 Erasure Coding RadosGW Write Tests
32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients
Filestore 512KB ChunksFilestore 4MB ChunksBluestore 512KB ChunksBluestore 4MB Chunks
Thr
ough
put (
MB
/s)
![Page 8: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/8.jpg)
8
RBD OVER ERASURE CODED POOLS
● aka erasure code overwrites
● requires BlueStore to perform reasonably
● significant improvement in efficiency over 3x replication
– 2+2 → 2x 4+2 → 1.5x
● small writes slower than replication
– early testing showed 4+2 is about half as fast as 3x replication
● large writes faster than replication
– less IO to device
● implementation still does the “simple” thing
– all writes update a full stripe
![Page 9: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/9.jpg)
9
CEPH-MGR
● ceph-mgr
– new management daemon to supplement ceph-mon (monitor)
– easier integration point for python management logic
– integrated metrics
● ceph-mon scaling
– offload pg stats from mon to mgr
– validated 10K OSD deployment (“Big Bang III” @ CERN)
● restful: new REST API
● prometheus, influx, zabbix
● dashboard: built-in web dashboard
– webby equivalent of 'ceph -s'
M G
???(time for new iconography)
![Page 10: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/10.jpg)
10
![Page 11: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/11.jpg)
11
CEPH-METRICS
![Page 12: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/12.jpg)
12
ASYNCMESSENGER
● new network Messenger implementation– event driven– fixed-size thread pool
● RDMA backend (ibverbs)– built by default– limited testing, but seems stable!
● DPDK backend– prototype!
![Page 13: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/13.jpg)
13
PERFECTLY BALANCED OSDS (FINALLY!)
● CRUSH choose_args– alternate weight sets for individual rules– fixes two problems
● imbalance – run numeric optimization to adjust weights to balance PG distribution for a pool (or cluster)
● multipick anomaly – adjust weights per position to correct for low-weighted devices (e.g., mostly empty rack)
– backward compatible ‘compat weight-set’ for imbalance only● pg up-map
– explicitly map individual PGs to specific devices in OSDMap– requires luminous+ clients
● Automagic, courtesy of ceph-mgr ‘balancer’ module– ceph balancer mode crush-compat– ceph balancer on
![Page 14: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/14.jpg)
14
RADOS MISC
● CRUSH device classes– mark OSDs with class (hdd, ssd, etc)– out-of-box rules to map to specific class of devices within the
same hierarchy● streamlined disk replacement● require_min_compat_client – simpler, safer configuration● annotated/documented config options● client backoff on stuck PGs or objects● better EIO handling● peering and recovery speedups● fast OSD failure detection
![Page 15: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/15.jpg)
15
S3Swift
Erasure coding
Multisite federation
Multisite replication
NFS
Encryption
Tiering
Deduplication
RADOSGW
Compression
![Page 16: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/16.jpg)
16
ZONE CZONE B
RGW METADATA SEARCH
RADOSGW
LIBRADOS
M
CLUSTER A
MM M
CLUSTER B
MM
RADOSGW RADOSGW RADOSGW
LIBRADOS LIBRADOS LIBRADOS
REST
RADOSGW
LIBRADOS
M
CLUSTER C
MM
REST
ZONE A
![Page 17: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/17.jpg)
17
RGW MISC
● NFS gateway– NFSv4 and v3– full object access (not general purpose!)
● dynamic bucket index sharding– automatic (finally!)
● inline compression● encryption
– follows S3 encryption APIs● S3 and Swift API odds and ends
RADOSGW
LIBRADOS
![Page 18: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/18.jpg)
18
Erasure coding
Multisite mirroring
Persistent client cache
Consistency groups
Encryption
iSCSI
Trash
RBD
![Page 19: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/19.jpg)
19
RBD
● RBD over erasure coded pool– rbd create --data-pool <ecpoolname> ...
● RBD mirroring improvements– cooperative HA daemons– improved Cinder integration
● iSCSI– LIO tcmu-runner, librbd (full feature set)
● Kernel RBD improvements– exclusive locking, object map
![Page 20: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/20.jpg)
20
CEPHFS
● multiple active MDS daemons (finally!)● subtree pinning to specific daemon● directory fragmentation on by default
– (snapshots still off by default)● so many tests● so many bugs fixed● kernel client improvements
![Page 21: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/21.jpg)
21 MIMIC
![Page 22: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/22.jpg)
22
OSD refactor
Tiering
Client-side caches
Metrics
Dedup
QoS
Self-management
Multi-site federation
![Page 23: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/23.jpg)
23
CONTAINERS CONTAINERS
● ceph-container (https://github.com/ceph/ceph-container)– multi-purpose container image– plan to publish upstream releases as containers on download.ceph.com
● ceph-helm (https://github.com/ceph/ceph-helm)– Helm charts to deploy Ceph in kubernetes
● openshift-ansible– deployment of Ceph in OpenShift
● mgr-based dashboard/GUI– management and metrics (based on ceph-metrics, https://github.com/ceph/ceph-metrics)
● leverage kubernetes for daemon scheduling– MDS, RGW, mgr, NFS, iSCSI, CIFS, rbd-mirror, …
● streamlined provisioning and management– adding nodes, replacing failed disks, …
![Page 24: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/24.jpg)
24
RADOS
● The future is– NVMe– DPDK/SPDK– futures-based programming model for OSD– painful but necessary
● BlueStore and rocksdb optimization– rocksdb level0 compaction– alternative KV store?
● RedStore?– considering NVMe focused backend beyond BlueStore
● Erasure coding plugin API improvements– new codes with less IO for single-OSD failures
DEVICE
OSD
OBJECTSTORE
MESSENGER
![Page 25: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/25.jpg)
25
EASE OF USE ODDS AND ENDS
● centralized config management– all config on mons – no more ceph.conf– ceph config …
● PG autoscaling– less reliance on operator to pick right
values● progress bars
– recovery, rebalancing, etc.● async recovery
– fewer high-latency outliers during recovery
![Page 26: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/26.jpg)
26
QUALITY OF SERVICE
● Ongoing background development– dmclock distributed QoS queuing– minimum reservations and priority weighting
● Range of policies– IO type (background, client)– pool-based– client-based
● Theory is complex● Prototype is promising, despite simplicity● Basics will land in Mimic
![Page 27: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/27.jpg)
27
![Page 28: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/28.jpg)
28
CEPH STORAGE CLUSTER
TIERING
● new RADOS ‘redirect’ primitive
– basically a symlink, transparent to librados
– replace “sparse” cache tier with base pool “index”
APPLICATION
BASE POOL (REPLICATED, SSD)
APPLICATION
CACHE POOL (REPLICATED)
BASE POOL (HDD AND/OR ERASURE)
CEPH STORAGE CLUSTERSLOW 1 (EC) SLOW #1 (...)
![Page 29: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/29.jpg)
29
CEPH STORAGE CLUSTER
DEDUPLICATION WIP
● Generalize redirect to a “manifest”
– map of offsets to object “fragments” (vs a full object copy)
● Break objects into chunks
– fixed size, or content fingerprint
● Store chunks in content-addressable pool
– name object by sha256(content)
– reference count chunks
● TBD
– inline or post?
– policy (when to promote inline, etc.)
– agent home (embed in OSD, …)
APPLICATION
BASE POOL (REPLICATED, SSD)
SLOW 1 (EC) CAS DEDUP POOL
![Page 30: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/30.jpg)
30
CephFS
● Integrated NFS gateway management– HA nfs-ganesha service– API configurable, integrated with Manila
● snapshots!● quota for kernel client
![Page 31: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/31.jpg)
31
CLIENT CACHES!
● RGW
– persistent read-only cache on NVMe
– fully consistent (only caches immutable “tail” rados objects)
– Mass Open Cloud
● RBD
– persistent read-only cache of immutable clone parent images
– writeback cache for improving write latency
● cluster image remains crash-consistent if client cache is lost
● CephFS
– kernel client already uses kernel fscache facility
![Page 32: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/32.jpg)
32
GROWING DEVELOPER COMMUNITY
![Page 33: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/33.jpg)
33
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovation Group
● Quantum
● Mellanox
● H3C
● Quantum
● UnitedStack
● Deutsche Telekom
● Reliance Jio Infocomm
● OVH
● Alibaba
● DreamHost
● CERN
GROWING DEVELOPER COMMUNITY
![Page 34: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/34.jpg)
34
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovation Group
● Quantum
● Mellanox
● H3C
● Quantum
● UnitedStack
● Deutsche Telekom
● Reliance Jio Infocomm
● OVH
● Alibaba
● DreamHost
● CERN
OPENSTACK VENDORS
![Page 35: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/35.jpg)
35
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovation Group
● Quantum
● Mellanox
● H3C
● Quantum
● UnitedStack
● Deutsche Telekom
● Reliance Jio Infocomm
● OVH
● Alibaba
● DreamHost
● CERN
CLOUD OPERATORS
![Page 36: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/36.jpg)
36
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovation Group
● Quantum
● Mellanox
● H3C
● Quantum
● UnitedStack
● Deutsche Telekom
● Reliance Jio Infocomm
● OVH
● Alibaba
● DreamHost
● CERN
HARDWARE AND SOLUTION VENDORS
![Page 37: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/37.jpg)
37
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovation Group
● Quantum
● Mellanox
● H3C
● Quantum
● UnitedStack
● Deutsche Telekom
● Reliance Jio Infocomm
● OVH
● Alibaba
● DreamHost
● CERN
???
![Page 38: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/38.jpg)
38
● Mailing list and IRC
– http://ceph.com/IRC
● Ceph Developer Monthly
– first Weds of every month
– video conference (Bluejeans)
– alternating APAC- and EMEA-friendly times
● Github
– https://github.com/ceph/
● Ceph Days
– http://ceph.com/cephdays/
● Meetups
– http://ceph.com/meetups
● Ceph Tech Talks
– http://ceph.com/ceph-tech-talks/
● ‘Ceph’ Youtube channel
– (google it)
– @ceph
GET INVOLVED
![Page 39: What's new in Luminous and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022030318/5a65f4997f8b9a723f8b4c43/html5/thumbnails/39.jpg)
CEPHALOCON APAC2018.03.22 and 23
BEIJING, CHINA