community update at openstack summit boston

40
COMMUNITY UPDATE SAGE WEIL – RED HAT 2017.05.11

Upload: sage-weil

Post on 22-Jan-2018

1.536 views

Category:

Software


1 download

TRANSCRIPT

COMMUNITY UPDATE

SAGE WEIL – RED HAT2017.05.11

2

UPSTREAM RELEASES

Jewel (LTS)

Spring 2016

Kraken

Fall 2016

Luminous (LTS)

Spring 2017

12.2.z

10.2.z

14.2.z

WE ARE HERE

Mimic

Fall 2017?

Nautilus? (LTS)

Spring 2018?

3 LUMINOUS

4

BLUESTORE: STABLE AND DEFAULT

● New OSD backend

– consumes raw block device(s) – no more XFS

– embeds rocksdb for metadata

● Fast on both HDDs (~2x) and SSDs (~1.5x)

– Similar to FileStore on NVMe, where the device is not the bottleneck

● Smaller journals

– happily uses fast SSD partition(s) for internal metadata, or NVRAM for journal

● Full data checksums (crc32c, xxhash, etc.)

● Inline compression (zlib, snappy)

– policy driven by global or per-pool config, and/or client hints

● Stable and default

5

HDD: RANDOM WRITE

4 8 16 32 64 128

256

512

1024

2048

4096

0

100

200

300

400

500

Bluestore vs Filestore HDD Random Write Throughput

FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)

IO Size

Thr

ough

put (

MB

/s)

4 8 16 32 64 128

256

512

1024

2048

4096

0

500

1000

1500

2000

Bluestore vs Filestore HDD Random Write IOPS

FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)

IO Size

IOP

S

6

HDD: MIXED READ/WRITE

4 8 16 32 64 128

256

512

1024

2048

4096

050

100150200250300350

Bluestore vs Filestore HDD Random RW Throughput

FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)

IO Size

Thr

ough

put (

MB

/s)

4 8 16 32 64 128

256

512

1024

2048

4096

0200400600800

10001200

Bluestore vs Filestore HDD Random RW IOPS

FilestoreBluestore (wip-bitmap-alloc-perf)BS (Master-a07452d)BS (Master-d62a4948)Bluestore (wip-bluestore-dw)

IO Size

IOP

S

7

RGW ON HDD+NVME, EC 4+2

1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados1 RGW Server 4 RGW Servers Bench

0

200

400

600

800

1000

1200

1400

1600

1800

4+2 Erasure Coding RadosGW Write Tests

32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients

Filestore 512KB ChunksFilestore 4MB ChunksBluestore 512KB ChunksBluestore 4MB Chunks

Thr

ough

put (

MB

/s)

8

RBD OVER ERASURE CODED POOLS

● aka erasure code overwrites

● requires BlueStore to perform reasonably

● significant improvement in efficiency over 3x replication

– 2+2 → 2x 4+2 → 1.5x

● small writes slower than replication

– early testing showed 4+2 is about half as fast as 3x replication

● large writes faster than replication

– less IO to device

● implementation still does the “simple” thing

– all writes update a full stripe

9

CEPH-MGR

● ceph-mgr

– new management daemon to supplement ceph-mon (monitor)

– easier integration point for python management logic

– integrated metrics

● make ceph-mon scalable again

– offload pg stats from mon to mgr

– push to 10K OSDs (planned “big bang 3” @ CERN)

● new REST API

– pecan

– based on previous Calamari API

● built-in web dashboard

– webby equivalent of 'ceph -s'

M G

???(time for new iconography)

10

11

ASYNCMESSENGER

● new network Messenger implementation– event driven– fixed-size thread pool

● RDMA backend (ibverbs)– built by default– limited testing, but seems stable!

● DPDK backend– prototype!

12

PERFECTLY BALANCED OSDS (FINALLY)

● CRUSH choose_args

– alternate weight sets for individual rules– complete flexibility to optimize weights etc– fixes two problems

● imbalance – run numeric optimization to adjust weights to balance PG distribution for a pool (or cluster)

● multipick anomaly – adjust weights per position to correct for low-weighted devices (e.g., mostly empty rack)

– backward compatible with pre-luminous clients for imbalance case● pg upmap

– explicitly map individual PGs to specific devices in OSDMap– simple offline optimizer balance PGs– by pool or by cluster– requires luminous+ clients

13

RADOS MISC

● CRUSH device classes– mark OSDs with class (hdd, ssd, etc)– out-of-box rules to map to specific class of devices within the

same hierarchy● streamlined disk replacement● require_min_compat_client – simpler, safer configuration● annotated/documented config options● client backoff on stuck PGs or objects● better EIO handling● peering and recovery speedups● fast OSD failure detection

14

S3Swift

Erasure coding

Multisite federation

Multisite replication

NFS

Encryption

Tiering

Deduplication

RADOSGW

Compression

15

ZONE CZONE B

RGW METADATA SEARCH

RADOSGW

LIBRADOS

M

CLUSTER A

MM M

CLUSTER B

MM

RADOSGW RADOSGW RADOSGW

LIBRADOS LIBRADOS LIBRADOS

REST

RADOSGW

LIBRADOS

M

CLUSTER C

MM

REST

ZONE A

16

RGW MISC

● NFS gateway– NFSv4 and v3– full object access (not general purpose!)

● dynamic bucket index sharding– automatic (finally!)

● inline compression● encryption

– follows S3 encryption APIs● S3 and Swift API odds and ends

RADOSGW

LIBRADOS

17

Erasure coding

Multisite mirroring

Persistent client cache

Consistency groups

Encryption

iSCSI

Trash

RBD

18

RBD

● RBD over erasure coded pool– rbd create --data-pool <ecpoolname> ...

● RBD mirroring improvements– cooperative HA daemons– improved Cinder integration

● iSCSI– LIO tcmu-runner, librbd (full feature set)

● Kernel RBD improvements– exclusive locking, object map

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

20

2017 =

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed

block device with cloud platform integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

OBJECT BLOCK FILE

FULLY AWESOME

21

CEPHFS

● multiple active MDS daemons (finally!)● subtree pinning to specific daemon● directory fragmentation on by default

– (snapshots still off by default)● so many tests● so many bugs fixed● kernel client improvements

22

OSD refactor

Tiering

Client-side caches

Metrics

Dedup

QoS

Self-management

AFTER LUMINOUS?

Multi-site federation

23 MIMIC

24

RADOS

● Peering optimizations● IO path refactor and optimization

– async, state-driven, futures– painful but necessary

● BlueStore and rocksdb optimization

– rocksdb level0 compaction– alternative KV store?

● Erasure coding plugin API improvements

– new codes with less IO for single-OSD failures

DEVICE

OSD

OBJECTSTORE

MESSENGER

25

QUALITY OF SERVICE

● Ongoing background development

– dmclock distributed QoS queuing– minimum reservations and priority weighting

● Range of policies

– IO type (background, client)– pool-based– client-based

● Theory is complex● Prototype is promising, despite simplicity● Missing management framework

26

27

CEPH STORAGE CLUSTER

TIERING

● new RADOS ‘redirect’ primitive

– basically a symlink, transparent to librados

– replace “sparse” cache tier with base pool “index”

APPLICATION

BASE POOL (REPLICATED, SSD)

APPLICATION

CACHE POOL (REPLICATED)

BASE POOL (HDD AND/OR ERASURE)

CEPH STORAGE CLUSTERSLOW 1 (EC) SLOW #1 (...)

28

CEPH STORAGE CLUSTER

DEDUPLICATION WIP

● Generalize redirect to a “manifest”

– map of offsets to object “fragments” (vs a full object copy)

● Break objects into chunks

– fixed size, or content fingerprint

● Store chunks in content-addressable pool

– name object by sha256(content)

– reference count chunks

● TBD

– inline or post?

– policy (when to promote inline, etc.)

– agent home (embed in OSD, …)

APPLICATION

BASE POOL (REPLICATED, SSD)

SLOW 1 (EC) CAS DEDUP POOL

29

CEPH-MGR: METRICS, MGMT

● Metrics aggregation

– short-term time series out of the box– no persistent state– streaming to external platform

(Prometheus, …)● Host self-management functions

– automatic CRUSH optimization– identification of slow devices– steer IO away from busy devices– device failure prediction

M G

consistency metrics

summary

30

ARM

● aarch64 builds

– centos7, ubuntu xenial– have some, but awaiting more build hardware in community lab

● thank you to partners!● ppc64● armv7l

– http://ceph.com/community/500-osd-ceph-cluster/

31

Gen 2 Gen3ARM 32bit ARM 64 bit

uBoot UEFI

1GB DDR Up to 4GB DDR

2x 1GbE SGMII 2x 2.5GbE SGMII

Basic Security Enhanced Security

No Flash option Flash option

I2C Management Integrated Management

Planning Gen3 availability to PoC partnersContact [email protected]

WDLABS MICROSERVER UPDATE

32

CLIENT CACHES!

● RGW

– persistent read-only cache on NVMe

– fully consistent (only caches immutable “tail” rados objects)

– Mass Open Cloud

● RBD

– persistent read-only cache of immutable clone parent images

– writeback cache for improving write latency

● cluster image remains crash-consistent if client cache is lost

● CephFS

– kernel client already uses kernel fscache facility

33

GROWING DEVELOPER COMMUNITY

34

● Red Hat

● Mirantis

● SUSE

● ZTE

● China Mobile

● XSky

● Digiware

● Intel

● Kylin Cloud

● Easystack

● Istuary Innovation Group

● Quantum

● Mellanox

● H3C

● Quantum

● UnitedStack

● Deutsche Telekom

● Reliance Jio Infocomm

● OVH

● Alibaba

● DreamHost

● CERN

GROWING DEVELOPER COMMUNITY

35

● Red Hat

● Mirantis

● SUSE

● ZTE

● China Mobile

● XSky

● Digiware

● Intel

● Kylin Cloud

● Easystack

● Istuary Innovation Group

● Quantum

● Mellanox

● H3C

● Quantum

● UnitedStack

● Deutsche Telekom

● Reliance Jio Infocomm

● OVH

● Alibaba

● DreamHost

● CERN

OPENSTACK VENDORS

36

● Red Hat

● Mirantis

● SUSE

● ZTE

● China Mobile

● XSky

● Digiware

● Intel

● Kylin Cloud

● Easystack

● Istuary Innovation Group

● Quantum

● Mellanox

● H3C

● Quantum

● UnitedStack

● Deutsche Telekom

● Reliance Jio Infocomm

● OVH

● Alibaba

● DreamHost

● CERN

CLOUD OPERATORS

37

● Red Hat

● Mirantis

● SUSE

● ZTE

● China Mobile

● XSky

● Digiware

● Intel

● Kylin Cloud

● Easystack

● Istuary Innovation Group

● Quantum

● Mellanox

● H3C

● Quantum

● UnitedStack

● Deutsche Telekom

● Reliance Jio Infocomm

● OVH

● Alibaba

● DreamHost

● CERN

HARDWARE AND SOLUTION VENDORS

38

● Red Hat

● Mirantis

● SUSE

● ZTE

● China Mobile

● XSky

● Digiware

● Intel

● Kylin Cloud

● Easystack

● Istuary Innovation Group

● Quantum

● Mellanox

● H3C

● Quantum

● UnitedStack

● Deutsche Telekom

● Reliance Jio Infocomm

● OVH

● Alibaba

● DreamHost

● CERN

???

39

● Mailing list and IRC

– http://ceph.com/IRC

● Ceph Developer Monthly

– first Weds of every month

– video conference (Bluejeans)

– alternating APAC- and EMEA-friendly times

● Github

– https://github.com/ceph/

● Ceph Days

– http://ceph.com/cephdays/

● Meetups

– http://ceph.com/meetups

● Ceph Tech Talks

– http://ceph.com/ceph-tech-talks/

● ‘Ceph’ Youtube channel

– (google it)

● Twitter

– @ceph

GET INVOLVED

THANK YOU!

Sage WeilCEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas