what is a ceph (and why do i care). openstack storage - colorado openstack meetup october 14 2014

Post on 27-Nov-2014

329 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

October 2014 overview of the Ceph distributed storage system architecture, integration with OpenStack, and plans for future development.

TRANSCRIPT

14 OCT 2014Colorado OpenStack Meetup

HISTORICAL TIMELINE

2

RHEL-OSP Certification FEB 2014

MAY 2012Launch of Inktank

OpenStack Integration 2011

2010Mainline Linux Kernel

Open Source 2006

2004 Project Starts at UCSC

Production Ready Ceph SEPT 2012

2012CloudStack Integration

OCT 2013Inktank Ceph Enterprise Launch

Xen Integration 2013

APR 2014Inktank Acquired by Red Hat

10 years in the making

Copyright © 2014 by Inktank

OPENSTACK USER SURVEY, 05/2014

3

DEV / QA PROOF OF CONCEPT PRODUCTION

A STORAGE REVOLUTION

PROPRIETARY HARDWARE

PROPRIETARY SOFTWARE

SUPPORT & MAINTENANCE

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISK

STANDARDHARDWARE

OPEN SOURCE SOFTWARE

ENTERPRISEPRODUCTS &

SERVICES

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISK

ARCHITECTURE

Copyright © 2014 by Inktank

ARCHITECTURAL COMPONENTS

6

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

ARCHITECTURAL COMPONENTS

7

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

Copyright © 2014 by Inktank

OBJECT STORAGE DAEMONS

8

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

btrfsxfsext4zfs?

M

M

M

Copyright © 2014 by Inktank

RADOS CLUSTER

9

APPLICATION

M M

M M

M

RADOS CLUSTER

Copyright © 2014 by Inktank

RADOS COMPONENTS

10

OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID

group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number These do not serve stored objects to

clients

MCopyright © 2014 by Inktank

WHERE DO OBJECTS LIVE?

11

??APPLICATION

M

M

M

OBJECT

Copyright © 2014 by Inktank

A METADATA SERVER?

12

1

APPLICATION

M

M

M

2

Copyright © 2014 by Inktank

CALCULATED PLACEMENT

13

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Copyright © 2014 by Inktank

EVEN BETTER: CRUSH!

14

CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

Copyright © 2014 by Inktank

CRUSH IS A QUICK CALCULATION

15

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Copyright © 2014 by Inktank

CRUSH: DYNAMIC DATA PLACEMENT

16

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

Copyright © 2014 by Inktank

CRUSH

17

OBJECT

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

Copyright © 2014 by Inktank

18

OBJECT

10 10 01 01 10 10 01 11 01 10

Copyright © 2014 by Inktank

19

CLIENT

??

Copyright © 2014 by Inktank

20Copyright © 2014 by Inktank

21Copyright © 2014 by Inktank

22

CLIENT

??

Copyright © 2014 by Inktank

23Copyright © 2014 by Inktank

24Copyright © 2014 by Inktank

25Copyright © 2014 by Inktank

ARCHITECTURAL COMPONENTS

26

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

Copyright © 2014 by Inktank

ACCESSING A RADOS CLUSTER

27

APPLICATION

M M

M

RADOS CLUSTER

LIBRADOS

OBJECT

socket

Copyright © 2014 by Inktank

L

LIBRADOS: RADOS ACCESS FOR APPS

28

LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead

Copyright © 2014 by Inktank

ARCHITECTURAL COMPONENTS

29

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

Copyright © 2014 by Inktank

THE RADOS GATEWAY

30

M M

M

RADOS CLUSTER

RADOSGWLIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

Copyright © 2014 by Inktank

RADOSGW MAKES RADOS WEBBY

31

RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications

Copyright © 2014 by Inktank

ARCHITECTURAL COMPONENTS

32

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

Copyright © 2014 by Inktank

STORING VIRTUAL DISKS

33

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Copyright © 2014 by Inktank

SEPARATE COMPUTE FROM STORAGE

34

M M

RADOS CLUSTER

HYPERVISORLIBRB

D

VM HYPERVISORLIBRB

D

Copyright © 2014 by Inktank

KERNEL MODULE FOR MAX FLEXIBLE!

35

M M

RADOS CLUSTER

LINUX HOSTKRBD

Copyright © 2014 by Inktank

RBD STORES VIRTUAL DISKS

36

RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster

(pool) Snapshots Copy-on-write clones Support in:

Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula,

Proxmox

Copyright © 2014 by Inktank

Export snapshots to geographically dispersed data centers▪Institute disaster recovery

Export incremental snapshots▪Minimize network bandwidth by only sending

changes

RBD SNAPSHOTS

Copyright © 2014 by Inktank

ARCHITECTURAL COMPONENTS

38

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,

PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and

scale-out metadata management

APP HOST/VM CLIENT

Copyright © 2014 by Inktank

SEPARATE METADATA SERVER

39

LINUX HOST

M M

M

RADOS CLUSTER

KERNEL MODULE

datametadata 0110

Copyright © 2014 by Inktank

SCALABLE METADATA SERVERS

40

METADATA SERVER Manages metadata for a POSIX-compliant

shared filesystem Directory hierarchy File metadata (owner, timestamps,

mode, etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem

Copyright © 2014 by Inktank

CALAMARI

41Copyright © 2014 by Inktank

CALAMARI ARCHITECTURE

42

CEPH STORAGE CLUSTER

MASTER

CALAMARI

ADMIN NODE

MINION

MINION

M

MINION

MINION

M

MINION

MINION

M

Copyright © 2014 by Inktank

USE CASES

WEB APPLICATION STORAGE

WEB APPLICATION

APP SERVER

APP SERVER

APP SERVER

CEPH STORAGE CLUSTER(RADOS)

CEPH OBJECT GATEWAY

(RGW)

CEPH OBJECT GATEWAY(RGW)

44

APP SERVER

S3/Swift S3/Swift S3/Swift S3/Swift

Copyright © 2014 by Inktank

MULTI-SITE OBJECT STORAGE

WEB APPLICATIONAPP

SERVER

CEPH OBJECT GATEWAY

(RGW)

45

CEPH STORAGE CLUSTER

(US-EAST)

WEB APPLICATIONAPP

SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(EU-WEST)

Copyright © 2014 by Inktank

ARCHIVE / COLD STORAGE

46

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

Copyright © 2014 by Inktank

ERASURE CODING

47

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects Very high durability Quicker recovery

One copy plus parity Cost-effective durability Expensive recovery

Copyright © 2014 by Inktank

ERASURE CODING: HOW DOES IT WORK?

48

CEPH STORAGE CLUSTER

OBJECT

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

Copyright © 2014 by Inktank

CACHE TIERING

49

CEPH CLIENT

CACHE: WRITEBACK MODE

BACKING POOL (REPLICATED)

CEPH STORAGE CLUSTER

Read/Write Read/Write

Copyright © 2014 by Inktank

CACHE TIERING

50

CEPH CLIENT

CACHE: READ ONLY MODE

BACKING POOL (REPLICATED)

CEPH STORAGE CLUSTER

Write Write Read Read

Copyright © 2014 by Inktank

WEBSCALE APPLICATIONS

51

WEB APPLICATION

APP SERVER

APP SERVER

APP SERVER

CEPH STORAGE CLUSTER(RADOS)

APP SERVER

NativeProtocol

NativeProtocol

NativeProtocol

NativeProtocol

Copyright © 2014 by Inktank

ARCHIVE / COLD STORAGE

52

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

Site A Site B

CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER

Copyright © 2014 by Inktank

CEPH BLOCK DEVICE (RBD)

DATABASES

53

MYSQL / MARIADB

LINUX KERNEL

CEPH STORAGE CLUSTER(RADOS)

NativeProtocol

NativeProtocol

NativeProtocol

NativeProtocol

Copyright © 2014 by Inktank

WHAT ABOUT CEPH AND OPENSTACK?

CEPH AND OPENSTACK

55

RADOSGWLIBRADOS

M M

RADOS CLUSTER

OPENSTACK

KEYSTONE CINDER GLANCE

NOVASWIFTLIBRB

DLIBRB

D

HYPER-

VISORLIBRBD

Copyright © 2014 by Inktank

JUNO Enable Cloning for rbd-backed ephemeral disks

KILO Volume Migration from One Backend to Another Implement proper snapshotting for Ceph-based

ephemeral disks Improve Backup in Cinder

OPENSTACK ADDITIONS

Copyright © 2014 by Inktank

Future Ceph Roadmap

CEPH ROADMAP

58

Giant Hammer I-Release

RBD df

Object Versioning

Performance Improvements

RBD Mirroring

Copyright © 2014 by Inktank

Calamari

Alternative Web Server for RGW

Performance Improvements

Object Expiration

Performance Improvements

NEXT STEPS

NEXT STEPSWHAT NOW?

• Read about the latest version of Ceph: http://ceph.com/docs

• Deploy a test cluster using ceph-deploy: http://ceph.com/qsg

Getting Started with Ceph

Most discussion happens on the mailing lists ceph-devel and ceph-users. Join or view archives at http://ceph.com/list

IRC is a great place to get help (or help others!) #ceph and #ceph-devel. Details and logs at http://ceph.com/irc

Getting Involved with Ceph

60

• Deploy a test cluster on the AWS free-tier using Juju: http://ceph.com/juju

• Ansible playbooks for Ceph: https://www.github.com/alfredodeza/ceph-ansible Download the code: http:

//www.github.com/ceph The tracker manages bugs and

feature requests. Register and start looking around at http://tracker.ceph.com

Doc updates and suggestions are always welcome. Learn how to contribute docs at http://ceph.com/docwriting

top related