build a cloud day - chicago

Post on 28-Nov-2014

332 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

SCALING STORAGE WITH CEPH

Ross Turk, Inktank

WHO?

Ross TurkVP Community, Inktank

ross@inktank.com @rossturk

inktank.com | ceph.com

me

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

IN THE BEGINNINGMagic Madzik, Flickr / CC BY 2.0

EARLY INFORMATION STORAGEChico.Ferreira, Flickr / CC BY 2.0

WRITING > CAVE PAINTINGSkevingessner, Flickr / CC BY-SA 2.0

x1000

==x1

PEOPLE BEGIN WRITING A LOTMoyan_Brenn, Flickr / CC BY-ND 2.0

WRITING IS T IME-CONSUMINGtrekkyandy, Flickr / CC BY 2.0

THE INDUSTRIALIZATION OF WRITINGFateDenied, Flickr / CC BY 2.0

x1000

==x1

+magnet =tape magnetic tape

STORAGE BECOMES MECHANICALErik Pitti, Wikipedia / CC BY-ND 2.0

HUMAN

COMPUTER TAPE

HUMAN

ROCK

HUMAN

INK

PAPER

COMPUTERS NEED PEOPLE TO WORKUSDAgov, Flickr / CC BY 2.0

HUMAN

COMPUTER TAPE

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

THROUGHPUT BECOMES IMPORTANTZane Luke, Flickr / CC BY-ND 2.0

LAZ0R B3AMS CHANGE EVERYTHING!!Jeff Kubina, Flickr / CC-BY-SA 2.0

HARD DRIVES ARE TOTALLY BET TER

amazing spinny hard drives sucky stupid tapeslow

EVERYTHING GETS MESSYRob!, Flickr / CC BY 2.0

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

011

db

owner: rturkcreated: aug12

last viewed: aug17size: 42025perms: 644

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

db 0110

WE OUTGROW THE HARD DRIVEMr. T in DC, Flickr / CC BY 2.0

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMANHUMAN

HUMANHUMAN

HUMAN

HUMANHUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

DISKCOMPUTE

R

HUMAN

HUMAN

HUMAN

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

011

dbX

pace: quickdriver: frog

license: expiredexpression: agog11101011 10110110

10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

DISKCOMPUTE

R

APP

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

DISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

COMPUTER

DISK

DISK

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

COMPUTER

VM

VM

VM

STOR AG E TH ROUG H OUT H I STORYTime-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

DISKCOMPUTE

R

HUMAN

HUMAN

HUMAN

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN

HUMAN

HUMAN

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

STORAGE APPLIANCESMichael Moll, Wikipedia / CC BY-SA 2.0

6.4 MILL ION SQFT OF FACTORIESDude94111, Flickr / CC BY 2.0

TECHNOLOGY IS A COMMODITYRaeAllen, Flickr / CC-BY 2.0

COMMODITY PRICES FLUCTUATE

May-07 May-08 May-09 May-10 May-11 May-12

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN[DEVELOPER]

!!

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++X

THE WORLDNEEDS

AN OPEN STORAGE TECHNOLOGY

THATSCALES

SAGE WEIL

Co-founder of DreamHost

Inventor of Ceph

CEO of Inktank

OPEN SOURCE

philosophy design

OPEN SOURCE SPREADS IDEASorchidgalore, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

WE ARE SMARTER TOGETHERrturk, Linkedin Inmap

CEPH BELONGS TO ALL OF USwackybadger, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

CEPH IS BUILT TO SCALE

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

ARILOMAX CALIFORNICUSaroid, Flickr / CC BY 2.0

THE OCTOPUS (A METAPHOR)I love speaking in metaphors.

single pointof failure

highly-availablereplicated

THE BEEHIVE (ANOTHER METAPHOR)blumenbiene, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++✔

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASEDSELF-

MANAGING

philosophy design

DISKS = JUST T INY RECORD PLAYERSjon_a_ross, Flickr / CC BY 2.0

D

55 times / day

=D

DD

x 1 MILLION

DD

DD

IT ALL STARTED WITH A DREAM

+

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

M

M

M

HUMAN

Monitors: Maintain cluster map Provide consensus for

distributed decision-making Must have an odd number These do not serve stored

objects to clients

MOSDs: One per disk

(recommended) At least three in a cluster Serve stored objects to

clients Intelligently peer to perform

replication tasks Supports object classes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

LIBRADOS

M

M

M

APP

native

LLIBRADOS Provides direct access

to RADOS for applications

C, C++, Python, PHP, Java

No HTTP overhead

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

M

M

M

native

REST

APP

LIBRADOS

RADOSGWLIBRADOS

RADOSGW

APP

RADOS Gateway: REST-based interface

to RADOS Supports buckets,

accounting Compatible with S3

and Swift applications

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

M

M

M

VM

LIBRADOS

LIBRBD

VIRTUALIZATION CONTAINER

LIBRADOS

M

M

M

LIBRBD

CONTAINER

LIBRADOS

LIBRBD

CONTAINERVM

LIBRADOS

M

M

M

KRBD (KERNEL MODULE)

HOST

RADOS Block Device: Storage of virtual disks in

RADOS Allows decoupling of VMs

and containers Live migration!

Images are striped across the cluster

Boot support in QEMU, KVM, and OpenStack Nova

Mount support in the Linux kernel

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

M

M

M

CLIENT

0110

datametadata

Metadata Server Manages metadata for a

POSIX-compliant shared filesystem Directory hierarchy File metadata (owner,

timestamps, mode, etc.) Stores metadata in RADOS Does not serve file data to

clients Only required for shared

filesystem

WHAT MAKES CEPH

UNIQUE?

HOW DO YOU F IND YOUR KEYS?azmeen, Flickr / CC BY 2.0

APP??

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

APP

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

A-G

H-N

O-T

U-Z

F*

I ALWAYS PUT MY KEYS ON THE HOOKvitamindave, Flickr / CC BY 2.0

APP

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DEAR DIARY: KEYS = IN THE KITCHENBarnaby, Flickr / CC BY 2.0

HOW DO YOUFIND YOUR KEYS

WHEN YOUR HOUSEIS

INFINITELY BIGAND

ALWAYS CHANGING?

THE ANSWER: CRUSH!!pasukaru76, Flickr / CC SA 2.0

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

CRUSH Pseudo-random

placement algorithm Ensures even distribution Repeatable, deterministic Rule-based configuration

Replica count Infrastructure topology Weighting

CLIENT

??

CLIENT

??

LIBRADOS

M

M

M

VM

LIBRBD

VIRTUALIZATION CONTAINER

HOW DO YOUSPIN UP

THOUSANDS OF VMsINSTANTLY

ANDEFFICIENTLY?

144 0 0 0 0

instant copy

= 144

4144

CLIENT

write

write

write

= 148

write

4144

CLIENTread

read

read

= 148

HOW DO YOUMANAGE

DIRECTORY HEIRARCHYWITHOUT

ASINGLE POINT OF

FAILURE?

FILESYSTEMS REQUIRE METADATABarnaby, Flickr / CC BY 2.0

M

M

M

CLIENT

0110

M

M

M

one tree

three metadata servers

??

DYNAMIC SUBTREE PARTITIONING

AND NOWBACKPEDALING

ALMOSTEVERYTHING

WORKS

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

LAN SCALE!!*

*OR REALLY REALLY SCARY FAST WAN

CEPH AND CLOUDSTACKtableatny, Flickr / CC BY 2.0

RBD SUPPORT IN CLOUDSTACK

Allows storage of virtual disks inside RADOS Works with KVM only right now No snapshots yet

Upcoming in CloudStack 4 More information can be found on the mailing list:

ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505

QUESTIONS?

Ross TurkVP Community, Inktank

ross@inktank.com @rossturk

inktank.com | ceph.com

top related