gigaom structure sf - june 2012

Post on 28-Nov-2014

376 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Our VP, Community Ross Turk's slides from his Structure talk in July 2012.

TRANSCRIPT

Inktank Delivering the Future of Storage

ME ME ME ME ME ME.

2

I made a slide today. It’s all about me.

ME ME ME ME ME ME.

3

I made a slide today. It’s all about me.

Ross Turk VP Community, Inktank ross@inktank.com @rossturk inktank.com | ceph.com

4

5

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Let’s Start With a Good, Old-Fashioned Origin Story JD Hancock, Flickr / CC BY 2.0 6

The Evolution of Storage A brief history of information storage technology

7

Cave Paintings: The Earliest Form (maybe) of Information Storage Chico.Ferreira, Flickr / CC BY 2.0 8

Technology Review: Cave Painting The good •  Low cost per smudge • Multitouch

The bad •  Limited storage capacity

•  10 caveman ideas per wall

• No support for CIFS

9

10

x1000

== x1

= + HUMAN WRITING

Technology Review: Books and Libraries The good •  Cost per scroll is high

•  Can be eased w/slave labor

The bad • No automatic replication

•  Must complete backups before Caesar’s invasion of Egypt!

11

Books (Strahov, Prague Library) Moyan_Brenn, Flickr / CC BY-ND 2.0 12

Printing Press FateDenied, Flickr / CC BY 2.0 13

14

x1000

== x1

+ magnet = tape magnetic tape

IBM System 360 Tape Drives Erik Pitti, Wikipedia / CC BY-ND 2.0

15

16

HUMAN COMPUTER TAPE

HUMAN ROCK

HUMAN

INK

PAPER

17

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

Tape Is Stupid Mrs. Gemstone, Flickr / CC BY-SA 2.0 18

Computers Need Programmers (and Operators) USDAgov, Flickr / CC BY 2.0 19

20

HUMAN COMPUTER TAPE

Throughput Becomes Important rfduck, Flickr / CC BY-ND 2.0 21

Hard Drive Jeff Kubina, Flickr / CC-BY-SA 2.0 22

Hard Drives Are Totally Better

23

amazing spinny hard drives sucky stupid tape

24

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db

25

owner: rturk created: aug12

last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101

10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

26

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 db 01 10

Humanity Outgrows the Hard Drive Mr. T in DC, Flickr / CC BY 2.0 27

28

HUMAN COMPUTER

DISK

DISK

DISK

DISK

DISK

DISK

DISK

What Happens When Two HUMANs Need Access to the Same Resource? wFourier, Flickr / CC BY 2.0 29

30

HUMAN COMPUTER

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

31

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

32

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

33

texture: crunch flavor: smoke, salt

nutrition: none color: bacon

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

34

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db X

35

DISK COMPUTER

APP

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

36

DISK

COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

DISK

37

DISK

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

VM

VM

VM

The Current State of Storage How people store information today, and why it’s still not perfect yet

38

How Much Store Things All Human History!! Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is. 39

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

40

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

41

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

42

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

43

HUMAN

HUMAN

HUMAN

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Storage Hardware Michael Moll, Wikipedia / CC BY-SA 2.0 44

6.4 Million Square Feet of Expensive Factory Buildings Dude94111, Flickr / CC BY 2.0 45

Storage Hardware Vendors Have Bills to Pay CarbonNYC, Flickr / CC BY 2.0 46

…Which Means That Customers Do Too 401K 2012, Flickr / CC BY-SA 2.0 47

Technology Is Becoming a Commodity RaeAllen, Flickr / CC-BY 2.0 48

Commodity Prices Fluctuate

49

May-07 May-08 May-09 May-10 May-11 May-12

Growing With Hardware Appliances First PB •  Proprietary

storage hardware • Well-known

storage vendor $14 b’zillion

Second PB •  Proprietary storage

hardware •  Same storage

vendor Another $14 b’zillion

50

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Dedicated Hardware Appliances Are OLD TECHNOLOGY Paul Keller, Flickr / CC BY 2.0 51

52 Source: http://www.cpubenchmark.net/high_end_cpus.html

53

FLAGSHIP PRODUCT

“I'm sick of paying for hardware with a three-year-old proc in it!” Mel B., Flickr / CC BY 2.0 54

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0 55

56

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

57

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ X

58

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN [DEVELOPER]

!!

Give More Money To The Big Proprietary Vendors It will make them very, very happy. 59

Storage Should Be Better

People need storage solutions that… • …are open • …are easy to manage • …satisfy their requirements

•  performance •  functional •  financial

60

The Birth of a New Storage Solution We think our roots are showing

61

DreamHost 62

Sage Weil Co-founder of DreamHost Inventor of Ceph CEO of Inktank

63

DreamHost DreamHost is staffed by extraordinarily hip people 64

65

+

New Monthly Code Commits

66

0

100

200

300

400

500

600

700

2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07

Ceph Starts Popping Up

67

68

OPEN SOURCE

philosophy design

Open Source is the Best Way to Spread Ideas orchidgalore, Flickr / CC BY 2.0 69

70

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

All of Us Are Smarter Than Some of Us rturk, Linkedin Inmap 71

72

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

Ceph is Built to Scale Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is. 73

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

74

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

Ariolimax Californicus aroid, Flickr / CC BY 2.0 75

The Octopus (A Metaphor) I love speaking in metaphors. 76

single point of failure

replicated replicated

The Beehive (A Better Metaphor) blumenbiene, Flickr / CC BY 2.0 77

78

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

79

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

80

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ ✔

81

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED SELF-

MANAGING

philosophy design

Hard Drives Are Tiny Record Players and They Fail Often jon_a_ross, Flickr / CC BY 2.0 82

83

D

55 times / day

= D

D D

x 1 MILLION

D D

D D

Enter: Ceph An architectural and functional overview of the Ceph system

84

85

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

86

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

87

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FS FS btrfs xfs ext4

M M M

88

M

M

M

HUMAN

89

Monitors: • Maintain cluster map •  Provide consensus for

distributed decision-making • Must have an odd number •  These do not serve stored

objects to clients

M

OSDs: • One per disk (recommended) •  At least three in a cluster •  Serve stored objects to

clients •  Intelligently peer to perform

replication tasks •  Supports object classes

90

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

LIBRADOS

M

M

M

91

APP

socket

L

92

LIBRADOS •  Provides direct access to

RADOS for applications •  C, C++, Python, PHP, Java • No HTTP overhead

93

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

94

M

M

M

LIBRADOS RADOSGW

APP

socket

REST

95

RADOS Gateway: •  REST-based interface to

RADOS •  Supports buckets,

accounting •  Compatible with S3 and

Swift applications

96

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

97

M

M

M

VM

LIBRADOS LIBRBD

VIRTUALIZATION CONTAINER

LIBRADOS

98

M

M

M

LIBRBD CONTAINER

LIBRADOS LIBRBD

CONTAINER VM

LIBRADOS

99

M

M

M

KRBD (KERNEL MODULE) HOST

100

RADOS Block Device: •  Storage of virtual disks in

RADOS •  Allows decoupling of VMs and

containers •  Live migration!

•  Images are striped across the cluster

•  Boot support in QEMU, KVM, and OpenStack Nova

•  Mount support in the Linux kernel

101

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

102

M

M

M

CLIENT

01 10

data metadata

103

Metadata Server • Manages metadata for a

POSIX-compliant shared filesystem •  Directory hierarchy •  File metadata (owner,

timestamps, mode, etc.) •  Stores metadata in RADOS •  Does not serve file data to

clients • Only required for shared

filesystem

What Makes Ceph Unique? Part one: CRUSH

104

105

APP ??

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

106

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

How Long Did It Take You To Find Your Keys This Morning? azmeen, Flickr / CC BY 2.0 107

Dear Diary: Today I Put My Keys on the Kitchen Counter Barnaby, Flickr / CC BY 2.0 108

109

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

A-G

H-N

O-T

U-Z

F*

I Always Put My Keys on the Hook By the Door vitamindave, Flickr / CC BY 2.0 110

HOW DO YOU FIND YOUR KEYS

WHEN YOUR HOUSE IS

INFINITELY BIG AND

ALWAYS CHANGING?

111

The Answer: CRUSH!!!!! pasukaru76, Flickr / CC SA 2.0 112

113

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

114

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

115

CRUSH •  Pseudo-random placement

algorithm •  Ensures even distribution •  Repeatable, deterministic •  Rule-based configuration

•  Replica count •  Infrastructure topology •  Weighting

116

CLIENT

??

117

118

119

CLIENT

??

What Makes Ceph Unique Part two: thin provisioning

120

LIBRADOS

121

M

M

M

VM

LIBRBD VIRTUALIZATION CONTAINER

HOW DO YOU SPIN UP

THOUSANDS OF VMs INSTANTLY

AND EFFICIENTLY?

122

144

123

0 0 0 0

instant copy

= 144

4 144

124

CLIENT

write

write

write

= 148

write

4 144

125

CLIENT read

read

read

= 148

What Makes Ceph Unique? Part three: clustered metadata

126

Metadata for a POSIX-Compliant Filesystem Barnaby, Flickr / CC BY 2.0 127

128

M

M

M

CLIENT

01 10

129

M

M

M

130

one tree

three metadata servers

??

131

132

133

134

135

DYNAMIC SUBTREE PARTITIONING

And Now: Backpedaling

136

137

ALMOST EVERYTHING

WORKS

138

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLY AWESOME

AWESOME AWESOME

AWESOME

AWESOME

139

LAN SCALE!! *

* OR REALLY REALLY SCARY FAST WAN

What is Inktank? I really like your polo shirt, please tell me what it means!

140

141

Who? •  Ceph’s inventor and (most) developers

142

Why? •  To ensure the long-term success of Ceph

•  To help companies adopt Ceph through services, support, training, and consulting

143

When? •  Founded: December 28, 2011

•  Brand Launched: April 2012

144

What do we want from you?? •  Try Ceph! Tell us what you think. Ask if

you need help. Help others if you can!

•  Are you a company? Consider dedicating dev resources to the project.

145

Questions?

146

Ross Turk VP Community, Inktank ross@inktank.com @rossturk inktank.com | ceph.com

top related