ghislain fourny big data fall 2019 - systems group @ eth big...109 example get /index.html http/1.1...

265
Ghislain Fourny Big Data Fall 2019 3. Object Storage satina / 123RF Stock Photo

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Ghislain Fourny

Big Data Fall 2019

3. Object Storage

satina / 123RF Stock Photo

2

Where are we?

Last lecture:

Reminder on

relational databases

3

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 2.x

4

Where are we?

Relational databases

fit on a

single machine

5

Where are we?

Petabytes

do not fit

on a

single machine

6

The lecture journey

Monolithic

Relational

Database

Modular

"Big Data"

Technology

Stack

7

Not reinventing the wheel

99%of what we learned

with 48 years of

SQL and relational

can bereused

8

Important take-away points

Relational algebra:

Selection

Projection

Grouping

Sorting

Joining

9

Important take-away points

Language

SQL

Declarative languages

Functional languages

Optimizations

Query plans

Indices

10

Important take-away points

What a table is made of

Table

Rows

Columns

Primary key

11

Important take-away points

Consistency constraints

Tabular integrity

Domain integrity

Atomic integrity (1st normal form)

Boyce-Codd normal form

SQL

12

Important take-away points

Consistency constraints

Tabular integrity

Domain integrity

Atomic integrity (1st normal form)

2nd, 3rd, Boyce-Codd normal form

NEW

NEW

SQL

NoSQL

Heterogeneous dataNested data

Denormalized dataNEW

13

Important take-away points

Transactions

Atomicity

Consistency

Isolation

Durability

ACID

14

Important take-away points

Transactions

Atomicity

Consistency

Isolation

Durability

Atomic Consistency

Availability

Partition tolerance

Eventual Consistency

NEW

NEW

NEW

NEW

CAP

ACID

15

The stack

16

The stack

Storage

Encoding

Syntax

Data models

Validation

Processing

Indexing

Data stores

User interfaces

Querying

17

The stack:Storage

Storage

Local filesystemNFS

GFS

HDFS

S3

Azure Blob Storage

18

The stack:

Encoding

Encoding

ASCII

ISO-8859-1

UTF-8BSON

19

The stack:Syntax

Syntax

TextCSV

XML

JSON

RDF/XML

Turtle

XBRL

20

The stack:

Data models

Data models

Tables: Relational model

Trees: XML Infoset, XDM

Graphs: RDF

Cubes: OLAP

21

The stack:

Validation

Validation

XML Schema

JSON Schema

Relational schemas

XBRL taxonomies

22

The stack:

Processing

Processing

Two-phase processing:

MapReduce

DAG-driven processing:

Tez, Spark, Flink, Ray

Elastic computing:

EC2

23

The stack:

Indexing

IndexingKey-value stores

Hash indices

B-Trees

Geographical indices

Spatial indices

24

The stack:

Data stores

Data stores

RDBMS

(Oracle/IBM/Microsoft)

MongoDB

CouchBase

ElasticSearch

Hive

HBase

MarkLogic

Cassandra

...

25

The stack:Querying

Querying

SQL

XQuery

JSONiqN1QL

MDX

SPARQL

REST APIs

26

The stack:User interfaces (UI)

User interfaces

Excel

Access

Tableau

Qlikview

BI tools

27

The stack:Storage

StorageToday!

28

Storage: from a single machine to a cluster

boscorelli / 123RF Stock Photo

29

Storage

needs to be

storedDatasomewhere

Database

Storage

30

Let's start from the 70s...

Vitaly Korovin / 123RF Stock Photo

31

File storage

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

Files

organized

in a

hierarchy

32

What is a file made of?

Content

Metadata

File

+

33

File Metadata

$ ls -l

total 48drwxr-xr-x 5 gfourny staff 170 Jul 29 08:11 2009drwxr-xr-x 16 gfourny staff 544 Aug 19 14:02 Exercisesdrwxr-xr-x 11 gfourny staff 374 Aug 19 14:02 Learning Objectives

drwxr-xr-x 18 gfourny staff 612 Aug 19 14:52 Lectures-rw-r--r-- 1 gfourny staff 1788 Aug 19 14:04 README.md

Fixed "schema"

34

File Content: Block storageFiles

content

stored in

blocks1 2 3

4 5 6

7 8

35

Local storage

Local Machine

36

Local storage

Local Machine

LAN (NAS)

LAN = local-area network

NAS = network-attached storage

37

Local storage

Local Machine

LAN (NAS)

WAN

LAN = local-area network

NAS = network-attached storage

WAN = wide-area network

38

Scaling Issues

Aleksandr Elesin / 123RF Stock Photo

1,000 files

39

Scaling Issues

Aleksandr Elesin / 123RF Stock Photo

1,000 files

1,000,000 files

40

Scaling Issues

Aleksandr Elesin / 123RF Stock Photo

1,000 files

1,000,000 files

1,000,000,000 files

41

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 2.x

42

Analogy: Tesla batteries (up to 2017)

43

Better performance: Explicit Block Storage

1 2 3

4 5 6

7 8

Application

(Control over locality of blocks)

44

So how do we make this scale?

Lorem IpsumDolor sit ametConsecteturAdipiscingElit. InImperdietIpsum ante

1. We throw away

the hierarchy!

45

So how do we make this scale?

2. We make

metadata flexible

46

So how do we make this scale?

3. We make the data model trivial

ID

47

So how do we make this scale?

4. We use commodity hardware

48

... and we get Object Storage

49

"Black-box" objects

... and we get Object Storage

50

"Black-box" objects

Flat and global key-value model

... and we get Object Storage

51

"Black-box" objects

Flat and global key-value model

Flexible metadata

... and we get Object Storage

52

"Black-box" objects

Flat and global key-value model

Flexible metadata

Commodity hardware

... and we get Object Storage

53

Scale

boscorelli / 123RF Stock Photo

54

One machine's not good enough.How do we scale?

55

Approach 1: scaling up

56

Approach 1: scaling up

57

Approach 2: scaling out

58

Approach 2: scaling out

59

Approach 2: scaling out

60

Approach 2: scaling out

61

Hardware price comparison

Scale outScale up

62

Approach 3: be smart

Viktorija Reuta / 123RF Stock Photo

63

Approach 3: be smart

Viktorija Reuta / 123RF Stock Photo

“You can have a second

computer once you’ve shown you know how to use the first one.”

Paul Barham

64

In this lecture

Approach 2

Scale out

65

Data centers

boscorelli / 123RF Stock Photo

66

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 2.x

67

Numbers - computing

1,000-100,000machines in a data center

1-100 coresper server

68

Numbers - storage

1-16 TBlocal storage

per server

16GB-6TBof RAM per server

69

Numbers - network

1-100 GB/snetwork bandwith

for a server

70

Racks

Height in "rack units"

(e.g., 42 RU)

71

Racks

Modular:

- servers

- storage

- routers

- ...

72

Rack servers

Lenovo ThinkServer RD430 Rack Server

1-4 RU

73

Amazon S3

74

S3 Model

75

S3 Model

76

S3 Model

Bucket ID

77

S3 Model

Bucket ID

78

S3 Model

Bucket ID

+

Object ID

79

Scalability

Max. 5 TB

80

Scalability

100/account(more upon request)

81

Durability

99.999999999%Loss of 1 in 1011 objects in a year

82

Availability

99.99%Down 1h / year

83

More about SLAs

99

99

9

99

9

9

84

More about SLA

SLA Outage

99% 4 days/year

99.9% 9 hours/year

99.99% 53 minutes/year

99.999% 6 minutes/year

99.9999% 32 seconds/year

99.99999% 4 seconds/year

85

More about SLA

Amazon's approach:

Response time < 10 ms

in 99.9% of the cases

(rather than average or median)

86

API

RESTSOAPDriver(JDBC...)

87

The new era: the CAP theorem

Yet

another

impossibility

triangle

88

The new era: the CAP theorem

Consistency

Availability Partition tolerance

89

(Atomic) Consistency

All nodes see the same data.

90

Availability

It is possible to query the database at all times.

91

Partition tolerance

The database continues to function even if

the network gets partitioned.

92

Consistency + Partition tolerance

93

Consistency + Partition tolerance

Update

94

Consistency + Partition tolerance

Propagation

95

Consistency + Partition tolerance

Not available until the network is connected again!

96

REST APIs

97

HTTP protocol

Sir Tim Berners-Lee

Version RFC

1.0 RFC 2616

1.1 RFC 7230

2.0 RFC 7540

98

Resources

Resource (URI)

http://www.ethz.ch/

http://www.mywebsite.ch/api/collection/foo/object/bar

urn:isbn:0123456789

mailto:[email protected]

99

Resources

Resource (URL)

http://www.ethz.ch/

http://www.mywebsite.ch/api/collection/foo/object/bar

100

Resources

Resource (URN)

urn:isbn:0123456789

mailto:[email protected]

101

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

102

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

scheme

103

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

authority

104

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

path

105

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

query

106

Resources

Resource (URI)

http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head

fragment

107

HTTP Methods

GET

PUT

DELETE

POST

108

HTTP Protocol

Method URI

[Body]

Status code

[Body]

[Header]

[Header]

Re

qu

es

tR

es

po

ns

e

109

Example

GET /index.html HTTP/1.1

Host: www.example.com

HTTP/1.1 200 OK

Date: Tue, 25 Sep 2018 09:48:34 GMT

Content-Type: text/html; charset=UTF-8 Content-Length: 138

Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT

Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)

ETag: "3f80f-1b6-3e1cb03b"

Accept-Ranges: bytes Connection: close

<html> <head> <title>An Example Page</title>

</head> <body> Hello World, this is a very simple

HTML document. </body> </html> Source: Wikipedia

110

GET

(Side-effect free)

111

PUT

(Idempotent)

112

DELETE

113

POST

!

_________

_________

_________

Most generic: side effects

114

REST with S3: Buckets

http://bucket.s3.amazonaws.com

115

REST with S3: Objects

http://bucket.s3.amazonaws.com/object-name

116

S3 REST API

PUT BucketDELETE Bucket

GET Bucket

PUT ObjectDELETE Object

GET Object

117

Example

GET /my-image.jpg HTTP/1.1

Host: bucket.s3.amazonaws.com

Date: Tue, 26 Sep 2017 10:55:00 GMT

Authorization: authorization string

118

Folders: is S3 a file system?

/food/fruits/orange

/food/fruits/strawberry

/food/vegetables/tomato/food/vegetables/turnip

/food/vegetables/lettuce

Physical

(Object keys)

Logical

(Browsing)

foodfruit

orangestrawberry

vegetables

tomato

turnip

lettuce

119

http://<bucket-name>.s3-website-us-east-1.amazonaws.com/

Static website hosting

120

More on Storage

121

Replication

Fault tolerance

122

Faults

Local (node failure)

versus

Regional (natural catastrophe)

123

Regions

Songkran Khamlue / 123RF Stock Photo

124

Regions

Songkran Khamlue / 123RF Stock Photo

125

Regions

Songkran Khamlue / 123RF Stock Photo

126

Storage Class

Standard High availability

Standard –

Infrequent Access

Less availability

Cheaper storage

Cost for retrieving

Amazon Glacier Low-costHours to GET

127

Azure Blob Storage

128

Overall comparison Azure vs. S3

S3 Azure

Object

ID

Bucket +

Object

Account + Container +

Blob

Object

API

Blackbox Block/Append/Page

Limit 5 TB 4.78 TB (block)195 GB (append)

8 TB (page)

129

Azure Architecture: Storage Stamp

Front-Ends

Partition Layer

Stream Layer

Virtual IP address

Account name

Partition name

Object name

130

Azure Architecture: One storage stamp

10-20 racks * 18 storage nodes/rack (30PB)

131

Azure Architecture: Keep some buffer

kept below 70/80% storage capacity

132

Storage Replication

Front-Ends

Partition Layer

Stream Layer

Intra-stamp replication (synchronous)

133

Storage Replication

Front-Ends

Partition Layer

Stream Layer

Front-Ends

Partition Layer

Stream Layer

Inter-stamp replication (asynchronous)

134

Location Services

Front-Ends

Partition Layer

Stream Layer

Location ServicesDNS

Virtual IP (primary) Virtual IP

Account name

mapped to one

Virtual IP

(primary stamp)

Front-Ends

Partition Layer

Stream Layer

135

Location Services

Location ServicesDNS

North America

Europe

Asia

136

Location Services

Front-Ends

Partition Layer

Stream Layer

DNS

Front-Ends

Partition Layer

Stream Layer

Account name

Primary stamp's VIP

Partition + Object

137

Key-value storage

Sergey Nivens / 123RF Stock Photo

138

Can we consider object storage a database?

139

Issue: latency

~100-300ms 1-9 ms

vs.

S3 Typical Database

140

Key-value stores

ID

1. Similar data model to object storage

141

Key-value stores

2. Smaller objects

5TB 400KB

(Dynamo)

142

Key-value stores

3. No Metadata

143

Key-value stores: data model

144

Basic API

get(key)

put(key, value)

145

Basic API

get(key) value

146

Basic API

get(key) value

put(key, other value)

147

Dynamo API

get(key) value

148

Dynamo API

get(key) value, context

149

Dynamo API

get(key) value, context

put(key, context, value)

(more on context later)

150

Key-value stores: why do we simplify?

Simplicity More features

151

Key-value stores: why do we simplify?

Simplicity

ConsistencyEventual consistency

More features

152

Key-value stores: why do we simplify?

Simplicity

ConsistencyEventual consistency

More features

Performance Overhead

153

Key-value stores: why do we simplify?

Simplicity

ConsistencyEventual consistency

More features

Performance

Scalability Monolithic

Overhead

154

Key Value

Key-value stores

Which is the

most

efficient

data

structurefor querying this?

155

Key Value

Key-value stores: logical model

Associative

Array

(aka Map)

156

Enter nodes: physical level

157

Design principles: incremental stability

158

Design principles: incremental stability

159

Design principles: incremental stability

160

Design principles: symmetry

161

Design principles: decentralization

162

Design principles: heterogeneity

163

Physical layer: Peer to peer networks

164

Distributed Hash Tables: Chord

hashed

n-bit ID

(Dynamo: 128 bits)

165

IDs are organized in a logical ring00000000001111111111

mod 2n

1000000000

01000000001100000000 2n nodes

166

Each Node picks a 128-bit hash (randomly)

167

Nodes are (logically) placed on the ring00000000001111111111

mod 2n

168

ID stored at next node clockwise

000...111...

mod 2n

169

ID stored at next node clockwise

000...111...

mod 2n

Domain of

responsibility

170

Adding and removing nodes000...111...

mod 2n

171

Adding and removing nodes000...111...

mod 2n

172

Adding and removing nodes000...111...

mod 2n

173

Adding and removing nodes000...111...

mod 2n

Needs to

be transferred

174

Adding and removing nodes000...111...

mod 2n

Needs to

be transferred

These nodes are

not affected

175

Adding and removing nodes000...111...

mod 2n

176

Adding and removing nodes000...111...

mod 2n

177

Adding and removing nodes000...111...

mod 2n

178

Adding and removing nodes000...111...

mod 2n

Needs to

be transferred

179

Adding and removing nodes000...111...

mod 2n

Needs to

be transferred

But what if the node failed?

180

Initial design: one range000...111...

mod 2n

181

Duplication: 2 ranges000...111...

mod 2n

182

Duplication: 2 ranges000...111...

mod 2n

183

Duplication: N ranges000...111...

mod 2n

184

Chord: Finger tables

Credits: Thomas Hofmann

185

Chord: Finger tables

O(log(number of nodes))

Credits: Thomas Hofmann

186

Dynamo: Preference lists

Key Nodes

a n1, n2

b n1, n2

c n2, n3

d n2, n3

e n2, n3

f n2, n3

g n2, n3

h n2, n3

i n3, n4

j n3, n4

k n3, n4

l n3, n4

Every node

knows about the ranges of

all other nodes

187

Dynamo: Preference lists

Key Node

1, 2, 3, ... At least N nodes

188

Dynamo: Preference lists

Key Node

1, 2, 3, ... First node is called the coordinator

189

Dynamo: Preference lists

Key Node

1, 2, 3, ... Reads from at least R nodes

190

Dynamo: Preference lists

Key Node

1, 2, 3, ... Writes confirmed (synchronously)

from at least W nodes

191

Quorum

R + W > N

192

Initial request

Load balancer

193

Initial request

Load balancer Random node

194

Initial request

Load balancer Random node Coordinator

195

Initial request

Load balancer Random node Coordinator

N-1 Nodes hosting replicas

196

Distributed Hash Tables: Pros

Highly scalable

Robust against failure

Self organizing

Credits: Thomas Hofmann

197

Distributed Hash Tables: Cons

Lookup, no search

Data integrity

Security issues

Credits: Thomas Hofmann

198

Issue 1

000...111...

mod 2n

199

Issue 1: randomness out of luck

000...111...

mod 2n

200

Issue 2: heterogenous performance000...111...

mod 2n

201

So...

How can we

• artificially increase the number of node?

and

• bring some elasticity to account for

performance differences?

202

Virtual nodes are the answer: tokens

000...111...

mod 2n

203

Virtual nodes: tokens

000...111...

mod 2n

204

Deleting a node000...111...

mod 2n

205

Deleting a node000...111...

mod 2n

206

Deleting a node000...111...

mod 2n

207

Adding a node000...111...

mod 2n

208

Adding a node000...111...

mod 2n

209

Vector clocks

put by Node A

210

Vector clocks

put by Node A

([A, 1])

211

Vector clocks

put by Node A

put by Node A

([A, 1])

([A, 2])

212

Vector clocks

put by Node A

put by Node A

put by Node B

([A, 1])

([A, 2])

([A, 2], [B, 1])

213

Vector clocks

put by Node A

put by Node A

put by Node Bput by Node C

([A, 1])

([A, 2])

([A, 2], [B, 1]) ([A, 2], [C, 1])

214

Vector clocks

put by Node A

put by Node A

put by Node Bput by Node C

reconcile and put by Node A

([A, 1])

([A, 2])

([A, 2], [B, 1]) ([A, 2], [C, 1])

215

Vector clocks

put by Node A

put by Node A

put by Node Bput by Node C

reconcile and put by Node A

([A, 1])

([A, 2])

([A, 2], [B, 1]) ([A, 2], [C, 1])

([A, 3], [B, 1], [C, 1])

216

Context

put by Node A

put by Node A

put by Node Bput by Node C

reconcile and put by Node A

([A, 1])

([A, 2])

([A, 2], [B, 1]) ([A, 2], [C, 1])

([A, 3], [B, 1], [C, 1])

conte

xt

217

Duplication: factor 3000...111...

mod 2n

218

Vector clocks

incoming request

put(key1, A)

219

Vector clocks

incoming request

put(key1, A)

Key Node

key1 n1, n2, n3

n1

n2

n3

220

Vector clocks

incoming request

put(key1, A)

Key Node

key1 n1, n2, n3

coordinator

n2

n3

221

Vector clocks

incoming request

put(key1, A)

coordinator

n2

n3 New context

[ (n1, 1) ]

222

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

223

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

replic

ation

replic

ation

224

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ] get(key1)

225

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

gathering all versions

gathering all versions

get(key1)

226

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

get(key1)

227

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

get(key1)

228

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

ReturningA, [ (n1, 1) ]

context

get(key1)

229

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

incoming request

put(key1, [ (n1, 1) ], B)

B, [ (n1, 2) ]new context (+1)

230

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

231

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

replic

ation

replic

ation

232

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

get(key1)

233

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

gathering all versions

gathering all versions

A, [ (n1, 1) ]

B, [ (n1, 2) ]

get(key1)

234

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

gathering all versions

gathering all versions

A, [ (n1, 1) ]

B, [ (n1, 2) ]

Returning

B, [ (n1, 2) ]

maximum element

get(key1)

235

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

incoming request

put(key1, [ (n1, 2) ], C)

C, [ (n1, 3) ]new context (+1)

236

Vector clocks

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

network partition

237

Vector clocks

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

network partition

gathering all versions

interim

coordinator

get(key1)

238

Vector clocks

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

network partition

Returning

B, [ (n1, 2) ]

gathering all versions

interim

coordinator

get(key1)

239

Vector clocks

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

network partition

incoming request

put(key1, [ (n1, 2) ], D)

D, [ (n1, 2), (n2, 1) ]new context

interim

coordinator

240

Vector clocks

interim

coordinatorn2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

network partition

D, [ (n1, 2), (n2, 1) ]

241

Vector clocks

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

242

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

243

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

get(key1)

244

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

gathering all versions

gathering all versions

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

get(key1)

245

Vector clocks

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

246

Directed Acyclic Graph

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]

247

Directed Acyclic Graph

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]not comparable

maximal elements

248

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

get(A)

gathering all versions

gathering all versions

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

249

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

get(A)

gathering all versions

gathering all versions

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

return

C, D, [ (n1, 3), (n2, 1) ]

250

Vector clocks

coordinator

n2

n3

A, [ (n1, 1) ]

A, [ (n1, 1) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ]

D, [ (n1, 2), (n2, 1) ]

251

Vector clocks

A, [ (n1, 1) ]

B, [ (n1, 2) ]

D, [ (n1, 2), (n2, 1) ]

C, [ (n1, 3) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

D, [ (n1, 2), (n2, 1) ]

C, [ (n1, 3) ]

A, [ (n1, 1) ]

B, [ (n1, 2) ]

D, [ (n1, 2), (n2, 1) ]

C, [ (n1, 3) ]

synchronization

252

Vector clocks

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

Cleanup

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]

253

Vector clocks

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

incoming request

put(key1,

[ (n1, 3), (n2, 1) ],

E)

(Client semantically solved

the conflict between C and D)

254

Vector clocks

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

incoming request

put(key1,

[ (n1, 3), (n2, 1) ],

E)

E, [ (n1, 4), (n2, 1) ]

255

Vector clocks

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

E, [ (n1, 4), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ]

256

Vector clocks

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

D, [ (n1, 3), (n2, 1) ]

C, [ (n1, 3) ]

E, [ (n1, 4), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ]

C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ]

257

Version history (all times)

A, [ (n1, 1) ]

B, [ (n1, 2) ]

C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]

E, [ (n1, 4), (n2, 1) ] absolute maximum

258

Lower latency

Partition-aware client Coordinator

N-1 Nodes hosting replicas

Hinted handoff

259

Merkle Trees

What if hinted replicas get lost?

What if the complexity in replica deltas increases?

260

Merkle Trees

What if hinted replicas get lost?

What if the complexity in replica deltas increases?

Anti-entropy protocol

261

Merkle Trees

Key range (one per virtual node)

262

Last words

263

Amazon mindset

264

Azure mindset

GraphsTables

TreesKey-value paradigm

265

§ Simplify the model!

§ Buy cheap hardware!

§ Remove schemas!

Take away messages: how to scale out?