dynamo and bigtable in light of the cap theorem

60
Dynamo and BigTable In light of the CAP Theorem 22953 Research Seminar: Databases and Data Mining November 2013 Open University Grisha Weintraub

Upload: grisha-weintraub

Post on 27-Jan-2015

153 views

Category:

Technology


0 download

DESCRIPTION

For a long time, relational database management systems have been the only solution for persistent data store. However, with the phenomenal growth of data, this conventional way of storing has become problematic. To manage the exponentially growing data traffic, largest information technology companies such as Google, Amazon and Yahoo have developed alternative solutions that store data in what have come to be known as NoSQL databases. Some of the NoSQL features are flexible schema, horizontal scaling and no ACID support. NoSQL databases store and replicate data in distributed systems, often across datacenters, to achieve scalability and reliability. The CAP theorem states that any networked shared-data system (e.g. NoSQL) can have at most two of three desirable properties: • consistency(C) - equivalent to having a single up-to-date copy of the data • availability(A) of that data (for reads and writes) • tolerance to network partitions(P) Because of this inherent tradeoff, it is necessary to sacrifice one of these properties. The general belief is that designers cannot sacrifice P and therefore have a difficult choice between C and A. In this seminar two NoSQL databases are presented: Amazon's Dynamo, which sacrifices consistency thereby achieving very high availability and Google's BigTable, which guarantees strong consistency while provides only best-effort availability.

TRANSCRIPT

Page 1: Dynamo and BigTable in light of the CAP theorem

Dynamo and BigTableIn light of the CAP Theorem

22953 Research Seminar: Databases and Data Mining November 2013Open University

Grisha Weintraub

Page 2: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL

• CAP Theorem

• Dynamo (AP)

• BigTable (CP)

• Dynamo vs. BigTable

Page 3: Dynamo and BigTable in light of the CAP theorem

Distributed Database Systems

• Data is stored across several sites that share no physical component.

• Systems that run on each site are independent of each other.• Appears to user as a single system.

Page 4: Dynamo and BigTable in light of the CAP theorem

Distributed Data Storage

Partitioning :Data is partitioned into several fragments and stored in different sites.Horizontal – by rows.Vertical – by columns.

Replication :System maintains multiple copies of data, stored in different sites.

Replication and partitioning can be combined !

Page 5: Dynamo and BigTable in light of the CAP theorem

Partitioning

value key

5 x

7 y

10 z

12 w

key

x

y

z

w

value

5

7

10

12

A B

Vertical

Locality of reference – data is most likely to be updated and queried locally.

Horizontal

value key

5 x

value key

7 y

10 z

value key

12 w

A B

C

Page 6: Dynamo and BigTable in light of the CAP theorem

Replication

value key

5 x

7 y

value key

5 x

10 z

value key

7 y

12 w

value key

10 z

12 w

A B

C D

value key

5 x

7 y

10 z

12 w

Pros – Increased availability of data and faster query evaluation.Cons – Increased cost of updates and complexity of concurrency control.

Page 7: Dynamo and BigTable in light of the CAP theorem

Updating distributed data

• Quorum voting (Gifford SOSP’79) : N – number of replicas. At least R copies should be read. At least W copies should be written. R+W > N Example (N=10, R=4, W=7)

• Read-any write-all : R=1, W=N

Page 8: Dynamo and BigTable in light of the CAP theorem

NoSQL

• No SQL :– Not RDBMS.– Not using SQL language.– Not only SQL ?

• Flexible schema

• Horizontal scalability

• Relaxed consistency high performance & availability

Page 9: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL √

• CAP Theorem

• Dynamo (AP)

• BigTable (CP)

• Dynamo vs. BigTable

Page 10: Dynamo and BigTable in light of the CAP theorem

CAP Theorem

• Eric A. Brewer. Towards robust distributed systems (Invited Talk) , July 2000

• S. Gilbert and N. Lynch, Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, June 2002

• Eric A. Brewer. CAP twelve years later: How the 'rules' have changed, February 2012

• S. Gilbert and N. Lynch, Perspectives on the CAP Theorem, February 2012

Page 11: Dynamo and BigTable in light of the CAP theorem

CAP Theorem

• Consistency – equivalent to having a single up-to-date copy of the data.

• Availability - every request received by a non-failing node in the system must result in a response.

• Partition tolerance - the network will be allowed to lose arbitrarily many messages sent from one node to another.

Theorem – You can have at most two of these properties for any shared-data system.

Page 12: Dynamo and BigTable in light of the CAP theorem

CAP Theorem

x=5 x? 5

Consistency Availability Partition tolerance

Page 13: Dynamo and BigTable in light of the CAP theorem

CAP Theorem - Proof

x=0

x=0

x=0

x=5 x?1. x=0 Not consistent2. No response Not available

Page 14: Dynamo and BigTable in light of the CAP theorem

CAP – 2 of 3

• Trivial:– The trivial system that ignores all requests meets these requirements.

• Best-effort availability :– Read-any write-all systems will become unavailable only when messages are lost.

• Examples :– Distributed database systems, BigTable

Availability

Partition Tolerance

Consistency

Page 15: Dynamo and BigTable in light of the CAP theorem

CAP – 2 of 3 Availability

Partition Tolerance

Consistency

• Trivial:– The service can trivially return the initial value in response to every request.

• Best-effort consistency :– Quorum-based system, modified to time-out lost messages, will only return

inconsistent(and, in particular, stale) data when messages are lost.

• Examples :– Web cashes, Dynamo

Page 16: Dynamo and BigTable in light of the CAP theorem

CAP – 2 of 3 Availability

Partition Tolerance

Consistency

• If there are no partitions, it is clearly possible to provide consistent, available data (e.g. read-any write-all).

• Does choosing CA make sense ?Eric Brewer :– “The general belief is that for wide-area systems, designers cannot forfeit P and

therefore have a difficult choice between C and A.“ – “If the choice is CA, and then there is a partition, the choice must revert to C or A. ”

Page 17: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL √

• CAP Theorem √

• Dynamo (AP)

• BigTable (CP)

• Dynamo vs. BigTable

Page 18: Dynamo and BigTable in light of the CAP theorem

Dynamo - Introduction

• Highly available key-value storage system.• Provides an “always-on” experience.• Prefers availability over consistency.• “Customers should be able to view and add

items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”

Page 19: Dynamo and BigTable in light of the CAP theorem

Dynamo - API

• Distributed hash table :

– put(key, context, object) – Associates given object with specified key and context. • context – metadata about the object, includes information as the version

of the object.

– get(key) – Returns the object to which the specified key is mapped or a list of objects with conflicting versions along with a context.

Page 20: Dynamo and BigTable in light of the CAP theorem

Dynamo - Partitioning• Naive approach :

– Hash the key.– Apply modulo n (n=number of nodes).

key = “John Smith”

hash(key) = 19

19 mod 4 = 3

1 2 3 4

Adding/deleting nodes totally mess !

Page 21: Dynamo and BigTable in light of the CAP theorem

Dynamo - Partitioning• Consistent hashing (STOC’97) :

– Each node is assigned to a random position on the ring.– Key is hashed to the fixed point on the ring.– Node is chosen by walking clockwise from the hash location.

Adding/deleting nodes uneven partitioning !

A

B

DE

F

G

hash(key)

C

Page 22: Dynamo and BigTable in light of the CAP theorem

Dynamo - Partitioning• Virtual nodes :

– Each physical node is assigned to multiple points in the ring.

Page 23: Dynamo and BigTable in light of the CAP theorem

Dynamo - Replication• Each data item is replicated at N nodes.

• Preference list - the list of nodes that store a particular key.

• Coordinator - node which handles read/write operations, typically the first in the preference list.

N = 3

A

B

DE

F

G

hash(key)

C

Page 24: Dynamo and BigTable in light of the CAP theorem

Dynamo - Data Versioning

• Eventual consistency(replicas are updated asynchronously) :– A put() call may return to its caller before the update has been

applied at all the replicas.– A get() call may return many versions of the same object.

• Reconciliation :– Syntactic – system resolves conflicts automatically(e.g. new

version overwrite the previous)– Semantic – client resolves conflicts(e.g. by merging shopping cart

items).

Page 25: Dynamo and BigTable in light of the CAP theorem

Dynamo - Data Versioning

• Semantic Reconciliation :Item 1

Item 1

Item 6

Item 4

get()

{ item1, item4}, {item1, item6}

put({ item1, item4,item6})

What about delete ?!

Page 26: Dynamo and BigTable in light of the CAP theorem

Dynamo - Data Versioning• Vector clocks(Lamport CACM’78)

– List of <node, counter> pairs.

– One vector clock is associated with every version of every object.

– If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

– Part of the “context” parameter of put() and get()

Page 27: Dynamo and BigTable in light of the CAP theorem

Dynamo – get() and put()

• Two strategies that a client can use :

– Load balancer that will select a node based on load information :client does not have to link any code specific to Dynamo in its

application.

– Partition-aware client library.can achieve lower latency because it skips a potential forwarding step.

Page 28: Dynamo and BigTable in light of the CAP theorem

Dynamo – get() and put()• Quorum-like system :

R - number of nodes that must read a key. W - number of nodes that must write a key.

• put() : Coordinator generates the vector clock for the new version and writes

the new version locally. Coordinator sends new version to N nodes. If at least W-1 nodes respond, then write is successful.

• get() : Coordinator requests all data versions from N nodes. If at least R-1 nodes respond, returns data to client.

Page 29: Dynamo and BigTable in light of the CAP theorem

Dynamo – Handling Failures• Sloppy Quorum and Hinted Handoff :

– All read and write operations are performed on the first N healthy nodes.– If node is down, its replica is sent to another node. The data received in this node is

called “hinted replica”.– When original node is recovered, the hinted replica is written back to the original node.

A

B

DE

F

G

C

Page 30: Dynamo and BigTable in light of the CAP theorem

Dynamo – Handling Failures• Anti-entropy :

– Protocol to keep the replicas synchronized.– To detect inconsistencies – Merkle Tree :

• Hash tree where leaves are hashes of the values of individual keys.• Parent nodes higher in the tree are hashes of their respective children.

Page 31: Dynamo and BigTable in light of the CAP theorem

Dynamo – Membership detection

• Gossip-based protocol :– Periodically, each node contacts another node in the network,

randomly selected.– Nodes compare their membership histories and reconcile them.

A FBGCEDGEAFBGC

A

B

DE

F

G

C

Page 32: Dynamo and BigTable in light of the CAP theorem

Dynamo - Summary

Page 33: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL √

• CAP Theorem √

• Dynamo (AP) √

• BigTable (CP)

• Dynamo vs. BigTable

Page 34: Dynamo and BigTable in light of the CAP theorem

BigTable - Introduction

• Distributed storage system for managing structured data that is designed to scale to a very large size.

Page 35: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model

• Bigtable is a sparse, distributed, persistent multidimensional sorted map.

• The map is indexed by a row key, column key, and a timestamp.

• (row_key,column_key,time) string

Page 36: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Modelphone name user_id

178145 John 15

email name user_id

[email protected] 29Bob t1Robert t2

row_keycolumn_key

timestamp

(29, name, t2) “Robert”

email phone name user_id

RDBMSApproach

null 178145 John [email protected] null Bob 29

Page 37: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model• Columns are grouped into Column Families:

– family : optional qualifier

contactInfo : email contactInfo : phone name: user_id

[email protected] 17814552 John 15

Column Family

Optional Qualifier

name user_id RDBMSApproach

John 15

value type user_id

178145 phone [email protected] email 15

Page 38: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model

• Rows :– The row keys in a table are arbitrary strings.– Every read or write of data under a single row key is atomic.– Data is maintained in lexicographic order by row key.– Each row range is called a tablet, which is the unit of

distribution and load balancing.

contactInfo : email contactInfo : phone name: user_id

[email protected] 178145 John 15

row_key

Page 39: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model

• Column Families :– Column keys are grouped into sets called column families.– A column family must be created before data can be stored

under any column key in that family.– Access control and both disk and memory accounting are

performed at the column-family level.

contactInfo : email contactInfo : phone name: user_id

[email protected] 178145 John 15

column_key

Page 40: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model

• Timestamps :– Each cell in a Bigtable can contain multiple versions of the same

data; these versions are indexed by timestamp.– Versions are stored in decreasing timestamp order.– Timestamps may be assigned :

• by BigTable(real time in ms)• by client application

– Older versions are garbage-collected.email name user_id

[email protected] 29Bob t1Robert t2

timestamp

Page 41: Dynamo and BigTable in light of the CAP theorem

BigTable – Data Model

• Tablets :– Large tables broken into tablets at row boundaries.– Tablet holds contiguous range of rows.– Approximately 100-200 MB of data per tablet.

..… id

..… 15000

Tablet 1..… .…

..… 20000

..… 20001

Tablet 2..… .…

..… 25000

Page 42: Dynamo and BigTable in light of the CAP theorem

BigTable – API

• Metadata operations :– Creating and deleting tables, column families, modify access control

rights.

• Client operations :– Write/delete values– Read values– Scan row ranges

// Open the tableTable *T = OpenOrDie("/bigtable/users");

// Update name and delete a phoneRowMutation r1(T, “29");r1.Set(“name:", “Robert");r1.Delete(“contactInfo:phone");Operation op;Apply(&op, &r1);

Page 43: Dynamo and BigTable in light of the CAP theorem

BigTable – Building Blocks

• GFS – large-scale distributed file system.GHEMAWAT, GOBIOFF, LEUNG, The Google file system. (Dec. 2003)

• Chubby – distributed lock service.BURROWS, The Chubby lock service for loosely coupled distributed systems (Nov. 2006)

• SSTable – file format to store BigTable data.

Page 44: Dynamo and BigTable in light of the CAP theorem

BigTable – Building Blocks• GFS :

– Files broken into chunks (typically 64 MB)– Master manages metadata.– Data transfers happens directly between clients/chunkservers.

Master Client

Page 45: Dynamo and BigTable in light of the CAP theorem

BigTable – Building Blocks• Chubby :

– Provides a namespace of directories and small files :• Each directory or file can be used as a lock.• Reads and writes to a file are atomic.

– Chubby clients maintain sessions with Chubby :• When a client's session expires, it loses any locks.

– Highly available :• 5 active replicas, one of which is elected to be the master.• Live when a majority of the replicas are running and can communicate with each

other.

Page 46: Dynamo and BigTable in light of the CAP theorem

BigTable – Building Blocks• SSTable :

– Stored in GFS.

– Persistent, ordered, immutable map from keys to values.

– Provided operations :• Get value by key• Iterate over all key/value pairs in a specified key range.

Page 47: Dynamo and BigTable in light of the CAP theorem

BigTable – System Structure • Three major components:

– Client library

– Master (exactly one) :• Assigning tablets to tablet servers.• Detecting the addition and expiration of tablet servers.• Balancing tablet-server load.• Garbage collection of files in GFS.• Schema changes such as table and column family creations.

– Tablet Servers(multiple, dynamically added) :• Manages 10-100 tablets• Handles read and write requests to the tablets.• Splits tablets that have grown too large.

Page 48: Dynamo and BigTable in light of the CAP theorem

BigTable – System Structure

Page 49: Dynamo and BigTable in light of the CAP theorem

BigTable – Tablet Location

• Three-level hierarchy analogous to that of a B+ tree to store tablet location information.

• Client library caches tablet locations.

Page 50: Dynamo and BigTable in light of the CAP theorem

BigTable – Tablet Assignment • Tablet Server:

– When a tablet server starts, it creates, and acquires an exclusive lock on a uniquely-named file in a specific Chubby directory - servers directory.

• Master :– Grabs a unique master lock in Chubby, which prevents concurrent master

instantiations.– Scans the servers directory in Chubby to find the live servers.– Communicates with every live tablet server to discover what tablets are already

assigned to each server.– If a tablet server reports that it has lost its lock or if the master was unable to reach

a server during its last several attempts – deletes the lock file and reassigns tablets.– Scans the METADATA table to find unassigned tablets and reassigns them.

Page 51: Dynamo and BigTable in light of the CAP theorem

BigTable - Tablet Assignment

Chubby

Tablet Server 1

Tablet 8 Tablet 1

Tablet Server 2

Tablet 2 Tablet 7

Master Metadata Table

Page 52: Dynamo and BigTable in light of the CAP theorem

BigTable – Tablet Serving• Writes :

– Updates committed to a commit log.– Recently committed updates are stored in memory – memtable.– Older updates are stored in a sequence of SSTables.

• Reads :– Read operation is executed on a merged view of the sequence of SSTables and the memtable.– Since the SSTables and the memtable are sorted, the merged view can be formed efficiently.

Page 53: Dynamo and BigTable in light of the CAP theorem

BigTable - Compactions• Minor compaction:

– Converts the memtable into SSTable.– Reduces memory usage.– Reduces log reads during recovery.

• Merging compaction:– Merges the memtable and a few SSTable.– Reduces the number of SSTables.

• Major compaction:– Merging compaction that results in a single SSTable.– No deletion records, only live data.– Good place to apply policy “keep only N versions”

Page 54: Dynamo and BigTable in light of the CAP theorem

BigTable – Bloom FiltersBloom Filter :1. Empty array a of m bits, all set to 0. 2. Hash function h, such that h hashes each element to one of

the m array positions with a uniform random distribution.3. To add element e – a[h(e)] = 1

Example :S1 = {“John Smith”, ”Lisa Smith”, ”Sam Doe”, ”Sandra Dee”}

0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0

Page 55: Dynamo and BigTable in light of the CAP theorem

BigTable – Bloom Filters

• Drastically reduces the number of disk seeks required for read operations !

Page 56: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL √

• CAP Theorem √

• Dynamo (AP) √

• BigTable (CP) √

• Dynamo vs. BigTable

Page 57: Dynamo and BigTable in light of the CAP theorem

Dynamo vs. BigTableBigTable Dynamo

multidimensional map key-value data model

by key range by key operations

ordered random partition

only in GFS sloppy quorum replication

hierarchical decentralized architecture

strong)*( eventual consistency

column family no access control

Page 58: Dynamo and BigTable in light of the CAP theorem

Overview

• Introduction to DDBS and NoSQL √

• CAP Theorem √

• Dynamo (AP) √

• BigTable (CP) √

• Dynamo vs. BigTable √

Page 59: Dynamo and BigTable in light of the CAP theorem

References• R.Ramakrishnan and J.Gehrke, Database Management Systems, 3rd edition, pp. 736-751

• S. Gilbert and N. Lynch, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News, June 2002, pp. 51-59.

• Brewer, E. CAP twelve years later: How the 'rules' have changed. IEEE Computer 45, 2 (Feb. 2012),

23–29.

• S. Gilbert and N. Lynch, Perspectives on the CAP Theorem. IEEE Computer 45, 2 (Feb. 2012), 30-36.

• G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store," Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), ACM, 2007, pp. 205-220.

• F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data" Proc. 7th Usenix Symp. Operating Systems Design and Implementation (OSDI 06), Usenix, 2006, pp. 205-218.

• Ghemawat, S., Gobioff, H., and Leung, S. The Google File system. In Proc. of the 19th ACM SOSP (Dec.2003), pp. 29.43.

Page 60: Dynamo and BigTable in light of the CAP theorem