jeff terrace michael j. freedman

28
Jeff Terrace Jeff Terrace Michael J. Freedman Michael J. Freedman Object Storage on CRAQ Object Storage on CRAQ High throughput chain replication High throughput chain replication for for read-mostly workloads read-mostly workloads

Upload: domani

Post on 21-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Object Storage on CRAQ High throughput chain replication for read-mostly workloads. Jeff Terrace Michael J. Freedman. Data Storage Revolution. Relational Databases Object Storage (put/get) Dynamo PNUTS CouchDB MemcacheDB Cassandra. Speed Scalability Availability Throughput - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jeff Terrace Michael J. Freedman

Jeff TerraceJeff TerraceMichael J. FreedmanMichael J. Freedman

Object Storage on CRAQObject Storage on CRAQ

High throughput chain replication forHigh throughput chain replication forread-mostly workloadsread-mostly workloads

Page 2: Jeff Terrace Michael J. Freedman

Data Storage Revolution

• Relational Databases

• Object Storage (put/get)– Dynamo– PNUTS– CouchDB– MemcacheDB– Cassandra

SpeedScalabilityAvailabilityThroughput

No Complexity

Page 3: Jeff Terrace Michael J. Freedman

Eventual Consistency

Manager

Replica

ReplicaReplica

Write Request

Read RequestReplica

B

A

Read Request

Page 4: Jeff Terrace Michael J. Freedman

Eventual Consistency

• Writes ordered after commit

• Reads can be out-of-order or stale

• Easy to scale, high throughput

• Difficult application programming model

Page 5: Jeff Terrace Michael J. Freedman

Traditional Solution to Consistency

Manager

Replica

Replica

ReplicaReplica

Write Request

Two-Phase Commit:

1.Prepare2.Vote: Yes3.Commit4.Ack

Page 6: Jeff Terrace Michael J. Freedman

Strong Consistency

• Reads and Writes strictly ordered

• Easy programming

• Expensive implementation

• Doesn’t scale well

Page 7: Jeff Terrace Michael J. Freedman

Our Goal

• Easy programming

• Easy to scale, high throughput

Page 8: Jeff Terrace Michael J. Freedman

Chain Replication

Manager

Replica

Replica

ReplicaReplica

HEAD TAIL

Write Request Read

Request

W1R1W2R2R3

van Renesse & Schneider(OSDI 2004)

W1

R1

R2

W2

R3

Page 9: Jeff Terrace Michael J. Freedman

Chain Replication

• Strong consistency

• Simple replication

• Increases write throughput

• Low read throughput

• Can we increase throughput?

• Insight:– Most applications are read-heavy (100:1)

Page 10: Jeff Terrace Michael J. Freedman

CRAQ

• Two states per object – clean and dirty

Replica TAILReplicaReplicaHEAD

Read Request

Read Request

Read Request

Read Request

Read Request

V1 V1 V1 V1 V1

Page 11: Jeff Terrace Michael J. Freedman

CRAQ

• Two states per object – clean and dirty

• If latest version is clean, return value

• If dirty, contact tail for latest version number

Replica TAILReplicaReplicaHEAD

V1 V1 V1 V1 V1

Write Request

,V2 ,V2 ,V2

Read Request

V1

Read Request 1V1

,V2 V2V2V2V2V2

2V2

Page 12: Jeff Terrace Michael J. Freedman

Multicast Optimizations

• Each chain forms group

• Tail multicasts ACKs

Replica TAILReplicaReplicaHEAD

V1 V1 V1 V1,V2 ,V2 ,V2 ,V2 V2V2V2V2V2

Page 13: Jeff Terrace Michael J. Freedman

Multicast Optimizations

• Each chain forms group

• Tail multicasts ACKs

• Head multicasts write data

Replica TAILReplicaReplicaHEAD

V2 V2 V2 V2

Write Request

,V3 ,V3 ,V3 ,V3 V2,V3V3

Page 14: Jeff Terrace Michael J. Freedman

CRAQ Benefits

• From Chain Replication– Strong consistency– Simple replication– Increases write throughput

• Additional Contributions– Read throughput scales :

• Chain Replication with Apportioned Queries

– Supports Eventual Consistency

Page 15: Jeff Terrace Michael J. Freedman

High Diversity

• Many data storage systems assume locality– Well connected, low latency

• Real large applications are geo-replicated– To provide low latency– Fault tolerance

(source: Data Center Knowledge)

Page 16: Jeff Terrace Michael J. Freedman

TAILTAIL

Multi-Datacenter CRAQ

HEAD

HEAD

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

TAILTAILReplic

aReplic

a

Replica

Replica

DC1

DC2

DC3

Page 17: Jeff Terrace Michael J. Freedman

Multi-Datacenter CRAQ

HEAD

HEAD

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

TAILTAILReplic

aReplic

a

Replica

Replica

ClientClient

DC1

DC2

DC3

ClientClient

Page 18: Jeff Terrace Michael J. Freedman

Motivation

1. Popular vs. scarce objects

2. Subset relevance

3. Datacenter diversity

4. Write locality

Solution

1. Specify chain size

2. List datacenters− dc1, dc2, … dcN

3. Separate sizes– dc1, chain_size1, …

4. Specify master

Chain Configuration

Page 19: Jeff Terrace Michael J. Freedman

Master Datacenter

HEAD

HEAD

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

DC1

DC2

HEAD

HEAD

Writer

Writer

TAILTAIL

TAILTAIL

Replica

Replica

Replica

Replica

DC3

Page 20: Jeff Terrace Michael J. Freedman

Implementation

• Approximately 3,000 lines of C++

• Uses Tame extensions to SFS asynchronousI/O and RPC libraries

• Network operations use Sun RPC interfaces

• Uses Yahoo’s ZooKeeper for coordination

Page 21: Jeff Terrace Michael J. Freedman

Coordination Using ZooKeeper

• Stores chain metadata

• Monitors/notifies about node membership

CRAQ

CRAQ CRAQCRAQ

CRAQCRAQ

CRAQCRAQ

CRAQCRAQ

CRAQCRAQ

CRAQ

CRAQ

CRAQCRAQ

CRAQCRAQ

DC1

DC3

DC2

ZooKeeper

ZooKeeper

ZooKeeper

ZooKeeper

ZooKeeper

ZooKeeper

Page 22: Jeff Terrace Michael J. Freedman

Evaluation

• Does CRAQ scale vs. CR?

• How does write rate impact performance?

• Can CRAQ recover from failures?

• How does WAN effect CRAQ?

• Tests use Emulab network emulation testbed

Page 23: Jeff Terrace Michael J. Freedman

Read Throughput as Writes Increase

1x-

3x-

7x-

Page 24: Jeff Terrace Michael J. Freedman

Failure Recovery (Read Throughput)

Page 25: Jeff Terrace Michael J. Freedman

Failure Recovery (Latency)

Time (s) Time (s)

Page 26: Jeff Terrace Michael J. Freedman

Geo-replicated Read Latency

Page 27: Jeff Terrace Michael J. Freedman

If Single Object Put/Get Insufficient

• Test-and-Set, Append, Increment– Trivial to implement– Head alone can evaluate

• Multiple object transaction in same chain– Can still be performed easily– Head alone can evaluate

• Multiple chains– An agreement protocol (2PC) can be used– Only heads of chains need to participate– Although degrades performance (use

carefully!)

Page 28: Jeff Terrace Michael J. Freedman

Summary

• CRAQ Contributions?– Challenges trade-off of consistency vs.

throughput

• Provides strong consistency

• Throughput scales linearly for read-mostly

• Support for wide-area deployments of chains

• Provides atomic operations and transactions

ThankYou

Questions?