phase 2 cleanup and look ahead jeff chase cps 512 fall 2015

Phase 2 cleanupand look ahead

Jeff Chase

CPS 512 Fall 2015

Part 1. Managing replicated server groups

These questions pertain to managing server groups with replication, as in e.g., Chubby, Dynamo, and the classical Replicated State Machine (RSM) model introduced in the 1978 Lamport paper Time, Clocks, and the Ordering of Events…. Answer each question with a few phrases or maybe a sentence or two. [40 points]

a) Consistent Hashing algorithms select server replicas to handle each object (e.g., each key/value pair). They use hash functions to assign each object and each server a value within a range referred to as a “ring” or “unit circle”. In what sense is the range a “ring” or “circle”, and why is it useful/necessary to view it this way?

The range is a “ring” because the lowest value in the range is treated as the immediate successor of the highest value in the range for the purpose of finding the replicas for an object. The N replicas for an object are the N servers whose values are the nearest successors of the object’s value on the ring. If there are not N servers whose values are higher than the object’s value, then the search wraps around from the end of the range (the highest value) to the start of the range (the lowest value).

b) In practice, it is important for consistent hashing systems to assign each server multiple values for multiple points on the ring. Why?

If a server leaves the system, its immediate successor(s) on the ring take responsibility for its objects. By giving each server multiple values (tokens or virtual servers), the server’s load is spread more evenly among many servers if it leaves. Similarly, if a new server is assigned multiple values on the ring, it draws load from more servers. Another benefit is that a server’s load is proportional to the number of tokens it has, so it is easy to weight each server’s load to match its capacity.

c) In the classical RSM model, it is important for all replicas to apply all operations in the same order. How do Chubby, Dynamo, and the Lamport RSM system achieve this safety property during failure-free operation?

Chubby and Dynamo: primary/coordinator assigns a sequence number for each update; replicas apply updates in sequence number order. In Chubby the primary is always unique and assigns a total order to the updates. Dynamo may fall back to a partial order using vector clocks if the primary/coordinator is not unique (generally because of a failure). Lamport RSM: operations are stamped with the requester’s logical clock. If two ops have the same stamp, their order is determined by their senders’ unique IDs. Thus the order is total.

CPS 512/590 first midterm exam, 10/6/2015

Your name please:

/200

/60

/50

/50

/40

Dynamo vs. Chubby (replication)

• Chubby: primary/backup, synchronous, consensus– Write to all, wait for at least majority to accept (W>N/2)

• Dynamo: no designated primary! A key may have multiple coordinators (if views diverge).– Asynchronous: write to all, but may wait for W<N/2 to accept.

client request

client request

client request

Chubby Dynamo

Part 1. Managing replicated server groups…

d) The Lamport RSM system is fully asynchronous and does not handle failures of any kind. Even so, it requires each peer to acknowledge every operation back to its sender. Why is this necessary, i.e., what could go wrong if acknowledgements are not received?

The acknowledgements inform the ack receiver of the ack sender’s logical clock. Since each node receives messages from a given peer in the order they are sent (program order), each ack allows the ack receiver to know that it has received any operations sent with an earlier timestamp from the ack sender. If acks from a peer are lost, the other peers cannot commit any operations, because they cannot know if the peer has sent an earlier request that is still in the network.


Part 2. Leases

These questions pertain to leases and leased locks, as used in Chubby, Lab #2, and distributed file systems like NFSv4. Answer each question with a few phrases or maybe a sentence or two. [50 points]

a) In Chubby (and in Lab #2), applications acquire and release locks with explicit acquire() and release() primitives. Leases create the possibility that an application thread holding a lock (the lock holder) loses the lock before releasing it. Under what conditions can this occur?

A network partition. That is the only case! (OK, it could also happen if the lock service fails.) Otherwise, the client renews the lease as needed until the application releases the lock. The lock server does not deny a lease renewal request from a client, even if another client is contending for the lock.

For this question it was not enough to say “the lease expires before the application releases the lock”, because a correct client renews the lease in that scenario, and the lease is lost only if some failure prevents the renewal from succeeding.

Lease renewal

acquire

grant acquire

releasegrant

app

renew

grant

acq()acq()

rel()

compute

compute

Lock server honors lease renewal request despite pending recall / acquire from another client.

recall

Lock token caching

acquire

grant

acquire

releasegrant

app

renew

grant

acq()

acq()

rel()

compute

recall

acq()

rel()

Client may cache lock ownership for future use, even after application releases the lock. Client honors recall if the lock is free.

Part 2. Leases…

b) In this circumstance (a), how does the lock client determine that the lease has been lost? How does the application determine that the lock has been lost? Be sure to answer for both Chubby and your Lab#2 code.

For both, the lock client knows the lease is lost if the lease expires, i.e., the client receives no renewal response before the expiration time.

Chubby: the application receives an exception/notification that it is in jeopardy and (later) that it has lost its session with Chubby.

Lab #2: I accepted various answers. You were not required to handle this case. In class, I suggested that you renew leases before they expire (by some delta less than the lock hold time in the application), and do not grant an acquire while a renewal is pending. That avoids this difficult case. Note that it is not sufficient to notify the application with a message: for actors, the notification message languishes at the back of the message while the actor’s thread is busily and happily computing, believing that it still holds the lock.

c) Chubby maintains a single lease for each client's session with Chubby, rather than a lease for each individual lock that a client might hold. What advantages/benefits does this choice offer for Chubby and its clients?

It is more efficient: the cost of handling lease renewals is amortized across all locks held by a client. It is also more tolerant to transient network failures and server fail-over delays because Chubby renews a client’s session lease automatically on every message received from a client. If the session lease is lost, then the client loses all locks and all other Chubby resources that it holds.

d) Network file systems including NFSv4 use leased locks to maintain consistency of file caches on the clients. In these systems, what events cause a client to request to acquire a lock from the server?

Read and write requests (system calls) by the application to operate on files.

e) In network file systems the server can recall (callback) a lock from a client, to force the client to release the lock for use by another client. What actions must a client take on its file cache before releasing the lock?

If it is a write lease: push any pending writes to the server. Note that this case differs from (a) and (b) in that the NFS client is itself the application that acquires/releases locks, and it can release on demand if the server recalls the lease.


Part 3. Handling writes

These questions pertain to committing/completing writes in distributed storage systems, including Chubby and Dynamo. Answer each question with a few phrases or maybe a sentence or two. [50 points]

a) Classical quorum replication systems like Chubby can commit a write only after agreement from at least a majority of servers. Why is it necessary for a majority of servers to agree to the write?

Generally, majority quorums ensure that any two quorums intersect in at least one node that has seen any given write. In Chubby, majority write is part of a consensus protocol that ensures that the primary knows about all writes even after a fail-over: a majority must vote to elect the new primary, and that majority set must include at least one node that saw any given write.

b) Dynamo allows writes on a given key to commit after agreement from less than a majority of servers (i.e., replicas for the key). What advantages/benefits does this choice offer for Dynamo and its clients?

Lower latency and better availability: it is not necessary to wait for responses from the slower servers, so writes complete quickly even if a majority of servers are slow or unreachable.

c) Given that Chubby and Dynamo can complete a write after agreement from a subset of replicas, how do the other replicas learn of the write under normal failure-free operation?

They are the same: the primary/coordinator sends (multicasts) the write to all replicas. Although they do not wait for all responses, all replicas receive the update from the primary/coordinator if there are no failures.

d) What happens in these systems if a read operation retrieves a value from a server that has not yet learned of a recently completed write? Summarize the behavior for Chubby and Dynamo in one sentence each.

Generally, quorum systems read from multiple replicas: at least one of these replicas was also in the last write set (R+W>N), and so some up-to-date replica returns a fresh value and supersedes any stale data read from other replicas. Dynamo behaves similarly, but it is possible that the read set does not overlap the last write set (R+W<N+1), and in this case alone a Dynamo read may return stale data. For Chubby, this was unintentionally a trick question: the stale read case “cannot happen” because Chubby serves all reads from the primary. (Chubby’s consensus protocol pushes all pending writes to the new primary on fail-over). I accepted a wide range of reasonable answers for Chubby.


Quorum write

write “accept”lagging accepts Primary or

coordinator

backups

commitwrite

Quorum read

write “accept”lagging accepts Primary or

coordinator

backups

commitwrite

read

read complete

stale results

fresh result

Part 3. Handling writes…

e) Classical ACID transaction systems differ from Dynamo and Chubby in that they require agreement from every server involved in a transaction in order to commit (e.g., as in the two-phase commit protocol). Why?

Atomicity (A) requires that each server that stores any data accessed by the transaction must know its commit order relative to other transactions. This is necessary to ensure that all servers apply data updates in the same (serializable) commit order.

Each server must also know whether/when the transaction commits or aborts so that it may log updates to persistent storage (on commit) or roll them back (on abort), and release any locks held by the transaction. This is necessary for the “ID” properties.


Part 4. CAP and all that

The now-famous “CAP theorem” deals with tradeoffs between Consistency and Availability in practical distributed systems. Answer these questions about CAP. Please keep your answers tight. [60 points]

a) Eric Brewer's paper CAP Twelve Years Later summarizes the CAP theorem this way: “The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties.” It then goes on to say that “2 of 3 is misleading”, citing a blog by Coda Hale that suggests a system cannot truly be “CA”, and that the only real choice is between “CP” and “AP”. Do you agree? Explain.

A good argument is: any system could be subject to network partitions. The question is, how does the system behave if a partition occurs and (as a result) some server cannot reach some peer that it contacts to help serve some request? One option is to block or fail the request, which sacrifices availability (A). Another option is to go ahead and serve the request anyway, which sacrifices consistency (C). There are no other options. In other words, you can’t choose not to be “partition-tolerant” (P). If a partition occurs your application will lose either C or A: the question is which.

C-A-P “choose

two”

C

A P

consistency

Availability Partition-resilience

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum).

Dr. Eric Brewer

“CAP theorem”

Getting precise about CAP #1

• What does consistency mean?

• Consistency Ability to implement an atomic data object served by multiple nodes.

• Requires linearizability of ops on the object.

– Total order for all operations, consistent with causal order, observed by all nodes

– Also called one-copy serializability (1SR): object behaves as if there is only one copy, with operations executing in sequence.

– Also called atomic consistency (“atomic”)

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. Seth Gilbert, Nancy Lynch. MIT manuscript.

Getting precise about CAP #3

• Theorem. It is impossible to implement an atomic data object that is available in all executions.– Proof. Partition the network. A write on one side is not seen by

a read on the other side, but the read must return a response.

• Corollary. Applies even if messages are delayed arbitrarily, but no message is lost.– Proof. The service cannot tell the difference.

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. Seth Gilbert, Nancy Lynch. MIT manuscript.

Looking ahead: Spanner

• Tabular data (semi-relational), SQL-like queries

• Table data is sharded into “tablets” (from BigTable).– Each replicated in a Paxos group (many tablets per group).

– Globally distributed

• ACID transactions across many tablets.

• Uses 2PL and 2PC– But each participant is a replicated Paxos group!

• New goal: snapshot-isolated reads in the past.– Keep multiple versions of each item.

– Timestamp all versions with a unique timestamp for the T that wrote it, conforming to serialization order (hard).

– Read @ any timestamp gives an isolated read transaction!

Paxos: voting among groups of nodes

“Can I lead b?” “OK, but” “v!”

L

N

1a 1b

“v?” “OK”

2a 2b 3

Propose Promise Accept Ack Commit

You will see references to Paxos state machine: it refers to a group of nodes that cooperate using the Paxos algorithm to keep a system with replicated state safe and available (to the extent possible under prevailing conditions). We will discuss it later.

Wait for majority Wait for majority

log log safe

Self-appoint

Paxos: Properties

• Paxos is an asynchronous consensus algorithm.

• Paxos (like 2PC) is guaranteed safe.– Consensus is a stable property: once reached it is never

violated; the agreed value is not changed.

• Paxos (like 2PC) is not guaranteed live.– Consensus is reached if “a large enough subnetwork...is

nonfaulty for a long enough time.”

– Otherwise Paxos might never terminate.

– Paxos is more robust than 2PC, which is vulnerable to failure of the leader.

• Equivalent: RAFT, Viewstamped Replication (VR)

● Replicated log => replicated state machine All servers execute same commands in same order

● Consensus module ensures proper log replication

● System makes progress as long as any majority of servers are up

● Failure model: fail-stop (not Byzantine), delayed/lost messagesMarch 3, 2013 Raft Consensus Algorithm Slide 19

Goal: Replicated Log

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

2PC: Two-Phase Commit

“commit or abort?” “here’s my vote” “commit/abort!”TM/C

RM/P

precommitor prepare

vote decide notify

RMs validate Tx and prepare by logging their

local updates and decisions

TM logs commit/abort

(commit point)

If unanimous to commitdecide to commit

else decide to abort

Handling Failures in 2PC

• How to ensure consensus if a site fails during the 2PC protocol?

• Case 1. A participant P fails before preparing.– Either P recovers and votes to abort, or C times out

and aborts.

• Case 2. Each P votes to commit, but C fails before committing.– Participants wait until C recovers and notifies them of

the decision to abort. The outcome is uncertain until C recovers.

Handling Failures in 2PC, continued

• Case 3: P or C fails during phase 2, after the outcome is determined.– Carry out the decision by reinitiating the 2PC protocol

on recovery.

– Note: this is an important general technique: log intentions and restart the protocol on recovery.

– If C fails, outcome is uncertain until C recovers.

• What if C never recovers?

• 2PC is safe, but not live.

Paxos vs. 2PC

• The fundamental difference is that leader failure can block 2PC, while Paxos is non-blocking (wait-free).– 2PC stakes everything on a single round with a single leader,

but “Paxos keeps trying”.

– Paxos: if agreement fails, then “have another round”.

• Both use logging for continuity across failures.

• But there are differences in problem setting...– 2PC: agents have “multiple choice with veto power”.

• Unanimity is required to commit (strictly harder).

– Paxos: consensus value is dictated by the first leader to control a majority: “majority rules”.

– Paxos: quorum writes with “first writer wins forever”.

● At any given time, each server is either: Leader: handles all client interactions, log replication

● At most 1 viable leader at a time

Follower: completely passive (issues no RPCs, responds to incoming RPCs)

Candidate: used to elect a new leader

● Normal operation: 1 leader, N-1 followers

March 3, 2013 Raft Consensus Algorithm Slide 24

Server States

Follower Candidate Leader

starttimeout,

start electionreceive votes frommajority of servers

timeout,new election

discover server with higher termdiscover current server

or higher term

“stepdown”

● Time divided into terms: Election Normal operation under a single leader

● At most 1 leader per term● Some terms have no leader (failed election)● Each server maintains current term value● Key role of terms: identify obsolete informationMarch 3, 2013 Raft Consensus Algorithm Slide 25

Terms

Term 1 Term 2 Term 3 Term 4 Term 5

time

Elections Normal OperationSplit Vote

Timestamp Invariants

OSDI 2012 26

• Timestamp order == commit order

• Timestamp order respects global wall-time order

T2

T3

T4

T1

TrueTime

• “Global wall-clock time” with bounded uncertainty

time

earliest latest

TT.now()

2*ε

OSDI 2012 27

Timestamps and TrueTime

T

Pick s = TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > ss

OSDI 2012

average ε

Commit wait

average ε

28

29

Commit Wait and Replication

OSDI 2012

T

Acquired locks Release locks

Start consensus Notify slaves

Commit wait donePick s

Achieve consensus

Bigtable: A Distributed Storage System for Structured Data

Written By:Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike

Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc.

Presented By:

Manoher Shatha &

Naveen Kumar Ratkal

Database Vs Bigtable

Features Database Bigtable

Supports Relational DB ? Most of the databases. No

Atomic Transactions All are atomic transactions. Limited.

Data Type Supports many data types. String of characters (un-interpreted string).

ACID Test Yes NO

Operations insert, delete, update etc….

read, write, update, delete etc…

Data Model

Figure 1: Web Table

“Contents:” “anchor:cnnsi.com” “anchor:my.look.ca”

“com.cnn.www” “CNN” “CNN.com”t3

t5t6 t9 t8

<html><html><html>

Bigtable is a multidimensional map.

Map is indexed by row key, column key and timestamp.

i.e. (row: string , column: string , time:int64 ) String.

Rows are ordered in lexicographic order by row key.

Row range for a table is dynamically partitioned, Each row range is called “Tablet ”.

Columns: syntax is family : qualifier.

Cells can store multiple timestamped versions of data.

http://www.cnnsi.com/

http://www.mylook.ca/

http://www.cnn.com/

Tablet & Table

Tablets contains some range of rows

Figure : TableFigure : From Erik Paulson

presentation

64KBlock

64KBlock Index

SSTable

64KBlock

64KBlock Index

SSTable

TabletStart : aardvark End : apple

64KBlock

64KBlock Index

SSTable

64KBlock

64KBlock Index

SSTable

64KBlock

64KBlock Index

SSTable

Tablet

aardvark apple

Tablet

apple_two boat

Figure : Tablet

Bigtable contains tables, each table is a sequence of tablets, and each tablet contains a set of rows (100-200MB).

SSTables

This is the underlying file format used to store Bigtable data.

SSTables are immutable.

If new data is added, a new SSTable is created.

Old SSTable is set out for garbage collection.

Figure: SSTable Figure : From Erik Paulson

presentation

64KBlock

64KBlock Index

SSTable

Tablet Serving

Commit log stores the updates that are made to the data.

Updates are stored in-memory: memtable.

Apply mutations in memory

Log mutations with WAL and group commit

Common log file per tablet server

Periodically write memtable out as an SST

Reads through memtable and SST sequence

Figure : Tablet Representation.

Figure is taken from the paper “google bigtable”.

Memory

GFS

Tablet Log

Write Op

Read OpMemtable

SSTSST SST

SSTable Files

“A small database”

volatile memory

logsnapshot

Your favorite programming

language

Your favorite data structures

in memory

Push a “pickled” checkpoint to a file periodically

Log transaction events as they

occur, and push log ahead of

commit WAL

[Birrell/Jones/Wobber, SOSP 87]

non-volatile memory

Redo: one way to use a Log

• Entries for T record the writes by T (or operations in T).– Redo logging

• To recover, read the checkpoint and replay committed log entries. – “Redo” by reissuing writes or reinvoking

the methods.

– Redo in order (old to new)

– Skip the records of uncommitted Ts

• No T can be allowed to affect the database until T commits.– Technique: write-ahead logging

(old)

(new)

LSN 11XID 18

LSN 13XID 19

LSN 12XID 18

LSN 14XID 18commit

...Log

SequenceNumber(LSN)

TransactionID (XID)

commitrecord

phase 2 cleanup and look ahead jeff chase cps 512 fall 2015

Documents

servers load

server replicas

n servers

majority of servers

virtual servers

server multiple values

objects value

highest value