phase 2 cleanup and look ahead jeff chase cps 512 fall 2015
TRANSCRIPT
Phase 2 cleanupand look ahead
Jeff Chase
CPS 512 Fall 2015
Part 1. Managing replicated server groups
These questions pertain to managing server groups with replication, as in e.g., Chubby, Dynamo, and the classical Replicated State Machine (RSM) model introduced in the 1978 Lamport paper Time, Clocks, and the Ordering of Events…. Answer each question with a few phrases or maybe a sentence or two. [40 points]
a) Consistent Hashing algorithms select server replicas to handle each object (e.g., each key/value pair). They use hash functions to assign each object and each server a value within a range referred to as a “ring” or “unit circle”. In what sense is the range a “ring” or “circle”, and why is it useful/necessary to view it this way?
The range is a “ring” because the lowest value in the range is treated as the immediate successor of the highest value in the range for the purpose of finding the replicas for an object. The N replicas for an object are the N servers whose values are the nearest successors of the object’s value on the ring. If there are not N servers whose values are higher than the object’s value, then the search wraps around from the end of the range (the highest value) to the start of the range (the lowest value).
b) In practice, it is important for consistent hashing systems to assign each server multiple values for multiple points on the ring. Why?
If a server leaves the system, its immediate successor(s) on the ring take responsibility for its objects. By giving each server multiple values (tokens or virtual servers), the server’s load is spread more evenly among many servers if it leaves. Similarly, if a new server is assigned multiple values on the ring, it draws load from more servers. Another benefit is that a server’s load is proportional to the number of tokens it has, so it is easy to weight each server’s load to match its capacity.
c) In the classical RSM model, it is important for all replicas to apply all operations in the same order. How do Chubby, Dynamo, and the Lamport RSM system achieve this safety property during failure-free operation?
Chubby and Dynamo: primary/coordinator assigns a sequence number for each update; replicas apply updates in sequence number order. In Chubby the primary is always unique and assigns a total order to the updates. Dynamo may fall back to a partial order using vector clocks if the primary/coordinator is not unique (generally because of a failure). Lamport RSM: operations are stamped with the requester’s logical clock. If two ops have the same stamp, their order is determined by their senders’ unique IDs. Thus the order is total.
CPS 512/590 first midterm exam, 10/6/2015
Your name please:
/200
/60
/50
/50
/40
Dynamo vs. Chubby (replication)
• Chubby: primary/backup, synchronous, consensus– Write to all, wait for at least majority to accept (W>N/2)
• Dynamo: no designated primary! A key may have multiple coordinators (if views diverge).– Asynchronous: write to all, but may wait for W<N/2 to accept.
client request
client request
client request
Chubby Dynamo
Part 1. Managing replicated server groups…
d) The Lamport RSM system is fully asynchronous and does not handle failures of any kind. Even so, it requires each peer to acknowledge every operation back to its sender. Why is this necessary, i.e., what could go wrong if acknowledgements are not received?
The acknowledgements inform the ack receiver of the ack sender’s logical clock. Since each node receives messages from a given peer in the order they are sent (program order), each ack allows the ack receiver to know that it has received any operations sent with an earlier timestamp from the ack sender. If acks from a peer are lost, the other peers cannot commit any operations, because they cannot know if the peer has sent an earlier request that is still in the network.
CPS 512/590 first midterm exam, 10/6/2015
Part 2. Leases
These questions pertain to leases and leased locks, as used in Chubby, Lab #2, and distributed file systems like NFSv4. Answer each question with a few phrases or maybe a sentence or two. [50 points]
a) In Chubby (and in Lab #2), applications acquire and release locks with explicit acquire() and release() primitives. Leases create the possibility that an application thread holding a lock (the lock holder) loses the lock before releasing it. Under what conditions can this occur?
A network partition. That is the only case! (OK, it could also happen if the lock service fails.) Otherwise, the client renews the lease as needed until the application releases the lock. The lock server does not deny a lease renewal request from a client, even if another client is contending for the lock.
For this question it was not enough to say “the lease expires before the application releases the lock”, because a correct client renews the lease in that scenario, and the lease is lost only if some failure prevents the renewal from succeeding.
Lease renewal
acquire
grant acquire
releasegrant
app
renew
grant
acq()acq()
rel()
compute
compute
Lock server honors lease renewal request despite pending recall / acquire from another client.
recall
Lock token caching
acquire
grant
acquire
releasegrant
app
renew
grant
acq()
acq()
rel()
compute
recall
acq()
rel()
Client may cache lock ownership for future use, even after application releases the lock. Client honors recall if the lock is free.
Part 2. Leases…
b) In this circumstance (a), how does the lock client determine that the lease has been lost? How does the application determine that the lock has been lost? Be sure to answer for both Chubby and your Lab#2 code.
For both, the lock client knows the lease is lost if the lease expires, i.e., the client receives no renewal response before the expiration time.
Chubby: the application receives an exception/notification that it is in jeopardy and (later) that it has lost its session with Chubby.
Lab #2: I accepted various answers. You were not required to handle this case. In class, I suggested that you renew leases before they expire (by some delta less than the lock hold time in the application), and do not grant an acquire while a renewal is pending. That avoids this difficult case. Note that it is not sufficient to notify the application with a message: for actors, the notification message languishes at the back of the message while the actor’s thread is busily and happily computing, believing that it still holds the lock.
c) Chubby maintains a single lease for each client's session with Chubby, rather than a lease for each individual lock that a client might hold. What advantages/benefits does this choice offer for Chubby and its clients?
It is more efficient: the cost of handling lease renewals is amortized across all locks held by a client. It is also more tolerant to transient network failures and server fail-over delays because Chubby renews a client’s session lease automatically on every message received from a client. If the session lease is lost, then the client loses all locks and all other Chubby resources that it holds.
d) Network file systems including NFSv4 use leased locks to maintain consistency of file caches on the clients. In these systems, what events cause a client to request to acquire a lock from the server?
Read and write requests (system calls) by the application to operate on files.
e) In network file systems the server can recall (callback) a lock from a client, to force the client to release the lock for use by another client. What actions must a client take on its file cache before releasing the lock?
If it is a write lease: push any pending writes to the server. Note that this case differs from (a) and (b) in that the NFS client is itself the application that acquires/releases locks, and it can release on demand if the server recalls the lease.
CPS 512/590 first midterm exam, 10/6/2015
Part 3. Handling writes
These questions pertain to committing/completing writes in distributed storage systems, including Chubby and Dynamo. Answer each question with a few phrases or maybe a sentence or two. [50 points]
a) Classical quorum replication systems like Chubby can commit a write only after agreement from at least a majority of servers. Why is it necessary for a majority of servers to agree to the write?
Generally, majority quorums ensure that any two quorums intersect in at least one node that has seen any given write. In Chubby, majority write is part of a consensus protocol that ensures that the primary knows about all writes even after a fail-over: a majority must vote to elect the new primary, and that majority set must include at least one node that saw any given write.
b) Dynamo allows writes on a given key to commit after agreement from less than a majority of servers (i.e., replicas for the key). What advantages/benefits does this choice offer for Dynamo and its clients?
Lower latency and better availability: it is not necessary to wait for responses from the slower servers, so writes complete quickly even if a majority of servers are slow or unreachable.
c) Given that Chubby and Dynamo can complete a write after agreement from a subset of replicas, how do the other replicas learn of the write under normal failure-free operation?
They are the same: the primary/coordinator sends (multicasts) the write to all replicas. Although they do not wait for all responses, all replicas receive the update from the primary/coordinator if there are no failures.
d) What happens in these systems if a read operation retrieves a value from a server that has not yet learned of a recently completed write? Summarize the behavior for Chubby and Dynamo in one sentence each.
Generally, quorum systems read from multiple replicas: at least one of these replicas was also in the last write set (R+W>N), and so some up-to-date replica returns a fresh value and supersedes any stale data read from other replicas. Dynamo behaves similarly, but it is possible that the read set does not overlap the last write set (R+W<N+1), and in this case alone a Dynamo read may return stale data. For Chubby, this was unintentionally a trick question: the stale read case “cannot happen” because Chubby serves all reads from the primary. (Chubby’s consensus protocol pushes all pending writes to the new primary on fail-over). I accepted a wide range of reasonable answers for Chubby.
CPS 512/590 first midterm exam, 10/6/2015
Quorum write
write “accept”lagging accepts Primary or
coordinator
backups
commitwrite
Quorum read
write “accept”lagging accepts Primary or
coordinator
backups
commitwrite
read
read complete
stale results
fresh result
Part 3. Handling writes…
e) Classical ACID transaction systems differ from Dynamo and Chubby in that they require agreement from every server involved in a transaction in order to commit (e.g., as in the two-phase commit protocol). Why?
Atomicity (A) requires that each server that stores any data accessed by the transaction must know its commit order relative to other transactions. This is necessary to ensure that all servers apply data updates in the same (serializable) commit order.
Each server must also know whether/when the transaction commits or aborts so that it may log updates to persistent storage (on commit) or roll them back (on abort), and release any locks held by the transaction. This is necessary for the “ID” properties.
CPS 512/590 first midterm exam, 10/6/2015
Part 4. CAP and all that
The now-famous “CAP theorem” deals with tradeoffs between Consistency and Availability in practical distributed systems. Answer these questions about CAP. Please keep your answers tight. [60 points]
a) Eric Brewer's paper CAP Twelve Years Later summarizes the CAP theorem this way: “The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties.” It then goes on to say that “2 of 3 is misleading”, citing a blog by Coda Hale that suggests a system cannot truly be “CA”, and that the only real choice is between “CP” and “AP”. Do you agree? Explain.
A good argument is: any system could be subject to network partitions. The question is, how does the system behave if a partition occurs and (as a result) some server cannot reach some peer that it contacts to help serve some request? One option is to block or fail the request, which sacrifices availability (A). Another option is to go ahead and serve the request anyway, which sacrifices consistency (C). There are no other options. In other words, you can’t choose not to be “partition-tolerant” (P). If a partition occurs your application will lose either C or A: the question is which.
C-A-P “choose
two”
C
A P
consistency
Availability Partition-resilience
CA: available, and consistent, unless there is a partition.
AP: a reachable replica provides service even in a partition, but may be inconsistent.
CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum).
Dr. Eric Brewer
“CAP theorem”
Getting precise about CAP #1
• What does consistency mean?
• Consistency Ability to implement an atomic data object served by multiple nodes.
• Requires linearizability of ops on the object.
– Total order for all operations, consistent with causal order, observed by all nodes
– Also called one-copy serializability (1SR): object behaves as if there is only one copy, with operations executing in sequence.
– Also called atomic consistency (“atomic”)
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. Seth Gilbert, Nancy Lynch. MIT manuscript.
Getting precise about CAP #3
• Theorem. It is impossible to implement an atomic data object that is available in all executions.– Proof. Partition the network. A write on one side is not seen by
a read on the other side, but the read must return a response.
• Corollary. Applies even if messages are delayed arbitrarily, but no message is lost.– Proof. The service cannot tell the difference.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. Seth Gilbert, Nancy Lynch. MIT manuscript.
Looking ahead: Spanner
• Tabular data (semi-relational), SQL-like queries
• Table data is sharded into “tablets” (from BigTable).– Each replicated in a Paxos group (many tablets per group).
– Globally distributed
• ACID transactions across many tablets.
• Uses 2PL and 2PC– But each participant is a replicated Paxos group!
• New goal: snapshot-isolated reads in the past.– Keep multiple versions of each item.
– Timestamp all versions with a unique timestamp for the T that wrote it, conforming to serialization order (hard).
– Read @ any timestamp gives an isolated read transaction!
Paxos: voting among groups of nodes
“Can I lead b?” “OK, but” “v!”
L
N
1a 1b
“v?” “OK”
2a 2b 3
Propose Promise Accept Ack Commit
You will see references to Paxos state machine: it refers to a group of nodes that cooperate using the Paxos algorithm to keep a system with replicated state safe and available (to the extent possible under prevailing conditions). We will discuss it later.
Wait for majority Wait for majority
log log safe
Self-appoint
Paxos: Properties
• Paxos is an asynchronous consensus algorithm.
• Paxos (like 2PC) is guaranteed safe.– Consensus is a stable property: once reached it is never
violated; the agreed value is not changed.
• Paxos (like 2PC) is not guaranteed live.– Consensus is reached if “a large enough subnetwork...is
nonfaulty for a long enough time.”
– Otherwise Paxos might never terminate.
– Paxos is more robust than 2PC, which is vulnerable to failure of the leader.
• Equivalent: RAFT, Viewstamped Replication (VR)
● Replicated log => replicated state machine All servers execute same commands in same order
● Consensus module ensures proper log replication
● System makes progress as long as any majority of servers are up
● Failure model: fail-stop (not Byzantine), delayed/lost messagesMarch 3, 2013 Raft Consensus Algorithm Slide 19
Goal: Replicated Log
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
Servers
Clients
shl
2PC: Two-Phase Commit
“commit or abort?” “here’s my vote” “commit/abort!”TM/C
RM/P
precommitor prepare
vote decide notify
RMs validate Tx and prepare by logging their
local updates and decisions
TM logs commit/abort
(commit point)
If unanimous to commitdecide to commit
else decide to abort
Handling Failures in 2PC
• How to ensure consensus if a site fails during the 2PC protocol?
• Case 1. A participant P fails before preparing.– Either P recovers and votes to abort, or C times out
and aborts.
• Case 2. Each P votes to commit, but C fails before committing.– Participants wait until C recovers and notifies them of
the decision to abort. The outcome is uncertain until C recovers.
Handling Failures in 2PC, continued
• Case 3: P or C fails during phase 2, after the outcome is determined.– Carry out the decision by reinitiating the 2PC protocol
on recovery.
– Note: this is an important general technique: log intentions and restart the protocol on recovery.
– If C fails, outcome is uncertain until C recovers.
• What if C never recovers?
• 2PC is safe, but not live.
Paxos vs. 2PC
• The fundamental difference is that leader failure can block 2PC, while Paxos is non-blocking (wait-free).– 2PC stakes everything on a single round with a single leader,
but “Paxos keeps trying”.
– Paxos: if agreement fails, then “have another round”.
• Both use logging for continuity across failures.
• But there are differences in problem setting...– 2PC: agents have “multiple choice with veto power”.
• Unanimity is required to commit (strictly harder).
– Paxos: consensus value is dictated by the first leader to control a majority: “majority rules”.
– Paxos: quorum writes with “first writer wins forever”.
● At any given time, each server is either: Leader: handles all client interactions, log replication
● At most 1 viable leader at a time
Follower: completely passive (issues no RPCs, responds to incoming RPCs)
Candidate: used to elect a new leader
● Normal operation: 1 leader, N-1 followers
March 3, 2013 Raft Consensus Algorithm Slide 24
Server States
Follower Candidate Leader
starttimeout,
start electionreceive votes frommajority of servers
timeout,new election
discover server with higher termdiscover current server
or higher term
“stepdown”
● Time divided into terms: Election Normal operation under a single leader
● At most 1 leader per term● Some terms have no leader (failed election)● Each server maintains current term value● Key role of terms: identify obsolete informationMarch 3, 2013 Raft Consensus Algorithm Slide 25
Terms
Term 1 Term 2 Term 3 Term 4 Term 5
time
Elections Normal OperationSplit Vote
Timestamp Invariants
OSDI 2012 26
• Timestamp order == commit order
• Timestamp order respects global wall-time order
T2
T3
T4
T1
TrueTime
• “Global wall-clock time” with bounded uncertainty
time
earliest latest
TT.now()
2*ε
OSDI 2012 27
Timestamps and TrueTime
T
Pick s = TT.now().latest
Acquired locks Release locks
Wait until TT.now().earliest > ss
OSDI 2012
average ε
Commit wait
average ε
28
29
Commit Wait and Replication
OSDI 2012
T
Acquired locks Release locks
Start consensus Notify slaves
Commit wait donePick s
Achieve consensus
Bigtable: A Distributed Storage System for Structured Data
Written By:Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc.
Presented By:
Manoher Shatha &
Naveen Kumar Ratkal
Database Vs Bigtable
Features Database Bigtable
Supports Relational DB ? Most of the databases. No
Atomic Transactions All are atomic transactions. Limited.
Data Type Supports many data types. String of characters (un-interpreted string).
ACID Test Yes NO
Operations insert, delete, update etc….
read, write, update, delete etc…
Data Model
Figure 1: Web Table
“Contents:” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” “CNN” “CNN.com”t3
t5t6 t9 t8
<html><html><html>
Bigtable is a multidimensional map.
Map is indexed by row key, column key and timestamp.
i.e. (row: string , column: string , time:int64 ) String.
Rows are ordered in lexicographic order by row key.
Row range for a table is dynamically partitioned, Each row range is called “Tablet ”.
Columns: syntax is family : qualifier.
Cells can store multiple timestamped versions of data.
Tablet & Table
Tablets contains some range of rows
Figure : TableFigure : From Erik Paulson
presentation
64KBlock
64KBlock Index
SSTable
64KBlock
64KBlock Index
SSTable
TabletStart : aardvark End : apple
64KBlock
64KBlock Index
SSTable
64KBlock
64KBlock Index
SSTable
64KBlock
64KBlock Index
SSTable
Tablet
aardvark apple
Tablet
apple_two boat
Figure : Tablet
Bigtable contains tables, each table is a sequence of tablets, and each tablet contains a set of rows (100-200MB).
SSTables
This is the underlying file format used to store Bigtable data.
SSTables are immutable.
If new data is added, a new SSTable is created.
Old SSTable is set out for garbage collection.
Figure: SSTable Figure : From Erik Paulson
presentation
64KBlock
64KBlock Index
SSTable
Tablet Serving
Commit log stores the updates that are made to the data.
Updates are stored in-memory: memtable.
Apply mutations in memory
Log mutations with WAL and group commit
Common log file per tablet server
Periodically write memtable out as an SST
Reads through memtable and SST sequence
Figure : Tablet Representation.
Figure is taken from the paper “google bigtable”.
Memory
GFS
Tablet Log
Write Op
Read OpMemtable
SSTSST SST
SSTable Files
“A small database”
volatile memory
logsnapshot
Your favorite programming
language
Your favorite data structures
in memory
Push a “pickled” checkpoint to a file periodically
Log transaction events as they
occur, and push log ahead of
commit WAL
[Birrell/Jones/Wobber, SOSP 87]
non-volatile memory
Redo: one way to use a Log
• Entries for T record the writes by T (or operations in T).– Redo logging
• To recover, read the checkpoint and replay committed log entries. – “Redo” by reissuing writes or reinvoking
the methods.
– Redo in order (old to new)
– Skip the records of uncommitted Ts
• No T can be allowed to affect the database until T commits.– Technique: write-ahead logging
(old)
(new)
LSN 11XID 18
LSN 13XID 19
LSN 12XID 18
LSN 14XID 18commit
...Log
SequenceNumber(LSN)
TransactionID (XID)
commitrecord