chapter 6 wenbing zhao department of electrical and computer engineering cleveland state university...
TRANSCRIPT
Chapter 6Chapter 6
Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering
Cleveland State UniversityCleveland State University
[email protected]@ieee.org
Building Dependable Building Dependable Distributed SystemsDistributed Systems
Building Dependable Distributed Systems, Copyright Wenbing Zhao 1
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
OutlineOutline Distributed consensus and Paxos algorithms
Distributed consensus problem Classic Paxos Multi-Paxos
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Distributed ConsensusDistributed Consensus Distributed consensus is a fundamental problem in
distributed computing Ensure replica consistency
Asynchronous distributed system No upper bound on processing time No upper bound on clock drift rate No upper bound on networking delay
In an asynchronous distributed system, you cannot tell a crashed process from a slow one, even if you can assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
FLP Impossibility ResultsFLP Impossibility Results FLP (Fischer, Lynch and Paterson) Impossibility result
A single faulty process can prevent consensus Because a slow process is indistinguishable from a crashed one
Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus In particular, they show that FLP applies to group membership, reliable
multicast So these practical problems are impossible to solve in asynchronous
systems They also look at the weakest condition under which consensus can be
solved
Ways to bypass the impossibility result Use unreliable failure detector Use a randomized consensus algorithm
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Distributed ConsensusDistributed Consensus Older generation of consensus algorithms rely on the use
of an unreliable failure detector to exclude failed processes from the consensus consideration Difficult to understand and harder to prove for correctness
The Paxos algorithm introduced by Lamport Separate safety and liveness properties “Paxos is among the simplest and most obvious of distributed
algorithm” by Lamport Consensus can be reached during periods of synchrony No consensus is possible if system very asynchronous, but
guarantees no disagreement
Consensus ProblemConsensus Problem
Safety: Only a value that has been proposed may be chosen If a value is chosen by a process, then the same value
must be chosen by any other process that has chosen a value
If a process learns a value, then the value must have been chosen by some process
Liveness: Some proposed value is eventually chosen and, if a value
has been chosen, then a process can eventually learn the value
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos Algorithm – The Paxos Algorithm – Consensus for Asynchronous Consensus for Asynchronous Distributed SystemsDistributed Systems Contribution: separately consider safety and
liveness issues. Safety can be guaranteed and liveness is ensured during period of synchrony
Participants of the algorithm are divided into three categories Proposers: those who propose values Accepters: those who decide which value to choose Learners: those who are interested in learning the value
chosen
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm
How to choose a value Use a single acceptor: straightforward but not fault
tolerant Use a number of acceptors: a value is chosen if
the majority of the acceptors have accepted it
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Accepted does not mean chosen. However, if a value has been chosen, it must have been accepted first
The Paxos AlgorithmThe Paxos Algorithm
Requirements for choosing a value P1. An acceptor must accept the first proposal that it
receives P2. If a proposal with value v is chosen, then every higher-
numbered proposal that is chosen has value v
Since the proposal numbers are totally ordered, P2 guarantees the safety property
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm
How to guarantee P2? P2a: If a proposal with value v is chosen, then every
higher-numbered proposal accepted by any acceptor has value v
But what if an acceptor that has never accepted v accepted a proposal with v’? P2b: if a proposal with value v is chosen, then every
higher-numbered proposal issued by any proposer has value v
P2b implies P2a, which implies P2
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm How to ensure P2b? P2c: For any v and n, if a proposal with value v and
number n is issued, then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered
less than n, or (b) v is the value of the highest-numbered proposal among
all proposals numbered less than n accepted by the acceptors in S
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm
To ensure P2c, an acceptor must promise: It will not accept any more proposals numbered
less than n, once it has accepted a proposal n
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm
Phase 1. (a) A proposer selects a proposal number n and sends a
prepare request with number n to a majority of acceptors. (b) If an acceptor receives a prepare request with number n
greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm Phase 2.
(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
The Paxos AlgorithmThe Paxos Algorithm
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Importance of Keeping Importance of Keeping Promises Promises (for not accepting older (for not accepting older proposal)proposal)
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Not a problem if a value has been chosen
Importance of Keeping Importance of Keeping Promises Promises (for not accepting older (for not accepting older proposal)proposal)
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Safety violated without promise with competing proposers
Importance of Keeping Importance of Keeping Promises Promises (for not accepting older (for not accepting older proposal)proposal)
Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao
Safety is ensured by the promise with competing proposers
Multi-Paxos: Paxos for State Machine Replication Client: partially assumes the role of a proposer Server replicas: acceptors Value to be agreed on: total ordering of requests sent by
clients Total ordering accomplished by running a sequence of
instances of Paxos Each instance is assigned a sequence number, representing the
total ordering of the request that is chosen Value to be chosen: the request chosen for the instance
Building Dependable Distributed Systems, Copyright Wenbing Zhao 19
Multi-Paxos: Paxos for State Machine Replication Client: partially assumes the role of a proposer
Only propose a value (i.e., request it sends) without the corresponding proposal number
A server replica, the primary, decides on the proposal number
Primary: essentially the proposer in Paxos Propose a sequence number – request binding Propagate value chosen (i.e., total ordering info) to other replicas
(i.e., learners)
Initial membership is known with a sole primary First phase can be omitted during normal operation When the primary is suspected, a new primary is elected (view
change)
Building Dependable Distributed Systems, Copyright Wenbing Zhao 20
Multi-Paxos: Paxos for State Machine Replication Normal operation of Multi-Paxos
Building Dependable Distributed Systems, Copyright Wenbing Zhao 21
Multi-Paxos: Checkpointing and Garbage Collection Paxos is open-ended: it never terminates
A proposer is allowed to initiate a new proposal even if every acceptor has accepted a proposal
An acceptor must remember the last proposal that it has accepted and the latest proposal number it has accepted
In Multi-Paxos, every replica must remember such info for every instance of Paxos: Need infinite memory
Solution: periodic checkpointing, e.g., once for every n requests Garbage collect logs after taking each checkpoint Request or control msg needed by a slow replica may not be
available anymore after a checkpoint => state transfer
Building Dependable Distributed Systems, Copyright Wenbing Zhao 22
Multi-Paxos: Leader Election and View Change Leader election: can be done using a full Paxos instance
New primary needs to determine if a value has been chosen in each incomplete instance of Paxos
Leader election and history determination can be done in a simple full paxos: view change
Building Dependable Distributed Systems, Copyright Wenbing Zhao 23
Multi-Paxos: View Change A set of 2f+1 replicas, replica id: 0,1,…,2f History of system: a sequence of views
Each view: one and only one primary Initially replica 0 assumes the primary role for v=0 Subsequently, replicas take the primary role in a
round-robin fashion To ensure liveness
A replica starts a view change timer on the initiation of each instance of Paxos
If the replica does not learn the request chosen before timer expires => suspect the primary
Building Dependable Distributed Systems, Copyright Wenbing Zhao 24
View Change On suspecting the primary, a replica broadcasts
a view change message to all Current primary, if it is wrongly suspected, joins
the view change anyway (i.e., it steps down from primary role)
A replica joins the view change even if it’s view change timer has not expired yet
On joining view change, a replica stops accepting normal control msgs and respond to only checkpointing and view change msgs
Building Dependable Distributed Systems, Copyright Wenbing Zhao 25
View change View change msg contains
New view # Seq# of last stable checkpoint A set of accepted records since last stable checkpoint
Each record consists of view#, seq#, request msg
On receiving f+1 view change msgs, new primary sends new view msg Include a set of accept msgs
Include all accepted msgs as part of view change msg When a gap in seq# is detected, create an accept request with no
op
A replica accepts new view msg if it has not installed a newer view
Building Dependable Distributed Systems, Copyright Wenbing Zhao 26
Dynamic Paxos
Designed to accommodate reconfiguration Extension majority concept to quorum Classic Paxos => uses static quorum Dynamic quorum: quorum size may change
dynamically Cheap Paxos uses dynamic quorum
Building Dependable Distributed Systems, Copyright Wenbing Zhao 27
Dynamic Paxos Fewer replicas are required by using spare
replicas and reconfiguration provided no other fault during reconfiguration
Without reconfiguration, 3f+1 replicas can only tolerate up to 3f/2 faulty replicas
2f+1 active replicas, plus f spares can tolerate up to 3f-1 faulty replicas via substitution and reconfiguration As long as 1 active replica and 1 spare are operating
Building Dependable Distributed Systems, Copyright Wenbing Zhao 28
Dynamic Paxos
Reconfiguration request must be totally ordered with respect to regular application requests
A reconf request includes both new membership and quorum definition
Replicas in the new membership should not accept msgs unrelated to reconf from replicas that have been excluded from the membership External replicas should not be allowed to participate
the consensus step Replica mistakenly excluded can join via
recofigurationBuilding Dependable Distributed
Systems, Copyright Wenbing Zhao 29
Building Dependable Distributed Systems, Copyright Wenbing Zhao 30
Building Dependable Distributed Systems, Copyright Wenbing Zhao 31
Cheap Paxos
Cheap Paxos is a special instance of Dynamic Paxos
Aims to minimize involvement of spare replicas Enable the use of f+1 active replicas to tolerate f
faults, provided that sufficient spares are available (f or more) Active replicas are referred to as main replicas Spare replicas are referred to as auxiliary replicas
Building Dependable Distributed Systems, Copyright Wenbing Zhao 32
Cheap Paxos Primary quorum
Consists of all active replicas Secondary quorum
Must be formed by the majority of combined replicas Consists of at least one main replica => Ensures
intersection between primary and secondary quorums Question: what if only one active replica left?
Building Dependable Distributed Systems, Copyright Wenbing Zhao 33
Building Dependable Distributed Systems, Copyright Wenbing Zhao 34
Building Dependable Distributed Systems, Copyright Wenbing Zhao 35
Cheap Paxos
Upon detection of the failure of an active replica, a reconfiguration request is issued New primary quorum still consists of all surviving
active replicas When reconfig request is executed, switch to
new configuration Auxiliary replicas are not bothered unless a
reconfiguration is necessary What if the primary fails => view change
Building Dependable Distributed Systems, Copyright Wenbing Zhao 36
Cheap Paxos: View Change
For history information: new primary must collect info from every active replica except the old primary
For approval of the primary role, the new primary must collect votes from all surviving active replicas plus one or more auxiliary replica => a secondary quorum
The secondary quorum is used to complete all Paxos instances started by old primary but not yet completed
Building Dependable Distributed Systems, Copyright Wenbing Zhao 37
Cheap Paxos Replicas in secondary quorum must propagate their
knowledge to all replicas prior to moving back to primary quorum So that auxiliary replicas do not have to keep all requests and
control msgs
How to achieve this Primary notifies its latest state to all replicas not in secondary
quorum Main replica (if it is not in secondary quorum) requests for
retransmission of missing msgs Auxiliary replica keeps new info from primary and purge old data,
and ack to the primary Primary resumes ordering requests after it receives ack from every
replicaBuilding Dependable Distributed
Systems, Copyright Wenbing Zhao 38
Example
Building Dependable Distributed Systems, Copyright Wenbing Zhao 39
Example
Building Dependable Distributed Systems, Copyright Wenbing Zhao 40
Example
Building Dependable Distributed Systems, Copyright Wenbing Zhao 41
Example
Building Dependable Distributed Systems, Copyright Wenbing Zhao 42
Fast Paxos
Multi-Paxos: use one round of P1 for all instances of Paxos
Fast Paxos: aims to further reduce the cost of consensus by running one P2a for all instances of Paxos Enables an acceptor to select a value (provide by a
client) unilaterally and sends P2b to the primary However, there is no free lunch => we need more
replicas to tolerate f faulty nodes (> 2f+1 for sure) We will also have to deal with collision recovery
Building Dependable Distributed Systems, Copyright Wenbing Zhao 43
Fast Paxos: Basic Steps Fast Paxos runs in rounds and each round has
up to two phases In each round, 1st phase: prepare phase to enable the
coordinator to solicit the status and promises 2nd phase for the coordinator to select a value to be
voted on When an acceptor responded to P1a in round i, it has
participated the round I When an acceptor sent P2b in response to P2a, it has
casted its vote
Building Dependable Distributed Systems, Copyright Wenbing Zhao 44
Fast Paxos: Basic Steps Start from an initial membership, phase 1 will
always enable the coordinator to select any value for P2b Need to run P1 once for all rounds Need to run P2a once for all rounds => eliminate
another communication step Classic round is used only when a consensus is
not reached in the fast round Due to coordinator faiure Due to a collsion
Building Dependable Distributed Systems, Copyright Wenbing Zhao 45
Fast Paxos: Normal Operation
Building Dependable Distributed Systems, Copyright Wenbing Zhao 46
Fast Paxos: Collision Recovery In the presence of multiple clients, acceptors
may chose different requests (values) The coordinator must collect from a quorum of
replicas on their values and select the right value On detecting a collision, i.e., the presence of multiple
values, the coordinator initiates a Classic round Phase 1 can be omitted in Classic round because the
coordinator already collected enough info
Building Dependable Distributed Systems, Copyright Wenbing Zhao 47
Fast Paxos: Quorum Requirement Quorum based on majority will not work
Coordinator cannot rule out any value in the presence of collision
Any value could have been chosen The coordinator must select a value that have
been chosen or might have been chosen Intuition: coordinator selects a value only if it is
present in the majority of the votes collected Need to make sure that a value in the minority could
not have been chosen
Building Dependable Distributed Systems, Copyright Wenbing Zhao 48
Fast Paxos: Quorum Requirement How to ensure that a value in the minority could
not have been chosen First, it cannot be chosen in the same fast round =>
the majority value makes it impossible Second, make sure two fast quorums (Rf) intersect in
more than half of |Rf| acceptors Third, make sure any fast quorum Rf, and any classic
quorum Rc intersect in more than half of |Rc| acceptors
Building Dependable Distributed Systems, Copyright Wenbing Zhao 49
Fast Paxos: Quorum Requirement Let total number of acceptors be n, number of
faults tolerated in a fast round e, and # of faults tolerated in a classic round f
Derive inequalities (n-f)+(n-f)-n>0 (n-e)+(n-e)-n>(n-e)/2 (n-e)+(n-f)-n>(n-f)/2
Reduce to n > 2f and n > 2e+f
Building Dependable Distributed Systems, Copyright Wenbing Zhao 50
Fast Paxos: Quorum Requirement First quorum formation: maximize e => e=f,
n> 3f => |Rc| = n-f > 2f => n = 3f+1 Second quorum formation: maximize f
f < n/2 e <= n/4 If e=1, n=4; |Rf| = |Rc| = 3
Building Dependable Distributed Systems, Copyright Wenbing Zhao 51
Fast Quorum: Coordinator Value Selection Rule If no acceptor has casted any vote, the
coordinator gets to select any value for P2b If only a single value in the votes, select that
value If multiple values, select that one that is in the
majority, if one is present. Otherwise, select any value
Building Dependable Distributed Systems, Copyright Wenbing Zhao 52