hq replication: efficient quorum agreement for reliable distributed systems james cowling 1, daniel...
TRANSCRIPT
HQ Replication:Efficient Quorum Agreement forReliable Distributed Systems
James Cowling1, Daniel Myers1, Barbara Liskov1
Rodrigo Rodrigues2, Liuba Shrira3
1MIT CSAIL2INESC-ID and Instituto Superior Técnico
3Brandeis University
Byzantine Fault Tolerance
› Reliable client-server distributed systems» Server replicated across group of replica
machines
› General operations
› Bounded number f of Byzantine replicas
› Must ensure correct system state» Consistent ordering of client operations
State of the Art
› Approaches:» State Machine Replication – BFT
3f+1 replicas» Byzantine Quorums – Q/U
5f+1 replicas Increased performance Degradation when writes contend
Contributions
› Low overhead Byzantine Fault Tolerance» Performance of Byzantine Quorums
without 5f+1 replicas or contention degradation
› Hybrid Quorum scheme for Byzantine
Fault Tolerance» Quorum approach in normal-case» Use Byzantine agreement to resolve write
contention
Outline
› Current Approaches
› HQ Replication
› BFT Improvements
› Performance Evaluation
› Conclusions
State Machine Replication
› BFT - Castro and Liskov TOCS ’02» Operations ordered by primary » Agreed upon by replicas
Client
Primary
Replica 2
Replica 3
Replica 4
Request Pre-Prepare Prepare Commit Reply
Byzantine Quorums
› Q/U - Abd-El-Malek et al. SOSP ’05
› Client controlled protocol» Replicas order operations
independently
› Optimistic» Best case one-phase
protocol» Worst case unbounded
Randomized backoff
Client
Replica 1
Replica 2
Replica 3
Replica 4
Replica 5
Update Reply
Replica 6
Advantages/DisadvantagesBFT
› Good» 3f+1 replicas» Bounded number of
phases› Bad
» Higher latency» Quadratic
communication
Q/U› Good
» Best-case performance One-phase write Low replica load
› Bad» 5f+1 replicas» Degraded
performance when writes contend
HQ Replication
› 3f+1 replicas
› Supports general operations
› No all-to-all communication in normal-
case
› BFT used to resolve contention
HQ Replication
Client
Replica 1
Replica 2
Replica 3
Replica 4
Write1 Write1 OK Write2 Write2 OK
› One-phase read
› Two-phase write
High-level Write Protocol
› Two-phase write protocol
› Phase 1:» Client obtains timestamp grant from each
replica
› Phase 2:» Client forms certificate from 2f+1
matching grants
» Sends to replicas to complete write
Grants
› Promise to execute operation at given sequence number» Assuming agreement from quorum
› Grant» Client ID» Object ID» Hash over requested operation» Sequence Number (timestamp)» Replica signature
Certificates
› Certificate» Quorum (2f+1) matching grants
› Proves quorum of replicas agree to
ordering of operation» Uniquely identify client, operation and
sequential ordering» Existence of certificate precludes
existence of conflicting certificate
Replica State
› Multiple independent objects
› State per-object» Certificate supporting most recent write» Operation status
Active– Write in progress, outstanding grant
Quiescent– No current write operation
Write Phase 1
› Client sends write request to replicas» If quiescent, replica assigns new grant to
client» If active, replica sends currently
outstanding grant
› Several Possibilities» All grants match» Grants for different client» Grants conflict
Isolated Write
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
Write A
Write A
Write A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Write A
Write A
Write A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Grant <1,1,A>1
Grant <1,1,A>2
Grant <1,1,A>3
Isolated Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Matching grants: Phase 2 write
Grant <1,1,A>1
Grant <1,1,A>2
Grant <1,1,A>3
Isolated Write
client 1
replica 1
replica 2
replica 3
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
Matching grants: Phase 2 write
Isolated Write
client 1
replica 1
replica 2
replica 3
execute A
execute A
execute A
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
Result A
Result AResult A
Isolated Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
result
Write Complete
Result A
Result AResult A
Incomplete Write
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write A
Write A
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Write A
Write A
Write A
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grant <1,1,A>1
Grant <1,1,A>2
Grant <1,1,A>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Client 1 slow or failed
Grant <1,1,A>1
Grant <1,1,A>2
Grant <1,1,A>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Writ
e B
Write B
Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grant<
1,1,A> 1
Grant <1,1,A>2
Grant <1,1,A>3
Replicas active: Return current grant
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2
Grants for different client: Perform Writeback
Grant<
1,1,A> 1
Grant <1,1,A>2
Grant <1,1,A>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
client 2Cert
{G 1,G 2
,G 3}, W
rite B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Grants for different client: Perform Writeback
Incomplete Write
client 1
replica 1
replica 2
replica 3
client 2
execute A
execute A
execute A
Cert {G 1
,G 2,G 3
}, Writ
e B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: 1Seq No: 1
Operation: AGrant
client 2Cert
{G 1,G 2
,G 3}, W
rite B
Cert {G1,G2,G3}, Write B
Cert {G1,G2,G3}, Write B
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
client 2
Grant<
2,2,B> 1
Grant <2,2,B>2
Grant <2,2,B>3
Incomplete Write
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
State: ActiveClient: 2Seq No: 2
Operation: BGrant
client 2
Matching grants: Phase 2 write
Grant<
2,2,B> 1
Grant <2,2,B>2
Grant <2,2,B>3
Write Contention
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: QuiescentClient: ?Seq No: 0
Operation: ?Grant
client 2
Write A
Write B
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
client 2
Write A
Write B
Write A
Write A
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
Grant <1,1,A>1
Grant <1,1,A>2
Grant <2,1,B>3
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Conflicting grants: Request resolution
Grant <1,1,A>1
Grant <1,1,A>2
Grant <2,1,B>3
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
Conflicting grants: Request resolution
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 1Seq No: 1
Operation: AGrant
State: ActiveClient: 2Seq No: 1
Operation: BGrant
Contention
Resolution
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
execute A
execute A
execute A
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
client 2
execute B
execute B
execute B
Cert {G1,G2,G3}
Cert {G1,G
2,G3}Cert {G
1 ,G2 ,G
3 }
Resolve Request
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
Result A
Result AResult A
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
result
Result A
Result AResult A
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2
Result
B
Result B
Result B
Write Contention
client 1
replica 1
replica 2
replica 3
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
State: QuiescentClient: 2Seq No: 2
Operation: BGrant
client 2result
Result
B
Result B
Result B
Contention Resolution
› BFT module used to resolve contention» Establish sequential order on contending
ops
› On receiving resolve request:» Freeze local object state» Send state to primary
› Primary runs BFT on combined state
› Replicas execute contending operations
Additional Details
› Read protocol
› State transfer
› Multi-object transactions
› Performance enhancements
Performance Enhancements
› Preferred quorums»Core protocol run by only 2f+1
replicas
› Symmetric-key cryptography»Authenticators instead of signatures
Collection of 3f+1 MACs <mi,1,mi,2,…,mi,n>
»Lower CPU overhead
BFT Improvements
› Preferred quorums»Reduces degree of quadratic
communication
› Single MAC per message»Significant improvements over
authenticators
Non-Contention Message Overhead
Messages sent/received at each replica per write request
Non-Contention Bandwidth Use
Total bandwidth at each replica per write request
Experimental Setup
› HQ and BFT prototypes deployed on Emulab» Up to 16 replicas (f=5), 200 clients (4 per
machine)
› New BFT codebase
› Implement counter service» Negligible operation payload» Multiple objects
Private non-contention objects Shared contention object
Non-contention Throughput
Maximum operation throughput
Resilience to Contention
Throughput degradation with increasing write-contention
Resilience to Contention
Throughput degradation with increasing write-contention
new
BFT Batching
› BFT allows batching at primary
› Greatly reduces internal protocol
communication
› Increased delay
Client
Primary
Replica 1
Replica 2
Replica 3
Request Pre-Prepare Prepare Commit Reply
once per batch
Batched Performance
Effect of BFT batching on maximum write throughput
Recommendations
› Use Q/U when» Latency critical» Contention low» 5f+1 replicas acceptable
› Use HQ when» Low latency important» Moderate contention
› Use BFT when» Contention high» Throughput more important than latency
Conclusions
› First Byzantine Quorum protocol with 3f+1 replicas» Supports general operations» Resilient to Byzantine clients
› Introduced Hybrid technique» Resolve contention without performance
degradation» Applicable to general quorum systems
› Found optimized BFT to perform well under high load
Questions?
Further Details› HQ Replication: Properties and optimizations
» James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues and Liuba Shrira. Technical Memo In Prep., MIT Computer Science and Artificial Laboratory, Cambridge, Massachusetts, 2006.
› Contact:» [email protected]» http://people.csail.mit.edu/cowling/
Write-back Operation› Write certificate paired with a subsequent
request› Used to ensure progress with slow
replicas or clients» Completes phase 2 for a slow client» Advances state of slow replicas
› Replica processes write phase 2 based on certificate, then the paired request
Backups…
Slow Replicas
› Some grants in quorum have old timestamp
› Perform writeback to slow replicas, using certificate provided with highest grant» Brings replicas up to date and solicits new
grants
Why 3f+1?
› 3f+1 replicas» f of which can be faulty
› 2f+1 agree on any ordering» f of these may be Byzantine» The remaining f may be slow
› Maximum of 2f can respond with old
system state, but not 2f+1
› Won’t HQ have a higher rate of
contention since it’s two phase (higher
latency) than Q/U?» No – contention window only between first
replica receives phase 1 request to last replica receives it. Hence independent of two-phase, and actually smaller than in Q/U