Prophecy: Using History for High‐Throughput Fault Tolerance
Siddhartha Sen
Joint work with Wyatt Lloyd and Mike Freedman
Princeton University
Non‐crash failures happenNon crash failures happen
Non‐crash failures happenNon crash failures happen
Model as Byzantine (malicious)(malicious)
Mask Byzantine faultsMask Byzantine faults
Clients Service
Mask Byzantine faultsMask Byzantine faults
ThroughputThroughput
R li t d i
Clients
Replicated service
Mask Byzantine faultsMask Byzantine faults
ThroughputThroughput
R li t d i
Clients
Replicated service
Mask Byzantine faultsMask Byzantine faults
ThroughputThroughput
R li t d i
Clients
Replicated service
Mask Byzantine faultsMask Byzantine faults
ThroughputThroughput
R li t d i
Clients
Replicated service
Mask Byzantine faultsMask Byzantine faults
ThroughputThroughput
Linearizability( t i t )(strong consistency)
R li t d i
Clients
Replicated service
Byzantine fault tolerance (BFT)Byzantine fault tolerance (BFT)
• Low throughputLow throughput
difi li• Modifies clients
• Long‐lived sessions
ProphecyProphecy
• High throughput + good consistencyHigh throughput + good consistency
f l h• No free lunch:– Read‐mostly workloads
– Slightly weakened consistency
Byzantine fault tolerance (BFT)Byzantine fault tolerance (BFT)
• Low throughput D ProphecyLow throughput
difi li
D‐Prophecy
• Modifies clientsProphecy
• Long‐lived sessions
Traditional BFT readsTraditional BFT reads
applicationapplication
Clients
Replica Group
Traditional BFT readsTraditional BFT reads
applicationapplication
Agree?
Clients
Replica Group
A cache solutionA cache solution
applicationcache application
Clients
Replica Group
A cache solutionA cache solution
applicationcache application
Agree?
Clients
Replica Group
A cache solutionA cache solution
applicationcache application
Problems:Agree?
• Huge cache• Invalidation
Clients
Invalidation
Replica Group
A compact cacheA compact cache
applicationcache application
Requests Responses
req1 resp1
req2 resp2
Clients
req3 resp3
Replica Group
A compact cacheA compact cache
applicationcache application
Requests Responses
sketch(req1) sketch(resp1)
sketch(req2) sketch(resp2)
Clients
sketch(req3) sketch(resp3)
Replica Group
A sketcherA sketcher
applicationsketcher application
Clients
Replica Group
Executing a readExecuting a readsketch webpage
…………
…………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
…………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
………………
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
………………
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
………………
Fast, load‐balanced reads
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
………………
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
…………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………key‐value store
…………
replicated state machine
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
………………
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
………………
Maintain a fresh cache
Clients
…… fresh cache
…………
Replica Group
Did hi li i bilit ?Did we achieve linearizability?
NO!
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
…………
……
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………
……………………
Clients…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
…………
……
Clients
……
…………
Replica Group
Executing a readExecuting a readsketch webpage
…………Agree?
…………
……
Fast reads may be stale
Clients
…… may be stale
…………
Replica Group
Load balancingLoad balancingsketch webpage
…………
…………
Clients…………
Replica Group
Load balancingLoad balancingsketch webpage
…………Agree?
…………
…………
Clients…………
Replica Group
Load balancingLoad balancingsketch webpage
…………Agree?
…………
………… Pr(k stale) = gk
Clients…………
Replica Group
D‐Prophecy vs. BFTD Prophecy vs. BFT
Traditional BFT:• Each replica executes readp
• Linearizability
D‐Prophecy:• One replica executes read
Clients
• “Delay‐once” linearizabilityReplica Group
Byzantine fault tolerance (BFT)Byzantine fault tolerance (BFT)
• Low throughput D ProphecyLow throughput
difi li
D‐Prophecy
• Modifies clientsProphecy
• Long‐lived sessions
Key‐exchange overheadKey exchange overhead
Key‐exchange overheadKey exchange overhead
11%
Key‐exchange overheadKey exchange overhead
11%
3%
Internet servicesInternet services
CliClients
Replica Group
A proxy solutionA proxy solution
Sketcher
Cli
Proxy
Clients
Replica Group
A proxy solutionA proxy solution
C lid tConsolidate sketchers
Sketcher
Cli
Proxy
Clients
Replica Group
A proxy solutionA proxy solution
C lid tConsolidate sketchers
Sketcher
CliClients
Replica Group
A proxy solutionA proxy solution
Sk t h tSketcher must be fail‐stop
Sketcher
Cli dClients Trusted
Replica Group
A proxy solution
Sketcher must be fail‐stop
A proxy solution
iddl b l dSketcher must be fail stop• Trust middlebox already• Small and simple
Sketcher
Cli dClients Trusted
Replica Group
Executing a readExecuting a read
…………
……q ……Sketcher
Cli d
q
Clients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli dClients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
……
Sketcher……
Cli d
…………
Req Resp
( )
Clients Trusted
…………
s(q)
⋅⋅⋅ ⋅⋅⋅ Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli dClients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli dClients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Replica Group
Executing a readExecuting a read
…………
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Req Resp
( )
Replica Groups(q)
⋅⋅⋅ ⋅⋅⋅
Executing a readExecuting a read
…………
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Req Resp
( )
Replica Groups(q)
⋅⋅⋅ ⋅⋅⋅
ProphecyProphecy
…………
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Replica Group
ProphecyProphecy
…………
Fast, load‐balanced reads
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Replica Group
ProphecyProphecy
…………
Fast reads may be stale
…………
Sketcher
Cli d
…………
Clients Trusted
…………
Req Resp
( )
Replica Groups(q)
⋅⋅⋅ ⋅⋅⋅
Delay‐once linearizabilityDelay once linearizability
Delay‐once linearizabilityDelay once linearizability
Delay‐once linearizabilityDelay once linearizability
⟨W, R, W, W, R, R, W, R ⟩
Delay‐once linearizabilityDelay once linearizability
Read‐after‐write property
⟨W, R, W, W, R, R, W, R ⟩
Delay‐once linearizabilityDelay once linearizability
Read‐after‐write property
⟨W, R, W, W, R, R, W, R ⟩
Example applicationExample application
• Upload embarrassing photosUpload embarrassing photos1. Remove colleagues from ACL
2 Upload photos2. Upload photos
3. (Refresh)
• Weak may reorder
• Delay‐once preserves order
Byzantine fault tolerance (BFT)Byzantine fault tolerance (BFT)
• Low throughput D ProphecyLow throughput
difi li
D‐Prophecy
• Modifies clientsProphecy
• Long‐lived sessions
ImplementationImplementation
• Modified PBFTModified PBFT– PBFT is stable, complete
Competitive with Zyzzyva et al– Competitive with Zyzzyva et. al.
• C++, Tamer async I/O– Sketcher: ∼2000 LOC– PBFT library: ∼1140 LOC– PBFT client: ∼1000 LOC
EvaluationEvaluation
• Prophecy vs proxied‐PBFTProphecy vs. proxied PBFT– Proxied systems
• D‐Prophecy vs. PBFT– Non‐proxied systems
EvaluationEvaluation
• Prophecy vs proxied‐PBFTProphecy vs. proxied PBFT– Proxied systems
• We will study:– Performance on “null” workloads– Performance on null workloads
– Performance with real replicated service
Where system bottlenecks how to scale–Where system bottlenecks, how to scale
Basic setupBasic setup
Sk t hSketcher
Clients (concurrent)Clients (100)
Replica Group (PBFT)
Fraction of failedFraction of failed fast reads
Fraction of failedAlexa top sites: Fraction of failed fast reads
Alexa top sites:< 15%
Small benefit on null readsSmall benefit on null reads
Small benefit on null readsSmall benefit on null reads
Apache webserver setupApache webserver setup
Sk t hSketcher
ClientsClients
Replica Group
Large benefit on real workloadLarge benefit on real workload
Large benefit on real workloadLarge benefit on real workload
3.7x
Large benefit on real workloadLarge benefit on real workload
3.7x
2.0x
Large benefit on real workloadLarge benefit on real workload
3.7x
2.0x
Benefit grows with workBenefit grows with work
Benefit grows with workBenefit grows with work
Benefit grows with workBenefit grows with work
Benefit grows with workBenefit grows with work94μs (Apache)
Benefit grows with workBenefit grows with work94μs (Apache)
Null workloads are misleading!are misleading!
Benefit grows with workBenefit grows with work
Single sketcher bottlenecksSingle sketcher bottlenecks
Single sketcher bottlenecksSingle sketcher bottlenecks
Scaling outScaling out
Scales linearly with replicasScales linearly with replicas
SummarySummary
• Prophecy good for Internet servicesp y g– Fast, load‐balanced reads
• D Prophecy good for traditional services• D‐Prophecy good for traditional services
• Prophecy scales linearly while PBFT stays flatProphecy scales linearly while PBFT stays flat
• Limitations:– Read‐mostly workloads (meas. study corroborates)– Delay‐once linearizability (useful for many apps)
Thank You
Additional slides
TransitionsTransitions
• Prophecy good for read‐mostly workloadsProphecy good for read mostly workloads
i i i i ?• Are transitions rare in practice?
Measurement studyMeasurement study
• Alexa top sitesAlexa top sites
• Access main page every 20 sec for 24 hrs
Mostly static contentMostly static content
Mostly static contentMostly static content
Mostly static contentMostly static content15%
Dynamic contentDynamic content
• Rabin fingerprinting on transitionsRabin fingerprinting on transitions
• 43% differ by single contiguous change• 43% differ by single contiguous change
• Sampled 4000 of them, over half due to:– Load balancing directives– Random IDs in links, function parameters