csc 536 lecture 9. outline case study amazon dynamo brewer’s cap theorem recovery
TRANSCRIPT
![Page 1: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/1.jpg)
CSC 536 Lecture 9
![Page 2: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/2.jpg)
Outline
Case studyAmazon Dynamo
Brewer’s CAP theorem
Recovery
![Page 3: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/3.jpg)
Dynamo:Amazon’s key-value storage system
![Page 4: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/4.jpg)
Amazon Dynamo
A data store for applications that require:primary-key access to datadata size < 1MBscalabilityhigh availabilityfault toleranceand really low latency
No need forRelational DB
Complexity and ACID properties imply little parallelism and low availability
Stringent security because it is used only by internal services
![Page 5: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/5.jpg)
Amazon apps that use Dynamo
Perform simple read/write ops on single, small ( < 1MB) data objects which are identified by a unique key.
best seller listsshopping cartscustomer preferencessession managementsales rankproduct catalogetc.
![Page 6: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/6.jpg)
Design Considerations
“ … customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”
![Page 7: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/7.jpg)
Design Considerations
“Always writeable”users must always be able to add/delete from the shopping cartno update is rejected because of failure or concurrent writedata must be replicated across data centersresolve conflicts during reads, not writes
Let each application decide for itself
Single administrative domainall nodes are trusted (no Byzantine failure) because service is not intended for external users
![Page 8: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/8.jpg)
Design Considerations
Unstructured dataNo need for hierarchical namespacesNo need for relational schema
Very-high availability and low latency“At least 99.9% of read and write operations to be performed within a few hundred milliseconds”
“average” or “median” is not good enough
Avoid routing requests through multiple nodeswhich would slow things down
Avoid ACID guaranteesACID guarantees tend to have poor availability
![Page 9: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/9.jpg)
Design Considerations
Incremental scalabilityAdding a single node should not affect the system significantly
Decentralization and symmetryAll nodes have the same responsibilitiesFavor P2P techniques over centralized controlNo single point of failure
Take advantage of node heterogeneityNodes with larger disks should store more data
![Page 10: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/10.jpg)
Dynamo API
A key is associated with each stored item
Operations that are supported:get(key)
return item associated with key
put(key, context, item) write key,value pair into storage
The context encodes system metadata about the itemincluding version information
![Page 11: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/11.jpg)
Dynamo API
A key is associated with each stored item
Operations that are supported:get(key)
locates object replicas associated with key and returns the object or list of objects along with version numbers
put(key, context, item) determines where the item replicas should be placed based on the item key and writes the replicas to disk
The context encodes system metadata about the itemincluding version information
![Page 12: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/12.jpg)
Partitioning Algorithm
For scalability, Dynamo makes use of a large number of nodes
across clusters and data centers
Also for scalability, Dynamo must balance the loadsusing a hash function to map data items to nodes
To insure incremental scalability, Dynamo uses consistent hashing
![Page 13: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/13.jpg)
Partitioning Algorithm
Consistent hashingHash function produces an m-bit numberwhich defines a circular name space
Each data item has a key and is mapped to a number in the name space obtained using Hash(key)Nodes are assigned numbers randomly in the name spaceData item is then assigned to the first clockwise node
the successor Succ() function
In consistent hashing the effect of adding a node is localizedOn average, K/n objects must be remapped (K = # of keys, n = # of nodes)
![Page 14: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/14.jpg)
Load Distribution
Problem: Random assignment of node to position in ring may produce non-uniform distribution of dataSolution: virtual nodes
Assign several random numbers to each physical nodeOne corresponds to physical node, the others to virtual ones
AdvantagesIf node becomes unavailable, its load is easily and evenly dispersed across the available nodesWhen a node becomes available, it accepts a roughly equivalent amount of load from the other available nodesThe number of virtual nodes that a node is responsible for can be decided based on its capacity
![Page 15: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/15.jpg)
Failures
Amazon has a number of data centersConsisting of a number of clusters of commodity machinesIndividual machines fail regularlySometimes entire data centers fail due to power outages, network partitions, tornados, etc.
To handle failuresitems are replicatedreplicas are not only spread across a cluster but across multiple data centers
![Page 16: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/16.jpg)
Replication
Data is replicated at N nodes
Succ(key) = coordinator node
The coordinator replicates the object at the N-1 successor nodes in the ring, skipping virtual nodes corresponding to already used physical nodesPreference list: the list of nodes that store a particular keyThere are actually > N nodes on the preference list, in order to ensure N “healthy” nodes at all times.
![Page 17: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/17.jpg)
Data Versioning
Dynamo provides eventual consistency
Updates can be propagated to replicas asynchronouslyput( ) call may return before all replicas have been updatedWhy? provide low latency and high availabilityImplication: a subsequent get( ) may return stale data
Some apps can be designed to work in this environmente.g., the “add-to/delete-from cart” operationIt’s okay to add to an old cart, as long as all versions of the cart are eventually reconciled
Note: eventual consistency
![Page 18: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/18.jpg)
Data Versioning
Dynamo treats each modification as a new (& immutable) version of the object
Multiple versions can exist at the same timeUsually, new versions contain the old versions – no problem
Sometimes concurrent updates and failures generate conflicting versions
e.g., if there’s been a network partition
![Page 19: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/19.jpg)
Parallel Version Branches
Vector clocks are used to identify causally related versions and parallel (concurrent) versions
For causally related versions, accept the final version as the “true” versionFor parallel (concurrent) versions, use some reconciliation technique to resolve the conflictReconciliation technique is app dependent
Typically this is handled by mergingFor add-to-cart operations, nothing is lostFor delete-from cart, deleted items might reappear after the reconciliation
![Page 20: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/20.jpg)
Parallel Version Branches example
Dk([Sx,i], [Sy,j]):
Object Dk
with vector clock ([Sx,i], [Sy,j])
where[Sx,i] indicates i updates by server Sx
and[Sy,j] indicates j updates by server Sy
![Page 21: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/21.jpg)
Execution of get( ) and put( )
Operations can originate at any node in the system
Clients mayRoute request through a load-balancing coordinator nodeUse client software that routes the request directly to the coordinator for that object
The coordinator contacts R nodes for reading and W nodes for writing, where R + W > N
![Page 22: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/22.jpg)
“Sloppy Quorum”
put( ): the coordinator writes to the first N healthy nodes on the preference list
If W writes succeed, the write is considered to be successful
get( ): coordinator reads from N nodeswaits for R responses. If they agree, return valueIf they disagree, but are causally related, return the most recent valueIf they are causally unrelated apply app-specific reconciliation techniques and write back the corrected version
![Page 23: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/23.jpg)
Hinted Handoff
What if a write operation can’t reach the first N nodes on the preference list?
To preserve availability and durability, store the replica temporarily on another node in the preference list
accompanied by a metadata “hint” that remembers where the replica should be storedthis (another) node will eventually deliver the update to the correct node when it recovers
Hinted handoff ensures that read and write operations don’t fail because of network partitioning or node failures.
![Page 24: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/24.jpg)
Handling Permanent Failures
Hinted replicas may be lost before they can be returned to the right node.
Other problems may cause replicas to be lost or fall out of agreement
Merkle trees allow two nodes to compare a set of replicas and determine fairly easily
Whether or not they are consistentWhere the inconsistencies are
![Page 25: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/25.jpg)
Merkle trees
Merkle trees have leaves whose values are hashes of the values associated with keys (one key/leaf)
Parent nodes contain hashes of their childrenEventually, root contains a hash that represents everything in that replica
Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts
To detect inconsistency between two sets of replicas, compare the roots
Source of inconsistency can be detected by recursively comparing children
![Page 26: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/26.jpg)
Membership and Failure Detection
Temporary failures of nodes are possible but shouldn’t cause load re-balancing
Additions and deletions of nodes are also explicitly executed by an administrator
A gossip-based protocol is used to ensure that every node eventually has a consistent view of its membership list
Members are the keys assigned to the ranges the node is responsible for
![Page 27: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/27.jpg)
Gossip-based Protocol
Periodically, each node contacts another node in the network, randomly selected
Nodes compare their membership histories and reconcile them
![Page 28: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/28.jpg)
Load Balancing for Additions and Deletions
When a node is added, it acquires key values from other nodes in the network.
Nodes learn of the added node through the gossip protocol, contact the node to offer their keys, which are then transferred after being acceptedWhen a node is removed, a similar process happens in reverse
Experience has shown that this approach leads to a relatively uniform distribution of key/value pairs across the system
![Page 29: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/29.jpg)
Problem Technique Advantage
Partitioning Consistent Hashing Incremental scalability
High availability Vector clocks, reconciled Version size is decoupledfor writes during reads from update rates
Temporary Sloppy Quorum, Provides high availability &failures hinted handoff durability guarantee when
some of the replicas arenot available
Permanent Anti-entropy using Synchronizes divergent replicasfailures Merkle trees in the background
Membership & Gossip-based protocol Preserves symmetry and avoids failure detection having a centralized registry for
storing membership and nodeliveness information
Summary
![Page 30: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/30.jpg)
Summary
High scalability, including incremental scalabilityVery high availability is possible, at the cost of consistencyApp developers can customize the storage system to emphasize performance, durability, or consistency
The primary parameters are N, R, and W
Dynamo shows that decentralization and eventual consistency can provide a satisfactory platform for hosting highly-available applications.
![Page 31: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/31.jpg)
Dynamo vs BigTable
Different types of data storage, designed for different needsDynamo optimizes latencyBigTable emphasizes throughput
More preciselyDynamo tends to emphasize network partition fault-tolerance and availability, at the expense of consistencyBigTable tends to emphasize network partition fault-tolerance and consistency over availability
![Page 32: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/32.jpg)
Brewer’s CAP theorem
Impossible for a distributed data store to simultaneously provide
Consistency (C)Availability (A)Partition-tolerance (P)
Conjectured by Brewer in 2000
Formally “proven” by Gilbert&Lynch in 2002
![Page 33: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/33.jpg)
Brewer’s CAP theorem
Assume two nodes storing replicated data on opposite sides of a partition
Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting CLikewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting AOnly when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P
Naïve Implication (“2 out of 3” view)Since, for wide-area systems, designers cannot forfeit P, they must make a difficult choice between C and A
![Page 34: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/34.jpg)
What about latency?
Latency and partitions are related
Operationally, the essence of CAP takes place during a partition-caused timeout, a period when the program must make a fundamental decision:
block/cancel the operation and thus decrease availability or proceed with the operation and thus risk inconsistency
The first results in high latency (waiting until partition is repaired) and the second results in possible inconsistency
![Page 35: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/35.jpg)
Brewer’s CAP theorem
A more sophisticated viewBecause partitions are rare, there is little reason to forfeit C or A when the system is not partitionedThe choice between C and A can occur many times within the same system at very fine granularity
not only can subsystems make different choices, but the choice can change according to the operation or specific data or user
The 3 properties are more continuous than binaryAvailability is a percentage between 0 to 100 percentDifferent consistency models existDifferent kinds of system partition can be defined
![Page 36: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/36.jpg)
Brewer’s CAP theorem
BigTable is a “CP type system”Dynamo is an “AP type system”Yahoo’s PNUTS is an “AP type system”
maintains remote copies asynchronouslymakes the “local” replica the master, which decreases latencyworks well in practice because single user data master is naturally located according to the user’s (normal) location
Facebook uses a “CP type system”the master copy is always in one locationuser typically has a closer but potentially stale copywhen users update their pages, the update goes to the master copy directly as do all the user’s reads for about 20 seconds, despite higher latency. After 20 seconds, the user’s traffic reverts to the closer copy.
“AP”, “CP” are really rough generalizations
![Page 37: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/37.jpg)
Recovery
![Page 38: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/38.jpg)
Recovery
Error recovery: replace a a present erroneous state with an error-free stateBackward recovery: bring system into a previously correct state
Need to record the system's state from time to time (checkpoints)Example:
Forward recovery: bring system to a correct new state from which it can continue to execute
Only works with known errorsExample:
![Page 39: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/39.jpg)
Recovery
Error recovery: replace a a present erroneous state with an error-free stateBackward recovery: bring system into a previously correct state
Need to record the system's state from time to time (checkpoints)Example: retransmit message
Forward recovery: bring system to a correct new state from which it can continue to execute
Only works with known errorsExample: error correction
![Page 40: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/40.jpg)
Backward recovery
Backward recovery is typically usedIt is more general
HoweverRecovery is expensiveSometimes we can't go back (e.g. a file is deleted)Checkpoints are expensive
Solution for the last point: message loggingSender-basedReceiver-based
![Page 41: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/41.jpg)
Checkpoints: Common approach
Periodically make a “big” checkpoint
Then, more frequently, make an incremental addition to itFor example: the checkpoint could be copies of some filesor of a databaseLooking ahead, the incremental data could be “operations”run on the database since the last transaction finished (committed)
![Page 42: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/42.jpg)
Problems with checkpoints
P and Q are interacting
Each makes independent checkpoints now and then
p
q
request
reply
![Page 43: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/43.jpg)
Problems with checkpoints
Q crashes and rolls back to checkpoint
p
q
request
reply
![Page 44: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/44.jpg)
Problems with checkpoints
Q crashes and rolls back to checkpoint
It will have “forgotten” message from P
p
q
request
![Page 45: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/45.jpg)
Problems with checkpoints
… Yet Q may even have replied.
Who would care? Suppose reply was “OK to release the cash. Account has been debited”
p
q
request
reply
![Page 46: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/46.jpg)
Two related concerns
First, Q needs to see that request again, so that it will reenter the state in which it sent the reply
Need to regenerate the input request
But if Q is non-deterministic, it might not repeat those actions even with identical input
So that might not be “enough”
![Page 47: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/47.jpg)
Rollback can leave inconsistency!
In this example, we see that checkpoints must somehow be coordinated with communication
If we allow programs to communicate and don’t coordinate checkpoints with message passing, system state becomes inconsistent even if individual processes are otherwise healthy
![Page 48: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/48.jpg)
More problems with checkpoints
P crashes and rolls back
p
q
request
reply
![Page 49: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/49.jpg)
More problems with checkpoints
P crashes and rolls back
Will P “reissue” the same request? Recall our non-determinism assumption: it might not!
p
q
request
reply
![Page 50: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/50.jpg)
Solution?
One idea: if a process rolls back, roll others back to a consistent state
If a message was sent after checkpoint, …If a message was received after checkpoint, … Assumes channels will be “empty” after doing this
![Page 51: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/51.jpg)
Solution?
One idea: if a process rolls back, roll others back to a consistent state
If a message was sent after checkpoint, roll receiver back to a state before that message was receivedIf a message was received after checkpoint roll the sender back to a state prior to sending itAssumes channels will be “empty” after doing this
![Page 52: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/52.jpg)
Solution?
Q crashes and rolls back
p
q
request
reply
![Page 53: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/53.jpg)
Solutions?
Q crashes and rolls back
p
q
request
reply
q rolled back to a state before this was received, or reply was sent
![Page 54: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/54.jpg)
Solution?
P must also roll back
Now it won’t upset us if P happens not to resend the same request
p
q
![Page 55: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/55.jpg)
Implementation Implementing independent checkpointing requires that dependencies are recorded so processes can jointly roll back to a consistent global stateLet CPi(m) be the m-th checkpoint taken by process Pi and let INTi(m) denote the interval between CPi(m-1) and CPi(m)
When Pi sends a message in interval INTi(m) Pi attaches to it the pair (i,m)
When Pj receives a message with attachment (i,m) in interval INTj(n) Pj records the dependency INTi(m) INTj(n)
When Pj takes checkpoint CPj(n), it logs this dependency as well
When Pi rolls back to checkpoint CPi(m-1): we need to ensure that all processes that have received messages from Pi
sent in interval INTi(m) are rolled back to a checkpoint preceding the receipt of such messages…
![Page 56: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/56.jpg)
Implementation Implementing independent checkpointing requires that dependencies are recorded so processes can jointly roll back to a consistent global stateLet CPi(m) be the m-th checkpoint taken by process Pi and let INTi(m) denote the interval between CPi(m-1) and CPi(m)
When Pi sends a message in interval INTi(m) Pi attaches to it the pair (i,m)
When Pj receives a message with attachment (i,m) in interval INTj(n) Pj records the dependency INTi(m) INTj(n)
When Pj takes checkpoint CPj(n), it logs this dependency as well
When Pi rolls back to checkpoint CPi(m-1): Pj will have to roll back to at least checkpoint CPj(n-1) Further rolling back may be necessary…
![Page 57: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/57.jpg)
Problems with checkpoints
But now we can get a cascade effect
p
q
![Page 58: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/58.jpg)
Problems with checkpoints
Q crashes, restarts from checkpoint…
p
q
![Page 59: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/59.jpg)
Problems with checkpoints
Forcing P to rollback for consistency…
p
q
![Page 60: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/60.jpg)
Problems with checkpoints
New inconsistency forces Q to rollback ever further
p
q
![Page 61: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/61.jpg)
Problems with checkpoints
New inconsistency forces P to rollback ever further
p
q
![Page 62: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/62.jpg)
This is a “cascaded” rollback
Or “domino effect”
It arises when the creation of checkpoints is uncoordinated w.r.t. communication
Can force a system to roll back to initial stateClearly undesirable in the extreme case…Could be avoided in our example if we had a log for the channel from P to Q
![Page 63: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/63.jpg)
Sometimes action is “external” to system, and we can’t roll back
Suppose that P is an ATM machineAsks: Can I give Ken $100Q debits account and says “OK”P gives out the money
We can’t roll P back in this case since the money is already gone
![Page 64: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/64.jpg)
Bigger issue is non-determinism
P’s actions could be tied to something randomFor example, perhaps a timeout caused P to send this message
After rollback these non-deterministic events might occur in some other order
Results in a different behavior, like not sending that same request… yet Q saw it, acted on it, and even replied!
![Page 65: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/65.jpg)
Issue has two sides
One involves reconstructing P’s message to Q in our examples
We don’t want P to roll back, since it might not send the same messageBut if we had a log with P’s message in it we would be fine, could just replay it
The other is that Q might not send the same response (non-determinism)
If Q did send a response and doesn’t send the identical one again, we must roll P back
![Page 66: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/66.jpg)
Options?
One idea is to coordinate the creation of checkpoints and logging of messages
In effect, find a point at which we can pause the systemAll processes make a checkpoint in a coordinated way: the consistent snapshot (seen that, done that)Then resume
![Page 67: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/67.jpg)
Why isn’t this common?
Often we can’t control processes we didn’t code ourselvesMost systems have many black-box componentsCan’t expect them to implement the checkpoint/rollback policy
Hence it isn’t really practical to do coordinated checkpointing if it includes system components
![Page 68: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/68.jpg)
Why isn’t this common?
Further concern: not every process can make a checkpoint “on request”
Might be in the middle of a costly computation that left big data structures aroundOr might adopt the policy that “I won’t do checkpoints while I’m waiting for responses from black box components”
This interferes with coordination protocols
![Page 69: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/69.jpg)
Implications?
Ensure that devices, timers, etc, can behave identically if we roll a process back and then restart it
Knowing that programs will re-do identical actions eliminates need to cascade rollbacks
![Page 70: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/70.jpg)
Implications?
Must also cope with thread preemptionOccurs when we use lightweight threads, as in Java or C#Thread scheduler might context switch at times determined by when an interrupt happensMust force the same behavior again later, when restarting, or program could behave differently
![Page 71: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/71.jpg)
Determinism
Despite these issues, often see mechanisms that assume determinism
Basically they are sayingEither don’t use threads, timers, I/O from multiple incoming channels, shared memory, etcOr use a “determinism forcing mechanism”
![Page 72: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/72.jpg)
With determinism…
We can revisit the checkpoint rollback problem and do much better
Eliminates need for cascaded rollbacksBut we do need a way to replay the identical inputs that were received after the checkpoint was made
Forces us to think about keeping logs of the channels between processes
![Page 73: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/73.jpg)
Two popular options
Receiver based loggingLog received messages; like an “extension” of the checkpoint
Sender based loggingLog messages when you send them, ensures you can resend them if needed
![Page 74: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/74.jpg)
Why do these work?
Recall the reasons for cascaded rollbackA cascade occurs if
Q received a message and replied to it, then rolls back to “before” that happenedWith message logging, Q can regenerate the input and re-read the message
![Page 75: CSC 536 Lecture 9. Outline Case study Amazon Dynamo Brewer’s CAP theorem Recovery](https://reader031.vdocuments.site/reader031/viewer/2022013011/56649e0d5503460f94af6e08/html5/thumbnails/75.jpg)
With these varied options
When Q rolls back we can
Re-run Q with identical inputs ifQ is deterministic, orNobody saw messages from Q after checkpoint state was recorded, orWe roll back the receivers of those messages
An issue: deterministic programs often crash in the identical way if we force identical execution