1
Consistency, Replication and Fault Tolerance
Introduction
Consistency Models
Distribution Protocols
Consistency Protocols
Fault Tolerance
2
Introduction: Reasons and Main Issue
• Two primary reasons for replicating data in DS: reliability and performance.
• Reliability: It can continue working after one replica crashes by simply switch to one of the other replicas; Also, it becomes possible to provide better protection against corrupted data.
• Performance: When the number of processes to access data managed by a server increases, performance can be improved by replicating the server and subsequently dividing the work; Also, a copy of data can be placed in the proximity of the process using them to reduce the time of data access.
• Consistency issue: keeping all replicas up-to-date.
3
Data-Centric Consistency Models
Consistency model: a contract between processes and the data store: if processes agree to obey certain rules, the store promises to work correctly.
Data store: any data, available by means of shared memory, shared database, or a file system in DS. A data store may be physically distributed across multiple machines.
Each process that can access data from the store is assumed to have a local (or nearby) copy available of the entire store. Write operation are propagated to other copies.
Consistency is discussed in the context of read and write operation on data store. When an operation changes the data, it is classified as write operation, otherwise, it is regarded as read operation.
4
Data-Centric Consistency Models
The general organization of a logical data store, physically distributed and replicated across multiple processes.
5
Consistency Models: Strict Consistency
Strict consistency: Any read on a data item x returns a value corresponding to the result of the most recent write on x.
The definition implicitly assumes the existence of absolute global time, so that the determination of “most recent” is unambiguous.
Uniprocessor systems traditionally observes the strict consistency. However, in DS, it is impossible that all writes are instantaneously visible to all processes and an absolute global time order is maintained.
In the following, the symbol Wi(x)a (Ri(x)b) is used to indicate a write by (read from) process Pi to data item x with value a (returning value b).
6
Strict Consistency
Behavior of two processes, operating on the same data item.
(a) A strictly consistent store.
(b) A store that is not strictly consistent.
7
Consistency Models: Sequential Consistency
It is impossible to implement strict consistency in DS (why?). Furthermore, experience shows that users can often manage quite well with weaker consistency models.
Sequential consistency (Lamport, 1979): A data store is said to be sequentially consistent if it satisfies the following condition:
The result of any execution is the same as if the (read and write) operations by all processes on the data store were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program.
8
Consistency Models: Sequential Consistency
In sequential consistency model, when processes run concurrently, possibly on different machines, any interleaving of read and write operations is acceptance behavior, but all processes should see the same interleaving of operations.
In the examples in the next slide, it shows that the time does not play a role in sequential consistency.
Sequential consistency is comparable to serialization in the case of transaction. The former is defined in terms of read/write operations, while the latter in terms of transactions.
The sequential consistency model is useful, because the users are taught to program in such a way that exact order of statement execution does not matter. When such an order is essential, synchronization operations should be used.
9
Sequential Consistency (1)
a) A sequentially consistent data store.
b) A data store that is not sequentially consistent.
10
Sequential Consistency (2)
Three concurrently executing processes, x, y and z are 0 initially.
Process P1 Process P2 Process P3
x = 1;
print ( y, z);
y = 1;
print (x, z);
z = 1;
print (x, y);
11
Sequential Consistency (3)
Four valid execution sequences for the processes of the previous slide. The vertical axis is time.
x = 1;
print ((y, z);
y = 1;
print (x, z);
z = 1;
print (x, y);
Prints: 001011
Signature: 001011
(a)
x = 1;
y = 1;
print (x,z);
print(y, z);
z = 1;
print (x, y);
Prints: 101011
Signature: 101011
(b)
y = 1;
z = 1;
print (x, y);
print (x, z);
x = 1;
print (y, z);
Prints: 010111
Signature:
110101
(c)
y = 1;
x = 1;
z = 1;
print (x, z);
print (y, z);
print (x, y);
Prints: 111111
Signature:
111111
(d)
12
Consistency Models: Weak Consistency
It is reasonable to let the process finish its critical section (for read/write operations) and then make sure that the final results are sent everywhere.
Using synchronization variables, weak consistency models have the following three properties:
(1) Accesses to synchronization variables associated with a data store are sequentially consistent;
(2) No operation on a synchronization variable is allowed to be performed until all previous writes have been completed everywhere;
(3) No read or write operation on data items are allowed to be performed until all previous operations to synchronization variables have been performed.
13
Weak Consistency (1)
A program fragment in which some variables may be kept in registers.
int a, b, c, d, e, x, y; /* variables */int *p, *q; /* pointers */int f( int *p, int *q); /* function prototype */
a = x * x; /* a stored in register */b = y * y; /* b as well */c = a*a*a + b*b + a * b; /* used later */d = a * a * c; /* used later */p = &a; /* p gets address of a */q = &b /* q gets address of b */e = f(p, q) /* function call */
14
Weak Consistency (2)
a) A valid sequence of events for weak consistency.
b) An invalid sequence for weak consistency.
15
Summary of Consistency Models
Consistency Description
Strict Absolute time ordering of all shared accesses matters.
SequentialAll processes see all shared accesses in the same order. Accesses are not ordered in time
Consistency Description (with synchronization variable)
Weak Shared data can be counted on to be consistent only after a synchronization is done
16
Distribution Protocols: Replica Placement
Several ways of distributing (propagating) updates to replicas, independent of the supported consistency model, have been proposed.
Replica Placement: deciding where, when, and by whom copies of the data store are to be placed.
Three different types of copies, permanent replicas, server-initiated replicas, and client-initiated replicas, can be distinguished, and logically organized as show in the next slide.
Permanent replicas: the initial set of replicas constituting a distributed data store.
17
Replica Placement
The logical organization of different kinds of copies of a data store into three concentric rings.
18
Distribution Protocols: Replica Placement
Server-initiated replicas: copies of a data store for enhancing performance. They are created at the initiative of the (owner of the) data store.
For example, it may be worthwhile to install a number of such replicas of a Web server in regions where many requests are coming from.
One of the major problems with such replicas is to decide exactly where and when the replicas should be created or deleted.
Server-initiated replication is gradually increasing in popularity, especially in the context of Web hosting services. Such hosting services can dynamically replicate files to servers close to demanding clients.
19
Server-Initiated Replicas
Counting access requests from different clients.
20
Distribution Protocols: Replica Placement
Client-initiated replicas: copies created at the initiative of clients, known as caches.
In principle, managing the cache is left entirely to the client, but there are many occasions in which the client can rely on participation from the data store to inform it when the cached data has become stale.
Placement of client caches is relatively simple: a cache is normally placed in the same machine as its client, or on a machine shared by clients in the same LAN.
Data are generally kept in a cache for a limited amount time to prevent extremely stale data from being used, or simply to make room for other data.
21
Distribution Protocols: Update Propagation
Update operations on replicas are generally initiated at a client and subsequently forwarded to one of the copies. From there, the update should be propagated to other copies, while ensuring consistency.
What is to be propagated: there are three possibilities:
(1) a notification of an update (invalidation protocol);
(2) data from one copy to another;
(3) the update operation to other copies (active replication).
In invalidation protocol in (1), other copies are informed about an update on a data, and the data are no longer valid. It uses little network bandwidth, suitable for relatively small read-to-write ratio.
22
Distribution Protocols: Update Propagation
Transferring the modified data in (2) is useful when the read-to-write ratio is relatively high. It is also possible to log the changes and transfer only those logs to save bandwidth, and multiple modifications can be packed into a single message to save communication overhead,
In the active replication in (3), updates can often be propagated at minimal bandwidth costs, provided the size of the parameters associated with an operation are relatively small. However, more processing power may be required by each replica, especially for complex operations.
Whether updates are pushed or pulled: push-based approach is referred to as server-based protocol; while pull-based one is referred to as client-based protocol.
23
Distribution Protocols: Update Propagation
Push-based approach: updates are propagated to other replicas without those replicas even asking for the updates, which are often used between permanent and server-initiated replicas, for a relatively high degree of consistency.
Pull-based approach: a server or client requests another server to send it any updates it has at that moment, which is often used by client cache. It is efficient when the read-to-update ratio is relatively low (e.g., in the case of client cache).
Unicast or multicast should be used: In umicast, if a server that updates a replica sends its update to N other servers, it does so by sending N separate update messages, one to each server. With multicast, the underlying network takes care of sending a multicast message efficiently to multiple receivers.
24
Pull versus Push Protocols
A comparison between push-based and pull-based protocols in the case of multiple client, single server systems.
Issue Push-based Pull-based
State of server List of client replicas and caches None
Messages sent Update (and possibly fetch update later) Poll and update
Response time at client
Immediate (or fetch-update time) Fetch-update time
25
Consistency Protocols: Primary-Based Protocols
Consistency protocol: describing an implementation of a specific consistency model, including sequential consistency, weak consistency with synchronization variable, as well as atomic transactions.
Primary-Based Protocol: Each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x.
A distinction can be made as to whether the primary is fixed at a remote server or if write operations can be carried out locally after moving the primary to the process where the write operation is initiated.
26
Remote-Write Protocols (1)
Primary-based remote-write protocol with a fixed server to which all read and write operations are forwarded.
The simplest primary-based (remote-write) protocol is the one in which all read and write operations are carried out at a (remote) single server. Data are not replicated at all, which is traditionally used in client-server systems.
27
Remote-Write Protocols (2)
The primary-backup (remote-write) protocols allow processes to perform read operations on a locally available copy, but should forward write operations to a (fixed) primary copy.
It provide a straightforward implementation of sequential consistency, as the primary can order all incoming writes.
28
Local-Write Protocols (1)
In the simple primary-based (local-write) protocols, there is only a single copy of each data item x: whenever a process wants to perform an operation on some data item, it is first transferred to the process, then the operation is performed.
29
Local-Write Protocols (2)
In the primary-backup (local-write) protocols, the primary copy migrates between processes that wish to perform a write operation. The main advantage is that multiple, successive write operations can be carried out locally, while reading processes can still access their local copies.
30
Consistency Protocols: Replicated-Write Protocols
Replicated-write protocols: write operations can be carried out at multiple replicas.
Active replication: each replica has an associated process that carries out update operations. Updates are generally propagated by means of the write operation that causes the update. It is also possible to send update.
Quorum-Based Protocols: the basic idea is to require clients to request and acquire the permission of multiple servers before reading or writing a replicated date item.
31
Dependability: Basic Concepts
Availability
Reliability
Safety
Maintainability
Fault Error Failure Faults:-Transient-Intermittent-Permanent
32
Failure Models
Type of failure Description
Crash failure (fail-stop) A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times
33
Error recovery
Software fault tolerance process:– Error (or fault) detection & diagnosis– Isolation/containment
prevent further damage & error propagation– Recovery
Backward recovery– Restore/rollback the system to a previously saved state
Checkpoints: saving system states at predetermined points• on stable storage that will not affected by failures
Forward recovery– Find a new state from which the system can continue operation.
Examples:• A degraded mode of the previous error-free state• Error compensation
34
Backward recovery (I) Can be used regardless of the damage sustained by the system
state– Requires no knowledge of the errors in the state– The only knowledge required is that the relevant prior state is error-free
Can handle unpredictable errors caused by residual design faults– … if the error do not affect the recovery mechanism
A general recovery scheme– Application-independent
Uniform pattern of error-detection & recovery
Particularly suited to recovery of transient faults– After the recovery, the error may have gone away!
Restarting with the checkpointed state will not produce the same fault
35
Backward recovery (II)
Significant resources required to perform checkpointing & recovery
The system must be halted temporarily
Danger of domino effect:– One process rolls back to its previous checkpoint– … which in turn causes another to roll further back (to its own checkpoint)– … which in turn causes the first process to roll back further …
36
Forward recovery (I) Key supporting concept: redundancy
– Hardware– Software– Data– Temporal (“diversity”)
Self-checking components– Switch from a failed to a non-failed component executing the same task-code
Fault masking– Error compensation is continuously applied
Voting schemes
Error compensation– Based on algorithm that uses redundancy/diversity
… multiple processes, executed in parallel, that deliver potential results
37
Forward recovery (II) Fairly efficient in terms of space/time overhead
Suitable if the fault is an anticipated one– Eg: potential loss of data keep redundant copies– Allows for optimization of recovery
Application-specific technique
Can only remove predictable errors– Requires knowledge of the error– Cannot aid in recovery if the state is damaged beyond “specification-wise
recoverability”– Depends on the ability to accurately detect the occurrence of a fault
Primarily used when the delay of backward recovery is not acceptable
38
Flat vs Hierarchical Groups (I)
Process resilience by replicating processes into groups
Group membership protocols
39
Flat vs Hierarchical Groups (II)
Flat groups:– Symmetrical (no special roles)– No single point of failure– Complex operation protocols (eg: voting)
Hierarchical groups:– Coordinator is a single point of failure
Group membership:-group server-distributed management
-Eg: reliable multicast
•Detection of failed processes?•Join/leave must be synchronous with data messages!•How to rebuild a group after a majorfailure?
40
Failure Masking & Replication
Having a group of identical processes allows us to mask >=1 faulty process(es)– Primary-backup protocols
Hierarchical organization• Election among backups to select a new primary
– Replicated-write protocols Flat process groups
• Active replication• Quorum protocols
K-fault tolerant system:•Fail-silent processes group size = (k + 1)•Byzantine failures group size = (2k + 1)
•Assuming that processes do not team up !!
41
Two-Phase Commit (2PC) Originally stated by J. Gray (1978)
Coordinator - the component that coordinates commitment at Home(T)
Participant - a resource manager accessed by T
A participant P is ready to commit T if all of T’s after-images at P are in stable storage – And therefore can be redone
The coordinator must not commit T until all participants are ready– If P isn’t ready, T commits, and P fails, then P can’t commit
when it recovers.
42
Case of Commit (Protocol – Phase I)
Coordinator ParticipantRequest-to-Prepare
Prepared
Commit
Done
Coordinator:
sends Request-to-Prepare msg to each participant
waits for all participants to vote
Participant:
votes Prepared if it’s ready to commit
may vote No for any reason
may delay voting indefinitely
43
CoordinatorRequest-to-Prepare
NoAbort
Done
Participant
Case of Abort (Protocol – Phase II)
If coordinator receives Prepared from all participants, it decides to commit.
The transaction is now marked committedOtherwise, it decides to abort.
The coordinator sends its decision to all participants (i.e. Commit or Abort)
Participants acknowledge receipt of Commit or Abort by replying Done.
44
Uncertainty in 2PC (I)
Before it votes, a participant can abort unilaterally
After a participant votes Prepared and before it receives the coordinator’s decision, it is uncertain. It can’t unilaterally commit or abort during its uncertainty period.
Coordinator ParticipantRequest-to-Prepare
Prepared
CommitDone
UncertaintyPeriod
45
Uncertainty (II)
The coordinator is never uncertain
If a participant fails or is disconnected from the coordinator while it’s uncertain, at recovery it must find out the decision– Coordinator must store decision in persistent storage
(In P. Bernstein’s words) Bad News Theorems•Uncertainty in commit protocols cannot be eliminated•Independent recovery of failed participants is not guaranteed
46
Blocking
A participant must await a repair/compensationaction before continuing.
While blocked, a participant may not release resources
For every possible commit protocol,a communications failure can cause a participant to become blocked
This holds not just for 2PC !
47
Independent recovery
Ideally, a recovered participant can decide to commit or abort without communicating with others
No commit protocol can guarantee independent recovery of failed participants
What to do if the coordinator or a participant times out waiting for a message ?
48
Failure handling in 2PC (I)
•Participant times out waiting for coordinator’s Request-to-prepare
–It decides to abort.
•Coordinator times out waiting for a participant’s vote
–It decides to abort.•A participant that voted Prepared times out waiting for the coordinator’s decision
–It’s blocked. –Use a termination protocol to decide what to do.–Native termination protocol - wait until coordinator recovers
•The coordinator times out waiting for Done msg’s–It must resolicit them, so it can forget the decision
49
Three Phase Commit (3PC)
Prevent blocking in the absence of comm. failures. – It can be made resilient to comm. Failures
but then it may block !
3PC is much more complex than 2PC, but only marginally improves reliability – prevents some blocking situations.– 3PC therefore is not used much in practice
Main idea: – Becoming certain & deciding to commit are separate steps.– Ensures that if any operational process is uncertain, then no
(failed or operational) process has committed.– In the termination protocol, if the operational processes are all
uncertain, they can decide to abort (avoids blocking).
50
3PC Finite state machine for a participant
1. (Begin phase 1) Coordinator C sends Request-to-prepare to all participants
2. Participants vote Prepared or No (just like 2PC)
3. If C receives Prepared from all participants, then (begin phase 2) it sends Pre-Commit to all participants.
4. Participants wait for Abort or Pre-Commit. Participant acknowledges Pre-commit.
5. After C receives ACKs from all participants, or times out on some of them, it (begin phase 3) sends Commit to all participants (that are up)
51
3PC Failure Handling
If coordinator times out before receiving Prepared from all participants, it decides to abort.
Coordinator ignores participants that don’t ack its Pre-Commit.
Participants that voted Prepared and timed out waiting for Pre-Commit or Commit use the termination protocol.
The termination protocol is where the complexity lies.