1 consistency, replication and fault tolerance introduction consistency models distribution...

1

Consistency, Replication and Fault Tolerance

Introduction

Consistency Models

Distribution Protocols

Consistency Protocols

Fault Tolerance

2

Introduction: Reasons and Main Issue

• Two primary reasons for replicating data in DS: reliability and performance.

• Reliability: It can continue working after one replica crashes by simply switch to one of the other replicas; Also, it becomes possible to provide better protection against corrupted data.

• Performance: When the number of processes to access data managed by a server increases, performance can be improved by replicating the server and subsequently dividing the work; Also, a copy of data can be placed in the proximity of the process using them to reduce the time of data access.

• Consistency issue: keeping all replicas up-to-date.

3

Data-Centric Consistency Models

Consistency model: a contract between processes and the data store: if processes agree to obey certain rules, the store promises to work correctly.

Data store: any data, available by means of shared memory, shared database, or a file system in DS. A data store may be physically distributed across multiple machines.

Each process that can access data from the store is assumed to have a local (or nearby) copy available of the entire store. Write operation are propagated to other copies.

Consistency is discussed in the context of read and write operation on data store. When an operation changes the data, it is classified as write operation, otherwise, it is regarded as read operation.

4

Data-Centric Consistency Models

The general organization of a logical data store, physically distributed and replicated across multiple processes.

5

Consistency Models: Strict Consistency

Strict consistency: Any read on a data item x returns a value corresponding to the result of the most recent write on x.

The definition implicitly assumes the existence of absolute global time, so that the determination of “most recent” is unambiguous.

Uniprocessor systems traditionally observes the strict consistency. However, in DS, it is impossible that all writes are instantaneously visible to all processes and an absolute global time order is maintained.

In the following, the symbol Wi(x)a (Ri(x)b) is used to indicate a write by (read from) process Pi to data item x with value a (returning value b).

6

Strict Consistency

Behavior of two processes, operating on the same data item.

(a) A strictly consistent store.

(b) A store that is not strictly consistent.

7

Consistency Models: Sequential Consistency

It is impossible to implement strict consistency in DS (why?). Furthermore, experience shows that users can often manage quite well with weaker consistency models.

Sequential consistency (Lamport, 1979): A data store is said to be sequentially consistent if it satisfies the following condition:

The result of any execution is the same as if the (read and write) operations by all processes on the data store were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program.

8

Consistency Models: Sequential Consistency

In sequential consistency model, when processes run concurrently, possibly on different machines, any interleaving of read and write operations is acceptance behavior, but all processes should see the same interleaving of operations.

In the examples in the next slide, it shows that the time does not play a role in sequential consistency.

Sequential consistency is comparable to serialization in the case of transaction. The former is defined in terms of read/write operations, while the latter in terms of transactions.

The sequential consistency model is useful, because the users are taught to program in such a way that exact order of statement execution does not matter. When such an order is essential, synchronization operations should be used.

9

Sequential Consistency (1)

a) A sequentially consistent data store.

b) A data store that is not sequentially consistent.

10


Three concurrently executing processes, x, y and z are 0 initially.

Process P1 Process P2 Process P3

x = 1;

print ( y, z);

y = 1;

print (x, z);

z = 1;

print (x, y);

11


Four valid execution sequences for the processes of the previous slide. The vertical axis is time.

x = 1;

print ((y, z);

y = 1;

print (x, z);

z = 1;

print (x, y);

Prints: 001011

Signature: 001011

(a)

x = 1;

y = 1;

print (x,z);

print(y, z);

z = 1;

print (x, y);

Prints: 101011

Signature: 101011

(b)

y = 1;

z = 1;

print (x, y);

print (x, z);

x = 1;

print (y, z);

Prints: 010111

Signature:

110101

(c)

y = 1;

x = 1;

z = 1;

print (x, z);

print (y, z);

print (x, y);

Prints: 111111

Signature:

111111

(d)

12

Consistency Models: Weak Consistency

It is reasonable to let the process finish its critical section (for read/write operations) and then make sure that the final results are sent everywhere.

Using synchronization variables, weak consistency models have the following three properties:

(1) Accesses to synchronization variables associated with a data store are sequentially consistent;

(2) No operation on a synchronization variable is allowed to be performed until all previous writes have been completed everywhere;

(3) No read or write operation on data items are allowed to be performed until all previous operations to synchronization variables have been performed.

13

Weak Consistency (1)

A program fragment in which some variables may be kept in registers.

int a, b, c, d, e, x, y; /* variables */int *p, *q; /* pointers */int f( int *p, int *q); /* function prototype */

a = x * x; /* a stored in register */b = y * y; /* b as well */c = a*a*a + b*b + a * b; /* used later */d = a * a * c; /* used later */p = &a; /* p gets address of a */q = &b /* q gets address of b */e = f(p, q) /* function call */

14

Weak Consistency (2)

a) A valid sequence of events for weak consistency.

b) An invalid sequence for weak consistency.

15

Summary of Consistency Models

Consistency Description

Strict Absolute time ordering of all shared accesses matters.

SequentialAll processes see all shared accesses in the same order. Accesses are not ordered in time

Consistency Description (with synchronization variable)

Weak Shared data can be counted on to be consistent only after a synchronization is done

16

Distribution Protocols: Replica Placement

Several ways of distributing (propagating) updates to replicas, independent of the supported consistency model, have been proposed.

Replica Placement: deciding where, when, and by whom copies of the data store are to be placed.

Three different types of copies, permanent replicas, server-initiated replicas, and client-initiated replicas, can be distinguished, and logically organized as show in the next slide.

Permanent replicas: the initial set of replicas constituting a distributed data store.

17

Replica Placement

The logical organization of different kinds of copies of a data store into three concentric rings.

18


Server-initiated replicas: copies of a data store for enhancing performance. They are created at the initiative of the (owner of the) data store.

For example, it may be worthwhile to install a number of such replicas of a Web server in regions where many requests are coming from.

One of the major problems with such replicas is to decide exactly where and when the replicas should be created or deleted.

Server-initiated replication is gradually increasing in popularity, especially in the context of Web hosting services. Such hosting services can dynamically replicate files to servers close to demanding clients.

19

Server-Initiated Replicas

Counting access requests from different clients.

20


Client-initiated replicas: copies created at the initiative of clients, known as caches.

In principle, managing the cache is left entirely to the client, but there are many occasions in which the client can rely on participation from the data store to inform it when the cached data has become stale.

Placement of client caches is relatively simple: a cache is normally placed in the same machine as its client, or on a machine shared by clients in the same LAN.

Data are generally kept in a cache for a limited amount time to prevent extremely stale data from being used, or simply to make room for other data.

21

Distribution Protocols: Update Propagation

Update operations on replicas are generally initiated at a client and subsequently forwarded to one of the copies. From there, the update should be propagated to other copies, while ensuring consistency.

What is to be propagated: there are three possibilities:

(1) a notification of an update (invalidation protocol);

(2) data from one copy to another;

(3) the update operation to other copies (active replication).

In invalidation protocol in (1), other copies are informed about an update on a data, and the data are no longer valid. It uses little network bandwidth, suitable for relatively small read-to-write ratio.

22


Transferring the modified data in (2) is useful when the read-to-write ratio is relatively high. It is also possible to log the changes and transfer only those logs to save bandwidth, and multiple modifications can be packed into a single message to save communication overhead,

In the active replication in (3), updates can often be propagated at minimal bandwidth costs, provided the size of the parameters associated with an operation are relatively small. However, more processing power may be required by each replica, especially for complex operations.

Whether updates are pushed or pulled: push-based approach is referred to as server-based protocol; while pull-based one is referred to as client-based protocol.

23


Push-based approach: updates are propagated to other replicas without those replicas even asking for the updates, which are often used between permanent and server-initiated replicas, for a relatively high degree of consistency.

Pull-based approach: a server or client requests another server to send it any updates it has at that moment, which is often used by client cache. It is efficient when the read-to-update ratio is relatively low (e.g., in the case of client cache).

Unicast or multicast should be used: In umicast, if a server that updates a replica sends its update to N other servers, it does so by sending N separate update messages, one to each server. With multicast, the underlying network takes care of sending a multicast message efficiently to multiple receivers.

24

Pull versus Push Protocols

A comparison between push-based and pull-based protocols in the case of multiple client, single server systems.

Issue Push-based Pull-based

State of server List of client replicas and caches None

Messages sent Update (and possibly fetch update later) Poll and update

Response time at client

Immediate (or fetch-update time) Fetch-update time

25

Consistency Protocols: Primary-Based Protocols

Consistency protocol: describing an implementation of a specific consistency model, including sequential consistency, weak consistency with synchronization variable, as well as atomic transactions.

Primary-Based Protocol: Each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x.

A distinction can be made as to whether the primary is fixed at a remote server or if write operations can be carried out locally after moving the primary to the process where the write operation is initiated.

26

Remote-Write Protocols (1)

Primary-based remote-write protocol with a fixed server to which all read and write operations are forwarded.

The simplest primary-based (remote-write) protocol is the one in which all read and write operations are carried out at a (remote) single server. Data are not replicated at all, which is traditionally used in client-server systems.

27

Remote-Write Protocols (2)

The primary-backup (remote-write) protocols allow processes to perform read operations on a locally available copy, but should forward write operations to a (fixed) primary copy.

It provide a straightforward implementation of sequential consistency, as the primary can order all incoming writes.

28

Local-Write Protocols (1)

In the simple primary-based (local-write) protocols, there is only a single copy of each data item x: whenever a process wants to perform an operation on some data item, it is first transferred to the process, then the operation is performed.

29

Local-Write Protocols (2)

In the primary-backup (local-write) protocols, the primary copy migrates between processes that wish to perform a write operation. The main advantage is that multiple, successive write operations can be carried out locally, while reading processes can still access their local copies.

30

Consistency Protocols: Replicated-Write Protocols

Replicated-write protocols: write operations can be carried out at multiple replicas.

Active replication: each replica has an associated process that carries out update operations. Updates are generally propagated by means of the write operation that causes the update. It is also possible to send update.

Quorum-Based Protocols: the basic idea is to require clients to request and acquire the permission of multiple servers before reading or writing a replicated date item.

31

Dependability: Basic Concepts

Availability

Reliability

Safety

Maintainability

Fault Error Failure Faults:-Transient-Intermittent-Permanent

32

Failure Models

Type of failure Description

Crash failure (fail-stop) A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

33

Error recovery

Software fault tolerance process:– Error (or fault) detection & diagnosis– Isolation/containment

prevent further damage & error propagation– Recovery

Backward recovery– Restore/rollback the system to a previously saved state

Checkpoints: saving system states at predetermined points• on stable storage that will not affected by failures

Forward recovery– Find a new state from which the system can continue operation.

Examples:• A degraded mode of the previous error-free state• Error compensation

34

Backward recovery (I) Can be used regardless of the damage sustained by the system

state– Requires no knowledge of the errors in the state– The only knowledge required is that the relevant prior state is error-free

Can handle unpredictable errors caused by residual design faults– … if the error do not affect the recovery mechanism

A general recovery scheme– Application-independent

Uniform pattern of error-detection & recovery

Particularly suited to recovery of transient faults– After the recovery, the error may have gone away!

Restarting with the checkpointed state will not produce the same fault

35

Backward recovery (II)

Significant resources required to perform checkpointing & recovery

The system must be halted temporarily

Danger of domino effect:– One process rolls back to its previous checkpoint– … which in turn causes another to roll further back (to its own checkpoint)– … which in turn causes the first process to roll back further …

36

Forward recovery (I) Key supporting concept: redundancy

– Hardware– Software– Data– Temporal (“diversity”)

Self-checking components– Switch from a failed to a non-failed component executing the same task-code

Fault masking– Error compensation is continuously applied

Voting schemes

Error compensation– Based on algorithm that uses redundancy/diversity

… multiple processes, executed in parallel, that deliver potential results

37

Forward recovery (II) Fairly efficient in terms of space/time overhead

Suitable if the fault is an anticipated one– Eg: potential loss of data keep redundant copies– Allows for optimization of recovery

Application-specific technique

Can only remove predictable errors– Requires knowledge of the error– Cannot aid in recovery if the state is damaged beyond “specification-wise

recoverability”– Depends on the ability to accurately detect the occurrence of a fault

Primarily used when the delay of backward recovery is not acceptable

38

Flat vs Hierarchical Groups (I)

Process resilience by replicating processes into groups

Group membership protocols

39

Flat vs Hierarchical Groups (II)

Flat groups:– Symmetrical (no special roles)– No single point of failure– Complex operation protocols (eg: voting)

Hierarchical groups:– Coordinator is a single point of failure

Group membership:-group server-distributed management

-Eg: reliable multicast

•Detection of failed processes?•Join/leave must be synchronous with data messages!•How to rebuild a group after a majorfailure?

40

Failure Masking & Replication

Having a group of identical processes allows us to mask >=1 faulty process(es)– Primary-backup protocols

Hierarchical organization• Election among backups to select a new primary

– Replicated-write protocols Flat process groups

• Active replication• Quorum protocols

K-fault tolerant system:•Fail-silent processes group size = (k + 1)•Byzantine failures group size = (2k + 1)

•Assuming that processes do not team up !!

41

Two-Phase Commit (2PC) Originally stated by J. Gray (1978)

Coordinator - the component that coordinates commitment at Home(T)

Participant - a resource manager accessed by T

A participant P is ready to commit T if all of T’s after-images at P are in stable storage – And therefore can be redone

The coordinator must not commit T until all participants are ready– If P isn’t ready, T commits, and P fails, then P can’t commit

when it recovers.

42

Case of Commit (Protocol – Phase I)

Coordinator ParticipantRequest-to-Prepare

Prepared

Commit

Done

Coordinator:

sends Request-to-Prepare msg to each participant

waits for all participants to vote

Participant:

votes Prepared if it’s ready to commit

may vote No for any reason

may delay voting indefinitely

43

CoordinatorRequest-to-Prepare

NoAbort

Done

Participant

Case of Abort (Protocol – Phase II)

If coordinator receives Prepared from all participants, it decides to commit.

The transaction is now marked committedOtherwise, it decides to abort.

The coordinator sends its decision to all participants (i.e. Commit or Abort)

Participants acknowledge receipt of Commit or Abort by replying Done.

44

Uncertainty in 2PC (I)

Before it votes, a participant can abort unilaterally

After a participant votes Prepared and before it receives the coordinator’s decision, it is uncertain. It can’t unilaterally commit or abort during its uncertainty period.

Coordinator ParticipantRequest-to-Prepare

Prepared

CommitDone

UncertaintyPeriod

45

Uncertainty (II)

The coordinator is never uncertain

If a participant fails or is disconnected from the coordinator while it’s uncertain, at recovery it must find out the decision– Coordinator must store decision in persistent storage

(In P. Bernstein’s words) Bad News Theorems•Uncertainty in commit protocols cannot be eliminated•Independent recovery of failed participants is not guaranteed

46

Blocking

A participant must await a repair/compensationaction before continuing.

While blocked, a participant may not release resources

For every possible commit protocol,a communications failure can cause a participant to become blocked

This holds not just for 2PC !

47

Independent recovery

Ideally, a recovered participant can decide to commit or abort without communicating with others

No commit protocol can guarantee independent recovery of failed participants

What to do if the coordinator or a participant times out waiting for a message ?

48

Failure handling in 2PC (I)

•Participant times out waiting for coordinator’s Request-to-prepare

–It decides to abort.

•Coordinator times out waiting for a participant’s vote

–It decides to abort.•A participant that voted Prepared times out waiting for the coordinator’s decision

–It’s blocked. –Use a termination protocol to decide what to do.–Native termination protocol - wait until coordinator recovers

•The coordinator times out waiting for Done msg’s–It must resolicit them, so it can forget the decision

49

Three Phase Commit (3PC)

Prevent blocking in the absence of comm. failures. – It can be made resilient to comm. Failures

but then it may block !

3PC is much more complex than 2PC, but only marginally improves reliability – prevents some blocking situations.– 3PC therefore is not used much in practice

Main idea: – Becoming certain & deciding to commit are separate steps.– Ensures that if any operational process is uncertain, then no

(failed or operational) process has committed.– In the termination protocol, if the operational processes are all

uncertain, they can decide to abort (avoids blocking).

50

3PC Finite state machine for a participant

1. (Begin phase 1) Coordinator C sends Request-to-prepare to all participants

2. Participants vote Prepared or No (just like 2PC)

3. If C receives Prepared from all participants, then (begin phase 2) it sends Pre-Commit to all participants.

4. Participants wait for Abort or Pre-Commit. Participant acknowledges Pre-commit.

5. After C receives ACKs from all participants, or times out on some of them, it (begin phase 3) sends Commit to all participants (that are up)

51

3PC Failure Handling

If coordinator times out before receiving Prepared from all participants, it decides to abort.

Coordinator ignores participants that don’t ack its Pre-Commit.

Participants that voted Prepared and timed out waiting for Pre-Commit or Commit use the termination protocol.

The termination protocol is where the complexity lies.

1 consistency, replication and fault tolerance introduction consistency models distribution...

Documents