distributed systems-unit 1

LECTURE NOTES: DISTRIBUTED SYSTEM (ECS-701) MUKESH KUMAR DEPARTMENT OF INFORMATION TECHNOLOGY

I.T.S ENGINEERING COLLEGE, GREATER NOIDA PLOT NO: 46, KNOWLEDGE PARK 3, GREATER NOIDA

UNIT-1 DISTRIBUTED SYSTEMS Basic Issues 1. What is a Distributed System? 2. Examples of Distributed Systems 3. Advantages and Disadvantages 4. Design Issues with Distributed Systems 5. Preliminary Course Topics What is a Distributed System? A distributed system is a collection of autonomous computers linked by a computer network that appear to the users of the system as a single computer. Some comments:

System architecture: the machines are autonomous; this means they are computers which, in principle, could work independently;

The user’s perception: the distributed system is perceived as a single system solving a certain problem (even though, in reality, we have several computers placed in different locations).

By running distributed system software the computers are enabled to:

1. Coordinate their activities 2. Share resources: hardware, software, data.

Examples of Distributed Systems

1. Network of workstations

Personal workstations + processors not assigned to specific users.

Single file system, with all files accessible from all machines in the same way and using the same path name.

For a certain command the system can look for the best place (workstation) to execute it.



2. Automatic banking (teller machine) system

Primary requirements: security and reliability.

Consistency of replicated data.

Concurrent transactions (operations which involve accounts in different banks; simultaneous access from several users, etc).

Fault tolerance Some more examples of Distributed Real-Time Systems

Synchronization of physical clocks

Scheduling with hard time constraints

Real-time communication

Fault tolerance Advantages of Distributed System Performance: very often a collection of processors can provide higher performance (and better price/performance ratio) than a centralized computer. Distribution: many applications involve, by their nature, spatially separated machines (banking, commercial, automotive system). Reliability (fault tolerance): if some of the machines crash, the system can survive. Incremental growth: as requirements on processing power grow, new machines can be added incrementally. Sharing of data/resources: shared data is essential to many applications (banking, computer supported cooperative work, reservation systems); other resources can be also shared (e.g. expensive printers). Disadvantage of Distributed System Difficulties of developing distributed software: how should operating systems, programming languages and applications look like?



Networking problems: several problems are created by the network infrastructure, which have to be dealt with: loss of messages, overloading, ... Security problems: sharing generates the problem of data security. Design Issues with Distributed Systems Design issues that arise specifically from the distributed nature of the application:

Transparency

Communication

Performance

Scalability

Heterogeneity

Openness

Reliability & fault tolerance

Security 1. Transparency Issue: How to achieve the single system image? How to "fool" everyone into thinking that the collection of machines is a "simple" computer? Access transparency - local and remote resources are accessed using identical operations. Location transparency - users cannot tell where hardware and software resources (CPUs, files, data bases) are located; the name of the resource shouldn’t encode the location of the resource. Migration (mobility) transparency - resources should be free to move from one location to another without having their names changed. Replication transparency - the system is free to make additional copies of files and other resources (for purpose of performance and/or reliability), without the users noticing. Example: several copies of a file; at a certain request that copy is accessed which is the closest to the client. Concurrency transparency - the users will not notice the existence of other users in the system (even if they access the same resources). Failure transparency - applications should be able to complete their task despite failures occurring in certain components of the system. Performance transparency - load variation should not lead to performance degradation. This could be achieved by automatic reconfiguration as response to changes of the load; it is difficult to achieve.



2. Communication: Issue: Components of a distributed system have to communicate in order to interact. This implies support at two levels:

1. Networking infrastructure (interconnections & network software). 2. Appropriate communication primitives and models and their

implementation: a. communication primitives:

i. send ii. receive iii. remote procedure call (RPC)

b. communication models i. Client-Server communication: implies a message exchange

between two processes: the process which requests a service and the one which provides it;

ii. Group Multicast: the target of a message is a set of processes, which are members of a given group.

3. Performance Several factors are influencing the performance of a distributed system:

The performance of individual workstations.

The speed of the communication infrastructure.

Extent to which reliability (fault tolerance) is provided (replication and preservation of coherence imply large overheads).

Flexibility in workload allocation: for example, idle processors (workstations) could be allocated automatically to a user’s task.

4. Scalability The system should remain efficient even with a significant increase in the number of users and resources connected:

Cost of adding resources should be reasonable;

Performance loss with increased number of users and resources should be controlled;

Software resources should not run out (number of bits allocated to addresses, number of entries in tables, etc.)

5. Heterogeneity Distributed applications are typically heterogeneous

Different hardware: mainframes, workstations, PCs, servers, etc.;

Different software: UNIX, MS-Windows, IBM OS/2, Real-time OSs, etc.;



Unconventional devices: teller machines, telephone switches, robots, manufacturing systems, etc.;

Diverse networks and protocols: Ethernet, FDDI, ATM, TCP/IP, Novell Netware, etc.

An additional software layer called middleware used to mask heterogeneity. 6. Openness One of the important features of distributed systems is openness and flexibility:

Every service is equally accessible to every client (local or remote);

It is easy to implement, install and debug new services;

Users can write and install their own services. Key aspect of openness:

Standard interfaces and protocols (like Internet communication protocols)

Support of heterogeneity (by adequate middleware, like CORBA) 7. Reliability and Fault Tolerance One of the main goals of building distributed systems is improvement of reliability. Availability: If machines go down, the system should work with the reduced amount of resources.

There should be a very small number of critical resources;

Critical resources: resources which have to be up in order the distributed system to work.

Key pieces of hardware and software (critical resources) should be replicated i.e. if one of them fails another one takes up - redundancy.

Data on the system must not be lost, and copies stored redundantly on different servers must be kept consistent.

The more copies kept, the better the availability, but keeping consistency becomes more difficult.

Fault-tolerance is a main issue related to reliability: the system has to detect faults and act in a reasonable way:

Mask the fault: continue to work with possibly reduced performance but without loss of data/ information.

Fail gracefully: react to the fault in a predictable way and possibly stop functionality for a short period, but without loss of data/information.



8. Security Security of information resources:

Confidentiality: Protection against disclosure to un-authorized person

Integrity: Protection against alteration and corruption

Availability: Keep the resource accessible Security risks associated with free access because Distributed systems should allow communication between programs/users/resources on different computers. The appropriate use of resources by different users has to be guaranteed.

MODELS OF DISTRIBUTED SYSTEMS Basic Element:

Resources in a distributed system are shared between users. They are normally encapsulated within one of the computers and can be accessed from other computers by communication.

Each resource is managed by a program, the resource manager; it offers a communication interface enabling the resource to be accessed by its users.

Resource managers can be in general modeled as processes. If the system is designed according to an object oriented methodology, resources are encapsulated in objects. 1. Architectural Models Issue: How are responsibilities distributed between system components and how are these components placed?

Client-server model

Peer-to-peer Variations of the above two:

Proxy server

Mobile code

Mobile agents

Network computers

Thin clients

Mobile devices



Client -Server Model The system is structured as a set of processes, called servers that offer services to the users, called clients. The client-server model is usually based on a simple request/reply protocol, implemented with send/receive primitives or using remote procedure calls (RPC) or remote method invocation (RMI):

The client sends a request (invocation) message to the server asking for some service;

The server does the work and returns a result (e.g. the data requested) or an error code if the work could not be performed.

A server can itself request services from other servers; thus, in this new relation, the server itself acts like a client.

Some problems with client-server:

Centralization of service ⇒ poor scaling

Limitations: capacity of server, bandwidth of network connecting the server.

Peer-to-Peer Model

All processes (objects) play similar role.

Processes (objects) interact without particular distinction between clients and servers.

The pattern of communication depends on the particular application.

A large number of data objects are shared; any individual computer holds only a small part of the application database.

Processing and communication loads for access to objects are distributed across many computers and access links.

This is the most general and flexible model.



Peer-to-Peer tries to solve the problems in centralized system.

It distributes shared resources widely share computing and communication loads.

Problems with peer-to-peer: High complexity due to

Cleverly place individual objects

Retrieve the objects

Maintain potentially large number of replicas. 2. Interaction Models Issue: How do we handle time? Are there time limits on process execution, message delivery, and clock drifts?

Synchronous distributed systems

Asynchronous distributed systems Synchronous Distributed Systems Main features:

Lower and upper bounds on execution time of processes can be set.

Transmitted messages are received within a known bounded time.

Drift rates between local clocks have a known bound. Important consequences:

In a synchronous distributed system there is a notion of global physical time (with a known relative precision depending on the drift rate).

Only synchronous distributed systems have a predictable behavior in terms of timing. Only such systems can be used for hard real-time applications.

In a synchronous distributed system it is possible and safe to use timeouts in order to detect failures of a process or communication link.

It is difficult and costly to implement synchronous distributed systems. Asynchronous Distributed Systems Many distributed systems (including those on the Internet) are asynchronous.

No bound on process execution time (nothing can be assumed about speed, load, and reliability of computers).

No bound on message transmission delays (nothing can be assumed about speed, load, reliability of interconnections)

No bounds on drift rates between local clocks.



Important consequences: 1. In an asynchronous distributed system there is no global physical time.

Reasoning can be only in terms of logical time (see lecture on time and state).

2. Asynchronous distributed systems are unpredictable in terms of timing. 3. No timeouts can be used.

Asynchronous systems are widely and successfully used in practice. In practice timeouts are used with asynchronous systems for failure detection. However, additional measures have to be applied in order to avoid duplicated messages, duplicated execution of operations, etc. 3. Fault Models Issue: What kind of faults can occur and what are their effects?

Omission faults

Arbitrary faults

Timing faults

1. Faults can occur both in processes and communication channels. The reason can be both software and hardware faults.

2. Fault models are needed in order to build systems with predictable behavior in case of faults (systems which are fault tolerant).

3. Of course, such a system will function according to the predictions, only as long as the real faults behave as defined by the “fault model”. If not .......

4. These issues will be discussed in some of the following chapters and in particular in the chapter on “Recovery and Fault Tolerance”.

Omission Faults A processor or communication channel fails to perform actions it is supposed to do. This means that the particular action is not performed!

We do not have an omission fault if: o An action is delayed (regardless how long) but finally executed. o An action is executed with an erroneous result.

With synchronous systems, omission faults can be detected by timeouts.

If we are sure that messages arrive, a timeout will indicate that the sending process has crashed. Such a system has‘fail-stop’ behavior.



Arbitrary (Byzantine) Faults

This is the most general and worst possible fault semantics.

Intended processing steps or communications are omitted or/and unintended ones are executed.

Results may not come at all or may come but carry wrong values. Timing Faults Timing faults can occur in synchronous distributed systems, where time limits are set to process execution, communications, and clock drifts. A timing fault occurs if any of these time limits is exceeded.

COMMUNICATION IN DISTRIBUTED SYSTEM Communication Models and their Layered Implementation The communication between distributed objects by means of two models:

1. Remote Method Invocation (RMI) 2. Remote Procedure Call (RPC).

RMI, as well as RPC, are based on request and reply primitives. Request and reply are implemented based on the network protocol (e.g. TCP or UDP in case of the Internet). Network Protocol Middleware and distributed applications have to be implemented on top of a network protocol. Such a protocol is implemented as several layers. In case of the Internet: TCP (Transport Control Protocol) and UDP (User Datagram Protocol) are both transport protocols implemented on top of the Internet protocol (IP). TCP (Transport Control Protocol) TCP is a reliable protocol. TCP guarantees the delivery to the receiving process of all data delivered by the sending process, in the same order.



TCP implements additional mechanisms on top of IP to meet reliability guarantees.

o Sequencing: A sequence number is attached to each transmitted segment (packet). At the receiver side, no segment is delivered until all lower numbered segments have been delivered.

o Flow control: The sender takes care not to overwhelm the receiver (or intermediate nodes). This is based on periodic acknowledgements received by the sender from the receiver.

o Retransmission and duplicate handling: If a segment is not acknowledged within a specified timeout, the sender retransmits it. Based on the sequence number, the receiver is able to detect and reject duplicates.

o Buffering: Buffering is used to balance the flow between sender and receiver. If the receiving buffer is full, incoming segments are dropped. They will not be acknowledged and the sender will retransmit them.

o Checksum: Each segment carries a checksum. If the received segment doesn’t match the checksum, it is dropped (and will be retransmitted)

UDP (User Datagram Protocol) UDP is a protocol that does not guarantee reliable transmission.

UDP offers no guarantee of delivery. According to the IP, packets may be dropped because of congestion or network error. UDP adds no additional reliability mechanism to this.

UDP provides a means of transmitting messages with minimal additional costs or transmission delays above those due to IP transmission. Its use is restricted to applications and services that do not require reliable delivery of messages.

If reliable delivery is requested with UDP, reliability mechanisms have to be implemented at the application level.

Request and Reply Primitives Communication between processes and objects in a distributed system is performed by message passing. In a typical scenario (e.g. client-server model) such a communication is through request and reply messages



Request-Reply Communication in a Client-Server Model The system is structured as a group of processes (objects), called servers that deliver services to clients. Remote Method Invocation (RMI) and Remote Procedure Call (RPC) The goal: make, for the programmer, distributed computing look like centralized computing. The solution:

Asking for a service is solved by the client issuing a simple method invocation or procedure call; because the server can be on a remote machine this is a remote invocation (call).

RMI (RPC) is transparent: the calling object (procedure) is not aware that the called one is executing on a different machine, and vice versa.

Implementation of RMI



Who are the players?

Object A asks for a service

Object B delivers the service 1. The proxy for object B

If an object A holds a remote reference to a (remote) object B, there exists a proxy object for B on the machine which hosts A. The proxy is created when the remote object reference is used for the first time. For each method in B there exists a corresponding method in the proxy.

The proxy is the local representative of the remote object ⇒ the remote invocation from A to B is initially handled like a local one from A to the proxy for B.

At invocation, the corresponding proxy method marshals the arguments and builds the message to be sent, as a request, to the server.

After reception of the reply, the proxy unmarshals the received message and sends the results, in an answer, to the invoker.

2. The skeleton for object B

On the server side, there exists a skeleton object corresponding to a class, if an object of that class can be accessed by RMI. For each method in B there exists a corresponding method in the skeleton.

The skeleton receives the request message, unmarshals it and invokes the corresponding method in the remote object; it waits for the result and marshals it into the message to be sent with the reply.

A part of the skeleton is also called dispatcher. The dispatcher receives a request from the communication module, identifies the invoked method and directs the request to the corresponding method of the skeleton.

Implementation is as follows

Object A and Object B belong to the application.

Remote reference module and communication module belong to the middleware.

The proxy for B and the skeleton for B represent the so called RMI software. They are situated at the border between middleware and application and usually can be generated automatically with help of available tools that are delivered together with the middleware software.



Communication module

The communication modules on the client and server are responsible of carrying out the exchange of messages which implement the request/reply protocol needed to execute the remote invocation.

The particular messages exchanged and the way errors are handled, depends on the RMI semantics which is implemented (see slide 40).

Remote reference module

The remote reference module translates between local and remote object references. The correspondence between them is recorded in a remote object table.

Remote object references are initially obtained by a client from a so called binder that is part of the global name service (it is not part of the remote reference module). Here servers register their remote objects and clients look up after services.

Remote Procedure Call (RPC)



TIME AND STATE IN DISTRIBUTED SYSTEMS Because each machine in a distributed system has its own clock there is no notion of global physical time. The n crystals on the n computers will run at slightly different rates, causing the clocks gradually to get out of synchronization and give different values. Problems:

Time triggered systems: these are systems in which certain activities are scheduled to occur at predefined moments in time. If such activities are to be coordinated over a distributed system we need a coherent notion of time. Example: time-triggered real-time systems

Maintaining the consistency of distributed data is often based on the time when a certain modification has been performed. Example: a make program.

The make-program example When the programmer has finished changing some source files he starts make; make examines the times at which all object and source files were last modified and decides which source files have to be recompiled. In the given example, Although P.c is modified after P.o has been generated; because of the clock drift the time assigned to P.c is smaller. i.e. P.c will not be recompiled for the new version! Solutions: 1. Synchronization of physical clocks

Computer clocks are synchronized with one another to an achievable, known, degree of accuracy ⇒ within the bounds of this accuracy we can coordinate activities on different computers using each computer’s local clock.

Physical clock synchronization is needed for distributed real-time systems.



2. Logical clocks

In many applications we are not interested in the physical time at which events occur; what is important is the relative order of events! The make-program is such an example.

In such situations we don’t need synchronized physical clocks. Relative ordering is based on a virtual notion of time - logical time.

Logical time is implemented using logical clocks. Lamport’s Logical Clocks 1. The order of events occurring at different processes is critical for many

distributed applications. Example: P.o_created and P.c_created.

2. Ordering can be based on two simple situations: I. If two events occurred in the same process then they occurred in the

order observed following the respective process; II. Whenever a message is sent between processes, the event of

sending the message occurred before the event of receiving it.

3. Ordering by Lamport is based on the happened before relation (denoted by →):

I. a → b, if a and b are events in the same process and a occurred before b;

II. a→b, if a is the event of sending a message m in a process, and b is the event of the same message m being received by another process;

III. If a → b and b → c, then a → c (the relation is transitive). 4. Ifa→b, we say that event a causally affects event b. The two events are

causally related. 5. There are events which are not related by the happened-before relation. If

both a → e and e → a are false, then a and e are concurrent events; we can write a||e. P1, P2, P3: processes; a, b, c, d, e, f: events; a → b, c → d, e → f, b → c, d → f a → c, a → d, a → f, b → d, b → f, ... a || e, c || e, ...



6. Using physical clocks, the happened before relation cannot be captured. It is possible that b → c and at the same time Tb > Tc (Tb is the physical time of b).

7. Logical clocks can be used in order to capture the happened-before relation.

A logical clock is a monotonically increasing software counter.

There is a logical clock CPi at each process Pi in the system.

The value of the logical clock is used to assign timestamps to events. CPi(a) is the timestamp of event a in process Pi.

There is no relationship between a logical clock and any physical clock. To capture the happened-before relation, logical clocks have to be implemented so that if a → b, then C(a) < C(b) Implementation Rules of Lamport’s Logical Clocks Implementation of logical clocks is performed using the following rules for updating the clocks and transmitting their values in messages: [R1]:

CPi is incremented before each event is issued at process Pi: CPi := CPi + 1. [R2]:

a) When ‘a’ is the event of sending a message ‘m’ from process Pi, then the timestamp tm = CPi (a) is included in ‘m’. (CPi(a) is the logical clock value obtained after applying rule R1).

b) On receiving message ‘m’ by process Pj, its logical clock CPj is updated as follows: CPj := max(CPj, tm).

c) The new value of CPj is used to timestamp the event of receiving message ‘m’ by Pj (applying rule R1).

If ‘a’ and ‘b’ are events in the same process and ‘a’ occurred before ‘b’, then a→b, and (by R1) C(a) < C(b).

If ‘a’ is the event of sending a message ‘m’ in a process, and ‘b’ is the event of the same message ‘m’ being received by another process, then a→b, and (by R2) C(a) < C(b).

If a → b and b → c, then a → c, and (by induction) C(a) < C(c).



Problems with Lamport’s Logical Clocks Lamport’s logical clocks impose only a partial order on the set of events; pairs of distinct events generated by different processes can have identical timestamp.

For certain applications a total ordering is needed; they consider that no two events can occur at the same time.

In order to enforce total ordering a global logical timestamp is introduced: o the global logical timestamp of an event a occurring at process Pi,

with logical timestamp CPi(a), is a pair (CPi(a), i), where i is an identifier of process Pi;

o we define (CPi(a), i) < (CPj(b), j) if and only if CPi(a) < CPj(b), or CPi(a) = CPj(b) and i < j.

Disadvantages of Lamport’s Clock Lamport’s logical clocks are not powerful enough to perform a causal ordering of events.

If a → b, then C(a) < C(b). However, the reverse is not always true (if the events occurred in different processes):

If C(a) < C(b), then a → b is not necessarily true. ( It is only guaranteed that b → a is not true).

In the figure C(e) < C(b), however there is no causal relation from event e to event b. By just looking at the timestamps of the events, we cannot say whether two events are causally related or not.

We would like messages to be processed according to their causal order. We would like to use the associated timestamp for this purpose.

Process P3 receives messages M1, M2, and M3. send(M1) → send(M2), send(M1) → send(M3), send(M3) || send(M2)



M1 has to be processed before M2 and M3. However P3 has not to wait for M3 in order to process it before M2 (although M3’s logical clock timestamp is smaller than M2’s).

Vector Clocks Vector clocks give the ability to decide whether two events are causally related or not by simply looking at their timestamp.

Each process Pi has a clock CPi, which is an integer vector of length n (n is the number of processes).

The value of CPi is used to assign timestamps to events in process Pi. CvPi(a) is the timestamp of event a in process Pi.

CPi[i], the ith entry of CPi, corresponds to Pi’s own logical time.

CPi*j+, j ≠ i, is Pi’s "best guess" of the logical time at Pj.

CPi[j] indicates the (logical) time of occurrence of the last event at Pj which is in a happened before relation to the current event at Pi.

Implementation of Vector Clock Implementation of vector clocks is performed using the following rules for updating the clocks and transmitting their values in messages: [R1]:

CPi is incremented before each event is issued at process Pi: CPi[i] := CPi[i] + 1.

[R2]: a) When ‘a’ is the event of sending a message ‘m’ from process Pi, then the

timestamp tm = CPi(a) is included in ‘m’ (CPi(a) is the vector clock value obtained after applying rule R1).

b) On receiving message ‘m’ by process Pj, its vector clock CPj is updated as follows: ∀k ∈ {1, 2, .., n}, CPj[k] := max(CPj[k], tm[k]).

c) The new value of CPj is used to timestamp the event of receiving message ‘m’ by Pj (applying rule R1).

For any two vector timestamps u and v, we have:

u = v if and only if ∀i, u[i] = v[i]

u ≤ v if and only if ∀i, u*i+ ≤ v*i+

u < v if and only if (u ≤ v ∧ u ≠ v)

u || v if and only if ¬(u < v) ∧ ¬(v < u)



Causal Order

Two events a and b are causally related if and only if o C(a) < C(b) or C(b) < C(a). Otherwise the events are concurrent.

With vector clocks we get the property which we missed for Lamport’s logical clocks:

o a → b if and only if Cv(a) < Cv(b). Thus, by just looking at the timestamps of the events, we can say whether two events are causally related or not.

Causal Ordering of Messages Using Vector Clocks

We would like messages to be processed according to their causal order. If Send(M1) → Send(M2), then every recipient of both messages M1 and M2 must receive M1 before M2.

Basic Idea: A message is delivered to a process only if the message immediately preceding it (considering the causal ordering) has been already delivered to the process. Otherwise, the message is buffered. We assume that processes communicate using broadcast messages. (There exist similar protocols for non-broadcast communication too.) The events which are of interest here are the sending of messages ⇒ vector clocks will be incremented only for message sending. Implementation of Causal Order BIRMAN-CHIPER-STEPHENSON PROTOCOL (BES) Implementation of the protocol is based on the following rules: [R1]:

a) Before broadcasting a message m, a process Pi increments the vector clock: CPi[i] := CPi[i] + 1.

b) The timestamp tm = CPi is included in m. [R2]: The receiving side, at process Pj, delays the delivery of message ‘m’ coming from Pi until both the following conditions are satisfied:



1. CPj[i] = tm[i] - 1 2. ∀k ∈ {1,2,..,n} - {i}, CPj*k+ ≥ tm*k+

Delayed messages are queued at each process in a queue that is sorted by their vector timestamp; Concurrent messages are ordered by the time of their arrival. [R3]: When a message is delivered at process Pj, its vector clock CPj is updated according to rule [R1:b] for vector clock implementation.

tm[i] - 1 indicates how many messages originating from Pi precede m.

Step [R2.1] ensures that process Pj has received all the messages originating from Pi that precede m.

Step [R2.2] ensures that Pj has received all those messages received by Pi before sending m.

SCHIPER-EGGLI-SANDOZ PROTOCOL (SES) Basic idea is same as BES but no need for broadcast messages. Each process maintains a vector V_P of size N - 1, N the number of processes in the system. V_P is a vector of tuple (P’,t): P’ the destination process id and t, a vector timestamp. Tm: logical time of sending message m Tpi: present logical time at pi Initially, V_P is empty. Algorithm as follows Sending a Message:

Send message M, time stamped tm, along with V_P1 to P2.

Insert (P2, tm) into V_P1. Overwrite the previous value of (P2,t), if any.

(P2,tm) is not sent. Any future message carrying (P2,tm) in V_P1 cannot be delivered to P2 until tm < tP2.

Delivering a message

If V_M (in the message) does not contain any pair (P2, t), it can be delivered.

/* (P2, t) exists */ If t ≥ Tp2, buffer the message. (Don’t deliver).

else (t < Tp2) deliver it What does the condition t ≥ Tp2 imply?

t is message vector time stamp.

t > Tp2 -> For all j, t[j] > Tp2[j]



This implies some events occurred without P2’s knowledge in other processes. So P2 decides to buffer the message.

When t < Tp2, message is delivered & Tp2 is updated with the help of V_P2 (after the merge operation).

Example:

M1 from P2 to P1: M1 + Tm (=<0,1,0>) + Empty V_P2

M2 from P2 to P3: M2 + Tm (<0, 2, 0>) + (P1, <0,1,0>)

M3 from P3 to P1: M3 + <0,2,2> + (P1, <0,1,0>)

M3 gets buffered because: o Tp1 is <0,0,0>, t in (P1, t) is <0,1,0> & so Tp1 < t

When M1 is received by P1: o Tp1 becomes <1,1,0>, by rules 1 and 2 of vector clock.

After updating Tp1, P1 checks buffered M3. o Now, Tp1 > t [in (P1, <0,1,0>]. o So M3 is delivered.

On delivering the message: o Merge V_M (in message) with V_P2 as follows. o If (P,t) is not there in V_P2, merge. o If (P,t) is present in V_P2, t is updated with max(t[i] in Vm, t[i] in

V_P2). {Component-wise maximum}. o Message cannot be delivered until t in V_M is greater than t in V_P2 o Update site P2’s local, logical clock. o Check buffered messages after local, logical clock update.



Global States Problem: How to collect and record a consistent global state in a distributed system. Why a problem? Because there is no global clock (no coherent notion of time) and no shared memory! Consider a bank system with two accounts A and B at two different sites; we transfer $50 between A and B.

In general, a global state consists of a set of local states and a set of states of the communication channels.

The state of the communication channel in a consistent global state should be the sequence of messages sent along the channel before the sender’s



state was recorded, excluding the sequence of messages received along the channel before the receiver’s state was recorded.

It is difficult to record channel states to ensure the above rule ⇒ global states are very often recorded without using channel states.

This is the case in the definition below. Formal Definition

LSi is the local state of process Pi. Beside other information, the local state also includes a record of all messages sent and received by the process.

We consider the global state GS of a system, as the collection of the local states of its processes: GS = {LS1, LS2, ..., LSn}.

A certain global state can be consistent or not!

send(Mij) denotes the event of sending message Mij from Pi to Pj;

rec(Mij) denotes the event of receiving message Mij by Pj.

send(Mij) ∈ LSi if and only if the sending event occurred before the local state was recorded;

rec(Mij) ∈ LSj if and only if the receiving event occurred before the local state was recorded.

transit(LSi,LSj) = {Mij | send(Mij) ∈LSi ∧ rec(Mij) ∉LSj} inconsistent(LSi,LSj) = {Mij | send(Mij) ∉LSi ∧ rec(Mij) ∈LSj}

Consistent Global State A global state GS = {LS1, LS2, ..., LSn} is consistent if and only if: ∀i, ∀j: 1 ≤ i, j ≤ n :: inconsistent(LSi,LSj) = ∅

In a consistent global state for every received message a corresponding send event is recorded in the global state.

In an inconsistent global state, there is at least one message whose receive event is recorded but its send event is not recorded.

Transit less Global State A global state GS = {LS1, LS2, ..., LSn} is transitless if and only if: ∀i, ∀j: 1 ≤ i, j ≤ n :: transit(LSi,LSj) = ∅

All messages recorded to be sent are also recorded to be received. Strongly Consistent Global State A global state is strongly consistent if it is consistent and transitless. A strongly consistent state corresponds to a consistent state in which all messages recorded as sent are also recorded as received.



The global state is seen as a collection of the local states, without explicitly capturing the state of the channel. Example {LS11, LS22, LS32} is inconsistent; {LS12, LS23, LS33} is consistent; {LS11, LS21, LS31} is strongly consistent. Global State Recording Chandy-Lamport Algorithm

The algorithm records a collection of local states which give a consistent global state of the system. In addition it records the state of the channels which is consistent with the collected global state.

Such a recorded "view" of the system is called a snapshot.

We assume that processes are connected through one directional channels and message delivery is FIFO.

We assume that the graph of processes and channels is strongly connected (there exists a path between any two processes).

The algorithm is based on the use of a special message, snapshot token, in order to control the state collection process.

How to collect a global state

A process Pi records its local state LSi and later sends a message ‘m’ to Pj; LSj at Pj has to be recorded before Pj has received m.

The state SChij of the channel Chij consists of all messages that process Pi sent before recording LSi and which have not been received by Pj when recording LSj.

A snapshot is started at the request of a particular process Pi, for example, when it suspects a deadlock because of long delay in accessing a resource; Pi then records its state LSi and, before sending any other message, it sends a token to every Pj that Pi communicates with.

When Pj receives a token from Pi, and this is the first time it received a token, it must record its state before it receives the next message from Pi. After recording its state Pj sends a token to every process it communicates with, before sending them any other message.



Algorithm Rule for sender Pi: /* performed by the initiating process and by any other process at the reception of the first token */ [SR1]: Pi records its state. [SR2]: Pi sends a token on each of its outgoing channels. Rule for receiver Pj: /* executed whenever Pj receives a token from another process Pi on channel Chij */ [RR1]: If Pj has not yet recorded its state Then Record the state of the channel: SChij := ∅. Follow the "Rule for sender". Else Record the state of the channel: SChij := M,

Where M is the set of messages that Pj received from Pi after Pj recorded its state and before Pj received the token on Chij.

End if. Huang’s Termination Detection Algorithm Model

Processes can be active or idle

Only active processes send messages

Idle process can become active on receiving an computation message

Active process can become idle at any time

Termination: all processes are idle and no computation message are in transit

Can use global snapshot to detect termination also

One controlling agent, has weight 1 initially

All other processes are idle initially and has weight 0

Computation starts when controlling agent sends a computation message to a process

An idle process becomes active on receiving a computation message

B(DW) – computation message with weight DW. Can be sent only by the controlling agent or an active process



C(DW) – control message with weight DW, sent by active processes to controlling agent when they are about to become idle

Let current weight at process = W

1. Send of B(DW): • Find W1, W2 such that W1 > 0, W2 > 0, W1 + W2 = W • Set W = W1 and send B(W2)

2. Receive of B(DW): • W=W + DW; • if idle, become active

3. Send of C(DW): • send C(W) to controlling agent • Become idle

4. Receive of C(DW): • W = W + DW • if W = 1, declare “termination”

distributed systems-unit 1

Engineering