distributed systems-unit 3

LECTURE NOTES: DISTRIBUTED SYSTEM (ECS-701) MUKESH KUMAR DEPARTMENT OF INFORMATION TECHNOLOGY

I.T.S ENGINEERING COLLEGE, GREATER NOIDA PLOT NO: 46, KNOWLEDGE PARK 3, GREATER NOIDA

UNIT-3 Agreement Protocols

System Model of Agreement Problems: Agreement problems have been studied under the following system model:

1. There are n processors in the system and at most m processors can be faulty. 2. The processors can directly communicate with other processors by message

passing. Thus, the system is logically fully connected. 3. A receiver processor always knows the identity of the sender processor of the

message. 4. The communication medium is reliable, and only processors are prone to failures.

NEED OF AGREEMENT ALGORITHMS

1. Model of Processor Failures: a. Crash fault b. Omission fault c. Malicious fault

Byzantine fault

2. Synchronous vs. asynchronous computation a. A process knows all the messages it expects (synchronous) b. Agreement problem is not solvable in an asynchronous system c. We will solve the agreement problem assuming synchronous computation

3. Authenticated and Non-authenticated Message:

a. In an authenticated message system, a faulty processor cannot forge a message of change the content of received message. Before it relay the message to others.

b. In non authenticated message system, a processor has no way to verify the authenticity of a received message. A non authenticated message is called oral message.

4. Performance aspects:

The performance of agreement protocol is generally determined by the following metrics:

i. Time ii. Message traffic

iii. Storage overhead



CLASSIFICATION OF AGREEMENT PROBLEM:

1. Byzantine Agreement Problem An arbitrarily chosen processor (source processor) will broadcast its initial value to all other processors.

a. All non-faulty processors agree on the same value b. If the source processor is non-faulty, then the common agreed upon value

by all non-faulty processors should be the initial value of the source. c. If the source processor is faulty, then all non-faulty processors can agree on

any common value d. It is irrelevant what value faulty processors agree on or whether they agree

on a value at all.

2. The Consensus Problem In consensus problem, every processor broadcasts its initial value to all other processors. This initial value of processors may be different. A protocol for reaching consensus should meet following

a. All non-faulty processors agree on the same single value. b. If the initial value of every non-faulty processor is v, then all non-faulty

processor must agree on value v. c. If the initial value of non-faulty processor is different, then all non-faulty

processor must agree on any common value. d. It is irrelevant what value faulty processors agree on or whether they agree

on a value at all.

3. The Interactive Consistency Problem In Interactive Consistency problem, every processor broadcasts its initial value to all other processors. This initial value of processors may be different. A protocol for reaching Interactive Consistency should meet following

a. All non-faulty processors agree on the same vector, (v1, v2, ........... , vn) b. If the ith processor is non-faulty and its initial value is vi, then ith value to be

agreed on by all non-faulty processor must be vi. c. If the jth processor is faulty, all non-faulty processor can agree on must be

vj, irrelevant the value of faulty processor.



Solution to the Byzantine agreement problem

a. The upper bound on the number of faulty processors b. In order to reach an agreement on a common value, non-faulty processors

need to be free from the influence of faulty processors. c. If faulty processors dominate in number, they can prevent non-faulty

processors from reaching a consensus. d. It is impossible to reach a consensus if the number of faulty processors, m,

exceeds (n-1)/3. Lamport-Shostak-Pease Algorithm ( OM(m) )

a. Requires m+1 round of message exchanges. - Assumes fully connected network.

b. Solves the Byzantine problem for 3m+1 or more processors in the presence of at most m faulty processors.

The recursive algorithm Algorithm OM(0).

a. The source processor sends its value to every processor. b. Each processor uses the value it receives from the source.

Algorithm OM(m), m>0.

a. The source processor sends its value to every processor. b. For each i, let vi be the value processor i receives from the source. Processor i

acts as the new source and initiates Algorithm OM(m-1) wherein it sends the value vi to each of the n-2 other processors.

c. For each i and each j (!= i), let vj be the value processor i received from processor j in the above step using Algorithm OM(m-1). Processor i uses the value majority (v1, v2, v3, … , vn-1).

Performance

a. O (nm) message complexity.



APPLICATION OF AGREEMENT ALGORITHMS 1. Fault-Tolerant Clock Synchronization

a. Sites maintain physical clocks that are closely synchronized with one another even in presence of Byzantine failures

b. Due to drift problem in physical clock, they need periodically resynchronization Assumptions:

a. All clocks are initially synchronized to approximately the same value b. A non-faulty process’s clocks runs at approximately the correct rate c. A non-faulty process can read the clock value of another non-faulty process with

at most a small error ϵ. Clock synchronization algorithm must satisfy:

a. At any time, the value of the clocks of all non-faulty processes must be approximately equal

b. There is a small bound on the amount by which the clock of a non-faulty process is changed during each resynchronization

Interactive Convergence Algorithm

a. Clocks are initially synchronized b. Clocks are resynchronized often enough so that two non-faulty clocks never differ

by more than δ. c. Each process reads the value of all other processes’ clocks and sets its clock value

to the average of these clock values d. If a clock value differs from its own clock by more than δ, it replaces that value by

its own clock value when taking the average e. This algorithm brings the clocks of non-faulty processes closer. f. All processes execute the algorithm instantaneously. g. The error in reading another process’s clock is zero

Interactive Consistency Algorithm

a. Improves the Interactive Convergence Algorithm by: b. Any two processes will compute approximately the same value for a process clock

by taking the median of the clock values rather than the mean c. If a process is non-faulty, then every non-faulty process obtains approximately the

correct value for this processes clock



Therefore, if a majority of processes are non-faulty, the median of all the clock values is either approximately equal to a good clock’s value or it lies between the values of two good clocks

2. Atomic Commit in Distributed Databases: Use of the Byzantine agreement protocol in two-phase commit algorithm In the problem of atomic commit, sites of distributed databases must agree whether to commit or abort.

a. In the first phase of the atomic commit, sites execute their part of distributed transaction and broadcast their decision to commit or abort to all other sites.

b. In the second phase, each site, based on what it received from other sites in the first phase, decides whether to commit or abort its part of the distributed transaction.

c. Every site receives an identical response from all other sites to reach the same decision.

d. If some sites behave maliciously, then to reach on a common decision use byzantine agreement protocol for the distributed transaction.

Distributed File System: A file system is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. File systems provide directory services, which convert a file name (possibly a hierarchical one) into an internal identifier (e.g. inode, FAT index). They contain a representation of the file data itself and methods for accessing it (read/write). The file system is responsible for controlling access to the data and for performing low-level operations such as buffering frequently used data and issuing disk I/O requests. Goals in designing a distributed file system:

Access transparency: Clients are unaware that files are distributed and can access them in the same way as local files are accessed.

Location transparency: A consistent name space exists encompassing local as well as remote files. The name of a file does not give it location.

Concurrency transparency: All clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner.



Failure transparency: The client and client programs should operate correctly after a server failure.

Heterogeneity: File service should be provided across different hardware and operating system platforms.

Scalability: The file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to huge ones (hundreds through tens of thousands of systems).

Replication transparency: To support scalability, we may wish to replicate files across multiple servers. Clients should be unaware of this.

Migration transparency: Files should be able to move around without the client's knowledge.

Support fine-grained distribution of data: To optimize performance, we may wish to locate individual objects near the processes that use them.

Tolerance for network partitioning: The entire network or certain segments of it may be unavailable to a client during certain periods (e.g. disconnected operation of a laptop). The file system should be tolerant of this.

Distributed File System Concepts

i. File Service: A file service is a specification of what the file system offers to clients. A file server is the implementation of a file service and runs on one or more machines.

F File service architecture

ii. File: A file itself contains a name, data, and attributes (such as owner, size,

creation time, access rights). An immutable file is one that, once created, cannot be changed. Immutable files are easy to cache and to replicate across servers since their contents are guaranteed to remain unchanged.



Protection used in Distributed File Systems:

Capabilities: Each user is granted a ticket (capability) from some trusted source for each object to which it has access. The capability specifies what kinds of access are allowed.

Access control lists: Each file has a list of users associated with it and access permissions per user. Multiple users may be organized into an entity known as a group.

File service types To provide a remote system with file service, we will have to select one of two models of operation. 1. Upload/Download model: In this model, there are two fundamental operations:

a. Read file: Transfers an entire file from the server to the requesting client.

b. Write file: Copies the file back to the server.

It is a simple model and efficient in that it provides local access to the file when it is being used. Three problems are evident. It can be wasteful if the client needs access to only a small amount of the file data. It can be problematic if the client doesn't have enough space to cache the entire file. Finally, what happens if others need to modify the same file? 2. Remote Access Model: The file service

provides remote operations such as open, close, read bytes, write bytes, get attributes, etc. The file system itself runs on servers.

The drawback in this approach is the servers are accessed for the duration of file access rather than once to download the file and again to upload it.



Another important distinction in providing file service is that of understanding the difference between directory service and file service. 3. Directory service: In the context of file systems, maps human-friendly textual names

for files to their internal locations, which can be used by the file service.

4. File service: It provides the file interface (this is mentioned above).

5. Client module: This is the client-side interface for file and directory service. It provides a local file system interface to client software (for example, the vnode file system layer of a UNIX kernel).

Architecture of Distributed File System:

Architecture of Distributed file System



Typical Data Access in Distributed File System Naming issues In designing a distributed file service, we should consider whether all machines (and processes) should have the exact same view of the directory hierarchy. We might also wish to consider whether the name space on all machines should have a global root directory (super root) so that files can be accessed as, for example, //server/path. This is a model that was adopted by the Apollo Domain System, an early distributed file system, and more recently by the web community in the construction of a uniform resource locator (URL).



Goals in name resolution: Location transparency: In Location transparency the path name of a file gives no hint to where the file is located. For instance, we may refer to a file as //server1/dir/file. The server (server) can move anywhere without the client caring, so we have location transparency. However, if the file moves to server2 things will not work. Location independence: In Location independency the files can be moved without their names changing. Hence, if machine or server names are embedded into path names we do not achieve location independence. It is desirable to have access transparency, so that applications and users can access remote files just as they access local files. To facilitate this, the remote file system name space should be syntactically consistent with the local name space. One way of accomplishing this is by redefining the way files are named and require an explicit syntax for identifying remote files. This can cause legacy applications to fail and user discontent (users will have to learn a new way of naming their files). An alternate solution is to use a file system mounting mechanism to overlay portions of another file system over a node in a local directory structure. Mechanisms for building Distributed File System Mounting: Mounting is used in the local environment to construct a uniform name space from separate file systems (which reside on different disks or partitions) as well as incorporating special-purpose file systems into the name space (e.g. /proc on many UNIX systems allows file system access to processes). A remote file system can be mounted at a particular point in the local directory tree. Attempts to access files and directories under that node will be directed to the driver for that file system.



To summarize, our naming options are:

Machine and path naming (machine:path, ./machine/path).

Mount remote file systems onto the local directory hierarchy (merging the two name spaces).

Provide a single name space which looks the same on all machines. The first two of these options are relatively easy to implement.

Caching • To ensure reasonable performance of a file system, some form of caching is needed. • In a local file system the rationale for caching is to reduce disk I/O. • In a distributed file system (DFS) the rationale is to reduce both network traffic and disk I/O. • In a DFS the client caches can be located either in the primary memory or on a disk. • The server will always keep a cache in primary memory in the same way as in a local file system. • The block size of the cache in a DFS can vary from the size of a disk block to an entire file. Cache Location Where should the cached data be stored - on disk or in main memory? Disk caches have one clear advantage over main-memory caches: they survive even if the machine crashes. Main-memory caches have several other advantages: • They allow diskless workstations. • Data can be fetched quicker from main memory than from a disk. • The server caches will always be in main memory. If the client caches are located in main-memory a single caching mechanism can be built for both server and client.



The technology trend towards larger and less expensive memory has reduced the need for disk caches. If a disk cache is used, a main-memory cache is still needed for performance reasons, thus in this case both types of cache will be used. Cache Update Policy The policy used to write modified data back to the server’s master copy has a critical effect on the system’s performance and reliability. Update policies:

a. Write-through: The simplest and most reliable strategy. Write operations must wait until the data is written to the server. The effect is that the cache is only used for read operations.

b. Delayed write: Modifications are written to the cache and then written to the server at a later time. Write operations becomes quicker and if data are overwritten before they are sent to the server only the last update need to be written to the server.

c. Write-on-close: All the time the file is open, the local cache is used. Only when the file is closed, data is written to the file server. For files that are open for long time periods and frequently modified, this gives better performance than delayed write.

Cache Consistency Whenever caches are used, a method is needed to verify that the content in the cache is consistent with the master copy. This problem is more difficult in a DFS than in a local file system because every client has its own cache. In a local file system all processes shares the cache. There are two approaches to verifying the validity of cached data:

a. Client-initiated: The client initiates a validity check in which it contacts the server and checks whether the cache is consistent with the master copy. Choosing the frequency of validating is the problem. If validation is one to often both the network and the server may be heavily loaded.

b. Server-initiated: The server records, for each client, the files that it is caching. Inconsistency is possible every time a file is modified if the file is cached by other clients. Every time a file is updated a message is sent to the clients that have the file cached that the cache is invalid.



Remote Services When a client needs service from a server on another machine, a message needs to be sent to the server demanding the service. The server sends back a message with the requested data. A common way to achieve this is Remote Procedure Call (RPC). The idea is that an RPC should look like a normal subroutine call to the client. Another possibility is to use sockets directly. Sockets used in the file system code however, have a few disadvantages:

1. Sockets may not be available in all systems 2. Making a connection using sockets requires knowledge of socket names. This is a

type of system configuration data that should not be compiled into file system code.

RPC

1. PRC is actually a programming API (Application Programming Interface). The actual communication still need to use message passing (and sockets).

2. An RPC is translated to a message sent to a certain port at the server machine. 3. A port is the address to a certain process at the server, for example the file server

process. 4. When calling local subroutines, the subroutine name is translated to the memory

address of the subroutine by the linker. 5. When using RPC the RPC subroutine instead is translated to the address of a

communication routine and a message is passed as parameter.

But how shall the client know which port number to use? Two methods:

1. A static port number is compiled into the communication routine. 2. Dynamic translation. The system has a server (portmap) that is called to get the

port number for a specified server. When using portmap every service calls portmap at startup to register its port number.

Stateless Server

1. A stateless server avoids the problems related to crashes. A client just retransmits requests if it gets no response.

2. The price for the more robust stateless server is reduced performance and some constraints on the design of the DFS.

3. Because a client resends a message if it does not get an answer in a specified time, the server may receive the same request several times.



4. This means that the operations must be idempotent that is they must give the same result if executed several times.

5. Self-contained read and write operations have this property if they use an absolute file position.

6. Destructive operations such as remove are more problematic. 7. Server initiated methods for cache validation are inherently stateful and cannot

be used. 8. UNIX read/write operations with file descriptors and implicit file positions are

inherently stateful and cannot be used directly to a stateless server. Stateless Versus Stateful Server

1. In all communication protocols that use a connection mechanism, state information for the connection is stored at the server as long as the connection is valid.

2. Examples of state information are descriptors for open files and read/write position pointers.

3. A stateless server does not store state information concerning clients. This requires that a datagram protocol is used where every packet has complete information regarding the request.

4. The advantage with storing state information is improved performance. 5. The disadvantage with state information is that it creates problems if either a

client or a server crashes. 6. If a server crashes the state information is lost and has to be recreated in some

way. 7. If a client crashes the server need to detect this so that it can reclaim space

allocated to storing the state of crashed client processes.



Case Study NFS NFS is implemented using the Virtual File System abstraction, which is now used for lots of different operating systems.

Distributed shared memory (DSM) What Is DSM?

1. The distributed shared memory (DSM) implements the shared memory model in distributed systems, which have no physical shared memory.

2. The shared memory model provides a virtual address space shared between all nodes.

3. The overcome the high cost of communication in distributed systems, DSM systems move data to the location of access.

How DSM Works?

1. Data moves between main memory and secondary memory (within a node) and between main memories of different nodes.

2. Each data object is owned by a node a. Initial owner is the node that created object b. Ownership can change as object moves from node to node

3. When a process accesses data in the shared address space, the mapping manager maps shared memory address to physical memory (local or remote)



Architecture of Distributed Memory

Advantages of distributed shared memory (DSM)

1. Data sharing is implicit, hiding data movement (as opposed to ‘Send’/‘Receive’ in message passing model)

2. Passing data structures containing pointers is easier (in message passing model data moves between different address spaces)

3. Moving entire object to user takes advantage of locality difference 4. Less expensive to build than tightly coupled multiprocessor system: off-the-shelf

hardware, no expensive interface to shared physical memory 5. Very large total physical memory for all nodes: Large programs can run more

efficiently 6. No serial access to common bus for shared physical memory like in

multiprocessor systems 7. Programs written for shared memory multiprocessors can be run on DSM systems

with minimum changes



IMPLEMENTING DISTRIBUTED SHARED MEMORY The Central Server Algorithm

1. Central server maintains all shared data a. Read request: returns data item b. Write request: updates data and returns acknowledgement message

2. Implementation a. A timeout is used to resend a request if acknowledgment fails b. Associated sequence numbers can be used to detect duplicate write

requests c. If an application’s request to access shared data fails repeatedly, a failure

condition is sent to the application 3. Issues: performance and reliability 4. Possible solutions

a. Partition shared data between several servers b. Use a mapping function to distribute/locate data

The Migration Algorithm

1. Operation a. Ship (migrate) entire data object (page, block) containing data item to

requesting location b. Allow only one node to access a shared data at a time

2. Advantages a. Takes advantage of the locality of reference b. DSM can be integrated with VM at each node

i. Make DSM page multiple of VM page size ii. A locally held shared memory can be mapped into the VM page

address space iii. If page not local, fault-handler migrates page and removes it from

address space at remote node 3. To locate a remote data object:

a. Use a location server b. Maintain hints at each node c. Broadcast query

4. Issues a. Only one node can access a data object at a time b. Thrashing can occur: to minimize it, set minimum time data object resides

at a node



The Read-Replication Algorithm

1. Replicates data objects to multiple nodes 2. DSM keeps track of location of data objects 3. Multiple nodes can have read access or one node write access (multiple readers-

one writer protocol) 4. After a write, all copies are invalidated or updated 5. DSM has to keep track of locations of all copies of data objects. Examples of

implementations: a. IVY: owner node of data object knows all nodes that have copies b. PLUS: distributed linked-list tracks all nodes that have copies

Advantage The read-replication can lead to substantial performance improvements if the ratio of reads to writes is large

The Full–Replication Algorithm

1. Extension of read-replication algorithm: multiple nodes can read and multiple nodes can write (multiple-readers, multiple-writers protocol)

2. Issue: consistency of data for multiple writers 3. Solution: use of gap-free sequencer

a. All writes sent to sequencer b. Sequencer assigns sequence number and sends write request to all sites

that have copies c. Each node performs writes according to sequence numbers d. A gap in sequence numbers indicates a missing write request: node asks for

retransmission of missing write requests Memory Coherence

1. DSM are based on a. Replicated shared data objects b. Concurrent access of data objects at many nodes

2. Coherent memory: when value returned by read operation is the expected value (e.g., value of most recent write)

3. Mechanism that control/synchronizes accesses is needed to maintain memory coherence

4. Sequential consistency: A system is sequentially consistent if a. The result of any execution of operations of all processors is the same as if

they were executed in sequential order, and



b. The operations of each processor appear in this sequence in the order specified by its program

5. General consistency: a. All copies of a memory location (replicas) eventually contain same data

when all writes issued by every processor have completed 6. Processor consistency:

a. Operations issued by a processor are performed in the order they are issued

b. Operations issued by several processors may not be performed in the same order (e.g. simultaneous reads of same location by different processors may yields different results)

7. Weak consistency: a. Memory is consistent only (immediately) after a synchronization operation b. A regular data access can be performed only after all previous

synchronization accesses have completed 8. Release consistency:

a. Further relaxation of weak consistency b. Synchronization operations must be consistent which each other only

within a processor c. Synchronization operations: Acquire (i.e. lock), Release (i.e. unlock) d. Sequence:

Acquire Regular access Release Coherence Protocols Issues

1. How do we ensure that all replicas have the same information? 2. How do we ensure that nodes do not access stale data?

Write-invalidate protocol

1. A write to shared data invalidates all copies except one before write executes 2. Invalidated copies are no longer accessible

Advantage: good performance for 1. Many updates between reads 2. Per node locality of reference

Disadvantage 1. Invalidations sent to all nodes that have copies 2. Inefficient if many nodes access same object



Examples: most DSM systems: IVY, Clouds, Dash, Memnet, Mermaid, and Mirage Write-update protocol

1. A write to shared data causes all copies to be updated (new value sent, instead of validation)

2. More difficult to implement Design Issues in DSM Granularity: size of shared memory unit

1. If DSM page size is a multiple of the local virtual memory (VM) management page size (supported by hardware), then DSM can be integrated with VM, i.e. use the VM page handling.

2. Advantages vs. disadvantages of using a large page size: a. (+) Exploit locality of reference b. (+) Less overhead in page transport c. (-) More contention for page by many processes

3. Advantages vs. disadvantages of using a small page size a. (+) Less contention b. (+) Less false sharing (page contains two items, not shared but needed by two

processes) c. (-) More page traffic

4. Examples a. PLUS: page size 4 Kbytes, unit of memory access is 32-bit word b. Clouds, Munin: object is unit of shared data structure

Page replacement

1. Replacement algorithm (e.g. LRU) must take into account page access modes: shared, private, read-only, writable

2. Example: LRU with access modes a. Private (local) pages to be replaced before shared ones b. Private pages swapped to disk c. Shared pages sent over network to owner d. Read-only pages may be discarded (owners have a copy)

distributed systems-unit 3

Engineering