joel crichlow distributed systems: computing over networks, phi managing distributed resources

J O E L C R I C H L O WD I S T R I B U T E D S Y S T E M S : C O M P U T I N G O V E R N E T W O R K S , P H I

MANAGING DISTRIBUTED RESOURCES

ISSUES

• Naming and Addressing• Sharing• Availability and Reliability• Replication• Privacy and Security

NAMING AND ADDRESSING

• Identify• node/group/user• root-directory/sub-directory/filename

• Locate/Find• Location Independence• Mapping• Name Servers

NAME SERVERS

• Allocate the address translation responsibilities to a name server

• Users use symbolic names with which they interact with the client machines

• The clients communicate with a name server which does the name to address resolution

Client

Other

server

Name

Server 1

2

3

NAME SERVERS

• The name server may be designed to answer requests for the name of a resource/service given its address

• Performance• Table entries for critical resources may be held in

nonvolatile primary storage• Caching at server, caching at clients

• Cooperating Name Servers• Replication, Partitioning

DOMAIN NAME SYSTEM

• Distributed Name Service• Multi-level set of domains• Partitioning• Replication• Caching• IPv4 (32 bits), IPv6 (128 bits)

DNSIPV4 ADDRESS FORMATS

8 bits 8 bits 8 bits 8 bits Class A

0 Network Host Class B

10 Network Host Class C

110 Network Host Class D

1110 Multicast address Class E

11110 Reserved for future use

DNS

• A slow but steady transition to IPv6 is taking place

• IPv6 is not interoperable with IPv4 therefore a transition technology is needed

• Tunneling places IPv6 packets within IPv4 packets

• The Dual-stack implementation allows both protocols to run in the same network

v6 v6 V4 v6

DIRECTORY SERVICE

• The name service can be incorporated into a more comprehensive directory service which allows, not only the locating of services and resources, but also the supplying of information on people

• X.500, defined by CCITT and ISO, is a good early example of such a directory service

• Several other directory services exist• A notable example, based on X.500, is LDAP, the

Lightweight Direct Access Protocol, which uses the TCP/IP stack

SHARING

• Access Control• Scheduling• Allocation• Sharing Primary Memory

SHARING

Access Control List – ACLPer resource list

ACL for Resource0

Staff RE

System RWE

Student R

SHARING

Capability List - CL

System Class CL Resource0 CL

Resource0 Capability with RWE Capability with RWEResource1 Capability with RE Capability with REResource2 Capability with E Capability with E

SHARING

• Scheduling• Pool of identical resources• Only one resource

• Allocation• Local vs remote resources• Mutually exclusive access• Indefinite postponement

• Hardware• Software

• Consistency

SHARING PRIMARY MEMORY

• Distributed Shared Memory• Shareable Unit

• Physical block• Logical block

• Synchronization• Consistency

DISTRIBUTED SHARED MEMORY

Sequential vs Release Consistency

Process

begin

a = 0

b = 0

a = a + 1

b = b + 1

end

Process

begin

acquire-lock(CS)

a = a + 1

b = b + 1

release-lock(CS)

end

PAGED DSM

L/NL S/NS SSA/PM Page frame

PMT of processShared-page-ID page in DSM

Page Manager

Page Table

Shared paged global memory

Process PMT entries indicate loaded/not-loaded (L/NL), shared/not-shared (S/NS), if not-shared the secondary storage address (SSA), if shared the link to the DSM Page Manager (PM), and if loaded the page frame number in the local memory

LOGICAL DSM

• Linda• The ‘tuple-space’ model of parallel programming• It consists of two types of logical tuples: process tuples and

data tuples• Process tuples are active and can execute; data tuples are

passive• Process tuples can execute simultaneously• When a process tuple is finished executing, it turns into a

data tuple• There are four basic primitives in Linda: out, in, rd, eval

• Orca• An object-based, language-based DSM

• Component Technologies and Java

AVAILABILITY AND RELIABILITY

• Performance• Service Outcomes• How Reachable• LAN• WAN


WAN• The number of possible routes through the network

between user and resource• The channel capacity through the various communication

links• The communication protocols employed


Processor and Memory Upgrades• Faster Processor• More Memory• Caches• Secondary Memory

CACHING

• Locality principle• Cache consistency• Cacheable and non-cacheable data• Write-through• Copy-back• Write-invalidate• Write-update• Snoopy cache• Directory scheme


Software Design

SERVER

queue

client client client


Databases• Partitioning• Replication• Replicated Dictionary• Queries and Sub-queries

Make a reservation for Dorothy Swift on a red sports car to be picked up in New York on (date and time given), a small hatch back to be picked up by Jill Plain in Los Angeles on (date and time given) and a station wagon for Jack Baggage in London on (date and time given).


Find the relevant relations (or objects) quickly

A replicated dictionary is requiredOnce the relations (objects) are located, a decision must be made quickly on what should be shipped

The request can be split into three queriesThe sub-query processing is then done at the three sitesAlternatively, the pertinent records or pages could be shipped from the remote sites for processing at the initiator

AVAILABILITY AND RELIABILITYMEMCACHED

• Distributed databases also form the backbone for many dynamic web-based applications

• A key approach to improving availability in such systems is to cache the recently referenced data into memory

• A noteworthy software tool that provides such a caching service is Memcached (memcached.org)• It is a free and open source distributed caching service that uses memory as

a cache for data objects that are normally stored on the back-end database• It uses a key-value store with a hash table that can be distributed across

many computers, a pool of servers• When the table is full new arrivals are accommodated by removing old data

based on LRU (Least Recently Used) order.• The client uses the input key and a hashing algorithm to locate a server

from among that client’s list of Memcached servers• That server then uses the key to store the key-value pair into its internal

hash table

REPLICATION

Maintaining copies of resources at separate nodes in the network can:• Improve the pattern of communication traffic• Help load sharing• Reduce response times• Offer an alternative when a resource becomes

unavailable

REPLICATION

• How many Copies• Replicas as members of a Group• Membership Service

• CreateGroup• JoinGroup• LeaveGroup• A member may leave the group voluntarily or through

failure

GROUP CHANGES

• A new member joining or a member leaving changes the composition of the group• If the membership changes during the multicast of a message how should the

outcome of the multicast be classified?• What about if the replicas were participating in transaction processing and a

member failed, what happens next? What happens if the failed member was, in fact, the coordinator of the activity?

• It is necessary that group changes be known and that a group change be synchronized with other pertinent group activity

• In the ISIS toolkit group changes are handled by the maintenance of group views• A view captures the current membership list and bears a unique identifier (i.e. a sequence

number)• A group view, i+1 differs from its immediate predecessor i either by the addition of a new

member or the departure of a member voluntarily or through failure

• Group activity, e.g. message passing, can then be associated with a particular view• If the view changes before an activity is complete a decision can be made with

respect to the outcome of that activity• Coordinating view changes is primarily a message delivery issue

RELIABILITY OF MESSAGE DELIVERY

• Unreliable multicast• Deliver the message to all the members of the group

without acknowledgement

• Reliable multicast• Ensure that some (if not all) members of the group

receive the message

• Atomic multicast• All operational members in the group receive the

message or none of them do

RELIABILITY OF MESSAGE DELIVERY

• Any member of the group can fail during the multicast• What if the originator fails? • The originator must be monitored to determine the failure• An effective way to do this is to require that the originator

multicast an “I’m alive” message periodically• On determining the failure another member must assume

the role of originator and attempt to complete the multicast

• An election algorithm is invoked when a member of the group concludes that the leader has failed

• The Bully Algorithm can be used• The biggest bloke on the block always wins

MESSAGE ORDERING

• Unordered• Totally ordered

• Centralized sequencing• Distributed sequencing, ISIS• Clocks

• Causally ordered• Vector timestamps

• Sync ordered

TOTALLY ORDEREDCENTRALIZED

• A single member, the sequencer, is responsible for allocating the sequence number to a message

• Before a message can be delivered a sequence number must be obtained from the sequencer

• Lower numbered messages are processed before higher numbered ones

• Members keep a record of the next sequence number expected (or the last one received) so that should an out-of-order one arrive, it can be held back until the correct one in the sequence arrives

TOTALLY ORDEREDCENTRALIZED/DISTRIBUTED

Incoming messages are held on hold-back queue where final ordering is established before a message is moved to stable queue for processing

TOTALLY ORDEREDDISTRIBUTED

• In ISIS each member records Fmax, the largest final number agreed, and Pmax, its own largest proposed number

• On receiving a message with a proposed number, each receiving member i responds to the initiator with its own proposed number computed as

Max(Fmax, Pmax) + 1 + i/n where n is the number of members

• Each member will place the message with its own proposed number on its hold-back queue

• The initiator collects all the proposals from which it selects the largest number

• All members are then notified of this final number

CAUSALLY ORDERED

• Happened-before ordering• Vector timestamps• Let replicas exist in different versions and ensure

that these versions can be causally ordered• The version number is expressed as a vector with

an entry for the number of messages received from each member.

• For example, the timestamp (2, 4, 3, 1, 3) would indicate that there are 5 members in the group and the member holding this vector timestamp has received 2 messages from member 1, 4 from member 2, and so on

CAUSALLY ORDERED

• Two versions Vi and Vj are causally ordered if and only if each entry in Vi’s vector timestamp is less than or equal to the corresponding entry in Vj’s vector timestamp

• The multicast message must carry a vector timestamp provided by the initiating member

• It is generated by incrementing by 1 its own entry position in the vector of the replica that it owns

• For example, if the replica owned by member 3 has the vector timestamp (2, 4, 3, 1, 3) and it initiates a multicast relating to this replica then it would update its timestamp to (2, 4, 4, 1, 3)

• Receiving members can use the arriving timestamp to establish causal order

SYNC-ORDERED

• Put everything “in-sync”• A sync-ordered message m divides messages

received into two mutually exclusive sets at all members: a set 1 of messages received before m and a set 2 of messages received after m

• Recall group view changes

PRIVACY AND SECURITY

• Protection• Cryptography• Secret Key Cryptography• Public Key Cryptography• Digital Signatures• Kerberos and others

CRYPTOGRAPHY

Block diagram of cryptographic message transfer from A to B

key, plaintext

Encryption algo.

ciphertext

Principal A

ciphertext

Decryption algo.

plaintext

Principal B

key

SECRET KEY DISTRIBUTION

Secret key authentication using a protocol derived from Needham-Schroeder

S

A B

A, B, NA

KA{NA, B, KAB, KB{KAB, A, t}}

KB{KAB, A, t}, KAB{NAA}

KAB{NAA - 1}, NB

KAB{NB - 1}

1

2

3

4

5

PUBLIC KEY AUTHENTICATION

Public key authentication protocol from Needham-Schroeder-Lowe

S

A

B

A, B

DKS{ B, EKB, t}

EKB{ A, NA}

EKA{NA , NB , B}

EKB{NB }

B, A

DKS{ A, EKA, t}

1

2

34

56

7

DIGITAL SIGNATURES

• Verification of electronic document• Public key cryptography provides a simple

mechanism for digital signatures• Principal A can send a signed message M to

principal B with two levels of encryption as follows: EKB{DKA{M}}

DIGITAL SIGNATURES

• The Message Digest• A message digest function MD transforms

the variable length message M into a fixed-length bit string MD(M) called the message digest, such that• no two messages will have the same message

digest• given M it is easy to compute MD(M)• given MD(M) it is effectively impossible to

generate M

PRIVACY AND SECURITY

• Kerberos• Key Distribution Centre (KDC)• Authentication Server (AS)• Ticket Granting Server (TGS)

• PGP• PEM• SSL

joel crichlow distributed systems: computing over networks, phi managing distributed resources

Documents

servers slide

partitioning slide

e slide

security slide

network slide

primary memory slide

local memory slide

tcpip stack slide