joel crichlow distributed systems: computing over networks, phi managing distributed resources
Post on 19-Dec-2015
218 views
TRANSCRIPT
J O E L C R I C H L O WD I S T R I B U T E D S Y S T E M S : C O M P U T I N G O V E R N E T W O R K S , P H I
MANAGING DISTRIBUTED RESOURCES
ISSUES
• Naming and Addressing• Sharing• Availability and Reliability• Replication• Privacy and Security
NAMING AND ADDRESSING
• Identify• node/group/user• root-directory/sub-directory/filename
• Locate/Find• Location Independence• Mapping• Name Servers
NAME SERVERS
• Allocate the address translation responsibilities to a name server
• Users use symbolic names with which they interact with the client machines
• The clients communicate with a name server which does the name to address resolution
Client
Other
server
Name
Server 1
2
3
NAME SERVERS
• The name server may be designed to answer requests for the name of a resource/service given its address
• Performance• Table entries for critical resources may be held in
nonvolatile primary storage• Caching at server, caching at clients
• Cooperating Name Servers• Replication, Partitioning
DOMAIN NAME SYSTEM
• Distributed Name Service• Multi-level set of domains• Partitioning• Replication• Caching• IPv4 (32 bits), IPv6 (128 bits)
DNSIPV4 ADDRESS FORMATS
8 bits 8 bits 8 bits 8 bits Class A
0 Network Host Class B
10 Network Host Class C
110 Network Host Class D
1110 Multicast address Class E
11110 Reserved for future use
DNS
• A slow but steady transition to IPv6 is taking place
• IPv6 is not interoperable with IPv4 therefore a transition technology is needed
• Tunneling places IPv6 packets within IPv4 packets
• The Dual-stack implementation allows both protocols to run in the same network
v6 v6 V4 v6
DIRECTORY SERVICE
• The name service can be incorporated into a more comprehensive directory service which allows, not only the locating of services and resources, but also the supplying of information on people
• X.500, defined by CCITT and ISO, is a good early example of such a directory service
• Several other directory services exist• A notable example, based on X.500, is LDAP, the
Lightweight Direct Access Protocol, which uses the TCP/IP stack
SHARING
• Access Control• Scheduling• Allocation• Sharing Primary Memory
SHARING
Access Control List – ACLPer resource list
ACL for Resource0
Staff RE
System RWE
Student R
SHARING
Capability List - CL
System Class CL Resource0 CL
Resource0 Capability with RWE Capability with RWEResource1 Capability with RE Capability with REResource2 Capability with E Capability with E
SHARING
• Scheduling• Pool of identical resources• Only one resource
• Allocation• Local vs remote resources• Mutually exclusive access• Indefinite postponement
• Hardware• Software
• Consistency
SHARING PRIMARY MEMORY
• Distributed Shared Memory• Shareable Unit
• Physical block• Logical block
• Synchronization• Consistency
DISTRIBUTED SHARED MEMORY
Sequential vs Release Consistency
Process
begin
a = 0
b = 0
a = a + 1
b = b + 1
end
Process
begin
acquire-lock(CS)
a = a + 1
b = b + 1
release-lock(CS)
end
PAGED DSM
L/NL S/NS SSA/PM Page frame
PMT of processShared-page-ID page in DSM
Page Manager
Page Table
Shared paged global memory
Process PMT entries indicate loaded/not-loaded (L/NL), shared/not-shared (S/NS), if not-shared the secondary storage address (SSA), if shared the link to the DSM Page Manager (PM), and if loaded the page frame number in the local memory
LOGICAL DSM
• Linda• The ‘tuple-space’ model of parallel programming• It consists of two types of logical tuples: process tuples and
data tuples• Process tuples are active and can execute; data tuples are
passive• Process tuples can execute simultaneously• When a process tuple is finished executing, it turns into a
data tuple• There are four basic primitives in Linda: out, in, rd, eval
• Orca• An object-based, language-based DSM
• Component Technologies and Java
AVAILABILITY AND RELIABILITY
• Performance• Service Outcomes• How Reachable• LAN• WAN
AVAILABILITY AND RELIABILITY
WAN• The number of possible routes through the network
between user and resource• The channel capacity through the various communication
links• The communication protocols employed
AVAILABILITY AND RELIABILITY
Processor and Memory Upgrades• Faster Processor• More Memory• Caches• Secondary Memory
CACHING
• Locality principle• Cache consistency• Cacheable and non-cacheable data• Write-through• Copy-back• Write-invalidate• Write-update• Snoopy cache• Directory scheme
AVAILABILITY AND RELIABILITY
Software Design
SERVER
queue
client client client
AVAILABILITY AND RELIABILITY
Databases• Partitioning• Replication• Replicated Dictionary• Queries and Sub-queries
Make a reservation for Dorothy Swift on a red sports car to be picked up in New York on (date and time given), a small hatch back to be picked up by Jill Plain in Los Angeles on (date and time given) and a station wagon for Jack Baggage in London on (date and time given).
AVAILABILITY AND RELIABILITY
Find the relevant relations (or objects) quickly
A replicated dictionary is requiredOnce the relations (objects) are located, a decision must be made quickly on what should be shipped
The request can be split into three queriesThe sub-query processing is then done at the three sitesAlternatively, the pertinent records or pages could be shipped from the remote sites for processing at the initiator
AVAILABILITY AND RELIABILITYMEMCACHED
• Distributed databases also form the backbone for many dynamic web-based applications
• A key approach to improving availability in such systems is to cache the recently referenced data into memory
• A noteworthy software tool that provides such a caching service is Memcached (memcached.org)• It is a free and open source distributed caching service that uses memory as
a cache for data objects that are normally stored on the back-end database• It uses a key-value store with a hash table that can be distributed across
many computers, a pool of servers• When the table is full new arrivals are accommodated by removing old data
based on LRU (Least Recently Used) order.• The client uses the input key and a hashing algorithm to locate a server
from among that client’s list of Memcached servers• That server then uses the key to store the key-value pair into its internal
hash table
REPLICATION
Maintaining copies of resources at separate nodes in the network can:• Improve the pattern of communication traffic• Help load sharing• Reduce response times• Offer an alternative when a resource becomes
unavailable
REPLICATION
• How many Copies• Replicas as members of a Group• Membership Service
• CreateGroup• JoinGroup• LeaveGroup• A member may leave the group voluntarily or through
failure
GROUP CHANGES
• A new member joining or a member leaving changes the composition of the group• If the membership changes during the multicast of a message how should the
outcome of the multicast be classified?• What about if the replicas were participating in transaction processing and a
member failed, what happens next? What happens if the failed member was, in fact, the coordinator of the activity?
• It is necessary that group changes be known and that a group change be synchronized with other pertinent group activity
• In the ISIS toolkit group changes are handled by the maintenance of group views• A view captures the current membership list and bears a unique identifier (i.e. a sequence
number)• A group view, i+1 differs from its immediate predecessor i either by the addition of a new
member or the departure of a member voluntarily or through failure
• Group activity, e.g. message passing, can then be associated with a particular view• If the view changes before an activity is complete a decision can be made with
respect to the outcome of that activity• Coordinating view changes is primarily a message delivery issue
RELIABILITY OF MESSAGE DELIVERY
• Unreliable multicast• Deliver the message to all the members of the group
without acknowledgement
• Reliable multicast• Ensure that some (if not all) members of the group
receive the message
• Atomic multicast• All operational members in the group receive the
message or none of them do
RELIABILITY OF MESSAGE DELIVERY
• Any member of the group can fail during the multicast• What if the originator fails? • The originator must be monitored to determine the failure• An effective way to do this is to require that the originator
multicast an “I’m alive” message periodically• On determining the failure another member must assume
the role of originator and attempt to complete the multicast
• An election algorithm is invoked when a member of the group concludes that the leader has failed
• The Bully Algorithm can be used• The biggest bloke on the block always wins
MESSAGE ORDERING
• Unordered• Totally ordered
• Centralized sequencing• Distributed sequencing, ISIS• Clocks
• Causally ordered• Vector timestamps
• Sync ordered
TOTALLY ORDEREDCENTRALIZED
• A single member, the sequencer, is responsible for allocating the sequence number to a message
• Before a message can be delivered a sequence number must be obtained from the sequencer
• Lower numbered messages are processed before higher numbered ones
• Members keep a record of the next sequence number expected (or the last one received) so that should an out-of-order one arrive, it can be held back until the correct one in the sequence arrives
TOTALLY ORDEREDCENTRALIZED/DISTRIBUTED
Incoming messages are held on hold-back queue where final ordering is established before a message is moved to stable queue for processing
TOTALLY ORDEREDDISTRIBUTED
• In ISIS each member records Fmax, the largest final number agreed, and Pmax, its own largest proposed number
• On receiving a message with a proposed number, each receiving member i responds to the initiator with its own proposed number computed as
Max(Fmax, Pmax) + 1 + i/n where n is the number of members
• Each member will place the message with its own proposed number on its hold-back queue
• The initiator collects all the proposals from which it selects the largest number
• All members are then notified of this final number
CAUSALLY ORDERED
• Happened-before ordering• Vector timestamps• Let replicas exist in different versions and ensure
that these versions can be causally ordered• The version number is expressed as a vector with
an entry for the number of messages received from each member.
• For example, the timestamp (2, 4, 3, 1, 3) would indicate that there are 5 members in the group and the member holding this vector timestamp has received 2 messages from member 1, 4 from member 2, and so on
CAUSALLY ORDERED
• Two versions Vi and Vj are causally ordered if and only if each entry in Vi’s vector timestamp is less than or equal to the corresponding entry in Vj’s vector timestamp
• The multicast message must carry a vector timestamp provided by the initiating member
• It is generated by incrementing by 1 its own entry position in the vector of the replica that it owns
• For example, if the replica owned by member 3 has the vector timestamp (2, 4, 3, 1, 3) and it initiates a multicast relating to this replica then it would update its timestamp to (2, 4, 4, 1, 3)
• Receiving members can use the arriving timestamp to establish causal order
SYNC-ORDERED
• Put everything “in-sync”• A sync-ordered message m divides messages
received into two mutually exclusive sets at all members: a set 1 of messages received before m and a set 2 of messages received after m
• Recall group view changes
PRIVACY AND SECURITY
• Protection• Cryptography• Secret Key Cryptography• Public Key Cryptography• Digital Signatures• Kerberos and others
CRYPTOGRAPHY
Block diagram of cryptographic message transfer from A to B
key, plaintext
Encryption algo.
ciphertext
Principal A
ciphertext
Decryption algo.
plaintext
Principal B
key
SECRET KEY DISTRIBUTION
Secret key authentication using a protocol derived from Needham-Schroeder
S
A B
A, B, NA
KA{NA, B, KAB, KB{KAB, A, t}}
KB{KAB, A, t}, KAB{NAA}
KAB{NAA - 1}, NB
KAB{NB - 1}
1
2
3
4
5
PUBLIC KEY AUTHENTICATION
Public key authentication protocol from Needham-Schroeder-Lowe
S
A
B
A, B
DKS{ B, EKB, t}
EKB{ A, NA}
EKA{NA , NB , B}
EKB{NB }
B, A
DKS{ A, EKA, t}
1
2
34
56
7
DIGITAL SIGNATURES
• Verification of electronic document• Public key cryptography provides a simple
mechanism for digital signatures• Principal A can send a signed message M to
principal B with two levels of encryption as follows: EKB{DKA{M}}
DIGITAL SIGNATURES
• The Message Digest• A message digest function MD transforms
the variable length message M into a fixed-length bit string MD(M) called the message digest, such that• no two messages will have the same message
digest• given M it is easy to compute MD(M)• given MD(M) it is effectively impossible to
generate M
PRIVACY AND SECURITY
• Kerberos• Key Distribution Centre (KDC)• Authentication Server (AS)• Ticket Granting Server (TGS)
• PGP• PEM• SSL