distributed computation

Lucas Shen Aug/5/2014

A Note on Distributed ComputingJim Waldo, Geoff Wyant, Ann Wolrath, Sam Kendall

✤ Why this subject? For who?

✤ Terminology

✤ Unified vision

✤ What’s the problem?

✤ Example: NFS @Sun

✤ Conclusion

Why this subject? For who?

cloud

cluster App

googleAzure

Amazon

simple instanceCPU cluster

GPU cluster Hadoop

Spark

Designer

Programmer

IaaSSaaS

Dropbox

Terminology

<Local computing>programs are confined to a single address space

<Distributed computing>programs make calls to other address space, even another machine.

Unified Vision

from the programmer’s point of view, there is no essential distinction between objects that share an address space and objects that are on two machines with different architectures located on different continents.

How?

1. write the application without worrying about where objects are located and how their communication is implemented.

2. tune performance by “concretizing” object locations and communication methods.

3. test with “real bullets” (e.g., networks being partitioned, machines going down)

Advantage to do so..

✤ The granularity of change could be done from the level of the entire system to the level of the individual object.

✤ As long as the interfaces between objects remain constant, the implementations of those objects can be altered at will.

✤ An object can be repaired and the repair installed without worry that the change will impact the other objects that make up the system.

Based on…….. what belief?

1. there is a single natural object-oriented design for a given application, regardless of the context in which that application will be deployed

2. failure and performance issues are tied to the implementation of the components of an application, and consideration of these issues should be left out of an initial design

3. the interface of an object is independent of the context in which that object is used.

01

What’s wrong?

✤ Local and distributed computing are very different. You should take it into account at the very beginning.

✤ You? who?

Stop avoiding problems

Designer Programmer

The danger lies in promoting the myth that “remote access and local access are exactly the same” and not enforcing the myth.

vs

Differences

✤ Latency

✤ Memory Access

✤ Partial failure

Latency

✤ local object invocation vs remote: 4~5 order of magnitude

✤ should decide what object should be local and what could be remote?

✤ two solution:

1. Just ignore this issue, hardware advancement will make the difference irrelevant

2. need tools that will allow one to see the pattern of communication between objects that make up an application.then tune the system

Memory access

✤ pointers: ptr in local address space is not valid in in another address space

✤ two choice:

1. all memory access must be controlled by an underlying system, like distributed shared memory

2. programmer be aware of the different type mem access

Designer Programmervs

Partial failure

✤ Components fail are common, not exceptions

✤ no common agent that is able to determine what component has failed and informs others of that failure

✤ Since no so called global-state in distributed system, how to take and fast recover from failures?

Two paths

1. design interfaces of objects as if they were all local

✤ fragile & not robust in any sense =.=

2. design interfaces as if they were all remote

✤ worst case scenario

✤ introduces unnecessary guarantees for object that are never intended to be used remotely..

why so hard?

:Distributed system has no single point of resource allocation, synchronization, or failure recovery, and thus is conceptually very different.

GFS, master node <—> fully distributed

Lesson learned : NFS@Sun

✤ NFS: Sun’s distributed file system

✤ Designers were unwilling to change the interface to the file system to reflect the distributed nature of file access.

✤ example of non-distributed API(open,read, write, close) reimplemented in a distributed way

Soft mount: NFS@Sun

✤ expose network or server failure to the client program. Read and write operations return a failure status much more often than in the single-system case.

✤ programs written with no allowance for these failures can easily corrupt the files used by the program.

Hard mount: NFS@Sun

✤ means: the application hangs until the server comes back up

✤ one server crashes, and many workstations—even those apparently having nothing to do with that server—freeze

why?

✤ The limitations on the reliability and robustness of NFS is not because the implementation of the parts of that system.

✤ In the NFS, an interface was designed for non-distributed computing where partial failure was not possible.

✤ the limitations on the robustness have set a limitation on the scalability of NFS.

conclusion (knowing the difference is the start of advancement) @1994

✤ They are different, and you should take the differences seriously.

✤ to be conscious of those differences at all stages of the design and implementation of distributed applications.

✤ Organization: allocate its research and engineering resources more wisely. Rather than using those resources in attempts to paper over the differences between the two kinds of computing, resources can be directed at improving the performance and reliability of each.

✤ Engineers: have to know whether they are sending messages to local or remote objects, and access those objects differently.

✤ As an user of nowadays cloud services, they work pretty good. But if we want to build a private cloud or cluster in the garage, we need to take care of those details.

01

Thanks for your time

distributed computation

Software

distributed system

distributed computingare

distributed way

local access

server failure

anunderlying system

thefile system

entire system