heterogeneous parallel and distributed computing

23
Heterogeneous parallel and distributed computing V.S. Sunderam a, * , G.A. Geist b a Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA b Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Abstract Heterogeneous network-based distributed and parallel computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are sig- nificantly dierent. In this paper, we discuss the evolution of heterogeneous concurrent computing, in the context of the parallel virtual machine (PVM) system, a widely adopted software system for network computing. In particular, we highlight the system level infra- structures that are required, aspects of parallel algorithm development that most aect per- formance, system capabilities and limitations, and tools and methodologies for eective computing in heterogeneous networked environments. We also present recent developments and experiences in the PVM project, and comment on ongoing and future work. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Heterogeneous computing; Networked computing; Cluster computing; Message passing interface (MPI); Parallel virtual machine (PVM); NAS parallel benchmark; Parallel I/O; Meta computing 1. Introduction We discuss parallel and distributed computing on networked heterogeneous en- vrionments. As used in this paper, these terms, as well as ‘‘concurrent’’ computing, refer to the simultaneous execution of the components of a single application on multiple processing elements. While this definition might also apply to most other www.elsevier.com/locate/parco Parallel Computing 25 (1999) 1699–1721 * Corresponding author. E-mail address: [email protected] (V.S. Sunderam) 0167-8191/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 9 9 ) 0 0 0 8 8 - 5

Upload: vs-sunderam

Post on 02-Jul-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Heterogeneous parallel and distributed computing

Heterogeneous parallel and distributedcomputing

V.S. Sunderama,*, G.A. Geistb

a Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USAb Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

Abstract

Heterogeneous network-based distributed and parallel computing is gaining increasing

acceptance as an alternative or complementary paradigm to multiprocessor-based parallel

processing as well as to conventional supercomputing. While algorithmic and programming

aspects of heterogeneous concurrent computing are similar to their parallel processing

counterparts, system issues, partitioning and scheduling, and performance aspects are sig-

ni®cantly di�erent. In this paper, we discuss the evolution of heterogeneous concurrent

computing, in the context of the parallel virtual machine (PVM) system, a widely adopted

software system for network computing. In particular, we highlight the system level infra-

structures that are required, aspects of parallel algorithm development that most a�ect per-

formance, system capabilities and limitations, and tools and methodologies for e�ective

computing in heterogeneous networked environments. We also present recent developments

and experiences in the PVM project, and comment on ongoing and future work. Ó 1999

Elsevier Science B.V. All rights reserved.

Keywords: Heterogeneous computing; Networked computing; Cluster computing; Message passing

interface (MPI); Parallel virtual machine (PVM); NAS parallel benchmark; Parallel I/O; Meta computing

1. Introduction

We discuss parallel and distributed computing on networked heterogeneous en-vrionments. As used in this paper, these terms, as well as ``concurrent'' computing,refer to the simultaneous execution of the components of a single application onmultiple processing elements. While this de®nition might also apply to most other

www.elsevier.com/locate/parco

Parallel Computing 25 (1999) 1699±1721

* Corresponding author.

E-mail address: [email protected] (V.S. Sunderam)

0167-8191/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 8 1 9 1 ( 9 9 ) 0 0 0 8 8 - 5

Page 2: Heterogeneous parallel and distributed computing

notions of parallel processing, we make a deliberate distinction, to highlight certainattributes of the methodologies and systems discussed herein ± namely, loosecoupling, physical and logical independence of the processing elements, and hetero-geneity. These characteristics distinguish heterogeneous concurrent computing fromtraditional parallel processing, normally performed on homogeneous, tightly cou-pled platforms which possess some degree of physical independence but are logicallycoherent.

Concurrent computing, in various forms, is becoming increasingly popular as amethodology for many classes of applications, particularly those in the high-per-formance and scienti®c computing arenas. This is due to numerous bene®ts thataccrue, both from the applications as well as the systems perspectives. However, inorder to fully exploit these advantages, a substantial framework is required ± in theform of novel programming paradigms and models, systems support, toolkits, andperformance analysis and enhancement mechanisms. In this paper, we focus on thelatter aspects, namely the systems infrastructures, functionality, and performanceissues in concurrent computing.

1.1. Heterogeneous, networked, and cluster computing

One of the major goals of concurrent computing systems is to support hetero-geneity. Heterogeneous computing refers to architectures, models, systems, andapplications that comprise substantively di�erent components, as well as to tech-niques and methodologies that address issues that arise when computing in het-erogeneous environments. While this de®nition encompasses numerous systems,including recon®gurable architectures, mixed-mode arithmetic, special purposehardware, and even vector and input±output units, we restrict ourselves to systemsthat are comprised of networked, independent, general purpose computers thatmay be used in a coherent and uni®ed manner. Thus, heterogeneous systems mayconsist of scalar, vector, parallel, and graphics machines that are interconnected byone or more (types of) networks, and support one or more programming envi-ronment/operating system. In such environments, heterogeneity occurs in severalforms:· System architecture ± heterogeneous systems may consist of SIMD, MIMD, sca-

lar, and vector computers.· Machine architecture ± individual processing elements may di�er in their instruc-

tion sets and/or data representation.· Machine con®gurations ± even when processing elements are architecturally iden-

tical, di�erences such as clock speeds and memory contribute to heterogeneity.· External in¯uences ± as heterogeneous systems are normally built in general

purpose environments, external resource demands can (and often do) induceheterogeneity into processing elements that are identical in architecture andcon®guration, and further, cause dynamic variations in interconnection networkcapacity.

· Interconnection networks ± may be optical or electrical, local or wide-area, high orlow speed, and may employ several di�erent protocols.

1700 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 3: Heterogeneous parallel and distributed computing

· Software ± from the infrastructure point of view, the underlying operating systemsare often di�erent in heterogeneous systems; from the applications point of view,in addition to operating systems aspects, di�erent programming models, languag-es, and support libraries are available in heterogeneous systems.Research in heterogeneous systems is in progress in several areas [1,2] including

applications, paradigm development, mapping, scheduling, recon®guration, etc., butthe primary thrust has thus far been in systems, methodologies, and toolkits [22].This latter thrust has been highly productive and successful, with several systems inproduction-level use at hundreds of installations worldwide. The body of this paperwill discuss the parallel virtual machine (PVM) system that has evolved into apopular and e�ective methodology for heterogeneous concurrent computing.

1.2. Applications perspective

From the point of view of application development, heterogeneous computing isattractive, since it inherently supports function parallelism, with the added potentialof executing subtasks on best-suited architectures. It is well known that di�erenttypes of algorithms are well matched to di�erent machine architectures and con-®gurations, and at least in the abstract sense, heterogeneous computing permits thismatching to be realized, resulting in optimality in application execution as well as inresource utilization. However, in practice, this scenario may be di�cult to achievefor reasons of availability, applicability, and the existence of appropriate mappingand scheduling tools. Nevertheless, the concept is an attractive one and several re-search e�orts are in progress in this area [3,4].

In this respect, many classes of applications that would bene®t substantivelyfrom heterogeneous computing have been identi®ed. For example, a critically im-portant problem which is ideally suited to heterogeneous computing is is globalclimate modeling. Simulation of the global climate is a particularly di�cult chal-lenge because of the wide range of time and space scales governing the behavior ofthe atmosphere, the oceans, and the surface. Parallel GCM codes require distinctcomponent modules representing the atmosphere, ocean and surface and processmodules representing phenomena like radiation and convection. Sampling, updat-ing and manipulating this data requires scalar, vector, MIMD and SIMD para-digms, many of which can be performed concurrently. Another application domainthat could exploit heterogeneous computing is computer vision. Vision problemsgenerally require processing at three levels: high, medium and low. Low-level andsome medium-level vision tasks often involve regular data ¯ow and iconic opera-tions. This type of computation is well matched to mesh-connected SIMD ma-chines. Medium-grained MIMD machines are more suitable for various high-leveland some medium-level vision tasks which are communication-intensive and inwhich the ¯ow of data is not regular. Coarse-grained MIMD machines are bestmatched for high-level vision tasks such as image understanding/recognition andsymbolic processing.

As previously mentioned however, the above aspect of heterogeneous concurrentcomputing is still in its infancy. Proof-of-concept research and experiments have

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1701

Page 4: Heterogeneous parallel and distributed computing

demonstrated the viability of exploiting application heterogeneity, and many othersare evolving. On the other hand, the systems aspect has matured signi®cantly; to theextent that robust environments are now available for production execution oftraditional parallel applications while providing stable testbeds for the evolving,truly heterogeneous, applications [12,13]. We discuss the systems facet of heteroge-neous concurrent computing in the remainder of the paper.

2. The historical perspective

2.1. Heterogeneous concurrent computing systems

Heterogeneous computing systems [5,6] evolved in the late 1980s and shared somecommon goals and requirements:· To e�ectively provide access to signi®cant amounts of computing resources in a

cost-e�ective manner, usually by utilizing already available resources.· To exploit the existing software infrastructure and facilities (e.g., editors, compil-

ers, debuggers) that are available on individual computer systems in a cluster.· To provide an e�ective programming model and interface, generally based on ex-

plicit parallelism and the message passing paradigm.· To support transparency in terms of architecture, processor type, task location,

network communication, and resource allocation.· To achieve the best possible performance, subject to the inherent limitations of the

processors and networks involved; some systems also attempt to be non-intrusiveby suspending execution in deference to higher priority activities.Several of the above goals were met, at least by the most popular network-

computing/heterogeneous processing systems. Other goals, such as exploiting het-erogeneity, sophisticated job and resource management, automatic parallelization,and graphical interfaces are still being pursued. Since then, PVM gained substan-tially in popularity, and the MPI standard evolved in the mid 1990s ± both are still inwidespread use. We brie¯y outline some of the earlier systems and their salientfeatures before discussing PVM in depth, and commenting on the latest trends inmetacomputing.

2.2. The Linda model and system

Linda [7] is a concurrent programming model that has evolved from a YaleUniversity research project. The primary concept in Linda is that of a ``tuple space'',an abstraction via which cooperating processes communicate. This central theme ofLinda has been proposed as an alternative paradigm to the two traditional methodsof parallel processing, namely, that based on shared-memory, and on messagepassing. The tuple space concept is essentially an abstraction of distributed shared-memory, with one important di�erence (tuple spaces are associative), and severalminor distinctions (destructive and non-destructive reads, and di�erent coherencysemantics are possible). Applications use the Linda model by embedding explicitly,

1702 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 5: Heterogeneous parallel and distributed computing

within cooperating sequential programs, constructs that manipulate (insert/retrievetuples) the tuple space.

From the application point of view Linda [8] is a set of programming languageextensions for facilitating parallel programming. The Linda model is a scheme builtupon an associative memory referred to as tuple space. It provides a shared-memoryabstraction for process communication without requiring the underlying hardwareto physically share-memory. Tuples are collections of ®elds logically ``welded'' toform persistent storage items. They are the basic tuple space storage units. Parallelprocesses exchange data by generating, reading, and consuming them. To update atuple, the tuple is removed from tuple space, modi®ed, and returned to tuple space.Restricting tuple space modi®cation in this manner creates an implicit lockingmechanism ensuring proper synchronization of multiple accesses.

The ``Linda system'' usually refers to a speci®c (sometimes portable) imple-mentation of software that supports the Linda programming model. System soft-ware is provided that establishes and maintains tuple spaces, that is used inconjunction with libraries that appropriately interpret and execute Linda primi-tives. Depending on the environment (shared-memory multiprocessors, messagepassing parallel computers, networks of workstations etc.), the tuple space mech-anism is implemented using di�erent techniques, and with varying degrees of e�-ciency. Recently, a new system technique has been proposed, at least nominallyrelated to the Linda project. This scheme, termed ``Pirhana'' proposes a proactiveapproach to concurrent computing ± the idea being that computational resources(viewed as active agents) seize computational tasks from a well-known locationbased on availability and suitability. Again, this scheme may be implemented onmultiple platforms, and manifested as a ``Pirhana system'' or ``Linda±Pirhanasystem''.

2.3. P4 and Parmacs

P4 is a library of macros and subroutines developed at Argonne National Lab-oratory for programming a variety of parallel machines. The P4 system [9] supportsboth the shared-memory model (based on monitors) and the distributed-memorymodel (using message-passing). For the shared-memory model of parallel compu-tation, P4 provides a set of primitives from which monitors can be constructed, aswell as a set of useful monitors. For the distributed-memory model, P4 providestyped send and receive operations, and creation of processes according to a text ®ledescribing group and process structure.

Process management in the P4 system is based on a con®guration ®le thatspeci®es the host pool, the object ®le to be executed on each machine, the numberof processes to be started on each host (intended primarily for multiprocessorsystems) and other auxiliary information. Two issues are noteworthy in regard tothe process management mechanism in P4. First, there is the notion a ``master''process and ``slave'' processes, and multilevel hierarchies may be formed to im-plement what is termed a cluster model of computation. Second, the primary modeof process creation is static, via the con®guration ®le; dynamic process creation is

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1703

Page 6: Heterogeneous parallel and distributed computing

possible only by a statically created process that must invoke a special P4 functionthat spawns a new process on the local machine. However, despite these restric-tions, a variety of application paradigms may be implemented in the P4 system in afairly straightforward manner.

Message Passing in P4 system is achieved through the use of traditional send andrecv primitives, parameterized almost exactly as other message passing systems.Several variants are provided for semantics such as heterogeneous exchange, andblocking or non-blocking transfer. A signi®cant proportion of the burden of bu�erallocation and management, however, is left to the user. Apart from basic messagepassing, P4 also o�ers a variety of global operations, including broadcast, globalmaxima and minima, and barrier synchronization. Shared-memory support viamonitors is a facility that distinguishes P4 from other systems. However, this featureis not distributed shared-memory; but rather, a portable mechanism for shared ad-dress space programming in true shared-memory multiprocessors.

Parmacs is a project that is closely related to the P4 e�ort. Essentially, Parmacsis a set of macro extensions to the P4 system developed at GMD [10]. It originatedin an e�ort to provide FORTRAN interfaces to the P4 system, but is now a sig-ni®cantly enhanced package that provides a variety of high-level abstractions,mostly dealing with global operations. Parmacs provides macros for logicallycon®guring a set of P4 processes; for example, the macro torus produces asuitable con®guration ®le for use by P4 that results in a logical process con®gu-ration corresponding to a 3-d torus. Other logical topologies, including generalgraphs may also be implemented, and Parmacs provides macros used in conjunc-tion with send and recv to achieve topology-speci®c communications withinexecuting programs.

2.4. Message passing interface (MPI)

In 1992 a group of about 30 people from universities, government laboratories,and industry began meeting to specify a message passing interface. It was felt that thede®nition of a message passing standard provides vendors with a clearly de®ned baseset of routines that they can implement e�ciently, or in some cases provide hardwaresupport for, thereby enhancing performance. In 1994 the MPI-1 speci®cation waspublished that de®ned 128 functions divided into ®ve categories: point-to-pointcommunication, collective communication, groups and context, processor top-ologies, and pro®ling interface.

While MPI-1 de®ned a message passing API, it was not portable across hetero-geneous clusters of computers because MPI-1 de®ned no standard way to startprocesses. Thus, in 1995 the MPI forum began to meet again to de®ne MPI-2. MPI-2speci®ed an additional 200 functions in several new areas: I/O, one-sided commu-nication, process spawning, and extended collective operations. The MPI-2 speci®-cation was published in 1997.

The goal of the MPI speci®cation is to develop a widely used standard for writingmessage passing programs. MPI-1 is widely supported by the parallel computervendors and work has begun to implement MPI-2 functions.

1704 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 7: Heterogeneous parallel and distributed computing

3. The PVM system

3.1. PVM overview

PVM is a software system that permits the utilization of a heterogeneous networkof parallel and serial computers as a uni®ed general and ¯exible concurrent com-putational resource. The PVM system [11] initially supported the message passing,shared-memory, and hybrid paradigms; thus, allowing applications to use the mostappropriate computing model for the entire application or for individual subalgo-rithms. However, support for emulated shared-memory was omitted as the systemevolved, since the message passing paradigm was the model of choice for most sci-enti®c parallel processing applications. Processing elements in PVM may be scalarmachines, distributed and shared-memory multiprocessors, vector supercomputersand special purpose graphics engines; thereby, permitting the use of the best-suitedcomputing resource for each component of an application.

The PVM system is composed of a suite of user interface primitives supportingsoftware that together enable concurrent computing on loosely coupled networks ofprocessing elements. PVM may be implemented on a hardware base consisting ofdi�erent machine architectures, including single CPU systems, vector machines, andmultiprocessors. These computing elements may be interconnected by one or morenetworks, which may themselves be di�erent (e.g., one implementation of PVMoperates on Ethernet, the Internet, and a ®ber optic network). These computingelements are accessed by applications via a standard interface that supports com-mon concurrent processing paradigms in the form of well-de®ned primitives thatare embedded in procedural host languages. Application programs are composed ofcomponents that are subtasks at a moderately larger level of granularity. Duringexecution, multiple instances of each component may be initiated. Fig. 1 depicts asimpli®ed architectural overview of the PVM computing model as well as thesystem.

Application programs view the PVM system as a general and ¯exible parallelcomputing resource. A translucent layering permits ¯exibility while retaining theability to exploit particular strengths of individual machines on the network. ThePVM user interface is strongly typed; support for operating in a heterogeneousenvironment is provided in the form of special constructs that selectively performmachine dependent data conversions where necessary. Inter-instance communicationconstructs include those for the exchange of data structures as well as high-levelprimitives such as broadcast, barrier synchronization, mutual exclusion, and ren-dezvous. Application programs under PVM may possess arbitrary control and de-pendency structures. In other words, at any point in the execution of a concurrentapplication, the processes in existence may have arbitrary relationships between eachother and, further, any process may communicate and/or synchronize with anyother.

The PVM system is composed of two parts. The ®rst part is a daemon, calledpvmd, that executes on all the computers comprising the virtual machine. PVM isdesigned so that any user having normal access rights to each host in the pool

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1705

Page 8: Heterogeneous parallel and distributed computing

may install and operate the system. To run a PVM application, the user executesthe daemons on a selected host pool, and the set of daemons cooperate viadistributed algorithms to initialize the virtual machine. The PVM application canthen be started by executing a program on any of these machines. The usualmethod is for this manually started program to spawn other application processes,using PVM facilities. Multiple users may con®gure overlapping virtual machines,and each user can execute several PVM applications simultaneously. The sec-ond part of the system is a library of PVM interface routines (libpvm.a). Thislibrary contains user callable routines for message passing, spawning processes,coordinating tasks, and modifying the virtual machine. The installation process forPVM is straightforward. PVM does not require special privileges to be installed.Anyone with a valid login on the hosts can do so, by following a simple sequenceof steps for obtaining the distribution via the Web or by Ftp, compiling, andinstalling.

3.2. PVM programming

Developing applications for the PVM system follows, in a general sense at least,the traditional paradigm for programming distributed-memory multiprocessors suchas the Intel family of hypercubes. This is true for both the logistical aspects ofprogramming as well as for algorithm development. However, there are signi®cant

Fig. 1. PVM system overview: (a) PVM computing model; (b) PVM architectural overview.

1706 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 9: Heterogeneous parallel and distributed computing

di�erences in terms of (a) task management, especially issues concerning dynamicprocess creation, naming and addressing; (b) initialization phases prior to actualcomputation; (c) granularity choices; and (d) heterogeneity. These issues must bekept in mind during the general programming process for PVM and attention paidto factors that impact functionality and performance. In PVM, the issue of workloadallocation is of particular importance, subsequent to establishing process structure,because of the heterogeneous and multiprogrammed nature of the underlying hoststhat inherently cause load imbalances. Therefore, data decomposition or partitioningshould not assume that all processing elements are equally capable or equallyavailable. Function decomposition is better suited, since it divides the work based ondi�erent operations or functions. In a sense, the PVM computing model supportsfunction decomposition at the component level (components are fundamentallydi�erent programs that perform di�erent operations) and data decomposition at theinstance level i.e., within a component, the same program operates on di�erentportions of the data.

In order to utilize the PVM system, applications must evolve through two stages.The ®rst concerns development of the distributed-memory parallel version of theapplication algorithm(s); this phase is common to the PVM system as well as to otherdistributed-memory multiprocessors. The actual parallelization decisions fall intotwo major categories ± those related to structure, and those related to e�ciency.For structural decisions in parallelizing applications, the major decisions to be madeinclude the choice of model to be used; i.e., crowd computation (based on peer-to-peer process structures) vs. tree computation (based on hierarchical process struc-tures) and data decomposition vs. function decomposition. Decisions with respect toe�ciency when parallelizing for distributed-memory environments are generallyoriented towards minimizing the frequency and volume of communications. It istypically in this latter respect that the parallelization process di�ers for PVM andhardware multiprocessors: for PVM environments based on networks, large gran-ularity generally leads to better performance. With this quali®cation, the parall-elization process is very similar for PVM and for other distributed-memoryenvironments, including hardware multiprocessors.

The parallelization of applications may be done either ab initio or from existingsequential versions or from existing parallel versions. In the ®rst two cases, the stagesinvolved are to select an appropriate algorithm for each of the subtasks in the ap-plication, usually from published descriptions ± or by inventing a parallel algorithm.These algorithms are then coded in the language of choice (C, C++, or FOR-TRAN77 for PVM) and interfaced with each other as well as with process man-agement and other constructs. Parallelization from existing sequential programs alsofollows certain general guidelines, primary among which are to decompose loops,beginning with outermost loops and working inward. In this process, the mainconcern is to detect dependencies and partition loops such that dependencies arepreserved while allowing for concurrency. This parallelization process is described innumerous textbooks and papers on parallel computing; though, few textbooks dis-cuss the practical and speci®c aspects of transforming a sequential program to aparallel one.

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1707

Page 10: Heterogeneous parallel and distributed computing

Existing parallel programs may be based on either the shared-memory or dis-tributed-memory paradigms. Converting existing shared-memory programs to PVMis similar to converting from sequential code, when the shared-memory versions arebased on vector or loop level parallelism. In the case of explicit shared-memoryprograms, the primary task is to locate synchronization points and replace these withmessage passing. In order to convert existing distributed-memory parallel code toPVM, the main task is to convert from one set of concurrency constructs to another.Typically, existing distributed-memory parallel programs are written either forhardware multiprocessors or other networked environments such as P4 or Express.In both cases, the major changes required are with regard to process management.For example, in the Intel family of distributed-memory multiprocessors (DMMPs),it is common for processes to be started from an interactive shell command line.Such a paradigm should be replaced for PVM by either a master program or a nodeprogram that takes responsibility for process spawning. With regard to interaction,there is, fortunately, a great deal of commonality between the message passing callsin various programming environments. The major di�erences between PVM andother systems in this context are with regard to (a) process management and processaddressing schemes; (b) virtual machine con®guration/recon®guration and its impacton executing applications; (c) heterogeneity in messages as well as the aspectof heterogeneity that deals with di�erent architectures and data representations; and(d) certain unique and specialized features such as signaling, task schedulingmethods, etc.

3.3. Fault tolerance issues

Fault tolerance is a critical issue for any large scale scienti®c computer applica-tion. Long-running simulations, which can take days or even weeks to execute, mustbe given some means to gracefully handle faults in the system or the applicationtasks. Without fault detection and recovery it is unlikely that such applications willever complete. For example, consider a large simulation running on dozens ofworkstations. If one of those many workstations should crash or be rebooted, thentasks critical to the application might disappear. Additionally, if the applicationhangs or fails, it may not be immediately obvious to the user. Many hours could bewasted before it is discovered that something has gone wrong. Further, there areseveral types of applications that explicitly require a fault tolerant execution envi-ronment, due to safety or level of service requirements. In any case, it is essential thatthere be some well-de®ned scheme for identifying system and application faults andautomatically responding to them, or at least providing timely noti®cation to theuser in the event of failure.

PVM has supported a basic fault noti®cation scheme for some time. Under thecontrol of the user, tasks can register with PVM to be ``noti®ed'' when the status ofthe virtual machine changes or when a task fails. This noti®cation comes in the formof special event messages that contain information about the particular event. A taskcan ``post'' a notify for any of the tasks from which it expects to receive a message. Inthis scenario, if a task dies, the receiving task will get a notify message in place of any

1708 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 11: Heterogeneous parallel and distributed computing

expected message. The notify message allows the task an opportunity to respond tothe fault without hanging or failing.

Similarly, if a speci®c host like an I/O server is critical to the application, then theapplication tasks can post noti®es for that host. The tasks will then be informed ifthat server exits the virtual machine, and they can allocate a new I/O server. Thistype of virtual machine noti®cation is also useful in controlling computing resources.When a host exits from the virtual machine, tasks can utilize the notify messages torecon®gure themselves to the remaining resources. When a new host computer isadded to the virtual machine, tasks can be noti®ed of this as well. This informationcan be used to redistribute load or expand the computation to utilize the new re-source. Several systems have been designed speci®cally for this purpose, includingthe WoDi system [21] which uses Condor [20] on top of PVM.

There are several important issues to consider when providing a fault noti®cationscheme. For example, a task might request noti®cation of an event after it has al-ready occurred. PVM immediately generates a notify message in response to anysuch ``after the fact'' request. For example, if a ``task exit'' noti®cation request isposted for a task that has already exited, a notify message is immediately returned.Similarly, if a ``host exit'' request is posted for a host that is no longer part of thevirtual machine, a notify message is immediately returned. It is possible for a ``hostadd'' noti®cation request to occur simultaneously with the addition of a new hostinto the virtual machine. To alleviate this race condition, the user must poll thevirtual machine after the notify request to obtain the complete virtual machinecon®guration. Subsequently, PVM can then reliably deliver any new ``host add''noti®es.

3.4. Current status and outlook

The latest version of PVM (PVM 3.4) works with both Windows NT as well asUnix hosts. The new features included in PVM 3.4 allows users to develop muchmore ¯exible, dynamic, and fault tolerant applications.

PVM 3.4 includes 12 new functions. These functions provide the biggest leap inPVM capabilities since PVM 3.0 came out in 1993. The functions provide commu-nication context, message handlers, and a tuple space called message box.

The ability to send messages in di�erent communication contexts is a fundamentalrequirement for parallel tools and applications that must interact with each other. Itis also a requirement for the development of safe parallel libraries. Context is aunique system created tag, which is sent with each message. A matching receivefunction must match the context, destination, and message tag ®elds for the messageto be received (wild cards are allowed for destination and message tag but not forcontext). In the past, PVM applications had to divide up the message tag space tomimic context capabilities. With PVM 3.4 there are built-in functions to create, set,and free context values.

By de®ning the context to be system wide unique, PVM continues to allow thedynamic generation and destruction of tasks. And by de®ning that all PVM taskshave a base context by default, all existing PVM applications continue to work

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1709

Page 12: Heterogeneous parallel and distributed computing

unchanged. The combination of these features allows parallel tools developers tocreate visualization and monitoring packages that can attach to existing PVM ap-plications, extract information, and detach without concern about interfering withthe application.

The ability in the future to dynamically plug-in middle layer tools and applica-tions is predicated on the existence of a similar if not identical communicationcontext paradigm to PVM 3.4.

PVM has always had message handlers internally, which were used for controllingthe virtual machine. In PVM 3.4 the ability to de®ne and delete message handlers hasbeen raised up to the user level. To add a message handler, an application task calls:

handler id � pvm addmhf�src;tag;context;function�;Thereafter, whenever a message arrives at this task with the speci®ed source,

message tag, and communication context, the speci®ed function is executed. Thefunction is passed the pointer to the message so that the handler may unpack themessage if required. PVM 3.4 places no restrictions on the complexity of the func-tion. It is free to make system calls or other PVM calls.

With the functionality provided by pvm_addmhf( ) it is possible to build one-sided communication, active messages, applications that trigger other applicationson certain events, fault recovery tools and schedulers, and so on. For example, in-stead of an error inside an application printing an error message, the event could bemade to invoke a parallel debugger focused on the area of the problem. Anotherexample would be a distributed data mining application that ®nds an interestingcorrelation and triggers a response in all the associated searching tasks. The existenceof pvm_addmhf( ) allows tasks within an application to dynamically adapt and takeon new functionality whenever a message handler is invoked.

In future systems the ability to dynamically add new functionality will have to beextended to include the underlying system as well as the user tasks. One could en-vision a message handler de®ned inside the virtual machine daemons that whentriggered by the application would spawn-o� intelligent agents to seek out the re-quested software module from Web repositories. These trusted ``children'' agentscould retrieve the module and use another message handler to cause the daemon toload the module, incorporating its new features.

In a typical message passing system, messages are transitive and the focus is oftenon making their existence as short as possible, i.e., decrease latency and increasebandwidth. There are many situations in distributed applications seen today inwhich programming would be much easier if there was a way to have persistentmessages. This is the purpose of the new message box feature in PVM 3.4. Themessage box is an internal tuple space in the virtual machine. Tasks can use regularPVM pack routines to create an arbitrary message and then use pvm_putinfo( ) toplace this message into the message box with an associated name. Copies of thismessage can be retrieved by any PVM task that knows the name. And if the name isunknown or changing dynamically, then pvm_getmboxinfo( ) can be used to ®nd thelist of names active in the message box. The four functions that make up the messagebox in PVM 3.4 are:

1710 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 13: Heterogeneous parallel and distributed computing

index � pvm putinfo�name;msgbuf;flag�pvm recvinfo�name;index;flag�pvm delinfo�name;index;flag�pvm getmboxinfo�pattern;names� �;structinfo� ��

The flag de®nes the properties of the stored message, such as, who is allowedto delete this message, does this name allow multiple instances of messages, does aput overwrite the message? The flag argument also allows extension of thisinterface as PVM 3.4 users give us feedback on how they use the features of messageboxes.

While the tuple space could be used as a distributed shared-memory, similar to theLinda system [8], the granularity of the PVM 3.4 implementation is better suited tolarge grained data storage.

Here are just a few of the many potential message box uses. A visualization toolspontaneously comes to life and ®nds out where and how to connect to a largedistributed simulation. A scheduling tool retrieves information left by a resourcemonitor. A new team member learns how to connect to an ongoing collaboration. Adebugging tool retrieves a message left by a performance monitor that indicateswhich of the thousands of tasks is most likely a bottleneck. Many of these capa-bilities are directly applicable to adaptable environments, and some method to havepersistent messages will be a part of future virtual machine design.

The addition of communication contexts, message handlers, and message boxes tothe PVM environment allows developers to take a big leap forward in the capabilitiesof their distributed applications. PVM 3.4 is a useful tool for the development ofmuch more dynamic, fault tolerant distributed applications.

3.5. MPI and its relationship to PVM

PVM is built around the concept of a virtual machine which is a dynamic col-lection of (potentially heterogeneous) computational resources managed as a singleparallel computer. The virtual machine concept is fundamental to the PVM per-spective and provides the basis for heterogeneity, portability, and encapsulation offunctions that constitute PVM. PVM provides features like fault tolerance and inter-operability which are not a part of MPI. In contrast, MPI has focused on messagepassing and explicitly states that resource management and the concept of a virtualmachine are outside the scope of the MPI (1 and 2) standard.

The PVM API has continuously evolved over the years to satisfy user requests foradditional features and to keep up with the fast changing network and computingtechnology. In contrast to the PVM API, the MPI-1 API was speci®ed by a com-mittee and de®ned as a ®xed unchanging standard. of about 40 high-performancecomputing experts from research and industry in a series of meetings in 1993±1994.The impetus for developing MPI was that each massively parallel processor (MPP)vendor was creating their own proprietary message passing API. In this scenario itwas not possible to write a portable parallel application. MPI is intended to be astandard message passing speci®cation that each MPP vendor would implement ontheir system. The MPP vendors need to be able to deliver high-performance and this

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1711

Page 14: Heterogeneous parallel and distributed computing

became the focus of the MPI design. Given this design focus, MPI is expected toalways be faster than PVM on MPP hosts.

MPI-1 contains the following main features:· A large set of point-to-point communication routines (by far the richest set of any

library to date).· A large set of collective communication routines for communication among

groups of processes.· A communication context that provides support for the design of safe parallel

software libraries.· The ability to specify communication topologies.· The ability to create derived datatypes that describe messages of non-contiguous

data.MPI-1 users soon discovered that their applications were not portable across a

network of workstations because there was no standard method to start MPI taskson separate hosts. Di�erent MPI implementations used di�erent methods. In 1995the MPI committee began meeting to design the MPI-2 speci®cation to correct thisproblem and to add additional communication functions to MPI including:· MPI_SPAWN functions to start MPI processes.· One-sided communication functions such as put and get.· MPI_IO.· Language bindings for C++.

The MPI-2 speci®cation was ®nished in June 1997. The MPI-2 document adds anadditional 200 functions to the 128 functions speci®ed in the MPI-1 API. This makesMPI a much richer source of communication methods than PVM.

4. Representative results in network computing

4.1. The NAS parallel benchmarks

The Numerical Aerodynamic Simulation (NAS) program of the National Air andSpace Administration (NASA) has devised and published a suite of benchmarks [14]for the performance analysis of highly parallel computers. These benchmarks aredesigned to substantially exercise the processor, memory, and communication sys-tems of current generation parallel computers. They are speci®ed only algorithmi-cally; except for a few constraints, implementors are free to select optimal languageconstructs and implementation techniques.

The complete benchmark suite consists of eight applications, ®ve of which aretermed kernels because they form the core of many classes of aerodynamic appli-cations, and the remaining three are simulated CFD applications. The ®ve kernels,and their vital characteristics are listed in Table 1. NASA periodically publishesperformance results obtained either from internal experiments or those conducted bythird party implementors on various supercomputers and parallel machines. The defacto yardstick used to compare these performance results is a single processor of theCray Y-MP, executing a sequential version of the same application.

1712 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 15: Heterogeneous parallel and distributed computing

The ®ve NPB kernels, with the exception of the embarrassingly parallel applica-tion, are all highly communication intensive when parallelized for message passingsystems. As such, they form a rigorous suite of quasi-real applications that heavilyexercise system facilities and also provide insights into bottlenecks and hot-spots forspeci®c distributed-memory architectures. In order to investigate the viability ofclusters and heterogeneous concurrent systems for such applications, the NPB ker-nels were ported to execute on the PVM system. Detailed discussions and analysesare presented in [15,16]; here we describe our experiences with two representativekernels, using Ethernet and FDDI-based clusters.

The V-cycle multigrid kernel involves the solution of a discrete Poisson problemr2u � v with periodic boundary conditions on a 256� 256� 256 grid. v is 0 at allcoordinates except for 10 speci®c points which are +1.0, and 10 speci®c points whichare ÿ1.0. The PVM version of this application was derived by substantially modi-fying an Intel hypercube version; data partitioning along 1-d of the grid, maintainingshadow boundaries, and performing nearest neighbor communications. Severaloptimizations were also incorporated, primarily to maximize utilization of networkcapacity, and to reduce some communication. Results for the multigrid kernel underPVM are shown in Table 2.

From the table it can be seen that the PVM implementation performs at good toexcellent levels, despite the large volume of communication which accounts for up to35% of the overall execution time. It may also be observed that the communications

Table 2

V-cycle multigrid: PVM timings

Platform Time

(s)

Communication

volume (MB)

Communication

time (s)

Bandwidth

(KB/s)

4�IBM RS6000/550

(Ethernet)

293 96 101 973

4�IBM RS6000/550

(FDDI)

185 96 18 5461

8�IBM RS6000/320

(FDDI)

110 192 32 6144

Cray Y-MP/1 54 ± ± ±

Table 1

NAS parallel benchmarks: kernel characteristics

Benchmark code Problem

size

Memory

(MB)

Cray time

(s)

Operation

count

Embarrassingly parallel 228 8 126 2:67� 1010

V-cycle multigrid 2563 453.6 22 3:90� 1009

Conjugate gradient 2� 106 83.2 12 1:51� 1009

3-d FFT PDE 2562 � 128 343.2 29 5:6� 1009

Integer sort 223 248.8 11 7:81� 1008

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1713

Page 16: Heterogeneous parallel and distributed computing

bandwidths obtained, at the application level, are a signi®cant percentage of thetheoretical limit for both types of networks. Finally, the eight processor clusterachieves one-half the speed of a single processor of the Cray Y-MP, at an estimatedone-fourth of the cost.

The conjugate gradient kernel is an application that approximates the smallesteigenvalue of a symmetric positive-de®nite sparse matrix. The critical portion of thecode is a matrix±vector multiplication, requiring the exchange of subvectors in thepartitioning scheme used. In this exercise also, the PVM version was implementedfor optimal performance, with modi®cations once again focusing on reducingcommunication volume and interference. Results from our experiments are shown inTable 3.

This table also exhibits certain interesting characteristics. Like the multigrid ap-plication, the conjugate gradient kernel is able to obtain near theoretical commu-nications bandwidth, particularly on the Ethernet, and a four processor cluster ofhigh-performance workstations performs at one-fourth the speed of a Cray Y-MP/1.Another notable observation is that with an increase in the number of processors, thecommunication volume increases; thereby, resulting in lowered speedups. Our resultsfrom these two NPB kernels indicate both the power and the limitations of con-current network-based computing, i.e., that with high-speed, high-capacity net-works, cluster performance is competitive with that of supercomputers: that it ispossible to harness nearly the full potential and capacity of processing elements andnetworks; but that scaling, load imbalance, and latency limitations are inevitablewith the use of general purpose processors and networks that most cluster envi-ronments are built from.

4.2. Polymer chains and scale-invariant phenomena

The particular problem studied in this work is one in which some fundamentalaspects of the statistical mechanics of polymer solutions [17] are investigated. Inthis experiment, we focus on a linear chain which has a restricted interaction withthe medium; that is, there are forbidden regions (in®nite energy barrier), and the

Table 3

Conjugate gradient: PVM timings

Platform Time

(s)

Communication

volume

Communication

time (s)

Bandwidth

(KB/s)

4 �IBM RS6000/550

(Ethernet)

203 130 124 1074

4 �IBM RS6000/550

(FDDI)

82 130 MB 24 5433

16 �Sun Sparc SS1+

(Ethernet)

620 370 MB 360 1074

Cray Y-MP/1 22 ± ± ±

1714 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 17: Heterogeneous parallel and distributed computing

chain is con®ned to other parts of the medium. If the forbidden regions occurrandomly and the minimum length scale of these is much smaller than the size ofthe polymer in homogeneous media, then this problem can be modeled by a self-avoiding walk (SAW) on a randomly diluted lattice. Thus, we ®rst create a per-colation cluster [18] on a ®nite grid (of size L3) by randomly diluting the grid (i.e.,removing sites as inaccessible to the chain) with probability 1ÿ p. The remainingsites then form connected components called clusters. Above a certain thresholdpc, there exists one cluster that spans the grid from end to end, and this is thecluster of interest. On this disordered cluster, a starting point of a SAW is chosenrandomly, and then all SAWs of a predetermined number of steps N are generatedby a depth-®rst type search algorithm. At each of the N steps, various confor-mational properties of the chain are measured, such as the moments of the end-to-end distance RN and radius of gyration SN and those of the number of chains CN .These are then averaged over the ensemble of SAWs on the particular disordercon®guration. This is repeated for a large number of disorder con®gurations, and®nally both linear and logarithmic means are calculated over the disorder en-semble.

The polymer simulation problem is typical of most Monte Carlo problems in thatit possesses a simple repetitive structure. The main routine initializes various arraysfor statistics and calls a slave routine which computes samples, and periodicallycommunicates them to the monitor(s). A few hours of e�ort were required in par-allelizing the original code using PVM and a related tool called ECLIPSE [23]. In 11di�erent experiments, each corresponding to a particular network con®guration ofarbitrarily chosen machines, we used between 16 and 192 geographically dispersedprocessors; results from these experiments are reported in Table 4.

The parallelization has made an otherwise impossible interactive experimentationprocess possible. Previous exercises conducted on the CRAY Y-MP were onlypermissible in batch mode, because of the computing time and cost involved, andallowed for little interactive experimentation. Further, limited CRAY access timemade the entire experimentation process di�cult; restrictions that many researchersencounter on supercomputers as well as on MPPs. In contrast, our experience is thata computing environment ¯exibly allowing for computation nodes that are SUN/IBM workstations, Intel i860 nodes, IPSC/2 nodes or sequent processors improvesthe entire interactive experimentation process by an order of magnitude. Moreover,

Table 4

Polymer physics application: PVM/ECLIPSE timings (seconds)

# of

procs

Sun

SS

IBM

RS6000

Sun SS +

RS600

Intel

i860

SS + IBM

+ i860

Average Equiv

Cray

15 7101 1992 9226 6106 12735

31 3194 1366 4661 3074 13250

63 1015 2108 1516 11552

127 492 893 692 10526

191 574 574 11493

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1715

Page 18: Heterogeneous parallel and distributed computing

if a subset of nodes fail, or a subnetwork goes down, the rest of the computationremains active, resulting in increased computation time but a graceful degradation ofservice.

5. Discussion

Our experiences with PVM and related systems have been very valuable. From theresearch perspective, the important issues in heterogeneous network computing havebeen highlighted, and the bene®ts and limitations of this paradigm have beendemonstrated. The project has also provided a deeper insight into important re-search issues that warrant further investigation and has led to the evolution of noveltechniques that are promising and should be pursued further. Like many experi-mental research projects, PVM has also had an important side bene®t: it has pro-duced software systems that are pragmatic and robust enough for actual productionquality use by applications in a variety of domains. From the point of view of ap-plications developers, PVM has provided a cost e�ective and technically viable so-lution for high-performance concurrent computing, as indicated by the large numberof installations adopting this system, often as their primary concurrent computingfacility. Both these facets of PVM have encouraged us to continue research anddevelopment, both on aspects that enhance the e�ectiveness and e�ciency of thesystems software, as well as on new features and techniques. We describe below, twonew projects that we have undertaken to improve the e�cacy of network-basedconcurrent computing.

5.1. Parallel I/O

In the rapidly evolving ®eld of heterogeneous concurrent computing, and even inconventional parallel processing, an overwhelmingly large proportion of the focushas thus far been on ``standalone'', ``in-core'', problems mostly of a scienti®c ormathematical nature. However, as this model of computing matures and becomes anincreasingly viable technology, the need to incorporate features other than fastcomputation and communication will become apparent ± the most pressing of theseis likely to be an input/output framework, preferably one that is itself parallel innature. Parallel I/O will not only be required by scienti®c applications that manip-ulate large data sets or require out-of-core computations, but also by non-scienti®capplications, especially those in transaction processing, distributed data bases, andother high performance commercial applications.

Motivated by this reasoning, we have embarked on a project to provide parallelI/O facilities in network-based concurrent computing systems, with PVM as the ®rsttarget environment. PIOUS [19] is an input/output system that provides processgroups access to permanent storage within a heterogeneous network computingenvironment. PIOUS is a parallel distributed ®le server that achieves a high-level ofperformance by exploiting the combined ®le I/O and bu�er cache capacities of

1716 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 19: Heterogeneous parallel and distributed computing

multiple interconnected computer systems. Fault tolerance is achieved by exploitingthe redundancy of storage media.

To better support parallel applications, PIOUS implements a parallel access ®leobject called a para®le and provides varying levels of concurrency control forprocess group members. PIOUS is a ®le service built on top of existing ®le systemsand accessed via library calls. PIOUS is itself implemented as a group ofcooperating processes within (an enhanced version of) the PVM distributedcomputing framework. Because of its modular design and utilization of existingstandards, PIOUS is easily ported to other distributed computing environmentsproviding similar functionality. An alpha version has been implemented andpreliminary results, in our evaluation of both functionality and performance, isvery encouraging.

5.2. Threads-based concurrent computing

At present, most heterogeneous network-based concurrent computing systems,including PVM, support the process-based model of computation. In other words,the unit of parallelism is a process. On the other end of the spectrum are traditionalshared-memory and vector computers where parallelism can be at the loop or eveninstruction level. While the process-based model has many advantages, there arecompelling reasons to consider network concurrent computing using a ®ner granu-larity, if this is possible without sacri®cing e�ciency.

Our preliminary analyses suggest that an attractive option is a model wherethreads, or light-weight processes are the units of parallelism. We are investigatinga scenario where processes, instead of possessing one thread of control, would becomposed of multiple threads, where threads within a process would cooperate viatheir common, shared, address space, while message passing would be used forcommunication between threads on di�erent processes or processors. This schemeis advantageous from several viewpoints. First, there is a trend among manufac-turers of almost all workstations and general purpose systems towards multipro-cessing on a small scale, i.e., systems with 2±8 CPUs are becoming increasinglycommon. In the conventional use of systems such as PVM, multiple processes mayexecute on these systems, but will communicate via (relatively expensive) messagepassing, even though physical shared-memory is available. With a threads ap-proach, multiple light-weight processes on one physical machine with several CPUswould interact via shared-memory, thus reducing communication overheads sig-ni®cantly.

In the above situation as well as on single CPU machines, threads have anotherimportant bene®t, namely, they are a natural and e�ective mechanism for latencyhiding. In other words, multiple threads within a single process have the potential forincreased CPU utilization (and therefore better overall application performance)because when a thread is blocked on an external communication activity, otherthreads could continue to perform useful work. Finally, threads are also attractivefrom the programming point of view for applications that are best expressed in termsof ®ne grained computations, e.g., Ising model computations, branch-and-bound

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1717

Page 20: Heterogeneous parallel and distributed computing

algorithms, etc. Currently, programming such applications requires extensive man-ual housekeeping and data/control structure management: a threads-based modelwould signi®cantly enhance the software engineering aspect of such applications. Weare currently experimenting with such a threads-based model and expect preliminaryresults to be available soon.

5.3. Ongoing trends in metacomputing

The computing model proposed by PVM and similar systems has proven to be aviable technology in a variety of distributed computing settings. Since then, how-ever, other developments that impact network computing have also taken place,notably in object-oriented systems, support for high-speed networks, and high-throughput batch-oriented cluster computing. Recently, more ambitious projectspropounding very large scale metacomputing and computational grids haveemerged; pioneering e�orts include the Globus [24] and Legion [25] systems. Briefoutlines of these systems are provided, as they are representative of the state the artin metacomputing.

Legion: The Legion research project at the University of Virginia aims to providean architecture for designing and building system services that present the illusion ofa single virtual machine. Persistence, security, improved response time, and greaterthroughput are among its many design goals, but the key characteristic of thesystem is its ambition of presenting a transparent, single virtual machine interface tothe user. Legion aims at presenting a seamless computing environment with a singlename-space, but supports multiple programming languages (and models) and inter-operability. It is an object-oriented system that attempts to exploit inheritance,reuse, and encapsulation; the distributed object programming system Mentat (aprecursor to Legion) is in fact the basis for programming the ®rst public release ofLegion.

Globus: Globus seeks to enable the construction of computational grids providingrobust access to high performance distributed computing (HPDC) resources. An-other very ambitious project, Globus' research foci are: (1) resource location andmanagement strategies; (2) enhanced unicast and multicast communication via theNexus subsystem; and (3) security, especially data access and authentication issues.Additional areas of interest include monitoring subsystems, mechanisms for dis-tributed access to application and system data and management of application ex-ecutables [24]. The Globus architecture is based on Nexus (a multithreadedcommunications framework), that interacts with metacomputing information ser-vices, resource allocation and authentication modules, to emulate a metacomputinginfrastructure. Several programming interfaces are provided, thereby enabling ¯ex-ibility.

NetSolve: A somewhat di�erent, client±server-based approach is adopted byNetSolve [26], a computational framework that allows users to access computationalresources distributed across the network. NetSolve o�ers the ability to search forcomputational resources on a network, choose the best available resource based on anumber of parameters, solve a problem (with retry for fault tolerance) and return the

1718 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 21: Heterogeneous parallel and distributed computing

answer to the user. Resources used by NetSolve are computational servers that runon di�erent hosts, and may provide both generic and specialized capabilities. Thesystem provides a framework to allow these servers to be interfaced with virtuallyany numerical software. Access is achieved through a variety of interfaces; two whichhave been developed are as a MATLAB interface and a graphical Java interface. It isalso possible to call NetSolve from C or FORTRAN programs using a NetSolvelibrary API.

IceT: Along somewhat related lines, a project called IceT has been in develop-ment. IceT shares its basis in loosely coupled heterogeneous network computing withLegion, Globus, and NetSolve, but there are a number of technical and philo-sophical di�erences in this project. First, IceT views metacomputing as a collabo-rative activity, where users contribute resources to a pool, which they then timeshare.Contributions, with di�erent levels of access assigned to each type, may be to openpools or to restricted ones, and may be updated or withdrawn when the owner sodesires. The second major di�erence between IceT and other metacomputing systemsis IceT's emphasis on process mobility and portability. To cater to the inherentlydynamic nature of the IceT virtual machine and potentially changing needs of ap-plications, uploading and soft-installation of application components is fundamentalto IceT's mode of operation. The two major di�erentiating aspects of IceT outlinedabove lead to the potential for other novel features that the system exploits. Thebasis in extensible classes for access to IceT features immediately enables userenhancement of system level facilities; i.e., the capability to modify or extend thevirtual machine (VM) dynamically. IceT environments can soft install system levelmodules during runtime, under control of users, e.g., insertion of a distributedshared±memory programming API for a portion of an application, or temporarydeployment of a lossy compression module when transferring a large image.Furthermore, this capability, coupled with the collaborative nature of IceT and itsuploading/soft-install provisions, facilitates implementation of ``agents'', both foruser level as well as system-oriented tasks.

Harness: Harness [27,28] is another metacomputing system that is beginning toevolve as framework for next generation heterogeneous network computing. Har-ness builds on the concept of the distributed virtual machine that was pioneered byPVM research, but fundamentally recreates this idea and explores dynamic capa-bilities beyond what PVM can supply. The Harness project is focused on developingthree key capabilities within the framework of a heterogeneous distributed com-puting environment that includes (like PVM) everything from laptops runningWindows 95/NT to multiprocessor supercomputers running Unix. First, Harnessproposes the notion of a parallel plug-in interface that allows users or applications todynamically customize, adapt, and extend the environments features. Second,Harness integrally incorporates distributed peer-to-peer control that prevents singlepoint of failure (in contrast to typical client/server control schemes). This reatlyenhances the fault tolerance that is available to large, long-running simulations.Finally, Harness supports multiple distributed virtual machines that can collaborate,merge, or split. This feature provides a framework for collaborative simulations, asin IceT.

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1719

Page 22: Heterogeneous parallel and distributed computing

Acknowledgements

Supported by NASA grant NAG 2-828, DoE grant DE-FG05-91ER25105, andNSF grants CCR-9118787, ASC-9214149, CCR-9523544.

References

[1] R.F. Freund, H.J. Siegel (Eds.), Heterogeneous processing (special issue), IEEE Computer 26 (6)

(1993).

[2] V.S. Sunderam, R.F. Freund, J. Parallel (Eds.), J. Parallel Distributed Comput. 21 (3) (1994).

[3] R.F. Freund, SuperC or distributed heterogeneous HPC 2 (4) (1991) 349±355.

[4] J. Potter, Associative Computing, Plenum Press, New York, 1992.

[5] L.H. Turcotte, A survey of software environments for exploiting networked computing resources,

Technical Report, ERCCFS Mississippi State University, June 1993.

[6] D.Y. Cheng, A survey of parallel programming languages and tools, NAS Systems Division Technical

Report RND 93 005, NASA Ames Research Center, March 1993.

[7] L. Patterson et al., Construction of a fault tolerant distributed tuple space, in: Proceedings of the 1993

Symposium on Applied Computing, Indianapolis, February 1993.

[8] D. Gelernter, Domesticating parallelism, IEEE Computer 19 (8) (1986) 12±16.

[9] J. Boyle et al., Portable Programs for Parallel Processors, Holt, Rinehart & Winston, New York,

1987.

[10] R. Hempel, The ANL/GMD MAcros (Parmacs) in FORTRAN for portable parallel programming

using message passing, GMD Technical Report, November 1991.

[11] V.S. Sunderam, PVM: a framework for parallel distributed computing, Journal of Concurrency:

Practice and Experience 2 (4) (1990) 315±339.

[12] V. Rego, V.S. Sunderam, Experiments in concurrent stochastic simulation: the ECLIPSE paradigm,

Journal of Parallel and Distributed Computing 14 (1) (1992) 66±84.

[13] H. Nakanishi, V. Rego, V.S. Sunderam, Superconcurrent simulation of polymer chains on

heterogeneous networks, in: Proceedings of the Fifth IEEE Supercomputing Conference, Minneap-

olis, November 1992.

[14] D.H. Bailey et al., The NAS parallel benchmarks, International Journal of Supercomputer

Applications 5 (3) (1991) 63±73.

[15] S.M. White, Implementing the NAS benchmarks on virtual parallel machines, M.S. thesis, Emory

University, April 1993.

[16] S.M. White, A. Anders, V.S. Sunderam, Performance optimization of the NAS NPB kernels under

PVM, in: Proceedings of Distributed Computing for Aeroscience Applications, Mo�ett Field,

October 1993.

[17] P.J. Flory, Statistical Mechanics of Chain Molecules, Interscience, New York, 1969.

[18] D. Stau�er, Introduction to Percolation Theory, Taylor & Francis, London, 1985.

[19] S.A. Moyer, V. Sunderam, Parallel I/O for distributed systems: issues and implementation, Future

Generation Computer Systems 12 (1) (1996).

[20] M. Litzkow, M. Livny, M.W. Mutka, Condor ± a hunter of idle workstations, in: Proceedings of the

Eighth International Conference on Distributed Computing Systems, June 1988, pp. 104±111.

[21] J. Pruyne, M. Livny, WoDi: a framework for parallel computing on unreliable resources, PVM Users'

Group Meeting, Santa Fe, NM, February 1996.

[22] TotalView Debugger, World Wide Web, http://www.etnus.com/tw/tvover.htm, January

1999.

[23] H. Nakanishi, V. Rego, V.S. Sunderam, On the e�ectiveness of superconcurrent computation on

heterogeneous networks, Journal of Parallel and Distributed Computing 24 (2) (1995) 177±190.

[24] I. Foster, C. Kesselman, The Globus project: a status report, in: Proceedings of the Heterogeneous

Computing Workshop, IPPS/SPDP '98, Orlando FL, April 1998.

1720 V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721

Page 23: Heterogeneous parallel and distributed computing

[25] M. Lewis, A. Grimshaw, The core Legion object module, in: Proceedings of the Fifth IEEE

International Symposium on High-performance Distributed Computing, August 1996.

[26] H. Casanova, J. Dongarra, Netsolve: a network solver for solving computational science problems,

The International Journal of Supercomputer Applications and High-performance Computing 11 (3)

(1997).

[27] J. Dongarra, G. Fagg, A. Geist, J. Kohl, P. Papadopoulos, S. Scott, M. Migliardi, V.S. Sunderam,

Harness: heterogeneous adaptable recon®gurable networked systems, in: Proceedings of the Seventh

High-performance Distributed Computing Symposium, Chicago, IL, August 1998, pp. 358±359.

[28] M. Migliardi, V. Sunderam, A. Geist, J. Dongarra, Dynamic recon®guration and virtual machine

management in the harness metacomputing system, in: Proceedings of ISCOPE98, Santa Fe, NM,

December 1998.

V.S. Sunderam, G.A. Geist / Parallel Computing 25 (1999) 1699±1721 1721