pvm and mpi what else is needed for cluster computing? al geist oak ridge national laboratory geist...

23
PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist DAPSYS/EuroPVM-MPI Balatonfured, Hungary September 11, 2000

Upload: neil-benson

Post on 12-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

PVM and MPI What Else is Needed

For Cluster Computing?

Al GeistOak Ridge National Laboratory

www.csm.ornl.gov/~geist

DAPSYS/EuroPVM-MPIBalatonfured, HungarySeptember 11, 2000

Page 2: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

EuroPVM-MPIEuroPVM-MPIDedicated to the hottest developments of PVM and MPI Dedicated to the hottest developments of PVM and MPI

PVM and MPI are the most used tools for parallel programming

The hottest trend driving PVM and MPI today is PC clusters running Linux and/or Windows

This talk will look at gaps in what PVM and MPI provide for Cluster Computing. What role the GRID may play and

What is happening to fill the gaps…

Page 3: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

New release this summer – PVM 3.4.3 includes:New release this summer – PVM 3.4.3 includes:Optimized msgbox routines

– More scalable, more robustNew Beowulf-linux port

– Allows clusters to be behind firewalls yet work togetherSmart virtual machine startup

– Automatically determines the reason for “Can’t start pvmd”

Works with Windows2000– Installshield version available– Improved Win32 communication performance

PVM Latest News: PVM Latest News:

New Third party PVM Software: New Third party PVM Software: PythonPVM 0.9, JavaPVM, PVM port using SCI interface interface

Page 4: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

1989 9490 96 97 99 2000

PVM-1

MPI-1 MPI-2 I-MPI

PVM-2 PVM-3 PVM-3.4Harness

Wide-area GRID experiments

Ten Years of Cluster ComputingTen Years of Cluster ComputingBuilding a Cluster Computing Environment for 21Building a Cluster Computing Environment for 21stst Century Century

Networksof Workstations PC Clusters

Page 5: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

0.1

1

10

100

1000

10000

100000

1000000

Jun-9

3

Nov-94

Jun-9

6

Nov-97

Jun-9

9

Nov-00

Jun-0

2

Nov-03

Jun-0

5

Nov-06

Jun-0

8

Nov-09

Per

form

ance

[G

Flo

p/s

]

N=1

N=500

N=10

1 TFlop/s

1 PFlop/s

2005 Entry at 1TFlop/s 2010 Peak at 1PFlop/s

ASCI

IBMCompaq

TOP500 Trends – Next 10 yearsTOP500 Trends – Next 10 yearsEven the largest machines are clusters Even the largest machines are clusters

http://www.netlib.org/benchmark/top500.html

Page 6: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

PC Clusters are cost effective from a hardware perspective

Many Universities and companies can afford 16 to 100 nodes.

System administration is an overlooked cost: people to maintain cluster software written for each cluster higher failure rates for COTS

Presently there is lack of tools for managing large clusters

www.csm.ornl.gov/torc

Trend in Affordable PC clusters Trend in Affordable PC clusters

Page 7: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

C3 Command line cluster toolset

Cluster Computing Tools Cluster Computing Tools

• cl_pushimage() – push system image across cluster• cl_shutdown() – shutdown specified nodes• cl_push() – push files/directories across cluster• cl_rm() – remove files from multiple nodes• cl_get() – gather cluster files to one location• cl_ps() – returns results of multi-node ps• cl_kill() – kill application across entire cluster• cl_exec() – execution of any command across specified nodes

Functions

C3 is a command line based toolset for system administration and user level operations on a single cluster. C3 functions may also be called in a program. C3 is multithreaded and each function executes in parallel.

only executableby sysadmin

Software available www.csm.ornl.gov/torc

Page 8: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

VTK ported to Linux Clusters– Visualization toolkit making

use of C3 cluster packageAVS Express ported to PC Cluster

– Expensive but standard package asked for by apps.

– Requires AVS Site license to eliminate the cost of individual node licenese

Cumulvs plug-in for AVS Express– Combines the interactive

visualization and computational steering of Cumulvs with the visualization tools of AVS.

Visualization using Clusters Visualization using Clusters Lowering the Cost of High Performance Graphics

Page 9: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

M3C tool suite

Cluster Computing Tools Cluster Computing Tools

• Reserve nodes within or across clusters• Submit Job to queue system• Monitor nodes of cluster – also adding a PAPI interface • Install software on selected nodes• Reboot, shutdown, add user, etc. to selected cluster nodes.• Display properties of nodes

Growing list of plug-in modules

Suite of user interface tools for system administration and simultaneous monitoring of multiple PC clusters.

Written in Java – web based remote access.

www.csm.ornl.gov/torc

Page 10: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

OSCAROSCAR

• Bring uniformity to cluster creation and use• Make clusters more broadly acceptable• Foster commercial versions of the cluster software

Goals

OSCAR is a collection of the best known software for building, programming, and using clusters. The collection effort is lead by a national consortium which includes: IBM, SGI, Intel, ORNL, NCSA, MCS Software. Other vendors invited.

www.csm.ornl.gov/oscar

National Consortium for Cluster software

For more details see Stephen ScottFor more details see Stephen Scott(talk Monday 12:00 DAPSYS track)

Page 11: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

ORNL M C ToolORNL M C ToolArchitecture designed to work both within and across organizations Architecture designed to work both within and across organizations

3

URL CGI

C3 Scripts/monitors

URL CGI

Custom Scripts

URL CGI

Third-party Scripts

GUIproxy

URL

front-end

back-end

M C3

Java appletBased GUI

cluster 1 cluster 2 cluster 1 cluster 1

Interface thruXML files

M3C proxy allows one sysadmin to monitor and update multiple clusters

M3C GUI allowsuser to submitand monitor jobs

ORNL UTK SDSC

Page 12: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

• Information services• Uniform naming, locating, and allocating

distirbuted resources• Data management and access• Single log-on security

GRID “Ubiquitous” ComputingGRID “Ubiquitous” Computing

GRID Forum is helping define higher level servicesGRID Forum is helping define higher level services

MPI and PVM are often seen asMPI and PVM are often seen aslower level capabilities thatlower level capabilities thatGRID frameworks support.GRID frameworks support.

GlobusGlobusNetSolveNetSolveCondorCondor

LegionLegionNeosNeosSinRGSinRG

Page 13: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Cumulvs – collaborative computational steering Cumulvs – collaborative computational steering

Recent Highlights

• Release of new version• Development of CAVE viewer• Works w/ PNL global arrays• Made CCA compliant

Cumulvs was the initial reason we started Harness.Cumulvs was the initial reason we started Harness.

Collective portA DOE effort to provide a standard for interoperability of high performance components developed by many different groups in different languages or frameworks.

Common Component ArchitectureCommon Component Architecture

http://z.ca.sandia.gov/~cca-forum/port-spec

Page 14: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

HARNESSHARNESS Exploring New Capabilities in Heterogeneous Distributed Exploring New Capabilities in Heterogeneous Distributed ComputingComputing

www.epm.ornl.gov/harness

• Parallel Plug-in environment Extend the concept of a plug-in to the parallel computing world.Dynamic with no restrictions on functions.

• Distributed peer-to-peer control No single point of failure unlike typical client/server models.

• Multiple distributed virtual machines merge/split Provide a means for short-term sharing of resources and collaboration between teams.

Building on our experience and success with PVMcreate a fundamentally new heterogeneous virtual machinebased on three research concepts:

Goal

Page 15: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Host D

Host C

Host B

Host A

VirtualMachine

Operation within VM usesDistributed Control

process control

user features

HARNESS daemon

Customizationand extensionby dynamicallyadding plug-ins

Componentbased daemon

Merge/split with other VMs

AnotherVM

HARNESS Virtual MachineHARNESS Virtual MachineScalable Distributed control and Component based Daemon Scalable Distributed control and Component based Daemon

Page 16: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

HARNESS Latest NewsHARNESS Latest News Provide a practical environment and illustrate extensibility Provide a practical environment and illustrate extensibility

• Harness Core (beta release ready)(see talk Monday 16:30 Track 1) Task library and Harness Daemon softwareProvides API to load, unload plug-ins and distributed control.

• PVM Plug-in (stalled for summer now back on track)Provides PVM API veneer to support exiting PVM applications.

• Fault tolerant MPI plug-in (see talk Monday 16:50 Track 1) Provides MPI API for 30 most used functions. Semantics adjusted to allow recovery from corrupted communicator.

• VIA communication plug-in (looking at multi-interface transfer)To illustrate how different low level communication plug-inscan be used within Harness. And to provide high performance

Page 17: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Parallel Plug-in ResearchParallel Plug-in ResearchFor Heterogeneous Distributed Virtual MachineFor Heterogeneous Distributed Virtual Machine

Tough Research problems include:

Heterogeneity (has delayed ‘C’ H-core development.)Synchronization - Dynamic installation Interoperation

• between same plug-in on different tasks• between task plug-in and daemon plug-in• between daemon plug-ins

Partial success

One research goal is to understand and implement a dynamic parallel plug-in environment.

provides a method for many users to extend Harness in much the same way that third party serial plug-ins extend Netscape, Photoshop, and Linux.

Page 18: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Fault Tolerant MPIFault Tolerant MPI Motivation Motivation

As application and machine sizes grow the MTBF is less than the application run time.

MPI standard is based on a static model so any decrease in tasksleads to corrupted communicator (MPI_COMM_WORLD).

Develop MPI plugin that takes advantage of Harness robustness to allow a range of recovery alternatives to an MPI application. Not just another MPI implementation.

FT-MPI follows the syntax of MPI standard

Communication performance on par with MPICHPresently uses PVM3.4.3 fault recovery until Harness is ready)

Page 19: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Fault Tolerant MPIFault Tolerant MPI Recovery requires MPI semantic changes Recovery requires MPI semantic changes

Key step to MPI recovery is creating a communicator that app can use to continue.

Accomplished by modifying the semantics of two MPI functions.

MPI_COMM_CREATE ( comm, group, newcomm)MPI_COMM_SPLIT ( comm, color, key, newcomm)

Creates a new communicator that contains all surviving processes Allows MPI_COMM_WORLD to be specified as both input and output

communicator.

Page 20: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

• No single point (or set of points) of failure for Harness. It survives as long as one member still lives.

• All members know the state of the virtual machine, and their knowledge is kept consistent w.r.t. the order of changes of state. (Important parallel programming requirement!)

• No member is more important than any other (at any instant) i.e. here isn’t a pass-around “control token”

Symmetric Peer-to-Peer Distributed ControlSymmetric Peer-to-Peer Distributed ControlCharacteristicsCharacteristics

Page 21: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Harness kernels on each hosthave arbitrary priority assigned to them(new kernels are always given the lowest priority)

Virtual machine

A task on this hostrequests a new host be added

VM state held by

each kernel

2.

Each adds request to alist of pending changes

1.

Send host/T#/datato neighbor in ring

3.

Distributed ControlDistributed ControlHarness Two Phase ArbitrationHarness Two Phase Arbitration

Page 22: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Supportsmultiple

simultaneous updates

Harness Distributed ControlHarness Distributed ControlControl is Scalable, Asynchronous, and ParallelControl is Scalable, Asynchronous, and Parallel

addhost

Fast host deleteor recoveryfrom fault

Parallel recoveryfrom multiplehost failures

Supportsfast hostadding

Scalable Design1<=S<=P

Page 23: PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory geist DAPSYS/EuroPVM-MPI Balatonfured,

Follow the links from my Web site

www.csm.ornl.gov/~geist

For more InformationFor more InformationAlso - Copy of these slides Also - Copy of these slides