performance analysis of dynamic load balancing algorithms with variable number of processors

15
J. Parallel Distrib. Comput. 65 (2005) 934 – 948 www.elsevier.com/locate/jpdc Performance analysis of dynamic load balancing algorithms with variable number of processors Saeed Iqbal a , Graham F. Carey b, a Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712-1085, USA b ICES, University of Texas at Austin, Austin, Texas, USA Received 15 January 2004; received in revised form 13 October 2004; accepted 11 April 2005 Available online 23 May 2005 Abstract In modern parallel adaptive mesh computations the problem size varies during simulation. In this study we investigate the comparative behavior of four load balancing algorithms when the number of processors is dynamically changed during the lifetime of a multistage parallel computation. The focus is on communication and data movement overheads, total parallel runtime and total resource consumption. We demonstrate the main ideas for the case of six adaptive mesh refinement (AMR) applications with different kinds of growth patterns. The results presented are for a 32 processor Intel cluster connected by Ethernet. © 2005 Published by Elsevier Inc. Keywords: Adaptive mesh refinement (AMR); Partitioning; Dynamic scheduling; Resource utilization; Cluster performance 1. Introduction Static partitioning of parallel adaptive mesh computations will cause load imbalance among processors at runtime. In addition the problem size and resource needs will change dynamically. On clusters and other parallel platforms, a dy- namic scheduling strategy may be feasible for such adaptive mesh refinement (AMR) applications. As parallel architec- tures evolve and adaptive computations are more widely applied efficient scheduling strategies that are sensitive to application class and parallel architecture will be needed. Ideally, the architecture of a parallel system should be se- lected optimally for a given computation class and problem size. In current practice, the number of processors used to solve a problem is fixed for the entire lifetime of the com- putation. This implies a known a priori upper bound on the problem size. Another possible approach is to initially start Corresponding author. ICES, University of Texas at Austin, Austin, TX 78712-1085, USA. Fax: +1 512 232 3357. E-mail addresses: [email protected] (S. Iqbal), [email protected] (G.F. Carey). 0743-7315/$ - see front matter © 2005 Published by Elsevier Inc. doi:10.1016/j.jpdc.2005.04.003 with a fixed number of processors determined according to an initial grid problem size, and then change the number of processors working on the computation dynamically at runtime as the grid is progressively refined or coarsened. The selection of the number of processors for a simulation may be based on problem size, processor speed, latency, memory availability and other factors. To efficiently execute parallel computations on parallel distributed systems two key issues must be resolved. First, a number of processors should be selected appropriately according to the problem size and expected communication overhead. Second, the workload should be balanced among the selected processors proportional to their computational power [16]. Using the maximum number of available pro- cessors in parallel is generally not optimal for all problem sizes. In addition, problem size may vary during a simula- tion. In some classes of applications, such as the dynami- cally adaptive grid simulations considered later, usually the problem size changes with an unpredictable growth rate during the solution process due to AMR. In these compu- tations, it is impossible to predict the computational load and communication requirements in advance. Hence, it is difficult to estimate the optimal number of processors until

Upload: saeed-iqbal

Post on 26-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Performance analysis of dynamic load balancing algorithms with variable number of processors

J. Parallel Distrib. Comput. 65 (2005) 934–948www.elsevier.com/locate/jpdc

Performance analysis of dynamic load balancing algorithms with variablenumber of processors

Saeed Iqbala, Graham F. Careyb,∗aDepartment of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712-1085, USA

bICES, University of Texas at Austin, Austin, Texas, USA

Received 15 January 2004; received in revised form 13 October 2004; accepted 11 April 2005Available online 23 May 2005

Abstract

In modern parallel adaptive mesh computations the problem size varies during simulation. In this study we investigate the comparativebehavior of four load balancing algorithms when the number of processors is dynamically changed during the lifetime of a multistageparallel computation. The focus is on communication and data movement overheads, total parallel runtime and total resource consumption.We demonstrate the main ideas for the case of six adaptive mesh refinement (AMR) applications with different kinds of growth patterns.The results presented are for a 32 processor Intel cluster connected by Ethernet.© 2005 Published by Elsevier Inc.

Keywords:Adaptive mesh refinement (AMR); Partitioning; Dynamic scheduling; Resource utilization; Cluster performance

1. Introduction

Static partitioning of parallel adaptive mesh computationswill cause load imbalance among processors at runtime. Inaddition the problem size and resource needs will changedynamically. On clusters and other parallel platforms, a dy-namic scheduling strategy may be feasible for such adaptivemesh refinement (AMR) applications. As parallel architec-tures evolve and adaptive computations are more widelyapplied efficient scheduling strategies that are sensitive toapplication class and parallel architecture will be needed.

Ideally, the architecture of a parallel system should be se-lected optimally for a given computation class and problemsize. In current practice, the number of processors used tosolve a problem is fixed for the entire lifetime of the com-putation. This implies a known a priori upper bound on theproblem size. Another possible approach is to initially start

∗ Corresponding author. ICES, University of Texas at Austin, Austin,TX 78712-1085, USA. Fax: +1 512 232 3357.

E-mail addresses:[email protected](S. Iqbal),[email protected](G.F. Carey).

0743-7315/$ - see front matter © 2005 Published by Elsevier Inc.doi:10.1016/j.jpdc.2005.04.003

with a fixed number of processors determined according toan initial grid problem size, and then change the numberof processors working on the computation dynamically atruntime as the grid is progressively refined or coarsened. Theselection of the number of processors for a simulation maybe based on problem size, processor speed, latency, memoryavailability and other factors.

To efficiently execute parallel computations on paralleldistributed systems two key issues must be resolved. First,a number of processors should be selected appropriatelyaccording to the problem size and expected communicationoverhead. Second, the workload should be balanced amongthe selected processors proportional to their computationalpower [16]. Using the maximum number of available pro-cessors in parallel is generally not optimal for all problemsizes. In addition, problem size may vary during a simula-tion. In some classes of applications, such as the dynami-cally adaptive grid simulations considered later, usually theproblem size changes with an unpredictable growth rateduring the solution process due to AMR. In these compu-tations, it is impossible to predict the computational loadand communication requirements in advance. Hence, it isdifficult to estimate the optimal number of processors until

Page 2: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 935

immediately prior to execution of a simulation “stage” (as-suming the simulation needs multiple stages). In some cases,using a larger number of processors can cause redundantcommunication and data movement. This is a wasteful useof computing resources and may even significantly increasetotal execution time because of latency and other overheads.

In the present work we investigate adaptive approachesfor dynamically varying the parallel partitions and proces-sor allocation as the number of processors vary during adap-tive mesh simulations. We use a simple heuristic based onmemory requirements to select the number of processors:if the memory requirements of a given problem size fluctu-ate and exceed the memory available then we can vary thenumber of processors and repartition the problem accord-ingly; if communication facilities are very efficient (shortlatency), it may be advantageous to add processors to takeadvantage of extra processing capability and obtain the re-sult more rapidly even if memory utilization is less efficient.On the other hand, if communication and/or data movementis extremely slow and costly, reducing the number of pro-cessors may be the optimal choice, subject of course, to thememory constraint. In this study we propose simple heuris-tic to guide the process of dynamically varying the numberof processors and investigate the comparative behavior offour load-balancing algorithms.

Several efficient parallel dynamically load balancing al-gorithms exist. A key issue in such dynamic load balanc-ing is to minimize communication and data movement over-head simultaneously. The situation with dynamically chang-ing problem size in AMR is clearly complex since the prob-lem size may increase or decrease and the mesh distributionin the domain changes dynamically in a manner that is notusually known in advance. The quality of partitioning (orrepartitioning) during adaptive dynamic load balancing isimportant to the overall efficiency of a parallel simulation.The quality of partitioning determines the communicationoverhead. As explained later it is estimated using the totalnumber of edge-cuts in the associated graph representationof the domain.

In a previous study[21], jobs in a parallel processing sys-tem are modeled as a single queue of computation stages,where each stage may require a different number of proces-sors. In [21] both single job (model 1) and a stream of jobs(model 2) are analyzed. The study further gives analyticallyderived optimal design points for parallel systems whiletaking into account performance metrics such as speedup,response time, efficiency and power, for highly idealizedmodels. The study also investigates effects of architecturaland network parameters on a single parallel multistage job,which is similar to the model originally proposed in [15],and later refined in [11,32]. In the present study, we ex-plicitly take into account imbalance among processors andinterprocessor communication time overhead as proposedby Sevcik [32]. In addition, we account for data movementtime overhead among processors during a parallel computa-tion. We do not however, consider the I/O overhead because

the class of problems considered in this study are typicallynot I/O intensive. We define “resource consumption” as theproduct of a number of active processors and length of timethey are occupied at a particular stage. The metric is similarto the notion of “Nodedays” used in[33].

The outline of the paper is as follows: the backgroundconcepts for this study and the six benchmark AMR compu-tations used in this study are described in Section 2; model-ing of parallel adaptive mesh computations is shown in Sec-tion 3; the problem definition and scheduling algorithm isgiven in Section 4; the details of experimental methodologyand results followed by a discussion are given in Section 5;and conclusions are presented in Section 6.

2. Background

2.1. Adaptive mesh refinement

In AMR for steady state problems, a coarse unstructuredgrid is first generated usually as a uniprocessor calculation.A quality partitioning of the grid is inexpensively determinedbecause of the small problem size. Parallel simulation is car-ried out on this sub-domain decomposition. The solution ispost-processed to compute error indicators and the mesh islocally refined (as either a uniprocessor or parallel calcula-tion, the latter being more complicated).

AMR schemes fall generally into three main categories[7]: (1) h refinement which involves subdivision of cells oredges, (2)p refinement which involves increasing the localdegree (and local degrees of freedom) of a fixed set of cells,(3) hp refinement in which both cell division and local poly-nomial enrichment are possible. The mesh is repartitioned ifneeded, using an efficient dynamic repartitioning algorithmbased on the prior partition and refinement pattern.

2.2. Model computations

This study uses six model parallel computations as bench-marks. The associated meshes are improved by adaptiveh-refinement and exhibit different kinds of behavior typicallyfound in adaptive computations. The first four model compu-tations correspond to AMR for dynamic problems with mov-ing line, plane and point singularities. The last two cases arefor “stationary” problems with AMR from an initial coarsegrid and using two error tolerances. Table 1 shows the sixbenchmarks. The adaptive refinement strategy in this studyis based on longest edge bisection of tetrahedra. The methodproceeds as follows: (1) Tetrahedral elements are identifiedfor local refinement; (2) next their longest edges are se-lected and then bisected; (3) the procedure progresses usinga skeleton strategy where the faces of the tetrahedra defineits skeleton. The two faces of the tetrahedron containing alongest edge to be bisected are subdivided and then, otherfaces and the interior are further subdivided to complete thelocal refinement; (4) Elements adjacent to bisected edges are

Page 3: Performance analysis of dynamic load balancing algorithms with variable number of processors

936 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

Table 1Six model computations on an L-shaped domain (L), for moving sin-gularity (MS), line (1D) and plane (2D) singularity and (SS) for steadystate/uniform refinement

Computation Stages AMR behavior

L-1D 7 Moving 1D (line) singularity on a side.L-2D 9 Moving 2D (plane) singularity.L-MS1 44 Slow moving point singularity in the domain.L-MS2 16 Point singularity oscillated between two points.L-SS1 11 Uniform refinement with relaxed error tolerance.L-SS2 6 Uniform refinement with tight error tolerance.

also refined on a continuing longest edge propagation path.This implies that the refinement propagates away from a se-lected element; (5) The path terminates when the bisectededge is also the longest edge of the adjacent element. Adap-tive coarsening proceeds in an analogous fashion to restoreelements in a previously refined grid (derefinement). Furtherdetails are provided in[27,28].

3. Modeling parallel adaptive mesh computations

We consider a class of parallel computations as made upof a number of stages. Each stage corresponds in our modelto an AMR step for a given physics model. The AMR pro-cess involves a dynamically adapting graph. Fig. 1 showsN adaptation stages of a typical computation. For conve-nience, we conceptually divide each stage into four distinctphases. Each stage after the first starts with a data move-ment phase, where data needs to be moved among proces-sors based on the results of the load balancing phase of theprevious stage. The choice of load balancing algorithm in-fluences the amount of data movement and communication.The second phase shown is the computation phase, where“useful work” is done on the data on each processor. Theassociated communication during the computation phase ismodeled as the third phase. The last phase shown is the loadbalancing stage. The total time spent on data movement,communication and load balancing can then be viewed asthe “overhead” due to parallel processing. These phases areconsidered in more detail in the following subsections. Ide-ally, the overhead should be very low, and this would meana good computational scalability with respect to number ofprocessors. In this study we assume a global load balancingoperation is done at each stage.

Fig. 2(a) shows the pattern for a static schedule, a ninestage computation on eight processors. The entire computa-tion is partitioned across all eight processors at each stage.Having a fixed number of processors is good for fixed sizeproblems because fixed size problems use the entire mem-ory of the parallel machine. Fig. 2(b) shows the same nine-stage computation on eight processors, where the number ofprocessors increase linearly at each stage as the AMR pro-ceeds at runtime. The adaptive computation is modeled as

Start Stop

0 3 41 52 6 N-2 N-1 N

Load Balancing

Communication

Computation

Data Movement

Fig. 1. The four phases of a stage. A dynamic parallel computation isshown with “N ” stages of evolution. Each stage is modeled as four distinctphases. The four phases are data movement, computation, communicationand (repartitioning)/load balancing.

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 80Stages

Data Movement:Computation:

Communication:Load Balancing:

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 80Stages

Data Movement:Computation:

Communication:Load Balancing:

Pro

cess

or N

umbe

rP

roce

ssor

Num

ber

(a)

(b)

Fig. 2. A nine stage computation (0–8) is scheduled on eight processors.The phases are data movement, computation, communication and loadbalancing. (a) Shows a static schedule, all eight processors are active forthe entire lifetime of the computation. (b) Shows a dynamic schedule, thenumber of processors is varied during the lifetime of the computation.

a graph, with nodes in the graph corresponding to elementsthat share a face or an edge. To decompose the domain ofthe computation among processors, we partition the graphinto subgraphs. The total execution time at each stage is es-timated, assuming no overlap, as follows:

TTotal = TDM + TCM + TCP + TLB, (1)

whereTDM is the data movement time,TCM is the commu-nication time,TCP is the computation time,TLB is the loadbalancing time.

Page 4: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 937

Table 2Data associated with an element

Data Num. Bytes Comments

Integer 9 36 IDs etcFloat 1 4 Element WeightDouble 3 24 X, Y, Z coordinatesNode 4 192 Four nodesFloat 2 8 Values

Table 3Data associated with a node

Data Num. Bytes Comments

Integer 1 4 IDsDouble 3 24 X, Y, Z coordinatesFloat 5 20 Values

3.1. Data movement phase

During the data movement phase, some elements of theAMR mesh or corresponding graph are moved among pro-cessors. For example, all structural and node associated dataof each such element (its state) has to be moved from asource processor to a destination processor. Data movementtime is proportional to the complexity of the state associatedwith each element. In this study we measure the actual datamovement time. We also measure the total and maximumnumber of elements moved among processors in each stage.Furthermore, in some cases, new data structures have to bebuilt at the destination processor. The type of data associ-ated with an element in this study is shown in Tables2 and3. Here, the total data moved per element is 264 bytes andwe measure the amount of data movement among proces-sors in each stage. As expected, the mean data movementtime shows strong correlation to the maximum number ofelements moved [16].

3.2. Computation phase

The second phase is the computation phase. In this phase,each processor performs some useful operation on its lo-cal data. In the case of the finite element example, thiscan be some form of explicit or implicit computation. (e.g.,constructing and solving a large linear system associatedwith the simulation.) In this study we assume an implicitsparse linear system is solved by a parallel Krylov sub-space solver with ILUT preconditioning. In Krylov subspacesolvers the work per iteration depends significantly on theparallel matrix-vector product (MVP) and dot product per-formance. We model the time needed to solve the linear sys-tem by

TComputation= NStepsNIterationsTIteration, (2)

Nele-mat = nodes

element

dofs

node, (3)

NIterations= K√

nglobal-matrix, (4)

FLOPSIteration = 4NBNPN2ele-mat, (5)

TIteration = FLOPSIteration

FLOPSeffective, (6)

whereNStepsis the number of Newton steps;Nele-mat is thesize of the element matrix, andnodes/elementis 4 in ourcase;dofs/node(i.e. degrees of freedom per node) is 3 inour case;K models the effect of pre-conditioning and istaken asK = 3.5 for the application class of interest.NBNPis the number of elements on the processor with the max-imum load (i.e. bottleneck processor) for a given iteration.FLOPSeffective indicates FLOPS achieved by the machineandFLOPSiteration denotes the actual number of FLOPS periteration. The model is validated for different sizes of linearsystems and number of processors.

3.3. Communication phase

The third phase represents the total time required for com-munication. We have modeled the entire time spent for com-munication as one phase as shown in Fig.1. It should benoted that during the solution of the parallel iterative solvereach iteration requires some computation followed by com-munication. The communication phase represents the totaltime spent on communication during these iterations. Thereis no overlap between communication and data movementphases. We estimate the total communication time from thecharacteristics of the communication network. The domi-nant operation in our iterative solver is a parallel MVP. TheMVP is invoked numerous times and dominates the commu-nication requirements. For an MVP, the dominant pattern isneighbor to neighbor communication. Hence, the total mes-sage size between processors is estimated from the numberof edges shared.

3.3.1. Communication characteristicsFig. 3 shows the total time to send and receive messages

of different lengths among two processors for two kindsof communication networks, namely, Ethernet and Myrinet.Note that on Ethernet, below a threshold of 512 bytes thecommunication time is dominated by message startup la-tency. After 512 bytes the total communication time growsmonotonically. Myrinet is about an order of magnitude fastercompared to Ethernet. Total message transmission is an im-portant feature of communication networks. Total messagetransmission time includes latency, software buffering andhardware effects.

3.3.2. Communication modelThe total communication time per iteration is estimated

from the number of message start-ups and the length of mes-sages between neighbor processors. The number of messagestartups is equal to the number of subgraphs connected toa particular subgraph in the partitioning. The length of the

Page 5: Performance analysis of dynamic load balancing algorithms with variable number of processors

938 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

100 101 102 103 104 105 10610-2

10 -1

100

101

102

103

Message Size (Bytes)

Tim

e (m

s)

EthernetMyrinet

Fig. 3. Communication time between two processors for different mes-sage sizes on Ethernet and Myrinet. On average Myrinet is effectivelyseven times faster than Ethernet. For short messages (less than 1 K) thecommunication time is independent of message size. For longer messagesizes communication time increases monotonically (almost linearly) withmessage length.

messages is proportional to the number of edge cuts of a par-ticular subgraph. The total communication (ms) is estimatedfrom the characteristics given in Fig.3. The communicationtime per iteration can be modeled in the form [4]:

Titeration = �(Neighborsmax) + �(Message), (7)

whereMessageis the size of a message between two proces-sors,Neighborsmax is the maximum number of neighborprocessors, and�,� are machine specific parameters.

In our model, the total number of edge cuts (Total ECin the Tables) is used as a metric to estimate communica-tion requirements. For uniform degree graphs (graphs withvertices having approximately the same number of edges),there is a correlation between edge-cuts and inter-processorcommunication costs[31]. This edge-cut metric has proveda useful guide for scaling to estimate communication time.However it does not address the effects of message bundlingand message length. (See also [14]). As in Oliker et al. [26],TotalVandMaxV, are introduced as measures of data move-ment. More specifically, we measure the total number offinite elements re-allocated during a data movement phase.

3.4. Load balancing phase

The fourth phase shows the actual time needed to run theload balancing algorithm. We use the actual measured par-titioning time for the multilevel scratch-remap (MSR) [20]partitioning algorithm. (In a previous study we comparedfour different dynamic load balancing algorithms [17] whenthe number of processors are dynamically varied. There wecompared (recursive inertial bisection (RIB) [34], space fill-

ing curves (SFC)[29], MSR [20] and multilevel diffusive(MDIF) [20] in an architecture independent manner. MSRshowed the best performance (minimized edge-cuts and datamovement among processors). Here we study the effect ofvarying the number of processors on the parallel runtime ofcomputations.

4. Problem definition and algorithm

This study compares two approaches to schedule the par-allel applications. First, a simple way to schedule is to useall available processors throughout the lifetime of a simula-tion. This is the “static schedule”. Second, consider startingthe simulation from a single processor. We refer to this asa “dynamic schedule”: As the size of the problem grows, asingle processor is not able to handle the problem, so moreprocessors are added. There are a number of reasons that asingle processor may not be able to handle the problem. Themost obvious is memory limitation. If the size of a problem(number of elements) grows more than the combined stor-age capacity of all available processors, then more proces-sors are needed. Another reason to add more processors is toshare a computational load: If the problem is highly paral-lelizable and the results are needed in a short time then it isadvantageous to share the computational load between moreprocessors to reduce execution time. However, the overheaddue to parallel processing is (usually) an increasing func-tion of the number of processors. Therefore, each instanceof the problem size will have an optimal number of proces-sors ranging from 1 through the maximum available. Thekey idea behind changing the number of processors is thatusing the optimal number of processors should improve re-source consumption. The minimum number of processorsduring each stage depends on the number of finite elementsand processor memory. Such a schedule that uses the mini-mum possible number of processors is called amemory lim-ited dynamic schedule.

4.1. Scheduling algorithm

Assume we are given a parallel distributed system withPprocessors, each having available memoryM. The parallelcomputation is initialized on a single processor. A coarseunstructured grid is first generated, usually on a single nodeof the multiprocessor. The maximum number of elementsthat can be processed on a processor is calculated from thestate associated with an element and available memory ofthe processor. After the first stage, the unstructured meshgoes through an AMR based on the error indicators. The ele-ments are refined using the longest edge bisection algorithm.If the total estimated memory required by the elementsafter refinement is more than the total number of availableprocessors, then more processors are added. The domain isrepartitioned among the enlarged set of processors and allrelevant elements are moved to the new processor based on

Page 6: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 939

the partition. Similarly, in coarsening as the memory needdecreases the number of processors may be reduced. Thisreduction in the number of processors will move data amongthem. The total edge cut and total data movement is recordedat each step. The above procedure (refinement/coarseningthrough data movement) is repeated for each stage ofthe computation. After the final stage, the total edge cutsand total data movement for all the stages is summed togive the final edge-cut count and data movement duringall the stages of the computation. The algorithm is givenbelow.

AlgorithmA memory-limited dynamic schedule is used to select the

number of processors at each stage:1. Initialize the computation forStage= 12. For eachStagefrom 1 toTotal Stages

(a) Estimate the number of processors needed for thecurrentStage. (e.g., If the anticipated memoryrequirement of the computation is more than thetotal memory of available processors, add moreprocessors.)

(b) The mesh is adaptively refined or coarsened.(This will change the problem size and create aload imbalance among processors.)

(c) The load is balanced among all processors usingthe parallel MSR algorithm (or any of the fourload balancing algorithms). (All associated datamovement required for the load balance is doneduring the load balancing.)

(d) The data movement time and partitioning timeis measured.

(e) Compute edge-cuts among processors and num-ber of neighbors of each processor.

(f) The computation time is estimated using solvermodel.

(g) The communication time is estimated from thetotal message lengths and neighbor processor ineach phase.

(h) The total runtime of a stage is the sum for thefour phases (load balancing, computation, com-munication and partitioning).

(i) The total resource consumption of a stage is es-timated as the sum of #processors× runtime.

(j) The mesh elements that need to be refined orcoarsened in the next stage are identified and anestimate of the problem size at the next stage ismade.

5. Experimental evaluation

All experiments in this study are done on a 32 processorcluster in the CFDLAB at UT Austin. The cluster contains16 nodes, each node being a dual Pentium III 550 MHz,

1 GB RAM processor. The nodes are connected by a 3COMsuperstack 24 port fast Ethernet switch. The nodes can beconnected by Ethernet or Myrinet by linking with the ap-propriate library. The MPI (MPICH)[12] library is used forall communication. Each node runs Linux version RedHatLinux v7.3. We use the ZOLTAN [6,10] partitioning, loadbalancing and data movement library.

In this study we investigate the behavior of four commondynamic load-balancing algorithms when the number of pro-cessors are changed during the lifetime of a multistage par-allel computation. The ideas are demonstrated for a class ofproblems involving dynamically adapting unstructured gridsfor simulations on commodity cluster parallel distributedsystems. The number of processors vary with problem sizeaccording to the memory requirements. The effect of vary-ing the number of processors on communication and datamovement overheads is recorded.

5.1. Example

Consider L-1D, the first test case in Table 1, an AMRcomputation with seven stages. Assume we have a maximumof 8 processors, and each processor can store data associatedwith a maximum of 320 elements.

The number of processors used in the resulting memory-limited dynamic schedule of L-1D is shown in Table 4 col-umn 3. For example, stage 1 has 128 elements that can beeasily processed on a single processor, stage 2 has 1680 el-ements and requires at least six processors. Tables 5 and 6show the total edge cuts and total data movement incurredduring the seven stages according to the above-mentionedschedule. For comparison we have also provided the staticschedule data in Table 6 (first row) the total number of pro-cessors is fixed (eight in this case) throughout all the stages.These results are for SFC repartitioning. Table 6 also showsresults when the load balancing algorithm is changed toRIB, MSR and MDIF. As shown in Table 6 there is an im-provement (reduction of total edge-cuts and data movement)in parallel overhead in all four algorithms using dynamicschedules compared to static schedules. MSR however gen-erates high quality partitions and has the least overhead.

In Table 5, the maximum number of processors is changedfrom 8 to 16 (32), and we assume each processor can storea maximum of 160 (80) elements. In each case, the updatedmemory-limited schedules are generated using the methodmentioned above. The results are shown in Table 5. The totalnumber of edge-cuts and data movement decrease substan-tially for the dynamic schedule on 16 and 32 processors.

5.2. Results

In this section the dynamic schedule described above isevaluated on the six benchmark computations described inTable 1. First we show results showing changes in total edge-cuts and data movement for these computations. Later we

Page 7: Performance analysis of dynamic load balancing algorithms with variable number of processors

940 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

Table 4Number of processors (#Proc.) at each stage according to the dynamicschedule when a maximum of eight processors are available

Stage Elements #Proc.

1 128 12 1680 63 1905 64 2547 85 1088 46 1439 57 2171 7

Table 5Total edge cuts and data movement with variable processors using SFC

Proc. Schedule Total EC Total DM

16 Static 4350 19 168Dynamic 3649 16 684

32 Static 5856 20 220Dynamic 4976 18 522

Table 6Total edge cuts and data movement with a maximum of eight processorsusing different algorithms

Algorithm Schedule Total EC Total DM

SFC Static 3102 16 928Dynamic 2495 13 604

RIB Static 3053 15 724Dynamic 2363 17 210

MSR Static 1608 15 552Dynamic 1319 14 864

MDIF Static 2577 14 488Dynamic 2189 14 502

show how the dynamic schedule effects total runtime andresource consumption.

5.2.1. Effect on total edge cuts and data movementFigs. 4–9 show how the total edge cuts and total data

movement incurred during these computations compare tothe static schedule. In each figure chart (a) shows the rel-ative total edge cuts and chart (b) shows the relative datamovement. In each case the static schedule is the baseline,so a number less than 1.0 shows an improvement. Each chartshows results for four load balancing algorithms. The resultsshow the sensitivity of each algorithm to varying the num-ber of processors at runtime. To study the scalability of thedynamic schedule, results are again shown for 8, 16 and 32processors.

The results shown in Figs. 4 and 5 compare the staticand dynamic schedule for model computations (L-MS1 andLMS-2) with moving point singularities. In general, for the

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 4. Partitioning and data-movement data for L-MS1: (a) relative totaledge cuts; (b) relative total data movement.

two model problems with moving point singularities, the dy-namic schedule can reduce both the total edge cuts and totaldata movement compared to the static schedule. The firstproblem has a singularity oscillating between two points inthe domain and the reduction in total edge cuts is more than20%. The reduction in data movement is 2–17% dependingon the number of processors and the load balancing algo-rithm. The results shown in Figs.6 and 7 compare the staticand dynamic schedules for model computation with 1D and2D singularities (L-1D & L-2D). In these two cases, thereis a decrease of 12–40% in edge-cuts. The correspondingchanges in total data movement were variable, ranging froma 20% decrease to 186% increase. The results shown in Figs.8 and 9 compare the static and dynamic schedule for modelcomputations (L-SS1 & L-SS2) with uniform refinement un-der different tolerances. In all cases, there is a decrease intotal edge-cuts of 12–44% and substantial increase in datamovement.

Page 8: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 941

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.2

0.4

0.6

0.8

1

1.2

1.4

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 5. Partitioning and data-movement data for L-MS2: (a) relative totaledge cuts; (b) relative total data movement.

5.2.2. Effects on total runtime and resource consumptionIn general, as Figs.10–15 show, the dynamic schedule

consistently shows an improvement in resource consump-tion. The ratio improves with increasing number of proces-sors. There is about 30–40% less resource consumption for16 and 32 processors. The reduction in resource consump-tion comes, of course, at a cost of longer run-times.

To gain further insight we can divide the six computa-tions into three groups based on how fast the problem sizechanges at each stage. The problem size is proportional tothe total number of elements. In the first group, it consistsof two computations (L-SS1 and L-SS2), in each case thenumber of elements increases exponentially at each stage.The runtime and resource consumption results for the firstgroup are shown in Figs. 10 and 11. The figures show a con-sistent pattern of less resource consumption at comparableor even faster run-times.

The second group contains three computations (L-1D, L-2D and L-MS1) with rapid and high amplitude variation in

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.2

0.4

0.6

0.8

1

1.2

1.4

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 6. Partitioning and data-movement data for L-1D: (a) relative totaledge cuts; (b) relative total data movement.

the number of elements at each stage. The runtime and re-source consumption results for the second group are shownin Figs.11–13. The figures show reduced resource consump-tion in each case. However, there is no clear pattern in therun-times. (The Computation (L-1D) is communication in-tensive.)

The third group contains a single computation with ahigh frequency low amplitude variation in the number ofelements. For the case with high frequency low amplitudechange in the number of elements there is no saving in re-source consumption. The run-times tend to increase with in-crease in the number of processors, as shown in Fig. 15, dueto increase in communication overhead.

5.3. Analysis and discussion

Table 7 shows the average decrease in total edge-cutsand increase in data movement due to the dynamic sched-ule relative to the static schedule. The multilevel algorithms

Page 9: Performance analysis of dynamic load balancing algorithms with variable number of processors

942 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Algorithm

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 7. Partitioning and data-movement data for L-2D: (a) relative totaledge cuts; (b) relative total data movement.

(MSR and MDIF) reduce total edge-cuts by about 23%.Both multilevel algorithms usually produce very good qual-ity partitions, so a 23% gain (further decrease) is substantial.The reduction in the geometric repartitioning algorithm isabout 20–25%. The maximum reduction in total edge-cutsis recorded in RIB (25%) but this does not translate to thebest partition because geometric partitions have relativelypoor quality. The best partitions generated are for the MSRalgorithm and a further 23% reduction in edge-cuts due tothe dynamic schedule further improves the performance.

SFC reduces total edge-cuts by about 20% and causes30% more data movement. The 20% reduction is usuallyon a poor quality partition and data movement is excessivebecause SFC does not consider previous assignment of el-ements while generating new partitions. Therefore there isno clear advantage in using SFC with dynamic schedulingif data movement is expensive on a particular architecture.

Among the multilevel algorithms, MSR causes less datamovement (24%, compared to 40% due to MDIF). The rea-

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.5

1

1.5

2

2.5

3

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 8. Partitioning and data-movement data for L-SS1: (a) relative totaledge cuts; (b) relative total data movement.

son for this difference in data movement is that MDIF takesthe previous assignment of elements into account to calcu-late new partitions. In an effort to minimize movement ofelements to distant processors, MDIF will try to move ele-ments to neighbors, which move elements to their neighbors,and so on. This pattern creates waves of element redistribu-tions among processors. In essence elements have to “dif-fuse” through the processors and the disturbance propagatesthrough more processors before they arrive at their final des-tinations. Diffusion is slower when the number of proces-sors is changed between successive load balancing stages.MSR, unlike MDIF, does not take into account any previousassignment of elements—elements are moved in one directstep to the new processors, and as a result there is less datamovement.

Tables8 and 9 show the reduction in total edge-cuts andincrease in data movement due to the dynamic schedulecompared to the static schedule. For point singularity typecomputations there is a decrease in total edge-cuts and total

Page 10: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 943

RIB SFC MSR MDIF0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Algorithm

Rel

ativ

e T

ota

l Eg

de-

Cu

ts

81632

RIB SFC MSR MDIF0

0.5

1

1.5

2

2.5

3

3.5

Algorithm

Rel

ativ

e T

ota

l Dat

a M

ove

men

t

81632

(a)

(b)

Fig. 9. Partitioning and data-movement data for L-SS2: (a) relative totaledge cuts; (b) relative total data movement.

data movement under the dynamic schedule. For line sin-gularity type computation the decrease in total edge-cuts isabout 15–20%. However, the percentage reduction in totaledge-cuts is consistently above 30% for plane singularityand uniform refinement cases.

The relative advantage of using dynamic scheduling de-pends on the communication cost of the parallel architec-ture. Note that, during the entire lifetime of a computation,the data movement overhead is incurred once every loadbalancing step, while the total communication between pro-cessors can occur many times between any two load balanc-ing steps. The communication overhead usually occurs onceper iteration of the solver. Data migration cost is extremelydependent on the architecture and communication facilitiesand application-specific data structures[10]. Communica-tion overhead on a parallel computer can be proportional tototal edge-cuts or the maximum number of edge-cuts. MSRproduces high-quality partitions and minimizes data move-ment. In addition, due to the dynamic schedule it shows a

8 16 320

20

40

60

80

100

120

140

160

180

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e(se

con

ds)

StaticDynamic

8 16 320

500

1000

1500

2000

2500

3000

3500

4000

4500

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

(a)

(b)

Fig. 10. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-SS1: (a) runtime;(b) resource consumption.

further decrease in edge-cuts of (23%) with modest increasein data movement of (24%). Hence it is likely to producethe best partitions on most architectures.

6. Related work

The problem of assigning multiple communicating mod-ules of a parallel program onto processing nodes of a dis-tributed memory multiprocessor has been extensively stud-ied in the literature. Bokhari[5] has shown that the asso-ciated mapping problem is NP-hard. Hence due to the in-tractable nature of the mapping problem most load balancingapproaches proposed in the literature are based on heuris-tics.

In modern adaptive computations the problem size canchange dynamically with time. In such computations dy-namic load balancing is critical to achieve scalabilityand efficient resource utilization. The extremely rich de-

Page 11: Performance analysis of dynamic load balancing algorithms with variable number of processors

944 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

8 16 320

20

40

60

80

100

120

140

160

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e(se

con

ds)

StaticDynamic

8 16 320

500

1000

1500

2000

2500

3000

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

(a)

(b)

Fig. 11. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-SS2: (a) runtime;(b) resource consumption.

sign space of dynamic load balancing allows designinga variety of mapping heuristics. Recently, several re-searchers have studied important aspects of these heuristics[2,9,30,35,22,13,8,3,18,1].

A major class of dynamic load balancing strategies areapplication specific [23,25]. These partitioning and dynamicload-balancing strategies fit more closely to application-specific needs. For example, an effort by Zaki et al. [36]proposes a customized, dynamic load-balancing strategyfor data-mining applications and examines the behavior ofglobal versus local and centralized versus distributed load-balancing strategies. They show that different schemes arepreferable for different applications under varying programand system parameters.

On the other hand, substantial effort has been investedto develop general purpose dynamic load balancing frame-works that are applicable to a wide range of applications.The idea is to abstract load balancing details from the userby providing simple-to-use interfaces and library functions.

8 16 320

100

200

300

400

500

600

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e (s

eco

nd

s)

StaticDynamic

8 16 320

1000

2000

3000

4000

5000

6000

7000

8000

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

(a)

(b)

Fig. 12. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-2D: (a) runtime;(b) resource consumption.

Typically, these libraries contain efficient implementationsof various parallel partitioning algorithms and data move-ment functions. Early attempts to develop such frameworkswere based on the fact that, in a network, personal worksta-tions typically have low utilization and might be dedicatedto parallel computations during these periods of low uti-lization. An early successful project of this type is Condor[24]. Several such distributed computing frameworks allowchanging the number of compute nodes dynamically e.g.,[37] according to availability of compute nodes. However,when parallel computations are executed on these frame-works the number of compute nodes is fixed during the life-time of the program. These frameworks can be extended, ornew middleware developed [19], to allow changing the num-ber of compute nodes during the lifetime of parallel compu-tations. As shown in the present study, such a feature couldenable more efficient utilization.

As distributed computing environments become more het-erogeneous, the problem of assigning tasks to processors be-

Page 12: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 945

8 16 320

5

10

15

20

25

30

35

40

45

50

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e (s

eco

nd

s)

StaticDynamic

8 16 320

500

1000

1500

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

(a)

(b)

Fig. 13. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-1D: (a) runtime;(b) resource consumption.

comes more complex due to the different processing powerof compute nodes. An algorithm to assign tasks to processorsin heterogeneous environments is proposed in[18], whichuses artificial intelligence algorithms to reduce the searchspace and can find sub-optimal assignment for medium sizeproblems. Combining the proposed algorithm with the abil-ity to dynamically the vary number of compute nodes, assuggested here, in a heterogeneous environment can be use-ful as it can allow appropriate selection of compute nodes.In addition, having a minimal number of processors can sub-stantially reduce the search space.

In general load balancing strategies can also be dividedinto centralized or distributed. The proposed algorithm heredoes not clearly fall into one of these categories, as it isdistributed in a subset of the total available compute nodes.Based on this point of view it can be classified as cen-tralized, distributed or semi-distributed. A semi-distributedalgorithm is proposed that [1], it uses a two-level hier-archical control by partitioning the interconnect into in-

8 16 320

50

100

150

200

250

300

350

400

450

500

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e (s

eco

nd

s)

StaticDynamic

8 16 320

2000

4000

6000

8000

10000

12000

14000

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

(a)

(b)

Fig. 14. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-MS1: (a) runtime;(b) resource consumption.

dependent symmetrical regions. The semi-distributed algo-rithm shows an improvement in response time and resourceutilization.

7. Conclusions and remarks

The dynamic scheduling strategy of varying the numberof processors during the lifetime of a parallel multistagecomputation can significantly affect the total parallel run-time and total resource consumption. In general, on the testarchitecture (Ethernet and Pentium III), the total parallelruntime is increased under the dynamic schedule comparedto the static schedule. This increase is due the tendency ofthe dynamic schedule to use the minimum number of pro-cessors needed, in contrast to the static schedule which usesall available processors. However, the sensitivity analysisshows that as processor speed increases i.e. processors areable to provide higher sustained FLOPS, the dynamic sched-

Page 13: Performance analysis of dynamic load balancing algorithms with variable number of processors

946 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

(a)

(b)

8 16 320

20

40

60

80

100

120

Processors (Maximum Available)

To

tal P

aral

lel R

un

tim

e (s

eco

nd

s)

StaticDynamic

8 16 320

500

1000

1500

2000

2500

3000

3500

Processors (Maximum Available)

To

tal R

eso

urc

e C

on

sum

pti

on

StaticDynamic

Fig. 15. Comparison of runtime and resource consumption for vari-ous numbers of maximum available processors for L-MS2: (a) runtime;(b) resource consumption.

Table 7The average reduction in total edge-cuts (EC) and increase in data move-ment (DM) for four algorithms across all six domains

Algorithm EC% DM%

SFC −20.0 30.0RIB −25.2 6.0MSR −23.0 24.0MDIF −23.0 40.0

Note that these reductions show the additional decrease in edge-cuts andincrease in data movement due to the dynamic schedule.

ule can consistently yield a lower parallel runtime[16]. Thisimproved performance is because the dynamic schedule in-volves less communication overhead (total messages andnumber of neighbors). In an extreme case, assuming in-finitely fast processors, the parallel runtime will obviouslybe limited by the parallel overhead only. Hence, increasingthe network speed diminishes the relative advantage of thedynamic schedule because the percentage of time spent in

Table 8Percentage reduction in total edge-cuts due to dynamic schedule of variousalgorithms

Singularity SFC RIB MSR MDIF

Point 20 15 15 18Line 18 21 17 17Plane 33 33 32 35Uniform 33 38 32 30

Table 9Percentage increase in data movement due to dynamic schedule of variousalgorithms

Singularity SFC RIB MSR MDIF

Point −15 −5 −5 −10Line −18 5 5 −4Plane 130 120 115 117Uniform 250 105 150 225

data movement and communication decreases. Assuming in-finitely fast networks, the static schedule will always outper-form the dynamic schedule because the dynamic scheduleuses fewer processors and hence each processor has morecomputation load. However, this use of the resource maybe preferred. In summary, considering total parallel runtimeonly, the dynamic schedule is attractive if in future parallelarchitectures the processor speed increases at a faster ratecompared to network speed. The resource consumption ofthe dynamic schedule is consistently better than the staticschedule.

Increase in any feature that will increase the percentagetime spent on the computation phase will diminish the rel-ative advantage of using the dynamic schedule. The dataassociated with an element affects the percentage time forcomputation. Likewise, a large linear system takes longerto solve because the parallel solution time for a linear sys-tem typically increases by at leastO(n1.5), wheren is sizeof the linear system. The relative advantage of the dynamicschedule increases for computations with elements that haveless data (i.e. hence require less computation per element)associated. The preconditioning in the schemes consideredhere does not influence the relative advantage of dynamicscheduling, because it equally affects the computation andparallel overhead. Preconditioning however does affect theabsolute parallel runtime by decreasing the total number ofiterations in an iterative solver. In this study we assume thatthe memory requirement of building the preconditioner isnot a factor.

Results show that the dynamic scheduling strategy ofvarying the number of processors during the lifetime, de-creases the total communication requirement by 20–25%compared to the static scheduling strategy. The dynamicstrategy does however cause more data movement and an in-crease in the load per processor. The increase in data move-ment is dependent on the load balancing algorithm. The

Page 14: Performance analysis of dynamic load balancing algorithms with variable number of processors

S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948 947

multilevel scratch-remap (MSR) algorithm gives the bestperformance among the four algorithms compared in thisstudy. The relative advantage of the dynamic approach isa maximum in computations where the growth rate is veryhigh and the size of computation grows monotonically.

Varying the number of processors at runtime can alsobe beneficial to low-power high-performance parallel sys-tem designs. The ratio of improvement in runtime and re-source consumption will become more attractive if there is ahigh overhead associated with the communication network.Varying the number of processors seems most appropriatefor parallel applications when the parallelization overheadis high. The high overhead can be attributed to a numberof reasons, such as relatively slow network (e.g., Internet,GRID computing) or high I/O requirements (e.g, computa-tional finance/economics, weather prediction).

Acknowledgments

We thank the anonymous reviewers whose commentshave improved this article. We express our appreciation toMargarida Jacome for helpful suggestions. This work hasbeen supported in part by DoD HPCMP PET Contract No.N62306-01-D-7110, NSF Contract No. ACI-0081791 andSandia National Laboratories. The authors would like toexpress their gratitude to M.A. Padron, J.P. Suarez and A.Plaza for their help in related mesh generation and adapta-tion studies.

References

[1] I. Ahmad, A. Ghafoor, Semi-distributed load balancing for massivelyparallel multicomputer systems, IEEE Trans. Software Engng. 17(10) (1991) 987–1004.

[2] K. Antonis, J. Garofalakis, I. Mourtos, P. Spirakis, A hierarchicaladaptive distributive algorithm for load balancing, Parallel Distrib.Comput. 64 (1) (2004) 151–162.

[3] A. Barak, A. Shiloh, A distributed load balancing policymulticomputing, Software Practice Exper. 15 (9) (1985) 901–913.

[4] E. Barragy, G.F. Carey, R.V.D. Gelin, Performance and scalability offinite element analysis for distributed parallel computation, J. ParallelDistrib. Comput. 21 (1994) 202–212.

[5] S.H. Bokhari, On the mapping problem, IEEE Trans. Comput. C-30(3) (1981) 207–214.

[6] E. Boman, K. Devine, B. Hendrickson, M.S. John, C. Vaughan,ZOLTAN: a dynamic load balancing library for parallel computationalapplications, Technical Report, Sandia National Laboratories,SAND99-1377, Albuquerque, NM, 1999.

[7] G.F. Carey, Computational grids: Generation, Adaption, and SolutionStrategies, Taylor and Francis, London, 1997.

[8] T.L. Casavant, J.G. Kuhl, Analysis of three dynamic loadbalancing strategies with varying global information requirements,in: Proceedings of the 7th International Conference on DistributedComputing Systems, 1987, pp. 185–192.

[9] S.P. Dandamudi, Sensitivity evaluation of dynamic load sharing indistributed systems, IEEE Concurrency 6 (3) (1998) 62–72.

[10] K. Devine, B. Hendrickson, E. Boman, M.S. John, C. Vaughan,Design of dynamic load-balancing tools for parallel applications, in:Proceedings of the 14th International Conference on Supercomputing,ACM Press, Santa Fe, NM, 2000, pp. 110–118.

[11] E. Gelenbe, Multiprocessor Performance, Wiley, NY, 1989.[12] W. Gropp, E. Lusk, N. Doss, A. Skjellum, A high-performance,

portable implementation of the (MPI) message passing interfacestandard, Parallel Comput. 22 (6) (1996) 789–828.

[13] A. Ha’c, T.J. Johnson, Sensitivity study of load balancing algorithmin a distributed system, Parallel Distrib. Comput. (1989) 85–89.

[14] B. Hendrickson, Graph Partitioning and Parallel Solvers: Has theEmperor No Clothes?, in: Irregular’98, Springer, Berlin, 1998, pp.218–225.

[15] J. Huang, On the behavior of algorithms in multiprocessingenvironment, Ph.D. Thesis, Department of Computer Science, UCLA,1988.

[16] S. Iqbal, Load balancing strategies for parallel architectures, Ph.D.Thesis, Univeristy of Texas at Austin, May 2003.

[17] S. Iqbal, G.F. Carey, M.A. Padron, J.P. Suarez, A. Plaza, Loadbalancing with variable number of processors on commodity clusters,in: Proceedings of HPC ’02, San Diego, 2002, pp. 135–140.

[18] M. Kafeel, I. Ahmad, Optimal task assignment in heterogeneousdistributed computing systems, IEEE Concurrency 6 (3) (1998) 42–51.

[19] L.V. Kalé, S. Kumar, J. DeSouza, M. Potnuru, S. Bandhakavi,Faucets: efficient resource allocation in the computational grid,in: Proceedings of the 2004 International Conference on ParallelProcessing, 2004.

[20] G. Karypis, V. Kumar, ParMETIS: parallel graph partitioning andsparse matrix ordering library #97-060, Technical Report, Departmentof Computer Science, University of Minnesota, 1997.

[21] L. Kleinrock, J.-H. Huang, On parallel processing systems: Amdahl’slaw generalized and some results on optimal design, IEEE Trans.Software Engrg. 18 (5) (1992) 434–447.

[22] Z. Lan, V.E. Taylor, G. Bryan, A novel dynamic load balancingscheme for parallel systems, Parallel Distrib. Comput. 64 (12) (2002)1763–1781.

[23] A. Laszloffy, J. Long, A.K. Patra, Simple data management andscheduling and solution strategies for managing the irregularities inparallel adaptive hp finite element simulations, Parallel Comput. 26(13–14) (2000) 1765–1788.

[24] M. Litzkow, M. Livny, M.W. Mutka, Condor—a hunter of idleworkstations, in: 8th IEEE conference on distributed computingsystems, IEEE, New York, 1998, pp. 104–111.

[25] M.G. Norman, P. Thanisch, Models of machines and computationfor mapping in multicomputers, ACM Comput. Surv. 25 (3) (1993)263–302.

[26] L. Oliker, R. Biswas, H.N. Gabow, Parallel tetrahedral meshadaptation with dynamic load balancing, Parallel Comput. 26 (2000)1583–1608.

[27] A. Plaza, G.F. Carey, Local refinement of simplicial grids based onthe skeleton, Appl. Numer. Math. 32 (2000) 195–218.

[28] A. Plaza, M.A. Padron, G.F. Carey, 3d refinement/de-refinementalgorithm for solving evolution problems, Appl. Numer. Math. 32(2000) 401–418.

[29] H. Sagan, Space Filling Curves, Springer, Berlin, 1994.[30] L. Sanglu, X. Li, A scalable load balancing system for nows, ACM

Oper. Systems Rev. 32 (3) (1998) 55–63.[31] K. Schloegel, G. Karypis, V. Kumar, Chapter on Graph Partitioning

for High Performance Scientific Simulations, CRPC ParallelComputing Handbook, Morgan Kaufmann, Los Altos, CA, 2000.

[32] K.C. Sevcik, Characterization of parallelism in applications and theiruse in scheduling, Perform. Eval. Rev. 17 (1) (1989) 171–180.

[33] A. Streit, On job scheduling for HPC-clusters and the dynP scheduler,in: Proceedings of the Eighth International Conference on HighPerformance Computing (HiPC 2001), Lecture Notes in ComputerScience, vol. 2228, Springer, Berlin, 2001, pp. 58–67.

[34] V.E. Taylor, B. Nour-Omid, A study of the factorization fill-in for aparallel implementation of finite element method, Internat. J. Numer.Methods Engrg. 37 (1994) 3809–3823.

Page 15: Performance analysis of dynamic load balancing algorithms with variable number of processors

948 S. Iqbal, G.F. Carey / J. Parallel Distrib. Comput. 65 (2005) 934–948

[35] J. Watts, S. Taylor, A practical approach to dynamic load balancing,IEEE Trans. Parallel Distrib. Systems 9 (3) (1998) 235–248.

[36] M.J. Zaki, W. Li, S. Pathasarathy, Customized dynamic loadbalancing for a network of workstations, Technical Report, ComputerScience Department, University of Rochester, Rochester NY 14627,1995.

[37] S. Zhou, X. Zheng, J. Wang, P. Delisle, Utopia: a load sharing facilityfor large, heterogeneous distributed computer systems, SoftwarePractice Exper. 23 (12) (1993) 1305–1336.

Dr. Saeed Iqbalreceived his Ph.D. in Com-puter Engineering in May 2003, from theUniversity of Texas at Austin. He has anM.S. in Computer Engineering and a B.S.in Electrical Engineering. His current workis on grid computing and performance anal-ysis of commodity clusters used for highperformance computing. He has worked inIntel Microcomputer Software Labs (MSL)in Portland OR and Austin TX. His inter-ests include parallel and distributed com-puting, micro-architecture, neural computingand system architecture. He is a member ofIEEE, ACM and Computer Society.

Dr. Graham F. Carey is a member of theinterdisciplinary Institute for ComputationalEngineering and Sciences and a Professorin the Department of Aerospace Engineer-ing and Engineering Mechanics at the Uni-versity of Texas at Austin. He is Director ofthe Computational Fluid Dynamics Labora-tory and is a holder of the Richard B. Cur-ran Centennial Chair in Engineering. Prof.Carey has a B.S. (Hons.) degree from Aus-tralia, and M.S. and Ph.D. degrees from theUniversity of Washington at Seattle. His re-search and teaching activities primarily deal

with techniques in computational mechanics, particularly finite elementmethods. Related research experience includes periods as a research fac-ulty member in Civil Engineering Australia (1966–68), and as a researchengineer at the Boeing Company, Seattle (1968–70), during which timehe worked in finite element formulation and computation of nonlinearproblems. Publications by Dr. Carey include several textbooks on finiteelements and over 200 publications in the general area of computer sim-ulation and finite element technology. He is co-editor of the Wiley Inter-national Journal, Communications in Numerical Methods in Engineering.