h-bsp: a hierarchical bsp computation model

The Journal of Supercomputing, 18, 179–200, 2001© 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

H-BSP: A Hierarchical BSP Computation ModelHOJUNG CHA AND DONGHO LEE [email protected]

Department of Computer Science, Kwangwoon University, Seoul, 139–701 Korea

Final version accepted September 7, 2000

Abstract. This paper presents a new parallel computing model, called H-BSP, which adds a hierarchicalconcept to the BSP(Bulk Synchronous Parallel) computing model. An H-BSP program consists of anumber of BSP groups which are dynamically created at run time and executed in a hierarchical fashion.H-BSP allows algorithm designers to develop more efficient algorithms by utilizing processor localityin the program. Based on the distributed memory model, H-BSP provides a group-based programmingparadigm and supports Divide & Conquer algorithms efficiently. This paper describes the structure ofthe H-BSP model, complexity analysis and some examples of H-BSP algorithm. Also presented is theperformance characteristics of H-BSP algorithms based on the simulation analysis. Simulation resultsshow that H-BSP takes advantages of processor locality and performs well in low bandwidth networks orin a constant-valence architecture such as 2-dimensional mesh. It is also proved that H-BSP can predictalgorithm performance better than BSP, due to its locality-preserving nature.

Keywords: computation model, BSP, processor locality, scalable computing, analysis.

1. Introduction

Many parallel computers have recently been built and some of them are in wide usefor scientific and engineering applications. The use of parallel computers, however,has not been so successful as researchers have anticipated. This is mainly becauseparallel software is architecture-dependent and there are many types of parallelhardware. Parallel computers are, in general, built with diverse hardware principles.This makes it difficult to write and analyze a portable software across a wide rangeof hardware platforms. Considering the rapid advance in hardware manufacturingtechnology, the architecture-dependent nature of current parallel software is anobstacle to its wide-spread use.

In contrast with the diversity of parallel hardware designs and programming mod-els, the sequential computing-based on the Von Neumann model of computation-has been successful in developing many portable software. The Von Neumann modelhas provided a unified model upon which hardware and software are independentlydeveloped with performance improvements in mind. It also enables the develop-ment of well-considered formal theory for algorithm analysis. One of the recentefforts in parallel computing research is to develop an architecture-independentand yet practical computing model.

Depending on the level of abstraction on parallel computers, parallel computermodels can be classified into four different categories [1]; the machine model, thearchitecture model, the computation model and the programming model. Amongthese, the computation model gives an abstract view on the architecture model and

180 cha and lee

defines how the computation on underlying architecture is achieved. It is mainlyused for the development of architecture-independent parallel algorithms and fortheir performance analyses. It should provide a theoretical base for algorithm anal-ysis in a simple framework. In other words, a computation model should reflect coretechnology of architecture model in order to be able to predict the performance ofan algorithm on real machines.

One of the representative parallel computational models is PRAM [2] whichextends the concept of RAM(Random Access Model) in sequential computing.PRAM, for example, has many advantages as a computational model due to itssimplicity, but its over-simplified assumptions are far from reality and thus unsuit-able for developing practical parallel software. Most of computational models arein fact architecture-dependent as the performance issue should not be compro-mised. In an effort to consider both theoretical and practical implementation issuestogether, research on new parallel computational models has been active recently.They tend to provide architecture-independent general-purpose computationalmodels by supporting various kinds of parallel architectures in a hope to developarchitecture-independent parallel software [2, 3]. There are, of course, some neg-ative opinions against the feasibility of general-purpose architecture-independentparallel model, but there are many hopeful signs. The hardware manufactur-ing technology is no more unique and people generally agree on how a parallelmachine should be built. The distributed shared memory machines, for example,inherit the benefits of both shared-memory and distributed-memory architectures.The need for developing hardware-independent parallel software is probably moredemanding requirement for the general-purpose computational model. Researchon architecture-independent parallel model has mainly been conducted from twoviewpoints. One is to develop enhanced PRAMs which partially solve the limits ofthe original PRAM. The other is to overcome the fundamental limits of PRAMand develop more practical computational models. The later includes HPRAM [1],LogP [4], C3 [5], BSP [6, 7] and so on.

Among the practical computational models recently suggested, BSP, in particu-lar, has attracted many interests since it is considered a bridging model that relatesparallel hardware and software in a consistent manner. BSP reflects some of theimplementation parameters in the model and thus enables the development ofarchitecture-independent parallel software. BSP provides a consistent algorithmanalysis technique for a wide range of parallel architectures. It is, however, a globalmemory model upon which a non-local memory access requires a global commu-nication and synchronization. This way, an application exhibiting communicationlocality does not benefit from the processor locality features of underlying hard-ware. LogP is also designed to be a realistic model of parallel computation like BSP.LogP enables the algorithm designer to implement efficient parallel algorithms byencouraging the utilization of communication schedule and allowing the overlap-ping of computation with communication. However, as with BSP, it requires prim-itives for global operation and does not take into account the processor locality ofunderlying hardware.

This paper proposes a hierarchical BSP (H-BSP) model which adopts BSP as itssubmodel and offers an mechanism to take advantage of processor locality and thus

h-bsp model 181

enables the development of more efficient algorithms. The paper is structured asfollows. Section 2 briefly describes the BSP computational model. Section 3 presentsthe principles of H-BSP and its algorithm analysis technique. Some examples of theH-BSP algorithms are then described in section 4 with their performance analyses.Simulation results are given in section 5 in order to validate the proposed H-BSPmodel and analyze its predicted performance. Section 6 summarizes the work.

2. The BSP model

PRAM offers a simple and ideal view on parallel programming, but it does notconsider implementation issues. There have been variants of PRAM models, suchas Phase PRAM, APRAM, BPRAM and LPRAM, which have been suggested aspractical PRAM models to overcome its fundamental implementation difficulties.Valient’s BSP(Bulk Synchronous Parallel) model is one of such efforts. BSP is moti-vated in an effort to efficiently implement PRAM on distributed memory machines.Valient proved that a PRAM application can be run on the BSP model as effi-ciently as on the PRAM model; i.e., its algorithm complexity is different in constanttime order. BSP has gradually been accepted an architecture-independent paral-lel computation model which enables the development of more practical parallelalgorithms by considering many hardware parameters.

In the following, the structure and principles of BSP model are briefly describedtogether with its complexity analysis technique.

2.1. Principles

BSP consists of three components: a processor set with local memories, amessage-passing communication network and the bulk synchronizer for pro-cessor synchronization. BSP is a two-level memory model which exploits local andnon-local memories. Non-local memory is accessed via the following primitives andtakes a constant time regardless of its location.

BspRead�ProcessorID; Source;Destination; length�;BspWrite�ProcessorID; Source;Destination; length�

Processor synchronization in BSP is explicitly conducted by BspSync�� globaloperation. The computation work between two consecutive BspSync��’s is calleda superstep and each superstep consists of both computation and communicationphases. The result of communication operation in a current superstep is valid inthe next superstep after global synchronization at the end of current superstep.Figure 1 illustrates the concept of BSP superstep.

BSP provides a good theoretical background for designing portable and pre-dictable parallel applications by abstracting various hardware and software architec-tures into an architecture-independent computation model. The algorithm designer

182 cha and lee

ComputationStep

Communicationand Sync. Step

Processor i Processor i+1Su

pers

tep

Figure 1. BSP Superstep.

does not necessarily understand the underlying hardware characteristics, but con-siders a few performance parameters in order to develop an application which runson various machines with a predictable performance.

There are four parameters in BSP which specifies a performance characteristicsof a given parallel architecture:

• p: Number of processors• s: Processor speed• l: Global synchronization cost• g: Communication cost

g is related to the communication bandwidth of an underlying network and meansthat h-relation, which is a communication pattern sending and receiving h messagesper processor, can be done in g×h time. g and l specify performance characteristicsof various parallel architectures. Table 1 shows the l and g values for representativearchitectures.

As shown in the table, l and g depend on the processor size. In order to developarchitecture-independent BSP applications the application programmer should con-sider parameters such as the problem size(N), the number of processors(p) and per-formance parameters l and g. By parameterizing a program using l and g, the sameprogram can effectively be ported to another architecture with different l and g.

Table 1. Examples of l and g

Network l g

2D Mesh O�√p� O�√p�Butterfly O�log�p�� O�log�p��Hypercube O�log�p�� O�1�

h-bsp model 183

2.2. Complexity analysis

A BSP algorithm is a sequence of supersteps and the overall complexity of analgorithm is obtained by combining the complexity of each superstep. Let ncompand ncomm, respectively, be the computation and communication time of the busiestprocessor. Then the complexity of a single superstep, Tsuperstep, is defined as

Tsuperstep = ncomp + ncomm × g + lThe overall complexity of a BSP algorithm A, TBSP�A�, which consists of S super-

steps is obtained as follows:

TBSP�A� = Tsuperstep1 + · · · + TsuperstepS= Ncomp +Ncomm × g + S × l

Here, Ncomp and Ncomm × g are the sum of computation time and the communi-cation time of all supersteps, respectively.

3. The H-BSP model

BSP provides a consistent algorithm analysis technique for a wide range of paral-lel architectures. It is, however, a global memory model upon which a non-localmemory access requires a global communication and synchronization. This way, anapplication exhibiting the locality of communication does not get benefits from pro-cessor locality features of an underlying hardware [3].

One of the work done to introduce processor locality into a computational modelis H-PRAM where PRAM is used as its submodel. H-PRAM shares the benefits ofPRAM in principle, but the model is not suitable for implementing on distributedmemory architectures. We now introduce a hierarchical BSP (H-BSP) model whichconsiders the locality feature of an application into the model. H-BSP inherits theBSP principle, but its hierarchical structure reflects the locality of processor com-munication and synchronization. In the following, the structure and principles ofH-BSP are presented, followed by its algorithm analysis technique.

3.1. Principles

In addition to the basic BSP principle, H-BSP uses a special mechanism to splitthe entire system into a number of smaller groups, as shown in Figure 2. Thesystem is dynamically splitted or merged at run time. Each group behaves as anindependent BSP system and they communicate in an asynchronous fashion. Theworking mechanism of H-BSP is similar to that of the process tree concept in UNIX.Therefore, H-BSP is regarded as an enhanced BSP where its mechanism is addedby group fork and join functionalities.

Figure 3 shows, with an example, how an H-BSP algorithm works in general.The level-1 BSP, called a root BSP, initially consists of 6 processors. It then splits

184 cha and lee

CommunicationNetwork

P/M P/M P/M P/M

Group SplitMechanism

Bulk Synchroniser

Message Passing

Figure 2. Structure of the H-BSP model.

into two level-2 groups; 2-processor BSP and 4-processor BSP. The split processcontinues until no further split is possible in the group; i.e. all BSP groups are asingle processor group. The dynamically created BSPs run, within each group, ina bulk synchronous fashion, but they are independent of each other and run asyn-chronously. At any time, only the leaf BSP groups are in active state. Their parents,the non-terminal BSP groups, should wait until their children BSPs (leaf BSPs) ter-minate. In other words, a group of leaf BSPs with the same parent synchronize

6-Processor BSP (Root BSP)

4-Processor BSP2-Processor BSP

fork & join

Level 1

Level 2

Level 3

NonterminalBSP in InactiveState

Leaf BSP in Active State1-Processor BSP

fork & join

fork & join

Figure 3. H-BSP concept.

h-bsp model 185

with each other at the end of their runs. Consequently the leaf BSPs are destroyedand the parent becomes active. This process continues until the root BSP becomesactive again and terminate its run; at this point, the entire algorithm terminates.

H-BSP is considered a generalized BSP since H-BSP consisting of a single-levelroot BSP is the same as BSP. The computation is done on a group basis, as well asthe communication and synchronization. The processors’ local memory in H-BSPgroup consists of the private area and the shared area. The private area is for itslocal operation whereas the shared area is common to all the processors in a groupand h-relation is used for its access.

The group split in H-BSP is done either implicitly or explicitly. With the implicitmechanism, a group of processors is physically selected in such a way that theselected processors are continuously located in the underlying communication net-work and their group diameter is also minimized. Let p be the number of processorsin the system, G the number of groups created, and pi the number of processorsin the i-th group. Then, we have p=

∑G−1i=0 pi. The implicit split mechanism is then

described as follows:

BspForkAuto�p0 x SubAlgorithm0�ArgList0�;p1 x SubAlgorithm1�ArgList1�;

: : :

pG−1 x SubAlgorithmG−1�ArgListG−1��Each BSP group and the processors in the group are assigned with a new group IDand processor IDs respectively. When the group size is equal and the sub-algorithmsrunning on each BSP group are the same, the following simple notation is used:

BspForkAuto�G;pG; SubAlgorithm�ArgumentList��The following is a pseudo-code for the H-BSP example shown in Figure 3.

————————————————————————————————————Pseudo Code: H-BSPlevel1(pG; r) /* pG: number of processors in a group, r: level ∗/�BspForkAuto�pG × 1/3 x level2�pG × 1/3; r + 1�; pG × 2/3:

level2�pG × 2/3; r + 1��y�level2(pG; r)�BspForkAuto�2; pG × 1/2; level3�pG × 1/2; r + 1��;�level3(pG; r)�

local computation;�

————————————————————————————————————Code 1. Pseudo-code for Figure 3.

186 cha and lee

In contrast to the implicit split, the explicit split mechanism specifies the groupID of each processor explicitly. Here the number of processors in a group does notneed to be specified as the group members are uniquely determined by a groupID1. The explicit split mechanism does not guarantee the independence among thegroups, but it enables an arbitrary form of group creation.

Let G be the number of groups to be created and GID its group ID. The explicitsplit is then defined as follows.

BspForkDirect�G;GID; SubAlgorithm0�ArgumentList�;:::

SubAlgorithmG−1�ArgumentList��When all the groups run the same algorithm, the definition is simplified to

BspForkDirect�G;GID; SubAlgorithm�ArgumentList��BspForkDirect can be useful for handling if or switch structures. A processor

running in a BSP group can branch, depending on the value of conditional variable,into a part of the code which has a different superstep structure. In this case, thegroups splitted by BspForkDirect at the branch point run asynchronously till theend of the conditional structure where the group synchronization takes place.

With the addition of group split mechanism, H-BSP has two types of synchro-nization: S-Sync and H-Sync. S-Sync is a classical superstep synchronization withina group whereas H-Sync is a hierarchical synchronization among the sibling groupswhich have the same parent. The complexity of H-Sync depends on the processorstructure by which a hierarchical synchronization is implemented. It is a functionof G, the number of groups, and pG, the number of processors in the parent groupand is represented as sH�G;pG�. Mesh structure is, in general, a function of pGwhereas hypercube is a function of G.

3.2. Complexity analysis

An H-BSP algorithm is managed by root BSP and its complexity becomes the over-all complexity of the algorithm. Hence, the complexity of an H-BSP algorithm isobtained by combining the complexity of BSP algorithm performed at root BSPand the sum of complexities required for each group split at root. TH−BSP�A�, thecomplexity of an H-BSP algorithm A, is denoted as

TH−BSP�A� = complexity of BSP algorithm performed at root BSP

+ total complexity of group splits at root BSP

The group split complexity reflects the time from the split to the join and itdepends on the most costly group among its children. Let p be the number ofprocessors at root BSP, f the number of group splits at root BSP, Gi the numberof BSP groups created by the i-th split (1 ≤ i ≤ f ), gid the created group’s ID

h-bsp model 187

(0 ≤ gid ≤ Gi − 1), and GT �i; gid� the complexity of the gid-th group resultingfrom the i-th group split. Then, the i-th group split complexity, FTi, is defined as

FTi = max{GT �i; gid��0 ≤ gid ≤ Gi − 1

}+ sH�Gi;p�

and the total complexity of group split at root BSP becomes∑fi=1 FTi. Here, the

complexity of the gid-th group, GT �i; gid�, is recursively computed in the samefashion.

Now, the overall complexity of an H-BSP algorithm A is defined as

TH−BSP�A� = Nrootcomp +Nroot

comm × g + Sroot × l +f∑i=1

FTi

Note that the first part of Nrootcomp +Nroot

comm × g + Sroot × l is the complexity of BSPalgorithm performed at root BSP and it is directly derived from the analysis insection 2.2. Figure 4 illustrates the analysis mechanism.

Root BSP

FT1

FT

2

f

FT

fork

fork

...

...

...

Level 1 Level 2Level 2

Time

fork ...

Group Split

Group Split

Group Split

gid=0

gid=0

gid=0

gid=G -11

gid=G -12

gid=G -1f

Figure 4. Complexity analysis for H-BSP.

188 cha and lee

3.3. Model review

The BSP model provides a practical algorithm development platform because itincorporates the components which PRAM does not account. However, BSP doesnot take advantage of processor locality as it uses an uniform memory access modelin order to avoid the interconnection network dependency.

H-BSP uses BSP as its submodel and dynamically divides the system into smallergroups. This way it still inherits the BSP principles and takes advantage of pro-cessor locality. While BSP is not able to jump to another superstep structure atconditional statements, H-BSP executes more flexible BSP code as it runs asyn-chronously. Furthermore, BSP poses difficulties in building libraries. The reason isthat libraries should have the same superstep structure as the calling part becauseevery processor executes in a bulk synchronous fashion. Opal [9], one of the BSPlanguages, uses a group concept running on independent processors for constructinglibraries. Unlike the dynamic group creation, Opal maintains static groups and com-municates via rendezvous mechanism found in Ada. Opal maintains library codeson a separate group and they are executed upon request. However, as Opal relieson rendezvous mechanism for synchronous communication, its performance anal-ysis is complicated [9]. On the other hand, H-BSP makes the library constructionrelatively easy due to its hierarchically group structure. It also provides a simplerperformance analysis mechanism than that of Opal. H-BSP is better suited to thelanguages originally developed for BSP, such as GL [10] and GPL [10], than theBSP model itself. For example, GL and GPL use tree structure in the language andit is better utilized in H-BSP than in BSP.

H-PRAM, one of the representative computational models, also provides a group-ing facility as in H-BSP. Its performance, however, is restricted by the inherentlimit of PRAM. H-PRAM is supposed to support shared memory efficiently on dis-tributed memory architecture, but the performance degrades due to the strongly-coherent memory model of PRAM. An H-PRAM algorithm running on a singlelevel is equivalent to the PRAM algorithm; thus, efficient execution on real machineis not easy as the communication bandwidth cannot be fully utilized by not consid-ering memory pipelining [1]. On the other hand, H-BSP performs well as it is basedon weakly-coherent memory model and inherits the pipelining feature of BSP. TheH-BSP model can also be implemented and adopted to current and future computersystems in an efficient manner.

Other related computation models are the LogP and C3 models. LogP is anasynchronous model which is particularly adequate for performance prediction andalgorithm analysis. LogP makes communication scheduling possible and enables theefficient use of system resources by overlapping communication and computation.However, LogP’s asynchronous communication makes algorithms complicated andthe algorithm designers should consider algorithm as a whole, rather than as asimple structure such as BSP’s superstep. LogP restricts the number of messageswhich can be in transit from or to any processor at any time. This demands thealgorithm designer to regulate communication bandwidth. As a program becomescomplicated, the LogP model makes the performance prediction more difficult andtherefore a global synchronization similar to that of BSP is required to handle the

h-bsp model 189

situation more predictable way [4]. This eventually incurs the same inefficiency asin BSP since the model does not consider the processor locality. The LogP modelis considered to be less architecture-independent than BSP or H-BSP. Meanwhile,the C3 model is similar to BSP as it also uses the superstep concept. C3 intro-duces many performance parameters, especially for congestion effect. C3 also makesuse of the processor locality, but performance analysis is more complicated thanH-BSP. Furthermore, as most of C3 algorithms are based on mesh, its architecture-independence is less visible [5].

4. Examples

In this section, we describe FFT(Fast Fourier Transform) and quick sort algorithmsdeveloped for the H-BSP model. We also analyze their performance.

4.1. FFT algorithm

Figure 5 illustrates the structure of a FFT graph. When row i has n nodes, row isatisfies 0 ≤ i ≤ log�n� and each node is assigned with a sequential ID(0; : : : n− 1).Node j of row i(0 ≤ j ≤ n − 1) is connected to the node in row i + 1 which hasthe same ID or whose ID is only different in the i-th MSB (i:e:, its ID is j xor2log�n�−�i+1�).

We now present BSP and H-BSP algorithms for the FFT problem. Their algo-rithm complexity is analyzed and compared. Also presented is the relative benefitof H-BSP over H-PRAM implementation of the FFT algorithm.

4.1.1. The FFT algorithm for BSP. Figure 5 shows the mapping of a FFT graphwith 8 nodes to BSP with 4 processors(dotted circle). When there are n FFT nodesand p BSP processors, n/p FFT nodes are sequentially assigned to each processor.Since the FFT graph consists of log�n� communication and computation steps,

000 001 010 100 101 111110011

000 001 010 100 101 111110011

Level 2

Level 1

Level 3

n=8, p=4

BSP Processor

BSP Group

Figure 5. FFT Graph and its mapping to BSP and H-BSP model.

190 cha and lee

the BSP FFT algorithm consists of log�n� supersteps. As each superstep performsn/p computations and log�p� supersteps require �n/p�-relation communications,the complexity of BSP FFT algorithm is as follows.

TBSP�FFT � = O�n/p× log�n� + n/p× log�p� × g + log�p� × l�The pseudo-code of the BSP FFT algorithm is as follows.

————————————————————————————————————Algorithm BSP FFT:FFT()�

for (i = 0y< log�p�y i++)�BspRead(PID xor 2log�p�−�i+1�; node, temp, n/p�y

BspSync();for (j = 0y j < n/py j ++)

node[j] = mode[j] operator temp[j];�Seq FFT (node, n/p); /∗ Sequential FFT Algorithm ∗/

�————————————————————————————————————Code 2. The BSP FFT Algorithm.

Let’s analyze the complexity of the BSP FFT algorithm when running on meshand hypercube architectures. In the case of mesh, since g = O�√p� and l = O�√p�(as shown in Table 1)

TMeshBSP �FFT � = O�n/p× log�n� + n/p× log�p� ×√p+ log�p� × √p�

= O�n/p× �log�n� + log�p� × √p��:In the case of hypercube, since g = O�√p� and l = O�√p�,

THypercubeBSP �FFT � = O�n/p× log�n� + n/p× log�p� + log2�p��

= O�n/p× log�n� + log2�p��:

4.1.2. The FFT algorithm for H-BSP. In H-BSP, a FFT graph can be mapped in theform of binary tree(circle in Figure 5). When there are p processors, the total num-ber of levels is log�p� + 1 and the BSP groups in the same level consist of the samenumber of processors. In the lowest level(level 0), BSP groups of single processorssequentially execute n/p number of FFT nodes. From level 1 to level log�p�, eachBSP group executes n/p computations and one superstep which consists of �n/p�-relation. Upon completion of BSP groups with the same parent, the groups aresynchronized via H-Sync operation. The complexity of every level except level 0 isO�n/p+ n/p× g+ l�. The complexity of level 0 is O�n/p× log�n/p�� as the com-plexity of sequential FFT algorithm is O�n× log�n��. The complexity of hierarchical

h-bsp model 191

synchronization is∑log�p�r=1 sH�G;pG� as described in section 3.1. The performance

parameters g and l in H-BSP are a function of pG, not of p as in BSP. Therefore,

TH−BSP�FFT � = O(n/p× log�n/p� +

log�p�∑r=1

(n/p+ n/pg�pG� + l�pG�

)+

log�p�∑r=1

sH�G;pG�):

Since the number of processors in group halves as the level increases, pG =p/2r−1. Also, as the sum of g and l at each level is

log�p�∑r=1

g�p/2r−1� =log�p�∑r=1

g�2r�;log�p�∑r=1

l�p/2r−1� =log�p�∑r=1

l�2r�;

the final complexity is as follows:

TH−BSP�FFT � = O(n/p× log�n� +

log�p�∑r=1

(n/p× g�2r� + l�2r�)

+log�p�∑r=1

sH�2; 2r�):

This equation is now applied to mesh and hypercube architectures. In the caseof mesh, the sum of gs’ and the sum of ls’ are, respectively,

log�p�∑r=1

g�2r� =log�p�∑r=1

O(√

2r)= O

(√p);log�p�∑r=1

l�2r� =log�p�∑r=1

O(√

2r)= O

(√p):

Also the complexity of hierarchical synchronization is a function of pG and thefollowing equation holds.

log�p�∑r=1

sH�2; 2r� =log�p�∑r=1

O(√

2r)= O

(√p)

Thus,

TMeshH−BSP�FFT � = O�n/p× log�n� + n/p×√p+√p+√p�

= O�n/p× �log�n� + √p��:In the case of hypercube, the complexity of hierarchical synchronization is∑log�p�r=1 sH�2; 2r� = ∑log�p�

r=1 O�log�2�� = O�log�p�� as it is a function of G. Alsothe sum of gs’ and the sum of ls’ are sH�2; 2r� = sH�2� = O�log�2�� = O�1� andthe following equations hold.

log�p�∑r=1

g�2r� =log�p�∑r=1

O�1� = O�log�p��;

log�p�∑r=1

l�2r� =log�p�∑r=1

O�log�2r�� = O�log2�p��

192 cha and lee

Therefore the final complexity of the H-BSP FFT on hypercube becomes

THypercubeH−BSP �FFT � = O�n/p× log�n� + n/p× log�p� + log2�p��

= O�n/p× log�n� + log2�p��:

The pseudo-code for the H-BSP FFT algorithm is as follows.

————————————————————————————————————Algorithm H-BSP FFT:FFT(pG, r)�

if �r == log�p� + 1��Seq FFT(node, n/p); /* Sequential FFT Algorithm */

�else �

BspRead(PID xor 2log�pG�−1, node, temp, n/p);BspSync��;for (j = 0; j < n/p; j++)

node[j] = node[j] operator temp[j];BspForkAuto(2, 1/2 × pG, FFT(1/2 × pG, r + 1));

��

————————————————————————————————————Code 3. The H-BSP FFT Algorithm.

Table 2 shows the performance comparison of FFT algorithms on BSP, H-PRAMand H-BSP models. As shown in the table, H-PRAM performs better than BSP inmesh, but worse in hypercube. The reason is that in mesh H-PRAM takes advantageof its processor locality. In hypercube, however, BSP’s use of memory pipelining ismore prominent than H-PRAM. Meanwhile, as H-BSP inherits the both character-istics (processor locality and memory pipelining), H-BSP performs better than BSPand H-PRAM in both architectures. Here the complexity of H-BSP in mesh is thesame as H-PRAM, but H-BSP has better potentials for efficient implementation onreal machines due to its superstep concept.

Table 2. Performance comparison of three models for FFT algorithm

Mesh Hypercube

BSP FFT O�n/p× �log�n� + log�p� × √p�� O�n/p× log�n� + log2�p��H-PRAM FFT [12] O�n/p× �log�n� + √p�� O�n/p× �log�n� + log2�p��H-BSP FFT O�n/p× �log�n� + √p�� O�n/p× log�n� + log2�p��

h-bsp model 193

4.2. QuickSort algorithm

H-BSP is appropriate for solving divide-and-conquer types of problem due to itshierarchical structure. This section describes a quick sort algorithm, QuickSort, forH-BSP and compares its performance with the BSP algorithm.

The QuickSort algorithm consists of divide and conquer steps. In the divide step,array A�q :: r� is divided into A�q :: s� and A�s+1 :: r� where ∀x ∈ A�q :: s� and ∀y ∈A�s+1 :: r�, x ≤ y. Here the split is decided by an element called ‘pivot’. In theconquer step, the splitted arrays are recursively sorted and then merged into asingle sorted array.

Figure 6 shows the running process of the H-BSP QuickSort algorithm. Whenthere are n numbers to be sorted and p = 2d�d = 1; 2; : : :� processors, n/p num-bers are allocated on each processor. In order to select a ‘pivot’ the median-basedpivot selection, proposed by Wagner [11], is used in the paper. The pivot selec-tion scheme generally affects the performance of QuickSort algorithm. Wagner’sscheme says that when the distribution characteristics of the numbers allocated oneach processor is equal to that of the entire numbers, selecting medians as pivotson each processor can result in near-optimal sorting. This method can easily beimplemented on each processor by sorting the initial data and maintaining themappropriately.

The H-BSP QuickSort works as follows. The medians of initial data assigned oneach processor are obtained and one of them is selected as a pivot, then transmittedto each processor. According to this pivot, the local data in each processor is dividedinto two blocks of larger and smaller. The blocks are then exchanged with theprocessor whose ID is different in the r-th MSB. For nondecreasing sort, the smaller

110 111010 011

000 001 100 101

0?? 1??

00?

01? 11?

10?

0 2 3 4 5 6 71

0 2 3 4 5 6 71

0 2 3 4 5 6 71Level 1

Level 2

Level 3

Figure 6. QuickSort for H-BSP.

194 cha and lee

block is allocated on the processor whose r-th MSB is 0 and vice versa. During theexchange process, data are merged and local order is maintained. After completingexchange and merge process, the overall system is splitted into two subsystems andthis process is recursively performed on each processor. After d recursive processes,the smallest block is sorted on the processor with ID 0 and the rest of data is sortedon other processors according to their processor IDs.

We now analyze the complexity of the H-BSP QuickSort algorithm2. The algo-rithm consists of four parts: initial local sort, pivot selection, pivot transmission, anddata division/exchange. The overall complexity is obtained by analyzing and sum-ming these four parts. For initial sorting of local data, since the number of initialdata on each processor is n/p, the complexity of this part is O�n/p × log�n/p��.For pivot selection, it only requires O�1� as the local data is already sorted. Pivottransmission is done in O�log�p� × �g+ l�� by using the BSP broadcasting scheme.Here the time required for broadcasting on hypercube is

d∑r=1

O�log�pG� × �1+ log�pG�� =d∑r=1

O�log�2r� × log�2r�� =d∑r=1

O�r × r�

= O�d�d + 1��2d + 1�/6� = O�log3�p��as the number of processors in a BSP group at level r is pG = p/2r−1 and g = O�1�,l = O�log�p��. In the case of mesh, since g = O�√p�, l = O�√p�, the broadcastingtime is

d∑r=1

O�log�pG� ×√pG� =

d∑r=1

O�log�2r� ×√

2r� =d∑r=1

O�r ×√

2r�

= O�d ×√

2d� = O�log�p� × √p�:Dividing local data of n/p numbers into two blocks requires O�log�n/p�� as thedata is already sorted. The complexity of data exchange is O�n/p× g�pG� + l�pG��as the communication takes n/p-relation. The merge process after data exchangeis O�n/p�. The overall complexity is therefore

THypercubeH−BSP �QuickSort� = O

(n/p× log�n/p� + log3�p�

+d∑r=1

(n/p× g�pG� + l�pG�

))= O�n/p× log�n� + log3�p��

TMeshH−BSP�QuickSort� = O(n/p× log�n/p� + log�p� × √p

+d∑r=1

(n/p× g�pG� + l�pG�

))= O(n/p× �log�n� + √p� + log�p� × √p):

h-bsp model 195

On the other hand, if the same procedure is applied to the BSP model, thecommunication and synchronization costs increase due to the global communica-tion and synchronization of BSP. For instance, the broadcasting time for hypercubetakes

d∑r=1

O�log�2r� × �1+ log�p�� =d∑r=1

O�r × log�p��

=d∑r=1

O��d2 + d�/2 × log�p��

= O�d2 × log�p� + d × log�p��= O�log3�p��:

In the case of mesh, it is

d∑r=1

O�log�2r� × �√p+√p�� =d∑r=1

O�r ×√p�

=d∑r=1

O��d2 + d�/2 ×√p�

= O�d2 ×√p+ d ×√p�= O�log2�p� × √p�:

Here, as g and l are functions of g, the final complexity on each architecture is asfollows.

THypercubeBSP �QuickSort� = O

(n/p× log�n/p� + log3�p�

+d∑r=1

(n/p× g�p� + l�p�))

= O�n/p× log�n� + log3�p��TMeshBSP �QuickSort� = O

(n/p× log�n/p� + log2�p� × √p

+d∑r=1

(n/p× g�p� + l�p�))

= O(n/p× �log�n� + log�p� × √p�+ log2�p� × √p)

Table 3 shows the performance comparison of the H-BSP and BSP QuickSorts.It is shown that their performance is the same on hypercube whereas, in the caseof mesh, the H-BSP model performs better than BSP. This is due to the fact thatH-BSP takes advantage of processor locality characteristics in mesh.

196 cha and lee

Table 3. Performance comparison of the H-BSP and BSP QuickSort algorithms

Mesh Hypercube

H-BSP O�n/p× �log�n� + √p� + log�p� × √p� O�n/p× log�n� + log3�p��BSP O�n/p× �log�n� + log�p� × √p� + log2�p� × √p� O�n/p× log�n� + log3�p��

5. Simulation analysis

In order to validate the proposed H-BSP model and analyze its predicted perfor-mance, a simulation system has been developed. The simulator is based on the discreteevent-driven simulation technique and implemented under UNIX environments.

5.1. Simulator

Figure 7 shows an overview of the simulation system. The simulator consists ofan architecture simulation system, an H-BSP simulation system and an applica-tion program system. The architecture simulation system implements a hardwarespecific simulation platform and provides low-level communication primitives. TheH-BSP simulation system provides basic H-BSP communication primitives, synchro-nization primitives, and group split primitives. The application program system runsan H-BSP application on top of the H-BSP simulation system.

The performance characteristics of an H-BSP algorithm is the same as that of BSPwhen running on a machine, such as a hypercube, where the communication band-width linearly increases with the processor size [6]. In this paper, a two-dimensionalmesh is used as a target architecture for analysis as it portraits the processor localityfeature. The split process for mesh is more difficult than for hypercubes. Therefore,a non-trivial indexing mechanism should be used for partitioning processors in meshto implement the H-BSP’s split function. The simulator uses a Peano indexing [12]

H-BSP Algorithm

H-BSP Model

HypercubeH-BSP Simulation Layer

H-BSP Algorithm

2D Mesh

H-BSP Simulation Layer

H-BSP Algorithm

Arch. Simulation Layer Arch. Simulation Layer

PortingPorting

Figure 7. Structure of the simulator.

h-bsp model 197

mechanism which minimizes, by recursively allocating 2 × 2 processor patterns, thecommunication distance within a processor group.

5.2. Results

Figure 8 shows the analytical and simulation results for the FFT algorithm runningon BSP and H-BSP. The X-axis and Y-axis stand for, respectively, the number of

0 256 512 768 1024Number of Processors

0.0

0.2

0.5

0.8

1.0

1.2

1.5

Com

puta

tion

Tim

e(U

nit:

Mill

ion

Pro

cess

or C

ycle

)

Mesh: FFT Computation Time (A)n = 4096

BSP & H–BSP Simulation


0.0

0.2

0.5

0.8

1.0

1.2

1.5

logp

*n/p

Rel

atio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: FFT Communication Time (B)n = 4096, Packet Size = 32Byte, BandWidth = 10MB/s

BSP & H–BSP Simulation BSP Analytical PredictionH–BSP Analytical Prediction


0.0

0.2

0.5

0.8

1.0

1.2

1.5

Syn

chro

niza

tion

Tim

e(U

nit:

Mill

ion

Pro

cess

or C

ycle

)

Mesh: FFT Synchronization Time (C)Packet Size = 32Byte, BandWidth = 10MB/s

BSP Simulation H–BSP Simulation


0.0

0.2

0.5

0.8

1.0

1.2

1.5

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: FFT Total Execution Time (D)n = 4096, BandWidth = 10MB/s

BSP Simulation BSP Analytical PredictionH–BSP SimulationH–BSP Analytical Prediction

0 5000 10000 15000 20000Problem Size n

0.0

1.0

2.0

3.0

4.0

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: FFT Execution Time (E)Number of Processors=64, BandWidth=10MB/s

BSP SimulationH–BSP Simulation

0 5 10 15 20 25 30Communication BandWidth(MB/s)

0.0

0.2

0.5

0.8

1.0

1.2

1.5

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: FFT Execution Time (F)Number Of Processor=256, n = 4096


Figure 8. Simulation results for FFT algorithm.

198 cha and lee

processors and the simulation time. The communication bandwidth of an underlyingarchitecture is assumed to be 10MBytes/sec and the number of input node(n) in FFTalgorithm is set to 4096.

Figure 8(A) shows that the simulation results for both BSP and H-BSP are thesame when considering the computation only. These results coincide with the ana-lytical results in section 4.1. Figure 8(B) shows the performance results for thecommunication overhead. As shown in the graph, the simulation results are thesame for both BSP and H-BSP. This is due to the fact that the communicationpatterns are the same in both cases. The analytical results are, however, differentand the simulation result for H-BSP is closer to its analytical result than the BSPcase. This reveals that the performance is better predicted in H-BSP as the modelproperly reflects the processor locality in the analysis. Figure 8(C) compares thesynchronization costs. The graph shows that the synchronization cost of H-BSP issmaller than that of BSP and the cost gap increases as the system size grows. Thisreveals the performance characteristics of H-BSP utilizes processor locality featurein its performance model. Figure 8(D) shows the overall performance of two modelsby counting computation, communication and synchronization costs altogether. Asshown in the figure, H-BSP results an improved performance and its performanceprediction model is better than that of BSP.

Figure 8(E) is a performance graph when varying the number of input nodes inFFT algorithm. It is shown that the input node size is less influential to the over-all system performance. Figure 8(F) shows the effects of communication bandwidthon H-BSP performance. As the communication bandwidth increases, the perfor-mance gap between BSP and H-BSP becomes less visible since the communicationoverhead of BSP decreases more prominently. This shows that H-BSP can poten-tially be well accepted in a system where the processor speed and the underlyingcommunication bandwidth are relatively in a big contrast.

Figure 9 shows the simulation results for QuickSort algorithm. The performancecharacteristics of QuickSort algorithm on H-BSP and BSP are similar to the casefor FFT simulation.

6. Conclusion

A well-designed parallel computation model enables algorithm designers to developarchitecture independent parallel algorithms which can efficiently be implementedon a variety of machines. BSP is one of the representative models, proposed withsuch purposes, and provides an abstraction for developing architecture-independentparallel algorithms. It, however, is an uniform memory model and requires increasedcommunication bandwidth as the system size grows.

This paper presented a variant of the BSP model which can take advantageof potential processor locality feature in underlying hardware. H-BSP dynamicallysplits a BSP algorithm into a group of smaller BSPs at run time and relates themin a hierarchical fashion. This way, the BSP principles are inherited and moreefficient algorithms can be developed upon the use of processor locality. H-BSPprovides a group-based programming paradigm and naturally supports divide-and-

h-bsp model 199


4.0

5.0

6.0

7.0

8.0

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: Quick Sort Execution Time (A)n = 4096, BandWidth = 10MB/s

BSP Simulation BSP Analytical PredictionH–BSP SimulationH–BSP Analytical Prediction

0 5000 10000 15000 20000Problem Size n

0.0

5.0

10.0

15.0

20.0

25.0

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: Quick Sort Execution Time (B)Number of Processors=64, BandWidth = 10MB/s


0 5 10 15 20 25 30Communication BandWidth(MB/s)

0.0

2.0

4.0

6.0

8.0

10.0

Exe

cutio

n T

ime(

Uni

t: M

illio

n P

roce

ssor

Cyc

le)

Mesh: Quick Sort Execution Time (C)Number of Processors=256, n=4096


Figure 9. Simulation results for QuickSort algorithm.

conquer algorithms. Presented in the paper are the H-BSP algorithms for FFTand QuickSort problems. Their performances are predicted and simulated for mesharchitectures. Simulation results show that the H-BSP model takes advantages ofprocessor locality and performs well in low bandwidth networks or in a constant-valence architecture such as 2-dimensional mesh. It is also proved that the H-BSPmodel can predict algorithm performance better than BSP model, due to its locality-preserving nature.

In order for H-BSP to become a practical model it is necessary to devise moreH-BSP algorithms and develop suitable programming environments. Also requiredis efficient group split mechanisms for diverse hardware architectures.

Acknowledgments

This work was supported by grant No. 95-0100-19-01-3 from the Basic ResearchProgram of the Korea Science and Engineering Foundation.

200 cha and lee

Notes

1. FORK [8], one of the PRAM languages, adopts the similar group split semantics.2. The analysis is based on the assumption for Wagner’ pivot selection.

References

1. Todd Heywood and Sanjay Ranka. A practical hierarchical model of parallel computation. 1. Themodel. Journal of Parallel and Distributed Computing, 16:212–232, 1992.

2. W.F. McColl. General purpose parallel computing. In A. Gibbson and P. Spirakis, eds., Lectures onParallel Computation, pp. 337–391. Cambridge University Press, Cambridge, UK, 1993.

3. D.B. Skillicorn. Architecture-independent parallel computation. IEEE Computer, 23:38–50, 1990.4. D. Culler, R. Karp, D. Patternson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von

Eicken. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACMSIGPLAN Symposium on Principles and Practices of Parallel Programming, 1993, pp. 1–12.

5. Susanne E. Hambrusch and Ashfaq A. Khokhar. C3: A parallel model for coarse grained machines.Technical report, Purdue University, January 1995.

6. Leslie G. Valiant. A bridging model for parallel computation. Communication of the ACM,33:103–111, 1990.

7. Thomas Cheatham, Amr Fahmy, Dan C. Stefanescu, and Leslie G. Valiant. Bulk synchronous parallelcomputing—a paradigm for transportable software. Technical report TR-36-94, Harvard University,December 1994.

8. Christoph Kebler and Helmut Seidl. Making FORK practical. In Workshop on Models of ParallelComputation, Universiteit Utrecht, January 1995.

9. Simon Knee. Program development and performance prediction on BSP machines using opal. Technicalreport PRG-TR-18-1994, Oxford University, August 1994.

10. W.F. McColl. BSP Programming. DIMACS Series in Discrete Mathematics and TheoreticalComputer Science, May 1994.

11. B.A. Wagner. Hyperquicksort: A fast sorting algorithm for hypercubes. In Proceedings of the SecondConference on Hypercube Multiprocessors, 1987, pp. 292–299.

12. Todd Heywood and Sanjay Ranka. A practical hierarchical model of parallel computation: Binarytree and FFT algorithms. Journal of Parallel and Distributed Computing, 16:233–249, 1992.

h-bsp: a hierarchical bsp computation model

Documents