808 ieee transactions on parallel and distributed systems

21
Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing Ram Kesavan and Dhabaleswar K. Panda, Senior Member, IEEE Abstract—The irregular switch-based network of workstations is fast becoming a cost-effective platform for high performance computing. This paper presents efficient multicasting with reduced link contention on irregular switch-based cut-through interconnection using the popular up*/down* (UD) routing and unicast message passing. First, it is proven that, for an arbitrary irregular network with UD routing, it is not possible to create an ordered list of nodes to implement an arbitrary multicast in a link contention-free manner with a minimal number of communication steps. Next, three different multicast algorithms are proposed with their respective node orderings to reduce link contention: switch-based ordering (SO), switch-based hierarchical ordering (SHO), and chain concatenation ordering (CCO). A variation of the binomial tree-based communication pattern, with unicast message passing, is used on the above orderings to implement multicast. Then, the problem of node contention is described in the case when multiple multicasts occur concurrently in a system. Using source-based information, the CCO algorithm is modified to propose a source-partitioned chain concatenation ordering (SPCCO) algorithm. It is also shown how the SPCCO algorithm reduces the effect of node contention at the cost of link contention. Using detailed simulation experiments, the proposed multicast algorithms are compared with each other as well as with the naive random ordering (RO) algorithm for a range of system sizes, switch sizes, message lengths, input buffer sizes, degrees of connectivity, destination set sizes, and communication start-up times. For the case of single multicast, the CCO algorithm is shown to be the best to implement multicast with reduced link contention and minimum latency. For the case of multiple multicasts, the SPCCO algorithm is shown to be the best when the start-up overhead dominates the propagation overhead and the CCO algorithm is shown to be the best otherwise. The results also highlight the importance of reducing link contention when designing efficient multicast, even for systems with large input buffers in the switches. Thus, these results demonstrate significant potential to be applied to current and future generation NOW systems with irregular interconnection. Index Terms—Parallel computer architecture, cut-through routing, wormhole routing, multicast, broadcast, collective communication, switch-based networks, irregular networks, networks of workstations. æ 1 INTRODUCTION M ULTICAST/BROADCAST is a common collective commu- nication operation as defined by the MPI standard [23]. Parallel systems supporting distributed memory or distributed-shared memory programming paradigms re- quire fast implementation of multicast and broadcast operations in order to support various application and system level data distribution functions. Multicast and broadcast also get used for other collective communication operations like barrier synchronization and global combin- ing [21], [26]. Since broadcast is a special case of multicast (multicast to all nodes in the system), we will consider multicast for the remainder of this paper. However, it must be noted that all the developed algorithms and theories in this paper apply to broadcast as well. Current generation parallel systems like IBM SP2 [39], Intel Paragon [13], Cray T3E [31], and Stanford FLASH use the cut-through switching technique due to its inherent advantages, like low-latency communication and reduced communication hardware overhead [24]. These systems provide a very small buffer space at each hop, which results in links getting held up by blocked worms. Also, these systems use regular network topologies (such as meshes, tori, hypercubes, multistage interconnection networks, etc.) with various deadlock-free routing schemes. Such regular topologies have important mathematical properties that make message communication easier by making message routing simpler, lowering the average distance per communication, and/or increasing the bisection band- width [9]. For such regular cut-through networks, many multicast/broadcast algorithms have been proposed in the literature in recent years [3], [8], [14], [16], [20], [22], [28]. More recently, cut-through switching is being applied to switch-based interconnects like, Myrinet [2] and ServerNet [12], to build networks of workstations, or NOWs (also called workstation clusters), for cost-effective parallel computing. In contrast to traditional parallel systems, these switches provide larger buffers at the input ports. This allows the trailing flits of a blocked worm to be pooled into the buffers, thus freeing links that would have otherwise been held up. Also, such switch-based networks typically have irregular topologies to allow the construction of scalable systems with incremental expan- sion capability. This flexibility allows easy addition and deletion of nodes to the computing environment making the overall environment more amenable to network reconfigurations and resistant to faults. However, these topologies do not possess many of the attractive 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001 . R. Kesavan is with Network Appliance, Inc., 495 Java East Drive, Sunnyvale, CA 94089. E-mail: [email protected]. . D.K. Panda is with the Department of Computer and Information Science, Ohio State University, Columbus, OH 43210. E-mail: [email protected]. Manuscript received 15 Oct. 1998; revised 7 Aug. 2000; accepted 21 Oct. 2000. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 108043. 1045-9219/01/$10.00 ß 2001 IEEE

Upload: others

Post on 11-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Efficient Multicast on Irregular Switch-BasedCut-Through Networks with Up-Down Routing

Ram Kesavan and Dhabaleswar K. Panda, Senior Member, IEEE

AbstractÐThe irregular switch-based network of workstations is fast becoming a cost-effective platform for high performance

computing. This paper presents efficient multicasting with reduced link contention on irregular switch-based cut-through

interconnection using the popular up*/down* (UD) routing and unicast message passing. First, it is proven that, for an arbitrary irregular

network with UD routing, it is not possible to create an ordered list of nodes to implement an arbitrary multicast in a link contention-free

manner with a minimal number of communication steps. Next, three different multicast algorithms are proposed with their respective

node orderings to reduce link contention: switch-based ordering (SO), switch-based hierarchical ordering (SHO), and chain

concatenation ordering (CCO). A variation of the binomial tree-based communication pattern, with unicast message passing, is used

on the above orderings to implement multicast. Then, the problem of node contention is described in the case when multiple multicasts

occur concurrently in a system. Using source-based information, the CCO algorithm is modified to propose a source-partitioned chain

concatenation ordering (SPCCO) algorithm. It is also shown how the SPCCO algorithm reduces the effect of node contention at the

cost of link contention. Using detailed simulation experiments, the proposed multicast algorithms are compared with each other as well

as with the naive random ordering (RO) algorithm for a range of system sizes, switch sizes, message lengths, input buffer sizes,

degrees of connectivity, destination set sizes, and communication start-up times. For the case of single multicast, the CCO algorithm is

shown to be the best to implement multicast with reduced link contention and minimum latency. For the case of multiple multicasts, the

SPCCO algorithm is shown to be the best when the start-up overhead dominates the propagation overhead and the CCO algorithm is

shown to be the best otherwise. The results also highlight the importance of reducing link contention when designing efficient multicast,

even for systems with large input buffers in the switches. Thus, these results demonstrate significant potential to be applied to current

and future generation NOW systems with irregular interconnection.

Index TermsÐParallel computer architecture, cut-through routing, wormhole routing, multicast, broadcast, collective communication,

switch-based networks, irregular networks, networks of workstations.

æ

1 INTRODUCTION

MULTICAST/BROADCAST is a common collective commu-nication operation as defined by the MPI standard

[23]. Parallel systems supporting distributed memory ordistributed-shared memory programming paradigms re-quire fast implementation of multicast and broadcastoperations in order to support various application andsystem level data distribution functions. Multicast andbroadcast also get used for other collective communicationoperations like barrier synchronization and global combin-ing [21], [26]. Since broadcast is a special case of multicast(multicast to all nodes in the system), we will considermulticast for the remainder of this paper. However, it mustbe noted that all the developed algorithms and theories inthis paper apply to broadcast as well.

Current generation parallel systems like IBM SP2 [39],

Intel Paragon [13], Cray T3E [31], and Stanford FLASH use

the cut-through switching technique due to its inherent

advantages, like low-latency communication and reduced

communication hardware overhead [24]. These systems

provide a very small buffer space at each hop, which resultsin links getting held up by blocked worms. Also, thesesystems use regular network topologies (such as meshes,tori, hypercubes, multistage interconnection networks, etc.)with various deadlock-free routing schemes. Such regulartopologies have important mathematical properties thatmake message communication easier by making messagerouting simpler, lowering the average distance percommunication, and/or increasing the bisection band-width [9]. For such regular cut-through networks, manymulticast/broadcast algorithms have been proposed in theliterature in recent years [3], [8], [14], [16], [20], [22], [28].

More recently, cut-through switching is being appliedto switch-based interconnects like, Myrinet [2] andServerNet [12], to build networks of workstations, orNOWs (also called workstation clusters), for cost-effectiveparallel computing. In contrast to traditional parallelsystems, these switches provide larger buffers at the inputports. This allows the trailing flits of a blocked worm tobe pooled into the buffers, thus freeing links that wouldhave otherwise been held up. Also, such switch-basednetworks typically have irregular topologies to allow theconstruction of scalable systems with incremental expan-sion capability. This flexibility allows easy addition anddeletion of nodes to the computing environment makingthe overall environment more amenable to networkreconfigurations and resistant to faults. However, thesetopologies do not possess many of the attractive

808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

. R. Kesavan is with Network Appliance, Inc., 495 Java East Drive,Sunnyvale, CA 94089. E-mail: [email protected].

. D.K. Panda is with the Department of Computer and Information Science,Ohio State University, Columbus, OH 43210.E-mail: [email protected].

Manuscript received 15 Oct. 1998; revised 7 Aug. 2000; accepted 21 Oct.2000.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 108043.

1045-9219/01/$10.00 ß 2001 IEEE

Page 2: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

mathematical properties of the regular topologies. Thismakes the routing schemes on such systems quitecomplicated. There are routing schemes [1], [6], [29], [30],[32] that have been proposed on such systems to achievedeadlock-free, adaptive routing. The complex nature ofsuch routing schemes also leads to difficulty in implement-ing a multicast/broadcast operation in a contention-freemanner.

Multicast algorithms are typically hierarchical in natureto achieve reduced latency. In these algorithms, some nodeswork as intermediate nodes which receive a copy of themessage from the source and forward it to other nodes.Typically, tree-structured algorithms are used to minimizethe number of communication startups (steps) required formulticast [4], [22]. The efficiency of an algorithm isdetermined by the required number of startups for amulticast to complete and the degree of link contentionexperienced among the messages of the multicast. Forregular networks with e-cube routing, the concept of adimension-ordered chain has been developed [22] toimplement contention-free multicast with minimumlatency. However, for irregular cut-through networkswith adaptive routing, developing such contention-freemulticast algorithms is a nontrivial task.

The goal of this paper is to develop efficient multicastalgorithms for irregular switch-based networks. Weconsider the popular deadlock-free routing scheme calledup*/down* (UD) routing, similar to that used in DEC AN1networks [30]. In addition to providing deadlock-freedom,this routing provides adaptive communication betweennodes in an irregular network. With respect to such routing,we first prove that no ordered chain, similar to thatproposed in [22], exists to implement contention-freemulticast in dlog2�d� 1�e steps for d destinations. Next,we develop multicast algorithms which 1) minimize thenumber of communication startups (steps) for a givennumber of destinations and 2) minimize contention amongthe communication steps.

We assume a system consisting of S switches withk ports per switch. We propose three different multicastalgorithms with their respective orderings of destinations.The first algorithm, switch-based ordering (SO), groups thedestinations based on the switches to which they areconnected to generate an ordered list of destinations. Thisalgorithm implements multicast with dlog2�d� 1�e stepswith contention among the steps. The second algorithm,switch-based hierarchical ordering (SHO), provides enhance-ment by using a two-step hierarchical multicast (interswitchand intraswitch). This algorithm implements a multicastwith up to �dlog2 L1e � dlog2 ke� steps, where a leader nodeset of size L1 is generated after grouping the destinationsbased on the switches. This algorithm guarantees that thefinal up to dlog2 ke intraswitch steps are contention-free.Finally, we propose a chain concatenation ordering (CCO)algorithm. For a given network and a set of destinations,this algorithm first determines chains of switches (definedas partial-ordered-chains or POCs) which can allow conten-tion-free multicast within themselves. These POCs areconcatenated to generate the overall ordered list in orderto minimize contention.

Then, we analyze the performance of the proposedCCO algorithm for the scenario where multiple multicastsoccur simultaneously in the system. This scenario is acommon occurrence in parallel numerical and scientificapplications, distributed shared memory systems, etc. Inthese operations, destination sets of different concurrentmulticasts often overlap, leading to nodes participatingconcurrently in multiple multicasts. We discuss theproblem of node contention in such multiple multicastsand describe a technique to reduce such node contention[18], [16]. Using this technique of using source-basedinformation, we propose a source-partitioned chain concatena-tion ordering (SPCCO) algorithm. We show how the SPCCOalgorithm reduces node contention at the expense ofincreased link contention. In the remainder of this paper,we refer to link contention simply as contention, whereaswe refer to node contention specifically as node contention.

We then compare the four proposed algorithms usingextensive detailed simulation experiments. In addition tocomparing these algorithms with each other, we comparethem against a naive random ordering (RO) algorithm whichis used in MPICH [11], an implementation of MPI. We firstuse single multicast experiments to isolate the effect of eachof the following parameters on the algorithms: system size,switch size, message length, input buffer size, degree ofconnectivity, destination set size, and communication start-up time. Finally, we study the latency of these schemesunder increasing multicast load with a variation of a fewselected parameters. This study gives us an understandingon how these schemes behave in realistic multiple multicasttraffic. Another important issue that has never been studiedis the relevance of reducing link contention for multicastalgorithms on systems with switches having large inputbuffers. In other words, is it meaningful at all to considerlink contention as a factor during the design of multicastalgorithms on systems with large input buffers? Also, as thesize of input buffers increases in current-day switches, doeslink contention become less and less of a factor?

Our simulation results clearly show that the CCOalgorithm is capable of implementing multicast withreduced latency for the single multicast scenario. Theseresults also show that the relative performance im-provement of the CCO algorithm, with respect to theother algorithms, does not decrease with increase in inputbuffer size (even with input buffer size of four times themessage length). This gives us strong evidence thatreducing contention is very important while designingmulticast algorithms for systems with large input buffersizes. The multiple multicast experiment results show thatthe SPCCO and the CCO algorithms perform the best interms of latency and throughput achievable in the net-work. The relative performance of these two algorithmsdepends on whether the communication start-up timedominates the message propagation time or otherwise.Therefore, we conclude that the SPCCO and the CCOalgorithms show significant potential to be applied tocurrent and future generation NOW systems with irre-gular interconnection.

Several multicast schemes have been recently proposedand evaluated for networks of workstations with cut-

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 809

Page 3: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

through switching. In [29], Qiao and Ni have proposed a

deadlock-free, adaptive routing scheme for irregular

networks with cut-through switches. The routing is based

on Eulerian trails. In this paper, we have considered the

deadlock-free, adaptive UD routing scheme proposed in

[30] due to its simplicity and commercial implementation.

Multicast schemes using extra network interface support on

Myrinet have been proposed in [41], [5]. Our emphasis in

this paper has been on developing alternative multicast

algorithms without using any additional network interface

support and evaluating their relative performance. In [17],

[34], [33], we have shown how the CCO algorithm can be

integrated with the smart network interface approach taken

in [41] to build more efficient multicast algorithms with

lower contention.In [7], Cohen et al. have proposed protocols for multi-

casting and broadcasting on cut-through networks. In this

work, it is shown that multicasting can be performed in

log2D steps in a link contention-free manner in any network

which allows minimal routing. However, the basic nature of

irregular networks makes the construction of minimal

routing schemes very difficult. Indeed, UD routing is

nonminimal. Therefore, the results of [7] cannot be applied

to UD routing. In [19], Hadas et al. have proposed optimal

contention-free multicasting using unicast messages.

Although this paper assumes the UD routing, there is a

further restriction on the routes some messages can take;

these routes are called relaxed up-first paths. This further

restriction permits the construction of a contention-free

multicast for irregular networks. However, the routing

scheme is obviously not strict UD routing. The results

presented in this paper provide unicast-based multicast

solutions for systems supporting the strict UD routing

without any constraints.The rest of the paper is organized as follows: Section 2

provides an overview of irregular networks and some

associated issues related to routing. Section 3 shows why

implementing contention-free multicast in irregular net-

works is a nontrivial problem. Section 4 presents the three

multicast algorithms in detail. Section 5 discusses the

problem of node contention for multiple multicast traffic

and proposes the SPCCO algorithm. Simulation experi-

ments and results comparing the relative merits of the

multicasting schemes are presented in Section 6. Finally,

concluding remarks are made in Section 7.

2 IRREGULAR NETWORKS

In this section, we provide models for irregular switch-based networks and the associated cut-through switches.Issues related to UD routing for such a network arediscussed.

2.1 Network Model

Fig. 1a shows a typical parallel system using switch-basedinterconnect with irregular topology. Such a networkconsists of a set of switches where each switch can have aset of ports. The system in the figure consists of eightswitches with eight ports per switch. Some of the ports ineach switch are connected to processors/workstations,some ports are connected to ports of other switches toprovide connectivity between the processors, and someports are left open for future connections. Such connectivityis typically irregular and the only thing that is guaranteed isthat the network is connected. Thus, the interconnectiontopology of the network can be denoted by a graphG � �V;E�, where V is the set of switches and E is the set ofbidirectional links between the switches [2], [30]. Fig. 1bshows the interconnection graph for the irregular networkin Fig. 1a. It is to be noted that all links are bidirectional andmultiple links between two switches are possible. A typicalswitch-based irregular network can be described by usingthe following parameters:

. P Ðnumber of processors,

. SÐnumber of switches,

. kÐnumber of ports per switch,

. fÐfraction of the total number of ports in the systemwhich are connected to processors, P � fSk,

. cÐpercentage connectivity out of remaining�1ÿ f�Sk ports for interconnection.

We assume f � 0:5 in this paper, so half the switch portsof the network are connected to processors. Such aconfiguration allows a system with a given number ofprocessors to be built using a lower number of switcheswhile allowing a reasonable number of external commu-nication ports per processor [12]. We vary c in our model toprovide different types of irregular connectivity.

2.2 Switch Model

Fig. 2 shows the architecture of a generic switch withk ports. Each port consists of one input and one outputlink. As shown in Fig. 1a, a port can be connected to the portof another switch, a workstation, or kept open. A switch is

810 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 1. (a) An example system with switch-based interconnect and irregular topology. (b) Corresponding interconnection graph G.

Page 4: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

wired to the workstation through a network interface cardwhich is typically plugged into the I/O bus of theworkstation. The switch can implement different types ofswitching techniques: cut-through or store-and-forward. Inthis paper, we assume switches implementing cut-throughswitching. Each port consists of an input and an outputbuffer. Although these buffers only need to be big enoughto capture the header flit of an incoming worm so that therouting decision can be made as soon as the header flitarrives, deeper buffers are usually required to perform flowcontrol efficiently across long links. A k-port switchtypically provides a k� k crossbar connectivity in order toenable a concurrent transfer of messages from the inputbuffers to any of the output buffers [2], [30], [35], [38], [39].However, in many instances, some routing restrictions areused to achieve deadlock-free routing. We consider some ofthese issues in the following section.

2.3 Routing Issues

Several deadlock-free routing schemes have been proposedin the literature for irregular networks [2], [12], [29], [30]. Inthis paper, we assume the routing scheme for our irregularnetwork to be similar to that used in Autonet [30] due to itssimplicity and its commercial implementation. Such routingallows adaptivity and is deadlock-free.

In this routing scheme, a breadth-first spanning tree (BFS)on graph G is first computed using a distributed algorithm.The algorithm has the property that all nodes willeventually agree on a unique spanning tree. Now, theedges of G can be partitioned into tree edges and crossedges. According to the property of BFS trees, a cross edgedoes not connect two switches which are at a difference ofmore than one level in the tree. Deadlock-free routing isbased on a loop-free assignment of direction to theoperational links. In particular, the ªupº end of each linkis defined as: 1) the end whose switch is closer to the root inthe spanning tree, or 2) the end whose switch has the lowerUID (unique ID), if both ends are at switches at the sametree level. Links looped back to the same switch are omittedfrom the configuration. The result of this assignment is thatthe directed links do not form loops. Fig. 3 shows in boldthe links belonging to the BFS spanning tree embedded onthe interconnection graph shown in Fig. 1. The assignmentof the ªupº direction to the links on this network isillustrated. The ªdownº direction is along the reversedirection of the link.

To eliminate deadlocks while still allowing all links to beused, this routing uses the following up/down rule: A legalroute must traverse zero or more links in the ªupº directionfollowed by zero or more links in the ªdownº direction.Putting it in the negative, a packet may never traverse a linkalong the ªupº direction after having traversed one in theªdownº direction. Details of this routing scheme can befound in [30]. This routing is also referred to as up�=down�

routing or UD routing.In order to implement the above routing, each switch has

an indexed forwarding table. When a worm reaches aswitch, the destination address is captured from the headerflit of the incoming worm. This address is concatenatedwith the incoming port number and the result is used toindex the switch's forwarding table. The table lookupreturns the outgoing port number that the worm shouldbe routed through. The forwarding tables can beconstructed to support both shortest path and nonshortestpath adaptive routing. In this paper, we only considershortest path adaptive routing. Thus, the forwardingtables allow only legal routes with the minimum hopcount. When multiple shortest path routes exist from thesource to the destination, the forwarding table entryshows alternative forwarding ports. The choice of theoutgoing port is decided dynamically based on the portswhich are free when the header flits arrive at the switch.In the case of multiple outgoing ports being free, therouting scheme randomly selects one of them.

3 CONTENTION-FREE MULTICAST IN

IRREGULAR NETWORKS

In this section, we discuss the significance of ordered chainsto achieve contention-free multicast with an optimalnumber of communication steps. We prove that there doesnot exist an ordered chain of nodes to implement contentionfree multicast with a binomial tree-based message patternon an arbitrary irregular network with the UD routingscheme discussed in Section 2.3.

3.1 Contention-Free Multicast with Ordered Chain

Typically, binomial tree-based algorithms have been usedin the literature [21], [22] to implement multicast onmeshes, tori, and hypercubes with an optimal number

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 811

Fig. 3. BFS spanning tree rooted at node 6 corresponding to the

example irregular network shown in Fig. 1.

Fig. 2. Organization of a typical k-port switch supporting cut-through

switching.

Page 5: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

of communication startups (steps). Such an approachrequires dlog2�d� 1�e communication steps for a multicastwith d destinations. Besides the number of startups, animportant factor which affects the overall multicast latencyis the contention that messages undergo between differentsteps of the binomial tree-based algorithm. In [22], it hasbeen shown that if an ordered chain can be generated amongthe nodes participating in the multicast, a link contention-free binomial multicast tree can be constructed. Let thesymbol <d denote such an ordering. Such an ordered chainexhibits the following property:

Property 1. If there exist four nodes w, x, y, and z in an orderedchain such that w <d x <d y <d z, then messages betweenprocessors w and x will not contend for any links withmessages between processors y and z, even for the boundarycondition x � y [22].

3.2 Nonexistence of Ordered Chain inIrregular Networks

Using the above property, the contention-free multicastproblem in irregular networks reduces to generating anordered chain among the participating nodes. However, inswitch-based networks, concurrent communication betweenthe processors connected to the same switch are contention-free. Thus, the above problem further reduces to generatingan ordered chain among participating switches, where aparticipating switch is defined as a switch having at leastone node connected to it which is participating in themulticast. In the worst case of a broadcast, an ordered chainconsisting of all the switches in the network must begenerated. This chain can be easily reduced to generate theordered chain for any arbitrary multicast. The followingtheorem indicates that it is not always possible to generatesuch an ordered chain for an arbitrary irregular network:

Theorem 1. Given an arbitrary irregular network using theUD routing discussed in Section 2.3, there does not alwaysexist an ordered chain satisfying Property 1 consisting of allthe switches in the network.

Proof. Consider an irregular network with the UD routingscheme as discussed in Section 2.3. Let graph G reflectthe connectivity between the participating switches for abroadcast. Let us take five switches fs1; . . . ; s5g in theBFS spanning tree of G such that the subgraph G0 inFig. 4a shows their relative positions in the BFS tree.Let there be no cross links incident on switchess1; s2; s4; s5. It can be easily seen that the shortest validroute from switch si to switch sj is along the links of G0,where 1 � i; j � 5. In the following discussion, let square

brackets (e.g., �s1; s2�) indicate that the relative orderingof the switches enclosed within square brackets is notimportant. We claim that any ordered chain in Gcontaining switches s1 to s5 must have either�s1; s2� <p s3 <p �s4; s5� o r �s4; s5� <p s3 <p �s1; s2�. W eprove this by contradiction. If �s1; s2; s4� <p s3 <p s5, thena message from a processor connected to switch s3 to aprocessor connected to switch s5 will contend for thelink e with a message from a processor connected toswitch s1 to a processor connected to switch s4. Thisscenario is shown in Fig. 4b. This violates Property 1of ordered chains. Similarly, it can be proven that�s1; s2; s5� <p s3 <p s4, �s4; s5; s1� <p s3 <p s2, a n d�s4; s5; s2� <p s3 <p s1 cannot be true. Thus, any orderedchain in G containing switches s1 to s5 must haveeither �s1; s2� <p s3 <p �s4; s5� or �s4; s5� <p s3 <p �s1; s2�.

Now, let us take an example of seven switches s1 to s7

in the BFS spanning tree of G such that the subgraph G00

in Fig. 4c shows their relative positions in the BFS tree.Let there be no cross links incident on switches s1 to s7,excluding s3. Using the above reasoning, any orderedchain of G containing switches s1 to s7 must satisfy allthree of the following conditions:

1. Either �s1; s2� <p s3 <p �s4; s5� or

�s4; s5� <p s3 <p �s1; s2�;

2. Either �s4; s5� <p s3 <p �s6; s7� or

�s6; s7� <p s3 <p �s4; s5�;and

3. Either �s6; s7� <p s3 <p �s1; s2� or

�s1; s2� <p s3 <p �s6; s7�:

It can be easily observed that such an ordered chain isimpossible to generate. Therefore, there exists no orderedchain for an arbitrary irregular graph with the routingdiscussed in Section 2.3. tuIt is impossible to implement contention-free multicast

using the ordered-chain technique. Also, in spite of our bestefforts, we found that it is a nontrivial problem toimplement contention-free multicast with the optimalnumber of communication steps in irregular networksusing other techniques. Thus, in the next section, wepropose alternative ordering schemes and the associatedmulticast algorithms to implement multicast with reducedcontention as well as with a minimum number of steps.

812 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 4. (a) The relative positions of five switches in the subgraph G0 of an example BFS tree. (b) A possible scenario of contention in G0. (c) Subgraph

G0 with seven switches, which is part of another example BFS tree.

Page 6: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

4 MULTICAST ALGORITHMS

In this section, we present several multicast algorithms. A

naive random ordering algorithm is introduced first. Then,

we propose three new algorithms with the capability for

reduced contention during multicast. These multicast

algorithms are illustrated with examples to demonstrate

their performance and capability to reduce contention.

4.1 Random Ordering (RO) Algorithm

Let the source of a multicast be ns and the destination

processors be in a set D. The naive RO algorithm randomly

orders the elements of the set D [ fnsg into a list, L0, and

executes a binomial tree-based multicast on it. Current

generation communication layers use such an algorithm for

implementing multicast. For example, the popular MPICH

implementation of the MPI standard uses this algorithm for

supporting multicast [11], [23], [37].This algorithm is very simple to implement and it takes

dlog2�jDj � 1�e communication startups (steps) to complete.

Since the destinations and the source are ordered randomly,

nothing can be said about the contention among messages

of the multicast. Therefore, it is likely that this algorithm is

prone to severe contention with an increase in jDj.Let us consider a sample multicast, shown in Fig. 5, on

the example irregular network in Fig. 1a. Processor 0 is the

source and f3; 9; 15; 16; 19; 20; 21g is the destination set of

this sample multicast. Fig. 7a shows the multicast tree

generated using the RO algorithm for the sample multicast.

It also shows the list L0, which is a random ordering of the

elements of D [ fnsg for the multicast.

4.2 Switch-Based Ordering (SO) Algorithm

The SO algorithm sorts the elements of D [ fnsg into a listL0 such that participating processors on the same switchappear adjacent to each other in L0. This is done by doing aswitch-based grouping of the processors and thenrandomly ordering these groups into the list L0. Similarto the RO algorithm, a binomial tree-based multicast is nowperformed on L0. Fig. 7b shows the multicast tree generatedusing the SO algorithm for the sample multicast shown inFig. 5 on an irregular network. It also details the list L0 forthe multicast. A formal specification of the SO algorithm isgiven in Fig. 6.

Like the RO algorithm, the SO algorithm takesdlog2�jDj � 1�e startups to complete. However, it reducescontention compared to the RO algorithm. In the latterphases of the multicast, nodes send messages to theirneighboring nodes in L0. Due to the grouping, there is ahigher probability of these communications taking placebetween processors on the same switch. This reducesinterswitch traffic considerably during the latter phases ofthe multicast when the number of messages is quite large.Intraswitch messages do not contribute to contention sincethese messages do not use interswitch links. Therefore, theSO algorithm promises better performance compared to theRO algorithm.

4.3 Switch-Based Hierarchical Ordering (SHO)Algorithm

The SHO algorithm uses the concepts of leader and hierarchyto guarantee contention-freedom in the latter phases of themulticast. The set D [ fnsg is partitioned into disjointsubsets such that each subset is represented by a leadernode. This partitioning is done in a way such that allparticipating processors connected to a switch form adisjoint subset. For subsets not containing the source nodens, the processor with the least UID within the subset ischosen as the leader node. The source node ns is chosen asthe leader node of its subset. A list L1 is formed byrandomly ordering all the leader nodes.

A formal specification of the SHO algorithm is given inFig. 8. The multicast takes place in two stages. The firststage involves executing a binomial tree-based multicast onthe elements of the list L1 with ns as the source. This stagetakes dlog2 jL1je startups to complete. It is to be noted thatthere is no contention-freedom guaranteed during this

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 813

Fig. 5. A sample multicast destination set on the example irregular

network.

Fig. 6. Outline of the SO algorithm.

Page 7: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

stage. During the second stage, each leader node does abinomial tree-based multicast over its associated subsetmembers. This stage of the algorithm takes up todlog2�kÿ 1�e startups to complete. This is because therecould be up to kÿ 1 processors connected to each switch(one port is required for interconnection) and, so, eachsubset could have up to kÿ 1 elements. Since this stage ofthe multicast consists solely of intraswitch messages, theydo not experience any contention with other messages.Therefore, the SHO algorithm has reduced contentioncompared to the SO algorithm. However, the SHOalgorithm takes up to dlog2�jL1j�e � dlog2�kÿ 1�e startups,which could be more than the number of startups for the SOalgorithm for small values of jDj. This advantage is offset asthe size of jDj increases and the message length increases.

Fig. 7c shows the multicast tree generated using the SHOalgorithm for the sample multicast shown in Fig. 5. It alsoshows the list L1 for the multicast. The communicationsteps are identified by �i; j�; where i corresponds to the stepnumber (as in the examples for the RO and SO algorithms)and j corresponds to the stage number. For this sampledestination set, the multicast takes four communicationsteps to complete.

4.4 Chain Concatenation Ordering (CCO) Algorithm

The above three algorithms do not attempt to reducecontention during the interswitch multicast steps. In order

to reduce such contention, we use a new concept of partialordered chain (POC) to order the participating switches.

4.4.1 Concept of a Partial Ordered Chain (POC)

A POC is formally defined as follows:

Definition 1. A partial ordered chain (POC) is an ordered list of

a subset of the switches in an arbitrary irregular network such

that the nodes in the list satisfy Property 1.

As proven by Theorem 1, there does not exist a globalordered chain among the switches of an arbitrary irregularnetwork with the deadlock-free, adaptive routing discussedin Section 2.3. Therefore, we attempt to construct as manylongest POCs as possible and concatenate them to form anoverall ordering. Such a concatenated chain promisesreduced contention among interswitch messages duringmulticast steps. The following theorem suggests a methodof constructing POCs on an irregular network with therouting scheme discussed in Section 2.3:

Theorem 2. Let P be any ordered list of switches

< s1; s2; . . . ; sn > , where si is connected to si�1 by a

ªdownº tree link (from the BFS spanning tree) or a ªdownºcross link connecting switches at different levels of the BFS

spanning tree. Then, P forms a partial ordered chain (POC).

Proof. Let us use the symbol <poc to denote theorder in the above list P . Therefore si <poc si�1. Let

814 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 7. Multicast trees for the sample multicast destination set using algorithms: (a) RO, (b) SO, and (c) SHO.

Fig. 8. Outline of the SHO algorithm.

Page 8: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

E � < e1; e2; . . . ; enÿ1 > denote the list of ªdownº linkssuch that switch si is connected to switch si�1 by theªdownº link ei, where si and si�1 2 P . A message from aprocessor connected to switch si to a processor connectedto switch sj, where si <poc sj will take only the links fromP if all the links ei; ei�1; . . . ; ejÿ1 are ªdownº tree links ofthe BFS spanning tree. Fig. 9a shows the only otherpossible minimal route for worm wi;j from si to sj. Theswitches in P are highlighted. This scenario cannot occurin a BFS tree because during the construction of the tree,either the cross link em would be a tree edge of sub-tree Tm or the cross link en would be a tree edge ofsub-tree Tn. However, if there is a cross link ec in thelist fei; ei�1; . . . ; ejÿ1g, then a message from a processorconnected to si to a processor connected to sj can takelinks that are not in P , as shown in Fig. 9b. In the figure,worm wi;j takes links not in the list E. In any case, thelinks taken by worm wi;j cannot be taken by worm wk;lgoing from sk to sl, where si <poc sj <poc sk <poc sl. This isbecause the links ei to ejÿ1 are at a higher level than thelinks ek to elÿ1 and the worms wi;j and wk;l take minimalpaths. Therefore, P is a partial ordered chain. tu

Now, given an arbitrary multicast destination set in anarbitrary irregular network, the results of Theorem 2 needto be used to construct longest possible POCs. TheCCO algorithm, described in the next section, does thisefficiently.

4.4.2 The Algorithm

The CCO algorithm constructs as many longest POCs aspossible from the participating processors, concatenates thePOCs, and executes a binomial tree-based multicast on thisconcatenated list. Such an approach promises to minimizethe contention because: 1) Messages within a POC do notcontend with each other and 2) a message within one POCcontends with a message within another disjoint POC onlyif one of these messages takes links not contained in itsPOC. An example of the latter situation is given in Fig. 9c.In the figure, switches in two POCs, P and P 0, arehighlighted with different shading. The worm wi;j going

from si to sj takes links that are not in E. Therefore, there is

contention between the worms wi;j and wa;b for the links

between switches sc and sd.A formal specification of the CCO algorithm is given in

Fig. 10 as a six-step approach. In the first step, a depth-first-

search (DFS) is applied on the irregular graph G, startingwith the root node r of the BFS spanning tree discussed in

Section 2.3 and considering only the ªdownº links specified

in Theorem 2. This is to facilitate the construction of the

longest POCs. The step results in a DAG, T . Fig. 11a shows

the DAG, T , which is created when the above DFS is

applied on the BFS tree in Fig. 3. Like in the SHO algorithm,a participating switch is defined as one with at least one

participating processor connected to it. In the third step, the

resultant DAG, T , from the DFS is reduced to a DAG, T 0,which contains only the participating switches. Fig. 11b

shows the T 0 created when the T from Fig. 11a is reduced

according to the multicast described in Fig. 5.In order to determine the longest POCs and concatenate

them to form an overall ordered list, we carry out a

weighted descendents approach. As indicated in Step 4,

each switch is given an appropriate weight according to the

number of participating processors connected to it and to all

its descendent switches. Fig. 11b shows the corresponding

weights of each switch in parentheses. The child with thelargest weight indicates how to proceed while building the

longest POC from the parent. After the weights have been

calculated, chains of switches are stripped off from T 0

according to their weights in Step 5. In other words, the

heaviest chain gets stripped first from T 0 and the lightestlast. These chains are concatenated together in chronologi-

cal order and each switch is replaced by the participating

processors connected to it to form L. The chains of switches

stripped off from T 0 in Fig. 11b are l1 �< 5; 3; 0 > and

l2 �< 4; 2 > , in chronological order. The switches in l1and l2 are replaced by the participating processorsconnected to them to generate the POCs: l01 � <21; 20; 15; 3; 0 > and l02 �< 19; 16; 0 > . The POCs l01 and l02are concatenated to form the list L. Finally, a binomial tree-

based multicast is performed on this list L, as indicated in

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 815

Fig. 9. Possible minimal paths from si to sj which take links not in P . (a) All links between si and sj in P are down tree links of the BFS spanning tree.

(b) 9 one cross link between si and sj in P . (c) An example of contention between messages of two disjoint POCs.

Page 9: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Step 6. Fig. 11c shows the resultant multicast tree generated

in Step 6 and the list L.It can be observed that the CCO algorithm has significant

potential to reduce contention compared to the SHO and

the SO algorithms. It incorporates the grouping effect of the

SO algorithm by reducing participating processors to

participating switches. It counteracts the extra startups

due to the hierarchical effect of the SHO algorithm by

expanding the switches to the participating processors

before the last step. By constructing as many longest POCs

as possible and concatenating them together, the contention

among messages within POCs is eliminated. The CCO

algorithm also takes only dlog2 jDje steps to complete. Thus,

this algorithm promises potential to implement a multicast

with a minimum number of communication startups aswell as reduced contention.

5 AN ALGORITHM FOR MULTIPLE MULTICAST

In this section, we consider how algorithms proposed forsingle multicasts (like the CCO algorithm) behave for thegeneralized case of multiple multicast. The problem of nodecontention is described and a technique of using sourcebased information is applied to propose the Source-Partitioned-CCO (SPCCO) algorithm.

5.1 Contention in Multiple Multicast

Multiple multicast operations (i.e., two or more multicastsexecuting simultaneously) occur frequently in parallel

816 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 10. Outline of the CCO algorithm.

Fig. 11. Illustrating the steps of the CCO algorithms on the multicast set of Fig. 5. (a) DAG T created by Step 1. (b) DAG T 0 created

according to Step 3 and the weights for switches computed according to Step 4. (c) the list L created by Step 5 and the

corresponding multicast tree.

Page 10: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

systems. Examples include cache-invalidation in distributedshared memory systems, multiple broadcast in numericaland scientific applications (LU decomposition for example),multiple multicast/broadcast operations during concurrentbarrier and reduction operations, etc. In these operations,destination sets of different concurrent multicasts oftenoverlap, leading to nodes participating concurrently inmultiple multicasts. In such a scenario, the source node ofeach multicast uses the same algorithm designed for singlemulticast and constructs its multicast tree independently.With overlapped destination sets, such construction of treesmay result in node contention [18], [16]. Let us see hownode contention arises when using the CCO algorithm formultiple multicast.

As discussed earlier, the CCO algorithm builds alow-contention ordering of all the nodes and uses abinomial tree to deliver the multicast to the destina-tions. Let the chain concatenation ordering for amulticast be � � < d0; d1; . . . ; dn > and let ds 2 � bethe source node. The binomial multicast tree is built inthe following manner: The source divides the chain intotwo halves by sending the message to the node dcenter. Thevalue of center is given as

center �dn2e if s < n

2bn2c if s > n2

sÿ 1 if s � n2 :

8<:Then, ds and dcenter recursively cover the other destina-

tions in their respective halves of the chain. Fig. 12a showshow a multicast message propagates within a sample CCO.The node dcenter, which receives the first copy of themessage, is positioned halfway in the chain and is calledthe half-node. The algorithm recursively identifies quarter-nodes, one-eighth-nodes, and so on as the intermediate nodes.

Now, let us consider two multicasts A and B withidentical source-destination sets. According to the CCOalgorithm, both these multicasts will have the same CCO asshown in Fig. 12. Fig. 12a and Fig. 12b show that bothmulticasts share the same half-node and quarter-nodes. Thecommon half-node for A and B has to sequentialize the fourmessage startups that it undergoes. This leads to nodecontention and two of the messages are delayed. Similarly,if several multicasts have (nearly) identical chain concate-nated orderings, they tend to share the same nodes at thekey positions along the orderings, leading to hot spots. Inthe worst case of multiple multicast, many-to-all broadcast,each broadcast has the same chain concatenated ordering.Therefore, all the sources choose the same node halfway inthe ordering to which to send their first messages, the nodequarter-way in the ordering to which to send their second

messages, and so on. This leads to severe node contentionand high latency for the multiple multicasts. In an earlierwork, we presented a detailed analysis of node contentionin the context of regular networks [18], [16].

A method to reduce node contention is to make eachmulticast choose unique intermediate nodes as different aspossible from the rest. With dynamic multicast patterns, allconcurrent multicasts are unaware of one another. Thismeans that a multicast has no information whatsoeverabout the source and destinations of the other multicasts. Agood multicast algorithm should use some local informa-tion to make its tree as unique as possible. The localinformation that our new algorithm uses is the position ofthe source in the system which is unique for each multicast.This technique was proposed and used in [18], [16] topropose the SPUmesh algorithm for regular networks. Weuse the same technique to propose a new Source PartitionedCCO (SPCCO) algorithm for irregular networks.

5.2 Source-Partitioned-CCO (SPCCO) Algorithm

In this section, we propose and discuss the new SPCCOalgorithm, which reduces the effect of node contention inmultiple multicasts.

5.2.1 The Algorithm

As the name suggests, the Source Partitioned CCO algo-rithm partitions the ordering according to the position ofthe source in the ordering. Let the concatenated chainordering (created by the CCO algorithm) containing thesource and destinations be �. A new ordering �0, isobtained by a rotate-left operation on � till the source shiftsto the beginning of �0. Now, the binomial tree-basedmulticast is built on �0. The algorithm is formally presentedin Fig. 13.

5.2.2 Reduced Node Contention

Changing ordering � to �0 causes the multicast pattern tobe dependent on the position of the source. In other words,each multicast chooses a different half-node, depending onthe position of its corresponding source node. This reducesthe node contention for the centrally positioned node. When�0 is divided recursively at each stage of the algorithm, theabove effect carries over. Therefore, node contention andlatency is reduced for multiple multicast as compared to theCCO algorithm.

Fig. 14a and Fig. 14b show the respective multicastpatterns using the CCO and the SPCCO algorithms for thesample multicast of Fig. 11. The ordering, �, from Fig. 14ahas been rotated left till the source, node 0, is at thebeginning of the new ordering, �0, shown in Fig. 14b. It canbe seen that the choice of the half-node and quarter-nodes is

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 817

Fig. 12. Multicast message pattern for sample CCOs for (a) multicast A and (b) multicast B. The sources, half-nodes, and quarter-nodes are

highlighted.

Page 11: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

now based on the position of the source. Also, it should benoted that the new ordering generated by the SPCCOalgorithm has all the POCs of the original ordering(generated by the CCO algorithm) intact, except the POC,which contains the source node. This POC is now split intotwo parts, a part of it is at the beginning of the �0 and theremainder is at the end of �0. Although it is not apparent inthis example, the splitting of the POC might lead to anincrease in inter-POC messages. As discussed earlier, intra-POC messages do not contend for links between them-selves. However, inter-POC message are not guaranteed tobe contention-free among themselves and with respect toother messages. Thus, a general increase in inter-POCmessages (and, therefore, an increase in link contention) isexpected using the SPCCO algorithm. However, it is notclear how much this increase in link contention will offsetthe reduction in node contention for the case of multiplemulticast. Section 6.4 studies this issue using detailedlatency versus applied load simulation experiments.

6 SIMULATION EXPERIMENTS AND RESULTS

In this section, we present results of simulation experimentsto compare the three algorithms proposed in Section 4 andthe SPCCO algorithm proposed in Section 5.

6.1 Experiments and Performance Measures

We used a C++/CSIM-based simulation test-bed [27] forour experiments. The simulation test-bed is capable ofmodeling a large number of topologies and can model avariety of flow control techniques ranging from wormholerouting to virtual cut-through. We assumed cut-throughswitching as the flow control technique. For all simulationexperiments, we assumed system and technologicalparameters representative of the current trend in technol-ogy. The following default parameters were used: ts

(communication start-up time) � 10:0 microseconds, tphy(link propagation time) � 12:5 nanoseconds, troute (routingdelay at switch) � 500 nanoseconds, tsw (switching timeacross the router crossbar for a flit) � 12:5 nanoseconds, tinj(time to inject a flit into network) � 12:5 nanoseconds, andtcons (time to consume a flit from network) � 12:5nanoseconds. The default message length was assumed tobe 128 flits and the default input buffer size at each portwas assumed to be 64 flits. In our earlier work [15], wehad presented results assuming single-flit input buffers ateach port (wormhole routing). Here, we present generalizedresults for cut-through switching with large input buffers ateach port.

We used two types of experiments to measure theperformance of the proposed multicasting schemes. In thefirst type of experiments, we measured the latency of singlemulticasts for each of the schemes to study the effect ofdifferent parameters on the relative latencies of theschemes. We assumed that exactly one multicast occurs inthe system at any given time and that there is no othernetwork traffic. The results from these experiments give usan estimate of the best possible performance of each of theschemes in isolation. Furthermore, the results help usisolate the effect of the various network parameters on theperformance of each of the schemes. The destinationsand network topologies were generated randomly. Foreach data point, the multicast latency was averaged over30 different sets of destinations for each of 10 differentnetwork configurations. The 95 percent confidence intervalsgenerated for the data points were observed to beextremely narrow. For our study, we varied each of thefollowing parameters one at a time: the system size, themessage length, the startup overhead time, the switchsize, the input buffer size in the switches, and the degreeof connectivity.

818 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 14. Multicast message patterns generated by (a) the CCO algorithm and (b) the SPCCO algorithm for the example multicast of Fig. 5.

Fig. 13. Outline of the SPCCO algorithm.

Page 12: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

In a real parallel system, however, it is unlikely that, atany given moment, the only traffic in the network is due toa single multicast. A more likely traffic scenario consists ofmultiple concurrent multicasts in the system. We used suchtraffic for our second type of experiments. We applied anincreasing load consisting of multicast traffic alone andexamined the load at which the network saturates with eachof the multicasting schemes under the influence of thevarious parameters. As in [40], [36], we used effectiveapplied load 1 as a measure of our stimulus. For a multicastof degree 2 m and a load of Bi, the effective applied load ismBi. For each data point, the multicast latency reportedwas calculated by taking the average of the latenciesobtained from experiments run on 10 different networkconfigurations which were randomly generated. We stu-died the performance of two different degrees of multicastsover the range of loads till saturation. We also varied eachof the following parameters one at a time: the messagelength, the input buffer size in the switches, the switch size,and the startup overhead time. The next section discussesthe irregular topologies used for the experiments and howthey were generated.

6.2 Generating Random Irregular Topologies

For all experiments of the first type (single multicast), weassumed a default system configuration of 256 processorsinterconnected by 64 eight-port switches in irregulartopologies. For all experiments of the second type(latency-throughput), we assumed a default systemconfiguration of a 32-processor system interconnected byeight eight-port switches in an irregular topology. Thesmaller system size was required to make the latency-throughput simulations manageable in terms of memoryand processing time. However, it is clear that the resultsobtained for the smaller system will scale well to largersystems.

Let us look at the process used for generating theirregular topologies mentioned above. To generate atopology with s k-port switches and p nodes, we reducedthe problem to that of generating interconnections among(skÿ p) switch ports so that the graph with the switchesas vertices remains connected. It was assumed that allports of a switch are full duplex. Links were not allowedbetween ports of the same switch. Depending on aparameter which we call the percentage connectivity, weallow a certain number of switch ports to remainunconnected (i.e., they have no attached links). For aninterconnection with 100 percent connectivity, we have atotal of s� kÿ p ports available, each of which areconnected to the port of another switch via a bidirectionallink. On the other hand, for a percentage connectivity of

80 percent, we have �s� kÿ p� � 0:8 switch ports whichare connected to other switches: �s� kÿ p� � 0:2 of theswitch ports remain unconnected. The default percentageconnectivity was fixed at 75 percent. A random numbergenerator was used to generate the port and switch towhich a given switch port should be connected or todecide if the port should be connected to a processingnode. In the preliminary version of this work [15], weassumed half the ports of each switch to be connected toprocessors. Here, we place no restriction on the numberof processor nodes connected to a switch. This allows usto create certain types of topologies where some switchesare used purely for interconnection and have noprocessor nodes connected to them.

6.3 Single Multicast Performance

We now present our results of the single multicastexperiments on the proposed multicasting schemes. Oneby one, the effect of each parameter on the performance ofthe schemes is examined. As described earlier, 10 randomtopologies were generated for each experiment. Then,30 random multicasts were generated for each multicastset size and for each topology and each data point reportedin the graphs is the average latency of these 300 multicasts.

6.3.1 Effect of System Size

First, we examined the effect of variation in system size onthe performance of the proposed multicasting schemes. Wesimulated the RO, SO, SHO, CCO, and SPCCO algorithmson four different system configurations with 64, 128, 256,and 512 processors, respectively. The switch size was fixedat eight ports, but the number of switches was 16, 32, 64,and 128 for each system configuration, respectively. Allother parameters were maintained at their respectivedefault values. Fig. 15 shows these results. It can beobserved that the CCO and SPCCO algorithms performthe best for all system sizes and destinations. As the systemsize increases, the benefits of the CCO and SPCCOalgorithms become more prominent. For example, on a512-processor system with 256 destinations, the reduction inmulticast latency achieved by the CCO algorithm isaround 35 percent, 17 percent, and 15 percent comparedto the RO, SO, and SHO algorithms, respectively. TheCCO algorithm performs marginally better than theSPCCO algorithm, although this is not apparent in Fig. 15.Since the performance of the SPCCO algorithm is nearlyidentical to that of the CCO algorithm, we only present butdo not discuss the SPCCO results in the remaining singlemulticast performance results.

It can be observed that the RO algorithm performs theworst. The relative multicast latency using the RO algo-rithm increases considerably as we move to larger systemsand larger number of destinations. The SO algorithmperforms well for small sizes of destination sets. However,its latency also increases as we move to larger systems and alarger number of destinations. The SHO algorithm does notperform well for smaller sizes of destination sets because ofits additional start-up requirement. However, as thenumber of destinations increases, it performs reasonablywell and its performance falls between the SO andCCO algorithms.

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 819

1. The load on a network is a measure of the stress on the network due tothe traffic injected into it. This value is typically expressed as a fraction ofthe maximum value of 1, which corresponds to a traffic pattern where everypossible injection channel is injecting one flit into the network every cycle.As described in [40], [36], we need to use a variation of this measure, calledthe effective applied load, to capture the stress on the network due tomulticast traffic. This is because a multicast flit injected into the networkcorresponds to the injection of many unicast message flits in terms of theimpact it has on network resources since multiple copies are made of themulticast flit as it traverses the network.

2. The degree of a multicast is the number of destinations it covers.

Page 13: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

6.3.2 Effect of Message Length

We studied the impact of message length on the fouralgorithms. Five different message lengthsÐ64, 128, 256,512, and 1024 flits on a 256-processor system with defaultparameters were considered. Fig. 16 shows the respectiveresults. It can be easily observed that CCO > SHO > SO > ROfor all message lengths, where > reflects the capability toimplement the multicast with reduced latency. Also, theimprovement in performance obtained by the CCO algo-rithm increases with increase in message length. This isbecause a longer message size accentuates the link conten-tion between messages and this leads to a larger differencein the performance of the algorithms. This is also reflectedin the SHO algorithm outperforming the SO algorithm asthe message length is increased.

6.3.3 Effect of Communication Start-Up Time

We studied the effect of communication start-up timeon the performance of the four algorithms. The default256-processor system configuration was used with fourdifferent communication start-up times: 1.0, 5.0, 10.0, and20.0 microseconds. Fig. 17 shows the respective multicastlatencies. It can be observed that, with higher communica-tion start-up time, the CCO algorithm shows smallerbenefits compared to the SO and SHO algorithms. This isexpected because, for a given message length, higherstart-up times reduce the contention between differentphases of the multicast algorithm. The performance of theSHO algorithm worsens with increase in start-up time dueto the extra start-up overhead of the SHO algorithm.However, it is clear that, as the start-up time diminishes,

820 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 16. Multicast latency versus number of destinations for five different message lengths: (a) 64, (b) 128, (c) 256, (d) 512, and (e) 1,024 flits.

Fig. 15. Multicast latency versus number of destinations for four different system configurations: (a) 64, (b) 128, (c) 256, and (d) 512 processors.

Page 14: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

the CCO algorithm clearly performs the best for alldestination sizes. This is because decreasing the start-up

time accentuates the link contention in the network.Currently, researchers are exploring multiple directions

to design efficient network interface architectures [17], [41]

and messaging layers [10], [25], [42], [43] to reducecommunication start-up time. In this context, the currentresults indicate that message contention in multicast will

gradually dominate with reduction in communication start-up time. Thus, algorithms like the CCO hold great promise

for implementing multicast with reduced latency in futuresystems.

6.3.4 Effect of Switch Size

We studied the effect of switch size on the performance of

the four algorithms. The default 256-processor system wasconsidered with three different switch sizes: 8, 16, and32 ports. Fig. 18 shows the performance results. It can be

observed that, with smaller switch size, the CCO algorithmperforms the best. As switch size increases, a greater

number of communication steps become intraswitch steps.Since intraswitch steps are contention-free, it leads toreduced contention for the overall multicast and the

algorithms start delivering equal performance. However,

for a larger number of destinations, contention still existsfor the RO and SHO algorithms. Thus, for bigger switchsize and a larger number of destinations, either the SO or

the CCO algorithm can be used.

6.3.5 Effect of Input Buffer Size

In the preliminary version of this work [15], we showed thatthe CCO algorithm performs the best with wormholerouted switches, i.e., cut-through switches with single flitinput buffer size. It is well known that increasing inputbuffer size will allow blocked worms to pool up at the

buffers and release downstream links that would otherwisehave remained reserved. This should allow other worms touse these freed links. Current day cut-through switchesprovide large input buffers [2], [12], [38]. This leads us toquestion the very need for low contention multicast algo-

rithms, since larger input buffers reduce link contention.To answer this question, we studied the impact of input

buffer size (in the switches) on multicast latency. The defaultsystem size of 256 processors was considered with fivedifferent input buffer sizes: 16 flits, 64 flits, 128 flits, 256 flits,and 512 flits. The default message length of 128 flits was

used for these experiments. Fig. 19 shows the associatedperformance results. These results show that, even with an

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 821

Fig. 17. Multicast latency versus number of destinations for four different communication start-up times: (a) 1.0, (b) 5.0, (c) 10.0, and(d) 20.0 microseconds.

Fig. 18. Multicast latency versus number of destinations for three different switch sizes: (a) 8, (b) 16, and (c) 32 ports.

Page 15: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

input buffer size of 512 flits (4 times the message length of128 flits), the multicast latency of the CCO algorithm isclearly less than the other schemes. In fact, the multicastlatencies of all four schemes does not vary much. This isbecause the increase in buffer space has only moved thecontention from the interswitch links to the input buffers ofthe switches. These results let us draw a very importantconclusion: Contention is still an important factor in thedesign of efficient multicast algorithms, even for systemswith large input buffers in switches.

6.3.6 Effect of Degree of Network Connectivity

Finally, we studied the impact of degree of networkconnectivity on single multicast latency. The default systemsize of 256 processors was considered with three differentdegrees of network connectivity: 65 percent, 75 percent, and90 percent. Fig. 20 shows the associated performanceresults. With lesser connectivity, the number of commu-nication links reduces in an irregular network, leading to alower number of adaptive paths and more contention formulticast. Under such circumstances, the CCO algorithmdelivers the best performance. As the degree of connectivityincreases, the contention effect reduces, but does not getcompletely eliminated. Thus, with higher connectivity, the

CCO algorithm still performs better compared to otheralgorithms, but the benefits are reduced.

6.4 Latency versus Applied Load for MultipleMulticast

We now present our results for multiple multicast latencyunder an increasing multicast load for the proposedalgorithms. We used two different multicast degrees inour experiments: 15-way multicasts (i.e., multicasts with15 destinations) and 27-way multicasts. As mentionedearlier, a 32-processor system was assumed for theseexperiments. For each of our experiments, our simulationswere run for at least one million cycles, with measurementsbeginning after a cold-start time of 500,000 cycles. It isworth keeping in mind that for each of the networks, themaximum unicast throughput (assuming no softwareoverheads and no contention for the I/O bus) with UDrouting has been observed to be less than 0.18 in oursimulations and in other work [29]. Also, each of the plots inthis section show multicast latency against effective appliedload, as discussed in Section 6.1. Again, 10 randomtopologies were generated for each experiment, the resultsreported is an average over these 10 topologies. TheSHO algorithm is not included in all the results reported in

822 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 19. Multicast latency versus number of destinations for different input buffer size in the switches: (a) 16, (b) 64, (c) 128, (d) 256, and (e) 512 flits.

Fig. 20. Multicast latency versus number of destinations for different degrees of network connectivity: (a) 65 percent, (b) 75 percent, and

(c) 90 percent.

Page 16: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

this section. This is because the SHO algorithm performedworse than all the remaining schemes due to the extra start-up overhead.

6.4.1 Effect of Message Length

Fig. 21 shows the results of our experiments under variationof the message length: 64, 128, and 256 flits. For a smallermessage length of 64 flits, the SPCCO and CCO algorithmsperform almost the same for a smaller degree (15), but theSPCCO algorithm outperforms the rest for a higher degree(27) for the same message length. It should be noted that,with increasing message length, the applied load at whichthe CCO algorithm saturates starts catching up with that ofthe SPCCO algorithm (and even overtakes it in Fig. 21c).This is shown clearly in the increase in message lengthfrom 128 flits to 256 flits (Fig. 21b to Fig. 21c and Fig. 21eto Fig. 21f).

These trends can be explained as follows: For smallermulticast destination sets, the degree of overlapping of thehalf-nodes, quarter-nodes, etc., of the various concurrentmulticasts is not high enough to offset the link contention inthe SPCCO algorithm. In other words, the node contentionin the CCO algorithm with low degree of multicast (andfewer overlapping destination sets) is not high enough tooffset the increased link contention in the SPCCO algorithm.However, with increase in the degree of multicast (27), thedegree of overlapping between intermediate nodes ofconcurrent multicasts increases. This results in an increasein node contention for the CCO, SO, and RO algorithms.This resultant node contention is reduced in theSPCCO algorithm. Therefore, with increase in multicastdegree, the performance of the SPCCO algorithmimproves in comparison to the other algorithms. Thiscan be clearly seen with the increase in multicast degreefrom Fig. 21a to Fig. 21d, Fig. 21b to Fig. 21e, and Fig. 21c toFig. 21f. This trend can also be seen in all the remainingresults reported in this section.

With increase in message length, the link contention inthe network increases. This is because longer messages holdup more network links for a longer period of time. Thisincrease in link contention affects the SPCCO algorithmmore than the CCO algorithm. At some point, the increasein link contention in SPCCO offsets the node contention inthe CCO algorithm (as seen with the increase in messagelength from Fig. 21b to Fig. 21c). Therefore, with increase inmessage length, the performance of the CCO algorithmimproves in comparison to the SPCCO algorithm and theother algorithms as well.

Another point to be noted is that the latency-throughputcurves do not have a well-defined knee to indicate thesaturation point for a message length of 64 flits. This isbecause the start-up overhead is too large compared to thepropagation time of 64 flit messages in the network.Therefore, the network does not saturate easily with thisratio of start-up overhead time to message propagation timein the network. With increase in message length, there is areduction in the dominance of the start-up overhead timeover the network propagation time. This results in thecurves having a well-defined curve to indicate the satura-tion point.

6.4.2 Effect of Input Buffer Size

Fig. 22 shows the results of our experiments under variationof the input buffer size in the switches: 16, 64, and 128 flits.As in Fig. 21a, Fig. 22a shows that the CCO algorithmoutperforms the SPCCO algorithm for a lower degree ofmulticast (15) for a smaller buffer size. As explained in theabove discussion, the SPCCO algorithm outperforms theCCO algorithm with increase in multicast degree (27). Thiscan be clearly seen when comparing any of Fig. 22a,Fig. 22b, and Fig. 22c with Fig. 22d, Fig. 22e, and Fig. 22f,respectively. It can be seen from Fig. 22f that the appliedload at which the SPCCO algorithm saturates is around15 percent, 33 percent, and 45 percent higher than the load

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 823

Fig. 21. Multicast latency versus applied load for 15-way and 27-way multicasts with varying message length: (a) 15-way; message

length = 64, (b) 15-way; message length = 128, and (c) 15-way; message length=256 flits; (d) 27-way; message length =

64, (e) 27-way; message length = 128, and (f) 27-way; message length = 256.

Page 17: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

at which the CCO, SO, and RO algorithms saturate,respectively.

With increase in buffer size, the relative performance ofall the algorithms does not vary much. We saw inSection 6.3.5 that the relative single multicast latencyperformance of the algorithms is not effected by anincrease in buffer size. This trend also holds true formultiple multicast traffic.

6.4.3 Effect of Switch Size

Fig. 23 shows the results of our experiments under variation

in switch size. In this experiment, we kept the degree of

network connectivity at 100 percent. This is due to the fact

that the switch size was required to be varied as 4, 8,

and 16 ports. To maintain the same number of switch

ports in the system, the number of switches for each of

these configurations were 16, 8, and 4, respectivelly.

Sixteen 4-port switches and 32 processors give 32 free

ports, which, with 75 percent connectivity, results in

only 24 ports, i.e., 12 bidirectional links. It is obvious

that 12 bidirectional links cannot connect a 16-switch

system. Therefore, we assumed 100 percent connectivity in

this experiment to allow 4-port switch configurations of the

system. It is to be noted that lower degrees of connectivity

will lead to higher link contention and will thus favor the

CCO algorithm over the SPCCO algorithm.As expected, the performance of the SPCCO algorithm

improves compared to that of the CCO algorithm with

increase in degree of multicast. Also, an increase in switch

size favors the SPCCO algorithm over the CCO algorithm.

This is because an increase in switch size results in a greater

number of communication steps becoming intraswitch

steps. Since intraswitch steps are contention-free, it leads

824 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Fig. 22. Multicast latency versus applied load for 15-way and 27-way multicasts with varying input buffer size in the switches: (a) 16, (b) 64,

and (c) 128 flits.

Fig. 23. Multicast latency versus applied load for 15-way and 27-way multicasts with varying switch size: (a) 4, (b) 8, and (c) 16 ports.

Page 18: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

to reduced link contention for the multiple multicast traffic.

This favors the SPCCO algorithm.

6.4.4 Effect of Communication Start-up Time

Fig. 24 shows the results of our experiments under

variation of the start-up overhead time: 5.0, 10.0, and

20.0 microseconds. As expected, the performance of the

SPCCO algorithm improves compared to that of the

CCO algorithm with increase in degree of multicast. Withincrease in start-up time, the SPCCO algorithm outperforms

the CCO and other algorithms. The reason is as follows: A

higher start-up time reduces the effect of link contention

due to the fact that contention occurs during the propaga-

tion time of messages in the network. Thus, if the start-up

overhead substantially dominates the propagation time, the

effect of link contention is reduced. Also, with increasingstart-up time, the effect of node contention is accentuated.

Therefore, an increase in start-up time favors the SPCCO

algorithm.It should also be noted that, with an increase in start-up

time, the latency-throughput curves do not have a well-

defined knee to indicate the saturation point. This can be

seen especially in the graphs with start-up time = 20 �s.

This is due to the fact that the start-up overhead dominatesthe propagation time and this results in the network notsaturating easily. A similar trend is seen (and explained) forsmall message lengths in Section 6.4.1.

6.4.5 Evaluation with Zero Start-Up Time

Fig. 25 shows the results of our experiments under start-uptime set to zero. These results are presented to give an ideaof the throughput obtainable from the proposed multicastalgorithms under the ideal assumption of zero start-uptime. This assumption unfairly highlights the link conten-tion in each of the algorithms and gives a clear picture ofhow much the CCO algorithm succeeds in reducing linkcontention.

It can be clearly seen in Fig. 25 that the CCO algorithmsubstantially outperforms the remaining algorithms. In fact,the saturation applied load for the CCO algorithm is50 percent, 50 percent, and 100 percent more than thatfor the SPCCO, SO, and RO algorithms, respectively.

6.5 Summary of Results

In summary, the CCO algorithm performs significantlybetter than the RO, SO, and SHO algorithms and marginallybetter than the SPCCO algorithm for the case of single

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 825

Fig. 24. Multicast latency versus applied load for 15-way and 27-way multicasts with varying communication start-up time: (a) 15-way; start-

up time = 5, (b) 15-way; start-up time = 10, and (c) 15-way; start-up time = 20 microseconds; (d) 27-way; start-up time = 5, (e) 27-way;

start-up time = 10, and (f) 27-way; start-up time = 20 microseconds.

Fig. 25. Multicast latency versus applied load for 15-way and 27-way multicasts with communication startup time equal to zero. (a) 15-way multicast;

start-up time = 0.0. (b) 27-way multicast; start-up = 0.0.

Page 19: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

multicast. The difference in performance of these algo-rithms increases with increase in message length, decreasein communication start-up time, decrease in switch size,and decrease in degree of network connectivity. Also, theCCO algorithm scales very well with system sizeÐitsrelative performance with respect to the other algorithmsimproves with increase in system size. Also, the relativeperformance of these algorithms does not change withincrease in buffer size. This leads to the importantconclusion that contention is still an important factor inthe design of efficient multicast algorithms for systems withlarge input buffers in switches. In the case of multiplemulticast, 1) the SPCCO algorithm outperforms theCCO algorithm when node contention dominatesÐwithhigher degree of multicast and larger switches and 2) theCCO algorithm outperforms the SPCCO algorithm whenlink contention dominatesÐwith longer messages andlower communication start-up time. Therefore, whendesigning efficient collective communication support, it isrecommended that either the SPCCO algorithm or theCCO algorithm be used judiciously, depending on thetechnological parameters (like communication start-up timeand switch size) and characteristics of the application (likemessage length and multicast degree).

7 CONCLUSIONS

In this paper, we have shown efficient ways of implement-ing multicast on the emerging irregular switch-based cut-through networks using UD routing and unicast messagepassing. First, we have proven that it is not possible toconstruct a complete ordered chain of destinations toimplement multicast in a contention-free manner withoptimal number of communication steps. Then, we haveproposed three new multicast algorithms (SO, SHO, andCCO) with their respective orderings of destinations. Wehave discussed the problem of node contention for multiplemulticast traffic and proposed the SPCCO algorithm forefficient multicast in such traffic.

These algorithms, together with a naive randomordering (RO), have been evaluated through simulationfor a wide range of system sizes, message lengths, switchsizes, input buffer sizes, degrees of connectivity, destina-tion set sizes, and communication start-up times. Thesimulation results demonstrate the CCO algorithm to bethe best for a wide range of system and technologicalparameters in the single multicast scenario. This algorithmimplements multicast with the least amount of contentionand minimum latency. The SO algorithm does better thanthe SHO algorithm for small sizes of destination sets.However, the SHO outperforms the SO as the system sizeand the number of destinations increase. Overall, forrelatively large systems and a large number of destina-tions, the four algorithms have been demonstrated toperform in the following order: CCO (best) > SHO > SO >RO (worst). We have also clearly demonstrated thatreducing link contention should be a major focus duringthe design of efficient multicast algorithms, even forsystems with large input buffers in the switches. This isbecause increasing input buffers in switches only shiftsthe contention from the links to the buffers, but does not

reduce the multicast latency. In the case of multiplemulticast traffic, we have shown that the SPCCOoutperforms the CCO algorithm with higher degree ofmulticast and larger switches and the CCO algorithmoutperforms the SPCCO algorithm with increase inmessage length and decrease in communication start-uptime.

As the network/cluster of workstations platform gradu-ally becomes a more popular alternative for high perfor-mance computing, the importance of efficient multicastingon such systems will prove to be critical to the overallperformance of the system. With a wealth of researchfocused on reducing the software start-up overhead at thehost workstations, reducing contention while designingefficient multicast algorithms is unavoidable, even forsystems with large input buffers in switches. Therefore,the CCO and SPCCO algorithms demonstrate significantpotential to be applied to current and future generationnetworks of workstations with irregular interconnection.Also, it will be an interesting exercise to extend thisframework to see how other collective communicationoperations, like barrier synchronization, complete ex-change, etc., can be implemented on irregular networkswith low latency.

ACKNOWLEDGMENTS

The authors would like to thank Kiran Bondalapati, whocollaborated in the earlier version of this work [15]. Theauthors would also like to thank other members of theParallel Architecture and Communication (PAC) researchgroup in the department for providing comments, criti-cisms, and suggestions to this work. This research wassupported in part by US National Science FoundationCareer Award MIP-9502294, US National Science Founda-tion Grant CCR-9704512, an Ohio State University Pre-sidential Fellowship, and an Ohio Board of RegentsCollaborative Research Grant. A preliminary version of thispaper has been presented at the International Symposiumon High Performance Computer Architecture (HPCA-3),Feb. 1997 [15]. This work was done while Ram Kesavan wasa graduate student at The Ohio State University. A numberof related papers and technical reports are availableelectronically through the home page of the ParallelArchitecture and Communication (PAC) research group.The URL is http://www.cis.ohio-state.edu/~panda/pac.html.

REFERENCES

[1] B. Abali, ªA Deadlock Avoidance Method for Computer Net-works,º Proc. First Int'l Workshop Comm. and Architectural Supportfor Network-Based Parallel Computing (CANPC '97), pp. 61-72,Feb. 1997.

[2] N.J. Boden et al., ªMyrinet: A Gigabit-per-Second Local AreaNetwork,º IEEE Micro, pp. 29-35, Feb. 1995.

[3] R.V. Boppana, S. Chalasani, and C.S. Raghavendra, ªOn MulticastWormhole Routing in Multicomputer Networks,º Proc. Symp.Parallel and Distributed Processing, pp. 722-729, 1994.

[4] J. Bruck, R. Cypher, and C.-T. Ho, ªMultiple Message Broad-casting with Generalized Fibonacci Trees,º Proc. Symp. Parallel andDistributed Processing, pp. 424-430, 1992.

826 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001

Page 20: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

[5] D. Buntinas, D.K. Panda, J. Duato, and P. Sadayappan, ªBroad-cast/Multicast over Myrinet Using NIC-Assisted MultidestinationMessages,º Proc. Fourth Int'l Workshop Comm., Architecture, andApplications for Network-Based Parallel Computing (CANPC '00),Jan. 2000.

[6] L. Cherkasova, V. Kotov, and T. Rokicki, ªFibre Channel Fabrics:Evaluation and Design,º Proc. 29th Hawaii Int'l Conf. SystemSciences, Feb. 1995.

[7] J. Cohen, P. Fraigniaud, J.C. Konig, and A. Raspaud, ªOptimizedBroadcasting and Multicasting Protocols in Cut-Through RoutedNetworks,º IEEE Trans. Parallel and Distributed Systems, vol. 9,no. 8, pp. 788-802, Aug. 1998.

[8] L. De Coster, N. Dewulf, and C.-T. Ho, ªEfficient Multi-PacketMulticast Algorithms on Meshes with Wormhole and Dimension-Ordered Routing,º Proc. Int'l Conf. Parallel Processing, vol. III,pp. 137-141 Aug. 1995.

[9] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: AnEngineering Approach. Los Alamitos, Calif.: IEEE CS Press, 1997.

[10] E.W. Felten, R.A. Alpert, A. Bilas, M.A. Blumrich, D.W. Clark, S.N.Damianakis, C. Dubnicki, L. Iftode, and K. Li, ªEarly Experiencewith Message-Passing on the SHRIMP Multicomputer,º Proc. Int'lSymp. Computer Architecture (ISCA), pp. 296-307, 1996.

[11] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, ªA High-Performance, Portable Implementation of the MPI, MessagePassing Interface Standard,º Parallel Computing, vol. 22, no. 6,pp. 789-828, Sept. 1996.

[12] R. Horst, ªServerNet Deadlock Avoidance and FractahedralTopologies,º Proc. Int'l Parallel Processing Symp, pp. 274-280, 1996.

[13] Intel Corporation, Paragon XP/S Product Overview, 1991.[14] S.L. Johnsson and C.-T. Ho, ªOptimum Broadcasting and

Personalized Communication in Hypercubes,º IEEE Trans.Computers, vol. 38, no. 9, pp. 1249-1268, Sept. 1989.

[15] R. Kesavan, K. Bondalapati, and D.K. Panda, ªMulticast onIrregular Switch-Based Networks with Wormhole Routing,º Proc.Int'l Symp. High Performance Computer Architecture (HPCA-3),pp. 48-57, Feb. 1997.

[16] R. Kesavan and D.K. Panda, ªMinimizing Node Contention inMultiple Multicast on Wormhole k-Ary n-Cube Networks,º Proc.Int'l Conf. Parallel Processing, vol. I, pp. 188-195, Aug. 1996.

[17] R. Kesavan and D.K. Panda, ªOptimal Multicast with Packetiza-tion and Network Interface Support,º Proc. Int'l Conf. ParallelProcessing, pp. 370-377, Aug. 1997.

[18] R. Kesavan and D.K. Panda, ªMultiple Multicast with MinimizedNode Contention on Wormhole k-ary n-cube Networks,º IEEETrans. Parallel and Distributed Systems, vol. 10, no. 4, pp. 371-393,Apr. 1999.

[19] R. Libeskind-Hadas, D. Mazzoni, and R. Rajagopalan, ªOptimalContention-Free Unicast-Based Multicasting in Switch-BasedNetworks of Workstations,º Proc. Merged 12th Int'l ParallelProcessing Symp. and Ninth Symp. Parallel and Distributed Processing,pp. 358-364 Apr. 1998.

[20] X. Lin and L.M. Ni, ªDeadlock-Free Multicast Wormhole Routingin Multicomputer Networks,º Proc. Int'l Symp. Computer Archi-tecture, pp. 116-124, 1991.

[21] P.K. McKinley and D.F. Robinson, ªCollective Communication inWormhole-Routed Massively Parallel Computers,º Computer,pp. 39-50, Dec. 1995.

[22] P.K. McKinley, H. Xu, A.-H. Esfahanian, and L.M. Ni, ªUnicast-Based Multicast Communication in Wormhole-Routed Net-works,º IEEE Trans. Parallel and Distributed Systems, vol. 5,no. 12, pp. 1252-1265, Dec. 1994.

[23] Message Passing Interface Forum, MPI: A Message-Passing InterfaceStandard, Mar. 1994.

[24] L. Ni and P.K. McKinley, ªA Survey of Wormhole RoutingTechniques in Direct Networks,º Computer, pp. 62-76, Feb. 1993.

[25] S. Pakin, M. Lauria, and A. Chien, ªHigh Performance Messagingon Workstations: Illinois Fast Messages (FM),º Proc. Supercomput-ing, 1995.

[26] D.K. Panda, ªIssues in Designing Efficient and Practical Algo-rithms for Collective Communication in Wormhole-RoutedSystems,º Proc. ICPP Workshop Challenges for Parallel Proces-sing, pp. 8-15, 1995.

[27] D.K. Panda, D. Basak, D. Dai, R. Kesavan, R. Sivaram,M. Banikazemi, and V. Moorthy, ªSimulation of ModernParallel Systems: A CSIM-Based Approach,º Proc. 1997 WinterSimulation Conf. (WSC '97), pp. 1013-1020, Dec. 1997.

[28] D.K. Panda, S. Singal, and R. Kesavan, ªMultidestination MessagePassing in Wormhole k-Ary n-Cube Networks with Base RoutingConformed Paths,º IEEE Trans. Parallel and Distributed Systems,vol. 10, no. 1, pp. 76-96, Jan. 1999.

[29] W. Qiao and L.M. Ni, ªAdaptive Routing in Irregular NetworksUsing Cut-Through Switches,º Proc. Int'l Conf. Parallel Processing,vol. I, pp. 52-60, Aug. 1996.

[30] M.D. Schroeder et al., ªAutonet: A High-Speed, Self-ConfiguringLocal Area Network Using Point-to-Point Links,º TechnicalReport SRC Research Report 59, Digital Equipment Corp., Apr.1990.

[31] S.L. Scott and G.M. Thorson, ªThe Cray T3E Network: AdaptiveRouting in a High Performance 3D Torus,º Proc. Symp. HighPerformance Interconnects (Hot Interconnects 4), pp. 147-156,Aug. 1996.

[32] F. Silla, M.P. Malumbres, A. Robles, P. Lopez, and J. Duato,ªEfficient Adaptive Routing in Networks of Workstations withIrregular Topology,º Proc. First Int'l Workshop Comm. andArchitectural Support for Network-Based Parallel Computing(CANPC '97), pp. 46-60, Feb. 1997.

[33] R. Sivaram, R. Kesavan, D.K. Panda, and C.B. Stunkel, ªArchi-tectural Support for Efficient Multicasting in Irregular Networks,ºIEEE Trans. Parallel and Distributed Systems, vol. 12, no. 5, pp. 489-513, May 2001.

[34] R. Sivaram, R. Kesavan, D. K. Panda, C. B. Stunkel, ªWhere toProvide Support for Efficient Multicasting in Irregular Networks:Network Interface or Switch?º Proc. 27th Int'l Conf. ParallelProcessing (ICPP '98), pp. 452-459, Aug. 1998.

[35] R. Sivaram, C.B. Stunkel, and D.K. Panda, ªHIPIQS: A HighPerformance Switch Architecture Using Input Queuing,º Proc.12th Int'l Parallel Processing Symp., pp. 134-143, Apr. 1998.

[36] R. Sivaram, C.B. Stunkel, and D.K. Panda, ªImplementing Multi-Destination Worms in Switch-Based Parallel Systems: Architec-tural Alternatives and Their Impact,º IEEE Trans. Parallel andDistributed Systems, vol. 11, no. 8, pp. 794-812, Aug. 2000.

[37] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, andJ. Dongarra, MPI: The Complete Reference. MIT Press, 1996.

[38] C.B. Stunkel, D. Shea, D.G. Grice, P.H. Hochschild, and M. Tsao,ªThe SP1 High Performance Switch,º Proc. Scalable High Perfor-mance Computing Conf., pp. 150-157, 1994.

[39] C.B. Stunkel et al. ªThe SP2 High-Performance Switch,º IBMSystem J., vol. 34, no. 2, pp. 185-204, 1995.

[40] C.B. Stunkel, R. Sivaram, and D.K. Panda, ªImplementing Multi-Destination Worms in Switch-Based Parallel Systems: Architec-tural Alternatives and Their Impact,º Proc. 24th IEEE/ACM Ann.Int'l Symp. Computer Architecture (ISCA-24), pp. 50-61, June 1997.

[41] K. Verstoep, K. Langendoen, and H. Bal, ªEfficient ReliableMulticast on Myrinet,º Proc. Int'l Conf. Parallel Processing, vol. III,pp. 156-165, Aug. 1996.

[42] T. von Eicken, A. Basu, V. Buch, and W. Vogels, ªU-Net: A User-Level Network Interface for Parallel and Distributed Computing,ºProc. ACM Symp. Operating Systems Principles, 1995.

[43] T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser,ªActive Messages: A Mechanism for Integrated Communicationand Computation,º Int'l Symp. Computer Architecture, pp. 25-266,1992.

KESAVAN AND PANDA: EFFICIENT MULTICAST ON IRREGULAR SWITCH-BASED CUT-THROUGH NETWORKS WITH UP-DOWN ROUTING 827

Page 21: 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Ram Kesavan received the BTech degree incomputer science and engineering from theIndian Institute of Technology, Madras, in 1993and the PhD degree in computer science fromOhio State University in 1998. He is currently amember of the technical staff in the ContentDistribution Business Unit of Network Appliance,Inc. His research interests include operatingsystems support for efficient interprocessorcommunication, parallel architecture, networks

of workstations, and high performance communication libraries.

Dhabaleswar K. Panda (S'88-M'92) receivedthe BTech degree in electrical engineering fromthe Indian Institute of Technology, Kanpur, India,in 1984, the ME degree in electrical andcommunication engineering from the IndianInstitute of Science, Bangalore, India, in 1986,and the PhD degree in computer engineeringfrom the University of Southern California, in1991. He is an associate professor in theDepartment of Computer and Information

Science, Ohio State University, Columbus. His research interestsinclude parallel computer architecture, wormhole-routing, interprocessorcommunication, collective communication, network-based computing,quality of service, and resource management. He has published morethan 90 papers in major journals and international conferences related tothese research areas. Dr. Panda has served on program committeesand organizing committees of several parallel processing conferences.He was a program cochair of the 1999 International Conference onParallel Processing, the founding cochair of the 1997 and 1998Workshops on Communication and Architectural Support for Network-Based Parallel Computing (CANPC), and a coguest editor for twospecial issue volumes of the Journal of Parallel and DistributedComputing on workstation clusters and network-based computing. Healso served as an IEEE Distinguished Visitor Speaker and an IEEEChapters Tutorials Program Speaker during 1997-2000. Currently, he isserving as an associate editor of the IEEE Transactions on Parallel andDistributed Computing, general cochair of the 2001 InternationalConference on Parallel Processing, and program cochair of the 2001Workshop on Communication Architecture for Clusters (CAC). Dr.Panda is a recipient of the US National Science Foundation FacultyEarly CAREER Development Award, the Lumley Research Award atOhio State University, and an Ameritech Faculty Fellow Award. He is asenior member of the IEEE, a member of the IEEE Computer Society,and a member of the ACM.

. For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.

828 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001