concurrent round-robin-based dispatching schemes for clos

15
830 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Concurrent Round-Robin-Based Dispatching Schemes for Clos-Network Switches Eiji Oki, Member, IEEE, Zhigang Jing, Member, IEEE, Roberto Rojas-Cessa, Member, IEEE, and H. Jonathan Chao, Fellow, IEEE Abstract—A Clos-network switch architecture is attractive because of its scalability. Previously proposed implementable dispatching schemes from the first stage to the second stage, such as random dispatching (RD), are not able to achieve high throughput unless the internal bandwidth is expanded. This paper presents two round-robin-based dispatching schemes to overcome the throughput limitation of the RD scheme. First, we introduce a concurrent round-robin dispatching (CRRD) scheme for the Clos-network switch. The CRRD scheme provides high switch throughput without expanding internal bandwidth. CRRD implementation is very simple because only simple round-robin arbiters are adopted. We show via simulation that CRRD achieves 100% throughput under uniform traffic. When the offered load reaches 1.0, the pointers of round-robin arbiters at the first- and second-stage modules are completely desynchronized and contention is avoided. Second, we introduce a concurrent master–slave round-robin dispatching (CMSD) scheme as an improved version of CRRD to make it more scalable. CMSD uses hierarchical round-robin arbitration. We show that CMSD preserves the advantages of CRRD, reduces the scheduling time by 30% or more when arbitration time is significant and has a dramatically reduced number of crosspoints of the intercon- nection wires between round-robin arbiters in the dispatching scheduler with a ratio of , where is the switch size. This makes CMSD easier to implement than CRRD when the switch size becomes large. Index Terms—Arbitration, Clos-network switch, dispatching, packet switch, throughput. I. INTRODUCTION A S THE Internet continues to grow, high-capacity switches and routers are needed for backbone networks. Several ap- proaches have been presented for high-speed packet switching systems [2], [3], [13]. Most high-speed packet switching sys- tems use a fixed-sized cell in the switch fabric. Variable-length packets are segmented into several fixed-sized cells when they arrive, switched through the switch fabric and reassembled into packets before they depart. Manuscript received December 14, 2000; revised November 1, 2001; ap- proved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor N. McKeown. This work was supported in part by the New York State Center for Advanced Technology in Telecommunications (CATT), and in part by the National Sci- ence Foundation under Grant 9906673 and Grant ANI-9814856. E. Oki was with the Polytechnic University of New York, Brooklyn, NY 11201 USA. He is now with the NTT Network Innovation Laboratories, Tokyo 180-8585, Japan (e-mail: [email protected]). Z. Jing and H. J. Chao are with the Department of Electrical Engineering, Polytechnic University of New York, Brooklyn, NY 11201 USA (e-mail: [email protected]; [email protected]). R. Rojas-Cessa was with the Department of Electrical Engineering, Poly- technic University of New York, Brooklyn, NY 11201 USA. He is now with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TNET.2002.804823 For implementation in a high-speed switching system, there are mainly two approaches. One is a single-stage switch archi- tecture. An example of the single-stage architecture is a crossbar switch. Identical switching elements are arranged on a matrix plane. Several high-speed crossbar switches are described [4], [13], [16]. However, the number of I/O pins in a crossbar chip limits the switch size. This makes a large-scale switch difficult to implement cost effectively as the number of chips becomes large. The other approach is to use a multiple-stage switch archi- tecture, such as a Clos-network switch [7]. The Clos-network switch architecture, which is a three-stage switch, is very attrac- tive because of its scalability. We can categorize the Clos-net- work switch architecture into two types. One has buffers to store cells in the second-stage modules and the other has no buffers in the second-stage modules. Reference [2] demonstrated a gigabit asynchronous transfer mode (ATM) switch using buffers in the second stage. In this architecture, every cell is randomly distributed from the first stage to the second-stage modules to balance the traffic load in the second stage. The purpose of implementing buffers in the second-stage modules is to resolve contention among cells from different first-stage modules [18]. The internal speed-up factor was suggested to be set more than 1.25 in [2]. However, this requires a re-sequence function at the third-stage modules or the latter modules because the buffers in the second-stage modules cause an out-of-sequence problem. Furthermore, as the port speed increases, this re-sequence function makes a switch more difficult to implement. Reference [6] developed an ATM switch by using non- buffered second-stage modules. This approach is promising even if the port line speed increases because it does not cause the out-of-sequence problem. Since there is no buffer in the second-stage modules to resolve the contention, how to dis- patch cells from the first to second stage becomes an important issue. There are some buffers to absorb contention at the first and third stages. 1 A random dispatching (RD) scheme is used for cell dis- patching from the first to second stage [6], as it is adopted in the case of the buffered second-stage modules in [2]. This scheme attempts to distribute traffic evenly to the second-stage modules. However, RD is not able to achieve a high throughput unless the internal bandwidth is expanded because the con- tention at the second stage cannot be avoided. To achieve 100% 1 Although there are several studies on scheduling algorithms [9], [10] where every stage has no buffer, the assumption used in the literature is out of the scope of this paper. This is because the assumption requires a contention resolution function for output ports (OPs) before cells enter the Clos-network switch. This function is not easy to implement in high-speed switching systems. 1063-6692/02$17.00 © 2002 IEEE

Upload: networksguy

Post on 14-Dec-2014

408 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Concurrent round-robin-based dispatching schemes for clos

830 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

Concurrent Round-Robin-Based DispatchingSchemes for Clos-Network Switches

Eiji Oki , Member, IEEE, Zhigang Jing, Member, IEEE, Roberto Rojas-Cessa, Member, IEEE, andH. Jonathan Chao, Fellow, IEEE

Abstract—A Clos-network switch architecture is attractivebecause of its scalability. Previously proposed implementabledispatching schemes from the first stage to the second stage,such as random dispatching (RD), are not able to achieve highthroughput unless the internal bandwidth is expanded. Thispaper presents two round-robin-based dispatching schemes toovercome the throughput limitation of the RD scheme. First, weintroduce a concurrent round-robin dispatching (CRRD) schemefor the Clos-network switch. The CRRD scheme provides highswitch throughput without expanding internal bandwidth. CRRDimplementation is very simple because only simple round-robinarbiters are adopted. We show via simulation that CRRD achieves100% throughput under uniform traffic. When the offeredload reaches 1.0, the pointers of round-robin arbiters at thefirst- and second-stage modules are completely desynchronizedand contention is avoided. Second, we introduce a concurrentmaster–slave round-robin dispatching (CMSD) scheme as animproved version of CRRD to make it more scalable. CMSDuses hierarchical round-robin arbitration. We show that CMSDpreserves the advantages of CRRD, reduces the scheduling timeby 30% or more when arbitration time is significant and hasa dramatically reduced number of crosspoints of the intercon-nection wires between round-robin arbiters in the dispatchingscheduler with a ratio of 1 , where is the switch size. Thismakes CMSD easier to implement than CRRD when the switchsize becomes large.

Index Terms—Arbitration, Clos-network switch, dispatching,packet switch, throughput.

I. INTRODUCTION

A S THE Internet continues to grow, high-capacity switchesand routers are needed for backbone networks. Several ap-

proaches have been presented for high-speed packet switchingsystems [2], [3], [13]. Most high-speed packet switching sys-tems use a fixed-sized cell in the switch fabric. Variable-lengthpackets are segmented into several fixed-sized cells when theyarrive, switched through the switch fabric and reassembled intopackets before they depart.

Manuscript received December 14, 2000; revised November 1, 2001; ap-proved by IEEE/ACM TRANSACTIONS ONNETWORKING Editor N. McKeown.This work was supported in part by the New York State Center for AdvancedTechnology in Telecommunications (CATT), and in part by the National Sci-ence Foundation under Grant 9906673 and Grant ANI-9814856.

E. Oki was with the Polytechnic University of New York, Brooklyn, NY11201 USA. He is now with the NTT Network Innovation Laboratories, Tokyo180-8585, Japan (e-mail: [email protected]).

Z. Jing and H. J. Chao are with the Department of Electrical Engineering,Polytechnic University of New York, Brooklyn, NY 11201 USA (e-mail:[email protected]; [email protected]).

R. Rojas-Cessa was with the Department of Electrical Engineering, Poly-technic University of New York, Brooklyn, NY 11201 USA. He is now withthe Department of Electrical and Computer Engineering, New Jersey Instituteof Technology, Newark, NJ 07102 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNET.2002.804823

For implementation in a high-speed switching system, thereare mainly two approaches. One is a single-stage switch archi-tecture. An example of the single-stage architecture is a crossbarswitch. Identical switching elements are arranged on a matrixplane. Several high-speed crossbar switches are described [4],[13], [16]. However, the number of I/O pins in a crossbar chiplimits the switch size. This makes a large-scale switch difficultto implement cost effectively as the number of chips becomeslarge.

The other approach is to use a multiple-stage switch archi-tecture, such as a Clos-network switch [7]. The Clos-networkswitch architecture, which is a three-stage switch, is very attrac-tive because of its scalability. We can categorize the Clos-net-work switch architecture into two types. One has buffers to storecells in the second-stage modules and the other has no buffersin the second-stage modules.

Reference [2] demonstrated a gigabit asynchronous transfermode (ATM) switch using buffers in the second stage. In thisarchitecture, every cell is randomly distributed from the firststage to the second-stage modules to balance the traffic loadin the second stage. The purpose of implementing buffers inthe second-stage modules is to resolve contention among cellsfrom different first-stage modules [18]. The internal speed-upfactor was suggested to be set more than 1.25 in [2]. However,this requires a re-sequence function at the third-stage modulesor the latter modules because the buffers in the second-stagemodules cause an out-of-sequence problem. Furthermore, as theport speed increases, this re-sequence function makes a switchmore difficult to implement.

Reference [6] developed an ATM switch by using non-buffered second-stage modules. This approach is promisingeven if the port line speed increases because it does not causethe out-of-sequence problem. Since there is no buffer in thesecond-stage modules to resolve the contention, how to dis-patch cells from the first to second stage becomes an importantissue. There are some buffers to absorb contention at the firstand third stages.1

A random dispatching (RD) scheme is used for cell dis-patching from the first to second stage [6], as it is adoptedin the case of the buffered second-stage modules in [2]. Thisscheme attempts to distribute traffic evenly to the second-stagemodules. However, RD is not able to achieve a high throughputunless the internal bandwidth is expanded because the con-tention at the second stage cannot be avoided. To achieve 100%

1Although there are several studies on scheduling algorithms [9], [10] whereevery stage has no buffer, the assumption used in the literature is out of the scopeof this paper. This is because the assumption requires a contention resolutionfunction for output ports (OPs) before cells enter the Clos-network switch. Thisfunction is not easy to implement in high-speed switching systems.

1063-6692/02$17.00 © 2002 IEEE

Page 2: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 831

throughput under uniform traffic2 by using RD, the internalexpansion ratio has to be set to approximately 1.6 when theswitch size is large [6].

One question arises:Is it possible to achieve a highthroughput by using a practical dispatching scheme withoutallocating any buffers in the second stage to avoid theout-of-sequence problem and without expanding the internalbandwidth?

This paper presents two solutions to this question. First,we introduce an innovative dispatching scheme called theconcurrent round-robin dispatching (CRRD) scheme for aClos-network switch. The basic idea of the novel CRRDscheme is to use the desynchronization effect [11] in theClos-network switch. The desynchronization effect has beenstudied using such simple scheduling algorithms as[11], [14] and dual round-robin matching (DRRM) [4], [5] inan input-queued crossbar switch. CRRD provides high switchthroughput without increasing internal bandwidth and its im-plementation is very simple because only simple round-robinarbiters are employed. We show via simulation that CRRDachieves 100% throughput under uniform traffic. With slightlyunbalanced traffic, we also show that CRRD provides a betterperformance than RD.

Second, this paper describes a scalable round-robin-baseddispatching scheme, called the concurrent master–slaveround-robin dispatching (CMSD) scheme. CMSD is animproved version of CRRD that provides more scalability.To make CRRD more scalable while preserving CRRD’sadvantages, we introduce two sets of output-link round-robinarbiters in the first-stage module, which are master arbitersand slave arbiters. These two sets of arbiters operate in ahierarchical round-robin manner. The dispatching schedulingtime is reduced, as is the interconnection complexity of thedispatching scheduler. This makes the hardware of CMSDeasier to implement than that of CRRD when the switch sizebecomes large. In addition, CMSD preserves the advantageof CRRD where the desynchronization effect is obtainedin a Clos-network switch. Simulation suggests that CMSDalso achieves 100% throughput under uniform traffic withoutexpanding internal switch capacity.

Fig. 1 categorizes several scheduling schemes and shows theanalogy of our proposed schemes. In crossbar switches ,a round-robin-based algorithm, was developed to overcome thethroughput limitation of the parallel iterative matching (PIM) al-gorithm [1], which uses randomness. In Clos-network switches,CRRD and CMSD, two round-robin-based algorithms, are de-veloped to overcome the throughput limitation and the high im-plementation complexity of RD, which also uses randomness.

The remainder of this paper is organized as follows. Sec-tion II describes a Clos-network switch model that we referencethroughout this paper. Section III explains throughput limita-tion of RD. Section IV introduces CRRD. Section V describesCMSD as an improved version of CRRD. Section VI presents aperformance study of CRRD and CMSD. Section VII discusses

2A switch can achieve 100% throughput under uniform traffic if the switchis stable. The stable sate is defined in [12] and [15]. Note that the 100% defini-tion in [15] is more general than the one used here because all independent andadmissible arrivals are considered in [15].

Fig. 1. Analogy among scheduling schemes.

Fig. 2. Clos-network switch with VOQs in the IMs.

implementation of CRRD and CMSD. Section VIII summarizesthe key points.

II. CLOS-NETWORK SWITCH MODEL

Fig. 2 shows a three-stage Clos-network switch. The termi-nology used in this paper is as follows.IM Input module at the first stage.CM Central module at the second stage.OM Output module at the third stage.

Number of input ports (IPs)/OPs in eachIM/OM, respectively.Number of IMs/OMs.Number of CMs.IM number, where .OM number, where .IP/OP number in each IM/OM, respectively,where .CM number, where .( )th IM.( )th CM.( )th OM.( )th IP at .( )th OP at .Virtual output queue (VOQ) at thatstores cells destined for .Output link at that is connected to

.Output link at that is connected to

.

Page 3: Concurrent round-robin-based dispatching schemes for clos

832 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

The first stage consists of IMs, each of which hasdimension. The second stage consists ofbufferless CMs, eachof which has dimension. The third stage consists ofOMs,each of which has dimension.

An has VOQs to eliminate head-of-line (HOL)blocking. A VOQ is denoted as . Each

stores cells that go from to the OPat . A VOQ can receive at most cells from

IPs and can send one cell to a CM in one cell time slot.3

Each has output links. Each output link isconnected to each .

A has output links, each of which is denoted as, which are connected to OMs, each of which is

.An has OPs, each of which is and has an

output buffer.4 Each output buffer receives at mostcells atone cell time slot and each OP at the OM forward one cell in afirst-in-first-out (FIFO) manner to the output line.5

III. RD SCHEME

The ATLANTA switch, developed by Lucent Technologies,uses the random selection in its dispatching algorithm [6]. Thissection describes the basic concept and performance character-istics of the RD scheme. This description helps to understandthe CRRD and CMSD schemes.

A. RD Algorithm

Two phases are considered for dispatching from the first tosecond stage. The details of RD are described in [6]. We showan example of RD in Section III-B. In phase 1, up toVOQsfrom each IM are selected as candidates and a selected VOQ isassigned to an IM output link. A request that is associated withthis output link is sent from the IM to the CM. This matchingbetween VOQs and output links is performed only within theIM. The number of VOQs that are chosen in this matching isalways , where is the number of nonemptyVOQs. In phase 2, each selected VOQ that is associated witheach IM output link sends a request from the IM to CM. CMsrespond with the arbitration results to IMs so that the matchingbetween IMs and CMs can be completed.

• Phase 1: Matching within IM— Step 1: At each time slot, nonempty VOQs send re-

quests for candidate selection.— Step 2: selects up to requests out of

nonempty VOQs. For example, a round-robin arbi-tration can be employed for this selection [6]. Then,

proposes up to candidate cells torandomlyselected CMs.

3An L-bit cell must be written to or read from a VOQ memory in a time lessthanL=C(n+ 1), whereC is a line speed. For example, whenL = 64�8 bits,C = 10 Gbit/s andn = 8, L=C(n+ 1) is 5.6 ns. This is feasible when weconsider current available CMOS technologies.

4We assume that the output buffer size atOP(j; h) is large enough to avoidcell loss without flow control between IMs and OMs so that we can focus thediscussion on the properties of dispatching schemes in this paper. However, theflow control mechanism can be adopted between IMs and OMs to avoid cellloss when the output buffer size is limited.

5Similar to the VOQ memory, anL-bit cell must be written to or read froman output memory in a time less thanL=C(m+ 1).

Fig. 3. Example of RD scheme (n = m = k = 2).

• Phase 2: Matching between IM and CM— Step 1: A request that is associated with is

sent out to the corresponding . A hasoutput links , each of which corresponds to

an . An arbiter that is associated withselects one request amongrequests. A random-selec-tion scheme is used for this arbitration.6 sendsup to grants, each of which is associated with one

, to the corresponding IMs.— Step 2: If a VOQ at the IM receives the grant from the

CM, it sends the corresponding cell at next time slot.Otherwise, the VOQ will be a candidate again at step2 in phase 1 at the next time slot.

B. Performance of RD Scheme

Although RD can dispatch cells evenly to CMs, a high switchthroughput cannot be achieved due to the contention at the CM,unless the internal bandwidth is expanded. Fig. 3 shows an ex-ample of the throughput limitation in the case of. Let us assume that every VOQ is always occupied with cells.

Each VOQ sends a request for a candidate at every time slot.We estimate how much utilization an output link

can achieve for the cell transmission.7 Since the utilizationof every is the same, we focus only on a single one,i.e., . The link utilization of is obtainedfrom the sum of the cell-transmission rates ofand . First, we estimate how much traffic

can send through . The probability thatuses to request for is 1/4

because there are four VOQs in . Consider thatrequests for using . If eitheror requests for through ,a contention occurs with the request by at

. The aggregate probability that eitheror , among four VOQs in , requests for

through is 1/2. In this case, the winningprobability of is 1/2. If there is no contentionfor caused by requests in , canalways send a cell with a probability of 1.0 without contention.

6Even when the round-robin scheme is employed as the contention resolutionfunction forL (r; j) atCM(r), the result that we will discuss later is the sameas the random selection due to the effect of the RD.

7The output-link utilization is defined as the probability that the output linkis used in cell transmission. Since we assume that the traffic load destined foreach OP isnot less than 1.0, we consider the maximum switch throughput asthe output-link utilization throughout this paper, unless specifically stated oth-erwise.

Page 4: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 833

Therefore, the traffic that can send throughis given as follows:

(1)

Since can use either or (i.e., twoCMs) and is in the same situation as(i.e., two VOQs), the total link utilization of is

(2)

Thus, in this example of , the switch canachieve a throughput of only 75%, unless the internal bandwidthis expanded.

This observation can be generally extended. We derive theformula that gives the maximum throughput by using RDunder uniform traffic in the following equation:

(3)

We describe how to obtain in detail in Appendix A. Thefactor in (3) expresses the expansion ratio.

When the expansion ratio is 1.0 (i.e., ), the maximumthroughput is only a function of . As described in the aboveexample of , the maximum throughput is 0.75. Asin-creases, the maximum throughput decreases. When ,the maximum throughput tends to(see Appendix B). In other words, to achieve 100% throughputby using RD, the expansion ratio has to be set to at least

.

IV. CRRD SCHEME

A. CRRD Algorithm

The switch model described in this section is the same as theone described in Section II. However, to simplify the explana-tion of CRRD, the order of in is rearrangedfor dispatching as follows:

Fig. 4. CRRD scheme.

Therefore, is redefined as , whereand .8

Fig. 4 illustrates the detailed CRRD algorithm by showingan example. To determine the matching between a request from

and the output link , CRRD uses an itera-tive matching in . In , there are output-link round-robin arbiters and VOQ round-robin arbiters. An output-linkarbiter associated with and has its own pointer .A VOQ arbiter associated with has its own pointer

. In , there are round-robin arbiters, each ofwhich corresponds to and has its own pointer .

As described in Section III-A, two phases are also consideredin CRRD. In phase 1, CRRD employs an iterative matching byusing round-robin arbiters to assign a VOQ to an IM output link.The matching between VOQs and output links is performedwithin the IM. It is based on a round-robin arbitration and issimilar to the request/grant/accept approach. A majordifference is that in CRRD, a VOQ sends a request to everyoutput-link arbiter; in , a VOQ sends a request to onlythe destined output-link arbiter. In phase 2, each selected VOQthat is matched with an output link sends a request from the IMto the CM. CMs send the arbitration results to IMs to completethe matching between the IM and CM.

• Phase 1: Matching within IM— First iteration* Step 1: Each nonempty VOQ sends a request to every

output-link arbiter.* Step 2: Each output-link arbiter chooses one nonempty

VOQ request in a round-robin fashion by searchingfrom the position of . It then sends the grantto the selected VOQ.

* Step 3: The VOQ arbiter sends the accept to thegranting output-link arbiter among all those receivedin a round-robin fashion by searching from the posi-tion of .

— th iteration ( )9

8The purpose of redefining the VOQs is to easily obtain the desynchronizationeffect.

9The number of iterations is designed by considering the limitation of thearbitration time in advance. CRRD tries to choose as many nonempty VOQs inthis matching as possible, but it does not always choose the maximum availablenonempty VOQs if the number of iterations is not enough. On the other hand,RD can choose the maximum number of nonempty VOQs in phase 1.

Page 5: Concurrent round-robin-based dispatching schemes for clos

834 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

Fig. 5. Example of desynchronization effect of CRRD (n = m = k = 2).

* Step 1: Each unmatched VOQ at the previous iterationssends another request to all unmatched output-link ar-biters.

* Steps 2 and 3: The same procedure is performed asin the first iteration for matching between unmatchednonempty VOQs and unmatched output links.

• Phase 2: Matching between IM and CM— Step 1: After phase 1 is completed, sends the

request to . Each round-robin arbiter associatedwith then chooses one request by searchingfrom the position of and sends the grant to

of .— Step 2: If the IM receives the grant from the CM, it

sends a corresponding cell from that VOQ at the nexttime slot. Otherwise, the IM cannot send the cell at thenext time slot. The request from the CM that is notgranted will again be attempted to be matched at thenext time slot because the pointers that are related tothe ungranted requests are not moved.

As with , the round-robin pointers andin and in are updated to one

position after the granted position only if the matching withinthe IM is achieved at the first iteration in phase 1 and therequest is also granted by the CM in phase 2.

Fig. 4 shows an example of , where CRRDoperates at the first iteration in phase 1. At step 1, ,

, , and , which are nonemptyVOQs, send requests to all the output-link arbiters. At step 2,output-link arbiters that are associated with , ,and select , , and ,respectively, according to their pointers’ positions. At step 3,

receives two grants from both output-link arbitersof and and accepts by using itsown VOQ arbiter. Since receives one grant fromoutput-link arbiter , it accepts the grant. With oneiteration, cannot be matched with any nonemptyVOQs. At the next iteration, the matching between unmatchednonempty VOQs and will be performed.

B. Desynchronization Effect of CRRD

While RD causes contention at the CM, as described inAppendix A, CRRD decreases contention at the CM becausepointers , , and are desynchronized.

We demonstrate how the pointers are desynchronizedby using simple examples. Let us consider the example of

, as shown in Fig. 5. We assume that everyVOQ is always occupied with cells. Each VOQ sends a requestto be selected as a candidate at every time slot. All the pointersare set to be , , and at theinitial state. Only one iteration in phase 1 is considered here.

At time slot , since all the pointers are set to zero, onlyone VOQ in , which is , can send a cell with

through . The related pointers with the grant, , and are updated from zero to one.

At , three VOQs, which are , ,and , can send cells. The related pointers with thegrants are updated. Four VOQs can send cells at . Inthis situation, 100% switch throughput is achieved. There is nocontention at all at the CMs from because the pointersare desynchronized.

Similar to the above example, CRRD can achieve the desyn-chronization effect and provide high throughput even though theswitch size is increased.

V. CMSD SCHEME

CRRD overcomes the problem of limited switch throughputof the RD scheme by using simple round-robin arbiters. CMSDis an improved version of CRRD that preserves CRRD’s advan-tages and provides more scalability.

A. CMSD Algorithm

Two phases are also considered in the description of theCMSD scheme, as in the CRRD scheme. The difference be-

Page 6: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 835

(a) (b)

(c)

Fig. 6. CMSD scheme.

tween CMSD and CRRD is how the iterative matching withinthe IM operates in phase 1.

We define several notations to describe the CMSD algorithm,which is shown in an example in Fig. 6. is denoted asa VOQ group that consists of VOQs, each of which is de-noted as . In , there are master output-linkround-robin arbiters, slave output-link round-robin arbiters,and VOQ round-robin arbiters. Each master output-link ar-biter associated with is denoted as and hasits own pointer . Each slave output-link arbiter asso-ciated with and is denoted as andhas its own pointer . A VOQ arbiter associated with

has its own pointer .

• Phase 1: Matching within IM— First iteration* Step 1: There are two sets of requests issued to

the output-link arbiters. One set is a request that issent from a nonempty to every asso-ciated within . The other set is agroup-level request sent from that has at leastone nonempty VOQ to every .

* Step 2: Each chooses a request amongVOQ groups independently in a round-robin fashionby searching from the position of and sendsthe grant to that belongs to the selected

. receives the grant from its masterarbiter and selects one VOQ request in a round-robin

fashion by searching from the position of 10

and sends the grant to the selected VOQ.* Step 3: The VOQ arbiter sends the accept to the

granting master and slave output-link arbiters amongall those received in a round-robin fashion by searchingfrom the position of .

— th iteration ( )* Step 1: Each unmatched VOQ at the previous itera-

tions sends a request to all the slave output-link ar-biters again. , which has at least one unmatchednonempty VOQ, sends a request to all the unmatchedmaster output-link arbiters again.

* Steps 2 and 3: The same procedure is performed as inthe first iteration for matching between the unmatchednonempty VOQs and unmatched output links.

• Phase 2: Matching between IM and CMThis operation is the same as in CRRD.As with CRRD, the round-robin pointers ,

, and in and inare updated to one position after the granted position only ifthe matching within the IM is achieved at the first iteration inphase 1 and the request is also granted by the CM in phase 2.

Fig. 6 shows an example of , whereCMSD is used at the first iteration in phase 1. At step 1,

10In the actual hardware design,SL(i; j; r) can start to search for a VOQrequest without waiting to receive the grant from a master arbiter. IfSL(i; j; r)does not receive a grant, the search bySL(i; j; r) is invalid. This is an advantageof CMSD. Section VII-A discusses the dispatching scheduling time.

Page 7: Concurrent round-robin-based dispatching schemes for clos

836 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

Fig. 7. Example of the desynchronization effect (n = m = k = 2).

, , and , which arenonempty VOQs, send requests to the associated slaveoutput-link arbiters. and send requests to all themaster output-link arbiters. At step 2, , , and

select , , and , respectively, ac-cording to the pointers’ positions. , , and

, which belong to the selected VOQ groups, choose, , and , respectively. At

step 3, receives two grants from bothand . It accepts by using its own VOQarbiter. Since receives one grant from ,it accepts the grant. In this iteration, is not matchedwith any nonempty VOQs. At the next iteration, the matchingbetween an unmatched nonempty VOQ and will beperformed.

B. Desynchronization Effect of CMSD

CMSD also decreases contention at the CM because pointers, , , and are desynchro-

nized.We illustrate how the pointers are desynchronized by using

simple examples. Let us consider the example of, as shown in Fig. 7. We assume that every VOQ is al-

ways occupied with cells. Each VOQ sends a request to be se-lected as a candidate at every time slot. All the pointers are setto be , , , and

at the initial state. Only one iteration in phase 1 isconsidered here.

At time slot , since all the pointers are set to zero, onlyone VOQ in , which is , can send a cell with

through . The related pointers with the grant

, , and are updated from zero to one.At , three VOQs, which are , ,and can send cells. The related pointers with thegrants are updated. Four VOQs can send cells at . Inthis situation, 100% switch throughput is achieved. There is nocontention at the CMs from because the pointers aredesynchronized.

Similar to the above example, CMSD can also achieve thedesynchronization effect and provide high throughput eventhough the switch size is increased, as in CRRD.

VI. PERFORMANCE OFCRRDAND CMSD

The performance of CRRD and CMSD was evaluated by sim-ulation using uniform and nonuniform traffic.

A. Uniform Traffic

1) CRRD: Fig. 8 shows that CRRD provides higherthroughput than RD under uniform traffic. A Bernoulli arrivalprocess is used for input traffic. Simulation suggests that CRRDcan achieve 100% throughput for any number of iterations inIM.

The reason CRRD provides 100% throughput under uniformtraffic is explained as follows. When the offered load is 1.0 andif the idle state, in which the internal link is not fully utilized,still occurs due to contention in the IM and CM, a VOQ that failsin the contention has to store backlogged cells. Under uniformtraffic, every VOQ keeps backlogged cells until the idle state iseliminated, i.e., until the stable state is reached. The stable stateis defined in [12]. In the stable state, every VOQ is occupiedwith backlogged cells. In this situation, as illustrated in Fig. 5,

Page 8: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 837

Fig. 8. Delay performance of CRRD and RD schemes (n = m = k = 8).

the desynchronization effect is always obtained. Therefore, evenwhen the offered load is 1.0, no contention occurs in the stablestate.

As the number of iterations increases, the delay performanceimproves when the offered load is less than 0.7, as shown inFig. 8. This is because the matching between VOQs and outputlinks within the IM increases. When the offered trafficload is not heavy, the desynchronization of the pointers is notcompletely achieved. At the low offered load, the delay per-formance of RD is better than that of CRRD with one itera-tion. This is because the matching within the IM in CRRD isnot completely achieved, while the complete matching withinthe IM in RD is always achieved, as described in Section III-A.When the offered load is larger than 0.7, the delay performanceof CRRD is not improved by the number of iterations in the IM.The number of iterations in the IM only improves the matchingwithin the IM, but does not improve the matching between theIM and CM. The delay performance is improved by the numberof iterations in the IM when the offered load is not heavy. Fig. 8shows that, when the number of iterations in the IM increasesto four, the delay performances almost converge.

Fig. 9 shows that 99.9% delay of CRRD has the same ten-dency as the average delay, as shown in Fig. 8.

Our simulation shows, as presented in Fig. 10, that even whenthe input traffic is bursty, CRRD provides 100% throughputfor any number of iterations. However, the delay performanceunder the bursty traffic becomes worse than that of the non-bursty traffic at the heavy load condition. The reason for the100% throughput even for bursty traffic can be explained as de-scribed in the Bernoulli traffic discussion above. We assume thatthe burst length is exponentially distributed as the bursty traffic.In this evaluation, the burst length is set to ten.

By looking at Fig. 10, we can make the following observa-tions.

When the input offered load, where is very low, thenumber of iterations does not affect the delay performance sig-nificantly. This is because average match size ratios for thematching within the IM in phase 1 and for the matchingbetween the IM and CM in phase 2 are close to 1.0, as shown inFigs. 11 and 12.

The average match size ratios and are definedas follows.

Fig. 9. 99.9% delay of CRRD (n = m = k = 8).

Fig. 10. Delay performance of CRRD affected by bursty traffic (n = m =

k = 8).

Fig. 11. Average match size ratioR in phase 1 (n = m = k = 8).

is defined as

(4)

where is a match size within in phase1 and is the number of nonempty VOQs in

at each time slot. When ,is set to 1.0. Note that RD always

provides , as is described inSection III-A.

Page 9: Concurrent round-robin-based dispatching schemes for clos

838 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

Fig. 12. Average match size ratioR in phase 2 (n = m = k = 8).

is defined as

(5)

if there is a request fromtootherwise

(6)

and

if there is a grant from toin phase 2

otherwise(7)

When , is set to1.0.

When increases to approximately 0.4, of CRRD withone iteration for Bernoulli traffic decreases due to contentionin the IM. for the bursty traffic is not more affected bythan for Bernoulli traffic. The desynchronization effect is moredifficult to obtain for Bernoulli traffic than for bursty traffic be-cause the arrival process is more independent. In this load re-gion, where the traffic load is approximately 0.4, the delay per-formance with one iteration for Bernoulli traffic is worse thanfor bursty traffic.

At , the number of iterations affects the delayperformance for Bernoulli traffic, as shown in Fig. 10. In thisload region, with one iteration for Bernoulli traffic is low,as shown in Fig. 11, while with four iterations is nearly 1.0.On the other hand, at , the number of iterationsaffects the delay performance for the bursty traffic, as shown inFig. 10. In this load region, with one iteration for the burstytraffic is low, as shown in Fig. 11, while with four iterationsis nearly 1.0.

When the traffic load becomes very heavy, andwith any iterations for both Bernoulli and bursty traffic approach1.0. In this load region, the desynchronization effect is easy toobtain, as described above.

Fig. 13. Delay performance of CMSD compared with CRRD (n = m = k =

8).

2) CMSD: This section describes the performance ofCMSD under uniform traffic by comparing it to CRRD.

Fig. 13 shows that CMSD provides 100% throughput underuniform traffic, as CRRD does, due to the desynchronizationeffect, as described in Section V-B. The throughput of CMSDis also independent of the number of iterations in the IM.

In Fig. 13, the number of iterations in the IM improves thedelay performance even in heavy load regions in the CMSDscheme. On the other hand, the delay performance of CRRD isimproved by the number of iterations in the IM when the offeredload is not heavy. As a result, in the heavy load region, the delayperformance of CMSD is better than that of CRRD when thenumber of iterations in the IM is large.

Since the master output-link arbiter of CMSD can choose aVOQ group, not a VOQ itself in CRRD, the master output-linkarbiter of CMSD affects the matching between the IM and CMmore than the output-link arbiter of CRRD. Therefore, CMSDmakes the matching between the IM and CM more efficient thanCRRD when the number of iterations in the IM increases. Al-though this is one of the advantages of CMSD, the main advan-tage is that CMSD is easier to implement than CRRD when theswitch size increases; this will be described in Section VII.

Fig. 13 shows that when the number of iterations in the IMincreases to four for CMSD, the delay performance almost con-verges.

Fig. 14 shows via simulation that CMSD also provides 100%throughput under uniform traffic even when a bursty arrivalprocess is considered. The bursty traffic assumption in Fig. 14is the same as that in Fig. 10.

In the heavy-load region, the delay performance of CMSD isbetter than that of CRRD under bursty traffic when the numberof iterations in the IM is large.

3) Expansion Factor:Fig. 15 shows that RD requires an ex-pansion ratio of over 1.5 to achieve 100% throughput, whileCRRD and CMSD need no bandwidth expansion.

B. Nonuniform Traffic

We compared the performance of CRRD, CMSD, and RDusing nonuniform traffic. The nonuniform traffic consideredhere is defined by introducing unbalanced probability,. Let us

Page 10: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 839

Fig. 14. Delay performance of CMSD in bursty traffic compared with CRRD(n = m = k = 8).

Fig. 15. Relationship between switch throughput and expansion factor (n =

k = 8).

consider IP , OP , and the offered input load for each IP.The traffic load from IP to OP , is given by

if

otherwise(8)

where is the switch size. Here, the aggregate offeredload that goes to outputfrom all the IPs is given by

(9)

When , the offered traffic is uniform. On the other hand,when , it is completely unbalanced. This means that allthe traffic of IP is destined for only OP, where .

Fig. 16 shows that the throughput with CRRD and CMSD ishigher than that of RD when the traffic is slightly unbalanced.11

The throughput of CRRD is almost the same as that of CMSD.We assume that enough large numbers of iterations in the IM areadopted in this evaluation for both CRRD and CMSD schemesto observe how nonuniform traffic impacts the matching be-tween the IM and CM. From to around , thethroughput of CRRD and CMSD decreases. This is because thecomplete desynchronization of the pointers is hard to achieveunder unbalanced traffic.12 However, when is larger than 0.4,

11Although both results shown in Fig. 16 are obtained by simulation, we alsoanalytically derived the upper bound throughput of RD in Appendix C.

12For the same reason, when input traffic is very asymmetric, throughputdegradation occurs.

Fig. 16. Switch throughput under nonuniform traffic (n = m = k = 8).

the throughput of CRRD and CMSD increases because the con-tention at the CM decreases. The reason is that, asincreases,more traffic from is destined for by usingand less traffic from is destined for by using

, where , according to (8). That is why contentionof at decreases.

Due to the desynchronization effect, both CRRD and CMSDprovide better performance than RD whenis smaller than 0.4.In addition, the throughput of CRRD and CMSD is not worsethan that of RD at any larger than 0.4, as shown in Fig. 16.

Developing a new practical dispatching scheme that avoidsthe throughput degradation even under nonuniform trafficwithout expanding the internal bandwidth is for further study.

VII. I MPLEMENTATION OF CRRDAND CMSD

This section discusses the implementation of CRRD andCMSD. First, we briefly compare them to RD. Since the CRRDand CMSD schemes are based on round-robin arbitration, theirimplementations are much simpler than that of RD which needsrandom generators in the IMs. These generator are difficultand expensive to implement at high speeds [14]. Therefore,the implementation cost for CRRD and CMSD is reducedcompared to RD.

The following sections describe the dispatching schedulingtime, hardware complexity, and interconnection-wire com-plexity for CRRD and CMSD.

A. Dispatching Scheduling Time

1) CRRD: In phase 1, at each iteration, two round-robin ar-biters are used. One is an output-link arbiter that chooses oneVOQ request out of, at most, requests. The other is a VOQ ar-biter that chooses one grant from, at most,output-link grants.We assume that priority encoders are used for the implementa-tion of the round-robin arbiters [14]. Since , the dis-patching scheduling time complexity of each iteration in phase1 is (see Appendix D) [8]. As is the case of ,ideally, iterations in phase 1 are preferable because there are

output links that should be matched with VOQs in the IM.Therefore, the time complexity in phase 1 is . How-ever, we do not practically need iterations. As described inSection VI, simulation suggests that one iteration is sufficientto achieve 100% throughput under uniform traffic. In this case,

Page 11: Concurrent round-robin-based dispatching schemes for clos

840 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

increasing the number of iterations improves the delay perfor-mance.

In phase 2, one IM request is chosen from, at most,requests.The time complexity at phase 2 is .

As a result, the time complexity of CRRD is, where is the number of switch ports. If we set

the number of iterations in phase 1 as, where , forpractical purposes, the time complexity of CRRD is expressedas .

Next, we consider the required dispatching scheduling timeof CRRD. The required time of CRRD, i.e., , isapproximately

(10)

where is a constant coefficient, which is determined by devicetechnologies. is the transmission delay between arbiters.

2) CMSD: In phase 1, at each iteration, three round-robinarbiters are used. The first one is a master output-link arbiterthat chooses one out of VOQ group requests. The secondone is a slave output-link arbiter that chooses one out ofre-quests. The third one is a VOQ arbiter that chooses one outof output-link grants. The slave output-link arbiter can per-form its arbitration without waiting for the result in the masteroutput-link arbitration, as described in Section V-A. In otherwords, both master output-link and slave output-link arbitersoperate simultaneously. Since is satisfied, the longerarbitration time of either the master output-link arbitration orthe VOQ arbitration time is dominant. Therefore, the time com-plexity of each iteration in phase 1 is .The time complexity in phase 1 is if

iterations are adopted.In phase 2, one is chosen out ofIM requests. The dis-

patching scheduling time complexity at phase 2 is .As a result, the time complexity of CMSD is

. As in the case of CRRD, forpractical purposes, the time complexity of CMSD is expressedas . When we assume ,the time complexity of CMSD is , which isequivalent to . This is the same order of timecomplexity as CRRD.

We consider the required dispatching scheduling time ofCMSD. With the same assumption as CRRD, the required timeof CMSD, i.e., , is approximately

(11)

where is the transmission delay between arbiters.3) Comparison of Required Dispatching Scheduling

Times: Although the order of the scheduling time complexityof CRRD is the same as that of CMSD, the required times aredifferent. To compare both required times, we define the ratioof to as follows:

(12)

First, we assume that the transmission delay between arbitersand is negligible compared to the required

TABLE IRATIO OF T (CMSD) TO T (CRRD), R (n = k = m)

round-robin execution time to clarify the dependency on, ,and so that we can further eliminate the device-dependentfactor . Using (10)–(12), we obtain the following equation:

(13)

Table I shows the relationship between and when weassume . Note that when , dependsonly on and is independent of .13

When , for example, the required time of CMSD isreduced more than 30% compared to CRRD.

In (13), we ignored the factors of and . Whenand are large, the reduction effect of CMSD

in Table I decreases. As the distance between two arbiters getslarger, these factors become significant.

B. Hardware Complexity for Dispatching in the IM

Since the CM arbitration algorithm of CMSD is the same asthat of CRRD, we discuss hardware complexities in the IM only.

1) CRRD: According to implementation results presentedin [14], the hardware complexity for each round-robin arbiteris , where is the number of requests to be selectedby the arbiter.

Each IM has output-link arbiters and VOQ arbiters.Each output-link arbiter chooses one out of requests. Thehardware complexity for all the output-link arbiters is .Each VOQ arbiter chooses one out ofrequests. The hardwarecomplexity for all the VOQ arbiters is . Therefore, thehardware complexity of CRRD in the IM is .

2) CMSD: Each IM has master output-link arbiters andslave output-link arbiters VOQ arbiters. Each master

output-link arbiter chooses one out ofrequests. The hardwarecomplexity for all the master output-link arbiters is .Each slave output-link arbiter chooses one out ofrequests.The hardware complexity for all the slave output-link arbiters is

. Each VOQ chooses one out of requests. The hard-ware complexity for all the VOQ arbiters is . Therefore,the hardware complexity of CMSD in the IM is . Thus,the order of the hardware complexity of CMSD is the same asthat of CRRD.

C. Interconnection-Wire Complexity Between Arbiters

1) CRRD: In CRRD, each VOQ arbiter is connected to alloutput-link arbiters with three groups of wires. The first groupis used by a VOQ arbiter to send requests to the output-linkarbiters. The second one is used by a VOQ arbiter to receivegrants from the output-link arbiters. The third one is used by aVOQ arbiter to send grants to the output-link arbiters.

As the switch size is larger, the number of wires that connectall the VOQ arbiters with all the output-link arbiters becomes

13R = (2i + 1)=(3i + 1)

Page 12: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 841

Fig. 17. Interconnection-wire model.

large. In this situation, a large number of crosspoints is pro-duced. That causes high layout complexity. When we implementthe dispatching scheduler in one chip, the layout complexity af-fects the interconnection between transistors. However, if wecannot implement it in one chip due to a gate limitation, we needto use multiple chips. In this case, the layout complexity influ-ences not only the interconnection within a chip, but also the in-terconnection between chips on printed circuit boards (PCBs).In addition, the number of pins in the scheduler chips of CRRDincreases.

Consider that there are source nodes and destinationnodes. We assume that each source node is connected to allthe destination nodes with wires without detouring, as shownin Fig. 17. The number of crosspoints for the interconnectionwires between nodes is given by

(14)

The derivation of (14) is described in Appendix E. Let thenumber of crosspoints for interconnection wires between

VOQs and output-link arbiters be .is determined by

(15)

Note that the factor of three in (15) expresses the numberof groups of wires, as described at the beginning of Sec-tion VII-C-I.

2) CMSD: In CMSD, each VOQ arbiter is connected to itsown slave output-link arbiters with three groups of wires. Thefirst group is used by a VOQ arbiter to send requests to the slaveoutput-link arbiters. The second one is used by a VOQ arbiter toreceive grants from the slave output-link arbiters. The third oneis used by a VOQ arbiter to send grants to the slave output-linkarbiters.

In the same way, each VOQ group is connected to all themaster output-link arbiters with three groups of wires. Thefirst is used by a VOQ group to send requests to the masteroutput-link arbiters. The second is used by a VOQ group toreceive grants from the master output-link arbiters. The third isused by a VOQ group to send grants to the master output-linkarbiters.

Let the number of crosspoints for interconnection wires be-tween arbiters in CMSD be . is de-termined by

(16)

The first term on the right-hand side is the number of cross-points of the interconnection wires betweenVOQs andslave output-link arbiters in VOQ groups. The second term isthe number of crosspoints of the interconnection wires between

VOQ groups and master output-link arbiters.3) Comparison of : As the switch size in-

creases, is much larger than . In gen-eral, when ,

. For example, when and , the numberof crosspoints of the interconnection wires of CMSD is reducedby 87.5%, 93.8%, and 96.9%, respectively, compared with thatof CRRD. Thus, CMSD can dramatically reduce the number ofwire crosspoints.

VIII. C ONCLUSIONS

A Clos-network switch architecture is attractive because of itsscalability. Unfortunately, previously proposed implementabledispatching schemes are not able to achieve high throughputunless the internal bandwidth is expanded.

First, we have introduced a novel dispatching scheme calledCRRD for the Clos-network switch. CRRD provides highswitch throughput without increasing internal bandwidth, whileonly simple round-robin arbiters are employed. Our simulationshowed that CRRD achieves 100% throughput under uniformtraffic. Even though the offered load reaches 1.0, the pointersof round-robin arbiters that we use at the IMs and CMs arecompletely desynchronized and contention is avoided.

Second, we have presented CMSD with hierarchicalround-robin arbitration, which was developed as an improvedversion of CRRD to make it more scalable in terms of dis-patching scheduling time and interconnection complexity inthe dispatching scheduler when the size of the switch increases.We showed that CMSD preserves the advantages of CRRD andachieves more than 30% reduction of the dispatching sched-uling time when arbitration time is significant. Furthermore,with CMSD, the number of interconnection crosspoints is dra-matically reduced with the ratio of . When ,a 96.9% reduction effect is obtained. This makes CMSD easierto implement than CRRD when the switch size becomes large.

This paper assumes that the dispatching scheduling must becompleted within one time slot. However, the constraint is abottleneck when port speed becomes high. In a future study,pipeline-based approaches should be considered. A pipeline-based scheduling approach for a crossbar switches is describedin [17], which relaxes the scheduling timing constraint withoutthroughput degradation.

APPENDIX IMAXIMUM SWITCH THROUGHPUTWITH RD SCHEME

UNDER UNIFORM TRAFFIC

To derive (3), we consider the throughput of output link, . This is the same as the switch throughput

that we would like to derive because the conditions of all theoutput links are the same. Therefore,

(17)

First, we define several notations. is the prob-ability that one output link in , e.g., , is used by

Page 13: Concurrent round-robin-based dispatching schemes for clos

842 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

a ’s request. Here, does not dependon which output link in is used for the request becauseit is selected randomly. is the probability that the re-quest by wins the contention of output link at

.By using and , is expressed

in the following equation:

(18)

Here, in the last form of (18), we consider that anddo not depend on bothand .

Since has VOQs, each VOQ is equally selected byeach output link in . is given by

(19)

Next, we consider . is obtained as follows:

(20)

The first term of the right-hand side at (20) is the winning proba-bility of the request by at when there are totally

requests for . The second term is the winning proba-bility of the request by at when there arerequests at . In the same way, the ( )th term is thewinning probability of the request by at whenthere are requests at .

can then be expressed in the following equation:

(21)

By using (17)–(19) and (21), we can derive (3).

APPENDIX IIDERIVATION OF FOR

We sketch the term in (3) according to the decreasingorder in the following:

(22)

When ,

(23)

Therefore,

(24)

In the above deduction, we use the following equation:

(25)

APPENDIX IIIUPPERBOUND OF MAXIMUM SWITCH THROUGHPUTWITH

RD SHEME UNDER NONUNIFORM TRAFFIC

The upper bound of the maximum switch throughput with RDscheme under nonuniform traffic is determined by(26), shown at the bottom of the page.

Equation (26) can be easily derived in the same way as (3).However, since the traffic load of each VOQ is different, (3) has

(26)

Page 14: Concurrent round-robin-based dispatching schemes for clos

OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 843

Fig. 18. Comparison between simulation and analytical upper bound resultsunder nonuniform traffic (n = m = k = 8).

to be modified. The first term of the right-hand side at (26) is ob-tained by considering the probability that a heavy-loaded VOQsends a request to the CM and wins the contention. The secondterm of the right-hand side is obtained by considering the prob-ability that, when a heavy-loaded VOQ and nonheavy-loadedVOQs send a request to the CM, a nonheavy-loaded VOQ winsthe contention. The third term of the right-hand side is obtainedby considering the probability that a heavy-loaded VOQ doesnot send a request to the CM, nonheavy-loaded VOQs send re-quests to the CM, and a nonheavy-loaded VOQ wins the con-tention.

Note that (26) gives the upper bound of the switch throughputof the RD scheme in the switch model described in Section II.This is because only one cell can be read from the same VOQat the same time slot in the switch model, as described in Sec-tion II. However, we assume that more than one cell can be readfrom the same VOQ at the same time slot in the derivation of(26) for simplicity. This different assumption causes differentresults under nonuniform traffic.14

Fig. 18 compares the simulation result with the upper boundresult obtained by (26). As increases to approximately 0.6,the difference between the simulation and analytical results in-creases due to the reason described above. At the region of

, the difference decreases withbecause thecontention probability at the CM decreases.

APPENDIX IVTIME COMPLEXITY OF ROUND-ROBIN ARBITER

In a round-robin arbiter, the arbitration time complexity isdominated by an -to- priority encoder, where is thenumber of inputs for the arbiter.

The priority encoder is analogous to a maximum or minimumsearch algorithm. We use a binary search tree, whose computa-tion complexity (and, thus, the timing) is , where is thedepth of the tree, and whereis equal to for a binarytree.

The priority encoder, in the canonical form, has two stages:theAND-gate stage and theOR-gate stage. In theAND-gate stage,

14However, under uniform traffic, both results are exactly the same becauseall the VOQs can be assumed to be in a nonempty state. This does not cause anydifference in the results.

there are AND gates where the largest gate hasinputs. An-inputAND gate can be built by using a number of-inputAND

gates where the level of delay is , i.e., in . As forthe OR-gate stage, there are OR gates where the largesthas inputs. The largestAND gate delay is then the delayorder for the priority encoder.

APPENDIX VDERIVATION OF (14)

We define an interconnection wire between source node,where and destination node, where

, as . We also define the number of crosspointsthat wire has as .

The total number of crosspoints of the interconnection wiresbetween source nodes and destination nodes is ex-

pressed by

(27)

Here, the factor of 1/2 on the right-hand side of (27) eliminatesthe double counting of the number of crosspoints.

First consider . Since has no crosspointswith other wires, . Since hascrosspoints with ,

. hascrosspoints with ,

. Therefore,. In the same way, .

Next, consider . Since hascrosspoints with ,

. has crosspoints withand cross-

points with . Therefore,. has

crosspoints with ,and cross-

points with . Therefore,. In the same way,

.We can then derive the following:

In general, is given by

(28)

We substitute (28) into (27). The following equation is obtained:

(29)

Page 15: Concurrent round-robin-based dispatching schemes for clos

844 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002

REFERENCES

[1] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High speed switchscheduling for local area networks,”ACM Trans. Comput. Syst., vol. 11,no. 4, pp. 319–352, Nov. 1993.

[2] T. Chaney, J. A. Fingerhut, M. Flucke, and J. S. Turner, “Design of aGigabit ATM switch,” inProc. IEEE INFOCOM, Apr. 1997, pp. 2–11.

[3] H. J. Chao, B.-S. Choe, J.-S. Park, and N. Uzun, “Design and imple-mentation of Abacus switch: A scalable multicast ATM switch,”IEEEJ. Select. Areas Commun., vol. 15, pp. 830–843, June 1997.

[4] H. J. Chao and J.-S. Park, “Centralized contention resolution schemesfor a large-capacity optical ATM switch,” inProc. IEEE ATM Workshop,Fairfax, VA, May 1998, pp. 11–16.

[5] H. J. Chao, “Saturn: A terabit packet switch using Dual Round-Robin,”in Proc. IEEE Globecom, Dec. 2000, pp. 487–495.

[6] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalableswitching solutions for broadband networking: The ATLANTA archi-tecture and chipset,”IEEE Commun. Mag., pp. 44–53, Dec. 1997.

[7] C. Clos, “A study of nonblocking switching networks,”Bell Syst. Tech.J., pp. 406–424, Mar. 1953.

[8] T. H. Cormen, C. E. Leiserson, and R. L. Rivest,Introduction to Algo-rithms, 19 ed. Cambridge, MA: MIT Press, 1997, sec. 13.2, p. 246.

[9] C. Y. Lee and A. Y. Qruc, “A fast parallel algorithm for routing unicastassignments in Benes networks,”IEEE Trans. Parallel Distrib. Syst., vol.6, pp. 329–333, Mar. 1995.

[10] T. T. Lee and S.-Y. Liew, “Parallel routing algorithm in Benes-Clos net-works,” in Proc. IEEE INFOCOM, 1996, pp. 279–286.

[11] N. McKeown, “Scheduling algorithm for input-queued cell switches,”Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Univ. California atBerkeley, Berkeley, CA, 1995.

[12] N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100%throughput in an input-queued switch,” inProc. IEEE INFOCOM,1996, pp. 296–302.

[13] N. McKeown, M. Izzard, A. Mekkittikul, W. Ellerisick, and M.Horowitz, “Tiny-Tera: A packet switch core,”IEEE Micro, pp. 26–33,Jan.–Feb. 1997.

[14] N. McKeown, “The iSLIP scheduling algorithm for input-queuesswitches,” IEEE/ACM Trans. Networking, vol. 7, pp. 188–200, Apr.1999.

[15] A. Mekkittikul and N. McKeown, “A practical scheduling algorithm toachieve 100% throughput in input-queued switches,” inProc. IEEE IN-FOCOM, 1998, pp. 792–799.

[16] E. Oki, N. Yamanaka, Y. Ohtomo, K. Okazaki, and R. Kawano, “A10-Gb/s (1.25 Gb/s� 8) 4�2 0.25-�m CMOS/SIMOX ATM switchbased on scalable distributed arbitration,”IEEE J. Solid-State Circuits,vol. 34, pp. 1921–1934, Dec. 1999.

[17] E. Oki, R. Rojas-Cessa, and H. J. Chao, “A pipeline-based approach formaximal-sized matching scheduling in input-buffered switches,”IEEECommun. Lett., vol. 5, pp. 263–265, June 2001.

[18] J. Turner and N. Yamanaka, “Architectural choices in large scale ATMswitches,”IEICE Trans. Commun., vol. E81-B, no. 2, pp. 120–137, Feb.1998.

Eiji Oki (M’95) received the B.E. and M.E. degreesin instrumentation engineering and the Ph.D. degreein electrical engineering from Keio University, Yoko-hama, Japan, in 1991, 1993, and 1999, respectively.

In 1993, he joined the Communication SwitchingLaboratories, Nippon Telegraph and TelephoneCorporation (NTT), Tokyo, Japan, where he hasresearched multimedia-communication network ar-chitectures based on ATM techniques, traffic-controlmethods, and high-speed switching systems withthe NTT Network Service Systems Laboratories.

From 2000 to 2001, he was a Visiting Scholar with Polytechnic University ofNew York, Brooklyn. He is currently engaged in research and development ofhigh-speed optical IP backbone networks as a Research Engineer with NTTNetwork Innovation Laboratories, Tokyo, Japan. He coauthoredBroadbandPacket Switching Technologies(New York: Wiley, 2001).

Dr. Oki is a member of the Institute of Electrical, Information and Commu-nication Engineers (IEICE), Japan. He was the recipient of the 1998 SwitchingSystem Research Award and the 1999 Excellent Paper Award presentedby IEICE, Japan, and the IEEE ComSoc. 2001 APB Outstanding YoungResearcher Award.

Zhigang Jing (M’01) received the B.S., M.S., andPh.D. degrees in optical fiber communication fromthe University of Electronic Science and Technologyof China, Chengdu, China, in 1993, 1996, and 1999,respectively, all in electrical engineering.

He then joined the Department of ElectricalEngineering, Tsinghua University, Beijing, China,where he was a Post-Doctoral Fellow. Since March2000, he has been with the Department of ElectricalEngineering, Polytechnic University of New York,Brooklyn, as a Post-Doctoral Fellow. His current

research interests are high-speed networks, terabit IP routers, multimediacommunication, Internet QoS, Diffserv, MPLS, etc.

Roberto Rojas-Cessa(S’97–M’01) received theB.S. degree in electronic instrumentation from theUniversidad Veracruzana (University of Veracruz),Veracruz, Mexico, in 1991. He graduated with anHonorary Diploma. He received the M.S. degreein electrical engineering from the Centro de In-vestigacion y de Estudios Avanzados del InstitutoPolitecnico Nacional (CINVESTAV-IPN), MexicoCity, Mexico, in 1995, and the M.S. degree incomputer engineering and Ph.D. degree in electricalengineering from the Polytechnic University of New

York, Brooklyn, in 2000 and 2001, respectively.In 1993, he was with the Sapporo Electronics Center in Japan for a Micro-

electronics Certification. From 1994 to 1996, he was with UNITEC and Iber-chip, Mexico City, Mexico. From 1996 to 2001, he was a Teaching and Re-search Fellow with the Polytechnic University of Brooklyn. In 2001, he was aPost-Doctoral Fellow with the Department of Electrical and Computer Engi-neering, Polytechnic University of Brooklyn. He has been involved in applica-tion-specific integrated-circuit (ASIC) design for biomedical applications andresearch on scheduling schemes for high-speed packet switches and switch reli-ability. He is currently an Assistant Professor with the Department of Electricaland Computer Engineering, New Jersey Institute of Technology, Newark. Hisresearch interests include high-speed and high-performance switching, fault tol-erance, and implementable scheduling algorithms for packet switches.

Dr. Rojas-Cessa is a member of the Institute of Electrical, Information andCommunication Engineers (IEICE), Japan. He was a CONACYT Fellow.

H. Jonathan Chao (S’83–M’85–SM’95–F’01)received the B.S. and M.S. degrees in electricalengineering from the National Chiao Tung Univer-sity, Taiwan, R.O.C., respectively, and the Ph.D.in electrical engineering from The Ohio StateUniversity, Columbus.

In January 1992, he joined the Polytechnic Univer-sity of New York, Brooklyn, where he is currently aProfessor of electrical engineering. He has served asa consultant for various companies such as NEC, Lu-cent Technologies, and Telcordia in the areas of ATM

switches, packet scheduling, and MPLS traffic engineering. He has given shortcourses to industry in the subjects of the IP/ATM/SONET network for over adecade. He was cofounder and from 2000 to 2001, the CTO of Coree Networks,where he led a team in the implementation of a multiterabit packet switch systemwith carrier-class reliability. From 1985 to 1992, he was a Member of TechnicalStaff with Telcordia, where he was involved in transport and switching systemarchitecture designs and ASIC implementations, such as the first SONET-likeframer chip, ATM layer chip, aequencer chip (the first chip handling packetscheduling), and ATM switch chip. From 1977 to 1981, he was a Senior En-gineer with Telecommunication Laboratories, Taiwan, R.O.C., where he wasinvolved with circuit designs for a digital telephone switching system. He hasauthored or coauthored over 100 journal and conference papers. He coauthoredBroadband Packet Switching Technologies(New York: Wiley, 2001) andQualityof Service Control in High-Speed Networks(New York: Wiley, 2001). He hasperformed research in the areas of terabit packet switches/routers and QoS con-trol in IP/ATM/MPLS networks. He holds 19 patents with five pending.

Dr. Chao served as a guest editor for the IEEE JOURNAL ON SELECTEDAREAS

IN COMMUNICATIONS with special topic on “Advances in ATM Switching Sys-tems for B-ISDN” (June 1997) and “Next Generation IP Switches and Routers”(June 1999). He also served as an editor for the IEEE/ACM TRANSACTIONS ON

NETWORKING (1997–2000). He was the recipient of the 1987 Telcordia Excel-lence Award. He was a corecipient of the 2001 Best Paper Award of the IEEETRANSACTIONS ONCIRCUITS AND SYSTEMS FORVIDEO TECHNOLOGY.