a high-performance two-stage packet switch architecture

1792 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 47, NO. 12, DECEMBER 1999

A High-Performance Two-Stage Packet Switch ArchitectureTimothy X Brown

Abstract—This paper contributes a distributed packet con-troller which reduces queueing to a single stage in two-stagepacket switches. Software and neural network based controllersare described. Simulations under a range of traffic conditions fora 1024� 1024 switch size shows the simplest architecture has thebest performance.

Index Terms—Asynchronous transfer mode, neural networks,packet switching.

I. INTRODUCTION

I N AN asynchronous transfer mode (ATM) switch,fixed-size packets arrive in periodictime slotsat one of

the input ports destined for one of the output ports.Because multiple packets can be destined for the same output,packets must queue somewhere in the switch to be sentin a later time slot (so-calledoutput blocking). Because ofspeed and complexity constraints, large ATM switches arebuilt from stages of smaller switch modules as in Fig. 1(a).Blocked packets can queue at several points in the switchingnetwork with different levels of coordination ranging from nocoordination between switch stages to various back pressuremechanisms [1]. This paper develops a distributed packetcontroller that extends a single-stage controller in a priorwork [2] and reduces queueing to a single stage. Simulatedperformance under a range of traffic conditions for a 10241024 switch size shows the simplest architecture has the bestwait time and buffer size performance.

A. Input versus Output Switch Modules

We focus on nonblocking switch modules that bufferblocked packets at output or input ports. The output-bufferedswitch has maximal throughput and the shortest queues [3],but it is also more expensive as the switch must operate at

times the input port speed to ensure that every packetcan reach its output. Once at the correct output, packets aresimply sent, in turn, at every time slot. Shared output buffers[4], although they have no effect on the mean performance,reduce the buffer size needed to absorb queue-size variationsand are used in this paper.

The input-buffered switches in this paper use separate inputqueues that can send any waiting packet. This avoids head-

Paper approved by P. E. Rynes, the Editor for Switching Systems of theIEEE Communications Society. Manuscript received March 3, 1998; revisedJune 18, 1999. This work was supported by NSF CAREER Award NCR-9624791 and a grant from the University of Colorado Council on Research andCreative Works. This paper was presented in part at the Fifth IFIP Workshopon Performance Modeling and Evaluation of ATM Networks, Ilkley, U.K.,July 1997.

The author is with the Department of Electrical and Computer Engi-neering, University of Colorado, Boulder, CO 80309-0530 USA (e-mail:[email protected]).

Publisher Item Identifier S 0090-6778(99)09767-6.

(a)

(b)

Fig. 1. (a) Building anN2� N

2 switch out of2N N � N switches. (b)Output buffers of stage 1 switches are input buffers to a stage 2 switch.

of-line blocking and enables maximal throughput [2], [3].Unlike an output-buffered switch, apacket arbitrator (PA)must decide which packets to send in every time slot. Given aset of input-queued packets, the PA chooses the largest subsetso that at most one packet is sent per input queue, and at mostone packet arrives per output port.

B. Software (SW) and Neural Network (NN) Packet Arbitrators

The packet arbitration problem reduces to the cardinalitygraph matching problem, which has a polynomial-time algo-rithm solution [5]. This algorithm readily computesthe largest packet subset in a SW simulation although it isunlikely to be fast enough for large high-speed switches tomake their selection within a time slot. Also, it does not applyto switches with internal blocking such as banyan networks.For this reason, we consider an NN PA.

An NN PA was presented in [2], and the reader is referredthere for details. The neuron matrix solution isgoverned by the differential equations

(1)

0090–6778/99$10.00 1999 IEEE

IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 47, NO. 12, DECEMBER 1999 1793

where for neuron is its state variable, is aconstant external input, and is any of the usualneuron sigmoid functions (e.g., If

(1) approaches a stable equilibrium where ifwhen a packet at input is waiting for outputotherwise, the set always corresponds

to a valid packet subset [2]. In simulation, the are set tomatch the queue state, and (1) is numerically integrated usinga fifth-order adaptive step size algorithm [6]. To speed thesimulation, neurons with are locked atTo avoid numerical instabilities, the at is a uniform[ 0.001, 0.001], and has a uniform [ 0.001, 0.001]added. Setting gives the best performance.

Prior work indicates that the switch fabric for a 32 32622-MB/s nonblocking switch or a 40 40 neuron NN can fiton a single VLSI chip [7], [8]. Circuits similar to (1) have beenbuilt with less than 1-s decision time [9]. For signaling, atevery time slot, a queue sends an-bit message to the PA withthe queue state, and the PA sends back a log-bit messagewith the selected packet subqueue [2]. This indicates theinput-buffered switch has a compact and fast implementationcompared to the more complex output-buffered switch [4].

II. L ARGE SWITCH ARCHITECTURES

A common way to build a large switch is to usetwo stages of smaller switches as in Fig. 1(a). Whilestraight forward, this may not be optimal. Packets buffer intwo stages, which, in general, doubles the delay. We considerthree versions of the architecture in Fig. 1(a).

A. Two Stages of Output-Buffered Shared Memory Switches

This is the direct two-stage implementation using the highestperformance switch. No PA is necessary. This is the base-line for comparison with other switches and is denotedOO.

B. One Stage of Output and One Stage of Input-BufferedSwitches with Packet Arbitrator

Output-buffered shared memory switches comprise the firststage. From a second-stage switch’s perspective, the outputbuffers of the first stage appear as input buffers. A single stageof buffering is created with stage one buffers logically actingas input buffers to simpler second-stage switches [Fig. 1(b)].PA’s located in each of the second-stage switches control oneoutput queue in each of the first-stage switches. Each queue inthe first-stage switch must now maintain subqueues so thePA can choose packets destined for any second-stage output.If the queues are maintained as linked lists in a single buffer asin [4], then the overhead for the linked lists is small. Thus,compared to the OO switch, minor modifications are requiredin the first stage, while the hardware in the second stage issignificantly reduced. This switch is denotedOIA .

C. Two Stages of Output-Buffered Switcheswith Packet Arbitrator

In the OO switch (Section II-A), the second-stage delaydepends on the order that packets are sent from the first

(a) (b)

Fig. 2. For a 16� 16 switch composed of two stages of 4� 4 out-put-buffered switches, (a) shows the four first-stage buffers serving one ofthe second-stage switches. The first-stage buffers both send four packets in(b), but the top set leads to congestion in the second layer, while the bottomset does not.

stage (e.g., Fig. 2). In the OIA switch (Section II-B), the PAeliminates second-stage queueing, but contention may preventa nonempty first-stage queue from sending a packet. Thissection’s switch sends the head-of-queue packet from suchqueues to be buffered in the second stage. As in Section II-A, every nonempty first-stage queue sends one packet whilethe PA reduces second-stage queueing. This switch shows theadvantage of packet arbitration and is denotedOOA.

III. EXPERIMENTS WITH A 1024 INPUT SWITCH

The following experiments are designed to compare theswitch designs. Absolute switch performance would requirea range of switch sizes under more diverse and realistictraffic types. Instead, the designs are compared under differentloads for a 1024 switch size (i.e., two stages of 32 3232 switches). This size is an order of magnitude larger thanexisting switches today, and the 32 32 neurons in eachneural PA is well beyond the “toy problem” size. To lowerbound the performance, we look at the first-stage buffering andwait times of the OO switch. This represents the irreduciblequeueing and contention at the first- to second-stage links.

The packet traffic is generated using a Bernoulli processand a burst process with different loads1 The switches aresimulated for 11 000 time slots starting from empty queues.For a given load and traffic type, the sequence of arrivals isthe same for every switch. The first 5000 time slots of dataare discarded so that only steady-state behavior is observed.With 32 switches in each stage, the remaining 6000 time slotscontain a total of 192 000 observations.

Average packet delay is used as a measure of averageperformance. To remove details of the switch implementation,the inherent delay (i.e., the delay of a single packet throughan empty switch) is subtracted out so that only the queueingdelay is measured. In every time slot, the size of everyshared buffer is recorded after all arrivals, but before any

1For the Bernoulli process, in every time slot a packet is generated atan input with probability�. The packet destination is a uniform randomdistribution over all the outputs. For the burst process, packets arrive ingeometrically distributed bursts of at least one packet with mean burst sizem. This is modeled by having a burst start in a time slot with probability�=m. All packets in a burst arrive in consecutive time slots and have thesame destination chosen uniformly. Bursts start immediately or as soon as allprior bursts finish, whichever is first. In this paper,m = 10.

1794 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 47, NO. 12, DECEMBER 1999

(a) (b)

(c) (d)

Fig. 3. Performance of different switch configurations with SW packet arbitrator. (a) Queueing delay: Bernoulli arrival data. (b) Queueing delay: burstyarrival data. (c) Buffer size for 10�3 loss: Bernoulli arrival data. (d) Buffer size for 10�3 loss: bursty arrival data.

(a) (b)

Fig. 4. Comparing SW and NN packet arbitrators. (a) Queueing delay: Bernoulli arrival data. (b) Buffer size for 10�3 loss: Bernoulli arrival data.

departures. As a measure of required buffering (an outliermetric), the buffer size exceeded in only 10of the timeslots is computed directly from the buffer size trace data.Smaller blocking probabilities, while more realistic, requireextrapolation techniques that do not illuminate the comparativeresults further.

Since packet behavior is correlated over time scales rangingup to several hundred time slots, error bars are computedindirectly. For each metric, six subestimates are computed

across successive 1000 time slot intervals, which are thenassumed independent. Average delay, being a linear combi-nation of the subestimates, uses the standard deviation ofthe subestimate average. Buffer size, not being linear, usesthe standard deviation of the subestimates. The buffer-sizesubestimates are from a smaller sample than the final estimateso this upper bounds the error.

Fig. 3 shows results for the three architectures using theSW PA. The OO and OOA switches include the total de-

IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 47, NO. 12, DECEMBER 1999 1795

lay/buffering of both stages. Across metrics, the differencebetween the OO and the lower bound is a factor of 2. Forqueueing delay, both arbitrated architectures approach thelower bound at high loads. For buffering, the simpler OIAarchitecture approaches the lower bound for low loads, whilefor high loads both arbitrated architectures approach a buffersize midway between the OO and lower bound. The OIA’ssuperior performance is due to the single stage of buffering.Although the OIA has longer first-stage queues, these are theonly queues and all queued packets are therefore under thePA’s control.

Results using a NN PA are shown for Bernoulli arrivalsin Fig. 4. The NN is guaranteed to provide valid but notnecessarily optimal solutions. When applied to the OOAswitch, there is little difference. The nonoptimal solutions ofthe neural PA are compensated for by always being able tosend packets from any nonempty queue so that the numberof packets sent is the same with both controllers. In the OIAswitch, the nonoptimal neural solutions translate directly intolonger queues and delay at high loads.

IV. CONCLUSION

A distributed PA was developed and applied to a large high-speed packet switch. The arbitrated OIA and OOA switchesof Sections II-B and II-C bring queueing delays to near-optimal and required buffer sizes to within a factor of 1.4of optimal at high loads. This result applies equally across

different arrival processes. Surprisingly, the best performanceis with the less complex OIA switch. Comparing SW andNN PA, the NN is similar when applied to the OOA switchwhile performance degrades at high loads when applied tothe OIA switch. Even so, the fast NN decisions could enablehigh-speed applications. Future work includes controlling maxdelay/fairness and simpler internal blocking switches.

REFERENCES

[1] A. Pattavina,Switching Theory: Architecture & Performance in Broad-band ATM Networks. New York: Wiley, 1998.

[2] T. X Brown and K. H. Lui, “Neural network design of a Banyan networkcontroller,” IEEE J. Select. Areas Commun., vol. 8, pp. 1428–1438, Oct.1990.

[3] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input versus outputqueueing on a space division packet switch,”IEEE Trans. Commun.,vol. COM-35, pp. 1347–1356, Dec. 1987.

[4] T. Kozaki et al., “32 � 32 shared buffer type ATM switch VLSI forB-ISDN,” IEEE J. Select. Areas Commun., vol. 8, Oct. 1991.

[5] E. L. Lawler, Combinatorial Optimization: Networks and Matroids.New York: Holt, Rinehart and Winston, 1976, p. 195.

[6] W. H. Presset al., Numerical Recipes in C, 2nd ed. Cambridge, U.K.:Cambridge Univ. Press. 1992, pp. 710–722.

[7] W. S. Marcus, “A CMOS batcher and banyan chip set for B-ISDNpacket switching,”IEEE J. Solid-State Circuits, vol. 25, pp. 1426–1432,Dec. 1990.

[8] S. P. Eberhardtet al., “Competitive neural architecture for hardwaresolution to the assignment problem,”Neural Networks, vol. 4, pp.431–442, 1991.

[9] J. Choi and B. J. Sheu, “A high-precision VLSI winner-take-all circuitfor self organizing neural networks,”IEEE J. Solid-State Circuits, vol.28, pp. 576–584, May 1993.

a high-performance two-stage packet switch architecture

Documents