efficient fair queueing algorithms for packet-switched networks

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 2, APRIL 1998 175

Efficient Fair Queueing Algorithmsfor Packet-Switched Networks

Dimitrios Stiliadis, Member, IEEE, and Anujan Varma,Member, IEEE

Abstract—Although weighted fair queueing (WFQ) has been re-garded as an ideal scheduling algorithm in terms of its combineddelay bound and proportional fairness properties, its asymp-totic time complexity increases linearly with the number ofsessions serviced by the scheduler, thus limiting its use in high-speed networks. An algorithm that combines the delay andfairness bounds of WFQ withO(1) timestamp computations hadremained elusive so far. In this paper we present two novelscheduling algorithms that haveO(1) complexity for timestampcomputations and provide the same bounds on end-to-end delayand buffer requirements as those of WFQ. The first algorithm,frame-based fair queueing(FFQ), uses a framing mechanism toperiodically recalibrate a global variable tracking the progressof work in the system, limiting any short-term unfairness towithin a frame period. The second algorithm, starting potential-based fair queueing(SPFQ), performs the recalibration at packetboundaries, resulting in improved fairness while still maintainingthe O(1) timestamp computations. Both algorithms are basedon the general framework of rate-proportional servers(RPS’s)introduced in [11]. The algorithms may be used in both generalpacket networks with variable packet sizes and in asynchronoustransfer mode (ATM) networks.

Index Terms—Fair queueing algorithms, performance bounds,switch scheduling, traffic scheduling.

I. INTRODUCTION

T RAFFIC scheduling algorithms are a critical componentof future integrated-services packet networks that will

provide a broad range of quality-of-service (QoS) guarantees.These guarantees are usually in the form of bounds on end-to-end delay, bandwidth, delay jitter (variation in delay),packet loss rate, or a combination of these parameters. Severalscheduling algorithms for providing QoS guarantees have beenproposed and analyzed in the literature [1]–[8] (for a survey,see [9]).

The design of a traffic scheduling algorithm involves aninevitable tradeoff among its delay, complexity of imple-mentation, and fairness. Among the three, the delay andimplementation complexity are clearly the most importantcriteria for the selection of an algorithm for use in a realsystem. While the fairness properties of the algorithm affectonly the short-term distribution of service offered to the ses-sions sharing the link, a larger delay bound implies increased

Manuscript received May 28, 1996; revised March 21, 1997; approved byIEEE/ACM TRANSACTIONS ON NETWORKING Editor S. Floyd. This work wassupported by the National Science Foundation Young Investigator Award MIP-9257103.

D. Stiliadis is with Bell Laboratories, Lucent Technologies, Holmdel, NJ07733 USA (e-mail: [email protected]).

A. Varma is with the Computer Engineering Department, University ofCalifornia, Santa Cruz, CA 95064 USA (e-mail: [email protected]).

Publisher Item Identifier S 1063-6692(98)02276-6.

burstiness of the session at the output of the scheduler, thusincreasing the amount of buffering needed in the switchesto avoid packet losses [10]. In addition to minimizing theend-to-end delay in a network of servers, the delay behaviorof an ideal algorithm must include: 1) insensitivity to trafficpatterns of other sessions (isolation); 2) delay bounds that areindependent of the number of sessions sharing the outgoinglink; and 3) ability to control the delay bound of a sessionwithout depending on the internal parameters of the scheduler[10], [11].

As was discussed in the first part of this work [11], basedonly on the end-to-end delay bounds and fairness properties,generalized processor sharing (GPS) is considered an idealscheduling discipline [1]. The GPS system is based on afluid model where the packets are assumed to be infinitelydivisible and multiple sessions may transmit traffic through theoutgoing link simultaneously at different rates. A packet-by-packet version of the algorithm, known as PGPS or weightedfair queueing (WFQ), is defined in terms of the GPS system[1], [2]. That is, a GPS system is simulated in parallel withthe packet-based system in order to identify the set of sessionsthat are backlogged at each instant. This information is used tocompute a timestamp for each arriving packet, indicating thetime at which it would depart the system under GPS. Packetsare then transmitted in increasing order of their timestamps.A serious problem with this approach is its computationalcomplexity—a maximum of events may be triggered in theGPS simulator during the transmission of one packet, where

denotes the number of sessions that share the outgoing link.Thus, the time required for completing a scheduling decisionis .

In order to reduce its complexity, an approximate imple-mentation of GPS multiplexing was proposed by Davin andHeybey [12] and later analyzed by Golestani [13], [14] underthe nameself-clocked fair queueing(SCFQ). In this algorithmthe timestamp of an arriving packet is computed based onthe timestamp of the packet currently in service. A similarapproach was proposed recently under the namestart-time fairqueueing[15], where the starting time of the packet currentlyin service is used to compute the timestamp of the arrivingpacket. This approach reduces the complexity of the algorithmgreatly. However, the price paid is in terms of the end-to-enddelay bounds that grow linearly with the number of sessionssharing the outgoing link [10], [15]. Thus, the worst-case delayof a session can no longer be controlled just by controllingits reservation, as is possible in WFQ. The higher end-to-end delay also affects the burstiness of sessions within the

1063–6692/98$10.00 1998 IEEE

176 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 2, APRIL 1998

network, increasing the buffer requirements. The VirtualClockscheduling algorithm [3], on the other hand, provides the sameend-to-end delay and burstiness bounds as WFQ with a simpletimestamp computation algorithm, but the price paid is in termsof fairness. A session can be starved for an arbitrary amountof time in a VirtualClock server if it has received bandwidthin excess of its reservation in the past [1].

An algorithm that combines the delay and fairness boundsof WFQ with timestamp computations had remainedelusive so far. In this paper we present two novel schedulingalgorithms that have complexity for timestamp compu-tations and we provide the same bounds on end-to-end delayand buffer requirements as those of WFQ. These algorithmsare based on the analytical framework ofrate-proportionalservers(RPS’s) presented in [11].

Schedulers in the RPS class use the concept ofpotentialto track the state of the system. Each session is associatedwith a session potentialthat keeps track of the amount ofnormalized service actually received by the session during thecurrent system busy period,1 plus any normalized service itmissed during the period when it was not backlogged. Thesession potential is a nondecreasing function of time during asystem busy period. The basic system is defined in terms of afluid model, and the corresponding packet server is obtained bycomputing a timestamp for each arriving packet that representsthe value of the session potential at the instant that the lastbit of the packet leaves the fluid system, and scheduling thepackets in the order of increasing timestamps.

We assume that sessions share the outgoing link of thescheduler, a rate is allocated to each session, and the totalbandwidth assigned to the sessions does not exceed the linkcapacity . That is

When session is backlogged, its potential increases exactlyby the normalized service it receives. That is, if denotesthe potential of session at time , then, during any interval

within a backlogged period for session

where denotes the amount of service received bysession during the interval .

The basic objective of an RPS is toequalizethe potential ofall backlogged sessions at each instant. This is achieved in afluid server as follows. At any instant, the scheduler servicesonly the subset of sessions with the minimum potential, andeach session in this subset receives service in proportion to itsreserved rate . Thus, the scheduler can be seen to increasethe potentials of the sessions in this subset at the same rate.At the time that a session becomes backlogged, its potential isupdated based on asystem potentialfunction that keeps trackof the progress of the total work done by the scheduler. Thesystem potential is a nondecreasing function of time.

1A system busy period is defined as a maximal interval during which theserver is continuously transmitting packets.

When an idle session becomes backlogged at time, itspotential is set as

to account for the service that it missed. Schedulers usedifferent functions to maintain the system potential, giving riseto widely different delay and fairness behaviors. In general, thesystem potential at time can be defined as a nondecreasingfunction of the potentials of the individual sessions beforetime , and the real time

(1.1)

Ideally, the rate of increase of the system potential at eachinstant should match the rate of increase of the potentialof a session currently being serviced by the scheduler. Inpractice, however, a much more relaxed definition of thesystem-potential function is adequate.

The system-potential function in an RPS must satisfy twofundamental properties to provide performance bounds com-parable to that of a WFQ scheduler. First, during any interval

within a system busy period, the system-potentialfunction must be increased with a rate of at least one, that is

(1.2)

Second, the system-potential function must never exceed thepotential of any backlogged session

(1.3)

where denotes the set of sessions that are backlogged inthe server at time. In addition, if the difference between thesystem potential and the potential of every backlogged sessionis bounded, then the server is fair and its fairness (or ratherunfairness) can be estimated in terms of this difference [11].

The definition of RPS’s does not specify the exact methodof maintaining the system-potential function. This enables awide range of algorithms to be defined, all with the samedelay bound as that of WFQ but with different fairnesscharacteristics. For example, both GPS and the fluid-modelequivalent of VirtualClock are RPS’s, but their fairness boundsand implementation complexities occupy two extremes in theRPS framework [11].

The fundamental difficulty in designing a practical RPS isthe need to maintain the system-potential function. Track-ing the global state of the system precisely requires sim-ulating the corresponding fluid-model RPS in parallel withthe packet server. However, the definition of the system-potential function allows considerable flexibility in approx-imating the global state of the system. This flexibility isexploited in this paper in the design of two practical schedulingalgorithms—frame-based fair queueing(FFQ) and startingpotential-based fair queueing(SPFQ). Both algorithms main-tain the system-potential function only as an approximationof the actual global state in the fluid model, but recalibratethe system potential periodically to correct any discrepancies.This recalibration is key to providing boundedfairness index,where the fairness index is defined as the maximum differ-ence in normalized service received by any two backlogged

STILIADIS AND VARMA: EFFICIENT FAIR QUEUEING ALGORITHMS 177

sessions during any arbitrary interval. In the FFQ algorithmthis recalibration is done at frame boundaries, while in SPFQthe recalibration occurs at packet boundaries. This gives riseto two algorithms with the same delay bound but with slightlydifferent fairness properties. Both algorithms, however, pro-vide bounded unfairness and timestamp computations.It is interesting to note that the fairness index of SPFQ isactually no worse than that of WFQ.

Both FFQ and SPFQ are timestamp-based algorithms. How-ever, FFQ uses a framing approach similar to that used inframe-based schedulers to recalibrate the system potentialperiodically. This makes the fairness of the algorithm dependon the frame size chosen by the implementation. SPFQ avoidsthis sensitivity to the frame size by recalibrating the systempotential at the end of transmission of every packet. Incomparison to FFQ, SPFQ requires more state information tobe maintained, resulting in a more complex implementation;however, this increased complexity does not affect its asymp-totic time complexity. Thus, SPFQ is attractive over FFQ inapplications where its improved fairness properties justify theadditional cost.

The rest of this paper is organized as follows. In Section IIwe define a general methodology for estimating and updatingthe system potential that will form the basis of the algorithmsdeveloped in later sections. In Section III we present FFQ interms of a hypothetical fluid model, and subsequently extendto a packet server. We also analyze the fairness propertiesof the algorithm. In Section IV we develop and analyze theSPFQ algorithm. Some concluding remarks are presented inSection V.

II. M AINTAINING THE SYSTEM POTENTIAL

In this section we will develop a general methodology formaintaining the system-potential function in a scheduler soas to bound the unfairness. The resulting algorithms will bereferred asfair RPS(FRPS). Formally, we can define an FRPSas an RPS satisfying the following property.

Definition 1: Let denote the system potential in anRPS and denote the potential of sessionat time . Thescheduling algorithm is an FRPS if and only if a finite constant

can be found such that

for any

where is the set of backlogged sessions at time.The above constraint can be satisfied by the use of a

recalibration mechanismperiodically to bound the maximumdifference of the system-potential function from the potentialof a backlogged session.

Let us first introduce some notations. We assume that a rateis allocated to session. Let be the bandwidth capacity

of the outgoing link; then, is the fraction of the linkrate allocated to session. As in the previous section, letrepresent the potential of sessionat time and denotethe corresponding value of system potential.

The fluid version of an FRPS follows all of the conditionsin [11, Definition 5]. That is, at each instant, the schedulerservices only the set of backlogged sessions with the mini-

mum potential, and sessions in this set are serviced at ratesproportional to their reservations.

In an idealized fluid server it is possible to update the systempotential at any instant of time. However, in a packet serverit is desirable to update the system potential only when apacket departs from the system. In order to simplify the im-plementation of the algorithm we will define a general methodfor updating the system potential that will be based only oninformation extracted from the packet-level implementationof the algorithm.

We first define a function that we will call thebasepotential. is a nondecreasing function with the followingproperties.

1) Let represent the set of sessions that are backloggedat time in the system. Then

(2.1)

2) A finite constant , as in Definition 1, can befound such that

(2.2)

That is, the base-potential function can be any nondecreasingfunction whose value is never higher than the potential ofany backlogged session at that instant. It is easy to see thatsuch a function can be used as the reference for updating thesystem potential periodically, since it satisfies (1.3). Since theabove definition of the base-potential function does not specifyhow to construct such a function, considerable flexibilityexists in its choice. Assuming that the base potential is usedto recalibrate the system potential periodically, and that theinterval between recalibrations is bounded, the condition in(2.2) is sufficient to achieve bounded unfairness, that is, forthe algorithm to belong to the FRPS class. Different choicesof the function result in algorithms with differentimplementation complexities but all with bounded unfairness.In the later sections we will show two distinct ways toconstruct the base-potential function , resulting in theFFQ and SPFQ algorithms.

In general, an FRPS can be constructed by maintaining thesystem potential function as follows.

Definition 2: Let the system-potential function in an RPSbe defined as follows. When the system is not busy, thesystem-potential function is equal to zero. During a systembusy period, the function is a piecewise linear functionof time . Let be the beginning of the current system busyperiod. Then

1) At times , with , arecalibration is performed by updating to the basepotential at that instant, if the system potential is lowerthan the base potential. That is

(2.3)

where denotes the limit asreaches from the left.2) At any time between updates, the system potential

increases linearly with time. That is

(2.4)


3) The interval between successive recalibrations isbounded, that is

for some finite

The update operation in (2.3) enables us to bound thedifference between the system potential and the potentials ofbacklogged sessions. Without such an update mechanism, thesystem potential may diverge from the session potentials byan arbitrary amount, causing the unfairness of the algorithmto be unbounded. The updates at time instantsare designed to bring the system potential to a value closerto the session potentials. This value is estimated through thebase-potential function .

Before proceeding further, it is important to note that thefairness index of any RPS as defined above depends on twofactors:

1) the choice of the base-potential function ;2) the frequency of recalibrations, that is, the choice of the

update instants .

In order to design an efficient packet version of the algorithm,we can only perform these recalibration steps at the times that apacket finishes its service. Thus, the frequency of recalibrationis upper bounded by the departure rate of packets from theserver.

We now proceed to show that the system-potential functionin Definition 2 results in an FRPS. We first need to show thatthe system-potential function satisfies the two key propertiesin the definition of an RPS [11]. That is, during a systembusy period the system potential must increase at least at therate of real time, and the system potential must not exceedthe potential of any backlogged session. These properties areproven in the following two lemmas.

Lemma 1: If the system-potential function is maintained asdescribed by Definition 2, then for any interval duringa system busy period

Proof: Assume that the system busy period under obser-vation started at time 0. If no recalibrations occurred duringthe interval , then the lemma is true by (2.4). Nowconsider the case when one or more recalibrations occurredduring the interval . Let be the updateinstants, with . Then, from (2.3) and (2.4)

(2.5)

for . From (2.4) and (2.5) we can express the finalpotential as

Proceeding similarly

This concludes the proof of Lemma 1.

Lemma 2: If the system-potential function is maintained asin Definition 2, then at any time

(2.6)

Proof: The proof is by contradiction. Since, (2.6) is satisfied trivially at time 0. Letbe the earliest

time during a system busy period at which forsome . We need to consider two cases.

Case 1: The system potential was not increased attime as a result of a recalibration. That is, at timewe didnot use the function to calculate the system potential,but instead used the linear function defined by (2.4). Then, let

be any interval such that ,session was continuously under service in the interval

, and no changes in occurred during the intervalfrom a recalibration. Note that, since sessionis the firstsession for which , such an interval always exists.Otherwise, there is another sessionwith forsome .

Since session is one of the sessions with minimumpotential throughout the interval , it was serviced ata minimum rate of during this interval. Thus, the potentialof session at must be at least

Thus, the result is true by contradiction.Case 2: The potential was increased at time as a

result of a recalibration. That is, . Then, by(2.3)

(2.7)

From the definition of we also know that

(2.8)

which implies that , a contradiction.Theorem 1: An RPS with its system-potential function

defined as per Definition 2 is an FRPS.Proof: Lemmas 1 and 2 prove that the system-potential

function satisfies the two main conditions imposed by thedefinition of an RPS. In addition, if the recalibrations areperformed at finite intervals, by (2.2), the difference betweenthe system potential and the potential of any backloggedsession will be bounded. Thus, the algorithm is an FRPS.

III. FRAME-BASED FAIR QUEUEING

Using the methodology we described in the previous sec-tion, we can define several algorithms by choosing differentbase-potential functions and recalibration intervals. A simpleapproach is to perform the recalibration periodically, witha maximum period equal to an internal parameter of thealgorithm that we call theframe size . This approach resultsin the definition of FFQ. In this section we will describe theFFQ algorithm and analyze its properties.

We will first define the parameters of the algorithm withrespect to a fluid system and subsequently extend them to thepacket-based system. We define the frame size parameter such


that exactly bits can be transmitted during aframe period. That is, . We define as

denotes the maximum amount of sessiontraffic thatcan be serviced during one frame. When a session remainsbacklogged, its potential increases by the normalized serviceoffered to it. Thus, when bits are serviced from session, its potential will increase by . We impose one

more restriction on the value of , that the largest packet ofa session can be transmitted during a frame period. That is, if

is the maximum packet size for session, then

(3.1)

We will refer to the process of recalibrating the systempotential in an FFQ server as aframe updateoperation. Eachframe update operation marks the beginning of a new frame inthe system. If all of the sessions are continuously backlogged,frame updates can be performed in a fluid server exactlyat intervals of the frame period . The updates will occurearlier, however, if some of the sessions are idle, causing thepotentials of backlogged sessions to rise faster. Thus, in thefluid server, the th frame update can be performed when thepotentials of backlogged sessions reach the value. Note thatall backlogged sessions reach the potential ofat the sametime in the fluid server. In a packet server, however, this is notthe case. Therefore, in order to avoid simulation of the fluidsystem to determine the frame update instants, we define theframe update instants in a more relaxed manner as follows. Let

denote the last time a frame update occurred. The nextupdate is performed when both of the following conditionshold.

1) The potentials of all backlogged sessions in the fluidserver belong in the next frame. That is

(3.2)

where is the set of sessions currently backlogged.2) The potentials of all sessions fall below the beginning

of the next frame, that is

(3.3)

Note that the above conditions may be satisfied during awindow of time. Performing the next frame update at anyinstant during this window will result in a valid algorithm.Let us assume that we decide to update the frame at time.Then, at time we set

(3.4)

Since analysis of the packet FFQ server requires referenceto the corresponding fluid server, we define the frame updateinstants to be identical in both servers. These update instantscan then be determined from only information available inthe packet server, so as to fall in the window defined byconditions (3.2) and (3.3) above. This relaxed definition allowsrecalibrations to be performed only when a packet finishesservice in the packet server. Note that the actual duration of a

Fig. 1. Behavior of the base-potential and system-potential functions in afluid FFQ server.Pi(t) represents the potentials of all backlogged sessions,P (t) denotes the system potential, andSP (t) denotes the base potential.Recalibration of the system potential occurs at frame update points�1, �2,�3, and�4. The frame update points can fall anywhere within a window wherethe individual session potentials lie betweenkT and (k + 1)T .

frame, that is, the interval between successive frame updates,never exceeds .

We can now define the base-potential function forFFQ as follows. is a step function whose value is zerowhen the server is idle and increases byat every frameupdate instant. Thus, at theth frame update instant ,assumes a value of .

Fig. 1 illustrates the base-potential and system-potentialfunctions in a fluid FFQ server, where through denote theinstants at which frame updates occur. The system potentialgrows linearly with time between the update instants. Notethat the th frame update can be performed any time during awindow when the potential of a backlogged session is between

and .In a packet server the frame updates can only be performed

at packet boundaries. We now show that an update instantcan be found in a packet server based only on the timestampsof the queued packets. Recall that the timestamp of a packetdenotes the potential of the corresponding session at the instantthat the packet completes its service in the fluid system. Wemake use of the following lemma from [11] to establish arelationship between the potentials of a session in the fluidand packet servers.

Lemma 3: Let be an interval within a system busyperiod in the fluid FFQ server. Letbe a session backloggedin the fluid server at time such that received more servicein the packet server compared to the fluid server in the interval

. Then there is another session, with thatreceived more service in the fluid server than in the packetserver during the interval .

This lemma enables us to find a relationship between thepotentials of the backlogged sessions in the fluid server and


the timestamps of the backlogged sessions in the packet server.We can now prove the following lemma, which will allow usto perform frame updates in FFQ by using only informationextracted from the actual packet-based system. Let us firstdefine the starting potential of a packet of session as thepotential of the session when packetstarts being serviced inthe corresponding fluid server, and let denote the startingpotential of the first packet in the queue of sessionat time. Let denote the set of backlogged sessions at timein

the packet server.Lemma 4: Assume that at time , for each backlogged

session in the FFQ packet server, the starting potential of itsfirst packet belongs in the next frame. That is, if was thelast instant at which a frame update occurred

Then, the potential of each backlogged session in the fluidFFQ server at time is also greater than or equal to .

Proof: We will prove the lemma by contradiction. Let usdenote with the session with the minimum potential in thefluid server and let us assume that the potential of sessionisless than . Session has received until time moreservice in the packet server than in the fluid server. By Lemma3, there is another sessionwith potential thathas received less service in the packet server than in the fluidserver. Let be the starting potential of the packet that isbeing serviced in the fluid server at timefrom session .Then and, thus, . Notice also thatthis packet has not yet been serviced in the packet server. Thisis a contradiction.

The significance of Lemma 4 is that we can determineframe update times based only on information available inthe packet server, and the scheduling algorithm still remainsan RPS. It should be noted that Lemma 4 establishes only thatthe potential of every backlogged session in the fluid FFQserver at time is not less than . To perform the nextframe update at time, we must also show that the potentialof every backlogged session at timeis below . Wewill prove this latter result in the next section. Thus, the packetserver can perform frame updates by keeping track of all of thesessions that are backlogged and have a packet with startingpotential in the current frame and finishing potential in thenext frame. When all such packets have been transmitted, weknow that the potentials of the corresponding sessions in thefluid system have also crossed the frame boundary. Therefore,the completion of transmission of the last packet in this setis a valid time to update the frame and the system-potentialfunction.

We can now describe the packet version of the FFQ al-gorithm. Without loss of generality we can assume that theservice rate of the server is one. Thus, the time to transmit

bits is also equal to . A fraction of the output linkbandwidth is allocated to sessionand, therefore,bits can be sent from sessionduring a frame. As in the fluidversion, we require that the maximum packet size be less than

so that a single packet can be transmitted within one frame.On the arrival of a packet, the algorithm in Fig. 2 is

executed to calculate the timestamp associated with the packet.

Fig. 2. Algorithm executed on the arrival of a packet in an FFQ server.

If the starting and finishing potentials of the packet belong todifferent frames, the current packet is one that crosses overto the next frame. Therefore, the packet is marked to indicatethat this is the first packet of the session to cross over tothe next frame. In addition, a counter is incremented to keeptrack of the number of sessions that have crossed over into thenew frame. The algorithm maintains one counter per frame tokeep track of the number of sessions whose packets cross intothe next frame. Later, when a marked packet is scheduledfor transmission, the corresponding counter is decremented;when the counter reaches zero, the potentials of all of thebacklogged sessions have crossed over to the next frame, anda frame update can be performed.

The array of counters is used to count the number ofsessions that have packets with a starting potential in eachframe. Although an infinite number of frames may need to beserviced, in practice the number of distinct frames in whichthe potentials of queued packets can fall into is limited by thebuffer size allocated to the sessions. Thus, ifdenotes thebuffer space allocated to session, the size of the array canbe limited to

If is rounded up to the nearest power of two, then thearray can be addressed with the least significantbits of the current frame number. The number of counters canfurther be reduced tothree if steps 4–8 of the algorithm areexecuted only when a packet reaches the head of the queue ofthe corresponding session [16].

When a packet finishes transmission, the algorithm in Fig.3 is executed to update the state of the system. It is possible toavoid testing the second condition in step 7 of the algorithmby modifying the algorithm slightly. The modification consistsof updating the variable and performing the frame updatewhen a packet isselectedfor transmission, rather than whenit completes transmission. In this case, a packet arrivingafter the last marked packet started its service will alwaysreceive a timestamp value in the next frame. To show thatthis modified system remains an RPS with the same latency,consider the following equivalent system. Assume that thetraffic scheduling system consists of a regulator followedby an FFQ scheduler. The regulator holds all packets that


Fig. 3. Algorithm executed on the departure of a packet in an FFQ server.

arrive while the transmitter is busy, and delivers them to thescheduler in batches at the end of transmission of each packet.It is easy to verify that this new system consisting of theregulator and the scheduler is work conserving.

Since packets arrive in the FFQ scheduler only at timeswhen a packet finishes service, it is easy to verify that apacket will never finish transmission in the packet serverlater than in the corresponding fluid server. (The proof caneasily be derived by extending [11, Lemma 3].) An arrivingpacket may see a maximum delay of in the regulator,equal to the maximum time needed for the current packet tocomplete service in the transmitter. Thus, the new system,consisting of the regulator and the scheduler, still belongs tothe class of latency-rate servers [10] with the same latencyas that of a simple RPS. However, we must note here thatupdating the variable and performing the frame updatewhen a packet is selected for transmission, rather than whenit completes transmission, alters the system-potential functionand, therefore, may change the transmission sequence ofpackets.

A. Correctness of FFQ

In order to be complete, it is necessary to verify that allconditions imposed in the definition of the FFQ algorithm forupdating the frame are satisfied when the above algorithm isexecuted. We have already proven in Lemma 4 that when theframe is updated at time , the potentials of all backloggedsessions in the fluid server are at least equal to. We alsohave to prove that, at this time, the potential of every session inthe fluid server is less than . We will use the followingsequence of three lemmas to prove this result. The proofs ofthese lemmas can be found in the Appendix.

Lemma 5: Let be an interval within a system busyperiod in an RPS fluid server. Letbe a session that receivedmore service in the fluid server compared to the packet serverin the interval . Then, there is another sessionwith

that received more service in the packet serverthan in the fluid server during the interval .

Lemma 6: At time when the frame is updated as de-scribed in the packet FFQ algorithm, the server has not yettransmitted any packet with potential greater than or equal to

.Lemma 7: Let be the time at which theth frame update

occurs in the packet FFQ server. Then, the potential of all ofthe sessions in the fluid server at timeis less than .

B. Fairness of FFQ

Since FFQ is an RPS, in order to analyze its fairness itis sufficient to prove that the difference between the systempotential and the potential of any backlogged session is alwaysbounded. We can state the following lemma for the fluid FFQserver.

Lemma 8: For every sessionbacklogged in the fluid FFQserver at time

A detailed proof can be found in the Appendix. Although abound for the unfairness of the packet server can be derivedfrom [11, Th. 4], we can provide a much tighter bound forFFQ. As in [11], let us denote with the serviceoffered by the scheduler to a connection during the interval

, assuming that a packet is transmitted as a whole onlywhen its last bit is transmitted; and with the serviceoffered to a session in the packet-by-packet server when thepartial service received by a packet under transmission isincluded. Then, we can prove the following bound on thefairness of a packet FFQ server.

Lemma 9: For any two sessions that are continuouslybacklogged in the interval in the packet FFQ server

The rather long proof of the above lemma is omitted due tolack of space. Interested readers may consult [16] for a detailedproof. Lemma 9 shows that the fairness of the algorithmdepends on the selection of the frame size. The latter, inturn, depends on the maximum packet size of each sessionand its minimum bandwidth allocation. Thus, the algorithm isespecially suited to application in asynchronous transfer mode(ATM) networks where the traffic consists of small fixed-sizecells and the frame size can be kept small. Note, however, thatthe frame size does not affect the latency of the server as isthe case in frame-based schedulers such as weighted round-robin and deficit round-robin. In addition, some short-termunfairness is unavoidable in any packet-level scheduler. Thedifference in normalized service received by two sessions canbe proportional to the number of backlogged sessions even ina WFQ server. Thus, the level of fairness provided by FFQmay be quite acceptable in practice.

IV. STARTING POTENTIAL-BASED FAIR QUEUEING

The highest frequency at which recalibration of the systempotential can be performed in a packet server is determined bythe transmission rate of packets on the outgoing link. Thus,


we can attempt to perform a recalibration each time a packetfinishes service, thus improving on the fairness properties ofFFQ. This approach is used in the definition of the SPFQalgorithm in this section.

Again, let denote the starting potential of the firstpacket in the queue of a backlogged sessionin the packetserver. That is, is a step function whose value is updatedeach time a new packet is placed at the head of the queue ofsession. Then, we define the base-potential function as

(4.1)

where denotes the set of backlogged sessions in thepacket server at time. That is, the base potential at any time

is defined as theminimumof the starting potentials of thebacklogged sessions. This allows to be calculated in anefficient manner—its value needs to be updated only when apacket is moved to the head of a session’s queue.

To complete the specification of the algorithm, we mustalso define the time instants at which the recalibration ofsystem potential, as defined by (2.3), is performed. We definethese instants to be the times at which a packet completes itsservice in the packet server.

Before proceeding to describe the algorithm further, wefirst show that the above definition of the base-potentialfunction satisfies the property in (2.2) that its value neverexceeds the minimum potential of a backlogged session inthe corresponding fluid server.

Lemma 10: If the starting potential of every backloggedsession in the packet server is greater than or equal toat time , then the potential of each backlogged session inthe corresponding fluid server at timeis also greater than orequal to .

Proof: We will prove the lemma by contradiction. Letus denote with the session with the minimum potential inthe fluid server at time and assume that its potentialin the fluid server is less than . Session has receiveduntil time more service in the packet server than in the fluidserver. By Lemma 3, there is another sessionwith potential

that has received less service in the packetserver than in the fluid server. This means that the packetmost recently serviced from sessionin the fluid server hasnot yet finished service in the packet server. Letbe thestarting potential of this packet. Then

By hypothesis, . Therefore, we must have, which contradicts with the definition of .

This concludes the proof of Lemma 10.The above result enables the recalibrations to be performed

using only information extracted from the packet server. Thescheduler can keep track of all of the sessions that arebacklogged in the packet server and determine the minimumstarting potential among their packets. When the system po-tential is lower than the starting potentials of the packets at thehead of the queues of all of the backlogged sessions, an updateis performed to increase the system potential to the minimumamong the starting potentials.

We can now describe the packet version of SPFQ moreprecisely. On the arrival of a packet from session, itstimestamp is calculated just as in the FFQ algorithm. However,no special operations are required for marking packets. Theonly additional step is the addition of the starting potential ofthe new packet to a separate priority queue, so as to facilitatethe recalibration operation.

On the departure of a packet, the system potential ismaintained at or above the minimum starting potential ofbacklogged sessions. It is easy to see that the price paidfor the improved fairness of the SPFQ algorithm is in therecalibration step that requires knowledge of the minimumamong the starting potentials of the packets at the head ofthe queues of all of the backlogged sessions. This operationcan be implemented efficiently by maintaining the startingpotentials of the backlogged sessions in a separate priorityqueue, so that the minimum value can be retrieved intime. An entry is added to this priority queue when a packetis moved to the head of the queue of a session. Likewise,when a packet completes its service, the corresponding entryis removed. If the maximum number of sessions sharing thelink is , these operations can be performed intime, the same complexity incurred in maintaining the priorityqueue of packets. Thus, the recalibration step does not affectthe asymptotic time complexity of the algorithm, although itrequires an additional data structure for the starting potentials.

A. Fairness of SPFQ

In this section we derive bounds on the short-term unfairnessof the packet SPFQ algorithm and show that it is comparableto that of WFQ. In order to calculate tight bounds on theunfairness of SPFQ, we will need to take into account thepotentials of sessions in both the packet server and thecorresponding fluid server. Let us denote with thepotential of session at time in the packet server, calculatedas follows. When a new packet is placed at the head of thequeue of session, the function is set equal to thestarting potential of that packet. While the packet is waitingfor transmission, the potential remains unchanged. When thepacket starts transmission, the potential is increased bya step equal to the normalized service offered to session.

As before, we will use to denote the potential ofsession at time in the corresponding fluid server. Note thatthe system-potential function is identical for the two systemsand will be denoted as . Since the packet-based systemis based on the fluid system, the service missed by a sessionwhile it is absent is the same in both servers. Similarly, thetotal service received by a session over a system busy periodis also the same in both servers. However, at a certain instantof time , the packet server may be ahead of or behind thefluid server in the amount of service offered to a session.Therefore, the potential of the session in the packet servermay be different from its potential in the fluid server. Thisdiscrepancy, however, is always bounded.

We will first prove a lemma that establishes a correspon-dence between the amount of service received by a backloggedsession in the packet server during an interval and itsgain in potential during the same period.


Lemma 11: Let be a session in the packet server withan infinite supply of packets after time. For any interval oftime with

(4.2)

Using the above lemma, we can prove the following theoremthat will provide a bound for the unfairness of the SPFQalgorithm.

Theorem 2: Let sessions and have an infinite supply ofpackets at time. During any interval with

The rather long proofs of the above results can be found in[16]. Note that, disregarding the term , this unfairnessbound over all pairs of sessions is comparable to that of WFQ[10] and SCFQ [13]. Thus, we can conclude that the fairnessindex of SPFQ is very close to the best-known bound foranytraffic scheduling algorithm.

V. CONCLUSIONS

In this paper we introduced and analyzed two novel sched-uling algorithms—FFQ and SPFQ. Both algorithms providethe worst-case service guarantees of a WFQ server and com-parable fairness. We analyzed the fairness properties of thealgorithms and showed that the difference in normalizedservice offered to any two sessions that are continuouslybacklogged is always bounded and this bound for SPFQ iscomparable to that of WFQ. The main advantage of thealgorithms compared to WFQ is that they do not requiresimulation of a fluid server in parallel, enabling them to beimplemented in a simple and efficient manner. All of theinformation needed for the algorithm can be extracted fromthe packet server itself.

Compared to FFQ, SPFQ provides the same end-to-end de-lay bounds but superior fairness properties. However, althoughSPFQ and FFQ have asymptotically the same implementationcomplexity, the former requires the use of two priority listsas opposed to one in the latter. Thus, SPFQ is attractive inapplications where its improved fairness justifies the additionalcost of implementation.

A working prototype of FFQ has been implemented in ourFPGA-based Simulation Testbed for ATM Networks (FAST)[17], [18]. The algorithm is incorporated in a shared-memoryATM switch architecture, using a set of parallel priority lists. Acentral controller arbitrates the sharing of the output link by thedistributed shared-memory modules. The prototype works at a16-MHz clock rate, supporting a link speed of approximately80 Mb/s and up to 1024 virtual channels. Given the routing,density, and speed limitations of the FPGA devices, theimplementation of the algorithm using ASIC technology cansupport 622 Mb/s links easily. Experimental tests using thisprototype are currently being carried out to study the averagebehavior of several of these algorithms.

Since timestamp computations are performed in timein both the FFQ and SPFQ algorithms, the asymptotic time-complexity of the algorithms is determined by the priority-listoperations. Traditional heap algorithms for insertion and dele-tion have a complexity of for virtual channels.There are a number of ways for reducing this complexity forATM networks where timestamps take integer values in a finiterange. A recursive algorithm was proposed in [19]–[21] forimplementing add and delete operations in such a priorityqueue with time complexity, where is thenumber of elements in the queue. These algorithms werefurther refined by Johnson [22] who presented a nonrecursivealgorithm with complexity for the add anddelete operations. In this algorithm denotes the smallestinterval between successive elements in the priority queue.Applying this algorithm to FFQ results in a complexity of

, where is the frame size. Furthermore, Dixonpresented a method for pipelining such an algorithm [23]. Notethat, in an output-buffered ATM switch, a maximumof cells may be added to the priority list in one cell cyclewhile only one cell is selected for transmission. Thus, inpractice, the time complexity imposed by performing multipleinsertions into the list may dominate the overall complexityof the algorithm.

A simpler implementation is based on the concept of cal-endar queues. If the finishing potential values are distinct, aseparate queue can be maintained for each timestamp value.A tree of priority encoders can be used to detect the nonemptyqueue with the smallest timestamp value in logarithmic time.For example, if a tree of 32-bit priority encoders is used, thelowest timestamp value can be detected in , where

is the number of slots in the calendar queue (which, in turn,is determined by the granularity of bandwidth allocation).

Application of the above algorithms for optimizing thecost of implementing the priority lists is beyond the scopeof this paper. We must point out, however, that in a high-speed ATM network environment a more complex approachthan the calendar queue implementation is unlikely to be used.Many of the algorithms mentioned in the previous paragraphsare inherently recursive in nature and have large constantfactors associated with their time complexity, making theirimplementation prohibitively expensive.

The fairness of an RPS, that of SPFQ in particular, canbe improved by adding a shaping mechanism at the input ofthe scheduler. The shaper releases packets into the scheduleronly when the system potential becomes equal to or greaterthan the starting potential of the packet. The addition ofthe shaping mechanism does not affect the delay boundof the scheduler. The properties of this shaped RPS areanalyzed in [25]. When SPFQ is used as the scheduler, sucha shaper–scheduler combination results in a work-conservingserver with improved fairness as compared to that of SPFQalone. Such a combination of SPFQ with shaping was alsoproposed by Bennett and Zhang, who called it WF2Q[24].

Apart from the scheduling algorithms, a major contributionof this paper is in illustrating the power of the frameworkof RPS’s and the definition of a general methodology fordesigning FRPS’s. The novelty of our approach is that instead


of just designing a scheduling algorithm and analyzing itsproperties, we have isolated the important properties that ascheduling algorithm must satisfy to achieve low latency andbounded unfairness. The two specific algorithms presented inthis paper illustrate how the RPS framework can be used bythe designer of a switch or router to balance the cost andperformance of the algorithm by trading off fairness with itsimplementation complexity,

APPENDIX

PROOFS OFLEMMAS AND THEOREMS

Proof of Lemma 5:First note that sessionis backloggedin the packet server. Since both servers are work conserving,it is clear that if session received more service in the fluidserver, there is another sessionthat has received less servicein the fluid server.

We will prove the lemma by contradiction—let us denotewith the set of sessions with potential at least equal to thatof session and let us assume that all of these sessions havereceived more or equal service in the fluid server compared tothe packet server. That is, for all , and

. We will distinguish two cases.Case 1: Some session is being serviced at time.

All other backlogged sessions in the fluid server at timehavepotential at least equal to and, thus, they belong in theset . However, we also know that there exists a sessionthat has received more service in the packet server than in thefluid server until time . This session can only have potentialless than . That is

(A.1)

Since this session received more service in the packet server,it must still be backlogged in the fluid server. Thus, session

should be serviced at timeinstead of session. This is acontradiction.

Case 2: A session that does not belong in the set isbeing serviced at timeand, thus, . Let denotethe last time that a session was in service in the fluidserver. Then, in the interval all sessions have notreceived any service in the fluid server and thus

(A.2)

Since the service function is nondecreasing, we can write

(A.3)

We know that every session has received until timemore service in the fluid server than in the packet server.

Therefore

(A.4)

By subtracting (A.3) from (A.4)

(A.5)

But at this time, there must exist at least one sessionthat received less service in the fluid server compared to the

packet server. This means that sessionis still backloggedat time in the fluid server. The potential of sessioncannot be lower than that of sessionbecause then a sessionfrom the set would not be serviced just before time. Thus,the potential of session is at least equal to . This isa contradiction.

Proof of Lemma 6:We will prove the lemma by contra-diction. It is easy to verify from the definition of the algorithmthat the frame will be updated the first time the startingpotentials of all backlogged sessions in the packet server aregreater than or equal to . Let us assume that at some time

the server transmitted a packet with a timestamp greaterthan or equal to . Then at time this packet would havethe minimum timestamp. Since we assumed that the framesize is selected such that the largest packet can be transmittedwithin a frame period, the potential of all backlogged sessionsin the packet server at timewould be greater than or equalto . Therefore, the th frame update would have occurredat or before , a contradiction.

Proof of Lemma 7:The proof is again by contradiction.Let us assume that a sessionexists with .Then session has received more service in the fluid servercompared to the packet server until time. By Lemma 5,there is another sessionwiththat has received more service in the packet server until time

compared to the fluid server. Let denote the timestampof the packet under service in the fluid server for session.Then, and this packet has alreadybeen serviced in the packet server. This is a contradiction toLemma 6.

Proof of Lemma 8:While a session is backlogged in theFFQ server its potential is increasing by the normalized serviceoffered to it. The system potential, on the other hand, isincreased in two cases. While the frame is not changing itis increased by the real time, and when the frame changes itbecomes at least equal to the starting potential of the currentframe. Let us assume that the current time isand that thelast frame update occurred at . The next frame update willoccur after the time when all backlogged sessions have crossedthe potential of in the packet server. As we showed, thiswill occur before the potential of any session becomes greaterthan . The largest difference between the systempotential and a session potential will appear just before theframe update. At this time

(A.6)

and

(A.7)

Subtracting (A.7) from (A.6)

Note that the fastest way for the potential of a sessionto reach the value from the time that the frame waslast updated is through its normalized service. However, bythe time the next frame update occurs, the system-potential


function would have increased by at least the time to servicebits of session . This bounds the difference in potentials

to .

REFERENCES

[1] A. K. Parekh and R. G. Gallager, “A generalized processor sharingapproach to flow control—The single node case,” inProc. IEEE INFO-COM’92, vol. 2, May 1992, pp. 915–924.

[2] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation ofa fair queueing algorithm,”Internetworking: Research and Experience,vol. 1, no. 1, pp. 3–26, 1990.

[3] L. Zhang, “VirtualClock: A new traffic control algorithm for packetswitching networks,”ACM Trans. Comput. Syst., vol. 9, pp. 101–124,May 1991.

[4] D. Ferrari and D. Verma, “A scheme for real-time channel establishmentin wide-area networks,”IEEE J. Select. Areas Commun., vol. 8, pp.368–379, Apr. 1990.

[5] M. Katevenis, S. Sidiropoulos, and C. Courcoubetis, “Weighted round-robin cell multiplexing in a general-purpose ATM switch chip,”IEEEJ. Select. Areas Commun., vol. 9, pp. 1265–1279, Oct. 1991.

[6] M. Shreedhar and G. Varghese, “Efficient fair queueing using deficitround robin,” in Proc. ACM SIGCOMM’95, Cambridge, MA, Sept.1995, pp. 231–242.

[7] C. Kalmanek, H. Kanakia, and S. Keshav, “Rate-controlled servers forvery high-speed networks,” inProc. IEEE Global TelecommunicationsConf., San Diego, CA, Dec. 1990, pp. 300.3.1–300.3.9.

[8] S. Golestani, “A framing strategy for congestion management,”IEEE J.Select. Areas Commun., vol. 9, pp. 1064–1077, Sept. 1991.

[9] H. Zhang, “Service disciplines for guaranteed performance service inpacket-switching networks,”Proc. IEEE, vol. 83, pp. 1374–1396, Oct.1995.

[10] D. Stiliadis and A. Varma, “Latency-rate servers: A general model foranalysis of traffic scheduling algorithms,” inProc. IEEE INFOCOM’96,San Francisco, CA, Mar. 1996, pp. 111–119.

[11] , “Rate-proportional servers: A design methodology for fairqueueing algorithms,”IEEE/ACM Trans. Networking, this issue, pp.164–174.

[12] J. Davin and A. Heybey, “A simulation study of fair queueing andpolicy enforcement,”Comput. Commun. Rev., vol. 20, pp. 23–29, Oct.1990.

[13] S. Golestani, “A self-clocked fair queueing scheme for broadbandapplications,” inProc. IEEE INFOCOM’94, Toronto, Ont., Canada, Apr.1994, pp. 636–646.

[14] , “Network delay analysis of a class of fair queueing algorithms,”IEEE J. Select. Areas Commun., vol. 13, pp. 1057–1070, Aug. 1995.

[15] P. Goyal, H. M. Vin, and H. Chen, “Start-time fair queueing: Ascheduling algorithm for integrated services packet switching networks,”in Proc. ACM SIGCOMM’96, Stanford, CA, Sept. 1996, pp. 157–169.

[16] D. Stiliadis, “Traffic scheduling in packet switched networks: Analysis,design, and implementation,” Ph.D. thesis, Comput. Eng. Dep., Univ.California, Santa Cruz, June 1996.

[17] D. Stiliadis and A. Varma, “A reconfigurable hardware approach tonetwork simulation,”ACM Trans. Modeling Comput. Simulation, vol. 7,no. 1, pp. 131–156, Jan. 1997.

[18] A. Varma and D. Stiliadis, “FAST: An FPGA-based simulation testbedfor ATM switching systems,” inProc. ICC’96, vol. 1, Dallas, TX, June1996, pp. 374–378.

[19] P. V. E. Boas, “Preserving order in a forest in less than logarithmic timeand linear space,”Inform. Processing Lett., vol. 6, pp. 80–82, Apr. 1977.

[20] P. V. E. Boas, R. Kaas, and E. Zijlstra, “Design and implementationof an efficient priority queue,”Mathemat. Syst. Theory, vol. 10, pp.99–127, 1977.

[21] K. Mehlhorn, Data Structures and Algorithms.New York: Springer-Verlag, 1984.

[22] D. Johnson, “A priority queue in which initialization and queue opera-tions takeO(log log d) time,” Math. Syst. Theory, vol. 15, pp. 295–309,1982.

[23] B. Dixon, “Concurrency in anO(log log n) priority queue,” inProc. Parallel and Distributed Computing, Theory and Practice, FirstCanada–France Conference, Montreal, P.Q., Canada, 1994, pp. 59–71.

[24] J. C. R. Bennett and H. Zhang, “Hierarchical packet fair queueingalgorithms,” in Proc. ACM SIGCOMM’96, Stanford, CA, Sept. 1996,pp. 143–156.

[25] D. Stiliadis and A. Varma, “A general methodology for designingefficient traffic scheduling and shaping algorithms,” inProc. IEEEINFOCOM’97, San Francisco, CA, Apr. 1997, pp. 326–335.

Dimitrios Stiliadis (M’96), for photograph and biography, see this issue, p.174.

Anujan Varma (M’86), for photograph and biography, see this issue, p. 174.

efficient fair queueing algorithms for packet-switched networks

Documents