quality of service provisioning for composable routing

Purdue UniversityPurdue e-PubsDepartment of Computer Science TechnicalReports Department of Computer Science

2002

Quality of Service Provisioning for ComposableRouting ElementsSeung Chul Han

Puneet Zaroo

David K.Y. YauPurdue University, [email protected]

Prem Gopalam

John C. S. Lui

Report Number:02-012

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] foradditional information.

Han, Seung Chul; Zaroo, Puneet; Yau, David K.Y.; Gopalam, Prem; and Lui, John C. S., "Quality of Service Provisioning forComposable Routing Elements" (2002). Department of Computer Science Technical Reports. Paper 1530.https://docs.lib.purdue.edu/cstech/1530

https://docs.lib.purdue.edu

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/comp_sci

QUALITY OF SERVICE PROVISIONING FOR COMPOSABLE ROUTING ELEMENTS

Seung Chul Han Puneet Zaroo

David K.Y. Yau Prem Gopalan John C.S. Lui

CSD TR #02-012 June 2002

QUALITY OF SERVICE PROVISIONING FORCOMPOSABLE ROUTING ELEMENTS

Seung Chul HanPuneet Zaroo

David K.Y. YauPrem GopalanJohn C.S. Lui

CSD TR #02-012June 2002

Quality of Service Provisioning for

Composable Routing Elements Seung Chul Hau Puneet Zaroo David K. Y. Yau Prem Gopalan John C. S. Lui

Abstract

Quality of service (QoS) provisio~~iog lor dynamically conaposallc softwaz clemenls in a progamrnable

routcr has not been previously studied. We presci~t a router platfor111 that supports exLe~rsitlc and config-

urablc routing elemenls, arid provides t11e1ri witti access to given resource allocations. Scl~eduling issues lor

thesc clarncnts are discussed: (1) flow-lased sclieduling, (2) the preelaptibility of a pipelirie of elemen~s, (3)

CPU conservation for idle eleaieots, (4) tlic CPU balance between input, output, atid processing ele~rlerlts

and its eRects on buffer provisionixlg, and (5) perrormance interactions bctween the packet forwarding plane

and the service extension coritrol pime. To demo~~strate liow QoS provisio~liug ill our system car1 bmicfit

end users, we use a video scaling application that G U ~ respond gracerully Lo network congestion. For the

application, wc quantiry how rouler resource Jrianagc~ncnt impacts Ihe ad-to-cnd quality ol decoded video.

Ours appears to be the first soEtwaro system that supports QoS-aware processing of libtweight, dyllaniic

router ele~nenls.

Keywords

Soltware router, routing elements, quality of service, CPU and bulfer allocation

Value-added processing of packets during tlieir trnnsport, especially at the network edge, is increasingly

relevant. Example applications itlclurle security firewalls, network address translations, and proxy services

to adapt application payIoad (e.g., a rriovie being streamed) to neGwork conditions. Moreover, some of these

services are not anticipated in advance. For example, in response to emerging security threats, new defensc

mechanisms will be clesigned as countermeasures. (Previous irrstances include proposals such as IP tracelack

[7] ar~d router tlirottling Ill] to defend agai~lst distributed denial-or-service attacks.) Hence, the aLi ty to

extend the service interlace ol a router or proxy server on-the-fly, without disrupting existirrg services, is attractive.

In providing extensible, value-added services during packet transport, we adopt an approadi based on

software elements. An element is a self-coritaiued code module implexnentitig a logical routing function. The

advantages or using these routillg elements are many:

S. C. Han, P. Zaroo and D. K. Y. Yau are with the Department of Computer Seienccs, Purdue University, West

Lafayette, IN47907; P. Gopalan is with Mazu Networks, Cambridge, MA (work done while P. Gopalan was a graduate student at Pwdue University); J. C. S. Lui is with the Department of Computer Science and Engineering, Chinac

University of Hong Kong. Contact author: D. Yau (yauOcs.purdue.edu). Paper submitted to lEEE International

Conference on Nehvotk Ptotocob, Paris, Rancc, November 2002.

1

Quality of Service Provisioning for

Composable Routing ElementsSeung Chul Han Puneet Zaroo David K. Y. Yau

Abstract

Frem Gopalan John C. S. Lui

Quality of service (QoS) provisioJling for dYJlamically composable software clements in a programmable

router lias not been previously stuuietl. We prC5ent a router platrorm that supports extensible and config

umble routing clements, and provides tJlel1l with access to given resource alJocations. Scheduling issues for

these clements are discussed: (1) flow-baseo scheduling, (2) the preemlJtibility of a pipeline of elements, (3)

CPU conservation for idle elements, {4} the CPU balance between input, output, and processing elements

and its effects on buffer provisioning, and (5) performance interactions between the packet forwarding plane

and the service extension control plane. To demonstrate how QoS provisioning in our system can benefit

end users, we use a video scaling application that can respond gracefully to network congestion. For the

applications we quantify how rouLer resource management impacts the end-ta-end quality oC decoded video.

Ours appears to be the first software system that supports QoS-aware processing of lightweight, dYllamic

router elements.

Keywords

Software router I routing elements, quality of service, CPU and buller allocation

1. INTRODUCTION

Value-oadded processing of packets during their transport, especially at the network edge, is increasingly

relevant. Example applications include security firewalls, network address translations, amI proxy services

to adapt application payload (e.g., a movie being streamed) to nel-work conditions. Moreover, some of these

services are not anticipated in advance. For example, in response to emerging security threats, new defensc

mechanisms wiJI be designed as countermeasures. (Previous instances include proposals such as IP traceback

17) alld router throttling (11) to defend against distributed denial-of-service attacks.) Hence, tIle ability to

extend the service interface oC a router or proxy Server oD-the-ny, without disrupting existing services, is

attractive.

In prOViding extensible, value-added services during packet transport, we adopt an approach based on

software elemenl$. An element is a self-contained code module implementing a logical routing function. The

advantages of using these routing elements are many:

s. C. Han, P. Zaroo and D. K. Y. Yau are with the Department of Computer Scienccs, Purdue University, West

Lafa.yette, IN 47907; P. GopaJan is with MlI2u Networks, Cambridge, MA (work done while P. Gopalan was a graduatestudent at Purdue University); J. C. S. Lui is with the Department of Computer Science and Engineering, Chinese

University of Hong Kong. Contact author: D. Yau ([email protected]). Paper submitted to IEEE InternationalConference on Network Protocols, Paris, France, November 2002.

Thc elements can be composed to lorm a flow processi~ig pipeline. Hence, more complex router

services C ~ I I be co~rslructed irolrl simpler arld well urlderstaod building blocks. This 11as i~nportant softwarc

eugitreeri~ig benefits: Ly jsolatjr~g desig~ arid irnplcmentation concerns and Zacilitatirlg code reuse.

An element implementing a common rouling functio~i can be sliared by several flo\vs desiring the

iunction. This co~~tributes to code atrd memory efficiency.

Ele~t~erils car1 be easily 111apped to a liglitwcight mccution context. For exatnple, differe~it elcmcnts,

possibly Leloilging to diflcrc~~t flows, can bc mccutcd in the context or a single thread or process. The

overl~ead of co~ltmt switching bctwccn clemcnts or flows can thus be minimized. -4 l a r g uulnber of flows

can bc cfficicntly supported in a scalable manner.

Elcmcnts can be fetched on demand lrom a (possibly remote) service repository, and dynanically

linked into the runtime environment oTa router. Tlris e~ables lhe service i~itcrfacc of arouler to be euleltsible

on-the-fly, without djsrupling existing flows. Service. that are I~itlrerto unalticipated can I b s be readily

iniroduced inlo a11 operational routiug infrastructure.

While routing ele~i~ents liave been adva~iced ill prior research arld arc supported in existi~ig syster~~s

(e-g., [5]), their scl~eclulir~g issues Tor providing quality of scmice (QoS) to network llorvs have uot receivcd

m u d attenth]. 111 this paper, we present the CROSS/Linux- routcr platform that supports confi~g~ral~lc

flow gral~lis or router ele~~ierlls as provided by the Click modular router [5]. Our research contributions

beyond Cliek are in the area of element-related resource allocation and scheduling, which includes the Following issues: . Tho provision of flow-based resource allocatio~~ wid sdlerluling 011 top of an dement-based software

architecture.

Tile preeraptiori gaiularity of flow processing. Our system can c o n t ~ ~ t switcli (wit11 acceptable

overhead) from a lower priority flow to a higher priority flow in the middle of processing a packel. This

roduccs tho duration of priority inversion. M'e sludy the resulting effects oti robust forwarding of network

Bows with Fine timescale QoS requirements.

CPU conservation for "idle" elernenls (i.e., elemelits wl~icli iieed not run because their packet queues

are empty). We provide an architecture in which ele~lierits do not have to poll for work to do.

The CPU balance between the element funclio~is of input, output, aud per-flow processing. We sludy

ltow giving different CPU &ares to these fuuctiorls will &ct buffer provisioning and packet forwarding

performance.

Tlle provision of a service control pIance, and accompanying resource coiitentio~l issues between the

forwarding and control planes. In particular, we discuss how the concurrelit tasks of flow processing and

service downloading may affect each olher's performalce.

A. Paper organization

The balance or the paper is organized is as follows. In Section D, we review tlw Click ~rlodular router

architecture, which provides background for configurable elements being used in our system. We lhen go on

discuss the design and implementation of CROSSfiiuux. Section 111 presents the forwarding plane for packet

processing. Issues lor per-flow resource sdieduhig will be discussed. Section IV presents the control plane.

In particular, it describes t i e processes of flow signaling and on-tho-Ry service extension. CROSS/Linux

has been implemented on a network 01 cornmodily Pentium 111 desktops configured as gateway routers. We present ~r~easuretrierlt results on various aspects of QoS provisioning in our syste~ii prototype, h la ted work

2

• The elements can be composed to Corm a flow processing pipeline. Hence, more comple." router

services can be cOllstructed CroUl simpler and well understood building blocks. This has important software

engineering benefits, by isolating tlesign and implementation concerns and facilitat.ing code reuse.

• An element implementing a common rouLing (unctioll can be sharecl by several flows desiring the

function. This contributes to code aul1memory efficiency.

• Elements call be easily mappetl to a light.weight execution context. For example, different elements,

possibJy belonging to different flow,s, can 1m m.::ecutcd in the conte-xt of a single thread or process. The

overhead of context switching between clements or flows can thus be minimi~ed. A large number of flows

can be efficiently supported in a scalable manner.

• Elements can be fetched on demand from a (possibly remote) service repository, and dyn31picaJly

linked into the runtime environment ofa router. This euables the service interface of 11 router to be e."tensible

on-the-fly, without disrupting existing flows. Services that are hitherto unanticipated can thus be readily

introduced into an operational routing infrastructure.

'Vhile routing elements have ))een advanced ill prior research and arc supported in existing s,Ystems

(e.g., [5», their scheduling issues for providing quality of service (QoS) to network nows have lIot received

much attention. In this paper, we present the CROSS/Linux router platform that supports cOllfigurabJe

flow graphs or router elements as providell by ~he CJick modular routcr (5]. Our research contributions

beyond Click are in the area of element-related resource allocation and scheduling, which

includes the following issues:

• The provision of flow-based resource allocation and scheduling on top oC an element-based software

architecture.

• The preemption granularity oC flow processing. Our system can conte.xL switch (with acceptable

overhead) from a lower priority Oow to a higher priority now in the middle of processing a packet.. This

reduces the duration of priority inversion. We study tile resulting effects on robust forwnrding of network

Bows with fine time-scale QoS requirements.

• CPU conservation for "idle" elements (Le., elements which ueed not run because their packet. queues

are empty). We provide an arcllitecture in which elements tlo not have to poll for work to do.

• The CPU balance between the element functions of illpllt, output, and per~flow processing. We study

how giving different CPU shares to these functions will affect buffer provisioning and packet forwarding

performance.

• The provision of a service control plance, and accompanying resource contention issues between the

forwarding and control planes. In particular I we discuss how ~he concurrent tasks of flow processing and

service downloading may affect each other's performauce.

A. Paper organization

The balance of the paper is orgal1iz;ed is as follows. In Sedion II, we review the Click modular router

ilrchitecture, which provides background for configurable elements being used in our system. We then go on

discuss the design and implementation of CROSSjLinux. Section III presents the forwnrding plane for packet

processing. Issues Cor per-flow resource scheduling will be discussed. Section IV presents the control pJane.

In particular, it describes the processes of flow signaling and on-tbe--Ry service extension. CROSS/Linux

has been implementetl on a network of commodity Pentium III desktops configured as gatewlly routers. We

present measurement. results on various aspects of QoS provisioning in our system prototype. Related work

is discussed in Section VII. Seclion VIII concludes.

The starti~rg point of our work is the existence of w~ ele~nent-lased routor architecture, such as provided

by tho Click modular router [4], in wllich elelnents can be configured for customized per-flow processing of

packets. For completeness, we briefly review the Click sohware architecture. In Click, elerr~er~ts arc C++ kernel modl~les e a d ~ implernenthig a silr~ple routcr function (e-g., receive from all it1 put retwork interface, send

to aa output i~lterfacc, packet classification, queuing, atid packet sclleduling). Elcments can be wr~sidered

liodes in a dircctcd graph, and they can be connected to each other thimuglr one or more pork they have.

W l ~ m an oulput port of elelnet~t is c o ~ ~ ~ ~ e c t e d to a11 input port 01 another elemenl, it lorlr~s a dirccted

edge from the Iormer (LIE upslrr(~rrr ele~r~enl) to the lattcr (the downsimm element). .4 packet car1 tlicn be

passed from the iapstrealr~ to the dom~strcarn clcmcnt. In general, a packel arrivilhg a t an input interface

of a rouler is firsf, processed by an input elcmcnt, where the packet gets classified to its Aow. dccording

to Llre cla&i6catio11, t l ~ c pckct then flows along the edges of l l ~ e flow graph, h111 an output port of e a d ~

upslrealii ele~rte~lt to art input port of each downstream elenlent. It rviH receive customixed protocol processing

accordirg to the actual path it traverses, and finally gels forwarded out of the router by an output eleme~~b.

An upstream clement hitiat- packet transrer to ils inla~ediate downstream ndghbor by calling tlie push

. virtual function of the neighbor. Hence, packet transfers initiated from upslream (e-g., by network iaput)

are called pwh pmcessiny. It is also possible for a downstream element to request packets rro111 upstream

(e.g., when a11 output network interface becomes ready, it lriay request a packet ta send). This is done by

the domstream element calling the pull virtual functio~l of its inlmecliate upstream neighbor- Hence, packet

.transfers initiated from dow~lstream is called pull processing. Conceptually, pusli/pull processu~g is enabled

by the &rival of packets at relevant packet queues, and a packet queue in Click is represented by a Queue

eknent.

Fig. 1 illustrates a sample flow graph implementilig a traffic conditioning block. The graph has two

Queue elements - one upstream 01 the Shaper elernelit and the other downstream of the Meter elenlent. III the example, push processi~~g starting at the Classifier element is enabled by packet arrivals a t the input

device queue (nol show11) served by tlle classifier, and pull processing starting a t the Deviceoutput dement

is enabled by packet arrivals at either of the two Queue elements sl~own.

CIick 11as to scl~edule the execution order of eligible elements. Our defiuition of an eligible dement is

one that is the starting point or push/pull processing and Iras available packets to process in the relevant

packet queue(s). From tlie sd~eduling point of view, a sequence of push (or pull) runction calls mmot be

interrupted. A packet must pass through the corresponding sequellee of elements, until it is either dropped,

or queued ill the context of a Queue element. For example, the Classifier-Meter-Discard clement sequence

in Fig. 1 cannot be preempted in the middle. After a packet is dropped m queued, however, the element

scheduler regains control, and sd~edules a next elemcnt to run. Hence, the position of Queue elements in

a processing path determines the path's preemption granularity in Click scheduling. I€ more elements are

connected in tandem without interposing Queue elements, the prcemption granulariby becomes coarser, since

the scheduler must wait ior all the elements to complete before it can resd~edule.

3

is discussed in Section VII. Section VIII concludes.

II. BACKGROUND

The stading point of our work is the existence of an element-based router architecture, such as provided

by the Click modular router [4), in which elements can he configured for cus~omized per-now processing of

packets. For completeness, we briefly review the Click software architecture. In Click, elements arc C++

kernel modules each iruplellleuting a simple routcr Cunction (e.g., receive from an illput uetwork interface, send

to an output interfacc, packet classific:a~ion, queuing, and packet scheduling). Elements can be considered

Jiodes ill a directed graph, and they can be connected to each other through one or more p07·ts t.hey have.

When an output port of an element is connected to an input port of another element, it. forms a directed

edge Cram the former (the upstn~fl711 element) to the latter (the downstream element). A packet can then be

passed from t.he upstream to t)IC dowIlstream c1emcnt. In general, a Jlacket arriving at all input interface

of a rouler is first processed by an iflput element, where the packet get.s classified to its flow. According

to t.he clasSification, tbe packet then flows along the edges of the Dow graph, from an output port of each

upstream element to an input port ofeach downstream element. It will receive customized protocol processing

according to tbe actual path it traverses, and finally gels forwarded out of the router by an output element.

An upstream element initiates packet transfer to its immediate downstream neighbor by calling Ule push

virtual function of the neighbor. Hence, packet transfers initiated from upstream (e.g., by network input)

are called push processirag. It is also possible for a downstream element to request packets CroIll upstream

(e.g., when all output network interface becomes ready, it may request a packet to send). This is done by

the downstream element calling the pull virtual function of its immediate upstream neighbor. Hence, packet

.transfeno initiated from downstream is called pull processing. Conceptually, push/pun processing is enabled

by the arrival DC packets at relevant packet queues, and a packet queue in Click is represented by a Queue

element.

Fig. 1 illustrates a sample flow graph implementing a traffic conditioning block. The graph has two

Queue elements - one upstream of the Shaper element and the other downstream oC tbe Meter element. In

the example, push processing starting at the Classifier element is enabled by packet arrlva.Js at the input

device queue (not shown) served by the classifier, and pull processing starting at tbe DeviceOutput element

is enabled by packet arrivaJs at either oC the two Queue elements showll.

Click lias to schedule the execution order of eligible elements. Our definition oC an eligible element is

one that is the starting point of push/pull processing and has available packets to process in tbe relevant

packet queue(s}. From the selleduJing point of view, a sequence of push (or pull) function caJls emmot be

interrupted. A packet must pass through the corresponding sequence oC elements I until it is either dropped,

or queue<! ill the context of a Queue element. For example, the Classifier-Meter-Discard element sequence

in Fig. 1 cannot be preempted in tIle middle. After a packet is dropped or queued, however, the element

scheduler regains control, and schedules a next element to run. Hence, t.he position of Queue elements in

a processing path determines the path's preemption granularity in Click scheduling_ If mare elements are

connected in tandem witbout interposing Queue elements, the preemption granularity becomes coarser, since

the scheduler must wait for all the elements to complete before it can reschedule.

= puslu~ull oulput pod T v pulihlpull input pnrt

Fig. 1. -4 sar~lple Click flow graph of elements.

A fur~dalne~ital design decisio~l aboul CROSS/Linux is the scheduling paradigm that should be used

lor packet processing. A sitliple approad] would be to schedule elements as independent entities, witliout

reference to their execution context. Click chooses such an approach. However, packets salt tl~rough a router

usually belong to higher level logical flows, which have their own QoS conslraints. For example, a vidm flow

may need some lni~ii~nurn forwarding rate to achieve conti~luity of the pictures. An interactive audio flow rnay specify some maximum delay bound for its packets, to supporl liiglt quality voice communication.

To eflectively support application-level QoS, we decided to provide a flow abstraction for scheduling

the packet forwarding plane. Packels are classified to their flows by a padrot class%er, according to jlow

specifications that are iostalled. For example, a layer-four UP flow can be defined by the source IP address,

destination IF' address, transport protocol, transport source port, and transport desbination port. Router

resources can then be allocated on a per-flow basis. In our current model, flows can be given proportional

CPU shares. As a packet gets processed by the sequence of elements lhat it goes tlirougl~, the CPU cycles

consumed by tile processing are charged to the packet's Row, arid not to the elements themselves. In particular, a1 element being shared by two or more flows collsuIues resources of the flow being processed. Such decouplil~g of the resource co11text from llie processing entity is the key to providing perlonnalce

isolation between logically independent flows.

The CROSS/Linux forwarding plane scheduler (hencelorth called tile Pow ~ c l ~ e d u l w ) solocts the next

flow to rulr fro111 a task queue of all the eligible flows in a router. A flow is eligible if one or more ol its

clernonts are eligible. Such a Row is represented 011 the task queue by an jRouler abstraclion tllal co~rtains

all the pertinent scheduling stab about the flow. Once a Row is sdleduled, i t still ren~aiiis to determine the execution order of the flow's eligible elements. We supporl this next-level sclleduling decision by (1)

allowing a flow to ill turll apportion its CPU allocatio~~ wliong tlie coristituent dcments, and (2) maintaini~~g flow-specific sclieduling state lor each element.

4

• C pllsh/plIlI OlllPllt port.... "'<7 plll'Wpllll inpllt port

Fig. 1. A sample Click flow graph of elements.

III. FORWARDING PLANE PACKET PROCESSING

A fundamental design decision about CROSS/Liuux is the scheduling paradigm that should be used

for packet processing. A silllple approach would be to schedule elements as independent entities, wiUlout

reference to their execution context. Click rooOSl!S such an approach. However, packets sent through a router

usually belong to higher level logical Bows, which have their own QoS constraints. For example, a video Bow

may need some minimum forwarding rate to achieve continuity of the pictures. An interactive audio flow

may specify some maximum delay bound for its packets, to support high quality voice communication.

To effectively support application-level QoS, we decided to provide a How abstraction for scheduling

the packet forwarding plane. Packets are classified to their flows by a paaet classifier, according to flow

specificaUoJls that are installed. For example, a layer-four IF flow can be defined by the source IP address,

destination IP address, transport protocol, transport source porl, and transport destination port. Router

resources can then be allocated on a peT-How basis. In our current model, flows call be given proportional

CPU shares. As a packet gets processed by the sequence of elements that it goes tllrougb, the CPU cycles

consumed by the processing are charged to the packet's flow, and 110t to the elements themselves. In

particular, all element being shared by two or more flows consumes resources of the flow being processed.

Such decoupliug oC the resource context from the processing entity is the key to providing performance

isolation between logically independent flows.

The CROSS/Linux forwarding plane scheduler (henceforth called the flow scheduler) selects the next

flow to rUIl from a task queue of all the eligible flows ill a routcr. A How is eligible if one or more of its

clements are eligible. Such a How is represented 011 the task queue by an /RotJler abstraction that contains

all the pertinent scheduling state about the flow. Once a flow is scheduled, it still remains to detcnnine

the execution order of the 80w's eligible elements. We support this next-level scheduling decision by (1)

allowing a flow to ill t.urn apportion its CPU allocation among the constituent clements, and (2) maintaining

Dow-specific scheduling state for each element.

Fig. 2. -4 sample CROSS/Linux router configuration.

Notice that certain elements do not logically belong to any particular flow. Instead, they perform

functions in the global router context. hiput and oulpul elemelits lor network interlaces, and an element

for vanilla IP lorwarding, ate i~tiportaut examples. We treat tliese global elements as belo~lging to certairi

"global flowsn. A global flow is represented in the task queue by an i o h t e r object, a counterpart of the

Router object for non-global florus. For the purpose of scheduling, global flows are quite similar to 11ormal

flows. They can be give11 specified resource allocations, tlius allowing their elements to compote for system

resources with other per-flow dements. The assig~~lmait of global router Iunclions to global flows is Bexible.

For example, we could have one global flow for eacli network input element, otie global flow for each uetwork

output eleinent, ar~d one global flow for vanilla IP forwarding. Or we could liave one global flow for all of

~ietwork input, network outpul, and vanilla IP lorwarding. Fig. 2 diows a router codguration ill which a

sir~gle iobutor is used for the router global functions, and two IElouter's have been created for per-flow user

processing.

Since a Aoa represer~ts a line of concurrency, it is natural to run each flow as a separate thread or

process. The approach, however, requires liigli context switching overhead (i-em, one full thread context

switch) between flows. To reduce the overliead, previous work [GI has advaricod the technique or latdting,

which always tries to process a batch of a t least 71 packets (provided that these packets are available)

belo~lghg to orie flow before the system will coxlsider switching to another flow. While batching reduces

context switching, it also makes the preemption granularity coarse and hence increases the possible duration

of priority inversion. For example, a newly backlogged higher priority flow may have to wait for an entire

batdl of n lower priority packets to finish belore it d l get a chance to run.

We have describad Click's packet preemption mechanism in Section 11. As discussed, the preemption

granularity is a sequence or de~nents tliat usually ends with a Queue element. This m a i s that a packet

can be preempled while being processed. Such smaller yreeinptiorl granularity Illan Latcliirlg is feasible in

•~

maUI!!!'t'

ioRoUIC'1'

5

Fig. 2. A sample CROSS/Linux router configuration.

Notice that certain elements do not logically llelong to any particular flow. Instead, they pedonn

functions in the globallouter context. Input and out.put. elements for network interfaces, and an element

for vanilla IP forwarding, are important examples. We treat these global elements as uelonging to certain

"global flows". A global flow is represented in the task queue by an ioRouter object, a counterpart of the

mouler object for non-global flows. For the purpose or scheduling, global flows are quite similar to normal

Bows. They can be given specified resource allocations, thus allowing their elements to compete Cor system

resources witb other per-Bow clements. The assignme.nt of global router funct.ions to global Bows is flexible.

For example, we could have one global flow for each network input element, one global How for each network

output element, and oue global flow Cor vanil1a IP forwarding. Or we could have one global flow for all of

network input, network output, and vaniUa IP forwarding. Fig. 2 shows a router configuration ill wbich a

single ioRouter is used for the router global functions, and two mouter's have been created for per-flow user

processing.

A. P,'eempfiaJl gruflularity

Since a flow represents a line of concurrency, it is natural to run each flolV as a separate thread or

process. The approach, however, requires high context switching overhead (i.e., one full thread context

switch) between flows. To reduce the overhead, previous work [6] has advanced the technique of batcJling,

which always tries to process a batch of at least 1& packets (provided that these packets are aVailable)

belonging to olle flow beCore the system will cOllsicler switching to another flow. While batching reduces

context switching, it also makes tlle preelnption granularity coarse and hence increases tlle possible duration

of priority inversion, For example, a newly backlogged higher priority flow may have to wait for an entire

batch of n lower priority packets to finish before it will get a chance to rUII.

We have described Click's packet preemption mechanism in Section II. As discussed, the preemption

granularity is a sequence of elements that usually ends witll a Queue element. This meallS that a packet.

can be preempted while beillg processed. Such smaller preemption granularity than batching is feasible in

Click because differeut packets can be processed by the same thread and no kernel-level thread scl~eduling

is required to switch between them. Since QoS is an i~llporlant collcern in CROSS/Linux and certain

applicalions, like continl~ous media, may have finegrained tir~ie constraints, wc take Click's approach one

step Iurtlier to allow flow preer~~ption at arbitrary element boundaries.

We associate with each flow, say i, a preemption rjuanlum qi (jo ps) for tlie flow. Oucc sdicduled, if i has

been running conti~~uously lor qi time, then 41e syste~r~ will atte111pt to rescl~cdulc wlicn the currenf, element

being processed for i fiaisl~es. To do so, bebre i~lvoking a dowistreilln push call (respectively: upstream pull

call) for i's currelit packet, we check wl~etlier qi lias expired or 11ot. If not, we perZarm the pus11 (respectively,

pull) call as usual. If it Ii-as expired, liowcver, tlicn instead of perrorming die pusli/pull call, tlie syste111

checks for the need to rescl~edule. Tlie current packet of i should be preemptetl if there is a~lotliet eligible flow i11 tho systcm that Iias higher or the same virtual time priority as i. To carry out the pree11iptio11, the

syste~ri saves a yoi~itcr to i's current packet and another pointer Lo tlie element &hat sl~ould next process tho packet when the packet is resumed. Since each elernenb operates 011 aud trat~sIon~is a packet indcpcndently in our system; we do not need to store further execulio~l stale for the pree~rlptetl packet. The addcd ru~itime

overhead for our preemption mechanism is therefore cluile s~uall.

13. CPU corrservaliorr /or. idle elenrenls

Recall Erom Section I1 that, conceptually, flow elements are enabled by packet arrivals into their work

queue(s). In practice, however, Click does not distinguish between eligible versus inoligiblu elements Instead,

elanerils have to poll their packet queue(s) Tor work to do. When an element is scheduled but finds no packet

to process, it sitrlply returns but re~nairls eligible for the CPU. Since we assign CPU shares to eleme~~ts, this

imposes a problem. Specifically, an element that has no notl-empty work queue will keep on polling, thus

wasting CPU time, uiltil if, has used up its allocated CPU sliarc. Although we are not able to iurther

elaborate, because or li~rriled space, this causes various anornalics in flow scheduling. To address tlie probleni, CROSS/Lirux rr~aintains a task queue of eligible Bows only, where a flow is

eligible if at least one of its ele~neuts is eligible. Wlien an clement finishes processing its lasb available packet,

it will enter tlie sleep state. Wen all the elements of a Bow sleep, the flow itself er~ters the sleep state and,

therefore, it d l be removed fro111 the task queue. Hence, it will not be chosen to run by the flow sclieduler.

Later, when a packet for the flow arrives, the packet will enable one of tlie flow's elel~lents, which will have

the eflect of waking up the flow and putting it back on the task queue.

IV. THE COITROL PLANE

Whereas the forwarding plane processes packet flows, tlie control ylaue of a router runs supporting

services such as routing (e-g., OSPF, RIP, and BGP) and signaling (e.g., SIP and RSVP) daemons. In tlie

case of an extensible services router, tile ability to download code modules on-lhefly is importa~it. It allows

services that are not planned a priori to be deployed as they become available or as tho nood arises. For

this purpose, the DARPA active network project bas developed the active network daemon, called anetd,

for letdlir~g code fro111 a remote repository. We leverage anetd in providing on-demand service exte~isio~~. Systelu support for irlterfacir~g CROSS/Luiux with anetd is discussed in Sectioi~ N-A.

Control plane services usually run as user-level processes. Rg. 3 illustrates liow such a service can be slarted. In the figure, a request to start anetd is received by tlie router, and causes tlie arletd dacmon process

to be spawned. After startup, the daemon "subscribes" to a~letd packets tluougli a standard socket-type

Ii

Click because different packets can be processed by the same thread and no kernel-level threau scheduling

is required to switch between them. Since QoS is an important cOJlcern in CROSS/Linux and certain

applications, like continuous media, may have fine-graineu time constraints, we take Click's approach one

step further to allow How preemption at arbitrary element boundaries.

·We associate with each flow, say i, a preeml}UOn quantum (/i (in ,us) for the How. Once sdlcduled, if i has

been running continuously for qi time, then t]le system will attempt to reschedule wIlen the curren!. element

being processed fOI' i finislles. To do so, before invoking a downstream push call (respectively, upstream pull

call) for i's current packet, we check whether 'Ii has expired or 1I0t. Hnot, we perform the push (respectively,

pull) call as usual. If it JllU; expired, llowcver. then instead of performing the pusll/pull call, the system

checks for the need to reschedule. The current packet of i should be preemllted if there is another eligible

flow ilJ tIm system that has higher or the same virtual time priority as i. To carry out the preemption. the

system saves a pointer to i's current packet and another pointer to the element that should next process tho

packet when the packet is resumed. Since each element operates on and LransfoTllls a packet independently

in our system, we do not need to store further execuLion staLe for the preemptetl packet. The added runtime

overhead for our preemption mechanism is tbererore (Iuite slllall.

B. CPU cOTise,'VutioTI!07' idle elemeTlts

Recall from Section II that, conceptually, flow elements are enabled by packet arrivals into their work

queue(s). In practice, however, Click does not distinguish between eligible versus ineligible elements. Instead,

elements have to poll their packet queue(s) for work to do. When an element is scheduled but finds no packet

to process, it simply retUTlIS hut remains eligible for the CPU. Since we assign CPU shares to elements, this

imposes a problem. Specifically, an element that has no non-empty work queue will keep on polling, thus

wasting CPU time, ulltil i!. has used up its allocated CPU share. Although we are not able to further

elaborate, because of limited space, this causes various anomalies in flow scheduling.

To address the problem, CROSS/Lillux maintains a task queue of eligible Bows only, where a flow is

eligible if at least oue of its elements is eligible. When an element finishes processing its last available packet,

it will enter the sleep state. When all the clements of a Bow liileep, the now itself enters the sleep state and,

therefore, it will be removed from the task queue. Hence, it will not be chosen to run by the flow scheduler.

Later, when a packet for the Bow arrives, t.he packet will enable one of the HOWlS elements, which will have

the effect of waking up the Bow and putting it back on Ule task queue.

IV. THE COKTROL PLANE

Whereas the forwarding plane processes packet flows, the control plane of a router runs supporting

services such as routing (e.g., OSPF, RIP, and BGP) and signaling (e.g., SIP and RSVP) daemons. In the

case of an extensible services router, the ability to download code modules on-the-fiy is important. It allows

services tbat are not planned a priori to be deployed as they become available or as the need arises. For

this purpose, the DARPA active network project bas developed the active network daemon, called anetd,

for fetching code from a remote repository. We leverage aneLd in providing on-demand service extension.

System support for interfacing CROSS/LuIUX with anetd is discussed in SecLion IV-A.

Control plane services usually run as user-level processes. Fig. 3 illustrates how such a service can be

slarted. In the figure, a request to start anetd is received by the router, allu causes the anetu daemon process

to be spawned. After startup, the daemon "subscribes" to alletu packets through a standard socket-type

packet

Fig. 3. A ~ ~ e t d service startup.

-4PI. This installs a new rule in tlic packet classifier for anetd packets to be locally qucucd Tor reading by

the daemon. Future anetd packels will thus be delivered to the daemon, histead of being forwarded by the

router.

Processes in the cotltrol plane compete for systelri resources with each other and with the forward-

ing plane. To schedule the wmpeLing dernands, CROSSfiinux impler~~ents a system level lnultiresource

scheduling architecture based on reuource allocutions [lo]. Largely h the same manner as described in [lo], QoS-aware scliedulars Tor CPU cycles, network bandwidth, disk baudwidth and main memory have bee11

integrated, although the current CPU scheduler supports only proportional shares but not docoupled delay

and rate allocations. Notice also that the flow scheduler described in Section III can Ire treated esse~ltially as a system process and hence, can be given a CPU share relative to other processes or theads in the system.

The flow scheduler tl~ell allocates the received CPU share to the packet flows tliat it manages.

A. Flow Signaling and Service Corijigurntion

So far, we have described flow sd~edulingassuming that the flows have been already set up. CROSS/Linux

also allows flows to be dynamically created and flexibly configured as a pipeline of elements. Such flow management is eRected by 1P control packets with tlie router alert option being set. Tluee kinds of control

packets are defined: ICSETUP Tor creating flows, IC-TEARD for destroying flows, at~d IC-CONFIG for col~figuring

a flow element. Tlle packet classifier reading horn an input interface identifies these co~itrol packels and

delivers them to a control queue. A system conlrol tlrrwd processes packets in tlie control queue in FIFO order. It runs code implemented in a FlowMalager element (also called tlie pow manager), wliicli is simiIar

lo the origirlal Click elemenl for IP classification, but has additiorlal support for adding rlew ports and lilter rules. Such support is clearly crucial for dynamic flow creatio~i.

;.',.

-:. Kernel;

I. Control packet• arrivlll

Fig. 3. Alletd service startup.

7

API. This installs a new rule in the packet classifier Cor anetd packets to be locally queued for reading by

the d~emon. Future anetd packets will thus be delivered to the daemou, instead oC being Corwarded by the

router.

Processes in the control plane compete Cor system resources with eacll other and with the Corward

ing plane. To schedule the competing demands, CROSS/Linux implements a system lelle' multircsource

scheduling architecture based on J'esource allocations [10J. Largely in the same manner as described in [10),

QoS-aware schedulers for CPU cycles, network bandwidth, disk bandwidth and main memory have been

integrated, although the current CPU scheduler supports only proportional shares but not decoupled delay

and rate allocations. Notice also that the flow scheduler described in Section ill can be treated essentially as

a system process and hence, can be given a CPU share relative to other processes or threads in the system.

The flow scheduler then allocates the received CPU share to the packet flows that it mana.ges.

A. Flow Signaling and Sennce Conjigurotion

So Car, we llave described flmv scheduling assuming that the flows have been already set up. CROSS/Linux

also allows flows to be dynamically created and flexibly configured as a pipeline oC elements. Such /low

management is effected by IF control packets with the router a.lert option being set. Three kinds of control

packets are defined: IG-SETUP for creating flows, IG_TEARD for destroying flows, and IC~CONFIG Cor configuring

a flow element. The packet classi6er reading from an input interface identifies these control packets alld

delivers them to a control queue. A system control tllr'eJJd processes packets in the control queue in FIFO

order. It runs code implemented in a FlowMallager element (also called the flow ma.nager), which is similar

to the Origilial Click element for IP classification, but lIas additional support for adding new ports and JiJter

rules. Such support is clearly crucial for dynamic flow creation.

A.1 Flow setup

Wllcrl a11 IC-SETUP padiot is rcccivcd, tlie flow manager constructs a conliguratio~l string representing the

flow specification encoded in the packet. Once the string is composed, llle original set of configuration strirlgs

maintained by t ie flow manager is reconfigured to include the new string. As part of the reconfiguration

process, a new elertleat output port is created Tar the flow maliagr. The now port is then co~~aected to a

Queue ele~rient crcatd for the IICW flow. B addition, an fRoutcr object will be created aud allocated resources

accordirig to paramctcrs carricd in thc ICSETUP packet. Later packets Lhal ~natclk the classification rule for

the ncn- flow are then deliverer1 to tlie corresponding flow queue.

A.2 Flow co~iliguratior~

An IC-CONFIG control packet is used to add/delele an elemenl tofrram tlic processing pipeline of an

csistir~g flow. In the case of adding an element, the flow manager checks wliethur tlic rcqucsted service is

alrcady wailable in a local service repository. IT not, it signals anetd to dow~~load tllc narncd service from

a remote node. The anetd daemon looks up the remote node llavirlg the service. It then reliably feicl~es

the code, as an uninlerpreted byte s h a m , from a web server ru~lnitlg 011 that node, using HTTP. For

cROSSfLint~x, the byte stream must correspord lo a co~llyiled ken~el rrlodule for the requesliiig machine. If lbe do~vnload fails (e.g., llhe requested service canuot be rou~~d) in the currcnt irnplementalio~l, 111e request

to add an eleruel~t sileiitly fails, ixi that the scridcr of the add request is nol notified ol tlie failure. If the

download succeeds, the fetched code will be entered into the local service repository. Oncc the code is

available localIy, it is dynamically linked wit11 the running kernel usiihg the standard Linux insmod utility.

Lastly, the linked module is configured into tlie processir~g pipeli~le tlirough the standard Click mecliarlis~u

of wriling a service specifiuutiorr to tlte keritel lhrougli the /proc file system.

Fig. 4 illustrates the flow confippation process. In the Byre, step 2 For spawning a nerv control thread

is optional. In the current implementation, it is invoked only iT the control thread is not already running

wllen t l ~ c IC-CONFIG packet is received. Notice also that code downloading c a i take place concurrently with

normal packet rorwarding, and thal the code packek relurned fro~n tlie HTTP SEI-VCX are not forwarded but are delivered t o anetd. This is because anetd has previously subscribed to the packets.

A.3 Flow delete

When an IC-TEARD is received, the flow manager verities the existelice of the named EItouter. If it exists,

it is removed kom the Row scheduler, its flow specification is removed From the packet classifier, and aiy

lnelnory allocated to it is retur~led to the keruel.

A media scaling service is reported in 131 lor router plugins [I]. Tlie service applies to wavelet-encodetl

real-time video consisting of a base layer and progressive enl~ancemei~t layers. Lower layers contain nore

basic video i~lfor~r~ation, arid are rieeded Tor I~iglier layers to add to the video quality. By using a plugin

to exat~lir~e the Iayer information of backlogged video packets at times or network congestion, the router

C ~ J I drop enhancement layer packets before base layer packets, and higher enliancement layer packets before

lower e~iliancernent layer packets. This way, it is possible to achieve gmcejul degradation OF video quality

under constrained network bandwidth.

8

A.I Flow setup

When an IC-SETUP packet is received, the How manager constructs a configuration string representing the

How specification encoded in the packet. Once the string is composed, lhe original set of configuration strings

maintained by the now manager is reconfigured to include the new strillg. As part of the reconfigul'atiol1

process, a ]lew element output port is created for the flow manager. The new port is tllen connected to a

Queue element created for the new flow. In addition, an mouter object will be created ami allocated resources

according to parameters carried in the IC-SETUP packet. Latcr packets that match the classification rule for

the new flow are then delivered to the corresponding flow queue.

A.2 Flow conl1guratioll

An IC_CDNFIG control packet is used to add/delete an element to/rrolIl the processing pipeline of an

existing flow. In the case of adding an element, the flow manager checks whether the requested service i~

already available in a local service repository. If not, it signals anetd to dowlIload the Iliulled service from

a remote node. The anetd daemon looks up the remole node l1aving the service. It then reliably fetches

the code, as an uninterpreted byte stream, from a web server running au that node, using HTTP. For

CROSS/Linux, the byle sLl'eam must corresl)ond to a compiled keOlel module for the requesLillg machine. If

the download fails (e.g., the requested service cannot be foulld) in the current implementatioll, the request

to add an element silcntly fails, iII that the sender of the ildd request is not notified of the failurc. If the

download succeeds, the fetched code will be entered into the local service repository. Once the cade is

available locally, it is dynamically Hnked with the running kernel using the standard LimlX insmod utility.

Lastly, the linked module is configured into the processing pipeline through the standard Click mechanism

of writing a se''!Jice specijir:uti071 to the kernel througb tl!e Iproc file system.

Fig. 4 illustrates the flow configuration process. hi the figure, step 2 for spawning a Ilew control thread

is optional. In the current implcmentation, it is invoked only if the control thread is not already running

when the IC_CONFIG packet is received. Notice also that code downloading can take place concurrently with

normal packet forwarding, and that the code packel.ll returned from tlle HTTP server are not forwarded but

are delivered to anetd. This is because anetd has previously subscribed to the packets.

A.3 Flow delete

When an IC_TRARD is received, the flow manager verifies the existence of the named ffiouter. If it exists,

it is removed from ~he now scheduler, its now specification is removed from the packet. classifier, and any

memory allocated to it is returued to the kernel.

V. VIDEO SCALING ApnrCATION

A media scaling service is reported in 13] Cor router plugins [1]. The service applies to wavelet-encoded

real-time video cOIlsisting of a base layer and progressive enhancement layers. Lower layers conlain more

basic video informatioIl, amI are needed Cor higher layers to add to the video quality. By using a plugin

to ex.amine the layer information of backlogged video packets at times of network congestiOIl , the router

can drop enhancement layer packets before base layer packets, and higher enhancement laycr packets before

lower enhancement layer packets. This way, it is possible to achieve graceful degrodaeian of video quality

under constrained network bandwidth.

Fig. 4. The process or service corfibwration using anetd.

We have ported wavdct video scaling to CROSS/Linux. Tlle service can be IeLdied aud loaded on

dealand, iu response to user requests. While tlie sane service has Lecn demonstrated i11 [3), our goal is to

understand how resource maiiagemeut in CROSSfLinux can irnpacl video quality perceived by end users.

Iu particular, video sealiiig requires sufEcient CPU cycles t o be efrective. Otherwise, video packets will be

dropped in a11 undiffcreniiated manner while ava i t i~~g processilig by the scaling module. We are interested

. in experimentally assessing how differenl CPU allocatio~is for the scaling service can d c c t video quality.

Resource allocatiorl issues are particularly relevant lor applications like video streaming that have QoS constraints.

VI. EXPERIMENTAL RESULTS

We present experimelital results to illustrate opplicnlion perfor~nance on CROSSfLhux. Tile rohting

platronti used is a Pelitium m/866 MHz PC filted with lour PC1 3Com 3c59x (vorleu) 10/100 Mb/s ethernet inlerfaces. The original vortex driver runs in interrupt mode, in wllicll every packet arrival from the network

generates a device interrupt. UTe have made our own changes to the vortex device driver to additiolially

support polling I/O, in which the device driver polls the nehork interface for packet arrivals (i.e., then! is

no interrupt overhead for receiving packets). Polling is much less expensive lhan interrupt processing, and ca~i significantly increase the dciency and stability of a router liaving to deal wilh frequent packet arrivals

[5], (61. For the global router functiolis, we schedule t11s1n in the context 01 a ~ing le global flow, similar to

the configuration showti in Fig. 11.

A. Contezt switching

As discussed, an element-based architecture allows low context switching overhad between fows, if tlie flow elemenk are run in the co~itext of orie kernel thread. To verify the claini, we measure the overliead of

Row context switching in CROSS/Linux, as a function of the number of eligible Bows ill tire system. Each

flow is given the saiue CPU share and is always enabled. Fig. 5 shows the results. Tlie overhead has two

9

• Farwon! pI.... p:lokOI.

• ConUDl pLwe code: pool."

Fig. 4. The process of service COnfib'l1ration using anetd.

We have ported wavele~ video scaling to CROSS/Linux. The service can be fetched and loaded on

demand, in response to user requests. While the same service lias been demonstrated 111 [3], our goal is to

understand how resource management in CROSS/Linux can impact video quali~y perceived by end users.

In particular, video scaling requires sufficien~ CPU cycles to be effective. Otherwise, video packets will be

dropped in all undifferentiated manner while awaiting processing by the scaling module. We are interested

. . in experimentally assessing how different CPU allocations for the scaling service can affect video quality.

Resource allocation issues are particularly relevant Cor applications like video streaming that have QoS

constraints.

VI. EXPERIMENTAL RESULTS

We present experimental results to illustrate application performance On CROSS/Linux. The routing

platfonn used is a Pentium ill/866 MHz PC fitted with four per 300m 3c59x (vorte:,<) 10/100 Mll/s ethernet

interfaces. The original vortex driver rUDS in interrupt mode, in which every packet arrival from ~he network

generates a device interrupt. We have made our own ch;mges to the vortex uevice driver to additionally

support polling I/O, in which the device driver polls the nel.work interface for packet arrivals (i.e., there is

no interrupt overhead for receiving packets). Polling is much less expensive than int.errupt processing, and

can significantly increase the efficiency and stability of a router having to deal with frequent packet arrivals

[5], 16]. For the global router functions, we schedule them in the context of a single global flow, similar to

the configuration shown in Fig. II.

A. Context. switching

As discussed, an element-based architecture allows low context switching overhead between nows, if the

flow elements are run in the context oC one kernel thread. To verify the claim, we measure the overhead of

flow context. switching in CROSS/Linux, as a function of the number of eligible flows in t.he system. Each

flow is given the same CPU share and is always enabled. Fig. 5 shows tbe results. The overhead has two

Number ot flows

Fig. 5. Conlext switch overhead as a fu~ictio~i or tile uunilcr of eligible Hows.

components. First, it has a fixed componenl or about 280 ns, wvliicli includes the tasks of dequeuing I11e

incoming flow from Ihe head of the lask list, s l o r i~~g the exccution statc of the flow being switched out (e-g.,

tlre uext eIe11retlt lo process tlie Row's packet tiiat is being preempted), and updatilig tlie proportional-share

scl~erluling state of both the it~co~ning and outgoing flows. Second, i t llas a li~iear coniponent that has a

nleasurd value of around 5 ns/flow, which accounts for tlie time required to ir~sert the outgoitlg flow into tho task list in sorted order of the eligible flows' virtual lime priorities. Tile lit~ear time reRccts our current

implementation o l the task lisL as a doubly lillked list of tho eligible flows. A priority queue impteme~ltation

can reduce tlie i~nplementation complexity to O(logn), where n is the number of eligible flows in llle syste~li.

To put our nurulers in pelsyective, the reported cost for context switcliing between lorwarding processes in (61 is 3.3 ps, after agb~essive perfomianco optimization using continuatio~ls.

B. Througlrput comparison with Click

CROSS/Linux has added support lor QoS beyond Click. We verify lliat the extra llleclla~~isrn does not

con~yromise the syste~n's efficiency hi forwarding packets. To do so, we compare tlie achievable tl~roughput

by Click and CROSS/Linux in forwarding small size (specifically, 64-byte) packets. (Smaller packets stress

the router more.) We configure ten flows each with equal CPU share. We vary the aggregate input packet

rate Iroln 10K lo lOOK packets/s lor polli~ig inode, and from 10K to 901< Tor interrupt mode. The results

are showa in Fig. 6. For pollittg, both Click and CROSS/Linw achieve a forwarding rate equal to t l ~ e input

rats (i.e., there is no packet loss) at all the offered Ioads. For interrupt mode, both Click and CROSSFinux

achieve lossless rorwarding a t up to about GOK packets/s. Whml tho input rate is 70K to 9OK packels/s,

losses occur for Lot11 sys t e~~~s , ar~d the achieved forwarding rate of CROSS/Linux is very sliylttly lower tlia~l

(witl~itl 99% of) Click's iorwardirg rate. We corlclude that QoS support in CROSS/Linux does not cause

significant loss in system perfonnancc.

10

400

350

"iii" 300.s.(II

~ 250..c.2 200.~

'"X 150(II

E0 1000

50

02 4 6 8 10 12 14 16 18 20 22

Number of flows

Fig. 5. Context switch ovel'head as a Cunction of the number of eligible Hows.

components. First, it has a fixed component of about 280 liS, which includes the tasks oC dequeuing I.be

incoming flow Crom the bead of the task list, sloring the execution state oC the flow being switched out (e.g.,

the uext element to process the Row's packet that is being preempted), and updating the proportional-share

schetluling state of both the incoming and outgoing flows. Second, it has a linear component that has a

measured value oC around 5 ns/flow, which accounts fOT LIle time required to insert the outgoing flow into

the task list in sorted order of the eligible flows' virtual time priorities. The linear time reflects our current

implementation of the task lis!. as a doubly linked list of the eligible flows. A priority queue implementation

can reduce the implementation complexity to OOogn}, where n is the number of eligible flows in the system.

To put our numbers in pen>pective, the reported cost for context switching between forwarding processes in

16] is 3.3 /Js, after agj,,'Tcssive perfonnancc optimization using continuatiolls.

B. Througllput comparison with Click

CROSS/Lillux has added support for QoS beyond Click. We verify that the extra mechanism does not

compromise the system's efficiency in Corwarding packets. To do so, we compare the achievable throughput

by Click and CROSS/Linux in forwaxding smaIl size (specifically, 54-byte) packets. (Smaller p81:kets stress

the router more.) We configure ten Bows each with equal CPU share. We vary the aggregate input packet

rate from 10K to lOOK packets/s for polling Illode, and from 10K to 90I{ for interrupt mode. The results

are shown in Fig. 6. For polling, botll Click and CROSS/Linux achieve a forwarding rate equal to the input

rate (i.e., there is no packet loss) at all the offered loads. For interrupt Inoue, both Click and CROSS/Linux

achieve lossless forwarding at up to about GOK packets/so When the input rate is 70K to 90K paCkets/s,

losses occur for both systems, and the achieved Corwarding rate of CROSS/Linux is very sliglatlylower thall

(within 99% of) Click's forwarding rate. We conclude that QoS support in CROSS/Linux does not cause

significant loss in system perfonnance.

Input rate (packelsls)

1 00000 - E

80000 5 m n V

m 60000 - e 40000

b z 2 20000

Fig. 6. Click and CROSS/Liuux packet forwarding performance in polling and interrupt modes.

'CLICK (pohing) --2 I

- CROSS-Linux (polling) . . - - - F CLICK rnterrupt ---s---

CROSS-Linux [;nterrupt] -.-.r.-- --- - *..-# -

- - '?. - -

.C. -/" '?q. - ~ t - kl

Fig. 7. Experi~nental lietwork sotup in which cadiz is an experimental CROSS/Linux router arid ponco is a

remote code server accessed through the Internet.

U'e measure the overltead of configuring and integrating new router services in CROSS/Linux, as d e

scribed in Section IV-A. In the experiments, the machine cadiz shown in Fig. 7 is the CROSS/Linux router

on which the new services are to be installed. It runs in our research lab in the Purdue CS depaxbla~t.

IT the implemented code is not initially available locally at cadiz, it has to be fetched from police (sec Fig.

7), a web semer owned by the campus computatio~i center, and cot~~lected to cadix via the public campus

h~tcrnet. Therefore, the experiments give aa idea of the kuid of yerforn,ance when code may have to be

retched horn remote servers accessed through a typical shared nelwork infrastructure.

Fig. 8 reports the confjguratio~~ time for both cases when the demanded code is available locally (tlic

11

CLICK (po~ing) --+-----100000 CROSS-Linux (polling) . ..•.. ..'F

~ CLICK (interrupt) .•. ,.;.•.II> CROSS·Linux (interrupt) •.•.£:1...... .-1"/~ 80000 ..."".. /

~ ~/~ 60000 •••••1!1,.,.

g>~ 40000 I- ~.:K""-/ "I!l\,.~ - .u. 20000 // :'1\

,I--' !l

9000030000 50000 70000Input rate (packelsls)

OL..-__.....L.. ..I-__--L ~__J

10000

Fig. 6. Click and CROSS/Linux packet forwarding penonnance in polling and interrupt modes.

Fig. 7. Experimental network setup in which cadiz is an experimental CROSS/Linux router and ponce is a

remote code server accessed. through the Internet.

C. Service extension

We measure the overhead of configuring and integrating new router services in CROSS/LillUX, as de

scribed ill Section IV-A. In the experiments, t.he macliine cadi:... shown in Fig. 7 is the CROSS/Linux router

on which the new services are to be installed. It runs in our research lab in the Purdue CS department.

If tIle implemented oode is not initially available locally at cadiz, it has to be fetched from pOlice (see Fig.

7), a web server owned by the campus computation center, and collnected to cadiz via the public campus

blternet. Therefore, the expel"imel1ts give an idea of the kind of perfomlance when code may have to be

felched from remote servers accessed through a typical shared network infrastructure.

Fig. 8 reports the configuration time for both cases when the demanded code is available locally (the

Fig. 8. Local iuul remote service configuration delay for Iour dxerent code ~nodules.

"local" case) a i d when it is not (the "remotc" case). Four code modula are measured: UraveScaleBW.~ (ol

size 9 kbytes) that perrorms bandwidth scaling of black and white wavelet video (see Seclior~ V): Du~rhn~iy.o

(or size 9.G kby les) that artificially delays a packet Tor some time interval, iVaveSca1eCOLOR.o (9.8 kbytes)

that perfornns bandwidth seztlinrg or color wavelet video, and Throttle.0 (10.4 kbytes) Tor the router throttli~ig

distributed denial-of-service dcfc~ise ~uecliarlisirl prese~ited ink [ll.]. hi the local case, the reported time includes

the tasks of dynamically linking a code module into tlle ru~uiing Lirmx kernel and configuring it into a flow

processiug pipeline. The times Tor WaveScaleBW-o, Dummy-o, WaveScaleC0LOR.o aid l'11rottle.o are

10.52, 11.62, 14.25, a ~ i d 16.23 ins, respectively. Notice that the configuration time ger~erally increases with

the code size, though we do not observe a fixed proportior~al increase betweell tho two quantities. This

suggets that tho code size is an inlportarit factor in detenriitii~lg the configuratio~i time, though it is nof. tlie

only factor (the complexity 01 the code module may also play a role). In the remote case, the reporled tinie includes the tasks for the local case and, additionally, tho task of fctching the code Imm the server using HTTP. The times taken for the four modulcs cnumeratcd abwc are, in that order, 102.93, 116.36, 140.79

and 158.18 ms, respeclively. Again, the configuration time increases with the code size, since it will also take

1011ger for the network to deliver a larger code module.

D. Fonuarding/wnlml plane contentien

Tlle previous experu~~erlt rrleasures the standalo~ie cost 01 service exten~sio~i rurir~llg in the control plane.

We lurther examine system performance when tlie control plane contends with the Ibrwarding plane for

resources. To do so, we lel our router [orward Bows as usual. Then, while the jorwarrlirry is going on,

we serid an IC-CONFIG coiitrol packet to download and configure the WaveScaleCOLORo 11iodule into the ruur~il~g keruel. Tlle syste111 level scheduler in Section N is used lo allocate relative CPU sllares to the flow

scheduler, anetd and the control thucad that interacts wit11 anetd. In the experiment, we simply use tlie

derault scheduling parameters such that the thrcc threads all have thc same CPU share. The fonvardi~ig

plane has aiucli higher actual load than the other two threads, but it can make use of tlie CPU cycles not clairned by tliern. No reservatioli for network bandwidth is made in the experiment.

We vary tlleoflered traffic rate for the forwarding plane from 10K to lOOK 64-byte packetsjs. We measure

12

180 .--------------------------,

150

I~O

120

60

40

10<31-, 1'1 •• _---

Tlwolo,Q llO.4kl

20

o '--__.....1"1__, __....:.I~I_~-'l:..-_---'-r---'!. .l..-1:_:__---'Fig. 8. Local and remote service configuration delay for four different code modules.

"local" case) aud when it is not (the "remote" case). Four code modules are measured: WaveScaleBW.o (of

sille 9 kbytes) that I)erforms bandwidth scaling of black and white wavelet video (see Section V), OUIlIlI1Y.O

(of size 9.6 kbytes) that at'Lillcially delays a packet for some time interval, WaveScaleCOLOR.o (9.8 kbytes)

that performs bandwidth scaling of color wavelet video, and ThrotUe.o (10.4 kbytes) for the router throttling

distributed denial-of-service defense mechanism presented in [ll]. In the local ease, the reported time includes

the tasks of dynamically linking a code module into the running Linux kernel and configuring it into a flow

processing pipeline. The times for WaveScaleBW.o, Dummy.o, WaveScaIeCOLOR.o and l'hrottle.o are

10.52, 11.62, 14.25, a1ll.11G.23 lOS, respectively. Notice thaI. the configuration time generally increases with

the code size, though we do not observe a fixed proportional increase between the two Quantities. This

suggests that the code size is an important factor in detennining the configuration time, though it is nol. the

only factor (the complexity of the code module may also playa role). In the remote case, the l'eported time

includes the tasks for the local case and, additionally, the task of fetching the code from t.he server using

HTTP. The times taken for the four modules enumerated above i1Ie, in that order, 102.93, 116.36, 140.79

and 158.18 ms, respecl.ively. Again, the configuration time increases with the code size, since it wiIJ also lake

101lger for the network to deliver a larger code module.

D. FOnJJurding/control plane conteJltion

The previous experiment measures lhe standalone cost of service extension rUJllling in the control plane.

We Curther examine system performance when the control plane contends with the forwarding plane Cor

resources. To do so, we leI. our router Iorward Bows as usual. Then, while the jOnJJurrliJlg is going 0'1"1,

we send an IC_CONFIG control packel. to download and configure the WaveScaleCOLORo module into the

running kernel. The system level scheduler in Section IV is used to allocate relative CPU shares to the :flow

scheduler, anetd and the control thread that interacts with anetd. In the experiment, we simply use the

default scheduling parameters such that the three tllreads an have the same CPU share. The forwarding

plane has much higher acl.ualload than the other I.wo threads, but it can make use of the CPU cycles not

claimed by them. No reservation for network bandwidth is made in the experiment.

We vary the offered traffic rate for the forwarding plane from 10K 10 lOOK 64-byte packets/so We measure

lnpuf rate (packelsls)

Fig. 9. Tirr~e to configure ~raveScaleCOLOR.o as a function of the corupeli~~g forwarding plane packet rate.

iioooo 3 m 500M) 7 m 90000 input rate (packet&)

100m I I

Fig. 10. Packet ronvarding perlonriance, wit11 and without competilig service configuratio~.

MOW V)

the aclual [orwardingrate achieved by the lorwardilg plane and also the tii~le taken for WaveSca1eCOLOR.o

to be successfully installed. From Fig. 9, notice that the configuratiox~ time is parLly co~lstant and partly

linear with the offered tralIic rate. Let y (in ms) be the configuratio~i tirile and x [in packets/s) be tho oRered

t r d c rate. We found that a linear lead square polyno~~lial, y = 0.01392 + 139.86, provides a very good fit

with an R-coefficient of 0.9972.

For the achieved iorwardi~lg rate, we compare Ihe cases when forwarding occurs with a ~ i d withou~ com-

petition from tlie service coxlfiguration process. From Fig. 10, notice that there is no observable performance

d i k e ~ l c e between the two cases. We colldude that service configuratio~i requirw only a small fraction of

tlie systern resources such that il rriakes no significant impact on the forwarding plane.

wlo downloading (pd~~ing) - a ,.c**

w/ dovrnloadin (pollhg; . /x' - - wlo dmloading lnlermpl s q a*..m - wl downloading (interrupt1 . .. 0. ,/.' - 2 7 m

- ..- / 0 a WOO0 - - V

a 50m - -

\ '-. - 4. '--. -

-\. - i -

13

200 r--.....--.-----,r---...----.-----y---r---...,Least·square lit y = 0.0139 x + 139.8620

190

180~

S-CI> 170

~160

150

140 1::..-_...J...._----I.__.1..-_....L._---l'--_........_--'-_........Jo 500 1000 1500 2000 2500 3000 3500 4000

Inpul rale (packets/s)

Fig. 9. Tillie to configure WaveScaJeCOLOR.o as a function of the compeiing CorwllI'ding plane packet rilte.

9000030000 50000 70000Inpul rale (packetsls)

w/o downloading (polling) -+- _---wI dovmloading (pollill,]J. ;;r/

wlo downloading (interrupt) ,., it·· , / .....

wI downloading (interrupt) . "8"'- /'

/--...."'......\

100000

90000

~ 60000J!!IeCI> 70000.><:uIII 60000.e!!! 50000l!!Dl

40000c~

~ 30000

~ 20000

10000

010000

Fig. 10. Packet fonvarding perfonnance, with and without competing service configuration.

the actual forwarding rate achieved by the forwarding plane and also the t.ime taken for WaveScaleCOLOR.o

to be successfully installed. From Fig. 9, notice that the configuration time is parl.ly constant and partly

linear with the offered traffic rate. Let y (in ms) be the configuration time and x (in packet.s/s) be the offered

traffic rate. We found that a linear least square polynomial, y = O.0139x + 139.86, provides a very good fit

with an R-coefficient of 0.9972.

For the achieved forwarding rate, we compare the cases when forwarding occurs with and without com

petition from the service configuration process. From Fig. 10, notice that there is no observable performance

difference between the two cases. We conclude that service configuration requires on)y a small fraction of

the system resources such t.hat it makes no significant impact on the forwarding plane.

Pain1

Rial Flow B

Fig. 11. Coriflguratio~~ of Flow A arrd Flow B lo evaluate flow-kicd versus element-basecl scheclulirrg.

Notice that MultPull2Push is sl~ared by both flows.

E. Flour-bused versus element-bused schululi7~!~

.4 Et~ntla~nental design decision about CROSS/Linu--u is to i~npose a flow abstractior~ ovcr Click's element- based arcliilechire. 1% demonstrate the performance impacl or flow-based scheduli~~g. MTc configure two

flo~vs, A a~rd B, as shown i n Fig. 11. Notice tllal tbe MnltPull2Pusl1 eleruent is bcing shared by the two

flows. Our objeclive is to process flow A with twice the actual CPU capacity as B. In the case or Click, CPU shares are assigned per element. Given the sharing objective, we ass ig~ B's privatc Paint elemen1 a

CPU share 01 2, and A's private Paint element a share of 4. IL is 11ot easy to ass ig~ a CPU share to khe

MultPull2Pusl1 elernent, since it is being shared. We make the apparer~tly reasonable choice or assigning it

a sliare of (2 -t 4)/2 = 3. For CROSS/Linux, CPU shares are assigned per-flow. Hence, we simply assign

shares to flows A and 3 in the ratio of 2:l.

We lhen generate 66byte packet arrivals for the two flows so that Ihey ate always backlogged. Fig. 12

shows tlie curriulative CPU co~lsurtlption or A and B in Click, as a function of time. Given the progress rate

of 3, tlio ezpected progress rate of A is also show11 for cotl~pariso~~. Notice from the figure that the actual

rate of A is significantly smaller than the cxpccted rate. This is because the MultPullZPush elanent does

not gct sufficient CPU cycles to keep up witli A'S packet arrivals, causing the packels to be dropped. 011 thc

other hand, increasing the CPU share of MultPull2Push gives B Ihe potential to be overly aggressive and

take away A's intended share. The result demonstrates the difiiculty of assigning appropriate CPU shares

to shared elelne~lts iri Click, sudl that the logical flows will get their desired actual CPU shares. In contrast, Fig. 13 shows tlie progress rates of the two flows in CROSS/Linux. Notice that our straightlorward florv rate

assig~melits easily result ill the desired progress ratio of 2:1 for A relative to 3. We wnclude that flow-based

scheduling avoids complex rate assignment probler~is when elements can be shared between flows, It thus

enables simple and intuitive user control over system resource allocations.

Packet arrivals from the network may happen quickly relative to the sclledulirlg of the software that

processes tlie packets. II the software callnot ruil as soon as llie packets arrive, the packets may be lost

unless there are sufficient buRers to absorb die burstbess. Sucll loss ]nay occur, lor example, at tlie liardware

network inherface, if the input element cannot read the packets and classify tliein quickly enough. It may

also occur a t a per-flow packet queue if the per-flow element(s) cannot consurlle tlie packets fast enough. We

examine several issues that affect buffer provisionirg in our systerrl to achieve lossless forwarding of packets.

Fig. 11. Configuration of Flow A and Flow B to evaluate flow-based versus element-based schedulillg.

Notice that MultPul12Push is shared by both flows.

E. Flaw-based ver·sus element-based scheiluli71!1

A fundamental design decision about CROSS/Linmc is to impose a now abstraction over Click's element

based archHecture. \Ve demonstrale the perfonnance impact of flow-based scheduling. We configure two

Haws, A and B, as shown ill Fig. 11. Notice that the Mult.Pull2Push element is being shared by the two

nows. Our objective is to process now A with twice the actual CPU capacity as B. In the case of Click,

CPU shares are assigned per element. Given the sharing objective, we assign B's private Paint element a

CPU slla!"e of 2, and A's private Paint element a share of 4. It is not easy to assign a CPU share to the

MuitPull2PusJl element, since it. is being shared. We make t.he apparently reasonable choice of assigning it.

a share oC (2 + 4)/2 = 3. For CROSS/LiIlUX., CPU shares are assigned per.flow. Hence, we simply assign

shares to flows A and B in the ratio of 2:1.

We then generaLe 64-byte packet arrivals for the two flows so that they are always backlogged. Fig. 12

shows the cumulative CPU consumption of A and B in Click, as a function of time. Given the progress rate

of B, tIle expected progress rate of A is also shown for comparison. Notice from the figure that the actual

rate of A is significantly smaller than the expected rate. This is because the MultPu1l2Push element does

not get sufficient CPU cycles to keep up with A IS packet arrivals, causing the packets to be dropped. On the

othe.. hand, increasing the CPU share of MultPull2Push gives B ~he potential to be overly aggressive and

take away A's iutended share. The result demonstrates the difficulty of assigning appropriate CPU shares

to shared elements in Click, SUell that the logical Bows will get their desired actual CPU shares. In contrast,

Fig. 13 shows the progress rates oC the t\\lO flows in CROSS/Linux. Notice that our straightrorward flo\\l rate

assil,rnmellts easily result ill t.he desired progress raUo of 2:1 for A relative to B. We conclude tllat now-based

scheduling avoids complex rate assignment problems when elements can be shared between flows. It t.hus

enables simple and intuitive user control over system resource allocations.

F. CPU afld hUer· p7'ouisioning

Packet arrivals from the network may happen quickly relative to the scheduling of the software that

processes the packets. If the software cannot rUll as soon as the packets arrive, the packets may be Jost

unless there are sufficient buffers to absorb the burstiness. Such loss may occur, for example, at the llardware

network interface, if the input element cannot read the packets and classify them qUickly enough. It may

also occur at a per-flow packet queue if the per-flow element(s) cannot consume the packets fast enough. We

examine several issues that affect buffer provisioning ill our system to achieve lossless forwarding of packets.

Fig. 12. Progas of Flow A and Flow B under eleinenl-based sclieduli~ig. ~ o t i c e that A's progress rate deviates horn hhe expected rate.

0 0 20 40 60 8 0 1 W 1 2 0

l ime (seconds)

I I I I I

- Flow A - Flow B - - - - .- -

-

-

Fig. 13. Progress of Flow A arid B under flow-based scliedding. The progress rate A to 3 L very close to

the expected ratio of 2:l.

F.1 CPU bahlce

Consider a general flow processi~ig pipelirie consisting of three stages: input, per-flow processing, and

output. CROSS/Lhux call assign d i h c n t relative CPU sliares to the tlirec parts. Let i, f , and o deliote

tho CPU shares given to input, processing, a11d output, respective1y. The ideal ratios between the quantities

should depend on tile time taken by tlie corresponding stages. U a IuncLion is give11 too srnall a CPU share,

packet loss lliay result if the Zuliction is not able to keep up rvitli the packet arrivals. 111 an experiment, we configure a flow whose input, yrocessirig arid output stages take about 150 us, 1.27

ps, and 130 ns, respectively. (Hence, the "ideal" CPU balance between the three stages sliould be aboul

1:8:1.) We generate back-to-back &byte packets for tlie Row at a rale of about 30K packetsls. In a sel

15

................ '..,..............,

...- ...,.. , ........,

AowB .

Expected Aow A --;="', ....

1.2&+<19

1a+09

'0II'E 8e+08:::I

'"C0U

'" 6&t<J8a>

~:::> 4e+08n.0

2et08

10 20 30 40 50 60 70 aoTime (seconds}

Fig. 12. Progress of Flow A amI Flow B under elemenL-based scheduling. Notice that A's progress rate

deviates from the expected rate.

40 60 80Time (seconds)

--'-"",

..'

120100

........-

Flow A -FloI\' B --- ---

20

1.8EH<l7

1.6e+07

'C1.49+07..

E 1.2e+07:::I..c8 1et07onCD 8EH06"6i)'::> Se..oSll..0

4a+06

20..06

00

Fig. 13. Progress of Flow A and B under flow-based scheduling. The progress rate A to B is very close to

the expected ratio of 2:1.

F.1 CPU balance

Consider a general flow processing pipeline consisting of three stages: input, per-flow processing, and

output. CROSS/Lillux cau assign different relative CPU shares to the three parts. Let iT f, and Q denote

the CPU shares given to input, processing, alld output, respectively. The ideal ratios between the quanmies

should depend on the time taken by the corresponding stages. If a function is given too small a CPU share,

packet loss may result if the (unction is not able to keep up with the packet arrivals.

III an experiment, we configure a flow whose input, processing and output stages take about 150 us, 1.27

p.s, and 130 os, respectively. (Hence, the "ideal" CPU baJance between the three stages should be about

1:8:1.) We generate back-ta-back 64-byte packets for the Howat a raLe of about 30K packets/so In a set

a loo

50

I I I L I

_t

:\ CLICK --

CROSS/Linux ---:< . - -

- - L

- -

Processing share, I

Fig. 14. Mi~~innan queue size for lossless lorwardi~~g as a lur~ctio~l of the processing share j , for boll1 Click

wrd CROSS/Linux.

or runs, we allocate CPU shares for input, processiag and output h ratios of 1 : f : 1, where j is varied

from X to 30. Itre tlrelr measure the rnirrirrturrr buffer size (in 11u111bcr of packets) needed lor Ilie flow to

achieve lossless forwadirig of its packets in each run. Polling mode is used. The resulls lor both Click and

CROSS/Linux are shown in Fig. 14. Notice lhat when f is small, a large buffer sire is 11eedeJ in both systems

to prevent packet loss (lor j = 1, Click requires 330 packets and CROSS/Linux requires 310 packels). -4s f

increases, tlie recdred bufier size decreases rather quickly! uiitil f readies about 8 which rellecls tlre ideal

CPU balance. h Click: the required buRer size first reaches the minimum value of 8 packets when f = 8 and

stays the same wlier~ f further increases. In CROSS/Linux, the required bufler size is 17 wlieri f = 8 and is

16 when f = 9 or higher. The burer sire stabilizes a1 different values for Click and CROSS/Linux because the two systems have difleront implementations of the input and queue elements, but further investigation

is needed to pinpoint the evact reasons.

In another experiment, we construct another input-processing-oulput pipeline where processing corre-

sponds to vanilla IP iorwarding oi packets. We allocate CPU sllares to tho thrm stages correspo~ldiiig to

their ideal balance of about 1:lO:l. We generate back-to-back 64byte packets for the flow at a rate of z

packets/s, where z is mried to be 9927, 29937, 49355 and 70499 packets/s in a sequence of russ. We then

measure the achieved fonvarding rate lor tlie flow when tlie bufler size, derloted by I, is sot to be 10, 100,

and 1000 packets in dimerent runs. Table I sliows the results for polling mode in CROSS/Linux. Notice that when b is 100 or 1000 packets,

forwarding is lossless. Whetr I is 10, howovor, some loss is observed, and the percentage or forwarded packets

range from about 99.6% to 99.5%. In the case of inierrupt mode, the loss rates vary I I I U ~ I more for the diaerent buEer sizes. The results are sliown in Fig. 15. Notice that for interrupt, a large bufler sim (of

about 1000 packets) is needed to realize the packet forwarding capacity of the router.

F.2 Preemption granularity

The preeniption granularity of tlie system, as discussed in Section III-A, will also affect buffer provisioning

to adiieve lossless forwardirig. This is because when the preemplion granularity is coarse, t11e1r a Row (even

16

302520

CLICK ---+CROSS/Linux ---:-< .

1510

350

- 300CI)....Q) 250.::L0as

E:: 200ttl

.t:! 150UlCI)

.L,~ 100CI)::J0 50

00 5

Processing share, f

Fig. 14. Minimum queue size COl' lossless forwarding as a function oC the processing share I, for both Click

31ld CROSS/Lil1ux.

of runs, we allocate CPU shares for input, Pl'ocessiug and output iu ratios of 1 : f : 1, where J is varied

from 1 to 30. We then measure the minimum buffer size (in number oC packets) needed [or the flow to

acltieve Irn;slcss forwarding oC its packets in each run. Polling mode is used. The results for both Clid:: and

CROSS/Linux are shown in Fig. 14. Notice that wben f is small, a large buffer size is needel1 in both systems

to prevent packet loss (for J = I, Click requires 330 packets and CROSS/Linux requires 310 packets). As fincreases, tIle required buffer size decreases rather quickly, until f readies about 8 which reAlecls the ideal

CPU balance. hi Click, the required buffer size first reaches the minimum value of 8 packets when f = 8 and

stays the same when f further increases. In CROSS/Linux, the required buffer size is 17 when f ::: 8 and is

16 when f = 9 or higher. The buffer size stabili:l:es at different values for Click and CROSS/Linux because

the two systems have different implementations of the input and queue elements, but Curther investigation

is needed to pinpoint the e.'cact reasons.

In another experiment, we construct another input-processing-output pipeline where processing corre

sponds \0 vanilla IP forwarding of packets. We allocate CPU shares to the three stages corresponding to

their ideal balance of about 1:10:1. We generate back-to-back 64-byte packets for the flow at a rate of x

packets/s, where x is varied to be 9927, 29937, 49355 and 70499 packetsjs in a sequence oC rUllS. We then

measure the achieved forwarding rate for Ule IIow when the buffer size, deuoted by b, is set to be 10, 100,

and 1000 packets in different runs.

Table I S]lo\VS the results for polling mode in CROSSjLinux. Notice that when b is 100 or 1000 packets,

forwarding is lossless. When b is 10, however, some loss is observed, and the percentage of forwarded packets

ranges from about 99.6% to 99.5%. In the case of interrupt mode, the loss rates vary much more for the

different buffer sizes. The results are shown in Fig. 15. Notice that for interrupt, a large buffer si~..e (of

about 1000 packets) is needed to realize the packet fonvarding capacity oC the router.

F.2 Preemption granularity

The preemption b'Taliularity of the system, as discussed in Section III·A, will also affect buffer provisioning

to achieve lossless forwarding. This is because when the preemp~iongranularity is coarse, then a How (even

( (packetoh) I (packe ts/s)

Input rate

\ ~ A K ~ L L X IP PACKET FORWARDING RATE AND PERCEXTAGE FOR BUFFER SIZES OF 10, 100 AND 1000

PACKETS, AND Al' DlFFERENT OFFERED ~ I I - B Y T E PACKET RATES. POLLIKG MODE.

Fomarduig rate

29937

49355

70499

60003 ,ueue=lWO packels (inIempl) P ,: . . - I: 93.7)'.15 i": ." :,; ~ueue=io pacblg {ln!ermpt) 1

% [orwarded I

Fig. 15. Interrupt mode vanilla IP Torlvarding rate and percentage with buffer sizes of 10, 100 and 1000

packets and a t different offered -byte packet rates.

TABLE I

29833

491411

70198

if it has a sufficient long-kerm CPU rate to process its packets) may have to wait longer before it will be

given a diarice to run. If packets arrive for the flow during this waiting period, they will have to be buffered. Then, w l~e~ i the flow ~ I I S , it rnay process a large number of backlogged packets in a burst. Hence, processing

lor the flow may appear more bursty, necessitating a larger bufier size to absorb tlie burstinoss.

In an experiment, we measure how the finer preemption granularity proposed in Section III-A may

impact resource (i-e., buffer) provisioning compared with Click's original ~iechanisrn. We configure two

Rows, A arid 3. A lias 011ly one simple processing element tillat does little more than queuing each received

packet for the outpub interrace. B has the same simple element as A, but in additioli rr dolay elements - each

artificially consuming about 1 ps of CPU time - configured into a processil~g pipeline with no intervening

Queue clerneats. In the original mechanism, t l~e pipolinc of n + 1 elements is not preemptible, but it is pmemptiblo at element boundaries with tho proposed changes. We generate Wbyte packet arrids [or the

two flows at a rate of about 5200 packets/s. MTe vary n from 0 to 1 2 in a set of runs, and report the minimum

buffer sizes needed by A t o achieve lossless lowarding i r tlie origi~lal and new mechanisms, respectively. Fig.

16 siiows the results. Notice that for tile origi~lal mechanism, the required bufler size Tor A illcreases roughly

29937

49355

70499

99.6 99.5 99.5

100

100

100

17

Input rate Forwarding rate % forwarded

(packets/s) (packets/s)

b= 10 b =100/1000 b= 10 b =100/10009927 9887 9927 99.6 100

29937 29833 29937 99.6 10049355 49144 49355 99.5 100

70499 70198 70499 99.5 100TABLE!

VAKJLLA IP PACKET FORWARDING RATE AND PERCEKTAGE FOR BUFFER SIZES OF' la, 100 AND 1000

PACKETS, AlI:D A'l' DIFFERENT OFFERED 64-BYTE PACKET RATES. POLLljI;G MODE.

10000020000 40000 60000 80000Inpull'8l& (packlllsfa)

:....•\..::........ \ .... -

.,}" \\ .

..\ ..~ .':.. ",

toooo 1.----<:..•• _...1..- .L..-__---'_-II_--l..__..L......J

o

20000

60000 uaue=1000 packels (inlllrrupl) -.--'.. : ' . -' ;0 pO;:~"'"I'::l I'''': ,'r : •• ' ~

Queue..l0 paclcelS {Inl&rrupt) .......

Fig. 15. InterrupL mode vanilla IP forwarding rate and percentage with buffer sizes of 10, 100 and 1000packets and at different offered G4-byte packet rates.

if it has a sufficient long-term CPU rate to process its packets) may have to wait longer beCore it will be

given a chance to run. If packets arrive for the flow during this walting period, they will have to be buffered.

Then, when the flow rullS, it may process a large nWllber of backlogged packets in a burst. Hence, processing

for the flow may appear more bursty, necessitating a larger buffer size to absorb tlle burstiness.

In an experiment, we measure how the finer preemption granularity proposed in Section III-A may

impact resource (i.e., buffer) provisioning compared with Click's original mechanism. We configure two

Rows, A and B. A has only one simple processing element that does little more than queuing eacb received

packet for the output interface. B has the same simple element as A, but in addition Tl delay elements - each

artificially consuming about 1 }.Is of CPU time - configured into a processing pipeline with no intervening

Queue clements. In the original mechanism I the pipeline of n + 1 elements is not preemptible, but it is

preemptible at element boundaries with the proposed changes. We generate 64-byte packet arrivals for the

two flows at a rate ofabout 5200 packets/so We vary n from 0 to 12 in a set of runs, and report the minimum

buffer si't:es needed by A to achieve 10ssIess rowarding ill the original and new mechanisms, respectively. Fig.

16 shows the results. Notice that for the original mechanism, the required buffer size for A increases roughly

I I I 1 1 I

Otiginal preemplion -I-- - - -

- - - - - -

- I I I I I I

Fig. 16. Minjlllu~n buRcr siacfor losslcss forwarding by flow A, as a function n, llle nutnber 01 delay elenie~lts

uscd in competing How B1s pipeline. Original versus Iine-grained preealptior~ ~ueclra~iisu~s.

linearly as 11 increases. With fine-grained preemption, however, the required buITer s i ~ e iillcreases fro111 1 to 2

as n increases from 0 to 1, bril slays a t the value 2 as 71 furll~er ircreases. Heme, although both mechanisms

can assure a lolig-term lorwardi~ig rate b r A irldepellder~t or Bas processi~~g pipclinc, finegrained preemplion

has tlie added advantage of keepiilg A's buffer requiterrlent largely unchanged in the different runs.

G. Video smling

Video scaling is designed to respond to nctwork congestion, and is most useful lor con~lections without access to guaranteed link bandwidth. Hence, w e do not perlorrn real-time link scliedulirig ia our experiments.

Instead, default FIFO packet scheduling is used for each network output port.

The experimental aelwork setup for video scalilrg is sllom~ in Fig. 7. In the figure, a wavelet video stream consisting of 300 frames and with a peak baidwidtli requirerne~~t of 2.6 Mb/s is being sent at 25

frames/s irom bolli~lg to madrigal, tllrougll the CROSS/Linux router cadiz. The video stream, encoded

to have one base layer and 127 enl~alicelnent layers, is displayed at niadrigal when received. At cadis, it corllpetes for resources with a cross tralfic stream of UDP packets, sent at diRerent bit rates wid requestiag

different per-flow processing, from sevilla to madrigal. Tho direct links shown between rnacli~ies are 10 h4L/s point- to-point ethernet connections. Interrupt I/O is being used.

In the presence of network congestion, CPU allocations have a sig~ificar~t i~npact on the quality 01 the

video received. a set olexperiments, we run tho video flow with a competing UDP flow generated at a rate

of 12,499 packots/s (packet size of 64 bytes). Each UDP packet receives CPU-intensive per-flow processirlg to

create CPU congestion. (The actual CPU utilizatio~~ js 100% il~roughout e a d ~ experiment.) When the video

flow is routed through the scaling service, we vary l l ~ e CPU allocatio~l of the flow to bo 0.003%, 0.067% and

0.122%, respectively. The retr~airling CPU capacity, less 20% given to the global router Zunctions, is elltirely

allocated to the conipeti~ig UDP flow. Fig. 17 profiles tlic PSNR of the received video. The average PSNRrs for 0.003%, 0.067% and 0.122% of video CPU allocation are 20.56, 21-67 and 22-61 dB, respectively. All 300

frames arc displayed lor each experiment using video scaling. For comparison, we also diow tlie reccived

18

16 r-----r--.----r---,---,---r-----,

Original preemption --tt':'r', "•.•!" :~I~l:=-' ~ ~...': I:::'~lnr:lt r·

E 4~

E2'E --:E ....0

0 2 4 6 6 10 12 14n

..

.I! 14ul'lI.e, 12~

"0; 10

~So 8UlUl

~ 6II>

..2

Fig. 16. Minimum buffcr size for losslcss forwarding by flow A, as a function n, lhe IUllnber of delay elements

used in competing How B's pipeline. Original versll!l fine-grained preemption mechanisms.

linearly as n incrcases. With finc-grained preemption, however, the reqnired buffer si~e illcreases Croll1 1 to 2

as n increases from 0 to I, bul slays at the value 2 as tl furUlel' increases. Hence, although both mechanisms

can assure a long-term forwarding rate for A independent of B's processing pipeline, fine-grained preemption

has Ule added advantage oC keepiug A's buffer requirement largely unchanged in the different runs.

G. Video scaling

Video scaling is designed to respond to network congestion, and is most useful for conuections without

access to guaranteed link bandwidth. Hence, we do nol perform real-time link scheduling in our experiments.

Instead, default FIFO packet scheduling is used for each network output port.

The experimental network setup Cor video scaling is shown in Fig. 7. In the figure, a wavelet video

stream consisting of 300 frames and with a peak banuwiuth requirement of 2.6 Mb{s is being sent at 25

frames/s from bolling to madrigal. through the CROSS/Linux router cadiz. The video stream, encoded

to have one base layer and 127 enhancement layers, is displayed at maclrigal when received. At cadiz, it

competes for resources with a cross traffic stream of UDP packets, sent at different bit rates a.nd requesting

different per-80w processing, from sevilla to madrigal. The direct links shown between machines are 10 Mb{s

point. to-point ethernet connections. Interrupt I/O is being used.

In the presence of network congestion, CPU allocations Jlave a siguificant impad on the quality of the

video received. hi a set of experiments, we run the video flow with a competing UDP now generated at a rate

of 12,499 packets/s (packet size of 64 bytes). Each UDP packet receives CPU-intensive per-flow processing to

create CPU congestion. (The actual CPU utilization is 100% throughout each experiment.) When the video

flow is routed through the scaling service, we vary lhe CPU allocation of the flow to be 0.003%, 0.067% and

0.122%. respectively. The remaining CPU capacity, less 20% given to the global router [unctions, is entirely

allocated to the competing UDP flow. Fig. 17 profHes the PSNR of the received video. The average PSNR's

for 0.003%, 0.067% and 0.122% of video CPU allocation are 20.56, 21.67 and 22.61 dB, respectively. All 300

frames are displayed Cor each experiment using video scaling. For comparison, we also show the reccived

Fig. 17. Received video qualily wit11 the vidco scaling service running at diRerent CPU rates, under CPU a11d network coxrges tion.

video quality wit11 drop-tail and 0.183% CPU allocation to the video flow. In spite or Lhe relaLively high CPU allocation, the video quality is very low - only 7 kames; are successrully displayed, with an average PSNR of 23.12 ilB. We conclude that vidw scaling, wlien given a suficis~t CPU share to rua, can sig1ific;mtly

improve the video applicalion's abilily to gracefully respond to network congestion.

Component-based syntlicsis of network protocols h a s been advanced in x-ker1iel[2], a d adopted ~ I I rwmt

extensible softwarebased routers [I], [8], (91. A ilotable example is router plugins [l] -however, plugiu gates

are fixed in the IP Sorwarding path arid car~riot be dyr~arnically extended. Moreover, the previous work [I],

[2], [8], [9] focuses neither on sdhedulitrg issues for the sohware elements themselves nor issues in the context

01 a cornplernentary service control plaue. Our forwarding planc implementation leverages Click [4], [5]. We

support the use 01 Click ele~rierits with pusb/pull data movement as router service components, and exploit

Click's corifiyration laliyage and system support in constructing flow service pipelines. However, Click

does not provide the control plane discussed in this paper. Moreover, we have geatly extended Click in

many aspects of flow and control plane scheduling.

Tilere has bee11 recait work or1 resource management in software routers. Qie el al. [GI present very

interestirig experimental results pertaining t o balancing between input, output, and per-jlow processirig in

their software router. We have investigated si~rlilar issues of CPU Matice in our system. However, our rocus is

on a syslem that suppork coniigurable routing eImnents, whereas their system does not provide such support.

To reduce context switel~illg, they use the technique 01 batching packets. Our system takes a inore fine-

grained preemption approach that allows a flow's packet to be preempted a t element boundaries. Moreover,

imporlanl ieatures of flow signali~rg and service extension, and tlleir interactions with the forwarding plane,

are not discussed in [GI. CROSS [lo] advances a multiresource scheduling architecture based on resoume

oUocations. We use resource allocations in system-level scheduling between the forwardir~g arld control

planes. However, CROSS is not element-based and, thereiore, does not address a lot of tlie sckerlulixig issues

presented in this paper.

i:/6

1

·::P"...'.I"; ·r, 1~"':! • I· \

CPU aooc-o,CIIm.l~) ._CPV~~~I"'"

CPU o!ioeLO.lll.1':'o(<lrop 1;110 --

19

D ~

Fig. 17. Received video quality with the video scaling service running at different CPU rates, under CPU

and ne~work cougestion.

viueo quality with drop-tail amI 0.183% CPU allocation to the video flow. In spite of l.he relal.ively high CPU

allocation, the video quality is very low - only 7 frames are successfully displayed, with an average PSNR

of 23.12 aBo We conclude that video scaling, wIlen given a sufficient CPU share to run. can significantly

improve' the video applical.ion's ability to gracefully respond to network congestion.

VII. R.ELATED WORK

Component-based synthesis of network protocols has been advanced in x-kernel [2], aud adopted in re<ll!Ilt

extensible softwaxe-based routers [I]. [8], [9J. A notable example is router plugins [lJ - however, plugin gates

are li."(ed in the IP forwarding path and caJlllot be dynamically extended. Moreover, the previous work [IJ,[2], [8], [9] focuses neither on schedulhlg issues Cor the software elements themselves nor issues in the context

of a complementary service control plane. Our forwarding plane implementation leverages Click [4], [5]. We

support the use of Click elements with push/pull data movement as router service components, and exploit

Click's configuration lauguage and system support in constructing Dow service pipelines. However, Click

does not provide the control plane discussed in t.llis paper. Moreover. we have b'Teatly extended Click in

many aspects of flow and controJ plane scheduling.

There has been recent work 011 resource management in software routers. Qie et aT. [6] present very

interesting experimental results pertaining to balancing between input, output, and per-How processing in

their sonware router. We have investigated similar issues oC CPU baJance ill OUr system. However, our focus is

on a sysLem that. supporLs configurable routing elements, whereas their system does not provide such support.

To reduce context SWitching. they use the technique of batching packets. OUT system takes a more fine

grained preemption approach that allows a flow's packet to be preempted at element boulldaries. Moreover,

imporl.anL features of How signaling and service elCtension, and their interactions with the Corwarding plane,

are not discussed in [6]. CROSS [10] auvances a multiresource scheduling architecture based on resoun::e

allocations. We use resource allocations in system-level scheduling between the forwarding and control

planes. However, CROSS is not element·based aIld, therefore, does not address a lot of the schetluling iSBues

presented in this paper.

Recenlly, I l~e use of network processors in a software router, diieny for data ylanc serviccs, is reported

in [8]. By using diflerertt processors (ge~ieral purpose versus specialixcd) for various data and conlrol plarle

serviccs, rlcw schcduli~ig problcr~is -arise, wtiich is an interesting xc'd for future researcll.

VIII. Coh-c~us~oxs

W e have l~rese~iled lhe CROSS/L~~ILK sohware rouler. The router allows Inore co~nylcx router services

to be coiistructed It.oitl sinipler and well tr~~derslood buildirg blocks. AfIoreovcr, it is truly dynamically

extel~siLle through die flow sig~alillg a d 011-tllefly scrvicc codibwratioa mechanisms. We have exaini~~erl in

detail vtrious issues of QoS provisio~~in~. For tllc fonvardir~g plane, wc discuss How-based resollrce scl~edulil~g,

and exploit thc lightwcigllt nature of clcments to support finograined preeniptioli 01 flow packets. \Vc l~a rc

also studid how buffers should bc provisioned to achieve lossless forwarding oC packets urlder col~ditiorls of

polling versus inlerrupt, and various CPU balance between input, output and processing. W e have cvaluatcd

resource contenlion issues between thc Iorrvarding and control planes. Diverse experimental results stlow that

our router ca11 achieve robust lossless Corwill'ding 01 packels, and can provide QoS support without excessive

perCormance penally. Finally, w e have prototyped arid evaluated a video scalirlg service to dcmonstnte

benefits for end users.

[I] D. Descaper, Z. Dittia, G. Parulkar, and B. Plattner. Router plugins: A s a k e architecture br next ~cncration routcrs. In Pmc. ACM SIGCOMM, Vamimuver, Canada, Sept 1998.

[2] N. C. Hutchinson and L. L. Peterson. The x-kernel: An architccturc for implcmcnting network protocols. IEEE

%ns. So~Ywate Engineerifig, 17(1);64-76, January 1991.

(31 R. Kcllcr, S. Choi, D. Dccaspcr, M. Dascn, G. finkhauser, and B- Plattner. An active router architecture for

multicast video distribution. In Pmc. IEEE InJocom, March 2000.

(41 E. Kohlcr, R. Morris, B. Chon, J. Jmnotti, and M. F. Kaashoek. The click modular router. ACM ~ n x a c i i o n s

on Computer Syslems, 18(3):263-297, August 2000.

[5] R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek. The Click modular router. In Pmceedingr of the 17h ACM Symposium on Operaling Systans Principles (SOSP '99), pages 217-231, Kiawah Island, South Carolina,

December 1999.

161 X. Qie, A. Bavier, L. Peterson, a d S. Karlin. Scheduling Computations on a Software-Based Router. In P m d i n g s of thc ACM SIGMETnrCS 2001 Confe~nce, pages 13-24, June 2001.

[71 S. Samge, D. Wetherall, A. Karlin, and T. Anderson. Practical rrctwork support for fi traceback, In Pmc. ACM SIGCDMM, Stockholm, Sweden, August 2000.

[8] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlicb. Building a Robust Software-Based Router Using Network

Processors. In Pmceedings ofthe 18th ACM Symposium on Operaling Systems Principles (SOSP), October 2001.

to appear.

[9] D. Wetherall. Active network vision and reality: Lessons from a capsulobascd system. In Pmc. ACM SOSP, December 1999.

1101 D. K. Y. Yau and X. Chen. Resource management in soFtware-programmable router operating systcms. ZEEE Journal on Selected Areas in Cornrnunicntions, 19(3), March 2001.

(111 D. K. Y. Yau, J. C. S. Lui, and F. Liang. Delending against distributed denial-of-service attacks with mu-min fair server centric routcr throttles. In Pmc. iEEE international Workshop on Quality o J Senn'ce (IWQoS) 2002,

Miami Beach, FL, May 2002.

20

Recenl1y, the use of net.work processors in a soft.ware rout.er, chiefly for data plane services, is reported

ill [8]. By using different processors (general purpose versus specialized} [Dr various data and conLrol plane

services, new scheduling problems arise, which is an interesting area for future reseal·ch.

VIII. COKCLUSIOKS

'Ve ]18ve presented the CROSSjLimL'( sofLware rouler. The router allows morc complex router services

to be const.ructed rrom simI)]er and well understood building blocks. Moreover, it is truly dynamically

extensible through the flow signaling alltl on-the-fly servicc coufib'llratioJl mccltanisms. We have examined ill

detail vclrious issues of QoS provisioning. For tim forwarding planc. we discuss flow-based resource scheduling,

and exploit the lightweight nature of clements to support fine-grained preempLion or flow packets. We have

also studied how buffers should be provisioned to achieve lossless forwarding of packet.s undet· conditions of

polling versus interrupt, and various CPU balance bel.ween input, output and processing. \\'e have evaluated

resource contention issues between the forwarding and control planes. Diverse experimental results show that

our l"Oule1' can aellieve robust lossless forwal'ding of packeLs, and call pl'ovide QoS suppod without excessive

performance penally. Finally, we have protolyped and evaluated a video scaling service to demonstrate

benefits for end Ilsers.

REFEREKCES

[1] D. Descaper, Z. Dittia, G. Parulkar, and B. Plattner. Router pIugins: A software arclIitecture for next generation

routers. In Proc. ACM SIGOOMM, Vanmuver, Canada, Sept 1998.

[2] N. C. Hutchinson and L. L. Peterson. The JI:-kernel: An a:rchitccture rOt implementing network protocols. IEEE

funs. Software. Engineering, 17(1};64.-76, January 1991.

[3] R. Keller, S. Choi, D. Dccasper, M. Dasell, G. Fankhauser, and B. Plattner. An active touter architecture Cor

multicast video distribution. In Proc. IEEE In/acorn, March 2000.

[4] E. Kohler, R. Morris. B. Chen, J. Jannotti, and M. F. Kaashoek. The elide modular router. ACM 7hmsactions

on Compll~r S'IIste.ms, 18(3}:263-297, August 2000.

[5) R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek. The Click modular router. In Proceedings of the 1'If.h

ACM Symposium on Opemtlng Systems Principles (SOSP '99), pages 217-231, Kiawah Island, South Carolina,

December 1999.

(61 X. Qie, A. Bavier, L. Peterson, and S. Karlin. Scheduling Computations on a Software-Based Router. In

Proceedings 0/ the AeM SIGMETRlCS 2001 Conference, pages 13-24, June 2001.

[7) S. Savage, D. Wetherall, A. Karlin, and T. AndeISon. Practical network 5upport for IP traceback. In Proc. ACM

SIGCOMM, Stockholm, Sweden, August 2000.

[S) T. Spalink, S. KaIlin, L. Peterson, and Y. Gottlieb. Building a Robust Software-Based Router Using Network

Processors. In Proceedings of the 18tT, ACM Symposium on Operullng Systems Principles (SOSP), October 2001.

to appear.

[9] D. Wctb&all. Active network vision and reality: Lessons from a capsule-based system. In Proc. ACM SOSP,

December 1999.

[101 D. K. Y. Yau and X. Chen. Resource management in software-programmable router operating systems. IEEE

Journal on Selected Areas in Communications, 19(3), March 2001.

Ill] D. K Y. Yall, J. C. S. Lui, and F. Liang. Defending against djstributed denial-of-service attacks with max·min

fair server centric router throttles. In Proc. IEEE Inlemational Workshop on Qllality of SenJiCll (IWQoS) 2002,

Miami Beach, FL, May 2002.

quality of service provisioning for composable routing

Documents