open packet processor: a programmable architecture for wire … · 2016. 5. 9. · open packet...

14
Open Packet Processor: a programmable architecture for wire speed platform-independent stateful in-network processing Giuseppe Bianchi, Marco Bonola, Salvatore Pontarelli, Davide Sanvito, Antonio Capone, Carmelo Cascone CNIT / University of Rome Tor Vergata / Politecnico di Milano ABSTRACT This paper aims at contributing to the ongoing debate on how to bring programmability of stateful packet process- ing tasks inside the network switches, while retaining plat- form independency. Our proposed approach, named “Open Packet Processor” (OPP), shows the viability (via an hard- ware prototype relying on commodity HW technologies and operating in a strictly bounded number of clock cycles) of eXtended Finite State Machines (XFSM) as low-level data plane programming abstraction. With the help of examples, including a token bucket and a C4.5 traffic classifier based on a binary tree, we show the ability of OPP to support stateful operation and flow-level feature tracking. Platform independence is accomplished by decoupling the implemen- tation of hardware primitives (registries, conditions, update instructions, forwarding actions, matching facilities) from their usage by an application formally described via an ab- stract XFSM. We finally discuss limitations and extensions. 1. INTRODUCTION To face the emerging needs for service flexibility, net- work efficiency, traffic diversification, and network secu- rity and reliability, today’s network nodes are called to a more flexible and richer packet processing. The origi- nal Internet nodes, historically limited to switches and routers providing “just” the plain forwarding services, have been massively complemented with a variety of heterogeneous middlebox-type functions [1, 2, 3, 4] such as network address translation, tunneling, load balanc- ing, monitoring, intrusion detection, and so on. The diversification of network equipment and technologies has definitely provided an increased availability of net- work functionalities, but at the cost of a significant ex- tra complexity in the control and management of large scale multi-vendor networks. Software-defined Networking (SDN) emerged as an attempt to address such problem. Coined in 2009 [5] as a direct follow-up of the OpenFlow proposal [6], SDN has broadly evolved since then [7] and does not in prin- ciple restricts to OpenFlow (a “minor piece in the SDN architecture” , according to the OpenFlow inventors them- selves [8]) as device-level abstraction. Nevertheless, most of the high level network programming abstractions pro- posed in the last half a dozen years [9, 10, 11, 12, 13, 14, 15] still rely on OpenFlow as southbound (using RFC 7426’s terminology) programming interface. In- deed, OpenFlow was designed with the desire for rapid adoption, opposed to first principles [7]; i.e., as a prag- matic attempt to address the dichotomy between i) flex- ibility and ability to support a broad range of innova- tion, and ii) compatibility with commodity hardware and vendors’ need for closed platforms [6]. The aftermath is that most of the above mentioned network programming frameworks circumvent OpenFlow’s limitations by promoting a “two-tiered” [16] program- ming model: any stateful processing intelligence of the network applications is delegated to the network con- troller, whereas OpenFlow switches limit to install and enforce stateless packet forwarding rules delivered by the remote controller. Centralization of the network applications’ intelligence may not be a problem (and ac- tually turns out to be an advantage) whenever changes in the forwarding states do not have strict real time requirements, and depend upon global network states. But for applications which rely only on local flow/port states, the latency toll imposed by the reliance on an external controller rules away the possibility to enforce software-implemented control plane tasks at wire speed, i.e. while remaining on the fast path 1 . One might argue that we do not even need such ultra- fast processing and packet-by-packet manipulation and control capabilities. However, not only the large real- world deployment of proprietary hardware network ap- pliances (e.g. for traffic classification, control/balancing, monitoring, etc), but also the evolution of the Open- Flow specification itself shows that this may not be the 1 A 64 bytes packet takes about 5 ns on a 100 gbps speed, roughly the time needed for a signal to reach a control entity placed one meter away. And the execution of an albeit sim- ple software-implemented control task may take way more time than this. Thus, even the physical, capillary, distribu- tion of control agents (as proxies of the remote SDN con- troller for low latency tasks) on each network device would hardly meet fast path requirements. 1 arXiv:1605.01977v1 [cs.NI] 6 May 2016

Upload: others

Post on 20-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

  • Open Packet Processor: a programmable architecture forwire speed platform-independent stateful in-network

    processing

    Giuseppe Bianchi, Marco Bonola, Salvatore Pontarelli,Davide Sanvito, Antonio Capone, Carmelo CasconeCNIT / University of Rome Tor Vergata / Politecnico di Milano

    ABSTRACTThis paper aims at contributing to the ongoing debate onhow to bring programmability of stateful packet process-ing tasks inside the network switches, while retaining plat-form independency. Our proposed approach, named “OpenPacket Processor” (OPP), shows the viability (via an hard-ware prototype relying on commodity HW technologies andoperating in a strictly bounded number of clock cycles) ofeXtended Finite State Machines (XFSM) as low-level dataplane programming abstraction. With the help of examples,including a token bucket and a C4.5 traffic classifier basedon a binary tree, we show the ability of OPP to supportstateful operation and flow-level feature tracking. Platformindependence is accomplished by decoupling the implemen-tation of hardware primitives (registries, conditions, updateinstructions, forwarding actions, matching facilities) fromtheir usage by an application formally described via an ab-stract XFSM. We finally discuss limitations and extensions.

    1. INTRODUCTIONTo face the emerging needs for service flexibility, net-

    work efficiency, traffic diversification, and network secu-rity and reliability, today’s network nodes are called toa more flexible and richer packet processing. The origi-nal Internet nodes, historically limited to switches androuters providing “just” the plain forwarding services,have been massively complemented with a variety ofheterogeneous middlebox-type functions [1, 2, 3, 4] suchas network address translation, tunneling, load balanc-ing, monitoring, intrusion detection, and so on. Thediversification of network equipment and technologieshas definitely provided an increased availability of net-work functionalities, but at the cost of a significant ex-tra complexity in the control and management of largescale multi-vendor networks.

    Software-defined Networking (SDN) emerged as anattempt to address such problem. Coined in 2009 [5] asa direct follow-up of the OpenFlow proposal [6], SDNhas broadly evolved since then [7] and does not in prin-ciple restricts to OpenFlow (a “minor piece in the SDNarchitecture”, according to the OpenFlow inventors them-

    selves [8]) as device-level abstraction. Nevertheless, mostof the high level network programming abstractions pro-posed in the last half a dozen years [9, 10, 11, 12, 13,14, 15] still rely on OpenFlow as southbound (usingRFC 7426’s terminology) programming interface. In-deed, OpenFlow was designed with the desire for rapidadoption, opposed to first principles [7]; i.e., as a prag-matic attempt to address the dichotomy between i) flex-ibility and ability to support a broad range of innova-tion, and ii) compatibility with commodity hardwareand vendors’ need for closed platforms [6].

    The aftermath is that most of the above mentionednetwork programming frameworks circumvent OpenFlow’slimitations by promoting a “two-tiered” [16] program-ming model: any stateful processing intelligence of thenetwork applications is delegated to the network con-troller, whereas OpenFlow switches limit to install andenforce stateless packet forwarding rules delivered bythe remote controller. Centralization of the networkapplications’ intelligence may not be a problem (and ac-tually turns out to be an advantage) whenever changesin the forwarding states do not have strict real timerequirements, and depend upon global network states.But for applications which rely only on local flow/portstates, the latency toll imposed by the reliance on anexternal controller rules away the possibility to enforcesoftware-implemented control plane tasks at wire speed,i.e. while remaining on the fast path1.

    One might argue that we do not even need such ultra-fast processing and packet-by-packet manipulation andcontrol capabilities. However, not only the large real-world deployment of proprietary hardware network ap-pliances (e.g. for traffic classification, control/balancing,monitoring, etc), but also the evolution of the Open-Flow specification itself shows that this may not be the

    1 A 64 bytes packet takes about 5 ns on a 100 gbps speed,roughly the time needed for a signal to reach a control entityplaced one meter away. And the execution of an albeit sim-ple software-implemented control task may take way moretime than this. Thus, even the physical, capillary, distribu-tion of control agents (as proxies of the remote SDN con-troller for low latency tasks) on each network device wouldhardly meet fast path requirements.

    1

    arX

    iv:1

    605.

    0197

    7v1

    [cs

    .NI]

    6 M

    ay 2

    016

  • case. As a matter of fact, since the creation of the OpenNetworking Foundation (ONF) in 2011, and up to thelatest (version 1.5) specification, we have witnessed anhectic evolution of the OpenFlow standard, with severalOpenFlow extensions devised to fix punctual shortcom-ings and accommodate specific needs, by incorporatingextremely specific stateful primitives (such as meters forrate control, group ports for fast failover support or dy-namic selection of one among many action buckets ateach time - e.g. for load balancing -, synchronized tablesfor supporting learning-type functionalities, etc).

    Indeed, in the last couple of years, a new researchtrend has started to challenge improved programmabil-ity of the data plane, beyond the elementary“match/action”abstraction provided by OpenFlow, and (even more re-cently) initial work on higher level network program-ming frameworks devised to exploit such newer andmore capable lower-level primitives are starting to emerge[17, 16]. Proposals such as POF [18, 19], although notyet targeting stateful flow processing, do significantlyimprove header matching flexibility and programmabil-ity, freeing it from any specific structure of the packetheader. Programmability of the packet scheduler insidethe switch has been recently addressed in [20]. Workssuch as OpenState [21, 22] and FAST [23] explicitlyadd support for per-flow state handling inside Open-Flow switches, although the abstractions therein de-fined are still simplistic and severely limit the type ofapplications that can be deployed (for instance, Open-State supports only a special type of Finite State Ma-chines, namely Mealy Machines, which do not providethe programmer with the possibility to declare and useown memory or registries). The P4 programming lan-guage [24, 25] leverages more advanced hardware tech-nology, namely dedicated processing architectures [26]or Reconfigurable Match Tables [27] as an extension ofTCAMs (Ternary Content Addressable Memories) topermit a significantly improved programmability in thepacket processing pipeline. In its latest 1.0.2 languagespecification [28], P4 has made a further crucial step inimproving stateful processing, by introducing registersdefined as “stateful memories [which] can be used in amore general way to keep state”. However, the P4 lan-guage does not specify how registries should be scalablysupported and managed by the underlying HW.

    ContributionThis work is an attempt to revisit fast-path programma-bility, by (i) proposing a programming abstraction whichretains the platform independent features of the origi-nal “match/action” OpenFlow abstraction, and by (ii)showing how our abstraction can be directly “executed”over an HW architecture (whose feasibility is concretelyshown via an HW FPGA prototype). In analogy withthe OpenFlow’s “match/action” abstraction, which ex-poses a network node’s TCAM to third party programma-

    Figure 1: a) A typical OpenFlow pipeline ar-chitecture. b) the OPP enabled pipeline. OPP“stages” can be pipelined with other OPP stagesor ordinary OpenFlow Match/Action stages.

    bility, also our abstraction directly refers to the HWinterface, and as such it can be directly exposed tothe programmer as a machine-level “configuration” in-terface, hence without any intermediary compilation oradaptation to the target (i.e., unlike the case of P4).

    In conceiving our abstraction, we have been largelyinspired by [21], where eXtended Finite State Machines[29] (therein referred to as “full” XFSM) were conjec-tured as a possible forward-looking abstraction. Ourkey difference with respect to [21] is that we not limitto postulate that such “full” XFSMs may ultimately bea suitable abstraction, but we concretely show their via-bility and their “executability” over an HW architectureleveraging commodity HW (standard TCAMs, hash ta-bles, ALUs, and somewhat trivial additional circuitry),and with a strictly bounded number of clock ticks.

    A limitation in this paper is our focus on a “sin-gle” packet processing stage, opposed to a more generalpacket processing pipeline comprising multiple match/actiontables. In essence, our work shows the viability of anXFSM-based abstraction as a (significant) generaliza-tion of the original single-table OpenFlow’s match/action.While multiple pipelined instances of our atomic OpenPacket Processor stages are clearly possible, exactly asmultiple match/action tables can be pipelined since Open-Flow version 1.1 (see figure 1), our present work doesnot yet take advantage of HW pipeline optimizationssuch as Reconfigurable Match Tables [27].

    Finally, and similarly to the OpenFlow’s original de-sign philosophy, even if our proposed architecture ispragmatically limited by the specific set of primitivesimplemented by the HW (supported packet process-ing and forwarding actions, matching facilities, arith-metic and logic operations on registry values, etc), itnevertheless remains extensible (by adding new actionsor instructions) and largely expressive in terms of howthe programmer shall use and combine such primitiveswithin a desired stateful operation. As it will be hope-fully apparent later on, a “full” XFSM permits to for-mally describe a wide variety of programmable packetprocessing and control tasks, which our architecturepermits to directly convey and deploy inside the switch.

    2

  • And even if probably not of any practical interest, thefact that not only the OpenFlow legacy statistics, butalso further tailored stateful extensions today integratedin the OpenFlow standard (hence hardcoded in the switch)might be externally programmed using an apparentlyviable platform agnostic abstraction merits further con-siderations (see discussion in section 5).

    2. CONCEPTAs anticipated in the previous section, our work fo-

    cuses on the design of a single Open Packet Processor(OPP) stage, as a significant generalization of the tra-ditional OpenFlow’s Match/Action abstraction. Morespecifically, our goal is to provide a packet processingstage which holds the following properties.

    Ability to process packets directly on the fastpath, i.e., while the packet is traveling in the pipeline(nanoseconds time scale). The requirement of perform-ing packet processing tasks in a deterministic and small(bounded) number of HW clock cycles hardly copeswith the possibility to employ a standard CPU (andthe relevant programming language), and requires us toimplement a domain-specific (traffic/network) comput-ing architecture from scratch.

    Efficient storage and management of per-flowstateful information. Other than parsing packet headerinformation and exposing such fields to a match/actionstage (or a pipeline of match action stages[27, 26]), wealso aim at permitting the programmer to further usethe past flow history for defining a desired per-packetprocessing behaviour. As shown in section 2.2, this caneasily accomplished by pre-pending a dedicated storagetable (concretely, an hash table) that permits to re-trieve, in O(1) time, stateful flow information. We namethis structure as Flow Context Table, as, in somewhatanalogy with context switching in ordinary operatingsystems, it permits to retrieve stateful information as-sociated to the flow to which an arriving packet belongs,and store back an updated context the end of the packetprocessing pipeline. Such (flow) context switching willoperate at wire speed, on a packet-by-packet basis.

    Ability to specify and compute a wide (and pro-grammable) class of stateful information, thus in-cluding counters, running averages, and in most general-ity stateful features useful in traffic control applications.It readily follows that the packet processing pipeline,which in standard OpenFlow is limited to match/actionprimitives, must be enriched with means to describeand (on the fly) enforce conditions on stateful quan-tities (e.g. the flow rate is above a threshold, or thetime elapsed since the last seen packet is greater thanthe average inter-arrival time), as well as provide arith-metic/logic operations so as to update such stateful fea-tures in a bounded number of clock cycles (ideally one).

    Platform independence. A key pragmatic insightin the original OpenFlow abstraction was the decisionof restricting the OpenFlow switch programmer’s abil-ity to just select actions among a finite set of supportedones (opposed to permitting the programmer to developown custom actions), and associate a desired action set(bundle) to a specific packet header match. We concep-tually follow a similar approach, but we cast it into amore elaborate eXtended Finite State Machine (XFSM)model. As described in section 2.1, an XFSM abstrac-tion permits us to formalize complex behavioral mod-els, involving custom per-flow states, custom per-flowregisters, conditions, state transitions, and arithmeticand logic operations. Still, an XFSM model does notrequire us to know how such primitives are concretelyimplemented in the hardware platform, but “just” per-mits us to combine them together so as to formalize adesired behaviour. Hence, it can be ported across plat-forms which support a same set of primitives.

    2.1 XFSM abstractionThe OpenFlow’s“Match-action”abstraction has been

    widely extended throughout the various standardizationsteps, with the extension of the match fields (includingthe possibility to perform matches on meta-data), withnew actions (and instructions), and with the ability toassociate a set of actions to a given match. Nevertheless,the basic abstraction conceptually remains the same. Itis instructive to formally re-interpret the (basic) Open-Flow match/action abstraction as a “map” T : I → O,where I = {i1, . . . , iM} is a finite set of Input Symbols,namely all the possible matches which are technicallysupported by an OpenFlow specification (being irrele-vant, at least for this discussion, to know how such In-put Symbols’ set I is established, and that each inputsymbol is a Cartesian combination of all possible headerfield matches), and O = {o1, . . . , oK} is a finite set ofOutput Symbols, i.e. all the possible actions supportedby an OpenFlow switch. The obvious limit of this ab-straction is that the match/action mapping is staticallyconfigured, and can change only upon controller’s in-tervention (e.g. via flow-mod OpenFlow commands).Finally, note that the “engine” which performs the ac-tual mapping T : I → O is a standard TCAM.

    As observed in [21], an OpenFlow switch can be triv-ially extended to support a more general abstractionwhich takes the form of a Mealy Machine, i.e. a Fi-nite State Machine with output, and which permits toformally model dynamic forwarding behaviors, i.e. per-mit to change in time the specific action(s) associatedto a same match. It suffices to add a further finite setS = {s1, s2, , sN} of programmer-specific states, and usethe TCAM to perform the mapping T : S× I → S×O.While remaining feasible on ordinary OpenFlow hard-ware, such Mealy Machine abstraction brings about two

    3

  • XFSM formal notation MeaningI input symbols all possible matches on packet

    header fieldsO output symbols OpenFlow-type actionsS custom states application specific states, de-

    fined by programmerD n-dimensional linear

    space D1 × · · · ×Dnall possible settings of n mem-ory registers; include bothcustom per-flow and globalswitch registers (see text)

    F set of enabling func-tions fi : D → {0, 1}

    Conditions (boolean predi-cates) on registers

    U set of update func-tions ui : D → D

    Applicable operations for up-dating registers’ content

    T transition relationT : S × F × I →S × U ×O

    Target state, actions and reg-ister update commands asso-ciated to each transition

    Table 1: eXtended Finite State Machine model

    key differences with respect to the original OpenFlowabstraction. First, the (output) action associated to avery same (input) match may now differ depending onan (input) state si ∈ S, i.e., the state in which the flowis found when a packet is being processed. Second, theMealy Machine permits to specify in which, possibly dif-ferent, (output) state so ∈ S the flow shall enter oncethe packet will be processed. While quite interesting,this generalization appears still insufficient to permitthe programmer to implement meaningful applications,as it lacks the ability to run-time compute and exploitin the forwarding decisions per-flow features commonlyused in traffic control algorithms.

    The goal of this paper is to show that a switch ar-chitecture can be further easily extended (section 2.2)to support an even more general Finite State Machinemodel, namely the eXtended Finite State Machine (XFSM)model introduced in [29]. As summarized in table 1,this model is formally specified by means of a 7-tupleM = (I,O, S,D, F, U, T ). Input symbols I (OpenFlow-type matches) and Output Symbols O (actions) are thesame as in OpenFlow. Per-application states S are in-herited from the Mealy Machine abstraction [21], andpermit the programmer to freely specify the possiblestates in which a flow can be, in relation to her desiredcustom application (technically, a state label is handledas a bit string). For instance, in an heavy hitter detec-tion application, a programmer can specify states suchas NORMAL, MILD, or HEAVY, whereas in a load balancingapplication, the state can be the actual switch outputport number (or the destination IP address) an alreadyseen flow has been pinned to, or DEFAULT for newly ar-riving flows or flows that can be rerouted. With respectto a Mealy Machine, the key advantage of the XFSMmodel resides in the additional programming flexibilityin three fundamental aspects.

    (1) Custom (per-flow) registers and global (switch-level) parameters. The XFSM model permits the

    programmer to explicitly define her own registers, byproviding an array of per-flow variables whose content(time stamps, counters, average values, last TCP/ACKsequence number seen, etc) shall be decided by the pro-grammer herself. Additionally, it is useful to expose tothe programmer (as further registers) also switch-levelstates (such as the switch queues’ status) or “global”shared variables which all flows can access. Albeit prac-tically very important, a detailed distinction into differ-ent register types is not foundational in terms of ab-straction, and therefore all registers that the program-mer can access (and eventually update) are summarizedin the XFSM model presented in Table 1 via the arrayD of memory registers.

    (2) Custom conditions on registers and switchparameters. The sheer majority of traffic control ap-plications rely on comparisons, which permit to deter-mine whether a counter exceeded some threshold, orwhether some amount of time has elapsed since thelast seen packet of a flow (or the first packet of theflow, i.e., the flow duration). The enabling functionsfi : D → {0, 1} serve exactly for this purpose, by imple-menting a set of (programmable) boolean comparators,namely conditions whose input can be decided by theprogrammer, and whose output is 1 or 0, depending onwhether the condition is true or false. In turns, the out-come of such comparisons can be exploited in the tran-sition relation, i.e. a state transition can be triggeredonly if a programmer-specific condition is satisfied.

    (3) Register’s updates. Along with the state transi-tion, the XFSM models also permits the programmer toupdate the content of the deployed registers. As we willshow later on, registers’ updates require the HW to im-plement a set of update functions ui : D → D, namelyarithmetic and logic primitives which must be providedin the HW pipeline, and whose input and output datashall be configured by the programmer.

    Finally, we stress that the actual computational stepin an XFSM is the transition relation T : S × F × I →S × U × O, which is nothing else than a “map” (albeitwith more complex inputs and outputs than the basicOpenFlow map), and hence is naturally implementedby the switch TCAM, as shown in the next section 2.2.

    2.2 OPP architectureTo our view, what makes the previously described

    XFSM abstraction compelling is the fact that it can bedirectly executed on the switch’s fast path using off theshelf HW, as we will prove in section 3 with a concreteHW prototype. As discussed in the next section, prac-tical restrictions of course emerge in terms of memorydeployed for the registers, as well as capability of theALUs used for register updates, but such restrictions aremostly related to an actual implementation, rather thanto the design which remains at least in principle very

    4

  • Figure 2: OPP architecture

    general and flexible. A sketch of the proposed OpenPacket Processor architecture is illustrated in Figure 2.The packet processing workflow is best explained bymeans of the following stages.

    Stage 1: flow context lookup. Once a packet entersan OPP processing block, the first task is to extract,from the packet, a Flow Identification Key (FK), whichidentifies the entity to which a state may be assigned.The flow is identified by an unique key composed of asubset of the information stored in the packet header.The desired FK is configured by the programmer (anIP address, a source/destination MAC pair, a 5-tupleflow identifier, etc) and depends on the specific applica-tion. The FK is used as index to lookup a Flow ContextTable, which stores the flow context, expressed in termsof (i) the state label si currently associated to the flow,

    and (ii) an array ~R = {R0, R1, ..., Rk} of (up to) k + 1registers defined by the programmer. The retrieved flowcontext is then appended as metadata and the packetis forwarded to the next stage.

    Stage 2: conditions’ evaluation. Goal of the Condi-tion Block illustrated in Figure 2 (and implemented us-ing ordinary boolean circuitry, see section 3) is to com-pute programmer-specific conditions, which can take asinput either the per flow register values (the array ~R),as well as global registers delivered to this block as anarray ~G = {G0, G1, ..., Gh} of (up to) h + 1 global vari-ables and/or global switch states. Formally, this blockis therefore in charge to implement the enabling func-tions specified by the XFSM abstraction. In practice, itis trivial to extend the assessment of conditions also topacket header fields (for instance, port number greaterthan a given global variable or custom per-flow reg-ister). The output of this block is a boolean vector~C = {c0, c1, ..., cm} which summarizes whether the i-thcondition is true (ci = 1) or false (ci = 0).

    Stage 3: XFSM execution step. Since boolean con-ditions have been transformed into 0/1 bits, they canbe provided as input to the TCAM, along with the statelabel and the necessary packet header fields, to perform

    a wildcard matching (different conditions may apply indifferent states, i.e. a bit representing a condition canbe set to “don’t care” for some specific states). EachTCAM row models one transition in the XFSM, andreturns a 3-tuple: (i) the next state in which the flowshall be set (which could coincide with the input statein the case of no state transition, i.e., a self-transitionin the XFSM), (ii) the actions associated the transition(usual OpenFlow-type forwarding actions, such as drop,push_label, set_tos etc...), and (iii) the informationneeded to update the registers as described below.

    Stage 4: register updates. Most applications re-quire arithmetic processing when updating a statefulvariable. Operations can be as simple as integer sums(to update counters or byte statistics) or can requiretailored floating point processing (averages, exponentialdecays, etc). The role of the Update logic block compo-nent highlighted in Figure 2 is to implement an arrayof Arithmetic and Logic Units (ALUs) which supporta selected set of computation primitives which permitthe programmer to update (re-compute) the value of theregisters, using as input the information available at thisstage (previous values of the registers, information ex-tracted from the packet, etc). Section 3.1 will describethe specific instruction set implemented in our HW pro-totype, where (with no pretence of completeness, norwillingness to impose our own set of operations) we im-plement a set of operations which appear to be eitheruseful to the specific network programmer’s needs, aswell as computationally effective in terms of implemen-tation (ideally, executable in a single clock tick). It isworth to mention that the problem of extending theset of supported ALU instructions is merely a technicalone, and does not affect the OPP architecture.

    Extension: Cross-flow context handling. As notedin [21], there are many useful stateful control tasks, inwhich states for a given flow are updated by events oc-curring on different flows. A simple but prominent ex-ample is MAC learning: packets are forwarded using thedestination MAC address, but the forwarding databaseis updated using the source MAC address. Thus, it may

    5

  • be useful to further generalize the XFSM abstraction assuggested in [21], i.e. by permitting the programmer touse a Flow Key during lookup (e.g. read informationassociated to a MAC destination address) and employa possibly different Flow Key (e.g. associated to theMAC source) for updating a state or a register.

    2.3 Programming the OPPIs is useful to conclude this section with at least a

    sketch of which types of applications (and programs)may be deployed.

    A first trivial example of dynamic forwarding actionsis that of a simple mechanism which distinguishes short-lived flows from long-lived flows by considering“long” any flow that has transmitted at least N packets,and applies different DSCP tags. The OPP program-mer would simply need to define two states (DEFAULTalso associated to every new flow, and LONG), one per-flow register R0 (a packet counter), one global regis-ter G0 (storing the constant threshold N), a conditionR0 > G0 applicable when in state DEFAULT, and an up-date function ADD(R0, 1)→ R0. Note that we did notassume any pre-implemented counters or meters in theswitch, but the counter and the relevant threshold checkhas been programmed using the OPP abstraction.

    The usage of packet inter-arrival times and timersis exemplified by a dynamic intra-flow load bal-ancing application, which can reroute a flow while itis in progress. As suggested in [30], rerouting shouldnot occur during packet bursts, to avoid out of order-ing and relevant performance impairments. Support inOPP just requires, for each packet being transmitted,to update a per-flow register R with the quantity t+∆,being t the actual packet timestamp and ∆ a suitablethreshold. When the next packet arrives (time t1), wecheck the condition t1 > R. If this is false, we route thepacket to the assigned path (indicated by the state la-bel); conversely we route it to an alternative path, andwe change the state accordingly. Again, note that wehave not assumed any pre-implemented support fromthe switch (e.g. timeouts or soft states), besides theability to provide time information (e.g., via a globalregister, or timestamps as packet metadata).

    Finally, the integration in the ALU design of monitoring-specific update instructions (averages, variances, smooth-ing filters, see section 3.1), several features frequentlyused in traffic control and classification applications canbe computed on the fly during the pipeline. We deferrelevant examples to section 4.

    3. HARDWARE FEASIBILITYDespite the current trend in softwarization of net-

    work functions and the widespread deployment of soft-ware switches, we believe that the viability of switch-level programming abstractions which challenge Open-

    Mixer

    Delay queue

    action

    PKT

    fields

    extractor

    Action

    Block

    Egress queues

    Ingress queues

    microcontroller

    Configuration

    commands

    OPP

    Status

    UART

    communication

    Metadata

    Global

    registers

    Update logic block

    Condition

    logic

    block

    Flow

    context

    memory

    condition

    vector

    state

    TCAM

    (XFSM table)

    Flow

    registers

    PKT

    Update

    information

    OPP stage

    Figure 3: Scheme of an OPP stage

    Flow limitations, hence including this work, must stillbe proven in terms of hardware feasibility and ability torun in a strictly bounded number of clock cycles.

    Figure 3 provides a block-level overview of a candi-date hardware implementation of a single OPP stage,which we prove feasible with an FPGA prototype. Pipelin-ing of an OPP stage with other OPP stages or ordinarymatch/action tables does not affect the single stage de-sign (although it does not permit us to benefit fromhardware extensions and TCAM optimizations such asthose introduced in [27], which we leave to future work).Figure 3 also illustrates the necessary auxiliary blocksdevised to handle packet input capture and output de-livery, in the assumption of a 4× 4 port switch.Packet reception and header field extraction. Pack-ets received on the input queues are collected and se-rialized by a mixer block, so that the OPP block re-ceives one packet per clock cycle. Such packet is thenprocessed by a Packet Fields Extractor, configured toprovide, together with the header fields (8 in our proto-type), the blocks required in the next processing stages- specifically: i) the Flow Key used to query the FlowContext Table, ii) the header fields used by the Con-dition block, iii) the header fields used by the UpdateLogic Block, and iv) the (eventually different) Flow Keyused for updating the Flow Context. The Packet FieldsExtractor is easily implemented in HW as a parallel ar-ray of elementary Shift and Mask (SaM) blocks whereeach SaM block selects the beginning of the targetedheader field (the shift function), and performs a bit-wisemask operation. This operation closely resembles thatproposed in POF [18]: we also use offsets, but insteadof lengths we use bit masks.

    Flow Context Table. This data structure is in chargeto store both state as well as registries associated toFlow Keys. It consists of an hash table (we imple-mented a d-left hash table with d = 4) to handle exactmatches, plus a TCAM to handle wildcard matches.Unlike the hash table, which must be arguably largeto store per-flow states, a very small TCAM can bedeployed, as it is required to handle the very few spe-cial cases where wildcard matches are needed (mainly

    6

  • Figure 4: Condition logic block array element

    default states, where the TCAM priority permits to dif-ferentiate default states for different protocols or packetformats). Our implementation uses 128 bit Flow Keys,and returns a 146 bit value which is sufficient to supporta 16 bit state label, four 32 bit per-flow registries, andtwo auxiliary bits per entry used by the microcontrollerfor housekeeping (see below).

    Condition Logic Block. This block permits to config-ure conditions on input pairs (per-flow registries, globalregistries, header fields), and evaluate them so as to re-turn as output a boolean 0/1 vector. This block, shownin figure 4, comprises multiple (8 in our implementa-tion) parallel configurable comparators, each of whichtakes as input two operands selected among all the flowregistries Ri, all the global registries Gi and the headerfields Hi coming from the packet field extractor. Theselection operation is provided by two multiplexers (onefor each operand). Each comparator supports five arith-metic comparison functions: >, ≥, =, ≤,

  • Type Instructions definitionLogic NOP do nothing

    NOT OUT1← NOT (IN1)XOR, AND, OR OUT1← IN1 op IN2

    Arit. ADD,SUB,MUL,DIV OUT1← IN1 op IN2ADDI,SUBI,MULI,DIVI OUT1← IN1 op IMM

    Shift/ LSL (Logical Shift Left) OUT1← IN1 > IMM

    ROR (Rotate Right) OUT1← IN1 ror IMM

    Table 2: ALU basic instruction set

    (INi) can be any among the available per flow registriesRi, the global variables Gi, or the header fields Hi pro-vided by the Extractor. Output operands (OUTi) indi-cate where the result of the instruction must be written(e.g. in a given per-flow register, or in a global variable).In some instructions, one or more of the operands (IOi)are both used as input and output. Our implementationsupports 4 per-flow registries, 4 global registries and 8header fields. Therefore, it may in principle support upto 24/ log2(16) = 6 operands. In practice, we envisionat most 4 operands (e.g., for the variance or for theewma smoothing instructions) and thus our implemen-tation may readily support up to 64 among registriesand header fields. In the case of logic/arithmetic/shiftoperations, which only require at most two operandsplus a third output, we have also considered the case inwhich one of the operands is an actual value (immediatevalue) which can hence use 16 bits.

    The packet/flow specific instructions supported in ourprototype do implement, as a dedicated HW primi-tives running at the system clock frequency and with amaximum latency of two clock cycles2, domain-specificoperations which we deem useful in traffic control ap-plications, and which would normally require multipleclock cycles if implemented using more elementary op-erations. Such domain specific operations include theonline computation of running averages (avg) and vari-ances (var), and the computation of exponentially de-caying moving averages (ewma) which can serve thepurpose of a moving average, but which can be incre-mentally computed and do not require to maintain awindow of samples.

    Usage and implementation details about packet/flowspecific instructions are provided in Table 3. The avgoperation stores the number of samples in IO1, and in-cludes a new sample IN1 in the running average IO2.Similarly, the var operation stores the number of sam-ples in IO1, the average of the value IN1 in IO2 andthe variance in IO3.

    The ewma operation3 was included to permit smooth-

    2As they involve a division, which we had to limit to 16 bitsfor dividend and divisor to target a 2 clock cycles latency.3Being tk the last sample time, and xk′ a new sample oc-curring at time tk′ , for simplicity of HW implementation weapproximate the exponentially weighted moving average asm(tk′) = m(tk)α

    tk′−tk +xk′ , and we use α = 1/2 to computepowers as shift operations. The intermediate decay quantity

    Instruction definitionavg() IO1← IO1 + 1

    IO2← IO2 + (IN1− IO2)/(IO1 + 1)IO1← IO1 + 1

    var() IO2← IO2 + (IN1− IO2)/(IO1 + 1)IO3← IO3 + ((IN1− IO2)2 − IO3)/(IO1 + 1)

    IO1← IN1ewma() decay = 1

  • resource type Reference switch OPP switch# Slice LUTs 49436 (11%) 71712 (16%)

    # Block RAMs 194 (13%) 393 (26%)

    Table 4: Hardware cost of OPP compared withthe reference NetFPGA SUME switch.

    The size of the SRAM that can be instantiated on alast generation chip is up to 32 MB, corresponding toaround 1 millions of entries in the d-left hash for thecontext flow table. The size of a TCAM can be up to40 Mb, corresponding to 256K XFSM table entries.

    The system latency, i.e. the time interval from thefirst table lookup to the last context update is 6 clockcycles. The FPGA prototype is able to sustain the fullthroughput of 40 Gbits/sec provided by the 4 switchports. If we suppose a minimum packet size of 40 bytes(320 bits), the system is able to process 1 packet foreach clock cycle, and thus up to 6 packets could bepipelined. However, the feedback loop (not present inthe forward-only OpenFlow pipelines [36]) raises a con-cern: the state update performed for a packet at thesixth clock cycle would be missed by pipelined packets.This could be an issue for packets belonging to a sameflow arriving back-to-back (consecutive clock cycles); inpractice, as long as the system is configured to workby aggregating N ≥ 6 different links, the mixer’s roundrobin policy will separate two packets coming from thesame link of N clock cycles, thus solving the problem.Note that the 6 clock cycles latency is fixed by the hard-ware blocks used in the FPGA (the TCAM and theBlock RAMs) and basically does not change scaling upthe number of ingress ports or moving to an ASIC.

    The whole system has been synthesized using thestandard Xilinx design flow. Table 4 reports the logicand memory resources (in terms of absolute numbersand fraction of available FPGA resources) used by theOPP FPGA implementation, and compare these resultswith those required for the NetFPGA SUME single-stage reference switch. As expected, the logic uses asmall fraction of the total area (the increase with re-spect the reference switch is 5% of the available FPGAlogic resources), that is dominated by memory (thatdoubles with respect the reference switch). The syn-thesis results hence confirm the trend already shownby [27]: the HW area is dominated by memory, whileadding intelligence/features in the logic require a smallsilicon overhead. The performance in terms of latencyof an OPP stage and throughput of deployed FPGAprototype has been measured sending several synthetictraces of packets of different size. The results are pre-sented in Fig. 5. As expected, the FPGA is able tosustain the expected throughput4

    4Due to the limitation of our hardware measurement set-up,we were unable to actually send to the FPGA more than 24Gbits/sec, so the data referred to 64B packet size could notbe measured. The expected theoretical value is reported.

    0

    10

    20

    30

    40

    50

    60

    0

    10

    20

    30

    40

    50

    60

    70

    80

    64 128 256 512

    late

    ncy

    (ns)

    Thro

    ughp

    ut (M

    pps)

    packet size (bytes)

    TheoreticalMeasuredLatency

    Figure 5: Performance of the FPGA prototype

    4. PROGRAMMING EXAMPLESTo functionally test the ability of OPP to support

    stateful applications, we have developed a complete OPPvirtual software environment. For both the switch andcontroller implementation we have extended the CPqDOpenFlow 1.3 software virtual switch [37] and the widelyadopted OpenFlow controller Ryu [38]. The sotwareimplementation serves just for testing purpose, henceit closely mimics the described OPP hardware opera-tion, including the relevant limitations. To configure anOPP switch via the controller, we developed an OPP-specific extension of the OpenFlow protocol. Due tospace limitations (the interested reader can find con-figuration files in the repository) we just mention thatthe configuration of the XFSM in the OPP architectureis a straightforward extension of an OpenFlow config-uration: it just requires to populate the XFSM tableentries and to configure of conditions, functions, key ex-tractors and initial global register values. All softwarecomponents required to test the proposed applicationsare bundled in a mininet [39] based virtual machine avi-lable at a dedicated OPP repository [40], along with ourprototype’s VHDL HW code.

    To understand how an application can be programmedusing OPP, let’s walk through a simplistic example ofa (quite inefficient, but at least trivial to follow) TCPport scan detection and mitigation application. Sincethe target is to detect IP address which behave as scan-ners, we use as Flow Key the IP address. Figure 6represents our desired application’s behavior, expressedin the form of an XFSM “program”, whereas figure pro-vides a corresponding tabular configuration deliveredto the switch. For every IP packet, we check in theFlow Context Table whether the IP source has an asso-ciated context; if this is not the case, a DEFAULT stateis conventionally returned. the XFSM table now checkswhether the packet is a TCP SYN, and only in this casewe will allocate a Flow Context Table entry for the con-sidered IP source, and we will set it in MONITOR state.In this state, we measure the rate of new TCP SYN ar-rivals toward hosts behind the switch port 1. Such rate

    9

  • NEW_TCP_FLOWif  R0 >=  G0

    [DROP]

    ANY_PACKETif  R1 >  pkt.ts[DROP]

    NEW_TCP_FLOWif  R0  <  G0  

    [OUT]

    ANY_PACKETIf  R1 <  pkt.ts

    [OUT]

    NEW_TCP_FLOW

    [OUT]

    IDLE_TIMEOUT_EXPIRED

    R0:  TCP  SYN  rate  (EWMA)R1:  DROP  state  expiration  timestampR2:  last  packet  timestamp  

    G0:  rate  threshold  (global)G1:  DROP  duration  (global)

    DEFAULT

    DROP

    MONITOR

    Figure 6: Port scan detection XFSM

    C0 C1 state packet  fields next  state packet  actions update   functions

    * * 0 syn=1 1 OUT R0=0R2=pkt.ts

    0 * 1 syn=1 1 OUT R0=EWMA(R0,  R2,  pkt.ts);;R2=pkt.ts1 * 1 syn=1 2 DROP R1=pkt.ts +  G1* 1 2 * 2 DROP

    * 0 2 * 1 OUT R0=0R2=pkt.ts

    Figure 7: Port scan detection XFSM table

    (computed with the EWMA update function) is storedand updated in the flow registers R0.

    While in MONITOR state, the value of R0 is verified foreach new TCP flow. If a given threshold (say 20 SYN/s,a value stored in the global register G0) is exceeded, thestate associated to this flow is set to DROP and all pack-ets from this IP addresses are discarded. Suppose nowthat the programmer wants to block the scanner for 5seconds. Lacking explicit timers (a non trivial HW ex-tension), such mechanism is realised by the followingprocedure: (i) when the flow state transits from MON-ITOR to DROP the register R1 is set to the packet timestamp value plus 5 sec. (a value stored in the global reg-ister G1); (ii) in MONITOR state the R1 value is checkedfor every received packet; (iii) If R1 pkt.ts (packet arrival time); v) theXFSM table is configured as in fig. 7.

    4.1 Decision tree based traffic classificationMachine Learning (ML) tools are widely adopted by

    the networking community for detecting anomalies andclassifying traffic patterns [41]. We have tested the fea-

    C0 C1 C2 C3 state next  state packet  actions update   functions

    * * * * 0 1 Fwd()R4=now()  +  G0var(R0,R1,R2,pkt.len)R3  = R3  +  pkt.len

    0 * * * 1 Fwd() var(R0,R1,R2,pkt.len)R3  = R3  +  pkt.len1 1 * 1 1 3 Fwd(),SetField(dscp=0)1 1 * 0 1 2 Fwd(),SetField(dscp=10)1 0 0 * 1 2 Fwd(),SetField(dscp=10)1 0 1 * 1 3 Fwd(),SetField(dscp=0)* * * * 2 Fwd(),SetField(dscp=10)* * * * 3 Fwd(),SetField(dscp=0)

    Figure 8: XFSM table for the traffic classifier

    sibility of using OPP to support this kind of traffic mon-itoring schemes by implementing a decision tree super-vised classifier based on the C4.5 algorithm [42] that hasbeen exploited by different work on ML based networktraffic classification [43, 44, 45, 46, 47, 48].

    Any ML based classification mechanisms has two phases:a training phase and a test phase. The training phaseis off line and used to create the classification modelby feeding the algorithm with labeled data that asso-ciate a measured traffic feature vector to one of n de-cision classes. In the case of decision tree based MLalgorithms, the output of such phase is the binary clas-sification tree. The training phase must obviously beperformed outside the switch. For our use case imple-mentation the decision rules have been created usingthe Orange data mining framework [49], and a featureset proposed in [50]. We considered a simple scenariowhere it is necessary to discriminate between WEB andP2P (control) traffic. The selected features for each floware: packet size average/variance and total number ofreceived bytes. These features are mapped directly tothe per flow memory registers R1, R2, R3. Moreover theapplication XFSM requires two additional registers: R0(packet counter) and R4 (measurement window expira-tion time). The input feature vectors are evaluated overa time window of 10 seconds.

    The test phase, which is performed online, consistsof two operations: (1) for each flow the feature set de-scribed above is computed; (2) after 10 seconds a deci-sion is made according to the decision tree. This test-ing mechanism is implemented in OPP according to theXFSM shown in Figure 8. The XFSM flows states areencoded as follows: State 0 → default; State 1 → mea-surement and decision; State 2 → WEB traffic (DSCPclass AF11); State 3 → P2P traffic (DSCP class besteffort). The condition set is: C0 : (now > R4); C1 :(R2 > G2), C2 : R3 > G3, C3 : R1

  • C0 C1 state packet  fields next  state packet  actions update   functions

    * * 0 eth.type=IP 1 OUT R0=now - G0R1=now  +  G1

    1 1 1 eth.type=IP 1 OUT R0=  R0+G1R1= R0+G1

    1 0 1 eth.type=IP 1 OUT R0=  now - G0R1= now  +  G10 1 1 eth.type=IP 1 DROP

    Figure 9: Token bucket XFSM. The flow regis-ters R0, R1 are used to store respectively Tmin,Tmax. The global registers G0, G1 are used tostore B ∗Q and Q. The extractors are ip.srcperformed after the condition verification, we cannotupdate the number of tokens in the bucket based onpacket arrival time before evaluating the condition (to-ken availability) for packet forwarding. For this reasonwe have implemented an alternative and equivalent al-gorithm based on a time window. For each flow a timewindow W (Tmin − Tmax) of length BQ is maintainedto represent the availability times of the tokens in thebucket. At each packet arrival, if arrival time T0 iswithin W (Case 1), at least one token is available andthe bucket is not full, so we shift W by Q to the right andforward packet. If the arrival time is after Tmax (Case2), the bucket is full, so packet is forwarded and W ismoved to the right to reflect that B− 1 tokens are nowavailable (Tmin = T0− (B − 1)Q and Tmax = T0 +Q).Finally, if the packet is received before Tmin (Case 3),no token is available, therefore W is left unchanged andthe packet is dropped.

    In the OPP implementation, upon receipt of the firstflow packet, we make a state transition in which weinitialize the two registers: Tmin = T0 − (B − 1) ∗ Qand Tmax = T0 + Q (initialization with full bucket).

    At each subsequent packet arrival we verify two con-ditions: C0 : Tnow >= Tmin; C1 : Tnow

  • functionalities. Moreover, the ability to store data inpersistent registries permits to properly describe, usingP4, the “computational loop” characterizing OPP. Onthe other side, we suffered from the lack, among the P4constructs, of an explicit state/context table and a rele-vant clean way to store and access per-flow data. In ourOPP.p4 library, a table functionally equivalent to ourContext table was actually constructed by combiningarrays of registers with hash keys generators which areprovided as P4 language primitives. However, besidesthe obvious stretch (P4 registers are generic, and notspecifically meant to be deployed on a per-flow basis),this construction also suffers from hash collisions, a nontrivial problem if constrained to be addressed while thepacket is flying through the pipeline. The availabilityof a tailored context/state table structure in P4 wouldgreatly simplify the support of an OPP target platform.

    Structural limitations and possible extensionsWhile (we believe) very promising, our proposed ap-proach is not free of structural concerns. If, on one side,limitations in the set of supported enabling functionsand ALU functions for registry updates may be easilyaddressed with suitable extensions, and integration ofmore flexible packet header parsing (following [24]) isnot expected to bring significant changes in the archi-tecture, there are at least three pragmatic compromiseswhich we took in the design, and which suggest futureresearch directions. The first, and major, one resides inthe fact that state transitions are“clocked”by packet ar-rivals: one and only one state transition associated to aflow can be triggered only if a packet of that flow arrives;asynchronous events, such as timers’ expiration, are notsupported. So far we have partially addressed this limi-tation with, on one side, the decoupling between lookupand update functions (the cross-flow state handling fea-ture), and on the other side with programming trickssuch as the handling of time performed while imple-menting the token bucket example. But further flexi-bility in this direction is a priority in our future researchwork. A second shortcoming is the deployment of ALUprocessing only in the Update Logic Block. This deci-sion was done in favour of a cleaner abstraction anda simpler implementation. However (programmable)arithmetic and logic operations would be beneficial alsowhile evaluate conditions (e.g., A − B > C) which, inthe most general case, may require to be postponed tothe next packet (the update function can store A − Bin a registry, and the next condition can use such reg-istry). A third, minor, shortcoming relates to the factthat all updates occur in parallel. This prevents the pro-grammer to pipeline operations, i.e. use (in the sametransition step) the output of an instruction as inputto a next one. While this issue is easily addressed bydeploying multiple Update Logic Blocks in series, thiswould increase the latency of the OPP loop.

    6. RELATED WORKThis work focuses on data plane programming archi-

    tectures and abstractions, a relatively recent trend. Insuch field, so far the mostly influential work is arguablyP4 [24] a programming language specifically focusing ondata path packet processing. In turns, such initial workhas stimulated the creation of a consortium (p4.org)which has so far produced a release 1.0.2 of the lan-guage specification [28]. Our OPP work is at a different(lower) level than P4: it describes an hardware pro-gramming interface and a relevant architecture whichcould be in future adapted to be used as compilationtarget [25] for P4. Furthermore, our work deals withstateful processing across different packets of a flow,and as such it appears perfectly complementary to theoriginal P4 proposal [24] which initially focused mainlyon programming flexibility of the packet pipeline (P4registers have been introduced in [28]).

    Concerning stateful processing, the work closer toours is OpenState [21], and in part FAST [23]. Withrespect to OpenState, OPP makes a very significantstep forward, as we support full eXtended Finite StateMachines (XFSM, as defined in [29]) opposed to themuch simpler OpenState’s Mealy Machines, and hencewe significantly broaden the variety of applications thatcan be programmed on the switch. Such step requiresadditional new specialized hardware blocks with respectto OpenState which instead requires only marginal ex-tensions to an OpenFlow hardware design [22].

    Finally, OPP shares some technical similarities with[27] and with the Intel Flexpipe architecture [26], espe-cially for what concerns the handling of ALUs in thepacket processing pipeline. However, both OPP focus(on stateful processing) and architecture design remainextremely different from both [27] and [26]. Indeed, anadvised extension consists in extending OPP to handlemultiple pipelined stages and hence exploit the TCAMreconfigurability concepts introduced in [27].

    7. CONCLUSIONSOPP is an attempt to find a pragmatic and viable

    balance between platform-independent HW configura-bility and data plane (packet-level) programming flex-ibility. While permitting programmers to deploy moresophisticated stateful forwarding tasks with respect tothe basic OpenFlow’s static match/action abstraction,we believe that an asset of our configuration interfaceresides in the fact that it does not significantly departfrom OpenFlow-type configurations - our extended fi-nite state machine model is indeed conveyed to theswitch in the usual form of a TCAM’s Flow Table. Wethus hope that our work might stimulate further debatein the research community on how to incrementally de-ploy programmable traffic processing inside the networknodes, e.g. via gradual OpenFlow extensions.

    12

  • 8. REFERENCES

    [1] Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang,“An untold story of middleboxes in cellularnetworks,” vol. 41, no. 4, pp. 374–385, 2011.

    [2] J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy,S. Ratnasamy, and V. Sekar, “Makingmiddleboxes someone else’s problem: networkprocessing as a cloud service,” ACM SIGCOMMComputer Communication Review, vol. 42, no. 4,pp. 13–24, 2012.

    [3] Z. A. Qazi, C.-C. Tu, L. Chiang, R. Miao,V. Sekar, and M. Yu, “SIMPLE-fying middleboxpolicy enforcement using SDN,” vol. 43, no. 4, pp.27–38, 2013.

    [4] A. Gember-Jacobson, R. Viswanathan,C. Prakash, R. Grandl, J. Khalid, S. Das, andA. Akella, “OpenNF: Enabling innovation innetwork function control,” in ACM SIGCOMMConference, 2014, pp. 163–174.

    [5] K. Greene, “TR10: Software-defined networking,2009,” MIT Technology Review.

    [6] N. McKeown, T. Anderson, H. Balakrishnan,G. Parulkar, L. Peterson, J. Rexford, S. Shenker,and J. Turner, “OpenFlow: enabling innovation incampus networks,” ACM SIGCOMM ComputerCommunication Review, vol. 38, no. 2, pp. 69–74,2008.

    [7] N. Feamster, J. Rexford, and E. Zegura, “Theroad to SDN: an intellectual history ofprogrammable networks,” ACM SIGCOMMComputer Communication Review, vol. 44, no. 2,pp. 87–98, 2014.

    [8] S. Shenker, M. Casado, T. Koponen, N. McKeownet al., “The future of networking, and the past ofprotocols,” Keynote slides, Open NetworkingSummit, vol. 20, 2011.

    [9] N. Gude, T. Koponen, J. Pettit, B. Pfaff,M. Casado, N. McKeown, and S. Shenker, “Nox:towards an operating system for networks,” ACMSIGCOMM Computer Communication Review,vol. 38, no. 3, pp. 105–110, 2008.

    [10] A. K. Nayak, A. Reimers, N. Feamster, andR. Clark, “Resonance: Dynamic access control forenterprise networks,” in 1st ACM Workshop onResearch on Enterprise Networking (WREN09),2009.

    [11] N. Foster, R. Harrison, M. J. Freedman,C. Monsanto, J. Rexford, A. Story, andD. Walker, “Frenetic: A network programminglanguage,” in ACM SIGPLAN Notices, vol. 46,no. 9, 2011, pp. 279–291.

    [12] A. Voellmy, H. Kim, and N. Feamster, “Procera: alanguage for high-level reactive network control,”in Proc. 1st workshop on Hot topics in softwaredefined networks, 2012, pp. 43–48.

    [13] C. Monsanto, J. Reich, N. Foster, J. Rexford, andD. Walker, “Composing Software-DefinedNetworks,” in USENIX NSDI, 2013, pp. 1–13.

    [14] T. Nelson, A. D. Ferguson, M. J. Scheer, andS. Krishnamurthi, “Tierless programming andreasoning for software-defined networks,” USENIXNSDI, 2014.

    [15] H. Kim, J. Reich, A. Gupta, M. Shahbaz,N. Feamster, and R. Clark, “Kinetic: Verifiabledynamic network control,” in USENIX NSDI,May 2015.

    [16] M. T. Arashloo, Y. Koral, M. Greenberg,J. Rexford, and D. Walker, “SNAP: Statefulnetwork-wide abstractions for packet processing,”arXiv preprint arXiv:1512.00822, 2015.

    [17] M. Shahbaz and N. Feamster, “The case for anintermediate representation for programmabledata planes,” in 1st ACM SIGCOMM Symposiumon Software Defined Networking Research, 2015.

    [18] H. Song, “Protocol-oblivious forwarding: Unleashthe power of sdn through a future-proofforwarding plane,” in Proceedings of the SecondACM SIGCOMM Workshop on Hot Topics inSoftware Defined Networking, HotSDN ’13, 2013,pp. 127–132.

    [19] H. Song, J. Gong, H. Chen, and J. Dustzadeh,“Unified POF Programming for Diversified SDNData Plane Devices,” in ICNS 2015, 2015.

    [20] A. Sivaraman, S. Subramanian, A. Agrawal,S. Chole, S.-T. Chuang, T. Edsall, M. Alizadeh,S. Katti, N. McKeown, and H. Balakrishnan,“Towards programmable packet scheduling,” in14th ACM Workshop on Hot Topics in Networks,2015, p. 23.

    [21] G. Bianchi, M. Bonola, A. Capone, andC. Cascone, “Openstate: programmingplatform-independent stateful openflowapplications inside the switch,” ACM SIGCOMMComputer Communication Review, vol. 44, no. 2,pp. 44–51, 2014.

    [22] S. Pontarelli, M. Bonola, G. Bianchi, A. Capone,and C. Cascone, “Stateful openflow: Hardwareproof of concept,” in IEEE High PerformanceSwitching and Routing (HPSR), 2015.

    [23] M. Moshref, A. Bhargava, A. Gupta, M. Yu, andR. Govindan, “Flow-level state transition as a newswitch primitive for SDN,” in 3rd workshop onHot topics in software defined networking, 2014,pp. 61–66.

    [24] P. Bosshart, D. Daly, G. Gibb, M. Izzard,N. McKeown, J. Rexford, C. Schlesinger,D. Talayco, A. Vahdat, G. Varghese et al., “P4:Programming protocol-independent packetprocessors,” ACM SIGCOMM ComputerCommunication Review, vol. 44, no. 3, pp. 87–95,

    13

  • 2014.[25] L. Jose, L. Yan, G. Varghese, and N. McKeown,

    “Compiling packet programs to reconfigurableswitches,” in USENIX NSDI, 2015.

    [26] “Intel Ethernet Switch FM6000 Series - SoftwareDefined Networking.” [Online]. Available:http://www.intel.com/content/dam/www/public/\us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdf

    [27] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese,N. McKeown, M. Izzard, F. Mujica, andM. Horowitz, “Forwarding metamorphosis: Fastprogrammable match-action processing inhardware for sdn,” in ACM SIGCOMMConference, 2013, pp. 99–110.

    [28] The P4 language Consortium, “The P4 LanguageSpecification, version 1.0.2,” March 2015.

    [29] K. T. Cheng and A. S. Krishnakumar, “AutomaticFunctional Test Generation Using The ExtendedFinite State Machine Model,” in ACM Int. DesignAutomation Conference (DAC), 1993, pp. 86–91.

    [30] S. Kandula, D. Katabi, S. Sinha, and A. Berger,“Dynamic load balancing without packetreordering,” ACM SIGCOMM ComputerCommunication Review, vol. 37, no. 2, pp. 51–62,2007.

    [31] N. Zilberman, Y. Audzevich, G. Covington, andA. Moore, “NetFPGA SUME: Toward 100 Gbpsas Research Commodity,” Micro, IEEE, vol. 34,no. 5, pp. 32–41, Sept 2014.

    [32] “Virtex-7 Family Overview,”http://www.xilinx.com.

    [33] B. Jean-Louis, “Using block RAM for highperformance read/write TCAMs,” 2012.

    [34] Z. Ullah, M. Jaiswal, Y. Chan, and R. Cheung,“FPGA Implementation of SRAM-based TernaryContent Addressable Memory,” in IEEE 26thInternational Parallel and Distributed ProcessingSymposium Workshops & PhD Forum(IPDPSW), 2012.

    [35] W. Jiang, “Scalable ternary content addressablememory implementation using FPGAs,” inArchitectures for Networking andCommunications Systems (ANCS), 2013ACM/IEEE Symposium on, 2013, pp. 71–82.

    [36] Open Networking Foundation, “OpenFlow SwitchSpecification ver 1.4,” Oct. 2013.

    [37] “OpenFlow 1.3 Software Switch,”http://cpqd.github.io/ofsoftswitch13/.

    [38] “RYU software framework,”http://osrg.github.io/ryu/.

    [39] “MiniNet,” http://www.mininet.org.[40] “OPP Source Repository,” for blind review

    purposes currently available at anonymized linkhttps://drive.google.com/folderview?id=0BzXk0zMykkNoR1RTeXMzNmZJdjQ

    - to be replaced with public one after the reviewprocess.

    [41] T. T. Nguyen and G. Armitage, “A survey oftechniques for internet traffic classification usingmachine learning,” Communications Surveys &Tutorials, IEEE, vol. 10, no. 4, pp. 56–76, 2008.

    [42] J. R. Quinlan, C4. 5: programs for machinelearning. Elsevier, 2014.

    [43] N. Williams, S. Zander, and G. Armitage, “Apreliminary performance comparison of fivemachine learning algorithms for practical ip trafficflow classification,” ACM SIGCOMM ComputerCommunication Review, vol. 36, no. 5, pp. 5–16,2006.

    [44] Z.-S. Pan, S.-C. Chen, G.-B. Hu, and D.-Q.Zhang, “Hybrid neural network and c4. 5 formisuse detection,” in Machine Learning andCybernetics, 2003 IEEE International Conferenceon, vol. 4, 2003, pp. 2463–2467.

    [45] Y. Ma, Z. Qian, G. Shou, and Y. Hu, “Study ofinformation network traffic identification based onc4. 5 algorithm,” in Wireless Communications,Networking and Mobile Computing, 2008.WiCOM’08. 4th IEEE International Conferenceon, 2008, pp. 1–5.

    [46] Y. Zhang, H. Wang, and S. Cheng, “A method forreal-time peer-to-peer traffic classification basedon c4. 5,” in Communication Technology (ICCT),12th IEEE International Conference on, 2010, pp.1192–1195.

    [47] W. Li and A. W. Moore, “A machine learningapproach for efficient traffic classification,” inModeling, Analysis, and Simulation of Computerand Telecommunication Systems, MASCOTS’07.15th International Symposium on, 2007, pp.310–317.

    [48] R. Alshammari et al., “Machine learning basedencrypted traffic classification: identifying ssh andskype,” in Computational Intelligence for Securityand Defense Applications, 2009. CISDA 2009.IEEE Symposium on, 2009, pp. 1–8.

    [49] J. Demšar, T. Curk, A. Erjavec, Črt Gorup,T. Hočevar, M. Milutinovič, M. Možina,M. Polajnar, M. Toplak, A. Starič, M. Štajdohar,L. Umek, L. Žagar, J. Žbontar, M. Žitnik, andB. Zupan, “Orange: Data mining toolbox inpython,” Journal of Machine Learning Research,vol. 14, pp. 2349–2353, 2013. [Online]. Available:http://jmlr.org/papers/v14/demsar13a.html

    [50] Z. Li, R. Yuan, and X. Guan, “Accurateclassification of the internet traffic based on thesvm method,” in Communications, 2007. ICC’07.IEEE International Conference on, 2007, pp.1373–1378.

    14

    http://www.intel.com/content/dam/www/public/ \ us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://www.intel.com/content/dam/www/public/ \ us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://www.intel.com/content/dam/www/public/ \ us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://jmlr.org/papers/v14/demsar13a.html

    1 Introduction2 Concept2.1 XFSM abstraction2.2 OPP architecture2.3 Programming the OPP

    3 Hardware feasibility3.1 Update logic block3.2 FPGA prototype

    4 Programming examples4.1 Decision tree based traffic classification4.2 Traffic policing with token buckets

    5 Discussion and extensions6 Related work7 Conclusions8 References